[mpich-discuss] MPI_Comm_spawn under torque
Suraj Prabhakaran
suraj.prabhakaran at gmail.com
Mon Feb 24 11:41:03 CST 2014
And a short update to the previous email,
doing MPI_Comm_spawn even without the ssh launcher and using the native Torque TM interface does not work and returns the same error.
So in general, I can do MPI_Comm_spawn under the Torque environment immaterial of the launcher that I use. Without the torque environment, it works fine.
Best,
Suraj
On Feb 24, 2014, at 6:37 PM, Suraj Prabhakaran wrote:
> Hello,
>
> I am trying to do an MPI_Comm_spawn under Torque environment. But I want to use the ssh launcher instead of TM interface since there seems to be a problem with launching large processes through the TM interface. mpiexec across nodes in general works fine but when I spawn using MPI_Comm_spawn to do a remote spawn on another node, I get the following error.
>
> mpiexec -launcher ssh -np 8 ./example
>
> [pid 12026] starting up on host grsacc01!
> [pid 12027] starting up on host grsacc01!
> [pid 12028] starting up on host grsacc01!
> [pid 12021] starting up on host grsacc01!
> [pid 12023] starting up on host grsacc01!
> [pid 12025] starting up on host grsacc01!
> [pid 12022] starting up on host grsacc01!
> [pid 12024] starting up on host grsacc01!
> 0 completed MPI_Init
> 4 completed MPI_Init
> Parent [pid 12025] about to spawn!
> 5 completed MPI_Init
> 3 completed MPI_Init
> Parent [pid 12024] about to spawn!
> Parent [pid 12026] about to spawn!
> 2 completed MPI_Init
> Parent [pid 12023] about to spawn!
> 7 completed MPI_Init
> Parent [pid 12028] about to spawn!
> Parent [pid 12021] about to spawn!
> 6 completed MPI_Init
> Parent [pid 12027] about to spawn!
> 1 completed MPI_Init
> Parent [pid 12022] about to spawn!
> [pid 20535] starting up on host grsacc02!
> [pid 20536] starting up on host grsacc02!
> Assertion failed in file src/util/procmap/local_proc.c at line 112: my_node_id <= max_node_id
> internal ABORT - process 0
> Assertion failed in file src/util/procmap/local_proc.c at line 112: my_node_id <= max_node_id
> internal ABORT - process 1
>
> ===================================================================================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = PID 20535 RUNNING AT grsacc02
> = EXIT CODE: 1
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> [proxy:0:0 at grsacc01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
> [proxy:0:0 at grsacc01] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:0 at grsacc01] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
> [mpiexec at grsacc01] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
> [mpiexec at grsacc01] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
> [mpiexec at grsacc01] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
> [mpiexec at grsacc01] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion
>
>
> Is there a way to get rid of this?
>
> Best,
> Suraj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140224/ed4cd59f/attachment.html>
More information about the discuss
mailing list