[mpich-discuss] MPI_Comm_spawn under torque

Mon Feb 24 11:41:03 CST 2014

And a short update to the previous email,

doing MPI_Comm_spawn even without the ssh launcher and using the native Torque TM interface does not work and returns the same error. 

So in general, I can do MPI_Comm_spawn under the Torque environment immaterial of the launcher that I use. Without the torque environment, it works fine. 

Best,
Suraj

On Feb 24, 2014, at 6:37 PM, Suraj Prabhakaran wrote:

> Hello,
> 
> I am trying to do an MPI_Comm_spawn under Torque environment. But I want to use the ssh launcher instead of TM interface since there seems to be a problem with launching large processes through the TM interface. mpiexec across nodes in general works fine but when I spawn using MPI_Comm_spawn to do a remote spawn on another node, I get the following error. 
> 
> mpiexec -launcher ssh -np 8 ./example
> 
> [pid 12026] starting up on host grsacc01!
> [pid 12027] starting up on host grsacc01!
> [pid 12028] starting up on host grsacc01!
> [pid 12021] starting up on host grsacc01!
> [pid 12023] starting up on host grsacc01!
> [pid 12025] starting up on host grsacc01!
> [pid 12022] starting up on host grsacc01!
> [pid 12024] starting up on host grsacc01!
> 0 completed MPI_Init
> 4 completed MPI_Init
> Parent [pid 12025] about to spawn!
> 5 completed MPI_Init
> 3 completed MPI_Init
> Parent [pid 12024] about to spawn!
> Parent [pid 12026] about to spawn!
> 2 completed MPI_Init
> Parent [pid 12023] about to spawn!
> 7 completed MPI_Init
> Parent [pid 12028] about to spawn!
> Parent [pid 12021] about to spawn!
> 6 completed MPI_Init
> Parent [pid 12027] about to spawn!
> 1 completed MPI_Init
> Parent [pid 12022] about to spawn!
> [pid 20535] starting up on host grsacc02!
> [pid 20536] starting up on host grsacc02!
> Assertion failed in file src/util/procmap/local_proc.c at line 112: my_node_id <= max_node_id
> internal ABORT - process 0
> Assertion failed in file src/util/procmap/local_proc.c at line 112: my_node_id <= max_node_id
> internal ABORT - process 1
> 
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 20535 RUNNING AT grsacc02
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> [proxy:0:0 at grsacc01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
> [proxy:0:0 at grsacc01] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:0 at grsacc01] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
> [mpiexec at grsacc01] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
> [mpiexec at grsacc01] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
> [mpiexec at grsacc01] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
> [mpiexec at grsacc01] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion
> 
> 
> Is there a way to get rid of this?
> 
> Best,
> Suraj

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140224/ed4cd59f/attachment.html>