[mpich-discuss] MPI_Comm_spawn under torque

Suraj Prabhakaran suraj.prabhakaran at gmail.com
Mon Feb 24 11:37:32 CST 2014


Hello,

I am trying to do an MPI_Comm_spawn under Torque environment. But I want to use the ssh launcher instead of TM interface since there seems to be a problem with launching large processes through the TM interface. mpiexec across nodes in general works fine but when I spawn using MPI_Comm_spawn to do a remote spawn on another node, I get the following error. 

mpiexec -launcher ssh -np 8 ./example

[pid 12026] starting up on host grsacc01!
[pid 12027] starting up on host grsacc01!
[pid 12028] starting up on host grsacc01!
[pid 12021] starting up on host grsacc01!
[pid 12023] starting up on host grsacc01!
[pid 12025] starting up on host grsacc01!
[pid 12022] starting up on host grsacc01!
[pid 12024] starting up on host grsacc01!
0 completed MPI_Init
4 completed MPI_Init
Parent [pid 12025] about to spawn!
5 completed MPI_Init
3 completed MPI_Init
Parent [pid 12024] about to spawn!
Parent [pid 12026] about to spawn!
2 completed MPI_Init
Parent [pid 12023] about to spawn!
7 completed MPI_Init
Parent [pid 12028] about to spawn!
Parent [pid 12021] about to spawn!
6 completed MPI_Init
Parent [pid 12027] about to spawn!
1 completed MPI_Init
Parent [pid 12022] about to spawn!
[pid 20535] starting up on host grsacc02!
[pid 20536] starting up on host grsacc02!
Assertion failed in file src/util/procmap/local_proc.c at line 112: my_node_id <= max_node_id
internal ABORT - process 0
Assertion failed in file src/util/procmap/local_proc.c at line 112: my_node_id <= max_node_id
internal ABORT - process 1

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 20535 RUNNING AT grsacc02
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at grsacc01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:0 at grsacc01] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at grsacc01] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec at grsacc01] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec at grsacc01] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at grsacc01] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec at grsacc01] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion


Is there a way to get rid of this?

Best,
Suraj




More information about the discuss mailing list