[mpich-discuss] MPI_Comm_spawn under torque
Suraj Prabhakaran
suraj.prabhakaran at gmail.com
Mon Feb 24 11:37:32 CST 2014
Hello,
I am trying to do an MPI_Comm_spawn under Torque environment. But I want to use the ssh launcher instead of TM interface since there seems to be a problem with launching large processes through the TM interface. mpiexec across nodes in general works fine but when I spawn using MPI_Comm_spawn to do a remote spawn on another node, I get the following error.
mpiexec -launcher ssh -np 8 ./example
[pid 12026] starting up on host grsacc01!
[pid 12027] starting up on host grsacc01!
[pid 12028] starting up on host grsacc01!
[pid 12021] starting up on host grsacc01!
[pid 12023] starting up on host grsacc01!
[pid 12025] starting up on host grsacc01!
[pid 12022] starting up on host grsacc01!
[pid 12024] starting up on host grsacc01!
0 completed MPI_Init
4 completed MPI_Init
Parent [pid 12025] about to spawn!
5 completed MPI_Init
3 completed MPI_Init
Parent [pid 12024] about to spawn!
Parent [pid 12026] about to spawn!
2 completed MPI_Init
Parent [pid 12023] about to spawn!
7 completed MPI_Init
Parent [pid 12028] about to spawn!
Parent [pid 12021] about to spawn!
6 completed MPI_Init
Parent [pid 12027] about to spawn!
1 completed MPI_Init
Parent [pid 12022] about to spawn!
[pid 20535] starting up on host grsacc02!
[pid 20536] starting up on host grsacc02!
Assertion failed in file src/util/procmap/local_proc.c at line 112: my_node_id <= max_node_id
internal ABORT - process 0
Assertion failed in file src/util/procmap/local_proc.c at line 112: my_node_id <= max_node_id
internal ABORT - process 1
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 20535 RUNNING AT grsacc02
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at grsacc01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:0 at grsacc01] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at grsacc01] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec at grsacc01] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec at grsacc01] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at grsacc01] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec at grsacc01] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion
Is there a way to get rid of this?
Best,
Suraj
More information about the discuss
mailing list