[mpich-discuss] mpich-master-v3.2b4-211-gf91baf0296ce: error spawning processes

Siegmar Gross Siegmar.Gross at informatik.hs-fulda.de
Tue Sep 8 03:09:05 CDT 2015


Hi,

yesterday I have built mpich-master-v3.2b4-211-gf91baf0296ce on
my machines (Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE
Linux 12.1 x86_64) with gcc-5.1.0 and Sun C 5.13. I get the
following errors if I run small programs that spawn processes
on two Sparc machines. "mpiexec" is aliased to 'mpiexec -genvnone'.
It doesn't matter if I use my cc- or gcc-version of MPICH.

tyr spawn 120 mpiexec -np 1 --host tyr,rs0 spawn_master

Parent process 0 running on tyr.informatik.hs-fulda.de
   I create 4 slave processes

Fatal error in MPI_Init: Unknown error class, error stack:
MPIR_Init_thread(472).................:
MPID_Init(302)........................: spawned process group was unable 
to connect back to the parent on port 
<tag#0$description#tyr$port#40568$ifname#193.174.24.39$>
MPID_Comm_connect(191)................:
MPIDI_Comm_connect(488)...............:
SetupNewIntercomm(1187)...............:
MPIR_Barrier_intra(150)...............:
barrier_smp_intra(96).................:
MPIR_Barrier_impl(332)................: Failure during collective
MPIR_Barrier_impl(327)................:
MPIR_Barrier(292).....................:
MPIR_Barrier_intra(169)...............:
MPIDU_Complete_posted_with_error(1137): Process failed
barrier_smp_intra(111)................:
MPIR_Bcast_impl(1452).................:
MPIR_Bcast(1476)......................:
MPIR_Bcast_intra(1287)................:
MPIR_Bcast_binomial(310)..............: Failure during collective
Fatal error in MPI_Init: Unknown error class, error stack:
MPIR_Init_thread(472)...:
MPID_Init(302)..........: spawned process group was unable to connect 
back to the parent on port 
<tag#0$description#tyr$port#40568$ifname#193.174.24.39$>
MPID_Comm_connect(191)..:
MPIDI_Comm_connect(488).:
SetupNewIntercomm(1187).:
MPIR_Barrier_intra(150).:
barrier_smp_intra(111)..:
MPIR_Bcast_impl(1452)...:
MPIR_Bcast(1476)........:
MPIR_Bcast_intra(1287)..:
MPIR_Bcast_binomial(310): Failure during collective
tyr spawn 121



I get the following error or something similar to the above error
message with "mpiexec -np 1 --host tyr,rs0 spawn_multiple_master"
and "mpiexec -np 1 --host tyr,rs0 spawn_intra_comm".

tyr spawn 127 mpiexec -np 1 --host tyr,rs0 spawn_multiple_master

Parent process 0 running on tyr.informatik.hs-fulda.de
   I create 3 slave processes.

Fatal error in MPI_Comm_spawn_multiple: Unknown error class, error stack:
MPI_Comm_spawn_multiple(162)..: MPI_Comm_spawn_multiple(count=2, 
cmds=ffffffff7fffdf08, argvs=ffffffff7fffdef8, 
maxprocs=ffffffff7fffdef0, infos=ffffffff7fffdee8, root=0, 
MPI_COMM_WORLD, intercomm=ffffffff7fffdee4, errors=0) failed
MPIDI_Comm_spawn_multiple(274):
MPID_Comm_accept(153).........:
MPIDI_Comm_accept(1057).......:
MPIR_Bcast_intra(1287)........:
MPIR_Bcast_binomial(310)......: Failure during collective

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 14925 RUNNING AT rs0
=   EXIT CODE: 10
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
tyr spawn 128





Sometimes I also get this error message.

tyr spawn 129 mpiexec -np 1 --host tyr,rs0 spawn_multiple_master

Parent process 0 running on tyr.informatik.hs-fulda.de
   I create 3 slave processes.


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 11444 RUNNING AT tyr
=   EXIT CODE: 10
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at tyr.informatik.hs-fulda.de] HYD_pmcd_pmip_control_cmd_cb 
(../../../../mpich-master-v3.2b4-211-gf91baf0296ce/src/pm/hydra/pm/pmiserv/pmip_cb.c:885): 
assert (!closed) failed
[proxy:0:0 at tyr.informatik.hs-fulda.de] HYDT_dmxu_poll_wait_for_event 
(../../../../mpich-master-v3.2b4-211-gf91baf0296ce/src/pm/hydra/tools/demux/demux_poll.c:76): 
callback returned error status
[proxy:0:0 at tyr.informatik.hs-fulda.de] main 
(../../../../mpich-master-v3.2b4-211-gf91baf0296ce/src/pm/hydra/pm/pmiserv/pmip.c:206): 
demux engine error waiting for event
[proxy:1:1 at rs0.informatik.hs-fulda.de] HYD_pmcd_pmip_control_cmd_cb 
(../../../../mpich-master-v3.2b4-211-gf91baf0296ce/src/pm/hydra/pm/pmiserv/pmip_cb.c:885): 
assert (!closed) failed
[proxy:1:1 at rs0.informatik.hs-fulda.de] HYDT_dmxu_poll_wait_for_event 
(../../../../mpich-master-v3.2b4-211-gf91baf0296ce/src/pm/hydra/tools/demux/demux_poll.c:76): 
callback returned error status
[proxy:1:1 at rs0.informatik.hs-fulda.de] main 
(../../../../mpich-master-v3.2b4-211-gf91baf0296ce/src/pm/hydra/pm/pmiserv/pmip.c:206): 
demux engine error waiting for event
[mpiexec at tyr.informatik.hs-fulda.de] HYDT_bscu_wait_for_completion 
(../../../../mpich-master-v3.2b4-211-gf91baf0296ce/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:75): 
one of the processes terminated badly; aborting
[mpiexec at tyr.informatik.hs-fulda.de] HYDT_bsci_wait_for_completion 
(../../../../mpich-master-v3.2b4-211-gf91baf0296ce/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): 
launcher returned error waiting for completion
[mpiexec at tyr.informatik.hs-fulda.de] HYD_pmci_wait_for_completion 
(../../../../mpich-master-v3.2b4-211-gf91baf0296ce/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): 
launcher returned error waiting for completion
[mpiexec at tyr.informatik.hs-fulda.de] main 
(../../../../mpich-master-v3.2b4-211-gf91baf0296ce/src/pm/hydra/ui/mpich/mpiexec.c:344): 
process manager error waiting for completion
tyr spawn 130




Sometimes it even works.

tyr spawn 131 mpiexec -np 1 --host tyr,rs0 spawn_multiple_master

Parent process 0 running on tyr.informatik.hs-fulda.de
   I create 3 slave processes.

Parent process 0: tasks in MPI_COMM_WORLD:                    1
                   tasks in COMM_CHILD_PROCESSES local group:  1
                   tasks in COMM_CHILD_PROCESSES remote group: 3

Slave process 2 of 3 running on rs0.informatik.hs-fulda.de
Slave process 0 of 3 running on rs0.informatik.hs-fulda.de
Slave process 1 of 3 running on tyr.informatik.hs-fulda.de
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 0: argv[1]: program type 1
spawn_slave 2: argv[0]: spawn_slave
spawn_slave 2: argv[1]: program type 2
spawn_slave 2: argv[2]: another parameter
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 1: argv[1]: program type 2
spawn_slave 1: argv[2]: another parameter
tyr spawn 132







It seems that the programs work fine on my x86_64 machines.
At least I wasn't able to produce an error.

tyr spawn 121 ssh linpc1
linpc1 fd1026 107 mpiexec -np 1 --host sunpc0,linpc1 spawn_master

Parent process 0 running on sunpc0
   I create 4 slave processes

Parent process 0: tasks in MPI_COMM_WORLD:                    1
                   tasks in COMM_CHILD_PROCESSES local group:  1
                   tasks in COMM_CHILD_PROCESSES remote group: 4

Slave process 0 of 4 running on linpc0
Slave process 2 of 4 running on linpc0
Slave process 1 of 4 running on sunpc0
Slave process 3 of 4 running on sunpc0
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 3: argv[0]: spawn_slave
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 2: argv[0]: spawn_slave



linpc1 fd1026 102 mpiexec -np 1 --host sunpc0,linpc0 spawn_multiple_master

Parent process 0 running on sunpc0
   I create 3 slave processes.

Parent process 0: tasks in MPI_COMM_WORLD:                    1
                   tasks in COMM_CHILD_PROCESSES local group:  1
                   tasks in COMM_CHILD_PROCESSES remote group: 3

Slave process 0 of 3 running on linpc0
Slave process 2 of 3 running on linpc0
Slave process 1 of 3 running on sunpc0
spawn_slave 0: argv[0]: spawn_slave
spawn_slave 0: argv[1]: program type 1
spawn_slave 1: argv[0]: spawn_slave
spawn_slave 1: argv[1]: program type 2
spawn_slave 1: argv[2]: another parameter
spawn_slave 2: argv[0]: spawn_slave
spawn_slave 2: argv[1]: program type 2
spawn_slave 2: argv[2]: another parameter
linpc1 fd1026 103



linpc1 fd1026 103 mpiexec -np 1 --host sunpc0,linpc0 spawn_intra_comm
Parent process 0: I create 2 slave processes

Parent process 0 running on sunpc0
     MPI_COMM_WORLD ntasks:              1
     COMM_CHILD_PROCESSES ntasks_local:  1
     COMM_CHILD_PROCESSES ntasks_remote: 2
     COMM_ALL_PROCESSES ntasks:          3
     mytid in COMM_ALL_PROCESSES:        0

Child process 1 running on sunpc0
     MPI_COMM_WORLD ntasks:              2
     COMM_ALL_PROCESSES ntasks:          3
     mytid in COMM_ALL_PROCESSES:        2

Child process 0 running on linpc0
     MPI_COMM_WORLD ntasks:              2
     COMM_ALL_PROCESSES ntasks:          3
     mytid in COMM_ALL_PROCESSES:        1
linpc1 fd1026 104



Kind regards

Siegmar

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5164 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150908/9209c1ea/attachment.p7s>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list