[mpich-discuss] mpich-master-v3.2-331-g44fd9c5f39e5: runtime error spawning processes

Min Si msi at il.is.s.u-tokyo.ac.jp
Thu Jun 9 13:10:40 CDT 2016


Hi Siegmar,

It seems it is still caused by the unaligned memory access issue in 
MPICH. Unfortunately, the fix patch is still in our reviewing queue. I 
hope it can be pushed to master as soon as we can. Please check this 
ticket for tracking the status (it will be closed once we have pushed 
fix to master). Thanks.
https://trac.mpich.org/projects/mpich/ticket/2309

Min

On 6/8/16 1:14 AM, Siegmar Gross wrote:
> Hi,
>
> I have built mpich-master-v3.2-331-g44fd9c5f39e5 on my machines (Solaris
> 10 Sparc, Solaris 10 x86_64, and openSUSE Linux 12.1 x86_64) with
> gcc-5.1.0 and Sun C 5.13. Most of the time I get an error with different
> error messages spawning processes on a Sparc machine.
>
>
> tyr spawn 107 mpiexec -np 1 --host tyr,tyr,tyr,tyr,tyr spawn_master
>
> Parent process 0 running on tyr.informatik.hs-fulda.de
>   I create 4 slave processes
>
> Parent process 0: tasks in MPI_COMM_WORLD:                    1
>                   tasks in COMM_CHILD_PROCESSES local group:  1
>                   tasks in COMM_CHILD_PROCESSES remote group: 4
>
> Slave process 0 of 4 running on tyr.informatik.hs-fulda.de
> Slave process 1 of 4 running on tyr.informatik.hs-fulda.de
> Slave process 2 of 4 running on tyr.informatik.hs-fulda.de
> Slave process 3 of 4 running on tyr.informatik.hs-fulda.de
> spawn_slave 0: argv[0]: spawn_slave
> spawn_slave 1: argv[0]: spawn_slave
> spawn_slave 2: argv[0]: spawn_slave
> spawn_slave 3: argv[0]: spawn_slave
>
>
>
> tyr spawn 108 mpiexec -np 1 --host tyr,tyr,tyr,tyr,tyr spawn_master
>
> Parent process 0 running on tyr.informatik.hs-fulda.de
>   I create 4 slave processes
>
> Fatal error in MPI_Comm_spawn: Unknown error class, error stack:
> MPI_Comm_spawn(141)...................: 
> MPI_Comm_spawn(cmd="spawn_slave", argv=0, maxprocs=4, MPI_INFO_NULL, 
> root=0, MPI_COMM_WORLD, intercomm=ffffffff7fffdf58, errors=0) failed
> MPIDI_Comm_spawn_multiple(274)........:
> MPID_Comm_accept(153).................:
> MPIDI_Comm_accept(1039)...............:
> MPIDU_Complete_posted_with_error(1137): Process failed
> tyr spawn 109
>
>
>
> tyr spawn 111 mpiexec -np 1 --host tyr,tyr,tyr,tyr,tyr spawn_master
>
> Parent process 0 running on tyr.informatik.hs-fulda.de
>   I create 4 slave processes
>
>
> =================================================================================== 
>
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 3322 RUNNING AT tyr
> =   EXIT CODE: 10
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> =================================================================================== 
>
> [proxy:0:0 at tyr.informatik.hs-fulda.de] HYD_pmcd_pmip_control_cmd_cb 
> (../../../../mpich-master-v3.2-331-g44fd9c5f39e5/src/pm/hydra/pm/pmiserv/pmip_cb.c:883): 
> assert (!closed) failed
> [proxy:0:0 at tyr.informatik.hs-fulda.de] HYDT_dmxu_poll_wait_for_event 
> (../../../../mpich-master-v3.2-331-g44fd9c5f39e5/src/pm/hydra/tools/demux/demux_poll.c:77): 
> callback returned error status
> [proxy:0:0 at tyr.informatik.hs-fulda.de] main 
> (../../../../mpich-master-v3.2-331-g44fd9c5f39e5/src/pm/hydra/pm/pmiserv/pmip.c:202): 
> demux engine error waiting for event
> [mpiexec at tyr.informatik.hs-fulda.de] HYDT_bscu_wait_for_completion 
> (../../../../mpich-master-v3.2-331-g44fd9c5f39e5/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): 
> one of the processes terminated badly; aborting
> [mpiexec at tyr.informatik.hs-fulda.de] HYDT_bsci_wait_for_completion 
> (../../../../mpich-master-v3.2-331-g44fd9c5f39e5/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): 
> launcher returned error waiting for completion
> [mpiexec at tyr.informatik.hs-fulda.de] HYD_pmci_wait_for_completion 
> (../../../../mpich-master-v3.2-331-g44fd9c5f39e5/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): 
> launcher returned error waiting for completion
> [mpiexec at tyr.informatik.hs-fulda.de] main 
> (../../../../mpich-master-v3.2-331-g44fd9c5f39e5/src/pm/hydra/ui/mpich/mpiexec.c:340): 
> process manager error waiting for completion
> tyr spawn 112
>
>
>
>
> I would be grateful if somebody can fix the problem. Please let me
> know, if you need more information. Thank you very much for any help
> in advance.
>
>
> Kind regards
>
> Siegmar
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list