[mpich-discuss] runtime error for mpich-master-v3.2-247-g1aec69b70951 with Solaris Sparc
Min Si
msi at il.is.s.u-tokyo.ac.jp
Wed Apr 27 08:52:34 CDT 2016
Hi Siegmar,
I think this is the same issue as you reported several month before.
This error is caused by unaligned memory access in MPICH internal code,
which is not allowed on SPARC machines. We have already finished a fix
patch, but the patch is still in review processing, so it is not added
into MPICH master branch yet. I will let you know once we have it in
master branch.
Min
On 4/21/16 9:21 PM, Siegmar Gross wrote:
> Hi,
>
> I have built mpich-master-v3.2-247-g1aec69b70951 on my machines
> (Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE Linux 12.1 x86_64)
> with gcc-5.1.0 and Sun C 5.13. I get the following errors for both
> compilers if I run small programs that spawn processes on two Sparc
> machines. Everything works fine if I use Linux and/or Solaris x86_64.
> "mpiexec" is aliased to 'mpiexec -genvnone'. I get different errors,
> if I run the same command several times as you can see below (sometimes
> it even works as expected).
>
>
> tyr spawn 119 mpichversion
> MPICH Version: 3.2
> MPICH Release date: Tue Apr 19 00:00:44 CDT 2016
> MPICH Device: ch3:nemesis
> MPICH configure: --prefix=/usr/local/mpich-3.2.1_64_gcc
> --libdir=/usr/local/mpich-3.2.1_64_gcc/lib64
> --includedir=/usr/local/mpich-3.2.1_64_gcc/include64 CC=gcc CXX=g++
> F77=gfortran FC=gfortran CFLAGS=-m64 CXXFLAGS=-m64 FFLAGS=-m64
> FCFLAGS=-m64 LDFLAGS=-m64 -L/usr/lib/sparcv9 -Wl,-rpath
> -Wl,/usr/lib/sparcv9 --enable-fortran=yes --enable-cxx --enable-romio
> --enable-debuginfo --enable-smpcoll --enable-threads=multiple
> --with-thread-package=posix --enable-shared
> MPICH CC: gcc -m64 -O2
> MPICH CXX: g++ -m64 -O2
> MPICH F77: gfortran -m64 -O2
> MPICH FC: gfortran -m64 -O2
>
>
> tyr spawn 120 mpiexec -np 1 --host tyr,tyr,tyr,ruester,ruester
> spawn_master
>
> Parent process 0 running on tyr.informatik.hs-fulda.de
> I create 4 slave processes
>
> Fatal error in MPI_Comm_spawn: Unknown error class, error stack:
> MPI_Comm_spawn(144)...................:
> MPI_Comm_spawn(cmd="spawn_slave", argv=0, maxprocs=4, MPI_INFO_NULL,
> root=0, MPI_COMM_WORLD, intercomm=ffffffff7fffdf58, errors=0) failed
> MPIDI_Comm_spawn_multiple(274)........:
> MPID_Comm_accept(153).................:
> MPIDI_Comm_accept(1039)...............:
> MPIDU_Complete_posted_with_error(1137): Process failed
>
> ===================================================================================
>
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = PID 3182 RUNNING AT tyr
> = EXIT CODE: 10
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
>
>
>
>
> tyr spawn 121 mpiexec -np 1 --host tyr,tyr,tyr,ruester,ruester
> spawn_master
>
> Parent process 0 running on tyr.informatik.hs-fulda.de
> I create 4 slave processes
>
> Parent process 0: tasks in MPI_COMM_WORLD: 1
> tasks in COMM_CHILD_PROCESSES local group: 1
> tasks in COMM_CHILD_PROCESSES remote group: 4
>
> Slave process 3 of 4 running on ruester.informatik.hs-fulda.de
> Slave process 2 of 4 running on ruester.informatik.hs-fulda.de
> spawn_slave 2: argv[0]: spawn_slave
> spawn_slave 3: argv[0]: spawn_slave
> Slave process 0 of 4 running on tyr.informatik.hs-fulda.de
> spawn_slave 0: argv[0]: spawn_slave
> Slave process 1 of 4 running on tyr.informatik.hs-fulda.de
> spawn_slave 1: argv[0]: spawn_slave
>
>
>
> tyr spawn 122 mpiexec -np 1 --host tyr,tyr,tyr,ruester,ruester
> spawn_master
>
> Parent process 0 running on tyr.informatik.hs-fulda.de
> I create 4 slave processes
>
>
>
> tyr spawn 123 mpiexec -np 1 --host tyr,tyr,tyr,ruester,ruester
> spawn_master
>
> Parent process 0 running on tyr.informatik.hs-fulda.de
> I create 4 slave processes
>
> Fatal error in MPI_Comm_spawn: Unknown error class, error stack:
> MPI_Comm_spawn(144)...................:
> MPI_Comm_spawn(cmd="spawn_slave", argv=0, maxprocs=4, MPI_INFO_NULL,
> root=0, MPI_COMM_WORLD, intercomm=ffffffff7fffdf58, errors=0) failed
> MPIDI_Comm_spawn_multiple(274)........:
> MPID_Comm_accept(153).................:
> MPIDI_Comm_accept(1039)...............:
> MPIDU_Complete_posted_with_error(1137): Process failed
> tyr spawn 124 mpiexec -np 1 --host tyr,tyr,tyr,ruester,ruester
> spawn_master
>
> Parent process 0 running on tyr.informatik.hs-fulda.de
> I create 4 slave processes
>
> Fatal error in MPI_Comm_spawn: Unknown error class, error stack:
> MPI_Comm_spawn(144)...................:
> MPI_Comm_spawn(cmd="spawn_slave", argv=0, maxprocs=4, MPI_INFO_NULL,
> root=0, MPI_COMM_WORLD, intercomm=ffffffff7fffdf58, errors=0) failed
> MPIDI_Comm_spawn_multiple(274)........:
> MPID_Comm_accept(153).................:
> MPIDI_Comm_accept(1039)...............:
> MPIDU_Complete_posted_with_error(1137): Process failed
>
> ===================================================================================
>
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = PID 3466 RUNNING AT tyr
> = EXIT CODE: 10
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
>
> [proxy:0:0 at tyr.informatik.hs-fulda.de] HYD_pmcd_pmip_control_cmd_cb
> (../../../../mpich-master-v3.2-247-g1aec69b70951/src/pm/hydra/pm/pmiserv/pmip_cb.c:886):
> assert (!closed) failed
> [proxy:0:0 at tyr.informatik.hs-fulda.de[proxy:1:1 at ruester.informatik.hs-fulda.de]
> HYD_pmcd_pmip_control_cmd_cb
> (../../../../mpich-master-v3.2-247-g1aec69b70951/src/pm/hydra/pm/pmiserv/pmip_cb.c]
> HYDT_dmxu_poll_wait_for_event
> (../../../../mpich-master-v3.2-247-g1aec69b70951/src/pm/hydra/tools/demux/demux_poll.c:77):
> callback returned error status
> [proxy:0:0 at tyr.informatik.hs-fulda.de] main
> (../../../../mpich-master-v3.2-247-g1aec69b70951/src/pm/hydra/pm/pmiserv/pmip.c:202):
> demux engine error waiting for event
> [mpiexec at tyr.informatik.hs-fulda.de] HYDT_bscu_wait_for_completion
> (../../../../mpich-master-v3.2-247-g1aec69b70951/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76):
> one of the processes terminated badly; aborting
> [mpiexec at tyr.informatik.hs-fulda.de] HYDT_bsci_wait_for_completion
> (../../../../mpich-master-v3.2-247-g1aec69b70951/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23):
> launcher returned error waiting for completion
> [mpiexec at tyr.informatik.hs-fulda.de] HYD_pmci_wait_for_completion
> (../../../../mpich-master-v3.2-247-g1aec69b70951/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218):
> launcher returned error waiting for completion
> [mpiexec at tyr.informatik.hs-fulda.de] main
> (../../../../mpich-master-v3.2-247-g1aec69b70951/src/pm/hydra/ui/mpich/mpiexec.c:340):
> process manager error waiting for completion
> tyr spawn 125
>
>
> I would be grateful if somebody can fix the problem. Thank you very
> much for any help in advance.
>
>
> Kind regards
>
> Siegmar
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160427/8e16fd2a/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list