[mpich-discuss] runtime error for mpich-master-v3.2-247-g1aec69b70951 with Solaris Sparc
Siegmar Gross
siegmar.gross at informatik.hs-fulda.de
Wed Apr 27 10:54:52 CDT 2016
Hi Min,
thank you very much for your help. I'm waiting for your message that
the patch has been added into the master branch.
Kind regards
Siegmar
Am 27.04.2016 um 15:52 schrieb Min Si:
> Hi Siegmar,
>
> I think this is the same issue as you reported several month before. This
> error is caused by unaligned memory access in MPICH internal code, which is
> not allowed on SPARC machines. We have already finished a fix patch, but the
> patch is still in review processing, so it is not added into MPICH master
> branch yet. I will let you know once we have it in master branch.
>
> Min
>
> On 4/21/16 9:21 PM, Siegmar Gross wrote:
>> Hi,
>>
>> I have built mpich-master-v3.2-247-g1aec69b70951 on my machines
>> (Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE Linux 12.1 x86_64)
>> with gcc-5.1.0 and Sun C 5.13. I get the following errors for both
>> compilers if I run small programs that spawn processes on two Sparc
>> machines. Everything works fine if I use Linux and/or Solaris x86_64.
>> "mpiexec" is aliased to 'mpiexec -genvnone'. I get different errors,
>> if I run the same command several times as you can see below (sometimes
>> it even works as expected).
>>
>>
>> tyr spawn 119 mpichversion
>> MPICH Version: 3.2
>> MPICH Release date: Tue Apr 19 00:00:44 CDT 2016
>> MPICH Device: ch3:nemesis
>> MPICH configure: --prefix=/usr/local/mpich-3.2.1_64_gcc
>> --libdir=/usr/local/mpich-3.2.1_64_gcc/lib64
>> --includedir=/usr/local/mpich-3.2.1_64_gcc/include64 CC=gcc CXX=g++
>> F77=gfortran FC=gfortran CFLAGS=-m64 CXXFLAGS=-m64 FFLAGS=-m64 FCFLAGS=-m64
>> LDFLAGS=-m64 -L/usr/lib/sparcv9 -Wl,-rpath -Wl,/usr/lib/sparcv9
>> --enable-fortran=yes --enable-cxx --enable-romio --enable-debuginfo
>> --enable-smpcoll --enable-threads=multiple --with-thread-package=posix
>> --enable-shared
>> MPICH CC: gcc -m64 -O2
>> MPICH CXX: g++ -m64 -O2
>> MPICH F77: gfortran -m64 -O2
>> MPICH FC: gfortran -m64 -O2
>>
>>
>> tyr spawn 120 mpiexec -np 1 --host tyr,tyr,tyr,ruester,ruester spawn_master
>>
>> Parent process 0 running on tyr.informatik.hs-fulda.de
>> I create 4 slave processes
>>
>> Fatal error in MPI_Comm_spawn: Unknown error class, error stack:
>> MPI_Comm_spawn(144)...................: MPI_Comm_spawn(cmd="spawn_slave",
>> argv=0, maxprocs=4, MPI_INFO_NULL, root=0, MPI_COMM_WORLD,
>> intercomm=ffffffff7fffdf58, errors=0) failed
>> MPIDI_Comm_spawn_multiple(274)........:
>> MPID_Comm_accept(153).................:
>> MPIDI_Comm_accept(1039)...............:
>> MPIDU_Complete_posted_with_error(1137): Process failed
>>
>> ===================================================================================
>>
>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> = PID 3182 RUNNING AT tyr
>> = EXIT CODE: 10
>> = CLEANING UP REMAINING PROCESSES
>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> ===================================================================================
>>
>>
>>
>>
>> tyr spawn 121 mpiexec -np 1 --host tyr,tyr,tyr,ruester,ruester spawn_master
>>
>> Parent process 0 running on tyr.informatik.hs-fulda.de
>> I create 4 slave processes
>>
>> Parent process 0: tasks in MPI_COMM_WORLD: 1
>> tasks in COMM_CHILD_PROCESSES local group: 1
>> tasks in COMM_CHILD_PROCESSES remote group: 4
>>
>> Slave process 3 of 4 running on ruester.informatik.hs-fulda.de
>> Slave process 2 of 4 running on ruester.informatik.hs-fulda.de
>> spawn_slave 2: argv[0]: spawn_slave
>> spawn_slave 3: argv[0]: spawn_slave
>> Slave process 0 of 4 running on tyr.informatik.hs-fulda.de
>> spawn_slave 0: argv[0]: spawn_slave
>> Slave process 1 of 4 running on tyr.informatik.hs-fulda.de
>> spawn_slave 1: argv[0]: spawn_slave
>>
>>
>>
>> tyr spawn 122 mpiexec -np 1 --host tyr,tyr,tyr,ruester,ruester spawn_master
>>
>> Parent process 0 running on tyr.informatik.hs-fulda.de
>> I create 4 slave processes
>>
>>
>>
>> tyr spawn 123 mpiexec -np 1 --host tyr,tyr,tyr,ruester,ruester spawn_master
>>
>> Parent process 0 running on tyr.informatik.hs-fulda.de
>> I create 4 slave processes
>>
>> Fatal error in MPI_Comm_spawn: Unknown error class, error stack:
>> MPI_Comm_spawn(144)...................: MPI_Comm_spawn(cmd="spawn_slave",
>> argv=0, maxprocs=4, MPI_INFO_NULL, root=0, MPI_COMM_WORLD,
>> intercomm=ffffffff7fffdf58, errors=0) failed
>> MPIDI_Comm_spawn_multiple(274)........:
>> MPID_Comm_accept(153).................:
>> MPIDI_Comm_accept(1039)...............:
>> MPIDU_Complete_posted_with_error(1137): Process failed
>> tyr spawn 124 mpiexec -np 1 --host tyr,tyr,tyr,ruester,ruester spawn_master
>>
>> Parent process 0 running on tyr.informatik.hs-fulda.de
>> I create 4 slave processes
>>
>> Fatal error in MPI_Comm_spawn: Unknown error class, error stack:
>> MPI_Comm_spawn(144)...................: MPI_Comm_spawn(cmd="spawn_slave",
>> argv=0, maxprocs=4, MPI_INFO_NULL, root=0, MPI_COMM_WORLD,
>> intercomm=ffffffff7fffdf58, errors=0) failed
>> MPIDI_Comm_spawn_multiple(274)........:
>> MPID_Comm_accept(153).................:
>> MPIDI_Comm_accept(1039)...............:
>> MPIDU_Complete_posted_with_error(1137): Process failed
>>
>> ===================================================================================
>>
>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> = PID 3466 RUNNING AT tyr
>> = EXIT CODE: 10
>> = CLEANING UP REMAINING PROCESSES
>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> ===================================================================================
>>
>> [proxy:0:0 at tyr.informatik.hs-fulda.de] HYD_pmcd_pmip_control_cmd_cb
>> (../../../../mpich-master-v3.2-247-g1aec69b70951/src/pm/hydra/pm/pmiserv/pmip_cb.c:886):
>> assert (!closed) failed
>> [proxy:0:0 at tyr.informatik.hs-fulda.de[proxy:1:1 at ruester.informatik.hs-fulda.de]
>> HYD_pmcd_pmip_control_cmd_cb
>> (../../../../mpich-master-v3.2-247-g1aec69b70951/src/pm/hydra/pm/pmiserv/pmip_cb.c]
>> HYDT_dmxu_poll_wait_for_event
>> (../../../../mpich-master-v3.2-247-g1aec69b70951/src/pm/hydra/tools/demux/demux_poll.c:77):
>> callback returned error status
>> [proxy:0:0 at tyr.informatik.hs-fulda.de] main
>> (../../../../mpich-master-v3.2-247-g1aec69b70951/src/pm/hydra/pm/pmiserv/pmip.c:202):
>> demux engine error waiting for event
>> [mpiexec at tyr.informatik.hs-fulda.de] HYDT_bscu_wait_for_completion
>> (../../../../mpich-master-v3.2-247-g1aec69b70951/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76):
>> one of the processes terminated badly; aborting
>> [mpiexec at tyr.informatik.hs-fulda.de] HYDT_bsci_wait_for_completion
>> (../../../../mpich-master-v3.2-247-g1aec69b70951/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23):
>> launcher returned error waiting for completion
>> [mpiexec at tyr.informatik.hs-fulda.de] HYD_pmci_wait_for_completion
>> (../../../../mpich-master-v3.2-247-g1aec69b70951/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218):
>> launcher returned error waiting for completion
>> [mpiexec at tyr.informatik.hs-fulda.de] main
>> (../../../../mpich-master-v3.2-247-g1aec69b70951/src/pm/hydra/ui/mpich/mpiexec.c:340):
>> process manager error waiting for completion
>> tyr spawn 125
>>
>>
>> I would be grateful if somebody can fix the problem. Thank you very
>> much for any help in advance.
>>
>>
>> Kind regards
>>
>> Siegmar
>>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list