[mpich-discuss] error spawning processes in mpich-3.2rc1

Min Si msi at il.is.s.u-tokyo.ac.jp
Wed Oct 21 11:35:25 CDT 2015


Hi Siegmar,

Thanks for providing us the test machine. We have confirmed this failure 
is caused by unaligned memory access inside MPICH. So the failure only 
happens on SPARC, which is alignment-sensitive.

We will fix it. You can track the progress from this ticket.
https://trac.mpich.org/projects/mpich/ticket/2309#ticket

Because we do not have other SPARC platform, would you mind if we use 
your machine for testing during this fix ?

Best regards,
Min
On 10/12/15 9:24 AM, Siegmar Gross wrote:
> Hi Min,
>
>> It seems you already enabled the most detailed error outputs. We could
>> not think out any clue for now. If you can give us access to your
>> machine, we are glad to help you debug on it.
>
> Can you send me your email address because I don't want to send
> login data to this list.
>
>
> Kind regards
>
> Siegmar
>
>
>>
>> Min
>>
>> On 10/8/15 12:02 AM, Siegmar Gross wrote:
>>> Hi Min,
>>>
>>> thank you very much for your answer.
>>>
>>>> We cannot reproduce this error on our test machines (Solaris i386,
>>>> Ubuntu x86_64) by using your programs. And unfortunately we do not 
>>>> have
>>>> Solaris Sparc machine thus could not verify it.
>>>
>>> The programs work fine on my Solaris x86_64 and Linux machines
>>> as well. I only have a problem on Solaris Sparc.
>>>
>>>
>>>> Sometime, it can happen that you need to add "./" in front of the
>>>> program path, could you try it ?
>>>>
>>>> For example, in spawn_master.c MPI: A Message-Passing Interface 
>>>> Standard
>>>>> #define SLAVE_PROG "./spawn_slave"
>>>
>>> No, it wil not work, because the programs are stored in a
>>> different directory ($HOME/{SunOS, Linux}/{sparc, x86_64}/bin)
>>> which is part of PATH (as well as ".").
>>>
>>> Can I do anything to track the source of the error?
>>>
>>>
>>> Kind regards
>>>
>>> Siegmar
>>>
>>>>
>>>> Min
>>>>
>>>> On 10/7/15 5:03 AM, Siegmar Gross wrote:
>>>>> Hi,
>>>>>
>>>>> today I've built mpich-3.2rc1 on my machines (Solaris 10 Sparc,
>>>>> Solaris 10 x86_64, and openSUSE Linux 12.1 x86_64) with gcc-5.1.0
>>>>> and Sun C 5.13. I still get the following errors on my Sparc machine
>>>>> which I'd already reported September 8th. "mpiexec" is aliased to
>>>>> 'mpiexec -genvnone'. It still doesn't matter if I use my cc- or
>>>>> gcc-version of MPICH.
>>>>>
>>>>>
>>>>> tyr spawn 119 mpichversion
>>>>> MPICH Version:          3.2rc1
>>>>> MPICH Release date:     Wed Oct  7 00:00:33 CDT 2015
>>>>> MPICH Device:           ch3:nemesis
>>>>> MPICH configure: --prefix=/usr/local/mpich-3.2_64_cc
>>>>> --libdir=/usr/local/mpich-3.2_64_cc/lib64
>>>>> --includedir=/usr/local/mpich-3.2_64_cc/include64 CC=cc CXX=CC 
>>>>> F77=f77
>>>>> FC=f95 CFLAGS=-m64 CXXFLAGS=-m64 FFLAGS=-m64 FCFLAGS=-m64 
>>>>> LDFLAGS=-m64
>>>>> -L/usr/lib/sparcv9 -R/usr/lib/sparcv9 --enable-fortran=yes
>>>>> --enable-cxx --enable-romio --enable-debuginfo --enable-smpcoll
>>>>> --enable-threads=multiple --with-thread-package=posix --enable-shared
>>>>> MPICH CC:       cc -m64   -O2
>>>>> MPICH CXX:      CC -m64  -O2
>>>>> MPICH F77:      f77 -m64
>>>>> MPICH FC:       f95 -m64  -O2
>>>>> tyr spawn 120
>>>>>
>>>>>
>>>>>
>>>>> tyr spawn 111 mpiexec -np 1 spawn_master
>>>>>
>>>>> Parent process 0 running on tyr.informatik.hs-fulda.de
>>>>>   I create 4 slave processes
>>>>>
>>>>> Fatal error in MPI_Comm_spawn: Unknown error class, error stack:
>>>>> MPI_Comm_spawn(144)...........: MPI_Comm_spawn(cmd="spawn_slave",
>>>>> argv=0, maxprocs=4, MPI_INFO_NULL, root=0, MPI_COMM_WORLD,
>>>>> intercomm=ffffffff7fffde50, errors=0) failed
>>>>> MPIDI_Comm_spawn_multiple(274):
>>>>> MPID_Comm_accept(153).........:
>>>>> MPIDI_Comm_accept(1057).......:
>>>>> MPIR_Bcast_intra(1287)........:
>>>>> MPIR_Bcast_binomial(310)......: Failure during collective
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> tyr spawn 112 mpiexec -np 1 spawn_multiple_master
>>>>>
>>>>> Parent process 0 running on tyr.informatik.hs-fulda.de
>>>>>   I create 3 slave processes.
>>>>>
>>>>> Fatal error in MPI_Comm_spawn_multiple: Unknown error class, error
>>>>> stack:
>>>>> MPI_Comm_spawn_multiple(162)..: MPI_Comm_spawn_multiple(count=2,
>>>>> cmds=ffffffff7fffde08, argvs=ffffffff7fffddf8,
>>>>> maxprocs=ffffffff7fffddf0, infos=ffffffff7fffdde8, root=0,
>>>>> MPI_COMM_WORLD, intercomm=ffffffff7fffdde4, errors=0) failed
>>>>> MPIDI_Comm_spawn_multiple(274):
>>>>> MPID_Comm_accept(153).........:
>>>>> MPIDI_Comm_accept(1057).......:
>>>>> MPIR_Bcast_intra(1287)........:
>>>>> MPIR_Bcast_binomial(310)......: Failure during collective
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> tyr spawn 113 mpiexec -np 1 spawn_intra_comm
>>>>> Parent process 0: I create 2 slave processes
>>>>> Fatal error in MPI_Comm_spawn: Unknown error class, error stack:
>>>>> MPI_Comm_spawn(144)...........: 
>>>>> MPI_Comm_spawn(cmd="spawn_intra_comm",
>>>>> argv=0, maxprocs=2, MPI_INFO_NULL, root=0, MPI_COMM_WORLD,
>>>>> intercomm=ffffffff7fffded4, errors=0) failed
>>>>> MPIDI_Comm_spawn_multiple(274):
>>>>> MPID_Comm_accept(153).........:
>>>>> MPIDI_Comm_accept(1057).......:
>>>>> MPIR_Bcast_intra(1287)........:
>>>>> MPIR_Bcast_binomial(310)......: Failure during collective
>>>>> tyr spawn 114
>>>>>
>>>>>
>>>>> I would be grateful if somebody can fix the problem. Thank you very
>>>>> much for any help in advance. I've attached my programs. Please let
>>>>> me know if you need anything else.
>>>>>
>>>>>
>>>>> Kind regards
>>>>>
>>>>> Siegmar
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing listdiscuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing listdiscuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20151021/404b77c5/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list