[mpich-discuss] MPI_Gather fails with 2048 processes and 4096 MB total

Florian.Willich at dlr.de Florian.Willich at dlr.de
Wed Dec 2 02:15:38 CST 2015


Hi Rob,

well maybe I was addressing the wrong organisation... I am currently testing on the Cray swan super computer which provides the module cray-mpich/7.2.6 ("Cray Message Passing Toolkit 7.2.6"). 

I can not determine whether the cray mpich version is mpich with additional implementations or if it is totally different from the official mpich releases. Additioanlly, I can not figure out on which mpich version this cray-mpich module is based on. I'll continue investigation and keep you updated.

Best Regards

Florian
________________________________________
Von: Rob Latham [robl at mcs.anl.gov]
Gesendet: Dienstag, 1. Dezember 2015 16:48
An: discuss at mpich.org
Betreff: Re: [mpich-discuss] MPI_Gather fails with 2048 processes and 4096 MB total

On 11/26/2015 12:38 PM, Archer, Charles J wrote:
> FYI, we hit various flavors of this problem when I was still at IBM, I think mostly in weather codes.
> Apparently Cray hit this too:
>
> https://trac.mpich.org/projects/mpich/ticket/1767
>
> We pretty much told our customers back then that a fix was forthcoming (with no ETA :) )with the revamp of datatypes to use internal 64-bit counts.
> We also provided workarounds.
>
> In the case of this gather operation, we asked the customer to implement gather as a flat tree using point to point.
> Root posts irecvs, then barrier, children send to root.
>
> IRC, the giant gather we were debugging was at the very end of the application and used to gather some statistics for IO at the root, so it wasn’t critical to perform well.
> I also attempted a workaround using some derived datatypes, but I hit another truncation in the datatype code itself :\
> I should see if I can dig up that implementation and make sure it isn’t still broken for large counts.

those are all fine approaches to work around the problem.  the internals
of MPICH, though, need to be 64 bit clean -- there are still 4500 places
where clang warns of a 64 bit value being assigned to a 32 bit type.

Florian Willich, what version of MPICH is this?   The line numbers in
the back trace don't match up with what I've got, and
I really thought we fixed this class of bug with commits 31d95ed7b18c
and 68f8c7aa7 over the summer.

==rob


>
>
>
>
>
> On Nov 26, 2015, at 12:14 PM, Balaji, Pavan <balaji at anl.gov<mailto:balaji at anl.gov>> wrote:
>
>
> Thanks for reporting.  This looks like an integer-overflow issue, which fails when the summation of data elements from all processes is larger than INT_MAX (2 billion).  We'll look into it.  I've created a ticket for it, and added you as the reporter, so you'll get notified as they are updates.
>
> http://trac.mpich.org/projects/mpich/ticket/2317
>
> Rob: can you create a simple test program for this and add it to the test bucket, so it shows up on the nightlies?
>
> Thanks,
>
>   -- Pavan
>
> On Nov 26, 2015, at 10:18 AM, Florian.Willich at dlr.de wrote:
>
> Dear mpich discussion group,
>
> the following issue appeared when running some benchmarks with MPI Gather:
>
> Gathering data (calling MPI_Gather(...) ) involing 2048 processes and 2 MB of data (4096 MB total) that each process sends fails with the following output:
> ____________________________
>
> Rank 1024 [Thu Nov 26 09:43:16 2015] [c1-0c1s12n3] Fatal error in PMPI_Gather: Invalid count, error stack:
> PMPI_Gather(959)......: MPI_Gather(sbuf=0x2aaab826c010, scount=524288, MPI_INT, rbuf=(nil), rcount=524288, MPI_INT, root=0, MPI_COMM_WORLD) failed
> MPIR_Gather_impl(775).:
> MPIR_Gather(735)......:
> MPIR_Gather_intra(347):
> MPIC_Send(360)........: Negative count, value is -2147483648
> _pmiu_daemon(SIGCHLD): [NID 00307] [c1-0c1s12n3] [Thu Nov 26 09:43:16 2015] PE RANK 1024 exit signal Aborted
> [NID 00307] 2015-11-26 09:43:16 Apid 949450: initiated application termination
> Application 949450 exit codes: 134
> Application 949450 exit signals: Killed
> Application 949450 resources: utime ~1s, stime ~137s, Rss ~2110448, inblocks ~617782, outblocks ~1659320
> ____________________________
>
> The following are some tests that I ran to better understand the problem:
>
> 2047 processes - 2 MB (4094 MB total) -> works!
>
> 2048 processes - 2047.5 KB (~1.999512 MB) (4095 MB total) -> works!
>
> 2048 processes - 3 MB (6144 MB total) -> fails:
> ____________________________
>
> Rank 1024 [Thu Nov 26 09:41:15 2015] [c1-0c1s12n3] Fatal error in PMPI_Gather: Invalid count, error stack:
> PMPI_Gather(959)......: MPI_Gather(sbuf=0x2aaab826c010, scount=786432, MPI_INT, rbuf=(nil), rcount=786432, MPI_INT, root=0, MPI_COMM_WORLD) failed
> MPIR_Gather_impl(775).:
> MPIR_Gather(735)......:
> MPIR_Gather_intra(347):
> MPIC_Send(360)........: Negative count, value is -1073741824
> _pmiu_daemon(SIGCHLD): [NID 00307] [c1-0c1s12n3] [Thu Nov 26 09:41:15 2015] PE RANK 1024 exit signal Aborted
> [NID 00307] 2015-11-26 09:41:15 Apid 949448: initiated application termination
> Application 949448 exit codes: 134
> Application 949448 exit signals: Killed
> Application 949448 resources: utime ~1s, stime ~139s, Rss ~3159984, inblocks ~617782, outblocks ~1659351
> ____________________________
>
> 2047 processes - 3 MB (6141 MB total) -> fails:
> ____________________________
>
> Rank 1024 [Thu Nov 26 09:40:31 2015] [c1-0c1s12n3] Fatal error in PMPI_Gather: Invalid count, error stack:
> PMPI_Gather(959)......: MPI_Gather(sbuf=0x2aaab826c010, scount=786432, MPI_INT, rbuf=(nil), rcount=786432, MPI_INT, root=0, MPI_COMM_WORLD) failed
> MPIR_Gather_impl(775).:
> MPIR_Gather(735)......:
> MPIR_Gather_intra(347):
> MPIC_Send(360)........: Negative count, value is -1076887552
> _pmiu_daemon(SIGCHLD): [NID 00307] [c1-0c1s12n3] [Thu Nov 26 09:40:32 2015] PE RANK 1024 exit signal Aborted
> [NID 00307] 2015-11-26 09:40:32 Apid 949446: initiated application termination
> Application 949446 exit codes: 134
> Application 949446 exit signals: Killed
> Application 949446 resources: utime ~1s, stime ~134s, Rss ~3157072, inblocks ~617780, outblocks ~1659351
> ____________________________
>
> 8 processes - 625 MB (5000 MB total) -> works!
>
> I can think of some pitfalls that might cause this issue but I do not have the knowledge of the internally called routines to check them. Is someone familier with the implementation of MPI_Gather(...) and willing to help me?
>
> Best regards
>
> Florian
>
> Deutsches Zentrum für Luft- und Raumfahrt e. V. (DLR)
> German Aerospace Center
> Institute of Planetary Research | Planetary Physics | Rutherfordstraße 2 | 12489 Berlin
>
> Florian Willich| Intern - Software Developer (Parallel Applications)
> florian.willlich at dlr.de
> DLR.de
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list