[mpich-discuss] MPI_Gather fails with 2048 processes and 4096 MB total
Archer, Charles J
charles.j.archer at intel.com
Thu Nov 26 12:38:11 CST 2015
FYI, we hit various flavors of this problem when I was still at IBM, I think mostly in weather codes.
Apparently Cray hit this too:
https://trac.mpich.org/projects/mpich/ticket/1767
We pretty much told our customers back then that a fix was forthcoming (with no ETA :) )with the revamp of datatypes to use internal 64-bit counts.
We also provided workarounds.
In the case of this gather operation, we asked the customer to implement gather as a flat tree using point to point.
Root posts irecvs, then barrier, children send to root.
IRC, the giant gather we were debugging was at the very end of the application and used to gather some statistics for IO at the root, so it wasn’t critical to perform well.
I also attempted a workaround using some derived datatypes, but I hit another truncation in the datatype code itself :\
I should see if I can dig up that implementation and make sure it isn’t still broken for large counts.
On Nov 26, 2015, at 12:14 PM, Balaji, Pavan <balaji at anl.gov<mailto:balaji at anl.gov>> wrote:
Thanks for reporting. This looks like an integer-overflow issue, which fails when the summation of data elements from all processes is larger than INT_MAX (2 billion). We'll look into it. I've created a ticket for it, and added you as the reporter, so you'll get notified as they are updates.
http://trac.mpich.org/projects/mpich/ticket/2317
Rob: can you create a simple test program for this and add it to the test bucket, so it shows up on the nightlies?
Thanks,
-- Pavan
On Nov 26, 2015, at 10:18 AM, Florian.Willich at dlr.de wrote:
Dear mpich discussion group,
the following issue appeared when running some benchmarks with MPI Gather:
Gathering data (calling MPI_Gather(...) ) involing 2048 processes and 2 MB of data (4096 MB total) that each process sends fails with the following output:
____________________________
Rank 1024 [Thu Nov 26 09:43:16 2015] [c1-0c1s12n3] Fatal error in PMPI_Gather: Invalid count, error stack:
PMPI_Gather(959)......: MPI_Gather(sbuf=0x2aaab826c010, scount=524288, MPI_INT, rbuf=(nil), rcount=524288, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Gather_impl(775).:
MPIR_Gather(735)......:
MPIR_Gather_intra(347):
MPIC_Send(360)........: Negative count, value is -2147483648
_pmiu_daemon(SIGCHLD): [NID 00307] [c1-0c1s12n3] [Thu Nov 26 09:43:16 2015] PE RANK 1024 exit signal Aborted
[NID 00307] 2015-11-26 09:43:16 Apid 949450: initiated application termination
Application 949450 exit codes: 134
Application 949450 exit signals: Killed
Application 949450 resources: utime ~1s, stime ~137s, Rss ~2110448, inblocks ~617782, outblocks ~1659320
____________________________
The following are some tests that I ran to better understand the problem:
2047 processes - 2 MB (4094 MB total) -> works!
2048 processes - 2047.5 KB (~1.999512 MB) (4095 MB total) -> works!
2048 processes - 3 MB (6144 MB total) -> fails:
____________________________
Rank 1024 [Thu Nov 26 09:41:15 2015] [c1-0c1s12n3] Fatal error in PMPI_Gather: Invalid count, error stack:
PMPI_Gather(959)......: MPI_Gather(sbuf=0x2aaab826c010, scount=786432, MPI_INT, rbuf=(nil), rcount=786432, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Gather_impl(775).:
MPIR_Gather(735)......:
MPIR_Gather_intra(347):
MPIC_Send(360)........: Negative count, value is -1073741824
_pmiu_daemon(SIGCHLD): [NID 00307] [c1-0c1s12n3] [Thu Nov 26 09:41:15 2015] PE RANK 1024 exit signal Aborted
[NID 00307] 2015-11-26 09:41:15 Apid 949448: initiated application termination
Application 949448 exit codes: 134
Application 949448 exit signals: Killed
Application 949448 resources: utime ~1s, stime ~139s, Rss ~3159984, inblocks ~617782, outblocks ~1659351
____________________________
2047 processes - 3 MB (6141 MB total) -> fails:
____________________________
Rank 1024 [Thu Nov 26 09:40:31 2015] [c1-0c1s12n3] Fatal error in PMPI_Gather: Invalid count, error stack:
PMPI_Gather(959)......: MPI_Gather(sbuf=0x2aaab826c010, scount=786432, MPI_INT, rbuf=(nil), rcount=786432, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Gather_impl(775).:
MPIR_Gather(735)......:
MPIR_Gather_intra(347):
MPIC_Send(360)........: Negative count, value is -1076887552
_pmiu_daemon(SIGCHLD): [NID 00307] [c1-0c1s12n3] [Thu Nov 26 09:40:32 2015] PE RANK 1024 exit signal Aborted
[NID 00307] 2015-11-26 09:40:32 Apid 949446: initiated application termination
Application 949446 exit codes: 134
Application 949446 exit signals: Killed
Application 949446 resources: utime ~1s, stime ~134s, Rss ~3157072, inblocks ~617780, outblocks ~1659351
____________________________
8 processes - 625 MB (5000 MB total) -> works!
I can think of some pitfalls that might cause this issue but I do not have the knowledge of the internally called routines to check them. Is someone familier with the implementation of MPI_Gather(...) and willing to help me?
Best regards
Florian
Deutsches Zentrum für Luft- und Raumfahrt e. V. (DLR)
German Aerospace Center
Institute of Planetary Research | Planetary Physics | Rutherfordstraße 2 | 12489 Berlin
Florian Willich| Intern - Software Developer (Parallel Applications)
florian.willlich at dlr.de
DLR.de
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list