[mpich-discuss] MPI_Gather fails with 2048 processes and 4096 MB total

Florian.Willich at dlr.de Florian.Willich at dlr.de
Thu Nov 26 10:18:08 CST 2015


Dear mpich discussion group,

the following issue appeared when running some benchmarks with MPI Gather:

Gathering data (calling MPI_Gather(...) ) involing 2048 processes and 2 MB of data (4096 MB total) that each process sends fails with the following output:
____________________________

Rank 1024 [Thu Nov 26 09:43:16 2015] [c1-0c1s12n3] Fatal error in PMPI_Gather: Invalid count, error stack:
PMPI_Gather(959)......: MPI_Gather(sbuf=0x2aaab826c010, scount=524288, MPI_INT, rbuf=(nil), rcount=524288, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Gather_impl(775).:
MPIR_Gather(735)......:
MPIR_Gather_intra(347):
MPIC_Send(360)........: Negative count, value is -2147483648
_pmiu_daemon(SIGCHLD): [NID 00307] [c1-0c1s12n3] [Thu Nov 26 09:43:16 2015] PE RANK 1024 exit signal Aborted
[NID 00307] 2015-11-26 09:43:16 Apid 949450: initiated application termination
Application 949450 exit codes: 134
Application 949450 exit signals: Killed
Application 949450 resources: utime ~1s, stime ~137s, Rss ~2110448, inblocks ~617782, outblocks ~1659320
____________________________

The following are some tests that I ran to better understand the problem:

2047 processes - 2 MB (4094 MB total) -> works!

2048 processes - 2047.5 KB (~1.999512 MB) (4095 MB total) -> works!

2048 processes - 3 MB (6144 MB total) -> fails:
____________________________

Rank 1024 [Thu Nov 26 09:41:15 2015] [c1-0c1s12n3] Fatal error in PMPI_Gather: Invalid count, error stack:
PMPI_Gather(959)......: MPI_Gather(sbuf=0x2aaab826c010, scount=786432, MPI_INT, rbuf=(nil), rcount=786432, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Gather_impl(775).:
MPIR_Gather(735)......:
MPIR_Gather_intra(347):
MPIC_Send(360)........: Negative count, value is -1073741824
_pmiu_daemon(SIGCHLD): [NID 00307] [c1-0c1s12n3] [Thu Nov 26 09:41:15 2015] PE RANK 1024 exit signal Aborted
[NID 00307] 2015-11-26 09:41:15 Apid 949448: initiated application termination
Application 949448 exit codes: 134
Application 949448 exit signals: Killed
Application 949448 resources: utime ~1s, stime ~139s, Rss ~3159984, inblocks ~617782, outblocks ~1659351
____________________________

2047 processes - 3 MB (6141 MB total) -> fails:
____________________________

Rank 1024 [Thu Nov 26 09:40:31 2015] [c1-0c1s12n3] Fatal error in PMPI_Gather: Invalid count, error stack:
PMPI_Gather(959)......: MPI_Gather(sbuf=0x2aaab826c010, scount=786432, MPI_INT, rbuf=(nil), rcount=786432, MPI_INT, root=0, MPI_COMM_WORLD) failed
MPIR_Gather_impl(775).:
MPIR_Gather(735)......:
MPIR_Gather_intra(347):
MPIC_Send(360)........: Negative count, value is -1076887552
_pmiu_daemon(SIGCHLD): [NID 00307] [c1-0c1s12n3] [Thu Nov 26 09:40:32 2015] PE RANK 1024 exit signal Aborted
[NID 00307] 2015-11-26 09:40:32 Apid 949446: initiated application termination
Application 949446 exit codes: 134
Application 949446 exit signals: Killed
Application 949446 resources: utime ~1s, stime ~134s, Rss ~3157072, inblocks ~617780, outblocks ~1659351
____________________________

8 processes - 625 MB (5000 MB total) -> works!

I can think of some pitfalls that might cause this issue but I do not have the knowledge of the internally called routines to check them. Is someone familier with the implementation of MPI_Gather(...) and willing to help me?

Best regards

Florian

Deutsches Zentrum für Luft- und Raumfahrt e. V. (DLR)
German Aerospace Center
Institute of Planetary Research | Planetary Physics | Rutherfordstraße 2 | 12489 Berlin

Florian Willich| Intern - Software Developer (Parallel Applications)
florian.willlich at dlr.de<mailto:florian.willlich at dlr.de>
DLR.de<http://www.dlr.de/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20151126/fe38acd9/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list