[mpich-discuss] MPI_Gather fails with 2048 processes and 4096 MB total

Balaji, Pavan balaji at anl.gov
Thu Nov 26 12:14:40 CST 2015


Thanks for reporting.  This looks like an integer-overflow issue, which fails when the summation of data elements from all processes is larger than INT_MAX (2 billion).  We'll look into it.  I've created a ticket for it, and added you as the reporter, so you'll get notified as they are updates.

	http://trac.mpich.org/projects/mpich/ticket/2317

Rob: can you create a simple test program for this and add it to the test bucket, so it shows up on the nightlies?

Thanks,

  -- Pavan

> On Nov 26, 2015, at 10:18 AM, Florian.Willich at dlr.de wrote:
> 
> Dear mpich discussion group,
> 
> the following issue appeared when running some benchmarks with MPI Gather:
> 
> Gathering data (calling MPI_Gather(...) ) involing 2048 processes and 2 MB of data (4096 MB total) that each process sends fails with the following output:
> ____________________________
> 
> Rank 1024 [Thu Nov 26 09:43:16 2015] [c1-0c1s12n3] Fatal error in PMPI_Gather: Invalid count, error stack:
> PMPI_Gather(959)......: MPI_Gather(sbuf=0x2aaab826c010, scount=524288, MPI_INT, rbuf=(nil), rcount=524288, MPI_INT, root=0, MPI_COMM_WORLD) failed
> MPIR_Gather_impl(775).: 
> MPIR_Gather(735)......: 
> MPIR_Gather_intra(347): 
> MPIC_Send(360)........: Negative count, value is -2147483648
> _pmiu_daemon(SIGCHLD): [NID 00307] [c1-0c1s12n3] [Thu Nov 26 09:43:16 2015] PE RANK 1024 exit signal Aborted
> [NID 00307] 2015-11-26 09:43:16 Apid 949450: initiated application termination
> Application 949450 exit codes: 134
> Application 949450 exit signals: Killed
> Application 949450 resources: utime ~1s, stime ~137s, Rss ~2110448, inblocks ~617782, outblocks ~1659320
> ____________________________
> 
> The following are some tests that I ran to better understand the problem:
> 
> 2047 processes - 2 MB (4094 MB total) -> works!
> 
> 2048 processes - 2047.5 KB (~1.999512 MB) (4095 MB total) -> works!
> 
> 2048 processes - 3 MB (6144 MB total) -> fails:
> ____________________________
> 
> Rank 1024 [Thu Nov 26 09:41:15 2015] [c1-0c1s12n3] Fatal error in PMPI_Gather: Invalid count, error stack:
> PMPI_Gather(959)......: MPI_Gather(sbuf=0x2aaab826c010, scount=786432, MPI_INT, rbuf=(nil), rcount=786432, MPI_INT, root=0, MPI_COMM_WORLD) failed
> MPIR_Gather_impl(775).: 
> MPIR_Gather(735)......: 
> MPIR_Gather_intra(347): 
> MPIC_Send(360)........: Negative count, value is -1073741824
> _pmiu_daemon(SIGCHLD): [NID 00307] [c1-0c1s12n3] [Thu Nov 26 09:41:15 2015] PE RANK 1024 exit signal Aborted
> [NID 00307] 2015-11-26 09:41:15 Apid 949448: initiated application termination
> Application 949448 exit codes: 134
> Application 949448 exit signals: Killed
> Application 949448 resources: utime ~1s, stime ~139s, Rss ~3159984, inblocks ~617782, outblocks ~1659351
> ____________________________
> 
> 2047 processes - 3 MB (6141 MB total) -> fails:
> ____________________________
> 
> Rank 1024 [Thu Nov 26 09:40:31 2015] [c1-0c1s12n3] Fatal error in PMPI_Gather: Invalid count, error stack:
> PMPI_Gather(959)......: MPI_Gather(sbuf=0x2aaab826c010, scount=786432, MPI_INT, rbuf=(nil), rcount=786432, MPI_INT, root=0, MPI_COMM_WORLD) failed
> MPIR_Gather_impl(775).: 
> MPIR_Gather(735)......: 
> MPIR_Gather_intra(347): 
> MPIC_Send(360)........: Negative count, value is -1076887552
> _pmiu_daemon(SIGCHLD): [NID 00307] [c1-0c1s12n3] [Thu Nov 26 09:40:32 2015] PE RANK 1024 exit signal Aborted
> [NID 00307] 2015-11-26 09:40:32 Apid 949446: initiated application termination
> Application 949446 exit codes: 134
> Application 949446 exit signals: Killed
> Application 949446 resources: utime ~1s, stime ~134s, Rss ~3157072, inblocks ~617780, outblocks ~1659351
> ____________________________
> 
> 8 processes - 625 MB (5000 MB total) -> works!
> 
> I can think of some pitfalls that might cause this issue but I do not have the knowledge of the internally called routines to check them. Is someone familier with the implementation of MPI_Gather(...) and willing to help me?
> 
> Best regards
> 
> Florian
> 
> Deutsches Zentrum für Luft- und Raumfahrt e. V. (DLR)
> German Aerospace Center
> Institute of Planetary Research | Planetary Physics | Rutherfordstraße 2 | 12489 Berlin
>  
> Florian Willich| Intern - Software Developer (Parallel Applications)
> florian.willlich at dlr.de
> DLR.de
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list