[mpich-discuss] Possible integer-overflow for MPI_COMM_WORLD in MPI_Iprobe

Jeff Hammond jeff.science at gmail.com
Mon Jan 21 21:46:28 CST 2019


I was able to reproduce with my own test (
https://github.com/jeffhammond/HPCInfo/blob/master/mpi/bugs/iprobe-overflow.c)
with Intel MPI 2019, so I will report that bug to the Intel MPI team.  It
should be easy enough for them to figure out if this bug is from MPICH or
not.

2139000000 iterations, 627.143272 seconds

2140000000 iterations, 627.436206 seconds

2141000000 iterations, 627.729135 seconds

2142000000 iterations, 628.022049 seconds

2143000000 iterations, 628.315015 seconds

2144000000 iterations, 628.608066 seconds

2145000000 iterations, 628.901065 seconds

2146000000 iterations, 629.193992 seconds

2147000000 iterations, 629.488107 seconds

Abort(738833413) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Iprobe:
Invalid communicator, error stack:

PMPI_Iprobe(123): MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,
MPI_COMM_WORLD, flag=0x7ffdf75a396c, status=0x7ffdf75a3970) failed

PMPI_Iprobe(90).: Invalid communicator

Jeff, who works for Intel but knows more about MPICH than Intel MPI

On Mon, Jan 21, 2019 at 11:19 AM Joachim Protze via discuss <
discuss at mpich.org> wrote:

> Hi all,
>
> we detected the behavior with IntelMPI 2019 (which is based on MPICH
> 3.3). Reproducing it with MPICH-3.3 was not yet successful. But I fear,
> that our built of MPICH just not uses the necessary code path / build
> flags.
>
> When calling MPI_Iprobe with the same communicator for ~2^31 times
> (which can take 10-30 minutes), the execution stops with:
>
> Abort(201962501) on node 0 (rank 0 in comm 0): Fatal error in
> PMPI_Iprobe: Invalid communicator, error stack:
> PMPI_Iprobe(123): MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,
> MPI_COMM_WORLD, flag=0x7ffd925056c0, status=0x7ffd92505694) failed
> PMPI_Iprobe(90).: Invalid communicator
>
>  From my understanding of the referenced MPICH code lines, I guess, that
> the ref-count for MPI_COMM_WORLD overflows, which triggers this error
> message.
>
> Best
> Joachim
>
> --
> Dipl.-Inf. Joachim Protze
>
> IT Center
> Group: High Performance Computing
> Division: Computational Science and Engineering
> RWTH Aachen University
> Seffenter Weg 23
> D 52074  Aachen (Germany)
> Tel: +49 241 80- 24765
> Fax: +49 241 80-624765
> protze at itc.rwth-aachen.de
> www.itc.rwth-aachen.de
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>


-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20190121/9e4af0f0/attachment.html>


More information about the discuss mailing list