[mpich-discuss] Possible integer-overflow for MPI_COMM_WORLD in MPI_Iprobe

Jeff Hammond jeff.science at gmail.com
Wed Dec 4 09:48:04 CST 2019


This was fixed in 2019 update 2.

Jeff

On Fri, Apr 26, 2019 at 8:09 AM Jeff Hammond <jeff.science at gmail.com> wrote:

> For anyone who cares about this bug because of Intel MPI, I am told it is
> fixed in Intel MPI 2019 update 3.
>
> Jeff
>
> On Mon, Jan 21, 2019 at 7:46 PM Jeff Hammond <jeff.science at gmail.com>
> wrote:
>
>> I was able to reproduce with my own test (
>> https://github.com/jeffhammond/HPCInfo/blob/master/mpi/bugs/iprobe-overflow.c)
>> with Intel MPI 2019, so I will report that bug to the Intel MPI team.  It
>> should be easy enough for them to figure out if this bug is from MPICH or
>> not.
>>
>> 2139000000 iterations, 627.143272 seconds
>>
>> 2140000000 iterations, 627.436206 seconds
>>
>> 2141000000 iterations, 627.729135 seconds
>>
>> 2142000000 iterations, 628.022049 seconds
>>
>> 2143000000 iterations, 628.315015 seconds
>>
>> 2144000000 iterations, 628.608066 seconds
>>
>> 2145000000 iterations, 628.901065 seconds
>>
>> 2146000000 iterations, 629.193992 seconds
>>
>> 2147000000 iterations, 629.488107 seconds
>>
>> Abort(738833413) on node 0 (rank 0 in comm 0): Fatal error in
>> PMPI_Iprobe: Invalid communicator, error stack:
>>
>> PMPI_Iprobe(123): MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,
>> MPI_COMM_WORLD, flag=0x7ffdf75a396c, status=0x7ffdf75a3970) failed
>>
>> PMPI_Iprobe(90).: Invalid communicator
>>
>> Jeff, who works for Intel but knows more about MPICH than Intel MPI
>>
>> On Mon, Jan 21, 2019 at 11:19 AM Joachim Protze via discuss <
>> discuss at mpich.org> wrote:
>>
>>> Hi all,
>>>
>>> we detected the behavior with IntelMPI 2019 (which is based on MPICH
>>> 3.3). Reproducing it with MPICH-3.3 was not yet successful. But I fear,
>>> that our built of MPICH just not uses the necessary code path / build
>>> flags.
>>>
>>> When calling MPI_Iprobe with the same communicator for ~2^31 times
>>> (which can take 10-30 minutes), the execution stops with:
>>>
>>> Abort(201962501) on node 0 (rank 0 in comm 0): Fatal error in
>>> PMPI_Iprobe: Invalid communicator, error stack:
>>> PMPI_Iprobe(123): MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,
>>> MPI_COMM_WORLD, flag=0x7ffd925056c0, status=0x7ffd92505694) failed
>>> PMPI_Iprobe(90).: Invalid communicator
>>>
>>>  From my understanding of the referenced MPICH code lines, I guess, that
>>> the ref-count for MPI_COMM_WORLD overflows, which triggers this error
>>> message.
>>>
>>> Best
>>> Joachim
>>>
>>> --
>>> Dipl.-Inf. Joachim Protze
>>>
>>> IT Center
>>> Group: High Performance Computing
>>> Division: Computational Science and Engineering
>>> RWTH Aachen University
>>> Seffenter Weg 23
>>> D 52074  Aachen (Germany)
>>> Tel: +49 241 80- 24765
>>> Fax: +49 241 80-624765
>>> protze at itc.rwth-aachen.de
>>> www.itc.rwth-aachen.de
>>>
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>
>>
>> --
>> Jeff Hammond
>> jeff.science at gmail.com
>> http://jeffhammond.github.io/
>>
>
>
> --
> Jeff Hammond
> jeff.science at gmail.com
> http://jeffhammond.github.io/
>


-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20191204/894a4c00/attachment.html>


More information about the discuss mailing list