[mpich-discuss] Possible integer-overflow for MPI_COMM_WORLD in MPI_Iprobe

Jeff Hammond jeff.science at gmail.com
Fri Apr 26 10:09:28 CDT 2019


For anyone who cares about this bug because of Intel MPI, I am told it is
fixed in Intel MPI 2019 update 3.

Jeff

On Mon, Jan 21, 2019 at 7:46 PM Jeff Hammond <jeff.science at gmail.com> wrote:

> I was able to reproduce with my own test (
> https://github.com/jeffhammond/HPCInfo/blob/master/mpi/bugs/iprobe-overflow.c)
> with Intel MPI 2019, so I will report that bug to the Intel MPI team.  It
> should be easy enough for them to figure out if this bug is from MPICH or
> not.
>
> 2139000000 iterations, 627.143272 seconds
>
> 2140000000 iterations, 627.436206 seconds
>
> 2141000000 iterations, 627.729135 seconds
>
> 2142000000 iterations, 628.022049 seconds
>
> 2143000000 iterations, 628.315015 seconds
>
> 2144000000 iterations, 628.608066 seconds
>
> 2145000000 iterations, 628.901065 seconds
>
> 2146000000 iterations, 629.193992 seconds
>
> 2147000000 iterations, 629.488107 seconds
>
> Abort(738833413) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Iprobe:
> Invalid communicator, error stack:
>
> PMPI_Iprobe(123): MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,
> MPI_COMM_WORLD, flag=0x7ffdf75a396c, status=0x7ffdf75a3970) failed
>
> PMPI_Iprobe(90).: Invalid communicator
>
> Jeff, who works for Intel but knows more about MPICH than Intel MPI
>
> On Mon, Jan 21, 2019 at 11:19 AM Joachim Protze via discuss <
> discuss at mpich.org> wrote:
>
>> Hi all,
>>
>> we detected the behavior with IntelMPI 2019 (which is based on MPICH
>> 3.3). Reproducing it with MPICH-3.3 was not yet successful. But I fear,
>> that our built of MPICH just not uses the necessary code path / build
>> flags.
>>
>> When calling MPI_Iprobe with the same communicator for ~2^31 times
>> (which can take 10-30 minutes), the execution stops with:
>>
>> Abort(201962501) on node 0 (rank 0 in comm 0): Fatal error in
>> PMPI_Iprobe: Invalid communicator, error stack:
>> PMPI_Iprobe(123): MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,
>> MPI_COMM_WORLD, flag=0x7ffd925056c0, status=0x7ffd92505694) failed
>> PMPI_Iprobe(90).: Invalid communicator
>>
>>  From my understanding of the referenced MPICH code lines, I guess, that
>> the ref-count for MPI_COMM_WORLD overflows, which triggers this error
>> message.
>>
>> Best
>> Joachim
>>
>> --
>> Dipl.-Inf. Joachim Protze
>>
>> IT Center
>> Group: High Performance Computing
>> Division: Computational Science and Engineering
>> RWTH Aachen University
>> Seffenter Weg 23
>> D 52074  Aachen (Germany)
>> Tel: +49 241 80- 24765
>> Fax: +49 241 80-624765
>> protze at itc.rwth-aachen.de
>> www.itc.rwth-aachen.de
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
> --
> Jeff Hammond
> jeff.science at gmail.com
> http://jeffhammond.github.io/
>


-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20190426/4a035b48/attachment.html>


More information about the discuss mailing list