[mpich-discuss] Possible integer overflows at scale in gather.c

Ignacio Laguna lagunaperalt1 at llnl.gov
Wed Sep 30 16:08:45 CDT 2015


Hi,

I found some 'potential' integer overflows when running mpich at large 
scale and/or with large inputs in gather.c. I believe that they are 
related to ticket #1767 
(http://trac.mpich.org/projects/mpich/ticket/1767), but I didn't see any 
other bug reports about them so I thought I should confirm with mpich 
developers.

In addition to this case in line 171:

     98	    int tmp_buf_size, missing;
...
    169		if (nbytes < MPIR_CVAR_GATHER_VSMALL_MSG_SIZE) tmp_buf_size++;
    170	
    171		tmp_buf_size *= nbytes;

which I believe it's fixed in mpich-3.2b4 (where tmp_buf_size is 
declared as a 64-bit MPI_Aint), I found the following cases in the same 
file src/mpi/coll/gather.c (in both mpich-3.1.4 and mpich-3.2b4):

Case 1:

    222	mpi_errno = MPIC_Recv(((char *)recvbuf +
    223	            (((rank + mask) % comm_size)*recvcount*extent)),
    224	            recvblks * recvcount, recvtype, src,
    225	            MPIR_GATHER_TAG, comm,
    226	            &status, errflag);

In line 223 I believe we get an integer overflow as follows. Suppose I 
run 2^20 = 1,048,576 ranks and do a gather with 4,096 elements. In this 
case (if I understand the algorithm well), ((rank + mask) % comm_size) 
would be 2^20 / 2 = 524,288, and recvcount = 4,096. Then the ((rank + 
mask) % comm_size)*recvcount expression would overflow: 524,288 * 4,096 
= 2,147,483,648, and become negative.

When multiplied with 'extent', which is size_t or MPI_Aint, it will 
become negative I believe or a huge positive which in any case will 
point to the wrong location in the recvbuf buffer, unless of course this 
wraparound behavior is intended.

Case 2:

There might be a similar problem in line 224 in the above code. With 
2^20 ranks, recvblks becomes 524,288 (again if I understand well the 
algorithm), so the recvblks * recvcount operation will also overflow.

I might be wrong on this -- I'm catching these issues with LLVM symbolic 
analysis -- so they can be totally false positives, but I just wanted to 
check with the mpich developers if they are valid issues or not. If they 
are, I believe fixes can be easy to implement (just make all these 
computations size_t).

Thanks!

-- 
Ignacio Laguna
Center for Applied Scientific Computing (CASC)
Lawrence Livermore National Laboratory
Phone: 925-422-7308, Fax: 925-422-6287
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list