[mpich-discuss] Possible integer overflows at scale in gather.c
Ignacio Laguna
lagunaperalt1 at llnl.gov
Wed Sep 30 16:08:45 CDT 2015
Hi,
I found some 'potential' integer overflows when running mpich at large
scale and/or with large inputs in gather.c. I believe that they are
related to ticket #1767
(http://trac.mpich.org/projects/mpich/ticket/1767), but I didn't see any
other bug reports about them so I thought I should confirm with mpich
developers.
In addition to this case in line 171:
98 int tmp_buf_size, missing;
...
169 if (nbytes < MPIR_CVAR_GATHER_VSMALL_MSG_SIZE) tmp_buf_size++;
170
171 tmp_buf_size *= nbytes;
which I believe it's fixed in mpich-3.2b4 (where tmp_buf_size is
declared as a 64-bit MPI_Aint), I found the following cases in the same
file src/mpi/coll/gather.c (in both mpich-3.1.4 and mpich-3.2b4):
Case 1:
222 mpi_errno = MPIC_Recv(((char *)recvbuf +
223 (((rank + mask) % comm_size)*recvcount*extent)),
224 recvblks * recvcount, recvtype, src,
225 MPIR_GATHER_TAG, comm,
226 &status, errflag);
In line 223 I believe we get an integer overflow as follows. Suppose I
run 2^20 = 1,048,576 ranks and do a gather with 4,096 elements. In this
case (if I understand the algorithm well), ((rank + mask) % comm_size)
would be 2^20 / 2 = 524,288, and recvcount = 4,096. Then the ((rank +
mask) % comm_size)*recvcount expression would overflow: 524,288 * 4,096
= 2,147,483,648, and become negative.
When multiplied with 'extent', which is size_t or MPI_Aint, it will
become negative I believe or a huge positive which in any case will
point to the wrong location in the recvbuf buffer, unless of course this
wraparound behavior is intended.
Case 2:
There might be a similar problem in line 224 in the above code. With
2^20 ranks, recvblks becomes 524,288 (again if I understand well the
algorithm), so the recvblks * recvcount operation will also overflow.
I might be wrong on this -- I'm catching these issues with LLVM symbolic
analysis -- so they can be totally false positives, but I just wanted to
check with the mpich developers if they are valid issues or not. If they
are, I believe fixes can be easy to implement (just make all these
computations size_t).
Thanks!
--
Ignacio Laguna
Center for Applied Scientific Computing (CASC)
Lawrence Livermore National Laboratory
Phone: 925-422-7308, Fax: 925-422-6287
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list