[mpich-discuss] Possible integer overflows at scale in gather.c
Rob Latham
robl at mcs.anl.gov
Thu Oct 1 11:15:50 CDT 2015
On 09/30/2015 04:08 PM, Ignacio Laguna wrote:
> Hi,
>
> I found some 'potential' integer overflows when running mpich at large
> scale and/or with large inputs in gather.c. I believe that they are
> related to ticket #1767
> (http://trac.mpich.org/projects/mpich/ticket/1767), but I didn't see any
> other bug reports about them so I thought I should confirm with mpich
> developers.
>
> In addition to this case in line 171:
>
> 98 int tmp_buf_size, missing;
> ...
> 169 if (nbytes < MPIR_CVAR_GATHER_VSMALL_MSG_SIZE)
> tmp_buf_size++;
> 170
> 171 tmp_buf_size *= nbytes;
>
> which I believe it's fixed in mpich-3.2b4 (where tmp_buf_size is
> declared as a 64-bit MPI_Aint),
I fixed this (I sure hope!) over the summer in commit 68f8c7aa798f7009
> I found the following cases in the same
> file src/mpi/coll/gather.c (in both mpich-3.1.4 and mpich-3.2b4):
>
> Case 1:
>
> 222 mpi_errno = MPIC_Recv(((char *)recvbuf +
> 223 (((rank + mask) % comm_size)*recvcount*extent)),
> 224 recvblks * recvcount, recvtype, src,
> 225 MPIR_GATHER_TAG, comm,
> 226 &status, errflag);
>
> In line 223 I believe we get an integer overflow as follows. Suppose I
> run 2^20 = 1,048,576 ranks and do a gather with 4,096 elements. In this
> case (if I understand the algorithm well), ((rank + mask) % comm_size)
> would be 2^20 / 2 = 524,288, and recvcount = 4,096. Then the ((rank +
> mask) % comm_size)*recvcount expression would overflow: 524,288 * 4,096
> = 2,147,483,648, and become negative.
We're doing address math here, so I hope the compiler is not trying to
store intermediate results in an integer. However, if LLVM is saying
otherwise, then sure, let's promote those types.
((rank + mask) % comm_size)*recvcount is of type int*int, which is
then multiplied by an MPI_Aint... yeah, looks suspicious.
There are quite a few places where we do math on recvcount and
recvblocks -- does your static analysis show problems with all of those
spots?
> When multiplied with 'extent', which is size_t or MPI_Aint, it will
> become negative I believe or a huge positive which in any case will
> point to the wrong location in the recvbuf buffer, unless of course this
> wraparound behavior is intended.
Nope, wraparound sure is not intended! We would have no hope of
cleaning up all these locations if it were not for clang's
-Wshorten64-to-32 flag.
> Case 2:
>
> There might be a similar problem in line 224 in the above code. With
> 2^20 ranks, recvblks becomes 524,288 (again if I understand well the
> algorithm), so the recvblks * recvcount operation will also overflow.
>
> I might be wrong on this -- I'm catching these issues with LLVM symbolic
> analysis -- so they can be totally false positives,
Tell me more about LLVM symbolic analysis! We have started using
coverity for static analysis but yes, there are a lot of messages
obscuring the genuine bugs.
> but I just wanted to
> check with the mpich developers if they are valid issues or not. If they
> are, I believe fixes can be easy to implement (just make all these
> computations size_t).
I promoted the prototype for MPIC_Recv to this:
int MPIC_Recv(void *buf, MPI_Aint count,
MPI_Datatype datatype, int source, int tag,
MPID_Comm *comm_ptr, MPI_Status *status,
MPIR_Errflag_t *errflag)
Top-level MPI routines are stuck using an 'int' count, but internally we
can (and now do) use a larger MPI_Aint type.
True, int*int can overflow, but since that parameter to MPIC_Recv is of
the larger MPI_Aint type, we're ok here. LLVM disagrees?
==rob
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list