[mpich-discuss] Possible integer overflows at scale in gather.c

Thu Oct 1 11:15:50 CDT 2015

On 09/30/2015 04:08 PM, Ignacio Laguna wrote:
> Hi,
>
> I found some 'potential' integer overflows when running mpich at large
> scale and/or with large inputs in gather.c. I believe that they are
> related to ticket #1767
> (http://trac.mpich.org/projects/mpich/ticket/1767), but I didn't see any
> other bug reports about them so I thought I should confirm with mpich
> developers.
>
> In addition to this case in line 171:
>
>      98        int tmp_buf_size, missing;
> ...
>     169        if (nbytes < MPIR_CVAR_GATHER_VSMALL_MSG_SIZE)
> tmp_buf_size++;
>     170
>     171        tmp_buf_size *= nbytes;
>
> which I believe it's fixed in mpich-3.2b4 (where tmp_buf_size is
> declared as a 64-bit MPI_Aint),

I fixed this (I sure hope!) over the summer in commit 68f8c7aa798f7009

> I found the following cases in the same
> file src/mpi/coll/gather.c (in both mpich-3.1.4 and mpich-3.2b4):
>
> Case 1:
>
>     222    mpi_errno = MPIC_Recv(((char *)recvbuf +
>     223                (((rank + mask) % comm_size)*recvcount*extent)),
>     224                recvblks * recvcount, recvtype, src,
>     225                MPIR_GATHER_TAG, comm,
>     226                &status, errflag);
>
> In line 223 I believe we get an integer overflow as follows. Suppose I
> run 2^20 = 1,048,576 ranks and do a gather with 4,096 elements. In this
> case (if I understand the algorithm well), ((rank + mask) % comm_size)
> would be 2^20 / 2 = 524,288, and recvcount = 4,096. Then the ((rank +
> mask) % comm_size)*recvcount expression would overflow: 524,288 * 4,096
> = 2,147,483,648, and become negative.

We're doing address math here, so I hope the compiler is not trying to 
store intermediate results in an integer.  However, if LLVM is saying 
otherwise, then sure, let's promote those types.

  ((rank + mask) % comm_size)*recvcount is of type int*int, which is 
then multiplied by an MPI_Aint... yeah, looks suspicious.

There are quite a few places where we do math on recvcount and 
recvblocks -- does your static analysis show problems with all of those 
spots?

> When multiplied with 'extent', which is size_t or MPI_Aint, it will
> become negative I believe or a huge positive which in any case will
> point to the wrong location in the recvbuf buffer, unless of course this
> wraparound behavior is intended.

Nope, wraparound sure is not intended!  We would have no hope of 
cleaning up all these locations if it were not for clang's 
-Wshorten64-to-32 flag.

> Case 2:
>
> There might be a similar problem in line 224 in the above code. With
> 2^20 ranks, recvblks becomes 524,288 (again if I understand well the
> algorithm), so the recvblks * recvcount operation will also overflow.
>
> I might be wrong on this -- I'm catching these issues with LLVM symbolic
> analysis -- so they can be totally false positives,

Tell me more about LLVM symbolic analysis!  We have started using 
coverity for static analysis but yes, there are a lot of messages 
obscuring the genuine bugs.

> but I just wanted to
> check with the mpich developers if they are valid issues or not. If they
> are, I believe fixes can be easy to implement (just make all these
> computations size_t).

I promoted the prototype for MPIC_Recv to this:

int MPIC_Recv(void *buf, MPI_Aint count,
MPI_Datatype datatype, int source, int tag,
MPID_Comm *comm_ptr, MPI_Status *status,
MPIR_Errflag_t *errflag)

Top-level MPI routines are stuck using an 'int' count, but internally we 
can (and now do) use a larger MPI_Aint type.

True, int*int can overflow, but since that parameter to MPIC_Recv is of 
the larger MPI_Aint type, we're ok here.   LLVM disagrees?

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss