[mpich-devel] MPICH hangs in MPI_Waitall when MPI_Cancel is used

Halim Amer aamer at anl.gov
Thu Jun 4 13:11:57 CDT 2015


That's right, but more importantly, cancelling sends is still not 
supported by the MXM and OFI netmods. Intel and Mellanox are working on 
it (tickets 2266 and 2270). It works fine so far with the TCP and 
Portals4 netmods though.

--Halim


On 6/4/15 1:00 PM, Rob Latham wrote:
>
>
> On 06/04/2015 12:19 PM, Jeff Hammond wrote:
>> Thanks for pointing that out.  It runs correctly now.  Sorry for the
>> stupid question.
>
>   it just so happens, Jeff, that they've been spending a lot of time
> debugging cancel send operations for all our various devices and so
> "cancel semantics" are (moreso than usual) quite warm in the cache.
>
> ==rob
>
>> On Thu, Jun 4, 2015 at 11:49 AM, Halim Amer <aamer at anl.gov> wrote:
>>> Hi Jeff,
>>>
>>> I don't think it is a correct program. If the send is correctly canceled
>>> then the origin has to satisfy the destination with another send. The
>>> hang
>>> is an expected result.
>>>
>>> This is what the standard says (P102):
>>>
>>> "...or that the send is successfully cancelled, in which case no part
>>> of the
>>> message was received at the destination. Then, any matching receive
>>> has to
>>> be satisfied by another send."
>>>
>>> --Halim
>>>
>>> Abdelhalim Amer (Halim)
>>> Postdoctoral Appointee
>>> MCS Division
>>> Argonne National Laboratory
>>>
>>>
>>> On 6/4/15 9:21 AM, Jeff Hammond wrote:
>>>>
>>>> I can't tell for sure if this is a correct program, but multiple
>>>> members of the MPI Forum suggested it is.
>>>>
>>>> If it is a correct program, it appears to expose a bug in MPICH,
>>>> because the MPI_Waitall hangs.
>>>>
>>>> Thanks,
>>>>
>>>> Jeff
>>>>
>>>> $ mpicc -g -Wall -std=c99 cancel-sucks.c && mpiexec -n 4 ./a.out
>>>>
>>>> $ mpichversion
>>>> MPICH Version:    3.2b1
>>>> MPICH Release date: unreleased development copy
>>>> MPICH Device:    ch3:nemesis
>>>> MPICH configure: CC=gcc-4.9 CXX=g++-4.9 FC=gfortran-4.9
>>>> F77=gfortran-4.9 --enable-cxx --enable-fortran
>>>> --enable-threads=runtime --enable-g=dbg --with-pm=hydra
>>>> --prefix=/opt/mpich/dev/gcc/default --enable-wrapper-rpath
>>>> --enable-static --enable-shared
>>>> MPICH CC: gcc-4.9    -g -O2
>>>> MPICH CXX: g++-4.9   -g -O2
>>>> MPICH F77: gfortran-4.9   -g -O2
>>>> MPICH FC: gfortran-4.9   -g -O2
>>>>
>>>>
>>>> #include <stdio.h>
>>>> #include <stdlib.h>
>>>> #include <mpi.h>
>>>>
>>>> const int n=1000;
>>>>
>>>> int main(void)
>>>> {
>>>>       MPI_Init(NULL,NULL);
>>>>
>>>>       int size, rank;
>>>>       MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>>       MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>       if (size<2) {
>>>>           printf("You must use 2 or more processes!\n");
>>>>           MPI_Finalize();
>>>>           exit(1);
>>>>       }
>>>>
>>>>       MPI_Request reqs[2*n];
>>>>
>>>>       int target = (rank+1)%size;
>>>>       for (int i=0; i<n; i++) {
>>>>
>>>> MPI_Issend(NULL,0,MPI_BYTE,target,0,MPI_COMM_WORLD,&(reqs[i]));
>>>>       }
>>>>
>>>>       srand((unsigned)(rank+MPI_Wtime()));
>>>>       int slot = rand()%n;
>>>>       printf("Cancelling send %d.\n", slot);
>>>>       MPI_Cancel(&reqs[slot]);
>>>>
>>>> #if 1
>>>>       MPI_Barrier(MPI_COMM_WORLD);
>>>> #endif
>>>>
>>>>       int origin = (rank==0) ? (size-1) : (rank-1);
>>>>       for (int i=0; i<n; i++) {
>>>>
>>>> MPI_Irecv(NULL,0,MPI_BYTE,origin,0,MPI_COMM_WORLD,&(reqs[n+i]));
>>>>       }
>>>>
>>>>       MPI_Status stats[2*n];
>>>>       MPI_Waitall(2*n,reqs,stats);
>>>>
>>>>       for (int i=0; i<n; i++) {
>>>>           int flag;
>>>>           MPI_Test_cancelled(&(stats[i]),&flag);
>>>>           if (flag) {
>>>>               printf("Status %d indicates cancel was successful.\n",
>>>> i);
>>>>           }
>>>>       }
>>>>
>>>>       MPI_Finalize();
>>>>       return 0;
>>>> }
>>>>
>>>>
>>> _______________________________________________
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/devel
>>
>>
>>
>


More information about the devel mailing list