[mpich-devel] MPICH hangs in MPI_Waitall when MPI_Cancel is used

Balaji, Pavan balaji at anl.gov
Thu Jun 4 14:02:32 CDT 2015


Actually, we just discovered that the portals4 implementation is incorrect as well.  I'll try to write a test to demonstrate it.

  -- Pavan





On 6/4/15, 1:11 PM, "Halim Amer" <aamer at anl.gov> wrote:

>That's right, but more importantly, cancelling sends is still not 
>supported by the MXM and OFI netmods. Intel and Mellanox are working on 
>it (tickets 2266 and 2270). It works fine so far with the TCP and 
>Portals4 netmods though.
>
>--Halim
>
>
>On 6/4/15 1:00 PM, Rob Latham wrote:
>>
>>
>> On 06/04/2015 12:19 PM, Jeff Hammond wrote:
>>> Thanks for pointing that out.  It runs correctly now.  Sorry for the
>>> stupid question.
>>
>>   it just so happens, Jeff, that they've been spending a lot of time
>> debugging cancel send operations for all our various devices and so
>> "cancel semantics" are (moreso than usual) quite warm in the cache.
>>
>> ==rob
>>
>>> On Thu, Jun 4, 2015 at 11:49 AM, Halim Amer <aamer at anl.gov> wrote:
>>>> Hi Jeff,
>>>>
>>>> I don't think it is a correct program. If the send is correctly canceled
>>>> then the origin has to satisfy the destination with another send. The
>>>> hang
>>>> is an expected result.
>>>>
>>>> This is what the standard says (P102):
>>>>
>>>> "...or that the send is successfully cancelled, in which case no part
>>>> of the
>>>> message was received at the destination. Then, any matching receive
>>>> has to
>>>> be satisfied by another send."
>>>>
>>>> --Halim
>>>>
>>>> Abdelhalim Amer (Halim)
>>>> Postdoctoral Appointee
>>>> MCS Division
>>>> Argonne National Laboratory
>>>>
>>>>
>>>> On 6/4/15 9:21 AM, Jeff Hammond wrote:
>>>>>
>>>>> I can't tell for sure if this is a correct program, but multiple
>>>>> members of the MPI Forum suggested it is.
>>>>>
>>>>> If it is a correct program, it appears to expose a bug in MPICH,
>>>>> because the MPI_Waitall hangs.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jeff
>>>>>
>>>>> $ mpicc -g -Wall -std=c99 cancel-sucks.c && mpiexec -n 4 ./a.out
>>>>>
>>>>> $ mpichversion
>>>>> MPICH Version:    3.2b1
>>>>> MPICH Release date: unreleased development copy
>>>>> MPICH Device:    ch3:nemesis
>>>>> MPICH configure: CC=gcc-4.9 CXX=g++-4.9 FC=gfortran-4.9
>>>>> F77=gfortran-4.9 --enable-cxx --enable-fortran
>>>>> --enable-threads=runtime --enable-g=dbg --with-pm=hydra
>>>>> --prefix=/opt/mpich/dev/gcc/default --enable-wrapper-rpath
>>>>> --enable-static --enable-shared
>>>>> MPICH CC: gcc-4.9    -g -O2
>>>>> MPICH CXX: g++-4.9   -g -O2
>>>>> MPICH F77: gfortran-4.9   -g -O2
>>>>> MPICH FC: gfortran-4.9   -g -O2
>>>>>
>>>>>
>>>>> #include <stdio.h>
>>>>> #include <stdlib.h>
>>>>> #include <mpi.h>
>>>>>
>>>>> const int n=1000;
>>>>>
>>>>> int main(void)
>>>>> {
>>>>>       MPI_Init(NULL,NULL);
>>>>>
>>>>>       int size, rank;
>>>>>       MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>>>       MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>>       if (size<2) {
>>>>>           printf("You must use 2 or more processes!\n");
>>>>>           MPI_Finalize();
>>>>>           exit(1);
>>>>>       }
>>>>>
>>>>>       MPI_Request reqs[2*n];
>>>>>
>>>>>       int target = (rank+1)%size;
>>>>>       for (int i=0; i<n; i++) {
>>>>>
>>>>> MPI_Issend(NULL,0,MPI_BYTE,target,0,MPI_COMM_WORLD,&(reqs[i]));
>>>>>       }
>>>>>
>>>>>       srand((unsigned)(rank+MPI_Wtime()));
>>>>>       int slot = rand()%n;
>>>>>       printf("Cancelling send %d.\n", slot);
>>>>>       MPI_Cancel(&reqs[slot]);
>>>>>
>>>>> #if 1
>>>>>       MPI_Barrier(MPI_COMM_WORLD);
>>>>> #endif
>>>>>
>>>>>       int origin = (rank==0) ? (size-1) : (rank-1);
>>>>>       for (int i=0; i<n; i++) {
>>>>>
>>>>> MPI_Irecv(NULL,0,MPI_BYTE,origin,0,MPI_COMM_WORLD,&(reqs[n+i]));
>>>>>       }
>>>>>
>>>>>       MPI_Status stats[2*n];
>>>>>       MPI_Waitall(2*n,reqs,stats);
>>>>>
>>>>>       for (int i=0; i<n; i++) {
>>>>>           int flag;
>>>>>           MPI_Test_cancelled(&(stats[i]),&flag);
>>>>>           if (flag) {
>>>>>               printf("Status %d indicates cancel was successful.\n",
>>>>> i);
>>>>>           }
>>>>>       }
>>>>>
>>>>>       MPI_Finalize();
>>>>>       return 0;
>>>>> }
>>>>>
>>>>>
>>>> _______________________________________________
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/devel
>>>
>>>
>>>
>>
>_______________________________________________
>To manage subscription options or unsubscribe:
>https://lists.mpich.org/mailman/listinfo/devel


More information about the devel mailing list