[mpich-discuss] ULFM revoke doesn't work properly?

Nils-Arne Dreier n.dreier at uni-muenster.de
Thu May 18 05:03:32 CDT 2017


Dear MPICH community.

I'm currently playing around with the ULFM features for fault-tolerance.
I know that these features are experimental, but want to discuss the
following example.

I used the following minimal example, which deadlock for all nonzero
ranks at MPI_Ssend. For my understanding MPI_Ssend should return
MPI_ERR_REVOKED, shouldn't it?

If i substitute MPI_Ssend with MPI_Send. All ranks reach the
MPIX_Comm_shrink command but then deadlock. I observed very rarly that
the shrink succeed, but can't determine the reason.

#include <iostream>
#include <mpi.h>

void checkMPIresult(int result){
    if(result!=MPI_SUCCESS){
        int len;
        char msg[MPI_MAX_ERROR_STRING];
        MPI_Error_string(result, msg, &len);
        std::cout << msg << std::endl;
    }
}

int main(int argc, char** argv){
    checkMPIresult(MPI_Init(&argc,&argv));
    MPI_Comm comm, new_comm;
    checkMPIresult(MPI_Comm_dup(MPI_COMM_WORLD,&comm));
    MPI_Comm_set_errhandler(comm, MPI_ERRORS_RETURN);
    int rank = -1;
    checkMPIresult(MPI_Comm_rank(comm,&rank));
    int error = 1;
    if(rank==0){
        std::cout << "revoking..." << std::endl;
        checkMPIresult(MPIX_Comm_revoke(comm));
        error = 0;
    }else{
        checkMPIresult(MPI_Ssend(&error,1,MPI_INT,0,0,comm));
    }
    std::cout << rank << "\twaiting for agree..." << std::endl;
    checkMPIresult(MPIX_Comm_agree(comm,&error));
    std::cout << rank << "\tagreed on " << error << std::endl;
    checkMPIresult(MPIX_Comm_shrink(comm,&new_comm));
    std::cout << rank << "\tcomm shrinked" << std::endl;
    checkMPIresult(MPI_Comm_free(&new_comm));
    checkMPIresult(MPI_Comm_free(&comm));
    checkMPIresult(MPI_Finalize());
    return 0;
}

I compile with
mpicxx -std=c++11 -pthread mpich-shrink.cc -o mpich-shrink

and run with
mpirun -n 4 --disable-auto-cleanup ./mpich-shrink

I used the recent master branch: The output of mpirun --version is:
HYDRA build details:
    Version:                                 3.3a2
    Release Date:                            unreleased development copy
    CC:                              gcc   
    CXX:                             g++   
    F77:                             gfortran  
    F90:                             gfortran  
    Configure options:                       '--disable-option-checking'
'--prefix=/home/nils/tmp/mpich-git-install'
'--enable-error-checking=all' '--cache-file=/dev/null' '--srcdir=.'
'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS=
-I/home/nils/tmp/mpich/src/mpl/include
-I/home/nils/tmp/mpich/src/mpl/include
-I/home/nils/tmp/mpich/src/openpa/src
-I/home/nils/tmp/mpich/src/openpa/src -D_REENTRANT
-I/home/nils/tmp/mpich/src/mpi/romio/include' 'MPLLIBNAME=mpl'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf
sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs
cobalt
    Checkpointing libraries available:       blcr
    Demux engines available:                 poll select

Thank you
Nils


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5390 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170518/4afd2b9a/attachment.p7s>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list