[mpich-discuss] ULFM revoke doesn't work properly?

Guo, Yanfei yguo at anl.gov
Fri May 26 11:00:42 CDT 2017


Hi Nils,

The hanging in revoke+shrink is a known issue (see https://github.com/pmodels/mpich/issues/2198). ULFM is still experimental, but we will keep working on it.

Best,

Yanfei Guo
Postdoctoral Researcher
MCS Division, ANL


On 5/18/17, 5:03 AM, "Nils-Arne Dreier" <n.dreier at uni-muenster.de> wrote:

    Dear MPICH community.
    
    I'm currently playing around with the ULFM features for fault-tolerance.
    I know that these features are experimental, but want to discuss the
    following example.
    
    I used the following minimal example, which deadlock for all nonzero
    ranks at MPI_Ssend. For my understanding MPI_Ssend should return
    MPI_ERR_REVOKED, shouldn't it?
    
    If i substitute MPI_Ssend with MPI_Send. All ranks reach the
    MPIX_Comm_shrink command but then deadlock. I observed very rarly that
    the shrink succeed, but can't determine the reason.
    
    #include <iostream>
    #include <mpi.h>
    
    void checkMPIresult(int result){
        if(result!=MPI_SUCCESS){
            int len;
            char msg[MPI_MAX_ERROR_STRING];
            MPI_Error_string(result, msg, &len);
            std::cout << msg << std::endl;
        }
    }
    
    int main(int argc, char** argv){
        checkMPIresult(MPI_Init(&argc,&argv));
        MPI_Comm comm, new_comm;
        checkMPIresult(MPI_Comm_dup(MPI_COMM_WORLD,&comm));
        MPI_Comm_set_errhandler(comm, MPI_ERRORS_RETURN);
        int rank = -1;
        checkMPIresult(MPI_Comm_rank(comm,&rank));
        int error = 1;
        if(rank==0){
            std::cout << "revoking..." << std::endl;
            checkMPIresult(MPIX_Comm_revoke(comm));
            error = 0;
        }else{
            checkMPIresult(MPI_Ssend(&error,1,MPI_INT,0,0,comm));
        }
        std::cout << rank << "\twaiting for agree..." << std::endl;
        checkMPIresult(MPIX_Comm_agree(comm,&error));
        std::cout << rank << "\tagreed on " << error << std::endl;
        checkMPIresult(MPIX_Comm_shrink(comm,&new_comm));
        std::cout << rank << "\tcomm shrinked" << std::endl;
        checkMPIresult(MPI_Comm_free(&new_comm));
        checkMPIresult(MPI_Comm_free(&comm));
        checkMPIresult(MPI_Finalize());
        return 0;
    }
    
    I compile with
    mpicxx -std=c++11 -pthread mpich-shrink.cc -o mpich-shrink
    
    and run with
    mpirun -n 4 --disable-auto-cleanup ./mpich-shrink
    
    I used the recent master branch: The output of mpirun --version is:
    HYDRA build details:
        Version:                                 3.3a2
        Release Date:                            unreleased development copy
        CC:                              gcc   
        CXX:                             g++   
        F77:                             gfortran  
        F90:                             gfortran  
        Configure options:                       '--disable-option-checking'
    '--prefix=/home/nils/tmp/mpich-git-install'
    '--enable-error-checking=all' '--cache-file=/dev/null' '--srcdir=.'
    'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS=
    -I/home/nils/tmp/mpich/src/mpl/include
    -I/home/nils/tmp/mpich/src/mpl/include
    -I/home/nils/tmp/mpich/src/openpa/src
    -I/home/nils/tmp/mpich/src/openpa/src -D_REENTRANT
    -I/home/nils/tmp/mpich/src/mpi/romio/include' 'MPLLIBNAME=mpl'
        Process Manager:                         pmi
        Launchers available:                     ssh rsh fork slurm ll lsf
    sge manual persist
        Topology libraries available:            hwloc
        Resource management kernels available:   user slurm ll lsf sge pbs
    cobalt
        Checkpointing libraries available:       blcr
        Demux engines available:                 poll select
    
    Thank you
    Nils
    
    
    



_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list