[mpich-discuss] ULFM revoke doesn't work properly?
Guo, Yanfei
yguo at anl.gov
Fri May 26 11:00:42 CDT 2017
Hi Nils,
The hanging in revoke+shrink is a known issue (see https://github.com/pmodels/mpich/issues/2198). ULFM is still experimental, but we will keep working on it.
Best,
Yanfei Guo
Postdoctoral Researcher
MCS Division, ANL
On 5/18/17, 5:03 AM, "Nils-Arne Dreier" <n.dreier at uni-muenster.de> wrote:
Dear MPICH community.
I'm currently playing around with the ULFM features for fault-tolerance.
I know that these features are experimental, but want to discuss the
following example.
I used the following minimal example, which deadlock for all nonzero
ranks at MPI_Ssend. For my understanding MPI_Ssend should return
MPI_ERR_REVOKED, shouldn't it?
If i substitute MPI_Ssend with MPI_Send. All ranks reach the
MPIX_Comm_shrink command but then deadlock. I observed very rarly that
the shrink succeed, but can't determine the reason.
#include <iostream>
#include <mpi.h>
void checkMPIresult(int result){
if(result!=MPI_SUCCESS){
int len;
char msg[MPI_MAX_ERROR_STRING];
MPI_Error_string(result, msg, &len);
std::cout << msg << std::endl;
}
}
int main(int argc, char** argv){
checkMPIresult(MPI_Init(&argc,&argv));
MPI_Comm comm, new_comm;
checkMPIresult(MPI_Comm_dup(MPI_COMM_WORLD,&comm));
MPI_Comm_set_errhandler(comm, MPI_ERRORS_RETURN);
int rank = -1;
checkMPIresult(MPI_Comm_rank(comm,&rank));
int error = 1;
if(rank==0){
std::cout << "revoking..." << std::endl;
checkMPIresult(MPIX_Comm_revoke(comm));
error = 0;
}else{
checkMPIresult(MPI_Ssend(&error,1,MPI_INT,0,0,comm));
}
std::cout << rank << "\twaiting for agree..." << std::endl;
checkMPIresult(MPIX_Comm_agree(comm,&error));
std::cout << rank << "\tagreed on " << error << std::endl;
checkMPIresult(MPIX_Comm_shrink(comm,&new_comm));
std::cout << rank << "\tcomm shrinked" << std::endl;
checkMPIresult(MPI_Comm_free(&new_comm));
checkMPIresult(MPI_Comm_free(&comm));
checkMPIresult(MPI_Finalize());
return 0;
}
I compile with
mpicxx -std=c++11 -pthread mpich-shrink.cc -o mpich-shrink
and run with
mpirun -n 4 --disable-auto-cleanup ./mpich-shrink
I used the recent master branch: The output of mpirun --version is:
HYDRA build details:
Version: 3.3a2
Release Date: unreleased development copy
CC: gcc
CXX: g++
F77: gfortran
F90: gfortran
Configure options: '--disable-option-checking'
'--prefix=/home/nils/tmp/mpich-git-install'
'--enable-error-checking=all' '--cache-file=/dev/null' '--srcdir=.'
'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS=
-I/home/nils/tmp/mpich/src/mpl/include
-I/home/nils/tmp/mpich/src/mpl/include
-I/home/nils/tmp/mpich/src/openpa/src
-I/home/nils/tmp/mpich/src/openpa/src -D_REENTRANT
-I/home/nils/tmp/mpich/src/mpi/romio/include' 'MPLLIBNAME=mpl'
Process Manager: pmi
Launchers available: ssh rsh fork slurm ll lsf
sge manual persist
Topology libraries available: hwloc
Resource management kernels available: user slurm ll lsf sge pbs
cobalt
Checkpointing libraries available: blcr
Demux engines available: poll select
Thank you
Nils
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list