[mpich-discuss] Error in mutex destruction at the end of MPI Program

Wed Dec 8 12:02:54 CST 2021

We do not see this issue in our regular multithreaded testing. Is it possible the other thread is using MPI while it is being finalized? I imagine that could lead to an error when destroying the mutex. It would also explain the nondeterministic nature of the error.

Ken

On 12/8/21, 7:26 AM, "Pedro Henrique Di Francia Rosso via discuss" <discuss at mpich.org> wrote:

    Hello there,
    I'm Pedro, and I work in a research group that researches the use of OpenMP in distributed systems using MPI as a communication layer. 

    In particular, we are working with multithreaded MPICH, where there are two main "users" of MPI in our system, an Event System and a Fault Tolerance (FT) system that work together in separate threads. Those systems carry mainly MPI asynchronous messages where the requests were freed or tested until the completion. Everything works fine and correctly.

    Recently, sometimes we are getting an assert error in the program ending when calling MPI_Finalize(), here is the error with some callstack: (An important note: this error does not always happen. In fact, it is much more common for the application to finish correctly, instead of asserting like that)

    Error in system call pthread_mutex_destroy: Device or resource busy
        src/mpi/init/mutex.c:90
    Assertion failed in file src/mpi/init/mutex.c at line 91: err == 0
    /usr/local/mpi/lib/libmpi.so.12(MPL_backtrace_show+0x35) [0x7f3429505673]
    /usr/local/mpi/lib/libmpi.so.12(+0x3248b4) [0x7f34294a38b4]
    /usr/local/mpi/lib/libmpi.so.12(MPI_Finalize+0xb8) [0x7f34293a4b28]
    /builds/ompcluster/llvm-project/build/projects/openmp/libomptarget/libomptarget.rtl.mpi.so <http://libomptarget.rtl.mpi.so>(+0xafa5) [0x7f342b800fa5]
    /lib/x86_64-linux-gnu/libc.so.6(+0x43161) [0x7f342a849161]
    /lib/x86_64-linux-gnu/libc.so.6(+0x4325a) [0x7f342a84925a]
    /builds/ompcluster/llvm-project/build/projects/openmp/libomptarget/libomptarget.rtl.mpi.so <http://libomptarget.rtl.mpi.so>(+0xf3d9) [0x7f342b8053d9]
    /builds/ompcluster/llvm-project/build/projects/openmp/libomptarget/libomptarget.so.12(__tgt_register_lib+0xf9) [0x7f342b8673c9]
    ./ompcluster/main() [0x40ef3d]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x78) [0x7f342a827b88]
    ./ompcluster/main() [0x402a5a]
    Abort(1) on node 2: Internal error

    I've looked at the mutex.c file and saw that this is a problem when destroying the global mutex employed in the multithread MPI. I would like to ask if there are any known scenarios, or common reasons for this problem to occur, to help me find what could be happening at the end of the execution.

    Here is the MPICH configuration in our container:

    $ mpichversion
    MPICH Version:          3.4.2
    MPICH Release date:     Wed May 26 15:51:40 CDT 2021
    MPICH Device:           ch4:ucx
    MPICH configure:        --prefix=/usr/local/mpi --disable-static --with-device=ch4:ucx --with-ucx=/usr/local/ucx
    MPICH CC:       gcc    -O2
    MPICH CXX:      g++   -O2
    MPICH F77:      gfortran   -O2
    MPICH FC:       gfortran   -O2
    MPICH Custom Information: 

    Regards, Pedro