[mpich-discuss] Error in mutex destruction at the end of MPI Program
Pedro Henrique Di Francia Rosso
p233687 at dac.unicamp.br
Wed Dec 8 07:25:46 CST 2021
Hello there,
I'm Pedro, and I work in a research group that researches the use of OpenMP
in distributed systems using MPI as a communication layer.
In particular, we are working with multithreaded MPICH, where there are two
main "users" of MPI in our system, an Event System and a Fault Tolerance
(FT) system that work together in separate threads. Those systems carry
mainly MPI asynchronous messages where the requests were freed or tested
until the completion. Everything works fine and correctly.
Recently, sometimes we are getting an assert error in the program ending
when calling MPI_Finalize(), here is the error with some callstack: (*An
important note: this error does not always happen. In fact, it is much more
common for the application to finish correctly, instead of asserting like
that*)
*Error in system call pthread_mutex_destroy: Device or resource busy
src/mpi/init/mutex.c:90Assertion failed in file src/mpi/init/mutex.c at
line 91: err == 0/usr/local/mpi/lib/libmpi.so.12(MPL_backtrace_show+0x35)
[0x7f3429505673]/usr/local/mpi/lib/libmpi.so.12(+0x3248b4)
[0x7f34294a38b4]/usr/local/mpi/lib/libmpi.so.12(MPI_Finalize+0xb8)
[0x7f34293a4b28]/builds/ompcluster/llvm-project/build/projects/openmp/libomptarget/libomptarget.rtl.mpi.so
<http://libomptarget.rtl.mpi.so>(+0xafa5)
[0x7f342b800fa5]/lib/x86_64-linux-gnu/libc.so.6(+0x43161)
[0x7f342a849161]/lib/x86_64-linux-gnu/libc.so.6(+0x4325a)
[0x7f342a84925a]/builds/ompcluster/llvm-project/build/projects/openmp/libomptarget/libomptarget.rtl.mpi.so
<http://libomptarget.rtl.mpi.so>(+0xf3d9)
[0x7f342b8053d9]/builds/ompcluster/llvm-project/build/projects/openmp/libomptarget/libomptarget.so.12(__tgt_register_lib+0xf9)
[0x7f342b8673c9]./ompcluster/main()
[0x40ef3d]/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x78)
[0x7f342a827b88]./ompcluster/main() [0x402a5a]Abort(1) on node 2: Internal
error*
I've looked at the mutex.c file and saw that this is a problem when
destroying the global mutex employed in the multithread MPI. I would like
to ask if there are any known scenarios, or common reasons for this problem
to occur, to help me find what could be happening at the end of the
execution.
Here is the MPICH configuration in our container:
*$ mpichversion*
*MPICH Version: 3.4.2MPICH Release date: Wed May 26 15:51:40
CDT 2021MPICH Device: ch4:ucxMPICH configure:
--prefix=/usr/local/mpi --disable-static --with-device=ch4:ucx
--with-ucx=/usr/local/ucxMPICH CC: gcc -O2MPICH CXX: g++
-O2MPICH F77: gfortran -O2MPICH FC: gfortran -O2MPICH Custom
Information: *
Regards, Pedro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20211208/97db4770/attachment.html>
More information about the discuss
mailing list