[mpich-discuss] Error in mutex destruction at the end of MPI Program
protze at itc.rwth-aachen.de
Wed Dec 8 12:47:12 CST 2021
OpenMP has no explicit finalization call. The runtime detects library destruction and performs cleanup code. Such cleanup code includes finalization of offloading devices.
If your openmp library supports omp_pause_resource_all (https://www.openmp.org/spec-html/5.0/openmpsu153.html#x190-9040003.2.44), you should call this function before the actual mpi_finalize()
From: Raffenetti, Ken via discuss <discuss at mpich.org>
Sent: Wednesday, December 8, 2021 7:02:54 PM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Raffenetti, Ken <raffenet at anl.gov>
Subject: Re: [mpich-discuss] Error in mutex destruction at the end of MPI Program
We do not see this issue in our regular multithreaded testing. Is it possible the other thread is using MPI while it is being finalized? I imagine that could lead to an error when destroying the mutex. It would also explain the nondeterministic nature of the error.
On 12/8/21, 7:26 AM, "Pedro Henrique Di Francia Rosso via discuss" <discuss at mpich.org> wrote:
I'm Pedro, and I work in a research group that researches the use of OpenMP in distributed systems using MPI as a communication layer.
In particular, we are working with multithreaded MPICH, where there are two main "users" of MPI in our system, an Event System and a Fault Tolerance (FT) system that work together in separate threads. Those systems carry mainly MPI asynchronous messages where the requests were freed or tested until the completion. Everything works fine and correctly.
Recently, sometimes we are getting an assert error in the program ending when calling MPI_Finalize(), here is the error with some callstack: (An important note: this error does not always happen. In fact, it is much more common for the application to finish correctly, instead of asserting like that)
Error in system call pthread_mutex_destroy: Device or resource busy
Assertion failed in file src/mpi/init/mutex.c at line 91: err == 0
/builds/ompcluster/llvm-project/build/projects/openmp/libomptarget/libomptarget.rtl.mpi.so <http://libomptarget.rtl.mpi.so>(+0xafa5) [0x7f342b800fa5]
/builds/ompcluster/llvm-project/build/projects/openmp/libomptarget/libomptarget.rtl.mpi.so <http://libomptarget.rtl.mpi.so>(+0xf3d9) [0x7f342b8053d9]
Abort(1) on node 2: Internal error
I've looked at the mutex.c file and saw that this is a problem when destroying the global mutex employed in the multithread MPI. I would like to ask if there are any known scenarios, or common reasons for this problem to occur, to help me find what could be happening at the end of the execution.
Here is the MPICH configuration in our container:
MPICH Version: 3.4.2
MPICH Release date: Wed May 26 15:51:40 CDT 2021
MPICH Device: ch4:ucx
MPICH configure: --prefix=/usr/local/mpi --disable-static --with-device=ch4:ucx --with-ucx=/usr/local/ucx
MPICH CC: gcc -O2
MPICH CXX: g++ -O2
MPICH F77: gfortran -O2
MPICH FC: gfortran -O2
MPICH Custom Information:
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the discuss