[mpich-discuss] Hang during MPI_Finalize using ch4:ofi:shm in mpich-4.1.2
Zhou, Hui
zhouh at anl.gov
Wed Dec 13 09:39:27 CST 2023
Hi Edric,
I am not sure which part is hanging, but you don't need to enable ofi:shm (libfabric shm provider). The ch4 device comes with its own shared memory functionality.
--
Hui
________________________________
From: Edric Ellis via discuss <discuss at mpich.org>
Sent: Wednesday, December 13, 2023 7:05 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Edric Ellis <eellis at mathworks.com>
Subject: [mpich-discuss] Hang during MPI_Finalize using ch4:ofi:shm in mpich-4.1.2
I'm working on getting a build of mpich-4.1.2 ready to replace our old build of mpich-3.3.2. With older MPICH releases, we used the "nemesis" channel via ch3 to provide support for shared-memory configurations as well as TCP/IP. In ch4, I thought the nearest equivalent would be:
--with-device=ch4:ofi:tcp,shm
The "tcp" portion of this seems to work just fine, but "shm" hangs during (I think) MPI_Finalize, requiring a CTRL-C to kill it. For example, in the build area,
$ ./src/pm/hydra/mpiexec.hydra -n 2 ./examples/cpi
Process 0 of 2 is on uk-eellis-l
Process 1 of 2 is on uk-eellis-l
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000019
^C[mpiexec at uk-eellis-l] Sending Ctrl-C to processes as requested
[mpiexec at uk-eellis-l] Press Ctrl-C again to force abort
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 829015 RUNNING AT uk-eellis-l
= EXIT CODE: 2
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Interrupt (signal 2)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
Things work fine if I force FI_PROVIDER=tcp. Am I missing something?
Here's the configure line I'm using:
$ ./configure --prefix <prefix> --with-device=ch4:ofi:tcp,shm --enable-shared --with-libfabric=embedded --enable-fortran --enable-efa=no
This is running on a Debian 11 system, gcc 10.3.0.
Cheers,
Edric.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20231213/d7b2a3fe/attachment-0001.html>
More information about the discuss
mailing list