[mpich-discuss] Hang during MPI_Finalize using ch4:ofi:shm in mpich-4.1.2

Edric Ellis eellis at mathworks.com
Wed Dec 13 10:45:40 CST 2023


Ok, that's good to know, I'll stick with simply "ofi:tcp" for now.

Thanks,
Edric.
________________________________
From: Zhou, Hui <zhouh at anl.gov>
Sent: 13 December 2023 15:39
To: discuss at mpich.org <discuss at mpich.org>
Cc: Edric Ellis <eellis at mathworks.com>
Subject: Re: Hang during MPI_Finalize using ch4:ofi:shm in mpich-4.1.2

Hi Edric,

I am not sure which part is hanging, but you don't need to enable ofi:shm​ (libfabric shm provider). The ch4 device comes with its own shared memory functionality.

--
Hui
________________________________
From: Edric Ellis via discuss <discuss at mpich.org>
Sent: Wednesday, December 13, 2023 7:05 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Edric Ellis <eellis at mathworks.com>
Subject: [mpich-discuss] Hang during MPI_Finalize using ch4:ofi:shm in mpich-4.1.2

I'm working on getting a build of mpich-4.1.2 ready to replace our old build of mpich-3.3.2. With older MPICH releases, we used the "nemesis" channel via ch3 to provide support for shared-memory configurations as well as TCP/IP. In ch4, I thought the nearest equivalent would be:

--with-device=ch4:ofi:tcp,shm

The "tcp" portion of this seems to work just fine, but "shm" hangs during (I think) MPI_Finalize, requiring a CTRL-C to kill it. For example, in the build area,

$ ./src/pm/hydra/mpiexec.hydra -n 2 ./examples/cpi
Process 0 of 2 is on uk-eellis-l
Process 1 of 2 is on uk-eellis-l
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000019
^C[mpiexec at uk-eellis-l] Sending Ctrl-C to processes as requested
[mpiexec at uk-eellis-l] Press Ctrl-C again to force abort

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 829015 RUNNING AT uk-eellis-l
=   EXIT CODE: 2
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Interrupt (signal 2)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Things work fine if I force FI_PROVIDER=tcp. Am I missing something?

Here's the configure line I'm using:

$ ./configure --prefix <prefix> --with-device=ch4:ofi:tcp,shm --enable-shared --with-libfabric=embedded --enable-fortran --enable-efa=no

This is running on a Debian 11 system, gcc 10.3.0.

Cheers,
Edric.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20231213/aa67bc54/attachment.html>


More information about the discuss mailing list