[mpich-discuss] Hang during MPI_Finalize using ch4:ofi:shm in mpich-4.1.2

Joachim Jenke jenke at itc.rwth-aachen.de
Wed Dec 13 11:16:32 CST 2023


If your code hangs in MPI_Finalize for certain communication 
implementation, this sounds like an uncompleted communication. Are you 
sure that you have no MPI communication ongoing when calling MPI_Finalize?

  - Joachim

Am 13.12.23 um 17:45 schrieb Edric Ellis via discuss:
> Ok, that's good to know, I'll stick with simply "ofi:tcp" for now.
> 
> Thanks,
> Edric.
> ------------------------------------------------------------------------
> *From:* Zhou, Hui <zhouh at anl.gov>
> *Sent:* 13 December 2023 15:39
> *To:* discuss at mpich.org <discuss at mpich.org>
> *Cc:* Edric Ellis <eellis at mathworks.com>
> *Subject:* Re: Hang during MPI_Finalize using ch4:ofi:shm in mpich-4.1.2
> Hi Edric,
> 
> I am not sure which part is hanging, but you don't need to enable 
> |ofi:shm|​ (libfabric shm provider). The ch4 device comes with its own 
> shared memory functionality.
> 
> -- 
> Hui
> ------------------------------------------------------------------------
> *From:* Edric Ellis via discuss <discuss at mpich.org>
> *Sent:* Wednesday, December 13, 2023 7:05 AM
> *To:* discuss at mpich.org <discuss at mpich.org>
> *Cc:* Edric Ellis <eellis at mathworks.com>
> *Subject:* [mpich-discuss] Hang during MPI_Finalize using ch4:ofi:shm in 
> mpich-4.1.2
> I'm working on getting a build of mpich-4.1.2 ready to replace our old 
> build of mpich-3.3.2. With older MPICH releases, we used the "nemesis" 
> channel via ch3 to provide support for shared-memory configurations as 
> well as TCP/IP. In ch4, I thought the nearest equivalent would be:
> 
> --with-device=ch4:ofi:tcp,shm
> 
> The "tcp" portion of this seems to work just fine, but "shm" hangs 
> during (I think) MPI_Finalize, requiring a CTRL-C to kill it. For 
> example, in the build area,
> 
> $ ./src/pm/hydra/mpiexec.hydra -n 2 ./examples/cpi
> Process 0 of 2 is on uk-eellis-l
> Process 1 of 2 is on uk-eellis-l
> pi is approximately 3.1415926544231318, Error is 0.0000000008333387
> wall clock time = 0.000019
> ^C[mpiexec at uk-eellis-l] Sending Ctrl-C to processes as requested
> [mpiexec at uk-eellis-l] Press Ctrl-C again to force abort
> 
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 829015 RUNNING AT uk-eellis-l
> =   EXIT CODE: 2
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Interrupt (signal 2)
> This typically refers to a problem with your application.
> Please see the FAQ page for debugging suggestions
> 
> Things work fine if I force FI_PROVIDER=tcp. Am I missing something?
> 
> Here's the configure line I'm using:
> 
> $ ./configure --prefix <prefix> --with-device=ch4:ofi:tcp,shm 
> --enable-shared --with-libfabric=embedded --enable-fortran --enable-efa=no
> 
> This is running on a Debian 11 system, gcc 10.3.0.
> 
> Cheers,
> Edric.
> 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-- 
Dr. rer. nat. Joachim Jenke

IT Center
Group: High Performance Computing
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074  Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
jenke at itc.rwth-aachen.de
www.itc.rwth-aachen.de

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5903 bytes
Desc: Kryptografische S/MIME-Signatur
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20231213/538ab1f0/attachment-0001.p7s>


More information about the discuss mailing list