[mpich-discuss] Hang during MPI_Finalize using ch4:ofi:shm in mpich-4.1.2
Joachim Jenke
jenke at itc.rwth-aachen.de
Wed Dec 13 11:16:32 CST 2023
If your code hangs in MPI_Finalize for certain communication
implementation, this sounds like an uncompleted communication. Are you
sure that you have no MPI communication ongoing when calling MPI_Finalize?
- Joachim
Am 13.12.23 um 17:45 schrieb Edric Ellis via discuss:
> Ok, that's good to know, I'll stick with simply "ofi:tcp" for now.
>
> Thanks,
> Edric.
> ------------------------------------------------------------------------
> *From:* Zhou, Hui <zhouh at anl.gov>
> *Sent:* 13 December 2023 15:39
> *To:* discuss at mpich.org <discuss at mpich.org>
> *Cc:* Edric Ellis <eellis at mathworks.com>
> *Subject:* Re: Hang during MPI_Finalize using ch4:ofi:shm in mpich-4.1.2
> Hi Edric,
>
> I am not sure which part is hanging, but you don't need to enable
> |ofi:shm| (libfabric shm provider). The ch4 device comes with its own
> shared memory functionality.
>
> --
> Hui
> ------------------------------------------------------------------------
> *From:* Edric Ellis via discuss <discuss at mpich.org>
> *Sent:* Wednesday, December 13, 2023 7:05 AM
> *To:* discuss at mpich.org <discuss at mpich.org>
> *Cc:* Edric Ellis <eellis at mathworks.com>
> *Subject:* [mpich-discuss] Hang during MPI_Finalize using ch4:ofi:shm in
> mpich-4.1.2
> I'm working on getting a build of mpich-4.1.2 ready to replace our old
> build of mpich-3.3.2. With older MPICH releases, we used the "nemesis"
> channel via ch3 to provide support for shared-memory configurations as
> well as TCP/IP. In ch4, I thought the nearest equivalent would be:
>
> --with-device=ch4:ofi:tcp,shm
>
> The "tcp" portion of this seems to work just fine, but "shm" hangs
> during (I think) MPI_Finalize, requiring a CTRL-C to kill it. For
> example, in the build area,
>
> $ ./src/pm/hydra/mpiexec.hydra -n 2 ./examples/cpi
> Process 0 of 2 is on uk-eellis-l
> Process 1 of 2 is on uk-eellis-l
> pi is approximately 3.1415926544231318, Error is 0.0000000008333387
> wall clock time = 0.000019
> ^C[mpiexec at uk-eellis-l] Sending Ctrl-C to processes as requested
> [mpiexec at uk-eellis-l] Press Ctrl-C again to force abort
>
> ===================================================================================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = PID 829015 RUNNING AT uk-eellis-l
> = EXIT CODE: 2
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Interrupt (signal 2)
> This typically refers to a problem with your application.
> Please see the FAQ page for debugging suggestions
>
> Things work fine if I force FI_PROVIDER=tcp. Am I missing something?
>
> Here's the configure line I'm using:
>
> $ ./configure --prefix <prefix> --with-device=ch4:ofi:tcp,shm
> --enable-shared --with-libfabric=embedded --enable-fortran --enable-efa=no
>
> This is running on a Debian 11 system, gcc 10.3.0.
>
> Cheers,
> Edric.
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
--
Dr. rer. nat. Joachim Jenke
IT Center
Group: High Performance Computing
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074 Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
jenke at itc.rwth-aachen.de
www.itc.rwth-aachen.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5903 bytes
Desc: Kryptografische S/MIME-Signatur
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20231213/538ab1f0/attachment-0001.p7s>
More information about the discuss
mailing list