[mpich-discuss] [EXTERNAL] Re: Spawned process hanging in MPI_Finalize
Joachim Protze
protze at itc.rwth-aachen.de
Thu Mar 4 02:02:14 CST 2021
Hi Kurt,
as a mental model, we can see MPI_Comm_spawn and MPI_Intercomm_create as
collective constructors of the inter-comm object. All collectively
created MPI objects also need a collective destructor call for full
cleanup. MPI_Comm_disconnect is this collective destructor.
MPI_Finalize implies MPI_Comm_disconnect for all undestroyed
communicators of the process. From that perspective, it might even be
sufficient to call MPI_Comm_disconnect on the parent process and let
MPI_Finalize in the child process do the job.
If MPI_Intercomm_create creates an inter-comm in a process you want to
stop early, all involved processes will need to call MPI_Comm_disconnect
before the process can terminate.
Best
Joachim
Am 04.03.21 um 01:55 schrieb Mccall, Kurt E. (MSFC-EV41):
> Joachim,
>
> Thanks, that helped! Is it necessary to call MPI_Comm_disconnect on inter-communicators that are created by MPI_Intercomm_create?
>
> Kurt
>
> -----Original Message-----
> From: Joachim Protze <protze at itc.rwth-aachen.de>
> Sent: Wednesday, March 3, 2021 5:52 AM
> To: discuss at mpich.org
> Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
> Subject: [EXTERNAL] Re: [mpich-discuss] Spawned process hanging in MPI_Finalize
>
> Hi Kurt,
>
> did you call MPI_Comm_disconnect on all processes connected with the
> inter-communicator? Also the parent process needs to disconnect from the
> inter-comm before the MPI_Comm_disconnect can return.
>
> - Joachim
>
> Am 03.03.21 um 01:54 schrieb Mccall, Kurt E. (MSFC-EV41) via discuss:
>> I have a parent process that creates a child via MPI_Comm_spawn(). When the child decides it has to exit, it is hanging in MPI_Finalize(). It does the same if it calls MPI_Comm_disconnect() before MPI_Finalize.
>>
>> Here is the stack trace in the child:
>>
>> (gdb) where
>> #0 0x00007fc6f2fedde0 in __poll_nocancel () from /usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libc.so.6
>> #1 0x00007fc6f4dc840e in MPID_nem_tcp_connpoll () at src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c:1819
>> #2 0x00007fc6f4db857e in MPID_nem_network_poll () at src/mpid/ch3/channels/nemesis/src/mpid_nem_network_poll.c:16
>> #3 0x00007fc6f4dafc43 in MPIDI_CH3I_Progress () at src/mpid/ch3/channels/nemesis/src/ch3_progress.c:1019
>> #4 0x00007fc6f4d5094d in MPIDI_CH3U_VC_WaitForClose () at src/mpid/ch3/src/ch3u_handle_connection.c:383
>> #5 0x00007fc6f4d94efa in MPID_Finalize () at src/mpid/ch3/src/mpid_finalize.c:110
>> #6 0x00007fc6f4c432ca in PMPI_Finalize () at src/mpi/init/finalize.c:260
>> #7 0x0000000000408a85 in needles::MpiWorker::finalize () at src/MpiWorker.cpp:470
>>
>> Maybe I have a communication that hasn't completed, or the child is waiting for the parent to call MPI_Finalize. I believe that you (Ken, Hui) told me that it shouldn't do the latter.
>>
>> Is there a way for the child to cleanly exit without hanging in MPI_Finalize? I tried calling MPI_Cancel() in the child on the only possible communication request that I knew of, but it didn't help.
>> It just occurred to me that I haven't tried calling MPI_Cancel on the requests in the parent...
>>
>> Thanks,
>> Kurt
>>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
--
Dipl.-Inf. Joachim Protze
IT Center
Group: High Performance Computing
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074 Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
protze at itc.rwth-aachen.de
www.itc.rwth-aachen.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5327 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20210304/87b28020/attachment-0001.p7s>
More information about the discuss
mailing list