[mpich-discuss] MPI_Waitany got abort signal.

Zhou, Hui zhouh at anl.gov
Wed Nov 24 11:09:31 CST 2021


Hi Anatoly,

Without a reproducible simple case, it will be difficult to pin down the issue. To debug, I would start injecting prints right above the assertion to see what is actually in the pkt_type​. But since mpich-3.1 is very old, can you try a newer release? The latest is mpich 4.0b1 -- https://www.mpich.org/downloads/

--
Hui Zhou
[https://www.mpich.org/files/2012/10/rnd100_home.jpg1zB]<https://www.mpich.org/downloads/>
Downloads | MPICH<https://www.mpich.org/downloads/>
Downloads MPICH is distributed under a BSD-like license. NOTE: MPICH binary packages are available in many UNIX distributions and for Windows. For example, you can search for it using “yum” (on Fedora), “apt” (Debian/Ubuntu), “pkg_add” (FreeBSD) or “port”/”brew” (Mac OS).
www.mpich.org

________________________________
From: Anatoly G via discuss <discuss at mpich.org>
Sent: Wednesday, November 24, 2021 3:49 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Anatoly G <anatolyrishon at gmail.com>
Subject: [mpich-discuss] MPI_Waitany got abort signal.

Hello Mpich,
I use mpich-3.1 on Ubuntu 14.
Each process has complicated logic except process 0.
Process 0 is used as a router to communicate with an application and broadcast/collect results from other processes.
During night runs, sometimes I see a single failure of process with rank 0.

From process 0 I get wallowing print:
Assertion failed in file src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 596: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO
internal ABORT - process 0
*** Error in `/export/home/fpd/versions/current_ver/third_party/MPI_Scheduler': double free or corruption (fasttop): 0x00007f27003f59a0 ***

Stack trace:
2d40 /lib/x86_64-linux-gnu/libc.so.6(+0x36d40) [0x7f2712ff2d40]
 gsignal + 57
 abort + 328
f394 /lib/x86_64-linux-gnu/libc.so.6(+0x73394) [0x7f271302f394]
b66e /lib/x86_64-linux-gnu/libc.so.6(+0x7f66e) [0x7f271303b66e]
 std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() + 31
8259 /lib/x86_64-linux-gnu/libc.so.6(+0x3c259) [0x7f2712ff8259]
82a5 /lib/x86_64-linux-gnu/libc.so.6(+0x3c2a5) [0x7f2712ff82a5]
d049 /export/home/fpd/versions/current_ver/third_party/libMPIServices.so(+0x224049) [0x7f2714bdd049]
 MPID_Abort + 103
 MPIR_Assert_fail + 37
d598 /export/home/fpd/versions/current_ver/third_party/libMPIServices.so(+0x244598) [0x7f2714bfd598]
 MPID_nem_tcp_connpoll + 366
 MPIDI_CH3I_Progress + 1408
 MPI_Waitany + 1072

From the rest of processes I get:
terminate called after throwing an instance of 'int'
s/netmod/tcp/socksm.c at line 596: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO
internal ABORT - process 1
*** Error in `Scheduler': double free or corruption (fasttop): 0x0000000003e16d60 ***

Unfortunately, I can't reproduce this failure on a simplified system.
Even on a real system failure can happen once in at night.

We have a memory monitor which shows that we have free memory on the computer.
Can you please advise me, what can be the reason for failure?

Regards,
Anatoly.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20211124/95db7496/attachment.html>


More information about the discuss mailing list