[mpich-discuss] MPI_Waitany got abort signal.

Anatoly G anatolyrishon at gmail.com
Wed Nov 24 03:49:09 CST 2021


Hello Mpich,
I use mpich-3.1 on Ubuntu 14.
Each process has complicated logic except process 0.
Process 0 is used as a router to communicate with an application and
broadcast/collect results from other processes.
During night runs, sometimes I see a single failure of process with rank 0.

>From process 0 I get wallowing print:
Assertion failed in file src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c
at line 596: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO ||
hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO
internal ABORT - process 0
*** Error in
`/export/home/fpd/versions/current_ver/third_party/MPI_Scheduler': double
free or corruption (fasttop): 0x00007f27003f59a0 ***

*Stack trace:*
2d40 /lib/x86_64-linux-gnu/libc.so.6(+0x36d40) [0x7f2712ff2d40]
 gsignal + 57
 abort + 328
f394 /lib/x86_64-linux-gnu/libc.so.6(+0x73394) [0x7f271302f394]
b66e /lib/x86_64-linux-gnu/libc.so.6(+0x7f66e) [0x7f271303b66e]
 std::basic_string<char, std::char_traits<char>, std::allocator<char>
>::~basic_string() + 31
8259 /lib/x86_64-linux-gnu/libc.so.6(+0x3c259) [0x7f2712ff8259]
82a5 /lib/x86_64-linux-gnu/libc.so.6(+0x3c2a5) [0x7f2712ff82a5]
d049
/export/home/fpd/versions/current_ver/third_party/libMPIServices.so(+0x224049)
[0x7f2714bdd049]
 MPID_Abort + 103
 MPIR_Assert_fail + 37
d598
/export/home/fpd/versions/current_ver/third_party/libMPIServices.so(+0x244598)
[0x7f2714bfd598]
 MPID_nem_tcp_connpoll + 366
 MPIDI_CH3I_Progress + 1408
 MPI_Waitany + 1072

>From the rest of processes I get:
terminate called after throwing an instance of 'int'
s/netmod/tcp/socksm.c at line 596: hdr.pkt_type ==
MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || hdr.pkt_type ==
MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO
internal ABORT - process 1
*** Error in `Scheduler': double free or corruption (fasttop):
0x0000000003e16d60 ***

Unfortunately, I can't reproduce this failure on a simplified system.
Even on a real system failure can happen once in at night.

We have a memory monitor which shows that we have free memory on the
computer.
Can you please advise me, what can be the reason for failure?

Regards,
Anatoly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20211124/b9833df1/attachment.html>


More information about the discuss mailing list