[mpich-discuss] Seeking possible causes for an assertion error in socksm.c at line 600: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || ...

Raffenetti, Ken raffenet at anl.gov
Sun Jul 7 07:23:48 CDT 2024


Are you able to update the MPICH version? MPICH 3.2.1 was released in 2017 and is no longer actively supported/maintained.

Ken

From: "kumar.tarun--- via discuss" <discuss at mpich.org>
Reply-To: "discuss at mpich.org" <discuss at mpich.org>
Date: Saturday, July 6, 2024 at 3:49 PM
To: "discuss at mpich.org" <discuss at mpich.org>
Cc: "kumar.tarun at siemens.com" <kumar.tarun at siemens.com>
Subject: [mpich-discuss] Seeking possible causes for an assertion error in socksm.c at line 600: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || ...

Hi, We are hitting following assertion: Assertion failed in file .. . nemesis/netmod/tcp/socksm. c at line 600: hdr. pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || .. . I looked at the assert and it looks like this in file …/nemesis/netmod/tcp/socksm. c
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hi,
    We are hitting following assertion:
Assertion failed in file ...nemesis/netmod/tcp/socksm.c at line 600: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || ...
I looked at the assert and it looks like this in file …/nemesis/netmod/tcp/socksm.c
MPIU_Assert(hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO ||
hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO);

We have tried multiple cores/partitions from 2 to 8 and the behaviour is same. Also a process is aborted and a message appears to suggest that. Mostly it's process 0 which is aborted but I have seen other processes as well reporting the crash. We are using mpich-3.2.1. I'm trying to understand possible causes for this error? I have explored the forum and no possible causes, like machine going out of memory etc are applicable here. Please suggest. Are there any debug/log/trace options I can use with mpiexec to further root cause?

Regards
Tarun

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240707/55f8af4f/attachment-0001.html>


More information about the discuss mailing list