[mpich-discuss] Seeking possible causes for an assertion error in socksm.c at line 600: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || ...

kumar.tarun at siemens.com kumar.tarun at siemens.com
Sat Jul 6 15:49:21 CDT 2024


Hi,
    We are hitting following assertion:
Assertion failed in file ...nemesis/netmod/tcp/socksm.c at line 600: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || ...
I looked at the assert and it looks like this in file .../nemesis/netmod/tcp/socksm.c
MPIU_Assert(hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO ||
hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO);

We have tried multiple cores/partitions from 2 to 8 and the behaviour is same. Also a process is aborted and a message appears to suggest that. Mostly it's process 0 which is aborted but I have seen other processes as well reporting the crash. We are using mpich-3.2.1. I'm trying to understand possible causes for this error? I have explored the forum and no possible causes, like machine going out of memory etc are applicable here. Please suggest. Are there any debug/log/trace options I can use with mpiexec to further root cause?

Regards
Tarun

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240706/e6a8efbe/attachment.html>


More information about the discuss mailing list