[mpich-discuss] Seeking possible causes for an assertion error in socksm.c at line 600: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || ...

kumar.tarun at siemens.com kumar.tarun at siemens.com
Wed Jul 17 15:38:26 CDT 2024


Thanks. I had further discussion on this on github: https://urldefense.us/v3/__https://github.com/pmodels/mpich/discussions/7052__;!!G_uCfscf7eWS!dnnk-6VPKT4-PQsWX4YCteFlSd4W1AwKYtiYNuX1yGNzR-Lc6a3op_cS87cAXevw_dHnGeWEurb-PvSxpO4$ 

I’m reposting my last response there as I need some answers.

We got the confirmation from the customer that there was a security application which was somehow killing the application as the application(and it's executables) were not whitelisted.

I have couple of questions:


  1.  Is the latest stable release of mpi4.2.2 stable against such instances? Is single machine run still require TCP ports or is it done using shared memory?
  2.  Is there anything that we can do while configuring mpich 3.2.1(the current version we are using) to make the application more robust against such instances? As I said it's a single machine run, so I'm still not sure why TCP ports are involved.
Thanks
Tarun


From: Raffenetti, Ken <raffenet at anl.gov>
Sent: Sunday, July 7, 2024 5:24 AM
To: discuss at mpich.org
Cc: Kumar, Tarun (DI SW ICS DVT RD QSCE) <kumar.tarun at siemens.com>
Subject: Re: [mpich-discuss] Seeking possible causes for an assertion error in socksm.c at line 600: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || ...

Are you able to update the MPICH version? MPICH 3.2.1 was released in 2017 and is no longer actively supported/maintained.

Ken

From: "kumar.tarun--- via discuss" <discuss at mpich.org<mailto:discuss at mpich.org>>
Reply-To: "discuss at mpich.org<mailto:discuss at mpich.org>" <discuss at mpich.org<mailto:discuss at mpich.org>>
Date: Saturday, July 6, 2024 at 3:49 PM
To: "discuss at mpich.org<mailto:discuss at mpich.org>" <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: "kumar.tarun at siemens.com<mailto:kumar.tarun at siemens.com>" <kumar.tarun at siemens.com<mailto:kumar.tarun at siemens.com>>
Subject: [mpich-discuss] Seeking possible causes for an assertion error in socksm.c at line 600: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || ...

Hi, We are hitting following assertion: Assertion failed in file .. . nemesis/netmod/tcp/socksm. c at line 600: hdr. pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || .. . I looked at the assert and it looks like this in file …/nemesis/netmod/tcp/socksm. c
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hi,
    We are hitting following assertion:
Assertion failed in file ...nemesis/netmod/tcp/socksm.c at line 600: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || ...
I looked at the assert and it looks like this in file …/nemesis/netmod/tcp/socksm.c
MPIU_Assert(hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO ||
hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO);

We have tried multiple cores/partitions from 2 to 8 and the behaviour is same. Also a process is aborted and a message appears to suggest that. Mostly it's process 0 which is aborted but I have seen other processes as well reporting the crash. We are using mpich-3.2.1. I'm trying to understand possible causes for this error? I have explored the forum and no possible causes, like machine going out of memory etc are applicable here. Please suggest. Are there any debug/log/trace options I can use with mpiexec to further root cause?

Regards
Tarun

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240717/4d2b63ab/attachment.html>


More information about the discuss mailing list