[mpich-discuss] Issue in MPICH while submitting jobs through slurm in NONMEM application

Zhou, Hui zhouh at anl.gov
Thu Jun 15 20:20:38 CDT 2023


Hi Thomas,

The assertion error means packet corruption. It's not clear what could be the cause unless you can provide a reproducer. Anyway, mpich-3.2.1 is quite old. My first suggestion would be try a newer mpich see if the error persists.

--
Hui Zhou
________________________________
From: Thomas Jayaseelan-External via discuss <discuss at mpich.org>
Sent: Thursday, June 15, 2023 9:58 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Thomas Jayaseelan-External <thomas.jayaseelan at regeneron.com>; Sundaresh Krishnasamy-External <sundaresh.krishnasam at regeneron.com>; Hariram Jayaram-External <hariram.jayaram at regeneron.com>
Subject: Re: [mpich-discuss] Issue in MPICH while submitting jobs through slurm in NONMEM application


Hi All,



It would helpful if you could help me on the below issue that we face in our application using MPI.



Best Regards,

Thomas



From: Thomas Jayaseelan-External
Sent: Thursday, June 15, 2023 10:58 AM
To: discuss at mpich.org
Cc: Sundaresh Krishnasamy-External <sundaresh.krishnasam at regeneron.com>; Hariram Jayaram-External <hariram.jayaram at regeneron.com>
Subject: Issue in MPICH while submitting jobs through slurm in NONMEM application



Hi Team,



This is Thomas, I am part of HPCOPs team in Regeneron Pharmaceuticals company. We build and support the HPC cluster infrastructure for the business as per their requirements.



I have reached out to you to get help on an issue that we are currently facing with MPI. It would be great if you could help us in getting a solution to it.



Nonmem is the application in which users submit the jobs through CLI, it is a CLI based application. When user tries to run job with more no. of cores the job runs for 10 to 15 hours and then stops intermittently. Please find the below error message that we get in our output file.



  1.  Assertion failed in file src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 600: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO

internal ABORT - process 1231

Done with nonmem execution



  1.  Assertion failed in file src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 600: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO

internal ABORT - process 163

Done with nonmem execution



Details:

NONMEM application version – NM750

Slurm version - 21.08.6

MPICH version - 3.2.1

OS – Amazon Linux 2



Please let me know if you need anything from my end.



Best Regards,

Thomas



********************************************************************
This e-mail and any attachment hereto, is intended only for use by the addressee(s) named above and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, any dissemination, distribution or copying of this email, or any attachment hereto, is strictly prohibited. If you receive this email in error please immediately notify me by return electronic mail and permanently delete this email and any attachment hereto, any copy of this e-mail and of any such attachment, and any printout thereof. Finally, please note that only authorized representatives of Regeneron Pharmaceuticals, Inc. have the power and authority to enter into business dealings with any third party.
********************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20230616/fefe44d6/attachment.html>


More information about the discuss mailing list