[mpich-discuss] Intermittent MPICH error

Mccall, Kurt E. (MSFC-EV41) kurt.e.mccall at nasa.gov
Tue Oct 6 07:34:59 CDT 2020


Thanks Hui.   I’ll investigate further.

Kurt

From: Zhou, Hui <zhouh at anl.gov>
Sent: Monday, October 5, 2020 6:16 PM
To: discuss at mpich.org
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
Subject: [EXTERNAL] Re: [mpich-discuss] Intermittent MPICH error

I believe it means one of the node process was terminated unexpectedly. We’ll need more clue to make a guess on what is the actual cause.

--
Hui Zhou


From: "Mccall, Kurt E. (MSFC-EV41) via discuss" <discuss at mpich.org<mailto:discuss at mpich.org>>
Reply-To: "discuss at mpich.org<mailto:discuss at mpich.org>" <discuss at mpich.org<mailto:discuss at mpich.org>>
Date: Monday, October 5, 2020 at 4:47 PM
To: "discuss at mpich.org<mailto:discuss at mpich.org>" <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
Subject: [mpich-discuss] Intermittent MPICH error

I am using MPICH 3.3.2 on Centos 3.10, Torque 5.1.1.

I have never seen this error when running my MPI jobs as myself, but when other users run the same code, this error may or may not occur.   It seems to happen when the job is shutting down.

[mpiexec at n006.cluster.com] HYDU_sock_write (utils/sock/sock.c:256): write error (Bad file descriptor)
[mpiexec at n006.cluster.com] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:178): unable to write data to proxy
[mpiexec at n006.cluster.com] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:77): unable to send signal downstream
[mpiexec at n006.cluster.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[mpiexec at n006.cluster.com] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
[mpiexec at n006.cluster.com] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion

Does anyone have a clue what these errors mean?   I hope this is enough info to go on.   If not, please let me know what else I should provide.

Thanks,
Kurt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20201006/971b113f/attachment-0001.html>


More information about the discuss mailing list