[mpich-discuss] Intermittent MPICH error

Zhou, Hui zhouh at anl.gov
Mon Oct 5 18:15:34 CDT 2020


I believe it means one of the node process was terminated unexpectedly. We’ll need more clue to make a guess on what is the actual cause.

--
Hui Zhou


From: "Mccall, Kurt E. (MSFC-EV41) via discuss" <discuss at mpich.org>
Reply-To: "discuss at mpich.org" <discuss at mpich.org>
Date: Monday, October 5, 2020 at 4:47 PM
To: "discuss at mpich.org" <discuss at mpich.org>
Cc: "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov>
Subject: [mpich-discuss] Intermittent MPICH error

I am using MPICH 3.3.2 on Centos 3.10, Torque 5.1.1.

I have never seen this error when running my MPI jobs as myself, but when other users run the same code, this error may or may not occur.   It seems to happen when the job is shutting down.

[mpiexec at n006.cluster.com] HYDU_sock_write (utils/sock/sock.c:256): write error (Bad file descriptor)
[mpiexec at n006.cluster.com] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:178): unable to write data to proxy
[mpiexec at n006.cluster.com] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:77): unable to send signal downstream
[mpiexec at n006.cluster.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[mpiexec at n006.cluster.com] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
[mpiexec at n006.cluster.com] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion

Does anyone have a clue what these errors mean?   I hope this is enough info to go on.   If not, please let me know what else I should provide.

Thanks,
Kurt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20201005/db0c8f7e/attachment.html>


More information about the discuss mailing list