[mpich-discuss] Intermittent MPICH error

Mccall, Kurt E. (MSFC-EV41) kurt.e.mccall at nasa.gov
Mon Oct 5 16:47:07 CDT 2020


I am using MPICH 3.3.2 on Centos 3.10, Torque 5.1.1.

I have never seen this error when running my MPI jobs as myself, but when other users run the same code, this error may or may not occur.   It seems to happen when the job is shutting down.

[mpiexec at n006.cluster.com] HYDU_sock_write (utils/sock/sock.c:256): write error (Bad file descriptor)
[mpiexec at n006.cluster.com] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:178): unable to write data to proxy
[mpiexec at n006.cluster.com] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:77): unable to send signal downstream
[mpiexec at n006.cluster.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[mpiexec at n006.cluster.com] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
[mpiexec at n006.cluster.com] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion

Does anyone have a clue what these errors mean?   I hope this is enough info to go on.   If not, please let me know what else I should provide.

Thanks,
Kurt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20201005/05cc4377/attachment.html>


More information about the discuss mailing list