[mpich-discuss] Failure to do anything under LSF

Zhou, Hui zhouh at anl.gov
Wed Feb 17 19:34:29 CST 2021


Not sure, but looks like it is not able to hold or establish stable socket connection between nodes.

--
Hui Zhou


From: Sendu Bala via discuss <discuss at mpich.org>
Date: Wednesday, February 17, 2021 at 9:14 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Sendu Bala <sb10 at sanger.ac.uk>
Subject: [mpich-discuss] Failure to do anything under LSF
Hi,

We had an mpi app running under LSF that worked fine tiled across 64 hosts.

Since moving to a new platform (LSF, but inside an OpenStack cluster*), the app is unreliable when tiled across more than 2 hosts.
The likelyhood of failure increases until when tiled across 16 hosts, it almost never works (but still can). It always works when using 16 cores of a single host.

The symptoms of failure are that our app doesn’t really start up (it logs nothing), the -outfile-pattern output files don’t get created, and it kills itself after 5mins of apparently doing nothing. (When it works, the -outfile-pattern files are created ~immediately and the app produces output.)

When failing, the mpirun process spawns a hydra_pmi_proxy process which spawns the app as well as 15 blaunch processes, which correspond to processes on the 15 other hosts, which have spawned a process for the app each.

strace says the mpirun and app processes are doing nothing during the 5mins wait, with the app processes stuck on reading from a socket.

After the 5 mins, the mpirun process exits with:
strace -p 2526
strace: Process 2526 attached
restart_syscall(<... resuming interrupted poll ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=2498, si_uid=0} ---
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2530, si_uid=0, si_status=255, si_utime=1, si_stime=0} ---
fstat(1, {st_mode=S_IFREG|0640, st_size=0, ...}) = 0
write(1, "[mpiexec at node-name] ", 21) = 21
write(1, "Sending Ctrl-C to processes as r"..., 41) = 41
write(1, "[mpiexec at node-name] ", 21) = 21
write(1, "Press Ctrl-C again to force abor"..., 34) = 34
write(4, "\1\0\0\0\2\0\0\0", 8) = 8
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=9, events=POLLIN}, {fd=13, events=POLLIN}, {fd=12, events=POLLIN}, {fd=15, events=POLLIN}, {fd=14, events=POLLIN}, {fd=17, events=POLLIN}, {fd=16, events=POLLIN}, {fd=19, events=POLLIN}, {fd=18, events=POLLIN}, {fd=21, events=POLLIN}, {fd=20, events=POLLIN}, {fd=23, events=POLLIN}, {fd=22, events=POLLIN}, {fd=25, events=POLLIN}, {fd=24, events=POLLIN}, {fd=27, events=POLLIN}, {fd=26, events=POLLIN}, {fd=29, events=POLLIN}, {fd=28, events=POLLIN}, {fd=31, events=POLLIN}, {fd=30, events=POLLIN}, {fd=33, events=POLLIN}, {fd=32, events=POLLIN}, {fd=35, events=POLLIN}, {fd=34, events=POLLIN}, {fd=37, events=POLLIN}, ...], 47, -1) = 6 ([{fd=3, revents=POLLIN}, {fd=15, revents=POLLIN}, {fd=19, revents=POLLIN}, {fd=21, revents=POLLIN}, {fd=25, revents=POLLIN}, {fd=31, revents=POLLIN}])
read(3, "\1\0\0\0\2\0\0\0", 8) = 8
write(6, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(48, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(46, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(38, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(43, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(40, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(45, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(42, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(47, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(50, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(44, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(41, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(-1, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = -1 EBADF (Bad file descriptor)
write(2, "[mpiexec at node-name] ", 21) = 21
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2541, si_uid=0, si_status=255, si_utime=0, si_stime=0} ---
write(2, "HYDU_sock_write (utils/sock/sock"..., 41) = 41
write(2, "write error (Bad file descriptor"..., 34) = 34
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2529, si_uid=0, si_status=255, si_utime=0, si_stime=0} ---
write(2, "[mpiexec at node-name] ", 21) = 21
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2537, si_uid=0, si_status=255, si_utime=0, si_stime=1} ---
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2528, si_uid=0, si_status=255, si_utime=0, si_stime=2} ---
write(2, "HYD_pmcd_pmiserv_send_signal (pm"..., 60) = 60
write(2, "unable to write data to proxy\n", 30) = 30
write(2, "[mpiexec at node-name] ", 21) = 21
write(2, "ui_cmd_cb (pm/pmiserv/pmiserv_pm"..., 42) = 42
write(2, "unable to send signal downstream"..., 33) = 33
write(2, "[mpiexec at node-name] ", 21) = 21
write(2, "HYDT_dmxu_poll_wait_for_event (t"..., 61) = 61
write(2, "callback returned error status\n", 31) = 31
write(2, "[mpiexec at node-name] ", 21) = 21
write(2, "HYD_pmci_wait_for_completion (pm"..., 62) = 62
write(2, "error waiting for event\n", 24) = 24
write(2, "[mpiexec at node-name] ", 21) = 21
write(2, "main (ui/mpich/mpiexec.c:326): ", 31) = 31
write(2, "process manager error waiting fo"..., 45) = 45
exit_group(-1) = ?
+++ exited with 255 +++

The bsub -e file says the corresponding:
[mpiexec at node-name] HYDU_sock_write (utils/sock/sock.c:254): write error (Bad file descriptor)
[mpiexec at node-name] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:176): unable to write data to proxy
[mpiexec at node-name] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:42): unable to send signal downstream
[mpiexec at node-name] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at node-name] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:160): error waiting for event
[mpiexec at node-name] main (ui/mpich/mpiexec.c:326): process manager error waiting for completion

The issue isn’t specific to our app; I get the same symptoms with:

bsub -q normal -o bsub.o -e bsub.e -R"span[ptile=1]” -n16 mpirun mpich/examples/cpi

bsub.e is empty, and bsub.o has this:
[mpiexec at node-name] Sending Ctrl-C to processes as requested
[mpiexec at node-name] Press Ctrl-C again to force abort
[…]
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun mpich/examples/cpi
------------------------------------------------------------

Exited with exit code 141.

Resource usage summary:

 CPU time : 0.49 sec.
 Max Memory : 85 MB
 Average Memory : 70.89 MB
 Total Requested Memory : -
 Delta Memory : -
 Max Swap : -
 Max Processes : 63
 Max Threads : 80
 Run time : 310 sec.
 Turnaround time : 317 sec.

(I did not initiate any Ctrl-C or similar myself.)

This is with mpich-3.4.1 configured --with-device=ch4:ucx. It’s worth noting I get the exact same symptoms using latest OpenMPI as well, so this is not an mpich-specific issue.

What can I do to investigate further or try to resolve this, so it works reliably with 16 or ideally 64 hosts again?


[*] I know very little about networking, but I’m told we have new nodes in the LSF cluster that use a software-defined network, but also old nodes that use a hardware network like our old system; limiting to the old nodes doesn’t help. But maybe there are some subtleties here I’ve overlooked.

Cheers,
Sendu
-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20210218/609ed79a/attachment.html>


More information about the discuss mailing list