[mpich-discuss] Failure to do anything under LSF [EXT]

Thu Feb 18 08:11:39 CST 2021

There are two pieces running mpi applications. One if the process manager, the other is the mpi library. The former launches your application, the latter runs inside your application. Both can output more debug information. Your description points to the problem in the process launcher. Try add `-v` or `--debug` to `mpiexec` and it should show more details, especially, includes the port information used for connections between the main process manager -- `mpiexec`, and the launcher on each node -- `hydra_pmi_proxy`.

You can control some of its behavior using environment variable. In particular, `MPIEXEC_PORT_RANGE` can be used to control the ports being used. More information are available in the man page of `mpiexec`.

--
Hui Zhou

From: Sendu Bala <sb10 at sanger.ac.uk>
Date: Thursday, February 18, 2021 at 3:26 AM
To: discuss at mpich.org <discuss at mpich.org>, Zhou, Hui <zhouh at anl.gov>
Subject: Re: Failure to do anything under LSF [EXT]
Is there a way to get mpich to output more debug information, or to get it to do more retries or similar?

Is it using certain ports for the sockets, and can those be changed?

When used with LSF, how does it get the host names? Does it only use $LSB_HOSTS? Because I have identified an issue with another app that is fixed by using IPs instead of Hostnames, so I have a shim that re-sets LSB_HOSTS to IPs, but it doesn’t seem to help with mpich.

Cheers,
Sendu
On 18 Feb 2021, 01:34 +0000, Zhou, Hui <zhouh at anl.gov>, wrote:

Not sure, but looks like it is not able to hold or establish stable socket connection between nodes.

--
Hui Zhou

From: Sendu Bala via discuss <discuss at mpich.org>
Date: Wednesday, February 17, 2021 at 9:14 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Sendu Bala <sb10 at sanger.ac.uk>
Subject: [mpich-discuss] Failure to do anything under LSF
Hi,

We had an mpi app running under LSF that worked fine tiled across 64 hosts.

Since moving to a new platform (LSF, but inside an OpenStack cluster*), the app is unreliable when tiled across more than 2 hosts.
The likelyhood of failure increases until when tiled across 16 hosts, it almost never works (but still can). It always works when using 16 cores of a single host.

The symptoms of failure are that our app doesn’t really start up (it logs nothing), the -outfile-pattern output files don’t get created, and it kills itself after 5mins of apparently doing nothing. (When it works, the -outfile-pattern files are created ~immediately and the app produces output.)

When failing, the mpirun process spawns a hydra_pmi_proxy process which spawns the app as well as 15 blaunch processes, which correspond to processes on the 15 other hosts, which have spawned a process for the app each.

strace says the mpirun and app processes are doing nothing during the 5mins wait, with the app processes stuck on reading from a socket.

After the 5 mins, the mpirun process exits with:
strace -p 2526
strace: Process 2526 attached
restart_syscall(<... resuming interrupted poll ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=2498, si_uid=0} ---
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2530, si_uid=0, si_status=255, si_utime=1, si_stime=0} ---
fstat(1, {st_mode=S_IFREG|0640, st_size=0, ...}) = 0
write(1, "[mpiexec at node-name] ", 21) = 21
write(1, "Sending Ctrl-C to processes as r"..., 41) = 41
write(1, "[mpiexec at node-name] ", 21) = 21
write(1, "Press Ctrl-C again to force abor"..., 34) = 34
write(4, "\1\0\0\0\2\0\0\0", 8) = 8
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=9, events=POLLIN}, {fd=13, events=POLLIN}, {fd=12, events=POLLIN}, {fd=15, events=POLLIN}, {fd=14, events=POLLIN}, {fd=17, events=POLLIN}, {fd=16, events=POLLIN}, {fd=19, events=POLLIN}, {fd=18, events=POLLIN}, {fd=21, events=POLLIN}, {fd=20, events=POLLIN}, {fd=23, events=POLLIN}, {fd=22, events=POLLIN}, {fd=25, events=POLLIN}, {fd=24, events=POLLIN}, {fd=27, events=POLLIN}, {fd=26, events=POLLIN}, {fd=29, events=POLLIN}, {fd=28, events=POLLIN}, {fd=31, events=POLLIN}, {fd=30, events=POLLIN}, {fd=33, events=POLLIN}, {fd=32, events=POLLIN}, {fd=35, events=POLLIN}, {fd=34, events=POLLIN}, {fd=37, events=POLLIN}, ...], 47, -1) = 6 ([{fd=3, revents=POLLIN}, {fd=15, revents=POLLIN}, {fd=19, revents=POLLIN}, {fd=21, revents=POLLIN}, {fd=25, revents=POLLIN}, {fd=31, revents=POLLIN}])
read(3, "\1\0\0\0\2\0\0\0", 8) = 8
write(6, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(48, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(46, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(38, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(43, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(40, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(45, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(42, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(47, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(50, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(44, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(41, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32
write(-1, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = -1 EBADF (Bad file descriptor)
write(2, "[mpiexec at node-name] ", 21) = 21
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2541, si_uid=0, si_status=255, si_utime=0, si_stime=0} ---
write(2, "HYDU_sock_write (utils/sock/sock"..., 41) = 41
write(2, "write error (Bad file descriptor"..., 34) = 34
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2529, si_uid=0, si_status=255, si_utime=0, si_stime=0} ---
write(2, "[mpiexec at node-name] ", 21) = 21
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2537, si_uid=0, si_status=255, si_utime=0, si_stime=1} ---
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2528, si_uid=0, si_status=255, si_utime=0, si_stime=2} ---
write(2, "HYD_pmcd_pmiserv_send_signal (pm"..., 60) = 60
write(2, "unable to write data to proxy\n", 30) = 30
write(2, "[mpiexec at node-name] ", 21) = 21
write(2, "ui_cmd_cb (pm/pmiserv/pmiserv_pm"..., 42) = 42
write(2, "unable to send signal downstream"..., 33) = 33
write(2, "[mpiexec at node-name] ", 21) = 21
write(2, "HYDT_dmxu_poll_wait_for_event (t"..., 61) = 61
write(2, "callback returned error status\n", 31) = 31
write(2, "[mpiexec at node-name] ", 21) = 21
write(2, "HYD_pmci_wait_for_completion (pm"..., 62) = 62
write(2, "error waiting for event\n", 24) = 24
write(2, "[mpiexec at node-name] ", 21) = 21
write(2, "main (ui/mpich/mpiexec.c:326): ", 31) = 31
write(2, "process manager error waiting fo"..., 45) = 45
exit_group(-1) = ?
+++ exited with 255 +++

The bsub -e file says the corresponding:
[mpiexec at node-name] HYDU_sock_write (utils/sock/sock.c:254): write error (Bad file descriptor)
[mpiexec at node-name] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:176): unable to write data to proxy
[mpiexec at node-name] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:42): unable to send signal downstream
[mpiexec at node-name] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at node-name] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:160): error waiting for event
[mpiexec at node-name] main (ui/mpich/mpiexec.c:326): process manager error waiting for completion

The issue isn’t specific to our app; I get the same symptoms with:

bsub -q normal -o bsub.o -e bsub.e -R"span[ptile=1]” -n16 mpirun mpich/examples/cpi

bsub.e is empty, and bsub.o has this:
[mpiexec at node-name] Sending Ctrl-C to processes as requested
[mpiexec at node-name] Press Ctrl-C again to force abort
[…]
Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun mpich/examples/cpi
------------------------------------------------------------

Exited with exit code 141.

Resource usage summary:

 CPU time : 0.49 sec.
 Max Memory : 85 MB
 Average Memory : 70.89 MB
 Total Requested Memory : -
 Delta Memory : -
 Max Swap : -
 Max Processes : 63
 Max Threads : 80
 Run time : 310 sec.
 Turnaround time : 317 sec.

(I did not initiate any Ctrl-C or similar myself.)

This is with mpich-3.4.1 configured --with-device=ch4:ucx. It’s worth noting I get the exact same symptoms using latest OpenMPI as well, so this is not an mpich-specific issue.

What can I do to investigate further or try to resolve this, so it works reliably with 16 or ideally 64 hosts again?

[*] I know very little about networking, but I’m told we have new nodes in the LSF cluster that use a software-defined network, but also old nodes that use a hardware network like our old system; limiting to the old nodes doesn’t help. But maybe there are some subtleties here I’ve overlooked.

Cheers,
Sendu
-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20210218/66abd87d/attachment.html>