<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">

<meta name="Generator" content="Microsoft Word 15 (filtered medium)">

<style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}

span.EmailStyle18

        {mso-style-type:personal-reply;

        font-family:"Calibri",sans-serif;

        color:windowtext;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}

@page WordSection1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.WordSection1

        {page:WordSection1;}

--></style>

</head>

<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">

<div class="WordSection1">

<p class="MsoNormal">Not sure, but looks like it is not able to hold or establish stable socket connection between nodes.<o:p></o:p></p>

<p class="MsoNormal"><o:p> </o:p></p>

<div>

<div>

<div>

<p class="MsoNormal">-- <br>

Hui Zhou<o:p></o:p></p>

</div>

</div>

</div>

<p class="MsoNormal"><o:p> </o:p></p>

<p class="MsoNormal"><o:p> </o:p></p>

<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">

<p class="MsoNormal" style="mso-margin-top-alt:0in;margin-right:0in;margin-bottom:12.0pt;margin-left:.5in">

<b><span style="font-size:12.0pt;color:black">From: </span></b><span style="font-size:12.0pt;color:black">Sendu Bala via discuss <discuss@mpich.org><br>

<b>Date: </b>Wednesday, February 17, 2021 at 9:14 AM<br>

<b>To: </b>discuss@mpich.org <discuss@mpich.org><br>

<b>Cc: </b>Sendu Bala <sb10@sanger.ac.uk><br>

<b>Subject: </b>[mpich-discuss] Failure to do anything under LSF<o:p></o:p></span></p>

</div>

<div name="messageBodySection">

<div>

<p class="MsoNormal" style="margin-left:.5in">Hi,<br>

<br>

We had an mpi app running under LSF that worked fine tiled across 64 hosts.<br>

<br>

Since moving to a new platform (LSF, but inside an OpenStack cluster*), the app is unreliable when tiled across more than 2 hosts.<br>

The likelyhood of failure increases until when tiled across 16 hosts, it almost never works (but still can). It always works when using 16 cores of a single host.<br>

<br>

The symptoms of failure are that our app doesn’t really start up (it logs nothing), the -outfile-pattern output files don’t get created, and it kills itself after 5mins of apparently doing nothing. (When it works, the -outfile-pattern files are created ~immediately

 and the app produces output.)<br>

<br>

When failing, the mpirun process spawns a hydra_pmi_proxy process which spawns the app as well as 15 blaunch processes, which correspond to processes on the 15 other hosts, which have spawned a process for the app each.<br>

<br>

strace says the mpirun and app processes are doing nothing during the 5mins wait, with the app processes stuck on reading from a socket.<br>

<br>

After the 5 mins, the mpirun process exits with:<br>

strace -p 2526<br>

strace: Process 2526 attached<br>

restart_syscall(<... resuming interrupted poll ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)<br>

--- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=2498, si_uid=0} ---<br>

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2530, si_uid=0, si_status=255, si_utime=1, si_stime=0} ---<br>

fstat(1, {st_mode=S_IFREG|0640, st_size=0, ...}) = 0<br>

write(1, "[mpiexec@node-name] ", 21) = 21<br>

write(1, "Sending Ctrl-C to processes as r"..., 41) = 41<br>

write(1, "[mpiexec@node-name] ", 21) = 21<br>

write(1, "Press Ctrl-C again to force abor"..., 34) = 34<br>

write(4, "\1\0\0\0\2\0\0\0", 8) = 8<br>

rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)<br>

poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=9, events=POLLIN}, {fd=13, events=POLLIN}, {fd=12, events=POLLIN}, {fd=15, events=POLLIN}, {fd=14, events=POLLIN},

 {fd=17, events=POLLIN}, {fd=16, events=POLLIN}, {fd=19, events=POLLIN}, {fd=18, events=POLLIN}, {fd=21, events=POLLIN}, {fd=20, events=POLLIN}, {fd=23, events=POLLIN}, {fd=22, events=POLLIN}, {fd=25, events=POLLIN}, {fd=24, events=POLLIN}, {fd=27, events=POLLIN},

 {fd=26, events=POLLIN}, {fd=29, events=POLLIN}, {fd=28, events=POLLIN}, {fd=31, events=POLLIN}, {fd=30, events=POLLIN}, {fd=33, events=POLLIN}, {fd=32, events=POLLIN}, {fd=35, events=POLLIN}, {fd=34, events=POLLIN}, {fd=37, events=POLLIN}, ...], 47, -1) =

 6 ([{fd=3, revents=POLLIN}, {fd=15, revents=POLLIN}, {fd=19, revents=POLLIN}, {fd=21, revents=POLLIN}, {fd=25, revents=POLLIN}, {fd=31, revents=POLLIN}])<br>

read(3, "\1\0\0\0\2\0\0\0", 8) = 8<br>

write(6, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32<br>

write(48, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32<br>

write(46, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32<br>

write(38, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32<br>

write(43, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32<br>

write(40, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32<br>

write(45, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32<br>

write(42, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32<br>

write(47, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32<br>

write(50, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32<br>

write(44, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32<br>

write(41, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = 32<br>

write(-1, "\3\0\0\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\2\0\0\0", 32) = -1 EBADF (Bad file descriptor)<br>

write(2, "[mpiexec@node-name] ", 21) = 21<br>

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2541, si_uid=0, si_status=255, si_utime=0, si_stime=0} ---<br>

write(2, "HYDU_sock_write (utils/sock/sock"..., 41) = 41<br>

write(2, "write error (Bad file descriptor"..., 34) = 34<br>

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2529, si_uid=0, si_status=255, si_utime=0, si_stime=0} ---<br>

write(2, "[mpiexec@node-name] ", 21) = 21<br>

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2537, si_uid=0, si_status=255, si_utime=0, si_stime=1} ---<br>

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2528, si_uid=0, si_status=255, si_utime=0, si_stime=2} ---<br>

write(2, "HYD_pmcd_pmiserv_send_signal (pm"..., 60) = 60<br>

write(2, "unable to write data to proxy\n", 30) = 30<br>

write(2, "[mpiexec@node-name] ", 21) = 21<br>

write(2, "ui_cmd_cb (pm/pmiserv/pmiserv_pm"..., 42) = 42<br>

write(2, "unable to send signal downstream"..., 33) = 33<br>

write(2, "[mpiexec@node-name] ", 21) = 21<br>

write(2, "HYDT_dmxu_poll_wait_for_event (t"..., 61) = 61<br>

write(2, "callback returned error status\n", 31) = 31<br>

write(2, "[mpiexec@node-name] ", 21) = 21<br>

write(2, "HYD_pmci_wait_for_completion (pm"..., 62) = 62<br>

write(2, "error waiting for event\n", 24) = 24<br>

write(2, "[mpiexec@node-name] ", 21) = 21<br>

write(2, "main (ui/mpich/mpiexec.c:326): ", 31) = 31<br>

write(2, "process manager error waiting fo"..., 45) = 45<br>

exit_group(-1) = ?<br>

+++ exited with 255 +++<br>

<br>

The bsub -e file says the corresponding:<br>

[mpiexec@node-name] HYDU_sock_write (utils/sock/sock.c:254): write error (Bad file descriptor)<br>

[mpiexec@node-name] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:176): unable to write data to proxy<br>

[mpiexec@node-name] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:42): unable to send signal downstream<br>

[mpiexec@node-name] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status<br>

[mpiexec@node-name] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:160): error waiting for event<br>

[mpiexec@node-name] main (ui/mpich/mpiexec.c:326): process manager error waiting for completion<br>

<br>

The issue isn’t specific to our app; I get the same symptoms with:<br>

<br>

bsub -q normal -o bsub.o -e bsub.e -R"span[ptile=1]” -n16 mpirun mpich/examples/cpi<br>

<br>

bsub.e is empty, and bsub.o has this:<br>

[mpiexec@node-name] Sending Ctrl-C to processes as requested<br>

[mpiexec@node-name] Press Ctrl-C again to force abort<br>

[…]<br>

Your job looked like:<br>

<br>

------------------------------------------------------------<br>

# LSBATCH: User input<br>

mpirun mpich/examples/cpi<br>

------------------------------------------------------------<br>

<br>

Exited with exit code 141.<br>

<br>

Resource usage summary:<br>

<br>

 CPU time : 0.49 sec.<br>

 Max Memory : 85 MB<br>

 Average Memory : 70.89 MB<br>

 Total Requested Memory : -<br>

 Delta Memory : -<br>

 Max Swap : -<br>

 Max Processes : 63<br>

 Max Threads : 80<br>

 Run time : 310 sec.<br>

 Turnaround time : 317 sec.<br>

<br>

(I did not initiate any Ctrl-C or similar myself.)<br>

<br>

This is with mpich-3.4.1 configured --with-device=ch4:ucx. It’s worth noting I get the exact same symptoms using latest OpenMPI as well, so this is not an mpich-specific issue.<br>

<br>

What can I do to investigate further or try to resolve this, so it works reliably with 16 or ideally 64 hosts again?<br>

<br>

<br>

[*] I know very little about networking, but I’m told we have new nodes in the LSF cluster that use a software-defined network, but also old nodes that use a hardware network like our old system; limiting to the old nodes doesn’t help. But maybe there are some

 subtleties here I’ve overlooked.<o:p></o:p></p>

</div>

</div>

<div name="messageSignatureSection">

<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>

<div>

<p class="MsoNormal" style="margin-left:.5in">Cheers, <o:p></o:p></p>

<div>

<p class="MsoNormal" style="margin-left:.5in">Sendu<o:p></o:p></p>

</div>

</div>

</div>

<p class="MsoNormal" style="margin-left:.5in">-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215

 Euston Road, London, NW1 2BE. <o:p></o:p></p>

</div>

</body>

</html>