<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr">Hello, Wesley.<div>I think my previous mail was not clear enough.</div><div><br></div><div><div style="color:rgb(0,0,0);font-size:12.6666669845581px">The system has following processes:</div><div style=""><ul style=""><li style=""><font color="#000000"><span style="font-size:12.6666669845581px">Application (Main) - this process executes:<br><span class="" id=":2te.1" tabindex="-1">execvp</span> with parameters <span style="font-size:12.6666669845581px"><span class="" id=":2te.2" tabindex="-1">mpiexec</span>.hydra -<span class="" id=":2te.3" tabindex="-1">genvall</span> -disable-auto-cleanup <br>-f <span class="" id=":2te.4" tabindex="-1">MpiConfigMachines</span>.<span class="" id=":2te.5" tabindex="-1">txt</span> -launcher=ssh -n 3 <span class="" id=":2te.6" tabindex="-1">MPI</span>_<span class="" id=":2te.7" tabindex="-1">Prog</span></span></span></font></li><li style=""><font color="#000000"><span style="font-size:12.6666669845581px"> <span class="" id=":2te.8" tabindex="-1">MPI</span>_<span class="" id=":2te.9" tabindex="-1">Prog</span> - this program performs calculations. It's instances (processes) created by <span class="" id=":2te.10" tabindex="-1">mpiexec</span>.hydra.</span></font></li></ul><font color="#000000"><span style="font-size:12.6666669845581px">After some </span></font><span style="color:rgb(0,0,0);font-size:12.6666669845581px">execution </span><span style="font-size:12.6666669845581px;color:rgb(0,0,0)">time, code of </span><span style="color:rgb(0,0,0);font-size:12.6666669845581px">Application (Main) calls "abort()" function and fails without sending SIG_TERM to <span class="" id=":2te.11" tabindex="-1">mpiexec</span>.hydra process. </span></div><div style=""><span style="font-size:12.6666669845581px;color:rgb(0,0,0)"><br></span></div><div style=""><span style="color:rgb(0,0,0);font-size:12.6666669845581px">I understand that this is a bug. It will be fixed by Application developers.</span><br></div><div style=""><span style="color:rgb(0,0,0);font-size:12.6666669845581px"><br></span></div><div style=""><font color="#000000"><span style="font-size:12.6666669845581px">But, after wrong Application termination <span class="" id=":2te.12" tabindex="-1">mpiexec</span>.hydra's father becomes <span class="" id=":2te.13" tabindex="-1">init</span> process. This is <span class="" id=":2te.14" tabindex="-1">ok</span>.</span></font></div><div style=""><font color="#000000"><span style="font-size:12.6666669845581px"><br></span></font></div><div style=""><font color="#000000"><span style="font-size:12.6666669845581px">But then I see via <span class="" id=":2te.15" tabindex="-1">netstat</span> that <span class="" id=":2te.16" tabindex="-1">mpiexec</span>.hydra starts sockets with other process (called Controller), which was not part of <span class="" id=":2te.17" tabindex="-1">MPI</span> execution.</span></font></div><div style=""><font color="#000000"><span style="font-size:12.6666669845581px"><br></span></font></div><div style=""><font color="#000000"><span style="font-size:12.6666669845581px">Is hydra tries to establish/restore connection with it's father (killed Application) process. </span></font></div><div style=""><font color="#000000"><span style="font-size:12.6666669845581px"><br></span></font></div><div style=""><font color="#000000"><span style="font-size:12.6666669845581px">I understand that this is an emergency and unexpected <span class="" id=":2te.18" tabindex="-1">mpiexec</span>.hydra usage.</span></font></div><div style=""><font color="#000000"><span style="font-size:12.6666669845581px"><br></span></font></div><div style=""><font color="#000000"><span style="font-size:12.6666669845581px">I was sure that <span class="" id=":2te.19" tabindex="-1">mpiexec</span>.hydra will not respond to Application fail and will behave exactly like Application process still exist.</span></font></div><div style=""><span style="color:rgb(0,0,0);font-size:12.6666669845581px"><br></span></div><div style=""><span style="color:rgb(0,0,0);font-size:12.6666669845581px">May be you can explain this strange situation.</span></div><div style=""><br></div><div style="">Regards,</div><div style="color:rgb(0,0,0);font-size:12.6666669845581px"><p class="MsoNormal"><span class="" id=":2te.20" tabindex="-1">Anatoly</span>.</p></div></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Jan 5, 2015 at 7:09 PM, Wesley Bland <span dir="ltr"><<a href="mailto:wbland@anl.gov" target="_blank">wbland@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><p style="margin:1.2em 0px!important">When you pass <code style="font-size:0.85em;font-family:Consolas,Inconsolata,Courier,monospace;margin:0px 0.15em;padding:0px 0.3em;white-space:pre-wrap;border:1px solid rgb(234,234,234);background-color:rgb(248,248,248);border-top-left-radius:3px;border-top-right-radius:3px;border-bottom-right-radius:3px;border-bottom-left-radius:3px;display:inline">-disable-auto-cleanup</code> on the command line to mpiexec, you’re telling Hydra not to clean up other processes when one process in your job fails. It’s assumed that those processes will either clean themselves up or complete successfully.</p>
<p style="margin:1.2em 0px!important">It’s not clear to me what your program is trying to do that would be erroneous, but usually when a process crashes, it’s the result of an erroneous program rather than a bug in MPICH. I’m not saying that there’s no bugs in MPICH, but we’d like to be able to narrow down where to look.</p>
<p style="margin:1.2em 0px!important">Thanks,<br>Wesley</p>
<div title="MDH:V2hlbiB5b3UgcGFzcyBgLWRpc2FibGUtYXV0by1jbGVhbnVwYCBvbiB0aGUgY29tbWFuZCBsaW5l
IHRvIG1waWV4ZWMsIHlvdSdyZSB0ZWxsaW5nIEh5ZHJhIG5vdCB0byBjbGVhbiB1cCBvdGhlciBw
cm9jZXNzZXMgd2hlbiBvbmUgcHJvY2VzcyBpbiB5b3VyIGpvYiBmYWlscy4gSXQncyBhc3N1bWVk
IHRoYXQgdGhvc2UgcHJvY2Vzc2VzIHdpbGwgZWl0aGVyIGNsZWFuIHRoZW1zZWx2ZXMgdXAgb3Ig
Y29tcGxldGUgc3VjY2Vzc2Z1bGx5LjxkaXY+PGJyPjwvZGl2PjxkaXY+SXQncyBub3QgY2xlYXIg
dG8gbWUgd2hhdCB5b3VyIHByb2dyYW0gaXMgdHJ5aW5nIHRvIGRvIHRoYXQgd291bGQgYmUgZXJy
b25lb3VzLCBidXQgdXN1YWxseSB3aGVuIGEgcHJvY2VzcyBjcmFzaGVzLCBpdCdzIHRoZSByZXN1
bHQgb2YgYW4gZXJyb25lb3VzIHByb2dyYW0gcmF0aGVyIHRoYW4gYSBidWcgaW4gTVBJQ0guIEkn
bSBub3Qgc2F5aW5nIHRoYXQgdGhlcmUncyBubyBidWdzIGluIE1QSUNILCBidXQgd2UnZCBsaWtl
IHRvIGJlIGFibGUgdG8gbmFycm93IGRvd24gd2hlcmUgdG8gbG9vay48L2Rpdj48ZGl2Pjxicj48
L2Rpdj48ZGl2PlRoYW5rcyw8L2Rpdj48ZGl2Pldlc2xleTwvZGl2Pg==" style="min-height:0;font-size:0em;padding:0;margin:0"></div></div></div><div class="gmail_extra"><br><div class="gmail_quote"><span class="">On Thu, Jan 1, 2015 at 6:35 AM, Anatoly G <span dir="ltr"><<a href="mailto:anatolyrishon@gmail.com" target="_blank">anatolyrishon@gmail.com</a>></span> wrote:<br></span><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>
<div dir="ltr"><span class="">
<div>Dear <span>MPICH</span>.</div>
<div>I have an additional information.</div>
<div>This "strange configuration" (hydra connected to computer not from the list) is result of
<span>unhandled</span> Main process fail (similar to abort() call) without killing children process (hydra). </div>
<div>Thus I can see "<span>init"</span> process becomes a father of hydra process. </div>
<div>Can you please refer me to document explaining hydra behavior when father process is dead (an emergency situation).</div>
<div>I understand that this situation shouldn't happen and this bug will be fixed, but I'm curious about the hydra logic.</div>
<div><br>
</div>
<div>Regards,</div>
<div><span>Anatoly</span>.</div>
<br>
</span><div class="gmail_quote"><span class="">---------- Forwarded message ----------<br>
From: <b class="gmail_sendername"><span>Anatoly</span> G</b>
<span dir="ltr"><<span>anatolyrishon</span>@<a href="http://gmail.com" target="_blank">gmail.com</a>></span><br>
Date: Wed, Dec 24, 2014 at 1:00 PM<br>
Subject: <span>mpiexec</span>.hydra creates <span>
unexpectable</span> <span>TCP</span> socket.<br>
To: discuss@<span>mpich</span>.org<br>
<br>
<br>
</span><div><div class="h5"><div dir="ltr">Dear <span><span>MPICH</span></span>.
<div>I'm using <span><span>mpich</span></span> 3.1 (hydra+<span><span>MPI</span></span>).</div>
<div>I execute main application (Main) which calls <span><span>mpiexec</span></span>.hydra in following way:</div>
<div><br>
</div>
<div><span><span>mpiexec</span></span>.hydra -<span><span>genvall</span></span> -disable-auto-cleanup -f
<span><span>MpiConfigMachines</span></span>.<span><span>txt</span></span> -launcher=ssh -n 3
<span><span>MPI</span></span>_<span><span>Prog</span></span> <br>
</div>
<div><br>
</div>
<div><span><span>MpiConfigMachines</span></span>.<span><span>txt</span></span> content:<br>
</div>
<div>
<div><a href="http://10.3.2.100:1" target="_blank">10.3.2.100:1</a></div>
<div><a href="http://10.3.2.101:2" target="_blank">10.3.2.101:2</a></div>
</div>
<div><br>
</div>
<div>Where 10.3.2.100 is a local host.</div>
<div>As result I get</div>
<div>
<ul>
<li>Main + single <span><span>MPI</span></span>_<span><span>Prog</span></span> processes on local computer<br>
</li><li>2 <span><span>MPI</span></span>_<span><span>Prog</span></span> processes on remote one.</li></ul>
<div>Main application establish <span><span>TCP</span></span> socket with local
<span><span>MPI</span></span>_<span><span>Prog</span></span>.</div>
</div>
<div>Main application establish <span><span>TCP</span></span> socket with controller on other computer 10.3.2.170, which is not included in
<span><span>MpiConfigMachines</span></span>.<span><span>txt</span></span> file.</div>
<div><br>
</div>
<div>After executing some time (hours, sometimes days) I see via <span><span>netstat</span></span> that created new connection from
<span><span>mpiexec</span></span>.hydra and controller. </div>
<div><br>
</div>
<div>Before executing <span><span>mpiexec</span></span>.hydra I set environment variable</div>
<div>
<p class="MsoNormal"><span><span>setenv</span></span>
<span><span>MPIEXEC</span></span>_PORT_RANGE 50010:65535</p>
<p class="MsoNormal">According to manual this variable limits hydra destination ports to [50010:65535].</p>
<p class="MsoNormal"><br>
</p>
<p class="MsoNormal">I see that hydra uses these ports with <span><span>MPI</span></span>_<span><span>Prog</span></span>, but connection with controller done on port 701 (controller computer).</p>
<p class="MsoNormal"><br>
</p>
<p class="MsoNormal">Controller program is a server. It can accept connections only.<br>
</p>
<p class="MsoNormal"><br>
</p>
<p class="MsoNormal">Can you please advice how to stand with this problem?</p>
<p class="MsoNormal">How hydra recognizes controller <span><span>IP</span></span> and establish connection with it?</p>
<p class="MsoNormal"><br>
</p>
<p class="MsoNormal">Sincerely,</p>
<p class="MsoNormal"><span><span>Anatoly</span></span>.</p>
</div>
<div><br>
</div>
</div>
</div></div></div>
<br>
</div>
</div>
</blockquote></div><br></div>
<br>_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br></blockquote></div><br></div>