<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr">Dear Wesley.<div>I asked the developers which saw this strange configuration and they not sure about <span class="" id=":5cm.11" tabindex="-1">MPI</span>_<span class="" id=":5cm.12" tabindex="-1">Prog</span> processes. These are "calculation processes" which may fail. I need to check it by myself if there is any failure of <span class="" id=":5cm.13" tabindex="-1">MPI</span>_<span class="" id=":5cm.14" tabindex="-1">Prog</span> processes.</div><div><br></div><div>I suppose that <span class="" id=":5cm.15" tabindex="-1">MPI</span>_<span class="" id=":5cm.16" tabindex="-1">Prog</span> processes (created by <span class="" id=":5cm.17" tabindex="-1">mpiexec</span>.hydra) didn't fail but failed Main application process which created <span class="" id=":5cm.18" tabindex="-1">mpiexec</span>.hydra process.</div><div><br></div><div>I know that my story looks strange. </div><div>I suppose, there is no dependencies between Main application and <span class="" id=":5cm.19" tabindex="-1">mpiexec</span>.hydra after hydra process created.</div><div><br></div><div>I'll try to reach more information & details.</div><div><br></div><div><br></div><div>Regards,</div><div><span class="" id=":5cm.20" tabindex="-1">Anatoly</span>.</div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jan 7, 2015 at 5:33 PM, Wesley Bland <span dir="ltr"><<a href="mailto:wbland@anl.gov" target="_blank">wbland@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Are all of the processes in your application aborting or just a subset?</div><div class="gmail_extra"><br><div class="gmail_quote"><span class="">On Wed, Jan 7, 2015 at 5:06 AM, Anatoly G <span dir="ltr"><<a href="mailto:anatolyrishon@gmail.com" target="_blank">anatolyrishon@gmail.com</a>></span> wrote:<br></span><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>
<div dir="ltr"><span class="">Hello, Wesley.
<div>I think my previous mail was not clear enough.</div>
<div><br>
</div>
</span><div><span class="">
<div style="color:rgb(0,0,0);font-size:12.6666669845581px">The system has following processes:</div>
</span><div>
<ul>
<li><font color="#000000"><span style="font-size:12.6666669845581px"><span class="">Application (Main) - this process executes:<br>
</span><span>execvp</span> with parameters <span style="font-size:12.6666669845581px"><span>mpiexec</span>.hydra -<span>genvall</span> -disable-auto-cleanup
<br><span class=""><span>
-f <span>MpiConfigMachines</span>.<span>txt</span> -launcher=ssh -n 3
<span>MPI</span>_<span>Prog</span></span></span></span></span></font></li><span class=""><li><font color="#000000"><span style="font-size:12.6666669845581px"> <span>MPI</span>_<span>Prog</span> - this program performs calculations. It's instances (processes) created
by <span>mpiexec</span>.hydra.</span></font></li></span></ul><span class="">
<font color="#000000"><span style="font-size:12.6666669845581px">After some </span></font><span style="color:rgb(0,0,0);font-size:12.6666669845581px">execution </span><span style="font-size:12.6666669845581px;color:rgb(0,0,0)">time, code of </span><span style="color:rgb(0,0,0);font-size:12.6666669845581px">Application
(Main) calls "abort()" function and fails without sending SIG_TERM to <span>
mpiexec</span>.hydra process. </span></span></div><span class="">
<div><span style="font-size:12.6666669845581px;color:rgb(0,0,0)"><br>
</span></div>
<div><span style="color:rgb(0,0,0);font-size:12.6666669845581px">I understand that this is a bug. It will be fixed by Application developers.</span><br>
</div>
<div><span style="color:rgb(0,0,0);font-size:12.6666669845581px"><br>
</span></div>
<div><font color="#000000"><span style="font-size:12.6666669845581px">But, after wrong Application termination
<span>mpiexec</span>.hydra's father becomes <span>
init</span> process. This is <span>ok</span>.</span></font></div>
<div><font color="#000000"><span style="font-size:12.6666669845581px"><br>
</span></font></div>
<div><font color="#000000"><span style="font-size:12.6666669845581px">But then I see via
<span>netstat</span> that <span>
mpiexec</span>.hydra starts sockets with other process (called Controller), which was not part of
<span>MPI</span> execution.</span></font></div>
<div><font color="#000000"><span style="font-size:12.6666669845581px"><br>
</span></font></div>
<div><font color="#000000"><span style="font-size:12.6666669845581px">Is hydra tries to establish/restore connection with it's father (killed Application) process. </span></font></div>
<div><font color="#000000"><span style="font-size:12.6666669845581px"><br>
</span></font></div>
<div><font color="#000000"><span style="font-size:12.6666669845581px">I understand that this is an emergency and unexpected
<span>mpiexec</span>.hydra usage.</span></font></div>
<div><font color="#000000"><span style="font-size:12.6666669845581px"><br>
</span></font></div>
<div><font color="#000000"><span style="font-size:12.6666669845581px">I was sure that
<span>mpiexec</span>.hydra will not respond to Application fail and will behave exactly like Application process still exist.</span></font></div>
<div><span style="color:rgb(0,0,0);font-size:12.6666669845581px"><br>
</span></div>
<div><span style="color:rgb(0,0,0);font-size:12.6666669845581px">May be you can explain this strange situation.</span></div>
<div><br>
</div>
<div>Regards,</div>
<div style="color:rgb(0,0,0);font-size:12.6666669845581px">
<p class="MsoNormal"><span>Anatoly</span>.</p>
</div>
</span></div>
<div><br>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote"><span class=""><div><div>On Mon, Jan 5, 2015 at 7:09 PM, Wesley Bland <span dir="ltr">
<<a href="mailto:wbland@anl.gov" target="_blank">wbland@anl.gov</a>></span> wrote:<br>
</div></div></span><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5"><div><div>
<div dir="ltr">
<div>
<p style="margin:1.2em 0px!important">When you pass <code style="font-size:0.85em;font-family:Consolas,Inconsolata,Courier,monospace;margin:0px 0.15em;padding:0px 0.3em;white-space:pre-wrap;border:1px solid rgb(234,234,234);background-color:rgb(248,248,248);border-top-left-radius:3px;border-top-right-radius:3px;border-bottom-right-radius:3px;border-bottom-left-radius:3px;display:inline">
-disable-auto-cleanup</code> on the command line to mpiexec, you’re telling Hydra not to clean up other processes when one process in your job fails. It’s assumed that those processes will either clean themselves up or complete successfully.</p>
<p style="margin:1.2em 0px!important">It’s not clear to me what your program is trying to do that would be erroneous, but usually when a process crashes, it’s the result of an erroneous program rather than a bug in MPICH. I’m not saying that there’s no bugs
in MPICH, but we’d like to be able to narrow down where to look.</p>
<p style="margin:1.2em 0px!important">Thanks,<br>
Wesley</p>
<div title="MDH:V2hlbiB5b3UgcGFzcyBgLWRpc2FibGUtYXV0by1jbGVhbnVwYCBvbiB0aGUgY29tbWFuZCBsaW5l
IHRvIG1waWV4ZWMsIHlvdSdyZSB0ZWxsaW5nIEh5ZHJhIG5vdCB0byBjbGVhbiB1cCBvdGhlciBw
cm9jZXNzZXMgd2hlbiBvbmUgcHJvY2VzcyBpbiB5b3VyIGpvYiBmYWlscy4gSXQncyBhc3N1bWVk
IHRoYXQgdGhvc2UgcHJvY2Vzc2VzIHdpbGwgZWl0aGVyIGNsZWFuIHRoZW1zZWx2ZXMgdXAgb3Ig
Y29tcGxldGUgc3VjY2Vzc2Z1bGx5LjxkaXY+PGJyPjwvZGl2PjxkaXY+SXQncyBub3QgY2xlYXIg
dG8gbWUgd2hhdCB5b3VyIHByb2dyYW0gaXMgdHJ5aW5nIHRvIGRvIHRoYXQgd291bGQgYmUgZXJy
b25lb3VzLCBidXQgdXN1YWxseSB3aGVuIGEgcHJvY2VzcyBjcmFzaGVzLCBpdCdzIHRoZSByZXN1
bHQgb2YgYW4gZXJyb25lb3VzIHByb2dyYW0gcmF0aGVyIHRoYW4gYSBidWcgaW4gTVBJQ0guIEkn
bSBub3Qgc2F5aW5nIHRoYXQgdGhlcmUncyBubyBidWdzIGluIE1QSUNILCBidXQgd2UnZCBsaWtl
IHRvIGJlIGFibGUgdG8gbmFycm93IGRvd24gd2hlcmUgdG8gbG9vay48L2Rpdj48ZGl2Pjxicj48
L2Rpdj48ZGl2PlRoYW5rcyw8L2Rpdj48ZGl2Pldlc2xleTwvZGl2Pg==" style="min-height:0;font-size:0em;padding:0;margin:0">
</div>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote"><span>On Thu, Jan 1, 2015 at 6:35 AM, Anatoly G
<span dir="ltr"><<a href="mailto:anatolyrishon@gmail.com" target="_blank">anatolyrishon@gmail.com</a>></span> wrote:<br>
</span>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>
<div dir="ltr"><span>
<div>Dear <span>MPICH</span>.</div>
<div>I have an additional information.</div>
<div>This "strange configuration" (hydra connected to computer not from the list) is result of
<span>unhandled</span> Main process fail (similar to abort() call) without killing children process (hydra). </div>
<div>Thus I can see "<span>init"</span> process becomes a father of hydra process. </div>
<div>Can you please refer me to document explaining hydra behavior when father process is dead (an emergency situation).</div>
<div>I understand that this situation shouldn't happen and this bug will be fixed, but I'm curious about the hydra logic.</div>
<div><br>
</div>
<div>Regards,</div>
<div><span>Anatoly</span>.</div>
<br>
</span>
<div class="gmail_quote"><span>---------- Forwarded message ----------<br>
From: <b class="gmail_sendername"><span>Anatoly</span> G</b> <span dir="ltr"><<span>anatolyrishon</span>@<a href="http://gmail.com" target="_blank">gmail.com</a>></span><br>
Date: Wed, Dec 24, 2014 at 1:00 PM<br>
Subject: <span>mpiexec</span>.hydra creates <span>unexpectable</span> <span>TCP</span> socket.<br>
To: discuss@<span>mpich</span>.org<br>
<br>
<br>
</span>
<div>
<div>
<div dir="ltr">Dear <span><span>MPICH</span></span>.
<div>I'm using <span><span>mpich</span></span> 3.1 (hydra+<span><span>MPI</span></span>).</div>
<div>I execute main application (Main) which calls <span><span>mpiexec</span></span>.hydra in following way:</div>
<div><br>
</div>
<div><span><span>mpiexec</span></span>.hydra -<span><span>genvall</span></span> -disable-auto-cleanup -f
<span><span>MpiConfigMachines</span></span>.<span><span>txt</span></span> -launcher=ssh -n 3
<span><span>MPI</span></span>_<span><span>Prog</span></span> <br>
</div>
<div><br>
</div>
<div><span><span>MpiConfigMachines</span></span>.<span><span>txt</span></span> content:<br>
</div>
<div>
<div><a href="http://10.3.2.100:1" target="_blank">10.3.2.100:1</a></div>
<div><a href="http://10.3.2.101:2" target="_blank">10.3.2.101:2</a></div>
</div>
<div><br>
</div>
<div>Where 10.3.2.100 is a local host.</div>
<div>As result I get</div>
<div>
<ul>
<li>Main + single <span><span>MPI</span></span>_<span><span>Prog</span></span> processes on local computer<br>
</li><li>2 <span><span>MPI</span></span>_<span><span>Prog</span></span> processes on remote one.</li></ul>
<div>Main application establish <span><span>TCP</span></span> socket with local <span>
<span>MPI</span></span>_<span><span>Prog</span></span>.</div>
</div>
<div>Main application establish <span><span>TCP</span></span> socket with controller on other computer 10.3.2.170, which is not included in
<span><span>MpiConfigMachines</span></span>.<span><span>txt</span></span> file.</div>
<div><br>
</div>
<div>After executing some time (hours, sometimes days) I see via <span><span>netstat</span></span> that created new connection from
<span><span>mpiexec</span></span>.hydra and controller. </div>
<div><br>
</div>
<div>Before executing <span><span>mpiexec</span></span>.hydra I set environment variable</div>
<div>
<p class="MsoNormal"><span><span>setenv</span></span> <span><span>MPIEXEC</span></span>_PORT_RANGE 50010:65535</p>
<p class="MsoNormal">According to manual this variable limits hydra destination ports to [50010:65535].</p>
<p class="MsoNormal"><br>
</p>
<p class="MsoNormal">I see that hydra uses these ports with <span><span>MPI</span></span>_<span><span>Prog</span></span>, but connection with controller done on port 701 (controller computer).</p>
<p class="MsoNormal"><br>
</p>
<p class="MsoNormal">Controller program is a server. It can accept connections only.<br>
</p>
<p class="MsoNormal"><br>
</p>
<p class="MsoNormal">Can you please advice how to stand with this problem?</p>
<p class="MsoNormal">How hydra recognizes controller <span><span>IP</span></span> and establish connection with it?</p>
<p class="MsoNormal"><br>
</p>
<p class="MsoNormal">Sincerely,</p>
<p class="MsoNormal"><span><span>Anatoly</span></span>.</p>
</div>
<div><br>
</div>
</div>
</div>
</div>
</div>
<br>
</div>
</div>
</blockquote>
</div>
<br>
</div>
<br></div></div></div></div><span class="">
_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
</span></blockquote>
</div>
<br>
</div>
</div>
</blockquote></div><br></div>
<br>_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br></blockquote></div><br></div>