[mpich-discuss] Fwd: mpiexec.hydra creates unexpectable TCP socket.

Anatoly G anatolyrishon at gmail.com
Wed Jan 7 05:06:33 CST 2015

Hello, Wesley.
I think my previous mail was not clear enough.

The system has following processes:

   - Application (Main) - this process executes:
   execvp with parameters  mpiexec.hydra -genvall  -disable-auto-cleanup
   -f MpiConfigMachines.txt -launcher=ssh -n 3 MPI_Prog
   -  MPI_Prog - this program performs calculations. It's instances
   (processes) created by mpiexec.hydra.

After some execution time, code of Application (Main) calls "abort()"
function and fails without sending SIG_TERM to mpiexec.hydra process.

I understand that this is a bug. It will be fixed by Application developers.

But, after wrong Application termination mpiexec.hydra's father becomes init
process. This is ok.

But then I see via netstat that mpiexec.hydra starts sockets with other
process (called Controller), which was not part of MPI execution.

Is hydra tries to establish/restore connection with it's father (killed
Application) process.

I understand that this is an emergency and unexpected mpiexec.hydra usage.

I was sure that mpiexec.hydra will not respond to Application fail and will
behave exactly like Application process still exist.

May be you can explain this strange situation.



On Mon, Jan 5, 2015 at 7:09 PM, Wesley Bland <wbland at anl.gov> wrote:

> When you pass -disable-auto-cleanup on the command line to mpiexec,
> you’re telling Hydra not to clean up other processes when one process in
> your job fails. It’s assumed that those processes will either clean
> themselves up or complete successfully.
> It’s not clear to me what your program is trying to do that would be
> erroneous, but usually when a process crashes, it’s the result of an
> erroneous program rather than a bug in MPICH. I’m not saying that there’s
> no bugs in MPICH, but we’d like to be able to narrow down where to look.
> Thanks,
> Wesley
> On Thu, Jan 1, 2015 at 6:35 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
>>  Dear MPICH.
>> I have an additional information.
>> This "strange configuration" (hydra connected to computer not from the
>> list) is result of unhandled Main process fail (similar to abort() call)
>> without killing children process (hydra).
>> Thus I can see "init" process becomes a father of hydra process.
>> Can you please refer me to document explaining hydra behavior when father
>> process is dead (an emergency situation).
>> I understand that this situation shouldn't happen and this bug will be
>> fixed, but I'm curious about the hydra logic.
>>  Regards,
>> Anatoly.
>> ---------- Forwarded message ----------
>> From: Anatoly G <anatolyrishon at gmail.com>
>> Date: Wed, Dec 24, 2014 at 1:00 PM
>> Subject: mpiexec.hydra creates unexpectable TCP socket.
>> To: discuss at mpich.org
>> Dear MPICH.
>> I'm using mpich 3.1 (hydra+MPI).
>> I execute main application (Main) which calls mpiexec.hydra in following
>> way:
>>  mpiexec.hydra -genvall  -disable-auto-cleanup  -f MpiConfigMachines.txt
>> -launcher=ssh -n 3 MPI_Prog
>>  MpiConfigMachines.txt content:
>>  Where is a local host.
>> As result I get
>>    - Main + single MPI_Prog processes on local computer
>>    - 2 MPI_Prog processes on remote one.
>> Main application establish TCP socket with local MPI_Prog.
>>  Main application establish TCP socket with controller on other computer
>>, which is not included in MpiConfigMachines.txt file.
>>  After executing some time (hours, sometimes days) I see via netstat
>> that created new connection from mpiexec.hydra and controller.
>>  Before executing mpiexec.hydra I set environment variable
>> setenv MPIEXEC_PORT_RANGE 50010:65535
>> According to manual this variable limits hydra destination ports to
>> [50010:65535].
>>  I see that hydra uses these ports with MPI_Prog, but connection with
>> controller done on port 701 (controller computer).
>>  Controller program is a server. It can accept connections only.
>>  Can you please advice how to stand with this problem?
>> How hydra recognizes controller IP and establish connection with it?
>>  Sincerely,
>> Anatoly.
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150107/8c730e9a/attachment.html>
-------------- next part --------------
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:

More information about the discuss mailing list