[mpich-discuss] Fwd: mpiexec.hydra creates unexpectable TCP socket.

Wesley Bland wbland at anl.gov
Wed Jan 7 09:33:17 CST 2015


Are all of the processes in your application aborting or just a subset?

On Wed, Jan 7, 2015 at 5:06 AM, Anatoly G <anatolyrishon at gmail.com> wrote:

>  Hello, Wesley.
> I think my previous mail was not clear enough.
>
>  The system has following processes:
>
>    - Application (Main) - this process executes:
>    execvp with parameters  mpiexec.hydra -genvall  -disable-auto-cleanup
>    -f MpiConfigMachines.txt -launcher=ssh -n 3 MPI_Prog
>    -  MPI_Prog - this program performs calculations. It's instances
>    (processes) created by mpiexec.hydra.
>
> After some execution time, code of Application (Main) calls "abort()"
> function and fails without sending SIG_TERM to mpiexec.hydra process.
>
>  I understand that this is a bug. It will be fixed by Application
> developers.
>
>  But, after wrong Application termination mpiexec.hydra's father becomes
> init process. This is ok.
>
>  But then I see via netstat that mpiexec.hydra starts sockets with other
> process (called Controller), which was not part of MPI execution.
>
>  Is hydra tries to establish/restore connection with it's father (killed
> Application) process.
>
>  I understand that this is an emergency and unexpected mpiexec.hydra
> usage.
>
>  I was sure that mpiexec.hydra will not respond to Application fail and
> will behave exactly like Application process still exist.
>
>  May be you can explain this strange situation.
>
>  Regards,
>
> Anatoly.
>
>
> On Mon, Jan 5, 2015 at 7:09 PM, Wesley Bland <wbland at anl.gov> wrote:
>
>>  When you pass -disable-auto-cleanup on the command line to mpiexec,
>> you’re telling Hydra not to clean up other processes when one process in
>> your job fails. It’s assumed that those processes will either clean
>> themselves up or complete successfully.
>>
>> It’s not clear to me what your program is trying to do that would be
>> erroneous, but usually when a process crashes, it’s the result of an
>> erroneous program rather than a bug in MPICH. I’m not saying that there’s
>> no bugs in MPICH, but we’d like to be able to narrow down where to look.
>>
>> Thanks,
>> Wesley
>>>>
>> On Thu, Jan 1, 2015 at 6:35 AM, Anatoly G <anatolyrishon at gmail.com>
>> wrote:
>>
>>>  Dear MPICH.
>>> I have an additional information.
>>> This "strange configuration" (hydra connected to computer not from the
>>> list) is result of unhandled Main process fail (similar to abort()
>>> call) without killing children process (hydra).
>>> Thus I can see "init" process becomes a father of hydra process.
>>> Can you please refer me to document explaining hydra behavior when
>>> father process is dead (an emergency situation).
>>> I understand that this situation shouldn't happen and this bug will be
>>> fixed, but I'm curious about the hydra logic.
>>>
>>>  Regards,
>>> Anatoly.
>>>
>>>  ---------- Forwarded message ----------
>>> From: Anatoly G <anatolyrishon at gmail.com>
>>> Date: Wed, Dec 24, 2014 at 1:00 PM
>>> Subject: mpiexec.hydra creates unexpectable TCP socket.
>>> To: discuss at mpich.org
>>>
>>>
>>>   Dear MPICH.
>>> I'm using mpich 3.1 (hydra+MPI).
>>> I execute main application (Main) which calls mpiexec.hydra in
>>> following way:
>>>
>>>  mpiexec.hydra -genvall  -disable-auto-cleanup  -f MpiConfigMachines.txt
>>> -launcher=ssh -n 3 MPI_Prog
>>>
>>>  MpiConfigMachines.txt content:
>>>  10.3.2.100:1
>>> 10.3.2.101:2
>>>
>>>  Where 10.3.2.100 is a local host.
>>> As result I get
>>>
>>>    - Main + single MPI_Prog processes on local computer
>>>    - 2 MPI_Prog processes on remote one.
>>>
>>> Main application establish TCP socket with local MPI_Prog.
>>>  Main application establish TCP socket with controller on other
>>> computer 10.3.2.170, which is not included in MpiConfigMachines.txt
>>> file.
>>>
>>>  After executing some time (hours, sometimes days) I see via netstat
>>> that created new connection from mpiexec.hydra and controller.
>>>
>>>  Before executing mpiexec.hydra I set environment variable
>>>
>>> setenv MPIEXEC_PORT_RANGE 50010:65535
>>>
>>> According to manual this variable limits hydra destination ports to
>>> [50010:65535].
>>>
>>>
>>>  I see that hydra uses these ports with MPI_Prog, but connection with
>>> controller done on port 701 (controller computer).
>>>
>>>
>>>  Controller program is a server. It can accept connections only.
>>>
>>>
>>>  Can you please advice how to stand with this problem?
>>>
>>> How hydra recognizes controller IP and establish connection with it?
>>>
>>>
>>>  Sincerely,
>>>
>>> Anatoly.
>>>
>>>
>>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150107/15571c65/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list