[mpich-discuss] Fwd: mpiexec.hydra creates unexpectable TCP socket.

Anatoly G anatolyrishon at gmail.com
Sun Jan 11 08:07:45 CST 2015


Dear Wesley.
I asked the developers which saw this strange configuration and they not
sure about MPI_Prog processes. These are "calculation processes" which may
fail. I need to check it by myself if there is any failure of MPI_Prog
processes.

I suppose that MPI_Prog processes (created by mpiexec.hydra) didn't fail
but failed Main application process which created mpiexec.hydra process.

I know that my story looks strange.
I suppose, there is no dependencies between Main application and mpiexec.hydra
after hydra process created.

I'll try to reach more information & details.


Regards,
Anatoly.




On Wed, Jan 7, 2015 at 5:33 PM, Wesley Bland <wbland at anl.gov> wrote:

> Are all of the processes in your application aborting or just a subset?
>
> On Wed, Jan 7, 2015 at 5:06 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
>
>>  Hello, Wesley.
>> I think my previous mail was not clear enough.
>>
>>  The system has following processes:
>>
>>    - Application (Main) - this process executes:
>>    execvp with parameters  mpiexec.hydra -genvall  -disable-auto-cleanup
>>    -f MpiConfigMachines.txt -launcher=ssh -n 3 MPI_Prog
>>    -  MPI_Prog - this program performs calculations. It's instances
>>    (processes) created by mpiexec.hydra.
>>
>> After some execution time, code of Application (Main) calls "abort()"
>> function and fails without sending SIG_TERM to mpiexec.hydra process.
>>
>>  I understand that this is a bug. It will be fixed by Application
>> developers.
>>
>>  But, after wrong Application termination mpiexec.hydra's father becomes
>> init process. This is ok.
>>
>>  But then I see via netstat that mpiexec.hydra starts sockets with other
>> process (called Controller), which was not part of MPI execution.
>>
>>  Is hydra tries to establish/restore connection with it's father (killed
>> Application) process.
>>
>>  I understand that this is an emergency and unexpected mpiexec.hydra
>> usage.
>>
>>  I was sure that mpiexec.hydra will not respond to Application fail and
>> will behave exactly like Application process still exist.
>>
>>  May be you can explain this strange situation.
>>
>>  Regards,
>>
>> Anatoly.
>>
>>
>> On Mon, Jan 5, 2015 at 7:09 PM, Wesley Bland <wbland at anl.gov> wrote:
>>
>>>  When you pass -disable-auto-cleanup on the command line to mpiexec,
>>> you’re telling Hydra not to clean up other processes when one process in
>>> your job fails. It’s assumed that those processes will either clean
>>> themselves up or complete successfully.
>>>
>>> It’s not clear to me what your program is trying to do that would be
>>> erroneous, but usually when a process crashes, it’s the result of an
>>> erroneous program rather than a bug in MPICH. I’m not saying that there’s
>>> no bugs in MPICH, but we’d like to be able to narrow down where to look.
>>>
>>> Thanks,
>>> Wesley
>>>>>>
>>> On Thu, Jan 1, 2015 at 6:35 AM, Anatoly G <anatolyrishon at gmail.com>
>>> wrote:
>>>
>>>>  Dear MPICH.
>>>> I have an additional information.
>>>> This "strange configuration" (hydra connected to computer not from the
>>>> list) is result of unhandled Main process fail (similar to abort()
>>>> call) without killing children process (hydra).
>>>> Thus I can see "init" process becomes a father of hydra process.
>>>> Can you please refer me to document explaining hydra behavior when
>>>> father process is dead (an emergency situation).
>>>> I understand that this situation shouldn't happen and this bug will be
>>>> fixed, but I'm curious about the hydra logic.
>>>>
>>>>  Regards,
>>>> Anatoly.
>>>>
>>>>  ---------- Forwarded message ----------
>>>> From: Anatoly G <anatolyrishon at gmail.com>
>>>> Date: Wed, Dec 24, 2014 at 1:00 PM
>>>> Subject: mpiexec.hydra creates unexpectable TCP socket.
>>>> To: discuss at mpich.org
>>>>
>>>>
>>>>   Dear MPICH.
>>>> I'm using mpich 3.1 (hydra+MPI).
>>>> I execute main application (Main) which calls mpiexec.hydra in
>>>> following way:
>>>>
>>>>  mpiexec.hydra -genvall  -disable-auto-cleanup  -f MpiConfigMachines.
>>>> txt -launcher=ssh -n 3 MPI_Prog
>>>>
>>>>  MpiConfigMachines.txt content:
>>>>  10.3.2.100:1
>>>> 10.3.2.101:2
>>>>
>>>>  Where 10.3.2.100 is a local host.
>>>> As result I get
>>>>
>>>>    - Main + single MPI_Prog processes on local computer
>>>>    - 2 MPI_Prog processes on remote one.
>>>>
>>>> Main application establish TCP socket with local MPI_Prog.
>>>>  Main application establish TCP socket with controller on other
>>>> computer 10.3.2.170, which is not included in MpiConfigMachines.txt
>>>> file.
>>>>
>>>>  After executing some time (hours, sometimes days) I see via netstat
>>>> that created new connection from mpiexec.hydra and controller.
>>>>
>>>>  Before executing mpiexec.hydra I set environment variable
>>>>
>>>> setenv MPIEXEC_PORT_RANGE 50010:65535
>>>>
>>>> According to manual this variable limits hydra destination ports to
>>>> [50010:65535].
>>>>
>>>>
>>>>  I see that hydra uses these ports with MPI_Prog, but connection with
>>>> controller done on port 701 (controller computer).
>>>>
>>>>
>>>>  Controller program is a server. It can accept connections only.
>>>>
>>>>
>>>>  Can you please advice how to stand with this problem?
>>>>
>>>> How hydra recognizes controller IP and establish connection with it?
>>>>
>>>>
>>>>  Sincerely,
>>>>
>>>> Anatoly.
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>
>>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150111/a5045d14/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list