[mpich-discuss] bind/listen race condition
Kenneth Raffenetti
raffenet at mcs.anl.gov
Thu Jun 29 09:22:55 CDT 2017
Hi Steve,
Thanks for the report. I was able to reproduce it on my laptop with the
latest git revision. I've created a Github issue to track it. Hopefully
it will be fixed soon.
https://github.com/pmodels/mpich/issues/2665
Ken
On 06/27/2017 02:11 PM, Steve Krueger wrote:
> I'm using a rather old 1.4.1p1 version, but I checked the latest sources, and I believe the
> problem still exists.
>
> When starting multiple MPICH jobs, and using an MPICH_PORT_RANGE, I occasionally see the error:
> HYDU_sock_listen (./utils/sock/sock.c:128): listen error (Address already in use)
>
> I believe two processes are successfully bind()ing the same port in the range, but then the second listen() call returns the error.
>
> The code in sock.c loops through the port range attempting to find a port to bind to, and once the bind() succeeds, it only call listen() once. I think the code should tolerate an EADDRINUSE error from listen() and retry a new port.
>
> The problem is easier to reproduce if you add a sleep(10) between the bind() and listen(), and
> then run two mpi jobs with the same MPICH_PORT_RANGE.
>
> sk
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list