[mpich-discuss] bind/listen race condition

Kenneth Raffenetti raffenet at mcs.anl.gov
Thu Jun 29 09:22:55 CDT 2017


Hi Steve,

Thanks for the report. I was able to reproduce it on my laptop with the 
latest git revision. I've created a Github issue to track it. Hopefully 
it will be fixed soon.

https://github.com/pmodels/mpich/issues/2665

Ken

On 06/27/2017 02:11 PM, Steve Krueger wrote:
> I'm using a rather old 1.4.1p1 version, but I checked the latest sources, and I believe the
> problem still exists.
> 
> When starting multiple MPICH jobs, and using an MPICH_PORT_RANGE, I occasionally see the error:
> HYDU_sock_listen (./utils/sock/sock.c:128): listen error (Address already in use)
> 
> I believe two processes are successfully bind()ing the same port in the range, but then the second listen() call returns the error.
> 
> The code in sock.c loops through the port range attempting to find a port to bind to, and once the bind() succeeds, it only call listen() once. I think the code should tolerate an EADDRINUSE error from listen() and retry a new port.
> 
> The problem is easier to reproduce if you add a sleep(10) between the bind() and listen(), and
> then run two mpi jobs with the same MPICH_PORT_RANGE.
> 
> sk
> 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list