[mpich-discuss] bind/listen race condition
Steve.Krueger at sas.com
Tue Jun 27 14:11:23 CDT 2017
I'm using a rather old 1.4.1p1 version, but I checked the latest sources, and I believe the
problem still exists.
When starting multiple MPICH jobs, and using an MPICH_PORT_RANGE, I occasionally see the error:
HYDU_sock_listen (./utils/sock/sock.c:128): listen error (Address already in use)
I believe two processes are successfully bind()ing the same port in the range, but then the second listen() call returns the error.
The code in sock.c loops through the port range attempting to find a port to bind to, and once the bind() succeeds, it only call listen() once. I think the code should tolerate an EADDRINUSE error from listen() and retry a new port.
The problem is easier to reproduce if you add a sleep(10) between the bind() and listen(), and
then run two mpi jobs with the same MPICH_PORT_RANGE.
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
More information about the discuss