[mpich-discuss] Deadlock in MPI_Ibarrier over ch3:sock

Dave Goodell goodell at mcs.anl.gov
Wed Jan 23 14:44:51 CST 2013


On Jan 23, 2013, at 2:13 PM CST, Jed Brown wrote:

> As we've discussed here several times before, PETSc's --download-mpich option uses ch3:sock by default because this issue
> 
> http://trac.mpich.org/projects/mpich/ticket/79
> 
> makes nemesis unusable when oversubscribed.

Your complaint has been duly noted (again).

> We're also checking for MPI_Ibarrier and using Torsten's "nonblocking consensus" algorithm when possible. Unfortunately, MPI_Ibarrier deadlocks when run over ch3:sock. (Both procs have posted the Ibarrier, but the requests never complete.) I can prepare a reduced test case if you'd like.

Hmm… it looks like we aren't poking NBC progress properly when a "test" or "iprobe" routine is called.  MPIDI_CH3i_Progress_test is missing a call to MPIDU_Sched_progress: http://git.mpich.org/mpich.git/blob/refs/heads/master:/src/mpid/ch3/channels/sock/src/ch3_progress.c#l51

I'm surprised this isn't caught by the "coll/nonblock3" test, although we may just be making too much progress some other way, or we may not have enough coverage of the test itself in that test: http://git.mpich.org/mpich.git/blob/refs/heads/master:/test/mpi/coll/nonblocking3.c

A modest test case would be very helpful.

> Are you still supporting ch3:sock?

We are, although we have chosen not to support a small subset of the MPI-3 functionality in ch3:sock, such as shared memory windows.  We still run sock in the nightly tests:

http://www.mpich.org/static/cron/tests/#hydra-ch3:sock

> Is there some test we can do to check whether ch3:sock is being used so that we can choose a safe algorithm?

A run-time test that doesn't require looking at which symbols are compiled into the app?  I can't think of one off the top of my head, although I suppose we could add one in the future via either MPI_T or the new MPI_Get_library_version routine.

-Dave




More information about the discuss mailing list