[mpich-devel] Deadlock when using MPIX_Grequest interface

Latham, Robert J. robl at mcs.anl.gov
Wed Mar 8 16:25:23 CST 2017


On Wed, 2017-03-08 at 09:25 +0000, Giuseppe Congiu wrote:
> Hello Rob,
>  
> > I'm excited to see someone using the MPIX_Grequest interface.  We
> > used
> > the MPIX_Grequest interface to implement non-blocking collective
> > I/O,
> > and had some bad interactions between libc's aio and the grequest
> > callbacks.  I don't know if you are running into something similar.
> 
> Maybe. Do you have a description of the problem somewhere?

The guy who did that work just left last Friday.  I'll have to dig up
the archives. Looks like it was a hard-to-debug segfault  https://trac.
mpich.org/projects/mpich/ticket/2201 

>  
> > Do you have any desire or plans to submit these changes into
> > upstream
> > ROMIO?  
> 
> The idea would be to push these changes to upstream ROMIO if this is
> relevant for the community. 

I don't encounter many BeeGFS users, but ROMIO file system drivers are
fairly self-contained and it wouldn't be a burden to ship with them in
ROMIO.


> In principle here I have the same intent. The difference is that I
> cannot check on progress since
> BeeGFS does not provide a way for checking the status of a single
> request. Instead it only
> offers a blocking wait interface for all the requests submitted for a
> certain file (identified
> by the filename). Thus I need to invoke deeper_cache_flush_wait()
> from inside one of the
> callbacks.

Blocking the progress engine when it expects to repeatedly call non-
blocking functions could work as long as deeper_cache_flush_wait()
eventually finishes and nothing needs MPI.  

Now, all I know about DEEP-ER is what I just read on https://www.beegfs
.com/wiki/cacheAPI, so I'm sure I don't know the whole picture, but 
can you call deeper_cache_flush_is_finished() and deeper_cache_flush()
without the WAIT flag?  Stick those two routines in the poll_fn().

The generalized request extensions provide a wait_fn() that should be
able to handle this, too...

When it gets stuck, what does the call stack look like?

Is it stuck for good or just making progress really really slowly?

==rob


More information about the devel mailing list