[mpich-devel] romio questions

Balaji, Pavan balaji at anl.gov
Fri Apr 4 17:12:33 CDT 2014


Every time the MPICH stack is “stuck” in the progress engine (i.e., nothing happened for some time), it’ll release the lock so other threads can grab it.  So unless File_write_all is blocking on a file-system operation, other threads should be able to make progress just fine.

  — Pavan

On Apr 4, 2014, at 4:34 PM, Rob Latham <robl at mcs.anl.gov> wrote:

> This is getting out of my depth.  MPICH guys: Mark Allen is updating Platform's MPI and has been asking me some questions about multi-threaded ROMIO.
> 
> The independent ROMIO routines might make some local MPI calls to process the data type, but otherwise, yes, the only blocking call would be the file system operation.
> 
> I haven't thought about what would happen if MPI_File_write_all was called from two threads.
> 
> MPICH guys: there's no way the CS_ENTER/EXIT macros can be that clever, right?
> 
> ==rob
> 
> On 04/04/2014 04:28 PM, Mark Allen wrote:
>> Thanks.  I looked a bit at the CS_ENTER/EXIT code.  Am I right that all
>> the non-collective MPIO calls like MPI_File_read etc are only blocking
>> in the sense of waiting on some local operation to complete?  If that's
>> the case it would be okay for them to hold a lock from beginning to end
>> of the call like that.  But for the collective MPIO operations I don't
>> think you could just hold the lock for the entire function call since
>> you might end up waiting for remote peers to arrive, and if two
>> collectives were taking place at the same time they could mismatch and
>> deadlock.  But I'm not sure if that's what the code is doing or not.  Is
>> there a release of the lock hidden somewhere when collectives block?  If
>> inside the MPI_Allgather for example MPI performed a release and regain
>> of the MPIO lock is the blocking MPIO collective holding any shared
>> resource that would get corrupted by letting other threads in at that point?
>> 
>> Fwiw I also noticed mpi-io/ioreq_f2c.c looks to be a case where an early
>> return runs the risk of a mismatched CS_ENTER w/o a corresponding
>> CS_EXIT.  It looks like all the other files used goto fn_exit to ensure
>> a match.
> 
> 
>> 
>> Mark
>> 
>> 
>> Inactive hide details for Rob Latham ---04/04/2014 09:49:00 AM---On
>> 04/04/2014 01:22 AM, Paul Coffman wrote: > ----- Forwarded Rob Latham
>> ---04/04/2014 09:49:00 AM---On 04/04/2014 01:22 AM, Paul Coffman wrote:
>> > ----- Forwarded by Paul Coffman/Rochester/IBM on 04/04
>> 
>> From: Rob Latham <robl at mcs.anl.gov>
>> To: Paul Coffman/Rochester/IBM at IBMUS
>> Cc: Mark Allen/Dallas/IBM at IBMUS
>> Date: 04/04/2014 09:49 AM
>> Subject: Re: Fw: romio questions
>> 
>> ------------------------------------------------------------------------
>> 
>> 
>> 
>> 
>> 
>> On 04/04/2014 01:22 AM, Paul Coffman wrote:
>> 
>> > ----- Forwarded by Paul Coffman/Rochester/IBM on 04/04/2014 01:22 AM
>> -----
>> >
>> > From: Mark Allen/Dallas/IBM
>> > To: Paul Coffman/Rochester/IBM at IBMUS,
>> > Date: 04/04/2014 01:07 AM
>> > Subject: romio questions
>> > ------------------------------------------------------------------------
>> >
>> >
>> > I have two questions/topics for you:
>> >
>> > First I wanted to ask do you happen to know if romio is thread safe?  I
>> > see a fair number of critical-section begin/end macros and am guessing
>> > it is, but thought I'd ask anyway.
>> 
>> the internal romio routines (things in the ADIO and ADIOI namespace)
>> rely on several bits of global state -- the flattened representation of
>> the file and memory datatype come to mind first, but there are probably
>> others.  The critical section macros are at the MPI-IO interface to
>> romio and should provide a "big lock" to the rest.
>> 
>> I haven't tried this though.  I would be a little nervous about it
>> working without a few patches.
>> 
>> > Second, I noticed romio uses an extension of generalized requests
>> > described here:
>> >
>> http://www.cs.illinois.edu/~wgropp/bib/papers/2007/latham_grequest-enhance-4.pdf
>> >
>> > but the code looks confused on whether the proposed wait_fn callback is
>> > waiting for a single request or all the requests.
>> >
>> > In romio's definitions of
>> >      ADIOI_PVFS2_aio_wait_fn
>> >      ADIOI_GEN_aio_wait_fn
>> > the wait_fn looks to me very much like it's waiting for all the reqs, vs
>> > the NTFS function that looks like it's waiting on some
>> >      ADIOI_NTFS_aio_wait_fn
>> 
>> the intent for the extended wait_fn was indeed to wait for all
>> outstanding generalized requests.  Specifically, to call aio_suspend on
>> more than one operation at a time.
>> 
>> I would not read too much into NTFS_aio_wait_fn.  that work was done
>> quickly for the paper and then set aside.
>> 
>> > And where these wait_fn callbacks get used, MPI_Waitsome uses
>> > MPIR_Grequest_progress_poke which conditionally calls wait_fn. It seems
>> > to me this would make MPI_Waitsome erroneously block until all its
>> > generalized requests finish (the code in MPIR_Grequest_progress_poke
>> > looks like it believes the wait_fn is supposed to just complete one
>> > request).
>> 
>> It has been some years since I looked at the waitsome/extended-grequest
>> interaction, but it does sound like I could have implemented it better...
>> 
>> > Is this an area you've worked with any or something we need to worry
>> > about?  The reason I was looking at it was in order to pull in the new
>> > romio I figured I'd just add MPICH's concept of extended generalized
>> > requests into our MPI, so I wanted to make sure I understood how they
>> > were intended to work in mpich and I think it's pretty muddled there
>> > unless I'm reading it wrong.
>> 
>> You would be the first developers besides me to look closely at our
>> extended generalized request proposal.  It's no surprise to me it's
>> muddled, since it hasn't had a lot of attention over the years.
>> 
>> at various points in the last 6 years MPICH developers have modified the
>> implementation of generalized requests to keep the common code paths
>> speedy.  That might explain some murkiness or things like 'waitsome'
>> actually waiting for everything.
>> 
>> ==rob
>> 
>> --
>> Rob Latham
>> Mathematics and Computer Science Division
>> Argonne National Lab, IL USA
>> 
>> 
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA



More information about the devel mailing list