[mpich-discuss] single node mpiio

Tue Jan 12 16:34:52 CST 2016

On 01/12/2016 02:01 PM, Jaln wrote:
> Thanks Rob,
> What is the reason of the whole file lock? is it due the bug in lustre
> 2.5? which has been fixed in 2.6 and 2.7.
> Or is it just due to such I/O pattern?
> If without file-locking, is the "single link" to the 10 osts actually
> blocking the I/O from being true parallel I/O?

As Wei-keng clarified, the whole-file lock and subsequent revocation 
only matters when two or more nodes (not merely MPI processes, if they 
are on the same node) are involved.

I don't know anything about a bug in lustre-2.5

I think this was a design choice -- while Lustre allows parallel access 
the design assumption is that most accesses are serial.   Something like 
O_DIRECT can bypass the locking.

When creating a file with a non-default stripe (and by default Lustre 
stripes are typically 1 or 4, not "all available OSTs),  ROMIO's lustre 
driver will set the O_LOV_DELAY_CREATE ioctl so that locks are assigned 
a bit later in the process.

==rob

>
> Best,
> Jialin
> Lawrence Berkeley Lab
>
> On Tue, Jan 12, 2016 at 11:22 AM, Rob Latham <robl at mcs.anl.gov
> <mailto:robl at mcs.anl.gov>> wrote:
>
>
>
>     On 01/12/2016 11:36 AM, Jaln wrote:
>
>         Hi,
>         I'm running hdft with mpiio on a single compute node (32 cores), 10
>         OSTs, the file system is lustre v2.5.
>         I submit the job with 3 processes, they are writing to a shared
>         file,
>         which is about 3GBs,
>         and each process writes 1/3 of the file, for example,
>         The array is a 4D double array, 3*32*1024*128, then each process
>         writes
>         32*1024*128 to the file, which is contiguous.
>
>         I observed some wired performance number, I tried both
>         independent I/O
>         and collective IO.
>         In the case of independent I/O, each rank seems to block each
>         other and
>         finish writing one after another. But in collective I/O, all
>         three ranks
>         reports same I/O cost, I think this is because there is only one
>         aggregator.
>         My question is, in the case of independent I/O, are the I/Os
>         blocking
>         when accessing the file?
>         If not blocking, can I expect linear speedup on a single node by
>         increasing number of processes?
>
>
>     There is a lot going on under the hood in collective mode.  Let's
>     start with independent mode.  Here's my mental model of what's going on:
>
>     - Your 3 process application issues 3 lustre writes.
>     - One of those will win the race to the file system, acquire a
>     whole-file lock, and begin to write.
>     - Now the request from the 2nd place process arrives.  Lustre will
>     revoke the whole-file lock and issue a lock for the beginning of the
>     request to the end of the file.
>     - When the last process arrives, Lustre will revoke locks  yet
>     again, and issue a lock from the beginning of the request to the end
>     of the file.
>
>     So yep, there is indeed a lot of blocking going on.  It's not a
>     formal queue, more of a side effect of waiting for the Lustre MDS to
>     issue the required locks.
>
>     In collective I/O (which I'll assume is a relatively recent ROMIO
>     implementation), one of your three mpi processes will be the
>     aggregator.  You seem to know about two-phase collective I/O so I
>     will make the rest of the explanation brief.  Yes, that aggregator
>     will receive data from the other proceses and issue all I/O.   The
>     other processes will wait until the aggregator finishes, which is
>     why you see all processes reporting the same run time for I/O.
>
>     You are unlikely to get speedup from a single node for two main
>     reasons:  There are the aforementioned lock contention issues.
>     There is a single link from that node to the 10 Lustre OSTs.
>
>     In fact, due to the lock revocation activity I described above, you
>     are likely to see a massive performance crash when you involve a
>     second node.  Then, as you add more nodes, you will finally light up
>     enough network links to see a speedup.
>
>     ==rob
>
>
>         Best,
>         Jialin
>         Lawrence Berkeley Lab
>
>
>         _______________________________________________
>         discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>         To manage subscription options or unsubscribe:
>         https://lists.mpich.org/mailman/listinfo/discuss
>
>     _______________________________________________
>     discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>     To manage subscription options or unsubscribe:
>     https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
>
> --
>
> Genius only means hard-working all one's life
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss