[mpich-discuss] single node mpiio
Rob Latham
robl at mcs.anl.gov
Tue Jan 12 16:34:52 CST 2016
On 01/12/2016 02:01 PM, Jaln wrote:
> Thanks Rob,
> What is the reason of the whole file lock? is it due the bug in lustre
> 2.5? which has been fixed in 2.6 and 2.7.
> Or is it just due to such I/O pattern?
> If without file-locking, is the "single link" to the 10 osts actually
> blocking the I/O from being true parallel I/O?
As Wei-keng clarified, the whole-file lock and subsequent revocation
only matters when two or more nodes (not merely MPI processes, if they
are on the same node) are involved.
I don't know anything about a bug in lustre-2.5
I think this was a design choice -- while Lustre allows parallel access
the design assumption is that most accesses are serial. Something like
O_DIRECT can bypass the locking.
When creating a file with a non-default stripe (and by default Lustre
stripes are typically 1 or 4, not "all available OSTs), ROMIO's lustre
driver will set the O_LOV_DELAY_CREATE ioctl so that locks are assigned
a bit later in the process.
==rob
>
> Best,
> Jialin
> Lawrence Berkeley Lab
>
> On Tue, Jan 12, 2016 at 11:22 AM, Rob Latham <robl at mcs.anl.gov
> <mailto:robl at mcs.anl.gov>> wrote:
>
>
>
> On 01/12/2016 11:36 AM, Jaln wrote:
>
> Hi,
> I'm running hdft with mpiio on a single compute node (32 cores), 10
> OSTs, the file system is lustre v2.5.
> I submit the job with 3 processes, they are writing to a shared
> file,
> which is about 3GBs,
> and each process writes 1/3 of the file, for example,
> The array is a 4D double array, 3*32*1024*128, then each process
> writes
> 32*1024*128 to the file, which is contiguous.
>
> I observed some wired performance number, I tried both
> independent I/O
> and collective IO.
> In the case of independent I/O, each rank seems to block each
> other and
> finish writing one after another. But in collective I/O, all
> three ranks
> reports same I/O cost, I think this is because there is only one
> aggregator.
> My question is, in the case of independent I/O, are the I/Os
> blocking
> when accessing the file?
> If not blocking, can I expect linear speedup on a single node by
> increasing number of processes?
>
>
> There is a lot going on under the hood in collective mode. Let's
> start with independent mode. Here's my mental model of what's going on:
>
> - Your 3 process application issues 3 lustre writes.
> - One of those will win the race to the file system, acquire a
> whole-file lock, and begin to write.
> - Now the request from the 2nd place process arrives. Lustre will
> revoke the whole-file lock and issue a lock for the beginning of the
> request to the end of the file.
> - When the last process arrives, Lustre will revoke locks yet
> again, and issue a lock from the beginning of the request to the end
> of the file.
>
> So yep, there is indeed a lot of blocking going on. It's not a
> formal queue, more of a side effect of waiting for the Lustre MDS to
> issue the required locks.
>
> In collective I/O (which I'll assume is a relatively recent ROMIO
> implementation), one of your three mpi processes will be the
> aggregator. You seem to know about two-phase collective I/O so I
> will make the rest of the explanation brief. Yes, that aggregator
> will receive data from the other proceses and issue all I/O. The
> other processes will wait until the aggregator finishes, which is
> why you see all processes reporting the same run time for I/O.
>
> You are unlikely to get speedup from a single node for two main
> reasons: There are the aforementioned lock contention issues.
> There is a single link from that node to the 10 Lustre OSTs.
>
> In fact, due to the lock revocation activity I described above, you
> are likely to see a massive performance crash when you involve a
> second node. Then, as you add more nodes, you will finally light up
> enough network links to see a speedup.
>
> ==rob
>
>
> Best,
> Jialin
> Lawrence Berkeley Lab
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
>
> --
>
> Genius only means hard-working all one's life
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list