[mpich-discuss] single node mpiio

Tue Jan 12 14:01:21 CST 2016

Thanks Rob,
What is the reason of the whole file lock? is it due the bug in lustre 2.5?
which has been fixed in 2.6 and 2.7.
Or is it just due to such I/O pattern?
If without file-locking, is the "single link" to the 10 osts actually
blocking the I/O from being true parallel I/O?

Best,
Jialin
Lawrence Berkeley Lab

On Tue, Jan 12, 2016 at 11:22 AM, Rob Latham <robl at mcs.anl.gov> wrote:

>
>
> On 01/12/2016 11:36 AM, Jaln wrote:
>
>> Hi,
>> I'm running hdft with mpiio on a single compute node (32 cores), 10
>> OSTs, the file system is lustre v2.5.
>> I submit the job with 3 processes, they are writing to a shared file,
>> which is about 3GBs,
>> and each process writes 1/3 of the file, for example,
>> The array is a 4D double array, 3*32*1024*128, then each process writes
>> 32*1024*128 to the file, which is contiguous.
>>
>> I observed some wired performance number, I tried both independent I/O
>> and collective IO.
>> In the case of independent I/O, each rank seems to block each other and
>> finish writing one after another. But in collective I/O, all three ranks
>> reports same I/O cost, I think this is because there is only one
>> aggregator.
>> My question is, in the case of independent I/O, are the I/Os blocking
>> when accessing the file?
>> If not blocking, can I expect linear speedup on a single node by
>> increasing number of processes?
>>
>
> There is a lot going on under the hood in collective mode.  Let's start
> with independent mode.  Here's my mental model of what's going on:
>
> - Your 3 process application issues 3 lustre writes.
> - One of those will win the race to the file system, acquire a whole-file
> lock, and begin to write.
> - Now the request from the 2nd place process arrives.  Lustre will revoke
> the whole-file lock and issue a lock for the beginning of the request to
> the end of the file.
> - When the last process arrives, Lustre will revoke locks  yet again, and
> issue a lock from the beginning of the request to the end of the file.
>
> So yep, there is indeed a lot of blocking going on.  It's not a formal
> queue, more of a side effect of waiting for the Lustre MDS to issue the
> required locks.
>
> In collective I/O (which I'll assume is a relatively recent ROMIO
> implementation), one of your three mpi processes will be the aggregator.
> You seem to know about two-phase collective I/O so I will make the rest of
> the explanation brief.  Yes, that aggregator will receive data from the
> other proceses and issue all I/O.   The other processes will wait until the
> aggregator finishes, which is why you see all processes reporting the same
> run time for I/O.
>
> You are unlikely to get speedup from a single node for two main reasons:
> There are the aforementioned lock contention issues.  There is a single
> link from that node to the 10 Lustre OSTs.
>
> In fact, due to the lock revocation activity I described above, you are
> likely to see a massive performance crash when you involve a second node.
> Then, as you add more nodes, you will finally light up enough network links
> to see a speedup.
>
> ==rob
>
>
>> Best,
>> Jialin
>> Lawrence Berkeley Lab
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>

-- 

Genius only means hard-working all one's life
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160112/1a71eb1c/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss