<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr">Thanks Rob,<div>What is the reason of the whole file lock? is it due the bug in lustre 2.5? which has been fixed in 2.6 and 2.7.</div><div>Or is it just due to such I/O pattern?</div><div>If without file-locking, is the "single link" to the 10 osts actually blocking the I/O from being true parallel I/O?</div><div><br></div><div>Best,</div><div>Jialin</div><div><span style="font-size:12.8px">Lawrence Berkeley Lab</span><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jan 12, 2016 at 11:22 AM, Rob Latham <span dir="ltr"><<a href="mailto:robl@mcs.anl.gov" target="_blank">robl@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>


<br>


On 01/12/2016 11:36 AM, Jaln wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Hi,<br>


I'm running hdft with mpiio on a single compute node (32 cores), 10<br>


OSTs, the file system is lustre v2.5.<br>


I submit the job with 3 processes, they are writing to a shared file,<br>


which is about 3GBs,<br>


and each process writes 1/3 of the file, for example,<br>


The array is a 4D double array, 3*32*1024*128, then each process writes<br>


32*1024*128 to the file, which is contiguous.<br>


<br>


I observed some wired performance number, I tried both independent I/O<br>


and collective IO.<br>


In the case of independent I/O, each rank seems to block each other and<br>


finish writing one after another. But in collective I/O, all three ranks<br>


reports same I/O cost, I think this is because there is only one<br>


aggregator.<br>


My question is, in the case of independent I/O, are the I/Os blocking<br>


when accessing the file?<br>


If not blocking, can I expect linear speedup on a single node by<br>


increasing number of processes?<br>


</blockquote>


<br>


There is a lot going on under the hood in collective mode.  Let's start with independent mode.  Here's my mental model of what's going on:<br>


<br>


- Your 3 process application issues 3 lustre writes.<br>


- One of those will win the race to the file system, acquire a whole-file lock, and begin to write.<br>


- Now the request from the 2nd place process arrives.  Lustre will revoke the whole-file lock and issue a lock for the beginning of the request to the end of the file.<br>


- When the last process arrives, Lustre will revoke locks  yet again, and issue a lock from the beginning of the request to the end of the file.<br>


<br>


So yep, there is indeed a lot of blocking going on.  It's not a formal queue, more of a side effect of waiting for the Lustre MDS to issue the required locks.<br>


<br>


In collective I/O (which I'll assume is a relatively recent ROMIO implementation), one of your three mpi processes will be the aggregator.  You seem to know about two-phase collective I/O so I will make the rest of the explanation brief.  Yes, that aggregator will receive data from the other proceses and issue all I/O.   The other processes will wait until the aggregator finishes, which is why you see all processes reporting the same run time for I/O.<br>


<br>


You are unlikely to get speedup from a single node for two main reasons:  There are the aforementioned lock contention issues.  There is a single link from that node to the 10 Lustre OSTs.<br>


<br>


In fact, due to the lock revocation activity I described above, you are likely to see a massive performance crash when you involve a second node.  Then, as you add more nodes, you will finally light up enough network links to see a speedup.<br>


<br>


==rob<br>


<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<br>


Best,<br>


Jialin<br>


Lawrence Berkeley Lab<br>


<br>


<br>


_______________________________________________<br>


discuss mailing list     <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>


To manage subscription options or unsubscribe:<br>


<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>


<br>


</blockquote>


_______________________________________________<br>


discuss mailing list     <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>


To manage subscription options or unsubscribe:<br>


<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>


</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><pre style="margin:0in;font-family:Arial,Helvetica,sans-serif;font-size:large" lang="en-US" align="left"><font size="2"><font face="comic sans ms,sans-serif"><font color="#666666"><span>Genius only means hard-working all one's life</span></font></font></font></pre></div>


</div>