<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr">ah, Thanks Wei-keng, I'll try that. <div><br></div><div>Best,</div><div>Jialin</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jan 12, 2016 at 11:59 AM, Wei-keng Liao <span dir="ltr"><<a href="mailto:wkliao@eecs.northwestern.edu" target="_blank">wkliao@eecs.northwestern.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi, Rob,<br>


<br>


I just want to clarify the locking behavior on Lustre (at least from my<br>


understanding). The 3 independent I/O requests coming from the same compute<br>


node will be considered by the Lustre as requests from the same client<br>


(because I/O are system calls and all 3 processes run on the same OS).<br>


Thus, no lock granting or revoking will occur within all processes in<br>


the same node.<br>


<br>


The three independent requests appear to run one after another is because<br>


(as you explained) there is only one link from that node to the Lustre<br>


system. This is probably the system behavior. Jialin, you might want to<br>


verify this by writing a simple MPI program using POSIX write calls.<br>


<br>


<br>


Wei-keng<br>


<br>


On Jan 12, 2016, at 1:22 PM, Rob Latham wrote:<br>


<br>


><br>


><br>


> On 01/12/2016 11:36 AM, Jaln wrote:<br>


>> Hi,<br>


>> I'm running hdft with mpiio on a single compute node (32 cores), 10<br>


>> OSTs, the file system is lustre v2.5.<br>


>> I submit the job with 3 processes, they are writing to a shared file,<br>


>> which is about 3GBs,<br>


>> and each process writes 1/3 of the file, for example,<br>


>> The array is a 4D double array, 3*32*1024*128, then each process writes<br>


>> 32*1024*128 to the file, which is contiguous.<br>


>><br>


>> I observed some wired performance number, I tried both independent I/O<br>


>> and collective IO.<br>


>> In the case of independent I/O, each rank seems to block each other and<br>


>> finish writing one after another. But in collective I/O, all three ranks<br>


>> reports same I/O cost, I think this is because there is only one<br>


>> aggregator.<br>


>> My question is, in the case of independent I/O, are the I/Os blocking<br>


>> when accessing the file?<br>


>> If not blocking, can I expect linear speedup on a single node by<br>


>> increasing number of processes?<br>


><br>


> There is a lot going on under the hood in collective mode.  Let's start with independent mode.  Here's my mental model of what's going on:<br>


><br>


> - Your 3 process application issues 3 lustre writes.<br>


> - One of those will win the race to the file system, acquire a whole-file lock, and begin to write.<br>


> - Now the request from the 2nd place process arrives.  Lustre will revoke the whole-file lock and issue a lock for the beginning of the request to the end of the file.<br>


> - When the last process arrives, Lustre will revoke locks  yet again, and issue a lock from the beginning of the request to the end of the file.<br>


><br>


> So yep, there is indeed a lot of blocking going on.  It's not a formal queue, more of a side effect of waiting for the Lustre MDS to issue the required locks.<br>


><br>


> In collective I/O (which I'll assume is a relatively recent ROMIO implementation), one of your three mpi processes will be the aggregator.  You seem to know about two-phase collective I/O so I will make the rest of the explanation brief.  Yes, that aggregator will receive data from the other proceses and issue all I/O.   The other processes will wait until the aggregator finishes, which is why you see all processes reporting the same run time for I/O.<br>


><br>


> You are unlikely to get speedup from a single node for two main reasons:  There are the aforementioned lock contention issues.  There is a single link from that node to the 10 Lustre OSTs.<br>


><br>


> In fact, due to the lock revocation activity I described above, you are likely to see a massive performance crash when you involve a second node.  Then, as you add more nodes, you will finally light up enough network links to see a speedup.<br>


><br>


> ==rob<br>


><br>


>><br>


>> Best,<br>


>> Jialin<br>


>> Lawrence Berkeley Lab<br>


>><br>


>><br>


>> _______________________________________________<br>


>> discuss mailing list     <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>


>> To manage subscription options or unsubscribe:<br>


>> <a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>


>><br>


> _______________________________________________<br>


> discuss mailing list     <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>


> To manage subscription options or unsubscribe:<br>


> <a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>


<br>


_______________________________________________<br>


discuss mailing list     <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>


To manage subscription options or unsubscribe:<br>


<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>


</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><pre style="margin:0in;font-family:Arial,Helvetica,sans-serif;font-size:large" lang="en-US" align="left"><font size="2"><font face="comic sans ms,sans-serif"><font color="#666666"><span>Genius only means hard-working all one's life</span></font></font></font></pre></div>


</div>