<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr">ah, Thanks Wei-keng, I'll try that. <div><br></div><div>Best,</div><div>Jialin</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jan 12, 2016 at 11:59 AM, Wei-keng Liao <span dir="ltr"><<a href="mailto:wkliao@eecs.northwestern.edu" target="_blank">wkliao@eecs.northwestern.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi, Rob,<br>
<br>
I just want to clarify the locking behavior on Lustre (at least from my<br>
understanding). The 3 independent I/O requests coming from the same compute<br>
node will be considered by the Lustre as requests from the same client<br>
(because I/O are system calls and all 3 processes run on the same OS).<br>
Thus, no lock granting or revoking will occur within all processes in<br>
the same node.<br>
<br>
The three independent requests appear to run one after another is because<br>
(as you explained) there is only one link from that node to the Lustre<br>
system. This is probably the system behavior. Jialin, you might want to<br>
verify this by writing a simple MPI program using POSIX write calls.<br>
<br>
<br>
Wei-keng<br>
<br>
On Jan 12, 2016, at 1:22 PM, Rob Latham wrote:<br>
<br>
><br>
><br>
> On 01/12/2016 11:36 AM, Jaln wrote:<br>
>> Hi,<br>
>> I'm running hdft with mpiio on a single compute node (32 cores), 10<br>
>> OSTs, the file system is lustre v2.5.<br>
>> I submit the job with 3 processes, they are writing to a shared file,<br>
>> which is about 3GBs,<br>
>> and each process writes 1/3 of the file, for example,<br>
>> The array is a 4D double array, 3*32*1024*128, then each process writes<br>
>> 32*1024*128 to the file, which is contiguous.<br>
>><br>
>> I observed some wired performance number, I tried both independent I/O<br>
>> and collective IO.<br>
>> In the case of independent I/O, each rank seems to block each other and<br>
>> finish writing one after another. But in collective I/O, all three ranks<br>
>> reports same I/O cost, I think this is because there is only one<br>
>> aggregator.<br>
>> My question is, in the case of independent I/O, are the I/Os blocking<br>
>> when accessing the file?<br>
>> If not blocking, can I expect linear speedup on a single node by<br>
>> increasing number of processes?<br>
><br>
> There is a lot going on under the hood in collective mode. Let's start with independent mode. Here's my mental model of what's going on:<br>
><br>
> - Your 3 process application issues 3 lustre writes.<br>
> - One of those will win the race to the file system, acquire a whole-file lock, and begin to write.<br>
> - Now the request from the 2nd place process arrives. Lustre will revoke the whole-file lock and issue a lock for the beginning of the request to the end of the file.<br>
> - When the last process arrives, Lustre will revoke locks yet again, and issue a lock from the beginning of the request to the end of the file.<br>
><br>
> So yep, there is indeed a lot of blocking going on. It's not a formal queue, more of a side effect of waiting for the Lustre MDS to issue the required locks.<br>
><br>
> In collective I/O (which I'll assume is a relatively recent ROMIO implementation), one of your three mpi processes will be the aggregator. You seem to know about two-phase collective I/O so I will make the rest of the explanation brief. Yes, that aggregator will receive data from the other proceses and issue all I/O. The other processes will wait until the aggregator finishes, which is why you see all processes reporting the same run time for I/O.<br>
><br>
> You are unlikely to get speedup from a single node for two main reasons: There are the aforementioned lock contention issues. There is a single link from that node to the 10 Lustre OSTs.<br>
><br>
> In fact, due to the lock revocation activity I described above, you are likely to see a massive performance crash when you involve a second node. Then, as you add more nodes, you will finally light up enough network links to see a speedup.<br>
><br>
> ==rob<br>
><br>
>><br>
>> Best,<br>
>> Jialin<br>
>> Lawrence Berkeley Lab<br>
>><br>
>><br>
>> _______________________________________________<br>
>> discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
>> To manage subscription options or unsubscribe:<br>
>> <a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
>><br>
> _______________________________________________<br>
> discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
<br>
_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><pre style="margin:0in;font-family:Arial,Helvetica,sans-serif;font-size:large" lang="en-US" align="left"><font size="2"><font face="comic sans ms,sans-serif"><font color="#666666"><span>Genius only means hard-working all one's life</span></font></font></font></pre></div>
</div>