[mpich-discuss] single node mpiio

Tue Jan 12 13:59:02 CST 2016

Hi, Rob,

I just want to clarify the locking behavior on Lustre (at least from my
understanding). The 3 independent I/O requests coming from the same compute
node will be considered by the Lustre as requests from the same client
(because I/O are system calls and all 3 processes run on the same OS).
Thus, no lock granting or revoking will occur within all processes in
the same node.

The three independent requests appear to run one after another is because
(as you explained) there is only one link from that node to the Lustre
system. This is probably the system behavior. Jialin, you might want to
verify this by writing a simple MPI program using POSIX write calls.

Wei-keng

On Jan 12, 2016, at 1:22 PM, Rob Latham wrote:

> 
> 
> On 01/12/2016 11:36 AM, Jaln wrote:
>> Hi,
>> I'm running hdft with mpiio on a single compute node (32 cores), 10
>> OSTs, the file system is lustre v2.5.
>> I submit the job with 3 processes, they are writing to a shared file,
>> which is about 3GBs,
>> and each process writes 1/3 of the file, for example,
>> The array is a 4D double array, 3*32*1024*128, then each process writes
>> 32*1024*128 to the file, which is contiguous.
>> 
>> I observed some wired performance number, I tried both independent I/O
>> and collective IO.
>> In the case of independent I/O, each rank seems to block each other and
>> finish writing one after another. But in collective I/O, all three ranks
>> reports same I/O cost, I think this is because there is only one
>> aggregator.
>> My question is, in the case of independent I/O, are the I/Os blocking
>> when accessing the file?
>> If not blocking, can I expect linear speedup on a single node by
>> increasing number of processes?
> 
> There is a lot going on under the hood in collective mode.  Let's start with independent mode.  Here's my mental model of what's going on:
> 
> - Your 3 process application issues 3 lustre writes.
> - One of those will win the race to the file system, acquire a whole-file lock, and begin to write.
> - Now the request from the 2nd place process arrives.  Lustre will revoke the whole-file lock and issue a lock for the beginning of the request to the end of the file.
> - When the last process arrives, Lustre will revoke locks  yet again, and issue a lock from the beginning of the request to the end of the file.
> 
> So yep, there is indeed a lot of blocking going on.  It's not a formal queue, more of a side effect of waiting for the Lustre MDS to issue the required locks.
> 
> In collective I/O (which I'll assume is a relatively recent ROMIO implementation), one of your three mpi processes will be the aggregator.  You seem to know about two-phase collective I/O so I will make the rest of the explanation brief.  Yes, that aggregator will receive data from the other proceses and issue all I/O.   The other processes will wait until the aggregator finishes, which is why you see all processes reporting the same run time for I/O.
> 
> You are unlikely to get speedup from a single node for two main reasons:  There are the aforementioned lock contention issues.  There is a single link from that node to the 10 Lustre OSTs.
> 
> In fact, due to the lock revocation activity I described above, you are likely to see a massive performance crash when you involve a second node.  Then, as you add more nodes, you will finally light up enough network links to see a speedup.
> 
> ==rob
> 
>> 
>> Best,
>> Jialin
>> Lawrence Berkeley Lab
>> 
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss