[mpich-discuss] single node mpiio

Tue Jan 12 13:22:20 CST 2016

On 01/12/2016 11:36 AM, Jaln wrote:
> Hi,
> I'm running hdft with mpiio on a single compute node (32 cores), 10
> OSTs, the file system is lustre v2.5.
> I submit the job with 3 processes, they are writing to a shared file,
> which is about 3GBs,
> and each process writes 1/3 of the file, for example,
> The array is a 4D double array, 3*32*1024*128, then each process writes
> 32*1024*128 to the file, which is contiguous.
>
> I observed some wired performance number, I tried both independent I/O
> and collective IO.
> In the case of independent I/O, each rank seems to block each other and
> finish writing one after another. But in collective I/O, all three ranks
> reports same I/O cost, I think this is because there is only one
> aggregator.
> My question is, in the case of independent I/O, are the I/Os blocking
> when accessing the file?
> If not blocking, can I expect linear speedup on a single node by
> increasing number of processes?

There is a lot going on under the hood in collective mode.  Let's start 
with independent mode.  Here's my mental model of what's going on:

- Your 3 process application issues 3 lustre writes.
- One of those will win the race to the file system, acquire a 
whole-file lock, and begin to write.
- Now the request from the 2nd place process arrives.  Lustre will 
revoke the whole-file lock and issue a lock for the beginning of the 
request to the end of the file.
- When the last process arrives, Lustre will revoke locks  yet again, 
and issue a lock from the beginning of the request to the end of the file.

So yep, there is indeed a lot of blocking going on.  It's not a formal 
queue, more of a side effect of waiting for the Lustre MDS to issue the 
required locks.

In collective I/O (which I'll assume is a relatively recent ROMIO 
implementation), one of your three mpi processes will be the aggregator. 
  You seem to know about two-phase collective I/O so I will make the 
rest of the explanation brief.  Yes, that aggregator will receive data 
from the other proceses and issue all I/O.   The other processes will 
wait until the aggregator finishes, which is why you see all processes 
reporting the same run time for I/O.

You are unlikely to get speedup from a single node for two main reasons: 
  There are the aforementioned lock contention issues.  There is a 
single link from that node to the 10 Lustre OSTs.

In fact, due to the lock revocation activity I described above, you are 
likely to see a massive performance crash when you involve a second 
node.  Then, as you add more nodes, you will finally light up enough 
network links to see a speedup.

==rob

>
> Best,
> Jialin
> Lawrence Berkeley Lab
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss