[mpich-discuss] single node mpiio
Rob Latham
robl at mcs.anl.gov
Tue Jan 12 13:22:20 CST 2016
On 01/12/2016 11:36 AM, Jaln wrote:
> Hi,
> I'm running hdft with mpiio on a single compute node (32 cores), 10
> OSTs, the file system is lustre v2.5.
> I submit the job with 3 processes, they are writing to a shared file,
> which is about 3GBs,
> and each process writes 1/3 of the file, for example,
> The array is a 4D double array, 3*32*1024*128, then each process writes
> 32*1024*128 to the file, which is contiguous.
>
> I observed some wired performance number, I tried both independent I/O
> and collective IO.
> In the case of independent I/O, each rank seems to block each other and
> finish writing one after another. But in collective I/O, all three ranks
> reports same I/O cost, I think this is because there is only one
> aggregator.
> My question is, in the case of independent I/O, are the I/Os blocking
> when accessing the file?
> If not blocking, can I expect linear speedup on a single node by
> increasing number of processes?
There is a lot going on under the hood in collective mode. Let's start
with independent mode. Here's my mental model of what's going on:
- Your 3 process application issues 3 lustre writes.
- One of those will win the race to the file system, acquire a
whole-file lock, and begin to write.
- Now the request from the 2nd place process arrives. Lustre will
revoke the whole-file lock and issue a lock for the beginning of the
request to the end of the file.
- When the last process arrives, Lustre will revoke locks yet again,
and issue a lock from the beginning of the request to the end of the file.
So yep, there is indeed a lot of blocking going on. It's not a formal
queue, more of a side effect of waiting for the Lustre MDS to issue the
required locks.
In collective I/O (which I'll assume is a relatively recent ROMIO
implementation), one of your three mpi processes will be the aggregator.
You seem to know about two-phase collective I/O so I will make the
rest of the explanation brief. Yes, that aggregator will receive data
from the other proceses and issue all I/O. The other processes will
wait until the aggregator finishes, which is why you see all processes
reporting the same run time for I/O.
You are unlikely to get speedup from a single node for two main reasons:
There are the aforementioned lock contention issues. There is a
single link from that node to the 10 Lustre OSTs.
In fact, due to the lock revocation activity I described above, you are
likely to see a massive performance crash when you involve a second
node. Then, as you add more nodes, you will finally light up enough
network links to see a speedup.
==rob
>
> Best,
> Jialin
> Lawrence Berkeley Lab
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list