[mpich-discuss] single node mpiio

Jaln valiantljk at gmail.com
Tue Jan 12 14:02:57 CST 2016


ah, Thanks Wei-keng, I'll try that.

Best,
Jialin

On Tue, Jan 12, 2016 at 11:59 AM, Wei-keng Liao <
wkliao at eecs.northwestern.edu> wrote:

> Hi, Rob,
>
> I just want to clarify the locking behavior on Lustre (at least from my
> understanding). The 3 independent I/O requests coming from the same compute
> node will be considered by the Lustre as requests from the same client
> (because I/O are system calls and all 3 processes run on the same OS).
> Thus, no lock granting or revoking will occur within all processes in
> the same node.
>
> The three independent requests appear to run one after another is because
> (as you explained) there is only one link from that node to the Lustre
> system. This is probably the system behavior. Jialin, you might want to
> verify this by writing a simple MPI program using POSIX write calls.
>
>
> Wei-keng
>
> On Jan 12, 2016, at 1:22 PM, Rob Latham wrote:
>
> >
> >
> > On 01/12/2016 11:36 AM, Jaln wrote:
> >> Hi,
> >> I'm running hdft with mpiio on a single compute node (32 cores), 10
> >> OSTs, the file system is lustre v2.5.
> >> I submit the job with 3 processes, they are writing to a shared file,
> >> which is about 3GBs,
> >> and each process writes 1/3 of the file, for example,
> >> The array is a 4D double array, 3*32*1024*128, then each process writes
> >> 32*1024*128 to the file, which is contiguous.
> >>
> >> I observed some wired performance number, I tried both independent I/O
> >> and collective IO.
> >> In the case of independent I/O, each rank seems to block each other and
> >> finish writing one after another. But in collective I/O, all three ranks
> >> reports same I/O cost, I think this is because there is only one
> >> aggregator.
> >> My question is, in the case of independent I/O, are the I/Os blocking
> >> when accessing the file?
> >> If not blocking, can I expect linear speedup on a single node by
> >> increasing number of processes?
> >
> > There is a lot going on under the hood in collective mode.  Let's start
> with independent mode.  Here's my mental model of what's going on:
> >
> > - Your 3 process application issues 3 lustre writes.
> > - One of those will win the race to the file system, acquire a
> whole-file lock, and begin to write.
> > - Now the request from the 2nd place process arrives.  Lustre will
> revoke the whole-file lock and issue a lock for the beginning of the
> request to the end of the file.
> > - When the last process arrives, Lustre will revoke locks  yet again,
> and issue a lock from the beginning of the request to the end of the file.
> >
> > So yep, there is indeed a lot of blocking going on.  It's not a formal
> queue, more of a side effect of waiting for the Lustre MDS to issue the
> required locks.
> >
> > In collective I/O (which I'll assume is a relatively recent ROMIO
> implementation), one of your three mpi processes will be the aggregator.
> You seem to know about two-phase collective I/O so I will make the rest of
> the explanation brief.  Yes, that aggregator will receive data from the
> other proceses and issue all I/O.   The other processes will wait until the
> aggregator finishes, which is why you see all processes reporting the same
> run time for I/O.
> >
> > You are unlikely to get speedup from a single node for two main
> reasons:  There are the aforementioned lock contention issues.  There is a
> single link from that node to the 10 Lustre OSTs.
> >
> > In fact, due to the lock revocation activity I described above, you are
> likely to see a massive performance crash when you involve a second node.
> Then, as you add more nodes, you will finally light up enough network links
> to see a speedup.
> >
> > ==rob
> >
> >>
> >> Best,
> >> Jialin
> >> Lawrence Berkeley Lab
> >>
> >>
> >> _______________________________________________
> >> discuss mailing list     discuss at mpich.org
> >> To manage subscription options or unsubscribe:
> >> https://lists.mpich.org/mailman/listinfo/discuss
> >>
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>



-- 

Genius only means hard-working all one's life
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160112/1f1f3349/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list