[mpich-devel] ROMIO collective i/o memory use
bobc at us.ibm.com
Mon May 6 16:54:10 CDT 2013
I did a couple quick searches and this goes back to bg/l, not bg/p. Jeff
once investigated a problem and said
In general, for smaller blocks, MPI-IO performed better than POSIX IO. For
a midplane, they are about equal. For a rack, MPI-IO is noticably slower.
I am now suspecting the collective phase of MPI-IO may be taking the time
Specifying -env BGLMPIO_TUNEGATHER=0 did not significantly change the 1
rack result. This controls using allgather (0) vs allreduce (1 - the
default) to communicate start and end offsets among the nodes.
**Specifying -env BGLMPIO_COMM=1 made the 1 rack result twice as slow.
This controls using alltoallv (0 - the default) vs send/recv (1) to do the
Specifying -env BGLMPIO_TUNEBLOCKING=0 made the 1 rack result so slow that
I cancelled the job. This controls whether to take psets and GPFS into
account (1 - the default) or not (0).
I can't find the issue where he first implemented BGLMPIO_COMM. I seem to
remember it performed MUCH better on small scattered i/o than send/recv.
He had tables and numbers which I can't find right now.
Bob Cernohous: (T/L 553) 507-253-6093
BobC at us.ibm.com
IBM Rochester, Building 030-2(C335), Department 61L
3605 Hwy 52 North, Rochester, MN 55901-7829
> Chaos reigns within.
> Reflect, repent, and reboot.
> Order shall return.
devel-bounces at mpich.org wrote on 05/06/2013 04:35:11 PM:
> From: Jeff Hammond <jhammond at alcf.anl.gov>
> To: devel at mpich.org,
> Cc: mpich2-dev at mcs.anl.gov, devel-bounces at mpich.org
> Date: 05/06/2013 04:36 PM
> Subject: Re: [mpich-devel] ROMIO collective i/o memory use
> Sent by: devel-bounces at mpich.org
> Do alltoallv actually run faster than send-recv for the MPIO use case?
> For >1MB messages, is alltoallv noticeably faster than a well-written
> send-recv implantation?
> At least on BGQ, send-recv turns into a receiver-side PAMI_Rget for
> large messages; I would guess the optimized alltoallv implementation
> is rput-based at the SPI level. Other than overhead, they should run
> at the same speed, no? If execution overhead is not significant, then
> the implementation that minimizes memory usage should be the default.
> I suppose I should just write alltoallv using send-recv and see what
> the difference is...
> On Mon, May 6, 2013 at 3:05 PM, Bob Cernohous <bobc at us.ibm.com> wrote:
> >> From: "Rob Latham" <robl at mcs.anl.gov>
> >> On Mon, May 06, 2013 at 02:30:15PM -0500, Bob Cernohous wrote:
> >> > > From: Rob Ross <rross at mcs.anl.gov>
> >> > >
> >> > > Should we consider this as interest in working on this problem on
> >> > > the IBM side :)? -- Rob
> >> >
> >> > Say what?! ;)
> >> RobR's excited that IBM's looking at the ROMIO piece of DCMF. We
> >> thought we were on our own with that one.
> >> > I was looking more for agreement that collective i/o is 'what it
> >> > is'... and maybe some idea if we just have some known limitations
> >> > scaling it. Yes, that BG alltoallv is a bigger problem that we can
> >> > avoid
> >> > with an env var -- is that just going to have to be 'good enough'?
> >> > think that Jeff P wrote that on BG/P and got good performance with
> >> > alltoallv. Trading memory for performance, not unusual, and at
> >> > it's
> >> > selectable.)
> >> I can't test while our Blue Gene is under maintenance. I know the
> >> environment variable selection helps only a little bit (like improves
> >> scaling from 4k to 8k maybe? don't have the notes offhand).
> > Ouch. So you've seen the scaling failures at 8k... ranks? racks?
> > failing at... 16 racks x 16 ranks per node... I think ... so 256k
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> ALCF docs: http://www.alcf.anl.gov/user-guides
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the devel