[mpich-devel] ROMIO collective i/o memory use

Bob Cernohous bobc at us.ibm.com
Mon May 6 16:54:10 CDT 2013


I did a couple quick searches and this goes back to bg/l, not bg/p.   Jeff 
once investigated a problem and said
----------------
In general, for smaller blocks, MPI-IO performed better than POSIX IO. For 
a midplane, they are about equal.  For a rack, MPI-IO is noticably slower. 
 I am now suspecting the collective phase of MPI-IO may be taking the time 
(#2 above).

Specifying -env BGLMPIO_TUNEGATHER=0 did not significantly change the 1 
rack result.  This controls using allgather (0) vs allreduce (1 - the 
default) to communicate start and end offsets among the nodes.

**Specifying -env BGLMPIO_COMM=1 made the 1 rack result twice as slow. 
This controls using alltoallv (0 - the default) vs send/recv (1) to do the 
consolidation phase.

Specifying -env BGLMPIO_TUNEBLOCKING=0 made the 1 rack result so slow that 
I cancelled the job.  This controls whether to take psets and GPFS into 
account (1 - the default) or not (0).
---------------
I can't find the issue where he first implemented BGLMPIO_COMM.  I seem to 
remember it performed MUCH better on small scattered i/o than send/recv. 
He had tables and numbers which I can't find right now.


Bob Cernohous:  (T/L 553) 507-253-6093

BobC at us.ibm.com
IBM Rochester, Building 030-2(C335), Department 61L
3605 Hwy 52 North, Rochester,  MN 55901-7829

> Chaos reigns within.
> Reflect, repent, and reboot.
> Order shall return.


devel-bounces at mpich.org wrote on 05/06/2013 04:35:11 PM:

> From: Jeff Hammond <jhammond at alcf.anl.gov>
> To: devel at mpich.org, 
> Cc: mpich2-dev at mcs.anl.gov, devel-bounces at mpich.org
> Date: 05/06/2013 04:36 PM
> Subject: Re: [mpich-devel] ROMIO collective i/o memory use
> Sent by: devel-bounces at mpich.org
> 
> Do alltoallv actually run faster than send-recv for the MPIO use case?
>  For >1MB messages, is alltoallv noticeably faster than a well-written
> send-recv implantation?
> 
> At least on BGQ, send-recv turns into a receiver-side PAMI_Rget for
> large messages; I would guess the optimized alltoallv implementation
> is rput-based at the SPI level.  Other than overhead, they should run
> at the same speed, no?  If execution overhead is not significant, then
> the implementation that minimizes memory usage should be the default.
> 
> I suppose I should just write alltoallv using send-recv and see what
> the difference is...
> 
> Jeff
> 
> On Mon, May 6, 2013 at 3:05 PM, Bob Cernohous <bobc at us.ibm.com> wrote:
> >
> >> From: "Rob Latham" <robl at mcs.anl.gov>
> >>
> >> On Mon, May 06, 2013 at 02:30:15PM -0500, Bob Cernohous wrote:
> >> > > From: Rob Ross <rross at mcs.anl.gov>
> >> > >
> >> > > Should we consider this as interest in working on this problem on
> >> > > the IBM side :)? -- Rob
> >> >
> >> > Say what?! ;)
> >>
> >> RobR's excited that IBM's looking at the ROMIO piece of DCMF.  We
> >> thought we were on our own with that one.
> >>
> >>
> >> > I was looking more for agreement that collective i/o is 'what it
> >> > is'... and maybe some idea if we just have some known limitations 
on
> >> > scaling it.  Yes, that BG alltoallv is a bigger problem that we can
> >> > avoid
> >> > with an env var -- is that just going to have to be 'good enough'? 
(I
> >> > think that Jeff P wrote that on BG/P and got good performance with 
that
> >> > alltoallv.  Trading memory for performance, not unusual, and at 
least
> >> > it's
> >> > selectable.)
> >>
> >> I can't test while our Blue Gene is under maintenance.    I know the
> >> environment variable selection helps only a little bit (like improves
> >> scaling from 4k to 8k maybe?  don't have the notes offhand).
> >
> > Ouch.  So you've seen the scaling failures at 8k... ranks? racks? 
Kevin is
> > failing at... 16 racks x 16 ranks per node... I think ... so 256k 
ranks.
> 
> 
> 
> -- 
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
> ALCF docs: http://www.alcf.anl.gov/user-guides
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20130506/c3787cf1/attachment.html>


More information about the devel mailing list