<font size=2 face="sans-serif">I did a couple quick searches and this

goes back to bg/l, not bg/p.   Jeff once investigated a problem and

said</font>

<br><font size=2 face="sans-serif">----------------</font>

<br><font size=2 face="sans-serif">In general, for smaller blocks, MPI-IO

performed better than POSIX IO.  For a midplane, they are about equal.

 For a rack, MPI-IO is noticably slower.  I am now suspecting

the collective phase of MPI-IO may be taking the time (#2 above).</font>

<br>

<br><font size=2 face="sans-serif">Specifying -env BGLMPIO_TUNEGATHER=0

did not significantly change the 1 rack result.  This controls using

allgather (0) vs allreduce (1 - the default) to communicate start and end

offsets among the nodes.</font>

<br>

<br><font size=2 face="sans-serif">**Specifying -env BGLMPIO_COMM=1 made

the 1 rack result twice as slow.  This controls using alltoallv (0

- the default) vs send/recv (1) to do the consolidation phase.</font>

<br>

<br><font size=2 face="sans-serif">Specifying -env BGLMPIO_TUNEBLOCKING=0

made the 1 rack result so slow that I cancelled the job.  This controls

whether to take psets and GPFS into account (1 - the default) or not (0).</font>

<br><font size=2 face="sans-serif">---------------</font>

<br><font size=2 face="sans-serif">I can't find the issue where he first

implemented BGLMPIO_COMM.  I seem to remember it performed MUCH better

on small scattered i/o than send/recv.  He had tables and numbers

which I can't find right now.</font>

<br>

<br><font size=2 face="sans-serif"><br>

Bob Cernohous:  (T/L 553) 507-253-6093<br>

<br>

BobC@us.ibm.com<br>

IBM Rochester, Building 030-2(C335), Department 61L<br>

3605 Hwy 52 North, Rochester,  MN 55901-7829<br>

<br>

> Chaos reigns within.<br>

> Reflect, repent, and reboot.<br>

> Order shall return.<br>

</font>

<br>

<br><tt><font size=2>devel-bounces@mpich.org wrote on 05/06/2013 04:35:11

PM:<br>

<br>

> From: Jeff Hammond <jhammond@alcf.anl.gov></font></tt>

<br><tt><font size=2>> To: devel@mpich.org, </font></tt>

<br><tt><font size=2>> Cc: mpich2-dev@mcs.anl.gov, devel-bounces@mpich.org</font></tt>

<br><tt><font size=2>> Date: 05/06/2013 04:36 PM</font></tt>

<br><tt><font size=2>> Subject: Re: [mpich-devel] ROMIO collective i/o

memory use</font></tt>

<br><tt><font size=2>> Sent by: devel-bounces@mpich.org</font></tt>

<br><tt><font size=2>> <br>

> Do alltoallv actually run faster than send-recv for the MPIO use case?<br>

>  For >1MB messages, is alltoallv noticeably faster than a

well-written<br>

> send-recv implantation?<br>

> <br>

> At least on BGQ, send-recv turns into a receiver-side PAMI_Rget for<br>

> large messages; I would guess the optimized alltoallv implementation<br>

> is rput-based at the SPI level.  Other than overhead, they should

run<br>

> at the same speed, no?  If execution overhead is not significant,

then<br>

> the implementation that minimizes memory usage should be the default.<br>

> <br>

> I suppose I should just write alltoallv using send-recv and see what<br>

> the difference is...<br>

> <br>

> Jeff<br>

> <br>

> On Mon, May 6, 2013 at 3:05 PM, Bob Cernohous <bobc@us.ibm.com>

wrote:<br>

> ><br>

> >> From: "Rob Latham" <robl@mcs.anl.gov><br>

> >><br>

> >> On Mon, May 06, 2013 at 02:30:15PM -0500, Bob Cernohous wrote:<br>

> >> > > From: Rob Ross <rross@mcs.anl.gov><br>

> >> > ><br>

> >> > > Should we consider this as interest in working

on this problem on<br>

> >> > > the IBM side :)? -- Rob<br>

> >> ><br>

> >> > Say what?! ;)<br>

> >><br>

> >> RobR's excited that IBM's looking at the ROMIO piece of DCMF.

 We<br>

> >> thought we were on our own with that one.<br>

> >><br>

> >><br>

> >> > I was looking more for agreement that collective i/o

is 'what it<br>

> >> > is'... and maybe some idea if we just have some known

limitations on<br>

> >> > scaling it.  Yes, that BG alltoallv is a bigger

problem that we can<br>

> >> > avoid<br>

> >> > with an env var -- is that just going to have to be

'good enough'?  (I<br>

> >> > think that Jeff P wrote that on BG/P and got good performance

with that<br>

> >> > alltoallv.  Trading memory for performance, not

unusual, and at least<br>

> >> > it's<br>

> >> > selectable.)<br>

> >><br>

> >> I can't test while our Blue Gene is under maintenance.  

 I know the<br>

> >> environment variable selection helps only a little bit (like

improves<br>

> >> scaling from 4k to 8k maybe?  don't have the notes offhand).<br>

> ><br>

> > Ouch.  So you've seen the scaling failures at 8k... ranks?

racks?  Kevin is<br>

> > failing at... 16 racks x 16 ranks per node... I think ... so

256k ranks.<br>

> <br>

> <br>

> <br>

> -- <br>

> Jeff Hammond<br>

> Argonne Leadership Computing Facility<br>

> University of Chicago Computation Institute<br>

> jhammond@alcf.anl.gov / (630) 252-5381<br>

> </font></tt><a href=http://www.linkedin.com/in/jeffhammond><tt><font size=2>http://www.linkedin.com/in/jeffhammond</font></tt></a><tt><font size=2><br>

> </font></tt><a href=https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond><tt><font size=2>https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond</font></tt></a><tt><font size=2><br>

> ALCF docs: </font></tt><a href="http://www.alcf.anl.gov/user-guides"><tt><font size=2>http://www.alcf.anl.gov/user-guides</font></tt></a><tt><font size=2><br>

> <br>

</font></tt>