<font size=2 face="sans-serif">I think my memory was off. Jeff actually

put in a different alltoall involving...</font>

<br>

<br><font size=2 face="sans-serif">---------------------------------------------------------------------</font>

<br><font size=2 face="sans-serif">The performance problem has been isolated

to a single function in the MPI-IO ROMIO common directory, ADIOI_Calc_others_req().

 This function was consuming 90% of the time between I/O syscalls.

 It was doing Isend/Irecv between all of the nodes, twice.  I

copied that function to a BG/L-specific module, rewrote it to use MPI_Alltoallv(),

and changed MPI-IO collective write and read to call it by default, or

when -env BGLMPIO_TUNEBLOCKING=1 is specified.  The new function only

consumed 16% of the time between I/O syscalls, and brought the total time

in line with using MPI_Gather() and POSIX I/O. </font>

<br><font size=2 face="sans-serif">...</font>

<br><font size=2 face="sans-serif">Both the performance fix and the memory

leak fix went into the V1R3M2 DRV140_2007-070417.  I compiled both

testcases against that driver and ran them on a full rack.  The performance

was good (see below), and the memory leak was gone.</font>

<br>

<br><font size=2 face="sans-serif">Performance fix:</font>

<br><font size=2 face="sans-serif">[0] using 5120 blocks per task, and

ntasks = 1024 ...</font>

<br><font size=2 face="sans-serif">[0] Time to write the file with MPI_File_write_all

= 61.602 seconds, bandwidth = 0.340 MB/sec</font>

<br><font size=2 face="sans-serif">[0] file block size = 87600</font>

<br><font size=2 face="sans-serif">[0] Time to write the file with posix

write = 84.563 seconds, bandwidth = 0.248 MB/sec</font>

<br><font size=2 face="sans-serif">-------------------------------------------------------------------</font>

<br>

<br><font size=2 face="sans-serif">Anyway, this is somewhat off the original

topic of o(p) allocations except that it often seems to be a trade off

between performance and memory.</font>

<br>

<br><font size=2 face="sans-serif"><br>

Bob Cernohous:  (T/L 553) 507-253-6093<br>

<br>

BobC@us.ibm.com<br>

IBM Rochester, Building 030-2(C335), Department 61L<br>

3605 Hwy 52 North, Rochester,  MN 55901-7829<br>

<br>

> Chaos reigns within.<br>

> Reflect, repent, and reboot.<br>

> Order shall return.<br>

</font>

<br>

<br><tt><font size=2>devel-bounces@mpich.org wrote on 05/06/2013 04:54:10

PM:<br>

<br>

> From: Bob Cernohous/Rochester/IBM@IBMUS</font></tt>

<br><tt><font size=2>> To: mpich2-dev@mcs.anl.gov, </font></tt>

<br><tt><font size=2>> Date: 05/06/2013 04:55 PM</font></tt>

<br><tt><font size=2>> Subject: Re: [mpich-devel] ROMIO collective i/o

memory use</font></tt>

<br><tt><font size=2>> Sent by: devel-bounces@mpich.org</font></tt>

<br><tt><font size=2>> <br>

> I did a couple quick searches and this goes back to bg/l, not bg/p.

<br>

> Jeff once investigated a problem and said <br>

> ---------------- <br>

> In general, for smaller blocks, MPI-IO performed better than POSIX

<br>

> IO.  For a midplane, they are about equal.  For a rack,

MPI-IO is <br>

> noticably slower.  I am now suspecting the collective phase of

MPI-<br>

> IO may be taking the time (#2 above). <br>

> <br>

> Specifying -env BGLMPIO_TUNEGATHER=0 did not significantly change

<br>

> the 1 rack result.  This controls using allgather (0) vs allreduce

<br>

> (1 - the default) to communicate start and end offsets among the nodes.

<br>

> <br>

> **Specifying -env BGLMPIO_COMM=1 made the 1 rack result twice as <br>

> slow.  This controls using alltoallv (0 - the default) vs send/recv

<br>

> (1) to do the consolidation phase. <br>

> <br>

> Specifying -env BGLMPIO_TUNEBLOCKING=0 made the 1 rack result so <br>

> slow that I cancelled the job.  This controls whether to take

psets <br>

> and GPFS into account (1 - the default) or not (0). <br>

> --------------- <br>

> I can't find the issue where he first implemented BGLMPIO_COMM.  I

<br>

> seem to remember it performed MUCH better on small scattered i/o <br>

> than send/recv.  He had tables and numbers which I can't find

right now. <br>

> <br>

> <br>

> Bob Cernohous:  (T/L 553) 507-253-6093<br>

> <br>

> BobC@us.ibm.com<br>

> IBM Rochester, Building 030-2(C335), Department 61L<br>

> 3605 Hwy 52 North, Rochester,  MN 55901-7829<br>

> <br>

> > Chaos reigns within.<br>

> > Reflect, repent, and reboot.<br>

> > Order shall return.<br>

> <br>

> <br>

> devel-bounces@mpich.org wrote on 05/06/2013 04:35:11 PM:<br>

> <br>

> > From: Jeff Hammond <jhammond@alcf.anl.gov> <br>

> > To: devel@mpich.org, <br>

> > Cc: mpich2-dev@mcs.anl.gov, devel-bounces@mpich.org <br>

> > Date: 05/06/2013 04:36 PM <br>

> > Subject: Re: [mpich-devel] ROMIO collective i/o memory use <br>

> > Sent by: devel-bounces@mpich.org <br>

> > <br>

> > Do alltoallv actually run faster than send-recv for the MPIO

use case?<br>

> >  For >1MB messages, is alltoallv noticeably faster than

a well-written<br>

> > send-recv implantation?<br>

> > <br>

> > At least on BGQ, send-recv turns into a receiver-side PAMI_Rget

for<br>

> > large messages; I would guess the optimized alltoallv implementation<br>

> > is rput-based at the SPI level.  Other than overhead, they

should run<br>

> > at the same speed, no?  If execution overhead is not significant,

then<br>

> > the implementation that minimizes memory usage should be the

default.<br>

> > <br>

> > I suppose I should just write alltoallv using send-recv and see

what<br>

> > the difference is...<br>

> > <br>

> > Jeff<br>

> > <br>

> > On Mon, May 6, 2013 at 3:05 PM, Bob Cernohous <bobc@us.ibm.com>

wrote:<br>

> > ><br>

> > >> From: "Rob Latham" <robl@mcs.anl.gov><br>

> > >><br>

> > >> On Mon, May 06, 2013 at 02:30:15PM -0500, Bob Cernohous

wrote:<br>

> > >> > > From: Rob Ross <rross@mcs.anl.gov><br>

> > >> > ><br>

> > >> > > Should we consider this as interest in working

on this problem on<br>

> > >> > > the IBM side :)? -- Rob<br>

> > >> ><br>

> > >> > Say what?! ;)<br>

> > >><br>

> > >> RobR's excited that IBM's looking at the ROMIO piece

of DCMF.  We<br>

> > >> thought we were on our own with that one.<br>

> > >><br>

> > >><br>

> > >> > I was looking more for agreement that collective

i/o is 'what it<br>

> > >> > is'... and maybe some idea if we just have some

known limitations on<br>

> > >> > scaling it.  Yes, that BG alltoallv is a bigger

problem that we can<br>

> > >> > avoid<br>

> > >> > with an env var -- is that just going to have to

be 'good enough'?  (I<br>

> > >> > think that Jeff P wrote that on BG/P and got good

performancewith that<br>

> > >> > alltoallv.  Trading memory for performance,

not unusual, and at least<br>

> > >> > it's<br>

> > >> > selectable.)<br>

> > >><br>

> > >> I can't test while our Blue Gene is under maintenance.

   I know the<br>

> > >> environment variable selection helps only a little bit

(like improves<br>

> > >> scaling from 4k to 8k maybe?  don't have the notes

offhand).<br>

> > ><br>

> > > Ouch.  So you've seen the scaling failures at 8k...

ranks? <br>

> racks?  Kevin is<br>

> > > failing at... 16 racks x 16 ranks per node... I think ...

so 256k ranks.<br>

> > <br>

> > <br>

> > <br>

> > -- <br>

> > Jeff Hammond<br>

> > Argonne Leadership Computing Facility<br>

> > University of Chicago Computation Institute<br>

> > jhammond@alcf.anl.gov / (630) 252-5381<br>

> > </font></tt><a href=http://www.linkedin.com/in/jeffhammond><tt><font size=2>http://www.linkedin.com/in/jeffhammond</font></tt></a><tt><font size=2><br>

> > </font></tt><a href=https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond><tt><font size=2>https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond</font></tt></a><tt><font size=2><br>

> > ALCF docs: </font></tt><a href="http://www.alcf.anl.gov/user-guides"><tt><font size=2>http://www.alcf.anl.gov/user-guides</font></tt></a><tt><font size=2><br>

> > </font></tt>