[mpich-devel] ROMIO collective i/o memory use

Mon May 6 17:05:51 CDT 2013

I think my memory was off. Jeff actually put in a different alltoall 
involving...

---------------------------------------------------------------------
The performance problem has been isolated to a single function in the 
MPI-IO ROMIO common directory, ADIOI_Calc_others_req().  This function was 
consuming 90% of the time between I/O syscalls.  It was doing Isend/Irecv 
between all of the nodes, twice.  I copied that function to a 
BG/L-specific module, rewrote it to use MPI_Alltoallv(), and changed 
MPI-IO collective write and read to call it by default, or when -env 
BGLMPIO_TUNEBLOCKING=1 is specified.  The new function only consumed 16% 
of the time between I/O syscalls, and brought the total time in line with 
using MPI_Gather() and POSIX I/O. 
...
Both the performance fix and the memory leak fix went into the V1R3M2 
DRV140_2007-070417.  I compiled both testcases against that driver and ran 
them on a full rack.  The performance was good (see below), and the memory 
leak was gone.

Performance fix:
[0] using 5120 blocks per task, and ntasks = 1024 ...
[0] Time to write the file with MPI_File_write_all = 61.602 seconds, 
bandwidth = 0.340 MB/sec
[0] file block size = 87600
[0] Time to write the file with posix write = 84.563 seconds, bandwidth = 
0.248 MB/sec
-------------------------------------------------------------------

Anyway, this is somewhat off the original topic of o(p) allocations except 
that it often seems to be a trade off between performance and memory.

Bob Cernohous:  (T/L 553) 507-253-6093

BobC at us.ibm.com
IBM Rochester, Building 030-2(C335), Department 61L
3605 Hwy 52 North, Rochester,  MN 55901-7829

> Chaos reigns within.
> Reflect, repent, and reboot.
> Order shall return.

devel-bounces at mpich.org wrote on 05/06/2013 04:54:10 PM:

> From: Bob Cernohous/Rochester/IBM at IBMUS
> To: mpich2-dev at mcs.anl.gov, 
> Date: 05/06/2013 04:55 PM
> Subject: Re: [mpich-devel] ROMIO collective i/o memory use
> Sent by: devel-bounces at mpich.org
> 
> I did a couple quick searches and this goes back to bg/l, not bg/p. 
> Jeff once investigated a problem and said 
> ---------------- 
> In general, for smaller blocks, MPI-IO performed better than POSIX 
> IO.  For a midplane, they are about equal.  For a rack, MPI-IO is 
> noticably slower.  I am now suspecting the collective phase of MPI-
> IO may be taking the time (#2 above). 
> 
> Specifying -env BGLMPIO_TUNEGATHER=0 did not significantly change 
> the 1 rack result.  This controls using allgather (0) vs allreduce 
> (1 - the default) to communicate start and end offsets among the nodes. 
> 
> **Specifying -env BGLMPIO_COMM=1 made the 1 rack result twice as 
> slow.  This controls using alltoallv (0 - the default) vs send/recv 
> (1) to do the consolidation phase. 
> 
> Specifying -env BGLMPIO_TUNEBLOCKING=0 made the 1 rack result so 
> slow that I cancelled the job.  This controls whether to take psets 
> and GPFS into account (1 - the default) or not (0). 
> --------------- 
> I can't find the issue where he first implemented BGLMPIO_COMM.  I 
> seem to remember it performed MUCH better on small scattered i/o 
> than send/recv.  He had tables and numbers which I can't find right now. 

> 
> 
> Bob Cernohous:  (T/L 553) 507-253-6093
> 
> BobC at us.ibm.com
> IBM Rochester, Building 030-2(C335), Department 61L
> 3605 Hwy 52 North, Rochester,  MN 55901-7829
> 
> > Chaos reigns within.
> > Reflect, repent, and reboot.
> > Order shall return.
> 
> 
> devel-bounces at mpich.org wrote on 05/06/2013 04:35:11 PM:
> 
> > From: Jeff Hammond <jhammond at alcf.anl.gov> 
> > To: devel at mpich.org, 
> > Cc: mpich2-dev at mcs.anl.gov, devel-bounces at mpich.org 
> > Date: 05/06/2013 04:36 PM 
> > Subject: Re: [mpich-devel] ROMIO collective i/o memory use 
> > Sent by: devel-bounces at mpich.org 
> > 
> > Do alltoallv actually run faster than send-recv for the MPIO use case?
> >  For >1MB messages, is alltoallv noticeably faster than a well-written
> > send-recv implantation?
> > 
> > At least on BGQ, send-recv turns into a receiver-side PAMI_Rget for
> > large messages; I would guess the optimized alltoallv implementation
> > is rput-based at the SPI level.  Other than overhead, they should run
> > at the same speed, no?  If execution overhead is not significant, then
> > the implementation that minimizes memory usage should be the default.
> > 
> > I suppose I should just write alltoallv using send-recv and see what
> > the difference is...
> > 
> > Jeff
> > 
> > On Mon, May 6, 2013 at 3:05 PM, Bob Cernohous <bobc at us.ibm.com> wrote:
> > >
> > >> From: "Rob Latham" <robl at mcs.anl.gov>
> > >>
> > >> On Mon, May 06, 2013 at 02:30:15PM -0500, Bob Cernohous wrote:
> > >> > > From: Rob Ross <rross at mcs.anl.gov>
> > >> > >
> > >> > > Should we consider this as interest in working on this problem 
on
> > >> > > the IBM side :)? -- Rob
> > >> >
> > >> > Say what?! ;)
> > >>
> > >> RobR's excited that IBM's looking at the ROMIO piece of DCMF.  We
> > >> thought we were on our own with that one.
> > >>
> > >>
> > >> > I was looking more for agreement that collective i/o is 'what it
> > >> > is'... and maybe some idea if we just have some known limitations 
on
> > >> > scaling it.  Yes, that BG alltoallv is a bigger problem that we 
can
> > >> > avoid
> > >> > with an env var -- is that just going to have to be 'good 
enough'?  (I
> > >> > think that Jeff P wrote that on BG/P and got good performancewith 
that
> > >> > alltoallv.  Trading memory for performance, not unusual, and at 
least
> > >> > it's
> > >> > selectable.)
> > >>
> > >> I can't test while our Blue Gene is under maintenance.    I know 
the
> > >> environment variable selection helps only a little bit (like 
improves
> > >> scaling from 4k to 8k maybe?  don't have the notes offhand).
> > >
> > > Ouch.  So you've seen the scaling failures at 8k... ranks? 
> racks?  Kevin is
> > > failing at... 16 racks x 16 ranks per node... I think ... so 256k 
ranks.
> > 
> > 
> > 
> > -- 
> > Jeff Hammond
> > Argonne Leadership Computing Facility
> > University of Chicago Computation Institute
> > jhammond at alcf.anl.gov / (630) 252-5381
> > http://www.linkedin.com/in/jeffhammond
> > https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
> > ALCF docs: http://www.alcf.anl.gov/user-guides
> > 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20130506/2200f6b3/attachment-0002.html>