[mpich-devel] ROMIO collective i/o memory use
bobc at us.ibm.com
Mon May 6 17:05:51 CDT 2013
I think my memory was off. Jeff actually put in a different alltoall
The performance problem has been isolated to a single function in the
MPI-IO ROMIO common directory, ADIOI_Calc_others_req(). This function was
consuming 90% of the time between I/O syscalls. It was doing Isend/Irecv
between all of the nodes, twice. I copied that function to a
BG/L-specific module, rewrote it to use MPI_Alltoallv(), and changed
MPI-IO collective write and read to call it by default, or when -env
BGLMPIO_TUNEBLOCKING=1 is specified. The new function only consumed 16%
of the time between I/O syscalls, and brought the total time in line with
using MPI_Gather() and POSIX I/O.
Both the performance fix and the memory leak fix went into the V1R3M2
DRV140_2007-070417. I compiled both testcases against that driver and ran
them on a full rack. The performance was good (see below), and the memory
leak was gone.
 using 5120 blocks per task, and ntasks = 1024 ...
 Time to write the file with MPI_File_write_all = 61.602 seconds,
bandwidth = 0.340 MB/sec
 file block size = 87600
 Time to write the file with posix write = 84.563 seconds, bandwidth =
Anyway, this is somewhat off the original topic of o(p) allocations except
that it often seems to be a trade off between performance and memory.
Bob Cernohous: (T/L 553) 507-253-6093
BobC at us.ibm.com
IBM Rochester, Building 030-2(C335), Department 61L
3605 Hwy 52 North, Rochester, MN 55901-7829
> Chaos reigns within.
> Reflect, repent, and reboot.
> Order shall return.
devel-bounces at mpich.org wrote on 05/06/2013 04:54:10 PM:
> From: Bob Cernohous/Rochester/IBM at IBMUS
> To: mpich2-dev at mcs.anl.gov,
> Date: 05/06/2013 04:55 PM
> Subject: Re: [mpich-devel] ROMIO collective i/o memory use
> Sent by: devel-bounces at mpich.org
> I did a couple quick searches and this goes back to bg/l, not bg/p.
> Jeff once investigated a problem and said
> In general, for smaller blocks, MPI-IO performed better than POSIX
> IO. For a midplane, they are about equal. For a rack, MPI-IO is
> noticably slower. I am now suspecting the collective phase of MPI-
> IO may be taking the time (#2 above).
> Specifying -env BGLMPIO_TUNEGATHER=0 did not significantly change
> the 1 rack result. This controls using allgather (0) vs allreduce
> (1 - the default) to communicate start and end offsets among the nodes.
> **Specifying -env BGLMPIO_COMM=1 made the 1 rack result twice as
> slow. This controls using alltoallv (0 - the default) vs send/recv
> (1) to do the consolidation phase.
> Specifying -env BGLMPIO_TUNEBLOCKING=0 made the 1 rack result so
> slow that I cancelled the job. This controls whether to take psets
> and GPFS into account (1 - the default) or not (0).
> I can't find the issue where he first implemented BGLMPIO_COMM. I
> seem to remember it performed MUCH better on small scattered i/o
> than send/recv. He had tables and numbers which I can't find right now.
> Bob Cernohous: (T/L 553) 507-253-6093
> BobC at us.ibm.com
> IBM Rochester, Building 030-2(C335), Department 61L
> 3605 Hwy 52 North, Rochester, MN 55901-7829
> > Chaos reigns within.
> > Reflect, repent, and reboot.
> > Order shall return.
> devel-bounces at mpich.org wrote on 05/06/2013 04:35:11 PM:
> > From: Jeff Hammond <jhammond at alcf.anl.gov>
> > To: devel at mpich.org,
> > Cc: mpich2-dev at mcs.anl.gov, devel-bounces at mpich.org
> > Date: 05/06/2013 04:36 PM
> > Subject: Re: [mpich-devel] ROMIO collective i/o memory use
> > Sent by: devel-bounces at mpich.org
> > Do alltoallv actually run faster than send-recv for the MPIO use case?
> > For >1MB messages, is alltoallv noticeably faster than a well-written
> > send-recv implantation?
> > At least on BGQ, send-recv turns into a receiver-side PAMI_Rget for
> > large messages; I would guess the optimized alltoallv implementation
> > is rput-based at the SPI level. Other than overhead, they should run
> > at the same speed, no? If execution overhead is not significant, then
> > the implementation that minimizes memory usage should be the default.
> > I suppose I should just write alltoallv using send-recv and see what
> > the difference is...
> > Jeff
> > On Mon, May 6, 2013 at 3:05 PM, Bob Cernohous <bobc at us.ibm.com> wrote:
> > >
> > >> From: "Rob Latham" <robl at mcs.anl.gov>
> > >>
> > >> On Mon, May 06, 2013 at 02:30:15PM -0500, Bob Cernohous wrote:
> > >> > > From: Rob Ross <rross at mcs.anl.gov>
> > >> > >
> > >> > > Should we consider this as interest in working on this problem
> > >> > > the IBM side :)? -- Rob
> > >> >
> > >> > Say what?! ;)
> > >>
> > >> RobR's excited that IBM's looking at the ROMIO piece of DCMF. We
> > >> thought we were on our own with that one.
> > >>
> > >>
> > >> > I was looking more for agreement that collective i/o is 'what it
> > >> > is'... and maybe some idea if we just have some known limitations
> > >> > scaling it. Yes, that BG alltoallv is a bigger problem that we
> > >> > avoid
> > >> > with an env var -- is that just going to have to be 'good
> > >> > think that Jeff P wrote that on BG/P and got good performancewith
> > >> > alltoallv. Trading memory for performance, not unusual, and at
> > >> > it's
> > >> > selectable.)
> > >>
> > >> I can't test while our Blue Gene is under maintenance. I know
> > >> environment variable selection helps only a little bit (like
> > >> scaling from 4k to 8k maybe? don't have the notes offhand).
> > >
> > > Ouch. So you've seen the scaling failures at 8k... ranks?
> racks? Kevin is
> > > failing at... 16 racks x 16 ranks per node... I think ... so 256k
> > --
> > Jeff Hammond
> > Argonne Leadership Computing Facility
> > University of Chicago Computation Institute
> > jhammond at alcf.anl.gov / (630) 252-5381
> > http://www.linkedin.com/in/jeffhammond
> > https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
> > ALCF docs: http://www.alcf.anl.gov/user-guides
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the devel