[mpich-devel] ROMIO collective i/o memory use
Rob Latham
robl at mcs.anl.gov
Tue May 7 16:37:37 CDT 2013
On Mon, May 06, 2013 at 01:41:07PM -0500, Bob Cernohous wrote:
> I agree and suggested:
> ---------------------
> It appears they don't have enough memory for an alltoallv exchange. Try
> '1'...
>
> * - BGMPIO_COMM - Define how data is exchanged on collective
>
> * reads and writes. Possible values:
>
> * - 0 - Use MPI_Alltoallv.
>
> * - 1 - Use MPI_Isend/MPI_Irecv.
>
> * - Default is 0.
>
> ---------------------
>
> but they didn't want a work around they wanted a 'fix for o(p)
> allocations'. There are o(p) allocations all over collective i/o from a
> quick glance. Just wanted some input from the experts about scaling
> romio. I haven't heard if the suggestion worked.
MOAB, a framework for "doing stuff to meshed data": simple experiment
of reading and writing a mesh with 8 million "tetrahedron" elements.
Setting BGMPIO_COMM to 1 serves to push the memory allocation
problem only a little bit down the road. I could not run this test
case until our Blue Gene /P came out of maintenance
With default BG /P settings, MOAB cannot even read in the initial 8
million tet dataset with 2048 MPI processes.
Set BGPMPIO_COMM to 1 and 2048 MPI processes work, but 4192 does not:
I'm still working off of our /P but the code for /Q is the same:
stderr[3260] Abort(-1) on node 3260: Unable to allocate non-contiguous buffer (Not enough space for file )
(that error message is misleading: it's a malloc that has returned NULL)
0x01a7e0ec
raise
../nptl/sysdeps/unix/sysv/linux/raise.c:67
0x01a3ce40
abort
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdlib/abort.c:73
0x018a9df4
MPID_Abort
/gpfs/home/robl/src/dcmf/BGP/IBM_V1R4M3/bgp/comm/lib/dev/mpich2/src/mpid/dcmfd/src/misc/mp
id_abort.c:81
0x018ad4e8
MPIDI_DCMF_StartMsg
/gpfs/home/robl/src/dcmf/BGP/IBM_V1R4M3/bgp/comm/lib/dev/mpich2/src/mpid/dcmfd/src/pt2pt/m
pidi_startmessage.c:173
0x018abcb0
MPID_Isend
/gpfs/home/robl/src/dcmf/BGP/IBM_V1R4M3/bgp/comm/lib/dev/mpich2/src/mpid/dcmfd/src/pt2pt/m
pid_isend.c:105
0x01878c34
PMPI_Isend
/gpfs/home/robl/src/dcmf/BGP/IBM_V1R4M3/bgp/comm/lib/dev/mpich2/src/mpi/pt2pt/isend.c:124
0x01890818
ADIOI_R_Exchange_data
/gpfs/home/robl/src/dcmf/BGP/IBM_V1R4M3/bgp/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bgl/
ad_bgl_rdcoll.c:828
0x018929e0
ADIOI_Read_and_exch
/gpfs/home/robl/src/dcmf/BGP/IBM_V1R4M3/bgp/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bgl/
ad_bgl_rdcoll.c:660
So, the suggestion to turn off the alltoallv optimization doesn't really work
if we can't even run on one rack of /P I would say.
==rob
> Bob Cernohous: (T/L 553) 507-253-6093
>
> BobC at us.ibm.com
> IBM Rochester, Building 030-2(C335), Department 61L
> 3605 Hwy 52 North, Rochester, MN 55901-7829
>
> > Chaos reigns within.
> > Reflect, repent, and reboot.
> > Order shall return.
>
>
> devel-bounces at mpich.org wrote on 05/04/2013 09:43:10 PM:
>
> > From: "Rob Latham" <robl at mcs.anl.gov>
> > To: devel at mpich.org,
> > Cc: mpich2-dev at mcs.anl.gov
> > Date: 05/04/2013 09:48 PM
> > Subject: Re: [mpich-devel] ROMIO collective i/o memory use
> > Sent by: devel-bounces at mpich.org
> >
> > On Mon, Apr 29, 2013 at 10:28:01AM -0500, Bob Cernohous wrote:
> > > A customer (Argonne ;) is complaining about O(p) allocations in
> collective
> > > i/o. A collective read is failing at larger scale.
> > >
> > > Any thoughts or comments or advice? There appears to be lots of O(p)
> in
> > > ROMIO collective I/O. Plus a lot of (possibly large) aggregated data
> > > buffers. A quick search shows
> >
> > The O(p) allocations are a concern, sure. For two-phase, though, the
> > real problem lies in ADIOI_R_Exchange_data_alltoallv and
> > ADIOI_W_Exchange_data_alltoallv . The O(p) allocations are the least
> > of our worries!
> >
> > around line 1063 of ad_bg_rdcoll.c
> >
> > all_recv_buf = (char *) ADIOI_Malloc( rtail );
> >
> > all_send_buf = (char *) ADIOI_Malloc( stail );
> >
> > (rtail and stail are the sum of the receive and send arrays)
> >
> > ==rob
> >
> > > The common ROMIO read collective code:
> > >
> > > Find all "ADIOI_Malloc", Match case, Regular expression (UNIX)
> > >
> > > File
> Z:\bgq\comm\lib\dev\mpich2\src\mpi\romio\adio\common\ad_read_coll.c
> > >
> > > 124 38: st_offsets = (ADIO_Offset *)
> > > ADIOI_Malloc(nprocs*sizeof(ADIO_Offset));
> > >
> > > 125 39: end_offsets = (ADIO_Offset *)
> > > ADIOI_Malloc(nprocs*sizeof(ADIO_Offset));
> > >
> > > 317 44: *offset_list_ptr = (ADIO_Offset *)
> > > ADIOI_Malloc(2*sizeof(ADIO_Offset));
> > >
> > > 318 41: *len_list_ptr = (ADIO_Offset *)
> > > ADIOI_Malloc(2*sizeof(ADIO_Offset));
> > >
> > > 334 44: *offset_list_ptr = (ADIO_Offset *)
> > > ADIOI_Malloc(2*sizeof(ADIO_Offset));
> > >
> > > 335 41: *len_list_ptr = (ADIO_Offset *)
> > > ADIOI_Malloc(2*sizeof(ADIO_Offset));
> > >
> > > 436 18: ADIOI_Malloc((contig_access_count+1)*sizeof(ADIO_Offset));
> > >
> > > 437 41: *len_list_ptr = (ADIO_Offset *)
> > > ADIOI_Malloc((contig_access_count+1)*sizeof(ADIO_Offset));
> > >
> > > 573 37: if (ntimes) read_buf = (char *)
> ADIOI_Malloc(coll_bufsize);
> > >
> > > 578 21: count = (int *) ADIOI_Malloc(nprocs * sizeof(int));
> > >
> > > 587 25: send_size = (int *) ADIOI_Malloc(nprocs * sizeof(int));
> > >
> > > 590 25: recv_size = (int *) ADIOI_Malloc(nprocs * sizeof(int));
> > >
> > > 598 25: start_pos = (int *) ADIOI_Malloc(nprocs*sizeof(int));
> > >
> > > 739 32: tmp_buf = (char *) ADIOI_Malloc(for_next_iter);
> > >
> > > 744 33: read_buf = (char *)
> > > ADIOI_Malloc(for_next_iter+coll_bufsize);
> > >
> > > 805 9:
> ADIOI_Malloc((nprocs_send+nprocs_recv+1)*sizeof(MPI_Request));
> > >
> > > 827 30: recv_buf = (char **) ADIOI_Malloc(nprocs *
> sizeof(char*));
> > >
> > > 830 44: (char *)
> > > ADIOI_Malloc(recv_size[i]);
> > >
> > > 870 31: statuses = (MPI_Status *)
> > > ADIOI_Malloc((nprocs_send+nprocs_recv+1) * \
> > >
> > > 988 35: curr_from_proc = (unsigned *) ADIOI_Malloc(nprocs *
> > > sizeof(unsigned));
> > >
> > > 989 35: done_from_proc = (unsigned *) ADIOI_Malloc(nprocs *
> > > sizeof(unsigned));
> > >
> > > 990 35: recv_buf_idx = (unsigned *) ADIOI_Malloc(nprocs *
> > > sizeof(unsigned));
> > >
> > > Total found: 22
> > >
> > >
> > > Our BG version of read collective:
> > >
> > > File
> Z:\bgq\comm\lib\dev\mpich2\src\mpi\romio\adio\ad_bg\ad_bg_rdcoll.c
> > >
> > > 179 40: st_offsets = (ADIO_Offset *)
> > > ADIOI_Malloc(nprocs*sizeof(ADIO_Offset));
> > >
> > > 180 40: end_offsets = (ADIO_Offset *)
> > > ADIOI_Malloc(nprocs*sizeof(ADIO_Offset));
> > >
> > > 183 43: bg_offsets0 = (ADIO_Offset *)
> > > ADIOI_Malloc(2*nprocs*sizeof(ADIO_Offset));
> > >
> > > 184 43: bg_offsets = (ADIO_Offset *)
> > > ADIOI_Malloc(2*nprocs*sizeof(ADIO_Offset));
> > >
> > > 475 37: if (ntimes) read_buf = (char *)
> ADIOI_Malloc(coll_bufsize);
> > >
> > > 480 21: count = (int *) ADIOI_Malloc(nprocs * sizeof(int));
> > >
> > > 489 25: send_size = (int *) ADIOI_Malloc(nprocs * sizeof(int));
> > >
> > > 492 25: recv_size = (int *) ADIOI_Malloc(nprocs * sizeof(int));
> > >
> > > 500 25: start_pos = (int *) ADIOI_Malloc(nprocs*sizeof(int));
> > >
> > > 676 32: tmp_buf = (char *) ADIOI_Malloc(for_next_iter);
> > >
> > > 681 33: read_buf = (char *)
> > > ADIOI_Malloc(for_next_iter+coll_bufsize);
> > >
> > > 761 9:
> ADIOI_Malloc((nprocs_send+nprocs_recv+1)*sizeof(MPI_Request));
> > >
> > > 783 30: recv_buf = (char **) ADIOI_Malloc(nprocs *
> sizeof(char*));
> > >
> > > 786 44: (char *)
> > > ADIOI_Malloc(recv_size[i]);
> > >
> > > 826 31: statuses = (MPI_Status *)
> > > ADIOI_Malloc((nprocs_send+nprocs_recv+1) * \
> > >
> > > 944 35: curr_from_proc = (unsigned *) ADIOI_Malloc(nprocs *
> > > sizeof(unsigned));
> > >
> > > 945 35: done_from_proc = (unsigned *) ADIOI_Malloc(nprocs *
> > > sizeof(unsigned));
> > >
> > > 946 35: recv_buf_idx = (unsigned *) ADIOI_Malloc(nprocs *
> > > sizeof(unsigned));
> > >
> > > 1058 23: rdispls = (int *) ADIOI_Malloc( nprocs * sizeof(int) );
> > >
> > > 1063 29: all_recv_buf = (char *) ADIOI_Malloc( rtail );
> > >
> > > 1064 26: recv_buf = (char **) ADIOI_Malloc(nprocs * sizeof(char
> *));
> > >
> > > 1068 23: sdispls = (int *) ADIOI_Malloc( nprocs * sizeof(int) );
> > >
> > > 1073 29: all_send_buf = (char *) ADIOI_Malloc( stail );
> > >
> > > Total found: 23
> > >
> > >
> > > Bob Cernohous: (T/L 553) 507-253-6093
> > >
> > > BobC at us.ibm.com
> > > IBM Rochester, Building 030-2(C335), Department 61L
> > > 3605 Hwy 52 North, Rochester, MN 55901-7829
> > >
> > > > Chaos reigns within.
> > > > Reflect, repent, and reboot.
> > > > Order shall return.
> >
> > --
> > Rob Latham
> > Mathematics and Computer Science Division
> > Argonne National Lab, IL USA
> >
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
More information about the devel
mailing list