[mpich-discuss] About mpi i/o, mpi hints and collective i/o memory usage

Latham, Robert J. robl at mcs.anl.gov
Fri Apr 21 11:15:13 CDT 2017


On Fri, 2017-04-21 at 16:13 +0200, pramod kumbhar wrote:
> 
> Dear All,
> 
> I would like to understand some details about MPI I/O hints on bg-q
> and out of memory error while doing collective i/o.

Are you the same Pramod Kumbhar that works at EPFL?  I only ask because
 it does not seem you have asked ALCF support (who run Mira) about your
problem.

We're going to quickly get off-topic for this general MPICH mailing
list, but let me give some on-topic answers before we go discuss this
on a more machine-specific venue.

1: collective I/O does consume some memory.  not only is there an
internal "collective buffer" maintained by MPI-IO itself, but the data
exchange copies data as well before calling ALLTOALL. 

1a: your darshan summary looks like it's doing the right stuff.  Your
tiny MPI-IO operations are getting transformed into more GPFS-friendly
multi-megabyte requests.  Good.

Paul Coffman has done a one-sided based two-phase implementation that
should be lower memory overhead.  But here we should take the
discussion off-list.

==rob

> 
> Quick Summary :
> 
> 1. On bg-q I see cb_buffer_size as 16MB when we query on file handle
> using MPI_File_get_info.
> An application that we are looking at has code section like: 
> 
> ….
> MPI_File_set_view( fh, position_to_write, MPI_FLOAT, mappingType,
> _native_, MPI_INFO_NULL );
> max_mb_on_any_rank_using_Kernel_GetMemorySize () => 275 MB
> MPI_File_write_all( fh, mappingBuffer, ....................
> MPI_FLOAT, &status);
> max_mb_on_any_rank_using_Kernel_GetMemorySize () => 373 MB
> ……
> 
> Why we see that spike in memory usage?  (see Detail section for size
> information)
> 
> I have seen “Kernel_GetMemorySize(KERNEL_MEMSIZE_HEAP….)” not
> returning accurate memory footprint but I am not sure if that is the
> case here.
> Darshan screenshot attached shows the access sizes while running on 4
> rack.
> 
> 2. Is romio_cb_alltoall ignored on bg-q? Even if I disable it, I see
> “automatic” in the output.
> 
> (I am looking at
> srcV1R2M4/comm/lib/dev/mpich2/src/mpi/romio/adio/ad_bg/ad_bg_hints.c
> and see the code section is commented.)
> 
> More Details :
> 
> We are debugging an application on MIRA which runs on 1,2,4 racks but
> fails at 8 racks while dumping a custom checkpoint. These are strong
> scaling runs and the size of checkpoint remains same (~172GB). 32
> ranks per mode. Max memory usage before start of checkpoint (i.e.
> before single write_all call)
> for 8 rack is ~ 300 MB. The checkpoint size from each rank is between
> Kbs to few MBs (as shown by darshan). Once application call
> checkpoint, we see below error :
> 
>   Out of memory in file
> /bgsys/source/srcV1R2M2.15270/comm/lib/dev/mpich2/src/mpi/romio/adio/
> ad_bg/ad_bg_wrcoll.c,     line 500
> 
> And hence I am confused about behaviour mentioned in question 1.
> If someone has any insight, it will be great help!
> 
> Regards,
> Pramod
> 
> p.s. 
> 
> Default values of all hints 
> 
> cb_buffer_size, value = 16777216
> romio_cb_read, value = enable
> romio_cb_write, value = enable
> cb_nodes, value = 8320             (change based on partition size)
> romio_no_indep_rw, value = false
> romio_cb_pfr, value = disable
> romio_cb_fr_types, value = aar
> romio_cb_fr_alignment, value = 1
> romio_cb_ds_threshold, value = 0
> romio_cb_alltoall, value = automatic
> ind_rd_buffer_size, value = 4194304
> romio_ds_read, value = automatic
> romio_ds_write, value = disable
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list