<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr">Hi Rob,<div><br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Are you the same Pramod Kumbhar that works at EPFL?</blockquote><div><br></div><div>


<p style="margin:0px;font-size:12px;line-height:normal;font-family:helvetica">Yes. After seeing failure on 8-rack, I started debugging/profiling on our local</p><p style="margin:0px;font-size:12px;line-height:normal;font-family:helvetica">4-rack bg-q system. I planned to send an email to support team with more detailed</p><p style="margin:0px;font-size:12px;line-height:normal;font-family:helvetica">information (for which job is currently in queue).</p>


</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


1: collective I/O does consume some memory.  not only is there an<br>


internal "collective buffer" maintained by MPI-IO itself, but the data<br>


exchange copies data as well before calling ALLTOALL.<br></blockquote><div><br></div><div>Just wondering if there any way to print or query some internal statistics about this.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


Paul Coffman has done a one-sided based two-phase implementation that<br>


should be lower memory overhead.  But here we should take the<br>


discussion off-list.<br></blockquote><div><br></div><div>Perfect ! Thanks!</div><div><br></div><div>Regards,</div><div>Pramod </div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


==rob<br>


<div><div class="gmail-h5"><br>


><br>


> Quick Summary :<br>


><br>


> 1. On bg-q I see cb_buffer_size as 16MB when we query on file handle<br>


> using MPI_File_get_info.<br>


> An application that we are looking at has code section like: <br>


><br>


> ….<br>


> MPI_File_set_view( fh, position_to_write, MPI_FLOAT, mappingType,<br>


> _native_, MPI_INFO_NULL );<br>


> max_mb_on_any_rank_using_<wbr>Kernel_GetMemorySize () => 275 MB<br>


> MPI_File_write_all( fh, mappingBuffer, ....................<br>


> MPI_FLOAT, &status);<br>


> max_mb_on_any_rank_using_<wbr>Kernel_GetMemorySize () => 373 MB<br>


> ……<br>


><br>


> Why we see that spike in memory usage?  (see Detail section for size<br>


> information)<br>


><br>


> I have seen “Kernel_GetMemorySize(KERNEL_<wbr>MEMSIZE_HEAP….)” not<br>


> returning accurate memory footprint but I am not sure if that is the<br>


> case here.<br>


> Darshan screenshot attached shows the access sizes while running on 4<br>


> rack.<br>


><br>


> 2. Is romio_cb_alltoall ignored on bg-q? Even if I disable it, I see<br>


> “automatic” in the output.<br>


><br>


> (I am looking at<br>


> srcV1R2M4/comm/lib/dev/mpich2/<wbr>src/mpi/romio/adio/ad_bg/ad_<wbr>bg_hints.c<br>


> and see the code section is commented.)<br>


><br>


> More Details :<br>


><br>


> We are debugging an application on MIRA which runs on 1,2,4 racks but<br>


> fails at 8 racks while dumping a custom checkpoint. These are strong<br>


> scaling runs and the size of checkpoint remains same (~172GB). 32<br>


> ranks per mode. Max memory usage before start of checkpoint (i.e.<br>


> before single write_all call)<br>


> for 8 rack is ~ 300 MB. The checkpoint size from each rank is between<br>


> Kbs to few MBs (as shown by darshan). Once application call<br>


> checkpoint, we see below error :<br>


><br>


>   Out of memory in file<br>


> /bgsys/source/srcV1R2M2.15270/<wbr>comm/lib/dev/mpich2/src/mpi/<wbr>romio/adio/<br>


> ad_bg/ad_bg_wrcoll.c,     line 500<br>


><br>


> And hence I am confused about behaviour mentioned in question 1.<br>


> If someone has any insight, it will be great help!<br>


><br>


> Regards,<br>


> Pramod<br>


><br>


> p.s. <br>


><br>


> Default values of all hints <br>


><br>


> cb_buffer_size, value = 16777216<br>


> romio_cb_read, value = enable<br>


> romio_cb_write, value = enable<br>


> cb_nodes, value = 8320             (change based on partition size)<br>


> romio_no_indep_rw, value = false<br>


> romio_cb_pfr, value = disable<br>


> romio_cb_fr_types, value = aar<br>


> romio_cb_fr_alignment, value = 1<br>


> romio_cb_ds_threshold, value = 0<br>


> romio_cb_alltoall, value = automatic<br>


> ind_rd_buffer_size, value = 4194304<br>


> romio_ds_read, value = automatic<br>


> romio_ds_write, value = disable<br>


</div></div>> ______________________________<wbr>_________________<br>


> discuss mailing list     <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>


> To manage subscription options or unsubscribe:<br>


> <a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/<wbr>mailman/listinfo/discuss</a><br>


______________________________<wbr>_________________<br>


discuss mailing list     <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>


To manage subscription options or unsubscribe:<br>


<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/<wbr>mailman/listinfo/discuss</a></blockquote></div><br></div></div></div>