[mpich-discuss] Bug (?) report: potential division by zero in ADIOI_LUSTRE_Docollect

Rob Latham robl at mcs.anl.gov
Mon Nov 9 17:14:21 CST 2015



On 10/30/2015 03:36 PM, Constantine Khroulev wrote:
> Dear MPICH developers,
>
> I am writing to you to report what I think is a bug in ADIO (which, if
> I understand it correctly, is a part of ROMIO, which is a part of
> MPICH).
>
> The function int ADIOI_LUSTRE_Docollect(ADIO_File, int, ADIO_Offset *,
> int) defined in src/mpi/romio/adio/ad_lustre/ad_lustre_aggregate.c
> (MPICH version 3.1.4 and several earlier versions) contains an
> unprotected division:
>
>      /* estimate average req_size */
>      avg_req_size = (int)(total_req_size / total_access_count);
>
> I suggest adding an if statement protecting from division by zero and
> stopping (if appropriate).

thanks for taking a closer look at the lustre code.  Halim and I chatted 
a bit about this.   We are frozen for the upcoming release, but once we 
re-open I'll apply the patch below:

% git diff src/mpi/romio/adio/ad_lustre/ad_lustre_aggregate.c
diff --git a/src/mpi/romio/adio/ad_lustre/ad_lustre_aggregate.c 
b/src/mpi/romio/adio/ad_lustre/ad_lustre_aggregate.c
index cd552829..e7901212 100644
--- a/src/mpi/romio/adio/ad_lustre/ad_lustre_aggregate.c
+++ b/src/mpi/romio/adio/ad_lustre/ad_lustre_aggregate.c
@@ -309,9 +309,14 @@ int ADIOI_LUSTRE_Docollect(ADIO_File fd, int 
contig_access_count,
      MPI_Allreduce(&req_size, &total_req_size, 1, MPI_LONG_LONG_INT, 
MPI_SUM,
                 fd->comm);
      MPI_Allreduce(&contig_access_count, &total_access_count, 1, 
MPI_INT, MPI_SUM,
-               fd->comm);
-    /* estimate average req_size */
-    avg_req_size = (int)(total_req_size / total_access_count);
+              fd->comm);
+    /* avoid possible divide-by-zero) */
+    if (!total_access_count) {
+       /* estimate average req_size */
+       avg_req_size = (int)(total_req_size / total_access_count);
+    } else {
+       avg_req_size = 0;
+    }
      /* get hint of big_req_size */
      big_req_size = fd->hints->fs_hints.lustre.coll_threshold;
      /* Don't perform collective I/O if there are big requests */

>
> Some context: I am debugging a failure of PISM [1] on NASA
> Pleiades [2], which uses the SGI MPI implementation, which also uses
> ROMIO. PISM crashed with SIGFPE in ADIOI_LUSTRE_Docollect deep inside
> HDF5 and NetCDF. I am not asking for help with this [3] here; I will
> contact NASA's support as soon as I have a way of reproducing the
> issue outside of PISM.

I wonder if Michael Raymond is on this list?

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list