[mpich-discuss] Bug (?) report: potential division by zero in ADIOI_LUSTRE_Docollect
Rob Latham
robl at mcs.anl.gov
Mon Nov 9 17:14:21 CST 2015
On 10/30/2015 03:36 PM, Constantine Khroulev wrote:
> Dear MPICH developers,
>
> I am writing to you to report what I think is a bug in ADIO (which, if
> I understand it correctly, is a part of ROMIO, which is a part of
> MPICH).
>
> The function int ADIOI_LUSTRE_Docollect(ADIO_File, int, ADIO_Offset *,
> int) defined in src/mpi/romio/adio/ad_lustre/ad_lustre_aggregate.c
> (MPICH version 3.1.4 and several earlier versions) contains an
> unprotected division:
>
> /* estimate average req_size */
> avg_req_size = (int)(total_req_size / total_access_count);
>
> I suggest adding an if statement protecting from division by zero and
> stopping (if appropriate).
thanks for taking a closer look at the lustre code. Halim and I chatted
a bit about this. We are frozen for the upcoming release, but once we
re-open I'll apply the patch below:
% git diff src/mpi/romio/adio/ad_lustre/ad_lustre_aggregate.c
diff --git a/src/mpi/romio/adio/ad_lustre/ad_lustre_aggregate.c
b/src/mpi/romio/adio/ad_lustre/ad_lustre_aggregate.c
index cd552829..e7901212 100644
--- a/src/mpi/romio/adio/ad_lustre/ad_lustre_aggregate.c
+++ b/src/mpi/romio/adio/ad_lustre/ad_lustre_aggregate.c
@@ -309,9 +309,14 @@ int ADIOI_LUSTRE_Docollect(ADIO_File fd, int
contig_access_count,
MPI_Allreduce(&req_size, &total_req_size, 1, MPI_LONG_LONG_INT,
MPI_SUM,
fd->comm);
MPI_Allreduce(&contig_access_count, &total_access_count, 1,
MPI_INT, MPI_SUM,
- fd->comm);
- /* estimate average req_size */
- avg_req_size = (int)(total_req_size / total_access_count);
+ fd->comm);
+ /* avoid possible divide-by-zero) */
+ if (!total_access_count) {
+ /* estimate average req_size */
+ avg_req_size = (int)(total_req_size / total_access_count);
+ } else {
+ avg_req_size = 0;
+ }
/* get hint of big_req_size */
big_req_size = fd->hints->fs_hints.lustre.coll_threshold;
/* Don't perform collective I/O if there are big requests */
>
> Some context: I am debugging a failure of PISM [1] on NASA
> Pleiades [2], which uses the SGI MPI implementation, which also uses
> ROMIO. PISM crashed with SIGFPE in ADIOI_LUSTRE_Docollect deep inside
> HDF5 and NetCDF. I am not asking for help with this [3] here; I will
> contact NASA's support as soon as I have a way of reproducing the
> issue outside of PISM.
I wonder if Michael Raymond is on this list?
==rob
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list