[mpich-devel] Collective i/o failure
Rob Latham
robl at mcs.anl.gov
Thu Jul 25 10:57:29 CDT 2013
On Thu, Jul 25, 2013 at 09:16:22AM -0500, Bob Cernohous wrote:
> Question. If a rank fails a collective i/o, does it make sense for all
> ranks to fail that collective i/o? Or just the single rank? The problem
> with just a single rank failing is the remaining ranks likely get hung in
> some internal collective call. I'm wondering if having all ranks fail is
> *better*?
>
> In write_all in particular:
>
> http://git.mpich.org/mpich.git/blob/HEAD:/src/mpi/romio/mpi-io/write_all.c#l89
>
>
> 89 if (file_ptr_type == ADIO_EXPLICIT_OFFSET && offset < 0)
> 90 {
> 91 error_code = MPIO_Err_create_code(MPI_SUCCESS,
> MPIR_ERR_RECOVERABLE,
> 92 myname, __LINE__,
> MPI_ERR_ARG,
> 93 "**iobadoffset", 0);
> 94 error_code = MPIO_Err_return_file(adio_fh, error_code);
> 95 goto fn_exit;
> 96 }
> 97 /* --END ERROR HANDLING-- */
>
> We *could* allreduce the offset and fail all ranks if any rank has a
> negative offset. Not sure that's the right answer.
There are some places already in the I/O path where we collectively
check the parameters, such as a check to ensure all processes
open/create a file with the same AMODE. Adding more collectives to
the I/O path might make sense, but should be done carefully.
I recently added MPIO_CHECK_INFO_ALL to address an IBM-reported bug,
so CHECK_OFFSET_ALL would not be without precedent.
> What happened was a single rank failed, the remaining ranks entered a
> (hardware-based) allreduce.
How did this single rank get a negative offset? Was there some
integer math that overflowed?
> The failing rank happened to enter an
> application (hardware-based) allreduce and you get a hardware detected
> failure that no app developer is going to even try to decode and debug. So
> I get the call. Without the hardware involved, it just hangs on two
> independent allreduces. They might have figured that one out by looking
> at where the (hundreds/thousands of) stacks were hung. Of course, they
> didn't actually check the return from the write_all :) Although I'm not
> sure how MPIR_ERR_RECOVERABLE it is at that point.
In parallel-netcdf I chose to force the caller to check errors so that
I could avoid an allreduce inside pnetcdf. That seemed like the
right thing to do for pnetcdf.
> Just wondering if there's something I can fix here in addition to the
> application.
>
> Bob Cernohous: (T/L 553) 507-253-6093
>
> BobC at us.ibm.com
> IBM Rochester, Building 030-2(C335), Department 61L
> 3605 Hwy 52 North, Rochester, MN 55901-7829
>
> > Chaos reigns within.
> > Reflect, repent, and reboot.
> > Order shall return.
--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA
More information about the devel
mailing list