[mpich-devel] Collective i/o failure

Thu Jul 25 10:57:29 CDT 2013

On Thu, Jul 25, 2013 at 09:16:22AM -0500, Bob Cernohous wrote:
> Question.  If a rank fails a collective i/o, does it make sense for all 
> ranks to fail that collective i/o?  Or just the single rank?   The problem 
> with just a single rank failing is the remaining ranks likely get hung in 
> some internal collective call.   I'm wondering if having all ranks fail is 
> *better*?
> 
> In write_all in particular:
> 
> http://git.mpich.org/mpich.git/blob/HEAD:/src/mpi/romio/mpi-io/write_all.c#l89
> 
> 
>   89     if (file_ptr_type == ADIO_EXPLICIT_OFFSET && offset < 0)
>   90     {
>   91         error_code = MPIO_Err_create_code(MPI_SUCCESS, 
> MPIR_ERR_RECOVERABLE,
>   92                                           myname, __LINE__, 
> MPI_ERR_ARG,
>   93                                           "**iobadoffset", 0);
>   94         error_code = MPIO_Err_return_file(adio_fh, error_code);
>   95         goto fn_exit;
>   96     }
>   97     /* --END ERROR HANDLING-- */
> 
> We *could* allreduce the offset and fail all ranks if any rank has a 
> negative offset.  Not sure that's the right answer.

There are some places already in the I/O path where we collectively
check the parameters, such as a check to ensure all processes
open/create a file with the same AMODE.   Adding more collectives to
the I/O path might make sense, but should be done carefully.

I recently added MPIO_CHECK_INFO_ALL to address an IBM-reported bug,
so CHECK_OFFSET_ALL would not be without precedent.

> What happened was a single rank failed, the remaining ranks entered a 
> (hardware-based) allreduce.  

How did this single rank get a negative offset?  Was there some
integer math that overflowed?

> The failing rank happened to enter an 
> application (hardware-based) allreduce and you get a hardware detected 
> failure that no app developer is going to even try to decode and debug. So 
> I get the call.  Without the hardware involved, it just hangs on two 
> independent allreduces.  They might have figured that one out by looking 
> at where the (hundreds/thousands of) stacks were hung.   Of course, they 
> didn't actually check the return from the write_all :)   Although I'm not 
> sure how MPIR_ERR_RECOVERABLE it is at that point. 

In parallel-netcdf I chose to force the caller to check errors so that
I could avoid an allreduce inside pnetcdf.   That seemed like the
right thing to do for pnetcdf.

> Just wondering if there's something I can fix here in addition to the 
> application.

> 
> Bob Cernohous:  (T/L 553) 507-253-6093
> 
> BobC at us.ibm.com
> IBM Rochester, Building 030-2(C335), Department 61L
> 3605 Hwy 52 North, Rochester,  MN 55901-7829
> 
> > Chaos reigns within.
> > Reflect, repent, and reboot.
> > Order shall return.

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA