[mpich-devel] Collective i/o failure

Bob Cernohous bobc at us.ibm.com
Thu Jul 25 09:16:22 CDT 2013


Question.  If a rank fails a collective i/o, does it make sense for all 
ranks to fail that collective i/o?  Or just the single rank?   The problem 
with just a single rank failing is the remaining ranks likely get hung in 
some internal collective call.   I'm wondering if having all ranks fail is 
*better*?

In write_all in particular:

http://git.mpich.org/mpich.git/blob/HEAD:/src/mpi/romio/mpi-io/write_all.c#l89


  89     if (file_ptr_type == ADIO_EXPLICIT_OFFSET && offset < 0)
  90     {
  91         error_code = MPIO_Err_create_code(MPI_SUCCESS, 
MPIR_ERR_RECOVERABLE,
  92                                           myname, __LINE__, 
MPI_ERR_ARG,
  93                                           "**iobadoffset", 0);
  94         error_code = MPIO_Err_return_file(adio_fh, error_code);
  95         goto fn_exit;
  96     }
  97     /* --END ERROR HANDLING-- */

We *could* allreduce the offset and fail all ranks if any rank has a 
negative offset.  Not sure that's the right answer.

What happened was a single rank failed, the remaining ranks entered a 
(hardware-based) allreduce.  The failing rank happened to enter an 
application (hardware-based) allreduce and you get a hardware detected 
failure that no app developer is going to even try to decode and debug. So 
I get the call.  Without the hardware involved, it just hangs on two 
independent allreduces.  They might have figured that one out by looking 
at where the (hundreds/thousands of) stacks were hung.   Of course, they 
didn't actually check the return from the write_all :)   Although I'm not 
sure how MPIR_ERR_RECOVERABLE it is at that point. 

Just wondering if there's something I can fix here in addition to the 
application.

Bob Cernohous:  (T/L 553) 507-253-6093

BobC at us.ibm.com
IBM Rochester, Building 030-2(C335), Department 61L
3605 Hwy 52 North, Rochester,  MN 55901-7829

> Chaos reigns within.
> Reflect, repent, and reboot.
> Order shall return.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20130725/ebd3377d/attachment.html>


More information about the devel mailing list