<font size=2 face="sans-serif">Question. If a rank fails a collective
i/o, does it make sense for all ranks to fail that collective i/o? Or
just the single rank? The problem with just a single rank failing
is the remaining ranks likely get hung in some internal collective call.
I'm wondering if having all ranks fail is *better*?</font>
<br>
<br><font size=2 face="sans-serif">In write_all in particular:</font>
<br>
<br><a href="http://git.mpich.org/mpich.git/blob/HEAD:/src/mpi/romio/mpi-io/write_all.c#l89"><font size=2 face="sans-serif">http://git.mpich.org/mpich.git/blob/HEAD:/src/mpi/romio/mpi-io/write_all.c#l89</font></a>
<br>
<br>
<br><font size=2 face="sans-serif"> 89 if (file_ptr_type
== ADIO_EXPLICIT_OFFSET && offset < 0)</font>
<br><font size=2 face="sans-serif"> 90 {</font>
<br><font size=2 face="sans-serif"> 91
error_code = MPIO_Err_create_code(MPI_SUCCESS, MPIR_ERR_RECOVERABLE,</font>
<br><font size=2 face="sans-serif"> 92
myname, __LINE__, MPI_ERR_ARG,</font>
<br><font size=2 face="sans-serif"> 93
"**iobadoffset", 0);</font>
<br><font size=2 face="sans-serif"> 94
error_code = MPIO_Err_return_file(adio_fh, error_code);</font>
<br><font size=2 face="sans-serif"> 95
goto fn_exit;</font>
<br><font size=2 face="sans-serif"> 96 }</font>
<br><font size=2 face="sans-serif"> 97 /* --END ERROR
HANDLING-- */</font>
<br>
<br><font size=2 face="sans-serif">We *could* allreduce the offset and
fail all ranks if any rank has a negative offset. Not sure that's
the right answer.</font>
<br>
<br><font size=2 face="sans-serif">What happened was a single rank failed,
the remaining ranks entered a (hardware-based) allreduce. The failing
rank happened to enter an application (hardware-based) allreduce and you
get a hardware detected failure that no app developer is going to even
try to decode and debug. So I get the call. Without the hardware
involved, it just hangs on two independent allreduces. They might
have figured that one out by looking at where the (hundreds/thousands of)
stacks were hung. Of course, they didn't actually check the return
from the write_all :) Although I'm not sure how MPIR_ERR_RECOVERABLE
it is at that point. </font>
<br>
<br><font size=2 face="sans-serif">Just wondering if there's something
I can fix here in addition to the application.</font>
<br><font size=2 face="sans-serif"><br>
Bob Cernohous: (T/L 553) 507-253-6093<br>
<br>
BobC@us.ibm.com<br>
IBM Rochester, Building 030-2(C335), Department 61L<br>
3605 Hwy 52 North, Rochester, MN 55901-7829<br>
<br>
> Chaos reigns within.<br>
> Reflect, repent, and reboot.<br>
> Order shall return.<br>
</font>