[mpich-devel] Collective i/o failure
Rob Ross
rross at mcs.anl.gov
Thu Jul 25 13:03:40 CDT 2013
Just to reiterate the point that RobL made: adding collectives to check for completion, etc. of other ranks adds overhead to the calls when they are successful. This in turn makes people not use MPI-IO, because it becomes slower, which is good for reducing bug reports, but bad for encouraging use of standard interfaces.
Rob
On Jul 25, 2013, at 12:10 PM, Bob Cernohous wrote:
> > From: "Rob Latham" <robl at mcs.anl.gov>
>
> > How did this single rank get a negative offset? Was there some
> > integer math that overflowed?
>
> That's for the app developer to figure out. My issue is that if all ranks had failed the write he probably would have started figuring that out a few days ago and I wouldn't have gotten involved :) It's the weird hw error that dragged me into this when the non-failing ranks entered allreduce in romio and the failing ranks entered allreduce in the app.
>
> Like I said :
>
> > > Just wondering if there's something I can fix here in addition to the
> > > application.
>
> Not the highest priority really. But I coincidentally just got another report (from ANL this time) that an app is hung with half the ranks in write_at_all and half the ranks in a later barrier. It could be something similar. I don't have enough information yet to know but I've suggested they look at errors from write.
>
More information about the devel
mailing list