[mpich-devel] Collective i/o failure

Rob Ross rross at mcs.anl.gov
Thu Jul 25 13:03:40 CDT 2013


Just to reiterate the point that RobL made: adding collectives to check for completion, etc. of other ranks adds overhead to the calls when they are successful. This in turn makes people not use MPI-IO, because it becomes slower, which is good for reducing bug reports, but bad for encouraging use of standard interfaces.

Rob

On Jul 25, 2013, at 12:10 PM, Bob Cernohous wrote:

> > From: "Rob Latham" <robl at mcs.anl.gov> 
> 
> > How did this single rank get a negative offset?  Was there some
> > integer math that overflowed?
> 
> That's for the app developer to figure out.  My issue is that if all ranks had failed the write he probably would have started figuring that out a few days ago and I wouldn't have gotten involved :)   It's the weird hw error that dragged me into this when the non-failing ranks entered allreduce in romio and the failing ranks entered allreduce in the app. 
> 
> Like I said : 
> 
> > > Just wondering if there's something I can fix here in addition to the 
> > > application.
> 
> Not the highest priority really.  But I coincidentally just got another report (from ANL this time) that an app is hung with half the ranks in write_at_all and half the ranks in a later barrier.  It could be something similar.  I don't have enough information yet to know but I've suggested they look at errors from write. 
> 



More information about the devel mailing list