[mpich-devel] Collective i/o failure

Bob Cernohous bobc at us.ibm.com
Thu Jul 25 12:10:49 CDT 2013


> From: "Rob Latham" <robl at mcs.anl.gov>

> How did this single rank get a negative offset?  Was there some
> integer math that overflowed?

That's for the app developer to figure out.  My issue is that if all ranks 
had failed the write he probably would have started figuring that out a 
few days ago and I wouldn't have gotten involved :)   It's the weird hw 
error that dragged me into this when the non-failing ranks entered 
allreduce in romio and the failing ranks entered allreduce in the app.

Like I said :

> > Just wondering if there's something I can fix here in addition to the 
> > application.

Not the highest priority really.  But I coincidentally just got another 
report (from ANL this time) that an app is hung with half the ranks in 
write_at_all and half the ranks in a later barrier.  It could be something 
similar.  I don't have enough information yet to know but I've suggested 
they look at errors from write.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20130725/3519a3f4/attachment-0001.html>


More information about the devel mailing list