[mpich-devel] Collective i/o failure

Rob Ross rross at mcs.anl.gov
Thu Jul 25 13:31:16 CDT 2013


See historical discussions on collective argument checking. -- Rob

On Jul 25, 2013, at 1:13 PM, Jeff Hammond wrote:

> MPI_Allreduce on an integer should be almost infinitely fast on BG so maybe an ifdef guard is all that is required to keep this from standing in the way of widespread acceptance of MPI-IO.
> 
> Best,
> 
> Jeff
> 
> ----- Original Message -----
>> From: "Rob Ross" <rross at mcs.anl.gov>
>> To: devel at mpich.org
>> Sent: Thursday, July 25, 2013 2:03:40 PM
>> Subject: Re: [mpich-devel] Collective i/o failure
>> 
>> Just to reiterate the point that RobL made: adding collectives to
>> check for completion, etc. of other ranks adds overhead to the calls
>> when they are successful. This in turn makes people not use MPI-IO,
>> because it becomes slower, which is good for reducing bug reports,
>> but bad for encouraging use of standard interfaces.
>> 
>> Rob
>> 
>> On Jul 25, 2013, at 12:10 PM, Bob Cernohous wrote:
>> 
>>>> From: "Rob Latham" <robl at mcs.anl.gov>
>>> 
>>>> How did this single rank get a negative offset?  Was there some
>>>> integer math that overflowed?
>>> 
>>> That's for the app developer to figure out.  My issue is that if
>>> all ranks had failed the write he probably would have started
>>> figuring that out a few days ago and I wouldn't have gotten
>>> involved :)   It's the weird hw error that dragged me into this
>>> when the non-failing ranks entered allreduce in romio and the
>>> failing ranks entered allreduce in the app.
>>> 
>>> Like I said :
>>> 
>>>>> Just wondering if there's something I can fix here in addition
>>>>> to the
>>>>> application.
>>> 
>>> Not the highest priority really.  But I coincidentally just got
>>> another report (from ANL this time) that an app is hung with half
>>> the ranks in write_at_all and half the ranks in a later barrier.
>>> It could be something similar.  I don't have enough information
>>> yet to know but I've suggested they look at errors from write.
>>> 
>> 
>> 
> 
> -- 
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
> ALCF docs: http://www.alcf.anl.gov/user-guides
> 



More information about the devel mailing list