[mpich-devel] Collective i/o failure

Rob Ross rross at mcs.anl.gov
Thu Jul 25 14:02:51 CDT 2013


Hi,

No, discussions of whether we should be checking that parameters are valid in a collective way, so that we can give a proper error back to all ranks in the case where one rank has a problem.

It is not generally the case that collectives are free. This type of error checking has been done in add-on libraries, for example, to help in debugging. I think A. Chan et al. did an instance of this some time back.

Rob

On Jul 25, 2013, at 1:57 PM, Jeff Hammond wrote:

> BG-specific discussions?  I don't care what the Forum thinks about an implementation detail that makes BG more productive for our users.
> 
> If MPIO_CHECK_OFFSET_ALL is optional and IBM chooses to enable it, what does that matter to anyone else?
> 
> Jeff
> 
> ----- Original Message -----
>> From: "Rob Ross" <rross at mcs.anl.gov>
>> To: devel at mpich.org
>> Sent: Thursday, July 25, 2013 2:31:16 PM
>> Subject: Re: [mpich-devel] Collective i/o failure
>> 
>> See historical discussions on collective argument checking. -- Rob
>> 
>> On Jul 25, 2013, at 1:13 PM, Jeff Hammond wrote:
>> 
>>> MPI_Allreduce on an integer should be almost infinitely fast on BG
>>> so maybe an ifdef guard is all that is required to keep this from
>>> standing in the way of widespread acceptance of MPI-IO.
>>> 
>>> Best,
>>> 
>>> Jeff
>>> 
>>> ----- Original Message -----
>>>> From: "Rob Ross" <rross at mcs.anl.gov>
>>>> To: devel at mpich.org
>>>> Sent: Thursday, July 25, 2013 2:03:40 PM
>>>> Subject: Re: [mpich-devel] Collective i/o failure
>>>> 
>>>> Just to reiterate the point that RobL made: adding collectives to
>>>> check for completion, etc. of other ranks adds overhead to the
>>>> calls
>>>> when they are successful. This in turn makes people not use
>>>> MPI-IO,
>>>> because it becomes slower, which is good for reducing bug reports,
>>>> but bad for encouraging use of standard interfaces.
>>>> 
>>>> Rob
>>>> 
>>>> On Jul 25, 2013, at 12:10 PM, Bob Cernohous wrote:
>>>> 
>>>>>> From: "Rob Latham" <robl at mcs.anl.gov>
>>>>> 
>>>>>> How did this single rank get a negative offset?  Was there some
>>>>>> integer math that overflowed?
>>>>> 
>>>>> That's for the app developer to figure out.  My issue is that if
>>>>> all ranks had failed the write he probably would have started
>>>>> figuring that out a few days ago and I wouldn't have gotten
>>>>> involved :)   It's the weird hw error that dragged me into this
>>>>> when the non-failing ranks entered allreduce in romio and the
>>>>> failing ranks entered allreduce in the app.
>>>>> 
>>>>> Like I said :
>>>>> 
>>>>>>> Just wondering if there's something I can fix here in addition
>>>>>>> to the
>>>>>>> application.
>>>>> 
>>>>> Not the highest priority really.  But I coincidentally just got
>>>>> another report (from ANL this time) that an app is hung with half
>>>>> the ranks in write_at_all and half the ranks in a later barrier.
>>>>> It could be something similar.  I don't have enough information
>>>>> yet to know but I've suggested they look at errors from write.
>>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> Jeff Hammond
>>> Argonne Leadership Computing Facility
>>> University of Chicago Computation Institute
>>> jhammond at alcf.anl.gov / (630) 252-5381
>>> http://www.linkedin.com/in/jeffhammond
>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>> ALCF docs: http://www.alcf.anl.gov/user-guides
>>> 
>> 
>> 
> 
> -- 
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
> ALCF docs: http://www.alcf.anl.gov/user-guides
> 



More information about the devel mailing list