[mpich-devel] Collective i/o failure
Rob Ross
rross at mcs.anl.gov
Thu Jul 25 14:02:51 CDT 2013
Hi,
No, discussions of whether we should be checking that parameters are valid in a collective way, so that we can give a proper error back to all ranks in the case where one rank has a problem.
It is not generally the case that collectives are free. This type of error checking has been done in add-on libraries, for example, to help in debugging. I think A. Chan et al. did an instance of this some time back.
Rob
On Jul 25, 2013, at 1:57 PM, Jeff Hammond wrote:
> BG-specific discussions? I don't care what the Forum thinks about an implementation detail that makes BG more productive for our users.
>
> If MPIO_CHECK_OFFSET_ALL is optional and IBM chooses to enable it, what does that matter to anyone else?
>
> Jeff
>
> ----- Original Message -----
>> From: "Rob Ross" <rross at mcs.anl.gov>
>> To: devel at mpich.org
>> Sent: Thursday, July 25, 2013 2:31:16 PM
>> Subject: Re: [mpich-devel] Collective i/o failure
>>
>> See historical discussions on collective argument checking. -- Rob
>>
>> On Jul 25, 2013, at 1:13 PM, Jeff Hammond wrote:
>>
>>> MPI_Allreduce on an integer should be almost infinitely fast on BG
>>> so maybe an ifdef guard is all that is required to keep this from
>>> standing in the way of widespread acceptance of MPI-IO.
>>>
>>> Best,
>>>
>>> Jeff
>>>
>>> ----- Original Message -----
>>>> From: "Rob Ross" <rross at mcs.anl.gov>
>>>> To: devel at mpich.org
>>>> Sent: Thursday, July 25, 2013 2:03:40 PM
>>>> Subject: Re: [mpich-devel] Collective i/o failure
>>>>
>>>> Just to reiterate the point that RobL made: adding collectives to
>>>> check for completion, etc. of other ranks adds overhead to the
>>>> calls
>>>> when they are successful. This in turn makes people not use
>>>> MPI-IO,
>>>> because it becomes slower, which is good for reducing bug reports,
>>>> but bad for encouraging use of standard interfaces.
>>>>
>>>> Rob
>>>>
>>>> On Jul 25, 2013, at 12:10 PM, Bob Cernohous wrote:
>>>>
>>>>>> From: "Rob Latham" <robl at mcs.anl.gov>
>>>>>
>>>>>> How did this single rank get a negative offset? Was there some
>>>>>> integer math that overflowed?
>>>>>
>>>>> That's for the app developer to figure out. My issue is that if
>>>>> all ranks had failed the write he probably would have started
>>>>> figuring that out a few days ago and I wouldn't have gotten
>>>>> involved :) It's the weird hw error that dragged me into this
>>>>> when the non-failing ranks entered allreduce in romio and the
>>>>> failing ranks entered allreduce in the app.
>>>>>
>>>>> Like I said :
>>>>>
>>>>>>> Just wondering if there's something I can fix here in addition
>>>>>>> to the
>>>>>>> application.
>>>>>
>>>>> Not the highest priority really. But I coincidentally just got
>>>>> another report (from ANL this time) that an app is hung with half
>>>>> the ranks in write_at_all and half the ranks in a later barrier.
>>>>> It could be something similar. I don't have enough information
>>>>> yet to know but I've suggested they look at errors from write.
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Jeff Hammond
>>> Argonne Leadership Computing Facility
>>> University of Chicago Computation Institute
>>> jhammond at alcf.anl.gov / (630) 252-5381
>>> http://www.linkedin.com/in/jeffhammond
>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>> ALCF docs: http://www.alcf.anl.gov/user-guides
>>>
>>
>>
>
> --
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
> ALCF docs: http://www.alcf.anl.gov/user-guides
>
More information about the devel
mailing list