[mpich-devel] Collective i/o failure
Rusty Lusk
lusk at mcs.anl.gov
Thu Jul 25 14:59:32 CDT 2013
Yes, the collchk library, which is an MPI profiling library tool, was done by a student, me, and Anthony a while back, and is still available as part of MPE. Its main contribution is to check consistency of input parameters for collective calls (e.g. everyone calls MPI_Bcast specifying same root). But the same approach could certainly be used to check consistency of return codes.
Advantage: works with any MPI, does not touch the MPI implementation.
Disadvantage: introduces overhead of a subroutine call, plus an extra collective.
Easy to swap in and out.
Rusty
On Thursday,Jul 25, 2013, at 2:02 PM, Rob Ross wrote:
> Hi,
>
> No, discussions of whether we should be checking that parameters are valid in a collective way, so that we can give a proper error back to all ranks in the case where one rank has a problem.
>
> It is not generally the case that collectives are free. This type of error checking has been done in add-on libraries, for example, to help in debugging. I think A. Chan et al. did an instance of this some time back.
>
> Rob
>
> On Jul 25, 2013, at 1:57 PM, Jeff Hammond wrote:
>
>> BG-specific discussions? I don't care what the Forum thinks about an implementation detail that makes BG more productive for our users.
>>
>> If MPIO_CHECK_OFFSET_ALL is optional and IBM chooses to enable it, what does that matter to anyone else?
>>
>> Jeff
>>
>> ----- Original Message -----
>>> From: "Rob Ross" <rross at mcs.anl.gov>
>>> To: devel at mpich.org
>>> Sent: Thursday, July 25, 2013 2:31:16 PM
>>> Subject: Re: [mpich-devel] Collective i/o failure
>>>
>>> See historical discussions on collective argument checking. -- Rob
>>>
>>> On Jul 25, 2013, at 1:13 PM, Jeff Hammond wrote:
>>>
>>>> MPI_Allreduce on an integer should be almost infinitely fast on BG
>>>> so maybe an ifdef guard is all that is required to keep this from
>>>> standing in the way of widespread acceptance of MPI-IO.
>>>>
>>>> Best,
>>>>
>>>> Jeff
>>>>
>>>> ----- Original Message -----
>>>>> From: "Rob Ross" <rross at mcs.anl.gov>
>>>>> To: devel at mpich.org
>>>>> Sent: Thursday, July 25, 2013 2:03:40 PM
>>>>> Subject: Re: [mpich-devel] Collective i/o failure
>>>>>
>>>>> Just to reiterate the point that RobL made: adding collectives to
>>>>> check for completion, etc. of other ranks adds overhead to the
>>>>> calls
>>>>> when they are successful. This in turn makes people not use
>>>>> MPI-IO,
>>>>> because it becomes slower, which is good for reducing bug reports,
>>>>> but bad for encouraging use of standard interfaces.
>>>>>
>>>>> Rob
>>>>>
>>>>> On Jul 25, 2013, at 12:10 PM, Bob Cernohous wrote:
>>>>>
>>>>>>> From: "Rob Latham" <robl at mcs.anl.gov>
>>>>>>
>>>>>>> How did this single rank get a negative offset? Was there some
>>>>>>> integer math that overflowed?
>>>>>>
>>>>>> That's for the app developer to figure out. My issue is that if
>>>>>> all ranks had failed the write he probably would have started
>>>>>> figuring that out a few days ago and I wouldn't have gotten
>>>>>> involved :) It's the weird hw error that dragged me into this
>>>>>> when the non-failing ranks entered allreduce in romio and the
>>>>>> failing ranks entered allreduce in the app.
>>>>>>
>>>>>> Like I said :
>>>>>>
>>>>>>>> Just wondering if there's something I can fix here in addition
>>>>>>>> to the
>>>>>>>> application.
>>>>>>
>>>>>> Not the highest priority really. But I coincidentally just got
>>>>>> another report (from ANL this time) that an app is hung with half
>>>>>> the ranks in write_at_all and half the ranks in a later barrier.
>>>>>> It could be something similar. I don't have enough information
>>>>>> yet to know but I've suggested they look at errors from write.
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> Argonne Leadership Computing Facility
>>>> University of Chicago Computation Institute
>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>> http://www.linkedin.com/in/jeffhammond
>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>> ALCF docs: http://www.alcf.anl.gov/user-guides
>>>>
>>>
>>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>> ALCF docs: http://www.alcf.anl.gov/user-guides
>>
>
More information about the devel
mailing list