[mpich-devel] Collective i/o failure

Thu Jul 25 14:59:32 CDT 2013

Yes, the collchk  library, which is an MPI profiling library tool, was done by a student, me, and Anthony a while back, and is still available as part of MPE.  Its main contribution is to check consistency of input parameters for collective calls (e.g. everyone calls MPI_Bcast specifying same root).  But the same approach could certainly be used to check consistency of return codes.  

Advantage: works with any MPI, does not touch the MPI implementation.
Disadvantage:  introduces overhead of a subroutine call, plus an extra collective.
  Easy to swap in and out.

Rusty

On Thursday,Jul 25, 2013, at 2:02 PM, Rob Ross wrote:

> Hi,
> 
> No, discussions of whether we should be checking that parameters are valid in a collective way, so that we can give a proper error back to all ranks in the case where one rank has a problem.
> 
> It is not generally the case that collectives are free. This type of error checking has been done in add-on libraries, for example, to help in debugging. I think A. Chan et al. did an instance of this some time back.
> 
> Rob
> 
> On Jul 25, 2013, at 1:57 PM, Jeff Hammond wrote:
> 
>> BG-specific discussions?  I don't care what the Forum thinks about an implementation detail that makes BG more productive for our users.
>> 
>> If MPIO_CHECK_OFFSET_ALL is optional and IBM chooses to enable it, what does that matter to anyone else?
>> 
>> Jeff
>> 
>> ----- Original Message -----
>>> From: "Rob Ross" <rross at mcs.anl.gov>
>>> To: devel at mpich.org
>>> Sent: Thursday, July 25, 2013 2:31:16 PM
>>> Subject: Re: [mpich-devel] Collective i/o failure
>>> 
>>> See historical discussions on collective argument checking. -- Rob
>>> 
>>> On Jul 25, 2013, at 1:13 PM, Jeff Hammond wrote:
>>> 
>>>> MPI_Allreduce on an integer should be almost infinitely fast on BG
>>>> so maybe an ifdef guard is all that is required to keep this from
>>>> standing in the way of widespread acceptance of MPI-IO.
>>>> 
>>>> Best,
>>>> 
>>>> Jeff
>>>> 
>>>> ----- Original Message -----
>>>>> From: "Rob Ross" <rross at mcs.anl.gov>
>>>>> To: devel at mpich.org
>>>>> Sent: Thursday, July 25, 2013 2:03:40 PM
>>>>> Subject: Re: [mpich-devel] Collective i/o failure
>>>>> 
>>>>> Just to reiterate the point that RobL made: adding collectives to
>>>>> check for completion, etc. of other ranks adds overhead to the
>>>>> calls
>>>>> when they are successful. This in turn makes people not use
>>>>> MPI-IO,
>>>>> because it becomes slower, which is good for reducing bug reports,
>>>>> but bad for encouraging use of standard interfaces.
>>>>> 
>>>>> Rob
>>>>> 
>>>>> On Jul 25, 2013, at 12:10 PM, Bob Cernohous wrote:
>>>>> 
>>>>>>> From: "Rob Latham" <robl at mcs.anl.gov>
>>>>>> 
>>>>>>> How did this single rank get a negative offset?  Was there some
>>>>>>> integer math that overflowed?
>>>>>> 
>>>>>> That's for the app developer to figure out.  My issue is that if
>>>>>> all ranks had failed the write he probably would have started
>>>>>> figuring that out a few days ago and I wouldn't have gotten
>>>>>> involved :)   It's the weird hw error that dragged me into this
>>>>>> when the non-failing ranks entered allreduce in romio and the
>>>>>> failing ranks entered allreduce in the app.
>>>>>> 
>>>>>> Like I said :
>>>>>> 
>>>>>>>> Just wondering if there's something I can fix here in addition
>>>>>>>> to the
>>>>>>>> application.
>>>>>> 
>>>>>> Not the highest priority really.  But I coincidentally just got
>>>>>> another report (from ANL this time) that an app is hung with half
>>>>>> the ranks in write_at_all and half the ranks in a later barrier.
>>>>>> It could be something similar.  I don't have enough information
>>>>>> yet to know but I've suggested they look at errors from write.
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> Jeff Hammond
>>>> Argonne Leadership Computing Facility
>>>> University of Chicago Computation Institute
>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>> http://www.linkedin.com/in/jeffhammond
>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>> ALCF docs: http://www.alcf.anl.gov/user-guides
>>>> 
>>> 
>>> 
>> 
>> -- 
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>> ALCF docs: http://www.alcf.anl.gov/user-guides
>> 
>