[mpich-devel] Collective i/o failure

Michael Blocksome blocksom at us.ibm.com
Thu Jul 25 14:01:20 CDT 2013


I'm actually a bit surprised that there isn't already a 
'--enable-collective-argument-checks' configure option that would default 
to "disabled".

Michael Blocksome
Blue Gene Messaging
blocksom at us.ibm.com




From:   Jeff Hammond <jhammond at alcf.anl.gov>
To:     devel at mpich.org, 
Date:   07/25/2013 01:59 PM
Subject:        Re: [mpich-devel] Collective i/o failure
Sent by:        devel-bounces at mpich.org



BG-specific discussions?  I don't care what the Forum thinks about an 
implementation detail that makes BG more productive for our users.

If MPIO_CHECK_OFFSET_ALL is optional and IBM chooses to enable it, what 
does that matter to anyone else?

Jeff

----- Original Message -----
> From: "Rob Ross" <rross at mcs.anl.gov>
> To: devel at mpich.org
> Sent: Thursday, July 25, 2013 2:31:16 PM
> Subject: Re: [mpich-devel] Collective i/o failure
> 
> See historical discussions on collective argument checking. -- Rob
> 
> On Jul 25, 2013, at 1:13 PM, Jeff Hammond wrote:
> 
> > MPI_Allreduce on an integer should be almost infinitely fast on BG
> > so maybe an ifdef guard is all that is required to keep this from
> > standing in the way of widespread acceptance of MPI-IO.
> > 
> > Best,
> > 
> > Jeff
> > 
> > ----- Original Message -----
> >> From: "Rob Ross" <rross at mcs.anl.gov>
> >> To: devel at mpich.org
> >> Sent: Thursday, July 25, 2013 2:03:40 PM
> >> Subject: Re: [mpich-devel] Collective i/o failure
> >> 
> >> Just to reiterate the point that RobL made: adding collectives to
> >> check for completion, etc. of other ranks adds overhead to the
> >> calls
> >> when they are successful. This in turn makes people not use
> >> MPI-IO,
> >> because it becomes slower, which is good for reducing bug reports,
> >> but bad for encouraging use of standard interfaces.
> >> 
> >> Rob
> >> 
> >> On Jul 25, 2013, at 12:10 PM, Bob Cernohous wrote:
> >> 
> >>>> From: "Rob Latham" <robl at mcs.anl.gov>
> >>> 
> >>>> How did this single rank get a negative offset?  Was there some
> >>>> integer math that overflowed?
> >>> 
> >>> That's for the app developer to figure out.  My issue is that if
> >>> all ranks had failed the write he probably would have started
> >>> figuring that out a few days ago and I wouldn't have gotten
> >>> involved :)   It's the weird hw error that dragged me into this
> >>> when the non-failing ranks entered allreduce in romio and the
> >>> failing ranks entered allreduce in the app.
> >>> 
> >>> Like I said :
> >>> 
> >>>>> Just wondering if there's something I can fix here in addition
> >>>>> to the
> >>>>> application.
> >>> 
> >>> Not the highest priority really.  But I coincidentally just got
> >>> another report (from ANL this time) that an app is hung with half
> >>> the ranks in write_at_all and half the ranks in a later barrier.
> >>> It could be something similar.  I don't have enough information
> >>> yet to know but I've suggested they look at errors from write.
> >>> 
> >> 
> >> 
> > 
> > --
> > Jeff Hammond
> > Argonne Leadership Computing Facility
> > University of Chicago Computation Institute
> > jhammond at alcf.anl.gov / (630) 252-5381
> > http://www.linkedin.com/in/jeffhammond
> > https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
> > ALCF docs: http://www.alcf.anl.gov/user-guides
> > 
> 
> 

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
ALCF docs: http://www.alcf.anl.gov/user-guides


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20130725/0458cadf/attachment.html>


More information about the devel mailing list