[mpich-discuss] any clue on this mpiio error

Jaln valiantljk at gmail.com
Wed Feb 10 17:17:57 CST 2016


Hi David,
That sounds great,
I appreciate for your input.

I'm going to try it with two I/O tickets on hand, and hopefully to update
you guys shortly.

Best,
Jialin


On Wed, Feb 10, 2016 at 1:53 PM, David Knaak <knaak at cray.com> wrote:

> > > On 02/10/2016 11:39 AM, Jaln wrote:
> > > My jobs on Edison die on IO errors like this:
> > >
> > > ADIOI_CRAY_WRITECONTIG(284): filename='OUT/rei20_0.g029'
> > > error='Input/output error' errno=5 PE=00044 W_rec=33518
> > > off=2232920756 len=0000524288
> > > See MPICH_MPIIO_ABORT_ON_RW_ERROR.
> > >
> > > Any Ideas about this error infor? I couldn't find anything on Google.
> > > Thanks
>
> > On Wed, Feb 10, 2016 at 9:46 AM, Rob Latham <robl at mcs.anl.gov> wrote:
> >
> > Cray doesn't share their modifications to ROMIO with us.  You'll have
> > more luck with your cray support contact.
> >
> > errno 5 might be indicative of a general I/O error of some kind.   Your
> > offset is just large enough that maybe you are hitting some kind of cray
> > 32 bit limitation?
> >
> > but I'm only guessing as cray doesn't share source with us.
> >
> > ==rob
>
> Hi Jaln,
>
> This message means that Cray's MPIIO had just made a system write() call
> and write() returned a status of -1.  The MPIIO routine that made the
> call, "ADIOI_CRAY_WriteContig", is giving you as much information as it
> can.  "errno" has a value of 5, which translates to the general I/O
> error message "Input/output error".  That doesn't really tell you much.
> The message recommends that you look at MPICH_MPIIO_ABORT_ON_RW_ERROR in
> the Cray MPI "intro_mpi" man page:
>
>   MPICH_MPIIO_ABORT_ON_RW_ERROR
>       If set to enable, causes MPI-IO to abort immediately after
>       issuing an error message if an I/O error occurs during a
>       system read() or write() call. This applies only to I/O
>       errors for system read() and write() calls made as a result
>       of MPI I/O calls. It does not apply to I/O errors for other
>       MPI I/O calls such as MPI_File_open(), nor does it apply to
>       read() and write() calls made by means other than MPI I/O
>       calls.
>
>       Abort on error is not standard behavior. The MPI Standard
>       specifies that the default error handling for MPI I/O calls
>       is to return an error code to the application rather than
>       aborting the application, but since errors on write or read
>       are almost always unexpected and usually not recoverable, it
>       may be preferable to abort as soon as the error is detected.
>       Doing so does not allow any recovery, but does provide the
>       most information about the error and terminates the job
>       quickly.
>
>       If the Cray Abnormal Termination Processing (ATP) feature is
>       enabled, the abort will result in a full stack backtrace
>       written to stderr and a graphical merged stack backtrace
>       tree (a "dot" file) that shows exactly where each process
>       was at the time of the abort.
>
>       This environment variable is global for all files opened by
>       MPI_File_open(). To enable this behavior only for specific
>       files, use the MPICH_MPIIO_HINTS abort_on_rw_error option.
>
>       Default: disable
>
> If this problem is repeatable and if you enable ATP (see man atp),
> set the environment variable and see what the backtrace tells you.
>
> You can contact me directly (knaak at cray.com).
>
> David Knaak
>
>
>
>


-- 

Genius only means hard-working all one's life
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160210/e67734a1/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list