[mpich-discuss] any clue on this mpiio error
Jaln
valiantljk at gmail.com
Wed Feb 10 17:17:57 CST 2016
Hi David,
That sounds great,
I appreciate for your input.
I'm going to try it with two I/O tickets on hand, and hopefully to update
you guys shortly.
Best,
Jialin
On Wed, Feb 10, 2016 at 1:53 PM, David Knaak <knaak at cray.com> wrote:
> > > On 02/10/2016 11:39 AM, Jaln wrote:
> > > My jobs on Edison die on IO errors like this:
> > >
> > > ADIOI_CRAY_WRITECONTIG(284): filename='OUT/rei20_0.g029'
> > > error='Input/output error' errno=5 PE=00044 W_rec=33518
> > > off=2232920756 len=0000524288
> > > See MPICH_MPIIO_ABORT_ON_RW_ERROR.
> > >
> > > Any Ideas about this error infor? I couldn't find anything on Google.
> > > Thanks
>
> > On Wed, Feb 10, 2016 at 9:46 AM, Rob Latham <robl at mcs.anl.gov> wrote:
> >
> > Cray doesn't share their modifications to ROMIO with us. You'll have
> > more luck with your cray support contact.
> >
> > errno 5 might be indicative of a general I/O error of some kind. Your
> > offset is just large enough that maybe you are hitting some kind of cray
> > 32 bit limitation?
> >
> > but I'm only guessing as cray doesn't share source with us.
> >
> > ==rob
>
> Hi Jaln,
>
> This message means that Cray's MPIIO had just made a system write() call
> and write() returned a status of -1. The MPIIO routine that made the
> call, "ADIOI_CRAY_WriteContig", is giving you as much information as it
> can. "errno" has a value of 5, which translates to the general I/O
> error message "Input/output error". That doesn't really tell you much.
> The message recommends that you look at MPICH_MPIIO_ABORT_ON_RW_ERROR in
> the Cray MPI "intro_mpi" man page:
>
> MPICH_MPIIO_ABORT_ON_RW_ERROR
> If set to enable, causes MPI-IO to abort immediately after
> issuing an error message if an I/O error occurs during a
> system read() or write() call. This applies only to I/O
> errors for system read() and write() calls made as a result
> of MPI I/O calls. It does not apply to I/O errors for other
> MPI I/O calls such as MPI_File_open(), nor does it apply to
> read() and write() calls made by means other than MPI I/O
> calls.
>
> Abort on error is not standard behavior. The MPI Standard
> specifies that the default error handling for MPI I/O calls
> is to return an error code to the application rather than
> aborting the application, but since errors on write or read
> are almost always unexpected and usually not recoverable, it
> may be preferable to abort as soon as the error is detected.
> Doing so does not allow any recovery, but does provide the
> most information about the error and terminates the job
> quickly.
>
> If the Cray Abnormal Termination Processing (ATP) feature is
> enabled, the abort will result in a full stack backtrace
> written to stderr and a graphical merged stack backtrace
> tree (a "dot" file) that shows exactly where each process
> was at the time of the abort.
>
> This environment variable is global for all files opened by
> MPI_File_open(). To enable this behavior only for specific
> files, use the MPICH_MPIIO_HINTS abort_on_rw_error option.
>
> Default: disable
>
> If this problem is repeatable and if you enable ATP (see man atp),
> set the environment variable and see what the backtrace tells you.
>
> You can contact me directly (knaak at cray.com).
>
> David Knaak
>
>
>
>
--
Genius only means hard-working all one's life
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160210/e67734a1/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list