[mpich-discuss] any clue on this mpiio error
David Knaak
knaak at cray.com
Wed Feb 10 15:53:57 CST 2016
> > On 02/10/2016 11:39 AM, Jaln wrote:
> > My jobs on Edison die on IO errors like this:
> >
> > ADIOI_CRAY_WRITECONTIG(284): filename='OUT/rei20_0.g029'
> > error='Input/output error' errno=5 PE=00044 W_rec=33518
> > off=2232920756 len=0000524288
> > See MPICH_MPIIO_ABORT_ON_RW_ERROR.
> >
> > Any Ideas about this error infor? I couldn't find anything on Google.
> > Thanks
> On Wed, Feb 10, 2016 at 9:46 AM, Rob Latham <robl at mcs.anl.gov> wrote:
>
> Cray doesn't share their modifications to ROMIO with us. You'll have
> more luck with your cray support contact.
>
> errno 5 might be indicative of a general I/O error of some kind. Your
> offset is just large enough that maybe you are hitting some kind of cray
> 32 bit limitation?
>
> but I'm only guessing as cray doesn't share source with us.
>
> ==rob
Hi Jaln,
This message means that Cray's MPIIO had just made a system write() call
and write() returned a status of -1. The MPIIO routine that made the
call, "ADIOI_CRAY_WriteContig", is giving you as much information as it
can. "errno" has a value of 5, which translates to the general I/O
error message "Input/output error". That doesn't really tell you much.
The message recommends that you look at MPICH_MPIIO_ABORT_ON_RW_ERROR in
the Cray MPI "intro_mpi" man page:
MPICH_MPIIO_ABORT_ON_RW_ERROR
If set to enable, causes MPI-IO to abort immediately after
issuing an error message if an I/O error occurs during a
system read() or write() call. This applies only to I/O
errors for system read() and write() calls made as a result
of MPI I/O calls. It does not apply to I/O errors for other
MPI I/O calls such as MPI_File_open(), nor does it apply to
read() and write() calls made by means other than MPI I/O
calls.
Abort on error is not standard behavior. The MPI Standard
specifies that the default error handling for MPI I/O calls
is to return an error code to the application rather than
aborting the application, but since errors on write or read
are almost always unexpected and usually not recoverable, it
may be preferable to abort as soon as the error is detected.
Doing so does not allow any recovery, but does provide the
most information about the error and terminates the job
quickly.
If the Cray Abnormal Termination Processing (ATP) feature is
enabled, the abort will result in a full stack backtrace
written to stderr and a graphical merged stack backtrace
tree (a "dot" file) that shows exactly where each process
was at the time of the abort.
This environment variable is global for all files opened by
MPI_File_open(). To enable this behavior only for specific
files, use the MPICH_MPIIO_HINTS abort_on_rw_error option.
Default: disable
If this problem is repeatable and if you enable ATP (see man atp),
set the environment variable and see what the backtrace tells you.
You can contact me directly (knaak at cray.com).
David Knaak
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list