[mpich-discuss] any clue on this mpiio error

David Knaak knaak at cray.com
Wed Feb 10 15:53:57 CST 2016


> > On 02/10/2016 11:39 AM, Jaln wrote:
> > My jobs on Edison die on IO errors like this:
> >
> > ADIOI_CRAY_WRITECONTIG(284): filename='OUT/rei20_0.g029'
> > error='Input/output error' errno=5 PE=00044 W_rec=33518
> > off=2232920756 len=0000524288
> > See MPICH_MPIIO_ABORT_ON_RW_ERROR.
> >
> > Any Ideas about this error infor? I couldn't find anything on Google.
> > Thanks

> On Wed, Feb 10, 2016 at 9:46 AM, Rob Latham <robl at mcs.anl.gov> wrote:
> 
> Cray doesn't share their modifications to ROMIO with us.  You'll have
> more luck with your cray support contact.
> 
> errno 5 might be indicative of a general I/O error of some kind.   Your
> offset is just large enough that maybe you are hitting some kind of cray
> 32 bit limitation?
> 
> but I'm only guessing as cray doesn't share source with us.
> 
> ==rob

Hi Jaln,

This message means that Cray's MPIIO had just made a system write() call
and write() returned a status of -1.  The MPIIO routine that made the
call, "ADIOI_CRAY_WriteContig", is giving you as much information as it
can.  "errno" has a value of 5, which translates to the general I/O
error message "Input/output error".  That doesn't really tell you much.
The message recommends that you look at MPICH_MPIIO_ABORT_ON_RW_ERROR in
the Cray MPI "intro_mpi" man page:

  MPICH_MPIIO_ABORT_ON_RW_ERROR
      If set to enable, causes MPI-IO to abort immediately after
      issuing an error message if an I/O error occurs during a
      system read() or write() call. This applies only to I/O
      errors for system read() and write() calls made as a result
      of MPI I/O calls. It does not apply to I/O errors for other
      MPI I/O calls such as MPI_File_open(), nor does it apply to
      read() and write() calls made by means other than MPI I/O
      calls.

      Abort on error is not standard behavior. The MPI Standard
      specifies that the default error handling for MPI I/O calls
      is to return an error code to the application rather than
      aborting the application, but since errors on write or read
      are almost always unexpected and usually not recoverable, it
      may be preferable to abort as soon as the error is detected.
      Doing so does not allow any recovery, but does provide the
      most information about the error and terminates the job
      quickly.

      If the Cray Abnormal Termination Processing (ATP) feature is
      enabled, the abort will result in a full stack backtrace
      written to stderr and a graphical merged stack backtrace
      tree (a "dot" file) that shows exactly where each process
      was at the time of the abort.

      This environment variable is global for all files opened by
      MPI_File_open(). To enable this behavior only for specific
      files, use the MPICH_MPIIO_HINTS abort_on_rw_error option.

      Default: disable

If this problem is repeatable and if you enable ATP (see man atp),
set the environment variable and see what the backtrace tells you.

You can contact me directly (knaak at cray.com).

David Knaak



_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list