[mpich-devel] Two issues with MPI_File error handling

Lisandro Dalcin dalcinl at gmail.com
Thu May 7 06:58:20 CDT 2015


This is using the preview release 3.2b2, I discovered this issue
months ago and it seems I forgot to report it.

1) MPI_File_call_errhandler(file, value) should return MPI_SUCCESS
instead of value under normal operation (as the MPI spec says), and
particularly if the error handler is set to MPI_ERRORS_RETURN. This
bug is trivial to fix.

2) Setting the predefined error handler MPI_ERRORS_ARE_FATAL and
calling MPI_File_call_errhandler(file, errcode) should "gracefully"
abort the processes the MPI way, instead of segfaulting. The problem
here is that ROMIO has to somehow handle MPI_ERRORS_ARE_FATAL in some
special way I'm not sure how to implement, right now it is trying to
invoke a callback that is set to NULL.

I'm attaching a test case showcasing the two issues, and pasting output below:

$ cat file_errhdl.c
#include <mpi.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
  int ierr;
  MPI_File fh;
  MPI_Init(&argc, &argv);

  MPI_File_open(MPI_COMM_WORLD,"/tmp/datafile",
                MPI_MODE_CREATE|MPI_MODE_RDWR|MPI_MODE_DELETE_ON_CLOSE,
                MPI_INFO_NULL,&fh);

  MPI_File_set_errhandler(fh,MPI_ERRORS_RETURN);
  ierr = MPI_File_call_errhandler(fh,MPI_ERR_FILE);
  printf("ierr: %d, expected: %d\n", ierr, (int)MPI_SUCCESS);

  MPI_File_set_errhandler(fh,MPI_ERRORS_ARE_FATAL);
  MPI_File_call_errhandler(fh,MPI_ERR_FILE); /* should abort */
  MPI_File_close(&fh);

  MPI_Finalize();
  return 0;
}

$ mpicc file_errhdl.c

$ ./a.out
ierr: 27, expected: 0
Segmentation fault (core dumped)

$ valgrind -q ./a.out
ierr: 27, expected: 0
==30036== Jump to the invalid address stated on the next line
==30036==    at 0x0: ???
==30036==    by 0x4CE8EDC: PMPI_File_call_errhandler
(file_call_errhandler.c:105)
==30036==    by 0x400951: main (in /home/dalcinl/Devel/BUGS-MPI/mpich/a.out)
==30036==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==30036==
==30036==
==30036== Process terminating with default action of signal 11 (SIGSEGV)
==30036==  Bad permissions for mapped region at address 0x0
==30036==    at 0x0: ???
==30036==    by 0x4CE8EDC: PMPI_File_call_errhandler
(file_call_errhandler.c:105)
==30036==    by 0x400951: main (in /home/dalcinl/Devel/BUGS-MPI/mpich/a.out)
Segmentation fault (core dumped)


-- 
Lisandro Dalcin
============
Research Scientist
Computer, Electrical and Mathematical Sciences & Engineering (CEMSE)
Numerical Porous Media Center (NumPor)
King Abdullah University of Science and Technology (KAUST)
http://numpor.kaust.edu.sa/

4700 King Abdullah University of Science and Technology
al-Khawarizmi Bldg (Bldg 1), Office # 4332
Thuwal 23955-6900, Kingdom of Saudi Arabia
http://www.kaust.edu.sa

Office Phone: +966 12 808-0459
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file_errhdl.c
Type: text/x-csrc
Size: 618 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/devel/attachments/20150507/c672e268/attachment.c>


More information about the devel mailing list