[mpich-discuss] Return error code issues - MPICH 3.2.1
Laguna Peralta, Ignacio
lagunaperalt1 at llnl.gov
Wed Oct 10 10:41:41 CDT 2018
Hi MPICH developers,
I would like to report some (unconfirmed) issues on how return error
codes are propagated within the library (e.g., when using the
MPI_ERRORS_RETURN handler).
We are developing a static analysis framework that detects the code
location of such bugs and we have tested it recently in MPICH 3.2.1. The
framework is giving us a number of reports and we would like to confirm
with you whether these are real bug cases or just false alarms.
** What the framework does **
It analyzes all functions and call paths within the library trying to
identify cases where a return error code is either:
(a) not saved by the calling function
(b) saved but later overwritten without MPICH taken an action on the
error. An action could be, for example, printing a message, aborting, or
returning the error to the calling function.
An example of case (a) we think we found is in src/mpi/coll/bcast.c:1402:
1400 /* Get the local intracommunicator */
1401 if (!comm_ptr->local_comm)
1402 MPIR_Setup_intercomm_localcomm( comm_ptr );
1403
1404 newcomm_ptr = comm_ptr->local_comm;
1405
1406 /* now do the usual broadcast on this intracommunicator
1407 with rank 0 as root. */
1408 mpi_errno = MPIR_Bcast_intra(buffer, count, datatype, 0,
newcomm_ptr, errflag);
In line 1402, the MPIR_Setup_intercomm_localcomm function is called but
the error code that this function could return is not saved--in line
1408 an error code is saved, however, in mpi_errno, which is what we
would expect for line 1402.
We see that MPICH takes actions such the following when an error is
observed (from the same file):
1408 mpi_errno = MPIR_Bcast_intra(buffer, count, datatype, 0,
newcomm_ptr, errflag);
1409 if (mpi_errno) {
1410 /* for communication errors, just record the error
but continue */
1411 *errflag = MPIR_ERR_GET_CLASS(mpi_errno);
1412 MPIR_ERR_SET(mpi_errno, *errflag, "**fail");
1413 MPIR_ERR_ADD(mpi_errno_ret, mpi_errno);
1414 }
So we took this code and adapted it to line 1402. Then when we inject an
artificial error code (something different from MPI_SUCCESS) to
MPIR_Setup_intercomm_localcomm, we see that this fixes the bug and the
program receives the error when calling MPI_Bcast; otherwise the
artificially injected error is lost and the program cannot see it.
** Reports **
Our framework is reporting the same case in other MPICH 3.2.1 locations:
src/mpi/coll/allgather.c:694
src/mpi/coll/allreduce.c:657
src/mpi/coll/bcast.c:1402
src/mpi/coll/iallgather.c:525
src/mpi/coll/iallreduce.c:551
src/mpi/coll/iscatter.c:505
src/mpi/coll/red_scat_block.c:965
src/mpi/coll/scatter.c:509
src/mpi/comm/comm_create.c:314
src/mpi/comm/comm_split.c:158
src/mpi/comm/intercomm_merge.c:301
** What we need from you **
Could someone take a look at these reports and confirm whether they are
real bug cases (where some action is missing) or they are just false
reports/alarms since in these locations there is no reason to save the
error code and take an action?
Thank you for your help. This will help us improve the framework and
hopefully could report issues to MPICH to make it more reliable.
Thanks!
--
Ignacio Laguna
Center for Applied Scientific Computing (CASC)
Lawrence Livermore National Laboratory
Phone: 925-422-7308, Fax: 925-422-6287
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list