[mpich-discuss] Return error code issues - MPICH 3.2.1

Laguna Peralta, Ignacio lagunaperalt1 at llnl.gov
Wed Oct 10 10:41:41 CDT 2018


Hi MPICH developers,

I would like to report some (unconfirmed) issues on how return error 
codes are propagated within the library (e.g., when using the 
MPI_ERRORS_RETURN handler).

We are developing a static analysis framework that detects the code 
location of such bugs and we have tested it recently in MPICH 3.2.1. The 
framework is giving us a number of reports and we would like to confirm 
with you whether these are real bug cases or just false alarms.

** What the framework does **

It analyzes all functions and call paths within the library trying to 
identify cases where a return error code is either:

(a) not saved by the calling function
(b) saved but later overwritten without MPICH taken an action on the 
error. An action could be, for example, printing a message, aborting, or 
returning the error to the calling function.

An example of case (a) we think we found is in src/mpi/coll/bcast.c:1402:

   1400	        /* Get the local intracommunicator */
   1401	        if (!comm_ptr->local_comm)
   1402	            MPIR_Setup_intercomm_localcomm( comm_ptr );
   1403	
   1404	        newcomm_ptr = comm_ptr->local_comm;
   1405	
   1406	        /* now do the usual broadcast on this intracommunicator
   1407	           with rank 0 as root. */
   1408	        mpi_errno = MPIR_Bcast_intra(buffer, count, datatype, 0, 
newcomm_ptr, errflag);

In line 1402, the MPIR_Setup_intercomm_localcomm function is called but 
the error code that this function could return is not saved--in line 
1408 an error code is saved, however, in mpi_errno, which is what we 
would expect for line 1402.

We see that MPICH takes actions such the following when an error is 
observed (from the same file):

   1408	        mpi_errno = MPIR_Bcast_intra(buffer, count, datatype, 0, 
newcomm_ptr, errflag);
   1409	        if (mpi_errno) {
   1410	            /* for communication errors, just record the error 
but continue */
   1411	            *errflag = MPIR_ERR_GET_CLASS(mpi_errno);
   1412	            MPIR_ERR_SET(mpi_errno, *errflag, "**fail");
   1413	            MPIR_ERR_ADD(mpi_errno_ret, mpi_errno);
   1414	        }

So we took this code and adapted it to line 1402. Then when we inject an 
artificial error code (something different from MPI_SUCCESS) to 
MPIR_Setup_intercomm_localcomm, we see that this fixes the bug and the 
program receives the error when calling MPI_Bcast; otherwise the 
artificially injected error is lost and the program cannot see it.

** Reports **

Our framework is reporting the same case in other MPICH 3.2.1 locations:

src/mpi/coll/allgather.c:694
src/mpi/coll/allreduce.c:657
src/mpi/coll/bcast.c:1402
src/mpi/coll/iallgather.c:525
src/mpi/coll/iallreduce.c:551
src/mpi/coll/iscatter.c:505
src/mpi/coll/red_scat_block.c:965
src/mpi/coll/scatter.c:509
src/mpi/comm/comm_create.c:314
src/mpi/comm/comm_split.c:158
src/mpi/comm/intercomm_merge.c:301

** What we need from you **

Could someone take a look at these reports and confirm whether they are 
real bug cases (where some action is missing) or they are just false 
reports/alarms since in these locations there is no reason to save the 
error code and take an action?

Thank you for your help. This will help us improve the framework and 
hopefully could report issues to MPICH to make it more reliable.

Thanks!

-- 
Ignacio Laguna
Center for Applied Scientific Computing (CASC)
Lawrence Livermore National Laboratory
Phone: 925-422-7308, Fax: 925-422-6287
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list