[mpich-discuss] Return error code issues - MPICH 3.2.1

Raffenetti, Kenneth J. raffenet at mcs.anl.gov
Wed Oct 10 16:23:12 CDT 2018


Hi Ignacio,

Would you mind creating a Github issue 
(https://github.com/pmodels/mpich) to track this? It will take some time 
to go through all the locations you have identified, but IMO from the 
example you are onto a real issue.

Ken

On 10/10/18 10:41 AM, Laguna Peralta, Ignacio wrote:
> Hi MPICH developers,
> 
> I would like to report some (unconfirmed) issues on how return error
> codes are propagated within the library (e.g., when using the
> MPI_ERRORS_RETURN handler).
> 
> We are developing a static analysis framework that detects the code
> location of such bugs and we have tested it recently in MPICH 3.2.1. The
> framework is giving us a number of reports and we would like to confirm
> with you whether these are real bug cases or just false alarms.
> 
> ** What the framework does **
> 
> It analyzes all functions and call paths within the library trying to
> identify cases where a return error code is either:
> 
> (a) not saved by the calling function
> (b) saved but later overwritten without MPICH taken an action on the
> error. An action could be, for example, printing a message, aborting, or
> returning the error to the calling function.
> 
> An example of case (a) we think we found is in src/mpi/coll/bcast.c:1402:
> 
>     1400	        /* Get the local intracommunicator */
>     1401	        if (!comm_ptr->local_comm)
>     1402	            MPIR_Setup_intercomm_localcomm( comm_ptr );
>     1403	
>     1404	        newcomm_ptr = comm_ptr->local_comm;
>     1405	
>     1406	        /* now do the usual broadcast on this intracommunicator
>     1407	           with rank 0 as root. */
>     1408	        mpi_errno = MPIR_Bcast_intra(buffer, count, datatype, 0,
> newcomm_ptr, errflag);
> 
> In line 1402, the MPIR_Setup_intercomm_localcomm function is called but
> the error code that this function could return is not saved--in line
> 1408 an error code is saved, however, in mpi_errno, which is what we
> would expect for line 1402.
> 
> We see that MPICH takes actions such the following when an error is
> observed (from the same file):
> 
>     1408	        mpi_errno = MPIR_Bcast_intra(buffer, count, datatype, 0,
> newcomm_ptr, errflag);
>     1409	        if (mpi_errno) {
>     1410	            /* for communication errors, just record the error
> but continue */
>     1411	            *errflag = MPIR_ERR_GET_CLASS(mpi_errno);
>     1412	            MPIR_ERR_SET(mpi_errno, *errflag, "**fail");
>     1413	            MPIR_ERR_ADD(mpi_errno_ret, mpi_errno);
>     1414	        }
> 
> So we took this code and adapted it to line 1402. Then when we inject an
> artificial error code (something different from MPI_SUCCESS) to
> MPIR_Setup_intercomm_localcomm, we see that this fixes the bug and the
> program receives the error when calling MPI_Bcast; otherwise the
> artificially injected error is lost and the program cannot see it.
> 
> ** Reports **
> 
> Our framework is reporting the same case in other MPICH 3.2.1 locations:
> 
> src/mpi/coll/allgather.c:694
> src/mpi/coll/allreduce.c:657
> src/mpi/coll/bcast.c:1402
> src/mpi/coll/iallgather.c:525
> src/mpi/coll/iallreduce.c:551
> src/mpi/coll/iscatter.c:505
> src/mpi/coll/red_scat_block.c:965
> src/mpi/coll/scatter.c:509
> src/mpi/comm/comm_create.c:314
> src/mpi/comm/comm_split.c:158
> src/mpi/comm/intercomm_merge.c:301
> 
> ** What we need from you **
> 
> Could someone take a look at these reports and confirm whether they are
> real bug cases (where some action is missing) or they are just false
> reports/alarms since in these locations there is no reason to save the
> error code and take an action?
> 
> Thank you for your help. This will help us improve the framework and
> hopefully could report issues to MPICH to make it more reliable.
> 
> Thanks!
> 
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list