[mpich-discuss] Return error code issues - MPICH 3.2.1

Ignacio Laguna lagunaperalt1 at llnl.gov
Wed Oct 10 17:03:48 CDT 2018


Hi Ken,

Thank you, I just submitted an issue (#3366).

Note that all the reports are associated with the same function call, 
MPIR_Setup_intercomm_localcomm, apparently used for intercommunicators, 
so they essentially refer to the same class of bug, but in different 
files or communication functions.

If at least a simple case could be verified (e.g., bcast) it would be 
useful for us to mark it a as success case and continue improving the 
analysis.

Thank you!!

Ignacio


On 10/10/18 2:23 PM, Raffenetti, Kenneth J. wrote:
> Hi Ignacio,
> 
> Would you mind creating a Github issue
> (https://github.com/pmodels/mpich) to track this? It will take some time
> to go through all the locations you have identified, but IMO from the
> example you are onto a real issue.
> 
> Ken
> 
> On 10/10/18 10:41 AM, Laguna Peralta, Ignacio wrote:
>> Hi MPICH developers,
>>
>> I would like to report some (unconfirmed) issues on how return error
>> codes are propagated within the library (e.g., when using the
>> MPI_ERRORS_RETURN handler).
>>
>> We are developing a static analysis framework that detects the code
>> location of such bugs and we have tested it recently in MPICH 3.2.1. The
>> framework is giving us a number of reports and we would like to confirm
>> with you whether these are real bug cases or just false alarms.
>>
>> ** What the framework does **
>>
>> It analyzes all functions and call paths within the library trying to
>> identify cases where a return error code is either:
>>
>> (a) not saved by the calling function
>> (b) saved but later overwritten without MPICH taken an action on the
>> error. An action could be, for example, printing a message, aborting, or
>> returning the error to the calling function.
>>
>> An example of case (a) we think we found is in src/mpi/coll/bcast.c:1402:
>>
>>      1400	        /* Get the local intracommunicator */
>>      1401	        if (!comm_ptr->local_comm)
>>      1402	            MPIR_Setup_intercomm_localcomm( comm_ptr );
>>      1403	
>>      1404	        newcomm_ptr = comm_ptr->local_comm;
>>      1405	
>>      1406	        /* now do the usual broadcast on this intracommunicator
>>      1407	           with rank 0 as root. */
>>      1408	        mpi_errno = MPIR_Bcast_intra(buffer, count, datatype, 0,
>> newcomm_ptr, errflag);
>>
>> In line 1402, the MPIR_Setup_intercomm_localcomm function is called but
>> the error code that this function could return is not saved--in line
>> 1408 an error code is saved, however, in mpi_errno, which is what we
>> would expect for line 1402.
>>
>> We see that MPICH takes actions such the following when an error is
>> observed (from the same file):
>>
>>      1408	        mpi_errno = MPIR_Bcast_intra(buffer, count, datatype, 0,
>> newcomm_ptr, errflag);
>>      1409	        if (mpi_errno) {
>>      1410	            /* for communication errors, just record the error
>> but continue */
>>      1411	            *errflag = MPIR_ERR_GET_CLASS(mpi_errno);
>>      1412	            MPIR_ERR_SET(mpi_errno, *errflag, "**fail");
>>      1413	            MPIR_ERR_ADD(mpi_errno_ret, mpi_errno);
>>      1414	        }
>>
>> So we took this code and adapted it to line 1402. Then when we inject an
>> artificial error code (something different from MPI_SUCCESS) to
>> MPIR_Setup_intercomm_localcomm, we see that this fixes the bug and the
>> program receives the error when calling MPI_Bcast; otherwise the
>> artificially injected error is lost and the program cannot see it.
>>
>> ** Reports **
>>
>> Our framework is reporting the same case in other MPICH 3.2.1 locations:
>>
>> src/mpi/coll/allgather.c:694
>> src/mpi/coll/allreduce.c:657
>> src/mpi/coll/bcast.c:1402
>> src/mpi/coll/iallgather.c:525
>> src/mpi/coll/iallreduce.c:551
>> src/mpi/coll/iscatter.c:505
>> src/mpi/coll/red_scat_block.c:965
>> src/mpi/coll/scatter.c:509
>> src/mpi/comm/comm_create.c:314
>> src/mpi/comm/comm_split.c:158
>> src/mpi/comm/intercomm_merge.c:301
>>
>> ** What we need from you **
>>
>> Could someone take a look at these reports and confirm whether they are
>> real bug cases (where some action is missing) or they are just false
>> reports/alarms since in these locations there is no reason to save the
>> error code and take an action?
>>
>> Thank you for your help. This will help us improve the framework and
>> hopefully could report issues to MPICH to make it more reliable.
>>
>> Thanks!
>>


More information about the discuss mailing list