[mpich-discuss] Return error code issues - MPICH 3.2.1
Raffenetti, Kenneth J.
raffenet at mcs.anl.gov
Wed Oct 10 16:23:12 CDT 2018
Hi Ignacio,
Would you mind creating a Github issue
(https://github.com/pmodels/mpich) to track this? It will take some time
to go through all the locations you have identified, but IMO from the
example you are onto a real issue.
Ken
On 10/10/18 10:41 AM, Laguna Peralta, Ignacio wrote:
> Hi MPICH developers,
>
> I would like to report some (unconfirmed) issues on how return error
> codes are propagated within the library (e.g., when using the
> MPI_ERRORS_RETURN handler).
>
> We are developing a static analysis framework that detects the code
> location of such bugs and we have tested it recently in MPICH 3.2.1. The
> framework is giving us a number of reports and we would like to confirm
> with you whether these are real bug cases or just false alarms.
>
> ** What the framework does **
>
> It analyzes all functions and call paths within the library trying to
> identify cases where a return error code is either:
>
> (a) not saved by the calling function
> (b) saved but later overwritten without MPICH taken an action on the
> error. An action could be, for example, printing a message, aborting, or
> returning the error to the calling function.
>
> An example of case (a) we think we found is in src/mpi/coll/bcast.c:1402:
>
> 1400 /* Get the local intracommunicator */
> 1401 if (!comm_ptr->local_comm)
> 1402 MPIR_Setup_intercomm_localcomm( comm_ptr );
> 1403
> 1404 newcomm_ptr = comm_ptr->local_comm;
> 1405
> 1406 /* now do the usual broadcast on this intracommunicator
> 1407 with rank 0 as root. */
> 1408 mpi_errno = MPIR_Bcast_intra(buffer, count, datatype, 0,
> newcomm_ptr, errflag);
>
> In line 1402, the MPIR_Setup_intercomm_localcomm function is called but
> the error code that this function could return is not saved--in line
> 1408 an error code is saved, however, in mpi_errno, which is what we
> would expect for line 1402.
>
> We see that MPICH takes actions such the following when an error is
> observed (from the same file):
>
> 1408 mpi_errno = MPIR_Bcast_intra(buffer, count, datatype, 0,
> newcomm_ptr, errflag);
> 1409 if (mpi_errno) {
> 1410 /* for communication errors, just record the error
> but continue */
> 1411 *errflag = MPIR_ERR_GET_CLASS(mpi_errno);
> 1412 MPIR_ERR_SET(mpi_errno, *errflag, "**fail");
> 1413 MPIR_ERR_ADD(mpi_errno_ret, mpi_errno);
> 1414 }
>
> So we took this code and adapted it to line 1402. Then when we inject an
> artificial error code (something different from MPI_SUCCESS) to
> MPIR_Setup_intercomm_localcomm, we see that this fixes the bug and the
> program receives the error when calling MPI_Bcast; otherwise the
> artificially injected error is lost and the program cannot see it.
>
> ** Reports **
>
> Our framework is reporting the same case in other MPICH 3.2.1 locations:
>
> src/mpi/coll/allgather.c:694
> src/mpi/coll/allreduce.c:657
> src/mpi/coll/bcast.c:1402
> src/mpi/coll/iallgather.c:525
> src/mpi/coll/iallreduce.c:551
> src/mpi/coll/iscatter.c:505
> src/mpi/coll/red_scat_block.c:965
> src/mpi/coll/scatter.c:509
> src/mpi/comm/comm_create.c:314
> src/mpi/comm/comm_split.c:158
> src/mpi/comm/intercomm_merge.c:301
>
> ** What we need from you **
>
> Could someone take a look at these reports and confirm whether they are
> real bug cases (where some action is missing) or they are just false
> reports/alarms since in these locations there is no reason to save the
> error code and take an action?
>
> Thank you for your help. This will help us improve the framework and
> hopefully could report issues to MPICH to make it more reliable.
>
> Thanks!
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list