[mpich-discuss] issue/question

Michael P. Deignan michael.p.deignan at gmail.com
Fri Sep 15 10:07:58 CDT 2023


I have a user who is running an mpi model using mvapich2, compiled with the 
Intel OneAPI compiler (2023.1.0) on a Rocky 8.6 OpenHPC cluster. Bear with me 
for a minute as I lay the foundation to get to rdma-core.

Periodically (randomly after X minutes, where X has been as few as 5 and as 
much as 30) the model will crash and in the error log there will be thousands 
of messages:

[mv2_mcast_resend_window] Failed to post mcast send errno:0

Basically this message repeats until the disk fills up and the job terminates 
with an out of disk space message.

I tracked the error to the mvapich2 source code at 
src/mpid/ch3/channels/common/include/ibv_mcast.h:





    int ret;                                                    \ 

    ret = ibv_post_send(_mcast_ctx->ud_ctx->qp,                 \

                &(_v->desc.u.sr), &(_v->desc.y.bad_sr));        \

    if (ret) {                                                  \

        PRINT_ERROR("Failed to post mcast send errno:%d\n",     \

                                errno);                         \

    }                                                           \

    _mcast_ctx->ud_ctx->send_wqes_avail--;                      \

    while (_mcast_ctx->ud_ctx->send_wqes_avail <= 0) {          \

        MPIDI_CH3I_Progress(FALSE, NULL);                       \

    }                      

which leads me to the ibv_post_send subroutine in include/infiniband/verbs.h:


/** 
* ibv_post_send - Post a list of work requests to a send queue.
*
* If IBV_SEND_INLINE flag is set, the data buffers can be reused
* immediately after the call returns.
*/
static inline int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr,

                                struct ibv_send_wr **bad_wr)
{

        return qp->context->ops.post_send(qp, wr, bad_wr);
}

For some reason, this function call (ibv_post_send) is returning zero back to 
the caller, and no error code is being set (errno = 0).

As the model does run for a random amount of time, this would seem to suggest 
some type of hardware problem, but everything 
we've checked (system, ib card, ib switch, etc.,) doesn't show any errors.

Can anyone shed some light on under what circumstances a call to ibv_post_send 
would, in fact, return zero when previously it didn't?

I realize I'm 
probably not giving much data to really help.. I suppose there could be some 
type of underlying operating system issue but even there I'm not sure how I would determine where the problem is, since everything else is running fine (and it appears only this one model seems to cause this problem) with nothing abnormal in the event logs. I tried running the model on an older Rocks Centos 7.x cluster and likewise received the same error.
 
Having some basic understanding of what would cause the post send call to return zero and not set an error code might help.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20230915/8e79e77d/attachment.html>


More information about the discuss mailing list