[mpich-discuss] issue/question

Panda, Dhabaleswar panda at cse.ohio-state.edu
Fri Sep 15 10:46:01 CDT 2023


Hi Michael,

I am cc’ing this note to the MVAPICH-Core developers. One of the developers will follow-up with you.

Thanks,

DK

From: Thakur, Rajeev via discuss <discuss at mpich.org>
Sent: Friday, September 15, 2023 11:42 AM
To: discuss at mpich.org
Cc: Thakur, Rajeev <thakur at anl.gov>
Subject: Re: [mpich-discuss] issue/question

The mailing list for MVAPICH2 is mvapich-discuss@ lists. osu. edu . Rajeev From: "Michael P. Deignan via discuss" <discuss@ mpich. org> Reply-To: "discuss@ mpich. org" <discuss@ mpich. org> Date: Friday, September 15,

The mailing list for MVAPICH2 is mvapich-discuss at lists.osu.edu<mailto:mvapich-discuss at lists.osu.edu> .

Rajeev

From: "Michael P. Deignan via discuss" <discuss at mpich.org<mailto:discuss at mpich.org>>
Reply-To: "discuss at mpich.org<mailto:discuss at mpich.org>" <discuss at mpich.org<mailto:discuss at mpich.org>>
Date: Friday, September 15, 2023 at 10:08 AM
To: "discuss at mpich.org<mailto:discuss at mpich.org>" <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: "Michael P. Deignan" <michael.p.deignan at gmail.com<mailto:michael.p.deignan at gmail.com>>
Subject: [mpich-discuss] issue/question



I have a user who is running an mpi model using mvapich2, compiled with the Intel OneAPI compiler (2023.1.0) on a Rocky 8.6 OpenHPC cluster. Bear with me for a minute as I lay the foundation to get to rdma-core.



Periodically (randomly after X minutes, where X has been as few as 5 and as much as 30) the model will crash and in the error log there will be thousands of messages:



[mv2_mcast_resend_window] Failed to post mcast send errno:0



Basically this message repeats until the disk fills up and the job terminates with an out of disk space message.



I tracked the error to the mvapich2 source code at src/mpid/ch3/channels/common/include/ibv_mcast.h:





    int ret;                                                    \

    ret = ibv_post_send(_mcast_ctx->ud_ctx->qp,                 \

                &(_v->desc.u.sr), &(_v->desc.y.bad_sr));        \

    if (ret) {                                                  \

        PRINT_ERROR("Failed to post mcast send errno:%d\n",     \

                                errno);                         \

    }                                                           \

    _mcast_ctx->ud_ctx->send_wqes_avail--;                      \

    while (_mcast_ctx->ud_ctx->send_wqes_avail <= 0) {          \

        MPIDI_CH3I_Progress(FALSE, NULL);                       \

    }



which leads me to the ibv_post_send subroutine in include/infiniband/verbs.h:





/**

* ibv_post_send - Post a list of work requests to a send queue.

*

* If IBV_SEND_INLINE flag is set, the data buffers can be reused

* immediately after the call returns.

*/

static inline int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr,

                                struct ibv_send_wr **bad_wr)

{

        return qp->context->ops.post_send(qp, wr, bad_wr);

}



For some reason, this function call (ibv_post_send) is returning zero back to the caller, and no error code is being set (errno = 0).



As the model does run for a random amount of time, this would seem to suggest some type of hardware problem, but everything we've checked (system, ib card, ib switch, etc.,) doesn't show any errors.



Can anyone shed some light on under what circumstances a call to ibv_post_send would, in fact, return zero when previously it didn't?



I realize I'm probably not giving much data to really help.. I suppose there could be some type of underlying operating system issue but even there I'm not sure how I would determine where the problem is, since everything else is running fine (and it appears only this one model seems to cause this problem) with nothing abnormal in the event logs. I tried running the model on an older Rocks Centos 7.x cluster and likewise received the same error.



Having some basic understanding of what would cause the post send call to return zero and not set an error code might help.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20230915/258a9023/attachment.html>


More information about the discuss mailing list