[mpich-discuss] Fail on MPI_Wait

Zhou, Hui zhouh at anl.gov
Fri Jun 21 13:03:11 CDT 2024


>../configure --prefix=/people/d3g293/mpich/mpich-4.2.1/build_decptn/install \
            --with-device=ch4:ofi:sockets --with-libfabric=embedded \
            --without-ucx CC=gcc CXX=g++

You are statically using "sockets" provider. Try --with-device=ch4:ofi​

Hui


________________________________
From: Palmer, Bruce J <Bruce.Palmer at pnnl.gov>
Sent: Friday, June 21, 2024 12:27 PM
To: Zhou, Hui <zhouh at anl.gov>; discuss at mpich.org <discuss at mpich.org>
Subject: RE: Fail on MPI_Wait

Hui, When I set FI_PROVIDER=tcp, the code crashes in MPI_Init. Specifically, this code will fail on one process: #include "mpi. h" int main(int argc, char **argv) { MPI_Init(&argc, &argv); MPI_Finalize(); } I’m running on a
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd

Hui,



When I set FI_PROVIDER=tcp, the code crashes in MPI_Init. Specifically, this code will fail on one process:



#include "mpi.h"



int main(int argc, char **argv) {

  MPI_Init(&argc, &argv);

  MPI_Finalize();

}



I’m running on a system with the following modules



[d3g293 at deception02 testing]$ module list

Currently Loaded Modulefiles:

  1) gcc/11.2.0            3) python/3.7.0          5) mkl/2019u4

  2) cmake/3.21.4          4) git/2.42.0(default)   6) cuda/11.8



and a home-built version of mpich-4.2.1 configured with



../configure --prefix=/people/d3g293/mpich/mpich-4.2.1/build_decptn/install \

            --with-device=ch4:ofi:sockets --with-libfabric=embedded \

            --without-ucx CC=gcc CXX=g++



I thought it might have something to do with using a build configuration in my application build that is set up to include Cuda, but it also fails in MPI_Init with a non-Cuda configuraton if I set the FI_PROVIDER variable.

Bruce



From: Zhou, Hui <zhouh at anl.gov>
Sent: Friday, June 14, 2024 9:27 AM
To: Palmer, Bruce J <Bruce.Palmer at pnnl.gov>; discuss at mpich.org
Subject: Re: Fail on MPI_Wait



Never mind. It is v4.2.1.

________________________________

From: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Sent: Friday, June 14, 2024 11:26 AM
To: Palmer, Bruce J <Bruce.Palmer at pnnl.gov<mailto:Bruce.Palmer at pnnl.gov>>; discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: Re: Fail on MPI_Wait



Bruce,



What is the mpich version, BTW?



--

Hui

________________________________

From: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Sent: Friday, June 14, 2024 10:55 AM
To: Palmer, Bruce J <Bruce.Palmer at pnnl.gov<mailto:Bruce.Palmer at pnnl.gov>>; discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: Re: Fail on MPI_Wait



Bruce,



You are using the sockets provider. Could try set FI_PROVIDER=tcp​ to see if it makes a difference?

Meanwhile, if you can get a small reproducer – with the sockets provider or any provider, I'll try to debug it. It is difficult to guess the true source of the issue without a reproducer.



--

Hui

________________________________

From: Palmer, Bruce J <Bruce.Palmer at pnnl.gov<mailto:Bruce.Palmer at pnnl.gov>>
Sent: Friday, June 14, 2024 10:47 AM
To: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>; discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: Re: Fail on MPI_Wait



The output to standard out from running on 2 nodes and one process per node is attached. From: Zhou, Hui <zhouh@ anl. gov> Date: Tuesday, June 11, 2024 at 5: 49 PM To: discuss@ mpich. org <discuss@ mpich. org> Cc: Palmer, Bruce J <Bruce. Palmer@ pnnl. gov>

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.



ZjQcmQRYFpfptBannerEnd

The output to standard out from running on 2 nodes and one process per node is attached.



From: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Date: Tuesday, June 11, 2024 at 5:49 PM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Palmer, Bruce J <Bruce.Palmer at pnnl.gov<mailto:Bruce.Palmer at pnnl.gov>>
Subject: Re: Fail on MPI_Wait

>MPIDI_OFI_handle_cq_error(793): OFI poll failed (ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)

This is an error coming from the libfabric provider. First we need find out which provider are you using. Try set environment variable MPIR_CVAR_DEBUG_SUMMARY=1​ and run a simple MPI_INIT+MPI_Finalize​ test code. Could post its console output?



--

Hui

________________________________

From: Palmer, Bruce J via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Sent: Tuesday, June 11, 2024 3:17 PM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Palmer, Bruce J <Bruce.Palmer at pnnl.gov<mailto:Bruce.Palmer at pnnl.gov>>
Subject: [mpich-discuss] Fail on MPI_Wait



Hi, I’m trying to debug a GPU-aware runtime for the Global Arrays library. We had a version of this working a while ago, but it has mysteriously started failing and we are trying to track down why. Currently, we are getting failures in MPI_Wait

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.



ZjQcmQRYFpfptBannerEnd

Hi,



I’m trying to debug a GPU-aware runtime for the Global Arrays library. We had a version of this working a while ago, but it has mysteriously started failing and we are trying to track down why. Currently, we are getting failures in MPI_Wait and were wondering if anyone could provide some information on what exactly seems to be failing inside the wait call. The error we are getting is



Abort(206752655) on node 0: Fatal error in internal_Wait: Other MPI error, error stack:

internal_Wait(68205)..........: MPI_Wait(request=0x500847a0, status=0x7ffff9331800) failed

MPIR_Wait(780)................:

MPIR_Wait_state(737)..........:

MPIDI_progress_test(134)......:

MPIDI_OFI_handle_cq_error(793): OFI poll failed (ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)



I’ve verified that the handle corresponding to 0x500847a0 is getting set earlier in the code in an MPI_Isend call and that no MPI_Wait or MPI_Test is called on the handle before it crashes with the above error message. I’m using MPICH 4.2.1 using gcc/8.3.0. The MPICH library was configured with



../configure --prefix=/people/d3g293/mpich/mpich-4.2.1/build_newell/install \

            --with-device=ch4:ofi:sockets --with-libfabric=embedded \

            --without-ucx --enable-threads=multiple --with-slurm \

            CC=gcc CXX=g+



I’ve tried building with UCX and gotten the same results.



Are these errors indicative of corruption of the request handle or problems with some internal MPI data structures or something else? Any information you can provide would be appreciated.

Thanks,

Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240621/bf4aefb6/attachment-0001.html>


More information about the discuss mailing list