[mpich-discuss] Fail on MPI_Wait
Palmer, Bruce J
Bruce.Palmer at pnnl.gov
Fri Jun 14 10:47:37 CDT 2024
The output to standard out from running on 2 nodes and one process per node is attached.
From: Zhou, Hui <zhouh at anl.gov>
Date: Tuesday, June 11, 2024 at 5:49 PM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Palmer, Bruce J <Bruce.Palmer at pnnl.gov>
Subject: Re: Fail on MPI_Wait
>MPIDI_OFI_handle_cq_error(793): OFI poll failed (ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)
This is an error coming from the libfabric provider. First we need find out which provider are you using. Try set environment variable MPIR_CVAR_DEBUG_SUMMARY=1 and run a simple MPI_INIT+MPI_Finalize test code. Could post its console output?
--
Hui
________________________________
From: Palmer, Bruce J via discuss <discuss at mpich.org>
Sent: Tuesday, June 11, 2024 3:17 PM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Palmer, Bruce J <Bruce.Palmer at pnnl.gov>
Subject: [mpich-discuss] Fail on MPI_Wait
Hi, I’m trying to debug a GPU-aware runtime for the Global Arrays library. We had a version of this working a while ago, but it has mysteriously started failing and we are trying to track down why. Currently, we are getting failures in MPI_Wait
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hi,
I’m trying to debug a GPU-aware runtime for the Global Arrays library. We had a version of this working a while ago, but it has mysteriously started failing and we are trying to track down why. Currently, we are getting failures in MPI_Wait and were wondering if anyone could provide some information on what exactly seems to be failing inside the wait call. The error we are getting is
Abort(206752655) on node 0: Fatal error in internal_Wait: Other MPI error, error stack:
internal_Wait(68205)..........: MPI_Wait(request=0x500847a0, status=0x7ffff9331800) failed
MPIR_Wait(780)................:
MPIR_Wait_state(737)..........:
MPIDI_progress_test(134)......:
MPIDI_OFI_handle_cq_error(793): OFI poll failed (ofi_events.c:793:MPIDI_OFI_handle_cq_error:Input/output error)
I’ve verified that the handle corresponding to 0x500847a0 is getting set earlier in the code in an MPI_Isend call and that no MPI_Wait or MPI_Test is called on the handle before it crashes with the above error message. I’m using MPICH 4.2.1 using gcc/8.3.0. The MPICH library was configured with
../configure --prefix=/people/d3g293/mpich/mpich-4.2.1/build_newell/install \
--with-device=ch4:ofi:sockets --with-libfabric=embedded \
--without-ucx --enable-threads=multiple --with-slurm \
CC=gcc CXX=g+
I’ve tried building with UCX and gotten the same results.
Are these errors indicative of corruption of the request handle or problems with some internal MPI data structures or something else? Any information you can provide would be appreciated.
Thanks,
Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240614/ca0b3e03/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.out
Type: application/octet-stream
Size: 1923 bytes
Desc: test.out
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240614/ca0b3e03/attachment.obj>
More information about the discuss
mailing list