[mpich-devel] MPICH with SYCL on Aurora

Wozniak, Justin M. woz at anl.gov
Thu Apr 2 10:32:59 CDT 2026


That makes it run but is about 6x slower when running with 2 processes.  Is there something else I can check?  I get:

# --with-device=ch4:ofi
# *** Device Configuration
# *** --------------------
# *** device      : ch4:ofi
# *** ipc feature : xpmem gpudirect
# *** gpu support : ZE

Note that I do not need any network support, as I am using this MPICH for node-local communication only.  So I also do not need integrated process launch features.


--

Justin M Wozniak


________________________________
From: Raffenetti, Ken <raffenet at anl.gov>
Sent: Wednesday, April 1, 2026 9:10
To: Wozniak, Justin M. <woz at anl.gov>; devel at mpich.org <devel at mpich.org>; Harms, Kevin <harms at alcf.anl.gov>
Subject: Re: MPICH with SYCL on Aurora

If building with Intel GPU support on Aurora, you should have --with-ze and --enable-ze-native=pvc in the configuration line. See https://urldefense.us/v3/__https://github.com/pmodels/mpich/wiki/Using-MPICH-on-Aurora*40ALCF__;JQ!!G_uCfscf7eWS!ZbF1GkbVic3jSWNF1rfFK0vKkU3Z_GbbJyanm_gmbwPRLaI4DSttTkl-g5DZZZby8-LBiGn0$  for more details.

Ken

From: Wozniak, Justin M. <woz at anl.gov>
Date: Tuesday, March 31, 2026 at 4:28 PM
To: Raffenetti, Ken <raffenet at anl.gov>, devel at mpich.org <devel at mpich.org>, Harms, Kevin <harms at alcf.anl.gov>
Subject: Re: MPICH with SYCL on Aurora

I should have mentioned- I configured with --with-ze=no to avoid this build-time error:

  CC       src/backend/ze/pup/yaksuri_zei_get_ptr_attr.lo
 OCLOC (spirv)  src/backend/ze/pup/yaksuri_zei_pup_char.cl
Could not determine device target: skl.
Error: Cannot get HW Info for device skl.
Invalid device error, trying to fallback to former ocloc libocloc_legacy1.so
Couldn't load former ocloc libocloc_legacy1.so
Command was: ocloc compile -file src/backend/ze/pup/yaksuri_zei_pup_char.cl -device skl -spv_only -out_dir src/backend/ze/pup -output_no_suffix -q -options "-I ./src/backend/ze/include -cl-std=CL2.0"
make[2]: *** [Makefile:13671: src/backend/ze/pup/yaksuri_zei_pup_char.c] Error 223
make[2]: Leaving directory '/tmp/wozniak/mpich-5.0.0rc3-ze/src/mpi/datatype/typerep/yaksa'

Maybe this is the main thing I need to address?  Do I just need to force the ocloc device for Aurora?

--

Justin M Wozniak


________________________________
From: Wozniak, Justin M. via devel <devel at mpich.org>
Sent: Tuesday, March 31, 2026 12:02
To: Raffenetti, Ken <raffenet at anl.gov>; devel at mpich.org <devel at mpich.org>; Harms, Kevin <harms at alcf.anl.gov>
Cc: Wozniak, Justin M. <woz at anl.gov>
Subject: Re: [mpich-devel] MPICH with SYCL on Aurora

This is mpich-5.0.0rc3 , I will try that, thanks.

--

Justin M Wozniak


________________________________
From: Raffenetti, Ken <raffenet at anl.gov>
Sent: Tuesday, March 31, 2026 11:30
To: devel at mpich.org <devel at mpich.org>; Harms, Kevin <harms at alcf.anl.gov>
Cc: Wozniak, Justin M. <woz at anl.gov>
Subject: Re: MPICH with SYCL on Aurora

Which version of MPI is this? This might be a known issue in CMA support (fixed here https://urldefense.us/v3/__https://github.com/pmodels/mpich/pull/7743__;!!G_uCfscf7eWS!ZbF1GkbVic3jSWNF1rfFK0vKkU3Z_GbbJyanm_gmbwPRLaI4DSttTkl-g5DZZZby88Njv_02$ <https://urldefense.us/v3/__https://github.com/pmodels/mpich/pull/7743__;!!G_uCfscf7eWS!cUiPT3Nmpearus7dfptbS5Cz3Q8WaE29of-5eDNCJqQMBj0uuKoeZlwEDkmOTVRQzaomHJtR$>). You can try disabling CMA with MPIR_CVAR_CH4_CMA_ENABLE=0 to avoid that path or pull in the fix to your copy and rebuild.

Ken

From: Wozniak, Justin M. via devel <devel at mpich.org>
Date: Tuesday, March 31, 2026 at 11:27 AM
To: Harms, Kevin <harms at alcf.anl.gov>, devel at mpich.org <devel at mpich.org>
Cc: Wozniak, Justin M. <woz at anl.gov>
Subject: Re: [mpich-devel] MPICH with SYCL on Aurora

With MPIR_CVAR_REQUEST_ERR_FATAL=1 in a 2-process run, this looks like:

Abort(270742287) on node 0: Fatal error in internal_Waitall: Other MPI error, error stack:
internal_Waitall(126)..: MPI_Waitall(count=1, array_of_requests=0x797cdb0, array_of_statuses=0x7ca3fb0) failed
MPIR_Waitall(916)......:
MPIDI_IPC_rndv_cb(172).:
MPIDI_CMA_copy_data(54):
copy_iovs(202).........: process_vm_readv failed (errno 14)

Abort(270742287) on node 1: Fatal error in internal_Waitall: Other MPI error, error stack:
(same)

This succeeds for 1-process with SYCL enabled or for 2-process with SYCL disabled in the app at configure time.

The app looks like:

$ ldd =agent
        libmpicxx.so.0 => /lus/flare/projects/EpiCalib/sfw/mpich-5.0.0rc3/lib/libmpicxx.so.0 (0x0000146726aed000)
        libmpi.so.0 => /lus/flare/projects/EpiCalib/sfw/mpich-5.0.0rc3/lib/libmpi.so.0 (0x0000146725265000)
        libmkl_sycl_blas.so.5 => /opt/aurora/26.26.0/oneapi/mkl/latest/lib/libmkl_sycl_blas.so.5 (0x0000146721063000)
   (etc.)
        libstdc++.so.6 => /opt/aurora/26.26.0/spack/unified/1.1.1/install/linux-x86_64/gcc-13.4.0-hgnyg4p/lib64/libstdc++.so.6 (0x0000146708cd9000)
        libm.so.6 => /lib64/libm.so.6 (0x0000146708b77000)
        libgcc_s.so.1 => /opt/aurora/26.26.0/spack/unified/1.1.1/install/linux-x86_64/gcc-13.4.0-hgnyg4p/lib64/libgcc_s.so.1 (0x0000146708b53000)
        libsycl.so.8 => /opt/aurora/26.26.0/oneapi/compiler/latest/lib/libsycl.so.8 (0x0000146708758000)
        libOpenCL.so.1 => /opt/aurora/26.26.0/support/libraries/khronos/default/lib64/libOpenCL.so.1 (0x0000146708743000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x000014670871f000)
        libc.so.6 => /lib64/libc.so.6 (0x000014670852a000)
        libhwloc.so.15 => /opt/aurora/26.26.0/oneapi/tcm/latest/lib/libhwloc.so.15 (0x00001467082cc000)
(etc.)

Thanks


--

Justin M Wozniak


________________________________
From: Harms, Kevin <harms at alcf.anl.gov>
Sent: Monday, March 30, 2026 14:50
To: devel at mpich.org <devel at mpich.org>
Cc: Wozniak, Justin M. <woz at anl.gov>
Subject: Re: MPICH with SYCL on Aurora

Justin,

  can you provide the specific error?

kevin

________________________________________
From: Wozniak, Justin M. via devel <devel at mpich.org>
Sent: Monday, March 30, 2026 2:24 PM
To: devel at mpich.org
Cc: Wozniak, Justin M.
Subject: [mpich-devel] MPICH with SYCL on Aurora

Hi
    I am trying to port a simulation ensemble workflow that runs ExaEpi/AMReX/SYCL to Aurora.  The outer workflow uses the system MPI and we use MPICH to run the app with node-local parallelism using a hand-built MPICH.  On Aurora, I get errors in early MPI calls that I think are due to SYCL.  This approach works on NVIDIA systems like Perlmutter.  Is there some simple way to make MPICH aware of SYCL?
    Thanks

--

Justin M Wozniak

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20260402/d4201d63/attachment-0001.html>


More information about the devel mailing list