[mpich-devel] MPICH with SYCL on Aurora
Wozniak, Justin M.
woz at anl.gov
Thu Apr 2 10:32:59 CDT 2026
That makes it run but is about 6x slower when running with 2 processes. Is there something else I can check? I get:
# --with-device=ch4:ofi
# *** Device Configuration
# *** --------------------
# *** device : ch4:ofi
# *** ipc feature : xpmem gpudirect
# *** gpu support : ZE
Note that I do not need any network support, as I am using this MPICH for node-local communication only. So I also do not need integrated process launch features.
--
Justin M Wozniak
________________________________
From: Raffenetti, Ken <raffenet at anl.gov>
Sent: Wednesday, April 1, 2026 9:10
To: Wozniak, Justin M. <woz at anl.gov>; devel at mpich.org <devel at mpich.org>; Harms, Kevin <harms at alcf.anl.gov>
Subject: Re: MPICH with SYCL on Aurora
If building with Intel GPU support on Aurora, you should have --with-ze and --enable-ze-native=pvc in the configuration line. See https://urldefense.us/v3/__https://github.com/pmodels/mpich/wiki/Using-MPICH-on-Aurora*40ALCF__;JQ!!G_uCfscf7eWS!ZbF1GkbVic3jSWNF1rfFK0vKkU3Z_GbbJyanm_gmbwPRLaI4DSttTkl-g5DZZZby8-LBiGn0$ for more details.
Ken
From: Wozniak, Justin M. <woz at anl.gov>
Date: Tuesday, March 31, 2026 at 4:28 PM
To: Raffenetti, Ken <raffenet at anl.gov>, devel at mpich.org <devel at mpich.org>, Harms, Kevin <harms at alcf.anl.gov>
Subject: Re: MPICH with SYCL on Aurora
I should have mentioned- I configured with --with-ze=no to avoid this build-time error:
CC src/backend/ze/pup/yaksuri_zei_get_ptr_attr.lo
OCLOC (spirv) src/backend/ze/pup/yaksuri_zei_pup_char.cl
Could not determine device target: skl.
Error: Cannot get HW Info for device skl.
Invalid device error, trying to fallback to former ocloc libocloc_legacy1.so
Couldn't load former ocloc libocloc_legacy1.so
Command was: ocloc compile -file src/backend/ze/pup/yaksuri_zei_pup_char.cl -device skl -spv_only -out_dir src/backend/ze/pup -output_no_suffix -q -options "-I ./src/backend/ze/include -cl-std=CL2.0"
make[2]: *** [Makefile:13671: src/backend/ze/pup/yaksuri_zei_pup_char.c] Error 223
make[2]: Leaving directory '/tmp/wozniak/mpich-5.0.0rc3-ze/src/mpi/datatype/typerep/yaksa'
Maybe this is the main thing I need to address? Do I just need to force the ocloc device for Aurora?
--
Justin M Wozniak
________________________________
From: Wozniak, Justin M. via devel <devel at mpich.org>
Sent: Tuesday, March 31, 2026 12:02
To: Raffenetti, Ken <raffenet at anl.gov>; devel at mpich.org <devel at mpich.org>; Harms, Kevin <harms at alcf.anl.gov>
Cc: Wozniak, Justin M. <woz at anl.gov>
Subject: Re: [mpich-devel] MPICH with SYCL on Aurora
This is mpich-5.0.0rc3 , I will try that, thanks.
--
Justin M Wozniak
________________________________
From: Raffenetti, Ken <raffenet at anl.gov>
Sent: Tuesday, March 31, 2026 11:30
To: devel at mpich.org <devel at mpich.org>; Harms, Kevin <harms at alcf.anl.gov>
Cc: Wozniak, Justin M. <woz at anl.gov>
Subject: Re: MPICH with SYCL on Aurora
Which version of MPI is this? This might be a known issue in CMA support (fixed here https://urldefense.us/v3/__https://github.com/pmodels/mpich/pull/7743__;!!G_uCfscf7eWS!ZbF1GkbVic3jSWNF1rfFK0vKkU3Z_GbbJyanm_gmbwPRLaI4DSttTkl-g5DZZZby88Njv_02$ <https://urldefense.us/v3/__https://github.com/pmodels/mpich/pull/7743__;!!G_uCfscf7eWS!cUiPT3Nmpearus7dfptbS5Cz3Q8WaE29of-5eDNCJqQMBj0uuKoeZlwEDkmOTVRQzaomHJtR$>). You can try disabling CMA with MPIR_CVAR_CH4_CMA_ENABLE=0 to avoid that path or pull in the fix to your copy and rebuild.
Ken
From: Wozniak, Justin M. via devel <devel at mpich.org>
Date: Tuesday, March 31, 2026 at 11:27 AM
To: Harms, Kevin <harms at alcf.anl.gov>, devel at mpich.org <devel at mpich.org>
Cc: Wozniak, Justin M. <woz at anl.gov>
Subject: Re: [mpich-devel] MPICH with SYCL on Aurora
With MPIR_CVAR_REQUEST_ERR_FATAL=1 in a 2-process run, this looks like:
Abort(270742287) on node 0: Fatal error in internal_Waitall: Other MPI error, error stack:
internal_Waitall(126)..: MPI_Waitall(count=1, array_of_requests=0x797cdb0, array_of_statuses=0x7ca3fb0) failed
MPIR_Waitall(916)......:
MPIDI_IPC_rndv_cb(172).:
MPIDI_CMA_copy_data(54):
copy_iovs(202).........: process_vm_readv failed (errno 14)
Abort(270742287) on node 1: Fatal error in internal_Waitall: Other MPI error, error stack:
(same)
This succeeds for 1-process with SYCL enabled or for 2-process with SYCL disabled in the app at configure time.
The app looks like:
$ ldd =agent
libmpicxx.so.0 => /lus/flare/projects/EpiCalib/sfw/mpich-5.0.0rc3/lib/libmpicxx.so.0 (0x0000146726aed000)
libmpi.so.0 => /lus/flare/projects/EpiCalib/sfw/mpich-5.0.0rc3/lib/libmpi.so.0 (0x0000146725265000)
libmkl_sycl_blas.so.5 => /opt/aurora/26.26.0/oneapi/mkl/latest/lib/libmkl_sycl_blas.so.5 (0x0000146721063000)
(etc.)
libstdc++.so.6 => /opt/aurora/26.26.0/spack/unified/1.1.1/install/linux-x86_64/gcc-13.4.0-hgnyg4p/lib64/libstdc++.so.6 (0x0000146708cd9000)
libm.so.6 => /lib64/libm.so.6 (0x0000146708b77000)
libgcc_s.so.1 => /opt/aurora/26.26.0/spack/unified/1.1.1/install/linux-x86_64/gcc-13.4.0-hgnyg4p/lib64/libgcc_s.so.1 (0x0000146708b53000)
libsycl.so.8 => /opt/aurora/26.26.0/oneapi/compiler/latest/lib/libsycl.so.8 (0x0000146708758000)
libOpenCL.so.1 => /opt/aurora/26.26.0/support/libraries/khronos/default/lib64/libOpenCL.so.1 (0x0000146708743000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x000014670871f000)
libc.so.6 => /lib64/libc.so.6 (0x000014670852a000)
libhwloc.so.15 => /opt/aurora/26.26.0/oneapi/tcm/latest/lib/libhwloc.so.15 (0x00001467082cc000)
(etc.)
Thanks
--
Justin M Wozniak
________________________________
From: Harms, Kevin <harms at alcf.anl.gov>
Sent: Monday, March 30, 2026 14:50
To: devel at mpich.org <devel at mpich.org>
Cc: Wozniak, Justin M. <woz at anl.gov>
Subject: Re: MPICH with SYCL on Aurora
Justin,
can you provide the specific error?
kevin
________________________________________
From: Wozniak, Justin M. via devel <devel at mpich.org>
Sent: Monday, March 30, 2026 2:24 PM
To: devel at mpich.org
Cc: Wozniak, Justin M.
Subject: [mpich-devel] MPICH with SYCL on Aurora
Hi
I am trying to port a simulation ensemble workflow that runs ExaEpi/AMReX/SYCL to Aurora. The outer workflow uses the system MPI and we use MPICH to run the app with node-local parallelism using a hand-built MPICH. On Aurora, I get errors in early MPI calls that I think are due to SYCL. This approach works on NVIDIA systems like Perlmutter. Is there some simple way to make MPICH aware of SYCL?
Thanks
--
Justin M Wozniak
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20260402/d4201d63/attachment-0001.html>
More information about the devel
mailing list