[mpich-discuss] MPICH: SHM bandwidth very low on IPC test
Joachim Jenke
jenke at itc.rwth-aachen.de
Thu Aug 14 11:20:39 CDT 2025
Hi Sam,
the 10GB/s stream bandwidth calculation includes the number of
read+written bytes (see lines 190/366).
I would assume, that your MPI bandwidth calculation only accounts for
the buffer size (i.e., only read or write). In shm communication one
process (and therefore one core) streams/memcopies the data from the
send to the receive buffer. So, when you see 3.5GB send bandwidth, that
actually compares to 7GB of stream Copy bandwidth.
As a side-effect of shm communication, we have actually seen that the
placement of the copying process can determine the first-touch
allocation of the buffer. Even if the memory is allocated with calloc,
the memory is not paged. A bcast/scatter to node-local processes can
result in paging all buffers to the same socket (what you typically want
to avoid).
Best
Joachim
Am 14.08.25 um 07:16 schrieb Sam Austin:
> Hi Joachim,
>
> Thanks for this suggestion! I used stream to test the single-core memory
> bandwidth. I am running on a Xeon E5-2699A v4, which has 55MB last level
> cache. So, I ran with 30 million elements per the instructions. It
> appears that I am seeing about 10 GB/s if I'm reading that right? If so,
> I am still not sure why I am only seeing ~3.5 GB/s on shared memory
> performance with MPICH.
>
> -------------------------------------------------------------
> STREAM version $Revision: 5.10 $
> -------------------------------------------------------------
> This system uses 8 bytes per array element.
> -------------------------------------------------------------
> Array size = 30000000 (elements), Offset = 0 (elements)
> Memory per array = 228.9 MiB (= 0.2 GiB).
> Total memory required = 686.6 MiB (= 0.7 GiB).
> Each kernel will be executed 10 times.
> The *best* time for each kernel (excluding the first iteration)
> will be used to compute the reported bandwidth.
> -------------------------------------------------------------
> Your clock granularity/precision appears to be 1 microseconds.
> Each test below will take on the order of 30276 microseconds.
> (= 30276 clock ticks)
> Increase the size of the arrays if this shows that
> you are not getting at least 20 clock ticks per test.
> -------------------------------------------------------------
> WARNING -- The above is only a rough guideline.
> For best results, please be sure you know the
> precision of your system timer.
> -------------------------------------------------------------
> Function Best Rate MB/s Avg time Min time Max time
> Copy: 10038.1 0.048342 0.047818 0.050004
> Scale: 10342.2 0.048738 0.046412 0.056605
> Add: 10580.3 0.068542 0.068051 0.069805
> Triad: 10703.0 0.067615 0.067271 0.068143
> -------------------------------------------------------------
> Solution Validates: avg error less than 1.000000e-13 on all three arrays
> -------------------------------------------------------------
>
> Thanks,
> Sam
>
> On Wed, Aug 13, 2025 at 5:26 PM Jenke, Joachim <jenke at itc.rwth-aachen.de
> <mailto:jenke at itc.rwth-aachen.de>> wrote:
>
> Hi Sam,
>
> Can you try out stream to understand the single-core memory
> bandwidth of the system?
>
> https://www.cs.virginia.edu/stream/ref.html <https://
> www.cs.virginia.edu/stream/ref.html>
>
> Copy bandwidth for large junks (exceeding cache sizes) should
> provide you an upper bound for shm communication bandwidth.
>
> Best
> Joachim
>
> Am 13.08.2025 22:04 schrieb Sam Austin via discuss
> <discuss at mpich.org <mailto:discuss at mpich.org>>:
> Hi all, I am working to configure MPICH and run a few examples on my
> standalone server (single node). Here are the system specs: Server:
> Dell PowerEdge C4130 CPUs: 2x Xeon E5-2699A v4 GPUs: 4x Tesla V100s
> connected with NVLink, tied to motherboard
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
> ZjQcmQRYFpfptBannerEnd
> Hi all,
>
> I am working to configure MPICH and run a few examples on my
> standalone server (single node). Here are the system specs:
> Server: Dell PowerEdge C4130
> CPUs: 2x Xeon E5-2699A v4
> GPUs: 4x Tesla V100s connected with NVLink, tied to motherboard with
> PCIe gen 3
> OS: Ubuntu 24.04 LTS
> I intend to use this system to develop multi-process programs for
> eventual execution in a large, distributed HPC environment. I ran a
> few tests with and without CUDA support; here is my mpichversion output:
>
> MPICH Version: 4.3.1
> MPICH Release date: Fri Jun 20 09:24:41 AM CDT 2025
> MPICH ABI: 17:1:5
> MPICH Device: ch4:ofi
> MPICH configure: --prefix=/opt/mpich/4.2.1-cpu --without-cuda
> MPICH CC: gcc -O2
> MPICH CXX: g++ -O2
> MPICH F77: gfortran -O2
> MPICH FC: gfortran -O2
> MPICH features: threadcomm
>
> The first example that I ran was a bandwidth test for CPU-CPU and
> GPU-GPU communication. This simple program sends small packets back
> and forth between processes to test the bandwidth over the various
> intra-node networks.
>
> The GPU-GPU bandwidth test showed that the GPU interconnect was
> saturating at ~45 GB/s, which is nominal for the NVLink interconnect
> topology present on the node (this was run with a CUDA-aware build
> of MPICH). The problem appears during the CPU-CPU IPC test. In
> theory, this test is pretty vanilla, as it is communicating between
> processes using shared memory, and does not involve traversing any
> of the intra-node networks (PCIe or NVLink). My understanding is
> that the bandwidth observed on the CPU-CPU IPC test should be quite
> high, at least higher than 10 GB/s.
>
> However, the intra-node IPC bandwidth appears to be very low, around
> 3.5 GB/s, when running this test. I tried the following fixes in an
> attempt to force MPICH to use shared memory, but to no avail:
> Passing the option to explicitly specify `nemesis` during the build
> configuration: "--with-device=ch3:nemesis --with-cuda"
> Passing the option to explicitly specify shared memory with ch4 to
> the configuration: "--with-ch4-shmmods=posix --with-cuda"
> Rebuilding MPICH without GPU support: "--without-cuda"
> Switching to Open MPI and running the same test
> These results, especially the last one in which I saw the same
> issues when running with Open MPI, makes me think it might be an
> issue with my system configuration. The question is: why is the IPC
> bandwidth so low despite supposedly using the SHM protocol? I'm
> wondering if anyone has encountered this issue before or might be
> able to lend some advice here. Any help would be greatly appreciated!
>
> Some interesting observations from the output below: when I run with
> "mpiexec -np 2 -genv FI_PROVIDER=shm ...", the log file reports
> "Opened fabric: shm". However, when I run without "-genv
> FI_PROVIDER=shm", the log file reports "Opened fabric: 10.133.0.0/21
> <https://urldefense.us/v3/__http://10.133.0.0/21__;!!G_uCfscf7eWS!
> ZaUD7Nw-
> pSrvdcr4vb0JBtm7m5HhtE6d7G1wb5HakwqLQQenlo0WTl1tkzV3CrJnLwCQ7cVvC-
> kgoP1Dp-C6$>", which I believe means that MPICH is falling back on
> the TCP socket protocol. In this case, my key point of confusion is
> that the observed bandwidth is essentially the same between the SHM
> and TCP protocols. Perhaps my test script isn't set up properly?
>
> Thanks,
> Sam
>
> The following is attached below:
> Bandwidth test program
> Run script for the program
> Output of the script on my machine
> ----------------------------------------------------------------------------------------------------------------
> In case the attachment doesn't go through, here are the contents of
> my test program, "shmem_check.cpp":
>
> // shmem_check.cpp
> //
> // This is a minimal benchmark to test the raw bandwidth of MPI
> communication
> // between two processes on the same node, using only host (CPU) memory.
> // It completely removes CUDA to isolate the performance of the MPI
> library's
> // on-node communication mechanism (e.g., shared memory vs. TCP
> loopback).
> //
> // Compile/run:
> // /opt/mpich/4.2.1-cpu/bin/mpicxx -std=c++17 -I/opt/mpich/4.2.1-
> cpu/include shmem_check.cpp -o shmem_check
> // /opt/mpich/4.2.1-cpu/bin/mpiexec -np 2 ./shmem_check
>
> #include <iostream>
> #include <vector>
> #include <numeric>
> #include <mpi.h>
>
> int main(int argc, char* argv[]) {
> MPI_Init(&argc, &argv);
>
> int rank, size;
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> MPI_Comm_size(MPI_COMM_WORLD, &size);
>
> if (size != 2) {
> if (rank == 0) {
> std::cerr << "Error: This program must be run with
> exactly 2 MPI processes." << std::endl;
> }
> MPI_Finalize();
> return 1;
> }
>
> const int num_samples = 100;
> const long long packet_size = 1LL << 28; // 256 MB
>
> // Allocate standard host memory. 'new' is sufficient.
> char* buffer = new char[packet_size];
>
> if (rank == 0) {
> std::cout << "--- Starting Host-to-Host MPI Bandwidth Test
> ---" << std::endl;
> std::cout << "Packet Size: " << (packet_size / (1024*1024))
> << " MB" << std::endl;
> }
>
> std::vector<double> timings;
> for (int i = 0; i < num_samples; ++i) {
> MPI_Barrier(MPI_COMM_WORLD);
> double start_time = MPI_Wtime();
>
> if (rank == 0) {
> MPI_Send(buffer, packet_size, MPI_CHAR, 1, 0,
> MPI_COMM_WORLD);
> MPI_Recv(buffer, 1, MPI_CHAR, 1, 1, MPI_COMM_WORLD,
> MPI_STATUS_IGNORE); // Wait for confirmation
> } else { // rank == 1
> MPI_Recv(buffer, packet_size, MPI_CHAR, 0, 0,
> MPI_COMM_WORLD, MPI_STATUS_IGNORE);
> MPI_Send(buffer, 1, MPI_CHAR, 0, 1, MPI_COMM_WORLD); //
> Send confirmation
> }
>
> double end_time = MPI_Wtime();
> if (i >= 10) { // Discard warmup runs
> timings.push_back(end_time - start_time);
> }
> }
>
> if (rank == 0) {
> double total_time = std::accumulate(timings.begin(),
> timings.end(), 0.0);
> double avg_time = total_time / timings.size();
> double bandwidth = (static_cast<double>(packet_size) /
> (1024.0 * 1024.0 * 1024.0)) / avg_time;
>
> std::cout <<
> "------------------------------------------------" << std::endl;
> std::cout << "Average Host-to-Host Bandwidth: " <<
> bandwidth << " GB/s" << std::endl;
> std::cout <<
> "------------------------------------------------" << std::endl;
> }
>
> // Clean up host memory
> delete[] buffer;
>
> MPI_Finalize();
> return 0;
> }
>
> ----------------------------------------------------------------------------------------------------------------
> Here is the script to run the test with verbose compilation and the
> `shm` layer forced and unforced:
>
> #!/usr/bin/zsh
> source ~/.zshrc
>
> # Compile
> /opt/mpich/4.2.1-cpu/bin/mpicxx -std=c++17 -I/opt/mpich/4.2.1-cpu/
> include shmem_check.cpp -o shmem_check
>
> # Run with shm forced
> /opt/mpich/4.2.1-cpu/bin/mpiexec -np 2 -genv FI_PROVIDER=shm -genv
> FI_LOG_LEVEL=debug ./shmem_check 2> output_shm.txt
>
> # Run without shm forced
> /opt/mpich/4.2.1-cpu/bin/mpiexec -np 2 -genv FI_LOG_LEVEL=debug ./
> shmem_check 2> output_no_shm.txt
>
> echo "Output of script with SHM forced: "
> grep -i "opened fabric" output_shm.txt
>
> echo "Output of script with SHM not forced: "
> grep -i "opened fabric" output_no_shm.txt
>
> ----------------------------------------------------------------------------------------------------------------
> Here is the output :
>
> --- Starting Host-to-Host MPI Bandwidth Test ---
> Packet Size: 256 MB
> ------------------------------------------------
> Average Host-to-Host Bandwidth: 3.35709 GB/s
> ------------------------------------------------
> --- Starting Host-to-Host MPI Bandwidth Test ---
> Packet Size: 256 MB
> ------------------------------------------------
> Average Host-to-Host Bandwidth: 3.54924 GB/s
> ------------------------------------------------
> Output of script with SHM forced:
> libfabric:3174297:1755114546::core:core:fi_fabric_():1503<info>
> Opened fabric: shm
> libfabric:3174298:1755114546::core:core:fi_fabric_():1503<info>
> Opened fabric: shm
> Output of script with SHM not forced:
> libfabric:3174351:1755114554::core:core:fi_fabric_():1503<info>
> Opened fabric: 10.133.0.0/21 <https://urldefense.us/v3/
> __http://10.133.0.0/21__;!!G_uCfscf7eWS!ZaUD7Nw-
> pSrvdcr4vb0JBtm7m5HhtE6d7G1wb5HakwqLQQenlo0WTl1tkzV3CrJnLwCQ7cVvC-
> kgoP1Dp-C6$>
> libfabric:3174350:1755114554::core:core:fi_fabric_():1503<info>
> Opened fabric: 10.133.0.0/21 <https://urldefense.us/v3/
> __http://10.133.0.0/21__;!!G_uCfscf7eWS!ZaUD7Nw-
> pSrvdcr4vb0JBtm7m5HhtE6d7G1wb5HakwqLQQenlo0WTl1tkzV3CrJnLwCQ7cVvC-
> kgoP1Dp-C6$>
>
--
Dr. rer. nat. Joachim Jenke
Deputy Group Lead
IT Center
Group: HPC - Parallelism, Runtime Analysis & Machine Learning
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074 Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
jenke at itc.rwth-aachen.de
www.itc.rwth-aachen.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6140 bytes
Desc: Kryptografische S/MIME-Signatur
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20250814/7171e17b/attachment.p7s>
More information about the discuss
mailing list