[mpich-discuss] MPICH: SHM bandwidth very low on IPC test

Thu Aug 14 09:51:53 CDT 2025

Hi Sam,

I think you are not getting the 10 GB/s because the buffer you used in your test code is freshly malloc'ed. The actual memory still needs to paged in as you access it, which adds to the messaging overhead. Try repeat the measurement a few times, you should see higher bandwidth number in the later rounds.

The intranode data movement is limited by CPU execution. Since there is only a single thread moving the data, it is typically limited at around 10GB/sec. In order to reach higher bandwidth, you need use more processes or threads. The latter is tricky since you need tell MPI about the thread context. With MPICH, use different communicators in different threads and enable `MPIR_CVAR_CH4_NUM_VCIS=8` (adjust the number to the number of your threads).

GPU to GPU bandwidth is higher because it is offloaded to the GPU copy engine, which can max the PCIe bandwidth.

--
Hui
________________________________
From: Sam Austin via discuss <discuss at mpich.org>
Sent: Thursday, August 14, 2025 12:16 AM
To: Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: Sam Austin <sam.austin.p at gmail.com>; discuss at mpich.org <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH: SHM bandwidth very low on IPC test

Hi Joachim, Thanks for this suggestion! I used stream to test the single-core memory bandwidth. I am running on a Xeon E5-2699A v4, which has 55MB last level cache. So, I ran with 30 million elements per the instructions. It appears that I am
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Hi Joachim,

Thanks for this suggestion! I used stream to test the single-core memory bandwidth. I am running on a Xeon E5-2699A v4, which has 55MB last level cache. So, I ran with 30 million elements per the instructions. It appears that I am seeing about 10 GB/s if I'm reading that right? If so, I am still not sure why I am only seeing ~3.5 GB/s on shared memory performance with MPICH.

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 30000000 (elements), Offset = 0 (elements)
Memory per array = 228.9 MiB (= 0.2 GiB).
Total memory required = 686.6 MiB (= 0.7 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 30276 microseconds.
   (= 30276 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10038.1     0.048342     0.047818     0.050004
Scale:          10342.2     0.048738     0.046412     0.056605
Add:            10580.3     0.068542     0.068051     0.069805
Triad:          10703.0     0.067615     0.067271     0.068143
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Thanks,
Sam

On Wed, Aug 13, 2025 at 5:26 PM Jenke, Joachim <jenke at itc.rwth-aachen.de<mailto:jenke at itc.rwth-aachen.de>> wrote:
Hi Sam,

Can you try out stream to understand the single-core memory bandwidth of the system?

https://urldefense.us/v3/__https://www.cs.virginia.edu/stream/ref.html__;!!G_uCfscf7eWS!dXngQ3nhPOhYDYNgZa7jen0GsvP-W3Qnx6NTJiziBP0Pb0pwa06fp86C3DEDt4OFFkx_d6XULKfy$ <https://urldefense.us/v3/__https://www.cs.virginia.edu/stream/ref.html__;!!G_uCfscf7eWS!dAE55XU_CwGFqjAQVBQfgqk4xKzby0oyqjIYPJK0zC_EuqRf1MH6qwE3Pt6pIUAC-Z4sF_Ryf9ReVZTycsuO$>

Copy bandwidth for large junks (exceeding cache sizes) should provide you an upper bound for shm communication bandwidth.

Best
Joachim

Am 13.08.2025 22:04 schrieb Sam Austin via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>:
Hi all, I am working to configure MPICH and run a few examples on my standalone server (single node). Here are the system specs: Server: Dell PowerEdge C4130 CPUs: 2x Xeon E5-2699A v4 GPUs: 4x Tesla V100s connected with NVLink, tied to motherboard
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Hi all,

I am working to configure MPICH and run a few examples on my standalone server (single node). Here are the system specs:
Server: Dell PowerEdge C4130
CPUs: 2x Xeon E5-2699A v4
GPUs: 4x Tesla V100s connected with NVLink, tied to motherboard with PCIe gen 3
OS: Ubuntu 24.04 LTS
I intend to use this system to develop multi-process programs for eventual execution in a large, distributed HPC environment. I ran a few tests with and without CUDA support; here is my mpichversion output:

MPICH Version:      4.3.1
MPICH Release date: Fri Jun 20 09:24:41 AM CDT 2025
MPICH ABI:          17:1:5
MPICH Device:       ch4:ofi
MPICH configure:    --prefix=/opt/mpich/4.2.1-cpu --without-cuda
MPICH CC:           gcc     -O2
MPICH CXX:          g++   -O2
MPICH F77:          gfortran   -O2
MPICH FC:           gfortran   -O2
MPICH features:     threadcomm

The first example that I ran was a bandwidth test for CPU-CPU and GPU-GPU communication. This simple program sends small packets back and forth between processes to test the bandwidth over the various intra-node networks.

The GPU-GPU bandwidth test showed that the GPU interconnect was saturating at ~45 GB/s, which is nominal for the NVLink interconnect topology present on the node (this was run with a CUDA-aware build of MPICH). The problem appears during the CPU-CPU IPC test. In theory, this test is pretty vanilla, as it is communicating between processes using shared memory, and does not involve traversing any of the intra-node networks (PCIe or NVLink). My understanding is that the bandwidth observed on the CPU-CPU IPC test should be quite high, at least higher than 10 GB/s.

However, the intra-node IPC bandwidth appears to be very low, around 3.5 GB/s, when running this test. I tried the following fixes in an attempt to force MPICH to use shared memory, but to no avail:
Passing the option to explicitly specify `nemesis` during the build configuration: "--with-device=ch3:nemesis --with-cuda"
Passing the option to explicitly specify shared memory with ch4 to the configuration: "--with-ch4-shmmods=posix --with-cuda"
Rebuilding MPICH without GPU support: "--without-cuda"
Switching to Open MPI and running the same test
These results, especially the last one in which I saw the same issues when running with Open MPI, makes me think it might be an issue with my system configuration. The question is: why is the IPC bandwidth so low despite supposedly using the SHM protocol? I'm wondering if anyone has encountered this issue before or might be able to lend some advice here. Any help would be greatly appreciated!

Some interesting observations from the output below: when I run with "mpiexec -np 2 -genv FI_PROVIDER=shm ...", the log file reports "Opened fabric: shm". However, when I run without "-genv FI_PROVIDER=shm", the log file reports "Opened fabric: 10.133.0.0/21<https://urldefense.us/v3/__http://10.133.0.0/21__;!!G_uCfscf7eWS!ZaUD7Nw-pSrvdcr4vb0JBtm7m5HhtE6d7G1wb5HakwqLQQenlo0WTl1tkzV3CrJnLwCQ7cVvC-kgoP1Dp-C6$>", which I believe means that MPICH is falling back on the TCP socket protocol. In this case, my key point of confusion is that the observed bandwidth is essentially the same between the SHM and TCP protocols. Perhaps my test script isn't set up properly?

Thanks,
Sam

The following is attached below:
Bandwidth test program
Run script for the program
Output of the script on my machine
----------------------------------------------------------------------------------------------------------------
In case the attachment doesn't go through, here are the contents of my test program, "shmem_check.cpp":

// shmem_check.cpp
//
// This is a minimal benchmark to test the raw bandwidth of MPI communication
// between two processes on the same node, using only host (CPU) memory.
// It completely removes CUDA to isolate the performance of the MPI library's
// on-node communication mechanism (e.g., shared memory vs. TCP loopback).
//
// Compile/run:
// /opt/mpich/4.2.1-cpu/bin/mpicxx -std=c++17 -I/opt/mpich/4.2.1-cpu/include shmem_check.cpp -o shmem_check
// /opt/mpich/4.2.1-cpu/bin/mpiexec -np 2 ./shmem_check

#include <iostream>
#include <vector>
#include <numeric>
#include <mpi.h>

int main(int argc, char* argv[]) {
    MPI_Init(&argc, &argv);

    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (size != 2) {
        if (rank == 0) {
            std::cerr << "Error: This program must be run with exactly 2 MPI processes." << std::endl;
        }
        MPI_Finalize();
        return 1;
    }

    const int num_samples = 100;
    const long long packet_size = 1LL << 28; // 256 MB

    // Allocate standard host memory. 'new' is sufficient.
    char* buffer = new char[packet_size];

    if (rank == 0) {
        std::cout << "--- Starting Host-to-Host MPI Bandwidth Test ---" << std::endl;
        std::cout << "Packet Size: " << (packet_size / (1024*1024)) << " MB" << std::endl;
    }

    std::vector<double> timings;
    for (int i = 0; i < num_samples; ++i) {
        MPI_Barrier(MPI_COMM_WORLD);
        double start_time = MPI_Wtime();

        if (rank == 0) {
            MPI_Send(buffer, packet_size, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
            MPI_Recv(buffer, 1, MPI_CHAR, 1, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // Wait for confirmation
        } else { // rank == 1
            MPI_Recv(buffer, packet_size, MPI_CHAR, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
            MPI_Send(buffer, 1, MPI_CHAR, 0, 1, MPI_COMM_WORLD); // Send confirmation
        }

        double end_time = MPI_Wtime();
        if (i >= 10) { // Discard warmup runs
            timings.push_back(end_time - start_time);
        }
    }

    if (rank == 0) {
        double total_time = std::accumulate(timings.begin(), timings.end(), 0.0);
        double avg_time = total_time / timings.size();
        double bandwidth = (static_cast<double>(packet_size) / (1024.0 * 1024.0 * 1024.0)) / avg_time;

        std::cout << "------------------------------------------------" << std::endl;
        std::cout << "Average Host-to-Host Bandwidth: " << bandwidth << " GB/s" << std::endl;
        std::cout << "------------------------------------------------" << std::endl;
    }

    // Clean up host memory
    delete[] buffer;

    MPI_Finalize();
    return 0;
}

----------------------------------------------------------------------------------------------------------------
Here is the script to run the test with verbose compilation and the `shm` layer forced and unforced:

#!/usr/bin/zsh
source ~/.zshrc

# Compile
/opt/mpich/4.2.1-cpu/bin/mpicxx -std=c++17 -I/opt/mpich/4.2.1-cpu/include shmem_check.cpp -o shmem_check

# Run with shm forced
/opt/mpich/4.2.1-cpu/bin/mpiexec -np 2 -genv FI_PROVIDER=shm -genv FI_LOG_LEVEL=debug ./shmem_check 2> output_shm.txt

# Run without shm forced
/opt/mpich/4.2.1-cpu/bin/mpiexec -np 2 -genv FI_LOG_LEVEL=debug ./shmem_check 2> output_no_shm.txt

echo "Output of script with SHM forced: "
grep -i "opened fabric" output_shm.txt

echo "Output of script with SHM not forced: "
grep -i "opened fabric" output_no_shm.txt

----------------------------------------------------------------------------------------------------------------
Here is the output :

--- Starting Host-to-Host MPI Bandwidth Test ---
Packet Size: 256 MB
------------------------------------------------
Average Host-to-Host Bandwidth: 3.35709 GB/s
------------------------------------------------
--- Starting Host-to-Host MPI Bandwidth Test ---
Packet Size: 256 MB
------------------------------------------------
Average Host-to-Host Bandwidth: 3.54924 GB/s
------------------------------------------------
Output of script with SHM forced:
libfabric:3174297:1755114546::core:core:fi_fabric_():1503<info> Opened fabric: shm
libfabric:3174298:1755114546::core:core:fi_fabric_():1503<info> Opened fabric: shm
Output of script with SHM not forced:
libfabric:3174351:1755114554::core:core:fi_fabric_():1503<info> Opened fabric: 10.133.0.0/21<https://urldefense.us/v3/__http://10.133.0.0/21__;!!G_uCfscf7eWS!ZaUD7Nw-pSrvdcr4vb0JBtm7m5HhtE6d7G1wb5HakwqLQQenlo0WTl1tkzV3CrJnLwCQ7cVvC-kgoP1Dp-C6$>
libfabric:3174350:1755114554::core:core:fi_fabric_():1503<info> Opened fabric: 10.133.0.0/21<https://urldefense.us/v3/__http://10.133.0.0/21__;!!G_uCfscf7eWS!ZaUD7Nw-pSrvdcr4vb0JBtm7m5HhtE6d7G1wb5HakwqLQQenlo0WTl1tkzV3CrJnLwCQ7cVvC-kgoP1Dp-C6$>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20250814/4482f1aa/attachment-0001.html>