[mpich-discuss] MPICH: SHM bandwidth very low on IPC test

Thu Aug 14 00:16:48 CDT 2025

Hi Joachim,

Thanks for this suggestion! I used stream to test the single-core memory
bandwidth. I am running on a Xeon E5-2699A v4, which has 55MB last level
cache. So, I ran with 30 million elements per the instructions. It appears
that I am seeing about 10 GB/s if I'm reading that right? If so, I am still
not sure why I am only seeing ~3.5 GB/s on shared memory performance with
MPICH.

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 30000000 (elements), Offset = 0 (elements)
Memory per array = 228.9 MiB (= 0.2 GiB).
Total memory required = 686.6 MiB (= 0.7 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 30276 microseconds.
   (= 30276 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10038.1     0.048342     0.047818     0.050004
Scale:          10342.2     0.048738     0.046412     0.056605
Add:            10580.3     0.068542     0.068051     0.069805
Triad:          10703.0     0.067615     0.067271     0.068143
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Thanks,
Sam

On Wed, Aug 13, 2025 at 5:26 PM Jenke, Joachim <jenke at itc.rwth-aachen.de>
wrote:

> Hi Sam,
>
> Can you try out stream to understand the single-core memory bandwidth of
> the system?
>
> https://urldefense.us/v3/__https://www.cs.virginia.edu/stream/ref.html__;!!G_uCfscf7eWS!dAE55XU_CwGFqjAQVBQfgqk4xKzby0oyqjIYPJK0zC_EuqRf1MH6qwE3Pt6pIUAC-Z4sF_Ryf9ReVZTycsuO$ 
>
> Copy bandwidth for large junks (exceeding cache sizes) should provide you
> an upper bound for shm communication bandwidth.
>
> Best
> Joachim
>
> Am 13.08.2025 22:04 schrieb Sam Austin via discuss <discuss at mpich.org>:
> Hi all, I am working to configure MPICH and run a few examples on my
> standalone server (single node). Here are the system specs: Server: Dell
> PowerEdge C4130 CPUs: 2x Xeon E5-2699A v4 GPUs: 4x Tesla V100s connected
> with NVLink, tied to motherboard
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> Hi all,
>
> I am working to configure MPICH and run a few examples on my standalone
> server (single node). Here are the system specs:
> Server: Dell PowerEdge C4130
> CPUs: 2x Xeon E5-2699A v4
> GPUs: 4x Tesla V100s connected with NVLink, tied to motherboard with PCIe
> gen 3
> OS: Ubuntu 24.04 LTS
> I intend to use this system to develop multi-process programs for eventual
> execution in a large, distributed HPC environment. I ran a few tests with
> and without CUDA support; here is my mpichversion output:
>
> MPICH Version:      4.3.1
> MPICH Release date: Fri Jun 20 09:24:41 AM CDT 2025
> MPICH ABI:          17:1:5
> MPICH Device:       ch4:ofi
> MPICH configure:    --prefix=/opt/mpich/4.2.1-cpu --without-cuda
> MPICH CC:           gcc     -O2
> MPICH CXX:          g++   -O2
> MPICH F77:          gfortran   -O2
> MPICH FC:           gfortran   -O2
> MPICH features:     threadcomm
>
> The first example that I ran was a bandwidth test for CPU-CPU and GPU-GPU
> communication. This simple program sends small packets back and forth
> between processes to test the bandwidth over the various intra-node
> networks.
>
> The GPU-GPU bandwidth test showed that the GPU interconnect was saturating
> at ~45 GB/s, which is nominal for the NVLink interconnect topology present
> on the node (this was run with a CUDA-aware build of MPICH). The problem
> appears during the CPU-CPU IPC test. In theory, this test is pretty
> vanilla, as it is communicating between processes using shared memory, and
> does not involve traversing any of the intra-node networks (PCIe or
> NVLink). My understanding is that the bandwidth observed on the CPU-CPU IPC
> test should be quite high, at least higher than 10 GB/s.
>
> However, the intra-node IPC bandwidth appears to be very low, around 3.5
> GB/s, when running this test. I tried the following fixes in an attempt to
> force MPICH to use shared memory, but to no avail:
> Passing the option to explicitly specify `nemesis` during the build
> configuration: "--with-device=ch3:nemesis --with-cuda"
> Passing the option to explicitly specify shared memory with ch4 to the
> configuration: "--with-ch4-shmmods=posix --with-cuda"
> Rebuilding MPICH without GPU support: "--without-cuda"
> Switching to Open MPI and running the same test
> These results, especially the last one in which I saw the same issues when
> running with Open MPI, makes me think it might be an issue with my system
> configuration. The question is: why is the IPC bandwidth so low despite
> supposedly using the SHM protocol? I'm wondering if anyone has encountered
> this issue before or might be able to lend some advice here. Any help would
> be greatly appreciated!
>
> Some interesting observations from the output below: when I run with
> "mpiexec -np 2 -genv FI_PROVIDER=shm ...", the log file reports "Opened
> fabric: shm". However, when I run without "-genv FI_PROVIDER=shm", the log
> file reports "Opened fabric: 10.133.0.0/21
> <https://urldefense.us/v3/__http://10.133.0.0/21__;!!G_uCfscf7eWS!ZaUD7Nw-pSrvdcr4vb0JBtm7m5HhtE6d7G1wb5HakwqLQQenlo0WTl1tkzV3CrJnLwCQ7cVvC-kgoP1Dp-C6$>",
> which I believe means that MPICH is falling back on the TCP socket
> protocol. In this case, my key point of confusion is that the observed
> bandwidth is essentially the same between the SHM and TCP protocols.
> Perhaps my test script isn't set up properly?
>
> Thanks,
> Sam
>
> The following is attached below:
> Bandwidth test program
> Run script for the program
> Output of the script on my machine
>
> ----------------------------------------------------------------------------------------------------------------
> In case the attachment doesn't go through, here are the contents of my
> test program, "shmem_check.cpp":
>
> // shmem_check.cpp
> //
> // This is a minimal benchmark to test the raw bandwidth of MPI
> communication
> // between two processes on the same node, using only host (CPU) memory.
> // It completely removes CUDA to isolate the performance of the MPI
> library's
> // on-node communication mechanism (e.g., shared memory vs. TCP loopback).
> //
> // Compile/run:
> // /opt/mpich/4.2.1-cpu/bin/mpicxx -std=c++17
> -I/opt/mpich/4.2.1-cpu/include shmem_check.cpp -o shmem_check
> // /opt/mpich/4.2.1-cpu/bin/mpiexec -np 2 ./shmem_check
>
> #include <iostream>
> #include <vector>
> #include <numeric>
> #include <mpi.h>
>
> int main(int argc, char* argv[]) {
>     MPI_Init(&argc, &argv);
>
>     int rank, size;
>     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>     MPI_Comm_size(MPI_COMM_WORLD, &size);
>
>     if (size != 2) {
>         if (rank == 0) {
>             std::cerr << "Error: This program must be run with exactly 2
> MPI processes." << std::endl;
>         }
>         MPI_Finalize();
>         return 1;
>     }
>
>     const int num_samples = 100;
>     const long long packet_size = 1LL << 28; // 256 MB
>
>     // Allocate standard host memory. 'new' is sufficient.
>     char* buffer = new char[packet_size];
>
>     if (rank == 0) {
>         std::cout << "--- Starting Host-to-Host MPI Bandwidth Test ---" <<
> std::endl;
>         std::cout << "Packet Size: " << (packet_size / (1024*1024)) << "
> MB" << std::endl;
>     }
>
>     std::vector<double> timings;
>     for (int i = 0; i < num_samples; ++i) {
>         MPI_Barrier(MPI_COMM_WORLD);
>         double start_time = MPI_Wtime();
>
>         if (rank == 0) {
>             MPI_Send(buffer, packet_size, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
>             MPI_Recv(buffer, 1, MPI_CHAR, 1, 1, MPI_COMM_WORLD,
> MPI_STATUS_IGNORE); // Wait for confirmation
>         } else { // rank == 1
>             MPI_Recv(buffer, packet_size, MPI_CHAR, 0, 0, MPI_COMM_WORLD,
> MPI_STATUS_IGNORE);
>             MPI_Send(buffer, 1, MPI_CHAR, 0, 1, MPI_COMM_WORLD); // Send
> confirmation
>         }
>
>         double end_time = MPI_Wtime();
>         if (i >= 10) { // Discard warmup runs
>             timings.push_back(end_time - start_time);
>         }
>     }
>
>     if (rank == 0) {
>         double total_time = std::accumulate(timings.begin(),
> timings.end(), 0.0);
>         double avg_time = total_time / timings.size();
>         double bandwidth = (static_cast<double>(packet_size) / (1024.0 *
> 1024.0 * 1024.0)) / avg_time;
>
>         std::cout << "------------------------------------------------" <<
> std::endl;
>         std::cout << "Average Host-to-Host Bandwidth: " << bandwidth << "
> GB/s" << std::endl;
>         std::cout << "------------------------------------------------" <<
> std::endl;
>     }
>
>     // Clean up host memory
>     delete[] buffer;
>
>     MPI_Finalize();
>     return 0;
> }
>
>
> ----------------------------------------------------------------------------------------------------------------
> Here is the script to run the test with verbose compilation and the `shm`
> layer forced and unforced:
>
> #!/usr/bin/zsh
> source ~/.zshrc
>
> # Compile
> /opt/mpich/4.2.1-cpu/bin/mpicxx -std=c++17 -I/opt/mpich/4.2.1-cpu/include
> shmem_check.cpp -o shmem_check
>
> # Run with shm forced
> /opt/mpich/4.2.1-cpu/bin/mpiexec -np 2 -genv FI_PROVIDER=shm -genv
> FI_LOG_LEVEL=debug ./shmem_check 2> output_shm.txt
>
> # Run without shm forced
> /opt/mpich/4.2.1-cpu/bin/mpiexec -np 2 -genv FI_LOG_LEVEL=debug
> ./shmem_check 2> output_no_shm.txt
>
> echo "Output of script with SHM forced: "
> grep -i "opened fabric" output_shm.txt
>
> echo "Output of script with SHM not forced: "
> grep -i "opened fabric" output_no_shm.txt
>
>
> ----------------------------------------------------------------------------------------------------------------
> Here is the output :
>
> --- Starting Host-to-Host MPI Bandwidth Test ---
> Packet Size: 256 MB
> ------------------------------------------------
> Average Host-to-Host Bandwidth: 3.35709 GB/s
> ------------------------------------------------
> --- Starting Host-to-Host MPI Bandwidth Test ---
> Packet Size: 256 MB
> ------------------------------------------------
> Average Host-to-Host Bandwidth: 3.54924 GB/s
> ------------------------------------------------
> Output of script with SHM forced:
> libfabric:3174297:1755114546::core:core:fi_fabric_():1503<info> Opened
> fabric: shm
> libfabric:3174298:1755114546::core:core:fi_fabric_():1503<info> Opened
> fabric: shm
> Output of script with SHM not forced:
> libfabric:3174351:1755114554::core:core:fi_fabric_():1503<info> Opened
> fabric: 10.133.0.0/21
> <https://urldefense.us/v3/__http://10.133.0.0/21__;!!G_uCfscf7eWS!ZaUD7Nw-pSrvdcr4vb0JBtm7m5HhtE6d7G1wb5HakwqLQQenlo0WTl1tkzV3CrJnLwCQ7cVvC-kgoP1Dp-C6$>
> libfabric:3174350:1755114554::core:core:fi_fabric_():1503<info> Opened
> fabric: 10.133.0.0/21
> <https://urldefense.us/v3/__http://10.133.0.0/21__;!!G_uCfscf7eWS!ZaUD7Nw-pSrvdcr4vb0JBtm7m5HhtE6d7G1wb5HakwqLQQenlo0WTl1tkzV3CrJnLwCQ7cVvC-kgoP1Dp-C6$>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20250814/4a11d602/attachment-0001.html>