[mpich-discuss] MPICH: SHM bandwidth very low on IPC test

Thu Aug 14 11:20:39 CDT 2025

Hi Sam,

the 10GB/s stream bandwidth calculation includes the number of 
read+written bytes (see lines 190/366).

I would assume, that your MPI bandwidth calculation only accounts for 
the buffer size (i.e., only read or write). In shm communication one 
process (and therefore one core) streams/memcopies the data from the 
send to the receive buffer. So, when you see 3.5GB send bandwidth, that 
actually compares to 7GB of stream Copy bandwidth.

As a side-effect of shm communication, we have actually seen that the 
placement of the copying process can determine the first-touch 
allocation of the buffer. Even if the memory is allocated with calloc, 
the memory is not paged. A bcast/scatter to node-local processes can 
result in paging all buffers to the same socket (what you typically want 
to avoid).

Best
Joachim

Am 14.08.25 um 07:16 schrieb Sam Austin:
> Hi Joachim,
> 
> Thanks for this suggestion! I used stream to test the single-core memory 
> bandwidth. I am running on a Xeon E5-2699A v4, which has 55MB last level 
> cache. So, I ran with 30 million elements per the instructions. It 
> appears that I am seeing about 10 GB/s if I'm reading that right? If so, 
> I am still not sure why I am only seeing ~3.5 GB/s on shared memory 
> performance with MPICH.
> 
> -------------------------------------------------------------
> STREAM version $Revision: 5.10 $
> -------------------------------------------------------------
> This system uses 8 bytes per array element.
> -------------------------------------------------------------
> Array size = 30000000 (elements), Offset = 0 (elements)
> Memory per array = 228.9 MiB (= 0.2 GiB).
> Total memory required = 686.6 MiB (= 0.7 GiB).
> Each kernel will be executed 10 times.
>   The *best* time for each kernel (excluding the first iteration)
>   will be used to compute the reported bandwidth.
> -------------------------------------------------------------
> Your clock granularity/precision appears to be 1 microseconds.
> Each test below will take on the order of 30276 microseconds.
>     (= 30276 clock ticks)
> Increase the size of the arrays if this shows that
> you are not getting at least 20 clock ticks per test.
> -------------------------------------------------------------
> WARNING -- The above is only a rough guideline.
> For best results, please be sure you know the
> precision of your system timer.
> -------------------------------------------------------------
> Function    Best Rate MB/s  Avg time     Min time     Max time
> Copy:           10038.1     0.048342     0.047818     0.050004
> Scale:          10342.2     0.048738     0.046412     0.056605
> Add:            10580.3     0.068542     0.068051     0.069805
> Triad:          10703.0     0.067615     0.067271     0.068143
> -------------------------------------------------------------
> Solution Validates: avg error less than 1.000000e-13 on all three arrays
> -------------------------------------------------------------
> 
> Thanks,
> Sam
> 
> On Wed, Aug 13, 2025 at 5:26 PM Jenke, Joachim <jenke at itc.rwth-aachen.de 
> <mailto:jenke at itc.rwth-aachen.de>> wrote:
> 
>     Hi Sam,
> 
>     Can you try out stream to understand the single-core memory
>     bandwidth of the system?
> 
>     https://www.cs.virginia.edu/stream/ref.html <https://
>     www.cs.virginia.edu/stream/ref.html>
> 
>     Copy bandwidth for large junks (exceeding cache sizes) should
>     provide you an upper bound for shm communication bandwidth.
> 
>     Best
>     Joachim
> 
>     Am 13.08.2025 22:04 schrieb Sam Austin via discuss
>     <discuss at mpich.org <mailto:discuss at mpich.org>>:
>     Hi all, I am working to configure MPICH and run a few examples on my
>     standalone server (single node). Here are the system specs: Server:
>     Dell PowerEdge C4130 CPUs: 2x Xeon E5-2699A v4 GPUs: 4x Tesla V100s
>     connected with NVLink, tied to motherboard
>     ZjQcmQRYFpfptBannerStart
>     This Message Is From an External Sender
>     This message came from outside your organization.
>     ZjQcmQRYFpfptBannerEnd
>     Hi all,
> 
>     I am working to configure MPICH and run a few examples on my
>     standalone server (single node). Here are the system specs:
>     Server: Dell PowerEdge C4130
>     CPUs: 2x Xeon E5-2699A v4
>     GPUs: 4x Tesla V100s connected with NVLink, tied to motherboard with
>     PCIe gen 3
>     OS: Ubuntu 24.04 LTS
>     I intend to use this system to develop multi-process programs for
>     eventual execution in a large, distributed HPC environment. I ran a
>     few tests with and without CUDA support; here is my mpichversion output:
> 
>     MPICH Version:      4.3.1
>     MPICH Release date: Fri Jun 20 09:24:41 AM CDT 2025
>     MPICH ABI:          17:1:5
>     MPICH Device:       ch4:ofi
>     MPICH configure:    --prefix=/opt/mpich/4.2.1-cpu --without-cuda
>     MPICH CC:           gcc     -O2
>     MPICH CXX:          g++   -O2
>     MPICH F77:          gfortran   -O2
>     MPICH FC:           gfortran   -O2
>     MPICH features:     threadcomm
> 
>     The first example that I ran was a bandwidth test for CPU-CPU and
>     GPU-GPU communication. This simple program sends small packets back
>     and forth between processes to test the bandwidth over the various
>     intra-node networks.
> 
>     The GPU-GPU bandwidth test showed that the GPU interconnect was
>     saturating at ~45 GB/s, which is nominal for the NVLink interconnect
>     topology present on the node (this was run with a CUDA-aware build
>     of MPICH). The problem appears during the CPU-CPU IPC test. In
>     theory, this test is pretty vanilla, as it is communicating between
>     processes using shared memory, and does not involve traversing any
>     of the intra-node networks (PCIe or NVLink). My understanding is
>     that the bandwidth observed on the CPU-CPU IPC test should be quite
>     high, at least higher than 10 GB/s.
> 
>     However, the intra-node IPC bandwidth appears to be very low, around
>     3.5 GB/s, when running this test. I tried the following fixes in an
>     attempt to force MPICH to use shared memory, but to no avail:
>     Passing the option to explicitly specify `nemesis` during the build
>     configuration: "--with-device=ch3:nemesis --with-cuda"
>     Passing the option to explicitly specify shared memory with ch4 to
>     the configuration: "--with-ch4-shmmods=posix --with-cuda"
>     Rebuilding MPICH without GPU support: "--without-cuda"
>     Switching to Open MPI and running the same test
>     These results, especially the last one in which I saw the same
>     issues when running with Open MPI, makes me think it might be an
>     issue with my system configuration. The question is: why is the IPC
>     bandwidth so low despite supposedly using the SHM protocol? I'm
>     wondering if anyone has encountered this issue before or might be
>     able to lend some advice here. Any help would be greatly appreciated!
> 
>     Some interesting observations from the output below: when I run with
>     "mpiexec -np 2 -genv FI_PROVIDER=shm ...", the log file reports
>     "Opened fabric: shm". However, when I run without "-genv
>     FI_PROVIDER=shm", the log file reports "Opened fabric: 10.133.0.0/21
>     <https://urldefense.us/v3/__http://10.133.0.0/21__;!!G_uCfscf7eWS!
>     ZaUD7Nw-
>     pSrvdcr4vb0JBtm7m5HhtE6d7G1wb5HakwqLQQenlo0WTl1tkzV3CrJnLwCQ7cVvC-
>     kgoP1Dp-C6$>", which I believe means that MPICH is falling back on
>     the TCP socket protocol. In this case, my key point of confusion is
>     that the observed bandwidth is essentially the same between the SHM
>     and TCP protocols. Perhaps my test script isn't set up properly?
> 
>     Thanks,
>     Sam
> 
>     The following is attached below:
>     Bandwidth test program
>     Run script for the program
>     Output of the script on my machine
>     ----------------------------------------------------------------------------------------------------------------
>     In case the attachment doesn't go through, here are the contents of
>     my test program, "shmem_check.cpp":
> 
>     // shmem_check.cpp
>     //
>     // This is a minimal benchmark to test the raw bandwidth of MPI
>     communication
>     // between two processes on the same node, using only host (CPU) memory.
>     // It completely removes CUDA to isolate the performance of the MPI
>     library's
>     // on-node communication mechanism (e.g., shared memory vs. TCP
>     loopback).
>     //
>     // Compile/run:
>     // /opt/mpich/4.2.1-cpu/bin/mpicxx -std=c++17 -I/opt/mpich/4.2.1-
>     cpu/include shmem_check.cpp -o shmem_check
>     // /opt/mpich/4.2.1-cpu/bin/mpiexec -np 2 ./shmem_check
> 
>     #include <iostream>
>     #include <vector>
>     #include <numeric>
>     #include <mpi.h>
> 
>     int main(int argc, char* argv[]) {
>          MPI_Init(&argc, &argv);
> 
>          int rank, size;
>          MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>          MPI_Comm_size(MPI_COMM_WORLD, &size);
> 
>          if (size != 2) {
>              if (rank == 0) {
>                  std::cerr << "Error: This program must be run with
>     exactly 2 MPI processes." << std::endl;
>              }
>              MPI_Finalize();
>              return 1;
>          }
> 
>          const int num_samples = 100;
>          const long long packet_size = 1LL << 28; // 256 MB
> 
>          // Allocate standard host memory. 'new' is sufficient.
>          char* buffer = new char[packet_size];
> 
>          if (rank == 0) {
>              std::cout << "--- Starting Host-to-Host MPI Bandwidth Test
>     ---" << std::endl;
>              std::cout << "Packet Size: " << (packet_size / (1024*1024))
>     << " MB" << std::endl;
>          }
> 
>          std::vector<double> timings;
>          for (int i = 0; i < num_samples; ++i) {
>              MPI_Barrier(MPI_COMM_WORLD);
>              double start_time = MPI_Wtime();
> 
>              if (rank == 0) {
>                  MPI_Send(buffer, packet_size, MPI_CHAR, 1, 0,
>     MPI_COMM_WORLD);
>                  MPI_Recv(buffer, 1, MPI_CHAR, 1, 1, MPI_COMM_WORLD,
>     MPI_STATUS_IGNORE); // Wait for confirmation
>              } else { // rank == 1
>                  MPI_Recv(buffer, packet_size, MPI_CHAR, 0, 0,
>     MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>                  MPI_Send(buffer, 1, MPI_CHAR, 0, 1, MPI_COMM_WORLD); //
>     Send confirmation
>              }
> 
>              double end_time = MPI_Wtime();
>              if (i >= 10) { // Discard warmup runs
>                  timings.push_back(end_time - start_time);
>              }
>          }
> 
>          if (rank == 0) {
>              double total_time = std::accumulate(timings.begin(),
>     timings.end(), 0.0);
>              double avg_time = total_time / timings.size();
>              double bandwidth = (static_cast<double>(packet_size) /
>     (1024.0 * 1024.0 * 1024.0)) / avg_time;
> 
>              std::cout <<
>     "------------------------------------------------" << std::endl;
>              std::cout << "Average Host-to-Host Bandwidth: " <<
>     bandwidth << " GB/s" << std::endl;
>              std::cout <<
>     "------------------------------------------------" << std::endl;
>          }
> 
>          // Clean up host memory
>          delete[] buffer;
> 
>          MPI_Finalize();
>          return 0;
>     }
> 
>     ----------------------------------------------------------------------------------------------------------------
>     Here is the script to run the test with verbose compilation and the
>     `shm` layer forced and unforced:
> 
>     #!/usr/bin/zsh
>     source ~/.zshrc
> 
>     # Compile
>     /opt/mpich/4.2.1-cpu/bin/mpicxx -std=c++17 -I/opt/mpich/4.2.1-cpu/
>     include shmem_check.cpp -o shmem_check
> 
>     # Run with shm forced
>     /opt/mpich/4.2.1-cpu/bin/mpiexec -np 2 -genv FI_PROVIDER=shm -genv
>     FI_LOG_LEVEL=debug ./shmem_check 2> output_shm.txt
> 
>     # Run without shm forced
>     /opt/mpich/4.2.1-cpu/bin/mpiexec -np 2 -genv FI_LOG_LEVEL=debug ./
>     shmem_check 2> output_no_shm.txt
> 
>     echo "Output of script with SHM forced: "
>     grep -i "opened fabric" output_shm.txt
> 
>     echo "Output of script with SHM not forced: "
>     grep -i "opened fabric" output_no_shm.txt
> 
>     ----------------------------------------------------------------------------------------------------------------
>     Here is the output :
> 
>     --- Starting Host-to-Host MPI Bandwidth Test ---
>     Packet Size: 256 MB
>     ------------------------------------------------
>     Average Host-to-Host Bandwidth: 3.35709 GB/s
>     ------------------------------------------------
>     --- Starting Host-to-Host MPI Bandwidth Test ---
>     Packet Size: 256 MB
>     ------------------------------------------------
>     Average Host-to-Host Bandwidth: 3.54924 GB/s
>     ------------------------------------------------
>     Output of script with SHM forced:
>     libfabric:3174297:1755114546::core:core:fi_fabric_():1503<info>
>     Opened fabric: shm
>     libfabric:3174298:1755114546::core:core:fi_fabric_():1503<info>
>     Opened fabric: shm
>     Output of script with SHM not forced:
>     libfabric:3174351:1755114554::core:core:fi_fabric_():1503<info>
>     Opened fabric: 10.133.0.0/21 <https://urldefense.us/v3/
>     __http://10.133.0.0/21__;!!G_uCfscf7eWS!ZaUD7Nw-
>     pSrvdcr4vb0JBtm7m5HhtE6d7G1wb5HakwqLQQenlo0WTl1tkzV3CrJnLwCQ7cVvC-
>     kgoP1Dp-C6$>
>     libfabric:3174350:1755114554::core:core:fi_fabric_():1503<info>
>     Opened fabric: 10.133.0.0/21 <https://urldefense.us/v3/
>     __http://10.133.0.0/21__;!!G_uCfscf7eWS!ZaUD7Nw-
>     pSrvdcr4vb0JBtm7m5HhtE6d7G1wb5HakwqLQQenlo0WTl1tkzV3CrJnLwCQ7cVvC-
>     kgoP1Dp-C6$>
> 

-- 
Dr. rer. nat. Joachim Jenke
Deputy Group Lead

IT Center
Group: HPC - Parallelism, Runtime Analysis & Machine Learning
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074  Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
jenke at itc.rwth-aachen.de
www.itc.rwth-aachen.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6140 bytes
Desc: Kryptografische S/MIME-Signatur
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20250814/7171e17b/attachment.p7s>