[mpich-discuss] MPICH: SHM bandwidth very low on IPC test
Sam Austin
sam.austin.p at gmail.com
Wed Aug 13 15:03:47 CDT 2025
Hi all,
I am working to configure MPICH and run a few examples on my standalone
server (single node). Here are the system specs:
Server: Dell PowerEdge C4130
CPUs: 2x Xeon E5-2699A v4
GPUs: 4x Tesla V100s connected with NVLink, tied to motherboard with PCIe
gen 3
OS: Ubuntu 24.04 LTS
I intend to use this system to develop multi-process programs for eventual
execution in a large, distributed HPC environment. I ran a few tests with
and without CUDA support; here is my mpichversion output:
MPICH Version: 4.3.1
MPICH Release date: Fri Jun 20 09:24:41 AM CDT 2025
MPICH ABI: 17:1:5
MPICH Device: ch4:ofi
MPICH configure: --prefix=/opt/mpich/4.2.1-cpu --without-cuda
MPICH CC: gcc -O2
MPICH CXX: g++ -O2
MPICH F77: gfortran -O2
MPICH FC: gfortran -O2
MPICH features: threadcomm
The first example that I ran was a bandwidth test for CPU-CPU and GPU-GPU
communication. This simple program sends small packets back and forth
between processes to test the bandwidth over the various intra-node
networks.
The GPU-GPU bandwidth test showed that the GPU interconnect was saturating
at ~45 GB/s, which is nominal for the NVLink interconnect topology present
on the node (this was run with a CUDA-aware build of MPICH). The problem
appears during the CPU-CPU IPC test. In theory, this test is pretty
vanilla, as it is communicating between processes using shared memory, and
does not involve traversing any of the intra-node networks (PCIe or
NVLink). My understanding is that the bandwidth observed on the CPU-CPU IPC
test should be quite high, at least higher than 10 GB/s.
However, the intra-node IPC bandwidth appears to be very low, around 3.5
GB/s, when running this test. I tried the following fixes in an attempt to
force MPICH to use shared memory, but to no avail:
Passing the option to explicitly specify `nemesis` during the build
configuration: "--with-device=ch3:nemesis --with-cuda"
Passing the option to explicitly specify shared memory with ch4 to the
configuration: "--with-ch4-shmmods=posix --with-cuda"
Rebuilding MPICH without GPU support: "--without-cuda"
Switching to Open MPI and running the same test
These results, especially the last one in which I saw the same issues when
running with Open MPI, makes me think it might be an issue with my system
configuration. The question is: why is the IPC bandwidth so low despite
supposedly using the SHM protocol? I'm wondering if anyone has encountered
this issue before or might be able to lend some advice here. Any help would
be greatly appreciated!
Some interesting observations from the output below: when I run with
"mpiexec -np 2 -genv FI_PROVIDER=shm ...", the log file reports "Opened
fabric: shm". However, when I run without "-genv FI_PROVIDER=shm", the log
file reports "Opened fabric: 10.133.0.0/21", which I believe means that
MPICH is falling back on the TCP socket protocol. In this case, my key
point of confusion is that the observed bandwidth is essentially the same
between the SHM and TCP protocols. Perhaps my test script isn't set up
properly?
Thanks,
Sam
The following is attached below:
Bandwidth test program
Run script for the program
Output of the script on my machine
----------------------------------------------------------------------------------------------------------------
In case the attachment doesn't go through, here are the contents of my test
program, "shmem_check.cpp":
// shmem_check.cpp
//
// This is a minimal benchmark to test the raw bandwidth of MPI
communication
// between two processes on the same node, using only host (CPU) memory.
// It completely removes CUDA to isolate the performance of the MPI
library's
// on-node communication mechanism (e.g., shared memory vs. TCP loopback).
//
// Compile/run:
// /opt/mpich/4.2.1-cpu/bin/mpicxx -std=c++17
-I/opt/mpich/4.2.1-cpu/include shmem_check.cpp -o shmem_check
// /opt/mpich/4.2.1-cpu/bin/mpiexec -np 2 ./shmem_check
#include <iostream>
#include <vector>
#include <numeric>
#include <mpi.h>
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (size != 2) {
if (rank == 0) {
std::cerr << "Error: This program must be run with exactly 2
MPI processes." << std::endl;
}
MPI_Finalize();
return 1;
}
const int num_samples = 100;
const long long packet_size = 1LL << 28; // 256 MB
// Allocate standard host memory. 'new' is sufficient.
char* buffer = new char[packet_size];
if (rank == 0) {
std::cout << "--- Starting Host-to-Host MPI Bandwidth Test ---" <<
std::endl;
std::cout << "Packet Size: " << (packet_size / (1024*1024)) << "
MB" << std::endl;
}
std::vector<double> timings;
for (int i = 0; i < num_samples; ++i) {
MPI_Barrier(MPI_COMM_WORLD);
double start_time = MPI_Wtime();
if (rank == 0) {
MPI_Send(buffer, packet_size, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
MPI_Recv(buffer, 1, MPI_CHAR, 1, 1, MPI_COMM_WORLD,
MPI_STATUS_IGNORE); // Wait for confirmation
} else { // rank == 1
MPI_Recv(buffer, packet_size, MPI_CHAR, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
MPI_Send(buffer, 1, MPI_CHAR, 0, 1, MPI_COMM_WORLD); // Send
confirmation
}
double end_time = MPI_Wtime();
if (i >= 10) { // Discard warmup runs
timings.push_back(end_time - start_time);
}
}
if (rank == 0) {
double total_time = std::accumulate(timings.begin(), timings.end(),
0.0);
double avg_time = total_time / timings.size();
double bandwidth = (static_cast<double>(packet_size) / (1024.0 *
1024.0 * 1024.0)) / avg_time;
std::cout << "------------------------------------------------" <<
std::endl;
std::cout << "Average Host-to-Host Bandwidth: " << bandwidth << "
GB/s" << std::endl;
std::cout << "------------------------------------------------" <<
std::endl;
}
// Clean up host memory
delete[] buffer;
MPI_Finalize();
return 0;
}
----------------------------------------------------------------------------------------------------------------
Here is the script to run the test with verbose compilation and the `shm`
layer forced and unforced:
#!/usr/bin/zsh
source ~/.zshrc
# Compile
/opt/mpich/4.2.1-cpu/bin/mpicxx -std=c++17 -I/opt/mpich/4.2.1-cpu/include
shmem_check.cpp -o shmem_check
# Run with shm forced
/opt/mpich/4.2.1-cpu/bin/mpiexec -np 2 -genv FI_PROVIDER=shm -genv
FI_LOG_LEVEL=debug ./shmem_check 2> output_shm.txt
# Run without shm forced
/opt/mpich/4.2.1-cpu/bin/mpiexec -np 2 -genv FI_LOG_LEVEL=debug
./shmem_check 2> output_no_shm.txt
echo "Output of script with SHM forced: "
grep -i "opened fabric" output_shm.txt
echo "Output of script with SHM not forced: "
grep -i "opened fabric" output_no_shm.txt
----------------------------------------------------------------------------------------------------------------
Here is the output :
--- Starting Host-to-Host MPI Bandwidth Test ---
Packet Size: 256 MB
------------------------------------------------
Average Host-to-Host Bandwidth: 3.35709 GB/s
------------------------------------------------
--- Starting Host-to-Host MPI Bandwidth Test ---
Packet Size: 256 MB
------------------------------------------------
Average Host-to-Host Bandwidth: 3.54924 GB/s
------------------------------------------------
Output of script with SHM forced:
libfabric:3174297:1755114546::core:core:fi_fabric_():1503<info> Opened
fabric: shm
libfabric:3174298:1755114546::core:core:fi_fabric_():1503<info> Opened
fabric: shm
Output of script with SHM not forced:
libfabric:3174351:1755114554::core:core:fi_fabric_():1503<info> Opened
fabric: 10.133.0.0/21
libfabric:3174350:1755114554::core:core:fi_fabric_():1503<info> Opened
fabric: 10.133.0.0/21
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20250813/c7f6a495/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: shmem_check.cpp
Type: application/octet-stream
Size: 2803 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20250813/c7f6a495/attachment.obj>
More information about the discuss
mailing list