[mpich-discuss] MPICH: SHM bandwidth very low on IPC test

Jenke, Joachim jenke at itc.rwth-aachen.de
Fri Aug 15 01:44:33 CDT 2025


Hi Sam,

First I'm not sure why reaching top performance on a dev system would matter. To understand the production performance, you would need to test on the target system anyways. Especially because a single-node system is no good proxy for communication in a distributed memory system.

Am 15.08.2025 02:16 schrieb Sam Austin <sam.austin.p at gmail.com>:
I would assume, that your MPI bandwidth calculation only accounts for the buffer size (i.e., only read or write).
This is a good point, although I'm not sure why the bandwidth would be 1.5x higher on the other machine that has very similar memory performance.

Depending on the process placement, this might be caused by moving the data from one to the other socket, see below.

One more piece of information: The single-socket machine (Xeon E5-2650 v4) has four RAM sticks in a quad-channel configuration, all tied to the same socket as you can see in lstopo. On the dual-socket machine (the machine in question), the four RAM sticks are in a dual-channel configuration, with two sticks on each socket. So, I'm not sure if the dual- vs quad-channel configuration is hurting maximum memory bandwidth per socket on the dual-socket machine, despite the total bandwidths showing approximately the same in STREAM.

As Peter already pointed out: there are at least two bandwidth limits in a multicore system. One is the bandwidth a core can stream using it's load/store pipelines. The other bandwidth is for the connection between the CPU package and the memory. Going from Dual-Channel to Quad-Channel you increase the latter bandwidth. Therefore you will need more processes/threads to saturate the bandwidth with Quad-Channel configuration.
In Multi-Socket systems you have a third bandwidth limit for accessing memory connected with a different socket, which is typically lower. For the p2p bandwidth test, the bandwidth should not be the limiting factor, but the increased latency of these accesses might cause a reduced single-core bandwidth.

I have tried to make the tests more uniform by binding the processes to the same core. But that still calls into question the total memory bandwidth of the dual-channel vs quad-channel memory configuration.

Do you mean both processes to the same core, or symmetric proc placement on the two systems?
Try binding the processes to cores on the same/different sockets. Also, make sure to initialize the buffers before starting communication, so that they are paged locally. Repeated communication in the same direction might cause the OS to trigger page migration. So make sure to communicate back and forth.

Do you have any thoughts on this? The question I'm trying to answer is: for what reasons would the memory bandwidth be significantly lower on the dual-socket Dell machine? I understand that making comparisons across machines is tricky, but I've tried to provide as much information as possible to isolate the key aspects of the memory configuration.

At the end, p2p bandwidth will never be reached in a large-scale program, because filling the node will quickly saturate the overall memory bandwidth.

Best
Joachim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20250815/f377341e/attachment.html>


More information about the discuss mailing list