<div dir="ltr">Hi all,<div><br></div><div>I'm working on a project in which one half of the processes need to send data to the other half in each node.</div><div>I'm using passive target mode of one-sided communication in which the receivers expose memory using MPI_Win_create, wait on MPI_Win_free and the senders send the data using MPI_Put.</div><div>The code works. However, I get weird performance using this concurrent MPI_Put communication. The peak aggregate bandwidth is only around 5GB/s. It does not make sense as an aggregate performance in one single node.<br></div><div>I thought the node-local communication is implemented as local memcpy.</div><div>But concurrent memcpy on the same testbed has 4x to 5x higher aggregate bandwidth.</div><div>Even concurrent memcpy using Linux shared memory across processes is 3x faster than my code.</div><div>I'm using CH3 in MPICH 3.2.1. CH4 in MPICH 3.3 is even 2x slower.</div><div>Does the performance make sense? Does MPICH has some queue for all one-sided communication in one node? Or do I understand it incorrectly?</div><div><br clear="all"><div><div dir="ltr" class="gmail-m_-7134994236507619519gmail_signature"><div dir="ltr">Thanks<div>Kun</div></div></div></div></div></div>