[mpich-discuss] Weird performance of locally concurrent MPI_Put

Mon Dec 31 16:29:57 CST 2018

Good advice.
Now I find out that the performance drops to sequential again when I move
all the sources of MPI_Put into threads in one MPI rank even using
MPI_Win_allocate.
Do you have similar experience before?

Thanks
Kun

On Sat, Dec 29, 2018 at 9:36 PM Jeff Hammond <jeff.science at gmail.com> wrote:

> I don’t know why you are timing win_allocate. I’d only time
> lock-put-unlock or put-flush.
>
> Jeff
>
> On Sat, Dec 29, 2018 at 9:11 AM Kun Feng <kfeng1 at hawk.iit.edu> wrote:
>
>> Thank you for the replies.
>> MPI_Win_allocate gives me much better performance. It is even faster than
>> what I got from pure memory bandwidth test.
>> I'm putting the same memory block from the source rank to the same memory
>> address on the destination rank followed by MPI_Win_flush to synchronize.
>> Do I do it correctly? The source code is attached.
>>
>> Thanks
>> Kun
>>
>>
>> On Fri, Dec 21, 2018 at 11:15 AM Jeff Hammond <jeff.science at gmail.com>
>> wrote:
>>
>>> Use MPI_Win_allocate instead of MPI_Win_create.  MPI_Win_create cannot
>>> allocate shared memory so you will not get good performance within a node.
>>>
>>> Jeff
>>>
>>> On Fri, Dec 21, 2018 at 8:18 AM Kun Feng via discuss <discuss at mpich.org>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm working on a project in which one half of the processes need to
>>>> send data to the other half in each node.
>>>> I'm using passive target mode of one-sided communication in which the
>>>> receivers expose memory using MPI_Win_create, wait on MPI_Win_free and the
>>>> senders send the data using MPI_Put.
>>>> The code works. However, I get weird performance using this concurrent
>>>> MPI_Put communication. The peak aggregate bandwidth is only around 5GB/s.
>>>> It does not make sense as an aggregate performance in one single node.
>>>> I thought the node-local communication is implemented as local memcpy.
>>>> But concurrent memcpy on the same testbed has 4x to 5x higher aggregate
>>>> bandwidth.
>>>> Even concurrent memcpy using Linux shared memory across processes is 3x
>>>> faster than my code.
>>>> I'm using CH3 in MPICH 3.2.1. CH4 in MPICH 3.3 is even 2x slower.
>>>> Does the performance make sense? Does MPICH has some queue for all
>>>> one-sided communication in one node? Or do I understand it incorrectly?
>>>>
>>>> Thanks
>>>> Kun
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>
>>>
>>> --
>>> Jeff Hammond
>>> jeff.science at gmail.com
>>> http://jeffhammond.github.io/
>>>
>> --
> Jeff Hammond
> jeff.science at gmail.com
> http://jeffhammond.github.io/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20181231/39b1d0a3/attachment.html>