[mpich-discuss] Weird performance of locally concurrent MPI_Put

Thu Jan 3 16:13:32 CST 2019

I don't use RMA from threads much but I do not expect MPICH to have
internal concurrency for bandwidth-limited RMA operations in shared-memory,
so it doesn't surprise me that the bandwidth is not improved by multiple
threads.

Jeff

On Mon, Dec 31, 2018 at 2:30 PM Kun Feng <kfeng1 at hawk.iit.edu> wrote:

> Good advice.
> Now I find out that the performance drops to sequential again when I move
> all the sources of MPI_Put into threads in one MPI rank even using
> MPI_Win_allocate.
> Do you have similar experience before?
>
> Thanks
> Kun
>
>
> On Sat, Dec 29, 2018 at 9:36 PM Jeff Hammond <jeff.science at gmail.com>
> wrote:
>
>> I don’t know why you are timing win_allocate. I’d only time
>> lock-put-unlock or put-flush.
>>
>> Jeff
>>
>> On Sat, Dec 29, 2018 at 9:11 AM Kun Feng <kfeng1 at hawk.iit.edu> wrote:
>>
>>> Thank you for the replies.
>>> MPI_Win_allocate gives me much better performance. It is even faster
>>> than what I got from pure memory bandwidth test.
>>> I'm putting the same memory block from the source rank to the same
>>> memory address on the destination rank followed by MPI_Win_flush to
>>> synchronize.
>>> Do I do it correctly? The source code is attached.
>>>
>>> Thanks
>>> Kun
>>>
>>>
>>> On Fri, Dec 21, 2018 at 11:15 AM Jeff Hammond <jeff.science at gmail.com>
>>> wrote:
>>>
>>>> Use MPI_Win_allocate instead of MPI_Win_create.  MPI_Win_create cannot
>>>> allocate shared memory so you will not get good performance within a node.
>>>>
>>>> Jeff
>>>>
>>>> On Fri, Dec 21, 2018 at 8:18 AM Kun Feng via discuss <discuss at mpich.org>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm working on a project in which one half of the processes need to
>>>>> send data to the other half in each node.
>>>>> I'm using passive target mode of one-sided communication in which the
>>>>> receivers expose memory using MPI_Win_create, wait on MPI_Win_free and the
>>>>> senders send the data using MPI_Put.
>>>>> The code works. However, I get weird performance using this concurrent
>>>>> MPI_Put communication. The peak aggregate bandwidth is only around 5GB/s.
>>>>> It does not make sense as an aggregate performance in one single node.
>>>>> I thought the node-local communication is implemented as local memcpy.
>>>>> But concurrent memcpy on the same testbed has 4x to 5x higher
>>>>> aggregate bandwidth.
>>>>> Even concurrent memcpy using Linux shared memory across processes is
>>>>> 3x faster than my code.
>>>>> I'm using CH3 in MPICH 3.2.1. CH4 in MPICH 3.3 is even 2x slower.
>>>>> Does the performance make sense? Does MPICH has some queue for all
>>>>> one-sided communication in one node? Or do I understand it incorrectly?
>>>>>
>>>>> Thanks
>>>>> Kun
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> jeff.science at gmail.com
>>>> http://jeffhammond.github.io/
>>>>
>>> --
>> Jeff Hammond
>> jeff.science at gmail.com
>> http://jeffhammond.github.io/
>>
>

-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20190103/5cd5ce78/attachment.html>