[mpich-discuss] Memory alignment with MPI_Alloc_mem

Jeff Hammond jeff.science at gmail.com
Wed Feb 18 08:48:31 CST 2015


I have a hard time imagining that Cray doesn't do what's necessary to
ensure proper utilization of their network with the system software
they package.

The only place MPI does atomics on user buffers is in MPI-3 RMA, and
CrayMPI uses the software implementation in Ch3 by default (they have
a DMAPP implementation as an option, but I don't know the details).

The easiest way to deal with the NIC alignment requirements is for
Cray to patch glibc so that malloc always returns an 8-byte aligned
address if it doesn't already.  And stack alignment is always to basic
datatype granularity, so the only case where it could be an issue is
for atomics on 32b types on the stack, no?

Best,

Jeff

On Mon, Feb 16, 2015 at 4:53 AM, Atchley, Scott <atchleyes at ornl.gov> wrote:
> Jeff,
>
> I expect that he is concerned with GNI's 4-byte alignment requirement for both the address and length for RDMA Reads and the 8-byte alignment for atomics.
>
> Scott
>
> On Feb 16, 2015, at 1:42 AM, Jeff Hammond <jeff.science at gmail.com> wrote:
>
>> If you are going to suballocate from a slab yourself, you can handle
>> alignment yourself easy enough, no?  Do I not understand what you mean
>> here.  And what sort of alignment do you want?  Are you trying to
>> align to 32/64 bytes because of AVX or some other x86 feature on Cray
>> XC30 or do you want page alignment?
>>
>> But what do you really want to achieve?  While it is usually
>> beneficial to use pre-registered buffers on RDMA networks, good MPI
>> implementations have a page-registration cache.  If, as you say, you
>> are suballocating from a slab, Cray MPI should have the backing pages
>> in the registration cache after you use them as MPI buffers.
>>
>> You can maximize the efficiency of the page registration cache by
>> using large pages.  Search for intro_hugepages using 'man' or on the
>> Internet to learn the specifics of this.  I suspect that using large
>> pages will induce much of the benefit you hoped to achieve with an
>> explicitly-registering MPI_Alloc_mem.
>>
>> If you really want to max out RDMA on Cray networks, you need to use
>> DMAPP.  I have some simple examples and pointers to docs here:
>> https://github.com/jeffhammond/HPCInfo/tree/master/dmapp.  I have more
>> examples other places that I'll migrate to that location if requested.
>>
>> If you're interested in portability, MPI-3 RMA is a good abstraction
>> for RDMA networks.  Some implementations do a better job than others
>> at exposing this relationship.  Cray MPI has a DMAPP back-end for RMA
>> now, although it is not active by default.  You could also try Torsten
>> Hoefler's foMPI
>> [http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI/].
>>
>> Best,
>>
>> Jeff
>>
>> On Sat, Feb 14, 2015 at 2:37 PM, Marcin Zalewski
>> <marcin.zalewski at gmail.com> wrote:
>>> I am using Cray MPT, and I would like to allocate a large region of
>>> memory from which, in turn, I will allocate buffers to be used with
>>> MPI. I am wondering if there is any benefit from allocating that heap
>>> with MPI_Alloc_mem. I would hope that it could be pre-registered for
>>> RDMA, speeding things up. However, I need this memory to have a
>>> specific alignment. Is there a general way in MPICH or maybe a
>>> specific way for MPT to request alignment with MPI_Alloc_mem?
>>>
>>> Thanks,
>>> Marcin
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>>
>> --
>> Jeff Hammond
>> jeff.science at gmail.com
>> http://jeffhammond.github.io/
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list