[mpich-discuss] Optimizing runtime for 4 mpiexec.hydra jobs

Wed Nov 13 11:36:28 CST 2019

Hi Doug,

When you do -map-by numa, processes should be bound to available numas using a round robin mapping. Thus if you are running only one MPI job with -bind-to numa and you have two numas, you are binding rank 0 to numa0:core0:hwthread0, rank 1 to numa1:core64:hwthread128, rank 2 to numa0:core0:hwthread1, and so on. Which is what you were expecting. However, if you run multiple MPI jobs with the same bind policy, these are independent from each other and will all map to the same core and hwthreads. Thus if you have 256 MPI processes running over 4 separate jobs you are still using half of the cores and a forth of the hwthreads. If you want to map to different hwthreads you can use a custom policy different for every job:

mpiexec -bind-to --help

-bind-to: Process-core binding type to use

    Binding type options:
        Default:
            none             -- no binding (default)

        Architecture unaware options:
            rr               -- round-robin as OS assigned processor IDs
            user:0+2,1+4,3,2 -- user specified binding

The user policy allows you to do exactly that. There you define what hwthreads you want to bind your ranks to (+N) means your MPI process can move across hwthreads skipping every N hwthreads. So if you have 32 hwthreads and one process and request -bind-to user:0+2 you get your process moving around across hwthreads: 0, 2, 4, 6, 8, …

If you want to see what the binding is for your processes you can export HYDRA_TOPO_DEBUG=1 and use /bin/true as simple example code to inspect the binding you get for different options.

If you don’t do any binding MPI processes will be re-scheduled over the course of the runtime on different cores, hwthreads (losing cached data) and possibly even different numas. In Linux memory is allocated on first touch policy by default. Which means that physical memory is allocated on the numa your process is running when it accessed it first. If your process gets moved to another numa later on during the runtime it will be accessing its memory across numa domains which has a bigger impact on latency. So if your application is latency sensitive you may get a big hit on performance.

Hope this helps.

Best,
Giuseppe

On Nov 13, 2019, at 10:39 AM, Douglas Dommermuth via discuss <discuss at mpich.org<mailto:discuss at mpich.org>> wrote:

256

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20191113/012d2dcd/attachment.html>