[mpich-discuss] Binding ranks to CPUs + Hydra-Bug
Jan Bierbaum
jan.bierbaum at tudos.org
Wed Dec 18 11:25:34 CST 2013
Hi!
Can somebody please explain the CPU binding in MPICH in more detail than
'mpiexec -bind-to -help' does? In particular I'd like to know how to use
'-bind-to user' correctly.
What I want to do is manually assign each rank to a CPU. Not all
available CPUs will necessarily be used and there may be more ranks than
CPUs. What I learned from the help text and some trial and error testing
is the following:
- The general syntax I need is '-bind-to user:0,1,2,...', which assigns
CPU0 to rank 0, CPU1 to rank1 and so on.
- Specifying more CPU ids than ranks does not matter - they are simply
ignored.
- Specifying too few CPU ids will leave the remaining ranks unbound and
thus free to migrate from CPU to CPU.
Now assume I want to run r ranks on a machine with c CPUs and utilize
only c/2 CPUs. This works fine as long as r <= c. Otherwise, when I
specify *exactly* c CPU ids, MPICH seems to "extend" the given list to
the remaining ranks. Is this expected and reliable behavior or mere
coincidence?
The main problem is that as soon as I try to hand '-bind-to' more than c
CPU ids it triggers the following assertion fault:
> [mpiexec at local] control_cb (/home/user/mpich-3.0.4/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
> [mpiexec at local] HYDT_dmxu_poll_wait_for_event (/home/user/mpich-3.0.4/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at local] HYD_pmci_wait_for_completion (/home/user/mpich-3.0.4/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
> [mpiexec at local] main (/home/user/mpich-3.0.4/src/pm/hydra/ui/mpich/mpiexec.c:331): process manager error waiting for completion
As you can see from the output I'm running MPICH 3.0.4. During the
'configure' step I used '--with-pm=hydra --enable-hydra-procbind' to
enable CPU binding.
Regards, Jan
More information about the discuss
mailing list