[mpich-devel] mpi process wireup and apache yarn

Ryan Lewis me at ryanlewis.net
Wed Jul 27 00:31:47 CDT 2016


Hi,

I am doing a bit of experimentation with the goal of getting MPI to run on
top of Apache YARN. I know that a few others have written here looking for
help with mpich2-yarn, and the strangely unreleased hamster project on the
hadoop JIRA. I'm not interested in those things.

I am writing this note to document my progress so far, and get some
confirmation that what I am doing is considered a "supported" mode of
operation.

For context, within YARN, a java based process called a "YARN Application
Master: submits requests for resources to the YARN ResourceManager and
launches "YARN Containers" via its own process launcher. There can be many
AppMasters and each of them may do different things.

As a proof of concept, I want to make a given Application Master request N
containers, and within each of them start the individual mpi processes.

I'm used to well, honestly, never dealing with any of this, and showing up
at some cluster where SLURM (or whatever) already exists, and all I need to
do is just write code compile it and 'qsub.' So all of this is a learning
experience.

Looking at Hydra, it seems that the intention is for Hydra to start
processes, and oddly (and surprising to me) it is designed to need to _ask_
a RM for resources. With different logic for different RM. There is not a
huge amount of documentation here, and so I was largely flying blind. I was
expecting that a RM just starts processes on machines, and wire up just
happens via some set of environment variable commands and shell commands,
and perhaps black magic.

After some googling, and private discussion with Jeff Hammond, he pointed
me and the -launcher manual flag for mpirun.

By issuing:


   1. rlewis at skynet03 build]$ mpirun -np 2  -launcher manual -hosts
   skynet01,skynet02 a.out

I was able to get these two hydra_pmi_proxy command lines, which after
running both on the two machines, roughly seem to make my mpi program
execute normally.

   1. HYDRA_LAUNCH: /usr/lib64/mpich/bin/hydra_pmi_proxy --control-port
   skynet03:58584 --rmk user --launcher manual --demux poll --pgid 0 --retries
   10 --usize -2 --proxy-id 0
   2. HYDRA_LAUNCH: /usr/lib64/mpich/bin/hydra_pmi_proxy --control-port
   skynet03:58584 --rmk user --launcher manual --demux poll --pgid 0 --retries
   10 --usize -2 --proxy-id 1
   3. HYDRA_LAUNCH_END

However, using MPI_Send, I would then see this occur:

   1. Fatal error in MPI_Send: A process has failed, error stack:
   2. MPI_Send(171)..............: MPI_Send(buf=0x7fff2e38b044, count=1,
   MPI_INT, dest=1, tag=0, MPI_COMM_WORLD) failed
   3. MPID_nem_tcp_connpoll(1833): Communication error with rank 1:
   Connection refused
   4.
   5. =====================================================================

It seems that when I add the option: `-disable-hostname-propagation`  the
underlying code seems to work. I'm not exactly sure if this is an accident.

However, assuming that this is all I need, it seems that essentially each
YARN container needs to execute these command lines:

 /usr/lib64/mpich/bin/hydra_pmi_proxy --control-port skynet03:58584 --rmk
user --launcher manual --demux poll --pgid 0 --retries 10 --usize -2
--proxy-id 0

Which they can get from starting the mpi control process on the machine
which runs the YARN Application Master.

And then they will all just work. Is this accurate? Is this a "supported"
mode of operation? this certainly is an extremely easy way to get MPI to
run on top of YARN, with zero code change necessary to the MPICH codebase.
I'm not sure how portable (across MPI implementations) this is though, but,
for now I don't care.

Best,

-rhl
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20160727/160bee19/attachment.html>


More information about the devel mailing list