[mpich-discuss] Hydra torque support issues
Bharath Ramesh
bramesh at vt.edu
Wed Apr 17 12:04:29 CDT 2013
On Thu, Mar 07, 2013 at 04:18:58PM -0600, Pavan Balaji wrote:
>
> This was already present in 1.5, though not enabled by default.
> However, there were some bug fixes in 3.0, so using the latest version
> is the best bet.
>
> -- Pavan
>
> On 03/07/2013 03:57 PM US Central Time, Dave Goodell wrote:
> > On Mar 7, 2013, at 3:51 PM CST, Bharath Ramesh <bramesh at vt.edu> wrote:
> >
> >> I am using mvapich2-1.9a2 which is based of mpich2-1.5. We have
> >> enabled Torque integration with hydra. We are noticing an issue
> >> where in Torque is not tracking the resource used by the MPI
> >> application when they are built with mvapich2. Further
> >> investigation revealed that mvapich2 hydra process launcher was
> >> not setting the correct session id to what torque used for
> >> forking the shell. I am wondering if this is a known issue or
> >> something that has already been fixed. If it has been fixed, what
> >> would be the best way to upgrade just hydra without affecting the
> >> rest of MPI stack.
> >
> > I can't speak to whether the issue is known or fixed (Pavan will know). But you can install a different version of Hydra from one of the release tarballs or nightly tarballs:
Sorry about a delayed response. I installed mvapich2-1.9b based
on mpich-3.0.2 and I can say that the issue still exists. To
better understand I am attaching the output of ps axf with
relevant portions to differentiate between the behavior of hydra
launcher when compared to OpenMPI which does the correct thing.
This allows torque to track the resources used.
--
Bharath
-------------- next part --------------
mvapich2 1.9b using based on mpich-3.0.2
========================================
mother superior
---------------
3451 ? SLsl 36:40 /opt/torque/4.1.5.1/sbin/pbs_mom -q -d
/opt/torque/4.1.5.1/spool
21456 ? Ss 0:00 \_ -bash
21492 ? S 0:00 | \_ pbs_demux
21518 ? S 0:00 | \_ /bin/bash /opt/torque/4.1.5.1/spool/mom_priv/jobs/10861.master.cluster.SC
21537 ? S 0:00 | \_ mpiexec -np 24 ./hello_world_mvapich2
21538 ? Ss 0:00 \_ /opt/apps/gcc4_5/mvapich2/1.9b/bin/hydra_pmi_proxy --control-port hs162:54560 --rmk pbs --launcher pb
21539 ? SLsl 0:00 \_ ./hello_world_mvapich2
21540 ? SLsl 0:00 \_ ./hello_world_mvapich2
21541 ? SLsl 0:00 \_ ./hello_world_mvapich2
21542 ? SLsl 0:00 \_ ./hello_world_mvapich2
21543 ? SLsl 0:00 \_ ./hello_world_mvapich2
21544 ? SLsl 0:00 \_ ./hello_world_mvapich2
21545 ? SLsl 0:00 \_ ./hello_world_mvapich2
21546 ? SLsl 0:00 \_ ./hello_world_mvapich2
21547 ? SLsl 0:00 \_ ./hello_world_mvapich2
21548 ? SLsl 0:00 \_ ./hello_world_mvapich2
21549 ? SLsl 0:00 \_ ./hello_world_mvapich2
21550 ? SLsl 0:00 \_ ./hello_world_mvapich2
sister mom
----------
3425 ? SLsl 29:08 /opt/torque/4.1.5.1/sbin/pbs_mom -q -d
/opt/torque/4.1.5.1/spool
26585 ? Ss 0:00 \_ /opt/apps/gcc4_5/mvapich2/1.9b/bin/hydra_pmi_proxy --control-port hs162:54560 --rmk pbs --launcher pb
26594 ? SLsl 0:00 \_ ./hello_world_mvapich2
26595 ? SLsl 0:00 \_ ./hello_world_mvapich2
26596 ? SLsl 0:00 \_ ./hello_world_mvapich2
26597 ? SLsl 0:00 \_ ./hello_world_mvapich2
26598 ? SLsl 0:00 \_ ./hello_world_mvapich2
26599 ? SLsl 0:00 \_ ./hello_world_mvapich2
26600 ? SLsl 0:00 \_ ./hello_world_mvapich2
26601 ? SLsl 0:00 \_ ./hello_world_mvapich2
26602 ? SLsl 0:00 \_ ./hello_world_mvapich2
26603 ? SLsl 0:00 \_ ./hello_world_mvapich2
26604 ? SLsl 0:00 \_ ./hello_world_mvapich2
26605 ? SLsl 0:00 \_ ./hello_world_mvapich2
openmpi-1.6.4
=============
mother superior
---------------
3451 ? SLsl 36:40 /opt/torque/4.1.5.1/sbin/pbs_mom -q -d
/opt/torque/4.1.5.1/spool
22070 ? Ss 0:00 \_ -bash
22106 ? S 0:00 \_ pbs_demux
22132 ? S 0:00 \_ /bin/bash /opt/torque/4.1.5.1/spool/mom_priv/jobs/10862.master.cluster.SC
22151 ? S 0:00 \_ mpiexec -np 24 ./hello_world_ompi
22152 ? SLl 0:00 \_ ./hello_world_ompi
22153 ? SLl 0:00 \_ ./hello_world_ompi
22154 ? SLl 0:00 \_ ./hello_world_ompi
22155 ? SLl 0:00 \_ ./hello_world_ompi
22156 ? SLl 0:00 \_ ./hello_world_ompi
22157 ? SLl 0:00 \_ ./hello_world_ompi
22158 ? SLl 0:00 \_ ./hello_world_ompi
22159 ? SLl 0:00 \_ ./hello_world_ompi
22160 ? SLl 0:00 \_ ./hello_world_ompi
22161 ? SLl 0:00 \_ ./hello_world_ompi
22162 ? SLl 0:00 \_ ./hello_world_ompi
22163 ? SLl 0:00 \_ ./hello_world_ompi
sister mom
----------
3425 ? SLsl 29:09 /opt/torque/4.1.5.1/sbin/pbs_mom -q -d
/opt/torque/4.1.5.1/spool
27161 ? Ss 0:00 \_ orted -mca ess tm -mca orte_ess_jobid 3506896896 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp
27162 ? SLl 0:00 \_ ./hello_world_ompi
27163 ? SLl 0:00 \_ ./hello_world_ompi
27164 ? SLl 0:00 \_ ./hello_world_ompi
27165 ? SLl 0:00 \_ ./hello_world_ompi
27166 ? SLl 0:00 \_ ./hello_world_ompi
27167 ? SLl 0:00 \_ ./hello_world_ompi
27168 ? SLl 0:00 \_ ./hello_world_ompi
27169 ? SLl 0:00 \_ ./hello_world_ompi
27170 ? SLl 0:00 \_ ./hello_world_ompi
27171 ? SLl 0:00 \_ ./hello_world_ompi
27172 ? SLl 0:00 \_ ./hello_world_ompi
27173 ? SLl 0:00 \_ ./hello_world_ompi
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4553 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130417/27a39ed9/attachment.bin>
More information about the discuss
mailing list