[mpich-discuss] Hydra torque support issues

Bharath Ramesh bramesh at vt.edu
Wed Apr 17 12:04:29 CDT 2013


On Thu, Mar 07, 2013 at 04:18:58PM -0600, Pavan Balaji wrote:
> 
> This was already present in 1.5, though not enabled by default.
> However, there were some bug fixes in 3.0, so using the latest version
> is the best bet.
> 
>  -- Pavan
> 
> On 03/07/2013 03:57 PM US Central Time, Dave Goodell wrote:
> > On Mar 7, 2013, at 3:51 PM CST, Bharath Ramesh <bramesh at vt.edu> wrote:
> > 
> >> I am using mvapich2-1.9a2 which is based of mpich2-1.5. We have
> >> enabled Torque integration with hydra. We are noticing an issue
> >> where in Torque is not tracking the resource used by the MPI
> >> application when they are built with mvapich2. Further
> >> investigation revealed that mvapich2 hydra process launcher was
> >> not setting the correct session id to what torque used for
> >> forking the shell.  I am wondering if this is a known issue or
> >> something that has already been fixed. If it has been fixed, what
> >> would be the best way to upgrade just hydra without affecting the
> >> rest of MPI stack.
> > 
> > I can't speak to whether the issue is known or fixed (Pavan will know).  But you can install a different version of Hydra from one of the release tarballs or nightly tarballs:

Sorry about a delayed response. I installed mvapich2-1.9b based
on mpich-3.0.2 and I can say that the issue still exists. To
better understand I am attaching the output of ps axf with
relevant portions to differentiate between the behavior of hydra
launcher when compared to OpenMPI which does the correct thing.
This allows torque to track the resources used.

-- 
Bharath
-------------- next part --------------
mvapich2 1.9b using based on mpich-3.0.2
========================================
mother superior
---------------

 3451 ?        SLsl  36:40 /opt/torque/4.1.5.1/sbin/pbs_mom -q -d
 /opt/torque/4.1.5.1/spool
 21456 ?        Ss     0:00  \_ -bash
 21492 ?        S      0:00  |   \_ pbs_demux
 21518 ?        S      0:00  |   \_ /bin/bash /opt/torque/4.1.5.1/spool/mom_priv/jobs/10861.master.cluster.SC
 21537 ?        S      0:00  |       \_ mpiexec -np 24 ./hello_world_mvapich2
 21538 ?        Ss     0:00  \_ /opt/apps/gcc4_5/mvapich2/1.9b/bin/hydra_pmi_proxy --control-port hs162:54560 --rmk pbs --launcher pb
 21539 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 21540 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 21541 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 21542 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 21543 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 21544 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 21545 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 21546 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 21547 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 21548 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 21549 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 21550 ?        SLsl   0:00      \_ ./hello_world_mvapich2


sister mom
----------

 3425 ?        SLsl  29:08 /opt/torque/4.1.5.1/sbin/pbs_mom -q -d
 /opt/torque/4.1.5.1/spool
 26585 ?        Ss     0:00  \_ /opt/apps/gcc4_5/mvapich2/1.9b/bin/hydra_pmi_proxy --control-port hs162:54560 --rmk pbs --launcher pb
 26594 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 26595 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 26596 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 26597 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 26598 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 26599 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 26600 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 26601 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 26602 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 26603 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 26604 ?        SLsl   0:00      \_ ./hello_world_mvapich2
 26605 ?        SLsl   0:00      \_ ./hello_world_mvapich2


openmpi-1.6.4
=============
mother superior
---------------

 3451 ?        SLsl  36:40 /opt/torque/4.1.5.1/sbin/pbs_mom -q -d
 /opt/torque/4.1.5.1/spool
 22070 ?        Ss     0:00  \_ -bash
 22106 ?        S      0:00      \_ pbs_demux
 22132 ?        S      0:00      \_ /bin/bash /opt/torque/4.1.5.1/spool/mom_priv/jobs/10862.master.cluster.SC
 22151 ?        S      0:00          \_ mpiexec -np 24 ./hello_world_ompi
 22152 ?        SLl    0:00              \_ ./hello_world_ompi
 22153 ?        SLl    0:00              \_ ./hello_world_ompi
 22154 ?        SLl    0:00              \_ ./hello_world_ompi
 22155 ?        SLl    0:00              \_ ./hello_world_ompi
 22156 ?        SLl    0:00              \_ ./hello_world_ompi
 22157 ?        SLl    0:00              \_ ./hello_world_ompi
 22158 ?        SLl    0:00              \_ ./hello_world_ompi
 22159 ?        SLl    0:00              \_ ./hello_world_ompi
 22160 ?        SLl    0:00              \_ ./hello_world_ompi
 22161 ?        SLl    0:00              \_ ./hello_world_ompi
 22162 ?        SLl    0:00              \_ ./hello_world_ompi
 22163 ?        SLl    0:00              \_ ./hello_world_ompi


sister mom
----------

 3425 ?        SLsl  29:09 /opt/torque/4.1.5.1/sbin/pbs_mom -q -d
 /opt/torque/4.1.5.1/spool
 27161 ?        Ss     0:00  \_ orted -mca ess tm -mca orte_ess_jobid 3506896896 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp
 27162 ?        SLl    0:00      \_ ./hello_world_ompi
 27163 ?        SLl    0:00      \_ ./hello_world_ompi
 27164 ?        SLl    0:00      \_ ./hello_world_ompi
 27165 ?        SLl    0:00      \_ ./hello_world_ompi
 27166 ?        SLl    0:00      \_ ./hello_world_ompi
 27167 ?        SLl    0:00      \_ ./hello_world_ompi
 27168 ?        SLl    0:00      \_ ./hello_world_ompi
 27169 ?        SLl    0:00      \_ ./hello_world_ompi
 27170 ?        SLl    0:00      \_ ./hello_world_ompi
 27171 ?        SLl    0:00      \_ ./hello_world_ompi
 27172 ?        SLl    0:00      \_ ./hello_world_ompi
 27173 ?        SLl    0:00      \_ ./hello_world_ompi
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4553 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130417/27a39ed9/attachment.bin>


More information about the discuss mailing list