[mpich-discuss] mpich-3.3.1 cannot launch hydra_pmi_proxy in SGE

Reuti reuti at staff.uni-marburg.de
Mon Jun 24 12:07:09 CDT 2019


Hi,

> Am 14.06.2019 um 02:59 schrieb Shuwei Zhao via discuss <discuss at mpich.org>:
> 
> Hi mpich team,
> 
>  
> 
> My team had a issue met in mpich-3.2.1 as Xiaopeng mentioned before. As you replied, the bug was fixed in mpich-3.3.1.
> 
> However as I installed mpich-3.3.1, finished integration and launched test.
> 
> Single machine is fine. But cross machine only doesn’t work in SGE environment while it works well in lsf.

For me it still works fine. We have no SSH in the cluster and all is started solely by `qrsh -inherit …` on the slave nodes.

Master of the parallel job:

11870 ?        S      0:00  \_ sge_shepherd-308116 -bg
11933 ?        SNs    0:00      \_ /bin/sh /var/spool/sge/node26/job_scripts/308116
11934 ?        SN     0:00          \_ mpiexec ./mpihello
11935 ?        SNs    0:00              \_ /home/reuti/local/mpich-3.3.1_gcc-6.5.0/bin/hydra_pmi_proxy --control-port node26:40054 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
11942 ?        RNs    1:43              |   \_ ./mpihello
11943 ?        RNs    1:43              |   \_ ./mpihello
11936 ?        SNsl   0:00              \_ /usr/sge/bin/lx24-em64t/qrsh -inherit -V node28 "/home/reuti/local/mpich-3.3.1_gcc-6.5.0/bin/hydra_pmi_proxy" --control-port node26:40054 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
11937 ?        SNsl   0:00              \_ /usr/sge/bin/lx24-em64t/qrsh -inherit -V node27 "/home/reuti/local/mpich-3.3.1_gcc-6.5.0/bin/hydra_pmi_proxy" --control-port node26:40054 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2
11938 ?        SNsl   0:00              \_ /usr/sge/bin/lx24-em64t/qrsh -inherit -V node25 "/home/reuti/local/mpich-3.3.1_gcc-6.5.0/bin/hydra_pmi_proxy" --control-port node26:40054 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 3
11939 ?        SNsl   0:00              \_ /usr/sge/bin/lx24-em64t/qrsh -inherit -V node24 "/home/reuti/local/mpich-3.3.1_gcc-6.5.0/bin/hydra_pmi_proxy" --control-port node26:40054 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 4
11940 ?        SNsl   0:00              \_ /usr/sge/bin/lx24-em64t/qrsh -inherit -V node29 "/home/reuti/local/mpich-3.3.1_gcc-6.5.0/bin/hydra_pmi_proxy" --control-port node26:40054 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 5
11941 ?        SNsl   0:00              \_ /usr/sge/bin/lx24-em64t/qrsh -inherit -V node23 "/home/reuti/local/mpich-3.3.1_gcc-6.5.0/bin/hydra_pmi_proxy" --control-port node26:40054 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 6

One of the slaves:

 3904 ?        Sl   958:10 /usr/sge/bin/lx24-em64t/sge_execd
 4012 ?        S     14:55  \_ /bin/sh /usr/sge/cluster/tmpspace.sh
17401 ?        Sl     0:00  \_ sge_shepherd-308116 -bg
17402 ?        SNs    0:00      \_ /usr/sge/utilbin/lx24-em64t/qrsh_starter /var/spool/sge/node23/active_jobs/308116.1/1.node23
17409 ?        SN     0:00          \_ /home/reuti/local/mpich-3.3.1_gcc-6.5.0/bin/hydra_pmi_proxy --control-port node26:40054 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 6
17410 ?        RNs    0:38              \_ ./mpihello

-- Reuti


> mpich-3.2.1
> 
> mpich-3.3.1
> 
> Relevant processes greped out using “ps -eo user,pid,ppid,cmd”
> 
> sgeadmin  2288  3938 sge_shepherd-768144 -bg
> 
> sgeadmin  2317  3938 sge_shepherd-768145 -bg
> 
> szhao     2367  2317 /remote/swefs1/PE/products/cktsim/p2019.06_rel/image/nightly/finesim_optimize/D20190611_5647184/Testing/finesim/bin/mpich/mpiexec -n 1 /remote/swefs1/PE/products/cktsim/p2019.06_rel/image/nightly/finesim_optimize/D20190611_5647184/Testing/bin/finesim my_command_line_argument
> 
> szhao     2368  2288 /remote/swefs1/PE/products/cktsim/p2019.06_rel/image/nightly/finesim_optimize/D20190611_5647184/Testing/finesim/bin/mpich/mpiexec -n 1 /remote/swefs1/PE/products/cktsim/p2019.06_rel/image/nightly/finesim_optimize/D20190611_5647184/Testing/bin/finesim my_command_line_arguments
> 
> szhao     2372  2368 /remote/swefs1/PE/products/cktsim/p2019.06_rel/image/nightly/finesim_optimize/D20190611_5647184/Testing/finesim/bin/mpich/hydra_pmi_proxy --control-port broad1029.internal.synopsys.com:39346 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
> 
> szhao     2373  2367 /remote/swefs1/PE/products/cktsim/p2019.06_rel/image/nightly/finesim_optimize/D20190611_5647184/Testing/finesim/bin/mpich/hydra_pmi_proxy --control-port broad1029.internal.synopsys.com:44694 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
> 
> szhao     2374  2372 /remote/swefs1/PE/products/cktsim/p2019.06_rel/image/nightly/finesim_optimize/D20190611_5647184/Testing/finesim/platform/linux64/finesim my_command_line_arguments
> 
> szhao     2375  2373 /remote/swefs1/PE/products/cktsim/p2019.06_rel/image/nightly/finesim_optimize/D20190611_5647184/Testing/finesim/platform/linux64/finesim my_command_line_arguments
> 
> sgeadmin  3938     1 /remote/sge3/default/bin/lx-amd64/sge_execd
> 
> sgeadmin 11119  3158 sge_shepherd-801927 -bg
> 
> sgeadmin 11151  3158 sge_shepherd-801928 -bg
> 
> szhao    11149 11119 /slowfs/finesim6/users/szhao/src/finesim/cktsim_dev/snps/finesim/optimize/finesim/bin/mpich/mpiexec -n 1 /slowfs/finesim6/users/szhao/src/finesim/cktsim_dev/snps/finesim/optimize/bin/finesim my_command_line_argument
> 
> szhao    11180 11151 /slowfs/finesim6/users/szhao/src/finesim/cktsim_dev/snps/finesim/optimize/finesim/bin/mpich/mpiexec -n 1 /slowfs/finesim6/users/szhao/src/finesim/cktsim_dev/snps/finesim/optimize/bin/finesim -o xxx -new_flow input.sp -DEBUG -ns 1 -dpmt_flow -dscale xxx.dynamic -pmc
> 
> szhao    11150 11149 [qrsh] <defunct>
> 
> szhao    11181 11180 [qrsh] <defunct>
> 
> sgeadmin  3158     1 /remote/sge3/default/bin/lx-amd64/sge_execd
> 
> Processes spawned procedure
> 
> From above, we can see that process spawned like this:
> 
> Sge_execd -> sge_shepherd > mpiexec -> hydra_pmi_proxy -> my_process  
> 
>  
> 
> From above, we can see that process spawned like this:
> 
> Sge_execd -> sge_shepherd > mpiexec -X-> hydra_pmi_proxy -> my_process  
> 
>  
> 
> 
> If you are able to track the pid and ppid, Looks like hydra_pmi_proxy is unable to spawn when run with mpich-3.3 on sge environment. Could you please help confirm is it a potential bug or there is something that I didn’t take care of?
> 
>  
> 
> Thanks,
> 
> Shuwei
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



More information about the discuss mailing list