[mpich-discuss] mpich-3.3.1 cannot launch hydra_pmi_proxy in SGE
Reuti
reuti at staff.uni-marburg.de
Mon Jun 24 12:07:09 CDT 2019
Hi,
> Am 14.06.2019 um 02:59 schrieb Shuwei Zhao via discuss <discuss at mpich.org>:
>
> Hi mpich team,
>
>
>
> My team had a issue met in mpich-3.2.1 as Xiaopeng mentioned before. As you replied, the bug was fixed in mpich-3.3.1.
>
> However as I installed mpich-3.3.1, finished integration and launched test.
>
> Single machine is fine. But cross machine only doesn’t work in SGE environment while it works well in lsf.
For me it still works fine. We have no SSH in the cluster and all is started solely by `qrsh -inherit …` on the slave nodes.
Master of the parallel job:
11870 ? S 0:00 \_ sge_shepherd-308116 -bg
11933 ? SNs 0:00 \_ /bin/sh /var/spool/sge/node26/job_scripts/308116
11934 ? SN 0:00 \_ mpiexec ./mpihello
11935 ? SNs 0:00 \_ /home/reuti/local/mpich-3.3.1_gcc-6.5.0/bin/hydra_pmi_proxy --control-port node26:40054 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
11942 ? RNs 1:43 | \_ ./mpihello
11943 ? RNs 1:43 | \_ ./mpihello
11936 ? SNsl 0:00 \_ /usr/sge/bin/lx24-em64t/qrsh -inherit -V node28 "/home/reuti/local/mpich-3.3.1_gcc-6.5.0/bin/hydra_pmi_proxy" --control-port node26:40054 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
11937 ? SNsl 0:00 \_ /usr/sge/bin/lx24-em64t/qrsh -inherit -V node27 "/home/reuti/local/mpich-3.3.1_gcc-6.5.0/bin/hydra_pmi_proxy" --control-port node26:40054 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2
11938 ? SNsl 0:00 \_ /usr/sge/bin/lx24-em64t/qrsh -inherit -V node25 "/home/reuti/local/mpich-3.3.1_gcc-6.5.0/bin/hydra_pmi_proxy" --control-port node26:40054 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 3
11939 ? SNsl 0:00 \_ /usr/sge/bin/lx24-em64t/qrsh -inherit -V node24 "/home/reuti/local/mpich-3.3.1_gcc-6.5.0/bin/hydra_pmi_proxy" --control-port node26:40054 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 4
11940 ? SNsl 0:00 \_ /usr/sge/bin/lx24-em64t/qrsh -inherit -V node29 "/home/reuti/local/mpich-3.3.1_gcc-6.5.0/bin/hydra_pmi_proxy" --control-port node26:40054 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 5
11941 ? SNsl 0:00 \_ /usr/sge/bin/lx24-em64t/qrsh -inherit -V node23 "/home/reuti/local/mpich-3.3.1_gcc-6.5.0/bin/hydra_pmi_proxy" --control-port node26:40054 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 6
One of the slaves:
3904 ? Sl 958:10 /usr/sge/bin/lx24-em64t/sge_execd
4012 ? S 14:55 \_ /bin/sh /usr/sge/cluster/tmpspace.sh
17401 ? Sl 0:00 \_ sge_shepherd-308116 -bg
17402 ? SNs 0:00 \_ /usr/sge/utilbin/lx24-em64t/qrsh_starter /var/spool/sge/node23/active_jobs/308116.1/1.node23
17409 ? SN 0:00 \_ /home/reuti/local/mpich-3.3.1_gcc-6.5.0/bin/hydra_pmi_proxy --control-port node26:40054 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 6
17410 ? RNs 0:38 \_ ./mpihello
-- Reuti
> mpich-3.2.1
>
> mpich-3.3.1
>
> Relevant processes greped out using “ps -eo user,pid,ppid,cmd”
>
> sgeadmin 2288 3938 sge_shepherd-768144 -bg
>
> sgeadmin 2317 3938 sge_shepherd-768145 -bg
>
> szhao 2367 2317 /remote/swefs1/PE/products/cktsim/p2019.06_rel/image/nightly/finesim_optimize/D20190611_5647184/Testing/finesim/bin/mpich/mpiexec -n 1 /remote/swefs1/PE/products/cktsim/p2019.06_rel/image/nightly/finesim_optimize/D20190611_5647184/Testing/bin/finesim my_command_line_argument
>
> szhao 2368 2288 /remote/swefs1/PE/products/cktsim/p2019.06_rel/image/nightly/finesim_optimize/D20190611_5647184/Testing/finesim/bin/mpich/mpiexec -n 1 /remote/swefs1/PE/products/cktsim/p2019.06_rel/image/nightly/finesim_optimize/D20190611_5647184/Testing/bin/finesim my_command_line_arguments
>
> szhao 2372 2368 /remote/swefs1/PE/products/cktsim/p2019.06_rel/image/nightly/finesim_optimize/D20190611_5647184/Testing/finesim/bin/mpich/hydra_pmi_proxy --control-port broad1029.internal.synopsys.com:39346 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
>
> szhao 2373 2367 /remote/swefs1/PE/products/cktsim/p2019.06_rel/image/nightly/finesim_optimize/D20190611_5647184/Testing/finesim/bin/mpich/hydra_pmi_proxy --control-port broad1029.internal.synopsys.com:44694 --rmk sge --launcher sge --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
>
> szhao 2374 2372 /remote/swefs1/PE/products/cktsim/p2019.06_rel/image/nightly/finesim_optimize/D20190611_5647184/Testing/finesim/platform/linux64/finesim my_command_line_arguments
>
> szhao 2375 2373 /remote/swefs1/PE/products/cktsim/p2019.06_rel/image/nightly/finesim_optimize/D20190611_5647184/Testing/finesim/platform/linux64/finesim my_command_line_arguments
>
> sgeadmin 3938 1 /remote/sge3/default/bin/lx-amd64/sge_execd
>
> sgeadmin 11119 3158 sge_shepherd-801927 -bg
>
> sgeadmin 11151 3158 sge_shepherd-801928 -bg
>
> szhao 11149 11119 /slowfs/finesim6/users/szhao/src/finesim/cktsim_dev/snps/finesim/optimize/finesim/bin/mpich/mpiexec -n 1 /slowfs/finesim6/users/szhao/src/finesim/cktsim_dev/snps/finesim/optimize/bin/finesim my_command_line_argument
>
> szhao 11180 11151 /slowfs/finesim6/users/szhao/src/finesim/cktsim_dev/snps/finesim/optimize/finesim/bin/mpich/mpiexec -n 1 /slowfs/finesim6/users/szhao/src/finesim/cktsim_dev/snps/finesim/optimize/bin/finesim -o xxx -new_flow input.sp -DEBUG -ns 1 -dpmt_flow -dscale xxx.dynamic -pmc
>
> szhao 11150 11149 [qrsh] <defunct>
>
> szhao 11181 11180 [qrsh] <defunct>
>
> sgeadmin 3158 1 /remote/sge3/default/bin/lx-amd64/sge_execd
>
> Processes spawned procedure
>
> From above, we can see that process spawned like this:
>
> Sge_execd -> sge_shepherd > mpiexec -> hydra_pmi_proxy -> my_process
>
>
>
> From above, we can see that process spawned like this:
>
> Sge_execd -> sge_shepherd > mpiexec -X-> hydra_pmi_proxy -> my_process
>
>
>
>
> If you are able to track the pid and ppid, Looks like hydra_pmi_proxy is unable to spawn when run with mpich-3.3 on sge environment. Could you please help confirm is it a potential bug or there is something that I didn’t take care of?
>
>
>
> Thanks,
>
> Shuwei
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list