[mpich-discuss] Problems in SGE submission when upgrade from 3.2.1 to 3.3.2
Shuwei Zhao
shuweizhao1991 at gmail.com
Thu May 28 21:40:08 CDT 2020
Hi
I was trying to upgrade our mpich version from 3.2.1 to 3.3.2 to consume
the latest stable version of mpich, however the new mpich version run get
failed.
We have 2 parallel environment configuration as below:
mt
dp
pe_name mt
slots 2000000
used_slots 1181
bound_slots 0
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
per_pe_task_prolog NONE
per_pe_task_epilog NONE
allocation_rule $pe_slots
control_slaves FALSE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
daemon_forks_slaves FALSE
master_forks_slaves TRUE
pe_name dp
slots 10000
used_slots 0
bound_slots 0
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
per_pe_task_prolog NONE
per_pe_task_epilog NONE
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
daemon_forks_slaves FALSE
master_forks_slaves FALSE
Running command:
We are using qsub to submit workers and master-worker connection will be
established using MPI_COMM_ACCEPT and MPI_COMM_CONNECT
qsub -P bnormal -pe mt 1 -e sge_err -o sge_out mpiexec -n 1
/path/to/my/binary binary_arguments
qsub -P bnormal -pe dp 1 -e sge_err -o sge_out mpiexec -n 1
/path/to/my/binary binary_arguments
Running result:
mt
Dp
Mpich 3.2.1
pass
Pass
Mpich 3.3.2
Fail
[proxy:0:0 at host1] HYDU_sock_write (utils/sock/sock.c:289): write error (
Bad file descriptor)
2 [proxy:0:0 at host1] main (pm/pmiserv/pmip.c:189): unable to send the
proxyID to the server
Fail
error: executing task of job 490150 failed: execution daemon on host
"host1" didn't accept task
Could you please explain why the run get failed in mpich-3.3 and any
solution we can use to get the run pass with mpich-3.3.2?
Appreciate any help.
Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200528/90801164/attachment.html>
More information about the discuss
mailing list