[mpich-discuss] Problems in SGE submission when upgrade from 3.2.1 to 3.3.2
Shuwei Zhao
shuweizhao1991 at gmail.com
Thu May 28 22:11:34 CDT 2020
Corrected error message as below:
mt
Dp
Mpich 3.2.1
pass
Pass
Mpich 3.3.2
Fail
error: executing task of job 490150 failed: execution daemon on host
"host1" didn't accept task
Fail
[proxy:0:0 at host1] HYDU_sock_write (utils/sock/sock.c:289): write error (
Bad file descriptor)
2 [proxy:0:0 at host1] main (pm/pmiserv/pmip.c:189): unable to send the
proxyID to the server
On Thu, May 28, 2020 at 9:40 PM Shuwei Zhao <shuweizhao1991 at gmail.com>
wrote:
> Hi
>
>
>
> I was trying to upgrade our mpich version from 3.2.1 to 3.3.2 to consume
> the latest stable version of mpich, however the new mpich version run get
> failed.
>
>
>
> We have 2 parallel environment configuration as below:
>
> mt
>
> dp
>
> pe_name mt
>
> slots 2000000
>
> used_slots 1181
>
> bound_slots 0
>
> user_lists NONE
>
> xuser_lists NONE
>
> start_proc_args /bin/true
>
> stop_proc_args /bin/true
>
> per_pe_task_prolog NONE
>
> per_pe_task_epilog NONE
>
> allocation_rule $pe_slots
>
> control_slaves FALSE
>
> job_is_first_task FALSE
>
> urgency_slots min
>
> accounting_summary FALSE
>
> daemon_forks_slaves FALSE
>
> master_forks_slaves TRUE
>
> pe_name dp
>
> slots 10000
>
> used_slots 0
>
> bound_slots 0
>
> user_lists NONE
>
> xuser_lists NONE
>
> start_proc_args /bin/true
>
> stop_proc_args /bin/true
>
> per_pe_task_prolog NONE
>
> per_pe_task_epilog NONE
>
> allocation_rule $round_robin
>
> control_slaves TRUE
>
> job_is_first_task FALSE
>
> urgency_slots min
>
> accounting_summary FALSE
>
> daemon_forks_slaves FALSE
>
> master_forks_slaves FALSE
>
>
>
> Running command:
>
> We are using qsub to submit workers and master-worker connection will be
> established using MPI_COMM_ACCEPT and MPI_COMM_CONNECT
>
> qsub -P bnormal -pe mt 1 -e sge_err -o sge_out mpiexec -n 1
> /path/to/my/binary binary_arguments
>
> qsub -P bnormal -pe dp 1 -e sge_err -o sge_out mpiexec -n 1
> /path/to/my/binary binary_arguments
>
>
>
> Running result:
>
>
>
> mt
>
> Dp
>
> Mpich 3.2.1
>
> pass
>
> Pass
>
> Mpich 3.3.2
>
> Fail
>
> [proxy:0:0 at host1] HYDU_sock_write (utils/sock/sock.c:289): write error
> ( Bad file descriptor)
>
> 2 [proxy:0:0 at host1] main (pm/pmiserv/pmip.c:189): unable to send the
> proxyID to the server
>
> Fail
>
> error: executing task of job 490150 failed: execution daemon on host
> "host1" didn't accept task
>
>
> Could you please explain why the run get failed in mpich-3.3 and any
> solution we can use to get the run pass with mpich-3.3.2?
>
> Appreciate any help.
>
>
>
> Thanks
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200528/fb157d2f/attachment-0001.html>
More information about the discuss
mailing list