[mpich-discuss] Problems in SGE submission when upgrade from 3.2.1 to 3.3.2

Shuwei Zhao shuweizhao1991 at gmail.com
Thu May 28 22:11:34 CDT 2020


Corrected error message as below:



mt

Dp

Mpich 3.2.1

pass

Pass

Mpich 3.3.2

Fail

error: executing task of job 490150 failed: execution daemon on host
"host1" didn't accept task

Fail

[proxy:0:0 at host1] HYDU_sock_write (utils/sock/sock.c:289): write error (
Bad file descriptor)

  2 [proxy:0:0 at host1] main (pm/pmiserv/pmip.c:189): unable to send the
proxyID to the server

On Thu, May 28, 2020 at 9:40 PM Shuwei Zhao <shuweizhao1991 at gmail.com>
wrote:

> Hi
>
>
>
> I was trying to upgrade our mpich version from 3.2.1 to 3.3.2 to consume
> the latest stable version of mpich, however the new mpich version run get
> failed.
>
>
>
> We have 2 parallel environment configuration as below:
>
> mt
>
> dp
>
> pe_name                mt
>
> slots                  2000000
>
> used_slots             1181
>
> bound_slots            0
>
> user_lists             NONE
>
> xuser_lists            NONE
>
> start_proc_args        /bin/true
>
> stop_proc_args         /bin/true
>
> per_pe_task_prolog     NONE
>
> per_pe_task_epilog     NONE
>
> allocation_rule        $pe_slots
>
> control_slaves         FALSE
>
> job_is_first_task      FALSE
>
> urgency_slots          min
>
> accounting_summary     FALSE
>
> daemon_forks_slaves    FALSE
>
> master_forks_slaves    TRUE
>
> pe_name                dp
>
> slots                  10000
>
> used_slots             0
>
> bound_slots            0
>
> user_lists             NONE
>
> xuser_lists            NONE
>
> start_proc_args        /bin/true
>
> stop_proc_args         /bin/true
>
> per_pe_task_prolog     NONE
>
> per_pe_task_epilog     NONE
>
> allocation_rule        $round_robin
>
> control_slaves         TRUE
>
> job_is_first_task      FALSE
>
> urgency_slots          min
>
> accounting_summary     FALSE
>
> daemon_forks_slaves    FALSE
>
> master_forks_slaves    FALSE
>
>
>
> Running command:
>
> We are using qsub to submit workers and master-worker connection will be
> established using MPI_COMM_ACCEPT and MPI_COMM_CONNECT
>
> qsub -P bnormal -pe mt 1 -e sge_err -o sge_out mpiexec -n 1
> /path/to/my/binary binary_arguments
>
> qsub -P bnormal -pe dp 1 -e sge_err -o sge_out mpiexec -n 1
> /path/to/my/binary binary_arguments
>
>
>
> Running result:
>
>
>
> mt
>
> Dp
>
> Mpich 3.2.1
>
> pass
>
> Pass
>
> Mpich 3.3.2
>
> Fail
>
> [proxy:0:0 at host1] HYDU_sock_write (utils/sock/sock.c:289): write error
> (    Bad file descriptor)
>
>   2 [proxy:0:0 at host1] main (pm/pmiserv/pmip.c:189): unable to send the
> proxyID to the server
>
> Fail
>
> error: executing task of job 490150 failed: execution daemon on host
> "host1" didn't accept task
>
>
> Could you please explain why the run get failed in mpich-3.3 and any
> solution we can use to get the run pass with mpich-3.3.2?
>
> Appreciate any help.
>
>
>
> Thanks
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200528/fb157d2f/attachment-0001.html>


More information about the discuss mailing list