[mpich-discuss] Problems in SGE submission when upgrade from 3.2.1 to 3.3.2

Shuwei Zhao shuweizhao1991 at gmail.com
Thu May 28 21:40:08 CDT 2020


Hi



I was trying to upgrade our mpich version from 3.2.1 to 3.3.2 to consume
the latest stable version of mpich, however the new mpich version run get
failed.



We have 2 parallel environment configuration as below:

mt

dp

pe_name                mt

slots                  2000000

used_slots             1181

bound_slots            0

user_lists             NONE

xuser_lists            NONE

start_proc_args        /bin/true

stop_proc_args         /bin/true

per_pe_task_prolog     NONE

per_pe_task_epilog     NONE

allocation_rule        $pe_slots

control_slaves         FALSE

job_is_first_task      FALSE

urgency_slots          min

accounting_summary     FALSE

daemon_forks_slaves    FALSE

master_forks_slaves    TRUE

pe_name                dp

slots                  10000

used_slots             0

bound_slots            0

user_lists             NONE

xuser_lists            NONE

start_proc_args        /bin/true

stop_proc_args         /bin/true

per_pe_task_prolog     NONE

per_pe_task_epilog     NONE

allocation_rule        $round_robin

control_slaves         TRUE

job_is_first_task      FALSE

urgency_slots          min

accounting_summary     FALSE

daemon_forks_slaves    FALSE

master_forks_slaves    FALSE



Running command:

We are using qsub to submit workers and master-worker connection will be
established using MPI_COMM_ACCEPT and MPI_COMM_CONNECT

qsub -P bnormal -pe mt 1 -e sge_err -o sge_out mpiexec -n 1
/path/to/my/binary binary_arguments

qsub -P bnormal -pe dp 1 -e sge_err -o sge_out mpiexec -n 1
/path/to/my/binary binary_arguments



Running result:



mt

Dp

Mpich 3.2.1

pass

Pass

Mpich 3.3.2

Fail

[proxy:0:0 at host1] HYDU_sock_write (utils/sock/sock.c:289): write error (
Bad file descriptor)

  2 [proxy:0:0 at host1] main (pm/pmiserv/pmip.c:189): unable to send the
proxyID to the server

Fail

error: executing task of job 490150 failed: execution daemon on host
"host1" didn't accept task


Could you please explain why the run get failed in mpich-3.3 and any
solution we can use to get the run pass with mpich-3.3.2?

Appreciate any help.



Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20200528/90801164/attachment.html>


More information about the discuss mailing list