[mpich-discuss] mpich hangs
Jeff Hammond
jeff.science at gmail.com
Thu Jun 27 22:31:45 CDT 2013
Can you run the cpi program? If that doesn't run, something is wrong,
because that program is trivial and correct.
Jeff
On Thu, Jun 27, 2013 at 10:29 PM, Syed. Jahanzeb Maqbool Hashmi
<jahanzeb.maqbool at gmail.com> wrote:
> again that same error:
> Fatal error in PMPI_Wait: A process has failed, error stack:
> PMPI_Wait(180)............: MPI_Wait(request=0xbebb9a1c, status=0xbebb99f0)
> failed
> MPIR_Wait_impl(77)........:
> dequeue_and_set_error(888): Communication error with rank 4
>
> here is the verbose output:
>
> --------------START------------------
>
> host: weiser1
> host: weiser2
>
> ==================================================================================================
> mpiexec options:
> ----------------
> Base path: /mnt/nfs/install/mpich-install/bin/
> Launcher: (null)
> Debug level: 1
> Enable X: -1
>
> Global environment:
> -------------------
> TERM=xterm
> SHELL=/bin/bash
>
> XDG_SESSION_COOKIE=218a1dd8e20ea6d6ec61475b00000019-1372384778.679329-1845893422
> SSH_CLIENT=192.168.0.3 57311 22
> OLDPWD=/mnt/nfs/jahanzeb/bench/hpl/hpl-2.1
> SSH_TTY=/dev/pts/0
> USER=linaro
>
> LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.axa=00;36:*.oga=00;36:*.spx=00;36:*.xspf=00;36:
> LD_LIBRARY_PATH=:/mnt/nfs/install/mpich-install/lib
> MAIL=/var/mail/linaro
>
> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/mnt/nfs/install/mpich-install/bin
> PWD=/mnt/nfs/jahanzeb/bench/hpl/hpl-2.1/bin/armv7-a
> LANG=C.UTF-8
> SHLVL=1
> HOME=/home/linaro
> LOGNAME=linaro
> SSH_CONNECTION=192.168.0.3 57311 192.168.0.101 22
> LESSOPEN=| /usr/bin/lesspipe %s
> LESSCLOSE=/usr/bin/lesspipe %s %s
> _=/mnt/nfs/install/mpich-install/bin/mpiexec
>
> Hydra internal environment:
> ---------------------------
> GFORTRAN_UNBUFFERED_PRECONNECTED=y
>
>
> Proxy information:
> *********************
> [1] proxy: weiser1 (4 cores)
> Exec list: ./xhpl (4 processes);
>
> [2] proxy: weiser2 (4 cores)
> Exec list: ./xhpl (4 processes);
>
>
> ==================================================================================================
>
> [mpiexec at weiser1] Timeout set to -1 (-1 means infinite)
> [mpiexec at weiser1] Got a control port string of weiser1:45851
>
> Proxy launch args: /mnt/nfs/install/mpich-install/bin/hydra_pmi_proxy
> --control-port weiser1:45851 --debug --rmk user --launcher ssh --demux poll
> --pgid 0 --retries 10 --usize -2 --proxy-id
>
> Arguments being passed to proxy 0:
> --version 3.0.4 --iface-ip-env-name MPICH_INTERFACE_HOSTNAME --hostname
> weiser1 --global-core-map 0,4,8 --pmi-id-map 0,0 --global-process-count 8
> --auto-cleanup 1 --pmi-kvsname kvs_24541_0 --pmi-process-mapping
> (vector,(0,2,4)) --ckpoint-num -1 --global-inherited-env 20 'TERM=xterm'
> 'SHELL=/bin/bash'
> 'XDG_SESSION_COOKIE=218a1dd8e20ea6d6ec61475b00000019-1372384778.679329-1845893422'
> 'SSH_CLIENT=192.168.0.3 57311 22'
> 'OLDPWD=/mnt/nfs/jahanzeb/bench/hpl/hpl-2.1' 'SSH_TTY=/dev/pts/0'
> 'USER=linaro'
> 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.axa=00;36:*.oga=00;36:*.spx=00;36:*.xspf=00;36:'
> 'LD_LIBRARY_PATH=:/mnt/nfs/install/mpich-install/lib'
> 'MAIL=/var/mail/linaro'
> 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/mnt/nfs/install/mpich-install/bin'
> 'PWD=/mnt/nfs/jahanzeb/bench/hpl/hpl-2.1/bin/armv7-a' 'LANG=C.UTF-8'
> 'SHLVL=1' 'HOME=/home/linaro' 'LOGNAME=linaro' 'SSH_CONNECTION=192.168.0.3
> 57311 192.168.0.101 22' 'LESSOPEN=| /usr/bin/lesspipe %s'
> 'LESSCLOSE=/usr/bin/lesspipe %s %s'
> '_=/mnt/nfs/install/mpich-install/bin/mpiexec' --global-user-env 0
> --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y'
> --proxy-core-count 4 --exec --exec-appnum 0 --exec-proc-count 4
> --exec-local-env 0 --exec-wdir
> /mnt/nfs/jahanzeb/bench/hpl/hpl-2.1/bin/armv7-a --exec-args 1 ./xhpl
>
> Arguments being passed to proxy 1:
> --version 3.0.4 --iface-ip-env-name MPICH_INTERFACE_HOSTNAME --hostname
> weiser2 --global-core-map 0,4,8 --pmi-id-map 0,4 --global-process-count 8
> --auto-cleanup 1 --pmi-kvsname kvs_24541_0 --pmi-process-mapping
> (vector,(0,2,4)) --ckpoint-num -1 --global-inherited-env 20 'TERM=xterm'
> 'SHELL=/bin/bash'
> 'XDG_SESSION_COOKIE=218a1dd8e20ea6d6ec61475b00000019-1372384778.679329-1845893422'
> 'SSH_CLIENT=192.168.0.3 57311 22'
> 'OLDPWD=/mnt/nfs/jahanzeb/bench/hpl/hpl-2.1' 'SSH_TTY=/dev/pts/0'
> 'USER=linaro'
> 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.axa=00;36:*.oga=00;36:*.spx=00;36:*.xspf=00;36:'
> 'LD_LIBRARY_PATH=:/mnt/nfs/install/mpich-install/lib'
> 'MAIL=/var/mail/linaro'
> 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/mnt/nfs/install/mpich-install/bin'
> 'PWD=/mnt/nfs/jahanzeb/bench/hpl/hpl-2.1/bin/armv7-a' 'LANG=C.UTF-8'
> 'SHLVL=1' 'HOME=/home/linaro' 'LOGNAME=linaro' 'SSH_CONNECTION=192.168.0.3
> 57311 192.168.0.101 22' 'LESSOPEN=| /usr/bin/lesspipe %s'
> 'LESSCLOSE=/usr/bin/lesspipe %s %s'
> '_=/mnt/nfs/install/mpich-install/bin/mpiexec' --global-user-env 0
> --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y'
> --proxy-core-count 4 --exec --exec-appnum 0 --exec-proc-count 4
> --exec-local-env 0 --exec-wdir
> /mnt/nfs/jahanzeb/bench/hpl/hpl-2.1/bin/armv7-a --exec-args 1 ./xhpl
>
> [mpiexec at weiser1] Launch arguments:
> /mnt/nfs/install/mpich-install/bin/hydra_pmi_proxy --control-port
> weiser1:45851 --debug --rmk user --launcher ssh --demux poll --pgid 0
> --retries 10 --usize -2 --proxy-id 0
> [mpiexec at weiser1] Launch arguments: /usr/bin/ssh -x weiser2
> "/mnt/nfs/install/mpich-install/bin/hydra_pmi_proxy" --control-port
> weiser1:45851 --debug --rmk user --launcher ssh --demux poll --pgid 0
> --retries 10 --usize -2 --proxy-id 1
> [proxy:0:0 at weiser1] got pmi command (from 0): init
> pmi_version=1 pmi_subversion=1
> [proxy:0:0 at weiser1] PMI response: cmd=response_to_init pmi_version=1
> pmi_subversion=1 rc=0
> [proxy:0:0 at weiser1] got pmi command (from 0): get_maxes
>
> [proxy:0:0 at weiser1] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
> vallen_max=1024
> [proxy:0:0 at weiser1] got pmi command (from 15): init
> pmi_version=1 pmi_subversion=1
> [proxy:0:0 at weiser1] PMI response: cmd=response_to_init pmi_version=1
> pmi_subversion=1 rc=0
> [proxy:0:0 at weiser1] got pmi command (from 15): get_maxes
>
> [proxy:0:0 at weiser1] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
> vallen_max=1024
> [proxy:0:0 at weiser1] got pmi command (from 8): init
> pmi_version=1 pmi_subversion=1
> [proxy:0:0 at weiser1] PMI response: cmd=response_to_init pmi_version=1
> pmi_subversion=1 rc=0
> [proxy:0:0 at weiser1] got pmi command (from 0): get_appnum
>
> [proxy:0:0 at weiser1] PMI response: cmd=appnum appnum=0
> [proxy:0:0 at weiser1] got pmi command (from 15): get_appnum
>
> [proxy:0:0 at weiser1] PMI response: cmd=appnum appnum=0
> [proxy:0:0 at weiser1] got pmi command (from 0): get_my_kvsname
>
> [proxy:0:0 at weiser1] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
> [proxy:0:0 at weiser1] got pmi command (from 8): get_maxes
>
> [proxy:0:0 at weiser1] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
> vallen_max=1024
> [proxy:0:0 at weiser1] got pmi command (from 0): get_my_kvsname
>
> [proxy:0:0 at weiser1] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
> [proxy:0:0 at weiser1] got pmi command (from 6): init
> pmi_version=1 pmi_subversion=1
> [proxy:0:0 at weiser1] PMI response: cmd=response_to_init pmi_version=1
> pmi_subversion=1 rc=0
> [proxy:0:0 at weiser1] got pmi command (from 15): get_my_kvsname
>
> [proxy:0:0 at weiser1] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
> [proxy:0:0 at weiser1] got pmi command (from 0): get
> kvsname=kvs_24541_0 key=PMI_process_mapping
> [proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
> value=(vector,(0,2,4))
> [proxy:0:0 at weiser1] got pmi command (from 8): get_appnum
>
> [proxy:0:0 at weiser1] PMI response: cmd=appnum appnum=0
> [proxy:0:0 at weiser1] got pmi command (from 15): get_my_kvsname
>
> [proxy:0:0 at weiser1] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
> [proxy:0:0 at weiser1] got pmi command (from 8): get_my_kvsname
>
> [proxy:0:0 at weiser1] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
> [proxy:0:0 at weiser1] got pmi command (from 0): put
> kvsname=kvs_24541_0 key=sharedFilename[0]
> value=/dev/shm/mpich_shar_tmpnEZdQ9
> [proxy:0:0 at weiser1] cached command:
> sharedFilename[0]=/dev/shm/mpich_shar_tmpnEZdQ9
> [proxy:0:0 at weiser1] PMI response: cmd=put_result rc=0 msg=success
> [proxy:0:0 at weiser1] got pmi command (from 15): get
> kvsname=kvs_24541_0 key=PMI_process_mapping
> [proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
> value=(vector,(0,2,4))
> [proxy:0:0 at weiser1] got pmi command (from 0): barrier_in
>
> [proxy:0:0 at weiser1] got pmi command (from 6): get_maxes
>
> [proxy:0:0 at weiser1] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
> vallen_max=1024
> [proxy:0:0 at weiser1] got pmi command (from 8): get_my_kvsname
>
> [proxy:0:0 at weiser1] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
> [proxy:0:0 at weiser1] got pmi command (from 15): barrier_in
>
> [proxy:0:0 at weiser1] got pmi command (from 8): get
> kvsname=kvs_24541_0 key=PMI_process_mapping
> [proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
> value=(vector,(0,2,4))
> [proxy:0:0 at weiser1] got pmi command (from 6): get_appnum
>
> [proxy:0:0 at weiser1] PMI response: cmd=appnum appnum=0
> [proxy:0:0 at weiser1] got pmi command (from 8): barrier_in
>
> [proxy:0:0 at weiser1] got pmi command (from 6): get_my_kvsname
>
> [proxy:0:0 at weiser1] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
> [proxy:0:0 at weiser1] got pmi command (from 6): get_my_kvsname
>
> [proxy:0:0 at weiser1] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
> [proxy:0:0 at weiser1] got pmi command (from 6): get
> kvsname=kvs_24541_0 key=PMI_process_mapping
> [proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
> value=(vector,(0,2,4))
> [proxy:0:0 at weiser1] got pmi command (from 6): barrier_in
>
> [proxy:0:0 at weiser1] flushing 1 put command(s) out
> [mpiexec at weiser1] [pgid: 0] got PMI command: cmd=put
> sharedFilename[0]=/dev/shm/mpich_shar_tmpnEZdQ9
> [proxy:0:0 at weiser1] forwarding command (cmd=put
> sharedFilename[0]=/dev/shm/mpich_shar_tmpnEZdQ9) upstream
> [proxy:0:0 at weiser1] forwarding command (cmd=barrier_in) upstream
> [mpiexec at weiser1] [pgid: 0] got PMI command: cmd=barrier_in
> [proxy:0:1 at weiser2] got pmi command (from 7): init
> pmi_version=1 pmi_subversion=1
> [proxy:0:1 at weiser2] PMI response: cmd=response_to_init pmi_version=1
> pmi_subversion=1 rc=0
> [proxy:0:1 at weiser2] got pmi command (from 5): init
> pmi_version=1 pmi_subversion=1
> [proxy:0:1 at weiser2] PMI response: cmd=response_to_init pmi_version=1
> pmi_subversion=1 rc=0
> [proxy:0:1 at weiser2] got pmi command (from 7): get_maxes
>
> [proxy:0:1 at weiser2] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
> vallen_max=1024
> [proxy:0:1 at weiser2] got pmi command (from 4): init
> pmi_version=1 pmi_subversion=1
> [proxy:0:1 at weiser2] PMI response: cmd=response_to_init pmi_version=1
> pmi_subversion=1 rc=0
> [proxy:0:1 at weiser2] got pmi command (from 7): get_appnum
>
> [proxy:0:1 at weiser2] PMI response: cmd=appnum appnum=0
> [proxy:0:1 at weiser2] got pmi command (from 4): get_maxes
>
> [proxy:0:1 at weiser2] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
> vallen_max=1024
> [proxy:0:1 at weiser2] got pmi command (from 7): get_my_kvsname
>
> [proxy:0:1 at weiser2] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
> [proxy:0:1 at weiser2] got pmi command (from 4): get_appnum
>
> [proxy:0:1 at weiser2] PMI response: cmd=appnum appnum=0
> [proxy:0:1 at weiser2] got pmi command (from 7): get_my_kvsname
>
> [proxy:0:1 at weiser2] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
> [proxy:0:1 at weiser2] got pmi command (from 4): get_my_kvsname
>
> [proxy:0:1 at weiser2] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
> [proxy:0:1 at weiser2] got pmi command (from 7): get
> kvsname=kvs_24541_0 key=PMI_process_mapping
> [proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
> value=(vector,(0,2,4))
> [proxy:0:1 at weiser2] got pmi command (from 4): get_my_kvsname
>
> [proxy:0:1 at weiser2] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
> [proxy:0:1 at weiser2] got pmi command (from 7): barrier_in
>
> [proxy:0:1 at weiser2] got pmi command (from 4): get
> kvsname=kvs_24541_0 key=PMI_process_mapping
> [proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
> value=(vector,(0,2,4))
> [proxy:0:1 at weiser2] got pmi command (from 5): get_maxes
>
> [proxy:0:1 at weiser2] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
> vallen_max=1024
> [proxy:0:1 at weiser2] got pmi command (from 5): get_appnum
>
> [proxy:0:1 at weiser2] PMI response: cmd=appnum appnum=0
> [proxy:0:1 at weiser2] got pmi command (from 4): put
> kvsname=kvs_24541_0 key=sharedFilename[4]
> value=/dev/shm/mpich_shar_tmpuKzlSa
> [proxy:0:1 at weiser2] cached command:
> sharedFilename[4]=/dev/shm/mpich_shar_tmpuKzlSa
> [proxy:0:1 at weiser2] PMI response: cmd=put_result rc=0 msg=success
> [proxy:0:1 at weiser2] got pmi command (from 5): get_my_kvsname
>
> [proxy:0:1 at weiser2] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
> [proxy:0:1 at weiser2] got pmi command (from 4): barrier_in
>
> [mpiexec at weiser1] [pgid: 0] got PMI command: cmd=put
> sharedFilename[4]=/dev/shm/mpich_shar_tmpuKzlSa
> [mpiexec at weiser1] [pgid: 0] got PMI command: cmd=barrier_in
> [mpiexec at weiser1] PMI response to fd 6 pid 10: cmd=keyval_cache
> sharedFilename[0]=/dev/shm/mpich_shar_tmpnEZdQ9
> sharedFilename[4]=/dev/shm/mpich_shar_tmpuKzlSa
> [mpiexec at weiser1] PMI response to fd 7 pid 10: cmd=keyval_cache
> sharedFilename[0]=/dev/shm/mpich_shar_tmpnEZdQ9
> sharedFilename[4]=/dev/shm/mpich_shar_tmpuKzlSa
> [mpiexec at weiser1] PMI response to fd 6 pid 10: cmd=barrier_out
> [mpiexec at weiser1] PMI response to fd 7 pid 10: cmd=barrier_out
> [proxy:0:1 at weiser2] got pmi command (from 5): get_my_kvsname
>
> [proxy:0:1 at weiser2] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
> [proxy:0:1 at weiser2] got pmi command (from 5): get
> kvsname=kvs_24541_0 key=PMI_process_mapping
> [proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
> value=(vector,(0,2,4))
> [proxy:0:1 at weiser2] got pmi command (from 10): init
> pmi_version=1 pmi_subversion=1
> [proxy:0:1 at weiser2] PMI response: cmd=response_to_init pmi_version=1
> pmi_subversion=1 rc=0
> [proxy:0:1 at weiser2] got pmi command (from 5): barrier_in
>
> [proxy:0:1 at weiser2] got pmi command (from 10): get_maxes
>
> [proxy:0:1 at weiser2] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
> vallen_max=1024
> [proxy:0:1 at weiser2] got pmi command (from 10): get_appnum
>
> [proxy:0:1 at weiser2] PMI response: cmd=appnum appnum=0
> [proxy:0:1 at weiser2] got pmi command (from 10): get_my_kvsname
>
> [proxy:0:1 at weiser2] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
> [proxy:0:1 at weiser2] got pmi command (from 10): get_my_kvsname
>
> [proxy:0:1 at weiser2] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
> [proxy:0:1 at weiser2] got pmi command (from 10): get
> kvsname=kvs_24541_0 key=PMI_process_mapping
> [proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
> value=(vector,(0,2,4))
> [proxy:0:1 at weiser2] got pmi command (from 10): barrier_in
>
> [proxy:0:1 at weiser2] flushing 1 put command(s) out
> [proxy:0:1 at weiser2] forwarding command (cmd=put
> sharedFilename[4]=/dev/shm/mpich_shar_tmpuKzlSa) upstream
> [proxy:0:1 at weiser2] forwarding command (cmd=barrier_in) upstream
> [proxy:0:0 at weiser1] PMI response: cmd=barrier_out
> [proxy:0:0 at weiser1] PMI response: cmd=barrier_out
> [proxy:0:0 at weiser1] PMI response: cmd=barrier_out
> [proxy:0:0 at weiser1] PMI response: cmd=barrier_out
> [proxy:0:0 at weiser1] got pmi command (from 6): get
> kvsname=kvs_24541_0 key=sharedFilename[0]
> [proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
> value=/dev/shm/mpich_shar_tmpnEZdQ9
> [proxy:0:1 at weiser2] PMI response: cmd=barrier_out
> [proxy:0:1 at weiser2] PMI response: cmd=barrier_out
> [proxy:0:1 at weiser2] PMI response: cmd=barrier_out
> [proxy:0:1 at weiser2] PMI response: cmd=barrier_out
> [proxy:0:1 at weiser2] got pmi command (from 5): get
> kvsname=kvs_24541_0 key=sharedFilename[4]
> [proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
> value=/dev/shm/mpich_shar_tmpuKzlSa
> [proxy:0:1 at weiser2] got pmi command (from 7): get
> kvsname=kvs_24541_0 key=sharedFilename[4]
> [proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
> value=/dev/shm/mpich_shar_tmpuKzlSa
> [proxy:0:1 at weiser2] got pmi command (from 10): get
> kvsname=kvs_24541_0 key=sharedFilename[4]
> [proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
> value=/dev/shm/mpich_shar_tmpuKzlSa
> [proxy:0:0 at weiser1] got pmi command (from 8): get
> kvsname=kvs_24541_0 key=sharedFilename[0]
> [proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
> value=/dev/shm/mpich_shar_tmpnEZdQ9
> [proxy:0:0 at weiser1] got pmi command (from 15): get
> kvsname=kvs_24541_0 key=sharedFilename[0]
> [proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
> value=/dev/shm/mpich_shar_tmpnEZdQ9
> [proxy:0:0 at weiser1] got pmi command (from 0): put
> kvsname=kvs_24541_0 key=P0-businesscard
> value=description#weiser1$port#56190$ifname#192.168.0.101$
> [proxy:0:0 at weiser1] cached command:
> P0-businesscard=description#weiser1$port#56190$ifname#192.168.0.101$
> [proxy:0:0 at weiser1] PMI response: cmd=put_result rc=0 msg=success
> [proxy:0:0 at weiser1] got pmi command (from 8): put
> kvsname=kvs_24541_0 key=P2-businesscard
> value=description#weiser1$port#40019$ifname#192.168.0.101$
> [proxy:0:0 at weiser1] cached command:
> P2-businesscard=description#weiser1$port#40019$ifname#192.168.0.101$
> [proxy:0:0 at weiser1] PMI response: cmd=put_result rc=0 msg=success
> [proxy:0:0 at weiser1] got pmi command (from 15): put
> kvsname=kvs_24541_0 key=P3-businesscard
> value=description#weiser1$port#57150$ifname#192.168.0.101$
> [proxy:0:0 at weiser1] cached command:
> P3-businesscard=description#weiser1$port#57150$ifname#192.168.0.101$
> [proxy:0:0 at weiser1] PMI response: cmd=put_result rc=0 msg=success
> [proxy:0:0 at weiser1] got pmi command (from 0): barrier_in
>
> [proxy:0:0 at weiser1] got pmi command (from 6): put
> kvsname=kvs_24541_0 key=P1-businesscard
> value=description#weiser1$port#34048$ifname#192.168.0.101$
> [proxy:0:0 at weiser1] cached command:
> P1-businesscard=description#weiser1$port#34048$ifname#192.168.0.101$
> [proxy:0:0 at weiser1] PMI response: cmd=put_result rc=0 msg=success
> [proxy:0:0 at weiser1] got pmi command (from 8): barrier_in
>
> [proxy:0:0 at weiser1] got pmi command (from 6): barrier_in
>
> [proxy:0:0 at weiser1] got pmi command (from 15): barrier_in
>
> [proxy:0:0 at weiser1] flushing 4 put command(s) out
> [mpiexec at weiser1] [pgid: 0] got PMI command: cmd=put
> P0-businesscard=description#weiser1$port#56190$ifname#192.168.0.101$
> P2-businesscard=description#weiser1$port#40019$ifname#192.168.0.101$
> P3-businesscard=description#weiser1$port#57150$ifname#192.168.0.101$
> P1-businesscard=description#weiser1$port#34048$ifname#192.168.0.101$
> [proxy:0:0 at weiser1] forwarding command (cmd=put
> P0-businesscard=description#weiser1$port#56190$ifname#192.168.0.101$
> P2-businesscard=description#weiser1$port#40019$ifname#192.168.0.101$
> P3-businesscard=description#weiser1$port#57150$ifname#192.168.0.101$
> P1-businesscard=description#weiser1$port#34048$ifname#192.168.0.101$)
> upstream
> [proxy:0:0 at weiser1] forwarding command (cmd=barrier_in) upstream
> [mpiexec at weiser1] [pgid: 0] got PMI command: cmd=barrier_in
> [proxy:0:1 at weiser2] got pmi command (from 4): put
> kvsname=kvs_24541_0 key=P4-businesscard
> value=description#weiser2$port#60693$ifname#192.168.0.102$
> [proxy:0:1 at weiser2] cached command:
> P4-businesscard=description#weiser2$port#60693$ifname#192.168.0.102$
> [proxy:0:1 at weiser2] PMI response: cmd=put_result rc=0 msg=success
> [proxy:0:1 at weiser2] got pmi command (from 5): put
> kvsname=kvs_24541_0 key=P5-businesscard
> value=description#weiser2$port#49938$ifname#192.168.0.102$
> [proxy:0:1 at weiser2] cached command:
> P5-businesscard=description#weiser2$port#49938$ifname#192.168.0.102$
> [proxy:0:1 at weiser2] PMI response: cmd=put_result rc=0 msg=success
> [proxy:0:1 at weiser2] got pmi command (from 7): put
> kvsname=kvs_24541_0 key=P6-businesscard
> value=description#weiser2$port#33516$ifname#192.168.0.102$
> [proxy:0:1 at weiser2] cached command:
> P6-businesscard=description#weiser2$port#33516$ifname#192.168.0.102$
> [proxy:0:1 at weiser2] PMI response: cmd=put_result rc=0 msg=success
> [proxy:0:1 at weiser2] got pmi command (from 10): put
> kvsname=kvs_24541_0 key=P7-businesscard
> value=description#weiser2$port#43116$ifname#192.168.0.102$
> [proxy:0:1 at weiser2] cached command:
> P7-businesscard=description#weiser2$port#43116$ifname#192.168.0.102$
> [proxy:0:1 at weiser2] [mpiexec at weiser1] [pgid: 0] got PMI command: cmd=put
> P4-businesscard=description#weiser2$port#60693$ifname#192.168.0.102$
> P5-businesscard=description#weiser2$port#49938$ifname#192.168.0.102$
> P6-businesscard=description#weiser2$port#33516$ifname#192.168.0.102$
> P7-businesscard=description#weiser2$port#43116$ifname#192.168.0.102$
> PMI response: cmd=put_result rc=0 msg=success
> [proxy:0:1 at weiser2] got pmi command (from 4): barrier_in
>
> [proxy:0:1 at weiser2] got pmi command (from 5): barrier_in
>
> [proxy:0:1 at weiser2] got pmi command (from 7): barrier_in
> [mpiexec at weiser1] [pgid: 0] got PMI command: cmd=barrier_in
> [mpiexec at weiser1] PMI response to fd 6 pid 10: cmd=keyval_cache
> P0-businesscard=description#weiser1$port#56190$ifname#192.168.0.101$
> P2-businesscard=description#weiser1$port#40019$ifname#192.168.0.101$
> P3-businesscard=description#weiser1$port#57150$ifname#192.168.0.101$
> P1-businesscard=description#weiser1$port#34048$ifname#192.168.0.101$
> P4-businesscard=description#weiser2$port#60693$ifname#192.168.0.102$
> P5-businesscard=description#weiser2$port#49938$ifname#192.168.0.102$
> P6-businesscard=description#weiser2$port#33516$ifname#192.168.0.102$
> P7-businesscard=description#weiser2$port#43116$ifname#192.168.0.102$
> [mpiexec at weiser1] PMI response to fd 7 pid 10: cmd=keyval_cache
> P0-businesscard=description#weiser1$port#56190$ifname#192.168.0.101$
> P2-businesscard=description#weiser1$port#40019$ifname#192.168.0.101$
> P3-businesscard=description#weiser1$port#57150$ifname#192.168.0.101$
> P1-businesscard=description#weiser1$port#34048$ifname#192.168.0.101$
> P4-businesscard=description#weiser2$port#60693$ifname#192.168.0.102$
> P5-businesscard=description#weiser2$port#49938$ifname#192.168.0.102$
> P6-businesscard=description#weiser2$port#33516$ifname#192.168.0.102$
> P7-businesscard=description#weiser2$port#43116$ifname#192.168.0.102$
> [mpiexec at weiser1] PMI response to fd 6 pid 10: cmd=barrier_out
> [mpiexec at weiser1] PMI response to fd 7 pid 10: cmd=barrier_out
> [proxy:0:0 at weiser1] PMI response: cmd=barrier_out
> [proxy:0:0 at weiser1]
> [proxy:0:1 at weiser2] got pmi command (from 10): barrier_in
>
> [proxy:0:1 at weiser2] flushing 4 put command(s) out
> [proxy:0:1 at weiser2] forwarding command (cmd=put
> P4-businesscard=description#weiser2$port#60693$ifname#192.168.0.102$
> P5-businesscard=description#weiser2$port#49938$ifname#192.168.0.102$
> P6-businesscard=description#weiser2$port#33516$ifname#192.168.0.102$
> P7-businesscard=description#weiser2$port#43116$ifname#192.168.0.102$)
> upstream
> [proxy:0:1 at weiser2] forwarding command (cmd=barrier_in) upstream
> PMI response: cmd=barrier_out
> [proxy:0:0 at weiser1] PMI response: cmd=barrier_out
> [proxy:0:0 at weiser1] PMI response: cmd=barrier_out
> [proxy:0:1 at weiser2] PMI response: cmd=barrier_out
> [proxy:0:1 at weiser2] PMI response: cmd=barrier_out
> [proxy:0:1 at weiser2] PMI response: cmd=barrier_out
> [proxy:0:1 at weiser2] PMI response: cmd=barrier_out
> [proxy:0:1 at weiser2] got pmi command (from 4): get
> kvsname=kvs_24541_0 key=P0-businesscard
> [proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
> value=description#weiser1$port#56190$ifname#192.168.0.101$
> ================================================================================
> HPLinpack 2.1 -- High-Performance Linpack benchmark -- October 26, 2012
> Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory,
> UTK
> Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
> Modified by Julien Langou, University of Colorado Denver
> ================================================================================
>
> An explanation of the input/output parameters follows:
> T/V : Wall time / encoded variant.
> N : The order of the coefficient matrix A.
> NB : The partitioning blocking factor.
> P : The number of process rows.
> Q : The number of process columns.
> Time : Time in seconds to solve the linear system.
> Gflops : Rate of execution for solving the linear system.
>
> The following parameter values will be used:
>
> N : 14616
> NB : 168
> PMAP : Row-major process mapping
> P : 2
> Q : 4
> PFACT : Right
> NBMIN : 4
> NDIV : 2
> RFACT : Crout
> BCAST : 1ringM
> DEPTH : 1
> SWAP : Mix (threshold = 64)
> L1 : transposed form
> U : transposed form
> EQUIL : yes
> ALIGN : 8 double precision words
>
> --------------------------------------------------------------------------------
>
> - The matrix A is randomly generated for each test.
> - The following scaled residual check will be computed:
> ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
> - The relative machine precision (eps) is taken to be
> 1.110223e-16
> [proxy:0:0 at weiser1] got pmi command (from 6): get
> - Computational tests pass if scaled residuals are less than
> 16.0
>
> kvsname=kvs_24541_0 key=P5-businesscard
> [proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
> value=description#weiser2$port#49938$ifname#192.168.0.102$
> [proxy:0:0 at weiser1] got pmi command (from 15): get
> kvsname=kvs_24541_0 key=P7-businesscard
> [proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
> value=description#weiser2$port#43116$ifname#192.168.0.102$
> [proxy:0:0 at weiser1] got pmi command (from 8): get
> kvsname=kvs_24541_0 key=P6-businesscard
> [proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
> value=description#weiser2$port#33516$ifname#192.168.0.102$
> [proxy:0:1 at weiser2] got pmi command (from 5): get
> kvsname=kvs_24541_0 key=P1-businesscard
> [proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
> value=description#weiser1$port#34048$ifname#192.168.0.101$
>
> ===================================================================================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = EXIT CODE: 9
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
>
>
> ----------- END --------------
>
> if that can help :(
>
>
>
>
>
>
> On Fri, Jun 28, 2013 at 12:24 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>>
>>
>> Looks like your application aborted for some reason.
>>
>> -- Pavan
>>
>>
>> On 06/27/2013 10:21 PM, Syed. Jahanzeb Maqbool Hashmi wrote:
>>>
>>> My bad, I just found out that there was a duplicate entry like:
>>> weiser1 127.0.1.1
>>> weiser1 192.168.0.101
>>> so i removed teh 127.x.x.x. entry and kept the hostfile contents similar
>>> on both nodes. Now previous error is reduced to this one:
>>>
>>> ------ START OF OUTPUT -------
>>>
>>> ....some HPL startup string (no final result)
>>> ...skip.....
>>>
>>>
>>> ===================================================================================
>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>> = EXIT CODE: 9
>>> = CLEANING UP REMAINING PROCESSES
>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>
>>> ===================================================================================
>>> [proxy:0:0 at weiser1] HYD_pmcd_pmip_control_cmd_cb
>>> (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
>>> [proxy:0:0 at weiser1] HYDT_dmxu_poll_wait_for_event
>>> (./tools/demux/demux_poll.c:77): callback returned error status
>>> [proxy:0:0 at weiser1] main (./pm/pmiserv/pmip.c:206): demux engine error
>>> waiting for event
>>> [mpiexec at weiser1] HYDT_bscu_wait_for_completion
>>> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes
>>> terminated badly; aborting
>>> [mpiexec at weiser1] HYDT_bsci_wait_for_completion
>>> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
>>> for completion
>>> [mpiexec at weiser1] HYD_pmci_wait_for_completion
>>> (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
>>> completion
>>> [mpiexec at weiser1] main (./ui/mpich/mpiexec.c:331): process manager error
>>> waiting for completion
>>>
>>> ------ END OF OUTPUT -------
>>>
>>>
>>>
>>> On Fri, Jun 28, 2013 at 12:12 PM, Pavan Balaji <balaji at mcs.anl.gov
>>> <mailto:balaji at mcs.anl.gov>> wrote:
>>>
>>>
>>> On 06/27/2013 10:08 PM, Syed. Jahanzeb Maqbool Hashmi wrote:
>>>
>>>
>>> P4-businesscard=description#__weiser2$port#57651$ifname#192.__168.0.102$
>>>
>>> P5-businesscard=description#__weiser2$port#52622$ifname#192.__168.0.102$
>>>
>>> P6-businesscard=description#__weiser2$port#55935$ifname#192.__168.0.102$
>>>
>>> P7-businesscard=description#__weiser2$port#54952$ifname#192.__168.0.102$
>>>
>>> P0-businesscard=description#__weiser1$port#41958$ifname#127.__0.1.1$
>>>
>>> P2-businesscard=description#__weiser1$port#35049$ifname#127.__0.1.1$
>>>
>>> P1-businesscard=description#__weiser1$port#39634$ifname#127.__0.1.1$
>>>
>>> P3-businesscard=description#__weiser1$port#51802$ifname#127.__0.1.1$
>>>
>>>
>>>
>>> I have two concerns with your output. Let's start with the first.
>>>
>>> Did you look at this question on the FAQ page?
>>>
>>> "Is your /etc/hosts file consistent across all nodes? Unless you are
>>> using an external DNS server, the /etc/hosts file on every machine
>>> should contain the correct IP information about all hosts in the
>>> system."
>>>
>>>
>>> -- Pavan
>>>
>>> --
>>> Pavan Balaji
>>> http://www.mcs.anl.gov/~balaji
>>>
>>>
>>
>> --
>> Pavan Balaji
>> http://www.mcs.anl.gov/~balaji
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
--
Jeff Hammond
jeff.science at gmail.com
More information about the discuss
mailing list