[mpich-discuss] mpich hangs

Syed. Jahanzeb Maqbool Hashmi jahanzeb.maqbool at gmail.com
Thu Jun 27 22:29:33 CDT 2013


again that same error:
Fatal error in PMPI_Wait: A process has failed, error stack:
PMPI_Wait(180)............: MPI_Wait(request=0xbebb9a1c, status=0xbebb99f0)
failed
MPIR_Wait_impl(77)........:
dequeue_and_set_error(888): Communication error with rank 4

here is the verbose output:

--------------START------------------
host: weiser1
host: weiser2

==================================================================================================
mpiexec options:
----------------
  Base path: /mnt/nfs/install/mpich-install/bin/
  Launcher: (null)
  Debug level: 1
  Enable X: -1

  Global environment:
  -------------------
    TERM=xterm
    SHELL=/bin/bash

XDG_SESSION_COOKIE=218a1dd8e20ea6d6ec61475b00000019-1372384778.679329-1845893422
    SSH_CLIENT=192.168.0.3 57311 22
    OLDPWD=/mnt/nfs/jahanzeb/bench/hpl/hpl-2.1
    SSH_TTY=/dev/pts/0
    USER=linaro

LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.axa=00;36:*.oga=00;36:*.spx=00;36:*.xspf=00;36:
    LD_LIBRARY_PATH=:/mnt/nfs/install/mpich-install/lib
    MAIL=/var/mail/linaro

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/mnt/nfs/install/mpich-install/bin
    PWD=/mnt/nfs/jahanzeb/bench/hpl/hpl-2.1/bin/armv7-a
    LANG=C.UTF-8
    SHLVL=1
    HOME=/home/linaro
    LOGNAME=linaro
    SSH_CONNECTION=192.168.0.3 57311 192.168.0.101 22
    LESSOPEN=| /usr/bin/lesspipe %s
    LESSCLOSE=/usr/bin/lesspipe %s %s
    _=/mnt/nfs/install/mpich-install/bin/mpiexec

  Hydra internal environment:
  ---------------------------
    GFORTRAN_UNBUFFERED_PRECONNECTED=y


    Proxy information:
    *********************
      [1] proxy: weiser1 (4 cores)
      Exec list: ./xhpl (4 processes);

      [2] proxy: weiser2 (4 cores)
      Exec list: ./xhpl (4 processes);


==================================================================================================

[mpiexec at weiser1] Timeout set to -1 (-1 means infinite)
[mpiexec at weiser1] Got a control port string of weiser1:45851

Proxy launch args: /mnt/nfs/install/mpich-install/bin/hydra_pmi_proxy
--control-port weiser1:45851 --debug --rmk user --launcher ssh --demux poll
--pgid 0 --retries 10 --usize -2 --proxy-id

Arguments being passed to proxy 0:
--version 3.0.4 --iface-ip-env-name MPICH_INTERFACE_HOSTNAME --hostname
weiser1 --global-core-map 0,4,8 --pmi-id-map 0,0 --global-process-count 8
--auto-cleanup 1 --pmi-kvsname kvs_24541_0 --pmi-process-mapping
(vector,(0,2,4)) --ckpoint-num -1 --global-inherited-env 20 'TERM=xterm'
'SHELL=/bin/bash'
'XDG_SESSION_COOKIE=218a1dd8e20ea6d6ec61475b00000019-1372384778.679329-1845893422'
'SSH_CLIENT=192.168.0.3 57311 22'
'OLDPWD=/mnt/nfs/jahanzeb/bench/hpl/hpl-2.1' 'SSH_TTY=/dev/pts/0'
'USER=linaro'
'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.axa=00;36:*.oga=00;36:*.spx=00;36:*.xspf=00;36:'
'LD_LIBRARY_PATH=:/mnt/nfs/install/mpich-install/lib'
'MAIL=/var/mail/linaro'
'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/mnt/nfs/install/mpich-install/bin'
'PWD=/mnt/nfs/jahanzeb/bench/hpl/hpl-2.1/bin/armv7-a' 'LANG=C.UTF-8'
'SHLVL=1' 'HOME=/home/linaro' 'LOGNAME=linaro' 'SSH_CONNECTION=192.168.0.3
57311 192.168.0.101 22' 'LESSOPEN=| /usr/bin/lesspipe %s'
'LESSCLOSE=/usr/bin/lesspipe %s %s'
'_=/mnt/nfs/install/mpich-install/bin/mpiexec' --global-user-env 0
--global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y'
--proxy-core-count 4 --exec --exec-appnum 0 --exec-proc-count 4
--exec-local-env 0 --exec-wdir
/mnt/nfs/jahanzeb/bench/hpl/hpl-2.1/bin/armv7-a --exec-args 1 ./xhpl

Arguments being passed to proxy 1:
--version 3.0.4 --iface-ip-env-name MPICH_INTERFACE_HOSTNAME --hostname
weiser2 --global-core-map 0,4,8 --pmi-id-map 0,4 --global-process-count 8
--auto-cleanup 1 --pmi-kvsname kvs_24541_0 --pmi-process-mapping
(vector,(0,2,4)) --ckpoint-num -1 --global-inherited-env 20 'TERM=xterm'
'SHELL=/bin/bash'
'XDG_SESSION_COOKIE=218a1dd8e20ea6d6ec61475b00000019-1372384778.679329-1845893422'
'SSH_CLIENT=192.168.0.3 57311 22'
'OLDPWD=/mnt/nfs/jahanzeb/bench/hpl/hpl-2.1' 'SSH_TTY=/dev/pts/0'
'USER=linaro'
'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.axa=00;36:*.oga=00;36:*.spx=00;36:*.xspf=00;36:'
'LD_LIBRARY_PATH=:/mnt/nfs/install/mpich-install/lib'
'MAIL=/var/mail/linaro'
'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/mnt/nfs/install/mpich-install/bin'
'PWD=/mnt/nfs/jahanzeb/bench/hpl/hpl-2.1/bin/armv7-a' 'LANG=C.UTF-8'
'SHLVL=1' 'HOME=/home/linaro' 'LOGNAME=linaro' 'SSH_CONNECTION=192.168.0.3
57311 192.168.0.101 22' 'LESSOPEN=| /usr/bin/lesspipe %s'
'LESSCLOSE=/usr/bin/lesspipe %s %s'
'_=/mnt/nfs/install/mpich-install/bin/mpiexec' --global-user-env 0
--global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y'
--proxy-core-count 4 --exec --exec-appnum 0 --exec-proc-count 4
--exec-local-env 0 --exec-wdir
/mnt/nfs/jahanzeb/bench/hpl/hpl-2.1/bin/armv7-a --exec-args 1 ./xhpl

[mpiexec at weiser1] Launch arguments:
/mnt/nfs/install/mpich-install/bin/hydra_pmi_proxy --control-port
weiser1:45851 --debug --rmk user --launcher ssh --demux poll --pgid 0
--retries 10 --usize -2 --proxy-id 0
[mpiexec at weiser1] Launch arguments: /usr/bin/ssh -x weiser2
"/mnt/nfs/install/mpich-install/bin/hydra_pmi_proxy" --control-port
weiser1:45851 --debug --rmk user --launcher ssh --demux poll --pgid 0
--retries 10 --usize -2 --proxy-id 1
[proxy:0:0 at weiser1] got pmi command (from 0): init
pmi_version=1 pmi_subversion=1
[proxy:0:0 at weiser1] PMI response: cmd=response_to_init pmi_version=1
pmi_subversion=1 rc=0
[proxy:0:0 at weiser1] got pmi command (from 0): get_maxes

[proxy:0:0 at weiser1] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
vallen_max=1024
[proxy:0:0 at weiser1] got pmi command (from 15): init
pmi_version=1 pmi_subversion=1
[proxy:0:0 at weiser1] PMI response: cmd=response_to_init pmi_version=1
pmi_subversion=1 rc=0
[proxy:0:0 at weiser1] got pmi command (from 15): get_maxes

[proxy:0:0 at weiser1] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
vallen_max=1024
[proxy:0:0 at weiser1] got pmi command (from 8): init
pmi_version=1 pmi_subversion=1
[proxy:0:0 at weiser1] PMI response: cmd=response_to_init pmi_version=1
pmi_subversion=1 rc=0
[proxy:0:0 at weiser1] got pmi command (from 0): get_appnum

[proxy:0:0 at weiser1] PMI response: cmd=appnum appnum=0
[proxy:0:0 at weiser1] got pmi command (from 15): get_appnum

[proxy:0:0 at weiser1] PMI response: cmd=appnum appnum=0
[proxy:0:0 at weiser1] got pmi command (from 0): get_my_kvsname

[proxy:0:0 at weiser1] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
[proxy:0:0 at weiser1] got pmi command (from 8): get_maxes

[proxy:0:0 at weiser1] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
vallen_max=1024
[proxy:0:0 at weiser1] got pmi command (from 0): get_my_kvsname

[proxy:0:0 at weiser1] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
[proxy:0:0 at weiser1] got pmi command (from 6): init
pmi_version=1 pmi_subversion=1
[proxy:0:0 at weiser1] PMI response: cmd=response_to_init pmi_version=1
pmi_subversion=1 rc=0
[proxy:0:0 at weiser1] got pmi command (from 15): get_my_kvsname

[proxy:0:0 at weiser1] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
[proxy:0:0 at weiser1] got pmi command (from 0): get
kvsname=kvs_24541_0 key=PMI_process_mapping
[proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,4))
[proxy:0:0 at weiser1] got pmi command (from 8): get_appnum

[proxy:0:0 at weiser1] PMI response: cmd=appnum appnum=0
[proxy:0:0 at weiser1] got pmi command (from 15): get_my_kvsname

[proxy:0:0 at weiser1] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
[proxy:0:0 at weiser1] got pmi command (from 8): get_my_kvsname

[proxy:0:0 at weiser1] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
[proxy:0:0 at weiser1] got pmi command (from 0): put
kvsname=kvs_24541_0 key=sharedFilename[0]
value=/dev/shm/mpich_shar_tmpnEZdQ9
[proxy:0:0 at weiser1] cached command:
sharedFilename[0]=/dev/shm/mpich_shar_tmpnEZdQ9
[proxy:0:0 at weiser1] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0 at weiser1] got pmi command (from 15): get
kvsname=kvs_24541_0 key=PMI_process_mapping
[proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,4))
[proxy:0:0 at weiser1] got pmi command (from 0): barrier_in

[proxy:0:0 at weiser1] got pmi command (from 6): get_maxes

[proxy:0:0 at weiser1] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
vallen_max=1024
[proxy:0:0 at weiser1] got pmi command (from 8): get_my_kvsname

[proxy:0:0 at weiser1] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
[proxy:0:0 at weiser1] got pmi command (from 15): barrier_in

[proxy:0:0 at weiser1] got pmi command (from 8): get
kvsname=kvs_24541_0 key=PMI_process_mapping
[proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,4))
[proxy:0:0 at weiser1] got pmi command (from 6): get_appnum

[proxy:0:0 at weiser1] PMI response: cmd=appnum appnum=0
[proxy:0:0 at weiser1] got pmi command (from 8): barrier_in

[proxy:0:0 at weiser1] got pmi command (from 6): get_my_kvsname

[proxy:0:0 at weiser1] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
[proxy:0:0 at weiser1] got pmi command (from 6): get_my_kvsname

[proxy:0:0 at weiser1] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
[proxy:0:0 at weiser1] got pmi command (from 6): get
kvsname=kvs_24541_0 key=PMI_process_mapping
[proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,4))
[proxy:0:0 at weiser1] got pmi command (from 6): barrier_in

[proxy:0:0 at weiser1] flushing 1 put command(s) out
[mpiexec at weiser1] [pgid: 0] got PMI command: cmd=put
sharedFilename[0]=/dev/shm/mpich_shar_tmpnEZdQ9
[proxy:0:0 at weiser1] forwarding command (cmd=put
sharedFilename[0]=/dev/shm/mpich_shar_tmpnEZdQ9) upstream
[proxy:0:0 at weiser1] forwarding command (cmd=barrier_in) upstream
[mpiexec at weiser1] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:1 at weiser2] got pmi command (from 7): init
pmi_version=1 pmi_subversion=1
[proxy:0:1 at weiser2] PMI response: cmd=response_to_init pmi_version=1
pmi_subversion=1 rc=0
[proxy:0:1 at weiser2] got pmi command (from 5): init
pmi_version=1 pmi_subversion=1
[proxy:0:1 at weiser2] PMI response: cmd=response_to_init pmi_version=1
pmi_subversion=1 rc=0
[proxy:0:1 at weiser2] got pmi command (from 7): get_maxes

[proxy:0:1 at weiser2] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
vallen_max=1024
[proxy:0:1 at weiser2] got pmi command (from 4): init
pmi_version=1 pmi_subversion=1
[proxy:0:1 at weiser2] PMI response: cmd=response_to_init pmi_version=1
pmi_subversion=1 rc=0
[proxy:0:1 at weiser2] got pmi command (from 7): get_appnum

[proxy:0:1 at weiser2] PMI response: cmd=appnum appnum=0
[proxy:0:1 at weiser2] got pmi command (from 4): get_maxes

[proxy:0:1 at weiser2] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
vallen_max=1024
[proxy:0:1 at weiser2] got pmi command (from 7): get_my_kvsname

[proxy:0:1 at weiser2] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
[proxy:0:1 at weiser2] got pmi command (from 4): get_appnum

[proxy:0:1 at weiser2] PMI response: cmd=appnum appnum=0
[proxy:0:1 at weiser2] got pmi command (from 7): get_my_kvsname

[proxy:0:1 at weiser2] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
[proxy:0:1 at weiser2] got pmi command (from 4): get_my_kvsname

[proxy:0:1 at weiser2] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
[proxy:0:1 at weiser2] got pmi command (from 7): get
kvsname=kvs_24541_0 key=PMI_process_mapping
[proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,4))
[proxy:0:1 at weiser2] got pmi command (from 4): get_my_kvsname

[proxy:0:1 at weiser2] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
[proxy:0:1 at weiser2] got pmi command (from 7): barrier_in

[proxy:0:1 at weiser2] got pmi command (from 4): get
kvsname=kvs_24541_0 key=PMI_process_mapping
[proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,4))
[proxy:0:1 at weiser2] got pmi command (from 5): get_maxes

[proxy:0:1 at weiser2] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
vallen_max=1024
[proxy:0:1 at weiser2] got pmi command (from 5): get_appnum

[proxy:0:1 at weiser2] PMI response: cmd=appnum appnum=0
[proxy:0:1 at weiser2] got pmi command (from 4): put
kvsname=kvs_24541_0 key=sharedFilename[4]
value=/dev/shm/mpich_shar_tmpuKzlSa
[proxy:0:1 at weiser2] cached command:
sharedFilename[4]=/dev/shm/mpich_shar_tmpuKzlSa
[proxy:0:1 at weiser2] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:1 at weiser2] got pmi command (from 5): get_my_kvsname

[proxy:0:1 at weiser2] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
[proxy:0:1 at weiser2] got pmi command (from 4): barrier_in

[mpiexec at weiser1] [pgid: 0] got PMI command: cmd=put
sharedFilename[4]=/dev/shm/mpich_shar_tmpuKzlSa
[mpiexec at weiser1] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at weiser1] PMI response to fd 6 pid 10: cmd=keyval_cache
sharedFilename[0]=/dev/shm/mpich_shar_tmpnEZdQ9
sharedFilename[4]=/dev/shm/mpich_shar_tmpuKzlSa
[mpiexec at weiser1] PMI response to fd 7 pid 10: cmd=keyval_cache
sharedFilename[0]=/dev/shm/mpich_shar_tmpnEZdQ9
sharedFilename[4]=/dev/shm/mpich_shar_tmpuKzlSa
[mpiexec at weiser1] PMI response to fd 6 pid 10: cmd=barrier_out
[mpiexec at weiser1] PMI response to fd 7 pid 10: cmd=barrier_out
[proxy:0:1 at weiser2] got pmi command (from 5): get_my_kvsname

[proxy:0:1 at weiser2] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
[proxy:0:1 at weiser2] got pmi command (from 5): get
kvsname=kvs_24541_0 key=PMI_process_mapping
[proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,4))
[proxy:0:1 at weiser2] got pmi command (from 10): init
pmi_version=1 pmi_subversion=1
[proxy:0:1 at weiser2] PMI response: cmd=response_to_init pmi_version=1
pmi_subversion=1 rc=0
[proxy:0:1 at weiser2] got pmi command (from 5): barrier_in

[proxy:0:1 at weiser2] got pmi command (from 10): get_maxes

[proxy:0:1 at weiser2] PMI response: cmd=maxes kvsname_max=256 keylen_max=64
vallen_max=1024
[proxy:0:1 at weiser2] got pmi command (from 10): get_appnum

[proxy:0:1 at weiser2] PMI response: cmd=appnum appnum=0
[proxy:0:1 at weiser2] got pmi command (from 10): get_my_kvsname

[proxy:0:1 at weiser2] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
[proxy:0:1 at weiser2] got pmi command (from 10): get_my_kvsname

[proxy:0:1 at weiser2] PMI response: cmd=my_kvsname kvsname=kvs_24541_0
[proxy:0:1 at weiser2] got pmi command (from 10): get
kvsname=kvs_24541_0 key=PMI_process_mapping
[proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,4))
[proxy:0:1 at weiser2] got pmi command (from 10): barrier_in

[proxy:0:1 at weiser2] flushing 1 put command(s) out
[proxy:0:1 at weiser2] forwarding command (cmd=put
sharedFilename[4]=/dev/shm/mpich_shar_tmpuKzlSa) upstream
[proxy:0:1 at weiser2] forwarding command (cmd=barrier_in) upstream
[proxy:0:0 at weiser1] PMI response: cmd=barrier_out
[proxy:0:0 at weiser1] PMI response: cmd=barrier_out
[proxy:0:0 at weiser1] PMI response: cmd=barrier_out
[proxy:0:0 at weiser1] PMI response: cmd=barrier_out
[proxy:0:0 at weiser1] got pmi command (from 6): get
kvsname=kvs_24541_0 key=sharedFilename[0]
[proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
value=/dev/shm/mpich_shar_tmpnEZdQ9
[proxy:0:1 at weiser2] PMI response: cmd=barrier_out
[proxy:0:1 at weiser2] PMI response: cmd=barrier_out
[proxy:0:1 at weiser2] PMI response: cmd=barrier_out
[proxy:0:1 at weiser2] PMI response: cmd=barrier_out
[proxy:0:1 at weiser2] got pmi command (from 5): get
kvsname=kvs_24541_0 key=sharedFilename[4]
[proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
value=/dev/shm/mpich_shar_tmpuKzlSa
[proxy:0:1 at weiser2] got pmi command (from 7): get
kvsname=kvs_24541_0 key=sharedFilename[4]
[proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
value=/dev/shm/mpich_shar_tmpuKzlSa
[proxy:0:1 at weiser2] got pmi command (from 10): get
kvsname=kvs_24541_0 key=sharedFilename[4]
[proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
value=/dev/shm/mpich_shar_tmpuKzlSa
[proxy:0:0 at weiser1] got pmi command (from 8): get
kvsname=kvs_24541_0 key=sharedFilename[0]
[proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
value=/dev/shm/mpich_shar_tmpnEZdQ9
[proxy:0:0 at weiser1] got pmi command (from 15): get
kvsname=kvs_24541_0 key=sharedFilename[0]
[proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
value=/dev/shm/mpich_shar_tmpnEZdQ9
[proxy:0:0 at weiser1] got pmi command (from 0): put
kvsname=kvs_24541_0 key=P0-businesscard
value=description#weiser1$port#56190$ifname#192.168.0.101$
[proxy:0:0 at weiser1] cached command:
P0-businesscard=description#weiser1$port#56190$ifname#192.168.0.101$
[proxy:0:0 at weiser1] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0 at weiser1] got pmi command (from 8): put
kvsname=kvs_24541_0 key=P2-businesscard
value=description#weiser1$port#40019$ifname#192.168.0.101$
[proxy:0:0 at weiser1] cached command:
P2-businesscard=description#weiser1$port#40019$ifname#192.168.0.101$
[proxy:0:0 at weiser1] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0 at weiser1] got pmi command (from 15): put
kvsname=kvs_24541_0 key=P3-businesscard
value=description#weiser1$port#57150$ifname#192.168.0.101$
[proxy:0:0 at weiser1] cached command:
P3-businesscard=description#weiser1$port#57150$ifname#192.168.0.101$
[proxy:0:0 at weiser1] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0 at weiser1] got pmi command (from 0): barrier_in

[proxy:0:0 at weiser1] got pmi command (from 6): put
kvsname=kvs_24541_0 key=P1-businesscard
value=description#weiser1$port#34048$ifname#192.168.0.101$
[proxy:0:0 at weiser1] cached command:
P1-businesscard=description#weiser1$port#34048$ifname#192.168.0.101$
[proxy:0:0 at weiser1] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0 at weiser1] got pmi command (from 8): barrier_in

[proxy:0:0 at weiser1] got pmi command (from 6): barrier_in

[proxy:0:0 at weiser1] got pmi command (from 15): barrier_in

[proxy:0:0 at weiser1] flushing 4 put command(s) out
[mpiexec at weiser1] [pgid: 0] got PMI command: cmd=put
P0-businesscard=description#weiser1$port#56190$ifname#192.168.0.101$
P2-businesscard=description#weiser1$port#40019$ifname#192.168.0.101$
P3-businesscard=description#weiser1$port#57150$ifname#192.168.0.101$
P1-businesscard=description#weiser1$port#34048$ifname#192.168.0.101$
[proxy:0:0 at weiser1] forwarding command (cmd=put
P0-businesscard=description#weiser1$port#56190$ifname#192.168.0.101$
P2-businesscard=description#weiser1$port#40019$ifname#192.168.0.101$
P3-businesscard=description#weiser1$port#57150$ifname#192.168.0.101$
P1-businesscard=description#weiser1$port#34048$ifname#192.168.0.101$)
upstream
[proxy:0:0 at weiser1] forwarding command (cmd=barrier_in) upstream
[mpiexec at weiser1] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:1 at weiser2] got pmi command (from 4): put
kvsname=kvs_24541_0 key=P4-businesscard
value=description#weiser2$port#60693$ifname#192.168.0.102$
[proxy:0:1 at weiser2] cached command:
P4-businesscard=description#weiser2$port#60693$ifname#192.168.0.102$
[proxy:0:1 at weiser2] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:1 at weiser2] got pmi command (from 5): put
kvsname=kvs_24541_0 key=P5-businesscard
value=description#weiser2$port#49938$ifname#192.168.0.102$
[proxy:0:1 at weiser2] cached command:
P5-businesscard=description#weiser2$port#49938$ifname#192.168.0.102$
[proxy:0:1 at weiser2] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:1 at weiser2] got pmi command (from 7): put
kvsname=kvs_24541_0 key=P6-businesscard
value=description#weiser2$port#33516$ifname#192.168.0.102$
[proxy:0:1 at weiser2] cached command:
P6-businesscard=description#weiser2$port#33516$ifname#192.168.0.102$
[proxy:0:1 at weiser2] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:1 at weiser2] got pmi command (from 10): put
kvsname=kvs_24541_0 key=P7-businesscard
value=description#weiser2$port#43116$ifname#192.168.0.102$
[proxy:0:1 at weiser2] cached command:
P7-businesscard=description#weiser2$port#43116$ifname#192.168.0.102$
[proxy:0:1 at weiser2] [mpiexec at weiser1] [pgid: 0] got PMI command: cmd=put
P4-businesscard=description#weiser2$port#60693$ifname#192.168.0.102$
P5-businesscard=description#weiser2$port#49938$ifname#192.168.0.102$
P6-businesscard=description#weiser2$port#33516$ifname#192.168.0.102$
P7-businesscard=description#weiser2$port#43116$ifname#192.168.0.102$
PMI response: cmd=put_result rc=0 msg=success
[proxy:0:1 at weiser2] got pmi command (from 4): barrier_in

[proxy:0:1 at weiser2] got pmi command (from 5): barrier_in

[proxy:0:1 at weiser2] got pmi command (from 7): barrier_in
[mpiexec at weiser1] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at weiser1] PMI response to fd 6 pid 10: cmd=keyval_cache
P0-businesscard=description#weiser1$port#56190$ifname#192.168.0.101$
P2-businesscard=description#weiser1$port#40019$ifname#192.168.0.101$
P3-businesscard=description#weiser1$port#57150$ifname#192.168.0.101$
P1-businesscard=description#weiser1$port#34048$ifname#192.168.0.101$
P4-businesscard=description#weiser2$port#60693$ifname#192.168.0.102$
P5-businesscard=description#weiser2$port#49938$ifname#192.168.0.102$
P6-businesscard=description#weiser2$port#33516$ifname#192.168.0.102$
P7-businesscard=description#weiser2$port#43116$ifname#192.168.0.102$
[mpiexec at weiser1] PMI response to fd 7 pid 10: cmd=keyval_cache
P0-businesscard=description#weiser1$port#56190$ifname#192.168.0.101$
P2-businesscard=description#weiser1$port#40019$ifname#192.168.0.101$
P3-businesscard=description#weiser1$port#57150$ifname#192.168.0.101$
P1-businesscard=description#weiser1$port#34048$ifname#192.168.0.101$
P4-businesscard=description#weiser2$port#60693$ifname#192.168.0.102$
P5-businesscard=description#weiser2$port#49938$ifname#192.168.0.102$
P6-businesscard=description#weiser2$port#33516$ifname#192.168.0.102$
P7-businesscard=description#weiser2$port#43116$ifname#192.168.0.102$
[mpiexec at weiser1] PMI response to fd 6 pid 10: cmd=barrier_out
[mpiexec at weiser1] PMI response to fd 7 pid 10: cmd=barrier_out
[proxy:0:0 at weiser1] PMI response: cmd=barrier_out
[proxy:0:0 at weiser1]
[proxy:0:1 at weiser2] got pmi command (from 10): barrier_in

[proxy:0:1 at weiser2] flushing 4 put command(s) out
[proxy:0:1 at weiser2] forwarding command (cmd=put
P4-businesscard=description#weiser2$port#60693$ifname#192.168.0.102$
P5-businesscard=description#weiser2$port#49938$ifname#192.168.0.102$
P6-businesscard=description#weiser2$port#33516$ifname#192.168.0.102$
P7-businesscard=description#weiser2$port#43116$ifname#192.168.0.102$)
upstream
[proxy:0:1 at weiser2] forwarding command (cmd=barrier_in) upstream
PMI response: cmd=barrier_out
[proxy:0:0 at weiser1] PMI response: cmd=barrier_out
[proxy:0:0 at weiser1] PMI response: cmd=barrier_out
[proxy:0:1 at weiser2] PMI response: cmd=barrier_out
[proxy:0:1 at weiser2] PMI response: cmd=barrier_out
[proxy:0:1 at weiser2] PMI response: cmd=barrier_out
[proxy:0:1 at weiser2] PMI response: cmd=barrier_out
[proxy:0:1 at weiser2] got pmi command (from 4): get
kvsname=kvs_24541_0 key=P0-businesscard
[proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
value=description#weiser1$port#56190$ifname#192.168.0.101$
================================================================================
HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012
Written by A. Petitet and R. Clint Whaley,  Innovative Computing
Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   14616
NB     :     168
PMAP   : Row-major process mapping
P      :       2
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be
1.110223e-16
[proxy:0:0 at weiser1] got pmi command (from 6): get
- Computational tests pass if scaled residuals are less than
 16.0

kvsname=kvs_24541_0 key=P5-businesscard
[proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
value=description#weiser2$port#49938$ifname#192.168.0.102$
[proxy:0:0 at weiser1] got pmi command (from 15): get
kvsname=kvs_24541_0 key=P7-businesscard
[proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
value=description#weiser2$port#43116$ifname#192.168.0.102$
[proxy:0:0 at weiser1] got pmi command (from 8): get
kvsname=kvs_24541_0 key=P6-businesscard
[proxy:0:0 at weiser1] PMI response: cmd=get_result rc=0 msg=success
value=description#weiser2$port#33516$ifname#192.168.0.102$
[proxy:0:1 at weiser2] got pmi command (from 5): get
kvsname=kvs_24541_0 key=P1-businesscard
[proxy:0:1 at weiser2] PMI response: cmd=get_result rc=0 msg=success
value=description#weiser1$port#34048$ifname#192.168.0.101$

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================


----------- END --------------

if that can help :(






On Fri, Jun 28, 2013 at 12:24 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:

>
> Looks like your application aborted for some reason.
>
>  -- Pavan
>
>
> On 06/27/2013 10:21 PM, Syed. Jahanzeb Maqbool Hashmi wrote:
>
>> My bad, I just found out that there was a duplicate entry like:
>> weiser1 127.0.1.1
>> weiser1 192.168.0.101
>> so i removed teh 127.x.x.x. entry and kept the hostfile contents similar
>> on both nodes. Now previous error is reduced to this one:
>>
>> ------ START OF OUTPUT -------
>>
>> ....some HPL startup string (no final result)
>> ...skip.....
>>
>> ==============================**==============================**
>> =======================
>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> =   EXIT CODE: 9
>> =   CLEANING UP REMAINING PROCESSES
>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> ==============================**==============================**
>> =======================
>> [proxy:0:0 at weiser1] HYD_pmcd_pmip_control_cmd_cb
>> (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
>> [proxy:0:0 at weiser1] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:**77): callback returned error status
>> [proxy:0:0 at weiser1] main (./pm/pmiserv/pmip.c:206): demux engine error
>> waiting for event
>> [mpiexec at weiser1] HYDT_bscu_wait_for_completion
>> (./tools/bootstrap/utils/bscu_**wait.c:76): one of the processes
>> terminated badly; aborting
>> [mpiexec at weiser1] HYDT_bsci_wait_for_completion
>> (./tools/bootstrap/src/bsci_**wait.c:23): launcher returned error waiting
>> for completion
>> [mpiexec at weiser1] HYD_pmci_wait_for_completion
>> (./pm/pmiserv/pmiserv_pmci.c:**217): launcher returned error waiting for
>> completion
>> [mpiexec at weiser1] main (./ui/mpich/mpiexec.c:331): process manager error
>> waiting for completion
>>
>> ------ END OF OUTPUT -------
>>
>>
>>
>> On Fri, Jun 28, 2013 at 12:12 PM, Pavan Balaji <balaji at mcs.anl.gov
>> <mailto:balaji at mcs.anl.gov>> wrote:
>>
>>
>>     On 06/27/2013 10:08 PM, Syed. Jahanzeb Maqbool Hashmi wrote:
>>
>>         P4-businesscard=description#__**weiser2$port#57651$ifname#192.**
>> __168.0.102$
>>         P5-businesscard=description#__**weiser2$port#52622$ifname#192.**
>> __168.0.102$
>>         P6-businesscard=description#__**weiser2$port#55935$ifname#192.**
>> __168.0.102$
>>         P7-businesscard=description#__**weiser2$port#54952$ifname#192.**
>> __168.0.102$
>>         P0-businesscard=description#__**weiser1$port#41958$ifname#127.**
>> __0.1.1$
>>         P2-businesscard=description#__**weiser1$port#35049$ifname#127.**
>> __0.1.1$
>>         P1-businesscard=description#__**weiser1$port#39634$ifname#127.**
>> __0.1.1$
>>         P3-businesscard=description#__**weiser1$port#51802$ifname#127.**
>> __0.1.1$
>>
>>
>>
>>     I have two concerns with your output.  Let's start with the first.
>>
>>     Did you look at this question on the FAQ page?
>>
>>     "Is your /etc/hosts file consistent across all nodes? Unless you are
>>     using an external DNS server, the /etc/hosts file on every machine
>>     should contain the correct IP information about all hosts in the
>>     system."
>>
>>
>>       -- Pavan
>>
>>     --
>>     Pavan Balaji
>>     http://www.mcs.anl.gov/~balaji
>>
>>
>>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130628/c49938a3/attachment.html>


More information about the discuss mailing list