[mpich-discuss] Optimal Firewall Settings for MPICH/HYDRA

Capehart, William J William.Capehart at sdsmt.edu
Tue Jul 22 14:57:03 CDT 2014


Hi Pavan:

I had already done that.  With the firewall down, all is fine.  With the
firewall up, with the port files allowing the range of ports as requested
in MPIEXEC_PORT_RANGE the program starts but once data begins to be tossed
about in the cpi or fpi program, things go goes badly,

I’m adding the full verbose dump of the cpi.c (below)

Bill


[me at localhost:/home/me]% mpiexec -v -n 2 -f nodesshort cpi.exe
host: {local.machine.ip.address}
host: {remote.machine.ip.address}

===========================================================================
=======================
mpiexec options:
----------------
  Base path: /usr/local/mpich/bin/
  Launcher: (null)
  Debug level: 1
  Enable X: -1

  Global environment:
  -------------------
    USER=me
    LOGNAME=me
    HOME=/home/me
    
PATH=./:/bin:/usr/bin:/usr/local/bin:/usr/local/lib:/usr/local/netcdf/bin:/
usr/local/mpich/bin/openmpi:/usr/local/mpich/bin:/usr/local/mpich/include:/
usr/local/mpich/lib:/usr/local/ncarg:/usr/local/netcdf:/projects/WRF_UTIL/W
PSV3:/usr/local/netcdf/lib:/usr/local/netcdf/include:/usr/local/include:/us
r/local/lib:/home/me/bin:/usr/local/ncarg/bin:/usr/lib64/qt-3.3/bin:/usr/lo
cal/bin:/bin:/usr/bin:/usr/local/pgi/linux86-64/2014/bin:/usr/local/pgi/lin
ux86-64/2014/lib
    MAIL=/var/spool/mail/me
    SHELL=/bin/tcsh
    SSH_CLIENT={local.machine.ip.address} 41583 22
    SSH_CONNECTION={local.machine.ip.address} 41583
{local.machine.ip.address} 22
    SSH_TTY=/dev/pts/3
    TERM=xterm-color
    SELINUX_ROLE_REQUESTED=
    SELINUX_LEVEL_REQUESTED=
    SELINUX_USE_CURRENT_RANGE=
    HOSTTYPE=x86_64-linux
    VENDOR=unknown
    OSTYPE=linux
    MACHTYPE=x86_64
    SHLVL=1
    PWD=/home/me
    GROUP=iasusers
    HOST=local.host.name
    REMOTEHOST=local.host.name
    HOSTNAME=local.host.name
    
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;
01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;
42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;
31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*
.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;3
1:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.
rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=
01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01
;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;3
5:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:
*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.
mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=
01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;
35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*
.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.fla
c=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=
01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;
36:
    CVS_RSH=ssh
    GDL_PATH=+/usr/share/gnudatalanguage
    G_BROKEN_FILENAMES=1
    SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
    LANG=en_US.UTF-8
    LESSOPEN=|/usr/bin/lesspipe.sh %s
    QTDIR=/usr/lib64/qt-3.3
    QTINC=/usr/lib64/qt-3.3/include
    QTLIB=/usr/lib64/qt-3.3/lib
    COMPILER_OPTION=PGI
    LINUX_MPIHOME=/usr/local/mpich
    MPICH=/usr/local/mpich
    
LD_LIBRARY_PATH=/usr/local/mpich/lib:/usr/local/mpich/lib:/usr/local/lib:/u
sr/local/netcdf/lib:/usr/local/pgi/linux86-64/2014/libso/
    
LD_RUN_PATH=/usr/local/mpich/include/openmpi:/usr/local/mpich/include:/usr/
local/netcdf/include:/usr/local/include:/usr/local/lib
    NODES=/home/me/nodes
    HYDRA_HOST_FILE=/home/me/nodes
    MPIEXEC_PORT_RANGE=10000:10100
    MPIR_CVAR_CH3_PORT_RANGE=10000:10100
    NCARG_ROOT=/usr/local/ncarg
    NCARG_BIN=/usr/local/ncarg/bin
    NCARG_LIB=/usr/local/ncarg/lib
    NCARG_INCLUDE=/usr/local/ncarg/include
    NCL_COMMAND=/usr/local/ncarg/bin/ncl
    NCARG_RANGS=/data/NCAR/RANGS
    ITT=/usr/local/exelis
    IDL_DIR=/usr/local/exelis/idl83
    ENVI_DIR=/usr/local/exelis/envi51
    EXELIS_DIR=/usr/local/exelis
    
IDL_PATH=+/home/me/tools:+/usr/local/exelis/idl83/lib:+/usr/local/exelis/id
l83/examples:/projects/idl_coyote
    NETCDF=/usr/local/netcdf
    NETCDFLIB=/usr/local/netcdf/lib
    NETCDFINC=/usr/local/netcdf/include
    NETCDF4=1
    PNETINC=-I/usr/local/parallel_netcdf_hdf/include
    PNETLIB=-L/usr/local/parallel_netcdf_hdf/lib  -lnetcdf -lnetcdff -ldl
-lhdf5 -lhdf5_hl -lz -lsz
    HDF5=/usr/local
    HDFLIB=/usr/local/lib
    HDFINC=/usr/local/include
    PGI=/usr/local/pgi
    PGIVERSION=/usr/local/pgi/linux86-64/2014
    LM_LICENSE_FILE=/usr/local/pgi/license.dat
    CC=pgcc
    FC=pgfortran
    F90=pgfortran
    F77=pgfortran
    CXX=pgcpp
    MPIFC=mpif90
    MPIF90=mpif90
    MPIF77=mpif90
    MPICC=mpicc
    MPICXX=mpicxx
    CPP=pgCC -E
    CFLAGS= -Msignextend -fPIC
    CPPFLAGS= -DNDEBUG -DpgiFortran  -fPIC
    CXXFLAGS=  -fPIC
    F90FLAGS=  -fPIC
    FFLAGS= -w  -fPIC
    LDFLAGS= 
    RSHCOMMAND=ssh
    MP_STACK_SIZE=80000000
    OMP_NUM_THREADS=16
    JASPERLIB=/usr/local/lib
    JASPERINC=/usr/local/include
    LFC=-lgfortran
    LDSO=/lib64/ld-linux-x86-64.so.2
    GCCDIR=/usr/lib/gcc/x86_64-redhat-linux/4.4.7
    GCCINC=/usr/lib/gcc/x86_64-redhat-linux/4.4.7/include
    G77DIR=/usr/lib/gcc/x86_64-redhat-linux/4.4.7
    HDF5_DISABLE_VERSION_CHECK=1
    WRFIO_NCD_LARGE_FILE_SUPPORT=1
    ESMF_DIR=/usr/local/esmfinstall
    ESMF_OS=Linux
    ESMF_BOPT=O
    ESMF_OPTLEVEL=0
    ESMF_ABI=64
    ESMF_COMM=mpich
    ESMF_COMPILER=pgi
    ESMF_INSTALL_PREFIX=/usr/local/esmf

  Hydra internal environment:
  ---------------------------
    GFORTRAN_UNBUFFERED_PRECONNECTED=y


    Proxy information:
    *********************
      [1] proxy: {local.machine.ip.address} (1 cores)
      Exec list: cpi.exe (1 processes);

      [2] proxy: {remote.machine.ip.address} (1 cores)
      Exec list: cpi.exe (1 processes);


===========================================================================
=======================

[mpiexec at local.host.name] Timeout set to -1 (-1 means infinite)
[mpiexec at local.host.name] Got a control port string of
{local.machine.ip.address}:10000

Proxy launch args: /usr/local/mpich/bin/hydra_pmi_proxy --control-port
{local.machine.ip.address}:10000 --debug --rmk user --launcher ssh --demux
poll --pgid 0 --retries 10 --usize -2 --proxy-id

Arguments being passed to proxy 0:
--version 3.0.4 --iface-ip-env-name MPICH_INTERFACE_HOSTNAME --hostname
{local.machine.ip.address} --global-core-map 0,1,2 --pmi-id-map 0,0
--global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_6142_0
--pmi-process-mapping (vector,(0,2,1)) --ckpoint-num -1
--global-inherited-env 102 'USER=me' 'LOGNAME=me' 'HOME=/home/me'
'PATH=./:/bin:/usr/bin:/usr/local/bin:/usr/local/lib:/usr/local/netcdf/bin:
/usr/local/mpich/bin/openmpi:/usr/local/mpich/bin:/usr/local/mpich/include:
/usr/local/mpich/lib:/usr/local/ncarg:/usr/local/netcdf:/projects/WRF_UTIL/
WPSV3:/usr/local/netcdf/lib:/usr/local/netcdf/include:/usr/local/include:/u
sr/local/lib:/home/me/bin:/usr/local/ncarg/bin:/usr/lib64/qt-3.3/bin:/usr/l
ocal/bin:/bin:/usr/bin:/usr/local/pgi/linux86-64/2014/bin:/usr/local/pgi/li
nux86-64/2014/lib' 'MAIL=/var/spool/mail/me' 'SHELL=/bin/tcsh'
'SSH_CLIENT={local.machine.ip.address} 41583 22'
'SSH_CONNECTION={local.machine.ip.address} 41583
{local.machine.ip.address} 22' 'SSH_TTY=/dev/pts/3' 'TERM=xterm-color'
'SELINUX_ROLE_REQUESTED=' 'SELINUX_LEVEL_REQUESTED='
'SELINUX_USE_CURRENT_RANGE=' 'HOSTTYPE=x86_64-linux' 'VENDOR=unknown'
'OSTYPE=linux' 'MACHTYPE=x86_64' 'SHLVL=1' 'PWD=/home/me' 'GROUP=iasusers'
'HOST=local.host.name' 'REMOTEHOST=local.host.name'
'HOSTNAME=local.host.name'
'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33
;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30
;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01
;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:
*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;
31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*
.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg
=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=0
1;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;
35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35
:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*
.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm
=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01
;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:
*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.fl
ac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg
=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01
;36:' 'CVS_RSH=ssh' 'GDL_PATH=+/usr/share/gnudatalanguage'
'G_BROKEN_FILENAMES=1'
'SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass' 'LANG=en_US.UTF-8'
'LESSOPEN=|/usr/bin/lesspipe.sh %s' 'QTDIR=/usr/lib64/qt-3.3'
'QTINC=/usr/lib64/qt-3.3/include' 'QTLIB=/usr/lib64/qt-3.3/lib'
'COMPILER_OPTION=PGI' 'LINUX_MPIHOME=/usr/local/mpich'
'MPICH=/usr/local/mpich'
'LD_LIBRARY_PATH=/usr/local/mpich/lib:/usr/local/mpich/lib:/usr/local/lib:/
usr/local/netcdf/lib:/usr/local/pgi/linux86-64/2014/libso/'
'LD_RUN_PATH=/usr/local/mpich/include/openmpi:/usr/local/mpich/include:/usr
/local/netcdf/include:/usr/local/include:/usr/local/lib'
'NODES=/home/me/nodes' 'HYDRA_HOST_FILE=/home/me/nodes'
'MPIEXEC_PORT_RANGE=10000:10100' 'MPIR_CVAR_CH3_PORT_RANGE=10000:10100'
'NCARG_ROOT=/usr/local/ncarg' 'NCARG_BIN=/usr/local/ncarg/bin'
'NCARG_LIB=/usr/local/ncarg/lib' 'NCARG_INCLUDE=/usr/local/ncarg/include'
'NCL_COMMAND=/usr/local/ncarg/bin/ncl' 'NCARG_RANGS=/data/NCAR/RANGS'
'ITT=/usr/local/exelis' 'IDL_DIR=/usr/local/exelis/idl83'
'ENVI_DIR=/usr/local/exelis/envi51' 'EXELIS_DIR=/usr/local/exelis'
'IDL_PATH=+/home/me/tools:+/usr/local/exelis/idl83/lib:+/usr/local/exelis/i
dl83/examples:/projects/idl_coyote' 'NETCDF=/usr/local/netcdf'
'NETCDFLIB=/usr/local/netcdf/lib' 'NETCDFINC=/usr/local/netcdf/include'
'NETCDF4=1' 'PNETINC=-I/usr/local/parallel_netcdf_hdf/include'
'PNETLIB=-L/usr/local/parallel_netcdf_hdf/lib  -lnetcdf -lnetcdff -ldl
-lhdf5 -lhdf5_hl -lz -lsz ' 'HDF5=/usr/local' 'HDFLIB=/usr/local/lib'
'HDFINC=/usr/local/include' 'PGI=/usr/local/pgi'
'PGIVERSION=/usr/local/pgi/linux86-64/2014'
'LM_LICENSE_FILE=/usr/local/pgi/license.dat' 'CC=pgcc' 'FC=pgfortran'
'F90=pgfortran' 'F77=pgfortran' 'CXX=pgcpp' 'MPIFC=mpif90' 'MPIF90=mpif90'
'MPIF77=mpif90' 'MPICC=mpicc' 'MPICXX=mpicxx' 'CPP=pgCC -E' 'CFLAGS=
-Msignextend -fPIC ' 'CPPFLAGS= -DNDEBUG -DpgiFortran  -fPIC  ' 'CXXFLAGS=
 -fPIC   ' 'F90FLAGS=  -fPIC  ' 'FFLAGS= -w  -fPIC ' 'LDFLAGS= '
'RSHCOMMAND=ssh' 'MP_STACK_SIZE=80000000' 'OMP_NUM_THREADS=16'
'JASPERLIB=/usr/local/lib' 'JASPERINC=/usr/local/include' 'LFC=-lgfortran'
'LDSO=/lib64/ld-linux-x86-64.so.2'
'GCCDIR=/usr/lib/gcc/x86_64-redhat-linux/4.4.7'
'GCCINC=/usr/lib/gcc/x86_64-redhat-linux/4.4.7/include'
'G77DIR=/usr/lib/gcc/x86_64-redhat-linux/4.4.7'
'HDF5_DISABLE_VERSION_CHECK=1' 'WRFIO_NCD_LARGE_FILE_SUPPORT=1'
'ESMF_DIR=/usr/local/esmfinstall' 'ESMF_OS=Linux' 'ESMF_BOPT=O'
'ESMF_OPTLEVEL=0' 'ESMF_ABI=64' 'ESMF_COMM=mpich' 'ESMF_COMPILER=pgi'
'ESMF_INSTALL_PREFIX=/usr/local/esmf' --global-user-env 0
--global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y'
--proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1
--exec-local-env 0 --exec-wdir /home/me --exec-args 1 cpi.exe

Arguments being passed to proxy 1:
--version 3.0.4 --iface-ip-env-name MPICH_INTERFACE_HOSTNAME --hostname
{remote.machine.ip.address} --global-core-map 0,1,2 --pmi-id-map 0,1
--global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_6142_0
--pmi-process-mapping (vector,(0,2,1)) --ckpoint-num -1
--global-inherited-env 102 'USER=me' 'LOGNAME=me' 'HOME=/home/me'
'PATH=./:/bin:/usr/bin:/usr/local/bin:/usr/local/lib:/usr/local/netcdf/bin:
/usr/local/mpich/bin/openmpi:/usr/local/mpich/bin:/usr/local/mpich/include:
/usr/local/mpich/lib:/usr/local/ncarg:/usr/local/netcdf:/projects/WRF_UTIL/
WPSV3:/usr/local/netcdf/lib:/usr/local/netcdf/include:/usr/local/include:/u
sr/local/lib:/home/me/bin:/usr/local/ncarg/bin:/usr/lib64/qt-3.3/bin:/usr/l
ocal/bin:/bin:/usr/bin:/usr/local/pgi/linux86-64/2014/bin:/usr/local/pgi/li
nux86-64/2014/lib' 'MAIL=/var/spool/mail/me' 'SHELL=/bin/tcsh'
'SSH_CLIENT={local.machine.ip.address} 41583 22'
'SSH_CONNECTION={local.machine.ip.address} 41583
{local.machine.ip.address} 22' 'SSH_TTY=/dev/pts/3' 'TERM=xterm-color'
'SELINUX_ROLE_REQUESTED=' 'SELINUX_LEVEL_REQUESTED='
'SELINUX_USE_CURRENT_RANGE=' 'HOSTTYPE=x86_64-linux' 'VENDOR=unknown'
'OSTYPE=linux' 'MACHTYPE=x86_64' 'SHLVL=1' 'PWD=/home/me' 'GROUP=iasusers'
'HOST=local.host.name' 'REMOTEHOST=local.host.name'
'HOSTNAME=local.host.name'
'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33
;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30
;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01
;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:
*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;
31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*
.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg
=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=0
1;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;
35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35
:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*
.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm
=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01
;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:
*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.fl
ac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg
=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01
;36:' 'CVS_RSH=ssh' 'GDL_PATH=+/usr/share/gnudatalanguage'
'G_BROKEN_FILENAMES=1'
'SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass' 'LANG=en_US.UTF-8'
'LESSOPEN=|/usr/bin/lesspipe.sh %s' 'QTDIR=/usr/lib64/qt-3.3'
'QTINC=/usr/lib64/qt-3.3/include' 'QTLIB=/usr/lib64/qt-3.3/lib'
'COMPILER_OPTION=PGI' 'LINUX_MPIHOME=/usr/local/mpich'
'MPICH=/usr/local/mpich'
'LD_LIBRARY_PATH=/usr/local/mpich/lib:/usr/local/mpich/lib:/usr/local/lib:/
usr/local/netcdf/lib:/usr/local/pgi/linux86-64/2014/libso/'
'LD_RUN_PATH=/usr/local/mpich/include/openmpi:/usr/local/mpich/include:/usr
/local/netcdf/include:/usr/local/include:/usr/local/lib'
'NODES=/home/me/nodes' 'HYDRA_HOST_FILE=/home/me/nodes'
'MPIEXEC_PORT_RANGE=10000:10100' 'MPIR_CVAR_CH3_PORT_RANGE=10000:10100'
'NCARG_ROOT=/usr/local/ncarg' 'NCARG_BIN=/usr/local/ncarg/bin'
'NCARG_LIB=/usr/local/ncarg/lib' 'NCARG_INCLUDE=/usr/local/ncarg/include'
'NCL_COMMAND=/usr/local/ncarg/bin/ncl' 'NCARG_RANGS=/data/NCAR/RANGS'
'ITT=/usr/local/exelis' 'IDL_DIR=/usr/local/exelis/idl83'
'ENVI_DIR=/usr/local/exelis/envi51' 'EXELIS_DIR=/usr/local/exelis'
'IDL_PATH=+/home/me/tools:+/usr/local/exelis/idl83/lib:+/usr/local/exelis/i
dl83/examples:/projects/idl_coyote' 'NETCDF=/usr/local/netcdf'
'NETCDFLIB=/usr/local/netcdf/lib' 'NETCDFINC=/usr/local/netcdf/include'
'NETCDF4=1' 'PNETINC=-I/usr/local/parallel_netcdf_hdf/include'
'PNETLIB=-L/usr/local/parallel_netcdf_hdf/lib  -lnetcdf -lnetcdff -ldl
-lhdf5 -lhdf5_hl -lz -lsz ' 'HDF5=/usr/local' 'HDFLIB=/usr/local/lib'
'HDFINC=/usr/local/include' 'PGI=/usr/local/pgi'
'PGIVERSION=/usr/local/pgi/linux86-64/2014'
'LM_LICENSE_FILE=/usr/local/pgi/license.dat' 'CC=pgcc' 'FC=pgfortran'
'F90=pgfortran' 'F77=pgfortran' 'CXX=pgcpp' 'MPIFC=mpif90' 'MPIF90=mpif90'
'MPIF77=mpif90' 'MPICC=mpicc' 'MPICXX=mpicxx' 'CPP=pgCC -E' 'CFLAGS=
-Msignextend -fPIC ' 'CPPFLAGS= -DNDEBUG -DpgiFortran  -fPIC  ' 'CXXFLAGS=
 -fPIC   ' 'F90FLAGS=  -fPIC  ' 'FFLAGS= -w  -fPIC ' 'LDFLAGS= '
'RSHCOMMAND=ssh' 'MP_STACK_SIZE=80000000' 'OMP_NUM_THREADS=16'
'JASPERLIB=/usr/local/lib' 'JASPERINC=/usr/local/include' 'LFC=-lgfortran'
'LDSO=/lib64/ld-linux-x86-64.so.2'
'GCCDIR=/usr/lib/gcc/x86_64-redhat-linux/4.4.7'
'GCCINC=/usr/lib/gcc/x86_64-redhat-linux/4.4.7/include'
'G77DIR=/usr/lib/gcc/x86_64-redhat-linux/4.4.7'
'HDF5_DISABLE_VERSION_CHECK=1' 'WRFIO_NCD_LARGE_FILE_SUPPORT=1'
'ESMF_DIR=/usr/local/esmfinstall' 'ESMF_OS=Linux' 'ESMF_BOPT=O'
'ESMF_OPTLEVEL=0' 'ESMF_ABI=64' 'ESMF_COMM=mpich' 'ESMF_COMPILER=pgi'
'ESMF_INSTALL_PREFIX=/usr/local/esmf' --global-user-env 0
--global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y'
--proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1
--exec-local-env 0 --exec-wdir /home/me --exec-args 1 cpi.exe

[mpiexec at local.host.name] Launch arguments:
/usr/local/mpich/bin/hydra_pmi_proxy --control-port
{local.machine.ip.address}:10000 --debug --rmk user --launcher ssh --demux
poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
[mpiexec at local.host.name] Launch arguments: /usr/bin/ssh -x
{remote.machine.ip.address} "/usr/local/mpich/bin/hydra_pmi_proxy"
--control-port {local.machine.ip.address}:10000 --debug --rmk user
--launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
[proxy:0:0 at local.host.name] got pmi command (from 0): init
pmi_version=1 pmi_subversion=1
[proxy:0:0 at local.host.name] PMI response: cmd=response_to_init
pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0 at local.host.name] got pmi command (from 0): get_maxes

[proxy:0:0 at local.host.name] PMI response: cmd=maxes kvsname_max=256
keylen_max=64 vallen_max=1024
[proxy:0:0 at local.host.name] got pmi command (from 0): get_appnum

[proxy:0:0 at local.host.name] PMI response: cmd=appnum appnum=0
[proxy:0:0 at local.host.name] got pmi command (from 0): get_my_kvsname

[proxy:0:0 at local.host.name] PMI response: cmd=my_kvsname kvsname=kvs_6142_0
[proxy:0:0 at local.host.name] got pmi command (from 0): get_my_kvsname

[proxy:0:0 at local.host.name] PMI response: cmd=my_kvsname kvsname=kvs_6142_0
[proxy:0:0 at local.host.name] got pmi command (from 0): get
kvsname=kvs_6142_0 key=PMI_process_mapping
[proxy:0:0 at local.host.name] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,1))
[proxy:0:0 at local.host.name] got pmi command (from 0): barrier_in

[proxy:0:0 at local.host.name] forwarding command (cmd=barrier_in) upstream
[mpiexec at local.host.name] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:1 at remote.host.name] got pmi command (from 4): init
pmi_version=1 pmi_subversion=1
[proxy:0:1 at remote.host.name] PMI response: cmd=response_to_init
pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:1 at remote.host.name] got pmi command (from 4): get_maxes

[proxy:0:1 at remote.host.name] PMI response: cmd=maxes kvsname_max=256
keylen_max=64 vallen_max=1024
[proxy:0:1 at remote.host.name] got pmi command (from 4): get_appnum

[proxy:0:1 at remote.host.name] PMI response: cmd=appnum appnum=0
[proxy:0:1 at remote.host.name] got pmi command (from 4): get_my_kvsname

[proxy:0:1 at remote.host.name] PMI response: cmd=my_kvsname
kvsname=kvs_6142_0
[proxy:0:1 at remote.host.name] got pmi command (from 4): get_my_kvsname

[proxy:0:1 at remote.host.name] PMI response: cmd=my_kvsname
kvsname=kvs_6142_0
[mpiexec at local.host.name] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at local.host.name] PMI response to fd 6 pid 4: cmd=barrier_out
[mpiexec at local.host.name] PMI response to fd 7 pid 4: cmd=barrier_out
[proxy:0:1 at remote.host.name] got pmi command (from 4): get
kvsname=kvs_6142_0 key=PMI_process_mapping
[proxy:0:1 at remote.host.name] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,1))
[proxy:0:1 at remote.host.name] got pmi command (from 4): barrier_in

[proxy:0:1 at remote.host.name] forwarding command (cmd=barrier_in) upstream
[proxy:0:0 at local.host.name] PMI response: cmd=barrier_out
[proxy:0:0 at local.host.name] got pmi command (from 0): put
kvsname=kvs_6142_0 key=P0-businesscard
value=description#{local.machine.ip.address}$port#33774$ifname#{local.machi
ne.ip.address}$ 
[proxy:0:0 at local.host.name] cached command:
P0-businesscard=description#{local.machine.ip.address}$port#33774$ifname#{l
ocal.machine.ip.address}$
[proxy:0:0 at local.host.name] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0 at local.host.name] got pmi command (from 0): barrier_in

[proxy:0:0 at local.host.name] flushing 1 put command(s) out
[proxy:0:0 at local.host.name] forwarding command (cmd=put
P0-businesscard=description#{local.machine.ip.address}$port#33774$ifname#{l
ocal.machine.ip.address}$) upstream
[mpiexec at local.host.name] [pgid: 0] got PMI command: cmd=put
P0-businesscard=description#{local.machine.ip.address}$port#33774$ifname#{l
ocal.machine.ip.address}$
[proxy:0:0 at local.host.name] forwarding command (cmd=barrier_in) upstream
[mpiexec at local.host.name] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:1 at remote.host.name] PMI response: cmd=barrier_out
[proxy:0:1 at remote.host.name] got pmi command (from 4): put
kvsname=kvs_6142_0 key=P1-businesscard
value=description#{remote.machine.ip.address}$port#44324$ifname#{remote.mac
hine.ip.address}$ 
[proxy:0:1 at remote.host.name] cached command:
P1-businesscard=description#{remote.machine.ip.address}$port#44324$ifname#{
remote.machine.ip.address}$
[proxy:0:1 at remote.host.name] PMI response: cmd=put_result rc=0 msg=success
[mpiexec at local.host.name] [pgid: 0] got PMI command: cmd=put
P1-businesscard=description#{remote.machine.ip.address}$port#44324$ifname#{
remote.machine.ip.address}$
[proxy:0:1 at remote.host.name] got pmi command (from 4): barrier_in

[proxy:0:1 at remote.host.name] flushing 1 put command(s) out
[proxy:0:1 at remote.host.name] forwarding command (cmd=put
P1-businesscard=description#{remote.machine.ip.address}$port#44324$ifname#{
remote.machine.ip.address}$) upstream
[proxy:0:1 at remote.host.name] forwarding command (cmd=barrier_in) upstream
[mpiexec at local.host.name] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at local.host.name] PMI response to fd 6 pid 4: cmd=keyval_cache
P0-businesscard=description#{local.machine.ip.address}$port#33774$ifname#{l
ocal.machine.ip.address}$
P1-businesscard=description#{remote.machine.ip.address}$port#44324$ifname#{
remote.machine.ip.address}$
[mpiexec at local.host.name] PMI response to fd 7 pid 4: cmd=keyval_cache
P0-businesscard=description#{local.machine.ip.address}$port#33774$ifname#{l
ocal.machine.ip.address}$
P1-businesscard=description#{remote.machine.ip.address}$port#44324$ifname#{
remote.machine.ip.address}$
[mpiexec at local.host.name] PMI response to fd 6 pid 4: cmd=barrier_out
[mpiexec at local.host.name] PMI response to fd 7 pid 4: cmd=barrier_out
[proxy:0:0 at local.host.name] PMI response: cmd=barrier_out
Process 0 of 2 is on local.host.name
[proxy:0:0 at local.host.name] got pmi command (from 0): get
kvsname=kvs_6142_0 key=P1-businesscard
[proxy:0:0 at local.host.name] PMI response: cmd=get_result rc=0 msg=success
value=description#{remote.machine.ip.address}$port#44324$ifname#{remote.mac
hine.ip.address}$
[proxy:0:1 at remote.host.name] PMI response: cmd=barrier_out
Process 1 of 2 is on remote.host.name
Fatal error in PMPI_Reduce: A process has failed, error stack:
PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff91d58820,
rbuf=0x7fff91d58828, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD)
failed
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(835)..........:
MPIR_Reduce_binomial(144).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(667): Communication error with rank 1

===========================================================================
========
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===========================================================================
========
[proxy:0:1 at remote.host.name] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:1 at remote.host.name] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1 at remote.host.name] main (./pm/pmiserv/pmip.c:206): demux engine
error waiting for event
[mpiexec at local.host.name] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at local.host.name] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
for completion
[mpiexec at local.host.name] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
completion
[mpiexec at local.host.name] main (./ui/mpich/mpiexec.c:331): process manager
error waiting for completion





On 7/22/14, 13:44 MDT, "Balaji, Pavan" <balaji at anl.gov> wrote:

>Bill,
>
>Just to make sure this is a firewall problem, can you try disabling the
>firewall for a short time to try out MPICH and see if it works correctly?
> Remember to turn off the firewall on all machines, not just the head
>node.
>
>  — Pavan
>
>On Jul 22, 2014, at 2:18 PM, Capehart, William J
><William.Capehart at sdsmt.edu> wrote:
>
>> That would be the one that comes with PGI 14.6 (MPICH 3.0.4)
>> 
>> Bill
>> 
>> 
>> On 7/22/14, 11:52 MDT, "Kenneth Raffenetti" <raffenet at mcs.anl.gov>
>>wrote:
>> 
>>> What version of MPICH/Hydra is this?
>>> 
>>> On 07/22/2014 12:48 PM, Capehart, William J wrote:
>>>> Hi All
>>>> 
>>>> We¹re running MPICH on a couple machines with a brand new UNIX distro
>>>> (SL 6.5) and that are on a vulnerable network and rather than leave
>>>>the
>>>> firewalls dropped we would like to run it through the firewall.
>>>> 
>>>> We have included the MPIEXEC_PORT_RANGE and MPIR_CVAR_CH3_PORT_RANGE
>>>> fields and
>>>> have adjusted our iptables accordingly and in line with the ³FAQ²
>>>> guidance.
>>>> 
>>>> Our passwordless SSH works fine between the machines.
>>>> 
>>>> But all of this gives us momentary success with the cpi and fpi MPICH
>>>> test programs.  But they crash with the firewall up. (but of course
>>>>run
>>>> happily with the firewall down).
>>>> 
>>>> An example of the basic output is below (node short sends one process
>>>>to
>>>> ³this.machine² and one to remote ³that.machine²
>>>> 
>>>> 
>>>> [this.machine]% mpiexec -n 2 -f nodesshort cpi.exe
>>>> 
>>>> Process 0 of 2 is on this.machine
>>>> 
>>>> Process 1 of 2 is on that.machine
>>>> 
>>>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>>>> 
>>>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff466a94d0,
>>>> rbuf=0x7fff466a94d8, count=1, MPI_DOUBLE, MPI_SUM, root=0,
>>>> MPI_COMM_WORLD) failed
>>>> 
>>>> MPIR_Reduce_impl(1029)..........:
>>>> 
>>>> MPIR_Reduce_intra(835)..........:
>>>> 
>>>> MPIR_Reduce_binomial(144).......:
>>>> 
>>>> MPIDI_CH3U_Recvq_FDU_or_AEP(667): Communication error with rank 1
>>>> 
>>>> 
>>>> 
>>>> 
>>>>=======================================================================
>>>>==
>>>> ==========
>>>> 
>>>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>> 
>>>> =   EXIT CODE: 1
>>>> 
>>>> =   CLEANING UP REMAINING PROCESSES
>>>> 
>>>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>> 
>>>> 
>>>> 
>>>>=======================================================================
>>>>==
>>>> ==========
>>>> 
>>>> [proxy:0:1 at that.machine] HYD_pmcd_pmip_control_cmd_cb
>>>> (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
>>>> 
>>>> [proxy:0:1 at that.machine] HYDT_dmxu_poll_wait_for_event
>>>> (./tools/demux/demux_poll.c:77): callback returned error status
>>>> 
>>>> [proxy:0:1 at that.machine] main (./pm/pmiserv/pmip.c:206): demux engine
>>>> error waiting for event
>>>> 
>>>> [mpiexec at this.machine] HYDT_bscu_wait_for_completion
>>>> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes
>>>> terminated badly; aborting
>>>> 
>>>> [mpiexec at this.machine] HYDT_bsci_wait_for_completion
>>>> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
>>>>waiting
>>>> for completion
>>>> 
>>>> [mpiexec at this.machine] HYD_pmci_wait_for_completion
>>>> (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
>>>> completion
>>>> 
>>>> [mpiexec at this.machine] main (./ui/mpich/mpiexec.c:331): process
>>>>manager
>>>> error waiting for completion
>>>> 
>>>> 
>>>> 
>>>> In debug mode it affirms that it is at least *starting with the first
>>>> available port as listed in MPIEXEC_PORT_RANGE
>>>> 
>>>> But later we get output like this:
>>>> 
>>>> [mpiexec at this.machine] PMI response to fd 6 pid 4: cmd=keyval_cache
>>>> 
>>>> 
>>>>P0-businesscard=description#{this.machine¹s.ip.address}$port#54105$ifna
>>>>me
>>>> #{this.machine¹s.ip.address}$
>>>> 
>>>> 
>>>>P1-businesscard=description#{that.machine¹s.ip.address}$port#47302$ifna
>>>>me
>>>> #{that.machine¹s.ip.address}$
>>>> 
>>>> 
>>>> 
>>>> Does this mean that we have missed a firewall setting either in the
>>>> environment variables or in the ip tables themselves?
>>>> 
>>>> 
>>>> Ideas?
>>>> 
>>>> 
>>>> 
>>>> Thanks Much
>>>> 
>>>> Bill
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>> 
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>--
>Pavan Balaji  ✉️
>http://www.mcs.anl.gov/~balaji
>
>_______________________________________________
>discuss mailing list     discuss at mpich.org
>To manage subscription options or unsubscribe:
>https://lists.mpich.org/mailman/listinfo/discuss



More information about the discuss mailing list