[mpich-discuss] Optimal Firewall Settings for MPICH/HYDRA
Capehart, William J
William.Capehart at sdsmt.edu
Tue Jul 22 14:57:03 CDT 2014
Hi Pavan:
I had already done that. With the firewall down, all is fine. With the
firewall up, with the port files allowing the range of ports as requested
in MPIEXEC_PORT_RANGE the program starts but once data begins to be tossed
about in the cpi or fpi program, things go goes badly,
I’m adding the full verbose dump of the cpi.c (below)
Bill
[me at localhost:/home/me]% mpiexec -v -n 2 -f nodesshort cpi.exe
host: {local.machine.ip.address}
host: {remote.machine.ip.address}
===========================================================================
=======================
mpiexec options:
----------------
Base path: /usr/local/mpich/bin/
Launcher: (null)
Debug level: 1
Enable X: -1
Global environment:
-------------------
USER=me
LOGNAME=me
HOME=/home/me
PATH=./:/bin:/usr/bin:/usr/local/bin:/usr/local/lib:/usr/local/netcdf/bin:/
usr/local/mpich/bin/openmpi:/usr/local/mpich/bin:/usr/local/mpich/include:/
usr/local/mpich/lib:/usr/local/ncarg:/usr/local/netcdf:/projects/WRF_UTIL/W
PSV3:/usr/local/netcdf/lib:/usr/local/netcdf/include:/usr/local/include:/us
r/local/lib:/home/me/bin:/usr/local/ncarg/bin:/usr/lib64/qt-3.3/bin:/usr/lo
cal/bin:/bin:/usr/bin:/usr/local/pgi/linux86-64/2014/bin:/usr/local/pgi/lin
ux86-64/2014/lib
MAIL=/var/spool/mail/me
SHELL=/bin/tcsh
SSH_CLIENT={local.machine.ip.address} 41583 22
SSH_CONNECTION={local.machine.ip.address} 41583
{local.machine.ip.address} 22
SSH_TTY=/dev/pts/3
TERM=xterm-color
SELINUX_ROLE_REQUESTED=
SELINUX_LEVEL_REQUESTED=
SELINUX_USE_CURRENT_RANGE=
HOSTTYPE=x86_64-linux
VENDOR=unknown
OSTYPE=linux
MACHTYPE=x86_64
SHLVL=1
PWD=/home/me
GROUP=iasusers
HOST=local.host.name
REMOTEHOST=local.host.name
HOSTNAME=local.host.name
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;
01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;
42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;
31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*
.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;3
1:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.
rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=
01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01
;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;3
5:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:
*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.
mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=
01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;
35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*
.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.fla
c=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=
01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;
36:
CVS_RSH=ssh
GDL_PATH=+/usr/share/gnudatalanguage
G_BROKEN_FILENAMES=1
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
LANG=en_US.UTF-8
LESSOPEN=|/usr/bin/lesspipe.sh %s
QTDIR=/usr/lib64/qt-3.3
QTINC=/usr/lib64/qt-3.3/include
QTLIB=/usr/lib64/qt-3.3/lib
COMPILER_OPTION=PGI
LINUX_MPIHOME=/usr/local/mpich
MPICH=/usr/local/mpich
LD_LIBRARY_PATH=/usr/local/mpich/lib:/usr/local/mpich/lib:/usr/local/lib:/u
sr/local/netcdf/lib:/usr/local/pgi/linux86-64/2014/libso/
LD_RUN_PATH=/usr/local/mpich/include/openmpi:/usr/local/mpich/include:/usr/
local/netcdf/include:/usr/local/include:/usr/local/lib
NODES=/home/me/nodes
HYDRA_HOST_FILE=/home/me/nodes
MPIEXEC_PORT_RANGE=10000:10100
MPIR_CVAR_CH3_PORT_RANGE=10000:10100
NCARG_ROOT=/usr/local/ncarg
NCARG_BIN=/usr/local/ncarg/bin
NCARG_LIB=/usr/local/ncarg/lib
NCARG_INCLUDE=/usr/local/ncarg/include
NCL_COMMAND=/usr/local/ncarg/bin/ncl
NCARG_RANGS=/data/NCAR/RANGS
ITT=/usr/local/exelis
IDL_DIR=/usr/local/exelis/idl83
ENVI_DIR=/usr/local/exelis/envi51
EXELIS_DIR=/usr/local/exelis
IDL_PATH=+/home/me/tools:+/usr/local/exelis/idl83/lib:+/usr/local/exelis/id
l83/examples:/projects/idl_coyote
NETCDF=/usr/local/netcdf
NETCDFLIB=/usr/local/netcdf/lib
NETCDFINC=/usr/local/netcdf/include
NETCDF4=1
PNETINC=-I/usr/local/parallel_netcdf_hdf/include
PNETLIB=-L/usr/local/parallel_netcdf_hdf/lib -lnetcdf -lnetcdff -ldl
-lhdf5 -lhdf5_hl -lz -lsz
HDF5=/usr/local
HDFLIB=/usr/local/lib
HDFINC=/usr/local/include
PGI=/usr/local/pgi
PGIVERSION=/usr/local/pgi/linux86-64/2014
LM_LICENSE_FILE=/usr/local/pgi/license.dat
CC=pgcc
FC=pgfortran
F90=pgfortran
F77=pgfortran
CXX=pgcpp
MPIFC=mpif90
MPIF90=mpif90
MPIF77=mpif90
MPICC=mpicc
MPICXX=mpicxx
CPP=pgCC -E
CFLAGS= -Msignextend -fPIC
CPPFLAGS= -DNDEBUG -DpgiFortran -fPIC
CXXFLAGS= -fPIC
F90FLAGS= -fPIC
FFLAGS= -w -fPIC
LDFLAGS=
RSHCOMMAND=ssh
MP_STACK_SIZE=80000000
OMP_NUM_THREADS=16
JASPERLIB=/usr/local/lib
JASPERINC=/usr/local/include
LFC=-lgfortran
LDSO=/lib64/ld-linux-x86-64.so.2
GCCDIR=/usr/lib/gcc/x86_64-redhat-linux/4.4.7
GCCINC=/usr/lib/gcc/x86_64-redhat-linux/4.4.7/include
G77DIR=/usr/lib/gcc/x86_64-redhat-linux/4.4.7
HDF5_DISABLE_VERSION_CHECK=1
WRFIO_NCD_LARGE_FILE_SUPPORT=1
ESMF_DIR=/usr/local/esmfinstall
ESMF_OS=Linux
ESMF_BOPT=O
ESMF_OPTLEVEL=0
ESMF_ABI=64
ESMF_COMM=mpich
ESMF_COMPILER=pgi
ESMF_INSTALL_PREFIX=/usr/local/esmf
Hydra internal environment:
---------------------------
GFORTRAN_UNBUFFERED_PRECONNECTED=y
Proxy information:
*********************
[1] proxy: {local.machine.ip.address} (1 cores)
Exec list: cpi.exe (1 processes);
[2] proxy: {remote.machine.ip.address} (1 cores)
Exec list: cpi.exe (1 processes);
===========================================================================
=======================
[mpiexec at local.host.name] Timeout set to -1 (-1 means infinite)
[mpiexec at local.host.name] Got a control port string of
{local.machine.ip.address}:10000
Proxy launch args: /usr/local/mpich/bin/hydra_pmi_proxy --control-port
{local.machine.ip.address}:10000 --debug --rmk user --launcher ssh --demux
poll --pgid 0 --retries 10 --usize -2 --proxy-id
Arguments being passed to proxy 0:
--version 3.0.4 --iface-ip-env-name MPICH_INTERFACE_HOSTNAME --hostname
{local.machine.ip.address} --global-core-map 0,1,2 --pmi-id-map 0,0
--global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_6142_0
--pmi-process-mapping (vector,(0,2,1)) --ckpoint-num -1
--global-inherited-env 102 'USER=me' 'LOGNAME=me' 'HOME=/home/me'
'PATH=./:/bin:/usr/bin:/usr/local/bin:/usr/local/lib:/usr/local/netcdf/bin:
/usr/local/mpich/bin/openmpi:/usr/local/mpich/bin:/usr/local/mpich/include:
/usr/local/mpich/lib:/usr/local/ncarg:/usr/local/netcdf:/projects/WRF_UTIL/
WPSV3:/usr/local/netcdf/lib:/usr/local/netcdf/include:/usr/local/include:/u
sr/local/lib:/home/me/bin:/usr/local/ncarg/bin:/usr/lib64/qt-3.3/bin:/usr/l
ocal/bin:/bin:/usr/bin:/usr/local/pgi/linux86-64/2014/bin:/usr/local/pgi/li
nux86-64/2014/lib' 'MAIL=/var/spool/mail/me' 'SHELL=/bin/tcsh'
'SSH_CLIENT={local.machine.ip.address} 41583 22'
'SSH_CONNECTION={local.machine.ip.address} 41583
{local.machine.ip.address} 22' 'SSH_TTY=/dev/pts/3' 'TERM=xterm-color'
'SELINUX_ROLE_REQUESTED=' 'SELINUX_LEVEL_REQUESTED='
'SELINUX_USE_CURRENT_RANGE=' 'HOSTTYPE=x86_64-linux' 'VENDOR=unknown'
'OSTYPE=linux' 'MACHTYPE=x86_64' 'SHLVL=1' 'PWD=/home/me' 'GROUP=iasusers'
'HOST=local.host.name' 'REMOTEHOST=local.host.name'
'HOSTNAME=local.host.name'
'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33
;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30
;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01
;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:
*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;
31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*
.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg
=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=0
1;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;
35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35
:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*
.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm
=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01
;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:
*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.fl
ac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg
=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01
;36:' 'CVS_RSH=ssh' 'GDL_PATH=+/usr/share/gnudatalanguage'
'G_BROKEN_FILENAMES=1'
'SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass' 'LANG=en_US.UTF-8'
'LESSOPEN=|/usr/bin/lesspipe.sh %s' 'QTDIR=/usr/lib64/qt-3.3'
'QTINC=/usr/lib64/qt-3.3/include' 'QTLIB=/usr/lib64/qt-3.3/lib'
'COMPILER_OPTION=PGI' 'LINUX_MPIHOME=/usr/local/mpich'
'MPICH=/usr/local/mpich'
'LD_LIBRARY_PATH=/usr/local/mpich/lib:/usr/local/mpich/lib:/usr/local/lib:/
usr/local/netcdf/lib:/usr/local/pgi/linux86-64/2014/libso/'
'LD_RUN_PATH=/usr/local/mpich/include/openmpi:/usr/local/mpich/include:/usr
/local/netcdf/include:/usr/local/include:/usr/local/lib'
'NODES=/home/me/nodes' 'HYDRA_HOST_FILE=/home/me/nodes'
'MPIEXEC_PORT_RANGE=10000:10100' 'MPIR_CVAR_CH3_PORT_RANGE=10000:10100'
'NCARG_ROOT=/usr/local/ncarg' 'NCARG_BIN=/usr/local/ncarg/bin'
'NCARG_LIB=/usr/local/ncarg/lib' 'NCARG_INCLUDE=/usr/local/ncarg/include'
'NCL_COMMAND=/usr/local/ncarg/bin/ncl' 'NCARG_RANGS=/data/NCAR/RANGS'
'ITT=/usr/local/exelis' 'IDL_DIR=/usr/local/exelis/idl83'
'ENVI_DIR=/usr/local/exelis/envi51' 'EXELIS_DIR=/usr/local/exelis'
'IDL_PATH=+/home/me/tools:+/usr/local/exelis/idl83/lib:+/usr/local/exelis/i
dl83/examples:/projects/idl_coyote' 'NETCDF=/usr/local/netcdf'
'NETCDFLIB=/usr/local/netcdf/lib' 'NETCDFINC=/usr/local/netcdf/include'
'NETCDF4=1' 'PNETINC=-I/usr/local/parallel_netcdf_hdf/include'
'PNETLIB=-L/usr/local/parallel_netcdf_hdf/lib -lnetcdf -lnetcdff -ldl
-lhdf5 -lhdf5_hl -lz -lsz ' 'HDF5=/usr/local' 'HDFLIB=/usr/local/lib'
'HDFINC=/usr/local/include' 'PGI=/usr/local/pgi'
'PGIVERSION=/usr/local/pgi/linux86-64/2014'
'LM_LICENSE_FILE=/usr/local/pgi/license.dat' 'CC=pgcc' 'FC=pgfortran'
'F90=pgfortran' 'F77=pgfortran' 'CXX=pgcpp' 'MPIFC=mpif90' 'MPIF90=mpif90'
'MPIF77=mpif90' 'MPICC=mpicc' 'MPICXX=mpicxx' 'CPP=pgCC -E' 'CFLAGS=
-Msignextend -fPIC ' 'CPPFLAGS= -DNDEBUG -DpgiFortran -fPIC ' 'CXXFLAGS=
-fPIC ' 'F90FLAGS= -fPIC ' 'FFLAGS= -w -fPIC ' 'LDFLAGS= '
'RSHCOMMAND=ssh' 'MP_STACK_SIZE=80000000' 'OMP_NUM_THREADS=16'
'JASPERLIB=/usr/local/lib' 'JASPERINC=/usr/local/include' 'LFC=-lgfortran'
'LDSO=/lib64/ld-linux-x86-64.so.2'
'GCCDIR=/usr/lib/gcc/x86_64-redhat-linux/4.4.7'
'GCCINC=/usr/lib/gcc/x86_64-redhat-linux/4.4.7/include'
'G77DIR=/usr/lib/gcc/x86_64-redhat-linux/4.4.7'
'HDF5_DISABLE_VERSION_CHECK=1' 'WRFIO_NCD_LARGE_FILE_SUPPORT=1'
'ESMF_DIR=/usr/local/esmfinstall' 'ESMF_OS=Linux' 'ESMF_BOPT=O'
'ESMF_OPTLEVEL=0' 'ESMF_ABI=64' 'ESMF_COMM=mpich' 'ESMF_COMPILER=pgi'
'ESMF_INSTALL_PREFIX=/usr/local/esmf' --global-user-env 0
--global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y'
--proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1
--exec-local-env 0 --exec-wdir /home/me --exec-args 1 cpi.exe
Arguments being passed to proxy 1:
--version 3.0.4 --iface-ip-env-name MPICH_INTERFACE_HOSTNAME --hostname
{remote.machine.ip.address} --global-core-map 0,1,2 --pmi-id-map 0,1
--global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_6142_0
--pmi-process-mapping (vector,(0,2,1)) --ckpoint-num -1
--global-inherited-env 102 'USER=me' 'LOGNAME=me' 'HOME=/home/me'
'PATH=./:/bin:/usr/bin:/usr/local/bin:/usr/local/lib:/usr/local/netcdf/bin:
/usr/local/mpich/bin/openmpi:/usr/local/mpich/bin:/usr/local/mpich/include:
/usr/local/mpich/lib:/usr/local/ncarg:/usr/local/netcdf:/projects/WRF_UTIL/
WPSV3:/usr/local/netcdf/lib:/usr/local/netcdf/include:/usr/local/include:/u
sr/local/lib:/home/me/bin:/usr/local/ncarg/bin:/usr/lib64/qt-3.3/bin:/usr/l
ocal/bin:/bin:/usr/bin:/usr/local/pgi/linux86-64/2014/bin:/usr/local/pgi/li
nux86-64/2014/lib' 'MAIL=/var/spool/mail/me' 'SHELL=/bin/tcsh'
'SSH_CLIENT={local.machine.ip.address} 41583 22'
'SSH_CONNECTION={local.machine.ip.address} 41583
{local.machine.ip.address} 22' 'SSH_TTY=/dev/pts/3' 'TERM=xterm-color'
'SELINUX_ROLE_REQUESTED=' 'SELINUX_LEVEL_REQUESTED='
'SELINUX_USE_CURRENT_RANGE=' 'HOSTTYPE=x86_64-linux' 'VENDOR=unknown'
'OSTYPE=linux' 'MACHTYPE=x86_64' 'SHLVL=1' 'PWD=/home/me' 'GROUP=iasusers'
'HOST=local.host.name' 'REMOTEHOST=local.host.name'
'HOSTNAME=local.host.name'
'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33
;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30
;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01
;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:
*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;
31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*
.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg
=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=0
1;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;
35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35
:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*
.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm
=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01
;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:
*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.fl
ac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg
=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01
;36:' 'CVS_RSH=ssh' 'GDL_PATH=+/usr/share/gnudatalanguage'
'G_BROKEN_FILENAMES=1'
'SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass' 'LANG=en_US.UTF-8'
'LESSOPEN=|/usr/bin/lesspipe.sh %s' 'QTDIR=/usr/lib64/qt-3.3'
'QTINC=/usr/lib64/qt-3.3/include' 'QTLIB=/usr/lib64/qt-3.3/lib'
'COMPILER_OPTION=PGI' 'LINUX_MPIHOME=/usr/local/mpich'
'MPICH=/usr/local/mpich'
'LD_LIBRARY_PATH=/usr/local/mpich/lib:/usr/local/mpich/lib:/usr/local/lib:/
usr/local/netcdf/lib:/usr/local/pgi/linux86-64/2014/libso/'
'LD_RUN_PATH=/usr/local/mpich/include/openmpi:/usr/local/mpich/include:/usr
/local/netcdf/include:/usr/local/include:/usr/local/lib'
'NODES=/home/me/nodes' 'HYDRA_HOST_FILE=/home/me/nodes'
'MPIEXEC_PORT_RANGE=10000:10100' 'MPIR_CVAR_CH3_PORT_RANGE=10000:10100'
'NCARG_ROOT=/usr/local/ncarg' 'NCARG_BIN=/usr/local/ncarg/bin'
'NCARG_LIB=/usr/local/ncarg/lib' 'NCARG_INCLUDE=/usr/local/ncarg/include'
'NCL_COMMAND=/usr/local/ncarg/bin/ncl' 'NCARG_RANGS=/data/NCAR/RANGS'
'ITT=/usr/local/exelis' 'IDL_DIR=/usr/local/exelis/idl83'
'ENVI_DIR=/usr/local/exelis/envi51' 'EXELIS_DIR=/usr/local/exelis'
'IDL_PATH=+/home/me/tools:+/usr/local/exelis/idl83/lib:+/usr/local/exelis/i
dl83/examples:/projects/idl_coyote' 'NETCDF=/usr/local/netcdf'
'NETCDFLIB=/usr/local/netcdf/lib' 'NETCDFINC=/usr/local/netcdf/include'
'NETCDF4=1' 'PNETINC=-I/usr/local/parallel_netcdf_hdf/include'
'PNETLIB=-L/usr/local/parallel_netcdf_hdf/lib -lnetcdf -lnetcdff -ldl
-lhdf5 -lhdf5_hl -lz -lsz ' 'HDF5=/usr/local' 'HDFLIB=/usr/local/lib'
'HDFINC=/usr/local/include' 'PGI=/usr/local/pgi'
'PGIVERSION=/usr/local/pgi/linux86-64/2014'
'LM_LICENSE_FILE=/usr/local/pgi/license.dat' 'CC=pgcc' 'FC=pgfortran'
'F90=pgfortran' 'F77=pgfortran' 'CXX=pgcpp' 'MPIFC=mpif90' 'MPIF90=mpif90'
'MPIF77=mpif90' 'MPICC=mpicc' 'MPICXX=mpicxx' 'CPP=pgCC -E' 'CFLAGS=
-Msignextend -fPIC ' 'CPPFLAGS= -DNDEBUG -DpgiFortran -fPIC ' 'CXXFLAGS=
-fPIC ' 'F90FLAGS= -fPIC ' 'FFLAGS= -w -fPIC ' 'LDFLAGS= '
'RSHCOMMAND=ssh' 'MP_STACK_SIZE=80000000' 'OMP_NUM_THREADS=16'
'JASPERLIB=/usr/local/lib' 'JASPERINC=/usr/local/include' 'LFC=-lgfortran'
'LDSO=/lib64/ld-linux-x86-64.so.2'
'GCCDIR=/usr/lib/gcc/x86_64-redhat-linux/4.4.7'
'GCCINC=/usr/lib/gcc/x86_64-redhat-linux/4.4.7/include'
'G77DIR=/usr/lib/gcc/x86_64-redhat-linux/4.4.7'
'HDF5_DISABLE_VERSION_CHECK=1' 'WRFIO_NCD_LARGE_FILE_SUPPORT=1'
'ESMF_DIR=/usr/local/esmfinstall' 'ESMF_OS=Linux' 'ESMF_BOPT=O'
'ESMF_OPTLEVEL=0' 'ESMF_ABI=64' 'ESMF_COMM=mpich' 'ESMF_COMPILER=pgi'
'ESMF_INSTALL_PREFIX=/usr/local/esmf' --global-user-env 0
--global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y'
--proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1
--exec-local-env 0 --exec-wdir /home/me --exec-args 1 cpi.exe
[mpiexec at local.host.name] Launch arguments:
/usr/local/mpich/bin/hydra_pmi_proxy --control-port
{local.machine.ip.address}:10000 --debug --rmk user --launcher ssh --demux
poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
[mpiexec at local.host.name] Launch arguments: /usr/bin/ssh -x
{remote.machine.ip.address} "/usr/local/mpich/bin/hydra_pmi_proxy"
--control-port {local.machine.ip.address}:10000 --debug --rmk user
--launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
[proxy:0:0 at local.host.name] got pmi command (from 0): init
pmi_version=1 pmi_subversion=1
[proxy:0:0 at local.host.name] PMI response: cmd=response_to_init
pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0 at local.host.name] got pmi command (from 0): get_maxes
[proxy:0:0 at local.host.name] PMI response: cmd=maxes kvsname_max=256
keylen_max=64 vallen_max=1024
[proxy:0:0 at local.host.name] got pmi command (from 0): get_appnum
[proxy:0:0 at local.host.name] PMI response: cmd=appnum appnum=0
[proxy:0:0 at local.host.name] got pmi command (from 0): get_my_kvsname
[proxy:0:0 at local.host.name] PMI response: cmd=my_kvsname kvsname=kvs_6142_0
[proxy:0:0 at local.host.name] got pmi command (from 0): get_my_kvsname
[proxy:0:0 at local.host.name] PMI response: cmd=my_kvsname kvsname=kvs_6142_0
[proxy:0:0 at local.host.name] got pmi command (from 0): get
kvsname=kvs_6142_0 key=PMI_process_mapping
[proxy:0:0 at local.host.name] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,1))
[proxy:0:0 at local.host.name] got pmi command (from 0): barrier_in
[proxy:0:0 at local.host.name] forwarding command (cmd=barrier_in) upstream
[mpiexec at local.host.name] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:1 at remote.host.name] got pmi command (from 4): init
pmi_version=1 pmi_subversion=1
[proxy:0:1 at remote.host.name] PMI response: cmd=response_to_init
pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:1 at remote.host.name] got pmi command (from 4): get_maxes
[proxy:0:1 at remote.host.name] PMI response: cmd=maxes kvsname_max=256
keylen_max=64 vallen_max=1024
[proxy:0:1 at remote.host.name] got pmi command (from 4): get_appnum
[proxy:0:1 at remote.host.name] PMI response: cmd=appnum appnum=0
[proxy:0:1 at remote.host.name] got pmi command (from 4): get_my_kvsname
[proxy:0:1 at remote.host.name] PMI response: cmd=my_kvsname
kvsname=kvs_6142_0
[proxy:0:1 at remote.host.name] got pmi command (from 4): get_my_kvsname
[proxy:0:1 at remote.host.name] PMI response: cmd=my_kvsname
kvsname=kvs_6142_0
[mpiexec at local.host.name] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at local.host.name] PMI response to fd 6 pid 4: cmd=barrier_out
[mpiexec at local.host.name] PMI response to fd 7 pid 4: cmd=barrier_out
[proxy:0:1 at remote.host.name] got pmi command (from 4): get
kvsname=kvs_6142_0 key=PMI_process_mapping
[proxy:0:1 at remote.host.name] PMI response: cmd=get_result rc=0 msg=success
value=(vector,(0,2,1))
[proxy:0:1 at remote.host.name] got pmi command (from 4): barrier_in
[proxy:0:1 at remote.host.name] forwarding command (cmd=barrier_in) upstream
[proxy:0:0 at local.host.name] PMI response: cmd=barrier_out
[proxy:0:0 at local.host.name] got pmi command (from 0): put
kvsname=kvs_6142_0 key=P0-businesscard
value=description#{local.machine.ip.address}$port#33774$ifname#{local.machi
ne.ip.address}$
[proxy:0:0 at local.host.name] cached command:
P0-businesscard=description#{local.machine.ip.address}$port#33774$ifname#{l
ocal.machine.ip.address}$
[proxy:0:0 at local.host.name] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0 at local.host.name] got pmi command (from 0): barrier_in
[proxy:0:0 at local.host.name] flushing 1 put command(s) out
[proxy:0:0 at local.host.name] forwarding command (cmd=put
P0-businesscard=description#{local.machine.ip.address}$port#33774$ifname#{l
ocal.machine.ip.address}$) upstream
[mpiexec at local.host.name] [pgid: 0] got PMI command: cmd=put
P0-businesscard=description#{local.machine.ip.address}$port#33774$ifname#{l
ocal.machine.ip.address}$
[proxy:0:0 at local.host.name] forwarding command (cmd=barrier_in) upstream
[mpiexec at local.host.name] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0:1 at remote.host.name] PMI response: cmd=barrier_out
[proxy:0:1 at remote.host.name] got pmi command (from 4): put
kvsname=kvs_6142_0 key=P1-businesscard
value=description#{remote.machine.ip.address}$port#44324$ifname#{remote.mac
hine.ip.address}$
[proxy:0:1 at remote.host.name] cached command:
P1-businesscard=description#{remote.machine.ip.address}$port#44324$ifname#{
remote.machine.ip.address}$
[proxy:0:1 at remote.host.name] PMI response: cmd=put_result rc=0 msg=success
[mpiexec at local.host.name] [pgid: 0] got PMI command: cmd=put
P1-businesscard=description#{remote.machine.ip.address}$port#44324$ifname#{
remote.machine.ip.address}$
[proxy:0:1 at remote.host.name] got pmi command (from 4): barrier_in
[proxy:0:1 at remote.host.name] flushing 1 put command(s) out
[proxy:0:1 at remote.host.name] forwarding command (cmd=put
P1-businesscard=description#{remote.machine.ip.address}$port#44324$ifname#{
remote.machine.ip.address}$) upstream
[proxy:0:1 at remote.host.name] forwarding command (cmd=barrier_in) upstream
[mpiexec at local.host.name] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at local.host.name] PMI response to fd 6 pid 4: cmd=keyval_cache
P0-businesscard=description#{local.machine.ip.address}$port#33774$ifname#{l
ocal.machine.ip.address}$
P1-businesscard=description#{remote.machine.ip.address}$port#44324$ifname#{
remote.machine.ip.address}$
[mpiexec at local.host.name] PMI response to fd 7 pid 4: cmd=keyval_cache
P0-businesscard=description#{local.machine.ip.address}$port#33774$ifname#{l
ocal.machine.ip.address}$
P1-businesscard=description#{remote.machine.ip.address}$port#44324$ifname#{
remote.machine.ip.address}$
[mpiexec at local.host.name] PMI response to fd 6 pid 4: cmd=barrier_out
[mpiexec at local.host.name] PMI response to fd 7 pid 4: cmd=barrier_out
[proxy:0:0 at local.host.name] PMI response: cmd=barrier_out
Process 0 of 2 is on local.host.name
[proxy:0:0 at local.host.name] got pmi command (from 0): get
kvsname=kvs_6142_0 key=P1-businesscard
[proxy:0:0 at local.host.name] PMI response: cmd=get_result rc=0 msg=success
value=description#{remote.machine.ip.address}$port#44324$ifname#{remote.mac
hine.ip.address}$
[proxy:0:1 at remote.host.name] PMI response: cmd=barrier_out
Process 1 of 2 is on remote.host.name
Fatal error in PMPI_Reduce: A process has failed, error stack:
PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff91d58820,
rbuf=0x7fff91d58828, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD)
failed
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(835)..........:
MPIR_Reduce_binomial(144).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(667): Communication error with rank 1
===========================================================================
========
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===========================================================================
========
[proxy:0:1 at remote.host.name] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:1 at remote.host.name] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1 at remote.host.name] main (./pm/pmiserv/pmip.c:206): demux engine
error waiting for event
[mpiexec at local.host.name] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at local.host.name] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
for completion
[mpiexec at local.host.name] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
completion
[mpiexec at local.host.name] main (./ui/mpich/mpiexec.c:331): process manager
error waiting for completion
On 7/22/14, 13:44 MDT, "Balaji, Pavan" <balaji at anl.gov> wrote:
>Bill,
>
>Just to make sure this is a firewall problem, can you try disabling the
>firewall for a short time to try out MPICH and see if it works correctly?
> Remember to turn off the firewall on all machines, not just the head
>node.
>
> — Pavan
>
>On Jul 22, 2014, at 2:18 PM, Capehart, William J
><William.Capehart at sdsmt.edu> wrote:
>
>> That would be the one that comes with PGI 14.6 (MPICH 3.0.4)
>>
>> Bill
>>
>>
>> On 7/22/14, 11:52 MDT, "Kenneth Raffenetti" <raffenet at mcs.anl.gov>
>>wrote:
>>
>>> What version of MPICH/Hydra is this?
>>>
>>> On 07/22/2014 12:48 PM, Capehart, William J wrote:
>>>> Hi All
>>>>
>>>> We¹re running MPICH on a couple machines with a brand new UNIX distro
>>>> (SL 6.5) and that are on a vulnerable network and rather than leave
>>>>the
>>>> firewalls dropped we would like to run it through the firewall.
>>>>
>>>> We have included the MPIEXEC_PORT_RANGE and MPIR_CVAR_CH3_PORT_RANGE
>>>> fields and
>>>> have adjusted our iptables accordingly and in line with the ³FAQ²
>>>> guidance.
>>>>
>>>> Our passwordless SSH works fine between the machines.
>>>>
>>>> But all of this gives us momentary success with the cpi and fpi MPICH
>>>> test programs. But they crash with the firewall up. (but of course
>>>>run
>>>> happily with the firewall down).
>>>>
>>>> An example of the basic output is below (node short sends one process
>>>>to
>>>> ³this.machine² and one to remote ³that.machine²
>>>>
>>>>
>>>> [this.machine]% mpiexec -n 2 -f nodesshort cpi.exe
>>>>
>>>> Process 0 of 2 is on this.machine
>>>>
>>>> Process 1 of 2 is on that.machine
>>>>
>>>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>>>>
>>>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff466a94d0,
>>>> rbuf=0x7fff466a94d8, count=1, MPI_DOUBLE, MPI_SUM, root=0,
>>>> MPI_COMM_WORLD) failed
>>>>
>>>> MPIR_Reduce_impl(1029)..........:
>>>>
>>>> MPIR_Reduce_intra(835)..........:
>>>>
>>>> MPIR_Reduce_binomial(144).......:
>>>>
>>>> MPIDI_CH3U_Recvq_FDU_or_AEP(667): Communication error with rank 1
>>>>
>>>>
>>>>
>>>>
>>>>=======================================================================
>>>>==
>>>> ==========
>>>>
>>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>>
>>>> = EXIT CODE: 1
>>>>
>>>> = CLEANING UP REMAINING PROCESSES
>>>>
>>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>>
>>>>
>>>>
>>>>=======================================================================
>>>>==
>>>> ==========
>>>>
>>>> [proxy:0:1 at that.machine] HYD_pmcd_pmip_control_cmd_cb
>>>> (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
>>>>
>>>> [proxy:0:1 at that.machine] HYDT_dmxu_poll_wait_for_event
>>>> (./tools/demux/demux_poll.c:77): callback returned error status
>>>>
>>>> [proxy:0:1 at that.machine] main (./pm/pmiserv/pmip.c:206): demux engine
>>>> error waiting for event
>>>>
>>>> [mpiexec at this.machine] HYDT_bscu_wait_for_completion
>>>> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes
>>>> terminated badly; aborting
>>>>
>>>> [mpiexec at this.machine] HYDT_bsci_wait_for_completion
>>>> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
>>>>waiting
>>>> for completion
>>>>
>>>> [mpiexec at this.machine] HYD_pmci_wait_for_completion
>>>> (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
>>>> completion
>>>>
>>>> [mpiexec at this.machine] main (./ui/mpich/mpiexec.c:331): process
>>>>manager
>>>> error waiting for completion
>>>>
>>>>
>>>>
>>>> In debug mode it affirms that it is at least *starting with the first
>>>> available port as listed in MPIEXEC_PORT_RANGE
>>>>
>>>> But later we get output like this:
>>>>
>>>> [mpiexec at this.machine] PMI response to fd 6 pid 4: cmd=keyval_cache
>>>>
>>>>
>>>>P0-businesscard=description#{this.machine¹s.ip.address}$port#54105$ifna
>>>>me
>>>> #{this.machine¹s.ip.address}$
>>>>
>>>>
>>>>P1-businesscard=description#{that.machine¹s.ip.address}$port#47302$ifna
>>>>me
>>>> #{that.machine¹s.ip.address}$
>>>>
>>>>
>>>>
>>>> Does this mean that we have missed a firewall setting either in the
>>>> environment variables or in the ip tables themselves?
>>>>
>>>>
>>>> Ideas?
>>>>
>>>>
>>>>
>>>> Thanks Much
>>>>
>>>> Bill
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>--
>Pavan Balaji ✉️
>http://www.mcs.anl.gov/~balaji
>
>_______________________________________________
>discuss mailing list discuss at mpich.org
>To manage subscription options or unsubscribe:
>https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list