[mpich-discuss] ./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed

Pavan Balaji balaji at mcs.anl.gov
Tue Aug 27 09:40:04 CDT 2013


Please don't drop discuss at mpich.org from the cc list.

I doubt demux, --assert-level and blcr are relevant here.  Also, the 
output of "make testing" is not helpful for us because those tests can 
fail even if your machines are too slow.

Did you try my suggestion from the previous email?  Could you try them 
and report back (just with that information)?

  -- Pavan

On 08/27/2013 09:35 AM, Joni-Pekka Kurronen wrote:
>
> hi,
>
> I have allraedy cheked up.
> This is not new install, just jumped up from mpich2 to 3 and
> gcc 4.6 to gcc 4.7.
>
> I have rsh-redone-rsh as main but tested ssh as well,...
> due at crash clinet's keep running at slave's and useing harddisk,...
>
> Following work's:
> any machine alone
> mpi1 and kaak  or  mpi1 and ugh
> but not all to gether except hostname,...
>
> This could be realated:
> - demux  ( have tried select and poll, whit poll have to restart slave
> machine's)
> - nfs4 ( some reason nfs4 must be manually mounted at moment after
> restart at slave's)
> - have changed --assert-level to 0 (default 2)
> - blcr
>
>
> ch3:socket setting:
>
> =============
> hosts file,..
> 192.168.0.41:2 <http://192.168.0.41:2>
> 192.168.0.42:2 <http://192.168.0.42:2>
> #192.168.0.43:2 <http://192.168.0.43:2>
> =============
> summary.xml errors
> <MPITEST>
> <NAME>spawninfo1</NAME>
> <NP>1</NP>
> <WORKDIR>./spawn</WORKDIR>
> <STATUS>fail</STATUS>
> <TESTDIFF>
> [mpiexec at mpi1] APPLICATION TIMED OUT
> [proxy:0:0 at mpi1] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
> [proxy:0:0 at mpi1] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at mpi1] main (./pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> [proxy:1:0 at mpi1] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
> [proxy:1:0 at mpi1] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:1:0 at mpi1] main (./pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> [mpiexec at mpi1] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes
> terminated badly; aborting
> [mpiexec at mpi1] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
> for completion
> [mpiexec at mpi1] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:188): launcher returned error waiting for
> completion
> [mpiexec at mpi1] main (./ui/mpich/mpiexec.c:331): process manager error
> waiting for completion
> </TESTDIFF>
> </MPITEST>
> <MPITEST>
> <NAME>rdwrord</NAME>
> <NP>4</NP>
> <WORKDIR>./io</WORKDIR>
> <STATUS>fail</STATUS>
> <TESTDIFF>
> Fatal error in PMPI_Bcast: Other MPI error
> [cli_2]: aborting job:
> Fatal error in PMPI_Bcast: Other MPI error
> Fatal error in PMPI_Bcast: Other MPI error
> [cli_0]: aborting job:
> Fatal error in PMPI_Bcast: Other MPI error
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> </TESTDIFF>
> </MPITEST>
> <MPITEST>
> <NAME>rdwrzero</NAME>
> <NP>4</NP>
> <WORKDIR>./io</WORKDIR>
> <STATUS>fail</STATUS>
> <TESTDIFF>
> Fatal error in PMPI_Bcast: Other MPI error
> [cli_2]: aborting job:
> Fatal error in PMPI_Bcast: Other MPI error
> Fatal error in PMPI_Bcast: Other MPI error
> [cli_0]: aborting job:
> Fatal error in PMPI_Bcast: Other MPI error
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> </TESTDIFF>
> </MPITEST>
> <MPITEST>
> <NAME>getextent</NAME>
> <NP>2</NP>
> <WORKDIR>./io</WORKDIR>
> <STATUS>pass</STATUS>
> </MPITEST>
> <MPITEST>
> <NAME>setinfo</NAME>
> <NP>4</NP>
> <WORKDIR>./io</WORKDIR>
> <STATUS>fail</STATUS>
> <TESTDIFF>
> Fatal error in PMPI_Bcast: Other MPI error
> [cli_2]: aborting job:
> Fatal error in PMPI_Bcast: Other MPI error
> Fatal error in PMPI_Bcast: Other MPI error
> [cli_0]: aborting job:
> Fatal error in PMPI_Bcast: Other MPI error
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> </TESTDIFF>
> </MPITEST>
> <MPITEST>
> <NAME>setviewcur</NAME>
> <NP>4</NP>
> <WORKDIR>./io</WORKDIR>
> <STATUS>fail</STATUS>
> <TESTDIFF>
> Fatal error in PMPI_Bcast: Other MPI error
> [cli_2]: aborting job:
> Fatal error in PMPI_Bcast: Other MPI error
> Fatal error in PMPI_Bcast: Other MPI error
> [cli_0]: aborting job:
> Fatal error in PMPI_Bcast: Other MPI error
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> </TESTDIFF>
> </MPITEST>
>
> ....
> ....
> ....
>
>
> ============
> ============
> hosts file,..
> 192.168.0.41:2 <http://192.168.0.41:2>
> 192.168.0.42:2 <http://192.168.0.42:2>
> 192.168.0.43:2 <http://192.168.0.43:2>
> ============
> ============
>
> When needed more than 4 process will hang,..
>
> Unexpected output in allred3: [mpiexec at mpi1] APPLICATION TIMED OUT
> Unexpected output in allred3: [proxy:0:0 at mpi1]
> HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert
> (!closed) failed
> Unexpected output in allred3: [proxy:0:0 at mpi1]
> HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback
> returned error status
> Unexpected output in allred3: [proxy:0:0 at mpi1] main
> (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
>
> ======
>
> if useing ch3:nemesis hardisk is not running all the time as whit
> sockets's,...
>
>
>
> ================================================
> This is read by rsh-redone-rsh for every process,...
> ================================================
> #!/bin/bash
>
> # JPK-Integration for Ubuntu 12.4 LTS
> #
> https://sites.google.com/site/jpsdatareviewstheboy007/ubuntu-lts-12-4-companion-whit-ltsp-mpich2-elmer-openfoam
> #
> # CMAKE goes loop and can not build, cmake build under development says
> documentation
>
> # gcc 4.7
> # bdver1 optimization
>
> shopt -s expand_aliases
> export JPK_MPI_DIR=/mpi3         # MAIN DIRECTORY, SUBDIRCTORYES:
> export JPK_VER=C3                # BINARY CODE
> export JPK_VER_S=S3              # SOURCE CODE
> export JPK_VER_B=B3              # BASH FILES TO COMPILE AND CONFIGURE
> export JPK_INS=$JPK_MPI_DIR/$JPK_VER
> export JPK_BUI=$JPK_MPI_DIR/$JPK_VER_S
> export JPK_ELMER=$JPK_INS/elmer_6283 #035
> export JPK_ELMER_S=$JPK_BUI/elmer_6283
> export JPK_NETGEN_S=$JPK_BUI/netgen_668
> export JPK_NETGEN=$JPK_INS/netgen_668
>
> #GCC
> #export JPK_FLAGS="-Wl,--no-as-needed -fPIC -DAdd_ -m64 -pthread -O3
> -fopenmp -lgomp -march=bdver1 -ftree-vectorize -funroll-loops"
> #export CFLAGS="-Wl,--no-as-needed -fPIC -DAdd_ -m64 -pthread -fopenmp
> -lgomp"
>
> # M A K E
>
> export JPK_JOBS=7
>
> # O P E N  MP
> export OMP_NUM_THREADS=6
>
>
> # M P I C 3
> # http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
> export JPK_MPICH2_S=$JPK_BUI/mpich-3.0.4
> export JPK_MPICH2=$JPK_INS/mpich-3.0.4
> export PATH=$JPK_MPICH2/bin:$PATH
> export MPI_HOME=$JPK_MPICH2
> export MPI_LIBS="-L$JPK_MPICH2/lib -lmpich -lmpichf90 -lmpl -lopa
> -lmpichcxx"
> export LD_LIBRARY_PATH=$JPK_MPICH2/lib:$JPK_MPICH2/bin # FIRST
>
> # M P I
>
> export MPI_IMPLEMENTATION=mpich
>
> export OMPI_CC=/$JPK_MPI_DIR/$JPK_VER/mpich-3.0.4/bin/mpicc
> export OMPI_CXX=/$JPK_MPI_DIR/$JPK_VER/mpich-3.0.4/bin/mpicxx
> export OMPI_77=/$JPK_MPI_DIR/$JPK_VER/mpich-3.0.4/bin/mpif77
> export OMPI_90=/$JPK_MPI_DIR/$JPK_VER/mpich-3.0.4/bin/mpif90
>
> # http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager
> export HYDRA_DEBUG=0
> export HYDRA_HOST_FILE=/mpi4/hosts
> export HYDRA_LAUNCHER=rsh
> #export HYDRA_LAUNCHER=ssh
> #export HYDRA_LAUNCHER_EXEC=/usr/bin/netkit-rsh
> export HYDRA_LAUNCHER_EXEC=/usr/bin/rsh-redone-rsh
> #export HYDRA_LAUNCHER_EXEC=/usr/bin/ssh
> export HYDRA_DEMUX=select
> #export HYDRA_DEMUX=select #more porseses than core's
> export HYDRA_PROXY_RETRY_COUNT=3
> #export HYDRA_RMK=pbs
> #export HYDRA_DEFAULT_RMK=pbs
> export HYDRA_ENV=all
> export MPIEXEC_PORT_RANGE=7000:7500
> #mpirun -launcher rsh -launcher-exec /usr/bin/netkit-rsh -demux select
> -n 21 ddd ./cpi
>
> # b l c r
>
> export HYDRA_CKPOINTLIB=blcr
> export HYDRA_CKPOINT_PREFIX=/mpi3/chekpoint/default.chk
> export HYDRA_CKPOINT_INTERVAL=10800
> export PATH=$JPK_INS/blcr-0.8.5/bin:$PATH
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JPK_INS/blcr-0.8.5/lib
> #-ckpoint-num 5
>
> # C M A K E - BUILD
> export PATH=$JPK_INS/cmake-2.8.10.2/bin:$PATH
> export CMAKE_COMMAND=$JPK_INS/cmake-2.8.10.2/bin
>
> # T O G L - UI netgen
>
> export JPK_TOGL="$JPK_INS/Togl-1.7"
> export JPK_TOGL_S="$JPK_BUI/Togl-1.7"
>
> # OCC
> export JPK_OCC=/usr/include/oce
>
> # M A T H
>
> export JPK_ARPACK_S=$JPK_BUI/ARPACK
> export JPK_ARPACK=$JPK_INS/ARPACK
>
> export BLAS=$JPK_INS/acml5.3.1/gfortran64_mp
> export BLAS32=$JPK_INS/acml5.3.1/gfortran64_mp
> #export BLAS=$JPK_INS/clAmdBlas-1.10.321/lib64
> #export BLAS32=$JPK_INS/clAmdBlas-1.10.321/include
> export FFT=$JPK_INS/fftw2
> export LACPACK=$BLAS
> export SCALAPACK=$JPK_INS/scalapack-2.0.2
> export BLACS=$SCALAPACK
>
> export JPK_LMETISDIR_S=$JPK_BUI/ParMetis-3.2.0
> export JPK_LMETISDIR=$JPK_INS/ParMetis-3.2.0
> export JPK_LMETISDIR32=$JPK_LMETISDIR
> export JPK_LMETISDIR_S5=$JPK_BUI/parmetis-4.0.2
> export JPK_LMETISDIR5=$JPK_INS/parmetis-4.0.2
>
> export METIS_DIR="" #$JPK_LMETISDIR MUST BE EMPTY
> export METIS_INCLUDE_DIR=$JPK_LMETISDIR
> export METIS_LIBDIR=$JPK_LMETISDIR
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JPK_LMETISDIR:$BLAS/lib:$FTT/lib
> #/mpi4/S/metis-5.0.2/GKlib
>
> export SCOTCHDIR=$JPK_INS/scotch_6.0.0
> export JPK_SCOTCHDIR_S=$JPK_BUI/scotch_6.0.0_esmumps
>
> export MUMPS_I=$JPK_INS/MUMPS_4.10.0
> export MUMPS=$JPK_BUI/MUMPS_4.10.0
>
> export HYPRE=$JPK_INS/hypre-2.8.0b
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HYPRE/lib
> export JPK_HYPRE_S=$JPK_BUI/hypre-2.8.0b
>
> export PARDISOLICMESSAGE=1
> export PARDISO=$JPK_INS/pardiso
> export PARDISO_LIC_PATH=$PARDISO
> export MKL_SERIAL=YES
>
> export
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SCOTCHDIR/lib:$MUMPS/lib:$BLAS/lib:$SCALAPACK/lib:$HYPRE/lib:$PARDISO:$METIS_LIBDIR:$JPK_ARPACK
>
> #HDF5
> #export JPK_HDF5_S=$JPK_BUI/hdf5-1.8.10-patch1 for vtk testing
> #export JPK_HDF5=$JPK_INS/hdf5-1.8.10-patch1
> export JPK_HDF5_S=$JPK_BUI/hdf5-1.8.10-patch1
> export JPK_HDF5=$JPK_INS/hdf5-1.8.10-patch1
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JPK_HDF5/lib
>
> # V T K
> export JPK_VTK_DIR=$JPK_INS/VTK-5.8.0
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JPK_VTK_DIR/lib/vtk-5.8
> export VTK_INCLUDEPATH=$JPK_VTK_DIR/include
>
> # Q T
> export QT_QMAKE_EXECUTABLE=/usr/bin/qmake-qt4
>
> # O P E N    F O A M
> # http://www.openfoam.org/download/source.php
>
> #export WM_SCHEDULER=wmakeScheduler
> #export WM_HOSTS="192.168.0.41:6 <http://192.168.0.41:6> 192.168.0.42:6
> <http://192.168.0.42:6> 192.168.0.43:6 <http://192.168.0.43:6>"
> #export WM_NCOMPPROCS=$($WM_SCHEDULER -count)
> #export WM_COLOURS="black blue green cyan red magenta yellow"
>
> #export FOAM_INST_DIR=/mpi2/OpenFOAM
> #foamDotFile=$FOAM_INST_DIR/OpenFOAM-2.1.x/etc/bashrc
> #[ -f $foamDotFile ] && . $foamDotFile
> #source /mpi3/OpenFOAM/OpenFOAM-2.1.x/etc/bashrc
>
> #export FOAM_RUN=/mpi2/om
> #export OpenBIN=/mpi2/OpenFOAM/OpenFOAM-2.1.x/bin/tools
> #export PATH=OpenBIN$:$PATH
> #export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/apps/OpenFOAM/tools/lib64
>
> #export
> ParaView_DIR=/mpi2/OpenFOAM/ThirdParty-2.1.x/platforms/linux64Gcc/paraview-3.12.0
> #export PATH=$ParaView_DIR/bin:$PATH
> #export PV_PLUGIN_PATH=$FOAM_LIBBIN/paraview-3.12
>
> # E L M E R
> export ELMER_HOME=$JPK_ELMER
> export ELMER_LIB=$JPK_ELMER/share/elmersolver/lib
> export PATH=$PATH:$ELMER_HOME/bin:$ELMER_HOME/lib
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ELMER_HOME/lib
> export ELMERGUI_HOME=$ELMER_HOME/bin
> export ELMER_POST_HOME=$ELMER_HOME/bin
>
> # S a l o m é
> #cd /mpi2/salome-meca/SALOME-MECA-2012.2-LGPL ; source envSalomeMeca.sh
> #cd ~/
>
> # Paraview
> #export PATH=$PATH:$JPK_INS/ParaView-3.14.1-Linux-64bit
> export PATH=$PATH:$JPK_INS/ParaView3
>
> # N E T G E N   P A R A L L E L $JPK_TCL/lib:$JPK_TK/lib:
> #export
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JPK_TOGL:$JPK_NETGEN\par/lib:/usr/lib/
> #export NETGENDIR=$JPK_NETGEN\par/bin
> # NETGEN
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JPK_TOGL:$JPK_NETGEN/lib:/usr/lib/
> export NETGENDIR=$JPK_NETGEN/bin
>
> #crontab, ext editor
> export EDITOR=nano
>
> #space ball
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
>
> #vrpn & hidapi
> export
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/$JPK_MPI_DIR/$JPK_VER/vrpn/lib:/$JPK_MPI_DIR/$JPK_VER/hidapi/lib:/usr/include/libusb-1.0
>
>
>
>
>
> 2013/8/27 Pavan Balaji <balaji at mcs.anl.gov <mailto:balaji at mcs.anl.gov>>
>
>
>     This is almost certainly a network issue with your third machine
>     (kaak, I presume?).
>
>     Thanks for making sure "hostname" works fine on all machines.  That
>     means that your ssh connections are setup correctly.  But a non-MPI
>     program, such as hostname, does not check the connection from kaak
>     back to mpi1.
>
>     Can you try a simple program like "examples/cpi" in the build
>     directory on all machines?  Try it on 2 machines (mpiexec -np 4) and
>     3 machines (mpiexec -np 6).
>
>     If the third machine is in fact having problems running the application:
>
>     1. Make sure there's no firewall on the third machines.
>
>     2. Make sure the /etc/hosts file is consistent on both the machines
>     (mpi1 and kaak).
>
>       -- Pavan
>
>
>     On 08/27/2013 06:46 AM, Joni-Pekka Kurronen wrote:
>
>
>         I have:
>         -Ubuntu 12.4
>         -rsh-redo-rsh
>         -three machines
>         -mpich3
>         -have tried export HYDRA_DEMUX=select / poll
>         -have tried ssh/rsh
>         -have added to LIBS: event_core event_pthreads
>
>         I can run test at on to two machines whitout error but
>         when I take third machine to cluster demux engine goes mad,...
>            there is connection hanging,... and nothing happens,...
>
>
>         <MPITEST>
>         <NAME>uoplong</NAME>
>         <NP>11</NP>
>         <WORKDIR>./coll</WORKDIR>
>         <STATUS>fail</STATUS>
>         <TESTDIFF>
>         [mpiexec at mpi1] APPLICATION TIMED OUT
>         [proxy:0:0 at mpi1] HYD_pmcd_pmip_control_cmd_cb
>         (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
>         [proxy:0:0 at mpi1] HYDT_dmxu_poll_wait_for_event
>         (./tools/demux/demux_poll.c:__77): callback returned error status
>         [proxy:0:0 at mpi1] main (./pm/pmiserv/pmip.c:206): demux engine error
>         waiting for event
>         [mpiexec at mpi1] HYDT_bscu_wait_for_completion
>         (./tools/bootstrap/utils/bscu___wait.c:76): one of the processes
>         terminated badly; aborting
>         [mpiexec at mpi1] HYDT_bsci_wait_for_completion
>         (./tools/bootstrap/src/bsci___wait.c:23): launcher returned
>         error waiting
>         for completion
>         [mpiexec at mpi1] HYD_pmci_wait_for_completion
>         (./pm/pmiserv/pmiserv_pmci.c:__188): launcher returned error
>         waiting for
>         completion
>         [mpiexec at mpi1] main (./ui/mpich/mpiexec.c:331): process manager
>         error
>         waiting for completion
>         </TESTDIFF>
>         </MPITEST>
>
>         Also I can run
>         joni at mpi1:/mpi3/S3/hpcc-1.4.2$ mpiexec -np 6 hostname
>         mpi1
>         mpi1
>         ugh
>         ugh
>         kaak
>         kaak
>
>         but if I run
>         joni at mpi1:/mpi3/S3/hpcc-1.4.2$ mpiexec -np 6 ls
>         I get only one directory as output and
>         system will cease until I have re-started slave machines !
>
>
>
>
>
>     --
>     Pavan Balaji
>     http://www.mcs.anl.gov/~balaji
>
>
>
>
> --
> Joni-Pekka Kurronen
> Joni.Kurronen at gmail.com <mailto:Joni.Kurronen at gmail.com>
> gsm. +358 50 521 2279

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji



More information about the discuss mailing list