[mpich-discuss] _get_addr error while running application using MPICH

Joseph Schuchart schuchart at hlrs.de
Tue Nov 20 09:16:54 CST 2018


Zhifeng,

Another way to approach this is to start the application under gdb and 
set a breakpoint to MPICH's internal abort function (MPID_Abort iirc). 
Once you hit this you can walk up the stack and try to find out where 
_get_addr was found to be faulty. Since you are running with a single 
process starting under GDB should be straight forward:

$ gdb -ex "b MPID_Abort" -ex r ./real.exe

(If you pass arguments to real.exe you have to pass --args to gdb)

Cheers
Joseph

On 11/19/18 1:49 PM, Zhifeng Yang via discuss wrote:
> Hi Hui,
> 
> I just searched the whole code. There is no MPI_T_* name in the code. I 
> may tried the newer version later on. Thank you very much
> 
> Zhifeng
> 
> 
> On Mon, Nov 19, 2018 at 12:14 PM Zhou, Hui <zhouh at anl.gov 
> <mailto:zhouh at anl.gov>> wrote:
> 
>     Hi Zhifeng,
> 
>     We just had a new mpich release: mpich-3.3rc1. You may try that
>     release see if you still have the same error.
> 
>     That aside, does your code uses MPI_T_ interfaces? You may try search
>     MPI_T_ prefixes in your code base. In particular, I am interested in
>     any MPI_T_ calls before MPI_Init call.
> 
>     -- 
>     Hui Zhou
> 
>     On Mon, Nov 19, 2018 at 10:39:20AM -0500, Zhifeng Yang wrote:
>      >Hi Hui,
>      >Here are the outputs. I tried the following commands
>      >mpirun --version
>      >./cpi
>      >mpirun ./cpi
>      >mpirun -np 1 ./cpi
>      >
>      >[vy57456 at maya-usr1 em_real]$mpirun --version
>      >HYDRA build details:
>      >    Version:                                 3.2.1
>      >    Release Date:                            Fri Nov 10 20:21:01
>     CST 2017
>      >    CC:                              gcc
>      >    CXX:                             g++
>      >    F77:                             gfortran
>      >    F90:                             gfortran
>      >    Configure options:                     
>       '--disable-option-checking'
>      >'--prefix=/umbc/xfs1/zzbatmos/users/vy57456/application/gfortran/mpich-3.2.1'
>      >'CC=gcc' 'CXX=g++' 'FC=gfortran' 'F77=gfortran'
>     '--cache-file=/dev/null'
>      >'--srcdir=.' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=-lpthread ' 'CPPFLAGS=
>      >-I/home/vy57456/zzbatmos_user/application/gfortran/source_code/mpich-3.2.1/src/mpl/include
>      >-I/home/vy57456/zzbatmos_user/application/gfortran/source_code/mpich-3.2.1/src/mpl/include
>      >-I/home/vy57456/zzbatmos_user/application/gfortran/source_code/mpich-3.2.1/src/openpa/src
>      >-I/home/vy57456/zzbatmos_user/application/gfortran/source_code/mpich-3.2.1/src/openpa/src
>      >-D_REENTRANT
>      >-I/home/vy57456/zzbatmos_user/application/gfortran/source_code/mpich-3.2.1/src/mpi/romio/include'
>      >'MPLLIBNAME=mpl'
>      >    Process Manager:                         pmi
>      >    Launchers available:                     ssh rsh fork slurm ll
>     lsf sge
>      >manual persist
>      >    Topology libraries available:            hwloc
>      >    Resource management kernels available:   user slurm ll lsf sge pbs
>      >cobalt
>      >    Checkpointing libraries available:
>      >    Demux engines available:                 poll select
>      >
>      >
>      >[vy57456 at maya-usr1 examples]$./cpi
>      >Process 0 of 1 is on maya-usr1
>      >pi is approximately 3.1415926544231341, Error is 0.0000000008333410
>      >wall clock time = 0.000066
>      >
>      >
>      >[vy57456 at maya-usr1 examples]$mpirun ./cpi
>      >Process 0 of 1 is on maya-usr1
>      >pi is approximately 3.1415926544231341, Error is 0.0000000008333410
>      >wall clock time = 0.000095
>      >
>      >[vy57456 at maya-usr1 examples]$mpirun -np 1 ./cpi
>      >Process 0 of 1 is on maya-usr1
>      >pi is approximately 3.1415926544231341, Error is 0.0000000008333410
>      >wall clock time = 0.000093
>      >
>      >There is no error.
>      >
>      >Zhifeng
>      >
>      >
>      >On Mon, Nov 19, 2018 at 10:33 AM Zhou, Hui <zhouh at anl.gov
>     <mailto:zhouh at anl.gov>> wrote:
>      >
>      >> On Mon, Nov 19, 2018 at 10:14:54AM -0500, Zhifeng Yang wrote:
>      >> >Thank you for helping me on this error. Actually, real.exe is a
>     portion of
>      >> >a very large weather model. It is very difficult to extract it or
>      >> duplicate
>      >> >the error in a simple fortran code, since I am not sure where
>     the problem
>      >> >is. From your discussion, I barely can understand them, in
>     fact. Even I do
>      >> >not know what is "_get_addr". Is it related to MPI?
>      >>
>      >> It is difficult to pin-point the problem without reproducing it.
>      >>
>      >> Anyway, let's start with mpirun. What is your output if you try:
>      >>
>      >>     mpirun --version
>      >>
>      >> Next, what is your mpich version? If you built mpich, locate the
>     `cpi`
>      >> program in the examples folder and try `./cpi` and `mpirun
>     ./cpi`. Do
>      >> you have error?
>      >>
>      >> --
>      >> Hui Zhou
>      >>
> 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 



More information about the discuss mailing list