[mpich-discuss] _get_addr error while running application using MPICH

Zhifeng Yang yangzf01 at gmail.com
Wed Nov 21 15:44:03 CST 2018


Hi Joseph,

I have not tried gdb yet. But I found a description in the initialization
module. Here it says


! <DESCRIPTION>
! This routine USES the modules in WRF and then calls the init routines
! they provide to perform module specific initializations at the
! beginning of a run.  Note, this is only once per run, not once per
! domain; domain specific initializations should be handled elsewhere,
! such as in <a href=start_domain.html>start_domain</a>.
!
! Certain framework specific module initializations in this file are
! dependent on order they are called. For example, since the quilt module
! relies on internal I/O, the init routine for internal I/O must be
! called first.  In the case of DM_PARALLEL compiles, the quilt module
! calls MPI_INIT as part of setting up and dividing communicators between
! compute and I/O server tasks.  Therefore, it must be called prior to
! module_dm, which will <em>also</em> try to call MPI_INIT if it sees
! that MPI has not be initialized yet (implementations of module_dm
! should in fact behave this way by first calling MPI_INITIALIZED before
! they try to call MPI_INIT).  If MPI is already initialized before the
! the quilting module is called, quilting will not work.
!
! The phase argument is used to allow other superstructures like ESMF to
! place their initialization calls following the WRF initialization call
! that calls MPI_INIT().  When used with ESMF, ESMF will call wrf_init()
! which in turn will call phase 2 of this routine.  Phase 1 will be called
! earlier.
!
! </DESCRIPTION>

 INTEGER, INTENT(IN) :: phase    ! phase==1 means return after MPI_INIT()
                                 ! phase==2 means resume after MPI_INIT()

It mentions something about MPI_INIT, but I can not understand its meaning.
It may help you to understand.

Best,
Zhifeng


On Tue, Nov 20, 2018 at 10:17 AM Joseph Schuchart via discuss <
discuss at mpich.org> wrote:

> Zhifeng,
>
> Another way to approach this is to start the application under gdb and
> set a breakpoint to MPICH's internal abort function (MPID_Abort iirc).
> Once you hit this you can walk up the stack and try to find out where
> _get_addr was found to be faulty. Since you are running with a single
> process starting under GDB should be straight forward:
>
> $ gdb -ex "b MPID_Abort" -ex r ./real.exe
>
> (If you pass arguments to real.exe you have to pass --args to gdb)
>
> Cheers
> Joseph
>
> On 11/19/18 1:49 PM, Zhifeng Yang via discuss wrote:
> > Hi Hui,
> >
> > I just searched the whole code. There is no MPI_T_* name in the code. I
> > may tried the newer version later on. Thank you very much
> >
> > Zhifeng
> >
> >
> > On Mon, Nov 19, 2018 at 12:14 PM Zhou, Hui <zhouh at anl.gov
> > <mailto:zhouh at anl.gov>> wrote:
> >
> >     Hi Zhifeng,
> >
> >     We just had a new mpich release: mpich-3.3rc1. You may try that
> >     release see if you still have the same error.
> >
> >     That aside, does your code uses MPI_T_ interfaces? You may try search
> >     MPI_T_ prefixes in your code base. In particular, I am interested in
> >     any MPI_T_ calls before MPI_Init call.
> >
> >     --
> >     Hui Zhou
> >
> >     On Mon, Nov 19, 2018 at 10:39:20AM -0500, Zhifeng Yang wrote:
> >      >Hi Hui,
> >      >Here are the outputs. I tried the following commands
> >      >mpirun --version
> >      >./cpi
> >      >mpirun ./cpi
> >      >mpirun -np 1 ./cpi
> >      >
> >      >[vy57456 at maya-usr1 em_real]$mpirun --version
> >      >HYDRA build details:
> >      >    Version:                                 3.2.1
> >      >    Release Date:                            Fri Nov 10 20:21:01
> >     CST 2017
> >      >    CC:                              gcc
> >      >    CXX:                             g++
> >      >    F77:                             gfortran
> >      >    F90:                             gfortran
> >      >    Configure options:
> >       '--disable-option-checking'
> >
> >'--prefix=/umbc/xfs1/zzbatmos/users/vy57456/application/gfortran/mpich-3.2.1'
> >      >'CC=gcc' 'CXX=g++' 'FC=gfortran' 'F77=gfortran'
> >     '--cache-file=/dev/null'
> >      >'--srcdir=.' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=-lpthread ' 'CPPFLAGS=
> >
> >-I/home/vy57456/zzbatmos_user/application/gfortran/source_code/mpich-3.2.1/src/mpl/include
> >
> >-I/home/vy57456/zzbatmos_user/application/gfortran/source_code/mpich-3.2.1/src/mpl/include
> >
> >-I/home/vy57456/zzbatmos_user/application/gfortran/source_code/mpich-3.2.1/src/openpa/src
> >
> >-I/home/vy57456/zzbatmos_user/application/gfortran/source_code/mpich-3.2.1/src/openpa/src
> >      >-D_REENTRANT
> >
> >-I/home/vy57456/zzbatmos_user/application/gfortran/source_code/mpich-3.2.1/src/mpi/romio/include'
> >      >'MPLLIBNAME=mpl'
> >      >    Process Manager:                         pmi
> >      >    Launchers available:                     ssh rsh fork slurm ll
> >     lsf sge
> >      >manual persist
> >      >    Topology libraries available:            hwloc
> >      >    Resource management kernels available:   user slurm ll lsf sge
> pbs
> >      >cobalt
> >      >    Checkpointing libraries available:
> >      >    Demux engines available:                 poll select
> >      >
> >      >
> >      >[vy57456 at maya-usr1 examples]$./cpi
> >      >Process 0 of 1 is on maya-usr1
> >      >pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> >      >wall clock time = 0.000066
> >      >
> >      >
> >      >[vy57456 at maya-usr1 examples]$mpirun ./cpi
> >      >Process 0 of 1 is on maya-usr1
> >      >pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> >      >wall clock time = 0.000095
> >      >
> >      >[vy57456 at maya-usr1 examples]$mpirun -np 1 ./cpi
> >      >Process 0 of 1 is on maya-usr1
> >      >pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> >      >wall clock time = 0.000093
> >      >
> >      >There is no error.
> >      >
> >      >Zhifeng
> >      >
> >      >
> >      >On Mon, Nov 19, 2018 at 10:33 AM Zhou, Hui <zhouh at anl.gov
> >     <mailto:zhouh at anl.gov>> wrote:
> >      >
> >      >> On Mon, Nov 19, 2018 at 10:14:54AM -0500, Zhifeng Yang wrote:
> >      >> >Thank you for helping me on this error. Actually, real.exe is a
> >     portion of
> >      >> >a very large weather model. It is very difficult to extract it
> or
> >      >> duplicate
> >      >> >the error in a simple fortran code, since I am not sure where
> >     the problem
> >      >> >is. From your discussion, I barely can understand them, in
> >     fact. Even I do
> >      >> >not know what is "_get_addr". Is it related to MPI?
> >      >>
> >      >> It is difficult to pin-point the problem without reproducing it.
> >      >>
> >      >> Anyway, let's start with mpirun. What is your output if you try:
> >      >>
> >      >>     mpirun --version
> >      >>
> >      >> Next, what is your mpich version? If you built mpich, locate the
> >     `cpi`
> >      >> program in the examples folder and try `./cpi` and `mpirun
> >     ./cpi`. Do
> >      >> you have error?
> >      >>
> >      >> --
> >      >> Hui Zhou
> >      >>
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20181121/b896fa38/attachment.html>


More information about the discuss mailing list