[mpich-discuss] _get_addr error while running application using MPICH
Zhifeng Yang
yangzf01 at gmail.com
Wed Nov 21 15:44:03 CST 2018
Hi Joseph,
I have not tried gdb yet. But I found a description in the initialization
module. Here it says
! <DESCRIPTION>
! This routine USES the modules in WRF and then calls the init routines
! they provide to perform module specific initializations at the
! beginning of a run. Note, this is only once per run, not once per
! domain; domain specific initializations should be handled elsewhere,
! such as in <a href=start_domain.html>start_domain</a>.
!
! Certain framework specific module initializations in this file are
! dependent on order they are called. For example, since the quilt module
! relies on internal I/O, the init routine for internal I/O must be
! called first. In the case of DM_PARALLEL compiles, the quilt module
! calls MPI_INIT as part of setting up and dividing communicators between
! compute and I/O server tasks. Therefore, it must be called prior to
! module_dm, which will <em>also</em> try to call MPI_INIT if it sees
! that MPI has not be initialized yet (implementations of module_dm
! should in fact behave this way by first calling MPI_INITIALIZED before
! they try to call MPI_INIT). If MPI is already initialized before the
! the quilting module is called, quilting will not work.
!
! The phase argument is used to allow other superstructures like ESMF to
! place their initialization calls following the WRF initialization call
! that calls MPI_INIT(). When used with ESMF, ESMF will call wrf_init()
! which in turn will call phase 2 of this routine. Phase 1 will be called
! earlier.
!
! </DESCRIPTION>
INTEGER, INTENT(IN) :: phase ! phase==1 means return after MPI_INIT()
! phase==2 means resume after MPI_INIT()
It mentions something about MPI_INIT, but I can not understand its meaning.
It may help you to understand.
Best,
Zhifeng
On Tue, Nov 20, 2018 at 10:17 AM Joseph Schuchart via discuss <
discuss at mpich.org> wrote:
> Zhifeng,
>
> Another way to approach this is to start the application under gdb and
> set a breakpoint to MPICH's internal abort function (MPID_Abort iirc).
> Once you hit this you can walk up the stack and try to find out where
> _get_addr was found to be faulty. Since you are running with a single
> process starting under GDB should be straight forward:
>
> $ gdb -ex "b MPID_Abort" -ex r ./real.exe
>
> (If you pass arguments to real.exe you have to pass --args to gdb)
>
> Cheers
> Joseph
>
> On 11/19/18 1:49 PM, Zhifeng Yang via discuss wrote:
> > Hi Hui,
> >
> > I just searched the whole code. There is no MPI_T_* name in the code. I
> > may tried the newer version later on. Thank you very much
> >
> > Zhifeng
> >
> >
> > On Mon, Nov 19, 2018 at 12:14 PM Zhou, Hui <zhouh at anl.gov
> > <mailto:zhouh at anl.gov>> wrote:
> >
> > Hi Zhifeng,
> >
> > We just had a new mpich release: mpich-3.3rc1. You may try that
> > release see if you still have the same error.
> >
> > That aside, does your code uses MPI_T_ interfaces? You may try search
> > MPI_T_ prefixes in your code base. In particular, I am interested in
> > any MPI_T_ calls before MPI_Init call.
> >
> > --
> > Hui Zhou
> >
> > On Mon, Nov 19, 2018 at 10:39:20AM -0500, Zhifeng Yang wrote:
> > >Hi Hui,
> > >Here are the outputs. I tried the following commands
> > >mpirun --version
> > >./cpi
> > >mpirun ./cpi
> > >mpirun -np 1 ./cpi
> > >
> > >[vy57456 at maya-usr1 em_real]$mpirun --version
> > >HYDRA build details:
> > > Version: 3.2.1
> > > Release Date: Fri Nov 10 20:21:01
> > CST 2017
> > > CC: gcc
> > > CXX: g++
> > > F77: gfortran
> > > F90: gfortran
> > > Configure options:
> > '--disable-option-checking'
> >
> >'--prefix=/umbc/xfs1/zzbatmos/users/vy57456/application/gfortran/mpich-3.2.1'
> > >'CC=gcc' 'CXX=g++' 'FC=gfortran' 'F77=gfortran'
> > '--cache-file=/dev/null'
> > >'--srcdir=.' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=-lpthread ' 'CPPFLAGS=
> >
> >-I/home/vy57456/zzbatmos_user/application/gfortran/source_code/mpich-3.2.1/src/mpl/include
> >
> >-I/home/vy57456/zzbatmos_user/application/gfortran/source_code/mpich-3.2.1/src/mpl/include
> >
> >-I/home/vy57456/zzbatmos_user/application/gfortran/source_code/mpich-3.2.1/src/openpa/src
> >
> >-I/home/vy57456/zzbatmos_user/application/gfortran/source_code/mpich-3.2.1/src/openpa/src
> > >-D_REENTRANT
> >
> >-I/home/vy57456/zzbatmos_user/application/gfortran/source_code/mpich-3.2.1/src/mpi/romio/include'
> > >'MPLLIBNAME=mpl'
> > > Process Manager: pmi
> > > Launchers available: ssh rsh fork slurm ll
> > lsf sge
> > >manual persist
> > > Topology libraries available: hwloc
> > > Resource management kernels available: user slurm ll lsf sge
> pbs
> > >cobalt
> > > Checkpointing libraries available:
> > > Demux engines available: poll select
> > >
> > >
> > >[vy57456 at maya-usr1 examples]$./cpi
> > >Process 0 of 1 is on maya-usr1
> > >pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> > >wall clock time = 0.000066
> > >
> > >
> > >[vy57456 at maya-usr1 examples]$mpirun ./cpi
> > >Process 0 of 1 is on maya-usr1
> > >pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> > >wall clock time = 0.000095
> > >
> > >[vy57456 at maya-usr1 examples]$mpirun -np 1 ./cpi
> > >Process 0 of 1 is on maya-usr1
> > >pi is approximately 3.1415926544231341, Error is 0.0000000008333410
> > >wall clock time = 0.000093
> > >
> > >There is no error.
> > >
> > >Zhifeng
> > >
> > >
> > >On Mon, Nov 19, 2018 at 10:33 AM Zhou, Hui <zhouh at anl.gov
> > <mailto:zhouh at anl.gov>> wrote:
> > >
> > >> On Mon, Nov 19, 2018 at 10:14:54AM -0500, Zhifeng Yang wrote:
> > >> >Thank you for helping me on this error. Actually, real.exe is a
> > portion of
> > >> >a very large weather model. It is very difficult to extract it
> or
> > >> duplicate
> > >> >the error in a simple fortran code, since I am not sure where
> > the problem
> > >> >is. From your discussion, I barely can understand them, in
> > fact. Even I do
> > >> >not know what is "_get_addr". Is it related to MPI?
> > >>
> > >> It is difficult to pin-point the problem without reproducing it.
> > >>
> > >> Anyway, let's start with mpirun. What is your output if you try:
> > >>
> > >> mpirun --version
> > >>
> > >> Next, what is your mpich version? If you built mpich, locate the
> > `cpi`
> > >> program in the examples folder and try `./cpi` and `mpirun
> > ./cpi`. Do
> > >> you have error?
> > >>
> > >> --
> > >> Hui Zhou
> > >>
> >
> >
> > _______________________________________________
> > discuss mailing list discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20181121/b896fa38/attachment.html>
More information about the discuss
mailing list