[mpich-discuss] MPICH 3.2 on BlueGene/Q

pramod kumbhar pramod.s.kumbhar at gmail.com
Thu Feb 25 14:49:52 CST 2016


Dear Rob,

I have built your fork of mpich on bg-q and I see below error while testing
simple example of MPI_File_iwrite_all:

○ → srun -n 2 ./nonblock_io
Abort(1) on node 0: 'MPI_IN_PLACE' requries support for `PAMI_IN_PLACE`
Abort(1) on node 1: 'MPI_IN_PLACE' requries support for `PAMI_IN_PLACE`
2016-02-25 21:45:18.183 (WARN ) [0xfffacbf8f50]
424617:ibm.runjob.client.Job: normal termination with status 1 from rank 0

I have compiled the mpich as:

./configure --enable-threads=multiple --host=powerpc64-bgq-linux
--with-device=pamid --prefix=$INSTALL_PREFIX --with-file-system=gpfs:BGQ
--with-bgq-install-dir=/bgsys/drivers/V1R2M0/ppc64
--with-pami=/bgsys/drivers/V1R2M0/ppc64/comm/sys
--with-pami-include=/bgsys/drivers/V1R2M0/ppc64/comm/sys/include
--with-pami-lib=/bgsys/drivers/V1R2M0/ppc64/comm/sys/lib
--enable-fast=nochkmsg,notiming,O3 --with-assert-level=0
--disable-error-messages --disable-debuginfo MPICHLIB_CXXFLAGS="-qhot
-qinline=800 -qflag=i:i -qsaveopt -qsuppress=1506-236"
MPICHLIB_CFLAGS="${MPICHLIB_CXXFLAGS}"
MPICHLIB_FFLAGS="${MPICHLIB_CXXFLAGS}"
MPICHLIB_F90FLAGS="${MPICHLIB_CXXFLAGS}"
make -j12
make install

Let me know if I have missing anything.

-Pramod

On Thu, Feb 25, 2016 at 8:04 PM, pramod kumbhar <pramod.s.kumbhar at gmail.com>
wrote:

> Dear All,
>
> I came across below thread in the archives about mpich 3.2 on BG-Q.
>
> I am testing non-blocking collectives, i/o functions on cluster and would like to do the same on BG-Q. I have following questions:
>
> 1. Last email from Dominic suggest that last "compilable" version doesn't support async progress. Is there any version that has non-blocking support and compiles on bg-q? (the fork from Rob?)
>
> 2. Do I have to consider anything specific on bg-q while benchmarking non-blocking functions (from mpi-3) ?
>
> Thanks in advance!
>
> Regards,
>
> Pramod
>
> p.s. I am copying the email thread from archive, not sure if this will be delivered to the correct thread...
>
>
> Hi All,
>
> Here is an update:
>
> MPICH 3.1.3 is the last version that passed the nonblocking test, even without setting PAMID_THREAD_MULTIPLE. However, setting PAMID_ASYNC_PROGRESS=1 will cause error.(Abort(1) on node 7: 'locking' async progress not applicable...)
>
> [chiensh at cumulus <https://lists.mpich.org/mailman/listinfo/discuss> coll.bak]$ which mpif90
> ~/apps/mpich/3.1.3/bin/mpif90
> [chiensh at cumulus <https://lists.mpich.org/mailman/listinfo/discuss> coll.bak]$ make nonblocking
>   CC       nonblocking.o
>   CCLD     nonblocking
> [chiensh at cumulus <https://lists.mpich.org/mailman/listinfo/discuss> coll.bak]$ srun -n 2 ./nonblocking
>  No errors
> [chiensh at cumulus <https://lists.mpich.org/mailman/listinfo/discuss> coll.bak]$ srun -n 4 ./nonblocking
>  No errors
> [chiensh at cumulus <https://lists.mpich.org/mailman/listinfo/discuss> coll.bak]$ srun -n 16 ./nonblocking
>  No errors
>
> Thanks all!
>
> Regards,
> Dominic
>
>
> On 11 Jan, 2016, at 2:29 pm, Dominic Chien <chiensh.acrc at gmail.com <https://lists.mpich.org/mailman/listinfo/discuss>> wrote:
>
> >* Thank you Jeff and Halim,
> *> >* Halim, I have tried 3.1.4  but it does not return 0 (error) when the program is finished, e.g. for a helloworld program
> *>* ==================================================================
> *>*   program hello
> *>*   include 'mpif.h'
> *>*   integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
> *> >*   call MPI_INIT(ierror)
> *>*   call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
> *>*   call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
> *>*   print*, 'node', rank, ': Hello world'
> *>*   call MPI_FINALIZE(ierror)
> *>*   end
> *>* ==================================================================
> *> >* Using MPI 3.1.rc4
> *>* ==================================================================
> *>* [chiensh at cumulus <https://lists.mpich.org/mailman/listinfo/discuss> test]$ which mpif90
> *>* ~/apps/mpich/3.1.rc4/bin/mpif90
> *>* [chiensh at cumulus <https://lists.mpich.org/mailman/listinfo/discuss> test]$ srun -n 2 ./a.out
> *>* node 1 : Hello world
> *>* node 0 : Hello world
> *>* [chiensh at cumulus <https://lists.mpich.org/mailman/listinfo/discuss> test]$
> *>* ==================================================================
> *>* Using MPI 3.1.4
> *>* ==================================================================
> *>* [chiensh at cumulus <https://lists.mpich.org/mailman/listinfo/discuss> test]$ which mpif90
> *>* ~/apps/mpich/3.1.4/bin/mpif90
> *>* [chiensh at cumulus <https://lists.mpich.org/mailman/listinfo/discuss> test]$ srun -n 2 ./a.out
> *>* node 1 : Hello world
> *>* node 0 : Hello world
> *>* 2016-01-11 14:24:25.968 (WARN ) [0xfff7ef48b10] 75532:ibm.runjob.client.Job: terminated by signal 11
> *>* 2016-01-11 14:24:25.968 (WARN ) [0xfff7ef48b10] 75532:ibm.runjob.client.Job: abnormal termination by signal 11 from rank 1
> *>* ==================================================================
> *> > >* Jeff, after I set PAMID_THREAD_MULTIPLE=1 and PAMID_ASYNC_PROGRESS=1, it seems to have some "improvement":  nonblocking test can run for up to 4 processes sometime, but sometime it just get a "deadlock", (see below)
> *>* ==========================================================
> *>* [chiensh at cumulus <https://lists.mpich.org/mailman/listinfo/discuss> coll.bak]$ srun --nodes=4 --ntasks-per-node=1 nonblocking
> *>* MPIDI_Process.*
> *>*  verbose               : 2
> *>*  statistics            : 1
> *>*  contexts              : 32
> *>*  async_progress        : 1
> *>*  context_post          : 1
> *>*  pt2pt.limits
> *>*    application
> *>*      eager
> *>*        remote, local   : 4097, 4097
> *>*      short
> *>*        remote, local   : 113, 113
> *>*    internal
> *>*      eager
> *>*        remote, local   : 4097, 4097
> *>*      short
> *>*        remote, local   : 113, 113
> *>*  rma_pending           : 1000
> *>*  shmem_pt2pt           : 1
> *>*  disable_internal_eager_scale : 524288
> *>*  optimized.collectives : 0
> *>*  optimized.select_colls: 2
> *>*  optimized.subcomms    : 1
> *>*  optimized.memory      : 0
> *>*  optimized.num_requests: 1
> *>*  mpir_nbc              : 1
> *>*  numTasks              : 4
> *>* mpi thread level        : 'MPI_THREAD_SINGLE'
> *>* MPIU_THREAD_GRANULARITY : 'per object'
> *>* ASSERT_LEVEL            : 0
> *>* MPICH_LIBDIR           : not defined
> *>* The following MPICH_* environment variables were specified:
> *>* The following PAMID_* environment variables were specified:
> *>*  PAMID_STATISTICS=1
> *>*  PAMID_ASYNC_PROGRESS=1
> *>*  PAMID_THREAD_MULTIPLE=1
> *>*  PAMID_VERBOSE=2
> *>* The following PAMI_* environment variables were specified:
> *>* The following COMMAGENT_* environment variables were specified:
> *>* The following MUSPI_* environment variables were specified:
> *>* The following BG_* environment variables were specified:
> *>* No errors
> *>* ==========================================================
> *>* [chiensh at cumulus <https://lists.mpich.org/mailman/listinfo/discuss> coll.bak]$ srun --nodes=4 --ntasks-per-node=1 nonblocking
> *>* MPIDI_Process.*
> *>*  verbose               : 2
> *>*  statistics            : 1
> *>*  contexts              : 32
> *>*  async_progress        : 1
> *>*  context_post          : 1
> *>*  pt2pt.limits
> *>*    application
> *>*      eager
> *>*        remote, local   : 4097, 4097
> *>*      short
> *>*        remote, local   : 113, 113
> *>*    internal
> *>*      eager
> *>*        remote, local   : 4097, 4097
> *>*      short
> *>*        remote, local   : 113, 113
> *>*  rma_pending           : 1000
> *>*  shmem_pt2pt           : 1
> *>*  disable_internal_eager_scale : 524288
> *>*  optimized.collectives : 0
> *>*  optimized.select_colls: 2
> *>*  optimized.subcomms    : 1
> *>*  optimized.memory      : 0
> *>*  optimized.num_requests: 1
> *>*  mpir_nbc              : 1
> *>*  numTasks              : 4
> *>* mpi thread level        : 'MPI_THREAD_SINGLE'
> *>* MPIU_THREAD_GRANULARITY : 'per object'
> *>* ASSERT_LEVEL            : 0
> *>* MPICH_LIBDIR           : not defined
> *>* The following MPICH_* environment variables were specified:
> *>* The following PAMID_* environment variables were specified:
> *>*  PAMID_STATISTICS=1
> *>*  PAMID_ASYNC_PROGRESS=1
> *>*  PAMID_THREAD_MULTIPLE=1
> *>*  PAMID_VERBOSE=2
> *>* The following PAMI_* environment variables were specified:
> *>* The following COMMAGENT_* environment variables were specified:
> *>* The following MUSPI_* environment variables were specified:
> *>* The following BG_* environment variables were specified:
> *>* (never return from here)
> *>* ==========================================================
> *> >* Thanks!
> *> >* Regards,
> *>* Dominic
> *> > >* On 11 Jan, 2016, at 12:08 pm, Halim Amer <aamer at anl.gov <https://lists.mpich.org/mailman/listinfo/discuss>> wrote:
> *> >>* Dominic,
> *>> >>* There were a bunch of fixes that went to PAMID since v3.1rc4. You could try a release from the 3.1 series (i.e. from 3.1 through 3.1.4).
> *>> >>* Regards,
> *>>* --Halim
> *>> >>* www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~aamer>
> *>* On 11 Jan, 2016, at 11:30 am, Jeff Hammond <jeff.science at gmail.com <https://lists.mpich.org/mailman/listinfo/discuss>> wrote:
> *>>* I recall MPI-3 RMA on BGQ deadlocks if you set PAMID_THREAD_MULTIPLE (please see ALCF MPI docs to verify exact name), which is required for async progress.
> *>> >>* ARMCI-MPI test suite is one good way to validate MPI-3 RMA is working.
> *>> >>* Jeff
> *>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160225/45beaf31/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list