[mpich-discuss] Support for MIC in MPICH 3.0.4
Maciej.Golebiewski at csiro.au
Maciej.Golebiewski at csiro.au
Mon Jul 8 18:21:04 CDT 2013
Hi Pavan,
Thanks for your reply, although it is not what I hoped to hear.
Unfortunately the MPI application I'm trying to run would require too much effort to convert to hybrid OpenMP/MPI to be viable option at this point.
I wonder how IntelMPI is working around this limit (I've run it with up to 2 (host) + 2 x 240 (MIC) ranks).
Anyway, thanks again and I guess I will have to wait for the updates to MPSS/SCIF.
Cheers,
Maciej
> -----Original Message-----
> From: Pavan Balaji [mailto:balaji at mcs.anl.gov]
> Sent: Tuesday, July 09, 2013 12:32 AM
> To: discuss at mpich.org
> Cc: Golebiewski, Maciej (CSIRO IM&T, Docklands)
> Subject: Re: [mpich-discuss] Support for MIC in MPICH 3.0.4
>
> Hi Maciej,
>
> Intel is aware of this problem that occurs when a large number of
> MIC processes are used. They are looking into it. Right now, the
> only workaround we can offer is to use fewer MIC processes and use
> threads on the MIC instead.
>
> -- Pavan
>
> On 07/08/2013 01:34 AM, Maciej.Golebiewski at csiro.au wrote:
> > Hi,
> >
> > I have managed to build MPICH for MIC device and for host system
> with support for SCIF and using Intel compilers. I am able to start
> MPI processes on MIC device using mpirun from host or mpiexec on
> the device itself.
> >
> > I can also run application with some ranks on the host and some
> on the device.
> >
> > I can't however start more than 4 ranks on any device, if I try
> to run my application across more than 1 node (be it host or MIC
> cards):
> >
> > env I_MPI_MIC=enable mpirun -n 1 -host mike.it.csiro.au ./mpptest
> -bcast : -n 30 -host mic0 /tmp/mpptest.scif -bcast
> > 0: 1: 00000011: 00000170: readv err 0 Fatal error in
> MPI_Allreduce:
> > Other MPI error, error stack:
> > MPI_Allreduce(861)...............:
> MPI_Allreduce(sbuf=0x7fff62945a08,
> > rbuf=0x7c5de0, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD)
> failed
> > MPIR_Allreduce_impl(719).........:
> > MPIR_Allreduce_intra(201)........:
> > allreduce_intra_or_coll_fn(110)..:
> > MPIR_Allreduce_intra(362)........:
> > MPIC_Sendrecv(213)...............:
> > MPIC_Wait(569)...................:
> > MPIDI_CH3I_Progress(367).........:
> > MPID_nem_mpich_blocking_recv(894):
> > state_commrdy_handler(175).......:
> > state_commrdy_handler(138).......:
> > MPID_nem_scif_recv_handler(115)..: Communication error with rank
> 1
> > [...bunch of more error messages follows...]
> >
> > This problem does not occur of all ranks run on the same node:
> >
> > env I_MPI_MIC=enable mpirun -n 30 -host mic0 /tmp/mpptest.scif -
> bcast
> > set default set font variable set curve window y 0.15 0.90 set
> order d
> > d d x y d title left 'time (us)', bottom 'Size (bytes)',
> > top 'Comm Perf for MPI (pkmic-mic0)',
> > 'type = blocking'
> >
> > #p0 p1 dist len ave time (us) rate
> > 0 29 29 0 11.448860 0.00
> > 0 29 29 32 12.009144 2.665e+6
> > [...more expected output follows...]
> >
> > Last November there was a short thread on this list (archive:
> http://lists.mpich.org/pipermail/discuss/2012-November/000109.html)
> about seemingly similiar problem, but seemed to fizzle out before
> any solution was offered.
> >
> > I wonder whether there was any progress on it, or perhaps someone
> else has encountered this problem, too and managed to find a
> workaround/solution or at least has an explanation as to what is
> happening.
> >
> > In case someone would like to try and reproduce it, my configure
> invocation for MPICH on MIC was:
> >
> > ./configure CC="icc" CXX=icpc F77="ifort -warn noalignments"
> FC="ifort
> > -warn noalignments" \ CXXFLAGS=-mmic CFLAGS=-mmic FFLAGS=-mmic
> > FCFLAGS=-mmic \ MPICHLIB_CFLAGS="-g" MPICHLIB_CXXFLAGS="-g"
> > MPICHLIB_FFLAGS="-g" MPICHLIB_FCFLAGS="-g" \ MPICHLIB_LDFLAGS="-
> g"
> > LIBS=-lscif LDFLAGS=-L/opt/intel/mic/filesystem/base/lib64 \
> > --with-device=ch3:nemesis:scif,tcp --disable-romio --disable-mpe
> > --enable-smpcoll \ --enable-fast=defopt
> > --prefix=/apps/mpich-scif/3.0.4/mic --enable-error-
> checking=runtime \
> > --enable-timing=runtime --host=i686-linux-gnu \
> > --with-cross=maint/fcrosscompile/cross_values.txt.mike
> >
> > The settings in cross_values.txt.mike were:
> >
> > # The Fortran related cross compilation values.
> > CROSS_F77_SIZEOF_INTEGER="4"
> > CROSS_F77_SIZEOF_REAL="4"
> > CROSS_F77_SIZEOF_DOUBLE_PRECISION="8"
> > CROSS_F77_TRUE_VALUE="-1"
> > CROSS_F77_FALSE_VALUE="0"
> > CROSS_F90_ADDRESS_KIND="8"
> > CROSS_F90_OFFSET_KIND="8"
> > CROSS_F90_INTEGER_KIND="4"
> > CROSS_F90_REAL_MODEL=" 6 , 37"
> > CROSS_F90_DOUBLE_MODEL=" 15 , 307"
> > CROSS_F90_INTEGER_MODEL=" 9"
> > CROSS_F90_ALL_INTEGER_MODELS=" 2 , 1, 4 , 2, 9 , 4, 18 , 8,"
> > CROSS_F90_INTEGER_MODEL_MAP=" { 2 , 1 , 1 }, { 4 , 2 , 2 }, {
> 9 , 4 , 4 }, { 18 , 8 , 8 },"
> >
> > And finally, the configure invocation for MPICH on the host was:
> >
> > ./configure CC="icc" CXX=icpc F77="ifort -warn noalignments"
> FC="ifort
> > -warn noalignments" \ MPICHLIB_CFLAGS="-g" MPICHLIB_CXXFLAGS="-g"
> > MPICHLIB_FFLAGS="-g" MPICHLIB_FCFLAGS="-g" \ MPICHLIB_LDFLAGS="-
> g"
> > LIBS=-lscif LDFLAGS=-L/usr/lib64 --with-
> device=ch3:nemesis:scif,tcp \
> > --disable-romio --disable-mpe --enable-smpcoll --enable-
> fast=defopt \
> > --prefix=/apps/mpich-scif/3.0.4/intel64
> > --enable-error-checking=runtime \ --enable-timing=runtime
> >
> > Thanks,
> >
> > Maciej
> >
> > Maciej Golebiewski
> > Applications Support, Advanced Scientific Computing CSIRO IM&T
> >
> > _______________________________________________
> > discuss mailing list discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
More information about the discuss
mailing list