[mpich-discuss] Support for MIC in MPICH 3.0.4
Pavan Balaji
balaji at mcs.anl.gov
Mon Jul 8 09:32:21 CDT 2013
Hi Maciej,
Intel is aware of this problem that occurs when a large number of MIC
processes are used. They are looking into it. Right now, the only
workaround we can offer is to use fewer MIC processes and use threads on
the MIC instead.
-- Pavan
On 07/08/2013 01:34 AM, Maciej.Golebiewski at csiro.au wrote:
> Hi,
>
> I have managed to build MPICH for MIC device and for host system with support for SCIF and using Intel compilers. I am able to start MPI processes on MIC device using mpirun from host or mpiexec on the device itself.
>
> I can also run application with some ranks on the host and some on the device.
>
> I can't however start more than 4 ranks on any device, if I try to run my application across more than 1 node (be it host or MIC cards):
>
> env I_MPI_MIC=enable mpirun -n 1 -host mike.it.csiro.au ./mpptest -bcast : -n 30 -host mic0 /tmp/mpptest.scif -bcast
> 0: 1: 00000011: 00000170: readv err 0
> Fatal error in MPI_Allreduce: Other MPI error, error stack:
> MPI_Allreduce(861)...............: MPI_Allreduce(sbuf=0x7fff62945a08, rbuf=0x7c5de0, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed
> MPIR_Allreduce_impl(719).........:
> MPIR_Allreduce_intra(201)........:
> allreduce_intra_or_coll_fn(110)..:
> MPIR_Allreduce_intra(362)........:
> MPIC_Sendrecv(213)...............:
> MPIC_Wait(569)...................:
> MPIDI_CH3I_Progress(367).........:
> MPID_nem_mpich_blocking_recv(894):
> state_commrdy_handler(175).......:
> state_commrdy_handler(138).......:
> MPID_nem_scif_recv_handler(115)..: Communication error with rank 1
> [...bunch of more error messages follows...]
>
> This problem does not occur of all ranks run on the same node:
>
> env I_MPI_MIC=enable mpirun -n 30 -host mic0 /tmp/mpptest.scif -bcast
> set default
> set font variable
> set curve window y 0.15 0.90
> set order d d d x y d
> title left 'time (us)', bottom 'Size (bytes)',
> top 'Comm Perf for MPI (pkmic-mic0)',
> 'type = blocking'
>
> #p0 p1 dist len ave time (us) rate
> 0 29 29 0 11.448860 0.00
> 0 29 29 32 12.009144 2.665e+6
> [...more expected output follows...]
>
> Last November there was a short thread on this list (archive: http://lists.mpich.org/pipermail/discuss/2012-November/000109.html) about seemingly similiar problem, but seemed to fizzle out before any solution was offered.
>
> I wonder whether there was any progress on it, or perhaps someone else has encountered this problem, too and managed to find a workaround/solution or at least has an explanation as to what is happening.
>
> In case someone would like to try and reproduce it, my configure invocation for MPICH on MIC was:
>
> ./configure CC="icc" CXX=icpc F77="ifort -warn noalignments" FC="ifort -warn noalignments" \
> CXXFLAGS=-mmic CFLAGS=-mmic FFLAGS=-mmic FCFLAGS=-mmic \
> MPICHLIB_CFLAGS="-g" MPICHLIB_CXXFLAGS="-g" MPICHLIB_FFLAGS="-g" MPICHLIB_FCFLAGS="-g" \
> MPICHLIB_LDFLAGS="-g" LIBS=-lscif LDFLAGS=-L/opt/intel/mic/filesystem/base/lib64 \
> --with-device=ch3:nemesis:scif,tcp --disable-romio --disable-mpe --enable-smpcoll \
> --enable-fast=defopt --prefix=/apps/mpich-scif/3.0.4/mic --enable-error-checking=runtime \
> --enable-timing=runtime --host=i686-linux-gnu \
> --with-cross=maint/fcrosscompile/cross_values.txt.mike
>
> The settings in cross_values.txt.mike were:
>
> # The Fortran related cross compilation values.
> CROSS_F77_SIZEOF_INTEGER="4"
> CROSS_F77_SIZEOF_REAL="4"
> CROSS_F77_SIZEOF_DOUBLE_PRECISION="8"
> CROSS_F77_TRUE_VALUE="-1"
> CROSS_F77_FALSE_VALUE="0"
> CROSS_F90_ADDRESS_KIND="8"
> CROSS_F90_OFFSET_KIND="8"
> CROSS_F90_INTEGER_KIND="4"
> CROSS_F90_REAL_MODEL=" 6 , 37"
> CROSS_F90_DOUBLE_MODEL=" 15 , 307"
> CROSS_F90_INTEGER_MODEL=" 9"
> CROSS_F90_ALL_INTEGER_MODELS=" 2 , 1, 4 , 2, 9 , 4, 18 , 8,"
> CROSS_F90_INTEGER_MODEL_MAP=" { 2 , 1 , 1 }, { 4 , 2 , 2 }, { 9 , 4 , 4 }, { 18 , 8 , 8 },"
>
> And finally, the configure invocation for MPICH on the host was:
>
> ./configure CC="icc" CXX=icpc F77="ifort -warn noalignments" FC="ifort -warn noalignments" \
> MPICHLIB_CFLAGS="-g" MPICHLIB_CXXFLAGS="-g" MPICHLIB_FFLAGS="-g" MPICHLIB_FCFLAGS="-g" \
> MPICHLIB_LDFLAGS="-g" LIBS=-lscif LDFLAGS=-L/usr/lib64 --with-device=ch3:nemesis:scif,tcp \
> --disable-romio --disable-mpe --enable-smpcoll --enable-fast=defopt \
> --prefix=/apps/mpich-scif/3.0.4/intel64 --enable-error-checking=runtime \
> --enable-timing=runtime
>
> Thanks,
>
> Maciej
>
> Maciej Golebiewski
> Applications Support, Advanced Scientific Computing CSIRO IM&T
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the discuss
mailing list