[mpich-discuss] Support for MIC in MPICH 3.0.4
Maciej.Golebiewski at csiro.au
Maciej.Golebiewski at csiro.au
Mon Jul 8 01:34:21 CDT 2013
Hi,
I have managed to build MPICH for MIC device and for host system with support for SCIF and using Intel compilers. I am able to start MPI processes on MIC device using mpirun from host or mpiexec on the device itself.
I can also run application with some ranks on the host and some on the device.
I can't however start more than 4 ranks on any device, if I try to run my application across more than 1 node (be it host or MIC cards):
env I_MPI_MIC=enable mpirun -n 1 -host mike.it.csiro.au ./mpptest -bcast : -n 30 -host mic0 /tmp/mpptest.scif -bcast
0: 1: 00000011: 00000170: readv err 0
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(861)...............: MPI_Allreduce(sbuf=0x7fff62945a08, rbuf=0x7c5de0, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(719).........:
MPIR_Allreduce_intra(201)........:
allreduce_intra_or_coll_fn(110)..:
MPIR_Allreduce_intra(362)........:
MPIC_Sendrecv(213)...............:
MPIC_Wait(569)...................:
MPIDI_CH3I_Progress(367).........:
MPID_nem_mpich_blocking_recv(894):
state_commrdy_handler(175).......:
state_commrdy_handler(138).......:
MPID_nem_scif_recv_handler(115)..: Communication error with rank 1
[...bunch of more error messages follows...]
This problem does not occur of all ranks run on the same node:
env I_MPI_MIC=enable mpirun -n 30 -host mic0 /tmp/mpptest.scif -bcast
set default
set font variable
set curve window y 0.15 0.90
set order d d d x y d
title left 'time (us)', bottom 'Size (bytes)',
top 'Comm Perf for MPI (pkmic-mic0)',
'type = blocking'
#p0 p1 dist len ave time (us) rate
0 29 29 0 11.448860 0.00
0 29 29 32 12.009144 2.665e+6
[...more expected output follows...]
Last November there was a short thread on this list (archive: http://lists.mpich.org/pipermail/discuss/2012-November/000109.html) about seemingly similiar problem, but seemed to fizzle out before any solution was offered.
I wonder whether there was any progress on it, or perhaps someone else has encountered this problem, too and managed to find a workaround/solution or at least has an explanation as to what is happening.
In case someone would like to try and reproduce it, my configure invocation for MPICH on MIC was:
./configure CC="icc" CXX=icpc F77="ifort -warn noalignments" FC="ifort -warn noalignments" \
CXXFLAGS=-mmic CFLAGS=-mmic FFLAGS=-mmic FCFLAGS=-mmic \
MPICHLIB_CFLAGS="-g" MPICHLIB_CXXFLAGS="-g" MPICHLIB_FFLAGS="-g" MPICHLIB_FCFLAGS="-g" \
MPICHLIB_LDFLAGS="-g" LIBS=-lscif LDFLAGS=-L/opt/intel/mic/filesystem/base/lib64 \
--with-device=ch3:nemesis:scif,tcp --disable-romio --disable-mpe --enable-smpcoll \
--enable-fast=defopt --prefix=/apps/mpich-scif/3.0.4/mic --enable-error-checking=runtime \
--enable-timing=runtime --host=i686-linux-gnu \
--with-cross=maint/fcrosscompile/cross_values.txt.mike
The settings in cross_values.txt.mike were:
# The Fortran related cross compilation values.
CROSS_F77_SIZEOF_INTEGER="4"
CROSS_F77_SIZEOF_REAL="4"
CROSS_F77_SIZEOF_DOUBLE_PRECISION="8"
CROSS_F77_TRUE_VALUE="-1"
CROSS_F77_FALSE_VALUE="0"
CROSS_F90_ADDRESS_KIND="8"
CROSS_F90_OFFSET_KIND="8"
CROSS_F90_INTEGER_KIND="4"
CROSS_F90_REAL_MODEL=" 6 , 37"
CROSS_F90_DOUBLE_MODEL=" 15 , 307"
CROSS_F90_INTEGER_MODEL=" 9"
CROSS_F90_ALL_INTEGER_MODELS=" 2 , 1, 4 , 2, 9 , 4, 18 , 8,"
CROSS_F90_INTEGER_MODEL_MAP=" { 2 , 1 , 1 }, { 4 , 2 , 2 }, { 9 , 4 , 4 }, { 18 , 8 , 8 },"
And finally, the configure invocation for MPICH on the host was:
./configure CC="icc" CXX=icpc F77="ifort -warn noalignments" FC="ifort -warn noalignments" \
MPICHLIB_CFLAGS="-g" MPICHLIB_CXXFLAGS="-g" MPICHLIB_FFLAGS="-g" MPICHLIB_FCFLAGS="-g" \
MPICHLIB_LDFLAGS="-g" LIBS=-lscif LDFLAGS=-L/usr/lib64 --with-device=ch3:nemesis:scif,tcp \
--disable-romio --disable-mpe --enable-smpcoll --enable-fast=defopt \
--prefix=/apps/mpich-scif/3.0.4/intel64 --enable-error-checking=runtime \
--enable-timing=runtime
Thanks,
Maciej
Maciej Golebiewski
Applications Support, Advanced Scientific Computing CSIRO IM&T
More information about the discuss
mailing list