[mpich-discuss] Code works with -ppn, fails without using MPICH 3.2

Aaron Knister aaron.s.knister at nasa.gov
Tue Sep 5 12:11:06 CDT 2017


Hey Matt,

Last I looked hydra and SLURM both speak the same PMI2 wire protocol (or
at least close enough) for jobs to start up even if mpich wasn't built
with --with-pmi=slurm. All you should need to do is "srun --mpi=pmi2".

-Aaron


On 09/05/2017 01:05 PM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND
APPLICATIONS INC] wrote:
> Ken,
>
> I thought about that, but doesn't that mean I am stuck with srun as my
> PM? I've never had great luck with srun compared to hydra (with other
> MPI stacks).
>
> I know you can't just do your suggestion because:
>
> configure: error: The PM chosen (hydra) requires the PMI
> implementation simple but slurm was selected as the PMI implementation.
>
> I am currently trying it with --with-pm=none and will test it, though.
>
> Matt
>
> On 09/05/2017 11:21 AM, Kenneth Raffenetti wrote:
>> It looks like you are using the Slurm launcher, but you might not
>> have configured MPICH to use Slurm PMI. Try adding this to your
>> configure line:
>>
>>    --with-pmi=slurm --with-slurm=<path/to/slurm/install>
>>
>> Ken
>>
>> On 09/05/2017 09:34 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS
>> AND APPLICATIONS INC] wrote:
>>> All,
>>>
>>> I've been evaluating different MPI stacks on our cluster and found
>>> that MPICH 3.2 does really well on some simple little benchmarks. It
>>> also runs Hello World just fine, so I decided to apply it to our
>>> climate model (GEOS).
>>>
>>> However, the first time I did that, things went a bit nuts.
>>> Essentially:
>>>
>>>> (1065) $ mpirun -np 96 ./GEOSgcm.x | & tee withoutppn.log
>>>> srun.slurm: cluster configuration lacks support for cpu binding
>>>> Fatal error in PMPI_Comm_create: Unknown error class, error stack:
>>>> PMPI_Comm_create(564).................:
>>>> MPI_Comm_create(MPI_COMM_WORLD, group=0x88000000,
>>>> new_comm=0x106d1740) failed
>>>> PMPI_Comm_create(541).................:
>>>> MPIR_Comm_create_intra(215)...........:
>>>> MPIR_Get_contextid_sparse_group(500)..:
>>>> MPIR_Allreduce_impl(764)..............:
>>>> MPIR_Allreduce_intra(257).............:
>>>> allreduce_intra_or_coll_fn(163).......:
>>>> MPIR_Allreduce_intra(417).............:
>>>> MPIDU_Complete_posted_with_error(1137): Process failed
>>>> MPIR_Allreduce_intra(417).............:
>>>> MPIDU_Complete_posted_with_error(1137): Process failed
>>>> MPIR_Allreduce_intra(268).............:
>>>> MPIR_Bcast_impl(1452).................:
>>>> MPIR_Bcast(1476)......................:
>>>> MPIR_Bcast_intra(1287)................:
>>>> MPIR_Bcast_binomial(310)..............: Failure during collective
>>>
>>> (NOTE: The srun.slurm thing is just an error/warning we always get.
>>> Doesn't matter if it's MPT, Open MPI, MVAPICH2, Intel MPI...it
>>> happens.)
>>>
>>> The thing is, it works just fine at (NX-by-NY) of 1x6 and 2x12, but
>>> once I go to 3x18, boom, collapse. As I am on 28-core nodes, my
>>> first thought was it was due to crossing nodes. But, those
>>> benchmarks I ran did just fine for 192 nodes, so...hmm.
>>>
>>> Out of desperation, I finally thought, what if it was the fact that
>>> 28 doesn't divide 96 and passed in -ppn and:
>>>
>>>> (1068) $ mpirun -ppn 12 -np 96 ./GEOSgcm.x |& tee withppn.log
>>>> srun.slurm: cluster configuration lacks support for cpu binding
>>>>
>>>>  In MAPL_Shmem:
>>>>      NumCores per Node =           12
>>>>      NumNodes in use   =            8
>>>>      Total PEs         =           96
>>>> ...
>>>
>>> Starts up just fine! Note that every other MPI stack (MPT, Intel
>>> MPI, MVAPICH2, and Open MPI) handle the non-ppn type job just fine,
>>> but it's possible that they are evenly distributing the processes
>>> themselves. And the "MAPL_Shmem" lines you see are just reporting
>>> what the process structure looks like. I've added some print
>>> statements including this:
>>>
>>>     if (present(CommIn)) then
>>>         CommCap = CommIn
>>>     else
>>>         CommCap = MPI_COMM_WORLD
>>>     end if
>>>
>>>     if (.not.present(CommIn)) then
>>>        call mpi_init(status)
>>>        VERIFY_(STATUS)
>>>     end if
>>>     write (*,*) "MPI Initialized."
>>>
>>> So, boring, and CommIn is *not* present, so we are using
>>> MPI_COMM_WORLD, and mpi_init is called as one would. Now if I run:
>>>
>>>    mpirun -np 96 ./GEOSgcm.x | grep 'MPI Init' | wc -l
>>>
>>> to count the number initialized, multiple times, I get results like:
>>> 40, 56, 56, 45, 68. Never consistent.
>>>
>>> So, I'm a bit at a loss. I freely admit I might have built MPICH3
>>> incorrectly. It was essentially my first time. I configured with:
>>>
>>>>  ./configure --prefix=$SWDEV/MPI/mpich/3.2/intel_17.0.4.196 \
>>>>     --disable-wrapper-rpath CC=icc CXX=icpc FC=ifort F77=ifort \
>>>>      --enable-fortran=all --enable-cxx | & tee
>>>> configure.intel_17.0.4.196.log
>>>
>>> which might be too vanilla for a SLURM/Infiniband cluster, but, yet,
>>> it works with -ppn. But maybe I need extra options to work at all
>>> times? --with-ibverbs? --with-slurm?
>>>
>>> Any ideas on what's happening and what I might have done wrong?
>>>
>>> Thanks,
>>> Matt
>
>

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list