[mpich-discuss] Code works with -ppn, fails without using MPICH 3.2

Tue Sep 5 12:05:33 CDT 2017

Ken,

I thought about that, but doesn't that mean I am stuck with srun as my 
PM? I've never had great luck with srun compared to hydra (with other 
MPI stacks).

I know you can't just do your suggestion because:

configure: error: The PM chosen (hydra) requires the PMI implementation 
simple but slurm was selected as the PMI implementation.

I am currently trying it with --with-pm=none and will test it, though.

Matt

On 09/05/2017 11:21 AM, Kenneth Raffenetti wrote:
> It looks like you are using the Slurm launcher, but you might not have 
> configured MPICH to use Slurm PMI. Try adding this to your configure line:
> 
>    --with-pmi=slurm --with-slurm=<path/to/slurm/install>
> 
> Ken
> 
> On 09/05/2017 09:34 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND 
> APPLICATIONS INC] wrote:
>> All,
>>
>> I've been evaluating different MPI stacks on our cluster and found 
>> that MPICH 3.2 does really well on some simple little benchmarks. It 
>> also runs Hello World just fine, so I decided to apply it to our 
>> climate model (GEOS).
>>
>> However, the first time I did that, things went a bit nuts. Essentially:
>>
>>> (1065) $ mpirun -np 96 ./GEOSgcm.x | & tee withoutppn.log
>>> srun.slurm: cluster configuration lacks support for cpu binding
>>> Fatal error in PMPI_Comm_create: Unknown error class, error stack:
>>> PMPI_Comm_create(564).................: 
>>> MPI_Comm_create(MPI_COMM_WORLD, group=0x88000000, 
>>> new_comm=0x106d1740) failed
>>> PMPI_Comm_create(541).................: 
>>> MPIR_Comm_create_intra(215)...........: 
>>> MPIR_Get_contextid_sparse_group(500)..: 
>>> MPIR_Allreduce_impl(764)..............: 
>>> MPIR_Allreduce_intra(257).............: 
>>> allreduce_intra_or_coll_fn(163).......: 
>>> MPIR_Allreduce_intra(417).............: 
>>> MPIDU_Complete_posted_with_error(1137): Process failed
>>> MPIR_Allreduce_intra(417).............: 
>>> MPIDU_Complete_posted_with_error(1137): Process failed
>>> MPIR_Allreduce_intra(268).............: 
>>> MPIR_Bcast_impl(1452).................: 
>>> MPIR_Bcast(1476)......................: 
>>> MPIR_Bcast_intra(1287)................: 
>>> MPIR_Bcast_binomial(310)..............: Failure during collective
>>
>> (NOTE: The srun.slurm thing is just an error/warning we always get. 
>> Doesn't matter if it's MPT, Open MPI, MVAPICH2, Intel MPI...it happens.)
>>
>> The thing is, it works just fine at (NX-by-NY) of 1x6 and 2x12, but 
>> once I go to 3x18, boom, collapse. As I am on 28-core nodes, my first 
>> thought was it was due to crossing nodes. But, those benchmarks I ran 
>> did just fine for 192 nodes, so...hmm.
>>
>> Out of desperation, I finally thought, what if it was the fact that 28 
>> doesn't divide 96 and passed in -ppn and:
>>
>>> (1068) $ mpirun -ppn 12 -np 96 ./GEOSgcm.x |& tee withppn.log
>>> srun.slurm: cluster configuration lacks support for cpu binding
>>>
>>>  In MAPL_Shmem:
>>>      NumCores per Node =           12
>>>      NumNodes in use   =            8
>>>      Total PEs         =           96
>>> ...
>>
>> Starts up just fine! Note that every other MPI stack (MPT, Intel MPI, 
>> MVAPICH2, and Open MPI) handle the non-ppn type job just fine, but 
>> it's possible that they are evenly distributing the processes 
>> themselves. And the "MAPL_Shmem" lines you see are just reporting what 
>> the process structure looks like. I've added some print statements 
>> including this:
>>
>>     if (present(CommIn)) then
>>         CommCap = CommIn
>>     else
>>         CommCap = MPI_COMM_WORLD
>>     end if
>>
>>     if (.not.present(CommIn)) then
>>        call mpi_init(status)
>>        VERIFY_(STATUS)
>>     end if
>>     write (*,*) "MPI Initialized."
>>
>> So, boring, and CommIn is *not* present, so we are using 
>> MPI_COMM_WORLD, and mpi_init is called as one would. Now if I run:
>>
>>    mpirun -np 96 ./GEOSgcm.x | grep 'MPI Init' | wc -l
>>
>> to count the number initialized, multiple times, I get results like: 
>> 40, 56, 56, 45, 68. Never consistent.
>>
>> So, I'm a bit at a loss. I freely admit I might have built MPICH3 
>> incorrectly. It was essentially my first time. I configured with:
>>
>>>  ./configure --prefix=$SWDEV/MPI/mpich/3.2/intel_17.0.4.196 \
>>>     --disable-wrapper-rpath CC=icc CXX=icpc FC=ifort F77=ifort \
>>>      --enable-fortran=all --enable-cxx | & tee 
>>> configure.intel_17.0.4.196.log
>>
>> which might be too vanilla for a SLURM/Infiniband cluster, but, yet, 
>> it works with -ppn. But maybe I need extra options to work at all 
>> times? --with-ibverbs? --with-slurm?
>>
>> Any ideas on what's happening and what I might have done wrong?
>>
>> Thanks,
>> Matt

-- 
Matt Thompson, SSAI, Sr Scientific Programmer/Analyst
NASA GSFC,    Global Modeling and Assimilation Office
Code 610.1,  8800 Greenbelt Rd,  Greenbelt,  MD 20771
Phone: 301-614-6712                 Fax: 301-614-6246
http://science.gsfc.nasa.gov/sed/bio/matthew.thompson