[mpich-discuss] Code works with -ppn, fails without using MPICH 3.2

Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] matthew.thompson at nasa.gov
Tue Sep 5 10:28:01 CDT 2017


An update!

I started reading all the MPICH wiki pages I could find and thought I 
should try -hosts or -f, and that *does* work:

> (1024) $ mpirun -f machinefile -np 96 ./GEOSgcm.x
> srun.slurm: cluster configuration lacks support for cpu binding
>  
>  In MAPL_Shmem:
>      NumCores per Node varies from           12  to           28
>      NumNodes in use   =            4
>      Total PEs         =           96
>  

So, I guess the answer is that MPICH 3.2 can't quite decode the SLURM 
environment to figure out a machinefile, so I need to make one myself.

Would this be the best way to do this, or is there a way to 
build/configure MPICH to better support this?

Next up: trying to figure out how to get Inifiniband supported as I 
think I'm using TCP:

> (1108) $ mpichversion | grep Device
> MPICH Device:    	ch3:nemesis




On 09/05/2017 10:34 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND 
APPLICATIONS INC] wrote:
> All,
> 
> I've been evaluating different MPI stacks on our cluster and found that 
> MPICH 3.2 does really well on some simple little benchmarks. It also 
> runs Hello World just fine, so I decided to apply it to our climate 
> model (GEOS).
> 
> However, the first time I did that, things went a bit nuts. Essentially:
> 
>> (1065) $ mpirun -np 96 ./GEOSgcm.x | & tee withoutppn.log
>> srun.slurm: cluster configuration lacks support for cpu binding
>> Fatal error in PMPI_Comm_create: Unknown error class, error stack:
>> PMPI_Comm_create(564).................: 
>> MPI_Comm_create(MPI_COMM_WORLD, group=0x88000000, new_comm=0x106d1740) 
>> failed
>> PMPI_Comm_create(541).................: 
>> MPIR_Comm_create_intra(215)...........: 
>> MPIR_Get_contextid_sparse_group(500)..: 
>> MPIR_Allreduce_impl(764)..............: 
>> MPIR_Allreduce_intra(257).............: 
>> allreduce_intra_or_coll_fn(163).......: 
>> MPIR_Allreduce_intra(417).............: 
>> MPIDU_Complete_posted_with_error(1137): Process failed
>> MPIR_Allreduce_intra(417).............: 
>> MPIDU_Complete_posted_with_error(1137): Process failed
>> MPIR_Allreduce_intra(268).............: 
>> MPIR_Bcast_impl(1452).................: 
>> MPIR_Bcast(1476)......................: 
>> MPIR_Bcast_intra(1287)................: 
>> MPIR_Bcast_binomial(310)..............: Failure during collective
> 
> (NOTE: The srun.slurm thing is just an error/warning we always get. 
> Doesn't matter if it's MPT, Open MPI, MVAPICH2, Intel MPI...it happens.)
> 
> The thing is, it works just fine at (NX-by-NY) of 1x6 and 2x12, but once 
> I go to 3x18, boom, collapse. As I am on 28-core nodes, my first thought 
> was it was due to crossing nodes. But, those benchmarks I ran did just 
> fine for 192 nodes, so...hmm.
> 
> Out of desperation, I finally thought, what if it was the fact that 28 
> doesn't divide 96 and passed in -ppn and:
> 
>> (1068) $ mpirun -ppn 12 -np 96 ./GEOSgcm.x |& tee withppn.log
>> srun.slurm: cluster configuration lacks support for cpu binding
>>
>>  In MAPL_Shmem:
>>      NumCores per Node =           12
>>      NumNodes in use   =            8
>>      Total PEs         =           96
>> ...
> 
> Starts up just fine! Note that every other MPI stack (MPT, Intel MPI, 
> MVAPICH2, and Open MPI) handle the non-ppn type job just fine, but it's 
> possible that they are evenly distributing the processes themselves. And 
> the "MAPL_Shmem" lines you see are just reporting what the process 
> structure looks like. I've added some print statements including this:
> 
>     if (present(CommIn)) then
>         CommCap = CommIn
>     else
>         CommCap = MPI_COMM_WORLD
>     end if
> 
>     if (.not.present(CommIn)) then
>        call mpi_init(status)
>        VERIFY_(STATUS)
>     end if
>     write (*,*) "MPI Initialized."
> 
> So, boring, and CommIn is *not* present, so we are using MPI_COMM_WORLD, 
> and mpi_init is called as one would. Now if I run:
> 
>    mpirun -np 96 ./GEOSgcm.x | grep 'MPI Init' | wc -l
> 
> to count the number initialized, multiple times, I get results like: 40, 
> 56, 56, 45, 68. Never consistent.
> 
> So, I'm a bit at a loss. I freely admit I might have built MPICH3 
> incorrectly. It was essentially my first time. I configured with:
> 
>>  ./configure --prefix=$SWDEV/MPI/mpich/3.2/intel_17.0.4.196 \
>>     --disable-wrapper-rpath CC=icc CXX=icpc FC=ifort F77=ifort \
>>      --enable-fortran=all --enable-cxx | & tee 
>> configure.intel_17.0.4.196.log
> 
> which might be too vanilla for a SLURM/Infiniband cluster, but, yet, it 
> works with -ppn. But maybe I need extra options to work at all times? 
> --with-ibverbs? --with-slurm?
> 
> Any ideas on what's happening and what I might have done wrong?
> 
> Thanks,
> Matt


-- 
Matt Thompson, SSAI, Sr Scientific Programmer/Analyst
NASA GSFC,    Global Modeling and Assimilation Office
Code 610.1,  8800 Greenbelt Rd,  Greenbelt,  MD 20771
Phone: 301-614-6712                 Fax: 301-614-6246
http://science.gsfc.nasa.gov/sed/bio/matthew.thompson
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list