[mpich-discuss] Code works with -ppn, fails without using MPICH 3.2

Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] matthew.thompson at nasa.gov
Wed Sep 6 10:53:11 CDT 2017


On 09/06/2017 11:02 AM, Kenneth Raffenetti wrote:
> On 09/05/2017 10:28 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND 
> APPLICATIONS INC] wrote:
>> An update!
>>
>> I started reading all the MPICH wiki pages I could find and thought I 
>> should try -hosts or -f, and that *does* work:
>>
>>> (1024) $ mpirun -f machinefile -np 96 ./GEOSgcm.x
>>> srun.slurm: cluster configuration lacks support for cpu binding
>>>
>>>  In MAPL_Shmem:
>>>      NumCores per Node varies from           12  to           28
>>>      NumNodes in use   =            4
>>>      Total PEs         =           96
>>>
>>
>> So, I guess the answer is that MPICH 3.2 can't quite decode the SLURM 
>> environment to figure out a machinefile, so I need to make one myself.
> 
> Can you check the config.log in src/pm/hydra and see if slurm was 
> detected? If not, you can specify --with-slurm=<path/to/install>. Hydra 
> should be able to detect and understand the slurm host list from the job 
> environment.

I tried compiling three different ways:

If I just add --with-slurm, nothing happens. Configure seems not to 
react and I get the same mpirun behavior.

If I add --with-slurm and --with-pmi=slurm, configure dies saying:

configure: error: The PM chosen (hydra) requires the PMI implementation 
simple but slurm was selected as the PMI implementation.

If I do --with-slurm --with-pmi=slurm --with-pm=none, then I don't get 
mpiexec.hydra built. So srun is all that can work.

> 
>>
>> Would this be the best way to do this, or is there a way to 
>> build/configure MPICH to better support this?
>>
>> Next up: trying to figure out how to get Inifiniband supported as I 
>> think I'm using TCP:
> 
> If you are using Mellanox infiniband, try --with-device=ch3:nemesis:mxm 
> --with-mxm=<path/to/install>. The MXM library is part of MOFED, or can 
> be downloaded separately from the Mellanox website.

Now this I was able to do, and I could see from the configure that it 
found it and did something different.

The problem? If I run a simple Hello World, the code locks up after 
printing "Hello world from..."  It's like it never finalizes, or gets 
stuck somewhere.

Matt
-- 
Matt Thompson, SSAI, Sr Scientific Programmer/Analyst
NASA GSFC,    Global Modeling and Assimilation Office
Code 610.1,  8800 Greenbelt Rd,  Greenbelt,  MD 20771
Phone: 301-614-6712                 Fax: 301-614-6246
http://science.gsfc.nasa.gov/sed/bio/matthew.thompson


More information about the discuss mailing list