[mpich-discuss] Code works with -ppn, fails without using MPICH 3.2
Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]
matthew.thompson at nasa.gov
Wed Sep 6 10:53:11 CDT 2017
On 09/06/2017 11:02 AM, Kenneth Raffenetti wrote:
> On 09/05/2017 10:28 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND
> APPLICATIONS INC] wrote:
>> An update!
>>
>> I started reading all the MPICH wiki pages I could find and thought I
>> should try -hosts or -f, and that *does* work:
>>
>>> (1024) $ mpirun -f machinefile -np 96 ./GEOSgcm.x
>>> srun.slurm: cluster configuration lacks support for cpu binding
>>>
>>> In MAPL_Shmem:
>>> NumCores per Node varies from 12 to 28
>>> NumNodes in use = 4
>>> Total PEs = 96
>>>
>>
>> So, I guess the answer is that MPICH 3.2 can't quite decode the SLURM
>> environment to figure out a machinefile, so I need to make one myself.
>
> Can you check the config.log in src/pm/hydra and see if slurm was
> detected? If not, you can specify --with-slurm=<path/to/install>. Hydra
> should be able to detect and understand the slurm host list from the job
> environment.
I tried compiling three different ways:
If I just add --with-slurm, nothing happens. Configure seems not to
react and I get the same mpirun behavior.
If I add --with-slurm and --with-pmi=slurm, configure dies saying:
configure: error: The PM chosen (hydra) requires the PMI implementation
simple but slurm was selected as the PMI implementation.
If I do --with-slurm --with-pmi=slurm --with-pm=none, then I don't get
mpiexec.hydra built. So srun is all that can work.
>
>>
>> Would this be the best way to do this, or is there a way to
>> build/configure MPICH to better support this?
>>
>> Next up: trying to figure out how to get Inifiniband supported as I
>> think I'm using TCP:
>
> If you are using Mellanox infiniband, try --with-device=ch3:nemesis:mxm
> --with-mxm=<path/to/install>. The MXM library is part of MOFED, or can
> be downloaded separately from the Mellanox website.
Now this I was able to do, and I could see from the configure that it
found it and did something different.
The problem? If I run a simple Hello World, the code locks up after
printing "Hello world from..." It's like it never finalizes, or gets
stuck somewhere.
Matt
--
Matt Thompson, SSAI, Sr Scientific Programmer/Analyst
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712 Fax: 301-614-6246
http://science.gsfc.nasa.gov/sed/bio/matthew.thompson
More information about the discuss
mailing list