[mpich-discuss] Code works with -ppn, fails without using MPICH 3.2
raffenet at mcs.anl.gov
Wed Sep 6 10:02:10 CDT 2017
On 09/05/2017 10:28 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND
APPLICATIONS INC] wrote:
> An update!
> I started reading all the MPICH wiki pages I could find and thought I
> should try -hosts or -f, and that *does* work:
>> (1024) $ mpirun -f machinefile -np 96 ./GEOSgcm.x
>> srun.slurm: cluster configuration lacks support for cpu binding
>> In MAPL_Shmem:
>> NumCores per Node varies from 12 to 28
>> NumNodes in use = 4
>> Total PEs = 96
> So, I guess the answer is that MPICH 3.2 can't quite decode the SLURM
> environment to figure out a machinefile, so I need to make one myself.
Can you check the config.log in src/pm/hydra and see if slurm was
detected? If not, you can specify --with-slurm=<path/to/install>. Hydra
should be able to detect and understand the slurm host list from the job
> Would this be the best way to do this, or is there a way to
> build/configure MPICH to better support this?
> Next up: trying to figure out how to get Inifiniband supported as I
> think I'm using TCP:
If you are using Mellanox infiniband, try --with-device=ch3:nemesis:mxm
--with-mxm=<path/to/install>. The MXM library is part of MOFED, or can
be downloaded separately from the Mellanox website.
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
More information about the discuss