[mpich-discuss] Code works with -ppn, fails without using MPICH 3.2
Kenneth Raffenetti
raffenet at mcs.anl.gov
Wed Sep 6 10:02:10 CDT 2017
On 09/05/2017 10:28 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND
APPLICATIONS INC] wrote:
> An update!
>
> I started reading all the MPICH wiki pages I could find and thought I
> should try -hosts or -f, and that *does* work:
>
>> (1024) $ mpirun -f machinefile -np 96 ./GEOSgcm.x
>> srun.slurm: cluster configuration lacks support for cpu binding
>>
>> In MAPL_Shmem:
>> NumCores per Node varies from 12 to 28
>> NumNodes in use = 4
>> Total PEs = 96
>>
>
> So, I guess the answer is that MPICH 3.2 can't quite decode the SLURM
> environment to figure out a machinefile, so I need to make one myself.
Can you check the config.log in src/pm/hydra and see if slurm was
detected? If not, you can specify --with-slurm=<path/to/install>. Hydra
should be able to detect and understand the slurm host list from the job
environment.
>
> Would this be the best way to do this, or is there a way to
> build/configure MPICH to better support this?
>
> Next up: trying to figure out how to get Inifiniband supported as I
> think I'm using TCP:
If you are using Mellanox infiniband, try --with-device=ch3:nemesis:mxm
--with-mxm=<path/to/install>. The MXM library is part of MOFED, or can
be downloaded separately from the Mellanox website.
Ken
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list