[mpich-discuss] Code works with -ppn, fails without using MPICH 3.2

Kenneth Raffenetti raffenet at mcs.anl.gov
Wed Sep 6 10:02:10 CDT 2017


On 09/05/2017 10:28 AM, Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND 
APPLICATIONS INC] wrote:
> An update!
> 
> I started reading all the MPICH wiki pages I could find and thought I 
> should try -hosts or -f, and that *does* work:
> 
>> (1024) $ mpirun -f machinefile -np 96 ./GEOSgcm.x
>> srun.slurm: cluster configuration lacks support for cpu binding
>>
>>  In MAPL_Shmem:
>>      NumCores per Node varies from           12  to           28
>>      NumNodes in use   =            4
>>      Total PEs         =           96
>>
> 
> So, I guess the answer is that MPICH 3.2 can't quite decode the SLURM 
> environment to figure out a machinefile, so I need to make one myself.

Can you check the config.log in src/pm/hydra and see if slurm was 
detected? If not, you can specify --with-slurm=<path/to/install>. Hydra 
should be able to detect and understand the slurm host list from the job 
environment.

> 
> Would this be the best way to do this, or is there a way to 
> build/configure MPICH to better support this?
> 
> Next up: trying to figure out how to get Inifiniband supported as I 
> think I'm using TCP:

If you are using Mellanox infiniband, try --with-device=ch3:nemesis:mxm 
--with-mxm=<path/to/install>. The MXM library is part of MOFED, or can 
be downloaded separately from the Mellanox website.

Ken
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list