[mpich-discuss] Problem with Mpich3.0.4 build for WRF run across multiple nodes in a cluster.

Rob Latham robl at mcs.anl.gov
Wed Jun 22 10:33:44 CDT 2016



On 06/21/2016 10:09 PM, Teck-Bin Arthur Lim wrote:
> Dear Rob,
>
> Thank you for your pointer.  It was down to simple machinefile syntax error,
> mpich is fine.  I got it working across multiple nodes.

that's right. the 'cpu=3' notation does not work.  Where did you get the 
idea to do that?  If it's in some documentation we have, we'd like to 
correct it.

For anyone coming here via google.. the correct way to tell mpich that 
you have, say, 6 cpus in a node is with a colon.  Here's an example:

$ cat machinefile
node1:6
node2:6
node3:1

https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager

==rob

>
> Arthur
>
> -----Original Message-----
> From: Teck-Bin Arthur Lim [mailto:limtba at ihpc.a-star.edu.sg]
> Sent: Wednesday, June 22, 2016 10:23 AM
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] Problem with Mpich3.0.4 build for WRF run across multiple nodes in a cluster.
>
> Dear Rob,
>
> Thanks, my machinefile and script are simple, as follows:
> *********************************************
> [limtba at fdns00_ws testenv-testlib]$ more mpi-script nohup time mpirun -machinefile machinefile -np 8 ./a.out >& log.aout &
>
> [limtba at fdns00_ws testenv-testlib]$ more machinefile fdns00_ib cpu=3 fdns01_ib cpu=2 fdns02_ib cpu=3
> *********************************************
> Is there some syntax error?  This is only a small test.
> Without machinefile,  mpich works fine on sinlge node.
>
> Arthur
>
> -----Original Message-----
> From: Rob Latham [mailto:robl at mcs.anl.gov]
> Sent: Tuesday, June 21, 2016 11:21 PM
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] Problem with Mpich3.0.4 build for WRF run across multiple nodes in a cluster.
>
>
>
> On 06/21/2016 05:58 AM, Teck-Bin Arthur Lim wrote:
>> Hi ,
>>
>> I met with a basic problem  trying to get old mpich(3.0.4) version to do
>> parallel run across different machines in a mini-cluster.   The error
>>    messages
>>
>> while invoking the mpiruns  are :
>>
>> *****************error messages********************
>>
>> [limtba at fdns00_ws testenv-testlib]$ ./mpi-script
>>
>> [limtba at fdns00_ws testenv-testlib]$ more log.aout
>>
>> [mpiexec at fdns00_ws] HYDU_process_mfile_token (./utils/args/args.c:299):
>> token cpu not supported at this time
>>
>> [mpiexec at fdns00_ws] HYDU_parse_hostfile (./utils/args/args.c:347):
>> unable to process token
>
> what does your machine file look like?  it sounds like you've got something in there that Hydra does not expect.
>
> ==rob
>>
>> [mpiexec at fdns00_ws] mfile_fn (./ui/mpich/utils.c:341): error parsing
>> hostfile
>>
>> [mpiexec at fdns00_ws] match_arg (./utils/args/args.c:153): match handler
>> returned error
>>
>> [mpiexec at fdns00_ws] HYDU_parse_array (./utils/args/args.c:175):
>> argument matching returned error
>>
>> [mpiexec at fdns00_ws] parse_args (./ui/mpich/utils.c:1609): error
>> parsing input array
>>
>> [mpiexec at fdns00_ws] HYD_uii_mpx_get_parameters
>> (./ui/mpich/utils.c:1660): unable to parse user arguments
>>
>> [mpiexec at fdns00_ws] main (./ui/mpich/mpiexec.c:153): error parsing
>> parameters
>>
>> Command exited with non-zero status 255
>>
>> *****************error messages********************
>>
>> This old version was downloaded from WRF site
>> (http://www2.mmm.ucar.edu/wrf/OnLineTutorial/compilation_tutorial.php#
>> STEP2)
>> , and was built
>>
>> with essentially, all the default configuration settings without any
>> options arguments given in the configure/make/make-install process, as :
>>
>> Ø./configure -prefix=$DIR/mpich
>>
>> Ømake
>>
>> Ømake install
>>
>> There are  no error messages during the built process, and the mpirun
>> works fine for parallel runs, using multiple processors, on a single
>> NODE only
>>
>> but met with the above error messages when attempting parallel run
>> across multiple machines.
>>
>> I need some advice as to how get this old mpich 3.0.4 working across
>> machines.   The OS for these machines are Centos5.5, with gcc4.1.2 and
>>
>> gcc4.4.7 installations.  As WRF needs gcc4.4 and higher version, I
>> have built the mpich3.04 using gcc4.4.7.
>>
>> Would appreciate any help and advices.
>>
>> Many Thanks.
>>
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list