[mpich-discuss] Problems with running make

Gustavo Correa gus at ldeo.columbia.edu
Sat Mar 1 13:20:30 CST 2014


Hi Ron

Please, see my comments inline.

On Mar 1, 2014, at 6:29 AM, Reuti wrote:

> Hi,
> 
> Am 01.03.2014 um 12:18 schrieb Ron Palmer:
> 
>> Gus and others,
>> many thanks for your comments and suggestions. I am unsure whether I need fortran included as I am not in control of the software to be run on this linux cluster. I have asked but not yet got a reply...
>> 
>> However, I did a yum search gfortran and installed the gcc-gfortran.x86_64 (I hope that was the one you referred to). I followed up with the ./configure ... as per your suggestion. That did not work. I then tried with only ./configure --prefix=/my/bin/folder. That worked so I also did make and make install.
>> 

Maybe there was a misspell in the configure command line with 
I am not 100% sure about the required gfortran yum RPMs.
It may also need the corresponding "devel" package, if you want to check,
but I am not really sure.   Maybe you already installed everything you need for 
a functional gfortran.  

As far as I can tell from your c2.txt file, configure seems to have recognized gfortran.
I can't tell for sure if it actually built the MPICH F77 and F90, although chances are that it did.
However, you may check if you have mpif77 and mpif90 in the MPICH bin directory,
which would be a good sign.

I would not leave the F77 and F90 interfaces out of the game, 
because sooner or later somebody will come with a Fortran MPI program 
to run on your machine.

>> I have a machinefile with three linux computers, all sharing the same mpi binaries. Running hostname on all 20 cores works, so I guess I should conclude that mpi works, right?
> 
> Are they using the same binaries /my/bin/folder for a non-interactive startup too, i.e.:
> 
> $ ssh sargeant which hydra_pmi_proxy
> 
> returns the same as on the master (maybe ~/.bashrc needs to set the $PATH)?
> 
> -- Reuti
> 

Make sure you check Reuti's suggestions.

Did you install MPICH in an NFS shared (across all machines) directory,
or did you install on each machine in a local directory?

As Reuti pointed out, the environment setup is important.
If MPICH is not installed in a directory that Linux searches by default, such as /usr or /usr/local,
you need to add the MPICH bin directory  to the PATH, and probably the lib directory to
LD_LIBRARY_PATH (if built with shared libraries).
The simplest way is in the .bashrc/.tcshrc file in the user home directory (which I presume
is shared via NFS across the computers, otherwise you need to edit these files on each
computer)
Say:

- bash:

export PATH=${PATH}:/my/mpich/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/my/mpich/lib

- tcsh:

setenv PATH ${PATH}:/my/mpich/bin
setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/my/mpich/lib

Also, you can go a long way troubleshooting these problems if you read the MPICH
documentation (Installer's Guide, User's Guide, README file)

http://www.mpich.org/documentation/guides/

If you really could run mpiexec hostname across all computers, I would guess you
don't have any security or network related issues (i.e. passwordless ssh works
across the nodes, there is no firewall blocking, the nodes names are properly
resolved most likely via /etc/hosts, and so on).

To make sure MPICH works with a simple MPI program, 
compile the cpi.c program in the examples directory (comes with the source code tarball),
and run it across all nodes.

Something like this:

mpicc -o cpi cpi.c 
mpiexec -machinefile all_machines -np 24 ./cpi 

If it runs, then MPICH is alright, and any issues you are having with the other program,
are not an MPICH issue, but a bug or misconfiguration in the other program.


>> I then tried the software I am supposed to run on the cluster. It works on all nodes on the master computer, but not if other computers are involved. I have captured the output from such failed attempt to run that software on all available nodes on all available computers in the cluster, see below. I have attached c2.txt and m2.txt (and the output of the failed run) in the attached gzipped file (the failed run in run2.txt and in the image below may not be from the same failed run...).
>> 

>> So, is it an issue for the software developers or is it an issue with mpich?
>> 

The  screenshot you sent MPI ranks 9, 11,12, 14 don't printout their ranks.
This may or may not indicate a programming error, however combined with the fact that
the program fails in MPI_Barrier, which is a collective call, suggests it may be a bug in the program.
One possibility is that the MPI program is wrong, and somehow ranks 9,11,12,14 
did not call MPI_Barrier (collective calls, as the name says, are collective,
must be called by all ranks in the "communicator").

Another line of search would be that in both the screenshot and the run2.txt file 
complain of "Communication error with rank 8" in the MPI_Barrier.
In each case rank 8 is in a different computer (gainsborough and constable).
It may be that rank 8 never called MPI_Barrier somehow.
For instance, MPI collective calls inside "if" conditionals can cause traps like this,
as the "if" condition may turn out to be true on some ranks and false on others.

Judging from the small amount of output, the program seems to have failed 
very early, which may help search for a possible bug (early in the code).

However, these are just wild guesses.

Is this an in-house program, public domain, or commercial software?
Anyway, this may be an issue to bring to that code developers.

I hope this helps,
Gus Correa

> 
>> I would appreciate if you could let me know if mpich is properly compiled and installed and whether I should follow up with the application developers instead. I now that they are successfully running this code on V1.4.1.
>> 
>> Thanks for your prompt support.
>> 
>> Cheers,
>> Ron
>> 
>> 
>> 
>> <ghdahehb.png>
>> 
>> 
>> 
>> On 1/03/2014 08:19, Gus Correa wrote:
>>> Hi Ron 
>>> 
>>> Your config.log shows that configure is picking up Gnu 
>>> /usr/bin/f77 as the Fortran compiler. 
>>> See an excerpt of config.log below. 
>>> 
>>> The Gnu f77/g77, which cannot compile Fortran-90 and later code. 
>>> 
>>> You may want to install gfortran. 
>>> Configure didn't find gfortran. 
>>> However, you can install it easily from RPM through yum. 
>>> 
>>> Once you install gfortran, to avoid picking up wrong compilers, 
>>> you can also point configure to the compilers. 
>>> Assuming you want to use all Gnu compilers, this would do it: 
>>> 
>>> ./configure CC=gcc, CXX=g++, F77=gfortran, FC=gfortran, --prefix=/wherever/you/want ... 
>>> 
>>> Another possibility is follow Reuti's suggestion of 
>>> not compiling the MPICH Fortran-90 interface. 
>>> 
>>> I hope this helps, 
>>> Gus Correa 
>>> 
>>> ******* from your config.log ********* 
>>> ... 
>>> configure:17977: result: no 
>>> configure:17947: checking for gfortran 
>>> configure:17977: result: no 
>>> configure:17947: checking for f77 
>>> configure:17963: found /usr/bin/f77 
>>> configure:17974: result: f77 
>>> configure:18000: checking for Fortran 77 compiler version 
>>> configure:18009: f77 --version >&5 
>>> GNU Fortran (GCC) 3.4.6 20060404 (Red Hat 3.4.6-19.el6) 
>>> ... 
>>> ***************************************** 
>>> 
>>> On 02/28/2014 05:02 AM, Reuti wrote: 
>>>> Hi, 
>>>> 
>>>> Am 28.02.2014 um 09:22 schrieb Ron Palmer: 
>>>> 
>>>>> I am struggling with running make on MPICH, V3.1. I believe that ./configure went well but I do not understand what caused the errors in make. I thought it may be associated with the fortran compile but I have F77 installed (though somewhat confused whether fortran 90 is also installed).  The README file suggested that I send an email to this list with nominated four text files attached. 
>>>> 
>>>> You can try to skip f90: 
>>>> 
>>>> $ ./configure --disable-fc ... 
>>>> 
>>>> -- Reuti 
>>>> 
>>>> 
>>>>> The end of the m.txt file is included here for easy access, full details are attached... 
>>>>> 
>>>>>  CC       src/mpi/topo/nhb_alltoallw.lo 
>>>>>  CC       src/binding/f90/create_f90_int.lo 
>>>>>  CC       src/binding/f90/create_f90_real.lo 
>>>>> src/binding/f90/create_f90_real.c: In function 'PMPI_Type_create_f90_real': 
>>>>> src/binding/f90/create_f90_real.c:73: error: expected expression before ',' token 
>>>>> src/binding/f90/create_f90_real.c:74: error: expected expression before ',' token 
>>>>> make[2]: *** [src/binding/f90/create_f90_real.lo] Error 1 
>>>>> make[2]: Leaving directory `/inv/mpich-3.1' 
>>>>> make[1]: *** [all-recursive] Error 1 
>>>>> make[1]: Leaving directory `/inv/mpich-3.1' 
>>>>> make: *** [all] Error 2 
>>>>> 
>>>>> I look forward to any suggestions any of you may have. 
>>>>> Regards, Ron 
>>>>> 
>>>>> 
>>>>> <mpi_logs.tar.gz>_______________________________________________ 
>>>>> discuss mailing list     discuss at mpich.org 
>>>>> To manage subscription options or unsubscribe: 
>>>>> https://lists.mpich.org/mailman/listinfo/discuss 
>>>> 
>>>> _______________________________________________ 
>>>> discuss mailing list     discuss at mpich.org 
>>>> To manage subscription options or unsubscribe: 
>>>> https://lists.mpich.org/mailman/listinfo/discuss 
>>> 
>>> _______________________________________________ 
>>> discuss mailing list     discuss at mpich.org 
>>> To manage subscription options or unsubscribe: 
>>> https://lists.mpich.org/mailman/listinfo/discuss 
>> 
>> 
>> 
>> <run2.tar.gz>_______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss




More information about the discuss mailing list