[mpich-discuss] MPICH3 and Nagfor: Corrupts writing/IO?

Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC] matthew.thompson at nasa.gov
Thu Jan 12 10:54:50 CST 2017


All,

I've been having some "fun" recently trying to get an MPI stack built 
with nagfor 6.1. I've tried Open MPI and MVAPICH2 and failed to even 
build the MPI stack, SGI MPT doesn't like nagfor, and Intel MPI I'm 
guessing wouldn't either.

So, I figured I'd go for MPICH3. And, lo and behold, building with 
nagfor 6.1 and gcc 5.3 (for CC and CXX) with:

>  ./configure --prefix=$SWDEV/MPI/mpich/3.2/nagfor_6.1-gcc_5.3-nomismatchall \
      --disable-wrapper-rpath CC=gcc CXX=g++ FC=nagfor F77=nagfor \
      CFLAGS='-fpic -m64' CXXFLAGS='-fpic -m64' \
      FCFLAGS='-PIC -abi=64' FFLAGS='-PIC -abi=64 -mismatch' \
      --enable-fortran=all --enable-cxx

I got something to build. Huzzah!

I then tried the cpi test, it worked! It even detected I was on slurm 
according to mpirun -verbose.

I then tried a simple Fortran 90 Hello world program and...crash:

> (1211)(master) $ cat helloWorld.F90
> program hello_world
>
>    use mpi
>
>    implicit none
>
>    integer :: comm
>    integer :: myid, npes, ierror
>    integer :: name_length
>
>    character(len=MPI_MAX_PROCESSOR_NAME) :: processor_name
>
>    call mpi_init(ierror)
>
>    comm = MPI_COMM_WORLD
>
>    call MPI_Comm_Rank(comm,myid,ierror)
>    call MPI_Comm_Size(comm,npes,ierror)
>    call MPI_Get_Processor_Name(processor_name,name_length,ierror)
>
>    write (*,'(A,1X,I4,1X,A,1X,I4,1X,A,1X,A)') "Process", myid, "of", npes, "is on", trim(processor_name)
>
>    call MPI_Finalize(ierror)
>
> end program hello_world
> (1212)(master) $ mpifort -o helloWorld.exe helloWorld.F90
> NAG Fortran Compiler Release 6.1(Tozai) Build 6113
> [NAG Fortran Compiler normal termination]
> (1213)(master) $ mpirun -np 4 ./helloWorld.exe
> srun.slurm: cluster configuration lacks support for cpu binding
> Runtime Error: Buffer overflow on output
> Program terminated by I/O error on unit 6 (Output_Unit,Unformatted,Direct)
> Runtime Error: Buffer overflow on output
> Program terminated by I/O error on unit 6 (Output_Unit,Unformatted,Direct)
> Runtime Error: Buffer overflow on output
> Program terminated by I/O error on unit Runtime Error: Buffer overflow on output
> Program terminated by I/O error on unit 6 (Output_Unit,Unformatted,Direct)
> 6 (Output_Unit,Unformatted,Direct)
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 7642 RUNNING AT borgl189
> =   EXIT CODE: 134
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
> This typically refers to a problem with your application.
> Please see the FAQ page for debugging suggestions

Weird. So, I decided to try something different, this program:

> program main
>    implicit none
>    real :: a
>    a = 1240.0
>    write (*,*) "Hello world", a
> end program main

Looks boring and is standard-compliant and nagfor likes it:

> (1226) $ nagfor test.F90 && ./a.out
> NAG Fortran Compiler Release 6.1(Tozai) Build 6113
> [NAG Fortran Compiler normal termination]
>  Hello world   1.2400000E+03

Looks correct. Now let's try mpifort:

> (1232) $ mpifort test.F90 && ./a.out
> NAG Fortran Compiler Release 6.1(Tozai) Build 6113
> [NAG Fortran Compiler normal termination]
>  Hello world
> Segmentation fault (core dumped)

You can't really see it here, but that "Hello world" is surrounded by LF 
characters. Like a literal LineFeed...and then it core dumps.

Now let's try running with mpirun as well:

> (1233) $ mpifort test.F90 && mpirun -np 1 ./a.out
> NAG Fortran Compiler Release 6.1(Tozai) Build 6113
> [NAG Fortran Compiler normal termination]
> srun.slurm: cluster configuration lacks support for cpu binding
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 8520 RUNNING AT borgl189
> =   EXIT CODE: 139
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
> This typically refers to a problem with your application.
> Please see the FAQ page for debugging suggestions

All righty then.

Does anyone have advice for this? I'll fully accept I configured MPICH3 
wrong as it's the first time in a while I've built MPICH (think MPICH2). 
But, still, I don't have any exciting flags.

Matt

-- 
Matt Thompson, SSAI, Sr Scientific Programmer/Analyst
NASA GSFC,    Global Modeling and Assimilation Office
Code 610.1,  8800 Greenbelt Rd,  Greenbelt,  MD 20771
Phone: 301-614-6712                 Fax: 301-614-6246
http://science.gsfc.nasa.gov/sed/bio/matthew.thompson
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list