[mpich-discuss] MPI_File_open fails on the cluster (2 node)

Wesley Bland wbland at mcs.anl.gov
Tue Aug 13 10:25:13 CDT 2013


Can you update to a recent version of MPICH? 1.4.1p is very old at this point. The most recent stable version is 3.0.4.

Wesley  


On Tuesday, August 13, 2013 at 9:43 AM, 정재용 wrote:

> I set up 2 node cluster and installed mpich by using intel compiler. OS is CentOS 6.4  
>  
> Each node is connected by ethernet. They shared directories where the compiler and mpich are installed, by using nfs.
>  
> SSH co nnection can be done without password.
>  
> By way of experiment, I compiled simple example of MPI_File_open with mpicc and executed by mpiexec on the nfs directory.
>  
> The code is  
>  
> #include "mpi.h"
> #include
>  
>  
> int main( int argc, char *argv[] )
> {
> MPI_Fint handleA, handleB;
> int rc;
> int errs = 0;
> int rank;
> MPI_File cFile;
>  
>  
> MPI_Init( &argc, &argv );
> MPI_Comm_rank( MPI_COMM_WORLD, &rank );
> rc = MPI_File_open( MPI_COMM_WORLD, "temp", MPI_MODE_RDWR | MPI_MODE_DELETE_ON_CLOSE | MPI_MODE_CREATE, MPI_INFO_NULL, &cFile );
> if (rc) {
> printf( "Unable to open file \"temp\"\n" );fflush(stdout);
> }
> else {
> MPI_File_close( &cFile );
> }
> MPI_Finalize();
> return 0;
> }
>  
>  
>  
>  
>  
>  
> This example just opens the file and closes it in parallel.
>  
> On the single machine, it worked well.
>  
> However, it didn't work when 2 machines' processor executed the code in parallel.
>  
> The error message:
>  
> Internal Error: invalid error code 209e0e (Ring ids do not match) in MPIR_Bcast_intra:1119
> Fatal error in PMPI_Bcast: Other MPI error, error stack:
> PMPI_Bcast(1478)......: MPI_Bcast(b uf=0x1deb980, count=1, MPI_CHAR, root=0, comm=0x84000004) failed
> MPIR_Bcast_impl(1321).:  
> MPIR_Bcast_intra(1119):  
>  
> When I added the prefix nfs: to the filename, as "nfs:temp", it worked well.  
>  
> However, I don't want to add that because I should execute very big code on the cluster.  
>  
> It is very hard to modify the code. Would you let me know what the problem is and how the problem is solved?
>  
> The nfs options are
> (client)
> mount -t nfs -o noac,nfsvers=3 "server_directory" "mount_directory"
> (server, /etc/exports)
> sync,rw,no_root_squash
>  
> I shared /opt and /home/username  
>  
> mpich configuration is
> MPICH2 Version:        1.4.1p1
> MPICH2 Release date:    Thu Sep  1 13:53:02 CDT 2011
> MPICH2 Device:        ch3:nemesis
> MPICH2 configure:     --prefix=/home/master/lib/mpich/mpich2-1.4.1p1 CC=icc F77=ifort CXX=icpc FC=ifort --enable-romio --with-file-system=nfs+ufs --with-pm=hydra
> MPICH2 CC:     icc    -O2
> MPICH2 CXX:     icpc   -O2
> MPICH2 F77:     ifort   -O2
> MPICH2 FC:     ifort   -O2
>  
> ------------------------------------------------------
> Jaeyong Jeong  
> Department of Mechanical Engineering
> Pohang University of Science and Technology
>   
>   
>  
>  
>  
>  
>  
>  
>  
> _______________________________________________
> discuss mailing list discuss at mpich.org (mailto:discuss at mpich.org)
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>  
>  


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130813/2f1ba840/attachment.html>


More information about the discuss mailing list