[mpich-commits] [mpich] MPICH primary repository branch, master, updated. v3.2a2-84-gef1cf14

Service Account noreply at mpich.org
Fri Dec 19 16:15:56 CST 2014


This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "MPICH primary repository".

The branch, master has been updated
       via  ef1cf141c1bd4f498f8a5fc6498ce021d7b030ab (commit)
      from  580d9ce8907143a68768079eae1af334fccdabdb (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
http://git.mpich.org/mpich.git/commitdiff/ef1cf141c1bd4f498f8a5fc6498ce021d7b030ab

commit ef1cf141c1bd4f498f8a5fc6498ce021d7b030ab
Author: Paul Coffman <pkcoff at us.ibm.com>
Date:   Fri Dec 12 14:51:18 2014 -0600

    barrier in close whenever shared files supported
    
    Currently in the MPI_File_close there is a barrier in place whenever the
    ADIO_SHARED_FP feature is enabled AND the ADIO_UNLINK_AFTER_CLOSE
    feature is disabled right before the code to close the shared file
    pointer and potentially unlink the shared file itself.  PE testing on
    GPFS revealed a situation using the non-collective
    MPI_File_read_shared/MPI_File_write_shared
    where based on this implementation all tasks needed to wait for all
    other tasks to complete processing before unlinking the shared file
    pointer or the open of the shared file pointer could fail.  This
    situation is illustrated as follows with the simplest example of 2 tasks
    that do this:
    MPI_File_Open
    MPI_File_set_view
    MPI_File_Read_shared
    MPI_File_close
    
    So both tasks call MPI_File_Read_shared at the same time which first
    does the ADIO_Get_shared_fp which does the file open with create mode on
    the shared file pointer.   Only 1 task can actually create the file, so
    there is a race to see who can get it done first.  If task 0 gets it
    created then he is the winner and goes on to use it, read the file and
    then MPI_File_close which then unlinks the shared file pointer first and
    then closes the output file.  Meanwhile, task 1 lost the race to create
    the file and is in error, the error handling in gpfs goes into effect
    and task 1 now just tries to open the file that task 0 created.  The
    problem is this error handling took longer that task 0 took to read and
    close the output file, so at the time when task 0 does the close he is
    the only process with a link since task 1 is still in the create file
    error handlilng code so therefore gpfs goes ahead and deletes the shared
    file pointer.  Then when the error handling code for task 1 does
    complete and he tries to do the open, the file is no longer there, so
    the open fails as does the subsequent read of the shared file pointer.
    Currently GPFS has the ADIO_UNLINK_AFTER_CLOSE  feature enabled, so the
    fix for this is to remove the additional condition of
    ADIO_UNLINK_AFTER_CLOSE  being disabled for the barrier in the close to
    be done.  Presumably this could be an issue for any parallel file system
    so this change is being done in the common code.
    
    See ticket #2214
    
    Signed-off-by: Paul Coffman <pkcoff at us.ibm.com>
    Signed-off-by: Rob Latham <robl at mcs.anl.gov>

diff --git a/src/mpi/romio/mpi-io/close.c b/src/mpi/romio/mpi-io/close.c
index cb1df99..520f206 100644
--- a/src/mpi/romio/mpi-io/close.c
+++ b/src/mpi/romio/mpi-io/close.c
@@ -58,9 +58,9 @@ int MPI_File_close(MPI_File *fh)
 	/* POSIX semantics say a deleted file remains available until all
 	 * processes close the file.  But since when was NFS posix-compliant?
 	 */
-	if (!ADIO_Feature(adio_fh, ADIO_UNLINK_AFTER_CLOSE)) {
-		MPI_Barrier((adio_fh)->comm);
-	}
+	/* this used to be gated by the lack of the UNLINK_AFTER_CLOSE feature,
+	 * but a race condition in GPFS necessated this.  See ticket #2214 */
+	MPI_Barrier((adio_fh)->comm);
 	if ((adio_fh)->shared_fp_fd != ADIO_FILE_NULL) {
 	    MPI_File *fh_shared = &(adio_fh->shared_fp_fd);
 	    ADIO_Close((adio_fh)->shared_fp_fd, &error_code);

-----------------------------------------------------------------------

Summary of changes:
 src/mpi/romio/mpi-io/close.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)


hooks/post-receive
-- 
MPICH primary repository


More information about the commits mailing list