[mpich-discuss] Worries with ROMIO on NFS since commit b4ab2f118d

Eric Chamberland Eric.Chamberland at giref.ulaval.ca
Mon Dec 4 19:18:59 CST 2017


Hi,

I have taken some time to have a relative "small" code to reproduce the 
problem.

The good thing is that now I can extract the exact call sequence of 
almost any MPI I/O call we make in our code to have a pure "mpi" calls 
in s simple C program, independent of our in-house code.

To reproduce the bug, just compile the attached file with any 
mpich/master since commit b4ab2f118d (nov 8) , and launch the resulting 
executable with 3 processes along with the second attachment 
(file_for_bug.data) saved in the pwd on an *NFS* path.

You should see comething like this:

ERROR Returned by MPI: 604040736
ERROR_string Returned by MPI: Other I/O error , error stack:
ADIOI_NFS_READSTRIDED(523): Other I/O error Bad file descriptor
ERROR Returned by MPI: 268496416
ERROR_string Returned by MPI: Other I/O error , error stack:
ADIOI_NFS_READSTRIDED(523): Other I/O error Operation now in progress
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2

If you launch it on a local drive, it works.

Can someone confirm it as reproduce the problem please?

Moreover, if I launch it with valgrind, even on a local disk, it 
complains like this, on process 0:

==99023== Warning: invalid file descriptor -1 in syscall read()
==99023==    at 0x53E2CB0: __read_nocancel (in /lib64/libc-2.19.so)
==99023==    by 0x5041606: file_to_info_all (system_hints.c:101)
==99023==    by 0x5041606: ADIOI_process_system_hints (system_hints.c:150)
==99023==    by 0x50311B8: ADIO_Open (ad_open.c:123)
==99023==    by 0x50161FD: PMPI_File_open (open.c:154)
==99023==    by 0x400F91: main (mpich_mpiio_nfs_bug_read.c:42)

Thanks,

Eric

On 21/11/17 02:49 PM, Eric Chamberland wrote:
> Hi M. Latham,
>
> I have more information now.
>
> When I try to run my example on NFS, I have the following error code:
>
> error #812707360
> Other I/O error , error stack:
> ADIOI_NFS_READSTRIDED(523): Other I/O error Success
>
> that is returned by MPI_File_read_all_begin
>
> When I try on a local disk, everything is fine.
>
> Here are all files about my actual build:
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_config.log 
>
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_c.txt 
>
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_m.txt 
>
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_mi.txt 
>
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_mpl_config.log 
>
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_pm_hydra_config.log 
>
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_mpiexec_info.txt 
>
>
> Hope this help to dig further into this issue.
>
> Thanks,
>
> Eric
>
>
> On 15/11/17 03:55 PM, Eric Chamberland wrote:
>> Hi,
>>
>> We are compiling with mpich/master each night since august 2016...
>>
>> since nov 8, the mpich/master branch is buggy with our nighlty build 
>> tests.
>>
>> Here is the nov 8 config.log:
>>
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.08.05h36m02s_config.log 
>>
>>
>> For nov 7 the configure log:
>>
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.07.05h36m01s_config.log 
>>
>>
>>
>> Since nov 8, on a specific ROMIO test, it is hanging indefinitely in 
>> optimized mode, and into DEBUG mode, I have a strange (yet to be 
>> debugged) assertion in our code.
>>
>> I reran the test manually, and when I wrote the results on a local 
>> disk, everything is fine.
>>
>> However, when I write over *NFS*, the test is faulty.
>>
>> I have not yet debugged enough through this, but, I suspect something 
>> related with one of:
>>
>> MPI_File_write_all_begin
>> MPI_File_write_all_end
>> MPI_File_read_all_begin
>> MPI_File_read_all_end
>> MPI_File_set_view
>> MPI_Type_free
>>
>> Am I alone to see these problems?
>>
>> Thanks,
>> Eric
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpich_mpiio_nfs_bug_read.c
Type: text/x-csrc
Size: 22211 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20171204/ee5da515/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file_for_bug.data
Type: application/octet-stream
Size: 106542 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20171204/ee5da515/attachment.obj>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list