[mpich-discuss] Worries with ROMIO on NFS since commit b4ab2f118d
Eric Chamberland
Eric.Chamberland at giref.ulaval.ca
Mon Dec 4 19:18:59 CST 2017
Hi,
I have taken some time to have a relative "small" code to reproduce the
problem.
The good thing is that now I can extract the exact call sequence of
almost any MPI I/O call we make in our code to have a pure "mpi" calls
in s simple C program, independent of our in-house code.
To reproduce the bug, just compile the attached file with any
mpich/master since commit b4ab2f118d (nov 8) , and launch the resulting
executable with 3 processes along with the second attachment
(file_for_bug.data) saved in the pwd on an *NFS* path.
You should see comething like this:
ERROR Returned by MPI: 604040736
ERROR_string Returned by MPI: Other I/O error , error stack:
ADIOI_NFS_READSTRIDED(523): Other I/O error Bad file descriptor
ERROR Returned by MPI: 268496416
ERROR_string Returned by MPI: Other I/O error , error stack:
ADIOI_NFS_READSTRIDED(523): Other I/O error Operation now in progress
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
If you launch it on a local drive, it works.
Can someone confirm it as reproduce the problem please?
Moreover, if I launch it with valgrind, even on a local disk, it
complains like this, on process 0:
==99023== Warning: invalid file descriptor -1 in syscall read()
==99023== at 0x53E2CB0: __read_nocancel (in /lib64/libc-2.19.so)
==99023== by 0x5041606: file_to_info_all (system_hints.c:101)
==99023== by 0x5041606: ADIOI_process_system_hints (system_hints.c:150)
==99023== by 0x50311B8: ADIO_Open (ad_open.c:123)
==99023== by 0x50161FD: PMPI_File_open (open.c:154)
==99023== by 0x400F91: main (mpich_mpiio_nfs_bug_read.c:42)
Thanks,
Eric
On 21/11/17 02:49 PM, Eric Chamberland wrote:
> Hi M. Latham,
>
> I have more information now.
>
> When I try to run my example on NFS, I have the following error code:
>
> error #812707360
> Other I/O error , error stack:
> ADIOI_NFS_READSTRIDED(523): Other I/O error Success
>
> that is returned by MPI_File_read_all_begin
>
> When I try on a local disk, everything is fine.
>
> Here are all files about my actual build:
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_config.log
>
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_c.txt
>
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_m.txt
>
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_mi.txt
>
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_mpl_config.log
>
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_pm_hydra_config.log
>
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_mpiexec_info.txt
>
>
> Hope this help to dig further into this issue.
>
> Thanks,
>
> Eric
>
>
> On 15/11/17 03:55 PM, Eric Chamberland wrote:
>> Hi,
>>
>> We are compiling with mpich/master each night since august 2016...
>>
>> since nov 8, the mpich/master branch is buggy with our nighlty build
>> tests.
>>
>> Here is the nov 8 config.log:
>>
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.08.05h36m02s_config.log
>>
>>
>> For nov 7 the configure log:
>>
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.07.05h36m01s_config.log
>>
>>
>>
>> Since nov 8, on a specific ROMIO test, it is hanging indefinitely in
>> optimized mode, and into DEBUG mode, I have a strange (yet to be
>> debugged) assertion in our code.
>>
>> I reran the test manually, and when I wrote the results on a local
>> disk, everything is fine.
>>
>> However, when I write over *NFS*, the test is faulty.
>>
>> I have not yet debugged enough through this, but, I suspect something
>> related with one of:
>>
>> MPI_File_write_all_begin
>> MPI_File_write_all_end
>> MPI_File_read_all_begin
>> MPI_File_read_all_end
>> MPI_File_set_view
>> MPI_Type_free
>>
>> Am I alone to see these problems?
>>
>> Thanks,
>> Eric
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpich_mpiio_nfs_bug_read.c
Type: text/x-csrc
Size: 22211 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20171204/ee5da515/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: file_for_bug.data
Type: application/octet-stream
Size: 106542 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20171204/ee5da515/attachment.obj>
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list