[mpich-discuss] BUG with ROMIO on NFS since commit b4ab2f118d
Eric Chamberland
Eric.Chamberland at giref.ulaval.ca
Thu Dec 14 10:21:53 CST 2017
Hi,
Can I have an account to declare this bug into the bug tracker?
Thanks,
Eric
On 08/12/17 10:14 AM, Guo, Yanfei wrote:
> Hi Eric,
>
> Sorry about the delay. I am able to reproduce the problem with your example (even with the master branch of MPICH). Maybe Rob can take a look at the problem.
>
> Yanfei Guo
> Assistant Computer Scientist
> MCS Division, ANL
>
>
>
> On 12/7/17, 1:59 PM, "Eric Chamberland" <Eric.Chamberland at giref.ulaval.ca> wrote:
>
> Hi,
>
> I first posted on Nov 15 this bug on the list and I still have no reply.
>
> Is there something I should know or is there a better place to post an
> MPICH bug?
>
> Thanks,
>
> Eric
>
>
> On 04/12/17 08:18 PM, Eric Chamberland wrote:
>> Hi,
>>
>> I have taken some time to have a relative "small" code to reproduce the
>> problem.
>>
>> The good thing is that now I can extract the exact call sequence of
>> almost any MPI I/O call we make in our code to have a pure "mpi" calls
>> in s simple C program, independent of our in-house code.
>>
>> To reproduce the bug, just compile the attached file with any
>> mpich/master since commit b4ab2f118d (nov 8) , and launch the resulting
>> executable with 3 processes along with the second attachment
>> (file_for_bug.data) saved in the pwd on an *NFS* path.
>>
>> You should see comething like this:
>>
>> ERROR Returned by MPI: 604040736
>> ERROR_string Returned by MPI: Other I/O error , error stack:
>> ADIOI_NFS_READSTRIDED(523): Other I/O error Bad file descriptor
>> ERROR Returned by MPI: 268496416
>> ERROR_string Returned by MPI: Other I/O error , error stack:
>> ADIOI_NFS_READSTRIDED(523): Other I/O error Operation now in progress
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
>>
>> If you launch it on a local drive, it works.
>>
>> Can someone confirm it as reproduce the problem please?
>>
>> Moreover, if I launch it with valgrind, even on a local disk, it
>> complains like this, on process 0:
>>
>> ==99023== Warning: invalid file descriptor -1 in syscall read()
>> ==99023== at 0x53E2CB0: __read_nocancel (in /lib64/libc-2.19.so)
>> ==99023== by 0x5041606: file_to_info_all (system_hints.c:101)
>> ==99023== by 0x5041606: ADIOI_process_system_hints (system_hints.c:150)
>> ==99023== by 0x50311B8: ADIO_Open (ad_open.c:123)
>> ==99023== by 0x50161FD: PMPI_File_open (open.c:154)
>> ==99023== by 0x400F91: main (mpich_mpiio_nfs_bug_read.c:42)
>>
>> Thanks,
>>
>> Eric
>>
>> On 21/11/17 02:49 PM, Eric Chamberland wrote:
>>> Hi M. Latham,
>>>
>>> I have more information now.
>>>
>>> When I try to run my example on NFS, I have the following error code:
>>>
>>> error #812707360
>>> Other I/O error , error stack:
>>> ADIOI_NFS_READSTRIDED(523): Other I/O error Success
>>>
>>> that is returned by MPI_File_read_all_begin
>>>
>>> When I try on a local disk, everything is fine.
>>>
>>> Here are all files about my actual build:
>>>
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_config.log
>>>
>>>
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_c.txt
>>>
>>>
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_m.txt
>>>
>>>
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_mi.txt
>>>
>>>
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_mpl_config.log
>>>
>>>
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_pm_hydra_config.log
>>>
>>>
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_mpiexec_info.txt
>>>
>>>
>>> Hope this help to dig further into this issue.
>>>
>>> Thanks,
>>>
>>> Eric
>>>
>>>
>>> On 15/11/17 03:55 PM, Eric Chamberland wrote:
>>>> Hi,
>>>>
>>>> We are compiling with mpich/master each night since august 2016...
>>>>
>>>> since nov 8, the mpich/master branch is buggy with our nighlty build
>>>> tests.
>>>>
>>>> Here is the nov 8 config.log:
>>>>
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.08.05h36m02s_config.log
>>>>
>>>>
>>>> For nov 7 the configure log:
>>>>
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.07.05h36m01s_config.log
>>>>
>>>>
>>>>
>>>> Since nov 8, on a specific ROMIO test, it is hanging indefinitely in
>>>> optimized mode, and into DEBUG mode, I have a strange (yet to be
>>>> debugged) assertion in our code.
>>>>
>>>> I reran the test manually, and when I wrote the results on a local
>>>> disk, everything is fine.
>>>>
>>>> However, when I write over *NFS*, the test is faulty.
>>>>
>>>> I have not yet debugged enough through this, but, I suspect something
>>>> related with one of:
>>>>
>>>> MPI_File_write_all_begin
>>>> MPI_File_write_all_end
>>>> MPI_File_read_all_begin
>>>> MPI_File_read_all_end
>>>> MPI_File_set_view
>>>> MPI_Type_free
>>>>
>>>> Am I alone to see these problems?
>>>>
>>>> Thanks,
>>>> Eric
>>>>
>>>> _______________________________________________
>>>> discuss mailing list discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list