[mpich-discuss] BUG with ROMIO on NFS since commit b4ab2f118d

Jeff Hammond jeff.science at gmail.com
Thu Dec 14 11:57:57 CST 2017


Anyone with a GitHub account should be able to create an issue here:
https://github.com/pmodels/mpich/issues/.

Best,

Jeff

On Thu, Dec 14, 2017 at 8:21 AM, Eric Chamberland <
Eric.Chamberland at giref.ulaval.ca> wrote:

> Hi,
>
> Can I have an account to declare this bug into the bug tracker?
>
> Thanks,
>
> Eric
>
>
> On 08/12/17 10:14 AM, Guo, Yanfei wrote:
>
>> Hi Eric,
>>
>> Sorry about the delay. I am able to reproduce the problem with your
>> example (even with the master branch of MPICH). Maybe Rob can take a look
>> at the problem.
>>
>> Yanfei Guo
>> Assistant Computer Scientist
>> MCS Division, ANL
>>
>>
>> On 12/7/17, 1:59 PM, "Eric Chamberland" <Eric.Chamberland at giref.ulaval.ca>
>> wrote:
>>
>> Hi,
>>
>> I first posted on Nov 15 this bug on the list and I still have no reply.
>>
>> Is there something I should know or is there a better place to post an
>> MPICH bug?
>>
>> Thanks,
>>
>> Eric
>>
>>
>> On 04/12/17 08:18 PM, Eric Chamberland wrote:
>>
>>> Hi,
>>>
>>> I have taken some time to have a relative "small" code to reproduce the
>>> problem.
>>>
>>> The good thing is that now I can extract the exact call sequence of
>>> almost any MPI I/O call we make in our code to have a pure "mpi" calls
>>> in s simple C program, independent of our in-house code.
>>>
>>> To reproduce the bug, just compile the attached file with any
>>> mpich/master since commit b4ab2f118d (nov 8) , and launch the resulting
>>> executable with 3 processes along with the second attachment
>>> (file_for_bug.data) saved in the pwd on an *NFS* path.
>>>
>>> You should see comething like this:
>>>
>>> ERROR Returned by MPI: 604040736
>>> ERROR_string Returned by MPI: Other I/O error , error stack:
>>> ADIOI_NFS_READSTRIDED(523): Other I/O error Bad file descriptor
>>> ERROR Returned by MPI: 268496416
>>> ERROR_string Returned by MPI: Other I/O error , error stack:
>>> ADIOI_NFS_READSTRIDED(523): Other I/O error Operation now in progress
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
>>>
>>> If you launch it on a local drive, it works.
>>>
>>> Can someone confirm it as reproduce the problem please?
>>>
>>> Moreover, if I launch it with valgrind, even on a local disk, it
>>> complains like this, on process 0:
>>>
>>> ==99023== Warning: invalid file descriptor -1 in syscall read()
>>> ==99023==    at 0x53E2CB0: __read_nocancel (in /lib64/libc-2.19.so)
>>> ==99023==    by 0x5041606: file_to_info_all (system_hints.c:101)
>>> ==99023==    by 0x5041606: ADIOI_process_system_hints
>>> (system_hints.c:150)
>>> ==99023==    by 0x50311B8: ADIO_Open (ad_open.c:123)
>>> ==99023==    by 0x50161FD: PMPI_File_open (open.c:154)
>>> ==99023==    by 0x400F91: main (mpich_mpiio_nfs_bug_read.c:42)
>>>
>>> Thanks,
>>>
>>> Eric
>>>
>>> On 21/11/17 02:49 PM, Eric Chamberland wrote:
>>>
>>>> Hi M. Latham,
>>>>
>>>> I have more information now.
>>>>
>>>> When I try to run my example on NFS, I have the following error code:
>>>>
>>>> error #812707360
>>>> Other I/O error , error stack:
>>>> ADIOI_NFS_READSTRIDED(523): Other I/O error Success
>>>>
>>>> that is returned by MPI_File_read_all_begin
>>>>
>>>> When I try on a local disk, everything is fine.
>>>>
>>>> Here are all files about my actual build:
>>>>
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.
>>>> 21.05h40m02s_config.log
>>>>
>>>>
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.
>>>> 21.05h40m02s_c.txt
>>>>
>>>>
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.
>>>> 21.05h40m02s_m.txt
>>>>
>>>>
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.
>>>> 21.05h40m02s_mi.txt
>>>>
>>>>
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.
>>>> 21.05h40m02s_mpl_config.log
>>>>
>>>>
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.
>>>> 21.05h40m02s_pm_hydra_config.log
>>>>
>>>>
>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.
>>>> 21.05h40m02s_mpiexec_info.txt
>>>>
>>>>
>>>> Hope this help to dig further into this issue.
>>>>
>>>> Thanks,
>>>>
>>>> Eric
>>>>
>>>>
>>>> On 15/11/17 03:55 PM, Eric Chamberland wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We are compiling with mpich/master each night since august 2016...
>>>>>
>>>>> since nov 8, the mpich/master branch is buggy with our nighlty build
>>>>> tests.
>>>>>
>>>>> Here is the nov 8 config.log:
>>>>>
>>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.
>>>>> 08.05h36m02s_config.log
>>>>>
>>>>>
>>>>> For nov 7 the configure log:
>>>>>
>>>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.
>>>>> 07.05h36m01s_config.log
>>>>>
>>>>>
>>>>>
>>>>> Since nov 8, on a specific ROMIO test, it is hanging indefinitely in
>>>>> optimized mode, and into DEBUG mode, I have a strange (yet to be
>>>>> debugged) assertion in our code.
>>>>>
>>>>> I reran the test manually, and when I wrote the results on a local
>>>>> disk, everything is fine.
>>>>>
>>>>> However, when I write over *NFS*, the test is faulty.
>>>>>
>>>>> I have not yet debugged enough through this, but, I suspect something
>>>>> related with one of:
>>>>>
>>>>> MPI_File_write_all_begin
>>>>> MPI_File_write_all_end
>>>>> MPI_File_read_all_begin
>>>>> MPI_File_read_all_end
>>>>> MPI_File_set_view
>>>>> MPI_Type_free
>>>>>
>>>>> Am I alone to see these problems?
>>>>>
>>>>> Thanks,
>>>>> Eric
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>



-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20171214/67987b62/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list