[mpich-discuss] BUG with ROMIO on NFS since commit b4ab2f118d

Guo, Yanfei yguo at anl.gov
Fri Dec 8 09:14:14 CST 2017


Hi Eric,

Sorry about the delay. I am able to reproduce the problem with your example (even with the master branch of MPICH). Maybe Rob can take a look at the problem.

Yanfei Guo
Assistant Computer Scientist
MCS Division, ANL

 

On 12/7/17, 1:59 PM, "Eric Chamberland" <Eric.Chamberland at giref.ulaval.ca> wrote:

Hi,

I first posted on Nov 15 this bug on the list and I still have no reply.

Is there something I should know or is there a better place to post an 
MPICH bug?

Thanks,

Eric


On 04/12/17 08:18 PM, Eric Chamberland wrote:
> Hi,
> 
> I have taken some time to have a relative "small" code to reproduce the 
> problem.
> 
> The good thing is that now I can extract the exact call sequence of 
> almost any MPI I/O call we make in our code to have a pure "mpi" calls 
> in s simple C program, independent of our in-house code.
> 
> To reproduce the bug, just compile the attached file with any 
> mpich/master since commit b4ab2f118d (nov 8) , and launch the resulting 
> executable with 3 processes along with the second attachment 
> (file_for_bug.data) saved in the pwd on an *NFS* path.
> 
> You should see comething like this:
> 
> ERROR Returned by MPI: 604040736
> ERROR_string Returned by MPI: Other I/O error , error stack:
> ADIOI_NFS_READSTRIDED(523): Other I/O error Bad file descriptor
> ERROR Returned by MPI: 268496416
> ERROR_string Returned by MPI: Other I/O error , error stack:
> ADIOI_NFS_READSTRIDED(523): Other I/O error Operation now in progress
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
> 
> If you launch it on a local drive, it works.
> 
> Can someone confirm it as reproduce the problem please?
> 
> Moreover, if I launch it with valgrind, even on a local disk, it 
> complains like this, on process 0:
> 
> ==99023== Warning: invalid file descriptor -1 in syscall read()
> ==99023==    at 0x53E2CB0: __read_nocancel (in /lib64/libc-2.19.so)
> ==99023==    by 0x5041606: file_to_info_all (system_hints.c:101)
> ==99023==    by 0x5041606: ADIOI_process_system_hints (system_hints.c:150)
> ==99023==    by 0x50311B8: ADIO_Open (ad_open.c:123)
> ==99023==    by 0x50161FD: PMPI_File_open (open.c:154)
> ==99023==    by 0x400F91: main (mpich_mpiio_nfs_bug_read.c:42)
> 
> Thanks,
> 
> Eric
> 
> On 21/11/17 02:49 PM, Eric Chamberland wrote:
>> Hi M. Latham,
>>
>> I have more information now.
>>
>> When I try to run my example on NFS, I have the following error code:
>>
>> error #812707360
>> Other I/O error , error stack:
>> ADIOI_NFS_READSTRIDED(523): Other I/O error Success
>>
>> that is returned by MPI_File_read_all_begin
>>
>> When I try on a local disk, everything is fine.
>>
>> Here are all files about my actual build:
>>
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_config.log 
>>
>>
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_c.txt 
>>
>>
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_m.txt 
>>
>>
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_mi.txt 
>>
>>
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_mpl_config.log 
>>
>>
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_pm_hydra_config.log 
>>
>>
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.21.05h40m02s_mpiexec_info.txt 
>>
>>
>> Hope this help to dig further into this issue.
>>
>> Thanks,
>>
>> Eric
>>
>>
>> On 15/11/17 03:55 PM, Eric Chamberland wrote:
>>> Hi,
>>>
>>> We are compiling with mpich/master each night since august 2016...
>>>
>>> since nov 8, the mpich/master branch is buggy with our nighlty build 
>>> tests.
>>>
>>> Here is the nov 8 config.log:
>>>
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.08.05h36m02s_config.log 
>>>
>>>
>>> For nov 7 the configure log:
>>>
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2017.11.07.05h36m01s_config.log 
>>>
>>>
>>>
>>> Since nov 8, on a specific ROMIO test, it is hanging indefinitely in 
>>> optimized mode, and into DEBUG mode, I have a strange (yet to be 
>>> debugged) assertion in our code.
>>>
>>> I reran the test manually, and when I wrote the results on a local 
>>> disk, everything is fine.
>>>
>>> However, when I write over *NFS*, the test is faulty.
>>>
>>> I have not yet debugged enough through this, but, I suspect something 
>>> related with one of:
>>>
>>> MPI_File_write_all_begin
>>> MPI_File_write_all_end
>>> MPI_File_read_all_begin
>>> MPI_File_read_all_end
>>> MPI_File_set_view
>>> MPI_Type_free
>>>
>>> Am I alone to see these problems?
>>>
>>> Thanks,
>>> Eric
>>>
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list