[mpich-discuss] very slow file writes independent of file size

Geoffrey Irving irving at naml.us
Tue Mar 4 12:05:15 CST 2014


Thanks.  I get the same non-Lustre error for scratch, but I can ask
NERSC about that to confirm.  I will be paranoid about stripe_count in
future. :)

Geoffrey

On Mon, Mar 3, 2014 at 10:07 PM, Wei-keng Liao
<wkliao at eecs.northwestern.edu> wrote:
>
> project directory is a GPFS. lfs command is for Lustre only.
>
> You can run lfs command on the directory where you wrote your files under $SCRATCH.
> Files created in a Lustre directory inherit the striping settings of that directory.
> The default stripe_count on Edison appears to be one. So, if you have large amount
> of data to write, then changing stripe_count to a larger value should give a much
> higher performance. All you need to do is to first create a new directory and then
> set the stripe_count to say 96 using lfs command and write your files there.
> For example,
>     mkdir $SCRATCH/new_directory
>     lfs setstripe -s 1M -c 96 -o -1 $SCRATCH/new_directory
>
> Using independent file points will most likely give you a much better performance,
> as some implementations of shared file pointer MPI-IO functions use a temporary
> file to store the pointer (so one process's change to the pointer can be immediately
> visible to other processes). Because of this and other overheads, the use of
> shared-pointer functions is generally discouraged.
>
> However, I think the poor performance in your case is mainly because of file
> striping issue.
>
>
> Wei-keng
>
> On Mar 3, 2014, at 11:36 PM, Geoffrey Irving wrote:
>
>> Unfortunately, I have moved the files since I ran the job (from
>> scratch to project).  Running lfs getstripe on files generated by a
>> different run in the project directory produces
>>
>> edison04:restart-5% lfs getstripe sparse-15.npy
>> can't find fs root for
>> '/global/project/projectdirs/pentago/restart-5/sparse-15.npy': -19
>> 'sparse-15.npy' is not on a Lustre filesystem: No such device (19)
>> cb_getstripe: 'sparse-15.npy' not on a Lustre fs?: Inappropriate ioctl
>> for device (25)
>> error: getstripe failed for sparse-15.npy.
>>
>> I will certainly use MPI_File_write_at_all in future, though it's
>> unintuitive if accidentally using a different but still collective
>> routine results in >3 minutes hangs.
>>
>> Thanks,
>> Geoffrey
>>
>>
>>
>> On Mon, Mar 3, 2014 at 10:43 AM, Wei-keng Liao
>> <wkliao at eecs.northwestern.edu> wrote:
>>> Hi, Geoffrey
>>>
>>> Edison has a Lustre file system and I assume your program writes files
>>> there. If this is the case, please check the file striping setting of your
>>> output files, using command "lfs getstripe filename"
>>>
>>> Usually, a high-performance I/O can be obtained if you have configured
>>> the lustre striping setting to a higher stripe_count (96 or 144 max on Edison).
>>>
>>> In addition, MPI_File_write_ordered is for shared file pointers.
>>> It can have some impact to the performance. Using independent file pointers
>>> often results in a better performance. You might want to consider to
>>> change your program to use those functions (eg. MPI_File_write_all).
>>>
>>>
>>> Wei-keng
>>>
>>> On Mar 3, 2014, at 12:21 AM, Geoffrey Irving wrote:
>>>
>>>> I'm doing postmortem on a 2048 node (16384 rank) job on Edison, and
>>>> trying to understand why my I/O performance might have been slow.
>>>>
>>>> Here's the data:
>>>>
>>>> Measured I/O bandwidth:
>>>> slice 35 write sparse bandwidth = 6082640 / (3.52519e+06 s / 16384) =
>>>> 2.63287e-05 GB/s
>>>> slice 34 write sparse bandwidth = 13824080 / (3.66608e+06 s / 16384) =
>>>> 5.75379e-05 GB/s
>>>> slice 33 write sparse bandwidth = 24754256 / (2.83647e+06 s / 16384) =
>>>> 0.000133166 GB/s
>>>> slice 32 write sparse bandwidth = 39370832 / (3.47016e+06 s / 16384) =
>>>> 0.000173119 GB/s
>>>> slice 31 write sparse bandwidth = 55812176 / (2.53623e+06 s / 16384) =
>>>> 0.000335785 GB/s
>>>> slice 30 write sparse bandwidth = 74741840 / (2.5714e+06 s / 16384) =
>>>> 0.00044352 GB/s
>>>> slice 29 write sparse bandwidth = 93560912 / (2.67336e+06 s / 16384) =
>>>> 0.000534019 GB/s
>>>> slice 28 write sparse bandwidth = 112803920 / (2.74639e+06 s / 16384)
>>>> = 0.000626733 GB/s
>>>> slice 27 write sparse bandwidth = 128194640 / (3.1603e+06 s / 16384) =
>>>> 0.000618958 GB/s
>>>> slice 26 write sparse bandwidth = 141281360 / (3.12754e+06 s / 16384)
>>>> = 0.00068929 GB/s
>>>> slice 25 write sparse bandwidth = 148193360 / (2.62376e+06 s / 16384)
>>>> = 0.000861835 GB/s
>>>> slice 24 write sparse bandwidth = 151861328 / (3.2145e+06 s / 16384) =
>>>> 0.000720865 GB/s
>>>> slice 23 write sparse bandwidth = 148193360 / (2.44736e+06 s / 16384)
>>>> = 0.000923956 GB/s
>>>> slice 22 write sparse bandwidth = 142055504 / (3.15962e+06 s / 16384)
>>>> = 0.000686031 GB/s
>>>> slice 21 write sparse bandwidth = 130388048 / (3.09774e+06 s / 16384)
>>>> = 0.000642263 GB/s
>>>> slice 20 write sparse bandwidth = 117964880 / (3.02676e+06 s / 16384)
>>>> = 0.000594696 GB/s
>>>> slice 19 write sparse bandwidth = 101560400 / (2.97198e+06 s / 16384)
>>>> = 0.000521434 GB/s
>>>> slice 18 write sparse bandwidth = 86372432 / (2.96247e+06 s / 16384) =
>>>> 0.000444878 GB/s
>>>> slice 18 write sections bandwidth = 1954957518434 / (1.83937e+07 s /
>>>> 16384) = 1.62177 GB/s
>>>> slice 17 write sparse bandwidth = 70170704 / (2.88973e+06 s / 16384) =
>>>> 0.000370526 GB/s
>>>> slice 17 write sections bandwidth = 1475380615039 / (1.36018e+07 s /
>>>> 16384) = 1.65511 GB/s
>>>> slice 17 read bandwidth (192 nodes) = 1475380615039 / 3383.36 s = 0.406122 GB/s
>>>> per node: measured = 2.16598 MB/s, theoretical peak = 33.1341 MB/s
>>>>
>>>> Focusing on the "sparse" lines, the main point is that the time seems
>>>> to be roughly independent of file size (plot attached).  Each timing
>>>> sample consists of (1) setup which I believe is negligible, (2)
>>>> MPI_File_open, (3) MPI_File_write_ordered, (4) MPI_File_close.
>>>>
>>>> What might have caused these file writes to take so long?
>>>>
>>>> Geoffrey
>>>> <2014-03-02-221950_1220x500.png>_______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



More information about the discuss mailing list