[mpich-discuss] very slow file writes independent of file size

Wei-keng Liao wkliao at eecs.northwestern.edu
Tue Mar 4 00:07:17 CST 2014


project directory is a GPFS. lfs command is for Lustre only.

You can run lfs command on the directory where you wrote your files under $SCRATCH.
Files created in a Lustre directory inherit the striping settings of that directory.
The default stripe_count on Edison appears to be one. So, if you have large amount
of data to write, then changing stripe_count to a larger value should give a much
higher performance. All you need to do is to first create a new directory and then
set the stripe_count to say 96 using lfs command and write your files there.
For example,
    mkdir $SCRATCH/new_directory
    lfs setstripe -s 1M -c 96 -o -1 $SCRATCH/new_directory

Using independent file points will most likely give you a much better performance,
as some implementations of shared file pointer MPI-IO functions use a temporary
file to store the pointer (so one process's change to the pointer can be immediately
visible to other processes). Because of this and other overheads, the use of
shared-pointer functions is generally discouraged.

However, I think the poor performance in your case is mainly because of file
striping issue.


Wei-keng

On Mar 3, 2014, at 11:36 PM, Geoffrey Irving wrote:

> Unfortunately, I have moved the files since I ran the job (from
> scratch to project).  Running lfs getstripe on files generated by a
> different run in the project directory produces
> 
> edison04:restart-5% lfs getstripe sparse-15.npy
> can't find fs root for
> '/global/project/projectdirs/pentago/restart-5/sparse-15.npy': -19
> 'sparse-15.npy' is not on a Lustre filesystem: No such device (19)
> cb_getstripe: 'sparse-15.npy' not on a Lustre fs?: Inappropriate ioctl
> for device (25)
> error: getstripe failed for sparse-15.npy.
> 
> I will certainly use MPI_File_write_at_all in future, though it's
> unintuitive if accidentally using a different but still collective
> routine results in >3 minutes hangs.
> 
> Thanks,
> Geoffrey
> 
> 
> 
> On Mon, Mar 3, 2014 at 10:43 AM, Wei-keng Liao
> <wkliao at eecs.northwestern.edu> wrote:
>> Hi, Geoffrey
>> 
>> Edison has a Lustre file system and I assume your program writes files
>> there. If this is the case, please check the file striping setting of your
>> output files, using command "lfs getstripe filename"
>> 
>> Usually, a high-performance I/O can be obtained if you have configured
>> the lustre striping setting to a higher stripe_count (96 or 144 max on Edison).
>> 
>> In addition, MPI_File_write_ordered is for shared file pointers.
>> It can have some impact to the performance. Using independent file pointers
>> often results in a better performance. You might want to consider to
>> change your program to use those functions (eg. MPI_File_write_all).
>> 
>> 
>> Wei-keng
>> 
>> On Mar 3, 2014, at 12:21 AM, Geoffrey Irving wrote:
>> 
>>> I'm doing postmortem on a 2048 node (16384 rank) job on Edison, and
>>> trying to understand why my I/O performance might have been slow.
>>> 
>>> Here's the data:
>>> 
>>> Measured I/O bandwidth:
>>> slice 35 write sparse bandwidth = 6082640 / (3.52519e+06 s / 16384) =
>>> 2.63287e-05 GB/s
>>> slice 34 write sparse bandwidth = 13824080 / (3.66608e+06 s / 16384) =
>>> 5.75379e-05 GB/s
>>> slice 33 write sparse bandwidth = 24754256 / (2.83647e+06 s / 16384) =
>>> 0.000133166 GB/s
>>> slice 32 write sparse bandwidth = 39370832 / (3.47016e+06 s / 16384) =
>>> 0.000173119 GB/s
>>> slice 31 write sparse bandwidth = 55812176 / (2.53623e+06 s / 16384) =
>>> 0.000335785 GB/s
>>> slice 30 write sparse bandwidth = 74741840 / (2.5714e+06 s / 16384) =
>>> 0.00044352 GB/s
>>> slice 29 write sparse bandwidth = 93560912 / (2.67336e+06 s / 16384) =
>>> 0.000534019 GB/s
>>> slice 28 write sparse bandwidth = 112803920 / (2.74639e+06 s / 16384)
>>> = 0.000626733 GB/s
>>> slice 27 write sparse bandwidth = 128194640 / (3.1603e+06 s / 16384) =
>>> 0.000618958 GB/s
>>> slice 26 write sparse bandwidth = 141281360 / (3.12754e+06 s / 16384)
>>> = 0.00068929 GB/s
>>> slice 25 write sparse bandwidth = 148193360 / (2.62376e+06 s / 16384)
>>> = 0.000861835 GB/s
>>> slice 24 write sparse bandwidth = 151861328 / (3.2145e+06 s / 16384) =
>>> 0.000720865 GB/s
>>> slice 23 write sparse bandwidth = 148193360 / (2.44736e+06 s / 16384)
>>> = 0.000923956 GB/s
>>> slice 22 write sparse bandwidth = 142055504 / (3.15962e+06 s / 16384)
>>> = 0.000686031 GB/s
>>> slice 21 write sparse bandwidth = 130388048 / (3.09774e+06 s / 16384)
>>> = 0.000642263 GB/s
>>> slice 20 write sparse bandwidth = 117964880 / (3.02676e+06 s / 16384)
>>> = 0.000594696 GB/s
>>> slice 19 write sparse bandwidth = 101560400 / (2.97198e+06 s / 16384)
>>> = 0.000521434 GB/s
>>> slice 18 write sparse bandwidth = 86372432 / (2.96247e+06 s / 16384) =
>>> 0.000444878 GB/s
>>> slice 18 write sections bandwidth = 1954957518434 / (1.83937e+07 s /
>>> 16384) = 1.62177 GB/s
>>> slice 17 write sparse bandwidth = 70170704 / (2.88973e+06 s / 16384) =
>>> 0.000370526 GB/s
>>> slice 17 write sections bandwidth = 1475380615039 / (1.36018e+07 s /
>>> 16384) = 1.65511 GB/s
>>> slice 17 read bandwidth (192 nodes) = 1475380615039 / 3383.36 s = 0.406122 GB/s
>>> per node: measured = 2.16598 MB/s, theoretical peak = 33.1341 MB/s
>>> 
>>> Focusing on the "sparse" lines, the main point is that the time seems
>>> to be roughly independent of file size (plot attached).  Each timing
>>> sample consists of (1) setup which I believe is negligible, (2)
>>> MPI_File_open, (3) MPI_File_write_ordered, (4) MPI_File_close.
>>> 
>>> What might have caused these file writes to take so long?
>>> 
>>> Geoffrey
>>> <2014-03-02-221950_1220x500.png>_______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss




More information about the discuss mailing list