[mpich-discuss] very slow file writes independent of file size

Geoffrey Irving irving at naml.us
Mon Mar 3 23:36:54 CST 2014


Unfortunately, I have moved the files since I ran the job (from
scratch to project).  Running lfs getstripe on files generated by a
different run in the project directory produces

edison04:restart-5% lfs getstripe sparse-15.npy
can't find fs root for
'/global/project/projectdirs/pentago/restart-5/sparse-15.npy': -19
'sparse-15.npy' is not on a Lustre filesystem: No such device (19)
cb_getstripe: 'sparse-15.npy' not on a Lustre fs?: Inappropriate ioctl
for device (25)
error: getstripe failed for sparse-15.npy.

I will certainly use MPI_File_write_at_all in future, though it's
unintuitive if accidentally using a different but still collective
routine results in >3 minutes hangs.

Thanks,
Geoffrey



On Mon, Mar 3, 2014 at 10:43 AM, Wei-keng Liao
<wkliao at eecs.northwestern.edu> wrote:
> Hi, Geoffrey
>
> Edison has a Lustre file system and I assume your program writes files
> there. If this is the case, please check the file striping setting of your
> output files, using command "lfs getstripe filename"
>
> Usually, a high-performance I/O can be obtained if you have configured
> the lustre striping setting to a higher stripe_count (96 or 144 max on Edison).
>
> In addition, MPI_File_write_ordered is for shared file pointers.
> It can have some impact to the performance. Using independent file pointers
> often results in a better performance. You might want to consider to
> change your program to use those functions (eg. MPI_File_write_all).
>
>
> Wei-keng
>
> On Mar 3, 2014, at 12:21 AM, Geoffrey Irving wrote:
>
>> I'm doing postmortem on a 2048 node (16384 rank) job on Edison, and
>> trying to understand why my I/O performance might have been slow.
>>
>> Here's the data:
>>
>> Measured I/O bandwidth:
>> slice 35 write sparse bandwidth = 6082640 / (3.52519e+06 s / 16384) =
>> 2.63287e-05 GB/s
>> slice 34 write sparse bandwidth = 13824080 / (3.66608e+06 s / 16384) =
>> 5.75379e-05 GB/s
>> slice 33 write sparse bandwidth = 24754256 / (2.83647e+06 s / 16384) =
>> 0.000133166 GB/s
>> slice 32 write sparse bandwidth = 39370832 / (3.47016e+06 s / 16384) =
>> 0.000173119 GB/s
>> slice 31 write sparse bandwidth = 55812176 / (2.53623e+06 s / 16384) =
>> 0.000335785 GB/s
>> slice 30 write sparse bandwidth = 74741840 / (2.5714e+06 s / 16384) =
>> 0.00044352 GB/s
>> slice 29 write sparse bandwidth = 93560912 / (2.67336e+06 s / 16384) =
>> 0.000534019 GB/s
>> slice 28 write sparse bandwidth = 112803920 / (2.74639e+06 s / 16384)
>> = 0.000626733 GB/s
>> slice 27 write sparse bandwidth = 128194640 / (3.1603e+06 s / 16384) =
>> 0.000618958 GB/s
>> slice 26 write sparse bandwidth = 141281360 / (3.12754e+06 s / 16384)
>> = 0.00068929 GB/s
>> slice 25 write sparse bandwidth = 148193360 / (2.62376e+06 s / 16384)
>> = 0.000861835 GB/s
>> slice 24 write sparse bandwidth = 151861328 / (3.2145e+06 s / 16384) =
>> 0.000720865 GB/s
>> slice 23 write sparse bandwidth = 148193360 / (2.44736e+06 s / 16384)
>> = 0.000923956 GB/s
>> slice 22 write sparse bandwidth = 142055504 / (3.15962e+06 s / 16384)
>> = 0.000686031 GB/s
>> slice 21 write sparse bandwidth = 130388048 / (3.09774e+06 s / 16384)
>> = 0.000642263 GB/s
>> slice 20 write sparse bandwidth = 117964880 / (3.02676e+06 s / 16384)
>> = 0.000594696 GB/s
>> slice 19 write sparse bandwidth = 101560400 / (2.97198e+06 s / 16384)
>> = 0.000521434 GB/s
>> slice 18 write sparse bandwidth = 86372432 / (2.96247e+06 s / 16384) =
>> 0.000444878 GB/s
>> slice 18 write sections bandwidth = 1954957518434 / (1.83937e+07 s /
>> 16384) = 1.62177 GB/s
>> slice 17 write sparse bandwidth = 70170704 / (2.88973e+06 s / 16384) =
>> 0.000370526 GB/s
>> slice 17 write sections bandwidth = 1475380615039 / (1.36018e+07 s /
>> 16384) = 1.65511 GB/s
>> slice 17 read bandwidth (192 nodes) = 1475380615039 / 3383.36 s = 0.406122 GB/s
>>  per node: measured = 2.16598 MB/s, theoretical peak = 33.1341 MB/s
>>
>> Focusing on the "sparse" lines, the main point is that the time seems
>> to be roughly independent of file size (plot attached).  Each timing
>> sample consists of (1) setup which I believe is negligible, (2)
>> MPI_File_open, (3) MPI_File_write_ordered, (4) MPI_File_close.
>>
>> What might have caused these file writes to take so long?
>>
>> Geoffrey
>> <2014-03-02-221950_1220x500.png>_______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



More information about the discuss mailing list