[mpich-discuss] Efficient message size for MPI_Bcast

Fri Apr 23 16:12:40 CDT 2021

On Fri, 2021-04-23 at 19:31 +0000, Mccall, Kurt E. (MSFC-EV41) via
discuss wrote:
> I have a file containing multiple Mb of data that I need to be read
> by all of my processes.   I’m assuming it is more efficient to have
> one process read it in and then broadcast it to the rest, rather than
> have all processes (about 20) hammer our NFS server (?).  

You could have all processes read the file collectively.  Normally
MPICH won't try to optimize such a simple workload, so additionally
you'd have to set the "romio_cb_read" hint to "enable".

Did your eyes glaze over?  Yeah, that's a little inside-baseball for
sure.

Read and broadcast is not a bad solution, either.  What's the best size
to use?  Depends on your network.  You could run a message size
benchmark: just about every network shows an s shape, where small
messages perform worse than larger messages up to a point and then
things level off.

MPI_Bcast is going to behave a bit differently than a point-to-point
workload, but the same general pattern should hold:  Broadcasting a
bunch of 8 byte regions is going to be bad, 1 MiB regions significantly
better, 8 MiB at a time might be a few percentage points better but
will probably perfom the same.

Some benchmarks to help answer your question on your specific hardware
if you want a more precise number:

- Intel MPI benchmarks 
https://software.intel.com/content/www/us/en/develop/articles/intel-mpi-benchmarks.html
- LLNL's 'mpiBench' https://github.com/LLNL/mpiBench
- OSU MPI Benchmarks https://mvapich.cse.ohio-state.edu/benchmarks/
(though the broadcast benchmark reports latency: I think the inverse will give you bandwidth)

Reading from NFS in this way is about as good as you can
expect.  Please do not expect anything more than "best effort" when
writing to NFS in parallel.  It is not a parallel file system and in
fact goes out of its way to make life hard for MPI writes (cache
behavior is unpredictable, blocks are falsely shared among processes,
and the single nfs server is a communication bottlenck)

==rob