<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr"><span style="font-size:12.8000001907349px">Could you try mpich3.2b2?  I tested your code with it on my laptop. My timing is</span><div style="font-size:12.8000001907349px"><br></div><div style="font-size:12.8000001907349px">np    time(s)<br><div>2      102.57<br></div><div>4      51.285<br></div><div>8      25.6425<br></div><div>16   12.8212</div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><div dir="ltr">--Junchao Zhang</div></div></div>


<br><div class="gmail_quote">On Mon, May 4, 2015 at 10:14 AM, David Froger <span dir="ltr"><<a href="mailto:david.froger.ml@mailoo.org" target="_blank">david.froger.ml@mailoo.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Thanks Junchao.<br>


<span class=""><br>


> I don't see you measure MPI_Allreduce.<br>


<br>


</span>You're right, let's call my code "a simple example to reproduce a bug" rather<br>


than a benchmark.<br>


<span class=""><br>


> Basically you only measured some random > numbers across processes.<br>


<br>


</span>The usleep simulate the time to perform computation in my real code<br>


(Computational Fluid Dynamic software). bench_mpi.cxx only do a<br>


usleep(microseconds) then call MPI_Allreduce. microseconds is a constant base<br>


time, divised by mpi_size (+ a random overhead between 0% and 5%, so that<br>


MPI_Allreduce is not called at the same wall clock on all proceses, but I<br>


thing a should have use a different seed on each proc).<br>


<br>


So because what the code do is only usleep(base_time / mpi_size), I expect the<br>


wall clock time to be half with twice processor.<br>


<br>


With MPiCH 3.1.4, the wall clock time increase with 7 or more processes.<br>


MPI_Allreduce become very slow without a reason. I'm triying to understand<br>


why.<br>


</blockquote></div><br></div></div>