<div dir="ltr"><div>Sam, </div> I had some updates from IBM last week. They reproduced the problem and found it<span style="font-size:13px;font-family:sans-serif"> only happens when the number of MPI ranks is non-power-of-2. Their advice is that since the IBM BG/Q </span><span style="font-size:13px;font-family:sans-serif">optimized collectives themselves are mostly designed only to be helpful for blocks with power-of-2 geometries, you can try in your program to see if </span><span style="font-size:13px;font-family:sans-serif">subsequent collective calls with PAMID_COLLECTIVES=1 are actually faster than PAMID_COLLECTIVES=0 on comms with a non-power-of-2 geometry. </span><span style="font-size:13px;font-family:sans-serif">If the answer is no, then you can just run with PAMID_COLLECTIVES=0 and avoid the dup/split performance issue. Otherwise, IBM may prioritize this ticket.</span><div><div><font face="sans-serif"><br></font></div><div> Thanks.</div><div><div><div><div><div class="gmail_extra"><div><div dir="ltr">--Junchao Zhang</div></div>
<br><div class="gmail_quote">On Thu, Jul 3, 2014 at 4:41 PM, Junchao Zhang <span dir="ltr"><<a href="mailto:jczhang@mcs.anl.gov" target="_blank">jczhang@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">Hi, Sam, <div> I wrote micro-benchmarks for MPI_Comm_split/dup. My profiling results suggested the problem lies in a IBM PAMI library call, PAMI_Geometry_create_taskrange(). Unfortunately, I don't have access to the PAMI source code and don't know why. I reported it to IBM and hope IBM will fix it.<br>
Alternatively, you can set an environment variable PAMID_COLLECTIVES=0 to disables pami collectives. My tests showed it at least fixed the scalability problem of Comm_split and Comm_dup.</div><div> Also through profiling, I found the qsort() called in MPICH code is actually using the merge sort algorithm in Mira's libc library.<span><font color="#888888"><br>
<div> </div></font></span></div></div><div class="gmail_extra"><span><font color="#888888"><br clear="all"><div><div dir="ltr">--Junchao Zhang</div></div></font></span><div><div>
<br><br><div class="gmail_quote">On Sat, May 17, 2014 at 9:06 AM, Sam Williams <span dir="ltr"><<a href="mailto:swwilliams@lbl.gov" target="_blank">swwilliams@lbl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
I've been conducting scaling experiments on the Mira (Blue Gene/Q) and K (Sparc) supercomputers. I've noticed that the time required for MPI_Comm_split and MPI_Comm_dup can grow quickly with scale (~P^2). As such, its performance eventually becomes a bottleneck. That is, although the benefit of using a subcommunicator is huge (multigrid solves are weak-scalable), the penalty of creating one (multigrid build time) is also huge.<br>
<br>
For example, when scaling from 1 to 46K nodes (= cubes of integers) on Mira, the time (in seconds) required to build a MG solver (including a subcommunicator) scales as<br>
222335.output: Total time in MGBuild 0.056704<br>
222336.output: Total time in MGBuild 0.060834<br>
222348.output: Total time in MGBuild 0.064782<br>
222349.output: Total time in MGBuild 0.090229<br>
222350.output: Total time in MGBuild 0.075280<br>
222351.output: Total time in MGBuild 0.091852<br>
222352.output: Total time in MGBuild 0.137299<br>
222411.output: Total time in MGBuild 0.301552<br>
222413.output: Total time in MGBuild 0.606444<br>
222415.output: Total time in MGBuild 0.745272<br>
222417.output: Total time in MGBuild 0.779757<br>
222418.output: Total time in MGBuild 4.671838<br>
222419.output: Total time in MGBuild 15.123162<br>
222420.output: Total time in MGBuild 33.875626<br>
222421.output: Total time in MGBuild 49.494547<br>
222422.output: Total time in MGBuild 151.329026<br>
<br>
If I disable the call to MPI_Comm_Split, my time scales as<br>
224982.output: Total time in MGBuild 0.050143<br>
224983.output: Total time in MGBuild 0.052607<br>
224988.output: Total time in MGBuild 0.050697<br>
224989.output: Total time in MGBuild 0.078343<br>
224990.output: Total time in MGBuild 0.054634<br>
224991.output: Total time in MGBuild 0.052158<br>
224992.output: Total time in MGBuild 0.060286<br>
225008.output: Total time in MGBuild 0.062925<br>
225009.output: Total time in MGBuild 0.097357<br>
225010.output: Total time in MGBuild 0.061807<br>
225011.output: Total time in MGBuild 0.076617<br>
225012.output: Total time in MGBuild 0.099683<br>
225013.output: Total time in MGBuild 0.125580<br>
225014.output: Total time in MGBuild 0.190711<br>
225016.output: Total time in MGBuild 0.218329<br>
225017.output: Total time in MGBuild 0.282081<br>
<br>
Although I didn't directly measure it, this suggests the time for MPI_Comm_Split is growing roughly quadratically with process concurrency.<br>
<br>
<br>
<br>
<br>
I see the same effect on the K machine (8...64K nodes) where the code uses comm_split/dup in conjunction:<br>
run00008_7_1.sh.o2412931: Total time in MGBuild 0.026458 seconds<br>
run00064_7_1.sh.o2415876: Total time in MGBuild 0.039121 seconds<br>
run00512_7_1.sh.o2415877: Total time in MGBuild 0.086800 seconds<br>
run01000_7_1.sh.o2414496: Total time in MGBuild 0.129764 seconds<br>
run01728_7_1.sh.o2415878: Total time in MGBuild 0.224576 seconds<br>
run04096_7_1.sh.o2415880: Total time in MGBuild 0.738979 seconds<br>
run08000_7_1.sh.o2414504: Total time in MGBuild 2.123800 seconds<br>
run13824_7_1.sh.o2415881: Total time in MGBuild 6.276573 seconds<br>
run21952_7_1.sh.o2415882: Total time in MGBuild 13.634200 seconds<br>
run32768_7_1.sh.o2415884: Total time in MGBuild 36.508670 seconds<br>
run46656_7_1.sh.o2415874: Total time in MGBuild 58.668228 seconds<br>
run64000_7_1.sh.o2415875: Total time in MGBuild 117.322217 seconds<br>
<br>
<br>
A glance at the implementation on Mira (I don't know if the implementation on K is stock) suggests it should be using qsort to sort based on keys. Unfortunately, qsort is not performance robust like heap/merge sort. If one were to be productive and call comm_split like...<br>
MPI_Comm_split(...,mycolor,myrank,...)<br>
then one runs the risk that the keys are presorted. This hits the worst case computational complexity for qsort... O(P^2). Demanding programmers avoid sending sorted keys seems unreasonable.<br>
<br>
<br>
I should note, I see a similar lack of scaling with MPI_Comm_dup on the K machine. Unfortunately, my BGQ data used an earlier version of the code that did not use comm_dup. As such, I can’t definitively say that it is a problem on that machine as well.<br>
<br>
Thus, I'm asking for scalable implementations of comm_split/dup using merge/heap sort whose worst case complexity is still PlogP to be prioritized in the next update.<br>
<br>
<br>
thanks<br>
_______________________________________________<br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/devel" target="_blank">https://lists.mpich.org/mailman/listinfo/devel</a><br>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div></div></div></div></div></div></div>