Finding the task range impl requires thinking in C++ but it's not hard. Just grep your way down a few layers. <span></span><div><div><br></div><div>Jeff<br><br>On Monday, July 7, 2014, Rob Latham <<a href="mailto:robl@mcs.anl.gov">robl@mcs.anl.gov</a>> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
<br>
On 07/07/2014 10:34 AM, Junchao Zhang wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Rob,<br>
Is it possible for me to install a debug version PAMI on Mira? I read<br>
the InstallReadme*_BGQ.txt. It is quite complex and and looks I need<br>
root privilege.<br>
If it is possible, I can profile the code further.<br>
</blockquote>
<br>
I know man, that install process is crazy.. seems like one should be able to get a pami library out of comm/sys/pami by setting enough environment variables -- there's no configure process for pami?<br>
<br>
==rob<br>
<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
--Junchao Zhang<br>
<br>
<br>
On Mon, Jul 7, 2014 at 10:19 AM, Rob Latham <<a>robl@mcs.anl.gov</a><br>
<mailto:<a>robl@mcs.anl.gov</a>>> wrote:<br>
<br>
<br>
<br>
On 07/03/2014 04:45 PM, Jeff Hammond wrote:<br>
<br>
PAMI is open-source via<br>
<a href="https://repo.anl-external.org/__repos/bgq-driver/" target="_blank">https://repo.anl-external.org/<u></u>__repos/bgq-driver/</a><br>
<<a href="https://repo.anl-external.org/repos/bgq-driver/" target="_blank">https://repo.anl-external.<u></u>org/repos/bgq-driver/</a>>.<br>
<br>
I believe ALCF has already reported this bug but you can contact<br>
<a>support@alcf.anl.gov</a> <mailto:<a>support@alcf.anl.gov</a>> for an update.<br>
<br>
<br>
in a nice bit of circular logic, ALCF keeps trying to close that<br>
ticket saying "this is being discussed on the MPICH list".<br>
<br>
Specifically to Jeff's point, the PAMI things are in<br>
bgq-VERSION-gpl.tar.gz<br>
<br>
Junchao: you can find the implementation of<br>
PAMI_Geometry_create_taskrange in comm/sys/pami/api/c/pami.cc, but<br>
all it does is immediately call the objects' create_taskrange'<br>
member function, so now you have to find where *that* is...<br>
<br>
==rob<br>
<br>
<br>
<br>
Best,<br>
<br>
Jeff<br>
<br>
On Thu, Jul 3, 2014 at 2:41 PM, Junchao Zhang<br>
<<a>jczhang@mcs.anl.gov</a> <mailto:<a>jczhang@mcs.anl.gov</a>>> wrote:<br>
<br>
Hi, Sam,<br>
I wrote micro-benchmarks for MPI_Comm_split/dup. My<br>
profiling results<br>
suggested the problem lies in a IBM PAMI library call,<br>
PAMI_Geometry_create___<u></u>taskrange(). Unfortunately, I don't<br>
have access to the<br>
PAMI source code and don't know why. I reported it to IBM<br>
and hope IBM will<br>
fix it.<br>
Alternatively, you can set an environment variable<br>
PAMID_COLLECTIVES=0 to<br>
disables pami collectives. My tests showed it at least fixed<br>
the scalability<br>
problem of Comm_split and Comm_dup.<br>
Also through profiling, I found the qsort() called in<br>
MPICH code is<br>
actually using the merge sort algorithm in Mira's libc library.<br>
<br>
<br>
<br>
--Junchao Zhang<br>
<br>
<br>
On Sat, May 17, 2014 at 9:06 AM, Sam Williams<br>
<<a>swwilliams@lbl.gov</a> <mailto:<a>swwilliams@lbl.gov</a>>> wrote:<br>
<br>
<br>
I've been conducting scaling experiments on the Mira<br>
(Blue Gene/Q) and K<br>
(Sparc) supercomputers. I've noticed that the time<br>
required for<br>
MPI_Comm_split and MPI_Comm_dup can grow quickly with<br>
scale (~P^2). As<br>
such, its performance eventually becomes a bottleneck.<br>
That is, although<br>
the benefit of using a subcommunicator is huge<br>
(multigrid solves are<br>
weak-scalable), the penalty of creating one (multigrid<br>
build time) is also<br>
huge.<br>
<br>
For example, when scaling from 1 to 46K nodes (= cubes<br>
of integers) on<br>
Mira, the time (in seconds) required to build a MG<br>
solver (including a<br>
subcommunicator) scales as<br>
222335.output: Total time in MGBuild 0.056704<br>
222336.output: Total time in MGBuild 0.060834<br>
222348.output: Total time in MGBuild 0.064782<br>
222349.output: Total time in MGBuild 0.090229<br>
222350.output: Total time in MGBuild 0.075280<br>
222351.output: Total time in MGBuild 0.091852<br>
222352.output: Total time in MGBuild 0.137299<br>
222411.output: Total time in MGBuild 0.301552<br>
222413.output: Total time in MGBuild 0.606444<br>
222415.output: Total time in MGBuild 0.745272<br>
222417.output: Total time in MGBuild 0.779757<br>
222418.output: Total time in MGBuild 4.671838<br>
222419.output: Total time in MGBuild 15.123162<br>
222420.output: Total time in MGBuild 33.875626<br>
222421.output: Total time in MGBuild 49.494547<br>
222422.output: Total time in MGBuild 151.329026<br>
<br>
If I disable the call to MPI_Comm_Split, my time scales as<br>
224982.output: Total time in MGBuild 0.050143<br>
224983.output: Total time in MGBuild 0.052607<br>
224988.output: Total time in MGBuild 0.050697<br>
224989.output: Total time in MGBuild 0.078343<br>
224990.output: Total time in MGBuild 0.054634<br>
224991.output: Total time in MGBuild 0.052158<br>
224992.output: Total time in MGBuild 0.060286<br>
225008.output: Total time in MGBuild 0.062925<br>
225009.output: Total time in MGBuild 0.097357<br>
225010.output: Total time in MGBuild 0.061807<br>
225011.output: Total time in MGBuild 0.076617<br>
225012.output: Total time in MGBuild 0.099683<br>
225013.output: Total time in MGBuild 0.125580<br>
225014.output: Total time in MGBuild 0.190711<br>
225016.output: Total time in MGBuild 0.218329<br>
225017.output: Total time in MGBuild 0.282081<br>
<br>
Although I didn't directly measure it, this suggests the<br>
time for<br>
MPI_Comm_Split is growing roughly quadratically with<br>
process concurrency.<br>
<br>
<br>
<br>
<br>
I see the same effect on the K machine (8...64K nodes)<br>
where the code uses<br>
comm_split/dup in conjunction:<br>
run00008_7_1.sh.o2412931: Total time in MGBuild<br>
0.026458 seconds<br>
run00064_7_1.sh.o2415876: Total time in MGBuild<br>
0.039121 seconds<br>
run00512_7_1.sh.o2415877: Total time in MGBuild<br>
0.086800 seconds<br>
run01000_7_1.sh.o2414496: Total time in MGBuild<br>
0.129764 seconds<br>
run01728_7_1.sh.o2415878: Total time in MGBuild<br>
0.224576 seconds<br>
run04096_7_1.sh.o2415880: Total time in MGBuild<br>
0.738979 seconds<br>
run08000_7_1.sh.o2414504: Total time in MGBuild<br>
2.123800 seconds<br>
run13824_7_1.sh.o2415881: Total time in MGBuild<br>
6.276573 seconds<br>
run21952_7_1.sh.o2415882: Total time in MGBuild<br>
13.634200 seconds<br>
run32768_7_1.sh.o2415884: Total time in MGBuild<br>
36.508670 seconds<br>
run46656_7_1.sh.o2415874: Total time in MGBuild<br>
58.668228 seconds<br>
run64000_7_1.sh.o2415875: Total time in MGBuild<br>
117.322217 seconds<br>
<br>
<br>
A glance at the implementation on Mira (I don't know if<br>
the implementation<br>
on K is stock) suggests it should be using qsort to sort<br>
based on keys.<br>
Unfortunately, qsort is not performance robust like<br>
heap/merge sort. If one<br>
were to be productive and call comm_split like...<br>
MPI_Comm_split(...,mycolor,__<u></u>myrank,...)<br>
then one runs the risk that the keys are presorted.<br>
This hits the worst<br>
case computational complexity for qsort... O(P^2).<br>
Demanding programmers<br>
avoid sending sorted keys seems unreasonable.<br>
<br>
<br>
I should note, I see a similar lack of scaling with<br>
MPI_Comm_dup on the K<br>
machine. Unfortunately, my BGQ data used an earlier<br>
version of the code<br>
that did not use comm_dup. As such, I can’t<br>
definitively say that it is a<br>
problem on that machine as well.<br>
<br>
Thus, I'm asking for scalable implementations of<br>
comm_split/dup using<br>
merge/heap sort whose worst case complexity is still<br>
PlogP to be prioritized<br>
in the next update.<br>
<br>
<br>
thanks<br>
______________________________<u></u>___________________<br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/__mailman/listinfo/devel" target="_blank">https://lists.mpich.org/__<u></u>mailman/listinfo/devel</a><br>
<<a href="https://lists.mpich.org/mailman/listinfo/devel" target="_blank">https://lists.mpich.org/<u></u>mailman/listinfo/devel</a>><br>
<br>
<br>
<br>
<br>
______________________________<u></u>___________________<br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/__mailman/listinfo/devel" target="_blank">https://lists.mpich.org/__<u></u>mailman/listinfo/devel</a><br>
<<a href="https://lists.mpich.org/mailman/listinfo/devel" target="_blank">https://lists.mpich.org/<u></u>mailman/listinfo/devel</a>><br>
<br>
<br>
<br>
<br>
<br>
--<br>
Rob Latham<br>
Mathematics and Computer Science Division<br>
Argonne National Lab, IL USA<br>
______________________________<u></u>___________________<br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/__mailman/listinfo/devel" target="_blank">https://lists.mpich.org/__<u></u>mailman/listinfo/devel</a><br>
<<a href="https://lists.mpich.org/mailman/listinfo/devel" target="_blank">https://lists.mpich.org/<u></u>mailman/listinfo/devel</a>><br>
<br>
<br>
<br>
<br>
______________________________<u></u>_________________<br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/devel" target="_blank">https://lists.mpich.org/<u></u>mailman/listinfo/devel</a><br>
<br>
</blockquote>
<br>
-- <br>
Rob Latham<br>
Mathematics and Computer Science Division<br>
Argonne National Lab, IL USA<br>
______________________________<u></u>_________________<br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/devel" target="_blank">https://lists.mpich.org/<u></u>mailman/listinfo/devel</a><br>
</blockquote></div></div><br><br>-- <br>Jeff Hammond<br><a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a><br>