Finding the task range impl requires thinking in C++ but it's not hard. Just grep your way down a few layers. <span></span><div><div><br></div><div>Jeff<br><br>On Monday, July 7, 2014, Rob Latham <<a href="mailto:robl@mcs.anl.gov">robl@mcs.anl.gov</a>> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

<br>

On 07/07/2014 10:34 AM, Junchao Zhang wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Rob,<br>

   Is it possible for me to install a debug version PAMI on Mira? I read<br>

the InstallReadme*_BGQ.txt. It is quite complex and and looks I need<br>

root privilege.<br>

   If it is possible, I can profile the code further.<br>

</blockquote>

<br>

I know man, that install process is crazy.. seems like one should be able to get a pami library out of comm/sys/pami by setting enough environment variables -- there's no configure process for pami?<br>

<br>

==rob<br>

<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

--Junchao Zhang<br>

<br>

<br>

On Mon, Jul 7, 2014 at 10:19 AM, Rob Latham <<a>robl@mcs.anl.gov</a><br>

<mailto:<a>robl@mcs.anl.gov</a>>> wrote:<br>

<br>

<br>

<br>

    On 07/03/2014 04:45 PM, Jeff Hammond wrote:<br>

<br>

        PAMI is open-source via<br>

        <a href="https://repo.anl-external.org/__repos/bgq-driver/" target="_blank">https://repo.anl-external.org/<u></u>__repos/bgq-driver/</a><br>

        <<a href="https://repo.anl-external.org/repos/bgq-driver/" target="_blank">https://repo.anl-external.<u></u>org/repos/bgq-driver/</a>>.<br>

<br>

        I believe ALCF has already reported this bug but you can contact<br>

        <a>support@alcf.anl.gov</a> <mailto:<a>support@alcf.anl.gov</a>> for an update.<br>

<br>

<br>

    in a nice bit of circular logic, ALCF keeps trying to close that<br>

    ticket saying "this is being discussed on the MPICH list".<br>

<br>

    Specifically to Jeff's point, the PAMI things are in<br>

    bgq-VERSION-gpl.tar.gz<br>

<br>

    Junchao: you can find the implementation of<br>

    PAMI_Geometry_create_taskrange in comm/sys/pami/api/c/pami.cc, but<br>

    all it does is immediately call the objects' create_taskrange'<br>

    member function, so now you have to find where *that* is...<br>

<br>

    ==rob<br>

<br>

<br>

<br>

        Best,<br>

<br>

        Jeff<br>

<br>

        On Thu, Jul 3, 2014 at 2:41 PM, Junchao Zhang<br>

        <<a>jczhang@mcs.anl.gov</a> <mailto:<a>jczhang@mcs.anl.gov</a>>> wrote:<br>

<br>

            Hi, Sam,<br>

                I wrote micro-benchmarks for MPI_Comm_split/dup. My<br>

            profiling results<br>

            suggested the problem lies in a IBM PAMI library call,<br>

            PAMI_Geometry_create___<u></u>taskrange().  Unfortunately, I don't<br>

            have access to the<br>

            PAMI source code and don't know why. I reported it to IBM<br>

            and hope IBM will<br>

            fix it.<br>

                Alternatively, you can set an environment variable<br>

            PAMID_COLLECTIVES=0 to<br>

            disables pami collectives. My tests showed it at least fixed<br>

            the scalability<br>

            problem of Comm_split and Comm_dup.<br>

                Also through profiling, I found the qsort() called in<br>

            MPICH code is<br>

            actually using the merge sort algorithm in Mira's libc library.<br>

<br>

<br>

<br>

            --Junchao Zhang<br>

<br>

<br>

            On Sat, May 17, 2014 at 9:06 AM, Sam Williams<br>

            <<a>swwilliams@lbl.gov</a> <mailto:<a>swwilliams@lbl.gov</a>>> wrote:<br>

<br>

<br>

                I've been conducting scaling experiments on the Mira<br>

                (Blue Gene/Q) and K<br>

                (Sparc) supercomputers.  I've noticed that the time<br>

                required for<br>

                MPI_Comm_split and MPI_Comm_dup can grow quickly with<br>

                scale (~P^2).  As<br>

                such, its performance eventually becomes a bottleneck.<br>

                  That is, although<br>

                the benefit of using a subcommunicator is huge<br>

                (multigrid solves are<br>

                weak-scalable), the penalty of creating one (multigrid<br>

                build time) is also<br>

                huge.<br>

<br>

                For example, when scaling from 1 to 46K nodes (= cubes<br>

                of integers) on<br>

                Mira, the time (in seconds) required to build a MG<br>

                solver (including a<br>

                subcommunicator) scales as<br>

                222335.output:   Total time in MGBuild      0.056704<br>

                222336.output:   Total time in MGBuild      0.060834<br>

                222348.output:   Total time in MGBuild      0.064782<br>

                222349.output:   Total time in MGBuild      0.090229<br>

                222350.output:   Total time in MGBuild      0.075280<br>

                222351.output:   Total time in MGBuild      0.091852<br>

                222352.output:   Total time in MGBuild      0.137299<br>

                222411.output:   Total time in MGBuild      0.301552<br>

                222413.output:   Total time in MGBuild      0.606444<br>

                222415.output:   Total time in MGBuild      0.745272<br>

                222417.output:   Total time in MGBuild      0.779757<br>

                222418.output:   Total time in MGBuild      4.671838<br>

                222419.output:   Total time in MGBuild     15.123162<br>

                222420.output:   Total time in MGBuild     33.875626<br>

                222421.output:   Total time in MGBuild     49.494547<br>

                222422.output:   Total time in MGBuild    151.329026<br>

<br>

                If I disable the call to MPI_Comm_Split, my time scales as<br>

                224982.output:   Total time in MGBuild      0.050143<br>

                224983.output:   Total time in MGBuild      0.052607<br>

                224988.output:   Total time in MGBuild      0.050697<br>

                224989.output:   Total time in MGBuild      0.078343<br>

                224990.output:   Total time in MGBuild      0.054634<br>

                224991.output:   Total time in MGBuild      0.052158<br>

                224992.output:   Total time in MGBuild      0.060286<br>

                225008.output:   Total time in MGBuild      0.062925<br>

                225009.output:   Total time in MGBuild      0.097357<br>

                225010.output:   Total time in MGBuild      0.061807<br>

                225011.output:   Total time in MGBuild      0.076617<br>

                225012.output:   Total time in MGBuild      0.099683<br>

                225013.output:   Total time in MGBuild      0.125580<br>

                225014.output:   Total time in MGBuild      0.190711<br>

                225016.output:   Total time in MGBuild      0.218329<br>

                225017.output:   Total time in MGBuild      0.282081<br>

<br>

                Although I didn't directly measure it, this suggests the<br>

                time for<br>

                MPI_Comm_Split is growing roughly quadratically with<br>

                process concurrency.<br>

<br>

<br>

<br>

<br>

                I see the same effect on the K machine (8...64K nodes)<br>

                where the code uses<br>

                comm_split/dup in conjunction:<br>

                run00008_7_1.sh.o2412931:   Total time in MGBuild<br>

                  0.026458 seconds<br>

                run00064_7_1.sh.o2415876:   Total time in MGBuild<br>

                  0.039121 seconds<br>

                run00512_7_1.sh.o2415877:   Total time in MGBuild<br>

                  0.086800 seconds<br>

                run01000_7_1.sh.o2414496:   Total time in MGBuild<br>

                  0.129764 seconds<br>

                run01728_7_1.sh.o2415878:   Total time in MGBuild<br>

                  0.224576 seconds<br>

                run04096_7_1.sh.o2415880:   Total time in MGBuild<br>

                  0.738979 seconds<br>

                run08000_7_1.sh.o2414504:   Total time in MGBuild<br>

                  2.123800 seconds<br>

                run13824_7_1.sh.o2415881:   Total time in MGBuild<br>

                  6.276573 seconds<br>

                run21952_7_1.sh.o2415882:   Total time in MGBuild<br>

                13.634200 seconds<br>

                run32768_7_1.sh.o2415884:   Total time in MGBuild<br>

                36.508670 seconds<br>

                run46656_7_1.sh.o2415874:   Total time in MGBuild<br>

                58.668228 seconds<br>

                run64000_7_1.sh.o2415875:   Total time in MGBuild<br>

                  117.322217 seconds<br>

<br>

<br>

                A glance at the implementation on Mira (I don't know if<br>

                the implementation<br>

                on K is stock) suggests it should be using qsort to sort<br>

                based on keys.<br>

                Unfortunately, qsort is not performance robust like<br>

                heap/merge sort.  If one<br>

                were to be productive and call comm_split like...<br>

                MPI_Comm_split(...,mycolor,__<u></u>myrank,...)<br>

                then one runs the risk that the keys are presorted.<br>

                  This hits the worst<br>

                case computational complexity for qsort... O(P^2).<br>

                  Demanding programmers<br>

                avoid sending sorted keys seems unreasonable.<br>

<br>

<br>

                I should note, I see a similar lack of scaling with<br>

                MPI_Comm_dup on the K<br>

                machine.  Unfortunately, my BGQ data used an earlier<br>

                version of the code<br>

                that did not use comm_dup.  As such, I can’t<br>

                definitively say that it is a<br>

                problem on that machine as well.<br>

<br>

                Thus, I'm asking for scalable implementations of<br>

                comm_split/dup using<br>

                merge/heap sort whose worst case complexity is still<br>

                PlogP to be prioritized<br>

                in the next update.<br>

<br>

<br>

                thanks<br>

                ______________________________<u></u>___________________<br>

                To manage subscription options or unsubscribe:<br>

                <a href="https://lists.mpich.org/__mailman/listinfo/devel" target="_blank">https://lists.mpich.org/__<u></u>mailman/listinfo/devel</a><br>

                <<a href="https://lists.mpich.org/mailman/listinfo/devel" target="_blank">https://lists.mpich.org/<u></u>mailman/listinfo/devel</a>><br>

<br>

<br>

<br>

<br>

            ______________________________<u></u>___________________<br>

            To manage subscription options or unsubscribe:<br>

            <a href="https://lists.mpich.org/__mailman/listinfo/devel" target="_blank">https://lists.mpich.org/__<u></u>mailman/listinfo/devel</a><br>

            <<a href="https://lists.mpich.org/mailman/listinfo/devel" target="_blank">https://lists.mpich.org/<u></u>mailman/listinfo/devel</a>><br>

<br>

<br>

<br>

<br>

<br>

    --<br>

    Rob Latham<br>

    Mathematics and Computer Science Division<br>

    Argonne National Lab, IL USA<br>

    ______________________________<u></u>___________________<br>

    To manage subscription options or unsubscribe:<br>

    <a href="https://lists.mpich.org/__mailman/listinfo/devel" target="_blank">https://lists.mpich.org/__<u></u>mailman/listinfo/devel</a><br>

    <<a href="https://lists.mpich.org/mailman/listinfo/devel" target="_blank">https://lists.mpich.org/<u></u>mailman/listinfo/devel</a>><br>

<br>

<br>

<br>

<br>

______________________________<u></u>_________________<br>

To manage subscription options or unsubscribe:<br>

<a href="https://lists.mpich.org/mailman/listinfo/devel" target="_blank">https://lists.mpich.org/<u></u>mailman/listinfo/devel</a><br>

<br>

</blockquote>

<br>

-- <br>

Rob Latham<br>

Mathematics and Computer Science Division<br>

Argonne National Lab, IL USA<br>

______________________________<u></u>_________________<br>

To manage subscription options or unsubscribe:<br>

<a href="https://lists.mpich.org/mailman/listinfo/devel" target="_blank">https://lists.mpich.org/<u></u>mailman/listinfo/devel</a><br>

</blockquote></div></div><br><br>-- <br>Jeff Hammond<br><a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a><br>