[mpich-discuss] spurious lock ups on collective merge intercom

Mon Feb 6 13:27:04 CST 2017

Thank you, Kenneth.

Here is a simple C++ equivalent of what i am doing:

server.cpp:
=============================

#include <iostream>
#include <mpi/mpi.h>
#include <stdlib.h>

using namespace std;

// The only argument must be the number of processes in communicator we
expect to build.
int main(int argc, char** argv)
{

    int np = atoi(argv[1]);

    int ac = 0;
    MPI_Init(&ac, &argv);

    char portName[MPI_MAX_PORT_NAME];

    MPI_Open_port(MPI_INFO_NULL, portName);

    cout << portName << "\n";

    MPI_Comm intercomm, intracomm = MPI_COMM_SELF;

    // Build an intracom dynamically until n processes are reached.
    for (int i = 1; i < np; i++) {

        MPI_Comm_accept(portName, MPI_INFO_NULL, 0, intracomm, &intercomm);

        cout << "Accepted.\n";

        MPI_Intercomm_merge(intercomm, false, &intracomm);

        cout << "Merged to an intracom.\n";

        MPI_Comm_free(&intercomm);
    }

    // Intracomm contains the one we can now use with n-grid.
    MPI_Comm_free(&intracomm);

    MPI_Close_port(portName);

    MPI_Finalize();
}

client.cpp:
===============================
#include <iostream>
#include <mpi/mpi.h>
#include <stdlib.h>

using namespace std;

// This expects intracom size and the port name to connect to.
// When using with shell, use single quotas to avoid shell substitution.
int main(int argc, char** argv)
{

    int ac = 0;
    MPI_Init(&ac, &argv);

    int np = atoi(argv[1]);
    char* portName = argv[2];

    cout << "Connecting to " << portName << "\n";

    MPI_Comm intercomm, intracomm;

    MPI_Comm_connect(portName, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm);

    cout << "Connected.\n";

    MPI_Intercomm_merge(intercomm, true, &intracomm);

    cout << "Merged.\n";

    MPI_Comm_free(&intercomm);

    int i;

    MPI_Comm_size(intracomm, &i);

    // Build an intracom dynamically until n processes are reached.
    for (; i < np; i++) {

        MPI_Comm_accept(portName, MPI_INFO_NULL, 0, intracomm, &intercomm);

        cout << "Accepted.\n";

        MPI_Intercomm_merge(intercomm, false, &intracomm);

        cout << "Merged to an intracom.\n";

        MPI_Comm_free(&intercomm);
    }

    // Intracomm contains the one we can now use with n-grid.
    MPI_Comm_free(&intracomm);

    MPI_Finalize();
}

============================
Run example on one machine, intracom size=2 (in this case I have run 3.3a)

dmitriy at Intel-Kubu:~/projects/mpitests$ mpic++ server.cpp -o server
dmitriy at Intel-Kubu:~/projects/mpitests$ mpic++ client.cpp -o client
dmitriy at Intel-Kubu:~/projects/mpitests$
dmitriy at Intel-Kubu:~/projects/mpitests$
dmitriy at Intel-Kubu:~/projects/mpitests$ mpiexec ./server 2
tag#0$description#Intel-Kubu$port#39210$ifname#127.0.1.1$
Accepted.
Merged to an intracom.
dmitriy at Intel-Kubu:~/projects/mpitests$

(in another shell)
dmitriy at Intel-Kubu:~/projects/mpitests$ mpiexec ./client 2
'tag#0$description#Intel-Kubu$port#39210$ifname#127.0.1.1$'
Connecting to tag#0$description#Intel-Kubu$port#39210$ifname#127.0.1.1$
Connected.
Merged.
dmitriy at Intel-Kubu:~/projects/mpitests$

First parameter is eventual size of intracom we are trying to build
dynamically, and client also needs to know the port reported by server
process. So there's therfore 1 server and (n-1) clients that connect to
form the intracom.

Now, if i do that for 192 processes on a 192 core cluster (occasionally
slightly overloaded in terms of cpu load), I more often get a lock-up than
not. The incidence is more frequent for 3.2 than 3.3a2. This cluster has
mellanox infiniband.

3.2 usually locks up on merge intercom call; and 3.3a2 locked up at least
once on 2 clients connected and waiting on merge intercom at the same time
(but my understanding only one client should be connected at a time, even
if there massive connects pending from other clients).

Hope this gives a little more material.
-Dmitriy

On Fri, Feb 3, 2017 at 8:40 AM, Kenneth Raffenetti <raffenet at mcs.anl.gov>
wrote:

> Hi Dmitriy,
>
> MPICH does appear to be reported a process exit/crash in this case. A
> simple reproducer would be useful to test if that is indeed the cause or if
> there's something else going on.
>
> I see below that you are using a non-standard MPI binding. If the test
> case is simple enough, we can try to port it and investigate further.
>
> Ken
>
> On 01/19/2017 06:58 PM, Dmitriy Lyubimov wrote:
>
>> These lock-ups seem to be gone in 3.3a2.
>>
>> I do occasionally get the following though:
>>
>> Unknown error class, error stack:
>> PMPI_Comm_accept(129).................:
>> MPI_Comm_accept(port="tag#0$description#aaa.com
>> <http://aaa.com>$port#36230$ifname#192.168.42.99$", MPI_INFO_NULL, ro
>> ot=180, comm=0x84000003, newcomm=0x7f3cf681842c) failed
>> MPID_Comm_accept(153).................:
>> MPIDI_Comm_accept(1244)...............:
>> MPIR_Get_contextid_sparse_group(499)..:
>> MPIR_Allreduce_impl(755)..............:
>> MPIR_Allreduce_intra(414).............:
>> MPIDU_Complete_posted_with_error(1137): Process failed
>>
>> What does this message mean? some process just exited/died (like with
>> seg fault?)
>>
>> Thank you.
>> -Dmitriy
>>
>> On Thu, Jan 12, 2017 at 11:55 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com
>> <mailto:dlieu.7 at gmail.com>> wrote:
>>
>>     further debugging shows that it's not actually mergeIntercom that
>>     locks up but a pair of send/recv that two nodes decide to execute
>>     before MPI_intercom_merge.
>>
>>     so the total snapshot of the situation is that everyone waits on
>>     mergeIntercom except for two processes that wait in send/recv
>>     respectively, while majority of others already have entered
>>     collective barrier.
>>
>>     it would seem that this sort of assymetric logic would be
>>     acceptable, since the send/recv pair is balanced before the merge is
>>     to occur, but in practice it seems to lock up -- increasingly so as
>>     the number of participating processes increases. It almost like
>>      once collective barrier of certain cardinality is formed,
>>     point-to-point messages are not going thru any longer.
>>
>>     If this scenario begets any ideas, please let me know.
>>
>>     thank you!
>>     -Dmitriy
>>
>>
>>
>>     On Wed, Jan 11, 2017 at 9:38 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com
>>     <mailto:dlieu.7 at gmail.com>> wrote:
>>
>>         Maybe it has something to do with the fact that it is stepping
>>         thru JVM JNI and that somehow screws threading model of MPI,
>>         although it is a single threaded JVM process, and MPI mappings
>>         are known to have been done before (e.g., openmpi had an effort
>>         towards that).
>>
>>         Strange thing is that i never had lock up with # of processes
>>         under 120 but something changes after that, the spurious
>>         condition becomes much more common after that. By the time I am
>>         at 150 processes in the intercom, I am almost certain to have a
>>         merge lock-up.
>>
>>
>>         On Wed, Jan 11, 2017 at 9:34 AM, Dmitriy Lyubimov
>>         <dlieu.7 at gmail.com <mailto:dlieu.7 at gmail.com>> wrote:
>>
>>             Thanks.
>>             it would not be easy for me to do immediately as i am using
>>             proprietary scala binding api for MPI.
>>
>>             it would help me to know if there's a known problem like
>>             that in the past, or generally mergeIntercomm api is known
>>             to work on hundreds of processes. Sounds like there are no
>>             known issues with that.
>>
>>
>>
>>             On Tue, Jan 10, 2017 at 11:53 PM, Oden, Lena <loden at anl.gov
>>             <mailto:loden at anl.gov>> wrote:
>>
>>                 Hello Dmittiy,
>>
>>                 can you maybe create a simple example-program to
>>                 reproduce this failure?
>>                 It is also often easier also to look at a code example
>>                 to identify a problem.
>>
>>                 Thanks,
>>                 Lena
>>                 > On Jan 11, 2017, at 2:45 AM, Dmitriy Lyubimov
>>                 <dlieu.7 at gmail.com <mailto:dlieu.7 at gmail.com>> wrote:
>>                 >
>>                 > Hello,
>>                 >
>>                 > (mpich 3.2)
>>                 >
>>                 > I have a scenario when i add a few extra processes do
>>                 existing intercom.
>>                 >
>>                 > it works as a simple loop --
>>                 > (1) n processes accept on n-intercom
>>                 > (2) 1 process connects
>>                 > (3) intracom is merged into n+1 intercom, intracom and
>>                 n-intercom are closed
>>                 > (4) repeat 1-3 as needed.
>>                 >
>>                 > Occasionally, i observe that step 3 spuriously locks
>>                 up (once i get in the range of 100+ processes). From
>>                 what i can tell, all processes in step 3 are accounted
>>                 for, and are waiting on the merge, but nothing happens.
>>                 the collective barrier locks up.
>>                 >
>>                 > I really have trouble resolving this issue, any ideas
>>                 are appreciated!
>>                 >
>>                 > Thank you very much.
>>                 > -Dmitriy
>>                 >
>>                 >
>>                 > _______________________________________________
>>                 > discuss mailing list     discuss at mpich.org
>>                 <mailto:discuss at mpich.org>
>>                 > To manage subscription options or unsubscribe:
>>                 > https://lists.mpich.org/mailman/listinfo/discuss
>>                 <https://lists.mpich.org/mailman/listinfo/discuss>
>>
>>                 _______________________________________________
>>                 discuss mailing list     discuss at mpich.org
>>                 <mailto:discuss at mpich.org>
>>                 To manage subscription options or unsubscribe:
>>                 https://lists.mpich.org/mailman/listinfo/discuss
>>                 <https://lists.mpich.org/mailman/listinfo/discuss>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170206/9386b746/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss