[mpich-discuss] spurious lock ups on collective merge intercom

Balaji, Pavan balaji at anl.gov
Tue Feb 7 13:15:17 CST 2017


Hi Dmitriy,

You should use publish/lookup name for this, instead of relying on manual printing.  I've attached updated server and client codes that do that.  Please see attached.

You'll need to start the Hydra nameserver for this using:

% hydra_nameserver

Once the nameserver has started, you can connect all mpiexecs to it using something like:

% mpiexec -nameserver localhost ./server 2

Here my nameserver was on the localhost, but you'll need to give the host on which the nameserver is running.  The nameserver is persistent, so you run it once and reuse it for any number of mpiexecs.

I'm not seeing lockups like you reported, but I tried it at a much smaller scale (4 nodes).

  -- Pavan



> On Feb 6, 2017, at 1:27 PM, Dmitriy Lyubimov <dlieu.7 at gmail.com> wrote:
>
> Thank you, Kenneth.
>
> Here is a simple C++ equivalent of what i am doing:
>
> server.cpp:
> =============================
>
> #include <iostream>
> #include <mpi/mpi.h>
> #include <stdlib.h>
>
> using namespace std;
>
> // The only argument must be the number of processes in communicator we expect to build.
> int main(int argc, char** argv)
> {
>
>     int np = atoi(argv[1]);
>
>     int ac = 0;
>     MPI_Init(&ac, &argv);
>
>     char portName[MPI_MAX_PORT_NAME];
>
>     MPI_Open_port(MPI_INFO_NULL, portName);
>
>     cout << portName << "\n";
>
>     MPI_Comm intercomm, intracomm = MPI_COMM_SELF;
>
>     // Build an intracom dynamically until n processes are reached.
>     for (int i = 1; i < np; i++) {
>
>         MPI_Comm_accept(portName, MPI_INFO_NULL, 0, intracomm, &intercomm);
>
>         cout << "Accepted.\n";
>
>         MPI_Intercomm_merge(intercomm, false, &intracomm);
>
>         cout << "Merged to an intracom.\n";
>
>         MPI_Comm_free(&intercomm);
>     }
>
>     // Intracomm contains the one we can now use with n-grid.
>     MPI_Comm_free(&intracomm);
>
>     MPI_Close_port(portName);
>
>     MPI_Finalize();
> }
>
>
> client.cpp:
> ===============================
> #include <iostream>
> #include <mpi/mpi.h>
> #include <stdlib.h>
>
> using namespace std;
>
> // This expects intracom size and the port name to connect to.
> // When using with shell, use single quotas to avoid shell substitution.
> int main(int argc, char** argv)
> {
>
>     int ac = 0;
>     MPI_Init(&ac, &argv);
>
>     int np = atoi(argv[1]);
>     char* portName = argv[2];
>
>     cout << "Connecting to " << portName << "\n";
>
>     MPI_Comm intercomm, intracomm;
>
>     MPI_Comm_connect(portName, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm);
>
>     cout << "Connected.\n";
>
>     MPI_Intercomm_merge(intercomm, true, &intracomm);
>
>     cout << "Merged.\n";
>
>     MPI_Comm_free(&intercomm);
>
>     int i;
>
>     MPI_Comm_size(intracomm, &i);
>
>     // Build an intracom dynamically until n processes are reached.
>     for (; i < np; i++) {
>
>         MPI_Comm_accept(portName, MPI_INFO_NULL, 0, intracomm, &intercomm);
>
>         cout << "Accepted.\n";
>
>         MPI_Intercomm_merge(intercomm, false, &intracomm);
>
>         cout << "Merged to an intracom.\n";
>
>         MPI_Comm_free(&intercomm);
>     }
>
>     // Intracomm contains the one we can now use with n-grid.
>     MPI_Comm_free(&intracomm);
>
>     MPI_Finalize();
> }
>
> ============================
> Run example on one machine, intracom size=2 (in this case I have run 3.3a)
>
> dmitriy at Intel-Kubu:~/projects/mpitests$ mpic++ server.cpp -o server
> dmitriy at Intel-Kubu:~/projects/mpitests$ mpic++ client.cpp -o client
> dmitriy at Intel-Kubu:~/projects/mpitests$
> dmitriy at Intel-Kubu:~/projects/mpitests$
> dmitriy at Intel-Kubu:~/projects/mpitests$ mpiexec ./server 2
> tag#0$description#Intel-Kubu$port#39210$ifname#127.0.1.1$
> Accepted.
> Merged to an intracom.
> dmitriy at Intel-Kubu:~/projects/mpitests$
>
> (in another shell)
> dmitriy at Intel-Kubu:~/projects/mpitests$ mpiexec ./client 2 'tag#0$description#Intel-Kubu$port#39210$ifname#127.0.1.1$'
> Connecting to tag#0$description#Intel-Kubu$port#39210$ifname#127.0.1.1$
> Connected.
> Merged.
> dmitriy at Intel-Kubu:~/projects/mpitests$
>
> First parameter is eventual size of intracom we are trying to build dynamically, and client also needs to know the port reported by server process. So there's therfore 1 server and (n-1) clients that connect to form the intracom.
>
> Now, if i do that for 192 processes on a 192 core cluster (occasionally slightly overloaded in terms of cpu load), I more often get a lock-up than not. The incidence is more frequent for 3.2 than 3.3a2. This cluster has mellanox infiniband.
>
> 3.2 usually locks up on merge intercom call; and 3.3a2 locked up at least once on 2 clients connected and waiting on merge intercom at the same time (but my understanding only one client should be connected at a time, even if there massive connects pending from other clients).
>
> Hope this gives a little more material.
> -Dmitriy
>
>
> On Fri, Feb 3, 2017 at 8:40 AM, Kenneth Raffenetti <raffenet at mcs.anl.gov> wrote:
> Hi Dmitriy,
>
> MPICH does appear to be reported a process exit/crash in this case. A simple reproducer would be useful to test if that is indeed the cause or if there's something else going on.
>
> I see below that you are using a non-standard MPI binding. If the test case is simple enough, we can try to port it and investigate further.
>
> Ken
>
> On 01/19/2017 06:58 PM, Dmitriy Lyubimov wrote:
> These lock-ups seem to be gone in 3.3a2.
>
> I do occasionally get the following though:
>
> Unknown error class, error stack:
> PMPI_Comm_accept(129).................:
> MPI_Comm_accept(port="tag#0$description#aaa.com
> <http://aaa.com>$port#36230$ifname#192.168.42.99$", MPI_INFO_NULL, ro
> ot=180, comm=0x84000003, newcomm=0x7f3cf681842c) failed
> MPID_Comm_accept(153).................:
> MPIDI_Comm_accept(1244)...............:
> MPIR_Get_contextid_sparse_group(499)..:
> MPIR_Allreduce_impl(755)..............:
> MPIR_Allreduce_intra(414).............:
> MPIDU_Complete_posted_with_error(1137): Process failed
>
> What does this message mean? some process just exited/died (like with
> seg fault?)
>
> Thank you.
> -Dmitriy
>
> On Thu, Jan 12, 2017 at 11:55 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com
> <mailto:dlieu.7 at gmail.com>> wrote:
>
>     further debugging shows that it's not actually mergeIntercom that
>     locks up but a pair of send/recv that two nodes decide to execute
>     before MPI_intercom_merge.
>
>     so the total snapshot of the situation is that everyone waits on
>     mergeIntercom except for two processes that wait in send/recv
>     respectively, while majority of others already have entered
>     collective barrier.
>
>     it would seem that this sort of assymetric logic would be
>     acceptable, since the send/recv pair is balanced before the merge is
>     to occur, but in practice it seems to lock up -- increasingly so as
>     the number of participating processes increases. It almost like
>      once collective barrier of certain cardinality is formed,
>     point-to-point messages are not going thru any longer.
>
>     If this scenario begets any ideas, please let me know.
>
>     thank you!
>     -Dmitriy
>
>
>
>     On Wed, Jan 11, 2017 at 9:38 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com
>     <mailto:dlieu.7 at gmail.com>> wrote:
>
>         Maybe it has something to do with the fact that it is stepping
>         thru JVM JNI and that somehow screws threading model of MPI,
>         although it is a single threaded JVM process, and MPI mappings
>         are known to have been done before (e.g., openmpi had an effort
>         towards that).
>
>         Strange thing is that i never had lock up with # of processes
>         under 120 but something changes after that, the spurious
>         condition becomes much more common after that. By the time I am
>         at 150 processes in the intercom, I am almost certain to have a
>         merge lock-up.
>
>
>         On Wed, Jan 11, 2017 at 9:34 AM, Dmitriy Lyubimov
>         <dlieu.7 at gmail.com <mailto:dlieu.7 at gmail.com>> wrote:
>
>             Thanks.
>             it would not be easy for me to do immediately as i am using
>             proprietary scala binding api for MPI.
>
>             it would help me to know if there's a known problem like
>             that in the past, or generally mergeIntercomm api is known
>             to work on hundreds of processes. Sounds like there are no
>             known issues with that.
>
>
>
>             On Tue, Jan 10, 2017 at 11:53 PM, Oden, Lena <loden at anl.gov
>             <mailto:loden at anl.gov>> wrote:
>
>                 Hello Dmittiy,
>
>                 can you maybe create a simple example-program to
>                 reproduce this failure?
>                 It is also often easier also to look at a code example
>                 to identify a problem.
>
>                 Thanks,
>                 Lena
>                 > On Jan 11, 2017, at 2:45 AM, Dmitriy Lyubimov
>                 <dlieu.7 at gmail.com <mailto:dlieu.7 at gmail.com>> wrote:
>                 >
>                 > Hello,
>                 >
>                 > (mpich 3.2)
>                 >
>                 > I have a scenario when i add a few extra processes do
>                 existing intercom.
>                 >
>                 > it works as a simple loop --
>                 > (1) n processes accept on n-intercom
>                 > (2) 1 process connects
>                 > (3) intracom is merged into n+1 intercom, intracom and
>                 n-intercom are closed
>                 > (4) repeat 1-3 as needed.
>                 >
>                 > Occasionally, i observe that step 3 spuriously locks
>                 up (once i get in the range of 100+ processes). From
>                 what i can tell, all processes in step 3 are accounted
>                 for, and are waiting on the merge, but nothing happens.
>                 the collective barrier locks up.
>                 >
>                 > I really have trouble resolving this issue, any ideas
>                 are appreciated!
>                 >
>                 > Thank you very much.
>                 > -Dmitriy
>                 >
>                 >
>                 > _______________________________________________
>                 > discuss mailing list     discuss at mpich.org
>                 <mailto:discuss at mpich.org>
>                 > To manage subscription options or unsubscribe:
>                 > https://lists.mpich.org/mailman/listinfo/discuss
>                 <https://lists.mpich.org/mailman/listinfo/discuss>
>
>                 _______________________________________________
>                 discuss mailing list     discuss at mpich.org
>                 <mailto:discuss at mpich.org>
>                 To manage subscription options or unsubscribe:
>                 https://lists.mpich.org/mailman/listinfo/discuss
>                 <https://lists.mpich.org/mailman/listinfo/discuss>
>
>
>
>
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170207/a450513d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: client.cpp
Type: application/octet-stream
Size: 1297 bytes
Desc: client.cpp
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170207/a450513d/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: server.cpp
Type: application/octet-stream
Size: 1120 bytes
Desc: server.cpp
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170207/a450513d/attachment-0001.obj>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list