[mpich-discuss] spurious lock ups on collective merge intercom

Tue Feb 7 16:08:56 CST 2017

On Tue, Feb 7, 2017 at 11:15 AM, Balaji, Pavan <balaji at anl.gov> wrote:

> Hi Dmitriy,
>
> You should use publish/lookup name for this, instead of relying on manual
> printing.  I've attached updated server and client codes that do that.
> Please see attached.
>
> You'll need to start the Hydra nameserver for this using:
>
> % hydra_nameserver
>
> Once the nameserver has started, you can connect all mpiexecs to it using
> something like:
>
> % mpiexec -nameserver localhost ./server 2
>

Thanks but we have our own name resolution architecture (which is not
manual printing/pasting). This is simple example that's been asked for to
isolated the problem and provided, not the actual code.

The name is guaranteed to be delivered to client verbatim down to the bit.
The use is fully consistent with the MPI 3 spec.

Moreover, we use different ports in the same job multiple times; and we may
have multiple ports open at the cluster at the same time (but not in the
case of lock-up though), for which this pattern use is just too simplistic
in comparison.

> Here my nameserver was on the localhost, but you'll need to give the host
> on which the nameserver is running.  The nameserver is persistent, so you
> run it once and reuse it for any number of mpiexecs.
>
> I'm not seeing lockups like you reported, but I tried it at a much smaller
> scale (4 nodes).
>

yes. Like i said. I am able to achieve lock up state spuriously on 192 core
cluster only if i spin up almost all cores (per process) and even then i
guess this should be orchestrated in a massively parallel fashion to
improve probability of that happening. This is not easy to reproduce; I
never saw this happening on less than 140 processes. However it does
preclude us from fully utilizing clusters capacities, and it has been fully
analyzed to lock up on a completely legitimate sequence of either merge
intracom, or barrier call (if inserted after accept) on the intercom
returned from accept/connect. At the moment, some % of the jobs just have
to be torn down on timeout in such situations. But definitely increases
with load and core capacity and becomes extremely unsettling in some jobs.

We have even more complicated, tree-like process merges into a grid, this
is a most simple, naive one (but most prone to spurious lock-ups on a
collective-after-accept). And it mostly works... until it doesn't.

>
>
>   -- Pavan
>
>
>
> > On Feb 6, 2017, at 1:27 PM, Dmitriy Lyubimov <dlieu.7 at gmail.com> wrote:
> >
> > Thank you, Kenneth.
> >
> > Here is a simple C++ equivalent of what i am doing:
> >
> > server.cpp:
> > =============================
> >
> > #include <iostream>
> > #include <mpi/mpi.h>
> > #include <stdlib.h>
> >
> > using namespace std;
> >
> > // The only argument must be the number of processes in communicator we
> expect to build.
> > int main(int argc, char** argv)
> > {
> >
> >     int np = atoi(argv[1]);
> >
> >     int ac = 0;
> >     MPI_Init(&ac, &argv);
> >
> >     char portName[MPI_MAX_PORT_NAME];
> >
> >     MPI_Open_port(MPI_INFO_NULL, portName);
> >
> >     cout << portName << "\n";
> >
> >     MPI_Comm intercomm, intracomm = MPI_COMM_SELF;
> >
> >     // Build an intracom dynamically until n processes are reached.
> >     for (int i = 1; i < np; i++) {
> >
> >         MPI_Comm_accept(portName, MPI_INFO_NULL, 0, intracomm,
> &intercomm);
> >
> >         cout << "Accepted.\n";
> >
> >         MPI_Intercomm_merge(intercomm, false, &intracomm);
> >
> >         cout << "Merged to an intracom.\n";
> >
> >         MPI_Comm_free(&intercomm);
> >     }
> >
> >     // Intracomm contains the one we can now use with n-grid.
> >     MPI_Comm_free(&intracomm);
> >
> >     MPI_Close_port(portName);
> >
> >     MPI_Finalize();
> > }
> >
> >
> > client.cpp:
> > ===============================
> > #include <iostream>
> > #include <mpi/mpi.h>
> > #include <stdlib.h>
> >
> > using namespace std;
> >
> > // This expects intracom size and the port name to connect to.
> > // When using with shell, use single quotas to avoid shell substitution.
> > int main(int argc, char** argv)
> > {
> >
> >     int ac = 0;
> >     MPI_Init(&ac, &argv);
> >
> >     int np = atoi(argv[1]);
> >     char* portName = argv[2];
> >
> >     cout << "Connecting to " << portName << "\n";
> >
> >     MPI_Comm intercomm, intracomm;
> >
> >     MPI_Comm_connect(portName, MPI_INFO_NULL, 0, MPI_COMM_SELF,
> &intercomm);
> >
> >     cout << "Connected.\n";
> >
> >     MPI_Intercomm_merge(intercomm, true, &intracomm);
> >
> >     cout << "Merged.\n";
> >
> >     MPI_Comm_free(&intercomm);
> >
> >     int i;
> >
> >     MPI_Comm_size(intracomm, &i);
> >
> >     // Build an intracom dynamically until n processes are reached.
> >     for (; i < np; i++) {
> >
> >         MPI_Comm_accept(portName, MPI_INFO_NULL, 0, intracomm,
> &intercomm);
> >
> >         cout << "Accepted.\n";
> >
> >         MPI_Intercomm_merge(intercomm, false, &intracomm);
> >
> >         cout << "Merged to an intracom.\n";
> >
> >         MPI_Comm_free(&intercomm);
> >     }
> >
> >     // Intracomm contains the one we can now use with n-grid.
> >     MPI_Comm_free(&intracomm);
> >
> >     MPI_Finalize();
> > }
> >
> > ============================
> > Run example on one machine, intracom size=2 (in this case I have run
> 3.3a)
> >
> > dmitriy at Intel-Kubu:~/projects/mpitests$ mpic++ server.cpp -o server
> > dmitriy at Intel-Kubu:~/projects/mpitests$ mpic++ client.cpp -o client
> > dmitriy at Intel-Kubu:~/projects/mpitests$
> > dmitriy at Intel-Kubu:~/projects/mpitests$
> > dmitriy at Intel-Kubu:~/projects/mpitests$ mpiexec ./server 2
> > tag#0$description#Intel-Kubu$port#39210$ifname#127.0.1.1$
> > Accepted.
> > Merged to an intracom.
> > dmitriy at Intel-Kubu:~/projects/mpitests$
> >
> > (in another shell)
> > dmitriy at Intel-Kubu:~/projects/mpitests$ mpiexec ./client 2
> 'tag#0$description#Intel-Kubu$port#39210$ifname#127.0.1.1$'
> > Connecting to tag#0$description#Intel-Kubu$port#39210$ifname#127.0.1.1$
> > Connected.
> > Merged.
> > dmitriy at Intel-Kubu:~/projects/mpitests$
> >
> > First parameter is eventual size of intracom we are trying to build
> dynamically, and client also needs to know the port reported by server
> process. So there's therfore 1 server and (n-1) clients that connect to
> form the intracom.
> >
> > Now, if i do that for 192 processes on a 192 core cluster (occasionally
> slightly overloaded in terms of cpu load), I more often get a lock-up than
> not. The incidence is more frequent for 3.2 than 3.3a2. This cluster has
> mellanox infiniband.
> >
> > 3.2 usually locks up on merge intercom call; and 3.3a2 locked up at
> least once on 2 clients connected and waiting on merge intercom at the same
> time (but my understanding only one client should be connected at a time,
> even if there massive connects pending from other clients).
> >
> > Hope this gives a little more material.
> > -Dmitriy
> >
> >
> > On Fri, Feb 3, 2017 at 8:40 AM, Kenneth Raffenetti <raffenet at mcs.anl.gov>
> wrote:
> > Hi Dmitriy,
> >
> > MPICH does appear to be reported a process exit/crash in this case. A
> simple reproducer would be useful to test if that is indeed the cause or if
> there's something else going on.
> >
> > I see below that you are using a non-standard MPI binding. If the test
> case is simple enough, we can try to port it and investigate further.
> >
> > Ken
> >
> > On 01/19/2017 06:58 PM, Dmitriy Lyubimov wrote:
> > These lock-ups seem to be gone in 3.3a2.
> >
> > I do occasionally get the following though:
> >
> > Unknown error class, error stack:
> > PMPI_Comm_accept(129).................:
> > MPI_Comm_accept(port="tag#0$description#aaa.com
> > <http://aaa.com>$port#36230$ifname#192.168.42.99$", MPI_INFO_NULL, ro
> > ot=180, comm=0x84000003, newcomm=0x7f3cf681842c) failed
> > MPID_Comm_accept(153).................:
> > MPIDI_Comm_accept(1244)...............:
> > MPIR_Get_contextid_sparse_group(499)..:
> > MPIR_Allreduce_impl(755)..............:
> > MPIR_Allreduce_intra(414).............:
> > MPIDU_Complete_posted_with_error(1137): Process failed
> >
> > What does this message mean? some process just exited/died (like with
> > seg fault?)
> >
> > Thank you.
> > -Dmitriy
> >
> > On Thu, Jan 12, 2017 at 11:55 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com
> > <mailto:dlieu.7 at gmail.com <dlieu.7 at gmail.com>>> wrote:
> >
> >     further debugging shows that it's not actually mergeIntercom that
> >     locks up but a pair of send/recv that two nodes decide to execute
> >     before MPI_intercom_merge.
> >
> >     so the total snapshot of the situation is that everyone waits on
> >     mergeIntercom except for two processes that wait in send/recv
> >     respectively, while majority of others already have entered
> >     collective barrier.
> >
> >     it would seem that this sort of assymetric logic would be
> >     acceptable, since the send/recv pair is balanced before the merge is
> >     to occur, but in practice it seems to lock up -- increasingly so as
> >     the number of participating processes increases. It almost like
> >      once collective barrier of certain cardinality is formed,
> >     point-to-point messages are not going thru any longer.
> >
> >     If this scenario begets any ideas, please let me know.
> >
> >     thank you!
> >     -Dmitriy
> >
> >
> >
> >     On Wed, Jan 11, 2017 at 9:38 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com
> >     <mailto:dlieu.7 at gmail.com <dlieu.7 at gmail.com>>> wrote:
> >
> >         Maybe it has something to do with the fact that it is stepping
> >         thru JVM JNI and that somehow screws threading model of MPI,
> >         although it is a single threaded JVM process, and MPI mappings
> >         are known to have been done before (e.g., openmpi had an effort
> >         towards that).
> >
> >         Strange thing is that i never had lock up with # of processes
> >         under 120 but something changes after that, the spurious
> >         condition becomes much more common after that. By the time I am
> >         at 150 processes in the intercom, I am almost certain to have a
> >         merge lock-up.
> >
> >
> >         On Wed, Jan 11, 2017 at 9:34 AM, Dmitriy Lyubimov
> >         <dlieu.7 at gmail.com <mailto:dlieu.7 at gmail.com <dlieu.7 at gmail.com>>>
> wrote:
> >
> >             Thanks.
> >             it would not be easy for me to do immediately as i am using
> >             proprietary scala binding api for MPI.
> >
> >             it would help me to know if there's a known problem like
> >             that in the past, or generally mergeIntercomm api is known
> >             to work on hundreds of processes. Sounds like there are no
> >             known issues with that.
> >
> >
> >
> >             On Tue, Jan 10, 2017 at 11:53 PM, Oden, Lena <loden at anl.gov
> >             <mailto:loden at anl.gov <loden at anl.gov>>> wrote:
> >
> >                 Hello Dmittiy,
> >
> >                 can you maybe create a simple example-program to
> >                 reproduce this failure?
> >                 It is also often easier also to look at a code example
> >                 to identify a problem.
> >
> >                 Thanks,
> >                 Lena
> >                 > On Jan 11, 2017, at 2:45 AM, Dmitriy Lyubimov
> >                 <dlieu.7 at gmail.com <mailto:dlieu.7 at gmail.com
> <dlieu.7 at gmail.com>>> wrote:
> >                 >
> >                 > Hello,
> >                 >
> >                 > (mpich 3.2)
> >                 >
> >                 > I have a scenario when i add a few extra processes do
> >                 existing intercom.
> >                 >
> >                 > it works as a simple loop --
> >                 > (1) n processes accept on n-intercom
> >                 > (2) 1 process connects
> >                 > (3) intracom is merged into n+1 intercom, intracom and
> >                 n-intercom are closed
> >                 > (4) repeat 1-3 as needed.
> >                 >
> >                 > Occasionally, i observe that step 3 spuriously locks
> >                 up (once i get in the range of 100+ processes). From
> >                 what i can tell, all processes in step 3 are accounted
> >                 for, and are waiting on the merge, but nothing happens.
> >                 the collective barrier locks up.
> >                 >
> >                 > I really have trouble resolving this issue, any ideas
> >                 are appreciated!
> >                 >
> >                 > Thank you very much.
> >                 > -Dmitriy
> >                 >
> >                 >
> >                 > _______________________________________________
> >                 > discuss mailing list     discuss at mpich.org
> >                 <mailto:discuss at mpich.org <discuss at mpich.org>>
> >                 > To manage subscription options or unsubscribe:
> >                 > https://lists.mpich.org/mailman/listinfo/discuss
> >                 <https://lists.mpich.org/mailman/listinfo/discuss>
> >
> >                 _______________________________________________
> >                 discuss mailing list     discuss at mpich.org
> >                 <mailto:discuss at mpich.org <discuss at mpich.org>>
> >                 To manage subscription options or unsubscribe:
> >                 https://lists.mpich.org/mailman/listinfo/discuss
> >                 <https://lists.mpich.org/mailman/listinfo/discuss>
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170207/ca24247f/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss