[mpich-discuss] spurious lock ups on collective merge intercom
Dmitriy Lyubimov
dlieu.7 at gmail.com
Mon Feb 6 13:27:04 CST 2017
Thank you, Kenneth.
Here is a simple C++ equivalent of what i am doing:
server.cpp:
=============================
#include <iostream>
#include <mpi/mpi.h>
#include <stdlib.h>
using namespace std;
// The only argument must be the number of processes in communicator we
expect to build.
int main(int argc, char** argv)
{
int np = atoi(argv[1]);
int ac = 0;
MPI_Init(&ac, &argv);
char portName[MPI_MAX_PORT_NAME];
MPI_Open_port(MPI_INFO_NULL, portName);
cout << portName << "\n";
MPI_Comm intercomm, intracomm = MPI_COMM_SELF;
// Build an intracom dynamically until n processes are reached.
for (int i = 1; i < np; i++) {
MPI_Comm_accept(portName, MPI_INFO_NULL, 0, intracomm, &intercomm);
cout << "Accepted.\n";
MPI_Intercomm_merge(intercomm, false, &intracomm);
cout << "Merged to an intracom.\n";
MPI_Comm_free(&intercomm);
}
// Intracomm contains the one we can now use with n-grid.
MPI_Comm_free(&intracomm);
MPI_Close_port(portName);
MPI_Finalize();
}
client.cpp:
===============================
#include <iostream>
#include <mpi/mpi.h>
#include <stdlib.h>
using namespace std;
// This expects intracom size and the port name to connect to.
// When using with shell, use single quotas to avoid shell substitution.
int main(int argc, char** argv)
{
int ac = 0;
MPI_Init(&ac, &argv);
int np = atoi(argv[1]);
char* portName = argv[2];
cout << "Connecting to " << portName << "\n";
MPI_Comm intercomm, intracomm;
MPI_Comm_connect(portName, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm);
cout << "Connected.\n";
MPI_Intercomm_merge(intercomm, true, &intracomm);
cout << "Merged.\n";
MPI_Comm_free(&intercomm);
int i;
MPI_Comm_size(intracomm, &i);
// Build an intracom dynamically until n processes are reached.
for (; i < np; i++) {
MPI_Comm_accept(portName, MPI_INFO_NULL, 0, intracomm, &intercomm);
cout << "Accepted.\n";
MPI_Intercomm_merge(intercomm, false, &intracomm);
cout << "Merged to an intracom.\n";
MPI_Comm_free(&intercomm);
}
// Intracomm contains the one we can now use with n-grid.
MPI_Comm_free(&intracomm);
MPI_Finalize();
}
============================
Run example on one machine, intracom size=2 (in this case I have run 3.3a)
dmitriy at Intel-Kubu:~/projects/mpitests$ mpic++ server.cpp -o server
dmitriy at Intel-Kubu:~/projects/mpitests$ mpic++ client.cpp -o client
dmitriy at Intel-Kubu:~/projects/mpitests$
dmitriy at Intel-Kubu:~/projects/mpitests$
dmitriy at Intel-Kubu:~/projects/mpitests$ mpiexec ./server 2
tag#0$description#Intel-Kubu$port#39210$ifname#127.0.1.1$
Accepted.
Merged to an intracom.
dmitriy at Intel-Kubu:~/projects/mpitests$
(in another shell)
dmitriy at Intel-Kubu:~/projects/mpitests$ mpiexec ./client 2
'tag#0$description#Intel-Kubu$port#39210$ifname#127.0.1.1$'
Connecting to tag#0$description#Intel-Kubu$port#39210$ifname#127.0.1.1$
Connected.
Merged.
dmitriy at Intel-Kubu:~/projects/mpitests$
First parameter is eventual size of intracom we are trying to build
dynamically, and client also needs to know the port reported by server
process. So there's therfore 1 server and (n-1) clients that connect to
form the intracom.
Now, if i do that for 192 processes on a 192 core cluster (occasionally
slightly overloaded in terms of cpu load), I more often get a lock-up than
not. The incidence is more frequent for 3.2 than 3.3a2. This cluster has
mellanox infiniband.
3.2 usually locks up on merge intercom call; and 3.3a2 locked up at least
once on 2 clients connected and waiting on merge intercom at the same time
(but my understanding only one client should be connected at a time, even
if there massive connects pending from other clients).
Hope this gives a little more material.
-Dmitriy
On Fri, Feb 3, 2017 at 8:40 AM, Kenneth Raffenetti <raffenet at mcs.anl.gov>
wrote:
> Hi Dmitriy,
>
> MPICH does appear to be reported a process exit/crash in this case. A
> simple reproducer would be useful to test if that is indeed the cause or if
> there's something else going on.
>
> I see below that you are using a non-standard MPI binding. If the test
> case is simple enough, we can try to port it and investigate further.
>
> Ken
>
> On 01/19/2017 06:58 PM, Dmitriy Lyubimov wrote:
>
>> These lock-ups seem to be gone in 3.3a2.
>>
>> I do occasionally get the following though:
>>
>> Unknown error class, error stack:
>> PMPI_Comm_accept(129).................:
>> MPI_Comm_accept(port="tag#0$description#aaa.com
>> <http://aaa.com>$port#36230$ifname#192.168.42.99$", MPI_INFO_NULL, ro
>> ot=180, comm=0x84000003, newcomm=0x7f3cf681842c) failed
>> MPID_Comm_accept(153).................:
>> MPIDI_Comm_accept(1244)...............:
>> MPIR_Get_contextid_sparse_group(499)..:
>> MPIR_Allreduce_impl(755)..............:
>> MPIR_Allreduce_intra(414).............:
>> MPIDU_Complete_posted_with_error(1137): Process failed
>>
>> What does this message mean? some process just exited/died (like with
>> seg fault?)
>>
>> Thank you.
>> -Dmitriy
>>
>> On Thu, Jan 12, 2017 at 11:55 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com
>> <mailto:dlieu.7 at gmail.com>> wrote:
>>
>> further debugging shows that it's not actually mergeIntercom that
>> locks up but a pair of send/recv that two nodes decide to execute
>> before MPI_intercom_merge.
>>
>> so the total snapshot of the situation is that everyone waits on
>> mergeIntercom except for two processes that wait in send/recv
>> respectively, while majority of others already have entered
>> collective barrier.
>>
>> it would seem that this sort of assymetric logic would be
>> acceptable, since the send/recv pair is balanced before the merge is
>> to occur, but in practice it seems to lock up -- increasingly so as
>> the number of participating processes increases. It almost like
>> once collective barrier of certain cardinality is formed,
>> point-to-point messages are not going thru any longer.
>>
>> If this scenario begets any ideas, please let me know.
>>
>> thank you!
>> -Dmitriy
>>
>>
>>
>> On Wed, Jan 11, 2017 at 9:38 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com
>> <mailto:dlieu.7 at gmail.com>> wrote:
>>
>> Maybe it has something to do with the fact that it is stepping
>> thru JVM JNI and that somehow screws threading model of MPI,
>> although it is a single threaded JVM process, and MPI mappings
>> are known to have been done before (e.g., openmpi had an effort
>> towards that).
>>
>> Strange thing is that i never had lock up with # of processes
>> under 120 but something changes after that, the spurious
>> condition becomes much more common after that. By the time I am
>> at 150 processes in the intercom, I am almost certain to have a
>> merge lock-up.
>>
>>
>> On Wed, Jan 11, 2017 at 9:34 AM, Dmitriy Lyubimov
>> <dlieu.7 at gmail.com <mailto:dlieu.7 at gmail.com>> wrote:
>>
>> Thanks.
>> it would not be easy for me to do immediately as i am using
>> proprietary scala binding api for MPI.
>>
>> it would help me to know if there's a known problem like
>> that in the past, or generally mergeIntercomm api is known
>> to work on hundreds of processes. Sounds like there are no
>> known issues with that.
>>
>>
>>
>> On Tue, Jan 10, 2017 at 11:53 PM, Oden, Lena <loden at anl.gov
>> <mailto:loden at anl.gov>> wrote:
>>
>> Hello Dmittiy,
>>
>> can you maybe create a simple example-program to
>> reproduce this failure?
>> It is also often easier also to look at a code example
>> to identify a problem.
>>
>> Thanks,
>> Lena
>> > On Jan 11, 2017, at 2:45 AM, Dmitriy Lyubimov
>> <dlieu.7 at gmail.com <mailto:dlieu.7 at gmail.com>> wrote:
>> >
>> > Hello,
>> >
>> > (mpich 3.2)
>> >
>> > I have a scenario when i add a few extra processes do
>> existing intercom.
>> >
>> > it works as a simple loop --
>> > (1) n processes accept on n-intercom
>> > (2) 1 process connects
>> > (3) intracom is merged into n+1 intercom, intracom and
>> n-intercom are closed
>> > (4) repeat 1-3 as needed.
>> >
>> > Occasionally, i observe that step 3 spuriously locks
>> up (once i get in the range of 100+ processes). From
>> what i can tell, all processes in step 3 are accounted
>> for, and are waiting on the merge, but nothing happens.
>> the collective barrier locks up.
>> >
>> > I really have trouble resolving this issue, any ideas
>> are appreciated!
>> >
>> > Thank you very much.
>> > -Dmitriy
>> >
>> >
>> > _______________________________________________
>> > discuss mailing list discuss at mpich.org
>> <mailto:discuss at mpich.org>
>> > To manage subscription options or unsubscribe:
>> > https://lists.mpich.org/mailman/listinfo/discuss
>> <https://lists.mpich.org/mailman/listinfo/discuss>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> <mailto:discuss at mpich.org>
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> <https://lists.mpich.org/mailman/listinfo/discuss>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170206/9386b746/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list