[mpich-discuss] spurious lock ups on collective merge intercom
Kenneth Raffenetti
raffenet at mcs.anl.gov
Fri Feb 3 10:40:23 CST 2017
Hi Dmitriy,
MPICH does appear to be reported a process exit/crash in this case. A
simple reproducer would be useful to test if that is indeed the cause or
if there's something else going on.
I see below that you are using a non-standard MPI binding. If the test
case is simple enough, we can try to port it and investigate further.
Ken
On 01/19/2017 06:58 PM, Dmitriy Lyubimov wrote:
> These lock-ups seem to be gone in 3.3a2.
>
> I do occasionally get the following though:
>
> Unknown error class, error stack:
> PMPI_Comm_accept(129).................:
> MPI_Comm_accept(port="tag#0$description#aaa.com
> <http://aaa.com>$port#36230$ifname#192.168.42.99$", MPI_INFO_NULL, ro
> ot=180, comm=0x84000003, newcomm=0x7f3cf681842c) failed
> MPID_Comm_accept(153).................:
> MPIDI_Comm_accept(1244)...............:
> MPIR_Get_contextid_sparse_group(499)..:
> MPIR_Allreduce_impl(755)..............:
> MPIR_Allreduce_intra(414).............:
> MPIDU_Complete_posted_with_error(1137): Process failed
>
> What does this message mean? some process just exited/died (like with
> seg fault?)
>
> Thank you.
> -Dmitriy
>
> On Thu, Jan 12, 2017 at 11:55 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com
> <mailto:dlieu.7 at gmail.com>> wrote:
>
> further debugging shows that it's not actually mergeIntercom that
> locks up but a pair of send/recv that two nodes decide to execute
> before MPI_intercom_merge.
>
> so the total snapshot of the situation is that everyone waits on
> mergeIntercom except for two processes that wait in send/recv
> respectively, while majority of others already have entered
> collective barrier.
>
> it would seem that this sort of assymetric logic would be
> acceptable, since the send/recv pair is balanced before the merge is
> to occur, but in practice it seems to lock up -- increasingly so as
> the number of participating processes increases. It almost like
> once collective barrier of certain cardinality is formed,
> point-to-point messages are not going thru any longer.
>
> If this scenario begets any ideas, please let me know.
>
> thank you!
> -Dmitriy
>
>
>
> On Wed, Jan 11, 2017 at 9:38 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com
> <mailto:dlieu.7 at gmail.com>> wrote:
>
> Maybe it has something to do with the fact that it is stepping
> thru JVM JNI and that somehow screws threading model of MPI,
> although it is a single threaded JVM process, and MPI mappings
> are known to have been done before (e.g., openmpi had an effort
> towards that).
>
> Strange thing is that i never had lock up with # of processes
> under 120 but something changes after that, the spurious
> condition becomes much more common after that. By the time I am
> at 150 processes in the intercom, I am almost certain to have a
> merge lock-up.
>
>
> On Wed, Jan 11, 2017 at 9:34 AM, Dmitriy Lyubimov
> <dlieu.7 at gmail.com <mailto:dlieu.7 at gmail.com>> wrote:
>
> Thanks.
> it would not be easy for me to do immediately as i am using
> proprietary scala binding api for MPI.
>
> it would help me to know if there's a known problem like
> that in the past, or generally mergeIntercomm api is known
> to work on hundreds of processes. Sounds like there are no
> known issues with that.
>
>
>
> On Tue, Jan 10, 2017 at 11:53 PM, Oden, Lena <loden at anl.gov
> <mailto:loden at anl.gov>> wrote:
>
> Hello Dmittiy,
>
> can you maybe create a simple example-program to
> reproduce this failure?
> It is also often easier also to look at a code example
> to identify a problem.
>
> Thanks,
> Lena
> > On Jan 11, 2017, at 2:45 AM, Dmitriy Lyubimov
> <dlieu.7 at gmail.com <mailto:dlieu.7 at gmail.com>> wrote:
> >
> > Hello,
> >
> > (mpich 3.2)
> >
> > I have a scenario when i add a few extra processes do
> existing intercom.
> >
> > it works as a simple loop --
> > (1) n processes accept on n-intercom
> > (2) 1 process connects
> > (3) intracom is merged into n+1 intercom, intracom and
> n-intercom are closed
> > (4) repeat 1-3 as needed.
> >
> > Occasionally, i observe that step 3 spuriously locks
> up (once i get in the range of 100+ processes). From
> what i can tell, all processes in step 3 are accounted
> for, and are waiting on the merge, but nothing happens.
> the collective barrier locks up.
> >
> > I really have trouble resolving this issue, any ideas
> are appreciated!
> >
> > Thank you very much.
> > -Dmitriy
> >
> >
> > _______________________________________________
> > discuss mailing list discuss at mpich.org
> <mailto:discuss at mpich.org>
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> <https://lists.mpich.org/mailman/listinfo/discuss>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> <mailto:discuss at mpich.org>
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> <https://lists.mpich.org/mailman/listinfo/discuss>
>
>
>
>
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list