[mpich-discuss] spurious lock ups on collective merge intercom

Fri Feb 3 10:40:23 CST 2017

Hi Dmitriy,

MPICH does appear to be reported a process exit/crash in this case. A 
simple reproducer would be useful to test if that is indeed the cause or 
if there's something else going on.

I see below that you are using a non-standard MPI binding. If the test 
case is simple enough, we can try to port it and investigate further.

Ken

On 01/19/2017 06:58 PM, Dmitriy Lyubimov wrote:
> These lock-ups seem to be gone in 3.3a2.
>
> I do occasionally get the following though:
>
> Unknown error class, error stack:
> PMPI_Comm_accept(129).................:
> MPI_Comm_accept(port="tag#0$description#aaa.com
> <http://aaa.com>$port#36230$ifname#192.168.42.99$", MPI_INFO_NULL, ro
> ot=180, comm=0x84000003, newcomm=0x7f3cf681842c) failed
> MPID_Comm_accept(153).................:
> MPIDI_Comm_accept(1244)...............:
> MPIR_Get_contextid_sparse_group(499)..:
> MPIR_Allreduce_impl(755)..............:
> MPIR_Allreduce_intra(414).............:
> MPIDU_Complete_posted_with_error(1137): Process failed
>
> What does this message mean? some process just exited/died (like with
> seg fault?)
>
> Thank you.
> -Dmitriy
>
> On Thu, Jan 12, 2017 at 11:55 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com
> <mailto:dlieu.7 at gmail.com>> wrote:
>
>     further debugging shows that it's not actually mergeIntercom that
>     locks up but a pair of send/recv that two nodes decide to execute
>     before MPI_intercom_merge.
>
>     so the total snapshot of the situation is that everyone waits on
>     mergeIntercom except for two processes that wait in send/recv
>     respectively, while majority of others already have entered
>     collective barrier.
>
>     it would seem that this sort of assymetric logic would be
>     acceptable, since the send/recv pair is balanced before the merge is
>     to occur, but in practice it seems to lock up -- increasingly so as
>     the number of participating processes increases. It almost like
>      once collective barrier of certain cardinality is formed,
>     point-to-point messages are not going thru any longer.
>
>     If this scenario begets any ideas, please let me know.
>
>     thank you!
>     -Dmitriy
>
>
>
>     On Wed, Jan 11, 2017 at 9:38 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com
>     <mailto:dlieu.7 at gmail.com>> wrote:
>
>         Maybe it has something to do with the fact that it is stepping
>         thru JVM JNI and that somehow screws threading model of MPI,
>         although it is a single threaded JVM process, and MPI mappings
>         are known to have been done before (e.g., openmpi had an effort
>         towards that).
>
>         Strange thing is that i never had lock up with # of processes
>         under 120 but something changes after that, the spurious
>         condition becomes much more common after that. By the time I am
>         at 150 processes in the intercom, I am almost certain to have a
>         merge lock-up.
>
>
>         On Wed, Jan 11, 2017 at 9:34 AM, Dmitriy Lyubimov
>         <dlieu.7 at gmail.com <mailto:dlieu.7 at gmail.com>> wrote:
>
>             Thanks.
>             it would not be easy for me to do immediately as i am using
>             proprietary scala binding api for MPI.
>
>             it would help me to know if there's a known problem like
>             that in the past, or generally mergeIntercomm api is known
>             to work on hundreds of processes. Sounds like there are no
>             known issues with that.
>
>
>
>             On Tue, Jan 10, 2017 at 11:53 PM, Oden, Lena <loden at anl.gov
>             <mailto:loden at anl.gov>> wrote:
>
>                 Hello Dmittiy,
>
>                 can you maybe create a simple example-program to
>                 reproduce this failure?
>                 It is also often easier also to look at a code example
>                 to identify a problem.
>
>                 Thanks,
>                 Lena
>                 > On Jan 11, 2017, at 2:45 AM, Dmitriy Lyubimov
>                 <dlieu.7 at gmail.com <mailto:dlieu.7 at gmail.com>> wrote:
>                 >
>                 > Hello,
>                 >
>                 > (mpich 3.2)
>                 >
>                 > I have a scenario when i add a few extra processes do
>                 existing intercom.
>                 >
>                 > it works as a simple loop --
>                 > (1) n processes accept on n-intercom
>                 > (2) 1 process connects
>                 > (3) intracom is merged into n+1 intercom, intracom and
>                 n-intercom are closed
>                 > (4) repeat 1-3 as needed.
>                 >
>                 > Occasionally, i observe that step 3 spuriously locks
>                 up (once i get in the range of 100+ processes). From
>                 what i can tell, all processes in step 3 are accounted
>                 for, and are waiting on the merge, but nothing happens.
>                 the collective barrier locks up.
>                 >
>                 > I really have trouble resolving this issue, any ideas
>                 are appreciated!
>                 >
>                 > Thank you very much.
>                 > -Dmitriy
>                 >
>                 >
>                 > _______________________________________________
>                 > discuss mailing list     discuss at mpich.org
>                 <mailto:discuss at mpich.org>
>                 > To manage subscription options or unsubscribe:
>                 > https://lists.mpich.org/mailman/listinfo/discuss
>                 <https://lists.mpich.org/mailman/listinfo/discuss>
>
>                 _______________________________________________
>                 discuss mailing list     discuss at mpich.org
>                 <mailto:discuss at mpich.org>
>                 To manage subscription options or unsubscribe:
>                 https://lists.mpich.org/mailman/listinfo/discuss
>                 <https://lists.mpich.org/mailman/listinfo/discuss>
>
>
>
>
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss