[mpich-discuss] spurious lock ups on collective merge intercom

Dmitriy Lyubimov dlieu.7 at gmail.com
Thu Jan 19 18:58:19 CST 2017


These lock-ups seem to be gone in 3.3a2.

I do occasionally get the following though:

Unknown error class, error stack:
PMPI_Comm_accept(129).................:
MPI_Comm_accept(port="tag#0$description#aaa.com$port#36230$ifname#192.168.42.99$",
MPI_INFO_NULL, ro
ot=180, comm=0x84000003, newcomm=0x7f3cf681842c) failed
MPID_Comm_accept(153).................:
MPIDI_Comm_accept(1244)...............:
MPIR_Get_contextid_sparse_group(499)..:
MPIR_Allreduce_impl(755)..............:
MPIR_Allreduce_intra(414).............:
MPIDU_Complete_posted_with_error(1137): Process failed

What does this message mean? some process just exited/died (like with seg
fault?)

Thank you.
-Dmitriy

On Thu, Jan 12, 2017 at 11:55 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com>
wrote:

> further debugging shows that it's not actually mergeIntercom that locks up
> but a pair of send/recv that two nodes decide to execute before
> MPI_intercom_merge.
>
> so the total snapshot of the situation is that everyone waits on
> mergeIntercom except for two processes that wait in send/recv respectively,
> while majority of others already have entered collective barrier.
>
> it would seem that this sort of assymetric logic would be acceptable,
> since the send/recv pair is balanced before the merge is to occur, but in
> practice it seems to lock up -- increasingly so as the number of
> participating processes increases. It almost like  once collective barrier
> of certain cardinality is formed, point-to-point messages are not going
> thru any longer.
>
> If this scenario begets any ideas, please let me know.
>
> thank you!
> -Dmitriy
>
>
>
> On Wed, Jan 11, 2017 at 9:38 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com>
> wrote:
>
>> Maybe it has something to do with the fact that it is stepping thru JVM
>> JNI and that somehow screws threading model of MPI, although it is a single
>> threaded JVM process, and MPI mappings are known to have been done before
>> (e.g., openmpi had an effort towards that).
>>
>> Strange thing is that i never had lock up with # of processes under 120
>> but something changes after that, the spurious condition becomes much more
>> common after that. By the time I am at 150 processes in the intercom, I am
>> almost certain to have a merge lock-up.
>>
>>
>> On Wed, Jan 11, 2017 at 9:34 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com>
>> wrote:
>>
>>> Thanks.
>>> it would not be easy for me to do immediately as i am using proprietary
>>> scala binding api for MPI.
>>>
>>> it would help me to know if there's a known problem like that in the
>>> past, or generally mergeIntercomm api is known to work on hundreds of
>>> processes. Sounds like there are no known issues with that.
>>>
>>>
>>>
>>> On Tue, Jan 10, 2017 at 11:53 PM, Oden, Lena <loden at anl.gov> wrote:
>>>
>>>> Hello Dmittiy,
>>>>
>>>> can you maybe create a simple example-program to reproduce this failure?
>>>> It is also often easier also to look at a code example to identify a
>>>> problem.
>>>>
>>>> Thanks,
>>>> Lena
>>>> > On Jan 11, 2017, at 2:45 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com>
>>>> wrote:
>>>> >
>>>> > Hello,
>>>> >
>>>> > (mpich 3.2)
>>>> >
>>>> > I have a scenario when i add a few extra processes do existing
>>>> intercom.
>>>> >
>>>> > it works as a simple loop --
>>>> > (1) n processes accept on n-intercom
>>>> > (2) 1 process connects
>>>> > (3) intracom is merged into n+1 intercom, intracom and n-intercom are
>>>> closed
>>>> > (4) repeat 1-3 as needed.
>>>> >
>>>> > Occasionally, i observe that step 3 spuriously locks up (once i get
>>>> in the range of 100+ processes). From what i can tell, all processes in
>>>> step 3 are accounted for, and are waiting on the merge, but nothing
>>>> happens. the collective barrier locks up.
>>>> >
>>>> > I really have trouble resolving this issue, any ideas are appreciated!
>>>> >
>>>> > Thank you very much.
>>>> > -Dmitriy
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > discuss mailing list     discuss at mpich.org
>>>> > To manage subscription options or unsubscribe:
>>>> > https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170119/f925e0a3/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list