[mpich-discuss] spurious lock ups on collective merge intercom

Dmitriy Lyubimov dlieu.7 at gmail.com
Thu Jan 12 13:55:28 CST 2017


further debugging shows that it's not actually mergeIntercom that locks up
but a pair of send/recv that two nodes decide to execute before
MPI_intercom_merge.

so the total snapshot of the situation is that everyone waits on
mergeIntercom except for two processes that wait in send/recv respectively,
while majority of others already have entered collective barrier.

it would seem that this sort of assymetric logic would be acceptable, since
the send/recv pair is balanced before the merge is to occur, but in
practice it seems to lock up -- increasingly so as the number of
participating processes increases. It almost like  once collective barrier
of certain cardinality is formed, point-to-point messages are not going
thru any longer.

If this scenario begets any ideas, please let me know.

thank you!
-Dmitriy



On Wed, Jan 11, 2017 at 9:38 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com> wrote:

> Maybe it has something to do with the fact that it is stepping thru JVM
> JNI and that somehow screws threading model of MPI, although it is a single
> threaded JVM process, and MPI mappings are known to have been done before
> (e.g., openmpi had an effort towards that).
>
> Strange thing is that i never had lock up with # of processes under 120
> but something changes after that, the spurious condition becomes much more
> common after that. By the time I am at 150 processes in the intercom, I am
> almost certain to have a merge lock-up.
>
>
> On Wed, Jan 11, 2017 at 9:34 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com>
> wrote:
>
>> Thanks.
>> it would not be easy for me to do immediately as i am using proprietary
>> scala binding api for MPI.
>>
>> it would help me to know if there's a known problem like that in the
>> past, or generally mergeIntercomm api is known to work on hundreds of
>> processes. Sounds like there are no known issues with that.
>>
>>
>>
>> On Tue, Jan 10, 2017 at 11:53 PM, Oden, Lena <loden at anl.gov> wrote:
>>
>>> Hello Dmittiy,
>>>
>>> can you maybe create a simple example-program to reproduce this failure?
>>> It is also often easier also to look at a code example to identify a
>>> problem.
>>>
>>> Thanks,
>>> Lena
>>> > On Jan 11, 2017, at 2:45 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com>
>>> wrote:
>>> >
>>> > Hello,
>>> >
>>> > (mpich 3.2)
>>> >
>>> > I have a scenario when i add a few extra processes do existing
>>> intercom.
>>> >
>>> > it works as a simple loop --
>>> > (1) n processes accept on n-intercom
>>> > (2) 1 process connects
>>> > (3) intracom is merged into n+1 intercom, intracom and n-intercom are
>>> closed
>>> > (4) repeat 1-3 as needed.
>>> >
>>> > Occasionally, i observe that step 3 spuriously locks up (once i get in
>>> the range of 100+ processes). From what i can tell, all processes in step 3
>>> are accounted for, and are waiting on the merge, but nothing happens. the
>>> collective barrier locks up.
>>> >
>>> > I really have trouble resolving this issue, any ideas are appreciated!
>>> >
>>> > Thank you very much.
>>> > -Dmitriy
>>> >
>>> >
>>> > _______________________________________________
>>> > discuss mailing list     discuss at mpich.org
>>> > To manage subscription options or unsubscribe:
>>> > https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170112/3829c3c0/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list