[mpich-discuss] spurious lock ups on collective merge intercom

Dmitriy Lyubimov dlieu.7 at gmail.com
Wed Jan 11 11:38:59 CST 2017


Maybe it has something to do with the fact that it is stepping thru JVM JNI
and that somehow screws threading model of MPI, although it is a single
threaded JVM process, and MPI mappings are known to have been done before
(e.g., openmpi had an effort towards that).

Strange thing is that i never had lock up with # of processes under 120 but
something changes after that, the spurious condition becomes much more
common after that. By the time I am at 150 processes in the intercom, I am
almost certain to have a merge lock-up.


On Wed, Jan 11, 2017 at 9:34 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com> wrote:

> Thanks.
> it would not be easy for me to do immediately as i am using proprietary
> scala binding api for MPI.
>
> it would help me to know if there's a known problem like that in the past,
> or generally mergeIntercomm api is known to work on hundreds of processes.
> Sounds like there are no known issues with that.
>
>
>
> On Tue, Jan 10, 2017 at 11:53 PM, Oden, Lena <loden at anl.gov> wrote:
>
>> Hello Dmittiy,
>>
>> can you maybe create a simple example-program to reproduce this failure?
>> It is also often easier also to look at a code example to identify a
>> problem.
>>
>> Thanks,
>> Lena
>> > On Jan 11, 2017, at 2:45 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com>
>> wrote:
>> >
>> > Hello,
>> >
>> > (mpich 3.2)
>> >
>> > I have a scenario when i add a few extra processes do existing intercom.
>> >
>> > it works as a simple loop --
>> > (1) n processes accept on n-intercom
>> > (2) 1 process connects
>> > (3) intracom is merged into n+1 intercom, intracom and n-intercom are
>> closed
>> > (4) repeat 1-3 as needed.
>> >
>> > Occasionally, i observe that step 3 spuriously locks up (once i get in
>> the range of 100+ processes). From what i can tell, all processes in step 3
>> are accounted for, and are waiting on the merge, but nothing happens. the
>> collective barrier locks up.
>> >
>> > I really have trouble resolving this issue, any ideas are appreciated!
>> >
>> > Thank you very much.
>> > -Dmitriy
>> >
>> >
>> > _______________________________________________
>> > discuss mailing list     discuss at mpich.org
>> > To manage subscription options or unsubscribe:
>> > https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170111/c4eeeed4/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list