[mpich-discuss] spurious lock ups on collective merge intercom
Oden, Lena
loden at anl.gov
Wed Jan 11 01:53:34 CST 2017
Hello Dmittiy,
can you maybe create a simple example-program to reproduce this failure?
It is also often easier also to look at a code example to identify a problem.
Thanks,
Lena
> On Jan 11, 2017, at 2:45 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com> wrote:
>
> Hello,
>
> (mpich 3.2)
>
> I have a scenario when i add a few extra processes do existing intercom.
>
> it works as a simple loop --
> (1) n processes accept on n-intercom
> (2) 1 process connects
> (3) intracom is merged into n+1 intercom, intracom and n-intercom are closed
> (4) repeat 1-3 as needed.
>
> Occasionally, i observe that step 3 spuriously locks up (once i get in the range of 100+ processes). From what i can tell, all processes in step 3 are accounted for, and are waiting on the merge, but nothing happens. the collective barrier locks up.
>
> I really have trouble resolving this issue, any ideas are appreciated!
>
> Thank you very much.
> -Dmitriy
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list