[mpich-discuss] spurious lock ups on collective merge intercom

Oden, Lena loden at anl.gov
Wed Jan 11 01:53:34 CST 2017


Hello Dmittiy,

can you maybe create a simple example-program to reproduce this failure?
It is also often easier also to look at a code example to identify a problem. 

Thanks,
Lena 
> On Jan 11, 2017, at 2:45 AM, Dmitriy Lyubimov <dlieu.7 at gmail.com> wrote:
> 
> Hello,
> 
> (mpich 3.2)
> 
> I have a scenario when i add a few extra processes do existing intercom.
> 
> it works as a simple loop -- 
> (1) n processes accept on n-intercom
> (2) 1 process connects 
> (3) intracom is merged into n+1 intercom, intracom and n-intercom are closed
> (4) repeat 1-3 as needed.
> 
> Occasionally, i observe that step 3 spuriously locks up (once i get in the range of 100+ processes). From what i can tell, all processes in step 3 are accounted for, and are waiting on the merge, but nothing happens. the collective barrier locks up.
> 
> I really have trouble resolving this issue, any ideas are appreciated!
> 
> Thank you very much.
> -Dmitriy
> 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list