[mpich-discuss] spurious lock ups on collective merge intercom

Dmitriy Lyubimov dlieu.7 at gmail.com
Tue Feb 7 16:29:02 CST 2017


also like i said, with 3.3a2 it was much (much!) harder to reproduce... but
it still  happens from time to time. like i said, stack analysis for 3.3a2
is slightly different. in 3.2 we have one post-connect and (n-1)
post-accept stacks all sitting and waiting in intercom_merge(),

with 3.3a2 it is slightly different -- we observerd 2 post-connects waiting
on merge, which this scheme should not allow at all. When we go thru
merging, there should always be one post-connect merge and (n-1)
post-accept merge calls. So either one of post-connects is still there from
the previous loop (which should not be possible as it would not allow to
accept on the current loop), or two clients were accepted somehow at the
same time (which should not be the case either, as 3-way intercoms are not
possible). But the analysis is very thorough, we get full backtraces of all
processes, and their internal state (such as current intracom size before
accept) is also known, so we rule out things like process failures etc.,  i
am pretty confident of that. We did that so many times it is fairly
difficult to believe we still haven't accounted for naive things like dead
processes etc.

On Tue, Feb 7, 2017 at 2:10 PM, Dmitriy Lyubimov <dlieu.7 at gmail.com> wrote:

>
>> yes. Like i said. I am able to achieve lock up state spuriously on 192
>> core cluster only if i spin up almost all cores (per process)
>>
>
> should read "1 core per process"
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170207/89a50a60/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list