[mpich-discuss] segment fault when using knem

Dave Goodell goodell at mcs.anl.gov
Fri Apr 26 08:06:57 CDT 2013

On Apr 25, 2013, at 8:03 PM CDT, M Xie <xmxmxie at gmail.com> wrote:

> I am doing some tests with mpich in a dual-processor SMP server.
> Now I use mpich-3.0.3, I also use knem for accelerating intra-node communication.
> The channel I used is nemesis:tcp, and nemesis-local-lmt=knem.
> But when I use osu_benchmarks to do the bandwidth test, osu_bw will segment
> fault at the 2MB, which is the default value of MPICH_NEM_LMT_DMA_THRESHOLD.
> It seems when DMA channel in knem is used, the segment fault occurs.
> When I set MPICH_NEM_LMT_DMA_THRESHOLD to a smaller value, such as
> 131072, osu_bw will segment fault at 131072.
> I also test NAS Parallell Benchmark. I noticed when DMA channel in knem is used,
> sometimes the NPB tests will be freezed after running for a while.
> In the attachment, I list some config and core dump files.
> Did anyone met or solved the similar problems.

(I saw this over on the knem list but didn't have time until now to respond)

Our knem support hasn't been touched in a long time.  I think the version we ship in "contrib" is very old (0.5.0), before the full RDMA-style interface existed.  Our side of the code probably hasn't been updated since that timeframe, modulo a patch that Brice contributed.  Also, we don't run it in our nightly tests, so it's possible it has bit-rotted for another reason.

It looks like there's a bad request handle being dereferenced, though just from the stack trace it's hard to tell whether that's a corrupted handle (garbage) or a stale handle value which has been freed.

We could be hitting this longstanding LMT bug: https://trac.mpich.org/projects/mpich/ticket/1039

Let me try to setup an environment where I can reproduce this and see if there's an easy/obvious fix.


More information about the discuss mailing list