[mpich-discuss] MPICH 5.0.1 performance on HPE SS11 plus more - a slurm problem
Howard Pritchard
hppritcha at gmail.com
Mon Jul 28 09:41:04 CDT 2025
Hi Folks,
We are seeing a strange performance issue on our HPE SS11 system when
testing osu_latency inter-node with MPICH.
First the info:
system using libfabric 1.22.0
slurm - 24.11.5
Here's my mpichversion output:
MPICH Version: 5.0.0a1
MPICH Release date: unreleased development copy
MPICH ABI: 0:0:0
MPICH Device: ch4:ofi
MPICH configure: --prefix=/XXXX/mpich_again/install --enable-g=no
--enable-error-checking=no --with-device=ch4:ofi --enable-threads=multiple
--with-ch4-shmmods=posix,xpmem --enable-thread-cs=per-vci
--with-libfabric=/opt/cray/libfabric/1.22.0
--with-xpmem=/opt/cray/xpmem/default --with-pmix=/opt/pmix/gcc4x/5.0.8
--enable-fast=O3
MPICH CC: gcc -O3
MPICH CXX: g++ -O3
MPICH F77: gfortran -O3
MPICH FC: gfortran -O3
MPICH features: threadcomm
And here's the OSU latency results:
slurmstepd: error: mpi/pmix_v4: pmixp_coll_belong_chk: nid001439 [1]:
pmixp_coll.c:280: No process controlled by this slurmstepd is involved in
this collective.
slurmstepd: error: mpi/pmix_v4: _process_server_request: nid001439 [1]:
pmixp_server.c:923: Unable to pmixp_state_coll_get()
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_check: nid001438 [0]:
pmixp_coll_ring.c:614: 0x15005c005dc0: unexpected contrib from nid001439:1,
expected is 0
slurmstepd: error: mpi/pmix_v4: _process_server_request: nid001438 [0]:
pmixp_server.c:937: 0x15005c005dc0: unexpected contrib from nid001439:1,
coll->seq=0, seq=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_reset_if_to: nid001438
[0]: pmixp_coll_ring.c:738: 0x1500580532f0: collective timeout seq=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_log: nid001438 [0]:
pmixp_coll.c:286: Dumping collective state
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
pmixp_coll_ring.c:756: 0x1500580532f0: COLL_FENCE_RING state seq=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
pmixp_coll_ring.c:758: my peerid: 0:nid001438
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
pmixp_coll_ring.c:765: neighbor id: next 1:nid001439, prev 1:nid001439
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
pmixp_coll_ring.c:775: Context ptr=0x150058053368, #0, in-use=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
pmixp_coll_ring.c:775: Context ptr=0x1500580533a0, #1, in-use=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
pmixp_coll_ring.c:775: Context ptr=0x1500580533d8, #2, in-use=1
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
pmixp_coll_ring.c:786: seq=0 contribs: loc=1/prev=0/fwd=1
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
pmixp_coll_ring.c:788: neighbor contribs [2]:
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
pmixp_coll_ring.c:821: done contrib: -
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
pmixp_coll_ring.c:823: wait contrib: nid001439
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
pmixp_coll_ring.c:825: status=PMIXP_COLL_RING_PROGRESS
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
pmixp_coll_ring.c:829: buf (offset/size): 36/16384
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_reset_if_to: nid001439
[1]: pmixp_coll_ring.c:738: 0x151d0c053400: collective timeout seq=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_log: nid001439 [1]:
pmixp_coll.c:286: Dumping collective state
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
pmixp_coll_ring.c:756: 0x151d0c053400: COLL_FENCE_RING state seq=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
pmixp_coll_ring.c:758: my peerid: 1:nid001439
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
pmixp_coll_ring.c:765: neighbor id: next 0:nid001438, prev 0:nid001438
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
pmixp_coll_ring.c:775: Context ptr=0x151d0c053478, #0, in-use=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
pmixp_coll_ring.c:775: Context ptr=0x151d0c0534b0, #1, in-use=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
pmixp_coll_ring.c:775: Context ptr=0x151d0c0534e8, #2, in-use=1
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
pmixp_coll_ring.c:786: seq=0 contribs: loc=1/prev=0/fwd=1
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
pmixp_coll_ring.c:788: neighbor contribs [2]:
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
pmixp_coll_ring.c:821: done contrib: -
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
pmixp_coll_ring.c:823: wait contrib: nid001438
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
pmixp_coll_ring.c:825: status=PMIXP_COLL_RING_PROGRESS
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
pmixp_coll_ring.c:829: buf (offset/size): 36/16384
# OSU MPI Latency Test v5.8
# Size Latency (us)
0 1.66
1 9.29
2 9.57
4 9.69
8 9.76
16 9.77
32 9.76
64 9.77
128 10.32
256 7.54
512 7.45
1024 7.38
2048 7.37
4096 7.45
8192 9.21
16384 9.70
32768 10.63
65536 13.15
131072 16.96
262144 23.84
524288 36.16
1048576 60.36
2097152 108.43
4194304 228.31
Note the slurm behavior is - I launch the job. Go get coffee, do some
duo-lingo, read some emails, then after about 10 minutes the osu latency
runs.
I did not get the slurm problems using an older mpich 4.3.1 but did get the
same performance issue. 9 usecs doesn't seem right for an 8-byte pingpong
over libfabric S11. I was expecting more like 1.6 or so.
I am confident the slurm issue is unrelated to the latency issue.
Thanks for any suggestions on how to address either issue however.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20250728/5bb4b681/attachment-0001.html>
More information about the discuss
mailing list