[mpich-discuss] MPICH 5.0.1 performance on HPE SS11 plus more - a slurm problem
Howard Pritchard
hppritcha at gmail.com
Thu Aug 21 16:39:48 CDT 2025
Here you go Hui!
MPICH debug output and slurm steps output to boot. Again no such
slurmy errors with the 4.3.1 release.
Something must have changed in the way MPICH is using the PMIX group
constructor ops or something like that.
Required minimum FI_VERSION: 0, current version: 10016
provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]
provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]
provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]
provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]
provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]
provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]
provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]
provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]
provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]
provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]
provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]
provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]
provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]
provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]
provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]
provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]
Required minimum FI_VERSION: 10005, current version: 10016
==== Capability set configuration ====
libfabric provider: cxi - cxi
MPIDI_OFI_ENABLE_DATA: 1
MPIDI_OFI_ENABLE_AV_TABLE: 1
MPIDI_OFI_ENABLE_SCALABLE_ENDPOINTS: 0
MPIDI_OFI_ENABLE_SHARED_CONTEXTS: 0
MPIDI_OFI_ENABLE_MR_VIRT_ADDRESS: 0
MPIDI_OFI_ENABLE_MR_ALLOCATED: 1
MPIDI_OFI_ENABLE_MR_REGISTER_NULL: 0
MPIDI_OFI_ENABLE_MR_PROV_KEY: 0
MPIDI_OFI_ENABLE_TAGGED: 1
MPIDI_OFI_ENABLE_AM: 1
MPIDI_OFI_ENABLE_RMA: 1
MPIDI_OFI_ENABLE_ATOMICS: 1
MPIDI_OFI_FETCH_ATOMIC_IOVECS: 1
MPIDI_OFI_ENABLE_DATA_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_CONTROL_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_PT2PT_NOPACK: 1
MPIDI_OFI_ENABLE_TRIGGERED: 0
MPIDI_OFI_ENABLE_HMEM: 0
MPIDI_OFI_NUM_AM_BUFFERS: 8
MPIDI_OFI_NUM_OPTIMIZED_MEMORY_REGIONS: 0
MPIDI_OFI_CONTEXT_BITS: 20
MPIDI_OFI_SOURCE_BITS: 0
MPIDI_OFI_TAG_BITS: 20
MPIDI_OFI_VNI_USE_DOMAIN: 1
MAXIMUM SUPPORTED RANKS: 4294967296
MAXIMUM TAG: 1048576
==== Provider global thresholds ====
max_buffered_send: 192
max_buffered_write: 192
max_msg_size: 4294967295
max_order_raw: -1
max_order_war: -1
max_order_waw: -1
tx_iov_limit: 1
rx_iov_limit: 1
rma_iov_limit: 1
max_mr_key_size: 4
==== Various sizes and limits ====
MPIDI_OFI_AM_MSG_HEADER_SIZE: 24
MPIDI_OFI_MAX_AM_HDR_SIZE: 255
sizeof(MPIDI_OFI_am_request_header_t): 416
sizeof(MPIDI_OFI_per_vci_t): 52480
MPIDI_OFI_AM_HDR_POOL_CELL_SIZE: 1024
MPIDI_OFI_DEFAULT_SHORT_SEND_SIZE: 16384
======================================
slurmstepd: error: mpi/pmix_v4: pmixp_coll_belong_chk: nid001406 [1]:
pmixp_coll.c:280: No process controlled by this slurmstepd is involved in
this collective.
slurmstepd: error: mpi/pmix_v4: _process_server_request: nid001406 [1]:
pmixp_server.c:923: Unable to pmixp_state_coll_get()
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_check: nid001405 [0]:
pmixp_coll_ring.c:614: 0x14b448006e10: unexpected contrib from nid001406:1,
expected is 0
slurmstepd: error: mpi/pmix_v4: _process_server_request: nid001405 [0]:
pmixp_server.c:937: 0x14b448006e10: unexpected contrib from nid001406:1,
coll->seq=0, seq=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_reset_if_to: nid001405
[0]: pmixp_coll_ring.c:738: 0x14b454052fc0: collective timeout seq=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_log: nid001405 [0]:
pmixp_coll.c:286: Dumping collective state
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]:
pmixp_coll_ring.c:756: 0x14b454052fc0: COLL_FENCE_RING state seq=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]:
pmixp_coll_ring.c:758: my peerid: 0:nid001405
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]:
pmixp_coll_ring.c:765: neighbor id: next 1:nid001406, prev 1:nid001406
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]:
pmixp_coll_ring.c:775: Context ptr=0x14b454053038, #0, in-use=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]:
pmixp_coll_ring.c:775: Context ptr=0x14b454053070, #1, in-use=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]:
pmixp_coll_ring.c:775: Context ptr=0x14b4540530a8, #2, in-use=1
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]:
pmixp_coll_ring.c:786: seq=0 contribs: loc=1/prev=0/fwd=1
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]:
pmixp_coll_ring.c:788: neighbor contribs [2]:
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]:
pmixp_coll_ring.c:821: done contrib: -
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]:
pmixp_coll_ring.c:823: wait contrib: nid001406
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]:
pmixp_coll_ring.c:825: status=PMIXP_COLL_RING_PROGRESS
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]:
pmixp_coll_ring.c:829: buf (offset/size): 36/16384
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_reset_if_to: nid001406
[1]: pmixp_coll_ring.c:738: 0x14aa28053100: collective timeout seq=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_log: nid001406 [1]:
pmixp_coll.c:286: Dumping collective state
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]:
pmixp_coll_ring.c:756: 0x14aa28053100: COLL_FENCE_RING state seq=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]:
pmixp_coll_ring.c:758: my peerid: 1:nid001406
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]:
pmixp_coll_ring.c:765: neighbor id: next 0:nid001405, prev 0:nid001405
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]:
pmixp_coll_ring.c:775: Context ptr=0x14aa28053178, #0, in-use=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]:
pmixp_coll_ring.c:775: Context ptr=0x14aa280531b0, #1, in-use=0
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]:
pmixp_coll_ring.c:775: Context ptr=0x14aa280531e8, #2, in-use=1
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]:
pmixp_coll_ring.c:786: seq=0 contribs: loc=1/prev=0/fwd=1
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]:
pmixp_coll_ring.c:788: neighbor contribs [2]:
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]:
pmixp_coll_ring.c:821: done contrib: -
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]:
pmixp_coll_ring.c:823: wait contrib: nid001405
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]:
pmixp_coll_ring.c:825: status=PMIXP_COLL_RING_PROGRESS
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]:
pmixp_coll_ring.c:829: buf (offset/size): 36/16384
==== Various sizes and limits ====
sizeof(MPIDI_per_vci_t): 128
==== collective selection ====
MPIR_CVAR_DEVICE_COLLECTIVES: percoll
MPIR: MPII_coll_generic_json
MPID: MPIDI_coll_generic_json
MPID/shm: MPIDI_POSIX_coll_generic_json
==== OFI dynamic settings ====
num_vcis: 1
num_nics: 1
======================================
error checking : disabled
QMPI : disabled
debugger support : disabled
thread level : MPI_THREAD_SINGLE
thread CS : per-vci
threadcomm : enabled
==== data structure summary ====
sizeof(MPIR_Comm): 1832
sizeof(MPIR_Request): 520
sizeof(MPIR_Datatype): 280
================================
# OSU MPI Latency Test v5.8
# Size Latency (us)
0 2.04
1 10.08
2 10.10
4 10.11
8 10.12
16 10.12
32 10.13
64 10.12
128 10.67
256 8.10
512 8.18
1024 8.11
2048 7.86
4096 7.80
8192 10.25
16384 11.04
32768 12.04
65536 14.05
131072 17.89
262144 24.61
524288 37.51
1048576 61.48
2097152 110.06
4194304 228.67
Am Mi., 13. Aug. 2025 um 13:10 Uhr schrieb Zhou, Hui <zhouh at anl.gov>:
> Hi Howard,
>
> Could you run with `MPIR_CVAR_DEBUG_SUMMARY=1`? It should print some debug
> messages. I want to confirm it is running the `cxi` provider.
>
>
> Hui
> ------------------------------
> *From:* Howard Pritchard <hppritcha at gmail.com>
> *Sent:* Wednesday, July 30, 2025 4:37 PM
> *To:* Thakur, Rajeev <thakur at anl.gov>
> *Cc:* discuss at mpich.org <discuss at mpich.org>; Zhou, Hui <zhouh at anl.gov>
> *Subject:* Re: [mpich-discuss] MPICH 5.0.1 performance on HPE SS11 plus
> more - a slurm problem
>
> You don't often get email from hppritcha at gmail.com. Learn why this is
> important <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!aeEBmF_DTUp_lE5ETFEZupObYvUZ6i54jdGlfV3tG05FKqEKN1UmnaLgx1W6epKD1rrcWzppMp6MXXLu$ >
> Hi Rajeev, Here are the results for 4. 3. x branch: hpp@ nid001293:
> /usr/projects/artab/users/hpp/osu-micro-benchmarks-5.
> 8-mpich/mpi/pt2pt>srun --mpi=pmix -n 2 ./osu_latency # OSU MPI Latency Test
> v5. 8 # Size Latency (us) 0 1. 92 1 1. 98 2 1. 98
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> Hi Rajeev,
>
> Here are the results for 4.3.x branch:
>
> hpp at nid001293:/usr/projects/artab/users/hpp/osu-micro-benchmarks-5.8-mpich/mpi/pt2pt>srun
> --mpi=pmix -n 2 ./osu_latency
>
> # OSU MPI Latency Test v5.8
>
> # Size Latency (us)
>
> 0 1.92
>
> 1 1.98
>
> 2 1.98
>
> 4 1.98
>
> 8 1.98
>
> 16 1.98
>
> 32 1.99
>
> 64 1.99
>
> 128 2.47
>
> 256 2.59
>
> 512 2.65
>
> 1024 2.76
>
> 2048 2.95
>
> 4096 3.00
>
> 8192 5.96
>
> 16384 6.64
>
> 32768 7.44
>
> 65536 8.75
>
> 131072 11.52
>
> 262144 17.08
>
> 524288 27.96
>
> 1048576 49.38
>
> 2097152 92.96
>
> 4194304 179.74
>
> These are more like i would expect for SS11/OFI CXI provider.
>
> Howard
>
> Am Mi., 30. Juli 2025 um 12:48 Uhr schrieb Thakur, Rajeev <thakur at anl.gov
> >:
>
> Hi Howard,
>
> What was the latency with the 4.3.x branch?
>
>
>
> Rajeev
>
>
>
>
>
> *From: *Howard Pritchard via discuss <discuss at mpich.org>
> *Reply-To: *"discuss at mpich.org" <discuss at mpich.org>
> *Date: *Wednesday, July 30, 2025 at 1:43 PM
> *To: *"Zhou, Hui" <zhouh at anl.gov>
> *Cc: *Howard Pritchard <hppritcha at gmail.com>, "discuss at mpich.org" <
> discuss at mpich.org>
> *Subject: *Re: [mpich-discuss] MPICH 5.0.1 performance on HPE SS11 plus
> more - a slurm problem
>
>
>
> Hi Hui That didn’t help. I am not surprised though as our cluster is an
> NVIDIA free zone. What did help is to switch to the mpich 4. 3. x branch
> and latency results are nominal and the slurm problem went away too. So we
> will stick with that branch.
>
> ZjQcmQRYFpfptBannerStart
>
> *This Message Is From an External Sender *
>
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
>
> Hi Hui
>
>
>
> That didn’t help. I am not surprised though as our cluster is an NVIDIA
> free zone. What did help is to switch to the mpich 4.3.x branch and
> latency results are nominal and the slurm problem went away too. So we
> will stick with that branch.
>
>
>
> Howard
>
>
>
> On Mon, Jul 28, 2025 at 4:15 PM Zhou, Hui <zhouh at anl.gov> wrote:
>
> Hi Howard,
>
>
>
> I wonder whether it is due to the overhead of querying pointer
> attributes. Could you try disable GPU support with `MPIR_CVAR_ENABLE_GPU=0`
> and see if the latency improves?
>
>
>
> Hui
> ------------------------------
>
> *From:* Howard Pritchard via discuss <discuss at mpich.org>
> *Sent:* Monday, July 28, 2025 9:41 AM
> *To:* discuss at mpich.org <discuss at mpich.org>
> *Cc:* Howard Pritchard <hppritcha at gmail.com>
> *Subject:* [mpich-discuss] MPICH 5.0.1 performance on HPE SS11 plus more
> - a slurm problem
>
>
>
> Hi Folks, We are seeing a strange performance issue on our HPE SS11 system
> when testing osu_latency inter-node with MPICH. First the info: system
> using libfabric 1. 22. 0 slurm - 24. 11. 5 Here's my mpichversion output:
> MPICH Version: 5. 0. 0a1
>
> ZjQcmQRYFpfptBannerStart
>
> *This Message Is From an External Sender *
>
> This message came from outside your organization.
>
>
>
> ZjQcmQRYFpfptBannerEnd
>
> Hi Folks,
>
>
>
> We are seeing a strange performance issue on our HPE SS11 system when
> testing osu_latency inter-node with MPICH.
>
>
>
> First the info:
>
> system using libfabric 1.22.0
>
> slurm - 24.11.5
>
>
>
> Here's my mpichversion output:
>
>
>
> MPICH Version: 5.0.0a1
>
> MPICH Release date: unreleased development copy
>
> MPICH ABI: 0:0:0
>
> MPICH Device: ch4:ofi
>
> MPICH configure: --prefix=/XXXX/mpich_again/install --enable-g=no
> --enable-error-checking=no --with-device=ch4:ofi --enable-threads=multiple
> --with-ch4-shmmods=posix,xpmem --enable-thread-cs=per-vci
> --with-libfabric=/opt/cray/libfabric/1.22.0
> --with-xpmem=/opt/cray/xpmem/default --with-pmix=/opt/pmix/gcc4x/5.0.8
> --enable-fast=O3
>
> MPICH CC: gcc -O3
>
> MPICH CXX: g++ -O3
>
> MPICH F77: gfortran -O3
>
> MPICH FC: gfortran -O3
>
> MPICH features: threadcomm
>
>
>
> And here's the OSU latency results:
>
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_belong_chk: nid001439 [1]:
> pmixp_coll.c:280: No process controlled by this slurmstepd is involved in
> this collective.
>
> slurmstepd: error: mpi/pmix_v4: _process_server_request: nid001439 [1]:
> pmixp_server.c:923: Unable to pmixp_state_coll_get()
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_check: nid001438 [0]:
> pmixp_coll_ring.c:614: 0x15005c005dc0: unexpected contrib from nid001439:1,
> expected is 0
>
> slurmstepd: error: mpi/pmix_v4: _process_server_request: nid001438 [0]:
> pmixp_server.c:937: 0x15005c005dc0: unexpected contrib from nid001439:1,
> coll->seq=0, seq=0
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_reset_if_to: nid001438
> [0]: pmixp_coll_ring.c:738: 0x1500580532f0: collective timeout seq=0
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_log: nid001438 [0]:
> pmixp_coll.c:286: Dumping collective state
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
> pmixp_coll_ring.c:756: 0x1500580532f0: COLL_FENCE_RING state seq=0
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
> pmixp_coll_ring.c:758: my peerid: 0:nid001438
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
> pmixp_coll_ring.c:765: neighbor id: next 1:nid001439, prev 1:nid001439
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
> pmixp_coll_ring.c:775: Context ptr=0x150058053368, #0, in-use=0
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
> pmixp_coll_ring.c:775: Context ptr=0x1500580533a0, #1, in-use=0
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
> pmixp_coll_ring.c:775: Context ptr=0x1500580533d8, #2, in-use=1
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
> pmixp_coll_ring.c:786: seq=0 contribs: loc=1/prev=0/fwd=1
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
> pmixp_coll_ring.c:788: neighbor contribs [2]:
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
> pmixp_coll_ring.c:821: done contrib: -
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
> pmixp_coll_ring.c:823: wait contrib: nid001439
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
> pmixp_coll_ring.c:825: status=PMIXP_COLL_RING_PROGRESS
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]:
> pmixp_coll_ring.c:829: buf (offset/size): 36/16384
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_reset_if_to: nid001439
> [1]: pmixp_coll_ring.c:738: 0x151d0c053400: collective timeout seq=0
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_log: nid001439 [1]:
> pmixp_coll.c:286: Dumping collective state
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
> pmixp_coll_ring.c:756: 0x151d0c053400: COLL_FENCE_RING state seq=0
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
> pmixp_coll_ring.c:758: my peerid: 1:nid001439
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
> pmixp_coll_ring.c:765: neighbor id: next 0:nid001438, prev 0:nid001438
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
> pmixp_coll_ring.c:775: Context ptr=0x151d0c053478, #0, in-use=0
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
> pmixp_coll_ring.c:775: Context ptr=0x151d0c0534b0, #1, in-use=0
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
> pmixp_coll_ring.c:775: Context ptr=0x151d0c0534e8, #2, in-use=1
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
> pmixp_coll_ring.c:786: seq=0 contribs: loc=1/prev=0/fwd=1
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
> pmixp_coll_ring.c:788: neighbor contribs [2]:
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
> pmixp_coll_ring.c:821: done contrib: -
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
> pmixp_coll_ring.c:823: wait contrib: nid001438
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
> pmixp_coll_ring.c:825: status=PMIXP_COLL_RING_PROGRESS
>
> slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]:
> pmixp_coll_ring.c:829: buf (offset/size): 36/16384
>
> # OSU MPI Latency Test v5.8
>
> # Size Latency (us)
>
> 0 1.66
>
> 1 9.29
>
> 2 9.57
>
> 4 9.69
>
> 8 9.76
>
> 16 9.77
>
> 32 9.76
>
> 64 9.77
>
> 128 10.32
>
> 256 7.54
>
> 512 7.45
>
> 1024 7.38
>
> 2048 7.37
>
> 4096 7.45
>
> 8192 9.21
>
> 16384 9.70
>
> 32768 10.63
>
> 65536 13.15
>
> 131072 16.96
>
> 262144 23.84
>
> 524288 36.16
>
> 1048576 60.36
>
> 2097152 108.43
>
> 4194304 228.31
>
>
>
> Note the slurm behavior is - I launch the job. Go get coffee, do some
> duo-lingo, read some emails, then after about 10 minutes the osu latency
> runs.
>
>
>
> I did not get the slurm problems using an older mpich 4.3.1 but did get
> the same performance issue. 9 usecs doesn't seem right for an 8-byte
> pingpong over libfabric S11. I was expecting more like 1.6 or so.
>
>
>
> I am confident the slurm issue is unrelated to the latency issue.
>
> Thanks for any suggestions on how to address either issue however.
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20250821/ce858d42/attachment-0001.html>
More information about the discuss
mailing list