[mpich-discuss] MPICH 5.0.1 performance on HPE SS11 plus more - a slurm problem

Raffenetti, Ken raffenet at anl.gov
Fri Aug 22 11:03:00 CDT 2025


Hi Howard,

The PMIx stuff is likely related to the new sessions implementation coming in 5.0.x. I’ll look for a Slurm cluster to try and figure out what’s going on with that.

What commit hash are you working with that shows the poor latency? I just built from the HEAD of main and don’t see the behavior on Aurora.

Ken

From: Howard Pritchard via discuss <discuss at mpich.org>
Date: Thursday, August 21, 2025 at 4:40 PM
To: Zhou, Hui <zhouh at anl.gov>
Cc: Howard Pritchard <hppritcha at gmail.com>, discuss at mpich.org <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH 5.0.1 performance on HPE SS11 plus more - a slurm problem

This Message Is From an External Sender
This message came from outside your organization.

Here you go Hui!

MPICH debug output and slurm steps output to boot.  Again no such slurmy errors with the 4.3.1 release.
Something must have changed in the way MPICH is using the PMIX group constructor ops or something like that.


Required minimum FI_VERSION: 0, current version: 10016

provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]

provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]

provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]

provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]

provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]

provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]

provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]

provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]

provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]

provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]

provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]

provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]

provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]

provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]

provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]

provider: cxi, score = 5, pref = 0, FI_FORMAT_UNSPEC [8]

Required minimum FI_VERSION: 10005, current version: 10016

==== Capability set configuration ====

libfabric provider: cxi - cxi

MPIDI_OFI_ENABLE_DATA: 1

MPIDI_OFI_ENABLE_AV_TABLE: 1

MPIDI_OFI_ENABLE_SCALABLE_ENDPOINTS: 0

MPIDI_OFI_ENABLE_SHARED_CONTEXTS: 0

MPIDI_OFI_ENABLE_MR_VIRT_ADDRESS: 0

MPIDI_OFI_ENABLE_MR_ALLOCATED: 1

MPIDI_OFI_ENABLE_MR_REGISTER_NULL: 0

MPIDI_OFI_ENABLE_MR_PROV_KEY: 0

MPIDI_OFI_ENABLE_TAGGED: 1

MPIDI_OFI_ENABLE_AM: 1

MPIDI_OFI_ENABLE_RMA: 1

MPIDI_OFI_ENABLE_ATOMICS: 1

MPIDI_OFI_FETCH_ATOMIC_IOVECS: 1

MPIDI_OFI_ENABLE_DATA_AUTO_PROGRESS: 0

MPIDI_OFI_ENABLE_CONTROL_AUTO_PROGRESS: 0

MPIDI_OFI_ENABLE_PT2PT_NOPACK: 1

MPIDI_OFI_ENABLE_TRIGGERED: 0

MPIDI_OFI_ENABLE_HMEM: 0

MPIDI_OFI_NUM_AM_BUFFERS: 8

MPIDI_OFI_NUM_OPTIMIZED_MEMORY_REGIONS: 0

MPIDI_OFI_CONTEXT_BITS: 20

MPIDI_OFI_SOURCE_BITS: 0

MPIDI_OFI_TAG_BITS: 20

MPIDI_OFI_VNI_USE_DOMAIN: 1

MAXIMUM SUPPORTED RANKS: 4294967296

MAXIMUM TAG: 1048576

==== Provider global thresholds ====

max_buffered_send: 192

max_buffered_write: 192

max_msg_size: 4294967295

max_order_raw: -1

max_order_war: -1

max_order_waw: -1

tx_iov_limit: 1

rx_iov_limit: 1

rma_iov_limit: 1

max_mr_key_size: 4

==== Various sizes and limits ====

MPIDI_OFI_AM_MSG_HEADER_SIZE: 24

MPIDI_OFI_MAX_AM_HDR_SIZE: 255

sizeof(MPIDI_OFI_am_request_header_t): 416

sizeof(MPIDI_OFI_per_vci_t): 52480

MPIDI_OFI_AM_HDR_POOL_CELL_SIZE: 1024

MPIDI_OFI_DEFAULT_SHORT_SEND_SIZE: 16384

======================================

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_belong_chk: nid001406 [1]: pmixp_coll.c:280: No process controlled by this slurmstepd is involved in this collective.

slurmstepd: error:  mpi/pmix_v4: _process_server_request: nid001406 [1]: pmixp_server.c:923: Unable to pmixp_state_coll_get()

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_check: nid001405 [0]: pmixp_coll_ring.c:614: 0x14b448006e10: unexpected contrib from nid001406:1, expected is 0

slurmstepd: error:  mpi/pmix_v4: _process_server_request: nid001405 [0]: pmixp_server.c:937: 0x14b448006e10: unexpected contrib from nid001406:1, coll->seq=0, seq=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_reset_if_to: nid001405 [0]: pmixp_coll_ring.c:738: 0x14b454052fc0: collective timeout seq=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_log: nid001405 [0]: pmixp_coll.c:286: Dumping collective state

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]: pmixp_coll_ring.c:756: 0x14b454052fc0: COLL_FENCE_RING state seq=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]: pmixp_coll_ring.c:758: my peerid: 0:nid001405

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]: pmixp_coll_ring.c:765: neighbor id: next 1:nid001406, prev 1:nid001406

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]: pmixp_coll_ring.c:775: Context ptr=0x14b454053038, #0, in-use=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]: pmixp_coll_ring.c:775: Context ptr=0x14b454053070, #1, in-use=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]: pmixp_coll_ring.c:775: Context ptr=0x14b4540530a8, #2, in-use=1

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]: pmixp_coll_ring.c:786:  seq=0 contribs: loc=1/prev=0/fwd=1

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]: pmixp_coll_ring.c:788:  neighbor contribs [2]:

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]: pmixp_coll_ring.c:821:  done contrib: -

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]: pmixp_coll_ring.c:823:  wait contrib: nid001406

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]: pmixp_coll_ring.c:825:  status=PMIXP_COLL_RING_PROGRESS

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001405 [0]: pmixp_coll_ring.c:829:  buf (offset/size): 36/16384

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_reset_if_to: nid001406 [1]: pmixp_coll_ring.c:738: 0x14aa28053100: collective timeout seq=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_log: nid001406 [1]: pmixp_coll.c:286: Dumping collective state

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]: pmixp_coll_ring.c:756: 0x14aa28053100: COLL_FENCE_RING state seq=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]: pmixp_coll_ring.c:758: my peerid: 1:nid001406

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]: pmixp_coll_ring.c:765: neighbor id: next 0:nid001405, prev 0:nid001405

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]: pmixp_coll_ring.c:775: Context ptr=0x14aa28053178, #0, in-use=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]: pmixp_coll_ring.c:775: Context ptr=0x14aa280531b0, #1, in-use=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]: pmixp_coll_ring.c:775: Context ptr=0x14aa280531e8, #2, in-use=1

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]: pmixp_coll_ring.c:786:  seq=0 contribs: loc=1/prev=0/fwd=1

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]: pmixp_coll_ring.c:788:  neighbor contribs [2]:

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]: pmixp_coll_ring.c:821:  done contrib: -

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]: pmixp_coll_ring.c:823:  wait contrib: nid001405

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]: pmixp_coll_ring.c:825:  status=PMIXP_COLL_RING_PROGRESS

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001406 [1]: pmixp_coll_ring.c:829:  buf (offset/size): 36/16384

==== Various sizes and limits ====

sizeof(MPIDI_per_vci_t): 128

==== collective selection ====

MPIR_CVAR_DEVICE_COLLECTIVES: percoll

MPIR: MPII_coll_generic_json

MPID: MPIDI_coll_generic_json

MPID/shm: MPIDI_POSIX_coll_generic_json

==== OFI dynamic settings ====

num_vcis: 1

num_nics: 1

======================================

error checking    : disabled

QMPI              : disabled

debugger support  : disabled

thread level      : MPI_THREAD_SINGLE

thread CS         : per-vci

threadcomm        : enabled

==== data structure summary ====

sizeof(MPIR_Comm): 1832

sizeof(MPIR_Request): 520

sizeof(MPIR_Datatype): 280

================================

# OSU MPI Latency Test v5.8

# Size          Latency (us)

0                       2.04

1                      10.08

2                      10.10

4                      10.11

8                      10.12

16                     10.12

32                     10.13

64                     10.12

128                    10.67

256                     8.10

512                     8.18

1024                    8.11

2048                    7.86

4096                    7.80

8192                   10.25

16384                  11.04

32768                  12.04

65536                  14.05

131072                 17.89

262144                 24.61

524288                 37.51

1048576                61.48

2097152               110.06

4194304               228.67


Am Mi., 13. Aug. 2025 um 13:10 Uhr schrieb Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>:
Hi Howard,

Could you run with `MPIR_CVAR_DEBUG_SUMMARY=1`? It should print some debug messages. I want to confirm it is running the `cxi` provider.


Hui
________________________________
From: Howard Pritchard <hppritcha at gmail.com<mailto:hppritcha at gmail.com>>
Sent: Wednesday, July 30, 2025 4:37 PM
To: Thakur, Rajeev <thakur at anl.gov<mailto:thakur at anl.gov>>
Cc: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>; Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Subject: Re: [mpich-discuss] MPICH 5.0.1 performance on HPE SS11 plus more - a slurm problem

You don't often get email from hppritcha at gmail.com<mailto:hppritcha at gmail.com>. Learn why this is important<https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!aeEBmF_DTUp_lE5ETFEZupObYvUZ6i54jdGlfV3tG05FKqEKN1UmnaLgx1W6epKD1rrcWzppMp6MXXLu$>
This Message Is From an External Sender
This message came from outside your organization.

Hi Rajeev,

Here are the results for 4.3.x branch:


hpp at nid001293:/usr/projects/artab/users/hpp/osu-micro-benchmarks-5.8-mpich/mpi/pt2pt>srun --mpi=pmix -n 2 ./osu_latency

# OSU MPI Latency Test v5.8

# Size          Latency (us)

0                       1.92

1                       1.98

2                       1.98

4                       1.98

8                       1.98

16                      1.98

32                      1.99

64                      1.99

128                     2.47

256                     2.59

512                     2.65

1024                    2.76

2048                    2.95

4096                    3.00

8192                    5.96

16384                   6.64

32768                   7.44

65536                   8.75

131072                 11.52

262144                 17.08

524288                 27.96

1048576                49.38

2097152                92.96

4194304               179.74

These are more like i would expect for SS11/OFI CXI provider.

Howard

Am Mi., 30. Juli 2025 um 12:48 Uhr schrieb Thakur, Rajeev <thakur at anl.gov<mailto:thakur at anl.gov>>:

Hi Howard,

                 What was the latency with the 4.3.x branch?



Rajeev





From: Howard Pritchard via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Reply-To: "discuss at mpich.org<mailto:discuss at mpich.org>" <discuss at mpich.org<mailto:discuss at mpich.org>>
Date: Wednesday, July 30, 2025 at 1:43 PM
To: "Zhou, Hui" <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Cc: Howard Pritchard <hppritcha at gmail.com<mailto:hppritcha at gmail.com>>, "discuss at mpich.org<mailto:discuss at mpich.org>" <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: Re: [mpich-discuss] MPICH 5.0.1 performance on HPE SS11 plus more - a slurm problem



Hi Hui That didn’t help. I am not surprised though as our cluster is an NVIDIA free zone. What did help is to switch to the mpich 4. 3. x branch and latency results are nominal and the slurm problem went away too. So we will stick with that branch. 

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd

Hi Hui



That didn’t help.  I am not surprised though as our cluster is an NVIDIA free zone.  What did help is to switch to the mpich 4.3.x branch and latency results are nominal and the slurm problem went away too.  So we will stick with that branch.



Howard



On Mon, Jul 28, 2025 at 4:15 PM Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>> wrote:

Hi Howard,



 I wonder whether it is due to the overhead of querying pointer attributes. Could you try disable GPU support with `MPIR_CVAR_ENABLE_GPU=0` and see if the latency improves?



Hui

________________________________

From: Howard Pritchard via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Sent: Monday, July 28, 2025 9:41 AM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Howard Pritchard <hppritcha at gmail.com<mailto:hppritcha at gmail.com>>
Subject: [mpich-discuss] MPICH 5.0.1 performance on HPE SS11 plus more - a slurm problem



Hi Folks, We are seeing a strange performance issue on our HPE SS11 system when testing osu_latency inter-node with MPICH. First the info: system using libfabric 1. 22. 0 slurm - 24. 11. 5 Here's my mpichversion output: MPICH Version:       5. 0. 0a1

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.



ZjQcmQRYFpfptBannerEnd

Hi Folks,



We are seeing a strange performance issue on our HPE SS11 system when testing osu_latency inter-node with MPICH.



First the info:

system using libfabric 1.22.0

slurm - 24.11.5



Here's my mpichversion output:



MPICH Version:      5.0.0a1

MPICH Release date: unreleased development copy

MPICH ABI:          0:0:0

MPICH Device:       ch4:ofi

MPICH configure:    --prefix=/XXXX/mpich_again/install --enable-g=no --enable-error-checking=no --with-device=ch4:ofi --enable-threads=multiple --with-ch4-shmmods=posix,xpmem --enable-thread-cs=per-vci --with-libfabric=/opt/cray/libfabric/1.22.0 --with-xpmem=/opt/cray/xpmem/default --with-pmix=/opt/pmix/gcc4x/5.0.8 --enable-fast=O3

MPICH CC:           gcc     -O3

MPICH CXX:          g++   -O3

MPICH F77:          gfortran   -O3

MPICH FC:           gfortran   -O3

MPICH features:     threadcomm



And here's the OSU latency results:



slurmstepd: error:  mpi/pmix_v4: pmixp_coll_belong_chk: nid001439 [1]: pmixp_coll.c:280: No process controlled by this slurmstepd is involved in this collective.

slurmstepd: error:  mpi/pmix_v4: _process_server_request: nid001439 [1]: pmixp_server.c:923: Unable to pmixp_state_coll_get()

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_check: nid001438 [0]: pmixp_coll_ring.c:614: 0x15005c005dc0: unexpected contrib from nid001439:1, expected is 0

slurmstepd: error:  mpi/pmix_v4: _process_server_request: nid001438 [0]: pmixp_server.c:937: 0x15005c005dc0: unexpected contrib from nid001439:1, coll->seq=0, seq=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_reset_if_to: nid001438 [0]: pmixp_coll_ring.c:738: 0x1500580532f0: collective timeout seq=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_log: nid001438 [0]: pmixp_coll.c:286: Dumping collective state

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]: pmixp_coll_ring.c:756: 0x1500580532f0: COLL_FENCE_RING state seq=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]: pmixp_coll_ring.c:758: my peerid: 0:nid001438

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]: pmixp_coll_ring.c:765: neighbor id: next 1:nid001439, prev 1:nid001439

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]: pmixp_coll_ring.c:775: Context ptr=0x150058053368, #0, in-use=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]: pmixp_coll_ring.c:775: Context ptr=0x1500580533a0, #1, in-use=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]: pmixp_coll_ring.c:775: Context ptr=0x1500580533d8, #2, in-use=1

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]: pmixp_coll_ring.c:786:  seq=0 contribs: loc=1/prev=0/fwd=1

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]: pmixp_coll_ring.c:788:  neighbor contribs [2]:

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]: pmixp_coll_ring.c:821:  done contrib: -

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]: pmixp_coll_ring.c:823:  wait contrib: nid001439

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]: pmixp_coll_ring.c:825:  status=PMIXP_COLL_RING_PROGRESS

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001438 [0]: pmixp_coll_ring.c:829:  buf (offset/size): 36/16384

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_reset_if_to: nid001439 [1]: pmixp_coll_ring.c:738: 0x151d0c053400: collective timeout seq=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_log: nid001439 [1]: pmixp_coll.c:286: Dumping collective state

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]: pmixp_coll_ring.c:756: 0x151d0c053400: COLL_FENCE_RING state seq=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]: pmixp_coll_ring.c:758: my peerid: 1:nid001439

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]: pmixp_coll_ring.c:765: neighbor id: next 0:nid001438, prev 0:nid001438

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]: pmixp_coll_ring.c:775: Context ptr=0x151d0c053478, #0, in-use=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]: pmixp_coll_ring.c:775: Context ptr=0x151d0c0534b0, #1, in-use=0

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]: pmixp_coll_ring.c:775: Context ptr=0x151d0c0534e8, #2, in-use=1

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]: pmixp_coll_ring.c:786:  seq=0 contribs: loc=1/prev=0/fwd=1

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]: pmixp_coll_ring.c:788:  neighbor contribs [2]:

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]: pmixp_coll_ring.c:821:  done contrib: -

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]: pmixp_coll_ring.c:823:  wait contrib: nid001438

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]: pmixp_coll_ring.c:825:  status=PMIXP_COLL_RING_PROGRESS

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: nid001439 [1]: pmixp_coll_ring.c:829:  buf (offset/size): 36/16384

# OSU MPI Latency Test v5.8

# Size          Latency (us)

0                       1.66

1                       9.29

2                       9.57

4                       9.69

8                       9.76

16                      9.77

32                      9.76

64                      9.77

128                    10.32

256                     7.54

512                     7.45

1024                    7.38

2048                    7.37

4096                    7.45

8192                    9.21

16384                   9.70

32768                  10.63

65536                  13.15

131072                 16.96

262144                 23.84

524288                 36.16

1048576                60.36

2097152               108.43

4194304               228.31



Note the slurm behavior is - I launch the job.  Go get coffee, do some duo-lingo, read some emails, then after about 10 minutes the osu latency runs.



I did not get the slurm problems using an older mpich 4.3.1 but did get the same performance issue.  9 usecs doesn't seem right for an 8-byte pingpong over libfabric S11.  I was expecting more like 1.6 or so.



I am confident the slurm issue is unrelated to the latency issue.

Thanks for any suggestions on how to address either issue however.




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20250822/a061741e/attachment-0001.html>


More information about the discuss mailing list