<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
I see. My guess (and I’ll let Hui confirm) is that you’re not actually using the PSM2 capabilities when you compile with
<font face="Menlo" class="">—with-device=ch4:ofi</font>. Instead, you’re getting a backup capability set that is supposed to work with any provider. When you switch to
<font face="Menlo" class="">—with-device=ch4:ofi:psm2</font>, you’re forcing MPICH to use the PSM2 capabilities (as we currently think of them for OFI 1.11 or whatever we expect right now) and your version of PSM2 doesn’t support that, so it crashes during
initialization because it can’t find a provider that meets its requirements. If you set the environment variable MPIR_CVAR_CH4_OFI_CAPABILITY_SETS_DEBUG, it will print out the capability set (and provider) that MPICH is using so you can confirm.
<div class=""><br class="">
</div>
<div class="">Assuming that’s the case, there’s probably some manual set of CVARs that will get you the right set of capabilities, but I’m not sure what it would be off the top of my head. 1.5 is pretty old at this point so it’s disappeared from my brain. :)</div>
<div class=""><br class="">
</div>
<div class="">I’m not that surprised that MVAPICH might be winning with an older version of OFI. My understanding is that it’s still on CH3 (I might be wrong here) and isn’t using the CH4 capability set code. I think the capability sets probably improve things
in MPICH when matching our expected versions, but could cause these sorts of legacy issues. Just a guess though.</div>
<div class=""><br class="">
</div>
<div class="">Good luck!</div>
<div class="">Wes<br class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Jun 16, 2021, at 2:48 AM, Antonio J. Peña <<a href="mailto:antonio.pena@bsc.es" class="">antonio.pena@bsc.es</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div class=""><br class="">
Hi Wesley,<br class="">
<br class="">
Happy to hear from you. With that setting I cannot get out from this runtime error at init:<br class="">
<br class="">
Abort(69832847) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:<br class="">
PMPI_Init(98)............: MPI_Init(argc=0x7ffd4776159c, argv=0x7ffd47761590) failed<br class="">
MPII_Init_thread(196)....:<br class="">
MPID_Init(472)...........:<br class="">
MPID_Init_local(585).....:<br class="">
MPIDI_OFI_init_local(633):<br class="">
open_fabric(1360)........: OFI fi_getinfo() failed (ofi_init.c:1360:open_fabric:Function not implemented)<br class="">
<br class="">
There must be something in MPICH, at least since v. 3.4.2, or I'm doing something wrong at the MPICH side. My libfabric is working fine: I get good performance with fi_pingpong, but I've also just tried MVAPICH (with psm2 netmod, not under libfabric) and it
gave good performance out of the box.<br class="">
<br class="">
Although I'd rather prefer to tweak MPICH because I'm far more comfortable with that code, I'm okay moving ahead with MVAPICH, so unless this is interesting from your side (I guess you don't care that much about psm2 now), we can close this thread.<br class="">
<br class="">
Thanks a lot for your help.<br class="">
<br class="">
Toni<br class="">
<br class="">
<br class="">
<br class="">
El 15/6/21 a las 15:03, Wesley Bland via discuss escribió:<br class="">
<blockquote type="cite" class="">Hey Toni,<br class="">
<br class="">
I’d be surprised that the performance drops that much, but you can try —with-device=ch4:ofi:psm2 to convert at least some of the branches to be compile-time instead of runtime. After that, I don’t remember enough about OFI 1.5. There might have been some changes
in MPICH over the last year or two that makes that version not perform as well…<br class="">
<br class="">
Thanks,<br class="">
Wes<br class="">
<br class="">
<blockquote type="cite" class="">On Jun 15, 2021, at 4:46 AM, Antonio Peña via discuss <<a href="mailto:discuss@mpich.org" class="">discuss@mpich.org</a>> wrote:<br class="">
<br class="">
<br class="">
Hi folks,<br class="">
<br class="">
I'm setting up an MPICH over libfabric over psm2 for MareNostrum (Omni-Path), to try out some ideas.<br class="">
<br class="">
I've compiled libfabric 1.5 (last one that compiles in this machine) over opa-psm2-11.2.185, and mpich-3.4.2 + mpich-4.0a1 in both ch3 and ch4 (yes 4 MPICH variants). There's only psm2 support in libfabric, so no danger of falling back to other providers. ldd
confirms my libfabric is linked.<br class="">
<br class="">
./fi_info<br class="">
provider: psm2<br class="">
fabric: psm2<br class="">
domain: psm2<br class="">
version: 1.5<br class="">
type: FI_EP_RDM<br class="">
protocol: FI_PROTO_PSMX2<br class="">
<br class="">
I'm comparing 2-node pt2pt performance against impi/2017.4 using osu microbenchmarks.<br class="">
<br class="">
While both fi_pingong and impi give me a max. BW of ~10 MB/s, all mpich versions stick at ~3 MB/s.<br class="">
<br class="">
Is this expected? I mean, is there so much secret sauce in impi? Or, am likely doing something wrong?<br class="">
<br class="">
I'm doing fairly plain configures, nothing fancy, e.g.:<br class="">
./configure --prefix=... --with-device=ch4:ofi --with-libfabric=...<br class="">
<br class="">
I'd appreciate some guidance - my MPICH tweaking is a little rusted :)<br class="">
<br class="">
Best,<br class="">
Toni<br class="">
_______________________________________________<br class="">
discuss mailing list <a href="mailto:discuss@mpich.org" class="">discuss@mpich.org</a><br class="">
To manage subscription options or unsubscribe:<br class="">
<a href="https://lists.mpich.org/mailman/listinfo/discuss" class="">https://lists.mpich.org/mailman/listinfo/discuss</a><br class="">
</blockquote>
_______________________________________________<br class="">
discuss mailing list <a href="mailto:discuss@mpich.org" class="">discuss@mpich.org</a><br class="">
To manage subscription options or unsubscribe:<br class="">
<a href="https://lists.mpich.org/mailman/listinfo/discuss" class="">https://lists.mpich.org/mailman/listinfo/discuss</a><br class="">
</blockquote>
<br class="">
-- <br class="">
Antonio J. Peña (PhD)<br class="">
Team Lead, Accelerators and Communications for HPC | Teaching and Research Staff<br class="">
Sr. Researcher, Computer Sciences Department | Computer Architecture Department<br class="">
Barcelona Supercomputing Center (BSC) | Universitat Politècnica de Catalunya (UPC)<br class="">
<a href="http://www.bsc.es/pena-antonio" class="">http://www.bsc.es/pena-antonio</a><br class="">
===============================<br class="">
Looking for job opportunities? Open positions in my team. Please contact me.<br class="">
<br class="">
<br class="">
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</body>
</html>