[mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll
Audet, Martin
Martin.Audet at cnrc-nrc.gc.ca
Mon Jun 23 14:48:25 CDT 2025
Thanks for this quick reply.
You say that hcoll don’t work correctly (runtime) so in this case, there should be a warning or something to warn users if they try to use it (like us). Slower but correct results are far better than faster but incorrect ones. I will recompile the library without this option so that it doesn’t create problems for the users of our cluster.
Since which version the hcoll don’t work correctly ?
I may also disable it in the older mpich versions we keep available for our users.
Thanks,
Martin
From: Raffenetti, Ken <raffenet at anl.gov>
Sent: June 23, 2025 15:40
To: discuss at mpich.org
Cc: Audet, Martin <Martin.Audet at cnrc-nrc.gc.ca>
Subject: EXT: Re: [mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll
***Attention*** This email originated from outside of the NRC. ***Attention*** Ce courriel provient de l'extérieur du CNRC.
Hi Martin,
My apologies for the lack of update on this topic. We did not include this patch because even with successful compilation, MPICH hcoll integration does not function correctly at runtime in our tests. Due to other priorities, we have not yet spent the time to fix the issue.
Ken
From: Audet, Martin via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Date: Monday, June 23, 2025 at 10:04 AM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Audet, Martin <Martin.Audet at cnrc-nrc.gc.ca<mailto:Martin.Audet at cnrc-nrc.gc.ca>>
Subject: [mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll
Hello, It seems that the silly compilation problem with hcoll_rte. c I had back in April with mpich 4. 3. 0 when using --with-hcoll=/opt/mellanox/hcoll configuration option is still present in 4. 3. 1, see: https: //lists. mpich. org/mailman/htdig/discuss/2025-April/006725. html
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hello,
It seems that the silly compilation problem with hcoll_rte.c I had back in April with mpich 4.3.0 when using --with-hcoll=/opt/mellanox/hcoll configuration option is still present in 4.3.1, see:
https://urldefense.us/v3/__https://lists.mpich.org/mailman/htdig/discuss/2025-April/006725.html__;!!G_uCfscf7eWS!aPI_D2UMfycAszfCSXqg49whNPjGUuWAnJfdH0aYvUXGwLumciUNdE6DVaTzWmVcQBIVNR1ZR8H23Yngc7kjYiqt-Ws$ <https://urldefense.us/v3/__https:/lists.mpich.org/mailman/htdig/discuss/2025-April/006725.html__;!!G_uCfscf7eWS!fvaja_SlDAvIzwz1hZZHt1QY74b9Va08hlq4gBLPtbxoN3xFpFmYKz6GBSA1PFywgC_JRwhwv3olRL2syH0Mhruza_g$>
It seems that the following very simple patch I was told to try with 4.3.0 haven't been included in 4.3.1:
--- src/mpid/common/hcoll/hcoll_rte.c 2025-04-16 12:54:24.847337975 -0400
+++ src/mpid/common/hcoll/hcoll_rte.c 2025-04-16 12:55:05.428164974 -0400
@@ -55,7 +55,7 @@
/* FIXME: The hcoll library needs to be updated to return
* error codes. The progress function pointer right now
* expects that the function returns void. */
- ret = hcoll_do_progress(&made_progress);
+ ret = hcoll_do_progress(-1, &made_progress);
MPIR_Assert(ret == MPI_SUCCESS);
}
}
So it look like this code path is not compiled very often by mpich developers or it's QA process.
BTW applying the same patch fix the compilation problem, but:
What does it mean for us users ? Should we still use this option ? BTW hcoll is a very cool mechanism for improving collective operations efficiency. Is this option obsolete ? Was it replaced by something else ?
Thanks,
Martin Audet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20250623/436f19ed/attachment-0001.html>
More information about the discuss
mailing list