[mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll
Raffenetti, Ken
raffenet at anl.gov
Mon Jun 23 14:40:00 CDT 2025
Hi Martin,
My apologies for the lack of update on this topic. We did not include this patch because even with successful compilation, MPICH hcoll integration does not function correctly at runtime in our tests. Due to other priorities, we have not yet spent the time to fix the issue.
Ken
From: Audet, Martin via discuss <discuss at mpich.org>
Date: Monday, June 23, 2025 at 10:04 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Audet, Martin <Martin.Audet at cnrc-nrc.gc.ca>
Subject: [mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll
Hello, It seems that the silly compilation problem with hcoll_rte. c I had back in April with mpich 4. 3. 0 when using --with-hcoll=/opt/mellanox/hcoll configuration option is still present in 4. 3. 1, see: https: //lists. mpich. org/mailman/htdig/discuss/2025-April/006725. html
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hello,
It seems that the silly compilation problem with hcoll_rte.c I had back in April with mpich 4.3.0 when using --with-hcoll=/opt/mellanox/hcoll configuration option is still present in 4.3.1, see:
https://urldefense.us/v3/__https://lists.mpich.org/mailman/htdig/discuss/2025-April/006725.html__;!!G_uCfscf7eWS!elbt1YagLFbV0oZNhSKq29OZzh5XCI9HcEBFr5l6vZm6gLr1ZBUv3tUJP_UJt3haeF1eEQpWFLo76c9H$ <https://urldefense.us/v3/__https:/lists.mpich.org/mailman/htdig/discuss/2025-April/006725.html__;!!G_uCfscf7eWS!fvaja_SlDAvIzwz1hZZHt1QY74b9Va08hlq4gBLPtbxoN3xFpFmYKz6GBSA1PFywgC_JRwhwv3olRL2syH0Mhruza_g$>
It seems that the following very simple patch I was told to try with 4.3.0 haven't been included in 4.3.1:
--- src/mpid/common/hcoll/hcoll_rte.c 2025-04-16 12:54:24.847337975 -0400
+++ src/mpid/common/hcoll/hcoll_rte.c 2025-04-16 12:55:05.428164974 -0400
@@ -55,7 +55,7 @@
/* FIXME: The hcoll library needs to be updated to return
* error codes. The progress function pointer right now
* expects that the function returns void. */
- ret = hcoll_do_progress(&made_progress);
+ ret = hcoll_do_progress(-1, &made_progress);
MPIR_Assert(ret == MPI_SUCCESS);
}
}
So it look like this code path is not compiled very often by mpich developers or it's QA process.
BTW applying the same patch fix the compilation problem, but:
What does it mean for us users ? Should we still use this option ? BTW hcoll is a very cool mechanism for improving collective operations efficiency. Is this option obsolete ? Was it replaced by something else ?
Thanks,
Martin Audet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20250623/7715083b/attachment.html>
More information about the discuss
mailing list