[mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll
Zhou, Hui
zhouh at anl.gov
Wed Jun 25 11:50:11 CDT 2025
Sorry for the misspelling - Martin 🙂
________________________________
From: Zhou, Hui via discuss <discuss at mpich.org>
Sent: Wednesday, June 25, 2025 11:46 AM
To: Raffenetti, Ken <raffenet at anl.gov>; discuss at mpich.org <discuss at mpich.org>
Cc: Zhou, Hui <zhouh at anl.gov>
Subject: Re: [mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll
Hi Margin,
Just to say that your support and patience are very much appreciated.
As for the hcoll issue, while the fix is simple - in fact we had the PR open since a year ago - it was blocked by the testing. We were not aware of of any hcoll users at the time, thus it received a low priority investigating the testing failures. In general, if you or your users are depending on some features and experience reakages, I would encourage you be harsh - more importantly, be persistent - and it's okay to shout from time to time :) . Stale issue needs a shout to raise attention and we need usage feedback to motivate and properly set priorities.
Again, thank you for your support and understanding of our resource constraint.
--
Hui
________________________________
From: Audet, Martin via discuss <discuss at mpich.org>
Sent: Wednesday, June 25, 2025 9:03 AM
To: Raffenetti, Ken <raffenet at anl.gov>; discuss at mpich.org <discuss at mpich.org>
Cc: Audet, Martin <Martin.Audet at cnrc-nrc.gc.ca>
Subject: Re: [mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll
Hello Ken, Thanks for the information. I have recompiled mpich 4. 3. 0 and 4. 3. 1 used on our small cluster without the --with-hcoll= option to make sure that hcoll is not used (no problems reported so far). BTW I was a little hash in my previous
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hello Ken,
Thanks for the information. I have recompiled mpich 4.3.0 and 4.3.1 used on our small cluster without the --with-hcoll= option to make sure that hcoll is not used (no problems reported so far).
BTW I was a little hash in my previous message. We should thank you for that great piece of software that is mpich. We use it on our clusters since... 1996.
Martin Audet
________________________________
From: Raffenetti, Ken <raffenet at anl.gov>
Sent: June 24, 2025 2:03 PM
To: Audet, Martin; discuss at mpich.org
Subject: EXT: Re: Re: [mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll
***Attention*** This email originated from outside of the NRC. ***Attention*** Ce courriel provient de l'extérieur du CNRC.
Hi Martin,
You are correct, we failed to warn users about the possible runtime issues using hcoll. I will have to go back and check when the last successful tests were run. As you guessed, this integration has been neglected somewhat in recent times. I have created a github issue to update our documentation and hopefully backport a fix to the stable branch once confirmed. https://urldefense.us/v3/__https://github.com/pmodels/mpich/issues/7475__;!!G_uCfscf7eWS!YigQLXMT0Fyp7-C0ahrpaiwZEgPmuyTGszNrPKXc70E95HHGb8HwZ1yF5wJee6Ml5VRoyzheBTuN$ <https://urldefense.us/v3/__https://github.com/pmodels/mpich/issues/7475__;!!G_uCfscf7eWS!cYKPVSECW8QJ_52C75nAD2akDv1nORiQ1rREqy--RTcPUPYonoJcHHtj9W3ShRA-DH3ckNmDJlIB74jhFFKZ5R36BRM$>
Ken
From: Audet, Martin <Martin.Audet at cnrc-nrc.gc.ca>
Date: Monday, June 23, 2025 at 2:48 PM
To: Raffenetti, Ken <raffenet at anl.gov>, discuss at mpich.org <discuss at mpich.org>
Subject: RE: Re: [mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll
Thanks for this quick reply. You say that hcoll don’t work correctly (runtime) so in this case, there should be a warning or something to warn users if they try to use it (like us). Slower but correct results are far better than faster but incorrect
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Thanks for this quick reply.
You say that hcoll don’t work correctly (runtime) so in this case, there should be a warning or something to warn users if they try to use it (like us). Slower but correct results are far better than faster but incorrect ones. I will recompile the library without this option so that it doesn’t create problems for the users of our cluster.
Since which version the hcoll don’t work correctly ?
I may also disable it in the older mpich versions we keep available for our users.
Thanks,
Martin
From: Raffenetti, Ken <raffenet at anl.gov>
Sent: June 23, 2025 15:40
To: discuss at mpich.org
Cc: Audet, Martin <Martin.Audet at cnrc-nrc.gc.ca>
Subject: EXT: Re: [mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll
***Attention*** This email originated from outside of the NRC. ***Attention*** Ce courriel provient de l'extérieur du CNRC.
Hi Martin,
My apologies for the lack of update on this topic. We did not include this patch because even with successful compilation, MPICH hcoll integration does not function correctly at runtime in our tests. Due to other priorities, we have not yet spent the time to fix the issue.
Ken
From: Audet, Martin via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Date: Monday, June 23, 2025 at 10:04 AM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Audet, Martin <Martin.Audet at cnrc-nrc.gc.ca<mailto:Martin.Audet at cnrc-nrc.gc.ca>>
Subject: [mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll
Hello, It seems that the silly compilation problem with hcoll_rte. c I had back in April with mpich 4. 3. 0 when using --with-hcoll=/opt/mellanox/hcoll configuration option is still present in 4. 3. 1, see: https: //lists. mpich. org/mailman/htdig/discuss/2025-April/006725. html
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hello,
It seems that the silly compilation problem with hcoll_rte.c I had back in April with mpich 4.3.0 when using --with-hcoll=/opt/mellanox/hcoll configuration option is still present in 4.3.1, see:
https://urldefense.us/v3/__https://lists.mpich.org/mailman/htdig/discuss/2025-April/006725.html__;!!G_uCfscf7eWS!YigQLXMT0Fyp7-C0ahrpaiwZEgPmuyTGszNrPKXc70E95HHGb8HwZ1yF5wJee6Ml5VRoywzQNnYC$ <https://urldefense.us/v3/__https:/lists.mpich.org/mailman/htdig/discuss/2025-April/006725.html__;!!G_uCfscf7eWS!fvaja_SlDAvIzwz1hZZHt1QY74b9Va08hlq4gBLPtbxoN3xFpFmYKz6GBSA1PFywgC_JRwhwv3olRL2syH0Mhruza_g$>
It seems that the following very simple patch I was told to try with 4.3.0 haven't been included in 4.3.1:
--- src/mpid/common/hcoll/hcoll_rte.c 2025-04-16 12:54:24.847337975 -0400
+++ src/mpid/common/hcoll/hcoll_rte.c 2025-04-16 12:55:05.428164974 -0400
@@ -55,7 +55,7 @@
/* FIXME: The hcoll library needs to be updated to return
* error codes. The progress function pointer right now
* expects that the function returns void. */
- ret = hcoll_do_progress(&made_progress);
+ ret = hcoll_do_progress(-1, &made_progress);
MPIR_Assert(ret == MPI_SUCCESS);
}
}
So it look like this code path is not compiled very often by mpich developers or it's QA process.
BTW applying the same patch fix the compilation problem, but:
What does it mean for us users ? Should we still use this option ? BTW hcoll is a very cool mechanism for improving collective operations efficiency. Is this option obsolete ? Was it replaced by something else ?
Thanks,
Martin Audet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20250625/61039915/attachment-0001.html>
More information about the discuss
mailing list