[mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll

Audet, Martin Martin.Audet at cnrc-nrc.gc.ca
Wed Jun 25 09:03:52 CDT 2025


Hello Ken,

Thanks for the information. I have recompiled mpich 4.3.0 and 4.3.1 used on our small cluster without the --with-hcoll= option to make sure that hcoll is not used (no problems reported so far).

BTW I was a little hash in my previous message. We should thank you for that great piece of software that is mpich. We use it on our clusters since... 1996.

Martin Audet

________________________________
From: Raffenetti, Ken <raffenet at anl.gov>
Sent: June 24, 2025 2:03 PM
To: Audet, Martin; discuss at mpich.org
Subject: EXT: Re: Re: [mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll

***Attention*** This email originated from outside of the NRC. ***Attention*** Ce courriel provient de l'extérieur du CNRC.

Hi Martin,

You are correct, we failed to warn users about the possible runtime issues using hcoll. I will have to go back and check when the last successful tests were run. As you guessed, this integration has been neglected somewhat in recent times. I have created a github issue to update our documentation and hopefully backport a fix to the stable branch once confirmed. https://urldefense.us/v3/__https://github.com/pmodels/mpich/issues/7475__;!!G_uCfscf7eWS!cYKPVSECW8QJ_52C75nAD2akDv1nORiQ1rREqy--RTcPUPYonoJcHHtj9W3ShRA-DH3ckNmDJlIB74jhFFKZ5R36BRM$ 

Ken

From: Audet, Martin <Martin.Audet at cnrc-nrc.gc.ca>
Date: Monday, June 23, 2025 at 2:48 PM
To: Raffenetti, Ken <raffenet at anl.gov>, discuss at mpich.org <discuss at mpich.org>
Subject: RE: Re: [mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll
Thanks for this quick reply. You say that hcoll don’t work correctly (runtime) so in this case, there should be a warning or something to warn users if they try to use it (like us). Slower but correct results are far better than faster but incorrect
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Thanks for this quick reply.

You say that hcoll don’t work correctly (runtime) so in this case, there should be a warning or something to warn users if they try to use it (like us). Slower but correct results are far better than faster but incorrect ones. I will recompile the library without this option so that it doesn’t create problems for the users of our cluster.

Since which version the hcoll don’t work correctly ?

I may also disable it in the older mpich versions we keep available for our users.

Thanks,

Martin
From: Raffenetti, Ken <raffenet at anl.gov>
Sent: June 23, 2025 15:40
To: discuss at mpich.org
Cc: Audet, Martin <Martin.Audet at cnrc-nrc.gc.ca>
Subject: EXT: Re: [mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll

***Attention*** This email originated from outside of the NRC. ***Attention*** Ce courriel provient de l'extérieur du CNRC.

Hi Martin,

My apologies for the lack of update on this topic. We did not include this patch because even with successful compilation, MPICH hcoll integration does not function correctly at runtime in our tests. Due to other priorities, we have not yet spent the time to fix the issue.

Ken

From: Audet, Martin via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Date: Monday, June 23, 2025 at 10:04 AM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Audet, Martin <Martin.Audet at cnrc-nrc.gc.ca<mailto:Martin.Audet at cnrc-nrc.gc.ca>>
Subject: [mpich-discuss] mpich 4.3.1 still have compilation problem when using --with-hcoll=/opt/mellanox/hcoll
Hello, It seems that the silly compilation problem with hcoll_rte. c I had back in April with mpich 4. 3. 0 when using --with-hcoll=/opt/mellanox/hcoll configuration option is still present in 4. 3. 1, see: https: //lists. mpich. org/mailman/htdig/discuss/2025-April/006725. html
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd

Hello,



It seems that the silly compilation problem with hcoll_rte.c I had back in April with mpich 4.3.0 when using --with-hcoll=/opt/mellanox/hcoll configuration option is still present in 4.3.1, see:

https://urldefense.us/v3/__https://lists.mpich.org/mailman/htdig/discuss/2025-April/006725.html__;!!G_uCfscf7eWS!cYKPVSECW8QJ_52C75nAD2akDv1nORiQ1rREqy--RTcPUPYonoJcHHtj9W3ShRA-DH3ckNmDJlIB74jhFFKZsvdFCw8$ <https://urldefense.us/v3/__https:/lists.mpich.org/mailman/htdig/discuss/2025-April/006725.html__;!!G_uCfscf7eWS!fvaja_SlDAvIzwz1hZZHt1QY74b9Va08hlq4gBLPtbxoN3xFpFmYKz6GBSA1PFywgC_JRwhwv3olRL2syH0Mhruza_g$>



It seems that the following very simple patch I was told to try with 4.3.0 haven't been included in 4.3.1:


--- src/mpid/common/hcoll/hcoll_rte.c   2025-04-16 12:54:24.847337975 -0400
+++ src/mpid/common/hcoll/hcoll_rte.c   2025-04-16 12:55:05.428164974 -0400
@@ -55,7 +55,7 @@
         /* FIXME: The hcoll library needs to be updated to return
          * error codes.  The progress function pointer right now
          * expects that the function returns void. */
-        ret = hcoll_do_progress(&made_progress);
+        ret = hcoll_do_progress(-1, &made_progress);
         MPIR_Assert(ret == MPI_SUCCESS);
     }
 }

So it look like this code path is not compiled very often by mpich developers or it's QA process.



BTW applying the same patch fix the compilation problem, but:



What does it mean for us users ? Should we still use this option ? BTW hcoll is a very cool mechanism for improving collective operations efficiency. Is this option obsolete ? Was it replaced by something else ?



Thanks,



Martin Audet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20250625/fdd7d494/attachment.html>


More information about the discuss mailing list