[mpich-discuss] mpich 4.3.0 compilation problem when using --with-hcoll=/opt/mellanox/hcoll

Audet, Martin Martin.Audet at cnrc-nrc.gc.ca
Wed Apr 16 13:54:27 CDT 2025


Hello Hui,


I tried the patch and it works (it compiles at least).

Thanks for your very quick response !

Martin

________________________________
From: Zhou, Hui <zhouh at anl.gov>
Sent: April 16, 2025 12:17 PM
To: discuss at mpich.org
Cc: Audet, Martin
Subject: EXT: Re: [mpich-discuss] mpich 4.3.0 compilation problem when using --with-hcoll=/opt/mellanox/hcoll

***Attention*** This email originated from outside of the NRC. ***Attention*** Ce courriel provient de l'extérieur du CNRC.

Hi Martin,

Could you try the patch in https://urldefense.us/v3/__https://github.com/pmodels/mpich/pull/7047?__;!!G_uCfscf7eWS!fFz0Y7WM3O3wILKEFJYH4NdS0Q4bnjjAlwkuzQk7YDRXZwbZpV6PhzYRIIp689vvuEERZIuT2W0cbtCyvZX_-QIOCT0$ 

--
Hui

________________________________
From: Audet, Martin via discuss <discuss at mpich.org>
Sent: Wednesday, April 16, 2025 11:13 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Audet, Martin <Martin.Audet at cnrc-nrc.gc.ca>
Subject: [mpich-discuss] mpich 4.3.0 compilation problem when using --with-hcoll=/opt/mellanox/hcoll

Hello mpich community, When I try to compile mpich 4. 3. 0 configured with --with-hcoll=/opt/mellanox/hcoll option, I get a compilation error because the hcoll_do_progress() function is defined with two arguments in hcoll_init. c but it is called
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd

Hello mpich community,


When I try to compile mpich 4.3.0 configured with --with-hcoll=/opt/mellanox/hcoll option, I get a compilation error because the hcoll_do_progress() function is defined with two arguments in hcoll_init.c but it is called only with one in  hcoll_rte.c !


Here is the error message I get:


src/mpid/common/hcoll/hcoll_rte.c: In function 'progress':

src/mpid/common/hcoll/hcoll_rte.c:58:33: warning: passing argument 1 of 'hcoll_do_progress' makes integer from pointer without a cast [-Wint-conversion]

   58 |         ret = hcoll_do_progress(&made_progress);

      |                                 ^~~~~~~~~~~~~~

      |                                 |

      |                                 int *

In file included from ./src/mpid/ch4/netmod/include/../ucx/ucx_coll.h:11,

                 from ./src/mpid/ch4/netmod/include/../ucx/netmod_inline.h:15,

                 from ./src/mpid/ch4/netmod/include/netmod_impl.h:1589,

                 from ./src/mpid/ch4/include/mpidch4.h:448,

                 from ./src/mpid/ch4/include/mpidpost.h:10,

                 from ./src/include/mpiimpl.h:232,

                 from src/mpid/common/hcoll/hcoll_rte.c:6:

./src/mpid/ch4/netmod/include/../ucx/../../../common/hcoll/hcoll.h:42:27: note: expected 'int' but argument is of type 'int *'

   42 | int hcoll_do_progress(int vci, int *made_progress);

      |                       ~~~~^~~

src/mpid/common/hcoll/hcoll_rte.c:58:15: error: too few arguments to function 'hcoll_do_progress'

   58 |         ret = hcoll_do_progress(&made_progress);

      |               ^~~~~~~~~~~~~~~~~

In file included from ./src/mpid/ch4/netmod/include/../ucx/ucx_coll.h:11,

                 from ./src/mpid/ch4/netmod/include/../ucx/netmod_inline.h:15,

                 from ./src/mpid/ch4/netmod/include/netmod_impl.h:1589,

                 from ./src/mpid/ch4/include/mpidch4.h:448,

                 from ./src/mpid/ch4/include/mpidpost.h:10,

                 from ./src/include/mpiimpl.h:232,

                 from src/mpid/common/hcoll/hcoll_rte.c:6:

./src/mpid/ch4/netmod/include/../ucx/../../../common/hcoll/hcoll.h:42:5: note: declared here

   42 | int hcoll_do_progress(int vci, int *made_progress);

      |     ^~~~~~~~~~~~~~~~~



I use to compile mpich versions 3.4,x, 4.1.x, and 4.2.x configured with this option (--with-hcoll=) in the past without any problems. It looks like some recent changes in the related files introduced a problem that slip into 4.3.0 and makes compilation impossible.

Could it be fixed ? Or could the --with-hcoll option be removed if it is no longer relevant (I guess that if we use ch4:ucx, ucx may itself use hcoll internally to optimize collective operations when running on hierarchical environment) ?

Here are some details:


arch: x86_64

OS: RHEL 9.5 (up to date except kernel)

MOFED: 24.10-2.1.8.0-LTS

hcoll: 4.8.3230-1.2410068

ucx: 1.18.0-1.2410068


The complete configuration line:

./configure --with-device=ch4:ucx --with-hcoll=/opt/mellanox/hcoll --prefix=/work/software/x86_64/mpich/mpich-ch4_ucx-4.3.0  --with-xpmem --enable-g=none --enable-fast=all --enable-romio --with-file-system=ufs+nfs+lustre --enable-shared --enable-sharedlibs=gcc

Thanks,

Martin Audet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20250416/0e95dc36/attachment.html>


More information about the discuss mailing list