[mpich-discuss] error in installing MPICH-4.0.1: libhwloc.so.15: cannot open shared object file: No such file or directory

Zhou, Hui zhouh at anl.gov
Sat Mar 26 23:18:47 CDT 2022


Hi Bruce,

libhwloc​ must be installed on the host where you build MPICH but is missing on one of your nodes. By default when you build MPICH, it checks for system hwloc and will use the one it found rather than use the embedded one. To force use embedded -- I recommend it in your case -- add --with-hwloc=embedded​ to your configure option.

Cheers,
--
Hui
________________________________
From: Bruce Rout via discuss <discuss at mpich.org>
Sent: Saturday, March 26, 2022 9:59 PM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Bruce Rout <bbrout at rasl.ca>
Subject: [mpich-discuss] error in installing MPICH-4.0.1: libhwloc.so.15: cannot open shared object file: No such file or directory

Greetings,

I have been trying for about a week to get MPICH to work.

I am running Ubuntu 20.04.4 on an HP Z-600 as a master with three nodes all under NFS.

NFS is working with ssh passwordless access and common hard drive directory between all four computers, (master plus three nodes).

MPICH had some problems with HYDRA Proxies however a re-install with apt-get install mpich solved the problem and it worked with processors accessed on the entire array.

I installed a hard drive for a different user on the master node, leaving the master user mpiuser unchanged. There was some difficulty in getting the drive to mount on the different user, but now mpich ohny works on the master node. I have reinstalled a number of times and am installing by downloading mpich-4.0.1 from http://www.mpich.org.

The running of a single node works:

mpiuser at BD-Main:~$ mpiexec -n 3 ./examples/cpi
Invalid MIT-MAGIC-COOKIE-1 keyInvalid MIT-MAGIC-COOKIE-1 keyInvalid MIT-MAGIC-COOKIE-1 keyProcess 0 of 3 is on BD-Main
Process 1 of 3 is on BD-Main
Process 2 of 3 is on BD-Main
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.002317
mpiuser at BD-Main:~$

and there is an invalid key somewhere in that output. Machinefile contains the following: (This is the host file???)

mpiuser at BD-Main:~$ more machinefile
node1:4
node2:2
node3:4

mpiuser at BD-Main:~$

And there is the following error with multiple nodes:

mpiuser at BD-Main:~$ mpiexec -f machinefile -n 3 ./examples/cpi
/home/mpiuser/mpich-install/bin/hydra_pmi_proxy: error while loading shared libraries: libhwloc.so.15: cannot open shared object file: No such file or directory
^C[mpiexec at BD-Main] Sending Ctrl-C to processes as requested
[mpiexec at BD-Main] Press Ctrl-C again to force abort
[mpiexec at BD-Main] HYDU_sock_write (utils/sock/sock.c:254): write error (Bad file descriptor)
[mpiexec at BD-Main] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:176): unable to write data to proxy
[mpiexec at BD-Main] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:42): unable to send signal downstream
[mpiexec at BD-Main] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at BD-Main] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:160): error waiting for event
[mpiexec at BD-Main] main (ui/mpich/mpiexec.c:325): process manager error waiting for completion
mpiuser at BD-Main:~$

The program hangs after the first error message
/home/mpiuser/mpich-install/bin/hydra_pmi_proxy: error while loading shared libraries: libhwloc.so.15: cannot open shared object file: No such file or directory

and had to exit with ctl-c.

I have hunted everywhere for libhwloc.so.15 and found two rpm files hwloc-libs-1.11.8-4.el7.x86_64.rpm and hwloc-2.7.1.tar.bz2 which I managed to download but cannot install the librarires. Synaptic is not help since the repository for hwloc is not installed.

Another indication of  a problem is that the file cp mpich4.0.1/src/pm/hydra/tools/topo/hwloc/hwloc/config.log is missing from the ./configure step. Obviously it is not included in the zipped files attached here. The configure, make and make install steps ran without a hitch and without error.

This is the output of mpich -info

mpiuser at BD-Main:~$ mpiexec -info
HYDRA build details:
    Version:                                 4.0.1
    Release Date:                            Tue Feb 22 16:37:51 CST 2022
    CC:                              gcc
    Configure options:                       '--disable-option-checking' '--prefix=/home/mpiuser/mpich-install' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= ' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= '
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Demux engines available:                 poll select
mpiuser at BD-Main:~$

The master node is BD-Main and this is the output of /etc/hosts which is similar on all nodes:

mpiuser at BD-Main:~$ more /etc/hosts
127.0.0.1      localhost
127.0.1.1      BD-Main


# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

# The following sets up the local network for cluster

10.0.0.1     master
10.0.0.2     node1
10.0.0.3     node2
10.0.0.4     node3

As I have mentioned before, this set-up seems to have worked and NFS is working well. I have purged both hydra and mpich and then reinstalled mpich-4.0.1 according to the instructions in the README fiel.

Any help would be greatly appreciated. Thank you for your time.

Yours,

Bruce


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20220327/7b504950/attachment.html>


More information about the discuss mailing list