[mpich-discuss] error in installing MPICH-4.0.1: libhwloc.so.15: cannot open shared object file: No such file or directory

Bruce Rout bbrout at rasl.ca
Sat Mar 26 21:59:28 CDT 2022


Greetings,

I have been trying for about a week to get MPICH to work.

I am running Ubuntu 20.04.4 on an HP Z-600 as a master with three nodes all
under NFS.

NFS is working with ssh passwordless access and common hard drive directory
between all four computers, (master plus three nodes).

MPICH had some problems with HYDRA Proxies however a re-install with
apt-get install mpich solved the problem and it worked with
processors accessed on the entire array.

I installed a hard drive for a different user on the master node, leaving
the master user mpiuser unchanged. There was some difficulty in getting the
drive to mount on the different user, but now mpich ohny works on the
master node. I have reinstalled a number of times and am installing by
downloading mpich-4.0.1 from http://www.mpich.org.

The running of a single node works:

mpiuser at BD-Main:~$ mpiexec -n 3 ./examples/cpi
Invalid MIT-MAGIC-COOKIE-1 keyInvalid MIT-MAGIC-COOKIE-1 keyInvalid
MIT-MAGIC-COOKIE-1 keyProcess 0 of 3 is on BD-Main
Process 1 of 3 is on BD-Main
Process 2 of 3 is on BD-Main
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.002317
mpiuser at BD-Main:~$

and there is an invalid key somewhere in that output. Machinefile contains
the following: (This is the host file???)

mpiuser at BD-Main:~$ more machinefile
node1:4
node2:2
node3:4

mpiuser at BD-Main:~$

And there is the following error with multiple nodes:

mpiuser at BD-Main:~$ mpiexec -f machinefile -n 3 ./examples/cpi
/home/mpiuser/mpich-install/bin/hydra_pmi_proxy: error while loading shared
libraries: libhwloc.so.15: cannot open shared object file: No such file or
directory
^C[mpiexec at BD-Main] Sending Ctrl-C to processes as requested
[mpiexec at BD-Main] Press Ctrl-C again to force abort
[mpiexec at BD-Main] HYDU_sock_write (utils/sock/sock.c:254): write error (Bad
file descriptor)
[mpiexec at BD-Main] HYD_pmcd_pmiserv_send_signal
(pm/pmiserv/pmiserv_cb.c:176): unable to write data to proxy
[mpiexec at BD-Main] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:42): unable to send
signal downstream
[mpiexec at BD-Main] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at BD-Main] HYD_pmci_wait_for_completion
(pm/pmiserv/pmiserv_pmci.c:160): error waiting for event
[mpiexec at BD-Main] main (ui/mpich/mpiexec.c:325): process manager error
waiting for completion
mpiuser at BD-Main:~$

The program hangs after the first error message
/home/mpiuser/mpich-install/bin/hydra_pmi_proxy: error while loading shared
libraries: libhwloc.so.15: cannot open shared object file: No such file or
directory

and had to exit with ctl-c.

I have hunted everywhere for libhwloc.so.15 and found two rpm files
hwloc-libs-1.11.8-4.el7.x86_64.rpm and hwloc-2.7.1.tar.bz2 which I managed
to download but cannot install the librarires. Synaptic is not help since
the repository for hwloc is not installed.

Another indication of  a problem is that the file cp
mpich4.0.1/src/pm/hydra/tools/topo/hwloc/hwloc/config.log is missing from
the ./configure step. Obviously it is not included in the zipped files
attached here. The configure, make and make install steps ran without a
hitch and without error.

This is the output of mpich -info

mpiuser at BD-Main:~$ mpiexec -info
HYDRA build details:
    Version:                                 4.0.1
    Release Date:                            Tue Feb 22 16:37:51 CST 2022
    CC:                              gcc
    Configure options:                       '--disable-option-checking'
'--prefix=/home/mpiuser/mpich-install' '--cache-file=/dev/null'
'--srcdir=.' 'CC=gcc' 'CFLAGS= ' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= '
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge
manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs
cobalt
    Demux engines available:                 poll select
mpiuser at BD-Main:~$

The master node is BD-Main and this is the output of /etc/hosts which is
similar on all nodes:

mpiuser at BD-Main:~$ more /etc/hosts
127.0.0.1      localhost
127.0.1.1      BD-Main


# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

# The following sets up the local network for cluster

10.0.0.1     master
10.0.0.2     node1
10.0.0.3     node2
10.0.0.4     node3

As I have mentioned before, this set-up seems to have worked and NFS is
working well. I have purged both hydra and mpich and then reinstalled
mpich-4.0.1 according to the instructions in the README fiel.

Any help would be greatly appreciated. Thank you for your time.

Yours,

Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20220326/4896c017/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpifiles.zip
Type: application/zip
Size: 107281 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20220326/4896c017/attachment-0001.zip>


More information about the discuss mailing list