[mpich-discuss] Parallel test hanging with mpich on rhel7

Orion Poplawski orion at cora.nwra.com
Mon Feb 3 20:29:21 CST 2014


We're starting to do the Fedora EPEL builds for EPEL7.  I'm building
hdf5 1.8.12 with:

mpich-3.0.4-4.el7.x86_64
gcc-4.8.2-3.el7.x86_64

The following test hangs here:

$ mpirun -np 4 ./t_cache
===================================
Parallel metadata cache tests
        mpi_size     = 4
        express_test = 1
===================================
*** Hint ***
You can use environment variable HDF5_PARAPREFIX to run parallel test
files in a
different directory or to add file type prefix. E.g.,
   HDF5_PARAPREFIX=pfs:/PFS/user/me
   export HDF5_PARAPREFIX
*** End of Hint ***
0:setup_rand(): seed = 138071.
3:setup_rand(): seed = 149196.
2:setup_rand(): seed = 160135.
1:setup_rand(): seed = 180134.
Testing server smoke check
PASSED
Testing smoke check #1 -- process 0 only md write strategy

Runs fine with openmpi.  Not seeing problems either in Fedora, which has
similar versions, so really not sure what is at issue, or how to debug
further.

The processes are consuming 100% cpu - looping continually calling
poll() (as seen by strace).  Some gdb snapshot backtraces:

(gdb) bt
#0  0x00007fdfdcd9fae2 in MPIDI_CH3I_Progress () from
/usr/lib64/mpich/lib/libmpich.so.10
#1  0x00007fdfdce495e8 in PMPI_Recv () from
/usr/lib64/mpich/lib/libmpich.so.10
#2  0x0000000000403b24 in recv_mssg (mssg_ptr=0x7fffccf6e2c0,
mssg_tag_offset=1) at ../../testpar/t_cache.c:1050
#3  0x000000000040655f in flush_datum (f=0x30fb600, dxpl_id=167772175,
dest=0, addr=177681, thing=0x61cd10 <data+21840>) at
../../testpar/t_cache.c:2542
#4  0x00007fdfdd905513 in H5C_flush_single_entry (f=0x30fb600,
primary_dxpl_id=167772175, secondary_dxpl_id=167772175,
type_ptr=0x614d20 <types>, addr=177681,
    flags=0, first_flush_ptr=0x7fffccf6e464,
del_entry_from_slist_on_destroy=0) at ../../src/H5C.c:7745
#5  0x00007fdfdd905f03 in H5C_make_space_in_cache (f=0x30fb600,
primary_dxpl_id=167772175, secondary_dxpl_id=167772175, space_needed=0,
write_permitted=1,
    first_flush_ptr=0x7fffccf6e464) at ../../src/H5C.c:8173
#6  0x00007fdfdd8f8a38 in H5C_flush_to_min_clean (f=0x30fb600,
primary_dxpl_id=167772175, secondary_dxpl_id=167772175) at
../../src/H5C.c:1972
#7  0x00007fdfdd8d6298 in H5AC_rsp__p0_only__flush_to_min_clean
(f=0x30fb600, dxpl_id=167772175, cache_ptr=0x7fdfddfb3010) at
../../src/H5AC.c:4754
#8  0x00007fdfdd8d64dc in H5AC_run_sync_point (f=0x30fb600,
dxpl_id=167772175, sync_point_op=0) at ../../src/H5AC.c:4854
#9  0x00007fdfdd8ce3d3 in H5AC_insert_entry (f=0x30fb600,
dxpl_id=167772168, type=0x614d20 <types>, addr=1856348, thing=0x64d360
<data+220064>, flags=0)
    at ../../src/H5AC.c:1006
#10 0x0000000000406cf9 in insert_entry (cache_ptr=0x7fdfddfb3010,
file_ptr=0x30fb600, idx=1058, flags=0) at ../../testpar/t_cache.c:2939
#11 0x000000000040a542 in smoke_check_1 (metadata_write_strategy=0) at
../../testpar/t_cache.c:5375
#12 0x000000000040da3d in main (argc=1, argv=0x7fffccf6e798) at
../../testpar/t_cache.c:7261


(gdb) bt
#0  0x00007ffd121dab2b in MPIDI_CH3I_Progress () from
/usr/lib64/mpich/lib/libmpich.so.10
#1  0x00007ffd1218dd2d in MPIC_Wait () from
/usr/lib64/mpich/lib/libmpich.so.10
#2  0x00007ffd1218e020 in MPIC_Recv () from
/usr/lib64/mpich/lib/libmpich.so.10
#3  0x00007ffd1218e829 in MPIC_Recv_ft () from
/usr/lib64/mpich/lib/libmpich.so.10
#4  0x00007ffd122131c2 in MPIR_Bcast_binomial.isra.1 () from
/usr/lib64/mpich/lib/libmpich.so.10
#5  0x00007ffd122139da in MPIR_Bcast_intra () from
/usr/lib64/mpich/lib/libmpich.so.10
#6  0x00007ffd1221428d in MPIR_Bcast_impl () from
/usr/lib64/mpich/lib/libmpich.so.10
#7  0x00007ffd122147a2 in PMPI_Bcast () from
/usr/lib64/mpich/lib/libmpich.so.10
#8  0x00007ffd12d0faba in H5AC_receive_and_apply_clean_list
(f=0x3a535d0, primary_dxpl_id=167772175, secondary_dxpl_id=167772175,
cache_ptr=0x7ffd133ee010)
    at ../../src/H5AC.c:4133
#9  0x00007ffd12d0f942 in
H5AC_propagate_flushed_and_still_clean_entries_list (f=0x3a535d0,
dxpl_id=167772175, cache_ptr=0x7ffd133ee010) at ../../src/H5AC.c:4073
#10 0x00007ffd12d11347 in H5AC_rsp__p0_only__flush_to_min_clean
(f=0x3a535d0, dxpl_id=167772175, cache_ptr=0x7ffd133ee010) at
../../src/H5AC.c:4769
#11 0x00007ffd12d114dc in H5AC_run_sync_point (f=0x3a535d0,
dxpl_id=167772175, sync_point_op=0) at ../../src/H5AC.c:4854
#12 0x00007ffd12d093d3 in H5AC_insert_entry (f=0x3a535d0,
dxpl_id=167772168, type=0x614d20 <types>, addr=1856348, thing=0x64d360
<data+220064>, flags=0)
    at ../../src/H5AC.c:1006
#13 0x0000000000406cf9 in insert_entry (cache_ptr=0x7ffd133ee010,
file_ptr=0x3a535d0, idx=1058, flags=0) at ../../testpar/t_cache.c:2939
#14 0x000000000040a542 in smoke_check_1 (metadata_write_strategy=0) at
../../testpar/t_cache.c:5375
#15 0x000000000040da3d in main (argc=1, argv=0x7fffe754ac68) at
../../testpar/t_cache.c:7261

(gdb) bt
#0  0x00007f1aff58a2e0 in MPID_nem_tcp_connpoll () from
/usr/lib64/mpich/lib/libmpich.so.10
#1  0x00007f1aff576b35 in MPIDI_CH3I_Progress () from
/usr/lib64/mpich/lib/libmpich.so.10
#2  0x00007f1aff6205e8 in PMPI_Recv () from
/usr/lib64/mpich/lib/libmpich.so.10
#3  0x0000000000403b24 in recv_mssg (mssg_ptr=0x7fff921b4210,
mssg_tag_offset=11) at ../../testpar/t_cache.c:1050
#4  0x0000000000408b94 in verify_entry_writes (addr=250487,
expected_entry_writes=1) at ../../testpar/t_cache.c:4506
#5  0x000000000040839b in verify_writes (num_writes=158,
written_entries_tbl=0x282cda0) at ../../testpar/t_cache.c:4151
#6  0x00007f1b000abf5e in H5AC_receive_and_apply_clean_list
(f=0x27fa5d0, primary_dxpl_id=167772175, secondary_dxpl_id=167772175,
cache_ptr=0x7f1b0078a010)
    at ../../src/H5AC.c:4179
#7  0x00007f1b000ab942 in
H5AC_propagate_flushed_and_still_clean_entries_list (f=0x27fa5d0,
dxpl_id=167772175, cache_ptr=0x7f1b0078a010) at ../../src/H5AC.c:4073
#8  0x00007f1b000ad347 in H5AC_rsp__p0_only__flush_to_min_clean
(f=0x27fa5d0, dxpl_id=167772175, cache_ptr=0x7f1b0078a010) at
../../src/H5AC.c:4769
#9  0x00007f1b000ad4dc in H5AC_run_sync_point (f=0x27fa5d0,
dxpl_id=167772175, sync_point_op=0) at ../../src/H5AC.c:4854
#10 0x00007f1b000a53d3 in H5AC_insert_entry (f=0x27fa5d0,
dxpl_id=167772168, type=0x614d20 <types>, addr=1856348, thing=0x64d360
<data+220064>, flags=0)
    at ../../src/H5AC.c:1006
#11 0x0000000000406cf9 in insert_entry (cache_ptr=0x7f1b0078a010,
file_ptr=0x27fa5d0, idx=1058, flags=0) at ../../testpar/t_cache.c:2939
#12 0x000000000040a542 in smoke_check_1 (metadata_write_strategy=0) at
../../testpar/t_cache.c:5375
#13 0x000000000040da3d in main (argc=1, argv=0x7fff921b4668) at
../../testpar/t_cache.c:7261

Thanks for any help,

  Orion

-- 
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA/CoRA Division                    FAX: 303-415-9702
3380 Mitchell Lane                  orion at cora.nwra.com
Boulder, CO 80301              http://www.cora.nwra.com



More information about the discuss mailing list