[mpich-discuss] netcdf-fortran 4.5.4 test hang on s390x

Orion Poplawski orion at nwra.com
Thu Jan 13 21:32:08 CST 2022


Hello -

   I'm working on updating the Fedora netcdf-fortran package to version 
4.5.4.  However, a test is hanging when run on s390x (other arches are 
fine).  This is with:

mpich-3.4.1-3.fc35.s390x
hdf5-mpich-1.12.1-1.fc36.s390x
netcdf-mpich-4.8.1-2.fc36.s390x

The test program is nf03_test4/f90tst_parallel_compressed and is run with:

mpiexec -np 4 ./f90tst_parallel_compressed

Here's a backtrace where it seems to be stuck:

#0  0x000003ffb14fbc94 in MPID_nem_tcp_connpoll 
(in_blocking_poll=<optimized out>) at 
src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c:1822
#1  0x000003ffb14ef78c in MPID_nem_network_poll (in_blocking_progress=1) 
at src/mpid/ch3/channels/nemesis/src/mpid_nem_network_poll.c:11
#2  MPID_nem_mpich_blocking_recv (completions=615, in_fbox=<synthetic 
pointer>, cell=<synthetic pointer>) at 
./src/mpid/ch3/channels/nemesis/include/mpid_nem_inline.h:964
#3  MPIDI_CH3I_Progress (progress_state=0x3ffdf66c168, 
is_blocking=<optimized out>) at 
src/mpid/ch3/channels/nemesis/src/ch3_progress.c:363
#4  0x000003ffb1b2ecc0 in MPIR_Wait_state (status=<optimized out>, 
state=0x3ffdf66c24c, request_ptr=0x3ffb1d17510 <MPIR_Request_direct>) at 
src/mpi/request/wait.c:32
#5  MPIR_Wait_impl.isra.0 (request_ptr=request_ptr at entry=0x3ffb1d17510 
<MPIR_Request_direct>, status=<optimized out>) at src/mpi/request/wait.c:55
#6  0x000003ffb1484ca0 in MPID_Wait (status=0x1, 
request_ptr=0x3ffb1d17510 <MPIR_Request_direct>) at 
./src/mpid/ch3/include/mpidpost.h:264
#7  MPIC_Wait (request_ptr=0x3ffb1d17510 <MPIR_Request_direct>, 
errflag=errflag at entry=0x3ffdf66c8ec) at src/mpi/coll/helper_fns.c:64
#8  0x000003ffb1485a04 in MPIC_Sendrecv (sendbuf=<optimized out>, 
sendcount=1, sendtype=<optimized out>, dest=<optimized out>, 
sendtag=<optimized out>, recvbuf=0x2aa1bf8f8b0, recvcount=1, 
recvtype=1275069445, source=1, recvtag=14,
     comm_ptr=0x3ffae42f690, status=<optimized out>, 
errflag=0x3ffdf66c8ec) at src/mpi/coll/helper_fns.c:333
#9  0x000003ffb141fe4e in MPIR_Allreduce_intra_recursive_doubling 
(sendbuf=<optimized out>, recvbuf=0x3ffdf66c9e0, count=<optimized out>, 
datatype=<optimized out>, op=<optimized out>, comm_ptr=0x3ffae42f690, 
errflag=0x3ffdf66c8ec)
     at src/mpi/coll/allreduce/allreduce_intra_recursive_doubling.c:127
#10 0x000003ffb1322a68 in MPIR_Allreduce_allcomm_auto 
(sendbuf=0x3ffdf66c9d0, sendbuf at entry=<error reading variable: value has 
been optimized out>, recvbuf=0x3ffdf66c9e0,
     recvbuf at entry=<error reading variable: value has been optimized 
out>, count=count at entry=1, datatype=datatype at entry=1275069445, 
op=op at entry=1476395011, comm_ptr=0x3ffae42f690, errflag=0x3ffdf66c8ec)
     at src/mpi/coll/allreduce/allreduce.c:145
#11 0x000003ffb1322c28 in MPIR_Allreduce_impl (sendbuf=<error reading 
variable: value has been optimized out>, recvbuf=<error reading 
variable: value has been optimized out>, count=<optimized out>, 
datatype=<optimized out>,
     op=op at entry=1476395011, comm_ptr=0x3ffae42f690, 
errflag=0x3ffdf66c8ec) at src/mpi/coll/allreduce/allreduce.c:210
#12 0x000003ffb1322d6a in MPID_Allreduce (errflag=0x3ffdf66c8ec, 
comm=0x3ffae42f690, op=op at entry=1476395011, datatype=<optimized out>, 
datatype at entry=1275069445, count=<optimized out>, count at entry=1,
     recvbuf=<error reading variable: value has been optimized out>, 
sendbuf=<error reading variable: value has been optimized out>) at 
./src/mpid/ch3/include/mpid_coll.h:41
#13 MPIR_Allreduce (sendbuf=<error reading variable: value has been 
optimized out>, recvbuf=<error reading variable: value has been 
optimized out>, count=count at entry=1, datatype=datatype at entry=1275069445, 
op=op at entry=1476395011,
     comm_ptr=0x3ffae42f690, errflag=0x3ffdf66c8ec) at 
src/mpi/coll/allreduce/allreduce.c:262
#14 0x000003ffb1323742 in PMPI_Allreduce (sendbuf=<optimized out>, 
sendbuf at entry=<error reading variable: value has been optimized out>, 
recvbuf=<optimized out>, recvbuf at entry=0x3ffdf66c9e0, count=count at entry=1,
     datatype=datatype at entry=1275069445, op=op at entry=1476395011, 
comm=-1006632958) at src/mpi/coll/allreduce/allreduce.c:387
#15 0x000003ffb0e11644 in H5D__mpio_array_gatherv.constprop.0 
(local_array=<optimized out>, local_array at entry=0x2aa1bfac440, 
local_array_num_entries=<optimized out>, _gathered_array=<optimized out>,
     _gathered_array at entry=0x3ffdf66cb78, 
_gathered_array_num_entries=_gathered_array_num_entries at entry=0x3ffdf66cb88, 
allgather=allgather at entry=true, comm=-1006632958, sort_func=0x0, root=0, 
array_entry_size=368)
     at ../../src/H5Dmpio.c:552
#16 0x000003ffb0e086cc in H5D__link_chunk_filtered_collective_io 
(fm=0x2aa1bef9c60, type_info=0x3ffdf66ce00, io_info=0x3ffdf66ce80) at 
../../src/H5Dmpio.c:1409
#17 H5D__chunk_collective_io (io_info=0x3ffdf66ce80, 
type_info=0x3ffdf66ce00, fm=fm at entry=0x2aa1bef9c60) at 
../../src/H5Dmpio.c:886
#18 0x000003ffb0e0abb4 in H5D__chunk_collective_write 
(io_info=<optimized out>, type_info=<optimized out>, nelmts=<optimized 
out>, file_space=<optimized out>, mem_space=0x2aa1bf8fac0, 
fm=0x2aa1bef9c60) at ../../src/H5Dmpio.c:967
#19 0x000003ffb0b78702 in H5D__write (dataset=0x2aa1bf0cda0, 
mem_type_id=<optimized out>, mem_space=<optimized out>, 
file_space=0x2aa1beee270, buf=<optimized out>) at ../../src/H5Dio.c:761
#20 0x000003ffb0dd9598 in H5VL__native_dataset_write (obj=0x2aa1bf0cda0, 
mem_type_id=216172782113784251, mem_space_id=<optimized out>, 
file_space_id=288230376151711833, dxpl_id=<optimized out>, 
buf=0x3ffa41b9010, req=0x0)
     at ../../src/H5VLnative_dataset.c:206
#21 0x000003ffb0db7b72 in H5VL__dataset_write (obj=<optimized out>, 
cls=<optimized out>, mem_type_id=mem_type_id at entry=216172782113784251, 
mem_space_id=mem_space_id at entry=288230376151711834,
     file_space_id=file_space_id at entry=288230376151711833, 
dxpl_id=792633534417207358, buf=0x3ffa41b9010, req=0x0) at 
../../src/H5VLcallback.c:2079
#22 0x000003ffb0dc1b9a in H5VL_dataset_write (vol_obj=0x2aa1bf0cd00, 
mem_type_id=216172782113784251, mem_space_id=288230376151711834, 
file_space_id=288230376151711833, dxpl_id=<optimized out>, 
buf=0x3ffa41b9010, req=0x0)
     at ../../src/H5VLcallback.c:2111
#23 0x000003ffb0b7506c in H5Dwrite (dset_id=<optimized out>, 
mem_type_id=216172782113784251, mem_space_id=288230376151711834, 
file_space_id=288230376151711833, dxpl_id=792633534417207358, 
buf=0x3ffa41b9010) at ../../src/H5Dio.c:289
#24 0x000003ffb1f8fdc8 in NC4_put_vars (ncid=<optimized out>, varid=0, 
startp=0x0, countp=0x2aa1bf0e720, stridep=0x3ffdf66f388, data=<optimized 
out>, mem_nc_type=5) at ../libhdf5/../../libhdf5/hdf5var.c:1594
#25 0x000003ffb1f90148 in NC4_put_vara (ncid=<optimized out>, 
varid=<optimized out>, startp=<optimized out>, countp=<optimized out>, 
op=<optimized out>, memtype=5) at ../libhdf5/../../libhdf5/hdf5var.c:1178
#26 0x000003ffb1f3b96c in NC_put_vara (ncid=<optimized out>, 
varid=<optimized out>, start=0x2aa1bf8f620, edges=0x2aa1bf8f910, 
value=value at entry=0x3ffa41b9010, memtype=5) at 
../libdispatch/../../libdispatch/dvarput.c:97
#27 0x000003ffb1f3be08 in nc_put_vara_float (ncid=<optimized out>, 
varid=<optimized out>, startp=<optimized out>, countp=<optimized out>, 
op=0x3ffa41b9010) at ../libdispatch/../../libdispatch/dvarput.c:695
#28 0x000003ffb269464e in nf_put_vara_real_ (ncid=<optimized out>, 
varid=<optimized out>, start=<error reading variable: value has been 
optimized out>, counts=<error reading variable: value has been optimized 
out>, rvals=...)
     at ../../fortran/nf_varaio.F90:372
#29 0x000003ffb26ede9e in netcdf::nf90_put_var_4d_fourbytereal 
(ncid=<error reading variable: value has been optimized out>, 
varid=<error reading variable: value has been optimized out>,
     values=<error reading variable: value has been optimized out>, 
start=<error reading variable: value has been optimized out>,
     count=<error reading variable: value requires 27021597768417284 
bytes, which is more than max-value-size>, stride=<error reading 
variable: value has been optimized out>, map=<error reading variable: 
value has been optimized out>)
     at netcdf_expanded.f90:1035
#30 0x000002aa1b1840ba in f90tst_parallel_compressed () at 
../../nf03_test4/f90tst_parallel_compressed.F90:255
#31 0x000002aa1b182588 in main (argc=<optimized out>, argv=<optimized 
out>) at ../../nf03_test4/f90tst_parallel_compressed.F90:14

It seems to be stuck in MPID_nem_mpich_blocking_recv.

I have no idea how to debug this further.  Any suggestions would be 
greatly appreciated.

It does appear to have been introduced between netcdf-fortran 4.5.2 and 
4.5.4.  I can still build and run the tests for 4.5.2.

-- 
Orion Poplawski
he/him/his - surely the least important thing about me
Manager of NWRA Technical Systems          720-772-5637
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       orion at nwra.com
Boulder, CO 80301                 https://www.nwra.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3847 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20220113/104baeba/attachment.p7s>


More information about the discuss mailing list