[mpich-discuss] netcdf-fortran 4.5.4 test hang on s390x

Zhou, Hui zhouh at anl.gov
Thu Jan 13 23:19:21 CST 2022


The backtrace simply shows the process is in the progress loop, i.e. waiting for a message from another process but never come. The reason could due to a bug in mpich, or a bug in the application library, or in the test.

--
Hui
________________________________
From: Orion Poplawski via discuss <discuss at mpich.org>
Sent: Thursday, January 13, 2022 9:32 PM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Orion Poplawski <orion at nwra.com>
Subject: [mpich-discuss] netcdf-fortran 4.5.4 test hang on s390x

Hello -

   I'm working on updating the Fedora netcdf-fortran package to version
4.5.4.  However, a test is hanging when run on s390x (other arches are
fine).  This is with:

mpich-3.4.1-3.fc35.s390x
hdf5-mpich-1.12.1-1.fc36.s390x
netcdf-mpich-4.8.1-2.fc36.s390x

The test program is nf03_test4/f90tst_parallel_compressed and is run with:

mpiexec -np 4 ./f90tst_parallel_compressed

Here's a backtrace where it seems to be stuck:

#0  0x000003ffb14fbc94 in MPID_nem_tcp_connpoll
(in_blocking_poll=<optimized out>) at
src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c:1822
#1  0x000003ffb14ef78c in MPID_nem_network_poll (in_blocking_progress=1)
at src/mpid/ch3/channels/nemesis/src/mpid_nem_network_poll.c:11
#2  MPID_nem_mpich_blocking_recv (completions=615, in_fbox=<synthetic
pointer>, cell=<synthetic pointer>) at
./src/mpid/ch3/channels/nemesis/include/mpid_nem_inline.h:964
#3  MPIDI_CH3I_Progress (progress_state=0x3ffdf66c168,
is_blocking=<optimized out>) at
src/mpid/ch3/channels/nemesis/src/ch3_progress.c:363
#4  0x000003ffb1b2ecc0 in MPIR_Wait_state (status=<optimized out>,
state=0x3ffdf66c24c, request_ptr=0x3ffb1d17510 <MPIR_Request_direct>) at
src/mpi/request/wait.c:32
#5  MPIR_Wait_impl.isra.0 (request_ptr=request_ptr at entry=0x3ffb1d17510
<MPIR_Request_direct>, status=<optimized out>) at src/mpi/request/wait.c:55
#6  0x000003ffb1484ca0 in MPID_Wait (status=0x1,
request_ptr=0x3ffb1d17510 <MPIR_Request_direct>) at
./src/mpid/ch3/include/mpidpost.h:264
#7  MPIC_Wait (request_ptr=0x3ffb1d17510 <MPIR_Request_direct>,
errflag=errflag at entry=0x3ffdf66c8ec) at src/mpi/coll/helper_fns.c:64
#8  0x000003ffb1485a04 in MPIC_Sendrecv (sendbuf=<optimized out>,
sendcount=1, sendtype=<optimized out>, dest=<optimized out>,
sendtag=<optimized out>, recvbuf=0x2aa1bf8f8b0, recvcount=1,
recvtype=1275069445, source=1, recvtag=14,
     comm_ptr=0x3ffae42f690, status=<optimized out>,
errflag=0x3ffdf66c8ec) at src/mpi/coll/helper_fns.c:333
#9  0x000003ffb141fe4e in MPIR_Allreduce_intra_recursive_doubling
(sendbuf=<optimized out>, recvbuf=0x3ffdf66c9e0, count=<optimized out>,
datatype=<optimized out>, op=<optimized out>, comm_ptr=0x3ffae42f690,
errflag=0x3ffdf66c8ec)
     at src/mpi/coll/allreduce/allreduce_intra_recursive_doubling.c:127
#10 0x000003ffb1322a68 in MPIR_Allreduce_allcomm_auto
(sendbuf=0x3ffdf66c9d0, sendbuf at entry=<error reading variable: value has
been optimized out>, recvbuf=0x3ffdf66c9e0,
     recvbuf at entry=<error reading variable: value has been optimized
out>, count=count at entry=1, datatype=datatype at entry=1275069445,
op=op at entry=1476395011, comm_ptr=0x3ffae42f690, errflag=0x3ffdf66c8ec)
     at src/mpi/coll/allreduce/allreduce.c:145
#11 0x000003ffb1322c28 in MPIR_Allreduce_impl (sendbuf=<error reading
variable: value has been optimized out>, recvbuf=<error reading
variable: value has been optimized out>, count=<optimized out>,
datatype=<optimized out>,
     op=op at entry=1476395011, comm_ptr=0x3ffae42f690,
errflag=0x3ffdf66c8ec) at src/mpi/coll/allreduce/allreduce.c:210
#12 0x000003ffb1322d6a in MPID_Allreduce (errflag=0x3ffdf66c8ec,
comm=0x3ffae42f690, op=op at entry=1476395011, datatype=<optimized out>,
datatype at entry=1275069445, count=<optimized out>, count at entry=1,
     recvbuf=<error reading variable: value has been optimized out>,
sendbuf=<error reading variable: value has been optimized out>) at
./src/mpid/ch3/include/mpid_coll.h:41
#13 MPIR_Allreduce (sendbuf=<error reading variable: value has been
optimized out>, recvbuf=<error reading variable: value has been
optimized out>, count=count at entry=1, datatype=datatype at entry=1275069445,
op=op at entry=1476395011,
     comm_ptr=0x3ffae42f690, errflag=0x3ffdf66c8ec) at
src/mpi/coll/allreduce/allreduce.c:262
#14 0x000003ffb1323742 in PMPI_Allreduce (sendbuf=<optimized out>,
sendbuf at entry=<error reading variable: value has been optimized out>,
recvbuf=<optimized out>, recvbuf at entry=0x3ffdf66c9e0, count=count at entry=1,
     datatype=datatype at entry=1275069445, op=op at entry=1476395011,
comm=-1006632958) at src/mpi/coll/allreduce/allreduce.c:387
#15 0x000003ffb0e11644 in H5D__mpio_array_gatherv.constprop.0
(local_array=<optimized out>, local_array at entry=0x2aa1bfac440,
local_array_num_entries=<optimized out>, _gathered_array=<optimized out>,
     _gathered_array at entry=0x3ffdf66cb78,
_gathered_array_num_entries=_gathered_array_num_entries at entry=0x3ffdf66cb88,
allgather=allgather at entry=true, comm=-1006632958, sort_func=0x0, root=0,
array_entry_size=368)
     at ../../src/H5Dmpio.c:552
#16 0x000003ffb0e086cc in H5D__link_chunk_filtered_collective_io
(fm=0x2aa1bef9c60, type_info=0x3ffdf66ce00, io_info=0x3ffdf66ce80) at
../../src/H5Dmpio.c:1409
#17 H5D__chunk_collective_io (io_info=0x3ffdf66ce80,
type_info=0x3ffdf66ce00, fm=fm at entry=0x2aa1bef9c60) at
../../src/H5Dmpio.c:886
#18 0x000003ffb0e0abb4 in H5D__chunk_collective_write
(io_info=<optimized out>, type_info=<optimized out>, nelmts=<optimized
out>, file_space=<optimized out>, mem_space=0x2aa1bf8fac0,
fm=0x2aa1bef9c60) at ../../src/H5Dmpio.c:967
#19 0x000003ffb0b78702 in H5D__write (dataset=0x2aa1bf0cda0,
mem_type_id=<optimized out>, mem_space=<optimized out>,
file_space=0x2aa1beee270, buf=<optimized out>) at ../../src/H5Dio.c:761
#20 0x000003ffb0dd9598 in H5VL__native_dataset_write (obj=0x2aa1bf0cda0,
mem_type_id=216172782113784251, mem_space_id=<optimized out>,
file_space_id=288230376151711833, dxpl_id=<optimized out>,
buf=0x3ffa41b9010, req=0x0)
     at ../../src/H5VLnative_dataset.c:206
#21 0x000003ffb0db7b72 in H5VL__dataset_write (obj=<optimized out>,
cls=<optimized out>, mem_type_id=mem_type_id at entry=216172782113784251,
mem_space_id=mem_space_id at entry=288230376151711834,
     file_space_id=file_space_id at entry=288230376151711833,
dxpl_id=792633534417207358, buf=0x3ffa41b9010, req=0x0) at
../../src/H5VLcallback.c:2079
#22 0x000003ffb0dc1b9a in H5VL_dataset_write (vol_obj=0x2aa1bf0cd00,
mem_type_id=216172782113784251, mem_space_id=288230376151711834,
file_space_id=288230376151711833, dxpl_id=<optimized out>,
buf=0x3ffa41b9010, req=0x0)
     at ../../src/H5VLcallback.c:2111
#23 0x000003ffb0b7506c in H5Dwrite (dset_id=<optimized out>,
mem_type_id=216172782113784251, mem_space_id=288230376151711834,
file_space_id=288230376151711833, dxpl_id=792633534417207358,
buf=0x3ffa41b9010) at ../../src/H5Dio.c:289
#24 0x000003ffb1f8fdc8 in NC4_put_vars (ncid=<optimized out>, varid=0,
startp=0x0, countp=0x2aa1bf0e720, stridep=0x3ffdf66f388, data=<optimized
out>, mem_nc_type=5) at ../libhdf5/../../libhdf5/hdf5var.c:1594
#25 0x000003ffb1f90148 in NC4_put_vara (ncid=<optimized out>,
varid=<optimized out>, startp=<optimized out>, countp=<optimized out>,
op=<optimized out>, memtype=5) at ../libhdf5/../../libhdf5/hdf5var.c:1178
#26 0x000003ffb1f3b96c in NC_put_vara (ncid=<optimized out>,
varid=<optimized out>, start=0x2aa1bf8f620, edges=0x2aa1bf8f910,
value=value at entry=0x3ffa41b9010, memtype=5) at
../libdispatch/../../libdispatch/dvarput.c:97
#27 0x000003ffb1f3be08 in nc_put_vara_float (ncid=<optimized out>,
varid=<optimized out>, startp=<optimized out>, countp=<optimized out>,
op=0x3ffa41b9010) at ../libdispatch/../../libdispatch/dvarput.c:695
#28 0x000003ffb269464e in nf_put_vara_real_ (ncid=<optimized out>,
varid=<optimized out>, start=<error reading variable: value has been
optimized out>, counts=<error reading variable: value has been optimized
out>, rvals=...)
     at ../../fortran/nf_varaio.F90:372
#29 0x000003ffb26ede9e in netcdf::nf90_put_var_4d_fourbytereal
(ncid=<error reading variable: value has been optimized out>,
varid=<error reading variable: value has been optimized out>,
     values=<error reading variable: value has been optimized out>,
start=<error reading variable: value has been optimized out>,
     count=<error reading variable: value requires 27021597768417284
bytes, which is more than max-value-size>, stride=<error reading
variable: value has been optimized out>, map=<error reading variable:
value has been optimized out>)
     at netcdf_expanded.f90:1035
#30 0x000002aa1b1840ba in f90tst_parallel_compressed () at
../../nf03_test4/f90tst_parallel_compressed.F90:255
#31 0x000002aa1b182588 in main (argc=<optimized out>, argv=<optimized
out>) at ../../nf03_test4/f90tst_parallel_compressed.F90:14

It seems to be stuck in MPID_nem_mpich_blocking_recv.

I have no idea how to debug this further.  Any suggestions would be
greatly appreciated.

It does appear to have been introduced between netcdf-fortran 4.5.2 and
4.5.4.  I can still build and run the tests for 4.5.2.

--
Orion Poplawski
he/him/his - surely the least important thing about me
Manager of NWRA Technical Systems          720-772-5637
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       orion at nwra.com
Boulder, CO 80301                 https://www.nwra.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20220114/950fc35c/attachment-0001.html>


More information about the discuss mailing list