[mpich-discuss] mpich hanging on startup

Orion Poplawski orion at cora.nwra.com
Wed Jan 20 18:05:27 CST 2016


I'm got a strange situation - I'm trying to build hdf5 1.8.16 for Fedora.  The
Fedora builders do not have network access (/etc/resolv.conf set to
"nameserver 127.0.0.1" and no local nameserver) for security purposes.  The
hdf5 t_mpi parallel test hangs on launch with mpich 3.2, but only on the arm
builders.

The processes seem to deadlock in the MPI code, each blocking in poll(),
presumably waiting for communication from the other process.

here are gdb stack traces of a two process t_mpi job:

(gdb) bt
#0  0xb69176ec in poll () from /lib/libc.so.6
#1  0xb6b4b178 in poll (__timeout=-1, __nfds=<optimized out>,
    __fds=<optimized out>) at /usr/include/bits/poll2.h:46
#2  MPIDU_Sock_wait (sock_set=0x7f57b3e0,
    millisecond_timeout=millisecond_timeout at entry=-1, eventp=0xb6fc6000,
    eventp at entry=0xbeffe054) at src/mpid/common/sock/poll/sock_wait.i:123
#3  0xb6b30960 in MPIDI_CH3i_Progress_wait (progress_state=0xbeffe0a0)
    at src/mpid/ch3/channels/sock/src/ch3_progress.c:221
#4  MPIDI_CH3I_Progress (blocking=blocking at entry=1,
    state=state at entry=0xbeffe0a0)
    at src/mpid/ch3/channels/sock/src/ch3_progress.c:962
#5  0xb6abfeac in MPIC_Wait (request_ptr=0xb6c0a180 <MPID_Request_direct>,
    errflag=errflag at entry=0xbeffe330) at src/mpi/coll/helper_fns.c:225
#6  0xb6ac01cc in MPIC_Recv (buf=buf at entry=0x7f5a0820, count=count at entry=65,
    datatype=datatype at entry=1275069445, source=<optimized out>,
    tag=tag at entry=11, comm_ptr=0xb6bf0908 <MPID_Comm_direct>,
    comm_ptr at entry=0x7f586250, status=0xbeffe190,
    status at entry=0xb6ffec88 <__stack_chk_guard>, errflag=0xbeffe330,
    errflag at entry=0x7f586c80) at src/mpi/coll/helper_fns.c:355
#7  0xb6a0326c in MPIR_Reduce_binomial (errflag=0x7f586c80,
    comm_ptr=0x7f586250, root=-1231016932, op=-1090526356,
    datatype=1275069445, count=65, recvbuf=0xb6bf0908 <MPID_Comm_direct>,
    sendbuf=<optimized out>) at src/mpi/coll/reduce.c:181
#8  MPIR_Reduce_intra (sendbuf=sendbuf at entry=0xffffffff,
    recvbuf=recvbuf at entry=0xbeffe36c, count=count at entry=65,
    datatype=datatype at entry=1275069445, op=1476395014, op at entry=-1228891236,
    root=root at entry=0, comm_ptr=0xb6bf0908 <MPID_Comm_direct>,
    comm_ptr at entry=0x58000006, errflag=errflag at entry=0xbeffe330)
    at src/mpi/coll/reduce.c:874
#9  0xb6a02c1c in MPIR_Reduce_impl (sendbuf=0xffffffff,
    sendbuf at entry=0xb6bf1288 <MPID_Comm_builtin>,
recvbuf=recvbuf at entry=0xbeffe36c,
    count=count at entry=65, datatype=datatype at entry=1275069445, op=1476395014,
    op at entry=-1228991864, root=root at entry=0,
    comm_ptr=comm_ptr at entry=0xb6bf0908 <MPID_Comm_direct>, errflag=0xbeffe330,
    errflag at entry=0xb6ffec88 <__stack_chk_guard>) at src/mpi/coll/reduce.c:1068
#10 0xb69f7b64 in MPIR_Allreduce_intra (sendbuf=0xb6bf1288 <MPID_Comm_builtin>,
    sendbuf at entry=0xffffffff, recvbuf=recvbuf at entry=0xbeffe36c,
count=count at entry=65,
    datatype=datatype at entry=1275069445, op=op at entry=1476395014,
    comm_ptr=0xb6bf1288 <MPID_Comm_builtin>,
    comm_ptr at entry=0xb6acdd88 <MPIR_Get_contextid_sparse_group+824>,
    errflag=errflag at entry=0xbeffe330) at src/mpi/coll/allreduce.c:234
#11 0xb69f9414 in MPIR_Allreduce_impl (sendbuf=sendbuf at entry=0xffffffff,
    recvbuf=recvbuf at entry=0xbeffe36c, count=count at entry=65, datatype=1275069445,
    op=1476395014, comm_ptr=comm_ptr at entry=0xb6bf1288 <MPID_Comm_builtin>,
    errflag=0xbeffe330, errflag at entry=0xbeffe4c8) at src/mpi/coll/allreduce.c:763
#12 0xb6acdd88 in MPIR_Get_contextid_sparse_group (
    comm_ptr=comm_ptr at entry=0xb6bf1288 <MPID_Comm_builtin>,
    group_ptr=0xb6998228 <__libc_malloc_initialized>, group_ptr at entry=0x0,
    tag=-1090526416, context_id=0xfffff000, context_id at entry=0xbeffe4c0,
    ignore_id=ignore_id at entry=0) at src/mpi/comm/contextid.c:496
#13 0xb6ace5a4 in MPIR_Get_contextid_sparse (
    comm_ptr=comm_ptr at entry=0xb6bf1288 <MPID_Comm_builtin>,
    context_id=context_id at entry=0xbeffe4c0, ignore_id=ignore_id at entry=0)
    at src/mpi/comm/contextid.c:298
#14 0xb6acc14c in MPIR_Comm_copy (
    comm_ptr=comm_ptr at entry=0xb6bf1288 <MPID_Comm_builtin>, size=2,
    outcomm_ptr=outcomm_ptr at entry=0xbeffe538) at src/mpi/comm/commutil.c:736
#15 0xb6a50a48 in MPIR_Comm_dup_impl (
    comm_ptr=comm_ptr at entry=0xb6bf1288 <MPID_Comm_builtin>,
newcomm_ptr=0xbeffe538,
    newcomm_ptr at entry=0xbeffe530) at src/mpi/comm/comm_dup.c:56
#16 0xb6a50cf0 in PMPI_Comm_dup (comm=1140850688, newcomm=0xbeffe574,
    newcomm at entry=0xbeffe56c) at src/mpi/comm/comm_dup.c:161
#17 0xb6daa0dc in H5FD_mpi_comm_info_dup (comm=<optimized out>, info=469762048,
    comm_new=comm_new at entry=0x7f5a0810, info_new=info_new at entry=0x7f5a0814)
    at ../../src/H5FDmpi.c:271
#18 0xb6daaa80 in H5FD_mpio_fapl_copy (_old_fa=0xbeffe64c) at
../../src/H5FDmpio.c:789
#19 0xb6d9e0ec in H5FD_pl_copy (copied_pl=0xbeffe5d0, old_pl=0xbeffe64c,
    pl_size=<optimized out>, copy_func=<optimized out>) at ../../src/H5FD.c:625
#20 H5FD_fapl_copy (copied_fapl=0xbeffe5d0, old_fapl=0xbeffe64c,
    driver_id=<optimized out>) at ../../src/H5FD.c:800
#21 H5FD_fapl_open (plist=plist at entry=0x7f5a06c8, driver_id=-1224955460,
    driver_id at entry=134217729, driver_info=driver_info at entry=0xbeffe64c)
    at ../../src/H5FD.c:753
#22 0xb6e6aea4 in H5P_set_driver (plist=plist at entry=0x7f5a06c8,
    new_driver_id=134217729, new_driver_info=0xbeffe64c,
    new_driver_info at entry=0xbeffe644) at ../../src/H5Pfapl.c:633
#23 0xb6dacec4 in H5Pset_fapl_mpio (fapl_id=<optimized out>, comm=1140850688,
    info=469762048) at ../../src/H5FDmpio.c:353
#24 0x7f5570e8 in parse_options (argv=<optimized out>, argc=<optimized out>)
    at ../../testpar/t_mpi.c:1041
#25 main (argc=0, argv=0x0) at ../../testpar/t_mpi.c:1106
(gdb) up 2
#2  MPIDU_Sock_wait (sock_set=0x7f57b3e0,
    millisecond_timeout=millisecond_timeout at entry=-1, eventp=0xb6fc6000,
    eventp at entry=0xbeffe054) at src/mpid/common/sock/poll/sock_wait.i:123
123                         n_fds = poll(sock_set->pollfds,
sock_set->poll_array_elems,
(gdb)  print millisecond_timeout
$1 = -1
(gdb) print *sock_set
$2 = {id = 0, starting_elem = 0, poll_array_sz = 32, poll_array_elems = 1,
  pollfds = 0x7f57b588, pollinfos = 0x7f57b690, eventq_head = 0x0, eventq_tail
= 0x0,
  pollfds_active = 0x0, pollfds_updated = 0, wakeup_posted = 0, intr_sock = 0x0,
  intr_fds = {-1, -1}}
(gdb) print *sock_set->pollfds
$3 = {fd = 10, events = 1, revents = 0}



#0  0xb69176ec in poll () from /lib/libc.so.6
#1  0xb6b4b178 in poll (__timeout=-1, __nfds=<optimized out>,
    __fds=<optimized out>) at /usr/include/bits/poll2.h:46
#2  MPIDU_Sock_wait (sock_set=0x7f57b3e0,
    millisecond_timeout=millisecond_timeout at entry=-1, eventp=0x7f5e5700,
    eventp at entry=0xbeffe074) at src/mpid/common/sock/poll/sock_wait.i:123
#3  0xb6b30960 in MPIDI_CH3i_Progress_wait (progress_state=0xbeffe0c0)
    at src/mpid/ch3/channels/sock/src/ch3_progress.c:221
#4  MPIDI_CH3I_Progress (blocking=blocking at entry=1,
    state=state at entry=0xbeffe0c0)
    at src/mpid/ch3/channels/sock/src/ch3_progress.c:962
#5  0xb6abfeac in MPIC_Wait (request_ptr=0xb6c0a180 <MPID_Request_direct>,
    errflag=errflag at entry=0xbeffe330) at src/mpi/coll/helper_fns.c:225
#6  0xb6ac000c in MPIC_Send (buf=buf at entry=0xb6bf0908 <MPID_Comm_direct>,
    count=count at entry=65, datatype=datatype at entry=1275069445,
    dest=<optimized out>, tag=tag at entry=11,
    comm_ptr=0xb6bf0908 <MPID_Comm_direct>, comm_ptr at entry=0x7f586250,
    errflag=0xbeffe330, errflag at entry=0x7f586c80)
    at src/mpi/coll/helper_fns.c:302
#7  0xb6a03438 in MPIR_Reduce_binomial (errflag=0x7f586c80,
    comm_ptr=0x7f586250, root=-1231016932, op=-1090526356,
    datatype=1275069445, count=65, recvbuf=0xb6bf0908 <MPID_Comm_direct>,
    sendbuf=<optimized out>) at src/mpi/coll/reduce.c:210
#8  MPIR_Reduce_intra (sendbuf=sendbuf at entry=0xbeffe36c,
    recvbuf=recvbuf at entry=0x0, count=count at entry=65,
    datatype=datatype at entry=1275069445, op=1476395014, op at entry=-1228891236,
    root=root at entry=0, comm_ptr=0xb6bf0908 <MPID_Comm_direct>,
    comm_ptr at entry=0x58000006, errflag=errflag at entry=0xbeffe330)
    at src/mpi/coll/reduce.c:874
#9  0xb6a02c1c in MPIR_Reduce_impl (sendbuf=sendbuf at entry=0xbeffe36c,
    recvbuf=recvbuf at entry=0x0, count=count at entry=65,
    datatype=datatype at entry=1275069445, op=1476395014, op at entry=-1228991864,
    root=root at entry=0, comm_ptr=comm_ptr at entry=0xb6bf0908 <MPID_Comm_direct>,
    errflag=0xbeffe330, errflag at entry=0xb6ffec88 <__stack_chk_guard>)
    at src/mpi/coll/reduce.c:1068
#10 0xb69f8b94 in MPIR_Allreduce_intra (
    sendbuf=0xb6bf1288 <MPID_Comm_builtin>, sendbuf at entry=0xffffffff,
    recvbuf=recvbuf at entry=0xbeffe36c, count=count at entry=65,
    datatype=datatype at entry=1275069445, op=op at entry=1476395014,
    comm_ptr=0xb6bf1288 <MPID_Comm_builtin>,
    comm_ptr at entry=0xb6acdd88 <MPIR_Get_contextid_sparse_group+824>,
    errflag=errflag at entry=0xbeffe330) at src/mpi/coll/allreduce.c:226
#11 0xb69f9414 in MPIR_Allreduce_impl (sendbuf=sendbuf at entry=0xffffffff,
    recvbuf=recvbuf at entry=0xbeffe36c, count=count at entry=65,
    datatype=1275069445, op=1476395014,
    comm_ptr=comm_ptr at entry=0xb6bf1288 <MPID_Comm_builtin>,
    errflag=0xbeffe330, errflag at entry=0xbeffe4c8)
    at src/mpi/coll/allreduce.c:763
#12 0xb6acdd88 in MPIR_Get_contextid_sparse_group (
    comm_ptr=comm_ptr at entry=0xb6bf1288 <MPID_Comm_builtin>,
    group_ptr=0xb6998228 <__libc_malloc_initialized>, group_ptr at entry=0x0,
    tag=-1090526416, context_id=0xfffff000, context_id at entry=0xbeffe4c0,
    ignore_id=ignore_id at entry=0) at src/mpi/comm/contextid.c:496
#13 0xb6ace5a4 in MPIR_Get_contextid_sparse (
    comm_ptr=comm_ptr at entry=0xb6bf1288 <MPID_Comm_builtin>,
    context_id=context_id at entry=0xbeffe4c0, ignore_id=ignore_id at entry=0)
    at src/mpi/comm/contextid.c:298
#14 0xb6acc14c in MPIR_Comm_copy (
    comm_ptr=comm_ptr at entry=0xb6bf1288 <MPID_Comm_builtin>, size=2,
    outcomm_ptr=outcomm_ptr at entry=0xbeffe538) at src/mpi/comm/commutil.c:736
#15 0xb6a50a48 in MPIR_Comm_dup_impl (
    comm_ptr=comm_ptr at entry=0xb6bf1288 <MPID_Comm_builtin>,
    newcomm_ptr=0xbeffe538, newcomm_ptr at entry=0xbeffe530)
    at src/mpi/comm/comm_dup.c:56
#16 0xb6a50cf0 in PMPI_Comm_dup (comm=1140850688, newcomm=0xbeffe574,
    newcomm at entry=0xbeffe56c) at src/mpi/comm/comm_dup.c:161
#17 0xb6daa0dc in H5FD_mpi_comm_info_dup (comm=<optimized out>,
    info=469762048, comm_new=comm_new at entry=0x7f5a0810,
    info_new=info_new at entry=0x7f5a0814) at ../../src/H5FDmpi.c:271
#18 0xb6daaa80 in H5FD_mpio_fapl_copy (_old_fa=0xbeffe64c)
    at ../../src/H5FDmpio.c:789
#19 0xb6d9e0ec in H5FD_pl_copy (copied_pl=0xbeffe5d0, old_pl=0xbeffe64c,
    pl_size=<optimized out>, copy_func=<optimized out>) at ../../src/H5FD.c:625
#20 H5FD_fapl_copy (copied_fapl=0xbeffe5d0, old_fapl=0xbeffe64c,
    driver_id=<optimized out>) at ../../src/H5FD.c:800
#21 H5FD_fapl_open (plist=plist at entry=0x7f5a06c8, driver_id=-1224955460,
    driver_id at entry=134217729, driver_info=driver_info at entry=0xbeffe64c)
    at ../../src/H5FD.c:753
#22 0xb6e6aea4 in H5P_set_driver (plist=plist at entry=0x7f5a06c8,
    new_driver_id=134217729, new_driver_info=0xbeffe64c,
    new_driver_info at entry=0xbeffe644) at ../../src/H5Pfapl.c:633
#23 0xb6dacec4 in H5Pset_fapl_mpio (fapl_id=<optimized out>, comm=1140850688,
    info=469762048) at ../../src/H5FDmpio.c:353
#24 0x7f5570e8 in parse_options (argv=<optimized out>, argc=<optimized out>)
    at ../../testpar/t_mpi.c:1041
#25 main (argc=0, argv=0x0) at ../../testpar/t_mpi.c:1106
(gdb) up
#2  MPIDU_Sock_wait (sock_set=0x7f57b3e0,
    millisecond_timeout=millisecond_timeout at entry=-1, eventp=0x7f5e5700,
    eventp at entry=0xbeffe074) at src/mpid/common/sock/poll/sock_wait.i:123
123                         n_fds = poll(sock_set->pollfds,
sock_set->poll_array_elems,
(gdb) list
118                      just use the same code as above.  Otherwise, use
119                      multithreaded code (and we don't then need the
120                      MPIU_THREAD_CHECK_BEGIN/END macros) */
121                     if (!MPIR_ThreadInfo.isThreaded) {
122                         MPIDI_FUNC_ENTER(MPID_STATE_POLL);
123                         n_fds = poll(sock_set->pollfds,
sock_set->poll_array_elems,
124                                      millisecond_timeout);
125                         MPIDI_FUNC_EXIT(MPID_STATE_POLL);
126                     }
127                     else
(gdb) print *sock_set
$2 = {id = 0, starting_elem = 0, poll_array_sz = 32, poll_array_elems = 1,
  pollfds = 0x7f57b588, pollinfos = 0x7f57b690, eventq_head = 0x0,
  eventq_tail = 0x0, pollfds_active = 0x0, pollfds_updated = 0,
  wakeup_posted = 0, intr_sock = 0x0, intr_fds = {-1, -1}}
(gdb) print millisecond_timeout
$3 = -1
(gdb)  print *sock_set->pollfds
$4 = {fd = 6, events = 1, revents = 0}


Is there anything else that might be useful in tracking this down?

-- 
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       orion at nwra.com
Boulder, CO 80301                   http://www.nwra.com
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list