[mpich-devel] segfault calling neighbor collectives in communicator with no topology

Lisandro Dalcin dalcinl at gmail.com
Tue Apr 30 02:53:36 CDT 2013


I'm adding support for the MPI-3 neighborhood collectives to mpi4py.
By mistake, I called a neighbor collective on COMM_SELF, and got a
segfault. After running under valgrind, I get the trace below.

It seems that MPICH (running 3.0.4) is not checking the communicators
for a topology being attached. This should be fixed in
MPIR_Topo_canon_nhb_count() at src/mpi/topo/topoutil.c, adding a check
after the following line:

    topo_ptr = MPIR_Topology_get(comm_ptr);

BTW, the same kind of check should also be added to MPIR_Topo_canon_nhb().


==14696== Invalid read of size 4
==14696==    at 0xDE0ED39: MPIR_Topo_canon_nhb_count (topoutil.c:283)
==14696==    by 0xDFA870A: MPIR_Ineighbor_allgather_default
(inhb_allgather.c:50)
==14696==    by 0xDFA8B8B: MPIR_Ineighbor_allgather_impl (inhb_allgather.c:98)
==14696==    by 0xDFAE27C: MPIR_Neighbor_allgather_default (nhb_allgather.c:37)
==14696==    by 0xDFAE350: MPIR_Neighbor_allgather_impl (nhb_allgather.c:58)
==14696==    by 0xDFAE918: PMPI_Neighbor_allgather (nhb_allgather.c:155)
==14696==    by 0xDAD7B77:
__pyx_pw_6mpi4py_3MPI_9Intracomm_25Neighbor_allgather
(mpi4py.MPI.c:87767)
==14696==    by 0x31784DD280: PyEval_EvalFrameEx (in
/usr/lib64/libpython2.7.so.1.0)
==14696==    by 0x31784DCEF0: PyEval_EvalFrameEx (in
/usr/lib64/libpython2.7.so.1.0)
==14696==    by 0x31784DDCBE: PyEval_EvalCodeEx (in
/usr/lib64/libpython2.7.so.1.0)
==14696==    by 0x317846DA36: ??? (in /usr/lib64/libpython2.7.so.1.0)
==14696==    by 0x3178449C0D: PyObject_Call (in /usr/lib64/libpython2.7.so.1.0)
==14696==  Address 0x0 is not stack'd, malloc'd or (recently) free'd


--
Lisandro Dalcin
---------------
CIMEC (INTEC/CONICET-UNL)
Predio CONICET-Santa Fe
Colectora RN 168 Km 472, Paraje El Pozo
3000 Santa Fe, Argentina
Tel: +54-342-4511594 (ext 1011)
Tel/Fax: +54-342-4511169


More information about the devel mailing list