[mpich-discuss] Segmentation violation when launching 182 local processes

Fri Jan 14 03:19:37 CST 2022

Hi,

There appears to be a problem in MPICH v3.4.3 when launching a job that uses 182 or more processes on a single machine. The problem does not occur in MPICH v3.3.2, I didn't check any other releases. I'm configuring MPICH like this on a 64-bit Linux system:

$ ./configure --enable-fortran --enable-shared --disable-cairo --disable-cpuid --disable-libxml2 --disable-pci --disable-opencl --disable-cuda --disable-nvml --disable-gl --disable-libnuma --disable-libudev --prefix= /local/mpich-install-3.4.3 --enable-shared --disable-static --enable-g=dbg --with-device=ch3:nemesis

Platform:

uname -m = x86_64
uname -r = 4.19.0-18-amd64
uname -s = Linux
uname -v = #1 SMP Debian 4.19.208-1 (2021-09-29)

Compiler:

gcc (Debian 8.3.0-6) 8.3.0

Launching any MPI program using "mpiexec -hosts localhost -n 182" fails with a segmentation fault, "-n 181" works. 

I got valgrind to emit this (the line numbers might be a little awry because I was fiddling with mpid_nem_init.c to try and figure things out):

==3395== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==3395==  Access not within mapped region at address 0xFFFFFFFF85DB6E40
==3395==    at 0x4C7E838: MPL_atomic_relaxed_store_int (mpl_atomic_c11.h:99)
==3395==    by 0x4C7E838: MPID_nem_init (mpid_nem_init.c:375)
==3395==    by 0x4C65075: MPIDI_CH3_Init (ch3_init.c:94)
==3395==    by 0x4C41BD0: MPID_Init (mpid_init.c:162)
==3395==    by 0x4A7E9CF: MPIR_Init_thread (initthread.c:158)
==3395==    by 0x4A7E62B: PMPI_Init (init.c:131)
==3395==    by 0x1091F7: main (check.c:15)

The problem appears to be in the MPID_nem_init function which has num_local stored as "int" - when this is 182, some of the computations involving it overflow. For example, the first argument in the line below overflows:

size_t fbox_len = MPL_MAX((num_local*((num_local-1) * MPID_NEM_FBOX_LEN)),
                              MPID_NEM_ASYMM_NULL_VAL);

I was able to stop the initial crash by fixing that call and the loop involving MAILBOX_INDEX, but I've no idea if there are other similar problems lurking... 

I discovered this because we have a regression test for this problem: https://lists.mcs.anl.gov/pipermail/mpich-discuss/2012-January/011619.html - that problem was encountered by one of our users on a high-core-count system.

Cheers,
Edric.