[mpich-discuss] mpiexec Hydra and large numbers of command line options

Doug Johnson djohnson at osc.edu
Mon Feb 25 08:37:11 CST 2019


Hi,

We have encountered a bug with mpiexec when the length of argv is
greater than or equal to 1000.  This was uncovered in the use of a
utility called pbsdcp that uses MPI_Bcast to efficiently copy files to
internal disks for parallel jobs.  If someone uses a wildcard to pass
lists of files for copying on the command line it can result in a large
number of arguments being passed to the execution of mpiexec.  This
results in a seg fault.

The problem can be trivially reproduced with the nonsensical command
line shown below.

owens-rw01:~> mpiexec `seq 999`
[proxy:0:0 at owens-rw01.ten.osc.edu] HYDU_create_process (utils/launch/launch.c:75): execvp error on file 1 (No such file or directory)
owens-rw01:~> mpiexec `seq 1000`
Segmentation fault
owens-rw01:~> mpiexec --version
HYDRA build details:
    Version:                                 3.2.1
    Release Date:                            General Availability Release
    CC:                              icc
    CXX:                             icpc
    F77:                             ifort
    F90:                             ifort
    Configure options:                       '--disable-option-checking' '--prefix=/opt/mvapich2/intel/18.0/2.3' '--enable-shared' '--with-mpe' '--enable-romio' '--enable-mpit-pvars=mv2' '--with-file-system=ufs+nfs+gpfs' '--with-pbs=/opt/torque' '--with-pbs-lib=/opt/torque/lib64' '--with-pbs-include=/opt/torque/include' 'CC=icc' 'CXX=icpc' 'FC=ifort' 'F77=ifort' '--cache-file=/dev/null' '--srcdir=.' 'CFLAGS= -DNDEBUG -DNVALGRIND -O2' 'LDFLAGS=-L/lib -L/lib -L/lib -Wl,-rpath,/lib -L/lib -Wl,-rpath,/lib -L/lib -L/lib' 'LIBS=-libmad -lrdmacm -libumad -libverbs -lrt -lpthread ' 'CPPFLAGS= -I/dev/shm/tmp.PxAZLCzt4P/src/mpl/include -I/dev/shm/tmp.PxAZLCzt4P/src/mpl/include -I/dev/shm/tmp.PxAZLCzt4P/src/openpa/src -I/dev/shm/tmp.PxAZLCzt4P/src/openpa/src -D_REENTRANT -I/dev/shm/tmp.PxAZLCzt4P/src/mpi/romio/include -I/include -I/include -I/include -I/include' 'MPLLIBNAME=mpl'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge pbs manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Checkpointing libraries available:
    Demux engines available:                 poll select


The root cause seems to be in src/pm/hydra/include/hydra.h,

#define HYD_NUM_TMP_STRINGS 1000

This define value is used for static allocation when processing argv in
mpiexec.  I suppose we could increase this value as a short term
workaround that may fix some use cases (could use ARG_MAX, 'getconf
ARG_MAX'.)  However, a cleaner fix is probably to pass argc along with
argv so that dynamic allocations can be used when processing arguments.


Thanks,
Doug



More information about the discuss mailing list