[mpich-discuss] hydra crashes with high number of processes

Thomas Ropars thomas.ropars at epfl.ch
Wed Jul 24 08:01:30 CDT 2013


Hi,

I'm working with mpich 3.0.4 and I get a segfault in Hydra when I try to 
run an application on a large number of processes (8192).

I simply run the following command:
mpirun -f ~/machine_list -n 8192 my_exec_file

I tried to run mpirun in valgrind to identify the problem and here is 
the output:
Invalid read of size 1
==44266==    at 0x4A077F2: __GI_strlen (mc_replace_strmem.c:284)
==44266==    by 0x3D774802B5: strdup (in /lib64/libc-2.12.so)
==44266==    by 0x40EBA5: HYD_pmcd_pmi_fill_in_exec_launch_info 
(pmiserv_utils.\
c:375)
==44266==    by 0x40A5C2: HYD_pmci_launch_procs (pmiserv_pmci.c:121)
==44266==    by 0x403A1E: main (mpiexec.c:326)
==44266==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

If I try to run on a smaller number of processes (eg 512), everything 
works fine.

Any suggestion to solve the problem?

Thomas





More information about the discuss mailing list