[mpich-discuss] Bug in HYDT_dbg_setup_procdesc

Chris January chris.january at allinea.com
Tue Apr 30 04:59:17 CDT 2013


Hello,

We (Allinea) have noticed a bug introduced in HYDT_dbg_setup_procdesc
between 3.0.2 and 3.0.3 caused by this commit:

http://trac.mpich.org/projects/mpich/changeset/e04dd4b64ff618f2df58789265b741a8e9fab081/

When debugging a 4 process job on a 32-core machine using DDT we find
that the 4 entries in MPIR_Proctable all have the same pid.

Here is how to reproduce the issue outside of DDT:

jbray at mic3:31053% gdb --args mpirun -np 4 wave_f.exe
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show
copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols
from /home/jbray/prog/mpich/mpich-3.0.3/mic_gnu/install/bin/mpirun...done.
(gdb) break MPIR_Breakpoint
Breakpoint 1 at 0x428a70: file ./tools/debugger/debugger.c, line 25.
(gdb) r
Starting
program: /home/jbray/prog/mpich/mpich-3.0.3/mic_gnu/install/bin/mpirun
-np 4 wave_f.exe
[Thread debugging using libthread_db enabled]
Detaching after fork from child process 107606.

Breakpoint 1, MPIR_Breakpoint () at ./tools/debugger/debugger.c:25
25	}
Missing separate debuginfos, use: debuginfo-install
glibc-2.12-1.107.el6.x86_64 libxml2-2.7.6-12.el6_4.1.x86_64
zlib-1.2.3-29.el6.x86_64
(gdb) print *MPIR_proctable at 4
$1 = {{host_name = 0x6723f0 "mic3", executable_name = 0x6723d0
"./wave_f.exe", pid = 107847}, {host_name = 0x6723b0 "mic3",
executable_name = 0x672390 "./wave_f.exe", pid = 107847}, {
    host_name = 0x672370 "mic3", executable_name = 0x672350
"./wave_f.exe", pid = 107847}, {host_name = 0x672330 "mic3",
executable_name = 0x672310 "./wave_f.exe", pid = 107847}}
(gdb) 

As you can see MPIR_proctable claims each rank has the same pid, when in
reality they do not:

-bash-4.1$ ps aux | grep 'wave_f.exe'
jbray    107841  0.3  0.0  99896 18772 pts/2    S+   10:57   0:00 gdb
--args mpirun -np 4 ./wave_f.exe
jbray    107843  0.0  0.0  23144  1288 pts/2    T    10:57
0:00 /home/jbray/prog/mpich/mpich-3.0.3/mic_gnu/install/bin/mpirun -np
4 ./wave_f.exe
jbray    107847  0.0  0.0  46488  1504 ?        Ss   10:57
0:00 ./wave_f.exe
jbray    107848  0.0  0.0  29332  1472 ?        Ss   10:57
0:00 ./wave_f.exe
jbray    107849  0.0  0.0  29332  1472 ?        Ss   10:57
0:00 ./wave_f.exe
jbray    107850  0.0  0.0  29332  1472 ?        Ss   10:57
0:00 ./wave_f.exe
cjanuary 107870  0.0  0.0 103244   864 pts/6    S+   10:57   0:00
grep ./wave_f.exe

Regards,
Chris January - VP Engineering - Allinea Software Ltd.





More information about the discuss mailing list