[mpich-discuss] Bug in HYDT_dbg_setup_procdesc
Pavan Balaji
balaji at mcs.anl.gov
Tue Apr 30 07:54:14 CDT 2013
Gah. Yes, good catch. I've fixed it here:
http://git.mpich.org/mpich.git/commitdiff/a9640dc4
-- Pavan
On 04/30/2013 04:59 AM US Central Time, Chris January wrote:
> Hello,
>
> We (Allinea) have noticed a bug introduced in HYDT_dbg_setup_procdesc
> between 3.0.2 and 3.0.3 caused by this commit:
>
> http://trac.mpich.org/projects/mpich/changeset/e04dd4b64ff618f2df58789265b741a8e9fab081/
>
> When debugging a 4 process job on a 32-core machine using DDT we find
> that the 4 entries in MPIR_Proctable all have the same pid.
>
> Here is how to reproduce the issue outside of DDT:
>
> jbray at mic3:31053% gdb --args mpirun -np 4 wave_f.exe
> GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6)
> Copyright (C) 2010 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show
> copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols
> from /home/jbray/prog/mpich/mpich-3.0.3/mic_gnu/install/bin/mpirun...done.
> (gdb) break MPIR_Breakpoint
> Breakpoint 1 at 0x428a70: file ./tools/debugger/debugger.c, line 25.
> (gdb) r
> Starting
> program: /home/jbray/prog/mpich/mpich-3.0.3/mic_gnu/install/bin/mpirun
> -np 4 wave_f.exe
> [Thread debugging using libthread_db enabled]
> Detaching after fork from child process 107606.
>
> Breakpoint 1, MPIR_Breakpoint () at ./tools/debugger/debugger.c:25
> 25 }
> Missing separate debuginfos, use: debuginfo-install
> glibc-2.12-1.107.el6.x86_64 libxml2-2.7.6-12.el6_4.1.x86_64
> zlib-1.2.3-29.el6.x86_64
> (gdb) print *MPIR_proctable at 4
> $1 = {{host_name = 0x6723f0 "mic3", executable_name = 0x6723d0
> "./wave_f.exe", pid = 107847}, {host_name = 0x6723b0 "mic3",
> executable_name = 0x672390 "./wave_f.exe", pid = 107847}, {
> host_name = 0x672370 "mic3", executable_name = 0x672350
> "./wave_f.exe", pid = 107847}, {host_name = 0x672330 "mic3",
> executable_name = 0x672310 "./wave_f.exe", pid = 107847}}
> (gdb)
>
> As you can see MPIR_proctable claims each rank has the same pid, when in
> reality they do not:
>
> -bash-4.1$ ps aux | grep 'wave_f.exe'
> jbray 107841 0.3 0.0 99896 18772 pts/2 S+ 10:57 0:00 gdb
> --args mpirun -np 4 ./wave_f.exe
> jbray 107843 0.0 0.0 23144 1288 pts/2 T 10:57
> 0:00 /home/jbray/prog/mpich/mpich-3.0.3/mic_gnu/install/bin/mpirun -np
> 4 ./wave_f.exe
> jbray 107847 0.0 0.0 46488 1504 ? Ss 10:57
> 0:00 ./wave_f.exe
> jbray 107848 0.0 0.0 29332 1472 ? Ss 10:57
> 0:00 ./wave_f.exe
> jbray 107849 0.0 0.0 29332 1472 ? Ss 10:57
> 0:00 ./wave_f.exe
> jbray 107850 0.0 0.0 29332 1472 ? Ss 10:57
> 0:00 ./wave_f.exe
> cjanuary 107870 0.0 0.0 103244 864 pts/6 S+ 10:57 0:00
> grep ./wave_f.exe
>
> Regards,
> Chris January - VP Engineering - Allinea Software Ltd.
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the discuss
mailing list