[mpich-discuss] Hydra fills in the MPIR_proctable[] incorrectly with multiple processes per node

John DelSignore John.DelSignore at roguewave.com
Mon Mar 11 14:59:14 CDT 2013


Hi,

I'm pretty sure this is a MPICH Hydra bug, but I wanted to ask this group before I go through the trouble of figuring out how to file a MPICH bug report, which I think requires me to create an MPICH Trac account, which I don't know how to do.

As far as I can tell, Hydra fills in the MPIR_proctable[] incorrectly with multiple processes per node. The index into the MPIR_proctable[] is supposed to be the MPI process's rank in MPI_COMM_WORLD. To demonstrate this problem. I created a simple MPI "hello world" program where each MPI process prints out its rank and pid; I attached it to this email.

This is the version of MPICH I am using:

shell% /home/mware/argonne/mpich2/1.4.1p1/x86_64-linux/bin/mpirun --version|head -3
HYDRA build details:
    Version:                                 1.4.1p1
    Release Date:                            Thu Sep  1 13:53:02 CDT 2011
shell% 

I ran the code under TotalView using 2 nodes and 6 processes (3 per node). I enabled logging so that TotalView would output the contents of the MPIR_proctable[] as it extracted it from the mpirun process. Here is the output of the run:

shell% tv8cli \
  -verbosity errors \
  -x15 \
  -parallel_stop no \
  -debug_file debug.log \
  -args \
    /home/mware/argonne/mpich2/1.4.1p1/x86_64-linux/bin/mpirun \
    -hosts 127.0.0.2,127.0.0.3 \
    -n 6 \
    ./mpichbug
d1.<> dcont
Hello from rank 0 of 6, getpid()==2691
Hello from rank 1 of 6, getpid()==2729
Hello from rank 2 of 6, getpid()==2693
Hello from rank 3 of 6, getpid()==2730
Hello from rank 4 of 6, getpid()==2694
Hello from rank 5 of 6, getpid()==2734
d1.<> quit -force
shell%

Grepping for "proctable" in the debugger's log file shows the contents of the MPIR_proctable[]:

shell% grep proctable debug.log
mpir_proctable_t::create: extracting hostname/execname/pids for 6 processes
mpir_proctable_t::create: MPIR_proctable[0]: host_name(0x0056be20)="127.0.0.2", executable_name(0x0056be80)="./mpichbug", pid=2691
mpir_proctable_t::create: MPIR_proctable[1]: host_name(0x0056be00)="127.0.0.2", executable_name(0x0056bde0)="./mpichbug", pid=2693
mpir_proctable_t::create: MPIR_proctable[2]: host_name(0x005859a0)="127.0.0.2", executable_name(0x0056c080)="./mpichbug", pid=2694
mpir_proctable_t::create: MPIR_proctable[3]: host_name(0x005856b0)="127.0.0.3", executable_name(0x005856d0)="./mpichbug", pid=2729
mpir_proctable_t::create: MPIR_proctable[4]: host_name(0x005856f0)="127.0.0.3", executable_name(0x00585710)="./mpichbug", pid=2730
mpir_proctable_t::create: MPIR_proctable[5]: host_name(0x00585730)="127.0.0.3", executable_name(0x00585750)="./mpichbug", pid=2734
shell% 

Matching up the pid values shows that the MPIR_proctable[rank] does not match the rank returned to the program by MPI_Comm_rank() for some of the MPI processes. Here's the MPI rank to MPIR_proctable rank mapping:
0 => 0
1 => 3
2 => 1
3 => 4
4 => 2
5 => 5    

Do you agree that this is an MPICH Hydra bug?

Any advice on how to create an MPICH Trac account so that I can report the bug?

Thanks, John D.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpichbug.c
Type: text/x-csrc
Size: 361 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130311/2bbbfb75/attachment.bin>


More information about the discuss mailing list