[mpich-discuss] Hydra fills in the MPIR_proctable[] incorrectly with multiple processes per node

Pavan Balaji balaji at mcs.anl.gov
Tue Mar 12 11:07:43 CDT 2013


John,

This does seem like a bug.  Specifically, this is a problem with the
wrap-around of hosts.  For example, I don't expect this problem to show
up when you do:

mpiexec -hosts 127.0.0.2:3,127.0.0.3:3 -n 6 ./mpichbug

This should only show up when the number of cores is not sufficient in
the first round and mpiexec has to wrap around to the first host again:

mpiexec -hosts 127.0.0.2,127.0.0.3 -n 6 ./mpichbug

I'm working on a patch.  I'll commit it in shortly.

 -- Pavan

On 03/11/2013 02:59 PM US Central Time, John DelSignore wrote:
> Hi,
> 
> I'm pretty sure this is a MPICH Hydra bug, but I wanted to ask this group before I go through the trouble of figuring out how to file a MPICH bug report, which I think requires me to create an MPICH Trac account, which I don't know how to do.
> 
> As far as I can tell, Hydra fills in the MPIR_proctable[] incorrectly with multiple processes per node. The index into the MPIR_proctable[] is supposed to be the MPI process's rank in MPI_COMM_WORLD. To demonstrate this problem. I created a simple MPI "hello world" program where each MPI process prints out its rank and pid; I attached it to this email.
> 
> This is the version of MPICH I am using:
> 
> shell% /home/mware/argonne/mpich2/1.4.1p1/x86_64-linux/bin/mpirun --version|head -3
> HYDRA build details:
>     Version:                                 1.4.1p1
>     Release Date:                            Thu Sep  1 13:53:02 CDT 2011
> shell% 
> 
> I ran the code under TotalView using 2 nodes and 6 processes (3 per node). I enabled logging so that TotalView would output the contents of the MPIR_proctable[] as it extracted it from the mpirun process. Here is the output of the run:
> 
> shell% tv8cli \
>   -verbosity errors \
>   -x15 \
>   -parallel_stop no \
>   -debug_file debug.log \
>   -args \
>     /home/mware/argonne/mpich2/1.4.1p1/x86_64-linux/bin/mpirun \
>     -hosts 127.0.0.2,127.0.0.3 \
>     -n 6 \
>     ./mpichbug
> d1.<> dcont
> Hello from rank 0 of 6, getpid()==2691
> Hello from rank 1 of 6, getpid()==2729
> Hello from rank 2 of 6, getpid()==2693
> Hello from rank 3 of 6, getpid()==2730
> Hello from rank 4 of 6, getpid()==2694
> Hello from rank 5 of 6, getpid()==2734
> d1.<> quit -force
> shell%
> 
> Grepping for "proctable" in the debugger's log file shows the contents of the MPIR_proctable[]:
> 
> shell% grep proctable debug.log
> mpir_proctable_t::create: extracting hostname/execname/pids for 6 processes
> mpir_proctable_t::create: MPIR_proctable[0]: host_name(0x0056be20)="127.0.0.2", executable_name(0x0056be80)="./mpichbug", pid=2691
> mpir_proctable_t::create: MPIR_proctable[1]: host_name(0x0056be00)="127.0.0.2", executable_name(0x0056bde0)="./mpichbug", pid=2693
> mpir_proctable_t::create: MPIR_proctable[2]: host_name(0x005859a0)="127.0.0.2", executable_name(0x0056c080)="./mpichbug", pid=2694
> mpir_proctable_t::create: MPIR_proctable[3]: host_name(0x005856b0)="127.0.0.3", executable_name(0x005856d0)="./mpichbug", pid=2729
> mpir_proctable_t::create: MPIR_proctable[4]: host_name(0x005856f0)="127.0.0.3", executable_name(0x00585710)="./mpichbug", pid=2730
> mpir_proctable_t::create: MPIR_proctable[5]: host_name(0x00585730)="127.0.0.3", executable_name(0x00585750)="./mpichbug", pid=2734
> shell% 
> 
> Matching up the pid values shows that the MPIR_proctable[rank] does not match the rank returned to the program by MPI_Comm_rank() for some of the MPI processes. Here's the MPI rank to MPIR_proctable rank mapping:
> 0 => 0
> 1 => 3
> 2 => 1
> 3 => 4
> 4 => 2
> 5 => 5    
> 
> Do you agree that this is an MPICH Hydra bug?
> 
> Any advice on how to create an MPICH Trac account so that I can report the bug?
> 
> Thanks, John D.
> 
> 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji



More information about the discuss mailing list