[mpich-discuss] Hydra fills in the MPIR_proctable[] incorrectly with multiple processes per node
Pavan Balaji
balaji at mcs.anl.gov
Tue Mar 12 11:07:43 CDT 2013
John,
This does seem like a bug. Specifically, this is a problem with the
wrap-around of hosts. For example, I don't expect this problem to show
up when you do:
mpiexec -hosts 127.0.0.2:3,127.0.0.3:3 -n 6 ./mpichbug
This should only show up when the number of cores is not sufficient in
the first round and mpiexec has to wrap around to the first host again:
mpiexec -hosts 127.0.0.2,127.0.0.3 -n 6 ./mpichbug
I'm working on a patch. I'll commit it in shortly.
-- Pavan
On 03/11/2013 02:59 PM US Central Time, John DelSignore wrote:
> Hi,
>
> I'm pretty sure this is a MPICH Hydra bug, but I wanted to ask this group before I go through the trouble of figuring out how to file a MPICH bug report, which I think requires me to create an MPICH Trac account, which I don't know how to do.
>
> As far as I can tell, Hydra fills in the MPIR_proctable[] incorrectly with multiple processes per node. The index into the MPIR_proctable[] is supposed to be the MPI process's rank in MPI_COMM_WORLD. To demonstrate this problem. I created a simple MPI "hello world" program where each MPI process prints out its rank and pid; I attached it to this email.
>
> This is the version of MPICH I am using:
>
> shell% /home/mware/argonne/mpich2/1.4.1p1/x86_64-linux/bin/mpirun --version|head -3
> HYDRA build details:
> Version: 1.4.1p1
> Release Date: Thu Sep 1 13:53:02 CDT 2011
> shell%
>
> I ran the code under TotalView using 2 nodes and 6 processes (3 per node). I enabled logging so that TotalView would output the contents of the MPIR_proctable[] as it extracted it from the mpirun process. Here is the output of the run:
>
> shell% tv8cli \
> -verbosity errors \
> -x15 \
> -parallel_stop no \
> -debug_file debug.log \
> -args \
> /home/mware/argonne/mpich2/1.4.1p1/x86_64-linux/bin/mpirun \
> -hosts 127.0.0.2,127.0.0.3 \
> -n 6 \
> ./mpichbug
> d1.<> dcont
> Hello from rank 0 of 6, getpid()==2691
> Hello from rank 1 of 6, getpid()==2729
> Hello from rank 2 of 6, getpid()==2693
> Hello from rank 3 of 6, getpid()==2730
> Hello from rank 4 of 6, getpid()==2694
> Hello from rank 5 of 6, getpid()==2734
> d1.<> quit -force
> shell%
>
> Grepping for "proctable" in the debugger's log file shows the contents of the MPIR_proctable[]:
>
> shell% grep proctable debug.log
> mpir_proctable_t::create: extracting hostname/execname/pids for 6 processes
> mpir_proctable_t::create: MPIR_proctable[0]: host_name(0x0056be20)="127.0.0.2", executable_name(0x0056be80)="./mpichbug", pid=2691
> mpir_proctable_t::create: MPIR_proctable[1]: host_name(0x0056be00)="127.0.0.2", executable_name(0x0056bde0)="./mpichbug", pid=2693
> mpir_proctable_t::create: MPIR_proctable[2]: host_name(0x005859a0)="127.0.0.2", executable_name(0x0056c080)="./mpichbug", pid=2694
> mpir_proctable_t::create: MPIR_proctable[3]: host_name(0x005856b0)="127.0.0.3", executable_name(0x005856d0)="./mpichbug", pid=2729
> mpir_proctable_t::create: MPIR_proctable[4]: host_name(0x005856f0)="127.0.0.3", executable_name(0x00585710)="./mpichbug", pid=2730
> mpir_proctable_t::create: MPIR_proctable[5]: host_name(0x00585730)="127.0.0.3", executable_name(0x00585750)="./mpichbug", pid=2734
> shell%
>
> Matching up the pid values shows that the MPIR_proctable[rank] does not match the rank returned to the program by MPI_Comm_rank() for some of the MPI processes. Here's the MPI rank to MPIR_proctable rank mapping:
> 0 => 0
> 1 => 3
> 2 => 1
> 3 => 4
> 4 => 2
> 5 => 5
>
> Do you agree that this is an MPICH Hydra bug?
>
> Any advice on how to create an MPICH Trac account so that I can report the bug?
>
> Thanks, John D.
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the discuss
mailing list