[mpich-discuss] Hydra fills in the MPIR_proctable[] incorrectly with multiple processes per node

Jeff Hammond jhammond at alcf.anl.gov
Mon Mar 11 21:22:34 CDT 2013


A standard response in this situation is to ask you to use MPICH 3.0.x
instead of a version from June 2011.

The MPICH guys can create a Trac account easily if appropriate.

Best,

Jeff

On Mon, Mar 11, 2013 at 2:59 PM, John DelSignore
<John.DelSignore at roguewave.com> wrote:
> Hi,
>
> I'm pretty sure this is a MPICH Hydra bug, but I wanted to ask this group before I go through the trouble of figuring out how to file a MPICH bug report, which I think requires me to create an MPICH Trac account, which I don't know how to do.
>
> As far as I can tell, Hydra fills in the MPIR_proctable[] incorrectly with multiple processes per node. The index into the MPIR_proctable[] is supposed to be the MPI process's rank in MPI_COMM_WORLD. To demonstrate this problem. I created a simple MPI "hello world" program where each MPI process prints out its rank and pid; I attached it to this email.
>
> This is the version of MPICH I am using:
>
> shell% /home/mware/argonne/mpich2/1.4.1p1/x86_64-linux/bin/mpirun --version|head -3
> HYDRA build details:
>     Version:                                 1.4.1p1
>     Release Date:                            Thu Sep  1 13:53:02 CDT 2011
> shell%
>
> I ran the code under TotalView using 2 nodes and 6 processes (3 per node). I enabled logging so that TotalView would output the contents of the MPIR_proctable[] as it extracted it from the mpirun process. Here is the output of the run:
>
> shell% tv8cli \
>   -verbosity errors \
>   -x15 \
>   -parallel_stop no \
>   -debug_file debug.log \
>   -args \
>     /home/mware/argonne/mpich2/1.4.1p1/x86_64-linux/bin/mpirun \
>     -hosts 127.0.0.2,127.0.0.3 \
>     -n 6 \
>     ./mpichbug
> d1.<> dcont
> Hello from rank 0 of 6, getpid()==2691
> Hello from rank 1 of 6, getpid()==2729
> Hello from rank 2 of 6, getpid()==2693
> Hello from rank 3 of 6, getpid()==2730
> Hello from rank 4 of 6, getpid()==2694
> Hello from rank 5 of 6, getpid()==2734
> d1.<> quit -force
> shell%
>
> Grepping for "proctable" in the debugger's log file shows the contents of the MPIR_proctable[]:
>
> shell% grep proctable debug.log
> mpir_proctable_t::create: extracting hostname/execname/pids for 6 processes
> mpir_proctable_t::create: MPIR_proctable[0]: host_name(0x0056be20)="127.0.0.2", executable_name(0x0056be80)="./mpichbug", pid=2691
> mpir_proctable_t::create: MPIR_proctable[1]: host_name(0x0056be00)="127.0.0.2", executable_name(0x0056bde0)="./mpichbug", pid=2693
> mpir_proctable_t::create: MPIR_proctable[2]: host_name(0x005859a0)="127.0.0.2", executable_name(0x0056c080)="./mpichbug", pid=2694
> mpir_proctable_t::create: MPIR_proctable[3]: host_name(0x005856b0)="127.0.0.3", executable_name(0x005856d0)="./mpichbug", pid=2729
> mpir_proctable_t::create: MPIR_proctable[4]: host_name(0x005856f0)="127.0.0.3", executable_name(0x00585710)="./mpichbug", pid=2730
> mpir_proctable_t::create: MPIR_proctable[5]: host_name(0x00585730)="127.0.0.3", executable_name(0x00585750)="./mpichbug", pid=2734
> shell%
>
> Matching up the pid values shows that the MPIR_proctable[rank] does not match the rank returned to the program by MPI_Comm_rank() for some of the MPI processes. Here's the MPI rank to MPIR_proctable rank mapping:
> 0 => 0
> 1 => 3
> 2 => 1
> 3 => 4
> 4 => 2
> 5 => 5
>
> Do you agree that this is an MPICH Hydra bug?
>
> Any advice on how to create an MPICH Trac account so that I can report the bug?
>
> Thanks, John D.
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond



More information about the discuss mailing list