[mpich-discuss] 回复: Communication Error when installing MPICH onmulti HOSTS.

维洛逐风 wu_0317 at qq.com
Wed Feb 19 01:39:31 CST 2014


Thank you! It helps!
I checked my /etc/hosts, as bellow:


127.0.0.1       localhost
127.0.1.1       mpislaver1
10.10.10.10     mpimaster
10.10.10.11     mpislaver1
# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters



It seems that the second line"127.0.1.1 mpislaver1" causes ambiguity. 
The Error gone after I deleted this line on both hosts!


------------------
Jie-Jun Wu
Department of Computer Science,
Sun Yat-sen University,
Guangzhou, 
P.R. China
 


 




------------------ 原始邮件 ------------------
发件人: "Balaji, Pavan";<balaji at anl.gov>;
发送时间: 2014年2月19日(星期三) 中午12:38
收件人: "discuss at mpich.org"<discuss at mpich.org>; 

主题: Re: [mpich-discuss] Communication Error when installing MPICH onmulti HOSTS.



 
 
 It’s hard to tell, but this does indicate some problem with your communication setup.  Did you verify your /etc/hosts like described on the FAQ page?
 
 
   — Pavan
 
 
   From: 维洛逐风 <wu_0317 at qq.com>
 Reply-To: "discuss at mpich.org" <discuss at mpich.org>
 Date: Tuesday, February 18, 2014 at 5:21 AM
 To: discuss <discuss at mpich.org>
 Subject: [mpich-discuss] Communication Error when installing MPICH on multi HOSTS.
 
 
 
   HI.
 
 
  My environment:
 Two Vmware VMs with ubuntu-server12.04 OS, called mpimaster,mpislaver1
 they both linked to a virtual network 10.0.0.1;
 they can ssh to each other without password;
 I have disabled the fire walls with "sudo ufw disable"
 I  install  mpich3.0.4 on a NFS servered by mpimaster.
 
 
 
 I installed mpich3.0.4 follow the "readme.txt", it has Communication problem when progresses from different host comunicate with each other.
 
 
 
 
 From picture above we can see it's ok to run "cpi" on both hosts separately.
 
 
 If you can't see the picture,plz see the shell's below.
 
 
  ailab at mpimaster:~/Downloads/mpich-3.0.4$ mpiexec -n 4 ./examples/cpi
 Process 0 of 4 is on mpimaster
 Process 1 of 4 is on mpimaster
 Process 2 of 4 is on mpimaster
 Process 3 of 4 is on mpimaster
 pi is approximately 3.1415926544231239, Error is 0.0000000008333307
 wall clock time = 0.028108
 ailab at mpimaster:~/Downloads/mpich-3.0.4$ mpiexec -hosts mpimaster -n 4 ./examples/cpi
 Process 2 of 4 is on mpimaster
 Process 0 of 4 is on mpimaster
 Process 1 of 4 is on mpimaster
 Process 3 of 4 is on mpimaster
 pi is approximately 3.1415926544231239, Error is 0.0000000008333307
 wall clock time = 0.027234
 ailab at mpimaster:~/Downloads/mpich-3.0.4$ mpiexec -hosts mpislaver1 -n 4 ./examples/cpi
 Process 0 of 4 is on mpislaver1
 pi is approximately 3.1415926544231239, Error is 0.0000000008333307
 wall clock time = 0.000093
 Process 1 of 4 is on mpislaver1
 Process 2 of 4 is on mpislaver1
 Process 3 of 4 is on mpislaver1
 ailab at mpimaster:~/Downloads/mpich-3.0.4$ mpiexec -hosts mpimaster,mpislaver1 -n 4 ./examples/cpi
 Process 0 of 4 is on mpimaster
 Process 2 of 4 is on mpimaster
 Fatal error in PMPI_Reduce: A process has failed, error stack:
 PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff73a51ce8, rbuf=0x7fff73a51cf0, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD) failed
 MPIR_Reduce_impl(1029)..........:
 MPIR_Reduce_intra(779)..........:
 MPIR_Reduce_impl(1029)..........:
 MPIR_Reduce_intra(835)..........:
 MPIR_Reduce_binomial(144).......:
 MPIDI_CH3U_Recvq_FDU_or_AEP(667): Communication error with rank 1
 MPIR_Reduce_intra(799)..........:
 MPIR_Reduce_impl(1029)..........:
 MPIR_Reduce_intra(835)..........:
 MPIR_Reduce_binomial(206).......: Failure during collective
 
 
 ================================================================================
 =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
 =   EXIT CODE: 1
 =   CLEANING UP REMAINING PROCESSES
 =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
 ================================================================================
 [proxy:0:1 at mpislaver1] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886)
 [proxy:0:1 at mpislaver1] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c
 [proxy:0:1 at mpislaver1] main (./pm/pmiserv/pmip.c:206): demux engine error waitin
 [mpiexec at mpimaster] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_
 [mpiexec at mpimaster] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wa
 [mpiexec at mpimaster] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:21
 [mpiexec at mpimaster] main (./ui/mpich/mpiexec.c:331): process manager error waiti
 ailab at mpimaster:~/Downloads/mpich-3.0.4$
 
 
 
 plz help,THX!
 
 
 
 
  ------------------
  Jie-Jun Wu
 Department of Computer Science,
 Sun Yat-sen University,
 Guangzhou, 
 P.R. China
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140219/4a9a8bc2/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6FBB8BAD at 5353463D.B35F0453.png
Type: application/octet-stream
Size: 68330 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140219/4a9a8bc2/attachment.obj>


More information about the discuss mailing list