[mpich-discuss] Communication Error when installing MPICH on multi HOSTS.

维洛逐风 wu_0317 at qq.com
Tue Feb 18 05:21:39 CST 2014


HI.


My environment:
Two Vmware VMs with ubuntu-server12.04 OS, called mpimaster,mpislaver1
they both linked to a virtual network 10.0.0.1;
they can ssh to each other without password;
I have disabled the fire walls with "sudo ufw disable"
I  install  mpich3.0.4 on a NFS servered by mpimaster.



I installed mpich3.0.4 follow the "readme.txt", it has Communication problem when progresses from different host comunicate with each other.




From picture above we can see it's ok to run "cpi" on both hosts separately.


If you can't see the picture,plz see the shell's below.


ailab at mpimaster:~/Downloads/mpich-3.0.4$ mpiexec -n 4 ./examples/cpi
Process 0 of 4 is on mpimaster
Process 1 of 4 is on mpimaster
Process 2 of 4 is on mpimaster
Process 3 of 4 is on mpimaster
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.028108
ailab at mpimaster:~/Downloads/mpich-3.0.4$ mpiexec -hosts mpimaster -n 4 ./examples/cpi
Process 2 of 4 is on mpimaster
Process 0 of 4 is on mpimaster
Process 1 of 4 is on mpimaster
Process 3 of 4 is on mpimaster
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.027234
ailab at mpimaster:~/Downloads/mpich-3.0.4$ mpiexec -hosts mpislaver1 -n 4 ./examples/cpi
Process 0 of 4 is on mpislaver1
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.000093
Process 1 of 4 is on mpislaver1
Process 2 of 4 is on mpislaver1
Process 3 of 4 is on mpislaver1
ailab at mpimaster:~/Downloads/mpich-3.0.4$ mpiexec -hosts mpimaster,mpislaver1 -n 4 ./examples/cpi
Process 0 of 4 is on mpimaster
Process 2 of 4 is on mpimaster
Fatal error in PMPI_Reduce: A process has failed, error stack:
PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff73a51ce8, rbuf=0x7fff73a51cf0, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD) failed
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(779)..........:
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(835)..........:
MPIR_Reduce_binomial(144).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(667): Communication error with rank 1
MPIR_Reduce_intra(799)..........:
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(835)..........:
MPIR_Reduce_binomial(206).......: Failure during collective


================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
================================================================================
[proxy:0:1 at mpislaver1] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886)
[proxy:0:1 at mpislaver1] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c
[proxy:0:1 at mpislaver1] main (./pm/pmiserv/pmip.c:206): demux engine error waitin
[mpiexec at mpimaster] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_
[mpiexec at mpimaster] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wa
[mpiexec at mpimaster] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:21
[mpiexec at mpimaster] main (./ui/mpich/mpiexec.c:331): process manager error waiti
ailab at mpimaster:~/Downloads/mpich-3.0.4$



plz help,THX!




------------------
Jie-Jun Wu
Department of Computer Science,
Sun Yat-sen University,
Guangzhou, 
P.R. China
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140218/e7058ac7/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: E1179051 at 3853243A.43420353.png
Type: application/octet-stream
Size: 68330 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140218/e7058ac7/attachment.obj>


More information about the discuss mailing list