[mpich-discuss] Communication Error when installing MPICH on multi HOSTS.

Balaji, Pavan balaji at anl.gov
Tue Feb 18 22:38:38 CST 2014


It’s hard to tell, but this does indicate some problem with your communication setup.  Did you verify your /etc/hosts like described on the FAQ page?

  — Pavan

From: 维洛逐风 <wu_0317 at qq.com<mailto:wu_0317 at qq.com>>
Reply-To: "discuss at mpich.org<mailto:discuss at mpich.org>" <discuss at mpich.org<mailto:discuss at mpich.org>>
Date: Tuesday, February 18, 2014 at 5:21 AM
To: discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: [mpich-discuss] Communication Error when installing MPICH on multi HOSTS.

HI.

My environment:
Two Vmware VMs with ubuntu-server12.04 OS, called mpimaster,mpislaver1
they both linked to a virtual network 10.0.0.1;
they can ssh to each other without password;
I have disabled the fire walls with "sudo ufw disable"
I  install  mpich3.0.4 on a NFS servered by mpimaster.

I installed mpich3.0.4 follow the "readme.txt", it has Communication problem when progresses from different host comunicate with each other.
[cid:E1179051 at 3853243A.43420353.png]

From picture above we can see it's ok to run "cpi" on both hosts separately.

If you can't see the picture,plz see the shell's below.

ailab at mpimaster:~/Downloads/mpich-3.0.4$ mpiexec -n 4 ./examples/cpi
Process 0 of 4 is on mpimaster
Process 1 of 4 is on mpimaster
Process 2 of 4 is on mpimaster
Process 3 of 4 is on mpimaster
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.028108
ailab at mpimaster:~/Downloads/mpich-3.0.4$ mpiexec -hosts mpimaster -n 4 ./examples/cpi
Process 2 of 4 is on mpimaster
Process 0 of 4 is on mpimaster
Process 1 of 4 is on mpimaster
Process 3 of 4 is on mpimaster
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.027234
ailab at mpimaster:~/Downloads/mpich-3.0.4$ mpiexec -hosts mpislaver1 -n 4 ./examples/cpi
Process 0 of 4 is on mpislaver1
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.000093
Process 1 of 4 is on mpislaver1
Process 2 of 4 is on mpislaver1
Process 3 of 4 is on mpislaver1
ailab at mpimaster:~/Downloads/mpich-3.0.4$ mpiexec -hosts mpimaster,mpislaver1 -n 4 ./examples/cpi
Process 0 of 4 is on mpimaster
Process 2 of 4 is on mpimaster
Fatal error in PMPI_Reduce: A process has failed, error stack:
PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff73a51ce8, rbuf=0x7fff73a51cf0, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD) failed
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(779)..........:
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(835)..........:
MPIR_Reduce_binomial(144).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(667): Communication error with rank 1
MPIR_Reduce_intra(799)..........:
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(835)..........:
MPIR_Reduce_binomial(206).......: Failure during collective

================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
================================================================================
[proxy:0:1 at mpislaver1] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886)
[proxy:0:1 at mpislaver1] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c
[proxy:0:1 at mpislaver1] main (./pm/pmiserv/pmip.c:206): demux engine error waitin
[mpiexec at mpimaster] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_
[mpiexec at mpimaster] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wa
[mpiexec at mpimaster] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:21
[mpiexec at mpimaster] main (./ui/mpich/mpiexec.c:331): process manager error waiti
ailab at mpimaster:~/Downloads/mpich-3.0.4$

plz help,THX!


------------------
Jie-Jun Wu
Department of Computer Science,
Sun Yat-sen University,
Guangzhou,
P.R. China

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140219/d9519600/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: E1179051 at 3853243A.43420353.png
Type: application/octet-stream
Size: 68330 bytes
Desc: E1179051 at 3853243A.43420353.png
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140219/d9519600/attachment.obj>


More information about the discuss mailing list