[mpich-discuss] Problem with cpi example on multiple hosts

HEINZ Josef (AREVA) josef.heinz at areva.com
Thu Mar 26 09:10:44 CDT 2015


Hello everybody!

First of all thank you very much for your endless answers. I have read and learned a lot about MPI.

But I am discovering some problems with a MPI communication between two hosts. I am using the cpi example to exclude programming errors from my side.
On both hosts mpich-3.0.4 is installed.

I can access hostname2 from hostname1 through the ports xx:yy (in total 6 ports) using the command  "ssh hostname2 -p <port>"
The other way (from hostname2 to hostname1) I get an "connection refused" if I try "ssh hostname1 -p <port>" (same ports as before). But my admin assures me that the ports in the firewall are open for communication, only ssh is not listening. This should mpi do when starting up the tasks.

Since the default port 22 does not work for ssh on hostname1, I configured my ~/.ssh/config (on hostname1) as followed to connect over port xx.
Host hostname2
      Hostname hostname2
      PORT xx

On hostname2 the ~/.ssh/config is empty since I can ssh  through port 22.

Before starting the process I restrict the port range as followed :
export MPIEXEC_PORT_RANGE=xx:yy

Starting the process:
            mpiexec  -f hostfile -np 2  ./cpi

Hostfile contains:
      Hostname1
Hostname2

This is the error output when running the example:
Process 0 of 2 is on hostname1
Process 1 of 2 is on hostname2
Fatal error in PMPI_Reduce: A process has failed, error stack:
PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0xbfca84f8, rbuf=0xbfca84f0, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD) failed
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(835)..........:
MPIR_Reduce_binomial(144).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(667): Communication error with rank 1


Running cpi only on hostname1 or hostname2 is working without problems
Process 0 of 2 is on hostname1
Process 1 of 2 is on hostname1
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000408

I hope I provided enough information. Does anyone has an idea what setting is not correct?

Best regards,

Josef

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150326/2e4840df/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list