[mpich-discuss] error with MPI_Reduce running cpi

Zhou, Hui zhouh at anl.gov
Mon Jun 17 12:59:04 CDT 2019


The error message says process 0 tried to connect to process 1 and the connection is refused. Common reason is firewall rules. The processes listen on a random port such as 23456, so even when ssh works (port 22), the other ports may be still blocked.

—
Hui Zhou









On Jun 17, 2019, at 11:28 AM, Jinang_Shah <sjinang at iitk.ac.in<mailto:sjinang at iitk.ac.in>> wrote:


1) $ ./configure --prefix=/users/misc/sjinang/mpich-install

2) No. It is showing an error : mpiexec -n 2 -f hostfile ./mpi/mpich-3.3.1/examples/send_recv
                                                Fatal error in PMPI_Send: Unknown error class, error stack:
                                                PMPI_Send(159).............: MPI_Send(buf=0x7ffe5d59e2a4, count=1, MPI_INT, dest=1, tag=0, MPI_COMM_WORLD) failed
                                                MPID_nem_tcp_connpoll(1845): Communication error with rank 1: Connection refused

3) I have done ssh from both the sides and its working fine. So now can you see the problem...

Thanks.




On 17-06-2019 21:28, Zhou, Hui wrote:

Hi Jinang_Shah,

Could you list your configure line ( try `head config.log`)?

If you try a simple example where process 0 sends a short message to process 1, would you result in similar error?

Can csews1 and csews2 connect to each other freely, i.e. is there firewall, router, etc. between these two hosts?

—
Hui Zhou





On Jun 17, 2019, at 10:11 AM, Jinang_Shah via discuss <discuss at mpich.org<mailto:discuss at mpich.org>> wrote:


$ mpiexec -n 2 -f hostfile ./mpi/mpich-3.3.1/examples/cpi
Process 1 of 2 is on csews2
Process 0 of 2 is on csews1
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(523)................: MPI_Reduce(sbuf=0x7ffeab580c10, rbuf=0x7ffeab580c18, count=1, datatype=MPI_DOUBLE, op=MPI_SUM, root=0, comm=MPI_COMM_WORLD) failed
PMPI_Reduce(509)................:
MPIR_Reduce_impl(316)...........:
MPIR_Reduce_intra_auto(231).....:
MPIR_Reduce_intra_binomial(125).:
MPIDI_CH3U_Recvq_FDU_or_AEP(629): Communication error with rank 1


Can anyone explain this error and how to overcome it.I have installed just MPICH and not HYDRA package.

By the way this same command works fine with hello program the one where each node says their identity.

_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20190617/3cb5bea8/attachment.html>


More information about the discuss mailing list