[mpich-discuss] having problem running MPICH on multiple nodes

Kenneth Raffenetti raffenet at mcs.anl.gov
Wed Nov 26 09:25:38 CST 2014


The connection refused makes me think a firewall is getting in the way. 
Is TCP communication limited to specific ports on the cluster? If so, 
you can use this envvar to enforce a range of ports in MPICH.

MPIR_CVAR_CH3_PORT_RANGE
     Description: The MPIR_CVAR_CH3_PORT_RANGE environment variable 
allows you to specify the range of TCP ports to be used by the process 
manager and the MPICH library. The format of this variable is 
<low>:<high>.  To specify any available port, use 0:0.
     Default: {0,0}

On 11/25/2014 11:50 PM, Amin Hassani wrote:
> Tried with the new configure too. same problem :(
>
> $ mpirun -hostfile hosts-hydra -np 2  test_dup
> Fatal error in MPI_Send: Unknown error class, error stack:
> MPI_Send(174)..............: MPI_Send(buf=0x7fffd90c76c8, count=1,
> MPI_INT, dest=1, tag=0, MPI_COMM_WORLD) failed
> MPID_nem_tcp_connpoll(1832): Communication error with rank 1: Connection
> refused
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 5459 RUNNING AT oakmnt-0-a
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> [proxy:0:1 at oakmnt-0-b] HYD_pmcd_pmip_control_cmd_cb
> (../../../../src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
> [proxy:0:1 at oakmnt-0-b] HYDT_dmxu_poll_wait_for_event
> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback
> returned error status
> [proxy:0:1 at oakmnt-0-b] main
> (../../../../src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> [mpiexec at oakmnt-0-a] HYDT_bscu_wait_for_completion
> (../../../../src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of
> the processes terminated badly; aborting
> [mpiexec at oakmnt-0-a] HYDT_bsci_wait_for_completion
> (../../../../src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher
> returned error waiting for completion
> [mpiexec at oakmnt-0-a] HYD_pmci_wait_for_completion
> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher
> returned error waiting for completion
> [mpiexec at oakmnt-0-a] main
> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error
> waiting for completion
>
>
> Amin Hassani,
> CIS department at UAB,
> Birmingham, AL, USA.
>
> On Tue, Nov 25, 2014 at 11:44 PM, Lu, Huiwei <huiweilu at mcs.anl.gov
> <mailto:huiweilu at mcs.anl.gov>> wrote:
>
>     So the error only happens when there is communication.
>
>     It may be caused by IB as your guessed before. Could you try to
>     reconfigure MPICH using "./configure --with-device=ch3:nemesis:tcp”
>     and try again?
>
>>     Huiwei
>
>      > On Nov 25, 2014, at 11:23 PM, Amin Hassani <ahassani at cis.uab.edu
>     <mailto:ahassani at cis.uab.edu>> wrote:
>      >
>      > Yes it works.
>      > output:
>      >
>      > $ mpirun -hostfile hosts-hydra -np 2  test
>      > rank 1
>      > rank 0
>      >
>      >
>      > Amin Hassani,
>      > CIS department at UAB,
>      > Birmingham, AL, USA.
>      >
>      > On Tue, Nov 25, 2014 at 11:20 PM, Lu, Huiwei
>     <huiweilu at mcs.anl.gov <mailto:huiweilu at mcs.anl.gov>> wrote:
>      > Could you try to run the following simple code to see if it works?
>      >
>      > #include <mpi.h>
>      > #include <stdio.h>
>      > int main(int argc, char** argv)
>      > {
>      >     int rank, size;
>      >     MPI_Init(&argc, &argv);
>      >     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>      >     printf("rank %d\n", rank);
>      >     MPI_Finalize();
>      >     return 0;
>      > }
>      >
>      > —
>      > Huiwei
>      >
>      > > On Nov 25, 2014, at 11:11 PM, Amin Hassani
>     <ahassani at cis.uab.edu <mailto:ahassani at cis.uab.edu>> wrote:
>      > >
>      > > No, I checked. Also I always install my MPI's in
>     /nethome/students/ahassani/usr/mpi. I never install them in
>     /nethome/students/ahassani/usr. So MPI files will never get there.
>     Even if put the /usr/mpi/bin in front of /usr/bin, it won't affect
>     anything. There has never been any mpi installed in /usr/bin.
>      > >
>      > > Thank you.
>      > > _______________________________________________
>      > > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>      > > To manage subscription options or unsubscribe:
>      > > https://lists.mpich.org/mailman/listinfo/discuss
>      >
>      > _______________________________________________
>      > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>      > To manage subscription options or unsubscribe:
>      > https://lists.mpich.org/mailman/listinfo/discuss
>      >
>      > _______________________________________________
>      > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>      > To manage subscription options or unsubscribe:
>      > https://lists.mpich.org/mailman/listinfo/discuss
>
>     _______________________________________________
>     discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>     To manage subscription options or unsubscribe:
>     https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list