[mpich-discuss] having problem running MPICH on multiple nodes

Amin Hassani ahassani at cis.uab.edu
Wed Nov 26 16:25:05 CST 2014


I disabled the whole firewall in those machines but, still get the same
problem. connection refuse.
I run the program in another set of totally different machines that we
have, but still same problem.
Any other thought where can be the problem?

Thanks.

Amin Hassani,
CIS department at UAB,
Birmingham, AL, USA.

On Wed, Nov 26, 2014 at 9:25 AM, Kenneth Raffenetti <raffenet at mcs.anl.gov>
wrote:

> The connection refused makes me think a firewall is getting in the way. Is
> TCP communication limited to specific ports on the cluster? If so, you can
> use this envvar to enforce a range of ports in MPICH.
>
> MPIR_CVAR_CH3_PORT_RANGE
>     Description: The MPIR_CVAR_CH3_PORT_RANGE environment variable allows
> you to specify the range of TCP ports to be used by the process manager and
> the MPICH library. The format of this variable is <low>:<high>.  To specify
> any available port, use 0:0.
>     Default: {0,0}
>
>
> On 11/25/2014 11:50 PM, Amin Hassani wrote:
>
>> Tried with the new configure too. same problem :(
>>
>> $ mpirun -hostfile hosts-hydra -np 2  test_dup
>> Fatal error in MPI_Send: Unknown error class, error stack:
>> MPI_Send(174)..............: MPI_Send(buf=0x7fffd90c76c8, count=1,
>> MPI_INT, dest=1, tag=0, MPI_COMM_WORLD) failed
>> MPID_nem_tcp_connpoll(1832): Communication error with rank 1: Connection
>> refused
>>
>> ============================================================
>> =======================
>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> =   PID 5459 RUNNING AT oakmnt-0-a
>> =   EXIT CODE: 1
>> =   CLEANING UP REMAINING PROCESSES
>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> ============================================================
>> =======================
>> [proxy:0:1 at oakmnt-0-b] HYD_pmcd_pmip_control_cmd_cb
>> (../../../../src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed)
>> failed
>> [proxy:0:1 at oakmnt-0-b] HYDT_dmxu_poll_wait_for_event
>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback
>> returned error status
>> [proxy:0:1 at oakmnt-0-b] main
>> (../../../../src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error
>> waiting for event
>> [mpiexec at oakmnt-0-a] HYDT_bscu_wait_for_completion
>> (../../../../src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of
>> the processes terminated badly; aborting
>> [mpiexec at oakmnt-0-a] HYDT_bsci_wait_for_completion
>> (../../../../src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher
>> returned error waiting for completion
>> [mpiexec at oakmnt-0-a] HYD_pmci_wait_for_completion
>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher
>> returned error waiting for completion
>> [mpiexec at oakmnt-0-a] main
>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error
>> waiting for completion
>>
>>
>> Amin Hassani,
>> CIS department at UAB,
>> Birmingham, AL, USA.
>>
>> On Tue, Nov 25, 2014 at 11:44 PM, Lu, Huiwei <huiweilu at mcs.anl.gov
>> <mailto:huiweilu at mcs.anl.gov>> wrote:
>>
>>     So the error only happens when there is communication.
>>
>>     It may be caused by IB as your guessed before. Could you try to
>>     reconfigure MPICH using "./configure --with-device=ch3:nemesis:tcp”
>>     and try again?
>>
>>>>     Huiwei
>>
>>      > On Nov 25, 2014, at 11:23 PM, Amin Hassani <ahassani at cis.uab.edu
>>     <mailto:ahassani at cis.uab.edu>> wrote:
>>      >
>>      > Yes it works.
>>      > output:
>>      >
>>      > $ mpirun -hostfile hosts-hydra -np 2  test
>>      > rank 1
>>      > rank 0
>>      >
>>      >
>>      > Amin Hassani,
>>      > CIS department at UAB,
>>      > Birmingham, AL, USA.
>>      >
>>      > On Tue, Nov 25, 2014 at 11:20 PM, Lu, Huiwei
>>     <huiweilu at mcs.anl.gov <mailto:huiweilu at mcs.anl.gov>> wrote:
>>      > Could you try to run the following simple code to see if it works?
>>      >
>>      > #include <mpi.h>
>>      > #include <stdio.h>
>>      > int main(int argc, char** argv)
>>      > {
>>      >     int rank, size;
>>      >     MPI_Init(&argc, &argv);
>>      >     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>      >     printf("rank %d\n", rank);
>>      >     MPI_Finalize();
>>      >     return 0;
>>      > }
>>      >
>>      > —
>>      > Huiwei
>>      >
>>      > > On Nov 25, 2014, at 11:11 PM, Amin Hassani
>>     <ahassani at cis.uab.edu <mailto:ahassani at cis.uab.edu>> wrote:
>>      > >
>>      > > No, I checked. Also I always install my MPI's in
>>     /nethome/students/ahassani/usr/mpi. I never install them in
>>     /nethome/students/ahassani/usr. So MPI files will never get there.
>>     Even if put the /usr/mpi/bin in front of /usr/bin, it won't affect
>>     anything. There has never been any mpi installed in /usr/bin.
>>      > >
>>      > > Thank you.
>>      > > _______________________________________________
>>      > > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org
>> >
>>      > > To manage subscription options or unsubscribe:
>>      > > https://lists.mpich.org/mailman/listinfo/discuss
>>      >
>>      > _______________________________________________
>>      > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>>      > To manage subscription options or unsubscribe:
>>      > https://lists.mpich.org/mailman/listinfo/discuss
>>      >
>>      > _______________________________________________
>>      > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>>      > To manage subscription options or unsubscribe:
>>      > https://lists.mpich.org/mailman/listinfo/discuss
>>
>>     _______________________________________________
>>     discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>>     To manage subscription options or unsubscribe:
>>     https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>  _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141126/55e3cdbe/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list