No subject
Tue Jun 18 13:52:11 CDT 2019
--Junchao Zhang
On Wed, Nov 26, 2014 at 4:25 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
> I disabled the whole firewall in those machines but, still get the same
> problem. connection refuse.
> I run the program in another set of totally different machines that we
> have, but still same problem.
> Any other thought where can be the problem?
>
> Thanks.
>
> Amin Hassani,
> CIS department at UAB,
> Birmingham, AL, USA.
>
> On Wed, Nov 26, 2014 at 9:25 AM, Kenneth Raffenetti <raffenet at mcs.anl.gov=
>
> wrote:
>
>> The connection refused makes me think a firewall is getting in the way.
>> Is TCP communication limited to specific ports on the cluster? If so, yo=
u
>> can use this envvar to enforce a range of ports in MPICH.
>>
>> MPIR_CVAR_CH3_PORT_RANGE
>> Description: The MPIR_CVAR_CH3_PORT_RANGE environment variable allow=
s
>> you to specify the range of TCP ports to be used by the process manager =
and
>> the MPICH library. The format of this variable is <low>:<high>. To spec=
ify
>> any available port, use 0:0.
>> Default: {0,0}
>>
>>
>> On 11/25/2014 11:50 PM, Amin Hassani wrote:
>>
>>> Tried with the new configure too. same problem :(
>>>
>>> $ mpirun -hostfile hosts-hydra -np 2 test_dup
>>> Fatal error in MPI_Send: Unknown error class, error stack:
>>> MPI_Send(174)..............: MPI_Send(buf=3D0x7fffd90c76c8, count=3D1,
>>> MPI_INT, dest=3D1, tag=3D0, MPI_COMM_WORLD) failed
>>> MPID_nem_tcp_connpoll(1832): Communication error with rank 1: Connectio=
n
>>> refused
>>>
>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> =3D BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>> =3D PID 5459 RUNNING AT oakmnt-0-a
>>> =3D EXIT CODE: 1
>>> =3D CLEANING UP REMAINING PROCESSES
>>> =3D YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> [proxy:0:1 at oakmnt-0-b] HYD_pmcd_pmip_control_cmd_cb
>>> (../../../../src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed)
>>> failed
>>> [proxy:0:1 at oakmnt-0-b] HYDT_dmxu_poll_wait_for_event
>>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback
>>> returned error status
>>> [proxy:0:1 at oakmnt-0-b] main
>>> (../../../../src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error
>>> waiting for event
>>> [mpiexec at oakmnt-0-a] HYDT_bscu_wait_for_completion
>>> (../../../../src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of
>>> the processes terminated badly; aborting
>>> [mpiexec at oakmnt-0-a] HYDT_bsci_wait_for_completion
>>> (../../../../src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher
>>> returned error waiting for completion
>>> [mpiexec at oakmnt-0-a] HYD_pmci_wait_for_completion
>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher
>>> returned error waiting for completion
>>> [mpiexec at oakmnt-0-a] main
>>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager erro=
r
>>> waiting for completion
>>>
>>>
>>> Amin Hassani,
>>> CIS department at UAB,
>>> Birmingham, AL, USA.
>>>
>>> On Tue, Nov 25, 2014 at 11:44 PM, Lu, Huiwei <huiweilu at mcs.anl.gov
>>> <mailto:huiweilu at mcs.anl.gov>> wrote:
>>>
>>> So the error only happens when there is communication.
>>>
>>> It may be caused by IB as your guessed before. Could you try to
>>> reconfigure MPICH using "./configure --with-device=3Dch3:nemesis:tc=
p=E2=80=9D
>>> and try again?
>>>
>>> =E2=80=94
>>> Huiwei
>>>
>>> > On Nov 25, 2014, at 11:23 PM, Amin Hassani <ahassani at cis.uab.edu
>>> <mailto:ahassani at cis.uab.edu>> wrote:
>>> >
>>> > Yes it works.
>>> > output:
>>> >
>>> > $ mpirun -hostfile hosts-hydra -np 2 test
>>> > rank 1
>>> > rank 0
>>> >
>>> >
>>> > Amin Hassani,
>>> > CIS department at UAB,
>>> > Birmingham, AL, USA.
>>> >
>>> > On Tue, Nov 25, 2014 at 11:20 PM, Lu, Huiwei
>>> <huiweilu at mcs.anl.gov <mailto:huiweilu at mcs.anl.gov>> wrote:
>>> > Could you try to run the following simple code to see if it work=
s?
>>> >
>>> > #include <mpi.h>
>>> > #include <stdio.h>
>>> > int main(int argc, char** argv)
>>> > {
>>> > int rank, size;
>>> > MPI_Init(&argc, &argv);
>>> > MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>> > printf("rank %d\n", rank);
>>> > MPI_Finalize();
>>> > return 0;
>>> > }
>>> >
>>> > =E2=80=94
>>> > Huiwei
>>> >
>>> > > On Nov 25, 2014, at 11:11 PM, Amin Hassani
>>> <ahassani at cis.uab.edu <mailto:ahassani at cis.uab.edu>> wrote:
>>> > >
>>> > > No, I checked. Also I always install my MPI's in
>>> /nethome/students/ahassani/usr/mpi. I never install them in
>>> /nethome/students/ahassani/usr. So MPI files will never get there.
>>> Even if put the /usr/mpi/bin in front of /usr/bin, it won't affect
>>> anything. There has never been any mpi installed in /usr/bin.
>>> > >
>>> > > Thank you.
>>> > > _______________________________________________
>>> > > discuss mailing list discuss at mpich.org <mailto:
>>> discuss at mpich.org>
>>> > > To manage subscription options or unsubscribe:
>>> > > https://lists.mpich.org/mailman/listinfo/discuss
>>> >
>>> > _______________________________________________
>>> > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org=
>
>>> > To manage subscription options or unsubscribe:
>>> > https://lists.mpich.org/mailman/listinfo/discuss
>>> >
>>> > _______________________________________________
>>> > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org=
>
>>> > To manage subscription options or unsubscribe:
>>> > https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
--001a11c108e02e57a30508cb605e
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dutf-8"><d=
iv dir=3D"ltr"><div>I have no idea. You may try to trace all events as said=
at <a href=3D"http://wiki.mpich.org/mpich/index.php/Debug_Event_Logging">h=
ttp://wiki.mpich.org/mpich/index.php/Debug_Event_Logging</a></div><div>From=
the trace log, one may find out something abnormal. </div></div=
><div class=3D"gmail_extra"><br clear=3D"all"><div><div class=3D"gmail_sign=
ature"><div dir=3D"ltr">--Junchao Zhang</div></div></div>
<br><div class=3D"gmail_quote">On Wed, Nov 26, 2014 at 4:25 PM, Amin Hassan=
i <span dir=3D"ltr"><<a href=3D"mailto:ahassani at cis.uab.edu" target=3D"_=
blank">ahassani at cis.uab.edu</a>></span> wrote:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family=
:tahoma,sans-serif;font-size:small">I disabled the whole firewall in those =
machines but, still get the same problem. connection refuse.</div><div clas=
s=3D"gmail_default" style=3D"font-family:tahoma,sans-serif;font-size:small"=
>I run the program in another set of totally different machines that we hav=
e, but still same problem.</div><div class=3D"gmail_default" style=3D"font-=
family:tahoma,sans-serif;font-size:small">Any other thought where can be th=
e problem?</div><div class=3D"gmail_default" style=3D"font-family:tahoma,sa=
ns-serif;font-size:small"><br></div><div class=3D"gmail_default" style=3D"f=
ont-family:tahoma,sans-serif;font-size:small">Thanks.</div></div><div class=
=3D"gmail_extra"><span class=3D""><br clear=3D"all"><div><div><div dir=3D"l=
tr">Amin Hassani,<br>CIS department at UAB,<br>
Birmingham, AL, USA.</div></div></div>
<br></span><div><div class=3D"h5"><div class=3D"gmail_quote">On Wed, Nov 26=
, 2014 at 9:25 AM, Kenneth Raffenetti <span dir=3D"ltr"><<a href=3D"mail=
to:raffenet at mcs.anl.gov" target=3D"_blank">raffenet at mcs.anl.gov</a>></sp=
an> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex">The connection refused makes m=
e think a firewall is getting in the way. Is TCP communication limited to s=
pecific ports on the cluster? If so, you can use this envvar to enforce a r=
ange of ports in MPICH.<br>
<br>
MPIR_CVAR_CH3_PORT_RANGE<br>
Description: The MPIR_CVAR_CH3_PORT_RANGE environment variabl=
e allows you to specify the range of TCP ports to be used by the process ma=
nager and the MPICH library. The format of this variable is <low>:<=
;high>. To specify any available port, use 0:0.<br>
Default: {0,0}<div><div><br>
<br>
On 11/25/2014 11:50 PM, Amin Hassani wrote:<br>
</div></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex"><div><div>
Tried with the new configure too. same problem :(<br>
<br>
$ mpirun -hostfile hosts-hydra -np 2 test_dup<br>
Fatal error in MPI_Send: Unknown error class, error stack:<br>
MPI_Send(174)..............: MPI_Send(buf=3D0x7fffd90c76c8, count=3D1,<br>
MPI_INT, dest=3D1, tag=3D0, MPI_COMM_WORLD) failed<br>
MPID_nem_tcp_connpoll(1832): Communication error with rank 1: Connection<br=
>
refused<br>
<br>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D<u></u>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<u></u>=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
=3D BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES<br>
=3D PID 5459 RUNNING AT oakmnt-0-a<br>
=3D EXIT CODE: 1<br>
=3D CLEANING UP REMAINING PROCESSES<br>
=3D YOU CAN IGNORE THE BELOW CLEANUP MESSAGES<br>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D<u></u>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<u></u>=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
[proxy:0:1 at oakmnt-0-b] HYD_pmcd_pmip_control_cmd_cb<br>
(../../../../src/pm/hydra/pm/<u></u>pmiserv/pmip_cb.c:885): assert (!closed=
) failed<br>
[proxy:0:1 at oakmnt-0-b] HYDT_dmxu_poll_wait_for_event<br>
(../../../../src/pm/hydra/<u></u>tools/demux/demux_poll.c:76): callback<br>
returned error status<br>
[proxy:0:1 at oakmnt-0-b] main<br>
(../../../../src/pm/hydra/pm/<u></u>pmiserv/pmip.c:206): demux engine error=
<br>
waiting for event<br>
[mpiexec at oakmnt-0-a] HYDT_bscu_wait_for_completion<br>
(../../../../src/pm/hydra/<u></u>tools/bootstrap/utils/bscu_<u></u>wait.c:7=
6): one of<br>
the processes terminated badly; aborting<br>
[mpiexec at oakmnt-0-a] HYDT_bsci_wait_for_completion<br>
(../../../../src/pm/hydra/<u></u>tools/bootstrap/src/bsci_wait.<u></u>c:23)=
: launcher<br>
returned error waiting for completion<br>
[mpiexec at oakmnt-0-a] HYD_pmci_wait_for_completion<br>
(../../../../src/pm/hydra/pm/<u></u>pmiserv/pmiserv_pmci.c:218): launcher<b=
r>
returned error waiting for completion<br>
[mpiexec at oakmnt-0-a] main<br>
(../../../../src/pm/hydra/ui/<u></u>mpich/mpiexec.c:344): process manager e=
rror<br>
waiting for completion<br>
<br>
<br>
Amin Hassani,<br>
CIS department at UAB,<br>
Birmingham, AL, USA.<br>
<br>
On Tue, Nov 25, 2014 at 11:44 PM, Lu, Huiwei <<a href=3D"mailto:huiweilu=
@mcs.anl.gov" target=3D"_blank">huiweilu at mcs.anl.gov</a><br></div></div><sp=
an>
<mailto:<a href=3D"mailto:huiweilu at mcs.anl.gov" target=3D"_blank">huiwei=
lu at mcs.anl.gov</a>>> wrote:<br>
<br>
So the error only happens when there is communication.<br>
<br>
It may be caused by IB as your guessed before. Could you try =
to<br>
reconfigure MPICH using "./configure --with-device=3Dch3=
:nemesis:tcp=E2=80=9D<br>
and try again?<br>
<br>
=E2=80=94<br>
Huiwei<br>
<br>
> On Nov 25, 2014, at 11:23 PM, Amin Hassani <<a =
href=3D"mailto:ahassani at cis.uab.edu" target=3D"_blank">ahassani at cis.uab.edu=
</a><br></span><span>
<mailto:<a href=3D"mailto:ahassani at cis.uab.edu" target=3D"=
_blank">ahassani at cis.uab.edu</a>>> wrote:<br>
><br>
> Yes it works.<br>
> output:<br>
><br>
> $ mpirun -hostfile hosts-hydra -np 2 test<br=
>
> rank 1<br>
> rank 0<br>
><br>
><br>
> Amin Hassani,<br>
> CIS department at UAB,<br>
> Birmingham, AL, USA.<br>
><br>
> On Tue, Nov 25, 2014 at 11:20 PM, Lu, Huiwei<br></=
span><span>
<<a href=3D"mailto:huiweilu at mcs.anl.gov" target=3D"_blank"=
>huiweilu at mcs.anl.gov</a> <mailto:<a href=3D"mailto:huiweilu at mcs.anl.gov=
" target=3D"_blank">huiweilu at mcs.anl.gov</a>>> wrote:<br>
> Could you try to run the following simple code to =
see if it works?<br>
><br>
> #include <mpi.h><br>
> #include <stdio.h><br>
> int main(int argc, char** argv)<br>
> {<br>
> int rank, size;<br>
> MPI_Init(&argc, &argv);=
<br>
> MPI_Comm_rank(MPI_COMM_WORLD, &=
amp;rank);<br>
> printf("rank %d\n", r=
ank);<br>
> MPI_Finalize();<br>
> return 0;<br>
> }<br>
><br>
> =E2=80=94<br>
> Huiwei<br>
><br>
> > On Nov 25, 2014, at 11:11 PM, Amin Hassani<br=
></span><span>
<<a href=3D"mailto:ahassani at cis.uab.edu" target=3D"_blank"=
>ahassani at cis.uab.edu</a> <mailto:<a href=3D"mailto:ahassani at cis.uab.edu=
" target=3D"_blank">ahassani at cis.uab.edu</a>>> wrote:<br>
> ><br>
> > No, I checked. Also I always install my MPI's=
in<br>
/nethome/students/ahassani/<u></u>usr/mpi. I never install th=
em in<br>
/nethome/students/ahassani/<u></u>usr. So MPI files will neve=
r get there.<br>
Even if put the /usr/mpi/bin in front of /usr/bin, it won't a=
ffect<br>
anything. There has never been any mpi installed in /usr/bin.=
<br>
> ><br>
> > Thank you.<br>
> > ______________________________<u></u>________=
_________<br></span>
> > discuss mailing list <a href=3D"mailto:discus=
s at mpich.org" target=3D"_blank">discuss at mpich.org</a> <mailto:<a href=3D"=
mailto:discuss at mpich.org" target=3D"_blank">discuss at mpich.org</a>><span>=
<br>
> > To manage subscription options or unsubscribe=
:<br>
> > <a href=3D"https://lists.mpich.org/mailman/li=
stinfo/discuss" target=3D"_blank">https://lists.mpich.org/<u></u>mailman/li=
stinfo/discuss</a><br>
><br>
> ______________________________<u></u>_____________=
____<br></span>
> discuss mailing list <a href=3D"mailto:discuss at mpi=
ch.org" target=3D"_blank">discuss at mpich.org</a> <mailto:<a href=3D"mailt=
o:discuss at mpich.org" target=3D"_blank">discuss at mpich.org</a>><span><br>
> To manage subscription options or unsubscribe:<br>
> <a href=3D"https://lists.mpich.org/mailman/listinf=
o/discuss" target=3D"_blank">https://lists.mpich.org/<u></u>mailman/listinf=
o/discuss</a><br>
><br>
> ______________________________<u></u>_____________=
____<br></span>
> discuss mailing list <a href=3D"mailto:discuss at mpi=
ch.org" target=3D"_blank">discuss at mpich.org</a> <mailto:<a href=3D"mailt=
o:discuss at mpich.org" target=3D"_blank">discuss at mpich.org</a>><span><br>
> To manage subscription options or unsubscribe:<br>
> <a href=3D"https://lists.mpich.org/mailman/listinf=
o/discuss" target=3D"_blank">https://lists.mpich.org/<u></u>mailman/listinf=
o/discuss</a><br>
<br>
______________________________<u></u>_________________<br></s=
pan>
discuss mailing list <a href=3D"mailto:discuss at mpich.org" tar=
get=3D"_blank">discuss at mpich.org</a> <mailto:<a href=3D"mailto:discuss at m=
pich.org" target=3D"_blank">discuss at mpich.org</a>><span><br>
To manage subscription options or unsubscribe:<br>
<a href=3D"https://lists.mpich.org/mailman/listinfo/discuss" =
target=3D"_blank">https://lists.mpich.org/<u></u>mailman/listinfo/discuss</=
a><br>
<br>
<br>
<br>
<br>
______________________________<u></u>_________________<br>
discuss mailing list <a href=3D"mailto:discuss at mpich.org=
" target=3D"_blank">discuss at mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href=3D"https://lists.mpich.org/mailman/listinfo/discuss" target=3D"_bla=
nk">https://lists.mpich.org/<u></u>mailman/listinfo/discuss</a><br>
<br>
</span></blockquote><div><div>
______________________________<u></u>_________________<br>
discuss mailing list <a href=3D"mailto:discuss at mpich.org=
" target=3D"_blank">discuss at mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href=3D"https://lists.mpich.org/mailman/listinfo/discuss" target=3D"_bla=
nk">https://lists.mpich.org/<u></u>mailman/listinfo/discuss</a></div></div>=
</blockquote></div><br></div></div></div>
<br>_______________________________________________<br>
discuss mailing list <a href=3D"mailto:discuss at mpich.org=
">discuss at mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href=3D"https://lists.mpich.org/mailman/listinfo/discuss" target=3D"_bla=
nk">https://lists.mpich.org/mailman/listinfo/discuss</a><br></blockquote></=
div><br></div>
--001a11c108e02e57a30508cb605e--
--===============2762240988201403401==
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
--===============2762240988201403401==--
More information about the discuss
mailing list