No subject


Tue Jun 18 13:52:11 CDT 2019


--Junchao Zhang

On Wed, Nov 26, 2014 at 4:25 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:

> I disabled the whole firewall in those machines but, still get the same
> problem. connection refuse.
> I run the program in another set of totally different machines that we
> have, but still same problem.
> Any other thought where can be the problem?
>
> Thanks.
>
> Amin Hassani,
> CIS department at UAB,
> Birmingham, AL, USA.
>
> On Wed, Nov 26, 2014 at 9:25 AM, Kenneth Raffenetti <raffenet at mcs.anl.gov=
>
> wrote:
>
>> The connection refused makes me think a firewall is getting in the way.
>> Is TCP communication limited to specific ports on the cluster? If so, yo=
u
>> can use this envvar to enforce a range of ports in MPICH.
>>
>> MPIR_CVAR_CH3_PORT_RANGE
>>     Description: The MPIR_CVAR_CH3_PORT_RANGE environment variable allow=
s
>> you to specify the range of TCP ports to be used by the process manager =
and
>> the MPICH library. The format of this variable is <low>:<high>.  To spec=
ify
>> any available port, use 0:0.
>>     Default: {0,0}
>>
>>
>> On 11/25/2014 11:50 PM, Amin Hassani wrote:
>>
>>> Tried with the new configure too. same problem :(
>>>
>>> $ mpirun -hostfile hosts-hydra -np 2  test_dup
>>> Fatal error in MPI_Send: Unknown error class, error stack:
>>> MPI_Send(174)..............: MPI_Send(buf=3D0x7fffd90c76c8, count=3D1,
>>> MPI_INT, dest=3D1, tag=3D0, MPI_COMM_WORLD) failed
>>> MPID_nem_tcp_connpoll(1832): Communication error with rank 1: Connectio=
n
>>> refused
>>>
>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> =3D   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>> =3D   PID 5459 RUNNING AT oakmnt-0-a
>>> =3D   EXIT CODE: 1
>>> =3D   CLEANING UP REMAINING PROCESSES
>>> =3D   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> [proxy:0:1 at oakmnt-0-b] HYD_pmcd_pmip_control_cmd_cb
>>> (../../../../src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed)
>>> failed
>>> [proxy:0:1 at oakmnt-0-b] HYDT_dmxu_poll_wait_for_event
>>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback
>>> returned error status
>>> [proxy:0:1 at oakmnt-0-b] main
>>> (../../../../src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error
>>> waiting for event
>>> [mpiexec at oakmnt-0-a] HYDT_bscu_wait_for_completion
>>> (../../../../src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of
>>> the processes terminated badly; aborting
>>> [mpiexec at oakmnt-0-a] HYDT_bsci_wait_for_completion
>>> (../../../../src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher
>>> returned error waiting for completion
>>> [mpiexec at oakmnt-0-a] HYD_pmci_wait_for_completion
>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher
>>> returned error waiting for completion
>>> [mpiexec at oakmnt-0-a] main
>>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager erro=
r
>>> waiting for completion
>>>
>>>
>>> Amin Hassani,
>>> CIS department at UAB,
>>> Birmingham, AL, USA.
>>>
>>> On Tue, Nov 25, 2014 at 11:44 PM, Lu, Huiwei <huiweilu at mcs.anl.gov
>>> <mailto:huiweilu at mcs.anl.gov>> wrote:
>>>
>>>     So the error only happens when there is communication.
>>>
>>>     It may be caused by IB as your guessed before. Could you try to
>>>     reconfigure MPICH using "./configure --with-device=3Dch3:nemesis:tc=
p=E2=80=9D
>>>     and try again?
>>>
>>>     =E2=80=94
>>>     Huiwei
>>>
>>>      > On Nov 25, 2014, at 11:23 PM, Amin Hassani <ahassani at cis.uab.edu
>>>     <mailto:ahassani at cis.uab.edu>> wrote:
>>>      >
>>>      > Yes it works.
>>>      > output:
>>>      >
>>>      > $ mpirun -hostfile hosts-hydra -np 2  test
>>>      > rank 1
>>>      > rank 0
>>>      >
>>>      >
>>>      > Amin Hassani,
>>>      > CIS department at UAB,
>>>      > Birmingham, AL, USA.
>>>      >
>>>      > On Tue, Nov 25, 2014 at 11:20 PM, Lu, Huiwei
>>>     <huiweilu at mcs.anl.gov <mailto:huiweilu at mcs.anl.gov>> wrote:
>>>      > Could you try to run the following simple code to see if it work=
s?
>>>      >
>>>      > #include <mpi.h>
>>>      > #include <stdio.h>
>>>      > int main(int argc, char** argv)
>>>      > {
>>>      >     int rank, size;
>>>      >     MPI_Init(&argc, &argv);
>>>      >     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>      >     printf("rank %d\n", rank);
>>>      >     MPI_Finalize();
>>>      >     return 0;
>>>      > }
>>>      >
>>>      > =E2=80=94
>>>      > Huiwei
>>>      >
>>>      > > On Nov 25, 2014, at 11:11 PM, Amin Hassani
>>>     <ahassani at cis.uab.edu <mailto:ahassani at cis.uab.edu>> wrote:
>>>      > >
>>>      > > No, I checked. Also I always install my MPI's in
>>>     /nethome/students/ahassani/usr/mpi. I never install them in
>>>     /nethome/students/ahassani/usr. So MPI files will never get there.
>>>     Even if put the /usr/mpi/bin in front of /usr/bin, it won't affect
>>>     anything. There has never been any mpi installed in /usr/bin.
>>>      > >
>>>      > > Thank you.
>>>      > > _______________________________________________
>>>      > > discuss mailing list discuss at mpich.org <mailto:
>>> discuss at mpich.org>
>>>      > > To manage subscription options or unsubscribe:
>>>      > > https://lists.mpich.org/mailman/listinfo/discuss
>>>      >
>>>      > _______________________________________________
>>>      > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org=
>
>>>      > To manage subscription options or unsubscribe:
>>>      > https://lists.mpich.org/mailman/listinfo/discuss
>>>      >
>>>      > _______________________________________________
>>>      > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org=
>
>>>      > To manage subscription options or unsubscribe:
>>>      > https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>>     _______________________________________________
>>>     discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>>>     To manage subscription options or unsubscribe:
>>>     https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>>  _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>

--001a11c108e02e57a30508cb605e
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dutf-8"><d=
iv dir=3D"ltr"><div>I have no idea. You may try to trace all events as said=
 at <a href=3D"http://wiki.mpich.org/mpich/index.php/Debug_Event_Logging">h=
ttp://wiki.mpich.org/mpich/index.php/Debug_Event_Logging</a></div><div>From=
 the trace log,  one may find out something abnormal. </div></div=
><div class=3D"gmail_extra"><br clear=3D"all"><div><div class=3D"gmail_sign=
ature"><div dir=3D"ltr">--Junchao Zhang</div></div></div>
<br><div class=3D"gmail_quote">On Wed, Nov 26, 2014 at 4:25 PM, Amin Hassan=
i <span dir=3D"ltr"><<a href=3D"mailto:ahassani at cis.uab.edu" target=3D"_=
blank">ahassani at cis.uab.edu</a>></span> wrote:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family=
:tahoma,sans-serif;font-size:small">I disabled the whole firewall in those =
machines but, still get the same problem. connection refuse.</div><div clas=
s=3D"gmail_default" style=3D"font-family:tahoma,sans-serif;font-size:small"=
>I run the program in another set of totally different machines that we hav=
e, but still same problem.</div><div class=3D"gmail_default" style=3D"font-=
family:tahoma,sans-serif;font-size:small">Any other thought where can be th=
e problem?</div><div class=3D"gmail_default" style=3D"font-family:tahoma,sa=
ns-serif;font-size:small"><br></div><div class=3D"gmail_default" style=3D"f=
ont-family:tahoma,sans-serif;font-size:small">Thanks.</div></div><div class=
=3D"gmail_extra"><span class=3D""><br clear=3D"all"><div><div><div dir=3D"l=
tr">Amin Hassani,<br>CIS department at UAB,<br>
Birmingham, AL, USA.</div></div></div>
<br></span><div><div class=3D"h5"><div class=3D"gmail_quote">On Wed, Nov 26=
, 2014 at 9:25 AM, Kenneth Raffenetti <span dir=3D"ltr"><<a href=3D"mail=
to:raffenet at mcs.anl.gov" target=3D"_blank">raffenet at mcs.anl.gov</a>></sp=
an> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex">The connection refused makes m=
e think a firewall is getting in the way. Is TCP communication limited to s=
pecific ports on the cluster? If so, you can use this envvar to enforce a r=
ange of ports in MPICH.<br>
<br>
MPIR_CVAR_CH3_PORT_RANGE<br>
    Description: The MPIR_CVAR_CH3_PORT_RANGE environment variabl=
e allows you to specify the range of TCP ports to be used by the process ma=
nager and the MPICH library. The format of this variable is <low>:&lt=
;high>.  To specify any available port, use 0:0.<br>
    Default: {0,0}<div><div><br>
<br>
On 11/25/2014 11:50 PM, Amin Hassani wrote:<br>
</div></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex"><div><div>
Tried with the new configure too. same problem :(<br>
<br>
$ mpirun -hostfile hosts-hydra -np 2  test_dup<br>
Fatal error in MPI_Send: Unknown error class, error stack:<br>
MPI_Send(174)..............: MPI_Send(buf=3D0x7fffd90c76c8, count=3D1,<br>
MPI_INT, dest=3D1, tag=3D0, MPI_COMM_WORLD) failed<br>
MPID_nem_tcp_connpoll(1832): Communication error with rank 1: Connection<br=
>
refused<br>
<br>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D<u></u>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<u></u>=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
=3D   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES<br>
=3D   PID 5459 RUNNING AT oakmnt-0-a<br>
=3D   EXIT CODE: 1<br>
=3D   CLEANING UP REMAINING PROCESSES<br>
=3D   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES<br>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D<u></u>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<u></u>=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
[proxy:0:1 at oakmnt-0-b] HYD_pmcd_pmip_control_cmd_cb<br>
(../../../../src/pm/hydra/pm/<u></u>pmiserv/pmip_cb.c:885): assert (!closed=
) failed<br>
[proxy:0:1 at oakmnt-0-b] HYDT_dmxu_poll_wait_for_event<br>
(../../../../src/pm/hydra/<u></u>tools/demux/demux_poll.c:76): callback<br>
returned error status<br>
[proxy:0:1 at oakmnt-0-b] main<br>
(../../../../src/pm/hydra/pm/<u></u>pmiserv/pmip.c:206): demux engine error=
<br>
waiting for event<br>
[mpiexec at oakmnt-0-a] HYDT_bscu_wait_for_completion<br>
(../../../../src/pm/hydra/<u></u>tools/bootstrap/utils/bscu_<u></u>wait.c:7=
6): one of<br>
the processes terminated badly; aborting<br>
[mpiexec at oakmnt-0-a] HYDT_bsci_wait_for_completion<br>
(../../../../src/pm/hydra/<u></u>tools/bootstrap/src/bsci_wait.<u></u>c:23)=
: launcher<br>
returned error waiting for completion<br>
[mpiexec at oakmnt-0-a] HYD_pmci_wait_for_completion<br>
(../../../../src/pm/hydra/pm/<u></u>pmiserv/pmiserv_pmci.c:218): launcher<b=
r>
returned error waiting for completion<br>
[mpiexec at oakmnt-0-a] main<br>
(../../../../src/pm/hydra/ui/<u></u>mpich/mpiexec.c:344): process manager e=
rror<br>
waiting for completion<br>
<br>
<br>
Amin Hassani,<br>
CIS department at UAB,<br>
Birmingham, AL, USA.<br>
<br>
On Tue, Nov 25, 2014 at 11:44 PM, Lu, Huiwei <<a href=3D"mailto:huiweilu=
@mcs.anl.gov" target=3D"_blank">huiweilu at mcs.anl.gov</a><br></div></div><sp=
an>
<mailto:<a href=3D"mailto:huiweilu at mcs.anl.gov" target=3D"_blank">huiwei=
lu at mcs.anl.gov</a>>> wrote:<br>
<br>
    So the error only happens when there is communication.<br>
<br>
    It may be caused by IB as your guessed before. Could you try =
to<br>
    reconfigure MPICH using "./configure --with-device=3Dch3=
:nemesis:tcp=E2=80=9D<br>
    and try again?<br>
<br>
    =E2=80=94<br>
    Huiwei<br>
<br>
     > On Nov 25, 2014, at 11:23 PM, Amin Hassani <<a =
href=3D"mailto:ahassani at cis.uab.edu" target=3D"_blank">ahassani at cis.uab.edu=
</a><br></span><span>
    <mailto:<a href=3D"mailto:ahassani at cis.uab.edu" target=3D"=
_blank">ahassani at cis.uab.edu</a>>> wrote:<br>
     ><br>
     > Yes it works.<br>
     > output:<br>
     ><br>
     > $ mpirun -hostfile hosts-hydra -np 2  test<br=
>
     > rank 1<br>
     > rank 0<br>
     ><br>
     ><br>
     > Amin Hassani,<br>
     > CIS department at UAB,<br>
     > Birmingham, AL, USA.<br>
     ><br>
     > On Tue, Nov 25, 2014 at 11:20 PM, Lu, Huiwei<br></=
span><span>
    <<a href=3D"mailto:huiweilu at mcs.anl.gov" target=3D"_blank"=
>huiweilu at mcs.anl.gov</a> <mailto:<a href=3D"mailto:huiweilu at mcs.anl.gov=
" target=3D"_blank">huiweilu at mcs.anl.gov</a>>> wrote:<br>
     > Could you try to run the following simple code to =
see if it works?<br>
     ><br>
     > #include <mpi.h><br>
     > #include <stdio.h><br>
     > int main(int argc, char** argv)<br>
     > {<br>
     >     int rank, size;<br>
     >     MPI_Init(&argc, &argv);=
<br>
     >     MPI_Comm_rank(MPI_COMM_WORLD, &=
amp;rank);<br>
     >     printf("rank %d\n", r=
ank);<br>
     >     MPI_Finalize();<br>
     >     return 0;<br>
     > }<br>
     ><br>
     > =E2=80=94<br>
     > Huiwei<br>
     ><br>
     > > On Nov 25, 2014, at 11:11 PM, Amin Hassani<br=
></span><span>
    <<a href=3D"mailto:ahassani at cis.uab.edu" target=3D"_blank"=
>ahassani at cis.uab.edu</a> <mailto:<a href=3D"mailto:ahassani at cis.uab.edu=
" target=3D"_blank">ahassani at cis.uab.edu</a>>> wrote:<br>
     > ><br>
     > > No, I checked. Also I always install my MPI's=
 in<br>
    /nethome/students/ahassani/<u></u>usr/mpi. I never install th=
em in<br>
    /nethome/students/ahassani/<u></u>usr. So MPI files will neve=
r get there.<br>
    Even if put the /usr/mpi/bin in front of /usr/bin, it won't a=
ffect<br>
    anything. There has never been any mpi installed in /usr/bin.=
<br>
     > ><br>
     > > Thank you.<br>
     > > ______________________________<u></u>________=
_________<br></span>
     > > discuss mailing list <a href=3D"mailto:discus=
s at mpich.org" target=3D"_blank">discuss at mpich.org</a> <mailto:<a href=3D"=
mailto:discuss at mpich.org" target=3D"_blank">discuss at mpich.org</a>><span>=
<br>
     > > To manage subscription options or unsubscribe=
:<br>
     > > <a href=3D"https://lists.mpich.org/mailman/li=
stinfo/discuss" target=3D"_blank">https://lists.mpich.org/<u></u>mailman/li=
stinfo/discuss</a><br>
     ><br>
     > ______________________________<u></u>_____________=
____<br></span>
     > discuss mailing list <a href=3D"mailto:discuss at mpi=
ch.org" target=3D"_blank">discuss at mpich.org</a> <mailto:<a href=3D"mailt=
o:discuss at mpich.org" target=3D"_blank">discuss at mpich.org</a>><span><br>
     > To manage subscription options or unsubscribe:<br>
     > <a href=3D"https://lists.mpich.org/mailman/listinf=
o/discuss" target=3D"_blank">https://lists.mpich.org/<u></u>mailman/listinf=
o/discuss</a><br>
     ><br>
     > ______________________________<u></u>_____________=
____<br></span>
     > discuss mailing list <a href=3D"mailto:discuss at mpi=
ch.org" target=3D"_blank">discuss at mpich.org</a> <mailto:<a href=3D"mailt=
o:discuss at mpich.org" target=3D"_blank">discuss at mpich.org</a>><span><br>
     > To manage subscription options or unsubscribe:<br>
     > <a href=3D"https://lists.mpich.org/mailman/listinf=
o/discuss" target=3D"_blank">https://lists.mpich.org/<u></u>mailman/listinf=
o/discuss</a><br>
<br>
    ______________________________<u></u>_________________<br></s=
pan>
    discuss mailing list <a href=3D"mailto:discuss at mpich.org" tar=
get=3D"_blank">discuss at mpich.org</a> <mailto:<a href=3D"mailto:discuss at m=
pich.org" target=3D"_blank">discuss at mpich.org</a>><span><br>
    To manage subscription options or unsubscribe:<br>
    <a href=3D"https://lists.mpich.org/mailman/listinfo/discuss" =
target=3D"_blank">https://lists.mpich.org/<u></u>mailman/listinfo/discuss</=
a><br>
<br>
<br>
<br>
<br>
______________________________<u></u>_________________<br>
discuss mailing list     <a href=3D"mailto:discuss at mpich.org=
" target=3D"_blank">discuss at mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href=3D"https://lists.mpich.org/mailman/listinfo/discuss" target=3D"_bla=
nk">https://lists.mpich.org/<u></u>mailman/listinfo/discuss</a><br>
<br>
</span></blockquote><div><div>
______________________________<u></u>_________________<br>
discuss mailing list     <a href=3D"mailto:discuss at mpich.org=
" target=3D"_blank">discuss at mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href=3D"https://lists.mpich.org/mailman/listinfo/discuss" target=3D"_bla=
nk">https://lists.mpich.org/<u></u>mailman/listinfo/discuss</a></div></div>=
</blockquote></div><br></div></div></div>
<br>_______________________________________________<br>
discuss mailing list     <a href=3D"mailto:discuss at mpich.org=
">discuss at mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href=3D"https://lists.mpich.org/mailman/listinfo/discuss" target=3D"_bla=
nk">https://lists.mpich.org/mailman/listinfo/discuss</a><br></blockquote></=
div><br></div>

--001a11c108e02e57a30508cb605e--

--===============2762240988201403401==
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
--===============2762240988201403401==--


More information about the discuss mailing list