[mpich-discuss] start MPD daemons on 2 different subnets
Pavan Balaji
balaji at mcs.anl.gov
Thu Jan 17 18:09:41 CST 2013
Hello,
You should be able to ssh to the nodes from each other. It looks like
you cannot ssh between the nodes using the below IP addresses.
-- Pavan
On 01/17/2013 06:08 PM US Central Time, Na Zhang wrote:
> Hello Dave,
>
> Thanks for your reply.
> We followed your advice and installed Hydra on each node.
>
> We specify ip address in hosts file. For example:
>
> shell $ mpiexec –f hosts –np 2 ./app
> shell $ cat hosts
> 192.168.0.1
> 192.168.1.1
>
> (The two node IP belongs to 2 different subnets: for example,
> subnet #1 192.168.0.0/24 <http://192.168.0.0/24> and subnet #2:
> 192.168.1.0/24 <http://192.168.1.0/24>)
>
> The error out put is
> “ssh connect to host 192.168.1.1 port 22: connection time out”.
>
> So is there a option for Hydra to solve this problem?
>
> Thank you!
>
> Sincerely,
> Na Zhang
>
> On Fri, Jan 11, 2013 at 5:20 PM, <discuss-request at mpich.org
> <mailto:discuss-request at mpich.org>> wrote:
>
> Send discuss mailing list submissions to
> discuss at mpich.org <mailto:discuss at mpich.org>
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.mpich.org/mailman/listinfo/discuss
> or, via email, send a message with subject or body 'help' to
> discuss-request at mpich.org <mailto:discuss-request at mpich.org>
>
> You can reach the person managing the list at
> discuss-owner at mpich.org <mailto:discuss-owner at mpich.org>
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of discuss digest..."
>
>
> Today's Topics:
>
> 1. Re: [PATCH] Use attribute layout_compatible for pair types
> (Jed Brown)
> 2. Re: [PATCH] Use attribute layout_compatible for pair types
> (Dmitri Gribenko)
> 3. start MPD daemons on 2 different subnets (Na Zhang)
> 4. Re: start MPD daemons on 2 different subnets (Dave Goodell)
> 5. Fatal error in PMPI_Reduce (Michael Colonno)
> 6. Re: Fatal error in PMPI_Reduce (Pavan Balaji)
> 7. Re: Fatal error in PMPI_Reduce (Pavan Balaji)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 9 Jan 2013 14:00:50 -0600
> From: Jed Brown <jedbrown at mcs.anl.gov <mailto:jedbrown at mcs.anl.gov>>
> To: discuss at mpich.org <mailto:discuss at mpich.org>
> Subject: Re: [mpich-discuss] [PATCH] Use attribute layout_compatible
> for pair types
> Message-ID:
>
> <CAM9tzSnqJHaj6wbKBdAWp5YveG+UW_OWiA768GRb1spHjn+TZw at mail.gmail.com
> <mailto:CAM9tzSnqJHaj6wbKBdAWp5YveG%2BUW_OWiA768GRb1spHjn%2BTZw at mail.gmail.com>>
> Content-Type: text/plain; charset="utf-8"
>
> On Jan 9, 2013 12:56 PM, "Dmitri Gribenko" <gribozavr at gmail.com
> <mailto:gribozavr at gmail.com>> wrote:
>
> > On Wed, Jan 9, 2013 at 8:19 PM, Dave Goodell <goodell at mcs.anl.gov
> <mailto:goodell at mcs.anl.gov>> wrote:
> > > Both implemented and pushed as d440abb and ac15f7a. Thanks.
> > >
> > > -Dave
> > >
> > > On Jan 1, 2013, at 11:14 PM CST, Jed Brown wrote:
> > >
> > >> In addition, I suggest guarding these definitions. Leaving these in
> > increases the total number of symbols in an example executable linking
> > PETSc by a factor of 2. (They're all read-only, but they're still
> there.)
> > Clang is smart enough to remove these, presumably because it
> understands
> > the special attributes.
> >
> > No, LLVM removes these not because of the attributes, but because
> > these are unused. And when they are used, most of the time they don't
> > have their address taken, so their value is propagated to the point
> > where they are read and the constants again become unused.
> >
> > I don't think GCC isn't smart enough to do the same. Do you compile
> > with optimization?
> >
>
> Dmitri, as discussed in the other thread, it's smart enough, but
> only when
> optimization is turned on. There's no reason to needlessly make debug
> builds heavier than necessary. This is not a big deal either way.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://lists.mpich.org/pipermail/discuss/attachments/20130109/92c5a5f7/attachment-0001.html>
>
> ------------------------------
>
> Message: 2
> Date: Wed, 9 Jan 2013 22:57:14 +0200
> From: Dmitri Gribenko <gribozavr at gmail.com <mailto:gribozavr at gmail.com>>
> To: discuss at mpich.org <mailto:discuss at mpich.org>
> Subject: Re: [mpich-discuss] [PATCH] Use attribute layout_compatible
> for pair types
> Message-ID:
>
> <CA+Y5xYeBp974pDiL0QFAhjxpeqpB2Xykjx-atYFtLWQ2Oq+aoA at mail.gmail.com
> <mailto:CA%2BY5xYeBp974pDiL0QFAhjxpeqpB2Xykjx-atYFtLWQ2Oq%2BaoA at mail.gmail.com>>
> Content-Type: text/plain; charset=UTF-8
>
> On Wed, Jan 9, 2013 at 10:00 PM, Jed Brown <jedbrown at mcs.anl.gov
> <mailto:jedbrown at mcs.anl.gov>> wrote:
> > On Jan 9, 2013 12:56 PM, "Dmitri Gribenko" <gribozavr at gmail.com
> <mailto:gribozavr at gmail.com>> wrote:
> >> I don't think GCC isn't smart enough to do the same. Do you compile
> >> with optimization?
> >
> > Dmitri, as discussed in the other thread, it's smart enough, but
> only when
> > optimization is turned on. There's no reason to needlessly make
> debug builds
> > heavier than necessary. This is not a big deal either way.
>
> Oh, now I see -- in debug builds it still emits these. Thank you
> for fixing!
>
> Dmitri
>
> --
> main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
> (j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com
> <mailto:gribozavr at gmail.com>>*/
>
>
> ------------------------------
>
> Message: 3
> Date: Fri, 11 Jan 2013 13:55:34 -0500
> From: Na Zhang <na.zhang at stonybrook.edu
> <mailto:na.zhang at stonybrook.edu>>
> To: discuss at mpich.org <mailto:discuss at mpich.org>
> Subject: [mpich-discuss] start MPD daemons on 2 different subnets
> Message-ID:
>
> <CAFbC_ZLoN6vwy3JnwvOpWR50tff8zn1qWt2WcxxfKHghigRzDw at mail.gmail.com
> <mailto:CAFbC_ZLoN6vwy3JnwvOpWR50tff8zn1qWt2WcxxfKHghigRzDw at mail.gmail.com>>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Dear developers,
>
> We want to start MPD daemons on 2 different subnets: for example,
> subnet #1 192.168.0.0/24 <http://192.168.0.0/24> and subnet #2:
> 192.168.1.0/24 <http://192.168.1.0/24>
>
> The two subnets are connected via switches and they can talk to each
> other. Next, we'd start MPD daemons on two nodes:
>
> Node #1 (in subnet #1): (hostname:node_001) IP=192.168.0.1
> Node #2 (in subnet #2): (hostname:node_002) IP=192.168.1.1
>
> We used the following commands:
>
> on Node #1: mpd --ifhn=192.168.0.1 --daemon
> (daemon is successfully started on node_001)
>
> on Node #2: mpd -h node_001 -p <node1's_port_number>
> --ifhn=192.168.1.1 --daemon
> (daemon cannot be started on Node #2. there is no error message, we
> use "mpdtrace" to check on Node #1, it shows no daemon started on Node
> #2 )
>
> Node #2 cannot join the ring that is generated by node #1.
>
> How should we do?
>
> Thank you in advance.
>
> Sincerely,
> Na Zhang
>
> --
> Na Zhang, Ph.D. Candidate
> Dept. of Applied Mathematics and Statistics
> Stony Brook University
> Phone: 631-838-3205
>
>
> ------------------------------
>
> Message: 4
> Date: Fri, 11 Jan 2013 12:59:07 -0600
> From: Dave Goodell <goodell at mcs.anl.gov <mailto:goodell at mcs.anl.gov>>
> To: discuss at mpich.org <mailto:discuss at mpich.org>
> Subject: Re: [mpich-discuss] start MPD daemons on 2 different subnets
> Message-ID: <F78FA019-E83D-40C1-A1CE-26C778411507 at mcs.anl.gov
> <mailto:F78FA019-E83D-40C1-A1CE-26C778411507 at mcs.anl.gov>>
> Content-Type: text/plain; charset=us-ascii
>
> Use hydra instead of MPD:
>
> http://wiki.mpich.org/mpich/index.php/FAQ#Q:_I_don.27t_like_.3CWHATEVER.3E_about_mpd.2C_or_I.27m_having_a_problem_with_mpdboot.2C_can_you_fix_it.3F
>
> -Dave
>
> On Jan 11, 2013, at 12:55 PM CST, Na Zhang wrote:
>
> > Dear developers,
> >
> > We want to start MPD daemons on 2 different subnets: for example,
> > subnet #1 192.168.0.0/24 <http://192.168.0.0/24> and subnet #2:
> 192.168.1.0/24 <http://192.168.1.0/24>
> >
> > The two subnets are connected via switches and they can talk to each
> > other. Next, we'd start MPD daemons on two nodes:
> >
> > Node #1 (in subnet #1): (hostname:node_001) IP=192.168.0.1
> > Node #2 (in subnet #2): (hostname:node_002) IP=192.168.1.1
> >
> > We used the following commands:
> >
> > on Node #1: mpd --ifhn=192.168.0.1 --daemon
> > (daemon is successfully started on node_001)
> >
> > on Node #2: mpd -h node_001 -p <node1's_port_number>
> > --ifhn=192.168.1.1 --daemon
> > (daemon cannot be started on Node #2. there is no error message, we
> > use "mpdtrace" to check on Node #1, it shows no daemon started on Node
> > #2 )
> >
> > Node #2 cannot join the ring that is generated by node #1.
> >
> > How should we do?
> >
> > Thank you in advance.
> >
> > Sincerely,
> > Na Zhang
> >
> > --
> > Na Zhang, Ph.D. Candidate
> > Dept. of Applied Mathematics and Statistics
> > Stony Brook University
> > Phone: 631-838-3205
> > _______________________________________________
> > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
> ------------------------------
>
> Message: 5
> Date: Fri, 11 Jan 2013 13:31:48 -0800
> From: "Michael Colonno" <mcolonno at stanford.edu
> <mailto:mcolonno at stanford.edu>>
> To: <discuss at mpich.org <mailto:discuss at mpich.org>>
> Subject: [mpich-discuss] Fatal error in PMPI_Reduce
> Message-ID: <0cc801cdf043$10bc1890$323449b0$@stanford.edu
> <http://stanford.edu>>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi All ~
>
>
>
> I've compiled MPICH2 3.0 with the Intel compiler (v. 13)
> on a
> CentOS 6.3 x64 system using SLURM as the process manager. My
> configure was
> simply:
>
>
>
> ./configure --with-pmi=slurm --with-pm=no
> --prefix=/usr/local/apps/MPICH2
>
>
>
> No errors during build or install. When I compile and run the example
> program cxxcpi I get (truncated):
>
>
>
> $ srun -n32 /usr/local/apps/cxxcpi
>
> Fatal error in PMPI_Reduce: A process has failed, error stack:
>
> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4ad18120,
> rbuf=0x7fff4ad18128, count=1, MPI_DOUBLE, MPI_SUM, root=0,
> MPI_COMM_WORLD)
> failed
>
> MPIR_Reduce_impl(1029)..........:
>
> MPIR_Reduce_intra(779)..........:
>
> MPIR_Reduce_impl(1029)..........:
>
> MPIR_Reduce_intra(835)..........:
>
> MPIR_Reduce_binomial(144).......:
>
> MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
>
> MPIR_Reduce_intra(799)..........:
>
> MPIR_Reduce_impl(1029)..........:
>
> MPIR_Reduce_intra(835)..........:
>
> MPIR_Reduce_binomial(206).......: Failure during collective
>
> srun: error: task 0: Exited with exit code 1
>
>
>
> This error is experienced with many of my MPI programs. A
> different application yields:
>
>
>
> PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7fff545be5fc, count=1, MPI_INT,
> root=0, MPI_COMM_WORLD) failed
>
> MPIR_Bcast_impl(1369).:
>
> MPIR_Bcast_intra(1160):
>
> MPIR_SMP_Bcast(1077)..: Failure during collective
>
>
>
> Can anyone point me in the right direction?
>
>
>
> Thanks,
>
> ~Mike C.
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://lists.mpich.org/pipermail/discuss/attachments/20130111/8aa6c13a/attachment-0001.html>
>
> ------------------------------
>
> Message: 6
> Date: Fri, 11 Jan 2013 16:19:23 -0600
> From: Pavan Balaji <balaji at mcs.anl.gov <mailto:balaji at mcs.anl.gov>>
> To: discuss at mpich.org <mailto:discuss at mpich.org>
> Subject: Re: [mpich-discuss] Fatal error in PMPI_Reduce
> Message-ID: <50F08FEB.1050901 at mcs.anl.gov
> <mailto:50F08FEB.1050901 at mcs.anl.gov>>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Michael,
>
> Did you try just using mpiexec?
>
> mpiexec -n 32 /usr/local/apps/cxxcpi
>
> -- Pavan
>
> On 01/11/2013 03:31 PM US Central Time, Michael Colonno wrote:
> > Hi All ~
> >
> >
> >
> > I've compiled MPICH2 3.0 with the Intel compiler (v.
> 13) on
> > a CentOS 6.3 x64 system using SLURM as the process manager. My
> configure
> > was simply:
> >
> >
> >
> > ./configure --with-pmi=slurm --with-pm=no
> --prefix=/usr/local/apps/MPICH2
> >
> >
> >
> > No errors during build or install. When I compile and run the example
> > program cxxcpi I get (truncated):
> >
> >
> >
> > $ srun -n32 /usr/local/apps/cxxcpi
> >
> > Fatal error in PMPI_Reduce: A process has failed, error stack:
> >
> > PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4ad18120,
> > rbuf=0x7fff4ad18128, count=1, MPI_DOUBLE, MPI_SUM, root=0,
> > MPI_COMM_WORLD) failed
> >
> > MPIR_Reduce_impl(1029)..........:
> >
> > MPIR_Reduce_intra(779)..........:
> >
> > MPIR_Reduce_impl(1029)..........:
> >
> > MPIR_Reduce_intra(835)..........:
> >
> > MPIR_Reduce_binomial(144).......:
> >
> > MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
> >
> > MPIR_Reduce_intra(799)..........:
> >
> > MPIR_Reduce_impl(1029)..........:
> >
> > MPIR_Reduce_intra(835)..........:
> >
> > MPIR_Reduce_binomial(206).......: Failure during collective
> >
> > srun: error: task 0: Exited with exit code 1
> >
> >
> >
> > This error is experienced with many of my MPI programs. A
> > different application yields:
> >
> >
> >
> > PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7fff545be5fc, count=1,
> MPI_INT,
> > root=0, MPI_COMM_WORLD) failed
> >
> > MPIR_Bcast_impl(1369).:
> >
> > MPIR_Bcast_intra(1160):
> >
> > MPIR_SMP_Bcast(1077)..: Failure during collective
> >
> >
> >
> > Can anyone point me in the right direction?
> >
> >
> >
> > Thanks,
> >
> > ~Mike C.
> >
> >
> >
> > _______________________________________________
> > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
>
> ------------------------------
>
> Message: 7
> Date: Fri, 11 Jan 2013 16:20:00 -0600
> From: Pavan Balaji <balaji at mcs.anl.gov <mailto:balaji at mcs.anl.gov>>
> To: discuss at mpich.org <mailto:discuss at mpich.org>
> Subject: Re: [mpich-discuss] Fatal error in PMPI_Reduce
> Message-ID: <50F09010.5010003 at mcs.anl.gov
> <mailto:50F09010.5010003 at mcs.anl.gov>>
> Content-Type: text/plain; charset=ISO-8859-1
>
>
> FYI, the reason I suggested this is because mpiexec will automatically
> detect and use slurm internally.
>
> -- Pavan
>
> On 01/11/2013 04:19 PM US Central Time, Pavan Balaji wrote:
> > Michael,
> >
> > Did you try just using mpiexec?
> >
> > mpiexec -n 32 /usr/local/apps/cxxcpi
> >
> > -- Pavan
> >
> > On 01/11/2013 03:31 PM US Central Time, Michael Colonno wrote:
> >> Hi All ~
> >>
> >>
> >>
> >> I've compiled MPICH2 3.0 with the Intel compiler (v.
> 13) on
> >> a CentOS 6.3 x64 system using SLURM as the process manager. My
> configure
> >> was simply:
> >>
> >>
> >>
> >> ./configure --with-pmi=slurm --with-pm=no
> --prefix=/usr/local/apps/MPICH2
> >>
> >>
> >>
> >> No errors during build or install. When I compile and run the example
> >> program cxxcpi I get (truncated):
> >>
> >>
> >>
> >> $ srun -n32 /usr/local/apps/cxxcpi
> >>
> >> Fatal error in PMPI_Reduce: A process has failed, error stack:
> >>
> >> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4ad18120,
> >> rbuf=0x7fff4ad18128, count=1, MPI_DOUBLE, MPI_SUM, root=0,
> >> MPI_COMM_WORLD) failed
> >>
> >> MPIR_Reduce_impl(1029)..........:
> >>
> >> MPIR_Reduce_intra(779)..........:
> >>
> >> MPIR_Reduce_impl(1029)..........:
> >>
> >> MPIR_Reduce_intra(835)..........:
> >>
> >> MPIR_Reduce_binomial(144).......:
> >>
> >> MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
> >>
> >> MPIR_Reduce_intra(799)..........:
> >>
> >> MPIR_Reduce_impl(1029)..........:
> >>
> >> MPIR_Reduce_intra(835)..........:
> >>
> >> MPIR_Reduce_binomial(206).......: Failure during collective
> >>
> >> srun: error: task 0: Exited with exit code 1
> >>
> >>
> >>
> >> This error is experienced with many of my MPI programs. A
> >> different application yields:
> >>
> >>
> >>
> >> PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7fff545be5fc, count=1,
> MPI_INT,
> >> root=0, MPI_COMM_WORLD) failed
> >>
> >> MPIR_Bcast_impl(1369).:
> >>
> >> MPIR_Bcast_intra(1160):
> >>
> >> MPIR_SMP_Bcast(1077)..: Failure during collective
> >>
> >>
> >>
> >> Can anyone point me in the right direction?
> >>
> >>
> >>
> >> Thanks,
> >>
> >> ~Mike C.
> >>
> >>
> >>
> >> _______________________________________________
> >> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> >> To manage subscription options or unsubscribe:
> >> https://lists.mpich.org/mailman/listinfo/discuss
> >>
> >
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
>
> ------------------------------
>
> _______________________________________________
> discuss mailing list
> discuss at mpich.org <mailto:discuss at mpich.org>
> https://lists.mpich.org/mailman/listinfo/discuss
>
> End of discuss Digest, Vol 3, Issue 9
> *************************************
>
>
>
>
> --
> Sincerely,
>
> Na Zhang, Ph.D. Candidate
> Dept. of Applied Mathematics and Statistics
> Stony Brook University
> Phone: 631-838-3205
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the discuss
mailing list