[mpich-discuss] start MPD daemons on 2 different subnets

Na Zhang na.zhang at stonybrook.edu
Thu Jan 17 18:08:21 CST 2013


Hello Dave,

Thanks for your reply.
We followed your advice and installed Hydra on each node.

We specify ip address in hosts file. For example:

shell $ mpiexec –f hosts –np 2 ./app
shell $ cat hosts
192.168.0.1
192.168.1.1

(The two node IP belongs to 2 different subnets: for example,
subnet #1 192.168.0.0/24 and subnet #2: 192.168.1.0/24)

The error out put is
“ssh connect to host 192.168.1.1 port 22: connection time out”.

So is there a option for Hydra to solve this problem?

Thank you!

Sincerely,
Na Zhang

On Fri, Jan 11, 2013 at 5:20 PM, <discuss-request at mpich.org> wrote:

> Send discuss mailing list submissions to
>         discuss at mpich.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://lists.mpich.org/mailman/listinfo/discuss
> or, via email, send a message with subject or body 'help' to
>         discuss-request at mpich.org
>
> You can reach the person managing the list at
>         discuss-owner at mpich.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of discuss digest..."
>
>
> Today's Topics:
>
>    1. Re:  [PATCH] Use attribute layout_compatible for  pair types
>       (Jed Brown)
>    2. Re:  [PATCH] Use attribute layout_compatible for  pair types
>       (Dmitri Gribenko)
>    3.  start MPD daemons on 2 different subnets (Na Zhang)
>    4. Re:  start MPD daemons on 2 different subnets (Dave Goodell)
>    5.  Fatal error in PMPI_Reduce (Michael Colonno)
>    6. Re:  Fatal error in PMPI_Reduce (Pavan Balaji)
>    7. Re:  Fatal error in PMPI_Reduce (Pavan Balaji)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 9 Jan 2013 14:00:50 -0600
> From: Jed Brown <jedbrown at mcs.anl.gov>
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] [PATCH] Use attribute layout_compatible
>         for     pair types
> Message-ID:
>         <
> CAM9tzSnqJHaj6wbKBdAWp5YveG+UW_OWiA768GRb1spHjn+TZw at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> On Jan 9, 2013 12:56 PM, "Dmitri Gribenko" <gribozavr at gmail.com> wrote:
>
> > On Wed, Jan 9, 2013 at 8:19 PM, Dave Goodell <goodell at mcs.anl.gov>
> wrote:
> > > Both implemented and pushed as d440abb and ac15f7a.  Thanks.
> > >
> > > -Dave
> > >
> > > On Jan 1, 2013, at 11:14 PM CST, Jed Brown wrote:
> > >
> > >> In addition, I suggest guarding these definitions. Leaving these in
> > increases the total number of symbols in an example executable linking
> > PETSc by a factor of 2. (They're all read-only, but they're still there.)
> > Clang is smart enough to remove these, presumably because it understands
> > the special attributes.
> >
> > No, LLVM removes these not because of the attributes, but because
> > these are unused.  And when they are used, most of the time they don't
> > have their address taken, so their value is propagated to the point
> > where they are read and the constants again become unused.
> >
> > I don't think GCC isn't smart enough to do the same.  Do you compile
> > with optimization?
> >
>
> Dmitri, as discussed in the other thread, it's smart enough, but only when
> optimization is turned on. There's no reason to needlessly make debug
> builds heavier than necessary. This is not a big deal either way.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.mpich.org/pipermail/discuss/attachments/20130109/92c5a5f7/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 2
> Date: Wed, 9 Jan 2013 22:57:14 +0200
> From: Dmitri Gribenko <gribozavr at gmail.com>
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] [PATCH] Use attribute layout_compatible
>         for     pair types
> Message-ID:
>         <
> CA+Y5xYeBp974pDiL0QFAhjxpeqpB2Xykjx-atYFtLWQ2Oq+aoA at mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> On Wed, Jan 9, 2013 at 10:00 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
> > On Jan 9, 2013 12:56 PM, "Dmitri Gribenko" <gribozavr at gmail.com> wrote:
> >> I don't think GCC isn't smart enough to do the same.  Do you compile
> >> with optimization?
> >
> > Dmitri, as discussed in the other thread, it's smart enough, but only
> when
> > optimization is turned on. There's no reason to needlessly make debug
> builds
> > heavier than necessary. This is not a big deal either way.
>
> Oh, now I see -- in debug builds it still emits these.  Thank you for
> fixing!
>
> Dmitri
>
> --
> main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
> (j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/
>
>
> ------------------------------
>
> Message: 3
> Date: Fri, 11 Jan 2013 13:55:34 -0500
> From: Na Zhang <na.zhang at stonybrook.edu>
> To: discuss at mpich.org
> Subject: [mpich-discuss] start MPD daemons on 2 different subnets
> Message-ID:
>         <
> CAFbC_ZLoN6vwy3JnwvOpWR50tff8zn1qWt2WcxxfKHghigRzDw at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Dear developers,
>
> We want to start MPD daemons on 2 different subnets: for example,
> subnet #1 192.168.0.0/24 and subnet #2: 192.168.1.0/24
>
> The two subnets are connected via switches and they can talk to each
> other. Next, we'd start MPD daemons on two nodes:
>
> Node #1 (in subnet #1): (hostname:node_001) IP=192.168.0.1
> Node #2 (in subnet #2): (hostname:node_002) IP=192.168.1.1
>
> We used the following commands:
>
> on Node #1: mpd --ifhn=192.168.0.1 --daemon
> (daemon is successfully started on node_001)
>
> on Node #2: mpd -h node_001 -p <node1's_port_number>
> --ifhn=192.168.1.1  --daemon
> (daemon cannot be started on Node #2. there is no error message, we
> use "mpdtrace" to check on Node #1, it shows no daemon started on Node
> #2 )
>
> Node #2 cannot join the ring that is generated by node #1.
>
> How should we do?
>
> Thank you in advance.
>
> Sincerely,
> Na Zhang
>
> --
> Na Zhang, Ph.D. Candidate
> Dept. of Applied Mathematics and Statistics
> Stony Brook University
> Phone: 631-838-3205
>
>
> ------------------------------
>
> Message: 4
> Date: Fri, 11 Jan 2013 12:59:07 -0600
> From: Dave Goodell <goodell at mcs.anl.gov>
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] start MPD daemons on 2 different subnets
> Message-ID: <F78FA019-E83D-40C1-A1CE-26C778411507 at mcs.anl.gov>
> Content-Type: text/plain; charset=us-ascii
>
> Use hydra instead of MPD:
>
>
> http://wiki.mpich.org/mpich/index.php/FAQ#Q:_I_don.27t_like_.3CWHATEVER.3E_about_mpd.2C_or_I.27m_having_a_problem_with_mpdboot.2C_can_you_fix_it.3F
>
> -Dave
>
> On Jan 11, 2013, at 12:55 PM CST, Na Zhang wrote:
>
> > Dear developers,
> >
> > We want to start MPD daemons on 2 different subnets: for example,
> > subnet #1 192.168.0.0/24 and subnet #2: 192.168.1.0/24
> >
> > The two subnets are connected via switches and they can talk to each
> > other. Next, we'd start MPD daemons on two nodes:
> >
> > Node #1 (in subnet #1): (hostname:node_001) IP=192.168.0.1
> > Node #2 (in subnet #2): (hostname:node_002) IP=192.168.1.1
> >
> > We used the following commands:
> >
> > on Node #1: mpd --ifhn=192.168.0.1 --daemon
> > (daemon is successfully started on node_001)
> >
> > on Node #2: mpd -h node_001 -p <node1's_port_number>
> > --ifhn=192.168.1.1  --daemon
> > (daemon cannot be started on Node #2. there is no error message, we
> > use "mpdtrace" to check on Node #1, it shows no daemon started on Node
> > #2 )
> >
> > Node #2 cannot join the ring that is generated by node #1.
> >
> > How should we do?
> >
> > Thank you in advance.
> >
> > Sincerely,
> > Na Zhang
> >
> > --
> > Na Zhang, Ph.D. Candidate
> > Dept. of Applied Mathematics and Statistics
> > Stony Brook University
> > Phone: 631-838-3205
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
> ------------------------------
>
> Message: 5
> Date: Fri, 11 Jan 2013 13:31:48 -0800
> From: "Michael Colonno" <mcolonno at stanford.edu>
> To: <discuss at mpich.org>
> Subject: [mpich-discuss] Fatal error in PMPI_Reduce
> Message-ID: <0cc801cdf043$10bc1890$323449b0$@stanford.edu>
> Content-Type: text/plain; charset="us-ascii"
>
>             Hi All ~
>
>
>
>             I've compiled MPICH2 3.0 with the Intel compiler (v. 13) on a
> CentOS 6.3 x64 system using SLURM as the process manager. My configure was
> simply:
>
>
>
> ./configure --with-pmi=slurm --with-pm=no --prefix=/usr/local/apps/MPICH2
>
>
>
> No errors during build or install. When I compile and run the example
> program cxxcpi I get (truncated):
>
>
>
> $ srun -n32 /usr/local/apps/cxxcpi
>
> Fatal error in PMPI_Reduce: A process has failed, error stack:
>
> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4ad18120,
> rbuf=0x7fff4ad18128, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD)
> failed
>
> MPIR_Reduce_impl(1029)..........:
>
> MPIR_Reduce_intra(779)..........:
>
> MPIR_Reduce_impl(1029)..........:
>
> MPIR_Reduce_intra(835)..........:
>
> MPIR_Reduce_binomial(144).......:
>
> MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
>
> MPIR_Reduce_intra(799)..........:
>
> MPIR_Reduce_impl(1029)..........:
>
> MPIR_Reduce_intra(835)..........:
>
> MPIR_Reduce_binomial(206).......: Failure during collective
>
> srun: error: task 0: Exited with exit code 1
>
>
>
>             This error is experienced with many of my MPI programs. A
> different application yields:
>
>
>
> PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7fff545be5fc, count=1, MPI_INT,
> root=0, MPI_COMM_WORLD) failed
>
> MPIR_Bcast_impl(1369).:
>
> MPIR_Bcast_intra(1160):
>
> MPIR_SMP_Bcast(1077)..: Failure during collective
>
>
>
>             Can anyone point me in the right direction?
>
>
>
>             Thanks,
>
>             ~Mike C.
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.mpich.org/pipermail/discuss/attachments/20130111/8aa6c13a/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 6
> Date: Fri, 11 Jan 2013 16:19:23 -0600
> From: Pavan Balaji <balaji at mcs.anl.gov>
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] Fatal error in PMPI_Reduce
> Message-ID: <50F08FEB.1050901 at mcs.anl.gov>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Michael,
>
> Did you try just using mpiexec?
>
> mpiexec -n 32 /usr/local/apps/cxxcpi
>
>  -- Pavan
>
> On 01/11/2013 03:31 PM US Central Time, Michael Colonno wrote:
> >             Hi All ~
> >
> >
> >
> >             I've compiled MPICH2 3.0 with the Intel compiler (v. 13) on
> > a CentOS 6.3 x64 system using SLURM as the process manager. My configure
> > was simply:
> >
> >
> >
> > ./configure --with-pmi=slurm --with-pm=no --prefix=/usr/local/apps/MPICH2
> >
> >
> >
> > No errors during build or install. When I compile and run the example
> > program cxxcpi I get (truncated):
> >
> >
> >
> > $ srun -n32 /usr/local/apps/cxxcpi
> >
> > Fatal error in PMPI_Reduce: A process has failed, error stack:
> >
> > PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4ad18120,
> > rbuf=0x7fff4ad18128, count=1, MPI_DOUBLE, MPI_SUM, root=0,
> > MPI_COMM_WORLD) failed
> >
> > MPIR_Reduce_impl(1029)..........:
> >
> > MPIR_Reduce_intra(779)..........:
> >
> > MPIR_Reduce_impl(1029)..........:
> >
> > MPIR_Reduce_intra(835)..........:
> >
> > MPIR_Reduce_binomial(144).......:
> >
> > MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
> >
> > MPIR_Reduce_intra(799)..........:
> >
> > MPIR_Reduce_impl(1029)..........:
> >
> > MPIR_Reduce_intra(835)..........:
> >
> > MPIR_Reduce_binomial(206).......: Failure during collective
> >
> > srun: error: task 0: Exited with exit code 1
> >
> >
> >
> >             This error is experienced with many of my MPI programs. A
> > different application yields:
> >
> >
> >
> > PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7fff545be5fc, count=1, MPI_INT,
> > root=0, MPI_COMM_WORLD) failed
> >
> > MPIR_Bcast_impl(1369).:
> >
> > MPIR_Bcast_intra(1160):
> >
> > MPIR_SMP_Bcast(1077)..: Failure during collective
> >
> >
> >
> >             Can anyone point me in the right direction?
> >
> >
> >
> >             Thanks,
> >
> >             ~Mike C.
> >
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
>
> ------------------------------
>
> Message: 7
> Date: Fri, 11 Jan 2013 16:20:00 -0600
> From: Pavan Balaji <balaji at mcs.anl.gov>
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] Fatal error in PMPI_Reduce
> Message-ID: <50F09010.5010003 at mcs.anl.gov>
> Content-Type: text/plain; charset=ISO-8859-1
>
>
> FYI, the reason I suggested this is because mpiexec will automatically
> detect and use slurm internally.
>
>  -- Pavan
>
> On 01/11/2013 04:19 PM US Central Time, Pavan Balaji wrote:
> > Michael,
> >
> > Did you try just using mpiexec?
> >
> > mpiexec -n 32 /usr/local/apps/cxxcpi
> >
> >  -- Pavan
> >
> > On 01/11/2013 03:31 PM US Central Time, Michael Colonno wrote:
> >>             Hi All ~
> >>
> >>
> >>
> >>             I've compiled MPICH2 3.0 with the Intel compiler (v. 13) on
> >> a CentOS 6.3 x64 system using SLURM as the process manager. My configure
> >> was simply:
> >>
> >>
> >>
> >> ./configure --with-pmi=slurm --with-pm=no
> --prefix=/usr/local/apps/MPICH2
> >>
> >>
> >>
> >> No errors during build or install. When I compile and run the example
> >> program cxxcpi I get (truncated):
> >>
> >>
> >>
> >> $ srun -n32 /usr/local/apps/cxxcpi
> >>
> >> Fatal error in PMPI_Reduce: A process has failed, error stack:
> >>
> >> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4ad18120,
> >> rbuf=0x7fff4ad18128, count=1, MPI_DOUBLE, MPI_SUM, root=0,
> >> MPI_COMM_WORLD) failed
> >>
> >> MPIR_Reduce_impl(1029)..........:
> >>
> >> MPIR_Reduce_intra(779)..........:
> >>
> >> MPIR_Reduce_impl(1029)..........:
> >>
> >> MPIR_Reduce_intra(835)..........:
> >>
> >> MPIR_Reduce_binomial(144).......:
> >>
> >> MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
> >>
> >> MPIR_Reduce_intra(799)..........:
> >>
> >> MPIR_Reduce_impl(1029)..........:
> >>
> >> MPIR_Reduce_intra(835)..........:
> >>
> >> MPIR_Reduce_binomial(206).......: Failure during collective
> >>
> >> srun: error: task 0: Exited with exit code 1
> >>
> >>
> >>
> >>             This error is experienced with many of my MPI programs. A
> >> different application yields:
> >>
> >>
> >>
> >> PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7fff545be5fc, count=1, MPI_INT,
> >> root=0, MPI_COMM_WORLD) failed
> >>
> >> MPIR_Bcast_impl(1369).:
> >>
> >> MPIR_Bcast_intra(1160):
> >>
> >> MPIR_SMP_Bcast(1077)..: Failure during collective
> >>
> >>
> >>
> >>             Can anyone point me in the right direction?
> >>
> >>
> >>
> >>             Thanks,
> >>
> >>             ~Mike C.
> >>
> >>
> >>
> >> _______________________________________________
> >> discuss mailing list     discuss at mpich.org
> >> To manage subscription options or unsubscribe:
> >> https://lists.mpich.org/mailman/listinfo/discuss
> >>
> >
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
>
> ------------------------------
>
> _______________________________________________
> discuss mailing list
> discuss at mpich.org
> https://lists.mpich.org/mailman/listinfo/discuss
>
> End of discuss Digest, Vol 3, Issue 9
> *************************************
>



-- 
Sincerely,

Na Zhang, Ph.D. Candidate
Dept. of Applied Mathematics and Statistics
Stony Brook University
Phone: 631-838-3205
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130117/fbd85f90/attachment.html>


More information about the discuss mailing list