[mpich-discuss] start MPD daemons on 2 different subnets

Pavan Balaji balaji at mcs.anl.gov
Thu Jan 17 18:09:41 CST 2013


Hello,

You should be able to ssh to the nodes from each other.  It looks like
you cannot ssh between the nodes using the below IP addresses.

 -- Pavan

On 01/17/2013 06:08 PM US Central Time, Na Zhang wrote:
> Hello Dave,
> 
> Thanks for your reply.
> We followed your advice and installed Hydra on each node. 
> 
> We specify ip address in hosts file. For example: 
> 
> shell $ mpiexec –f hosts –np 2 ./app
> shell $ cat hosts
> 192.168.0.1
> 192.168.1.1
> 
> (The two node IP belongs to 2 different subnets: for example,
> subnet #1 192.168.0.0/24 <http://192.168.0.0/24> and subnet #2:
> 192.168.1.0/24 <http://192.168.1.0/24>)
> 
> The error out put is 
> “ssh connect to host 192.168.1.1 port 22: connection time out”.
> 
> So is there a option for Hydra to solve this problem? 
> 
> Thank you!
> 
> Sincerely,
> Na Zhang
> 
> On Fri, Jan 11, 2013 at 5:20 PM, <discuss-request at mpich.org
> <mailto:discuss-request at mpich.org>> wrote:
> 
>     Send discuss mailing list submissions to
>             discuss at mpich.org <mailto:discuss at mpich.org>
> 
>     To subscribe or unsubscribe via the World Wide Web, visit
>             https://lists.mpich.org/mailman/listinfo/discuss
>     or, via email, send a message with subject or body 'help' to
>             discuss-request at mpich.org <mailto:discuss-request at mpich.org>
> 
>     You can reach the person managing the list at
>             discuss-owner at mpich.org <mailto:discuss-owner at mpich.org>
> 
>     When replying, please edit your Subject line so it is more specific
>     than "Re: Contents of discuss digest..."
> 
> 
>     Today's Topics:
> 
>        1. Re:  [PATCH] Use attribute layout_compatible for  pair types
>           (Jed Brown)
>        2. Re:  [PATCH] Use attribute layout_compatible for  pair types
>           (Dmitri Gribenko)
>        3.  start MPD daemons on 2 different subnets (Na Zhang)
>        4. Re:  start MPD daemons on 2 different subnets (Dave Goodell)
>        5.  Fatal error in PMPI_Reduce (Michael Colonno)
>        6. Re:  Fatal error in PMPI_Reduce (Pavan Balaji)
>        7. Re:  Fatal error in PMPI_Reduce (Pavan Balaji)
> 
> 
>     ----------------------------------------------------------------------
> 
>     Message: 1
>     Date: Wed, 9 Jan 2013 14:00:50 -0600
>     From: Jed Brown <jedbrown at mcs.anl.gov <mailto:jedbrown at mcs.anl.gov>>
>     To: discuss at mpich.org <mailto:discuss at mpich.org>
>     Subject: Re: [mpich-discuss] [PATCH] Use attribute layout_compatible
>             for     pair types
>     Message-ID:
>            
>     <CAM9tzSnqJHaj6wbKBdAWp5YveG+UW_OWiA768GRb1spHjn+TZw at mail.gmail.com
>     <mailto:CAM9tzSnqJHaj6wbKBdAWp5YveG%2BUW_OWiA768GRb1spHjn%2BTZw at mail.gmail.com>>
>     Content-Type: text/plain; charset="utf-8"
> 
>     On Jan 9, 2013 12:56 PM, "Dmitri Gribenko" <gribozavr at gmail.com
>     <mailto:gribozavr at gmail.com>> wrote:
> 
>     > On Wed, Jan 9, 2013 at 8:19 PM, Dave Goodell <goodell at mcs.anl.gov
>     <mailto:goodell at mcs.anl.gov>> wrote:
>     > > Both implemented and pushed as d440abb and ac15f7a.  Thanks.
>     > >
>     > > -Dave
>     > >
>     > > On Jan 1, 2013, at 11:14 PM CST, Jed Brown wrote:
>     > >
>     > >> In addition, I suggest guarding these definitions. Leaving these in
>     > increases the total number of symbols in an example executable linking
>     > PETSc by a factor of 2. (They're all read-only, but they're still
>     there.)
>     > Clang is smart enough to remove these, presumably because it
>     understands
>     > the special attributes.
>     >
>     > No, LLVM removes these not because of the attributes, but because
>     > these are unused.  And when they are used, most of the time they don't
>     > have their address taken, so their value is propagated to the point
>     > where they are read and the constants again become unused.
>     >
>     > I don't think GCC isn't smart enough to do the same.  Do you compile
>     > with optimization?
>     >
> 
>     Dmitri, as discussed in the other thread, it's smart enough, but
>     only when
>     optimization is turned on. There's no reason to needlessly make debug
>     builds heavier than necessary. This is not a big deal either way.
>     -------------- next part --------------
>     An HTML attachment was scrubbed...
>     URL:
>     <http://lists.mpich.org/pipermail/discuss/attachments/20130109/92c5a5f7/attachment-0001.html>
> 
>     ------------------------------
> 
>     Message: 2
>     Date: Wed, 9 Jan 2013 22:57:14 +0200
>     From: Dmitri Gribenko <gribozavr at gmail.com <mailto:gribozavr at gmail.com>>
>     To: discuss at mpich.org <mailto:discuss at mpich.org>
>     Subject: Re: [mpich-discuss] [PATCH] Use attribute layout_compatible
>             for     pair types
>     Message-ID:
>            
>     <CA+Y5xYeBp974pDiL0QFAhjxpeqpB2Xykjx-atYFtLWQ2Oq+aoA at mail.gmail.com
>     <mailto:CA%2BY5xYeBp974pDiL0QFAhjxpeqpB2Xykjx-atYFtLWQ2Oq%2BaoA at mail.gmail.com>>
>     Content-Type: text/plain; charset=UTF-8
> 
>     On Wed, Jan 9, 2013 at 10:00 PM, Jed Brown <jedbrown at mcs.anl.gov
>     <mailto:jedbrown at mcs.anl.gov>> wrote:
>     > On Jan 9, 2013 12:56 PM, "Dmitri Gribenko" <gribozavr at gmail.com
>     <mailto:gribozavr at gmail.com>> wrote:
>     >> I don't think GCC isn't smart enough to do the same.  Do you compile
>     >> with optimization?
>     >
>     > Dmitri, as discussed in the other thread, it's smart enough, but
>     only when
>     > optimization is turned on. There's no reason to needlessly make
>     debug builds
>     > heavier than necessary. This is not a big deal either way.
> 
>     Oh, now I see -- in debug builds it still emits these.  Thank you
>     for fixing!
> 
>     Dmitri
> 
>     --
>     main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
>     (j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com
>     <mailto:gribozavr at gmail.com>>*/
> 
> 
>     ------------------------------
> 
>     Message: 3
>     Date: Fri, 11 Jan 2013 13:55:34 -0500
>     From: Na Zhang <na.zhang at stonybrook.edu
>     <mailto:na.zhang at stonybrook.edu>>
>     To: discuss at mpich.org <mailto:discuss at mpich.org>
>     Subject: [mpich-discuss] start MPD daemons on 2 different subnets
>     Message-ID:
>            
>     <CAFbC_ZLoN6vwy3JnwvOpWR50tff8zn1qWt2WcxxfKHghigRzDw at mail.gmail.com
>     <mailto:CAFbC_ZLoN6vwy3JnwvOpWR50tff8zn1qWt2WcxxfKHghigRzDw at mail.gmail.com>>
>     Content-Type: text/plain; charset=ISO-8859-1
> 
>     Dear developers,
> 
>     We want to start MPD daemons on 2 different subnets: for example,
>     subnet #1 192.168.0.0/24 <http://192.168.0.0/24> and subnet #2:
>     192.168.1.0/24 <http://192.168.1.0/24>
> 
>     The two subnets are connected via switches and they can talk to each
>     other. Next, we'd start MPD daemons on two nodes:
> 
>     Node #1 (in subnet #1): (hostname:node_001) IP=192.168.0.1
>     Node #2 (in subnet #2): (hostname:node_002) IP=192.168.1.1
> 
>     We used the following commands:
> 
>     on Node #1: mpd --ifhn=192.168.0.1 --daemon
>     (daemon is successfully started on node_001)
> 
>     on Node #2: mpd -h node_001 -p <node1's_port_number>
>     --ifhn=192.168.1.1  --daemon
>     (daemon cannot be started on Node #2. there is no error message, we
>     use "mpdtrace" to check on Node #1, it shows no daemon started on Node
>     #2 )
> 
>     Node #2 cannot join the ring that is generated by node #1.
> 
>     How should we do?
> 
>     Thank you in advance.
> 
>     Sincerely,
>     Na Zhang
> 
>     --
>     Na Zhang, Ph.D. Candidate
>     Dept. of Applied Mathematics and Statistics
>     Stony Brook University
>     Phone: 631-838-3205
> 
> 
>     ------------------------------
> 
>     Message: 4
>     Date: Fri, 11 Jan 2013 12:59:07 -0600
>     From: Dave Goodell <goodell at mcs.anl.gov <mailto:goodell at mcs.anl.gov>>
>     To: discuss at mpich.org <mailto:discuss at mpich.org>
>     Subject: Re: [mpich-discuss] start MPD daemons on 2 different subnets
>     Message-ID: <F78FA019-E83D-40C1-A1CE-26C778411507 at mcs.anl.gov
>     <mailto:F78FA019-E83D-40C1-A1CE-26C778411507 at mcs.anl.gov>>
>     Content-Type: text/plain; charset=us-ascii
> 
>     Use hydra instead of MPD:
> 
>     http://wiki.mpich.org/mpich/index.php/FAQ#Q:_I_don.27t_like_.3CWHATEVER.3E_about_mpd.2C_or_I.27m_having_a_problem_with_mpdboot.2C_can_you_fix_it.3F
> 
>     -Dave
> 
>     On Jan 11, 2013, at 12:55 PM CST, Na Zhang wrote:
> 
>     > Dear developers,
>     >
>     > We want to start MPD daemons on 2 different subnets: for example,
>     > subnet #1 192.168.0.0/24 <http://192.168.0.0/24> and subnet #2:
>     192.168.1.0/24 <http://192.168.1.0/24>
>     >
>     > The two subnets are connected via switches and they can talk to each
>     > other. Next, we'd start MPD daemons on two nodes:
>     >
>     > Node #1 (in subnet #1): (hostname:node_001) IP=192.168.0.1
>     > Node #2 (in subnet #2): (hostname:node_002) IP=192.168.1.1
>     >
>     > We used the following commands:
>     >
>     > on Node #1: mpd --ifhn=192.168.0.1 --daemon
>     > (daemon is successfully started on node_001)
>     >
>     > on Node #2: mpd -h node_001 -p <node1's_port_number>
>     > --ifhn=192.168.1.1  --daemon
>     > (daemon cannot be started on Node #2. there is no error message, we
>     > use "mpdtrace" to check on Node #1, it shows no daemon started on Node
>     > #2 )
>     >
>     > Node #2 cannot join the ring that is generated by node #1.
>     >
>     > How should we do?
>     >
>     > Thank you in advance.
>     >
>     > Sincerely,
>     > Na Zhang
>     >
>     > --
>     > Na Zhang, Ph.D. Candidate
>     > Dept. of Applied Mathematics and Statistics
>     > Stony Brook University
>     > Phone: 631-838-3205
>     > _______________________________________________
>     > discuss mailing list     discuss at mpich.org <mailto:discuss at mpich.org>
>     > To manage subscription options or unsubscribe:
>     > https://lists.mpich.org/mailman/listinfo/discuss
> 
> 
> 
>     ------------------------------
> 
>     Message: 5
>     Date: Fri, 11 Jan 2013 13:31:48 -0800
>     From: "Michael Colonno" <mcolonno at stanford.edu
>     <mailto:mcolonno at stanford.edu>>
>     To: <discuss at mpich.org <mailto:discuss at mpich.org>>
>     Subject: [mpich-discuss] Fatal error in PMPI_Reduce
>     Message-ID: <0cc801cdf043$10bc1890$323449b0$@stanford.edu
>     <http://stanford.edu>>
>     Content-Type: text/plain; charset="us-ascii"
> 
>                 Hi All ~
> 
> 
> 
>                 I've compiled MPICH2 3.0 with the Intel compiler (v. 13)
>     on a
>     CentOS 6.3 x64 system using SLURM as the process manager. My
>     configure was
>     simply:
> 
> 
> 
>     ./configure --with-pmi=slurm --with-pm=no
>     --prefix=/usr/local/apps/MPICH2
> 
> 
> 
>     No errors during build or install. When I compile and run the example
>     program cxxcpi I get (truncated):
> 
> 
> 
>     $ srun -n32 /usr/local/apps/cxxcpi
> 
>     Fatal error in PMPI_Reduce: A process has failed, error stack:
> 
>     PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4ad18120,
>     rbuf=0x7fff4ad18128, count=1, MPI_DOUBLE, MPI_SUM, root=0,
>     MPI_COMM_WORLD)
>     failed
> 
>     MPIR_Reduce_impl(1029)..........:
> 
>     MPIR_Reduce_intra(779)..........:
> 
>     MPIR_Reduce_impl(1029)..........:
> 
>     MPIR_Reduce_intra(835)..........:
> 
>     MPIR_Reduce_binomial(144).......:
> 
>     MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
> 
>     MPIR_Reduce_intra(799)..........:
> 
>     MPIR_Reduce_impl(1029)..........:
> 
>     MPIR_Reduce_intra(835)..........:
> 
>     MPIR_Reduce_binomial(206).......: Failure during collective
> 
>     srun: error: task 0: Exited with exit code 1
> 
> 
> 
>                 This error is experienced with many of my MPI programs. A
>     different application yields:
> 
> 
> 
>     PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7fff545be5fc, count=1, MPI_INT,
>     root=0, MPI_COMM_WORLD) failed
> 
>     MPIR_Bcast_impl(1369).:
> 
>     MPIR_Bcast_intra(1160):
> 
>     MPIR_SMP_Bcast(1077)..: Failure during collective
> 
> 
> 
>                 Can anyone point me in the right direction?
> 
> 
> 
>                 Thanks,
> 
>                 ~Mike C.
> 
>     -------------- next part --------------
>     An HTML attachment was scrubbed...
>     URL:
>     <http://lists.mpich.org/pipermail/discuss/attachments/20130111/8aa6c13a/attachment-0001.html>
> 
>     ------------------------------
> 
>     Message: 6
>     Date: Fri, 11 Jan 2013 16:19:23 -0600
>     From: Pavan Balaji <balaji at mcs.anl.gov <mailto:balaji at mcs.anl.gov>>
>     To: discuss at mpich.org <mailto:discuss at mpich.org>
>     Subject: Re: [mpich-discuss] Fatal error in PMPI_Reduce
>     Message-ID: <50F08FEB.1050901 at mcs.anl.gov
>     <mailto:50F08FEB.1050901 at mcs.anl.gov>>
>     Content-Type: text/plain; charset=ISO-8859-1
> 
>     Michael,
> 
>     Did you try just using mpiexec?
> 
>     mpiexec -n 32 /usr/local/apps/cxxcpi
> 
>      -- Pavan
> 
>     On 01/11/2013 03:31 PM US Central Time, Michael Colonno wrote:
>     >             Hi All ~
>     >
>     >
>     >
>     >             I've compiled MPICH2 3.0 with the Intel compiler (v.
>     13) on
>     > a CentOS 6.3 x64 system using SLURM as the process manager. My
>     configure
>     > was simply:
>     >
>     >
>     >
>     > ./configure --with-pmi=slurm --with-pm=no
>     --prefix=/usr/local/apps/MPICH2
>     >
>     >
>     >
>     > No errors during build or install. When I compile and run the example
>     > program cxxcpi I get (truncated):
>     >
>     >
>     >
>     > $ srun -n32 /usr/local/apps/cxxcpi
>     >
>     > Fatal error in PMPI_Reduce: A process has failed, error stack:
>     >
>     > PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4ad18120,
>     > rbuf=0x7fff4ad18128, count=1, MPI_DOUBLE, MPI_SUM, root=0,
>     > MPI_COMM_WORLD) failed
>     >
>     > MPIR_Reduce_impl(1029)..........:
>     >
>     > MPIR_Reduce_intra(779)..........:
>     >
>     > MPIR_Reduce_impl(1029)..........:
>     >
>     > MPIR_Reduce_intra(835)..........:
>     >
>     > MPIR_Reduce_binomial(144).......:
>     >
>     > MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
>     >
>     > MPIR_Reduce_intra(799)..........:
>     >
>     > MPIR_Reduce_impl(1029)..........:
>     >
>     > MPIR_Reduce_intra(835)..........:
>     >
>     > MPIR_Reduce_binomial(206).......: Failure during collective
>     >
>     > srun: error: task 0: Exited with exit code 1
>     >
>     >
>     >
>     >             This error is experienced with many of my MPI programs. A
>     > different application yields:
>     >
>     >
>     >
>     > PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7fff545be5fc, count=1,
>     MPI_INT,
>     > root=0, MPI_COMM_WORLD) failed
>     >
>     > MPIR_Bcast_impl(1369).:
>     >
>     > MPIR_Bcast_intra(1160):
>     >
>     > MPIR_SMP_Bcast(1077)..: Failure during collective
>     >
>     >
>     >
>     >             Can anyone point me in the right direction?
>     >
>     >
>     >
>     >             Thanks,
>     >
>     >             ~Mike C.
>     >
>     >
>     >
>     > _______________________________________________
>     > discuss mailing list     discuss at mpich.org <mailto:discuss at mpich.org>
>     > To manage subscription options or unsubscribe:
>     > https://lists.mpich.org/mailman/listinfo/discuss
>     >
> 
>     --
>     Pavan Balaji
>     http://www.mcs.anl.gov/~balaji
> 
> 
>     ------------------------------
> 
>     Message: 7
>     Date: Fri, 11 Jan 2013 16:20:00 -0600
>     From: Pavan Balaji <balaji at mcs.anl.gov <mailto:balaji at mcs.anl.gov>>
>     To: discuss at mpich.org <mailto:discuss at mpich.org>
>     Subject: Re: [mpich-discuss] Fatal error in PMPI_Reduce
>     Message-ID: <50F09010.5010003 at mcs.anl.gov
>     <mailto:50F09010.5010003 at mcs.anl.gov>>
>     Content-Type: text/plain; charset=ISO-8859-1
> 
> 
>     FYI, the reason I suggested this is because mpiexec will automatically
>     detect and use slurm internally.
> 
>      -- Pavan
> 
>     On 01/11/2013 04:19 PM US Central Time, Pavan Balaji wrote:
>     > Michael,
>     >
>     > Did you try just using mpiexec?
>     >
>     > mpiexec -n 32 /usr/local/apps/cxxcpi
>     >
>     >  -- Pavan
>     >
>     > On 01/11/2013 03:31 PM US Central Time, Michael Colonno wrote:
>     >>             Hi All ~
>     >>
>     >>
>     >>
>     >>             I've compiled MPICH2 3.0 with the Intel compiler (v.
>     13) on
>     >> a CentOS 6.3 x64 system using SLURM as the process manager. My
>     configure
>     >> was simply:
>     >>
>     >>
>     >>
>     >> ./configure --with-pmi=slurm --with-pm=no
>     --prefix=/usr/local/apps/MPICH2
>     >>
>     >>
>     >>
>     >> No errors during build or install. When I compile and run the example
>     >> program cxxcpi I get (truncated):
>     >>
>     >>
>     >>
>     >> $ srun -n32 /usr/local/apps/cxxcpi
>     >>
>     >> Fatal error in PMPI_Reduce: A process has failed, error stack:
>     >>
>     >> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4ad18120,
>     >> rbuf=0x7fff4ad18128, count=1, MPI_DOUBLE, MPI_SUM, root=0,
>     >> MPI_COMM_WORLD) failed
>     >>
>     >> MPIR_Reduce_impl(1029)..........:
>     >>
>     >> MPIR_Reduce_intra(779)..........:
>     >>
>     >> MPIR_Reduce_impl(1029)..........:
>     >>
>     >> MPIR_Reduce_intra(835)..........:
>     >>
>     >> MPIR_Reduce_binomial(144).......:
>     >>
>     >> MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
>     >>
>     >> MPIR_Reduce_intra(799)..........:
>     >>
>     >> MPIR_Reduce_impl(1029)..........:
>     >>
>     >> MPIR_Reduce_intra(835)..........:
>     >>
>     >> MPIR_Reduce_binomial(206).......: Failure during collective
>     >>
>     >> srun: error: task 0: Exited with exit code 1
>     >>
>     >>
>     >>
>     >>             This error is experienced with many of my MPI programs. A
>     >> different application yields:
>     >>
>     >>
>     >>
>     >> PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7fff545be5fc, count=1,
>     MPI_INT,
>     >> root=0, MPI_COMM_WORLD) failed
>     >>
>     >> MPIR_Bcast_impl(1369).:
>     >>
>     >> MPIR_Bcast_intra(1160):
>     >>
>     >> MPIR_SMP_Bcast(1077)..: Failure during collective
>     >>
>     >>
>     >>
>     >>             Can anyone point me in the right direction?
>     >>
>     >>
>     >>
>     >>             Thanks,
>     >>
>     >>             ~Mike C.
>     >>
>     >>
>     >>
>     >> _______________________________________________
>     >> discuss mailing list     discuss at mpich.org <mailto:discuss at mpich.org>
>     >> To manage subscription options or unsubscribe:
>     >> https://lists.mpich.org/mailman/listinfo/discuss
>     >>
>     >
> 
>     --
>     Pavan Balaji
>     http://www.mcs.anl.gov/~balaji
> 
> 
>     ------------------------------
> 
>     _______________________________________________
>     discuss mailing list
>     discuss at mpich.org <mailto:discuss at mpich.org>
>     https://lists.mpich.org/mailman/listinfo/discuss
> 
>     End of discuss Digest, Vol 3, Issue 9
>     *************************************
> 
> 
> 
> 
> -- 
> Sincerely,
> 
> Na Zhang, Ph.D. Candidate
> Dept. of Applied Mathematics and Statistics
> Stony Brook University
> Phone: 631-838-3205
> 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji



More information about the discuss mailing list