[mpich-discuss] having problem running MPICH on multiple nodes
Amin Hassani
ahassani at cis.uab.edu
Tue Nov 25 22:58:51 CST 2014
Here you go!
host machine:
~{ahassani at vulcan13:~/usr/bin}~{Tue Nov 25 10:56 PM}~
$ echo $LD_LIBRARY_PATH
/nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib:
~{ahassani at vulcan13:~/usr/bin}~{Tue Nov 25 10:56 PM}~
$ echo $PATH
/nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin:/opt/matlab-R2013a/bin
oakmnt-0-a:
$ echo $LD_LIBRARY_PATH
/nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib:
~{ahassani at oakmnt-0-a:~/usr/bin}~{Tue Nov 25 10:56 PM}~
$ echo $PATH
/nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin
oakmnt-0-b:
~{ahassani at oakmnt-0-b:~}~{Tue Nov 25 10:56 PM}~
$ echo $LD_LIBRARY_PATH
/nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib:
~{ahassani at oakmnt-0-b:~}~{Tue Nov 25 10:56 PM}~
$ echo $PATH
/nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin
Amin Hassani,
CIS department at UAB,
Birmingham, AL, USA.
On Tue, Nov 25, 2014 at 10:55 PM, Lu, Huiwei <huiweilu at mcs.anl.gov> wrote:
> So your ssh connection is correct. And we confirmed the code itself is
> correct before. The problem may be somewhere else.
>
> Could you check the PATH and LD_LIBRARY_PATH on these three machines
> (oakmnt-0-a, oakmnt-0-b, and the host machine) to make sure they are the
> same? So that mpirun is using the same library on these machines.
>
> —
> Huiwei
>
> > On Nov 25, 2014, at 10:33 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
> >
> > Here you go!
> >
> > $ mpirun -hostfile hosts-hydra -np 2 hostname
> > oakmnt-0-a
> > oakmnt-0-b
> >
> > Thanks.
> >
> > Amin Hassani,
> > CIS department at UAB,
> > Birmingham, AL, USA.
> >
> > On Tue, Nov 25, 2014 at 10:31 PM, Lu, Huiwei <huiweilu at mcs.anl.gov>
> wrote:
> > I can run your simplest code on my machine without a problem. So I guess
> there is some problem in cluster connection. Could you give me the output
> of the following?
> >
> > $ mpirun -hostfile hosts-hydra -np 2 hostname
> >
> > —
> > Huiwei
> >
> > > On Nov 25, 2014, at 10:24 PM, Amin Hassani <ahassani at cis.uab.edu>
> wrote:
> > >
> > > Hi,
> > >
> > > the code that I gave you had more stuff in it that I didn't want to
> distract you. here is the simpler send/recv test that I just ran and it
> failed.
> > >
> > > which mpirun: specific directory that I install my MPIs
> > > /nethome/students/ahassani/usr/mpi/bin/mpirun
> > >
> > > mpirun with no argument:
> > > $ mpirun
> > > [mpiexec at oakmnt-0-a] set_default_values
> (../../../../src/pm/hydra/ui/mpich/utils.c:1528): no executable provided
> > > [mpiexec at oakmnt-0-a] HYD_uii_mpx_get_parameters
> (../../../../src/pm/hydra/ui/mpich/utils.c:1739): setting default values
> failed
> > > [mpiexec at oakmnt-0-a] main
> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:153): error parsing parameters
> > >
> > >
> > >
> > > #include <mpi.h>
> > > #include <stdio.h>
> > > #include <malloc.h>
> > > #include <unistd.h>
> > > #include <stdlib.h>
> > >
> > > int skip = 10;
> > > int iter = 30;
> > >
> > > int main(int argc, char** argv)
> > > {
> > > int rank, size;
> > > int i, j, k;
> > > double t1, t2;
> > > int rc;
> > >
> > > MPI_Init(&argc, &argv);
> > > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2;
> > > MPI_Comm_rank(world, &rank);
> > > MPI_Comm_size(world, &size);
> > > int a = 0, b = 1;
> > > if(rank == 0){
> > > MPI_Send(&a, 1, MPI_INT, 1, 0, world);
> > > }else{
> > > MPI_Recv(&b, 1, MPI_INT, 0, 0, world, MPI_STATUS_IGNORE);
> > > }
> > >
> > > printf("b is %d\n", b);
> > > MPI_Finalize();
> > >
> > > return 0;
> > > }
> > >
> > > Thank you.
> > >
> > >
> > > Amin Hassani,
> > > CIS department at UAB,
> > > Birmingham, AL, USA.
> > >
> > > On Tue, Nov 25, 2014 at 10:20 PM, Lu, Huiwei <huiweilu at mcs.anl.gov>
> wrote:
> > > Hi, Amin,
> > >
> > > Could you quickly give us the output of the following command: "which
> mpirun"
> > >
> > > Also, your simplest code couldn’t compile correctly: "error: ‘t_avg’
> undeclared (first use in this function)”. Can you fix it?
> > >
> > > —
> > > Huiwei
> > >
> > > > On Nov 25, 2014, at 2:58 PM, Amin Hassani <ahassani at cis.uab.edu>
> wrote:
> > > >
> > > > This is the simplest code I have that doesn't run.
> > > >
> > > >
> > > > #include <mpi.h>
> > > > #include <stdio.h>
> > > > #include <malloc.h>
> > > > #include <unistd.h>
> > > > #include <stdlib.h>
> > > >
> > > > int main(int argc, char** argv)
> > > > {
> > > > int rank, size;
> > > > int i, j, k;
> > > > double t1, t2;
> > > > int rc;
> > > >
> > > > MPI_Init(&argc, &argv);
> > > > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2;
> > > > MPI_Comm_rank(world, &rank);
> > > > MPI_Comm_size(world, &size);
> > > >
> > > > t2 = 1;
> > > > MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world);
> > > > t_avg = t_avg / size;
> > > >
> > > > MPI_Finalize();
> > > >
> > > > return 0;
> > > > }
> > > >
> > > > Amin Hassani,
> > > > CIS department at UAB,
> > > > Birmingham, AL, USA.
> > > >
> > > > On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Peña" <
> apenya at mcs.anl.gov> wrote:
> > > >
> > > > Hi Amin,
> > > >
> > > > Can you share with us a minimal piece of code with which you can
> reproduce this issue?
> > > >
> > > > Thanks,
> > > > Antonio
> > > >
> > > >
> > > >
> > > > On 11/25/2014 12:52 PM, Amin Hassani wrote:
> > > >> Hi,
> > > >>
> > > >> I am having problem running MPICH, on multiple nodes. When I run an
> multiple MPI processes on one node, it totally works, but when I try to run
> on multiple nodes, it fails with the error below.
> > > >> My machines have Debian OS, Both infiniband and TCP interconnects.
> I'm guessing it has something do to with the TCP network, but I can run
> openmpi on these machines with no problem. But for some reason I cannot run
> MPICH on multiple nodes. Please let me know if more info is needed from my
> side. I'm guessing there are some configuration that I am missing. I used
> MPICH 3.1.3 for this test. I googled this problem but couldn't find any
> solution.
> > > >>
> > > >> In my MPI program, I am doing a simple allreduce over
> MPI_COMM_WORLD.
> > > >>
> > > >> my host file (hosts-hydra) is something like this:
> > > >> oakmnt-0-a:1
> > > >> oakmnt-0-b:1
> > > >>
> > > >> I get this error:
> > > >>
> > > >> $ mpirun -hostfile hosts-hydra -np 2 test_dup
> > > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
> status->MPI_TAG == recvtag
> > > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
> status->MPI_TAG == recvtag
> > > >> internal ABORT - process 1
> > > >> internal ABORT - process 0
> > > >>
> > > >>
> ===================================================================================
> > > >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> > > >> = PID 30744 RUNNING AT oakmnt-0-b
> > > >> = EXIT CODE: 1
> > > >> = CLEANING UP REMAINING PROCESSES
> > > >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> > > >>
> ===================================================================================
> > > >> [mpiexec at vulcan13] HYDU_sock_read
> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file
> descriptor)
> > > >> [mpiexec at vulcan13] control_cb
> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read
> command from proxy
> > > >> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event
> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned
> error status
> > > >> [mpiexec at vulcan13] HYD_pmci_wait_for_completion
> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for
> event
> > > >> [mpiexec at vulcan13] main
> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error
> waiting for completion
> > > >>
> > > >> Thanks.
> > > >> Amin Hassani,
> > > >> CIS department at UAB,
> > > >> Birmingham, AL, USA.
> > > >>
> > > >>
> > > >> _______________________________________________
> > > >> discuss mailing list
> > > >> discuss at mpich.org
> > > >>
> > > >> To manage subscription options or unsubscribe:
> > > >>
> > > >> https://lists.mpich.org/mailman/listinfo/discuss
> > > >
> > > >
> > > > --
> > > > Antonio J. Peña
> > > > Postdoctoral Appointee
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > > 9700 South Cass Avenue, Bldg. 240, Of. 3148
> > > > Argonne, IL 60439-4847
> > > >
> > > > apenya at mcs.anl.gov
> > > > www.mcs.anl.gov/~apenya
> > > >
> > > > _______________________________________________
> > > > discuss mailing list discuss at mpich.org
> > > > To manage subscription options or unsubscribe:
> > > > https://lists.mpich.org/mailman/listinfo/discuss
> > > >
> > > > _______________________________________________
> > > > discuss mailing list discuss at mpich.org
> > > > To manage subscription options or unsubscribe:
> > > > https://lists.mpich.org/mailman/listinfo/discuss
> > >
> > > _______________________________________________
> > > discuss mailing list discuss at mpich.org
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> > >
> > > _______________________________________________
> > > discuss mailing list discuss at mpich.org
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141125/40b550cd/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list