[mpich-discuss] having problem running MPICH on multiple nodes

Amin Hassani ahassani at cis.uab.edu
Tue Nov 25 22:35:23 CST 2014


It might be an issue with the cluster, but if I could some how run the
mpich in debug mode, It might be useful, but no idea how to do it in mpich.

Thanks.

Amin Hassani,
CIS department at UAB,
Birmingham, AL, USA.

On Tue, Nov 25, 2014 at 10:33 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:

> Here you go!
>
> $ mpirun -hostfile hosts-hydra -np 2 hostname
> oakmnt-0-a
> oakmnt-0-b
>
> Thanks.
>
> Amin Hassani,
> CIS department at UAB,
> Birmingham, AL, USA.
>
> On Tue, Nov 25, 2014 at 10:31 PM, Lu, Huiwei <huiweilu at mcs.anl.gov> wrote:
>
>> I can run your simplest code on my machine without a problem. So I guess
>> there is some problem in cluster connection. Could you give me the output
>> of the following?
>>
>> $ mpirun -hostfile hosts-hydra -np 2 hostname
>>
>>>> Huiwei
>>
>> > On Nov 25, 2014, at 10:24 PM, Amin Hassani <ahassani at cis.uab.edu>
>> wrote:
>> >
>> > Hi,
>> >
>> > the code that I gave you had more stuff in it that I didn't want to
>> distract you. here is the simpler send/recv test that I just ran and it
>> failed.
>> >
>> > which mpirun: specific directory that I install my MPIs
>> > /nethome/students/ahassani/usr/mpi/bin/mpirun
>> >
>> > mpirun with no argument:
>> > $ mpirun
>> > [mpiexec at oakmnt-0-a] set_default_values
>> (../../../../src/pm/hydra/ui/mpich/utils.c:1528): no executable provided
>> > [mpiexec at oakmnt-0-a] HYD_uii_mpx_get_parameters
>> (../../../../src/pm/hydra/ui/mpich/utils.c:1739): setting default values
>> failed
>> > [mpiexec at oakmnt-0-a] main
>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:153): error parsing parameters
>> >
>> >
>> >
>> > #include <mpi.h>
>> > #include <stdio.h>
>> > #include <malloc.h>
>> > #include <unistd.h>
>> > #include <stdlib.h>
>> >
>> > int skip = 10;
>> > int iter = 30;
>> >
>> > int main(int argc, char** argv)
>> > {
>> >     int rank, size;
>> >     int i, j, k;
>> >     double t1, t2;
>> >     int rc;
>> >
>> >     MPI_Init(&argc, &argv);
>> >     MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2;
>> >     MPI_Comm_rank(world, &rank);
>> >     MPI_Comm_size(world, &size);
>> >     int a = 0, b = 1;
>> >     if(rank == 0){
>> >         MPI_Send(&a, 1, MPI_INT, 1, 0, world);
>> >     }else{
>> >         MPI_Recv(&b, 1, MPI_INT, 0, 0, world, MPI_STATUS_IGNORE);
>> >     }
>> >
>> >     printf("b is %d\n", b);
>> >     MPI_Finalize();
>> >
>> >     return 0;
>> > }
>> >
>> > Thank you.
>> >
>> >
>> > Amin Hassani,
>> > CIS department at UAB,
>> > Birmingham, AL, USA.
>> >
>> > On Tue, Nov 25, 2014 at 10:20 PM, Lu, Huiwei <huiweilu at mcs.anl.gov>
>> wrote:
>> > Hi, Amin,
>> >
>> > Could you quickly give us the output of the following command: "which
>> mpirun"
>> >
>> > Also, your simplest code couldn’t compile correctly: "error: ‘t_avg’
>> undeclared (first use in this function)”. Can you fix it?
>> >
>> > —
>> > Huiwei
>> >
>> > > On Nov 25, 2014, at 2:58 PM, Amin Hassani <ahassani at cis.uab.edu>
>> wrote:
>> > >
>> > > This is the simplest code I have that doesn't run.
>> > >
>> > >
>> > > #include <mpi.h>
>> > > #include <stdio.h>
>> > > #include <malloc.h>
>> > > #include <unistd.h>
>> > > #include <stdlib.h>
>> > >
>> > > int main(int argc, char** argv)
>> > > {
>> > >     int rank, size;
>> > >     int i, j, k;
>> > >     double t1, t2;
>> > >     int rc;
>> > >
>> > >     MPI_Init(&argc, &argv);
>> > >     MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2;
>> > >     MPI_Comm_rank(world, &rank);
>> > >     MPI_Comm_size(world, &size);
>> > >
>> > >     t2 = 1;
>> > >     MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world);
>> > >     t_avg = t_avg / size;
>> > >
>> > >     MPI_Finalize();
>> > >
>> > >     return 0;
>> > > }​
>> > >
>> > > Amin Hassani,
>> > > CIS department at UAB,
>> > > Birmingham, AL, USA.
>> > >
>> > > On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Peña" <
>> apenya at mcs.anl.gov> wrote:
>> > >
>> > > Hi Amin,
>> > >
>> > > Can you share with us a minimal piece of code with which you can
>> reproduce this issue?
>> > >
>> > > Thanks,
>> > >   Antonio
>> > >
>> > >
>> > >
>> > > On 11/25/2014 12:52 PM, Amin Hassani wrote:
>> > >> Hi,
>> > >>
>> > >> I am having problem running MPICH, on multiple nodes. When I run an
>> multiple MPI processes on one node, it totally works, but when I try to run
>> on multiple nodes, it fails with the error below.
>> > >> My machines have Debian OS, Both infiniband and TCP interconnects.
>> I'm guessing it has something do to with the TCP network, but I can run
>> openmpi on these machines with no problem. But for some reason I cannot run
>> MPICH on multiple nodes. Please let me know if more info is needed from my
>> side. I'm guessing there are some configuration that I am missing. I used
>> MPICH 3.1.3 for this test. I googled this problem but couldn't find any
>> solution.
>> > >>
>> > >> ​In my MPI program, I am doing a simple allreduce over
>> MPI_COMM_WORLD​.
>> > >>
>> > >> ​my host file (hosts-hydra) is something like this:
>> > >> oakmnt-0-a:1
>> > >> oakmnt-0-b:1 ​
>> > >>
>> > >> ​I get this error:​
>> > >>
>> > >> $ mpirun -hostfile hosts-hydra -np 2  test_dup
>> > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
>> status->MPI_TAG == recvtag
>> > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
>> status->MPI_TAG == recvtag
>> > >> internal ABORT - process 1
>> > >> internal ABORT - process 0
>> > >>
>> > >>
>> ===================================================================================
>> > >> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> > >> =   PID 30744 RUNNING AT oakmnt-0-b
>> > >> =   EXIT CODE: 1
>> > >> =   CLEANING UP REMAINING PROCESSES
>> > >> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> > >>
>> ===================================================================================
>> > >> [mpiexec at vulcan13] HYDU_sock_read
>> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file
>> descriptor)
>> > >> [mpiexec at vulcan13] control_cb
>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read
>> command from proxy
>> > >> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event
>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned
>> error status
>> > >> [mpiexec at vulcan13] HYD_pmci_wait_for_completion
>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for
>> event
>> > >> [mpiexec at vulcan13] main
>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error
>> waiting for completion
>> > >>
>> > >> Thanks.
>> > >> Amin Hassani,
>> > >> CIS department at UAB,
>> > >> Birmingham, AL, USA.
>> > >>
>> > >>
>> > >> _______________________________________________
>> > >> discuss mailing list
>> > >> discuss at mpich.org
>> > >>
>> > >> To manage subscription options or unsubscribe:
>> > >>
>> > >> https://lists.mpich.org/mailman/listinfo/discuss
>> > >
>> > >
>> > > --
>> > > Antonio J. Peña
>> > > Postdoctoral Appointee
>> > > Mathematics and Computer Science Division
>> > > Argonne National Laboratory
>> > > 9700 South Cass Avenue, Bldg. 240, Of. 3148
>> > > Argonne, IL 60439-4847
>> > >
>> > > apenya at mcs.anl.gov
>> > > www.mcs.anl.gov/~apenya
>> > >
>> > > _______________________________________________
>> > > discuss mailing list     discuss at mpich.org
>> > > To manage subscription options or unsubscribe:
>> > > https://lists.mpich.org/mailman/listinfo/discuss
>> > >
>> > > _______________________________________________
>> > > discuss mailing list     discuss at mpich.org
>> > > To manage subscription options or unsubscribe:
>> > > https://lists.mpich.org/mailman/listinfo/discuss
>> >
>> > _______________________________________________
>> > discuss mailing list     discuss at mpich.org
>> > To manage subscription options or unsubscribe:
>> > https://lists.mpich.org/mailman/listinfo/discuss
>> >
>> > _______________________________________________
>> > discuss mailing list     discuss at mpich.org
>> > To manage subscription options or unsubscribe:
>> > https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141125/d6cd009f/attachment-0001.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list