[mpich-discuss] having problem running MPICH on multiple nodes
Lu, Huiwei
huiweilu at mcs.anl.gov
Tue Nov 25 23:08:05 CST 2014
You may try to put /nethome/students/ahassani/usr/mpi/lib and /nethome/students/ahassani/usr/mpi/bin to the very front of LD_LIBRARY_PATH and PATH.
—
Huiwei
> On Nov 25, 2014, at 11:06 PM, Lu, Huiwei <huiweilu at mcs.anl.gov> wrote:
>
> Is there a chance that some old mpi libraries sits in /nethome/students/ahassani/usr/lib?
> Or some old mpirun sits in /nethome/students/ahassani/usr/bin?
>
> —
> Huiwei
>
>> On Nov 25, 2014, at 10:58 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
>>
>>
>> Here you go!
>>
>> host machine:
>> ~{ahassani at vulcan13:~/usr/bin}~{Tue Nov 25 10:56 PM}~
>> $ echo $LD_LIBRARY_PATH
>> /nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib:
>> ~{ahassani at vulcan13:~/usr/bin}~{Tue Nov 25 10:56 PM}~
>> $ echo $PATH
>> /nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin:/opt/matlab-R2013a/bin
>>
>> oakmnt-0-a:
>> $ echo $LD_LIBRARY_PATH
>> /nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib:
>> ~{ahassani at oakmnt-0-a:~/usr/bin}~{Tue Nov 25 10:56 PM}~
>> $ echo $PATH
>> /nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin
>>
>> oakmnt-0-b:
>> ~{ahassani at oakmnt-0-b:~}~{Tue Nov 25 10:56 PM}~
>> $ echo $LD_LIBRARY_PATH
>> /nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib:
>> ~{ahassani at oakmnt-0-b:~}~{Tue Nov 25 10:56 PM}~
>> $ echo $PATH
>> /nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin
>>
>>
>> Amin Hassani,
>> CIS department at UAB,
>> Birmingham, AL, USA.
>>
>> On Tue, Nov 25, 2014 at 10:55 PM, Lu, Huiwei <huiweilu at mcs.anl.gov> wrote:
>> So your ssh connection is correct. And we confirmed the code itself is correct before. The problem may be somewhere else.
>>
>> Could you check the PATH and LD_LIBRARY_PATH on these three machines (oakmnt-0-a, oakmnt-0-b, and the host machine) to make sure they are the same? So that mpirun is using the same library on these machines.
>>
>> —
>> Huiwei
>>
>>> On Nov 25, 2014, at 10:33 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
>>>
>>> Here you go!
>>>
>>> $ mpirun -hostfile hosts-hydra -np 2 hostname
>>> oakmnt-0-a
>>> oakmnt-0-b
>>>
>>> Thanks.
>>>
>>> Amin Hassani,
>>> CIS department at UAB,
>>> Birmingham, AL, USA.
>>>
>>> On Tue, Nov 25, 2014 at 10:31 PM, Lu, Huiwei <huiweilu at mcs.anl.gov> wrote:
>>> I can run your simplest code on my machine without a problem. So I guess there is some problem in cluster connection. Could you give me the output of the following?
>>>
>>> $ mpirun -hostfile hosts-hydra -np 2 hostname
>>>
>>> —
>>> Huiwei
>>>
>>>> On Nov 25, 2014, at 10:24 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
>>>>
>>>> Hi,
>>>>
>>>> the code that I gave you had more stuff in it that I didn't want to distract you. here is the simpler send/recv test that I just ran and it failed.
>>>>
>>>> which mpirun: specific directory that I install my MPIs
>>>> /nethome/students/ahassani/usr/mpi/bin/mpirun
>>>>
>>>> mpirun with no argument:
>>>> $ mpirun
>>>> [mpiexec at oakmnt-0-a] set_default_values (../../../../src/pm/hydra/ui/mpich/utils.c:1528): no executable provided
>>>> [mpiexec at oakmnt-0-a] HYD_uii_mpx_get_parameters (../../../../src/pm/hydra/ui/mpich/utils.c:1739): setting default values failed
>>>> [mpiexec at oakmnt-0-a] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:153): error parsing parameters
>>>>
>>>>
>>>>
>>>> #include <mpi.h>
>>>> #include <stdio.h>
>>>> #include <malloc.h>
>>>> #include <unistd.h>
>>>> #include <stdlib.h>
>>>>
>>>> int skip = 10;
>>>> int iter = 30;
>>>>
>>>> int main(int argc, char** argv)
>>>> {
>>>> int rank, size;
>>>> int i, j, k;
>>>> double t1, t2;
>>>> int rc;
>>>>
>>>> MPI_Init(&argc, &argv);
>>>> MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2;
>>>> MPI_Comm_rank(world, &rank);
>>>> MPI_Comm_size(world, &size);
>>>> int a = 0, b = 1;
>>>> if(rank == 0){
>>>> MPI_Send(&a, 1, MPI_INT, 1, 0, world);
>>>> }else{
>>>> MPI_Recv(&b, 1, MPI_INT, 0, 0, world, MPI_STATUS_IGNORE);
>>>> }
>>>>
>>>> printf("b is %d\n", b);
>>>> MPI_Finalize();
>>>>
>>>> return 0;
>>>> }
>>>>
>>>> Thank you.
>>>>
>>>>
>>>> Amin Hassani,
>>>> CIS department at UAB,
>>>> Birmingham, AL, USA.
>>>>
>>>> On Tue, Nov 25, 2014 at 10:20 PM, Lu, Huiwei <huiweilu at mcs.anl.gov> wrote:
>>>> Hi, Amin,
>>>>
>>>> Could you quickly give us the output of the following command: "which mpirun"
>>>>
>>>> Also, your simplest code couldn’t compile correctly: "error: ‘t_avg’ undeclared (first use in this function)”. Can you fix it?
>>>>
>>>> —
>>>> Huiwei
>>>>
>>>>> On Nov 25, 2014, at 2:58 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
>>>>>
>>>>> This is the simplest code I have that doesn't run.
>>>>>
>>>>>
>>>>> #include <mpi.h>
>>>>> #include <stdio.h>
>>>>> #include <malloc.h>
>>>>> #include <unistd.h>
>>>>> #include <stdlib.h>
>>>>>
>>>>> int main(int argc, char** argv)
>>>>> {
>>>>> int rank, size;
>>>>> int i, j, k;
>>>>> double t1, t2;
>>>>> int rc;
>>>>>
>>>>> MPI_Init(&argc, &argv);
>>>>> MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2;
>>>>> MPI_Comm_rank(world, &rank);
>>>>> MPI_Comm_size(world, &size);
>>>>>
>>>>> t2 = 1;
>>>>> MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world);
>>>>> t_avg = t_avg / size;
>>>>>
>>>>> MPI_Finalize();
>>>>>
>>>>> return 0;
>>>>> }
>>>>>
>>>>> Amin Hassani,
>>>>> CIS department at UAB,
>>>>> Birmingham, AL, USA.
>>>>>
>>>>> On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Peña" <apenya at mcs.anl.gov> wrote:
>>>>>
>>>>> Hi Amin,
>>>>>
>>>>> Can you share with us a minimal piece of code with which you can reproduce this issue?
>>>>>
>>>>> Thanks,
>>>>> Antonio
>>>>>
>>>>>
>>>>>
>>>>> On 11/25/2014 12:52 PM, Amin Hassani wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I am having problem running MPICH, on multiple nodes. When I run an multiple MPI processes on one node, it totally works, but when I try to run on multiple nodes, it fails with the error below.
>>>>>> My machines have Debian OS, Both infiniband and TCP interconnects. I'm guessing it has something do to with the TCP network, but I can run openmpi on these machines with no problem. But for some reason I cannot run MPICH on multiple nodes. Please let me know if more info is needed from my side. I'm guessing there are some configuration that I am missing. I used MPICH 3.1.3 for this test. I googled this problem but couldn't find any solution.
>>>>>>
>>>>>> In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD.
>>>>>>
>>>>>> my host file (hosts-hydra) is something like this:
>>>>>> oakmnt-0-a:1
>>>>>> oakmnt-0-b:1
>>>>>>
>>>>>> I get this error:
>>>>>>
>>>>>> $ mpirun -hostfile hosts-hydra -np 2 test_dup
>>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag
>>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag
>>>>>> internal ABORT - process 1
>>>>>> internal ABORT - process 0
>>>>>>
>>>>>> ===================================================================================
>>>>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>>>> = PID 30744 RUNNING AT oakmnt-0-b
>>>>>> = EXIT CODE: 1
>>>>>> = CLEANING UP REMAINING PROCESSES
>>>>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>>>> ===================================================================================
>>>>>> [mpiexec at vulcan13] HYDU_sock_read (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file descriptor)
>>>>>> [mpiexec at vulcan13] control_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read command from proxy
>>>>>> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
>>>>>> [mpiexec at vulcan13] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
>>>>>> [mpiexec at vulcan13] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion
>>>>>>
>>>>>> Thanks.
>>>>>> Amin Hassani,
>>>>>> CIS department at UAB,
>>>>>> Birmingham, AL, USA.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list
>>>>>> discuss at mpich.org
>>>>>>
>>>>>> To manage subscription options or unsubscribe:
>>>>>>
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>>
>>>>> --
>>>>> Antonio J. Peña
>>>>> Postdoctoral Appointee
>>>>> Mathematics and Computer Science Division
>>>>> Argonne National Laboratory
>>>>> 9700 South Cass Avenue, Bldg. 240, Of. 3148
>>>>> Argonne, IL 60439-4847
>>>>>
>>>>> apenya at mcs.anl.gov
>>>>> www.mcs.anl.gov/~apenya
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>> _______________________________________________
>>>> discuss mailing list discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>> _______________________________________________
>>>> discuss mailing list discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list