[mpich-discuss] having problem running MPICH on multiple nodes

Lu, Huiwei huiweilu at mcs.anl.gov
Tue Nov 25 23:08:05 CST 2014


You may try to put /nethome/students/ahassani/usr/mpi/lib and /nethome/students/ahassani/usr/mpi/bin to the very front of LD_LIBRARY_PATH and PATH.
—
Huiwei

> On Nov 25, 2014, at 11:06 PM, Lu, Huiwei <huiweilu at mcs.anl.gov> wrote:
> 
> Is there a chance that some old mpi libraries sits in /nethome/students/ahassani/usr/lib?
> Or some old mpirun sits in /nethome/students/ahassani/usr/bin?
> 
>> Huiwei
> 
>> On Nov 25, 2014, at 10:58 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
>> 
>> 
>> Here you go!
>> 
>> host machine:
>> ~{ahassani at vulcan13:~/usr/bin}~{Tue Nov 25 10:56 PM}~
>> $ echo $LD_LIBRARY_PATH
>> /nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib:
>> ~{ahassani at vulcan13:~/usr/bin}~{Tue Nov 25 10:56 PM}~
>> $ echo $PATH
>> /nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin:/opt/matlab-R2013a/bin
>> 
>> oakmnt-0-a:
>> $ echo $LD_LIBRARY_PATH
>> /nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib:
>> ~{ahassani at oakmnt-0-a:~/usr/bin}~{Tue Nov 25 10:56 PM}~
>> $ echo $PATH
>> /nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin
>> 
>> oakmnt-0-b:
>> ~{ahassani at oakmnt-0-b:~}~{Tue Nov 25 10:56 PM}~
>> $ echo $LD_LIBRARY_PATH
>> /nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib:
>> ~{ahassani at oakmnt-0-b:~}~{Tue Nov 25 10:56 PM}~
>> $ echo $PATH
>> /nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin
>> 
>> 
>> Amin Hassani,
>> CIS department at UAB,
>> Birmingham, AL, USA.
>> 
>> On Tue, Nov 25, 2014 at 10:55 PM, Lu, Huiwei <huiweilu at mcs.anl.gov> wrote:
>> So your ssh connection is correct. And we confirmed the code itself is correct before. The problem may be somewhere else.
>> 
>> Could you check the PATH and LD_LIBRARY_PATH on these three machines (oakmnt-0-a, oakmnt-0-b, and the host machine) to make sure they are the same? So that mpirun is using the same library on these machines.
>> 
>>>> Huiwei
>> 
>>> On Nov 25, 2014, at 10:33 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
>>> 
>>> Here you go!
>>> 
>>> $ mpirun -hostfile hosts-hydra -np 2 hostname
>>> oakmnt-0-a
>>> oakmnt-0-b
>>> 
>>> Thanks.
>>> 
>>> Amin Hassani,
>>> CIS department at UAB,
>>> Birmingham, AL, USA.
>>> 
>>> On Tue, Nov 25, 2014 at 10:31 PM, Lu, Huiwei <huiweilu at mcs.anl.gov> wrote:
>>> I can run your simplest code on my machine without a problem. So I guess there is some problem in cluster connection. Could you give me the output of the following?
>>> 
>>> $ mpirun -hostfile hosts-hydra -np 2 hostname
>>> 
>>>>>> Huiwei
>>> 
>>>> On Nov 25, 2014, at 10:24 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> the code that I gave you had more stuff in it that I didn't want to distract you. here is the simpler send/recv test that I just ran and it failed.
>>>> 
>>>> which mpirun: specific directory that I install my MPIs
>>>> /nethome/students/ahassani/usr/mpi/bin/mpirun
>>>> 
>>>> mpirun with no argument:
>>>> $ mpirun
>>>> [mpiexec at oakmnt-0-a] set_default_values (../../../../src/pm/hydra/ui/mpich/utils.c:1528): no executable provided
>>>> [mpiexec at oakmnt-0-a] HYD_uii_mpx_get_parameters (../../../../src/pm/hydra/ui/mpich/utils.c:1739): setting default values failed
>>>> [mpiexec at oakmnt-0-a] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:153): error parsing parameters
>>>> 
>>>> 
>>>> 
>>>> #include <mpi.h>
>>>> #include <stdio.h>
>>>> #include <malloc.h>
>>>> #include <unistd.h>
>>>> #include <stdlib.h>
>>>> 
>>>> int skip = 10;
>>>> int iter = 30;
>>>> 
>>>> int main(int argc, char** argv)
>>>> {
>>>>    int rank, size;
>>>>    int i, j, k;
>>>>    double t1, t2;
>>>>    int rc;
>>>> 
>>>>    MPI_Init(&argc, &argv);
>>>>    MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2;
>>>>    MPI_Comm_rank(world, &rank);
>>>>    MPI_Comm_size(world, &size);
>>>>    int a = 0, b = 1;
>>>>    if(rank == 0){
>>>>        MPI_Send(&a, 1, MPI_INT, 1, 0, world);
>>>>    }else{
>>>>        MPI_Recv(&b, 1, MPI_INT, 0, 0, world, MPI_STATUS_IGNORE);
>>>>    }
>>>> 
>>>>    printf("b is %d\n", b);
>>>>    MPI_Finalize();
>>>> 
>>>>    return 0;
>>>> }
>>>> 
>>>> Thank you.
>>>> 
>>>> 
>>>> Amin Hassani,
>>>> CIS department at UAB,
>>>> Birmingham, AL, USA.
>>>> 
>>>> On Tue, Nov 25, 2014 at 10:20 PM, Lu, Huiwei <huiweilu at mcs.anl.gov> wrote:
>>>> Hi, Amin,
>>>> 
>>>> Could you quickly give us the output of the following command: "which mpirun"
>>>> 
>>>> Also, your simplest code couldn’t compile correctly: "error: ‘t_avg’ undeclared (first use in this function)”. Can you fix it?
>>>> 
>>>>>>>> Huiwei
>>>> 
>>>>> On Nov 25, 2014, at 2:58 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
>>>>> 
>>>>> This is the simplest code I have that doesn't run.
>>>>> 
>>>>> 
>>>>> #include <mpi.h>
>>>>> #include <stdio.h>
>>>>> #include <malloc.h>
>>>>> #include <unistd.h>
>>>>> #include <stdlib.h>
>>>>> 
>>>>> int main(int argc, char** argv)
>>>>> {
>>>>>    int rank, size;
>>>>>    int i, j, k;
>>>>>    double t1, t2;
>>>>>    int rc;
>>>>> 
>>>>>    MPI_Init(&argc, &argv);
>>>>>    MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2;
>>>>>    MPI_Comm_rank(world, &rank);
>>>>>    MPI_Comm_size(world, &size);
>>>>> 
>>>>>    t2 = 1;
>>>>>    MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world);
>>>>>    t_avg = t_avg / size;
>>>>> 
>>>>>    MPI_Finalize();
>>>>> 
>>>>>    return 0;
>>>>> }​
>>>>> 
>>>>> Amin Hassani,
>>>>> CIS department at UAB,
>>>>> Birmingham, AL, USA.
>>>>> 
>>>>> On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Peña" <apenya at mcs.anl.gov> wrote:
>>>>> 
>>>>> Hi Amin,
>>>>> 
>>>>> Can you share with us a minimal piece of code with which you can reproduce this issue?
>>>>> 
>>>>> Thanks,
>>>>>  Antonio
>>>>> 
>>>>> 
>>>>> 
>>>>> On 11/25/2014 12:52 PM, Amin Hassani wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I am having problem running MPICH, on multiple nodes. When I run an multiple MPI processes on one node, it totally works, but when I try to run on multiple nodes, it fails with the error below.
>>>>>> My machines have Debian OS, Both infiniband and TCP interconnects. I'm guessing it has something do to with the TCP network, but I can run openmpi on these machines with no problem. But for some reason I cannot run MPICH on multiple nodes. Please let me know if more info is needed from my side. I'm guessing there are some configuration that I am missing. I used MPICH 3.1.3 for this test. I googled this problem but couldn't find any solution.
>>>>>> 
>>>>>> ​In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD​.
>>>>>> 
>>>>>> ​my host file (hosts-hydra) is something like this:
>>>>>> oakmnt-0-a:1
>>>>>> oakmnt-0-b:1 ​
>>>>>> 
>>>>>> ​I get this error:​
>>>>>> 
>>>>>> $ mpirun -hostfile hosts-hydra -np 2  test_dup
>>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag
>>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag
>>>>>> internal ABORT - process 1
>>>>>> internal ABORT - process 0
>>>>>> 
>>>>>> ===================================================================================
>>>>>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>>>> =   PID 30744 RUNNING AT oakmnt-0-b
>>>>>> =   EXIT CODE: 1
>>>>>> =   CLEANING UP REMAINING PROCESSES
>>>>>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>>>> ===================================================================================
>>>>>> [mpiexec at vulcan13] HYDU_sock_read (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file descriptor)
>>>>>> [mpiexec at vulcan13] control_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read command from proxy
>>>>>> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
>>>>>> [mpiexec at vulcan13] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
>>>>>> [mpiexec at vulcan13] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion
>>>>>> 
>>>>>> Thanks.
>>>>>> Amin Hassani,
>>>>>> CIS department at UAB,
>>>>>> Birmingham, AL, USA.
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> discuss mailing list
>>>>>> discuss at mpich.org
>>>>>> 
>>>>>> To manage subscription options or unsubscribe:
>>>>>> 
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>> 
>>>>> 
>>>>> --
>>>>> Antonio J. Peña
>>>>> Postdoctoral Appointee
>>>>> Mathematics and Computer Science Division
>>>>> Argonne National Laboratory
>>>>> 9700 South Cass Avenue, Bldg. 240, Of. 3148
>>>>> Argonne, IL 60439-4847
>>>>> 
>>>>> apenya at mcs.anl.gov
>>>>> www.mcs.anl.gov/~apenya
>>>>> 
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>> 
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>> 
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>> 
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>> 
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>> 
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list