[mpich-discuss] having problem running MPICH on multiple nodes
Amin Hassani
ahassani at cis.uab.edu
Tue Nov 25 21:41:06 CST 2014
Is there any debugging flag that I can turn on to figure out problems?
Thanks.
Amin Hassani,
CIS department at UAB,
Birmingham, AL, USA.
On Tue, Nov 25, 2014 at 9:31 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
> Now I'm getting this error with MPICH-3.2a2
> Any thought?
>
> $ mpirun -hostfile hosts-hydra -np 2 test_dup
> Fatal error in MPI_Allreduce: Unknown error class, error stack:
> MPI_Allreduce(912)....................: MPI_Allreduce(sbuf=0x7fffa5240e60,
> rbuf=0x7fffa5240e68, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed
> MPIR_Allreduce_impl(769)..............:
> MPIR_Allreduce_intra(419).............:
> MPIDU_Complete_posted_with_error(1192): Process failed
> Fatal error in MPI_Allreduce: Unknown error class, error stack:
> MPI_Allreduce(912)....................: MPI_Allreduce(sbuf=0x7fffaf6ef070,
> rbuf=0x7fffaf6ef078, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed
> MPIR_Allreduce_impl(769)..............:
> MPIR_Allreduce_intra(419).............:
> MPIDU_Complete_posted_with_error(1192): Process failed
>
>
> ===================================================================================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = PID 451 RUNNING AT oakmnt-0-a
> = EXIT CODE: 1
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> ===================================================================================
>
>
> Thanks.
>
> Amin Hassani,
> CIS department at UAB,
> Birmingham, AL, USA.
>
> On Tue, Nov 25, 2014 at 9:25 PM, Amin Hassani <ahassani at cis.uab.edu>
> wrote:
>
>> Ok, I'll try to test the alpha version. I'll let you know the results.
>>
>> Thank you.
>>
>> Amin Hassani,
>> CIS department at UAB,
>> Birmingham, AL, USA.
>>
>> On Tue, Nov 25, 2014 at 9:21 PM, Bland, Wesley B. <wbland at anl.gov> wrote:
>>
>>> It’s hard to tell then. Other than some problems compiling (not
>>> declaring all of your variables), everything seems ok. Can you try running
>>> with the most recent alpha. I have no idea what bug we could have fixed
>>> here to make things work, but it’d be good to eliminate the possibility.
>>>
>>> Thanks,
>>> Wesley
>>>
>>> On Nov 25, 2014, at 10:11 PM, Amin Hassani <ahassani at cis.uab.edu>
>>> wrote:
>>>
>>> Here I attached config.log exits in the root folder where it is
>>> compiled. I'm not too familiar with MPICH but, there are other config.logs
>>> in other directories also but not sure if you needed them too.
>>> I don't have any specific environment variable that can relate to
>>> MPICH. Also tried with
>>> export HYDRA_HOST_FILE=<address to host file>,
>>> but have the same problem.
>>> I don't do anything FT related in MPICH, I don't think this version of
>>> MPICH has anything related to FT in it.
>>>
>>> Thanks.
>>>
>>> Amin Hassani,
>>> CIS department at UAB,
>>> Birmingham, AL, USA.
>>>
>>> On Tue, Nov 25, 2014 at 9:02 PM, Bland, Wesley B. <wbland at anl.gov>
>>> wrote:
>>>
>>>> Can you also provide your config.log and any CVARs or other relevant
>>>> environment variables that you might be setting (for instance, in relation
>>>> to fault tolerance)?
>>>>
>>>> Thanks,
>>>> Wesley
>>>>
>>>>
>>>> On Nov 25, 2014, at 3:58 PM, Amin Hassani <ahassani at cis.uab.edu>
>>>> wrote:
>>>>
>>>> This is the simplest code I have that doesn't run.
>>>>
>>>>
>>>> #include <mpi.h>
>>>> #include <stdio.h>
>>>> #include <malloc.h>
>>>> #include <unistd.h>
>>>> #include <stdlib.h>
>>>>
>>>> int main(int argc, char** argv)
>>>> {
>>>> int rank, size;
>>>> int i, j, k;
>>>> double t1, t2;
>>>> int rc;
>>>>
>>>> MPI_Init(&argc, &argv);
>>>> MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2;
>>>> MPI_Comm_rank(world, &rank);
>>>> MPI_Comm_size(world, &size);
>>>>
>>>> t2 = 1;
>>>> MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world);
>>>> t_avg = t_avg / size;
>>>>
>>>> MPI_Finalize();
>>>>
>>>> return 0;
>>>> }
>>>>
>>>> Amin Hassani,
>>>> CIS department at UAB,
>>>> Birmingham, AL, USA.
>>>>
>>>> On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Peña" <apenya at mcs.anl.gov>
>>>> wrote:
>>>>
>>>>>
>>>>> Hi Amin,
>>>>>
>>>>> Can you share with us a minimal piece of code with which you can
>>>>> reproduce this issue?
>>>>>
>>>>> Thanks,
>>>>> Antonio
>>>>>
>>>>>
>>>>>
>>>>> On 11/25/2014 12:52 PM, Amin Hassani wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am having problem running MPICH, on multiple nodes. When I run an
>>>>> multiple MPI processes on one node, it totally works, but when I try to run
>>>>> on multiple nodes, it fails with the error below.
>>>>> My machines have Debian OS, Both infiniband and TCP interconnects.
>>>>> I'm guessing it has something do to with the TCP network, but I can run
>>>>> openmpi on these machines with no problem. But for some reason I cannot run
>>>>> MPICH on multiple nodes. Please let me know if more info is needed from my
>>>>> side. I'm guessing there are some configuration that I am missing. I used
>>>>> MPICH 3.1.3 for this test. I googled this problem but couldn't find any
>>>>> solution.
>>>>>
>>>>> In my MPI program, I am doing a simple allreduce over
>>>>> MPI_COMM_WORLD.
>>>>>
>>>>> my host file (hosts-hydra) is something like this:
>>>>> oakmnt-0-a:1
>>>>> oakmnt-0-b:1
>>>>>
>>>>> I get this error:
>>>>>
>>>>> $ mpirun -hostfile hosts-hydra -np 2 test_dup
>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
>>>>> status->MPI_TAG == recvtag
>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
>>>>> status->MPI_TAG == recvtag
>>>>> internal ABORT - process 1
>>>>> internal ABORT - process 0
>>>>>
>>>>>
>>>>> ===================================================================================
>>>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>>> = PID 30744 RUNNING AT oakmnt-0-b
>>>>> = EXIT CODE: 1
>>>>> = CLEANING UP REMAINING PROCESSES
>>>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>>>
>>>>> ===================================================================================
>>>>> [mpiexec at vulcan13] HYDU_sock_read
>>>>> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file
>>>>> descriptor)
>>>>> [mpiexec at vulcan13] control_cb
>>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read
>>>>> command from proxy
>>>>> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event
>>>>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned
>>>>> error status
>>>>> [mpiexec at vulcan13] HYD_pmci_wait_for_completion
>>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for
>>>>> event
>>>>> [mpiexec at vulcan13] main
>>>>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error
>>>>> waiting for completion
>>>>>
>>>>> Thanks.
>>>>> Amin Hassani,
>>>>> CIS department at UAB,
>>>>> Birmingham, AL, USA.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Antonio J. Peña
>>>>> Postdoctoral Appointee
>>>>> Mathematics and Computer Science Division
>>>>> Argonne National Laboratory
>>>>> 9700 South Cass Avenue, Bldg. 240, Of. 3148
>>>>> Argonne, IL 60439-4847apenya at mcs.anl.govwww.mcs.anl.gov/~apenya
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>
>>> <config.log>_______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141125/396bde8e/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list