[mpich-discuss] having problem running MPICH on multiple nodes
Amin Hassani
ahassani at cis.uab.edu
Tue Nov 25 21:31:19 CST 2014
Now I'm getting this error with MPICH-3.2a2
Any thought?
$ mpirun -hostfile hosts-hydra -np 2 test_dup
Fatal error in MPI_Allreduce: Unknown error class, error stack:
MPI_Allreduce(912)....................: MPI_Allreduce(sbuf=0x7fffa5240e60,
rbuf=0x7fffa5240e68, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(769)..............:
MPIR_Allreduce_intra(419).............:
MPIDU_Complete_posted_with_error(1192): Process failed
Fatal error in MPI_Allreduce: Unknown error class, error stack:
MPI_Allreduce(912)....................: MPI_Allreduce(sbuf=0x7fffaf6ef070,
rbuf=0x7fffaf6ef078, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(769)..............:
MPIR_Allreduce_intra(419).............:
MPIDU_Complete_posted_with_error(1192): Process failed
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 451 RUNNING AT oakmnt-0-a
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Thanks.
Amin Hassani,
CIS department at UAB,
Birmingham, AL, USA.
On Tue, Nov 25, 2014 at 9:25 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
> Ok, I'll try to test the alpha version. I'll let you know the results.
>
> Thank you.
>
> Amin Hassani,
> CIS department at UAB,
> Birmingham, AL, USA.
>
> On Tue, Nov 25, 2014 at 9:21 PM, Bland, Wesley B. <wbland at anl.gov> wrote:
>
>> It’s hard to tell then. Other than some problems compiling (not
>> declaring all of your variables), everything seems ok. Can you try running
>> with the most recent alpha. I have no idea what bug we could have fixed
>> here to make things work, but it’d be good to eliminate the possibility.
>>
>> Thanks,
>> Wesley
>>
>> On Nov 25, 2014, at 10:11 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
>>
>> Here I attached config.log exits in the root folder where it is
>> compiled. I'm not too familiar with MPICH but, there are other config.logs
>> in other directories also but not sure if you needed them too.
>> I don't have any specific environment variable that can relate to MPICH.
>> Also tried with
>> export HYDRA_HOST_FILE=<address to host file>,
>> but have the same problem.
>> I don't do anything FT related in MPICH, I don't think this version of
>> MPICH has anything related to FT in it.
>>
>> Thanks.
>>
>> Amin Hassani,
>> CIS department at UAB,
>> Birmingham, AL, USA.
>>
>> On Tue, Nov 25, 2014 at 9:02 PM, Bland, Wesley B. <wbland at anl.gov> wrote:
>>
>>> Can you also provide your config.log and any CVARs or other relevant
>>> environment variables that you might be setting (for instance, in relation
>>> to fault tolerance)?
>>>
>>> Thanks,
>>> Wesley
>>>
>>>
>>> On Nov 25, 2014, at 3:58 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
>>>
>>> This is the simplest code I have that doesn't run.
>>>
>>>
>>> #include <mpi.h>
>>> #include <stdio.h>
>>> #include <malloc.h>
>>> #include <unistd.h>
>>> #include <stdlib.h>
>>>
>>> int main(int argc, char** argv)
>>> {
>>> int rank, size;
>>> int i, j, k;
>>> double t1, t2;
>>> int rc;
>>>
>>> MPI_Init(&argc, &argv);
>>> MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2;
>>> MPI_Comm_rank(world, &rank);
>>> MPI_Comm_size(world, &size);
>>>
>>> t2 = 1;
>>> MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world);
>>> t_avg = t_avg / size;
>>>
>>> MPI_Finalize();
>>>
>>> return 0;
>>> }
>>>
>>> Amin Hassani,
>>> CIS department at UAB,
>>> Birmingham, AL, USA.
>>>
>>> On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Peña" <apenya at mcs.anl.gov>
>>> wrote:
>>>
>>>>
>>>> Hi Amin,
>>>>
>>>> Can you share with us a minimal piece of code with which you can
>>>> reproduce this issue?
>>>>
>>>> Thanks,
>>>> Antonio
>>>>
>>>>
>>>>
>>>> On 11/25/2014 12:52 PM, Amin Hassani wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am having problem running MPICH, on multiple nodes. When I run an
>>>> multiple MPI processes on one node, it totally works, but when I try to run
>>>> on multiple nodes, it fails with the error below.
>>>> My machines have Debian OS, Both infiniband and TCP interconnects. I'm
>>>> guessing it has something do to with the TCP network, but I can run openmpi
>>>> on these machines with no problem. But for some reason I cannot run MPICH
>>>> on multiple nodes. Please let me know if more info is needed from my side.
>>>> I'm guessing there are some configuration that I am missing. I used MPICH
>>>> 3.1.3 for this test. I googled this problem but couldn't find any solution.
>>>>
>>>> In my MPI program, I am doing a simple allreduce over
>>>> MPI_COMM_WORLD.
>>>>
>>>> my host file (hosts-hydra) is something like this:
>>>> oakmnt-0-a:1
>>>> oakmnt-0-b:1
>>>>
>>>> I get this error:
>>>>
>>>> $ mpirun -hostfile hosts-hydra -np 2 test_dup
>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
>>>> status->MPI_TAG == recvtag
>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
>>>> status->MPI_TAG == recvtag
>>>> internal ABORT - process 1
>>>> internal ABORT - process 0
>>>>
>>>>
>>>> ===================================================================================
>>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>> = PID 30744 RUNNING AT oakmnt-0-b
>>>> = EXIT CODE: 1
>>>> = CLEANING UP REMAINING PROCESSES
>>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>>
>>>> ===================================================================================
>>>> [mpiexec at vulcan13] HYDU_sock_read
>>>> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file
>>>> descriptor)
>>>> [mpiexec at vulcan13] control_cb
>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read
>>>> command from proxy
>>>> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event
>>>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned
>>>> error status
>>>> [mpiexec at vulcan13] HYD_pmci_wait_for_completion
>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for
>>>> event
>>>> [mpiexec at vulcan13] main
>>>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error
>>>> waiting for completion
>>>>
>>>> Thanks.
>>>> Amin Hassani,
>>>> CIS department at UAB,
>>>> Birmingham, AL, USA.
>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list discuss at mpich.org
>>>> To manage subscription options or unsubscribe:https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>>
>>>>
>>>> --
>>>> Antonio J. Peña
>>>> Postdoctoral Appointee
>>>> Mathematics and Computer Science Division
>>>> Argonne National Laboratory
>>>> 9700 South Cass Avenue, Bldg. 240, Of. 3148
>>>> Argonne, IL 60439-4847apenya at mcs.anl.govwww.mcs.anl.gov/~apenya
>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>
>> <config.log>_______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141125/f643a304/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list