[mpich-discuss] having problem running MPICH on multiple nodes

Amin Hassani ahassani at cis.uab.edu
Tue Nov 25 21:41:06 CST 2014


Is there any debugging flag that I can turn on to figure out problems?

Thanks.

Amin Hassani,
CIS department at UAB,
Birmingham, AL, USA.

On Tue, Nov 25, 2014 at 9:31 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:

> Now I'm getting this error with MPICH-3.2a2
> Any thought?
>
> ​$ mpirun -hostfile hosts-hydra -np 2  test_dup
> Fatal error in MPI_Allreduce: Unknown error class, error stack:
> MPI_Allreduce(912)....................: MPI_Allreduce(sbuf=0x7fffa5240e60,
> rbuf=0x7fffa5240e68, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed
> MPIR_Allreduce_impl(769)..............:
> MPIR_Allreduce_intra(419).............:
> MPIDU_Complete_posted_with_error(1192): Process failed
> Fatal error in MPI_Allreduce: Unknown error class, error stack:
> MPI_Allreduce(912)....................: MPI_Allreduce(sbuf=0x7fffaf6ef070,
> rbuf=0x7fffaf6ef078, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed
> MPIR_Allreduce_impl(769)..............:
> MPIR_Allreduce_intra(419).............:
> MPIDU_Complete_posted_with_error(1192): Process failed
>
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 451 RUNNING AT oakmnt-0-a
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> ===================================================================================
>>
> Thanks.
>
> Amin Hassani,
> CIS department at UAB,
> Birmingham, AL, USA.
>
> On Tue, Nov 25, 2014 at 9:25 PM, Amin Hassani <ahassani at cis.uab.edu>
> wrote:
>
>> Ok, I'll try to test the alpha version. I'll let you know the results.
>>
>> Thank you.
>>
>> Amin Hassani,
>> CIS department at UAB,
>> Birmingham, AL, USA.
>>
>> On Tue, Nov 25, 2014 at 9:21 PM, Bland, Wesley B. <wbland at anl.gov> wrote:
>>
>>>  It’s hard to tell then. Other than some problems compiling (not
>>> declaring all of your variables), everything seems ok. Can you try running
>>> with the most recent alpha. I have no idea what bug we could have fixed
>>> here to make things work, but it’d be good to eliminate the possibility.
>>>
>>>  Thanks,
>>> Wesley
>>>
>>>  On Nov 25, 2014, at 10:11 PM, Amin Hassani <ahassani at cis.uab.edu>
>>> wrote:
>>>
>>>   Here I attached config.log exits in the root folder where it is
>>> compiled. I'm not too familiar with MPICH but, there are other config.logs
>>> in other directories also but not sure if you needed them too.
>>>  I don't have any specific environment variable that can relate to
>>> MPICH. Also tried with
>>> export HYDRA_HOST_FILE=<address to host file>,
>>>  but have the same problem.
>>> I don't do anything FT related in MPICH, I don't think this version of
>>> MPICH has anything related to FT in it.
>>>
>>>  Thanks.
>>>
>>>  Amin Hassani,
>>> CIS department at UAB,
>>> Birmingham, AL, USA.
>>>
>>> On Tue, Nov 25, 2014 at 9:02 PM, Bland, Wesley B. <wbland at anl.gov>
>>> wrote:
>>>
>>>> Can you also provide your config.log and any CVARs or other relevant
>>>> environment variables that you might be setting (for instance, in relation
>>>> to fault tolerance)?
>>>>
>>>>  Thanks,
>>>> Wesley
>>>>
>>>>
>>>>  On Nov 25, 2014, at 3:58 PM, Amin Hassani <ahassani at cis.uab.edu>
>>>> wrote:
>>>>
>>>>   This is the simplest code I have that doesn't run.
>>>>
>>>>
>>>>  #include <mpi.h>
>>>> #include <stdio.h>
>>>> #include <malloc.h>
>>>> #include <unistd.h>
>>>> #include <stdlib.h>
>>>>
>>>>  int main(int argc, char** argv)
>>>>  {
>>>>     int rank, size;
>>>>     int i, j, k;
>>>>     double t1, t2;
>>>>     int rc;
>>>>
>>>>      MPI_Init(&argc, &argv);
>>>>     MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2;
>>>>     MPI_Comm_rank(world, &rank);
>>>>     MPI_Comm_size(world, &size);
>>>>
>>>>      t2 = 1;
>>>>     MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world);
>>>>     t_avg = t_avg / size;
>>>>
>>>>      MPI_Finalize();
>>>>
>>>>      return 0;
>>>> }​
>>>>
>>>>  Amin Hassani,
>>>> CIS department at UAB,
>>>> Birmingham, AL, USA.
>>>>
>>>> On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Peña" <apenya at mcs.anl.gov>
>>>> wrote:
>>>>
>>>>>
>>>>> Hi Amin,
>>>>>
>>>>> Can you share with us a minimal piece of code with which you can
>>>>> reproduce this issue?
>>>>>
>>>>> Thanks,
>>>>>   Antonio
>>>>>
>>>>>
>>>>>
>>>>> On 11/25/2014 12:52 PM, Amin Hassani wrote:
>>>>>
>>>>>   Hi,
>>>>>
>>>>>  I am having problem running MPICH, on multiple nodes. When I run an
>>>>> multiple MPI processes on one node, it totally works, but when I try to run
>>>>> on multiple nodes, it fails with the error below.
>>>>>  My machines have Debian OS, Both infiniband and TCP interconnects.
>>>>> I'm guessing it has something do to with the TCP network, but I can run
>>>>> openmpi on these machines with no problem. But for some reason I cannot run
>>>>> MPICH on multiple nodes. Please let me know if more info is needed from my
>>>>> side. I'm guessing there are some configuration that I am missing. I used
>>>>> MPICH 3.1.3 for this test. I googled this problem but couldn't find any
>>>>> solution.
>>>>>
>>>>>   ​In my MPI program, I am doing a simple allreduce over
>>>>> MPI_COMM_WORLD​.
>>>>>
>>>>>   ​my host file (hosts-hydra) is something like this:
>>>>> oakmnt-0-a:1
>>>>>  oakmnt-0-b:1 ​
>>>>>
>>>>>   ​I get this error:​
>>>>>
>>>>>   $ mpirun -hostfile hosts-hydra -np 2  test_dup
>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
>>>>> status->MPI_TAG == recvtag
>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
>>>>> status->MPI_TAG == recvtag
>>>>> internal ABORT - process 1
>>>>> internal ABORT - process 0
>>>>>
>>>>>
>>>>> ===================================================================================
>>>>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>>> =   PID 30744 RUNNING AT oakmnt-0-b
>>>>> =   EXIT CODE: 1
>>>>> =   CLEANING UP REMAINING PROCESSES
>>>>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>>>
>>>>> ===================================================================================
>>>>> [mpiexec at vulcan13] HYDU_sock_read
>>>>> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file
>>>>> descriptor)
>>>>> [mpiexec at vulcan13] control_cb
>>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read
>>>>> command from proxy
>>>>> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event
>>>>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned
>>>>> error status
>>>>> [mpiexec at vulcan13] HYD_pmci_wait_for_completion
>>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for
>>>>> event
>>>>> [mpiexec at vulcan13] main
>>>>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error
>>>>> waiting for completion
>>>>>
>>>>>  Thanks.
>>>>>   Amin Hassani,
>>>>> CIS department at UAB,
>>>>> Birmingham, AL, USA.
>>>>>
>>>>>
>>>>>  _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Antonio J. Peña
>>>>> Postdoctoral Appointee
>>>>> Mathematics and Computer Science Division
>>>>> Argonne National Laboratory
>>>>> 9700 South Cass Avenue, Bldg. 240, Of. 3148
>>>>> Argonne, IL 60439-4847apenya at mcs.anl.govwww.mcs.anl.gov/~apenya
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>
>>>>  _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>
>>>  <config.log>_______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141125/396bde8e/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list