[mpich-discuss] having problem running MPICH on multiple nodes

Junchao Zhang jczhang at mcs.anl.gov
Tue Nov 25 21:53:50 CST 2014


Is the failure specific to MPI_Allreduce?  Did other tests (like simple
send/recv) work?

--Junchao Zhang

On Tue, Nov 25, 2014 at 9:41 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:

> Is there any debugging flag that I can turn on to figure out problems?
>
> Thanks.
>
> Amin Hassani,
> CIS department at UAB,
> Birmingham, AL, USA.
>
> On Tue, Nov 25, 2014 at 9:31 PM, Amin Hassani <ahassani at cis.uab.edu>
> wrote:
>
>> Now I'm getting this error with MPICH-3.2a2
>> Any thought?
>>
>> ​$ mpirun -hostfile hosts-hydra -np 2  test_dup
>> Fatal error in MPI_Allreduce: Unknown error class, error stack:
>> MPI_Allreduce(912)....................:
>> MPI_Allreduce(sbuf=0x7fffa5240e60, rbuf=0x7fffa5240e68, count=1,
>> MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed
>> MPIR_Allreduce_impl(769)..............:
>> MPIR_Allreduce_intra(419).............:
>> MPIDU_Complete_posted_with_error(1192): Process failed
>> Fatal error in MPI_Allreduce: Unknown error class, error stack:
>> MPI_Allreduce(912)....................:
>> MPI_Allreduce(sbuf=0x7fffaf6ef070, rbuf=0x7fffaf6ef078, count=1,
>> MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed
>> MPIR_Allreduce_impl(769)..............:
>> MPIR_Allreduce_intra(419).............:
>> MPIDU_Complete_posted_with_error(1192): Process failed
>>
>>
>> ===================================================================================
>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> =   PID 451 RUNNING AT oakmnt-0-a
>> =   EXIT CODE: 1
>> =   CLEANING UP REMAINING PROCESSES
>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>
>> ===================================================================================
>>>>
>> Thanks.
>>
>> Amin Hassani,
>> CIS department at UAB,
>> Birmingham, AL, USA.
>>
>> On Tue, Nov 25, 2014 at 9:25 PM, Amin Hassani <ahassani at cis.uab.edu>
>> wrote:
>>
>>> Ok, I'll try to test the alpha version. I'll let you know the results.
>>>
>>> Thank you.
>>>
>>> Amin Hassani,
>>> CIS department at UAB,
>>> Birmingham, AL, USA.
>>>
>>> On Tue, Nov 25, 2014 at 9:21 PM, Bland, Wesley B. <wbland at anl.gov>
>>> wrote:
>>>
>>>>  It’s hard to tell then. Other than some problems compiling (not
>>>> declaring all of your variables), everything seems ok. Can you try running
>>>> with the most recent alpha. I have no idea what bug we could have fixed
>>>> here to make things work, but it’d be good to eliminate the possibility.
>>>>
>>>>  Thanks,
>>>> Wesley
>>>>
>>>>  On Nov 25, 2014, at 10:11 PM, Amin Hassani <ahassani at cis.uab.edu>
>>>> wrote:
>>>>
>>>>   Here I attached config.log exits in the root folder where it is
>>>> compiled. I'm not too familiar with MPICH but, there are other config.logs
>>>> in other directories also but not sure if you needed them too.
>>>>  I don't have any specific environment variable that can relate to
>>>> MPICH. Also tried with
>>>> export HYDRA_HOST_FILE=<address to host file>,
>>>>  but have the same problem.
>>>> I don't do anything FT related in MPICH, I don't think this version of
>>>> MPICH has anything related to FT in it.
>>>>
>>>>  Thanks.
>>>>
>>>>  Amin Hassani,
>>>> CIS department at UAB,
>>>> Birmingham, AL, USA.
>>>>
>>>> On Tue, Nov 25, 2014 at 9:02 PM, Bland, Wesley B. <wbland at anl.gov>
>>>> wrote:
>>>>
>>>>> Can you also provide your config.log and any CVARs or other relevant
>>>>> environment variables that you might be setting (for instance, in relation
>>>>> to fault tolerance)?
>>>>>
>>>>>  Thanks,
>>>>> Wesley
>>>>>
>>>>>
>>>>>  On Nov 25, 2014, at 3:58 PM, Amin Hassani <ahassani at cis.uab.edu>
>>>>> wrote:
>>>>>
>>>>>   This is the simplest code I have that doesn't run.
>>>>>
>>>>>
>>>>>  #include <mpi.h>
>>>>> #include <stdio.h>
>>>>> #include <malloc.h>
>>>>> #include <unistd.h>
>>>>> #include <stdlib.h>
>>>>>
>>>>>  int main(int argc, char** argv)
>>>>>  {
>>>>>     int rank, size;
>>>>>     int i, j, k;
>>>>>     double t1, t2;
>>>>>     int rc;
>>>>>
>>>>>      MPI_Init(&argc, &argv);
>>>>>     MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2;
>>>>>     MPI_Comm_rank(world, &rank);
>>>>>     MPI_Comm_size(world, &size);
>>>>>
>>>>>      t2 = 1;
>>>>>     MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world);
>>>>>     t_avg = t_avg / size;
>>>>>
>>>>>      MPI_Finalize();
>>>>>
>>>>>      return 0;
>>>>> }​
>>>>>
>>>>>  Amin Hassani,
>>>>> CIS department at UAB,
>>>>> Birmingham, AL, USA.
>>>>>
>>>>> On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Peña" <apenya at mcs.anl.gov
>>>>> > wrote:
>>>>>
>>>>>>
>>>>>> Hi Amin,
>>>>>>
>>>>>> Can you share with us a minimal piece of code with which you can
>>>>>> reproduce this issue?
>>>>>>
>>>>>> Thanks,
>>>>>>   Antonio
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 11/25/2014 12:52 PM, Amin Hassani wrote:
>>>>>>
>>>>>>   Hi,
>>>>>>
>>>>>>  I am having problem running MPICH, on multiple nodes. When I run an
>>>>>> multiple MPI processes on one node, it totally works, but when I try to run
>>>>>> on multiple nodes, it fails with the error below.
>>>>>>  My machines have Debian OS, Both infiniband and TCP interconnects.
>>>>>> I'm guessing it has something do to with the TCP network, but I can run
>>>>>> openmpi on these machines with no problem. But for some reason I cannot run
>>>>>> MPICH on multiple nodes. Please let me know if more info is needed from my
>>>>>> side. I'm guessing there are some configuration that I am missing. I used
>>>>>> MPICH 3.1.3 for this test. I googled this problem but couldn't find any
>>>>>> solution.
>>>>>>
>>>>>>   ​In my MPI program, I am doing a simple allreduce over
>>>>>> MPI_COMM_WORLD​.
>>>>>>
>>>>>>   ​my host file (hosts-hydra) is something like this:
>>>>>> oakmnt-0-a:1
>>>>>>  oakmnt-0-b:1 ​
>>>>>>
>>>>>>   ​I get this error:​
>>>>>>
>>>>>>   $ mpirun -hostfile hosts-hydra -np 2  test_dup
>>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
>>>>>> status->MPI_TAG == recvtag
>>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
>>>>>> status->MPI_TAG == recvtag
>>>>>> internal ABORT - process 1
>>>>>> internal ABORT - process 0
>>>>>>
>>>>>>
>>>>>> ===================================================================================
>>>>>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>>>> =   PID 30744 RUNNING AT oakmnt-0-b
>>>>>> =   EXIT CODE: 1
>>>>>> =   CLEANING UP REMAINING PROCESSES
>>>>>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>>>>
>>>>>> ===================================================================================
>>>>>> [mpiexec at vulcan13] HYDU_sock_read
>>>>>> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file
>>>>>> descriptor)
>>>>>> [mpiexec at vulcan13] control_cb
>>>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read
>>>>>> command from proxy
>>>>>> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event
>>>>>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned
>>>>>> error status
>>>>>> [mpiexec at vulcan13] HYD_pmci_wait_for_completion
>>>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for
>>>>>> event
>>>>>> [mpiexec at vulcan13] main
>>>>>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error
>>>>>> waiting for completion
>>>>>>
>>>>>>  Thanks.
>>>>>>   Amin Hassani,
>>>>>> CIS department at UAB,
>>>>>> Birmingham, AL, USA.
>>>>>>
>>>>>>
>>>>>>  _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Antonio J. Peña
>>>>>> Postdoctoral Appointee
>>>>>> Mathematics and Computer Science Division
>>>>>> Argonne National Laboratory
>>>>>> 9700 South Cass Avenue, Bldg. 240, Of. 3148
>>>>>> Argonne, IL 60439-4847apenya at mcs.anl.govwww.mcs.anl.gov/~apenya
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>
>>>>>  _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>
>>>>  <config.log>_______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>
>>>
>>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141125/4fcc9f41/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list