[mpich-discuss] having problem running MPICH on multiple nodes

Amin Hassani ahassani at cis.uab.edu
Tue Nov 25 21:31:19 CST 2014


Now I'm getting this error with MPICH-3.2a2
Any thought?

​$ mpirun -hostfile hosts-hydra -np 2  test_dup
Fatal error in MPI_Allreduce: Unknown error class, error stack:
MPI_Allreduce(912)....................: MPI_Allreduce(sbuf=0x7fffa5240e60,
rbuf=0x7fffa5240e68, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(769)..............:
MPIR_Allreduce_intra(419).............:
MPIDU_Complete_posted_with_error(1192): Process failed
Fatal error in MPI_Allreduce: Unknown error class, error stack:
MPI_Allreduce(912)....................: MPI_Allreduce(sbuf=0x7fffaf6ef070,
rbuf=0x7fffaf6ef078, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(769)..............:
MPIR_Allreduce_intra(419).............:
MPIDU_Complete_posted_with_error(1192): Process failed

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 451 RUNNING AT oakmnt-0-a
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
​

Thanks.

Amin Hassani,
CIS department at UAB,
Birmingham, AL, USA.

On Tue, Nov 25, 2014 at 9:25 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:

> Ok, I'll try to test the alpha version. I'll let you know the results.
>
> Thank you.
>
> Amin Hassani,
> CIS department at UAB,
> Birmingham, AL, USA.
>
> On Tue, Nov 25, 2014 at 9:21 PM, Bland, Wesley B. <wbland at anl.gov> wrote:
>
>>  It’s hard to tell then. Other than some problems compiling (not
>> declaring all of your variables), everything seems ok. Can you try running
>> with the most recent alpha. I have no idea what bug we could have fixed
>> here to make things work, but it’d be good to eliminate the possibility.
>>
>>  Thanks,
>> Wesley
>>
>>  On Nov 25, 2014, at 10:11 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
>>
>>   Here I attached config.log exits in the root folder where it is
>> compiled. I'm not too familiar with MPICH but, there are other config.logs
>> in other directories also but not sure if you needed them too.
>>  I don't have any specific environment variable that can relate to MPICH.
>> Also tried with
>> export HYDRA_HOST_FILE=<address to host file>,
>>  but have the same problem.
>> I don't do anything FT related in MPICH, I don't think this version of
>> MPICH has anything related to FT in it.
>>
>>  Thanks.
>>
>>  Amin Hassani,
>> CIS department at UAB,
>> Birmingham, AL, USA.
>>
>> On Tue, Nov 25, 2014 at 9:02 PM, Bland, Wesley B. <wbland at anl.gov> wrote:
>>
>>> Can you also provide your config.log and any CVARs or other relevant
>>> environment variables that you might be setting (for instance, in relation
>>> to fault tolerance)?
>>>
>>>  Thanks,
>>> Wesley
>>>
>>>
>>>  On Nov 25, 2014, at 3:58 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
>>>
>>>   This is the simplest code I have that doesn't run.
>>>
>>>
>>>  #include <mpi.h>
>>> #include <stdio.h>
>>> #include <malloc.h>
>>> #include <unistd.h>
>>> #include <stdlib.h>
>>>
>>>  int main(int argc, char** argv)
>>>  {
>>>     int rank, size;
>>>     int i, j, k;
>>>     double t1, t2;
>>>     int rc;
>>>
>>>      MPI_Init(&argc, &argv);
>>>     MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2;
>>>     MPI_Comm_rank(world, &rank);
>>>     MPI_Comm_size(world, &size);
>>>
>>>      t2 = 1;
>>>     MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world);
>>>     t_avg = t_avg / size;
>>>
>>>      MPI_Finalize();
>>>
>>>      return 0;
>>> }​
>>>
>>>  Amin Hassani,
>>> CIS department at UAB,
>>> Birmingham, AL, USA.
>>>
>>> On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Peña" <apenya at mcs.anl.gov>
>>> wrote:
>>>
>>>>
>>>> Hi Amin,
>>>>
>>>> Can you share with us a minimal piece of code with which you can
>>>> reproduce this issue?
>>>>
>>>> Thanks,
>>>>   Antonio
>>>>
>>>>
>>>>
>>>> On 11/25/2014 12:52 PM, Amin Hassani wrote:
>>>>
>>>>   Hi,
>>>>
>>>>  I am having problem running MPICH, on multiple nodes. When I run an
>>>> multiple MPI processes on one node, it totally works, but when I try to run
>>>> on multiple nodes, it fails with the error below.
>>>>  My machines have Debian OS, Both infiniband and TCP interconnects. I'm
>>>> guessing it has something do to with the TCP network, but I can run openmpi
>>>> on these machines with no problem. But for some reason I cannot run MPICH
>>>> on multiple nodes. Please let me know if more info is needed from my side.
>>>> I'm guessing there are some configuration that I am missing. I used MPICH
>>>> 3.1.3 for this test. I googled this problem but couldn't find any solution.
>>>>
>>>>   ​In my MPI program, I am doing a simple allreduce over
>>>> MPI_COMM_WORLD​.
>>>>
>>>>   ​my host file (hosts-hydra) is something like this:
>>>> oakmnt-0-a:1
>>>>  oakmnt-0-b:1 ​
>>>>
>>>>   ​I get this error:​
>>>>
>>>>   $ mpirun -hostfile hosts-hydra -np 2  test_dup
>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
>>>> status->MPI_TAG == recvtag
>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
>>>> status->MPI_TAG == recvtag
>>>> internal ABORT - process 1
>>>> internal ABORT - process 0
>>>>
>>>>
>>>> ===================================================================================
>>>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>> =   PID 30744 RUNNING AT oakmnt-0-b
>>>> =   EXIT CODE: 1
>>>> =   CLEANING UP REMAINING PROCESSES
>>>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>>
>>>> ===================================================================================
>>>> [mpiexec at vulcan13] HYDU_sock_read
>>>> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file
>>>> descriptor)
>>>> [mpiexec at vulcan13] control_cb
>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read
>>>> command from proxy
>>>> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event
>>>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned
>>>> error status
>>>> [mpiexec at vulcan13] HYD_pmci_wait_for_completion
>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for
>>>> event
>>>> [mpiexec at vulcan13] main
>>>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error
>>>> waiting for completion
>>>>
>>>>  Thanks.
>>>>   Amin Hassani,
>>>> CIS department at UAB,
>>>> Birmingham, AL, USA.
>>>>
>>>>
>>>>  _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>>
>>>>
>>>> --
>>>> Antonio J. Peña
>>>> Postdoctoral Appointee
>>>> Mathematics and Computer Science Division
>>>> Argonne National Laboratory
>>>> 9700 South Cass Avenue, Bldg. 240, Of. 3148
>>>> Argonne, IL 60439-4847apenya at mcs.anl.govwww.mcs.anl.gov/~apenya
>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>
>>>  _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>
>>  <config.log>_______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141125/f643a304/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list