[mpich-discuss] having problem running MPICH on multiple nodes
Amin Hassani
ahassani at cis.uab.edu
Tue Nov 25 22:02:24 CST 2014
Same type of problem. seems some problem with the network, but as I
mentioned I run openmpi on it perfectly, both TCP and infiniband. machines
are not behind a firewall. Same problem even if I run mpirun on one of the
nodes. (not headnode)
Fatal error in MPI_Send: Unknown error class, error stack:
MPI_Send(174)..............: MPI_Send(buf=0x7fff9cb16128, count=1, MPI_INT,
dest=1, tag=0, MPI_COMM_WORLD) failed
MPID_nem_tcp_connpoll(1832): Communication error with rank 1: Connection
refused
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 1438 RUNNING AT oakmnt-0-a
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1 at oakmnt-0-b] HYD_pmcd_pmip_control_cmd_cb
(../../../../src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:1 at oakmnt-0-b] HYDT_dmxu_poll_wait_for_event
(../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned
error status
[proxy:0:1 at oakmnt-0-b] main
(../../../../src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error
waiting for event
[mpiexec at vulcan13] HYDT_bscu_wait_for_completion
(../../../../src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of the
processes terminated badly; aborting
[mpiexec at vulcan13] HYDT_bsci_wait_for_completion
(../../../../src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher
returned error waiting for completion
[mpiexec at vulcan13] HYD_pmci_wait_for_completion
(../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher returned
error waiting for completion
[mpiexec at vulcan13] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344):
process manager error waiting for completion
Amin Hassani,
CIS department at UAB,
Birmingham, AL, USA.
On Tue, Nov 25, 2014 at 9:53 PM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
> Is the failure specific to MPI_Allreduce? Did other tests (like simple
> send/recv) work?
>
> --Junchao Zhang
>
> On Tue, Nov 25, 2014 at 9:41 PM, Amin Hassani <ahassani at cis.uab.edu>
> wrote:
>
>> Is there any debugging flag that I can turn on to figure out problems?
>>
>> Thanks.
>>
>> Amin Hassani,
>> CIS department at UAB,
>> Birmingham, AL, USA.
>>
>> On Tue, Nov 25, 2014 at 9:31 PM, Amin Hassani <ahassani at cis.uab.edu>
>> wrote:
>>
>>> Now I'm getting this error with MPICH-3.2a2
>>> Any thought?
>>>
>>> $ mpirun -hostfile hosts-hydra -np 2 test_dup
>>> Fatal error in MPI_Allreduce: Unknown error class, error stack:
>>> MPI_Allreduce(912)....................:
>>> MPI_Allreduce(sbuf=0x7fffa5240e60, rbuf=0x7fffa5240e68, count=1,
>>> MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed
>>> MPIR_Allreduce_impl(769)..............:
>>> MPIR_Allreduce_intra(419).............:
>>> MPIDU_Complete_posted_with_error(1192): Process failed
>>> Fatal error in MPI_Allreduce: Unknown error class, error stack:
>>> MPI_Allreduce(912)....................:
>>> MPI_Allreduce(sbuf=0x7fffaf6ef070, rbuf=0x7fffaf6ef078, count=1,
>>> MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed
>>> MPIR_Allreduce_impl(769)..............:
>>> MPIR_Allreduce_intra(419).............:
>>> MPIDU_Complete_posted_with_error(1192): Process failed
>>>
>>>
>>> ===================================================================================
>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>> = PID 451 RUNNING AT oakmnt-0-a
>>> = EXIT CODE: 1
>>> = CLEANING UP REMAINING PROCESSES
>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>
>>> ===================================================================================
>>>
>>>
>>> Thanks.
>>>
>>> Amin Hassani,
>>> CIS department at UAB,
>>> Birmingham, AL, USA.
>>>
>>> On Tue, Nov 25, 2014 at 9:25 PM, Amin Hassani <ahassani at cis.uab.edu>
>>> wrote:
>>>
>>>> Ok, I'll try to test the alpha version. I'll let you know the results.
>>>>
>>>> Thank you.
>>>>
>>>> Amin Hassani,
>>>> CIS department at UAB,
>>>> Birmingham, AL, USA.
>>>>
>>>> On Tue, Nov 25, 2014 at 9:21 PM, Bland, Wesley B. <wbland at anl.gov>
>>>> wrote:
>>>>
>>>>> It’s hard to tell then. Other than some problems compiling (not
>>>>> declaring all of your variables), everything seems ok. Can you try running
>>>>> with the most recent alpha. I have no idea what bug we could have fixed
>>>>> here to make things work, but it’d be good to eliminate the possibility.
>>>>>
>>>>> Thanks,
>>>>> Wesley
>>>>>
>>>>> On Nov 25, 2014, at 10:11 PM, Amin Hassani <ahassani at cis.uab.edu>
>>>>> wrote:
>>>>>
>>>>> Here I attached config.log exits in the root folder where it is
>>>>> compiled. I'm not too familiar with MPICH but, there are other config.logs
>>>>> in other directories also but not sure if you needed them too.
>>>>> I don't have any specific environment variable that can relate to
>>>>> MPICH. Also tried with
>>>>> export HYDRA_HOST_FILE=<address to host file>,
>>>>> but have the same problem.
>>>>> I don't do anything FT related in MPICH, I don't think this version of
>>>>> MPICH has anything related to FT in it.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Amin Hassani,
>>>>> CIS department at UAB,
>>>>> Birmingham, AL, USA.
>>>>>
>>>>> On Tue, Nov 25, 2014 at 9:02 PM, Bland, Wesley B. <wbland at anl.gov>
>>>>> wrote:
>>>>>
>>>>>> Can you also provide your config.log and any CVARs or other relevant
>>>>>> environment variables that you might be setting (for instance, in relation
>>>>>> to fault tolerance)?
>>>>>>
>>>>>> Thanks,
>>>>>> Wesley
>>>>>>
>>>>>>
>>>>>> On Nov 25, 2014, at 3:58 PM, Amin Hassani <ahassani at cis.uab.edu>
>>>>>> wrote:
>>>>>>
>>>>>> This is the simplest code I have that doesn't run.
>>>>>>
>>>>>>
>>>>>> #include <mpi.h>
>>>>>> #include <stdio.h>
>>>>>> #include <malloc.h>
>>>>>> #include <unistd.h>
>>>>>> #include <stdlib.h>
>>>>>>
>>>>>> int main(int argc, char** argv)
>>>>>> {
>>>>>> int rank, size;
>>>>>> int i, j, k;
>>>>>> double t1, t2;
>>>>>> int rc;
>>>>>>
>>>>>> MPI_Init(&argc, &argv);
>>>>>> MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2;
>>>>>> MPI_Comm_rank(world, &rank);
>>>>>> MPI_Comm_size(world, &size);
>>>>>>
>>>>>> t2 = 1;
>>>>>> MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world);
>>>>>> t_avg = t_avg / size;
>>>>>>
>>>>>> MPI_Finalize();
>>>>>>
>>>>>> return 0;
>>>>>> }
>>>>>>
>>>>>> Amin Hassani,
>>>>>> CIS department at UAB,
>>>>>> Birmingham, AL, USA.
>>>>>>
>>>>>> On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Peña" <
>>>>>> apenya at mcs.anl.gov> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi Amin,
>>>>>>>
>>>>>>> Can you share with us a minimal piece of code with which you can
>>>>>>> reproduce this issue?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Antonio
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 11/25/2014 12:52 PM, Amin Hassani wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am having problem running MPICH, on multiple nodes. When I run
>>>>>>> an multiple MPI processes on one node, it totally works, but when I try to
>>>>>>> run on multiple nodes, it fails with the error below.
>>>>>>> My machines have Debian OS, Both infiniband and TCP interconnects.
>>>>>>> I'm guessing it has something do to with the TCP network, but I can run
>>>>>>> openmpi on these machines with no problem. But for some reason I cannot run
>>>>>>> MPICH on multiple nodes. Please let me know if more info is needed from my
>>>>>>> side. I'm guessing there are some configuration that I am missing. I used
>>>>>>> MPICH 3.1.3 for this test. I googled this problem but couldn't find any
>>>>>>> solution.
>>>>>>>
>>>>>>> In my MPI program, I am doing a simple allreduce over
>>>>>>> MPI_COMM_WORLD.
>>>>>>>
>>>>>>> my host file (hosts-hydra) is something like this:
>>>>>>> oakmnt-0-a:1
>>>>>>> oakmnt-0-b:1
>>>>>>>
>>>>>>> I get this error:
>>>>>>>
>>>>>>> $ mpirun -hostfile hosts-hydra -np 2 test_dup
>>>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
>>>>>>> status->MPI_TAG == recvtag
>>>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
>>>>>>> status->MPI_TAG == recvtag
>>>>>>> internal ABORT - process 1
>>>>>>> internal ABORT - process 0
>>>>>>>
>>>>>>>
>>>>>>> ===================================================================================
>>>>>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>>>>> = PID 30744 RUNNING AT oakmnt-0-b
>>>>>>> = EXIT CODE: 1
>>>>>>> = CLEANING UP REMAINING PROCESSES
>>>>>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>>>>>
>>>>>>> ===================================================================================
>>>>>>> [mpiexec at vulcan13] HYDU_sock_read
>>>>>>> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file
>>>>>>> descriptor)
>>>>>>> [mpiexec at vulcan13] control_cb
>>>>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read
>>>>>>> command from proxy
>>>>>>> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event
>>>>>>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned
>>>>>>> error status
>>>>>>> [mpiexec at vulcan13] HYD_pmci_wait_for_completion
>>>>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for
>>>>>>> event
>>>>>>> [mpiexec at vulcan13] main
>>>>>>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error
>>>>>>> waiting for completion
>>>>>>>
>>>>>>> Thanks.
>>>>>>> Amin Hassani,
>>>>>>> CIS department at UAB,
>>>>>>> Birmingham, AL, USA.
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> discuss mailing list discuss at mpich.org
>>>>>>> To manage subscription options or unsubscribe:https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Antonio J. Peña
>>>>>>> Postdoctoral Appointee
>>>>>>> Mathematics and Computer Science Division
>>>>>>> Argonne National Laboratory
>>>>>>> 9700 South Cass Avenue, Bldg. 240, Of. 3148
>>>>>>> Argonne, IL 60439-4847apenya at mcs.anl.govwww.mcs.anl.gov/~apenya
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> discuss mailing list discuss at mpich.org
>>>>>>> To manage subscription options or unsubscribe:
>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>
>>>>> <config.log>_______________________________________________
>>>>> discuss mailing list discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141125/46860640/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list