[mpich-discuss] mpich hangs

Syed. Jahanzeb Maqbool Hashmi jahanzeb.maqbool at gmail.com
Thu Jun 27 21:41:51 CDT 2013


sorry for giving such little information.

ok here is the output after a long hang (which sometimes comes out)

================START OF OUTPUT=====================

linaro at weiser1:/mnt/nfs/jahanzeb/bench/hpl/hpl-2.1/bin/armv7-a$ mpirun -np
8 -machinefile machines ./xhp                               l
Fatal error in MPI_Send: A process has failed, error stack:
MPI_Send(171)..............: MPI_Send(buf=0xbe84fc50, count=1, MPI_INT,
dest=0, tag=9001, MPI_COMM_WORLD                               ) failed
MPID_nem_tcp_connpoll(1826): Communication error with rank 0: Connection
refused

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at weiser1] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:0 at weiser1] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at weiser1] main (./pm/pmiserv/pmip.c:206): demux engine error
waiting for event
[mpiexec at weiser1] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at weiser1] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at weiser1] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
completion
[mpiexec at weiser1] main (./ui/mpich/mpiexec.c:331): process manager error
waiting for completion

================ENDOF OUTPUT=====================





On Fri, Jun 28, 2013 at 11:39 AM, Pavan Balaji <balaji at mcs.anl.gov> wrote:

>
> On 06/27/2013 09:36 PM, Syed. Jahanzeb Maqbool Hashmi wrote:
>
>> I am trying to run HPL on a cluster of nodes. The problem I am facing is
>> with mpich, as I have successfully configured mpich. The program runs on
>> single node without passing -machinefile argument. But as long as I
>> execute of multiple nodes (-machinefile nodes) then the program hangs on
>> indefinitely right after issuing the command.
>>
>
> Given how little information you have provided, here's the only response I
> can give:
>
> You are doing something wrong.
>
>  -- Pavan
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130628/04ef1e51/attachment.html>


More information about the discuss mailing list