[mpich-discuss] ./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
Pavan Balaji
balaji at mcs.anl.gov
Tue Aug 27 09:01:44 CDT 2013
This is almost certainly a network issue with your third machine (kaak,
I presume?).
Thanks for making sure "hostname" works fine on all machines. That
means that your ssh connections are setup correctly. But a non-MPI
program, such as hostname, does not check the connection from kaak back
to mpi1.
Can you try a simple program like "examples/cpi" in the build directory
on all machines? Try it on 2 machines (mpiexec -np 4) and 3 machines
(mpiexec -np 6).
If the third machine is in fact having problems running the application:
1. Make sure there's no firewall on the third machines.
2. Make sure the /etc/hosts file is consistent on both the machines
(mpi1 and kaak).
-- Pavan
On 08/27/2013 06:46 AM, Joni-Pekka Kurronen wrote:
>
> I have:
> -Ubuntu 12.4
> -rsh-redo-rsh
> -three machines
> -mpich3
> -have tried export HYDRA_DEMUX=select / poll
> -have tried ssh/rsh
> -have added to LIBS: event_core event_pthreads
>
> I can run test at on to two machines whitout error but
> when I take third machine to cluster demux engine goes mad,...
> there is connection hanging,... and nothing happens,...
>
>
> <MPITEST>
> <NAME>uoplong</NAME>
> <NP>11</NP>
> <WORKDIR>./coll</WORKDIR>
> <STATUS>fail</STATUS>
> <TESTDIFF>
> [mpiexec at mpi1] APPLICATION TIMED OUT
> [proxy:0:0 at mpi1] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
> [proxy:0:0 at mpi1] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at mpi1] main (./pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> [mpiexec at mpi1] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes
> terminated badly; aborting
> [mpiexec at mpi1] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
> for completion
> [mpiexec at mpi1] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:188): launcher returned error waiting for
> completion
> [mpiexec at mpi1] main (./ui/mpich/mpiexec.c:331): process manager error
> waiting for completion
> </TESTDIFF>
> </MPITEST>
>
> Also I can run
> joni at mpi1:/mpi3/S3/hpcc-1.4.2$ mpiexec -np 6 hostname
> mpi1
> mpi1
> ugh
> ugh
> kaak
> kaak
>
> but if I run
> joni at mpi1:/mpi3/S3/hpcc-1.4.2$ mpiexec -np 6 ls
> I get only one directory as output and
> system will cease until I have re-started slave machines !
>
>
>
>
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the discuss
mailing list