[mpich-discuss] Parallel test hanging with mpich on rhel7
Orion Poplawski
orion at cora.nwra.com
Thu Feb 6 18:04:03 CST 2014
On 02/06/2014 09:12 AM, Kenneth Raffenetti wrote:
> Hi Orion,
>
> On 02/04/2014 03:23 PM, Orion Poplawski wrote:
>> However, I'm still seeing a hang on our Fedora builders in a different test:
>>
>>
>> make[4]: Entering directory `/builddir/build/BUILD/hdf5-1.8.12/mpich/testpar'
>> ============================
>> Testing t_mpi
>>
>> Full log:
>> http://koji.fedoraproject.org/koji/getfile?taskID=6492001&name=build.log
>>
>> Unfortunately I'm not able to reproduce this on my own machines so I'm at a
>> loss here.
>
> We'll look into this and let you know if we find anything in our testing.
>
> Ken
I may have something - one special thing about the Fedora builders is that
they do not have network access. And in the particular environment that is
failing, ssh is outputting:
ssh: Could not resolve hostname buildvm-11.phx2.fedoraproject.org: Name or
service not known
this output seems to wedge mpiexec. Here is some strace snippets:
+ strace -f mpirun -np 4 ./xCbtest_MPI-LINUX-0
execve("/usr/lib64/mpich/bin/mpirun", ["mpirun", "-np", "4",
"./xCbtest_MPI-LINUX-0"], [/* 46 vars */]) = 0
....
[pid 7662] execve("/usr/bin/ssh", ["/usr/bin/ssh", "-x",
"buildvm-11.phx2.fedoraproject.or"..., "\"/usr/lib64/mpich/bin/hydra_pmi_"...,
"--control-port", "buildvm-11.phx2.fedoraproject.or"..., "--rmk", "user",
"--launcher", "ssh", "--demux", "poll", "--pgid", "0", "--retries", "10",
...], [/* 46 vars */]) = 0
[pid 7662] write(2, "ssh: Could not resolve hostname "..., 94) = 94
[pid 7661] <... poll resumed> ) = 1 ([{fd=10, revents=POLLIN}])
[pid 7662] exit_group(255) = ?
[pid 7661] fcntl(10, F_GETFL) = 0 (flags O_RDONLY)
[pid 7661] fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
[pid 7661] fcntl(2, F_GETFL) = 0x1 (flags O_WRONLY)
[pid 7661] fcntl(2, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
[pid 7661] read(10, "ssh: Could not resolve hostname "..., 65536) = 94
[pid 7661] write(2, "ssh: Could not resolve hostname "..., 94ssh: Could not
resolve hostname buildvm-11.phx2.fedoraproject.org: Name or service not known
) = 94
[pid 7661] gettimeofday({1391730785, 32416}, NULL) = 0
[pid 7661] poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=8,
events=POLLIN}, {fd=10, events=POLLIN}], 4, 4294967295 <unfinished ...>
[pid 7662] +++ exited with 255 +++
<... poll resumed> ) = 2 ([{fd=8, revents=POLLHUP}, {fd=10,
revents=POLLHUP}])
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=7662, si_status=255,
si_utime=0, si_stime=0} ---
brk(0) = 0x1e0b000
brk(0x1e3a000) = 0x1e3a000
fcntl(8, F_GETFL) = 0 (flags O_RDONLY)
fcntl(8, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
fcntl(1, F_GETFL) = 0x1 (flags O_WRONLY)
fcntl(1, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
read(8, "", 65536) = 0
close(8) = 0
read(10, "", 65536) = 0
close(10) = 0
gettimeofday({1391730785, 33070}, NULL) = 0
and we are stuck here....
Full logs for a bit are here:
http://kojipkgs.fedoraproject.org//work/tasks/2917/6502917/build.log
Is there some way we can disable mpiexec trying to use ssh?
--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder/CoRA Office FAX: 303-415-9702
3380 Mitchell Lane orion at nwra.com
Boulder, CO 80301 http://www.nwra.com
More information about the discuss
mailing list