[mpich-discuss] Parallel test hanging with mpich on rhel7

Orion Poplawski orion at cora.nwra.com
Thu Feb 6 18:04:03 CST 2014


On 02/06/2014 09:12 AM, Kenneth Raffenetti wrote:
> Hi Orion,
> 
> On 02/04/2014 03:23 PM, Orion Poplawski wrote:
>> However, I'm still seeing a hang on our Fedora builders in a different test:
>>
>>
>> make[4]: Entering directory `/builddir/build/BUILD/hdf5-1.8.12/mpich/testpar'
>> ============================
>> Testing  t_mpi
>>
>> Full log:
>> http://koji.fedoraproject.org/koji/getfile?taskID=6492001&name=build.log
>>
>> Unfortunately I'm not able to reproduce this on my own machines so I'm at a
>> loss here.
> 
> We'll look into this and let you know if we find anything in our testing.
> 
> Ken

I may have something - one special thing about the Fedora builders is that
they do not have network access.  And in the particular environment that is
failing, ssh is outputting:

ssh: Could not resolve hostname buildvm-11.phx2.fedoraproject.org: Name or
service not known

this output seems to wedge mpiexec.  Here is some strace snippets:

+ strace -f mpirun -np 4 ./xCbtest_MPI-LINUX-0
execve("/usr/lib64/mpich/bin/mpirun", ["mpirun", "-np", "4",
"./xCbtest_MPI-LINUX-0"], [/* 46 vars */]) = 0
....

[pid  7662] execve("/usr/bin/ssh", ["/usr/bin/ssh", "-x",
"buildvm-11.phx2.fedoraproject.or"..., "\"/usr/lib64/mpich/bin/hydra_pmi_"...,
"--control-port", "buildvm-11.phx2.fedoraproject.or"..., "--rmk", "user",
"--launcher", "ssh", "--demux", "poll", "--pgid", "0", "--retries", "10",
...], [/* 46 vars */]) = 0

[pid  7662] write(2, "ssh: Could not resolve hostname "..., 94) = 94
[pid  7661] <... poll resumed> )        = 1 ([{fd=10, revents=POLLIN}])
[pid  7662] exit_group(255)             = ?
[pid  7661] fcntl(10, F_GETFL)          = 0 (flags O_RDONLY)
[pid  7661] fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
[pid  7661] fcntl(2, F_GETFL)           = 0x1 (flags O_WRONLY)
[pid  7661] fcntl(2, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
[pid  7661] read(10, "ssh: Could not resolve hostname "..., 65536) = 94
[pid  7661] write(2, "ssh: Could not resolve hostname "..., 94ssh: Could not
resolve hostname buildvm-11.phx2.fedoraproject.org: Name or service not known
) = 94
[pid  7661] gettimeofday({1391730785, 32416}, NULL) = 0
[pid  7661] poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=8,
events=POLLIN}, {fd=10, events=POLLIN}], 4, 4294967295 <unfinished ...>
[pid  7662] +++ exited with 255 +++
<... poll resumed> )                    = 2 ([{fd=8, revents=POLLHUP}, {fd=10,
revents=POLLHUP}])
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=7662, si_status=255,
si_utime=0, si_stime=0} ---
brk(0)                                  = 0x1e0b000
brk(0x1e3a000)                          = 0x1e3a000
fcntl(8, F_GETFL)                       = 0 (flags O_RDONLY)
fcntl(8, F_SETFL, O_RDONLY|O_NONBLOCK)  = 0
fcntl(1, F_GETFL)                       = 0x1 (flags O_WRONLY)
fcntl(1, F_SETFL, O_WRONLY|O_NONBLOCK)  = 0
read(8, "", 65536)                      = 0
close(8)                                = 0
read(10, "", 65536)                     = 0
close(10)                               = 0
gettimeofday({1391730785, 33070}, NULL) = 0

and we are stuck here....


Full logs for a bit are here:
http://kojipkgs.fedoraproject.org//work/tasks/2917/6502917/build.log


Is there some way we can disable mpiexec trying to use ssh?


-- 
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       orion at nwra.com
Boulder, CO 80301                   http://www.nwra.com



More information about the discuss mailing list