[mpich-discuss] Parallel test hanging with mpich on rhel7

Balaji, Pavan balaji at anl.gov
Thu Feb 6 22:10:53 CST 2014


Thanks.  That+IBk-s very useful analysis.  Would you be willing to try the
attached patch to see if it solves this issue?

  +IBQ- Pavan

On 2/6/14, 6:04 PM, +ACI-Orion Poplawski+ACI- +ADw-orion+AEA-cora.nwra.com+AD4- wrote:

+AD4-On 02/06/2014 09:12 AM, Kenneth Raffenetti wrote:
+AD4APg- Hi Orion,
+AD4APg- 
+AD4APg- On 02/04/2014 03:23 PM, Orion Poplawski wrote:
+AD4APgA+- However, I'm still seeing a hang on our Fedora builders in a different
+AD4APgA+-test:
+AD4APgA+-
+AD4APgA+-
+AD4APgA+- make+AFs-4+AF0-: Entering directory
+AD4APgA+AGA-/builddir/build/BUILD/hdf5-1.8.12/mpich/testpar'
+AD4APgA+- +AD0APQA9AD0APQA9AD0APQA9AD0APQA9AD0APQA9AD0APQA9AD0APQA9AD0APQA9AD0APQA9AD0-
+AD4APgA+- Testing  t+AF8-mpi
+AD4APgA+-
+AD4APgA+- Full log:
+AD4APgA+- 
+AD4APgA+-http://koji.fedoraproject.org/koji/getfile?taskID+AD0-6492001+ACY-name+AD0-build.log
+AD4APgA+-
+AD4APgA+- Unfortunately I'm not able to reproduce this on my own machines so I'm
+AD4APgA+-at a
+AD4APgA+- loss here.
+AD4APg- 
+AD4APg- We'll look into this and let you know if we find anything in our
+AD4APg-testing.
+AD4APg- 
+AD4APg- Ken
+AD4-
+AD4-I may have something - one special thing about the Fedora builders is that
+AD4-they do not have network access.  And in the particular environment that
+AD4-is
+AD4-failing, ssh is outputting:
+AD4-
+AD4-ssh: Could not resolve hostname buildvm-11.phx2.fedoraproject.org: Name or
+AD4-service not known
+AD4-
+AD4-this output seems to wedge mpiexec.  Here is some strace snippets:
+AD4-
+AD4- strace -f mpirun -np 4 ./xCbtest+AF8-MPI-LINUX-0
+AD4-execve(+ACI-/usr/lib64/mpich/bin/mpirun+ACI-, +AFsAIg-mpirun+ACI-, +ACI--np+ACI-, +ACI-4+ACI-,
+AD4AIg-./xCbtest+AF8-MPI-LINUX-0+ACIAXQ-, +AFs-/+ACo- 46 vars +ACo-/+AF0-) +AD0- 0
+AD4-....
+AD4-
+AD4AWw-pid  7662+AF0- execve(+ACI-/usr/bin/ssh+ACI-, +AFsAIg-/usr/bin/ssh+ACI-, +ACI--x+ACI-,
+AD4AIg-buildvm-11.phx2.fedoraproject.or+ACI-...,
+AD4AIgBcACI-/usr/lib64/mpich/bin/hydra+AF8-pmi+AF8AIg-...,
+AD4AIg---control-port+ACI-, +ACI-buildvm-11.phx2.fedoraproject.or+ACI-..., +ACI---rmk+ACI-, +ACI-user+ACI-,
+AD4AIg---launcher+ACI-, +ACI-ssh+ACI-, +ACI---demux+ACI-, +ACI-poll+ACI-, +ACI---pgid+ACI-, +ACI-0+ACI-, +ACI---retries+ACI-, +ACI-10+ACI-,
+AD4-...+AF0-, +AFs-/+ACo- 46 vars +ACo-/+AF0-) +AD0- 0
+AD4-
+AD4AWw-pid  7662+AF0- write(2, +ACI-ssh: Could not resolve hostname +ACI-..., 94) +AD0- 94
+AD4AWw-pid  7661+AF0- +ADw-... poll resumed+AD4- )        +AD0- 1 (+AFsAew-fd+AD0-10, revents+AD0-POLLIN+AH0AXQ-)
+AD4AWw-pid  7662+AF0- exit+AF8-group(255)             +AD0- ?
+AD4AWw-pid  7661+AF0- fcntl(10, F+AF8-GETFL)          +AD0- 0 (flags O+AF8-RDONLY)
+AD4AWw-pid  7661+AF0- fcntl(10, F+AF8-SETFL, O+AF8-RDONLY+AHw-O+AF8-NONBLOCK) +AD0- 0
+AD4AWw-pid  7661+AF0- fcntl(2, F+AF8-GETFL)           +AD0- 0x1 (flags O+AF8-WRONLY)
+AD4AWw-pid  7661+AF0- fcntl(2, F+AF8-SETFL, O+AF8-WRONLY+AHw-O+AF8-NONBLOCK) +AD0- 0
+AD4AWw-pid  7661+AF0- read(10, +ACI-ssh: Could not resolve hostname +ACI-..., 65536) +AD0- 94
+AD4AWw-pid  7661+AF0- write(2, +ACI-ssh: Could not resolve hostname +ACI-..., 94ssh: Could
+AD4-not
+AD4-resolve hostname buildvm-11.phx2.fedoraproject.org: Name or service not
+AD4-known
+AD4-) +AD0- 94
+AD4AWw-pid  7661+AF0- gettimeofday(+AHs-1391730785, 32416+AH0-, NULL) +AD0- 0
+AD4AWw-pid  7661+AF0- poll(+AFsAew-fd+AD0-3, events+AD0-POLLIN+AH0-, +AHs-fd+AD0-5, events+AD0-POLLIN+AH0-, +AHs-fd+AD0-8,
+AD4-events+AD0-POLLIN+AH0-, +AHs-fd+AD0-10, events+AD0-POLLIN+AH0AXQ-, 4, 4294967295 +ADw-unfinished ...+AD4-
+AD4AWw-pid  7662+AF0- +++8- exited with 255 +++8-
+AD4APA-... poll resumed+AD4- )                    +AD0- 2 (+AFsAew-fd+AD0-8, revents+AD0-POLLHUP+AH0-,
+AD4Aew-fd+AD0-10,
+AD4-revents+AD0-POLLHUP+AH0AXQ-)
+AD4---- SIGCHLD +AHs-si+AF8-signo+AD0-SIGCHLD, si+AF8-code+AD0-CLD+AF8-EXITED, si+AF8-pid+AD0-7662,
+AD4-si+AF8-status+AD0-255,
+AD4-si+AF8-utime+AD0-0, si+AF8-stime+AD0-0+AH0- ---
+AD4-brk(0)                                  +AD0- 0x1e0b000
+AD4-brk(0x1e3a000)                          +AD0- 0x1e3a000
+AD4-fcntl(8, F+AF8-GETFL)                       +AD0- 0 (flags O+AF8-RDONLY)
+AD4-fcntl(8, F+AF8-SETFL, O+AF8-RDONLY+AHw-O+AF8-NONBLOCK)  +AD0- 0
+AD4-fcntl(1, F+AF8-GETFL)                       +AD0- 0x1 (flags O+AF8-WRONLY)
+AD4-fcntl(1, F+AF8-SETFL, O+AF8-WRONLY+AHw-O+AF8-NONBLOCK)  +AD0- 0
+AD4-read(8, +ACIAIg-, 65536)                      +AD0- 0
+AD4-close(8)                                +AD0- 0
+AD4-read(10, +ACIAIg-, 65536)                     +AD0- 0
+AD4-close(10)                               +AD0- 0
+AD4-gettimeofday(+AHs-1391730785, 33070+AH0-, NULL) +AD0- 0
+AD4-
+AD4-and we are stuck here....
+AD4-
+AD4-
+AD4-Full logs for a bit are here:
+AD4-http://kojipkgs.fedoraproject.org//work/tasks/2917/6502917/build.log
+AD4-
+AD4-
+AD4-Is there some way we can disable mpiexec trying to use ssh?
+AD4-
+AD4-
+AD4--- 
+AD4-Orion Poplawski
+AD4-Technical Manager                     303-415-9701 x222
+AD4-NWRA, Boulder/CoRA Office             FAX: 303-415-9702
+AD4-3380 Mitchell Lane                       orion+AEA-nwra.com
+AD4-Boulder, CO 80301                   http://www.nwra.com
+AD4AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBf-
+AD4-discuss mailing list     discuss+AEA-mpich.org
+AD4-To manage subscription options or unsubscribe:
+AD4-https://lists.mpich.org/mailman/listinfo/discuss

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Improve-localhost-detection.patch
Type: application+AC8-octet-stream
Size: 1521 bytes
Desc: 0001-Improve-localhost-detection.patch
URL: <http:+AC8ALw-lists.mpich.org+AC8-pipermail+AC8-discuss+AC8-attachments+AC8-20140207+AC8-b18b6062+AC8-attachment.obj>


More information about the discuss mailing list