[mpich-devel] Hydra fails to launch hello world on 1 proc

Jeff Hammond jhammond at alcf.anl.gov
Wed Apr 10 23:14:39 CDT 2013


Ugghh, I am just stupid.  I fixed the hostname situation - I am
definitely not on the Argonne network right now - and all of my
problems disappeared.

In the interest of idiot-proofing Hydra, maybe it can timeout when SSH
fails, but clearly you guys have better things to do.

Thanks for your help.

Jeff

On Wed, Apr 10, 2013 at 9:58 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> Oh, I should also have suggested the preliminary step of running:
>
> "mpiexec -v -n 1 hostname"
>
> Based on the strace output, it looks like hydra is trying to ssh to the local machine and that ssh connection is simply hanging for some reason (try manually ssh-ing to goldstone.mcs.anl.gov).
>
> ----8<----
> execve("/usr/bin/ssh", ["/usr/bin/ssh", "-x", "goldstone.mcs.anl.gov", "\"/home/jeff/eclipse/MPICH/git/in"..., "--control-port", "goldstone.mcs.anl.gov:51163", "--rmk", "user", "--launcher", "ssh", "--demux", "poll", "--pgid", "0", "--retries", "10", ...], [/* 108 vars */]) = 0
> […]
> socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
> fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
> connect(3, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("140.221.3.26")}, 16
> ----8<----
>
> I'm not sure why hydra is trying to use ssh for a single (same) node mpiexec.  Some/all of that may also have been clear from the "mpiexec -v" output.
>
> -Dave
>
> On Apr 10, 2013, at 10:45 PM CDT, Jeff Hammond <jhammond at alcf.anl.gov> wrote:
>
>> Not sure if this helps at all.  It makes no sense to me.
>>
>> Jeff
>>
>> On Wed, Apr 10, 2013 at 9:33 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
>>> Linux or Mac?  If it's Linux, an "strace -f -ff -o strace.out mpiexec -n 1 hostname" might shed some light on the situation.
>>>
>>> -Dave
>>>
>>> On Apr 10, 2013, at 10:30 PM CDT, Jeff Hammond <jhammond at alcf.anl.gov> wrote:
>>>
>>>> "mpiexec -n 1 hostname" hangs with Hydra but runs fine with OpenMPI.
>>>>
>>>> I'm having issues with MPI+Pthreads code with both MPICH and OpenMPI
>>>> that indicates that my system is not behaving as others do, but I'll
>>>> need to do a lot more work to figure out what the important
>>>> differences are.
>>>>
>>>> Jeff
>>>>
>>>> On Wed, Apr 10, 2013 at 9:24 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
>>>>> Does it run non-MPI jobs OK?  ("mpiexec -n 1 hostname", for example)
>>>>>
>>>>> Is this Linux or a Mac?
>>>>>
>>>>> If you temporarily disable the firewall, does that make a difference?
>>>>>
>>>>> -Dave
>>>>>
>>>>> On Apr 10, 2013, at 6:34 PM CDT, Jeff Hammond <jhammond at alcf.anl.gov> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm using the latest Git trunk build of MPICH with GCC and am unable
>>>>>> to run a 'hello, world' program using mpiexec.
>>>>>>
>>>>>> Any clues what the problem is?  I have not seen this problem before,
>>>>>> but this is newly refreshed laptop.  The firewall is active but I
>>>>>> would not have expected Hydra to need to go through the firewall to
>>>>>> launch a serial job.
>>>>>>
>>>>>> If there's something wrong with my setup, it would be nice if Hydra
>>>>>> would issue a warning/error instead of handing.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jeff
>>>>>>
>>>>>> I compiled MPICH like this:
>>>>>> ../configure CC=gcc CXX=g++ FC=gfortran F77=gfortran --enable-threads
>>>>>> --enable-f77 --enable-fc --enable-g --with-pm=hydra --enable-rpath
>>>>>> --disable-static --enable-shared --with-device=ch3:nemesis
>>>>>> --prefix=/home/jeff/eclipse/MPICH/git/install-gcc
>>>>>>
>>>>>> jeff at goldstone:~/eclipse/OSPRI/mcs.svn/trunk/tests/devices/mpi-pt> mpicc -show
>>>>>> gcc -I/home/jeff/eclipse/MPICH/git/install-gcc/include
>>>>>> -L/home/jeff/eclipse/MPICH/git/install-gcc/lib64 -Wl,-rpath
>>>>>> -Wl,/home/jeff/eclipse/MPICH/git/install-gcc/lib64 -lmpich -lopa -lmpl
>>>>>> -lrt -lpthread
>>>>>>
>>>>>> jeff at goldstone:~/eclipse/OSPRI/mcs.svn/trunk/tests/devices/mpi-pt> make
>>>>>> mpicc -g -O0 -Wall -std=gnu99 -DDEBUG -c hello.c -o hello.o
>>>>>> mpicc -g -O0 -Wall -std=gnu99 safemalloc.o hello.o -lm -o hello.x
>>>>>> rm hello.o
>>>>>>
>>>>>> jeff at goldstone:~/eclipse/OSPRI/mcs.svn/trunk/tests/devices/mpi-pt>
>>>>>> mpiexec -n 1 ./hello.x
>>>>>> ^C[mpiexec at goldstone.mcs.anl.gov] Sending Ctrl-C to processes as requested
>>>>>> [mpiexec at goldstone.mcs.anl.gov] Press Ctrl-C again to force abort
>>>>>> [mpiexec at goldstone.mcs.anl.gov] HYDU_sock_write
>>>>>> (../../../../src/pm/hydra/utils/sock/sock.c:291): write error (Bad
>>>>>> file descriptor)
>>>>>> [mpiexec at goldstone.mcs.anl.gov] HYD_pmcd_pmiserv_send_signal
>>>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:170): unable to
>>>>>> write data to proxy
>>>>>> [mpiexec at goldstone.mcs.anl.gov] ui_cmd_cb
>>>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to
>>>>>> send signal downstream
>>>>>> [mpiexec at goldstone.mcs.anl.gov] HYDT_dmxu_poll_wait_for_event
>>>>>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:77): callback
>>>>>> returned error status
>>>>>> [mpiexec at goldstone.mcs.anl.gov] HYD_pmci_wait_for_completion
>>>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:197): error
>>>>>> waiting for event
>>>>>> [mpiexec at goldstone.mcs.anl.gov] main
>>>>>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:331): process manager
>>>>>> error waiting for completion
>>>>>>
>>>>>> jeff at goldstone:~/eclipse/OSPRI/mcs.svn/trunk/tests/devices/mpi-pt> ./hello.x
>>>>>> <no errors>
>>>>>>
>>>>>> jeff at goldstone:~/eclipse/OSPRI/mcs.svn/trunk/tests/devices/mpi-pt> cat hello.c
>>>>>> #include <stdio.h>
>>>>>> #include <stdlib.h>
>>>>>>
>>>>>> #include <mpi.h>
>>>>>>
>>>>>> int main(int argc, char * argv[])
>>>>>> {
>>>>>>  int provided;
>>>>>>
>>>>>>  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
>>>>>>  if (provided!=MPI_THREAD_MULTIPLE)
>>>>>>      MPI_Abort(MPI_COMM_WORLD, 1);
>>>>>>
>>>>>>  int rank, size;
>>>>>>  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>>>  MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>>>>
>>>>>>  MPI_Finalize();
>>>>>>
>>>>>>  return 0;
>>>>>> }
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jeff Hammond
>>>>>> Argonne Leadership Computing Facility
>>>>>> University of Chicago Computation Institute
>>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> Argonne Leadership Computing Facility
>>>> University of Chicago Computation Institute
>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>> http://www.linkedin.com/in/jeffhammond
>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>> <strace.out.11831><strace.out.11832>
>



-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond


More information about the devel mailing list