[mpich-discuss] success and failure report for mpich-3.0.2

Siegmar Gross Siegmar.Gross at informatik.hs-fulda.de
Wed May 1 09:25:01 CDT 2013


Hi

I have subscribed once more, so that my email should now be
delivered to discuss at mpich.org.

> First, don't add the full path.  That'll not help when the executable is
> at two different paths on the two machines.
> 
> Can you please run this from both sunpc1 and linpc1:
> 
> % mpiexec -np 2 -hosts sunpc1,linpc1 which hostname

linpc1 fd1026 108 mpiexec -np 2 -hosts sunpc1,linpc1 which hostname
/bin/hostname
/usr/local/bin/hostname
 
linpc1 fd1026 109 ssh sunpc1
sunpc1 fd1026 105 mpiexec -np 2 -hosts sunpc1,linpc1 which hostname
/usr/local/bin/hostname
which: no hostname in 
(...:/usr/local/bin:...:/usr/bin:...:/usr/local/mpich-3.0.2_64_cc/bin:/home/fd1026/SunOS/x86_64/bin:.)

============================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=============================================================================
sunpc1 fd1026 106 

You don't use PATH from Linux, but PATH from Solaris, because I get
"which: no hostname in (...:/home/fd1026/SunOS/x86_64/bin:...)".
"mpiexec" works, if I use Linux as local machine, because all (!)
machines have "/usr/local/bin" in PATH (/usr/local is a link to the
operating system specific directory for open source binaries, libs,
etc and "date" or "hostname" are available in /usr/local/bin on
Solaris but not on Linux.


> The reason I'm still curious whether both machines are seeing the same
> path is because one of the machines is accessed locally (through fork)
> while the other is accessed over ssh.  So the environment you are seeing
> by logging in might not be the same as the environment you'd see by a
> non-interactive ssh launch.

No, they don't, because every machine sees the environment of its
operating system and Linux and Solaris use sometimes different
directories, e.g., /bin and /usr/bin. Today I used a non-interactive
ssh launch. The following lines are extracted from some lines further
down in this email (I have removed many lines, so that the difference
is more obvious). "environ_mpi.c" is a small MPI program, which
prints the contents of some environment variables, so that I can
see, what the program sees :-).

> > sunpc1 hello_1 110 mpiexec -np 2 -host sunpc1 environ_mpi
> >     PATH
> >                        /usr/local/bin
> >                        /usr/bin
> >                        /home/fd1026/SunOS/x86_64/bin

> > linpc1 fd1026 102  mpiexec -np 2 -host linpc1 environ_mpi
> >     PATH
> >                        /usr/local/bin
> >                        /bin
> >                        /usr/bin
> >                        /home/fd1026/Linux/x86_64/bin


linpc1 fd1026 102 where date
/bin/date

linpc1 fd1026 103 where hostname
/bin/hostname


sunpc1 fd1026 108 where date
/usr/local/bin/date
/usr/bin/date

sunpc1 fd1026 109 where hostname
/usr/local/bin/hostname
/usr/bin/hostname


It wouldn't help to add "/bin" to PATH of my Solaris machines,
because I still need "/home/fd1026/`uname -s`/.../bin" to store
my MPI programs for different operating systems and architectures.
If PATH must have the same value on all machines, I could use
symbolic links to hide operating system and architecture specific
directories. Nevertheless, it would be better, if MPICH would use
PATH of the target machine and not PATH of the local machine for
all machines. Now I know at least, why the program didn't work.
Is it possible, that "mpiexec" uses the correct environment of
the target machine? Thank you very much for any help in advance.


Kind regards

Siegmar



> On 05/01/2013 04:29 AM US Central Time, Siegmar Gross wrote:
> > Hi
> > 
> >> On 04/30/2013 05:55 AM US Central Time, Siegmar Gross wrote:
> >>> It seems, that I don't need a path, if the command has the same path
> >>> on both machines. It breaks, if the program has different pathnames.
> >>
> >> From the launching logic, I don't know how that'll be true.  I just
> >> tested this as well and it works fine for me.
> >>
> >>> sunpc1 fd1026 108 mpiexec -np 2 -host sunpc1,linpc1 hostname
> >>> sunpc1
> >>> [proxy:0:1 at linpc1] HYDU_create_process 
> >>> (../../../../mpich-3.0.2/src/pm/hydra/utils/launch/launch.c:74):
> >>>   execvp error on file hostname (No such file or directory)
> >>
> >> My guess is that "hostname" one of the machines is on your path and the
> >> other is not.
> > 
> > No, both machines know "hostname". I try to show you, which PATH is
> > available on both machines.
> > 
> > 
> > sunpc1 hello_1 110 mpiexec -np 2 -host sunpc1 environ_mpi
> > 
> > Now 1 slave tasks are sending their environment.
> > 
> > Environment from task 1:
> >   message type:        3
> >   msg length:          3394 characters
> >   message:             
> >     hostname:          sunpc1
> >     operating system:  SunOS
> >     release:           5.10
> >     processor:         i86pc
> >     PATH
> >                        /usr/local/eclipse-3.6.1
> >                        /usr/local/NetBeans-4.0/bin
> >                        /usr/local/jdk1.7.0_07/bin/amd64
> >                        /usr/local/apache-ant-1.6.2/bin
> >                        /usr/local/gcc-4.8.0/bin
> >                        /opt/solstudio12.3/bin
> >                        /usr/local/bin
> >                        /usr/local/ssl/bin
> >                        /usr/local/pgsql/bin
> >                        /usr/bin
> >                        /usr/openwin/bin
> >                        /usr/dt/bin
> >                        /usr/ccs/bin
> >                        /usr/sfw/bin
> >                        /opt/sfw/bin
> >                        /usr/ucb
> >                        /usr/lib/lp/postscript
> >                        /usr/local/teTeX-1.0.7/bin/i386-pc-solaris2.10
> >                        /usr/local/bluej-2.1.2
> >                        /usr/local/mpich-3.0.2_64_cc/bin
> >                        /home/fd1026/SunOS/x86_64/bin
> >                        .
> >                        /usr/sbin
> >     LD_LIBRARY_PATH_64
> > ...
> > 
> > 
> > sunpc1 hello_1 111 mpiexec -np 2 -host linpc1 environ_mpi
> > [proxy:0:0 at linpc1] HYDU_create_process 
> > (../../../../mpich-3.0.2/src/pm/hydra/utils/launch/launch.c:74):
> >   execvp error on file environ_mpi (No such file or directory)
> > [proxy:0:0 at linpc1] HYDU_create_process 
> > (../../../../mpich-3.0.2/src/pm/hydra/utils/launch/launch.c:74):
> >   execvp error on file environ_mpi (No such file or directory)
> > 
> > ======================================================================
> > =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> > =   EXIT CODE: 255
> > =   CLEANING UP REMAINING PROCESSES
> > =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> > =======================================================================
> > sunpc1 hello_1 112 
> > 
> > 
> > 
> > Now I switch the local host from Solaris to Linux and try everything
> > once more.
> > 
> > sunpc1 hello_1 112 ssh linpc1
> > linpc1 fd1026 102  mpiexec -np 2 -host linpc1 environ_mpi
> > 
> > Now 1 slave tasks are sending their environment.
> > 
> > Environment from task 1:
> >   message type:        3
> >   msg length:          3452 characters
> >   message:             
> >     hostname:          linpc1
> >     operating system:  Linux
> >     release:           3.1.10-1.16-desktop
> >     processor:         x86_64
> >     PATH
> >                        /usr/local/eclipse-3.6.1
> >                        /usr/local/NetBeans-4.0/bin
> >                        /usr/local/jdk1.7.0_07-64/bin
> >                        /usr/local/apache-ant-1.6.2/bin
> >                        /usr/local/icc-9.1/idb/bin
> >                        /usr/local/icc-9.1/cc/bin
> >                        /usr/local/icc-9.1/fc/bin
> >                        /usr/local/gcc-4.8.0/bin
> >                        /opt/solstudio12.3/bin
> >                        /usr/local/bin
> >                        /usr/local/ssl/bin
> >                        /usr/local/pgsql/bin
> >                        /bin
> >                        /usr/bin
> >                        /usr/X11R6/bin
> >                        /usr/local/teTeX-1.0.7/bin/i586-pc-linux-gnu
> >                        /usr/local/bluej-2.1.2
> >                        /usr/local/mpich-3.0.2_64_cc/bin
> >                        /home/fd1026/Linux/x86_64/bin
> >                        .
> >                        /usr/sbin
> >     LD_LIBRARY_PATH_64
> > ...
> > 
> > 
> > 
> > linpc1 fd1026 103 mpiexec -np 2 -host sunpc1 environ_mpi
> > 
> > ====================================================================
> > =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> > =   EXIT CODE: 9
> > =   CLEANING UP REMAINING PROCESSES
> > =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> > =====================================================================
> > YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
> > This typically refers to a problem with your application.
> > Please see the FAQ page for debugging suggestions
> > 
> > 
> > OK, now let's try with a full pathname.
> > 
> > linpc1 fd1026 104 mpiexec -np 2 -host sunpc1 /home/fd1026/SunOS/x86_64/bin/environ_mpi
> > 
> > Now 1 slave tasks are sending their environment.
> > 
> > Environment from task 1:
> >   message type:        3
> >   msg length:          3436 characters
> >   message:             
> >     hostname:          sunpc1
> >     operating system:  SunOS
> >     release:           5.10
> >     processor:         i86pc
> >     PATH
> >                        /usr/local/eclipse-3.6.1
> >                        /usr/local/NetBeans-4.0/bin
> >                        /usr/local/jdk1.7.0_07-64/bin
> >                        /usr/local/apache-ant-1.6.2/bin
> >                        /usr/local/icc-9.1/idb/bin
> >                        /usr/local/icc-9.1/cc/bin
> >                        /usr/local/icc-9.1/fc/bin
> >                        /usr/local/gcc-4.8.0/bin
> >                        /opt/solstudio12.3/bin
> >                        /usr/local/bin
> >                        /usr/local/ssl/bin
> >                        /usr/local/pgsql/bin
> >                        /bin
> >                        /usr/bin
> >                        /usr/X11R6/bin
> >                        /usr/local/teTeX-1.0.7/bin/i586-pc-linux-gnu
> >                        /usr/local/bluej-2.1.2
> >                        /usr/local/mpich-3.0.2_64_cc/bin
> >                        /home/fd1026/Linux/x86_64/bin
> >                        .
> >                        /usr/sbin
> >     LD_LIBRARY_PATH_64
> > ...
> > 
> > 
> > Ah, you are still using PATH from Linux and not from SunOS. I
> > was lucky with "date", because Linux contains its "default"
> > pathnames and "/usr/local/bin", while "/bin" is not a "default"
> > pathname for Solaris as you can see above. My MPI programs are
> > stored in "/home/fd1026/Linux/x86_64/bin" for Linux and in
> > "/home/fd1026/SunOS/x86_64/bin" for Solaris x86_64 (I'm using
> > NFS, so that I need different directories for the same program
> > on different operating systems).
> > 
> > 
> > 
> >>> linpc1 fd1026 105 mpiexec -np 2 -host sunpc1,linpc1 hostname
> >>> linpc1
> >>> sunpc1
> >>
> >> Are all of /bin /usr/local/bin and /usr/bin in your path?
> > 
> > No, PATH depends on the operating system and architecture, but
> > PATH contains all directories necessary to find all programs.
> > All environment variables are set via $HOME/.cshrc. Does MPICH
> > need the same PATH on all machines? How do you distinguish a
> > program for different operating systems in a NFS environment?
> > Do you need a link "$HOME/mpich_programs", which points to the
> > operating system specific directory and which is part of PATH?
> > 
> > 
> > Kind regards
> > 
> > Siegmar
> > 
> 
> -- 
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji




More information about the discuss mailing list