[mpich-discuss] MPICH 2 script to MPICH 3

Balaji, Pavan balaji at anl.gov
Mon Jan 19 17:42:13 CST 2015


FYI, mpd uses ssh also, so that shouldn't be an issue.  I wouldn't recommend the manual launch.  It's only meant for tools developers who build stuff on top of Hydra.

AFAICT, you can remove most of the mpd management code in your script and simply pass the host list to mpiexec in Hydra:

http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager

If Condor stores the host list allocated for your job somewhere, try passing it to Hydra with the "-f" option to mpiexec.

  -- Pavan

> On Jan 19, 2015, at 10:04 AM, Kenneth Raffenetti <raffenet at mcs.anl.gov> wrote:
> 
> Hi Livan,
> 
> It should be possible to convert your script to use Hydra. Hydra has 2 main components. The first is the main hydra executable - mpiexec. The other is the hydra_pmi_proxy.
> 
> In a normal ssh launched scenario, mpiexec will launch proxies on all nodes being used for the job, and those proxies will launch the mpi processes.
> 
> If ssh is not possible in your setup, you should be able to utilize the hydra manual laucher (mpiexec -launcher manual). When you run mpiexec with that parameter, it will output the proxy launch commands that can be used to connect to it. It will then wait for all the proxies to connect. Here's an example:
> 
> raffenet at doom:mpich/ $ mpiexec -launcher manual -n 3 -hosts a,b,c /bin/hostname
> HYDRA_LAUNCH: /sandbox/mpich/i/bin/hydra_pmi_proxy --control-port doom:36311 --rmk user --launcher manual --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
> HYDRA_LAUNCH: /sandbox/mpich/i/bin/hydra_pmi_proxy --control-port doom:36311 --rmk user --launcher manual --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
> HYDRA_LAUNCH: /sandbox/mpich/i/bin/hydra_pmi_proxy --control-port doom:36311 --rmk user --launcher manual --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2
> HYDRA_LAUNCH_END
> 
> Let me know if you have other questions.
> 
> Ken
> 
> On 01/19/2015 05:39 AM, Livan Valladares wrote:
>> Hello,
>> I have been using Htcondor with MPICH 2 but I update MPICH 2 to MPICH 3 and now the MPD process manager has been deprecated and Hydra is de default process manager.
>> The script I was using use some commands like mpdtrace, mpdallexit but they are not supported anymore.
>> Here is my script, is it any possibility to change this script code to be able to work with Hydra?
>> Thank you very much,
>> Livan Valladares Martell
>> Script:
>> 
>> #!/bin/sh
>> 
>> ##**************************************************************
>> ##
>> ## Copyright (C) 1990-2014, Condor Team, Computer Sciences Department,
>> ## University of Wisconsin-Madison, WI.
>> ##
>> ## Licensed under the Apache License, Version 2.0 (the "License"); you
>> ## may not use this file except in compliance with the License.  You may
>> ## obtain a copy of the License at
>> ##
>> ##    http://www.apache.org/licenses/LICENSE-2.0
>> ##
>> ## Unless required by applicable law or agreed to in writing, software
>> ## distributed under the License is distributed on an "AS IS" BASIS,
>> ## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> ## See the License for the specific language governing permissions and
>> ## limitations under the License.
>> ##
>> ##**************************************************************
>> 
>> 
>> # Set this to the bin directory of MPICH installation
>> MPDIR=/opt/local/bin
>> PATH=$MPDIR:.:$PATH
>> export PATH
>> 
>> _CONDOR_PROCNO=$_CONDOR_PROCNO
>> _CONDOR_NPROCS=$_CONDOR_NPROCS
>> 
>> # Remove the contact file, so if we are held and released
>> # it can be recreated anew
>> 
>> rm -f $CONDOR_CONTACT_FILE
>> 
>> PATH=`condor_config_val libexec`/:$PATH
>> 
>> # mpd needs a conf file, and it must be
>> # permissions 0700
>> mkdir tmp
>> MPD_CONF_FILE=`pwd`/tmp/mpd_conf_file
>> export MPD_CONF_FILE
>> 
>> ulimit -c 0
>> 
>> # If you have a shared file system, maybe you
>> # want to put the mpd.conf file in your home
>> # directory
>> 
>> echo "password=somepassword" > $MPD_CONF_FILE
>> chmod 0700 $MPD_CONF_FILE
>> 
>> # If on the head node, start mpd, get the port and host,
>> # and condor_chirp it back into the ClassAd
>> # so the non-head nodes can find the head node.
>> 
>> if [ $_CONDOR_PROCNO -eq 0 ]
>> then
>> 	mpd > mpd.out.$_CONDOR_PROCNO 2>&1 &
>> 	sleep 1
>> 	host=`mpdtrace -l | sed 1q | tr '_' ' ' | awk '{print $1}'`
>> 	port=`mpdtrace -l | sed 1q | tr '_' ' ' | awk '{print $2}'`
>> 
>> 	condor_chirp set_job_attr MPICH_PORT $port
>> 	condor_chirp set_job_attr MPICH_HOST \"$host\"
>> 	
>> 	num_hosts=1
>> 	retries=0
>> 	while [ $num_hosts -ne $_CONDOR_NPROCS ]
>> 	do
>> 		num_hosts=`mpdtrace | wc -l`
>> 		sleep 2
>> 		retries=`expr $retries + 1`
>> 		if [ $retries -gt 100 ]
>> 		then
>> 			echo "Too many retries, could not start all $_CONDOR_NPROCS nodes, only started $num_hosts, giving up.  Here are the hosts I could start "
>> 			mpdtrace
>> 			exit 1
>> 		fi
>> 	done
>> 
>> 	## run the actual mpi job, which was the command line argument
>>  	## to the invocation of this shell script
>>  	mpiexec -n $_CONDOR_NPROCS $@
>> 	e=$?
>> 
>> 	mpdallexit
>> 	sleep 20
>> 	echo $e
>> else
>> 	# If NOT the head node, acquire the host and port of
>>  	# the head node
>>  	retries=0
>> 	host=UNDEFINED
>> 	while [ $host == "UNDEFINED" ]
>> 	do
>> 		host=`condor_chirp get_job_attr MPICH_HOST`
>> 		sleep 2
>> 		retries=`expr $retries + 1`
>> 		if [ $retries -gt 100 ]; then
>>                     echo "Too many retries, could not get mpd host from condor_chirp, giving up."
>>                     exit 1
>>                 fi
>> 	done
>> 
>> 	port=`condor_chirp get_job_attr MPICH_PORT`
>> 	host=`echo $host | tr -d '"'`
>> 	mpd --host=$host --port=$port > mpd.out.$_CONDOR_PROCNO 2>&1
>> fi
>> 
>> 
>> 
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

--
Pavan Balaji  ✉️
http://www.mcs.anl.gov/~balaji

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list