[mpich-discuss] Using mpiexc.hydra with -disable-auto-cleanup.

Anatoly G anatolyrishon at gmail.com
Sun Mar 30 03:44:37 CDT 2014


Dear MPICH team,

I use MPICH2.

I have configuration:
Main application which executes:
"mpiexec.hydra -genvall  -disable-auto-cleanup  -f MpiConfigMachines.txt
-launcher=rsh -n 20 node"

After fail of single "node" process, I need to restart all system w/o
restarting Main application process.

After fail of "node" process, I execute some inner logic and then I call
MPI_Abort from Master process (rank 0) to abort all "node" processes. Then
I send signal SIG_TERM to mpiexec.hydra in order to finish hydra process
and executing again:
"mpiexec.hydra -genvall  -disable-auto-cleanup  -f MpiConfigMachines.txt
-launcher=rsh -n 20 node"

The problem:
sometimes I see mpiexec.hydra which is stalled

I execute:
::kill(mProcId, SIGTERM)
where mProcId is "mpiexec.hydra" id.

Then I see that mpiexec.hydra process is still exist, sometimes
hydra_pmi_proxy too.
If I execute "kill -9" mpiexec.hydra always killed and hydra_pmi_proxy too.

My question is if "kill -9" in the case of "node" process failure is
recommended way?
If not, what is recommended way.

Regards,
Anatoly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140330/0594d8ee/attachment.html>


More information about the discuss mailing list