[mpich-discuss] MPI_Comm_spawn zombies have risen from the dead.

Thomas Pak thomas.pak at maths.ox.ac.uk
Tue Oct 16 10:57:25 CDT 2018


Dear all,

My MPI application spawns a large number of MPI processes using MPI_Comm_spawn over its total lifetime. Unfortunately, I have experienced that this results in zombie processes of hydra_pmi_proxy accumulating over time. As a result, my MPI application will eventually crash because there are too many zombie processes and no new processes can be created.
This issue does not seem to be new; I have found references to similar issues in at least three different places:
- [mpich-discuss] MPI_Comm_Spawn causing zombies of hydra_pmi_proxy (https://lists.mpich.org/pipermail/discuss/2013-March/000599.html) (March 2013)
- [GitHub issue] hydra_pmi_proxy zombie MPI_comm_spawn #1677 (https://github.com/pmodels/mpich/issues/1677) (October 2016)
- [Google Groups] Hanging in spawn in master-slave code and hydra_pmi_proxy zombies (https://groups.google.com/forum/#!msg/mpi4py/A9mN-2UkFf8/2gnUEtArDwAJ) (December 2016)
In the first reference, a dirty fix was proposed by Silvan Brändli to simply reap all the zombie processes in a loop at the start of the function HYDU_create_process in src/pm/hydra/utils/launch/launch.c. However, I have found that this comes with its own set of problems and makes my MPI application unstable.

In the GitHub issue, the problem was marked as "worksforme". However, this problem persists for me and other machines that I have tried to reproduce it on, so it clearly has not yet been resolved.
In the Google Groups thread, the problem was "resolved" by avoiding spawning MPI processes entirely, which is not an option in my case.
Pavan Balaji mentions in the 2013 mpi-discuss thread that this issue was a problem "once upon a time". It seems to have risen from the dead again.
I have attached a short and self-contained program written in C that reproduces the problem. The program simply spawns child processes using MPI_Comm_spawn in an infinite loop, where each child process exits after writing a message to stdout. I use MPI_Comm_disconnect to disconnect the Intercommunicator, but the same problem occurs when using MPI_Comm_free instead.
I have also attached the logfiles generated while building MPICH, as described in the README, except for the file mpich-3.2.1/src/pm/hydra/tools/topo/hwloc/hwloc/config.log, as that was missing. I am using MPICH version 3.2.1, which is the latest stable release at the time of writing.
Thanks in advance for your help!
Best wishes,
Thomas Pak
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20181016/ae9002cd/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi-spawn.c
Type: application/octet-stream
Size: 984 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20181016/ae9002cd/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpich-3.2.1-logfiles.tar.gz
Type: application/octet-stream
Size: 119177 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20181016/ae9002cd/attachment-0001.obj>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list