[mpich-discuss] MPI_Comm_spawn zombies have risen from the dead.
Thomas Pak
thomas.pak at maths.ox.ac.uk
Thu Dec 6 05:53:22 CST 2018
Dear all,
Does anyone have any feedback on this issue? The problem still persists in MPICH-3.3 and it seems like it is pointing to a severe flaw in how MPICH handles dynamic process creation. To reiterate, the following short program using MPICH creates zombie processes indefinitely until no more processes can be created.
"""
#include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[]) {
// Initialize MPI
MPI_Init(NULL, NULL);
// Get parent
MPI_Comm parent;
MPI_Comm_get_parent(&parent);
// If the process was not spawned
if (parent == MPI_COMM_NULL) {
puts("I was not spawned!");
// Spawn child process in loop
char *cmd = argv[0];
char **cmd_argv = MPI_ARGV_NULL;
int maxprocs = 1;
MPI_Info info = MPI_INFO_NULL;
int root = 0;
MPI_Comm comm = MPI_COMM_SELF;
MPI_Comm intercomm;
int *array_of_errcodes = MPI_ERRCODES_IGNORE;
for (;;) {
MPI_Comm_spawn(cmd, cmd_argv, maxprocs, info, root, comm,
&intercomm, array_of_errcodes);
MPI_Comm_disconnect(&intercomm);
}
// If process was spawned
} else {
puts("I was spawned!");
MPI_Comm_disconnect(&parent);
}
// Finalize
MPI_Finalize();
}
"""
Thanks in advance for your help.
Best wishes,
Thomas Pak
On Oct 16 2018, at 4:57 pm, Thomas Pak <thomas.pak at maths.ox.ac.uk> wrote:
>
> Dear all,
> My MPI application spawns a large number of MPI processes using MPI_Comm_spawn over its total lifetime. Unfortunately, I have experienced that this results in zombie processes of hydra_pmi_proxy accumulating over time. As a result, my MPI application will eventually crash because there are too many zombie processes and no new processes can be created.
> This issue does not seem to be new; I have found references to similar issues in at least three different places:
> - [mpich-discuss] MPI_Comm_Spawn causing zombies of hydra_pmi_proxy (https://lists.mpich.org/pipermail/discuss/2013-March/000599.html) (March 2013)
> - [GitHub issue] hydra_pmi_proxy zombie MPI_comm_spawn #1677 (https://github.com/pmodels/mpich/issues/1677) (October 2016)
> - [Google Groups] Hanging in spawn in master-slave code and hydra_pmi_proxy zombies (https://groups.google.com/forum/#!msg/mpi4py/A9mN-2UkFf8/2gnUEtArDwAJ) (December 2016)
> In the first reference, a dirty fix was proposed by Silvan Brändli to simply reap all the zombie processes in a loop at the start of the function HYDU_create_process in src/pm/hydra/utils/launch/launch.c. However, I have found that this comes with its own set of problems and makes my MPI application unstable.
>
> In the GitHub issue, the problem was marked as "worksforme". However, this problem persists for me and other machines that I have tried to reproduce it on, so it clearly has not yet been resolved.
> In the Google Groups thread, the problem was "resolved" by avoiding spawning MPI processes entirely, which is not an option in my case.
> Pavan Balaji mentions in the 2013 mpi-discuss thread that this issue was a problem "once upon a time". It seems to have risen from the dead again.
> I have attached a short and self-contained program written in C that reproduces the problem. The program simply spawns child processes using MPI_Comm_spawn in an infinite loop, where each child process exits after writing a message to stdout. I use MPI_Comm_disconnect to disconnect the Intercommunicator, but the same problem occurs when using MPI_Comm_free instead.
> I have also attached the logfiles generated while building MPICH, as described in the README, except for the file mpich-3.2.1/src/pm/hydra/tools/topo/hwloc/hwloc/config.log, as that was missing. I am using MPICH version 3.2.1, which is the latest stable release at the time of writing.
> Thanks in advance for your help!
> Best wishes,
> Thomas Pak
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20181206/08f53ced/attachment.html>
More information about the discuss
mailing list