[mpich-discuss] MPI_Comm_spawn zombies have risen from the dead.

Thomas Pak thomas.pak at maths.ox.ac.uk
Thu Dec 6 08:11:54 CST 2018


Hi Hui,

That is exactly right, thanks so much for looking into it! Please let me know if you make any progress on this issue, as the application I am developing critically depends on MPI_Comm_spawn.
Best wishes,
Thomas Pak

On Dec 6 2018, at 1:57 pm, Zhou, Hui <zhouh at anl.gov> wrote:
>
> Hi Thomas,
>
> To summarize issue: MPICH proxy does not reap processes after its exit?
>
> I’ll look into the issue. From my peruse of current MPI standards, it does not really specify the behavior after MPI_Finalize. The dynamic process part of MPI support is weak, primary due to lack of application adoption. Regardless, I agree that the proxy creates the process should be responsible to reap its dead children.
>
>> Hui Zhou
>
>
>
>
>
>
>
>
> > On Dec 6, 2018, at 3:53 AM, Thomas Pak via discuss <discuss at mpich.org (mailto:discuss at mpich.org)> wrote:
> > Dear all,
> > Does anyone have any feedback on this issue? The problem still persists in MPICH-3.3 and it seems like it is pointing to a severe flaw in how MPICH handles dynamic process creation. To reiterate, the following short program using MPICH creates zombie processes indefinitely until no more processes can be created.
> > """
> > #include <stdio.h>
> > #include <mpi.h>
> >
> > int main(int argc, char *argv[]) {
> > // Initialize MPI
> > MPI_Init(NULL, NULL);
> >
> > // Get parent
> > MPI_Comm parent;
> > MPI_Comm_get_parent(&parent);
> >
> > // If the process was not spawned
> > if (parent == MPI_COMM_NULL) {
> >
> > puts("I was not spawned!");
> > // Spawn child process in loop
> > char *cmd = argv[0];
> > char **cmd_argv = MPI_ARGV_NULL;
> > int maxprocs = 1;
> > MPI_Info info = MPI_INFO_NULL;
> > int root = 0;
> > MPI_Comm comm = MPI_COMM_SELF;
> > MPI_Comm intercomm;
> > int *array_of_errcodes = MPI_ERRCODES_IGNORE;
> >
> > for (;;) {
> > MPI_Comm_spawn(cmd, cmd_argv, maxprocs, info, root, comm,
> > &intercomm, array_of_errcodes);
> >
> > MPI_Comm_disconnect(&intercomm);
> > }
> >
> > // If process was spawned
> > } else {
> >
> > puts("I was spawned!");
> > MPI_Comm_disconnect(&parent);
> > }
> >
> > // Finalize
> > MPI_Finalize();
> >
> > }
> > """
> >
> > Thanks in advance for your help.
> > Best wishes,
> > Thomas Pak
> >
> > On Oct 16 2018, at 4:57 pm, Thomas Pak <thomas.pak at maths.ox.ac.uk (mailto:thomas.pak at maths.ox.ac.uk)> wrote:
> > >
> > > Dear all,
> > > My MPI application spawns a large number of MPI processes using MPI_Comm_spawn over its total lifetime. Unfortunately, I have experienced that this results in zombie processes of hydra_pmi_proxy accumulating over time. As a result, my MPI application will eventually crash because there are too many zombie processes and no new processes can be created.
> > > This issue does not seem to be new; I have found references to similar issues in at least three different places:
> > > - [mpich-discuss] MPI_Comm_Spawn causing zombies of hydra_pmi_proxy (https://lists.mpich.org/pipermail/discuss/2013-March/000599.html) (March 2013)
> > > - [GitHub issue] hydra_pmi_proxy zombie MPI_comm_spawn #1677 (https://github.com/pmodels/mpich/issues/1677) (October 2016)
> > > - [Google Groups] Hanging in spawn in master-slave code and hydra_pmi_proxy zombies (https://groups.google.com/forum/#!msg/mpi4py/A9mN-2UkFf8/2gnUEtArDwAJ) (December 2016)
> > > In the first reference, a dirty fix was proposed by Silvan Brändli to simply reap all the zombie processes in a loop at the start of the function HYDU_create_process in src/pm/hydra/utils/launch/launch.c. However, I have found that this comes with its own set of problems and makes my MPI application unstable.
> > >
> > > In the GitHub issue, the problem was marked as "worksforme". However, this problem persists for me and other machines that I have tried to reproduce it on, so it clearly has not yet been resolved.
> > > In the Google Groups thread, the problem was "resolved" by avoiding spawning MPI processes entirely, which is not an option in my case.
> > > Pavan Balaji mentions in the 2013 mpi-discuss thread that this issue was a problem "once upon a time". It seems to have risen from the dead again.
> > > I have attached a short and self-contained program written in C that reproduces the problem. The program simply spawns child processes using MPI_Comm_spawn in an infinite loop, where each child process exits after writing a message to stdout. I use MPI_Comm_disconnect to disconnect the Intercommunicator, but the same problem occurs when using MPI_Comm_free instead.
> > > I have also attached the logfiles generated while building MPICH, as described in the README, except for the file mpich-3.2.1/src/pm/hydra/tools/topo/hwloc/hwloc/config.log, as that was missing. I am using MPICH version 3.2.1, which is the latest stable release at the time of writing.
> > > Thanks in advance for your help!
> > > Best wishes,
> > > Thomas Pak
> > > _______________________________________________
> > > discuss mailing list discuss at mpich.org (mailto:discuss at mpich.org)
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> > >
> >
> > _______________________________________________
> > discuss mailing list discuss at mpich.org (mailto:discuss at mpich.org)
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
>
>
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20181206/5b6f94a6/attachment.html>


More information about the discuss mailing list