[mpich-discuss] MPI_Comm_Spawn causing zombies of hydra_pmi_proxy

Silvan Brändli silvan.braendli at tuhh.de
Tue Mar 26 06:30:16 CDT 2013


Just for the record: I guess my zombie problem was caused by ignoring 
return codes unequal zero by using the workaround as described in 
http://lists.mpich.org/pipermail/discuss/2013-February/000429.html .

A dirty fix which allows to ignore non-zero return codes and get rid of 
zombies (caused by spawning processes) is to add something like

     int waitstatus, child_pid;
     while ((child_pid = waitpid(-1, &waitstatus, WNOHANG)) > 0)
     {
         printf("HYDRA: child of pid %d finished\n",child_pid);
     }

at the beginning of function HYDU_create_process in 
src/pm/hydra/utils/launch/launch.c .

Best regards
Silvan

On 21.03.2013 12:31, Silvan Brändli wrote:
> PS: Just some additional information on compilers: so far I used icpc,
> when changing it to gcc-4.7.2 I still get the zombies. The only change
> in the linked libraries is an additional
> libdl.so.2 => /lib64/libdl.so.2
>
> Any hint / idea?
> Best regards
> Silvan
>
> On 19.03.2013 10:47, Silvan Brändli wrote:
>> Dear Pavan,
>>
>> in the attached example I use mpich3, but still I get the zombies. Is
>> there something wrong with
>> - my MPI function calls? (disconnect, finalize)
>> - my linked libraries? (see below)
>>
>> Thanks in advance!
>> Best regards
>> Silvan
>>
>> ldd_hi
>>      linux-vdso.so.1 (0x00007fffa01ff000)
>>      libmpichcxx.so.10 => /opt/mpich3/lib/libmpichcxx.so.10
>> (0x00007fdfec0e4000)
>>      libmpich.so.10 => /opt/mpich3/lib/libmpich.so.10
>> (0x00007fdfebc6f000)
>>      libopa.so.1 => /opt/mpich3/lib/libopa.so.1 (0x00007fdfeba6d000)
>>      libmpl.so.1 => /opt/mpich3/lib/libmpl.so.1 (0x00007fdfeb868000)
>>      libaio.so.1 => /lib64/libaio.so.1 (0x00007fdfeb666000)
>>      libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fdfeb449000)
>>      libm.so.6 => /lib64/libm.so.6 (0x00007fdfeb14e000)
>>      libstdc++.so.6 =>
>> /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/libstdc++.so.6
>> (0x00007fdfeae47000)
>>      libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fdfeac30000)
>>      libc.so.6 => /lib64/libc.so.6 (0x00007fdfea882000)
>>      libdl.so.2 => /lib64/libdl.so.2 (0x00007fdfea67e000)
>>      libgfortran.so.3 =>
>> /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/libgfortran.so.3
>> (0x00007fdfea361000)
>>      libquadmath.so.0 =>
>> /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/libquadmath.so.0
>> (0x00007fdfea12b000)
>>      /lib64/ld-linux-x86-64.so.2 (0x00007fdfec308000)
>>      librt.so.1 => /lib64/librt.so.1 (0x00007fdfe9f23000)
>> ldd_main
>>      linux-vdso.so.1 (0x00007fffeb1ff000)
>>      libmpichcxx.so.10 => /opt/mpich3/lib/libmpichcxx.so.10
>> (0x00007f6fc797e000)
>>      libmpich.so.10 => /opt/mpich3/lib/libmpich.so.10
>> (0x00007f6fc7509000)
>>      libopa.so.1 => /opt/mpich3/lib/libopa.so.1 (0x00007f6fc7307000)
>>      libmpl.so.1 => /opt/mpich3/lib/libmpl.so.1 (0x00007f6fc7102000)
>>      libaio.so.1 => /lib64/libaio.so.1 (0x00007f6fc6f00000)
>>      libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f6fc6ce3000)
>>      libm.so.6 => /lib64/libm.so.6 (0x00007f6fc69e8000)
>>      libstdc++.so.6 =>
>> /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/libstdc++.so.6
>> (0x00007f6fc66e1000)
>>      libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f6fc64ca000)
>>      libc.so.6 => /lib64/libc.so.6 (0x00007f6fc611c000)
>>      libdl.so.2 => /lib64/libdl.so.2 (0x00007f6fc5f18000)
>>      libgfortran.so.3 =>
>> /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/libgfortran.so.3
>> (0x00007f6fc5bfb000)
>>      libquadmath.so.0 =>
>> /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/libquadmath.so.0
>> (0x00007f6fc59c5000)
>>      /lib64/ld-linux-x86-64.so.2 (0x00007f6fc7ba2000)
>>      librt.so.1 => /lib64/librt.so.1 (0x00007f6fc57bd000)
>>
>> On 07.03.2013 17:23, Pavan Balaji wrote:
>>>
>>> I just tried this with the mpich master and it seems to work correctly,
>>> and there are no zombie processes (though I reduced the number of
>>> iterations to 10000, instead of 200000).  This was a problem in mpich
>>> once upon a time, but that was a few years ago.  Are you using the
>>> latest version of mpich (3.0.2)?
>>>
>>>   -- Pavan
>>>
>>> On 03/07/2013 09:44 AM US Central Time, Silvan Brändli wrote:
>>>> PS: The attached programs are a simplification of my code. They
>>>> reproduce the zombie problem. Waiting for the 32k zombies takes a
>>>> while... but I expect the same behaviour as with my original code.
>>>>
>>>> Am I missing something when finishing the called program? I just use
>>>> MPI_Comm_disconnect and MPI_Finalize.
>>>>
>>>> Best regards
>>>> Silvan
>>>>
>>>> main.cpp
>>>>
>>>> #include <mpi.h>;
>>>>
>>>> int main(int argc, char *argv[])
>>>> {
>>>>    int          myrank;
>>>>    int spawnerror;
>>>>    int value = 123;
>>>>    void *buf = &value;
>>>>    MPI_Comm child_comm;
>>>>
>>>>    if (MPI_Init(&argc,&argv)!=MPI_SUCCESS)
>>>>    {
>>>>      printf("MPI_Init failed");
>>>>    }
>>>>
>>>>    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
>>>>
>>>>    char* hiargv[] = {"23",NULL};
>>>>    for(int i = 1; i <= 200000; i++)
>>>>    {
>>>>      value = i;
>>>>      printf("Main before spawn %d\n",i);
>>>>      MPI_Comm_spawn("./hi",hiargv, 1, MPI_INFO_NULL, myrank,
>>>> MPI_COMM_SELF, &child_comm, &spawnerror);
>>>>      MPI_Send(buf, 1, MPI_INTEGER, 0, 1, child_comm);
>>>>      MPI_Comm_disconnect(&child_comm);
>>>>    }
>>>>
>>>>    MPI_Finalize();
>>>>    return 0;
>>>> }
>>>>
>>>> hi.cpp:
>>>>
>>>> #include <mpi.h>;
>>>>
>>>> int main(int argc, char** argv) {
>>>>    MPI_Comm parent;
>>>>    MPI_Status status;
>>>>    int err;
>>>>    int value = -1;
>>>>    void* buf= &value;
>>>>
>>>>    if (MPI_Init(&argc,&argv)!=MPI_SUCCESS)
>>>>    {
>>>>      printf("MPI_Init failed");
>>>>    }
>>>>    MPI_Comm_get_parent(&parent);
>>>>    if (parent == MPI_COMM_NULL) printf("No parent!");
>>>>
>>>>    MPI_Recv(buf, 1, MPI_INTEGER, 0, MPI_ANY_TAG, parent, &status);
>>>>    MPI_Comm_disconnect(&parent);
>>>>    err = MPI_Finalize();
>>>>    printf("hi finalized %d %d \n",err, value);
>>>>    return 0;
>>>> }
>>>>
>>>>
>>>>
>>>> On 07.03.2013 12:38, Silvan Brändli wrote:
>>>>> Dear all,
>>>>>
>>>>> again I have a question related to spawning processes. I understand
>>>>> the
>>>>> situation as follows:
>>>>>
>>>>> My program A spawns program B. Program B spawns program C1, C2 ...
>>>>> C10000 ...
>>>>> Program Cx terminates correctly before Cx+1 is called, however
>>>>> returning
>>>>> 1 to mpiexec. To handle this I use the workaround as described in
>>>>> http://lists.mpich.org/pipermail/discuss/2013-February/000429.html
>>>>>
>>>>> Now it looks like with every Spawn a "hydra_pmi_proxy" is started, the
>>>>> calling program is mpiexec. When the program Cx is finished this
>>>>> "hydra_pmi_proxy" remains as a zombie until the programs A, B and
>>>>> mpiexec are finished. When approx. 32k of those "hydra_pmi_proxy"
>>>>> exist
>>>>> I get some problems (too many processes or something similar).
>>>>>
>>>>> What can I do to finish "hydra_pmi_proxy" while my programs A, B and
>>>>> mpiexec are still running?
>>>>>
>>>>> I'm glad about every hint.
>>>>>
>>>>> Best regards
>>>>> Silvan
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>
>>
>>
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>


-- 
Dipl.-Ing. Silvan Brändli
Numerische Strukturanalyse mit Anwendungen in der Schiffstechnik (M-10)

Technische Universität Hamburg-Harburg
Schwarzenbergstraße 95c
21073 Hamburg

Tel.  : +49 (0)40 42878 - 6187
Fax.  : +49 (0)40 42878 - 6090
e-mail: silvan.braendli at tuhh.de
www   : http://www.tuhh.de/skf

5th GACM Colloquium on Computational Mechanics
http://www.tu-harburg.de/gacm2013



More information about the discuss mailing list