[mpich-discuss] MPI_Finalize hangs in dynamic connection in case of failed process

Roy, Hirak Hirak_Roy at mentor.com
Mon Mar 2 23:01:07 CST 2015


Hi Wesley,

As I mentioned in my email, MPI_disconnect hangs.
Here is a short program which you can run.
Please note that there is an "assert" in client.c
Compile :
>> mpicc server.c -o server
>> mpicc client.c -o client

To run use two shell/terminal :
Term1>> mpiexec -n 1 ./server
Term2>> mpiexec -n 1 ./client

Please press any key on the server terminal.



I earlier filed a bug related to this : http://trac.mpich.org/projects/mpich/ticket/2205
My question :

1>    What happens if we don't call MPI_Finalize and call exit(0)?

2>    Is there anyway I can forcefully complete MPI_disconnect from server side ?


Thanks,
Hirak


Ps : The reason of using sock is : bug in nemesis : http://trac.mpich.org/projects/mpich/ticket/1103 and http://trac.mpich.org/projects/mpich/ticket/79


Wesley Bland wbland at anl.gov <mailto:discuss%40mpich.org?Subject=Re%3A%20%5Bmpich-discuss%5D%20MPI_Finalize%20hangs%20in%20dynamic%20connection%20in%0A%09case%20of%20failed%20process&In-Reply-To=%3CF4B1C63E-062B-4036-9DE5-A4C93096F32C%40anl.gov%3E>
Thu Feb 26 10:13:42 CST 2015

  *   Previous message: [mpich-discuss] MPI_Finalize hangs in dynamic connection in case of failed process<http://lists.mpich.org/pipermail/discuss/2015-February/003725.html>
  *   Next message: [mpich-discuss] query ABI version<http://lists.mpich.org/pipermail/discuss/2015-February/003726.html>
  *   Messages sorted by: [ date ]<http://lists.mpich.org/pipermail/discuss/2015-February/date.html#3740> [ thread ]<http://lists.mpich.org/pipermail/discuss/2015-February/thread.html#3740> [ subject ]<http://lists.mpich.org/pipermail/discuss/2015-February/subject.html#3740> [ author ]<http://lists.mpich.org/pipermail/discuss/2015-February/author.html#3740>

________________________________

First, I believe the sock device is untested with most of the MPICH fault tolerance features, so YMMV here.



Is there a reason that you aren't calling MPI_Disconnect for the failed process? Did you try it an something bad happened? That seems like the most straightforward way of doing things.



Otherwise, this sounds like a known issue that we're seeing from time to time with MPI_Finalize and the FT work. It's something I'm trying to figure out now. If you can reduce your code down to the minimum and send it to me, I can use it as a test case to try to fix the problem.



Thanks,

Wesley



> On Feb 19, 2015, at 5:15 AM, Roy, Hirak <Hirak_Roy at mentor.com<https://lists.mpich.org/mailman/listinfo/discuss>> wrote:

>

> Hi All,

>

> I am using MPICH with sock connection.

> I also setup processes using dynamic connection method (MPI_Comm_connect/MPI_Comm_accept). It's a master-slave architecture where master accepts the connections from slaves.

>

> Now if one of the process dies (or get killed), I can still recover from this (without using checkpoint/restore method).

> For the particular process in master, I do not call MPI_disconnect (it hangs and does not complete).

> As a result, my MPI_Finalize in master hangs and does not complete.

> Do you have a workaround to forcefully complete MPI_Finalize or MPI_disconnect?

> I tried MPI_Comm_free on the failed connection. However, it does not solve the hang in finalize.

>

> Thanks,

> Hirak

> _______________________________________________

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150303/6680b580/attachment.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: server.c
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150303/6680b580/attachment.c>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: client.c
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150303/6680b580/attachment-0001.c>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list