[mpich-discuss] MPI_Finalize hangs in dynamic connection in case of failed process
Roy, Hirak
Hirak_Roy at mentor.com
Mon Mar 2 23:01:07 CST 2015
Hi Wesley,
As I mentioned in my email, MPI_disconnect hangs.
Here is a short program which you can run.
Please note that there is an "assert" in client.c
Compile :
>> mpicc server.c -o server
>> mpicc client.c -o client
To run use two shell/terminal :
Term1>> mpiexec -n 1 ./server
Term2>> mpiexec -n 1 ./client
Please press any key on the server terminal.
I earlier filed a bug related to this : http://trac.mpich.org/projects/mpich/ticket/2205
My question :
1> What happens if we don't call MPI_Finalize and call exit(0)?
2> Is there anyway I can forcefully complete MPI_disconnect from server side ?
Thanks,
Hirak
Ps : The reason of using sock is : bug in nemesis : http://trac.mpich.org/projects/mpich/ticket/1103 and http://trac.mpich.org/projects/mpich/ticket/79
Wesley Bland wbland at anl.gov <mailto:discuss%40mpich.org?Subject=Re%3A%20%5Bmpich-discuss%5D%20MPI_Finalize%20hangs%20in%20dynamic%20connection%20in%0A%09case%20of%20failed%20process&In-Reply-To=%3CF4B1C63E-062B-4036-9DE5-A4C93096F32C%40anl.gov%3E>
Thu Feb 26 10:13:42 CST 2015
* Previous message: [mpich-discuss] MPI_Finalize hangs in dynamic connection in case of failed process<http://lists.mpich.org/pipermail/discuss/2015-February/003725.html>
* Next message: [mpich-discuss] query ABI version<http://lists.mpich.org/pipermail/discuss/2015-February/003726.html>
* Messages sorted by: [ date ]<http://lists.mpich.org/pipermail/discuss/2015-February/date.html#3740> [ thread ]<http://lists.mpich.org/pipermail/discuss/2015-February/thread.html#3740> [ subject ]<http://lists.mpich.org/pipermail/discuss/2015-February/subject.html#3740> [ author ]<http://lists.mpich.org/pipermail/discuss/2015-February/author.html#3740>
________________________________
First, I believe the sock device is untested with most of the MPICH fault tolerance features, so YMMV here.
Is there a reason that you aren't calling MPI_Disconnect for the failed process? Did you try it an something bad happened? That seems like the most straightforward way of doing things.
Otherwise, this sounds like a known issue that we're seeing from time to time with MPI_Finalize and the FT work. It's something I'm trying to figure out now. If you can reduce your code down to the minimum and send it to me, I can use it as a test case to try to fix the problem.
Thanks,
Wesley
> On Feb 19, 2015, at 5:15 AM, Roy, Hirak <Hirak_Roy at mentor.com<https://lists.mpich.org/mailman/listinfo/discuss>> wrote:
>
> Hi All,
>
> I am using MPICH with sock connection.
> I also setup processes using dynamic connection method (MPI_Comm_connect/MPI_Comm_accept). It's a master-slave architecture where master accepts the connections from slaves.
>
> Now if one of the process dies (or get killed), I can still recover from this (without using checkpoint/restore method).
> For the particular process in master, I do not call MPI_disconnect (it hangs and does not complete).
> As a result, my MPI_Finalize in master hangs and does not complete.
> Do you have a workaround to forcefully complete MPI_Finalize or MPI_disconnect?
> I tried MPI_Comm_free on the failed connection. However, it does not solve the hang in finalize.
>
> Thanks,
> Hirak
> _______________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150303/6680b580/attachment.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: server.c
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150303/6680b580/attachment.c>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: client.c
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150303/6680b580/attachment-0001.c>
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list