[mpich-discuss] MPI_Finalize hangs in dynamic connection in case of failed process

Wesley Bland wbland at anl.gov
Thu Feb 26 10:13:42 CST 2015


First, I believe the sock device is untested with most of the MPICH fault tolerance features, so YMMV here.

Is there a reason that you aren’t calling MPI_Disconnect for the failed process? Did you try it an something bad happened? That seems like the most straightforward way of doing things.

Otherwise, this sounds like a known issue that we’re seeing from time to time with MPI_Finalize and the FT work. It’s something I’m trying to figure out now. If you can reduce your code down to the minimum and send it to me, I can use it as a test case to try to fix the problem.

Thanks,
Wesley

> On Feb 19, 2015, at 5:15 AM, Roy, Hirak <Hirak_Roy at mentor.com> wrote:
> 
> Hi All,
>  
> I am using MPICH with sock connection.
> I also setup processes using dynamic connection method (MPI_Comm_connect/MPI_Comm_accept). It’s a master-slave architecture where master accepts the connections from slaves.
>  
> Now if one of the process dies (or get killed), I can still recover from this (without using checkpoint/restore method).
> For the particular process in master, I do not call MPI_disconnect (it hangs and does not complete).
> As a result, my MPI_Finalize in master hangs and does not complete.
> Do you have a workaround to forcefully complete MPI_Finalize or MPI_disconnect?
> I tried MPI_Comm_free on the failed connection. However, it does not solve the hang in finalize.
>  
> Thanks,
> Hirak
> _______________________________________________
> discuss mailing list     discuss at mpich.org <mailto:discuss at mpich.org>
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss <https://lists.mpich.org/mailman/listinfo/discuss>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150226/bff21605/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list