[mpich-discuss] MPI_Finalize hangs in dynamic connection in case of failed process

Wesley Bland wbland at anl.gov
Wed Mar 4 18:13:54 CST 2015


On Mon, Mar 2, 2015 at 9:01 PM, Roy, Hirak <Hirak_Roy at mentor.com> wrote:

>  Hi Wesley,
>
>
>
> As I mentioned in my email, MPI_disconnect hangs.
>
> Here is a short program which you can run.
>
> Please note that there is an “assert” in client.c
>
> Compile :
>
> >> mpicc server.c –o server
>
> >> mpicc client.c –o client
>
>
>
> To run use two shell/terminal :
>
> Term1>> mpiexec –n 1 ./server
>
> Term2>> mpiexec –n 1 ./client
>
>
>
> Please press any key on the server terminal.
>
>
>
>
>
>
>
> I earlier filed a bug related to this :
> http://trac.mpich.org/projects/mpich/ticket/2205
>
> My question :
>
> 1>    What happens if we don’t call MPI_Finalize and call exit(0)?
>
Mostly nothing. In theory, some things might not get cleaned up, but
probably the worst you'll see is a nasty error message.

>  2>    Is there anyway I can forcefully complete MPI_disconnect from
> server side ?
>
I don't know of anything that will allow you to do that.

>
>
>
>
> Thanks,
>
> Hirak
>
>
>
>
>
> Ps : The reason of using sock is : bug in nemesis :
> http://trac.mpich.org/projects/mpich/ticket/1103 and
> http://trac.mpich.org/projects/mpich/ticket/79
>
>
>
>
>
> *Wesley Bland* wbland at anl.gov
> <discuss%40mpich.org?Subject=Re%3A%20%5Bmpich-discuss%5D%20MPI_Finalize%20hangs%20in%20dynamic%20connection%20in%0A%09case%20of%20failed%20process&In-Reply-To=%3CF4B1C63E-062B-4036-9DE5-A4C93096F32C%40anl.gov%3E>
> *Thu Feb 26 10:13:42 CST 2015*
>
>    - Previous message: [mpich-discuss] MPI_Finalize hangs in dynamic
>    connection in case of failed process
>    <http://lists.mpich.org/pipermail/discuss/2015-February/003725.html>
>    - Next message: [mpich-discuss] query ABI version
>    <http://lists.mpich.org/pipermail/discuss/2015-February/003726.html>
>    - *Messages sorted by:* [ date ]
>    <http://lists.mpich.org/pipermail/discuss/2015-February/date.html#3740>
>     [ thread ]
>    <http://lists.mpich.org/pipermail/discuss/2015-February/thread.html#3740>
>     [ subject ]
>    <http://lists.mpich.org/pipermail/discuss/2015-February/subject.html#3740>
>     [ author ]
>    <http://lists.mpich.org/pipermail/discuss/2015-February/author.html#3740>
>
>  ------------------------------
>
> First, I believe the sock device is untested with most of the MPICH fault tolerance features, so YMMV here.
>
>
>
> Is there a reason that you aren’t calling MPI_Disconnect for the failed process? Did you try it an something bad happened? That seems like the most straightforward way of doing things.
>
>
>
> Otherwise, this sounds like a known issue that we’re seeing from time to time with MPI_Finalize and the FT work. It’s something I’m trying to figure out now. If you can reduce your code down to the minimum and send it to me, I can use it as a test case to try to fix the problem.
>
>
>
> Thanks,
>
> Wesley
>
>
>
> >* On Feb 19, 2015, at 5:15 AM, Roy, Hirak <Hirak_Roy at mentor.com <https://lists.mpich.org/mailman/listinfo/discuss>> wrote:*
>
> >
>
> >* Hi All,*
>
> >
>
> >* I am using MPICH with sock connection.*
>
> >* I also setup processes using dynamic connection method (MPI_Comm_connect/MPI_Comm_accept). It’s a master-slave architecture where master accepts the connections from slaves.*
>
> >
>
> >* Now if one of the process dies (or get killed), I can still recover from this (without using checkpoint/restore method).*
>
> >* For the particular process in master, I do not call MPI_disconnect (it hangs and does not complete).*
>
> >* As a result, my MPI_Finalize in master hangs and does not complete.*
>
> >* Do you have a workaround to forcefully complete MPI_Finalize or MPI_disconnect?*
>
> >* I tried MPI_Comm_free on the failed connection. However, it does not solve the hang in finalize.*
>
> >
>
> >* Thanks,*
>
> >* Hirak*
>
> >* _______________________________________________*
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150304/8d1d30cc/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list