[mpich-discuss] MPI process killed and SIGUSR1
Balaji, Pavan
balaji at anl.gov
Thu Oct 9 11:12:22 CDT 2014
I’ll let Wesley answer the FT notification part.
MPI-4 is a major standard release of MPI. That’ll take a few years. We are currently working on the MPI-3.1 release. (hopefully you know the difference between MPI and MPICH, otherwise it’ll take many emails to explain that part :-) ).
— Pavan
On Oct 9, 2014, at 11:09 AM, Roy, Hirak <Hirak_Roy at mentor.com> wrote:
> Hi Pavan,
>
> Just wondering with the current release whether we have any way to notify the server that client is terminated unexpectedly!
> Another point: When do we expect to have MPI-4 release out?
>
> Thanks,
> Hirak
>
>
> Please don’t rely on this feature. We are preparing for MPI-4 Fault Tolerance and are in the process of reworking a bunch of this stuff. This might or might not exist in the future if you are planning to use this for production code.
>
> — Pavan
>
> On Oct 9, 2014, at 10:57 AM, Roy, Hirak <Hirak_Roy at mentor.com> wrote:
>
> >
> > Hi Sangmin,
> >
> > The readme of mpich says the following :
> >
> > FAILURE NOTIFICATION: THIS IS AN UNSUPPORTED FEATURE AND WILL
> > ALMOST CERTAINLY CHANGE IN THE FUTURE!
> >
> > In the current release, hydra notifies the MPICH library of failed
> > processes by sending a SIGUSR1 signal. The application can catch
> > this signal to be notified of failed processes. If the application
> > replaces the library's signal handler with its own, the application
> > must be sure to call the library's handler from it's own
> > handler. Note that you cannot call any MPI function from inside a
> > signal handler.
> >
> > If this is true, should not I expect SIGUSR1?
> >
> >
> > Thanks,
> > Hirak
> > First of all, MPI functions are not signal safe. So, if you try to use signals within your MPI program, things might break.
> >
> > — Sangmin
> >
> >
> > On Oct 9, 2014, at 7:37 AM, Roy, Hirak <Hirak_Roy at mentor.com<mailto:Hirak_Roy at mentor.com>> wrote:
> >
> > Hi ,
> >
> > I have two MPI processes (server and client) launched independently by two different mpiexec command. (mpich-3.0.4, sock-device)
> > 1> mpiexec –disable-auto-cleanup –n 1 ./server
> > 2> mpiexec –disable-auto-cleanup –n 1 ./client
> >
> > The server opens a port and does MPI_Comm_accept.
> > The client gets the port information and does MPI_Comm_connect and hence we get a new intercommunicator.
> > I don’t do MPI_Comm_merge.
> >
> > I have installed my own signal handler for SIGUSR1 before even I call MPI_Init ( I guess, this will automatically chain the signal handler).
> >
> > >> signal (SIGUSR1, mysignalhandler);
> >
> > Now suppose, the ‘client’ process gets killed ( I forcefully kill the process by signal 9), I thought I would get SIGUSR1 in the process ‘server’.
> > However, I don’t get any signal in ‘server’ process.
> > Am I doing something wrong?
> > I have noticed that if I start 4 client processes with single mpiexec command, and one client gets killed, rest of the 3 clients receive SIGUSR1.
> >
> > Does this mean, SIGUSR1 is not forwarded across processes connected using inter-communicator?
> >
> >
> > Thanks,
> > Hirak
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
--
Pavan Balaji ✉️
http://www.mcs.anl.gov/~balaji
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list