[mpich-discuss] MPI process killed and SIGUSR1

Balaji, Pavan balaji at anl.gov
Thu Oct 9 11:12:22 CDT 2014


I’ll let Wesley answer the FT notification part.

MPI-4 is a major standard release of MPI.  That’ll take a few years.  We are currently working on the MPI-3.1 release.  (hopefully you know the difference between MPI and MPICH, otherwise it’ll take many emails to explain that part :-) ).

  — Pavan

On Oct 9, 2014, at 11:09 AM, Roy, Hirak <Hirak_Roy at mentor.com> wrote:

> Hi Pavan,
>  
> Just wondering with the current release whether we have any way to notify the server that client is terminated unexpectedly!
> Another point: When do we expect to have MPI-4 release out?
>  
> Thanks,
> Hirak
>  
>  
> Please don’t rely on this feature.  We are preparing for MPI-4 Fault Tolerance and are in the process of reworking a bunch of this stuff.  This might or might not exist in the future if you are planning to use this for production code.
>  
>   — Pavan
>  
> On Oct 9, 2014, at 10:57 AM, Roy, Hirak <Hirak_Roy at mentor.com> wrote:
>  
> >  
> > Hi Sangmin,
> >  
> > The readme of mpich says the following :
> >  
> > FAILURE NOTIFICATION: THIS IS AN UNSUPPORTED FEATURE AND WILL
> > ALMOST CERTAINLY CHANGE IN THE FUTURE!
> >  
> >    In the current release, hydra notifies the MPICH library of failed
> >    processes by sending a SIGUSR1 signal.  The application can catch
> >    this signal to be notified of failed processes.  If the application
> >    replaces the library's signal handler with its own, the application
> >    must be sure to call the library's handler from it's own
> >    handler.  Note that you cannot call any MPI function from inside a
> >    signal handler.
> >  
> > If this is true, should not I expect SIGUSR1?
> >  
> >  
> > Thanks,
> > Hirak
> > First of all, MPI functions are not signal safe. So, if you try to use signals within your MPI program, things might break.
> >  
> > — Sangmin
> >  
> >  
> > On Oct 9, 2014, at 7:37 AM, Roy, Hirak <Hirak_Roy at mentor.com<mailto:Hirak_Roy at mentor.com>> wrote:
> >  
> > Hi ,
> >  
> > I have two MPI processes (server and client)  launched independently by two different mpiexec command. (mpich-3.0.4, sock-device)
> > 1>    mpiexec –disable-auto-cleanup –n 1 ./server
> > 2>    mpiexec –disable-auto-cleanup –n 1 ./client
> >  
> > The server opens a port and does MPI_Comm_accept.
> > The client gets the port information and does MPI_Comm_connect and hence we get a new intercommunicator.
> > I don’t do MPI_Comm_merge.
> >  
> > I have installed my own signal handler for SIGUSR1 before even I call MPI_Init ( I guess, this will automatically chain the signal handler).
> >  
> > >> signal (SIGUSR1, mysignalhandler);
> >  
> > Now suppose, the ‘client’ process gets killed ( I forcefully kill the process by signal 9), I thought I would get SIGUSR1 in the process ‘server’.
> > However, I don’t get any signal in ‘server’ process.
> > Am I doing something wrong?
> > I have noticed that if I start 4 client processes with single mpiexec command, and one client gets killed, rest of the 3 clients receive SIGUSR1.
> >  
> > Does this mean, SIGUSR1 is not forwarded across processes connected using inter-communicator?
> >  
> >  
> > Thanks,
> > Hirak
>  
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

--
Pavan Balaji  ✉️
http://www.mcs.anl.gov/~balaji

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list