[mpich-discuss] MPI process killed and SIGUSR1

Balaji, Pavan balaji at anl.gov
Thu Oct 9 11:04:16 CDT 2014


Please don’t rely on this feature.  We are preparing for MPI-4 Fault Tolerance and are in the process of reworking a bunch of this stuff.  This might or might not exist in the future if you are planning to use this for production code.

  — Pavan

On Oct 9, 2014, at 10:57 AM, Roy, Hirak <Hirak_Roy at mentor.com> wrote:

>  
> Hi Sangmin,
>  
> The readme of mpich says the following :
>  
> FAILURE NOTIFICATION: THIS IS AN UNSUPPORTED FEATURE AND WILL
> ALMOST CERTAINLY CHANGE IN THE FUTURE!
>  
>    In the current release, hydra notifies the MPICH library of failed
>    processes by sending a SIGUSR1 signal.  The application can catch
>    this signal to be notified of failed processes.  If the application
>    replaces the library's signal handler with its own, the application
>    must be sure to call the library's handler from it's own
>    handler.  Note that you cannot call any MPI function from inside a
>    signal handler.
>  
> If this is true, should not I expect SIGUSR1?
>  
>  
> Thanks,
> Hirak
> First of all, MPI functions are not signal safe. So, if you try to use signals within your MPI program, things might break.
>  
> — Sangmin
>  
>  
> On Oct 9, 2014, at 7:37 AM, Roy, Hirak <Hirak_Roy at mentor.com<mailto:Hirak_Roy at mentor.com>> wrote:
>  
> Hi ,
>  
> I have two MPI processes (server and client)  launched independently by two different mpiexec command. (mpich-3.0.4, sock-device)
> 1>    mpiexec –disable-auto-cleanup –n 1 ./server
> 2>    mpiexec –disable-auto-cleanup –n 1 ./client
>  
> The server opens a port and does MPI_Comm_accept.
> The client gets the port information and does MPI_Comm_connect and hence we get a new intercommunicator.
> I don’t do MPI_Comm_merge.
>  
> I have installed my own signal handler for SIGUSR1 before even I call MPI_Init ( I guess, this will automatically chain the signal handler).
>  
> >> signal (SIGUSR1, mysignalhandler);
>  
> Now suppose, the ‘client’ process gets killed ( I forcefully kill the process by signal 9), I thought I would get SIGUSR1 in the process ‘server’.
> However, I don’t get any signal in ‘server’ process.
> Am I doing something wrong?
> I have noticed that if I start 4 client processes with single mpiexec command, and one client gets killed, rest of the 3 clients receive SIGUSR1.
>  
> Does this mean, SIGUSR1 is not forwarded across processes connected using inter-communicator?
>  
>  
> Thanks,
> Hirak
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

--
Pavan Balaji  ✉️
http://www.mcs.anl.gov/~balaji

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list