[mpich-discuss] MPI process killed and SIGUSR1
Balaji, Pavan
balaji at anl.gov
Thu Oct 9 11:04:16 CDT 2014
Please don’t rely on this feature. We are preparing for MPI-4 Fault Tolerance and are in the process of reworking a bunch of this stuff. This might or might not exist in the future if you are planning to use this for production code.
— Pavan
On Oct 9, 2014, at 10:57 AM, Roy, Hirak <Hirak_Roy at mentor.com> wrote:
>
> Hi Sangmin,
>
> The readme of mpich says the following :
>
> FAILURE NOTIFICATION: THIS IS AN UNSUPPORTED FEATURE AND WILL
> ALMOST CERTAINLY CHANGE IN THE FUTURE!
>
> In the current release, hydra notifies the MPICH library of failed
> processes by sending a SIGUSR1 signal. The application can catch
> this signal to be notified of failed processes. If the application
> replaces the library's signal handler with its own, the application
> must be sure to call the library's handler from it's own
> handler. Note that you cannot call any MPI function from inside a
> signal handler.
>
> If this is true, should not I expect SIGUSR1?
>
>
> Thanks,
> Hirak
> First of all, MPI functions are not signal safe. So, if you try to use signals within your MPI program, things might break.
>
> — Sangmin
>
>
> On Oct 9, 2014, at 7:37 AM, Roy, Hirak <Hirak_Roy at mentor.com<mailto:Hirak_Roy at mentor.com>> wrote:
>
> Hi ,
>
> I have two MPI processes (server and client) launched independently by two different mpiexec command. (mpich-3.0.4, sock-device)
> 1> mpiexec –disable-auto-cleanup –n 1 ./server
> 2> mpiexec –disable-auto-cleanup –n 1 ./client
>
> The server opens a port and does MPI_Comm_accept.
> The client gets the port information and does MPI_Comm_connect and hence we get a new intercommunicator.
> I don’t do MPI_Comm_merge.
>
> I have installed my own signal handler for SIGUSR1 before even I call MPI_Init ( I guess, this will automatically chain the signal handler).
>
> >> signal (SIGUSR1, mysignalhandler);
>
> Now suppose, the ‘client’ process gets killed ( I forcefully kill the process by signal 9), I thought I would get SIGUSR1 in the process ‘server’.
> However, I don’t get any signal in ‘server’ process.
> Am I doing something wrong?
> I have noticed that if I start 4 client processes with single mpiexec command, and one client gets killed, rest of the 3 clients receive SIGUSR1.
>
> Does this mean, SIGUSR1 is not forwarded across processes connected using inter-communicator?
>
>
> Thanks,
> Hirak
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
--
Pavan Balaji ✉️
http://www.mcs.anl.gov/~balaji
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list