[mpich-discuss] MPI process killed and SIGUSR1

Wesley Bland wbland at anl.gov
Thu Oct 9 11:17:53 CDT 2014


MPICH should detect and notify you about process failures without you having to install your own signal handler. You’ll be notified via the return code on your MPI call. You can also use an MPI Errhandler to catch this notification. So once you’ve set up your intercommunicator between your two processes, you should be able to change the default errhandler from MPI_ERRORS_ABORT to your own custom error handler (or just use MPI_ERRORS_RETURN and check the return codes).

If you’re asking about the implementation of the proposed fault tolerance features for MPI-4 in MPICH, that’s a work in progress. We hope to have something for the MPICH 3.2 release cycle, but it’s not guaranteed and it will still be a very experimental feature given that the MPI Forum has not yet actually adopted the fault tolerance proposal.

Thanks,
Wesley

> On Oct 9, 2014, at 11:12 AM, Balaji, Pavan <balaji at anl.gov> wrote:
> 
> 
> I’ll let Wesley answer the FT notification part.
> 
> MPI-4 is a major standard release of MPI.  That’ll take a few years.  We are currently working on the MPI-3.1 release.  (hopefully you know the difference between MPI and MPICH, otherwise it’ll take many emails to explain that part :-) ).
> 
>  — Pavan
> 
> On Oct 9, 2014, at 11:09 AM, Roy, Hirak <Hirak_Roy at mentor.com> wrote:
> 
>> Hi Pavan,
>> 
>> Just wondering with the current release whether we have any way to notify the server that client is terminated unexpectedly!
>> Another point: When do we expect to have MPI-4 release out?
>> 
>> Thanks,
>> Hirak
>> 
>> 
>> Please don’t rely on this feature.  We are preparing for MPI-4 Fault Tolerance and are in the process of reworking a bunch of this stuff.  This might or might not exist in the future if you are planning to use this for production code.
>> 
>>  — Pavan
>> 
>> On Oct 9, 2014, at 10:57 AM, Roy, Hirak <Hirak_Roy at mentor.com> wrote:
>> 
>>> 
>>> Hi Sangmin,
>>> 
>>> The readme of mpich says the following :
>>> 
>>> FAILURE NOTIFICATION: THIS IS AN UNSUPPORTED FEATURE AND WILL
>>> ALMOST CERTAINLY CHANGE IN THE FUTURE!
>>> 
>>>   In the current release, hydra notifies the MPICH library of failed
>>>   processes by sending a SIGUSR1 signal.  The application can catch
>>>   this signal to be notified of failed processes.  If the application
>>>   replaces the library's signal handler with its own, the application
>>>   must be sure to call the library's handler from it's own
>>>   handler.  Note that you cannot call any MPI function from inside a
>>>   signal handler.
>>> 
>>> If this is true, should not I expect SIGUSR1?
>>> 
>>> 
>>> Thanks,
>>> Hirak
>>> First of all, MPI functions are not signal safe. So, if you try to use signals within your MPI program, things might break.
>>> 
>>> — Sangmin
>>> 
>>> 
>>> On Oct 9, 2014, at 7:37 AM, Roy, Hirak <Hirak_Roy at mentor.com<mailto:Hirak_Roy at mentor.com>> wrote:
>>> 
>>> Hi ,
>>> 
>>> I have two MPI processes (server and client)  launched independently by two different mpiexec command. (mpich-3.0.4, sock-device)
>>> 1>    mpiexec –disable-auto-cleanup –n 1 ./server
>>> 2>    mpiexec –disable-auto-cleanup –n 1 ./client
>>> 
>>> The server opens a port and does MPI_Comm_accept.
>>> The client gets the port information and does MPI_Comm_connect and hence we get a new intercommunicator.
>>> I don’t do MPI_Comm_merge.
>>> 
>>> I have installed my own signal handler for SIGUSR1 before even I call MPI_Init ( I guess, this will automatically chain the signal handler).
>>> 
>>>>> signal (SIGUSR1, mysignalhandler);
>>> 
>>> Now suppose, the ‘client’ process gets killed ( I forcefully kill the process by signal 9), I thought I would get SIGUSR1 in the process ‘server’.
>>> However, I don’t get any signal in ‘server’ process.
>>> Am I doing something wrong?
>>> I have noticed that if I start 4 client processes with single mpiexec command, and one client gets killed, rest of the 3 clients receive SIGUSR1.
>>> 
>>> Does this mean, SIGUSR1 is not forwarded across processes connected using inter-communicator?
>>> 
>>> 
>>> Thanks,
>>> Hirak
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> 
> --
> Pavan Balaji  ✉️
> http://www.mcs.anl.gov/~balaji
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list