[mpich-discuss] question about -disable-auto-cleanup

Kenneth Raffenetti raffenet at mcs.anl.gov
Thu Aug 31 11:19:45 CDT 2017


Hi Zaak,

I'll try my best to explain here. There are a few things to consider.

1. Hydra: -disable-auto-cleanup means if an MPI process dies, let other 
processes in the job continue running. Since Hydra (mpiexec) is already 
monitoring MPI processes to detect when one dies, there is no impact 
inside Hydra from passing this option.

2. Application behavior: Since the default error handler in MPI is 
MPI_ERRORS_ARE_FATAL, some applications may rely on that fact and expect 
a running job to be aborted/cleaned up if a process quits. With 
-disable-auto-cleanup this will no longer be the case. An application 
can call MPI_Abort() to force the old behavior, however.

Ken

On 08/30/2017 12:29 PM, Zaak Beekman wrote:
> OK, since there were no responses here to my previous email, perhaps a 
> better question would be:
> 
> What is a good resource to learn about the impact of passing 
> `--disable-auto-cleanup` at runtime?
> 
> Some google searches bring up discussions of what appear to be bugs in 
> the standard and/or implementation, but I'm not sure where to look to 
> find out about even the intended runtime semantics.
> 
> Any and all help pointing me in the right direction would be much 
> appreciated.
> 
> Thanks,
> Zaak
> 
> On Wed, Aug 30, 2017 at 1:00 PM <discuss-request at mpich.org 
> <mailto:discuss-request at mpich.org>> wrote:
> 
>     Send discuss mailing list submissions to
>     discuss at mpich.org <mailto:discuss at mpich.org>
> 
>     To subscribe or unsubscribe via the World Wide Web, visit
>     https://lists.mpich.org/mailman/listinfo/discuss
>     or, via email, send a message with subject or body 'help' to
>     discuss-request at mpich.org <mailto:discuss-request at mpich.org>
> 
>     You can reach the person managing the list at
>     discuss-owner at mpich.org <mailto:discuss-owner at mpich.org>
> 
>     When replying, please edit your Subject line so it is more specific
>     than "Re: Contents of discuss digest..."
> 
> 
>     Today's Topics:
> 
>         1.  question about -disable-auto-cleanup (Zaak Beekman)
>         2.  Torque MPICH jobs stuck (Souparno Adhikary)
>         3. Re:  Torque MPICH jobs stuck (Halim Amer)
> 
> 
>     ----------------------------------------------------------------------
> 
>     Message: 1
>     Date: Tue, 29 Aug 2017 21:22:49 +0000
>     From: Zaak Beekman <zbeekman at gmail.com <mailto:zbeekman at gmail.com>>
>     To: discuss at mpich.org <mailto:discuss at mpich.org>
>     Subject: [mpich-discuss] question about -disable-auto-cleanup
>     Message-ID:
>             
>     <CAAbnBwZrQ03YmmmayhcHEywh8bEFMZ_AycBydOqZFB023KeJZQ at mail.gmail.com
>     <mailto:CAAbnBwZrQ03YmmmayhcHEywh8bEFMZ_AycBydOqZFB023KeJZQ at mail.gmail.com>>
>     Content-Type: text/plain; charset="utf-8"
> 
>     I know that --disable-auto-cleanup is required to enable the
>     fault-tolerant
>     MPI features, but are there downsides to passing this? Performance
>     implications?
> 
>     I ask, because over at
>     https://github.com/sourceryinstitute/OpenCoarrays we've
>     implemented much of the Fortran 2015 failed images feature on top of
>     MPICH
>     and other MPI implementations. But to use this,
>     --disable-auto-cleanup must
>     be passed to mpiexec. We provide wrapper scripts to try to abstract the
>     back end (GASNet, MPI, OpenSHMEM etc.) in the form of a Fortran compiler
>     wrapper, and an executable launcher. So I'm wondering, since failed
>     images
>     are part of the standard (2015) would it be dumb if we always pass
>     --disable-auto-cleanup to mpiexec and only turn off support when
>     explicitly
>     asked for by the user, or is it safer/more performant to default to
>     requiring the user to pass an additional flag to our wrapper script that
>     results in --disable-auto-cleanup getting passed to mpiexec?
> 
>     Feedback would be much appreciated. Feel free to post responses at
>     https://github.com/sourceryinstitute/OpenCoarrays/issues/401 as well..
> 
>     Thanks,
>     Zaak
>     -------------- next part --------------
>     An HTML attachment was scrubbed...
>     URL:
>     <http://lists.mpich.org/pipermail/discuss/attachments/20170829/52d25b23/attachment-0001.html>
> 
>     ------------------------------
> 
>     Message: 2
>     Date: Wed, 30 Aug 2017 13:48:00 +0530
>     From: Souparno Adhikary <souparnoa91 at gmail.com
>     <mailto:souparnoa91 at gmail.com>>
>     To: discuss at mpich.org <mailto:discuss at mpich.org>
>     Subject: [mpich-discuss] Torque MPICH jobs stuck
>     Message-ID:
>             
>     <CAL6QJ1BF8FAYAvLiyqtKGMo+6e_3vdSf95wmH2n2F8efHMyfCw at mail.gmail.com
>     <mailto:CAL6QJ1BF8FAYAvLiyqtKGMo%2B6e_3vdSf95wmH2n2F8efHMyfCw at mail.gmail.com>>
>     Content-Type: text/plain; charset="utf-8"
> 
>     I know this is not a proper place to discuss this, but, as the
>     Torque-mpich
>     list seems dead, I can't think of any other place to post this.
> 
>     MPICH2 was installed in the servers. I installed Torque afterwards. I
>     opened the ports including them in the iptables file.
> 
>     Torque mpi jobs (even the simple jobs like hostname) remains stuck. But,
>     the jobs are properly distributed in the nodes and pbsnodes -a
>     showing them
>     in order.
> 
>     The sched_log files and server_logs do not yield anything different.
>     Therefore, it might be a problem with the mpich2.
> 
>     Can you please suggest me from where I can start troubleshooting???
> 
>     Thanks,
> 
>     Souparno Adhikary,
>     CHPC Lab,
>     Department of Microbiology,
>     University of Calcutta.
>     -------------- next part --------------
>     An HTML attachment was scrubbed...
>     URL:
>     <http://lists.mpich.org/pipermail/discuss/attachments/20170830/99b126ee/attachment-0001.html>
> 
>     ------------------------------
> 
>     Message: 3
>     Date: Wed, 30 Aug 2017 11:00:51 -0500
>     From: Halim Amer <aamer at anl.gov <mailto:aamer at anl.gov>>
>     To: <discuss at mpich.org <mailto:discuss at mpich.org>>
>     Subject: Re: [mpich-discuss] Torque MPICH jobs stuck
>     Message-ID: <3a2d0cc3-51a5-c646-4afc-40ece230bb04 at anl.gov
>     <mailto:3a2d0cc3-51a5-c646-4afc-40ece230bb04 at anl.gov>>
>     Content-Type: text/plain; charset="utf-8"; format=flowed
> 
>     Which MPICH version are you using? Have you tried the latest 3.2
>     version?
> 
>     If it still fails, can you attach your simple Torque job script here?
> 
>     Halim
>     www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~aamer>
> 
>     On 8/30/17 3:18 AM, Souparno Adhikary wrote:
>      > I know this is not a proper place to discuss this, but, as the
>      > Torque-mpich list seems dead, I can't think of any other place to
>     post this.
>      >
>      > MPICH2 was installed in the servers. I installed Torque afterwards. I
>      > opened the ports including them in the iptables file.
>      >
>      > Torque mpi jobs (even the simple jobs like hostname) remains
>     stuck. But,
>      > the jobs are properly distributed in the nodes and pbsnodes -a
>     showing
>      > them in order.
>      >
>      > The sched_log files and server_logs do not yield anything different.
>      > Therefore, it might be a problem with the mpich2.
>      >
>      > Can you please suggest me from where I can start troubleshooting???
>      >
>      > Thanks,
>      >
>      > Souparno Adhikary,
>      > CHPC Lab,
>      > Department of Microbiology,
>      > University of Calcutta.
>      >
>      >
>      > _______________________________________________
>      > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>      > To manage subscription options or unsubscribe:
>      > https://lists.mpich.org/mailman/listinfo/discuss
>      >
> 
> 
>     ------------------------------
> 
>     Subject: Digest Footer
> 
>     _______________________________________________
>     discuss mailing list
>     discuss at mpich.org <mailto:discuss at mpich.org>
>     https://lists.mpich.org/mailman/listinfo/discuss
> 
>     ------------------------------
> 
>     End of discuss Digest, Vol 58, Issue 18
>     ***************************************
> 
> 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list