[mpich-discuss] question about -disable-auto-cleanup

Zaak Beekman zbeekman at gmail.com
Thu Aug 31 11:49:37 CDT 2017


Ken,

Thanks so much for your concise response!

Is my understanding correct that: If one MPI rank, or all the ranks on a
given node are to fail, from, say, a hardware issue, there is no means of
determining that this has happened unless the MPI library supports the user
level failure mitigation (ULFM) features? So if one were to pass
--disable-auto-cleanup when the runtime didn't have support for ULFM, and
processes on remote MPI ranks died, there'd be no way to query them and
call MPI_abort from a different, still running rank?

Thanks again,
Zaak

On Thu, Aug 31, 2017 at 12:19 PM <discuss-request at mpich.org> wrote:

> Send discuss mailing list submissions to
>         discuss at mpich.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://lists.mpich.org/mailman/listinfo/discuss
> or, via email, send a message with subject or body 'help' to
>         discuss-request at mpich.org
>
> You can reach the person managing the list at
>         discuss-owner at mpich.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of discuss digest..."
>
>
> Today's Topics:
>
>    1.  question about -disable-auto-cleanup (Zaak Beekman)
>    2. Re:  Torque MPICH jobs stuck (Souparno Adhikary)
>    3. Re:  question about -disable-auto-cleanup (Kenneth Raffenetti)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 30 Aug 2017 17:29:11 +0000
> From: Zaak Beekman <zbeekman at gmail.com>
> To: discuss at mpich.org
> Subject: [mpich-discuss] question about -disable-auto-cleanup
> Message-ID:
>         <
> CAAbnBwb8L2VtXqsbbGjZvSOHpTkd7mK6SeQuhHKTEG0brOPSBQ at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> OK, since there were no responses here to my previous email, perhaps a
> better question would be:
>
> What is a good resource to learn about the impact of passing
> `--disable-auto-cleanup` at runtime?
>
> Some google searches bring up discussions of what appear to be bugs in the
> standard and/or implementation, but I'm not sure where to look to find out
> about even the intended runtime semantics.
>
> Any and all help pointing me in the right direction would be much
> appreciated.
>
> Thanks,
> Zaak
>
> On Wed, Aug 30, 2017 at 1:00 PM <discuss-request at mpich.org> wrote:
>
> > Send discuss mailing list submissions to
> >         discuss at mpich.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >         https://lists.mpich.org/mailman/listinfo/discuss
> > or, via email, send a message with subject or body 'help' to
> >         discuss-request at mpich.org
> >
> > You can reach the person managing the list at
> >         discuss-owner at mpich.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of discuss digest..."
> >
> >
> > Today's Topics:
> >
> >    1.  question about -disable-auto-cleanup (Zaak Beekman)
> >    2.  Torque MPICH jobs stuck (Souparno Adhikary)
> >    3. Re:  Torque MPICH jobs stuck (Halim Amer)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Tue, 29 Aug 2017 21:22:49 +0000
> > From: Zaak Beekman <zbeekman at gmail.com>
> > To: discuss at mpich.org
> > Subject: [mpich-discuss] question about -disable-auto-cleanup
> > Message-ID:
> >         <
> > CAAbnBwZrQ03YmmmayhcHEywh8bEFMZ_AycBydOqZFB023KeJZQ at mail.gmail.com>
> > Content-Type: text/plain; charset="utf-8"
> >
> > I know that --disable-auto-cleanup is required to enable the
> fault-tolerant
> > MPI features, but are there downsides to passing this? Performance
> > implications?
> >
> > I ask, because over at https://github.com/sourceryinstitute/OpenCoarrays
> > we've
> > implemented much of the Fortran 2015 failed images feature on top of
> MPICH
> > and other MPI implementations. But to use this, --disable-auto-cleanup
> must
> > be passed to mpiexec. We provide wrapper scripts to try to abstract the
> > back end (GASNet, MPI, OpenSHMEM etc.) in the form of a Fortran compiler
> > wrapper, and an executable launcher. So I'm wondering, since failed
> images
> > are part of the standard (2015) would it be dumb if we always pass
> > --disable-auto-cleanup to mpiexec and only turn off support when
> explicitly
> > asked for by the user, or is it safer/more performant to default to
> > requiring the user to pass an additional flag to our wrapper script that
> > results in --disable-auto-cleanup getting passed to mpiexec?
> >
> > Feedback would be much appreciated. Feel free to post responses at
> > https://github.com/sourceryinstitute/OpenCoarrays/issues/401 as well..
> >
> > Thanks,
> > Zaak
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL: <
> >
> http://lists.mpich.org/pipermail/discuss/attachments/20170829/52d25b23/attachment-0001.html
> > >
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Wed, 30 Aug 2017 13:48:00 +0530
> > From: Souparno Adhikary <souparnoa91 at gmail.com>
> > To: discuss at mpich.org
> > Subject: [mpich-discuss] Torque MPICH jobs stuck
> > Message-ID:
> >         <
> > CAL6QJ1BF8FAYAvLiyqtKGMo+6e_3vdSf95wmH2n2F8efHMyfCw at mail.gmail.com>
> > Content-Type: text/plain; charset="utf-8"
> >
> > I know this is not a proper place to discuss this, but, as the
> Torque-mpich
> > list seems dead, I can't think of any other place to post this.
> >
> > MPICH2 was installed in the servers. I installed Torque afterwards. I
> > opened the ports including them in the iptables file.
> >
> > Torque mpi jobs (even the simple jobs like hostname) remains stuck. But,
> > the jobs are properly distributed in the nodes and pbsnodes -a showing
> them
> > in order.
> >
> > The sched_log files and server_logs do not yield anything different.
> > Therefore, it might be a problem with the mpich2.
> >
> > Can you please suggest me from where I can start troubleshooting???
> >
> > Thanks,
> >
> > Souparno Adhikary,
> > CHPC Lab,
> > Department of Microbiology,
> > University of Calcutta.
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL: <
> >
> http://lists.mpich.org/pipermail/discuss/attachments/20170830/99b126ee/attachment-0001.html
> > >
> >
> > ------------------------------
> >
> > Message: 3
> > Date: Wed, 30 Aug 2017 11:00:51 -0500
> > From: Halim Amer <aamer at anl.gov>
> > To: <discuss at mpich.org>
> > Subject: Re: [mpich-discuss] Torque MPICH jobs stuck
> > Message-ID: <3a2d0cc3-51a5-c646-4afc-40ece230bb04 at anl.gov>
> > Content-Type: text/plain; charset="utf-8"; format=flowed
> >
> > Which MPICH version are you using? Have you tried the latest 3.2 version?
> >
> > If it still fails, can you attach your simple Torque job script here?
> >
> > Halim
> > www.mcs.anl.gov/~aamer
> >
> > On 8/30/17 3:18 AM, Souparno Adhikary wrote:
> > > I know this is not a proper place to discuss this, but, as the
> > > Torque-mpich list seems dead, I can't think of any other place to post
> > this.
> > >
> > > MPICH2 was installed in the servers. I installed Torque afterwards. I
> > > opened the ports including them in the iptables file.
> > >
> > > Torque mpi jobs (even the simple jobs like hostname) remains stuck.
> But,
> > > the jobs are properly distributed in the nodes and pbsnodes -a showing
> > > them in order.
> > >
> > > The sched_log files and server_logs do not yield anything different.
> > > Therefore, it might be a problem with the mpich2.
> > >
> > > Can you please suggest me from where I can start troubleshooting???
> > >
> > > Thanks,
> > >
> > > Souparno Adhikary,
> > > CHPC Lab,
> > > Department of Microbiology,
> > > University of Calcutta.
> > >
> > >
> > > _______________________________________________
> > > discuss mailing list     discuss at mpich.org
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> > >
> >
> >
> > ------------------------------
> >
> > Subject: Digest Footer
> >
> > _______________________________________________
> > discuss mailing list
> > discuss at mpich.org
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > ------------------------------
> >
> > End of discuss Digest, Vol 58, Issue 18
> > ***************************************
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.mpich.org/pipermail/discuss/attachments/20170830/10bbec38/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 2
> Date: Thu, 31 Aug 2017 13:00:34 +0530
> From: Souparno Adhikary <souparnoa91 at gmail.com>
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] Torque MPICH jobs stuck
> Message-ID:
>         <CAL6QJ1BaHHTQdfESDxFEcxNes-ZyjQD48KXe5==FTS_9=
> 4Mw8w at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> We are using mpich2-1.4.1p1. I can give a try with the latest version. My
> job script is as follows:
>
> #!/bin/sh
> #PBS -N asyn
> #PBS -q batch
> #PBS -l nodes=4:ppn=4
> #PBS -l walltime=120:00:00
> #PBS -V
> cd $PBS_O_WORKDIR
> mpirun -np 16 gmx_mpi mdrun -deffnm asyn_10ns
>
>
> Souparno Adhikary,
> CHPC Lab,
> Department of Microbiology,
> University of Calcutta.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.mpich.org/pipermail/discuss/attachments/20170831/e607657d/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 3
> Date: Thu, 31 Aug 2017 11:19:45 -0500
> From: Kenneth Raffenetti <raffenet at mcs.anl.gov>
> To: <discuss at mpich.org>
> Subject: Re: [mpich-discuss] question about -disable-auto-cleanup
> Message-ID: <8a531da2-8de2-70ab-9ff2-bd4e36660154 at mcs.anl.gov>
> Content-Type: text/plain; charset="utf-8"; format=flowed
>
> Hi Zaak,
>
> I'll try my best to explain here. There are a few things to consider.
>
> 1. Hydra: -disable-auto-cleanup means if an MPI process dies, let other
> processes in the job continue running. Since Hydra (mpiexec) is already
> monitoring MPI processes to detect when one dies, there is no impact
> inside Hydra from passing this option.
>
> 2. Application behavior: Since the default error handler in MPI is
> MPI_ERRORS_ARE_FATAL, some applications may rely on that fact and expect
> a running job to be aborted/cleaned up if a process quits. With
> -disable-auto-cleanup this will no longer be the case. An application
> can call MPI_Abort() to force the old behavior, however.
>
> Ken
>
> On 08/30/2017 12:29 PM, Zaak Beekman wrote:
> > OK, since there were no responses here to my previous email, perhaps a
> > better question would be:
> >
> > What is a good resource to learn about the impact of passing
> > `--disable-auto-cleanup` at runtime?
> >
> > Some google searches bring up discussions of what appear to be bugs in
> > the standard and/or implementation, but I'm not sure where to look to
> > find out about even the intended runtime semantics.
> >
> > Any and all help pointing me in the right direction would be much
> > appreciated.
> >
> > Thanks,
> > Zaak
> >
> > On Wed, Aug 30, 2017 at 1:00 PM <discuss-request at mpich.org
> > <mailto:discuss-request at mpich.org>> wrote:
> >
> >     Send discuss mailing list submissions to
> >     discuss at mpich.org <mailto:discuss at mpich.org>
> >
> >     To subscribe or unsubscribe via the World Wide Web, visit
> >     https://lists.mpich.org/mailman/listinfo/discuss
> >     or, via email, send a message with subject or body 'help' to
> >     discuss-request at mpich.org <mailto:discuss-request at mpich.org>
> >
> >     You can reach the person managing the list at
> >     discuss-owner at mpich.org <mailto:discuss-owner at mpich.org>
> >
> >     When replying, please edit your Subject line so it is more specific
> >     than "Re: Contents of discuss digest..."
> >
> >
> >     Today's Topics:
> >
> >         1.  question about -disable-auto-cleanup (Zaak Beekman)
> >         2.  Torque MPICH jobs stuck (Souparno Adhikary)
> >         3. Re:  Torque MPICH jobs stuck (Halim Amer)
> >
> >
> >
>  ----------------------------------------------------------------------
> >
> >     Message: 1
> >     Date: Tue, 29 Aug 2017 21:22:49 +0000
> >     From: Zaak Beekman <zbeekman at gmail.com <mailto:zbeekman at gmail.com>>
> >     To: discuss at mpich.org <mailto:discuss at mpich.org>
> >     Subject: [mpich-discuss] question about -disable-auto-cleanup
> >     Message-ID:
> >
> >     <CAAbnBwZrQ03YmmmayhcHEywh8bEFMZ_AycBydOqZFB023KeJZQ at mail.gmail.com
> >     <mailto:
> CAAbnBwZrQ03YmmmayhcHEywh8bEFMZ_AycBydOqZFB023KeJZQ at mail.gmail.com>>
> >     Content-Type: text/plain; charset="utf-8"
> >
> >     I know that --disable-auto-cleanup is required to enable the
> >     fault-tolerant
> >     MPI features, but are there downsides to passing this? Performance
> >     implications?
> >
> >     I ask, because over at
> >     https://github.com/sourceryinstitute/OpenCoarrays we've
> >     implemented much of the Fortran 2015 failed images feature on top of
> >     MPICH
> >     and other MPI implementations. But to use this,
> >     --disable-auto-cleanup must
> >     be passed to mpiexec. We provide wrapper scripts to try to abstract
> the
> >     back end (GASNet, MPI, OpenSHMEM etc.) in the form of a Fortran
> compiler
> >     wrapper, and an executable launcher. So I'm wondering, since failed
> >     images
> >     are part of the standard (2015) would it be dumb if we always pass
> >     --disable-auto-cleanup to mpiexec and only turn off support when
> >     explicitly
> >     asked for by the user, or is it safer/more performant to default to
> >     requiring the user to pass an additional flag to our wrapper script
> that
> >     results in --disable-auto-cleanup getting passed to mpiexec?
> >
> >     Feedback would be much appreciated. Feel free to post responses at
> >     https://github.com/sourceryinstitute/OpenCoarrays/issues/401 as
> well..
> >
> >     Thanks,
> >     Zaak
> >     -------------- next part --------------
> >     An HTML attachment was scrubbed...
> >     URL:
> >     <
> http://lists.mpich.org/pipermail/discuss/attachments/20170829/52d25b23/attachment-0001.html
> >
> >
> >     ------------------------------
> >
> >     Message: 2
> >     Date: Wed, 30 Aug 2017 13:48:00 +0530
> >     From: Souparno Adhikary <souparnoa91 at gmail.com
> >     <mailto:souparnoa91 at gmail.com>>
> >     To: discuss at mpich.org <mailto:discuss at mpich.org>
> >     Subject: [mpich-discuss] Torque MPICH jobs stuck
> >     Message-ID:
> >
> >     <CAL6QJ1BF8FAYAvLiyqtKGMo+6e_3vdSf95wmH2n2F8efHMyfCw at mail.gmail.com
> >     <mailto:
> CAL6QJ1BF8FAYAvLiyqtKGMo%2B6e_3vdSf95wmH2n2F8efHMyfCw at mail.gmail.com>>
> >     Content-Type: text/plain; charset="utf-8"
> >
> >     I know this is not a proper place to discuss this, but, as the
> >     Torque-mpich
> >     list seems dead, I can't think of any other place to post this.
> >
> >     MPICH2 was installed in the servers. I installed Torque afterwards. I
> >     opened the ports including them in the iptables file.
> >
> >     Torque mpi jobs (even the simple jobs like hostname) remains stuck.
> But,
> >     the jobs are properly distributed in the nodes and pbsnodes -a
> >     showing them
> >     in order.
> >
> >     The sched_log files and server_logs do not yield anything different.
> >     Therefore, it might be a problem with the mpich2.
> >
> >     Can you please suggest me from where I can start troubleshooting???
> >
> >     Thanks,
> >
> >     Souparno Adhikary,
> >     CHPC Lab,
> >     Department of Microbiology,
> >     University of Calcutta.
> >     -------------- next part --------------
> >     An HTML attachment was scrubbed...
> >     URL:
> >     <
> http://lists.mpich.org/pipermail/discuss/attachments/20170830/99b126ee/attachment-0001.html
> >
> >
> >     ------------------------------
> >
> >     Message: 3
> >     Date: Wed, 30 Aug 2017 11:00:51 -0500
> >     From: Halim Amer <aamer at anl.gov <mailto:aamer at anl.gov>>
> >     To: <discuss at mpich.org <mailto:discuss at mpich.org>>
> >     Subject: Re: [mpich-discuss] Torque MPICH jobs stuck
> >     Message-ID: <3a2d0cc3-51a5-c646-4afc-40ece230bb04 at anl.gov
> >     <mailto:3a2d0cc3-51a5-c646-4afc-40ece230bb04 at anl.gov>>
> >     Content-Type: text/plain; charset="utf-8"; format=flowed
> >
> >     Which MPICH version are you using? Have you tried the latest 3.2
> >     version?
> >
> >     If it still fails, can you attach your simple Torque job script here?
> >
> >     Halim
> >     www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~aamer>
> >
> >     On 8/30/17 3:18 AM, Souparno Adhikary wrote:
> >      > I know this is not a proper place to discuss this, but, as the
> >      > Torque-mpich list seems dead, I can't think of any other place to
> >     post this.
> >      >
> >      > MPICH2 was installed in the servers. I installed Torque
> afterwards. I
> >      > opened the ports including them in the iptables file.
> >      >
> >      > Torque mpi jobs (even the simple jobs like hostname) remains
> >     stuck. But,
> >      > the jobs are properly distributed in the nodes and pbsnodes -a
> >     showing
> >      > them in order.
> >      >
> >      > The sched_log files and server_logs do not yield anything
> different.
> >      > Therefore, it might be a problem with the mpich2.
> >      >
> >      > Can you please suggest me from where I can start
> troubleshooting???
> >      >
> >      > Thanks,
> >      >
> >      > Souparno Adhikary,
> >      > CHPC Lab,
> >      > Department of Microbiology,
> >      > University of Calcutta.
> >      >
> >      >
> >      > _______________________________________________
> >      > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> >      > To manage subscription options or unsubscribe:
> >      > https://lists.mpich.org/mailman/listinfo/discuss
> >      >
> >
> >
> >     ------------------------------
> >
> >     Subject: Digest Footer
> >
> >     _______________________________________________
> >     discuss mailing list
> >     discuss at mpich.org <mailto:discuss at mpich.org>
> >     https://lists.mpich.org/mailman/listinfo/discuss
> >
> >     ------------------------------
> >
> >     End of discuss Digest, Vol 58, Issue 18
> >     ***************************************
> >
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> discuss mailing list
> discuss at mpich.org
> https://lists.mpich.org/mailman/listinfo/discuss
>
> ------------------------------
>
> End of discuss Digest, Vol 58, Issue 19
> ***************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170831/fef8812a/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list