[mpich-discuss] question about -disable-auto-cleanup
Kenneth Raffenetti
raffenet at mcs.anl.gov
Thu Aug 31 12:40:48 CDT 2017
Correct, there is no MPI-level query mechanism without the ULFM extensions.
Ken
On 08/31/2017 11:49 AM, Zaak Beekman wrote:
> Ken,
>
> Thanks so much for your concise response!
>
> Is my understanding correct that: If one MPI rank, or all the ranks on a
> given node are to fail, from, say, a hardware issue, there is no means
> of determining that this has happened unless the MPI library supports
> the user level failure mitigation (ULFM) features? So if one were to
> pass --disable-auto-cleanup when the runtime didn't have support for
> ULFM, and processes on remote MPI ranks died, there'd be no way to query
> them and call MPI_abort from a different, still running rank?
>
> Thanks again,
> Zaak
>
> On Thu, Aug 31, 2017 at 12:19 PM <discuss-request at mpich.org
> <mailto:discuss-request at mpich.org>> wrote:
>
> Send discuss mailing list submissions to
> discuss at mpich.org <mailto:discuss at mpich.org>
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.mpich.org/mailman/listinfo/discuss
> or, via email, send a message with subject or body 'help' to
> discuss-request at mpich.org <mailto:discuss-request at mpich.org>
>
> You can reach the person managing the list at
> discuss-owner at mpich.org <mailto:discuss-owner at mpich.org>
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of discuss digest..."
>
>
> Today's Topics:
>
> 1. question about -disable-auto-cleanup (Zaak Beekman)
> 2. Re: Torque MPICH jobs stuck (Souparno Adhikary)
> 3. Re: question about -disable-auto-cleanup (Kenneth Raffenetti)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 30 Aug 2017 17:29:11 +0000
> From: Zaak Beekman <zbeekman at gmail.com <mailto:zbeekman at gmail.com>>
> To: discuss at mpich.org <mailto:discuss at mpich.org>
> Subject: [mpich-discuss] question about -disable-auto-cleanup
> Message-ID:
>
> <CAAbnBwb8L2VtXqsbbGjZvSOHpTkd7mK6SeQuhHKTEG0brOPSBQ at mail.gmail.com
> <mailto:CAAbnBwb8L2VtXqsbbGjZvSOHpTkd7mK6SeQuhHKTEG0brOPSBQ at mail.gmail.com>>
> Content-Type: text/plain; charset="utf-8"
>
> OK, since there were no responses here to my previous email, perhaps a
> better question would be:
>
> What is a good resource to learn about the impact of passing
> `--disable-auto-cleanup` at runtime?
>
> Some google searches bring up discussions of what appear to be bugs
> in the
> standard and/or implementation, but I'm not sure where to look to
> find out
> about even the intended runtime semantics.
>
> Any and all help pointing me in the right direction would be much
> appreciated.
>
> Thanks,
> Zaak
>
> On Wed, Aug 30, 2017 at 1:00 PM <discuss-request at mpich.org
> <mailto:discuss-request at mpich.org>> wrote:
>
> > Send discuss mailing list submissions to
> > discuss at mpich.org <mailto:discuss at mpich.org>
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> > https://lists.mpich.org/mailman/listinfo/discuss
> > or, via email, send a message with subject or body 'help' to
> > discuss-request at mpich.org <mailto:discuss-request at mpich.org>
> >
> > You can reach the person managing the list at
> > discuss-owner at mpich.org <mailto:discuss-owner at mpich.org>
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of discuss digest..."
> >
> >
> > Today's Topics:
> >
> > 1. question about -disable-auto-cleanup (Zaak Beekman)
> > 2. Torque MPICH jobs stuck (Souparno Adhikary)
> > 3. Re: Torque MPICH jobs stuck (Halim Amer)
> >
> >
> >
> ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Tue, 29 Aug 2017 21:22:49 +0000
> > From: Zaak Beekman <zbeekman at gmail.com <mailto:zbeekman at gmail.com>>
> > To: discuss at mpich.org <mailto:discuss at mpich.org>
> > Subject: [mpich-discuss] question about -disable-auto-cleanup
> > Message-ID:
> > <
> >
> CAAbnBwZrQ03YmmmayhcHEywh8bEFMZ_AycBydOqZFB023KeJZQ at mail.gmail.com
> <mailto:CAAbnBwZrQ03YmmmayhcHEywh8bEFMZ_AycBydOqZFB023KeJZQ at mail.gmail.com>>
> > Content-Type: text/plain; charset="utf-8"
> >
> > I know that --disable-auto-cleanup is required to enable the
> fault-tolerant
> > MPI features, but are there downsides to passing this? Performance
> > implications?
> >
> > I ask, because over at
> https://github.com/sourceryinstitute/OpenCoarrays
> > we've
> > implemented much of the Fortran 2015 failed images feature on top
> of MPICH
> > and other MPI implementations. But to use this,
> --disable-auto-cleanup must
> > be passed to mpiexec. We provide wrapper scripts to try to
> abstract the
> > back end (GASNet, MPI, OpenSHMEM etc.) in the form of a Fortran
> compiler
> > wrapper, and an executable launcher. So I'm wondering, since
> failed images
> > are part of the standard (2015) would it be dumb if we always pass
> > --disable-auto-cleanup to mpiexec and only turn off support when
> explicitly
> > asked for by the user, or is it safer/more performant to default to
> > requiring the user to pass an additional flag to our wrapper
> script that
> > results in --disable-auto-cleanup getting passed to mpiexec?
> >
> > Feedback would be much appreciated. Feel free to post responses at
> > https://github.com/sourceryinstitute/OpenCoarrays/issues/401 as
> well..
> >
> > Thanks,
> > Zaak
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL: <
> >
> http://lists.mpich.org/pipermail/discuss/attachments/20170829/52d25b23/attachment-0001.html
> > >
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Wed, 30 Aug 2017 13:48:00 +0530
> > From: Souparno Adhikary <souparnoa91 at gmail.com
> <mailto:souparnoa91 at gmail.com>>
> > To: discuss at mpich.org <mailto:discuss at mpich.org>
> > Subject: [mpich-discuss] Torque MPICH jobs stuck
> > Message-ID:
> > <
> >
> CAL6QJ1BF8FAYAvLiyqtKGMo+6e_3vdSf95wmH2n2F8efHMyfCw at mail.gmail.com
> <mailto:CAL6QJ1BF8FAYAvLiyqtKGMo%2B6e_3vdSf95wmH2n2F8efHMyfCw at mail.gmail.com>>
> > Content-Type: text/plain; charset="utf-8"
> >
> > I know this is not a proper place to discuss this, but, as the
> Torque-mpich
> > list seems dead, I can't think of any other place to post this.
> >
> > MPICH2 was installed in the servers. I installed Torque afterwards. I
> > opened the ports including them in the iptables file.
> >
> > Torque mpi jobs (even the simple jobs like hostname) remains
> stuck. But,
> > the jobs are properly distributed in the nodes and pbsnodes -a
> showing them
> > in order.
> >
> > The sched_log files and server_logs do not yield anything different.
> > Therefore, it might be a problem with the mpich2.
> >
> > Can you please suggest me from where I can start troubleshooting???
> >
> > Thanks,
> >
> > Souparno Adhikary,
> > CHPC Lab,
> > Department of Microbiology,
> > University of Calcutta.
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL: <
> >
> http://lists.mpich.org/pipermail/discuss/attachments/20170830/99b126ee/attachment-0001.html
> > >
> >
> > ------------------------------
> >
> > Message: 3
> > Date: Wed, 30 Aug 2017 11:00:51 -0500
> > From: Halim Amer <aamer at anl.gov <mailto:aamer at anl.gov>>
> > To: <discuss at mpich.org <mailto:discuss at mpich.org>>
> > Subject: Re: [mpich-discuss] Torque MPICH jobs stuck
> > Message-ID: <3a2d0cc3-51a5-c646-4afc-40ece230bb04 at anl.gov
> <mailto:3a2d0cc3-51a5-c646-4afc-40ece230bb04 at anl.gov>>
> > Content-Type: text/plain; charset="utf-8"; format=flowed
> >
> > Which MPICH version are you using? Have you tried the latest 3.2
> version?
> >
> > If it still fails, can you attach your simple Torque job script here?
> >
> > Halim
> > www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~aamer>
> >
> > On 8/30/17 3:18 AM, Souparno Adhikary wrote:
> > > I know this is not a proper place to discuss this, but, as the
> > > Torque-mpich list seems dead, I can't think of any other place
> to post
> > this.
> > >
> > > MPICH2 was installed in the servers. I installed Torque
> afterwards. I
> > > opened the ports including them in the iptables file.
> > >
> > > Torque mpi jobs (even the simple jobs like hostname) remains
> stuck. But,
> > > the jobs are properly distributed in the nodes and pbsnodes -a
> showing
> > > them in order.
> > >
> > > The sched_log files and server_logs do not yield anything
> different.
> > > Therefore, it might be a problem with the mpich2.
> > >
> > > Can you please suggest me from where I can start troubleshooting???
> > >
> > > Thanks,
> > >
> > > Souparno Adhikary,
> > > CHPC Lab,
> > > Department of Microbiology,
> > > University of Calcutta.
> > >
> > >
> > > _______________________________________________
> > > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> > >
> >
> >
> > ------------------------------
> >
> > Subject: Digest Footer
> >
> > _______________________________________________
> > discuss mailing list
> > discuss at mpich.org <mailto:discuss at mpich.org>
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > ------------------------------
> >
> > End of discuss Digest, Vol 58, Issue 18
> > ***************************************
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://lists.mpich.org/pipermail/discuss/attachments/20170830/10bbec38/attachment-0001.html>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 31 Aug 2017 13:00:34 +0530
> From: Souparno Adhikary <souparnoa91 at gmail.com
> <mailto:souparnoa91 at gmail.com>>
> To: discuss at mpich.org <mailto:discuss at mpich.org>
> Subject: Re: [mpich-discuss] Torque MPICH jobs stuck
> Message-ID:
>
> <CAL6QJ1BaHHTQdfESDxFEcxNes-ZyjQD48KXe5==FTS_9=4Mw8w at mail.gmail.com
> <mailto:4Mw8w at mail.gmail.com>>
> Content-Type: text/plain; charset="utf-8"
>
> We are using mpich2-1.4.1p1. I can give a try with the latest
> version. My
> job script is as follows:
>
> #!/bin/sh
> #PBS -N asyn
> #PBS -q batch
> #PBS -l nodes=4:ppn=4
> #PBS -l walltime=120:00:00
> #PBS -V
> cd $PBS_O_WORKDIR
> mpirun -np 16 gmx_mpi mdrun -deffnm asyn_10ns
>
>
> Souparno Adhikary,
> CHPC Lab,
> Department of Microbiology,
> University of Calcutta.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://lists.mpich.org/pipermail/discuss/attachments/20170831/e607657d/attachment-0001.html>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 31 Aug 2017 11:19:45 -0500
> From: Kenneth Raffenetti <raffenet at mcs.anl.gov
> <mailto:raffenet at mcs.anl.gov>>
> To: <discuss at mpich.org <mailto:discuss at mpich.org>>
> Subject: Re: [mpich-discuss] question about -disable-auto-cleanup
> Message-ID: <8a531da2-8de2-70ab-9ff2-bd4e36660154 at mcs.anl.gov
> <mailto:8a531da2-8de2-70ab-9ff2-bd4e36660154 at mcs.anl.gov>>
> Content-Type: text/plain; charset="utf-8"; format=flowed
>
> Hi Zaak,
>
> I'll try my best to explain here. There are a few things to consider.
>
> 1. Hydra: -disable-auto-cleanup means if an MPI process dies, let other
> processes in the job continue running. Since Hydra (mpiexec) is already
> monitoring MPI processes to detect when one dies, there is no impact
> inside Hydra from passing this option.
>
> 2. Application behavior: Since the default error handler in MPI is
> MPI_ERRORS_ARE_FATAL, some applications may rely on that fact and expect
> a running job to be aborted/cleaned up if a process quits. With
> -disable-auto-cleanup this will no longer be the case. An application
> can call MPI_Abort() to force the old behavior, however.
>
> Ken
>
> On 08/30/2017 12:29 PM, Zaak Beekman wrote:
> > OK, since there were no responses here to my previous email,
> perhaps a
> > better question would be:
> >
> > What is a good resource to learn about the impact of passing
> > `--disable-auto-cleanup` at runtime?
> >
> > Some google searches bring up discussions of what appear to be
> bugs in
> > the standard and/or implementation, but I'm not sure where to look to
> > find out about even the intended runtime semantics.
> >
> > Any and all help pointing me in the right direction would be much
> > appreciated.
> >
> > Thanks,
> > Zaak
> >
> > On Wed, Aug 30, 2017 at 1:00 PM <discuss-request at mpich.org
> <mailto:discuss-request at mpich.org>
> > <mailto:discuss-request at mpich.org
> <mailto:discuss-request at mpich.org>>> wrote:
> >
> > Send discuss mailing list submissions to
> > discuss at mpich.org <mailto:discuss at mpich.org>
> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> > https://lists.mpich.org/mailman/listinfo/discuss
> > or, via email, send a message with subject or body 'help' to
> > discuss-request at mpich.org <mailto:discuss-request at mpich.org>
> <mailto:discuss-request at mpich.org <mailto:discuss-request at mpich.org>>
> >
> > You can reach the person managing the list at
> > discuss-owner at mpich.org <mailto:discuss-owner at mpich.org>
> <mailto:discuss-owner at mpich.org <mailto:discuss-owner at mpich.org>>
> >
> > When replying, please edit your Subject line so it is more
> specific
> > than "Re: Contents of discuss digest..."
> >
> >
> > Today's Topics:
> >
> > 1. question about -disable-auto-cleanup (Zaak Beekman)
> > 2. Torque MPICH jobs stuck (Souparno Adhikary)
> > 3. Re: Torque MPICH jobs stuck (Halim Amer)
> >
> >
> >
> ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Tue, 29 Aug 2017 21:22:49 +0000
> > From: Zaak Beekman <zbeekman at gmail.com
> <mailto:zbeekman at gmail.com> <mailto:zbeekman at gmail.com
> <mailto:zbeekman at gmail.com>>>
> > To: discuss at mpich.org <mailto:discuss at mpich.org>
> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>
> > Subject: [mpich-discuss] question about -disable-auto-cleanup
> > Message-ID:
> >
> >
> <CAAbnBwZrQ03YmmmayhcHEywh8bEFMZ_AycBydOqZFB023KeJZQ at mail.gmail.com <mailto:CAAbnBwZrQ03YmmmayhcHEywh8bEFMZ_AycBydOqZFB023KeJZQ at mail.gmail.com>
> >
> <mailto:CAAbnBwZrQ03YmmmayhcHEywh8bEFMZ_AycBydOqZFB023KeJZQ at mail.gmail.com <mailto:CAAbnBwZrQ03YmmmayhcHEywh8bEFMZ_AycBydOqZFB023KeJZQ at mail.gmail.com>>>
> > Content-Type: text/plain; charset="utf-8"
> >
> > I know that --disable-auto-cleanup is required to enable the
> > fault-tolerant
> > MPI features, but are there downsides to passing this?
> Performance
> > implications?
> >
> > I ask, because over at
> > https://github.com/sourceryinstitute/OpenCoarrays we've
> > implemented much of the Fortran 2015 failed images feature on
> top of
> > MPICH
> > and other MPI implementations. But to use this,
> > --disable-auto-cleanup must
> > be passed to mpiexec. We provide wrapper scripts to try to
> abstract the
> > back end (GASNet, MPI, OpenSHMEM etc.) in the form of a
> Fortran compiler
> > wrapper, and an executable launcher. So I'm wondering, since
> failed
> > images
> > are part of the standard (2015) would it be dumb if we always
> pass
> > --disable-auto-cleanup to mpiexec and only turn off support when
> > explicitly
> > asked for by the user, or is it safer/more performant to
> default to
> > requiring the user to pass an additional flag to our wrapper
> script that
> > results in --disable-auto-cleanup getting passed to mpiexec?
> >
> > Feedback would be much appreciated. Feel free to post
> responses at
> > https://github.com/sourceryinstitute/OpenCoarrays/issues/401 as
> well..
> >
> > Thanks,
> > Zaak
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL:
> >
> <http://lists.mpich.org/pipermail/discuss/attachments/20170829/52d25b23/attachment-0001.html>
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Wed, 30 Aug 2017 13:48:00 +0530
> > From: Souparno Adhikary <souparnoa91 at gmail.com
> <mailto:souparnoa91 at gmail.com>
> > <mailto:souparnoa91 at gmail.com <mailto:souparnoa91 at gmail.com>>>
> > To: discuss at mpich.org <mailto:discuss at mpich.org>
> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>
> > Subject: [mpich-discuss] Torque MPICH jobs stuck
> > Message-ID:
> >
> >
> <CAL6QJ1BF8FAYAvLiyqtKGMo+6e_3vdSf95wmH2n2F8efHMyfCw at mail.gmail.com <mailto:CAL6QJ1BF8FAYAvLiyqtKGMo%2B6e_3vdSf95wmH2n2F8efHMyfCw at mail.gmail.com>
> >
> <mailto:CAL6QJ1BF8FAYAvLiyqtKGMo%2B6e_3vdSf95wmH2n2F8efHMyfCw at mail.gmail.com <mailto:CAL6QJ1BF8FAYAvLiyqtKGMo%252B6e_3vdSf95wmH2n2F8efHMyfCw at mail.gmail.com>>>
> > Content-Type: text/plain; charset="utf-8"
> >
> > I know this is not a proper place to discuss this, but, as the
> > Torque-mpich
> > list seems dead, I can't think of any other place to post this.
> >
> > MPICH2 was installed in the servers. I installed Torque
> afterwards. I
> > opened the ports including them in the iptables file.
> >
> > Torque mpi jobs (even the simple jobs like hostname) remains
> stuck. But,
> > the jobs are properly distributed in the nodes and pbsnodes -a
> > showing them
> > in order.
> >
> > The sched_log files and server_logs do not yield anything
> different.
> > Therefore, it might be a problem with the mpich2.
> >
> > Can you please suggest me from where I can start
> troubleshooting???
> >
> > Thanks,
> >
> > Souparno Adhikary,
> > CHPC Lab,
> > Department of Microbiology,
> > University of Calcutta.
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL:
> >
> <http://lists.mpich.org/pipermail/discuss/attachments/20170830/99b126ee/attachment-0001.html>
> >
> > ------------------------------
> >
> > Message: 3
> > Date: Wed, 30 Aug 2017 11:00:51 -0500
> > From: Halim Amer <aamer at anl.gov <mailto:aamer at anl.gov>
> <mailto:aamer at anl.gov <mailto:aamer at anl.gov>>>
> > To: <discuss at mpich.org <mailto:discuss at mpich.org>
> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>>
> > Subject: Re: [mpich-discuss] Torque MPICH jobs stuck
> > Message-ID: <3a2d0cc3-51a5-c646-4afc-40ece230bb04 at anl.gov
> <mailto:3a2d0cc3-51a5-c646-4afc-40ece230bb04 at anl.gov>
> > <mailto:3a2d0cc3-51a5-c646-4afc-40ece230bb04 at anl.gov
> <mailto:3a2d0cc3-51a5-c646-4afc-40ece230bb04 at anl.gov>>>
> > Content-Type: text/plain; charset="utf-8"; format=flowed
> >
> > Which MPICH version are you using? Have you tried the latest 3.2
> > version?
> >
> > If it still fails, can you attach your simple Torque job
> script here?
> >
> > Halim
> > www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~aamer>
> <http://www.mcs.anl.gov/~aamer>
> >
> > On 8/30/17 3:18 AM, Souparno Adhikary wrote:
> > > I know this is not a proper place to discuss this, but, as the
> > > Torque-mpich list seems dead, I can't think of any other
> place to
> > post this.
> > >
> > > MPICH2 was installed in the servers. I installed Torque
> afterwards. I
> > > opened the ports including them in the iptables file.
> > >
> > > Torque mpi jobs (even the simple jobs like hostname) remains
> > stuck. But,
> > > the jobs are properly distributed in the nodes and pbsnodes -a
> > showing
> > > them in order.
> > >
> > > The sched_log files and server_logs do not yield anything
> different.
> > > Therefore, it might be a problem with the mpich2.
> > >
> > > Can you please suggest me from where I can start
> troubleshooting???
> > >
> > > Thanks,
> > >
> > > Souparno Adhikary,
> > > CHPC Lab,
> > > Department of Microbiology,
> > > University of Calcutta.
> > >
> > >
> > > _______________________________________________
> > > discuss mailing list discuss at mpich.org
> <mailto:discuss at mpich.org> <mailto:discuss at mpich.org
> <mailto:discuss at mpich.org>>
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> > >
> >
> >
> > ------------------------------
> >
> > Subject: Digest Footer
> >
> > _______________________________________________
> > discuss mailing list
> > discuss at mpich.org <mailto:discuss at mpich.org>
> <mailto:discuss at mpich.org <mailto:discuss at mpich.org>>
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > ------------------------------
> >
> > End of discuss Digest, Vol 58, Issue 18
> > ***************************************
> >
> >
> >
> > _______________________________________________
> > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> discuss mailing list
> discuss at mpich.org <mailto:discuss at mpich.org>
> https://lists.mpich.org/mailman/listinfo/discuss
>
> ------------------------------
>
> End of discuss Digest, Vol 58, Issue 19
> ***************************************
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list