[mpich-discuss] Unexpected "Bad termination"

Luiz Carlos da Costa Junior lccostajr at gmail.com
Mon Feb 1 02:07:20 CST 2016


Pavan, I will try to attach a debugger to my application as you suggested.
Thanks. Regards, Luiz

On 29 January 2016 at 16:23, Balaji, Pavan <balaji at anl.gov> wrote:

> Luiz,
>
> The exit code (11) that was displayed indicates segmentation fault, though
> a human-readable string saying so would have been better.
>
> Can you try attaching a debugger?  If you have ddd installed, you can do
> this:
>
> % mpiexec -np <num_procs> ddd ./your_application
> <your_application_parameters>
>
> This will fire up a bunch of ddd windows, one for each process.
>
>   -- Pavan
>
> > On Jan 29, 2016, at 12:00 PM, Luiz Carlos da Costa Junior <
> lccostajr at gmail.com> wrote:
> >
> > Pavan, I am not suspicious of MPICH itself, it is very stable and we
> have been using it for almost a decade.
> > I was thinking of some kind of bad interaction with our SO, like Daniel
> said.
> >
> > My intention here is only to try to get some help from the experience
> you have.
> > As I commented, it isn't easy to reproduce the problem and when it
> happens there are no clues about it.
> >
> > I forgot to mention that the problem happens at the very beginning of
> execution of my application (I am not sure since there is no output from it
> on screen, not even the "logo" with program version).
> > In case of a segmentation fault as Halim said, am I supposed to get any
> output related to it? If yes, is there something I can do to activate or
> redirect it?
> >
> > Thanks again for your help.
> >
> > Best regards,
> > Luiz
> >
> > On 29 January 2016 at 15:26, Daniel Ibanez <dan.a.ibanez at gmail.com>
> wrote:
> > You're right, Pavan, I got mixed up.
> > I think it is the "assert (!closed) failed" that confuses me,
> > the other messages are clear enough.
> >
> > On Fri, Jan 29, 2016 at 12:09 PM, Balaji, Pavan <balaji at anl.gov> wrote:
> >
> > Every time the application (not MPICH) has an error, the same error
> message is shown.
> >
> > I'm going to go and delete that error message now since users keep
> thinking this is a problem in MPICH, while MPICH is only trying to be
> helpful and telling them that "this is not MPICH's fault.  your application
> is broken."  Maybe I should just say that in the error message.
> >
> >   -- Pavan
> >
> > > On Jan 29, 2016, at 10:49 AM, Daniel Ibanez <dan.a.ibanez at gmail.com>
> wrote:
> > >
> > > Hello all,
> > >
> > > Sorry to intrude on someone else's question,
> > > but we also see this error message on our Linux cluster,
> > > with code that we're very confident is free of MPI bugs
> > > (has run on most of Mira, etc...).
> > >
> > > The MPICH build being used was compiled on our
> > > workstations, and runs well on the workstations
> > > (up to the 16 ranks their CPUs handle).
> > > However, using that MPICH install on the cluster
> > > causes this for small (2-4) rank operations.
> > >
> > > I'm saying this to suggest that it has to do with interactions
> > > of MPICH and the architecture/OS, maybe also with
> > > MPICH build configuration.
> > >
> > > I'll try to come up with a reproducing test program today.
> > >
> > > On Fri, Jan 29, 2016 at 11:42 AM, Halim Amer <aamer at anl.gov> wrote:
> > > Luiz,
> > >
> > > You are experiencing a segmentation fault.
> > > We don't have enough information to pinpoint the source of the
> problem, however. We usually require a small piece of code that reproduces
> the bug to help debugging.
> > >
> > > --Halim
> > >
> > > www.mcs.anl.gov/~aamer
> > >
> > >
> > > On 1/29/16 10:27 AM, Luiz Carlos da Costa Junior wrote:
> > > Dear all,
> > >
> > > We have been using MPICH with our software and performing execution in
> > > Amazon AWS Linux servers for a long time.
> > > We use to have in production environment MPICH version 1.4.1p1 (which -
> > > I know - is very old), but it has been very very stable in the latest
> years.
> > > However, recently we have been facing a "Bad termination" problem once
> > > in a while, so we decided to investigate this issue.
> > > In principle, we don't have a apparent reason to believe that the
> > > problem lies on our code, since there was no changes that explain this
> > > behavior.
> > > The other point is that it occurs in a intermittent fashion, if we run
> > > the program again it doesn't happen, so it has been difficult to
> > > debug/trace the source of the problem.
> > >
> > > Our first step, then, was to update the MPI version to the latest
> > > version 3.2.
> > > However, we faced the same problem (output below):
> > >
> > >
>  =====================================================================================
> > >     =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> > >     =   EXIT CODE: 11
> > >     =   CLEANING UP REMAINING PROCESSES
> > >     =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> > >
>  =====================================================================================
> > >     [proxy:0:0 at ip-10-137-129-86] HYD_pmcd_pmip_control_cmd_cb
> > >     (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
> > >     [proxy:0:0 at ip-10-137-129-86] HYDT_dmxu_poll_wait_for_event
> > >     (./tools/demux/demux_poll.c:77): callback returned error status
> > >     [mpiexec at ip-10-137-129-86] HYDT_bscu_wait_for_completion
> > >     (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes
> > >     terminated badly; aborting
> > >     [mpiexec at ip-10-137-129-86] HYDT_bsci_wait_for_completion
> > >     (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
> > >     waiting for completion
> > >     [mpiexec at ip-10-137-129-86] HYD_pmci_wait_for_completion
> > >     (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting
> > >     for completion
> > >     [mpiexec at ip-10-137-129-86] main (./ui/mpich/mpiexec.c:405):
> process
> > >     manager error waiting for completion
> > >
> > >
> > > Do you have any clue about what might have been causing this problem?
> > > Any suggestion at this point would be highly appreciated.
> > >
> > > Best regards,
> > > Luiz
> > >
> > >
> > > _______________________________________________
> > > discuss mailing list     discuss at mpich.org
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> > >
> > > _______________________________________________
> > > discuss mailing list     discuss at mpich.org
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> > >
> > > _______________________________________________
> > > discuss mailing list     discuss at mpich.org
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160201/a477945b/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list