[mpich-discuss] Unexpected "Bad termination"

Balaji, Pavan balaji at anl.gov
Fri Jan 29 12:23:58 CST 2016


Luiz,

The exit code (11) that was displayed indicates segmentation fault, though a human-readable string saying so would have been better.

Can you try attaching a debugger?  If you have ddd installed, you can do this:

% mpiexec -np <num_procs> ddd ./your_application <your_application_parameters>

This will fire up a bunch of ddd windows, one for each process.

  -- Pavan

> On Jan 29, 2016, at 12:00 PM, Luiz Carlos da Costa Junior <lccostajr at gmail.com> wrote:
> 
> Pavan, I am not suspicious of MPICH itself, it is very stable and we have been using it for almost a decade.
> I was thinking of some kind of bad interaction with our SO, like Daniel said.
> 
> My intention here is only to try to get some help from the experience you have.
> As I commented, it isn't easy to reproduce the problem and when it happens there are no clues about it.
> 
> I forgot to mention that the problem happens at the very beginning of execution of my application (I am not sure since there is no output from it on screen, not even the "logo" with program version).
> In case of a segmentation fault as Halim said, am I supposed to get any output related to it? If yes, is there something I can do to activate or redirect it?
> 
> Thanks again for your help.
> 
> Best regards,
> Luiz
> 
> On 29 January 2016 at 15:26, Daniel Ibanez <dan.a.ibanez at gmail.com> wrote:
> You're right, Pavan, I got mixed up.
> I think it is the "assert (!closed) failed" that confuses me,
> the other messages are clear enough.
> 
> On Fri, Jan 29, 2016 at 12:09 PM, Balaji, Pavan <balaji at anl.gov> wrote:
> 
> Every time the application (not MPICH) has an error, the same error message is shown.
> 
> I'm going to go and delete that error message now since users keep thinking this is a problem in MPICH, while MPICH is only trying to be helpful and telling them that "this is not MPICH's fault.  your application is broken."  Maybe I should just say that in the error message.
> 
>   -- Pavan
> 
> > On Jan 29, 2016, at 10:49 AM, Daniel Ibanez <dan.a.ibanez at gmail.com> wrote:
> >
> > Hello all,
> >
> > Sorry to intrude on someone else's question,
> > but we also see this error message on our Linux cluster,
> > with code that we're very confident is free of MPI bugs
> > (has run on most of Mira, etc...).
> >
> > The MPICH build being used was compiled on our
> > workstations, and runs well on the workstations
> > (up to the 16 ranks their CPUs handle).
> > However, using that MPICH install on the cluster
> > causes this for small (2-4) rank operations.
> >
> > I'm saying this to suggest that it has to do with interactions
> > of MPICH and the architecture/OS, maybe also with
> > MPICH build configuration.
> >
> > I'll try to come up with a reproducing test program today.
> >
> > On Fri, Jan 29, 2016 at 11:42 AM, Halim Amer <aamer at anl.gov> wrote:
> > Luiz,
> >
> > You are experiencing a segmentation fault.
> > We don't have enough information to pinpoint the source of the problem, however. We usually require a small piece of code that reproduces the bug to help debugging.
> >
> > --Halim
> >
> > www.mcs.anl.gov/~aamer
> >
> >
> > On 1/29/16 10:27 AM, Luiz Carlos da Costa Junior wrote:
> > Dear all,
> >
> > We have been using MPICH with our software and performing execution in
> > Amazon AWS Linux servers for a long time.
> > We use to have in production environment MPICH version 1.4.1p1 (which -
> > I know - is very old), but it has been very very stable in the latest years.
> > However, recently we have been facing a "Bad termination" problem once
> > in a while, so we decided to investigate this issue.
> > In principle, we don't have a apparent reason to believe that the
> > problem lies on our code, since there was no changes that explain this
> > behavior.
> > The other point is that it occurs in a intermittent fashion, if we run
> > the program again it doesn't happen, so it has been difficult to
> > debug/trace the source of the problem.
> >
> > Our first step, then, was to update the MPI version to the latest
> > version 3.2.
> > However, we faced the same problem (output below):
> >
> >     =====================================================================================
> >     =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> >     =   EXIT CODE: 11
> >     =   CLEANING UP REMAINING PROCESSES
> >     =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> >     =====================================================================================
> >     [proxy:0:0 at ip-10-137-129-86] HYD_pmcd_pmip_control_cmd_cb
> >     (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
> >     [proxy:0:0 at ip-10-137-129-86] HYDT_dmxu_poll_wait_for_event
> >     (./tools/demux/demux_poll.c:77): callback returned error status
> >     [mpiexec at ip-10-137-129-86] HYDT_bscu_wait_for_completion
> >     (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes
> >     terminated badly; aborting
> >     [mpiexec at ip-10-137-129-86] HYDT_bsci_wait_for_completion
> >     (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
> >     waiting for completion
> >     [mpiexec at ip-10-137-129-86] HYD_pmci_wait_for_completion
> >     (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting
> >     for completion
> >     [mpiexec at ip-10-137-129-86] main (./ui/mpich/mpiexec.c:405): process
> >     manager error waiting for completion
> >
> >
> > Do you have any clue about what might have been causing this problem?
> > Any suggestion at this point would be highly appreciated.
> >
> > Best regards,
> > Luiz
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list