[mpich-discuss] Unexpected "Bad termination"

Daniel Ibanez dan.a.ibanez at gmail.com
Fri Jan 29 11:26:34 CST 2016


You're right, Pavan, I got mixed up.
I think it is the "assert (!closed) failed" that confuses me,
the other messages are clear enough.

On Fri, Jan 29, 2016 at 12:09 PM, Balaji, Pavan <balaji at anl.gov> wrote:

>
> Every time the application (not MPICH) has an error, the same error
> message is shown.
>
> I'm going to go and delete that error message now since users keep
> thinking this is a problem in MPICH, while MPICH is only trying to be
> helpful and telling them that "this is not MPICH's fault.  your application
> is broken."  Maybe I should just say that in the error message.
>
>   -- Pavan
>
> > On Jan 29, 2016, at 10:49 AM, Daniel Ibanez <dan.a.ibanez at gmail.com>
> wrote:
> >
> > Hello all,
> >
> > Sorry to intrude on someone else's question,
> > but we also see this error message on our Linux cluster,
> > with code that we're very confident is free of MPI bugs
> > (has run on most of Mira, etc...).
> >
> > The MPICH build being used was compiled on our
> > workstations, and runs well on the workstations
> > (up to the 16 ranks their CPUs handle).
> > However, using that MPICH install on the cluster
> > causes this for small (2-4) rank operations.
> >
> > I'm saying this to suggest that it has to do with interactions
> > of MPICH and the architecture/OS, maybe also with
> > MPICH build configuration.
> >
> > I'll try to come up with a reproducing test program today.
> >
> > On Fri, Jan 29, 2016 at 11:42 AM, Halim Amer <aamer at anl.gov> wrote:
> > Luiz,
> >
> > You are experiencing a segmentation fault.
> > We don't have enough information to pinpoint the source of the problem,
> however. We usually require a small piece of code that reproduces the bug
> to help debugging.
> >
> > --Halim
> >
> > www.mcs.anl.gov/~aamer
> >
> >
> > On 1/29/16 10:27 AM, Luiz Carlos da Costa Junior wrote:
> > Dear all,
> >
> > We have been using MPICH with our software and performing execution in
> > Amazon AWS Linux servers for a long time.
> > We use to have in production environment MPICH version 1.4.1p1 (which -
> > I know - is very old), but it has been very very stable in the latest
> years.
> > However, recently we have been facing a "Bad termination" problem once
> > in a while, so we decided to investigate this issue.
> > In principle, we don't have a apparent reason to believe that the
> > problem lies on our code, since there was no changes that explain this
> > behavior.
> > The other point is that it occurs in a intermittent fashion, if we run
> > the program again it doesn't happen, so it has been difficult to
> > debug/trace the source of the problem.
> >
> > Our first step, then, was to update the MPI version to the latest
> > version 3.2.
> > However, we faced the same problem (output below):
> >
> >
>  =====================================================================================
> >     =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> >     =   EXIT CODE: 11
> >     =   CLEANING UP REMAINING PROCESSES
> >     =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> >
>  =====================================================================================
> >     [proxy:0:0 at ip-10-137-129-86] HYD_pmcd_pmip_control_cmd_cb
> >     (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
> >     [proxy:0:0 at ip-10-137-129-86] HYDT_dmxu_poll_wait_for_event
> >     (./tools/demux/demux_poll.c:77): callback returned error status
> >     [mpiexec at ip-10-137-129-86] HYDT_bscu_wait_for_completion
> >     (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes
> >     terminated badly; aborting
> >     [mpiexec at ip-10-137-129-86] HYDT_bsci_wait_for_completion
> >     (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
> >     waiting for completion
> >     [mpiexec at ip-10-137-129-86] HYD_pmci_wait_for_completion
> >     (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting
> >     for completion
> >     [mpiexec at ip-10-137-129-86] main (./ui/mpich/mpiexec.c:405): process
> >     manager error waiting for completion
> >
> >
> > Do you have any clue about what might have been causing this problem?
> > Any suggestion at this point would be highly appreciated.
> >
> > Best regards,
> > Luiz
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160129/f83ea3e0/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list