[mpich-discuss] Unexpected "Bad termination"

Luiz Carlos da Costa Junior lccostajr at gmail.com
Fri Jan 29 12:00:55 CST 2016


Pavan, I am not suspicious of MPICH itself, it is very stable and we have
been using it for almost a decade.
I was thinking of some kind of bad interaction with our SO, like Daniel
said.

My intention here is only to try to get some help from the experience you
have.
As I commented, it isn't easy to reproduce the problem and when it happens
there are no clues about it.

I forgot to mention that the problem happens at the very beginning of
execution of my application (I am not sure since there is no output from it
on screen, not even the "logo" with program version).
In case of a segmentation fault as Halim said, am I supposed to get any
output related to it? If yes, is there something I can do to activate or
redirect it?

Thanks again for your help.

Best regards,
Luiz

On 29 January 2016 at 15:26, Daniel Ibanez <dan.a.ibanez at gmail.com> wrote:

> You're right, Pavan, I got mixed up.
> I think it is the "assert (!closed) failed" that confuses me,
> the other messages are clear enough.
>
> On Fri, Jan 29, 2016 at 12:09 PM, Balaji, Pavan <balaji at anl.gov> wrote:
>
>>
>> Every time the application (not MPICH) has an error, the same error
>> message is shown.
>>
>> I'm going to go and delete that error message now since users keep
>> thinking this is a problem in MPICH, while MPICH is only trying to be
>> helpful and telling them that "this is not MPICH's fault.  your application
>> is broken."  Maybe I should just say that in the error message.
>>
>>   -- Pavan
>>
>> > On Jan 29, 2016, at 10:49 AM, Daniel Ibanez <dan.a.ibanez at gmail.com>
>> wrote:
>> >
>> > Hello all,
>> >
>> > Sorry to intrude on someone else's question,
>> > but we also see this error message on our Linux cluster,
>> > with code that we're very confident is free of MPI bugs
>> > (has run on most of Mira, etc...).
>> >
>> > The MPICH build being used was compiled on our
>> > workstations, and runs well on the workstations
>> > (up to the 16 ranks their CPUs handle).
>> > However, using that MPICH install on the cluster
>> > causes this for small (2-4) rank operations.
>> >
>> > I'm saying this to suggest that it has to do with interactions
>> > of MPICH and the architecture/OS, maybe also with
>> > MPICH build configuration.
>> >
>> > I'll try to come up with a reproducing test program today.
>> >
>> > On Fri, Jan 29, 2016 at 11:42 AM, Halim Amer <aamer at anl.gov> wrote:
>> > Luiz,
>> >
>> > You are experiencing a segmentation fault.
>> > We don't have enough information to pinpoint the source of the problem,
>> however. We usually require a small piece of code that reproduces the bug
>> to help debugging.
>> >
>> > --Halim
>> >
>> > www.mcs.anl.gov/~aamer
>> >
>> >
>> > On 1/29/16 10:27 AM, Luiz Carlos da Costa Junior wrote:
>> > Dear all,
>> >
>> > We have been using MPICH with our software and performing execution in
>> > Amazon AWS Linux servers for a long time.
>> > We use to have in production environment MPICH version 1.4.1p1 (which -
>> > I know - is very old), but it has been very very stable in the latest
>> years.
>> > However, recently we have been facing a "Bad termination" problem once
>> > in a while, so we decided to investigate this issue.
>> > In principle, we don't have a apparent reason to believe that the
>> > problem lies on our code, since there was no changes that explain this
>> > behavior.
>> > The other point is that it occurs in a intermittent fashion, if we run
>> > the program again it doesn't happen, so it has been difficult to
>> > debug/trace the source of the problem.
>> >
>> > Our first step, then, was to update the MPI version to the latest
>> > version 3.2.
>> > However, we faced the same problem (output below):
>> >
>> >
>>  =====================================================================================
>> >     =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> >     =   EXIT CODE: 11
>> >     =   CLEANING UP REMAINING PROCESSES
>> >     =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> >
>>  =====================================================================================
>> >     [proxy:0:0 at ip-10-137-129-86] HYD_pmcd_pmip_control_cmd_cb
>> >     (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
>> >     [proxy:0:0 at ip-10-137-129-86] HYDT_dmxu_poll_wait_for_event
>> >     (./tools/demux/demux_poll.c:77): callback returned error status
>> >     [mpiexec at ip-10-137-129-86] HYDT_bscu_wait_for_completion
>> >     (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes
>> >     terminated badly; aborting
>> >     [mpiexec at ip-10-137-129-86] HYDT_bsci_wait_for_completion
>> >     (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
>> >     waiting for completion
>> >     [mpiexec at ip-10-137-129-86] HYD_pmci_wait_for_completion
>> >     (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting
>> >     for completion
>> >     [mpiexec at ip-10-137-129-86] main (./ui/mpich/mpiexec.c:405): process
>> >     manager error waiting for completion
>> >
>> >
>> > Do you have any clue about what might have been causing this problem?
>> > Any suggestion at this point would be highly appreciated.
>> >
>> > Best regards,
>> > Luiz
>> >
>> >
>> > _______________________________________________
>> > discuss mailing list     discuss at mpich.org
>> > To manage subscription options or unsubscribe:
>> > https://lists.mpich.org/mailman/listinfo/discuss
>> >
>> > _______________________________________________
>> > discuss mailing list     discuss at mpich.org
>> > To manage subscription options or unsubscribe:
>> > https://lists.mpich.org/mailman/listinfo/discuss
>> >
>> > _______________________________________________
>> > discuss mailing list     discuss at mpich.org
>> > To manage subscription options or unsubscribe:
>> > https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160129/3e5c4053/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list