[mpich-discuss] Unexpected "Bad termination"
Balaji, Pavan
balaji at anl.gov
Fri Jan 29 11:09:29 CST 2016
Every time the application (not MPICH) has an error, the same error message is shown.
I'm going to go and delete that error message now since users keep thinking this is a problem in MPICH, while MPICH is only trying to be helpful and telling them that "this is not MPICH's fault. your application is broken." Maybe I should just say that in the error message.
-- Pavan
> On Jan 29, 2016, at 10:49 AM, Daniel Ibanez <dan.a.ibanez at gmail.com> wrote:
>
> Hello all,
>
> Sorry to intrude on someone else's question,
> but we also see this error message on our Linux cluster,
> with code that we're very confident is free of MPI bugs
> (has run on most of Mira, etc...).
>
> The MPICH build being used was compiled on our
> workstations, and runs well on the workstations
> (up to the 16 ranks their CPUs handle).
> However, using that MPICH install on the cluster
> causes this for small (2-4) rank operations.
>
> I'm saying this to suggest that it has to do with interactions
> of MPICH and the architecture/OS, maybe also with
> MPICH build configuration.
>
> I'll try to come up with a reproducing test program today.
>
> On Fri, Jan 29, 2016 at 11:42 AM, Halim Amer <aamer at anl.gov> wrote:
> Luiz,
>
> You are experiencing a segmentation fault.
> We don't have enough information to pinpoint the source of the problem, however. We usually require a small piece of code that reproduces the bug to help debugging.
>
> --Halim
>
> www.mcs.anl.gov/~aamer
>
>
> On 1/29/16 10:27 AM, Luiz Carlos da Costa Junior wrote:
> Dear all,
>
> We have been using MPICH with our software and performing execution in
> Amazon AWS Linux servers for a long time.
> We use to have in production environment MPICH version 1.4.1p1 (which -
> I know - is very old), but it has been very very stable in the latest years.
> However, recently we have been facing a "Bad termination" problem once
> in a while, so we decided to investigate this issue.
> In principle, we don't have a apparent reason to believe that the
> problem lies on our code, since there was no changes that explain this
> behavior.
> The other point is that it occurs in a intermittent fashion, if we run
> the program again it doesn't happen, so it has been difficult to
> debug/trace the source of the problem.
>
> Our first step, then, was to update the MPI version to the latest
> version 3.2.
> However, we faced the same problem (output below):
>
> =====================================================================================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = EXIT CODE: 11
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> =====================================================================================
> [proxy:0:0 at ip-10-137-129-86] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
> [proxy:0:0 at ip-10-137-129-86] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at ip-10-137-129-86] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes
> terminated badly; aborting
> [mpiexec at ip-10-137-129-86] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
> waiting for completion
> [mpiexec at ip-10-137-129-86] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting
> for completion
> [mpiexec at ip-10-137-129-86] main (./ui/mpich/mpiexec.c:405): process
> manager error waiting for completion
>
>
> Do you have any clue about what might have been causing this problem?
> Any suggestion at this point would be highly appreciated.
>
> Best regards,
> Luiz
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list