[mpich-discuss] Unexpected "Bad termination"
Daniel Ibanez
dan.a.ibanez at gmail.com
Fri Jan 29 10:49:27 CST 2016
Hello all,
Sorry to intrude on someone else's question,
but we also see this error message on our Linux cluster,
with code that we're very confident is free of MPI bugs
(has run on most of Mira, etc...).
The MPICH build being used was compiled on our
workstations, and runs well on the workstations
(up to the 16 ranks their CPUs handle).
However, using that MPICH install on the cluster
causes this for small (2-4) rank operations.
I'm saying this to suggest that it has to do with interactions
of MPICH and the architecture/OS, maybe also with
MPICH build configuration.
I'll try to come up with a reproducing test program today.
On Fri, Jan 29, 2016 at 11:42 AM, Halim Amer <aamer at anl.gov> wrote:
> Luiz,
>
> You are experiencing a segmentation fault.
> We don't have enough information to pinpoint the source of the problem,
> however. We usually require a small piece of code that reproduces the bug
> to help debugging.
>
> --Halim
>
> www.mcs.anl.gov/~aamer
>
>
> On 1/29/16 10:27 AM, Luiz Carlos da Costa Junior wrote:
>
>> Dear all,
>>
>> We have been using MPICH with our software and performing execution in
>> Amazon AWS Linux servers for a long time.
>> We use to have in production environment MPICH version 1.4.1p1 (which -
>> I know - is very old), but it has been very very stable in the latest
>> years.
>> However, recently we have been facing a "Bad termination" problem once
>> in a while, so we decided to investigate this issue.
>> In principle, we don't have a apparent reason to believe that the
>> problem lies on our code, since there was no changes that explain this
>> behavior.
>> The other point is that it occurs in a intermittent fashion, if we run
>> the program again it doesn't happen, so it has been difficult to
>> debug/trace the source of the problem.
>>
>> Our first step, then, was to update the MPI version to the latest
>> version 3.2.
>> However, we faced the same problem (output below):
>>
>>
>> =====================================================================================
>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> = EXIT CODE: 11
>> = CLEANING UP REMAINING PROCESSES
>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>
>> =====================================================================================
>> [proxy:0:0 at ip-10-137-129-86] HYD_pmcd_pmip_control_cmd_cb
>> (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
>> [proxy:0:0 at ip-10-137-129-86] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:77): callback returned error status
>> [mpiexec at ip-10-137-129-86] HYDT_bscu_wait_for_completion
>> (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes
>> terminated badly; aborting
>> [mpiexec at ip-10-137-129-86] HYDT_bsci_wait_for_completion
>> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
>> waiting for completion
>> [mpiexec at ip-10-137-129-86] HYD_pmci_wait_for_completion
>> (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting
>> for completion
>> [mpiexec at ip-10-137-129-86] main (./ui/mpich/mpiexec.c:405): process
>> manager error waiting for completion
>>
>>
>> Do you have any clue about what might have been causing this problem?
>> Any suggestion at this point would be highly appreciated.
>>
>> Best regards,
>> Luiz
>>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160129/09b74734/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list