[mpich-discuss] Unexpected "Bad termination"

Luiz Carlos da Costa Junior lccostajr at gmail.com
Fri Jan 29 10:27:48 CST 2016


Dear all,

We have been using MPICH with our software and performing execution in
Amazon AWS Linux servers for a long time.
We use to have in production environment MPICH version 1.4.1p1 (which - I
know - is very old), but it has been very very stable in the latest years.
However, recently we have been facing a "Bad termination" problem once in a
while, so we decided to investigate this issue.
In principle, we don't have a apparent reason to believe that the problem
lies on our code, since there was no changes that explain this behavior.
The other point is that it occurs in a intermittent fashion, if we run the
program again it doesn't happen, so it has been difficult to debug/trace
the source of the problem.

Our first step, then, was to update the MPI version to the latest version
3.2.
However, we faced the same problem (output below):

=====================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 11
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> =====================================================================================
> [proxy:0:0 at ip-10-137-129-86] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
> [proxy:0:0 at ip-10-137-129-86] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at ip-10-137-129-86] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated
> badly; aborting
> [mpiexec at ip-10-137-129-86] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
> [mpiexec at ip-10-137-129-86] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for
> completion
> [mpiexec at ip-10-137-129-86] main (./ui/mpich/mpiexec.c:405): process
> manager error waiting for completion


Do you have any clue about what might have been causing this problem?
Any suggestion at this point would be highly appreciated.

Best regards,
Luiz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160129/f1fac771/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list