<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr">You're right, Pavan, I got mixed up.<div>I think it is the "assert (!closed) failed" that confuses me,</div><div>the other messages are clear enough.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jan 29, 2016 at 12:09 PM, Balaji, Pavan <span dir="ltr"><<a href="mailto:balaji@anl.gov" target="_blank">balaji@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
Every time the application (not MPICH) has an error, the same error message is shown.<br>
<br>
I'm going to go and delete that error message now since users keep thinking this is a problem in MPICH, while MPICH is only trying to be helpful and telling them that "this is not MPICH's fault. your application is broken." Maybe I should just say that in the error message.<br>
<span class="HOEnZb"><font color="#888888"><br>
-- Pavan<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
> On Jan 29, 2016, at 10:49 AM, Daniel Ibanez <<a href="mailto:dan.a.ibanez@gmail.com">dan.a.ibanez@gmail.com</a>> wrote:<br>
><br>
> Hello all,<br>
><br>
> Sorry to intrude on someone else's question,<br>
> but we also see this error message on our Linux cluster,<br>
> with code that we're very confident is free of MPI bugs<br>
> (has run on most of Mira, etc...).<br>
><br>
> The MPICH build being used was compiled on our<br>
> workstations, and runs well on the workstations<br>
> (up to the 16 ranks their CPUs handle).<br>
> However, using that MPICH install on the cluster<br>
> causes this for small (2-4) rank operations.<br>
><br>
> I'm saying this to suggest that it has to do with interactions<br>
> of MPICH and the architecture/OS, maybe also with<br>
> MPICH build configuration.<br>
><br>
> I'll try to come up with a reproducing test program today.<br>
><br>
> On Fri, Jan 29, 2016 at 11:42 AM, Halim Amer <<a href="mailto:aamer@anl.gov">aamer@anl.gov</a>> wrote:<br>
> Luiz,<br>
><br>
> You are experiencing a segmentation fault.<br>
> We don't have enough information to pinpoint the source of the problem, however. We usually require a small piece of code that reproduces the bug to help debugging.<br>
><br>
> --Halim<br>
><br>
> <a href="http://www.mcs.anl.gov/~aamer" rel="noreferrer" target="_blank">www.mcs.anl.gov/~aamer</a><br>
><br>
><br>
> On 1/29/16 10:27 AM, Luiz Carlos da Costa Junior wrote:<br>
> Dear all,<br>
><br>
> We have been using MPICH with our software and performing execution in<br>
> Amazon AWS Linux servers for a long time.<br>
> We use to have in production environment MPICH version 1.4.1p1 (which -<br>
> I know - is very old), but it has been very very stable in the latest years.<br>
> However, recently we have been facing a "Bad termination" problem once<br>
> in a while, so we decided to investigate this issue.<br>
> In principle, we don't have a apparent reason to believe that the<br>
> problem lies on our code, since there was no changes that explain this<br>
> behavior.<br>
> The other point is that it occurs in a intermittent fashion, if we run<br>
> the program again it doesn't happen, so it has been difficult to<br>
> debug/trace the source of the problem.<br>
><br>
> Our first step, then, was to update the MPI version to the latest<br>
> version 3.2.<br>
> However, we faced the same problem (output below):<br>
><br>
> =====================================================================================<br>
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES<br>
> = EXIT CODE: 11<br>
> = CLEANING UP REMAINING PROCESSES<br>
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES<br>
> =====================================================================================<br>
> [proxy:0:0@ip-10-137-129-86] HYD_pmcd_pmip_control_cmd_cb<br>
> (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed<br>
> [proxy:0:0@ip-10-137-129-86] HYDT_dmxu_poll_wait_for_event<br>
> (./tools/demux/demux_poll.c:77): callback returned error status<br>
> [mpiexec@ip-10-137-129-86] HYDT_bscu_wait_for_completion<br>
> (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes<br>
> terminated badly; aborting<br>
> [mpiexec@ip-10-137-129-86] HYDT_bsci_wait_for_completion<br>
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error<br>
> waiting for completion<br>
> [mpiexec@ip-10-137-129-86] HYD_pmci_wait_for_completion<br>
> (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting<br>
> for completion<br>
> [mpiexec@ip-10-137-129-86] main (./ui/mpich/mpiexec.c:405): process<br>
> manager error waiting for completion<br>
><br>
><br>
> Do you have any clue about what might have been causing this problem?<br>
> Any suggestion at this point would be highly appreciated.<br>
><br>
> Best regards,<br>
> Luiz<br>
><br>
><br>
> _______________________________________________<br>
> discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
><br>
> _______________________________________________<br>
> discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
><br>
> _______________________________________________<br>
> discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
<br>
_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
</div></div></blockquote></div><br></div>