<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr"><div>Pavan, I am not suspicious of MPICH itself, it is very stable and we have been using it for almost a decade.</div><div>I was thinking of some kind of bad interaction with our SO, like Daniel said.<br></div><div><br></div><div>My intention here is only to try to get some help from the experience you have.</div><div>As I commented, it isn't easy to reproduce the problem and when it happens there are no clues about it.</div><div><br></div><div>I forgot to mention that the problem happens at the very beginning of execution of my application (I am not sure since there is no output from it on screen, not even the "logo" with program version).</div><div>In case of a segmentation fault as <span style="font-size:12.8px">Halim said</span>, am I supposed to get any output related to it? If yes, is there something I can do to activate or redirect it?</div><div><br></div><div>Thanks again for your help.</div><div><br></div><div>Best regards,</div><div>Luiz</div></div><div class="gmail_extra"><br><div class="gmail_quote">On 29 January 2016 at 15:26, Daniel Ibanez <span dir="ltr"><<a href="mailto:dan.a.ibanez@gmail.com" target="_blank">dan.a.ibanez@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">You're right, Pavan, I got mixed up.<div>I think it is the "assert (!closed) failed" that confuses me,</div><div>the other messages are clear enough.</div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jan 29, 2016 at 12:09 PM, Balaji, Pavan <span dir="ltr"><<a href="mailto:balaji@anl.gov" target="_blank">balaji@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
Every time the application (not MPICH) has an error, the same error message is shown.<br>
<br>
I'm going to go and delete that error message now since users keep thinking this is a problem in MPICH, while MPICH is only trying to be helpful and telling them that "this is not MPICH's fault. your application is broken." Maybe I should just say that in the error message.<br>
<span><font color="#888888"><br>
-- Pavan<br>
</font></span><div><div><br>
> On Jan 29, 2016, at 10:49 AM, Daniel Ibanez <<a href="mailto:dan.a.ibanez@gmail.com" target="_blank">dan.a.ibanez@gmail.com</a>> wrote:<br>
><br>
> Hello all,<br>
><br>
> Sorry to intrude on someone else's question,<br>
> but we also see this error message on our Linux cluster,<br>
> with code that we're very confident is free of MPI bugs<br>
> (has run on most of Mira, etc...).<br>
><br>
> The MPICH build being used was compiled on our<br>
> workstations, and runs well on the workstations<br>
> (up to the 16 ranks their CPUs handle).<br>
> However, using that MPICH install on the cluster<br>
> causes this for small (2-4) rank operations.<br>
><br>
> I'm saying this to suggest that it has to do with interactions<br>
> of MPICH and the architecture/OS, maybe also with<br>
> MPICH build configuration.<br>
><br>
> I'll try to come up with a reproducing test program today.<br>
><br>
> On Fri, Jan 29, 2016 at 11:42 AM, Halim Amer <<a href="mailto:aamer@anl.gov" target="_blank">aamer@anl.gov</a>> wrote:<br>
> Luiz,<br>
><br>
> You are experiencing a segmentation fault.<br>
> We don't have enough information to pinpoint the source of the problem, however. We usually require a small piece of code that reproduces the bug to help debugging.<br>
><br>
> --Halim<br>
><br>
> <a href="http://www.mcs.anl.gov/~aamer" rel="noreferrer" target="_blank">www.mcs.anl.gov/~aamer</a><br>
><br>
><br>
> On 1/29/16 10:27 AM, Luiz Carlos da Costa Junior wrote:<br>
> Dear all,<br>
><br>
> We have been using MPICH with our software and performing execution in<br>
> Amazon AWS Linux servers for a long time.<br>
> We use to have in production environment MPICH version 1.4.1p1 (which -<br>
> I know - is very old), but it has been very very stable in the latest years.<br>
> However, recently we have been facing a "Bad termination" problem once<br>
> in a while, so we decided to investigate this issue.<br>
> In principle, we don't have a apparent reason to believe that the<br>
> problem lies on our code, since there was no changes that explain this<br>
> behavior.<br>
> The other point is that it occurs in a intermittent fashion, if we run<br>
> the program again it doesn't happen, so it has been difficult to<br>
> debug/trace the source of the problem.<br>
><br>
> Our first step, then, was to update the MPI version to the latest<br>
> version 3.2.<br>
> However, we faced the same problem (output below):<br>
><br>
> =====================================================================================<br>
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES<br>
> = EXIT CODE: 11<br>
> = CLEANING UP REMAINING PROCESSES<br>
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES<br>
> =====================================================================================<br>
> [proxy:0:0@ip-10-137-129-86] HYD_pmcd_pmip_control_cmd_cb<br>
> (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed<br>
> [proxy:0:0@ip-10-137-129-86] HYDT_dmxu_poll_wait_for_event<br>
> (./tools/demux/demux_poll.c:77): callback returned error status<br>
> [mpiexec@ip-10-137-129-86] HYDT_bscu_wait_for_completion<br>
> (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes<br>
> terminated badly; aborting<br>
> [mpiexec@ip-10-137-129-86] HYDT_bsci_wait_for_completion<br>
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error<br>
> waiting for completion<br>
> [mpiexec@ip-10-137-129-86] HYD_pmci_wait_for_completion<br>
> (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting<br>
> for completion<br>
> [mpiexec@ip-10-137-129-86] main (./ui/mpich/mpiexec.c:405): process<br>
> manager error waiting for completion<br>
><br>
><br>
> Do you have any clue about what might have been causing this problem?<br>
> Any suggestion at this point would be highly appreciated.<br>
><br>
> Best regards,<br>
> Luiz<br>
><br>
><br>
> _______________________________________________<br>
> discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
><br>
> _______________________________________________<br>
> discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
><br>
> _______________________________________________<br>
> discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
<br>
_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
</div></div></blockquote></div><br></div>
</div></div><br>_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br></blockquote></div><br></div>