<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr">These lock-ups seem to be gone in 3.3a2.<div><br></div><div>I do occasionally get the following though: <br><br></div><div><span style="font-family:monospace"><span style="color:rgb(0,0,0)">Unknown error class, error stack:
</span><br>PMPI_Comm_accept(129).................: MPI_Comm_accept(port="tag#0$description#<a href="http://aaa.com">aaa.com</a>$port#36230$ifname#192.168.42.99$", MPI_INFO_NULL, ro<br>ot=180, comm=0x84000003, newcomm=0x7f3cf681842c) failed
<br>MPID_Comm_accept(153).................: <br>MPIDI_Comm_accept(1244)...............: <br>MPIR_Get_contextid_sparse_group(499)..: <br>MPIR_Allreduce_impl(755)..............: <br>MPIR_Allreduce_intra(414).............: <br>MPIDU_Complete_posted_with_error(1137): Process failed<br>
<br></span></div><div><span style="font-family:monospace">What does this message mean? some process just exited/died (like with seg fault?)</span></div><div><span style="font-family:monospace"><br></span></div><div><span style="font-family:monospace">Thank you. </span></div><div><span style="font-family:monospace">-Dmitriy</span></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jan 12, 2017 at 11:55 AM, Dmitriy Lyubimov <span dir="ltr"><<a href="mailto:dlieu.7@gmail.com" target="_blank">dlieu.7@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">further debugging shows that it's not actually mergeIntercom that locks up but a pair of send/recv that two nodes decide to execute before MPI_intercom_merge.<div><br></div><div>so the total snapshot of the situation is that everyone waits on mergeIntercom except for two processes that wait in send/recv respectively, while majority of others already have entered collective barrier.</div><div><br></div><div>it would seem that this sort of assymetric logic would be acceptable, since the send/recv pair is balanced before the merge is to occur, but in practice it seems to lock up -- increasingly so as the number of participating processes increases. It almost like once collective barrier of certain cardinality is formed, point-to-point messages are not going thru any longer.</div><div><br></div><div>If this scenario begets any ideas, please let me know.</div><div><br></div><div>thank you! </div><span class="HOEnZb"><font color="#888888"><div>-Dmitriy<br><div><br></div><div><br></div></div></font></span></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jan 11, 2017 at 9:38 AM, Dmitriy Lyubimov <span dir="ltr"><<a href="mailto:dlieu.7@gmail.com" target="_blank">dlieu.7@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Maybe it has something to do with the fact that it is stepping thru JVM JNI and that somehow screws threading model of MPI, although it is a single threaded JVM process, and MPI mappings are known to have been done before (e.g., openmpi had an effort towards that).<div><br></div><div>Strange thing is that i never had lock up with # of processes under 120 but something changes after that, the spurious condition becomes much more common after that. By the time I am at 150 processes in the intercom, I am almost certain to have a merge lock-up.</div><div><br></div></div><div class="m_2337695943411380465HOEnZb"><div class="m_2337695943411380465h5"><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jan 11, 2017 at 9:34 AM, Dmitriy Lyubimov <span dir="ltr"><<a href="mailto:dlieu.7@gmail.com" target="_blank">dlieu.7@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Thanks. <div>it would not be easy for me to do immediately as i am using proprietary scala binding api for MPI.</div><div><br></div><div>it would help me to know if there's a known problem like that in the past, or generally mergeIntercomm api is known to work on hundreds of processes. Sounds like there are no known issues with that.</div><div><br></div><div><br></div></div><div class="m_2337695943411380465m_-5427999635322498654HOEnZb"><div class="m_2337695943411380465m_-5427999635322498654h5"><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jan 10, 2017 at 11:53 PM, Oden, Lena <span dir="ltr"><<a href="mailto:loden@anl.gov" target="_blank">loden@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello Dmittiy,<br>
<br>
can you maybe create a simple example-program to reproduce this failure?<br>
It is also often easier also to look at a code example to identify a problem.<br>
<br>
Thanks,<br>
Lena<br>
<div><div class="m_2337695943411380465m_-5427999635322498654m_6079558684378485089h5">> On Jan 11, 2017, at 2:45 AM, Dmitriy Lyubimov <<a href="mailto:dlieu.7@gmail.com" target="_blank">dlieu.7@gmail.com</a>> wrote:<br>
><br>
> Hello,<br>
><br>
> (mpich 3.2)<br>
><br>
> I have a scenario when i add a few extra processes do existing intercom.<br>
><br>
> it works as a simple loop --<br>
> (1) n processes accept on n-intercom<br>
> (2) 1 process connects<br>
> (3) intracom is merged into n+1 intercom, intracom and n-intercom are closed<br>
> (4) repeat 1-3 as needed.<br>
><br>
> Occasionally, i observe that step 3 spuriously locks up (once i get in the range of 100+ processes). From what i can tell, all processes in step 3 are accounted for, and are waiting on the merge, but nothing happens. the collective barrier locks up.<br>
><br>
> I really have trouble resolving this issue, any ideas are appreciated!<br>
><br>
> Thank you very much.<br>
> -Dmitriy<br>
><br>
><br>
</div></div>> ______________________________<wbr>_________________<br>
> discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailma<wbr>n/listinfo/discuss</a><br>
<br>
______________________________<wbr>_________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailma<wbr>n/listinfo/discuss</a><br>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>