<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
code
{mso-style-priority:99;
font-family:"Courier New";}
p.xmsonormal, li.xmsonormal, div.xmsonormal
{mso-style-name:x_msonormal;
margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
p.xxmsonormal, li.xxmsonormal, div.xxmsonormal
{mso-style-name:x_xmsonormal;
margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle22
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal">We just released 4.0b1 last week! We will release 4.0rc1 and 4.0 GA in roughly two weeks apart if no major issues are discovered during this period.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<div>
<p class="MsoNormal">-- <br>
Hui Zhou<o:p></o:p></p>
</div>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="mso-margin-top-alt:0in;margin-right:0in;margin-bottom:12.0pt;margin-left:.5in">
<b><span style="font-size:12.0pt;color:black">From: </span></b><span style="font-size:12.0pt;color:black">Mccall, Kurt E. (MSFC-EV41) via discuss <discuss@mpich.org><br>
<b>Date: </b>Monday, November 22, 2021 at 9:13 AM<br>
<b>To: </b>discuss@mpich.org <discuss@mpich.org><br>
<b>Cc: </b>Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall@nasa.gov><br>
<b>Subject: </b>Re: [mpich-discuss] Maximum number of inter-communicators?<o:p></o:p></span></p>
</div>
<p class="MsoNormal" style="margin-left:.5in">Hui, <o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">A reboot of that failing node fixed the problem. My original question was if there was a limit on the number of inter-communicators. With 4.0a2, my job runs fine with the increased number of inter-communicators,
filling up each node nicely. Should I just step up to 3.4.2, or wait until 4.0 is released? Do you know when 4.0 will be released?<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Thanks for your help,<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Kurt<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black">Hi Kurt,</span><o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black">There is indeed a limit on maximum number of communicators that you can have, including both intra communicators and inter-communicators. Try free the communicators that
you no longer need. In older version of MPICH, there may be additional limit on how many dynamic processes one can connect. If you still hit crash after making sure there isn't too many simultaneous active communicators, could you try the latest release --
</span><a href="https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mpich.org%2Fstatic%2Fdownloads%2F4.0a2%2Fmpich-4.0a2.tar.gz&data=04%7C01%7Ckurt.e.mccall%40nasa.gov%7C08341d1f0e7f480dcdd708d9adc1e714%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637731870592965533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vaTalwrjoeP76g%2BX04eQyZHSMYry2UxV5XZ2z%2FPMP20%3D&reserved=0"><span style="font-size:12.0pt">http://www.mpich.org/static/downloads/4.0a2/mpich-4.0a2.tar.gz</span></a><span style="font-size:12.0pt;color:black">,
and see if the issue persist?</span><o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black">--
</span><o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black">Hui</span><o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="margin-left:.5in"><b>From:</b> Mccall, Kurt E. (MSFC-EV41) via discuss <discuss@mpich.org>
<br>
<b>Sent:</b> Monday, November 22, 2021 8:10 AM<br>
<b>To:</b> discuss@mpich.org<br>
<b>Cc:</b> Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall@nasa.gov><br>
<b>Subject:</b> Re: [mpich-discuss] [EXTERNAL] Re: Maximum number of inter-communicators?<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Hui, just one process on a particular node failed (without any error message). We just had an O/S upgrade on the whole cluster, and may an error was made on that node in the upgrade. I’m trying to figure out
what is different about that node.<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Kurt<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="margin-left:.5in"><b>From:</b> Zhou, Hui <<a href="mailto:zhouh@anl.gov">zhouh@anl.gov</a>>
<br>
<b>Sent:</b> Friday, November 19, 2021 4:29 PM<br>
<b>To:</b> <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
<b>Cc:</b> Mccall, Kurt E. (MSFC-EV41) <<a href="mailto:kurt.e.mccall@nasa.gov">kurt.e.mccall@nasa.gov</a>><br>
<b>Subject:</b> [EXTERNAL] Re: Maximum number of inter-communicators?<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">Kurt,<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in">That usually means the process went away without telling the process manager (hydra) first. Those messages are from hydra when the connection to the process disconnected unexpectedly. Did all processes call `MPI_Finalize`
before exit?<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<div>
<div>
<div>
<p class="MsoNormal" style="margin-left:.5in">-- <br>
Hui Zhou<o:p></o:p></p>
</div>
</div>
</div>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:.5in"><o:p> </o:p></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="mso-margin-top-alt:0in;margin-right:0in;margin-bottom:12.0pt;margin-left:1.0in">
<b><span style="font-size:12.0pt;color:black">From: </span></b><span style="font-size:12.0pt;color:black">Mccall, Kurt E. (MSFC-EV41) via discuss <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Date: </b>Friday, November 19, 2021 at 2:26 PM<br>
<b>To: </b><a href="mailto:discuss@mpich.org">discuss@mpich.org</a> <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Cc: </b>Mccall, Kurt E. (MSFC-EV41) <<a href="mailto:kurt.e.mccall@nasa.gov">kurt.e.mccall@nasa.gov</a>><br>
<b>Subject: </b>Re: [mpich-discuss] Maximum number of inter-communicators?<o:p></o:p></span></p>
</div>
<p class="MsoNormal" style="margin-left:1.0in">Thanks Hui, replacing MPI_LOGICAL with MPI_CXX_BOOL enabled the code to run without the segfault (using 4.0a2). Do you have any clue what this error message means?<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:1.0in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:1.0in">[mpiexec@n022.cluster.com] control_cb (../../../../mpich-4.0a2/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:206): assert (!closed) failed<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:1.0in">[mpiexec@n022.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0a2/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:1.0in">[mpiexec@n022.cluster.com] HYD_pmci_wait_for_completion (../../../../mpich-4.0a2/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:160): error waiting for event<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:1.0in">[mpiexec@n022.cluster.com] main (../../../../mpich-4.0a2/src/pm/hydra/ui/mpich/mpiexec.c:325): process manager error waiting for completion<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:1.0in">[proxy:0:0@n022.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0a2/src/pm/hydra/pm/pmiserv/pmip_cb.c:896): assert (!closed) failed<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:1.0in">[proxy:0:0@n022.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0a2/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:1.0in">[proxy:0:0@n022.cluster.com] main (../../../../mpich-4.0a2/src/pm/hydra/pm/pmiserv/pmip.c:169): de[proxy:0:11@n010.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0a2/src/pm/hydra/pm/pmiserv/pmip_cb.c:896):
assert (!closed) failed<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:1.0in">[proxy:0:11@n010.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0a2/src/pm/hydra/tools/demu[proxy:0:14@n007.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0a2/src/pm/hydra/pm/pmiserv/pmip_cb.c:896):
assert (!closed) failed<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:1.0in">[proxy:0:14@n007.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0a2/src/pm/hydra/tools/demu[proxy:0:15@n006.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0a2/src/pm/hydra/pm/pmiserv/pmip_cb.c:896):
assert (!closed) failed<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:1.0in">[proxy:0:15@n006.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0a2/src/pm/hydra/tools/demu[proxy:0:13@n008.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0a2/src/pm/hydra/pm/pmiserv/pmip_cb.c:896):
assert (!closed) failed<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:1.0in">[proxy:0:13@n008.cluster.com] HYDT_dmxu_poll_<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:1.0in"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-left:1.0in">Thanks, <o:p></o:p></p>
<p class="MsoNormal" style="margin-left:1.0in">Kurt<o:p></o:p></p>
<p class="MsoNormal" style="margin-left:1.0in"><o:p> </o:p></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="margin-left:1.0in"><b>From:</b> Zhou, Hui <<a href="mailto:zhouh@anl.gov">zhouh@anl.gov</a>>
<br>
<b>Sent:</b> Monday, November 15, 2021 9:44 PM<br>
<b>To:</b> <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
<b>Cc:</b> Mccall, Kurt E. (MSFC-EV41) <<a href="mailto:kurt.e.mccall@nasa.gov">kurt.e.mccall@nasa.gov</a>><br>
<b>Subject:</b> [EXTERNAL] Re: Maximum number of inter-communicators?<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal" style="margin-left:1.0in"><o:p> </o:p></p>
<div>
<p class="MsoNormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black">Hi Kurt,<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black">Did you build mpich with Fortran disabled? MPI_LOGICAL is a Fortran datatype and is unavailable when Fortran binding is disabled. Try use
</span><code><span style="font-size:10.0pt;color:black">MPI_C_BOOL</span></code><span style="font-size:12.0pt;color:black"> instead. If you didn't disable
</span><code><span style="font-size:10.0pt;color:black">cxx</span></code><span style="font-size:12.0pt;color:black">, you may also try use
</span><code><span style="font-size:10.0pt;color:black">MPI_CXX_BOOL</span></code><span style="font-size:12.0pt;color:black"> since you are programming in C++.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black">Hui<o:p></o:p></span></p>
</div>
<div style="margin-left:.5in">
<div class="MsoNormal" align="center" style="margin-left:.5in;text-align:center">
<hr size="0" width="62%" align="center">
</div>
</div>
<div id="divRplyFwdMsg">
<p class="MsoNormal" style="margin-left:1.0in"><b><span style="color:black">From:</span></b><span style="color:black"> Mccall, Kurt E. (MSFC-EV41) via discuss <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Sent:</b> Saturday, November 13, 2021 2:56 PM<br>
<b>To:</b> <a href="mailto:discuss@mpich.org">discuss@mpich.org</a> <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>><br>
<b>Cc:</b> Mccall, Kurt E. (MSFC-EV41) <<a href="mailto:kurt.e.mccall@nasa.gov">kurt.e.mccall@nasa.gov</a>><br>
<b>Subject:</b> Re: [mpich-discuss] Maximum number of inter-communicators?</span>
<o:p></o:p></p>
<div>
<p class="MsoNormal" style="margin-left:1.0in"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="xmsonormal" style="margin-left:1.0in">Hui,<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">I built MPICH 4.0a2 with gcc 4.8.5, and included the -enable-g=all flag to “configure” so that debugging symbols would be present. The code is crashing my call to MPI_Type_commit, in libpthreads.so. gdb give
this stack trace below. Since MPICH 3.3.2, has there been changes in how custom types are created (the code worked in 3.3.2)? I included my type-creating code after the stack trace.<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">Program received signal SIGSEGV, Segmentation fault.<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">MPIR_Typerep_create_struct (count=count@entry=8, array_of_blocklengths=array_of_blocklengths@entry=0x128b6b0,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> array_of_displacements=array_of_displacements@entry=0x7fffcaa243c0,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> array_of_types=array_of_types@entry=0x7fffcaa24410,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> newtype=newtype@entry=0x7fec8bedd258 <MPIR_Datatype_direct+1400>)<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> at ../mpich-4.0a2/src/mpi/datatype/typerep/src/typerep_dataloop_create.c:659<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">659 MPIR_Ensure_Aint_fits_in_int(old_dtp->builtin_element_size);<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">(gdb) where<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">#0 MPIR_Typerep_create_struct (count=count@entry=8, array_of_blocklengths=array_of_blocklengths@entry=0x128b6b0,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> array_of_displacements=array_of_displacements@entry=0x7fffcaa243c0,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> array_of_types=array_of_types@entry=0x7fffcaa24410,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> newtype=newtype@entry=0x7fec8bedd258 <MPIR_Datatype_direct+1400>)<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> at ../mpich-4.0a2/src/mpi/datatype/typerep/src/typerep_dataloop_create.c:659<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">#1 0x00007fec8b9b2608 in type_struct (count=count@entry=8, blocklength_array=blocklength_array@entry=0x128b6b0,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> displacement_array=displacement_array@entry=0x7fffcaa243c0, oldtype_array=oldtype_array@entry=0x7fffcaa24410,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> newtype=newtype@entry=0x7fffcaa242dc) at ../mpich-4.0a2/src/mpi/datatype/type_create.c:206<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">#2 0x00007fec8b9b4b9e in type_struct (newtype=0x7fffcaa242dc, oldtype_array=0x7fffcaa24410,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> displacement_array=0x7fffcaa243c0, blocklength_array=0x128b6b0, count=8)<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> at ../mpich-4.0a2/src/mpi/datatype/type_create.c:227<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">#3 MPIR_Type_struct (count=count@entry=8, blocklength_array=0x128b6b0,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> displacement_array=displacement_array@entry=0x7fffcaa243c0, oldtype_array=oldtype_array@entry=0x7fffcaa24410,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> newtype=newtype@entry=0x7fffcaa242dc) at ../mpich-4.0a2/src/mpi/datatype/type_create.c:235<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">#4 0x00007fec8b9b7b08 in MPIR_Type_create_struct_impl (count=count@entry=8,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> array_of_blocklengths=array_of_blocklengths@entry=0x7fffcaa24440,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> array_of_displacements=array_of_displacements@entry=0x7fffcaa243c0,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> array_of_types=array_of_types@entry=0x7fffcaa24410, newtype=newtype@entry=0x12853fc)<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> at ../mpich-4.0a2/src/mpi/datatype/type_create.c:908<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">#5 0x00007fec8b85ad26 in internal_Type_create_struct (newtype=0x12853fc, array_of_types=0x7fffcaa24410,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> array_of_displacements=<optimized out>, array_of_blocklengths=0x7fffcaa24440, count=8)<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> at ../mpich-4.0a2/src/binding/c/datatype/type_create_struct.c:79<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">#6 PMPI_Type_create_struct (count=8, array_of_blocklengths=0x7fffcaa24440, array_of_displacements=0x7fffcaa243c0,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> array_of_types=0x7fffcaa24410, newtype=0x12853fc)<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> at ../mpich-4.0a2/src/binding/c/datatype/type_create_struct.c:164<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">#7 0x0000000000438dfb in needles::MpiMsgBasic::createMsgDataType (this=0x12853fc) at src/MsgBasic.cpp:97<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">#8 0x0000000000412b77 in needles::NeedlesMpiManager::init (this=0x12853a0, argc=23, argv=0x7fffcaa24e08, rank=20,
<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> world_size=21) at src/NeedlesMpiManager.cpp:204<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">#9 0x000000000040605f in main (argc=23, argv=0x7fffcaa24e08) at src/NeedlesMpiManagerMain.cpp:142<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">(gdb) <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">Here is my code that creates the custom type and then calls MPI_Type_commit:<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> MsgBasic obj;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> int struct_len = 8, i;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> int block_len[struct_len];<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> MPI_Datatype types[struct_len];<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> MPI_Aint displacements[struct_len];<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> i = 0;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> block_len[i] = 1;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> types[i] = MPI_LOGICAL;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> displacements[i] = (size_t) &obj.tuple_valid_ - (size_t) &obj;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> ++i;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> block_len[i] = 1;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> types[i] = MPI_LOGICAL;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> displacements[i] = (size_t) &obj.tuple_seq_valid_ - (size_t) &obj;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> // the int array "start_" member<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> ++i;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> block_len[i] = Tuple::N_INDICES_MAX_;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> types[i] = MPI_SHORT;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> displacements[i] = (size_t) &obj.start_ - (size_t) &obj;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> // the int array "end_" member<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> ++i;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> block_len[i] = Tuple::N_INDICES_MAX_;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> types[i] = MPI_SHORT;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> displacements[i] = (size_t) &obj.end_ - (size_t) &obj;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> // the integer "opcode_" member<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> ++i;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> block_len[i] = 1;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> types[i] = MPI_INT;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> displacements[i] = (size_t) &obj.opcode_ - (size_t) &obj;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> // the boolean "success_" member<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> ++i;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> block_len[i] = 1;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> types[i] = MPI_LOGICAL; // NOTE: might be MPI_BOOLEAN in later version<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> displacements[i] = (size_t) &obj.success_ - (size_t) &obj;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> // the double "run_time_sec_" member<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> ++i;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> block_len[i] = 1;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> types[i] = MPI_DOUBLE;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> displacements[i] = (size_t) &obj.run_time_sec_ - (size_t) &obj;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> // the char array "error_msg_" member<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> ++i;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> block_len[i] = NeedlesMpi::ERROR_MSG_LEN_ + 1;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> types[i] = MPI_CHAR;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> displacements[i] = (size_t) &obj.error_msg_[0] - (size_t) &obj;<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> MPI_Type_create_struct(struct_len, block_len, displacements,<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> types, &msg_data_type_);<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> MPI_Type_commit(&msg_data_type_);<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">Thanks,<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in">Kurt<o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"><b> </b><o:p></o:p></p>
<p class="xmsonormal" style="margin-left:1.0in"><b>From:</b> Zhou, Hui <<a href="mailto:zhouh@anl.gov">zhouh@anl.gov</a>>
<br>
<b>Sent:</b> Sunday, October 24, 2021 6:46 PM<br>
<b>To:</b> <a href="mailto:discuss@mpich.org">discuss@mpich.org</a><br>
<b>Cc:</b> Mccall, Kurt E. (MSFC-EV41) <<a href="mailto:kurt.e.mccall@nasa.gov">kurt.e.mccall@nasa.gov</a>><br>
<b>Subject:</b> [EXTERNAL] Re: Maximum number of inter-communicators?<o:p></o:p></p>
</div>
</div>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<div>
<p class="xmsonormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black">Hi Kurt,</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black">There is indeed a limit on maximum number of communicators that you can have, including both intra communicators and inter-communicators. Try free the communicators that
you no longer need. In older version of MPICH, there may be additional limit on how many dynamic processes one can connect. If you still hit crash after making sure there isn't too many simultaneous active communicators, could you try the latest release --
</span><a href="https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mpich.org%2Fstatic%2Fdownloads%2F4.0a2%2Fmpich-4.0a2.tar.gz&data=04%7C01%7Ckurt.e.mccall%40nasa.gov%7C08341d1f0e7f480dcdd708d9adc1e714%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637731870592965533%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vaTalwrjoeP76g%2BX04eQyZHSMYry2UxV5XZ2z%2FPMP20%3D&reserved=0"><span style="font-size:12.0pt">http://www.mpich.org/static/downloads/4.0a2/mpich-4.0a2.tar.gz</span></a><span style="font-size:12.0pt;color:black">,
and see if the issue persist?</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black">--
</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal" style="margin-left:1.0in"><span style="font-size:12.0pt;color:black">Hui</span><o:p></o:p></p>
</div>
<div style="margin-left:.5in">
<div class="MsoNormal" align="center" style="margin-left:.5in;text-align:center">
<hr size="0" width="53%" align="center">
</div>
</div>
<div id="x_divRplyFwdMsg">
<p class="xmsonormal" style="margin-left:1.0in"><b><span style="color:black">From:</span></b><span style="color:black"> Mccall, Kurt E. (MSFC-EV41) via discuss <</span><a href="mailto:discuss@mpich.org">discuss@mpich.org</a><span style="color:black">><br>
<b>Sent:</b> Sunday, October 24, 2021 2:37 PM<br>
<b>To:</b> </span><a href="mailto:discuss@mpich.org">discuss@mpich.org</a><span style="color:black"> <</span><a href="mailto:discuss@mpich.org">discuss@mpich.org</a><span style="color:black">><br>
<b>Cc:</b> Mccall, Kurt E. (MSFC-EV41) <</span><a href="mailto:kurt.e.mccall@nasa.gov">kurt.e.mccall@nasa.gov</a><span style="color:black">><br>
<b>Subject:</b> [mpich-discuss] Maximum number of inter-communicators?</span> <o:p>
</o:p></p>
<div>
<p class="xmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="xxmsonormal" style="margin-left:1.0in">Hi,<o:p></o:p></p>
<p class="xxmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xxmsonormal" style="margin-left:1.0in">Based on a paper I read about giving an MPI job some fault tolerance, I’m exclusively connecting my processes with inter-communicators.<o:p></o:p></p>
<p class="xxmsonormal" style="margin-left:1.0in">I’ve found that if I increase the number of processes beyond a certain point, many processes don’t get created at all and the whole job
<o:p></o:p></p>
<p class="xxmsonormal" style="margin-left:1.0in">crashes. Am I running up against an operating system limit (like the number of open file descriptors – it is set at 1024), or some sort of
<o:p></o:p></p>
<p class="xxmsonormal" style="margin-left:1.0in">MPICH limit?<o:p></o:p></p>
<p class="xxmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xxmsonormal" style="margin-left:1.0in">If it matters, my process architecture (a tree) is as follows: one master process connected to 21 manager processes on 21 other nodes,
<o:p></o:p></p>
<p class="xxmsonormal" style="margin-left:1.0in">and each manager connected to 8 worker processes on the manager’s own node. This is the largest job I’ve been able to create
<o:p></o:p></p>
<p class="xxmsonormal" style="margin-left:1.0in">without it crashing. Attempting to increase the number of workers beyond 8 results in a crash.<o:p></o:p></p>
<p class="xxmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xxmsonormal" style="margin-left:1.0in">I’m using MPICH 3.3.2 on Centos 3.10.0. MPICH was compiled with the Portland Group compiler pgc++ 19.5-0.<o:p></o:p></p>
<p class="xxmsonormal" style="margin-left:1.0in"> <o:p></o:p></p>
<p class="xxmsonormal" style="margin-left:1.0in">Thanks,<o:p></o:p></p>
<p class="xxmsonormal" style="margin-left:1.0in">Kurt<o:p></o:p></p>
</div>
</div>
</div>
</div>
</div>
</body>
</html>