[mpich-discuss] Maximum number of inter-communicators?

Jeff Hammond jeff.science at gmail.com
Tue Nov 16 15:00:15 CST 2021


Unless MPICH configure has changed recently, —enable-g=debug enables debug
symbols, and is all you need. Your choice is sufficient but overkill. It
may introduce nontrivial performance overheads.

Jeff

On Sat, Nov 13, 2021 at 10:57 PM Mccall, Kurt E. (MSFC-EV41) via discuss <
discuss at mpich.org> wrote:

> Hui,
>
>
>
> I built MPICH 4.0a2 with gcc 4.8.5, and included the -enable-g=all flag to
> “configure” so that debugging symbols would be present.   The code is
> crashing my call to MPI_Type_commit, in libpthreads.so.   gdb give this
> stack trace below.  Since MPICH 3.3.2, has there been changes in how custom
> types are created (the code worked in 3.3.2)?   I included my type-creating
> code after the stack trace.
>
>
>
> Program received signal SIGSEGV, Segmentation fault.
>
> MPIR_Typerep_create_struct (count=count at entry=8,
> array_of_blocklengths=array_of_blocklengths at entry=0x128b6b0,
>
>     array_of_displacements=array_of_displacements at entry=0x7fffcaa243c0,
>
>     array_of_types=array_of_types at entry=0x7fffcaa24410,
>
>     newtype=newtype at entry=0x7fec8bedd258 <MPIR_Datatype_direct+1400>)
>
>     at
> ../mpich-4.0a2/src/mpi/datatype/typerep/src/typerep_dataloop_create.c:659
>
> 659
> MPIR_Ensure_Aint_fits_in_int(old_dtp->builtin_element_size);
>
>
>
> (gdb) where
>
>
>
> #0  MPIR_Typerep_create_struct (count=count at entry=8,
> array_of_blocklengths=array_of_blocklengths at entry=0x128b6b0,
>
>     array_of_displacements=array_of_displacements at entry=0x7fffcaa243c0,
>
>     array_of_types=array_of_types at entry=0x7fffcaa24410,
>
>     newtype=newtype at entry=0x7fec8bedd258 <MPIR_Datatype_direct+1400>)
>
>     at
> ../mpich-4.0a2/src/mpi/datatype/typerep/src/typerep_dataloop_create.c:659
>
> #1  0x00007fec8b9b2608 in type_struct (count=count at entry=8,
> blocklength_array=blocklength_array at entry=0x128b6b0,
>
>     displacement_array=displacement_array at entry=0x7fffcaa243c0,
> oldtype_array=oldtype_array at entry=0x7fffcaa24410,
>
>     newtype=newtype at entry=0x7fffcaa242dc) at
> ../mpich-4.0a2/src/mpi/datatype/type_create.c:206
>
> #2  0x00007fec8b9b4b9e in type_struct (newtype=0x7fffcaa242dc,
> oldtype_array=0x7fffcaa24410,
>
>     displacement_array=0x7fffcaa243c0, blocklength_array=0x128b6b0,
> count=8)
>
>     at ../mpich-4.0a2/src/mpi/datatype/type_create.c:227
>
> #3  MPIR_Type_struct (count=count at entry=8, blocklength_array=0x128b6b0,
>
>     displacement_array=displacement_array at entry=0x7fffcaa243c0,
> oldtype_array=oldtype_array at entry=0x7fffcaa24410,
>
>     newtype=newtype at entry=0x7fffcaa242dc) at
> ../mpich-4.0a2/src/mpi/datatype/type_create.c:235
>
> #4  0x00007fec8b9b7b08 in MPIR_Type_create_struct_impl (count=count at entry=8,
>
>
>     array_of_blocklengths=array_of_blocklengths at entry=0x7fffcaa24440,
>
>     array_of_displacements=array_of_displacements at entry=0x7fffcaa243c0,
>
>     array_of_types=array_of_types at entry=0x7fffcaa24410,
> newtype=newtype at entry=0x12853fc)
>
>     at ../mpich-4.0a2/src/mpi/datatype/type_create.c:908
>
> #5  0x00007fec8b85ad26 in internal_Type_create_struct (newtype=0x12853fc,
> array_of_types=0x7fffcaa24410,
>
>     array_of_displacements=<optimized out>,
> array_of_blocklengths=0x7fffcaa24440, count=8)
>
>     at ../mpich-4.0a2/src/binding/c/datatype/type_create_struct.c:79
>
> #6  PMPI_Type_create_struct (count=8,
> array_of_blocklengths=0x7fffcaa24440,
> array_of_displacements=0x7fffcaa243c0,
>
>     array_of_types=0x7fffcaa24410, newtype=0x12853fc)
>
>     at ../mpich-4.0a2/src/binding/c/datatype/type_create_struct.c:164
>
> #7  0x0000000000438dfb in needles::MpiMsgBasic::createMsgDataType
> (this=0x12853fc) at src/MsgBasic.cpp:97
>
> #8  0x0000000000412b77 in needles::NeedlesMpiManager::init
> (this=0x12853a0, argc=23, argv=0x7fffcaa24e08, rank=20,
>
>     world_size=21) at src/NeedlesMpiManager.cpp:204
>
> #9  0x000000000040605f in main (argc=23, argv=0x7fffcaa24e08) at
> src/NeedlesMpiManagerMain.cpp:142
>
> (gdb)
>
>
>
>
>
>
>
> Here is my code that creates the custom type and then calls
> MPI_Type_commit:
>
>
>
>     MsgBasic obj;
>
>     int struct_len = 8, i;
>
>
>
>     int block_len[struct_len];
>
>     MPI_Datatype types[struct_len];
>
>     MPI_Aint displacements[struct_len];
>
>
>
>     i = 0;
>
>     block_len[i] = 1;
>
>     types[i] = MPI_LOGICAL;
>
>     displacements[i] = (size_t) &obj.tuple_valid_  - (size_t) &obj;
>
>
>
>     ++i;
>
>     block_len[i] = 1;
>
>     types[i] = MPI_LOGICAL;
>
>     displacements[i] = (size_t) &obj.tuple_seq_valid_  - (size_t) &obj;
>
>
>
>     // the int array "start_" member
>
>     ++i;
>
>     block_len[i] = Tuple::N_INDICES_MAX_;
>
>     types[i] = MPI_SHORT;
>
>     displacements[i] = (size_t) &obj.start_  - (size_t) &obj;
>
>
>
>     // the int array "end_" member
>
>     ++i;
>
>     block_len[i] = Tuple::N_INDICES_MAX_;
>
>     types[i] = MPI_SHORT;
>
>     displacements[i] = (size_t) &obj.end_  - (size_t) &obj;
>
>
>
>     // the integer "opcode_" member
>
>     ++i;
>
>     block_len[i] = 1;
>
>     types[i] = MPI_INT;
>
>     displacements[i] = (size_t) &obj.opcode_ - (size_t) &obj;
>
>
>
>     // the boolean "success_" member
>
>     ++i;
>
>     block_len[i] = 1;
>
>     types[i] = MPI_LOGICAL;  // NOTE: might be MPI_BOOLEAN in later version
>
>     displacements[i] = (size_t) &obj.success_  - (size_t) &obj;
>
>
>
>     // the double "run_time_sec_" member
>
>     ++i;
>
>     block_len[i] = 1;
>
>     types[i] = MPI_DOUBLE;
>
>     displacements[i] = (size_t) &obj.run_time_sec_ - (size_t) &obj;
>
>
>
>     // the char array "error_msg_" member
>
>     ++i;
>
>     block_len[i] = NeedlesMpi::ERROR_MSG_LEN_ + 1;
>
>     types[i] = MPI_CHAR;
>
>     displacements[i] = (size_t) &obj.error_msg_[0] - (size_t) &obj;
>
>
>
>     MPI_Type_create_struct(struct_len, block_len, displacements,
>
>         types, &msg_data_type_);
>
>     MPI_Type_commit(&msg_data_type_);
>
>
>
> Thanks,
>
> Kurt
>
>
>
> *From:* Zhou, Hui <zhouh at anl.gov>
> *Sent:* Sunday, October 24, 2021 6:46 PM
> *To:* discuss at mpich.org
> *Cc:* Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
> *Subject:* [EXTERNAL] Re: Maximum number of inter-communicators?
>
>
>
> Hi Kurt,
>
>
>
> There is indeed a limit on maximum number of communicators that you can
> have, including both intra communicators and inter-communicators. Try free
> the communicators that you no longer need. In older version of MPICH, there
> may be additional limit on how many dynamic processes one can connect. If
> you still hit crash after making sure there isn't too many simultaneous
> active communicators, could you try the latest release --
> http://www.mpich.org/static/downloads/4.0a2/mpich-4.0a2.tar.gz
> <https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mpich.org%2Fstatic%2Fdownloads%2F4.0a2%2Fmpich-4.0a2.tar.gz&data=04%7C01%7Ckurt.e.mccall%40nasa.gov%7Cf784f0a87c7245e8a5f808d99855e822%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637708316997034873%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=P2D25y9EReF3fLdicTKBU5N1k5tzRtAH2a9ZbOLf3cs%3D&reserved=0>,
> and see if the issue persist?
>
>
>
> --
>
> Hui
> ------------------------------
>
> *From:* Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org>
> *Sent:* Sunday, October 24, 2021 2:37 PM
> *To:* discuss at mpich.org <discuss at mpich.org>
> *Cc:* Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
> *Subject:* [mpich-discuss] Maximum number of inter-communicators?
>
>
>
> Hi,
>
>
>
> Based on a paper I read about giving an MPI job some fault tolerance, I’m
> exclusively connecting my processes with inter-communicators.
>
> I’ve found that if I increase the number of processes beyond a certain
> point, many processes don’t get created at all and the whole job
>
> crashes.   Am I running up against an operating system limit (like the
> number of open file descriptors – it is set at 1024), or some sort of
>
> MPICH limit?
>
>
>
> If it matters, my process architecture (a tree)  is as follows:  one
> master process connected to 21 manager processes on 21 other nodes,
>
> and each manager connected to 8 worker processes on the manager’s own
> node.   This is the largest job I’ve been able to create
>
> without it crashing.    Attempting to increase the number of workers
> beyond 8 results in a crash.
>
>
>
> I’m using MPICH 3.3.2 on Centos 3.10.0.   MPICH was compiled with the
> Portland Group compiler pgc++ 19.5-0.
>
>
>
> Thanks,
>
> Kurt
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20211116/afa80869/attachment.html>


More information about the discuss mailing list