[mpich-discuss] Maximum number of inter-communicators?
Jeff Hammond
jeff.science at gmail.com
Tue Nov 16 15:00:15 CST 2021
Unless MPICH configure has changed recently, —enable-g=debug enables debug
symbols, and is all you need. Your choice is sufficient but overkill. It
may introduce nontrivial performance overheads.
Jeff
On Sat, Nov 13, 2021 at 10:57 PM Mccall, Kurt E. (MSFC-EV41) via discuss <
discuss at mpich.org> wrote:
> Hui,
>
>
>
> I built MPICH 4.0a2 with gcc 4.8.5, and included the -enable-g=all flag to
> “configure” so that debugging symbols would be present. The code is
> crashing my call to MPI_Type_commit, in libpthreads.so. gdb give this
> stack trace below. Since MPICH 3.3.2, has there been changes in how custom
> types are created (the code worked in 3.3.2)? I included my type-creating
> code after the stack trace.
>
>
>
> Program received signal SIGSEGV, Segmentation fault.
>
> MPIR_Typerep_create_struct (count=count at entry=8,
> array_of_blocklengths=array_of_blocklengths at entry=0x128b6b0,
>
> array_of_displacements=array_of_displacements at entry=0x7fffcaa243c0,
>
> array_of_types=array_of_types at entry=0x7fffcaa24410,
>
> newtype=newtype at entry=0x7fec8bedd258 <MPIR_Datatype_direct+1400>)
>
> at
> ../mpich-4.0a2/src/mpi/datatype/typerep/src/typerep_dataloop_create.c:659
>
> 659
> MPIR_Ensure_Aint_fits_in_int(old_dtp->builtin_element_size);
>
>
>
> (gdb) where
>
>
>
> #0 MPIR_Typerep_create_struct (count=count at entry=8,
> array_of_blocklengths=array_of_blocklengths at entry=0x128b6b0,
>
> array_of_displacements=array_of_displacements at entry=0x7fffcaa243c0,
>
> array_of_types=array_of_types at entry=0x7fffcaa24410,
>
> newtype=newtype at entry=0x7fec8bedd258 <MPIR_Datatype_direct+1400>)
>
> at
> ../mpich-4.0a2/src/mpi/datatype/typerep/src/typerep_dataloop_create.c:659
>
> #1 0x00007fec8b9b2608 in type_struct (count=count at entry=8,
> blocklength_array=blocklength_array at entry=0x128b6b0,
>
> displacement_array=displacement_array at entry=0x7fffcaa243c0,
> oldtype_array=oldtype_array at entry=0x7fffcaa24410,
>
> newtype=newtype at entry=0x7fffcaa242dc) at
> ../mpich-4.0a2/src/mpi/datatype/type_create.c:206
>
> #2 0x00007fec8b9b4b9e in type_struct (newtype=0x7fffcaa242dc,
> oldtype_array=0x7fffcaa24410,
>
> displacement_array=0x7fffcaa243c0, blocklength_array=0x128b6b0,
> count=8)
>
> at ../mpich-4.0a2/src/mpi/datatype/type_create.c:227
>
> #3 MPIR_Type_struct (count=count at entry=8, blocklength_array=0x128b6b0,
>
> displacement_array=displacement_array at entry=0x7fffcaa243c0,
> oldtype_array=oldtype_array at entry=0x7fffcaa24410,
>
> newtype=newtype at entry=0x7fffcaa242dc) at
> ../mpich-4.0a2/src/mpi/datatype/type_create.c:235
>
> #4 0x00007fec8b9b7b08 in MPIR_Type_create_struct_impl (count=count at entry=8,
>
>
> array_of_blocklengths=array_of_blocklengths at entry=0x7fffcaa24440,
>
> array_of_displacements=array_of_displacements at entry=0x7fffcaa243c0,
>
> array_of_types=array_of_types at entry=0x7fffcaa24410,
> newtype=newtype at entry=0x12853fc)
>
> at ../mpich-4.0a2/src/mpi/datatype/type_create.c:908
>
> #5 0x00007fec8b85ad26 in internal_Type_create_struct (newtype=0x12853fc,
> array_of_types=0x7fffcaa24410,
>
> array_of_displacements=<optimized out>,
> array_of_blocklengths=0x7fffcaa24440, count=8)
>
> at ../mpich-4.0a2/src/binding/c/datatype/type_create_struct.c:79
>
> #6 PMPI_Type_create_struct (count=8,
> array_of_blocklengths=0x7fffcaa24440,
> array_of_displacements=0x7fffcaa243c0,
>
> array_of_types=0x7fffcaa24410, newtype=0x12853fc)
>
> at ../mpich-4.0a2/src/binding/c/datatype/type_create_struct.c:164
>
> #7 0x0000000000438dfb in needles::MpiMsgBasic::createMsgDataType
> (this=0x12853fc) at src/MsgBasic.cpp:97
>
> #8 0x0000000000412b77 in needles::NeedlesMpiManager::init
> (this=0x12853a0, argc=23, argv=0x7fffcaa24e08, rank=20,
>
> world_size=21) at src/NeedlesMpiManager.cpp:204
>
> #9 0x000000000040605f in main (argc=23, argv=0x7fffcaa24e08) at
> src/NeedlesMpiManagerMain.cpp:142
>
> (gdb)
>
>
>
>
>
>
>
> Here is my code that creates the custom type and then calls
> MPI_Type_commit:
>
>
>
> MsgBasic obj;
>
> int struct_len = 8, i;
>
>
>
> int block_len[struct_len];
>
> MPI_Datatype types[struct_len];
>
> MPI_Aint displacements[struct_len];
>
>
>
> i = 0;
>
> block_len[i] = 1;
>
> types[i] = MPI_LOGICAL;
>
> displacements[i] = (size_t) &obj.tuple_valid_ - (size_t) &obj;
>
>
>
> ++i;
>
> block_len[i] = 1;
>
> types[i] = MPI_LOGICAL;
>
> displacements[i] = (size_t) &obj.tuple_seq_valid_ - (size_t) &obj;
>
>
>
> // the int array "start_" member
>
> ++i;
>
> block_len[i] = Tuple::N_INDICES_MAX_;
>
> types[i] = MPI_SHORT;
>
> displacements[i] = (size_t) &obj.start_ - (size_t) &obj;
>
>
>
> // the int array "end_" member
>
> ++i;
>
> block_len[i] = Tuple::N_INDICES_MAX_;
>
> types[i] = MPI_SHORT;
>
> displacements[i] = (size_t) &obj.end_ - (size_t) &obj;
>
>
>
> // the integer "opcode_" member
>
> ++i;
>
> block_len[i] = 1;
>
> types[i] = MPI_INT;
>
> displacements[i] = (size_t) &obj.opcode_ - (size_t) &obj;
>
>
>
> // the boolean "success_" member
>
> ++i;
>
> block_len[i] = 1;
>
> types[i] = MPI_LOGICAL; // NOTE: might be MPI_BOOLEAN in later version
>
> displacements[i] = (size_t) &obj.success_ - (size_t) &obj;
>
>
>
> // the double "run_time_sec_" member
>
> ++i;
>
> block_len[i] = 1;
>
> types[i] = MPI_DOUBLE;
>
> displacements[i] = (size_t) &obj.run_time_sec_ - (size_t) &obj;
>
>
>
> // the char array "error_msg_" member
>
> ++i;
>
> block_len[i] = NeedlesMpi::ERROR_MSG_LEN_ + 1;
>
> types[i] = MPI_CHAR;
>
> displacements[i] = (size_t) &obj.error_msg_[0] - (size_t) &obj;
>
>
>
> MPI_Type_create_struct(struct_len, block_len, displacements,
>
> types, &msg_data_type_);
>
> MPI_Type_commit(&msg_data_type_);
>
>
>
> Thanks,
>
> Kurt
>
>
>
> *From:* Zhou, Hui <zhouh at anl.gov>
> *Sent:* Sunday, October 24, 2021 6:46 PM
> *To:* discuss at mpich.org
> *Cc:* Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
> *Subject:* [EXTERNAL] Re: Maximum number of inter-communicators?
>
>
>
> Hi Kurt,
>
>
>
> There is indeed a limit on maximum number of communicators that you can
> have, including both intra communicators and inter-communicators. Try free
> the communicators that you no longer need. In older version of MPICH, there
> may be additional limit on how many dynamic processes one can connect. If
> you still hit crash after making sure there isn't too many simultaneous
> active communicators, could you try the latest release --
> http://www.mpich.org/static/downloads/4.0a2/mpich-4.0a2.tar.gz
> <https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mpich.org%2Fstatic%2Fdownloads%2F4.0a2%2Fmpich-4.0a2.tar.gz&data=04%7C01%7Ckurt.e.mccall%40nasa.gov%7Cf784f0a87c7245e8a5f808d99855e822%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637708316997034873%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=P2D25y9EReF3fLdicTKBU5N1k5tzRtAH2a9ZbOLf3cs%3D&reserved=0>,
> and see if the issue persist?
>
>
>
> --
>
> Hui
> ------------------------------
>
> *From:* Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org>
> *Sent:* Sunday, October 24, 2021 2:37 PM
> *To:* discuss at mpich.org <discuss at mpich.org>
> *Cc:* Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
> *Subject:* [mpich-discuss] Maximum number of inter-communicators?
>
>
>
> Hi,
>
>
>
> Based on a paper I read about giving an MPI job some fault tolerance, I’m
> exclusively connecting my processes with inter-communicators.
>
> I’ve found that if I increase the number of processes beyond a certain
> point, many processes don’t get created at all and the whole job
>
> crashes. Am I running up against an operating system limit (like the
> number of open file descriptors – it is set at 1024), or some sort of
>
> MPICH limit?
>
>
>
> If it matters, my process architecture (a tree) is as follows: one
> master process connected to 21 manager processes on 21 other nodes,
>
> and each manager connected to 8 worker processes on the manager’s own
> node. This is the largest job I’ve been able to create
>
> without it crashing. Attempting to increase the number of workers
> beyond 8 results in a crash.
>
>
>
> I’m using MPICH 3.3.2 on Centos 3.10.0. MPICH was compiled with the
> Portland Group compiler pgc++ 19.5-0.
>
>
>
> Thanks,
>
> Kurt
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
--
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20211116/afa80869/attachment.html>
More information about the discuss
mailing list