[mpich-discuss] Maximum number of inter-communicators?
Zhou, Hui
zhouh at anl.gov
Mon Nov 15 21:44:29 CST 2021
Hi Kurt,
Did you build mpich with Fortran disabled? MPI_LOGICAL is a Fortran datatype and is unavailable when Fortran binding is disabled. Try use MPI_C_BOOL instead. If you didn't disable cxx, you may also try use MPI_CXX_BOOL since you are programming in C++.
Hui
________________________________
From: Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org>
Sent: Saturday, November 13, 2021 2:56 PM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
Subject: Re: [mpich-discuss] Maximum number of inter-communicators?
Hui,
I built MPICH 4.0a2 with gcc 4.8.5, and included the -enable-g=all flag to “configure” so that debugging symbols would be present. The code is crashing my call to MPI_Type_commit, in libpthreads.so. gdb give this stack trace below. Since MPICH 3.3.2, has there been changes in how custom types are created (the code worked in 3.3.2)? I included my type-creating code after the stack trace.
Program received signal SIGSEGV, Segmentation fault.
MPIR_Typerep_create_struct (count=count at entry=8, array_of_blocklengths=array_of_blocklengths at entry=0x128b6b0,
array_of_displacements=array_of_displacements at entry=0x7fffcaa243c0,
array_of_types=array_of_types at entry=0x7fffcaa24410,
newtype=newtype at entry=0x7fec8bedd258 <MPIR_Datatype_direct+1400>)
at ../mpich-4.0a2/src/mpi/datatype/typerep/src/typerep_dataloop_create.c:659
659 MPIR_Ensure_Aint_fits_in_int(old_dtp->builtin_element_size);
(gdb) where
#0 MPIR_Typerep_create_struct (count=count at entry=8, array_of_blocklengths=array_of_blocklengths at entry=0x128b6b0,
array_of_displacements=array_of_displacements at entry=0x7fffcaa243c0,
array_of_types=array_of_types at entry=0x7fffcaa24410,
newtype=newtype at entry=0x7fec8bedd258 <MPIR_Datatype_direct+1400>)
at ../mpich-4.0a2/src/mpi/datatype/typerep/src/typerep_dataloop_create.c:659
#1 0x00007fec8b9b2608 in type_struct (count=count at entry=8, blocklength_array=blocklength_array at entry=0x128b6b0,
displacement_array=displacement_array at entry=0x7fffcaa243c0, oldtype_array=oldtype_array at entry=0x7fffcaa24410,
newtype=newtype at entry=0x7fffcaa242dc) at ../mpich-4.0a2/src/mpi/datatype/type_create.c:206
#2 0x00007fec8b9b4b9e in type_struct (newtype=0x7fffcaa242dc, oldtype_array=0x7fffcaa24410,
displacement_array=0x7fffcaa243c0, blocklength_array=0x128b6b0, count=8)
at ../mpich-4.0a2/src/mpi/datatype/type_create.c:227
#3 MPIR_Type_struct (count=count at entry=8, blocklength_array=0x128b6b0,
displacement_array=displacement_array at entry=0x7fffcaa243c0, oldtype_array=oldtype_array at entry=0x7fffcaa24410,
newtype=newtype at entry=0x7fffcaa242dc) at ../mpich-4.0a2/src/mpi/datatype/type_create.c:235
#4 0x00007fec8b9b7b08 in MPIR_Type_create_struct_impl (count=count at entry=8,
array_of_blocklengths=array_of_blocklengths at entry=0x7fffcaa24440,
array_of_displacements=array_of_displacements at entry=0x7fffcaa243c0,
array_of_types=array_of_types at entry=0x7fffcaa24410, newtype=newtype at entry=0x12853fc)
at ../mpich-4.0a2/src/mpi/datatype/type_create.c:908
#5 0x00007fec8b85ad26 in internal_Type_create_struct (newtype=0x12853fc, array_of_types=0x7fffcaa24410,
array_of_displacements=<optimized out>, array_of_blocklengths=0x7fffcaa24440, count=8)
at ../mpich-4.0a2/src/binding/c/datatype/type_create_struct.c:79
#6 PMPI_Type_create_struct (count=8, array_of_blocklengths=0x7fffcaa24440, array_of_displacements=0x7fffcaa243c0,
array_of_types=0x7fffcaa24410, newtype=0x12853fc)
at ../mpich-4.0a2/src/binding/c/datatype/type_create_struct.c:164
#7 0x0000000000438dfb in needles::MpiMsgBasic::createMsgDataType (this=0x12853fc) at src/MsgBasic.cpp:97
#8 0x0000000000412b77 in needles::NeedlesMpiManager::init (this=0x12853a0, argc=23, argv=0x7fffcaa24e08, rank=20,
world_size=21) at src/NeedlesMpiManager.cpp:204
#9 0x000000000040605f in main (argc=23, argv=0x7fffcaa24e08) at src/NeedlesMpiManagerMain.cpp:142
(gdb)
Here is my code that creates the custom type and then calls MPI_Type_commit:
MsgBasic obj;
int struct_len = 8, i;
int block_len[struct_len];
MPI_Datatype types[struct_len];
MPI_Aint displacements[struct_len];
i = 0;
block_len[i] = 1;
types[i] = MPI_LOGICAL;
displacements[i] = (size_t) &obj.tuple_valid_ - (size_t) &obj;
++i;
block_len[i] = 1;
types[i] = MPI_LOGICAL;
displacements[i] = (size_t) &obj.tuple_seq_valid_ - (size_t) &obj;
// the int array "start_" member
++i;
block_len[i] = Tuple::N_INDICES_MAX_;
types[i] = MPI_SHORT;
displacements[i] = (size_t) &obj.start_ - (size_t) &obj;
// the int array "end_" member
++i;
block_len[i] = Tuple::N_INDICES_MAX_;
types[i] = MPI_SHORT;
displacements[i] = (size_t) &obj.end_ - (size_t) &obj;
// the integer "opcode_" member
++i;
block_len[i] = 1;
types[i] = MPI_INT;
displacements[i] = (size_t) &obj.opcode_ - (size_t) &obj;
// the boolean "success_" member
++i;
block_len[i] = 1;
types[i] = MPI_LOGICAL; // NOTE: might be MPI_BOOLEAN in later version
displacements[i] = (size_t) &obj.success_ - (size_t) &obj;
// the double "run_time_sec_" member
++i;
block_len[i] = 1;
types[i] = MPI_DOUBLE;
displacements[i] = (size_t) &obj.run_time_sec_ - (size_t) &obj;
// the char array "error_msg_" member
++i;
block_len[i] = NeedlesMpi::ERROR_MSG_LEN_ + 1;
types[i] = MPI_CHAR;
displacements[i] = (size_t) &obj.error_msg_[0] - (size_t) &obj;
MPI_Type_create_struct(struct_len, block_len, displacements,
types, &msg_data_type_);
MPI_Type_commit(&msg_data_type_);
Thanks,
Kurt
From: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Sent: Sunday, October 24, 2021 6:46 PM
To: discuss at mpich.org<mailto:discuss at mpich.org>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
Subject: [EXTERNAL] Re: Maximum number of inter-communicators?
Hi Kurt,
There is indeed a limit on maximum number of communicators that you can have, including both intra communicators and inter-communicators. Try free the communicators that you no longer need. In older version of MPICH, there may be additional limit on how many dynamic processes one can connect. If you still hit crash after making sure there isn't too many simultaneous active communicators, could you try the latest release -- http://www.mpich.org/static/downloads/4.0a2/mpich-4.0a2.tar.gz<https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mpich.org%2Fstatic%2Fdownloads%2F4.0a2%2Fmpich-4.0a2.tar.gz&data=04%7C01%7Ckurt.e.mccall%40nasa.gov%7Cf784f0a87c7245e8a5f808d99855e822%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637708316997034873%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=P2D25y9EReF3fLdicTKBU5N1k5tzRtAH2a9ZbOLf3cs%3D&reserved=0>, and see if the issue persist?
--
Hui
________________________________
From: Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Sent: Sunday, October 24, 2021 2:37 PM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
Subject: [mpich-discuss] Maximum number of inter-communicators?
Hi,
Based on a paper I read about giving an MPI job some fault tolerance, I’m exclusively connecting my processes with inter-communicators.
I’ve found that if I increase the number of processes beyond a certain point, many processes don’t get created at all and the whole job
crashes. Am I running up against an operating system limit (like the number of open file descriptors – it is set at 1024), or some sort of
MPICH limit?
If it matters, my process architecture (a tree) is as follows: one master process connected to 21 manager processes on 21 other nodes,
and each manager connected to 8 worker processes on the manager’s own node. This is the largest job I’ve been able to create
without it crashing. Attempting to increase the number of workers beyond 8 results in a crash.
I’m using MPICH 3.3.2 on Centos 3.10.0. MPICH was compiled with the Portland Group compiler pgc++ 19.5-0.
Thanks,
Kurt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20211116/fdb74fce/attachment-0001.html>
More information about the discuss
mailing list