[mpich-discuss] Is PMIx working well for ch4:ucx ? Intermitent Seg Fault in MPIR_pmi_init()

Zhou, Hui zhouh at anl.gov
Mon Jan 4 15:08:10 CST 2021


Hi Martin,

PMIx with MPICH is not well tested with large number of nodes and we are not aware of the issue you described. We’ll look into it. Meanwhile, if you could file a github issue, that will help us better tracking it.

Thanks for reporting it.

--
Hui Zhou


From: Audet, Martin via discuss <discuss at mpich.org>
Date: Tuesday, December 22, 2020 at 5:16 PM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Audet, Martin <Martin.Audet at cnrc-nrc.gc.ca>, Raymond, Stephane <Stephane.Raymond at cnrc-nrc.gc.ca>
Subject: [mpich-discuss] Is PMIx working well for ch4:ucx ? Intermitent Seg Fault in MPIR_pmi_init()
Hello MPICH_Users &&  MPICH_Developers,

Is the new mpich version 3.4rc1 is supposed to work well with PMIx when using the ch4:ucx device or is it still considered experimental ?

Yesterday when trying it with a my usual “hellompihost2.cpp” test program on our cluster it turns out that it created intermittent Segmentation fault in the MPIR_pmi_init() initialization function which is indirectly called by MPI_Init(). It seems that the higher number of “processors” (MPI ranks) involved, the higher the chances of having this problem. With 24 nodes each with 24 or 48 “processors” (MPI ranks) the probability seems to be 100%. When keeping 24 “processors” per node, the probability of having this problem increase rapidly when increasing the node count from 9 to 10 (i.e. at 9 nodes it works most of the time and with 10 nodes it fails most of the time).

Note that my “hellompihost2.cpp” program works very well with OpenMPI and with the same mpich 3.4rc1 but with the ch3:sock or ch3:nemesis channel (using the PMI2 startup mechanism).

We use Slurm version 20.02.6 and PMIx 3.1.5. Our OS is CentOS 7.9 (latest kernel and packages) and we use MOFED 4.9.2.2.4.0. As you can see our environment is absolutely not exotic and should be very common among mpich users.

Below is the output from uname, mpichversion, the source code of hellompihost2.cpp and a sample of the output when it crashes.

Thanks,

Martin Audet


[audetm at hn audetm]$ uname -a
Linux hn.galerkin.res.nrc.gc.ca 3.10.0-1160.6.1.el7.x86_64 #1 SMP Tue Nov 17 13:59:11 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

[audetm at hn audetm]$ mpichversion
MPICH Version:       3.4rc1
MPICH Release date:  Thu Dec 10 14:41:59 CST 2020
MPICH Device:        ch4:ucx
MPICH configure:     --with-device=ch4:ucx --with-hcoll=/opt/mellanox/hcoll --with-pmix=/usr --prefix=/work/software/x86_64/mpi/mpich-ch4_ucx-3.4rc1 --enable-fast=all --enable-romio --with-file-system=ufs+nfs+lustre --enable-shared --enable-sharedlibs=gcc
MPICH CC: gcc -std=gnu99 -std=gnu99    -DNDEBUG -DNVALGRIND -O2
MPICH CXX:      g++   -DNDEBUG -DNVALGRIND -O2
MPICH F77:      gfortran   -O2
MPICH FC: gfortran   -O2
MPICH Custom Information:
[audetm at hn audetm]$


#include <mpi.h>
#include <unistd.h>
#include <sched.h>
#include <iostream>
#include <vector>
#include <sstream>
#include <string>

int main(int argc, char **argv)
{
   MPI_Init(&argc, &argv);

   std::ostringstream ostr;

   const MPI_Comm cur_comm = MPI_COMM_WORLD;

   int comm_rank, comm_size;

   MPI_Comm_rank(cur_comm, &comm_rank);
   MPI_Comm_size(cur_comm, &comm_size);

   char      name_buf[MPI_MAX_PROCESSOR_NAME+1];
   int       name_len;

   MPI_Get_processor_name(name_buf, &name_len);

   name_buf[name_len] = '\0';

   ostr << "rank " << comm_rank << " running on " << name_buf << " PID " << getpid() << " pinned to CPUs ";

   cpu_set_t cpuset;

   const int ret = sched_getaffinity(0, sizeof(cpuset), &cpuset);

   if (ret == 0) {
      bool prev = false;
      for (int i_cpu=0; i_cpu < CPU_SETSIZE; i_cpu++) {
          if (CPU_ISSET(i_cpu, &cpuset)) {
             if (prev) {
                ostr << ',';
             }
             ostr << i_cpu;
             prev = true;
          }
      }
   }
   else {
      ostr << "(unknown)";
   }

   const std::string msg(ostr.str());

   const int msg_len  = int(msg.size());
   const int msg_len2 = msg_len;

   enum { STR_LEN_TAG, STR_VAL_TAG };

   const int ROOT_RANK = 0;

   MPI_Request req_tbl[2];

   MPI_Isend(&msg_len2,        1, MPI_INT,  ROOT_RANK, STR_LEN_TAG, cur_comm, &req_tbl[0]);
   MPI_Isend(msg.data(), msg_len, MPI_CHAR, ROOT_RANK, STR_VAL_TAG, cur_comm, &req_tbl[1]);

   if (comm_rank == ROOT_RANK) {
      std::cout << "Running with " << comm_size << " process\n";

      std::vector<char> recv_msg;

      for (int i_rank=0; i_rank < comm_size; i_rank++) {
          int        recv_len;
          MPI_Status stat;

          MPI_Recv(&recv_len,           1, MPI_INT,  i_rank, STR_LEN_TAG, cur_comm, &stat);

          recv_msg.resize(recv_len+2);

          MPI_Recv(&recv_msg[0], recv_len, MPI_CHAR, i_rank, STR_VAL_TAG, cur_comm, &stat);

          recv_msg[recv_len  ] = '\n';
          recv_msg[recv_len+1] = '\0';

          std::cout << &recv_msg[0];
      }
   }
   std::cout << std::flush;

   MPI_Status stat_tbl[2];
   MPI_Waitall(2, req_tbl, stat_tbl);

   MPI_Finalize();

   return 0;
}

audetm at hn audetm]$ time srun --nodes=16 --ntasks-per-node=48 --ntasks-per-core=2 --exclusive ./hellompihost2_ch4_ucx_34rc1
[cn12:29033:0:29033] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x905aea0)
[cn12:29032:0:29032] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x9726690)
[cn12:29049:0:29049] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x423fea0)
[cn12:29034:0:29034] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3308690)
[cn9:29497:0:29497] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x986e460)
[cn9:29523:0:29523] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x9a3f460)
[cn9:29491:0:29491] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x31d0460)
[cn9:29521:0:29521] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8a86460)
[cn12:29037:0:29037] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x55e5690)
[cn9:29493:0:29493] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x96de460)
==== backtrace (tid:  29032) ====
0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
1 0x0000000000305a6c MPIR_pmi_init()  :0
2 0x000000000031303f MPID_Init()  :0
3 0x0000000000202fa6 MPIR_Init_thread()  :0
4 0x0000000000202df2 PMPI_Init()  ???:0
5 0x0000000000401805 main()  ???:0
6 0x0000000000022555 __libc_start_main()  ???:0
7 0x0000000000401709 _start()  ???:0
=================================
[cn9:29485:0:29485] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4381c50)
==== backtrace (tid:  29033) ====
0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
1 0x0000000000305a6c MPIR_pmi_init()  :0
2 0x000000000031303f MPID_Init()  :0
3 0x0000000000202fa6 MPIR_Init_thread()  :0
4 0x0000000000202df2 PMPI_Init()  ???:0
5 0x0000000000401805 main()  ???:0
6 0x0000000000022555 __libc_start_main()  ???:0
7 0x0000000000401709 _start()  ???:0
=================================
==== backtrace (tid:  29049) ====
0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
1 0x0000000000305a6c MPIR_pmi_init()  :0
2 0x000000000031303f MPID_Init()  :0
3 0x0000000000202fa6 MPIR_Init_thread()  :0
4 0x0000000000202df2 PMPI_Init()  ???:0
5 0x0000000000401805 main()  ???:0
6 0x0000000000022555 __libc_start_main()  ???:0
7 0x0000000000401709 _start()  ???:0
=================================
[cn7:34315:0:34315] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xbbcd450)
[cn7:34310:0:34310] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x47b4c40)
[cn7:34311:0:34311] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x86f2c40)
[cn7:34320:0:34320] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2775c40)
[cn7:34322:0:34322] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8a5e450)
[cn2:34371:0:34371] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x9e8fc70)
[cn2:34394:0:34394] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3036480)
[cn2:34370:0:34370] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2951480)
[cn12:29030:0:29030] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x5b4e690)
[cn11:29124:0:29124] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x843c470)
==== backtrace (tid:  29034) ====
0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
1 0x0000000000305a6c MPIR_pmi_init()  :0
2 0x000000000031303f MPID_Init()  :0
3 0x0000000000202fa6 MPIR_Init_thread()  :0
4 0x0000000000202df2 PMPI_Init()  ???:0
5 0x0000000000401805 main()  ???:0
6 0x0000000000022555 __libc_start_main()  ???:0
7 0x0000000000401709 _start()  ???:0
=================================
[cn12:29059:0:29059] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xa1f3690)
[cn7:34331:0:34331] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2630c40)
[cn12:29044:0:29044] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x639b690)
[cn11:29159:0:29159] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xaa1fc60)
[cn11:29126:0:29126] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x735bc60)
==== backtrace (tid:  29037) ====
0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
1 0x0000000000305a6c MPIR_pmi_init()  :0
2 0x000000000031303f MPID_Init()  :0
3 0x0000000000202fa6 MPIR_Init_thread()  :0
4 0x0000000000202df2 PMPI_Init()  ???:0
5 0x0000000000401805 main()  ???:0
6 0x0000000000022555 __libc_start_main()  ???:0
7 0x0000000000401709 _start()  ???:0
=================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20210104/5dab61a6/attachment-0001.html>


More information about the discuss mailing list