[mpich-discuss] MPI_Info key host error in SLURM

Iker Martín Álvarez martini at uji.es
Wed Feb 24 11:53:49 CST 2021


Hi,

I have been working around the MPI_Comm_spawn function with the key "host"
along a value in a system with the SLURM resource manager (slurm-wlm
17.11.2). The function works as expected, but when I send to execute the
code with the sbatch command, an error arises. This does not happen when I
execute directly the code in the same machine SLURM decided to execute it
when it was sended with *sbatch*. In both cases with the key "host", as
when I do not use the key, it works just fine.

The same code has been tested with MPICH 3.3.2 and 3.4.1, which gives
different errors. Also, I tried it with other implementations (OpenMPI and
Intel MPI), which works as expected creating the processes in the indicated
host.

I would like to create processes by MPI_Comm_spawn in an assigned host, so
if there are other key values for the Info argument, I could try them, but
I have not found any looking through MPICH documentation.

Here is the code I have been using:

int main(int argc, char ** argv) {
  int myId, numP;
  MPI_Info info;
  MPI_Comm comm;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &myId);
  MPI_Comm_size(MPI_COMM_WORLD, &numP);

  MPI_Comm comm_par;
  MPI_Comm_get_parent(&comm_par);
  if(comm_par != MPI_COMM_NULL ) {
    if(myId == ROOT) {
      printf("SONS\n"); fflush(stdout);
    }
  } else {
    if(myId == ROOT) {
      printf("FATHERS\n"); fflush(stdout);
    }
    MPI_Info_create(&info);
    MPI_Info_set(info, "host", "n00");
    MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, numP, info, ROOT,
MPI_COMM_WORLD, &comm, MPI_ERRCODES_IGNORE);
  }
  MPI_Finalize();
  return 0;
}

For MPICH 3.4.1 there is sometimes no error and the code stops working at
MPI_Comm_spawn function, other times this error is shown:
Abort(1615120) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init:
Internal MPI error!, error stack:
MPIR_Init_thread(152)...:
MPID_Init(562)..........:
MPIDU_Init_shm_init(195):
Init_shm_barrier(94)....: Internal MPI error!  barrier not initialized

Also, the error code for MPICH 3.3.2:

Assertion failed in file src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c
at line 683: our_pg_rank < pg->size
Assertion failed in file src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c
at line 683: our_pg_rank < pg->size
Assertion failed in file src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c
at line 683: our_pg_rank < pg->size
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
internal ABORT - process 0

Thanks, Iker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20210224/62710cb7/attachment.html>


More information about the discuss mailing list