[mpich-discuss] MPI_Info key host error in SLURM

Raffenetti, Kenneth J. raffenet at mcs.anl.gov
Fri Feb 26 09:15:48 CST 2021


Hi,

Could you add "-v" to your mpiexec command and provide the output? The "host" info key is handled by the process manager when executing the spawn command.

Ken

On 2/24/21, 11:55 AM, "Iker Martín Álvarez via discuss" <discuss at mpich.org> wrote:

    Hi,
    I have been working around the MPI_Comm_spawn function with the key "host" along a value in a system with the SLURM resource manager (slurm-wlm 17.11.2). The function works as expected, but when I send to execute the code with the sbatch command, an error arises. This does not happen when I execute directly the code in the same machine SLURM decided to execute it when it was sended with sbatch. In both cases with the key "host", as when I do not use the key, it works just fine.
    
    The same code has been tested with MPICH 3.3.2 and 3.4.1, which gives different errors. Also, I tried it with other implementations (OpenMPI and Intel MPI), which works as expected creating the processes in the indicated host.
    
    I would like to create processes by MPI_Comm_spawn in an assigned host, so if there are other key values for the Info argument, I could try them, but I have not found any looking through MPICH documentation.
    
    Here is the code I have been using:
    
    int main(int argc, char ** argv) {
    
      int myId, numP;
      MPI_Info info;
      MPI_Comm comm;
    
      MPI_Init(&argc, &argv);
      MPI_Comm_rank(MPI_COMM_WORLD, &myId);
      MPI_Comm_size(MPI_COMM_WORLD, &numP);  
    
      MPI_Comm comm_par;
      MPI_Comm_get_parent(&comm_par);
      if(comm_par != MPI_COMM_NULL ) {
        if(myId == ROOT) {
          printf("SONS\n"); fflush(stdout);
        }
      } else {
        if(myId == ROOT) {
          printf("FATHERS\n"); fflush(stdout);
        }
        MPI_Info_create(&info);
        MPI_Info_set(info, "host", "n00");
        MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, numP, info, ROOT, MPI_COMM_WORLD, &comm, MPI_ERRCODES_IGNORE);
      }
      MPI_Finalize();
      return 0;
    }
    
    
    For MPICH 3.4.1 there is sometimes no error and the code stops working at MPI_Comm_spawn function, other times this error is shown:
    Abort(1615120) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Internal MPI error!, error stack:
    MPIR_Init_thread(152)...:
    MPID_Init(562)..........:
    MPIDU_Init_shm_init(195):
    Init_shm_barrier(94)....: Internal MPI error!  barrier not initialized
    
    
    Also, the error code for MPICH 3.3.2:
    
    Assertion failed in file src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c at line 683: our_pg_rank < pg->size
    Assertion failed in file src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c at line 683: our_pg_rank < pg->size
    Assertion failed in file src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c at line 683: our_pg_rank < pg->size
    libbacktrace: no debug info in ELF executable
    libbacktrace: no debug info in ELF executable
    libbacktrace: no debug info in ELF executable
    libbacktrace: no debug info in ELF executable
    libbacktrace: no debug info in ELF executable
    libbacktrace: no debug info in ELF executable
    libbacktrace: no debug info in ELF executable
    libbacktrace: no debug info in ELF executable
    libbacktrace: no debug info in ELF executable
    libbacktrace: no debug info in ELF executable
    libbacktrace: no debug info in ELF executable
    libbacktrace: no debug info in ELF executable
    internal ABORT - process 0
    
    Thanks, Iker



More information about the discuss mailing list