[mpich-discuss] MPI_Info key host error in SLURM

Raffenetti, Kenneth J. raffenet at mcs.anl.gov
Tue Mar 2 16:03:43 CST 2021


Thanks for sending. I haven't found time to look closely yet. Will update you asap.

Ken

On 3/1/21, 1:38 PM, "Iker Martín Álvarez" <martini at uji.es> wrote:

    Hi Ken,
    Thanks for your reply.
    Here are attached two files with the arg "-v". BatchOutput.txt has the output which is giving the error message when the code is executed with the command sbatch, while the file InteractiveOutput.txt is the one executed interactively in the node and work as expected. Both of them has been compiled and executed with mpich 3.4.1.
    
    Thanks, Iker
    
    
    El vie, 26 feb 2021 a las 16:15, Raffenetti, Kenneth J. (<raffenet at mcs.anl.gov>) escribió:
    
    
    Hi,
    
    Could you add "-v" to your mpiexec command and provide the output? The "host" info key is handled by the process manager when executing the spawn command.
    
    Ken
    
    On 2/24/21, 11:55 AM, "Iker Martín Álvarez via discuss" <discuss at mpich.org> wrote:
    
        Hi,
        I have been working around the MPI_Comm_spawn function with the key "host" along a value in a system with the SLURM resource manager (slurm-wlm 17.11.2). The function works as expected, but when I send to execute the code with the sbatch command, an error arises. This does not happen when I execute directly the code in the same machine SLURM decided to execute it when it was sended with sbatch. In both cases with the key "host", as when I do not use the key, it works just fine.
    
        The same code has been tested with MPICH 3.3.2 and 3.4.1, which gives different errors. Also, I tried it with other implementations (OpenMPI and Intel MPI), which works as expected creating the processes in the indicated host.
    
        I would like to create processes by MPI_Comm_spawn in an assigned host, so if there are other key values for the Info argument, I could try them, but I have not found any looking through MPICH documentation.
    
        Here is the code I have been using:
    
        int main(int argc, char ** argv) {
    
          int myId, numP;
          MPI_Info info;
          MPI_Comm comm;
    
          MPI_Init(&argc, &argv);
          MPI_Comm_rank(MPI_COMM_WORLD, &myId);
          MPI_Comm_size(MPI_COMM_WORLD, &numP);  
    
          MPI_Comm comm_par;
          MPI_Comm_get_parent(&comm_par);
          if(comm_par != MPI_COMM_NULL ) {
            if(myId == ROOT) {
              printf("SONS\n"); fflush(stdout);
            }
          } else {
            if(myId == ROOT) {
              printf("FATHERS\n"); fflush(stdout);
            }
            MPI_Info_create(&info);
            MPI_Info_set(info, "host", "n00");
            MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, numP, info, ROOT, MPI_COMM_WORLD, &comm, MPI_ERRCODES_IGNORE);
          }
          MPI_Finalize();
          return 0;
        }
    
    
        For MPICH 3.4.1 there is sometimes no error and the code stops working at MPI_Comm_spawn function, other times this error is shown:
        Abort(1615120) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Internal MPI error!, error stack:
        MPIR_Init_thread(152)...:
        MPID_Init(562)..........:
        MPIDU_Init_shm_init(195):
        Init_shm_barrier(94)....: Internal MPI error!  barrier not initialized
    
    
        Also, the error code for MPICH 3.3.2:
    
        Assertion failed in file src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c at line 683: our_pg_rank < pg->size
        Assertion failed in file src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c at line 683: our_pg_rank < pg->size
        Assertion failed in file src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c at line 683: our_pg_rank < pg->size
        libbacktrace: no debug info in ELF executable
        libbacktrace: no debug info in ELF executable
        libbacktrace: no debug info in ELF executable
        libbacktrace: no debug info in ELF executable
        libbacktrace: no debug info in ELF executable
        libbacktrace: no debug info in ELF executable
        libbacktrace: no debug info in ELF executable
        libbacktrace: no debug info in ELF executable
        libbacktrace: no debug info in ELF executable
        libbacktrace: no debug info in ELF executable
        libbacktrace: no debug info in ELF executable
        libbacktrace: no debug info in ELF executable
        internal ABORT - process 0
    
        Thanks, Iker
    
    



More information about the discuss mailing list