[mpich-discuss] MPI_Info key host error in SLURM
Raffenetti, Kenneth J.
raffenet at mcs.anl.gov
Fri Feb 26 09:15:48 CST 2021
Hi,
Could you add "-v" to your mpiexec command and provide the output? The "host" info key is handled by the process manager when executing the spawn command.
Ken
On 2/24/21, 11:55 AM, "Iker Martín Álvarez via discuss" <discuss at mpich.org> wrote:
Hi,
I have been working around the MPI_Comm_spawn function with the key "host" along a value in a system with the SLURM resource manager (slurm-wlm 17.11.2). The function works as expected, but when I send to execute the code with the sbatch command, an error arises. This does not happen when I execute directly the code in the same machine SLURM decided to execute it when it was sended with sbatch. In both cases with the key "host", as when I do not use the key, it works just fine.
The same code has been tested with MPICH 3.3.2 and 3.4.1, which gives different errors. Also, I tried it with other implementations (OpenMPI and Intel MPI), which works as expected creating the processes in the indicated host.
I would like to create processes by MPI_Comm_spawn in an assigned host, so if there are other key values for the Info argument, I could try them, but I have not found any looking through MPICH documentation.
Here is the code I have been using:
int main(int argc, char ** argv) {
int myId, numP;
MPI_Info info;
MPI_Comm comm;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myId);
MPI_Comm_size(MPI_COMM_WORLD, &numP);
MPI_Comm comm_par;
MPI_Comm_get_parent(&comm_par);
if(comm_par != MPI_COMM_NULL ) {
if(myId == ROOT) {
printf("SONS\n"); fflush(stdout);
}
} else {
if(myId == ROOT) {
printf("FATHERS\n"); fflush(stdout);
}
MPI_Info_create(&info);
MPI_Info_set(info, "host", "n00");
MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, numP, info, ROOT, MPI_COMM_WORLD, &comm, MPI_ERRCODES_IGNORE);
}
MPI_Finalize();
return 0;
}
For MPICH 3.4.1 there is sometimes no error and the code stops working at MPI_Comm_spawn function, other times this error is shown:
Abort(1615120) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Internal MPI error!, error stack:
MPIR_Init_thread(152)...:
MPID_Init(562)..........:
MPIDU_Init_shm_init(195):
Init_shm_barrier(94)....: Internal MPI error! barrier not initialized
Also, the error code for MPICH 3.3.2:
Assertion failed in file src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c at line 683: our_pg_rank < pg->size
Assertion failed in file src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c at line 683: our_pg_rank < pg->size
Assertion failed in file src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c at line 683: our_pg_rank < pg->size
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
internal ABORT - process 0
Thanks, Iker
More information about the discuss
mailing list