[mpich-discuss] MPI_Info key host error in SLURM
Iker Martín Álvarez
martini at uji.es
Wed Feb 24 11:53:49 CST 2021
Hi,
I have been working around the MPI_Comm_spawn function with the key "host"
along a value in a system with the SLURM resource manager (slurm-wlm
17.11.2). The function works as expected, but when I send to execute the
code with the sbatch command, an error arises. This does not happen when I
execute directly the code in the same machine SLURM decided to execute it
when it was sended with *sbatch*. In both cases with the key "host", as
when I do not use the key, it works just fine.
The same code has been tested with MPICH 3.3.2 and 3.4.1, which gives
different errors. Also, I tried it with other implementations (OpenMPI and
Intel MPI), which works as expected creating the processes in the indicated
host.
I would like to create processes by MPI_Comm_spawn in an assigned host, so
if there are other key values for the Info argument, I could try them, but
I have not found any looking through MPICH documentation.
Here is the code I have been using:
int main(int argc, char ** argv) {
int myId, numP;
MPI_Info info;
MPI_Comm comm;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myId);
MPI_Comm_size(MPI_COMM_WORLD, &numP);
MPI_Comm comm_par;
MPI_Comm_get_parent(&comm_par);
if(comm_par != MPI_COMM_NULL ) {
if(myId == ROOT) {
printf("SONS\n"); fflush(stdout);
}
} else {
if(myId == ROOT) {
printf("FATHERS\n"); fflush(stdout);
}
MPI_Info_create(&info);
MPI_Info_set(info, "host", "n00");
MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, numP, info, ROOT,
MPI_COMM_WORLD, &comm, MPI_ERRCODES_IGNORE);
}
MPI_Finalize();
return 0;
}
For MPICH 3.4.1 there is sometimes no error and the code stops working at
MPI_Comm_spawn function, other times this error is shown:
Abort(1615120) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init:
Internal MPI error!, error stack:
MPIR_Init_thread(152)...:
MPID_Init(562)..........:
MPIDU_Init_shm_init(195):
Init_shm_barrier(94)....: Internal MPI error! barrier not initialized
Also, the error code for MPICH 3.3.2:
Assertion failed in file src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c
at line 683: our_pg_rank < pg->size
Assertion failed in file src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c
at line 683: our_pg_rank < pg->size
Assertion failed in file src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c
at line 683: our_pg_rank < pg->size
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
libbacktrace: no debug info in ELF executable
internal ABORT - process 0
Thanks, Iker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20210224/62710cb7/attachment.html>
More information about the discuss
mailing list