[mpich-discuss] MPI_Info key host error in SLURM

Iker Martín Álvarez martini at uji.es
Wed Mar 24 10:43:44 CDT 2021


Sorry for the late reply,

I tried using that option, but the outcome was the same.
However, with the info key "hostfile" and creating a file with the the
following text, it works as expected. In this example, creating 10 and 5
processes respectively in n00 and n01.

n00:10
n01:5

When trying this key, I noticed that if the parent processes try to finish
their execution while their sons do some work, the parents are blocked in
MPI_Finalize until the sons call this function too.
This happens even if for both groups all their processes call
MPI_Comm_disconnect. I think this is happening because both groups are
still connected and therefore the parents wait until the sons call this
function.
Could it be that I am missing something?

I based my conclusión on what is told here:
https://www.mpi-forum.org/docs/mpi-2.2/mpi22-report/node226.htm

Thanks, Iker

El jue, 4 mar 2021 a las 18:31, Raffenetti, Kenneth J. (<
raffenet at mcs.anl.gov>) escribió:

> I notice in the output that the sbatch version uses --launcher slurm while
> the interactive version uses --launcher ssh. Can you try adding --launcher
> ssh to the mpiexec command of you sbatch script and see if it makes a
> difference?
>
> Ken
>
> On 3/1/21, 1:38 PM, "Iker Martín Álvarez" <martini at uji.es> wrote:
>
>     Hi Ken,
>     Thanks for your reply.
>     Here are attached two files with the arg "-v". BatchOutput.txt has the
> output which is giving the error message when the code is executed with the
> command sbatch, while the file InteractiveOutput.txt is the one executed
> interactively in the node and work as expected. Both of them has been
> compiled and executed with mpich 3.4.1.
>
>     Thanks, Iker
>
>
>     El vie, 26 feb 2021 a las 16:15, Raffenetti, Kenneth J. (<
> raffenet at mcs.anl.gov>) escribió:
>
>
>     Hi,
>
>     Could you add "-v" to your mpiexec command and provide the output? The
> "host" info key is handled by the process manager when executing the spawn
> command.
>
>     Ken
>
>     On 2/24/21, 11:55 AM, "Iker Martín Álvarez via discuss" <
> discuss at mpich.org> wrote:
>
>         Hi,
>         I have been working around the MPI_Comm_spawn function with the
> key "host" along a value in a system with the SLURM resource manager
> (slurm-wlm 17.11.2). The function works as expected, but when I send to
> execute the code with the sbatch command, an error arises. This does not
> happen when I execute directly the code in the same machine SLURM decided
> to execute it when it was sended with sbatch. In both cases with the key
> "host", as when I do not use the key, it works just fine.
>
>         The same code has been tested with MPICH 3.3.2 and 3.4.1, which
> gives different errors. Also, I tried it with other implementations
> (OpenMPI and Intel MPI), which works as expected creating the processes in
> the indicated host.
>
>         I would like to create processes by MPI_Comm_spawn in an assigned
> host, so if there are other key values for the Info argument, I could try
> them, but I have not found any looking through MPICH documentation.
>
>         Here is the code I have been using:
>
>         int main(int argc, char ** argv) {
>
>           int myId, numP;
>           MPI_Info info;
>           MPI_Comm comm;
>
>           MPI_Init(&argc, &argv);
>           MPI_Comm_rank(MPI_COMM_WORLD, &myId);
>           MPI_Comm_size(MPI_COMM_WORLD, &numP);
>
>           MPI_Comm comm_par;
>           MPI_Comm_get_parent(&comm_par);
>           if(comm_par != MPI_COMM_NULL ) {
>             if(myId == ROOT) {
>               printf("SONS\n"); fflush(stdout);
>             }
>           } else {
>             if(myId == ROOT) {
>               printf("FATHERS\n"); fflush(stdout);
>             }
>             MPI_Info_create(&info);
>             MPI_Info_set(info, "host", "n00");
>             MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, numP, info, ROOT,
> MPI_COMM_WORLD, &comm, MPI_ERRCODES_IGNORE);
>           }
>           MPI_Finalize();
>           return 0;
>         }
>
>
>         For MPICH 3.4.1 there is sometimes no error and the code stops
> working at MPI_Comm_spawn function, other times this error is shown:
>         Abort(1615120) on node 0 (rank 0 in comm 0): Fatal error in
> PMPI_Init: Internal MPI error!, error stack:
>         MPIR_Init_thread(152)...:
>         MPID_Init(562)..........:
>         MPIDU_Init_shm_init(195):
>         Init_shm_barrier(94)....: Internal MPI error!  barrier not
> initialized
>
>
>         Also, the error code for MPICH 3.3.2:
>
>         Assertion failed in file
> src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c at line 683: our_pg_rank
> < pg->size
>         Assertion failed in file
> src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c at line 683: our_pg_rank
> < pg->size
>         Assertion failed in file
> src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c at line 683: our_pg_rank
> < pg->size
>         libbacktrace: no debug info in ELF executable
>         libbacktrace: no debug info in ELF executable
>         libbacktrace: no debug info in ELF executable
>         libbacktrace: no debug info in ELF executable
>         libbacktrace: no debug info in ELF executable
>         libbacktrace: no debug info in ELF executable
>         libbacktrace: no debug info in ELF executable
>         libbacktrace: no debug info in ELF executable
>         libbacktrace: no debug info in ELF executable
>         libbacktrace: no debug info in ELF executable
>         libbacktrace: no debug info in ELF executable
>         libbacktrace: no debug info in ELF executable
>         internal ABORT - process 0
>
>         Thanks, Iker
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20210324/7db1c238/attachment.html>


More information about the discuss mailing list