[mpich-discuss] MPI_Comm_spawn and SLURM

Mccall, Kurt E. (MSFC-EV41) kurt.e.mccall at nasa.gov
Fri Nov 26 17:05:27 CST 2021


I am attempting to run MPICH under SLURM for the first time, and there could be a lot of things I am doing wrong.   All processes are getting launched but the master process is freezing in MPI_Comm_spawn.    The stack trace is below, followed by the SLURM command I use to start the job.   Even though all 20 processes are running, one per node as desired, salloc reports that "Requested node configuration is not available".  Not sure if that is why MPI_Comm_spawn is frozen.   Thanks for any help.

#0  0x00007f0183d08a08 in poll () from /lib64/libc.so.6
#1  0x00007f0185aed611 in MPID_nem_tcp_connpoll (in_blocking_poll=<optimized out>)
    at ../mpich-4.0b1/src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c:1765
#2  0x00007f0185ad8186 in MPID_nem_mpich_blocking_recv (completions=<optimized out>, in_fbox=<synthetic pointer>,
    cell=<synthetic pointer>) at ../mpich-4.0b1/src/mpid/ch3/channels/nemesis/include/mpid_nem_inline.h:947
#3  MPIDI_CH3I_Progress (progress_state=progress_state at entry=0x7ffcc4839760, is_blocking=is_blocking at entry=1)
    at ../mpich-4.0b1/src/mpid/ch3/channels/nemesis/src/ch3_progress.c:360
#4  0x00007f0185a8811a in MPIDI_Create_inter_root_communicator_accept (vc_pptr=<synthetic pointer>,
    comm_pptr=<synthetic pointer>, port_name=<optimized out>) at ../mpich-4.0b1/src/mpid/ch3/src/ch3u_port.c:417
#5  MPIDI_Comm_accept (port_name=<optimized out>, info=<optimized out>, root=0,
    comm_ptr=0x7f0185ef1540 <MPIR_Comm_builtin+832>, newcomm=0x7ffcc4839bb8)
    at ../mpich-4.0b1/src/mpid/ch3/src/ch3u_port.c:1176
#6  0x00007f0185ac3d45 in MPID_Comm_accept (
    port_name=port_name at entry=0x7ffcc4839a00 "tag#0$description#n001$port#55907$ifname#172.16.56.1$",
    info=info at entry=0x0, root=root at entry=0, comm=comm at entry=0x7f0185ef1540 <MPIR_Comm_builtin+832>,
    newcomm_ptr=newcomm_ptr at entry=0x7ffcc4839bb8) at ../mpich-4.0b1/src/mpid/ch3/src/mpid_port.c:130
#7  0x00007f0185a73285 in MPIDI_Comm_spawn_multiple (count=<optimized out>, commands=0x7ffcc4839b78,
    argvs=0x7ffcc4839b70, maxprocs=0x7ffcc4839b6c, info_ptrs=<optimized out>, root=<optimized out>,
    comm_ptr=0x7f0185ef1540 <MPIR_Comm_builtin+832>, intercomm=0x7ffcc4839bb8, errcodes=<optimized out>)
    at ../mpich-4.0b1/src/mpid/ch3/src/ch3u_comm_spawn_multiple.c:258
#8  0x00007f0185abec99 in MPID_Comm_spawn_multiple (count=count at entry=1,
    array_of_commands=array_of_commands at entry=0x7ffcc4839b78, array_of_argv=array_of_argv at entry=0x7ffcc4839b70,
    array_of_maxprocs=array_of_maxprocs at entry=0x7ffcc4839b6c,
    array_of_info_ptrs=array_of_info_ptrs at entry=0x7ffcc4839b60, root=root at entry=0,
    comm_ptr=0x7f0185ef1540 <MPIR_Comm_builtin+832>, intercomm=0x7ffcc4839bb8, array_of_errcodes=0x7ffcc4839c98)
    at ../mpich-4.0b1/src/mpid/ch3/src/mpid_comm_spawn_multiple.c:49
#9  0x00007f0185a34895 in MPIR_Comm_spawn_impl (command=<optimized out>, command at entry=0x995428 "NeedlesMpiMM",
    argv=<optimized out>, argv at entry=0x993f50, maxprocs=<optimized out>, maxprocs at entry=1, info_ptr=<optimized out>,
    root=root at entry=0, comm_ptr=comm_ptr at entry=0x7f0185ef1540 <MPIR_Comm_builtin+832>, p_intercomm_ptr=0x7ffcc4839bb8,
    array_of_errcodes=0x7ffcc4839c98) at ../mpich-4.0b1/src/mpi/spawn/spawn_impl.c:168
#10 0x00007f0185953637 in internal_Comm_spawn (array_of_errcodes=0x7ffcc4839c98, intercomm=0x7ffcc4839d7c,
    comm=1140850689, root=0, info=-1677721600, maxprocs=1, argv=<optimized out>, command=0x995428 "NeedlesMpiMM")
    at ../mpich-4.0b1/src/binding/c/spawn/comm_spawn.c:83
#11 PMPI_Comm_spawn (command=0x995428 "NeedlesMpiMM", argv=0x993f50, maxprocs=1, info=-1677721600, root=0,
    comm=1140850689, intercomm=0x7ffcc4839d7c, array_of_errcodes=0x7ffcc4839c98)
    at ../mpich-4.0b1/src/binding/c/spawn/comm_spawn.c:169
#12 0x000000000040cec3 in needles::NeedlesMpiMaster::spawnNewManager (this=0x995380, nodenum=0,
    host_name="n001.cluster.pssclabs.com", intercom=@0x7ffcc4839d7c: 67108864) at src/NeedlesMpiMaster.cpp:1432
#13 0x00000000004084cb in needles::NeedlesMpiMaster::init (this=0x995380, argc=23, argv=0x7ffcc483a548, rank=0,
    world_size=20) at src/NeedlesMpiMaster.cpp:246
#14 0x0000000000406799 in main (argc=23, argv=0x7ffcc483a548) at src/NeedlesMpiManagerMain.cpp:96


Here is my salloc command to start the job.   I want one task per node, reserving the rest of the cores on the node for spawning of additional processes.


$ salloc -ntasks=20 --cpus-per-task=24 -verbose

Here is what salloc reports:

salloc: -------------------- --------------------
salloc: cpus-per-task       : 24
salloc: ntasks              : 20
salloc: verbose             : 1
salloc: -------------------- --------------------
salloc: end of defined options
salloc: Linear node selection plugin loaded with argument 4
salloc: select/cons_res loaded with argument 4
salloc: Cray/Aries node selection plugin loaded
salloc: select/cons_tres loaded with argument 4
salloc: Granted job allocation 34311
srun: error: Unable to create step for job 34311: Requested node configuration is not available

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20211126/6a248ec3/attachment.html>


More information about the discuss mailing list