<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal">I am attempting to run MPICH under SLURM for the first time, and there could be a lot of things I am doing wrong. All processes are getting launched but the master process is freezing in MPI_Comm_spawn. The stack trace is below, followed
by the SLURM command I use to start the job. Even though all 20 processes are running, one per node as desired, salloc reports that “Requested node configuration is not available". Not sure if that is why MPI_Comm_spawn is frozen. Thanks for any help.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">#0 0x00007f0183d08a08 in poll () from /lib64/libc.so.6<o:p></o:p></p>
<p class="MsoNormal">#1 0x00007f0185aed611 in MPID_nem_tcp_connpoll (in_blocking_poll=<optimized out>)<o:p></o:p></p>
<p class="MsoNormal"> at ../mpich-4.0b1/src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c:1765<o:p></o:p></p>
<p class="MsoNormal">#2 0x00007f0185ad8186 in MPID_nem_mpich_blocking_recv (completions=<optimized out>, in_fbox=<synthetic pointer>,
<o:p></o:p></p>
<p class="MsoNormal"> cell=<synthetic pointer>) at ../mpich-4.0b1/src/mpid/ch3/channels/nemesis/include/mpid_nem_inline.h:947<o:p></o:p></p>
<p class="MsoNormal">#3 MPIDI_CH3I_Progress (progress_state=progress_state@entry=0x7ffcc4839760, is_blocking=is_blocking@entry=1)<o:p></o:p></p>
<p class="MsoNormal"> at ../mpich-4.0b1/src/mpid/ch3/channels/nemesis/src/ch3_progress.c:360<o:p></o:p></p>
<p class="MsoNormal">#4 0x00007f0185a8811a in MPIDI_Create_inter_root_communicator_accept (vc_pptr=<synthetic pointer>,
<o:p></o:p></p>
<p class="MsoNormal"> comm_pptr=<synthetic pointer>, port_name=<optimized out>) at ../mpich-4.0b1/src/mpid/ch3/src/ch3u_port.c:417<o:p></o:p></p>
<p class="MsoNormal">#5 MPIDI_Comm_accept (port_name=<optimized out>, info=<optimized out>, root=0,
<o:p></o:p></p>
<p class="MsoNormal"> comm_ptr=0x7f0185ef1540 <MPIR_Comm_builtin+832>, newcomm=0x7ffcc4839bb8)<o:p></o:p></p>
<p class="MsoNormal"> at ../mpich-4.0b1/src/mpid/ch3/src/ch3u_port.c:1176<o:p></o:p></p>
<p class="MsoNormal">#6 0x00007f0185ac3d45 in MPID_Comm_accept (<o:p></o:p></p>
<p class="MsoNormal"> port_name=port_name@entry=0x7ffcc4839a00 "tag#0$description#n001$port#55907$ifname#172.16.56.1$",
<o:p></o:p></p>
<p class="MsoNormal"> info=info@entry=0x0, root=root@entry=0, comm=comm@entry=0x7f0185ef1540 <MPIR_Comm_builtin+832>,
<o:p></o:p></p>
<p class="MsoNormal"> newcomm_ptr=newcomm_ptr@entry=0x7ffcc4839bb8) at ../mpich-4.0b1/src/mpid/ch3/src/mpid_port.c:130<o:p></o:p></p>
<p class="MsoNormal">#7 0x00007f0185a73285 in MPIDI_Comm_spawn_multiple (count=<optimized out>, commands=0x7ffcc4839b78,
<o:p></o:p></p>
<p class="MsoNormal"> argvs=0x7ffcc4839b70, maxprocs=0x7ffcc4839b6c, info_ptrs=<optimized out>, root=<optimized out>,
<o:p></o:p></p>
<p class="MsoNormal"> comm_ptr=0x7f0185ef1540 <MPIR_Comm_builtin+832>, intercomm=0x7ffcc4839bb8, errcodes=<optimized out>)<o:p></o:p></p>
<p class="MsoNormal"> at ../mpich-4.0b1/src/mpid/ch3/src/ch3u_comm_spawn_multiple.c:258<o:p></o:p></p>
<p class="MsoNormal">#8 0x00007f0185abec99 in MPID_Comm_spawn_multiple (count=count@entry=1,
<o:p></o:p></p>
<p class="MsoNormal"> array_of_commands=array_of_commands@entry=0x7ffcc4839b78, array_of_argv=array_of_argv@entry=0x7ffcc4839b70,
<o:p></o:p></p>
<p class="MsoNormal"> array_of_maxprocs=array_of_maxprocs@entry=0x7ffcc4839b6c,
<o:p></o:p></p>
<p class="MsoNormal"> array_of_info_ptrs=array_of_info_ptrs@entry=0x7ffcc4839b60, root=root@entry=0,
<o:p></o:p></p>
<p class="MsoNormal"> comm_ptr=0x7f0185ef1540 <MPIR_Comm_builtin+832>, intercomm=0x7ffcc4839bb8, array_of_errcodes=0x7ffcc4839c98)<o:p></o:p></p>
<p class="MsoNormal"> at ../mpich-4.0b1/src/mpid/ch3/src/mpid_comm_spawn_multiple.c:49<o:p></o:p></p>
<p class="MsoNormal">#9 0x00007f0185a34895 in MPIR_Comm_spawn_impl (command=<optimized out>, command@entry=0x995428 "NeedlesMpiMM",
<o:p></o:p></p>
<p class="MsoNormal"> argv=<optimized out>, argv@entry=0x993f50, maxprocs=<optimized out>, maxprocs@entry=1, info_ptr=<optimized out>,
<o:p></o:p></p>
<p class="MsoNormal"> root=root@entry=0, comm_ptr=comm_ptr@entry=0x7f0185ef1540 <MPIR_Comm_builtin+832>, p_intercomm_ptr=0x7ffcc4839bb8,
<o:p></o:p></p>
<p class="MsoNormal"> array_of_errcodes=0x7ffcc4839c98) at ../mpich-4.0b1/src/mpi/spawn/spawn_impl.c:168<o:p></o:p></p>
<p class="MsoNormal">#10 0x00007f0185953637 in internal_Comm_spawn (array_of_errcodes=0x7ffcc4839c98, intercomm=0x7ffcc4839d7c,
<o:p></o:p></p>
<p class="MsoNormal"> comm=1140850689, root=0, info=-1677721600, maxprocs=1, argv=<optimized out>, command=0x995428 "NeedlesMpiMM")<o:p></o:p></p>
<p class="MsoNormal"> at ../mpich-4.0b1/src/binding/c/spawn/comm_spawn.c:83<o:p></o:p></p>
<p class="MsoNormal">#11 PMPI_Comm_spawn (command=0x995428 "NeedlesMpiMM", argv=0x993f50, maxprocs=1, info=-1677721600, root=0,
<o:p></o:p></p>
<p class="MsoNormal"> comm=1140850689, intercomm=0x7ffcc4839d7c, array_of_errcodes=0x7ffcc4839c98)<o:p></o:p></p>
<p class="MsoNormal"> at ../mpich-4.0b1/src/binding/c/spawn/comm_spawn.c:169<o:p></o:p></p>
<p class="MsoNormal">#12 0x000000000040cec3 in needles::NeedlesMpiMaster::spawnNewManager (this=0x995380, nodenum=0,
<o:p></o:p></p>
<p class="MsoNormal"> host_name="n001.cluster.pssclabs.com", intercom=@0x7ffcc4839d7c: 67108864) at src/NeedlesMpiMaster.cpp:1432<o:p></o:p></p>
<p class="MsoNormal">#13 0x00000000004084cb in needles::NeedlesMpiMaster::init (this=0x995380, argc=23, argv=0x7ffcc483a548, rank=0,
<o:p></o:p></p>
<p class="MsoNormal"> world_size=20) at src/NeedlesMpiMaster.cpp:246<o:p></o:p></p>
<p class="MsoNormal">#14 0x0000000000406799 in main (argc=23, argv=0x7ffcc483a548) at src/NeedlesMpiManagerMain.cpp:96<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><b>Here is my salloc command to start the job. I want one task per node, reserving the rest of the cores on the node for spawning of additional processes.
<o:p></o:p></b></p>
<p class="MsoNormal"><b><o:p> </o:p></b></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><span style="color:red">$ salloc –ntasks=20 --cpus-per-task=24 –verbose<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><b>Here is what salloc reports:<o:p></o:p></b></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">salloc: -------------------- --------------------<o:p></o:p></p>
<p class="MsoNormal">salloc: cpus-per-task : 24<o:p></o:p></p>
<p class="MsoNormal">salloc: ntasks : 20<o:p></o:p></p>
<p class="MsoNormal">salloc: verbose : 1<o:p></o:p></p>
<p class="MsoNormal">salloc: -------------------- --------------------<o:p></o:p></p>
<p class="MsoNormal">salloc: end of defined options<o:p></o:p></p>
<p class="MsoNormal">salloc: Linear node selection plugin loaded with argument 4<o:p></o:p></p>
<p class="MsoNormal">salloc: select/cons_res loaded with argument 4<o:p></o:p></p>
<p class="MsoNormal">salloc: Cray/Aries node selection plugin loaded<o:p></o:p></p>
<p class="MsoNormal">salloc: select/cons_tres loaded with argument 4<o:p></o:p></p>
<p class="MsoNormal">salloc: Granted job allocation 34311<o:p></o:p></p>
<p class="MsoNormal"><b><span style="color:red">srun: error: Unable to create step for job 34311: Requested node configuration is not available</span><o:p></o:p></b></p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</body>
</html>