<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div>Dear mpich community,</div>
<div dir="ltr">
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
I am quite new to mpi but I got a small slurm cluster with 3 compute nodes running.
<br>
</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
I can run simple jobs like `srun -N3 hostname` and I am trying now to run an mpi helloworld app. My issue is that the job hangs and fails after a few seconds.</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<span style="font-family:Consolas,Courier,monospace"># srun -N2 -n4 /scratch/helloworld-mpi
</span>
<div><span style="font-family:Consolas,Courier,monospace">srun: error: mpi/pmi2: failed to send temp kvs to compute nodes</span></div>
<div><span style="font-family:Consolas,Courier,monospace">srun: Job step aborted: Waiting up to 32 seconds for job step to finish.</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmstepd: error: *** STEP 2.0 ON nid001001-cluster-1 CANCELLED AT 2022-03-22T16:43:07 ***</span></div>
<div><span style="font-family:Consolas,Courier,monospace">srun: error: nid001002-cluster-1: task 3: Killed</span></div>
<div><span style="font-family:Consolas,Courier,monospace">srun: launch/slurm: _step_signal: Terminating StepId=2.0</span></div>
<span style="font-family:Consolas,Courier,monospace">srun: error: nid001001-cluster-1: tasks 0-2: Killed</span></div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<span style="font-family:Arial,Helvetica,sans-serif">​I can see this in the slurmd logs:</span></div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<span style="font-family:Consolas,Courier,monospace">slurmd: debug3: CPUs=40 Boards=1 Sockets=40 Cores=1 Threads=1 Memory=515174 TmpDisk=211436 Uptime=1920011 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)</span>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: in the service_connection</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug2: Start processing RPC: REQUEST_LAUNCH_TASKS</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: launch task StepId=2.0 request from UID:0 GID:0 HOST:172.29.113.47 PORT:40062</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug:  Checking credential with 468 bytes of sig data</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug2: _group_cache_lookup_internal: no entry found for root</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug:  task/affinity: task_p_slurmd_launch_request: task affinity : before lllp distribution cpu bind method is '(null type)' ((null))</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 40 c 1; hw s 40 c 1 t 1</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 core mask from slurmctld: 0x0000000007</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 CPU final mask for local node: 0x0000000000000007</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: task/affinity: _lllp_map_abstract_masks: _lllp_map_abstract_masks</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug:  task/affinity: lllp_distribution: binding tasks:3 to nodes:0 sockets:3:0 cores:3:0 threads:3</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: task/affinity: lllp_distribution: JobId=2 implicit auto binding: sockets,one_thread, dist 8192</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug2: task/affinity: lllp_distribution: JobId=2 will use lllp_cyclic because of SelectTypeParameters</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 40 c 1; hw s 40 c 1 t 1</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 core mask from slurmctld: 0x0000000007</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 CPU final mask for local node: 0x0000000000000007</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:0] 0x0000000000000001</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:1] 0x0000000000000002</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:2] 0x0000000000000004</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: task/affinity: _lllp_map_abstract_masks: _lllp_map_abstract_masks</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:0] 0x0000000000000001</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:1] 0x0000000100000000</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:2] 0x0000000000000002</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: task/affinity: _lllp_generate_cpu_bind: 3 19 58</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [2]: mask_cpu,one_thread, 0x0000000000000001,0x0000000100000000,0x0000000000000002</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug:  task/affinity: task_p_slurmd_launch_request: task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0x0000000000000001,0x0000000100000000,0x0000000000000002)</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug2: _insert_job_state: we already have a job state for job 2.  No big deal, just an FYI.</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file (/etc/slurm/cgroup.conf)</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: slurmstepd rank 0 (nid001001-cluster-1), parent rank -1 (NONE), children 0, depth 0, max_depth 0</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd</span></div>
<span style="font-family:Consolas,Courier,monospace">slurmd: debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS</span>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: in the service_connection</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug2: Start processing RPC: REQUEST_FORWARD_DATA</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug2: Processing RPC: REQUEST_FORWARD_DATA</span></div>
<div><span style="font-family:Consolas,Courier,monospace">slurmd: debug3: Entering _rpc_forward_data, address: /var/spool/slurmd/sock.pmi2.2.0, len: 84</span></div>
<span style="font-family:Consolas,Courier,monospace">slurmd: debug2: slurmd: debug3: CPUs=40 Boards=1 Sockets=40 Cores=1 Threads=1 Memory=515174 TmpDisk=211436 Uptime=1920011 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
<div>slurmd: debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.</div>
<div>slurmd: debug3: in the service_connection</div>
<div>slurmd: debug2: Start processing RPC: REQUEST_LAUNCH_TASKS</div>
<div>slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS</div>
<div>slurmd: launch task StepId=2.0 request from UID:0 GID:0 HOST:172.29.113.47 PORT:40062</div>
<div>slurmd: debug:  Checking credential with 468 bytes of sig data</div>
<div>slurmd: debug2: _group_cache_lookup_internal: no entry found for root</div>
<div>slurmd: debug:  task/affinity: task_p_slurmd_launch_request: task affinity : before lllp distribution cpu bind method is '(null type)' ((null))</div>
<div>slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 40 c 1; hw s 40 c 1 t 1</div>
<div>slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 core mask from slurmctld: 0x0000000007</div>
<div>slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 CPU final mask for local node: 0x0000000000000007</div>
<div>slurmd: debug3: task/affinity: _lllp_map_abstract_masks: _lllp_map_abstract_masks</div>
<div>slurmd: debug:  task/affinity: lllp_distribution: binding tasks:3 to nodes:0 sockets:3:0 cores:3:0 threads:3</div>
<div>slurmd: task/affinity: lllp_distribution: JobId=2 implicit auto binding: sockets,one_thread, dist 8192</div>
<div>slurmd: debug2: task/affinity: lllp_distribution: JobId=2 will use lllp_cyclic because of SelectTypeParameters</div>
<div>slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic </div>
<div>slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 40 c 1; hw s 40 c 1 t 1</div>
<div>slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 core mask from slurmctld: 0x0000000007</div>
<div>slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 CPU final mask for local node: 0x0000000000000007</div>
<div>slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:0] 0x0000000000000001</div>
<div>slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:1] 0x0000000000000002</div>
<div>slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:2] 0x0000000000000004</div>
<div>slurmd: debug3: task/affinity: _lllp_map_abstract_masks: _lllp_map_abstract_masks</div>
<div>slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:0] 0x0000000000000001</div>
<div>slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:1] 0x0000000100000000</div>
<div>slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:2] 0x0000000000000002</div>
<div>slurmd: debug3: task/affinity: _lllp_generate_cpu_bind: 3 19 58</div>
<div>slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [2]: mask_cpu,one_thread, 0x0000000000000001,0x0000000100000000,0x0000000000000002</div>
<div>slurmd: debug:  task/affinity: task_p_slurmd_launch_request: task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0x0000000000000001,0x0000000100000000,0x0000000000000002)</div>
<div>slurmd: debug2: _insert_job_state: we already have a job state for job 2.  No big deal, just an FYI.</div>
<div>slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd</div>
<div>slurmd: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file (/etc/slurm/cgroup.conf)</div>
<div>slurmd: debug3: slurmstepd rank 0 (nid001001-cluster-1), parent rank -1 (NONE), children 0, depth 0, max_depth 0</div>
<div>slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd</div>
<div>slurmd: debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS</div>
<div>slurmd: debug3: in the service_connection</div>
<div>slurmd: debug2: Start processing RPC: REQUEST_FORWARD_DATA</div>
<div>slurmd: debug2: Processing RPC: REQUEST_FORWARD_DATA</div>
<div>slurmd: debug3: Entering _rpc_forward_data, address: /var/spool/slurmd/sock.pmi2.2.0, len: 84</div>
slurmd: debug2: failed connecting to specified socket '/var/spool/slurmd/sock.pmi2.2.0': Connection refused</span></div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
...</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
I compiled mpich-4.0.0 and I can run mpi jobs outside  slurm</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<span style="font-family:Consolas,Courier,monospace"># mpirun -ppn 2 --hosts nid001001-cluster-1,nid001003-cluster-1,nid001003-cluster-1 /scratch/helloworld-mpi</span>
<div><span style="font-family:Consolas,Courier,monospace">Warning: Permanently added 'nid001003-cluster-1,172.29.9.83' (ECDSA) to the list of known hosts.</span></div>
<div><span style="font-family:Consolas,Courier,monospace">Hello world from processor nid001003-cluster-1, rank 2 out of 4 processors</span></div>
<div><span style="font-family:Consolas,Courier,monospace">Hello world from processor nid001003-cluster-1, rank 3 out of 4 processors</span></div>
<div><span style="font-family:Consolas,Courier,monospace">Hello world from processor nid001001-cluster-1, rank 0 out of 4 processors</span></div>
<span style="font-family:Consolas,Courier,monospace">Hello world from processor nid001001-cluster-1, rank 1 out of 4 processors</span></div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
Could someone please give me a hint of what to look in regards running mpi jobs in slurm?</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
thank you very much<br>
</div>
</div>
</body>
</html>