[mpich-discuss] Fw: not able to run mpi jobs

Tue Mar 22 12:20:58 CDT 2022

Dear mpich community,

I am quite new to mpi but I got a small slurm cluster with 3 compute nodes running.
I can run simple jobs like `srun -N3 hostname` and I am trying now to run an mpi helloworld app. My issue is that the job hangs and fails after a few seconds.

# srun -N2 -n4 /scratch/helloworld-mpi
srun: error: mpi/pmi2: failed to send temp kvs to compute nodes
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 2.0 ON nid001001-cluster-1 CANCELLED AT 2022-03-22T16:43:07 ***
srun: error: nid001002-cluster-1: task 3: Killed
srun: launch/slurm: _step_signal: Terminating StepId=2.0
srun: error: nid001001-cluster-1: tasks 0-2: Killed

I can see this in the slurmd logs:

slurmd: debug3: CPUs=40 Boards=1 Sockets=40 Cores=1 Threads=1 Memory=515174 TmpDisk=211436 Uptime=1920011 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
slurmd: debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
slurmd: debug3: in the service_connection
slurmd: debug2: Start processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: launch task StepId=2.0 request from UID:0 GID:0 HOST:172.29.113.47 PORT:40062
slurmd: debug:  Checking credential with 468 bytes of sig data
slurmd: debug2: _group_cache_lookup_internal: no entry found for root
slurmd: debug:  task/affinity: task_p_slurmd_launch_request: task affinity : before lllp distribution cpu bind method is '(null type)' ((null))
slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 40 c 1; hw s 40 c 1 t 1
slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 core mask from slurmctld: 0x0000000007
slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 CPU final mask for local node: 0x0000000000000007
slurmd: debug3: task/affinity: _lllp_map_abstract_masks: _lllp_map_abstract_masks
slurmd: debug:  task/affinity: lllp_distribution: binding tasks:3 to nodes:0 sockets:3:0 cores:3:0 threads:3
slurmd: task/affinity: lllp_distribution: JobId=2 implicit auto binding: sockets,one_thread, dist 8192
slurmd: debug2: task/affinity: lllp_distribution: JobId=2 will use lllp_cyclic because of SelectTypeParameters
slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 40 c 1; hw s 40 c 1 t 1
slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 core mask from slurmctld: 0x0000000007
slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 CPU final mask for local node: 0x0000000000000007
slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:0] 0x0000000000000001
slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:1] 0x0000000000000002
slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:2] 0x0000000000000004
slurmd: debug3: task/affinity: _lllp_map_abstract_masks: _lllp_map_abstract_masks
slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:0] 0x0000000000000001
slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:1] 0x0000000100000000
slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:2] 0x0000000000000002
slurmd: debug3: task/affinity: _lllp_generate_cpu_bind: 3 19 58
slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [2]: mask_cpu,one_thread, 0x0000000000000001,0x0000000100000000,0x0000000000000002
slurmd: debug:  task/affinity: task_p_slurmd_launch_request: task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0x0000000000000001,0x0000000100000000,0x0000000000000002)
slurmd: debug2: _insert_job_state: we already have a job state for job 2.  No big deal, just an FYI.
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file (/etc/slurm/cgroup.conf)
slurmd: debug3: slurmstepd rank 0 (nid001001-cluster-1), parent rank -1 (NONE), children 0, depth 0, max_depth 0
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd: debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug3: in the service_connection
slurmd: debug2: Start processing RPC: REQUEST_FORWARD_DATA
slurmd: debug2: Processing RPC: REQUEST_FORWARD_DATA
slurmd: debug3: Entering _rpc_forward_data, address: /var/spool/slurmd/sock.pmi2.2.0, len: 84
slurmd: debug2: slurmd: debug3: CPUs=40 Boards=1 Sockets=40 Cores=1 Threads=1 Memory=515174 TmpDisk=211436 Uptime=1920011 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
slurmd: debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
slurmd: debug3: in the service_connection
slurmd: debug2: Start processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug2: Processing RPC: REQUEST_LAUNCH_TASKS
slurmd: launch task StepId=2.0 request from UID:0 GID:0 HOST:172.29.113.47 PORT:40062
slurmd: debug:  Checking credential with 468 bytes of sig data
slurmd: debug2: _group_cache_lookup_internal: no entry found for root
slurmd: debug:  task/affinity: task_p_slurmd_launch_request: task affinity : before lllp distribution cpu bind method is '(null type)' ((null))
slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 40 c 1; hw s 40 c 1 t 1
slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 core mask from slurmctld: 0x0000000007
slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 CPU final mask for local node: 0x0000000000000007
slurmd: debug3: task/affinity: _lllp_map_abstract_masks: _lllp_map_abstract_masks
slurmd: debug:  task/affinity: lllp_distribution: binding tasks:3 to nodes:0 sockets:3:0 cores:3:0 threads:3
slurmd: task/affinity: lllp_distribution: JobId=2 implicit auto binding: sockets,one_thread, dist 8192
slurmd: debug2: task/affinity: lllp_distribution: JobId=2 will use lllp_cyclic because of SelectTypeParameters
slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 40 c 1; hw s 40 c 1 t 1
slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 core mask from slurmctld: 0x0000000007
slurmd: debug3: task/affinity: _get_avail_map: StepId=2.0 CPU final mask for local node: 0x0000000000000007
slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:0] 0x0000000000000001
slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:1] 0x0000000000000002
slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:2] 0x0000000000000004
slurmd: debug3: task/affinity: _lllp_map_abstract_masks: _lllp_map_abstract_masks
slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:0] 0x0000000000000001
slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:1] 0x0000000100000000
slurmd: debug3: task/affinity: _task_layout_display_masks: _task_layout_display_masks jobid [2:2] 0x0000000000000002
slurmd: debug3: task/affinity: _lllp_generate_cpu_bind: 3 19 58
slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [2]: mask_cpu,one_thread, 0x0000000000000001,0x0000000100000000,0x0000000000000002
slurmd: debug:  task/affinity: task_p_slurmd_launch_request: task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0x0000000000000001,0x0000000100000000,0x0000000000000002)
slurmd: debug2: _insert_job_state: we already have a job state for job 2.  No big deal, just an FYI.
slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
slurmd: debug2: _read_slurm_cgroup_conf_int: No cgroup.conf file (/etc/slurm/cgroup.conf)
slurmd: debug3: slurmstepd rank 0 (nid001001-cluster-1), parent rank -1 (NONE), children 0, depth 0, max_depth 0
slurmd: debug3: _rpc_launch_tasks: return from _forkexec_slurmstepd
slurmd: debug2: Finish processing RPC: REQUEST_LAUNCH_TASKS
slurmd: debug3: in the service_connection
slurmd: debug2: Start processing RPC: REQUEST_FORWARD_DATA
slurmd: debug2: Processing RPC: REQUEST_FORWARD_DATA
slurmd: debug3: Entering _rpc_forward_data, address: /var/spool/slurmd/sock.pmi2.2.0, len: 84
slurmd: debug2: failed connecting to specified socket '/var/spool/slurmd/sock.pmi2.2.0': Connection refused
...

I compiled mpich-4.0.0 and I can run mpi jobs outside  slurm

# mpirun -ppn 2 --hosts nid001001-cluster-1,nid001003-cluster-1,nid001003-cluster-1 /scratch/helloworld-mpi
Warning: Permanently added 'nid001003-cluster-1,172.29.9.83' (ECDSA) to the list of known hosts.
Hello world from processor nid001003-cluster-1, rank 2 out of 4 processors
Hello world from processor nid001003-cluster-1, rank 3 out of 4 processors
Hello world from processor nid001001-cluster-1, rank 0 out of 4 processors
Hello world from processor nid001001-cluster-1, rank 1 out of 4 processors

Could someone please give me a hint of what to look in regards running mpi jobs in slurm?

thank you very much
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20220322/d0f3dd61/attachment.html>