[mpich-discuss] [EXTERNAL] Re: mpiexec error
Mccall, Kurt E. (MSFC-EV41)
kurt.e.mccall at nasa.gov
Fri May 6 12:30:52 CDT 2022
Ken,
That was it -- looks like Torque requires FQDNs for MPI_Comm_spawn. I had the code set up for Slurm, which requires short host names. Thanks for your help!
Kurt
-----Original Message-----
From: Raffenetti, Ken <raffenet at anl.gov>
Sent: Friday, May 6, 2022 12:23 PM
To: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>; discuss at mpich.org
Subject: Re: [EXTERNAL] Re: [mpich-discuss] mpiexec error
Are the hostnames in the file all fully qualified? For example "n022.cluster.com". This error message suggests its looking for host "n022".
[mpiexec at n022.cluster.com] HYDT_bscd_pbs_launch_procs (../../../../mpich-4.0.1/src/pm/hydra/tools/bootstrap/external/pbs_launch.c:74): error finding PBS node ID for host n022
I am trying to understand if its failing to match because somewhere fully qualified part is being stripped.
Ken
On 5/6/22, 12:17 PM, "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov> wrote:
Ken,
The node file has 420 lines, with 20 lines for each host, including n022.cluster.com (21 nodes, 20 cores per node)
Thanks,
Kurt
-----Original Message-----
From: Raffenetti, Ken <raffenet at anl.gov>
Sent: Friday, May 6, 2022 12:13 PM
To: discuss at mpich.org
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
Subject: [EXTERNAL] Re: [mpich-discuss] mpiexec error
Hi Kurt,
Before running mpiexec, can you print out the hostfile to confirm the contents? Something like this:
cat $PBS_HOSTFILE
Ken
On 5/6/22, 11:55 AM, "Mccall, Kurt E. (MSFC-EV41) via discuss" <discuss at mpich.org> wrote:
Running MPICH 4.0.1 under Torque 5.1, I’m getting the mpiexec error “user specified host not in the PBS allocated list”. My qsub command is:
qsub -V -j oe -e stdio -o stdio -f -X -l nodes=21:ppn=20 <bash_script>
My mpiexec command is:
mpiexec -print-all-exitcodes -enable-x -np 21 -wdir ${work_dir} -env DISPLAY localhost:10.0 --ppn 1 <more args> …
Here is the full error message. Thanks for any help.
[mpiexec at n022.cluster.com] find_pbs_node_id (../../../../mpich-4.0.1/src/pm/hydra/tools/bootstrap/external/pbs_launch.c:27): user specified host not in the PBS allocated list
[mpiexec at n022.cluster.com] HYDT_bscd_pbs_launch_procs (../../../../mpich-4.0.1/src/pm/hydra/tools/bootstrap/external/pbs_launch.c:74): error finding PBS node ID for host n022
[mpiexec at n022.cluster.com] HYDT_bsci_launch_procs (../../../../mpich-4.0.1/src/pm/hydra/tools/bootstrap/src/bsci_launch.c:17): launcher returned error while launching processes
[mpiexec at n022.cluster.com] fn_spawn (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmiserv_pmi_v1.c:580): launcher cannot launch processes
[mpiexec at n022.cluster.com] handle_pmi_cmd (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:48): PMI handler returned error
[mpiexec at n022.cluster.com] control_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:284): unable to process PMI command
[mpiexec at n022.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0.1/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at n022.cluster.com] HYD_pmci_wait_for_completion (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:160): error waiting for event
[mpiexec at n022.cluster.com] main (../../../../mpich-4.0.1/src/pm/hydra/ui/mpich/mpiexec.c:325): process manager error waiting for completion
[proxy:0:0 at n022.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
[proxy:0:0 at n022.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0.1/src/pm/hydra/tools/demux/[proxy:0:2 at n020.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
[proxy:0:2 at n020.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0.1/src/pm/hydra/tools/demux/[proxy:0:5 at n016.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
[proxy:0:5 at n016.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0.1/src/pm/hydra/tools/demux/[proxy:0:15 at n006.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
[proxy:0:15 at n006.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0.1/src/pm/hydra/tools/demu[proxy:0:16 at n005.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
[proxy:0:16 at n005.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0.1/src/pm/hydra/tools/demu[proxy:0:19 at n002.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
[proxy:0:19 at n002.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0.1/src/pm/hydra/tools/demu[proxy:0:20 at n001.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
[proxy:0:20 at n001.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0.1/src/pm/hydra/tools/demudemux_poll.c:76): callback returned error status
[proxy:0:0 at n022.cluster.com] main (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip.c:169): demux engine error waiting for event
demux_poll.c:76): callback returned error status
[proxy:0:2 at n020.cluster.com] main (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip.c:169): demux engine error waiting for event
[proxy:0:1 at n021.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
More information about the discuss
mailing list