[mpich-discuss] [EXTERNAL] Re: mpiexec error

Mccall, Kurt E. (MSFC-EV41) kurt.e.mccall at nasa.gov
Fri May 6 12:30:52 CDT 2022


Ken,

That was it -- looks like Torque requires FQDNs for MPI_Comm_spawn.    I had the code set up for Slurm, which requires short host names.   Thanks for your help!

Kurt

-----Original Message-----
From: Raffenetti, Ken <raffenet at anl.gov> 
Sent: Friday, May 6, 2022 12:23 PM
To: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>; discuss at mpich.org
Subject: Re: [EXTERNAL] Re: [mpich-discuss] mpiexec error

Are the hostnames in the file all fully qualified? For example "n022.cluster.com". This error message suggests its looking for host "n022".

[mpiexec at n022.cluster.com] HYDT_bscd_pbs_launch_procs (../../../../mpich-4.0.1/src/pm/hydra/tools/bootstrap/external/pbs_launch.c:74): error finding PBS node ID for host n022

I am trying to understand if its failing to match because somewhere fully qualified part is being stripped.

Ken

On 5/6/22, 12:17 PM, "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov> wrote:

    Ken,

    The node file has 420 lines, with 20 lines for each host, including n022.cluster.com (21 nodes, 20 cores per node)

    Thanks,
    Kurt

    -----Original Message-----
    From: Raffenetti, Ken <raffenet at anl.gov> 
    Sent: Friday, May 6, 2022 12:13 PM
    To: discuss at mpich.org
    Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
    Subject: [EXTERNAL] Re: [mpich-discuss] mpiexec error

    Hi Kurt,

    Before running mpiexec, can you print out the hostfile to confirm the contents? Something like this:

      cat $PBS_HOSTFILE

    Ken

    On 5/6/22, 11:55 AM, "Mccall, Kurt E. (MSFC-EV41) via discuss" <discuss at mpich.org> wrote:

        Running MPICH 4.0.1 under Torque 5.1, I’m getting the mpiexec error “user specified host not in the PBS allocated list”.   My qsub command is:

        qsub -V -j oe -e stdio -o stdio -f -X -l nodes=21:ppn=20  <bash_script>


        My mpiexec command is:

        mpiexec -print-all-exitcodes -enable-x -np 21  -wdir ${work_dir} -env DISPLAY localhost:10.0 --ppn 1  <more args> …


        Here is the full error message.   Thanks for any help.

        [mpiexec at n022.cluster.com] find_pbs_node_id (../../../../mpich-4.0.1/src/pm/hydra/tools/bootstrap/external/pbs_launch.c:27): user specified host not in the PBS allocated list
        [mpiexec at n022.cluster.com] HYDT_bscd_pbs_launch_procs (../../../../mpich-4.0.1/src/pm/hydra/tools/bootstrap/external/pbs_launch.c:74): error finding PBS node ID for host n022
        [mpiexec at n022.cluster.com] HYDT_bsci_launch_procs (../../../../mpich-4.0.1/src/pm/hydra/tools/bootstrap/src/bsci_launch.c:17): launcher returned error while launching processes
        [mpiexec at n022.cluster.com] fn_spawn (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmiserv_pmi_v1.c:580): launcher cannot launch processes
        [mpiexec at n022.cluster.com] handle_pmi_cmd (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:48): PMI handler returned error
        [mpiexec at n022.cluster.com] control_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:284): unable to process PMI command
        [mpiexec at n022.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0.1/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
        [mpiexec at n022.cluster.com] HYD_pmci_wait_for_completion (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:160): error waiting for event
        [mpiexec at n022.cluster.com] main (../../../../mpich-4.0.1/src/pm/hydra/ui/mpich/mpiexec.c:325): process manager error waiting for completion
        [proxy:0:0 at n022.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
        [proxy:0:0 at n022.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0.1/src/pm/hydra/tools/demux/[proxy:0:2 at n020.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
        [proxy:0:2 at n020.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0.1/src/pm/hydra/tools/demux/[proxy:0:5 at n016.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
        [proxy:0:5 at n016.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0.1/src/pm/hydra/tools/demux/[proxy:0:15 at n006.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
        [proxy:0:15 at n006.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0.1/src/pm/hydra/tools/demu[proxy:0:16 at n005.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
        [proxy:0:16 at n005.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0.1/src/pm/hydra/tools/demu[proxy:0:19 at n002.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
        [proxy:0:19 at n002.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0.1/src/pm/hydra/tools/demu[proxy:0:20 at n001.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed
        [proxy:0:20 at n001.cluster.com] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.0.1/src/pm/hydra/tools/demudemux_poll.c:76): callback returned error status
        [proxy:0:0 at n022.cluster.com] main (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip.c:169): demux engine error waiting for event
        demux_poll.c:76): callback returned error status
        [proxy:0:2 at n020.cluster.com] main (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip.c:169): demux engine error waiting for event
        [proxy:0:1 at n021.cluster.com] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:899): assert (!closed) failed





More information about the discuss mailing list