[mpich-discuss] MPI_Init hangs under Slurm

Raffenetti, Ken raffenet at anl.gov
Fri Feb 18 14:20:56 CST 2022


So it looks like only nodes 1 and 2 successfully launched.

n001.cluster.pssclabs.com
n002.cluster.pssclabs.com

while the others are stuck trying to get there over ssh. Basically, it's a system configuration issue. That said, it is preferable to use the Slurm launcher instead of ssh, but in the past I recall there was an issue spawning new processes getting an error like:

srun: Job 84993 step creation temporarily disabled, retrying (Requested nodes are busy)

Can you try removing the --exclusive from your sbatch command and also remove -launcher ssh from your mpiexec command? I think what might be happening is the Slurm launcher won't spawn any processes because it thinks the currently running processes need access to the whole node.

Ken

On 2/18/22, 2:08 PM, "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov> wrote:

    I’m clueless.  Here is the correct output, running mpiexec inside of sbatch.

    Kurt

    From: Zhou, Hui <zhouh at anl.gov> 
    Sent: Friday, February 18, 2022 2:00 PM
    To: Raffenetti, Ken <raffenet at anl.gov>; discuss at mpich.org
    Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
    Subject: [EXTERNAL] Re: [mpich-discuss] MPI_Init hangs under Slurm



    Hi Kurt,



    Did you run mpiexec​ inside sbatch​? It will need sbatch​ to allocate the nodes.



    -- 

    Hui

    ________________________________________

    From: Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org>
    Sent: Friday, February 18, 2022 1:39 PM
    To: Raffenetti, Ken <raffenet at anl.gov>; discuss at mpich.org <discuss at mpich.org>
    Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
    Subject: Re: [mpich-discuss] MPI_Init hangs under Slurm 



    Here is the --verbose output.    Is it trying to launch the processes all on the head node rocci.ndc.nasa.gov?

    Kurt

    -----Original Message-----
    From: Raffenetti, Ken <raffenet at anl.gov> 
    Sent: Friday, February 18, 2022 1:31 PM
    To: discuss at mpich.org
    Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
    Subject: [EXTERNAL] Re: [mpich-discuss] MPI_Init hangs under Slurm

    From the looks of it, using the ssh launcher might not be able to access all the nodes. To confirm, can you try launching a non-MPI program? Something like

          mpiexec -verbose -launcher ssh -print-all-exitcodes -np 20 -ppn1 hostname

    Ken

    On 2/17/22, 2:39 PM, "Mccall, Kurt E. (MSFC-EV41) via discuss" <discuss at mpich.org> wrote:

        Sorry, my attachment with an .out extension was blocked.  Here is the file with a .txt extension.

        From: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov> 
        Sent: Thursday, February 17, 2022 2:36 PM
        To: discuss at mpich.org
        Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
        Subject: MPI_Init hangs under Slurm



        Things were working fine when I was launching 1 node jobs under Slurm 20.11.8, but when I launched a 20 node job, MPICH hangs in MPI_Init.   The output of “mpiexec -verbose” is attached, and the stack trace at the point where it hangs is below.

        In the “mpiexec -verbose” output, I wonder why variables such as PATH_modshare point to our Intel MPI implementation, which I am no using.    I am using MPICH 4.0 with a patch that Ken Raffenetti provided (which makes MPICH recognize the “host” info key).  My $PATH and $LD_LIBRARY_PATH variables definitely point to the correct MPICH installation.

        I appreciate any help you can give.


        Here is the Slurm sbatch command:

        sbatch --nodes=20 --ntasks=20 --job-name $job_name --exclusive –verbose 


        Here is the mpiexec command:

        mpiexec -verbose -launcher ssh -print-all-exitcodes -np 20  -wdir ${work_dir} -env DISPLAY localhost:10.0 --ppn 1 <many more args…>


        Stack trace at MPI_Init:

        #0  0x00007f6d85f499b2 in read () from /lib64/libpthread.so.0
        #1  0x00007f6d87a5753a in PMIU_readline (fd=5, buf=buf at entry=0x7ffd6fb596e0 "", maxlen=maxlen at entry=1024)
            at ../mpich-slurm-patch-4.0/src/pmi/simple/simple_pmiutil.c:134
        #2  0x00007f6d87a57a56 in GetResponse (request=0x7f6d87b48351 "cmd=barrier_in\n",
            expectedCmd=0x7f6d87b48345 "barrier_out", checkRc=0) at ../mpich-slurm-patch-4.0/src/pmi/simple/simple_pmi.c:818
        #3  0x00007f6d87a29915 in MPIDI_PG_SetConnInfo (rank=rank at entry=0,
            connString=connString at entry=0x1bbf5a0 "description#n001$port#33403$ifname#172.16.56.1$")
            at ../mpich-slurm-patch-4.0/src/mpid/ch3/src/mpidi_pg.c:559
        #4  0x00007f6d87a38611 in MPID_nem_init (pg_rank=pg_rank at entry=0, pg_p=pg_p at entry=0x1bbf850, has_parent=<optimized out>)
            at ../mpich-slurm-patch-4.0/src/mpid/ch3/channels/nemesis/src/mpid_nem_init.c:393
        #5  0x00007f6d87a2ad93 in MPIDI_CH3_Init (has_parent=<optimized out>, pg_p=0x1bbf850, pg_rank=0)
            at ../mpich-slurm-patch-4.0/src/mpid/ch3/channels/nemesis/src/ch3_init.c:83
        #6  0x00007f6d87a1b3b7 in init_world () at ../mpich-slurm-patch-4.0/src/mpid/ch3/src/mpid_init.c:190
        #7  MPID_Init (requested=<optimized out>, provided=provided at entry=0x7f6d87e03540 <MPIR_ThreadInfo>)
            at ../mpich-slurm-patch-4.0/src/mpid/ch3/src/mpid_init.c:76
        #8  0x00007f6d879828eb in MPII_Init_thread (argc=argc at entry=0x7ffd6fb5a5cc, argv=argv at entry=0x7ffd6fb5a5c0,
            user_required=0, provided=provided at entry=0x7ffd6fb5a574, p_session_ptr=p_session_ptr at entry=0x0)
            at ../mpich-slurm-patch-4.0/src/mpi/init/mpir_init.c:208
        #9  0x00007f6d879832a5 in MPIR_Init_impl (argc=0x7ffd6fb5a5cc, argv=0x7ffd6fb5a5c0)
            at ../mpich-slurm-patch-4.0/src/mpi/init/mpir_init.c:93
        #10 0x00007f6d8786388e in PMPI_Init (argc=0x7ffd6fb5a5cc, argv=0x7ffd6fb5a5c0)
            at ../mpich-slurm-patch-4.0/src/binding/c/init/init.c:46
        #11 0x000000000040640d in main (argc=23, argv=0x7ffd6fb5ad68) at src/NeedlesMpiManagerMain.cpp:53



More information about the discuss mailing list