[mpich-discuss] mpiexec fails to launch any processes

Zhou, Hui zhouh at anl.gov
Tue Jun 14 10:41:46 CDT 2022


Kurt,

The stack trace is showing mpiexec​ in the polling loop. Since there are no other PMI messages being logged, that means it is still waiting for the processes to call PMI_Init​. Can you make your program print something before and after MPI_Init​? That will tell us whether the program  is stuck in MPI_Init​ or before MPI_Init​.

Hui
________________________________
From: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
Sent: Tuesday, June 14, 2022 2:21 AM
To: discuss at mpich.org <discuss at mpich.org>; Zhou, Hui <zhouh at anl.gov>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
Subject: Re: mpiexec fails to launch any processes


Hui,



Slurm doesn’t seem to be killing the job, as it still shows up when I run squeue.    A gdb stack trace shows where mpiexec is stuck – does this tell you anything?



#0  0x00007f9c895ddaa8 in poll () from /lib64/libc.so.6

#1  0x000000000045352c in HYDT_dmxu_poll_wait_for_event (wtime=-1)

    at ../../../../mpich-4.0.1/src/pm/hydra/tools/demux/demux_poll.c:39

#2  0x0000000000452e9a in HYDT_dmx_wait_for_event (wtime=-1)

    at ../../../../mpich-4.0.1/src/pm/hydra/tools/demux/demux.c:168

#3  0x000000000040cda4 in HYD_pmci_wait_for_completion (timeout=-1)

    at ../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:157

#4  0x0000000000404177 in main (argc=33, argv=0x7fff054e6888)

    at ../../../../mpich-4.0.1/src/pm/hydra/ui/mpich/mpiexec.c:324



Thanks,

Kurt



From: Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org>
Sent: Monday, June 13, 2022 4:16 PM
To: Zhou, Hui <zhouh at anl.gov>; discuss at mpich.org
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
Subject: Re: [mpich-discuss] [EXTERNAL] Re: mpiexec fails to launch any processes



Hui,



That worked too.   I guess I’ll have to find a way to pass a “verbose” argument to sbatch and see why Slurm is killing my application.



Thanks,

Kurt



From: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Sent: Monday, June 13, 2022 4:11 PM
To: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>; discuss at mpich.org<mailto:discuss at mpich.org>
Subject: Re: [EXTERNAL] Re: mpiexec fails to launch any processes



Kurt,



Could you try launch hostname​ with the same command?



    mpiexec -launcher ssh -verbose -print-all-exitcodes -wdir  <directory> -np 20 -ppn 1 hostname



If that went okay, it then seems to point to your application. Something in your code made Slurm kill the job.



--

Hui

________________________________

From: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
Sent: Monday, June 13, 2022 4:02 PM
To: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>; discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: RE: [EXTERNAL] Re: mpiexec fails to launch any processes



Hui,



$ mpiexec -N 10 -hostfile MySlurmNodeFile2 hostname



works properly, reporting from each of 10 nodes.



Kurt



From: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Sent: Monday, June 13, 2022 2:44 PM
To: discuss at mpich.org<mailto:discuss at mpich.org>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
Subject: [EXTERNAL] Re: mpiexec fails to launch any processes



Hi Kurt,



I don't have much clue. Are you able to launch some trivial applications, for example, "hostname​"?



--

Hui

________________________________

From: Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Sent: Monday, June 13, 2022 12:29 PM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
Subject: Re: [mpich-discuss] mpiexec fails to launch any processes



Outlook blocked the output file slurm.out that I had attached.   Trying to send it again as slurm.txt.



Kurt





Hi,



My mpiexec command fails to launch any processes.   I ran it with the -verbose option but didn’t see any obvious errors in the output (attached).



The command is:



mpiexec -launcher ssh -verbose -print-all-exitcodes -wdir  <directory> -np 20 -ppn 1  <more args…>



I am running MPICH 4.0.1 under Slurm 20.11.8.  Thanks for any help.



Kurt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20220614/9dd4df21/attachment-0001.html>


More information about the discuss mailing list