[mpich-discuss] mpiexec fails to launch any processes
Zhou, Hui
zhouh at anl.gov
Tue Jun 14 10:41:46 CDT 2022
Kurt,
The stack trace is showing mpiexec in the polling loop. Since there are no other PMI messages being logged, that means it is still waiting for the processes to call PMI_Init. Can you make your program print something before and after MPI_Init? That will tell us whether the program is stuck in MPI_Init or before MPI_Init.
Hui
________________________________
From: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
Sent: Tuesday, June 14, 2022 2:21 AM
To: discuss at mpich.org <discuss at mpich.org>; Zhou, Hui <zhouh at anl.gov>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
Subject: Re: mpiexec fails to launch any processes
Hui,
Slurm doesn’t seem to be killing the job, as it still shows up when I run squeue. A gdb stack trace shows where mpiexec is stuck – does this tell you anything?
#0 0x00007f9c895ddaa8 in poll () from /lib64/libc.so.6
#1 0x000000000045352c in HYDT_dmxu_poll_wait_for_event (wtime=-1)
at ../../../../mpich-4.0.1/src/pm/hydra/tools/demux/demux_poll.c:39
#2 0x0000000000452e9a in HYDT_dmx_wait_for_event (wtime=-1)
at ../../../../mpich-4.0.1/src/pm/hydra/tools/demux/demux.c:168
#3 0x000000000040cda4 in HYD_pmci_wait_for_completion (timeout=-1)
at ../../../../mpich-4.0.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:157
#4 0x0000000000404177 in main (argc=33, argv=0x7fff054e6888)
at ../../../../mpich-4.0.1/src/pm/hydra/ui/mpich/mpiexec.c:324
Thanks,
Kurt
From: Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org>
Sent: Monday, June 13, 2022 4:16 PM
To: Zhou, Hui <zhouh at anl.gov>; discuss at mpich.org
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
Subject: Re: [mpich-discuss] [EXTERNAL] Re: mpiexec fails to launch any processes
Hui,
That worked too. I guess I’ll have to find a way to pass a “verbose” argument to sbatch and see why Slurm is killing my application.
Thanks,
Kurt
From: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Sent: Monday, June 13, 2022 4:11 PM
To: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>; discuss at mpich.org<mailto:discuss at mpich.org>
Subject: Re: [EXTERNAL] Re: mpiexec fails to launch any processes
Kurt,
Could you try launch hostname with the same command?
mpiexec -launcher ssh -verbose -print-all-exitcodes -wdir <directory> -np 20 -ppn 1 hostname
If that went okay, it then seems to point to your application. Something in your code made Slurm kill the job.
--
Hui
________________________________
From: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
Sent: Monday, June 13, 2022 4:02 PM
To: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>; discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: RE: [EXTERNAL] Re: mpiexec fails to launch any processes
Hui,
$ mpiexec -N 10 -hostfile MySlurmNodeFile2 hostname
works properly, reporting from each of 10 nodes.
Kurt
From: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Sent: Monday, June 13, 2022 2:44 PM
To: discuss at mpich.org<mailto:discuss at mpich.org>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
Subject: [EXTERNAL] Re: mpiexec fails to launch any processes
Hi Kurt,
I don't have much clue. Are you able to launch some trivial applications, for example, "hostname"?
--
Hui
________________________________
From: Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Sent: Monday, June 13, 2022 12:29 PM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov<mailto:kurt.e.mccall at nasa.gov>>
Subject: Re: [mpich-discuss] mpiexec fails to launch any processes
Outlook blocked the output file slurm.out that I had attached. Trying to send it again as slurm.txt.
Kurt
Hi,
My mpiexec command fails to launch any processes. I ran it with the -verbose option but didn’t see any obvious errors in the output (attached).
The command is:
mpiexec -launcher ssh -verbose -print-all-exitcodes -wdir <directory> -np 20 -ppn 1 <more args…>
I am running MPICH 4.0.1 under Slurm 20.11.8. Thanks for any help.
Kurt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20220614/9dd4df21/attachment-0001.html>
More information about the discuss
mailing list