[mpich-devel] Aurora: Trying to start MPICH inside bigger job
Wozniak, Justin M.
woz at anl.gov
Mon Apr 27 12:15:51 CDT 2026
That worked, thanks! I just had to unset PMIX_NAMESPACE at runtime.
--
Justin M Wozniak
________________________________
From: Zhou, Hui via devel <devel at mpich.org>
Sent: Thursday, April 23, 2026 13:44
To: devel at mpich.org <devel at mpich.org>
Cc: Zhou, Hui <zhouh at anl.gov>
Subject: Re: [mpich-devel] Aurora: Trying to start MPICH inside bigger job
I see the issue! The application main.x is linked to a version of MPICH build with PMIx and hydra only uses PMI1 or PMI2. You need rebuild MPICH without PMIx and relink the app with the custom built MPICH. There is a way to build MPICH with both PMIx and PMI1/2, but it is not well tested. Try specify both the configure option --with-pmix=... and --with-pmilib=mpich in building MPICH, and let me know if it works.
--
Hui
________________________________
From: Wozniak, Justin M. via devel <devel at mpich.org>
Sent: Thursday, April 23, 2026 1:14 PM
To: Zhou, Hui via devel <devel at mpich.org>
Cc: Wozniak, Justin M. <woz at anl.gov>
Subject: Re: [mpich-devel] Aurora: Trying to start MPICH inside bigger job
When running directly in an interactive allocation, I get the output below. Yes, Perlmutter is SLURM. Could there be some way to force the node-local MPICH to run without PMI? Using -launcher fork does not seem to have an effect. I can use a very minimal MPICH but I do need GPU support. Thanks
$ /lus/flare/projects/EpiCalib/sfw/mpich-git/bin/mpiexec -n 2 ./main.x
host: x4217c5s3b0n0.hsn.cm.aurora.alcf.anl.gov
[mpiexec at x4217c5s3b0n0] Timeout set to -1 (-1 means infinite)
==================================================================================================
mpiexec options:
----------------
Base path: /lus/flare/projects/EpiCalib/sfw/mpich-git/bin/
Launcher: (null)
Debug level: 1
Enable X: -1
Global environment:
-------------------
...
mpiexec at x4217c5s3b0n0] Launch arguments: /lus/flare/projects/EpiCalib/sfw/mpich-git/bin/hydra_pmi_proxy --control-port x4217c5s3b0n0.hsn.cm.aurora.alcf.anl.gov:41641 --debug --rmk pbs --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 0
[proxy:0 at x4217c5s3b0n0] Sending upstream hdr.cmd = CMD_PID_LIST
[proxy:0 at x4217c5s3b0n0] Sending upstream hdr.cmd = CMD_STDERR
[proxy:0 at x4217c5s3b0n0] Sending upstream hdr.cmd = CMD_STDERR
Abort(672810000): Fatal error in internal_Init: Internal MPI error!, error stack:
internal_Init(70)...............: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(204)...........:
MPIR_pmi_init(225)..............:
check_MPIR_CVAR_PMI_VERSION(158): Runtime environment uses unsupported PMI version PMI-1 or PMI-2. Aborting.
Abort(672810000): Fatal error in internal_Init: Internal MPI error!, error stack:
internal_Init(70)...............: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(204)...........:
MPIR_pmi_init(225)..............:
check_MPIR_CVAR_PMI_VERSION(158): Runtime environment uses unsupported PMI version PMI-1 or PMI-2. Aborting.
[proxy:0 at x4217c5s3b0n0] Sending upstream hdr.cmd = CMD_EXIT_STATUS
________________________________
From: Zhou, Hui <zhouh at anl.gov>
Sent: Thursday, April 23, 2026 12:21
To: Zhou, Hui via devel <devel at mpich.org>
Cc: Wozniak, Justin M. <woz at anl.gov>
Subject: Re: Aurora: Trying to start MPICH inside bigger job
/opt/cray/pals/1.8/bin/mpiexec \
env $UNSETS \
/lus/flare/projects/EpiCalib/sfw/mpich-git/bin/mpiexec -n 2 \
~/../main.x
First, are you able to run hydra (MPICH's mpiexec) directly, i.e. ` /lus/flare/projects/EpiCalib/sfw/mpich-git/bin/mpiexec -n 2 ~/../main.x`?
If that went well, then the key issue is we have two PMI environment concurrently active. One from PALs, assumably running PMIx; and one from hydra, assumably running PMI-1. This may confuse the MPI processes on which PMI protocol to use.
What is the launcher on Perlmutter? Is it Slurm? It will have the same double active PMI environment issue, but I guess it happened to work.
What is the error in MPI_Init are you seeing?
Hui
________________________________
From: Wozniak, Justin M. via devel <devel at mpich.org>
Sent: Thursday, April 23, 2026 12:04 PM
To: Zhou, Hui via devel <devel at mpich.org>
Cc: Wozniak, Justin M. <woz at anl.gov>
Subject: [mpich-devel] Aurora: Trying to start MPICH inside bigger job
Hi
I am trying to run a simulation ensemble where the system MPI coordinates many tasks, each of which is a call to a plain MPICH-built app (ExaEpi/AMReX). The app runs on its own, but when called from inside the bigger MPI job, it causes an error in MPI_Init(). This approach works on other systems like Perlmutter. I think I can reproduce this with something as simple as:
/opt/cray/pals/1.8/bin/mpiexec \
env $UNSETS \
/lus/flare/projects/EpiCalib/sfw/mpich-git/bin/mpiexec -n 2 \
~/../main.x
where main.x is a toy MPI program, mpich-git is my MPICH build on Aurora, and UNSETS is a variety of attempts to unset environment variables, none of this has worked so far.
Is there any known reason why I am unable to run MPICH inside this context, when it does run on its own?
Also, any other tips for debugging MPI startup might help. I am using the following but not getting much detail:
export PMI_DEBUG=1
export PMIX_DEBUG=1
export HYDRA_DEBUG=1
I get:
[mpiexec at x4217c5s7b0n0] Launch arguments: /lus/flare/projects/EpiCalib/sfw/mpich-git/bin/hydra_pmi_proxy --control-port
x4217c5s7b0n0:37193 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0
--gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 0
[proxy:0 at x4217c5s7b0n0] Sending upstream hdr.cmd = CMD_PID_LIST
[proxy:0 at x4217c5s7b0n0] Sending upstream hdr.cmd = CMD_STDERR
Abort(16): Fatal error in internal_Init: Internal MPI error!
[proxy:0 at x4217c5s7b0n0] Sending upstream hdr.cmd = CMD_STDERR
Abort(16): Fatal error in internal_Init: Internal MPI error!
[proxy:0 at x4217c5s7b0n0] Sending upstream hdr.cmd = CMD_EXIT_STATUS
x4217c5s7b0n0.hsn.cm.aurora.alcf.anl.gov: rank 0 exited with code 16
Thanks
Justin
--
Justin M Wozniak
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20260427/16e9c104/attachment-0001.html>
More information about the devel
mailing list