[mpich-devel] Aurora: Trying to start MPICH inside bigger job

Zhou, Hui zhouh at anl.gov
Thu Apr 23 12:21:20 CDT 2026


/opt/cray/pals/1.8/bin/mpiexec \
  env $UNSETS \
  /lus/flare/projects/EpiCalib/sfw/mpich-git/bin/mpiexec -n 2 \
  ~/../main.x

First, are you able to run hydra (MPICH's mpiexec) directly, i.e. ` /lus/flare/projects/EpiCalib/sfw/mpich-git/bin/mpiexec -n 2   ~/../main.x`?

If that went well, then the key issue is we have two PMI environment concurrently active. One from PALs, assumably running PMIx; and one from hydra, assumably running PMI-1. This may confuse the MPI processes on which PMI protocol to use.

What is the launcher on Perlmutter? Is it Slurm? It will have the same double active PMI environment issue, but I guess it happened to work.

What is the error in MPI_Init are you seeing?

Hui

________________________________
From: Wozniak, Justin M. via devel <devel at mpich.org>
Sent: Thursday, April 23, 2026 12:04 PM
To: Zhou, Hui via devel <devel at mpich.org>
Cc: Wozniak, Justin M. <woz at anl.gov>
Subject: [mpich-devel] Aurora: Trying to start MPICH inside bigger job

Hi
    I am trying to run a simulation ensemble where the system MPI coordinates many tasks, each of which is a call to a plain MPICH-built app (ExaEpi/AMReX).  The app runs on its own, but when called from inside the bigger MPI job, it causes an error in MPI_Init().  This approach works on other systems like Perlmutter.  I think I can reproduce this with something as simple as:

/opt/cray/pals/1.8/bin/mpiexec \
  env $UNSETS \
  /lus/flare/projects/EpiCalib/sfw/mpich-git/bin/mpiexec -n 2 \
  ~/../main.x

where main.x is a toy MPI program, mpich-git is my MPICH build on Aurora, and UNSETS is a variety of attempts to unset environment variables, none of this has worked so far.

Is there any known reason why I am unable to run MPICH inside this context, when it does run on its own?

Also, any other tips for debugging MPI startup might help.  I am using the following but not getting much detail:

export PMI_DEBUG=1
export PMIX_DEBUG=1
export HYDRA_DEBUG=1

I get:

[mpiexec at x4217c5s7b0n0] Launch arguments: /lus/flare/projects/EpiCalib/sfw/mpich-git/bin/hydra_pmi_proxy --control-port
x4217c5s7b0n0:37193 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0
--gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 0
[proxy:0 at x4217c5s7b0n0] Sending upstream hdr.cmd = CMD_PID_LIST
[proxy:0 at x4217c5s7b0n0] Sending upstream hdr.cmd = CMD_STDERR
Abort(16): Fatal error in internal_Init: Internal MPI error!
[proxy:0 at x4217c5s7b0n0] Sending upstream hdr.cmd = CMD_STDERR
Abort(16): Fatal error in internal_Init: Internal MPI error!
[proxy:0 at x4217c5s7b0n0] Sending upstream hdr.cmd = CMD_EXIT_STATUS
x4217c5s7b0n0.hsn.cm.aurora.alcf.anl.gov: rank 0 exited with code 16

    Thanks
    Justin


--

Justin M Wozniak

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20260423/fc7d60c3/attachment-0001.html>


More information about the devel mailing list