[mpich-devel] Aurora: Trying to start MPICH inside bigger job

Wozniak, Justin M. woz at anl.gov
Thu Apr 23 12:04:57 CDT 2026


Hi
    I am trying to run a simulation ensemble where the system MPI coordinates many tasks, each of which is a call to a plain MPICH-built app (ExaEpi/AMReX).  The app runs on its own, but when called from inside the bigger MPI job, it causes an error in MPI_Init().  This approach works on other systems like Perlmutter.  I think I can reproduce this with something as simple as:

/opt/cray/pals/1.8/bin/mpiexec \
  env $UNSETS \
  /lus/flare/projects/EpiCalib/sfw/mpich-git/bin/mpiexec -n 2 \
  ~/../main.x

where main.x is a toy MPI program, mpich-git is my MPICH build on Aurora, and UNSETS is a variety of attempts to unset environment variables, none of this has worked so far.

Is there any known reason why I am unable to run MPICH inside this context, when it does run on its own?

Also, any other tips for debugging MPI startup might help.  I am using the following but not getting much detail:

export PMI_DEBUG=1
export PMIX_DEBUG=1
export HYDRA_DEBUG=1

I get:

[mpiexec at x4217c5s7b0n0] Launch arguments: /lus/flare/projects/EpiCalib/sfw/mpich-git/bin/hydra_pmi_proxy --control-port
x4217c5s7b0n0:37193 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0
--gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 0
[proxy:0 at x4217c5s7b0n0] Sending upstream hdr.cmd = CMD_PID_LIST
[proxy:0 at x4217c5s7b0n0] Sending upstream hdr.cmd = CMD_STDERR
Abort(16): Fatal error in internal_Init: Internal MPI error!
[proxy:0 at x4217c5s7b0n0] Sending upstream hdr.cmd = CMD_STDERR
Abort(16): Fatal error in internal_Init: Internal MPI error!
[proxy:0 at x4217c5s7b0n0] Sending upstream hdr.cmd = CMD_EXIT_STATUS
x4217c5s7b0n0.hsn.cm.aurora.alcf.anl.gov: rank 0 exited with code 16

    Thanks
    Justin


--

Justin M Wozniak

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20260423/f2726f9e/attachment.html>


More information about the devel mailing list