[mpich-discuss] MPICH -- too many open files

Mccall, Kurt E. (MSFC-EV41) kurt.e.mccall at nasa.gov
Tue Mar 22 12:55:01 CDT 2022


My application, which spawns multiple subprocesses via MPI_Comm_spawn, eventually fails on one Slurm cluster as I scale up the number of processes, with the error:

[mpiexec at n002.cluster.pssclabs.com] HYDU_create_process (../../../../mpich-4.0.1/src/pm/hydra/utils/launch/launch.c:21): pipe error (Too many open files)
[mpiexec at n002.cluster.pssclabs.com] HYDT_bscd_common_launch_procs (../../../../mpich-4.0.1/src/pm/hydra/tools/bootstrap/external/external_common_launch.c:296): create process returned error
free(): invalid pointer
/var/spool/slurm/job235999/slurm_script: line 296: 3778907 Aborted                 (core dumped)

It works fine on a different (Torque) cluster for very large job sizes.

"ulimit -n" (number of open files) on both machines returns 1024.

I'm hoping that there is some other system setting on the Slurm cluster that would allow larger jobs.   I can provide the "-verbose" output file if that would help.

Thanks,
Kurt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20220322/a24d27dd/attachment.html>


More information about the discuss mailing list