[mpich-discuss] MPICH -- too many open files

Zhou, Hui zhouh at anl.gov
Tue Mar 22 13:19:45 CDT 2022


MPICH currently launches individual proxy for each spawn. I think that probably attributes to the flood of fds on the server. For now, I guess the solution is to ask system admin to increase the fd limit. Feel free to open a issue at https://github.com/pmodels/mpich/issues. We'll prioritize to get it enhanced.

--
Hui
[https://opengraph.githubassets.com/7cbf378518c55cfe4549cfe44dfb9b74ceb606f933d674d797658393c7979a52/pmodels/mpich]<https://github.com/pmodels/mpich/issues>
Issues · pmodels/mpich - GitHub<https://github.com/pmodels/mpich/issues>
Official MPICH Repository. Contribute to pmodels/mpich development by creating an account on GitHub.
github.com

________________________________
From: Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org>
Sent: Tuesday, March 22, 2022 12:55 PM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
Subject: [mpich-discuss] MPICH -- too many open files


My application, which spawns multiple subprocesses via MPI_Comm_spawn, eventually fails on one Slurm cluster as I scale up the number of processes, with the error:



[mpiexec at n002.cluster.pssclabs.com] HYDU_create_process (../../../../mpich-4.0.1/src/pm/hydra/utils/launch/launch.c:21): pipe error (Too many open files)

[mpiexec at n002.cluster.pssclabs.com] HYDT_bscd_common_launch_procs (../../../../mpich-4.0.1/src/pm/hydra/tools/bootstrap/external/external_common_launch.c:296): create process returned error

free(): invalid pointer

/var/spool/slurm/job235999/slurm_script: line 296: 3778907 Aborted                 (core dumped)



It works fine on a different (Torque) cluster for very large job sizes.



“ulimit -n” (number of open files) on both machines returns 1024.



I’m hoping that there is some other system setting on the Slurm cluster that would allow larger jobs.   I can provide the “-verbose” output file if that would help.



Thanks,

Kurt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20220322/423fdb59/attachment.html>


More information about the discuss mailing list