[mpich-discuss] too many ssh connections warning

Reuti reuti at staff.uni-marburg.de
Sun Dec 8 07:21:23 CST 2019


Hi Kurt:

> Am 07.12.2019 um 16:07 schrieb Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>:
> 
> Thanks Reuti.   I assume that by the "task manager interface without ssh", you mean
> 
> $ mpiexec -launcher rsh ...

No.

If the default compilation doesn't include SLURM support in your case, one has to compile MPICH with:

$ ./configure --with-slurm=[PATH] …

It should look like:

$ ps -e f
…
 16594 ?        Sl     0:00 slurmstepd: [166806.batch]
 16599 ?        S      0:00  \_ /bin/bash /var/spool/slurm/d/job166806/slurm_script
 16755 ?        S      0:00      \_ mpiexec ./mpihello
 16757 ?        Ssl    0:00          \_ /bin/srun -N 2 -n 2 --input none /home/reuti/local/mpich-3.3.2/bin/hydra_pmi_pro
 16758 ?        S      0:00              \_ /bin/srun -N 2 -n 2 --input none /home/reuti/local/mpich-3.3.2/bin/hydra_pmi
 16766 ?        Sl     0:00 slurmstepd: [166806.0]
 16772 ?        S      0:00  \_ /home/reuti/local/mpich-3.3.2/bin/hydra_pmi_proxy --control-port node045:3
 16773 ?        Rs     0:09      \_ ./mpihello
 16774 ?        Rs     0:09      \_ ./mpihello
…
(and on the slave nodes only the second daemon is present)

No SSH, no RSH. Hence it's a tight integration into the queuing system.

https://slurm.schedmd.com/mpi_guide.html

In addition, to change to `srun` as startup one might need (not used in the example above):

https://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Note_that_the_default_build_of_MPICH_will_work_fine_in_SLURM_environments._No_extra_steps_are_needed.

It should work with a cluster which has no `ssh`  to the nodes implemented at all (I allow `ssh` to the nodes only for admin staff).

===

Are the nodes on a private network, i.e. the nodes can't be reached from the Internet? Then one might also discuss pro and cons of allowing `rsh` or not. If even inside a private cluster all communication between the nodes has to be encrypted, I fear neither MPICH nor any other MPI implementation provides this.

===

AFAIR "fork" was used in times before Hydra as an alternative to start the MPI tasks local on a single machine only. I didn't check for "fork" or "smpd" for some time since Hydra appeared.

-- Reuti



> I can't use rsh on our cluster due to security concerns.   Another launcher option is "fork", but when I tried it, the whole job froze.   Does "fork" refer to a specific binary like ssh, or does it refer to the Linux system call?
> 
> Thanks,
> Kurt
> 
>>> Sorry, I forgot to mention that I am starting the job under PBS/Torque with the qsub command.
> 
>> Then it should be possible to use the task manager interface without `ssh`:
> 
> -----Original Message-----
> From: Reuti via discuss <discuss at mpich.org> 
> Sent: Tuesday, December 3, 2019 9:02 AM
> To: Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org>
> Cc: Reuti <reuti at staff.uni-marburg.de>
> Subject: Re: [mpich-discuss] [EXTERNAL] Re: too many ssh connections warning
> 
> Hi:
> 
>> Am 03.12.2019 um 15:45 schrieb Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org>:
>> 
>> Reuti,
>> 
>> Sorry, I forgot to mention that I am starting the job under PBS/Torque with the qsub command.
> 
> Then it should be possible to use the task manager interface without `ssh`:
> 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.adaptivecomputing.com_torque_4-2D2-2D7_Content_topics_7-2DmessagePassing_MPICH.htm&d=DwIGaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=dpTb4yi8w2BnIrMIlCu68U0xSr-qw1uaPJR1KRkVKFw&s=ylKo9O73T5IQufhrd-w_1etlEkRdNKcqrSglXtEam84&e= 
> 
> 
>>  I'll check with our sysadmins to see if there are firewall issues.
> 
> This could also be later an issue if MPICH will connect to other machines directly to talk to the already started daemons.
> 
> 
>>  What is PAM?
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Linux-5FPAM&d=DwIGaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=dpTb4yi8w2BnIrMIlCu68U0xSr-qw1uaPJR1KRkVKFw&s=df3HqZTL7frMCuiq-Xw17TmjFNFOVK5RK_cMvVGRuXM&e= 
> 
> Several limits can be set here, depending on your distribution:
> 
> ls /lib64/security/
> 
> will show the available ones which are installed by default and are then used/configured in /etc/pam.d
> 
> -- Reuti
> 
> 
>> Hui Zhou,
>> 
>> What do you expect would be making multiple SSH connections to the node?  The creation of inter-communicators?   Individual MPI_Iprobe/MPI_Isend/MPI_IRecv commands?  If you have a guess, that would help me know how to fix the problem.
>> 
>> Kurt
>> 
>> 
>> -----Original Message-----
>> From: Reuti <reuti at staff.uni-marburg.de> 
>> Sent: Monday, December 2, 2019 3:20 PM
>> To: discuss at mpich.org
>> Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
>> Subject: [EXTERNAL] Re: [mpich-discuss] too many ssh connections warning
>> 
>> 
>>> Am 02.12.2019 um 22:14 schrieb Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org>:
>>> 
>>> My application uses mainly inter-communicators rather than intra-communicators for fault tolerance.    A particular process might have 20 inter-communicators active at one time.   I’m receiving the warning
>>> 
>>> [mpiexec at n010.cluster.com] WARNING: too many ssh connections to n009.cluster.com; waiting 6 seconds
>>> 
>>> What is the cause of this?   I have several guesses:
>>> 
>>> 1)      MPICH has an internal limit on the number of  connections
>>> 2)      I’m bumping up against a Linux limit on the number of connections
>>> 3)      Non-blocking communication using MPI_Isend() creates a temporary ssh connection (not likely)
>> 
>> 4) Firewall or PAM settings on the target prevent to many logins in a certain timeframe.
>> 
>> Are you using a queuing system and have the chance to skip SSH and startup MPICH by the queuing system?
>> 
>> -- Reuti
>> 
>> 
>>> The other question is, what are  the consequences of “waiting 6 seconds”?   Are some non-blocking messages dropped?
>>> 
>>> I’m using MPICH 3.3.2, CentOS 3.10 and the Portland Group compiler pgc++ 19.5.0.
>>> 
>>> 
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.mpich.org_mailman_listinfo_discuss&d=DwIFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=97JqnCQfN2Iy11xYlubB_AugrnlkH8C8vw4uQg6cJho&s=XXNw4ApjKsaCVdFY88_0_gD-tbjnIn4-0nxojl5hj6Y&e= 
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.mpich.org_mailman_listinfo_discuss&d=DwIGaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=dpTb4yi8w2BnIrMIlCu68U0xSr-qw1uaPJR1KRkVKFw&s=GhqzOOFuQP9ajlhwERMBbejcjkCy7zrnpLMbvEQk1wE&e= 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.mpich.org_mailman_listinfo_discuss&d=DwIGaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=dpTb4yi8w2BnIrMIlCu68U0xSr-qw1uaPJR1KRkVKFw&s=GhqzOOFuQP9ajlhwERMBbejcjkCy7zrnpLMbvEQk1wE&e=

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20191208/c332106f/attachment.html>


More information about the discuss mailing list