[mpich-discuss] too many ssh connections warning
Reuti
reuti at staff.uni-marburg.de
Sun Dec 8 07:21:23 CST 2019
Hi Kurt:
> Am 07.12.2019 um 16:07 schrieb Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>:
>
> Thanks Reuti. I assume that by the "task manager interface without ssh", you mean
>
> $ mpiexec -launcher rsh ...
No.
If the default compilation doesn't include SLURM support in your case, one has to compile MPICH with:
$ ./configure --with-slurm=[PATH] …
It should look like:
$ ps -e f
…
16594 ? Sl 0:00 slurmstepd: [166806.batch]
16599 ? S 0:00 \_ /bin/bash /var/spool/slurm/d/job166806/slurm_script
16755 ? S 0:00 \_ mpiexec ./mpihello
16757 ? Ssl 0:00 \_ /bin/srun -N 2 -n 2 --input none /home/reuti/local/mpich-3.3.2/bin/hydra_pmi_pro
16758 ? S 0:00 \_ /bin/srun -N 2 -n 2 --input none /home/reuti/local/mpich-3.3.2/bin/hydra_pmi
16766 ? Sl 0:00 slurmstepd: [166806.0]
16772 ? S 0:00 \_ /home/reuti/local/mpich-3.3.2/bin/hydra_pmi_proxy --control-port node045:3
16773 ? Rs 0:09 \_ ./mpihello
16774 ? Rs 0:09 \_ ./mpihello
…
(and on the slave nodes only the second daemon is present)
No SSH, no RSH. Hence it's a tight integration into the queuing system.
https://slurm.schedmd.com/mpi_guide.html
In addition, to change to `srun` as startup one might need (not used in the example above):
https://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Note_that_the_default_build_of_MPICH_will_work_fine_in_SLURM_environments._No_extra_steps_are_needed.
It should work with a cluster which has no `ssh` to the nodes implemented at all (I allow `ssh` to the nodes only for admin staff).
===
Are the nodes on a private network, i.e. the nodes can't be reached from the Internet? Then one might also discuss pro and cons of allowing `rsh` or not. If even inside a private cluster all communication between the nodes has to be encrypted, I fear neither MPICH nor any other MPI implementation provides this.
===
AFAIR "fork" was used in times before Hydra as an alternative to start the MPI tasks local on a single machine only. I didn't check for "fork" or "smpd" for some time since Hydra appeared.
-- Reuti
> I can't use rsh on our cluster due to security concerns. Another launcher option is "fork", but when I tried it, the whole job froze. Does "fork" refer to a specific binary like ssh, or does it refer to the Linux system call?
>
> Thanks,
> Kurt
>
>>> Sorry, I forgot to mention that I am starting the job under PBS/Torque with the qsub command.
>
>> Then it should be possible to use the task manager interface without `ssh`:
>
> -----Original Message-----
> From: Reuti via discuss <discuss at mpich.org>
> Sent: Tuesday, December 3, 2019 9:02 AM
> To: Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org>
> Cc: Reuti <reuti at staff.uni-marburg.de>
> Subject: Re: [mpich-discuss] [EXTERNAL] Re: too many ssh connections warning
>
> Hi:
>
>> Am 03.12.2019 um 15:45 schrieb Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org>:
>>
>> Reuti,
>>
>> Sorry, I forgot to mention that I am starting the job under PBS/Torque with the qsub command.
>
> Then it should be possible to use the task manager interface without `ssh`:
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.adaptivecomputing.com_torque_4-2D2-2D7_Content_topics_7-2DmessagePassing_MPICH.htm&d=DwIGaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=dpTb4yi8w2BnIrMIlCu68U0xSr-qw1uaPJR1KRkVKFw&s=ylKo9O73T5IQufhrd-w_1etlEkRdNKcqrSglXtEam84&e=
>
>
>> I'll check with our sysadmins to see if there are firewall issues.
>
> This could also be later an issue if MPICH will connect to other machines directly to talk to the already started daemons.
>
>
>> What is PAM?
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Linux-5FPAM&d=DwIGaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=dpTb4yi8w2BnIrMIlCu68U0xSr-qw1uaPJR1KRkVKFw&s=df3HqZTL7frMCuiq-Xw17TmjFNFOVK5RK_cMvVGRuXM&e=
>
> Several limits can be set here, depending on your distribution:
>
> ls /lib64/security/
>
> will show the available ones which are installed by default and are then used/configured in /etc/pam.d
>
> -- Reuti
>
>
>> Hui Zhou,
>>
>> What do you expect would be making multiple SSH connections to the node? The creation of inter-communicators? Individual MPI_Iprobe/MPI_Isend/MPI_IRecv commands? If you have a guess, that would help me know how to fix the problem.
>>
>> Kurt
>>
>>
>> -----Original Message-----
>> From: Reuti <reuti at staff.uni-marburg.de>
>> Sent: Monday, December 2, 2019 3:20 PM
>> To: discuss at mpich.org
>> Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mccall at nasa.gov>
>> Subject: [EXTERNAL] Re: [mpich-discuss] too many ssh connections warning
>>
>>
>>> Am 02.12.2019 um 22:14 schrieb Mccall, Kurt E. (MSFC-EV41) via discuss <discuss at mpich.org>:
>>>
>>> My application uses mainly inter-communicators rather than intra-communicators for fault tolerance. A particular process might have 20 inter-communicators active at one time. I’m receiving the warning
>>>
>>> [mpiexec at n010.cluster.com] WARNING: too many ssh connections to n009.cluster.com; waiting 6 seconds
>>>
>>> What is the cause of this? I have several guesses:
>>>
>>> 1) MPICH has an internal limit on the number of connections
>>> 2) I’m bumping up against a Linux limit on the number of connections
>>> 3) Non-blocking communication using MPI_Isend() creates a temporary ssh connection (not likely)
>>
>> 4) Firewall or PAM settings on the target prevent to many logins in a certain timeframe.
>>
>> Are you using a queuing system and have the chance to skip SSH and startup MPICH by the queuing system?
>>
>> -- Reuti
>>
>>
>>> The other question is, what are the consequences of “waiting 6 seconds”? Are some non-blocking messages dropped?
>>>
>>> I’m using MPICH 3.3.2, CentOS 3.10 and the Portland Group compiler pgc++ 19.5.0.
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.mpich.org_mailman_listinfo_discuss&d=DwIFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=97JqnCQfN2Iy11xYlubB_AugrnlkH8C8vw4uQg6cJho&s=XXNw4ApjKsaCVdFY88_0_gD-tbjnIn4-0nxojl5hj6Y&e=
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.mpich.org_mailman_listinfo_discuss&d=DwIGaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=dpTb4yi8w2BnIrMIlCu68U0xSr-qw1uaPJR1KRkVKFw&s=GhqzOOFuQP9ajlhwERMBbejcjkCy7zrnpLMbvEQk1wE&e=
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.mpich.org_mailman_listinfo_discuss&d=DwIGaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=dpTb4yi8w2BnIrMIlCu68U0xSr-qw1uaPJR1KRkVKFw&s=GhqzOOFuQP9ajlhwERMBbejcjkCy7zrnpLMbvEQk1wE&e=
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20191208/c332106f/attachment.html>
More information about the discuss
mailing list