From: Lu Wei luwei32@huawei.com
hulk inclusion category: bugfix bugzilla: NA CVE: NA
--------------------------------
Commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range in connect()") allocates even ports for connect() first while leaving odd ports for bind() and this works well in busy servers.
But this strategy causes severe performance degradation in busy clients. when a client has used more than half of the local ports setted in proc/sys/net/ipv4/ip_local_port_range, if this client trys to connect to a server again, the connect time increases rapidly since it will traverse all the even ports though they are exhausted.
So this path provides another strategy by introducing a system option: local_port_allocation. If it is a busy client, users should set it to 1 to use sequential allocation while it should be set to 0 in other situations. Its default value is 0.
See: https://gitee.com/src-openeuler/kernel/issues/I2CT3R?from=project-issue
Signed-off-by: Lu Wei luwei32@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/networking/ip-sysctl.txt | 9 +++++++++ include/net/netns/ipv4.h | 1 + net/ipv4/inet_hashtables.c | 11 ++++++++--- net/ipv4/sysctl_net_ipv4.c | 7 +++++++ net/ipv4/tcp_ipv4.c | 1 + 5 files changed, 26 insertions(+), 3 deletions(-)
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index c99a4423ffba7..60a0309ce1609 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -885,6 +885,15 @@ ip_local_reserved_ports - list of comma separated ranges
Default: Empty
+local_port_allocation - INTEGER + This is a per-namespace sysctl. It defines whether to use + sequential allocation of local ports. If it is set to zero, + even ports will be allocated to connect() while leaving odd + ports for bind(); If it is set to non-zero, sequential allocation + will be applied. + + default: 0 + ip_unprivileged_port_start - INTEGER This is a per-namespace sysctl. It defines the first unprivileged port in the network namespace. Privileged ports diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index 623cfbb7b8dcb..73d5a32d4118e 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -92,6 +92,7 @@ struct netns_ipv4 { int sysctl_icmp_errors_use_inbound_ifaddr;
struct local_ports ip_local_ports; + int sysctl_local_port_allocation;
int sysctl_tcp_ecn; int sysctl_tcp_ecn_fallback; diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 941477baa8d22..421fda1c1b138 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -678,7 +678,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, struct net *net = sock_net(sk); struct inet_bind_bucket *tb; u32 remaining, offset; - int ret, i, low, high; + int ret, i, low, high, span_size; static u32 hint; int l3mdev;
@@ -699,6 +699,11 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, return ret; }
+ /* local_port_allocation 0 means even and odd port allocation strategy + * will be applied, so span size is 2; otherwise sequential allocation + * will be used and span size is 1. Default value is 0. + */ + span_size = (int)net->ipv4.sysctl_local_port_allocation ? 1 : 2; l3mdev = inet_sk_bound_l3mdev(sk);
inet_get_local_port_range(net, &low, &high); @@ -714,7 +719,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, offset &= ~1U; other_parity_scan: port = low + offset; - for (i = 0; i < remaining; i += 2, port += 2) { + for (i = 0; i < remaining; i += span_size, port += span_size) { if (unlikely(port >= high)) port -= remaining; if (inet_is_local_reserved_port(net, port)) @@ -755,7 +760,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, }
offset++; - if ((offset & 1) && remaining > 1) + if ((offset & 1) && remaining > 1 && span_size == 2) goto other_parity_scan;
return -EADDRNOTAVAIL; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index d427114bd8bab..fdd166ee80f34 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -682,6 +682,13 @@ static struct ctl_table ipv4_net_table[] = { .mode = 0644, .proc_handler = ipv4_local_port_range, }, + { + .procname = "local_port_allocation", + .data = &init_net.ipv4.sysctl_local_port_allocation, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, { .procname = "ip_local_reserved_ports", .data = &init_net.ipv4.sysctl_local_reserved_ports, diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 393c032e82a75..229ae272b2400 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2689,6 +2689,7 @@ static int __net_init tcp_sk_init(struct net *net) spin_lock_init(&net->ipv4.tcp_fastopen_ctx_lock); net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60; atomic_set(&net->ipv4.tfo_active_disable_times, 0); + net->ipv4.sysctl_local_port_allocation = 0;
/* Reno is always built in */ if (!net_eq(net, &init_net) &&