Eric Dumazet (3): tcp/dccp: bypass empty buckets in inet_twsk_purge() tcp/dccp: do not care about families in inet_twsk_purge() tcp: do not export tcp_twsk_purge()
Florian Westphal (1): tcp: prevent concurrent execution of tcp_sk_exit_batch
include/net/inet_timewait_sock.h | 2 +- include/net/tcp.h | 2 +- net/dccp/ipv4.c | 2 +- net/dccp/ipv6.c | 6 ------ net/ipv4/inet_timewait_sock.c | 16 +++++++++------- net/ipv4/tcp_ipv4.c | 16 +++++++++++++++- net/ipv4/tcp_minisocks.c | 7 +++---- net/ipv6/tcp_ipv6.c | 6 ------ 8 files changed, 30 insertions(+), 27 deletions(-)
From: Eric Dumazet edumazet@google.com
stable inclusion from stable-v6.6.48 commit 9624febd6968ad8174ab6236b82d5c1b1639db99 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAOXZW CVE: CVE-2024-44991
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit 50e2907ef8bb52cf80ecde9eec5c4dac07177146 ]
TCP ehash table is often sparsely populated.
inet_twsk_purge() spends too much time calling cond_resched().
This patch can reduce time spent in inet_twsk_purge() by 20x.
Signed-off-by: Eric Dumazet edumazet@google.com Reviewed-by: Kuniyuki Iwashima kuniyu@amazon.com Link: https://lore.kernel.org/r/20240327191206.508114-1-edumazet@google.com Signed-off-by: Jakub Kicinski kuba@kernel.org Stable-dep-of: 565d121b6998 ("tcp: prevent concurrent execution of tcp_sk_exit_batch") Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Wang Liang wangliang74@huawei.com --- net/ipv4/inet_timewait_sock.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index 757ae3a4e2f1..55f60d1b46f2 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -281,12 +281,17 @@ EXPORT_SYMBOL_GPL(__inet_twsk_schedule); /* Remove all non full sockets (TIME_WAIT and NEW_SYN_RECV) for dead netns */ void inet_twsk_purge(struct inet_hashinfo *hashinfo, int family) { + struct inet_ehash_bucket *head = &hashinfo->ehash[0]; + unsigned int ehash_mask = hashinfo->ehash_mask; struct hlist_nulls_node *node; unsigned int slot; struct sock *sk;
- for (slot = 0; slot <= hashinfo->ehash_mask; slot++) { - struct inet_ehash_bucket *head = &hashinfo->ehash[slot]; + for (slot = 0; slot <= ehash_mask; slot++, head++) { + + if (hlist_nulls_empty(&head->chain)) + continue; + restart_rcu: cond_resched(); rcu_read_lock();
From: Eric Dumazet edumazet@google.com
stable inclusion from stable-v6.6.48 commit 7348061662c7149b81e38e545d5dd6bd427bec81 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAOXZW CVE: CVE-2024-44991
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit 1eeb5043573981f3a1278876515851b7f6b1df1b ]
We lost ability to unload ipv6 module a long time ago.
Instead of calling expensive inet_twsk_purge() twice, we can handle all families in one round.
Also remove an extra line added in my prior patch, per Kuniyuki Iwashima feedback.
Signed-off-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/netdev/20240327192934.6843-1-kuniyu@amazon.com/ Reviewed-by: Kuniyuki Iwashima kuniyu@amazon.com Link: https://lore.kernel.org/r/20240329153203.345203-1-edumazet@google.com Signed-off-by: Jakub Kicinski kuba@kernel.org Stable-dep-of: 565d121b6998 ("tcp: prevent concurrent execution of tcp_sk_exit_batch") Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Wang Liang wangliang74@huawei.com --- include/net/inet_timewait_sock.h | 2 +- include/net/tcp.h | 2 +- net/dccp/ipv4.c | 2 +- net/dccp/ipv6.c | 6 ------ net/ipv4/inet_timewait_sock.c | 9 +++------ net/ipv4/tcp_ipv4.c | 2 +- net/ipv4/tcp_minisocks.c | 6 +++--- net/ipv6/tcp_ipv6.c | 6 ------ 8 files changed, 10 insertions(+), 25 deletions(-)
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h index 4a8e578405cb..9365e5af8d6d 100644 --- a/include/net/inet_timewait_sock.h +++ b/include/net/inet_timewait_sock.h @@ -114,7 +114,7 @@ static inline void inet_twsk_reschedule(struct inet_timewait_sock *tw, int timeo
void inet_twsk_deschedule_put(struct inet_timewait_sock *tw);
-void inet_twsk_purge(struct inet_hashinfo *hashinfo, int family); +void inet_twsk_purge(struct inet_hashinfo *hashinfo);
static inline struct net *twsk_net(const struct inet_timewait_sock *twsk) diff --git a/include/net/tcp.h b/include/net/tcp.h index 245387812940..5cb57b974543 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -349,7 +349,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb); void tcp_rcv_space_adjust(struct sock *sk); int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp); void tcp_twsk_destructor(struct sock *sk); -void tcp_twsk_purge(struct list_head *net_exit_list, int family); +void tcp_twsk_purge(struct list_head *net_exit_list); ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags); diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c index f94d30b17199..f8d38262751c 100644 --- a/net/dccp/ipv4.c +++ b/net/dccp/ipv4.c @@ -1042,7 +1042,7 @@ static void __net_exit dccp_v4_exit_net(struct net *net)
static void __net_exit dccp_v4_exit_batch(struct list_head *net_exit_list) { - inet_twsk_purge(&dccp_hashinfo, AF_INET); + inet_twsk_purge(&dccp_hashinfo); }
static struct pernet_operations dccp_v4_ops = { diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c index 683e4291b348..d25e962b18a5 100644 --- a/net/dccp/ipv6.c +++ b/net/dccp/ipv6.c @@ -1122,15 +1122,9 @@ static void __net_exit dccp_v6_exit_net(struct net *net) inet_ctl_sock_destroy(pn->v6_ctl_sk); }
-static void __net_exit dccp_v6_exit_batch(struct list_head *net_exit_list) -{ - inet_twsk_purge(&dccp_hashinfo, AF_INET6); -} - static struct pernet_operations dccp_v6_ops = { .init = dccp_v6_init_net, .exit = dccp_v6_exit_net, - .exit_batch = dccp_v6_exit_batch, .id = &dccp_v6_pernet_id, .size = sizeof(struct dccp_v6_pernet), }; diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index 55f60d1b46f2..fff53144250c 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -279,7 +279,7 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm) EXPORT_SYMBOL_GPL(__inet_twsk_schedule);
/* Remove all non full sockets (TIME_WAIT and NEW_SYN_RECV) for dead netns */ -void inet_twsk_purge(struct inet_hashinfo *hashinfo, int family) +void inet_twsk_purge(struct inet_hashinfo *hashinfo) { struct inet_ehash_bucket *head = &hashinfo->ehash[0]; unsigned int ehash_mask = hashinfo->ehash_mask; @@ -288,7 +288,6 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo, int family) struct sock *sk;
for (slot = 0; slot <= ehash_mask; slot++, head++) { - if (hlist_nulls_empty(&head->chain)) continue;
@@ -303,15 +302,13 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo, int family) TCPF_NEW_SYN_RECV)) continue;
- if (sk->sk_family != family || - refcount_read(&sock_net(sk)->ns.count)) + if (refcount_read(&sock_net(sk)->ns.count)) continue;
if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt))) continue;
- if (unlikely(sk->sk_family != family || - refcount_read(&sock_net(sk)->ns.count))) { + if (refcount_read(&sock_net(sk)->ns.count)) { sock_gen_put(sk); goto restart; } diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index e0cbde24dded..c1e278f0bab1 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -3307,7 +3307,7 @@ static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list) { struct net *net;
- tcp_twsk_purge(net_exit_list, AF_INET); + tcp_twsk_purge(net_exit_list);
list_for_each_entry(net, net_exit_list, exit_list) { inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo); diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 5b9738c4ecbc..1f2d0365fbc7 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -363,7 +363,7 @@ void tcp_twsk_destructor(struct sock *sk) } EXPORT_SYMBOL_GPL(tcp_twsk_destructor);
-void tcp_twsk_purge(struct list_head *net_exit_list, int family) +void tcp_twsk_purge(struct list_head *net_exit_list) { bool purged_once = false; struct net *net; @@ -371,9 +371,9 @@ void tcp_twsk_purge(struct list_head *net_exit_list, int family) list_for_each_entry(net, net_exit_list, exit_list) { if (net->ipv4.tcp_death_row.hashinfo->pernet) { /* Even if tw_refcount == 1, we must clean up kernel reqsk */ - inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo, family); + inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo); } else if (!purged_once) { - inet_twsk_purge(&tcp_hashinfo, family); + inet_twsk_purge(&tcp_hashinfo); purged_once = true; } } diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index d0034916d386..83b48dc2b3ee 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -2213,15 +2213,9 @@ static void __net_exit tcpv6_net_exit(struct net *net) inet_ctl_sock_destroy(net->ipv6.tcp_sk); }
-static void __net_exit tcpv6_net_exit_batch(struct list_head *net_exit_list) -{ - tcp_twsk_purge(net_exit_list, AF_INET6); -} - static struct pernet_operations tcpv6_net_ops = { .init = tcpv6_net_init, .exit = tcpv6_net_exit, - .exit_batch = tcpv6_net_exit_batch, };
int __init tcpv6_init(void)
From: Florian Westphal fw@strlen.de
stable inclusion from stable-v6.6.48 commit 99580ae890ec8bd98b21a2a9c6668f8f1555b62e category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAOXZW CVE: CVE-2024-44991
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit 565d121b69980637f040eb4d84289869cdaabedf ]
Its possible that two threads call tcp_sk_exit_batch() concurrently, once from the cleanup_net workqueue, once from a task that failed to clone a new netns. In the latter case, error unwinding calls the exit handlers in reverse order for the 'failed' netns.
tcp_sk_exit_batch() calls tcp_twsk_purge(). Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"), this function picks up twsk in any dying netns, not just the one passed in via exit_batch list.
This means that the error unwind of setup_net() can "steal" and destroy timewait sockets belonging to the exiting netns.
This allows the netns exit worker to proceed to call
WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
without the expected 1 -> 0 transition, which then splats.
At same time, error unwind path that is also running inet_twsk_purge() will splat as well:
WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 ... refcount_dec include/linux/refcount.h:351 [inline] inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 setup_net+0x714/0xb40 net/core/net_namespace.c:375 copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
... because refcount_dec() of tw_refcount unexpectedly dropped to 0.
This doesn't seem like an actual bug (no tw sockets got lost and I don't see a use-after-free) but as erroneous trigger of debug check.
Add a mutex to force strict ordering: the task that calls tcp_twsk_purge() blocks other task from doing final _dec_and_test before mutex-owner has removed all tw sockets of dying netns.
Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.") Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/ Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/ Signed-off-by: Florian Westphal fw@strlen.de Reviewed-by: Kuniyuki Iwashima kuniyu@amazon.com Reviewed-by: Jason Xing kerneljasonxing@gmail.com Reviewed-by: Eric Dumazet edumazet@google.com Link: https://patch.msgid.link/20240812222857.29837-1-fw@strlen.de Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Wang Liang wangliang74@huawei.com --- net/ipv4/tcp_ipv4.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index c1e278f0bab1..90eed56ad1ae 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -94,6 +94,8 @@ EXPORT_SYMBOL(tcp_hashinfo);
static DEFINE_PER_CPU(struct sock *, ipv4_tcp_sk);
+static DEFINE_MUTEX(tcp_exit_batch_mutex); + static u32 tcp_v4_init_seq(const struct sk_buff *skb) { return secure_tcp_seq(ip_hdr(skb)->daddr, @@ -3307,6 +3309,16 @@ static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list) { struct net *net;
+ /* make sure concurrent calls to tcp_sk_exit_batch from net_cleanup_work + * and failed setup_net error unwinding path are serialized. + * + * tcp_twsk_purge() handles twsk in any dead netns, not just those in + * net_exit_list, the thread that dismantles a particular twsk must + * do so without other thread progressing to refcount_dec_and_test() of + * tcp_death_row.tw_refcount. + */ + mutex_lock(&tcp_exit_batch_mutex); + tcp_twsk_purge(net_exit_list);
list_for_each_entry(net, net_exit_list, exit_list) { @@ -3314,6 +3326,8 @@ static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list) WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount)); tcp_fastopen_ctx_destroy(net); } + + mutex_unlock(&tcp_exit_batch_mutex); }
static struct pernet_operations __net_initdata tcp_sk_ops = {
From: Eric Dumazet edumazet@google.com
stable inclusion from stable-v6.6.48 commit f0974e6bc385f0e53034af17abbb86ac0246ef1c category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IAOXZW CVE: CVE-2024-44991
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit c51db4ac10d57c366f9a92121e3889bfc6c324cd upstream.
After commit 1eeb50435739 ("tcp/dccp: do not care about families in inet_twsk_purge()") tcp_twsk_purge() is no longer potentially called from a module.
Signed-off-by: Eric Dumazet edumazet@google.com Cc: Kuniyuki Iwashima kuniyu@amazon.com Reviewed-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Wang Liang wangliang74@huawei.com --- net/ipv4/tcp_minisocks.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 1f2d0365fbc7..96031fba3884 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -378,7 +378,6 @@ void tcp_twsk_purge(struct list_head *net_exit_list) } } } -EXPORT_SYMBOL_GPL(tcp_twsk_purge);
/* Warning : This function is called without sk_listener being locked. * Be sure to read socket fields once, as their value could change under us.
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/11638 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/D...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/11638 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/D...