some backport bugfix for sockmap
Cong Wang (1): bpf, sock_map: Move cancel_work_sync() out of sock lock
Eric Dumazet (1): net: deal with most data-races in sk_wait_event()
Jakub Sitnicki (2): bpf, sockmap: Check for any of tcp_bpf_prots when cloning a listener bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself
Pengcheng Yang (3): bpf, sockmap: Fix repeated calls to sock_put() when msg has more_data bpf, sockmap: Fix missing BPF_F_INGRESS flag when using apply_bytes bpf, sockmap: Fix data loss caused by using apply_bytes on ingress redirect
Wang Yufen (1): bpf, sockmap: Fix the sk->sk_forward_alloc warning of sk_stream_kill_queues
zhangmingyi (1): bpf: fix bpf_tcp_ingress addr use after free
include/linux/skmsg.h | 3 +- include/linux/util_macros.h | 12 ++++++++ include/net/tcp.h | 4 +-- net/core/skmsg.c | 16 +++++----- net/core/sock_map.c | 60 +++++++++++++++++++++---------------- net/core/stream.c | 12 ++++---- net/ipv4/tcp_bpf.c | 36 +++++++++++++--------- net/llc/af_llc.c | 8 +++-- net/smc/smc_close.c | 4 +-- net/smc/smc_rx.c | 4 +-- net/smc/smc_tx.c | 4 +-- net/tipc/socket.c | 4 +-- net/tls/tls_main.c | 3 +- net/tls/tls_sw.c | 6 ++-- 14 files changed, 105 insertions(+), 71 deletions(-)
From: Wang Yufen wangyufen@huawei.com
stable inclusion from stable-v5.10.155 commit cc21dc48a78cc9e5af9a4d039cd456446a6e73ff category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I64YCD CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit 8ec95b94716a1e4d126edc3fb2bc426a717e2dba ]
When running `test_sockmap` selftests, the following warning appears:
WARNING: CPU: 2 PID: 197 at net/core/stream.c:205 sk_stream_kill_queues+0xd3/0xf0 Call Trace: <TASK> inet_csk_destroy_sock+0x55/0x110 tcp_rcv_state_process+0xd28/0x1380 ? tcp_v4_do_rcv+0x77/0x2c0 tcp_v4_do_rcv+0x77/0x2c0 __release_sock+0x106/0x130 __tcp_close+0x1a7/0x4e0 tcp_close+0x20/0x70 inet_release+0x3c/0x80 __sock_release+0x3a/0xb0 sock_close+0x14/0x20 __fput+0xa3/0x260 task_work_run+0x59/0xb0 exit_to_user_mode_prepare+0x1b3/0x1c0 syscall_exit_to_user_mode+0x19/0x50 do_syscall_64+0x48/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae
The root case is in commit 84472b436e76 ("bpf, sockmap: Fix more uncharged while msg has more_data"), where I used msg->sg.size to replace the tosend, causing breakage:
if (msg->apply_bytes && msg->apply_bytes < tosend) tosend = psock->apply_bytes;
Fixes: 84472b436e76 ("bpf, sockmap: Fix more uncharged while msg has more_data") Reported-by: Jakub Sitnicki jakub@cloudflare.com Signed-off-by: Wang Yufen wangyufen@huawei.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Acked-by: John Fastabend john.fastabend@gmail.com Acked-by: Jakub Sitnicki jakub@cloudflare.com Link: https://lore.kernel.org/bpf/1667266296-8794-1-git-send-email-wangyufen@huawe... Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Chen Jiahao chenjiahao16@huawei.com Signed-off-by: Liu Jian liujian56@huawei.com --- net/ipv4/tcp_bpf.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index 438992e28efe..f49554584898 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -385,7 +385,7 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, { bool cork = false, enospc = sk_msg_full(msg); struct sock *sk_redir; - u32 tosend, delta = 0; + u32 tosend, origsize, sent, delta = 0; u32 eval = __SK_NONE; int ret;
@@ -440,10 +440,12 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, cork = true; psock->cork = NULL; } - sk_msg_return(sk, msg, msg->sg.size); + sk_msg_return(sk, msg, tosend); release_sock(sk);
+ origsize = msg->sg.size; ret = tcp_bpf_sendmsg_redir(sk_redir, msg, tosend, flags); + sent = origsize - msg->sg.size;
if (eval == __SK_REDIRECT) sock_put(sk_redir); @@ -482,7 +484,7 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, msg->sg.data[msg->sg.start].page_link && msg->sg.data[msg->sg.start].length) { if (eval == __SK_REDIRECT) - sk_mem_charge(sk, msg->sg.size); + sk_mem_charge(sk, tosend - sent); goto more_data; } }
From: Cong Wang cong.wang@bytedance.com
mainline inclusion from mainline-v6.1-rc5 commit 8bbabb3fddcd0f858be69ed5abc9b470a239d6f2 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I65HYE CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
Stanislav reported a lockdep warning, which is caused by the cancel_work_sync() called inside sock_map_close(), as analyzed below by Jakub:
psock->work.func = sk_psock_backlog() ACQUIRE psock->work_mutex sk_psock_handle_skb() skb_send_sock() __skb_send_sock() sendpage_unlocked() kernel_sendpage() sock->ops->sendpage = inet_sendpage() sk->sk_prot->sendpage = tcp_sendpage() ACQUIRE sk->sk_lock tcp_sendpage_locked() RELEASE sk->sk_lock RELEASE psock->work_mutex
sock_map_close() ACQUIRE sk->sk_lock sk_psock_stop() sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED) cancel_work_sync() __cancel_work_timer() __flush_work() // wait for psock->work to finish RELEASE sk->sk_lock
We can move the cancel_work_sync() out of the sock lock protection, but still before saved_close() was called.
Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()") Reported-by: Stanislav Fomichev sdf@google.com Signed-off-by: Cong Wang cong.wang@bytedance.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Tested-by: Jakub Sitnicki jakub@cloudflare.com Acked-by: John Fastabend john.fastabend@gmail.com Acked-by: Jakub Sitnicki jakub@cloudflare.com Link: https://lore.kernel.org/bpf/20221102043417.279409-1-xiyou.wangcong@gmail.com (cherry picked from commit 8bbabb3fddcd0f858be69ed5abc9b470a239d6f2) Signed-off-by: Liu Jian liujian56@huawei.com
Conflicts: net/core/skmsg.c --- include/linux/skmsg.h | 2 +- net/core/skmsg.c | 7 ++----- net/core/sock_map.c | 7 ++++--- 3 files changed, 7 insertions(+), 9 deletions(-)
diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h index a83885c5bb86..dda4e951aeba 100644 --- a/include/linux/skmsg.h +++ b/include/linux/skmsg.h @@ -393,7 +393,7 @@ static inline void sk_psock_report_error(struct sk_psock *psock, int err) }
struct sk_psock *sk_psock_init(struct sock *sk, int node); -void sk_psock_stop(struct sk_psock *psock, bool wait); +void sk_psock_stop(struct sk_psock *psock);
int sk_psock_init_strp(struct sock *sk, struct sk_psock *psock); void sk_psock_start_strp(struct sock *sk, struct sk_psock *psock); diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 26a39b672966..78cb3782326d 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -695,16 +695,13 @@ static void sk_psock_link_destroy(struct sk_psock *psock) } }
-void sk_psock_stop(struct sk_psock *psock, bool wait) +void sk_psock_stop(struct sk_psock *psock) { spin_lock_bh(&psock->ingress_lock); sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED); sk_psock_cork_free(psock); __sk_psock_zap_ingress(psock); spin_unlock_bh(&psock->ingress_lock); - - if (wait) - cancel_work_sync(&psock->work); }
static void sk_psock_destroy_deferred(struct work_struct *gc) @@ -750,7 +747,7 @@ void sk_psock_drop(struct sock *sk, struct sk_psock *psock) sk_psock_stop_verdict(sk, psock); write_unlock_bh(&sk->sk_callback_lock);
- sk_psock_stop(psock, false); + sk_psock_stop(psock); call_rcu(&psock->rcu, sk_psock_destroy); } EXPORT_SYMBOL_GPL(sk_psock_drop); diff --git a/net/core/sock_map.c b/net/core/sock_map.c index 46a73cbf0729..0e00e9a0d22d 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -1651,7 +1651,7 @@ void sock_map_destroy(struct sock *sk) saved_destroy = psock->saved_destroy; sock_map_remove_links(sk, psock); rcu_read_unlock(); - sk_psock_stop(psock, false); + sk_psock_stop(psock); sk_psock_put(sk, psock); saved_destroy(sk); } @@ -1674,9 +1674,10 @@ void sock_map_close(struct sock *sk, long timeout) saved_close = psock->saved_close; sock_map_remove_links(sk, psock); rcu_read_unlock(); - sk_psock_stop(psock, true); - sk_psock_put(sk, psock); + sk_psock_stop(psock); release_sock(sk); + cancel_work_sync(&psock->work); + sk_psock_put(sk, psock); saved_close(sk, timeout); }
From: Pengcheng Yang yangpc@wangsu.com
stable inclusion from stable-v5.10.163 commit 28e4a763cd4a2b1a78852216ef4bd7df3a05cec6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6AVM6 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit 7a9841ca025275b5b0edfb0b618934abb6ceec15 ]
In tcp_bpf_send_verdict() redirection, the eval variable is assigned to __SK_REDIRECT after the apply_bytes data is sent, if msg has more_data, sock_put() will be called multiple times.
We should reset the eval variable to __SK_NONE every time more_data starts.
This causes:
IPv4: Attempt to release TCP socket in state 1 00000000b4c925d7 ------------[ cut here ]------------ refcount_t: addition on 0; use-after-free. WARNING: CPU: 5 PID: 4482 at lib/refcount.c:25 refcount_warn_saturate+0x7d/0x110 Modules linked in: CPU: 5 PID: 4482 Comm: sockhash_bypass Kdump: loaded Not tainted 6.0.0 #1 Hardware name: Red Hat KVM, BIOS 1.11.0-2.el7 04/01/2014 Call Trace: <TASK> __tcp_transmit_skb+0xa1b/0xb90 ? __alloc_skb+0x8c/0x1a0 ? __kmalloc_node_track_caller+0x184/0x320 tcp_write_xmit+0x22a/0x1110 __tcp_push_pending_frames+0x32/0xf0 do_tcp_sendpages+0x62d/0x640 tcp_bpf_push+0xae/0x2c0 tcp_bpf_sendmsg_redir+0x260/0x410 ? preempt_count_add+0x70/0xa0 tcp_bpf_send_verdict+0x386/0x4b0 tcp_bpf_sendmsg+0x21b/0x3b0 sock_sendmsg+0x58/0x70 __sys_sendto+0xfa/0x170 ? xfd_validate_state+0x1d/0x80 ? switch_fpu_return+0x59/0xe0 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x37/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: cd9733f5d75c ("tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function") Signed-off-by: Pengcheng Yang yangpc@wangsu.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Acked-by: Jakub Sitnicki jakub@cloudflare.com Link: https://lore.kernel.org/bpf/1669718441-2654-2-git-send-email-yangpc@wangsu.c... Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Wang Hai wanghai38@huawei.com Signed-off-by: Liu Jian liujian56@huawei.com --- net/ipv4/tcp_bpf.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index f49554584898..e5a2aaa02f2a 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -386,7 +386,7 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, bool cork = false, enospc = sk_msg_full(msg); struct sock *sk_redir; u32 tosend, origsize, sent, delta = 0; - u32 eval = __SK_NONE; + u32 eval; int ret;
more_data: @@ -417,6 +417,7 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, tosend = msg->sg.size; if (psock->apply_bytes && psock->apply_bytes < tosend) tosend = psock->apply_bytes; + eval = __SK_NONE;
switch (psock->eval) { case __SK_PASS:
From: Pengcheng Yang yangpc@wangsu.com
mainline inclusion from mainline-v6.2-rc1 commit a351d6087bf7d3d8440d58d3bf244ec64b89394a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I65HYE CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
When redirecting, we use sk_msg_to_ingress() to get the BPF_F_INGRESS flag from the msg->flags. If apply_bytes is used and it is larger than the current data being processed, sk_psock_msg_verdict() will not be called when sendmsg() is called again. At this time, the msg->flags is 0, and we lost the BPF_F_INGRESS flag.
So we need to save the BPF_F_INGRESS flag in sk_psock and use it when redirection.
Fixes: 8934ce2fd081 ("bpf: sockmap redirect ingress support") Signed-off-by: Pengcheng Yang yangpc@wangsu.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Acked-by: Jakub Sitnicki jakub@cloudflare.com Link: https://lore.kernel.org/bpf/1669718441-2654-3-git-send-email-yangpc@wangsu.c... Signed-off-by: Liu Jian liujian56@huawei.com
Conflicts: include/net/tcp.h --- include/linux/skmsg.h | 1 + include/net/tcp.h | 4 ++-- net/core/skmsg.c | 9 ++++++--- net/ipv4/tcp_bpf.c | 11 ++++++----- net/tls/tls_sw.c | 6 ++++-- 5 files changed, 19 insertions(+), 12 deletions(-)
diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h index dda4e951aeba..9c14c5fbb8c0 100644 --- a/include/linux/skmsg.h +++ b/include/linux/skmsg.h @@ -94,6 +94,7 @@ struct sk_psock { u32 apply_bytes; u32 cork_bytes; u32 eval; + bool redir_ingress; /* undefined if sk_redir is null */ struct sk_msg *cork; struct sk_psock_progs progs; struct sk_psock_parser parser; diff --git a/include/net/tcp.h b/include/net/tcp.h index 1b832683ccf8..36ab6cf44553 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -2238,8 +2238,8 @@ static inline void tcp_bpf_clone(const struct sock *sk, struct sock *newsk) #endif /* CONFIG_BPF_STREAM_PARSER */
#ifdef CONFIG_NET_SOCK_MSG -int tcp_bpf_sendmsg_redir(struct sock *sk, struct sk_msg *msg, u32 bytes, - int flags); +int tcp_bpf_sendmsg_redir(struct sock *sk, bool ingress, + struct sk_msg *msg, u32 bytes, int flags); int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg, int len, int flags); #endif /* CONFIG_NET_SOCK_MSG */ diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 78cb3782326d..cc6570d71911 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -784,13 +784,16 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock, ret = sk_psock_map_verd(ret, msg->sk_redir); psock->apply_bytes = msg->apply_bytes; if (ret == __SK_REDIRECT) { - if (psock->sk_redir) + if (psock->sk_redir) { sock_put(psock->sk_redir); - psock->sk_redir = msg->sk_redir; - if (!psock->sk_redir) { + psock->sk_redir = NULL; + } + if (!msg->sk_redir) { ret = __SK_DROP; goto out; } + psock->redir_ingress = sk_msg_to_ingress(msg); + psock->sk_redir = msg->sk_redir; sock_hold(psock->sk_redir); } out: diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index e5a2aaa02f2a..8c2ea1aaef77 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -213,10 +213,9 @@ static int tcp_bpf_push_locked(struct sock *sk, struct sk_msg *msg, return ret; }
-int tcp_bpf_sendmsg_redir(struct sock *sk, struct sk_msg *msg, - u32 bytes, int flags) +int tcp_bpf_sendmsg_redir(struct sock *sk, bool ingress, + struct sk_msg *msg, u32 bytes, int flags) { - bool ingress = sk_msg_to_ingress(msg); struct sk_psock *psock = sk_psock_get(sk); int ret;
@@ -383,7 +382,7 @@ static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, struct sk_msg *msg, int *copied, int flags) { - bool cork = false, enospc = sk_msg_full(msg); + bool cork = false, enospc = sk_msg_full(msg), redir_ingress; struct sock *sk_redir; u32 tosend, origsize, sent, delta = 0; u32 eval; @@ -429,6 +428,7 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, sk_msg_apply_bytes(psock, tosend); break; case __SK_REDIRECT: + redir_ingress = psock->redir_ingress; sk_redir = psock->sk_redir; sk_msg_apply_bytes(psock, tosend); if (!psock->apply_bytes) { @@ -445,7 +445,8 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, release_sock(sk);
origsize = msg->sg.size; - ret = tcp_bpf_sendmsg_redir(sk_redir, msg, tosend, flags); + ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress, + msg, tosend, flags); sent = origsize - msg->sg.size;
if (eval == __SK_REDIRECT) diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 3ee8aa7ec04d..c8f8881032ed 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -807,7 +807,7 @@ static int bpf_exec_tx_verdict(struct sk_msg *msg, struct sock *sk, struct sk_psock *psock; struct sock *sk_redir; struct tls_rec *rec; - bool enospc, policy; + bool enospc, policy, redir_ingress; int err = 0, send; u32 delta = 0;
@@ -852,6 +852,7 @@ static int bpf_exec_tx_verdict(struct sk_msg *msg, struct sock *sk, } break; case __SK_REDIRECT: + redir_ingress = psock->redir_ingress; sk_redir = psock->sk_redir; memcpy(&msg_redir, msg, sizeof(*msg)); if (msg->apply_bytes < send) @@ -861,7 +862,8 @@ static int bpf_exec_tx_verdict(struct sk_msg *msg, struct sock *sk, sk_msg_return_zero(sk, msg, send); msg->sg.size -= send; release_sock(sk); - err = tcp_bpf_sendmsg_redir(sk_redir, &msg_redir, send, flags); + err = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress, + &msg_redir, send, flags); lock_sock(sk); if (err < 0) { *copied -= sk_msg_free_nocharge(sk, &msg_redir);
From: Pengcheng Yang yangpc@wangsu.com
stable inclusion from stable-v5.10.163 commit c58df40e3e67fd7603290b5dd24c47cdc1f9ef74 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6AVM6 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit 9072931f020bfd907d6d89ee21ff1481cd78b407 ]
Use apply_bytes on ingress redirect, when apply_bytes is less than the length of msg data, some data may be skipped and lost in bpf_tcp_ingress().
If there is still data in the scatterlist that has not been consumed, we cannot move the msg iter.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Pengcheng Yang yangpc@wangsu.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Acked-by: Jakub Sitnicki jakub@cloudflare.com Link: https://lore.kernel.org/bpf/1669718441-2654-4-git-send-email-yangpc@wangsu.c... Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Wang Hai wanghai38@huawei.com Signed-off-by: Liu Jian liujian56@huawei.com --- net/ipv4/tcp_bpf.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index 8c2ea1aaef77..fd6fbf3d80a2 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -126,8 +126,11 @@ static int bpf_tcp_ingress(struct sock *sk, struct sk_psock *psock, tmp->sg.end = i; if (apply) { apply_bytes -= size; - if (!apply_bytes) + if (!apply_bytes) { + if (sge->length) + sk_msg_iter_var_prev(i); break; + } } } while (i != msg->sg.end);
From: Jakub Sitnicki jakub@cloudflare.com
stable inclusion from stable-v5.10.168 commit 9bd6074e1872d22190a8da30e796cbf937d334f0 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6IXN2 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit ddce1e091757d0259107c6c0c7262df201de2b66 ]
A listening socket linked to a sockmap has its sk_prot overridden. It points to one of the struct proto variants in tcp_bpf_prots. The variant depends on the socket's family and which sockmap programs are attached.
A child socket cloned from a TCP listener initially inherits their sk_prot. But before cloning is finished, we restore the child's proto to the listener's original non-tcp_bpf_prots one. This happens in tcp_create_openreq_child -> tcp_bpf_clone.
Today, in tcp_bpf_clone we detect if the child's proto should be restored by checking only for the TCP_BPF_BASE proto variant. This is not correct. The sk_prot of listening socket linked to a sockmap can point to to any variant in tcp_bpf_prots.
If the listeners sk_prot happens to be not the TCP_BPF_BASE variant, then the child socket unintentionally is left if the inherited sk_prot by tcp_bpf_clone.
This leads to issues like infinite recursion on close [1], because the child state is otherwise not set up for use with tcp_bpf_prot operations.
Adjust the check in tcp_bpf_clone to detect all of tcp_bpf_prots variants.
Note that it wouldn't be sufficient to check the socket state when overriding the sk_prot in tcp_bpf_update_proto in order to always use the TCP_BPF_BASE variant for listening sockets. Since commit b8b8315e39ff ("bpf, sockmap: Remove unhash handler for BPF sockmap usage") it is possible for a socket to transition to TCP_LISTEN state while already linked to a sockmap, e.g. connect() -> insert into map -> connect(AF_UNSPEC) -> listen().
[1]: https://lore.kernel.org/all/00000000000073b14905ef2e7401@google.com/
Fixes: e80251555f0b ("tcp_bpf: Don't let child socket inherit parent protocol ops on copy") Reported-by: syzbot+04c21ed96d861dccc5cd@syzkaller.appspotmail.com Signed-off-by: Jakub Sitnicki jakub@cloudflare.com Acked-by: John Fastabend john.fastabend@gmail.com Link: https://lore.kernel.org/r/20230113-sockmap-fix-v2-2-1e0ee7ac2f90@cloudflare.... Signed-off-by: Alexei Starovoitov ast@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Wang Hai wanghai38@huawei.com Signed-off-by: Liu Jian liujian56@huawei.com --- include/linux/util_macros.h | 12 ++++++++++++ net/ipv4/tcp_bpf.c | 4 ++-- 2 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/include/linux/util_macros.h b/include/linux/util_macros.h index 72299f261b25..43db6e47503c 100644 --- a/include/linux/util_macros.h +++ b/include/linux/util_macros.h @@ -38,4 +38,16 @@ */ #define find_closest_descending(x, a, as) __find_closest(x, a, as, >=)
+/** + * is_insidevar - check if the @ptr points inside the @var memory range. + * @ptr: the pointer to a memory address. + * @var: the variable which address and size identify the memory range. + * + * Evaluates to true if the address in @ptr lies within the memory + * range allocated to @var. + */ +#define is_insidevar(ptr, var) \ + ((uintptr_t)(ptr) >= (uintptr_t)(var) && \ + (uintptr_t)(ptr) < (uintptr_t)(var) + sizeof(var)) + #endif diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index fd6fbf3d80a2..63a21c703740 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -6,6 +6,7 @@ #include <linux/bpf.h> #include <linux/init.h> #include <linux/wait.h> +#include <linux/util_macros.h>
#include <net/inet_common.h> #include <net/tls.h> @@ -726,10 +727,9 @@ struct proto *tcp_bpf_get_proto(struct sock *sk, struct sk_psock *psock) */ void tcp_bpf_clone(const struct sock *sk, struct sock *newsk) { - int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4; struct proto *prot = newsk->sk_prot;
- if (prot == &tcp_bpf_prots[family][TCP_BPF_BASE]) + if (is_insidevar(prot, tcp_bpf_prots)) newsk->sk_prot = sk->sk_prot_creator; } #endif /* CONFIG_BPF_STREAM_PARSER */
From: Jakub Sitnicki jakub@cloudflare.com
mainline inclusion from mainline-v6.2-rc7 commit 5b4a79ba65a1ab479903fff2e604865d229b70a9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I65HYE CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
sock_map proto callbacks should never call themselves by design. Protect against bugs like [1] and break out of the recursive loop to avoid a stack overflow in favor of a resource leak.
[1] https://lore.kernel.org/all/00000000000073b14905ef2e7401@google.com/
Suggested-by: Eric Dumazet edumazet@google.com Signed-off-by: Jakub Sitnicki jakub@cloudflare.com Acked-by: John Fastabend john.fastabend@gmail.com Link: https://lore.kernel.org/r/20230113-sockmap-fix-v2-1-1e0ee7ac2f90@cloudflare.... Signed-off-by: Alexei Starovoitov ast@kernel.org Signed-off-by: Liu Jian liujian56@huawei.com --- net/core/sock_map.c | 61 +++++++++++++++++++++++++-------------------- 1 file changed, 34 insertions(+), 27 deletions(-)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c index 0e00e9a0d22d..66b7f3fb01ed 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -1623,15 +1623,16 @@ void sock_map_unhash(struct sock *sk) psock = sk_psock(sk); if (unlikely(!psock)) { rcu_read_unlock(); - if (sk->sk_prot->unhash) - sk->sk_prot->unhash(sk); - return; + saved_unhash = READ_ONCE(sk->sk_prot)->unhash; + } else { + saved_unhash = psock->saved_unhash; + sock_map_remove_links(sk, psock); + rcu_read_unlock(); } - - saved_unhash = psock->saved_unhash; - sock_map_remove_links(sk, psock); - rcu_read_unlock(); - saved_unhash(sk); + if (WARN_ON_ONCE(saved_unhash == sock_map_unhash)) + return; + if (saved_unhash) + saved_unhash(sk); }
void sock_map_destroy(struct sock *sk) @@ -1643,17 +1644,18 @@ void sock_map_destroy(struct sock *sk) psock = sk_psock_get(sk); if (unlikely(!psock)) { rcu_read_unlock(); - if (sk->sk_prot->destroy) - sk->sk_prot->destroy(sk); - return; + saved_destroy = READ_ONCE(sk->sk_prot)->destroy; + } else { + saved_destroy = psock->saved_destroy; + sock_map_remove_links(sk, psock); + rcu_read_unlock(); + sk_psock_stop(psock); + sk_psock_put(sk, psock); } - - saved_destroy = psock->saved_destroy; - sock_map_remove_links(sk, psock); - rcu_read_unlock(); - sk_psock_stop(psock); - sk_psock_put(sk, psock); - saved_destroy(sk); + if (WARN_ON_ONCE(saved_destroy == sock_map_destroy)) + return; + if (saved_destroy) + saved_destroy(sk); } EXPORT_SYMBOL_GPL(sock_map_destroy);
@@ -1668,16 +1670,21 @@ void sock_map_close(struct sock *sk, long timeout) if (unlikely(!psock)) { rcu_read_unlock(); release_sock(sk); - return sk->sk_prot->close(sk, timeout); + saved_close = READ_ONCE(sk->sk_prot)->close; + } else { + saved_close = psock->saved_close; + sock_map_remove_links(sk, psock); + rcu_read_unlock(); + sk_psock_stop(psock); + release_sock(sk); + cancel_work_sync(&psock->work); + sk_psock_put(sk, psock); } - - saved_close = psock->saved_close; - sock_map_remove_links(sk, psock); - rcu_read_unlock(); - sk_psock_stop(psock); - release_sock(sk); - cancel_work_sync(&psock->work); - sk_psock_put(sk, psock); + /* Make sure we do not recurse. This is a bug. + * Leak the socket instead of crashing on a stack overflow. + */ + if (WARN_ON_ONCE(saved_close == sock_map_close)) + return; saved_close(sk, timeout); }
From: Eric Dumazet edumazet@google.com
stable inclusion from stable-v5.10.181 commit 4493914009609d6351b3a41dfe3b0ac5209bd4c6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I65HYE CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
---------------------------
[ Upstream commit d0ac89f6f9879fae316c155de77b5173b3e2c9c9 ]
__condition is evaluated twice in sk_wait_event() macro.
First invocation is lockless, and reads can race with writes, as spotted by syzbot.
BUG: KCSAN: data-race in sk_stream_wait_connect / tcp_disconnect
write to 0xffff88812d83d6a0 of 4 bytes by task 9065 on cpu 1: tcp_disconnect+0x2cd/0xdb0 inet_shutdown+0x19e/0x1f0 net/ipv4/af_inet.c:911 __sys_shutdown_sock net/socket.c:2343 [inline] __sys_shutdown net/socket.c:2355 [inline] __do_sys_shutdown net/socket.c:2363 [inline] __se_sys_shutdown+0xf8/0x140 net/socket.c:2361 __x64_sys_shutdown+0x31/0x40 net/socket.c:2361 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff88812d83d6a0 of 4 bytes by task 9040 on cpu 0: sk_stream_wait_connect+0x1de/0x3a0 net/core/stream.c:75 tcp_sendmsg_locked+0x2e4/0x2120 net/ipv4/tcp.c:1266 tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1484 inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:651 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] __sys_sendto+0x246/0x300 net/socket.c:2142 __do_sys_sendto net/socket.c:2154 [inline] __se_sys_sendto net/socket.c:2150 [inline] __x64_sys_sendto+0x78/0x90 net/socket.c:2150 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0x00000000 -> 0x00000068
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot syzkaller@googlegroups.com Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Liu Jian liujian56@huawei.com --- net/core/stream.c | 12 ++++++------ net/ipv4/tcp_bpf.c | 2 +- net/llc/af_llc.c | 8 +++++--- net/smc/smc_close.c | 4 ++-- net/smc/smc_rx.c | 4 ++-- net/smc/smc_tx.c | 4 ++-- net/tipc/socket.c | 4 ++-- net/tls/tls_main.c | 3 ++- 8 files changed, 22 insertions(+), 19 deletions(-)
diff --git a/net/core/stream.c b/net/core/stream.c index cf4a8704c87c..8267b46e9ae4 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -73,8 +73,8 @@ int sk_stream_wait_connect(struct sock *sk, long *timeo_p) add_wait_queue(sk_sleep(sk), &wait); sk->sk_write_pending++; done = sk_wait_event(sk, timeo_p, - !sk->sk_err && - !((1 << sk->sk_state) & + !READ_ONCE(sk->sk_err) && + !((1 << READ_ONCE(sk->sk_state)) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)), &wait); remove_wait_queue(sk_sleep(sk), &wait); sk->sk_write_pending--; @@ -87,9 +87,9 @@ EXPORT_SYMBOL(sk_stream_wait_connect); * sk_stream_closing - Return 1 if we still have things to send in our buffers. * @sk: socket to verify */ -static inline int sk_stream_closing(struct sock *sk) +static int sk_stream_closing(const struct sock *sk) { - return (1 << sk->sk_state) & + return (1 << READ_ONCE(sk->sk_state)) & (TCPF_FIN_WAIT1 | TCPF_CLOSING | TCPF_LAST_ACK); }
@@ -142,8 +142,8 @@ int sk_stream_wait_memory(struct sock *sk, long *timeo_p)
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); sk->sk_write_pending++; - sk_wait_event(sk, ¤t_timeo, sk->sk_err || - (sk->sk_shutdown & SEND_SHUTDOWN) || + sk_wait_event(sk, ¤t_timeo, READ_ONCE(sk->sk_err) || + (READ_ONCE(sk->sk_shutdown) & SEND_SHUTDOWN) || (sk_stream_memory_free(sk) && !vm_wait), &wait); sk->sk_write_pending--; diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index 63a21c703740..ad612109317f 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -263,7 +263,7 @@ static int tcp_bpf_wait_data(struct sock *sk, struct sk_psock *psock, sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk); ret = sk_wait_event(sk, &timeo, !list_empty(&psock->ingress_msg) || - !skb_queue_empty(&sk->sk_receive_queue), &wait); + !skb_queue_empty_lockless(&sk->sk_receive_queue), &wait); sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk); remove_wait_queue(sk_sleep(sk), &wait); return ret; diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c index 99a37c411323..01e26698285a 100644 --- a/net/llc/af_llc.c +++ b/net/llc/af_llc.c @@ -582,7 +582,8 @@ static int llc_ui_wait_for_disc(struct sock *sk, long timeout)
add_wait_queue(sk_sleep(sk), &wait); while (1) { - if (sk_wait_event(sk, &timeout, sk->sk_state == TCP_CLOSE, &wait)) + if (sk_wait_event(sk, &timeout, + READ_ONCE(sk->sk_state) == TCP_CLOSE, &wait)) break; rc = -ERESTARTSYS; if (signal_pending(current)) @@ -602,7 +603,8 @@ static bool llc_ui_wait_for_conn(struct sock *sk, long timeout)
add_wait_queue(sk_sleep(sk), &wait); while (1) { - if (sk_wait_event(sk, &timeout, sk->sk_state != TCP_SYN_SENT, &wait)) + if (sk_wait_event(sk, &timeout, + READ_ONCE(sk->sk_state) != TCP_SYN_SENT, &wait)) break; if (signal_pending(current) || !timeout) break; @@ -621,7 +623,7 @@ static int llc_ui_wait_for_busy_core(struct sock *sk, long timeout) while (1) { rc = 0; if (sk_wait_event(sk, &timeout, - (sk->sk_shutdown & RCV_SHUTDOWN) || + (READ_ONCE(sk->sk_shutdown) & RCV_SHUTDOWN) || (!llc_data_accept_state(llc->state) && !llc->remote_busy_flag && !llc->p_flag), &wait)) diff --git a/net/smc/smc_close.c b/net/smc/smc_close.c index 84102db5bb31..149a59ecd299 100644 --- a/net/smc/smc_close.c +++ b/net/smc/smc_close.c @@ -64,8 +64,8 @@ static void smc_close_stream_wait(struct smc_sock *smc, long timeout)
rc = sk_wait_event(sk, &timeout, !smc_tx_prepared_sends(&smc->conn) || - sk->sk_err == ECONNABORTED || - sk->sk_err == ECONNRESET || + READ_ONCE(sk->sk_err) == ECONNABORTED || + READ_ONCE(sk->sk_err) == ECONNRESET || smc->conn.killed, &wait); if (rc) diff --git a/net/smc/smc_rx.c b/net/smc/smc_rx.c index 7f7e983e42b1..3757aff6c2f0 100644 --- a/net/smc/smc_rx.c +++ b/net/smc/smc_rx.c @@ -203,9 +203,9 @@ int smc_rx_wait(struct smc_sock *smc, long *timeo, sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk); add_wait_queue(sk_sleep(sk), &wait); rc = sk_wait_event(sk, timeo, - sk->sk_err || + READ_ONCE(sk->sk_err) || cflags->peer_conn_abort || - sk->sk_shutdown & RCV_SHUTDOWN || + READ_ONCE(sk->sk_shutdown) & RCV_SHUTDOWN || conn->killed || fcrit(conn), &wait); diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c index 52ef1fca0b60..2429f9fc7e0e 100644 --- a/net/smc/smc_tx.c +++ b/net/smc/smc_tx.c @@ -110,8 +110,8 @@ static int smc_tx_wait(struct smc_sock *smc, int flags) break; /* at least 1 byte of free & no urgent data */ set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); sk_wait_event(sk, &timeo, - sk->sk_err || - (sk->sk_shutdown & SEND_SHUTDOWN) || + READ_ONCE(sk->sk_err) || + (READ_ONCE(sk->sk_shutdown) & SEND_SHUTDOWN) || smc_cdc_rxed_any_close(conn) || (atomic_read(&conn->sndbuf_space) && !conn->urg_tx_pend), diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 8f3c9fbb9916..7cf9b40b5c73 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -300,9 +300,9 @@ static void tsk_rej_rx_queue(struct sock *sk, int error) tipc_sk_respond(sk, skb, error); }
-static bool tipc_sk_connected(struct sock *sk) +static bool tipc_sk_connected(const struct sock *sk) { - return sk->sk_state == TIPC_ESTABLISHED; + return READ_ONCE(sk->sk_state) == TIPC_ESTABLISHED; }
/* tipc_sk_type_connectionless - check if the socket is datagram socket diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c index 9d2254b2879d..5790f992b7f8 100644 --- a/net/tls/tls_main.c +++ b/net/tls/tls_main.c @@ -92,7 +92,8 @@ int wait_on_pending_writer(struct sock *sk, long *timeo) break; }
- if (sk_wait_event(sk, timeo, !sk->sk_write_pending, &wait)) + if (sk_wait_event(sk, timeo, + !READ_ONCE(sk->sk_write_pending), &wait)) break; } remove_wait_queue(sk_sleep(sk), &wait);
From: zhangmingyi zhangmingyi5@huawei.com
euleros inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I545NW CVE: NA
--------------------------------
fix a bug in bpf_tcp_ingress(), addr use after free
Signed-off-by: zhangmingyi zhangmingyi5@huawei.com Reviewed-by: liuxin liuxin350@huawei.com Reviewed-by: wuchangye wuchangye@huawei.com Fixes: 8818e269f18d ("bpf, sockmap: Add sk_rmem_alloc check for sockmap") Signed-off-by: Liu Jian liujian56@huawei.com --- net/ipv4/tcp_bpf.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index ad612109317f..1cff6ae3f6fd 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -138,7 +138,8 @@ static int bpf_tcp_ingress(struct sock *sk, struct sk_psock *psock, if (!ret) { msg->sg.start = i; sk_psock_queue_msg(psock, tmp); - atomic_add(tmp->sg.size, &sk->sk_rmem_alloc); + if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) + atomic_add(tmp->sg.size, &sk->sk_rmem_alloc); sk_psock_data_ready(sk, psock); } else { sk_msg_free(sk, tmp);