Patchset of CVE-2024-26921.
Florian Westphal (1): inet: inet_defrag: prevent sk release while still in use
Guillaume Nault (1): inet: frags: re-introduce skb coalescing for local delivery
Vasily Averin (2): skbuff: introduce skb_expand_head() skb_expand_head() adjust skb->truesize incorrectly
Ziyang Xuan (2): net: Fix KABI break for introducing is_skb_wmem() sk_buff: Fix KABI break for the modification of struct sk_buff
include/linux/skbuff.h | 1 + include/net/inet_frag.h | 2 +- include/net/tcp.h | 2 +- include/net/tcp_ext.h | 14 +++ net/core/skbuff.c | 52 +++++++++++ net/ipv4/inet_fragment.c | 109 ++++++++++++++++++------ net/ipv4/ip_fragment.c | 10 ++- net/ipv6/netfilter/nf_conntrack_reasm.c | 4 +- net/ipv6/reassembly.c | 2 +- 9 files changed, 165 insertions(+), 31 deletions(-) create mode 100644 include/net/tcp_ext.h
From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15-rc1 commit f1260ff15a71b8fc122b2c9abd8a7abffb6e0168 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9HVTH CVE: CVE-2024-26921
--------------------------------
Like skb_realloc_headroom(), new helper increases headroom of specified skb. Unlike skb_realloc_headroom(), it does not allocate a new skb if possible; copies skb->sk on new skb when as needed and frees original skb in case of failures.
This helps to simplify ip[6]_finish_output2() and a few other similar cases.
Signed-off-by: Vasily Averin vvs@virtuozzo.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com --- include/linux/skbuff.h | 1 + net/core/skbuff.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 43 insertions(+)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index b67d42871ee9..c7d1d8b5f41b 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1059,6 +1059,7 @@ static inline struct sk_buff *__pskb_copy(struct sk_buff *skb, int headroom, int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, gfp_t gfp_mask); struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom); +struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom); struct sk_buff *skb_copy_expand(const struct sk_buff *skb, int newheadroom, int newtailroom, gfp_t priority); int __must_check skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist *sg, diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 69081cdfab43..9986f237817a 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1562,6 +1562,48 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom) } EXPORT_SYMBOL(skb_realloc_headroom);
+/** + * skb_expand_head - reallocate header of &sk_buff + * @skb: buffer to reallocate + * @headroom: needed headroom + * + * Unlike skb_realloc_headroom, this one does not allocate a new skb + * if possible; copies skb->sk to new skb as needed + * and frees original skb in case of failures. + * + * It expect increased headroom and generates warning otherwise. + */ + +struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom) +{ + int delta = headroom - skb_headroom(skb); + + if (WARN_ONCE(delta <= 0, + "%s is expecting an increase in the headroom", __func__)) + return skb; + + /* pskb_expand_head() might crash, if skb is shared */ + if (skb_shared(skb)) { + struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC); + + if (likely(nskb)) { + if (skb->sk) + skb_set_owner_w(nskb, skb->sk); + consume_skb(skb); + } else { + kfree_skb(skb); + } + skb = nskb; + } + if (skb && + pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) { + kfree_skb(skb); + skb = NULL; + } + return skb; +} +EXPORT_SYMBOL(skb_expand_head); + /** * skb_copy_expand - copy and expand sk_buff * @skb: buffer to copy
From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15 commit 7f678def99d29c520418607509bb19c7fc96a6db category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9HVTH CVE: CVE-2024-26921
--------------------------------
Christoph Paasch reports [1] about incorrect skb->truesize after skb_expand_head() call in ip6_xmit. This may happen because of two reasons: - skb_set_owner_w() for newly cloned skb is called too early, before pskb_expand_head() where truesize is adjusted for (!skb-sk) case. - pskb_expand_head() does not adjust truesize in (skb->sk) case. In this case sk->sk_wmem_alloc should be adjusted too.
[1] https://lkml.org/lkml/2021/8/20/1082
Fixes: f1260ff15a71 ("skbuff: introduce skb_expand_head()") Fixes: 2d85a1b31dde ("ipv6: ip6_finish_output2: set sk into newly allocated nskb") Reported-by: Christoph Paasch christoph.paasch@gmail.com Signed-off-by: Vasily Averin vvs@virtuozzo.com Reviewed-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/644330dd-477e-0462-83bf-9f514c41edd1@virtuozzo.com Signed-off-by: Jakub Kicinski kuba@kernel.org Conflicts: net/core/skbuff.c [The version does not include commit 7b7ed885aff2 and commit 2544af0344ba] Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com --- net/core/skbuff.c | 37 ++++++++++++++++++++++++------------- net/core/sock_destructor.h | 12 ++++++++++++ 2 files changed, 36 insertions(+), 13 deletions(-) create mode 100644 net/core/sock_destructor.h
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 9986f237817a..1e0999f22f0b 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -77,6 +77,8 @@ #include <linux/capability.h> #include <linux/user_namespace.h>
+#include "sock_destructor.h" + struct kmem_cache *skbuff_head_cache __ro_after_init; static struct kmem_cache *skbuff_fclone_cache __ro_after_init; int sysctl_max_skb_frags __read_mostly = MAX_SKB_FRAGS; @@ -1577,30 +1579,39 @@ EXPORT_SYMBOL(skb_realloc_headroom); struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom) { int delta = headroom - skb_headroom(skb); + int osize = skb_end_offset(skb); + struct sock *sk = skb->sk;
if (WARN_ONCE(delta <= 0, "%s is expecting an increase in the headroom", __func__)) return skb;
- /* pskb_expand_head() might crash, if skb is shared */ - if (skb_shared(skb)) { + delta = SKB_DATA_ALIGN(delta); + /* pskb_expand_head() might crash, if skb is shared. */ + if (skb_shared(skb) || !is_skb_wmem(skb)) { struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
- if (likely(nskb)) { - if (skb->sk) - skb_set_owner_w(nskb, skb->sk); - consume_skb(skb); - } else { - kfree_skb(skb); - } + if (unlikely(!nskb)) + goto fail; + + if (sk) + skb_set_owner_w(nskb, sk); + consume_skb(skb); skb = nskb; } - if (skb && - pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) { - kfree_skb(skb); - skb = NULL; + if (pskb_expand_head(skb, delta, 0, GFP_ATOMIC)) + goto fail; + + if (sk && is_skb_wmem(skb)) { + delta = skb_end_offset(skb) - osize; + refcount_add(delta, &sk->sk_wmem_alloc); + skb->truesize += delta; } return skb; + +fail: + kfree_skb(skb); + return NULL; } EXPORT_SYMBOL(skb_expand_head);
diff --git a/net/core/sock_destructor.h b/net/core/sock_destructor.h new file mode 100644 index 000000000000..2f396e6bfba5 --- /dev/null +++ b/net/core/sock_destructor.h @@ -0,0 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef _NET_CORE_SOCK_DESTRUCTOR_H +#define _NET_CORE_SOCK_DESTRUCTOR_H +#include <net/tcp.h> + +static inline bool is_skb_wmem(const struct sk_buff *skb) +{ + return skb->destructor == sock_wfree || + skb->destructor == __sock_wfree || + (IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree); +} +#endif
hulk inclusion category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9HVTH CVE: CVE-2024-26921
--------------------------------
Fix KABI break when backport commit 7f678def99d2 ("skb_expand_head() adjust skb->truesize incorrectly") by moving is_skb_wmem() into a private .h file.
Fixes: 7f678def99d2 ("skb_expand_head() adjust skb->truesize incorrectly") Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com --- include/net/tcp.h | 2 +- net/core/sock_destructor.h => include/net/tcp_ext.h | 10 ++++++---- net/core/skbuff.c | 3 +-- 3 files changed, 8 insertions(+), 7 deletions(-) rename net/core/sock_destructor.h => include/net/tcp_ext.h (72%)
diff --git a/include/net/tcp.h b/include/net/tcp.h index 44deff714e93..68e67b79534b 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -43,6 +43,7 @@ #include <net/tcp_states.h> #include <net/inet_ecn.h> #include <net/dst.h> +#include <net/tcp_ext.h>
#include <linux/seq_file.h> #include <linux/memcontrol.h> @@ -338,7 +339,6 @@ int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset, ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset, size_t size, int flags); void tcp_release_cb(struct sock *sk); -void tcp_wfree(struct sk_buff *skb); void tcp_write_timer_handler(struct sock *sk); void tcp_delack_timer_handler(struct sock *sk); int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg); diff --git a/net/core/sock_destructor.h b/include/net/tcp_ext.h similarity index 72% rename from net/core/sock_destructor.h rename to include/net/tcp_ext.h index 2f396e6bfba5..733534808c4c 100644 --- a/net/core/sock_destructor.h +++ b/include/net/tcp_ext.h @@ -1,7 +1,9 @@ /* SPDX-License-Identifier: GPL-2.0-or-later */ -#ifndef _NET_CORE_SOCK_DESTRUCTOR_H -#define _NET_CORE_SOCK_DESTRUCTOR_H -#include <net/tcp.h> + +#ifndef _TCP_EXT_H +#define _TCP_EXT_H + +void tcp_wfree(struct sk_buff *skb);
static inline bool is_skb_wmem(const struct sk_buff *skb) { @@ -9,4 +11,4 @@ static inline bool is_skb_wmem(const struct sk_buff *skb) skb->destructor == __sock_wfree || (IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree); } -#endif +#endif /* _TCP_EXT_H */ diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 1e0999f22f0b..4953be162818 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -70,6 +70,7 @@ #include <net/checksum.h> #include <net/ip6_checksum.h> #include <net/xfrm.h> +#include <net/tcp_ext.h>
#include <linux/uaccess.h> #include <trace/events/skb.h> @@ -77,8 +78,6 @@ #include <linux/capability.h> #include <linux/user_namespace.h>
-#include "sock_destructor.h" - struct kmem_cache *skbuff_head_cache __ro_after_init; static struct kmem_cache *skbuff_fclone_cache __ro_after_init; int sysctl_max_skb_frags __read_mostly = MAX_SKB_FRAGS;
From: Guillaume Nault gnault@redhat.com
mainline inclusion from mainline-v5.3-rc6 commit 891584f48a9084ba462f10da4c6bb28b6181b543 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9HVTH CVE: CVE-2024-26921
--------------------------------
Before commit d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6 defrag"), a netperf UDP_STREAM test[0] using big IPv6 datagrams (thus generating many fragments) and running over an IPsec tunnel, reported more than 6Gbps throughput. After that patch, the same test gets only 9Mbps when receiving on a be2net nic (driver can make a big difference here, for example, ixgbe doesn't seem to be affected).
By reusing the IPv4 defragmentation code, IPv6 lost fragment coalescing (IPv4 fragment coalescing was dropped by commit 14fe22e33462 ("Revert "ipv4: use skb coalescing in defragmentation"")).
Without fragment coalescing, be2net runs out of Rx ring entries and starts to drop frames (ethtool reports rx_drops_no_frags errors). Since the netperf traffic is only composed of UDP fragments, any lost packet prevents reassembly of the full datagram. Therefore, fragments which have no possibility to ever get reassembled pile up in the reassembly queue, until the memory accounting exeeds the threshold. At that point no fragment is accepted anymore, which effectively discards all netperf traffic.
When reassembly timeout expires, some stale fragments are removed from the reassembly queue, so a few packets can be received, reassembled and delivered to the netperf receiver. But the nic still drops frames and soon the reassembly queue gets filled again with stale fragments. These long time frames where no datagram can be received explain why the performance drop is so significant.
Re-introducing fragment coalescing is enough to get the initial performances again (6.6Gbps with be2net): driver doesn't drop frames anymore (no more rx_drops_no_frags errors) and the reassembly engine works at full speed.
This patch is quite conservative and only coalesces skbs for local IPv4 and IPv6 delivery (in order to avoid changing skb geometry when forwarding). Coalescing could be extended in the future if need be, as more scenarios would probably benefit from it.
[0]: Test configuration Sender: ip xfrm policy flush ip xfrm state flush ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1 ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir in tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1 ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir out tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow netserver -D -L fc00:2::1
Receiver: ip xfrm policy flush ip xfrm state flush ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1 ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir in tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1 ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir out tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow netperf -H fc00:2::1 -f k -P 0 -L fc00:1::1 -l 60 -t UDP_STREAM -I 99,5 -i 5,5 -T5,5 -6
Signed-off-by: Guillaume Nault gnault@redhat.com Acked-by: Florian Westphal fw@strlen.de Signed-off-by: David S. Miller davem@davemloft.net Conflicts: net/ieee802154/6lowpan/reassembly.c net/ipv4/inet_fragment.c [The version does not include commit 254c5dbe15d4 and 6ce3b4dcee4f.] Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com --- include/net/inet_frag.h | 2 +- net/ipv4/inet_fragment.c | 39 ++++++++++++++++++------- net/ipv4/ip_fragment.c | 8 ++++- net/ipv6/netfilter/nf_conntrack_reasm.c | 2 +- net/ipv6/reassembly.c | 2 +- 5 files changed, 38 insertions(+), 15 deletions(-)
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h index b02bf737d019..7d373beb70e6 100644 --- a/include/net/inet_frag.h +++ b/include/net/inet_frag.h @@ -162,7 +162,7 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb, void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb, struct sk_buff *parent); void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head, - void *reasm_data); + void *reasm_data, bool try_coalesce); struct sk_buff *inet_frag_pull_head(struct inet_frag_queue *q);
#endif diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c index 9f69411251d0..1ab592d45afd 100644 --- a/net/ipv4/inet_fragment.c +++ b/net/ipv4/inet_fragment.c @@ -437,11 +437,12 @@ void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb, EXPORT_SYMBOL(inet_frag_reasm_prepare);
void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head, - void *reasm_data) + void *reasm_data, bool try_coalesce) { struct sk_buff **nextp = (struct sk_buff **)reasm_data; struct rb_node *rbn; struct sk_buff *fp; + int sum_truesize;
skb_push(head, head->data - skb_network_header(head));
@@ -449,25 +450,41 @@ void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head, fp = FRAG_CB(head)->next_frag; rbn = rb_next(&head->rbnode); rb_erase(&head->rbnode, &q->rb_fragments); + + sum_truesize = head->truesize; while (rbn || fp) { /* fp points to the next sk_buff in the current run; * rbn points to the next run. */ /* Go through the current run. */ while (fp) { - *nextp = fp; - nextp = &fp->next; - fp->prev = NULL; - memset(&fp->rbnode, 0, sizeof(fp->rbnode)); - fp->sk = NULL; - head->data_len += fp->len; - head->len += fp->len; + struct sk_buff *next_frag = FRAG_CB(fp)->next_frag; + bool stolen; + int delta; + + sum_truesize += fp->truesize; if (head->ip_summed != fp->ip_summed) head->ip_summed = CHECKSUM_NONE; else if (head->ip_summed == CHECKSUM_COMPLETE) head->csum = csum_add(head->csum, fp->csum); - head->truesize += fp->truesize; - fp = FRAG_CB(fp)->next_frag; + + if (try_coalesce && skb_try_coalesce(head, fp, &stolen, + &delta)) { + kfree_skb_partial(fp, stolen); + } else { + fp->prev = NULL; + memset(&fp->rbnode, 0, sizeof(fp->rbnode)); + fp->sk = NULL; + + head->data_len += fp->len; + head->len += fp->len; + head->truesize += fp->truesize; + + *nextp = fp; + nextp = &fp->next; + } + + fp = next_frag; } /* Move to the next run. */ if (rbn) { @@ -478,7 +495,7 @@ void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head, rbn = rbnext; } } - sub_frag_mem_limit(q->net, head->truesize); + sub_frag_mem_limit(q->net, sum_truesize);
*nextp = NULL; skb_mark_not_on_list(head); diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c index 5a1d39e32196..a4cd5db79a50 100644 --- a/net/ipv4/ip_fragment.c +++ b/net/ipv4/ip_fragment.c @@ -397,6 +397,11 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb) return err; }
+static bool ip_frag_coalesce_ok(const struct ipq *qp) +{ + return qp->q.key.v4.user == IP_DEFRAG_LOCAL_DELIVER; +} + /* Build a new IP datagram from all its fragments. */ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb, struct sk_buff *prev_tail, struct net_device *dev) @@ -425,7 +430,8 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb, if (len > 65535) goto out_oversize;
- inet_frag_reasm_finish(&qp->q, skb, reasm_data); + inet_frag_reasm_finish(&qp->q, skb, reasm_data, + ip_frag_coalesce_ok(qp));
skb->dev = dev; IPCB(skb)->frag_max_size = max(qp->max_df_size, qp->q.max_size); diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c index 35d5a76867d0..247a81467721 100644 --- a/net/ipv6/netfilter/nf_conntrack_reasm.c +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c @@ -359,7 +359,7 @@ static int nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *skb,
skb_reset_transport_header(skb);
- inet_frag_reasm_finish(&fq->q, skb, reasm_data); + inet_frag_reasm_finish(&fq->q, skb, reasm_data, false);
skb->ignore_df = 1; skb->dev = dev; diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c index b596727f0497..3265811845c2 100644 --- a/net/ipv6/reassembly.c +++ b/net/ipv6/reassembly.c @@ -288,7 +288,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *skb,
skb_reset_transport_header(skb);
- inet_frag_reasm_finish(&fq->q, skb, reasm_data); + inet_frag_reasm_finish(&fq->q, skb, reasm_data, true);
skb->dev = dev; ipv6_hdr(skb)->payload_len = htons(payload_len);
From: Florian Westphal fw@strlen.de
mainline inclusion from mainline-v6.9-rc2 commit 18685451fc4e546fc0e718580d32df3c0e5c8272 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9HVTH CVE: CVE-2024-26921
--------------------------------
ip_local_out() and other functions can pass skb->sk as function argument.
If the skb is a fragment and reassembly happens before such function call returns, the sk must not be released.
This affects skb fragments reassembled via netfilter or similar modules, e.g. openvswitch or ct_act.c, when run as part of tx pipeline.
Eric Dumazet made an initial analysis of this bug. Quoting Eric: Calling ip_defrag() in output path is also implying skb_orphan(), which is buggy because output path relies on sk not disappearing.
A relevant old patch about the issue was : 8282f27449bf ("inet: frag: Always orphan skbs inside ip_defrag()")
[..]
net/ipv4/ip_output.c depends on skb->sk being set, and probably to an inet socket, not an arbitrary one.
If we orphan the packet in ipvlan, then downstream things like FQ packet scheduler will not work properly.
We need to change ip_defrag() to only use skb_orphan() when really needed, ie whenever frag_list is going to be used.
Eric suggested to stash sk in fragment queue and made an initial patch. However there is a problem with this:
If skb is refragmented again right after, ip_do_fragment() will copy head->sk to the new fragments, and sets up destructor to sock_wfree. IOW, we have no choice but to fix up sk_wmem accouting to reflect the fully reassembled skb, else wmem will underflow.
This change moves the orphan down into the core, to last possible moment. As ip_defrag_offset is aliased with sk_buff->sk member, we must move the offset into the FRAG_CB, else skb->sk gets clobbered.
This allows to delay the orphaning long enough to learn if the skb has to be queued or if the skb is completing the reasm queue.
In the former case, things work as before, skb is orphaned. This is safe because skb gets queued/stolen and won't continue past reasm engine.
In the latter case, we will steal the skb->sk reference, reattach it to the head skb, and fix up wmem accouting when inet_frag inflates truesize.
Fixes: 7026b1ddb6b8 ("netfilter: Pass socket pointer down through okfn().") Diagnosed-by: Eric Dumazet edumazet@google.com Reported-by: xingwei lee xrivendell7@gmail.com Reported-by: yue sun samsun1006219@gmail.com Reported-by: syzbot+e5167d7144a62715044c@syzkaller.appspotmail.com Signed-off-by: Florian Westphal fw@strlen.de Reviewed-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/20240326101845.30836-1-fw@strlen.de Signed-off-by: Paolo Abeni pabeni@redhat.com Conflicts: net/ipv4/inet_fragment.c [The version does not include commit 2e47eece158a and commit 8672406eb5d7] Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com --- include/linux/skbuff.h | 5 +- net/ipv4/inet_fragment.c | 70 ++++++++++++++++++++----- net/ipv4/ip_fragment.c | 2 +- net/ipv6/netfilter/nf_conntrack_reasm.c | 2 +- 4 files changed, 60 insertions(+), 19 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index c7d1d8b5f41b..9f38aad04145 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -683,10 +683,7 @@ struct sk_buff { struct list_head list; };
- union { - struct sock *sk; - int ip_defrag_offset; - }; + struct sock *sk;
union { ktime_t tstamp; diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c index 1ab592d45afd..5d004c3f4e2b 100644 --- a/net/ipv4/inet_fragment.c +++ b/net/ipv4/inet_fragment.c @@ -28,6 +28,8 @@ #include <net/ip.h> #include <net/ipv6.h>
+#include <net/tcp_ext.h> + /* Use skb->cb to track consecutive/adjacent fragments coming at * the end of the queue. Nodes in the rb-tree queue will * contain "runs" of one or more adjacent fragments. @@ -43,6 +45,7 @@ struct ipfrag_skb_cb { }; struct sk_buff *next_frag; int frag_run_len; + int ip_defrag_offset; };
#define FRAG_CB(skb) ((struct ipfrag_skb_cb *)((skb)->cb)) @@ -319,12 +322,12 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb, */ if (!last) fragrun_create(q, skb); /* First fragment. */ - else if (last->ip_defrag_offset + last->len < end) { + else if (FRAG_CB(last)->ip_defrag_offset + last->len < end) { /* This is the common case: skb goes to the end. */ /* Detect and discard overlaps. */ - if (offset < last->ip_defrag_offset + last->len) + if (offset < FRAG_CB(last)->ip_defrag_offset + last->len) return IPFRAG_OVERLAP; - if (offset == last->ip_defrag_offset + last->len) + if (offset == FRAG_CB(last)->ip_defrag_offset + last->len) fragrun_append_to_last(q, skb); else fragrun_create(q, skb); @@ -341,13 +344,13 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
parent = *rbn; curr = rb_to_skb(parent); - curr_run_end = curr->ip_defrag_offset + + curr_run_end = FRAG_CB(curr)->ip_defrag_offset + FRAG_CB(curr)->frag_run_len; - if (end <= curr->ip_defrag_offset) + if (end <= FRAG_CB(curr)->ip_defrag_offset) rbn = &parent->rb_left; else if (offset >= curr_run_end) rbn = &parent->rb_right; - else if (offset >= curr->ip_defrag_offset && + else if (offset >= FRAG_CB(curr)->ip_defrag_offset && end <= curr_run_end) return IPFRAG_DUP; else @@ -361,7 +364,7 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb, rb_insert_color(&skb->rbnode, &q->rb_fragments); }
- skb->ip_defrag_offset = offset; + FRAG_CB(skb)->ip_defrag_offset = offset;
return IPFRAG_OK; } @@ -371,13 +374,28 @@ void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb, struct sk_buff *parent) { struct sk_buff *fp, *head = skb_rb_first(&q->rb_fragments); - struct sk_buff **nextp; + void (*destructor)(struct sk_buff *); + unsigned int orig_truesize = 0; + struct sk_buff **nextp = NULL; + struct sock *sk = skb->sk; int delta;
+ if (sk && is_skb_wmem(skb)) { + /* TX: skb->sk might have been passed as argument to + * dst->output and must remain valid until tx completes. + * + * Move sk to reassembled skb and fix up wmem accounting. + */ + orig_truesize = skb->truesize; + destructor = skb->destructor; + } + if (head != skb) { fp = skb_clone(skb, GFP_ATOMIC); - if (!fp) - return NULL; + if (!fp) { + head = skb; + goto out_restore_sk; + } FRAG_CB(fp)->next_frag = FRAG_CB(skb)->next_frag; if (RB_EMPTY_NODE(&skb->rbnode)) FRAG_CB(parent)->next_frag = fp; @@ -386,6 +404,12 @@ void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb, &q->rb_fragments); if (q->fragments_tail == skb) q->fragments_tail = fp; + + if (orig_truesize) { + /* prevent skb_morph from releasing sk */ + skb->sk = NULL; + skb->destructor = NULL; + } skb_morph(skb, head); FRAG_CB(skb)->next_frag = FRAG_CB(head)->next_frag; rb_replace_node(&head->rbnode, &skb->rbnode, @@ -393,13 +417,13 @@ void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb, consume_skb(head); head = skb; } - WARN_ON(head->ip_defrag_offset != 0); + WARN_ON(FRAG_CB(head)->ip_defrag_offset != 0);
delta = -head->truesize;
/* Head of list must not be cloned. */ if (skb_unclone(head, GFP_ATOMIC)) - return NULL; + goto out_restore_sk;
delta += head->truesize; if (delta) @@ -415,7 +439,7 @@ void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
clone = alloc_skb(0, GFP_ATOMIC); if (!clone) - return NULL; + goto out_restore_sk; skb_shinfo(clone)->frag_list = skb_shinfo(head)->frag_list; skb_frag_list_init(head); for (i = 0; i < skb_shinfo(head)->nr_frags; i++) @@ -432,6 +456,21 @@ void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb, nextp = &skb_shinfo(head)->frag_list; }
+out_restore_sk: + if (orig_truesize) { + int ts_delta = head->truesize - orig_truesize; + + /* if this reassembled skb is fragmented later, + * fraglist skbs will get skb->sk assigned from head->sk, + * and each frag skb will be released via sock_wfree. + * + * Update sk_wmem_alloc. + */ + head->sk = sk; + head->destructor = destructor; + refcount_add(ts_delta, &sk->sk_wmem_alloc); + } + return nextp; } EXPORT_SYMBOL(inet_frag_reasm_prepare); @@ -440,6 +479,8 @@ void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head, void *reasm_data, bool try_coalesce) { struct sk_buff **nextp = (struct sk_buff **)reasm_data; + struct sock *sk = is_skb_wmem(head) ? head->sk : NULL; + const unsigned int head_truesize = head->truesize; struct rb_node *rbn; struct sk_buff *fp; int sum_truesize; @@ -501,6 +542,9 @@ void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head, skb_mark_not_on_list(head); head->prev = NULL; head->tstamp = q->stamp; + + if (sk) + refcount_add(sum_truesize - head_truesize, &sk->sk_wmem_alloc); } EXPORT_SYMBOL(inet_frag_reasm_finish);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c index a4cd5db79a50..8f458d61554e 100644 --- a/net/ipv4/ip_fragment.c +++ b/net/ipv4/ip_fragment.c @@ -380,6 +380,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb) }
skb_dst_drop(skb); + skb_orphan(skb); return -EINPROGRESS;
insert_error: @@ -483,7 +484,6 @@ int ip_defrag(struct net *net, struct sk_buff *skb, u32 user) struct ipq *qp;
__IP_INC_STATS(net, IPSTATS_MIB_REASMREQDS); - skb_orphan(skb);
/* Lookup (or create) queue header */ qp = ip_find(net, ip_hdr(skb), user, vif); diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c index 247a81467721..ced03ecce475 100644 --- a/net/ipv6/netfilter/nf_conntrack_reasm.c +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c @@ -307,6 +307,7 @@ static int nf_ct_frag6_queue(struct frag_queue *fq, struct sk_buff *skb, }
skb_dst_drop(skb); + skb_orphan(skb); return -EINPROGRESS;
insert_error: @@ -473,7 +474,6 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff *skb, u32 user) hdr = ipv6_hdr(skb); fhdr = (struct frag_hdr *)skb_transport_header(skb);
- skb_orphan(skb); fq = fq_find(net, fhdr->identification, user, hdr, skb->dev ? skb->dev->ifindex : 0); if (fq == NULL) {
hulk inclusion category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9HVTH CVE: CVE-2024-26921
--------------------------------
Fix KABI break when backport CVE-2024-26921 patch by reverting modification of struct sk_buff.
Fixes: 18685451fc4e ("inet: inet_defrag: prevent sk release while still in use") Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com --- include/linux/skbuff.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 9f38aad04145..c7d1d8b5f41b 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -683,7 +683,10 @@ struct sk_buff { struct list_head list; };
- struct sock *sk; + union { + struct sock *sk; + int ip_defrag_offset; + };
union { ktime_t tstamp;
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/7193 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/E...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/7193 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/E...