Patchset for CVE-2021-47131.
v4: - Resend for PR create failed. v3: - Delete "Reference" head comment. v2: - Add bugfix patches for CVE patch.
Maxim Mikityanskiy (5): net/tls: Replace TLS_RX_SYNC_RUNNING with RCU net/tls: Fix use-after-free after the TLS device goes down and up tls: Fix context leak on tls_device_down net/tls: Remove the context from the list in tls_device_down net/tls: Use RCU API to access tls_ctx->netdev
include/net/tls.h | 13 ++++- net/tls/tls_device.c | 89 ++++++++++++++++++++++++++++------- net/tls/tls_device_fallback.c | 9 +++- net/tls/tls_main.c | 1 + 4 files changed, 93 insertions(+), 19 deletions(-)
From: Maxim Mikityanskiy maximmi@nvidia.com
mainline inclusion from mainline-v5.13-rc5 commit 05fc8b6cbd4f979a6f25759c4a17dd5f657f7ecd category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I990AF CVE: CVE-2021-47131
--------------------------------
RCU synchronization is guaranteed to finish in finite time, unlike a busy loop that polls a flag. This patch is a preparation for the bugfix in the next patch, where the same synchronize_net() call will also be used to sync with the TX datapath.
Signed-off-by: Maxim Mikityanskiy maximmi@nvidia.com Reviewed-by: Tariq Toukan tariqt@nvidia.com Signed-off-by: David S. Miller davem@davemloft.net Conflicts: include/net/tls.h net/tls/tls_device.c Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com --- include/net/tls.h | 1 - net/tls/tls_device.c | 9 +++------ 2 files changed, 3 insertions(+), 7 deletions(-)
diff --git a/include/net/tls.h b/include/net/tls.h index e2d84b791e5ca..bf7af80e682f8 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -169,7 +169,6 @@ enum { };
enum tls_context_flags { - TLS_RX_SYNC_RUNNING = 0, /* tls_dev_del was called for the RX side, device state was released, * but tls_ctx->netdev might still be kept, because TX-side driver * resources might not be released yet. Used to prevent the second diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c index 228e3ce48d437..a4d3c5462d058 100644 --- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -571,12 +571,11 @@ static void tls_device_resync_rx(struct tls_context *tls_ctx, { struct net_device *netdev;
- if (WARN_ON(test_and_set_bit(TLS_RX_SYNC_RUNNING, &tls_ctx->flags))) - return; + rcu_read_lock(); netdev = READ_ONCE(tls_ctx->netdev); if (netdev) netdev->tlsdev_ops->tls_dev_resync_rx(netdev, sk, seq, rcd_sn); - clear_bit_unlock(TLS_RX_SYNC_RUNNING, &tls_ctx->flags); + rcu_read_unlock(); }
void handle_device_resync(struct sock *sk, u32 seq, u64 rcd_sn) @@ -991,9 +990,7 @@ static int tls_device_down(struct net_device *netdev) netdev->tlsdev_ops->tls_dev_del(netdev, ctx, TLS_OFFLOAD_CTX_DIR_RX); WRITE_ONCE(ctx->netdev, NULL); - smp_mb__before_atomic(); /* pairs with test_and_set_bit() */ - while (test_bit(TLS_RX_SYNC_RUNNING, &ctx->flags)) - usleep_range(10, 200); + synchronize_net(); dev_put(netdev); list_del_init(&ctx->list);
From: Maxim Mikityanskiy maximmi@nvidia.com
mainline inclusion from mainline-v5.13-rc5 commit c55dcdd435aa6c6ad6ccac0a4c636d010ee367a4 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I990AF CVE: CVE-2021-47131
--------------------------------
When a netdev with active TLS offload goes down, tls_device_down is called to stop the offload and tear down the TLS context. However, the socket stays alive, and it still points to the TLS context, which is now deallocated. If a netdev goes up, while the connection is still active, and the data flow resumes after a number of TCP retransmissions, it will lead to a use-after-free of the TLS context.
This commit addresses this bug by keeping the context alive until its normal destruction, and implements the necessary fallbacks, so that the connection can resume in software (non-offloaded) kTLS mode.
On the TX side tls_sw_fallback is used to encrypt all packets. The RX side already has all the necessary fallbacks, because receiving non-decrypted packets is supported. The thing needed on the RX side is to block resync requests, which are normally produced after receiving non-decrypted packets.
The necessary synchronization is implemented for a graceful teardown: first the fallbacks are deployed, then the driver resources are released (it used to be possible to have a tls_dev_resync after tls_dev_del).
A new flag called TLS_RX_DEV_DEGRADED is added to indicate the fallback mode. It's used to skip the RX resync logic completely, as it becomes useless, and some objects may be released (for example, resync_async, which is allocated and freed by the driver).
Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure") Signed-off-by: Maxim Mikityanskiy maximmi@nvidia.com Reviewed-by: Tariq Toukan tariqt@nvidia.com Signed-off-by: David S. Miller davem@davemloft.net Conflicts: include/net/tls.h net/tls/tls_device.c net/tls/tls_device_fallback.c net/tls/tls_main.c Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com --- include/net/tls.h | 10 +++++++ net/tls/tls_device.c | 54 +++++++++++++++++++++++++++++++---- net/tls/tls_device_fallback.c | 7 +++++ net/tls/tls_main.c | 1 + 4 files changed, 66 insertions(+), 6 deletions(-)
diff --git a/include/net/tls.h b/include/net/tls.h index bf7af80e682f8..83040729a6a62 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -169,6 +169,11 @@ enum { };
enum tls_context_flags { + /* tls_device_down was called after the netdev went down, device state + * was released, and kTLS works in software, even though rx_conf is + * still TLS_HW (needed for transition). + */ + TLS_RX_DEV_DEGRADED = 0, /* tls_dev_del was called for the RX side, device state was released, * but tls_ctx->netdev might still be kept, because TX-side driver * resources might not be released yet. Used to prevent the second @@ -222,6 +227,8 @@ struct tls_context { u16 pending_open_record_frags; int (*push_pending_record)(struct sock *sk, int flags);
+ struct sock *sk; + void (*sk_write_space)(struct sock *sk); void (*sk_destruct)(struct sock *sk); void (*sk_proto_close)(struct sock *sk, long timeout); @@ -335,6 +342,9 @@ static inline bool tls_is_pending_open_record(struct tls_context *tls_ctx) struct sk_buff * tls_validate_xmit_skb(struct sock *sk, struct net_device *dev, struct sk_buff *skb); +struct sk_buff * +tls_validate_xmit_skb_sw(struct sock *sk, struct net_device *dev, + struct sk_buff *skb);
static inline bool tls_is_sk_tx_device_offloaded(struct sock *sk) { diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c index a4d3c5462d058..70f53b5f444cd 100644 --- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -48,6 +48,7 @@ static void tls_device_gc_task(struct work_struct *work); static DECLARE_WORK(tls_device_gc_work, tls_device_gc_task); static LIST_HEAD(tls_device_gc_list); static LIST_HEAD(tls_device_list); +static LIST_HEAD(tls_device_down_list); static DEFINE_SPINLOCK(tls_device_lock);
static void tls_device_free_ctx(struct tls_context *ctx) @@ -588,6 +589,8 @@ void handle_device_resync(struct sock *sk, u32 seq, u64 rcd_sn)
if (tls_ctx->rx_conf != TLS_HW) return; + if (unlikely(test_bit(TLS_RX_DEV_DEGRADED, &tls_ctx->flags))) + return;
rx_ctx = tls_offload_ctx_rx(tls_ctx); resync_req = atomic64_read(&rx_ctx->resync_req); @@ -698,7 +701,18 @@ int tls_device_decrypted(struct sock *sk, struct sk_buff *skb)
ctx->sw.decrypted |= is_decrypted;
- /* Return immedeatly if the record is either entirely plaintext or + if (unlikely(test_bit(TLS_RX_DEV_DEGRADED, &tls_ctx->flags))) { + if (likely(is_encrypted || is_decrypted)) + return 0; + + /* After tls_device_down disables the offload, the next SKB will + * likely have initial fragments decrypted, and final ones not + * decrypted. We need to reencrypt that single SKB. + */ + return tls_device_reencrypt(sk, skb); + } + + /* Return immediately if the record is either entirely plaintext or * entirely ciphertext. Otherwise handle reencrypt partially decrypted * record. */ @@ -982,6 +996,26 @@ static int tls_device_down(struct net_device *netdev) spin_unlock_irqrestore(&tls_device_lock, flags);
list_for_each_entry_safe(ctx, tmp, &list, list) { + /* Stop offloaded TX and switch to the fallback. + * tls_is_sk_tx_device_offloaded will return false. + */ + WRITE_ONCE(ctx->sk->sk_validate_xmit_skb, tls_validate_xmit_skb_sw); + + /* Stop the RX and TX resync. + * tls_dev_resync must not be called after tls_dev_del. + */ + WRITE_ONCE(ctx->netdev, NULL); + + /* Start skipping the RX resync logic completely. */ + set_bit(TLS_RX_DEV_DEGRADED, &ctx->flags); + + /* Sync with inflight packets. After this point: + * TX: no non-encrypted packets will be passed to the driver. + * RX: resync requests from the driver will be ignored. + */ + synchronize_net(); + + /* Release the offload context on the driver side. */ if (ctx->tx_conf == TLS_HW) netdev->tlsdev_ops->tls_dev_del(netdev, ctx, TLS_OFFLOAD_CTX_DIR_TX); @@ -989,13 +1023,21 @@ static int tls_device_down(struct net_device *netdev) !test_bit(TLS_RX_DEV_CLOSED, &ctx->flags)) netdev->tlsdev_ops->tls_dev_del(netdev, ctx, TLS_OFFLOAD_CTX_DIR_RX); - WRITE_ONCE(ctx->netdev, NULL); - synchronize_net(); + dev_put(netdev); - list_del_init(&ctx->list);
- if (refcount_dec_and_test(&ctx->refcount)) - tls_device_free_ctx(ctx); + /* Move the context to a separate list for two reasons: + * 1. When the context is deallocated, list_del is called. + * 2. It's no longer an offloaded context, so we don't want to + * run offload-specific code on this context. + */ + spin_lock_irqsave(&tls_device_lock, flags); + list_move_tail(&ctx->list, &tls_device_down_list); + spin_unlock_irqrestore(&tls_device_lock, flags); + + /* Device contexts for RX and TX will be freed in on sk_destruct + * by tls_device_free_ctx. rx_conf and tx_conf stay in TLS_HW. + */ }
up_write(&device_offload_lock); diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c index 6cf832891b53e..55f1b7ed80f37 100644 --- a/net/tls/tls_device_fallback.c +++ b/net/tls/tls_device_fallback.c @@ -427,6 +427,13 @@ struct sk_buff *tls_validate_xmit_skb(struct sock *sk, } EXPORT_SYMBOL_GPL(tls_validate_xmit_skb);
+struct sk_buff *tls_validate_xmit_skb_sw(struct sock *sk, + struct net_device *dev, + struct sk_buff *skb) +{ + return tls_sw_fallback(sk, skb); +} + int tls_sw_fallback_init(struct sock *sk, struct tls_offload_context_tx *offload_ctx, struct tls_crypto_info *crypto_info) diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c index 19646ef9f6f61..b3eafb85b8e16 100644 --- a/net/tls/tls_main.c +++ b/net/tls/tls_main.c @@ -560,6 +560,7 @@ static struct tls_context *create_ctx(struct sock *sk) ctx->setsockopt = sk->sk_prot->setsockopt; ctx->getsockopt = sk->sk_prot->getsockopt; ctx->sk_proto_close = sk->sk_prot->close; + ctx->sk = sk; return ctx; }
From: Maxim Mikityanskiy maximmi@nvidia.com
mainline inclusion from mainline-v5.18-rc7 commit 3740651bf7e200109dd42d5b2fb22226b26f960a category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I990AF CVE: CVE-2021-47131
--------------------------------
The commit cited below claims to fix a use-after-free condition after tls_device_down. Apparently, the description wasn't fully accurate. The context stayed alive, but ctx->netdev became NULL, and the offload was torn down without a proper fallback, so a bug was present, but a different kind of bug.
Due to misunderstanding of the issue, the original patch dropped the refcount_dec_and_test line for the context to avoid the alleged premature deallocation. That line has to be restored, because it matches the refcount_inc_not_zero from the same function, otherwise the contexts that survived tls_device_down are leaked.
This patch fixes the described issue by restoring refcount_dec_and_test. After this change, there is no leak anymore, and the fallback to software kTLS still works.
Fixes: c55dcdd435aa ("net/tls: Fix use-after-free after the TLS device goes down and up") Signed-off-by: Maxim Mikityanskiy maximmi@nvidia.com Reviewed-by: Tariq Toukan tariqt@nvidia.com Link: https://lore.kernel.org/r/20220512091830.678684-1-maximmi@nvidia.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com --- net/tls/tls_device.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c index 70f53b5f444cd..524a16fdcadf6 100644 --- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -1037,7 +1037,10 @@ static int tls_device_down(struct net_device *netdev)
/* Device contexts for RX and TX will be freed in on sk_destruct * by tls_device_free_ctx. rx_conf and tx_conf stay in TLS_HW. + * Now release the ref taken above. */ + if (refcount_dec_and_test(&ctx->refcount)) + tls_device_free_ctx(ctx); }
up_write(&device_offload_lock);
From: Maxim Mikityanskiy maximmi@nvidia.com
mainline inclusion from mainline-v5.19 commit f6336724a4d4220c89a4ec38bca84b03b178b1a3 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I990AF CVE: CVE-2021-47131
--------------------------------
tls_device_down takes a reference on all contexts it's going to move to the degraded state (software fallback). If sk_destruct runs afterwards, it can reduce the reference counter back to 1 and return early without destroying the context. Then tls_device_down will release the reference it took and call tls_device_free_ctx. However, the context will still stay in tls_device_down_list forever. The list will contain an item, memory for which is released, making a memory corruption possible.
Fix the above bug by properly removing the context from all lists before any call to tls_device_free_ctx.
Fixes: 3740651bf7e2 ("tls: Fix context leak on tls_device_down") Signed-off-by: Maxim Mikityanskiy maximmi@nvidia.com Reviewed-by: Tariq Toukan tariqt@nvidia.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com --- net/tls/tls_device.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c index 524a16fdcadf6..bce6e76dd7ead 100644 --- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -1039,8 +1039,13 @@ static int tls_device_down(struct net_device *netdev) * by tls_device_free_ctx. rx_conf and tx_conf stay in TLS_HW. * Now release the ref taken above. */ - if (refcount_dec_and_test(&ctx->refcount)) + if (refcount_dec_and_test(&ctx->refcount)) { + /* sk_destruct ran after tls_device_down took a ref, and + * it returned early. Complete the destruction here. + */ + list_del(&ctx->list); tls_device_free_ctx(ctx); + } }
up_write(&device_offload_lock);
From: Maxim Mikityanskiy maximmi@nvidia.com
mainline inclusion from mainline-v6.0-rc1 commit 94ce3b64c62d4b628cf85cd0d9a370aca8f7e43a category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I990AF CVE: CVE-2021-47131
--------------------------------
Currently, tls_device_down synchronizes with tls_device_resync_rx using RCU, however, the pointer to netdev is stored using WRITE_ONCE and loaded using READ_ONCE.
Although such approach is technically correct (rcu_dereference is essentially a READ_ONCE, and rcu_assign_pointer uses WRITE_ONCE to store NULL), using special RCU helpers for pointers is more valid, as it includes additional checks and might change the implementation transparently to the callers.
Mark the netdev pointer as __rcu and use the correct RCU helpers to access it. For non-concurrent access pass the right conditions that guarantee safe access (locks taken, refcount value). Also use the correct helper in mlx5e, where even READ_ONCE was missing.
The transition to RCU exposes existing issues, fixed by this commit:
1. bond_tls_device_xmit could read netdev twice, and it could become NULL the second time, after the NULL check passed.
2. Drivers shouldn't stop processing the last packet if tls_device_down just set netdev to NULL, before tls_dev_del was called. This prevents a possible packet drop when transitioning to the fallback software mode.
Fixes: 89df6a810470 ("net/bonding: Implement TLS TX device offload") Fixes: c55dcdd435aa ("net/tls: Fix use-after-free after the TLS device goes down and up") Signed-off-by: Maxim Mikityanskiy maximmi@nvidia.com Link: https://lore.kernel.org/r/20220810081602.1435800-1-maximmi@nvidia.com Signed-off-by: Jakub Kicinski kuba@kernel.org Conflicts: drivers/net/bonding/bond_main.c drivers/net/ethernet/chelsio/inline_crypto/ch_ktls/chcr_ktls.c drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c include/net/tls.h net/tls/tls_device.c net/tls/tls_device_fallback.c Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com --- include/net/tls.h | 2 +- net/tls/tls_device.c | 24 +++++++++++++++++------- net/tls/tls_device_fallback.c | 2 +- 3 files changed, 19 insertions(+), 9 deletions(-)
diff --git a/include/net/tls.h b/include/net/tls.h index 83040729a6a62..3a6ecd48ec630 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -207,7 +207,7 @@ struct tls_context { union tls_crypto_context crypto_recv;
struct list_head list; - struct net_device *netdev; + struct net_device __rcu *netdev; refcount_t refcount;
void *priv_ctx_tx; diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c index bce6e76dd7ead..378c928c40b7a 100644 --- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -68,6 +68,7 @@ static void tls_device_free_ctx(struct tls_context *ctx) static void tls_device_gc_task(struct work_struct *work) { struct tls_context *ctx, *tmp; + struct net_device *netdev; unsigned long flags; LIST_HEAD(gc_list);
@@ -76,7 +77,11 @@ static void tls_device_gc_task(struct work_struct *work) spin_unlock_irqrestore(&tls_device_lock, flags);
list_for_each_entry_safe(ctx, tmp, &gc_list, list) { - struct net_device *netdev = ctx->netdev; + /* Safe, because this is the destroy flow, refcount is 0, so + * tls_device_down can't store this field in parallel. + */ + netdev = rcu_dereference_protected(ctx->netdev, + !refcount_read(&ctx->refcount));
if (netdev && ctx->tx_conf == TLS_HW) { netdev->tlsdev_ops->tls_dev_del(netdev, ctx, @@ -96,7 +101,7 @@ static void tls_device_attach(struct tls_context *ctx, struct sock *sk, if (sk->sk_destruct != tls_device_sk_destruct) { refcount_set(&ctx->refcount, 1); dev_hold(netdev); - ctx->netdev = netdev; + RCU_INIT_POINTER(ctx->netdev, netdev); spin_lock_irq(&tls_device_lock); list_add_tail(&ctx->list, &tls_device_list); spin_unlock_irq(&tls_device_lock); @@ -573,7 +578,7 @@ static void tls_device_resync_rx(struct tls_context *tls_ctx, struct net_device *netdev;
rcu_read_lock(); - netdev = READ_ONCE(tls_ctx->netdev); + netdev = rcu_dereference(tls_ctx->netdev); if (netdev) netdev->tlsdev_ops->tls_dev_resync_rx(netdev, sk, seq, rcd_sn); rcu_read_unlock(); @@ -958,7 +963,8 @@ void tls_device_offload_cleanup_rx(struct sock *sk) struct net_device *netdev;
down_read(&device_offload_lock); - netdev = tls_ctx->netdev; + netdev = rcu_dereference_protected(tls_ctx->netdev, + lockdep_is_held(&device_offload_lock)); if (!netdev) goto out;
@@ -967,7 +973,7 @@ void tls_device_offload_cleanup_rx(struct sock *sk)
if (tls_ctx->tx_conf != TLS_HW) { dev_put(netdev); - tls_ctx->netdev = NULL; + rcu_assign_pointer(tls_ctx->netdev, NULL); } else { set_bit(TLS_RX_DEV_CLOSED, &tls_ctx->flags); } @@ -987,7 +993,11 @@ static int tls_device_down(struct net_device *netdev)
spin_lock_irqsave(&tls_device_lock, flags); list_for_each_entry_safe(ctx, tmp, &tls_device_list, list) { - if (ctx->netdev != netdev || + struct net_device *ctx_netdev = + rcu_dereference_protected(ctx->netdev, + lockdep_is_held(&device_offload_lock)); + + if (ctx_netdev != netdev || !refcount_inc_not_zero(&ctx->refcount)) continue;
@@ -1004,7 +1014,7 @@ static int tls_device_down(struct net_device *netdev) /* Stop the RX and TX resync. * tls_dev_resync must not be called after tls_dev_del. */ - WRITE_ONCE(ctx->netdev, NULL); + rcu_assign_pointer(ctx->netdev, NULL);
/* Start skipping the RX resync logic completely. */ set_bit(TLS_RX_DEV_DEGRADED, &ctx->flags); diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c index 55f1b7ed80f37..319aaed05571f 100644 --- a/net/tls/tls_device_fallback.c +++ b/net/tls/tls_device_fallback.c @@ -420,7 +420,7 @@ struct sk_buff *tls_validate_xmit_skb(struct sock *sk, struct net_device *dev, struct sk_buff *skb) { - if (dev == tls_get_ctx(sk)->netdev) + if (dev == rcu_dereference_bh(tls_get_ctx(sk)->netdev)) return skb;
return tls_sw_fallback(sk, skb);
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/5476 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/I...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/5476 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/I...