From: Daniel Borkmann daniel@iogearbox.net
stable inclusion from linux-4.19.226 commit bd6e97e2b6f59a19894c7032a83f03ad38ede28e
--------------------------------
commit 710ad98c363a66a0cd8526465426c5c5f8377ee0 upstream.
Laurent reported that they have seen a significant amount of TCP retransmissions at high throughput from applications residing in network namespaces talking to the outside world via veths. The drops were seen on the qdisc layer (fq_codel, as per systemd default) of the phys device such as ena or virtio_net due to all traffic hitting a _single_ TX queue _despite_ multi-queue device. (Note that the setup was _not_ using XDP on veths as the issue is generic.)
More specifically, after edbea9220251 ("veth: Store queue_mapping independently of XDP prog presence") which made it all the way back to v4.19.184+, skb_record_rx_queue() would set skb->queue_mapping to 1 (given 1 RX and 1 TX queue by default for veths) instead of leaving at 0.
This is eventually retained and callbacks like ena_select_queue() will also pick single queue via netdev_core_pick_tx()'s ndo_select_queue() once all the traffic is forwarded to that device via upper stack or other means. Similarly, for others not implementing ndo_select_queue() if XPS is disabled, netdev_pick_tx() might call into the skb_tx_hash() and check for prior skb_rx_queue_recorded() as well.
In general, it is a _bad_ idea for virtual devices like veth to mess around with queue selection [by default]. Given dev->real_num_tx_queues is by default 1, the skb->queue_mapping was left untouched, and so prior to edbea9220251 the netdev_core_pick_tx() could do its job upon __dev_queue_xmit() on the phys device.
Unbreak this and restore prior behavior by removing the skb_record_rx_queue() from veth_xmit() altogether.
If the veth peer has an XDP program attached, then it would return the first RX queue index in xdp_md->rx_queue_index (unless configured in non-default manner). However, this is still better than breaking the generic case.
Fixes: edbea9220251 ("veth: Store queue_mapping independently of XDP prog presence") Fixes: 638264dc9022 ("veth: Support per queue XDP ring") Reported-by: Laurent Bernaille laurent.bernaille@datadoghq.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Cc: Maciej Fijalkowski maciej.fijalkowski@intel.com Cc: Toshiaki Makita toshiaki.makita1@gmail.com Cc: Eric Dumazet eric.dumazet@gmail.com Cc: Paolo Abeni pabeni@redhat.com Cc: John Fastabend john.fastabend@gmail.com Cc: Willem de Bruijn willemb@google.com Acked-by: John Fastabend john.fastabend@gmail.com Reviewed-by: Eric Dumazet edumazet@google.com Acked-by: Toshiaki Makita toshiaki.makita1@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Conflicts: drivers/net/veth.c Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com Reviewed-by: Wei Yongjun weiyongjun1@huawei.com Signed-off-by: Laibin Qiu qiulaibin@huawei.com --- drivers/net/veth.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 41a00cd76955..749faa6fcd82 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -197,8 +197,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) if (rxq < rcv->real_num_rx_queues) { rq = &rcv_priv->rq[rxq]; rcv_xdp = rcu_access_pointer(rq->xdp_prog); - if (rcv_xdp) - skb_record_rx_queue(skb, rxq); }
if (likely(veth_forward_skb(rcv, skb, rq, rcv_xdp) == NET_RX_SUCCESS)) {
From: Cui GaoSheng cuigaosheng1@huawei.com
hulk inclusion category: bugfix bugzilla: 186384 https://gitee.com/openeuler/kernel/issues/I4X1AI?from=project-issue CVE: NA
--------------------------------
This reverts commit 67ab712f2b0e7448052f3b8fdee727fb7e204448.
Signed-off-by: Cui GaoSheng cuigaosheng1@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Laibin Qiu qiulaibin@huawei.com --- kernel/audit.c | 5 ----- 1 file changed, 5 deletions(-)
diff --git a/kernel/audit.c b/kernel/audit.c index 3de5ebb94559..c5e034fe14bb 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -740,8 +740,6 @@ static int kauditd_send_queue(struct sock *sk, u32 portid, if (!sk) { if (err_hook) (*err_hook)(skb); - if (queue == &audit_hold_queue) - goto out; continue; }
@@ -758,8 +756,6 @@ static int kauditd_send_queue(struct sock *sk, u32 portid, (*err_hook)(skb); if (rc == -EAGAIN) rc = 0; - if (queue == &audit_hold_queue) - goto out; /* continue to drain the queue */ continue; } else @@ -771,7 +767,6 @@ static int kauditd_send_queue(struct sock *sk, u32 portid, } }
-out: return (rc >= 0 ? 0 : rc); }
From: Paul Moore paul@paul-moore.com
mainline inclusion from mainline-v5.17-rc3 commit f26d04331360d42dbd6b58448bd98e4edbfbe1c5 category: bugfix bugzilla: 186384 https://gitee.com/openeuler/kernel/issues/I4X1AI?from=project-issue CVE: NA
--------------------------------
When an admin enables audit at early boot via the "audit=1" kernel command line the audit queue behavior is slightly different; the audit subsystem goes to greater lengths to avoid dropping records, which unfortunately can result in problems when the audit daemon is forcibly stopped for an extended period of time.
This patch makes a number of changes designed to improve the audit queuing behavior so that leaving the audit daemon in a stopped state for an extended period does not cause a significant impact to the system.
- kauditd_send_queue() is now limited to looping through the passed queue only once per call. This not only prevents the function from looping indefinitely when records are returned to the current queue, it also allows any recovery handling in kauditd_thread() to take place when kauditd_send_queue() returns.
- Transient netlink send errors seen as -EAGAIN now cause the record to be returned to the retry queue instead of going to the hold queue. The intention of the hold queue is to store, perhaps for an extended period of time, the events which led up to the audit daemon going offline. The retry queue remains a temporary queue intended to protect against transient issues between the kernel and the audit daemon.
- The retry queue is now limited by the audit_backlog_limit setting, the same as the other queues. This allows admins to bound the size of all of the audit queues on the system.
- kauditd_rehold_skb() now returns records to the end of the hold queue to ensure ordering is preserved in the face of recent changes to kauditd_send_queue().
Cc: stable@vger.kernel.org Fixes: 5b52330bbfe63 ("audit: fix auditd/kernel connection state tracking") Fixes: f4b3ee3c85551 ("audit: improve robustness of the audit queue handling") Reported-by: Gaosheng Cui cuigaosheng1@huawei.com Tested-by: Gaosheng Cui cuigaosheng1@huawei.com Reviewed-by: Richard Guy Briggs rgb@redhat.com Signed-off-by: Paul Moore paul@paul-moore.com Signed-off-by: Cui GaoSheng cuigaosheng1@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Laibin Qiu qiulaibin@huawei.com --- kernel/audit.c | 62 ++++++++++++++++++++++++++++++++++---------------- 1 file changed, 43 insertions(+), 19 deletions(-)
diff --git a/kernel/audit.c b/kernel/audit.c index c5e034fe14bb..7dc14a4d9e3c 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -549,20 +549,22 @@ static void kauditd_printk_skb(struct sk_buff *skb) /** * kauditd_rehold_skb - Handle a audit record send failure in the hold queue * @skb: audit record + * @error: error code (unused) * * Description: * This should only be used by the kauditd_thread when it fails to flush the * hold queue. */ -static void kauditd_rehold_skb(struct sk_buff *skb) +static void kauditd_rehold_skb(struct sk_buff *skb, __always_unused int error) { - /* put the record back in the queue at the same place */ - skb_queue_head(&audit_hold_queue, skb); + /* put the record back in the queue */ + skb_queue_tail(&audit_hold_queue, skb); }
/** * kauditd_hold_skb - Queue an audit record, waiting for auditd * @skb: audit record + * @error: error code * * Description: * Queue the audit record, waiting for an instance of auditd. When this @@ -572,19 +574,31 @@ static void kauditd_rehold_skb(struct sk_buff *skb) * and queue it, if we have room. If we want to hold on to the record, but we * don't have room, record a record lost message. */ -static void kauditd_hold_skb(struct sk_buff *skb) +static void kauditd_hold_skb(struct sk_buff *skb, int error) { /* at this point it is uncertain if we will ever send this to auditd so * try to send the message via printk before we go any further */ kauditd_printk_skb(skb);
/* can we just silently drop the message? */ - if (!audit_default) { - kfree_skb(skb); - return; + if (!audit_default) + goto drop; + + /* the hold queue is only for when the daemon goes away completely, + * not -EAGAIN failures; if we are in a -EAGAIN state requeue the + * record on the retry queue unless it's full, in which case drop it + */ + if (error == -EAGAIN) { + if (!audit_backlog_limit || + skb_queue_len(&audit_retry_queue) < audit_backlog_limit) { + skb_queue_tail(&audit_retry_queue, skb); + return; + } + audit_log_lost("kauditd retry queue overflow"); + goto drop; }
- /* if we have room, queue the message */ + /* if we have room in the hold queue, queue the message */ if (!audit_backlog_limit || skb_queue_len(&audit_hold_queue) < audit_backlog_limit) { skb_queue_tail(&audit_hold_queue, skb); @@ -593,24 +607,32 @@ static void kauditd_hold_skb(struct sk_buff *skb)
/* we have no other options - drop the message */ audit_log_lost("kauditd hold queue overflow"); +drop: kfree_skb(skb); }
/** * kauditd_retry_skb - Queue an audit record, attempt to send again to auditd * @skb: audit record + * @error: error code (unused) * * Description: * Not as serious as kauditd_hold_skb() as we still have a connected auditd, * but for some reason we are having problems sending it audit records so * queue the given record and attempt to resend. */ -static void kauditd_retry_skb(struct sk_buff *skb) +static void kauditd_retry_skb(struct sk_buff *skb, __always_unused int error) { - /* NOTE: because records should only live in the retry queue for a - * short period of time, before either being sent or moved to the hold - * queue, we don't currently enforce a limit on this queue */ - skb_queue_tail(&audit_retry_queue, skb); + if (!audit_backlog_limit || + skb_queue_len(&audit_retry_queue) < audit_backlog_limit) { + skb_queue_tail(&audit_retry_queue, skb); + return; + } + + /* we have to drop the record, send it via printk as a last effort */ + kauditd_printk_skb(skb); + audit_log_lost("kauditd retry queue overflow"); + kfree_skb(skb); }
/** @@ -648,7 +670,7 @@ static void auditd_reset(const struct auditd_connection *ac) /* flush the retry queue to the hold queue, but don't touch the main * queue since we need to process that normally for multicast */ while ((skb = skb_dequeue(&audit_retry_queue))) - kauditd_hold_skb(skb); + kauditd_hold_skb(skb, -ECONNREFUSED); }
/** @@ -722,16 +744,18 @@ static int kauditd_send_queue(struct sock *sk, u32 portid, struct sk_buff_head *queue, unsigned int retry_limit, void (*skb_hook)(struct sk_buff *skb), - void (*err_hook)(struct sk_buff *skb)) + void (*err_hook)(struct sk_buff *skb, int error)) { int rc = 0; - struct sk_buff *skb; + struct sk_buff *skb = NULL; + struct sk_buff *skb_tail; unsigned int failed = 0;
/* NOTE: kauditd_thread takes care of all our locking, we just use * the netlink info passed to us (e.g. sk and portid) */
- while ((skb = skb_dequeue(queue))) { + skb_tail = skb_peek_tail(queue); + while ((skb != skb_tail) && (skb = skb_dequeue(queue))) { /* call the skb_hook for each skb we touch */ if (skb_hook) (*skb_hook)(skb); @@ -739,7 +763,7 @@ static int kauditd_send_queue(struct sock *sk, u32 portid, /* can we send to anyone via unicast? */ if (!sk) { if (err_hook) - (*err_hook)(skb); + (*err_hook)(skb, -ECONNREFUSED); continue; }
@@ -753,7 +777,7 @@ static int kauditd_send_queue(struct sock *sk, u32 portid, rc == -ECONNREFUSED || rc == -EPERM) { sk = NULL; if (err_hook) - (*err_hook)(skb); + (*err_hook)(skb, rc); if (rc == -EAGAIN) rc = 0; /* continue to drain the queue */