This patchset fixes the packet stuck problem mentioned in [1].
Patch 1: Add STATE_MISSED flag to fix packet stuck problem. Patch 2: Fix a tx_action rescheduling problem after STATE_MISSED flag is added in patch 1. Patch 3: Fix the significantly higher CPU consumption problem when multiple threads are competing on a saturated outgoing device.
V6: Some performance optimization in patch 1 suggested by Jakub and drop NET_XMIT_DROP checking in patch 3. V5: add patch 3 to fix the problem reported by Michal Kubecek. V4: Change STATE_NEED_RESCHEDULE to STATE_MISSED and add patch 2.
[1]. https://lkml.org/lkml/2019/10/9/42
Yunsheng Lin (3): net: sched: fix packet stuck problem for lockless qdisc net: sched: fix endless tx action reschedule during deactivation net: sched: fix tx action reschedule issue with stopped queue
include/net/pkt_sched.h | 7 +------ include/net/sch_generic.h | 33 ++++++++++++++++++++++++++++++++- net/core/dev.c | 29 ++++++++++++++++++++++++----- net/sched/sch_generic.c | 31 +++++++++++++++++++++++++++++-- 4 files changed, 86 insertions(+), 14 deletions(-)
Lockless qdisc has below concurrent problem: cpu0 cpu1 . . q->enqueue . . . qdisc_run_begin() . . . dequeue_skb() . . . sch_direct_xmit() . . . . q->enqueue . qdisc_run_begin() . return and do nothing . . qdisc_run_end() .
cpu1 enqueue a skb without calling __qdisc_run() because cpu0 has not released the lock yet and spin_trylock() return false for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb enqueued by cpu1 when calling dequeue_skb() because cpu1 may enqueue the skb after cpu0 calling dequeue_skb() and before cpu0 calling qdisc_run_end().
Lockless qdisc has below another concurrent problem when tx_action is involved:
cpu0(serving tx_action) cpu1 cpu2 . . . . q->enqueue . . qdisc_run_begin() . . dequeue_skb() . . . q->enqueue . . . . sch_direct_xmit() . . . qdisc_run_begin() . . return and do nothing . . . clear __QDISC_STATE_SCHED . . qdisc_run_begin() . . return and do nothing . . . . . . qdisc_run_end() .
This patch fixes the above data race by: 1. If the first spin_trylock() return false and STATE_MISSED is not set, set STATE_MISSED and retry another spin_trylock() in case other CPU may not see STATE_MISSED after it releases the lock. 2. reschedule if STATE_MISSED is set after the lock is released at the end of qdisc_run_end().
For tx_action case, STATE_MISSED is also set when cpu1 is at the end if qdisc_run_end(), so tx_action will be rescheduled again to dequeue the skb enqueued by cpu2.
Clear STATE_MISSED before retrying a dequeuing when dequeuing returns NULL in order to reduce the overhead of the second spin_trylock() and __netif_schedule() calling.
The performance impact of this patch, tested using pktgen and dummy netdev with pfifo_fast qdisc attached:
threads without+this_patch with+this_patch delta 1 2.61Mpps 2.60Mpps -0.3% 2 3.97Mpps 3.82Mpps -3.7% 4 5.62Mpps 5.59Mpps -0.5% 8 2.78Mpps 2.77Mpps -0.3% 16 2.22Mpps 2.22Mpps -0.0%
Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Signed-off-by: Yunsheng Lin linyunsheng@huawei.com Tested-by: Juergen Gross jgross@suse.com --- V6: Check MISSED after the first trylock, and remove the automic test and set for performance sake as suggested by Jakub. V4: Change STATE_NEED_RESCHEDULE to STATE_MISSED mirroring NAPI's NAPIF_STATE_MISSED, and add Juergen's "Tested-by" tag for there is only renaming and typo fixing between V4 and V3. V3: Fix a compile error and a few comment typo, remove the __QDISC_STATE_DEACTIVATED checking, and update the performance data. V2: Avoid the overhead of fixing the data race as much as possible. --- include/net/sch_generic.h | 33 ++++++++++++++++++++++++++++++++- net/sched/sch_generic.c | 19 +++++++++++++++++++ 2 files changed, 51 insertions(+), 1 deletion(-)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index f7a6e14..e5df394 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -36,6 +36,7 @@ struct qdisc_rate_table { enum qdisc_state_t { __QDISC_STATE_SCHED, __QDISC_STATE_DEACTIVATED, + __QDISC_STATE_MISSED, };
struct qdisc_size_table { @@ -159,8 +160,33 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc) static inline bool qdisc_run_begin(struct Qdisc *qdisc) { if (qdisc->flags & TCQ_F_NOLOCK) { + if (spin_trylock(&qdisc->seqlock)) + goto nolock_empty; + + /* If the MISSED flag is set, it means other thread has + * set the MISSED flag before second spin_trylock(), so + * we can return false here to avoid multi cpus doing + * the set_bit() and second spin_trylock() concurrently. + */ + if (test_bit(__QDISC_STATE_MISSED, &qdisc->state)) + return false; + + /* Set the MISSED flag before the second spin_trylock(), + * if the second spin_trylock() return false, it means + * other cpu holding the lock will do dequeuing for us + * or it will see the MISSED flag set after releasing + * lock and reschedule the net_tx_action() to do the + * dequeuing. + */ + set_bit(__QDISC_STATE_MISSED, &qdisc->state); + + /* Retry again in case other CPU may not see the new flag + * after it releases the lock at the end of qdisc_run_end(). + */ if (!spin_trylock(&qdisc->seqlock)) return false; + +nolock_empty: WRITE_ONCE(qdisc->empty, false); } else if (qdisc_is_running(qdisc)) { return false; @@ -176,8 +202,13 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc) static inline void qdisc_run_end(struct Qdisc *qdisc) { write_seqcount_end(&qdisc->running); - if (qdisc->flags & TCQ_F_NOLOCK) + if (qdisc->flags & TCQ_F_NOLOCK) { spin_unlock(&qdisc->seqlock); + + if (unlikely(test_bit(__QDISC_STATE_MISSED, + &qdisc->state))) + __netif_schedule(qdisc); + } }
static inline bool qdisc_may_bulk(const struct Qdisc *qdisc) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 44991ea..795d986 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) { struct pfifo_fast_priv *priv = qdisc_priv(qdisc); struct sk_buff *skb = NULL; + bool need_retry = true; int band;
+retry: for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) { struct skb_array *q = band2list(priv, band);
@@ -652,6 +654,23 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) } if (likely(skb)) { qdisc_update_stats_at_dequeue(qdisc, skb); + } else if (need_retry && + test_bit(__QDISC_STATE_MISSED, &qdisc->state)) { + /* Delay clearing the STATE_MISSED here to reduce + * the overhead of the second spin_trylock() in + * qdisc_run_begin() and __netif_schedule() calling + * in qdisc_run_end(). + */ + clear_bit(__QDISC_STATE_MISSED, &qdisc->state); + + /* Make sure dequeuing happens after clearing + * STATE_MISSED. + */ + smp_mb__after_atomic(); + + need_retry = false; + + goto retry; } else { WRITE_ONCE(qdisc->empty, true); }
Currently qdisc_run() checks the STATE_DEACTIVATED of lockless qdisc before calling __qdisc_run(), which ultimately clear the STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED is set before clearing STATE_MISSED, there may be endless rescheduling of net_tx_action() at the end of qdisc_run_end(), see below:
CPU0(net_tx_atcion) CPU1(__dev_xmit_skb) CPU2(dev_deactivate) . . . . set STATE_MISSED . . __netif_schedule() . . . set STATE_DEACTIVATED . . qdisc_reset() . . . .<--------------- . synchronize_net() clear __QDISC_STATE_SCHED | . . . | . . . | . . . | . --------->. . | . | . test STATE_DEACTIVATED | . | some_qdisc_is_busy() __qdisc_run() *not* called | . |-----return *true* . | . . test STATE_MISS | . . __netif_schedule()--------| . . . . . . . .
__qdisc_run() is not called by net_tx_atcion() in CPU0 because CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule is called endlessly at the end of qdisc_run_end(), causing endless tx action rescheduling problem.
qdisc_run() called by net_tx_action() runs in the softirq context, which should has the same semantic as the qdisc_run() called by __dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a synchronize_net() between STATE_DEACTIVATED flag being set and qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely bail out for the deactived lockless qdisc in net_tx_action(), and qdisc_reset() will reset all skb not dequeued yet.
So add the rcu_read_lock() explicitly to protect the qdisc_run() and do the STATE_DEACTIVATED checking in net_tx_action() before calling qdisc_run_begin(). Another option is to do the checking in the qdisc_run_end(), but it will add unnecessary overhead for non-tx_action case, because __dev_queue_xmit() will not see qdisc with STATE_DEACTIVATED after synchronize_net(), the qdisc with STATE_DEACTIVATED can only be seen by net_tx_action() because of __netif_schedule().
The STATE_DEACTIVATED checking in qdisc_run() is to avoid race between net_tx_action() and qdisc_reset(), see: commit d518d2ed8640 ("net/sched: fix race between deactivation and dequeue for NOLOCK qdisc"). As the bailout added above for deactived lockless qdisc in net_tx_action() provides better protection for the race without calling qdisc_run() at all, so remove the STATE_DEACTIVATED checking in qdisc_run().
After qdisc_reset(), there is no skb in qdisc to be dequeued, so clear the STATE_MISSED in dev_reset_queue() too.
Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- include/net/pkt_sched.h | 7 +------ net/core/dev.c | 26 ++++++++++++++++++++++---- net/sched/sch_generic.c | 4 +++- 3 files changed, 26 insertions(+), 11 deletions(-)
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h index f5c1bee..6d7b12c 100644 --- a/include/net/pkt_sched.h +++ b/include/net/pkt_sched.h @@ -128,12 +128,7 @@ void __qdisc_run(struct Qdisc *q); static inline void qdisc_run(struct Qdisc *q) { if (qdisc_run_begin(q)) { - /* NOLOCK qdisc must check 'state' under the qdisc seqlock - * to avoid racing with dev_qdisc_reset() - */ - if (!(q->flags & TCQ_F_NOLOCK) || - likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) - __qdisc_run(q); + __qdisc_run(q); qdisc_run_end(q); } } diff --git a/net/core/dev.c b/net/core/dev.c index 222b1d3..d596cd7 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5025,25 +5025,43 @@ static __latent_entropy void net_tx_action(struct softirq_action *h) sd->output_queue_tailp = &sd->output_queue; local_irq_enable();
+ rcu_read_lock(); + while (head) { struct Qdisc *q = head; spinlock_t *root_lock = NULL;
head = head->next_sched;
- if (!(q->flags & TCQ_F_NOLOCK)) { - root_lock = qdisc_lock(q); - spin_lock(root_lock); - } /* We need to make sure head->next_sched is read * before clearing __QDISC_STATE_SCHED */ smp_mb__before_atomic(); + + if (!(q->flags & TCQ_F_NOLOCK)) { + root_lock = qdisc_lock(q); + spin_lock(root_lock); + } else if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, + &q->state))) { + /* There is a synchronize_net() between + * STATE_DEACTIVATED flag being set and + * qdisc_reset()/some_qdisc_is_busy() in + * dev_deactivate(), so we can safely bail out + * early here to avoid data race between + * qdisc_deactivate() and some_qdisc_is_busy() + * for lockless qdisc. + */ + clear_bit(__QDISC_STATE_SCHED, &q->state); + continue; + } + clear_bit(__QDISC_STATE_SCHED, &q->state); qdisc_run(q); if (root_lock) spin_unlock(root_lock); } + + rcu_read_unlock(); }
xfrm_dev_backlog(sd); diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 795d986..d86c4cc 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -1177,8 +1177,10 @@ static void dev_reset_queue(struct net_device *dev, qdisc_reset(qdisc);
spin_unlock_bh(qdisc_lock(qdisc)); - if (nolock) + if (nolock) { + clear_bit(__QDISC_STATE_MISSED, &qdisc->state); spin_unlock_bh(&qdisc->seqlock); + } }
static bool some_qdisc_is_busy(struct net_device *dev)
The netdev qeueue might be stopped when byte queue limit has reached or tx hw ring is full, net_tx_action() may still be rescheduled endlessly if STATE_MISSED is set, which consumes a lot of cpu without dequeuing and transmiting any skb because the netdev queue is stopped, see qdisc_run_end().
This patch fixes it by checking the netdev queue state before calling qdisc_run() and clearing STATE_MISSED if netdev queue is stopped during qdisc_run(), the net_tx_action() is recheduled again when netdev qeueue is restarted, see netif_tx_wake_queue().
Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Reported-by: Michal Kubecek mkubecek@suse.cz Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- V6: Drop NET_XMIT_DROP checking for it is not really relevant to this patch, and it may cause performance performance regression with multi pktgen threads on dummy netdev with pfifo_fast qdisc case. --- net/core/dev.c | 3 ++- net/sched/sch_generic.c | 8 +++++++- 2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c index d596cd7..ef8cf76 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3853,7 +3853,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
if (q->flags & TCQ_F_NOLOCK) { rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; - qdisc_run(q); + if (likely(!netif_xmit_frozen_or_stopped(txq))) + qdisc_run(q);
if (unlikely(to_free)) kfree_skb_list(to_free); diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index d86c4cc..37403c1 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -74,6 +74,7 @@ static inline struct sk_buff *__skb_dequeue_bad_txq(struct Qdisc *q) } } else { skb = SKB_XOFF_MAGIC; + clear_bit(__QDISC_STATE_MISSED, &q->state); } }
@@ -242,6 +243,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, } } else { skb = NULL; + clear_bit(__QDISC_STATE_MISSED, &q->state); } if (lock) spin_unlock(lock); @@ -251,8 +253,10 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, *validate = true;
if ((q->flags & TCQ_F_ONETXQUEUE) && - netif_xmit_frozen_or_stopped(txq)) + netif_xmit_frozen_or_stopped(txq)) { + clear_bit(__QDISC_STATE_MISSED, &q->state); return skb; + }
skb = qdisc_dequeue_skb_bad_txq(q); if (unlikely(skb)) { @@ -311,6 +315,8 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, HARD_TX_LOCK(dev, txq, smp_processor_id()); if (!netif_xmit_frozen_or_stopped(txq)) skb = dev_hard_start_xmit(skb, dev, txq, &ret); + else + clear_bit(__QDISC_STATE_MISSED, &q->state);
HARD_TX_UNLOCK(dev, txq); } else {
On Mon, 10 May 2021 09:42:36 +0800 Yunsheng Lin wrote:
The netdev qeueue might be stopped when byte queue limit has reached or tx hw ring is full, net_tx_action() may still be rescheduled endlessly if STATE_MISSED is set, which consumes a lot of cpu without dequeuing and transmiting any skb because the netdev queue is stopped, see qdisc_run_end().
This patch fixes it by checking the netdev queue state before calling qdisc_run() and clearing STATE_MISSED if netdev queue is stopped during qdisc_run(), the net_tx_action() is recheduled again when netdev qeueue is restarted, see netif_tx_wake_queue().
Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Reported-by: Michal Kubecek mkubecek@suse.cz Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
Patches 1 and 2 look good to me but this one I'm not 100% sure.
@@ -251,8 +253,10 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, *validate = true;
if ((q->flags & TCQ_F_ONETXQUEUE) &&
netif_xmit_frozen_or_stopped(txq))
netif_xmit_frozen_or_stopped(txq)) {
return skb;clear_bit(__QDISC_STATE_MISSED, &q->state);
- }
The queues are woken asynchronously without holding any locks via netif_tx_wake_queue(). Theoretically we can have a situation where:
CPU 0 CPU 1 . . dequeue_skb() . netif_xmit_frozen..() # true . . [IRQ] . netif_tx_wake_queue() . <end of IRQ> . netif_tx_action() . set MISSED clear MISSED return NULL ret from qdisc_restart() ret from __qdisc_run() qdisc_run_end() -> MISSED not set
On 2021/5/11 12:22, Jakub Kicinski wrote:
On Mon, 10 May 2021 09:42:36 +0800 Yunsheng Lin wrote:
The netdev qeueue might be stopped when byte queue limit has reached or tx hw ring is full, net_tx_action() may still be rescheduled endlessly if STATE_MISSED is set, which consumes a lot of cpu without dequeuing and transmiting any skb because the netdev queue is stopped, see qdisc_run_end().
This patch fixes it by checking the netdev queue state before calling qdisc_run() and clearing STATE_MISSED if netdev queue is stopped during qdisc_run(), the net_tx_action() is recheduled again when netdev qeueue is restarted, see netif_tx_wake_queue().
Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Reported-by: Michal Kubecek mkubecek@suse.cz Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
Patches 1 and 2 look good to me but this one I'm not 100% sure.
@@ -251,8 +253,10 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, *validate = true;
if ((q->flags & TCQ_F_ONETXQUEUE) &&
netif_xmit_frozen_or_stopped(txq))
netif_xmit_frozen_or_stopped(txq)) {
return skb;clear_bit(__QDISC_STATE_MISSED, &q->state);
- }
The queues are woken asynchronously without holding any locks via netif_tx_wake_queue(). Theoretically we can have a situation where:
CPU 0 CPU 1 . . dequeue_skb() . netif_xmit_frozen..() # true . . [IRQ] . netif_tx_wake_queue() . <end of IRQ> . netif_tx_action() . set MISSED clear MISSED return NULL ret from qdisc_restart() ret from __qdisc_run() qdisc_run_end() -> MISSED not set
Yes, the above does seems to have the above data race.
As my understanding, there is two ways to fix the above data race: 1. do not clear the STATE_MISSED for netif_xmit_frozen_or_stopped() case, just check the netif_xmit_frozen_or_stopped() before calling __netif_schedule() at the end of qdisc_run_end(). This seems to only work with qdisc with TCQ_F_ONETXQUEUE flag because it seems we can only check the netif_xmit_frozen_or_stopped() with q->dev_queue, I am not sure q->dev_queue is pointint to which netdev queue when qdisc is not set with TCQ_F_ONETXQUEUE flag.
2. clearing the STATE_MISSED for netif_xmit_frozen_or_stopped() case as this patch does, and protect the __netif_schedule() with q->seqlock for netif_tx_wake_queue(), which might bring unnecessary overhead for non-stopped queue case
Any better idea?
.
On 2021/5/11 17:04, Yunsheng Lin wrote:
On 2021/5/11 12:22, Jakub Kicinski wrote:
On Mon, 10 May 2021 09:42:36 +0800 Yunsheng Lin wrote:
The netdev qeueue might be stopped when byte queue limit has reached or tx hw ring is full, net_tx_action() may still be rescheduled endlessly if STATE_MISSED is set, which consumes a lot of cpu without dequeuing and transmiting any skb because the netdev queue is stopped, see qdisc_run_end().
This patch fixes it by checking the netdev queue state before calling qdisc_run() and clearing STATE_MISSED if netdev queue is stopped during qdisc_run(), the net_tx_action() is recheduled again when netdev qeueue is restarted, see netif_tx_wake_queue().
Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Reported-by: Michal Kubecek mkubecek@suse.cz Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
Patches 1 and 2 look good to me but this one I'm not 100% sure.
@@ -251,8 +253,10 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, *validate = true;
if ((q->flags & TCQ_F_ONETXQUEUE) &&
netif_xmit_frozen_or_stopped(txq))
netif_xmit_frozen_or_stopped(txq)) {
return skb;clear_bit(__QDISC_STATE_MISSED, &q->state);
- }
The queues are woken asynchronously without holding any locks via netif_tx_wake_queue(). Theoretically we can have a situation where:
CPU 0 CPU 1 . . dequeue_skb() . netif_xmit_frozen..() # true . . [IRQ] . netif_tx_wake_queue() . <end of IRQ> . netif_tx_action() . set MISSED clear MISSED return NULL ret from qdisc_restart() ret from __qdisc_run() qdisc_run_end() -> MISSED not set
Yes, the above does seems to have the above data race.
As my understanding, there is two ways to fix the above data race:
do not clear the STATE_MISSED for netif_xmit_frozen_or_stopped() case, just check the netif_xmit_frozen_or_stopped() before calling __netif_schedule() at the end of qdisc_run_end(). This seems to only work with qdisc with TCQ_F_ONETXQUEUE flag because it seems we can only check the netif_xmit_frozen_or_stopped() with q->dev_queue, I am not sure q->dev_queue is pointint to which netdev queue when qdisc is not set with TCQ_F_ONETXQUEUE flag.
clearing the STATE_MISSED for netif_xmit_frozen_or_stopped() case as this patch does, and protect the __netif_schedule() with q->seqlock for netif_tx_wake_queue(), which might bring unnecessary overhead for non-stopped queue case
Any better idea?
3. Or check the netif_xmit_frozen_or_stopped() again after clearing STATE_MISSED, like below:
if (netif_xmit_frozen_or_stopped(txq)) { clear_bit(__QDISC_STATE_MISSED, &q->state);
/* Make sure the below netif_xmit_frozen_or_stopped() * checking happens after clearing STATE_MISSED. */ smp_mb__after_atomic();
/* Checking netif_xmit_frozen_or_stopped() again to * make sure __QDISC_STATE_MISSED is set if the * __QDISC_STATE_MISSED set by netif_tx_wake_queue()'s * rescheduling of net_tx_action() is cleared by the * above clear_bit(). */ if (!netif_xmit_frozen_or_stopped(txq)) set_bit(__QDISC_STATE_MISSED, &q->state); }
It is kind of ugly, but it does seem to fix the above data race too. And it seems like a common pattern to deal with the concurrency between xmit and NAPI polling, as below:
https://elixir.bootlin.com/linux/v5.12-rc2/source/drivers/net/ethernet/hisil...
.
Linuxarm mailing list -- linuxarm@openeuler.org To unsubscribe send an email to linuxarm-leave@openeuler.org
On Tue, 11 May 2021 20:13:56 +0800 Yunsheng Lin wrote:
On 2021/5/11 17:04, Yunsheng Lin wrote:
On 2021/5/11 12:22, Jakub Kicinski wrote:
The queues are woken asynchronously without holding any locks via netif_tx_wake_queue(). Theoretically we can have a situation where:
CPU 0 CPU 1 . . dequeue_skb() . netif_xmit_frozen..() # true . . [IRQ] . netif_tx_wake_queue() . <end of IRQ> . netif_tx_action() . set MISSED clear MISSED return NULL ret from qdisc_restart() ret from __qdisc_run() qdisc_run_end()
[...]
Yes, the above does seems to have the above data race.
As my understanding, there is two ways to fix the above data race:
- do not clear the STATE_MISSED for netif_xmit_frozen_or_stopped() case, just check the netif_xmit_frozen_or_stopped() before calling __netif_schedule() at the end of qdisc_run_end(). This seems to only work with qdisc with TCQ_F_ONETXQUEUE flag because it seems we can only check the netif_xmit_frozen_or_stopped() with q->dev_queue, I am not sure q->dev_queue is pointint to which netdev queue when qdisc is not set with TCQ_F_ONETXQUEUE flag.
Isn't the case where we have a NOLOCK qdisc without TCQ_F_ONETXQUEUE rather unexpected? It'd have to be a single pfifo on multi-queue netdev, right? Sounds not worth optimizing for. How about:
static inline void qdisc_run_end(struct Qdisc *qdisc) { write_seqcount_end(&qdisc->running); if (qdisc->flags & TCQ_F_NOLOCK) { spin_unlock(&qdisc->seqlock);
if (unlikely(test_bit(__QDISC_STATE_MISSED, &qdisc->state))) { clear_bit(__QDISC_STATE_MISSED, &qdisc->state); if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(q->dev_queue)) __netif_schedule(qdisc); } } }
For the strange non-ONETXQUEUE case we'd have an occasional unnecessary net_tx_action, but no infinite loop possible.
- clearing the STATE_MISSED for netif_xmit_frozen_or_stopped() case as this patch does, and protect the __netif_schedule() with q->seqlock for netif_tx_wake_queue(), which might bring unnecessary overhead for non-stopped queue case
Any better idea?
Or check the netif_xmit_frozen_or_stopped() again after clearing STATE_MISSED, like below:
if (netif_xmit_frozen_or_stopped(txq)) { clear_bit(__QDISC_STATE_MISSED, &q->state);
/* Make sure the below netif_xmit_frozen_or_stopped()
- checking happens after clearing STATE_MISSED.
*/ smp_mb__after_atomic();
/* Checking netif_xmit_frozen_or_stopped() again to
- make sure __QDISC_STATE_MISSED is set if the
- __QDISC_STATE_MISSED set by netif_tx_wake_queue()'s
- rescheduling of net_tx_action() is cleared by the
- above clear_bit().
*/ if (!netif_xmit_frozen_or_stopped(txq)) set_bit(__QDISC_STATE_MISSED, &q->state);
}
It is kind of ugly, but it does seem to fix the above data race too. And it seems like a common pattern to deal with the concurrency between xmit and NAPI polling, as below:
https://elixir.bootlin.com/linux/v5.12-rc2/source/drivers/net/ethernet/hisil...
This is indeed the idiomatic way of dealing with Tx queue stopping race, but it's a bit of code to sprinkle around. My vote would be option 1.
On 2021/5/12 7:30, Jakub Kicinski wrote:
On Tue, 11 May 2021 20:13:56 +0800 Yunsheng Lin wrote:
On 2021/5/11 17:04, Yunsheng Lin wrote:
On 2021/5/11 12:22, Jakub Kicinski wrote:
The queues are woken asynchronously without holding any locks via netif_tx_wake_queue(). Theoretically we can have a situation where:
CPU 0 CPU 1 . . dequeue_skb() . netif_xmit_frozen..() # true . . [IRQ] . netif_tx_wake_queue() . <end of IRQ> . netif_tx_action() . set MISSED clear MISSED return NULL ret from qdisc_restart() ret from __qdisc_run() qdisc_run_end()
[...]
Yes, the above does seems to have the above data race.
As my understanding, there is two ways to fix the above data race:
- do not clear the STATE_MISSED for netif_xmit_frozen_or_stopped() case, just check the netif_xmit_frozen_or_stopped() before calling __netif_schedule() at the end of qdisc_run_end(). This seems to only work with qdisc with TCQ_F_ONETXQUEUE flag because it seems we can only check the netif_xmit_frozen_or_stopped() with q->dev_queue, I am not sure q->dev_queue is pointint to which netdev queue when qdisc is not set with TCQ_F_ONETXQUEUE flag.
Isn't the case where we have a NOLOCK qdisc without TCQ_F_ONETXQUEUE rather unexpected? It'd have to be a single pfifo on multi-queue netdev, right? Sounds not worth optimizing for. How about:
static inline void qdisc_run_end(struct Qdisc *qdisc) { write_seqcount_end(&qdisc->running); if (qdisc->flags & TCQ_F_NOLOCK) { spin_unlock(&qdisc->seqlock);
if (unlikely(test_bit(__QDISC_STATE_MISSED, &qdisc->state))) { clear_bit(__QDISC_STATE_MISSED, &qdisc->state); if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(q->dev_queue)) __netif_schedule(qdisc); }
} }
For the strange non-ONETXQUEUE case we'd have an occasional unnecessary net_tx_action, but no infinite loop possible.
- clearing the STATE_MISSED for netif_xmit_frozen_or_stopped() case as this patch does, and protect the __netif_schedule() with q->seqlock for netif_tx_wake_queue(), which might bring unnecessary overhead for non-stopped queue case
Any better idea?
Or check the netif_xmit_frozen_or_stopped() again after clearing STATE_MISSED, like below:
if (netif_xmit_frozen_or_stopped(txq)) { clear_bit(__QDISC_STATE_MISSED, &q->state);
/* Make sure the below netif_xmit_frozen_or_stopped()
- checking happens after clearing STATE_MISSED.
*/ smp_mb__after_atomic();
/* Checking netif_xmit_frozen_or_stopped() again to
- make sure __QDISC_STATE_MISSED is set if the
- __QDISC_STATE_MISSED set by netif_tx_wake_queue()'s
- rescheduling of net_tx_action() is cleared by the
- above clear_bit().
*/ if (!netif_xmit_frozen_or_stopped(txq)) set_bit(__QDISC_STATE_MISSED, &q->state);
}
It is kind of ugly, but it does seem to fix the above data race too. And it seems like a common pattern to deal with the concurrency between xmit and NAPI polling, as below:
https://elixir.bootlin.com/linux/v5.12-rc2/source/drivers/net/ethernet/hisil...
This is indeed the idiomatic way of dealing with Tx queue stopping race, but it's a bit of code to sprinkle around. My vote would be option 1.
I had done some performance testing to see which is better, tested using pktgen and dummy netdev with pfifo_fast qdisc attached:
unit: Mpps
threads V6 V6 + option 1 V6 + option 3 1 2.60 2.54 2.60 2 3.86 3.84 3.84 4 5.56 5.50 5.51 8 2.79 2.77 2.77 16 2.23 2.24 2.22
So it seems the netif_xmit_frozen_or_stopped checking overhead for non-stopped queue is noticable for 1 pktgen thread.
And the performance increase for V6 + option 1 with 16 pktgen threads is because of "clear_bit(__QDISC_STATE_MISSED, &qdisc->state)" at the end of qdisc_run_end(), which may avoid the another round of dequeuing in the pfifo_fast_dequeue(). And adding the "clear_bit(__QDISC_STATE_MISSED, &qdisc->state)" for V6 + option 3, the data for 16 pktgen thread also go up to 2.24Mpps.
So it seems V6 + option 3 with "clear_bit(__QDISC_STATE_MISSED, &qdisc->state)" at the end of qdisc_run_end() is better?
Linuxarm mailing list -- linuxarm@openeuler.org To unsubscribe send an email to linuxarm-leave@openeuler.org
On Wed, 12 May 2021 11:34:55 +0800 Yunsheng Lin wrote:
This is indeed the idiomatic way of dealing with Tx queue stopping race, but it's a bit of code to sprinkle around. My vote would be option 1.
I had done some performance testing to see which is better, tested using pktgen and dummy netdev with pfifo_fast qdisc attached:
unit: Mpps
threads V6 V6 + option 1 V6 + option 3 1 2.60 2.54 2.60 2 3.86 3.84 3.84 4 5.56 5.50 5.51 8 2.79 2.77 2.77 16 2.23 2.24 2.22
So it seems the netif_xmit_frozen_or_stopped checking overhead for non-stopped queue is noticable for 1 pktgen thread.
And the performance increase for V6 + option 1 with 16 pktgen threads is because of "clear_bit(__QDISC_STATE_MISSED, &qdisc->state)" at the end of qdisc_run_end(), which may avoid the another round of dequeuing in the pfifo_fast_dequeue(). And adding the "clear_bit(__QDISC_STATE_MISSED, &qdisc->state)" for V6 + option 3, the data for 16 pktgen thread also go up to 2.24Mpps.
So it seems V6 + option 3 with "clear_bit(__QDISC_STATE_MISSED, &qdisc->state)" at the end of qdisc_run_end() is better?
Alright, sounds good.