On 2021/4/19 10:04, Yunsheng Lin wrote:
On 2021/4/19 6:59, Michal Kubecek wrote:
On Thu, Mar 25, 2021 at 11:13:11AM +0800, Yunsheng Lin wrote:
Lockless qdisc has below concurrent problem: cpu0 cpu1 . . q->enqueue . . . qdisc_run_begin() . . . dequeue_skb() . . . sch_direct_xmit() . . . . q->enqueue . qdisc_run_begin() . return and do nothing . . qdisc_run_end() .
cpu1 enqueue a skb without calling __qdisc_run() because cpu0 has not released the lock yet and spin_trylock() return false for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb enqueued by cpu1 when calling dequeue_skb() because cpu1 may enqueue the skb after cpu0 calling dequeue_skb() and before cpu0 calling qdisc_run_end().
Lockless qdisc has below another concurrent problem when tx_action is involved:
cpu0(serving tx_action) cpu1 cpu2 . . . . q->enqueue . . qdisc_run_begin() . . dequeue_skb() . . . q->enqueue . . . . sch_direct_xmit() . . . qdisc_run_begin() . . return and do nothing . . . clear __QDISC_STATE_SCHED . . qdisc_run_begin() . . return and do nothing . . . . . . qdisc_run_end() .
This patch fixes the above data race by:
- Get the flag before doing spin_trylock().
- If the first spin_trylock() return false and the flag is not set before the first spin_trylock(), Set the flag and retry another spin_trylock() in case other CPU may not see the new flag after it releases the lock.
- reschedule if the flags is set after the lock is released at the end of qdisc_run_end().
For tx_action case, the flags is also set when cpu1 is at the end if qdisc_run_end(), so tx_action will be rescheduled again to dequeue the skb enqueued by cpu2.
Only clear the flag before retrying a dequeuing when dequeuing returns NULL in order to reduce the overhead of the above double spin_trylock() and __netif_schedule() calling.
The performance impact of this patch, tested using pktgen and dummy netdev with pfifo_fast qdisc attached:
threads without+this_patch with+this_patch delta 1 2.61Mpps 2.60Mpps -0.3% 2 3.97Mpps 3.82Mpps -3.7% 4 5.62Mpps 5.59Mpps -0.5% 8 2.78Mpps 2.77Mpps -0.3% 16 2.22Mpps 2.22Mpps -0.0%
Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
V3: fix a compile error and a few comment typo, remove the __QDISC_STATE_DEACTIVATED checking, and update the performance data. V2: Avoid the overhead of fixing the data race as much as possible.
I tried this patch o top of 5.12-rc7 with real devices. I used two machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs (2 8-core CPUs with HT disabled) and 16 Rx/Tx queues, receiver has 48 CPUs (2 12-core CPUs with HT enabled) and 48 Rx/Tx queues. With multiple TCP streams on a saturated ethernet, the CPU consumption grows quite a lot:
threads unpatched 5.12-rc7 5.12-rc7 + v3 1 25.6% 30.6% 8 73.1% 241.4% 128 132.2% 1012.0%
I do not really read the above number, but I understand that v3 has a cpu usage impact when it is patched to 5.12-rc7, so I do a test too on a arm64 system with a hns3 100G netdev, which is in node 0, and node 0 has 32 cpus.
root@(none)$ cat /sys/class/net/eth4/device/numa_node 0 root@(none)$ numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 128646 MB node 0 free: 127876 MB node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 1 size: 129019 MB node 1 free: 128796 MB node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 2 size: 129019 MB node 2 free: 128755 MB node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 node 3 size: 127989 MB node 3 free: 127772 MB node distances: node 0 1 2 3 0: 10 16 32 33 1: 16 10 25 32 2: 32 25 10 16 3: 33 32 16 10
and I use below cmd to only use 16 tx/rx queue with 72 tx queue depth in order to trigger the tx queue stopping handling: ethtool -L eth4 combined 16 ethtool -G eth4 tx 72
threads unpatched patched_v4 patched_v4+queue_stopped_opt 16 11% (si: 3.8%) 20% (si:13.5%) 11% (si:3.8%) 32 12% (si: 4.4%) 30% (si:22%) 13% (si:4.4%) 64 13% (si: 4.9%) 35% (si:26%) 13% (si:4.7%)
"11% (si: 3.8%)": 11% means the total cpu useage in node 0, and 3.8% means the softirq cpu useage . And thread number is as below iperf cmd: taskset -c 0-31 iperf -c 192.168.100.2 -t 1000000 -i 1 -P *thread*
It seems after applying the queue_stopped_opt patch, the cpu usage is closed to the unpatch one, at least with my testcase, maybe you can try your testcase with the queue_stopped_opt patch to see if it make any difference?
patched_v4: https://patchwork.kernel.org/project/netdevbpf/patch/1618535809-11952-2-git-...
And queue_stopped_opt: diff --git a/net/core/dev.c b/net/core/dev.c index 5d6f2e2..62008d5 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3762,7 +3762,9 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
if (q->flags & TCQ_F_NOLOCK) { rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; - qdisc_run(q); + if (likely(rc == NET_XMIT_SUCCESS && + !netif_xmit_frozen_or_stopped(txq))) + qdisc_run(q);
if (unlikely(to_free)) kfree_skb_list(to_free); diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 6d7f954..079a825 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -74,6 +74,7 @@ static inline struct sk_buff *__skb_dequeue_bad_txq(struct Qdisc *q) } } else { skb = SKB_XOFF_MAGIC; + clear_bit(__QDISC_STATE_MISSED, &q->state); } }
@@ -242,6 +243,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, } } else { skb = NULL; + clear_bit(__QDISC_STATE_MISSED, &q->state); } if (lock) spin_unlock(lock); @@ -251,8 +253,10 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, *validate = true;
if ((q->flags & TCQ_F_ONETXQUEUE) && - netif_xmit_frozen_or_stopped(txq)) + netif_xmit_frozen_or_stopped(txq)) { + clear_bit(__QDISC_STATE_MISSED, &q->state); return skb; + }
skb = qdisc_dequeue_skb_bad_txq(q); if (unlikely(skb)) { @@ -311,6 +315,8 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, HARD_TX_LOCK(dev, txq, smp_processor_id()); if (!netif_xmit_frozen_or_stopped(txq)) skb = dev_hard_start_xmit(skb, dev, txq, &ret); + else + clear_bit(__QDISC_STATE_MISSED, &q->state);
HARD_TX_UNLOCK(dev, txq); } else {
(The values are normalized to one core, i.e. 100% corresponds to one fully used logical CPU.) I didn't perform a full statistical evaluation but the growth is way beyond any statistical fluctuation with one exception: 8-thread test of patched kernel showed values from 155.5% to 311.4%. Closer look shows that most of the CPU time was spent in softirq and running top in parallel with the test confirms that there are multiple ksofirqd threads running at 100% CPU. I had similar problem with earlier versions of my patch (work in progress, I still need to check some corner cases and write commit message explaining the logic)
Great, if there is a better idea, maybe share the core idea first so that we both can work on the that?
and tracing confirmed that similar problem (non-empty queue, no other thread going to clean it up but device queue stopped) was happening repeatedly most of the time.
Make sense, maybe that is why the dummy netdevice can not catch this kind of performance degradation because it always consume the skb without stopping the tx queue, and a real netdevice with limited queue depth may stop the queue when there are multil skbs queuing concurrently.
I think for the above to happen, there may be a lot of tx doorbell batching happening, it would be better to see how many packets is enqueued at a time when trace_qdisc_dequeue() tracepoint is enabled?
The biggest problem IMHO is that the loop in __qdisc_run() may finish without rescheduling not only when the qdisc queue is empty but also when the corresponding device Tx queue is stopped which devices tend to do whenever they cannot send any more packets out. Thus whenever __QDISC_STATE_NEED_RESCHEDULE is set with device queue stopped or frozen, we keep rescheduling the queue cleanup without any chance to progress or clear __QDISC_STATE_NEED_RESCHEDULE. For this to happen, all we need is another thready to fail the first spin_trylock() while device queue is stopped and qdisc queue not empty.
Yes, We could just return false before doing the second spin_trylock() if the netdev queue corresponding qdisc is stopped, and when the netdev queue is restarted, __netif_schedule() is called again, see netif_tx_wake_queue().
Maybe add a sch_qdisc_stopped() function and do the testting in qdisc_run_begin:
if (dont_retry || sch_qdisc_stopped()) return false;
bool sch_qdisc_stopped(struct Qdisc *q) { const struct netdev_queue *txq = q->dev_queue;
if (!netif_xmit_frozen_or_stopped(txq)) return true;
reture false; }
At least for qdisc with TCQ_F_ONETXQUEUE flags set is doable?
Another part of the problem may be that to avoid the race, the logic is too pessimistic: consider e.g. (dotted lines show "barriers" where ordering is important):
CPU A CPU B spin_trylock() succeeds pfifo_fast_enqueue() .................................................................. skb_array empty, exit loop first spin_trylock() fails set __QDISC_STATE_NEED_RESCHEDULE second spin_trylock() fails .................................................................. spin_unlock() call __netif_schedule()
When we switch the order of spin_lock() on CPU A and second spin_trylock() on CPU B (but keep setting __QDISC_STATE_NEED_RESCHEDULE before CPU A tests it), we end up scheduling a queue cleanup even if there is already one running. And either of these is quite realistic.
But I am not sure it is a good thing or bad thing when the above happen, because it may be able to enable the tx batching?
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index f7a6e14..e3f46eb 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -36,6 +36,7 @@ struct qdisc_rate_table { enum qdisc_state_t { __QDISC_STATE_SCHED, __QDISC_STATE_DEACTIVATED,
- __QDISC_STATE_NEED_RESCHEDULE,
};
I'm not sure if putting the flag here is a good idea. If you look at the history of struct Qdisc reshufflings, this part (cacheline) should be used for members which don't change too often. However, this new flag is going to be touched about as often and in similar places as q->seqlock or q->empty so that it should probably be in the last cacheline with them.
I am not sure about that too, but we can always adjust it's location if it provdes to be the case:)
[...]
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 44991ea..4953430 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) { struct pfifo_fast_priv *priv = qdisc_priv(qdisc); struct sk_buff *skb = NULL;
- bool need_retry = true; int band;
+retry: for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) { struct skb_array *q = band2list(priv, band);
@@ -652,6 +654,16 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) } if (likely(skb)) { qdisc_update_stats_at_dequeue(qdisc, skb);
- } else if (need_retry &&
test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE,
&qdisc->state)) {
/* do another enqueuing after clearing the flag to
* avoid calling __netif_schedule().
*/
smp_mb__after_atomic();
need_retry = false;
} else { WRITE_ONCE(qdisc->empty, true); }igoto retry;
Does the retry really provide significant improvement? IIUC it only helps if all of enqueue, first spin_trylock() failure and setting the __QDISC_STATE_NEED_RESCHEDULE flag happen between us finding the skb_array empty and checking the flag. If enqueue happens before we check the array (and flag is set before the check), the retry is useless. If the flag is set after we check it, we don't catch it (no matter if the enqueue happened before or after we found skb_array empty).
In odrder to avoid doing the "set_bit(__QDISC_STATE_MISSED, &qdisc->state)"
__QDISC_STATE_MISSED is in V4, so should be: set_bit(__QDISC_STATE_NEED_RESCHEDULE, &qdisc->state)
The above data race is not really the main concern here, the main concern is when to clear the __QDISC_STATE_NEED_RESCHEDULE in order to avoid the multi cpus doing the set_bit().
as much as possible, the __QDISC_STATE_NEED_RESCHEDULE need to be set as as much as possible, so only clear __QDISC_STATE_NEED_RESCHEDULE when the queue is empty. And it has about 5% performance improvement.
Michal
.
Linuxarm mailing list -- linuxarm@openeuler.org To unsubscribe send an email to linuxarm-leave@openeuler.org