[Linuxarm] Re: [PATCH net v3] net: sched: fix packet stuck problem for lockless qdisc

19 Apr 2021

      On Mon, Apr 19, 2021 at 08:21:38PM +0800, Yunsheng Lin wrote:
...
On 2021/4/19 10:04, Yunsheng Lin wrote:
...
On 2021/4/19 6:59, Michal Kubecek wrote:
...
I tried this patch o top of 5.12-rc7 with real devices. I used two
machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs (2 8-core CPUs
with HT disabled) and 16 Rx/Tx queues, receiver has 48 CPUs (2 12-core
CPUs with HT enabled) and 48 Rx/Tx queues. With multiple TCP streams on
a saturated ethernet, the CPU consumption grows quite a lot:
threads     unpatched 5.12-rc7    5.12-rc7 + v3   
      1               25.6%               30.6%
      8               73.1%              241.4%
    128              132.2%             1012.0%
I do not really read the above number, but I understand that v3 has a cpu
usage impact when it is patched to 5.12-rc7, so I do a test too on a arm64
system with a hns3 100G netdev, which is in node 0, and node 0 has 32 cpus.
root@(none)$ cat /sys/class/net/eth4/device/numa_node
0
root@(none)$ numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 0 size: 128646 MB
node 0 free: 127876 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 1 size: 129019 MB
node 1 free: 128796 MB
node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 2 size: 129019 MB
node 2 free: 128755 MB
node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 3 size: 127989 MB
node 3 free: 127772 MB
node distances:
node   0   1   2   3
  0:  10  16  32  33
  1:  16  10  25  32
  2:  32  25  10  16
  3:  33  32  16  10
and I use below cmd to only use 16 tx/rx queue with 72 tx queue depth in
order to trigger the tx queue stopping handling:
ethtool -L eth4 combined 16
ethtool -G eth4 tx 72
threads       unpatched          patched_v4    patched_v4+queue_stopped_opt
   16      11% (si: 3.8%)     20% (si:13.5%)       11% (si:3.8%)
   32      12% (si: 4.4%)     30% (si:22%)         13% (si:4.4%)
   64      13% (si: 4.9%)     35% (si:26%)         13% (si:4.7%)
"11% (si: 3.8%)": 11% means the total cpu useage in node 0, and 3.8%
means the softirq cpu useage .
And thread number is as below iperf cmd:
taskset -c 0-31 iperf -c 192.168.100.2 -t 1000000 -i 1 -P *thread*
The problem I see with this is that iperf's -P option only allows
running the test in multiple connections but they do not actually run in
multiple threads. Therefore this may not result in as much concurrency
as it seems.

Also, 100Gb/s ethernet is not so easy to saturate, trying 10Gb/s or
1Gb/s might put more pressure on qdisc code concurrency.

Michal
...
It seems after applying the queue_stopped_opt patch, the cpu usage
is closed to the unpatch one, at least with my testcase, maybe you
can try your testcase with the queue_stopped_opt patch to see if
it make any difference?
I will, I was not aware of v4 submission. I'll write a short note to it
so that it does not accidentally get applied before we know for sure
what the CPU usage impact is.

Michal