On Mon, Apr 19, 2021 at 08:21:38PM +0800, Yunsheng Lin wrote:
On 2021/4/19 10:04, Yunsheng Lin wrote:
On 2021/4/19 6:59, Michal Kubecek wrote:
I tried this patch o top of 5.12-rc7 with real devices. I used two machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs (2 8-core CPUs with HT disabled) and 16 Rx/Tx queues, receiver has 48 CPUs (2 12-core CPUs with HT enabled) and 48 Rx/Tx queues. With multiple TCP streams on a saturated ethernet, the CPU consumption grows quite a lot:
threads unpatched 5.12-rc7 5.12-rc7 + v3 1 25.6% 30.6% 8 73.1% 241.4% 128 132.2% 1012.0%
I do not really read the above number, but I understand that v3 has a cpu usage impact when it is patched to 5.12-rc7, so I do a test too on a arm64 system with a hns3 100G netdev, which is in node 0, and node 0 has 32 cpus.
root@(none)$ cat /sys/class/net/eth4/device/numa_node 0 root@(none)$ numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 128646 MB node 0 free: 127876 MB node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 1 size: 129019 MB node 1 free: 128796 MB node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 2 size: 129019 MB node 2 free: 128755 MB node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 node 3 size: 127989 MB node 3 free: 127772 MB node distances: node 0 1 2 3 0: 10 16 32 33 1: 16 10 25 32 2: 32 25 10 16 3: 33 32 16 10
and I use below cmd to only use 16 tx/rx queue with 72 tx queue depth in order to trigger the tx queue stopping handling: ethtool -L eth4 combined 16 ethtool -G eth4 tx 72
threads unpatched patched_v4 patched_v4+queue_stopped_opt 16 11% (si: 3.8%) 20% (si:13.5%) 11% (si:3.8%) 32 12% (si: 4.4%) 30% (si:22%) 13% (si:4.4%) 64 13% (si: 4.9%) 35% (si:26%) 13% (si:4.7%)
"11% (si: 3.8%)": 11% means the total cpu useage in node 0, and 3.8% means the softirq cpu useage . And thread number is as below iperf cmd: taskset -c 0-31 iperf -c 192.168.100.2 -t 1000000 -i 1 -P *thread*
The problem I see with this is that iperf's -P option only allows running the test in multiple connections but they do not actually run in multiple threads. Therefore this may not result in as much concurrency as it seems.
Also, 100Gb/s ethernet is not so easy to saturate, trying 10Gb/s or 1Gb/s might put more pressure on qdisc code concurrency.
Michal
It seems after applying the queue_stopped_opt patch, the cpu usage is closed to the unpatch one, at least with my testcase, maybe you can try your testcase with the queue_stopped_opt patch to see if it make any difference?
I will, I was not aware of v4 submission. I'll write a short note to it so that it does not accidentally get applied before we know for sure what the CPU usage impact is.
Michal