From: moyufeng Sent: Wednesday, June 2, 2021 1:13 PM To: tanhuazhong tanhuazhong@huawei.com; shenjian (K) shenjian15@huawei.com; lipeng (Y) lipeng321@huawei.com; Zhuangyuzeng (Yisen) yisen.zhuang@huawei.com; linyunsheng linyunsheng@huawei.com; zhangjiaran zhangjiaran@huawei.com; huangguangbin (A) huangguangbin2@huawei.com; chenhao (DY) chenhao288@hisilicon.com; moyufeng moyufeng@huawei.com; Salil Mehta salil.mehta@huawei.com Subject: [PATCH net-next 7/7] {topost} net: hns3: use bounce buffer when rx page can not be reused
From: Yunsheng Lin linyunsheng@huawei.com
Currently rx page will be reused to receive future packet when the stack releases the previous skb quickly. If the old page can not be reused, a new page will be allocated and mapped, which comsumes a lot of cpu when IOMMU is in the strict mode, especially when the application and irq/NAPI happens to run on the same cpu.
So allocate a new frag to memcpy the data to avoid the costly IOMMU unmapping/mapping operation, and add "frag_alloc_err" and "frag_alloc" stats in "ethtool -S ethX" cmd.
The throughput improves above 50% when running single thread of iperf using TCP when IOMMU is in strict mode and iperf shares the same cpu with irq/NAPI(rx_copybreak = 2048 and mtu = 1500).
Performance gains are quite good!
Few questions:
How we have ensured this will work efficiently in real world workloads and that there are no repercussions?
Also, is there any impact on end-to-end *latency* or *jitter* because of this vis-a-vis without this approach?
Also, have you checked why MLX5 driver has removed this copybreak concept for small packets but MLX4 did had this or why other latest drivers don't have this?
Hope I have not missed this anywhere but what are the default values for both {rx,tx}_copybreak?
Thanks
-----Original Message----- From: Salil Mehta Sent: Thursday, June 3, 2021 11:47 AM To: moyufeng moyufeng@huawei.com; tanhuazhong tanhuazhong@huawei.com; shenjian (K) shenjian15@huawei.com; lipeng (Y) lipeng321@huawei.com; Zhuangyuzeng (Yisen) yisen.zhuang@huawei.com; linyunsheng linyunsheng@huawei.com; zhangjiaran zhangjiaran@huawei.com; huangguangbin (A) huangguangbin2@huawei.com; chenhao (DY) chenhao288@hisilicon.com; Linuxarm linuxarm@huawei.com; linuxarm@openeuler.org Cc: xuwei (O) xuwei5@huawei.com; Jonathan Cameron jonathan.cameron@huawei.com; Song Bao Hua (Barry Song) song.bao.hua@hisilicon.com Subject: RE: [PATCH net-next 7/7] {topost} net: hns3: use bounce buffer when rx page can not be reused
From: moyufeng Sent: Wednesday, June 2, 2021 1:13 PM To: tanhuazhong tanhuazhong@huawei.com; shenjian (K)
lipeng (Y) lipeng321@huawei.com; Zhuangyuzeng (Yisen) yisen.zhuang@huawei.com; linyunsheng linyunsheng@huawei.com;
zhangjiaran
zhangjiaran@huawei.com; huangguangbin (A) huangguangbin2@huawei.com; chenhao (DY) chenhao288@hisilicon.com; moyufeng moyufeng@huawei.com;
Salil
Mehta salil.mehta@huawei.com Subject: [PATCH net-next 7/7] {topost} net: hns3: use bounce buffer when rx
page
can not be reused
From: Yunsheng Lin linyunsheng@huawei.com
Currently rx page will be reused to receive future packet when the stack releases the previous skb quickly. If the old page can not be reused, a new page will be allocated and mapped, which comsumes a lot of cpu when IOMMU is in the strict mode, especially when the application and irq/NAPI happens to run on the same cpu.
So allocate a new frag to memcpy the data to avoid the costly IOMMU unmapping/mapping operation, and add "frag_alloc_err" and "frag_alloc" stats in "ethtool -S ethX" cmd.
The throughput improves above 50% when running single thread of iperf using TCP when IOMMU is in strict mode and iperf shares the same cpu with irq/NAPI(rx_copybreak = 2048 and mtu = 1500).
Performance gains are quite good!
Few questions:
How we have ensured this will work efficiently in real world workloads and that there are no repercussions?
Maybe we can test it by a nginx server. Usually it is hard to setup environment from our side :-(
So maybe we can simulate some load noise by command like $ stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 10s to check if the patch can make constant performance increase.
Also, is there any impact on end-to-end *latency* or *jitter* because of this vis-a-vis without this approach?
Also, have you checked why MLX5 driver has removed this copybreak concept for small packets but MLX4 did had this or why other latest drivers don't have this?
Hope I have not missed this anywhere but what are the default values for both {rx,tx}_copybreak?
Thanks
Thanks Barry
On 2021/6/3 7:47, Salil Mehta wrote:
From: moyufeng Sent: Wednesday, June 2, 2021 1:13 PM To: tanhuazhong tanhuazhong@huawei.com; shenjian (K) shenjian15@huawei.com; lipeng (Y) lipeng321@huawei.com; Zhuangyuzeng (Yisen) yisen.zhuang@huawei.com; linyunsheng linyunsheng@huawei.com; zhangjiaran zhangjiaran@huawei.com; huangguangbin (A) huangguangbin2@huawei.com; chenhao (DY) chenhao288@hisilicon.com; moyufeng moyufeng@huawei.com; Salil Mehta salil.mehta@huawei.com Subject: [PATCH net-next 7/7] {topost} net: hns3: use bounce buffer when rx page can not be reused
From: Yunsheng Lin linyunsheng@huawei.com
Currently rx page will be reused to receive future packet when the stack releases the previous skb quickly. If the old page can not be reused, a new page will be allocated and mapped, which comsumes a lot of cpu when IOMMU is in the strict mode, especially when the application and irq/NAPI happens to run on the same cpu.
So allocate a new frag to memcpy the data to avoid the costly IOMMU unmapping/mapping operation, and add "frag_alloc_err" and "frag_alloc" stats in "ethtool -S ethX" cmd.
The throughput improves above 50% when running single thread of iperf using TCP when IOMMU is in strict mode and iperf shares the same cpu with irq/NAPI(rx_copybreak = 2048 and mtu = 1500).
Performance gains are quite good!
Few questions:
How we have ensured this will work efficiently in real world workloads and that there are no repercussions?
Also, is there any impact on end-to-end *latency* or *jitter* because of this vis-a-vis without this approach?
Sorry for replying late. This optimization is based on specific scenarios. In these scenarios, the performance is improved and this change has passed the iteration test, which is for quality assurance.
Also, have you checked why MLX5 driver has removed this copybreak concept for small packets but MLX4 did had this or why other latest drivers don't have this?
Sorry, I'm not sure why the mlx5 doesn't have this either :(
Hope I have not missed this anywhere but what are the default values for both {rx,tx}_copybreak?
The default values for both {rx,tx}_copybreak are 0. We can change it by "ethtool --set-tunable devname {rx,tx}-copybreak".
Thanks .