
-----Original Message----- From: Salil Mehta Sent: Thursday, June 3, 2021 11:47 AM To: moyufeng <moyufeng@huawei.com>; tanhuazhong <tanhuazhong@huawei.com>; shenjian (K) <shenjian15@huawei.com>; lipeng (Y) <lipeng321@huawei.com>; Zhuangyuzeng (Yisen) <yisen.zhuang@huawei.com>; linyunsheng <linyunsheng@huawei.com>; zhangjiaran <zhangjiaran@huawei.com>; huangguangbin (A) <huangguangbin2@huawei.com>; chenhao (DY) <chenhao288@hisilicon.com>; Linuxarm <linuxarm@huawei.com>; linuxarm@openeuler.org Cc: xuwei (O) <xuwei5@huawei.com>; Jonathan Cameron <jonathan.cameron@huawei.com>; Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> Subject: RE: [PATCH net-next 7/7] {topost} net: hns3: use bounce buffer when rx page can not be reused
From: moyufeng Sent: Wednesday, June 2, 2021 1:13 PM To: tanhuazhong <tanhuazhong@huawei.com>; shenjian (K) <shenjian15@huawei.com>; lipeng (Y) <lipeng321@huawei.com>; Zhuangyuzeng (Yisen) <yisen.zhuang@huawei.com>; linyunsheng <linyunsheng@huawei.com>; zhangjiaran <zhangjiaran@huawei.com>; huangguangbin (A) <huangguangbin2@huawei.com>; chenhao (DY) <chenhao288@hisilicon.com>; moyufeng <moyufeng@huawei.com>; Salil Mehta <salil.mehta@huawei.com> Subject: [PATCH net-next 7/7] {topost} net: hns3: use bounce buffer when rx page can not be reused
From: Yunsheng Lin <linyunsheng@huawei.com>
Currently rx page will be reused to receive future packet when the stack releases the previous skb quickly. If the old page can not be reused, a new page will be allocated and mapped, which comsumes a lot of cpu when IOMMU is in the strict mode, especially when the application and irq/NAPI happens to run on the same cpu.
So allocate a new frag to memcpy the data to avoid the costly IOMMU unmapping/mapping operation, and add "frag_alloc_err" and "frag_alloc" stats in "ethtool -S ethX" cmd.
The throughput improves above 50% when running single thread of iperf using TCP when IOMMU is in strict mode and iperf shares the same cpu with irq/NAPI(rx_copybreak = 2048 and mtu = 1500).
Performance gains are quite good!
Few questions:
How we have ensured this will work efficiently in real world workloads and that there are no repercussions?
Maybe we can test it by a nginx server. Usually it is hard to setup environment from our side :-( So maybe we can simulate some load noise by command like $ stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 10s to check if the patch can make constant performance increase.
Also, is there any impact on end-to-end *latency* or *jitter* because of this vis-a-vis without this approach?
Also, have you checked why MLX5 driver has removed this copybreak concept for small packets but MLX4 did had this or why other latest drivers don't have this?
Hope I have not missed this anywhere but what are the default values for both {rx,tx}_copybreak?
Thanks
Thanks Barry