[Linuxarm] Re: [PATCH net-next 7/7] {topost} net: hns3: use bounce buffer when rx page can not be reused

4 Jun 2021

...
-----Original Message-----
From: Salil Mehta
Sent: Thursday, June 3, 2021 11:47 AM
To: moyufeng <moyufeng@huawei.com>; tanhuazhong <tanhuazhong@huawei.com>;
shenjian (K) <shenjian15@huawei.com>; lipeng (Y) <lipeng321@huawei.com>;
Zhuangyuzeng (Yisen) <yisen.zhuang@huawei.com>; linyunsheng
<linyunsheng@huawei.com>; zhangjiaran <zhangjiaran@huawei.com>;
huangguangbin (A) <huangguangbin2@huawei.com>; chenhao (DY)
<chenhao288@hisilicon.com>; Linuxarm <linuxarm@huawei.com>;
linuxarm@openeuler.org
Cc: xuwei (O) <xuwei5@huawei.com>; Jonathan Cameron
<jonathan.cameron@huawei.com>; Song Bao Hua (Barry Song)
<song.bao.hua@hisilicon.com>
Subject: RE: [PATCH net-next 7/7] {topost} net: hns3: use bounce buffer when
rx page can not be reused
...
From: moyufeng
Sent: Wednesday, June 2, 2021 1:13 PM
To: tanhuazhong <tanhuazhong@huawei.com>; shenjian (K)
<shenjian15@huawei.com>;
lipeng (Y) <lipeng321@huawei.com>; Zhuangyuzeng (Yisen)
<yisen.zhuang@huawei.com>; linyunsheng <linyunsheng@huawei.com>;
zhangjiaran
<zhangjiaran@huawei.com>; huangguangbin (A) <huangguangbin2@huawei.com>;
chenhao (DY) <chenhao288@hisilicon.com>; moyufeng <moyufeng@huawei.com>;
Salil
Mehta <salil.mehta@huawei.com>
Subject: [PATCH net-next 7/7] {topost} net: hns3: use bounce buffer when rx
page
can not be reused
From: Yunsheng Lin <linyunsheng@huawei.com>
Currently rx page will be reused to receive future packet when
the stack releases the previous skb quickly. If the old page
can not be reused, a new page will be allocated and mapped,
which comsumes a lot of cpu when IOMMU is in the strict mode,
especially when the application and irq/NAPI happens to run on
the same cpu.
So allocate a new frag to memcpy the data to avoid the costly
IOMMU unmapping/mapping operation, and add "frag_alloc_err"
and "frag_alloc" stats in "ethtool -S ethX" cmd.
The throughput improves above 50% when running single thread of
iperf using TCP when IOMMU is in strict mode and iperf shares the
same cpu with irq/NAPI(rx_copybreak = 2048 and mtu = 1500).
Performance gains are quite good!
Few questions:
How we have ensured this will work efficiently in real world workloads
and that there are no repercussions?
Maybe we can test it by a nginx server. Usually it is hard
to setup environment from our side :-(

So maybe we can simulate some load noise by command like
$ stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 10s
to check if the patch can make constant performance
increase.
...
Also, is there any impact on end-to-end *latency* or *jitter* because
of this vis-a-vis without this approach?
Also, have you checked why MLX5 driver has removed this copybreak
concept for small packets but MLX4 did had this or why other latest
drivers don't have this?
Hope I have not missed this anywhere but what are the default values
for both {rx,tx}_copybreak?
Thanks
Thanks
Barry