[Linuxarm] Re: [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool

28 Jul 2021


      On Tue, Jul 27, 2021 at 12:54 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
...
On 2021/7/26 0:49, Alexander Duyck wrote:
...
On Sat, Jul 24, 2021 at 6:07 AM Yunsheng Lin <yunshenglin0825@gmail.com> wrote:
...
On Fri, Jul 23, 2021 at 09:08:00AM -0700, Alexander Duyck wrote:
...
On Fri, Jul 23, 2021 at 4:12 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
...
On 2021/7/22 23:18, Alexander Duyck wrote:
...
>>
<snip>
...
...
...
...
Rather than trying to reuse the devices page pool it might make more
sense to see if you couldn't have TCP just use some sort of circular
buffer of memory that is directly mapped for the device that it is
going to be transmitting to. Essentially what you would be doing is
creating a pre-mapped page and would need to communicate that the
memory is already mapped for the device you want to send it to so that
it could skip that step.
IIUC sk_page_frag_refill() is already doing a similar reusing as the
rx reusing implemented in most driver except for the not pre-mapping
part.
And it seems that even if we pre-map the page and communicate that the
memory is already mapped to the driver, it is likely that we will not
be able to reuse the page when the circular buffer is not big enough
or tx completion/tcp ack is not happening quickly enough, which might
means unmapping/deallocating old circular buffer and allocating/mapping
new circular buffer.
Using page pool we might be able to alleviate the above problem as it
does for rx?
I would say that instead of looking at going straight for the page
pool it might make more sense to look at seeing if we can coalesce the
DMA mapping of the pages first at the socket layer rather than trying
to introduce the overhead for the page pool. In the case of sockets we
already have the destructors that are called when the memory is freed,
so instead of making sockets use page pool it might make more sense to
extend the socket buffer allocation/freeing to incorporate bulk
mapping and unmapping of pages to optimize the socket Tx path in the
32K page case.
I was able to enable tx recycling prototyping based on page pool to
run some performance test, the performance improvement is about +20%
（30Gbit -> 38Gbit） for single thread iperf tcp flow when IOMMU is in
strict mode. And CPU usage descreases about 10% for four threads iperf
tcp flow for line speed of 100Gbit when IOMMU is in strict mode.
That isn't surprising given that for most devices the IOMMU will be
called per frag which can add a fair bit of overhead.
...
Looking at the prototyping code, I am agreed that it is a bit controversial
to use the page pool for tx as the page pool is assuming NAPI polling
protection for allocation side.
So I will take a deeper look about your suggestion above to see how to
implement it.
Also, I am assuming the "destructors" means tcp_wfree() for TCP, right?
It seems tcp_wfree() is mainly used to do memory accounting and free
"struct sock" if necessary.
Yes, that is what I was thinking. If we had some way to add something
like an argument or way to push the information about where the skbs
are being freed back to the socket the socket could then be looking at
pre-mapping the pages for the device if we assume a 1:1 mapping from
the socket to the device.
...
I am not so familiar with socket layer to understand how the "destructors"
will be helpful here, any detailed idea how to use "destructors" here?
The basic idea is the destructors are called when the skb is orphaned
or freed. So it might be a good spot to put in any logic to free pages
from your special pool. The only thing you would need to sort out is
making certain to bump reference counts appropriately if the skb is
cloned and the destructor is copied.