On 2021/7/28 2:38, Alexander Duyck wrote:
On Tue, Jul 27, 2021 at 12:54 AM Yunsheng Lin linyunsheng@huawei.com wrote:
On 2021/7/26 0:49, Alexander Duyck wrote:
On Sat, Jul 24, 2021 at 6:07 AM Yunsheng Lin yunshenglin0825@gmail.com wrote:
On Fri, Jul 23, 2021 at 09:08:00AM -0700, Alexander Duyck wrote:
On Fri, Jul 23, 2021 at 4:12 AM Yunsheng Lin linyunsheng@huawei.com wrote:
On 2021/7/22 23:18, Alexander Duyck wrote: >>>
<snip>
Rather than trying to reuse the devices page pool it might make more sense to see if you couldn't have TCP just use some sort of circular buffer of memory that is directly mapped for the device that it is going to be transmitting to. Essentially what you would be doing is creating a pre-mapped page and would need to communicate that the memory is already mapped for the device you want to send it to so that it could skip that step.
IIUC sk_page_frag_refill() is already doing a similar reusing as the rx reusing implemented in most driver except for the not pre-mapping part.
And it seems that even if we pre-map the page and communicate that the memory is already mapped to the driver, it is likely that we will not be able to reuse the page when the circular buffer is not big enough or tx completion/tcp ack is not happening quickly enough, which might means unmapping/deallocating old circular buffer and allocating/mapping new circular buffer.
Using page pool we might be able to alleviate the above problem as it does for rx?
I would say that instead of looking at going straight for the page pool it might make more sense to look at seeing if we can coalesce the DMA mapping of the pages first at the socket layer rather than trying to introduce the overhead for the page pool. In the case of sockets we already have the destructors that are called when the memory is freed, so instead of making sockets use page pool it might make more sense to extend the socket buffer allocation/freeing to incorporate bulk mapping and unmapping of pages to optimize the socket Tx path in the 32K page case.
I was able to enable tx recycling prototyping based on page pool to run some performance test, the performance improvement is about +20% (30Gbit -> 38Gbit) for single thread iperf tcp flow when IOMMU is in strict mode. And CPU usage descreases about 10% for four threads iperf tcp flow for line speed of 100Gbit when IOMMU is in strict mode.
That isn't surprising given that for most devices the IOMMU will be called per frag which can add a fair bit of overhead.
Looking at the prototyping code, I am agreed that it is a bit controversial to use the page pool for tx as the page pool is assuming NAPI polling protection for allocation side.
So I will take a deeper look about your suggestion above to see how to implement it.
Also, I am assuming the "destructors" means tcp_wfree() for TCP, right? It seems tcp_wfree() is mainly used to do memory accounting and free "struct sock" if necessary.
Yes, that is what I was thinking. If we had some way to add something like an argument or way to push the information about where the skbs are being freed back to the socket the socket could then be looking at pre-mapping the pages for the device if we assume a 1:1 mapping from the socket to the device.
I am not so familiar with socket layer to understand how the "destructors" will be helpful here, any detailed idea how to use "destructors" here?
The basic idea is the destructors are called when the skb is orphaned or freed. So it might be a good spot to put in any logic to free pages from your special pool. The only thing you would need to sort out is making certain to bump reference counts appropriately if the skb is cloned and the destructor is copied.
It seems the destructor is not copied when a skb is cloned, see: https://elixir.bootlin.com/linux/latest/source/net/core/skbuff.c#L1050
For IPv4 TCP, tcp_write_xmit() calls __tcp_transmit_skb() to send the new cloned skb using ip_queue_xmit(), and the original skb is kept in sk->tcp_rtx_queue to wait for the ack packet. The destructor is assigned to the new cloned skb in __tcp_transmit_skb(), and when destructor is called in the tx completion process, it seems the frag page might be still used by the retransmiting process, which means it is better not to unmap or recycle that frag page in skb->destructor?
Also I tried to implement a frag pool to replace the page pool, but it seems the feature needed for frag pool is already implemented by the page pool, so implementing a new frag pool does not make sense.
For Rx, we have 1 : 1 relation between struct napi_struct instance and struct page_pool instance, It seems we have the below options if the recycling pool makes sense for Tx too: 1. 1 : 1 relation between struct net_device instance and struct page_pool instance. 2. 1 : 1 relation between struct napi_struct instance and struct page_pool instance. 3. 1 : 1 relation between struct sock instance and struct page_pool instance.
For me, the option 2 seems to make more sense to me if we can reuse the same page pool for both Tx and Rx.
As the where or when to "bump reference counts appropriately", it seems the __skb_frag_ref() might be a good spot to increment frag count if the frag is copied, and the bit 0 for "struct page" ptr in frag->bv_page is reversed as the lower 12 bits for dma address, so we can use bit 0 in frag->bv_page to indicate the corresponding page is from a page pool or not? If the page is from a page pool, then __skb_frag_ref() can do a atomic_inc(pp_frag_count) instead of get_page(). This also might means that skb->pp_recycle is only used to indicate if the head data page is from page pool or not when the bit 0 of frag->bv_page can be used to indicate if the corresponding frag page is from page pool or not.
And doing atomic_inc(pp_frag_count) in __skb_frag_ref() seems to match the semantics of "recycle after all users of page is done with the page", at least for most users in the netstack?
.