This patchset adds frag page support in page pool and enable skb's page frag recycling based on page pool in hns3 drvier.
V2: 1. resend based on the latest net-next.
V1: 1. avoid atomic_long_read() in case of freeing or draining page frag, and drop RFC tag.
RFC v6: 1. Disable frag page support in system 32-bit arch and 64-bit DMA.
RFC v5: 1. Rename dma_addr[0] to pp_frag_count and adjust codes according to the rename.
RFC v4: 1. Use the dma_addr[1] to store bias. 2. Default to a pagecnt_bias of PAGE_SIZE - 1. 3. other minor comment suggested by Alexander.
RFC v3: 1. Implement the semantic of "page recycling only wait for the page pool user instead of all user of a page" 2. Support the frag allocation of different sizes 3. Merge patch 4 & 5 to one patch as it does not make sense to use page_pool_dev_alloc_pages() API directly with elevated refcnt. 4. other minor comment suggested by Alexander.
RFC v2: 1. Split patch 1 to more reviewable one. 2. Repurpose the lower 12 bits of the dma address to store the pagecnt_bias as suggested by Alexander. 3. support recycling to pool->alloc for elevated refcnt case too.
Yunsheng Lin (4): page_pool: keep pp info as long as page pool owns the page page_pool: add interface to manipulate frag count in page pool page_pool: add frag page recycling support in page pool net: hns3: support skb's frag page recycling based on page pool
drivers/net/ethernet/hisilicon/Kconfig | 1 + drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 79 +++++++++++++++-- drivers/net/ethernet/hisilicon/hns3/hns3_enet.h | 3 + drivers/net/ethernet/marvell/mvneta.c | 6 +- drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 2 +- drivers/net/ethernet/ti/cpsw.c | 2 +- drivers/net/ethernet/ti/cpsw_new.c | 2 +- include/linux/mm_types.h | 18 ++-- include/linux/skbuff.h | 4 +- include/net/page_pool.h | 68 +++++++++++--- net/core/page_pool.c | 112 +++++++++++++++++++++++- 11 files changed, 258 insertions(+), 39 deletions(-)
Currently, page->pp is cleared and set everytime the page is recycled, which is unnecessary.
So only set the page->pp when the page is added to the page pool and only clear it when the page is released from the page pool.
This is also a preparation to support allocating frag page in page pool.
Reviewed-by: Ilias Apalodimas ilias.apalodimas@linaro.org Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- drivers/net/ethernet/marvell/mvneta.c | 6 +----- drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 2 +- drivers/net/ethernet/ti/cpsw.c | 2 +- drivers/net/ethernet/ti/cpsw_new.c | 2 +- include/linux/skbuff.h | 4 +--- include/net/page_pool.h | 7 ------- net/core/page_pool.c | 21 +++++++++++++++++---- 7 files changed, 22 insertions(+), 22 deletions(-)
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index ff8db31..5d1007e 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -2327,7 +2327,7 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct page_pool *pool, if (!skb) return ERR_PTR(-ENOMEM);
- skb_mark_for_recycle(skb, virt_to_page(xdp->data), pool); + skb_mark_for_recycle(skb);
skb_reserve(skb, xdp->data - xdp->data_hard_start); skb_put(skb, xdp->data_end - xdp->data); @@ -2339,10 +2339,6 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct page_pool *pool, skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, skb_frag_page(frag), skb_frag_off(frag), skb_frag_size(frag), PAGE_SIZE); - /* We don't need to reset pp_recycle here. It's already set, so - * just mark fragments for recycling. - */ - page_pool_store_mem_info(skb_frag_page(frag), pool); }
return skb; diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index 99bd8b8..744f58f 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -3995,7 +3995,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi, }
if (pp) - skb_mark_for_recycle(skb, page, pp); + skb_mark_for_recycle(skb); else dma_unmap_single_attrs(dev->dev.parent, dma_addr, bm_pool->buf_size, DMA_FROM_DEVICE, diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c index abf9a2a..c451eaa 100644 --- a/drivers/net/ethernet/ti/cpsw.c +++ b/drivers/net/ethernet/ti/cpsw.c @@ -431,7 +431,7 @@ static void cpsw_rx_handler(void *token, int len, int status) skb->protocol = eth_type_trans(skb, ndev);
/* mark skb for recycling */ - skb_mark_for_recycle(skb, page, pool); + skb_mark_for_recycle(skb); netif_receive_skb(skb);
ndev->stats.rx_bytes += len; diff --git a/drivers/net/ethernet/ti/cpsw_new.c b/drivers/net/ethernet/ti/cpsw_new.c index ae16722..d197623 100644 --- a/drivers/net/ethernet/ti/cpsw_new.c +++ b/drivers/net/ethernet/ti/cpsw_new.c @@ -375,7 +375,7 @@ static void cpsw_rx_handler(void *token, int len, int status) skb->protocol = eth_type_trans(skb, ndev);
/* mark skb for recycling */ - skb_mark_for_recycle(skb, page, pool); + skb_mark_for_recycle(skb); netif_receive_skb(skb);
ndev->stats.rx_bytes += len; diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 783cc23..6bdb0db 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4712,11 +4712,9 @@ static inline u64 skb_get_kcov_handle(struct sk_buff *skb) }
#ifdef CONFIG_PAGE_POOL -static inline void skb_mark_for_recycle(struct sk_buff *skb, struct page *page, - struct page_pool *pp) +static inline void skb_mark_for_recycle(struct sk_buff *skb) { skb->pp_recycle = 1; - page_pool_store_mem_info(page, pp); } #endif
diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 3dd62dd..8d7744d 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -253,11 +253,4 @@ static inline void page_pool_ring_unlock(struct page_pool *pool) spin_unlock_bh(&pool->ring.producer_lock); }
-/* Store mem_info on struct page and use it while recycling skb frags */ -static inline -void page_pool_store_mem_info(struct page *page, struct page_pool *pp) -{ - page->pp = pp; -} - #endif /* _NET_PAGE_POOL_H */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 5e4eb45..78838c6 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -206,6 +206,19 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page) return true; }
+static void page_pool_set_pp_info(struct page_pool *pool, + struct page *page) +{ + page->pp = pool; + page->pp_magic |= PP_SIGNATURE; +} + +static void page_pool_clear_pp_info(struct page *page) +{ + page->pp_magic = 0; + page->pp = NULL; +} + static struct page *__page_pool_alloc_page_order(struct page_pool *pool, gfp_t gfp) { @@ -222,7 +235,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool, return NULL; }
- page->pp_magic |= PP_SIGNATURE; + page_pool_set_pp_info(pool, page);
/* Track how many pages are held 'in-flight' */ pool->pages_state_hold_cnt++; @@ -266,7 +279,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, put_page(page); continue; } - page->pp_magic |= PP_SIGNATURE; + + page_pool_set_pp_info(pool, page); pool->alloc.cache[pool->alloc.count++] = page; /* Track how many pages are held 'in-flight' */ pool->pages_state_hold_cnt++; @@ -345,7 +359,7 @@ void page_pool_release_page(struct page_pool *pool, struct page *page) DMA_ATTR_SKIP_CPU_SYNC); page_pool_set_dma_addr(page, 0); skip_dma_unmap: - page->pp_magic = 0; + page_pool_clear_pp_info(page);
/* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. @@ -644,7 +658,6 @@ bool page_pool_return_skb_page(struct page *page) * The page will be returned to the pool here regardless of the * 'flipped' fragment being in use or not. */ - page->pp = NULL; page_pool_put_full_page(pp, page, false);
return true;
For 32 bit systems with 64 bit dma, dma_addr[1] is used to store the upper 32 bit dma addr, those system should be rare those days.
For normal system, the dma_addr[1] in 'struct page' is not used, so we can reuse dma_addr[1] for storing frag count, which means how many frags this page might be splited to.
In order to simplify the page frag support in the page pool, the PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro is added to indicate the 32 bit systems with 64 bit dma, and the page frag support in page pool is disabled for such system.
The newly added page_pool_set_frag_count() is called to reserve the maximum frag count before any page frag is passed to the user. The page_pool_atomic_sub_frag_count_return() is called when user is done with the page frag.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- include/linux/mm_types.h | 18 +++++++++++++----- include/net/page_pool.h | 46 +++++++++++++++++++++++++++++++++++++++------- net/core/page_pool.c | 4 ++++ 3 files changed, 56 insertions(+), 12 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 52bbd2b..7f8ee09 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -103,11 +103,19 @@ struct page { unsigned long pp_magic; struct page_pool *pp; unsigned long _pp_mapping_pad; - /** - * @dma_addr: might require a 64-bit value on - * 32-bit architectures. - */ - unsigned long dma_addr[2]; + unsigned long dma_addr; + union { + /** + * dma_addr_upper: might require a 64-bit + * value on 32-bit architectures. + */ + unsigned long dma_addr_upper; + /** + * For frag page support, not supported in + * 32-bit architectures with 64-bit DMA. + */ + atomic_long_t pp_frag_count; + }; }; struct { /* slab, slob and slub */ union { diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 8d7744d..42e6997 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -45,7 +45,10 @@ * Please note DMA-sync-for-CPU is still * device driver responsibility */ -#define PP_FLAG_ALL (PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV) +#define PP_FLAG_PAGE_FRAG BIT(2) /* for page frag feature */ +#define PP_FLAG_ALL (PP_FLAG_DMA_MAP |\ + PP_FLAG_DMA_SYNC_DEV |\ + PP_FLAG_PAGE_FRAG)
/* * Fast allocation side cache array/stack @@ -198,19 +201,48 @@ static inline void page_pool_recycle_direct(struct page_pool *pool, page_pool_put_full_page(pool, page, true); }
+#define PAGE_POOL_DMA_USE_PP_FRAG_COUNT \ + (sizeof(dma_addr_t) > sizeof(unsigned long)) + static inline dma_addr_t page_pool_get_dma_addr(struct page *page) { - dma_addr_t ret = page->dma_addr[0]; - if (sizeof(dma_addr_t) > sizeof(unsigned long)) - ret |= (dma_addr_t)page->dma_addr[1] << 16 << 16; + dma_addr_t ret = page->dma_addr; + + if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT) + ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16; + return ret; }
static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr) { - page->dma_addr[0] = addr; - if (sizeof(dma_addr_t) > sizeof(unsigned long)) - page->dma_addr[1] = upper_32_bits(addr); + page->dma_addr = addr; + if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT) + page->dma_addr_upper = upper_32_bits(addr); +} + +static inline void page_pool_set_frag_count(struct page *page, long nr) +{ + atomic_long_set(&page->pp_frag_count, nr); +} + +static inline long page_pool_atomic_sub_frag_count_return(struct page *page, + long nr) +{ + long ret; + + /* As suggested by Alexander, atomic_long_read() may cover up the + * reference count errors, so avoid calling atomic_long_read() in + * the cases of freeing or draining the page_frags, where we would + * not expect it to match or that are slowpath anyway. + */ + if (__builtin_constant_p(nr) && + atomic_long_read(&page->pp_frag_count) == nr) + return 0; + + ret = atomic_long_sub_return(nr, &page->pp_frag_count); + WARN_ON(ret < 0); + return ret; }
static inline bool is_page_pool_compiled_in(void) diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 78838c6..68fab94 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -67,6 +67,10 @@ static int page_pool_init(struct page_pool *pool, */ }
+ if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT && + pool->p.flags & PP_FLAG_PAGE_FRAG) + return -EINVAL; + if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0) return -ENOMEM;
On 06/08/2021 04.46, Yunsheng Lin wrote:
I find the macro name confusing.
I think it would be easier to read the code, if it was called: PAGE_POOL_DMA_CANNOT_USE_PP_FRAG_COUNT
I read this as: if the page_pool use pp_frag_count and have flag set, then it is invalid/no-allowed, which seems wrong.
I find this code more intuitive to read:
+ if (PAGE_POOL_DMA_CANNOT_USE_PP_FRAG_COUNT && + pool->p.flags & PP_FLAG_PAGE_FRAG) + return -EINVAL;
--Jesper
On 2021/8/10 22:58, Jesper Dangaard Brouer wrote:
Actually, there is a *DMA* in tha above macro, which means DMA addr uses the PP_FRAG_COUNT field. Perhaps PAGE_POOL_DMA_ADDR_UPPER_USE_PP_FRAG_COUNT is more obvious here?
On 06/08/2021 04.46, Yunsheng Lin wrote:
I worried about this WARN_ON() as it generates an 'ud2' instruction which influence I-cache fetching. But I have disassembled (objdump) the page_pool.o binary and the ud2 gets placed last in the main function page_pool_put_page() that use this inlined function. Thus, I assume this is not a problem :-)
Currently page pool only support page recycling when there is only one user of the page, and the split page reusing implemented in the most driver can not use the page pool as bing-pong way of reusing requires the multi user support in page pool.
Those reusing or recycling has below limitations: 1. page from page pool can only be used be one user in order for the page recycling to happen. 2. Bing-pong way of reusing in most driver does not support multi desc using different part of the same page in order to save memory.
So add multi-users support and frag page recycling in page pool to overcome the above limitation.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- include/net/page_pool.h | 15 +++++++++ net/core/page_pool.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 102 insertions(+)
diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 42e6997..a408240 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -91,6 +91,9 @@ struct page_pool { unsigned long defer_warn;
u32 pages_state_hold_cnt; + unsigned int frag_offset; + struct page *frag_page; + long frag_users;
/* * Data structure for allocation side @@ -140,6 +143,18 @@ static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool) return page_pool_alloc_pages(pool, gfp); }
+struct page *page_pool_alloc_frag(struct page_pool *pool, unsigned int *offset, + unsigned int size, gfp_t gfp); + +static inline struct page *page_pool_dev_alloc_frag(struct page_pool *pool, + unsigned int *offset, + unsigned int size) +{ + gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN); + + return page_pool_alloc_frag(pool, offset, size, gfp); +} + /* get the stored dma direction. A driver might decide to treat this locally and * avoid the extra cache line from page_pool to determine the direction */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 68fab94..ac11604 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -24,6 +24,8 @@ #define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ)
+#define BIAS_MAX LONG_MAX + static int page_pool_init(struct page_pool *pool, const struct page_pool_params *params) { @@ -423,6 +425,11 @@ static __always_inline struct page * __page_pool_put_page(struct page_pool *pool, struct page *page, unsigned int dma_sync_size, bool allow_direct) { + /* It is not the last user for the page frag case */ + if (pool->p.flags & PP_FLAG_PAGE_FRAG && + page_pool_atomic_sub_frag_count_return(page, 1)) + return NULL; + /* This allocator is optimized for the XDP mode that uses * one-frame-per-page, but have fallbacks that act like the * regular page allocator APIs. @@ -515,6 +522,84 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data, } EXPORT_SYMBOL(page_pool_put_page_bulk);
+static struct page *page_pool_drain_frag(struct page_pool *pool, + struct page *page) +{ + long drain_count = BIAS_MAX - pool->frag_users; + + /* Some user is still using the page frag */ + if (likely(page_pool_atomic_sub_frag_count_return(page, + drain_count))) + return NULL; + + if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) { + if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) + page_pool_dma_sync_for_device(pool, page, -1); + + return page; + } + + page_pool_return_page(pool, page); + return NULL; +} + +static void page_pool_free_frag(struct page_pool *pool) +{ + long drain_count = BIAS_MAX - pool->frag_users; + struct page *page = pool->frag_page; + + pool->frag_page = NULL; + + if (!page || + page_pool_atomic_sub_frag_count_return(page, drain_count)) + return; + + page_pool_return_page(pool, page); +} + +struct page *page_pool_alloc_frag(struct page_pool *pool, + unsigned int *offset, + unsigned int size, gfp_t gfp) +{ + unsigned int max_size = PAGE_SIZE << pool->p.order; + struct page *page = pool->frag_page; + + if (WARN_ON(!(pool->p.flags & PP_FLAG_PAGE_FRAG) || + size > max_size)) + return NULL; + + size = ALIGN(size, dma_get_cache_alignment()); + *offset = pool->frag_offset; + + if (page && *offset + size > max_size) { + page = page_pool_drain_frag(pool, page); + if (page) + goto frag_reset; + } + + if (!page) { + page = page_pool_alloc_pages(pool, gfp); + if (unlikely(!page)) { + pool->frag_page = NULL; + return NULL; + } + + pool->frag_page = page; + +frag_reset: + pool->frag_users = 1; + *offset = 0; + pool->frag_offset = size; + page_pool_set_frag_count(page, BIAS_MAX); + return page; + } + + pool->frag_users++; + pool->frag_offset = *offset + size; + return page; +} +EXPORT_SYMBOL(page_pool_alloc_frag); + static void page_pool_empty_ring(struct page_pool *pool) { struct page *page; @@ -620,6 +705,8 @@ void page_pool_destroy(struct page_pool *pool) if (!page_pool_put(pool)) return;
+ page_pool_free_frag(pool); + if (!page_pool_release(pool)) return;
This patch adds skb's frag page recycling support based on the frag page support in page pool.
The performance improves above 10~20% for single thread iperf TCP flow with IOMMU disabled when iperf server and irq/NAPI have a different CPU.
The performance improves about 135%(14Gbit to 33Gbit) for single thread iperf TCP flow when IOMMU is in strict mode and iperf server shares the same cpu with irq/NAPI.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- drivers/net/ethernet/hisilicon/Kconfig | 1 + drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 79 +++++++++++++++++++++++-- drivers/net/ethernet/hisilicon/hns3/hns3_enet.h | 3 + 3 files changed, 78 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/Kconfig b/drivers/net/ethernet/hisilicon/Kconfig index 094e4a3..2ba0e7b 100644 --- a/drivers/net/ethernet/hisilicon/Kconfig +++ b/drivers/net/ethernet/hisilicon/Kconfig @@ -91,6 +91,7 @@ config HNS3 tristate "Hisilicon Network Subsystem Support HNS3 (Framework)" depends on PCI select NET_DEVLINK + select PAGE_POOL help This selects the framework support for Hisilicon Network Subsystem 3. This layer facilitates clients like ENET, RoCE and user-space ethernet diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index cb8d5da..fcbeb1f 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -3205,6 +3205,21 @@ static int hns3_alloc_buffer(struct hns3_enet_ring *ring, unsigned int order = hns3_page_order(ring); struct page *p;
+ if (ring->page_pool) { + p = page_pool_dev_alloc_frag(ring->page_pool, + &cb->page_offset, + hns3_buf_size(ring)); + if (unlikely(!p)) + return -ENOMEM; + + cb->priv = p; + cb->buf = page_address(p); + cb->dma = page_pool_get_dma_addr(p); + cb->type = DESC_TYPE_PP_FRAG; + cb->reuse_flag = 0; + return 0; + } + p = dev_alloc_pages(order); if (!p) return -ENOMEM; @@ -3227,8 +3242,13 @@ static void hns3_free_buffer(struct hns3_enet_ring *ring, if (cb->type & (DESC_TYPE_SKB | DESC_TYPE_BOUNCE_HEAD | DESC_TYPE_BOUNCE_ALL | DESC_TYPE_SGL_SKB)) napi_consume_skb(cb->priv, budget); - else if (!HNAE3_IS_TX_RING(ring) && cb->pagecnt_bias) - __page_frag_cache_drain(cb->priv, cb->pagecnt_bias); + else if (!HNAE3_IS_TX_RING(ring)) { + if (cb->type & DESC_TYPE_PAGE && cb->pagecnt_bias) + __page_frag_cache_drain(cb->priv, cb->pagecnt_bias); + else if (cb->type & DESC_TYPE_PP_FRAG) + page_pool_put_full_page(ring->page_pool, cb->priv, + false); + } memset(cb, 0, sizeof(*cb)); }
@@ -3315,7 +3335,7 @@ static int hns3_alloc_and_map_buffer(struct hns3_enet_ring *ring, int ret;
ret = hns3_alloc_buffer(ring, cb); - if (ret) + if (ret || ring->page_pool) goto out;
ret = hns3_map_buffer(ring, cb); @@ -3337,7 +3357,8 @@ static int hns3_alloc_and_attach_buffer(struct hns3_enet_ring *ring, int i) if (ret) return ret;
- ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma); + ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma + + ring->desc_cb[i].page_offset);
return 0; } @@ -3367,7 +3388,8 @@ static void hns3_replace_buffer(struct hns3_enet_ring *ring, int i, { hns3_unmap_buffer(ring, &ring->desc_cb[i]); ring->desc_cb[i] = *res_cb; - ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma); + ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma + + ring->desc_cb[i].page_offset); ring->desc[i].rx.bd_base_info = 0; }
@@ -3539,6 +3561,12 @@ static void hns3_nic_reuse_page(struct sk_buff *skb, int i, u32 frag_size = size - pull_len; bool reused;
+ if (ring->page_pool) { + skb_add_rx_frag(skb, i, desc_cb->priv, frag_offset, + frag_size, truesize); + return; + } + /* Avoid re-using remote or pfmem page */ if (unlikely(!dev_page_is_reusable(desc_cb->priv))) goto out; @@ -3856,6 +3884,9 @@ static int hns3_alloc_skb(struct hns3_enet_ring *ring, unsigned int length, /* We can reuse buffer as-is, just make sure it is reusable */ if (dev_page_is_reusable(desc_cb->priv)) desc_cb->reuse_flag = 1; + else if (desc_cb->type & DESC_TYPE_PP_FRAG) + page_pool_put_full_page(ring->page_pool, desc_cb->priv, + false); else /* This page cannot be reused so discard it */ __page_frag_cache_drain(desc_cb->priv, desc_cb->pagecnt_bias); @@ -3863,6 +3894,10 @@ static int hns3_alloc_skb(struct hns3_enet_ring *ring, unsigned int length, hns3_rx_ring_move_fw(ring); return 0; } + + if (ring->page_pool) + skb_mark_for_recycle(skb); + u64_stats_update_begin(&ring->syncp); ring->stats.seg_pkt_cnt++; u64_stats_update_end(&ring->syncp); @@ -3901,6 +3936,10 @@ static int hns3_add_frag(struct hns3_enet_ring *ring) "alloc rx fraglist skb fail\n"); return -ENXIO; } + + if (ring->page_pool) + skb_mark_for_recycle(new_skb); + ring->frag_num = 0;
if (ring->tail_skb) { @@ -4705,6 +4744,29 @@ static void hns3_put_ring_config(struct hns3_nic_priv *priv) priv->ring = NULL; }
+static void hns3_alloc_page_pool(struct hns3_enet_ring *ring) +{ + struct page_pool_params pp_params = { + .flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG | + PP_FLAG_DMA_SYNC_DEV, + .order = hns3_page_order(ring), + .pool_size = ring->desc_num * hns3_buf_size(ring) / + (PAGE_SIZE << hns3_page_order(ring)), + .nid = dev_to_node(ring_to_dev(ring)), + .dev = ring_to_dev(ring), + .dma_dir = DMA_FROM_DEVICE, + .offset = 0, + .max_len = PAGE_SIZE << hns3_page_order(ring), + }; + + ring->page_pool = page_pool_create(&pp_params); + if (IS_ERR(ring->page_pool)) { + dev_warn(ring_to_dev(ring), "page pool creation failed: %ld\n", + PTR_ERR(ring->page_pool)); + ring->page_pool = NULL; + } +} + static int hns3_alloc_ring_memory(struct hns3_enet_ring *ring) { int ret; @@ -4724,6 +4786,8 @@ static int hns3_alloc_ring_memory(struct hns3_enet_ring *ring) goto out_with_desc_cb;
if (!HNAE3_IS_TX_RING(ring)) { + hns3_alloc_page_pool(ring); + ret = hns3_alloc_ring_buffers(ring); if (ret) goto out_with_desc; @@ -4764,6 +4828,11 @@ void hns3_fini_ring(struct hns3_enet_ring *ring) devm_kfree(ring_to_dev(ring), tx_spare); ring->tx_spare = NULL; } + + if (!HNAE3_IS_TX_RING(ring) && ring->page_pool) { + page_pool_destroy(ring->page_pool); + ring->page_pool = NULL; + } }
static int hns3_buf_size2type(u32 buf_size) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h index 15af3d9..27809d6 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h @@ -6,6 +6,7 @@
#include <linux/dim.h> #include <linux/if_vlan.h> +#include <net/page_pool.h>
#include "hnae3.h"
@@ -307,6 +308,7 @@ enum hns3_desc_type { DESC_TYPE_BOUNCE_ALL = 1 << 3, DESC_TYPE_BOUNCE_HEAD = 1 << 4, DESC_TYPE_SGL_SKB = 1 << 5, + DESC_TYPE_PP_FRAG = 1 << 6, };
struct hns3_desc_cb { @@ -451,6 +453,7 @@ struct hns3_enet_ring { struct hnae3_queue *tqp; int queue_index; struct device *dev; /* will be used for DMA mapping of descriptors */ + struct page_pool *page_pool;
/* statistic */ struct ring_stats stats;
Hi Jakub
After adding page pool to hns3 receiving package process, we want to add some debug info. Such as below:
1. count of page pool allocate and free page, which is defined for pages_state_hold_cnt and pages_state_release_cnt in page pool framework.
2. pool size、order、nid、dev、max_len, which is setted for each rx ring in hns3 driver.
In this regard, we consider two ways to show these info:
1. Add it to queue statistics and query it by ethtool -S.
2. Add a file node "page_pool_info" for debugfs, then cat this file node, print as below:
queue_id allocate_cnt free_cnt pool_size order nid dev max_len 000 xxx xxx xxx xxx xxx xxx xxx 001 002 . . Which one is more acceptable, or would you have some other suggestion?
Thanks
On 2021/8/6 10:46, Yunsheng Lin wrote:
On Wed, 8 Sep 2021 16:31:40 +0800 moyufeng wrote:
Normally I'd say put the stats in ethtool -S and the rest in debugfs but I'm not sure if exposing pages_state_hold_cnt and pages_state_release_cnt directly. Those are short counters, and will very likely wrap. They are primarily meaningful for calculating page_pool_inflight(). Given this I think their semantics may be too confusing for an average ethtool -S user.
Putting all the information in debugfs seems like a better idea.
Hi Jakub,
On Wed, Sep 08, 2021 at 08:08:43AM -0700, Jakub Kicinski wrote:
I can't really disagree on the aforementioned stats being confusing. However at some point we'll want to add more useful page_pool stats (e.g the percentage of the page/page fragments that are hitting the recycling path). Would it still be 'ok' to have info split across ethtool and debugfs?
Regards /Ilias
On Wed, 8 Sep 2021 18:26:35 +0300 Ilias Apalodimas wrote:
Possibly. We'll also see what Alex L comes up with for XDP stats. Maybe we can arrive at a netlink API for standard things (broken record).
You said percentage - even tho I personally don't like it - there is a small precedent of ethtool -S containing non-counter information (IOW not monotonically increasing event counters), e.g. some vendors rammed PCI link quality in there. So if all else fails ethtool -S should be fine.
On 08/09/2021 17.57, Jakub Kicinski wrote:
I agree with Ilias, that we ought-to add some page_pool stats. *BUT* ONLY if this doesn't hurt performance!!!
We have explained before, how this is possible, e.g. by keeping consumer vs. producer counters on separate cache-lines (internally in page_pool struct and likely on per CPU for returning pages). Then the drivers ethtool functions can request the page_pool to fillout a driver provided stats area, such that the collection and aggregation of counters are not on the fast-path.
I definitely don't want to see pages_state_hold_cnt and pages_state_release_cnt being exposed directly. These were carefully designed to not hurt performance. An inflight counter can be deducted by above ethtool-driver step and presented to userspace.
Notice that while developing page_pool, I've been using tracepoints and bpftrace scripts to inspect the behavior and internals of page_pool. See[1] and I've even written a page leak detector[2].
In principle you could write a bpftrace tool that extract stats, the same way. But I would only recommend doing this for devel phase, because these tracepoints do add some overhead. Originally I wanted to push people to use this for stats, but I've realized that not having these stats easy available is annoying ;-)
-Jesper
[1] https://github.com/xdp-project/xdp-project/tree/master/areas/mem/bpftrace [2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/bpftrace/pa...
On Wed, Sep 08, 2021 at 08:57:23AM -0700, Jakub Kicinski wrote:
Yea percentage may have been the wrong example. I agree that having absolute numbers (all allocated pages and recycled pages) is a better option. To be honest keeping the 'weird' stats in debugfs seems sane, the pages_state_hold_cnt/pages_state_release_cnt are only going to be needed during debug.
Thanks /Ilias
Hi,
On Fri, Aug 06, 2021 at 10:46:22AM +0800, Yunsheng Lin wrote:
This commit is giving me some trouble, but I haven't managed to pinpoint the exact problem.
Symptoms are: * A page gets unmapped twice from page_pool_release_page(). The second time, dma-iommu.c warns about the empty PTE [1] * The rx ring still accesses the page after the first unmap, causing SMMU translation faults [2] * That leads to APEI errors and reset of the device, at which time page_pool_inflight() complains about "Negative(-x) inflight packet-pages".
After some debugging, it looks like the page gets released three times instead of two:
(1) first in page_pool_drain_frag():
page_pool_alloc_frag+0x1fc/0x248 hns3_alloc_and_map_buffer+0x30/0x170 hns3_nic_alloc_rx_buffers+0x9c/0x170 hns3_clean_rx_ring+0x854/0x950 hns3_nic_common_poll+0xa0/0x218 __napi_poll+0x38/0x1b0 net_rx_action+0xe8/0x248 __do_softirq+0x120/0x284
(2) Then later by page_pool_return_skb_page(), which (I guess) unmaps the page:
page_pool_put_page+0x214/0x308 page_pool_return_skb_page+0x48/0x60 skb_release_data+0x168/0x188 skb_release_all+0x28/0x38 kfree_skb+0x30/0x90 packet_rcv+0x4c/0x410 __netif_receive_skb_list_core+0x1f4/0x218 netif_receive_skb_list_internal+0x18c/0x2a8
(3) And finally, soon after, by clean_rx_ring() which causes pp_frag_count underflow (seen after removing the optimization in page_pool_atomic_sub_frag_count_return):
page_pool_put_page+0x2a0/0x308 page_pool_put_full_page hns3_alloc_skb hns3_handle_rx_bd hns3_clean_rx_ring+0x744/0x950 hns3_nic_common_poll+0xa0/0x218 __napi_poll+0x38/0x1b0 net_rx_action+0xe8/0x248
So I'm guessing (2) happens too early while the RX ring is still using the page, but I don't know more. I'd be happy to add more debug and to test fixes if you have any suggestions.
Thanks, Jean
[1] ------------[ cut here ]------------ WARNING: CPU: 71 PID: 0 at drivers/iommu/dma-iommu.c:848 iommu_dma_unmap_page+0xbc/0xd8 Modules linked in: fuse overlay ipmi_si hisi_hpre hisi_zip ecdh_generic hisi_trng_v2 ecc ipmi_d> CPU: 71 PID: 0 Comm: swapper/71 Not tainted 5.16.0-g3813c61fbaad #22 Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B133.01 03/25/2021 pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : iommu_dma_unmap_page+0xbc/0xd8 lr : iommu_dma_unmap_page+0x38/0xd8 sp : ffff800010abb8d0 x29: ffff800010abb8d0 x28: ffff20200ee80000 x27: 0000000000000042 x26: ffff20201a7ed800 x25: ffff20200be7a5c0 x24: 0000000000000002 x23: 0000000000000020 x22: 0000000000001000 x21: 0000000000000000 x20: 0000000000000002 x19: ffff002086b730c8 x18: 0000000000000001 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000008 x11: 000000000000ffff x10: 0000000000000001 x9 : 0000000000000004 x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffff202006274800 x5 : 0000000000000009 x4 : 0000000000000001 x3 : 000000000000001e x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: iommu_dma_unmap_page+0xbc/0xd8 dma_unmap_page_attrs+0x30/0x1d0 page_pool_release_page+0x40/0x88 page_pool_return_page+0x18/0x80 page_pool_put_page+0x248/0x288 hns3_clean_rx_ring+0x744/0x950 hns3_nic_common_poll+0xa0/0x218 __napi_poll+0x38/0x1b0 net_rx_action+0xe8/0x248 __do_softirq+0x120/0x284 irq_exit_rcu+0xe0/0x100 el1_interrupt+0x3c/0x88 el1h_64_irq_handler+0x18/0x28 el1h_64_irq+0x78/0x7c arch_cpu_idle+0x18/0x28 default_idle_call+0x20/0x68 do_idle+0x214/0x260 cpu_startup_entry+0x28/0x70 secondary_start_kernel+0x160/0x170 __secondary_switched+0x90/0x94 ---[ end trace 432d1737b4b96ed9 ]---
(please ignore the kernel version, I can reproduce this with v5.14 and v5.17-rc1, and bisected to this commit.)
[2] arm-smmu-v3 arm-smmu-v3.6.auto: event 0x10 received: arm-smmu-v3 arm-smmu-v3.6.auto: 0x0000bd0000000010 arm-smmu-v3 arm-smmu-v3.6.auto: 0x000012000000007c arm-smmu-v3 arm-smmu-v3.6.auto: 0x00000000ff905800 arm-smmu-v3 arm-smmu-v3.6.auto: 0x00000000ff905000
On 2022/1/26 22:30, Jean-Philippe Brucker wrote:
Hi, Thanks for reporting the problem.
We also hit a similiar problem during internal CI testing, but was not able to trigger it manually, so was not able to find the root case yet.
Is your test case more likely to trigger the problem?
If the reference counting or pp_frag_count of the page is manipulated correctly, I think step 2&3 does not have any dependency between each other.
fixes if you have any suggestions.
My initial thinking is to track if the reference counting or pp_frag_count of the page is manipulated correctly.
Perhaps using the newly added reference counting tracking infrastructure? Will look into how to use the reference counting tracking infrastructure for the above problem.
On Fri, Jan 28, 2022 at 12:00:35PM +0800, Yunsheng Lin wrote:
The problem shows up shortly after boot for me, when I'm not doing anything special, just ssh'ing into the server. I did manage to trigger it faster with a "netperf -T TCP_MAERTS" job. Maybe I have something enabled in my config that makes it easier to trigger? Attached the .config to this reply, but I think it corresponds pretty much to debian's config.
It looks like pp_frag_count is dropped too many times: after (1), pp_frag_count only has 1 ref, so (2) drops it to 0 and (3) results in underflow. I turned page_pool_atomic_sub_frag_count_return() into "atomic_long_sub_return(nr, &page->pp_frag_count)" to make sure (the atomic_long_read() bit normally hides this). Wasn't entirely sure if this is expected behavior, though.
Thanks, Jean
On 2022/1/28 17:21, Jean-Philippe Brucker wrote:
I tried the above config, unfortunately I was still not able to trigger the problem.
Are you true the above 1~3 step is happening for the same page? If it is the same page, there must be something wrong here.
Normally there are 1024 BD for a rx ring:
BD_0 BD_1 BD_2 BD_3 BD_4 .... BD_1020 BD_1021 BD_1022 BD_1023 ^ ^ head tail
Suppose head is manipulated by driver, and tail is manipulated by hw.
driver allocates buffer for BD pointed by head, as the frag page recycling is introduced in this patch, the BD_0 and BD_1's buffer may point to the same page(4K page size, and each BD only need 2k Buffer. hw dma the data to the buffer pointed by tail when packet is received.
so step 1 Normally happen for the BD pointed by head, and step 2 & 3 Normally happen for the BD pointed by tail. And Normally there are at least (1024 - RCB_NOF_ALLOC_RX_BUFF_ONCE) BD between head and tail, so it is unlikely that head and tail's BD buffer points to the same page.
On Sat, Jan 29, 2022 at 04:44:34PM +0800, Yunsheng Lin wrote:
Yes they happen on the same page. What I did was save the backtrace of each call to page_pool_atomic_sub_frag_count_return() and, when an underflow error happens on a page, print out the history of that page only.
My report was not right, though, I forgot to save the backtrace for pp_frag_count==0. There's actually two refs on the page. It goes like this:
(1) T-1535, drop BIAS_MAX - 2, pp_frag_count now 2 page_pool_alloc_frag+0x128/0x240 hns3_alloc_and_map_buffer+0x30/0x170 hns3_nic_alloc_rx_buffers+0x9c/0x170 hns3_clean_rx_ring+0x864/0x960 hns3_nic_common_poll+0xa0/0x218 __napi_poll+0x38/0x188 net_rx_action+0xe8/0x248 __do_softirq+0x120/0x284
(2) T-4, drop 1, pp_frag_count now 1 page_pool_put_page+0x98/0x338 page_pool_return_skb_page+0x48/0x60 skb_release_data+0x170/0x190 skb_release_all+0x28/0x38 kfree_skb_reason+0x30/0x90 packet_rcv+0x58/0x430 __netif_receive_skb_list_core+0x1f4/0x218 netif_receive_skb_list_internal+0x18c/0x2a8
(3) T-1, drop 1, pp_frag_count now 0 page_pool_put_page+0x98/0x338 page_pool_return_skb_page+0x48/0x60 skb_release_data+0x170/0x190 skb_release_all+0x28/0x38 __kfree_skb+0x18/0x30 __sk_defer_free_flush+0x44/0x58 tcp_recvmsg+0x94/0x1b8 inet_recvmsg+0x50/0x128
(4) T, drop 1, pp_frag_count now -1 (underflow) page_pool_put_page+0x2d0/0x338 hns3_clean_rx_ring+0x74c/0x960 hns3_nic_common_poll+0xa0/0x218 __napi_poll+0x38/0x188 net_rx_action+0xe8/0x248
I think a new page is allocated at step 1, no? The driver calls page_pool_alloc_frag() when refilling the rx ring, and since the current pool->frag_page (P1) is still used by BD_0 and BD_1, then page_pool_drain_frag() drops (BIAS_MAX - 2) references and page_pool_alloc_frag() replaces frag_page with a new page, P2. Later, head points to BD_1, the driver can drop the remaining 2 references to P1 in steps 2 and 3, and P1 can be unmapped and freed/recycled
What I don't get is which of steps 2, 3 and 4 is the wrong one. Could be 2 or 3 because the device is evidently still doing DMA to the page after it's released, but it could also be that the driver doesn't properly clear the BD in which case step 4 is wrong. I'll try to find out which fragment gets dropped twice.
Thanks, Jean
On 2022/2/3 17:48, Jean-Philippe Brucker wrote:
Yes. For most of the case, there should be two steps of the 2/3/4 steps, when there is extra step in the above calltrace, it may mean the page_count() is 2 instead of 1, if that is the case, __skb_frag_ref() may be called for a page from page pool((page->pp_magic & ~0x3UL) == PP_SIGNATURE)), which is not supposed to happen.
When there are more than two steps for the freeing side, the only case I know about the skb cloning and expanding case, which is fixed by the below commit: 2cc3aeb5eccc (skbuff: Fix a potential race while recycling page_pool packets)
Maybe there are other case we missed?
Hi,
Sorry for the delay, I had to focus on other issues.
On Mon, Feb 07, 2022 at 10:54:40AM +0800, Yunsheng Lin wrote:
Yes it's something similar, I found the problem in skb_try_coalesce(). I sent a possible fix, we can continue the discussion there:
https://lore.kernel.org/netdev/20220324172913.26293-1-jean-philippe@linaro.o...
Thanks, Jean
On 10/08/2021 16.01, Jakub Kicinski wrote:
I had hoped to see more acks / reviewed-by before this got applied. E.g. from MM-people as this patchset changes struct page and page_pool (that I'm marked as maintainer of). And I would have appreciated an reviewed-by credit to/from Alexander as he did a lot of work in the RFC patchset for the split-page tricks.
p.s. I just returned from vacation today, and have not had time to review, sorry.
--Jesper
(relevant struct page changes for MM-people to review)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 52bbd2b..7f8ee09 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -103,11 +103,19 @@ struct page { unsigned long pp_magic; struct page_pool *pp; unsigned long _pp_mapping_pad; - /** - * @dma_addr: might require a 64-bit value on - * 32-bit architectures. - */ - unsigned long dma_addr[2]; + unsigned long dma_addr; + union { + /** + * dma_addr_upper: might require a 64-bit + * value on 32-bit architectures. + */ + unsigned long dma_addr_upper; + /** + * For frag page support, not supported in + * 32-bit architectures with 64-bit DMA. + */ + atomic_long_t pp_frag_count; + }; }; struct { /* slab, slob and slub */
On Tue, 10 Aug 2021 16:23:52 +0200 Jesper Dangaard Brouer wrote:
Sorry, it was on the list for days and there were 7 or so prior versions, I thought it was ripe. If possible, a note that review will come would be useful.
And I would have appreciated an reviewed-by credit to/from Alexander as he did a lot of work in the RFC patchset for the split-page tricks.
I asked him off-list, he said something I interpreted as "code is okay, but the review tag is not coming".
On Tue, Aug 10, 2021 at 7:43 AM Jakub Kicinski kuba@kernel.org wrote:
Yeah, I ran out of feedback a revision or two ago and just haven't had a chance to go through and add my reviewed by. If you want feel free to add my reviewed by for the set.
Reviewed-by: Alexander Duyck alexanderduyck@fb.com
On 2021/8/10 23:09, Alexander Duyck wrote:
Yeah, the credit goes to Ilias, Matteo, Matthew too, the patchset from them paves the path for supporting the skb frag page recycling.
Yeah, thanks for the time and patient for reviewing this patchset.
By the way, I am still trying to implement the tx recycling mentioned in the other thread, which seems more controversial than rx recycling as tx recycling may touch the tcp/ip and socket layer. So it would be good have your opinion about that idea or implemention too:)