Patch 1: support non-split page when PP_FLAG_PAGE_FRAG is set. patch 2: avoid calling compound_head() for skb frag page Patch 3: keep track of pp page when __skb_frag_ref() is called.
V2: add patch 2, adjust the commit log accroding to the discussion in V1, and fix a compiler error reported by kernel test robot.
Yunsheng Lin (3): page_pool: support non-split page with PP_FLAG_PAGE_FRAG pool_pool: avoid calling compound_head() for skb frag page skbuff: keep track of pp page when __skb_frag_ref() is called
include/linux/skbuff.h | 40 ++++++++++++++++++++++++++++++++++++---- include/net/page_pool.h | 34 +++++++++++++++++++++++++++++++++- net/core/page_pool.c | 30 +++++++++--------------------- 3 files changed, 78 insertions(+), 26 deletions(-)
Currently when PP_FLAG_PAGE_FRAG is set, the caller is not expected to call page_pool_alloc_pages() directly because of the PP_FLAG_PAGE_FRAG checking in __page_pool_put_page().
The patch removes the above checking to enable non-split page support when PP_FLAG_PAGE_FRAG is set.
Reviewed-by: Alexander Duyck alexanderduyck@fb.com Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- include/net/page_pool.h | 6 ++++++ net/core/page_pool.c | 12 +++++++----- 2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/include/net/page_pool.h b/include/net/page_pool.h index a4082406a003..2ad0706566c5 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -238,6 +238,9 @@ static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
static inline void page_pool_set_frag_count(struct page *page, long nr) { + if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT) + return; + atomic_long_set(&page->pp_frag_count, nr); }
@@ -246,6 +249,9 @@ static inline long page_pool_atomic_sub_frag_count_return(struct page *page, { long ret;
+ if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT) + return 0; + /* As suggested by Alexander, atomic_long_read() may cover up the * reference count errors, so avoid calling atomic_long_read() in * the cases of freeing or draining the page_frags, where we would diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 1a6978427d6c..ba9f14deacf4 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -313,11 +313,14 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
/* Fast-path: Get a page from cache */ page = __page_pool_get_cached(pool); - if (page) - return page;
/* Slow-path: cache empty, do real allocation */ - page = __page_pool_alloc_pages_slow(pool, gfp); + if (!page) + page = __page_pool_alloc_pages_slow(pool, gfp); + + if (likely(page)) + page_pool_set_frag_count(page, 1); + return page; } EXPORT_SYMBOL(page_pool_alloc_pages); @@ -426,8 +429,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, unsigned int dma_sync_size, bool allow_direct) { /* It is not the last user for the page frag case */ - if (pool->p.flags & PP_FLAG_PAGE_FRAG && - page_pool_atomic_sub_frag_count_return(page, 1)) + if (page_pool_atomic_sub_frag_count_return(page, 1)) return NULL;
/* This allocator is optimized for the XDP mode that uses
As the page for a skb frag is always a head page, so make sure skb_pp_recycle() passes a head page to avoid calling compound_head() for skb frag page case.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- include/linux/skbuff.h | 2 +- net/core/page_pool.c | 2 -- 2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 6bdb0db3e825..35eebc2310a5 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4722,7 +4722,7 @@ static inline bool skb_pp_recycle(struct sk_buff *skb, void *data) { if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle) return false; - return page_pool_return_skb_page(virt_to_page(data)); + return page_pool_return_skb_page(virt_to_head_page(data)); }
#endif /* __KERNEL__ */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index ba9f14deacf4..09d7b8614ef5 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -740,8 +740,6 @@ bool page_pool_return_skb_page(struct page *page) { struct page_pool *pp;
- page = compound_head(page); - /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation * in order to preserve any existing bits, such as bit 0 for the * head page of compound page and bit 1 for pfmemalloc page, so
As the skb->pp_recycle and page->pp_magic may not be enough to track if a frag page is from page pool after the calling of __skb_frag_ref(), mostly because of a data race, see: commit 2cc3aeb5eccc ("skbuff: Fix a potential race while recycling page_pool packets").
There may be clone and expand head case that might lose the track if a frag page is from page pool or not.
And not being able to keep track of pp page may cause problem for the skb_split() case in tso_fragment() too: Supposing a skb has 3 frag pages, all coming from a page pool, and is split to skb1 and skb2: skb1: first frag page + first half of second frag page skb2: second half of second frag page + third frag page
How do we set the skb->pp_recycle of skb1 and skb2? 1. If we set both of them to 1, then we may have a similar race as the above commit for second frag page. 2. If we set only one of them to 1, then we may have resource leaking problem as both first frag page and third frag page are indeed from page pool.
So increment the frag count when __skb_frag_ref() is called, and only use page->pp_magic to indicate if a frag page is from page pool, to avoid the above data race.
For 32 bit systems with 64 bit dma, we preserve the orginial behavior as frag count is used to trace how many time does a frag page is called with __skb_frag_ref().
We still use both skb->pp_recycle and page->pp_magic to decide the head page for a skb is from page pool or not.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- include/linux/skbuff.h | 40 ++++++++++++++++++++++++++++++++++++---- include/net/page_pool.h | 28 +++++++++++++++++++++++++++- net/core/page_pool.c | 16 ++-------------- 3 files changed, 65 insertions(+), 19 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 35eebc2310a5..4d975ab27078 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3073,7 +3073,16 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag) */ static inline void __skb_frag_ref(skb_frag_t *frag) { - get_page(skb_frag_page(frag)); + struct page *page = skb_frag_page(frag); + +#ifdef CONFIG_PAGE_POOL + if (page_pool_is_pp_page(page)) { + page_pool_atomic_inc_frag_count(page); + return; + } +#endif + + get_page(page); }
/** @@ -3088,6 +3097,22 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f) __skb_frag_ref(&skb_shinfo(skb)->frags[f]); }
+/** + * skb_frag_is_pp_page - decide if a page is recyclable. + * @page: frag page + * @recycle: skb->pp_recycle + * + * For 32 bit systems with 64 bit dma, the skb->pp_recycle is + * also used to decide if a page can be recycled to the page + * pool. + */ +static inline bool skb_frag_is_pp_page(struct page *page, + bool recycle) +{ + return page_pool_is_pp_page(page) || + (recycle && __page_pool_is_pp_page(page)); +} + /** * __skb_frag_unref - release a reference on a paged fragment. * @frag: the paged fragment @@ -3101,8 +3126,10 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle) struct page *page = skb_frag_page(frag);
#ifdef CONFIG_PAGE_POOL - if (recycle && page_pool_return_skb_page(page)) + if (skb_frag_is_pp_page(page, recycle)) { + page_pool_return_skb_page(page); return; + } #endif put_page(page); } @@ -4720,9 +4747,14 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb)
static inline bool skb_pp_recycle(struct sk_buff *skb, void *data) { - if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle) + struct page *page = virt_to_head_page(data); + + if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle || + !__page_pool_is_pp_page(page)) return false; - return page_pool_return_skb_page(virt_to_head_page(data)); + + page_pool_return_skb_page(page); + return true; }
#endif /* __KERNEL__ */ diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 2ad0706566c5..eb103d86f453 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -164,7 +164,7 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool) return pool->p.dma_dir; }
-bool page_pool_return_skb_page(struct page *page); +void page_pool_return_skb_page(struct page *page);
struct page_pool *page_pool_create(const struct page_pool_params *params);
@@ -244,6 +244,32 @@ static inline void page_pool_set_frag_count(struct page *page, long nr) atomic_long_set(&page->pp_frag_count, nr); }
+static inline void page_pool_atomic_inc_frag_count(struct page *page) +{ + atomic_long_inc(&page->pp_frag_count); +} + +static inline bool __page_pool_is_pp_page(struct page *page) +{ + /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation + * in order to preserve any existing bits, such as bit 0 for the + * head page of compound page and bit 1 for pfmemalloc page, so + * mask those bits for freeing side when doing below checking, + * and page_is_pfmemalloc() is checked in __page_pool_put_page() + * to avoid recycling the pfmemalloc page. + */ + return (page->pp_magic & ~0x3UL) == PP_SIGNATURE; +} + +static inline bool page_pool_is_pp_page(struct page *page) +{ + /* For systems with the same dma addr as the bus addr, we can use + * page->pp_magic to indicate a pp page uniquely. + */ + return !PAGE_POOL_DMA_USE_PP_FRAG_COUNT && + __page_pool_is_pp_page(page); +} + static inline long page_pool_atomic_sub_frag_count_return(struct page *page, long nr) { diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 09d7b8614ef5..3a419871d4bc 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -24,7 +24,7 @@ #define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ)
-#define BIAS_MAX LONG_MAX +#define BIAS_MAX (LONG_MAX / 2)
static int page_pool_init(struct page_pool *pool, const struct page_pool_params *params) @@ -736,20 +736,10 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid) } EXPORT_SYMBOL(page_pool_update_nid);
-bool page_pool_return_skb_page(struct page *page) +void page_pool_return_skb_page(struct page *page) { struct page_pool *pp;
- /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation - * in order to preserve any existing bits, such as bit 0 for the - * head page of compound page and bit 1 for pfmemalloc page, so - * mask those bits for freeing side when doing below checking, - * and page_is_pfmemalloc() is checked in __page_pool_put_page() - * to avoid recycling the pfmemalloc page. - */ - if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) - return false; - pp = page->pp;
/* Driver set this to memory recycling info. Reset it on recycle. @@ -758,7 +748,5 @@ bool page_pool_return_skb_page(struct page *page) * 'flipped' fragment being in use or not. */ page_pool_put_full_page(pp, page, false); - - return true; } EXPORT_SYMBOL(page_pool_return_skb_page);
On Tue, Sep 14, 2021 at 5:12 AM Yunsheng Lin linyunsheng@huawei.com wrote:
As the skb->pp_recycle and page->pp_magic may not be enough to track if a frag page is from page pool after the calling of __skb_frag_ref(), mostly because of a data race, see: commit 2cc3aeb5eccc ("skbuff: Fix a potential race while recycling page_pool packets").
I'm not sure how this comment actually applies. It is an issue that was fixed. If anything my concern is that this change will introduce new races instead of fixing any existing ones.
There may be clone and expand head case that might lose the track if a frag page is from page pool or not.
Can you explain how? If there is such a case we should fix it instead of trying to introduce new features to address it. This seems more like a performance optimization rather than a fix.
And not being able to keep track of pp page may cause problem for the skb_split() case in tso_fragment() too: Supposing a skb has 3 frag pages, all coming from a page pool, and is split to skb1 and skb2: skb1: first frag page + first half of second frag page skb2: second half of second frag page + third frag page
How do we set the skb->pp_recycle of skb1 and skb2?
- If we set both of them to 1, then we may have a similar race as the above commit for second frag page.
- If we set only one of them to 1, then we may have resource leaking problem as both first frag page and third frag page are indeed from page pool.
The question I would have is in the above cases how is skb->pp_recycle being set on the second buffer? Is it already coming that way? If so, maybe you should special case the __skb_frag_ref when you know you are loading a recycling skb instead of just assuming you want to do it automatically.
So increment the frag count when __skb_frag_ref() is called, and only use page->pp_magic to indicate if a frag page is from page pool, to avoid the above data race.
This assumes the page is only going to be used by the network stack. My concern is what happens when the page is pulled out of the skb and used for example in storage?
For 32 bit systems with 64 bit dma, we preserve the orginial behavior as frag count is used to trace how many time does a frag page is called with __skb_frag_ref().
We still use both skb->pp_recycle and page->pp_magic to decide the head page for a skb is from page pool or not.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
include/linux/skbuff.h | 40 ++++++++++++++++++++++++++++++++++++---- include/net/page_pool.h | 28 +++++++++++++++++++++++++++- net/core/page_pool.c | 16 ++-------------- 3 files changed, 65 insertions(+), 19 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 35eebc2310a5..4d975ab27078 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3073,7 +3073,16 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag) */ static inline void __skb_frag_ref(skb_frag_t *frag) {
get_page(skb_frag_page(frag));
struct page *page = skb_frag_page(frag);
+#ifdef CONFIG_PAGE_POOL
if (page_pool_is_pp_page(page)) {
page_pool_atomic_inc_frag_count(page);
return;
}
+#endif
get_page(page);
}
/** @@ -3088,6 +3097,22 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f) __skb_frag_ref(&skb_shinfo(skb)->frags[f]); }
+/**
- skb_frag_is_pp_page - decide if a page is recyclable.
- @page: frag page
- @recycle: skb->pp_recycle
- For 32 bit systems with 64 bit dma, the skb->pp_recycle is
- also used to decide if a page can be recycled to the page
- pool.
- */
+static inline bool skb_frag_is_pp_page(struct page *page,
bool recycle)
+{
return page_pool_is_pp_page(page) ||
(recycle && __page_pool_is_pp_page(page));
+}
The logic for this check is ugly. You are essentially calling __page_pool_is_pp_page again if it fails the first check. It would probably make more sense to rearrange things and just call (!DMA_USE_PP_FRAG_COUNT || recycle) && __page_pool_is_pp_page(). With that the check of recycle could be dropped entirely if frag count is valid to use, and in the non-fragcount case it reverts back to the original check.
/**
- __skb_frag_unref - release a reference on a paged fragment.
- @frag: the paged fragment
@@ -3101,8 +3126,10 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle) struct page *page = skb_frag_page(frag);
#ifdef CONFIG_PAGE_POOL
if (recycle && page_pool_return_skb_page(page))
if (skb_frag_is_pp_page(page, recycle)) {
page_pool_return_skb_page(page); return;
}
#endif put_page(page); } @@ -4720,9 +4747,14 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb)
static inline bool skb_pp_recycle(struct sk_buff *skb, void *data) {
if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)
struct page *page = virt_to_head_page(data);
if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle ||
!__page_pool_is_pp_page(page)) return false;
return page_pool_return_skb_page(virt_to_head_page(data));
page_pool_return_skb_page(page);
return true;
}
#endif /* __KERNEL__ */
As I recall the virt_to_head_page isn't necessarily a cheap call as it can lead to quite a bit of pointer chasing and a few extra math steps to do the virt to page conversion. I would be curious how much extra overhead this is adding to the non-fragcount or non-recycling case.
diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 2ad0706566c5..eb103d86f453 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -164,7 +164,7 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool) return pool->p.dma_dir; }
-bool page_pool_return_skb_page(struct page *page); +void page_pool_return_skb_page(struct page *page);
struct page_pool *page_pool_create(const struct page_pool_params *params);
@@ -244,6 +244,32 @@ static inline void page_pool_set_frag_count(struct page *page, long nr) atomic_long_set(&page->pp_frag_count, nr); }
+static inline void page_pool_atomic_inc_frag_count(struct page *page) +{
atomic_long_inc(&page->pp_frag_count);
+}
Your function name is almost as long as the function itself. Maybe you don't need it?
+static inline bool __page_pool_is_pp_page(struct page *page) +{
/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
* in order to preserve any existing bits, such as bit 0 for the
* head page of compound page and bit 1 for pfmemalloc page, so
* mask those bits for freeing side when doing below checking,
* and page_is_pfmemalloc() is checked in __page_pool_put_page()
* to avoid recycling the pfmemalloc page.
*/
return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
+}
+static inline bool page_pool_is_pp_page(struct page *page) +{
/* For systems with the same dma addr as the bus addr, we can use
* page->pp_magic to indicate a pp page uniquely.
*/
return !PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
__page_pool_is_pp_page(page);
+}
We should really change the name of the #define. I keep reading it as we are using the PP_FRAG_COUNT, not that it is already in use. Maybe we should look at something like PP_FRAG_COUNT_VALID and just invert the logic for it.
Also this function naming is really confusing. You don't have to have the frag count to be a page pool page. Maybe this should be something like page_pool_is_pp_frag_page.
static inline long page_pool_atomic_sub_frag_count_return(struct page *page, long nr) { diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 09d7b8614ef5..3a419871d4bc 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -24,7 +24,7 @@ #define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ)
-#define BIAS_MAX LONG_MAX +#define BIAS_MAX (LONG_MAX / 2)
static int page_pool_init(struct page_pool *pool, const struct page_pool_params *params)
I still think this would be better as a separate patch calling out the fact that you are changing the value with the plan to support incrementing it in the future.
@@ -736,20 +736,10 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid) } EXPORT_SYMBOL(page_pool_update_nid);
-bool page_pool_return_skb_page(struct page *page) +void page_pool_return_skb_page(struct page *page) { struct page_pool *pp;
/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
* in order to preserve any existing bits, such as bit 0 for the
* head page of compound page and bit 1 for pfmemalloc page, so
* mask those bits for freeing side when doing below checking,
* and page_is_pfmemalloc() is checked in __page_pool_put_page()
* to avoid recycling the pfmemalloc page.
*/
if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
return false;
pp = page->pp; /* Driver set this to memory recycling info. Reset it on recycle.
@@ -758,7 +748,5 @@ bool page_pool_return_skb_page(struct page *page) * 'flipped' fragment being in use or not. */ page_pool_put_full_page(pp, page, false);
return true;
} EXPORT_SYMBOL(page_pool_return_skb_page); -- 2.33.0
On 2021/9/15 8:59, Alexander Duyck wrote:
On Tue, Sep 14, 2021 at 5:12 AM Yunsheng Lin linyunsheng@huawei.com wrote:
As the skb->pp_recycle and page->pp_magic may not be enough to track if a frag page is from page pool after the calling of __skb_frag_ref(), mostly because of a data race, see: commit 2cc3aeb5eccc ("skbuff: Fix a potential race while recycling page_pool packets").
I'm not sure how this comment actually applies. It is an issue that was fixed. If anything my concern is that this change will introduce new races instead of fixing any existing ones.
My initial thinking about adding the above comment is to emphasize that we might clear cloned skb's pp_recycle when doing head expanding, and page pool might need to give up on that page if that cloned skb is the last one to be freed.
There may be clone and expand head case that might lose the track if a frag page is from page pool or not.
Can you explain how? If there is such a case we should fix it instead of trying to introduce new features to address it. This seems more like a performance optimization rather than a fix.
Yes, I consider it an optimization too, that's why I am targetting net-next.
Even for the below skb_split() case in tso_fragment(), I am not sure how can a rx pp page can go through the tcp stack yet.
And not being able to keep track of pp page may cause problem for the skb_split() case in tso_fragment() too: Supposing a skb has 3 frag pages, all coming from a page pool, and is split to skb1 and skb2: skb1: first frag page + first half of second frag page skb2: second half of second frag page + third frag page
How do we set the skb->pp_recycle of skb1 and skb2?
- If we set both of them to 1, then we may have a similar race as the above commit for second frag page.
- If we set only one of them to 1, then we may have resource leaking problem as both first frag page and third frag page are indeed from page pool.
The question I would have is in the above cases how is skb->pp_recycle being set on the second buffer? Is it already coming that way? If so,
As the skb_split() implemention, it takes skb and skb1 for input, skb have pp_recycle set, and the skb1 is the newly allocated without pp_recycle set, after skb_split(), some payload of skb is split to the skb1, how to set the pp_recycle of skb1 seems tricky here.
maybe you should special case the __skb_frag_ref when you know you are loading a recycling skb instead of just assuming you want to do it automatically.
I am not sure what does "special case" and "automatically" means here. One way I can think of is to avoid doing the __skb_frag_ref, and allocate a new frag for skb1 and do memcpying if there is a sharing frag between skb and skb1 when splitting. But it seems the allocating a new frag and memcpying seems much heavier doing the __skb_frag_ref.
So increment the frag count when __skb_frag_ref() is called, and only use page->pp_magic to indicate if a frag page is from page pool, to avoid the above data race.
This assumes the page is only going to be used by the network stack. My concern is what happens when the page is pulled out of the skb and used for example in storage?
As my understanding, the above case works as before, as if the storage is holding reference to that page, the page pool will give up on that page by checking "page_ref_count(page) == 1" when the last user from network stack has done with a pp page(by checking page->pp_frag_count).
For 32 bit systems with 64 bit dma, we preserve the orginial behavior as frag count is used to trace how many time does a frag page is called with __skb_frag_ref().
We still use both skb->pp_recycle and page->pp_magic to decide the head page for a skb is from page pool or not.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
include/linux/skbuff.h | 40 ++++++++++++++++++++++++++++++++++++---- include/net/page_pool.h | 28 +++++++++++++++++++++++++++- net/core/page_pool.c | 16 ++-------------- 3 files changed, 65 insertions(+), 19 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 35eebc2310a5..4d975ab27078 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3073,7 +3073,16 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag) */ static inline void __skb_frag_ref(skb_frag_t *frag) {
get_page(skb_frag_page(frag));
struct page *page = skb_frag_page(frag);
+#ifdef CONFIG_PAGE_POOL
if (page_pool_is_pp_page(page)) {
page_pool_atomic_inc_frag_count(page);
return;
}
+#endif
get_page(page);
}
/** @@ -3088,6 +3097,22 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f) __skb_frag_ref(&skb_shinfo(skb)->frags[f]); }
+/**
- skb_frag_is_pp_page - decide if a page is recyclable.
- @page: frag page
- @recycle: skb->pp_recycle
- For 32 bit systems with 64 bit dma, the skb->pp_recycle is
- also used to decide if a page can be recycled to the page
- pool.
- */
+static inline bool skb_frag_is_pp_page(struct page *page,
bool recycle)
+{
return page_pool_is_pp_page(page) ||
(recycle && __page_pool_is_pp_page(page));
+}
The logic for this check is ugly. You are essentially calling __page_pool_is_pp_page again if it fails the first check. It would probably make more sense to rearrange things and just call (!DMA_USE_PP_FRAG_COUNT || recycle) && __page_pool_is_pp_page(). With that the check of recycle could be dropped entirely if frag count is valid to use, and in the non-fragcount case it reverts back to the original check.
The reason I did not do that is I kind of did not want to use the DMA_USE_PP_FRAG_COUNT outside of page pool. I can use DMA_USE_PP_FRAG_COUNT directly in skbuff.h if the above is considered harmless:)
The 32 bit systems with 64 bit dma really seems a burden here, as memtioned by Ilias [1], there seems to be no such system using page pool, we might as well consider disabling page pool for such system?
Ilias, Jesper, what do you think?
1. http://lkml.iu.edu/hypermail/linux/kernel/2107.1/06321.html
/**
- __skb_frag_unref - release a reference on a paged fragment.
- @frag: the paged fragment
@@ -3101,8 +3126,10 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle) struct page *page = skb_frag_page(frag);
#ifdef CONFIG_PAGE_POOL
if (recycle && page_pool_return_skb_page(page))
if (skb_frag_is_pp_page(page, recycle)) {
page_pool_return_skb_page(page); return;
}
#endif put_page(page); } @@ -4720,9 +4747,14 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb)
static inline bool skb_pp_recycle(struct sk_buff *skb, void *data) {
if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)
struct page *page = virt_to_head_page(data);
if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle ||
!__page_pool_is_pp_page(page)) return false;
return page_pool_return_skb_page(virt_to_head_page(data));
page_pool_return_skb_page(page);
return true;
}
#endif /* __KERNEL__ */
As I recall the virt_to_head_page isn't necessarily a cheap call as it can lead to quite a bit of pointer chasing and a few extra math steps to do the virt to page conversion. I would be curious how much extra overhead this is adding to the non-fragcount or non-recycling case.
As the page pool can only deal with head_page as the checking of "page_ref_count(page) == 1" in __page_pool_put_page() seems requiring head_page, and skb_free_frag() in skb_free_head() seems to operate on the head_page, so I assume the overhead is minimal?
diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 2ad0706566c5..eb103d86f453 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -164,7 +164,7 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool) return pool->p.dma_dir; }
-bool page_pool_return_skb_page(struct page *page); +void page_pool_return_skb_page(struct page *page);
struct page_pool *page_pool_create(const struct page_pool_params *params);
@@ -244,6 +244,32 @@ static inline void page_pool_set_frag_count(struct page *page, long nr) atomic_long_set(&page->pp_frag_count, nr); }
+static inline void page_pool_atomic_inc_frag_count(struct page *page) +{
atomic_long_inc(&page->pp_frag_count);
+}
Your function name is almost as long as the function itself. Maybe you don't need it?
It is about avoiding exposing the pp_frag_count outside of page pool. It is not needed if the above is not a issue:)
+static inline bool __page_pool_is_pp_page(struct page *page) +{
/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
* in order to preserve any existing bits, such as bit 0 for the
* head page of compound page and bit 1 for pfmemalloc page, so
* mask those bits for freeing side when doing below checking,
* and page_is_pfmemalloc() is checked in __page_pool_put_page()
* to avoid recycling the pfmemalloc page.
*/
return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
+}
+static inline bool page_pool_is_pp_page(struct page *page) +{
/* For systems with the same dma addr as the bus addr, we can use
* page->pp_magic to indicate a pp page uniquely.
*/
return !PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
__page_pool_is_pp_page(page);
+}
We should really change the name of the #define. I keep reading it as we are using the PP_FRAG_COUNT, not that it is already in use. Maybe we should look at something like PP_FRAG_COUNT_VALID and just invert the logic for it.
Yes, Jesper seems to have the similar confusion. I seems better that we can remove that macro completely if the 32 bit systems with 64 bit dma turn out to be not existing at all?
Also this function naming is really confusing. You don't have to have the frag count to be a page pool page. Maybe this should be something like page_pool_is_pp_frag_page.
I am not sure what does the extra *frag* means here.
static inline long page_pool_atomic_sub_frag_count_return(struct page *page, long nr) { diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 09d7b8614ef5..3a419871d4bc 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -24,7 +24,7 @@ #define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ)
-#define BIAS_MAX LONG_MAX +#define BIAS_MAX (LONG_MAX / 2)
static int page_pool_init(struct page_pool *pool, const struct page_pool_params *params)
I still think this would be better as a separate patch calling out the fact that you are changing the value with the plan to support incrementing it in the future.
Sure.
@@ -736,20 +736,10 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid) } EXPORT_SYMBOL(page_pool_update_nid);
-bool page_pool_return_skb_page(struct page *page) +void page_pool_return_skb_page(struct page *page) { struct page_pool *pp;
/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
* in order to preserve any existing bits, such as bit 0 for the
* head page of compound page and bit 1 for pfmemalloc page, so
* mask those bits for freeing side when doing below checking,
* and page_is_pfmemalloc() is checked in __page_pool_put_page()
* to avoid recycling the pfmemalloc page.
*/
if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
return false;
pp = page->pp; /* Driver set this to memory recycling info. Reset it on recycle.
@@ -758,7 +748,5 @@ bool page_pool_return_skb_page(struct page *page) * 'flipped' fragment being in use or not. */ page_pool_put_full_page(pp, page, false);
return true;
} EXPORT_SYMBOL(page_pool_return_skb_page); -- 2.33.0
.
Hi Yunsheng, Alexander,
On Wed, Sep 15, 2021 at 05:07:08PM +0800, Yunsheng Lin wrote:
On 2021/9/15 8:59, Alexander Duyck wrote:
On Tue, Sep 14, 2021 at 5:12 AM Yunsheng Lin linyunsheng@huawei.com wrote:
As the skb->pp_recycle and page->pp_magic may not be enough to track if a frag page is from page pool after the calling of __skb_frag_ref(), mostly because of a data race, see: commit 2cc3aeb5eccc ("skbuff: Fix a potential race while recycling page_pool packets").
I'm not sure how this comment actually applies. It is an issue that was fixed. If anything my concern is that this change will introduce new races instead of fixing any existing ones.
My initial thinking about adding the above comment is to emphasize that we might clear cloned skb's pp_recycle when doing head expanding, and page pool might need to give up on that page if that cloned skb is the last one to be freed.
There may be clone and expand head case that might lose the track if a frag page is from page pool or not.
Can you explain how? If there is such a case we should fix it instead of trying to introduce new features to address it. This seems more like a performance optimization rather than a fix.
Yes, I consider it an optimization too, that's why I am targetting net-next.
Even for the below skb_split() case in tso_fragment(), I am not sure how can a rx pp page can go through the tcp stack yet.
I am bit confused :). We don't have that problem *now* right? This will appear if we try to pull in your patches on using page pool and recycling for Tx where TSO and skb_split are used?
I'll be honest, when I came up with the recycling idea for page pool, I never intended to support Tx. I agree with Alexander here, If people want to use it on Tx and think there's value, we might need to go back to the drawing board and see what I've missed. It's still early and there's a handful of drivers using it, so it will less painful now.
The pp_recycle_bit was introduced to make the checking faster, instead of getting stuff into cache and check the page signature. If that ends up being counterproductive, we could just replace the entire logic with the frag count and the page signature, couldn't we? In that case we should be very cautious and measure potential regression on the standard path.
But in general, I'd be happier if we only had a simple logic in our testing for the pages we have to recycle. Debugging and understanding this otherwise will end up being a mess.
And not being able to keep track of pp page may cause problem for the skb_split() case in tso_fragment() too: Supposing a skb has 3 frag pages, all coming from a page pool, and is split to skb1 and skb2: skb1: first frag page + first half of second frag page skb2: second half of second frag page + third frag page
How do we set the skb->pp_recycle of skb1 and skb2?
- If we set both of them to 1, then we may have a similar race as the above commit for second frag page.
- If we set only one of them to 1, then we may have resource leaking problem as both first frag page and third frag page are indeed from page pool.
The question I would have is in the above cases how is skb->pp_recycle being set on the second buffer? Is it already coming that way? If so,
As the skb_split() implemention, it takes skb and skb1 for input, skb have pp_recycle set, and the skb1 is the newly allocated without pp_recycle set, after skb_split(), some payload of skb is split to the skb1, how to set the pp_recycle of skb1 seems tricky here.
maybe you should special case the __skb_frag_ref when you know you are loading a recycling skb instead of just assuming you want to do it automatically.
I am not sure what does "special case" and "automatically" means here.
The 'special case' is an skb_split() were one or more of the frags ends up being split right?
One way I can think of is to avoid doing the __skb_frag_ref, and allocate a new frag for skb1 and do memcpying if there is a sharing frag between skb and skb1 when splitting. But it seems the allocating a new frag and memcpying seems much heavier doing the __skb_frag_ref.
We could also accept we'd only recycle some of the frags. If the frag is 'split', we could just unmap it and let the networking stack free it eventually. But that would introduce some kind of dependency between the core networking code and page_pool, which we've tried to avoid so far, but more importantly it will slow down the entire path for drivers that don't use page_pool ...
So increment the frag count when __skb_frag_ref() is called, and only use page->pp_magic to indicate if a frag page is from page pool, to avoid the above data race.
This assumes the page is only going to be used by the network stack. My concern is what happens when the page is pulled out of the skb and used for example in storage?
As my understanding, the above case works as before, as if the storage is holding reference to that page, the page pool will give up on that page by checking "page_ref_count(page) == 1" when the last user from network stack has done with a pp page(by checking page->pp_frag_count).
For 32 bit systems with 64 bit dma, we preserve the orginial behavior as frag count is used to trace how many time does a frag page is called with __skb_frag_ref().
We still use both skb->pp_recycle and page->pp_magic to decide the head page for a skb is from page pool or not.
[...]
+/**
- skb_frag_is_pp_page - decide if a page is recyclable.
- @page: frag page
- @recycle: skb->pp_recycle
- For 32 bit systems with 64 bit dma, the skb->pp_recycle is
- also used to decide if a page can be recycled to the page
- pool.
- */
+static inline bool skb_frag_is_pp_page(struct page *page,
bool recycle)
+{
return page_pool_is_pp_page(page) ||
(recycle && __page_pool_is_pp_page(page));
+}
The logic for this check is ugly. You are essentially calling __page_pool_is_pp_page again if it fails the first check. It would probably make more sense to rearrange things and just call (!DMA_USE_PP_FRAG_COUNT || recycle) && __page_pool_is_pp_page(). With that the check of recycle could be dropped entirely if frag count is valid to use, and in the non-fragcount case it reverts back to the original check.
The reason I did not do that is I kind of did not want to use the DMA_USE_PP_FRAG_COUNT outside of page pool. I can use DMA_USE_PP_FRAG_COUNT directly in skbuff.h if the above is considered harmless:)
The 32 bit systems with 64 bit dma really seems a burden here, as memtioned by Ilias [1], there seems to be no such system using page pool, we might as well consider disabling page pool for such system?
Ilias, Jesper, what do you think?
Well I can't really disagree with myself too much :). I still think we are carrying a lot of code and complexity for systems that don't exist.
/**
- __skb_frag_unref - release a reference on a paged fragment.
- @frag: the paged fragment
@@ -3101,8 +3126,10 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle) struct page *page = skb_frag_page(frag);
#ifdef CONFIG_PAGE_POOL
if (recycle && page_pool_return_skb_page(page))
if (skb_frag_is_pp_page(page, recycle)) {
page_pool_return_skb_page(page); return;
}
#endif put_page(page); } @@ -4720,9 +4747,14 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb)
static inline bool skb_pp_recycle(struct sk_buff *skb, void *data) {
if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle)
struct page *page = virt_to_head_page(data);
if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle ||
!__page_pool_is_pp_page(page)) return false;
return page_pool_return_skb_page(virt_to_head_page(data));
page_pool_return_skb_page(page);
return true;
}
#endif /* __KERNEL__ */
As I recall the virt_to_head_page isn't necessarily a cheap call as it can lead to quite a bit of pointer chasing and a few extra math steps to do the virt to page conversion. I would be curious how much extra overhead this is adding to the non-fragcount or non-recycling case.
As the page pool can only deal with head_page as the checking of "page_ref_count(page) == 1" in __page_pool_put_page() seems requiring head_page, and skb_free_frag() in skb_free_head() seems to operate on the head_page, so I assume the overhead is minimal?
diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 2ad0706566c5..eb103d86f453 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -164,7 +164,7 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool) return pool->p.dma_dir; }
-bool page_pool_return_skb_page(struct page *page); +void page_pool_return_skb_page(struct page *page);
struct page_pool *page_pool_create(const struct page_pool_params *params);
@@ -244,6 +244,32 @@ static inline void page_pool_set_frag_count(struct page *page, long nr) atomic_long_set(&page->pp_frag_count, nr); }
+static inline void page_pool_atomic_inc_frag_count(struct page *page) +{
atomic_long_inc(&page->pp_frag_count);
+}
Your function name is almost as long as the function itself. Maybe you don't need it?
It is about avoiding exposing the pp_frag_count outside of page pool. It is not needed if the above is not a issue:)
+static inline bool __page_pool_is_pp_page(struct page *page) +{
/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
* in order to preserve any existing bits, such as bit 0 for the
* head page of compound page and bit 1 for pfmemalloc page, so
* mask those bits for freeing side when doing below checking,
* and page_is_pfmemalloc() is checked in __page_pool_put_page()
* to avoid recycling the pfmemalloc page.
*/
return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
+}
+static inline bool page_pool_is_pp_page(struct page *page) +{
/* For systems with the same dma addr as the bus addr, we can use
* page->pp_magic to indicate a pp page uniquely.
*/
return !PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
__page_pool_is_pp_page(page);
+}
We should really change the name of the #define. I keep reading it as we are using the PP_FRAG_COUNT, not that it is already in use. Maybe we should look at something like PP_FRAG_COUNT_VALID and just invert the logic for it.
Yes, Jesper seems to have the similar confusion.
+1
I seems better that we can remove that macro completely if the 32 bit systems with 64 bit dma turn out to be not existing at all?
Also this function naming is really confusing. You don't have to have the frag count to be a page pool page. Maybe this should be something like page_pool_is_pp_frag_page.
[...]
Regards /Ilias
On 15/09/2021 14.56, Ilias Apalodimas wrote:
Hi Yunsheng, Alexander,
On Wed, Sep 15, 2021 at 05:07:08PM +0800, Yunsheng Lin wrote:
On 2021/9/15 8:59, Alexander Duyck wrote:
On Tue, Sep 14, 2021 at 5:12 AM Yunsheng Lin linyunsheng@huawei.com wrote:
As the skb->pp_recycle and page->pp_magic may not be enough to track if a frag page is from page pool after the calling of __skb_frag_ref(), mostly because of a data race, see: commit 2cc3aeb5eccc ("skbuff: Fix a potential race while recycling page_pool packets").
I'm not sure how this comment actually applies. It is an issue that was fixed. If anything my concern is that this change will introduce new races instead of fixing any existing ones.
My initial thinking about adding the above comment is to emphasize that we might clear cloned skb's pp_recycle when doing head expanding, and page pool might need to give up on that page if that cloned skb is the last one to be freed.
There may be clone and expand head case that might lose the track if a frag page is from page pool or not.
Can you explain how? If there is such a case we should fix it instead of trying to introduce new features to address it. This seems more like a performance optimization rather than a fix.
Yes, I consider it an optimization too, that's why I am targetting net-next.
Even for the below skb_split() case in tso_fragment(), I am not sure how can a rx pp page can go through the tcp stack yet.
I am bit confused :). We don't have that problem *now* right? This will appear if we try to pull in your patches on using page pool and recycling for Tx where TSO and skb_split are used?
I'll be honest, when I came up with the recycling idea for page pool, I never intended to support Tx. I agree with Alexander here, If people want to use it on Tx and think there's value, we might need to go back to the drawing board and see what I've missed. It's still early and there's a handful of drivers using it, so it will less painful now.
I agree, page_pool is NOT designed or intended for TX support. E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much. As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.
The pp_recycle_bit was introduced to make the checking faster, instead of getting stuff into cache and check the page signature. If that ends up being counterproductive, we could just replace the entire logic with the frag count and the page signature, couldn't we? In that case we should be very cautious and measure potential regression on the standard path.
+1
But in general, I'd be happier if we only had a simple logic in our testing for the pages we have to recycle. Debugging and understanding this otherwise will end up being a mess.
[...]
For 32 bit systems with 64 bit dma, we preserve the orginial behavior as frag count is used to trace how many time does a frag page is called with __skb_frag_ref().
We still use both skb->pp_recycle and page->pp_magic to decide the head page for a skb is from page pool or not.
[...]
+/**
- skb_frag_is_pp_page - decide if a page is recyclable.
- @page: frag page
- @recycle: skb->pp_recycle
- For 32 bit systems with 64 bit dma, the skb->pp_recycle is
- also used to decide if a page can be recycled to the page
- pool.
- */
+static inline bool skb_frag_is_pp_page(struct page *page,
bool recycle)
+{
return page_pool_is_pp_page(page) ||
(recycle && __page_pool_is_pp_page(page));
+}
The logic for this check is ugly. You are essentially calling __page_pool_is_pp_page again if it fails the first check. It would probably make more sense to rearrange things and just call (!DMA_USE_PP_FRAG_COUNT || recycle) && __page_pool_is_pp_page(). With that the check of recycle could be dropped entirely if frag count is valid to use, and in the non-fragcount case it reverts back to the original check.
The reason I did not do that is I kind of did not want to use the DMA_USE_PP_FRAG_COUNT outside of page pool. I can use DMA_USE_PP_FRAG_COUNT directly in skbuff.h if the above is considered harmless:)
The 32 bit systems with 64 bit dma really seems a burden here, as memtioned by Ilias [1], there seems to be no such system using page pool, we might as well consider disabling page pool for such system?
Ilias, Jesper, what do you think?
Well I can't really disagree with myself too much :). I still think we are carrying a lot of code and complexity for systems that don't exist.
I would be fine with rejecting such systems at page_pool setup time. Meaning that NIC drivers using page_pool for DMA-mapping, getting compiled on 32-bit systems and needing 64-bit DMA-mappings, will have their call to page_pool_create() fail (with something else than -EINVAL please). If drivers really want work on such systems, they have to implement their own DMA-mapping fallback tracking outside page_pool. Meaning it is only the keeping track of DMA-mapping part of page_pool they cannot use.
[...]
+static inline bool __page_pool_is_pp_page(struct page *page) +{
/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
* in order to preserve any existing bits, such as bit 0 for the
* head page of compound page and bit 1 for pfmemalloc page, so
* mask those bits for freeing side when doing below checking,
* and page_is_pfmemalloc() is checked in __page_pool_put_page()
* to avoid recycling the pfmemalloc page.
*/
return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
+}
+static inline bool page_pool_is_pp_page(struct page *page) +{
/* For systems with the same dma addr as the bus addr, we can use
* page->pp_magic to indicate a pp page uniquely.
*/
return !PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
__page_pool_is_pp_page(page);
+}
We should really change the name of the #define. I keep reading it as we are using the PP_FRAG_COUNT, not that it is already in use. Maybe we should look at something like PP_FRAG_COUNT_VALID and just invert the logic for it.
Yes, Jesper seems to have the similar confusion.
+1
+1
I seems better that we can remove that macro completely if the 32 bit systems with 64 bit dma turn out to be not existing at all?
Also this function naming is really confusing. You don't have to have the frag count to be a page pool page. Maybe this should be something like page_pool_is_pp_frag_page.
[...]
Regards /Ilias
--Jesper
On 2021/9/15 23:42, Jesper Dangaard Brouer wrote:
On 15/09/2021 14.56, Ilias Apalodimas wrote:
Hi Yunsheng, Alexander,
On Wed, Sep 15, 2021 at 05:07:08PM +0800, Yunsheng Lin wrote:
On 2021/9/15 8:59, Alexander Duyck wrote:
On Tue, Sep 14, 2021 at 5:12 AM Yunsheng Lin linyunsheng@huawei.com wrote:
As the skb->pp_recycle and page->pp_magic may not be enough to track if a frag page is from page pool after the calling of __skb_frag_ref(), mostly because of a data race, see: commit 2cc3aeb5eccc ("skbuff: Fix a potential race while recycling page_pool packets").
I'm not sure how this comment actually applies. It is an issue that was fixed. If anything my concern is that this change will introduce new races instead of fixing any existing ones.
My initial thinking about adding the above comment is to emphasize that we might clear cloned skb's pp_recycle when doing head expanding, and page pool might need to give up on that page if that cloned skb is the last one to be freed.
There may be clone and expand head case that might lose the track if a frag page is from page pool or not.
Can you explain how? If there is such a case we should fix it instead of trying to introduce new features to address it. This seems more like a performance optimization rather than a fix.
Yes, I consider it an optimization too, that's why I am targetting net-next.
Even for the below skb_split() case in tso_fragment(), I am not sure how can a rx pp page can go through the tcp stack yet.
I am bit confused :). We don't have that problem *now* right? This will appear if we try to pull in your patches on using page pool and recycling for Tx where TSO and skb_split are used?
As my understanding, the problem might exists without tx recycling, because a skb from wire would be passed down to the tcp stack and retransmited back to the wire theoretically. As I am not able to setup a configuration to verify and test it and the handling seems tricky, so I am targetting net-next branch instead of net branch.
I'll be honest, when I came up with the recycling idea for page pool, I never intended to support Tx. I agree with Alexander here, If people want to use it on Tx and think there's value, we might need to go back to the drawing board and see what I've missed. It's still early and there's a handful of drivers using it, so it will less painful now.
Yes, we also need to prototype it to see if there is something missing in the drawing board and how much improvement we get from that:)
I agree, page_pool is NOT designed or intended for TX support. E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much. As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.
As memtioned before, Tx recycling is based on page_pool instance per socket. it shares the page_pool instance with rx.
Anyway, based on feedback from edumazet and dsahern, I am still trying to see if the page pool is meaningful for tx.
The pp_recycle_bit was introduced to make the checking faster, instead of getting stuff into cache and check the page signature. If that ends up being counterproductive, we could just replace the entire logic with the frag count and the page signature, couldn't we? In that case we should be very cautious and measure potential regression on the standard path.
+1
I am not sure "pp_recycle_bit was introduced to make the checking faster" is a valid. The size of "struct page" is only about 9 words(36/72 bytes), which is mostly to be in the same cache line, and both standard path and recycle path have been touching the "struct page", so it seems the overhead for checking signature seems minimal.
I agree that we need to be cautious and measure potential regression on the standard path.
Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag page is from page pool.
But in general, I'd be happier if we only had a simple logic in our testing for the pages we have to recycle. Debugging and understanding this otherwise will end up being a mess.
[...]
For 32 bit systems with 64 bit dma, we preserve the orginial behavior as frag count is used to trace how many time does a frag page is called with __skb_frag_ref().
We still use both skb->pp_recycle and page->pp_magic to decide the head page for a skb is from page pool or not.
[...]
+/**
- skb_frag_is_pp_page - decide if a page is recyclable.
- @page: frag page
- @recycle: skb->pp_recycle
- For 32 bit systems with 64 bit dma, the skb->pp_recycle is
- also used to decide if a page can be recycled to the page
- pool.
- */
+static inline bool skb_frag_is_pp_page(struct page *page,
bool recycle)
+{
return page_pool_is_pp_page(page) ||
(recycle && __page_pool_is_pp_page(page));
+}
The logic for this check is ugly. You are essentially calling __page_pool_is_pp_page again if it fails the first check. It would probably make more sense to rearrange things and just call (!DMA_USE_PP_FRAG_COUNT || recycle) && __page_pool_is_pp_page(). With that the check of recycle could be dropped entirely if frag count is valid to use, and in the non-fragcount case it reverts back to the original check.
The reason I did not do that is I kind of did not want to use the DMA_USE_PP_FRAG_COUNT outside of page pool. I can use DMA_USE_PP_FRAG_COUNT directly in skbuff.h if the above is considered harmless:)
The 32 bit systems with 64 bit dma really seems a burden here, as memtioned by Ilias [1], there seems to be no such system using page pool, we might as well consider disabling page pool for such system?
Ilias, Jesper, what do you think?
Well I can't really disagree with myself too much :). I still think we are carrying a lot of code and complexity for systems that don't exist.
I would be fine with rejecting such systems at page_pool setup time. Meaning that NIC drivers using page_pool for DMA-mapping, getting compiled on 32-bit systems and needing 64-bit DMA-mappings, will have their call to page_pool_create() fail (with something else than -EINVAL please). If drivers really want work on such systems, they have to implement their own DMA-mapping fallback tracking outside page_pool. Meaning it is only the keeping track of DMA-mapping part of page_pool they cannot use.
Ok, will send out a patch to reject DMA-mapping support for such system.
[...]
+static inline bool __page_pool_is_pp_page(struct page *page) +{
/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
* in order to preserve any existing bits, such as bit 0 for the
* head page of compound page and bit 1 for pfmemalloc page, so
* mask those bits for freeing side when doing below checking,
* and page_is_pfmemalloc() is checked in __page_pool_put_page()
* to avoid recycling the pfmemalloc page.
*/
return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
+}
+static inline bool page_pool_is_pp_page(struct page *page) +{
/* For systems with the same dma addr as the bus addr, we can use
* page->pp_magic to indicate a pp page uniquely.
*/
return !PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
__page_pool_is_pp_page(page);
+}
We should really change the name of the #define. I keep reading it as we are using the PP_FRAG_COUNT, not that it is already in use. Maybe we should look at something like PP_FRAG_COUNT_VALID and just invert the logic for it.
Yes, Jesper seems to have the similar confusion.
+1
+1
I seems better that we can remove that macro completely if the 32 bit systems with 64 bit dma turn out to be not existing at all?
Also this function naming is really confusing. You don't have to have the frag count to be a page pool page. Maybe this should be something like page_pool_is_pp_frag_page.
[...]
Regards /Ilias
--Jesper _______________________________________________ Linuxarm mailing list -- linuxarm@openeuler.org To unsubscribe send an email to linuxarm-leave@openeuler.org
appear if we try to pull in your patches on using page pool and recycling
[...]
for Tx where TSO and skb_split are used?
As my understanding, the problem might exists without tx recycling, because a skb from wire would be passed down to the tcp stack and retransmited back to the wire theoretically. As I am not able to setup a configuration to verify and test it and the handling seems tricky, so I am targetting net-next branch instead of net branch.
I'll be honest, when I came up with the recycling idea for page pool, I never intended to support Tx. I agree with Alexander here, If people want to use it on Tx and think there's value, we might need to go back to the drawing board and see what I've missed. It's still early and there's a handful of drivers using it, so it will less painful now.
Yes, we also need to prototype it to see if there is something missing in the drawing board and how much improvement we get from that:)
I agree, page_pool is NOT designed or intended for TX support. E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much. As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.
As memtioned before, Tx recycling is based on page_pool instance per socket. it shares the page_pool instance with rx.
Anyway, based on feedback from edumazet and dsahern, I am still trying to see if the page pool is meaningful for tx.
The pp_recycle_bit was introduced to make the checking faster, instead of getting stuff into cache and check the page signature. If that ends up being counterproductive, we could just replace the entire logic with the frag count and the page signature, couldn't we? In that case we should be very cautious and measure potential regression on the standard path.
+1
I am not sure "pp_recycle_bit was introduced to make the checking faster" is a valid. The size of "struct page" is only about 9 words(36/72 bytes), which is mostly to be in the same cache line, and both standard path and recycle path have been touching the "struct page", so it seems the overhead for checking signature seems minimal.
I agree that we need to be cautious and measure potential regression on the standard path.
well pp_recycle is on the same cache line boundary with the head_frag we need to decide on recycling. After that we start checking page signatures etc, which means the default release path remains mostly unaffected.
I guess what you are saying here, is that 'struct page' is going to be accessed eventually by the default network path, so there won't be any noticeable performance hit? What about the other usecases we have for pp_recycle right now? __skb_frag_unref() in skb_shift() or skb_try_coalesce() (the latter can probably be removed tbh).
Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag page is from page pool.
Instead of the 'struct page' signature? And the pp_recycle bit will continue to exist?
Right now the 'naive' explanation on the recycling decision is something like:
if (pp_recycle) <--- recycling bit is set (check page signature) <--- signature matches page pool (check fragment refcnt) <--- If frags are enabled and is the last consumer recycle
If we can proove the performance is unaffected when we eliminate the first if, then obviously we should remove it. I'll try running that test here and see, but keep in mind I am only testing on an 1GB interface. Any chance we can get measurements on a beefier hardware using hns3 ?
But in general, I'd be happier if we only had a simple logic in our testing for the pages we have to recycle. Debugging and understanding this otherwise will end up being a mess.
[...]
Regards /Ilias
On 2021/9/16 16:44, Ilias Apalodimas wrote:
appear if we try to pull in your patches on using page pool and recycling
[...]
for Tx where TSO and skb_split are used?
As my understanding, the problem might exists without tx recycling, because a skb from wire would be passed down to the tcp stack and retransmited back to the wire theoretically. As I am not able to setup a configuration to verify and test it and the handling seems tricky, so I am targetting net-next branch instead of net branch.
I'll be honest, when I came up with the recycling idea for page pool, I never intended to support Tx. I agree with Alexander here, If people want to use it on Tx and think there's value, we might need to go back to the drawing board and see what I've missed. It's still early and there's a handful of drivers using it, so it will less painful now.
Yes, we also need to prototype it to see if there is something missing in the drawing board and how much improvement we get from that:)
I agree, page_pool is NOT designed or intended for TX support. E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much. As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.
As memtioned before, Tx recycling is based on page_pool instance per socket. it shares the page_pool instance with rx.
Anyway, based on feedback from edumazet and dsahern, I am still trying to see if the page pool is meaningful for tx.
The pp_recycle_bit was introduced to make the checking faster, instead of getting stuff into cache and check the page signature. If that ends up being counterproductive, we could just replace the entire logic with the frag count and the page signature, couldn't we? In that case we should be very cautious and measure potential regression on the standard path.
+1
I am not sure "pp_recycle_bit was introduced to make the checking faster" is a valid. The size of "struct page" is only about 9 words(36/72 bytes), which is mostly to be in the same cache line, and both standard path and recycle path have been touching the "struct page", so it seems the overhead for checking signature seems minimal.
I agree that we need to be cautious and measure potential regression on the standard path.
well pp_recycle is on the same cache line boundary with the head_frag we need to decide on recycling. After that we start checking page signatures etc, which means the default release path remains mostly unaffected.
I guess what you are saying here, is that 'struct page' is going to be accessed eventually by the default network path, so there won't be any noticeable performance hit? What about the other usecases we have
Yes.
for pp_recycle right now? __skb_frag_unref() in skb_shift() or skb_try_coalesce() (the latter can probably be removed tbh).
If we decide to go with accurate indicator of a pp page, we just need to make sure network stack use __skb_frag_unref() and __skb_frag_ref() to put and get a page frag, the indicator checking need only done in __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and skb_try_coalesce() should be fine too.
Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag page is from page pool.
Instead of the 'struct page' signature? And the pp_recycle bit will continue to exist?
pp_recycle bit might only exist or is only used for the head page for the skb. The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely. Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref() will increment the _refcount or pp_frag_count according to the bit 0 of frag->bv_page.
By the way, I also prototype the above idea, and it seems to work well too.
. Right now the 'naive' explanation on the recycling decision is something like:
if (pp_recycle) <--- recycling bit is set (check page signature) <--- signature matches page pool (check fragment refcnt) <--- If frags are enabled and is the last consumer recycle
If we can proove the performance is unaffected when we eliminate the first if, then obviously we should remove it. I'll try running that test here and see, but keep in mind I am only testing on an 1GB interface. Any chance we can get measurements on a beefier hardware using hns3 ?
Sure, I will try it. As the kind of performance overhead is small, any performance testcase in mind?
But in general, I'd be happier if we only had a simple logic in our testing for the pages we have to recycle. Debugging and understanding this otherwise will end up being a mess.
[...]
Regards /Ilias .
On Thu, Sep 16, 2021 at 05:33:39PM +0800, Yunsheng Lin wrote:
On 2021/9/16 16:44, Ilias Apalodimas wrote:
appear if we try to pull in your patches on using page pool and recycling
[...]
for Tx where TSO and skb_split are used?
As my understanding, the problem might exists without tx recycling, because a skb from wire would be passed down to the tcp stack and retransmited back to the wire theoretically. As I am not able to setup a configuration to verify and test it and the handling seems tricky, so I am targetting net-next branch instead of net branch.
I'll be honest, when I came up with the recycling idea for page pool, I never intended to support Tx. I agree with Alexander here, If people want to use it on Tx and think there's value, we might need to go back to the drawing board and see what I've missed. It's still early and there's a handful of drivers using it, so it will less painful now.
Yes, we also need to prototype it to see if there is something missing in the drawing board and how much improvement we get from that:)
I agree, page_pool is NOT designed or intended for TX support. E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much. As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.
As memtioned before, Tx recycling is based on page_pool instance per socket. it shares the page_pool instance with rx.
Anyway, based on feedback from edumazet and dsahern, I am still trying to see if the page pool is meaningful for tx.
The pp_recycle_bit was introduced to make the checking faster, instead of getting stuff into cache and check the page signature. If that ends up being counterproductive, we could just replace the entire logic with the frag count and the page signature, couldn't we? In that case we should be very cautious and measure potential regression on the standard path.
+1
I am not sure "pp_recycle_bit was introduced to make the checking faster" is a valid. The size of "struct page" is only about 9 words(36/72 bytes), which is mostly to be in the same cache line, and both standard path and recycle path have been touching the "struct page", so it seems the overhead for checking signature seems minimal.
I agree that we need to be cautious and measure potential regression on the standard path.
well pp_recycle is on the same cache line boundary with the head_frag we need to decide on recycling. After that we start checking page signatures etc, which means the default release path remains mostly unaffected.
I guess what you are saying here, is that 'struct page' is going to be accessed eventually by the default network path, so there won't be any noticeable performance hit? What about the other usecases we have
Yes.
In that case you'd need to call virt_to_head_page() early though, get it and then compare the signature. I guess that's avoidable by using frag->bv_page for the fragments?
for pp_recycle right now? __skb_frag_unref() in skb_shift() or skb_try_coalesce() (the latter can probably be removed tbh).
If we decide to go with accurate indicator of a pp page, we just need to make sure network stack use __skb_frag_unref() and __skb_frag_ref() to put and get a page frag, the indicator checking need only done in __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and skb_try_coalesce() should be fine too.
Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag page is from page pool.
Instead of the 'struct page' signature? And the pp_recycle bit will continue to exist?
pp_recycle bit might only exist or is only used for the head page for the skb. The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely. Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref() will increment the _refcount or pp_frag_count according to the bit 0 of frag->bv_page.
By the way, I also prototype the above idea, and it seems to work well too.
As long as no one else touches this, it's just another way of identifying a page_pool allocated page. But are we gaining by that? Not using virt_to_head_page() as stated above? But in that case you still need to keep pp_recycle around.
. Right now the 'naive' explanation on the recycling decision is something like:
if (pp_recycle) <--- recycling bit is set (check page signature) <--- signature matches page pool (check fragment refcnt) <--- If frags are enabled and is the last consumer recycle
If we can proove the performance is unaffected when we eliminate the first if, then obviously we should remove it. I'll try running that test here and see, but keep in mind I am only testing on an 1GB interface. Any chance we can get measurements on a beefier hardware using hns3 ?
Sure, I will try it. As the kind of performance overhead is small, any performance testcase in mind?
'eliminate the first if' wasn't accurate. I meant switch the first if and check the struct page signature instead. That would be the best solution imho. We effectively have a single rule to check if a packet comes from page_pool or not.
You can start by sending a lot of packets and dropping those immediately. That should put enough stress on the receive path and the allocators and it should give us a rough idea.
But in general, I'd be happier if we only had a simple logic in our testing for the pages we have to recycle. Debugging and understanding this otherwise will end up being a mess.
[...]
Regards /Ilias .
Regards /Ilias
On 2021/9/16 18:38, Ilias Apalodimas wrote:
On Thu, Sep 16, 2021 at 05:33:39PM +0800, Yunsheng Lin wrote:
On 2021/9/16 16:44, Ilias Apalodimas wrote:
appear if we try to pull in your patches on using page pool and recycling
[...]
for Tx where TSO and skb_split are used?
As my understanding, the problem might exists without tx recycling, because a skb from wire would be passed down to the tcp stack and retransmited back to the wire theoretically. As I am not able to setup a configuration to verify and test it and the handling seems tricky, so I am targetting net-next branch instead of net branch.
I'll be honest, when I came up with the recycling idea for page pool, I never intended to support Tx. I agree with Alexander here, If people want to use it on Tx and think there's value, we might need to go back to the drawing board and see what I've missed. It's still early and there's a handful of drivers using it, so it will less painful now.
Yes, we also need to prototype it to see if there is something missing in the drawing board and how much improvement we get from that:)
I agree, page_pool is NOT designed or intended for TX support. E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much. As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.
As memtioned before, Tx recycling is based on page_pool instance per socket. it shares the page_pool instance with rx.
Anyway, based on feedback from edumazet and dsahern, I am still trying to see if the page pool is meaningful for tx.
The pp_recycle_bit was introduced to make the checking faster, instead of getting stuff into cache and check the page signature. If that ends up being counterproductive, we could just replace the entire logic with the frag count and the page signature, couldn't we? In that case we should be very cautious and measure potential regression on the standard path.
+1
I am not sure "pp_recycle_bit was introduced to make the checking faster" is a valid. The size of "struct page" is only about 9 words(36/72 bytes), which is mostly to be in the same cache line, and both standard path and recycle path have been touching the "struct page", so it seems the overhead for checking signature seems minimal.
I agree that we need to be cautious and measure potential regression on the standard path.
well pp_recycle is on the same cache line boundary with the head_frag we need to decide on recycling. After that we start checking page signatures etc, which means the default release path remains mostly unaffected.
I guess what you are saying here, is that 'struct page' is going to be accessed eventually by the default network path, so there won't be any noticeable performance hit? What about the other usecases we have
Yes.
In that case you'd need to call virt_to_head_page() early though, get it and then compare the signature. I guess that's avoidable by using frag->bv_page for the fragments?
If a page of a skb frag is from page pool, It seems frag->bv_page is always point to head_page of a compound page, so the calling of virt_to_head_page() does not seems necessary.
bit 0 of frag->bv_page is different way of indicatior for a pp page, it is better we do not confuse with the page signature way. Using a bit 0 may give us a free word in 'struct page' if we manage to use skb->pp_recycle to indicate a head page of the skb uniquely, meaning page->pp_magic can be used for future feature.
for pp_recycle right now? __skb_frag_unref() in skb_shift() or skb_try_coalesce() (the latter can probably be removed tbh).
If we decide to go with accurate indicator of a pp page, we just need to make sure network stack use __skb_frag_unref() and __skb_frag_ref() to put and get a page frag, the indicator checking need only done in __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and skb_try_coalesce() should be fine too.
Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag page is from page pool.
Instead of the 'struct page' signature? And the pp_recycle bit will continue to exist?
pp_recycle bit might only exist or is only used for the head page for the skb. The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely. Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref() will increment the _refcount or pp_frag_count according to the bit 0 of frag->bv_page.
By the way, I also prototype the above idea, and it seems to work well too.
As long as no one else touches this, it's just another way of identifying a page_pool allocated page. But are we gaining by that? Not using virt_to_head_page() as stated above? But in that case you still need to keep pp_recycle around.
No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref() is called after memcpying the shinfo or doing "*fragto = *fragfrom".
. Right now the 'naive' explanation on the recycling decision is something like:
if (pp_recycle) <--- recycling bit is set (check page signature) <--- signature matches page pool (check fragment refcnt) <--- If frags are enabled and is the last consumer recycle
If we can proove the performance is unaffected when we eliminate the first if, then obviously we should remove it. I'll try running that test here and see, but keep in mind I am only testing on an 1GB interface. Any chance we can get measurements on a beefier hardware using hns3 ?
Sure, I will try it. As the kind of performance overhead is small, any performance testcase in mind?
'eliminate the first if' wasn't accurate. I meant switch the first if and check the struct page signature instead. That would be the best solution imho. We effectively have a single rule to check if a packet comes from page_pool or not.
I am not sure what does "switch " means here, if the page signature can indicate a pp page uniquely, the "if (pp_recycle)" checking can be removed.
You can start by sending a lot of packets and dropping those immediately. That should put enough stress on the receive path and the allocators and it should give us a rough idea.
But in general, I'd be happier if we only had a simple logic in our testing for the pages we have to recycle. Debugging and understanding this otherwise will end up being a mess.
[...]
Regards /Ilias .
Regards /Ilias .
On 2021/9/16 19:04, Yunsheng Lin wrote:
On 2021/9/16 18:38, Ilias Apalodimas wrote:
On Thu, Sep 16, 2021 at 05:33:39PM +0800, Yunsheng Lin wrote:
On 2021/9/16 16:44, Ilias Apalodimas wrote:
> appear if we try to pull in your patches on using page pool and recycling
[...]
> for Tx where TSO and skb_split are used?
As my understanding, the problem might exists without tx recycling, because a skb from wire would be passed down to the tcp stack and retransmited back to the wire theoretically. As I am not able to setup a configuration to verify and test it and the handling seems tricky, so I am targetting net-next branch instead of net branch.
> > I'll be honest, when I came up with the recycling idea for page pool, I > never intended to support Tx. I agree with Alexander here, If people want > to use it on Tx and think there's value, we might need to go back to the > drawing board and see what I've missed. It's still early and there's a > handful of drivers using it, so it will less painful now.
Yes, we also need to prototype it to see if there is something missing in the drawing board and how much improvement we get from that:)
I agree, page_pool is NOT designed or intended for TX support. E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much. As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.
As memtioned before, Tx recycling is based on page_pool instance per socket. it shares the page_pool instance with rx.
Anyway, based on feedback from edumazet and dsahern, I am still trying to see if the page pool is meaningful for tx.
> The pp_recycle_bit was introduced to make the checking faster, instead of > getting stuff into cache and check the page signature. If that ends up > being counterproductive, we could just replace the entire logic with the > frag count and the page signature, couldn't we? In that case we should be > very cautious and measure potential regression on the standard path.
+1
I am not sure "pp_recycle_bit was introduced to make the checking faster" is a valid. The size of "struct page" is only about 9 words(36/72 bytes), which is mostly to be in the same cache line, and both standard path and recycle path have been touching the "struct page", so it seems the overhead for checking signature seems minimal.
I agree that we need to be cautious and measure potential regression on the standard path.
well pp_recycle is on the same cache line boundary with the head_frag we need to decide on recycling. After that we start checking page signatures etc, which means the default release path remains mostly unaffected.
I guess what you are saying here, is that 'struct page' is going to be accessed eventually by the default network path, so there won't be any noticeable performance hit? What about the other usecases we have
Yes.
In that case you'd need to call virt_to_head_page() early though, get it and then compare the signature. I guess that's avoidable by using frag->bv_page for the fragments?
If a page of a skb frag is from page pool, It seems frag->bv_page is always point to head_page of a compound page, so the calling of virt_to_head_page() does not seems necessary.
bit 0 of frag->bv_page is different way of indicatior for a pp page, it is better we do not confuse with the page signature way. Using a bit 0 may give us a free word in 'struct page' if we manage to use skb->pp_recycle to indicate a head page of the skb uniquely, meaning page->pp_magic can be used for future feature.
for pp_recycle right now? __skb_frag_unref() in skb_shift() or skb_try_coalesce() (the latter can probably be removed tbh).
If we decide to go with accurate indicator of a pp page, we just need to make sure network stack use __skb_frag_unref() and __skb_frag_ref() to put and get a page frag, the indicator checking need only done in __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and skb_try_coalesce() should be fine too.
Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag page is from page pool.
Instead of the 'struct page' signature? And the pp_recycle bit will continue to exist?
pp_recycle bit might only exist or is only used for the head page for the skb. The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely. Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref() will increment the _refcount or pp_frag_count according to the bit 0 of frag->bv_page.
By the way, I also prototype the above idea, and it seems to work well too.
As long as no one else touches this, it's just another way of identifying a page_pool allocated page. But are we gaining by that? Not using virt_to_head_page() as stated above? But in that case you still need to keep pp_recycle around.
No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref() is called after memcpying the shinfo or doing "*fragto = *fragfrom".
Acctually it seems we do not need to ensure __skb_frag_ref() is called after memcpying the shinfo or doing "*fragto = *fragfrom".
Just make sure the bit 0 of frag->bv_page is passed to the new frag->bv_page( by memcpying the shinfo or doing "*fragto = *fragfrom"), __skb_frag_ref() and __skb_frag_unref() will check the bit 0 of frag->bv_page to update the _refcount or pp_frag_count accordingly.
. Right now the 'naive' explanation on the recycling decision is something like:
if (pp_recycle) <--- recycling bit is set (check page signature) <--- signature matches page pool (check fragment refcnt) <--- If frags are enabled and is the last consumer recycle
If we can proove the performance is unaffected when we eliminate the first if, then obviously we should remove it. I'll try running that test here and see, but keep in mind I am only testing on an 1GB interface. Any chance we can get measurements on a beefier hardware using hns3 ?
Sure, I will try it. As the kind of performance overhead is small, any performance testcase in mind?
'eliminate the first if' wasn't accurate. I meant switch the first if and check the struct page signature instead. That would be the best solution imho. We effectively have a single rule to check if a packet comes from page_pool or not.
I am not sure what does "switch " means here, if the page signature can indicate a pp page uniquely, the "if (pp_recycle)" checking can be removed.
You can start by sending a lot of packets and dropping those immediately. That should put enough stress on the receive path and the allocators and it should give us a rough idea.
> But in general, I'd be happier if we only had a simple logic in our > testing for the pages we have to recycle. Debugging and understanding this > otherwise will end up being a mess.
[...]
Regards /Ilias .
Regards /Ilias .
.
On Thu, Sep 16, 2021 at 07:04:54PM +0800, Yunsheng Lin wrote:
On 2021/9/16 18:38, Ilias Apalodimas wrote:
On Thu, Sep 16, 2021 at 05:33:39PM +0800, Yunsheng Lin wrote:
On 2021/9/16 16:44, Ilias Apalodimas wrote:
> appear if we try to pull in your patches on using page pool and recycling
[...]
> for Tx where TSO and skb_split are used?
As my understanding, the problem might exists without tx recycling, because a skb from wire would be passed down to the tcp stack and retransmited back to the wire theoretically. As I am not able to setup a configuration to verify and test it and the handling seems tricky, so I am targetting net-next branch instead of net branch.
> > I'll be honest, when I came up with the recycling idea for page pool, I > never intended to support Tx. I agree with Alexander here, If people want > to use it on Tx and think there's value, we might need to go back to the > drawing board and see what I've missed. It's still early and there's a > handful of drivers using it, so it will less painful now.
Yes, we also need to prototype it to see if there is something missing in the drawing board and how much improvement we get from that:)
I agree, page_pool is NOT designed or intended for TX support. E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much. As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.
As memtioned before, Tx recycling is based on page_pool instance per socket. it shares the page_pool instance with rx.
Anyway, based on feedback from edumazet and dsahern, I am still trying to see if the page pool is meaningful for tx.
> The pp_recycle_bit was introduced to make the checking faster, instead of > getting stuff into cache and check the page signature. If that ends up > being counterproductive, we could just replace the entire logic with the > frag count and the page signature, couldn't we? In that case we should be > very cautious and measure potential regression on the standard path.
+1
I am not sure "pp_recycle_bit was introduced to make the checking faster" is a valid. The size of "struct page" is only about 9 words(36/72 bytes), which is mostly to be in the same cache line, and both standard path and recycle path have been touching the "struct page", so it seems the overhead for checking signature seems minimal.
I agree that we need to be cautious and measure potential regression on the standard path.
well pp_recycle is on the same cache line boundary with the head_frag we need to decide on recycling. After that we start checking page signatures etc, which means the default release path remains mostly unaffected.
I guess what you are saying here, is that 'struct page' is going to be accessed eventually by the default network path, so there won't be any noticeable performance hit? What about the other usecases we have
Yes.
In that case you'd need to call virt_to_head_page() early though, get it and then compare the signature. I guess that's avoidable by using frag->bv_page for the fragments?
If a page of a skb frag is from page pool, It seems frag->bv_page is always point to head_page of a compound page, so the calling of virt_to_head_page() does not seems necessary.
I was mostly referring to the skb head here and how would you trigger the recycling path.
I think we are talking about different things here. One idea is to use the last bit of frag->bv_page to identify fragments allocated from page_pool, which is done today with the signature.
The signature however exists in the head page so my question was, can we rid of that without having a performance penalty?
IOW in skb_free_head() an we replace:
if (skb_pp_recycle(skb, head)) with if (page->pp_magic & ~0x3UL) == PP_SIGNATURE) and get rid of the 'bool recycle' argument in __skb_frag_unref()?
bit 0 of frag->bv_page is different way of indicatior for a pp page, it is better we do not confuse with the page signature way. Using a bit 0 may give us a free word in 'struct page' if we manage to use skb->pp_recycle to indicate a head page of the skb uniquely, meaning page->pp_magic can be used for future feature.
for pp_recycle right now? __skb_frag_unref() in skb_shift() or skb_try_coalesce() (the latter can probably be removed tbh).
If we decide to go with accurate indicator of a pp page, we just need to make sure network stack use __skb_frag_unref() and __skb_frag_ref() to put and get a page frag, the indicator checking need only done in __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and skb_try_coalesce() should be fine too.
Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag page is from page pool.
Instead of the 'struct page' signature? And the pp_recycle bit will continue to exist?
pp_recycle bit might only exist or is only used for the head page for the skb. The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely. Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref() will increment the _refcount or pp_frag_count according to the bit 0 of frag->bv_page.
By the way, I also prototype the above idea, and it seems to work well too.
As long as no one else touches this, it's just another way of identifying a page_pool allocated page. But are we gaining by that? Not using virt_to_head_page() as stated above? But in that case you still need to keep pp_recycle around.
No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref() is called after memcpying the shinfo or doing "*fragto = *fragfrom".
But we'll have to keep it for the skb head in this case.
Regards /Ilias
. Right now the 'naive' explanation on the recycling decision is something like:
if (pp_recycle) <--- recycling bit is set (check page signature) <--- signature matches page pool (check fragment refcnt) <--- If frags are enabled and is the last consumer recycle
If we can proove the performance is unaffected when we eliminate the first if, then obviously we should remove it. I'll try running that test here and see, but keep in mind I am only testing on an 1GB interface. Any chance we can get measurements on a beefier hardware using hns3 ?
Sure, I will try it. As the kind of performance overhead is small, any performance testcase in mind?
'eliminate the first if' wasn't accurate. I meant switch the first if and check the struct page signature instead. That would be the best solution imho. We effectively have a single rule to check if a packet comes from page_pool or not.
I am not sure what does "switch " means here, if the page signature can indicate a pp page uniquely, the "if (pp_recycle)" checking can be removed.
You can start by sending a lot of packets and dropping those immediately. That should put enough stress on the receive path and the allocators and it should give us a rough idea.
> But in general, I'd be happier if we only had a simple logic in our > testing for the pages we have to recycle. Debugging and understanding this > otherwise will end up being a mess.
[...]
Regards /Ilias .
Regards /Ilias .
On 2021/9/16 19:57, Ilias Apalodimas wrote:
On Thu, Sep 16, 2021 at 07:04:54PM +0800, Yunsheng Lin wrote:
On 2021/9/16 18:38, Ilias Apalodimas wrote:
On Thu, Sep 16, 2021 at 05:33:39PM +0800, Yunsheng Lin wrote:
On 2021/9/16 16:44, Ilias Apalodimas wrote:
>> appear if we try to pull in your patches on using page pool and recycling
[...]
>> for Tx where TSO and skb_split are used?
As my understanding, the problem might exists without tx recycling, because a skb from wire would be passed down to the tcp stack and retransmited back to the wire theoretically. As I am not able to setup a configuration to verify and test it and the handling seems tricky, so I am targetting net-next branch instead of net branch.
>> >> I'll be honest, when I came up with the recycling idea for page pool, I >> never intended to support Tx. I agree with Alexander here, If people want >> to use it on Tx and think there's value, we might need to go back to the >> drawing board and see what I've missed. It's still early and there's a >> handful of drivers using it, so it will less painful now.
Yes, we also need to prototype it to see if there is something missing in the drawing board and how much improvement we get from that:)
> > I agree, page_pool is NOT designed or intended for TX support. > E.g. it doesn't make sense to allocate a page_pool instance per socket, as the backing memory structures for page_pool are too much. > As the number RX-queues are more limited it was deemed okay that we use page_pool per RX-queue, which sacrifice some memory to gain speed.
As memtioned before, Tx recycling is based on page_pool instance per socket. it shares the page_pool instance with rx.
Anyway, based on feedback from edumazet and dsahern, I am still trying to see if the page pool is meaningful for tx.
> > >> The pp_recycle_bit was introduced to make the checking faster, instead of >> getting stuff into cache and check the page signature. If that ends up >> being counterproductive, we could just replace the entire logic with the >> frag count and the page signature, couldn't we? In that case we should be >> very cautious and measure potential regression on the standard path. > > +1
I am not sure "pp_recycle_bit was introduced to make the checking faster" is a valid. The size of "struct page" is only about 9 words(36/72 bytes), which is mostly to be in the same cache line, and both standard path and recycle path have been touching the "struct page", so it seems the overhead for checking signature seems minimal.
I agree that we need to be cautious and measure potential regression on the standard path.
well pp_recycle is on the same cache line boundary with the head_frag we need to decide on recycling. After that we start checking page signatures etc, which means the default release path remains mostly unaffected.
I guess what you are saying here, is that 'struct page' is going to be accessed eventually by the default network path, so there won't be any noticeable performance hit? What about the other usecases we have
Yes.
In that case you'd need to call virt_to_head_page() early though, get it and then compare the signature. I guess that's avoidable by using frag->bv_page for the fragments?
If a page of a skb frag is from page pool, It seems frag->bv_page is always point to head_page of a compound page, so the calling of virt_to_head_page() does not seems necessary.
I was mostly referring to the skb head here and how would you trigger the recycling path.
I think we are talking about different things here. One idea is to use the last bit of frag->bv_page to identify fragments allocated from page_pool, which is done today with the signature.
The signature however exists in the head page so my question was, can we rid of that without having a performance penalty?
As both skb frag and head page is eventually operated on the head page of a compound page(if it is a compound page) for normal case too, maybe we can refactor the code to get the head page of a compound page before the signature checking without doing a second virt_to_head_page() or compound_head() call?
IOW in skb_free_head() an we replace:
if (skb_pp_recycle(skb, head)) with if (page->pp_magic & ~0x3UL) == PP_SIGNATURE) and get rid of the 'bool recycle' argument in __skb_frag_unref()?
For the frag page of a skb, it seems ok to get rid of the 'bool recycle' argument in __skb_frag_unref(), as __skb_frag_unref() and __skb_frag_ref() is symmetrically called to put/get a page.
For the head page of a skb, we might need to make sure the head page passed to __build_skb_around() meet below condition: do pp_frag_count incrementing instead of _refcount incrementing when the head page is not newly allocated and it is from page pool. It seems hard to audit that?
bit 0 of frag->bv_page is different way of indicatior for a pp page, it is better we do not confuse with the page signature way. Using a bit 0 may give us a free word in 'struct page' if we manage to use skb->pp_recycle to indicate a head page of the skb uniquely, meaning page->pp_magic can be used for future feature.
for pp_recycle right now? __skb_frag_unref() in skb_shift() or skb_try_coalesce() (the latter can probably be removed tbh).
If we decide to go with accurate indicator of a pp page, we just need to make sure network stack use __skb_frag_unref() and __skb_frag_ref() to put and get a page frag, the indicator checking need only done in __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and skb_try_coalesce() should be fine too.
Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag page is from page pool.
Instead of the 'struct page' signature? And the pp_recycle bit will continue to exist?
pp_recycle bit might only exist or is only used for the head page for the skb. The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely. Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref() will increment the _refcount or pp_frag_count according to the bit 0 of frag->bv_page.
By the way, I also prototype the above idea, and it seems to work well too.
As long as no one else touches this, it's just another way of identifying a page_pool allocated page. But are we gaining by that? Not using virt_to_head_page() as stated above? But in that case you still need to keep pp_recycle around.
No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref() is called after memcpying the shinfo or doing "*fragto = *fragfrom".
But we'll have to keep it for the skb head in this case.
As above, I am not really look into skb head case:)
Regards /Ilias
. Right now the 'naive' explanation on the recycling decision is something like:
if (pp_recycle) <--- recycling bit is set (check page signature) <--- signature matches page pool (check fragment refcnt) <--- If frags are enabled and is the last consumer recycle
If we can proove the performance is unaffected when we eliminate the first if, then obviously we should remove it. I'll try running that test here and see, but keep in mind I am only testing on an 1GB interface. Any chance we can get measurements on a beefier hardware using hns3 ?
Sure, I will try it. As the kind of performance overhead is small, any performance testcase in mind?
'eliminate the first if' wasn't accurate. I meant switch the first if and check the struct page signature instead. That would be the best solution imho. We effectively have a single rule to check if a packet comes from page_pool or not.
I am not sure what does "switch " means here, if the page signature can indicate a pp page uniquely, the "if (pp_recycle)" checking can be removed.
You can start by sending a lot of packets and dropping those immediately. That should put enough stress on the receive path and the allocators and it should give us a rough idea.
> >> But in general, I'd be happier if we only had a simple logic in our >> testing for the pages we have to recycle. Debugging and understanding this >> otherwise will end up being a mess. > >
[...]
Regards /Ilias .
Regards /Ilias .
.
Hi Yunsheng,
[...]
> I am not sure "pp_recycle_bit was introduced to make the checking faster" is a > valid. The size of "struct page" is only about 9 words(36/72 bytes), which is > mostly to be in the same cache line, and both standard path and recycle path have > been touching the "struct page", so it seems the overhead for checking signature > seems minimal. > > I agree that we need to be cautious and measure potential regression on the > standard path.
well pp_recycle is on the same cache line boundary with the head_frag we need to decide on recycling. After that we start checking page signatures etc, which means the default release path remains mostly unaffected.
I guess what you are saying here, is that 'struct page' is going to be accessed eventually by the default network path, so there won't be any noticeable performance hit? What about the other usecases we have
Yes.
In that case you'd need to call virt_to_head_page() early though, get it and then compare the signature. I guess that's avoidable by using frag->bv_page for the fragments?
If a page of a skb frag is from page pool, It seems frag->bv_page is always point to head_page of a compound page, so the calling of virt_to_head_page() does not seems necessary.
I was mostly referring to the skb head here and how would you trigger the recycling path.
I think we are talking about different things here. One idea is to use the last bit of frag->bv_page to identify fragments allocated from page_pool, which is done today with the signature.
The signature however exists in the head page so my question was, can we rid of that without having a performance penalty?
As both skb frag and head page is eventually operated on the head page of a compound page(if it is a compound page) for normal case too, maybe we can refactor the code to get the head page of a compound page before the signature checking without doing a second virt_to_head_page() or compound_head() call?
Yea that's doable, but my concern is different here. If we do that the standard network stack, even for drivers that don't use page_pool, will have to do a virt_to_head_page() -> check signature, to decide if it has to try recycling the packet. That's the performance part I am worried about, since it happens for every packet.
IOW in skb_free_head() an we replace:
if (skb_pp_recycle(skb, head)) with if (page->pp_magic & ~0x3UL) == PP_SIGNATURE) and get rid of the 'bool recycle' argument in __skb_frag_unref()?
For the frag page of a skb, it seems ok to get rid of the 'bool recycle' argument in __skb_frag_unref(), as __skb_frag_unref() and __skb_frag_ref() is symmetrically called to put/get a page.
For the head page of a skb, we might need to make sure the head page passed to __build_skb_around() meet below condition: do pp_frag_count incrementing instead of _refcount incrementing when the head page is not newly allocated and it is from page pool. It seems hard to audit that?
Yea that seems a bit weird at least to me and I am not sure, it's the only place we'll have to go and do that.
bit 0 of frag->bv_page is different way of indicatior for a pp page, it is better we do not confuse with the page signature way. Using a bit 0 may give us a free word in 'struct page' if we manage to use skb->pp_recycle to indicate a head page of the skb uniquely, meaning page->pp_magic can be used for future feature.
for pp_recycle right now? __skb_frag_unref() in skb_shift() or skb_try_coalesce() (the latter can probably be removed tbh).
If we decide to go with accurate indicator of a pp page, we just need to make sure network stack use __skb_frag_unref() and __skb_frag_ref() to put and get a page frag, the indicator checking need only done in __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and skb_try_coalesce() should be fine too.
> > Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag > page is from page pool.
Instead of the 'struct page' signature? And the pp_recycle bit will continue to exist?
pp_recycle bit might only exist or is only used for the head page for the skb. The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely. Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref() will increment the _refcount or pp_frag_count according to the bit 0 of frag->bv_page.
By the way, I also prototype the above idea, and it seems to work well too.
As long as no one else touches this, it's just another way of identifying a page_pool allocated page. But are we gaining by that? Not using virt_to_head_page() as stated above? But in that case you still need to keep pp_recycle around.
No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref() is called after memcpying the shinfo or doing "*fragto = *fragfrom".
But we'll have to keep it for the skb head in this case.
As above, I am not really look into skb head case:)
Let me take a step back here, because I think we drifted a bit. The page signature was introduced in order to be able to identify skb fragments. The problem was that you couldn't rely on the pp_recycle bit of the skb head, since fragments could come from anywhere. So you use the skb bit as a hint for skb frags, and you eventually decide using the page signature.
So we got 3 options (Anything I've missed ?) - try to remove pp_recycle bit, since the page signature is enough for the skb head and fragments. That in my opinion is the cleanest option, as long as we can prove there's no performance hit on the standard network path.
- Replace the page signature with frag->bv_page bit0. In that case we still have to keep the pp_recycle bit, but we do have an 'easier' indication that a skb frag comes from page_pool. That's still pretty safe, since you now have unique identifiers for the skb and page fragments and you can be sure of their origin (page pool or not). What I am missing here, is what do we get out of this? I think the advantage is not having to call virt_to_head_page() for frags ?
- Keep all of them(?) and use frag->bv_page bit0 similarly to pp_recycle bit? I don't see much value on this one, I am just keeping it here for completeness.
Thanks /Ilias
Regards /Ilias
. Right now the 'naive' explanation on the recycling decision is something like:
if (pp_recycle) <--- recycling bit is set (check page signature) <--- signature matches page pool (check fragment refcnt) <--- If frags are enabled and is the last consumer recycle
If we can proove the performance is unaffected when we eliminate the first if, then obviously we should remove it. I'll try running that test here and see, but keep in mind I am only testing on an 1GB interface. Any chance we can get measurements on a beefier hardware using hns3 ?
Sure, I will try it. As the kind of performance overhead is small, any performance testcase in mind?
'eliminate the first if' wasn't accurate. I meant switch the first if and check the struct page signature instead. That would be the best solution imho. We effectively have a single rule to check if a packet comes from page_pool or not.
I am not sure what does "switch " means here, if the page signature can indicate a pp page uniquely, the "if (pp_recycle)" checking can be removed.
You can start by sending a lot of packets and dropping those immediately. That should put enough stress on the receive path and the allocators and it should give us a rough idea.
> >> >>> But in general, I'd be happier if we only had a simple logic in our >>> testing for the pages we have to recycle. Debugging and understanding this >>> otherwise will end up being a mess. >> >>
[...]
Regards /Ilias .
Regards /Ilias .
.
On 2021/9/17 14:38, Ilias Apalodimas wrote:
Hi Yunsheng,
[...]
>> I am not sure "pp_recycle_bit was introduced to make the checking faster" is a >> valid. The size of "struct page" is only about 9 words(36/72 bytes), which is >> mostly to be in the same cache line, and both standard path and recycle path have >> been touching the "struct page", so it seems the overhead for checking signature >> seems minimal. >> >> I agree that we need to be cautious and measure potential regression on the >> standard path. > > well pp_recycle is on the same cache line boundary with the head_frag we > need to decide on recycling. After that we start checking page signatures > etc, which means the default release path remains mostly unaffected. > > I guess what you are saying here, is that 'struct page' is going to be > accessed eventually by the default network path, so there won't be any > noticeable performance hit? What about the other usecases we have
Yes.
In that case you'd need to call virt_to_head_page() early though, get it and then compare the signature. I guess that's avoidable by using frag->bv_page for the fragments?
If a page of a skb frag is from page pool, It seems frag->bv_page is always point to head_page of a compound page, so the calling of virt_to_head_page() does not seems necessary.
I was mostly referring to the skb head here and how would you trigger the recycling path.
I think we are talking about different things here. One idea is to use the last bit of frag->bv_page to identify fragments allocated from page_pool, which is done today with the signature.
The signature however exists in the head page so my question was, can we rid of that without having a performance penalty?
As both skb frag and head page is eventually operated on the head page of a compound page(if it is a compound page) for normal case too, maybe we can refactor the code to get the head page of a compound page before the signature checking without doing a second virt_to_head_page() or compound_head() call?
Yea that's doable, but my concern is different here. If we do that the standard network stack, even for drivers that don't use page_pool, will have to do a virt_to_head_page() -> check signature, to decide if it has to try recycling the packet. That's the performance part I am worried about, since it happens for every packet.
Yes, there is theoretically performance penalty for virt_to_head_page() or compound_head(), will do more test if we decide to go with the signature checking.
IOW in skb_free_head() an we replace:
if (skb_pp_recycle(skb, head)) with if (page->pp_magic & ~0x3UL) == PP_SIGNATURE) and get rid of the 'bool recycle' argument in __skb_frag_unref()?
For the frag page of a skb, it seems ok to get rid of the 'bool recycle' argument in __skb_frag_unref(), as __skb_frag_unref() and __skb_frag_ref() is symmetrically called to put/get a page.
For the head page of a skb, we might need to make sure the head page passed to __build_skb_around() meet below condition: do pp_frag_count incrementing instead of _refcount incrementing when the head page is not newly allocated and it is from page pool. It seems hard to audit that?
Yea that seems a bit weird at least to me and I am not sure, it's the only place we'll have to go and do that.
Yes, That is why I avoid changing the behavior of a head page for a skb. In other word, maybe we should not track if head page for a skb is pp page or not when the page'_refcount is incremented during network stack journey, just treat it as normal page?
bit 0 of frag->bv_page is different way of indicatior for a pp page, it is better we do not confuse with the page signature way. Using a bit 0 may give us a free word in 'struct page' if we manage to use skb->pp_recycle to indicate a head page of the skb uniquely, meaning page->pp_magic can be used for future feature.
> for pp_recycle right now? __skb_frag_unref() in skb_shift() or > skb_try_coalesce() (the latter can probably be removed tbh).
If we decide to go with accurate indicator of a pp page, we just need to make sure network stack use __skb_frag_unref() and __skb_frag_ref() to put and get a page frag, the indicator checking need only done in __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and skb_try_coalesce() should be fine too.
> >> >> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag >> page is from page pool. > > Instead of the 'struct page' signature? And the pp_recycle bit will > continue to exist?
pp_recycle bit might only exist or is only used for the head page for the skb. The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely. Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref() will increment the _refcount or pp_frag_count according to the bit 0 of frag->bv_page.
By the way, I also prototype the above idea, and it seems to work well too.
As long as no one else touches this, it's just another way of identifying a page_pool allocated page. But are we gaining by that? Not using virt_to_head_page() as stated above? But in that case you still need to keep pp_recycle around.
No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref() is called after memcpying the shinfo or doing "*fragto = *fragfrom".
But we'll have to keep it for the skb head in this case.
As above, I am not really look into skb head case:)
Let me take a step back here, because I think we drifted a bit. The page signature was introduced in order to be able to identify skb fragments. The problem was that you couldn't rely on the pp_recycle bit of the skb head, since fragments could come from anywhere. So you use the skb bit as a hint for skb frags, and you eventually decide using the page signature.
So we got 3 options (Anything I've missed ?)
try to remove pp_recycle bit, since the page signature is enough for the skb head and fragments. That in my opinion is the cleanest option, as long as we can prove there's no performance hit on the standard network path.
Replace the page signature with frag->bv_page bit0. In that case we still have to keep the pp_recycle bit, but we do have an 'easier' indication that a skb frag comes from page_pool. That's still pretty safe, since you now have unique identifiers for the skb and page fragments and you can be sure of their origin (page pool or not). What I am missing here, is what do we get out of this? I think the advantage is not having to call virt_to_head_page() for frags ?
Not using the signature will free a word space in struct page for future feature?
- Keep all of them(?) and use frag->bv_page bit0 similarly to pp_recycle bit? I don't see much value on this one, I am just keeping it here for completeness.
For safty and performance reason: 1. maybe we should move the pp_recycle bit from "struct sk_buff" to "struct skb_shared_info", and use it to only indicate if the head page of a skb is from page pool.
2. The frag->bv_page bit0 is used to indicate if the frag page of a skb is from page pool, and modify __skb_frag_unref() and __skb_frag_ref() to keep track of it.
3. For safty or debugging reason, keep the page signature for now, and put a page signature WARN_ON checking in page pool to catch any misbehaviour?
If there is not bug showing up later, maybe we can free the page signature space for other usage?
Thanks /Ilias
In that case you'd need to call virt_to_head_page() early though, get it
[...]
and then compare the signature. I guess that's avoidable by using frag->bv_page for the fragments?
If a page of a skb frag is from page pool, It seems frag->bv_page is always point to head_page of a compound page, so the calling of virt_to_head_page() does not seems necessary.
I was mostly referring to the skb head here and how would you trigger the recycling path.
I think we are talking about different things here. One idea is to use the last bit of frag->bv_page to identify fragments allocated from page_pool, which is done today with the signature.
The signature however exists in the head page so my question was, can we rid of that without having a performance penalty?
As both skb frag and head page is eventually operated on the head page of a compound page(if it is a compound page) for normal case too, maybe we can refactor the code to get the head page of a compound page before the signature checking without doing a second virt_to_head_page() or compound_head() call?
Yea that's doable, but my concern is different here. If we do that the standard network stack, even for drivers that don't use page_pool, will have to do a virt_to_head_page() -> check signature, to decide if it has to try recycling the packet. That's the performance part I am worried about, since it happens for every packet.
Yes, there is theoretically performance penalty for virt_to_head_page() or compound_head(), will do more test if we decide to go with the signature checking.
Can we check this somehow? I can send a patch for this, but my testing is limited to 1Gbit for the recycling. I can find 25/100Gbit interfaces for the 'normal' path.
IOW in skb_free_head() an we replace:
if (skb_pp_recycle(skb, head)) with if (page->pp_magic & ~0x3UL) == PP_SIGNATURE) and get rid of the 'bool recycle' argument in __skb_frag_unref()?
For the frag page of a skb, it seems ok to get rid of the 'bool recycle' argument in __skb_frag_unref(), as __skb_frag_unref() and __skb_frag_ref() is symmetrically called to put/get a page.
For the head page of a skb, we might need to make sure the head page passed to __build_skb_around() meet below condition: do pp_frag_count incrementing instead of _refcount incrementing when the head page is not newly allocated and it is from page pool. It seems hard to audit that?
Yea that seems a bit weird at least to me and I am not sure, it's the only place we'll have to go and do that.
Yes, That is why I avoid changing the behavior of a head page for a skb. In other word, maybe we should not track if head page for a skb is pp page or not when the page'_refcount is incremented during network stack journey, just treat it as normal page?
I am not sure I understand this.
bit 0 of frag->bv_page is different way of indicatior for a pp page, it is better we do not confuse with the page signature way. Using a bit 0 may give us a free word in 'struct page' if we manage to use skb->pp_recycle to indicate a head page of the skb uniquely, meaning page->pp_magic can be used for future feature.
> >> for pp_recycle right now? __skb_frag_unref() in skb_shift() or >> skb_try_coalesce() (the latter can probably be removed tbh). > > If we decide to go with accurate indicator of a pp page, we just need > to make sure network stack use __skb_frag_unref() and __skb_frag_ref() > to put and get a page frag, the indicator checking need only done in > __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and > skb_try_coalesce() should be fine too. > >> >>> >>> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag >>> page is from page pool. >> >> Instead of the 'struct page' signature? And the pp_recycle bit will >> continue to exist? > > pp_recycle bit might only exist or is only used for the head page for the skb. > The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely. > Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the > indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref() > will increment the _refcount or pp_frag_count according to the bit 0 of > frag->bv_page. > > By the way, I also prototype the above idea, and it seems to work well too. >
As long as no one else touches this, it's just another way of identifying a page_pool allocated page. But are we gaining by that? Not using virt_to_head_page() as stated above? But in that case you still need to keep pp_recycle around.
No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref() is called after memcpying the shinfo or doing "*fragto = *fragfrom".
But we'll have to keep it for the skb head in this case.
As above, I am not really look into skb head case:)
Let me take a step back here, because I think we drifted a bit. The page signature was introduced in order to be able to identify skb fragments. The problem was that you couldn't rely on the pp_recycle bit of the skb head, since fragments could come from anywhere. So you use the skb bit as a hint for skb frags, and you eventually decide using the page signature.
So we got 3 options (Anything I've missed ?)
try to remove pp_recycle bit, since the page signature is enough for the skb head and fragments. That in my opinion is the cleanest option, as long as we can prove there's no performance hit on the standard network path.
Replace the page signature with frag->bv_page bit0. In that case we still have to keep the pp_recycle bit, but we do have an 'easier' indication that a skb frag comes from page_pool. That's still pretty safe, since you now have unique identifiers for the skb and page fragments and you can be sure of their origin (page pool or not). What I am missing here, is what do we get out of this? I think the advantage is not having to call virt_to_head_page() for frags ?
Not using the signature will free a word space in struct page for future feature?
Yea that's another thing we gain, but I am not sure how useful how this is going to turn out.
- Keep all of them(?) and use frag->bv_page bit0 similarly to pp_recycle bit? I don't see much value on this one, I am just keeping it here for completeness.
For safty and performance reason:
- maybe we should move the pp_recycle bit from "struct sk_buff" to "struct skb_shared_info", and use it to only indicate if the head page of a skb is from page pool.
What's the safety or performance we gain out of this? The only performance I can think of is the dirty cache line of the recycle bit we set to 0. If we do move it to skb_shared)info we'll have to make sure it's on the same cacheline as the ones we already change.
The frag->bv_page bit0 is used to indicate if the frag page of a skb is from page pool, and modify __skb_frag_unref() and __skb_frag_ref() to keep track of it.
For safty or debugging reason, keep the page signature for now, and put a page signature WARN_ON checking in page pool to catch any misbehaviour?
If there is not bug showing up later, maybe we can free the page signature space for other usage?
Yea that's essentially identical to (2) but we move the pp_recycle on the skb_shared_info. I'd really prefer getting rid of the pp_recycle entirely, since it's the cleanest thing we can do in my head. If we ever need an extra 4/8 bytes in the future, we can always go back and implement this.
Alexander/Jesper any additional thoughts?
Regards /Ilias
Thanks /Ilias
On 2021/9/17 23:01, Ilias Apalodimas wrote:
> In that case you'd need to call virt_to_head_page() early though, get it
[...]
> and then compare the signature. I guess that's avoidable by using > frag->bv_page for the fragments?
If a page of a skb frag is from page pool, It seems frag->bv_page is always point to head_page of a compound page, so the calling of virt_to_head_page() does not seems necessary.
I was mostly referring to the skb head here and how would you trigger the recycling path.
I think we are talking about different things here. One idea is to use the last bit of frag->bv_page to identify fragments allocated from page_pool, which is done today with the signature.
The signature however exists in the head page so my question was, can we rid of that without having a performance penalty?
As both skb frag and head page is eventually operated on the head page of a compound page(if it is a compound page) for normal case too, maybe we can refactor the code to get the head page of a compound page before the signature checking without doing a second virt_to_head_page() or compound_head() call?
Yea that's doable, but my concern is different here. If we do that the standard network stack, even for drivers that don't use page_pool, will have to do a virt_to_head_page() -> check signature, to decide if it has to try recycling the packet. That's the performance part I am worried about, since it happens for every packet.
Yes, there is theoretically performance penalty for virt_to_head_page() or compound_head(), will do more test if we decide to go with the signature checking.
Can we check this somehow? I can send a patch for this, but my testing is limited to 1Gbit for the recycling. I can find 25/100Gbit interfaces for the 'normal' path.
I have done the signature checking for frag page of a skb, I am not able to see noticable change between patched(patched with this patch) and unpatched, for small packet drop test case(perfermance data is about 34Mpps).
As the hns3 driver does not use the build_skb() API, so I am not able to test the signature checking penalty for head page of a skb, any chance to do the testing for head page of a skb on your side?
IOW in skb_free_head() an we replace:
if (skb_pp_recycle(skb, head)) with if (page->pp_magic & ~0x3UL) == PP_SIGNATURE) and get rid of the 'bool recycle' argument in __skb_frag_unref()?
For the frag page of a skb, it seems ok to get rid of the 'bool recycle' argument in __skb_frag_unref(), as __skb_frag_unref() and __skb_frag_ref() is symmetrically called to put/get a page.
For the head page of a skb, we might need to make sure the head page passed to __build_skb_around() meet below condition: do pp_frag_count incrementing instead of _refcount incrementing when the head page is not newly allocated and it is from page pool. It seems hard to audit that?
Yea that seems a bit weird at least to me and I am not sure, it's the only place we'll have to go and do that.
Yes, That is why I avoid changing the behavior of a head page for a skb. In other word, maybe we should not track if head page for a skb is pp page or not when the page'_refcount is incremented during network stack journey, just treat it as normal page?
I am not sure I understand this.
I was saying only treat the head page of a skb as pp page when it is newly allocated from page pool, if that page is reference-counted to build another head page for another skb later, just treat it as normal page.
bit 0 of frag->bv_page is different way of indicatior for a pp page, it is better we do not confuse with the page signature way. Using a bit 0 may give us a free word in 'struct page' if we manage to use skb->pp_recycle to indicate a head page of the skb uniquely, meaning page->pp_magic can be used for future feature.
> >> >>> for pp_recycle right now? __skb_frag_unref() in skb_shift() or >>> skb_try_coalesce() (the latter can probably be removed tbh). >> >> If we decide to go with accurate indicator of a pp page, we just need >> to make sure network stack use __skb_frag_unref() and __skb_frag_ref() >> to put and get a page frag, the indicator checking need only done in >> __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and >> skb_try_coalesce() should be fine too. >> >>> >>>> >>>> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag >>>> page is from page pool. >>> >>> Instead of the 'struct page' signature? And the pp_recycle bit will >>> continue to exist? >> >> pp_recycle bit might only exist or is only used for the head page for the skb. >> The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely. >> Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the >> indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref() >> will increment the _refcount or pp_frag_count according to the bit 0 of >> frag->bv_page. >> >> By the way, I also prototype the above idea, and it seems to work well too. >> > > As long as no one else touches this, it's just another way of identifying a > page_pool allocated page. But are we gaining by that? Not using > virt_to_head_page() as stated above? But in that case you still need to > keep pp_recycle around.
No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref() is called after memcpying the shinfo or doing "*fragto = *fragfrom".
But we'll have to keep it for the skb head in this case.
As above, I am not really look into skb head case:)
Let me take a step back here, because I think we drifted a bit. The page signature was introduced in order to be able to identify skb fragments. The problem was that you couldn't rely on the pp_recycle bit of the skb head, since fragments could come from anywhere. So you use the skb bit as a hint for skb frags, and you eventually decide using the page signature.
So we got 3 options (Anything I've missed ?)
try to remove pp_recycle bit, since the page signature is enough for the skb head and fragments. That in my opinion is the cleanest option, as long as we can prove there's no performance hit on the standard network path.
Replace the page signature with frag->bv_page bit0. In that case we still have to keep the pp_recycle bit, but we do have an 'easier' indication that a skb frag comes from page_pool. That's still pretty safe, since you now have unique identifiers for the skb and page fragments and you can be sure of their origin (page pool or not). What I am missing here, is what do we get out of this? I think the advantage is not having to call virt_to_head_page() for frags ?
Not using the signature will free a word space in struct page for future feature?
Yea that's another thing we gain, but I am not sure how useful how this is going to turn out.
- Keep all of them(?) and use frag->bv_page bit0 similarly to pp_recycle bit? I don't see much value on this one, I am just keeping it here for completeness.
For safty and performance reason:
- maybe we should move the pp_recycle bit from "struct sk_buff" to "struct skb_shared_info", and use it to only indicate if the head page of a skb is from page pool.
What's the safety or performance we gain out of this? The only performance
safety is that we still have two ways to indicate a pp page. the pp_recycle bit in "struct skb_shared_info" or frag->bv_page bit0 tell if we want to treat a page as pp page, the page signature checking is used to tell if we if set those bits correctly?
I can think of is the dirty cache line of the recycle bit we set to 0. If we do move it to skb_shared)info we'll have to make sure it's on the same cacheline as the ones we already change.
Yes, when we move the pp_recycle bit to skb_shared_info, that bit is only set once, and we seems to not need to worry about skb doing cloning or expanding as the it is part of head page(shinfo is part of head page).
The frag->bv_page bit0 is used to indicate if the frag page of a skb is from page pool, and modify __skb_frag_unref() and __skb_frag_ref() to keep track of it.
For safty or debugging reason, keep the page signature for now, and put a page signature WARN_ON checking in page pool to catch any misbehaviour?
If there is not bug showing up later, maybe we can free the page signature space for other usage?
Yea that's essentially identical to (2) but we move the pp_recycle on the skb_shared_info. I'd really prefer getting rid of the pp_recycle entirely,
When also removing the pp_recycle for head page of a skb, it seems a little risky as we are not sure when a not-newly-allocated pp page is called with __build_skb_around() to build to head page?
since it's the cleanest thing we can do in my head. If we ever need an extra 4/8 bytes in the future, we can always go back and implement this.
Alexander/Jesper any additional thoughts?
Regards /Ilias
Thanks /Ilias
.
[...]
I was mostly referring to the skb head here and how would you trigger the recycling path.
I think we are talking about different things here. One idea is to use the last bit of frag->bv_page to identify fragments allocated from page_pool, which is done today with the signature.
The signature however exists in the head page so my question was, can we rid of that without having a performance penalty?
As both skb frag and head page is eventually operated on the head page of a compound page(if it is a compound page) for normal case too, maybe we can refactor the code to get the head page of a compound page before the signature checking without doing a second virt_to_head_page() or compound_head() call?
Yea that's doable, but my concern is different here. If we do that the standard network stack, even for drivers that don't use page_pool, will have to do a virt_to_head_page() -> check signature, to decide if it has to try recycling the packet. That's the performance part I am worried about, since it happens for every packet.
Yes, there is theoretically performance penalty for virt_to_head_page() or compound_head(), will do more test if we decide to go with the signature checking.
Can we check this somehow? I can send a patch for this, but my testing is limited to 1Gbit for the recycling. I can find 25/100Gbit interfaces for the 'normal' path.
I have done the signature checking for frag page of a skb, I am not able to see noticable change between patched(patched with this patch) and unpatched, for small packet drop test case(perfermance data is about 34Mpps).
As the hns3 driver does not use the build_skb() API, so I am not able to test the signature checking penalty for head page of a skb, any chance to do the testing for head page of a skb on your side?
Yea I think I'll see what I can do. I wasn't expecting any regression on the recycling path, since we do eventually call virt_to_head_page(), but it's always good to confirm. Thanks for testing it.
IOW in skb_free_head() an we replace:
if (skb_pp_recycle(skb, head)) with if (page->pp_magic & ~0x3UL) == PP_SIGNATURE) and get rid of the 'bool recycle' argument in __skb_frag_unref()?
For the frag page of a skb, it seems ok to get rid of the 'bool recycle' argument in __skb_frag_unref(), as __skb_frag_unref() and __skb_frag_ref() is symmetrically called to put/get a page.
For the head page of a skb, we might need to make sure the head page passed to __build_skb_around() meet below condition: do pp_frag_count incrementing instead of _refcount incrementing when the head page is not newly allocated and it is from page pool. It seems hard to audit that?
Yea that seems a bit weird at least to me and I am not sure, it's the only place we'll have to go and do that.
Yes, That is why I avoid changing the behavior of a head page for a skb. In other word, maybe we should not track if head page for a skb is pp page or not when the page'_refcount is incremented during network stack journey, just treat it as normal page?
I am not sure I understand this.
I was saying only treat the head page of a skb as pp page when it is newly allocated from page pool, if that page is reference-counted to build another head page for another skb later, just treat it as normal page.
But the problem here is that a cloned/expanded SKB could trigger a race when freeing the fragments. That's why we reset the pp_recycle bit if there's still references to the frags. What does 'normal' page means here? We'll have to at least unmap dma part.
> bit 0 of frag->bv_page is different way of indicatior for a pp page, > it is better we do not confuse with the page signature way. Using > a bit 0 may give us a free word in 'struct page' if we manage to > use skb->pp_recycle to indicate a head page of the skb uniquely, meaning > page->pp_magic can be used for future feature. > > >> >>> >>>> for pp_recycle right now? __skb_frag_unref() in skb_shift() or >>>> skb_try_coalesce() (the latter can probably be removed tbh). >>> >>> If we decide to go with accurate indicator of a pp page, we just need >>> to make sure network stack use __skb_frag_unref() and __skb_frag_ref() >>> to put and get a page frag, the indicator checking need only done in >>> __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and >>> skb_try_coalesce() should be fine too. >>> >>>> >>>>> >>>>> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag >>>>> page is from page pool. >>>> >>>> Instead of the 'struct page' signature? And the pp_recycle bit will >>>> continue to exist? >>> >>> pp_recycle bit might only exist or is only used for the head page for the skb. >>> The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely. >>> Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the >>> indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref() >>> will increment the _refcount or pp_frag_count according to the bit 0 of >>> frag->bv_page. >>> >>> By the way, I also prototype the above idea, and it seems to work well too. >>> >> >> As long as no one else touches this, it's just another way of identifying a >> page_pool allocated page. But are we gaining by that? Not using >> virt_to_head_page() as stated above? But in that case you still need to >> keep pp_recycle around. > > No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref() > is called after memcpying the shinfo or doing "*fragto = *fragfrom".
But we'll have to keep it for the skb head in this case.
As above, I am not really look into skb head case:)
Let me take a step back here, because I think we drifted a bit. The page signature was introduced in order to be able to identify skb fragments. The problem was that you couldn't rely on the pp_recycle bit of the skb head, since fragments could come from anywhere. So you use the skb bit as a hint for skb frags, and you eventually decide using the page signature.
So we got 3 options (Anything I've missed ?)
try to remove pp_recycle bit, since the page signature is enough for the skb head and fragments. That in my opinion is the cleanest option, as long as we can prove there's no performance hit on the standard network path.
Replace the page signature with frag->bv_page bit0. In that case we still have to keep the pp_recycle bit, but we do have an 'easier' indication that a skb frag comes from page_pool. That's still pretty safe, since you now have unique identifiers for the skb and page fragments and you can be sure of their origin (page pool or not). What I am missing here, is what do we get out of this? I think the advantage is not having to call virt_to_head_page() for frags ?
Not using the signature will free a word space in struct page for future feature?
Yea that's another thing we gain, but I am not sure how useful how this is going to turn out.
- Keep all of them(?) and use frag->bv_page bit0 similarly to pp_recycle bit? I don't see much value on this one, I am just keeping it here for completeness.
For safty and performance reason:
- maybe we should move the pp_recycle bit from "struct sk_buff" to "struct skb_shared_info", and use it to only indicate if the head page of a skb is from page pool.
What's the safety or performance we gain out of this? The only performance
safety is that we still have two ways to indicate a pp page. the pp_recycle bit in "struct skb_shared_info" or frag->bv_page bit0 tell if we want to treat a page as pp page, the page signature checking is used to tell if we if set those bits correctly?
Yea but in the long run we'll want the page signature. So that's basically (2) once we do that.
I can think of is the dirty cache line of the recycle bit we set to 0. If we do move it to skb_shared)info we'll have to make sure it's on the same cacheline as the ones we already change.
Yes, when we move the pp_recycle bit to skb_shared_info, that bit is only set once, and we seems to not need to worry about skb doing cloning or expanding as the it is part of head page(shinfo is part of head page).
The frag->bv_page bit0 is used to indicate if the frag page of a skb is from page pool, and modify __skb_frag_unref() and __skb_frag_ref() to keep track of it.
For safty or debugging reason, keep the page signature for now, and put a page signature WARN_ON checking in page pool to catch any misbehaviour?
If there is not bug showing up later, maybe we can free the page signature space for other usage?
Yea that's essentially identical to (2) but we move the pp_recycle on the skb_shared_info. I'd really prefer getting rid of the pp_recycle entirely,
When also removing the pp_recycle for head page of a skb, it seems a little risky as we are not sure when a not-newly-allocated pp page is called with __build_skb_around() to build to head page?
Removing the pp_recyle, is only safe if we keep the page signature. I was suggesting we follow (1) first before starting moving things around.
since it's the cleanest thing we can do in my head. If we ever need an extra 4/8 bytes in the future, we can always go back and implement this.
Alexander/Jesper any additional thoughts?
Thanks /Ilias
On 2021/9/18 17:23, Ilias Apalodimas wrote:
[...]
[...]
> > IOW in skb_free_head() an we replace: > > if (skb_pp_recycle(skb, head)) > with > if (page->pp_magic & ~0x3UL) == PP_SIGNATURE) > and get rid of the 'bool recycle' argument in __skb_frag_unref()?
For the frag page of a skb, it seems ok to get rid of the 'bool recycle' argument in __skb_frag_unref(), as __skb_frag_unref() and __skb_frag_ref() is symmetrically called to put/get a page.
For the head page of a skb, we might need to make sure the head page passed to __build_skb_around() meet below condition: do pp_frag_count incrementing instead of _refcount incrementing when the head page is not newly allocated and it is from page pool. It seems hard to audit that?
Yea that seems a bit weird at least to me and I am not sure, it's the only place we'll have to go and do that.
Yes, That is why I avoid changing the behavior of a head page for a skb. In other word, maybe we should not track if head page for a skb is pp page or not when the page'_refcount is incremented during network stack journey, just treat it as normal page?
I am not sure I understand this.
I was saying only treat the head page of a skb as pp page when it is newly allocated from page pool, if that page is reference-counted to build another head page for another skb later, just treat it as normal page.
But the problem here is that a cloned/expanded SKB could trigger a race when freeing the fragments. That's why we reset the pp_recycle bit if there's still references to the frags. What does 'normal' page means here? We'll have to at least unmap dma part.
'normal' page means non-pp page here. Maybe forget the above.
I read the code related to head page headling for a skb, it seems the NAPI_GRO_FREE_STOLEN_HEAD and skb_head_frag_to_page_desc() case is just fine as it is now when the page signature is used to identify a pp page for the head page of a skb uniquely?
> >> bit 0 of frag->bv_page is different way of indicatior for a pp page, >> it is better we do not confuse with the page signature way. Using >> a bit 0 may give us a free word in 'struct page' if we manage to >> use skb->pp_recycle to indicate a head page of the skb uniquely, meaning >> page->pp_magic can be used for future feature. >> >> >>> >>>> >>>>> for pp_recycle right now? __skb_frag_unref() in skb_shift() or >>>>> skb_try_coalesce() (the latter can probably be removed tbh). >>>> >>>> If we decide to go with accurate indicator of a pp page, we just need >>>> to make sure network stack use __skb_frag_unref() and __skb_frag_ref() >>>> to put and get a page frag, the indicator checking need only done in >>>> __skb_frag_unref() and __skb_frag_ref(), so the skb_shift() and >>>> skb_try_coalesce() should be fine too. >>>> >>>>> >>>>>> >>>>>> Another way is to use the bit 0 of frag->bv_page ptr to indicate if a frag >>>>>> page is from page pool. >>>>> >>>>> Instead of the 'struct page' signature? And the pp_recycle bit will >>>>> continue to exist? >>>> >>>> pp_recycle bit might only exist or is only used for the head page for the skb. >>>> The bit 0 of frag->bv_page ptr can be used to indicate a frag page uniquely. >>>> Doing a memcpying of shinfo or "*fragto = *fragfrom" automatically pass the >>>> indicator to the new shinfo before doing a __skb_frag_ref(), and __skb_frag_ref() >>>> will increment the _refcount or pp_frag_count according to the bit 0 of >>>> frag->bv_page. >>>> >>>> By the way, I also prototype the above idea, and it seems to work well too. >>>> >>> >>> As long as no one else touches this, it's just another way of identifying a >>> page_pool allocated page. But are we gaining by that? Not using >>> virt_to_head_page() as stated above? But in that case you still need to >>> keep pp_recycle around. >> >> No, we do not need the pp_recycle, as long as the we make sure __skb_frag_ref() >> is called after memcpying the shinfo or doing "*fragto = *fragfrom". > > But we'll have to keep it for the skb head in this case.
As above, I am not really look into skb head case:)
Let me take a step back here, because I think we drifted a bit. The page signature was introduced in order to be able to identify skb fragments. The problem was that you couldn't rely on the pp_recycle bit of the skb head, since fragments could come from anywhere. So you use the skb bit as a hint for skb frags, and you eventually decide using the page signature.
So we got 3 options (Anything I've missed ?)
try to remove pp_recycle bit, since the page signature is enough for the skb head and fragments. That in my opinion is the cleanest option, as long as we can prove there's no performance hit on the standard network path.
Replace the page signature with frag->bv_page bit0. In that case we still have to keep the pp_recycle bit, but we do have an 'easier' indication that a skb frag comes from page_pool. That's still pretty safe, since you now have unique identifiers for the skb and page fragments and you can be sure of their origin (page pool or not). What I am missing here, is what do we get out of this? I think the advantage is not having to call virt_to_head_page() for frags ?
Not using the signature will free a word space in struct page for future feature?
Yea that's another thing we gain, but I am not sure how useful how this is going to turn out.
- Keep all of them(?) and use frag->bv_page bit0 similarly to pp_recycle bit? I don't see much value on this one, I am just keeping it here for completeness.
For safty and performance reason:
- maybe we should move the pp_recycle bit from "struct sk_buff" to "struct skb_shared_info", and use it to only indicate if the head page of a skb is from page pool.
What's the safety or performance we gain out of this? The only performance
safety is that we still have two ways to indicate a pp page. the pp_recycle bit in "struct skb_shared_info" or frag->bv_page bit0 tell if we want to treat a page as pp page, the page signature checking is used to tell if we if set those bits correctly?
Yea but in the long run we'll want the page signature. So that's basically (2) once we do that.
I can think of is the dirty cache line of the recycle bit we set to 0. If we do move it to skb_shared)info we'll have to make sure it's on the same cacheline as the ones we already change.
Yes, when we move the pp_recycle bit to skb_shared_info, that bit is only set once, and we seems to not need to worry about skb doing cloning or expanding as the it is part of head page(shinfo is part of head page).
The frag->bv_page bit0 is used to indicate if the frag page of a skb is from page pool, and modify __skb_frag_unref() and __skb_frag_ref() to keep track of it.
For safty or debugging reason, keep the page signature for now, and put a page signature WARN_ON checking in page pool to catch any misbehaviour?
If there is not bug showing up later, maybe we can free the page signature space for other usage?
Yea that's essentially identical to (2) but we move the pp_recycle on the skb_shared_info. I'd really prefer getting rid of the pp_recycle entirely,
When also removing the pp_recycle for head page of a skb, it seems a little risky as we are not sure when a not-newly-allocated pp page is called with __build_skb_around() to build to head page?
Removing the pp_recyle, is only safe if we keep the page signature. I was suggesting we follow (1) first before starting moving things around.
I suppose (1) means the below, right:
- try to remove pp_recycle bit, since the page signature is enough for the skb head and fragments. That in my opinion is the cleanest option, as long as we can prove there's no performance hit on the standard network path.
It seems doable if my above analysis of head page headling for a skb does not miss anything.
since it's the cleanest thing we can do in my head. If we ever need an extra 4/8 bytes in the future, we can always go back and implement this.
Alexander/Jesper any additional thoughts?
Thanks /Ilias .
On Wed, Sep 15, 2021 at 7:05 PM Yunsheng Lin linyunsheng@huawei.com wrote:
As memtioned before, Tx recycling is based on page_pool instance per socket. it shares the page_pool instance with rx.
Anyway, based on feedback from edumazet and dsahern, I am still trying to see if the page pool is meaningful for tx.
It is not for generic linux TCP stack, but perhaps for benchmarks.
Unless you dedicate one TX/RX pair per TCP socket ?
Most high performance TCP flows are using zerocopy, I am really not sure why we would need to 'optimize' the path that is wasting cpu cycles doing user->kernel copies anyway, at the cost of insane complexity.
On 2021/9/18 1:15, Eric Dumazet wrote:
On Wed, Sep 15, 2021 at 7:05 PM Yunsheng Lin linyunsheng@huawei.com wrote:
As memtioned before, Tx recycling is based on page_pool instance per socket. it shares the page_pool instance with rx.
Anyway, based on feedback from edumazet and dsahern, I am still trying to see if the page pool is meaningful for tx.
It is not for generic linux TCP stack, but perhaps for benchmarks.
I am not sure I understand what does above means, did you mean tx recycling only benefit the benchmark tool, such as iperf/netperf, but not the real usecase?
Unless you dedicate one TX/RX pair per TCP socket ?
TX/RX pair for netdev queue or TX/RX pair for recycling pool?
As the TX/RX pair for netdev queue, I am not dedicating one TX/RX pair netdev queue per TCP socket.
As the TX/RX pair for recycling pool, my initial thinking is each NAPI/socket context have a 'struct pp_alloc_cache', which provides last-in-first-out and lockless mini pool specific to each NAPI/socket context, and a central locked 'struct ptr_ring' pool based on queue for all the NAPI/socket mini pools, when a NAPI/socket context's mini pool is empty or full, it can refill some page from the central pool or flush some page to the central pool.
I am not sure if the locked central pool is needed or not, or the 'struct ptr_ring' of page pool is right one to be the locked central pool yet.
Most high performance TCP flows are using zerocopy, I am really not sure why we would need to 'optimize' the path that is wasting cpu cycles doing user->kernel copies anyway, at the cost of insane complexity.
As my understanding, zerocopy is mostly about big packet and non-IOMMU case.
As complexity, I am not convinced yet that it is that complex, as it is mostly using the existing infrastructure to support tx recycling.
The point is that most of skb is freed in the context of NAPI or socket, it seems we may utilize that to do batch allocating/freeing of skb/page_frag, or reusing of skb/page_frag/dma mapping to avoid (IO/CPU)TLB miss, cache miss, overhead of spinlock and dma mapping.
.