Patch 1: disable dma mapping support for 32-bit arch with 64-bit DMA. Patch 2: support non-split page when PP_FLAG_PAGE_FRAG is set. patch 3: avoid calling compound_head() for skb frag page Patch 4-7: use pp_magic to identify pp page uniquely.
V3: 1. add patch 1/4/6/7. 2. use pp_magic to identify pp page uniquely too. 3. avoid unnecessary compound_head() calling.
V2: add patch 2, adjust the commit log accroding to the discussion in V1, and fix a compiler error reported by kernel test robot.
Yunsheng Lin (7): page_pool: disable dma mapping support for 32-bit arch with 64-bit DMA page_pool: support non-split page with PP_FLAG_PAGE_FRAG pool_pool: avoid calling compound_head() for skb frag page page_pool: change BIAS_MAX to support incrementing skbuff: keep track of pp page when __skb_frag_ref() is called skbuff: only use pp_magic identifier for a skb' head page skbuff: remove unused skb->pp_recycle
.../net/ethernet/hisilicon/hns3/hns3_enet.c | 6 --- drivers/net/ethernet/marvell/mvneta.c | 2 - .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 4 +- drivers/net/ethernet/marvell/sky2.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +- drivers/net/ethernet/ti/cpsw.c | 2 - drivers/net/ethernet/ti/cpsw_new.c | 2 - include/linux/mm_types.h | 13 +----- include/linux/skbuff.h | 39 ++++++++---------- include/net/page_pool.h | 31 ++++++++------ net/core/page_pool.c | 40 +++++++------------ net/core/skbuff.c | 36 ++++++----------- net/tls/tls_device.c | 2 +- 13 files changed, 67 insertions(+), 114 deletions(-)
As the 32-bit arch with 64-bit DMA seems to rare those days, and page pool is carrying a lot of code and complexity for systems that possibly don't exist.
So disable dma mapping support for such systems, if drivers really want to work on such systems, they have to implement their own DMA-mapping fallback tracking outside page_pool.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- include/linux/mm_types.h | 13 +------------ include/net/page_pool.h | 12 +----------- net/core/page_pool.c | 10 ++++++---- 3 files changed, 8 insertions(+), 27 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 7f8ee09c711f..436e0946d691 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -104,18 +104,7 @@ struct page { struct page_pool *pp; unsigned long _pp_mapping_pad; unsigned long dma_addr; - union { - /** - * dma_addr_upper: might require a 64-bit - * value on 32-bit architectures. - */ - unsigned long dma_addr_upper; - /** - * For frag page support, not supported in - * 32-bit architectures with 64-bit DMA. - */ - atomic_long_t pp_frag_count; - }; + atomic_long_t pp_frag_count; }; struct { /* slab, slob and slub */ union { diff --git a/include/net/page_pool.h b/include/net/page_pool.h index a4082406a003..3855f069627f 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -216,24 +216,14 @@ static inline void page_pool_recycle_direct(struct page_pool *pool, page_pool_put_full_page(pool, page, true); }
-#define PAGE_POOL_DMA_USE_PP_FRAG_COUNT \ - (sizeof(dma_addr_t) > sizeof(unsigned long)) - static inline dma_addr_t page_pool_get_dma_addr(struct page *page) { - dma_addr_t ret = page->dma_addr; - - if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT) - ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16; - - return ret; + return page->dma_addr; }
static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr) { page->dma_addr = addr; - if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT) - page->dma_addr_upper = upper_32_bits(addr); }
static inline void page_pool_set_frag_count(struct page *page, long nr) diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 1a6978427d6c..a65bd7972e37 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -49,6 +49,12 @@ static int page_pool_init(struct page_pool *pool, * which is the XDP_TX use-case. */ if (pool->p.flags & PP_FLAG_DMA_MAP) { + /* DMA-mapping is not supported on 32-bit systems with + * 64-bit DMA mapping. + */ + if (sizeof(dma_addr_t) > sizeof(unsigned long)) + return -EINVAL; + if ((pool->p.dma_dir != DMA_FROM_DEVICE) && (pool->p.dma_dir != DMA_BIDIRECTIONAL)) return -EINVAL; @@ -69,10 +75,6 @@ static int page_pool_init(struct page_pool *pool, */ }
- if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT && - pool->p.flags & PP_FLAG_PAGE_FRAG) - return -EINVAL; - if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0) return -ENOMEM;
(+cc Matthew) but this looks safe to me.
On Wed, 22 Sept 2021 at 12:43, Yunsheng Lin linyunsheng@huawei.com wrote:
As the 32-bit arch with 64-bit DMA seems to rare those days, and page pool is carrying a lot of code and complexity for systems that possibly don't exist.
So disable dma mapping support for such systems, if drivers really want to work on such systems, they have to implement their own DMA-mapping fallback tracking outside page_pool.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
include/linux/mm_types.h | 13 +------------ include/net/page_pool.h | 12 +----------- net/core/page_pool.c | 10 ++++++---- 3 files changed, 8 insertions(+), 27 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 7f8ee09c711f..436e0946d691 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -104,18 +104,7 @@ struct page { struct page_pool *pp; unsigned long _pp_mapping_pad; unsigned long dma_addr;
union {
/**
* dma_addr_upper: might require a 64-bit
* value on 32-bit architectures.
*/
unsigned long dma_addr_upper;
/**
* For frag page support, not supported in
* 32-bit architectures with 64-bit DMA.
*/
atomic_long_t pp_frag_count;
};
atomic_long_t pp_frag_count; }; struct { /* slab, slob and slub */ union {
diff --git a/include/net/page_pool.h b/include/net/page_pool.h index a4082406a003..3855f069627f 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -216,24 +216,14 @@ static inline void page_pool_recycle_direct(struct page_pool *pool, page_pool_put_full_page(pool, page, true); }
-#define PAGE_POOL_DMA_USE_PP_FRAG_COUNT \
(sizeof(dma_addr_t) > sizeof(unsigned long))
static inline dma_addr_t page_pool_get_dma_addr(struct page *page) {
dma_addr_t ret = page->dma_addr;
if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
return ret;
return page->dma_addr;
}
static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr) { page->dma_addr = addr;
if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
page->dma_addr_upper = upper_32_bits(addr);
}
static inline void page_pool_set_frag_count(struct page *page, long nr) diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 1a6978427d6c..a65bd7972e37 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -49,6 +49,12 @@ static int page_pool_init(struct page_pool *pool, * which is the XDP_TX use-case. */ if (pool->p.flags & PP_FLAG_DMA_MAP) {
/* DMA-mapping is not supported on 32-bit systems with
* 64-bit DMA mapping.
*/
if (sizeof(dma_addr_t) > sizeof(unsigned long))
return -EINVAL;
if ((pool->p.dma_dir != DMA_FROM_DEVICE) && (pool->p.dma_dir != DMA_BIDIRECTIONAL)) return -EINVAL;
@@ -69,10 +75,6 @@ static int page_pool_init(struct page_pool *pool, */ }
if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
pool->p.flags & PP_FLAG_PAGE_FRAG)
return -EINVAL;
if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0) return -ENOMEM;
-- 2.33.0
Reviewed-by: Ilias Apalodimas ilias.apalodimas@linaro.org
On 22/09/2021 11.41, Yunsheng Lin wrote:
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 1a6978427d6c..a65bd7972e37 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -49,6 +49,12 @@ static int page_pool_init(struct page_pool *pool, * which is the XDP_TX use-case. */ if (pool->p.flags & PP_FLAG_DMA_MAP) {
/* DMA-mapping is not supported on 32-bit systems with
* 64-bit DMA mapping.
*/
if (sizeof(dma_addr_t) > sizeof(unsigned long))
return -EINVAL;
As I said before, can we please use another error than EINVAL. We should give drivers a chance/ability to detect this error, and e.g. fallback to doing DMA mappings inside driver instead.
I suggest using EOPNOTSUPP 95 (Operation not supported).
-Jesper
Hi Jesper,
On Thu, 23 Sept 2021 at 12:33, Jesper Dangaard Brouer jbrouer@redhat.com wrote:
On 22/09/2021 11.41, Yunsheng Lin wrote:
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 1a6978427d6c..a65bd7972e37 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -49,6 +49,12 @@ static int page_pool_init(struct page_pool *pool, * which is the XDP_TX use-case. */ if (pool->p.flags & PP_FLAG_DMA_MAP) {
/* DMA-mapping is not supported on 32-bit systems with
* 64-bit DMA mapping.
*/
if (sizeof(dma_addr_t) > sizeof(unsigned long))
return -EINVAL;
As I said before, can we please use another error than EINVAL. We should give drivers a chance/ability to detect this error, and e.g. fallback to doing DMA mappings inside driver instead.
I suggest using EOPNOTSUPP 95 (Operation not supported).
I am fine with both. In any case though the aforementioned driver can just remove PP_FLAG_DMA_MAP and do it's own mappings.
Regards /Ilias
-Jesper
On 2021/9/23 18:02, Ilias Apalodimas wrote:
Hi Jesper,
On Thu, 23 Sept 2021 at 12:33, Jesper Dangaard Brouer jbrouer@redhat.com wrote:
On 22/09/2021 11.41, Yunsheng Lin wrote:
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 1a6978427d6c..a65bd7972e37 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -49,6 +49,12 @@ static int page_pool_init(struct page_pool *pool, * which is the XDP_TX use-case. */ if (pool->p.flags & PP_FLAG_DMA_MAP) {
/* DMA-mapping is not supported on 32-bit systems with
* 64-bit DMA mapping.
*/
if (sizeof(dma_addr_t) > sizeof(unsigned long))
return -EINVAL;
As I said before, can we please use another error than EINVAL. We should give drivers a chance/ability to detect this error, and e.g. fallback to doing DMA mappings inside driver instead.
I suggest using EOPNOTSUPP 95 (Operation not supported).
Will change it to EOPNOTSUPP, thanks.
I am fine with both. In any case though the aforementioned driver can just remove PP_FLAG_DMA_MAP and do it's own mappings.
Regards /Ilias
-Jesper
.
On Thu, Sep 23, 2021 at 07:13:11PM +0800, Yunsheng Lin wrote:
On 2021/9/23 18:02, Ilias Apalodimas wrote:
Hi Jesper,
On Thu, 23 Sept 2021 at 12:33, Jesper Dangaard Brouer jbrouer@redhat.com wrote:
On 22/09/2021 11.41, Yunsheng Lin wrote:
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 1a6978427d6c..a65bd7972e37 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -49,6 +49,12 @@ static int page_pool_init(struct page_pool *pool, * which is the XDP_TX use-case. */ if (pool->p.flags & PP_FLAG_DMA_MAP) {
/* DMA-mapping is not supported on 32-bit systems with
* 64-bit DMA mapping.
*/
if (sizeof(dma_addr_t) > sizeof(unsigned long))
return -EINVAL;
As I said before, can we please use another error than EINVAL. We should give drivers a chance/ability to detect this error, and e.g. fallback to doing DMA mappings inside driver instead.
I suggest using EOPNOTSUPP 95 (Operation not supported).
Will change it to EOPNOTSUPP, thanks.
Mind sending this one separately (and you can keep my reviewed-by). It fits nicely on it's own and since I am not sure about the rest of the changes yet, it would be nice to get this one in.
Cheers /Ilias
I am fine with both. In any case though the aforementioned driver can just remove PP_FLAG_DMA_MAP and do it's own mappings.
Regards /Ilias
-Jesper
.
On 2021/9/23 21:07, Ilias Apalodimas wrote:
On Thu, Sep 23, 2021 at 07:13:11PM +0800, Yunsheng Lin wrote:
On 2021/9/23 18:02, Ilias Apalodimas wrote:
Hi Jesper,
On Thu, 23 Sept 2021 at 12:33, Jesper Dangaard Brouer jbrouer@redhat.com wrote:
On 22/09/2021 11.41, Yunsheng Lin wrote:
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 1a6978427d6c..a65bd7972e37 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -49,6 +49,12 @@ static int page_pool_init(struct page_pool *pool, * which is the XDP_TX use-case. */ if (pool->p.flags & PP_FLAG_DMA_MAP) {
/* DMA-mapping is not supported on 32-bit systems with
* 64-bit DMA mapping.
*/
if (sizeof(dma_addr_t) > sizeof(unsigned long))
return -EINVAL;
As I said before, can we please use another error than EINVAL. We should give drivers a chance/ability to detect this error, and e.g. fallback to doing DMA mappings inside driver instead.
I suggest using EOPNOTSUPP 95 (Operation not supported).
Will change it to EOPNOTSUPP, thanks.
Mind sending this one separately (and you can keep my reviewed-by). It fits nicely on it's own and since I am not sure about the rest of the changes yet, it would be nice to get this one in.
I am not sure sending this one separately really makes sense, as it is mainly used to make supporting the "keep track of pp page when __skb_frag_ref() is called" in patch 5 easier.
Cheers /Ilias
I am fine with both. In any case though the aforementioned driver can just remove PP_FLAG_DMA_MAP and do it's own mappings.
Regards /Ilias
-Jesper
.
.
On Fri, 24 Sept 2021 at 10:04, Yunsheng Lin linyunsheng@huawei.com wrote:
On 2021/9/23 21:07, Ilias Apalodimas wrote:
On Thu, Sep 23, 2021 at 07:13:11PM +0800, Yunsheng Lin wrote:
On 2021/9/23 18:02, Ilias Apalodimas wrote:
Hi Jesper,
On Thu, 23 Sept 2021 at 12:33, Jesper Dangaard Brouer jbrouer@redhat.com wrote:
On 22/09/2021 11.41, Yunsheng Lin wrote:
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 1a6978427d6c..a65bd7972e37 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -49,6 +49,12 @@ static int page_pool_init(struct page_pool *pool, * which is the XDP_TX use-case. */ if (pool->p.flags & PP_FLAG_DMA_MAP) {
/* DMA-mapping is not supported on 32-bit systems with
* 64-bit DMA mapping.
*/
if (sizeof(dma_addr_t) > sizeof(unsigned long))
return -EINVAL;
As I said before, can we please use another error than EINVAL. We should give drivers a chance/ability to detect this error, and e.g. fallback to doing DMA mappings inside driver instead.
I suggest using EOPNOTSUPP 95 (Operation not supported).
Will change it to EOPNOTSUPP, thanks.
Mind sending this one separately (and you can keep my reviewed-by). It fits nicely on it's own and since I am not sure about the rest of the changes yet, it would be nice to get this one in.
I am not sure sending this one separately really makes sense, as it is mainly used to make supporting the "keep track of pp page when __skb_frag_ref() is called" in patch 5 easier.
It rips out support for devices that are 32bit and have 64bit dma and make the whole code easier to follow. I thought we agreed on removing the support for those devices regardless didn't we?
Regards /Ilias
On 2021/9/24 15:25, Ilias Apalodimas wrote:
On Fri, 24 Sept 2021 at 10:04, Yunsheng Lin linyunsheng@huawei.com wrote:
On 2021/9/23 21:07, Ilias Apalodimas wrote:
On Thu, Sep 23, 2021 at 07:13:11PM +0800, Yunsheng Lin wrote:
On 2021/9/23 18:02, Ilias Apalodimas wrote:
Hi Jesper,
On Thu, 23 Sept 2021 at 12:33, Jesper Dangaard Brouer jbrouer@redhat.com wrote:
On 22/09/2021 11.41, Yunsheng Lin wrote: > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index 1a6978427d6c..a65bd7972e37 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -49,6 +49,12 @@ static int page_pool_init(struct page_pool *pool, > * which is the XDP_TX use-case. > */ > if (pool->p.flags & PP_FLAG_DMA_MAP) { > + /* DMA-mapping is not supported on 32-bit systems with > + * 64-bit DMA mapping. > + */ > + if (sizeof(dma_addr_t) > sizeof(unsigned long)) > + return -EINVAL;
As I said before, can we please use another error than EINVAL. We should give drivers a chance/ability to detect this error, and e.g. fallback to doing DMA mappings inside driver instead.
I suggest using EOPNOTSUPP 95 (Operation not supported).
Will change it to EOPNOTSUPP, thanks.
Mind sending this one separately (and you can keep my reviewed-by). It fits nicely on it's own and since I am not sure about the rest of the changes yet, it would be nice to get this one in.
I am not sure sending this one separately really makes sense, as it is mainly used to make supporting the "keep track of pp page when __skb_frag_ref() is called" in patch 5 easier.
It rips out support for devices that are 32bit and have 64bit dma and make the whole code easier to follow. I thought we agreed on removing the support for those devices regardless didn't we?
I am actually not convinced that the code about PAGE_POOL_DMA_USE_PP_FRAG_COUNT (maybe the name is somewhat confusiong) as it it now, but it is after adding patch 5, and it seems we are not handing the skb_split() case in tso_fragment() for 32bit arch with 64bit dma too if we still keep PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro.
Regards /Ilias .
Currently when PP_FLAG_PAGE_FRAG is set, the caller is not expected to call page_pool_alloc_pages() directly because of the PP_FLAG_PAGE_FRAG checking in __page_pool_put_page().
The patch removes the above checking to enable non-split page support when PP_FLAG_PAGE_FRAG is set.
Reviewed-by: Alexander Duyck alexanderduyck@fb.com Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- net/core/page_pool.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index a65bd7972e37..f7e71dcb6a2e 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -315,11 +315,14 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
/* Fast-path: Get a page from cache */ page = __page_pool_get_cached(pool); - if (page) - return page;
/* Slow-path: cache empty, do real allocation */ - page = __page_pool_alloc_pages_slow(pool, gfp); + if (!page) + page = __page_pool_alloc_pages_slow(pool, gfp); + + if (likely(page)) + page_pool_set_frag_count(page, 1); + return page; } EXPORT_SYMBOL(page_pool_alloc_pages); @@ -428,8 +431,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, unsigned int dma_sync_size, bool allow_direct) { /* It is not the last user for the page frag case */ - if (pool->p.flags & PP_FLAG_PAGE_FRAG && - page_pool_atomic_sub_frag_count_return(page, 1)) + if (page_pool_atomic_sub_frag_count_return(page, 1)) return NULL;
/* This allocator is optimized for the XDP mode that uses
On 22/09/2021 11.41, Yunsheng Lin wrote:
Currently when PP_FLAG_PAGE_FRAG is set, the caller is not expected to call page_pool_alloc_pages() directly because of the PP_FLAG_PAGE_FRAG checking in __page_pool_put_page().
The patch removes the above checking to enable non-split page support when PP_FLAG_PAGE_FRAG is set.
Reviewed-by: Alexander Duyck alexanderduyck@fb.com Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
net/core/page_pool.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index a65bd7972e37..f7e71dcb6a2e 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -315,11 +315,14 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
/* Fast-path: Get a page from cache */ page = __page_pool_get_cached(pool);
if (page)
return page;
/* Slow-path: cache empty, do real allocation */
page = __page_pool_alloc_pages_slow(pool, gfp);
- if (!page)
page = __page_pool_alloc_pages_slow(pool, gfp);
- if (likely(page))
page_pool_set_frag_count(page, 1);
I really don't like that you add one atomic_long_set operation per page alloc call. This is a fast-path for XDP use-cases, which you are ignoring as you drivers doesn't implement XDP.
As I cannot ask you to run XDP benchmarks, I fortunately have some page_pool specific microbenchmarks you can run instead.
I will ask you to provide before and after results from running these benchmarks [1] and [2].
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/benc...
[2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/benc...
How to use these module is documented here[3]: [3] https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/build-pro...
return page; } EXPORT_SYMBOL(page_pool_alloc_pages); @@ -428,8 +431,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, unsigned int dma_sync_size, bool allow_direct) { /* It is not the last user for the page frag case */
- if (pool->p.flags & PP_FLAG_PAGE_FRAG &&
page_pool_atomic_sub_frag_count_return(page, 1))
- if (page_pool_atomic_sub_frag_count_return(page, 1)) return NULL;
This adds an atomic_long_read, even when PP_FLAG_PAGE_FRAG is not set.
/* This allocator is optimized for the XDP mode that uses
On 2021/9/23 20:08, Jesper Dangaard Brouer wrote:
On 22/09/2021 11.41, Yunsheng Lin wrote:
Currently when PP_FLAG_PAGE_FRAG is set, the caller is not expected to call page_pool_alloc_pages() directly because of the PP_FLAG_PAGE_FRAG checking in __page_pool_put_page().
The patch removes the above checking to enable non-split page support when PP_FLAG_PAGE_FRAG is set.
Reviewed-by: Alexander Duyck alexanderduyck@fb.com Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
net/core/page_pool.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index a65bd7972e37..f7e71dcb6a2e 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -315,11 +315,14 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp) /* Fast-path: Get a page from cache */ page = __page_pool_get_cached(pool);
- if (page)
return page; /* Slow-path: cache empty, do real allocation */
- page = __page_pool_alloc_pages_slow(pool, gfp);
- if (!page)
page = __page_pool_alloc_pages_slow(pool, gfp);
- if (likely(page))
page_pool_set_frag_count(page, 1);
I really don't like that you add one atomic_long_set operation per page alloc call. This is a fast-path for XDP use-cases, which you are ignoring as you drivers doesn't implement XDP.
As I cannot ask you to run XDP benchmarks, I fortunately have some page_pool specific microbenchmarks you can run instead.
I will ask you to provide before and after results from running these benchmarks [1] and [2].
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/benc...
[2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/benc...
How to use these module is documented here[3]: [3] https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/build-pro...
Will running these benchmarks to see if any performance overhead noticable here, thanks for the benchmarks.
return page;
} EXPORT_SYMBOL(page_pool_alloc_pages); @@ -428,8 +431,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, unsigned int dma_sync_size, bool allow_direct) { /* It is not the last user for the page frag case */
- if (pool->p.flags & PP_FLAG_PAGE_FRAG &&
page_pool_atomic_sub_frag_count_return(page, 1))
- if (page_pool_atomic_sub_frag_count_return(page, 1)) return NULL;
This adds an atomic_long_read, even when PP_FLAG_PAGE_FRAG is not set.
The point here is to have consistent handling for both PP_FLAG_PAGE_FRAG and non-PP_FLAG_PAGE_FRAG case in the following patch.
As the page->_refcount is accessed in "page_ref_count(page) == 1" checking in __page_pool_put_page(), and page->pp_frag_count is most likely in the same cache line as the page->_refcount, So I am not expecting a noticable overhead here.
Anyway, will use the above benchmarks as an example to verify it.
/* This allocator is optimized for the XDP mode that uses
.
On 2021/9/24 15:23, Yunsheng Lin wrote:
On 2021/9/23 20:08, Jesper Dangaard Brouer wrote:
On 22/09/2021 11.41, Yunsheng Lin wrote:
Currently when PP_FLAG_PAGE_FRAG is set, the caller is not expected to call page_pool_alloc_pages() directly because of the PP_FLAG_PAGE_FRAG checking in __page_pool_put_page().
The patch removes the above checking to enable non-split page support when PP_FLAG_PAGE_FRAG is set.
Reviewed-by: Alexander Duyck alexanderduyck@fb.com Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
net/core/page_pool.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index a65bd7972e37..f7e71dcb6a2e 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -315,11 +315,14 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp) /* Fast-path: Get a page from cache */ page = __page_pool_get_cached(pool);
- if (page)
return page; /* Slow-path: cache empty, do real allocation */
- page = __page_pool_alloc_pages_slow(pool, gfp);
- if (!page)
page = __page_pool_alloc_pages_slow(pool, gfp);
- if (likely(page))
page_pool_set_frag_count(page, 1);
I really don't like that you add one atomic_long_set operation per page alloc call. This is a fast-path for XDP use-cases, which you are ignoring as you drivers doesn't implement XDP.
As I cannot ask you to run XDP benchmarks, I fortunately have some page_pool specific microbenchmarks you can run instead.
I will ask you to provide before and after results from running these benchmarks [1] and [2].
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/benc...
[2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/benc...
How to use these module is documented here[3]: [3] https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/build-pro...
Will running these benchmarks to see if any performance overhead noticable here, thanks for the benchmarks.
You are right, there is notiable overhead for bench_page_pool_cross_cpu test case above, possibly due to the cache bouncing caused by page_pool_set_frag_count().
As memntioned by Ilias, mlx5 use page pool and also do the recycling internally, so handling the page frag tracking consistently for both PP_FLAG_PAGE_FRAG and non-PP_FLAG_PAGE_FRAG will break the mlx5 driver.
So I will drop this patch for now.
return page;
} EXPORT_SYMBOL(page_pool_alloc_pages); @@ -428,8 +431,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, unsigned int dma_sync_size, bool allow_direct) { /* It is not the last user for the page frag case */
- if (pool->p.flags & PP_FLAG_PAGE_FRAG &&
page_pool_atomic_sub_frag_count_return(page, 1))
- if (page_pool_atomic_sub_frag_count_return(page, 1)) return NULL;
This adds an atomic_long_read, even when PP_FLAG_PAGE_FRAG is not set.
The point here is to have consistent handling for both PP_FLAG_PAGE_FRAG and non-PP_FLAG_PAGE_FRAG case in the following patch.
As the page->_refcount is accessed in "page_ref_count(page) == 1" checking in __page_pool_put_page(), and page->pp_frag_count is most likely in the same cache line as the page->_refcount, So I am not expecting a noticable overhead here.
Anyway, will use the above benchmarks as an example to verify it.
/* This allocator is optimized for the XDP mode that uses
.
Linuxarm mailing list -- linuxarm@openeuler.org To unsubscribe send an email to linuxarm-leave@openeuler.org
As the pp page for a skb frag is always a head page, so make sure skb_pp_recycle() passes a head page to avoid calling compound_head() for skb frag page case.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- include/linux/skbuff.h | 2 +- net/core/page_pool.c | 2 -- 2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 6bdb0db3e825..35eebc2310a5 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4722,7 +4722,7 @@ static inline bool skb_pp_recycle(struct sk_buff *skb, void *data) { if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle) return false; - return page_pool_return_skb_page(virt_to_page(data)); + return page_pool_return_skb_page(virt_to_head_page(data)); }
#endif /* __KERNEL__ */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index f7e71dcb6a2e..357fb53343a0 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -742,8 +742,6 @@ bool page_pool_return_skb_page(struct page *page) { struct page_pool *pp;
- page = compound_head(page); - /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation * in order to preserve any existing bits, such as bit 0 for the * head page of compound page and bit 1 for pfmemalloc page, so
On Wed, Sep 22, 2021 at 05:41:27PM +0800, Yunsheng Lin wrote:
As the pp page for a skb frag is always a head page, so make sure skb_pp_recycle() passes a head page to avoid calling compound_head() for skb frag page case.
Doesn't that rely on the driver mostly (i.e what's passed in skb_frag_set_page() ? None of the current netstack code assumes bv_page is the head page of a compound page. Since our page_pool allocator can will allocate compound pages for order > 0, why should we rely on it ?
Thanks /Ilias
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
include/linux/skbuff.h | 2 +- net/core/page_pool.c | 2 -- 2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 6bdb0db3e825..35eebc2310a5 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4722,7 +4722,7 @@ static inline bool skb_pp_recycle(struct sk_buff *skb, void *data) { if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle) return false;
- return page_pool_return_skb_page(virt_to_page(data));
- return page_pool_return_skb_page(virt_to_head_page(data));
}
#endif /* __KERNEL__ */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index f7e71dcb6a2e..357fb53343a0 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -742,8 +742,6 @@ bool page_pool_return_skb_page(struct page *page) { struct page_pool *pp;
- page = compound_head(page);
- /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
- in order to preserve any existing bits, such as bit 0 for the
- head page of compound page and bit 1 for pfmemalloc page, so
-- 2.33.0
On 2021/9/23 16:33, Ilias Apalodimas wrote:
On Wed, Sep 22, 2021 at 05:41:27PM +0800, Yunsheng Lin wrote:
As the pp page for a skb frag is always a head page, so make sure skb_pp_recycle() passes a head page to avoid calling compound_head() for skb frag page case.
Doesn't that rely on the driver mostly (i.e what's passed in skb_frag_set_page() ? None of the current netstack code assumes bv_page is the head page of a compound page. Since our page_pool allocator can will allocate compound pages for order > 0, why should we rely on it ?
As the page pool alloc function return 'struct page *' to the caller, which is the head page of a compound pages for order > 0, so I assume the caller will pass that to skb_frag_set_page().
For non-pp page, I assume it is ok whether the page is a head page or tail page, as the pp_magic for both of them are not set with PP_SIGNATURE.
Or should we play safe here, and do the trick as skb_free_head() does in patch 6?
Thanks /Ilias
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
include/linux/skbuff.h | 2 +- net/core/page_pool.c | 2 -- 2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 6bdb0db3e825..35eebc2310a5 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4722,7 +4722,7 @@ static inline bool skb_pp_recycle(struct sk_buff *skb, void *data) { if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle) return false;
- return page_pool_return_skb_page(virt_to_page(data));
- return page_pool_return_skb_page(virt_to_head_page(data));
}
#endif /* __KERNEL__ */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index f7e71dcb6a2e..357fb53343a0 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -742,8 +742,6 @@ bool page_pool_return_skb_page(struct page *page) { struct page_pool *pp;
- page = compound_head(page);
- /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
- in order to preserve any existing bits, such as bit 0 for the
- head page of compound page and bit 1 for pfmemalloc page, so
-- 2.33.0
.
On Thu, 23 Sept 2021 at 14:24, Yunsheng Lin linyunsheng@huawei.com wrote:
On 2021/9/23 16:33, Ilias Apalodimas wrote:
On Wed, Sep 22, 2021 at 05:41:27PM +0800, Yunsheng Lin wrote:
As the pp page for a skb frag is always a head page, so make sure skb_pp_recycle() passes a head page to avoid calling compound_head() for skb frag page case.
Doesn't that rely on the driver mostly (i.e what's passed in skb_frag_set_page() ? None of the current netstack code assumes bv_page is the head page of a compound page. Since our page_pool allocator can will allocate compound pages for order > 0, why should we rely on it ?
As the page pool alloc function return 'struct page *' to the caller, which is the head page of a compound pages for order > 0, so I assume the caller will pass that to skb_frag_set_page().
Yea that's exactly the assumption I was afraid of. Sure not passing the head page might seem weird atm and the assumption stands, but the point is we shouldn't blow up the entire network stack if someone does that eventually.
For non-pp page, I assume it is ok whether the page is a head page or tail page, as the pp_magic for both of them are not set with PP_SIGNATURE.
Yea that's true, although we removed the checking for coalescing recyclable and non-recyclable SKBs, the next patch first checks the signature before trying to do anything with the skb.
Or should we play safe here, and do the trick as skb_free_head() does in patch 6?
I don't think the &1 will even be measurable, so I'd suggest just dropping this and play safe?
Cheers /Ilias
Thanks /Ilias
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
include/linux/skbuff.h | 2 +- net/core/page_pool.c | 2 -- 2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 6bdb0db3e825..35eebc2310a5 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4722,7 +4722,7 @@ static inline bool skb_pp_recycle(struct sk_buff *skb, void *data) { if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle) return false;
- return page_pool_return_skb_page(virt_to_page(data));
- return page_pool_return_skb_page(virt_to_head_page(data));
}
#endif /* __KERNEL__ */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index f7e71dcb6a2e..357fb53343a0 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -742,8 +742,6 @@ bool page_pool_return_skb_page(struct page *page) { struct page_pool *pp;
- page = compound_head(page);
- /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation * in order to preserve any existing bits, such as bit 0 for the * head page of compound page and bit 1 for pfmemalloc page, so
-- 2.33.0
.
On 2021/9/23 19:47, Ilias Apalodimas wrote:
On Thu, 23 Sept 2021 at 14:24, Yunsheng Lin linyunsheng@huawei.com wrote:
On 2021/9/23 16:33, Ilias Apalodimas wrote:
On Wed, Sep 22, 2021 at 05:41:27PM +0800, Yunsheng Lin wrote:
As the pp page for a skb frag is always a head page, so make sure skb_pp_recycle() passes a head page to avoid calling compound_head() for skb frag page case.
Doesn't that rely on the driver mostly (i.e what's passed in skb_frag_set_page() ? None of the current netstack code assumes bv_page is the head page of a compound page. Since our page_pool allocator can will allocate compound pages for order > 0, why should we rely on it ?
As the page pool alloc function return 'struct page *' to the caller, which is the head page of a compound pages for order > 0, so I assume the caller will pass that to skb_frag_set_page().
Yea that's exactly the assumption I was afraid of. Sure not passing the head page might seem weird atm and the assumption stands, but the point is we shouldn't blow up the entire network stack if someone does that eventually.
For non-pp page, I assume it is ok whether the page is a head page or tail page, as the pp_magic for both of them are not set with PP_SIGNATURE.
Yea that's true, although we removed the checking for coalescing recyclable and non-recyclable SKBs, the next patch first checks the signature before trying to do anything with the skb.
Or should we play safe here, and do the trick as skb_free_head() does in patch 6?
I don't think the &1 will even be measurable, so I'd suggest just dropping this and play safe?
I am not sure what does '&1' mean above.
The one thing I am not sure about the trick done in patch 6 is that if __page_frag_cache_drain() is right API to use here, I used it because it is the only API that is expecting a head page.
Cheers /Ilias
Thanks /Ilias
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
include/linux/skbuff.h | 2 +- net/core/page_pool.c | 2 -- 2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 6bdb0db3e825..35eebc2310a5 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4722,7 +4722,7 @@ static inline bool skb_pp_recycle(struct sk_buff *skb, void *data) { if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle) return false;
- return page_pool_return_skb_page(virt_to_page(data));
- return page_pool_return_skb_page(virt_to_head_page(data));
}
#endif /* __KERNEL__ */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index f7e71dcb6a2e..357fb53343a0 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -742,8 +742,6 @@ bool page_pool_return_skb_page(struct page *page) { struct page_pool *pp;
- page = compound_head(page);
- /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation * in order to preserve any existing bits, such as bit 0 for the * head page of compound page and bit 1 for pfmemalloc page, so
-- 2.33.0
.
.
On Fri, 24 Sept 2021 at 10:33, Yunsheng Lin linyunsheng@huawei.com wrote:
On 2021/9/23 19:47, Ilias Apalodimas wrote:
On Thu, 23 Sept 2021 at 14:24, Yunsheng Lin linyunsheng@huawei.com wrote:
On 2021/9/23 16:33, Ilias Apalodimas wrote:
On Wed, Sep 22, 2021 at 05:41:27PM +0800, Yunsheng Lin wrote:
As the pp page for a skb frag is always a head page, so make sure skb_pp_recycle() passes a head page to avoid calling compound_head() for skb frag page case.
Doesn't that rely on the driver mostly (i.e what's passed in skb_frag_set_page() ? None of the current netstack code assumes bv_page is the head page of a compound page. Since our page_pool allocator can will allocate compound pages for order > 0, why should we rely on it ?
As the page pool alloc function return 'struct page *' to the caller, which is the head page of a compound pages for order > 0, so I assume the caller will pass that to skb_frag_set_page().
Yea that's exactly the assumption I was afraid of. Sure not passing the head page might seem weird atm and the assumption stands, but the point is we shouldn't blow up the entire network stack if someone does that eventually.
For non-pp page, I assume it is ok whether the page is a head page or tail page, as the pp_magic for both of them are not set with PP_SIGNATURE.
Yea that's true, although we removed the checking for coalescing recyclable and non-recyclable SKBs, the next patch first checks the signature before trying to do anything with the skb.
Or should we play safe here, and do the trick as skb_free_head() does in patch 6?
I don't think the &1 will even be measurable, so I'd suggest just dropping this and play safe?
I am not sure what does '&1' mean above.
I meant the check compound_head() is doing before deciding on the head page.
The one thing I am not sure about the trick done in patch 6 is that if __page_frag_cache_drain() is right API to use here, I used it because it is the only API that is expecting a head page.
Yea seemed a bit funny to me in the first place, until I figured out what exactly it was doing.
Regards /Ilias
Cheers /Ilias
Thanks /Ilias
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
include/linux/skbuff.h | 2 +- net/core/page_pool.c | 2 -- 2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 6bdb0db3e825..35eebc2310a5 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4722,7 +4722,7 @@ static inline bool skb_pp_recycle(struct sk_buff *skb, void *data) { if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle) return false;
- return page_pool_return_skb_page(virt_to_page(data));
- return page_pool_return_skb_page(virt_to_head_page(data));
}
#endif /* __KERNEL__ */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index f7e71dcb6a2e..357fb53343a0 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -742,8 +742,6 @@ bool page_pool_return_skb_page(struct page *page) { struct page_pool *pp;
- page = compound_head(page);
- /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation * in order to preserve any existing bits, such as bit 0 for the * head page of compound page and bit 1 for pfmemalloc page, so
-- 2.33.0
.
.
As the page->pp_frag_count need incrementing for pp page tracking support, so change BIAS_MAX to (LONG_MAX / 2) to avoid overflowing.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- net/core/page_pool.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 357fb53343a0..e9516477f9d2 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -24,7 +24,7 @@ #define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ)
-#define BIAS_MAX LONG_MAX +#define BIAS_MAX (LONG_MAX / 2)
static int page_pool_init(struct page_pool *pool, const struct page_pool_params *params)
As the skb->pp_recycle and page->pp_magic may not be enough to track if a frag page is from page pool after the calling of __skb_frag_ref(), mostly because of a data race, see: commit 2cc3aeb5eccc ("skbuff: Fix a potential race while recycling page_pool packets").
There may be clone and expand head case that might lose the track if a frag page is from page pool or not.
And not being able to keep track of pp page may cause problem for the skb_split() case in tso_fragment() too: Supposing a skb has 3 frag pages, all coming from a page pool, and is split to skb1 and skb2: skb1: first frag page + first half of second frag page skb2: second half of second frag page + third frag page
How do we set the skb->pp_recycle of skb1 and skb2? 1. If we set both of them to 1, then we may have a similar race as the above commit for second frag page. 2. If we set only one of them to 1, then we may have resource leaking problem as both first frag page and third frag page are indeed from page pool.
So increment the frag count when __skb_frag_ref() is called, and only use page->pp_magic to indicate if a frag page is from page pool, to avoid the above data race.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- drivers/net/ethernet/marvell/sky2.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +- include/linux/skbuff.h | 30 +++++++++++++++++----- include/net/page_pool.h | 19 +++++++++++++- net/core/page_pool.c | 14 +--------- net/core/skbuff.c | 4 +-- net/tls/tls_device.c | 2 +- 7 files changed, 48 insertions(+), 25 deletions(-)
diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c index e9fc74e54b22..91b62f468a26 100644 --- a/drivers/net/ethernet/marvell/sky2.c +++ b/drivers/net/ethernet/marvell/sky2.c @@ -2503,7 +2503,7 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
if (length == 0) { /* don't need this page */ - __skb_frag_unref(frag, false); + __skb_frag_unref(frag); --skb_shinfo(skb)->nr_frags; } else { size = min(length, (unsigned) PAGE_SIZE); diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index 7f6d3b82c29b..ce31b1fd554f 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -526,7 +526,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv, fail: while (nr > 0) { nr--; - __skb_frag_unref(skb_shinfo(skb)->frags + nr, false); + __skb_frag_unref(skb_shinfo(skb)->frags + nr); } return 0; } diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 35eebc2310a5..a2d3b6fe0c32 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3073,7 +3073,16 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag) */ static inline void __skb_frag_ref(skb_frag_t *frag) { - get_page(skb_frag_page(frag)); + struct page *page = skb_frag_page(frag); + +#ifdef CONFIG_PAGE_POOL + if (page_pool_is_pp_page(page)) { + page_pool_atomic_inc_frag_count(page); + return; + } +#endif + + get_page(page); }
/** @@ -3091,18 +3100,19 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f) /** * __skb_frag_unref - release a reference on a paged fragment. * @frag: the paged fragment - * @recycle: recycle the page if allocated via page_pool * * Releases a reference on the paged fragment @frag * or recycles the page via the page_pool API. */ -static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle) +static inline void __skb_frag_unref(skb_frag_t *frag) { struct page *page = skb_frag_page(frag);
#ifdef CONFIG_PAGE_POOL - if (recycle && page_pool_return_skb_page(page)) + if (page_pool_is_pp_page(page)) { + page_pool_return_skb_page(page); return; + } #endif put_page(page); } @@ -3116,7 +3126,7 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle) */ static inline void skb_frag_unref(struct sk_buff *skb, int f) { - __skb_frag_unref(&skb_shinfo(skb)->frags[f], skb->pp_recycle); + __skb_frag_unref(&skb_shinfo(skb)->frags[f]); }
/** @@ -4720,9 +4730,17 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb)
static inline bool skb_pp_recycle(struct sk_buff *skb, void *data) { + struct page *page; + if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle) return false; - return page_pool_return_skb_page(virt_to_head_page(data)); + + page = virt_to_head_page(data); + if (!page_pool_is_pp_page(page)) + return false; + + page_pool_return_skb_page(page); + return true; }
#endif /* __KERNEL__ */ diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 3855f069627f..f9738da37d9f 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -164,7 +164,7 @@ inline enum dma_data_direction page_pool_get_dma_dir(struct page_pool *pool) return pool->p.dma_dir; }
-bool page_pool_return_skb_page(struct page *page); +void page_pool_return_skb_page(struct page *page);
struct page_pool *page_pool_create(const struct page_pool_params *params);
@@ -231,6 +231,23 @@ static inline void page_pool_set_frag_count(struct page *page, long nr) atomic_long_set(&page->pp_frag_count, nr); }
+static inline void page_pool_atomic_inc_frag_count(struct page *page) +{ + atomic_long_inc(&page->pp_frag_count); +} + +static inline bool page_pool_is_pp_page(struct page *page) +{ + /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation + * in order to preserve any existing bits, such as bit 0 for the + * head page of compound page and bit 1 for pfmemalloc page, so + * mask those bits for freeing side when doing below checking, + * and page_is_pfmemalloc() is checked in __page_pool_put_page() + * to avoid recycling the pfmemalloc page. + */ + return (page->pp_magic & ~0x3UL) == PP_SIGNATURE; +} + static inline long page_pool_atomic_sub_frag_count_return(struct page *page, long nr) { diff --git a/net/core/page_pool.c b/net/core/page_pool.c index e9516477f9d2..96a28accce0e 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -738,20 +738,10 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid) } EXPORT_SYMBOL(page_pool_update_nid);
-bool page_pool_return_skb_page(struct page *page) +void page_pool_return_skb_page(struct page *page) { struct page_pool *pp;
- /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation - * in order to preserve any existing bits, such as bit 0 for the - * head page of compound page and bit 1 for pfmemalloc page, so - * mask those bits for freeing side when doing below checking, - * and page_is_pfmemalloc() is checked in __page_pool_put_page() - * to avoid recycling the pfmemalloc page. - */ - if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) - return false; - pp = page->pp;
/* Driver set this to memory recycling info. Reset it on recycle. @@ -760,7 +750,5 @@ bool page_pool_return_skb_page(struct page *page) * 'flipped' fragment being in use or not. */ page_pool_put_full_page(pp, page, false); - - return true; } EXPORT_SYMBOL(page_pool_return_skb_page); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 2170bea2c7de..db8af3eff255 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -668,7 +668,7 @@ static void skb_release_data(struct sk_buff *skb) skb_zcopy_clear(skb, true);
for (i = 0; i < shinfo->nr_frags; i++) - __skb_frag_unref(&shinfo->frags[i], skb->pp_recycle); + __skb_frag_unref(&shinfo->frags[i]);
if (shinfo->frag_list) kfree_skb_list(shinfo->frag_list); @@ -3563,7 +3563,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen) fragto = &skb_shinfo(tgt)->frags[merge];
skb_frag_size_add(fragto, skb_frag_size(fragfrom)); - __skb_frag_unref(fragfrom, skb->pp_recycle); + __skb_frag_unref(fragfrom); }
/* Reposition in the original skb */ diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c index b932469ee69c..bd9f1567aa39 100644 --- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -128,7 +128,7 @@ static void destroy_record(struct tls_record_info *record) int i;
for (i = 0; i < record->num_frags; i++) - __skb_frag_unref(&record->frags[i], false); + __skb_frag_unref(&record->frags[i]); kfree(record); }
As pp_magic is used to identify pp page for a skb's frag page in previous patch, so do the similar handling for the head page of a skb too.
And it seems head to frag converting for a skb during GRO and GSO processing does not need handling when using pp_magic to identify pp page for a skb' head page and frag page, see NAPI_GRO_FREE_STOLEN_HEAD for GRO in skb_gro_receive() and skb_head_frag_to_page_desc() for GSO in skb_segment().
As pp_magic only exist in the head page of a compound page, and the freeing of a head page for a skb is eventually operated on the head page of a compound page for both pp and non-pp page, so use virt_to_head_page() and __page_frag_cache_drain() in skb_free_head() to avoid unnecessary virt_to_head_page() calling in page_frag_free().
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- include/linux/skbuff.h | 15 --------------- net/core/skbuff.c | 11 +++++++++-- 2 files changed, 9 insertions(+), 17 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a2d3b6fe0c32..b77ee060b64d 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -4728,20 +4728,5 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb) } #endif
-static inline bool skb_pp_recycle(struct sk_buff *skb, void *data) -{ - struct page *page; - - if (!IS_ENABLED(CONFIG_PAGE_POOL) || !skb->pp_recycle) - return false; - - page = virt_to_head_page(data); - if (!page_pool_is_pp_page(page)) - return false; - - page_pool_return_skb_page(page); - return true; -} - #endif /* __KERNEL__ */ #endif /* _LINUX_SKBUFF_H */ diff --git a/net/core/skbuff.c b/net/core/skbuff.c index db8af3eff255..3718898da499 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -647,9 +647,16 @@ static void skb_free_head(struct sk_buff *skb) unsigned char *head = skb->head;
if (skb->head_frag) { - if (skb_pp_recycle(skb, head)) + struct page *page = virt_to_head_page(head); + +#ifdef CONFIG_PAGE_POOL + if (page_pool_is_pp_page(page)) { + page_pool_return_skb_page(page); return; - skb_free_frag(head); + } +#endif + + __page_frag_cache_drain(page, 1); } else { kfree(head); }
As we have used pp_magic to identify pp page for the head and frag page of a skb, the skb->pp_recycle is not used, so remove it.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- .../net/ethernet/hisilicon/hns3/hns3_enet.c | 6 ------ drivers/net/ethernet/marvell/mvneta.c | 2 -- .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 4 +--- drivers/net/ethernet/ti/cpsw.c | 2 -- drivers/net/ethernet/ti/cpsw_new.c | 2 -- include/linux/skbuff.h | 12 +---------- net/core/skbuff.c | 21 +------------------ 7 files changed, 3 insertions(+), 46 deletions(-)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 22af3d6ce178..5331e0f2cee4 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -3864,9 +3864,6 @@ static int hns3_alloc_skb(struct hns3_enet_ring *ring, unsigned int length, return 0; }
- if (ring->page_pool) - skb_mark_for_recycle(skb); - u64_stats_update_begin(&ring->syncp); ring->stats.seg_pkt_cnt++; u64_stats_update_end(&ring->syncp); @@ -3906,9 +3903,6 @@ static int hns3_add_frag(struct hns3_enet_ring *ring) return -ENXIO; }
- if (ring->page_pool) - skb_mark_for_recycle(new_skb); - ring->frag_num = 0;
if (ring->tail_skb) { diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index 9d460a270601..c852e0dd6d38 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -2327,8 +2327,6 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct page_pool *pool, if (!skb) return ERR_PTR(-ENOMEM);
- skb_mark_for_recycle(skb); - skb_reserve(skb, xdp->data - xdp->data_hard_start); skb_put(skb, xdp->data_end - xdp->data); skb->ip_summed = mvneta_rx_csum(pp, desc_status); diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c index d5c92e43f89e..bacae115c6c6 100644 --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c @@ -3994,9 +3994,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi, goto err_drop_frame; }
- if (pp) - skb_mark_for_recycle(skb); - else + if (!pp) dma_unmap_single_attrs(dev->dev.parent, dma_addr, bm_pool->buf_size, DMA_FROM_DEVICE, DMA_ATTR_SKIP_CPU_SYNC); diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c index 66f7ddd9b1f9..2fb5a4545b8b 100644 --- a/drivers/net/ethernet/ti/cpsw.c +++ b/drivers/net/ethernet/ti/cpsw.c @@ -430,8 +430,6 @@ static void cpsw_rx_handler(void *token, int len, int status) cpts_rx_timestamp(cpsw->cpts, skb); skb->protocol = eth_type_trans(skb, ndev);
- /* mark skb for recycling */ - skb_mark_for_recycle(skb); netif_receive_skb(skb);
ndev->stats.rx_bytes += len; diff --git a/drivers/net/ethernet/ti/cpsw_new.c b/drivers/net/ethernet/ti/cpsw_new.c index 7968f24d99c8..1e74d484852d 100644 --- a/drivers/net/ethernet/ti/cpsw_new.c +++ b/drivers/net/ethernet/ti/cpsw_new.c @@ -374,8 +374,6 @@ static void cpsw_rx_handler(void *token, int len, int status) cpts_rx_timestamp(cpsw->cpts, skb); skb->protocol = eth_type_trans(skb, ndev);
- /* mark skb for recycling */ - skb_mark_for_recycle(skb); netif_receive_skb(skb);
ndev->stats.rx_bytes += len; diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index b77ee060b64d..d4bb0e160fef 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -668,8 +668,6 @@ typedef unsigned char *sk_buff_data_t; * @head_frag: skb was allocated from page fragments, * not allocated by kmalloc() or vmalloc(). * @pfmemalloc: skbuff was allocated from PFMEMALLOC reserves - * @pp_recycle: mark the packet for recycling instead of freeing (implies - * page_pool support on driver) * @active_extensions: active extensions (skb_ext_id types) * @ndisc_nodetype: router type (from link layer) * @ooo_okay: allow the mapping of a socket to a queue to be changed @@ -795,8 +793,7 @@ struct sk_buff { fclone:2, peeked:1, head_frag:1, - pfmemalloc:1, - pp_recycle:1; /* page_pool recycle indicator */ + pfmemalloc:1; #ifdef CONFIG_SKB_EXTENSIONS __u8 active_extensions; #endif @@ -4721,12 +4718,5 @@ static inline u64 skb_get_kcov_handle(struct sk_buff *skb) #endif }
-#ifdef CONFIG_PAGE_POOL -static inline void skb_mark_for_recycle(struct sk_buff *skb) -{ - skb->pp_recycle = 1; -} -#endif - #endif /* __KERNEL__ */ #endif /* _LINUX_SKBUFF_H */ diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3718898da499..85ae59f4349a 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -670,7 +670,7 @@ static void skb_release_data(struct sk_buff *skb) if (skb->cloned && atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1, &shinfo->dataref)) - goto exit; + return;
skb_zcopy_clear(skb, true);
@@ -681,17 +681,6 @@ static void skb_release_data(struct sk_buff *skb) kfree_skb_list(shinfo->frag_list);
skb_free_head(skb); -exit: - /* When we clone an SKB we copy the reycling bit. The pp_recycle - * bit is only set on the head though, so in order to avoid races - * while trying to recycle fragments on __skb_frag_unref() we need - * to make one SKB responsible for triggering the recycle path. - * So disable the recycling bit if an SKB is cloned and we have - * additional references to to the fragmented part of the SKB. - * Eventually the last SKB will have the recycling bit set and it's - * dataref set to 0, which will trigger the recycling - */ - skb->pp_recycle = 0; }
/* @@ -1073,7 +1062,6 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb) n->nohdr = 0; n->peeked = 0; C(pfmemalloc); - C(pp_recycle); n->destructor = NULL; C(tail); C(end); @@ -5368,13 +5356,6 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, if (skb_cloned(to)) return false;
- /* The page pool signature of struct page will eventually figure out - * which pages can be recycled or not but for now let's prohibit slab - * allocated and page_pool allocated SKBs from being coalesced. - */ - if (to->pp_recycle != from->pp_recycle) - return false; - if (len <= skb_tailroom(to)) { if (len) BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len));
Hi Yunsheng,
On Wed, Sep 22, 2021 at 05:41:24PM +0800, Yunsheng Lin wrote:
Patch 1: disable dma mapping support for 32-bit arch with 64-bit DMA. Patch 2: support non-split page when PP_FLAG_PAGE_FRAG is set. patch 3: avoid calling compound_head() for skb frag page Patch 4-7: use pp_magic to identify pp page uniquely.
There's some subtle changes in this patchset that might affect XDP.
What I forgot when I proposed removing the recycling bit, is that it also serves as an 'opt-in' mechanism for drivers that want to use page_pool but do the recycling internally. With that removed we need to make sure nothing bad happens to them. In theory the page refcnt for mlx5 specifically will be elevated, so we'll just end up unmapping the buffer. Arguably we could add a similar mechanism internally into page pool, which would allow us to enable and disable recycling, but that's an extra if per packet allocation and I don't know if we want that on the XDP case. A few numbers pre/post patch for XDP would help, but iirc hns3 doesn't have XDP support yet?
It's plumbers week so I'll do some testing starting Monday.
Thanks /Ilias
V3: 1. add patch 1/4/6/7. 2. use pp_magic to identify pp page uniquely too. 3. avoid unnecessary compound_head() calling.
V2: add patch 2, adjust the commit log accroding to the discussion in V1, and fix a compiler error reported by kernel test robot.
Yunsheng Lin (7): page_pool: disable dma mapping support for 32-bit arch with 64-bit DMA page_pool: support non-split page with PP_FLAG_PAGE_FRAG pool_pool: avoid calling compound_head() for skb frag page page_pool: change BIAS_MAX to support incrementing skbuff: keep track of pp page when __skb_frag_ref() is called skbuff: only use pp_magic identifier for a skb' head page skbuff: remove unused skb->pp_recycle
.../net/ethernet/hisilicon/hns3/hns3_enet.c | 6 --- drivers/net/ethernet/marvell/mvneta.c | 2 - .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 4 +- drivers/net/ethernet/marvell/sky2.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +- drivers/net/ethernet/ti/cpsw.c | 2 - drivers/net/ethernet/ti/cpsw_new.c | 2 - include/linux/mm_types.h | 13 +----- include/linux/skbuff.h | 39 ++++++++---------- include/net/page_pool.h | 31 ++++++++------ net/core/page_pool.c | 40 +++++++------------ net/core/skbuff.c | 36 ++++++----------- net/tls/tls_device.c | 2 +- 13 files changed, 67 insertions(+), 114 deletions(-)
-- 2.33.0
On 2021/9/23 15:07, Ilias Apalodimas wrote:
Hi Yunsheng,
On Wed, Sep 22, 2021 at 05:41:24PM +0800, Yunsheng Lin wrote:
Patch 1: disable dma mapping support for 32-bit arch with 64-bit DMA. Patch 2: support non-split page when PP_FLAG_PAGE_FRAG is set. patch 3: avoid calling compound_head() for skb frag page Patch 4-7: use pp_magic to identify pp page uniquely.
There's some subtle changes in this patchset that might affect XDP.
What I forgot when I proposed removing the recycling bit, is that it also serves as an 'opt-in' mechanism for drivers that want to use page_pool but do the recycling internally. With that removed we need to make sure nothing bad happens to them. In theory the page refcnt for mlx5
It seems odd that mlx5 is adding its own page cache on top of page pool, is it about support both "struct sk_buff" and "struct xdp_buff" for the same queue?
specifically will be elevated, so we'll just end up unmapping the buffer. Arguably we could add a similar mechanism internally into page pool, which would allow us to enable and disable recycling, but that's an extra if per packet allocation and I don't know if we want that on the XDP case.
Or we could change mlx5e_rx_cache_get() to check for "page->pp_frag_count == 1" too, and adjust mlx5e_page_release() accordingly?
A few numbers pre/post patch for XDP would help, but iirc hns3 doesn't have XDP support yet?
You are right, hns3 doesn't have XDP support yet.
It's plumbers week so I'll do some testing starting Monday.
Thanks /Ilias
V3: 1. add patch 1/4/6/7. 2. use pp_magic to identify pp page uniquely too. 3. avoid unnecessary compound_head() calling.
V2: add patch 2, adjust the commit log accroding to the discussion in V1, and fix a compiler error reported by kernel test robot.
Yunsheng Lin (7): page_pool: disable dma mapping support for 32-bit arch with 64-bit DMA page_pool: support non-split page with PP_FLAG_PAGE_FRAG pool_pool: avoid calling compound_head() for skb frag page page_pool: change BIAS_MAX to support incrementing skbuff: keep track of pp page when __skb_frag_ref() is called skbuff: only use pp_magic identifier for a skb' head page skbuff: remove unused skb->pp_recycle
.../net/ethernet/hisilicon/hns3/hns3_enet.c | 6 --- drivers/net/ethernet/marvell/mvneta.c | 2 - .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 4 +- drivers/net/ethernet/marvell/sky2.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +- drivers/net/ethernet/ti/cpsw.c | 2 - drivers/net/ethernet/ti/cpsw_new.c | 2 - include/linux/mm_types.h | 13 +----- include/linux/skbuff.h | 39 ++++++++---------- include/net/page_pool.h | 31 ++++++++------ net/core/page_pool.c | 40 +++++++------------ net/core/skbuff.c | 36 ++++++----------- net/tls/tls_device.c | 2 +- 13 files changed, 67 insertions(+), 114 deletions(-)
-- 2.33.0
.