[PATCH OLK-6.6 0/6] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
Track DMA-mapped pages and unmap them when destroying the pool Dong Chenchen (1): page_pool: Fix kabi broken in struct page_pool Jakub Kicinski (1): net: page_pool: factor out releasing DMA from releasing the page Jesper Dangaard Brouer (1): mm/page_pool: catch page_pool memory leaks Toke Høiland-Jørgensen (3): page_pool: Move pp_magic check into helper functions page_pool: Track DMA-mapped pages and unmap them when destroying the pool page_pool: Fix PP_MAGIC_MASK to avoid crashing on some 32-bit arches .../net/ethernet/mellanox/mlx5/core/en/xdp.c | 4 +- include/linux/mm.h | 68 ++++++++ include/linux/poison.h | 4 + include/net/page_pool/types.h | 6 +- mm/page_alloc.c | 3 + net/core/page_pool.c | 148 +++++++++++++++--- net/core/skbuff.c | 9 +- net/core/xdp.c | 4 +- 8 files changed, 215 insertions(+), 31 deletions(-) -- 2.25.1
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/19350 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/7RS... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/19350 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/7RS...
From: Jesper Dangaard Brouer <hawk@kernel.org> mainline inclusion from mainline-v6.8-rc1 commit dba1b8a7ab6853a84bf3afdbeac1c2f2370d3444 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICPKD5 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Pages belonging to a page_pool (PP) instance must be freed through the PP APIs in-order to correctly release any DMA mappings and release refcnt on the DMA device when freeing PP instance. When PP release a page (page_pool_release_page) the page->pp_magic value is cleared. This patch detect a leaked PP page in free_page_is_bad() via unexpected state of page->pp_magic value being PP_SIGNATURE. We choose to report and treat it as a bad page. It would be possible to release the page via returning it to the PP instance as the page->pp pointer is likely still valid. Notice this code is only activated when either compiled with CONFIG_DEBUG_VM or boot cmdline debug_pagealloc=on, and CONFIG_PAGE_POOL. Reduced example output of leak with PP_SIGNATURE = dead000000000040: BUG: Bad page state in process swapper/4 pfn:141fa6 page:000000006dbf8062 refcount:0 mapcount:0 mapping:0000000000000000 index:0x141fa6000 pfn:0x141fa6 flags: 0x2fffff80000000(node=0|zone=2|lastcpupid=0x1fffff) page_type: 0xffffffff() raw: 002fffff80000000 dead000000000040 ffff88814888a000 0000000000000000 raw: 0000000141fa6000 0000000000000001 00000000ffffffff 0000000000000000 page dumped because: page_pool leak [...] Call Trace: <IRQ> dump_stack_lvl+0x32/0x50 bad_page+0x70/0xf0 free_unref_page_prepare+0x263/0x430 free_unref_page+0x34/0x130 mlx5e_free_rx_mpwqe+0x190/0x1c0 [mlx5_core] mlx5e_post_rx_mpwqes+0x1ac/0x280 [mlx5_core] mlx5e_napi_poll+0x12b/0x710 [mlx5_core] ? skb_free_head+0x4f/0x90 __napi_poll+0x2b/0x1c0 net_rx_action+0x27b/0x360 The advantage is the Call Trace directly points to the function leaking the PP page, which in this case is an on purpose bug introduced into the mlx5 driver to test this code change. Currently PP will periodically in page_pool_release_retry() printk warning "stalled pool shutdown" which cannot be directly corrolated to leaking and might as well be a false positive due to SKBs being stuck on a socket for an extended period. After this patch we should be able to remove this printk. Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com> --- mm/page_alloc.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ca6c8d356c98..8edede20ad51 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -877,6 +877,9 @@ static inline bool page_expected_state(struct page *page, page_ref_count(page) | #ifdef CONFIG_MEMCG page->memcg_data | +#endif +#ifdef CONFIG_PAGE_POOL + ((page->pp_magic & ~0x3UL) == PP_SIGNATURE) | #endif (page->flags & check_flags))) return false; @@ -903,6 +906,10 @@ static const char *page_bad_reason(struct page *page, unsigned long flags) #ifdef CONFIG_MEMCG if (unlikely(page->memcg_data)) bad_reason = "page still charged to cgroup"; +#endif +#ifdef CONFIG_PAGE_POOL + if (unlikely((page->pp_magic & ~0x3UL) == PP_SIGNATURE)) + bad_reason = "page_pool leak"; #endif return bad_reason; } -- 2.25.1
From: Toke Høiland-Jørgensen <toke@redhat.com> mainline inclusion from mainline-v6.16-rc1 commit cd3c93167da0e760b5819246eae7a4ea30fd014b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICPKD5 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Since we are about to stash some more information into the pp_magic field, let's move the magic signature checks into a pair of helper functions so it can be changed in one place. Reviewed-by: Mina Almasry <almasrymina@google.com> Tested-by: Yonglong Liu <liuyonglong@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-1-6a9ef2e0cba8@redh... Signed-off-by: Jakub Kicinski <kuba@kernel.org> Conflicts: include/linux/mm.h net/core/netmem_priv.h net/core/skbuff.c net/core/xdp.c [commit 5796d3967c09 and 239e9a90c887 are not backport lead to conflicts in mm.h commit 8ab79ed50cf1 and f7dc3248dcfb are not backport lead to conflicts in other files] Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com> --- .../net/ethernet/mellanox/mlx5/core/en/xdp.c | 4 ++-- include/linux/mm.h | 20 +++++++++++++++++++ mm/page_alloc.c | 8 ++------ net/core/skbuff.c | 9 +-------- net/core/xdp.c | 4 ++-- 5 files changed, 27 insertions(+), 18 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c index b723ff5e5249..2ef779180099 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c @@ -662,8 +662,8 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq, xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo); page = xdpi.page.page; - /* No need to check ((page->pp_magic & ~0x3UL) == PP_SIGNATURE) - * as we know this is a page_pool page. + /* No need to check page_pool_page_is_pp() as we + * know this is a page_pool page. */ page_pool_recycle_direct(page->pp, page); } while (++n < num); diff --git a/include/linux/mm.h b/include/linux/mm.h index 8999dcf606fa..cdf1c47c872b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4355,4 +4355,24 @@ static inline bool vma_is_peer_shared(struct vm_area_struct *vma) } #endif +/* Mask used for checking in page_pool_page_is_pp() below. page->pp_magic is + * OR'ed with PP_SIGNATURE after the allocation in order to preserve bit 0 for + * the head page of compound page and bit 1 for pfmemalloc page. + * page_is_pfmemalloc() is checked in __page_pool_put_page() to avoid recycling + * the pfmemalloc page. + */ +#define PP_MAGIC_MASK ~0x3UL + +#ifdef CONFIG_PAGE_POOL +static inline bool page_pool_page_is_pp(struct page *page) +{ + return (page->pp_magic & PP_MAGIC_MASK) == PP_SIGNATURE; +} +#else +static inline bool page_pool_page_is_pp(struct page *page) +{ + return false; +} +#endif + #endif /* _LINUX_MM_H */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8edede20ad51..a11cf7843bf6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -878,9 +878,7 @@ static inline bool page_expected_state(struct page *page, #ifdef CONFIG_MEMCG page->memcg_data | #endif -#ifdef CONFIG_PAGE_POOL - ((page->pp_magic & ~0x3UL) == PP_SIGNATURE) | -#endif + page_pool_page_is_pp(page) | (page->flags & check_flags))) return false; @@ -907,10 +905,8 @@ static const char *page_bad_reason(struct page *page, unsigned long flags) if (unlikely(page->memcg_data)) bad_reason = "page still charged to cgroup"; #endif -#ifdef CONFIG_PAGE_POOL - if (unlikely((page->pp_magic & ~0x3UL) == PP_SIGNATURE)) + if (unlikely(page_pool_page_is_pp(page))) bad_reason = "page_pool leak"; -#endif return bad_reason; } diff --git a/net/core/skbuff.c b/net/core/skbuff.c index ea5e9d46d4c4..1d5958c1aeed 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -899,14 +899,7 @@ bool napi_pp_put_page(struct page *page, bool napi_safe) page = compound_head(page); - /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation - * in order to preserve any existing bits, such as bit 0 for the - * head page of compound page and bit 1 for pfmemalloc page, so - * mask those bits for freeing side when doing below checking, - * and page_is_pfmemalloc() is checked in __page_pool_put_page() - * to avoid recycling the pfmemalloc page. - */ - if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) + if (unlikely(!page_pool_page_is_pp(page))) return false; pp = page->pp; diff --git a/net/core/xdp.c b/net/core/xdp.c index 5ee3f8f165e5..d816456329c9 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -380,8 +380,8 @@ void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct, page = virt_to_head_page(data); if (napi_direct && xdp_return_frame_no_direct()) napi_direct = false; - /* No need to check ((page->pp_magic & ~0x3UL) == PP_SIGNATURE) - * as mem->type knows this a page_pool page + /* No need to check page_pool_page_is_pp() as mem->type + * knows this a page_pool page */ page_pool_put_full_page(page->pp, page, napi_direct); break; -- 2.25.1
From: Jakub Kicinski <kuba@kernel.org> mainline inclusion from mainline-v6.8-rc1 commit c3f687d8dfeb33cffbb8f47c30002babfc4895d2 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICPKD5 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... ------------------------------------------ Releasing the DMA mapping will be useful for other types of pages, so factor it out. Make sure compiler inlines it, to avoid any regressions. Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Conflicts: net/core/page_pool.c [commit 7aee8429eedd is not backport] Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com> --- net/core/page_pool.c | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/net/core/page_pool.c b/net/core/page_pool.c index cb7238043a33..7e3404ee40d9 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -506,21 +506,16 @@ static s32 page_pool_inflight(struct page_pool *pool) return inflight; } -/* Disconnects a page (from a page_pool). API users can have a need - * to disconnect a page (from a page_pool), to allow it to be used as - * a regular page (that will eventually be returned to the normal - * page-allocator via put_page). - */ -static void page_pool_return_page(struct page_pool *pool, struct page *page) +static __always_inline +void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) { dma_addr_t dma; - int count; if (!(pool->p.flags & PP_FLAG_DMA_MAP)) /* Always account for inflight pages, even if we didn't * map them */ - goto skip_dma_unmap; + return; dma = page_pool_get_dma_addr(page); @@ -529,7 +524,19 @@ static void page_pool_return_page(struct page_pool *pool, struct page *page) PAGE_SIZE << pool->p.order, pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING); page_pool_set_dma_addr(page, 0); -skip_dma_unmap: +} + +/* Disconnects a page (from a page_pool). API users can have a need + * to disconnect a page (from a page_pool), to allow it to be used as + * a regular page (that will eventually be returned to the normal + * page-allocator via put_page). + */ +void page_pool_return_page(struct page_pool *pool, struct page *page) +{ + int count; + + __page_pool_release_page_dma(pool, page); + page_pool_clear_pp_info(page); /* This may be the last page returned, releasing the pool, so -- 2.25.1
From: Toke Høiland-Jørgensen <toke@redhat.com> mainline inclusion from mainline-v6.16-rc1 commit ee62ce7a1d909ccba0399680a03c2dee83bcae95 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICPKD5 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... ------------------------------------------ When enabling DMA mapping in page_pool, pages are kept DMA mapped until they are released from the pool, to avoid the overhead of re-mapping the pages every time they are used. This causes resource leaks and/or crashes when there are pages still outstanding while the device is torn down, because page_pool will attempt an unmap through a non-existent DMA device on the subsequent page return. To fix this, implement a simple tracking of outstanding DMA-mapped pages in page pool using an xarray. This was first suggested by Mina[0], and turns out to be fairly straight forward: We simply store pointers to pages directly in the xarray with xa_alloc() when they are first DMA mapped, and remove them from the array on unmap. Then, when a page pool is torn down, it can simply walk the xarray and unmap all pages still present there before returning, which also allows us to get rid of the get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional synchronisation is needed, as a page will only ever be unmapped once. To avoid having to walk the entire xarray on unmap to find the page reference, we stash the ID assigned by xa_alloc() into the page structure itself, using the upper bits of the pp_magic field. This requires a couple of defines to avoid conflicting with the POINTER_POISON_DELTA define, but this is all evaluated at compile-time, so does not affect run-time performance. The bitmap calculations in this patch gives the following number of bits for different architectures: - 23 bits on 32-bit architectures - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE) - 32 bits on other 64-bit architectures Stashing a value into the unused bits of pp_magic does have the effect that it can make the value stored there lie outside the unmappable range (as governed by the mmap_min_addr sysctl), for architectures that don't define ILLEGAL_POINTER_VALUE. This means that if one of the pointers that is aliased to the pp_magic field (such as page->lru.next) is dereferenced while the page is owned by page_pool, that could lead to a dereference into userspace, which is a security concern. The risk of this is mitigated by the fact that (a) we always clear pp_magic before releasing a page from page_pool, and (b) this would need a use-after-free bug for struct page, which can have many other risks since page->lru.next is used as a generic list pointer in multiple places in the kernel. As such, with this patch we take the position that this risk is negligible in practice. For more discussion, see[1]. Since all the tracking added in this patch is performed on DMA map/unmap, no additional code is needed in the fast path, meaning the performance overhead of this tracking is negligible there. A micro-benchmark shows that the total overhead of the tracking itself is about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]). Since this cost is only paid on DMA map and unmap, it seems like an acceptable cost to fix the late unmap issue. Further optimisation can narrow the cases where this cost is paid (for instance by eliding the tracking when DMA map/unmap is a no-op). The extra memory needed to track the pages is neatly encapsulated inside xarray, which uses the 'struct xa_node' structure to track items. This structure is 576 bytes long, with slots for 64 items, meaning that a full node occurs only 9 bytes of overhead per slot it tracks (in practice, it probably won't be this efficient, but in any case it should be an acceptable overhead). [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJc... [1] https://lore.kernel.org/r/20250320023202.GA25514@openwall.com [2] https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com Reported-by: Yonglong Liu <liuyonglong@huawei.com> Closes: https://lore.kernel.org/r/8743264a-9700-4227-a556-5f931c720211@huawei.com Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code") Suggested-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Tested-by: Jesper Dangaard Brouer <hawk@kernel.org> Tested-by: Qiuling Ren <qren@redhat.com> Tested-by: Yuying Ma <yuma@redhat.com> Tested-by: Yonglong Liu <liuyonglong@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-2-6a9ef2e0cba8@redh... Signed-off-by: Jakub Kicinski <kuba@kernel.org> Conflicts: include/linux/mm.h include/net/page_pool/types.h include/linux/poison.h net/core/netmem_priv.h net/core/page_pool.c [commit 5796d3967c09, 239e9a90c887 and 1a251f52cfdc are not backport lead to conflicts in mm.h. commit 8ab79ed50cf1, 57afb4830157, 4dec64c52e24 (use netmem) and 403f11ac9ab7(add dma_sync) are not backport lead to conflicts in other files] Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com> --- include/linux/mm.h | 52 +++++++++++++++++++-- include/linux/poison.h | 4 ++ include/net/page_pool/types.h | 6 +++ net/core/page_pool.c | 86 +++++++++++++++++++++++++++++++---- 4 files changed, 135 insertions(+), 13 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index cdf1c47c872b..346a19e9749d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4355,13 +4355,57 @@ static inline bool vma_is_peer_shared(struct vm_area_struct *vma) } #endif +/* + * DMA mapping IDs for page_pool + * + * When DMA-mapping a page, page_pool allocates an ID (from an xarray) and + * stashes it in the upper bits of page->pp_magic. We always want to be able to + * unambiguously identify page pool pages (using page_pool_page_is_pp()). Non-PP + * pages can have arbitrary kernel pointers stored in the same field as pp_magic + * (since it overlaps with page->lru.next), so we must ensure that we cannot + * mistake a valid kernel pointer with any of the values we write into this + * field. + * + * On architectures that set POISON_POINTER_DELTA, this is already ensured, + * since this value becomes part of PP_SIGNATURE; meaning we can just use the + * space between the PP_SIGNATURE value (without POISON_POINTER_DELTA), and the + * lowest bits of POISON_POINTER_DELTA. On arches where POISON_POINTER_DELTA is + * 0, we make sure that we leave the two topmost bits empty, as that guarantees + * we won't mistake a valid kernel pointer for a value we set, regardless of the + * VMSPLIT setting. + * + * Altogether, this means that the number of bits available is constrained by + * the size of an unsigned long (at the upper end, subtracting two bits per the + * above), and the definition of PP_SIGNATURE (with or without + * POISON_POINTER_DELTA). + */ +#define PP_DMA_INDEX_SHIFT (1 + __fls(PP_SIGNATURE - POISON_POINTER_DELTA)) + +/* Hardcode the MIN macro in the below commit: + * 1a251f52cfdc ("minmax: make generic MIN() and MAX() macros available everywhere") + */ +#define PP_MIN(x, y) ((x) < (y) ? (x) : (y)) + +#if POISON_POINTER_DELTA > 0 +/* PP_SIGNATURE includes POISON_POINTER_DELTA, so limit the size of the DMA + * index to not overlap with that if set + */ +#define PP_DMA_INDEX_BITS PP_MIN(32, __ffs(POISON_POINTER_DELTA) - PP_DMA_INDEX_SHIFT) +#else +/* Always leave out the topmost two; see above. */ +#define PP_DMA_INDEX_BITS PP_MIN(32, BITS_PER_LONG - PP_DMA_INDEX_SHIFT - 2) +#endif + +#define PP_DMA_INDEX_MASK GENMASK(PP_DMA_INDEX_BITS + PP_DMA_INDEX_SHIFT - 1, \ + PP_DMA_INDEX_SHIFT) + /* Mask used for checking in page_pool_page_is_pp() below. page->pp_magic is * OR'ed with PP_SIGNATURE after the allocation in order to preserve bit 0 for - * the head page of compound page and bit 1 for pfmemalloc page. - * page_is_pfmemalloc() is checked in __page_pool_put_page() to avoid recycling - * the pfmemalloc page. + * the head page of compound page and bit 1 for pfmemalloc page, as well as the + * bits used for the DMA index. page_is_pfmemalloc() is checked in + * __page_pool_put_page() to avoid recycling the pfmemalloc page. */ -#define PP_MAGIC_MASK ~0x3UL +#define PP_MAGIC_MASK ~(PP_DMA_INDEX_MASK | 0x3UL) #ifdef CONFIG_PAGE_POOL static inline bool page_pool_page_is_pp(struct page *page) diff --git a/include/linux/poison.h b/include/linux/poison.h index 851a855d3868..96f09b600af2 100644 --- a/include/linux/poison.h +++ b/include/linux/poison.h @@ -79,6 +79,10 @@ #define KEY_DESTROY 0xbd /********** net/core/page_pool.c **********/ +/* + * page_pool uses additional free bits within this value to store data, see the + * definition of PP_DMA_INDEX_MASK in mm.h + */ #define PP_SIGNATURE (0x40 + POISON_POINTER_DELTA) /********** net/core/skbuff.c **********/ diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index b5b6d0438c38..8c796de47b21 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -6,6 +6,7 @@ #include <linux/dma-direction.h> #include <linux/ptr_ring.h> #include <linux/kabi.h> +#include <linux/xarray.h> #define PP_FLAG_DMA_MAP BIT(0) /* Should page_pool do the DMA * map/unmap @@ -23,6 +24,9 @@ PP_FLAG_DMA_SYNC_DEV |\ PP_FLAG_PAGE_FRAG) +/* Index limit to stay within PP_DMA_INDEX_BITS for DMA indices */ +#define PP_DMA_INDEX_LIMIT XA_LIMIT(1, BIT(PP_DMA_INDEX_BITS) - 1) + /* * Fast allocation side cache array/stack * @@ -168,6 +172,8 @@ struct page_pool { */ struct ptr_ring ring; + struct xarray dma_mapped; + #ifdef CONFIG_PAGE_POOL_STATS /* recycle stats are per-cpu to avoid locking */ struct page_pool_recycle_stats __percpu *recycle_stats; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 7e3404ee40d9..f55aa48f528b 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -233,8 +233,7 @@ static int page_pool_init(struct page_pool *pool, /* Driver calling page_pool_create() also call page_pool_destroy() */ refcount_set(&pool->user_cnt, 1); - if (pool->p.flags & PP_FLAG_DMA_MAP) - get_device(pool->p.dev); + xa_init_flags(&pool->dma_mapped, XA_FLAGS_ALLOC1); return 0; } @@ -347,9 +346,24 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool, pool->p.dma_dir); } -static bool page_pool_dma_map(struct page_pool *pool, struct page *page) +static unsigned long page_get_dma_index(struct page *page) +{ + return (page->pp_magic & PP_DMA_INDEX_MASK) >> PP_DMA_INDEX_SHIFT; +} + +static void page_set_dma_index(struct page *page, unsigned long id) +{ + unsigned long magic; + + magic = page->pp_magic | (id << PP_DMA_INDEX_SHIFT); + page->pp_magic = magic; +} + +static bool page_pool_dma_map(struct page_pool *pool, struct page *page, gfp_t gfp) { dma_addr_t dma; + int err; + u32 id; /* Setup DMA mapping: use 'struct page' area for storing DMA-addr * since dma_addr_t can be either 32 or 64 bits and does not always fit @@ -363,7 +377,23 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page) if (dma_mapping_error(pool->p.dev, dma)) return false; + if (in_softirq()) + err = xa_alloc(&pool->dma_mapped, &id, page, PP_DMA_INDEX_LIMIT, + gfp); + else + err = xa_alloc_bh(&pool->dma_mapped, &id, page, + PP_DMA_INDEX_LIMIT, gfp); + + if (err) { + WARN_ONCE(err != -ENOMEM, "couldn't track DMA mapping, please report to netdev@"); + dma_unmap_page_attrs(pool->p.dev, dma, PAGE_SIZE << pool->p.order, + pool->p.dma_dir, + DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING); + return false; + } + page_pool_set_dma_addr(page, dma); + page_set_dma_index(page, id); if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) page_pool_dma_sync_for_device(pool, page, pool->p.max_len); @@ -397,7 +427,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool, return NULL; if ((pool->p.flags & PP_FLAG_DMA_MAP) && - unlikely(!page_pool_dma_map(pool, page))) { + unlikely(!page_pool_dma_map(pool, page, gfp))) { put_page(page); return NULL; } @@ -444,7 +474,7 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, for (i = 0; i < nr_pages; i++) { page = pool->alloc.cache[i]; if ((pp_flags & PP_FLAG_DMA_MAP) && - unlikely(!page_pool_dma_map(pool, page))) { + unlikely(!page_pool_dma_map(pool, page, gfp))) { put_page(page); continue; } @@ -509,6 +539,8 @@ static s32 page_pool_inflight(struct page_pool *pool) static __always_inline void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) { + struct page *old; + unsigned long id; dma_addr_t dma; if (!(pool->p.flags & PP_FLAG_DMA_MAP)) @@ -517,6 +549,18 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) */ return; + id = page_get_dma_index(page); + if (!id) + return; + + if (in_softirq()) + old = xa_cmpxchg(&pool->dma_mapped, id, page, NULL, 0); + else + old = xa_cmpxchg_bh(&pool->dma_mapped, id, page, NULL, 0); + + if (old != page) + return; + dma = page_pool_get_dma_addr(page); /* When page is unmapped, it cannot be returned to our pool */ @@ -524,6 +568,7 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) PAGE_SIZE << pool->p.order, pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING); page_pool_set_dma_addr(page, 0); + page_set_dma_index(page, 0); } /* Disconnects a page (from a page_pool). API users can have a need @@ -609,9 +654,13 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) { /* Read barrier done in page_ref_count / READ_ONCE */ - if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) + if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) { + /* re-check under rcu_read_lock() to sync with page_pool_scrub() */ + rcu_read_lock(); page_pool_dma_sync_for_device(pool, page, dma_sync_size); + rcu_read_unlock(); + } if (allow_direct && in_softirq() && page_pool_recycle_in_cache(page, pool)) @@ -813,8 +862,7 @@ static void page_pool_free(struct page_pool *pool) ptr_ring_cleanup(&pool->ring, NULL); - if (pool->p.flags & PP_FLAG_DMA_MAP) - put_device(pool->p.dev); + xa_destroy(&pool->dma_mapped); #ifdef CONFIG_PAGE_POOL_STATS free_percpu(pool->recycle_stats); @@ -841,8 +889,28 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool) static void page_pool_scrub(struct page_pool *pool) { + unsigned long id; + void *ptr; + page_pool_empty_alloc_cache_once(pool); - pool->destroy_cnt++; + if (!pool->destroy_cnt++ && pool->p.flags & PP_FLAG_DMA_MAP) { + if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) { + /* Disable page_pool_dma_sync_for_device() */ + pool->p.flags &= ~PP_FLAG_DMA_SYNC_DEV; + + /* Make sure all concurrent returns that may see the old + * value of dma_sync (and thus perform a sync) have + * finished before doing the unmapping below. Skip the + * wait if the device doesn't actually need syncing, or + * if there are no outstanding mapped pages. + */ + if (!xa_empty(&pool->dma_mapped)) + synchronize_net(); + } + + xa_for_each(&pool->dma_mapped, id, ptr) + __page_pool_release_page_dma(pool, ptr); + } /* No more consumers should exist, but producers could still * be in-flight. -- 2.25.1
From: Toke Høiland-Jørgensen <toke@redhat.com> mainline inclusion from mainline-v6.18-rc1 commit 95920c2ed02bde551ab654e9749c2ca7bc3100e0 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICPKD5 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... ------------------------------------------ Helge reported that the introduction of PP_MAGIC_MASK let to crashes on boot on his 32-bit parisc machine. The cause of this is the mask is set too wide, so the page_pool_page_is_pp() incurs false positives which crashes the machine. Just disabling the check in page_pool_is_pp() will lead to the page_pool code itself malfunctioning; so instead of doing this, this patch changes the define for PP_DMA_INDEX_BITS to avoid mistaking arbitrary kernel pointers for page_pool-tagged pages. The fix relies on the kernel pointers that alias with the pp_magic field always being above PAGE_OFFSET. With this assumption, we can use the lowest bit of the value of PAGE_OFFSET as the upper bound of the PP_DMA_INDEX_MASK, which should avoid the false positives. Because we cannot rely on PAGE_OFFSET always being a compile-time constant, nor on it always being >0, we fall back to disabling the dma_index storage when there are not enough bits available. This leaves us in the situation we were in before the patch in the Fixes tag, but only on a subset of architecture configurations. This seems to be the best we can do until the transition to page types in complete for page_pool pages. v2: - Make sure there's at least 8 bits available and that the PAGE_OFFSET bit calculation doesn't wrap Link: https://lore.kernel.org/all/aMNJMFa5fDalFmtn@p100/ Fixes: ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them when destroying the pool") Cc: stable@vger.kernel.org # 6.15+ Tested-by: Helge Deller <deller@gmx.de> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Tested-by: Helge Deller <deller@gmx.de> Link: https://patch.msgid.link/20250930114331.675412-1-toke@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Conflicts: include/linux/mm.h net/core/page_pool.c [commit 1a251f52cfdc add min and commit 8ab79ed50cf1, 57afb4830157, 4dec64c52e24 introduce netmem, which not merged lead to conflict] Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com> --- include/linux/mm.h | 22 ++++++++------ net/core/page_pool.c | 72 ++++++++++++++++++++++++++++++-------------- 2 files changed, 62 insertions(+), 32 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 346a19e9749d..86909d01bf2c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4370,14 +4370,13 @@ static inline bool vma_is_peer_shared(struct vm_area_struct *vma) * since this value becomes part of PP_SIGNATURE; meaning we can just use the * space between the PP_SIGNATURE value (without POISON_POINTER_DELTA), and the * lowest bits of POISON_POINTER_DELTA. On arches where POISON_POINTER_DELTA is - * 0, we make sure that we leave the two topmost bits empty, as that guarantees - * we won't mistake a valid kernel pointer for a value we set, regardless of the - * VMSPLIT setting. + * 0, we use the lowest bit of PAGE_OFFSET as the boundary if that value is + * known at compile-time. * - * Altogether, this means that the number of bits available is constrained by - * the size of an unsigned long (at the upper end, subtracting two bits per the - * above), and the definition of PP_SIGNATURE (with or without - * POISON_POINTER_DELTA). + * If the value of PAGE_OFFSET is not known at compile time, or if it is too + * small to leave at least 8 bits available above PP_SIGNATURE, we define the + * number of bits to be 0, which turns off the DMA index tracking altogether + * (see page_pool_register_dma_index()). */ #define PP_DMA_INDEX_SHIFT (1 + __fls(PP_SIGNATURE - POISON_POINTER_DELTA)) @@ -4392,8 +4391,13 @@ static inline bool vma_is_peer_shared(struct vm_area_struct *vma) */ #define PP_DMA_INDEX_BITS PP_MIN(32, __ffs(POISON_POINTER_DELTA) - PP_DMA_INDEX_SHIFT) #else -/* Always leave out the topmost two; see above. */ -#define PP_DMA_INDEX_BITS PP_MIN(32, BITS_PER_LONG - PP_DMA_INDEX_SHIFT - 2) +/* Use the lowest bit of PAGE_OFFSET if there's at least 8 bits available; see above */ +#define PP_DMA_INDEX_MIN_OFFSET (1 << (PP_DMA_INDEX_SHIFT + 8)) +#define PP_DMA_INDEX_BITS ((__builtin_constant_p(PAGE_OFFSET) && \ + PAGE_OFFSET >= PP_DMA_INDEX_MIN_OFFSET && \ + !(PAGE_OFFSET & (PP_DMA_INDEX_MIN_OFFSET - 1))) ? \ + PP_MIN(32, __ffs(PAGE_OFFSET) - PP_DMA_INDEX_SHIFT) : 0) + #endif #define PP_DMA_INDEX_MASK GENMASK(PP_DMA_INDEX_BITS + PP_DMA_INDEX_SHIFT - 1, \ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index f55aa48f528b..f63687fcd40e 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -359,11 +359,57 @@ static void page_set_dma_index(struct page *page, unsigned long id) page->pp_magic = magic; } +static int page_pool_register_dma_index(struct page_pool *pool, + struct page *page, gfp_t gfp) +{ + int err = 0; + u32 id; + + if (unlikely(!PP_DMA_INDEX_BITS)) + goto out; + + if (in_softirq()) + err = xa_alloc(&pool->dma_mapped, &id, page, PP_DMA_INDEX_LIMIT, gfp); + else + err = xa_alloc_bh(&pool->dma_mapped, &id, page, PP_DMA_INDEX_LIMIT, gfp); + if (err) { + WARN_ONCE(err != -ENOMEM, "couldn't track DMA mapping, please report to netdev@"); + goto out; + } + + page_set_dma_index(page, id); +out: + return err; +} + +static int page_pool_release_dma_index(struct page_pool *pool, + struct page *page) +{ + struct page *old; + unsigned long id; + + if (unlikely(!PP_DMA_INDEX_BITS)) + return 0; + + id = page_get_dma_index(page); + if (!id) + return -1; + + if (in_softirq()) + old = xa_cmpxchg(&pool->dma_mapped, id, page, NULL, 0); + else + old = xa_cmpxchg_bh(&pool->dma_mapped, id, page, NULL, 0); + if (old != page) + return -1; + + page_set_dma_index(page, 0); + return 0; +} + static bool page_pool_dma_map(struct page_pool *pool, struct page *page, gfp_t gfp) { dma_addr_t dma; int err; - u32 id; /* Setup DMA mapping: use 'struct page' area for storing DMA-addr * since dma_addr_t can be either 32 or 64 bits and does not always fit @@ -377,15 +423,8 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page, gfp_t g if (dma_mapping_error(pool->p.dev, dma)) return false; - if (in_softirq()) - err = xa_alloc(&pool->dma_mapped, &id, page, PP_DMA_INDEX_LIMIT, - gfp); - else - err = xa_alloc_bh(&pool->dma_mapped, &id, page, - PP_DMA_INDEX_LIMIT, gfp); - + err = page_pool_register_dma_index(pool, page, gfp); if (err) { - WARN_ONCE(err != -ENOMEM, "couldn't track DMA mapping, please report to netdev@"); dma_unmap_page_attrs(pool->p.dev, dma, PAGE_SIZE << pool->p.order, pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING); @@ -393,7 +432,6 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page, gfp_t g } page_pool_set_dma_addr(page, dma); - page_set_dma_index(page, id); if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) page_pool_dma_sync_for_device(pool, page, pool->p.max_len); @@ -539,8 +577,6 @@ static s32 page_pool_inflight(struct page_pool *pool) static __always_inline void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) { - struct page *old; - unsigned long id; dma_addr_t dma; if (!(pool->p.flags & PP_FLAG_DMA_MAP)) @@ -549,16 +585,7 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) */ return; - id = page_get_dma_index(page); - if (!id) - return; - - if (in_softirq()) - old = xa_cmpxchg(&pool->dma_mapped, id, page, NULL, 0); - else - old = xa_cmpxchg_bh(&pool->dma_mapped, id, page, NULL, 0); - - if (old != page) + if (page_pool_release_dma_index(pool, page)) return; dma = page_pool_get_dma_addr(page); @@ -568,7 +595,6 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) PAGE_SIZE << pool->p.order, pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING); page_pool_set_dma_addr(page, 0); - page_set_dma_index(page, 0); } /* Disconnects a page (from a page_pool). API users can have a need -- 2.25.1
Offering: HULK hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICPKD5 -------------------------------- commit ee62ce7a1d90 add dma_mapped array to struct page_pool, which lead to kabi broken. Switch to a pointer implementation to fix the issue. Fixes: ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them when destroying the pool") Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com> --- include/net/page_pool/types.h | 4 +--- net/core/page_pool.c | 27 +++++++++++++++++++-------- 2 files changed, 20 insertions(+), 11 deletions(-) diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index 8c796de47b21..d6e171e79009 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -172,8 +172,6 @@ struct page_pool { */ struct ptr_ring ring; - struct xarray dma_mapped; - #ifdef CONFIG_PAGE_POOL_STATS /* recycle stats are per-cpu to avoid locking */ struct page_pool_recycle_stats __percpu *recycle_stats; @@ -188,7 +186,7 @@ struct page_pool { u64 destroy_cnt; - KABI_RESERVE(1) + KABI_USE(1, struct xarray *dma_mapped) KABI_RESERVE(2) }; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index f63687fcd40e..22a01281ef51 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -233,7 +233,17 @@ static int page_pool_init(struct page_pool *pool, /* Driver calling page_pool_create() also call page_pool_destroy() */ refcount_set(&pool->user_cnt, 1); - xa_init_flags(&pool->dma_mapped, XA_FLAGS_ALLOC1); + pool->dma_mapped = kmalloc(sizeof(*pool->dma_mapped), GFP_KERNEL); + if (!pool->dma_mapped) { +#ifdef CONFIG_PAGE_POOL_STATS + free_percpu(pool->recycle_stats); +#endif + ptr_ring_cleanup(&pool->ring, NULL); + + return -ENOMEM; + } + + xa_init_flags(pool->dma_mapped, XA_FLAGS_ALLOC1); return 0; } @@ -369,9 +379,9 @@ static int page_pool_register_dma_index(struct page_pool *pool, goto out; if (in_softirq()) - err = xa_alloc(&pool->dma_mapped, &id, page, PP_DMA_INDEX_LIMIT, gfp); + err = xa_alloc(pool->dma_mapped, &id, page, PP_DMA_INDEX_LIMIT, gfp); else - err = xa_alloc_bh(&pool->dma_mapped, &id, page, PP_DMA_INDEX_LIMIT, gfp); + err = xa_alloc_bh(pool->dma_mapped, &id, page, PP_DMA_INDEX_LIMIT, gfp); if (err) { WARN_ONCE(err != -ENOMEM, "couldn't track DMA mapping, please report to netdev@"); goto out; @@ -396,9 +406,9 @@ static int page_pool_release_dma_index(struct page_pool *pool, return -1; if (in_softirq()) - old = xa_cmpxchg(&pool->dma_mapped, id, page, NULL, 0); + old = xa_cmpxchg(pool->dma_mapped, id, page, NULL, 0); else - old = xa_cmpxchg_bh(&pool->dma_mapped, id, page, NULL, 0); + old = xa_cmpxchg_bh(pool->dma_mapped, id, page, NULL, 0); if (old != page) return -1; @@ -888,7 +898,8 @@ static void page_pool_free(struct page_pool *pool) ptr_ring_cleanup(&pool->ring, NULL); - xa_destroy(&pool->dma_mapped); + xa_destroy(pool->dma_mapped); + kfree(pool->dma_mapped); #ifdef CONFIG_PAGE_POOL_STATS free_percpu(pool->recycle_stats); @@ -930,11 +941,11 @@ static void page_pool_scrub(struct page_pool *pool) * wait if the device doesn't actually need syncing, or * if there are no outstanding mapped pages. */ - if (!xa_empty(&pool->dma_mapped)) + if (!xa_empty(pool->dma_mapped)) synchronize_net(); } - xa_for_each(&pool->dma_mapped, id, ptr) + xa_for_each(pool->dma_mapped, id, ptr) __page_pool_release_page_dma(pool, ptr); } -- 2.25.1
participants (2)
-
Dong Chenchen -
patchwork bot