[PATCH v3 OLK-6.6 0/4] Fix CVE-2025-21696
Fix CVE-2025-21696. change since v2: - add debug fix patch for second patch. David Hildenbrand (1): mm/mremap: fix WARN with uffd that has remap events disabled Lokesh Gidra (1): userfaultfd: move userfaultfd_ctx struct to header file Ryan Roberts (1): mm: clear uffd-wp PTE/PMD state on mremap() Ze Zuo (1): userfaultfd: fix kabi breakage due to struct userfaultfd_ctx fs/userfaultfd.c | 39 ------------------- include/linux/userfaultfd_k.h | 53 +++++++++++++++++++++++++ mm/huge_memory.c | 12 ++++++ mm/hugetlb.c | 14 ++++++- mm/mremap.c | 73 ++++++++++++++++++++++++++--------- 5 files changed, 132 insertions(+), 59 deletions(-) -- 2.25.1
From: Lokesh Gidra <lokeshgidra@google.com> mainline inclusion from mainline-v6.9-rc1 commit f91e6b41dd11daffb138e3afdb4804aefc3d4e1b category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/784 CVE: CVE-2025-21696 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Patch series "per-vma locks in userfaultfd", v7. Performing userfaultfd operations (like copy/move etc.) in critical section of mmap_lock (read-mode) causes significant contention on the lock when operations requiring the lock in write-mode are taking place concurrently. We can use per-vma locks instead to significantly reduce the contention issue. Android runtime's Garbage Collector uses userfaultfd for concurrent compaction. mmap-lock contention during compaction potentially causes jittery experience for the user. During one such reproducible scenario, we observed the following improvements with this patch-set: - Wall clock time of compaction phase came down from ~3s to <500ms - Uninterruptible sleep time (across all threads in the process) was ~10ms (none in mmap_lock) during compaction, instead of >20s This patch (of 4): Move the struct to userfaultfd_k.h to be accessible from mm/userfaultfd.c. There are no other changes in the struct. This is required to prepare for using per-vma locks in userfaultfd operations. Link: https://lkml.kernel.org/r/20240215182756.3448972-1-lokeshgidra@google.com Link: https://lkml.kernel.org/r/20240215182756.3448972-2-lokeshgidra@google.com Signed-off-by: Lokesh Gidra <lokeshgidra@google.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicolas Geoffray <ngeoffray@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ze Zuo <zuoze1@huawei.com> --- fs/userfaultfd.c | 39 ----------------------------------- include/linux/userfaultfd_k.h | 39 +++++++++++++++++++++++++++++++++++ 2 files changed, 39 insertions(+), 39 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index fe6d2ff45160..67902e073f06 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -52,45 +52,6 @@ static struct ctl_table vm_userfaultfd_table[] = { static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly; -/* - * Start with fault_pending_wqh and fault_wqh so they're more likely - * to be in the same cacheline. - * - * Locking order: - * fd_wqh.lock - * fault_pending_wqh.lock - * fault_wqh.lock - * event_wqh.lock - * - * To avoid deadlocks, IRQs must be disabled when taking any of the above locks, - * since fd_wqh.lock is taken by aio_poll() while it's holding a lock that's - * also taken in IRQ context. - */ -struct userfaultfd_ctx { - /* waitqueue head for the pending (i.e. not read) userfaults */ - wait_queue_head_t fault_pending_wqh; - /* waitqueue head for the userfaults */ - wait_queue_head_t fault_wqh; - /* waitqueue head for the pseudo fd to wakeup poll/read */ - wait_queue_head_t fd_wqh; - /* waitqueue head for events */ - wait_queue_head_t event_wqh; - /* a refile sequence protected by fault_pending_wqh lock */ - seqcount_spinlock_t refile_seq; - /* pseudo fd refcounting */ - refcount_t refcount; - /* userfaultfd syscall flags */ - unsigned int flags; - /* features requested from the userspace */ - unsigned int features; - /* released */ - bool released; - /* memory mappings are changing because of non-cooperative event */ - atomic_t mmap_changing; - /* mm with one ore more vmas attached to this userfaultfd_ctx */ - struct mm_struct *mm; -}; - struct userfaultfd_fork_ctx { struct userfaultfd_ctx *orig; struct userfaultfd_ctx *new; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index aa8e3725f103..9c3f06fd68c0 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -36,6 +36,45 @@ #define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS) +/* + * Start with fault_pending_wqh and fault_wqh so they're more likely + * to be in the same cacheline. + * + * Locking order: + * fd_wqh.lock + * fault_pending_wqh.lock + * fault_wqh.lock + * event_wqh.lock + * + * To avoid deadlocks, IRQs must be disabled when taking any of the above locks, + * since fd_wqh.lock is taken by aio_poll() while it's holding a lock that's + * also taken in IRQ context. + */ +struct userfaultfd_ctx { + /* waitqueue head for the pending (i.e. not read) userfaults */ + wait_queue_head_t fault_pending_wqh; + /* waitqueue head for the userfaults */ + wait_queue_head_t fault_wqh; + /* waitqueue head for the pseudo fd to wakeup poll/read */ + wait_queue_head_t fd_wqh; + /* waitqueue head for events */ + wait_queue_head_t event_wqh; + /* a refile sequence protected by fault_pending_wqh lock */ + seqcount_spinlock_t refile_seq; + /* pseudo fd refcounting */ + refcount_t refcount; + /* userfaultfd syscall flags */ + unsigned int flags; + /* features requested from the userspace */ + unsigned int features; + /* released */ + bool released; + /* memory mappings are changing because of non-cooperative event */ + atomic_t mmap_changing; + /* mm with one ore more vmas attached to this userfaultfd_ctx */ + struct mm_struct *mm; +}; + extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason); /* A combined operation mode + behavior flags. */ -- 2.25.1
From: Ryan Roberts <ryan.roberts@arm.com> mainline inclusion from mainline-v6.13 commit 0cef0bb836e3cfe00f08f9606c72abd72fe78ca3 category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/784 CVE: CVE-2025-21696 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- When mremap()ing a memory region previously registered with userfaultfd as write-protected but without UFFD_FEATURE_EVENT_REMAP, an inconsistency in flag clearing leads to a mismatch between the vma flags (which have uffd-wp cleared) and the pte/pmd flags (which do not have uffd-wp cleared). This mismatch causes a subsequent mprotect(PROT_WRITE) to trigger a warning in page_table_check_pte_flags() due to setting the pte to writable while uffd-wp is still set. Fix this by always explicitly clearing the uffd-wp pte/pmd flags on any such mremap() so that the values are consistent with the existing clearing of VM_UFFD_WP. Be careful to clear the logical flag regardless of its physical form; a PTE bit, a swap PTE bit, or a PTE marker. Cover PTE, huge PMD and hugetlb paths. Link: https://lkml.kernel.org/r/20250107144755.1871363-2-ryan.roberts@arm.com Co-developed-by: Mikołaj Lenczewski <miko.lenczewski@arm.com> Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Closes: https://lore.kernel.org/linux-mm/810b44a8-d2ae-4107-b665-5a42eae2d948@arm.co... Fixes: 63b2d4174c4a ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl") Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: include/linux/userfaultfd_k.h mm/hugetlb.c mm/mremap.c [Context conflicts for the second file, conflicts due to merge commit f822a9a81a31 ("mm: optimize mremap() by PTE batching").] Signed-off-by: Ze Zuo <zuoze1@huawei.com> --- include/linux/userfaultfd_k.h | 12 ++++++++++++ mm/huge_memory.c | 12 ++++++++++++ mm/hugetlb.c | 14 +++++++++++++- mm/mremap.c | 32 +++++++++++++++++++++++++++++++- 4 files changed, 68 insertions(+), 2 deletions(-) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 9c3f06fd68c0..791af86a89cc 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -223,6 +223,13 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma, vma_is_shmem(vma); } +static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma) +{ + struct userfaultfd_ctx *uffd_ctx = vma->vm_userfaultfd_ctx.ctx; + + return uffd_ctx && (uffd_ctx->features & UFFD_FEATURE_EVENT_REMAP) == 0; +} + extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *); extern void dup_userfaultfd_complete(struct list_head *); void dup_userfaultfd_fail(struct list_head *); @@ -346,6 +353,11 @@ static inline bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma) return false; } +static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma) +{ + return false; +} + #endif /* CONFIG_USERFAULTFD */ static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2bd84157869e..c5ca0fc25789 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2549,6 +2549,16 @@ static pmd_t move_soft_dirty_pmd(pmd_t pmd) return pmd; } +static pmd_t clear_uffd_wp_pmd(pmd_t pmd) +{ + if (pmd_present(pmd)) + pmd = pmd_clear_uffd_wp(pmd); + else if (is_swap_pmd(pmd)) + pmd = pmd_swp_clear_uffd_wp(pmd); + + return pmd; +} + bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd) { @@ -2587,6 +2597,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, pgtable_trans_huge_deposit(mm, new_pmd, pgtable); } pmd = move_soft_dirty_pmd(pmd); + if (vma_has_uffd_without_event_remap(vma)) + pmd = clear_uffd_wp_pmd(pmd); set_pmd_at(mm, new_addr, new_pmd, pmd); if (force_flush) flush_pmd_tlb_range(vma, old_addr, old_addr + PMD_SIZE); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1a98b599c0b0..214a8ab4cc4c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5657,6 +5657,7 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr, unsigned long new_addr, pte_t *src_pte, pte_t *dst_pte, unsigned long sz) { + bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma); struct hstate *h = hstate_vma(vma); struct mm_struct *mm = vma->vm_mm; spinlock_t *src_ptl, *dst_ptl; @@ -5673,7 +5674,18 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr, spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); pte = huge_ptep_get_and_clear(mm, old_addr, src_pte, sz); - set_huge_pte_at(mm, new_addr, dst_pte, pte, sz); + + if (need_clear_uffd_wp && pte_marker_uffd_wp(pte)) + huge_pte_clear(mm, new_addr, dst_pte, sz); + else { + if (need_clear_uffd_wp) { + if (pte_present(pte)) + pte = huge_pte_clear_uffd_wp(pte); + else if (is_swap_pte(pte)) + pte = pte_swp_clear_uffd_wp(pte); + } + set_huge_pte_at(mm, new_addr, dst_pte, pte, sz); + } if (src_ptl != dst_ptl) spin_unlock(src_ptl); diff --git a/mm/mremap.c b/mm/mremap.c index ae842afdaea1..98674ed21b62 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -161,6 +161,7 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, struct vm_area_struct *new_vma, pmd_t *new_pmd, unsigned long new_addr, bool need_rmap_locks) { + bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma); struct mm_struct *mm = vma->vm_mm; pte_t *old_ptep, *new_ptep; pte_t old_pte, pte; @@ -239,7 +240,18 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0); pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr); pte = move_soft_dirty_pte(pte); - set_ptes(mm, new_addr, new_ptep, pte, nr_ptes); + + if (need_clear_uffd_wp && pte_marker_uffd_wp(pte)) + pte_clear(mm, new_addr, new_ptep); + else { + if (need_clear_uffd_wp) { + if (pte_present(pte)) + pte = pte_clear_uffd_wp(pte); + else if (is_swap_pte(pte)) + pte = pte_swp_clear_uffd_wp(pte); + } + set_ptes(mm, new_addr, new_ptep, pte, nr_ptes); + } } arch_leave_lazy_mmu_mode(); @@ -301,6 +313,15 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, if (WARN_ON_ONCE(!pmd_none(*new_pmd))) return false; + /* If this pmd belongs to a uffd vma with remap events disabled, we need + * to ensure that the uffd-wp state is cleared from all pgtables. This + * means recursing into lower page tables in move_page_tables(), and we + * can reuse the existing code if we simply treat the entry as "not + * moved". + */ + if (vma_has_uffd_without_event_remap(vma)) + return false; + /* * We don't have to worry about the ordering of src and dst * ptlocks because exclusive mmap_lock prevents deadlock. @@ -356,6 +377,15 @@ static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr, if (WARN_ON_ONCE(!pud_none(*new_pud))) return false; + /* If this pud belongs to a uffd vma with remap events disabled, we need + * to ensure that the uffd-wp state is cleared from all pgtables. This + * means recursing into lower page tables in move_page_tables(), and we + * can reuse the existing code if we simply treat the entry as "not + * moved". + */ + if (vma_has_uffd_without_event_remap(vma)) + return false; + /* * We don't have to worry about the ordering of src and dst * ptlocks because exclusive mmap_lock prevents deadlock. -- 2.25.1
From: David Hildenbrand <david@redhat.com> mainline inclusion from mainline-v6.17-rc3 commit 772e5b4a5e8360743645b9a466842d16092c4f94 category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/784 CVE: CVE-2025-21696 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ------------------ Registering userfaultd on a VMA that spans at least one PMD and then mremap()'ing that VMA can trigger a WARN when recovering from a failed page table move due to a page table allocation error. The code ends up doing the right thing (recurse, avoiding moving actual page tables), but triggering that WARN is unpleasant: WARNING: CPU: 2 PID: 6133 at mm/mremap.c:357 move_normal_pmd mm/mremap.c:357 [inline] WARNING: CPU: 2 PID: 6133 at mm/mremap.c:357 move_pgt_entry mm/mremap.c:595 [inline] WARNING: CPU: 2 PID: 6133 at mm/mremap.c:357 move_page_tables+0x3832/0x44a0 mm/mremap.c:852 Modules linked in: CPU: 2 UID: 0 PID: 6133 Comm: syz.0.19 Not tainted 6.17.0-rc1-syzkaller-00004-g53e760d89498 #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:move_normal_pmd mm/mremap.c:357 [inline] RIP: 0010:move_pgt_entry mm/mremap.c:595 [inline] RIP: 0010:move_page_tables+0x3832/0x44a0 mm/mremap.c:852 Code: ... RSP: 0018:ffffc900037a76d8 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000032930007 RCX: ffffffff820c6645 RDX: ffff88802e56a440 RSI: ffffffff820c7201 RDI: 0000000000000007 RBP: ffff888037728fc0 R08: 0000000000000007 R09: 0000000000000000 R10: 0000000032930007 R11: 0000000000000000 R12: 0000000000000000 R13: ffffc900037a79a8 R14: 0000000000000001 R15: dffffc0000000000 FS: 000055556316a500(0000) GS:ffff8880d68bc000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b30863fff CR3: 0000000050171000 CR4: 0000000000352ef0 Call Trace: <TASK> copy_vma_and_data+0x468/0x790 mm/mremap.c:1215 move_vma+0x548/0x1780 mm/mremap.c:1282 mremap_to+0x1b7/0x450 mm/mremap.c:1406 do_mremap+0xfad/0x1f80 mm/mremap.c:1921 __do_sys_mremap+0x119/0x170 mm/mremap.c:1977 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0x4c0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f00d0b8ebe9 Code: ... RSP: 002b:00007ffe5ea5ee98 EFLAGS: 00000246 ORIG_RAX: 0000000000000019 RAX: ffffffffffffffda RBX: 00007f00d0db5fa0 RCX: 00007f00d0b8ebe9 RDX: 0000000000400000 RSI: 0000000000c00000 RDI: 0000200000000000 RBP: 00007ffe5ea5eef0 R08: 0000200000c00000 R09: 0000000000000000 R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000002 R13: 00007f00d0db5fa0 R14: 00007f00d0db5fa0 R15: 0000000000000005 </TASK> The underlying issue is that we recurse during the original page table move, but not during the recovery move. Fix it by checking for both VMAs and performing the check before the pmd_none() sanity check. Add a new helper where we perform+document that check for the PMD and PUD level. Thanks to Harry for bisecting. Link: https://lkml.kernel.org/r/20250818175358.1184757-1-david@redhat.com Fixes: 0cef0bb836e3 ("mm: clear uffd-wp PTE/PMD state on mremap()") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+4d9a13f0797c46a29e42@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/689bb893.050a0220.7f033.013a.GAE@google.com Tested-by: Harry Yoo <harry.yoo@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: mm/mremap.c [The mainline patch is based on pagetable_move_control, introducing this structure that relies on numerous pre-patches. Here, we directly adapt according to the code logic.] Signed-off-by: Tong Tiangen <tongtiangen@huawei.com> Signed-off-by: Ze Zuo <zuoze1@huawei.com> --- mm/mremap.c | 77 ++++++++++++++++++++++++++++------------------------- 1 file changed, 41 insertions(+), 36 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 98674ed21b62..36ead7184d8e 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -276,9 +276,29 @@ static inline bool arch_supports_page_table_move(void) } #endif +static inline bool uffd_supports_page_table_move(struct vm_area_struct *vma, + struct vm_area_struct *new_vma) +{ + /* + * If we are moving a VMA that has uffd-wp registered but with + * remap events disabled (new VMA will not be registered with uffd), we + * need to ensure that the uffd-wp state is cleared from all pgtables. + * This means recursing into lower page tables in move_page_tables(). + * + * We might get called with VMAs reversed when recovering from a + * failed page table move. In that case, the + * "old"-but-actually-"originally new" VMA during recovery will not have + * a uffd context. Recursing into lower page tables during the original + * move but not during the recovery move will cause trouble, because we + * run into already-existing page tables. So check both VMAs. + */ + return !vma_has_uffd_without_event_remap(vma) && + !vma_has_uffd_without_event_remap(new_vma); +} + #ifdef CONFIG_HAVE_MOVE_PMD -static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, - unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd) +static bool move_normal_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma, + unsigned long old_addr, unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd) { spinlock_t *old_ptl, *new_ptl; struct mm_struct *mm = vma->vm_mm; @@ -287,6 +307,8 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, if (!arch_supports_page_table_move()) return false; + if (!uffd_supports_page_table_move(vma, new_vma)) + return false; /* * The destination pmd shouldn't be established, free_pgtables() * should have released it. @@ -313,15 +335,6 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, if (WARN_ON_ONCE(!pmd_none(*new_pmd))) return false; - /* If this pmd belongs to a uffd vma with remap events disabled, we need - * to ensure that the uffd-wp state is cleared from all pgtables. This - * means recursing into lower page tables in move_page_tables(), and we - * can reuse the existing code if we simply treat the entry as "not - * moved". - */ - if (vma_has_uffd_without_event_remap(vma)) - return false; - /* * We don't have to worry about the ordering of src and dst * ptlocks because exclusive mmap_lock prevents deadlock. @@ -352,17 +365,16 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, return res; } #else -static inline bool move_normal_pmd(struct vm_area_struct *vma, - unsigned long old_addr, unsigned long new_addr, pmd_t *old_pmd, - pmd_t *new_pmd) +static inline bool move_normal_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma, + unsigned long old_addr, unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd) { return false; } #endif #if CONFIG_PGTABLE_LEVELS > 2 && defined(CONFIG_HAVE_MOVE_PUD) -static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr, - unsigned long new_addr, pud_t *old_pud, pud_t *new_pud) +static bool move_normal_pud(struct vm_area_struct *vma, struct vm_area_struct *new_vma, + unsigned long old_addr, unsigned long new_addr, pud_t *old_pud, pud_t *new_pud) { spinlock_t *old_ptl, *new_ptl; struct mm_struct *mm = vma->vm_mm; @@ -370,6 +382,8 @@ static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr, if (!arch_supports_page_table_move()) return false; + if (!uffd_supports_page_table_move(vma, new_vma)) + return false; /* * The destination pud shouldn't be established, free_pgtables() * should have released it. @@ -377,15 +391,6 @@ static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr, if (WARN_ON_ONCE(!pud_none(*new_pud))) return false; - /* If this pud belongs to a uffd vma with remap events disabled, we need - * to ensure that the uffd-wp state is cleared from all pgtables. This - * means recursing into lower page tables in move_page_tables(), and we - * can reuse the existing code if we simply treat the entry as "not - * moved". - */ - if (vma_has_uffd_without_event_remap(vma)) - return false; - /* * We don't have to worry about the ordering of src and dst * ptlocks because exclusive mmap_lock prevents deadlock. @@ -410,9 +415,8 @@ static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr, return true; } #else -static inline bool move_normal_pud(struct vm_area_struct *vma, - unsigned long old_addr, unsigned long new_addr, pud_t *old_pud, - pud_t *new_pud) +static inline bool move_normal_pud(struct vm_area_struct *vma, struct vm_area_struct *new_vma, + unsigned long old_addr, unsigned long new_addr, pud_t *old_pud, pud_t *new_pud) { return false; } @@ -518,8 +522,9 @@ static __always_inline unsigned long get_extent(enum pgt_entry entry, * pgt_entry. Returns true if the move was successful, else false. */ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma, - unsigned long old_addr, unsigned long new_addr, - void *old_entry, void *new_entry, bool need_rmap_locks) + struct vm_area_struct *new_vma, unsigned long old_addr, + unsigned long new_addr, void *old_entry, void *new_entry, + bool need_rmap_locks) { bool moved = false; @@ -529,11 +534,11 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma, switch (entry) { case NORMAL_PMD: - moved = move_normal_pmd(vma, old_addr, new_addr, old_entry, + moved = move_normal_pmd(vma, new_vma, old_addr, new_addr, old_entry, new_entry); break; case NORMAL_PUD: - moved = move_normal_pud(vma, old_addr, new_addr, old_entry, + moved = move_normal_pud(vma, new_vma, old_addr, new_addr, old_entry, new_entry); break; case HPAGE_PMD: @@ -598,14 +603,14 @@ unsigned long move_page_tables(struct vm_area_struct *vma, break; if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) { if (extent == HPAGE_PUD_SIZE) { - move_pgt_entry(HPAGE_PUD, vma, old_addr, new_addr, + move_pgt_entry(HPAGE_PUD, vma, new_vma, old_addr, new_addr, old_pud, new_pud, need_rmap_locks); /* We ignore and continue on error? */ continue; } } else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) { - if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr, + if (move_pgt_entry(NORMAL_PUD, vma, new_vma, old_addr, new_addr, old_pud, new_pud, true)) continue; } @@ -621,7 +626,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) || pmd_devmap(*old_pmd)) { if (extent == HPAGE_PMD_SIZE && - move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr, + move_pgt_entry(HPAGE_PMD, vma, new_vma, old_addr, new_addr, old_pmd, new_pmd, need_rmap_locks)) continue; split_huge_pmd(vma, old_pmd, old_addr); @@ -631,7 +636,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, * If the extent is PMD-sized, try to speed the move by * moving at the PMD level if possible. */ - if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr, + if (move_pgt_entry(NORMAL_PMD, vma, new_vma, old_addr, new_addr, old_pmd, new_pmd, true)) continue; } -- 2.25.1
hulk inclusion category: bugfix bugzilla: https://atomgit.com/src-openeuler/kernel/issues/784 -------------------------------- Move the userfaultfd_ctx structure to the header file. Since this new structure is introduced into the userfaultfd_k.h header file, it changes the kernel Application Binary Interface (kABI) of some exported interfaces that reference this header. To avoid this issue, use the GENKSYMS macro to handle it. Fixes: f91e6b41dd11 ("userfaultfd: move userfaultfd_ctx struct to header file.") Signed-off-by: Ze Zuo <zuoze1@huawei.com> --- include/linux/userfaultfd_k.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 791af86a89cc..a9aaf07bc44f 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -50,6 +50,7 @@ * since fd_wqh.lock is taken by aio_poll() while it's holding a lock that's * also taken in IRQ context. */ +#ifndef __GENKSYMS__ struct userfaultfd_ctx { /* waitqueue head for the pending (i.e. not read) userfaults */ wait_queue_head_t fault_pending_wqh; @@ -74,6 +75,7 @@ struct userfaultfd_ctx { /* mm with one ore more vmas attached to this userfaultfd_ctx */ struct mm_struct *mm; }; +#endif extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason); -- 2.25.1
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://atomgit.com/openeuler/kernel/merge_requests/20705 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/OOM... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://atomgit.com/openeuler/kernel/merge_requests/20705 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/OOM...
participants (2)
-
patchwork bot -
Ze Zuo