Most of bugfix about large folio, one about fault around and one about vmalloc, two small code optimization.
Baolin Wang (1): mm: huge_memory: add the missing folio_test_pmd_mappable() for THP split statistics
Barry Song (1): mm: prohibit the last subpage from reusing the entire large folio
John Hubbard (1): mm/memory.c: do_numa_page(): remove a redundant page table read
Kefeng Wang (2): mm: memory: fix shift-out-of-bounds in fault_around_bytes_set Revert "mm: support multi-size THP numa balancing"
Matthew Wilcox (1): mm: simplify thp_vma_allowable_order
Uladzislau Rezki (Sony) (1): mm: vmalloc: bail out early in find_vmap_area() if vmap is not init
Zi Yan (1): mm/huge_memory: skip invalid debugfs new_order input for folio split
fs/proc/task_mmu.c | 4 +- include/linux/huge_mm.h | 29 ++++++------ mm/huge_memory.c | 20 +++++++-- mm/khugepaged.c | 16 +++---- mm/memory.c | 97 +++++++++++++++-------------------------- mm/mprotect.c | 3 +- mm/vmalloc.c | 3 ++ 7 files changed, 80 insertions(+), 92 deletions(-)
mainline inclusion from mainline-v6.9-rc1 commit 5aa598a72eafcf05239519646ec88638c8894dba category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9S4Z4 CVE: NA
-------------------------------------------------
The rounddown_pow_of_two(0) is undefined, so val = 0 is not allowed in the fault_around_bytes_set(), and leads to shift-out-of-bounds,
UBSAN: shift-out-of-bounds in include/linux/log2.h:67:13 shift exponent 4294967295 is too large for 64-bit type 'long unsigned int' CPU: 7 PID: 107 Comm: sh Not tainted 6.8.0-rc6-next-20240301 #294 Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: dump_backtrace+0x94/0xec show_stack+0x18/0x24 dump_stack_lvl+0x78/0x90 dump_stack+0x18/0x24 ubsan_epilogue+0x10/0x44 __ubsan_handle_shift_out_of_bounds+0x98/0x134 fault_around_bytes_set+0xa4/0xb0 simple_attr_write_xsigned.isra.0+0xe4/0x1ac simple_attr_write+0x18/0x24 debugfs_attr_write+0x4c/0x98 vfs_write+0xd0/0x4b0 ksys_write+0x6c/0xfc __arm64_sys_write+0x1c/0x28 invoke_syscall+0x44/0x104 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x34/0xdc el0t_64_sync_handler+0xc0/0xc4 el0t_64_sync+0x190/0x194 ---[ end trace ]---
Fix it by setting the minimum val to PAGE_SIZE.
Link: https://lkml.kernel.org/r/20240302064312.2358924-1-wangkefeng.wang@huawei.co... Fixes: 53d36a56d8c4 ("mm: prefer fault_around_pages to fault_around_bytes") Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Reported-by: Yue Sun samsun1006219@gmail.com Closes: https://lore.kernel.org/all/CAEkJfYPim6DQqW1GqCiHLdh2-eweqk1fGyXqs3JM+8e1qGg... Reviewed-by: Lorenzo Stoakes lstoakes@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 5aa598a72eafcf05239519646ec88638c8894dba) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/memory.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c index fa4d1b499511..5a0e935712d4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4807,7 +4807,8 @@ static int fault_around_bytes_set(void *data, u64 val) * The minimum value is 1 page, however this results in no fault-around * at all. See should_fault_around(). */ - fault_around_pages = max(rounddown_pow_of_two(val) >> PAGE_SHIFT, 1UL); + val = max(val, PAGE_SIZE); + fault_around_pages = rounddown_pow_of_two(val) >> PAGE_SHIFT;
return 0; }
From: John Hubbard jhubbard@nvidia.com
mainline inclusion from mainline-v6.9-rc1 commit 6c1b748ebf27befffec83b77ca1960bf70ed6ac9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9S4Z4 CVE: NA
-------------------------------------------------
do_numa_page() is reading from the same page table entry, twice, while holding the page table lock: once while checking that the pte hasn't changed, and again in order to modify the pte.
Instead, just read the pte once, and save it in the same old_pte variable that already exists. This has no effect on behavior, other than to provide a tiny potential improvement to performance, by avoiding the redundant memory read (which the compiler cannot elide, due to READ_ONCE()).
Also improve the associated comments nearby.
Link: https://lkml.kernel.org/r/20240228034151.459370-1-jhubbard@nvidia.com Signed-off-by: John Hubbard jhubbard@nvidia.com Reviewed-by: David Hildenbrand david@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 6c1b748ebf27befffec83b77ca1960bf70ed6ac9) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/memory.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index 5a0e935712d4..48d08cdc1a1c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5132,18 +5132,18 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) int flags = 0, nr_pages;
/* - * The "pte" at this point cannot be used safely without - * validation through pte_unmap_same(). It's of NUMA type but - * the pfn may be screwed if the read is non atomic. + * The pte cannot be used safely until we verify, while holding the page + * table lock, that its contents have not changed during fault handling. */ spin_lock(vmf->ptl); - if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { + /* Read the live PTE from the page tables: */ + old_pte = ptep_get(vmf->pte); + + if (unlikely(!pte_same(old_pte, vmf->orig_pte))) { pte_unmap_unlock(vmf->pte, vmf->ptl); goto out; }
- /* Get the normal PTE */ - old_pte = ptep_get(vmf->pte); pte = pte_modify(old_pte, vma->vm_page_prot);
/*
From: Barry Song v-songbaohua@oppo.com
mainline inclusion from mainline-v6.9-rc1 commit cd197c3a2040100fd8668b33e72b07d4790b39d7 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9S4Z4 CVE: NA
-------------------------------------------------
In a Copy-on-Write (CoW) scenario, the last subpage will reuse the entire large folio, resulting in the waste of (nr_pages - 1) pages. This wasted memory remains allocated until it is either unmapped or memory reclamation occurs.
The following small program can serve as evidence of this behavior
main() { #define SIZE 1024 * 1024 * 1024UL void *p = malloc(SIZE); memset(p, 0x11, SIZE); if (fork() == 0) _exit(0); memset(p, 0x12, SIZE); printf("done\n"); while(1); }
For example, using a 1024KiB mTHP by: echo always > /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/enabled
(1) w/o the patch, it takes 2GiB,
Before running the test program, / # free -m total used free shared buff/cache available Mem: 5754 84 5692 0 17 5669 Swap: 0 0 0
/ # /a.out & / # done
After running the test program, / # free -m total used free shared buff/cache available Mem: 5754 2149 3627 0 19 3605 Swap: 0 0 0
(2) w/ the patch, it takes 1GiB only,
Before running the test program, / # free -m total used free shared buff/cache available Mem: 5754 89 5687 0 17 5664 Swap: 0 0 0
/ # /a.out & / # done
After running the test program, / # free -m total used free shared buff/cache available Mem: 5754 1122 4655 0 17 4632 Swap: 0 0 0
This patch migrates the last subpage to a small folio and immediately returns the large folio to the system. It benefits both memory availability and anti-fragmentation.
Link: https://lkml.kernel.org/r/20240308092721.144735-1-21cnbao@gmail.com Signed-off-by: Barry Song v-songbaohua@oppo.com Acked-by: David Hildenbrand david@redhat.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Lance Yang ioworker0@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit cd197c3a2040100fd8668b33e72b07d4790b39d7) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/memory.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c index 48d08cdc1a1c..9d264c303972 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3532,6 +3532,16 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf, struct folio *folio) static bool wp_can_reuse_anon_folio(struct folio *folio, struct vm_area_struct *vma) { + /* + * We could currently only reuse a subpage of a large folio if no + * other subpages of the large folios are still mapped. However, + * let's just consistently not reuse subpages even if we could + * reuse in that scenario, and give back a large folio a bit + * sooner. + */ + if (folio_test_large(folio)) + return false; + /* * We have to verify under folio lock: these early checks are * just an optimization to avoid locking the folio and freeing
From: Zi Yan ziy@nvidia.com
mainline inclusion from mainline-v6.9-rc1 commit 2394aef616cf38fbf2e797c6845ccd35d76ce256 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9S4Z4 CVE: NA
-------------------------------------------------
User can put arbitrary new_order via debugfs for folio split test. Although new_order check is added to split_huge_page_to_list_order() in the prior commit, these two additional checks can avoid unnecessary folio locking and split_folio_to_order() calls.
Link: https://lkml.kernel.org/r/20240307181854.138928-2-zi.yan@sent.com Signed-off-by: Zi Yan ziy@nvidia.com Reported-by: Dan Carpenter dan.carpenter@linaro.org Closes: https://lore.kernel.org/linux-mm/7dda9283-b437-4cf8-ab0d-83c330deb9c0@moroto... Cc: David Hildenbrand david@redhat.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 2394aef616cf38fbf2e797c6845ccd35d76ce256) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/huge_memory.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b1eda738509c..fe812b3e38f7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3604,6 +3604,9 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, if (!is_transparent_hugepage(folio)) goto next;
+ if (new_order >= folio_order(folio)) + goto next; + total++; /* * For folios with private, split_huge_page_to_list_to_order() @@ -3671,6 +3674,9 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start, total++; nr_pages = folio_nr_pages(folio);
+ if (new_order >= folio_order(folio)) + goto next; + if (!folio_trylock(folio)) goto next;
From: Baolin Wang baolin.wang@linux.alibaba.com
mainline inclusion from mainline-v6.9-rc2 commit 835c3a25aa373d486514e4e0f5a7450ea82ae489 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9S4Z4 CVE: NA
-------------------------------------------------
Now the mTHP can also be split or added into the deferred list, so add folio_test_pmd_mappable() validation for PMD mapped THP, to avoid confusion with PMD mapped THP related statistics.
[baolin.wang@linux.alibaba.com: check THP earlier in case folio is split, per Lance] Link: https://lkml.kernel.org/r/b99f8cb14bc85fdb6ab43721d1331cb5ebed2581.171377104... Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.171169485... Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Acked-by: David Hildenbrand david@redhat.com Reviewed-by: Lance Yang ioworker0@gmail.com Cc: Muchun Song muchun.song@linux.dev Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 835c3a25aa373d486514e4e0f5a7450ea82ae489) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/huge_memory.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index fe812b3e38f7..a18138e8ebe2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3155,6 +3155,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order); struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; + bool is_thp = folio_test_pmd_mappable(folio); int extra_pins, ret; pgoff_t end; bool is_hzp; @@ -3333,7 +3334,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, i_mmap_unlock_read(mapping); out: xas_destroy(&xas); - count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED); + if (is_thp) + count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED); return ret; }
@@ -3395,7 +3397,8 @@ void deferred_split_folio(struct folio *folio)
spin_lock_irqsave(&ds_queue->split_queue_lock, flags); if (list_empty(&folio->_deferred_list)) { - count_vm_event(THP_DEFERRED_SPLIT_PAGE); + if (folio_test_pmd_mappable(folio)) + count_vm_event(THP_DEFERRED_SPLIT_PAGE); list_add_tail(&folio->_deferred_list, &ds_queue->split_queue); ds_queue->split_queue_len++; #ifdef CONFIG_MEMCG
From: Matthew Wilcox willy@infradead.org
mainline inclusion from mainline-v6.9-rc2 commit e0ffb29bc54d86b9ab10ebafc66eb1b7229e0cd7 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9S4Z4 CVE: NA
-------------------------------------------------
Combine the three boolean arguments into one flags argument for readability.
Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: David Hildenbrand david@redhat.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Ryan Roberts ryan.roberts@arm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit e0ffb29bc54d86b9ab10ebafc66eb1b7229e0cd7) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- fs/proc/task_mmu.c | 4 ++-- include/linux/huge_mm.h | 29 +++++++++++++++-------------- mm/huge_memory.c | 7 +++++-- mm/khugepaged.c | 16 +++++++--------- mm/memory.c | 10 ++++++---- 5 files changed, 35 insertions(+), 31 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 639eb70634f4..900ed13ba87d 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -869,8 +869,8 @@ static int show_smap(struct seq_file *m, void *v) __show_smap(m, &mss, false);
seq_printf(m, "THPeligible: %8u\n", - !!thp_vma_allowable_orders(vma, vma->vm_flags, true, false, - true, THP_ORDERS_ALL)); + !!thp_vma_allowable_orders(vma, vma->vm_flags, + TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL));
if (arch_pkeys_enabled()) seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 58ce4efb2019..18f7cfb7fca4 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -89,8 +89,12 @@ extern struct kobj_attribute shmem_enabled_attr; */ #define THP_ORDERS_ALL (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE)
-#define thp_vma_allowable_order(vma, vm_flags, smaps, in_pf, enforce_sysfs, order) \ - (!!thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf, enforce_sysfs, BIT(order))) +#define TVA_SMAPS (1 << 0) /* Will be used for procfs */ +#define TVA_IN_PF (1 << 1) /* Page fault handler */ +#define TVA_ENFORCE_SYSFS (1 << 2) /* Obey sysfs configuration */ + +#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \ + (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
#ifdef CONFIG_TRANSPARENT_HUGEPAGE #define HPAGE_PMD_SHIFT PMD_SHIFT @@ -216,17 +220,15 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) }
unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, - unsigned long vm_flags, bool smaps, - bool in_pf, bool enforce_sysfs, + unsigned long vm_flags, + unsigned long tva_flags, unsigned long orders);
/** * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma * @vma: the vm area to check * @vm_flags: use these vm_flags instead of vma->vm_flags - * @smaps: whether answer will be used for smaps file - * @in_pf: whether answer will be used by page fault handler - * @enforce_sysfs: whether sysfs config should be taken into account + * @tva_flags: Which TVA flags to honour * @orders: bitfield of all orders to consider * * Calculates the intersection of the requested hugepage orders and the allowed @@ -239,12 +241,12 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, */ static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, - unsigned long vm_flags, bool smaps, - bool in_pf, bool enforce_sysfs, + unsigned long vm_flags, + unsigned long tva_flags, unsigned long orders) { /* Optimization to check if required orders are enabled early. */ - if (enforce_sysfs && vma_is_anonymous(vma)) { + if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) { unsigned long mask = READ_ONCE(huge_anon_orders_always);
if (vm_flags & VM_HUGEPAGE) @@ -258,8 +260,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, return 0; }
- return __thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf, - enforce_sysfs, orders); + return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders); }
enum mthp_stat_item { @@ -437,8 +438,8 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma, }
static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, - unsigned long vm_flags, bool smaps, - bool in_pf, bool enforce_sysfs, + unsigned long vm_flags, + unsigned long tva_flags, unsigned long orders) { return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a18138e8ebe2..eddb7984610d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -77,10 +77,13 @@ unsigned long huge_anon_orders_inherit __read_mostly; unsigned long huge_pcp_allow_orders __read_mostly;
unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, - unsigned long vm_flags, bool smaps, - bool in_pf, bool enforce_sysfs, + unsigned long vm_flags, + unsigned long tva_flags, unsigned long orders) { + bool smaps = tva_flags & TVA_SMAPS; + bool in_pf = tva_flags & TVA_IN_PF; + bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS; /* Check the intersection of requested and supported orders. */ orders &= vma_is_anonymous(vma) ? THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 5f999528ec30..fa787464662f 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -459,7 +459,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma, { if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) && hugepage_flags_enabled()) { - if (thp_vma_allowable_order(vma, vm_flags, false, false, true, + if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS, PMD_ORDER)) __khugepaged_enter(vma->vm_mm); } @@ -925,6 +925,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, struct collapse_control *cc) { struct vm_area_struct *vma; + unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
if (unlikely(hpage_collapse_test_exit_or_disable(mm))) return SCAN_ANY_PROCESS; @@ -935,8 +936,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
if (!thp_vma_suitable_order(vma, address, PMD_ORDER)) return SCAN_ADDRESS_RANGE; - if (!thp_vma_allowable_order(vma, vma->vm_flags, false, false, - cc->is_khugepaged, PMD_ORDER)) + if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER)) return SCAN_VMA_CHECK; /* * Anon VMA expected, the address may be unmapped then @@ -1527,8 +1527,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, * and map it by a PMD, regardless of sysfs THP settings. As such, let's * analogously elide sysfs THP settings here. */ - if (!thp_vma_allowable_order(vma, vma->vm_flags, false, false, false, - PMD_ORDER)) + if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER)) return SCAN_VMA_CHECK;
/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */ @@ -2403,8 +2402,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, progress++; break; } - if (!thp_vma_allowable_order(vma, vma->vm_flags, false, false, - true, PMD_ORDER)) { + if (!thp_vma_allowable_order(vma, vma->vm_flags, + TVA_ENFORCE_SYSFS, PMD_ORDER)) { skip: progress++; continue; @@ -2741,8 +2740,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
*prev = vma;
- if (!thp_vma_allowable_order(vma, vma->vm_flags, false, false, false, - PMD_ORDER)) + if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER)) return -EINVAL;
if (task_in_dynamic_pool(current)) diff --git a/mm/memory.c b/mm/memory.c index 9d264c303972..f52e52426da5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4343,8 +4343,8 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) * for this vma. Then filter out the orders that can't be allocated over * the faulting address and still be fully contained in the vma. */ - orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true, - BIT(PMD_ORDER) - 1); + orders = thp_vma_allowable_orders(vma, vma->vm_flags, + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); orders = thp_vma_suitable_orders(vma, vmf->address, orders);
if (!orders) @@ -5445,7 +5445,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, return VM_FAULT_OOM; retry_pud: if (pud_none(*vmf.pud) && - thp_vma_allowable_order(vma, vm_flags, false, true, true, PUD_ORDER) && + thp_vma_allowable_order(vma, vm_flags, + TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER) && !task_in_dynamic_pool(current)) { ret = create_huge_pud(&vmf); if (!(ret & VM_FAULT_FALLBACK)) @@ -5480,7 +5481,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, goto retry_pud;
if (pmd_none(*vmf.pmd) && - thp_vma_allowable_order(vma, vm_flags, false, true, true, PMD_ORDER) && + thp_vma_allowable_order(vma, vm_flags, + TVA_IN_PF | TVA_ENFORCE_SYSFS, PMD_ORDER) && !task_in_dynamic_pool(current)) { ret = create_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK))
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9S4Z4 CVE: NA
--------------------------------
This reverts commit bb6adb65697a4870e33281e1dfb36147a8c953bc.
The is issue found on arm64, so revert it for now.
Fixes: bb6adb65697a ("mm: support multi-size THP numa balancing") Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/memory.c | 62 ++++++++++----------------------------------------- mm/mprotect.c | 3 +-- 2 files changed, 13 insertions(+), 52 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index f52e52426da5..a8f0df59aca1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5082,51 +5082,17 @@ int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma, }
static void numa_rebuild_single_mapping(struct vm_fault *vmf, struct vm_area_struct *vma, - unsigned long fault_addr, pte_t *fault_pte, bool writable) { pte_t pte, old_pte;
- old_pte = ptep_modify_prot_start(vma, fault_addr, fault_pte); + old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte); pte = pte_modify(old_pte, vma->vm_page_prot); pte = pte_mkyoung(pte); if (writable) pte = pte_mkwrite(pte, vma); - ptep_modify_prot_commit(vma, fault_addr, fault_pte, old_pte, pte); - update_mmu_cache_range(vmf, vma, fault_addr, fault_pte, 1); -} - -static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_struct *vma, - struct folio *folio, pte_t fault_pte, - bool ignore_writable, bool pte_write_upgrade) -{ - int nr = pte_pfn(fault_pte) - folio_pfn(folio); - unsigned long start = max(vmf->address - nr * PAGE_SIZE, vma->vm_start); - unsigned long end = min(vmf->address + (folio_nr_pages(folio) - nr) * PAGE_SIZE, vma->vm_end); - pte_t *start_ptep = vmf->pte - (vmf->address - start) / PAGE_SIZE; - unsigned long addr; - - /* Restore all PTEs' mapping of the large folio */ - for (addr = start; addr != end; start_ptep++, addr += PAGE_SIZE) { - pte_t ptent = ptep_get(start_ptep); - bool writable = false; - - if (!pte_present(ptent) || !pte_protnone(ptent)) - continue; - - if (pfn_folio(pte_pfn(ptent)) != folio) - continue; - - if (!ignore_writable) { - ptent = pte_modify(ptent, vma->vm_page_prot); - writable = pte_write(ptent); - if (!writable && pte_write_upgrade && - can_change_pte_writable(vma, addr, ptent)) - writable = true; - } - - numa_rebuild_single_mapping(vmf, vma, addr, start_ptep, writable); - } + ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); + update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); }
static vm_fault_t do_numa_page(struct vm_fault *vmf) @@ -5134,12 +5100,11 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct folio *folio = NULL; int nid = NUMA_NO_NODE; - bool writable = false, ignore_writable = false; - bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma); + bool writable = false; int last_cpupid; int target_nid; pte_t pte, old_pte; - int flags = 0, nr_pages; + int flags = 0;
/* * The pte cannot be used safely until we verify, while holding the page @@ -5161,7 +5126,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) * is only valid while holding the PT lock. */ writable = pte_write(pte); - if (!writable && pte_write_upgrade && + if (!writable && vma_wants_manual_pte_write_upgrade(vma) && can_change_pte_writable(vma, vmf->address, pte)) writable = true;
@@ -5169,6 +5134,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) if (!folio || folio_is_zone_device(folio)) goto out_map;
+ /* TODO: handle PTE-mapped THP */ + if (folio_test_large(folio)) + goto out_map; + /* * Avoid grouping on RO pages in general. RO pages shouldn't hurt as * much anyway since they can be in shared cache state. This misses @@ -5188,7 +5157,6 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) flags |= TNF_SHARED;
nid = folio_nid(folio); - nr_pages = folio_nr_pages(folio); /* * For memory tiering mode, cpupid of slow memory page is used * to record page access time. So use default value. @@ -5205,7 +5173,6 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) } pte_unmap_unlock(vmf->pte, vmf->ptl); writable = false; - ignore_writable = true;
/* Migrate to the requested node */ if (migrate_misplaced_folio(folio, vma, target_nid)) { @@ -5226,19 +5193,14 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
out: if (nid != NUMA_NO_NODE) - task_numa_fault(last_cpupid, nid, nr_pages, flags); + task_numa_fault(last_cpupid, nid, 1, flags); return 0; out_map: /* * Make it present again, depending on how arch implements * non-accessible ptes, some can allow access by kernel mode. */ - if (folio && folio_test_large(folio)) - numa_rebuild_large_mapping(vmf, vma, folio, pte, ignore_writable, - pte_write_upgrade); - else - numa_rebuild_single_mapping(vmf, vma, vmf->address, vmf->pte, - writable); + numa_rebuild_single_mapping(vmf, vma, writable); pte_unmap_unlock(vmf->pte, vmf->ptl); goto out; } diff --git a/mm/mprotect.c b/mm/mprotect.c index b360577be4f8..f121c46f6e4c 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -129,8 +129,7 @@ static long change_pte_range(struct mmu_gather *tlb,
/* Also skip shared copy-on-write pages */ if (is_cow_mapping(vma->vm_flags) && - (folio_maybe_dma_pinned(folio) || - folio_likely_mapped_shared(folio))) + folio_ref_count(folio) != 1) continue;
/*
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-v6.9-rc3 commit 4ed91fa9177b236b73a271f11a333a98f076eb63 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9S4Z4 CVE: NA
-------------------------------------------------
During the boot the s390 system triggers "spinlock bad magic" messages if the spinlock debugging is enabled:
[ 0.465445] BUG: spinlock bad magic on CPU#0, swapper/0 [ 0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 [ 0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1 [ 0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux) [ 0.466270] Call Trace: [ 0.466470] [<00000000011f26c8>] dump_stack_lvl+0x98/0xd8 [ 0.466516] [<00000000001dcc6a>] do_raw_spin_lock+0x8a/0x108 [ 0.466545] [<000000000042146c>] find_vmap_area+0x6c/0x108 [ 0.466572] [<000000000042175a>] find_vm_area+0x22/0x40 [ 0.466597] [<000000000012f152>] __set_memory+0x132/0x150 [ 0.466624] [<0000000001cc0398>] vmem_map_init+0x40/0x118 [ 0.466651] [<0000000001cc0092>] paging_init+0x22/0x68 [ 0.466677] [<0000000001cbbed2>] setup_arch+0x52a/0x708 [ 0.466702] [<0000000001cb6140>] start_kernel+0x80/0x5c8 [ 0.466727] [<0000000000100036>] startup_continue+0x36/0x40
it happens because such system tries to access some vmap areas whereas the vmalloc initialization is not even yet done:
[ 0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 [ 0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1 [ 0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux) [ 0.466270] Call Trace: [ 0.466470] dump_stack_lvl (lib/dump_stack.c:117) [ 0.466516] do_raw_spin_lock (kernel/locking/spinlock_debug.c:87 kernel/locking/spinlock_debug.c:115) [ 0.466545] find_vmap_area (mm/vmalloc.c:1059 mm/vmalloc.c:2364) [ 0.466572] find_vm_area (mm/vmalloc.c:3150) [ 0.466597] __set_memory (arch/s390/mm/pageattr.c:360 arch/s390/mm/pageattr.c:393) [ 0.466624] vmem_map_init (./arch/s390/include/asm/set_memory.h:55 arch/s390/mm/vmem.c:660) [ 0.466651] paging_init (arch/s390/mm/init.c:97) [ 0.466677] setup_arch (arch/s390/kernel/setup.c:972) [ 0.466702] start_kernel (init/main.c:899) [ 0.466727] startup_continue (arch/s390/kernel/head64.S:35) [ 0.466811] INFO: lockdep is turned off. ... [ 0.718250] vmalloc init - busy lock init 0000000002871860 [ 0.718328] vmalloc init - busy lock init 00000000028731b8
Some background. It worked before because the lock that is in question was statically defined and initialized. As of now, the locks and data structures are initialized in the vmalloc_init() function.
To address that issue add the check whether the "vmap_initialized" variable is set, if not find_vmap_area() bails out on entry returning NULL.
Link: https://lkml.kernel.org/r/20240323141544.4150-1-urezki@gmail.com Fixes: 72210662c5a2 ("mm: vmalloc: offload free_vmap_area_lock lock") Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Tested-by: Guenter Roeck linux@roeck-us.net Reviewed-by: Baoquan He bhe@redhat.com Acked-by: Heiko Carstens hca@linux.ibm.com Cc: Christoph Hellwig hch@infradead.org Cc: Dave Chinner david@fromorbit.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 4ed91fa9177b236b73a271f11a333a98f076eb63) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/vmalloc.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index e6058942a084..6320c0dfba0d 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2343,6 +2343,9 @@ struct vmap_area *find_vmap_area(unsigned long addr) struct vmap_area *va; int i, j;
+ if (unlikely(!vmap_initialized)) + return NULL; + /* * An addr_to_node_id(addr) converts an address to a node index * where a VA is located. If VA spans several zones and passed
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/7997 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/D...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/7997 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/D...