Patch[8] has context conflict.
David Hildenbrand (1): mm: don't install PMD mappings when THPs are disabled by the hw/process/vma
Jingbo Xu (2): mm: fix arithmetic for bdi min_ratio mm: fix arithmetic for max_prop_frac when setting max_ratio
Kefeng Wang (1): mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw()
Liu Shixin (4): mm: kmemleak: split __create_object into two functions mm: kmemleak: use mem_pool_free() to free object mm: kmemleak: add __find_and_remove_object() mm/kmemleak: fix partially freeing unknown object warning
Xueshi Hu (1): mm/hugetlb: fix nodes huge page allocation when there are surplus pages
include/linux/huge_mm.h | 18 ++++++ mm/huge_memory.c | 13 +---- mm/hugetlb.c | 4 +- mm/kmemleak.c | 126 ++++++++++++++++++++++++++++------------ mm/memory.c | 9 +++ mm/page-writeback.c | 4 +- mm/shmem.c | 7 +-- 7 files changed, 124 insertions(+), 57 deletions(-)
From: Xueshi Hu xueshi.hu@smartx.com
mainline inclusion from mainline-v6.7-rc1 commit b72b3c9c34c825c81d205241c5f822fc7835923f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBE0E0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
In set_nr_huge_pages(), local variable "count" is used to record persistent_huge_pages(), but when it cames to nodes huge page allocation, the semantics changes to nr_huge_pages. When there exists surplus huge pages and using the interface under /sys/devices/system/node/node*/hugepages to change huge page pool size, this difference can result in the allocation of an unexpected number of huge pages.
Steps to reproduce the bug:
Starting with:
Node 0 Node 1 Total HugePages_Total 0.00 0.00 0.00 HugePages_Free 0.00 0.00 0.00 HugePages_Surp 0.00 0.00 0.00
create 100 huge pages in Node 0 and consume it, then set Node 0 's nr_hugepages to 0.
yields:
Node 0 Node 1 Total HugePages_Total 200.00 0.00 200.00 HugePages_Free 0.00 0.00 0.00 HugePages_Surp 200.00 0.00 200.00
write 100 to Node 1's nr_hugepages
echo 100 > /sys/devices/system/node/node1/\ hugepages/hugepages-2048kB/nr_hugepages
gets:
Node 0 Node 1 Total HugePages_Total 200.00 400.00 600.00 HugePages_Free 0.00 400.00 400.00 HugePages_Surp 200.00 0.00 200.00
Kernel is expected to create only 100 huge pages and it gives 200.
Link: https://lkml.kernel.org/r/20230829033343.467779-1-xueshi.hu@smartx.com Fixes: 9a30523066cd ("hugetlb: add per node hstate attributes") Signed-off-by: Xueshi Hu xueshi.hu@smartx.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: Andi Kleen andi@firstfloor.org Cc: Lee Schermerhorn lee.schermerhorn@hp.com Cc: Mel Gorman mel@csn.ul.ie Cc: Muchun Song muchun.song@linux.dev Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/hugetlb.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6d53d52652f3..a80b431a8f72 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3696,7 +3696,9 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, if (nid != NUMA_NO_NODE) { unsigned long old_count = count;
- count += h->nr_huge_pages - h->nr_huge_pages_node[nid]; + count += persistent_huge_pages(h) - + (h->nr_huge_pages_node[nid] - + h->surplus_huge_pages_node[nid]); /* * User may have specified a large count value which caused the * above calculation to overflow. In this case, they wanted
mainline inclusion from mainline-v6.7-rc1 commit 0edd7b58293340d26380d7510fd4fc08e5902511 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBE0E0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
__create_object() consists of two part, the first part allocate a kmemleak object and initialize it, the second part insert it into object tree. This function need kmemleak_lock but actually only the second part need lock.
Split it into two functions, the first function __alloc_object only allocate a kmemleak object, and the second function __link_object() will initialize the object and insert it into object tree, use the kmemleak_lock to protect __link_object() only.
[akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20231018102952.3339837-5-liushixin2@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Cc: Catalin Marinas catalin.marinas@arm.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Patrick Wang patrick.wang.shcn@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/kmemleak.c | 61 +++++++++++++++++++++++++++++++++------------------ 1 file changed, 40 insertions(+), 21 deletions(-)
diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 54c2c90d3abc..a95a0331f93c 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -623,25 +623,15 @@ static noinline depot_stack_handle_t set_track_prepare(void) return trace_handle; }
-/* - * Create the metadata (struct kmemleak_object) corresponding to an allocated - * memory block and add it to the object_list and object_tree_root (or - * object_phys_tree_root). - */ -static void __create_object(unsigned long ptr, size_t size, - int min_count, gfp_t gfp, bool is_phys) +static struct kmemleak_object *__alloc_object(gfp_t gfp) { - unsigned long flags; - struct kmemleak_object *object, *parent; - struct rb_node **link, *rb_parent; - unsigned long untagged_ptr; - unsigned long untagged_objp; + struct kmemleak_object *object;
object = mem_pool_alloc(gfp); if (!object) { pr_warn("Cannot allocate a kmemleak_object structure\n"); kmemleak_disable(); - return; + return NULL; }
INIT_LIST_HEAD(&object->object_list); @@ -649,13 +639,8 @@ static void __create_object(unsigned long ptr, size_t size, INIT_HLIST_HEAD(&object->area_list); raw_spin_lock_init(&object->lock); atomic_set(&object->use_count, 1); - object->flags = OBJECT_ALLOCATED | (is_phys ? OBJECT_PHYS : 0); - object->pointer = ptr; - object->size = kfence_ksize((void *)ptr) ?: size; object->excess_ref = 0; - object->min_count = min_count; object->count = 0; /* white color initially */ - object->jiffies = jiffies; object->checksum = 0; object->del_state = 0;
@@ -680,7 +665,23 @@ static void __create_object(unsigned long ptr, size_t size, /* kernel backtrace */ object->trace_handle = set_track_prepare();
- raw_spin_lock_irqsave(&kmemleak_lock, flags); + return object; +} + +static void __link_object(struct kmemleak_object *object, unsigned long ptr, + size_t size, int min_count, bool is_phys) +{ + + struct kmemleak_object *parent; + struct rb_node **link, *rb_parent; + unsigned long untagged_ptr; + unsigned long untagged_objp; + + object->flags = OBJECT_ALLOCATED | (is_phys ? OBJECT_PHYS : 0); + object->pointer = ptr; + object->size = kfence_ksize((void *)ptr) ?: size; + object->min_count = min_count; + object->jiffies = jiffies;
untagged_ptr = (unsigned long)kasan_reset_tag((void *)ptr); /* @@ -711,14 +712,32 @@ static void __create_object(unsigned long ptr, size_t size, */ dump_object_info(parent); kmem_cache_free(object_cache, object); - goto out; + return; } } rb_link_node(&object->rb_node, rb_parent, link); rb_insert_color(&object->rb_node, is_phys ? &object_phys_tree_root : &object_tree_root); list_add_tail_rcu(&object->object_list, &object_list); -out: +} + +/* + * Create the metadata (struct kmemleak_object) corresponding to an allocated + * memory block and add it to the object_list and object_tree_root (or + * object_phys_tree_root). + */ +static void __create_object(unsigned long ptr, size_t size, + int min_count, gfp_t gfp, bool is_phys) +{ + struct kmemleak_object *object; + unsigned long flags; + + object = __alloc_object(gfp); + if (!object) + return; + + raw_spin_lock_irqsave(&kmemleak_lock, flags); + __link_object(object, ptr, size, min_count, is_phys); raw_spin_unlock_irqrestore(&kmemleak_lock, flags); }
mainline inclusion from mainline-v6.7-rc1 commit 2e1d47385f98aa0fa7d71aeee32d26cc6f8493e0 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBE0E0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The kmemleak object is allocated by mem_pool_alloc(), which could be from slab or mem_pool[], so it's not suitable using __kmem_cache_free() to free the object, use __mem_pool_free() instead.
Link: https://lkml.kernel.org/r/20231018102952.3339837-6-liushixin2@huawei.com Fixes: 0647398a8c7b ("mm: kmemleak: simple memory allocation pool for kmemleak objects") Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Catalin Marinas catalin.marinas@arm.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Patrick Wang patrick.wang.shcn@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/kmemleak.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/mm/kmemleak.c b/mm/kmemleak.c index a95a0331f93c..4155b399bee4 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -668,8 +668,8 @@ static struct kmemleak_object *__alloc_object(gfp_t gfp) return object; }
-static void __link_object(struct kmemleak_object *object, unsigned long ptr, - size_t size, int min_count, bool is_phys) +static int __link_object(struct kmemleak_object *object, unsigned long ptr, + size_t size, int min_count, bool is_phys) {
struct kmemleak_object *parent; @@ -711,14 +711,15 @@ static void __link_object(struct kmemleak_object *object, unsigned long ptr, * be freed while the kmemleak_lock is held. */ dump_object_info(parent); - kmem_cache_free(object_cache, object); - return; + return -EEXIST; } } rb_link_node(&object->rb_node, rb_parent, link); rb_insert_color(&object->rb_node, is_phys ? &object_phys_tree_root : &object_tree_root); list_add_tail_rcu(&object->object_list, &object_list); + + return 0; }
/* @@ -731,14 +732,17 @@ static void __create_object(unsigned long ptr, size_t size, { struct kmemleak_object *object; unsigned long flags; + int ret;
object = __alloc_object(gfp); if (!object) return;
raw_spin_lock_irqsave(&kmemleak_lock, flags); - __link_object(object, ptr, size, min_count, is_phys); + ret = __link_object(object, ptr, size, min_count, is_phys); raw_spin_unlock_irqrestore(&kmemleak_lock, flags); + if (ret) + mem_pool_free(object); }
/* Create kmemleak object which allocated with virtual address. */
mainline inclusion from mainline-v6.7-rc1 commit 858a195b9330d0bd36f59e8652f693a1beb7da2f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBE0E0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Add new __find_and_remove_object() without kmemleak_lock protect, it is in preparation for the next patch.
Link: https://lkml.kernel.org/r/20231018102952.3339837-7-liushixin2@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Acked-by: Catalin Marinas catalin.marinas@arm.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Patrick Wang patrick.wang.shcn@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/kmemleak.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 4155b399bee4..39ea73364282 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -583,6 +583,19 @@ static void __remove_object(struct kmemleak_object *object) object->del_state |= DELSTATE_REMOVED; }
+static struct kmemleak_object *__find_and_remove_object(unsigned long ptr, + int alias, + bool is_phys) +{ + struct kmemleak_object *object; + + object = __lookup_object(ptr, alias, is_phys); + if (object) + __remove_object(object); + + return object; +} + /* * Look up an object in the object search tree and remove it from both * object_tree_root (or object_phys_tree_root) and object_list. The @@ -596,9 +609,7 @@ static struct kmemleak_object *find_and_remove_object(unsigned long ptr, int ali struct kmemleak_object *object;
raw_spin_lock_irqsave(&kmemleak_lock, flags); - object = __lookup_object(ptr, alias, is_phys); - if (object) - __remove_object(object); + object = __find_and_remove_object(ptr, alias, is_phys); raw_spin_unlock_irqrestore(&kmemleak_lock, flags);
return object;
mainline inclusion from mainline-v6.7-rc1 commit 5e4fc577db2514fee3cc2729415d0e2589d5d056 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBE0E0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
delete_object_part() can be called by multiple callers in the same time. If an object is found and removed by a caller, and then another caller try to find it too, it failed and return directly. It still be recorded by kmemleak even if it has already been freed to buddy. With DEBUG on, kmemleak will report the following warning,
kmemleak: Partially freeing unknown object at 0xa1af86000 (size 4096) CPU: 0 PID: 742 Comm: test_huge Not tainted 6.6.0-rc3kmemleak+ #54 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x37/0x50 kmemleak_free_part_phys+0x50/0x60 hugetlb_vmemmap_optimize+0x172/0x290 ? __pfx_vmemmap_remap_pte+0x10/0x10 __prep_new_hugetlb_folio+0xe/0x30 prep_new_hugetlb_folio.isra.0+0xe/0x40 alloc_fresh_hugetlb_folio+0xc3/0xd0 alloc_surplus_hugetlb_folio.constprop.0+0x6e/0xd0 hugetlb_acct_memory.part.0+0xe6/0x2a0 hugetlb_reserve_pages+0x110/0x2c0 hugetlbfs_file_mmap+0x11d/0x1b0 mmap_region+0x248/0x9a0 ? hugetlb_get_unmapped_area+0x15c/0x2d0 do_mmap+0x38b/0x580 vm_mmap_pgoff+0xe6/0x190 ksys_mmap_pgoff+0x18a/0x1f0 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Expand __create_object() and move __alloc_object() to the beginning. Then use kmemleak_lock to protect __find_and_remove_object() and __link_object() as a whole, which can guarantee all objects are processed sequentialally.
Link: https://lkml.kernel.org/r/20231018102952.3339837-8-liushixin2@huawei.com Fixes: 53238a60dd4a ("kmemleak: Allow partial freeing of memory blocks") Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Catalin Marinas catalin.marinas@arm.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Patrick Wang patrick.wang.shcn@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/kmemleak.c | 42 +++++++++++++++++++++++++++++++----------- 1 file changed, 31 insertions(+), 11 deletions(-)
diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 39ea73364282..7d210dd74381 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -816,16 +816,25 @@ static void delete_object_full(unsigned long ptr) */ static void delete_object_part(unsigned long ptr, size_t size, bool is_phys) { - struct kmemleak_object *object; - unsigned long start, end; + struct kmemleak_object *object, *object_l, *object_r; + unsigned long start, end, flags; + + object_l = __alloc_object(GFP_KERNEL); + if (!object_l) + return;
- object = find_and_remove_object(ptr, 1, is_phys); + object_r = __alloc_object(GFP_KERNEL); + if (!object_r) + goto out; + + raw_spin_lock_irqsave(&kmemleak_lock, flags); + object = __find_and_remove_object(ptr, 1, is_phys); if (!object) { #ifdef DEBUG kmemleak_warn("Partially freeing unknown object at 0x%08lx (size %zu)\n", ptr, size); #endif - return; + goto unlock; }
/* @@ -835,14 +844,25 @@ static void delete_object_part(unsigned long ptr, size_t size, bool is_phys) */ start = object->pointer; end = object->pointer + object->size; - if (ptr > start) - __create_object(start, ptr - start, object->min_count, - GFP_KERNEL, is_phys); - if (ptr + size < end) - __create_object(ptr + size, end - ptr - size, object->min_count, - GFP_KERNEL, is_phys); + if ((ptr > start) && + !__link_object(object_l, start, ptr - start, + object->min_count, is_phys)) + object_l = NULL; + if ((ptr + size < end) && + !__link_object(object_r, ptr + size, end - ptr - size, + object->min_count, is_phys)) + object_r = NULL; + +unlock: + raw_spin_unlock_irqrestore(&kmemleak_lock, flags); + if (object) + __delete_object(object);
- __delete_object(object); +out: + if (object_l) + mem_pool_free(object_l); + if (object_r) + mem_pool_free(object_r); }
static void __paint_it(struct kmemleak_object *object, int color)
From: Jingbo Xu jefflexu@linux.alibaba.com
mainline inclusion from mainline-v6.7 commit e0646b7590084a5bf3b056d3ad871d9379d2c25a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBE0E0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Since now bdi->min_ratio is part per million, fix the wrong arithmetic. Otherwise it will fail with -EINVAL when setting a reasonable min_ratio, as it tries to set min_ratio to (min_ratio * BDI_RATIO_SCALE) in percentage unit, which exceeds 100% anyway.
# cat /sys/class/bdi/253:0/min_ratio 0 # cat /sys/class/bdi/253:0/max_ratio 100 # echo 1 > /sys/class/bdi/253:0/min_ratio -bash: echo: write error: Invalid argument
Link: https://lkml.kernel.org/r/20231219142508.86265-2-jefflexu@linux.alibaba.com Fixes: 8021fb3232f2 ("mm: split off __bdi_set_min_ratio() function") Signed-off-by: Jingbo Xu jefflexu@linux.alibaba.com Reported-by: Joseph Qi joseph.qi@linux.alibaba.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Stefan Roesch shr@devkernel.io Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page-writeback.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c index e632ec9b6421..4e44a0e80e25 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -714,7 +714,6 @@ static int __bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ra
if (min_ratio > 100 * BDI_RATIO_SCALE) return -EINVAL; - min_ratio *= BDI_RATIO_SCALE;
spin_lock_bh(&bdi_lock); if (min_ratio > bdi->max_ratio) {
From: Jingbo Xu jefflexu@linux.alibaba.com
mainline inclusion from mainline-v6.7 commit fa151a39a6879144b587f35c0dfcc15e1be9450f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBE0E0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Since now bdi->max_ratio is part per million, fix the wrong arithmetic for max_prop_frac when setting max_ratio. Otherwise the miscalculated max_prop_frac will affect the incrementing of writeout completion count when max_ratio is not 100%.
Link: https://lkml.kernel.org/r/20231219142508.86265-3-jefflexu@linux.alibaba.com Fixes: efc3e6ad53ea ("mm: split off __bdi_set_max_ratio() function") Signed-off-by: Jingbo Xu jefflexu@linux.alibaba.com Cc: Joseph Qi joseph.qi@linux.alibaba.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Stefan Roesch shr@devkernel.io Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page-writeback.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 4e44a0e80e25..903933ed0d56 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -750,7 +750,8 @@ static int __bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ra ret = -EINVAL; } else { bdi->max_ratio = max_ratio; - bdi->max_prop_frac = (FPROP_FRAC_BASE * max_ratio) / 100; + bdi->max_prop_frac = (FPROP_FRAC_BASE * max_ratio) / + (100 * BDI_RATIO_SCALE); } spin_unlock_bh(&bdi_lock);
From: Kefeng Wang wangkefeng.wang@huawei.com
mainline inclusion from mainline-v6.12-rc4 commit 963756aac1f011d904ddd9548ae82286d3a91f96 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBE0E0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "mm: don't install PMD mappings when THPs are disabled by the hw/process/vma".
During testing, it was found that we can get PMD mappings in processes where THP (and more precisely, PMD mappings) are supposed to be disabled. While it works as expected for anon+shmem, the pagecache is the problematic bit.
For s390 KVM this currently means that a VM backed by a file located on filesystem with large folio support can crash when KVM tries accessing the problematic page, because the readahead logic might decide to use a PMD-sized THP and faulting it into the page tables will install a PMD mapping, something that s390 KVM cannot tolerate.
This might also be a problem with HW that does not support PMD mappings, but I did not try reproducing it.
Fix it by respecting the ways to disable THPs when deciding whether we can install a PMD mapping. khugepaged should already be taking care of not collapsing if THPs are effectively disabled for the hw/process/vma.
This patch (of 2):
Add vma_thp_disabled() and thp_disabled_by_hw() helpers to be shared by shmem_allowable_huge_orders() and __thp_vma_allowable_orders().
[david@redhat.com: rename to vma_thp_disabled(), split out thp_disabled_by_hw() ] Link: https://lkml.kernel.org/r/20241011102445.934409-2-david@redhat.com Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: David Hildenbrand david@redhat.com Reported-by: Leo Fu bfu@redhat.com Tested-by: Thomas Huth thuth@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Cc: Boqiao Fu bfu@redhat.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Claudio Imbrenda imbrenda@linux.ibm.com Cc: Hugh Dickins hughd@google.com Cc: Janosch Frank frankja@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: include/linux/huge_mm.h [ Context conflicts. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/huge_mm.h | 18 ++++++++++++++++++ mm/huge_memory.c | 13 +------------ mm/shmem.c | 7 +------ 3 files changed, 20 insertions(+), 18 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 27bd7ec3a546..02910e742bb1 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -330,6 +330,24 @@ struct thpsize { (transparent_hugepage_flags & \ (1<<TRANSPARENT_HUGEPAGE_FILE_MTHP_FLAG))
+static inline bool vma_thp_disabled(struct vm_area_struct *vma, + unsigned long vm_flags) +{ + /* + * Explicitly disabled through madvise or prctl, or some + * architectures may disable THP for some mappings, for + * example, s390 kvm. + */ + return (vm_flags & VM_NOHUGEPAGE) || + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags); +} + +static inline bool thp_disabled_by_hw(void) +{ + /* If the hardware/firmware marked hugepage support disabled. */ + return transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED); +} + unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 3c58e95a33eb..3df13ed40214 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -105,18 +105,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, if (!vma->vm_mm) /* vdso */ return 0;
- /* - * Explicitly disabled through madvise or prctl, or some - * architectures may disable THP for some mappings, for - * example, s390 kvm. - * */ - if ((vm_flags & VM_NOHUGEPAGE) || - test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) - return 0; - /* - * If the hardware/firmware marked hugepage support disabled. - */ - if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED)) + if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags)) return 0;
/* khugepaged doesn't collapse DAX vma, but page fault is fine. */ diff --git a/mm/shmem.c b/mm/shmem.c index 65553342a16f..f9e48c353a9d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1686,12 +1686,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode, loff_t i_size; int order;
- if (vma && ((vm_flags & VM_NOHUGEPAGE) || - test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))) - return 0; - - /* If the hardware/firmware marked hugepage support disabled. */ - if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED)) + if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags))) return 0;
global_huge = shmem_huge_global_enabled(inode, index, write_end,
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.12-rc4 commit 2b0f922323ccfa76219bcaacd35cd50aeaa13592 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBE0E0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
We (or rather, readahead logic :) ) might be allocating a THP in the pagecache and then try mapping it into a process that explicitly disabled THP: we might end up installing PMD mappings.
This is a problem for s390x KVM, which explicitly remaps all PMD-mapped THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before starting the VM.
For example, starting a VM backed on a file system with large folios supported makes the VM crash when the VM tries accessing such a mapping using KVM.
Is it also a problem when the HW disabled THP using TRANSPARENT_HUGEPAGE_UNSUPPORTED? At least on x86 this would be the case without X86_FEATURE_PSE.
In the future, we might be able to do better on s390x and only disallow PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED really wants. For now, fix it by essentially performing the same check as would be done in __thp_vma_allowable_orders() or in shmem code, where this works as expected, and disallow PMD mappings, making us fallback to PTE mappings.
Link: https://lkml.kernel.org/r/20241011102445.934409-3-david@redhat.com Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Signed-off-by: David Hildenbrand david@redhat.com Reported-by: Leo Fu bfu@redhat.com Tested-by: Thomas Huth thuth@redhat.com Cc: Thomas Huth thuth@redhat.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Ryan Roberts ryan.roberts@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Janosch Frank frankja@linux.ibm.com Cc: Claudio Imbrenda imbrenda@linux.ibm.com Cc: Hugh Dickins hughd@google.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/memory.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c index badaf096a344..d275c2ae2723 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4687,6 +4687,15 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) pmd_t entry; vm_fault_t ret = VM_FAULT_FALLBACK;
+ /* + * It is too late to allocate a small folio, we already have a large + * folio in the pagecache: especially s390 KVM cannot tolerate any + * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any + * PMD mappings if THPs are disabled. + */ + if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags)) + return ret; + if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER)) return ret;
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/14336 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/E...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/14336 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/E...