From: ZhangPeng zhangpeng362@huawei.com
Backport page fault and fork optimization, including optimization for folio add and split, remove uffd_wp and anon/cow/shared fault optimization.
We can get a 3% performance improvment for lmbench fork_proc and 1.5% for lmbench page_fault.
Kairui Song (5): mm/filemap: return early if failed to allocate memory for split mm/filemap: clean up hugetlb exclusion code lib/xarray: introduce a new helper xas_get_order lib/xarray: introduce a new helper xas_get_order mm/filemap: optimize filemap folio adding
Kefeng Wang (2): mm: memory: check userfaultfd_wp() in vmf_orig_pte_uffd_wp() mm: swapfile: check usable swap device in __folio_throttle_swaprate()
include/linux/xarray.h | 6 +++ lib/test_xarray.c | 93 ++++++++++++++++++++++++++++++++++++++++++ lib/xarray.c | 49 ++++++++++++++-------- mm/filemap.c | 74 +++++++++++++++++++++------------ mm/memory.c | 10 ++--- mm/swapfile.c | 13 ++++-- 6 files changed, 194 insertions(+), 51 deletions(-)
From: Kefeng Wang wangkefeng.wang@huawei.com
maillist inclusion category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/I9JAY9 CVE: NA
Reference: https://lore.kernel.org/linux-mm/20240422030039.3293568-1-wangkefeng.wang@hu...
--------------------------------
Add userfaultfd_wp() check in vmf_orig_pte_uffd_wp() to avoid the unnecessary FAULT_FLAG_ORIG_PTE_VALID check/pte_marker_entry_uffd_wp() in most pagefault, note, the function vmf_orig_pte_uffd_wp() is not inlined in the two kernel versions, the difference is shown below,
perf date,
perf report -i perf.data.before | grep vmf 0.17% 0.13% lat_pagefault [kernel.kallsyms] [k] vmf_orig_pte_uffd_wp.part.0.isra.0 perf report -i perf.data.after | grep vmf
lat_pagefault -W 5 -N 5 /tmp/XXX latency before after diff average(8 tests) 0.262675 0.2600375 -0.0026375
Although it's a small, but the uffd_wp is a new feature than previous kernel, when the vma is not registered with UFFD_WP, let's avoid to execute the new logical, also adding __always_inline attribute to vmf_orig_pte_uffd_wp(), which make set_pte_range() only check VM_UFFD_WP flags without the function call. In addition, directly call the vmf_orig_pte_uffd_wp() in do_anonymous_page() and set_pte_range() to save an uffd_wp variable.
Link: https://lkml.kernel.org/r/20240422030039.3293568-1-wangkefeng.wang@huawei.co... Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Peter Xu peterx@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/memory.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index 2c317f517d1d..833e4d076f01 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -113,8 +113,10 @@ static bool vmf_pte_changed(struct vm_fault *vmf); * Return true if the original pte was a uffd-wp pte marker (so the pte was * wr-protected). */ -static bool vmf_orig_pte_uffd_wp(struct vm_fault *vmf) +static __always_inline bool vmf_orig_pte_uffd_wp(struct vm_fault *vmf) { + if (!userfaultfd_wp(vmf->vma)) + return false; if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID)) return false;
@@ -4366,7 +4368,6 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) */ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { - bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); struct vm_area_struct *vma = vmf->vma; unsigned long addr = vmf->address; struct folio *folio; @@ -4464,7 +4465,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) folio_add_new_anon_rmap(folio, vma, addr); folio_add_lru_vma(folio, vma); setpte: - if (uffd_wp) + if (vmf_orig_pte_uffd_wp(vmf)) entry = pte_mkuffd_wp(entry); set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
@@ -4640,7 +4641,6 @@ void set_pte_range(struct vm_fault *vmf, struct folio *folio, struct page *page, unsigned int nr, unsigned long addr) { struct vm_area_struct *vma = vmf->vma; - bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); bool write = vmf->flags & FAULT_FLAG_WRITE; bool prefault = in_range(vmf->address, addr, nr * PAGE_SIZE); pte_t entry; @@ -4655,7 +4655,7 @@ void set_pte_range(struct vm_fault *vmf, struct folio *folio,
if (write) entry = maybe_mkwrite(pte_mkdirty(entry), vma); - if (unlikely(uffd_wp)) + if (unlikely(vmf_orig_pte_uffd_wp(vmf))) entry = pte_mkuffd_wp(entry); /* copy-on-write page */ add_reliable_folio_counter(folio, vma->vm_mm, nr);
From: Kairui Song kasong@tencent.com
maillist inclusion category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/I9JAY9 CVE: NA
Reference: https://lore.kernel.org/linux-mm/20240416071722.45997-1-ryncsn@gmail.com/#t
--------------------------------
Patch series "mm/filemap: optimize folio adding and splitting", v4.
Currently, at least 3 tree walks are needed for filemap folio adding if the folio is previously evicted. One for getting the order of current slot, one for ranged conflict check, and one for another order retrieving. If a split is needed, more walks are needed.
This series is trying to merge these walks, and speed up filemap_add_folio, I see a 7.5% - 12.5% performance gain for fio stress test.
So instead of doing multiple tree walks, do one optimism range check with lock hold, and exit if raced with another insertion. If a shadow exists, check it with a new xas_get_order helper before releasing the lock to avoid redundant tree walks for getting its order.
Drop the lock and do the allocation only if a split is needed.
In the best case, it only need to walk the tree once. If it needs to alloc and split, 3 walks are issued (One for first ranged conflict check and order retrieving, one for the second check after allocation, one for the insert after split).
Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \ --buffered=1 --ioengine=mmap --rw=randread --time_based \ --ramp_time=30s --runtime=5m --group_reporting
Before: bw ( MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691 iops : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691
After (+7.3%): bw ( MiB/s): min= 493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651 iops : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651
Test result with THP (do a THP randread then switch to 4K page in hope it issues a lot of splitting):
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \ --buffered=1 --ioengine=mmap -thp=1 --readonly \ --rw=randread --time_based --ramp_time=30s --runtime=10m \ --group_reporting
fio -name=cached --numjobs=16 --filename=/mnt/test.img \ --buffered=1 --ioengine=mmap \ --rw=randread --time_based --runtime=5s --group_reporting
Before: bw ( KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976 iops : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·
READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec
After (+10.4%): bw ( KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146 iops : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146
READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec
The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.
This patch (of 4):
xas_split_alloc could fail with NOMEM, and in such case, it should abort early instead of keep going and fail the xas_split below.
Link: https://lkml.kernel.org/r/20240416071722.45997-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20240415171857.19244-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20240415171857.19244-2-ryncsn@gmail.com Signed-off-by: Kairui Song kasong@tencent.com Acked-by: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/filemap.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/filemap.c b/mm/filemap.c index e8a7bd88bf90..27d46001072d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -876,9 +876,12 @@ noinline int __filemap_add_folio(struct address_space *mapping, unsigned int order = xa_get_order(xas.xa, xas.xa_index); void *entry, *old = NULL;
- if (order > folio_order(folio)) + if (order > folio_order(folio)) { xas_split_alloc(&xas, xa_load(xas.xa, xas.xa_index), order, gfp); + if (xas_error(&xas)) + goto error; + } xas_lock_irq(&xas); xas_for_each_conflict(&xas, entry) { old = entry;
From: Kairui Song kasong@tencent.com
maillist inclusion category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/I9JAY9 CVE: NA
Reference: https://lore.kernel.org/linux-mm/20240416071722.45997-1-ryncsn@gmail.com/#t
--------------------------------
__filemap_add_folio only has two callers, one never passes hugetlb folio and one always passes in hugetlb folio. So move the hugetlb related cgroup charging out of it to make the code cleaner.
Link: https://lkml.kernel.org/r/20240415171857.19244-3-ryncsn@gmail.com Signed-off-by: Kairui Song kasong@tencent.com Acked-by: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org
Conflicts: mm/filemap.c
Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/filemap.c | 21 ++++++++------------- 1 file changed, 8 insertions(+), 13 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c index 27d46001072d..1ab3d88eb2f8 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -849,20 +849,12 @@ noinline int __filemap_add_folio(struct address_space *mapping, { XA_STATE(xas, &mapping->i_pages, index); int huge = folio_test_hugetlb(folio); - bool charged = false; - long nr = 1; + long nr;
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio); mapping_set_update(&xas, mapping);
- if (!huge) { - int error = mem_cgroup_charge(folio, NULL, gfp); - if (error) - return error; - charged = true; - } - VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio); xas_set_order(&xas, index, folio_order(folio)); nr = folio_nr_pages(folio); @@ -927,8 +919,6 @@ noinline int __filemap_add_folio(struct address_space *mapping, trace_mm_filemap_add_to_page_cache(folio); return 0; error: - if (charged) - mem_cgroup_uncharge(folio); folio->mapping = NULL; /* Leave page->index set: truncation relies upon it */ folio_put_refs(folio, nr); @@ -942,11 +932,16 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio, void *shadow = NULL; int ret;
+ ret = mem_cgroup_charge(folio, NULL, gfp); + if (ret) + return ret; + __folio_set_locked(folio); ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow); - if (unlikely(ret)) + if (unlikely(ret)) { + mem_cgroup_uncharge(folio); __folio_clear_locked(folio); - else { + } else { /* * The folio might have been evicted from cache only * recently, in which case it should be activated like
From: Kairui Song kasong@tencent.com
maillist inclusion category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/I9JAY9 CVE: NA
Reference: https://lore.kernel.org/linux-mm/20240416071722.45997-1-ryncsn@gmail.com/#t
--------------------------------
It can be used after xas_load to check the order of loaded entries. Compared to xa_get_order, it saves an XA_STATE and avoid a rewalk.
Added new test for xas_get_order, to make the test work, we have to export xas_get_order with EXPORT_SYMBOL_GPL.
Also fix a sparse warning by checking the slot value with xa_entry instead of accessing it directly, as suggested by Matthew Wilcox.
Link: https://lkml.kernel.org/r/20240415171857.19244-4-ryncsn@gmail.com Signed-off-by: Kairui Song kasong@tencent.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org
Conflicts: lib/test_xarray.c
Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- include/linux/xarray.h | 6 ++++++ lib/test_xarray.c | 34 +++++++++++++++++++++++++++++ lib/xarray.c | 49 ++++++++++++++++++++++++++---------------- 3 files changed, 71 insertions(+), 18 deletions(-)
diff --git a/include/linux/xarray.h b/include/linux/xarray.h index 89ec31f1f023..c3f54d2eaf36 100644 --- a/include/linux/xarray.h +++ b/include/linux/xarray.h @@ -1551,6 +1551,7 @@ void xas_create_range(struct xa_state *);
#ifdef CONFIG_XARRAY_MULTI int xa_get_order(struct xarray *, unsigned long index); +int xas_get_order(struct xa_state *xas); void xas_split(struct xa_state *, void *entry, unsigned int order); void xas_split_alloc(struct xa_state *, void *entry, unsigned int order, gfp_t); #else @@ -1559,6 +1560,11 @@ static inline int xa_get_order(struct xarray *xa, unsigned long index) return 0; }
+static inline int xas_get_order(struct xa_state *xas) +{ + return 0; +} + static inline void xas_split(struct xa_state *xas, void *entry, unsigned int order) { diff --git a/lib/test_xarray.c b/lib/test_xarray.c index e77d4856442c..2e229012920b 100644 --- a/lib/test_xarray.c +++ b/lib/test_xarray.c @@ -1756,6 +1756,39 @@ static noinline void check_get_order(struct xarray *xa) } }
+static noinline void check_xas_get_order(struct xarray *xa) +{ + XA_STATE(xas, xa, 0); + + unsigned int max_order = IS_ENABLED(CONFIG_XARRAY_MULTI) ? 20 : 1; + unsigned int order; + unsigned long i, j; + + for (order = 0; order < max_order; order++) { + for (i = 0; i < 10; i++) { + xas_set_order(&xas, i << order, order); + do { + xas_lock(&xas); + xas_store(&xas, xa_mk_value(i)); + xas_unlock(&xas); + } while (xas_nomem(&xas, GFP_KERNEL)); + + for (j = i << order; j < (i + 1) << order; j++) { + xas_set_order(&xas, j, 0); + rcu_read_lock(); + xas_load(&xas); + XA_BUG_ON(xa, xas_get_order(&xas) != order); + rcu_read_unlock(); + } + + xas_lock(&xas); + xas_set_order(&xas, i << order, order); + xas_store(&xas, NULL); + xas_unlock(&xas); + } + } +} + static noinline void check_destroy(struct xarray *xa) { unsigned long index; @@ -1805,6 +1838,7 @@ static int xarray_checks(void) check_reserve(&xa0); check_multi_store(&array); check_get_order(&array); + check_xas_get_order(&array); check_xa_alloc(); check_find(&array); check_find_entry(&array); diff --git a/lib/xarray.c b/lib/xarray.c index f49053882354..84faa433cd24 100644 --- a/lib/xarray.c +++ b/lib/xarray.c @@ -1751,39 +1751,52 @@ void *xa_store_range(struct xarray *xa, unsigned long first, EXPORT_SYMBOL(xa_store_range);
/** - * xa_get_order() - Get the order of an entry. - * @xa: XArray. - * @index: Index of the entry. + * xas_get_order() - Get the order of an loaded entry after xas_load. + * @xas: XArray operation state. + * + * Called after xas_load, the xas should not be in an error state. * * Return: A number between 0 and 63 indicating the order of the entry. */ -int xa_get_order(struct xarray *xa, unsigned long index) +int xas_get_order(struct xa_state *xas) { - XA_STATE(xas, xa, index); - void *entry; int order = 0;
- rcu_read_lock(); - entry = xas_load(&xas); - - if (!entry) - goto unlock; - - if (!xas.xa_node) - goto unlock; + if (!xas->xa_node) + return 0;
for (;;) { - unsigned int slot = xas.xa_offset + (1 << order); + unsigned int slot = xas->xa_offset + (1 << order);
if (slot >= XA_CHUNK_SIZE) break; - if (!xa_is_sibling(xas.xa_node->slots[slot])) + if (!xa_is_sibling(xas->xa_node->slots[slot])) break; order++; }
- order += xas.xa_node->shift; -unlock: + order += xas->xa_node->shift; + return order; +} +EXPORT_SYMBOL_GPL(xas_get_order); + +/** + * xa_get_order() - Get the order of an entry. + * @xa: XArray. + * @index: Index of the entry. + * + * Return: A number between 0 and 63 indicating the order of the entry. + */ +int xa_get_order(struct xarray *xa, unsigned long index) +{ + XA_STATE(xas, xa, index); + int order = 0; + void *entry; + + rcu_read_lock(); + entry = xas_load(&xas); + if (entry) + order = xas_get_order(&xas); rcu_read_unlock();
return order;
From: Kairui Song kasong@tencent.com
maillist inclusion category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/I9JAY9 CVE: NA
Reference: https://lore.kernel.org/linux-mm/20240416071722.45997-1-ryncsn@gmail.com/#t
--------------------------------
simplify comment, sparse warning fix, per Matthew Wilcox
Link: https://lkml.kernel.org/r/20240416071722.45997-4-ryncsn@gmail.com Signed-off-by: Kairui Song kasong@tencent.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- lib/xarray.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lib/xarray.c b/lib/xarray.c index 84faa433cd24..1c87d871cacf 100644 --- a/lib/xarray.c +++ b/lib/xarray.c @@ -1751,7 +1751,7 @@ void *xa_store_range(struct xarray *xa, unsigned long first, EXPORT_SYMBOL(xa_store_range);
/** - * xas_get_order() - Get the order of an loaded entry after xas_load. + * xas_get_order() - Get the order of an entry. * @xas: XArray operation state. * * Called after xas_load, the xas should not be in an error state. @@ -1770,7 +1770,7 @@ int xas_get_order(struct xa_state *xas)
if (slot >= XA_CHUNK_SIZE) break; - if (!xa_is_sibling(xas->xa_node->slots[slot])) + if (!xa_is_sibling(xa_entry(xas->xa, xas->xa_node, slot))) break; order++; }
From: Kairui Song kasong@tencent.com
maillist inclusion category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/I9JAY9 CVE: NA
Reference: https://lore.kernel.org/linux-mm/20240416071722.45997-1-ryncsn@gmail.com/#t
--------------------------------
Instead of doing multiple tree walks, do one optimism range check with lock hold, and exit if raced with another insertion. If a shadow exists, check it with a new xas_get_order helper before releasing the lock to avoid redundant tree walks for getting its order.
Drop the lock and do the allocation only if a split is needed.
In the best case, it only need to walk the tree once. If it needs to alloc and split, 3 walks are issued (One for first ranged conflict check and order retrieving, one for the second check after allocation, one for the insert after split).
Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \ --buffered=1 --ioengine=mmap --rw=randread --time_based \ --ramp_time=30s --runtime=5m --group_reporting
Before: bw ( MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691 iops : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691
After (+7.3%): bw ( MiB/s): min= 493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651 iops : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651
Test result with THP (do a THP randread then switch to 4K page in hope it issues a lot of splitting):
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \ --buffered=1 --ioengine=mmap -thp=1 --readonly \ --rw=randread --time_based --ramp_time=30s --runtime=10m \ --group_reporting
fio -name=cached --numjobs=16 --filename=/mnt/test.img \ --buffered=1 --ioengine=mmap \ --rw=randread --time_based --runtime=5s --group_reporting
Before: bw ( KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976 iops : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·
READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec
After (+12.5%): bw ( KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146 iops : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146
READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec
The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.
Link: https://lkml.kernel.org/r/20240415171857.19244-5-ryncsn@gmail.com Signed-off-by: Kairui Song kasong@tencent.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org
Conflicts: lib/test_xarray.c mm/filemap.c
Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- lib/test_xarray.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++ mm/filemap.c | 56 ++++++++++++++++++++++++++++++++------------ 2 files changed, 100 insertions(+), 15 deletions(-)
diff --git a/lib/test_xarray.c b/lib/test_xarray.c index 2e229012920b..542926da61a3 100644 --- a/lib/test_xarray.c +++ b/lib/test_xarray.c @@ -1789,6 +1789,64 @@ static noinline void check_xas_get_order(struct xarray *xa) } }
+static noinline void check_xas_conflict_get_order(struct xarray *xa) +{ + XA_STATE(xas, xa, 0); + + void *entry; + int only_once; + unsigned int max_order = IS_ENABLED(CONFIG_XARRAY_MULTI) ? 20 : 1; + unsigned int order; + unsigned long i, j, k; + + for (order = 0; order < max_order; order++) { + for (i = 0; i < 10; i++) { + xas_set_order(&xas, i << order, order); + do { + xas_lock(&xas); + xas_store(&xas, xa_mk_value(i)); + xas_unlock(&xas); + } while (xas_nomem(&xas, GFP_KERNEL)); + + /* + * Ensure xas_get_order works with xas_for_each_conflict. + */ + j = i << order; + for (k = 0; k < order; k++) { + only_once = 0; + xas_set_order(&xas, j + (1 << k), k); + xas_lock(&xas); + xas_for_each_conflict(&xas, entry) { + XA_BUG_ON(xa, entry != xa_mk_value(i)); + XA_BUG_ON(xa, xas_get_order(&xas) != order); + only_once++; + } + XA_BUG_ON(xa, only_once != 1); + xas_unlock(&xas); + } + + if (order < max_order - 1) { + only_once = 0; + xas_set_order(&xas, (i & ~1UL) << order, order + 1); + xas_lock(&xas); + xas_for_each_conflict(&xas, entry) { + XA_BUG_ON(xa, entry != xa_mk_value(i)); + XA_BUG_ON(xa, xas_get_order(&xas) != order); + only_once++; + } + XA_BUG_ON(xa, only_once != 1); + xas_unlock(&xas); + } + + xas_set_order(&xas, i << order, order); + xas_lock(&xas); + xas_store(&xas, NULL); + xas_unlock(&xas); + } + } +} + + static noinline void check_destroy(struct xarray *xa) { unsigned long index; @@ -1839,6 +1897,7 @@ static int xarray_checks(void) check_multi_store(&array); check_get_order(&array); check_xas_get_order(&array); + check_xas_conflict_get_order(&array); check_xa_alloc(); check_find(&array); check_find_entry(&array); diff --git a/mm/filemap.c b/mm/filemap.c index 1ab3d88eb2f8..bb873391a939 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -848,7 +848,9 @@ noinline int __filemap_add_folio(struct address_space *mapping, struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp) { XA_STATE(xas, &mapping->i_pages, index); - int huge = folio_test_hugetlb(folio); + void *alloced_shadow = NULL; + int alloced_order = 0; + bool huge; long nr;
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); @@ -857,6 +859,7 @@ noinline int __filemap_add_folio(struct address_space *mapping,
VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio); xas_set_order(&xas, index, folio_order(folio)); + huge = folio_test_hugetlb(folio); nr = folio_nr_pages(folio);
gfp &= GFP_RECLAIM_MASK; @@ -864,16 +867,10 @@ noinline int __filemap_add_folio(struct address_space *mapping, folio->mapping = mapping; folio->index = xas.xa_index;
- do { - unsigned int order = xa_get_order(xas.xa, xas.xa_index); + for (;;) { + int order = -1, split_order = 0; void *entry, *old = NULL;
- if (order > folio_order(folio)) { - xas_split_alloc(&xas, xa_load(xas.xa, xas.xa_index), - order, gfp); - if (xas_error(&xas)) - goto error; - } xas_lock_irq(&xas); xas_for_each_conflict(&xas, entry) { old = entry; @@ -881,19 +878,33 @@ noinline int __filemap_add_folio(struct address_space *mapping, xas_set_err(&xas, -EEXIST); goto unlock; } + /* + * If a larger entry exists, + * it will be the first and only entry iterated. + */ + if (order == -1) + order = xas_get_order(&xas); + } + + /* entry may have changed before we re-acquire the lock */ + if (alloced_order && (old != alloced_shadow || order != alloced_order)) { + xas_destroy(&xas); + alloced_order = 0; }
if (old) { - if (shadowp) - *shadowp = old; - /* entry may have been split before we acquired lock */ - order = xa_get_order(xas.xa, xas.xa_index); - if (order > folio_order(folio)) { + if (order > 0 && order > folio_order(folio)) { /* How to handle large swap entries? */ BUG_ON(shmem_mapping(mapping)); + if (!alloced_order) { + split_order = order; + goto unlock; + } xas_split(&xas, old, order); xas_reset(&xas); } + if (shadowp) + *shadowp = old; }
xas_store(&xas, folio); @@ -909,9 +920,24 @@ noinline int __filemap_add_folio(struct address_space *mapping, __lruvec_stat_mod_folio(folio, NR_FILE_THPS, nr); } + unlock: xas_unlock_irq(&xas); - } while (xas_nomem(&xas, gfp)); + + /* split needed, alloc here and retry. */ + if (split_order) { + xas_split_alloc(&xas, old, split_order, gfp); + if (xas_error(&xas)) + goto error; + alloced_shadow = old; + alloced_order = split_order; + xas_reset(&xas); + continue; + } + + if (!xas_nomem(&xas, gfp)) + break; + }
if (xas_error(&xas)) goto error;
From: Kefeng Wang wangkefeng.wang@huawei.com
maillist inclusion category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/I9JAY9 CVE: NA
Reference: https://lore.kernel.org/linux-mm/20240418135644.2736748-1-wangkefeng.wang@hu...
--------------------------------
Skip blk_cgroup_congested() if there is no usable swap device since no swapin/out will occur, Thereby avoid taking swap_lock. The difference is shown below from perf date of CoW pagefault,
perf report -g -i perf.data.swapoff | egrep "blk_cgroup_congested|__folio_throttle_swaprate" 1.01% 0.16% page_fault2_pro [kernel.kallsyms] [k] __folio_throttle_swaprate 0.83% 0.80% page_fault2_pro [kernel.kallsyms] [k] blk_cgroup_congested
perf report -g -i perf.data.swapon | egrep "blk_cgroup_congested|__folio_throttle_swaprate" 0.15% 0.15% page_fault2_pro [kernel.kallsyms] [k] __folio_throttle_swaprate
Link: https://lkml.kernel.org/r/20240418135644.2736748-1-wangkefeng.wang@huawei.co... Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/swapfile.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 91f48df47952..3312f2422b36 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2465,13 +2465,17 @@ static void reinsert_swap_info(struct swap_info_struct *p) spin_unlock(&swap_lock); }
+static bool __has_usable_swap(void) +{ + return !plist_head_empty(&swap_active_head); +} + bool has_usable_swap(void) { - bool ret = true; + bool ret;
spin_lock(&swap_lock); - if (plist_head_empty(&swap_active_head)) - ret = false; + ret = __has_usable_swap(); spin_unlock(&swap_lock); return ret; } @@ -3728,6 +3732,9 @@ void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp) if (!(gfp & __GFP_IO)) return;
+ if (!__has_usable_swap()) + return; + if (!blk_cgroup_congested()) return;
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/6626 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/K...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/6626 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/K...