Barry Song (4): mm: add per-order mTHP anon_fault_alloc and anon_fault_fallback counters mm: add per-order mTHP anon_swpout and anon_swpout_fallback counters mm: add docs for per-order mTHP counters and transhuge_page ABI mm: correct the docs for thp_fault_alloc and thp_fault_fallback
Kefeng Wang (4): mm: filemap: make mTHP configurable for exec mapping mm: huge_memory: add folio_get_unmapped_area() mm: huge_memory: add thp mapping align control mm: add control to allow specified high-order pages stored on PCP list
Matthew Wilcox (Oracle) (3): mm: remove inc/dec lruvec page state functions mm/khugepaged: use a folio more in collapse_file() mm/memcontrol: remove __mod_lruvec_page_state()
Ryan Roberts (1): mm/filemap: Allow arch to request folio size for exec memory
.../sysfs-kernel-mm-transparent-hugepage | 18 ++ Documentation/admin-guide/mm/transhuge.rst | 58 +++- arch/arm64/include/asm/pgtable.h | 12 + include/linux/gfp.h | 1 + include/linux/huge_mm.h | 27 ++ include/linux/pgtable.h | 12 + include/linux/vmstat.h | 60 ++--- mm/filemap.c | 42 +++ mm/huge_memory.c | 252 +++++++++++++++++- mm/khugepaged.c | 16 +- mm/memcontrol.c | 9 +- mm/memory.c | 5 + mm/page_alloc.c | 18 +- mm/page_io.c | 1 + mm/vmscan.c | 3 + 15 files changed, 460 insertions(+), 74 deletions(-) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-transparent-hugepage
From: "Matthew Wilcox (Oracle)" willy@infradead.org
mainline inclusion from mainline-v6.8-rc1 commit e435ca87882167dda78776ce4bd6eb2094eb864b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9Q9DF CVE: NA
-------------------------------------------------
Patch series "Remove some lruvec page accounting functions", v2.
Some functions are now unused; remove them. Make __mod_lruvec_page_state() unused and then remove it.
This patch (of 6):
All callers of these have been converted to their folio equivalents.
Link: https://lkml.kernel.org/r/20231228085748.1083901-1-willy@infradead.org Link: https://lkml.kernel.org/r/20231228085748.1083901-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: Hyeonggon Yoo 42.hyeyoo@gmail.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Vlastimil Babka vbabka@suse.cz Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit e435ca87882167dda78776ce4bd6eb2094eb864b) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- include/linux/vmstat.h | 24 ------------------------ 1 file changed, 24 deletions(-)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index fed855bae6d8..147ae73e0ee7 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -597,18 +597,6 @@ static inline void mod_lruvec_page_state(struct page *page,
#endif /* CONFIG_MEMCG */
-static inline void __inc_lruvec_page_state(struct page *page, - enum node_stat_item idx) -{ - __mod_lruvec_page_state(page, idx, 1); -} - -static inline void __dec_lruvec_page_state(struct page *page, - enum node_stat_item idx) -{ - __mod_lruvec_page_state(page, idx, -1); -} - static inline void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx, int val) { @@ -627,18 +615,6 @@ static inline void __lruvec_stat_sub_folio(struct folio *folio, __lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio)); }
-static inline void inc_lruvec_page_state(struct page *page, - enum node_stat_item idx) -{ - mod_lruvec_page_state(page, idx, 1); -} - -static inline void dec_lruvec_page_state(struct page *page, - enum node_stat_item idx) -{ - mod_lruvec_page_state(page, idx, -1); -} - static inline void lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx, int val) {
From: "Matthew Wilcox (Oracle)" willy@infradead.org
mainline inclusion from mainline-v6.8-rc1 commit b54d60b18e850561e8bdb4264ae740676c3b7658 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9Q9DF CVE: NA
-------------------------------------------------
This function is not yet fully converted to the folio API, but this removes a few uses of old APIs.
Link: https://lkml.kernel.org/r/20231228085748.1083901-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Reviewed-by: Zi Yan ziy@nvidia.com Reviewed-by: Vlastimil Babka vbabka@suse.cz Cc: Hyeonggon Yoo 42.hyeyoo@gmail.com Cc: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit b54d60b18e850561e8bdb4264ae740676c3b7658) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/khugepaged.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 9a9f1ebe6dd7..7d329e9eeec8 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2147,23 +2147,23 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, xas_lock_irq(&xas); }
- nr = thp_nr_pages(hpage); + folio = page_folio(hpage); + nr = folio_nr_pages(folio); if (is_shmem) - __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr); + __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, nr); else - __mod_lruvec_page_state(hpage, NR_FILE_THPS, nr); + __lruvec_stat_mod_folio(folio, NR_FILE_THPS, nr);
if (nr_none) { - __mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none); + __lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr_none); /* nr_none is always 0 for non-shmem. */ - __mod_lruvec_page_state(hpage, NR_SHMEM, nr_none); + __lruvec_stat_mod_folio(folio, NR_SHMEM, nr_none); }
/* * Mark hpage as uptodate before inserting it into the page cache so * that it isn't mistaken for an fallocated but unwritten page. */ - folio = page_folio(hpage); folio_mark_uptodate(folio); folio_ref_add(folio, HPAGE_PMD_NR - 1);
@@ -2173,7 +2173,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
/* Join all the small entries into a single multi-index entry. */ xas_set_order(&xas, start, HPAGE_PMD_ORDER); - xas_store(&xas, hpage); + xas_store(&xas, folio); WARN_ON_ONCE(xas_error(&xas)); xas_unlock_irq(&xas);
@@ -2184,7 +2184,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr, retract_page_tables(mapping, start); if (cc && !cc->is_khugepaged) result = SCAN_PTE_MAPPED_HUGEPAGE; - unlock_page(hpage); + folio_unlock(folio);
/* * The collapse has succeeded, so free the old pages.
From: "Matthew Wilcox (Oracle)" willy@infradead.org
mainline inclusion from mainline-v6.8-rc1 commit c701123bd68bf1cc3bc167b4f597cb1f4995c39c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9Q9DF CVE: NA
-------------------------------------------------
There are no more callers of __mod_lruvec_page_state(), so convert the implementation to __lruvec_stat_mod_folio(), removing two calls to compound_head() (one explicit, one hidden inside page_memcg()).
Link: https://lkml.kernel.org/r/20231228085748.1083901-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Reviewed-by: Zi Yan ziy@nvidia.com Acked-by: Shakeel Butt shakeelb@google.com Reviewed-by: Vlastimil Babka vbabka@suse.cz Cc: Hyeonggon Yoo 42.hyeyoo@gmail.com Cc: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit c701123bd68bf1cc3bc167b4f597cb1f4995c39c) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- include/linux/vmstat.h | 36 ++++++++++++++++++------------------ mm/memcontrol.c | 9 ++++----- 2 files changed, 22 insertions(+), 23 deletions(-)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 147ae73e0ee7..343906a98d6e 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -556,19 +556,25 @@ static inline void mod_lruvec_state(struct lruvec *lruvec, local_irq_restore(flags); }
-void __mod_lruvec_page_state(struct page *page, +void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx, int val);
-static inline void mod_lruvec_page_state(struct page *page, +static inline void lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx, int val) { unsigned long flags;
local_irq_save(flags); - __mod_lruvec_page_state(page, idx, val); + __lruvec_stat_mod_folio(folio, idx, val); local_irq_restore(flags); }
+static inline void mod_lruvec_page_state(struct page *page, + enum node_stat_item idx, int val) +{ + lruvec_stat_mod_folio(page_folio(page), idx, val); +} + #else
static inline void __mod_lruvec_state(struct lruvec *lruvec, @@ -583,10 +589,16 @@ static inline void mod_lruvec_state(struct lruvec *lruvec, mod_node_page_state(lruvec_pgdat(lruvec), idx, val); }
-static inline void __mod_lruvec_page_state(struct page *page, - enum node_stat_item idx, int val) +static inline void __lruvec_stat_mod_folio(struct folio *folio, + enum node_stat_item idx, int val) { - __mod_node_page_state(page_pgdat(page), idx, val); + __mod_node_page_state(folio_pgdat(folio), idx, val); +} + +static inline void lruvec_stat_mod_folio(struct folio *folio, + enum node_stat_item idx, int val) +{ + mod_node_page_state(folio_pgdat(folio), idx, val); }
static inline void mod_lruvec_page_state(struct page *page, @@ -597,12 +609,6 @@ static inline void mod_lruvec_page_state(struct page *page,
#endif /* CONFIG_MEMCG */
-static inline void __lruvec_stat_mod_folio(struct folio *folio, - enum node_stat_item idx, int val) -{ - __mod_lruvec_page_state(&folio->page, idx, val); -} - static inline void __lruvec_stat_add_folio(struct folio *folio, enum node_stat_item idx) { @@ -615,12 +621,6 @@ static inline void __lruvec_stat_sub_folio(struct folio *folio, __lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio)); }
-static inline void lruvec_stat_mod_folio(struct folio *folio, - enum node_stat_item idx, int val) -{ - mod_lruvec_page_state(&folio->page, idx, val); -} - static inline void lruvec_stat_add_folio(struct folio *folio, enum node_stat_item idx) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f1cf73835cba..ff64d2c36749 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -897,16 +897,15 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, __mod_memcg_lruvec_state(lruvec, idx, val); }
-void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx, +void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx, int val) { - struct page *head = compound_head(page); /* rmap on tail pages */ struct mem_cgroup *memcg; - pg_data_t *pgdat = page_pgdat(page); + pg_data_t *pgdat = folio_pgdat(folio); struct lruvec *lruvec;
rcu_read_lock(); - memcg = page_memcg(head); + memcg = folio_memcg(folio); /* Untracked pages have no memcg, no lruvec. Update only the node */ if (!memcg) { rcu_read_unlock(); @@ -918,7 +917,7 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx, __mod_lruvec_state(lruvec, idx, val); rcu_read_unlock(); } -EXPORT_SYMBOL(__mod_lruvec_page_state); +EXPORT_SYMBOL(__lruvec_stat_mod_folio);
void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) {
From: Ryan Roberts ryan.roberts@arm.com
maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9Q9DF CVE: NA
Reference: https://lore.kernel.org/linux-mm/Zc6zD4zfsrkRlHe6@casper.infradead.org
-------------------------------------------------
Change the readahead config so that if it is being requested for an executable mapping, do a synchronous read of an arch-specified size in a naturally aligned manner.
On arm64 if memory is physically contiguous and naturally aligned to the "contpte" size, we can use contpte mappings, which improves utilization of the TLB. When paired with the "multi-size THP" changes, this works well to reduce dTLB pressure. However iTLB pressure is still high due to executable mappings having a low liklihood of being in the required folio size and mapping alignment, even when the filesystem supports readahead into large folios (e.g. XFS).
The reason for the low liklihood is that the current readahead algorithm starts with an order-2 folio and increases the folio order by 2 every time the readahead mark is hit. But most executable memory is faulted in fairly randomly and so the readahead mark is rarely hit and most executable folios remain order-2. This is observed impirically and confirmed from discussion with a gnu linker expert; in general, the linker does nothing to group temporally accessed text together spacially. Additionally, with the current read-around approach there are no alignment guarrantees between the file and folio. This is insufficient for arm64's contpte mapping requirement (order-4 for 4K base pages).
So it seems reasonable to special-case the read(ahead) logic for executable mappings. The trade-off is performance improvement (due to more efficient storage of the translations in iTLB) vs potential read amplification (due to reading too much data around the fault which won't be used), and the latter is independent of base page size. I've chosen 64K folio size for arm64 which benefits both the 4K and 16K base page size configs and shouldn't lead to any further read-amplification since the old read-around path was (usually) reading blocks of 128K (with the last 32K being async).
Performance Benchmarking ------------------------
The below shows kernel compilation and speedometer javascript benchmarks on Ampere Altra arm64 system. (The contpte patch series is applied in the baseline).
First, confirmation that this patch causes more memory to be contained in 64K folios (this is for all file-backed memory so includes non-executable too):
| File-backed folios | Speedometer | Kernel Compile | | by size as percentage |-----------------|-----------------| | of all mapped file mem | before | after | before | after | |=========================|========|========|========|========| |file-thp-aligned-16kB | 45% | 9% | 46% | 7% | |file-thp-aligned-32kB | 2% | 0% | 3% | 1% | |file-thp-aligned-64kB | 3% | 63% | 5% | 80% | |file-thp-aligned-128kB | 11% | 11% | 0% | 0% | |file-thp-unaligned-16kB | 1% | 0% | 3% | 1% | |file-thp-unaligned-128kB | 1% | 0% | 0% | 0% | |file-thp-partial | 0% | 0% | 0% | 0% | |-------------------------|--------|--------|--------|--------| |file-cont-aligned-64kB | 16% | 75% | 5% | 80% |
The above shows that for both use cases, the amount of file memory backed by 16K folios reduces and the amount backed by 64K folios increases significantly. And the amount of memory that is contpte-mapped significantly increases (last line).
And this is reflected in performance improvement:
Kernel Compilation (smaller is faster): | kernel | real-time | kern-time | user-time | peak memory | |----------|-------------|-------------|-------------|---------------| | before | 0.0% | 0.0% | 0.0% | 0.0% | | after | -1.6% | -2.1% | -1.7% | 0.0% |
Speedometer (bigger is faster): | kernel | runs_per_min | peak memory | |----------|----------------|---------------| | before | 0.0% | 0.0% | | after | 1.3% | 1.0% |
Both benchmarks show a ~1.5% improvement once the patch is applied.
Alternatives ------------
I considered (and rejected for now - but I anticipate this patch will stimulate discussion around what the best approach is) alternative approaches:
- Expose a global user-controlled knob to set the preferred folio size; this would move policy to user space and allow (e.g.) setting it to PMD-size for even better iTLB utilizaiton. But this would add ABI, and I prefer to start with the simplest approach first. It also has the downside that a change wouldn't apply to memory already in the page cache that is in active use (e.g. libc) so we don't get the same level of utilization as for something that is fixed from boot.
- Add a per-vma attribute to allow user space to specify preferred folio size for memory faulted from the range. (we've talked about such a control in the context of mTHP). The dynamic loader would then be responsible for adding the annotations. Again this feels like something that could be added later if value was demonstrated.
- Enhance MADV_COLLAPSE to collapse to THP sizes less than PMD-size. This would still require dynamic linker involvement, but would additionally neccessitate a copy and all memory in the range would be synchronously faulted in, adding to application load time. It would work for filesystems that don't support large folios though.
Signed-off-by: Ryan Roberts ryan.roberts@arm.com --- arch/arm64/include/asm/pgtable.h | 12 ++++++++++++ include/linux/pgtable.h | 12 ++++++++++++ mm/filemap.c | 19 +++++++++++++++++++ 3 files changed, 43 insertions(+)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 07948fe59b9d..8d68d00de0a4 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1147,6 +1147,18 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf, */ #define arch_wants_old_prefaulted_pte cpu_has_hw_af
+/* + * Request exec memory is read into pagecache in at least 64K folios. The + * trade-off here is performance improvement due to storing translations more + * effciently in the iTLB vs the potential for read amplification due to reading + * data from disk that won't be used. The latter is independent of base page + * size, so we set a page-size independent block size of 64K. This size can be + * contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB entry), + * and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base pages are in + * use. + */ +#define arch_wants_exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT) + static inline bool pud_sect_supported(void) { return PAGE_SIZE == SZ_4K; diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index ecc561d49d5b..a0fafb8e7005 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -435,6 +435,18 @@ static inline bool arch_has_hw_pte_young(void) } #endif
+#ifndef arch_wants_exec_folio_order +/* + * Returns preferred minimum folio order for executable file-backed memory. Must + * be in range [0, PMD_ORDER]. Negative value implies that the HW has no + * preference and mm will not special-case executable memory in the pagecache. + */ +static inline int arch_wants_exec_folio_order(void) +{ + return -1; +} +#endif + #ifndef arch_check_zapped_pte static inline void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte) diff --git a/mm/filemap.c b/mm/filemap.c index a274d2c5e232..b7881f1db472 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3197,6 +3197,25 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) } #endif
+ /* + * Allow arch to request a preferred minimum folio order for executable + * memory. This can often be beneficial to performance if (e.g.) arm64 + * can contpte-map the folio. Executable memory rarely benefits from + * read-ahead anyway, due to its random access nature. + */ + if (vm_flags & VM_EXEC) { + int order = arch_wants_exec_folio_order(); + + if (order >= 0) { + fpin = maybe_unlock_mmap_for_io(vmf, fpin); + ra->size = 1UL << order; + ra->async_size = 0; + ractl._index &= ~((unsigned long)ra->size - 1); + page_cache_ra_order(&ractl, ra, order); + return fpin; + } + } + /* If we don't want any read-ahead, don't bother */ if (vm_flags & VM_RAND_READ) return fpin;
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9Q9DF CVE: NA
-------------------------------------------------
Let's only enable large folio order for exec mapping with large folio support, also add a new sysfs interface to make it configurable.
Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- Documentation/admin-guide/mm/transhuge.rst | 8 +-- include/linux/huge_mm.h | 1 + mm/filemap.c | 27 +++++++++- mm/huge_memory.c | 62 +++++++++++++++++----- 4 files changed, 79 insertions(+), 19 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 936da10c5260..22f0e0009371 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -203,11 +203,13 @@ PMD-mappable transparent hugepage:: cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
The kernel tries to use huge, PMD-mappable page on read page fault for -file exec mapping if CONFIG_READ_ONLY_THP_FOR_FS enabled. It's possible -to enabled the feature by writing 1 or disablt by writing 0:: +if CONFIG_READ_ONLY_THP_FOR_FS enabled, or try non-PMD size page(eg, +64K arm64) for file exec mapping, BIT0 for PMD THP, BIT1 for mTHP. It's +possible to enable/disable it by configurate the corresponding bit::
- echo 0x0 >/sys/kernel/mm/transparent_hugepage/thp_exec_enabled echo 0x1 >/sys/kernel/mm/transparent_hugepage/thp_exec_enabled + echo 0x2 >/sys/kernel/mm/transparent_hugepage/thp_exec_enabled + echo 0x3 >/sys/kernel/mm/transparent_hugepage/thp_exec_enabled
khugepaged will be automatically started when one or more hugepage sizes are enabled (either by directly setting "always" or "madvise", diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index abf2340a2d18..896d7870ecd9 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -51,6 +51,7 @@ enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG, TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG, TRANSPARENT_HUGEPAGE_FILE_EXEC_THP_FLAG, + TRANSPARENT_HUGEPAGE_FILE_EXEC_MTHP_FLAG, };
struct kobject; diff --git a/mm/filemap.c b/mm/filemap.c index b7881f1db472..d3c813429bf2 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -46,6 +46,7 @@ #include <linux/pipe_fs_i.h> #include <linux/splice.h> #include <linux/huge_mm.h> +#include <linux/pgtable.h> #include <asm/pgalloc.h> #include <asm/tlbflush.h> #include "internal.h" @@ -3141,6 +3142,10 @@ static int lock_folio_maybe_drop_mmap(struct vm_fault *vmf, struct folio *folio, (transparent_hugepage_flags & \ (1<<TRANSPARENT_HUGEPAGE_FILE_EXEC_THP_FLAG))
+#define file_exec_mthp_enabled() \ + (transparent_hugepage_flags & \ + (1<<TRANSPARENT_HUGEPAGE_FILE_EXEC_MTHP_FLAG)) + static inline void try_enable_file_exec_thp(struct vm_area_struct *vma, unsigned long *vm_flags, struct file *file) @@ -3157,6 +3162,24 @@ static inline void try_enable_file_exec_thp(struct vm_area_struct *vma, if (file_exec_thp_enabled()) hugepage_madvise(vma, vm_flags, MADV_HUGEPAGE); } + +static inline bool file_exec_can_enable_mthp(struct address_space *mapping, + unsigned long vm_flags) +{ +#ifndef arch_wants_exec_folio_order + return false; +#endif + if (!is_exec_mapping(vm_flags)) + return false; + + if (!mapping_large_folio_support(mapping)) + return false; + + if (!file_exec_mthp_enabled()) + return false; + + return true; +} #endif
/* @@ -3195,7 +3218,6 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER); return fpin; } -#endif
/* * Allow arch to request a preferred minimum folio order for executable @@ -3203,7 +3225,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) * can contpte-map the folio. Executable memory rarely benefits from * read-ahead anyway, due to its random access nature. */ - if (vm_flags & VM_EXEC) { + if (file_exec_can_enable_mthp(mapping, vm_flags)) { int order = arch_wants_exec_folio_order();
if (order >= 0) { @@ -3215,6 +3237,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) return fpin; } } +#endif
/* If we don't want any read-ahead, don't bother */ if (vm_flags & VM_RAND_READ) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0c61e7c7c2c1..4301e0fb6f3f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -426,31 +426,67 @@ static struct kobj_attribute hpage_pmd_size_attr = __ATTR_RO(hpage_pmd_size);
#ifdef CONFIG_READ_ONLY_THP_FOR_FS +#define FILE_EXEC_THP_ENABLE BIT(0) +#else +#define FILE_EXEC_THP_ENABLE 0 +#endif + +#define FILE_EXEC_MTHP_ENABLE BIT(1) +#define FILE_EXEC_THP_ALL (FILE_EXEC_THP_ENABLE | FILE_EXEC_MTHP_ENABLE) + +static void thp_exec_enabled_set(enum transparent_hugepage_flag flag, + bool enable) +{ + if (enable) + set_bit(flag, &transparent_hugepage_flags); + else + clear_bit(flag, &transparent_hugepage_flags); +} + static ssize_t thp_exec_enabled_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return single_hugepage_flag_show(kobj, attr, buf, - TRANSPARENT_HUGEPAGE_FILE_EXEC_THP_FLAG); + unsigned long val = 0; + +#ifdef CONFIG_READ_ONLY_THP_FOR_FS + if (test_bit(TRANSPARENT_HUGEPAGE_FILE_EXEC_THP_FLAG, + &transparent_hugepage_flags)) + val |= FILE_EXEC_THP_ENABLE; +#endif + + if (test_bit(TRANSPARENT_HUGEPAGE_FILE_EXEC_MTHP_FLAG, + &transparent_hugepage_flags)) + val |= FILE_EXEC_MTHP_ENABLE; + + return sysfs_emit(buf, "0x%lx\n", val); } static ssize_t thp_exec_enabled_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { - size_t ret = single_hugepage_flag_store(kobj, attr, buf, count, - TRANSPARENT_HUGEPAGE_FILE_EXEC_THP_FLAG); - if (ret > 0) { - int err = start_stop_khugepaged(); + unsigned long val; + int ret;
- if (err) - ret = err; - } + ret = kstrtoul(buf, 16, &val); + if (ret < 0) + return ret; + if (val & ~FILE_EXEC_THP_ALL) + return -EINVAL;
- return ret; +#ifdef CONFIG_READ_ONLY_THP_FOR_FS + thp_exec_enabled_set(TRANSPARENT_HUGEPAGE_FILE_EXEC_THP_FLAG, + val & FILE_EXEC_THP_ENABLE); + ret = start_stop_khugepaged(); + if (ret) + return ret; +#endif + thp_exec_enabled_set(TRANSPARENT_HUGEPAGE_FILE_EXEC_MTHP_FLAG, + val & FILE_EXEC_MTHP_ENABLE); + + return count; } static struct kobj_attribute thp_exec_enabled_attr = __ATTR_RW(thp_exec_enabled);
-#endif - static struct attribute *hugepage_attr[] = { &enabled_attr.attr, &defrag_attr.attr, @@ -459,9 +495,7 @@ static struct attribute *hugepage_attr[] = { #ifdef CONFIG_SHMEM &shmem_enabled_attr.attr, #endif -#ifdef CONFIG_READ_ONLY_THP_FOR_FS &thp_exec_enabled_attr.attr, -#endif NULL, };
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9Q9DF CVE: NA
-------------------------------------------------
The mmap() already supports align an mmap address by the PMD size, add a folio_get_unmapped_area() to be align other page size, eg, 64K on arm64, which could increase the probability of the specific order of large folio, that is, try the best to allocate 64K, also it may reduce more TLB miss, experimental.
Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/huge_memory.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 50 insertions(+)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 4301e0fb6f3f..3e6ec8a60765 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -887,6 +887,52 @@ static unsigned long __thp_get_unmapped_area(struct file *filp, return ret; }
+static bool file_mapping_align_enabled(struct file *filp) +{ + struct address_space *mapping; + + if (!filp) + return false; + + mapping = filp->f_mapping; + if (!mapping || !mapping_large_folio_support(mapping)) + return false; + + return true; +} + +static bool anon_mapping_align_enabled(int order) +{ + unsigned long mask; + + mask = READ_ONCE(huge_anon_orders_always) | + READ_ONCE(huge_anon_orders_madvise); + + if (hugepage_global_enabled()) + mask |= READ_ONCE(huge_anon_orders_inherit); + + mask = BIT(order) & mask; + if (!mask) + return false; + + return true; +} + +static unsigned long folio_get_unmapped_area(struct file *filp, unsigned long addr, + unsigned long len, unsigned long pgoff, unsigned long flags) +{ + int order = arch_wants_exec_folio_order(); + + if (order < 0) + return 0; + + if (file_mapping_align_enabled(filp) || + (!filp && anon_mapping_align_enabled(order))) + return __thp_get_unmapped_area(filp, addr, len, pgoff, flags, + PAGE_SIZE << order); + return 0; +} + unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { @@ -897,6 +943,10 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, if (ret) return ret;
+ ret = folio_get_unmapped_area(filp, addr, len, off, flags); + if (ret) + return ret; + return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags); } EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9Q9DF CVE: NA
-------------------------------------------------
Since folio_get_unmapped_area() is experimental, add a sysfs interface to enable/disable it, it is disabled by default, and could be enabled by /sys/kernel/mm/transparent_hugepage/thp_mapping_align.
Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- Documentation/admin-guide/mm/transhuge.rst | 9 +++ include/linux/huge_mm.h | 2 + mm/huge_memory.c | 67 ++++++++++++++++++++-- 3 files changed, 72 insertions(+), 6 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 22f0e0009371..e52cd57bb512 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -211,6 +211,15 @@ possible to enable/disable it by configurate the corresponding bit:: echo 0x2 >/sys/kernel/mm/transparent_hugepage/thp_exec_enabled echo 0x3 >/sys/kernel/mm/transparent_hugepage/thp_exec_enabled
+The kernel could try to enable other larger size mappings align other +than THP size, eg, 64K on arm64, BIT0 for file mapping, BIT1 for anon +mapping, it is disabled by default, and could enable this feature by +writing the corresponding bit to 1:: + + echo 0x1 >/sys/kernel/mm/transparent_hugepage/thp_mapping_align + echo 0x2 >/sys/kernel/mm/transparent_hugepage/thp_mapping_align + echo 0x3 >/sys/kernel/mm/transparent_hugepage/thp_mapping_align + khugepaged will be automatically started when one or more hugepage sizes are enabled (either by directly setting "always" or "madvise", or by setting "inherit" while the top-level enabled is set to "always" diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 896d7870ecd9..8fdf17e80359 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -52,6 +52,8 @@ enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG, TRANSPARENT_HUGEPAGE_FILE_EXEC_THP_FLAG, TRANSPARENT_HUGEPAGE_FILE_EXEC_MTHP_FLAG, + TRANSPARENT_HUGEPAGE_FILE_MAPPING_ALIGN_FLAG, + TRANSPARENT_HUGEPAGE_ANON_MAPPING_ALIGN_FLAG, };
struct kobject; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 3e6ec8a60765..9f1100dfee66 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -434,8 +434,7 @@ static struct kobj_attribute hpage_pmd_size_attr = #define FILE_EXEC_MTHP_ENABLE BIT(1) #define FILE_EXEC_THP_ALL (FILE_EXEC_THP_ENABLE | FILE_EXEC_MTHP_ENABLE)
-static void thp_exec_enabled_set(enum transparent_hugepage_flag flag, - bool enable) +static void thp_flag_set(enum transparent_hugepage_flag flag, bool enable) { if (enable) set_bit(flag, &transparent_hugepage_flags); @@ -473,20 +472,61 @@ static ssize_t thp_exec_enabled_store(struct kobject *kobj, return -EINVAL;
#ifdef CONFIG_READ_ONLY_THP_FOR_FS - thp_exec_enabled_set(TRANSPARENT_HUGEPAGE_FILE_EXEC_THP_FLAG, - val & FILE_EXEC_THP_ENABLE); + thp_flag_set(TRANSPARENT_HUGEPAGE_FILE_EXEC_THP_FLAG, + val & FILE_EXEC_THP_ENABLE); ret = start_stop_khugepaged(); if (ret) return ret; #endif - thp_exec_enabled_set(TRANSPARENT_HUGEPAGE_FILE_EXEC_MTHP_FLAG, - val & FILE_EXEC_MTHP_ENABLE); + thp_flag_set(TRANSPARENT_HUGEPAGE_FILE_EXEC_MTHP_FLAG, + val & FILE_EXEC_MTHP_ENABLE);
return count; } static struct kobj_attribute thp_exec_enabled_attr = __ATTR_RW(thp_exec_enabled);
+#define FILE_MAPPING_ALIGN BIT(0) +#define ANON_MAPPING_ALIGN BIT(1) +#define THP_MAPPING_ALIGN_ALL (FILE_MAPPING_ALIGN | ANON_MAPPING_ALIGN) + +static ssize_t thp_mapping_align_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + unsigned long val = 0; + + if (test_bit(TRANSPARENT_HUGEPAGE_FILE_MAPPING_ALIGN_FLAG, + &transparent_hugepage_flags)) + val |= FILE_MAPPING_ALIGN; + + if (test_bit(TRANSPARENT_HUGEPAGE_ANON_MAPPING_ALIGN_FLAG, + &transparent_hugepage_flags)) + val |= ANON_MAPPING_ALIGN; + + return sysfs_emit(buf, "0x%lx\n", val); +} +static ssize_t thp_mapping_align_store(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, size_t count) +{ + unsigned long val; + int ret; + + ret = kstrtoul(buf, 16, &val); + if (ret < 0) + return ret; + if (val & ~THP_MAPPING_ALIGN_ALL) + return -EINVAL; + + thp_flag_set(TRANSPARENT_HUGEPAGE_FILE_MAPPING_ALIGN_FLAG, + val & FILE_MAPPING_ALIGN); + thp_flag_set(TRANSPARENT_HUGEPAGE_ANON_MAPPING_ALIGN_FLAG, + val & ANON_MAPPING_ALIGN); + + return count; +} +static struct kobj_attribute thp_mapping_align_attr = + __ATTR_RW(thp_mapping_align); + static struct attribute *hugepage_attr[] = { &enabled_attr.attr, &defrag_attr.attr, @@ -496,6 +536,7 @@ static struct attribute *hugepage_attr[] = { &shmem_enabled_attr.attr, #endif &thp_exec_enabled_attr.attr, + &thp_mapping_align_attr.attr, NULL, };
@@ -887,10 +928,21 @@ static unsigned long __thp_get_unmapped_area(struct file *filp, return ret; }
+#define thp_file_mapping_align_enabled() \ + (transparent_hugepage_flags & \ + (1<<TRANSPARENT_HUGEPAGE_FILE_MAPPING_ALIGN_FLAG)) + +#define thp_anon_mapping_align_enabled() \ + (transparent_hugepage_flags & \ + (1<<TRANSPARENT_HUGEPAGE_ANON_MAPPING_ALIGN_FLAG)) + static bool file_mapping_align_enabled(struct file *filp) { struct address_space *mapping;
+ if (!thp_file_mapping_align_enabled()) + return false; + if (!filp) return false;
@@ -905,6 +957,9 @@ static bool anon_mapping_align_enabled(int order) { unsigned long mask;
+ if (!thp_anon_mapping_align_enabled()) + return 0; + mask = READ_ONCE(huge_anon_orders_always) | READ_ONCE(huge_anon_orders_madvise);
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9Q9DF CVE: NA
-------------------------------------------------
The high-order pages stored on PCP list may not always win, so it is disabled by default for high-orders except PMD_ORDER.
Adding a new control pcp_allow_high_order to allow user to enable/disable the specified high-order(only order 4 for now) pages stored on PCP list or not, note, the all pages on pcplists will be drained when disable it.
Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- Documentation/admin-guide/mm/transhuge.rst | 9 +++++++ include/linux/gfp.h | 1 + include/linux/huge_mm.h | 1 + mm/huge_memory.c | 31 ++++++++++++++++++++++ mm/page_alloc.c | 18 ++++++++++++- 5 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index e52cd57bb512..0dc0b4dab621 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -220,6 +220,15 @@ writing the corresponding bit to 1:: echo 0x2 >/sys/kernel/mm/transparent_hugepage/thp_mapping_align echo 0x3 >/sys/kernel/mm/transparent_hugepage/thp_mapping_align
+The kernel could enable high-orders(greated than PAGE_ALLOC_COSTLY_ORDER, only +support order 4 for now) be stored on PCP lists(except PMD order), which could +reduce the zone lock contention when allocate hige-order pages frequently. It +is possible to enable order 4 pages stored on PCP lists by writing 4 or disable +it back by writing 0:: + + echo 0 >/sys/kernel/mm/transparent_hugepage/pcp_allow_high_order + echo 4 >/sys/kernel/mm/transparent_hugepage/pcp_allow_high_order + khugepaged will be automatically started when one or more hugepage sizes are enabled (either by directly setting "always" or "madvise", or by setting "inherit" while the top-level enabled is set to "always" diff --git a/include/linux/gfp.h b/include/linux/gfp.h index b18b7e3758be..b2d4f45a866b 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -335,6 +335,7 @@ extern void page_frag_free(void *addr);
void page_alloc_init_cpuhp(void); int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp); +void drain_all_zone_pages(void); void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp); void drain_all_pages(struct zone *zone); void drain_local_pages(struct zone *zone); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 8fdf17e80359..056f6918eeed 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -104,6 +104,7 @@ extern unsigned long transparent_hugepage_flags; extern unsigned long huge_anon_orders_always; extern unsigned long huge_anon_orders_madvise; extern unsigned long huge_anon_orders_inherit; +extern unsigned long huge_pcp_allow_orders;
static inline bool hugepage_global_enabled(void) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9f1100dfee66..508155fe9830 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -74,6 +74,7 @@ unsigned long huge_zero_pfn __read_mostly = ~0UL; unsigned long huge_anon_orders_always __read_mostly; unsigned long huge_anon_orders_madvise __read_mostly; unsigned long huge_anon_orders_inherit __read_mostly; +unsigned long huge_pcp_allow_orders __read_mostly;
unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long vm_flags, bool smaps, @@ -417,6 +418,35 @@ static ssize_t use_zero_page_store(struct kobject *kobj, } static struct kobj_attribute use_zero_page_attr = __ATTR_RW(use_zero_page);
+static ssize_t pcp_allow_high_order_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%lu\n", READ_ONCE(huge_pcp_allow_orders)); +} +static ssize_t pcp_allow_high_order_store(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, size_t count) +{ + unsigned long value; + int ret; + + ret = kstrtoul(buf, 10, &value); + if (ret < 0) + return ret; + + /* Only enable order 4 now, 0 is to disable it */ + if (value != 0 && value != (PAGE_ALLOC_COSTLY_ORDER + 1)) + return -EINVAL; + + if (value == 0) + drain_all_zone_pages(); + + WRITE_ONCE(huge_pcp_allow_orders, value); + + return count; +} +static struct kobj_attribute pcp_allow_high_order_attr = + __ATTR_RW(pcp_allow_high_order); + static ssize_t hpage_pmd_size_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -531,6 +561,7 @@ static struct attribute *hugepage_attr[] = { &enabled_attr.attr, &defrag_attr.attr, &use_zero_page_attr.attr, + &pcp_allow_high_order_attr.attr, &hpage_pmd_size_attr.attr, #ifdef CONFIG_SHMEM &shmem_enabled_attr.attr, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4652dc453964..f225f412e71d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -528,7 +528,7 @@ static void bad_page(struct page *page, const char *reason) static inline unsigned int order_to_pindex(int migratetype, int order) { #ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (order > PAGE_ALLOC_COSTLY_ORDER) { + if (order > PAGE_ALLOC_COSTLY_ORDER + 1) { VM_BUG_ON(order != HPAGE_PMD_ORDER); return NR_LOWORDER_PCP_LISTS; } @@ -560,6 +560,8 @@ static inline bool pcp_allowed_order(unsigned int order) #ifdef CONFIG_TRANSPARENT_HUGEPAGE if (order == HPAGE_PMD_ORDER) return true; + if (order == READ_ONCE(huge_pcp_allow_orders)) + return true; #endif return false; } @@ -6829,6 +6831,20 @@ void zone_pcp_reset(struct zone *zone) } }
+void drain_all_zone_pages(void) +{ + struct zone *zone; + + mutex_lock(&pcp_batch_high_lock); + for_each_populated_zone(zone) + __zone_set_pageset_high_and_batch(zone, 0, 0, 1); + __drain_all_pages(NULL, true); + for_each_populated_zone(zone) + __zone_set_pageset_high_and_batch(zone, zone->pageset_high_min, + zone->pageset_high_max, zone->pageset_batch); + mutex_unlock(&pcp_batch_high_lock); +} + #ifdef CONFIG_MEMORY_HOTREMOVE /* * All pages in the range must be in a single zone, must not contain holes,
From: Barry Song v-songbaohua@oppo.com
next inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9Q9DF CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "mm: add per-order mTHP alloc and swpout counters", v6.
The patchset introduces a framework to facilitate mTHP counters, starting with the allocation and swap-out counters. Currently, only four new nodes are appended to the stats directory for each mTHP size.
/sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats anon_fault_alloc anon_fault_fallback anon_fault_fallback_charge anon_swpout anon_swpout_fallback
These nodes are crucial for us to monitor the fragmentation levels of both the buddy system and the swap partitions. In the future, we may consider adding additional nodes for further insights.
This patch (of 4):
Profiling a system blindly with mTHP has become challenging due to the lack of visibility into its operations. Presenting the success rate of mTHP allocations appears to be pressing need.
Recently, I've been experiencing significant difficulty debugging performance improvements and regressions without these figures. It's crucial for us to understand the true effectiveness of mTHP in real-world scenarios, especially in systems with fragmented memory.
This patch establishes the framework for per-order mTHP counters. It begins by introducing the anon_fault_alloc and anon_fault_fallback counters. Additionally, to maintain consistency with thp_fault_fallback_charge in /proc/vmstat, this patch also tracks anon_fault_fallback_charge when mem_cgroup_charge fails for mTHP. Incorporating additional counters should now be straightforward as well.
Link: https://lkml.kernel.org/r/20240412114858.407208-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240412114858.407208-2-21cnbao@gmail.com Signed-off-by: Barry Song v-songbaohua@oppo.com Acked-by: David Hildenbrand david@redhat.com Cc: Chris Li chrisl@kernel.org Cc: Domenico Cerasuolo cerasuolodomenico@gmail.com Cc: Kairui Song kasong@tencent.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Peter Xu peterx@redhat.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Suren Baghdasaryan surenb@google.com Cc: Yosry Ahmed yosryahmed@google.com Cc: Yu Zhao yuzhao@google.com Cc: Jonathan Corbet corbet@lwn.net Signed-off-by: Andrew Morton akpm@linux-foundation.org --- include/linux/huge_mm.h | 21 +++++++++++++++++ mm/huge_memory.c | 52 +++++++++++++++++++++++++++++++++++++++++ mm/memory.c | 5 ++++ 3 files changed, 78 insertions(+)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 056f6918eeed..0e87d2ebb541 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -261,6 +261,27 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, enforce_sysfs, orders); }
+enum mthp_stat_item { + MTHP_STAT_ANON_FAULT_ALLOC, + MTHP_STAT_ANON_FAULT_FALLBACK, + MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE, + __MTHP_STAT_COUNT +}; + +struct mthp_stat { + unsigned long stats[ilog2(MAX_PTRS_PER_PTE) + 1][__MTHP_STAT_COUNT]; +}; + +DECLARE_PER_CPU(struct mthp_stat, mthp_stats); + +static inline void count_mthp_stat(int order, enum mthp_stat_item item) +{ + if (order <= 0 || order > PMD_ORDER) + return; + + this_cpu_inc(mthp_stats.stats[order][item]); +} + #define transparent_hugepage_use_zero_page() \ (transparent_hugepage_flags & \ (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG)) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 508155fe9830..d3980c66b0fc 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -660,6 +660,48 @@ static const struct kobj_type thpsize_ktype = { .sysfs_ops = &kobj_sysfs_ops, };
+DEFINE_PER_CPU(struct mthp_stat, mthp_stats) = {{{0}}}; + +static unsigned long sum_mthp_stat(int order, enum mthp_stat_item item) +{ + unsigned long sum = 0; + int cpu; + + for_each_possible_cpu(cpu) { + struct mthp_stat *this = &per_cpu(mthp_stats, cpu); + + sum += this->stats[order][item]; + } + + return sum; +} + +#define DEFINE_MTHP_STAT_ATTR(_name, _index) \ +static ssize_t _name##_show(struct kobject *kobj, \ + struct kobj_attribute *attr, char *buf) \ +{ \ + int order = to_thpsize(kobj)->order; \ + \ + return sysfs_emit(buf, "%lu\n", sum_mthp_stat(order, _index)); \ +} \ +static struct kobj_attribute _name##_attr = __ATTR_RO(_name) + +DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC); +DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK); +DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); + +static struct attribute *stats_attrs[] = { + &anon_fault_alloc_attr.attr, + &anon_fault_fallback_attr.attr, + &anon_fault_fallback_charge_attr.attr, + NULL, +}; + +static struct attribute_group stats_attr_group = { + .name = "stats", + .attrs = stats_attrs, +}; + static struct thpsize *thpsize_create(int order, struct kobject *parent) { unsigned long size = (PAGE_SIZE << order) / SZ_1K; @@ -683,6 +725,12 @@ static struct thpsize *thpsize_create(int order, struct kobject *parent) return ERR_PTR(ret); }
+ ret = sysfs_create_group(&thpsize->kobj, &stats_attr_group); + if (ret) { + kobject_put(&thpsize->kobj); + return ERR_PTR(ret); + } + thpsize->order = order; return thpsize; } @@ -1052,6 +1100,8 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, folio_put(folio); count_vm_event(THP_FAULT_FALLBACK); count_vm_event(THP_FAULT_FALLBACK_CHARGE); + count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_FALLBACK); + count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); return VM_FAULT_FALLBACK; } folio_throttle_swaprate(folio, gfp); @@ -1102,6 +1152,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, mm_inc_nr_ptes(vma->vm_mm); spin_unlock(vmf->ptl); count_vm_event(THP_FAULT_ALLOC); + count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); }
@@ -1222,6 +1273,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true); if (unlikely(!folio)) { count_vm_event(THP_FAULT_FALLBACK); + count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_FALLBACK); return VM_FAULT_FALLBACK; } return __do_huge_pmd_anonymous_page(vmf, &folio->page, gfp); diff --git a/mm/memory.c b/mm/memory.c index 64e1fd144d93..4ef917a182f9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4349,6 +4349,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) folio = vma_alloc_folio(gfp, order, vma, addr, true); if (folio) { if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) { + count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); folio_put(folio); goto next; } @@ -4357,6 +4358,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) return folio; } next: + count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK); order = next_order(&orders, order); }
@@ -4465,6 +4467,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
folio_ref_add(folio, nr_pages - 1); add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC); +#endif add_reliable_folio_counter(folio, vma->vm_mm, nr_pages); folio_add_new_anon_rmap(folio, vma, addr); folio_add_lru_vma(folio, vma);
From: Barry Song v-songbaohua@oppo.com
next inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9Q9DF CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
This helps to display the fragmentation situation of the swapfile, knowing the proportion of how much we haven't split large folios. So far, we only support non-split swapout for anon memory, with the possibility of expanding to shmem in the future. So, we add the "anon" prefix to the counter names.
Link: https://lkml.kernel.org/r/20240412114858.407208-3-21cnbao@gmail.com Signed-off-by: Barry Song v-songbaohua@oppo.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Acked-by: David Hildenbrand david@redhat.com Cc: Chris Li chrisl@kernel.org Cc: Domenico Cerasuolo cerasuolodomenico@gmail.com Cc: Kairui Song kasong@tencent.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Peter Xu peterx@redhat.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Suren Baghdasaryan surenb@google.com Cc: Yosry Ahmed yosryahmed@google.com Cc: Yu Zhao yuzhao@google.com Cc: Jonathan Corbet corbet@lwn.net Signed-off-by: Andrew Morton akpm@linux-foundation.org --- include/linux/huge_mm.h | 2 ++ mm/huge_memory.c | 4 ++++ mm/page_io.c | 1 + mm/vmscan.c | 3 +++ 4 files changed, 10 insertions(+)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 0e87d2ebb541..3e5f7064e2de 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -265,6 +265,8 @@ enum mthp_stat_item { MTHP_STAT_ANON_FAULT_ALLOC, MTHP_STAT_ANON_FAULT_FALLBACK, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE, + MTHP_STAT_ANON_SWPOUT, + MTHP_STAT_ANON_SWPOUT_FALLBACK, __MTHP_STAT_COUNT };
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d3980c66b0fc..763bb25e4f99 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -689,11 +689,15 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name) DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC); DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK); DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); +DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT); +DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
static struct attribute *stats_attrs[] = { &anon_fault_alloc_attr.attr, &anon_fault_fallback_attr.attr, &anon_fault_fallback_charge_attr.attr, + &anon_swpout_attr.attr, + &anon_swpout_fallback_attr.attr, NULL, };
diff --git a/mm/page_io.c b/mm/page_io.c index ea8d57b9b3ae..80e49e536d37 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -212,6 +212,7 @@ static inline void count_swpout_vm_event(struct folio *folio) count_memcg_folio_events(folio, THP_SWPOUT, 1); count_vm_event(THP_SWPOUT); } + count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPOUT); #endif count_vm_events(PSWPOUT, folio_nr_pages(folio)); } diff --git a/mm/vmscan.c b/mm/vmscan.c index d8f2b571562c..34614bb7062d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1911,6 +1911,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, goto activate_locked; } if (!add_to_swap(folio)) { + int __maybe_unused order = folio_order(folio); + if (!folio_test_large(folio)) goto activate_locked_split; /* Fallback to swap normal pages */ @@ -1922,6 +1924,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, THP_SWPOUT_FALLBACK, 1); count_vm_event(THP_SWPOUT_FALLBACK); } + count_mthp_stat(order, MTHP_STAT_ANON_SWPOUT_FALLBACK); #endif if (!add_to_swap(folio)) goto activate_locked_split;
From: Barry Song v-songbaohua@oppo.com
next inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9Q9DF CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
This patch includes documentation for mTHP counters and an ABI file for sys-kernel-mm-transparent-hugepage, which appears to have been missing for some time.
[v-songbaohua@oppo.com: fix the name and unexpected indentation] Link: https://lkml.kernel.org/r/20240415054538.17071-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240412114858.407208-4-21cnbao@gmail.com Signed-off-by: Barry Song v-songbaohua@oppo.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: David Hildenbrand david@redhat.com Cc: Chris Li chrisl@kernel.org Cc: Domenico Cerasuolo cerasuolodomenico@gmail.com Cc: Kairui Song kasong@tencent.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Peter Xu peterx@redhat.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Suren Baghdasaryan surenb@google.com Cc: Yosry Ahmed yosryahmed@google.com Cc: Yu Zhao yuzhao@google.com Cc: Jonathan Corbet corbet@lwn.net Signed-off-by: Andrew Morton akpm@linux-foundation.org --- .../sysfs-kernel-mm-transparent-hugepage | 18 ++++++++++++ Documentation/admin-guide/mm/transhuge.rst | 28 +++++++++++++++++++ 2 files changed, 46 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-transparent-hugepage
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-transparent-hugepage b/Documentation/ABI/testing/sysfs-kernel-mm-transparent-hugepage new file mode 100644 index 000000000000..7bfbb9cc2c11 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-transparent-hugepage @@ -0,0 +1,18 @@ +What: /sys/kernel/mm/transparent_hugepage/ +Date: April 2024 +Contact: Linux memory management mailing list linux-mm@kvack.org +Description: + /sys/kernel/mm/transparent_hugepage/ contains a number of files and + subdirectories, + + - defrag + - enabled + - hpage_pmd_size + - khugepaged + - shmem_enabled + - use_zero_page + - subdirectories of the form hugepages-<size>kB, where <size> + is the page size of the hugepages supported by the kernel/CPU + combination. + + See Documentation/admin-guide/mm/transhuge.rst for details. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 0dc0b4dab621..dbbb4709bc78 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -474,6 +474,34 @@ thp_swpout_fallback Usually because failed to allocate some continuous swap space for the huge page.
+In /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/stats, There are +also individual counters for each huge page size, which can be utilized to +monitor the system's effectiveness in providing huge pages for usage. Each +counter has its own corresponding file. + +anon_fault_alloc + is incremented every time a huge page is successfully + allocated and charged to handle a page fault. + +anon_fault_fallback + is incremented if a page fault fails to allocate or charge + a huge page and instead falls back to using huge pages with + lower orders or small pages. + +anon_fault_fallback_charge + is incremented if a page fault fails to charge a huge page and + instead falls back to using huge pages with lower orders or + small pages even though the allocation was successful. + +anon_swpout + is incremented every time a huge page is swapped out in one + piece without splitting. + +anon_swpout_fallback + is incremented if a huge page has to be split before swapout. + Usually because failed to allocate some continuous swap space + for the huge page. + As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use. There are some counters in ``/proc/vmstat`` to help
From: Barry Song v-songbaohua@oppo.com
next inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9Q9DF CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The documentation does not align with the code. In __do_huge_pmd_anonymous_page(), THP_FAULT_FALLBACK is incremented when mem_cgroup_charge() fails, despite the allocation succeeding, whereas THP_FAULT_ALLOC is only incremented after a successful charge.
Link: https://lkml.kernel.org/r/20240412114858.407208-5-21cnbao@gmail.com Signed-off-by: Barry Song v-songbaohua@oppo.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: David Hildenbrand david@redhat.com Cc: Chris Li chrisl@kernel.org Cc: Domenico Cerasuolo cerasuolodomenico@gmail.com Cc: Kairui Song kasong@tencent.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Peter Xu peterx@redhat.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Suren Baghdasaryan surenb@google.com Cc: Yosry Ahmed yosryahmed@google.com Cc: Yu Zhao yuzhao@google.com Cc: Jonathan Corbet corbet@lwn.net Signed-off-by: Andrew Morton akpm@linux-foundation.org --- Documentation/admin-guide/mm/transhuge.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index dbbb4709bc78..54046f487f15 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -396,7 +396,7 @@ monitor how successfully the system is providing huge pages for use.
thp_fault_alloc is incremented every time a huge page is successfully - allocated to handle a page fault. + allocated and charged to handle a page fault.
thp_collapse_alloc is incremented by khugepaged when it has found @@ -404,7 +404,7 @@ thp_collapse_alloc successfully allocated a new huge page to store the data.
thp_fault_fallback - is incremented if a page fault fails to allocate + is incremented if a page fault fails to allocate or charge a huge page and instead falls back to using small pages.
thp_fault_fallback_charge
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/7530 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/N...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/7530 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/N...