From: Ryan Roberts ryan.roberts@arm.com
mainline inclusion from mainline-6.8-rc1 commit 7dc7c5ef6463111991002f24c0aea08afe86f2cc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I98AW9 CVE: NA
-------------------------------------------------
Patch series "Multi-size THP for anonymous memory", v9.
A series to implement multi-size THP (mTHP) for anonymous memory (previously called "small-sized THP" and "large anonymous folios").
The objective of this is to improve performance by allocating larger chunks of memory during anonymous page faults:
1) Since SW (the kernel) is dealing with larger chunks of memory than base pages, there are efficiency savings to be had; fewer page faults, batched PTE and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel overhead. This should benefit all architectures. 2) Since we are now mapping physically contiguous chunks of memory, we can take advantage of HW TLB compression techniques. A reduction in TLB pressure speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
This version incorporates David's feedback on the core patches (#3, #4) and adds some RB and TB tags (see change log for details).
By default, the existing behaviour (and performance) is maintained. The user must explicitly enable multi-size THP to see the performance benefit. This is done via a new sysfs interface (as recommended by David Hildenbrand - thanks to David for the suggestion)! This interface is inspired by the existing per-hugepage-size sysfs interface used by hugetlb, provides full backwards compatibility with the existing PMD-size THP interface, and provides a base for future extensibility. See [9] for detailed discussion of the interface.
This series is based on mm-unstable (715b67adf4c8).
Prerequisites =============
I'm removing this section on the basis that I don't believe what we were previously calling prerequisites are really prerequisites anymore. We originally defined them when mTHP was a compile-time feature. There is now a runtime control to opt-in to mTHP; when disabled, correctness and performance are as before. When enabled, the code is still correct/robust, but in the absence of the one remaining item (compaction) there may be a performance impact in some corners. See the old list in the v8 cover letter at [8]. And a longer explanation of my thinking here [10].
SUMMARY: I don't think we should hold this series up, waiting for the items on the prerequisites list. I believe this series should be ready now so hopefully can be added to mm-unstable for some testing, then fingers crossed for v6.8.
Testing =======
The series includes patches for mm selftests to enlighten the cow and khugepaged tests to explicitly test with multi-size THP, in the same way that PMD-sized THP is tested. The new tests all pass, and no regressions are observed in the mm selftest suite. I've also run my usual kernel compilation and java script benchmarks without any issues.
Refer to my performance numbers posted with v6 [6]. (These are for multi-size THP only - they do not include the arm64 contpte follow-on series).
John Hubbard at Nvidia has indicated dramatic 10x performance improvements for some workloads at [11]. (Observed using v6 of this series as well as the arm64 contpte series).
Kefeng Wang at Huawei has also indicated he sees improvements at [12] although there are some latency regressions also.
I've also checked that there is no regression in the write fault path when mTHP is disabled using a microbenchmark. I ran it for a baseline kernel, as well as v8 and v9. I repeated on Ampere Altra (bare metal) and Apple M2 (VM):
| | m2 vm | altra | |--------------|---------------------|---------------------| | kernel | mean | std_rel | mean | std_rel | |--------------|----------|----------|----------|----------| | baseline | 0.000% | 0.341% | 0.000% | 3.581% | | anonfolio-v8 | 0.005% | 0.272% | 5.068% | 1.128% | | anonfolio-v9 | -0.013% | 0.442% | 0.107% | 1.788% |
There is no measurable difference on M2, but altra has a slow down in v8 which is fixed in v9 by moving the THP order check to be inline within thp_vma_allowable_orders(), as suggested by David.
This patch (of 10):
In preparation for the introduction of anonymous multi-size THP, we would like to be able to split them when they have unmapped subpages, in order to free those unused pages under memory pressure. So remove the artificial requirement that the large folio needed to be at least PMD-sized.
Link: https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20231207161211.2374093-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Yu Zhao yuzhao@google.com Reviewed-by: Yin Fengwei fengwei.yin@intel.com Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Reviewed-by: David Hildenbrand david@redhat.com Reviewed-by: Barry Song v-songbaohua@oppo.com Tested-by: Kefeng Wang wangkefeng.wang@huawei.com Tested-by: John Hubbard jhubbard@nvidia.com Cc: Alistair Popple apopple@nvidia.com Cc: Anshuman Khandual anshuman.khandual@arm.com Cc: Catalin Marinas catalin.marinas@arm.com Cc: David Rientjes rientjes@google.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Hugh Dickins hughd@google.com Cc: Itaru Kitayama itaru.kitayama@gmail.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Luis Chamberlain mcgrof@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 7dc7c5ef6463111991002f24c0aea08afe86f2cc) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/rmap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c index 2984dbffd391..25656f22d444 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1488,11 +1488,11 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, __lruvec_stat_mod_folio(folio, idx, -nr);
/* - * Queue anon THP for deferred split if at least one + * Queue anon large folio for deferred split if at least one * page of the folio is unmapped and at least one page * is still mapped. */ - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) + if (folio_test_large(folio) && folio_test_anon(folio)) if (!compound || nr < nr_pmdmapped) deferred_split_folio(folio); }