This is to optimize fork/unmap/zap() with PTE-mapped THP.
Catalin Marinas (1): arm64: Mark the 'addr' argument to set_ptes() and __set_pte_at() as unused
David Hildenbrand (24): arm/pgtable: define PFN_PTE_SHIFT nios2/pgtable: define PFN_PTE_SHIFT powerpc/pgtable: define PFN_PTE_SHIFT riscv/pgtable: define PFN_PTE_SHIFT s390/pgtable: define PFN_PTE_SHIFT sparc/pgtable: define PFN_PTE_SHIFT mm/pgtable: make pte_next_pfn() independent of set_ptes() arm/mm: use pte_next_pfn() in set_ptes() powerpc/mm: use pte_next_pfn() in set_ptes() mm/memory: factor out copying the actual PTE in copy_present_pte() mm/memory: pass PTE to copy_present_pte() mm/memory: optimize fork() with PTE-mapped THP mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch() mm/memory: ignore writable bit in folio_pte_batch() mm/memory: factor out zapping of present pte into zap_present_pte() mm/memory: handle !page case in zap_present_pte() separately mm/memory: further separate anon and pagecache folio handling in zap_present_pte() mm/memory: factor out zapping folio pte into zap_present_folio_pte() mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size() mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP mm/mmu_gather: add tlb_remove_tlb_entries() mm/mmu_gather: add __tlb_remove_folio_pages() mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing mm/memory: optimize unmap/zap with PTE-mapped THP
Kefeng Wang (7): s390: use pfn_swap_entry_folio() in ptep_zap_swap_entry() mm: use pfn_swap_entry_folio() in __split_huge_pmd_locked() mm: use pfn_swap_entry_to_folio() in zap_huge_pmd() mm: use pfn_swap_entry_folio() in copy_nonpresent_pte() mm: convert to should_zap_page() to should_zap_folio() mm: convert mm_counter() to take a folio mm: convert mm_counter_file() to take a folio
Matthew Wilcox (Oracle) (2): mm: add pfn_swap_entry_folio() mprotect: use pfn_swap_entry_folio
Peter Xu (1): mm/memory: fix missing pte marker for !page on pte zaps
Ryan Roberts (2): arm64/mm: Hoist synchronization out of set_ptes() loop arm64/mm: make set_ptes() robust when OAs cross 48-bit boundary
arch/arm/include/asm/pgtable.h | 2 + arch/arm/mm/mmu.c | 2 +- arch/arm64/include/asm/mte.h | 4 +- arch/arm64/include/asm/pgtable.h | 58 ++-- arch/arm64/kernel/mte.c | 4 +- arch/nios2/include/asm/pgtable.h | 2 + arch/powerpc/include/asm/pgtable.h | 2 + arch/powerpc/include/asm/tlb.h | 2 + arch/powerpc/mm/pgtable.c | 5 +- arch/riscv/include/asm/pgtable.h | 2 + arch/s390/include/asm/pgtable.h | 2 + arch/s390/include/asm/tlb.h | 30 +- arch/s390/mm/pgtable.c | 4 +- arch/sparc/include/asm/pgtable_64.h | 2 + include/asm-generic/tlb.h | 44 ++- include/linux/mm.h | 12 +- include/linux/mm_types.h | 37 ++- include/linux/pgtable.h | 103 ++++++- include/linux/swapops.h | 13 + kernel/events/uprobes.c | 2 +- mm/filemap.c | 2 +- mm/huge_memory.c | 23 +- mm/khugepaged.c | 4 +- mm/memory.c | 421 ++++++++++++++++++++-------- mm/mmu_gather.c | 111 ++++++-- mm/mprotect.c | 4 +- mm/rmap.c | 10 +- mm/swap.c | 12 +- mm/swap_state.c | 15 +- mm/userfaultfd.c | 2 +- 30 files changed, 718 insertions(+), 218 deletions(-)
From: "Matthew Wilcox (Oracle)" willy@infradead.org
mainline inclusion from mainline-v6.9-rc1 commit 5662400a9ac03f38ef3b84e4ff9a640a4604bef9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Patch series "mm: convert mm counter to take a folio", v3.
Make sure all mm_counter() and mm_counter_file() callers have a folio, then convert mm counter functions to take a folio, which saves some compound_head() calls.
This patch (of 10):
Thanks to the compound_head() hidden inside PageLocked(), this saves a call to compound_head() over calling page_folio(pfn_swap_entry_to_page())
Link: https://lkml.kernel.org/r/20240111152429.3374566-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240111152429.3374566-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: David Hildenbrand david@redhat.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 5662400a9ac03f38ef3b84e4ff9a640a4604bef9) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- include/linux/swapops.h | 13 +++++++++++++ mm/filemap.c | 2 +- mm/huge_memory.c | 2 +- 3 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 4a7e53612fdb..6038d4c87ddc 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -484,6 +484,19 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry) return p; }
+static inline struct folio *pfn_swap_entry_folio(swp_entry_t entry) +{ + struct folio *folio = pfn_folio(swp_offset_pfn(entry)); + + /* + * Any use of migration entries may only occur while the + * corresponding folio is locked + */ + BUG_ON(is_migration_entry(entry) && !folio_test_locked(folio)); + + return folio; +} + /* * A pfn swap entry is a special type of swap entry that always has a pfn stored * in the swap offset. They are used to represent unaddressable device memory diff --git a/mm/filemap.c b/mm/filemap.c index 12d73aa8487d..94c9f36b17d8 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1369,7 +1369,7 @@ void migration_entry_wait_on_locked(swp_entry_t entry, spinlock_t *ptl) unsigned long pflags = 0; bool in_thrashing; wait_queue_head_t *q; - struct folio *folio = page_folio(pfn_swap_entry_to_page(entry)); + struct folio *folio = pfn_swap_entry_folio(entry);
q = folio_waitqueue(folio); if (!folio_test_uptodate(folio) && folio_test_workingset(folio)) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2d325daab411..b3ee48dc71ff 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2011,7 +2011,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION if (is_swap_pmd(*pmd)) { swp_entry_t entry = pmd_to_swp_entry(*pmd); - struct folio *folio = page_folio(pfn_swap_entry_to_page(entry)); + struct folio *folio = pfn_swap_entry_folio(entry); pmd_t newpmd;
VM_BUG_ON(!is_pmd_migration_entry(*pmd));
From: "Matthew Wilcox (Oracle)" willy@infradead.org
mainline inclusion from mainline-v6.9-rc1 commit f2d571b0b207087442d1c3fca5189ee1cb34648e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
We only want to know whether the folio is anonymous, so use pfn_swap_entry_folio() and save a call to compound_head().
Link: https://lkml.kernel.org/r/20240111152429.3374566-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: David Hildenbrand david@redhat.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit f2d571b0b207087442d1c3fca5189ee1cb34648e) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/mprotect.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c index b51f90eae9fb..f121c46f6e4c 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -198,13 +198,13 @@ static long change_pte_range(struct mmu_gather *tlb, pte_t newpte;
if (is_writable_migration_entry(entry)) { - struct page *page = pfn_swap_entry_to_page(entry); + struct folio *folio = pfn_swap_entry_folio(entry);
/* * A protection check is difficult so * just be safe and disable write */ - if (PageAnon(page)) + if (folio_test_anon(folio)) entry = make_readable_exclusive_migration_entry( swp_offset(entry)); else
mainline inclusion from mainline-v6.9-rc1 commit 0601ac883a814930c3a38d39a115fdc05179d886 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Call pfn_swap_entry_folio() in ptep_zap_swap_entry() as preparation for converting mm counter functions to take a folio.
Link: https://lkml.kernel.org/r/20240111152429.3374566-5-willy@infradead.org Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: David Hildenbrand david@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 0601ac883a814930c3a38d39a115fdc05179d886) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- arch/s390/mm/pgtable.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c index 5cb92941540b..24e7be76f71d 100644 --- a/arch/s390/mm/pgtable.c +++ b/arch/s390/mm/pgtable.c @@ -730,9 +730,9 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry) if (!non_swap_entry(entry)) dec_mm_counter(mm, MM_SWAPENTS); else if (is_migration_entry(entry)) { - struct page *page = pfn_swap_entry_to_page(entry); + struct folio *folio = pfn_swap_entry_folio(entry);
- dec_mm_counter(mm, mm_counter(page)); + dec_mm_counter(mm, mm_counter(&folio->page)); } free_swap_and_cache(entry); }
mainline inclusion from mainline-v6.9-rc1 commit 439992ff4637ad5042ca8ee1f659fae24890de3e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Call pfn_swap_entry_folio() in __split_huge_pmd_locked() as preparation for converting mm counter functions to take a folio.
Link: https://lkml.kernel.org/r/20240111152429.3374566-6-willy@infradead.org Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: David Hildenbrand david@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 439992ff4637ad5042ca8ee1f659fae24890de3e) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/huge_memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b3ee48dc71ff..7ba8f4bc7bb0 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2287,7 +2287,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, swp_entry_t entry;
entry = pmd_to_swp_entry(old_pmd); - page = pfn_swap_entry_to_page(entry); + folio = pfn_swap_entry_folio(entry); } else { page = pmd_page(old_pmd); folio = page_folio(page); @@ -2299,7 +2299,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, folio_remove_rmap_pmd(folio, page, vma); folio_put(folio); } - add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR); + add_mm_counter(mm, mm_counter_file(&folio->page), -HPAGE_PMD_NR); return; }
mainline inclusion from mainline-v6.9-rc1 commit 0103b27a6b826729dc1500d013e53ebed48980b3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Call pfn_swap_entry_to_folio() in zap_huge_pmd() as preparation for converting mm counter functions to take a folio. Saves a call to compound_head() embedded inside PageAnon().
Link: https://lkml.kernel.org/r/20240111152429.3374566-7-willy@infradead.org Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: David Hildenbrand david@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 0103b27a6b826729dc1500d013e53ebed48980b3) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/huge_memory.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7ba8f4bc7bb0..7f6aabdaf37d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1870,13 +1870,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, zap_deposited_table(tlb->mm, pmd); spin_unlock(ptl); } else { - struct page *page = NULL; + struct folio *folio = NULL; int flush_needed = 1;
if (pmd_present(orig_pmd)) { - page = pmd_page(orig_pmd); + struct page *page = pmd_page(orig_pmd); + + folio = page_folio(page); add_reliable_page_counter(page, tlb->mm, -HPAGE_PMD_NR); - folio_remove_rmap_pmd(page_folio(page), page, vma); + folio_remove_rmap_pmd(folio, page, vma); VM_BUG_ON_PAGE(page_mapcount(page) < 0, page); VM_BUG_ON_PAGE(!PageHead(page), page); } else if (thp_migration_supported()) { @@ -1884,23 +1886,24 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
VM_BUG_ON(!is_pmd_migration_entry(orig_pmd)); entry = pmd_to_swp_entry(orig_pmd); - page = pfn_swap_entry_to_page(entry); + folio = pfn_swap_entry_folio(entry); flush_needed = 0; } else WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
- if (PageAnon(page)) { + if (folio_test_anon(folio)) { zap_deposited_table(tlb->mm, pmd); add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); } else { if (arch_needs_pgtable_deposit()) zap_deposited_table(tlb->mm, pmd); - add_mm_counter(tlb->mm, mm_counter_file(page), -HPAGE_PMD_NR); + add_mm_counter(tlb->mm, mm_counter_file(&folio->page), + -HPAGE_PMD_NR); }
spin_unlock(ptl); if (flush_needed) - tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE); + tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE); } return 1; }
mainline inclusion from mainline-v6.9-rc1 commit 530c2a0da0b440bec4af3dae5bd7110f77962e9b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Call pfn_swap_entry_folio() as preparation for converting mm counter functions to take a folio.
Link: https://lkml.kernel.org/r/20240111152429.3374566-8-willy@infradead.org Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: David Hildenbrand david@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 530c2a0da0b440bec4af3dae5bd7110f77962e9b) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index 031ff37a91fb..ba94174d2d8a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -811,9 +811,9 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, } rss[MM_SWAPENTS]++; } else if (is_migration_entry(entry)) { - page = pfn_swap_entry_to_page(entry); + folio = pfn_swap_entry_folio(entry);
- rss[mm_counter(page)]++; + rss[mm_counter(&folio->page)]++;
if (!is_readable_migration_entry(entry) && is_cow_mapping(vm_flags)) {
mainline inclusion from mainline-v6.9-rc1 commit eabafaaa957553142cdafc8ae804fb679e5a5f5e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Make should_zap_page() take a folio and rename it to should_zap_folio() as preparation for converting mm counter functions to take a folio. Saves a call to compound_head() hidden inside PageAnon().
[wangkefeng.wang@huawei.com: fix used-uninitialized warning] Link: https://lkml.kernel.org/r/962a7993-fce9-4de8-85cd-25e290f25736@huawei.com Link: https://lkml.kernel.org/r/20240111152429.3374566-9-willy@infradead.org Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: David Hildenbrand david@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit eabafaaa957553142cdafc8ae804fb679e5a5f5e) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/memory.c | 31 +++++++++++++++++-------------- 1 file changed, 17 insertions(+), 14 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index ba94174d2d8a..3bb5b9543771 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1370,19 +1370,20 @@ static inline bool should_zap_cows(struct zap_details *details) return details->even_cows; }
-/* Decides whether we should zap this page with the page pointer specified */ -static inline bool should_zap_page(struct zap_details *details, struct page *page) +/* Decides whether we should zap this folio with the folio pointer specified */ +static inline bool should_zap_folio(struct zap_details *details, + struct folio *folio) { - /* If we can make a decision without *page.. */ + /* If we can make a decision without *folio.. */ if (should_zap_cows(details)) return true;
- /* E.g. the caller passes NULL for the case of a zero page */ - if (!page) + /* E.g. the caller passes NULL for the case of a zero folio */ + if (!folio) return true;
- /* Otherwise we should only zap non-anon pages */ - return !PageAnon(page); + /* Otherwise we should only zap non-anon folios */ + return !folio_test_anon(folio); }
static inline bool zap_drop_file_uffd_wp(struct zap_details *details) @@ -1435,7 +1436,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, arch_enter_lazy_mmu_mode(); do { pte_t ptent = ptep_get(pte); - struct folio *folio; + struct folio *folio = NULL; struct page *page;
if (pte_none(ptent)) @@ -1448,7 +1449,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, unsigned int delay_rmap;
page = vm_normal_page(vma, addr, ptent); - if (unlikely(!should_zap_page(details, page))) + if (page) + folio = page_folio(page); + + if (unlikely(!should_zap_folio(details, folio))) continue; ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); @@ -1461,7 +1465,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, continue; }
- folio = page_folio(page); delay_rmap = 0; if (!folio_test_anon(folio)) { if (pte_dirty(ptent)) { @@ -1494,7 +1497,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, is_device_exclusive_entry(entry)) { page = pfn_swap_entry_to_page(entry); folio = page_folio(page); - if (unlikely(!should_zap_page(details, page))) + if (unlikely(!should_zap_folio(details, folio))) continue; /* * Both device private/exclusive mappings should only @@ -1516,10 +1519,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, if (unlikely(!free_swap_and_cache(entry))) print_bad_pte(vma, addr, ptent, NULL); } else if (is_migration_entry(entry)) { - page = pfn_swap_entry_to_page(entry); - if (!should_zap_page(details, page)) + folio = pfn_swap_entry_folio(entry); + if (!should_zap_folio(details, folio)) continue; - rss[mm_counter(page)]--; + rss[mm_counter(&folio->page)]--; } else if (pte_marker_entry_uffd_wp(entry)) { /* * For anon: always drop the marker; for file: only
mainline inclusion from mainline-v6.9-rc1 commit a23f517b0e1554467b0eb3bc1ebcb4d626217302 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Now all callers of mm_counter() have a folio, convert mm_counter() to take a folio. Saves a call to compound_head() hidden inside PageAnon().
Link: https://lkml.kernel.org/r/20240111152429.3374566-10-willy@infradead.org Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: David Hildenbrand david@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit a23f517b0e1554467b0eb3bc1ebcb4d626217302) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com
Conflicts: mm/memory.c mm/rmap.c --- arch/s390/mm/pgtable.c | 2 +- include/linux/mm.h | 6 +++--- mm/memory.c | 10 +++++----- mm/rmap.c | 8 ++++---- mm/userfaultfd.c | 2 +- 5 files changed, 14 insertions(+), 14 deletions(-)
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c index 24e7be76f71d..66d4c227c098 100644 --- a/arch/s390/mm/pgtable.c +++ b/arch/s390/mm/pgtable.c @@ -732,7 +732,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry) else if (is_migration_entry(entry)) { struct folio *folio = pfn_swap_entry_folio(entry);
- dec_mm_counter(mm, mm_counter(&folio->page)); + dec_mm_counter(mm, mm_counter(folio)); } free_swap_and_cache(entry); } diff --git a/include/linux/mm.h b/include/linux/mm.h index 3452aa356a71..aefa9c1ae3c5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2615,11 +2615,11 @@ static inline int mm_counter_file(struct page *page) return MM_FILEPAGES; }
-static inline int mm_counter(struct page *page) +static inline int mm_counter(struct folio *folio) { - if (PageAnon(page)) + if (folio_test_anon(folio)) return MM_ANONPAGES; - return mm_counter_file(page); + return mm_counter_file(&folio->page); }
static inline unsigned long get_mm_rss(struct mm_struct *mm) diff --git a/mm/memory.c b/mm/memory.c index 3bb5b9543771..663b098bdf6e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -813,7 +813,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, } else if (is_migration_entry(entry)) { folio = pfn_swap_entry_folio(entry);
- rss[mm_counter(&folio->page)]++; + rss[mm_counter(folio)]++;
if (!is_readable_migration_entry(entry) && is_cow_mapping(vm_flags)) { @@ -845,7 +845,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, * keep things as they are. */ folio_get(folio); - rss[mm_counter(page)]++; + rss[mm_counter(folio)]++; /* Cannot fail as these pages cannot get pinned. */ folio_try_dup_anon_rmap_pte(folio, page, src_vma);
@@ -1477,7 +1477,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, if (pte_young(ptent) && likely(vma_has_recency(vma))) folio_mark_accessed(folio); } - rss[mm_counter(page)]--; + rss[mm_counter(folio)]--; add_reliable_page_counter(page, mm, -1); if (!delay_rmap) { folio_remove_rmap_pte(folio, page, vma); @@ -1506,7 +1506,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, * see zap_install_uffd_wp_if_needed(). */ WARN_ON_ONCE(!vma_is_anonymous(vma)); - rss[mm_counter(page)]--; + rss[mm_counter(folio)]--; add_reliable_page_counter(page, mm, -1); if (is_device_private_entry(entry)) folio_remove_rmap_pte(folio, page, vma); @@ -1522,7 +1522,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, folio = pfn_swap_entry_folio(entry); if (!should_zap_folio(details, folio)) continue; - rss[mm_counter(&folio->page)]--; + rss[mm_counter(folio)]--; } else if (pte_marker_entry_uffd_wp(entry)) { /* * For anon: always drop the marker; for file: only diff --git a/mm/rmap.c b/mm/rmap.c index 73f2b3a33158..db6a2e4bafc2 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1750,7 +1750,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, set_huge_pte_at(mm, address, pvmw.pte, pteval, hsz); } else { - dec_mm_counter(mm, mm_counter(&folio->page)); + dec_mm_counter(mm, mm_counter(folio)); add_reliable_page_counter(&folio->page, mm, -1); set_pte_at(mm, address, pvmw.pte, pteval); } @@ -1766,7 +1766,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * migration) will not expect userfaults on already * copied pages. */ - dec_mm_counter(mm, mm_counter(&folio->page)); + dec_mm_counter(mm, mm_counter(folio)); add_reliable_page_counter(&folio->page, mm, -1); } else if (folio_test_anon(folio)) { swp_entry_t entry = page_swap_entry(subpage); @@ -2156,7 +2156,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, set_huge_pte_at(mm, address, pvmw.pte, pteval, hsz); } else { - dec_mm_counter(mm, mm_counter(&folio->page)); + dec_mm_counter(mm, mm_counter(folio)); add_reliable_page_counter(&folio->page, mm, -1); set_pte_at(mm, address, pvmw.pte, pteval); } @@ -2172,7 +2172,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, * migration) will not expect userfaults on already * copied pages. */ - dec_mm_counter(mm, mm_counter(&folio->page)); + dec_mm_counter(mm, mm_counter(folio)); add_reliable_page_counter(&folio->page, mm, -1); } else { swp_entry_t entry; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index f8449f506af2..ef7f6348c7ec 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -125,7 +125,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd, * Must happen after rmap, as mm_counter() checks mapping (via * PageAnon()), which is set by __page_set_anon_rmap(). */ - inc_mm_counter(dst_mm, mm_counter(page)); + inc_mm_counter(dst_mm, mm_counter(folio));
set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
mainline inclusion from mainline-v6.9-rc1 commit 6b27cc6c66abf0f0b091a95ca1ad4e0fc68c11fd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Now all callers of mm_counter_file() have a folio, convert mm_counter_file() to take a folio. Saves a call to compound_head() hidden inside PageSwapBacked().
Link: https://lkml.kernel.org/r/20240111152429.3374566-11-willy@infradead.org Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: David Hildenbrand david@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 6b27cc6c66abf0f0b091a95ca1ad4e0fc68c11fd) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com
Conflicts: mm/memory.c mm/rmap.c --- include/linux/mm.h | 8 ++++---- kernel/events/uprobes.c | 2 +- mm/huge_memory.c | 4 ++-- mm/khugepaged.c | 4 ++-- mm/memory.c | 10 +++++----- mm/rmap.c | 2 +- 6 files changed, 15 insertions(+), 15 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index aefa9c1ae3c5..46c7b073824c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2607,10 +2607,10 @@ static inline void dec_mm_counter(struct mm_struct *mm, int member) mm_trace_rss_stat(mm, member); }
-/* Optimized variant when page is already known not to be PageAnon */ -static inline int mm_counter_file(struct page *page) +/* Optimized variant when folio is already known not to be anon */ +static inline int mm_counter_file(struct folio *folio) { - if (PageSwapBacked(page)) + if (folio_test_swapbacked(folio)) return MM_SHMEMPAGES; return MM_FILEPAGES; } @@ -2619,7 +2619,7 @@ static inline int mm_counter(struct folio *folio) { if (folio_test_anon(folio)) return MM_ANONPAGES; - return mm_counter_file(&folio->page); + return mm_counter_file(folio); }
static inline unsigned long get_mm_rss(struct mm_struct *mm) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 3103cd259383..9f8d9baa7a2f 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -189,7 +189,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, dec_mm_counter(mm, MM_ANONPAGES);
if (!folio_test_anon(old_folio)) { - dec_mm_counter(mm, mm_counter_file(old_page)); + dec_mm_counter(mm, mm_counter_file(old_folio)); inc_mm_counter(mm, MM_ANONPAGES); }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7f6aabdaf37d..2073693b3aa7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1897,7 +1897,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, } else { if (arch_needs_pgtable_deposit()) zap_deposited_table(tlb->mm, pmd); - add_mm_counter(tlb->mm, mm_counter_file(&folio->page), + add_mm_counter(tlb->mm, mm_counter_file(folio), -HPAGE_PMD_NR); }
@@ -2302,7 +2302,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, folio_remove_rmap_pmd(folio, page, vma); folio_put(folio); } - add_mm_counter(mm, mm_counter_file(&folio->page), -HPAGE_PMD_NR); + add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PMD_NR); return; }
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 89b47a6e24af..0b8636537383 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1652,7 +1652,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, /* step 3: set proper refcount and mm_counters. */ if (nr_ptes) { folio_ref_sub(folio, nr_ptes); - add_mm_counter(mm, mm_counter_file(&folio->page), -nr_ptes); + add_mm_counter(mm, mm_counter_file(folio), -nr_ptes); }
/* step 4: remove empty page table */ @@ -1683,7 +1683,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, if (nr_ptes) { flush_tlb_mm(mm); folio_ref_sub(folio, nr_ptes); - add_mm_counter(mm, mm_counter_file(&folio->page), -nr_ptes); + add_mm_counter(mm, mm_counter_file(folio), -nr_ptes); } if (start_pte) pte_unmap_unlock(start_pte, ptl); diff --git a/mm/memory.c b/mm/memory.c index 663b098bdf6e..e7a959688bdc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -971,7 +971,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, } else if (page) { folio_get(folio); folio_dup_file_rmap_pte(folio, page); - rss[mm_counter_file(page)]++; + rss[mm_counter_file(folio)]++; add_reliable_folio_counter(folio, dst_vma->vm_mm, 1); }
@@ -1875,7 +1875,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, return -EBUSY; /* Ok, finally just insert the thing.. */ folio_get(folio); - inc_mm_counter(vma->vm_mm, mm_counter_file(page)); + inc_mm_counter(vma->vm_mm, mm_counter_file(folio)); folio_add_file_rmap_pte(folio, page, vma); set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot)); return 0; @@ -3184,7 +3184,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { if (old_folio) { if (!folio_test_anon(old_folio)) { - dec_mm_counter(mm, mm_counter_file(&old_folio->page)); + dec_mm_counter(mm, mm_counter_file(old_folio)); inc_mm_counter(mm, MM_ANONPAGES); } add_reliable_folio_counter(old_folio, mm, -1); @@ -4480,7 +4480,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) if (write) entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
- add_mm_counter(vma->vm_mm, mm_counter_file(page), HPAGE_PMD_NR); + add_mm_counter(vma->vm_mm, mm_counter_file(folio), HPAGE_PMD_NR); add_reliable_page_counter(page, vma->vm_mm, HPAGE_PMD_NR); folio_add_file_rmap_pmd(folio, page, vma);
@@ -4545,7 +4545,7 @@ void set_pte_range(struct vm_fault *vmf, struct folio *folio, folio_add_new_anon_rmap(folio, vma, addr); folio_add_lru_vma(folio, vma); } else { - add_mm_counter(vma->vm_mm, mm_counter_file(page), nr); + add_mm_counter(vma->vm_mm, mm_counter_file(folio), nr); folio_add_file_rmap_ptes(folio, page, nr, vma); } set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr); diff --git a/mm/rmap.c b/mm/rmap.c index db6a2e4bafc2..88345e743c4f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1877,7 +1877,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * * See Documentation/mm/mmu_notifier.rst */ - dec_mm_counter(mm, mm_counter_file(&folio->page)); + dec_mm_counter(mm, mm_counter_file(folio)); add_reliable_folio_counter(folio, mm, -1); } discard:
From: Ryan Roberts ryan.roberts@arm.com
mainline inclusion from mainline-v6.7-rc1 commit 3425cec42c3ce0f65fe74e412756b567b152e61d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
set_ptes() sets a physically contiguous block of memory (which all belongs to the same folio) to a contiguous block of ptes. The arm64 implementation of this previously just looped, operating on each individual pte. But the __sync_icache_dcache() and mte_sync_tags() operations can both be hoisted out of the loop so that they are performed once for the contiguous set of pages (which may be less than the whole folio). This should result in minor performance gains.
__sync_icache_dcache() already acts on the whole folio, and sets a flag in the folio so that it skips duplicate calls. But by hoisting the call, all the pte testing is done only once.
mte_sync_tags() operates on each individual page with its own loop. But by passing the number of pages explicitly, we can rely solely on its loop and do the checks only once. This approach also makes it robust for the future, rather than assuming if a head page of a compound page is being mapped, then the whole compound page is being mapped, instead we explicitly know how many pages are being mapped. The old assumption may not continue to hold once the "anonymous large folios" feature is merged.
Signed-off-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Steven Price steven.price@arm.com Link: https://lore.kernel.org/r/20231005140730.2191134-1-ryan.roberts@arm.com Signed-off-by: Catalin Marinas catalin.marinas@arm.com (cherry picked from commit 3425cec42c3ce0f65fe74e412756b567b152e61d) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- arch/arm64/include/asm/mte.h | 4 ++-- arch/arm64/include/asm/pgtable.h | 27 +++++++++++++++++---------- arch/arm64/kernel/mte.c | 4 ++-- 3 files changed, 21 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h index d0f6d87865bc..9cdded082dd4 100644 --- a/arch/arm64/include/asm/mte.h +++ b/arch/arm64/include/asm/mte.h @@ -90,7 +90,7 @@ static inline bool try_page_mte_tagging(struct page *page) }
void mte_zero_clear_page_tags(void *addr); -void mte_sync_tags(pte_t pte); +void mte_sync_tags(pte_t pte, unsigned int nr_pages); void mte_copy_page_tags(void *kto, const void *kfrom); int mte_copy_mc_page_tags(void *kto, const void *kfrom); void mte_thread_init_user(void); @@ -123,7 +123,7 @@ static inline bool try_page_mte_tagging(struct page *page) static inline void mte_zero_clear_page_tags(void *addr) { } -static inline void mte_sync_tags(pte_t pte) +static inline void mte_sync_tags(pte_t pte, unsigned int nr_pages) { } static inline void mte_copy_page_tags(void *kto, const void *kfrom) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 07bdf5dd8ebe..5e98a29fd867 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -325,8 +325,7 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep, __func__, pte_val(old_pte), pte_val(pte)); }
-static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, - pte_t *ptep, pte_t pte) +static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages) { if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte)) __sync_icache_dcache(pte); @@ -339,20 +338,18 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, */ if (system_supports_mte() && pte_access_permitted(pte, false) && !pte_special(pte) && pte_tagged(pte)) - mte_sync_tags(pte); - - __check_safe_pte_update(mm, ptep, pte); - - set_pte(ptep, pte); + mte_sync_tags(pte, nr_pages); }
static inline void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned int nr) { page_table_check_ptes_set(mm, ptep, pte, nr); + __sync_cache_and_tags(pte, nr);
for (;;) { - __set_pte_at(mm, addr, ptep, pte); + __check_safe_pte_update(mm, ptep, pte); + set_pte(ptep, pte); if (--nr == 0) break; ptep++; @@ -531,18 +528,28 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd) #define pud_pfn(pud) ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT) #define pfn_pud(pfn,prot) __pud(__phys_to_pud_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
+static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte, unsigned int nr) +{ + __sync_cache_and_tags(pte, nr); + __check_safe_pte_update(mm, ptep, pte); + set_pte(ptep, pte); +} + static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd) { page_table_check_pmd_set(mm, pmdp, pmd); - return __set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd)); + return __set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd), + PMD_SIZE >> PAGE_SHIFT); }
static inline void set_pud_at(struct mm_struct *mm, unsigned long addr, pud_t *pudp, pud_t pud) { page_table_check_pud_set(mm, pudp, pud); - return __set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud)); + return __set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud), + PUD_SIZE >> PAGE_SHIFT); }
#define __p4d_to_phys(p4d) __pte_to_phys(p4d_pte(p4d)) diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index 4edecaac8f91..2fb5e7a7a4d5 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -35,10 +35,10 @@ DEFINE_STATIC_KEY_FALSE(mte_async_or_asymm_mode); EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode); #endif
-void mte_sync_tags(pte_t pte) +void mte_sync_tags(pte_t pte, unsigned int nr_pages) { struct page *page = pte_page(pte); - long i, nr_pages = compound_nr(page); + unsigned int i;
/* if PG_mte_tagged is set, tags have already been initialised */ for (i = 0; i < nr_pages; i++, page++) {
From: Catalin Marinas catalin.marinas@arm.com
mainline inclusion from mainline-v6.7-rc1 commit dba2ff4922b3cf573c25c3886e869258a6076030 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
This argument is not used by the arm64 implementation. Mark it as __always_unused and also remove the unnecessary 'addr' increment in set_ptes().
Signed-off-by: Catalin Marinas catalin.marinas@arm.com Reported-by: kernel test robot lkp@intel.com Closes: https://lore.kernel.org/oe-kbuild-all/202310140531.BQQwt3NQ-lkp@intel.com/ Cc: Will Deacon will@kernel.org Tested-by: Ryan Roberts ryan.roberts@arm.com Link: https://lore.kernel.org/r/ZS6EvMiJ0QF5INkv@arm.com (cherry picked from commit dba2ff4922b3cf573c25c3886e869258a6076030) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- arch/arm64/include/asm/pgtable.h | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 5e98a29fd867..79ce70fbb751 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -341,8 +341,9 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages) mte_sync_tags(pte, nr_pages); }
-static inline void set_ptes(struct mm_struct *mm, unsigned long addr, - pte_t *ptep, pte_t pte, unsigned int nr) +static inline void set_ptes(struct mm_struct *mm, + unsigned long __always_unused addr, + pte_t *ptep, pte_t pte, unsigned int nr) { page_table_check_ptes_set(mm, ptep, pte, nr); __sync_cache_and_tags(pte, nr); @@ -353,7 +354,6 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr, if (--nr == 0) break; ptep++; - addr += PAGE_SIZE; pte_val(pte) += PAGE_SIZE; } } @@ -528,7 +528,8 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd) #define pud_pfn(pud) ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT) #define pfn_pud(pfn,prot) __pud(__phys_to_pud_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
-static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, +static inline void __set_pte_at(struct mm_struct *mm, + unsigned long __always_unused addr, pte_t *ptep, pte_t pte, unsigned int nr) { __sync_cache_and_tags(pte, nr);
From: Ryan Roberts ryan.roberts@arm.com
mainline inclusion from mainline-v6.9-rc1 commit 6e8f588708971e0626f5be808e8c4b6cdb86eb0b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Patch series "mm/memory: optimize fork() with PTE-mapped THP", v3.
Now that the rmap overhaul[1] is upstream that provides a clean interface for rmap batching, let's implement PTE batching during fork when processing PTE-mapped THPs.
This series is partially based on Ryan's previous work[2] to implement cont-pte support on arm64, but its a complete rewrite based on [1] to optimize all architectures independent of any such PTE bits, and to use the new rmap batching functions that simplify the code and prepare for further rmap accounting changes.
We collect consecutive PTEs that map consecutive pages of the same large folio, making sure that the other PTE bits are compatible, and (a) adjust the refcount only once per batch, (b) call rmap handling functions only once per batch and (c) perform batch PTE setting/updates.
While this series should be beneficial for adding cont-pte support on ARM64[2], it's one of the requirements for maintaining a total mapcount[3] for large folios with minimal added overhead and further changes[4] that build up on top of the total mapcount.
Independent of all that, this series results in a speedup during fork with PTE-mapped THP, which is the default with THPs that are smaller than a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).
On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios of the same size (stddev < 1%) results in the following runtimes for fork() (shorter is better):
Folio Size | v6.8-rc1 | New | Change ------------------------------------------ 4KiB | 0.014328 | 0.014035 | - 2% 16KiB | 0.014263 | 0.01196 | -16% 32KiB | 0.014334 | 0.01094 | -24% 64KiB | 0.014046 | 0.010444 | -26% 128KiB | 0.014011 | 0.010063 | -28% 256KiB | 0.013993 | 0.009938 | -29% 512KiB | 0.013983 | 0.00985 | -30% 1024KiB | 0.013986 | 0.00982 | -30% 2048KiB | 0.014305 | 0.010076 | -30%
Note that these numbers are even better than the ones from v1 (verified over multiple reboots), even though there were only minimal code changes. Well, I removed a pte_mkclean() call for anon folios, maybe that also plays a role.
But my experience is that fork() is extremely sensitive to code size, inlining, ... so I suspect we'll see on other architectures rather a change of -20% instead of -30%, and it will be easy to "lose" some of that speedup in the future by subtle code changes.
Next up is PTE batching when unmapping. Only tested on x86-64. Compile-tested on most other architectures.
[1] https://lkml.kernel.org/r/20231220224504.646757-1-david@redhat.com [2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@arm.com [3] https://lkml.kernel.org/r/20230809083256.699513-1-david@redhat.com [4] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com [5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com
This patch (of 15):
Since the high bits [51:48] of an OA are not stored contiguously in the PTE, there is a theoretical bug in set_ptes(), which just adds PAGE_SIZE to the pte to get the pte with the next pfn. This works until the pfn crosses the 48-bit boundary, at which point we overflow into the upper attributes.
Of course one could argue (and Matthew Wilcox has :) that we will never see a folio cross this boundary because we only allow naturally aligned power-of-2 allocation, so this would require a half-petabyte folio. So its only a theoretical bug. But its better that the code is robust regardless.
I've implemented pte_next_pfn() as part of the fix, which is an opt-in core-mm interface. So that is now available to the core-mm, which will be needed shortly to support forthcoming fork()-batching optimizations.
Link: https://lkml.kernel.org/r/20240129124649.189745-1-david@redhat.com Link: https://lkml.kernel.org/r/20240125173534.1659317-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240129124649.189745-2-david@redhat.com Fixes: 4a169d61c2ed ("arm64: implement the new page table range API") Closes: https://lore.kernel.org/linux-mm/fdaeb9a5-d890-499a-92c8-d171df43ad01@arm.co... Signed-off-by: Ryan Roberts ryan.roberts@arm.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Catalin Marinas catalin.marinas@arm.com Reviewed-by: David Hildenbrand david@redhat.com Tested-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Mike Rapoport (IBM) rppt@kernel.org Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: David S. Miller davem@davemloft.net Cc: Dinh Nguyen dinguyen@kernel.org Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Russell King (Oracle) linux@armlinux.org.uk Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Cc: Alexandre Ghiti alexghiti@rivosinc.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 6e8f588708971e0626f5be808e8c4b6cdb86eb0b) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- arch/arm64/include/asm/pgtable.h | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 79ce70fbb751..52d0b0a763f1 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -341,6 +341,22 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages) mte_sync_tags(pte, nr_pages); }
+/* + * Select all bits except the pfn + */ +static inline pgprot_t pte_pgprot(pte_t pte) +{ + unsigned long pfn = pte_pfn(pte); + + return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte)); +} + +#define pte_next_pfn pte_next_pfn +static inline pte_t pte_next_pfn(pte_t pte) +{ + return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte)); +} + static inline void set_ptes(struct mm_struct *mm, unsigned long __always_unused addr, pte_t *ptep, pte_t pte, unsigned int nr) @@ -354,7 +370,7 @@ static inline void set_ptes(struct mm_struct *mm, if (--nr == 0) break; ptep++; - pte_val(pte) += PAGE_SIZE; + pte = pte_next_pfn(pte); } } #define set_ptes set_ptes @@ -433,16 +449,6 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte) return clear_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE)); }
-/* - * Select all bits except the pfn - */ -static inline pgprot_t pte_pgprot(pte_t pte) -{ - unsigned long pfn = pte_pfn(pte); - - return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte)); -} - #ifdef CONFIG_NUMA_BALANCING /* * See the comment in include/linux/pgtable.h
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit 12b884f2e09ab42d3879a3e2c703e7157691013c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simply define PFN_PTE_SHIFT, required by pte_next_pfn().
Link: https://lkml.kernel.org/r/20240129124649.189745-3-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Tested-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Mike Rapoport (IBM) rppt@kernel.org Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Alexandre Ghiti alexghiti@rivosinc.com Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: David S. Miller davem@davemloft.net Cc: Dinh Nguyen dinguyen@kernel.org Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Russell King (Oracle) linux@armlinux.org.uk Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 12b884f2e09ab42d3879a3e2c703e7157691013c) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- arch/arm/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h index 93ffc943c87d..e8c8629b3eb5 100644 --- a/arch/arm/include/asm/pgtable.h +++ b/arch/arm/include/asm/pgtable.h @@ -208,6 +208,8 @@ static inline void __sync_icache_dcache(pte_t pteval) extern void __sync_icache_dcache(pte_t pteval); #endif
+#define PFN_PTE_SHIFT PAGE_SHIFT + void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pteval, unsigned int nr); #define set_ptes set_ptes
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit 3a6a6c3fbda8f50fc9f0e5fede8a0f70abdea033 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simply define PFN_PTE_SHIFT, required by pte_next_pfn().
Link: https://lkml.kernel.org/r/20240129124649.189745-4-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Tested-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Mike Rapoport (IBM) rppt@kernel.org Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Alexandre Ghiti alexghiti@rivosinc.com Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: David S. Miller davem@davemloft.net Cc: Dinh Nguyen dinguyen@kernel.org Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Russell King (Oracle) linux@armlinux.org.uk Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 3a6a6c3fbda8f50fc9f0e5fede8a0f70abdea033) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- arch/nios2/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h index 5144506dfa69..d052dfcbe8d3 100644 --- a/arch/nios2/include/asm/pgtable.h +++ b/arch/nios2/include/asm/pgtable.h @@ -178,6 +178,8 @@ static inline void set_pte(pte_t *ptep, pte_t pteval) *ptep = pteval; }
+#define PFN_PTE_SHIFT 0 + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned int nr) {
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit f7dc4d689e6fafe3d8424f600b924f2d59d1a3cf category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simply define PFN_PTE_SHIFT, required by pte_next_pfn().
Link: https://lkml.kernel.org/r/20240129124649.189745-5-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Christophe Leroy christophe.leroy@csgroup.eu Tested-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Mike Rapoport (IBM) rppt@kernel.org Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Alexandre Ghiti alexghiti@rivosinc.com Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: David S. Miller davem@davemloft.net Cc: Dinh Nguyen dinguyen@kernel.org Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Russell King (Oracle) linux@armlinux.org.uk Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit f7dc4d689e6fafe3d8424f600b924f2d59d1a3cf) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- arch/powerpc/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index d0ee46de248e..db2fe941e4c8 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -41,6 +41,8 @@ struct mm_struct;
#ifndef __ASSEMBLY__
+#define PFN_PTE_SHIFT PTE_RPN_SHIFT + void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned int nr); #define set_ptes set_ptes
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit 57c254b2fb31f0160829f4bf1cb993a9e9c302a8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simply define PFN_PTE_SHIFT, required by pte_next_pfn().
Link: https://lkml.kernel.org/r/20240129124649.189745-6-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Alexandre Ghiti alexghiti@rivosinc.com Tested-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Mike Rapoport (IBM) rppt@kernel.org Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: David S. Miller davem@davemloft.net Cc: Dinh Nguyen dinguyen@kernel.org Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Russell King (Oracle) linux@armlinux.org.uk Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 57c254b2fb31f0160829f4bf1cb993a9e9c302a8) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- arch/riscv/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index c00bd5377db9..0a6088bffa00 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -526,6 +526,8 @@ static inline void __set_pte_at(pte_t *ptep, pte_t pteval) set_pte(ptep, pteval); }
+#define PFN_PTE_SHIFT _PAGE_PFN_SHIFT + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pteval, unsigned int nr) {
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit 4555ac8b3c16f67f74c04ff71ce8c4a8fcee973a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simply define PFN_PTE_SHIFT, required by pte_next_pfn().
Link: https://lkml.kernel.org/r/20240129124649.189745-7-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Tested-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Mike Rapoport (IBM) rppt@kernel.org Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Alexandre Ghiti alexghiti@rivosinc.com Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: David S. Miller davem@davemloft.net Cc: Dinh Nguyen dinguyen@kernel.org Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Russell King (Oracle) linux@armlinux.org.uk Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 4555ac8b3c16f67f74c04ff71ce8c4a8fcee973a) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- arch/s390/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index fb3ee7758b76..41855e8058ff 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1314,6 +1314,8 @@ pgprot_t pgprot_writecombine(pgprot_t prot); #define pgprot_writethrough pgprot_writethrough pgprot_t pgprot_writethrough(pgprot_t prot);
+#define PFN_PTE_SHIFT PAGE_SHIFT + /* * Set multiple PTEs to consecutive pages with a single call. All PTEs * are within the same folio, PMD and VMA.
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit ce7a9de353da053e55a68e2441196114547e38d0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
We want to make use of pte_next_pfn() outside of set_ptes(). Let's simply define PFN_PTE_SHIFT, required by pte_next_pfn().
Link: https://lkml.kernel.org/r/20240129124649.189745-8-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Tested-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Mike Rapoport (IBM) rppt@kernel.org Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Alexandre Ghiti alexghiti@rivosinc.com Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: David S. Miller davem@davemloft.net Cc: Dinh Nguyen dinguyen@kernel.org Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Russell King (Oracle) linux@armlinux.org.uk Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit ce7a9de353da053e55a68e2441196114547e38d0) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- arch/sparc/include/asm/pgtable_64.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 5e41033bf4ca..be9bcc50e4cb 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -928,6 +928,8 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PAGE_SHIFT); }
+#define PFN_PTE_SHIFT PAGE_SHIFT + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned int nr) {
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit 6cdfa1d5d5d8285108495c33588c48cdda81b647 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Let's provide pte_next_pfn(), independently of set_ptes(). This allows for using the generic pte_next_pfn() version in some arch-specific set_ptes() implementations, and prepares for reusing pte_next_pfn() in other context.
Link: https://lkml.kernel.org/r/20240129124649.189745-9-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Christophe Leroy christophe.leroy@csgroup.eu Tested-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Mike Rapoport (IBM) rppt@kernel.org Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Alexandre Ghiti alexghiti@rivosinc.com Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: David S. Miller davem@davemloft.net Cc: Dinh Nguyen dinguyen@kernel.org Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Russell King (Oracle) linux@armlinux.org.uk Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 6cdfa1d5d5d8285108495c33588c48cdda81b647) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- include/linux/pgtable.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index af7639c3b0a3..b5ce7ee512d0 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -205,7 +205,6 @@ static inline int pmd_young(pmd_t pmd) #define arch_flush_lazy_mmu_mode() do {} while (0) #endif
-#ifndef set_ptes
#ifndef pte_next_pfn static inline pte_t pte_next_pfn(pte_t pte) @@ -214,6 +213,7 @@ static inline pte_t pte_next_pfn(pte_t pte) } #endif
+#ifndef set_ptes /** * set_ptes - Map consecutive pages to a contiguous range of addresses. * @mm: Address space to map the pages into.
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit e5ea320aec811c0e5cddefda17052579e0306415 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Let's use our handy helper now that it's available on all archs.
Link: https://lkml.kernel.org/r/20240129124649.189745-10-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Tested-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Mike Rapoport (IBM) rppt@kernel.org Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Alexandre Ghiti alexghiti@rivosinc.com Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: David S. Miller davem@davemloft.net Cc: Dinh Nguyen dinguyen@kernel.org Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Russell King (Oracle) linux@armlinux.org.uk Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit e5ea320aec811c0e5cddefda17052579e0306415) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- arch/arm/mm/mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c index 674ed71573a8..c24e29c0b9a4 100644 --- a/arch/arm/mm/mmu.c +++ b/arch/arm/mm/mmu.c @@ -1814,6 +1814,6 @@ void set_ptes(struct mm_struct *mm, unsigned long addr, if (--nr == 0) break; ptep++; - pte_val(pteval) += PAGE_SIZE; + pteval = pte_next_pfn(pteval); } }
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit 802cc2ab33b0d8a013c216ca7f4caa9034bfc257 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Let's use our handy new helper. Note that the implementation is slightly different, but shouldn't really make a difference in practice.
Link: https://lkml.kernel.org/r/20240129124649.189745-11-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Christophe Leroy christophe.leroy@csgroup.eu Tested-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Mike Rapoport (IBM) rppt@kernel.org Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Alexandre Ghiti alexghiti@rivosinc.com Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: David S. Miller davem@davemloft.net Cc: Dinh Nguyen dinguyen@kernel.org Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Russell King (Oracle) linux@armlinux.org.uk Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 802cc2ab33b0d8a013c216ca7f4caa9034bfc257) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- arch/powerpc/mm/pgtable.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index 4d69bfb9bc11..79b7b35c4899 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -220,10 +220,7 @@ void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, break; ptep++; addr += PAGE_SIZE; - /* - * increment the pfn. - */ - pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte))); + pte = pte_next_pfn(pte); } }
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit 23ed190868a65525b8941370630fbb215f12ebe8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Let's prepare for further changes.
Link: https://lkml.kernel.org/r/20240129124649.189745-12-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Mike Rapoport (IBM) rppt@kernel.org Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Alexandre Ghiti alexghiti@rivosinc.com Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: David S. Miller davem@davemloft.net Cc: Dinh Nguyen dinguyen@kernel.org Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Russell King (Oracle) linux@armlinux.org.uk Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 23ed190868a65525b8941370630fbb215f12ebe8) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/memory.c | 63 ++++++++++++++++++++++++++++------------------------- 1 file changed, 33 insertions(+), 30 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index e7a959688bdc..7f1bd12589e7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -935,6 +935,29 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma return 0; }
+static inline void __copy_present_pte(struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte, + pte_t pte, unsigned long addr) +{ + struct mm_struct *src_mm = src_vma->vm_mm; + + /* If it's a COW mapping, write protect it both processes. */ + if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) { + ptep_set_wrprotect(src_mm, addr, src_pte); + pte = pte_wrprotect(pte); + } + + /* If it's a shared mapping, mark it clean in the child. */ + if (src_vma->vm_flags & VM_SHARED) + pte = pte_mkclean(pte); + pte = pte_mkold(pte); + + if (!userfaultfd_wp(dst_vma)) + pte = pte_clear_uffd_wp(pte); + + set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); +} + /* * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page * is required to copy this pte. @@ -944,23 +967,23 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss, struct folio **prealloc) { - struct mm_struct *src_mm = src_vma->vm_mm; - unsigned long vm_flags = src_vma->vm_flags; pte_t pte = ptep_get(src_pte); struct page *page; struct folio *folio;
page = vm_normal_page(src_vma, addr, pte); - if (page) - folio = page_folio(page); - if (page && folio_test_anon(folio)) { + if (unlikely(!page)) + goto copy_pte; + + folio = page_folio(page); + folio_get(folio); + if (folio_test_anon(folio)) { /* * If this page may have been pinned by the parent process, * copy the page immediately for the child so that we'll always * guarantee the pinned page won't be randomly replaced in the * future. */ - folio_get(folio); if (unlikely(folio_try_dup_anon_rmap_pte(folio, page, src_vma))) { /* Page may be pinned, we have to copy. */ folio_put(folio); @@ -968,35 +991,15 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, addr, rss, prealloc, page); } rss[MM_ANONPAGES]++; - } else if (page) { - folio_get(folio); + VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio); + } else { folio_dup_file_rmap_pte(folio, page); rss[mm_counter_file(folio)]++; add_reliable_folio_counter(folio, dst_vma->vm_mm, 1); }
- /* - * If it's a COW mapping, write protect it both - * in the parent and the child - */ - if (is_cow_mapping(vm_flags) && pte_write(pte)) { - ptep_set_wrprotect(src_mm, addr, src_pte); - pte = pte_wrprotect(pte); - } - VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page)); - - /* - * If it's a shared mapping, mark it clean in - * the child - */ - if (vm_flags & VM_SHARED) - pte = pte_mkclean(pte); - pte = pte_mkold(pte); - - if (!userfaultfd_wp(dst_vma)) - pte = pte_clear_uffd_wp(pte); - - set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); +copy_pte: + __copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, pte, addr); return 0; }
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit 53723298ba436830fdf0744c19b57b2a18f44041 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
We already read it, let's just forward it.
This patch is based on work by Ryan Roberts.
[david@redhat.com: fix the hmm "exclusive_cow" selftest] Link: https://lkml.kernel.org/r/13f296b8-e882-47fd-b939-c2141dc28717@redhat.com Link: https://lkml.kernel.org/r/20240129124649.189745-13-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Mike Rapoport (IBM) rppt@kernel.org Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Alexandre Ghiti alexghiti@rivosinc.com Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: David S. Miller davem@davemloft.net Cc: Dinh Nguyen dinguyen@kernel.org Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Russell King (Oracle) linux@armlinux.org.uk Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 53723298ba436830fdf0744c19b57b2a18f44041) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/memory.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index 7f1bd12589e7..cd7b189b9220 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -964,10 +964,9 @@ static inline void __copy_present_pte(struct vm_area_struct *dst_vma, */ static inline int copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, - pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss, - struct folio **prealloc) + pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr, + int *rss, struct folio **prealloc) { - pte_t pte = ptep_get(src_pte); struct page *page; struct folio *folio;
@@ -1095,6 +1094,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, progress += 8; continue; } + ptent = ptep_get(src_pte); + VM_WARN_ON_ONCE(!pte_present(ptent));
/* * Device exclusive entry restored, continue by copying @@ -1104,7 +1105,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, } /* copy_present_pte() will clear `*prealloc' if consumed */ ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, - addr, rss, &prealloc); + ptent, addr, rss, &prealloc); /* * If we need a pre-allocated page for this pte, drop the * locks, allocate, and try again.
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit f8d937761d65c87e9987b88ea7beb7bddc333a0e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Let's implement PTE batching when consecutive (present) PTEs map consecutive pages of the same large folio, and all other PTE bits besides the PFNs are equal.
We will optimize folio_pte_batch() separately, to ignore selected PTE bits. This patch is based on work by Ryan Roberts.
Use __always_inline for __copy_present_ptes() and keep the handling for single PTEs completely separate from the multi-PTE case: we really want the compiler to optimize for the single-PTE case with small folios, to not degrade performance.
Note that PTE batching will never exceed a single page table and will always stay within VMA boundaries.
Further, processing PTE-mapped THP that maybe pinned and have PageAnonExclusive set on at least one subpage should work as expected, but there is room for improvement: We will repeatedly (1) detect a PTE batch (2) detect that we have to copy a page (3) fall back and allocate a single page to copy a single page. For now we won't care as pinned pages are a corner case, and we should rather look into maintaining only a single PageAnonExclusive bit for large folios.
Link: https://lkml.kernel.org/r/20240129124649.189745-14-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Mike Rapoport (IBM) rppt@kernel.org Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Alexandre Ghiti alexghiti@rivosinc.com Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: David S. Miller davem@davemloft.net Cc: Dinh Nguyen dinguyen@kernel.org Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Russell King (Oracle) linux@armlinux.org.uk Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit f8d937761d65c87e9987b88ea7beb7bddc333a0e) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- include/linux/pgtable.h | 31 +++++++++++ mm/memory.c | 112 +++++++++++++++++++++++++++++++++------- 2 files changed, 124 insertions(+), 19 deletions(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index b5ce7ee512d0..28d59a6da257 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -622,6 +622,37 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres } #endif
+#ifndef wrprotect_ptes +/** + * wrprotect_ptes - Write-protect PTEs that map consecutive pages of the same + * folio. + * @mm: Address space the pages are mapped into. + * @addr: Address the first page is mapped at. + * @ptep: Page table pointer for the first entry. + * @nr: Number of entries to write-protect. + * + * May be overridden by the architecture; otherwise, implemented as a simple + * loop over ptep_set_wrprotect(). + * + * Note that PTE bits in the PTE range besides the PFN can differ. For example, + * some PTEs might be write-protected. + * + * Context: The caller holds the page table lock. The PTEs map consecutive + * pages that belong to the same folio. The PTEs are all in the same PMD. + */ +static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, unsigned int nr) +{ + for (;;) { + ptep_set_wrprotect(mm, addr, ptep); + if (--nr == 0) + break; + ptep++; + addr += PAGE_SIZE; + } +} +#endif + /* * On some architectures hardware does not set page access bit when accessing * memory page, it is responsibility of software setting this bit. It brings diff --git a/mm/memory.c b/mm/memory.c index cd7b189b9220..4cb1b895cfa0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -935,15 +935,15 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma return 0; }
-static inline void __copy_present_pte(struct vm_area_struct *dst_vma, +static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte, - pte_t pte, unsigned long addr) + pte_t pte, unsigned long addr, int nr) { struct mm_struct *src_mm = src_vma->vm_mm;
/* If it's a COW mapping, write protect it both processes. */ if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) { - ptep_set_wrprotect(src_mm, addr, src_pte); + wrprotect_ptes(src_mm, addr, src_pte, nr); pte = pte_wrprotect(pte); }
@@ -955,26 +955,93 @@ static inline void __copy_present_pte(struct vm_area_struct *dst_vma, if (!userfaultfd_wp(dst_vma)) pte = pte_clear_uffd_wp(pte);
- set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); + set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr); +} + +/* + * Detect a PTE batch: consecutive (present) PTEs that map consecutive + * pages of the same folio. + * + * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN. + */ +static inline int folio_pte_batch(struct folio *folio, unsigned long addr, + pte_t *start_ptep, pte_t pte, int max_nr) +{ + unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio); + const pte_t *end_ptep = start_ptep + max_nr; + pte_t expected_pte = pte_next_pfn(pte); + pte_t *ptep = start_ptep + 1; + + VM_WARN_ON_FOLIO(!pte_present(pte), folio); + + while (ptep != end_ptep) { + pte = ptep_get(ptep); + + if (!pte_same(pte, expected_pte)) + break; + + /* + * Stop immediately once we reached the end of the folio. In + * corner cases the next PFN might fall into a different + * folio. + */ + if (pte_pfn(pte) == folio_end_pfn) + break; + + expected_pte = pte_next_pfn(expected_pte); + ptep++; + } + + return ptep - start_ptep; }
/* - * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page - * is required to copy this pte. + * Copy one present PTE, trying to batch-process subsequent PTEs that map + * consecutive pages of the same folio by copying them as well. + * + * Returns -EAGAIN if one preallocated page is required to copy the next PTE. + * Otherwise, returns the number of copied PTEs (at least 1). */ static inline int -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr, - int *rss, struct folio **prealloc) + int max_nr, int *rss, struct folio **prealloc) { struct page *page; struct folio *folio; + int err, nr;
page = vm_normal_page(src_vma, addr, pte); if (unlikely(!page)) goto copy_pte;
folio = page_folio(page); + + /* + * If we likely have to copy, just don't bother with batching. Make + * sure that the common "small folio" case is as fast as possible + * by keeping the batching logic separate. + */ + if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) { + nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr); + folio_ref_add(folio, nr); + if (folio_test_anon(folio)) { + if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page, + nr, src_vma))) { + folio_ref_sub(folio, nr); + return -EAGAIN; + } + rss[MM_ANONPAGES] += nr; + VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio); + } else { + folio_dup_file_rmap_ptes(folio, page, nr); + rss[mm_counter_file(folio)] += nr; + } + __copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte, + addr, nr); + return nr; + } + folio_get(folio); if (folio_test_anon(folio)) { /* @@ -986,8 +1053,9 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, if (unlikely(folio_try_dup_anon_rmap_pte(folio, page, src_vma))) { /* Page may be pinned, we have to copy. */ folio_put(folio); - return copy_present_page(dst_vma, src_vma, dst_pte, src_pte, - addr, rss, prealloc, page); + err = copy_present_page(dst_vma, src_vma, dst_pte, src_pte, + addr, rss, prealloc, page); + return err ? err : 1; } rss[MM_ANONPAGES]++; VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio); @@ -998,8 +1066,8 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, }
copy_pte: - __copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, pte, addr); - return 0; + __copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte, addr, 1); + return 1; }
static inline struct folio *page_copy_prealloc(struct mm_struct *src_mm, @@ -1031,10 +1099,11 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_t *src_pte, *dst_pte; pte_t ptent; spinlock_t *src_ptl, *dst_ptl; - int progress, ret = 0; + int progress, max_nr, ret = 0; int rss[NR_MM_COUNTERS]; swp_entry_t entry = (swp_entry_t){0}; struct folio *prealloc = NULL; + int nr;
again: progress = 0; @@ -1065,6 +1134,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, arch_enter_lazy_mmu_mode();
do { + nr = 1; + /* * We are holding two locks at this point - either of them * could generate latencies in another task on another CPU. @@ -1103,9 +1174,10 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, */ WARN_ON_ONCE(ret != -ENOENT); } - /* copy_present_pte() will clear `*prealloc' if consumed */ - ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, - ptent, addr, rss, &prealloc); + /* copy_present_ptes() will clear `*prealloc' if consumed */ + max_nr = (end - addr) / PAGE_SIZE; + ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, + ptent, addr, max_nr, rss, &prealloc); /* * If we need a pre-allocated page for this pte, drop the * locks, allocate, and try again. @@ -1122,8 +1194,10 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, folio_put(prealloc); prealloc = NULL; } - progress += 8; - } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); + nr = ret; + progress += 8 * nr; + } while (dst_pte += nr, src_pte += nr, addr += PAGE_SIZE * nr, + addr != end);
arch_leave_lazy_mmu_mode(); pte_unmap_unlock(orig_src_pte, src_ptl); @@ -1144,7 +1218,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, prealloc = page_copy_prealloc(src_mm, src_vma, addr); if (!prealloc) return -ENOMEM; - } else if (ret) { + } else if (ret < 0) { VM_WARN_ON_ONCE(1); }
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit 25365e10699aa0e320345d019194fbea9f37a4ae category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Let's always ignore the accessed/young bit: we'll always mark the PTE as old in our child process during fork, and upcoming users will similarly not care.
Ignore the dirty bit only if we don't want to duplicate the dirty bit into the child process during fork. Maybe, we could just set all PTEs in the child dirty if any PTE is dirty. For now, let's keep the behavior unchanged, this can be optimized later if required.
Ignore the soft-dirty bit only if the bit doesn't have any meaning in the src vma, and similarly won't have any in the copied dst vma.
For now, we won't bother with the uffd-wp bit.
Link: https://lkml.kernel.org/r/20240129124649.189745-15-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Alexandre Ghiti alexghiti@rivosinc.com Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: David S. Miller davem@davemloft.net Cc: Dinh Nguyen dinguyen@kernel.org Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Russell King (Oracle) linux@armlinux.org.uk Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Cc: Mike Rapoport (IBM) rppt@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 25365e10699aa0e320345d019194fbea9f37a4ae) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/memory.c | 36 +++++++++++++++++++++++++++++++----- 1 file changed, 31 insertions(+), 5 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index 4cb1b895cfa0..6609185178ff 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -958,24 +958,44 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma, set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr); }
+/* Flags for folio_pte_batch(). */ +typedef int __bitwise fpb_t; + +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */ +#define FPB_IGNORE_DIRTY ((__force fpb_t)BIT(0)) + +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */ +#define FPB_IGNORE_SOFT_DIRTY ((__force fpb_t)BIT(1)) + +static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) +{ + if (flags & FPB_IGNORE_DIRTY) + pte = pte_mkclean(pte); + if (likely(flags & FPB_IGNORE_SOFT_DIRTY)) + pte = pte_clear_soft_dirty(pte); + return pte_mkold(pte); +} + /* * Detect a PTE batch: consecutive (present) PTEs that map consecutive * pages of the same folio. * - * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN. + * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN, + * the accessed bit, dirty bit (with FPB_IGNORE_DIRTY) and soft-dirty bit + * (with FPB_IGNORE_SOFT_DIRTY). */ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, - pte_t *start_ptep, pte_t pte, int max_nr) + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags) { unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio); const pte_t *end_ptep = start_ptep + max_nr; - pte_t expected_pte = pte_next_pfn(pte); + pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte), flags); pte_t *ptep = start_ptep + 1;
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
while (ptep != end_ptep) { - pte = ptep_get(ptep); + pte = __pte_batch_clear_ignored(ptep_get(ptep), flags);
if (!pte_same(pte, expected_pte)) break; @@ -1009,6 +1029,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma { struct page *page; struct folio *folio; + fpb_t flags = 0; int err, nr;
page = vm_normal_page(src_vma, addr, pte); @@ -1023,7 +1044,12 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma * by keeping the batching logic separate. */ if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) { - nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr); + if (src_vma->vm_flags & VM_SHARED) + flags |= FPB_IGNORE_DIRTY; + if (!vma_soft_dirty_enabled(src_vma)) + flags |= FPB_IGNORE_SOFT_DIRTY; + + nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags); folio_ref_add(folio, nr); if (folio_test_anon(folio)) { if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit d7c0e5f722ab229153c22efc836bf220479bdce6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
... and conditionally return to the caller if any PTE except the first one is writable. fork() has to make sure to properly write-protect in case any PTE is writable. Other users (e.g., page unmaping) are expected to not care.
Link: https://lkml.kernel.org/r/20240129124649.189745-16-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Cc: Albert Ou aou@eecs.berkeley.edu Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Alexandre Ghiti alexghiti@rivosinc.com Cc: Aneesh Kumar K.V aneesh.kumar@kernel.org Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: David S. Miller davem@davemloft.net Cc: Dinh Nguyen dinguyen@kernel.org Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Naveen N. Rao naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Russell King (Oracle) linux@armlinux.org.uk Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Cc: Mike Rapoport (IBM) rppt@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit d7c0e5f722ab229153c22efc836bf220479bdce6) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/memory.c | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index 6609185178ff..378ff4a0df7a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -973,7 +973,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) pte = pte_mkclean(pte); if (likely(flags & FPB_IGNORE_SOFT_DIRTY)) pte = pte_clear_soft_dirty(pte); - return pte_mkold(pte); + return pte_wrprotect(pte_mkold(pte)); }
/* @@ -981,21 +981,32 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) * pages of the same folio. * * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN, - * the accessed bit, dirty bit (with FPB_IGNORE_DIRTY) and soft-dirty bit - * (with FPB_IGNORE_SOFT_DIRTY). + * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and + * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY). + * + * If "any_writable" is set, it will indicate if any other PTE besides the + * first (given) PTE is writable. */ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, - pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags) + pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, + bool *any_writable) { unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio); const pte_t *end_ptep = start_ptep + max_nr; pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte), flags); pte_t *ptep = start_ptep + 1; + bool writable; + + if (any_writable) + *any_writable = false;
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
while (ptep != end_ptep) { - pte = __pte_batch_clear_ignored(ptep_get(ptep), flags); + pte = ptep_get(ptep); + if (any_writable) + writable = !!pte_write(pte); + pte = __pte_batch_clear_ignored(pte, flags);
if (!pte_same(pte, expected_pte)) break; @@ -1008,6 +1019,9 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, if (pte_pfn(pte) == folio_end_pfn) break;
+ if (any_writable) + *any_writable |= writable; + expected_pte = pte_next_pfn(expected_pte); ptep++; } @@ -1029,6 +1043,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma { struct page *page; struct folio *folio; + bool any_writable; fpb_t flags = 0; int err, nr;
@@ -1049,7 +1064,8 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma if (!vma_soft_dirty_enabled(src_vma)) flags |= FPB_IGNORE_SOFT_DIRTY;
- nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags); + nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags, + &any_writable); folio_ref_add(folio, nr); if (folio_test_anon(folio)) { if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page, @@ -1063,6 +1079,8 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma folio_dup_file_rmap_ptes(folio, page, nr); rss[mm_counter_file(folio)] += nr; } + if (any_writable) + pte = pte_mkwrite(pte, src_vma); __copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte, addr, nr); return nr;
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit 789753e17c4d6593932f07e40b740373123296a6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Patch series "mm/memory: optimize unmap/zap with PTE-mapped THP", v3.
This series is based on [1]. Similar to what we did with fork(), let's implement PTE batching during unmap/zap when processing PTE-mapped THPs.
We collect consecutive PTEs that map consecutive pages of the same large folio, making sure that the other PTE bits are compatible, and (a) adjust the refcount only once per batch, (b) call rmap handling functions only once per batch, (c) perform batch PTE setting/updates and (d) perform TLB entry removal once per batch.
Ryan was previously working on this in the context of cont-pte for arm64, int latest iteration [2] with a focus on arm6 with cont-pte only. This series implements the optimization for all architectures, independent of such PTE bits, teaches MMU gather/TLB code to be fully aware of such large-folio-pages batches as well, and amkes use of our new rmap batching function when removing the rmap.
To achieve that, we have to enlighten MMU gather / page freeing code (i.e., everything that consumes encoded_page) to process unmapping of consecutive pages that all belong to the same large folio. I'm being very careful to not degrade order-0 performance, and it looks like I managed to achieve that.
While this series should -- similar to [1] -- be beneficial for adding cont-pte support on arm64[2], it's one of the requirements for maintaining a total mapcount[3] for large folios with minimal added overhead and further changes[4] that build up on top of the total mapcount.
Independent of all that, this series results in a speedup during munmap() and similar unmapping (process teardown, MADV_DONTNEED on larger ranges) with PTE-mapped THP, which is the default with THPs that are smaller than a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).
On an Intel Xeon Silver 4210R CPU, munmap'ing a 1GiB VMA backed by PTE-mapped folios of the same size (stddev < 1%) results in the following runtimes for munmap() in seconds (shorter is better):
Folio Size | mm-unstable | New | Change --------------------------------------------- 4KiB | 0.058110 | 0.057715 | - 1% 16KiB | 0.044198 | 0.035469 | -20% 32KiB | 0.034216 | 0.023522 | -31% 64KiB | 0.029207 | 0.018434 | -37% 128KiB | 0.026579 | 0.014026 | -47% 256KiB | 0.025130 | 0.011756 | -53% 512KiB | 0.024292 | 0.010703 | -56% 1024KiB | 0.023812 | 0.010294 | -57% 2048KiB | 0.023785 | 0.009910 | -58%
[1] https://lkml.kernel.org/r/20240129124649.189745-1-david@redhat.com [2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@arm.com [3] https://lkml.kernel.org/r/20230809083256.699513-1-david@redhat.com [4] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com [5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com
This patch (of 10):
Let's prepare for further changes by factoring out processing of present PTEs.
Link: https://lkml.kernel.org/r/20240214204435.167852-1-david@redhat.com Link: https://lkml.kernel.org/r/20240214204435.167852-2-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Aneesh Kumar K.V aneesh.kumar@linux.ibm.com Cc: Arnd Bergmann arnd@arndb.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Heiko Carstens hca@linux.ibm.com Cc: linuxppc-dev@lists.ozlabs.org Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Michal Hocko mhocko@suse.com Cc: "Naveen N. Rao" naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Cc: Yin Fengwei fengwei.yin@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 789753e17c4d6593932f07e40b740373123296a6) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com
Conflicts: mm/memory.c --- mm/memory.c | 96 ++++++++++++++++++++++++++++++----------------------- 1 file changed, 54 insertions(+), 42 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index 378ff4a0df7a..44dc030ae8a9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1535,13 +1535,62 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, pte_install_uffd_wp_if_needed(vma, addr, pte, pteval); }
+static inline void zap_present_pte(struct mmu_gather *tlb, + struct vm_area_struct *vma, pte_t *pte, pte_t ptent, + unsigned long addr, struct zap_details *details, + int *rss, bool *force_flush, bool *force_break) +{ + struct mm_struct *mm = tlb->mm; + struct folio *folio = NULL; + bool delay_rmap = false; + struct page *page; + + page = vm_normal_page(vma, addr, ptent); + if (page) + folio = page_folio(page); + + if (unlikely(!should_zap_folio(details, folio))) + return; + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + arch_check_zapped_pte(vma, ptent); + tlb_remove_tlb_entry(tlb, pte, addr); + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); + if (unlikely(!page)) { + ksm_might_unmap_zero_page(mm, ptent); + return; + } + + if (!folio_test_anon(folio)) { + if (pte_dirty(ptent)) { + folio_mark_dirty(folio); + if (tlb_delay_rmap(tlb)) { + delay_rmap = true; + *force_flush = true; + } + } + if (pte_young(ptent) && likely(vma_has_recency(vma))) + folio_mark_accessed(folio); + } + rss[mm_counter(folio)]--; + add_reliable_page_counter(page, mm, -1); + if (!delay_rmap) { + folio_remove_rmap_pte(folio, page, vma); + if (unlikely(page_mapcount(page) < 0)) + print_bad_pte(vma, addr, ptent, page); + } + if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) { + *force_flush = true; + *force_break = true; + } +} + static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, struct zap_details *details) { + bool force_flush = false, force_break = false; struct mm_struct *mm = tlb->mm; - int force_flush = 0; int rss[NR_MM_COUNTERS]; spinlock_t *ptl; pte_t *start_pte; @@ -1558,7 +1607,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, arch_enter_lazy_mmu_mode(); do { pte_t ptent = ptep_get(pte); - struct folio *folio = NULL; + struct folio *folio; struct page *page;
if (pte_none(ptent)) @@ -1568,46 +1617,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, break;
if (pte_present(ptent)) { - unsigned int delay_rmap; - - page = vm_normal_page(vma, addr, ptent); - if (page) - folio = page_folio(page); - - if (unlikely(!should_zap_folio(details, folio))) - continue; - ptent = ptep_get_and_clear_full(mm, addr, pte, - tlb->fullmm); - arch_check_zapped_pte(vma, ptent); - tlb_remove_tlb_entry(tlb, pte, addr); - zap_install_uffd_wp_if_needed(vma, addr, pte, details, - ptent); - if (unlikely(!page)) { - ksm_might_unmap_zero_page(mm, ptent); - continue; - } - - delay_rmap = 0; - if (!folio_test_anon(folio)) { - if (pte_dirty(ptent)) { - folio_mark_dirty(folio); - if (tlb_delay_rmap(tlb)) { - delay_rmap = 1; - force_flush = 1; - } - } - if (pte_young(ptent) && likely(vma_has_recency(vma))) - folio_mark_accessed(folio); - } - rss[mm_counter(folio)]--; - add_reliable_page_counter(page, mm, -1); - if (!delay_rmap) { - folio_remove_rmap_pte(folio, page, vma); - if (unlikely(page_mapcount(page) < 0)) - print_bad_pte(vma, addr, ptent, page); - } - if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) { - force_flush = 1; + zap_present_pte(tlb, vma, pte, ptent, addr, details, + rss, &force_flush, &force_break); + if (unlikely(force_break)) { addr += PAGE_SIZE; break; }
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit 0cf18e839f64fff9a58569cc9a596bf97310e044 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
We don't need uptodate accessed/dirty bits, so in theory we could replace ptep_get_and_clear_full() by an optimized ptep_clear_full() function. Let's rely on the provided pte.
Further, there is no scenario where we would have to insert uffd-wp markers when zapping something that is not a normal page (i.e., zeropage). Add a sanity check to make sure this remains true.
should_zap_folio() no longer has to handle NULL pointers. This change replaces 2/3 "!page/!folio" checks by a single "!page" one.
Note that arch_check_zapped_pte() on x86-64 checks the HW-dirty bit to detect shadow stack entries. But for shadow stack entries, the HW dirty bit (in combination with non-writable PTEs) is set by software. So for the arch_check_zapped_pte() check, we don't have to sync against HW setting the HW dirty bit concurrently, it is always set.
Link: https://lkml.kernel.org/r/20240214204435.167852-3-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Aneesh Kumar K.V aneesh.kumar@linux.ibm.com Cc: Arnd Bergmann arnd@arndb.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Michal Hocko mhocko@suse.com Cc: "Naveen N. Rao" naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Cc: Yin Fengwei fengwei.yin@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 0cf18e839f64fff9a58569cc9a596bf97310e044) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/memory.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index 44dc030ae8a9..7ac5855b6e1c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1500,10 +1500,6 @@ static inline bool should_zap_folio(struct zap_details *details, if (should_zap_cows(details)) return true;
- /* E.g. the caller passes NULL for the case of a zero folio */ - if (!folio) - return true; - /* Otherwise we should only zap non-anon folios */ return !folio_test_anon(folio); } @@ -1541,24 +1537,28 @@ static inline void zap_present_pte(struct mmu_gather *tlb, int *rss, bool *force_flush, bool *force_break) { struct mm_struct *mm = tlb->mm; - struct folio *folio = NULL; bool delay_rmap = false; + struct folio *folio; struct page *page;
page = vm_normal_page(vma, addr, ptent); - if (page) - folio = page_folio(page); + if (!page) { + /* We don't need up-to-date accessed/dirty bits. */ + ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + arch_check_zapped_pte(vma, ptent); + tlb_remove_tlb_entry(tlb, pte, addr); + VM_WARN_ON_ONCE(userfaultfd_wp(vma)); + ksm_might_unmap_zero_page(mm, ptent); + return; + }
+ folio = page_folio(page); if (unlikely(!should_zap_folio(details, folio))) return; ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); arch_check_zapped_pte(vma, ptent); tlb_remove_tlb_entry(tlb, pte, addr); zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); - if (unlikely(!page)) { - ksm_might_unmap_zero_page(mm, ptent); - return; - }
if (!folio_test_anon(folio)) { if (pte_dirty(ptent)) {
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit d11838ed63ee842fc9ef335b9f3aee3aa26f2ab5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
We don't need up-to-date accessed-dirty information for anon folios and can simply work with the ptent we already have. Also, we know the RSS counter we want to update.
We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() + zap_install_uffd_wp_if_needed() after updating the folio and RSS.
While at it, only call zap_install_uffd_wp_if_needed() if there is even any chance that pte_install_uffd_wp_if_needed() would do *something*. That is, just don't bother if uffd-wp does not apply.
Link: https://lkml.kernel.org/r/20240214204435.167852-4-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Aneesh Kumar K.V aneesh.kumar@linux.ibm.com Cc: Arnd Bergmann arnd@arndb.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Michal Hocko mhocko@suse.com Cc: "Naveen N. Rao" naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Cc: Yin Fengwei fengwei.yin@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit d11838ed63ee842fc9ef335b9f3aee3aa26f2ab5) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com
Conflicts: mm/memory.c --- mm/memory.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index 7ac5855b6e1c..a5c788739876 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1555,12 +1555,9 @@ static inline void zap_present_pte(struct mmu_gather *tlb, folio = page_folio(page); if (unlikely(!should_zap_folio(details, folio))) return; - ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); - arch_check_zapped_pte(vma, ptent); - tlb_remove_tlb_entry(tlb, pte, addr); - zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
if (!folio_test_anon(folio)) { + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); if (pte_dirty(ptent)) { folio_mark_dirty(folio); if (tlb_delay_rmap(tlb)) { @@ -1570,9 +1567,19 @@ static inline void zap_present_pte(struct mmu_gather *tlb, } if (pte_young(ptent) && likely(vma_has_recency(vma))) folio_mark_accessed(folio); + rss[mm_counter(folio)]--; + add_reliable_page_counter(page, mm, -1); + } else { + /* We don't need up-to-date accessed/dirty bits. */ + ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + rss[MM_ANONPAGES]--; } - rss[mm_counter(folio)]--; - add_reliable_page_counter(page, mm, -1); + + arch_check_zapped_pte(vma, ptent); + tlb_remove_tlb_entry(tlb, pte, addr); + if (unlikely(userfaultfd_pte_wp(vma, ptent))) + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); + if (!delay_rmap) { folio_remove_rmap_pte(folio, page, vma); if (unlikely(page_mapcount(page) < 0))
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit 2b42a7e531509577bd822aece610cd6d0dbf0dd7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Let's prepare for further changes by factoring it out into a separate function.
Link: https://lkml.kernel.org/r/20240214204435.167852-5-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Aneesh Kumar K.V aneesh.kumar@linux.ibm.com Cc: Arnd Bergmann arnd@arndb.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Michal Hocko mhocko@suse.com Cc: "Naveen N. Rao" naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Cc: Yin Fengwei fengwei.yin@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 2b42a7e531509577bd822aece610cd6d0dbf0dd7) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/memory.c | 53 ++++++++++++++++++++++++++++++++--------------------- 1 file changed, 32 insertions(+), 21 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index a5c788739876..e4208f5302fe 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1531,30 +1531,14 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, pte_install_uffd_wp_if_needed(vma, addr, pte, pteval); }
-static inline void zap_present_pte(struct mmu_gather *tlb, - struct vm_area_struct *vma, pte_t *pte, pte_t ptent, - unsigned long addr, struct zap_details *details, - int *rss, bool *force_flush, bool *force_break) +static inline void zap_present_folio_pte(struct mmu_gather *tlb, + struct vm_area_struct *vma, struct folio *folio, + struct page *page, pte_t *pte, pte_t ptent, unsigned long addr, + struct zap_details *details, int *rss, bool *force_flush, + bool *force_break) { struct mm_struct *mm = tlb->mm; bool delay_rmap = false; - struct folio *folio; - struct page *page; - - page = vm_normal_page(vma, addr, ptent); - if (!page) { - /* We don't need up-to-date accessed/dirty bits. */ - ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); - arch_check_zapped_pte(vma, ptent); - tlb_remove_tlb_entry(tlb, pte, addr); - VM_WARN_ON_ONCE(userfaultfd_wp(vma)); - ksm_might_unmap_zero_page(mm, ptent); - return; - } - - folio = page_folio(page); - if (unlikely(!should_zap_folio(details, folio))) - return;
if (!folio_test_anon(folio)) { ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); @@ -1591,6 +1575,33 @@ static inline void zap_present_pte(struct mmu_gather *tlb, } }
+static inline void zap_present_pte(struct mmu_gather *tlb, + struct vm_area_struct *vma, pte_t *pte, pte_t ptent, + unsigned long addr, struct zap_details *details, + int *rss, bool *force_flush, bool *force_break) +{ + struct mm_struct *mm = tlb->mm; + struct folio *folio; + struct page *page; + + page = vm_normal_page(vma, addr, ptent); + if (!page) { + /* We don't need up-to-date accessed/dirty bits. */ + ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + arch_check_zapped_pte(vma, ptent); + tlb_remove_tlb_entry(tlb, pte, addr); + VM_WARN_ON_ONCE(userfaultfd_wp(vma)); + ksm_might_unmap_zero_page(mm, ptent); + return; + } + + folio = page_folio(page); + if (unlikely(!should_zap_folio(details, folio))) + return; + zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details, + rss, force_flush, force_break); +} + static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end,
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit c30d6bc8d0153630e600e8f67ba88c670d9e1b0c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
We have two bits available in the encoded page pointer to store additional information. Currently, we use one bit to request delay of the rmap removal until after a TLB flush.
We want to make use of the remaining bit internally for batching of multiple pages of the same folio, specifying that the next encoded page pointer in an array is actually "nr_pages". So pass page + delay_rmap flag instead of an encoded page, to handle the encoding internally.
Link: https://lkml.kernel.org/r/20240214204435.167852-6-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Aneesh Kumar K.V aneesh.kumar@linux.ibm.com Cc: Arnd Bergmann arnd@arndb.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Michal Hocko mhocko@suse.com Cc: "Naveen N. Rao" naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Cc: Yin Fengwei fengwei.yin@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit c30d6bc8d0153630e600e8f67ba88c670d9e1b0c) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- arch/s390/include/asm/tlb.h | 13 ++++++------- include/asm-generic/tlb.h | 12 ++++++------ mm/mmu_gather.c | 7 ++++--- 3 files changed, 16 insertions(+), 16 deletions(-)
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h index 383b1f91442c..1eb1df478e0c 100644 --- a/arch/s390/include/asm/tlb.h +++ b/arch/s390/include/asm/tlb.h @@ -25,8 +25,7 @@ void __tlb_remove_table(void *_table); static inline void tlb_flush(struct mmu_gather *tlb); static inline bool __tlb_remove_page_size(struct mmu_gather *tlb, - struct encoded_page *page, - int page_size); + struct page *page, bool delay_rmap, int page_size);
#define tlb_flush tlb_flush #define pte_free_tlb pte_free_tlb @@ -42,14 +41,14 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb, * tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page * has already been freed, so just do free_page_and_swap_cache. * - * s390 doesn't delay rmap removal, so there is nothing encoded in - * the page pointer. + * s390 doesn't delay rmap removal. */ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb, - struct encoded_page *page, - int page_size) + struct page *page, bool delay_rmap, int page_size) { - free_page_and_swap_cache(encoded_page_ptr(page)); + VM_WARN_ON_ONCE(delay_rmap); + + free_page_and_swap_cache(page); return false; }
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 1a3191e844b7..c092b54d06b9 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -261,9 +261,8 @@ struct mmu_gather_batch { */ #define MAX_GATHER_BATCH_COUNT (10000UL/MAX_GATHER_BATCH)
-extern bool __tlb_remove_page_size(struct mmu_gather *tlb, - struct encoded_page *page, - int page_size); +extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, + bool delay_rmap, int page_size);
#ifdef CONFIG_SMP /* @@ -465,13 +464,14 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) static inline void tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size) { - if (__tlb_remove_page_size(tlb, encode_page(page, 0), page_size)) + if (__tlb_remove_page_size(tlb, page, false, page_size)) tlb_flush_mmu(tlb); }
-static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page, unsigned int flags) +static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb, + struct page *page, bool delay_rmap) { - return __tlb_remove_page_size(tlb, encode_page(page, flags), PAGE_SIZE); + return __tlb_remove_page_size(tlb, page, delay_rmap, PAGE_SIZE); }
/* tlb_remove_page diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 604ddf08affe..ac733d81b112 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -116,7 +116,8 @@ static void tlb_batch_list_free(struct mmu_gather *tlb) tlb->local.next = NULL; }
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, int page_size) +bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, + bool delay_rmap, int page_size) { struct mmu_gather_batch *batch;
@@ -131,13 +132,13 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, i * Add the page and check if we are full. If so * force a flush. */ - batch->encoded_pages[batch->nr++] = page; + batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap); if (batch->nr == batch->max) { if (!tlb_next_batch(tlb)) return true; batch = tlb->active; } - VM_BUG_ON_PAGE(batch->nr > batch->max, encoded_page_ptr(page)); + VM_BUG_ON_PAGE(batch->nr > batch->max, page);
return false; }
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit da510964c095cb5e070800ef38752c453d2aa71d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Nowadays, encoded pages are only used in mmu_gather handling. Let's update the documentation, and define ENCODED_PAGE_BIT_DELAY_RMAP. While at it, rename ENCODE_PAGE_BITS to ENCODED_PAGE_BITS.
If encoded page pointers would ever be used in other context again, we'd likely want to change the defines to reflect their context (e.g., ENCODED_PAGE_FLAG_MMU_GATHER_DELAY_RMAP). For now, let's keep it simple.
This is a preparation for using the remaining spare bit to indicate that the next item in an array of encoded pages is a "nr_pages" argument and not an encoded page.
Link: https://lkml.kernel.org/r/20240214204435.167852-7-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Aneesh Kumar K.V aneesh.kumar@linux.ibm.com Cc: Arnd Bergmann arnd@arndb.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Michal Hocko mhocko@suse.com Cc: "Naveen N. Rao" naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Cc: Yin Fengwei fengwei.yin@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit da510964c095cb5e070800ef38752c453d2aa71d) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- include/linux/mm_types.h | 17 +++++++++++------ mm/mmu_gather.c | 5 +++-- 2 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5d5ee85a0721..1aee23c28330 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -224,8 +224,8 @@ struct page { * * An 'encoded_page' pointer is a pointer to a regular 'struct page', but * with the low bits of the pointer indicating extra context-dependent - * information. Not super-common, but happens in mmu_gather and mlock - * handling, and this acts as a type system check on that use. + * information. Only used in mmu_gather handling, and this acts as a type + * system check on that use. * * We only really have two guaranteed bits in general, although you could * play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE) @@ -234,21 +234,26 @@ struct page { * Use the supplied helper functions to endcode/decode the pointer and bits. */ struct encoded_page; -#define ENCODE_PAGE_BITS 3ul + +#define ENCODED_PAGE_BITS 3ul + +/* Perform rmap removal after we have flushed the TLB. */ +#define ENCODED_PAGE_BIT_DELAY_RMAP 1ul + static __always_inline struct encoded_page *encode_page(struct page *page, unsigned long flags) { - BUILD_BUG_ON(flags > ENCODE_PAGE_BITS); + BUILD_BUG_ON(flags > ENCODED_PAGE_BITS); return (struct encoded_page *)(flags | (unsigned long)page); }
static inline unsigned long encoded_page_flags(struct encoded_page *page) { - return ENCODE_PAGE_BITS & (unsigned long)page; + return ENCODED_PAGE_BITS & (unsigned long)page; }
static inline struct page *encoded_page_ptr(struct encoded_page *page) { - return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page); + return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page); }
/* diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index ac733d81b112..6540c99c6758 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -53,7 +53,7 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_ for (int i = 0; i < batch->nr; i++) { struct encoded_page *enc = batch->encoded_pages[i];
- if (encoded_page_flags(enc)) { + if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) { struct page *page = encoded_page_ptr(enc); folio_remove_rmap_pte(page_folio(page), page, vma); } @@ -119,6 +119,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb) bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, bool delay_rmap, int page_size) { + int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0; struct mmu_gather_batch *batch;
VM_BUG_ON(!tlb->end); @@ -132,7 +133,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, * Add the page and check if we are full. If so * force a flush. */ - batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap); + batch->encoded_pages[batch->nr++] = encode_page(page, flags); if (batch->nr == batch->max) { if (!tlb_next_batch(tlb)) return true;
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit 4d5bf0b6183f79ea361dd506365d2a471270735c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Let's add a helper that lets us batch-process multiple consecutive PTEs.
Note that the loop will get optimized out on all architectures except on powerpc. We have to add an early define of __tlb_remove_tlb_entry() on ppc to make the compiler happy (and avoid making tlb_remove_tlb_entries() a macro).
[arnd@kernel.org: change __tlb_remove_tlb_entry() to an inline function] Link: https://lkml.kernel.org/r/20240221154549.2026073-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20240214204435.167852-8-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Signed-off-by: Arnd Bergmann arnd@arndb.de Reviewed-by: Ryan Roberts ryan.roberts@arm.com Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Aneesh Kumar K.V aneesh.kumar@linux.ibm.com Cc: Arnd Bergmann arnd@arndb.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Michal Hocko mhocko@suse.com Cc: "Naveen N. Rao" naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Cc: Yin Fengwei fengwei.yin@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 4d5bf0b6183f79ea361dd506365d2a471270735c) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- arch/powerpc/include/asm/tlb.h | 2 ++ include/asm-generic/tlb.h | 24 +++++++++++++++++++++++- 2 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h index b3de6102a907..1ca7d4c4b90d 100644 --- a/arch/powerpc/include/asm/tlb.h +++ b/arch/powerpc/include/asm/tlb.h @@ -19,6 +19,8 @@
#include <linux/pagemap.h>
+static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, + unsigned long address); #define __tlb_remove_tlb_entry __tlb_remove_tlb_entry
#define tlb_flush tlb_flush diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index c092b54d06b9..50ddc0ed7ff0 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -595,7 +595,9 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb, }
#ifndef __tlb_remove_tlb_entry -#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0) +static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, unsigned long address) +{ +} #endif
/** @@ -611,6 +613,26 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb, __tlb_remove_tlb_entry(tlb, ptep, address); \ } while (0)
+/** + * tlb_remove_tlb_entries - remember unmapping of multiple consecutive ptes for + * later tlb invalidation. + * + * Similar to tlb_remove_tlb_entry(), but remember unmapping of multiple + * consecutive ptes instead of only a single one. + */ +static inline void tlb_remove_tlb_entries(struct mmu_gather *tlb, + pte_t *ptep, unsigned int nr, unsigned long address) +{ + tlb_flush_pte_range(tlb, address, PAGE_SIZE * nr); + for (;;) { + __tlb_remove_tlb_entry(tlb, ptep, address); + if (--nr == 0) + break; + ptep++; + address += PAGE_SIZE; + } +} + #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \ do { \ unsigned long _sz = huge_page_size(h); \
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit d7f861b9c43aadbe384ab1382d2e76750bedc91e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Add __tlb_remove_folio_pages(), which will remove multiple consecutive pages that belong to the same large folio, instead of only a single page. We'll be using this function when optimizing unmapping/zapping of large folios that are mapped by PTEs.
We're using the remaining spare bit in an encoded_page to indicate that the next enoced page in an array contains actually shifted "nr_pages". Teach swap/freeing code about putting multiple folio references, and delayed rmap handling to remove page ranges of a folio.
This extension allows for still gathering almost as many small folios as we used to (-1, because we have to prepare for a possibly bigger next entry), but still allows for gathering consecutive pages that belong to the same large folio.
Note that we don't pass the folio pointer, because it is not required for now. Further, we don't support page_size != PAGE_SIZE, it won't be required for simple PTE batching.
We have to provide a separate s390 implementation, but it's fairly straight forward.
Another, more invasive and likely more expensive, approach would be to use folio+range or a PFN range instead of page+nr_pages. But, we should do that consistently for the whole mmu_gather. For now, let's keep it simple and add "nr_pages" only.
Note that it is now possible to gather significantly more pages: In the past, we were able to gather ~10000 pages, now we can also gather ~5000 folio fragments that span multiple pages. A folio fragment on x86-64 can span up to 512 pages (2 MiB THP) and on arm64 with 64k in theory 8192 pages (512 MiB THP). Gathering more memory is not considered something we should worry about, especially because these are already corner cases.
While we can gather more total memory, we won't free more folio fragments. As long as page freeing time primarily only depends on the number of involved folios, there is no effective change for !preempt configurations. However, we'll adjust tlb_batch_pages_flush() separately to handle corner cases where page freeing time grows proportionally with the actual memory size.
Link: https://lkml.kernel.org/r/20240214204435.167852-9-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Aneesh Kumar K.V aneesh.kumar@linux.ibm.com Cc: Arnd Bergmann arnd@arndb.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Michal Hocko mhocko@suse.com Cc: "Naveen N. Rao" naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Cc: Yin Fengwei fengwei.yin@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit d7f861b9c43aadbe384ab1382d2e76750bedc91e) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- arch/s390/include/asm/tlb.h | 17 +++++++++++ include/asm-generic/tlb.h | 8 +++++ include/linux/mm_types.h | 20 ++++++++++++ mm/mmu_gather.c | 61 +++++++++++++++++++++++++++++++------ mm/swap.c | 12 ++++++-- mm/swap_state.c | 15 +++++++-- 6 files changed, 119 insertions(+), 14 deletions(-)
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h index 1eb1df478e0c..b76c8f028bad 100644 --- a/arch/s390/include/asm/tlb.h +++ b/arch/s390/include/asm/tlb.h @@ -26,6 +26,8 @@ void __tlb_remove_table(void *_table); static inline void tlb_flush(struct mmu_gather *tlb); static inline bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, bool delay_rmap, int page_size); +static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb, + struct page *page, unsigned int nr_pages, bool delay_rmap);
#define tlb_flush tlb_flush #define pte_free_tlb pte_free_tlb @@ -52,6 +54,21 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb, return false; }
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb, + struct page *page, unsigned int nr_pages, bool delay_rmap) +{ + struct encoded_page *encoded_pages[] = { + encode_page(page, ENCODED_PAGE_BIT_NR_PAGES_NEXT), + encode_nr_pages(nr_pages), + }; + + VM_WARN_ON_ONCE(delay_rmap); + VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1)); + + free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages)); + return false; +} + static inline void tlb_flush(struct mmu_gather *tlb) { __tlb_flush_mm_lazy(tlb->mm); diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 50ddc0ed7ff0..22384baee10e 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -70,6 +70,7 @@ * * - tlb_remove_page() / __tlb_remove_page() * - tlb_remove_page_size() / __tlb_remove_page_size() + * - __tlb_remove_folio_pages() * * __tlb_remove_page_size() is the basic primitive that queues a page for * freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a @@ -79,6 +80,11 @@ * tlb_remove_page() and tlb_remove_page_size() imply the call to * tlb_flush_mmu() when required and has no return value. * + * __tlb_remove_folio_pages() is similar to __tlb_remove_page(), however, + * instead of removing a single page, remove the given number of consecutive + * pages that are all part of the same (large) folio: just like calling + * __tlb_remove_page() on each page individually. + * * - tlb_change_page_size() * * call before __tlb_remove_page*() to set the current page-size; implies a @@ -263,6 +269,8 @@ struct mmu_gather_batch {
extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, bool delay_rmap, int page_size); +bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page, + unsigned int nr_pages, bool delay_rmap);
#ifdef CONFIG_SMP /* diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 1aee23c28330..aa17e8c500ce 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -240,6 +240,15 @@ struct encoded_page; /* Perform rmap removal after we have flushed the TLB. */ #define ENCODED_PAGE_BIT_DELAY_RMAP 1ul
+/* + * The next item in an encoded_page array is the "nr_pages" argument, specifying + * the number of consecutive pages starting from this page, that all belong to + * the same folio. For example, "nr_pages" corresponds to the number of folio + * references that must be dropped. If this bit is not set, "nr_pages" is + * implicitly 1. + */ +#define ENCODED_PAGE_BIT_NR_PAGES_NEXT 2ul + static __always_inline struct encoded_page *encode_page(struct page *page, unsigned long flags) { BUILD_BUG_ON(flags > ENCODED_PAGE_BITS); @@ -256,6 +265,17 @@ static inline struct page *encoded_page_ptr(struct encoded_page *page) return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page); }
+static __always_inline struct encoded_page *encode_nr_pages(unsigned long nr) +{ + VM_WARN_ON_ONCE((nr << 2) >> 2 != nr); + return (struct encoded_page *)(nr << 2); +} + +static __always_inline unsigned long encoded_nr_pages(struct encoded_page *page) +{ + return ((unsigned long)page) >> 2; +} + /* * A swap entry has to fit into a "unsigned long", as the entry is hidden * in the "index" field of the swapper address space. diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 6540c99c6758..d175c0f1e2c8 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -50,12 +50,21 @@ static bool tlb_next_batch(struct mmu_gather *tlb) #ifdef CONFIG_SMP static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_struct *vma) { + struct encoded_page **pages = batch->encoded_pages; + for (int i = 0; i < batch->nr; i++) { - struct encoded_page *enc = batch->encoded_pages[i]; + struct encoded_page *enc = pages[i];
if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) { struct page *page = encoded_page_ptr(enc); - folio_remove_rmap_pte(page_folio(page), page, vma); + unsigned int nr_pages = 1; + + if (unlikely(encoded_page_flags(enc) & + ENCODED_PAGE_BIT_NR_PAGES_NEXT)) + nr_pages = encoded_nr_pages(pages[++i]); + + folio_remove_rmap_ptes(page_folio(page), page, nr_pages, + vma); } } } @@ -89,18 +98,26 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb) for (batch = &tlb->local; batch && batch->nr; batch = batch->next) { struct encoded_page **pages = batch->encoded_pages;
- do { + while (batch->nr) { /* * limit free batch count when PAGE_SIZE > 4K */ unsigned int nr = min(512U, batch->nr);
+ /* + * Make sure we cover page + nr_pages, and don't leave + * nr_pages behind when capping the number of entries. + */ + if (unlikely(encoded_page_flags(pages[nr - 1]) & + ENCODED_PAGE_BIT_NR_PAGES_NEXT)) + nr++; + free_pages_and_swap_cache(pages, nr); pages += nr; batch->nr -= nr;
cond_resched(); - } while (batch->nr); + } } tlb->active = &tlb->local; } @@ -116,8 +133,9 @@ static void tlb_batch_list_free(struct mmu_gather *tlb) tlb->local.next = NULL; }
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, - bool delay_rmap, int page_size) +static bool __tlb_remove_folio_pages_size(struct mmu_gather *tlb, + struct page *page, unsigned int nr_pages, bool delay_rmap, + int page_size) { int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0; struct mmu_gather_batch *batch; @@ -126,6 +144,8 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
#ifdef CONFIG_MMU_GATHER_PAGE_SIZE VM_WARN_ON(tlb->page_size != page_size); + VM_WARN_ON_ONCE(nr_pages != 1 && page_size != PAGE_SIZE); + VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1)); #endif
batch = tlb->active; @@ -133,17 +153,40 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, * Add the page and check if we are full. If so * force a flush. */ - batch->encoded_pages[batch->nr++] = encode_page(page, flags); - if (batch->nr == batch->max) { + if (likely(nr_pages == 1)) { + batch->encoded_pages[batch->nr++] = encode_page(page, flags); + } else { + flags |= ENCODED_PAGE_BIT_NR_PAGES_NEXT; + batch->encoded_pages[batch->nr++] = encode_page(page, flags); + batch->encoded_pages[batch->nr++] = encode_nr_pages(nr_pages); + } + /* + * Make sure that we can always add another "page" + "nr_pages", + * requiring two entries instead of only a single one. + */ + if (batch->nr >= batch->max - 1) { if (!tlb_next_batch(tlb)) return true; batch = tlb->active; } - VM_BUG_ON_PAGE(batch->nr > batch->max, page); + VM_BUG_ON_PAGE(batch->nr > batch->max - 1, page);
return false; }
+bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page, + unsigned int nr_pages, bool delay_rmap) +{ + return __tlb_remove_folio_pages_size(tlb, page, nr_pages, delay_rmap, + PAGE_SIZE); +} + +bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, + bool delay_rmap, int page_size) +{ + return __tlb_remove_folio_pages_size(tlb, page, 1, delay_rmap, page_size); +} + #endif /* MMU_GATHER_NO_GATHER */
#ifdef CONFIG_MMU_GATHER_TABLE_FREE diff --git a/mm/swap.c b/mm/swap.c index cd8f0150ba3a..e5380d732c0d 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -967,11 +967,17 @@ void release_pages(release_pages_arg arg, int nr) unsigned int lock_batch;
for (i = 0; i < nr; i++) { + unsigned int nr_refs = 1; struct folio *folio;
/* Turn any of the argument types into a folio */ folio = page_folio(encoded_page_ptr(encoded[i]));
+ /* Is our next entry actually "nr_pages" -> "nr_refs" ? */ + if (unlikely(encoded_page_flags(encoded[i]) & + ENCODED_PAGE_BIT_NR_PAGES_NEXT)) + nr_refs = encoded_nr_pages(encoded[++i]); + /* * Make sure the IRQ-safe lock-holding time does not get * excessive with a continuous string of pages from the @@ -990,14 +996,14 @@ void release_pages(release_pages_arg arg, int nr) unlock_page_lruvec_irqrestore(lruvec, flags); lruvec = NULL; } - if (put_devmap_managed_page(&folio->page)) + if (put_devmap_managed_page_refs(&folio->page, nr_refs)) continue; - if (folio_put_testzero(folio)) + if (folio_ref_sub_and_test(folio, nr_refs)) free_zone_device_page(&folio->page); continue; }
- if (!folio_put_testzero(folio)) + if (!folio_ref_sub_and_test(folio, nr_refs)) continue;
if (folio_test_large(folio)) { diff --git a/mm/swap_state.c b/mm/swap_state.c index ddb3a65e5c6e..d0636532d1ab 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -311,8 +311,19 @@ void free_page_and_swap_cache(struct page *page) void free_pages_and_swap_cache(struct encoded_page **pages, int nr) { lru_add_drain(); - for (int i = 0; i < nr; i++) - free_swap_cache(encoded_page_ptr(pages[i])); + for (int i = 0; i < nr; i++) { + struct page *page = encoded_page_ptr(pages[i]); + + /* + * Skip over the "nr_pages" entry. It's sufficient to call + * free_swap_cache() only once per folio. + */ + if (unlikely(encoded_page_flags(pages[i]) & + ENCODED_PAGE_BIT_NR_PAGES_NEXT)) + i++; + + free_swap_cache(page); + } release_pages(pages, nr); }
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit e61abd4490684de379b4a2ef1be2dbde39ac1ced category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
In tlb_batch_pages_flush(), we can end up freeing up to 512 pages or now up to 256 folio fragments that span more than one page, before we conditionally reschedule.
It's a pain that we have to handle cond_resched() in tlb_batch_pages_flush() manually and cannot simply handle it in release_pages() -- release_pages() can be called from atomic context. Well, in a perfect world we wouldn't have to make our code more complicated at all.
With page poisoning and init_on_free, we might now run into soft lockups when we free a lot of rather large folio fragments, because page freeing time then depends on the actual memory size we are freeing instead of on the number of folios that are involved.
In the absolute (unlikely) worst case, on arm64 with 64k we will be able to free up to 256 folio fragments that each span 512 MiB: zeroing out 128 GiB does sound like it might take a while. But instead of ignoring this unlikely case, let's just handle it.
So, let's teach tlb_batch_pages_flush() that there are some configurations where page freeing is horribly slow, and let's reschedule more frequently -- similarly like we did for now before we had large folio fragments in there. Avoid yet another loop over all encoded pages in the common case by handling that separately.
Note that with page poisoning/zeroing, we might now end up freeing only a single folio fragment at a time that might exceed the old 512 pages limit: but if we cannot even free a single MAX_ORDER page on a system without running into soft lockups, something else is already completely bogus. Freeing a PMD-mapped THP would similarly cause trouble.
In theory, we might even free 511 order-0 pages + a single MAX_ORDER page, effectively having to zero out 8703 pages on arm64 with 64k, translating to ~544 MiB of memory: however, if 512 MiB doesn't result in soft lockups, 544 MiB is unlikely to result in soft lockups, so we won't care about that for the time being.
In the future, we might want to detect if handling cond_resched() is required at all, and just not do any of that with full preemption enabled.
Link: https://lkml.kernel.org/r/20240214204435.167852-10-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Aneesh Kumar K.V aneesh.kumar@linux.ibm.com Cc: Arnd Bergmann arnd@arndb.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Michal Hocko mhocko@suse.com Cc: "Naveen N. Rao" naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Cc: Yin Fengwei fengwei.yin@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit e61abd4490684de379b4a2ef1be2dbde39ac1ced) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/mmu_gather.c | 58 ++++++++++++++++++++++++++++++++++++------------- 1 file changed, 43 insertions(+), 15 deletions(-)
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index d175c0f1e2c8..99b3e9408aa0 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -91,18 +91,21 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) } #endif
-static void tlb_batch_pages_flush(struct mmu_gather *tlb) -{ - struct mmu_gather_batch *batch; +/* + * We might end up freeing a lot of pages. Reschedule on a regular + * basis to avoid soft lockups in configurations without full + * preemption enabled. The magic number of 512 folios seems to work. + */ +#define MAX_NR_FOLIOS_PER_FREE 512
- for (batch = &tlb->local; batch && batch->nr; batch = batch->next) { - struct encoded_page **pages = batch->encoded_pages; +static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch) +{ + struct encoded_page **pages = batch->encoded_pages; + unsigned int nr, nr_pages;
- while (batch->nr) { - /* - * limit free batch count when PAGE_SIZE > 4K - */ - unsigned int nr = min(512U, batch->nr); + while (batch->nr) { + if (!page_poisoning_enabled_static() && !want_init_on_free()) { + nr = min(MAX_NR_FOLIOS_PER_FREE, batch->nr);
/* * Make sure we cover page + nr_pages, and don't leave @@ -111,14 +114,39 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb) if (unlikely(encoded_page_flags(pages[nr - 1]) & ENCODED_PAGE_BIT_NR_PAGES_NEXT)) nr++; + } else { + /* + * With page poisoning and init_on_free, the time it + * takes to free memory grows proportionally with the + * actual memory size. Therefore, limit based on the + * actual memory size and not the number of involved + * folios. + */ + for (nr = 0, nr_pages = 0; + nr < batch->nr && nr_pages < MAX_NR_FOLIOS_PER_FREE; + nr++) { + if (unlikely(encoded_page_flags(pages[nr]) & + ENCODED_PAGE_BIT_NR_PAGES_NEXT)) + nr_pages += encoded_nr_pages(pages[++nr]); + else + nr_pages++; + } + }
- free_pages_and_swap_cache(pages, nr); - pages += nr; - batch->nr -= nr; + free_pages_and_swap_cache(pages, nr); + pages += nr; + batch->nr -= nr;
- cond_resched(); - } + cond_resched(); } +} + +static void tlb_batch_pages_flush(struct mmu_gather *tlb) +{ + struct mmu_gather_batch *batch; + + for (batch = &tlb->local; batch && batch->nr; batch = batch->next) + __tlb_batch_free_encoded_pages(batch); tlb->active = &tlb->local; }
From: David Hildenbrand david@redhat.com
mainline inclusion from mainline-v6.9-rc1 commit 10ebac4f95e7a9951c453d6c66d9beb5a35db338 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Similar to how we optimized fork(), let's implement PTE batching when consecutive (present) PTEs map consecutive pages of the same large folio.
Most infrastructure we need for batching (mmu gather, rmap) is already there. We only have to add get_and_clear_full_ptes() and clear_full_ptes(). Similarly, extend zap_install_uffd_wp_if_needed() to process a PTE range.
We won't bother sanity-checking the mapcount of all subpages, but only check the mapcount of the first subpage we process. If there is a real problem hiding somewhere, we can trigger it simply by using small folios, or when we zap single pages of a large folio. Ideally, we had that check in rmap code (including for delayed rmap), but then we cannot print the PTE. Let's keep it simple for now. If we ever have a cheap folio_mapcount(), we might just want to check for underflows there.
To keep small folios as fast as possible force inlining of a specialized variant using __always_inline with nr=1.
Link: https://lkml.kernel.org/r/20240214204435.167852-11-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Aneesh Kumar K.V aneesh.kumar@linux.ibm.com Cc: Arnd Bergmann arnd@arndb.de Cc: Catalin Marinas catalin.marinas@arm.com Cc: Christian Borntraeger borntraeger@linux.ibm.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Heiko Carstens hca@linux.ibm.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Michal Hocko mhocko@suse.com Cc: "Naveen N. Rao" naveen.n.rao@linux.ibm.com Cc: Nicholas Piggin npiggin@gmail.com Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Sven Schnelle svens@linux.ibm.com Cc: Vasily Gorbik gor@linux.ibm.com Cc: Will Deacon will@kernel.org Cc: Yin Fengwei fengwei.yin@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 10ebac4f95e7a9951c453d6c66d9beb5a35db338) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com
Conflicts: mm/memory.c --- include/linux/pgtable.h | 70 ++++++++++++++++++++++++++++++ mm/memory.c | 94 +++++++++++++++++++++++++++++------------ 2 files changed, 137 insertions(+), 27 deletions(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 28d59a6da257..8b9e4afe2e35 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -552,6 +552,76 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm, } #endif
+#ifndef get_and_clear_full_ptes +/** + * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of + * the same folio, collecting dirty/accessed bits. + * @mm: Address space the pages are mapped into. + * @addr: Address the first page is mapped at. + * @ptep: Page table pointer for the first entry. + * @nr: Number of entries to clear. + * @full: Whether we are clearing a full mm. + * + * May be overridden by the architecture; otherwise, implemented as a simple + * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the + * returned PTE. + * + * Note that PTE bits in the PTE range besides the PFN can differ. For example, + * some PTEs might be write-protected. + * + * Context: The caller holds the page table lock. The PTEs map consecutive + * pages that belong to the same folio. The PTEs are all in the same PMD. + */ +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm, + unsigned long addr, pte_t *ptep, unsigned int nr, int full) +{ + pte_t pte, tmp_pte; + + pte = ptep_get_and_clear_full(mm, addr, ptep, full); + while (--nr) { + ptep++; + addr += PAGE_SIZE; + tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full); + if (pte_dirty(tmp_pte)) + pte = pte_mkdirty(pte); + if (pte_young(tmp_pte)) + pte = pte_mkyoung(pte); + } + return pte; +} +#endif + +#ifndef clear_full_ptes +/** + * clear_full_ptes - Clear present PTEs that map consecutive pages of the same + * folio. + * @mm: Address space the pages are mapped into. + * @addr: Address the first page is mapped at. + * @ptep: Page table pointer for the first entry. + * @nr: Number of entries to clear. + * @full: Whether we are clearing a full mm. + * + * May be overridden by the architecture; otherwise, implemented as a simple + * loop over ptep_get_and_clear_full(). + * + * Note that PTE bits in the PTE range besides the PFN can differ. For example, + * some PTEs might be write-protected. + * + * Context: The caller holds the page table lock. The PTEs map consecutive + * pages that belong to the same folio. The PTEs are all in the same PMD. + */ +static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, unsigned int nr, int full) +{ + for (;;) { + ptep_get_and_clear_full(mm, addr, ptep, full); + if (--nr == 0) + break; + ptep++; + addr += PAGE_SIZE; + } +} +#endif
/* * If two threads concurrently fault at the same page, the thread that diff --git a/mm/memory.c b/mm/memory.c index e4208f5302fe..fa4122b8b9f3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1518,7 +1518,7 @@ static inline bool zap_drop_file_uffd_wp(struct zap_details *details) */ static inline void zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, - unsigned long addr, pte_t *pte, + unsigned long addr, pte_t *pte, int nr, struct zap_details *details, pte_t pteval) { /* Zap on anonymous always means dropping everything */ @@ -1528,20 +1528,27 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, if (zap_drop_file_uffd_wp(details)) return;
- pte_install_uffd_wp_if_needed(vma, addr, pte, pteval); + for (;;) { + /* the PFN in the PTE is irrelevant. */ + pte_install_uffd_wp_if_needed(vma, addr, pte, pteval); + if (--nr == 0) + break; + pte++; + addr += PAGE_SIZE; + } }
-static inline void zap_present_folio_pte(struct mmu_gather *tlb, +static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb, struct vm_area_struct *vma, struct folio *folio, - struct page *page, pte_t *pte, pte_t ptent, unsigned long addr, - struct zap_details *details, int *rss, bool *force_flush, - bool *force_break) + struct page *page, pte_t *pte, pte_t ptent, unsigned int nr, + unsigned long addr, struct zap_details *details, int *rss, + bool *force_flush, bool *force_break) { struct mm_struct *mm = tlb->mm; bool delay_rmap = false;
if (!folio_test_anon(folio)) { - ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); + ptent = get_and_clear_full_ptes(mm, addr, pte, nr, tlb->fullmm); if (pte_dirty(ptent)) { folio_mark_dirty(folio); if (tlb_delay_rmap(tlb)) { @@ -1551,38 +1558,51 @@ static inline void zap_present_folio_pte(struct mmu_gather *tlb, } if (pte_young(ptent) && likely(vma_has_recency(vma))) folio_mark_accessed(folio); - rss[mm_counter(folio)]--; - add_reliable_page_counter(page, mm, -1); + rss[mm_counter(folio)] -= nr; + add_reliable_page_counter(page, mm, -nr); } else { /* We don't need up-to-date accessed/dirty bits. */ - ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); - rss[MM_ANONPAGES]--; + clear_full_ptes(mm, addr, pte, nr, tlb->fullmm); + rss[MM_ANONPAGES] -= nr; }
+ /* Checking a single PTE in a batch is sufficient. */ arch_check_zapped_pte(vma, ptent); - tlb_remove_tlb_entry(tlb, pte, addr); + tlb_remove_tlb_entries(tlb, pte, nr, addr); if (unlikely(userfaultfd_pte_wp(vma, ptent))) - zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); + zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, + ptent);
if (!delay_rmap) { - folio_remove_rmap_pte(folio, page, vma); + folio_remove_rmap_ptes(folio, page, nr, vma); + + /* Only sanity-check the first page in a batch. */ if (unlikely(page_mapcount(page) < 0)) print_bad_pte(vma, addr, ptent, page); } - if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) { + if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) { *force_flush = true; *force_break = true; } }
-static inline void zap_present_pte(struct mmu_gather *tlb, +/* + * Zap or skip at least one present PTE, trying to batch-process subsequent + * PTEs that map consecutive pages of the same folio. + * + * Returns the number of processed (skipped or zapped) PTEs (at least 1). + */ +static inline int zap_present_ptes(struct mmu_gather *tlb, struct vm_area_struct *vma, pte_t *pte, pte_t ptent, - unsigned long addr, struct zap_details *details, - int *rss, bool *force_flush, bool *force_break) + unsigned int max_nr, unsigned long addr, + struct zap_details *details, int *rss, bool *force_flush, + bool *force_break) { + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; struct mm_struct *mm = tlb->mm; struct folio *folio; struct page *page; + int nr;
page = vm_normal_page(vma, addr, ptent); if (!page) { @@ -1592,14 +1612,29 @@ static inline void zap_present_pte(struct mmu_gather *tlb, tlb_remove_tlb_entry(tlb, pte, addr); VM_WARN_ON_ONCE(userfaultfd_wp(vma)); ksm_might_unmap_zero_page(mm, ptent); - return; + return 1; }
folio = page_folio(page); if (unlikely(!should_zap_folio(details, folio))) - return; - zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details, - rss, force_flush, force_break); + return 1; + + /* + * Make sure that the common "small folio" case is as fast as possible + * by keeping the batching logic separate. + */ + if (unlikely(folio_test_large(folio) && max_nr != 1)) { + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags, + NULL); + + zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr, + addr, details, rss, force_flush, + force_break); + return nr; + } + zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, 1, addr, + details, rss, force_flush, force_break); + return 1; }
static unsigned long zap_pte_range(struct mmu_gather *tlb, @@ -1614,6 +1649,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, pte_t *start_pte; pte_t *pte; swp_entry_t entry; + int nr;
tlb_change_page_size(tlb, PAGE_SIZE); init_rss_vec(rss); @@ -1627,7 +1663,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, pte_t ptent = ptep_get(pte); struct folio *folio; struct page *page; + int max_nr;
+ nr = 1; if (pte_none(ptent)) continue;
@@ -1635,10 +1673,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, break;
if (pte_present(ptent)) { - zap_present_pte(tlb, vma, pte, ptent, addr, details, - rss, &force_flush, &force_break); + max_nr = (end - addr) / PAGE_SIZE; + nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr, + addr, details, rss, &force_flush, + &force_break); if (unlikely(force_break)) { - addr += PAGE_SIZE; + addr += nr * PAGE_SIZE; break; } continue; @@ -1695,8 +1735,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, WARN_ON_ONCE(1); } pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); - zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); - } while (pte++, addr += PAGE_SIZE, addr != end); + zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent); + } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
add_mm_rss_vec(mm, rss); arch_leave_lazy_mmu_mode();
From: Peter Xu peterx@redhat.com
mainline inclusion from mainline-v6.9-rc2 commit f8572367eaff6739e3bc238ba93b86cd7881c0ff category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHB4 CVE: NA
-------------------------------------------------
Commit 0cf18e839f64 of large folio zap work broke uffd-wp. Now mm's uffd unit test "wp-unpopulated" will trigger this WARN_ON_ONCE().
The WARN_ON_ONCE() asserts that an VMA cannot be registered with userfaultfd-wp if it contains a !normal page, but it's actually possible. One example is an anonymous vma, register with uffd-wp, read anything will install a zero page. Then when zap on it, this should trigger.
What's more, removing that WARN_ON_ONCE may not be enough either, because we should also not rely on "whether it's a normal page" to decide whether pte marker is needed. For example, one can register wr-protect over some DAX regions to track writes when UFFD_FEATURE_WP_ASYNC enabled, in which case it can have page==NULL for a devmap but we may want to keep the marker around.
Link: https://lkml.kernel.org/r/20240313213107.235067-1-peterx@redhat.com Fixes: 0cf18e839f64 ("mm/memory: handle !page case in zap_present_pte() separately") Signed-off-by: Peter Xu peterx@redhat.com Acked-by: David Hildenbrand david@redhat.com Cc: Muhammad Usama Anjum usama.anjum@collabora.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit f8572367eaff6739e3bc238ba93b86cd7881c0ff) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/memory.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c index fa4122b8b9f3..6a81a75f3884 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1610,7 +1610,9 @@ static inline int zap_present_ptes(struct mmu_gather *tlb, ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); arch_check_zapped_pte(vma, ptent); tlb_remove_tlb_entry(tlb, pte, addr); - VM_WARN_ON_ONCE(userfaultfd_wp(vma)); + if (userfaultfd_pte_wp(vma, ptent)) + zap_install_uffd_wp_if_needed(vma, addr, pte, 1, + details, ptent); ksm_might_unmap_zero_page(mm, ptent); return 1; }
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/5656 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/4...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/5656 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/4...