Folio feature:
1. Patch series "Split a folio to any lower order folios", v5. 2. arm64: mm: swap: support THP_SWAP on hardware with MTE 3. Patch series "support multi-size THP numa balancing", v2. 4. Patch series "Swap-out mTHP without splitting", v7.
Other improve and bugfix: 1. mm: fix draining remote pageset 2. Patch series "Address some contpte nits". 3. Patch series "mm: page_alloc: fixes for high atomic reserve caluculations", v3. 4. Patch series "Fix I/O high when memory almost met memcg limit", v2. 5. mm, oom:dump_tasks add rss detailed information printing 6. mm: ratelimit stat flush from workingset shrinker 7. mm: madvise: pageout: ignore references rather than clearing young 8. mm: use memalloc_nofs_save() in page_cache_ra_order() 9. mm/vmalloc: fix return value of vb_alloc if size is 0 10. mm: remove struct page from get_shadow_from_swap_cache
Baolin Wang (2): mm: factor out the numa mapping rebuilding into a new helper mm: support multi-size THP numa balancing
Barry Song (5): mm: madvise: pageout: ignore references rather than clearing young arm64: mm: swap: support THP_SWAP on hardware with MTE madvise:madvise_cold_or_pageout_pte_range(): allow split while folio_estimated_sharers = 0 mm: hold PTL from the first PTE while reclaiming a large folio mm: alloc_anon_folio: avoid doing vma_thp_gfp_mask in fallback cases
Charan Teja Kalla (2): mm: page_alloc: correct high atomic reserve calculations mm: page_alloc: enforce minimum zone size to do high atomic reserves
David Hildenbrand (1): mm: convert folio_estimated_sharers() to folio_likely_mapped_shared()
Hailong.Liu (1): mm/vmalloc: fix return value of vb_alloc if size is 0
Huang Ying (1): mm: fix draining remote pageset
Jiexun Wang (1): mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range()
John Hubbard (1): huge_memory.c: document huge page splitting rules more thoroughly
Kefeng Wang (2): mm: use memalloc_nofs_save() in page_cache_ra_order() mm: huge_memory: use more folio api in __split_huge_page_tail()
Liu Shixin (2): mm/readahead: break read-ahead loop if filemap_add_folio return -ENOMEM mm/filemap: don't decrease mmap_miss when folio has workingset flag
Matthew Wilcox (Oracle) (4): mm: support order-1 folios in the page cache XArray: set the marks correctly when splitting an entry mm: remove struct page from get_shadow_from_swap_cache mm: remove PageAnonExclusive assertions in unuse_pte()
Muhammad Usama Anjum (2): selftests/mm: split_huge_page_test: conform test to TAP format output selftests: mm: fix unused and uninitialized variable warning
Ryan Roberts (9): arm64/mm: export contpte symbols only to GPL users arm64/mm: improve comment in contpte_ptep_get_lockless() mm: swap: remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() mm: swap: simplify struct percpu_cluster mm: swap: update get_swap_pages() to take folio order mm: swap: allow storage of all mTHP orders mm: vmscan: avoid split during shrink_folio_list() mm: madvise: avoid split during MADV_PAGEOUT and MADV_COLD
Sergey Senozhatsky (1): mm/madvise: don't forget to leave lazy MMU mode in madvise_cold_or_pageout_pte_range()
Shakeel Butt (1): mm: ratelimit stat flush from workingset shrinker
Yong Wang (1): mm, oom:dump_tasks add rss detailed information printing
Zi Yan (9): mm/huge_memory: only split PMD mapping when necessary in unmap_folio() mm/memcg: use order instead of nr in split_page_memcg() mm/page_owner: use order instead of nr in split_page_owner() mm: memcg: make memcg huge page split support any order split mm: page_owner: add support for splitting to any order in split page_owner mm: thp: split huge page to any lower order pages mm: huge_memory: enable debugfs to split huge pages to any order mm/huge_memory: check new folio order when split a folio mm/migrate: split source folio if it is on deferred split list
arch/arm64/include/asm/pgtable.h | 19 +- arch/arm64/mm/contpte.c | 46 +-- arch/arm64/mm/mteswap.c | 45 +++ fs/proc/etmem_swap.c | 2 +- include/linux/huge_mm.h | 31 +- include/linux/memcontrol.h | 4 +- include/linux/mm.h | 48 ++- include/linux/page_owner.h | 14 +- include/linux/pgtable.h | 61 +++- include/linux/swap.h | 40 ++- lib/test_xarray.c | 14 +- lib/xarray.c | 23 +- mm/damon/paddr.c | 2 +- mm/etmem.c | 2 +- mm/filemap.c | 16 +- mm/huge_memory.c | 206 ++++++++--- mm/internal.h | 93 ++++- mm/madvise.c | 123 ++++--- mm/memcontrol.c | 10 +- mm/memory.c | 100 ++++-- mm/mempolicy.c | 14 +- mm/migrate.c | 31 +- mm/mprotect.c | 3 +- mm/oom_kill.c | 7 +- mm/page_alloc.c | 16 +- mm/page_io.c | 2 +- mm/page_owner.c | 6 +- mm/readahead.c | 15 +- mm/shmem.c | 2 +- mm/swap_slots.c | 8 +- mm/swap_state.c | 8 +- mm/swapfile.c | 331 ++++++++++-------- mm/vmalloc.c | 2 +- mm/vmscan.c | 46 ++- mm/vmstat.c | 4 +- mm/workingset.c | 2 +- tools/testing/selftests/mm/run_vmtests.sh | 22 +- .../selftests/mm/split_huge_page_test.c | 323 ++++++++++++----- 38 files changed, 1207 insertions(+), 534 deletions(-)
From: Huang Ying ying.huang@intel.com
mainline inclusion from mainline-v6.7-rc1 commit fa8c4f9a665bd0cf5a475327aea4fd179e896216 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
If there is no memory allocation/freeing in the PCP (Per-CPU Pageset) of a remote zone (zone in remote NUMA node) after some time (3 seconds for now), the pages of the PCP of the remote zone will be drained to avoid memory wastage.
This behavior was introduced in the commit 4ae7c03943fc ("[PATCH] Periodically drain non local pagesets") and the commit 4037d452202e ("Move remote node draining out of slab allocators")
But, after the commit 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8"), the vmstat updater worker which is used to drain the PCP of remote zones may not be re-queued when we are waiting for the timeout (pcp->expire != 0) if there are no vmstat changes on this CPU, for example, when the CPU goes idle or runs user space only workloads. This may cause the pages of a remote zone be kept in PCP of this CPU for long time. So that, the page reclaiming of the remote zone may be triggered prematurely. This isn't a severe problem in practice, because the PCP of the remote zone will be drained if some memory are allocated/freed again on this CPU. And, the PCP will eventually be drained during the direct reclaiming if necessary.
Anyway, the problem still deserves a fix via guaranteeing that the vmstat updater worker will always be re-queued when we are waiting for the timeout. In effect, this restores the original behavior before the commit 7cc36bbddde5.
We can reproduce the bug via allocating/freeing pages from a remote zone then go idle as follows. And the patch can fix it.
- Run some workloads, use `numactl` to bind CPU to node 0 and memory to node 1. So the PCP of the CPU on node 0 for zone on node 1 will be filled.
- After workloads finish, idle for 60s
- Check /proc/zoneinfo
With the original kernel, the number of pages in the PCP of the CPU on node 0 for zone on node 1 is non-zero after idle. With the patched kernel, it becomes 0 after idle. That is, we avoid to keep pages in the remote PCP during idle.
Link: https://lkml.kernel.org/r/20231007062356.187621-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20230811090819.60845-1-ying.huang@intel.com Fixes: 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8") Signed-off-by: "Huang, Ying" ying.huang@intel.com Reviewed-by: Christoph Lameter cl@linux.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Vlastimil Babka vbabka@suse.cz Cc: Michal Hocko mhocko@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/vmstat.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/vmstat.c b/mm/vmstat.c index 6bed5bcb8208..9a0c40d804e6 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -855,8 +855,10 @@ static int refresh_cpu_vm_stats(bool do_pagesets) continue; }
- if (__this_cpu_dec_return(pcp->expire)) + if (__this_cpu_dec_return(pcp->expire)) { + changes++; continue; + }
if (__this_cpu_read(pcp->count)) { drain_zone_pages(zone, this_cpu_ptr(pcp));
From: Charan Teja Kalla quic_charante@quicinc.com
mainline inclusion from mainline-v6.8-rc1 commit d68e39fc45f70e35eb74df2128d315c1d91e4dc4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "mm: page_alloc: fixes for high atomic reserve caluculations", v3.
The state of the system where the issue exposed shown in oom kill logs:
[ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kBlocal_pcp:492kB free_cma:0kB [ 295.998656] lowmem_reserve[]: 0 32 [ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH) 33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7752kB
From the above, it is seen that ~16MB of memory reserved for high atomic reserves against the expectation of 1% reserves which is fixed in the 1st patch.
Don't reserve the high atomic page blocks if 1% of zone memory size is below a pageblock size.
This patch (of 2):
reserve_highatomic_pageblock() aims to reserve the 1% of the managed pages of a zone, which is used for the high order atomic allocations.
It uses the below calculation to reserve: static void reserve_highatomic_pageblock(struct page *page, ....) {
....... max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages;
if (zone->nr_reserved_highatomic >= max_managed) goto out;
zone->nr_reserved_highatomic += pageblock_nr_pages; set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
out: .... }
Since we are always appending the 1% of zone managed pages count to pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the nr_reserved_highatomic is incremented/decremented in pageblock sizes.
Encountered a system(actually a VM running on the Linux kernel) with the below zone configuration: Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB managed:49224kB
The existing calculations making it to reserve the 8MB(with pageblock size of 4MB) i.e. 16% of the zone managed memory. Reserving such high amount of memory can easily exert memory pressure in the system thus may lead into unnecessary reclaims till unreserving of high atomic reserves.
Since high atomic reserves are managed in pageblock size granules, as MIGRATE_HIGHATOMIC is set for such pageblock, fix the calculations for high atomic reserves as, minimum is pageblock size , maximum is approximately 1% of the zone managed pages.
Link: https://lkml.kernel.org/r/cover.1700821416.git.quic_charante@quicinc.com Link: https://lkml.kernel.org/r/1660034138397b82a0a8b6ae51cbe96bd583d89e.170082141... Signed-off-by: Charan Teja Kalla quic_charante@quicinc.com Acked-by: Mel Gorman mgorman@techsingularity.net Acked-by: David Rientjes rientjes@google.com Cc: David Hildenbrand david@redhat.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@suse.com Cc: Pavankumar Kondeti quic_pkondeti@quicinc.com Cc: Vlastimil Babka vbabka@suse.cz Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e92509d4a5d6..e911f79473bb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1912,10 +1912,11 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone) unsigned long max_managed, flags;
/* - * Limit the number reserved to 1 pageblock or roughly 1% of a zone. + * The number reserved as: minimum is 1 pageblock, maximum is + * roughly 1% of a zone. * Check is race-prone but harmless. */ - max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages; + max_managed = ALIGN((zone_managed_pages(zone) / 100), pageblock_nr_pages); if (zone->nr_reserved_highatomic >= max_managed) return;
From: Charan Teja Kalla quic_charante@quicinc.com
mainline inclusion from mainline-v6.8-rc1 commit 9cd20f3fe045af95a8fe7a12328b21bfd2f3b8bf category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Highatomic reserves are set to roughly 1% of zone for maximum and a pageblock size for minimum. Encountered a system with the below configuration: Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB managed:49224kB
On such systems, even a single pageblock makes highatomic reserves are set to ~8% of the zone memory. This high value can easily exert pressure on the zone.
Per discussion with Michal and Mel, it is not much useful to reserve the memory for highatomic allocations on such small systems[1]. Since the minimum size for high atomic reserves is always going to be a pageblock size and if 1% of zone managed pages is going to be below pageblock size, don't reserve memory for high atomic allocations. Thanks Michal for this suggestion[2].
Since no memory is being reserved for high atomic allocations and if respective allocation failures are seen, this patch can be reverted.
[1] https://lore.kernel.org/linux-mm/20231117161956.d3yjdxhhm4rhl7h2@techsingula... [2] https://lore.kernel.org/linux-mm/ZVYRJMUitykepLRy@tiehlicka/
Link: https://lkml.kernel.org/r/c3a2a48e2cfe08176a80eaf01c110deb9e918055.170082141... Signed-off-by: Charan Teja Kalla quic_charante@quicinc.com Acked-by: David Rientjes rientjes@google.com Cc: David Hildenbrand david@redhat.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Mel Gorman mgorman@techsingularity.net Cc: Michal Hocko mhocko@suse.com Cc: Pavankumar Kondeti quic_pkondeti@quicinc.com Cc: Vlastimil Babka vbabka@suse.cz Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e911f79473bb..3f8c134adb5f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1913,9 +1913,12 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
/* * The number reserved as: minimum is 1 pageblock, maximum is - * roughly 1% of a zone. + * roughly 1% of a zone. But if 1% of a zone falls below a + * pageblock size, then don't reserve any pageblocks. * Check is race-prone but harmless. */ + if ((zone_managed_pages(zone) / 100) < pageblock_nr_pages) + return; max_managed = ALIGN((zone_managed_pages(zone) / 100), pageblock_nr_pages); if (zone->nr_reserved_highatomic >= max_managed) return;
From: Ryan Roberts ryan.roberts@arm.com
mainline inclusion from mainline-v6.9-rc1 commit 912609e96cd728766373d84903f12a6d836de518 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "Address some contpte nits".
These 2 patches address some nits raised by Catalin late in the review cycle for my contpte series [1].
[1] https://lore.kernel.org/linux-mm/20240215103205.2607016-1-ryan.roberts@arm.c...
This patch (of 2):
The contpte symbols must be exported since some of the public inline ptep_* APIs are called from modules and these inlines now call the contpte functions. Originally they were exported as EXPORT_SYMBOL() for fear of breaking out-of-tree modules. But we subsequently concluded that EXPORT_SYMBOL_GPL() should be safe since these functions are deeply core mm routines, and any module operating at this level is not going to be able to survive on EXPORT_SYMBOL alone.
Link: https://lkml.kernel.org/r/20240226120321.1055731-1-ryan.roberts@arm.com Link: https://lore.kernel.org/linux-mm/f9fc2b31-11cb-4969-8961-9c89fea41b74@nvidia... Link: https://lkml.kernel.org/r/20240226120321.1055731-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts ryan.roberts@arm.com Acked-by: David Hildenbrand david@redhat.com Acked-by: Catalin Marinas catalin.marinas@arm.com Cc: John Hubbard jhubbard@nvidia.com Cc: Mark Rutland mark.rutland@arm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- arch/arm64/mm/contpte.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c index 16788f07716d..be0a226c4ff9 100644 --- a/arch/arm64/mm/contpte.c +++ b/arch/arm64/mm/contpte.c @@ -135,7 +135,7 @@ void __contpte_try_fold(struct mm_struct *mm, unsigned long addr, pte = pte_mkcont(pte); contpte_convert(mm, addr, orig_ptep, pte); } -EXPORT_SYMBOL(__contpte_try_fold); +EXPORT_SYMBOL_GPL(__contpte_try_fold);
void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte) @@ -150,7 +150,7 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr, pte = pte_mknoncont(pte); contpte_convert(mm, addr, ptep, pte); } -EXPORT_SYMBOL(__contpte_try_unfold); +EXPORT_SYMBOL_GPL(__contpte_try_unfold);
pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte) { @@ -178,7 +178,7 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
return orig_pte; } -EXPORT_SYMBOL(contpte_ptep_get); +EXPORT_SYMBOL_GPL(contpte_ptep_get);
pte_t contpte_ptep_get_lockless(pte_t *orig_ptep) { @@ -231,7 +231,7 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
return orig_pte; } -EXPORT_SYMBOL(contpte_ptep_get_lockless); +EXPORT_SYMBOL_GPL(contpte_ptep_get_lockless);
void contpte_set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte, unsigned int nr) @@ -274,7 +274,7 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
} while (addr != end); } -EXPORT_SYMBOL(contpte_set_ptes); +EXPORT_SYMBOL_GPL(contpte_set_ptes);
void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, unsigned int nr, int full) @@ -282,7 +282,7 @@ void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr, contpte_try_unfold_partial(mm, addr, ptep, nr); __clear_full_ptes(mm, addr, ptep, nr, full); } -EXPORT_SYMBOL(contpte_clear_full_ptes); +EXPORT_SYMBOL_GPL(contpte_clear_full_ptes);
pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, @@ -291,7 +291,7 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm, contpte_try_unfold_partial(mm, addr, ptep, nr); return __get_and_clear_full_ptes(mm, addr, ptep, nr, full); } -EXPORT_SYMBOL(contpte_get_and_clear_full_ptes); +EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) @@ -316,7 +316,7 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
return young; } -EXPORT_SYMBOL(contpte_ptep_test_and_clear_young); +EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
int contpte_ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) @@ -337,7 +337,7 @@ int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
return young; } -EXPORT_SYMBOL(contpte_ptep_clear_flush_young); +EXPORT_SYMBOL_GPL(contpte_ptep_clear_flush_young);
void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep, unsigned int nr) @@ -355,7 +355,7 @@ void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr, contpte_try_unfold_partial(mm, addr, ptep, nr); __wrprotect_ptes(mm, addr, ptep, nr); } -EXPORT_SYMBOL(contpte_wrprotect_ptes); +EXPORT_SYMBOL_GPL(contpte_wrprotect_ptes);
int contpte_ptep_set_access_flags(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, @@ -401,4 +401,4 @@ int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
return 1; } -EXPORT_SYMBOL(contpte_ptep_set_access_flags); +EXPORT_SYMBOL_GPL(contpte_ptep_set_access_flags);
From: Ryan Roberts ryan.roberts@arm.com
mainline inclusion from mainline-v6.9-rc1 commit 94c18d5f7e0d612ce3fb9cb4aa8cfb1308d57a0a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Make clear the atmicity/consistency requirements of the API and how we achieve them.
Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@arm.com/ Link: https://lkml.kernel.org/r/20240226120321.1055731-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts ryan.roberts@arm.com Acked-by: David Hildenbrand david@redhat.com Reviewed-by: Catalin Marinas catalin.marinas@arm.com Cc: John Hubbard jhubbard@nvidia.com Cc: Mark Rutland mark.rutland@arm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- arch/arm64/mm/contpte.c | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-)
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c index be0a226c4ff9..1b64b4c3f8bf 100644 --- a/arch/arm64/mm/contpte.c +++ b/arch/arm64/mm/contpte.c @@ -183,16 +183,20 @@ EXPORT_SYMBOL_GPL(contpte_ptep_get); pte_t contpte_ptep_get_lockless(pte_t *orig_ptep) { /* - * Gather access/dirty bits, which may be populated in any of the ptes - * of the contig range. We may not be holding the PTL, so any contiguous - * range may be unfolded/modified/refolded under our feet. Therefore we - * ensure we read a _consistent_ contpte range by checking that all ptes - * in the range are valid and have CONT_PTE set, that all pfns are - * contiguous and that all pgprots are the same (ignoring access/dirty). - * If we find a pte that is not consistent, then we must be racing with - * an update so start again. If the target pte does not have CONT_PTE - * set then that is considered consistent on its own because it is not - * part of a contpte range. + * The ptep_get_lockless() API requires us to read and return *orig_ptep + * so that it is self-consistent, without the PTL held, so we may be + * racing with other threads modifying the pte. Usually a READ_ONCE() + * would suffice, but for the contpte case, we also need to gather the + * access and dirty bits from across all ptes in the contiguous block, + * and we can't read all of those neighbouring ptes atomically, so any + * contiguous range may be unfolded/modified/refolded under our feet. + * Therefore we ensure we read a _consistent_ contpte range by checking + * that all ptes in the range are valid and have CONT_PTE set, that all + * pfns are contiguous and that all pgprots are the same (ignoring + * access/dirty). If we find a pte that is not consistent, then we must + * be racing with an update so start again. If the target pte does not + * have CONT_PTE set then that is considered consistent on its own + * because it is not part of a contpte range. */
pgprot_t orig_prot;
From: Yong Wang wang.yong12@zte.com.cn
mainline inclusion from mainline-v6.8-rc1 commit 27873192ac5938bfa9d27348d79b931e5b438ba6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When the system is under oom, it prints out the RSS information of each process. However, we don't know the size of rss_anon, rss_file, and rss_shmem.
To distinguish the memory occupied by anonymous or file mappings or shmem, could help us identify the root cause of the oom.
So this patch adds RSS details, which refers to the /proc/<pid>/status[1]. It can help us know more about process memory usage.
Example of oom including the new rss_* fields: [ 1630.902466] Tasks state (memory values in pages): [ 1630.902870] [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name [ 1630.903619] [ 149] 0 149 486 288 0 288 0 36864 0 0 ash [ 1630.904210] [ 156] 0 156 153531 153345 153345 0 0 1269760 0 0 mm_test
[1] commit 8cee852ec53f ("mm, procfs: breakdown RSS for anon, shmem and file in /proc/pid/status").
Link: https://lkml.kernel.org/r/202311231840181856667@zte.com.cn Signed-off-by: Yong Wang wang.yong12@zte.com.cn Reviewed-by: Yang Yang yang.yang29@zte.com.cn Cc: Hugh Dickins hughd@google.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Xuexin Jiang jiang.xuexin@zte.com.cn Cc: Michal Hocko mhocko@suse.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/oom_kill.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 3da29ddfea1a..dc932c583747 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -450,10 +450,11 @@ static int dump_task(struct task_struct *p, void *arg) return 0; }
- pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n", + pr_info("[%7d] %5d %5d %8lu %8lu %8lu %8lu %9lu %8ld %8lu %5hd %s\n", task->pid, from_kuid(&init_user_ns, task_uid(task)), task->tgid, task->mm->total_vm, get_mm_rss(task->mm), - mm_pgtables_bytes(task->mm), + get_mm_counter(task->mm, MM_ANONPAGES), get_mm_counter(task->mm, MM_FILEPAGES), + get_mm_counter(task->mm, MM_SHMEMPAGES), mm_pgtables_bytes(task->mm), get_mm_counter(task->mm, MM_SWAPENTS), task->signal->oom_score_adj, task->comm); task_unlock(task); @@ -474,7 +475,7 @@ static int dump_task(struct task_struct *p, void *arg) static void dump_tasks(struct oom_control *oc) { pr_info("Tasks state (memory values in pages):\n"); - pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n"); + pr_info("[ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name\n");
if (is_memcg_oom(oc)) { mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
From: Shakeel Butt shakeelb@google.com
mainline inclusion from mainline-v6.8-rc1 commit d4a5b369ad6d8aae552752ff438dddde653a72ec category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
One of our workloads (Postgres 14 + sysbench OLTP) regressed on newer upstream kernel and on further investigation, it seems like the cause is the always synchronous rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats"). On further inspection it seems like we don't really need accurate stats in this function as it was already approximating the amount of appropriate shadow entries to keep for maintaining the refault information. Since there is already 2 sec periodic rstat flush, we don't need exact stats here. Let's ratelimit the rstat flush in this code path.
Link: https://lkml.kernel.org/r/20231228073055.4046430-1-shakeelb@google.com Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats") Signed-off-by: Shakeel Butt shakeelb@google.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Yosry Ahmed yosryahmed@google.com Cc: Yu Zhao yuzhao@google.com Cc: Michal Hocko mhocko@suse.com Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Muchun Song songmuchun@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/workingset.c [ Context conflicts with commit 7d7ef0a4686a. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/workingset.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/workingset.c b/mm/workingset.c index 2559a1f2fc1c..9110957bec5b 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -664,7 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, struct lruvec *lruvec; int i;
- mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_ratelimited(); lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) pages += lruvec_page_state_local(lruvec,
From: Barry Song v-songbaohua@oppo.com
mainline inclusion from mainline-v6.9-rc1 commit 2864f3d0f5831a50253befc5d4583868268b7153 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
While doing MADV_PAGEOUT, the current code will clear PTE young so that vmscan won't read young flags to allow the reclamation of madvised folios to go ahead. It seems we can do it by directly ignoring references, thus we can remove tlb flush in madvise and rmap overhead in vmscan.
Regarding the side effect, in the original code, if a parallel thread runs side by side to access the madvised memory with the thread doing madvise, folios will get a chance to be re-activated by vmscan (though the time gap is actually quite small since checking PTEs is done immediately after clearing PTEs young). But with this patch, they will still be reclaimed. But this behaviour doing PAGEOUT and doing access at the same time is quite silly like DoS. So probably, we don't need to care. Or ignoring the new access during the quite small time gap is even better.
For DAMON's DAMOS_PAGEOUT based on physical address region, we still keep its behaviour as is since a physical address might be mapped by multiple processes. MADV_PAGEOUT based on virtual address is actually much more aggressive on reclamation. To untouch paddr's DAMOS_PAGEOUT, we simply pass ignore_references as false in reclaim_pages().
A microbench as below has shown 6% decrement on the latency of MADV_PAGEOUT,
#define PGSIZE 4096 main() { int i; #define SIZE 512*1024*1024 volatile long *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
for (i = 0; i < SIZE/sizeof(long); i += PGSIZE / sizeof(long)) p[i] = 0x11;
madvise(p, SIZE, MADV_PAGEOUT); }
w/o patch w/ patch root@10:~# time ./a.out root@10:~# time ./a.out real 0m49.634s real 0m46.334s user 0m0.637s user 0m0.648s sys 0m47.434s sys 0m44.265s
Link: https://lkml.kernel.org/r/20240226005739.24350-1-21cnbao@gmail.com Signed-off-by: Barry Song v-songbaohua@oppo.com Acked-by: Minchan Kim minchan@kernel.org Cc: SeongJae Park sj@kernel.org Cc: Michal Hocko mhocko@suse.com Cc: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/vmscan.c mm/etmem.c include/linux/swap.h fs/proc/etmem_swap.c [ Adapt reclaim_pages() and reclaim_folio_list() used in etmem. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- fs/proc/etmem_swap.c | 2 +- include/linux/swap.h | 5 +++-- mm/damon/paddr.c | 2 +- mm/etmem.c | 2 +- mm/madvise.c | 8 ++++---- mm/vmscan.c | 12 +++++++----- 6 files changed, 17 insertions(+), 14 deletions(-)
diff --git a/fs/proc/etmem_swap.c b/fs/proc/etmem_swap.c index b4a35da9ac3d..20ac09d67d33 100644 --- a/fs/proc/etmem_swap.c +++ b/fs/proc/etmem_swap.c @@ -72,7 +72,7 @@ static ssize_t swap_pages_write(struct file *file, const char __user *buf, }
if (!list_empty(&pagelist)) - reclaim_pages(&pagelist); + reclaim_pages(&pagelist, false);
ret = count; kfree(data_ptr_res); diff --git a/include/linux/swap.h b/include/linux/swap.h index e818f53cbc31..cbcb767b79e3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -420,8 +420,9 @@ extern unsigned long zone_reclaimable_pages(struct zone *zone); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); extern unsigned int reclaim_folio_list(struct list_head *folio_list, - struct pglist_data *pgdat); -extern unsigned long reclaim_pages(struct list_head *folio_list); + struct pglist_data *pgdat, + bool ignore_references); +extern unsigned long reclaim_pages(struct list_head *folio_list, bool ignore_references);
#define MEMCG_RECLAIM_MAY_SWAP (1 << 1) #define MEMCG_RECLAIM_PROACTIVE (1 << 2) diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index 909db25efb35..21d31580d1a4 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -250,7 +250,7 @@ static unsigned long damon_pa_pageout(struct damon_region *r, struct damos *s) put_folio: folio_put(folio); } - applied = reclaim_pages(&folio_list); + applied = reclaim_pages(&folio_list, false); cond_resched(); return applied * PAGE_SIZE; } diff --git a/mm/etmem.c b/mm/etmem.c index 5accf8e0bbdf..a1b2db374fdb 100644 --- a/mm/etmem.c +++ b/mm/etmem.c @@ -248,7 +248,7 @@ int do_swapcache_reclaim(unsigned long *swapcache_watermark, /* Reclaim all the swapcache we have scanned */ for_each_node_state(nid, N_MEMORY) { cond_resched(); - reclaim_folio_list(&swapcache_list[nid], NODE_DATA(nid)); + reclaim_folio_list(&swapcache_list[nid], NODE_DATA(nid), false); }
/* Put pack all the pages that are not reclaimed by shrink_folio_list */ diff --git a/mm/madvise.c b/mm/madvise.c index a3c509cf2bc9..2b821024a38e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -429,7 +429,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, return 0; }
- if (pmd_young(orig_pmd)) { + if (!pageout && pmd_young(orig_pmd)) { pmdp_invalidate(vma, addr, pmd); orig_pmd = pmd_mkold(orig_pmd);
@@ -453,7 +453,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, huge_unlock: spin_unlock(ptl); if (pageout) - reclaim_pages(&folio_list); + reclaim_pages(&folio_list, true); return 0; }
@@ -522,7 +522,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
- if (pte_young(ptent)) { + if (!pageout && pte_young(ptent)) { ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); ptent = pte_mkold(ptent); @@ -556,7 +556,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, pte_unmap_unlock(start_pte, ptl); } if (pageout) - reclaim_pages(&folio_list); + reclaim_pages(&folio_list, true); cond_resched();
return 0; diff --git a/mm/vmscan.c b/mm/vmscan.c index 95a845905624..e1aa8bd796e9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2792,7 +2792,8 @@ static void shrink_active_list(unsigned long nr_to_scan, }
unsigned int reclaim_folio_list(struct list_head *folio_list, - struct pglist_data *pgdat) + struct pglist_data *pgdat, + bool ignore_references) { struct reclaim_stat dummy_stat; unsigned int nr_reclaimed; @@ -2805,7 +2806,7 @@ unsigned int reclaim_folio_list(struct list_head *folio_list, .no_demotion = 1, };
- nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &dummy_stat, false); + nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &dummy_stat, ignore_references); while (!list_empty(folio_list)) { folio = lru_to_folio(folio_list); list_del(&folio->lru); @@ -2815,7 +2816,7 @@ unsigned int reclaim_folio_list(struct list_head *folio_list, return nr_reclaimed; }
-unsigned long reclaim_pages(struct list_head *folio_list) +unsigned long reclaim_pages(struct list_head *folio_list, bool ignore_references) { int nid; unsigned int nr_reclaimed = 0; @@ -2837,11 +2838,12 @@ unsigned long reclaim_pages(struct list_head *folio_list) continue; }
- nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); + nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid), + ignore_references); nid = folio_nid(lru_to_folio(folio_list)); } while (!list_empty(folio_list));
- nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); + nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid), ignore_references);
memalloc_noreclaim_restore(noreclaim_flag);
From: Kefeng Wang wangkefeng.wang@huawei.com
mainline inclusion from mainline-v6.9 commit 30153e4466647a17eebfced13eede5cbe4290e69 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
See commit f2c817bed58d ("mm: use memalloc_nofs_save in readahead path"), ensure that page_cache_ra_order() do not attempt to reclaim file-backed pages too, or it leads to a deadlock, found issue when test ext4 large folio.
INFO: task DataXceiver for:7494 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:DataXceiver for state:D stack:0 pid:7494 ppid:1 flags:0x00000200 Call trace: __switch_to+0x14c/0x240 __schedule+0x82c/0xdd0 schedule+0x58/0xf0 io_schedule+0x24/0xa0 __folio_lock+0x130/0x300 migrate_pages_batch+0x378/0x918 migrate_pages+0x350/0x700 compact_zone+0x63c/0xb38 compact_zone_order+0xc0/0x118 try_to_compact_pages+0xb0/0x280 __alloc_pages_direct_compact+0x98/0x248 __alloc_pages+0x510/0x1110 alloc_pages+0x9c/0x130 folio_alloc+0x20/0x78 filemap_alloc_folio+0x8c/0x1b0 page_cache_ra_order+0x174/0x308 ondemand_readahead+0x1c8/0x2b8 page_cache_async_ra+0x68/0xb8 filemap_readahead.isra.0+0x64/0xa8 filemap_get_pages+0x3fc/0x5b0 filemap_splice_read+0xf4/0x280 ext4_file_splice_read+0x2c/0x48 [ext4] vfs_splice_read.part.0+0xa8/0x118 splice_direct_to_actor+0xbc/0x288 do_splice_direct+0x9c/0x108 do_sendfile+0x328/0x468 __arm64_sys_sendfile64+0x8c/0x148 invoke_syscall+0x4c/0x118 el0_svc_common.constprop.0+0xc8/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x4c/0x1f8 el0t_64_sync_handler+0xc0/0xc8 el0t_64_sync+0x188/0x190
Link: https://lkml.kernel.org/r/20240426112938.124740-1-wangkefeng.wang@huawei.com Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Zhang Yi yi.zhang@huawei.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/readahead.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/mm/readahead.c b/mm/readahead.c index 63c7320ba464..f79eb574055c 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -493,6 +493,7 @@ void page_cache_ra_order(struct readahead_control *ractl, pgoff_t index = readahead_index(ractl); pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT; pgoff_t mark = index + ra->size - ra->async_size; + unsigned int nofs; int err = 0; gfp_t gfp = readahead_gfp_mask(mapping);
@@ -509,6 +510,8 @@ void page_cache_ra_order(struct readahead_control *ractl, new_order--; }
+ /* See comment in page_cache_ra_unbounded() */ + nofs = memalloc_nofs_save(); filemap_invalidate_lock_shared(mapping);
if (unlikely(!mapping_large_folio_support(mapping))) { @@ -541,6 +544,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
read_pages(ractl); filemap_invalidate_unlock_shared(mapping); + memalloc_nofs_restore(nofs);
/* * If there were already pages in the page cache, then we may have
From: "Hailong.Liu" hailong.liu@oppo.com
mainline inclusion from mainline-v6.9 commit ac0476e8ca2e4125c0886d7d8d418b8e7cb17139 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
vm_map_ram() uses IS_ERR() to validate the return value of vb_alloc(). If vm_map_ram(page, 0, 0) is executed, vb_alloc(0, GFP_KERNEL) would return NULL. In such a case, IS_ERR() cannot handle the return value and lead to kernel panic by vmap_pages_range_noflush() at last. To resolve this issue, return ERR_PTR(-EINVAL) if the size is 0.
Link: https://lkml.kernel.org/r/20240426024149.21176-1-hailong.liu@oppo.com Reviewed-by: Barry Song baohua@kernel.org Reviewed-by: Uladzislau Rezki (Sony) urezki@gmail.com Signed-off-by: Hailong.Liu hailong.liu@oppo.com Reviewed-by: Christoph Hellwig hch@lst.de Cc: Lorenzo Stoakes lstoakes@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/vmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index e8a66a0dcb57..e6058942a084 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2694,7 +2694,7 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask) * get_order(0) returns funny result. Just warn and terminate * early. */ - return NULL; + return ERR_PTR(-EINVAL); } order = get_order(size);
From: Kefeng Wang wangkefeng.wang@huawei.com
mainline inclusion from mainline-v6.8-rc1 commit b75427691f4a1fd3625d1f4c016832b8c75c2326 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Use more folio APIs to save six compound_head() calls in __split_huge_page_tail().
Link: https://lkml.kernel.org/r/20231110033324.2455523-5-wangkefeng.wang@huawei.co... Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org [ Dep-of: c010d47f107f ("mm: thp: split huge page to any lower order pages"). ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/huge_memory.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6c277b55544c..b41f3d607c67 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2725,13 +2725,13 @@ static void __split_huge_page_tail(struct folio *folio, int tail, clear_compound_head(page_tail);
/* Finally unfreeze refcount. Additional reference from page cache. */ - page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) || - PageSwapCache(head))); + page_ref_unfreeze(page_tail, 1 + (!folio_test_anon(folio) || + folio_test_swapcache(folio)));
- if (page_is_young(head)) - set_page_young(page_tail); - if (page_is_idle(head)) - set_page_idle(page_tail); + if (folio_test_young(folio)) + folio_set_young(new_folio); + if (folio_test_idle(folio)) + folio_set_idle(new_folio);
folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
From: Muhammad Usama Anjum usama.anjum@collabora.com
mainline inclusion from mainline-v6.9-rc1 commit 735887041a456152d31099c9a8cc24a2f6fecec3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Conform the layout, informational and status messages to TAP. No functional change is intended other than the layout of output messages.
Link: https://lkml.kernel.org/r/20240202113119.2047740-9-usama.anjum@collabora.com Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org [ Dep-of: fc4d182316bd ("mm: huge_memory: enable debugfs to split huge pages to any order"). ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- .../selftests/mm/split_huge_page_test.c | 161 ++++++++---------- 1 file changed, 69 insertions(+), 92 deletions(-)
diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c index dff3be23488b..c00b85b5ce08 100644 --- a/tools/testing/selftests/mm/split_huge_page_test.c +++ b/tools/testing/selftests/mm/split_huge_page_test.c @@ -17,6 +17,7 @@ #include <malloc.h> #include <stdbool.h> #include "vm_util.h" +#include "../kselftest.h"
uint64_t pagesize; unsigned int pageshift; @@ -50,21 +51,19 @@ int is_backed_by_thp(char *vaddr, int pagemap_file, int kpageflags_file) return 0; }
-static int write_file(const char *path, const char *buf, size_t buflen) +static void write_file(const char *path, const char *buf, size_t buflen) { int fd; ssize_t numwritten;
fd = open(path, O_WRONLY); if (fd == -1) - return 0; + ksft_exit_fail_msg("%s open failed: %s\n", path, strerror(errno));
numwritten = write(fd, buf, buflen - 1); close(fd); if (numwritten < 1) - return 0; - - return (unsigned int) numwritten; + ksft_exit_fail_msg("Write failed\n"); }
static void write_debugfs(const char *fmt, ...) @@ -77,15 +76,10 @@ static void write_debugfs(const char *fmt, ...) ret = vsnprintf(input, INPUT_MAX, fmt, argp); va_end(argp);
- if (ret >= INPUT_MAX) { - printf("%s: Debugfs input is too long\n", __func__); - exit(EXIT_FAILURE); - } + if (ret >= INPUT_MAX) + ksft_exit_fail_msg("%s: Debugfs input is too long\n", __func__);
- if (!write_file(SPLIT_DEBUGFS, input, ret + 1)) { - perror(SPLIT_DEBUGFS); - exit(EXIT_FAILURE); - } + write_file(SPLIT_DEBUGFS, input, ret + 1); }
void split_pmd_thp(void) @@ -95,39 +89,30 @@ void split_pmd_thp(void) size_t i;
one_page = memalign(pmd_pagesize, len); - - if (!one_page) { - printf("Fail to allocate memory\n"); - exit(EXIT_FAILURE); - } + if (!one_page) + ksft_exit_fail_msg("Fail to allocate memory: %s\n", strerror(errno));
madvise(one_page, len, MADV_HUGEPAGE);
for (i = 0; i < len; i++) one_page[i] = (char)i;
- if (!check_huge_anon(one_page, 4, pmd_pagesize)) { - printf("No THP is allocated\n"); - exit(EXIT_FAILURE); - } + if (!check_huge_anon(one_page, 4, pmd_pagesize)) + ksft_exit_fail_msg("No THP is allocated\n");
/* split all THPs */ write_debugfs(PID_FMT, getpid(), (uint64_t)one_page, (uint64_t)one_page + len);
for (i = 0; i < len; i++) - if (one_page[i] != (char)i) { - printf("%ld byte corrupted\n", i); - exit(EXIT_FAILURE); - } + if (one_page[i] != (char)i) + ksft_exit_fail_msg("%ld byte corrupted\n", i);
- if (!check_huge_anon(one_page, 0, pmd_pagesize)) { - printf("Still AnonHugePages not split\n"); - exit(EXIT_FAILURE); - } + if (!check_huge_anon(one_page, 0, pmd_pagesize)) + ksft_exit_fail_msg("Still AnonHugePages not split\n");
- printf("Split huge pages successful\n"); + ksft_test_result_pass("Split huge pages successful\n"); free(one_page); }
@@ -143,36 +128,29 @@ void split_pte_mapped_thp(void) int pagemap_fd; int kpageflags_fd;
- if (snprintf(pagemap_proc, 255, pagemap_template, getpid()) < 0) { - perror("get pagemap proc error"); - exit(EXIT_FAILURE); - } - pagemap_fd = open(pagemap_proc, O_RDONLY); + if (snprintf(pagemap_proc, 255, pagemap_template, getpid()) < 0) + ksft_exit_fail_msg("get pagemap proc error: %s\n", strerror(errno));
- if (pagemap_fd == -1) { - perror("read pagemap:"); - exit(EXIT_FAILURE); - } + pagemap_fd = open(pagemap_proc, O_RDONLY); + if (pagemap_fd == -1) + ksft_exit_fail_msg("read pagemap: %s\n", strerror(errno));
kpageflags_fd = open(kpageflags_proc, O_RDONLY); - - if (kpageflags_fd == -1) { - perror("read kpageflags:"); - exit(EXIT_FAILURE); - } + if (kpageflags_fd == -1) + ksft_exit_fail_msg("read kpageflags: %s\n", strerror(errno));
one_page = mmap((void *)(1UL << 30), len, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + if (one_page == MAP_FAILED) + ksft_exit_fail_msg("Fail to allocate memory: %s\n", strerror(errno));
madvise(one_page, len, MADV_HUGEPAGE);
for (i = 0; i < len; i++) one_page[i] = (char)i;
- if (!check_huge_anon(one_page, 4, pmd_pagesize)) { - printf("No THP is allocated\n"); - exit(EXIT_FAILURE); - } + if (!check_huge_anon(one_page, 4, pmd_pagesize)) + ksft_exit_fail_msg("No THP is allocated\n");
/* remap the first pagesize of first THP */ pte_mapped = mremap(one_page, pagesize, pagesize, MREMAP_MAYMOVE); @@ -183,10 +161,8 @@ void split_pte_mapped_thp(void) pagesize, pagesize, MREMAP_MAYMOVE|MREMAP_FIXED, pte_mapped + pagesize * i); - if (pte_mapped2 == (char *)-1) { - perror("mremap failed"); - exit(EXIT_FAILURE); - } + if (pte_mapped2 == MAP_FAILED) + ksft_exit_fail_msg("mremap failed: %s\n", strerror(errno)); }
/* smap does not show THPs after mremap, use kpageflags instead */ @@ -196,10 +172,8 @@ void split_pte_mapped_thp(void) is_backed_by_thp(&pte_mapped[i], pagemap_fd, kpageflags_fd)) thp_size++;
- if (thp_size != 4) { - printf("Some THPs are missing during mremap\n"); - exit(EXIT_FAILURE); - } + if (thp_size != 4) + ksft_exit_fail_msg("Some THPs are missing during mremap\n");
/* split all remapped THPs */ write_debugfs(PID_FMT, getpid(), (uint64_t)pte_mapped, @@ -208,21 +182,18 @@ void split_pte_mapped_thp(void) /* smap does not show THPs after mremap, use kpageflags instead */ thp_size = 0; for (i = 0; i < pagesize * 4; i++) { - if (pte_mapped[i] != (char)i) { - printf("%ld byte corrupted\n", i); - exit(EXIT_FAILURE); - } + if (pte_mapped[i] != (char)i) + ksft_exit_fail_msg("%ld byte corrupted\n", i); + if (i % pagesize == 0 && is_backed_by_thp(&pte_mapped[i], pagemap_fd, kpageflags_fd)) thp_size++; }
- if (thp_size) { - printf("Still %ld THPs not split\n", thp_size); - exit(EXIT_FAILURE); - } + if (thp_size) + ksft_exit_fail_msg("Still %ld THPs not split\n", thp_size);
- printf("Split PTE-mapped huge pages successful\n"); + ksft_test_result_pass("Split PTE-mapped huge pages successful\n"); munmap(one_page, len); close(pagemap_fd); close(kpageflags_fd); @@ -238,24 +209,21 @@ void split_file_backed_thp(void) char testfile[INPUT_MAX]; uint64_t pgoff_start = 0, pgoff_end = 1024;
- printf("Please enable pr_debug in split_huge_pages_in_file() if you need more info.\n"); + ksft_print_msg("Please enable pr_debug in split_huge_pages_in_file() for more info.\n");
status = mount("tmpfs", tmpfs_loc, "tmpfs", 0, "huge=always,size=4m");
- if (status) { - printf("Unable to create a tmpfs for testing\n"); - exit(EXIT_FAILURE); - } + if (status) + ksft_exit_fail_msg("Unable to create a tmpfs for testing\n");
status = snprintf(testfile, INPUT_MAX, "%s/thp_file", tmpfs_loc); if (status >= INPUT_MAX) { - printf("Fail to create file-backed THP split testing file\n"); - goto cleanup; + ksft_exit_fail_msg("Fail to create file-backed THP split testing file\n"); }
fd = open(testfile, O_CREAT|O_WRONLY, 0664); if (fd == -1) { - perror("Cannot open testing file\n"); + ksft_perror("Cannot open testing file"); goto cleanup; }
@@ -264,7 +232,7 @@ void split_file_backed_thp(void) close(fd);
if (num_written < 1) { - printf("Fail to write data to testing file\n"); + ksft_perror("Fail to write data to testing file"); goto cleanup; }
@@ -272,42 +240,51 @@ void split_file_backed_thp(void) write_debugfs(PATH_FMT, testfile, pgoff_start, pgoff_end);
status = unlink(testfile); - if (status) - perror("Cannot remove testing file\n"); + if (status) { + ksft_perror("Cannot remove testing file"); + goto cleanup; + }
-cleanup: status = umount(tmpfs_loc); if (status) { - printf("Unable to umount %s\n", tmpfs_loc); - exit(EXIT_FAILURE); + rmdir(tmpfs_loc); + ksft_exit_fail_msg("Unable to umount %s\n", tmpfs_loc); } + status = rmdir(tmpfs_loc); - if (status) { - perror("cannot remove tmp dir"); - exit(EXIT_FAILURE); - } + if (status) + ksft_exit_fail_msg("cannot remove tmp dir: %s\n", strerror(errno));
- printf("file-backed THP split test done, please check dmesg for more information\n"); + ksft_print_msg("Please check dmesg for more information\n"); + ksft_test_result_pass("File-backed THP split test done\n"); + return; + +cleanup: + umount(tmpfs_loc); + rmdir(tmpfs_loc); + ksft_exit_fail_msg("Error occurred\n"); }
int main(int argc, char **argv) { + ksft_print_header(); + if (geteuid() != 0) { - printf("Please run the benchmark as root\n"); - exit(EXIT_FAILURE); + ksft_print_msg("Please run the benchmark as root\n"); + ksft_finished(); }
+ ksft_set_plan(3); + pagesize = getpagesize(); pageshift = ffs(pagesize) - 1; pmd_pagesize = read_pmd_pagesize(); - if (!pmd_pagesize) { - printf("Reading PMD pagesize failed\n"); - exit(EXIT_FAILURE); - } + if (!pmd_pagesize) + ksft_exit_fail_msg("Reading PMD pagesize failed\n");
split_pmd_thp(); split_pte_mapped_thp(); split_file_backed_thp();
- return 0; + ksft_finished(); }
From: Zi Yan ziy@nvidia.com
mainline inclusion from mainline-v6.9-rc1 commit 319a624ec2b79db7a0b0a2a2a61e3aa5c96eabfc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "Split a folio to any lower order folios", v5.
File folio supports any order and multi-size THP is upstreamed[1], so both file and anonymous folios can be >0 order. Currently, split_huge_page() only splits a huge page to order-0 pages, but splitting to orders higher than 0 might better utilize large folios, if done properly. In addition, Large Block Sizes in XFS support would benefit from it during truncate[2]. This patchset adds support for splitting a large folio to any lower order folios.
In addition to this implementation of split_huge_page_to_list_to_order(), a possible optimization could be splitting a large folio to arbitrary smaller folios instead of a single order. As both Hugh and Ryan pointed out [3,5] that split to a single order might not be optimal, an order-9 folio might be better split into 1 order-8, 1 order-7, ..., 1 order-1, and 2 order-0 folios, depending on subsequent folio operations. Leave this as future work.
[1] https://lore.kernel.org/all/20231207161211.2374093-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-mm/20240226094936.2677493-1-kernel@pankajragha... [3] https://lore.kernel.org/linux-mm/9dd96da-efa2-5123-20d4-4992136ef3ad@google.... [4] https://lore.kernel.org/linux-mm/cbb1d6a0-66dd-47d0-8733-f836fe050374@arm.co... [5] https://lore.kernel.org/linux-mm/20240213215520.1048625-1-zi.yan@sent.com/
This patch (of 8):
As multi-size THP support is added, not all THPs are PMD-mapped, thus during a huge page split, there is no need to always split PMD mapping in unmap_folio(). Make it conditional.
Link: https://lkml.kernel.org/r/20240226205534.1603748-1-zi.yan@sent.com Link: https://lkml.kernel.org/r/20240226205534.1603748-2-zi.yan@sent.com Signed-off-by: Zi Yan ziy@nvidia.com Reviewed-by: David Hildenbrand david@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Luis Chamberlain mcgrof@kernel.org Cc: Matthew Wilcox willy@infradead.org Cc: Michal Koutny mkoutny@suse.com Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Cc: Zach O'Keefe zokeefe@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/huge_memory.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b41f3d607c67..21ec24a6658b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2594,11 +2594,14 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
static void unmap_folio(struct folio *folio) { - enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD | - TTU_SYNC | TTU_BATCH_FLUSH; + enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SYNC | + TTU_BATCH_FLUSH;
VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
+ if (folio_test_pmd_mappable(folio)) + ttu_flags |= TTU_SPLIT_HUGE_PMD; + /* * Anon pages need migration entries to preserve them, but file * pages can simply be left unmapped, then faulted back on demand.
From: "Matthew Wilcox (Oracle)" willy@infradead.org
mainline inclusion from mainline-v6.9-rc1 commit 8897277acfef7f70fdecc054073bea2542fc7a1b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Folios of order 1 have no space to store the deferred list. This is not a problem for the page cache as file-backed folios are never placed on the deferred list. All we need to do is prevent the core MM from touching the deferred list for order 1 folios and remove the code which prevented us from allocating order 1 folios.
Link: https://lore.kernel.org/linux-mm/90344ea7-4eec-47ee-5996-0c22f42d6a6a@google... Link: https://lkml.kernel.org/r/20240226205534.1603748-3-zi.yan@sent.com Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Zi Yan ziy@nvidia.com Cc: David Hildenbrand david@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Luis Chamberlain mcgrof@kernel.org Cc: Michal Koutny mkoutny@suse.com Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Cc: Zach O'Keefe zokeefe@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/filemap.c | 2 -- mm/huge_memory.c | 19 +++++++++++++++---- mm/internal.h | 3 +-- mm/readahead.c | 3 --- 4 files changed, 16 insertions(+), 11 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c index 058d79840bc7..7bbe26aa9cc8 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1940,8 +1940,6 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, gfp_t alloc_gfp = gfp;
err = -ENOMEM; - if (order == 1) - order = 0; if (order > 0) alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; folio = filemap_alloc_folio(alloc_gfp, order); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 21ec24a6658b..700bb6146600 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -792,8 +792,10 @@ struct deferred_split *get_deferred_split_queue(struct folio *folio)
void folio_prep_large_rmappable(struct folio *folio) { - VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio); - INIT_LIST_HEAD(&folio->_deferred_list); + if (!folio || !folio_test_large(folio)) + return; + if (folio_order(folio) > 1) + INIT_LIST_HEAD(&folio->_deferred_list); folio_set_large_rmappable(folio); }
@@ -2981,7 +2983,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) /* Prevent deferred_split_scan() touching ->_refcount */ spin_lock(&ds_queue->split_queue_lock); if (folio_ref_freeze(folio, 1 + extra_pins)) { - if (!list_empty(&folio->_deferred_list)) { + if (folio_order(folio) > 1 && + !list_empty(&folio->_deferred_list)) { ds_queue->split_queue_len--; list_del(&folio->_deferred_list); } @@ -3032,6 +3035,9 @@ void folio_undo_large_rmappable(struct folio *folio) struct deferred_split *ds_queue; unsigned long flags;
+ if (folio_order(folio) <= 1) + return; + /* * At this point, there is no one trying to add the folio to * deferred_list. If folio is not in deferred_list, it's safe @@ -3057,7 +3063,12 @@ void deferred_split_folio(struct folio *folio) #endif unsigned long flags;
- VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio); + /* + * Order 1 folios have no space for a deferred list, but we also + * won't waste much memory by not adding them to the deferred list. + */ + if (folio_order(folio) <= 1) + return;
/* * The try_to_unmap() in page reclaim path might reach here too, diff --git a/mm/internal.h b/mm/internal.h index 28085201b863..f2e7972d4a3e 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -512,8 +512,7 @@ static inline struct folio *page_rmappable_folio(struct page *page) { struct folio *folio = (struct folio *)page;
- if (folio && folio_order(folio) > 1) - folio_prep_large_rmappable(folio); + folio_prep_large_rmappable(folio); return folio; }
diff --git a/mm/readahead.c b/mm/readahead.c index f79eb574055c..5464f0bee669 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -528,9 +528,6 @@ void page_cache_ra_order(struct readahead_control *ractl, /* Don't allocate pages past EOF */ while (index + (1UL << order) - 1 > limit) order--; - /* THP machinery does not support order-1 */ - if (order == 1) - order = 0; err = ra_alloc_folio(ractl, index, mark, order, gfp); if (err) break;
From: Zi Yan ziy@nvidia.com
mainline inclusion from mainline-v6.9-rc1 commit 502003bb76b83649cd4ff7f701987ac5cf43bc4b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
We do not have non power of two pages, using nr is error prone if nr is not power-of-two. Use page order instead.
Link: https://lkml.kernel.org/r/20240226205534.1603748-4-zi.yan@sent.com Signed-off-by: Zi Yan ziy@nvidia.com Acked-by: David Hildenbrand david@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Luis Chamberlain mcgrof@kernel.org Cc: "Matthew Wilcox (Oracle)" willy@infradead.org Cc: Michal Koutny mkoutny@suse.com Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Cc: Zach O'Keefe zokeefe@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/memcontrol.h | 4 ++-- mm/huge_memory.c | 5 +++-- mm/memcontrol.c | 3 ++- mm/page_alloc.c | 4 ++-- 4 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 53adaf4e39be..830d4a67fe9b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1278,7 +1278,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, rcu_read_unlock(); }
-void split_page_memcg(struct page *head, unsigned int nr); +void split_page_memcg(struct page *head, int order);
unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, @@ -1742,7 +1742,7 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { }
-static inline void split_page_memcg(struct page *head, unsigned int nr) +static inline void split_page_memcg(struct page *head, int order) { }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 700bb6146600..25c5cd2ce224 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2756,11 +2756,12 @@ static void __split_huge_page(struct page *page, struct list_head *list, struct lruvec *lruvec; struct address_space *swap_cache = NULL; unsigned long offset = 0; - unsigned int nr = thp_nr_pages(head); int i, nr_dropped = 0; + int order = folio_order(folio); + unsigned int nr = 1 << order;
/* complete memcg works before add pages to LRU */ - split_page_memcg(head, nr); + split_page_memcg(head, order);
if (folio_test_anon(folio) && folio_test_swapcache(folio)) { offset = swp_offset(folio->swap); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 70223fc704c0..9768adeb2110 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3657,11 +3657,12 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) /* * Because page_memcg(head) is not set on tails, set it now. */ -void split_page_memcg(struct page *head, unsigned int nr) +void split_page_memcg(struct page *head, int order) { struct folio *folio = page_folio(head); struct mem_cgroup *memcg = folio_memcg(folio); int i; + unsigned int nr = 1 << order;
if (mem_cgroup_disabled() || !memcg) return; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3f8c134adb5f..aabb76ccb035 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2661,7 +2661,7 @@ void split_page(struct page *page, unsigned int order) for (i = 1; i < (1 << order); i++) set_page_refcounted(page + i); split_page_owner(page, 1 << order); - split_page_memcg(page, 1 << order); + split_page_memcg(page, order); } EXPORT_SYMBOL_GPL(split_page);
@@ -5055,7 +5055,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order, struct page *last = page + nr;
split_page_owner(page, 1 << order); - split_page_memcg(page, 1 << order); + split_page_memcg(page, order); while (page < --last) set_page_refcounted(last);
From: Zi Yan ziy@nvidia.com
mainline inclusion from mainline-v6.9-rc1 commit 9a581c12cddb06696fe4811239934fcde57ceb91 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
We do not have non power of two pages, using nr is error prone if nr is not power-of-two. Use page order instead.
Link: https://lkml.kernel.org/r/20240226205534.1603748-5-zi.yan@sent.com Signed-off-by: Zi Yan ziy@nvidia.com Acked-by: David Hildenbrand david@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Luis Chamberlain mcgrof@kernel.org Cc: "Matthew Wilcox (Oracle)" willy@infradead.org Cc: Michal Koutny mkoutny@suse.com Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Cc: Zach O'Keefe zokeefe@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/page_owner.h | 9 ++++----- mm/huge_memory.c | 2 +- mm/page_alloc.c | 4 ++-- mm/page_owner.c | 3 ++- 4 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h index 119a0c9d2a8b..2b39c8e19d98 100644 --- a/include/linux/page_owner.h +++ b/include/linux/page_owner.h @@ -11,7 +11,7 @@ extern struct page_ext_operations page_owner_ops; extern void __reset_page_owner(struct page *page, unsigned short order); extern void __set_page_owner(struct page *page, unsigned short order, gfp_t gfp_mask); -extern void __split_page_owner(struct page *page, unsigned int nr); +extern void __split_page_owner(struct page *page, int order); extern void __folio_copy_owner(struct folio *newfolio, struct folio *old); extern void __set_page_owner_migrate_reason(struct page *page, int reason); extern void __dump_page_owner(const struct page *page); @@ -31,10 +31,10 @@ static inline void set_page_owner(struct page *page, __set_page_owner(page, order, gfp_mask); }
-static inline void split_page_owner(struct page *page, unsigned int nr) +static inline void split_page_owner(struct page *page, int order) { if (static_branch_unlikely(&page_owner_inited)) - __split_page_owner(page, nr); + __split_page_owner(page, order); } static inline void folio_copy_owner(struct folio *newfolio, struct folio *old) { @@ -59,8 +59,7 @@ static inline void set_page_owner(struct page *page, unsigned int order, gfp_t gfp_mask) { } -static inline void split_page_owner(struct page *page, - unsigned short order) +static inline void split_page_owner(struct page *page, int order) { } static inline void folio_copy_owner(struct folio *newfolio, struct folio *folio) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 25c5cd2ce224..96620dec3311 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2800,7 +2800,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, unlock_page_lruvec(lruvec); /* Caller disabled irqs, so they are still disabled here */
- split_page_owner(head, nr); + split_page_owner(head, order);
/* See comment in __split_huge_page_tail() */ if (PageAnon(head)) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index aabb76ccb035..f241b2fdc09f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2660,7 +2660,7 @@ void split_page(struct page *page, unsigned int order)
for (i = 1; i < (1 << order); i++) set_page_refcounted(page + i); - split_page_owner(page, 1 << order); + split_page_owner(page, order); split_page_memcg(page, order); } EXPORT_SYMBOL_GPL(split_page); @@ -5054,7 +5054,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order, struct page *page = virt_to_page((void *)addr); struct page *last = page + nr;
- split_page_owner(page, 1 << order); + split_page_owner(page, order); split_page_memcg(page, order); while (page < --last) set_page_refcounted(last); diff --git a/mm/page_owner.c b/mm/page_owner.c index e7eba7688881..125113b9f729 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -215,11 +215,12 @@ void __set_page_owner_migrate_reason(struct page *page, int reason) page_ext_put(page_ext); }
-void __split_page_owner(struct page *page, unsigned int nr) +void __split_page_owner(struct page *page, int order) { int i; struct page_ext *page_ext = page_ext_get(page); struct page_owner *page_owner; + unsigned int nr = 1 << order;
if (unlikely(!page_ext)) return;
From: Zi Yan ziy@nvidia.com
mainline inclusion from mainline-v6.9-rc1 commit b8791381d7edae3706edde207f52d9e483ed400c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
It sets memcg information for the pages after the split. A new parameter new_order is added to tell the order of subpages in the new page, always 0 for now. It prepares for upcoming changes to support split huge page to any lower order.
Link: https://lkml.kernel.org/r/20240226205534.1603748-6-zi.yan@sent.com Signed-off-by: Zi Yan ziy@nvidia.com Acked-by: David Hildenbrand david@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Luis Chamberlain mcgrof@kernel.org Cc: "Matthew Wilcox (Oracle)" willy@infradead.org Cc: Michal Koutny mkoutny@suse.com Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Cc: Zach O'Keefe zokeefe@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/memcontrol.h | 4 ++-- mm/huge_memory.c | 2 +- mm/memcontrol.c | 11 ++++++----- mm/page_alloc.c | 4 ++-- 4 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 830d4a67fe9b..8c199fe368c2 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1278,7 +1278,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, rcu_read_unlock(); }
-void split_page_memcg(struct page *head, int order); +void split_page_memcg(struct page *head, int old_order, int new_order);
unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, @@ -1742,7 +1742,7 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) { }
-static inline void split_page_memcg(struct page *head, int order) +static inline void split_page_memcg(struct page *head, int old_order, int new_order) { }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 96620dec3311..9f7b29f3b10c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2761,7 +2761,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, unsigned int nr = 1 << order;
/* complete memcg works before add pages to LRU */ - split_page_memcg(head, order); + split_page_memcg(head, order, 0);
if (folio_test_anon(folio) && folio_test_swapcache(folio)) { offset = swp_offset(folio->swap); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9768adeb2110..f1cf73835cba 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3657,23 +3657,24 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) /* * Because page_memcg(head) is not set on tails, set it now. */ -void split_page_memcg(struct page *head, int order) +void split_page_memcg(struct page *head, int old_order, int new_order) { struct folio *folio = page_folio(head); struct mem_cgroup *memcg = folio_memcg(folio); int i; - unsigned int nr = 1 << order; + unsigned int old_nr = 1 << old_order; + unsigned int new_nr = 1 << new_order;
if (mem_cgroup_disabled() || !memcg) return;
- for (i = 1; i < nr; i++) + for (i = new_nr; i < old_nr; i += new_nr) folio_page(folio, i)->memcg_data = folio->memcg_data;
if (folio_memcg_kmem(folio)) - obj_cgroup_get_many(__folio_objcg(folio), nr - 1); + obj_cgroup_get_many(__folio_objcg(folio), old_nr / new_nr - 1); else - css_get_many(&memcg->css, nr - 1); + css_get_many(&memcg->css, old_nr / new_nr - 1); }
#ifdef CONFIG_SWAP diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f241b2fdc09f..fc3286da1eec 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2661,7 +2661,7 @@ void split_page(struct page *page, unsigned int order) for (i = 1; i < (1 << order); i++) set_page_refcounted(page + i); split_page_owner(page, order); - split_page_memcg(page, order); + split_page_memcg(page, order, 0); } EXPORT_SYMBOL_GPL(split_page);
@@ -5055,7 +5055,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order, struct page *last = page + nr;
split_page_owner(page, order); - split_page_memcg(page, order); + split_page_memcg(page, order, 0); while (page < --last) set_page_refcounted(last);
From: Zi Yan ziy@nvidia.com
mainline inclusion from mainline-v6.9-rc1 commit 46d44d09d24c5b451d6ab8f0fca5a40f651e3837 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
It adds a new_order parameter to set new page order in page owner. It prepares for upcoming changes to support split huge page to any lower order.
Link: https://lkml.kernel.org/r/20240226205534.1603748-7-zi.yan@sent.com Signed-off-by: Zi Yan ziy@nvidia.com Cc: David Hildenbrand david@redhat.com Acked-by: David Hildenbrand david@redhat.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Luis Chamberlain mcgrof@kernel.org Cc: "Matthew Wilcox (Oracle)" willy@infradead.org Cc: Michal Koutny mkoutny@suse.com Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Cc: Zach O'Keefe zokeefe@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/page_owner.h | 13 ++++++++----- mm/huge_memory.c | 2 +- mm/page_alloc.c | 4 ++-- mm/page_owner.c | 7 +++---- 4 files changed, 14 insertions(+), 12 deletions(-)
diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h index 2b39c8e19d98..debdc25f08b9 100644 --- a/include/linux/page_owner.h +++ b/include/linux/page_owner.h @@ -11,7 +11,8 @@ extern struct page_ext_operations page_owner_ops; extern void __reset_page_owner(struct page *page, unsigned short order); extern void __set_page_owner(struct page *page, unsigned short order, gfp_t gfp_mask); -extern void __split_page_owner(struct page *page, int order); +extern void __split_page_owner(struct page *page, int old_order, + int new_order); extern void __folio_copy_owner(struct folio *newfolio, struct folio *old); extern void __set_page_owner_migrate_reason(struct page *page, int reason); extern void __dump_page_owner(const struct page *page); @@ -31,10 +32,11 @@ static inline void set_page_owner(struct page *page, __set_page_owner(page, order, gfp_mask); }
-static inline void split_page_owner(struct page *page, int order) +static inline void split_page_owner(struct page *page, int old_order, + int new_order) { if (static_branch_unlikely(&page_owner_inited)) - __split_page_owner(page, order); + __split_page_owner(page, old_order, new_order); } static inline void folio_copy_owner(struct folio *newfolio, struct folio *old) { @@ -56,10 +58,11 @@ static inline void reset_page_owner(struct page *page, unsigned short order) { } static inline void set_page_owner(struct page *page, - unsigned int order, gfp_t gfp_mask) + unsigned short order, gfp_t gfp_mask) { } -static inline void split_page_owner(struct page *page, int order) +static inline void split_page_owner(struct page *page, int old_order, + int new_order) { } static inline void folio_copy_owner(struct folio *newfolio, struct folio *folio) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9f7b29f3b10c..b56bbb88bec9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2800,7 +2800,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, unlock_page_lruvec(lruvec); /* Caller disabled irqs, so they are still disabled here */
- split_page_owner(head, order); + split_page_owner(head, order, 0);
/* See comment in __split_huge_page_tail() */ if (PageAnon(head)) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fc3286da1eec..95bd8f6f7889 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2660,7 +2660,7 @@ void split_page(struct page *page, unsigned int order)
for (i = 1; i < (1 << order); i++) set_page_refcounted(page + i); - split_page_owner(page, order); + split_page_owner(page, order, 0); split_page_memcg(page, order, 0); } EXPORT_SYMBOL_GPL(split_page); @@ -5054,7 +5054,7 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order, struct page *page = virt_to_page((void *)addr); struct page *last = page + nr;
- split_page_owner(page, order); + split_page_owner(page, order, 0); split_page_memcg(page, order, 0); while (page < --last) set_page_refcounted(last); diff --git a/mm/page_owner.c b/mm/page_owner.c index 125113b9f729..41993f140f1b 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -215,19 +215,18 @@ void __set_page_owner_migrate_reason(struct page *page, int reason) page_ext_put(page_ext); }
-void __split_page_owner(struct page *page, int order) +void __split_page_owner(struct page *page, int old_order, int new_order) { int i; struct page_ext *page_ext = page_ext_get(page); struct page_owner *page_owner; - unsigned int nr = 1 << order;
if (unlikely(!page_ext)) return;
- for (i = 0; i < nr; i++) { + for (i = 0; i < (1 << old_order); i++) { page_owner = get_page_owner(page_ext); - page_owner->order = 0; + page_owner->order = new_order; page_ext = page_ext_next(page_ext); } page_ext_put(page_ext);
From: Zi Yan ziy@nvidia.com
mainline inclusion from mainline-v6.9-rc1 commit c010d47f107f609b9f4d6a103b6dfc53889049e9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
To split a THP to any lower order pages, we need to reform THPs on subpages at given order and add page refcount based on the new page order. Also we need to reinitialize page_deferred_list after removing the page from the split_queue, otherwise a subsequent split will see list corruption when checking the page_deferred_list again.
Note: Anonymous order-1 folio is not supported because _deferred_list, which is used by partially mapped folios, is stored in subpage 2 and an order-1 folio only has subpage 0 and 1. File-backed order-1 folios are fine, since they do not use _deferred_list.
[ziy@nvidia.com: fixup per discussion with Ryan] Link: https://lkml.kernel.org/r/494F48CD-1F0F-4CAD-884E-6D48F40AF990@nvidia.com Link: https://lkml.kernel.org/r/20240226205534.1603748-8-zi.yan@sent.com Signed-off-by: Zi Yan ziy@nvidia.com Cc: David Hildenbrand david@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Luis Chamberlain mcgrof@kernel.org Cc: "Matthew Wilcox (Oracle)" willy@infradead.org Cc: Michal Koutny mkoutny@suse.com Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Cc: Zach O'Keefe zokeefe@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/huge_mm.h | 21 +++++--- mm/huge_memory.c | 107 +++++++++++++++++++++++++++++++--------- 2 files changed, 96 insertions(+), 32 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 885888e38e26..4e604758e58a 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -266,10 +266,11 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
void folio_prep_large_rmappable(struct folio *folio); bool can_split_folio(struct folio *folio, int *pextra_pins); -int split_huge_page_to_list(struct page *page, struct list_head *list); +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, + unsigned int new_order); static inline int split_huge_page(struct page *page) { - return split_huge_page_to_list(page, NULL); + return split_huge_page_to_list_to_order(page, NULL, 0); } void deferred_split_folio(struct folio *folio);
@@ -423,7 +424,8 @@ can_split_folio(struct folio *folio, int *pextra_pins) return false; } static inline int -split_huge_page_to_list(struct page *page, struct list_head *list) +split_huge_page_to_list_to_order(struct page *page, struct list_head *list, + unsigned int new_order) { return 0; } @@ -520,17 +522,20 @@ static inline bool thp_migration_supported(void) } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-static inline int split_folio_to_list(struct folio *folio, - struct list_head *list) +static inline int split_folio_to_list_to_order(struct folio *folio, + struct list_head *list, int new_order) { - return split_huge_page_to_list(&folio->page, list); + return split_huge_page_to_list_to_order(&folio->page, list, new_order); }
-static inline int split_folio(struct folio *folio) +static inline int split_folio_to_order(struct folio *folio, int new_order) { - return split_folio_to_list(folio, NULL); + return split_folio_to_list_to_order(folio, NULL, new_order); }
+#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0) +#define split_folio(f) split_folio_to_order(f, 0) + /* * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to * limitations in the implementation like arm64 MTE can override this to diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b56bbb88bec9..e8723d79a19c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2637,7 +2637,6 @@ static void lru_add_page_tail(struct page *head, struct page *tail, struct lruvec *lruvec, struct list_head *list) { VM_BUG_ON_PAGE(!PageHead(head), head); - VM_BUG_ON_PAGE(PageCompound(tail), head); VM_BUG_ON_PAGE(PageLRU(tail), head); lockdep_assert_held(&lruvec->lru_lock);
@@ -2658,7 +2657,8 @@ static void lru_add_page_tail(struct page *head, struct page *tail, }
static void __split_huge_page_tail(struct folio *folio, int tail, - struct lruvec *lruvec, struct list_head *list) + struct lruvec *lruvec, struct list_head *list, + unsigned int new_order) { struct page *head = &folio->page; struct page *page_tail = head + tail; @@ -2728,10 +2728,15 @@ static void __split_huge_page_tail(struct folio *folio, int tail, * which needs correct compound_head(). */ clear_compound_head(page_tail); + if (new_order) { + prep_compound_page(page_tail, new_order); + folio_prep_large_rmappable(new_folio); + }
/* Finally unfreeze refcount. Additional reference from page cache. */ - page_ref_unfreeze(page_tail, 1 + (!folio_test_anon(folio) || - folio_test_swapcache(folio))); + page_ref_unfreeze(page_tail, + 1 + ((!folio_test_anon(folio) || folio_test_swapcache(folio)) ? + folio_nr_pages(new_folio) : 0));
if (folio_test_young(folio)) folio_set_young(new_folio); @@ -2749,7 +2754,7 @@ static void __split_huge_page_tail(struct folio *folio, int tail, }
static void __split_huge_page(struct page *page, struct list_head *list, - pgoff_t end) + pgoff_t end, unsigned int new_order) { struct folio *folio = page_folio(page); struct page *head = &folio->page; @@ -2757,11 +2762,12 @@ static void __split_huge_page(struct page *page, struct list_head *list, struct address_space *swap_cache = NULL; unsigned long offset = 0; int i, nr_dropped = 0; + unsigned int new_nr = 1 << new_order; int order = folio_order(folio); unsigned int nr = 1 << order;
/* complete memcg works before add pages to LRU */ - split_page_memcg(head, order, 0); + split_page_memcg(head, order, new_order);
if (folio_test_anon(folio) && folio_test_swapcache(folio)) { offset = swp_offset(folio->swap); @@ -2774,8 +2780,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
ClearPageHasHWPoisoned(head);
- for (i = nr - 1; i >= 1; i--) { - __split_huge_page_tail(folio, i, lruvec, list); + for (i = nr - new_nr; i >= new_nr; i -= new_nr) { + __split_huge_page_tail(folio, i, lruvec, list, new_order); /* Some pages can be beyond EOF: drop them from page cache */ if (head[i].index >= end) { struct folio *tail = page_folio(head + i); @@ -2796,24 +2802,30 @@ static void __split_huge_page(struct page *page, struct list_head *list, } }
- ClearPageCompound(head); + if (!new_order) + ClearPageCompound(head); + else { + struct folio *new_folio = (struct folio *)head; + + folio_set_order(new_folio, new_order); + } unlock_page_lruvec(lruvec); /* Caller disabled irqs, so they are still disabled here */
- split_page_owner(head, order, 0); + split_page_owner(head, order, new_order);
/* See comment in __split_huge_page_tail() */ if (PageAnon(head)) { /* Additional pin to swap cache */ if (PageSwapCache(head)) { - page_ref_add(head, 2); + page_ref_add(head, 1 + new_nr); xa_unlock(&swap_cache->i_pages); } else { page_ref_inc(head); } } else { /* Additional pin to page cache */ - page_ref_add(head, 2); + page_ref_add(head, 1 + new_nr); xa_unlock(&head->mapping->i_pages); } local_irq_enable(); @@ -2825,7 +2837,15 @@ static void __split_huge_page(struct page *page, struct list_head *list, if (folio_test_swapcache(folio)) split_swap_cluster(folio->swap);
- for (i = 0; i < nr; i++) { + /* + * set page to its compound_head when split to non order-0 pages, so + * we can skip unlocking it below, since PG_locked is transferred to + * the compound_head of the page and the caller will unlock it. + */ + if (new_order) + page = compound_head(page); + + for (i = 0; i < nr; i += new_nr) { struct page *subpage = head + i; if (subpage == page) continue; @@ -2859,29 +2879,36 @@ bool can_split_folio(struct folio *folio, int *pextra_pins) }
/* - * This function splits huge page into normal pages. @page can point to any - * subpage of huge page to split. Split doesn't change the position of @page. + * This function splits huge page into pages in @new_order. @page can point to + * any subpage of huge page to split. Split doesn't change the position of + * @page. + * + * NOTE: order-1 anonymous folio is not supported because _deferred_list, + * which is used by partially mapped folios, is stored in subpage 2 and an + * order-1 folio only has subpage 0 and 1. File-backed order-1 folios are OK, + * since they do not use _deferred_list. * * Only caller must hold pin on the @page, otherwise split fails with -EBUSY. * The huge page must be locked. * * If @list is null, tail pages will be added to LRU list, otherwise, to @list. * - * Both head page and tail pages will inherit mapping, flags, and so on from - * the hugepage. + * Pages in new_order will inherit mapping, flags, and so on from the hugepage. * - * GUP pin and PG_locked transferred to @page. Rest subpages can be freed if - * they are not mapped. + * GUP pin and PG_locked transferred to @page or the compound page @page belongs + * to. Rest subpages can be freed if they are not mapped. * * Returns 0 if the hugepage is split successfully. * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under * us. */ -int split_huge_page_to_list(struct page *page, struct list_head *list) +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, + unsigned int new_order) { struct folio *folio = page_folio(page); struct deferred_split *ds_queue = get_deferred_split_queue(folio); - XA_STATE(xas, &folio->mapping->i_pages, folio->index); + /* reset xarray order to new order after split */ + XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order); struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; int extra_pins, ret; @@ -2891,6 +2918,31 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
+ /* Cannot split anonymous THP to order-1 */ + if (new_order == 1 && folio_test_anon(folio)) { + VM_WARN_ONCE(1, "Cannot split to order-1 folio"); + return -EINVAL; + } + + if (new_order) { + /* Only swapping a whole PMD-mapped folio is supported */ + if (folio_test_swapcache(folio)) + return -EINVAL; + /* Split shmem folio to non-zero order not supported */ + if (shmem_mapping(folio->mapping)) { + VM_WARN_ONCE(1, + "Cannot split shmem folio to non-0 order"); + return -EINVAL; + } + /* No split if the file system does not support large folio */ + if (!mapping_large_folio_support(folio->mapping)) { + VM_WARN_ONCE(1, + "Cannot split file folio to non-0 order"); + return -EINVAL; + } + } + + is_hzp = is_huge_zero_page(&folio->page); if (is_hzp) { pr_warn_ratelimited("Called split_huge_page for huge zero page\n"); @@ -2987,14 +3039,21 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) if (folio_order(folio) > 1 && !list_empty(&folio->_deferred_list)) { ds_queue->split_queue_len--; - list_del(&folio->_deferred_list); + /* + * Reinitialize page_deferred_list after removing the + * page from the split_queue, otherwise a subsequent + * split will see list corruption when checking the + * page_deferred_list. + */ + list_del_init(&folio->_deferred_list); } spin_unlock(&ds_queue->split_queue_lock); if (mapping) { int nr = folio_nr_pages(folio);
xas_split(&xas, folio, folio_order(folio)); - if (folio_test_pmd_mappable(folio)) { + if (folio_test_pmd_mappable(folio) && + new_order < HPAGE_PMD_ORDER) { if (folio_test_swapbacked(folio)) { __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr); @@ -3006,7 +3065,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) } }
- __split_huge_page(page, list, end); + __split_huge_page(page, list, end, new_order); ret = 0; } else { spin_unlock(&ds_queue->split_queue_lock);
From: Zi Yan ziy@nvidia.com
mainline inclusion from mainline-v6.9-rc1 commit fc4d182316bd5309b4066fd9ef21529ea397a7d4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
It is used to test split_huge_page_to_list_to_order for pagecache THPs. Also add test cases for split_huge_page_to_list_to_order via both debugfs.
[ziy@nvidia.com: fix issue discovered with NFS] Link: https://lkml.kernel.org/r/262E4DAA-4A78-4328-B745-1355AE356A07@nvidia.com Link: https://lkml.kernel.org/r/20240226205534.1603748-9-zi.yan@sent.com Signed-off-by: Zi Yan ziy@nvidia.com Tested-by: Aishwarya TCV aishwarya.tcv@arm.com Cc: David Hildenbrand david@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Luis Chamberlain mcgrof@kernel.org Cc: "Matthew Wilcox (Oracle)" willy@infradead.org Cc: Michal Koutny mkoutny@suse.com Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Cc: Zach O'Keefe zokeefe@google.com Cc: Aishwarya TCV aishwarya.tcv@arm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/huge_memory.c | 34 ++-- tools/testing/selftests/mm/run_vmtests.sh | 22 ++- .../selftests/mm/split_huge_page_test.c | 168 +++++++++++++++++- 3 files changed, 205 insertions(+), 19 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e8723d79a19c..3223a501b9f7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3294,7 +3294,7 @@ static inline bool vma_not_suitable_for_thp_split(struct vm_area_struct *vma) }
static int split_huge_pages_pid(int pid, unsigned long vaddr_start, - unsigned long vaddr_end) + unsigned long vaddr_end, unsigned int new_order) { int ret = 0; struct task_struct *task; @@ -3358,13 +3358,19 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, goto next;
total++; - if (!can_split_folio(folio, NULL)) + /* + * For folios with private, split_huge_page_to_list_to_order() + * will try to drop it before split and then check if the folio + * can be split or not. So skip the check here. + */ + if (!folio_test_private(folio) && + !can_split_folio(folio, NULL)) goto next;
if (!folio_trylock(folio)) goto next;
- if (!split_folio(folio)) + if (!split_folio_to_order(folio, new_order)) split++;
folio_unlock(folio); @@ -3382,7 +3388,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, }
static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start, - pgoff_t off_end) + pgoff_t off_end, unsigned int new_order) { struct filename *file; struct file *candidate; @@ -3421,7 +3427,7 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start, if (!folio_trylock(folio)) goto next;
- if (!split_folio(folio)) + if (!split_folio_to_order(folio, new_order)) split++;
folio_unlock(folio); @@ -3446,10 +3452,14 @@ static ssize_t split_huge_pages_write(struct file *file, const char __user *buf, { static DEFINE_MUTEX(split_debug_mutex); ssize_t ret; - /* hold pid, start_vaddr, end_vaddr or file_path, off_start, off_end */ + /* + * hold pid, start_vaddr, end_vaddr, new_order or + * file_path, off_start, off_end, new_order + */ char input_buf[MAX_INPUT_BUF_SZ]; int pid; unsigned long vaddr_start, vaddr_end; + unsigned int new_order = 0;
ret = mutex_lock_interruptible(&split_debug_mutex); if (ret) @@ -3478,29 +3488,29 @@ static ssize_t split_huge_pages_write(struct file *file, const char __user *buf, goto out; }
- ret = sscanf(buf, "0x%lx,0x%lx", &off_start, &off_end); - if (ret != 2) { + ret = sscanf(buf, "0x%lx,0x%lx,%d", &off_start, &off_end, &new_order); + if (ret != 2 && ret != 3) { ret = -EINVAL; goto out; } - ret = split_huge_pages_in_file(file_path, off_start, off_end); + ret = split_huge_pages_in_file(file_path, off_start, off_end, new_order); if (!ret) ret = input_len;
goto out; }
- ret = sscanf(input_buf, "%d,0x%lx,0x%lx", &pid, &vaddr_start, &vaddr_end); + ret = sscanf(input_buf, "%d,0x%lx,0x%lx,%d", &pid, &vaddr_start, &vaddr_end, &new_order); if (ret == 1 && pid == 1) { split_huge_pages_all(); ret = strlen(input_buf); goto out; - } else if (ret != 3) { + } else if (ret != 3 && ret != 4) { ret = -EINVAL; goto out; }
- ret = split_huge_pages_pid(pid, vaddr_start, vaddr_end); + ret = split_huge_pages_pid(pid, vaddr_start, vaddr_end, new_order); if (!ret) ret = strlen(input_buf); out: diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index 171c929826d5..198c4db57c9e 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -351,7 +351,27 @@ CATEGORY="thp" run_test ./khugepaged -s 2
CATEGORY="thp" run_test ./transhuge-stress -d 20
-CATEGORY="thp" run_test ./split_huge_page_test +# Try to create XFS if not provided +if [ -z "${SPLIT_HUGE_PAGE_TEST_XFS_PATH}" ]; then + if test_selected "thp"; then + if grep xfs /proc/filesystems &>/dev/null; then + XFS_IMG=$(mktemp /tmp/xfs_img_XXXXXX) + SPLIT_HUGE_PAGE_TEST_XFS_PATH=$(mktemp -d /tmp/xfs_dir_XXXXXX) + truncate -s 314572800 ${XFS_IMG} + mkfs.xfs -q ${XFS_IMG} + mount -o loop ${XFS_IMG} ${SPLIT_HUGE_PAGE_TEST_XFS_PATH} + MOUNTED_XFS=1 + fi + fi +fi + +CATEGORY="thp" run_test ./split_huge_page_test ${SPLIT_HUGE_PAGE_TEST_XFS_PATH} + +if [ -n "${MOUNTED_XFS}" ]; then + umount ${SPLIT_HUGE_PAGE_TEST_XFS_PATH} + rmdir ${SPLIT_HUGE_PAGE_TEST_XFS_PATH} + rm -f ${XFS_IMG} +fi
CATEGORY="migration" run_test ./migration
diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c index c00b85b5ce08..6c988bd2f335 100644 --- a/tools/testing/selftests/mm/split_huge_page_test.c +++ b/tools/testing/selftests/mm/split_huge_page_test.c @@ -16,6 +16,7 @@ #include <sys/mount.h> #include <malloc.h> #include <stdbool.h> +#include <time.h> #include "vm_util.h" #include "../kselftest.h"
@@ -24,10 +25,11 @@ unsigned int pageshift; uint64_t pmd_pagesize;
#define SPLIT_DEBUGFS "/sys/kernel/debug/split_huge_pages" +#define SMAP_PATH "/proc/self/smaps" #define INPUT_MAX 80
-#define PID_FMT "%d,0x%lx,0x%lx" -#define PATH_FMT "%s,0x%lx,0x%lx" +#define PID_FMT "%d,0x%lx,0x%lx,%d" +#define PATH_FMT "%s,0x%lx,0x%lx,%d"
#define PFN_MASK ((1UL<<55)-1) #define KPF_THP (1UL<<22) @@ -102,7 +104,7 @@ void split_pmd_thp(void)
/* split all THPs */ write_debugfs(PID_FMT, getpid(), (uint64_t)one_page, - (uint64_t)one_page + len); + (uint64_t)one_page + len, 0);
for (i = 0; i < len; i++) if (one_page[i] != (char)i) @@ -177,7 +179,7 @@ void split_pte_mapped_thp(void)
/* split all remapped THPs */ write_debugfs(PID_FMT, getpid(), (uint64_t)pte_mapped, - (uint64_t)pte_mapped + pagesize * 4); + (uint64_t)pte_mapped + pagesize * 4, 0);
/* smap does not show THPs after mremap, use kpageflags instead */ thp_size = 0; @@ -237,7 +239,7 @@ void split_file_backed_thp(void) }
/* split the file-backed THP */ - write_debugfs(PATH_FMT, testfile, pgoff_start, pgoff_end); + write_debugfs(PATH_FMT, testfile, pgoff_start, pgoff_end, 0);
status = unlink(testfile); if (status) { @@ -265,8 +267,149 @@ void split_file_backed_thp(void) ksft_exit_fail_msg("Error occurred\n"); }
+bool prepare_thp_fs(const char *xfs_path, char *thp_fs_template, + const char **thp_fs_loc) +{ + if (xfs_path) { + *thp_fs_loc = xfs_path; + return false; + } + + *thp_fs_loc = mkdtemp(thp_fs_template); + + if (!*thp_fs_loc) + ksft_exit_fail_msg("cannot create temp folder\n"); + + return true; +} + +void cleanup_thp_fs(const char *thp_fs_loc, bool created_tmp) +{ + int status; + + if (!created_tmp) + return; + + status = rmdir(thp_fs_loc); + if (status) + ksft_exit_fail_msg("cannot remove tmp dir: %s\n", + strerror(errno)); +} + +int create_pagecache_thp_and_fd(const char *testfile, size_t fd_size, int *fd, + char **addr) +{ + size_t i; + int dummy; + + srand(time(NULL)); + + *fd = open(testfile, O_CREAT | O_RDWR, 0664); + if (*fd == -1) + ksft_exit_fail_msg("Failed to create a file at %s\n", testfile); + + for (i = 0; i < fd_size; i++) { + unsigned char byte = (unsigned char)i; + + write(*fd, &byte, sizeof(byte)); + } + close(*fd); + sync(); + *fd = open("/proc/sys/vm/drop_caches", O_WRONLY); + if (*fd == -1) { + ksft_perror("open drop_caches"); + goto err_out_unlink; + } + if (write(*fd, "3", 1) != 1) { + ksft_perror("write to drop_caches"); + goto err_out_unlink; + } + close(*fd); + + *fd = open(testfile, O_RDWR); + if (*fd == -1) { + ksft_perror("Failed to open testfile\n"); + goto err_out_unlink; + } + + *addr = mmap(NULL, fd_size, PROT_READ|PROT_WRITE, MAP_SHARED, *fd, 0); + if (*addr == (char *)-1) { + ksft_perror("cannot mmap"); + goto err_out_close; + } + madvise(*addr, fd_size, MADV_HUGEPAGE); + + for (size_t i = 0; i < fd_size; i++) + dummy += *(*addr + i); + + if (!check_huge_file(*addr, fd_size / pmd_pagesize, pmd_pagesize)) { + ksft_print_msg("No large pagecache folio generated, please provide a filesystem supporting large folio\n"); + munmap(*addr, fd_size); + close(*fd); + unlink(testfile); + ksft_test_result_skip("Pagecache folio split skipped\n"); + return -2; + } + return 0; +err_out_close: + close(*fd); +err_out_unlink: + unlink(testfile); + ksft_exit_fail_msg("Failed to create large pagecache folios\n"); + return -1; +} + +void split_thp_in_pagecache_to_order(size_t fd_size, int order, const char *fs_loc) +{ + int fd; + char *addr; + size_t i; + char testfile[INPUT_MAX]; + int err = 0; + + err = snprintf(testfile, INPUT_MAX, "%s/test", fs_loc); + + if (err < 0) + ksft_exit_fail_msg("cannot generate right test file name\n"); + + err = create_pagecache_thp_and_fd(testfile, fd_size, &fd, &addr); + if (err) + return; + err = 0; + + write_debugfs(PID_FMT, getpid(), (uint64_t)addr, (uint64_t)addr + fd_size, order); + + for (i = 0; i < fd_size; i++) + if (*(addr + i) != (char)i) { + ksft_print_msg("%lu byte corrupted in the file\n", i); + err = EXIT_FAILURE; + goto out; + } + + if (!check_huge_file(addr, 0, pmd_pagesize)) { + ksft_print_msg("Still FilePmdMapped not split\n"); + err = EXIT_FAILURE; + goto out; + } + +out: + munmap(addr, fd_size); + close(fd); + unlink(testfile); + if (err) + ksft_exit_fail_msg("Split PMD-mapped pagecache folio to order %d failed\n", order); + ksft_test_result_pass("Split PMD-mapped pagecache folio to order %d passed\n", order); +} + int main(int argc, char **argv) { + int i; + size_t fd_size; + char *optional_xfs_path = NULL; + char fs_loc_template[] = "/tmp/thp_fs_XXXXXX"; + const char *fs_loc; + bool created_tmp; + ksft_print_header();
if (geteuid() != 0) { @@ -274,7 +417,10 @@ int main(int argc, char **argv) ksft_finished(); }
- ksft_set_plan(3); + if (argc > 1) + optional_xfs_path = argv[1]; + + ksft_set_plan(3+9);
pagesize = getpagesize(); pageshift = ffs(pagesize) - 1; @@ -282,9 +428,19 @@ int main(int argc, char **argv) if (!pmd_pagesize) ksft_exit_fail_msg("Reading PMD pagesize failed\n");
+ fd_size = 2 * pmd_pagesize; + split_pmd_thp(); split_pte_mapped_thp(); split_file_backed_thp();
+ created_tmp = prepare_thp_fs(optional_xfs_path, fs_loc_template, + &fs_loc); + for (i = 8; i >= 0; i--) + split_thp_in_pagecache_to_order(fd_size, i, fs_loc); + cleanup_thp_fs(fs_loc, created_tmp); + ksft_finished(); + + return 0; }
From: Zi Yan ziy@nvidia.com
mainline inclusion from mainline-v6.9-rc1 commit 1412ecb3d256e5d23c6bd150e110cef6e3736844 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
A folio can only be split into lower orders.
Since there are no new_order checks in debugfs, any new_order can be passed via debugfs into split_huge_page_to_list_to_order().
Check new_order to make sure it is smaller than input folio order.
Link: https://lkml.kernel.org/r/20240307181854.138928-1-zi.yan@sent.com Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages") Signed-off-by: Zi Yan ziy@nvidia.com Reported-by: Dan Carpenter dan.carpenter@linaro.org Closes: https://lore.kernel.org/linux-mm/7dda9283-b437-4cf8-ab0d-83c330deb9c0@moroto... Cc: David Hildenbrand david@redhat.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/huge_memory.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 3223a501b9f7..fd3ff0016ca4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2918,6 +2918,9 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
+ if (new_order >= folio_order(folio)) + return -EINVAL; + /* Cannot split anonymous THP to order-1 */ if (new_order == 1 && folio_test_anon(folio)) { VM_WARN_ONCE(1, "Cannot split to order-1 folio");
From: "Matthew Wilcox (Oracle)" willy@infradead.org
mainline inclusion from mainline-v6.9 commit 2a0774c2886d25f4d2987cd3e3813d16bf96f34f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
If we created a new node to replace an entry which had search marks set, we were setting the search mark on every entry in that node. That works fine when we're splitting to order 0, but when splitting to a larger order, we must not set the search marks on the sibling entries.
Link: https://lkml.kernel.org/r/20240501153120.4094530-1-willy@infradead.org Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages") Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Reported-by: Luis Chamberlain mcgrof@kernel.org Link: https://lore.kernel.org/r/ZjFGCOYk3FK_zVy3@bombadil.infradead.org Tested-by: Luis Chamberlain mcgrof@kernel.org Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- lib/test_xarray.c | 14 +++++++++++++- lib/xarray.c | 23 +++++++++++++++++++---- 2 files changed, 32 insertions(+), 5 deletions(-)
diff --git a/lib/test_xarray.c b/lib/test_xarray.c index 542926da61a3..5c6953c4822f 100644 --- a/lib/test_xarray.c +++ b/lib/test_xarray.c @@ -1555,9 +1555,11 @@ static void check_split_1(struct xarray *xa, unsigned long index, unsigned int order, unsigned int new_order) { XA_STATE_ORDER(xas, xa, index, new_order); - unsigned int i; + unsigned int i, found; + void *entry;
xa_store_order(xa, index, order, xa, GFP_KERNEL); + xa_set_mark(xa, index, XA_MARK_1);
xas_split_alloc(&xas, xa, order, GFP_KERNEL); xas_lock(&xas); @@ -1574,6 +1576,16 @@ static void check_split_1(struct xarray *xa, unsigned long index, xa_set_mark(xa, index, XA_MARK_0); XA_BUG_ON(xa, !xa_get_mark(xa, index, XA_MARK_0));
+ xas_set_order(&xas, index, 0); + found = 0; + rcu_read_lock(); + xas_for_each_marked(&xas, entry, ULONG_MAX, XA_MARK_1) { + found++; + XA_BUG_ON(xa, xa_is_internal(entry)); + } + rcu_read_unlock(); + XA_BUG_ON(xa, found != 1 << (order - new_order)); + xa_destroy(xa); }
diff --git a/lib/xarray.c b/lib/xarray.c index 1c87d871cacf..32d4bac8c94c 100644 --- a/lib/xarray.c +++ b/lib/xarray.c @@ -970,8 +970,22 @@ static unsigned int node_get_marks(struct xa_node *node, unsigned int offset) return marks; }
+static inline void node_mark_slots(struct xa_node *node, unsigned int sibs, + xa_mark_t mark) +{ + int i; + + if (sibs == 0) + node_mark_all(node, mark); + else { + for (i = 0; i < XA_CHUNK_SIZE; i += sibs + 1) + node_set_mark(node, i, mark); + } +} + static void node_set_marks(struct xa_node *node, unsigned int offset, - struct xa_node *child, unsigned int marks) + struct xa_node *child, unsigned int sibs, + unsigned int marks) { xa_mark_t mark = XA_MARK_0;
@@ -979,7 +993,7 @@ static void node_set_marks(struct xa_node *node, unsigned int offset, if (marks & (1 << (__force unsigned int)mark)) { node_set_mark(node, offset, mark); if (child) - node_mark_all(child, mark); + node_mark_slots(child, sibs, mark); } if (mark == XA_MARK_MAX) break; @@ -1078,7 +1092,8 @@ void xas_split(struct xa_state *xas, void *entry, unsigned int order) child->nr_values = xa_is_value(entry) ? XA_CHUNK_SIZE : 0; RCU_INIT_POINTER(child->parent, node); - node_set_marks(node, offset, child, marks); + node_set_marks(node, offset, child, xas->xa_sibs, + marks); rcu_assign_pointer(node->slots[offset], xa_mk_node(child)); if (xa_is_value(curr)) @@ -1087,7 +1102,7 @@ void xas_split(struct xa_state *xas, void *entry, unsigned int order) } else { unsigned int canon = offset - xas->xa_sibs;
- node_set_marks(node, canon, NULL, marks); + node_set_marks(node, canon, NULL, 0, marks); rcu_assign_pointer(node->slots[canon], entry); while (offset > canon) rcu_assign_pointer(node->slots[offset--],
From: Muhammad Usama Anjum usama.anjum@collabora.com
mainline inclusion from mainline-v6.9-rc6 commit 6db7412c142006985a15765785cf6c0c0dd75374 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Fix the warnings by initializing and marking the variable as unused. I've caught the warnings by using clang.
split_huge_page_test.c:303:6: warning: variable 'dummy' set but not used [-Wunused-but-set-variable] 303 | int dummy; | ^ split_huge_page_test.c:343:3: warning: variable 'dummy' is uninitialized when used here [-Wuninitialized] 343 | dummy += *(*addr + i); | ^~~~~ split_huge_page_test.c:303:11: note: initialize the variable 'dummy' to silence this warning 303 | int dummy; | ^ | = 0 2 warnings generated.
Link: https://lkml.kernel.org/r/20240416162658.3353622-1-usama.anjum@collabora.com Fixes: fc4d182316bd ("mm: huge_memory: enable debugfs to split huge pages to any order") Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com Reviewed-by: Zi Yan ziy@nvidia.com Cc: Bill Wendling morbo@google.com Cc: Justin Stitt justinstitt@google.com Cc: Muhammad Usama Anjum usama.anjum@collabora.com Cc: Nathan Chancellor nathan@kernel.org Cc: Nick Desaulniers ndesaulniers@google.com Cc: Shuah Khan shuah@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- tools/testing/selftests/mm/split_huge_page_test.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c index 6c988bd2f335..d3c7f5fb3e7b 100644 --- a/tools/testing/selftests/mm/split_huge_page_test.c +++ b/tools/testing/selftests/mm/split_huge_page_test.c @@ -300,7 +300,7 @@ int create_pagecache_thp_and_fd(const char *testfile, size_t fd_size, int *fd, char **addr) { size_t i; - int dummy; + int __attribute__((unused)) dummy = 0;
srand(time(NULL));
From: "Matthew Wilcox (Oracle)" willy@infradead.org
next inclusion from next-20240510 commit 4c773a44257f1f8a88ebc2c1bf67799bb4d6f409 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
We don't actually use any parts of struct page; all we do is check the value of the pointer. So give the pointer the appropriate name & type.
Link: https://lkml.kernel.org/r/20240402201659.918308-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Reviewed-by: David Hildenbrand david@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/swap_state.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c index d0636532d1ab..40b84dc47974 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -72,11 +72,11 @@ void *get_shadow_from_swap_cache(swp_entry_t entry) { struct address_space *address_space = swap_address_space(entry); pgoff_t idx = swp_offset(entry); - struct page *page; + void *shadow;
- page = xa_load(&address_space->i_pages, idx); - if (xa_is_value(page)) - return page; + shadow = xa_load(&address_space->i_pages, idx); + if (xa_is_value(shadow)) + return shadow; return NULL; }
From: "Matthew Wilcox (Oracle)" willy@infradead.org
mainline inclusion from mainline-v6.8-rc1 commit 8d294a8c6393afbde59cf14a0e8413df4b206698 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The page in question is either freshly allocated or known to be in the swap cache; these assertions are not particularly useful.
Link: https://lkml.kernel.org/r/20231212164813.2540119-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Reviewed-by: David Hildenbrand david@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org [ Dep-of: f238b8c33c67 ("arm64: mm: swap: support THP_SWAP on hardware with MTE"). ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/swapfile.c | 4 ---- 1 file changed, 4 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index dff84e2e219f..75d5e6977d0d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1887,10 +1887,6 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, */ arch_swap_restore(entry, page_folio(page));
- /* See do_swap_page() */ - BUG_ON(!PageAnon(page) && PageMappedToDisk(page)); - BUG_ON(PageAnon(page) && PageAnonExclusive(page)); - dec_mm_counter(vma->vm_mm, MM_SWAPENTS); inc_mm_counter(vma->vm_mm, MM_ANONPAGES); add_reliable_page_counter(page, vma->vm_mm, 1);
From: Barry Song v-songbaohua@oppo.com
next inclusion from next-20240510 commit f238b8c33c6738f146bbfbb09b78870ea157c2b7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with MTE as the MTE code works with the assumption tags save/restore is always handling a folio with only one page.
The limitation should be removed as more and more ARM64 SoCs have this feature. Co-existence of MTE and THP_SWAP becomes more and more important.
This patch makes MTE tags saving support large folios, then we don't need to split large folios into base pages for swapping out on ARM64 SoCs with MTE any more.
arch_prepare_to_swap() should take folio rather than page as parameter because we support THP swap-out as a whole. It saves tags for all pages in a large folio.
As now we are restoring tags based-on folio, in arch_swap_restore(), we may increase some extra loops and early-exitings while refaulting a large folio which is still in swapcache in do_swap_page(). In case a large folio has nr pages, do_swap_page() will only set the PTE of the particular page which is causing the page fault. Thus do_swap_page() runs nr times, and each time, arch_swap_restore() will loop nr times for those subpages in the folio. So right now the algorithmic complexity becomes O(nr^2).
Once we support mapping large folios in do_swap_page(), extra loops and early-exitings will decrease while not being completely removed as a large folio might get partially tagged in corner cases such as, 1. a large folio in swapcache can be partially unmapped, thus, MTE tags for the unmapped pages will be invalidated; 2. users might use mprotect() to set MTEs on a part of a large folio.
arch_thp_swp_supported() is dropped since ARM64 MTE was the only one who needed it.
Link: https://lkml.kernel.org/r/20240322114136.61386-2-21cnbao@gmail.com Signed-off-by: Barry Song v-songbaohua@oppo.com Reviewed-by: Steven Price steven.price@arm.com Acked-by: Chris Li chrisl@kernel.org Reviewed-by: Ryan Roberts ryan.roberts@arm.com Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will@kernel.org Cc: Mark Rutland mark.rutland@arm.com Cc: David Hildenbrand david@redhat.com Cc: Kemeng Shi shikemeng@huaweicloud.com Cc: "Matthew Wilcox (Oracle)" willy@infradead.org Cc: Anshuman Khandual anshuman.khandual@arm.com Cc: Peter Collingbourne pcc@google.com Cc: Yosry Ahmed yosryahmed@google.com Cc: Peter Xu peterx@redhat.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: "Mike Rapoport (IBM)" rppt@kernel.org Cc: Hugh Dickins hughd@google.com Cc: "Aneesh Kumar K.V" aneesh.kumar@linux.ibm.com Cc: Rick Edgecombe rick.p.edgecombe@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/swap_slots.c mm/swapfile.c [ Contexts conflicts with commit 2ba9c9b9104b and f00f48436c78. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- arch/arm64/include/asm/pgtable.h | 19 ++------------ arch/arm64/mm/mteswap.c | 45 ++++++++++++++++++++++++++++++++ include/linux/huge_mm.h | 12 --------- include/linux/pgtable.h | 2 +- mm/internal.h | 14 ++++++++++ mm/memory.c | 2 +- mm/page_io.c | 2 +- mm/shmem.c | 2 +- mm/swap_slots.c | 2 +- mm/swapfile.c | 2 +- 10 files changed, 67 insertions(+), 35 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index b72196ed5a76..07948fe59b9d 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -45,12 +45,6 @@ __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1) #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-static inline bool arch_thp_swp_supported(void) -{ - return !system_supports_mte(); -} -#define arch_thp_swp_supported arch_thp_swp_supported - /* * Outside of a few very special situations (e.g. hibernation), we always * use broadcast TLB invalidation instructions, therefore a spurious page @@ -1095,12 +1089,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma, #ifdef CONFIG_ARM64_MTE
#define __HAVE_ARCH_PREPARE_TO_SWAP -static inline int arch_prepare_to_swap(struct page *page) -{ - if (system_supports_mte()) - return mte_save_tags(page); - return 0; -} +extern int arch_prepare_to_swap(struct folio *folio);
#define __HAVE_ARCH_SWAP_INVALIDATE static inline void arch_swap_invalidate_page(int type, pgoff_t offset) @@ -1116,11 +1105,7 @@ static inline void arch_swap_invalidate_area(int type) }
#define __HAVE_ARCH_SWAP_RESTORE -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) -{ - if (system_supports_mte()) - mte_restore_tags(entry, &folio->page); -} +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
#endif /* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c index a31833e3ddc5..63e8d72f202a 100644 --- a/arch/arm64/mm/mteswap.c +++ b/arch/arm64/mm/mteswap.c @@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset) mte_free_tag_storage(tags); }
+static inline void __mte_invalidate_tags(struct page *page) +{ + swp_entry_t entry = page_swap_entry(page); + + mte_invalidate_tags(swp_type(entry), swp_offset(entry)); +} + void mte_invalidate_tags_area(int type) { swp_entry_t entry = swp_entry(type, 0); @@ -83,3 +90,41 @@ void mte_invalidate_tags_area(int type) } xa_unlock(&mte_pages); } + +int arch_prepare_to_swap(struct folio *folio) +{ + long i, nr; + int err; + + if (!system_supports_mte()) + return 0; + + nr = folio_nr_pages(folio); + + for (i = 0; i < nr; i++) { + err = mte_save_tags(folio_page(folio, i)); + if (err) + goto out; + } + return 0; + +out: + while (i--) + __mte_invalidate_tags(folio_page(folio, i)); + return err; +} + +void arch_swap_restore(swp_entry_t entry, struct folio *folio) +{ + long i, nr; + + if (!system_supports_mte()) + return; + + nr = folio_nr_pages(folio); + + for (i = 0; i < nr; i++) { + mte_restore_tags(entry, folio_page(folio, i)); + entry.val++; + } +} diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 4e604758e58a..abf2340a2d18 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -536,16 +536,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order) #define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0) #define split_folio(f) split_folio_to_order(f, 0)
-/* - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to - * limitations in the implementation like arm64 MTE can override this to - * false - */ -#ifndef arch_thp_swp_supported -static inline bool arch_thp_swp_supported(void) -{ - return true; -} -#endif - #endif /* _LINUX_HUGE_MM_H */ diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 87cd26a48038..288dd11b46e5 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1024,7 +1024,7 @@ static inline int arch_unmap_one(struct mm_struct *mm, * prototypes must be defined in the arch-specific asm/pgtable.h file. */ #ifndef __HAVE_ARCH_PREPARE_TO_SWAP -static inline int arch_prepare_to_swap(struct page *page) +static inline int arch_prepare_to_swap(struct folio *folio) { return 0; } diff --git a/mm/internal.h b/mm/internal.h index f2e7972d4a3e..3cd5867bba43 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -76,6 +76,20 @@ static inline int folio_nr_pages_mapped(struct folio *folio) return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED; }
+/* + * Retrieve the first entry of a folio based on a provided entry within the + * folio. We cannot rely on folio->swap as there is no guarantee that it has + * been initialized. Used for calling arch_swap_restore() + */ +static inline swp_entry_t folio_swap(swp_entry_t entry, struct folio *folio) +{ + swp_entry_t swap = { + .val = ALIGN_DOWN(entry.val, folio_nr_pages(folio)), + }; + + return swap; +} + static inline void *folio_raw_mapping(struct folio *folio) { unsigned long mapping = (unsigned long)folio->mapping; diff --git a/mm/memory.c b/mm/memory.c index 1bcbc3697a2f..1b3309b50357 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4175,7 +4175,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * when reading from swap. This metadata may be indexed by swap entry * so this must be called before swap_free(). */ - arch_swap_restore(entry, folio); + arch_swap_restore(folio_swap(entry, folio), folio);
/* * Remove the swap entry and conditionally try to free up the swapcache. diff --git a/mm/page_io.c b/mm/page_io.c index af65db0be731..ea8d57b9b3ae 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) * Arch code may have to preserve more data than just the page * contents, e.g. memory tags. */ - ret = arch_prepare_to_swap(&folio->page); + ret = arch_prepare_to_swap(folio); if (ret) { folio_mark_dirty(folio); folio_unlock(folio); diff --git a/mm/shmem.c b/mm/shmem.c index 0a39dbdcfb2e..a7550982a13d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1893,7 +1893,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, * Some architectures may have to restore extra metadata to the * folio after reading from swap. */ - arch_swap_restore(swap, folio); + arch_swap_restore(folio_swap(swap, folio), folio);
if (shmem_should_replace_folio(folio, gfp)) { error = shmem_replace_folio(&folio, gfp, info, index); diff --git a/mm/swap_slots.c b/mm/swap_slots.c index c7781364fa50..9971cdf30e20 100644 --- a/mm/swap_slots.c +++ b/mm/swap_slots.c @@ -434,7 +434,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
if (folio_test_large(folio)) { - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported()) + if (IS_ENABLED(CONFIG_THP_SWAP)) get_swap_pages(1, &entry, folio_nr_pages(folio), type); goto out; } diff --git a/mm/swapfile.c b/mm/swapfile.c index 75d5e6977d0d..961862f4aa57 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1885,7 +1885,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, * when reading from swap. This metadata may be indexed by swap entry * so this must be called before swap_free(). */ - arch_swap_restore(entry, page_folio(page)); + arch_swap_restore(folio_swap(entry, folio), page_folio(page));
dec_mm_counter(vma->vm_mm, MM_SWAPENTS); inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
next inclusion from next-20240510 commit 0fd44ab213bcfb26c47eedaa0985e4b5dbf0a494 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "Fix I/O high when memory almost met memcg limit", v2.
Recently, when install package in a docker which almost reached its memory limit, the installer has no respond severely for more than 15 minutes. During this period, I/O stays high(~1G/s) and influence the whole machine. I've constructed a use case as follows:
1. create a docker:
$ cat test.sh #!/bin/bash
docker rm centos7 --force
docker create --name centos7 --memory 4G --memory-swap 6G centos:7 /usr/sbin/init docker start centos7 sleep 1
docker cp ./alloc_page centos7:/ docker cp ./reproduce.sh centos7:/
docker exec -it centos7 /bin/bash
2. try reproduce the problem in docker:
$ cat reproduce.sh #!/bin/bash
while true; do flag=$(ps -ef | grep -v grep | grep alloc_page| wc -l) if [ "$flag" -eq 0 ]; then /alloc_page & fi
sleep 30
start_time=$(date +%s) yum install -y expect > /dev/null 2>&1
end_time=$(date +%s)
elapsed_time=$((end_time - start_time))
echo "$elapsed_time seconds" yum remove -y expect > /dev/null 2>&1 done
$ cat alloc_page.c: #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h>
#define SIZE 1*1024*1024 //1M
int main() { void *addr = NULL; int i;
for (i = 0; i < 1024 * 6 - 50;i++) { addr = (void *)malloc(SIZE); if (!addr) return -1;
memset(addr, 0, SIZE); }
sleep(99999); return 0; }
We found that this problem is caused by a lot ot meaningless read-ahead. Since the docker is almost met memory limit, the page will be reclaimed immediately after read-ahead and will read-ahead again immediately. The program is executed slowly and waste a lot of I/O resource.
These two patch aim to break the read-ahead in above scenario.
[1] https://lore.kernel.org/linux-mm/c2f4a2fa-3bde-72ce-66f5-db81a373fdbc@huawei... [2] https://lore.kernel.org/all/20240201100835.1626685-1-liushixin2@huawei.com/ [3] https://lore.kernel.org/all/20240201173130.frpaqpy7iyzias5j@quack3/
This patch (of 2):
When filemap_add_folio() return -ENOMEM, break read-ahead loop like what filemap_alloc_folio() does.
Link: https://lkml.kernel.org/r/20240322093555.226789-1-liushixin2@huawei.com Link: https://lkml.kernel.org/r/20240322093555.226789-2-liushixin2@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Signed-off-by: Jinjiang Tu tujinjiang@huawei.com Reviewed-by: Jan Kara jack@suse.cz Cc: Al Viro viro@ZenIV.linux.org.uk Cc: Christian Brauner brauner@kernel.org Cc: Liu Shixin liushixin2@huawei.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/readahead.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c index 5464f0bee669..689e003951fe 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -231,6 +231,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, */ for (i = 0; i < nr_to_read; i++) { struct folio *folio = xa_load(&mapping->i_pages, index + i); + int ret;
if (folio && !xa_is_value(folio)) { /* @@ -250,9 +251,12 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, folio = filemap_alloc_folio(gfp_mask, 0); if (!folio) break; - if (filemap_add_folio(mapping, folio, index + i, - gfp_mask) < 0) { + + ret = filemap_add_folio(mapping, folio, index + i, gfp_mask); + if (ret < 0) { folio_put(folio); + if (ret == -ENOMEM) + break; read_pages(ractl); ractl->_index++; i = ractl->_index + ractl->_nr_pages - index - 1;
next inclusion from next-20240510 commit 5c46d5319bde73075c5f2cefd555426848eb506f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
If there are too many folios that are recently evicted in a file, then they will probably continue to be evicted. In such situation, there is no positive effect to read-ahead this file since it is only a waste of IO.
The mmap_miss is increased in do_sync_mmap_readahead() and decreased in both do_async_mmap_readahead() and filemap_map_pages(). In order to skip read-ahead in above scenario, the mmap_miss have to increased exceed MMAP_LOTSAMISS. This can be done by stop decreased mmap_miss when folio has workingset flag. The async path is not to care because in above scenario, it's hard to run into the async path.
[liushixin2@huawei.com: add comments] Link: https://lkml.kernel.org/r/20240326065026.1910584-1-liushixin2@huawei.com Link: https://lkml.kernel.org/r/20240322093555.226789-3-liushixin2@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Jan Kara jack@suse.cz Cc: Al Viro viro@ZenIV.linux.org.uk Cc: Christian Brauner brauner@kernel.org Cc: Jinjiang Tu tujinjiang@huawei.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/filemap.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c index 7bbe26aa9cc8..a274d2c5e232 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3527,7 +3527,15 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf, if (PageHWPoison(page + count)) goto skip;
- (*mmap_miss)++; + /* + * If there are too many folios that are recently evicted + * in a file, they will probably continue to be evicted. + * In such situation, read-ahead is only a waste of IO. + * Don't decrease mmap_miss in this scenario to make sure + * we can stop read-ahead. + */ + if (!folio_test_workingset(folio)) + (*mmap_miss)++;
/* * NOTE: If there're PTE markers, we'll leave them to be @@ -3578,7 +3586,9 @@ static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf, if (PageHWPoison(page)) return ret;
- (*mmap_miss)++; + /* See comment of filemap_map_folio_range() */ + if (!folio_test_workingset(folio)) + (*mmap_miss)++;
/* * NOTE: If there're PTE markers, we'll leave them to be
From: Barry Song v-songbaohua@oppo.com
mainline inclusion from mainline-v6.9-rc1 commit cc864ebba5f612ce2960e7e09322a193e8fda0d7 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The purpose is stopping splitting large folios whose mapcount are 2 or above. Folios whose estimated_shares = 0 should be still perfect and even better candidates than estimated_shares = 1.
Consider a pte-mapped large folio with 16 subpages, if we unmap 1-15, the current code will split folios and reclaim them while madvise goes on this folio; but if we unmap subpage 0, we will keep this folio and break. This is weird.
For pmd-mapped large folios, we can still use "= 1" as the condition as anyway we have the entire map for it. So this patch doesn't change the condition for pmd-mapped large folios. This also explains why we had been using "= 1" for both pmd-mapped and pte-mapped large folios before commit 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios"), because in the past, we used the mapcount of the specific subpage, since the subpage had pte present, its mapcount wouldn't be 0.
The problem can be quite easily reproduced by writing a small program, unmapping the first subpage of a pte-mapped large folio vs. unmapping anyone other than the first subpage.
Link: https://lkml.kernel.org/r/20240221085036.105621-1-21cnbao@gmail.com Fixes: 2f406263e3e9 ("madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check") Signed-off-by: Barry Song v-songbaohua@oppo.com Reviewed-by: David Hildenbrand david@redhat.com Reviewed-by: Vishal Moola (Oracle) vishal.moola@gmail.com Cc: Yin Fengwei fengwei.yin@intel.com Cc: Yu Zhao yuzhao@google.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Matthew Wilcox willy@infradead.org Cc: Minchan Kim minchan@kernel.org Cc: Yang Shi shy828301@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org [ Dep-of: ebb34f78d72c ("mm: convert folio_estimated_sharers() to folio_likely_mapped_shared(). ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/madvise.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/madvise.c b/mm/madvise.c index 2b821024a38e..8df46ebffdea 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -485,7 +485,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (folio_test_large(folio)) { int err;
- if (folio_estimated_sharers(folio) != 1) + if (folio_estimated_sharers(folio) > 1) break; if (pageout_anon_only_filter && !folio_test_anon(folio)) break;
From: Barry Song v-songbaohua@oppo.com
next inclusion from next-20240510 commit 73bc32875ee9b1881dd780308c6793fe463fe803 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE modifications preceded by pte clear. While iterating over PTEs of a large folio, it only starts acquiring PTL from the first valid (present) PTE. PTE modifications can temporarily set PTEs to pte_none. Consequently, the initial PTEs of a large folio might be skipped in try_to_unmap_one().
For example, for an anon folio, if we skip PTE0, we may have PTE0 which is still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after try_to_unmap_one().
So folio will be still mapped, the folio fails to be reclaimed and is put back to LRU in this round.
This also breaks up PTEs optimization such as CONT-PTE on this large folio and may lead to accident folio_split() afterwards. And since a part of PTEs are now swap entries, accessing those parts will introduce overhead - do_swap_page. Although the kernel can withstand all of the above issues, the situation still seems quite awkward and warrants making it more ideal.
The same race also occurs with small folios, but they have only one PTE, thus, it won't be possible for them to be partially unmapped.
This patch holds PTL from PTE0, allowing us to avoid reading PTE values that are in the process of being transformed. With stable PTE values, we can ensure that this large folio is either completely reclaimed or that all PTEs remain untouched in this round.
A corner case is that if we hold PTL from PTE0 and most initial PTEs have been really unmapped before that, we may increase the duration of holding PTL. Thus we only apply this optimization to folios which are still entirely mapped (not in deferred_split list).
[akpm@linux-foundation.org: rewrap comment, per Matthew] Link: https://lkml.kernel.org/r/20240306095219.71086-1-21cnbao@gmail.com Signed-off-by: Barry Song v-songbaohua@oppo.com Acked-by: David Hildenbrand david@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Chris Li chrisl@kernel.org Cc: Chuanhua Han hanchuanhua@oppo.com Cc: Gao Xiang xiang@kernel.org Cc: Huang, Ying ying.huang@intel.com Cc: Hugh Dickins hughd@google.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michal Hocko mhocko@suse.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/vmscan.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/mm/vmscan.c b/mm/vmscan.c index e1aa8bd796e9..206be0d30d57 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1954,6 +1954,20 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
if (folio_test_pmd_mappable(folio)) flags |= TTU_SPLIT_HUGE_PMD; + /* + * Without TTU_SYNC, try_to_unmap will only begin to + * hold PTL from the first present PTE within a large + * folio. Some initial PTEs might be skipped due to + * races with parallel PTE writes in which PTEs can be + * cleared temporarily before being written new present + * values. This will lead to a large folio is still + * mapped while some subpages have been partially + * unmapped after try_to_unmap; TTU_SYNC helps + * try_to_unmap acquire PTL from the first PTE, + * eliminating the influence of temporary PTE values. + */ + if (folio_test_large(folio) && list_empty(&folio->_deferred_list)) + flags |= TTU_SYNC;
try_to_unmap(folio, flags); if (folio_mapped(folio)) {
From: Zi Yan ziy@nvidia.com
next inclusion from next-20240510 commit 7262f208ca681385d133844be8a58d9b4ca185f7 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
If the source folio is on deferred split list, it is likely some subpages are not used. Split it before migration to avoid migrating unused subpages.
Commit 616b8371539a6 ("mm: thp: enable thp migration in generic path") did not check if a THP is on deferred split list before migration, thus, the destination THP is never put on deferred split list even if the source THP might be. The opportunity of reclaiming free pages in a partially mapped THP during deferred list scanning is lost, but no other harmful consequence is present[1].
[1]: https://lore.kernel.org/linux-mm/03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia...
[zi.yan@sent.com: fix an error in migrate_misplaced_folio()] Link: https://lkml.kernel.org/r/20240326150031.569387-1-zi.yan@sent.com Link: https://lkml.kernel.org/r/20240322193304.522496-1-zi.yan@sent.com Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path") Signed-off-by: Zi Yan ziy@nvidia.com Tested-by: Baolin Wang baolin.wang@linux.alibaba.com Reviewed-by: Baolin Wang baolin.wang@linux.alibaba.com Acked-by: David Hildenbrand david@redhat.com Cc: Huang, Ying ying.huang@intel.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Ryan Roberts ryan.roberts@arm.com Cc: SeongJae Park sj@kernel.org Cc: Yang Shi shy828301@gmail.com Cc: Yin Fengwei fengwei.yin@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/migrate.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+)
diff --git a/mm/migrate.c b/mm/migrate.c index cd2a24c6a745..936ed28f90c8 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1653,6 +1653,29 @@ static int migrate_pages_batch(struct list_head *from,
cond_resched();
+ /* + * The rare folio on the deferred split list should + * be split now. It should not count as a failure. + * Only check it without removing it from the list. + * Since the folio can be on deferred_split_scan() + * local list and removing it can cause the local list + * corruption. Folio split process below can handle it + * with the help of folio_ref_freeze(). + * + * nr_pages > 2 is needed to avoid checking order-1 + * page cache folios. They exist, in contrast to + * non-existent order-1 anonymous folios, and do not + * use _deferred_list. + */ + if (nr_pages > 2 && + !list_empty(&folio->_deferred_list)) { + if (try_split_folio(folio, split_folios) == 0) { + stats->nr_thp_split += is_thp; + stats->nr_split++; + continue; + } + } + /* * Large folio migration might be unsupported or * the allocation might be failed so we should retry
From: David Hildenbrand david@redhat.com
next inclusion from next-20240510 commit ebb34f78d72c2320620ba6d55cb22a52949047a1 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Callers of folio_estimated_sharers() only care about "mapped shared vs. mapped exclusively", not the exact estimate of sharers. Let's consolidate and unify the condition users are checking. While at it clarify the semantics and extend the discussion on the fuzziness.
Use the "likely mapped shared" terminology to better express what the (adjusted) function actually checks.
Whether a partially-mappable folio is more likely to not be partially mapped than partially mapped is debatable. In the future, we might be able to improve our estimate for partially-mappable folios, though.
Note that we will now consistently detect "mapped shared" only if the first subpage is actually mapped multiple times. When the first subpage is not mapped, we will consistently detect it as "mapped exclusively". This change should currently only affect the usage in madvise_free_pte_range() and queue_folios_pte_range() for large folios: if the first page was already unmapped, we would have skipped the folio.
[david@redhat.com: folio_likely_mapped_shared() kerneldoc fixup] Link: https://lkml.kernel.org/r/dd0ad9f2-2d7a-45f3-9ba3-979488c7dd27@redhat.com Link: https://lkml.kernel.org/r/20240227201548.857831-1-david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Khalid Aziz khalid.aziz@oracle.com Acked-by: Barry Song v-songbaohua@oppo.com Reviewed-by: Vishal Moola (Oracle) vishal.moola@gmail.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/mempolicy.c [ Context conflicts due to lack of commit 1cb5d11a370f. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/mm.h | 48 ++++++++++++++++++++++++++++++++++++---------- mm/huge_memory.c | 2 +- mm/madvise.c | 6 +++--- mm/memory.c | 2 +- mm/mempolicy.c | 14 ++++++-------- mm/migrate.c | 8 ++++---- 6 files changed, 53 insertions(+), 27 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 83519e1cfc95..f86fd573a4a1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2160,21 +2160,49 @@ static inline size_t folio_size(struct folio *folio) }
/** - * folio_estimated_sharers - Estimate the number of sharers of a folio. + * folio_likely_mapped_shared - Estimate if the folio is mapped into the page + * tables of more than one MM * @folio: The folio. * - * folio_estimated_sharers() aims to serve as a function to efficiently - * estimate the number of processes sharing a folio. This is done by - * looking at the precise mapcount of the first subpage in the folio, and - * assuming the other subpages are the same. This may not be true for large - * folios. If you want exact mapcounts for exact calculations, look at - * page_mapcount() or folio_total_mapcount(). + * This function checks if the folio is currently mapped into more than one + * MM ("mapped shared"), or if the folio is only mapped into a single MM + * ("mapped exclusively"). * - * Return: The estimated number of processes sharing a folio. + * As precise information is not easily available for all folios, this function + * estimates the number of MMs ("sharers") that are currently mapping a folio + * using the number of times the first page of the folio is currently mapped + * into page tables. + * + * For small anonymous folios (except KSM folios) and anonymous hugetlb folios, + * the return value will be exactly correct, because they can only be mapped + * at most once into an MM, and they cannot be partially mapped. + * + * For other folios, the result can be fuzzy: + * #. For partially-mappable large folios (THP), the return value can wrongly + * indicate "mapped exclusively" (false negative) when the folio is + * only partially mapped into at least one MM. + * #. For pagecache folios (including hugetlb), the return value can wrongly + * indicate "mapped shared" (false positive) when two VMAs in the same MM + * cover the same file range. + * #. For (small) KSM folios, the return value can wrongly indicate "mapped + * shared" (false negative), when the folio is mapped multiple times into + * the same MM. + * + * Further, this function only considers current page table mappings that + * are tracked using the folio mapcount(s). + * + * This function does not consider: + * #. If the folio might get mapped in the (near) future (e.g., swapcache, + * pagecache, temporary unmapping for migration). + * #. If the folio is mapped differently (VM_PFNMAP). + * #. If hugetlb page table sharing applies. Callers might want to check + * hugetlb_pmd_shared(). + * + * Return: Whether the folio is estimated to be mapped into more than one MM. */ -static inline int folio_estimated_sharers(struct folio *folio) +static inline bool folio_likely_mapped_shared(struct folio *folio) { - return page_mapcount(folio_page(folio, 0)); + return page_mapcount(folio_page(folio, 0)) > 1; }
#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE diff --git a/mm/huge_memory.c b/mm/huge_memory.c index fd3ff0016ca4..5e8374aa8b59 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1833,7 +1833,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, * If other processes are mapping this folio, we couldn't discard * the folio unless they all do MADV_FREE so let's skip the folio. */ - if (folio_estimated_sharers(folio) != 1) + if (folio_likely_mapped_shared(folio)) goto out;
if (!folio_trylock(folio)) diff --git a/mm/madvise.c b/mm/madvise.c index 8df46ebffdea..e6570812f723 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -409,7 +409,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, folio = pfn_folio(pmd_pfn(orig_pmd));
/* Do not interfere with other mappings of this folio */ - if (folio_estimated_sharers(folio) != 1) + if (folio_likely_mapped_shared(folio)) goto huge_unlock;
if (pageout_anon_only_filter && !folio_test_anon(folio)) @@ -485,7 +485,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (folio_test_large(folio)) { int err;
- if (folio_estimated_sharers(folio) > 1) + if (folio_likely_mapped_shared(folio)) break; if (pageout_anon_only_filter && !folio_test_anon(folio)) break; @@ -704,7 +704,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (folio_test_large(folio)) { int err;
- if (folio_estimated_sharers(folio) != 1) + if (folio_likely_mapped_shared(folio)) break; if (!folio_trylock(folio)) break; diff --git a/mm/memory.c b/mm/memory.c index 1b3309b50357..2a0280384b67 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5098,7 +5098,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) * Flag if the folio is shared between multiple address spaces. This * is later used when determining whether to group tasks together */ - if (folio_estimated_sharers(folio) > 1 && (vma->vm_flags & VM_SHARED)) + if (folio_likely_mapped_shared(folio) && (vma->vm_flags & VM_SHARED)) flags |= TNF_SHARED;
nid = folio_nid(folio); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index f28f4c277099..a80f99751904 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -608,12 +608,11 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask, * With MPOL_MF_MOVE, we try to migrate only unshared folios. If it * is shared it is likely not worth migrating. * - * To check if the folio is shared, ideally we want to make sure - * every page is mapped to the same process. Doing that is very - * expensive, so check the estimated mapcount of the folio instead. + * See folio_likely_mapped_shared() on possible imprecision when we + * cannot easily detect if a folio is shared. */ if (flags & (MPOL_MF_MOVE_ALL) || - (flags & MPOL_MF_MOVE && folio_estimated_sharers(folio) == 1 && + (flags & MPOL_MF_MOVE && !folio_likely_mapped_shared(folio) && !hugetlb_pmd_shared(pte))) { if (!isolate_hugetlb(folio, qp->pagelist) && (flags & MPOL_MF_STRICT)) @@ -1040,11 +1039,10 @@ static int migrate_folio_add(struct folio *folio, struct list_head *foliolist, * We try to migrate only unshared folios. If it is shared it * is likely not worth migrating. * - * To check if the folio is shared, ideally we want to make sure - * every page is mapped to the same process. Doing that is very - * expensive, so check the estimated mapcount of the folio instead. + * See folio_likely_mapped_shared() on possible imprecision when we + * cannot easily detect if a folio is shared. */ - if ((flags & MPOL_MF_MOVE_ALL) || folio_estimated_sharers(folio) == 1) { + if ((flags & MPOL_MF_MOVE_ALL) || !folio_likely_mapped_shared(folio)) { if (folio_isolate_lru(folio)) { list_add_tail(&folio->lru, foliolist); node_stat_mod_folio(folio, diff --git a/mm/migrate.c b/mm/migrate.c index 936ed28f90c8..141509ec9a48 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2599,11 +2599,11 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, /* * Don't migrate file folios that are mapped in multiple processes * with execute permissions as they are probably shared libraries. - * To check if the folio is shared, ideally we want to make sure - * every page is mapped to the same process. Doing that is very - * expensive, so check the estimated mapcount of the folio instead. + * + * See folio_likely_mapped_shared() on possible imprecision when we + * cannot easily detect if a folio is shared. */ - if (folio_estimated_sharers(folio) != 1 && folio_is_file_lru(folio) && + if (folio_likely_mapped_shared(folio) && folio_is_file_lru(folio) && (vma->vm_flags & VM_EXEC)) goto out;
From: John Hubbard jhubbard@nvidia.com
next inclusion from next-20240510 commit a8353dc98f3ae570297e5e25cc05fc7d6b7f0e7b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
1. Add information about the behavior of huge page splitting, with respect to page/folio refcounts, and gup/pup pins.
2. Update and clarify the existing documentation, to compensate for the ravages of time and code change.
Link: https://lkml.kernel.org/r/20240325044452.217463-1-jhubbard@nvidia.com Signed-off-by: John Hubbard jhubbard@nvidia.com Reviewed-by: Zi Yan ziy@nvidia.com Reviewed-by: David Hildenbrand david@redhat.com Cc: Matthew Wilcox willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/huge_memory.c | 42 +++++++++++++++++++++++++++--------------- 1 file changed, 27 insertions(+), 15 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5e8374aa8b59..a548caef1c2f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2879,28 +2879,40 @@ bool can_split_folio(struct folio *folio, int *pextra_pins) }
/* - * This function splits huge page into pages in @new_order. @page can point to - * any subpage of huge page to split. Split doesn't change the position of - * @page. + * This function splits a large folio into smaller folios of order @new_order. + * @page can point to any page of the large folio to split. The split operation + * does not change the position of @page. * - * NOTE: order-1 anonymous folio is not supported because _deferred_list, - * which is used by partially mapped folios, is stored in subpage 2 and an - * order-1 folio only has subpage 0 and 1. File-backed order-1 folios are OK, - * since they do not use _deferred_list. + * Prerequisites: * - * Only caller must hold pin on the @page, otherwise split fails with -EBUSY. - * The huge page must be locked. + * 1) The caller must hold a reference on the @page's owning folio, also known + * as the large folio. + * + * 2) The large folio must be locked. + * + * 3) The folio must not be pinned. Any unexpected folio references, including + * GUP pins, will result in the folio not getting split; instead, the caller + * will receive an -EBUSY. + * + * 4) @new_order > 1, usually. Splitting to order-1 anonymous folios is not + * supported for non-file-backed folios, because folio->_deferred_list, which + * is used by partially mapped folios, is stored in subpage 2, but an order-1 + * folio only has subpages 0 and 1. File-backed order-1 folios are supported, + * since they do not use _deferred_list. + * + * After splitting, the caller's folio reference will be transferred to @page, + * resulting in a raised refcount of @page after this call. The other pages may + * be freed if they are not mapped. * * If @list is null, tail pages will be added to LRU list, otherwise, to @list. * - * Pages in new_order will inherit mapping, flags, and so on from the hugepage. + * Pages in @new_order will inherit the mapping, flags, and so on from the + * huge page. * - * GUP pin and PG_locked transferred to @page or the compound page @page belongs - * to. Rest subpages can be freed if they are not mapped. + * Returns 0 if the huge page was split successfully. * - * Returns 0 if the hugepage is split successfully. - * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under - * us. + * Returns -EBUSY if @page's folio is pinned, or if the anon_vma disappeared + * from under us. */ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, unsigned int new_order)
From: Barry Song v-songbaohua@oppo.com
next inclusion from next-20240510 commit 68dbcf4899f357c707c8bd42326533d729e8e8bb category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Fallback rates surpassing 90% have been observed on phones utilizing 64KiB CONT-PTE mTHP. In these scenarios, when one out of every 16 PTEs fails to allocate large folios, the remaining 15 PTEs fallback. Consequently, invoking vma_thp_gfp_mask seems redundant in such cases. Furthermore, abstaining from its use can also contribute to improved code readability.
Link: https://lkml.kernel.org/r/20240329073750.20012-1-21cnbao@gmail.com Signed-off-by: Barry Song v-songbaohua@oppo.com Reviewed-by: Zi Yan ziy@nvidia.com Acked-by: Yu Zhao yuzhao@google.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: John Hubbard jhubbard@nvidia.com Cc: David Hildenbrand david@redhat.com Cc: Alistair Popple apopple@nvidia.com Cc: Anshuman Khandual anshuman.khandual@arm.com Cc: Catalin Marinas catalin.marinas@arm.com Cc: David Rientjes rientjes@google.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Hugh Dickins hughd@google.com Cc: Itaru Kitayama itaru.kitayama@gmail.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Luis Chamberlain mcgrof@kernel.org Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Yang Shi shy828301@gmail.com Cc: Yin Fengwei fengwei.yin@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/memory.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c index 2a0280384b67..dcb94f8eefb2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4338,6 +4338,9 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
pte_unmap(pte);
+ if (!orders) + goto fallback; + /* Try allocating the highest of the remaining orders. */ gfp = vma_thp_gfp_mask(vma); while (orders) {
From: Baolin Wang baolin.wang@linux.alibaba.com
next inclusion from next-20240510 commit 6b0ed7b3c77547d2308983a26db11a0d14a60ace category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "support multi-size THP numa balancing", v2.
This patchset tries to support mTHP numa balancing, as a simple solution to start, the NUMA balancing algorithm for mTHP will follow the THP strategy as the basic support. Please find details in each patch.
This patch (of 2):
To support large folio's numa balancing, factor out the numa mapping rebuilding into a new helper as a preparation.
Link: https://lkml.kernel.org/r/cover.1712132950.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/cover.1711683069.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.171168306... Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Reviewed-by: "Huang, Ying" ying.huang@intel.com Cc: David Hildenbrand david@redhat.com Cc: John Hubbard jhubbard@nvidia.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Ryan Roberts ryan.roberts@arm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/memory.c | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index dcb94f8eefb2..33edf6a888b3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5043,6 +5043,20 @@ int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma, return mpol_misplaced(folio, vma, addr); }
+static void numa_rebuild_single_mapping(struct vm_fault *vmf, struct vm_area_struct *vma, + bool writable) +{ + pte_t pte, old_pte; + + old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte); + pte = pte_modify(old_pte, vma->vm_page_prot); + pte = pte_mkyoung(pte); + if (writable) + pte = pte_mkwrite(pte, vma); + ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); + update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); +} + static vm_fault_t do_numa_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; @@ -5148,13 +5162,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) * Make it present again, depending on how arch implements * non-accessible ptes, some can allow access by kernel mode. */ - old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte); - pte = pte_modify(old_pte, vma->vm_page_prot); - pte = pte_mkyoung(pte); - if (writable) - pte = pte_mkwrite(pte, vma); - ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); + numa_rebuild_single_mapping(vmf, vma, writable); pte_unmap_unlock(vmf->pte, vmf->ptl); goto out; }
From: Baolin Wang baolin.wang@linux.alibaba.com
next inclusion from next-20240510 commit d2136d749d76af980b3accd72704eea4eab625bd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Now the anonymous page allocation already supports multi-size THP (mTHP), but the numa balancing still prohibits mTHP migration even though it is an exclusive mapping, which is unreasonable.
Allow scanning mTHP: Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section pages") skips shared CoW pages' NUMA page migration to avoid shared data segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other uses") change to use page_count() to avoid GUP pages migration, that will also skip the mTHP numa scanning. Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP issue, although there is still a GUP race, the issue seems to have been resolved by commit 80d47f5de5e3. Meanwhile, use the folio_likely_mapped_shared() to skip shared CoW pages though this is not a precise sharers count. To check if the folio is shared, ideally we want to make sure every page is mapped to the same process, but doing that seems expensive and using the estimated mapcount seems can work when running autonuma benchmark.
Allow migrating mTHP: As mentioned in the previous thread[1], large folios (including THP) are more susceptible to false sharing issues among threads than 4K base page, leading to pages ping-pong back and forth during numa balancing, which is currently not easy to resolve. Therefore, as a start to support mTHP numa balancing, we can follow the PMD mapped THP's strategy, that means we can reuse the 2-stage filter in should_numa_migrate_memory() to check if the mTHP is being heavily contended among threads (through checking the CPU id and pid of the last access) to avoid false sharing at some degree. Thus, we can restore all PTE maps upon the first hint page fault of a large folio to follow the PMD mapped THP's strategy. In the future, we can continue to optimize the NUMA balancing algorithm to avoid the false sharing issue with large folios as much as possible.
Performance data: Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum Base: 2024-03-25 mm-unstable branch Enable mTHP to run autonuma-benchmark
mTHP:16K Base Patched numa01 numa01 224.70 143.48 numa01_THREAD_ALLOC numa01_THREAD_ALLOC 118.05 47.43 numa02 numa02 13.45 9.29 numa02_SMT numa02_SMT 14.80 7.50
mTHP:64K Base Patched numa01 numa01 216.15 114.40 numa01_THREAD_ALLOC numa01_THREAD_ALLOC 115.35 47.41 numa02 numa02 13.24 9.25 numa02_SMT numa02_SMT 14.67 7.34
mTHP:128K Base Patched numa01 numa01 205.13 144.45 numa01_THREAD_ALLOC numa01_THREAD_ALLOC 112.93 41.88 numa02 numa02 13.16 9.18 numa02_SMT numa02_SMT 14.81 7.49
[1] https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@techsingularity....
[baolin.wang@linux.alibaba.com: v3] Link: https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.171213295... Link: https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.171168306... Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Reviewed-by: "Huang, Ying" ying.huang@intel.com Cc: David Hildenbrand david@redhat.com Cc: John Hubbard jhubbard@nvidia.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Ryan Roberts ryan.roberts@arm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/memory.c | 62 +++++++++++++++++++++++++++++++++++++++++---------- mm/mprotect.c | 3 ++- 2 files changed, 52 insertions(+), 13 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index 33edf6a888b3..2c7577d5e7b7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5044,17 +5044,51 @@ int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma, }
static void numa_rebuild_single_mapping(struct vm_fault *vmf, struct vm_area_struct *vma, + unsigned long fault_addr, pte_t *fault_pte, bool writable) { pte_t pte, old_pte;
- old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte); + old_pte = ptep_modify_prot_start(vma, fault_addr, fault_pte); pte = pte_modify(old_pte, vma->vm_page_prot); pte = pte_mkyoung(pte); if (writable) pte = pte_mkwrite(pte, vma); - ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); + ptep_modify_prot_commit(vma, fault_addr, fault_pte, old_pte, pte); + update_mmu_cache_range(vmf, vma, fault_addr, fault_pte, 1); +} + +static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_struct *vma, + struct folio *folio, pte_t fault_pte, + bool ignore_writable, bool pte_write_upgrade) +{ + int nr = pte_pfn(fault_pte) - folio_pfn(folio); + unsigned long start = max(vmf->address - nr * PAGE_SIZE, vma->vm_start); + unsigned long end = min(vmf->address + (folio_nr_pages(folio) - nr) * PAGE_SIZE, vma->vm_end); + pte_t *start_ptep = vmf->pte - (vmf->address - start) / PAGE_SIZE; + unsigned long addr; + + /* Restore all PTEs' mapping of the large folio */ + for (addr = start; addr != end; start_ptep++, addr += PAGE_SIZE) { + pte_t ptent = ptep_get(start_ptep); + bool writable = false; + + if (!pte_present(ptent) || !pte_protnone(ptent)) + continue; + + if (pfn_folio(pte_pfn(ptent)) != folio) + continue; + + if (!ignore_writable) { + ptent = pte_modify(ptent, vma->vm_page_prot); + writable = pte_write(ptent); + if (!writable && pte_write_upgrade && + can_change_pte_writable(vma, addr, ptent)) + writable = true; + } + + numa_rebuild_single_mapping(vmf, vma, addr, start_ptep, writable); + } }
static vm_fault_t do_numa_page(struct vm_fault *vmf) @@ -5062,11 +5096,12 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct folio *folio = NULL; int nid = NUMA_NO_NODE; - bool writable = false; + bool writable = false, ignore_writable = false; + bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma); int last_cpupid; int target_nid; pte_t pte, old_pte; - int flags = 0; + int flags = 0, nr_pages;
/* * The "pte" at this point cannot be used safely without @@ -5088,7 +5123,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) * is only valid while holding the PT lock. */ writable = pte_write(pte); - if (!writable && vma_wants_manual_pte_write_upgrade(vma) && + if (!writable && pte_write_upgrade && can_change_pte_writable(vma, vmf->address, pte)) writable = true;
@@ -5096,10 +5131,6 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) if (!folio || folio_is_zone_device(folio)) goto out_map;
- /* TODO: handle PTE-mapped THP */ - if (folio_test_large(folio)) - goto out_map; - /* * Avoid grouping on RO pages in general. RO pages shouldn't hurt as * much anyway since they can be in shared cache state. This misses @@ -5119,6 +5150,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) flags |= TNF_SHARED;
nid = folio_nid(folio); + nr_pages = folio_nr_pages(folio); /* * For memory tiering mode, cpupid of slow memory page is used * to record page access time. So use default value. @@ -5135,6 +5167,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) } pte_unmap_unlock(vmf->pte, vmf->ptl); writable = false; + ignore_writable = true;
/* Migrate to the requested node */ if (migrate_misplaced_folio(folio, vma, target_nid)) { @@ -5155,14 +5188,19 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
out: if (nid != NUMA_NO_NODE) - task_numa_fault(last_cpupid, nid, 1, flags); + task_numa_fault(last_cpupid, nid, nr_pages, flags); return 0; out_map: /* * Make it present again, depending on how arch implements * non-accessible ptes, some can allow access by kernel mode. */ - numa_rebuild_single_mapping(vmf, vma, writable); + if (folio && folio_test_large(folio)) + numa_rebuild_large_mapping(vmf, vma, folio, pte, ignore_writable, + pte_write_upgrade); + else + numa_rebuild_single_mapping(vmf, vma, vmf->address, vmf->pte, + writable); pte_unmap_unlock(vmf->pte, vmf->ptl); goto out; } diff --git a/mm/mprotect.c b/mm/mprotect.c index f121c46f6e4c..b360577be4f8 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb,
/* Also skip shared copy-on-write pages */ if (is_cow_mapping(vma->vm_flags) && - folio_ref_count(folio) != 1) + (folio_maybe_dma_pinned(folio) || + folio_likely_mapped_shared(folio))) continue;
/*
From: Jiexun Wang wangjiexun@tinylab.org
mainline inclusion from mainline-v6.7-rc5 commit b2f557a21bc8fffdcd65794eda8a854e024999f3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
I conducted real-time testing and observed that madvise_cold_or_pageout_pte_range() causes significant latency under memory pressure, which can be effectively reduced by adding cond_resched() within the loop.
I tested on the LicheePi 4A board using Cylictest for latency testing and Ftrace for latency tracing. The board uses TH1520 processor and has a memory size of 8GB. The kernel version is 6.5.0 with the PREEMPT_RT patch applied.
The script I tested is as follows:
echo wakeup_rt > /sys/kernel/tracing/current_tracer echo 1 > /sys/kernel/tracing/tracing_on echo 0 > /sys/kernel/tracing/tracing_max_latency stress-ng --vm 8 --vm-bytes 2G & cyclictest --mlockall --smp --priority=99 --distance=0 --duration=30m echo 0 > /sys/kernel/tracing/tracing_on cat /sys/kernel/tracing/trace
The tracing results before modification are as follows:
stress-n-206 3dn.h512 2us : 206:120:R + [003] 196: 0:R cyclictest stress-n-206 3dn.h512 7us : <stack trace> => __ftrace_trace_stack => __trace_stack => probe_wakeup => ttwu_do_activate => try_to_wake_up => wake_up_process => hrtimer_wakeup => __hrtimer_run_queues => hrtimer_interrupt => riscv_timer_interrupt => handle_percpu_devid_irq => generic_handle_domain_irq => riscv_intc_irq => handle_riscv_irq => do_irq stress-n-206 3dn.h512 9us#: 0 stress-n-206 3d...3.. 2544us : __schedule stress-n-206 3d...3.. 2545us : 206:120:R ==> [003] 196: 0:R cyclictest stress-n-206 3d...3.. 2551us : <stack trace> => __ftrace_trace_stack => __trace_stack => probe_wakeup_sched_switch => __schedule => preempt_schedule => migrate_enable => rt_spin_unlock => madvise_cold_or_pageout_pte_range => walk_pgd_range => __walk_page_range => walk_page_range => madvise_pageout => madvise_vma_behavior => do_madvise => sys_madvise => do_trap_ecall_u => ret_from_exception
The tracing results after modification are as follows:
stress-n-232 0dn.h413 1us+: 232:120:R + [000] 217: 0:R cyclictest stress-n-232 0dn.h413 12us : <stack trace> => __ftrace_trace_stack => __trace_stack => probe_wakeup => ttwu_do_activate => try_to_wake_up => wake_up_process => hrtimer_wakeup => __hrtimer_run_queues => hrtimer_interrupt => riscv_timer_interrupt => handle_percpu_devid_irq => generic_handle_domain_irq => riscv_intc_irq => handle_riscv_irq => do_irq stress-n-232 0dn.h413 19us#: 0 stress-n-232 0d...3.. 1671us : __schedule stress-n-232 0d...3.. 1676us+: 232:120:R ==> [000] 217: 0:R cyclictest stress-n-232 0d...3.. 1687us : <stack trace> => __ftrace_trace_stack => __trace_stack => probe_wakeup_sched_switch => __schedule => preempt_schedule => migrate_enable => free_unref_page_list => release_pages => free_pages_and_swap_cache => tlb_batch_pages_flush => tlb_flush_mmu => unmap_page_range => unmap_vmas => unmap_region => do_vmi_align_munmap.constprop.0 => do_vmi_munmap => __vm_munmap => sys_munmap => do_trap_ecall_u => ret_from_exception
After the modification, the cause of maximum latency is no longer madvise_cold_or_pageout_pte_range(), so this modification can reduce the latency caused by madvise_cold_or_pageout_pte_range().
Currently the madvise_cold_or_pageout_pte_range() function exhibits significant latency under memory pressure, which can be effectively reduced by adding cond_resched() within the loop.
When the batch_count reaches SWAP_CLUSTER_MAX, we reschedule the task to ensure fairness and avoid long lock holding times.
Link: https://lkml.kernel.org/r/85363861af65fac66c7a98c251906afc0d9c8098.169529104... Signed-off-by: Jiexun Wang wangjiexun@tinylab.org Cc: Zhangjin Wu falcon@tinylab.org Signed-off-by: Andrew Morton akpm@linux-foundation.org [ Dep-of: 3931b871c493 ("mm: madvise: avoid split during MADV_PAGEOUT and MADV_COLD"). ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/madvise.c | 11 +++++++++++ 1 file changed, 11 insertions(+)
diff --git a/mm/madvise.c b/mm/madvise.c index e6570812f723..30b218d658a8 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -379,6 +379,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, struct folio *folio = NULL; LIST_HEAD(folio_list); bool pageout_anon_only_filter; + unsigned int batch_count = 0;
if (fatal_signal_pending(current)) return -EINTR; @@ -460,6 +461,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, regular_folio: #endif tlb_change_page_size(tlb, PAGE_SIZE); +restart: start_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); if (!start_pte) return 0; @@ -468,6 +470,15 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, for (; addr < end; pte++, addr += PAGE_SIZE) { ptent = ptep_get(pte);
+ if (++batch_count == SWAP_CLUSTER_MAX) { + batch_count = 0; + if (need_resched()) { + pte_unmap_unlock(start_pte, ptl); + cond_resched(); + goto restart; + } + } + if (pte_none(ptent)) continue;
From: Sergey Senozhatsky senozhatsky@chromium.org
mainline inclusion from mainline-v6.8-rc4 commit 4c2da3188b848d33c26d7f0f8b14f3150331c923 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
We need to leave lazy MMU mode before unlocking.
Link: https://lkml.kernel.org/r/20240126032608.355899-1-senozhatsky@chromium.org Fixes: b2f557a21bc8 ("mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range()") Signed-off-by: Sergey Senozhatsky senozhatsky@chromium.org Cc: Jiexun Wang wangjiexun@tinylab.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/madvise.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/mm/madvise.c b/mm/madvise.c index 30b218d658a8..a1ef759be8fe 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -473,6 +473,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (++batch_count == SWAP_CLUSTER_MAX) { batch_count = 0; if (need_resched()) { + arch_leave_lazy_mmu_mode(); pte_unmap_unlock(start_pte, ptl); cond_resched(); goto restart;
From: Ryan Roberts ryan.roberts@arm.com
next inclusion from next-20240510 commit d7d0d389ff90644546ffcb8e15ea3ccaf6138958 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "Swap-out mTHP without splitting", v7.
This series adds support for swapping out multi-size THP (mTHP) without needing to first split the large folio via split_huge_page_to_list_to_order(). It closely follows the approach already used to swap-out PMD-sized THP.
There are a couple of reasons for swapping out mTHP without splitting:
- Performance: It is expensive to split a large folio and under extreme memory pressure some workloads regressed performance when using 64K mTHP vs 4K small folios because of this extra cost in the swap-out path. This series not only eliminates the regression but makes it faster to swap out 64K mTHP vs 4K small folios.
- Memory fragmentation avoidance: If we can avoid splitting a large folio memory is less likely to become fragmented, making it easier to re-allocate a large folio in future.
- Performance: Enables a separate series [7] to swap-in whole mTHPs, which means we won't lose the TLB-efficiency benefits of mTHP once the memory has been through a swap cycle.
I've done what I thought was the smallest change possible, and as a result, this approach is only employed when the swap is backed by a non-rotating block device (just as PMD-sized THP is supported today). Discussion against the RFC concluded that this is sufficient.
Performance Testing ===================
I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The VM is set up with a 35G block ram device as the swap device and the test is run from inside a memcg limited to 40G memory. I've then run `usemem` from vm-scalability with 70 processes, each allocating and writing 1G of memory. I've repeated everything 6 times and taken the mean performance improvement relative to 4K page baseline:
| alloc size | baseline | + this series | | | mm-unstable (~v6.9-rc1) | | |:-----------|------------------------:|------------------------:| | 4K Page | 0.0% | 1.3% | | 64K THP | -13.6% | 46.3% | | 2M THP | 91.4% | 89.6% |
So with this change, the 64K swap performance goes from a 14% regression to a 46% improvement. While 2M shows a small regression I'm confident that this is just noise.
[1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.c... [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.c... [3] https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.co... [4] https://lore.kernel.org/linux-mm/20240311150058.1122862-1-ryan.roberts@arm.c... [5] https://lore.kernel.org/linux-mm/20240327144537.4165578-1-ryan.roberts@arm.c... [6] https://lore.kernel.org/linux-mm/20240403114032.1162100-1-ryan.roberts@arm.c... [7] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/ [8] https://lore.kernel.org/linux-mm/CAGsJ_4yMOow27WDvN2q=E4HAtDd2PJ=OQ5Pj9DG+6F... [9] https://lore.kernel.org/linux-mm/579d5127-c763-4001-9625-4563a9316ac3@redhat...
This patch (of 7):
As preparation for supporting small-sized THP in the swap-out path, without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE, which, when present, always implies PMD-sized THP, which is the same as the cluster size.
The only use of the flag was to determine whether a swap entry refers to a single page or a PMD-sized THP in swap_page_trans_huge_swapped(). Instead of relying on the flag, we now pass in order, which originates from the folio's order. This allows the logic to work for folios of any order.
The one snag is that one of the swap_page_trans_huge_swapped() call sites does not have the folio. But it was only being called there to shortcut a call __try_to_reclaim_swap() in some cases. __try_to_reclaim_swap() gets the folio and (via some other functions) calls swap_page_trans_huge_swapped(). So I've removed the problematic call site and believe the new logic should be functionally equivalent.
That said, removing the fast path means that we will take a reference and trylock a large folio much more often, which we would like to avoid. The next patch will solve this.
Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster() which used to be called during folio splitting, since split_swap_cluster()'s only job was to remove the flag.
Link: https://lkml.kernel.org/r/20240408183946.2991168-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240408183946.2991168-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: "Huang, Ying" ying.huang@intel.com Acked-by: Chris Li chrisl@kernel.org Acked-by: David Hildenbrand david@redhat.com Cc: Barry Song 21cnbao@gmail.com Cc: Gao Xiang xiang@kernel.org Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Lance Yang ioworker0@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michal Hocko mhocko@suse.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Cc: Barry Song v-songbaohua@oppo.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/swap.h | 10 ---------- mm/huge_memory.c | 3 --- mm/swapfile.c | 47 ++++++++------------------------------------ 3 files changed, 8 insertions(+), 52 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h index cbcb767b79e3..4a2a8095d7ca 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -271,7 +271,6 @@ struct swap_cluster_info { }; #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ -#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge page */
/* * We assign a cluster to each CPU, so each CPU can allocate swap entry from @@ -630,15 +629,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis, } #endif /* CONFIG_SWAP */
-#ifdef CONFIG_THP_SWAP -extern int split_swap_cluster(swp_entry_t entry); -#else -static inline int split_swap_cluster(swp_entry_t entry) -{ - return 0; -} -#endif - #ifdef CONFIG_MEMCG static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a548caef1c2f..0c61e7c7c2c1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2834,9 +2834,6 @@ static void __split_huge_page(struct page *page, struct list_head *list, shmem_uncharge(head->mapping->host, nr_dropped); remap_page(folio, nr);
- if (folio_test_swapcache(folio)) - split_swap_cluster(folio->swap); - /* * set page to its compound_head when split to non order-0 pages, so * we can skip unlocking it below, since PG_locked is transferred to diff --git a/mm/swapfile.c b/mm/swapfile.c index 961862f4aa57..25ff4f0dd2c2 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -342,18 +342,6 @@ static inline void cluster_set_null(struct swap_cluster_info *info) info->data = 0; }
-static inline bool cluster_is_huge(struct swap_cluster_info *info) -{ - if (IS_ENABLED(CONFIG_THP_SWAP)) - return info->flags & CLUSTER_FLAG_HUGE; - return false; -} - -static inline void cluster_clear_huge(struct swap_cluster_info *info) -{ - info->flags &= ~CLUSTER_FLAG_HUGE; -} - static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, unsigned long offset) { @@ -1021,7 +1009,7 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) offset = idx * SWAPFILE_CLUSTER; ci = lock_cluster(si, offset); alloc_cluster(si, idx); - cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE); + cluster_set_count(ci, SWAPFILE_CLUSTER);
memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER); unlock_cluster(ci); @@ -1449,7 +1437,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
ci = lock_cluster_or_swap_info(si, offset); if (size == SWAPFILE_CLUSTER) { - VM_BUG_ON(!cluster_is_huge(ci)); map = si->swap_map + offset; for (i = 0; i < SWAPFILE_CLUSTER; i++) { val = map[i]; @@ -1457,7 +1444,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) if (val == SWAP_HAS_CACHE) free_entries++; } - cluster_clear_huge(ci); if (free_entries == SWAPFILE_CLUSTER) { unlock_cluster_or_swap_info(si, ci); spin_lock(&si->lock); @@ -1479,23 +1465,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) unlock_cluster_or_swap_info(si, ci); }
-#ifdef CONFIG_THP_SWAP -int split_swap_cluster(swp_entry_t entry) -{ - struct swap_info_struct *si; - struct swap_cluster_info *ci; - unsigned long offset = swp_offset(entry); - - si = _swap_info_get(entry); - if (!si) - return -EBUSY; - ci = lock_cluster(si, offset); - cluster_clear_huge(ci); - unlock_cluster(ci); - return 0; -} -#endif - static int swp_entry_cmp(const void *ent1, const void *ent2) { const swp_entry_t *e1 = ent1, *e2 = ent2; @@ -1603,22 +1572,23 @@ int swp_swapcount(swp_entry_t entry) }
static bool swap_page_trans_huge_swapped(struct swap_info_struct *si, - swp_entry_t entry) + swp_entry_t entry, int order) { struct swap_cluster_info *ci; unsigned char *map = si->swap_map; + unsigned int nr_pages = 1 << order; unsigned long roffset = swp_offset(entry); - unsigned long offset = round_down(roffset, SWAPFILE_CLUSTER); + unsigned long offset = round_down(roffset, nr_pages); int i; bool ret = false;
ci = lock_cluster_or_swap_info(si, offset); - if (!ci || !cluster_is_huge(ci)) { + if (!ci || nr_pages == 1) { if (swap_count(map[roffset])) ret = true; goto unlock_out; } - for (i = 0; i < SWAPFILE_CLUSTER; i++) { + for (i = 0; i < nr_pages; i++) { if (swap_count(map[offset + i])) { ret = true; break; @@ -1640,7 +1610,7 @@ static bool folio_swapped(struct folio *folio) if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio))) return swap_swapcount(si, entry) != 0;
- return swap_page_trans_huge_swapped(si, entry); + return swap_page_trans_huge_swapped(si, entry, folio_order(folio)); }
/** @@ -1706,8 +1676,7 @@ int free_swap_and_cache(swp_entry_t entry) }
count = __swap_entry_free(p, entry); - if (count == SWAP_HAS_CACHE && - !swap_page_trans_huge_swapped(p, entry)) + if (count == SWAP_HAS_CACHE) __try_to_reclaim_swap(p, swp_offset(entry), TTRS_UNMAPPED | TTRS_FULL); put_swap_device(p);
From: Ryan Roberts ryan.roberts@arm.com
next inclusion from next-20240510 commit a62fb92ac12ed39df4930dca599a3b427552882a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Now that we no longer have a convenient flag in the cluster to determine if a folio is large, free_swap_and_cache() will take a reference and lock a large folio much more often, which could lead to contention and (e.g.) failure to split large folios, etc.
Let's solve that problem by batch freeing swap and cache with a new function, free_swap_and_cache_nr(), to free a contiguous range of swap entries together. This allows us to first drop a reference to each swap slot before we try to release the cache folio. This means we only try to release the folio once, only taking the reference and lock once - much better than the previous 512 times for the 2M THP case.
Contiguous swap entries are gathered in zap_pte_range() and madvise_free_pte_range() in a similar way to how present ptes are already gathered in zap_pte_range().
While we are at it, let's simplify by converting the return type of both functions to void. The return value was used only by zap_pte_range() to print a bad pte, and was ignored by everyone else, so the extra reporting wasn't exactly guaranteed. We will still get the warning with most of the information from get_swap_device(). With the batch version, we wouldn't know which pte was bad anyway so could print the wrong one.
[ryan.roberts@arm.com: fix a build warning on parisc] Link: https://lkml.kernel.org/r/20240409111840.3173122-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240408183946.2991168-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts ryan.roberts@arm.com Acked-by: David Hildenbrand david@redhat.com Cc: Barry Song 21cnbao@gmail.com Cc: Barry Song v-songbaohua@oppo.com Cc: Chris Li chrisl@kernel.org Cc: Gao Xiang xiang@kernel.org Cc: "Huang, Ying" ying.huang@intel.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Lance Yang ioworker0@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michal Hocko mhocko@suse.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/pgtable.h | 29 ++++++++++++ include/linux/swap.h | 12 +++-- mm/internal.h | 64 +++++++++++++++++++++++++++ mm/madvise.c | 12 +++-- mm/memory.c | 13 +++--- mm/swapfile.c | 97 +++++++++++++++++++++++++++++++++-------- 6 files changed, 196 insertions(+), 31 deletions(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 288dd11b46e5..218921a74c0a 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -680,6 +680,35 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm, } #endif
+#ifndef clear_not_present_full_ptes +/** + * clear_not_present_full_ptes - Clear multiple not present PTEs which are + * consecutive in the pgtable. + * @mm: Address space the ptes represent. + * @addr: Address of the first pte. + * @ptep: Page table pointer for the first entry. + * @nr: Number of entries to clear. + * @full: Whether we are clearing a full mm. + * + * May be overridden by the architecture; otherwise, implemented as a simple + * loop over pte_clear_not_present_full(). + * + * Context: The caller holds the page table lock. The PTEs are all not present. + * The PTEs are all in the same PMD. + */ +static inline void clear_not_present_full_ptes(struct mm_struct *mm, + unsigned long addr, pte_t *ptep, unsigned int nr, int full) +{ + for (;;) { + pte_clear_not_present_full(mm, addr, ptep, full); + if (--nr == 0) + break; + ptep++; + addr += PAGE_SIZE; + } +} +#endif + #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH extern pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address, diff --git a/include/linux/swap.h b/include/linux/swap.h index 4a2a8095d7ca..90d546caa0ec 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -502,7 +502,7 @@ extern int swap_duplicate(swp_entry_t); extern int swapcache_prepare(swp_entry_t); extern void swap_free(swp_entry_t); extern void swapcache_free_entries(swp_entry_t *entries, int n); -extern int free_swap_and_cache(swp_entry_t); +extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); int swap_type_of(dev_t device, sector_t offset); int find_first_swap(dev_t *device); extern unsigned int count_swap_pages(int, int); @@ -560,8 +560,9 @@ static inline void put_swap_device(struct swap_info_struct *si) #define free_pages_and_swap_cache(pages, nr) \ release_pages((pages), (nr));
-/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */ -#define free_swap_and_cache(e) is_pfn_swap_entry(e) +static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr) +{ +}
static inline void free_swap_cache(struct page *page) { @@ -629,6 +630,11 @@ static inline int add_swap_extent(struct swap_info_struct *sis, } #endif /* CONFIG_SWAP */
+static inline void free_swap_and_cache(swp_entry_t entry) +{ + free_swap_and_cache_nr(entry, 1); +} + #ifdef CONFIG_MEMCG static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) { diff --git a/mm/internal.h b/mm/internal.h index 3cd5867bba43..a96811881760 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -11,6 +11,8 @@ #include <linux/mm.h> #include <linux/pagemap.h> #include <linux/rmap.h> +#include <linux/swap.h> +#include <linux/swapops.h> #include <linux/tracepoint-defs.h>
struct folio_batch; @@ -188,6 +190,68 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
return min(ptep - start_ptep, max_nr); } + +/** + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte. + * @pte: The initial pte state; is_swap_pte(pte) must be true and + * non_swap_entry() must be false. + * + * Increments the swap offset, while maintaining all other fields, including + * swap type, and any swp pte bits. The resulting pte is returned. + */ +static inline pte_t pte_next_swp_offset(pte_t pte) +{ + swp_entry_t entry = pte_to_swp_entry(pte); + pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry), + (swp_offset(entry) + 1))); + + if (pte_swp_soft_dirty(pte)) + new = pte_swp_mksoft_dirty(new); + if (pte_swp_exclusive(pte)) + new = pte_swp_mkexclusive(new); + if (pte_swp_uffd_wp(pte)) + new = pte_swp_mkuffd_wp(new); + + return new; +} + +/** + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries + * @start_ptep: Page table pointer for the first entry. + * @max_nr: The maximum number of table entries to consider. + * @pte: Page table entry for the first entry. + * + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs + * containing swap entries all with consecutive offsets and targeting the same + * swap type, all with matching swp pte bits. + * + * max_nr must be at least one and must be limited by the caller so scanning + * cannot exceed a single page table. + * + * Return: the number of table entries in the batch. + */ +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte) +{ + pte_t expected_pte = pte_next_swp_offset(pte); + const pte_t *end_ptep = start_ptep + max_nr; + pte_t *ptep = start_ptep + 1; + + VM_WARN_ON(max_nr < 1); + VM_WARN_ON(!is_swap_pte(pte)); + VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte))); + + while (ptep < end_ptep) { + pte = ptep_get(ptep); + + if (!pte_same(pte, expected_pte)) + break; + + expected_pte = pte_next_swp_offset(expected_pte); + ptep++; + } + + return ptep - start_ptep; +} #endif /* CONFIG_MMU */
void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, diff --git a/mm/madvise.c b/mm/madvise.c index a1ef759be8fe..ce491a4f6f87 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -667,6 +667,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, struct folio *folio; int nr_swap = 0; unsigned long next; + int nr, max_nr;
next = pmd_addr_end(addr, end); if (pmd_trans_huge(*pmd)) @@ -679,7 +680,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, return 0; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); - for (; addr != end; pte++, addr += PAGE_SIZE) { + for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) { + nr = 1; ptent = ptep_get(pte);
if (pte_none(ptent)) @@ -694,9 +696,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
entry = pte_to_swp_entry(ptent); if (!non_swap_entry(entry)) { - nr_swap--; - free_swap_and_cache(entry); - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); + max_nr = (end - addr) / PAGE_SIZE; + nr = swap_pte_batch(pte, max_nr, ptent); + nr_swap -= nr; + free_swap_and_cache_nr(entry, nr); + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); } else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); diff --git a/mm/memory.c b/mm/memory.c index 2c7577d5e7b7..11d97d28aa37 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1642,12 +1642,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, folio_remove_rmap_pte(folio, page, vma); folio_put(folio); } else if (!non_swap_entry(entry)) { - /* Genuine swap entry, hence a private anon page */ + max_nr = (end - addr) / PAGE_SIZE; + nr = swap_pte_batch(pte, max_nr, ptent); + /* Genuine swap entries, hence a private anon pages */ if (!should_zap_cows(details)) continue; - rss[MM_SWAPENTS]--; - if (unlikely(!free_swap_and_cache(entry))) - print_bad_pte(vma, addr, ptent, NULL); + rss[MM_SWAPENTS] -= nr; + free_swap_and_cache_nr(entry, nr); } else if (is_migration_entry(entry)) { folio = pfn_swap_entry_folio(entry); if (!should_zap_folio(details, folio)) @@ -1672,8 +1673,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, /* We should have covered all the swap entry types */ WARN_ON_ONCE(1); } - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); - zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent); + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); + zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent); } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
add_mm_rss_vec(mm, rss); diff --git a/mm/swapfile.c b/mm/swapfile.c index 25ff4f0dd2c2..beaef7521469 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -129,7 +129,11 @@ static inline unsigned char swap_count(unsigned char ent) /* Reclaim the swap entry if swap is getting full*/ #define TTRS_FULL 0x4
-/* returns 1 if swap entry is freed */ +/* + * returns number of pages in the folio that backs the swap entry. If positive, + * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no + * folio was associated with the swap entry. + */ static int __try_to_reclaim_swap(struct swap_info_struct *si, unsigned long offset, unsigned long flags) { @@ -154,6 +158,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si, ret = folio_free_swap(folio); folio_unlock(folio); } + ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio); folio_put(folio); return ret; } @@ -889,7 +894,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si, swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); spin_lock(&si->lock); /* entry was freed successfully, try to use this again */ - if (swap_was_freed) + if (swap_was_freed > 0) goto checks; goto scan; /* check next one */ } @@ -1656,32 +1661,88 @@ bool folio_free_swap(struct folio *folio) return true; }
-/* - * Free the swap entry like above, but also try to - * free the page cache entry if it is the last user. +/** + * free_swap_and_cache_nr() - Release reference on range of swap entries and + * reclaim their cache if no more references remain. + * @entry: First entry of range. + * @nr: Number of entries in range. + * + * For each swap entry in the contiguous range, release a reference. If any swap + * entries become free, try to reclaim their underlying folios, if present. The + * offset range is defined by [entry.offset, entry.offset + nr). */ -int free_swap_and_cache(swp_entry_t entry) +void free_swap_and_cache_nr(swp_entry_t entry, int nr) { - struct swap_info_struct *p; + const unsigned long start_offset = swp_offset(entry); + const unsigned long end_offset = start_offset + nr; + unsigned int type = swp_type(entry); + struct swap_info_struct *si; + bool any_only_cache = false; + unsigned long offset; unsigned char count;
if (non_swap_entry(entry)) - return 1; + return;
- p = get_swap_device(entry); - if (p) { - if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) { - put_swap_device(p); - return 0; + si = get_swap_device(entry); + if (!si) + return; + + if (WARN_ON(end_offset > si->max)) + goto out; + + /* + * First free all entries in the range. + */ + for (offset = start_offset; offset < end_offset; offset++) { + if (data_race(si->swap_map[offset])) { + count = __swap_entry_free(si, swp_entry(type, offset)); + if (count == SWAP_HAS_CACHE) + any_only_cache = true; + } else { + WARN_ON_ONCE(1); } + } + + /* + * Short-circuit the below loop if none of the entries had their + * reference drop to zero. + */ + if (!any_only_cache) + goto out;
- count = __swap_entry_free(p, entry); - if (count == SWAP_HAS_CACHE) - __try_to_reclaim_swap(p, swp_offset(entry), + /* + * Now go back over the range trying to reclaim the swap cache. This is + * more efficient for large folios because we will only try to reclaim + * the swap once per folio in the common case. If we do + * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the + * latter will get a reference and lock the folio for every individual + * page but will only succeed once the swap slot for every subpage is + * zero. + */ + for (offset = start_offset; offset < end_offset; offset += nr) { + nr = 1; + if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) { + /* + * Folios are always naturally aligned in swap so + * advance forward to the next boundary. Zero means no + * folio was found for the swap entry, so advance by 1 + * in this case. Negative value means folio was found + * but could not be reclaimed. Here we can still advance + * to the next boundary. + */ + nr = __try_to_reclaim_swap(si, offset, TTRS_UNMAPPED | TTRS_FULL); - put_swap_device(p); + if (nr == 0) + nr = 1; + else if (nr < 0) + nr = -nr; + nr = ALIGN(offset + 1, nr) - offset; + } } - return p != NULL; + +out: + put_swap_device(si); }
#ifdef CONFIG_HIBERNATION
From: Ryan Roberts ryan.roberts@arm.com
next inclusion from next-20240510 commit 14c62da21b2b865f4fc0c49edd74ed7299927d35 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
struct percpu_cluster stores the index of cpu's current cluster and the offset of the next entry that will be allocated for the cpu. These two pieces of information are redundant because the cluster index is just (offset / SWAPFILE_CLUSTER). The only reason for explicitly keeping the cluster index is because the structure used for it also has a flag to indicate "no cluster". However this data structure also contains a spin lock, which is never used in this context, as a side effect the code copies the spinlock_t structure, which is questionable coding practice in my view.
So let's clean this up and store only the next offset, and use a sentinal value (SWAP_NEXT_INVALID) to indicate "no cluster". SWAP_NEXT_INVALID is chosen to be 0, because 0 will never be seen legitimately; The first page in the swap file is the swap header, which is always marked bad to prevent it from being allocated as an entry. This also prevents the cluster to which it belongs being marked free, so it will never appear on the free list.
This change saves 16 bytes per cpu. And given we are shortly going to extend this mechanism to be per-cpu-AND-per-order, we will end up saving 16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the system.
Link: https://lkml.kernel.org/r/20240408183946.2991168-4-ryan.roberts@arm.com Signed-off-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: "Huang, Ying" ying.huang@intel.com Cc: Barry Song 21cnbao@gmail.com Cc: Barry Song v-songbaohua@oppo.com Cc: Chris Li chrisl@kernel.org Cc: David Hildenbrand david@redhat.com Cc: Gao Xiang xiang@kernel.org Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Lance Yang ioworker0@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michal Hocko mhocko@suse.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/swap.h | 9 ++++++++- mm/swapfile.c | 22 +++++++++++----------- 2 files changed, 19 insertions(+), 12 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h index 90d546caa0ec..bb5de1f0eeb9 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -272,13 +272,20 @@ struct swap_cluster_info { #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
+/* + * The first page in the swap file is the swap header, which is always marked + * bad to prevent it from being allocated as an entry. This also prevents the + * cluster to which it belongs being marked free. Therefore 0 is safe to use as + * a sentinel to indicate next is not valid in percpu_cluster. + */ +#define SWAP_NEXT_INVALID 0 + /* * We assign a cluster to each CPU, so each CPU can allocate swap entry from * its own cluster and swapout sequentially. The purpose is to optimize swapout * throughput. */ struct percpu_cluster { - struct swap_cluster_info index; /* Current cluster index */ unsigned int next; /* Likely next allocation offset */ };
diff --git a/mm/swapfile.c b/mm/swapfile.c index beaef7521469..ae9660cf3676 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -608,7 +608,7 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, return false;
percpu_cluster = this_cpu_ptr(si->percpu_cluster); - cluster_set_null(&percpu_cluster->index); + percpu_cluster->next = SWAP_NEXT_INVALID; return true; }
@@ -621,14 +621,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, { struct percpu_cluster *cluster; struct swap_cluster_info *ci; - unsigned long tmp, max; + unsigned int tmp, max;
new_cluster: cluster = this_cpu_ptr(si->percpu_cluster); - if (cluster_is_null(&cluster->index)) { + tmp = cluster->next; + if (tmp == SWAP_NEXT_INVALID) { if (!cluster_list_empty(&si->free_clusters)) { - cluster->index = si->free_clusters.head; - cluster->next = cluster_next(&cluster->index) * + tmp = cluster_next(&si->free_clusters.head) * SWAPFILE_CLUSTER; } else if (!cluster_list_empty(&si->discard_clusters)) { /* @@ -648,9 +648,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, * Other CPUs can use our cluster if they can't find a free cluster, * check if there is still free entry in the cluster */ - tmp = cluster->next; - max = min_t(unsigned long, si->max, - (cluster_next(&cluster->index) + 1) * SWAPFILE_CLUSTER); + max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER)); if (tmp < max) { ci = lock_cluster(si, tmp); while (tmp < max) { @@ -661,12 +659,13 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, unlock_cluster(ci); } if (tmp >= max) { - cluster_set_null(&cluster->index); + cluster->next = SWAP_NEXT_INVALID; goto new_cluster; } - cluster->next = tmp + 1; *offset = tmp; *scan_base = tmp; + tmp += 1; + cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID; return true; }
@@ -3244,8 +3243,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) } for_each_possible_cpu(cpu) { struct percpu_cluster *cluster; + cluster = per_cpu_ptr(p->percpu_cluster, cpu); - cluster_set_null(&cluster->index); + cluster->next = SWAP_NEXT_INVALID; } } else { atomic_inc(&nr_rotate_swap);
From: Ryan Roberts ryan.roberts@arm.com
next inclusion from next-20240510 commit 9faaa0f8168bfcd81469b0724b25ba3093097a08 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
We are about to allow swap storage of any mTHP size. To prepare for that, let's change get_swap_pages() to take a folio order parameter instead of nr_pages. This makes the interface self-documenting; a power-of-2 number of pages must be provided. We will also need the order internally so this simplifies accessing it.
Link: https://lkml.kernel.org/r/20240408183946.2991168-5-ryan.roberts@arm.com Signed-off-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: "Huang, Ying" ying.huang@intel.com Reviewed-by: David Hildenbrand david@redhat.com Cc: Barry Song 21cnbao@gmail.com Cc: Barry Song v-songbaohua@oppo.com Cc: Chris Li chrisl@kernel.org Cc: Gao Xiang xiang@kernel.org Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Lance Yang ioworker0@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michal Hocko mhocko@suse.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: include/linux/swap.h mm/swap_slots.c mm/swapfile.c [ Context conflicts with commit c08dff4db9ac. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/swap.h | 2 +- mm/swap_slots.c | 6 +++--- mm/swapfile.c | 13 +++++++------ 3 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h index bb5de1f0eeb9..0f9abd180f0a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -501,7 +501,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio); bool folio_free_swap(struct folio *folio); void put_swap_folio(struct folio *folio, swp_entry_t entry); extern swp_entry_t get_swap_page_of_type(int); -extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size, +extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order, int type); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern void swap_shmem_alloc(swp_entry_t); diff --git a/mm/swap_slots.c b/mm/swap_slots.c index 9971cdf30e20..7af3b93d4c8c 100644 --- a/mm/swap_slots.c +++ b/mm/swap_slots.c @@ -385,7 +385,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache, int type) cache->cur = 0; if (swap_slot_cache_active) cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, - cache->slots, 1, type); + cache->slots, 0, type);
return cache->nr; } @@ -435,7 +435,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
if (folio_test_large(folio)) { if (IS_ENABLED(CONFIG_THP_SWAP)) - get_swap_pages(1, &entry, folio_nr_pages(folio), type); + get_swap_pages(1, &entry, folio_order(folio), type); goto out; }
@@ -467,7 +467,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio) goto out; }
- get_swap_pages(1, &entry, 1, type); + get_swap_pages(1, &entry, 0, type); out: if (mem_cgroup_try_charge_swap(folio, entry)) { put_swap_folio(folio, entry); diff --git a/mm/swapfile.c b/mm/swapfile.c index ae9660cf3676..7ec8a071537c 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -277,15 +277,15 @@ static void discard_swap_cluster(struct swap_info_struct *si, #ifdef CONFIG_THP_SWAP #define SWAPFILE_CLUSTER HPAGE_PMD_NR
-#define swap_entry_size(size) (size) +#define swap_entry_order(order) (order) #else #define SWAPFILE_CLUSTER 256
/* - * Define swap_entry_size() as constant to let compiler to optimize + * Define swap_entry_order() as constant to let compiler to optimize * out some code if !CONFIG_THP_SWAP */ -#define swap_entry_size(size) 1 +#define swap_entry_order(order) 0 #endif #define LATENCY_LIMIT 256
@@ -1120,10 +1120,11 @@ static inline bool should_skip_swap_type(int swap_type, int type) } #endif
-int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size, +int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order, int type) { - unsigned long size = swap_entry_size(entry_size); + int order = swap_entry_order(entry_order); + unsigned long size = 1 << order; struct swap_info_struct *si, *next; long avail_pgs; int n_ret = 0; @@ -1433,7 +1434,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) unsigned char *map; unsigned int i, free_entries = 0; unsigned char val; - int size = swap_entry_size(folio_nr_pages(folio)); + int size = 1 << swap_entry_order(folio_order(folio));
si = _swap_info_get(entry); if (!si)
From: Ryan Roberts ryan.roberts@arm.com
next inclusion from next-20240510 commit 845982eb264bc64b0c3242ace217fb574f56a299 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Multi-size THP enables performance improvements by allocating large, pte-mapped folios for anonymous memory. However I've observed that on an arm64 system running a parallel workload (e.g. kernel compilation) across many cores, under high memory pressure, the speed regresses. This is due to bottlenecking on the increased number of TLBIs added due to all the extra folio splitting when the large folios are swapped out.
Therefore, solve this regression by adding support for swapping out mTHP without needing to split the folio, just like is already done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, and when the swap backing store is a non-rotating block device. These are the same constraints as for the existing PMD-sized THP swap-out support.
Note that no attempt is made to swap-in (m)THP here - this is still done page-by-page, like for PMD-sized THP. But swapping-out mTHP is a prerequisite for swapping-in mTHP.
The main change here is to improve the swap entry allocator so that it can allocate any power-of-2 number of contiguous entries between [1, (1 << PMD_ORDER)]. This is done by allocating a cluster for each distinct order and allocating sequentially from it until the cluster is full. This ensures that we don't need to search the map and we get no fragmentation due to alignment padding for different orders in the cluster. If there is no current cluster for a given order, we attempt to allocate a free cluster from the list. If there are no free clusters, we fail the allocation and the caller can fall back to splitting the folio and allocates individual entries (as per existing PMD-sized THP fallback).
The per-order current clusters are maintained per-cpu using the existing infrastructure. This is done to avoid interleving pages from different tasks, which would prevent IO being batched. This is already done for the order-0 allocations so we follow the same pattern.
As is done for order-0 per-cpu clusters, the scanner now can steal order-0 entries from any per-cpu-per-order reserved cluster. This ensures that when the swap file is getting full, space doesn't get tied up in the per-cpu reserves.
This change only modifies swap to be able to accept any order mTHP. It doesn't change the callers to elide doing the actual split. That will be done in separate changes.
Link: https://lkml.kernel.org/r/20240408183946.2991168-6-ryan.roberts@arm.com Signed-off-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: "Huang, Ying" ying.huang@intel.com Cc: Barry Song 21cnbao@gmail.com Cc: Barry Song v-songbaohua@oppo.com Cc: Chris Li chrisl@kernel.org Cc: David Hildenbrand david@redhat.com Cc: Gao Xiang xiang@kernel.org Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Lance Yang ioworker0@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michal Hocko mhocko@suse.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/swap.h | 8 ++- mm/swapfile.c | 162 ++++++++++++++++++++++++------------------- 2 files changed, 98 insertions(+), 72 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h index 0f9abd180f0a..13cd68b5f5e2 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -280,13 +280,19 @@ struct swap_cluster_info { */ #define SWAP_NEXT_INVALID 0
+#ifdef CONFIG_THP_SWAP +#define SWAP_NR_ORDERS (PMD_ORDER + 1) +#else +#define SWAP_NR_ORDERS 1 +#endif + /* * We assign a cluster to each CPU, so each CPU can allocate swap entry from * its own cluster and swapout sequentially. The purpose is to optimize swapout * throughput. */ struct percpu_cluster { - unsigned int next; /* Likely next allocation offset */ + unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ };
struct swap_cluster_list { diff --git a/mm/swapfile.c b/mm/swapfile.c index 7ec8a071537c..ddb50283f2f1 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -550,10 +550,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
/* * The cluster corresponding to page_nr will be used. The cluster will be - * removed from free cluster list and its usage counter will be increased. + * removed from free cluster list and its usage counter will be increased by + * count. */ -static void inc_cluster_info_page(struct swap_info_struct *p, - struct swap_cluster_info *cluster_info, unsigned long page_nr) +static void add_cluster_info_page(struct swap_info_struct *p, + struct swap_cluster_info *cluster_info, unsigned long page_nr, + unsigned long count) { unsigned long idx = page_nr / SWAPFILE_CLUSTER;
@@ -562,9 +564,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p, if (cluster_is_free(&cluster_info[idx])) alloc_cluster(p, idx);
- VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER); + VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER); cluster_set_count(&cluster_info[idx], - cluster_count(&cluster_info[idx]) + 1); + cluster_count(&cluster_info[idx]) + count); +} + +/* + * The cluster corresponding to page_nr will be used. The cluster will be + * removed from free cluster list and its usage counter will be increased by 1. + */ +static void inc_cluster_info_page(struct swap_info_struct *p, + struct swap_cluster_info *cluster_info, unsigned long page_nr) +{ + add_cluster_info_page(p, cluster_info, page_nr, 1); }
/* @@ -594,7 +606,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p, */ static bool scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, - unsigned long offset) + unsigned long offset, int order) { struct percpu_cluster *percpu_cluster; bool conflict; @@ -608,24 +620,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, return false;
percpu_cluster = this_cpu_ptr(si->percpu_cluster); - percpu_cluster->next = SWAP_NEXT_INVALID; + percpu_cluster->next[order] = SWAP_NEXT_INVALID; + return true; +} + +static inline bool swap_range_empty(char *swap_map, unsigned int start, + unsigned int nr_pages) +{ + unsigned int i; + + for (i = 0; i < nr_pages; i++) { + if (swap_map[start + i]) + return false; + } + return true; }
/* - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This - * might involve allocating a new cluster for current CPU too. + * Try to get swap entries with specified order from current cpu's swap entry + * pool (a cluster). This might involve allocating a new cluster for current CPU + * too. */ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, - unsigned long *offset, unsigned long *scan_base) + unsigned long *offset, unsigned long *scan_base, int order) { + unsigned int nr_pages = 1 << order; struct percpu_cluster *cluster; struct swap_cluster_info *ci; unsigned int tmp, max;
new_cluster: cluster = this_cpu_ptr(si->percpu_cluster); - tmp = cluster->next; + tmp = cluster->next[order]; if (tmp == SWAP_NEXT_INVALID) { if (!cluster_list_empty(&si->free_clusters)) { tmp = cluster_next(&si->free_clusters.head) * @@ -646,26 +673,27 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
/* * Other CPUs can use our cluster if they can't find a free cluster, - * check if there is still free entry in the cluster + * check if there is still free entry in the cluster, maintaining + * natural alignment. */ max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER)); if (tmp < max) { ci = lock_cluster(si, tmp); while (tmp < max) { - if (!si->swap_map[tmp]) + if (swap_range_empty(si->swap_map, tmp, nr_pages)) break; - tmp++; + tmp += nr_pages; } unlock_cluster(ci); } if (tmp >= max) { - cluster->next = SWAP_NEXT_INVALID; + cluster->next[order] = SWAP_NEXT_INVALID; goto new_cluster; } *offset = tmp; *scan_base = tmp; - tmp += 1; - cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID; + tmp += nr_pages; + cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID; return true; }
@@ -790,13 +818,14 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si,
static int scan_swap_map_slots(struct swap_info_struct *si, unsigned char usage, int nr, - swp_entry_t slots[]) + swp_entry_t slots[], int order) { struct swap_cluster_info *ci; unsigned long offset; unsigned long scan_base; unsigned long last_in_cluster = 0; int latency_ration = LATENCY_LIMIT; + unsigned int nr_pages = 1 << order; int n_ret = 0; bool scanned_many = false;
@@ -811,6 +840,25 @@ static int scan_swap_map_slots(struct swap_info_struct *si, * And we let swap pages go all over an SSD partition. Hugh */
+ if (order > 0) { + /* + * Should not even be attempting large allocations when huge + * page swap is disabled. Warn and fail the allocation. + */ + if (!IS_ENABLED(CONFIG_THP_SWAP) || + nr_pages > SWAPFILE_CLUSTER) { + VM_WARN_ON_ONCE(1); + return 0; + } + + /* + * Swapfile is not block device or not using clusters so unable + * to allocate large entries. + */ + if (!(si->flags & SWP_BLKDEV) || !si->cluster_info) + return 0; + } + si->flags += SWP_SCANNING; /* * Use percpu scan base for SSD to reduce lock contention on @@ -825,8 +873,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
/* SSD algorithm */ if (si->cluster_info) { - if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) + if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) { + if (order > 0) + goto no_page; goto scan; + } } else if (unlikely(!si->cluster_nr--)) { if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) { si->cluster_nr = SWAPFILE_CLUSTER - 1; @@ -868,13 +919,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
checks: if (si->cluster_info) { - while (scan_swap_map_ssd_cluster_conflict(si, offset)) { + while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) { /* take a break if we already got some slots */ if (n_ret) goto done; if (!scan_swap_map_try_ssd_cluster(si, &offset, - &scan_base)) + &scan_base, order)) { + if (order > 0) + goto no_page; goto scan; + } } } if (!(si->flags & SWP_WRITEOK)) @@ -905,11 +959,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si, else goto done; } - WRITE_ONCE(si->swap_map[offset], usage); - inc_cluster_info_page(si, si->cluster_info, offset); + memset(si->swap_map + offset, usage, nr_pages); + add_cluster_info_page(si, si->cluster_info, offset, nr_pages); unlock_cluster(ci);
- swap_range_alloc(si, offset, 1); + swap_range_alloc(si, offset, nr_pages); slots[n_ret++] = swp_entry(si->type, offset);
/* got enough slots or reach max slots? */ @@ -930,8 +984,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
/* try to get more slots in cluster */ if (si->cluster_info) { - if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) + if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) goto checks; + if (order > 0) + goto done; } else if (si->cluster_nr && !si->swap_map[++offset]) { /* non-ssd case, still more slots in cluster? */ --si->cluster_nr; @@ -958,11 +1014,13 @@ static int scan_swap_map_slots(struct swap_info_struct *si, }
done: - set_cluster_next(si, offset + 1); + if (order == 0) + set_cluster_next(si, offset + 1); si->flags -= SWP_SCANNING; return n_ret;
scan: + VM_WARN_ON(order > 0); spin_unlock(&si->lock); while (++offset <= READ_ONCE(si->highest_bit)) { if (unlikely(--latency_ration < 0)) { @@ -991,38 +1049,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si, return n_ret; }
-static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) -{ - unsigned long idx; - struct swap_cluster_info *ci; - unsigned long offset; - - /* - * Should not even be attempting cluster allocations when huge - * page swap is disabled. Warn and fail the allocation. - */ - if (!IS_ENABLED(CONFIG_THP_SWAP)) { - VM_WARN_ON_ONCE(1); - return 0; - } - - if (cluster_list_empty(&si->free_clusters)) - return 0; - - idx = cluster_list_first(&si->free_clusters); - offset = idx * SWAPFILE_CLUSTER; - ci = lock_cluster(si, offset); - alloc_cluster(si, idx); - cluster_set_count(ci, SWAPFILE_CLUSTER); - - memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER); - unlock_cluster(ci); - swap_range_alloc(si, offset, SWAPFILE_CLUSTER); - *slot = swp_entry(si->type, offset); - - return 1; -} - static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) { unsigned long offset = idx * SWAPFILE_CLUSTER; @@ -1130,9 +1156,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order, int n_ret = 0; int node;
- /* Only single cluster request supported */ - WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER); - spin_lock(&swap_avail_lock);
avail_pgs = get_avail_pages(size, type); @@ -1173,14 +1196,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order, spin_unlock(&si->lock); goto nextsi; } - if (size == SWAPFILE_CLUSTER) { - if (si->flags & SWP_BLKDEV) - n_ret = swap_alloc_cluster(si, swp_entries); - } else - n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, - n_goal, swp_entries); + n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, + n_goal, swp_entries, order); spin_unlock(&si->lock); - if (n_ret || size == SWAPFILE_CLUSTER) + if (n_ret || size > 1) goto check_out; cond_resched();
@@ -1757,7 +1776,7 @@ swp_entry_t get_swap_page_of_type(int type)
/* This is called for allocating swap entry, not cache */ spin_lock(&si->lock); - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry)) + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0)) atomic_long_dec(&nr_swap_pages); spin_unlock(&si->lock); fail: @@ -3208,7 +3227,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) p->flags |= SWP_SYNCHRONOUS_IO;
if (p->bdev && bdev_nonrot(p->bdev)) { - int cpu; + int cpu, i; unsigned long ci, nr_cluster;
p->flags |= SWP_SOLIDSTATE; @@ -3246,7 +3265,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) struct percpu_cluster *cluster;
cluster = per_cpu_ptr(p->percpu_cluster, cpu); - cluster->next = SWAP_NEXT_INVALID; + for (i = 0; i < SWAP_NR_ORDERS; i++) + cluster->next[i] = SWAP_NEXT_INVALID; } } else { atomic_inc(&nr_rotate_swap);
From: Ryan Roberts ryan.roberts@arm.com
next inclusion from next-20240510 commit 5ed890ce5147855c5360affd5e5419ed68a54100 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Now that swap supports storing all mTHP sizes, avoid splitting large folios before swap-out. This benefits performance of the swap-out path by eliding split_folio_to_list(), which is expensive, and also sets us up for swapping in large folios in a future series.
If the folio is partially mapped, we continue to split it since we want to avoid the extra IO overhead and storage of writing out pages uneccessarily.
THP_SWPOUT and THP_SWPOUT_FALLBACK counters should continue to count events only for PMD-mappable folios to avoid user confusion. THP_SWPOUT already has the appropriate guard. Add a guard for THP_SWPOUT_FALLBACK. It may be appropriate to add per-size counters in future.
Link: https://lkml.kernel.org/r/20240408183946.2991168-7-ryan.roberts@arm.com Signed-off-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: David Hildenbrand david@redhat.com Reviewed-by: Barry Song v-songbaohua@oppo.com Cc: Barry Song 21cnbao@gmail.com Cc: Chris Li chrisl@kernel.org Cc: Gao Xiang xiang@kernel.org Cc: "Huang, Ying" ying.huang@intel.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Lance Yang ioworker0@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michal Hocko mhocko@suse.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/vmscan.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c index 206be0d30d57..7e9cbb2bc454 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1903,25 +1903,25 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, if (!can_split_folio(folio, NULL)) goto activate_locked; /* - * Split folios without a PMD map right - * away. Chances are some or all of the - * tail pages can be freed without IO. + * Split partially mapped folios right away. + * We can free the unmapped pages without IO. */ - if (!folio_entire_mapcount(folio) && - split_folio_to_list(folio, - folio_list)) + if (data_race(!list_empty(&folio->_deferred_list)) && + split_folio_to_list(folio, folio_list)) goto activate_locked; } if (!add_to_swap(folio)) { if (!folio_test_large(folio)) goto activate_locked_split; /* Fallback to swap normal pages */ - if (split_folio_to_list(folio, - folio_list)) + if (split_folio_to_list(folio, folio_list)) goto activate_locked; #ifdef CONFIG_TRANSPARENT_HUGEPAGE - count_memcg_folio_events(folio, THP_SWPOUT_FALLBACK, 1); - count_vm_event(THP_SWPOUT_FALLBACK); + if (nr_pages >= HPAGE_PMD_NR) { + count_memcg_folio_events(folio, + THP_SWPOUT_FALLBACK, 1); + count_vm_event(THP_SWPOUT_FALLBACK); + } #endif if (!add_to_swap(folio)) goto activate_locked_split;
From: Ryan Roberts ryan.roberts@arm.com
next inclusion from next-20240510 commit 3931b871c4936c00c4e27c469056d8da47a3493f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OCYO CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large folio that is fully and contiguously mapped in the pageout/cold vm range. This change means that large folios will be maintained all the way to swap storage. This both improves performance during swap-out, by eliding the cost of splitting the folio, and sets us up nicely for maintaining the large folio when it is swapped back in (to be covered in a separate series).
Folios that are not fully mapped in the target range are still split, but note that behavior is changed so that if the split fails for any reason (folio locked, shared, etc) we now leave it as is and move to the next pte in the range and continue work on the proceeding folios. Previously any failure of this sort would cause the entire operation to give up and no folios mapped at higher addresses were paged out or made cold. Given large folios are becoming more common, this old behavior would have likely lead to wasted opportunities.
While we are at it, change the code that clears young from the ptes to use ptep_test_and_clear_young(), via the new mkold_ptes() batch helper function. This is more efficent than get_and_clear/modify/set, especially for contpte mappings on arm64, where the old approach would require unfolding/refolding and the new approach can be done in place.
Link: https://lkml.kernel.org/r/20240408183946.2991168-8-ryan.roberts@arm.com Signed-off-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Barry Song v-songbaohua@oppo.com Acked-by: David Hildenbrand david@redhat.com Cc: Barry Song 21cnbao@gmail.com Cc: Chris Li chrisl@kernel.org Cc: Gao Xiang xiang@kernel.org Cc: "Huang, Ying" ying.huang@intel.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Lance Yang ioworker0@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michal Hocko mhocko@suse.com Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/pgtable.h | 30 ++++++++++++++ mm/internal.h | 12 +++++- mm/madvise.c | 87 +++++++++++++++++++++++------------------ mm/memory.c | 4 +- 4 files changed, 92 insertions(+), 41 deletions(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 218921a74c0a..ecc561d49d5b 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -333,6 +333,36 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, } #endif
+#ifndef mkold_ptes +/** + * mkold_ptes - Mark PTEs that map consecutive pages of the same folio as old. + * @vma: VMA the pages are mapped into. + * @addr: Address the first page is mapped at. + * @ptep: Page table pointer for the first entry. + * @nr: Number of entries to mark old. + * + * May be overridden by the architecture; otherwise, implemented as a simple + * loop over ptep_test_and_clear_young(). + * + * Note that PTE bits in the PTE range besides the PFN can differ. For example, + * some PTEs might be write-protected. + * + * Context: The caller holds the page table lock. The PTEs map consecutive + * pages that belong to the same folio. The PTEs are all in the same PMD. + */ +static inline void mkold_ptes(struct vm_area_struct *vma, unsigned long addr, + pte_t *ptep, unsigned int nr) +{ + for (;;) { + ptep_test_and_clear_young(vma, addr, ptep); + if (--nr == 0) + break; + ptep++; + addr += PAGE_SIZE; + } +} +#endif + #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG) static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, diff --git a/mm/internal.h b/mm/internal.h index a96811881760..6451747b7160 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -129,6 +129,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) * @flags: Flags to modify the PTE batch semantics. * @any_writable: Optional pointer to indicate whether any entry except the * first one is writable. + * @any_young: Optional pointer to indicate whether any entry except the + * first one is young. * * Detect a PTE batch: consecutive (present) PTEs that map consecutive * pages of the same large folio. @@ -144,16 +146,18 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags) */ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, - bool *any_writable) + bool *any_writable, bool *any_young) { unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio); const pte_t *end_ptep = start_ptep + max_nr; pte_t expected_pte, *ptep; - bool writable; + bool writable, young; int nr;
if (any_writable) *any_writable = false; + if (any_young) + *any_young = false;
VM_WARN_ON_FOLIO(!pte_present(pte), folio); VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio); @@ -167,6 +171,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, pte = ptep_get(ptep); if (any_writable) writable = !!pte_write(pte); + if (any_young) + young = !!pte_young(pte); pte = __pte_batch_clear_ignored(pte, flags);
if (!pte_same(pte, expected_pte)) @@ -182,6 +188,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
if (any_writable) *any_writable |= writable; + if (any_young) + *any_young |= young;
nr = pte_batch_hint(ptep, pte); expected_pte = pte_advance_pfn(expected_pte, nr); diff --git a/mm/madvise.c b/mm/madvise.c index ce491a4f6f87..57ad2c766aa6 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -380,6 +380,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, LIST_HEAD(folio_list); bool pageout_anon_only_filter; unsigned int batch_count = 0; + int nr;
if (fatal_signal_pending(current)) return -EINTR; @@ -467,7 +468,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, return 0; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); - for (; addr < end; pte++, addr += PAGE_SIZE) { + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) { + nr = 1; ptent = ptep_get(pte);
if (++batch_count == SWAP_CLUSTER_MAX) { @@ -491,55 +493,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, continue;
/* - * Creating a THP page is expensive so split it only if we - * are sure it's worth. Split it if we are only owner. + * If we encounter a large folio, only split it if it is not + * fully mapped within the range we are operating on. Otherwise + * leave it as is so that it can be swapped out whole. If we + * fail to split a folio, leave it in place and advance to the + * next pte in the range. */ if (folio_test_large(folio)) { - int err; - - if (folio_likely_mapped_shared(folio)) - break; - if (pageout_anon_only_filter && !folio_test_anon(folio)) - break; - if (!folio_trylock(folio)) - break; - folio_get(folio); - arch_leave_lazy_mmu_mode(); - pte_unmap_unlock(start_pte, ptl); - start_pte = NULL; - err = split_folio(folio); - folio_unlock(folio); - folio_put(folio); - if (err) - break; - start_pte = pte = - pte_offset_map_lock(mm, pmd, addr, &ptl); - if (!start_pte) - break; - arch_enter_lazy_mmu_mode(); - pte--; - addr -= PAGE_SIZE; - continue; + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | + FPB_IGNORE_SOFT_DIRTY; + int max_nr = (end - addr) / PAGE_SIZE; + bool any_young; + + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, + fpb_flags, NULL, &any_young); + if (any_young) + ptent = pte_mkyoung(ptent); + + if (nr < folio_nr_pages(folio)) { + int err; + + if (folio_likely_mapped_shared(folio)) + continue; + if (pageout_anon_only_filter && !folio_test_anon(folio)) + continue; + if (!folio_trylock(folio)) + continue; + folio_get(folio); + arch_leave_lazy_mmu_mode(); + pte_unmap_unlock(start_pte, ptl); + start_pte = NULL; + err = split_folio(folio); + folio_unlock(folio); + folio_put(folio); + start_pte = pte = + pte_offset_map_lock(mm, pmd, addr, &ptl); + if (!start_pte) + break; + arch_enter_lazy_mmu_mode(); + if (!err) + nr = 0; + continue; + } }
/* * Do not interfere with other mappings of this folio and - * non-LRU folio. + * non-LRU folio. If we have a large folio at this point, we + * know it is fully mapped so if its mapcount is the same as its + * number of pages, it must be exclusive. */ - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1) + if (!folio_test_lru(folio) || + folio_mapcount(folio) != folio_nr_pages(folio)) continue;
if (pageout_anon_only_filter && !folio_test_anon(folio)) continue;
- VM_BUG_ON_FOLIO(folio_test_large(folio), folio); - if (!pageout && pte_young(ptent)) { - ptent = ptep_get_and_clear_full(mm, addr, pte, - tlb->fullmm); - ptent = pte_mkold(ptent); - set_pte_at(mm, addr, pte, ptent); - tlb_remove_tlb_entry(tlb, pte, addr); + mkold_ptes(vma, addr, pte, nr); + tlb_remove_tlb_entries(tlb, pte, nr, addr); }
/* diff --git a/mm/memory.c b/mm/memory.c index 11d97d28aa37..64e1fd144d93 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -996,7 +996,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma flags |= FPB_IGNORE_SOFT_DIRTY;
nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags, - &any_writable); + &any_writable, NULL); folio_ref_add(folio, nr); if (folio_test_anon(folio)) { if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page, @@ -1563,7 +1563,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb, */ if (unlikely(folio_test_large(folio) && max_nr != 1)) { nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags, - NULL); + NULL, NULL);
zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr, addr, details, rss, force_flush,
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/7195 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/X...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/7195 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/X...