Some folio bugfix.
Andrew Bresticker (1): mm/memory: don't require head page for do_set_pmd()
Baolin Wang (2): mm: support multi-size THP numa balancing mm: shmem: fix getting incorrect lruvec when replacing a shmem folio
Hugh Dickins (1): mm/migrate: fix kernel BUG at mm/compaction.c:2761!
Kefeng Wang (1): mm: fix possible OOB in numa_rebuild_large_mapping()
Ran Xiaokai (1): mm: huge_memory: fix misused mapping_large_folio_support() for anon folios
Zi Yan (1): mm/rmap: do not add fully unmapped large folio to deferred split list
include/linux/pagemap.h | 4 +++ mm/huge_memory.c | 28 +++++++++------- mm/memcontrol.c | 3 +- mm/memory.c | 71 +++++++++++++++++++++++++++++++++-------- mm/migrate.c | 8 ++++- mm/mprotect.c | 3 +- mm/rmap.c | 13 ++++++-- mm/shmem.c | 2 +- 8 files changed, 100 insertions(+), 32 deletions(-)
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/9331 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/H...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/9331 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/H...
From: Zi Yan ziy@nvidia.com
mainline inclusion from mainline-v6.10-rc1 commit 7491f3f34891ef8baf5418f6856af91b58f7d200 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IA7H2V CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
In __folio_remove_rmap(), a large folio is added to deferred split list if any page in a folio loses its final mapping. But it is possible that the folio is fully unmapped and adding it to deferred split list is unnecessary.
For PMD-mapped THPs, that was not really an issue, because removing the last PMD mapping in the absence of PTE mappings would not have added the folio to the deferred split queue.
However, for PTE-mapped THPs, which are now more prominent due to mTHP, they are always added to the deferred split queue. One side effect is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be unintentionally increased, making it look like there are many partially mapped folios -- although the whole folio is fully unmapped stepwise.
Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs where possible starting from commit b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped folio is unmapped in one go and can avoid being added to deferred split list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be noise when we cannot batch-unmap a complete PTE-mapped folio in one go -- or where this type of batching is not implemented yet, e.g., migration.
To avoid the unnecessary addition, folio->_nr_pages_mapped is checked to tell if the whole folio is unmapped. If the folio is already on deferred split list, it will be skipped, too.
Note: commit 98046944a159 ("mm: huge_memory: add the missing folio_test_pmd_mappable() for THP split statistics") tried to exclude mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not fix the above issue. A fully unmapped PTE-mapped order-9 THP was still added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE, since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio() the order-9 folio is folio_test_pmd_mappable().
Link: https://lkml.kernel.org/r/20240502132852.862138-1-zi.yan@sent.com Signed-off-by: Zi Yan ziy@nvidia.com Suggested-by: David Hildenbrand david@redhat.com Reviewed-by: Yang Shi shy828301@gmail.com Reviewed-by: David Hildenbrand david@redhat.com Reviewed-by: Barry Song baohua@kernel.org Reviewed-by: Lance Yang ioworker0@gmail.com Cc: Alexander Gordeev agordeev@linux.ibm.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Ryan Roberts ryan.roberts@arm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/rmap.c [ Context conflicts. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/rmap.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c index 27f8881be2ad..d13003244e6a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1473,6 +1473,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, { atomic_t *mapped = &folio->_nr_pages_mapped; int last, nr = 0, nr_pmdmapped = 0; + bool partially_mapped = false; enum node_stat_item idx;
__folio_rmap_sanity_checks(folio, page, nr_pages, level); @@ -1489,6 +1490,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, if (last) nr++; } while (page++, --nr_pages > 0); + + partially_mapped = nr && atomic_read(mapped); break; case RMAP_LEVEL_PMD: last = atomic_add_negative(-1, &folio->_entire_mapcount); @@ -1505,6 +1508,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, nr = 0; } } + + partially_mapped = nr < nr_pmdmapped; break; }
@@ -1525,10 +1530,12 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, * Queue anon large folio for deferred split if at least one * page of the folio is unmapped and at least one page * is still mapped. + * + * Check partially_mapped first to ensure it is a large folio. */ - if (folio_test_large(folio) && folio_test_anon(folio)) - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) - deferred_split_folio(folio); + if (folio_test_anon(folio) && partially_mapped && + list_empty(&folio->_deferred_list)) + deferred_split_folio(folio); }
/*
From: Baolin Wang baolin.wang@linux.alibaba.com
mainline inclusion from mainline-v6.10-rc1 commit d2136d749d76af980b3accd72704eea4eab625bd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IA7H2V CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Now the anonymous page allocation already supports multi-size THP (mTHP), but the numa balancing still prohibits mTHP migration even though it is an exclusive mapping, which is unreasonable.
Allow scanning mTHP: Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section pages") skips shared CoW pages' NUMA page migration to avoid shared data segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other uses") change to use page_count() to avoid GUP pages migration, that will also skip the mTHP numa scanning. Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP issue, although there is still a GUP race, the issue seems to have been resolved by commit 80d47f5de5e3. Meanwhile, use the folio_likely_mapped_shared() to skip shared CoW pages though this is not a precise sharers count. To check if the folio is shared, ideally we want to make sure every page is mapped to the same process, but doing that seems expensive and using the estimated mapcount seems can work when running autonuma benchmark.
Allow migrating mTHP: As mentioned in the previous thread[1], large folios (including THP) are more susceptible to false sharing issues among threads than 4K base page, leading to pages ping-pong back and forth during numa balancing, which is currently not easy to resolve. Therefore, as a start to support mTHP numa balancing, we can follow the PMD mapped THP's strategy, that means we can reuse the 2-stage filter in should_numa_migrate_memory() to check if the mTHP is being heavily contended among threads (through checking the CPU id and pid of the last access) to avoid false sharing at some degree. Thus, we can restore all PTE maps upon the first hint page fault of a large folio to follow the PMD mapped THP's strategy. In the future, we can continue to optimize the NUMA balancing algorithm to avoid the false sharing issue with large folios as much as possible.
Performance data: Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum Base: 2024-03-25 mm-unstable branch Enable mTHP to run autonuma-benchmark
mTHP:16K Base Patched numa01 numa01 224.70 143.48 numa01_THREAD_ALLOC numa01_THREAD_ALLOC 118.05 47.43 numa02 numa02 13.45 9.29 numa02_SMT numa02_SMT 14.80 7.50
mTHP:64K Base Patched numa01 numa01 216.15 114.40 numa01_THREAD_ALLOC numa01_THREAD_ALLOC 115.35 47.41 numa02 numa02 13.24 9.25 numa02_SMT numa02_SMT 14.67 7.34
mTHP:128K Base Patched numa01 numa01 205.13 144.45 numa01_THREAD_ALLOC numa01_THREAD_ALLOC 112.93 41.88 numa02 numa02 13.16 9.18 numa02_SMT numa02_SMT 14.81 7.49
[1] https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@techsingularity....
[baolin.wang@linux.alibaba.com: v3] Link: https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.171213295... Link: https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.171168306... Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Reviewed-by: "Huang, Ying" ying.huang@intel.com Cc: David Hildenbrand david@redhat.com Cc: John Hubbard jhubbard@nvidia.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Ryan Roberts ryan.roberts@arm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/memory.c | 62 +++++++++++++++++++++++++++++++++++++++++---------- mm/mprotect.c | 3 ++- 2 files changed, 52 insertions(+), 13 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index a8f0df59aca1..f52e52426da5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5082,17 +5082,51 @@ int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma, }
static void numa_rebuild_single_mapping(struct vm_fault *vmf, struct vm_area_struct *vma, + unsigned long fault_addr, pte_t *fault_pte, bool writable) { pte_t pte, old_pte;
- old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte); + old_pte = ptep_modify_prot_start(vma, fault_addr, fault_pte); pte = pte_modify(old_pte, vma->vm_page_prot); pte = pte_mkyoung(pte); if (writable) pte = pte_mkwrite(pte, vma); - ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); + ptep_modify_prot_commit(vma, fault_addr, fault_pte, old_pte, pte); + update_mmu_cache_range(vmf, vma, fault_addr, fault_pte, 1); +} + +static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_struct *vma, + struct folio *folio, pte_t fault_pte, + bool ignore_writable, bool pte_write_upgrade) +{ + int nr = pte_pfn(fault_pte) - folio_pfn(folio); + unsigned long start = max(vmf->address - nr * PAGE_SIZE, vma->vm_start); + unsigned long end = min(vmf->address + (folio_nr_pages(folio) - nr) * PAGE_SIZE, vma->vm_end); + pte_t *start_ptep = vmf->pte - (vmf->address - start) / PAGE_SIZE; + unsigned long addr; + + /* Restore all PTEs' mapping of the large folio */ + for (addr = start; addr != end; start_ptep++, addr += PAGE_SIZE) { + pte_t ptent = ptep_get(start_ptep); + bool writable = false; + + if (!pte_present(ptent) || !pte_protnone(ptent)) + continue; + + if (pfn_folio(pte_pfn(ptent)) != folio) + continue; + + if (!ignore_writable) { + ptent = pte_modify(ptent, vma->vm_page_prot); + writable = pte_write(ptent); + if (!writable && pte_write_upgrade && + can_change_pte_writable(vma, addr, ptent)) + writable = true; + } + + numa_rebuild_single_mapping(vmf, vma, addr, start_ptep, writable); + } }
static vm_fault_t do_numa_page(struct vm_fault *vmf) @@ -5100,11 +5134,12 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct folio *folio = NULL; int nid = NUMA_NO_NODE; - bool writable = false; + bool writable = false, ignore_writable = false; + bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma); int last_cpupid; int target_nid; pte_t pte, old_pte; - int flags = 0; + int flags = 0, nr_pages;
/* * The pte cannot be used safely until we verify, while holding the page @@ -5126,7 +5161,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) * is only valid while holding the PT lock. */ writable = pte_write(pte); - if (!writable && vma_wants_manual_pte_write_upgrade(vma) && + if (!writable && pte_write_upgrade && can_change_pte_writable(vma, vmf->address, pte)) writable = true;
@@ -5134,10 +5169,6 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) if (!folio || folio_is_zone_device(folio)) goto out_map;
- /* TODO: handle PTE-mapped THP */ - if (folio_test_large(folio)) - goto out_map; - /* * Avoid grouping on RO pages in general. RO pages shouldn't hurt as * much anyway since they can be in shared cache state. This misses @@ -5157,6 +5188,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) flags |= TNF_SHARED;
nid = folio_nid(folio); + nr_pages = folio_nr_pages(folio); /* * For memory tiering mode, cpupid of slow memory page is used * to record page access time. So use default value. @@ -5173,6 +5205,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) } pte_unmap_unlock(vmf->pte, vmf->ptl); writable = false; + ignore_writable = true;
/* Migrate to the requested node */ if (migrate_misplaced_folio(folio, vma, target_nid)) { @@ -5193,14 +5226,19 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
out: if (nid != NUMA_NO_NODE) - task_numa_fault(last_cpupid, nid, 1, flags); + task_numa_fault(last_cpupid, nid, nr_pages, flags); return 0; out_map: /* * Make it present again, depending on how arch implements * non-accessible ptes, some can allow access by kernel mode. */ - numa_rebuild_single_mapping(vmf, vma, writable); + if (folio && folio_test_large(folio)) + numa_rebuild_large_mapping(vmf, vma, folio, pte, ignore_writable, + pte_write_upgrade); + else + numa_rebuild_single_mapping(vmf, vma, vmf->address, vmf->pte, + writable); pte_unmap_unlock(vmf->pte, vmf->ptl); goto out; } diff --git a/mm/mprotect.c b/mm/mprotect.c index f121c46f6e4c..b360577be4f8 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb,
/* Also skip shared copy-on-write pages */ if (is_cow_mapping(vma->vm_flags) && - folio_ref_count(folio) != 1) + (folio_maybe_dma_pinned(folio) || + folio_likely_mapped_shared(folio))) continue;
/*
From: Kefeng Wang wangkefeng.wang@huawei.com
mainline inclusion from mainline-v6.10-rc5 commit cfdd12b48202398a879e8bc4e7fa023f4d473f62 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IA7H2V CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The large folio is mapped with folio size(not greater PMD_SIZE) aligned virtual address during the pagefault, ie, 'addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE)' in do_anonymous_page(). But after the mremap(), the virtual address only requires PAGE_SIZE alignment. Also pte is moved to new in move_page_tables(), then traversal of the new pte in the numa_rebuild_large_mapping() could hit the following issue,
Unable to handle kernel paging request at virtual address 00000a80c021a788 Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=00002040341a6000 [00000a80c021a788] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 0000000096000004 [#1] SMP ... CPU: 76 PID: 15187 Comm: git Kdump: loaded Tainted: G W 6.10.0-rc2+ #209 Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 1.79 08/21/2021 pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : numa_rebuild_large_mapping+0x338/0x638 lr : numa_rebuild_large_mapping+0x320/0x638 sp : ffff8000b41c3b00 x29: ffff8000b41c3b30 x28: ffff8000812a0000 x27: 00000000000a8000 x26: 00000000000000a8 x25: 0010000000000001 x24: ffff20401c7170f0 x23: 0000ffff33a1e000 x22: 0000ffff33a76000 x21: ffff20400869eca0 x20: 0000ffff33976000 x19: 00000000000000a8 x18: ffffffffffffffff x17: 0000000000000000 x16: 0000000000000020 x15: ffff8000b41c36a8 x14: 0000000000000000 x13: 205d373831353154 x12: 5b5d333331363732 x11: 000000000011ff78 x10: 000000000011ff10 x9 : ffff800080273f30 x8 : 000000320400869e x7 : c0000000ffffd87f x6 : 00000000001e6ba8 x5 : ffff206f3fb5af88 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : fffffdffc0000000 x0 : 00000a80c021a780 Call trace: numa_rebuild_large_mapping+0x338/0x638 do_numa_page+0x3e4/0x4e0 handle_pte_fault+0x1bc/0x238 __handle_mm_fault+0x20c/0x400 handle_mm_fault+0xa8/0x288 do_page_fault+0x124/0x498 do_translation_fault+0x54/0x80 do_mem_abort+0x4c/0xa8 el0_da+0x40/0x110 el0t_64_sync_handler+0xe4/0x158 el0t_64_sync+0x188/0x190
Fix it by making the start and end not only within the vma range, but also within the page table range.
Link: https://lkml.kernel.org/r/20240612122822.4033433-1-wangkefeng.wang@huawei.co... Fixes: d2136d749d76 ("mm: support multi-size THP numa balancing") Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Acked-by: David Hildenbrand david@redhat.com Reviewed-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: "Huang, Ying" ying.huang@intel.com Cc: John Hubbard jhubbard@nvidia.com Cc: Liu Shixin liushixin2@huawei.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Ryan Roberts ryan.roberts@arm.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/memory.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c index f52e52426da5..83c3650b533b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5101,10 +5101,16 @@ static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_stru bool ignore_writable, bool pte_write_upgrade) { int nr = pte_pfn(fault_pte) - folio_pfn(folio); - unsigned long start = max(vmf->address - nr * PAGE_SIZE, vma->vm_start); - unsigned long end = min(vmf->address + (folio_nr_pages(folio) - nr) * PAGE_SIZE, vma->vm_end); - pte_t *start_ptep = vmf->pte - (vmf->address - start) / PAGE_SIZE; - unsigned long addr; + unsigned long start, end, addr = vmf->address; + unsigned long addr_start = addr - (nr << PAGE_SHIFT); + unsigned long pt_start = ALIGN_DOWN(addr, PMD_SIZE); + pte_t *start_ptep; + + /* Stay within the VMA and within the page table. */ + start = max3(addr_start, pt_start, vma->vm_start); + end = min3(addr_start + folio_size(folio), pt_start + PMD_SIZE, + vma->vm_end); + start_ptep = vmf->pte - ((addr - start) >> PAGE_SHIFT);
/* Restore all PTEs' mapping of the large folio */ for (addr = start; addr != end; start_ptep++, addr += PAGE_SIZE) {
From: Baolin Wang baolin.wang@linux.alibaba.com
mainline inclusion from mainline-v6.10-rc5 commit 9094b4a1c76cfe84b906cc152bab34d4ba26fa5c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IA7H2V CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When testing shmem swapin, I encountered the warning below on my machine. The reason is that replacing an old shmem folio with a new one causes mem_cgroup_migrate() to clear the old folio's memcg data. As a result, the old folio cannot get the correct memcg's lruvec needed to remove itself from the LRU list when it is being freed. This could lead to possible serious problems, such as LRU list crashes due to holding the wrong LRU lock, and incorrect LRU statistics.
To fix this issue, we can fallback to use the mem_cgroup_replace_folio() to replace the old shmem folio.
[ 5241.100311] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5d9960 [ 5241.100317] head: order:4 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 5241.100319] flags: 0x17fffe0000040068(uptodate|lru|head|swapbacked|node=0|zone=2|lastcpupid=0x3ffff) [ 5241.100323] raw: 17fffe0000040068 fffffdffd6687948 fffffdffd69ae008 0000000000000000 [ 5241.100325] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 [ 5241.100326] head: 17fffe0000040068 fffffdffd6687948 fffffdffd69ae008 0000000000000000 [ 5241.100327] head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 [ 5241.100328] head: 17fffe0000000204 fffffdffd6665801 ffffffffffffffff 0000000000000000 [ 5241.100329] head: 0000000a00000010 0000000000000000 00000000ffffffff 0000000000000000 [ 5241.100330] page dumped because: VM_WARN_ON_ONCE_FOLIO(!memcg && !mem_cgroup_disabled()) [ 5241.100338] ------------[ cut here ]------------ [ 5241.100339] WARNING: CPU: 19 PID: 78402 at include/linux/memcontrol.h:775 folio_lruvec_lock_irqsave+0x140/0x150 [...] [ 5241.100374] pc : folio_lruvec_lock_irqsave+0x140/0x150 [ 5241.100375] lr : folio_lruvec_lock_irqsave+0x138/0x150 [ 5241.100376] sp : ffff80008b38b930 [...] [ 5241.100398] Call trace: [ 5241.100399] folio_lruvec_lock_irqsave+0x140/0x150 [ 5241.100401] __page_cache_release+0x90/0x300 [ 5241.100404] __folio_put+0x50/0x108 [ 5241.100406] shmem_replace_folio+0x1b4/0x240 [ 5241.100409] shmem_swapin_folio+0x314/0x528 [ 5241.100411] shmem_get_folio_gfp+0x3b4/0x930 [ 5241.100412] shmem_fault+0x74/0x160 [ 5241.100414] __do_fault+0x40/0x218 [ 5241.100417] do_shared_fault+0x34/0x1b0 [ 5241.100419] do_fault+0x40/0x168 [ 5241.100420] handle_pte_fault+0x80/0x228 [ 5241.100422] __handle_mm_fault+0x1c4/0x440 [ 5241.100424] handle_mm_fault+0x60/0x1f0 [ 5241.100426] do_page_fault+0x120/0x488 [ 5241.100429] do_translation_fault+0x4c/0x68 [ 5241.100431] do_mem_abort+0x48/0xa0 [ 5241.100434] el0_da+0x38/0xc0 [ 5241.100436] el0t_64_sync_handler+0x68/0xc0 [ 5241.100437] el0t_64_sync+0x14c/0x150 [ 5241.100439] ---[ end trace 0000000000000000 ]---
[baolin.wang@linux.alibaba.com: remove less helpful comments, per Matthew] Link: https://lkml.kernel.org/r/ccad3fe1375b468ebca3227b6b729f3eaf9d8046.171842319... Link: https://lkml.kernel.org/r/3c11000dd6c1df83015a8321a859e9775ebbc23e.171826611... Fixes: 85ce2c517ade ("memcontrol: only transfer the memcg data for migration") Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Reviewed-by: Shakeel Butt shakeel.butt@linux.dev Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Hugh Dickins hughd@google.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Nhat Pham nphamcs@gmail.com Cc: Michal Hocko mhocko@suse.com Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Muchun Song songmuchun@bytedance.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/shmem.c [ Context conflicts with mem_reliable feature. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/memcontrol.c | 3 +-- mm/shmem.c | 2 +- 2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d97410e3ec0e..fcf08f3dc53f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -8606,8 +8606,7 @@ void __mem_cgroup_uncharge_folios(struct folio_batch *folios) * @new: Replacement folio. * * Charge @new as a replacement folio for @old. @old will - * be uncharged upon free. This is only used by the page cache - * (in replace_page_cache_folio()). + * be uncharged upon free. * * Both folios must be locked, @new->mapping must be set up. */ diff --git a/mm/shmem.c b/mm/shmem.c index cf27e1785f80..079f47192bdb 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1769,7 +1769,7 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp, xa_lock_irq(&swap_mapping->i_pages); error = shmem_replace_entry(swap_mapping, swap_index, old, new); if (!error) { - mem_cgroup_migrate(old, new); + mem_cgroup_replace_folio(old, new); __lruvec_stat_mod_folio(new, NR_FILE_PAGES, 1); __lruvec_stat_mod_folio(new, NR_SHMEM, 1); shmem_reliable_folio_add(new, 1);
From: Ran Xiaokai ran.xiaokai@zte.com.cn
mainline inclusion from mainline-v6.10-rc5 commit 6a50c9b512f7734bc356f4bd47885a6f7c98491a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IA7H2V CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When I did a large folios split test, a WARNING "[ 5059.122759][ T166] Cannot split file folio to non-0 order" was triggered. But the test cases are only for anonmous folios. while mapping_large_folio_support() is only reasonable for page cache folios.
In split_huge_page_to_list_to_order(), the folio passed to mapping_large_folio_support() maybe anonmous folio. The folio_test_anon() check is missing. So the split of the anonmous THP is failed. This is also the same for shmem_mapping(). We'd better add a check for both. But the shmem_mapping() in __split_huge_page() is not involved, as for anonmous folios, the end parameter is set to -1, so (head[i].index >= end) is always false. shmem_mapping() is not called.
Also add a VM_WARN_ON_ONCE() in mapping_large_folio_support() for anon mapping, So we can detect the wrong use more easily.
THP folios maybe exist in the pagecache even the file system doesn't support large folio, it is because when CONFIG_TRANSPARENT_HUGEPAGE is enabled, khugepaged will try to collapse read-only file-backed pages to THP. But the mapping does not actually support multi order large folios properly.
Using /sys/kernel/debug/split_huge_pages to verify this, with this patch, large anon THP is successfully split and the warning is ceased.
Link: https://lkml.kernel.org/r/202406071740485174hcFl7jRxncsHDtI-Pz-o@zte.com.cn Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages") Reviewed-by: Barry Song baohua@kernel.org Reviewed-by: Zi Yan ziy@nvidia.com Acked-by: David Hildenbrand david@redhat.com Signed-off-by: Ran Xiaokai ran.xiaokai@zte.com.cn Cc: Michal Hocko mhocko@kernel.org Cc: xu xin xu.xin16@zte.com.cn Cc: Yang Yang yang.yang29@zte.com.cn Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/huge_memory.c [ Context conflicts. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/pagemap.h | 4 ++++ mm/huge_memory.c | 28 +++++++++++++++++----------- 2 files changed, 21 insertions(+), 11 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 26a60ff9cfed..d2dceb45746c 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -364,6 +364,10 @@ static inline void mapping_clear_large_folios(struct address_space *mapping) */ static inline bool mapping_large_folio_support(struct address_space *mapping) { + /* AS_LARGE_FOLIO_SUPPORT is only reasonable for pagecache folios */ + VM_WARN_ONCE((unsigned long)mapping & PAGE_MAPPING_ANON, + "Anonymous mapping always supports large folio"); + return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags); } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index eddb7984610d..a6f087186630 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3169,30 +3169,36 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, if (new_order >= folio_order(folio)) return -EINVAL;
- /* Cannot split anonymous THP to order-1 */ - if (new_order == 1 && folio_test_anon(folio)) { - VM_WARN_ONCE(1, "Cannot split to order-1 folio"); - return -EINVAL; - } - - if (new_order) { - /* Only swapping a whole PMD-mapped folio is supported */ - if (folio_test_swapcache(folio)) + if (folio_test_anon(folio)) { + /* order-1 is not supported for anonymous THP. */ + if (new_order == 1) { + VM_WARN_ONCE(1, "Cannot split to order-1 folio"); return -EINVAL; + } + } else if (new_order) { /* Split shmem folio to non-zero order not supported */ if (shmem_mapping(folio->mapping)) { VM_WARN_ONCE(1, "Cannot split shmem folio to non-0 order"); return -EINVAL; } - /* No split if the file system does not support large folio */ - if (!mapping_large_folio_support(folio->mapping)) { + /* + * No split if the file system does not support large folio. + * Note that we might still have THPs in such mappings due to + * CONFIG_READ_ONLY_THP_FOR_FS. But in that case, the mapping + * does not actually support large folios properly. + */ + if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && + !mapping_large_folio_support(folio->mapping)) { VM_WARN_ONCE(1, "Cannot split file folio to non-0 order"); return -EINVAL; } }
+ /* Only swapping a whole PMD-mapped folio is supported */ + if (folio_test_swapcache(folio) && new_order) + return -EINVAL;
is_hzp = is_huge_zero_page(&folio->page); if (is_hzp) {
From: Hugh Dickins hughd@google.com
mainline inclusion from mainline-v6.10-rc5 commit 8e279f970b5cb0628f856b6735e2e47b4da9f76e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IA7H2V CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
I hit the VM_BUG_ON(!list_empty(&cc->migratepages)) in compact_zone(); and if DEBUG_VM were off, then pages would be lost on a local list.
Our convention is that if migrate_pages() reports complete success (0), then the migratepages list will be empty; but if it reports an error or some pages remaining, then its caller must putback_movable_pages().
There's a new case in which migrate_pages() has been reporting complete success, but returning with pages left on the migratepages list: when migrate_pages_batch() successfully split a folio on the deferred list, but then the "Failure isn't counted" call does not dispose of them all.
Since that block is expecting the large folio to have been counted as 1 failure already, and since the return code is later adjusted to success whenever the returned list is found empty, the simple way to fix this safely is to count splitting the deferred folio as "a failure".
Link: https://lkml.kernel.org/r/46c948b4-4dd8-6e03-4c7b-ce4e81cfa536@google.com Fixes: 7262f208ca68 ("mm/migrate: split source folio if it is on deferred split list") Signed-off-by: Hugh Dickins hughd@google.com Cc: Baolin Wang baolin.wang@linux.alibaba.com Cc: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/migrate.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/mm/migrate.c b/mm/migrate.c index 141509ec9a48..78c5b4aaf60d 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1655,7 +1655,12 @@ static int migrate_pages_batch(struct list_head *from,
/* * The rare folio on the deferred split list should - * be split now. It should not count as a failure. + * be split now. It should not count as a failure: + * but increment nr_failed because, without doing so, + * migrate_pages() may report success with (split but + * unmigrated) pages still on its fromlist; whereas it + * always reports success when its fromlist is empty. + * * Only check it without removing it from the list. * Since the folio can be on deferred_split_scan() * local list and removing it can cause the local list @@ -1670,6 +1675,7 @@ static int migrate_pages_batch(struct list_head *from, if (nr_pages > 2 && !list_empty(&folio->_deferred_list)) { if (try_split_folio(folio, split_folios) == 0) { + nr_failed++; stats->nr_thp_split += is_thp; stats->nr_split++; continue;
From: Andrew Bresticker abrestic@rivosinc.com
next inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IA7H2V CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The requirement that the head page be passed to do_set_pmd() was added in commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the finish_fault() and filemap_map_pages() paths if the page to be inserted is anything but the head page for an otherwise suitable vma and pmd-sized page.
Matthew said:
: We're going to stop using PMDs to map large folios unless the fault is : within the first 4KiB of the PMD. No idea how many workloads that : affects, but it only needs to be backported as far as v6.8, so we may : as well backport it.
Link: https://lkml.kernel.org/r/20240611153216.2794513-1-abrestic@rivosinc.com Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()") Signed-off-by: Andrew Bresticker abrestic@rivosinc.com Acked-by: David Hildenbrand david@redhat.com Acked-by: Hugh Dickins hughd@google.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/memory.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c index 83c3650b533b..fe10342687d0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4609,8 +4609,9 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER)) return ret;
- if (page != &folio->page || folio_order(folio) != HPAGE_PMD_ORDER) + if (folio_order(folio) != HPAGE_PMD_ORDER) return ret; + page = &folio->page;
/* * Just backoff if any subpage of a THP is corrupted otherwise