David Hildenbrand (1): mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio()
Hugh Dickins (1): mm: shmem: fix ShmemHugePages at swapout
Kalesh Singh (1): mm: Respect mmap hint address when aligning for THP
Matthew Wilcox (Oracle) (2): mm: open-code page_folio() in dump_page() mm: open-code PageTail in folio_flags() and const_folio_flags()
include/linux/page-flags.h | 2 +- mm/debug.c | 7 +++++-- mm/filemap.c | 4 ++-- mm/mmap.c | 3 ++- mm/shmem.c | 22 ++++++++++++---------- 5 files changed, 22 insertions(+), 16 deletions(-)
From: "Matthew Wilcox (Oracle)" willy@infradead.org
maillist inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IB1RRY
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?...
-------------------------------------------
page_folio() calls page_fixed_fake_head() which will misidentify this page as being a fake head and load off the end of 'precise'. We may have a pointer to a fake head, but that's OK because it contains the right information for dump_page().
gcc-15 is smart enough to catch this with -Warray-bounds:
In function 'page_fixed_fake_head', inlined from '_compound_head' at ../include/linux/page-flags.h:251:24, inlined from '__dump_page' at ../mm/debug.c:123:11: ../include/asm-generic/rwonce.h:44:26: warning: array subscript 9 is outside +array bounds of 'struct page[1]' [-Warray-bounds=]
Link: https://lkml.kernel.org/r/20241125201721.2963278-2-willy@infradead.org Fixes: fae7d834c43c ("mm: add __dump_folio()") Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Reported-by: Kees Cook kees@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Jinjiang Tu tujinjiang@huawei.com --- mm/debug.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/mm/debug.c b/mm/debug.c index 0dd516a6640e..df3949e49eb3 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -111,19 +111,22 @@ static void __dump_page(const struct page *page) { struct folio *foliop, folio; struct page precise; + unsigned long head; unsigned long pfn = page_to_pfn(page); unsigned long idx, nr_pages = 1; int loops = 5;
again: memcpy(&precise, page, sizeof(*page)); - foliop = page_folio(&precise); - if (foliop == (struct folio *)&precise) { + head = precise.compound_head; + if ((head & 1) == 0) { + foliop = (struct folio *)&precise; idx = 0; if (!folio_test_large(foliop)) goto dump; foliop = (struct folio *)page; } else { + foliop = (struct folio *)(head - 1); idx = folio_page_idx(foliop, page); }
From: Kalesh Singh kaleshsingh@google.com
maillist inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IB1RRY
Reference: https://lore.kernel.org/lkml/20241118214650.3667577-1-kaleshsingh@google.com...
-------------------------------------------
Commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") updated __get_unmapped_area() to align the start address for the VMA to a PMD boundary if CONFIG_TRANSPARENT_HUGEPAGE=y.
It does this by effectively looking up a region that is of size, request_size + PMD_SIZE, and aligning up the start to a PMD boundary.
Commit 4ef9ad19e176 ("mm: huge_memory: don't force huge page alignment on 32 bit") opted out of this for 32bit due to regressions in mmap base randomization.
Commit d4148aeab412 ("mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes") restricted this to only mmap sizes that are multiples of the PMD_SIZE due to reported regressions in some performance benchmarks -- which seemed mostly due to the reduced spatial locality of related mappings due to the forced PMD-alignment.
Another unintended side effect has emerged: When a user specifies an mmap hint address, the THP alignment logic modifies the behavior, potentially ignoring the hint even if a sufficiently large gap exists at the requested hint location.
Example Scenario:
Consider the following simplified virtual address (VA) space:
...
0x200000-0x400000 --- VMA A 0x400000-0x600000 --- Hole 0x600000-0x800000 --- VMA B
...
A call to mmap() with hint=0x400000 and len=0x200000 behaves differently:
- Before THP alignment: The requested region (size 0x200000) fits into the gap at 0x400000, so the hint is respected.
- After alignment: The logic searches for a region of size 0x400000 (len + PMD_SIZE) starting at 0x400000. This search fails due to the mapping at 0x600000 (VMA B), and the hint is ignored, falling back to arch_get_unmapped_area[_topdown]().
In general the hint is effectively ignored, if there is any existing mapping in the below range:
[mmap_hint + mmap_size, mmap_hint + mmap_size + PMD_SIZE)
This changes the semantics of mmap hint; from ""Respect the hint if a sufficiently large gap exists at the requested location" to "Respect the hint only if an additional PMD-sized gap exists beyond the requested size".
This has performance implications for allocators that allocate their heap using mmap but try to keep it "as contiguous as possible" by using the end of the exisiting heap as the address hint. With the new behavior it's more likely to get a much less contiguous heap, adding extra fragmentation and performance overhead.
To restore the expected behavior; don't use thp_get_unmapped_area_vmflags() when the user provided a hint address, for anonymous mappings.
Note: As, Yang Shi, pointed out: the issue still remains for filesystems which are using thp_get_unmapped_area() for their get_unmapped_area() op. It is unclear what worklaods will regress for if we ignore THP alignment when the hint address is provided for such file backed mappings -- so this fix will be handled separately.
Cc: Andrew Morton akpm@linux-foundation.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Yang Shi yang@os.amperecomputing.com Cc: Rik van Riel riel@surriel.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: Suren Baghdasaryan surenb@google.com Cc: Minchan Kim minchan@kernel.org Cc: Hans Boehm hboehm@google.com Cc: Lokesh Gidra lokeshgidra@google.com Cc: stable@vger.kernel.org Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") Signed-off-by: Kalesh Singh kaleshsingh@google.com Reviewed-by: Rik van Riel riel@surriel.com Reviewed-by: Vlastimil Babka vbabka@suse.cz
Conflicts: mm/mmap.c [Context conflicts.] Signed-off-by: Jinjiang Tu tujinjiang@huawei.com --- mm/mmap.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/mmap.c b/mm/mmap.c index dfa3d2bfe289..9fbc02eec7ba 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1900,7 +1900,8 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, * so use shmem's get_unmapped_area in case it can be huge. */ get_area = shmem_get_unmapped_area; - } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { + } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) + && !addr /* no hint */) { /* Ensures that larger anonymous mappings are THP aligned. */ get_area = thp_get_unmapped_area; }
From: Hugh Dickins hughd@google.com
maillist inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IB9L9C
Reference: https://lore.kernel.org/lkml/5ba477c8-a569-70b5-923e-09ab221af45b@google.com...
-------------------------------------------
/proc/meminfo ShmemHugePages has been showing overlarge amounts (more than Shmem) after swapping out THPs: we forgot to update NR_SHMEM_THPS.
Add shmem_update_stats(), to avoid repetition, and risk of making that mistake again: the call from shmem_delete_from_page_cache() is the bugfix; the call from shmem_replace_folio() is reassuring, but not really a bugfix (replace corrects misplaced swapin readahead, but huge swapin readahead would be a mistake).
Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") Signed-off-by: Hugh Dickins hughd@google.com Cc: stable@vger.kernel.org
Conflicts: mm/shmem.c [Context conflicts.] Signed-off-by: Jinjiang Tu tujinjiang@huawei.com --- mm/shmem.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c index 1440f17c5d02..5b39db2f6cfe 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -778,6 +778,14 @@ static bool shmem_huge_global_enabled(struct inode *inode, pgoff_t index, } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+static void shmem_update_stats(struct folio *folio, int nr_pages) +{ + if (folio_test_pmd_mappable(folio)) + __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, nr_pages); + __lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); + __lruvec_stat_mod_folio(folio, NR_SHMEM, nr_pages); +} + /* * Somewhat like filemap_add_folio, but error if expected item has gone. */ @@ -812,10 +820,7 @@ static int shmem_add_to_page_cache(struct folio *folio, xas_store(&xas, folio); if (xas_error(&xas)) goto unlock; - if (folio_test_pmd_mappable(folio)) - __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, nr); - __lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr); - __lruvec_stat_mod_folio(folio, NR_SHMEM, nr); + shmem_update_stats(folio, nr); shmem_reliable_folio_add(folio, nr); mapping->nrpages += nr; unlock: @@ -844,8 +849,7 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap) error = shmem_replace_entry(mapping, folio->index, folio, radswap); folio->mapping = NULL; mapping->nrpages -= nr; - __lruvec_stat_mod_folio(folio, NR_FILE_PAGES, -nr); - __lruvec_stat_mod_folio(folio, NR_SHMEM, -nr); + shmem_update_stats(folio, -nr); shmem_reliable_folio_add(folio, -nr); xa_unlock_irq(&mapping->i_pages); folio_put_refs(folio, nr); @@ -1977,11 +1981,9 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp, } if (!error) { mem_cgroup_replace_folio(old, new); - __lruvec_stat_mod_folio(new, NR_FILE_PAGES, nr_pages); - __lruvec_stat_mod_folio(new, NR_SHMEM, nr_pages); + shmem_update_stats(new, nr_pages); shmem_reliable_folio_add(new, nr_pages); - __lruvec_stat_mod_folio(old, NR_FILE_PAGES, -nr_pages); - __lruvec_stat_mod_folio(old, NR_SHMEM, -nr_pages); + shmem_update_stats(old, -nr_pages); shmem_reliable_folio_add(old, -nr_pages); } xa_unlock_irq(&swap_mapping->i_pages);
From: David Hildenbrand david@redhat.com
maillist inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IB9L9C
Reference: https://lore.kernel.org/all/20241129125303.4033164-1-david@redhat.com/
-------------------------------------------
The folio can get freed + buddy-merged + reallocated in the meantime, resulting in us calling folio_test_locked() possibly on a tail page.
This makes const_folio_flags VM_BUG_ON_PGFLAGS() when stumbling over the tail page.
Could this result in other issues? Doesn't look like it. False positives and false negatives don't really matter, because this folio would get skipped either way when detecting that they have been reallocated in the meantime.
Fix it by performing the folio_test_locked() checked after grabbing a reference. If this ever becomes a real problem, we could add a special helper that racily checks if the bit is set even on tail pages ... but let's hope that's not required so we can just handle it cleaner: work on the folio after we hold a reference.
Do we really need the folio_test_locked() check if we are going to trylock briefly after? Well, we can at least avoid a xas_reload().
It's a bit unclear which exact change introduced that issue. Likely, ever since we made PG_locked obey to the PF_NO_TAIL policy it could have been triggered in some way.
Link: https://lkml.kernel.org/r/20241129125303.4033164-1-david@redhat.com Fixes: 48c935ad88f5 ("page-flags: define PG_locked behavior on compound pages") Signed-off-by: David Hildenbrand david@redhat.com Reported-by: syzbot+9f9a7f73fb079b2387a6@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/674184c9.050a0220.1cc393.0001.GAE@google.com/ Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: "Matthew Wilcox (Oracle)" willy@infradead.org Cc: Hillf Danton hdanton@sina.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Jinjiang Tu tujinjiang@huawei.com --- mm/filemap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c index edc223834cde..c4d431b5e26e 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3545,10 +3545,10 @@ static struct folio *next_uptodate_folio(struct xa_state *xas, continue; if (xa_is_value(folio)) continue; - if (folio_test_locked(folio)) - continue; if (!folio_try_get(folio)) continue; + if (folio_test_locked(folio)) + goto skip; /* Has the page moved or been split? */ if (unlikely(folio != xas_reload(xas))) goto skip;
From: "Matthew Wilcox (Oracle)" willy@infradead.org
mainline inclusion from mainline-v6.13-rc2 commit 4de22b2a6a7477d84d9a01eb6b62a9117309d722 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IB9L9C
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------
It is unsafe to call PageTail() in dump_page() as page_is_fake_head() will almost certainly return true when called on a head page that is copied to the stack. That will cause the VM_BUG_ON_PGFLAGS() in const_folio_flags() to trigger when it shouldn't. Fortunately, we don't need to call PageTail() here; it's fine to have a pointer to a virtual alias of the page's flag word rather than the real page's flag word.
Link: https://lkml.kernel.org/r/20241125201721.2963278-1-willy@infradead.org Fixes: fae7d834c43c ("mm: add __dump_folio()") Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: Kees Cook kees@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org
Conflicts: include/linux/page-flags.h [Conflicts due to ce3467af6bded1c0018ca67ea1599f45fbb8100b isn't merged.] Signed-off-by: Jinjiang Tu tujinjiang@huawei.com --- include/linux/page-flags.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 7a67d997eece..6878081cbbc1 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -312,7 +312,7 @@ static unsigned long *folio_flags(struct folio *folio, unsigned n) { struct page *page = &folio->page;
- VM_BUG_ON_PGFLAGS(PageTail(page), page); + VM_BUG_ON_PGFLAGS(page->compound_head & 1, page); VM_BUG_ON_PGFLAGS(n > 0 && !test_bit(PG_head, &page->flags), page); return &page[n].flags; }
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/14178 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/Y...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/14178 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/Y...