1. hugetlb: remove hugetlb special casing in filemap.c 2. thp: 1) align larger anonymous mappings on THP 2) batch tlb flush when splitting 3) try madvise huge for file exec
Baolin Wang (1): mm: huge_memory: batch tlb flush when splitting a pte-mapped THP
Fangrui Song (1): mm: remove VM_EXEC requirement for THP eligibility
Kefeng Wang (1): mm: filemap: try to enable THP for exec mapping
Lance Yang (2): mm/khugepaged: bypassing unnecessary scans with MMF_DISABLE_THP check mm/khugepaged: keep mm in mm_slot without MMF_DISABLE_THP check
Lorenzo Stoakes (1): mm/filemap: clarify filemap_fault() comments for not uptodate case
Rik van Riel (1): mm: align larger anonymous mappings on THP boundaries
Ryan Roberts (1): mm: thp_get_unmapped_area must honour topdown preference
Sidhartha Kumar (3): mm/filemap: remove hugetlb special casing in filemap.c mm/hugetlb: have CONFIG_HUGETLB_PAGE select CONFIG_XARRAY_MULTI fs/hugetlbfs/inode.c: mm/memory-failure.c: fix hugetlbfs hwpoison handling
Yang Shi (3): mm: mmap: map MAP_STACK to VM_NOHUGEPAGE mm: huge_memory: don't force huge page alignment on 32 bit mm: mmap: no need to call khugepaged_enter_vma() for stack
Documentation/admin-guide/mm/transhuge.rst | 7 ++ fs/Kconfig | 1 + fs/hugetlbfs/inode.c | 37 +++++----- include/linux/huge_mm.h | 2 +- include/linux/hugetlb.h | 12 ++++ include/linux/mman.h | 1 + include/linux/pagemap.h | 32 +-------- mm/filemap.c | 80 ++++++++++++++-------- mm/huge_memory.c | 47 ++++++++++++- mm/hugetlb.c | 32 ++------- mm/khugepaged.c | 16 +++-- mm/memory-failure.c | 2 +- mm/migrate.c | 6 +- mm/mmap.c | 11 +-- 14 files changed, 165 insertions(+), 121 deletions(-)
From: Baolin Wang baolin.wang@linux.alibaba.com
mainline inclusion from mainline-v6.8-rc1 commit 3027c6f8eb9d3857aef08f401aeb7de715410525 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9H84X CVE: NA
-------------------------------------------------
I can observe an obvious tlb flush hotspot when splitting a pte-mapped THP on my ARM64 server, and the distribution of this hotspot is as follows:
- 16.85% split_huge_page_to_list + 7.80% down_write - 7.49% try_to_migrate - 7.48% rmap_walk_anon 7.23% ptep_clear_flush + 1.52% __split_huge_page
The reason is that the split_huge_page_to_list() will build migration entries for each subpage of a pte-mapped Anon THP by try_to_migrate(), or unmap for file THP, and it will clear and tlb flush for each subpage's pte. Moreover, the split_huge_page_to_list() will set TTU_SPLIT_HUGE_PMD flag to ensure the THP is already a pte-mapped THP before splitting it to some normal pages.
Actually, there is no need to flush tlb for each subpage immediately, instead we can batch tlb flush for the pte-mapped THP to improve the performance.
After this patch, we can see the batch tlb flush can improve the latency obviously when running thpscale.
k6.5-base patched Amean fault-both-1 1071.17 ( 0.00%) 901.83 * 15.81%* Amean fault-both-3 2386.08 ( 0.00%) 1865.32 * 21.82%* Amean fault-both-5 2851.10 ( 0.00%) 2273.84 * 20.25%* Amean fault-both-7 3679.91 ( 0.00%) 2881.66 * 21.69%* Amean fault-both-12 5916.66 ( 0.00%) 4369.55 * 26.15%* Amean fault-both-18 7981.36 ( 0.00%) 6303.57 * 21.02%* Amean fault-both-24 10950.79 ( 0.00%) 8752.56 * 20.07%* Amean fault-both-30 14077.35 ( 0.00%) 10170.01 * 27.76%* Amean fault-both-32 13061.57 ( 0.00%) 11630.08 * 10.96%*
Link: https://lkml.kernel.org/r/431d9fb6823036369dcb1d3b2f63732f01df21a7.169848826... Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Reviewed-by: "Huang, Ying" ying.huang@intel.com Reviewed-by: Yang Shi shy828301@gmail.com Reviewed-by: Alistair Popple apopple@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 3027c6f8eb9d3857aef08f401aeb7de715410525) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/huge_memory.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cd1eb575f15d..4ce2ed00eaa2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2556,7 +2556,7 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma, static void unmap_folio(struct folio *folio) { enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD | - TTU_SYNC; + TTU_SYNC | TTU_BATCH_FLUSH;
VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
@@ -2569,6 +2569,8 @@ static void unmap_folio(struct folio *folio) try_to_migrate(folio, ttu_flags); else try_to_unmap(folio, ttu_flags | TTU_IGNORE_MLOCK); + + try_to_unmap_flush(); }
static void remap_page(struct folio *folio, unsigned long nr)
From: Lorenzo Stoakes lstoakes@gmail.com
mainline inclusion from mainline-v6.7-rc1 commit 6facf36ee49675ea952713a7a1488e9f191e7eec category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9H84X CVE: NA
-------------------------------------------------
The existing comments in filemap_fault() suggest that, after either a minor fault has occurred and filemap_get_folio() found a folio in the page cache, or a major fault arose and __filemap_get_folio(FGP_CREATE...) did the job (having relied on do_sync_mmap_readahead() or filemap_read_folio() to read in the folio), the only possible reason it could not be uptodate is because of an error.
This is not so, as if, for instance, the fault occurred within a VMA which had the VM_RAND_READ flag set (via madvise() with the MADV_RANDOM flag specified), this would cause even synchronous readahead to fail to read in the folio.
I confirmed this by dropping page caches and faulting in memory madvise()'d this way, observing that this code path was reached on each occasion.
Clarify the comments to include this case, and additionally update the comment recently added around the invalidate lock logic to make it clear the comment explicitly refers to the minor fault case.
In addition, while we're here, refer to folios rather than pages.
[lstoakes@gmail.com: correct identation as per Christopher's feedback] Link: https://lkml.kernel.org/r/2c7014c0-6343-4e76-8697-3f84f54350bd@lucifer.local Link: https://lkml.kernel.org/r/20230930231029.88196-1-lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes lstoakes@gmail.com Reviewed-by: Jan Kara jack@suse.cz Reviewed-by: Christoph Hellwig hch@lst.de Cc: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 6facf36ee49675ea952713a7a1488e9f191e7eec) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/filemap.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c index bf4c07bded63..1022bce6bcd5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3319,21 +3319,28 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio);
/* - * We have a locked page in the page cache, now we need to check - * that it's up-to-date. If not, it is going to be due to an error. + * We have a locked folio in the page cache, now we need to check + * that it's up-to-date. If not, it is going to be due to an error, + * or because readahead was otherwise unable to retrieve it. */ if (unlikely(!folio_test_uptodate(folio))) { /* - * The page was in cache and uptodate and now it is not. - * Strange but possible since we didn't hold the page lock all - * the time. Let's drop everything get the invalidate lock and - * try again. + * If the invalidate lock is not held, the folio was in cache + * and uptodate and now it is not. Strange but possible since we + * didn't hold the page lock all the time. Let's drop + * everything, get the invalidate lock and try again. */ if (!mapping_locked) { folio_unlock(folio); folio_put(folio); goto retry_find; } + + /* + * OK, the folio is really not uptodate. This can be because the + * VMA has the VM_RAND_READ flag set, or because an error + * arose. Let's read it in directly. + */ goto page_not_uptodate; }
From: Sidhartha Kumar sidhartha.kumar@oracle.com
mainline inclusion from mainline-v6.7-rc1 commit a08c7193e4f18dc8508f2d07d0de2c5b94cb39a3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9H84X CVE: NA
-------------------------------------------------
Remove special cased hugetlb handling code within the page cache by changing the granularity of ->index to the base page size rather than the huge page size. The motivation of this patch is to reduce complexity within the filemap code while also increasing performance by removing branches that are evaluated on every page cache lookup.
To support the change in index, new wrappers for hugetlb page cache interactions are added. These wrappers perform the conversion to a linear index which is now expected by the page cache for huge pages.
========================= PERFORMANCE ======================================
Perf was used to check the performance differences after the patch. Overall the performance is similar to mainline with a very small larger overhead that occurs in __filemap_add_folio() and hugetlb_add_to_page_cache(). This is because of the larger overhead that occurs in xa_load() and xa_store() as the xarray is now using more entries to store hugetlb folios in the page cache.
Timing
aarch64 2MB Page Size 6.5-rc3 + this patch: [root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt real 1m49.568s user 0m0.000s sys 1m49.461s
6.5-rc3: [root]# time fallocate -l 700GB test.txt real 1m47.495s user 0m0.000s sys 1m47.370s 1GB Page Size 6.5-rc3 + this patch: [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt real 1m47.024s user 0m0.000s sys 1m46.921s
6.5-rc3: [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt real 1m44.551s user 0m0.000s sys 1m44.438s
x86 2MB Page Size 6.5-rc3 + this patch: [root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt real 0m22.383s user 0m0.000s sys 0m22.255s
6.5-rc3: [opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt real 0m22.735s user 0m0.038s sys 0m22.567s
1GB Page Size 6.5-rc3 + this patch: [root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt real 0m25.786s user 0m0.001s sys 0m25.589s
6.5-rc3: [root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt real 0m33.454s user 0m0.001s sys 0m33.193s
aarch64: workload - fallocate a 700GB file backed by huge pages
6.5-rc3 + this patch: 2MB Page Size: --100.00%--__arm64_sys_fallocate ksys_fallocate vfs_fallocate hugetlbfs_fallocate | |--95.04%--__pi_clear_page | |--3.57%--clear_huge_page | | | |--2.63%--rcu_all_qs | | | --0.91%--__cond_resched | --0.67%--__cond_resched 0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache 0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio
6.5-rc3 2MB Page Size: --100.00%--__arm64_sys_fallocate ksys_fallocate vfs_fallocate hugetlbfs_fallocate | |--94.91%--__pi_clear_page | |--4.11%--clear_huge_page | | | |--3.00%--rcu_all_qs | | | --1.10%--__cond_resched | --0.59%--__cond_resched 0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache 0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio
x86 workload - fallocate a 100GB file backed by huge pages
6.5-rc3 + this patch: 2MB Page Size: hugetlbfs_fallocate | --99.57%--clear_huge_page | --98.47%--clear_page_erms | --0.53%--asm_sysvec_apic_timer_interrupt
0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load 0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache 0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio 0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store
6.5-rc3 2MB Page Size: --99.93%--__x64_sys_fallocate vfs_fallocate hugetlbfs_fallocate | --99.38%--clear_huge_page | |--98.40%--clear_page_erms | --0.59%--__cond_resched 0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio
========================= TESTING ======================================
This patch passes libhugetlbfs tests and LTP hugetlb tests
********** TEST SUMMARY * 2M * 32-bit 64-bit * Total testcases: 110 113 * Skipped: 0 0 * PASS: 107 113 * FAIL: 0 0 * Killed by signal: 3 0 * Bad configuration: 0 0 * Expected FAIL: 0 0 * Unexpected PASS: 0 0 * Test not present: 0 0 * Strange test result: 0 0 **********
Done executing testcases. LTP Version: 20220527-178-g2761a81c4
page migration was also tested using Mike Kravetz's test program.[8]
[dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug] Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mounta... Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar sidhartha.kumar@oracle.com Signed-off-by: Dan Carpenter dan.carpenter@linaro.org Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Muchun Song songmuchun@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit a08c7193e4f18dc8508f2d07d0de2c5b94cb39a3) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- fs/hugetlbfs/inode.c | 37 +++++++++++++++++++------------------ include/linux/hugetlb.h | 12 ++++++++++++ include/linux/pagemap.h | 32 ++------------------------------ mm/filemap.c | 34 ++++++++++------------------------ mm/hugetlb.c | 32 ++++++-------------------------- mm/migrate.c | 6 +++--- 6 files changed, 52 insertions(+), 101 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 5bc54ff64816..42accb940306 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -346,7 +346,7 @@ static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, struct iov_iter *to) ssize_t retval = 0;
while (iov_iter_count(to)) { - struct page *page; + struct folio *folio; size_t nr, copied, want;
/* nr is the maximum number of bytes to copy from this page */ @@ -364,18 +364,18 @@ static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, struct iov_iter *to) } nr = nr - offset;
- /* Find the page */ - page = find_lock_page(mapping, index); - if (unlikely(page == NULL)) { + /* Find the folio */ + folio = filemap_lock_hugetlb_folio(h, mapping, index); + if (IS_ERR(folio)) { /* * We have a HOLE, zero out the user-buffer for the * length of the hole or request. */ copied = iov_iter_zero(nr, to); } else { - unlock_page(page); + folio_unlock(folio);
- if (!PageHWPoison(page)) + if (!folio_test_has_hwpoisoned(folio)) want = nr; else { /* @@ -383,19 +383,19 @@ static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, struct iov_iter *to) * touching the 1st raw HWPOISON subpage after * offset. */ - want = adjust_range_hwpoison(page, offset, nr); + want = adjust_range_hwpoison(&folio->page, offset, nr); if (want == 0) { - put_page(page); + folio_put(folio); retval = -EIO; break; } }
/* - * We have the page, copy it to user space buffer. + * We have the folio, copy it to user space buffer. */ - copied = copy_page_to_iter(page, offset, want, to); - put_page(page); + copied = copy_folio_to_iter(folio, offset, want, to); + folio_put(folio); } offset += copied; retval += copied; @@ -673,21 +673,20 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, { struct hstate *h = hstate_inode(inode); struct address_space *mapping = &inode->i_data; - const pgoff_t start = lstart >> huge_page_shift(h); - const pgoff_t end = lend >> huge_page_shift(h); + const pgoff_t end = lend >> PAGE_SHIFT; struct folio_batch fbatch; pgoff_t next, index; int i, freed = 0; bool truncate_op = (lend == LLONG_MAX);
folio_batch_init(&fbatch); - next = start; + next = lstart >> PAGE_SHIFT; while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) { for (i = 0; i < folio_batch_count(&fbatch); ++i) { struct folio *folio = fbatch.folios[i]; u32 hash = 0;
- index = folio->index; + index = folio->index >> huge_page_order(h); hash = hugetlb_fault_mutex_hash(mapping, index); mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -705,7 +704,9 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, }
if (truncate_op) - (void)hugetlb_unreserve_pages(inode, start, LONG_MAX, freed); + (void)hugetlb_unreserve_pages(inode, + lstart >> huge_page_shift(h), + LONG_MAX, freed); }
static void hugetlbfs_evict_inode(struct inode *inode) @@ -753,7 +754,7 @@ static void hugetlbfs_zero_partial_page(struct hstate *h, pgoff_t idx = start >> huge_page_shift(h); struct folio *folio;
- folio = filemap_lock_folio(mapping, idx); + folio = filemap_lock_hugetlb_folio(h, mapping, idx); if (IS_ERR(folio)) return;
@@ -898,7 +899,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, mutex_lock(&hugetlb_fault_mutex_table[hash]);
/* See if already present in mapping to avoid alloc/free */ - folio = filemap_get_folio(mapping, index); + folio = filemap_get_folio(mapping, index << huge_page_order(h)); if (!IS_ERR(folio)) { folio_put(folio); mutex_unlock(&hugetlb_fault_mutex_table[hash]); diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index f582d53492a8..b9a78a93859a 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -901,6 +901,12 @@ static inline unsigned int blocks_per_huge_page(struct hstate *h) return huge_page_size(h) / 512; }
+static inline struct folio *filemap_lock_hugetlb_folio(struct hstate *h, + struct address_space *mapping, pgoff_t idx) +{ + return filemap_lock_folio(mapping, idx << huge_page_order(h)); +} + #include <asm/hugetlb.h>
#ifndef is_hugepage_only_range @@ -1097,6 +1103,12 @@ static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio *folio return NULL; }
+static inline struct folio *filemap_lock_hugetlb_folio(struct hstate *h, + struct address_space *mapping, pgoff_t idx) +{ + return NULL; +} + static inline int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list) { diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index d418f1c563ee..40403c365814 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -807,9 +807,6 @@ static inline pgoff_t folio_next_index(struct folio *folio) */ static inline struct page *folio_file_page(struct folio *folio, pgoff_t index) { - /* HugeTLBfs indexes the page cache in units of hpage_size */ - if (folio_test_hugetlb(folio)) - return &folio->page; return folio_page(folio, index & (folio_nr_pages(folio) - 1)); }
@@ -825,9 +822,6 @@ static inline struct page *folio_file_page(struct folio *folio, pgoff_t index) */ static inline bool folio_contains(struct folio *folio, pgoff_t index) { - /* HugeTLBfs indexes the page cache in units of hpage_size */ - if (folio_test_hugetlb(folio)) - return folio->index == index; return index - folio_index(folio) < folio_nr_pages(folio); }
@@ -885,10 +879,9 @@ static inline struct folio *read_mapping_folio(struct address_space *mapping, }
/* - * Get index of the page within radix-tree (but not for hugetlb pages). - * (TODO: remove once hugetlb pages will have ->index in PAGE_SIZE) + * Get the offset in PAGE_SIZE (even for hugetlb pages). */ -static inline pgoff_t page_to_index(struct page *page) +static inline pgoff_t page_to_pgoff(struct page *page) { struct page *head;
@@ -903,19 +896,6 @@ static inline pgoff_t page_to_index(struct page *page) return head->index + page - head; }
-extern pgoff_t hugetlb_basepage_index(struct page *page); - -/* - * Get the offset in PAGE_SIZE (even for hugetlb pages). - * (TODO: hugetlb pages should have ->index in PAGE_SIZE) - */ -static inline pgoff_t page_to_pgoff(struct page *page) -{ - if (unlikely(PageHuge(page))) - return hugetlb_basepage_index(page); - return page_to_index(page); -} - /* * Return byte-offset into filesystem object for page. */ @@ -952,24 +932,16 @@ static inline loff_t folio_file_pos(struct folio *folio)
/* * Get the offset in PAGE_SIZE (even for hugetlb folios). - * (TODO: hugetlb folios should have ->index in PAGE_SIZE) */ static inline pgoff_t folio_pgoff(struct folio *folio) { - if (unlikely(folio_test_hugetlb(folio))) - return hugetlb_basepage_index(&folio->page); return folio->index; }
-extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma, - unsigned long address); - static inline pgoff_t linear_page_index(struct vm_area_struct *vma, unsigned long address) { pgoff_t pgoff; - if (unlikely(is_vm_hugetlb_page(vma))) - return linear_hugepage_index(vma, address); pgoff = (address - vma->vm_start) >> PAGE_SHIFT; pgoff += vma->vm_pgoff; return pgoff; diff --git a/mm/filemap.c b/mm/filemap.c index 1022bce6bcd5..96a49748772c 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -131,11 +131,8 @@ static void page_cache_delete(struct address_space *mapping,
mapping_set_update(&xas, mapping);
- /* hugetlb pages are represented by a single entry in the xarray */ - if (!folio_test_hugetlb(folio)) { - xas_set_order(&xas, folio->index, folio_order(folio)); - nr = folio_nr_pages(folio); - } + xas_set_order(&xas, folio->index, folio_order(folio)); + nr = folio_nr_pages(folio);
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -235,7 +232,7 @@ void filemap_free_folio(struct address_space *mapping, struct folio *folio) if (free_folio) free_folio(folio);
- if (folio_test_large(folio) && !folio_test_hugetlb(folio)) + if (folio_test_large(folio)) refs = folio_nr_pages(folio); folio_put_refs(folio, refs); } @@ -860,14 +857,15 @@ noinline int __filemap_add_folio(struct address_space *mapping,
if (!huge) { int error = mem_cgroup_charge(folio, NULL, gfp); - VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio); if (error) return error; charged = true; - xas_set_order(&xas, index, folio_order(folio)); - nr = folio_nr_pages(folio); }
+ VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio); + xas_set_order(&xas, index, folio_order(folio)); + nr = folio_nr_pages(folio); + gfp &= GFP_RECLAIM_MASK; folio_ref_add(folio, nr); folio->mapping = mapping; @@ -2028,7 +2026,7 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start, int idx = folio_batch_count(fbatch) - 1;
folio = fbatch->folios[idx]; - if (!xa_is_value(folio) && !folio_test_hugetlb(folio)) + if (!xa_is_value(folio)) nr = folio_nr_pages(folio); *start = indices[idx] + nr; } @@ -2092,7 +2090,7 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start, int idx = folio_batch_count(fbatch) - 1;
folio = fbatch->folios[idx]; - if (!xa_is_value(folio) && !folio_test_hugetlb(folio)) + if (!xa_is_value(folio)) nr = folio_nr_pages(folio); *start = indices[idx] + nr; } @@ -2133,9 +2131,6 @@ unsigned filemap_get_folios(struct address_space *mapping, pgoff_t *start, continue; if (!folio_batch_add(fbatch, folio)) { unsigned long nr = folio_nr_pages(folio); - - if (folio_test_hugetlb(folio)) - nr = 1; *start = folio->index + nr; goto out; } @@ -2201,9 +2196,6 @@ unsigned filemap_get_folios_contig(struct address_space *mapping,
if (!folio_batch_add(fbatch, folio)) { nr = folio_nr_pages(folio); - - if (folio_test_hugetlb(folio)) - nr = 1; *start = folio->index + nr; goto out; } @@ -2220,10 +2212,7 @@ unsigned filemap_get_folios_contig(struct address_space *mapping,
if (nr) { folio = fbatch->folios[nr - 1]; - if (folio_test_hugetlb(folio)) - *start = folio->index + 1; - else - *start = folio_next_index(folio); + *start = folio->index + folio_nr_pages(folio); } out: rcu_read_unlock(); @@ -2261,9 +2250,6 @@ unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t *start, continue; if (!folio_batch_add(fbatch, folio)) { unsigned long nr = folio_nr_pages(folio); - - if (folio_test_hugetlb(folio)) - nr = 1; *start = folio->index + nr; goto out; } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index fa4aebb1c121..2e6b210ae430 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -994,7 +994,7 @@ static long region_count(struct resv_map *resv, long f, long t)
/* * Convert the address within this vma to the page offset within - * the mapping, in pagecache page units; huge pages here. + * the mapping, huge page units here. */ static pgoff_t vma_hugecache_offset(struct hstate *h, struct vm_area_struct *vma, unsigned long address) @@ -1003,13 +1003,6 @@ static pgoff_t vma_hugecache_offset(struct hstate *h, (vma->vm_pgoff >> huge_page_order(h)); }
-pgoff_t linear_hugepage_index(struct vm_area_struct *vma, - unsigned long address) -{ - return vma_hugecache_offset(hstate_vma(vma), vma, address); -} -EXPORT_SYMBOL_GPL(linear_hugepage_index); - /** * vma_kernel_pagesize - Page size granularity for this VMA. * @vma: The user mapping. @@ -2127,20 +2120,6 @@ struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage) return NULL; }
-pgoff_t hugetlb_basepage_index(struct page *page) -{ - struct page *page_head = compound_head(page); - pgoff_t index = page_index(page_head); - unsigned long compound_idx; - - if (compound_order(page_head) > MAX_ORDER) - compound_idx = page_to_pfn(page) - page_to_pfn(page_head); - else - compound_idx = page - page_head; - - return (index << compound_order(page_head)) + compound_idx; -} - static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h, gfp_t gfp_mask, int nid, nodemask_t *nmask, nodemask_t *node_alloc_noretry) @@ -5974,7 +5953,7 @@ static bool hugetlbfs_pagecache_present(struct hstate *h, struct vm_area_struct *vma, unsigned long address) { struct address_space *mapping = vma->vm_file->f_mapping; - pgoff_t idx = vma_hugecache_offset(h, vma, address); + pgoff_t idx = linear_page_index(vma, address); struct folio *folio;
folio = filemap_get_folio(mapping, idx); @@ -5991,6 +5970,7 @@ int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping struct hstate *h = hstate_inode(inode); int err;
+ idx <<= huge_page_order(h); __folio_set_locked(folio); err = __filemap_add_folio(mapping, folio, idx, GFP_KERNEL, NULL);
@@ -6098,7 +6078,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, * before we get page_table_lock. */ new_folio = false; - folio = filemap_lock_folio(mapping, idx); + folio = filemap_lock_hugetlb_folio(h, mapping, idx); if (IS_ERR(folio)) { size = i_size_read(mapping->host) >> huge_page_shift(h); if (idx >= size) @@ -6407,7 +6387,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, /* Just decrements count, does not deallocate */ vma_end_reservation(h, vma, haddr);
- pagecache_folio = filemap_lock_folio(mapping, idx); + pagecache_folio = filemap_lock_hugetlb_folio(h, mapping, idx); if (IS_ERR(pagecache_folio)) pagecache_folio = NULL; } @@ -6540,7 +6520,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
if (is_continue) { ret = -EFAULT; - folio = filemap_lock_folio(mapping, idx); + folio = filemap_lock_hugetlb_folio(h, mapping, idx); if (IS_ERR(folio)) goto out; folio_in_pagecache = true; diff --git a/mm/migrate.c b/mm/migrate.c index a861eca9fd47..bdcf75f2b320 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -537,7 +537,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping, int expected_count;
xas_lock_irq(&xas); - expected_count = 2 + folio_has_private(src); + expected_count = folio_expected_refs(mapping, src); if (!folio_ref_freeze(src, expected_count)) { xas_unlock_irq(&xas); return -EAGAIN; @@ -546,11 +546,11 @@ int migrate_huge_page_move_mapping(struct address_space *mapping, dst->index = src->index; dst->mapping = src->mapping;
- folio_get(dst); + folio_ref_add(dst, folio_nr_pages(dst));
xas_store(&xas, dst);
- folio_ref_unfreeze(src, expected_count - 1); + folio_ref_unfreeze(src, expected_count - folio_nr_pages(src));
xas_unlock_irq(&xas);
From: Sidhartha Kumar sidhartha.kumar@oracle.com
mainline inclusion from mainline-v6.7-rc5 commit 4a3ef6be03e6700037fc20e63aa5ffd972e435ca category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9H84X CVE: NA
-------------------------------------------------
After commit a08c7193e4f1 "mm/filemap: remove hugetlb special casing in filemap.c", hugetlb pages are stored in the page cache in base page sized indexes. This leads to multi index stores in the xarray which is only supporting through CONFIG_XARRAY_MULTI. The other page cache user of multi index stores ,THP, selects XARRAY_MULTI. Have CONFIG_HUGETLB_PAGE follow this behavior as well to avoid the BUG() with a CONFIG_HUGETLB_PAGE && !CONFIG_XARRAY_MULTI config.
Link: https://lkml.kernel.org/r/20231204183234.348697-1-sidhartha.kumar@oracle.com Fixes: a08c7193e4f1 ("mm/filemap: remove hugetlb special casing in filemap.c") Signed-off-by: Sidhartha Kumar sidhartha.kumar@oracle.com Reported-by: Al Viro viro@zeniv.linux.org.uk Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Muchun Song muchun.song@linux.dev Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 4a3ef6be03e6700037fc20e63aa5ffd972e435ca) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- fs/Kconfig | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/Kconfig b/fs/Kconfig index 7b0a2226b714..cfd412d770a9 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -280,6 +280,7 @@ config HUGETLBFS
config HUGETLB_PAGE def_bool HUGETLBFS + select XARRAY_MULTI
config HUGETLB_PAGE_OPTIMIZE_VMEMMAP def_bool HUGETLB_PAGE
From: Sidhartha Kumar sidhartha.kumar@oracle.com
mainline inclusion from mainline-v6.8-rc3 commit 19d3e221807772f8443e565234a6fdc5a2b09d26 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9H84X CVE: NA
-------------------------------------------------
has_extra_refcount() makes the assumption that the page cache adds a ref count of 1 and subtracts this in the extra_pins case. Commit a08c7193e4f1 (mm/filemap: remove hugetlb special casing in filemap.c) modifies __filemap_add_folio() by calling folio_ref_add(folio, nr); for all cases (including hugtetlb) where nr is the number of pages in the folio. We should adjust the number of references coming from the page cache by subtracing the number of pages rather than 1.
In hugetlbfs_read_iter(), folio_test_has_hwpoisoned() is testing the wrong flag as, in the hugetlb case, memory-failure code calls folio_test_set_hwpoison() to indicate poison. folio_test_hwpoison() is the correct function to test for that flag.
After these fixes, the hugetlb hwpoison read selftest passes all cases.
Link: https://lkml.kernel.org/r/20240112180840.367006-1-sidhartha.kumar@oracle.com Fixes: a08c7193e4f1 ("mm/filemap: remove hugetlb special casing in filemap.c") Signed-off-by: Sidhartha Kumar sidhartha.kumar@oracle.com Closes: https://lore.kernel.org/linux-mm/20230713001833.3778937-1-jiaqiyan@google.co... Reported-by: Muhammad Usama Anjum usama.anjum@collabora.com Tested-by: Muhammad Usama Anjum usama.anjum@collabora.com Acked-by: Miaohe Lin linmiaohe@huawei.com Acked-by: Muchun Song muchun.song@linux.dev Cc: James Houghton jthoughton@google.com Cc: Jiaqi Yan jiaqiyan@google.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Naoya Horiguchi naoya.horiguchi@nec.com Cc: stable@vger.kernel.org [6.7+] Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 19d3e221807772f8443e565234a6fdc5a2b09d26) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- fs/hugetlbfs/inode.c | 2 +- mm/memory-failure.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 42accb940306..c4f3c5d631f8 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -375,7 +375,7 @@ static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, struct iov_iter *to) } else { folio_unlock(folio);
- if (!folio_test_has_hwpoisoned(folio)) + if (!folio_test_hwpoison(folio)) want = nr; else { /* diff --git a/mm/memory-failure.c b/mm/memory-failure.c index c9372bae3524..50ef4ff84a74 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -978,7 +978,7 @@ static bool has_extra_refcount(struct page_state *ps, struct page *p, int count = page_count(p) - 1;
if (extra_pins) - count -= 1; + count -= folio_nr_pages(page_folio(p));
if (count > 0) { pr_err("%#lx: %s still referenced by %d users\n",
From: Rik van Riel riel@surriel.com
mainline inclusion from mainline-v6.7 commit efa7df3e3bb5da8e6abbe37727417f32a37fba47 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9H84X CVE: NA
-------------------------------------------------
Align larger anonymous memory mappings on THP boundaries by going through thp_get_unmapped_area if THPs are enabled for the current process.
With this patch, larger anonymous mappings are now THP aligned. When a malloc library allocates a 2MB or larger arena, that arena can now be mapped with THPs right from the start, which can result in better TLB hit rates and execution time.
Link: https://lkml.kernel.org/r/20220809142457.4751229f@imladris.surriel.com Link: https://lkml.kernel.org/r/20231214223423.1133074-1-yang@os.amperecomputing.c... Signed-off-by: Rik van Riel riel@surriel.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: Matthew Wilcox willy@infradead.org Cc: Christopher Lameter cl@linux.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit efa7df3e3bb5da8e6abbe37727417f32a37fba47) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/mmap.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c index 8b5295653475..6cd1ccb79193 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1850,6 +1850,9 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, */ pgoff = 0; get_area = shmem_get_unmapped_area; + } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { + /* Ensures that larger anonymous mappings are THP aligned. */ + get_area = thp_get_unmapped_area; }
addr = get_area(file, addr, len, pgoff, flags);
From: Yang Shi yang@os.amperecomputing.com
mainline inclusion from mainline-v6.8-rc3 commit c4608d1bf7c6536d1a3d233eb21e50678681564e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9H84X CVE: NA
-------------------------------------------------
commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") incured regression for stress-ng pthread benchmark [1]. It is because THP get allocated to pthread's stack area much more possible than before. Pthread's stack area is allocated by mmap without VM_GROWSDOWN or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not.
The MAP_STACK flag is used to mark the stack area, but it is a no-op on Linux. Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating THP for such stack area.
With this change the stack area looks like:
fffd18e10000-fffd19610000 rw-p 00000000 00:00 0 Size: 8192 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 12 kB Pss: 12 kB Pss_Dirty: 12 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 12 kB Referenced: 12 kB Anonymous: 12 kB KSM: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 0 VmFlags: rd wr mr mw me ac nh
The "nh" flag is set.
[1] https://lore.kernel.org/linux-mm/202312192310.56367035-oliver.sang@intel.com...
Link: https://lkml.kernel.org/r/20231221065943.2803551-2-shy828301@gmail.com Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") Signed-off-by: Yang Shi yang@os.amperecomputing.com Reported-by: kernel test robot oliver.sang@intel.com Tested-by: Oliver Sang oliver.sang@intel.com Reviewed-by: Yin Fengwei fengwei.yin@intel.com Cc: Rik van Riel riel@surriel.com Cc: Matthew Wilcox willy@infradead.org Cc: Christopher Lameter cl@linux.com Cc: Huang, Ying ying.huang@intel.com Cc: stable@vger.kerenl.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit c4608d1bf7c6536d1a3d233eb21e50678681564e) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- include/linux/mman.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/include/linux/mman.h b/include/linux/mman.h index 40d94411d492..dc7048824be8 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags) return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) | _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) | _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) | + _calc_vm_trans(flags, MAP_STACK, VM_NOHUGEPAGE) | arch_calc_vm_flag_bits(flags); }
From: Yang Shi yang@os.amperecomputing.com
mainline inclusion from mainline-v6.8-rc3 commit 4ef9ad19e17676b9ef071309bc62020e2373705d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9H84X CVE: NA
-------------------------------------------------
commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") caused two issues [1] [2] reported on 32 bit system or compat userspace.
It doesn't make too much sense to force huge page alignment on 32 bit system due to the constrained virtual address space.
[1] https://lore.kernel.org/linux-mm/d0a136a0-4a31-46bc-adf4-2db109a61672@kernel... [2] https://lore.kernel.org/linux-mm/CAJuCfpHXLdQy1a2B6xN2d7quTYwg2OoZseYPZTRpU0...
Link: https://lkml.kernel.org/r/20240118180505.2914778-1-shy828301@gmail.com Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") Signed-off-by: Yang Shi yang@os.amperecomputing.com Reported-by: Jiri Slaby jirislaby@kernel.org Reported-by: Suren Baghdasaryan surenb@google.com Tested-by: Jiri Slaby jirislaby@kernel.org Tested-by: Suren Baghdasaryan surenb@google.com Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: Rik van Riel riel@surriel.com Cc: Christopher Lameter cl@linux.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 4ef9ad19e17676b9ef071309bc62020e2373705d) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/huge_memory.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 4ce2ed00eaa2..3c4189b1a97e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -37,6 +37,7 @@ #include <linux/page_owner.h> #include <linux/sched/sysctl.h> #include <linux/memory-tiers.h> +#include <linux/compat.h>
#include <asm/tlb.h> #include <asm/pgalloc.h> @@ -784,6 +785,9 @@ static unsigned long __thp_get_unmapped_area(struct file *filp, loff_t off_align = round_up(off, size); unsigned long len_pad, ret;
+ if (IS_ENABLED(CONFIG_32BIT) || in_compat_syscall()) + return 0; + if (off_end <= off_align || (off_end - off_align) < size) return 0;
From: Ryan Roberts ryan.roberts@arm.com
mainline inclusion from mainline-v6.8-rc3 commit 96204e15310c218fd9355bdcacd02fed1d18070e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9H84X CVE: NA
-------------------------------------------------
The addition of commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") caused the "virtual_address_range" mm selftest to start failing on arm64. Let's fix that regression.
There were 2 visible problems when running the test; 1) it takes much longer to execute, and 2) the test fails. Both are related:
The (first part of the) test allocates as many 1GB anonymous blocks as it can in the low 256TB of address space, passing NULL as the addr hint to mmap. Before the faulty patch, all allocations were abutted and contained in a single, merged VMA. However, after this patch, each allocation is in its own VMA, and there is a 2M gap between each VMA. This causes the 2 problems in the test: 1) mmap becomes MUCH slower because there are so many VMAs to check to find a new 1G gap. 2) mmap fails once it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then causes a subsequent calloc() to fail, which causes the test to fail.
The problem is that arm64 (unlike x86) selects ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area() allocates len+2M then always aligns to the bottom of the discovered gap. That causes the 2M hole.
Fix this by detecting cases where we can still achive the alignment goal when moved to the top of the allocated area, if configured to prefer top-down allocation.
While we are at it, fix thp_get_unmapped_area's use of pgoff, which should always be zero for anonymous mappings. Prior to the faulty change, while it was possible for user space to pass in pgoff!=0, the old mm->get_unmapped_area() handler would not use it. thp_get_unmapped_area() does use it, so let's explicitly zero it before calling the handler. This should also be the correct behavior for arches that define their own get_unmapped_area() handler.
Link: https://lkml.kernel.org/r/20240123171420.3970220-1-ryan.roberts@arm.com Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") Closes: https://lore.kernel.org/linux-mm/1e8f5ac7-54ce-433a-ae53-81522b2320e1@arm.co... Signed-off-by: Ryan Roberts ryan.roberts@arm.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Rik van Riel riel@surriel.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 96204e15310c218fd9355bdcacd02fed1d18070e) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/huge_memory.c | 10 ++++++++-- mm/mmap.c | 6 ++++-- 2 files changed, 12 insertions(+), 4 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 3c4189b1a97e..87edef93df10 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -783,7 +783,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp, { loff_t off_end = off + len; loff_t off_align = round_up(off, size); - unsigned long len_pad, ret; + unsigned long len_pad, ret, off_sub;
if (IS_ENABLED(CONFIG_32BIT) || in_compat_syscall()) return 0; @@ -812,7 +812,13 @@ static unsigned long __thp_get_unmapped_area(struct file *filp, if (ret == addr) return addr;
- ret += (off - ret) & (size - 1); + off_sub = (off - ret) & (size - 1); + + if (current->mm->get_unmapped_area == arch_get_unmapped_area_topdown && + !off_sub) + return ret + size; + + ret += off_sub; return ret; }
diff --git a/mm/mmap.c b/mm/mmap.c index 6cd1ccb79193..b8c5486dfb25 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1846,15 +1846,17 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, /* * mmap_region() will call shmem_zero_setup() to create a file, * so use shmem's get_unmapped_area in case it can be huge. - * do_mmap() will clear pgoff, so match alignment. */ - pgoff = 0; get_area = shmem_get_unmapped_area; } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { /* Ensures that larger anonymous mappings are THP aligned. */ get_area = thp_get_unmapped_area; }
+ /* Always treat pgoff as zero for anonymous memory. */ + if (!file) + pgoff = 0; + addr = get_area(file, addr, len, pgoff, flags); if (IS_ERR_VALUE(addr)) return addr;
From: Fangrui Song maskray@google.com
mainline inclusion from mainline-v6.8-rc1 commit 7fbb5e188248c50f737720825da1864ce42536d1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9H84X CVE: NA
-------------------------------------------------
Commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled()") introduced the VM_EXEC requirement, which is not strictly needed.
lld's default --rosegment option and GNU ld's -z separate-code option (default on Linux/x86 since binutils 2.31) create a read-only PT_LOAD segment without the PF_X flag, which should be eligible for THP.
Certain architectures support medium and large code models, where .lrodata may be placed in a separate read-only PT_LOAD segment, which should be eligible for THP as well.
Link: https://lkml.kernel.org/r/20231220054123.1266001-1-maskray@google.com Signed-off-by: Fangrui Song maskray@google.com Acked-by: Yang Shi shy828301@gmail.com Cc: Miaohe Lin linmiaohe@huawei.com Cc: Song Liu songliubraving@fb.com Cc: Matthew Wilcox willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 7fbb5e188248c50f737720825da1864ce42536d1) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- include/linux/huge_mm.h | 1 - 1 file changed, 1 deletion(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index fa7a38a30fc6..5adb86af35fc 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -206,7 +206,6 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) inode = vma->vm_file->f_inode;
return (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS)) && - (vma->vm_flags & VM_EXEC) && !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode); }
From: Yang Shi yang@os.amperecomputing.com
mainline inclusion from mainline-vv6.9-rc1 commit 05976a42b327d4f5a529a5e55cb8bfc2fa0bcca1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9H84X CVE: NA
-------------------------------------------------
We avoid allocating THP for temporary stack, even though khugepaged_enter_vma() is called for stack VMAs, it actualy returns false. So no need to call it in the first place at all.
Link: https://lkml.kernel.org/r/20231221065943.2803551-1-shy828301@gmail.com Signed-off-by: Yang Shi yang@os.amperecomputing.com Reviewed-by: Yin Fengwei fengwei.yin@intel.com Cc: Christopher Lameter cl@linux.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Rik van Riel riel@surriel.com Cc: kernel test robot oliver.sang@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 05976a42b327d4f5a529a5e55cb8bfc2fa0bcca1) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/mmap.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c index b8c5486dfb25..6cd5d3f607e5 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2069,7 +2069,6 @@ static int expand_upwards(struct vm_area_struct *vma, unsigned long address) } } anon_vma_unlock_write(vma->anon_vma); - khugepaged_enter_vma(vma, vma->vm_flags); mas_destroy(&mas); validate_mm(mm); return error; @@ -2163,7 +2162,6 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address) } } anon_vma_unlock_write(vma->anon_vma); - khugepaged_enter_vma(vma, vma->vm_flags); mas_destroy(&mas); validate_mm(mm); return error;
From: Lance Yang ioworker0@gmail.com
mainline inclusion from mainline-vv6.9-rc1 commit 879c6000e191b61b97e17bce44c4564ee42eb612 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9H84X CVE: NA
-------------------------------------------------
khugepaged scans the entire address space in the background for each given mm, looking for opportunities to merge sequences of basic pages into huge pages. However, when an mm is inserted to the mm_slots list, and the MMF_DISABLE_THP flag is set later, this scanning process becomes unnecessary for that mm and can be skipped to avoid redundant operations, especially in scenarios with a large address space.
On an Intel Core i5 CPU, the time taken by khugepaged to scan the address space of the process, which has been set with the MMF_DISABLE_THP flag after being added to the mm_slots list, is as follows (shorter is better):
VMA Count | Old | New | Change --------------------------------------- 50 | 23us | 9us | -60.9% 100 | 32us | 9us | -71.9% 200 | 44us | 9us | -79.5% 400 | 75us | 9us | -88.0% 800 | 98us | 9us | -90.8%
Once the count of VMAs for the process exceeds page_to_scan, khugepaged needs to wait for scan_sleep_millisecs ms before scanning the next process. IMO, unnecessary scans could actually be skipped with a very inexpensive mm->flags check in this case.
This commit introduces a check before each scanning process to test the MMF_DISABLE_THP flag for the given mm; if the flag is set, the scanning process is bypassed, thereby improving the efficiency of khugepaged.
This optimization is not a correctness issue but rather an enhancement to save expensive checks on each VMA when userspace cannot prctl itself before spawning into the new process.
On some servers within our company, we deploy a daemon responsible for monitoring and updating local applications. Some applications prefer not to use THP, so the daemon calls prctl to disable THP before fork/exec. Conversely, for other applications, the daemon calls prctl to enable THP before fork/exec.
Ideally, the daemon should invoke prctl after the fork, but its current implementation follows the described approach. In the Go standard library, there is no direct encapsulation of the fork system call; instead, fork and execve are combined into one through syscall.ForkExec.
Link: https://lkml.kernel.org/r/20240129054551.57728-1-ioworker0@gmail.com Signed-off-by: Lance Yang ioworker0@gmail.com Acked-by: David Hildenbrand david@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: Minchan Kim minchan@kernel.org Cc: Muchun Song songmuchun@bytedance.com Cc: Peter Xu peterx@redhat.com Cc: Zach O'Keefe zokeefe@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 879c6000e191b61b97e17bce44c4564ee42eb612) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/khugepaged.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 0b8636537383..6dcf399fa65c 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -416,6 +416,12 @@ static inline int hpage_collapse_test_exit(struct mm_struct *mm) return atomic_read(&mm->mm_users) == 0; }
+static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm) +{ + return hpage_collapse_test_exit(mm) || + test_bit(MMF_DISABLE_THP, &mm->flags); +} + void __khugepaged_enter(struct mm_struct *mm) { struct khugepaged_mm_slot *mm_slot; @@ -1439,7 +1445,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
lockdep_assert_held(&khugepaged_mm_lock);
- if (hpage_collapse_test_exit(mm)) { + if (hpage_collapse_test_exit_or_disable(mm)) { /* free mm_slot */ hash_del(&slot->hash); list_del(&slot->mm_node); @@ -2387,7 +2393,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, goto breakouterloop_mmap_lock;
progress++; - if (unlikely(hpage_collapse_test_exit(mm))) + if (unlikely(hpage_collapse_test_exit_or_disable(mm))) goto breakouterloop;
vma_iter_init(&vmi, mm, khugepaged_scan.address); @@ -2395,7 +2401,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, unsigned long hstart, hend;
cond_resched(); - if (unlikely(hpage_collapse_test_exit(mm))) { + if (unlikely(hpage_collapse_test_exit_or_disable(mm))) { progress++; break; } @@ -2417,7 +2423,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, bool mmap_locked = true;
cond_resched(); - if (unlikely(hpage_collapse_test_exit(mm))) + if (unlikely(hpage_collapse_test_exit_or_disable(mm))) goto breakouterloop;
VM_BUG_ON(khugepaged_scan.address < hstart || @@ -2435,7 +2441,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, fput(file); if (*result == SCAN_PTE_MAPPED_HUGEPAGE) { mmap_read_lock(mm); - if (hpage_collapse_test_exit(mm)) + if (hpage_collapse_test_exit_or_disable(mm)) goto breakouterloop; *result = collapse_pte_mapped_thp(mm, khugepaged_scan.address, false); @@ -2477,7 +2483,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, * Release the current mm_slot if this mm is about to die, or * if we scanned all vmas of this mm. */ - if (hpage_collapse_test_exit(mm) || !vma) { + if (hpage_collapse_test_exit_or_disable(mm) || !vma) { /* * Make sure that if mm_users is reaching zero while * khugepaged runs here, khugepaged_exit will find
From: Lance Yang ioworker0@gmail.com
mainline inclusion from mainline-vv6.9-rc1 commit 5dad604809c5acc546ec74057498db1623f1c408 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9H84X CVE: NA
-------------------------------------------------
Previously, we removed the mm from mm_slot and dropped mm_count if the MMF_THP_DISABLE flag was set. However, we didn't re-add the mm back after clearing the MMF_THP_DISABLE flag. Additionally, We add a check for the MMF_THP_DISABLE flag in hugepage_vma_revalidate().
Link: https://lkml.kernel.org/r/20240227035135.54593-1-ioworker0@gmail.com Fixes: 879c6000e191 ("mm/khugepaged: bypassing unnecessary scans with MMF_DISABLE_THP check") Signed-off-by: Lance Yang ioworker0@gmail.com Suggested-by: Yang Shi shy828301@gmail.com Reviewed-by: Yang Shi shy828301@gmail.com Reviewed-by: David Hildenbrand david@redhat.com Cc: Michal Hocko mhocko@suse.com Cc: Minchan Kim minchan@kernel.org Cc: Muchun Song songmuchun@bytedance.com Cc: Peter Xu peterx@redhat.com Cc: Zach O'Keefe zokeefe@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 5dad604809c5acc546ec74057498db1623f1c408) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- mm/khugepaged.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 6dcf399fa65c..9a9f1ebe6dd7 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -928,7 +928,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, { struct vm_area_struct *vma;
- if (unlikely(hpage_collapse_test_exit(mm))) + if (unlikely(hpage_collapse_test_exit_or_disable(mm))) return SCAN_ANY_PROCESS;
*vmap = vma = find_vma(mm, address); @@ -1445,7 +1445,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
lockdep_assert_held(&khugepaged_mm_lock);
- if (hpage_collapse_test_exit_or_disable(mm)) { + if (hpage_collapse_test_exit(mm)) { /* free mm_slot */ hash_del(&slot->hash); list_del(&slot->mm_node); @@ -2483,7 +2483,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, * Release the current mm_slot if this mm is about to die, or * if we scanned all vmas of this mm. */ - if (hpage_collapse_test_exit_or_disable(mm) || !vma) { + if (hpage_collapse_test_exit(mm) || !vma) { /* * Make sure that if mm_users is reaching zero while * khugepaged runs here, khugepaged_exit will find
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9H84X CVE: NA
-------------------------------------------------
Let's the do_sync_mmap_readahead() to try to enable THP for exec mapping, which could ruduce iTLB miss for large application with big text segment.
Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com --- Documentation/admin-guide/mm/transhuge.rst | 7 ++++++ include/linux/huge_mm.h | 1 + mm/filemap.c | 27 ++++++++++++++++++++ mm/huge_memory.c | 29 ++++++++++++++++++++++ 4 files changed, 64 insertions(+)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 04eb45a2f940..936da10c5260 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -202,6 +202,13 @@ PMD-mappable transparent hugepage::
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
+The kernel tries to use huge, PMD-mappable page on read page fault for +file exec mapping if CONFIG_READ_ONLY_THP_FOR_FS enabled. It's possible +to enabled the feature by writing 1 or disablt by writing 0:: + + echo 0x0 >/sys/kernel/mm/transparent_hugepage/thp_exec_enabled + echo 0x1 >/sys/kernel/mm/transparent_hugepage/thp_exec_enabled + khugepaged will be automatically started when one or more hugepage sizes are enabled (either by directly setting "always" or "madvise", or by setting "inherit" while the top-level enabled is set to "always" diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 5adb86af35fc..885888e38e26 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -50,6 +50,7 @@ enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG, TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG, + TRANSPARENT_HUGEPAGE_FILE_EXEC_THP_FLAG, };
struct kobject; diff --git a/mm/filemap.c b/mm/filemap.c index 96a49748772c..e8a7bd88bf90 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -45,6 +45,7 @@ #include <linux/migrate.h> #include <linux/pipe_fs_i.h> #include <linux/splice.h> +#include <linux/huge_mm.h> #include <asm/pgalloc.h> #include <asm/tlbflush.h> #include "internal.h" @@ -3113,6 +3114,29 @@ static int lock_folio_maybe_drop_mmap(struct vm_fault *vmf, struct folio *folio, return 1; }
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#define file_exec_thp_enabled() \ + (transparent_hugepage_flags & \ + (1<<TRANSPARENT_HUGEPAGE_FILE_EXEC_THP_FLAG)) + +static inline void try_enable_file_exec_thp(struct vm_area_struct *vma, + unsigned long *vm_flags, + struct file *file) +{ + if (!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS)) + return; + + if (!is_exec_mapping(*vm_flags)) + return; + + if (file->f_op->get_unmapped_area != thp_get_unmapped_area) + return; + + if (file_exec_thp_enabled()) + hugepage_madvise(vma, vm_flags, MADV_HUGEPAGE); +} +#endif + /* * Synchronous readahead happens when we don't even find a page in the page * cache at all. We don't want to perform IO under the mmap sem, so if we have @@ -3131,6 +3155,9 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) unsigned int mmap_miss;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE + /* Try enable thp for exec mapping by default */ + try_enable_file_exec_thp(vmf->vma, &vm_flags, file); + /* Use the readahead code, even if readahead is disabled */ if (vm_flags & VM_HUGEPAGE) { fpin = maybe_unlock_mmap_for_io(vmf, fpin); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 87edef93df10..6c277b55544c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -425,6 +425,32 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj, static struct kobj_attribute hpage_pmd_size_attr = __ATTR_RO(hpage_pmd_size);
+#ifdef CONFIG_READ_ONLY_THP_FOR_FS +static ssize_t thp_exec_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return single_hugepage_flag_show(kobj, attr, buf, + TRANSPARENT_HUGEPAGE_FILE_EXEC_THP_FLAG); +} +static ssize_t thp_exec_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, size_t count) +{ + size_t ret = single_hugepage_flag_store(kobj, attr, buf, count, + TRANSPARENT_HUGEPAGE_FILE_EXEC_THP_FLAG); + if (ret > 0) { + int err = start_stop_khugepaged(); + + if (err) + ret = err; + } + + return ret; +} +static struct kobj_attribute thp_exec_enabled_attr = + __ATTR_RW(thp_exec_enabled); + +#endif + static struct attribute *hugepage_attr[] = { &enabled_attr.attr, &defrag_attr.attr, @@ -432,6 +458,9 @@ static struct attribute *hugepage_attr[] = { &hpage_pmd_size_attr.attr, #ifdef CONFIG_SHMEM &shmem_enabled_attr.attr, +#endif +#ifdef CONFIG_READ_ONLY_THP_FOR_FS + &thp_exec_enabled_attr.attr, #endif NULL, };
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/6190 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/W...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/6190 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/W...