Baolin Wang (1): mm: page_alloc: allowing mTHP compaction to capture the freed page directly
Gao Xiang (1): mm/migrate: fix deadlock in migrate_pages_batch() on large folios
Gavin Shan (2): mm/huge_memory: avoid PMD-size page cache if needed mm/readahead: limit page cache size in page_cache_ra_order()
Hugh Dickins (2): mm: fix crashes from deferred split racing folio migration mm: simplify folio_migrate_mapping()
Kefeng Wang (1): mm: refactor folio_undo_large_rmappable()
Liu Shixin (1): mm/huge_memory: fix comment errors of thp_mapping_align
Matthew Wilcox (Oracle) (1): filemap: Convert generic_perform_write() to support large folios
Pankaj Raghav (1): readahead: use ilog2 instead of a while loop in page_cache_ra_order()
Peter Xu (1): mm/migrate: putback split folios when numa hint migration fails
Ran Xiaokai (1): mm/huge_memory: mark racy access onhuge_anon_orders_always
Ryan Roberts (1): mm: fix khugepaged activation policy
Yajun Deng (1): mm/mmap: simplify vma link and unlink
Zi Yan (1): mm/migrate: make migrate_pages_batch() stats consistent
linke li (3): mm/swapfile: mark racy access on si->highest_bit mm/slub: mark racy access on slab->freelist mm/slub: mark racy accesses on slab->slabs
Documentation/admin-guide/mm/transhuge.rst | 15 ++++---- include/linux/huge_mm.h | 24 +++++------- mm/filemap.c | 40 ++++++++++++-------- mm/huge_memory.c | 32 +++++++++------- mm/internal.h | 17 ++++++++- mm/khugepaged.c | 33 ++++++++++++---- mm/memcontrol.c | 11 ------ mm/migrate.c | 42 +++++++++++++++------ mm/mmap.c | 44 ++++++++++------------ mm/page_alloc.c | 9 +++-- mm/readahead.c | 10 ++--- mm/slub.c | 8 ++-- mm/swap.c | 8 +--- mm/swapfile.c | 2 +- mm/vmscan.c | 8 +--- 15 files changed, 168 insertions(+), 135 deletions(-)
From: Zi Yan ziy@nvidia.com
mainline inclusion from mainline-v6.10-rc6 commit c6408250703530187cc6250dcd702d12a71c44f5 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
As Ying pointed out in [1], stats->nr_thp_failed needs to be updated to avoid stats inconsistency between MIGRATE_SYNC and MIGRATE_ASYNC when calling migrate_pages_batch().
Because if not, when migrate_pages_batch() is called via migrate_pages(MIGRATE_ASYNC), nr_thp_failed will not be increased and when migrate_pages_batch() is called via migrate_pages(MIGRATE_SYNC*), nr_thp_failed will be increase in migrate_pages_sync() by stats->nr_thp_failed += astats.nr_thp_split.
[1] https://lore.kernel.org/linux-mm/87msnq7key.fsf@yhuang6-desk2.ccr.corp.intel...
Link: https://lkml.kernel.org/r/20240620012712.19804-1-zi.yan@sent.com Link: https://lkml.kernel.org/r/20240618134151.29214-1-zi.yan@sent.com Fixes: 7262f208ca68 ("mm/migrate: split source folio if it is on deferred split list") Signed-off-by: Zi Yan ziy@nvidia.com Suggested-by: "Huang, Ying" ying.huang@intel.com Reviewed-by: "Huang, Ying" ying.huang@intel.com Cc: David Hildenbrand david@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Yang Shi shy828301@gmail.com Cc: Yin Fengwei fengwei.yin@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/migrate.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/mm/migrate.c b/mm/migrate.c index b5d9d8feacfa..d7d789c7a939 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1660,6 +1660,10 @@ static int migrate_pages_batch(struct list_head *from, * migrate_pages() may report success with (split but * unmigrated) pages still on its fromlist; whereas it * always reports success when its fromlist is empty. + * stats->nr_thp_failed should be increased too, + * otherwise stats inconsistency will happen when + * migrate_pages_batch is called via migrate_pages() + * with MIGRATE_SYNC and MIGRATE_ASYNC. * * Only check it without removing it from the list. * Since the folio can be on deferred_split_scan() @@ -1676,6 +1680,7 @@ static int migrate_pages_batch(struct list_head *from, !list_empty(&folio->_deferred_list)) { if (try_split_folio(folio, split_folios) == 0) { nr_failed++; + stats->nr_thp_failed += is_thp; stats->nr_thp_split += is_thp; stats->nr_split++; continue;
From: Peter Xu peterx@redhat.com
mainline inclusion from mainline-v6.11-rc1 commit 6e49019db5f7a09a9c0e8ac4d108e656c3f8e583 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
This issue is not from any report yet, but by code observation only.
This is yet another fix besides Hugh's patch [1] but on relevant code path, where eager split of folio can happen if the folio is already on deferred list during a folio migration.
Here the issue is NUMA path (migrate_misplaced_folio()) may start to encounter such folio split now even with MR_NUMA_MISPLACED hint applied. Then when migrate_pages() didn't migrate all the folios, it's possible the split small folios be put onto the list instead of the original folio. Then putting back only the head page won't be enough.
Fix it by putting back all the folios on the list.
[1] https://lore.kernel.org/all/46c948b4-4dd8-6e03-4c7b-ce4e81cfa536@google.com/
[akpm@linux-foundation.org: remove now unused local `nr_pages'] Link: https://lkml.kernel.org/r/20240708215537.2630610-1-peterx@redhat.com Fixes: 7262f208ca68 ("mm/migrate: split source folio if it is on deferred split list") Signed-off-by: Peter Xu peterx@redhat.com Reviewed-by: Zi Yan ziy@nvidia.com Reviewed-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: Yang Shi shy828301@gmail.com Cc: Hugh Dickins hughd@google.com Cc: Huang Ying ying.huang@intel.com Cc: David Hildenbrand david@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/migrate.c [ Context conflict. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/migrate.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c index d7d789c7a939..01e04e47699c 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2605,7 +2605,6 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, int nr_remaining; unsigned int nr_succeeded; LIST_HEAD(migratepages); - int nr_pages = folio_nr_pages(folio);
/* * Don't migrate file folios that are mapped in multiple processes @@ -2634,12 +2633,8 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma, NULL, node, MIGRATE_ASYNC, MR_NUMA_MISPLACED, &nr_succeeded); if (nr_remaining) { - if (!list_empty(&migratepages)) { - list_del(&folio->lru); - node_stat_mod_folio(folio, NR_ISOLATED_ANON + - folio_is_file_lru(folio), -nr_pages); - folio_putback_lru(folio); - } + if (!list_empty(&migratepages)) + putback_movable_pages(&migratepages); isolated = 0; } if (nr_succeeded) {
From: Ran Xiaokai ran.xiaokai@zte.com.cn
mainline inclusion from mainline-v6.11-rc1 commit 7f83bf14603ef41a44dc907594d749a283e22c37 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
huge_anon_orders_always is accessed lockless, it is better to use the READ_ONCE() wrapper. This is not fixing any visible bug, hopefully this can cease some KCSAN complains in the future. Also do that for huge_anon_orders_madvise.
Link: https://lkml.kernel.org/r/20240515104754889HqrahFPePOIE1UlANHVAh@zte.com.cn Signed-off-by: Ran Xiaokai ran.xiaokai@zte.com.cn Acked-by: David Hildenbrand david@redhat.com Reviewed-by: Lu Zhongjun lu.zhongjun@zte.com.cn Reviewed-by: xu xin xu.xin16@zte.com.cn Cc: Yang Yang yang.yang29@zte.com.cn Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/huge_mm.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 0da01df3b283..548f68913c1b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -132,8 +132,8 @@ static inline bool hugepage_flags_enabled(void) * So we don't need to look at huge_anon_orders_inherit. */ return hugepage_global_enabled() || - huge_anon_orders_always || - huge_anon_orders_madvise; + READ_ONCE(huge_anon_orders_always) || + READ_ONCE(huge_anon_orders_madvise); }
static inline int highest_order(unsigned long orders)
From: Ryan Roberts ryan.roberts@arm.com
mainline inclusion from mainline-v6.11-rc1 commit 00f58104202c472e487f0866fbd38832523fd4f9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Since the introduction of mTHP, the docuementation has stated that khugepaged would be enabled when any mTHP size is enabled, and disabled when all mTHP sizes are disabled. There are 2 problems with this; 1. this is not what was implemented by the code and 2. this is not the desirable behavior.
Desirable behavior is for khugepaged to be enabled when any PMD-sized THP is enabled, anon or file. (Note that file THP is still controlled by the top-level control so we must always consider that, as well as the PMD-size mTHP control for anon). khugepaged only supports collapsing to PMD-sized THP so there is no value in enabling it when PMD-sized THP is disabled. So let's change the code and documentation to reflect this policy.
Further, per-size enabled control modification events were not previously forwarded to khugepaged to give it an opportunity to start or stop. Consequently the following was resulting in khugepaged eroneously not being activated:
echo never > /sys/kernel/mm/transparent_hugepage/enabled echo always > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
[ryan.roberts@arm.com: v3] Link: https://lkml.kernel.org/r/20240705102849.2479686-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240705102849.2479686-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240704091051.2411934-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts ryan.roberts@arm.com Fixes: 3485b88390b0 ("mm: thp: introduce multi-size THP sysfs interface") Closes: https://lore.kernel.org/linux-mm/7a0bbe69-1e3d-4263-b206-da007791a5c4@redhat... Acked-by: David Hildenbrand david@redhat.com Cc: Baolin Wang baolin.wang@linux.alibaba.com Cc: Barry Song baohua@kernel.org Cc: Jonathan Corbet corbet@lwn.net Cc: Lance Yang ioworker0@gmail.com Cc: Yang Shi shy828301@gmail.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: Documentation/admin-guide/mm/transhuge.rst [ Context conflict. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- Documentation/admin-guide/mm/transhuge.rst | 11 ++++---- include/linux/huge_mm.h | 12 -------- mm/huge_memory.c | 7 +++++ mm/khugepaged.c | 33 +++++++++++++++++----- 4 files changed, 38 insertions(+), 25 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index f9d692f049f6..5cf126c94f92 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -232,12 +232,11 @@ it back by writing 0:: echo 0 >/sys/kernel/mm/transparent_hugepage/pcp_allow_high_order echo 4 >/sys/kernel/mm/transparent_hugepage/pcp_allow_high_order
-khugepaged will be automatically started when one or more hugepage -sizes are enabled (either by directly setting "always" or "madvise", -or by setting "inherit" while the top-level enabled is set to "always" -or "madvise"), and it'll be automatically shutdown when the last -hugepage size is disabled (either by directly setting "never", or by -setting "inherit" while the top-level enabled is set to "never"). +khugepaged will be automatically started when PMD-sized THP is enabled +(either of the per-size anon control or the top-level control are set +to "always" or "madvise"), and it'll be automatically shutdown when +PMD-sized THP is disabled (when both the per-size anon control and the +top-level control are "never")
Khugepaged controls ------------------- diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 548f68913c1b..5ee1867520cc 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -124,18 +124,6 @@ static inline bool hugepage_global_always(void) (1<<TRANSPARENT_HUGEPAGE_FLAG); }
-static inline bool hugepage_flags_enabled(void) -{ - /* - * We cover both the anon and the file-backed case here; we must return - * true if globally enabled, even when all anon sizes are set to never. - * So we don't need to look at huge_anon_orders_inherit. - */ - return hugepage_global_enabled() || - READ_ONCE(huge_anon_orders_always) || - READ_ONCE(huge_anon_orders_madvise); -} - static inline int highest_order(unsigned long orders) { return fls_long(orders) - 1; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d743502c70f0..6178b28ddbcb 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -652,6 +652,13 @@ static ssize_t thpsize_enabled_store(struct kobject *kobj, } else ret = -EINVAL;
+ if (ret > 0) { + int err; + + err = start_stop_khugepaged(); + if (err) + ret = err; + } return ret; }
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index bc1aaf5b99ed..d13033eb7eaa 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -422,6 +422,26 @@ static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm) test_bit(MMF_DISABLE_THP, &mm->flags); }
+static bool hugepage_pmd_enabled(void) +{ + /* + * We cover both the anon and the file-backed case here; file-backed + * hugepages, when configured in, are determined by the global control. + * Anon pmd-sized hugepages are determined by the pmd-size control. + */ + if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && + hugepage_global_enabled()) + return true; + if (test_bit(PMD_ORDER, &huge_anon_orders_always)) + return true; + if (test_bit(PMD_ORDER, &huge_anon_orders_madvise)) + return true; + if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) && + hugepage_global_enabled()) + return true; + return false; +} + void __khugepaged_enter(struct mm_struct *mm) { struct khugepaged_mm_slot *mm_slot; @@ -458,7 +478,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma, unsigned long vm_flags) { if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) && - hugepage_flags_enabled()) { + hugepage_pmd_enabled()) { if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS, PMD_ORDER)) __khugepaged_enter(vma->vm_mm); @@ -2505,8 +2525,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
static int khugepaged_has_work(void) { - return !list_empty(&khugepaged_scan.mm_head) && - hugepage_flags_enabled(); + return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled(); }
static int khugepaged_wait_event(void) @@ -2579,7 +2598,7 @@ static void khugepaged_wait_work(void) return; }
- if (hugepage_flags_enabled()) + if (hugepage_pmd_enabled()) wait_event_freezable(khugepaged_wait, khugepaged_wait_event()); }
@@ -2610,7 +2629,7 @@ static void set_recommended_min_free_kbytes(void) int nr_zones = 0; unsigned long recommended_min;
- if (!hugepage_flags_enabled()) { + if (!hugepage_pmd_enabled()) { calculate_min_free_kbytes(); goto update_wmarks; } @@ -2660,7 +2679,7 @@ int start_stop_khugepaged(void) int err = 0;
mutex_lock(&khugepaged_mutex); - if (hugepage_flags_enabled()) { + if (hugepage_pmd_enabled()) { if (!khugepaged_thread) khugepaged_thread = kthread_run(khugepaged, NULL, "khugepaged"); @@ -2686,7 +2705,7 @@ int start_stop_khugepaged(void) void khugepaged_min_free_kbytes_update(void) { mutex_lock(&khugepaged_mutex); - if (hugepage_flags_enabled() && khugepaged_thread) + if (hugepage_pmd_enabled() && khugepaged_thread) set_recommended_min_free_kbytes(); mutex_unlock(&khugepaged_mutex); }
From: Gao Xiang hsiangkao@linux.alibaba.com
mainline inclusion from mainline-v6.11-rc4 commit 2e6506e1c4eed2676a8412231046f31e10e240da category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Currently, migrate_pages_batch() can lock multiple locked folios with an arbitrary order. Although folio_trylock() is used to avoid deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch asynchronously firstly") mentioned, it seems try_split_folio() is still missing.
It was found by compaction stress test when I explicitly enable EROFS compressed files to use large folios, which case I cannot reproduce with the same workload if large folio support is off (current mainline). Typically, filesystem reads (with locked file-backed folios) could use another bdev/meta inode to load some other I/Os (e.g. inode extent metadata or caching compressed data), so the locking order will be:
file-backed folios (A) bdev/meta folios (B)
The following calltrace shows the deadlock: Thread 1 takes (B) lock and tries to take folio (A) lock Thread 2 takes (A) lock and tries to take folio (B) lock
[Thread 1] INFO: task stress:1824 blocked for more than 30 seconds. Tainted: G OE 6.10.0-rc7+ #6 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack:0 pid:1824 tgid:1824 ppid:1822 flags:0x0000000c Call trace: __switch_to+0xec/0x138 __schedule+0x43c/0xcb0 schedule+0x54/0x198 io_schedule+0x44/0x70 folio_wait_bit_common+0x184/0x3f8 <-- folio mapping ffff00036d69cb18 index 996 (**) __folio_lock+0x24/0x38 migrate_pages_batch+0x77c/0xea0 // try_split_folio (mm/migrate.c:1486:2) // migrate_pages_batch (mm/migrate.c:1734:16) <--- LIST_HEAD(unmap_folios) has .. folio mapping 0xffff0000d184f1d8 index 1711; (*) folio mapping 0xffff0000d184f1d8 index 1712; .. migrate_pages+0xb28/0xe90 compact_zone+0xa08/0x10f0 compact_node+0x9c/0x180 sysctl_compaction_handler+0x8c/0x118 proc_sys_call_handler+0x1a8/0x280 proc_sys_write+0x1c/0x30 vfs_write+0x240/0x380 ksys_write+0x78/0x118 __arm64_sys_write+0x24/0x38 invoke_syscall+0x78/0x108 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x148 el0t_64_sync_handler+0x100/0x130 el0t_64_sync+0x190/0x198
[Thread 2] INFO: task stress:1825 blocked for more than 30 seconds. Tainted: G OE 6.10.0-rc7+ #6 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack:0 pid:1825 tgid:1825 ppid:1822 flags:0x0000000c Call trace: __switch_to+0xec/0x138 __schedule+0x43c/0xcb0 schedule+0x54/0x198 io_schedule+0x44/0x70 folio_wait_bit_common+0x184/0x3f8 <-- folio = 0xfffffdffc6b503c0 (mapping == 0xffff0000d184f1d8 index == 1711) (*) __folio_lock+0x24/0x38 z_erofs_runqueue+0x384/0x9c0 [erofs] z_erofs_readahead+0x21c/0x350 [erofs] <-- folio mapping 0xffff00036d69cb18 range from [992, 1024] (**) read_pages+0x74/0x328 page_cache_ra_order+0x26c/0x348 ondemand_readahead+0x1c0/0x3a0 page_cache_sync_ra+0x9c/0xc0 filemap_get_pages+0xc4/0x708 filemap_read+0x104/0x3a8 generic_file_read_iter+0x4c/0x150 vfs_read+0x27c/0x330 ksys_pread64+0x84/0xd0 __arm64_sys_pread64+0x28/0x40 invoke_syscall+0x78/0x108 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x148 el0t_64_sync_handler+0x100/0x130 el0t_64_sync+0x190/0x198
Link: https://lkml.kernel.org/r/20240729021306.398286-1-hsiangkao@linux.alibaba.co... Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Signed-off-by: Gao Xiang hsiangkao@linux.alibaba.com Reviewed-by: "Huang, Ying" ying.huang@intel.com Acked-by: David Hildenbrand david@redhat.com Cc: Matthew Wilcox willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/migrate.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c index 01e04e47699c..c091060585a3 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1480,11 +1480,17 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio, return rc; }
-static inline int try_split_folio(struct folio *folio, struct list_head *split_folios) +static inline int try_split_folio(struct folio *folio, struct list_head *split_folios, + enum migrate_mode mode) { int rc;
- folio_lock(folio); + if (mode == MIGRATE_ASYNC) { + if (!folio_trylock(folio)) + return -EAGAIN; + } else { + folio_lock(folio); + } rc = split_folio_to_list(folio, split_folios); folio_unlock(folio); if (!rc) @@ -1678,7 +1684,7 @@ static int migrate_pages_batch(struct list_head *from, */ if (nr_pages > 2 && !list_empty(&folio->_deferred_list)) { - if (try_split_folio(folio, split_folios) == 0) { + if (!try_split_folio(folio, split_folios, mode)) { nr_failed++; stats->nr_thp_failed += is_thp; stats->nr_thp_split += is_thp; @@ -1700,7 +1706,7 @@ static int migrate_pages_batch(struct list_head *from, if (!thp_migration_supported() && is_thp) { nr_failed++; stats->nr_thp_failed++; - if (!try_split_folio(folio, split_folios)) { + if (!try_split_folio(folio, split_folios, mode)) { stats->nr_thp_split++; stats->nr_split++; continue; @@ -1732,7 +1738,7 @@ static int migrate_pages_batch(struct list_head *from, stats->nr_thp_failed += is_thp; /* Large folio NUMA faulting doesn't split to retry. */ if (is_large && !nosplit) { - int ret = try_split_folio(folio, split_folios); + int ret = try_split_folio(folio, split_folios, mode);
if (!ret) { stats->nr_thp_split += is_thp;
From: Baolin Wang baolin.wang@linux.alibaba.com
mainline inclusion from mainline-v6.10-rc1 commit 231f8c7127e37edcd4d9e3f87e0f9fcf0e90d902 category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Currently, compaction_capture() does not allow lower-order allocations to directly capture the movable free pages, even though lower-order allocations might also be requesting movable pages, that can lead to more compaction scanning. And, with the enablement of mTHP, such situations will become more common.
Thus allowing lower-order (mTHP) allocations of movable page types directly capture the movable free pages can avoid unnecessary compaction scanning, meanwhile that won't pollute the movable pageblock. With testing 1M mTHP compaction, it can be seen that compaction scanning is significantly reduced.
mm-unstable patched Ops Compaction pages isolated 116598741.00 120946702.00 Ops Compaction migrate scanned 1764870054.00 1488621550.00 Ops Compaction free scanned 7707879039.00 4986299318.00 Ops Compact scan efficiency 22.90 29.85 Ops Compaction cost 73797.69 72933.48
Link: https://lkml.kernel.org/r/8118a5d66a034736a48433beddaca60ed78577c4.171289232... Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Reviewed-by: Zi Yan ziy@nvidia.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Ryan Roberts ryan.roberts@arm.com Cc: Vlastimil Babka vbabka@suse.cz Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 22addac04e98..c4bec13e2702 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -621,12 +621,14 @@ compaction_capture(struct capture_control *capc, struct page *page, return false;
/* - * Do not let lower order allocations pollute a movable pageblock. + * Do not let lower order allocations pollute a movable pageblock + * unless compaction is also requesting movable pages. * This might let an unmovable request use a reclaimable pageblock * and vice-versa but no more than normal fallback logic which can * have trouble finding a high-order free page. */ - if (order < pageblock_order && migratetype == MIGRATE_MOVABLE) + if (order < pageblock_order && migratetype == MIGRATE_MOVABLE && + capc->cc->migratetype != MIGRATE_MOVABLE) return false;
capc->page = page;
From: Hugh Dickins hughd@google.com
mainline inclusion from mainline-v6.10 commit be9581ea8c058d81154251cb0695987098996cad category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on flags when freeing, yet the flags shown are not bad: PG_locked had been set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN symptoms implying double free by deferred split and large folio migration.
6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration") was right to fix the memcg-dependent locking broken in 85ce2c517ade ("memcontrol: only transfer the memcg data for migration"), but missed a subtlety of deferred_split_scan(): it moves folios to its own local list to work on them without split_queue_lock, during which time folio->_deferred_list is not empty, but even the "right" lock does nothing to secure the folio and the list it is on.
Fortunately, deferred_split_scan() is careful to use folio_try_get(): so folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable() while the old folio's reference count is temporarily frozen to 0 - adding such a freeze in the !mapping case too (originally, folio lock and unmapping and no swap cache left an anon folio unreachable, so no freezing was needed there: but the deferred split queue offers a way to reach it).
Link: https://lkml.kernel.org/r/29c83d1a-11ca-b6c9-f92e-6ccb322af510@google.com Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration") Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: Barry Song baohua@kernel.org Cc: David Hildenbrand david@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Nhat Pham nphamcs@gmail.com Cc: Yang Shi shy828301@gmail.com Cc: Zi Yan ziy@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/memcontrol.c | 11 ----------- mm/migrate.c | 13 +++++++++++++ 2 files changed, 13 insertions(+), 11 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fcf08f3dc53f..ff22aeac06a4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -8684,17 +8684,6 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
/* Transfer the charge and the css ref */ commit_charge(new, memcg); - /* - * If the old folio is a large folio and is in the split queue, it needs - * to be removed from the split queue now, in case getting an incorrect - * split queue in destroy_large_folio() after the memcg of the old folio - * is cleared. - * - * In addition, the old folio is about to be freed after migration, so - * removing from the split queue a bit earlier seems reasonable. - */ - if (folio_test_large(old) && folio_test_large_rmappable(old)) - folio_undo_large_rmappable(old); old->memcg_data = 0; }
diff --git a/mm/migrate.c b/mm/migrate.c index c091060585a3..371a5119a527 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -413,6 +413,15 @@ int folio_migrate_mapping(struct address_space *mapping, if (folio_ref_count(folio) != expected_count) return -EAGAIN;
+ /* Take off deferred split queue while frozen and memcg set */ + if (folio_test_large(folio) && + folio_test_large_rmappable(folio)) { + if (!folio_ref_freeze(folio, expected_count)) + return -EAGAIN; + folio_undo_large_rmappable(folio); + folio_ref_unfreeze(folio, expected_count); + } + /* No turning back from here */ newfolio->index = folio->index; newfolio->mapping = folio->mapping; @@ -431,6 +440,10 @@ int folio_migrate_mapping(struct address_space *mapping, return -EAGAIN; }
+ /* Take off deferred split queue while frozen and memcg set */ + if (folio_test_large(folio) && folio_test_large_rmappable(folio)) + folio_undo_large_rmappable(folio); + /* * Now we know that no one else is looking at the folio: * no turning back from here.
From: Kefeng Wang wangkefeng.wang@huawei.com
mainline inclusion from mainline-v6.11-rc1 commit 593a10dabe08dcf93259fce2badd8dc2528859a8 category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Folios of order <= 1 are not in deferred list, the check of order is added into folio_undo_large_rmappable() from commit 8897277acfef ("mm: support order-1 folios in the page cache"), but there is a repeated check for small folio (order 0) during each call of the folio_undo_large_rmappable(), so only keep folio_order() check inside the function.
In addition, move all the checks into header file to save a function call for non-large-rmappable or empty deferred_list folio.
Link: https://lkml.kernel.org/r/20240521130315.46072-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: David Hildenbrand david@redhat.com Reviewed-by: Vishal Moola (Oracle) vishal.moola@gmail.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Lance Yang ioworker0@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Michal Hocko mhocko@kernel.org Cc: Muchun Song muchun.song@linux.dev Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Shakeel Butt shakeel.butt@linux.dev Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/page_alloc.c [ Context conflicts. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/huge_memory.c | 13 +------------ mm/internal.h | 17 ++++++++++++++++- mm/page_alloc.c | 3 +-- mm/swap.c | 8 ++------ mm/vmscan.c | 8 ++------ 5 files changed, 22 insertions(+), 27 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6178b28ddbcb..1f659287fbb3 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3355,22 +3355,11 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, return ret; }
-void folio_undo_large_rmappable(struct folio *folio) +void __folio_undo_large_rmappable(struct folio *folio) { struct deferred_split *ds_queue; unsigned long flags;
- if (folio_order(folio) <= 1) - return; - - /* - * At this point, there is no one trying to add the folio to - * deferred_list. If folio is not in deferred_list, it's safe - * to check without acquiring the split_queue_lock. - */ - if (data_race(list_empty(&folio->_deferred_list))) - return; - ds_queue = get_deferred_split_queue(folio); spin_lock_irqsave(&ds_queue->split_queue_lock, flags); if (!list_empty(&folio->_deferred_list)) { diff --git a/mm/internal.h b/mm/internal.h index 0ecbaa392054..7db2957ef3a0 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -593,7 +593,22 @@ static inline void folio_set_order(struct folio *folio, unsigned int order) #endif }
-void folio_undo_large_rmappable(struct folio *folio); +void __folio_undo_large_rmappable(struct folio *folio); +static inline void folio_undo_large_rmappable(struct folio *folio) +{ + if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio)) + return; + + /* + * At this point, there is no one trying to add the folio to + * deferred_list. If folio is not in deferred_list, it's safe + * to check without acquiring the split_queue_lock. + */ + if (data_race(list_empty(&folio->_deferred_list))) + return; + + __folio_undo_large_rmappable(folio); +}
static inline struct folio *page_rmappable_folio(struct page *page) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c4bec13e2702..52382ba24232 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2568,8 +2568,7 @@ void free_unref_folios(struct folio_batch *folios) continue; }
- if (order > 0 && folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); + folio_undo_large_rmappable(folio); if (!free_unref_page_prepare(&folio->page, pfn, order)) continue;
diff --git a/mm/swap.c b/mm/swap.c index 1c9e8f70d6b5..358bf8494062 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -123,8 +123,7 @@ void __folio_put(struct folio *folio) }
page_cache_release(folio); - if (folio_test_large(folio) && folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); + folio_undo_large_rmappable(folio); mem_cgroup_uncharge(folio); free_unref_page(&folio->page, folio_order(folio)); } @@ -999,10 +998,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs) free_huge_folio(folio); continue; } - if (folio_test_large(folio) && - folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); - + folio_undo_large_rmappable(folio); __page_cache_release(folio, &lruvec, &flags);
if (j != i) diff --git a/mm/vmscan.c b/mm/vmscan.c index 37019d51c31d..3337907ae5e9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2135,9 +2135,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, */ nr_reclaimed += nr_pages;
- if (folio_test_large(folio) && - folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); + folio_undo_large_rmappable(folio); if (folio_batch_add(&free_folios, folio) == 0) { mem_cgroup_uncharge_folios(&free_folios); try_to_unmap_flush(); @@ -2545,9 +2543,7 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec, if (unlikely(folio_put_testzero(folio))) { __folio_clear_lru_flags(folio);
- if (folio_test_large(folio) && - folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); + folio_undo_large_rmappable(folio); if (folio_batch_add(&free_folios, folio) == 0) { spin_unlock_irq(&lruvec->lru_lock); mem_cgroup_uncharge_folios(&free_folios);
From: Hugh Dickins hughd@google.com
mainline inclusion from mainline-v6.11-rc1 commit a5ea521250afdf3d70c72970660f44aebf56ea19 category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Now that folio_undo_large_rmappable() is an inline function checking order and large_rmappable for itself (and __folio_undo_large_rmappable() is now declared even when CONFIG_TRANASPARENT_HUGEPAGE is off) there is no need for folio_migrate_mapping() to check large and large_rmappable first (in the mapping case when it has had to freeze anyway).
Link: https://lkml.kernel.org/r/68feee73-050e-8e98-7a3a-abf78738d92c@google.com Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Zi Yan ziy@nvidia.com Cc: Baolin Wang baolin.wang@linux.alibaba.com Cc: Barry Song baohua@kernel.org Cc: David Hildenbrand david@redhat.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Nhat Pham nphamcs@gmail.com Cc: Yang Shi shy828301@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/migrate.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c index 371a5119a527..f2f3f3cf3fe2 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -441,8 +441,7 @@ int folio_migrate_mapping(struct address_space *mapping, }
/* Take off deferred split queue while frozen and memcg set */ - if (folio_test_large(folio) && folio_test_large_rmappable(folio)) - folio_undo_large_rmappable(folio); + folio_undo_large_rmappable(folio);
/* * Now we know that no one else is looking at the folio:
From: Gavin Shan gshan@redhat.com
mainline inclusion from mainline-v6.11-rc1 commit d659b715e94ac039803d7601505d3473393fc0be category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
xarray can't support arbitrary page cache size. the largest and supported page cache size is defined as MAX_PAGECACHE_ORDER by commit 099d90642a71 ("mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray"). However, it's possible to have 512MB page cache in the huge memory's collapsing path on ARM64 system whose base page size is 64KB. 512MB page cache is breaking the limitation and a warning is raised when the xarray entry is split as shown in the following example.
[root@dhcp-10-26-1-207 ~]# cat /proc/1/smaps | grep KernelPageSize KernelPageSize: 64 kB [root@dhcp-10-26-1-207 ~]# cat /tmp/test.c : int main(int argc, char **argv) { const char *filename = TEST_XFS_FILENAME; int fd = 0; void *buf = (void *)-1, *p; int pgsize = getpagesize(); int ret = 0;
if (pgsize != 0x10000) { fprintf(stdout, "System with 64KB base page size is required!\n"); return -EPERM; }
system("echo 0 > /sys/devices/virtual/bdi/253:0/read_ahead_kb"); system("echo 1 > /proc/sys/vm/drop_caches");
/* Open the xfs file */ fd = open(filename, O_RDONLY); assert(fd > 0);
/* Create VMA */ buf = mmap(NULL, TEST_MEM_SIZE, PROT_READ, MAP_SHARED, fd, 0); assert(buf != (void *)-1); fprintf(stdout, "mapped buffer at 0x%p\n", buf);
/* Populate VMA */ ret = madvise(buf, TEST_MEM_SIZE, MADV_NOHUGEPAGE); assert(ret == 0); ret = madvise(buf, TEST_MEM_SIZE, MADV_POPULATE_READ); assert(ret == 0);
/* Collapse VMA */ ret = madvise(buf, TEST_MEM_SIZE, MADV_HUGEPAGE); assert(ret == 0); ret = madvise(buf, TEST_MEM_SIZE, MADV_COLLAPSE); if (ret) { fprintf(stdout, "Error %d to madvise(MADV_COLLAPSE)\n", errno); goto out; }
/* Split xarray entry. Write permission is needed */ munmap(buf, TEST_MEM_SIZE); buf = (void *)-1; close(fd); fd = open(filename, O_RDWR); assert(fd > 0); fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, TEST_MEM_SIZE - pgsize, pgsize); out: if (buf != (void *)-1) munmap(buf, TEST_MEM_SIZE); if (fd > 0) close(fd);
return ret; }
[root@dhcp-10-26-1-207 ~]# gcc /tmp/test.c -o /tmp/test [root@dhcp-10-26-1-207 ~]# /tmp/test ------------[ cut here ]------------ WARNING: CPU: 25 PID: 7560 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128 Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib \ nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct \ nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 \ ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse \ xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 virtio_net \ sha1_ce net_failover virtio_blk virtio_console failover dimlib virtio_mmio CPU: 25 PID: 7560 Comm: test Kdump: loaded Not tainted 6.10.0-rc7-gavin+ #9 Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024 pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : xas_split_alloc+0xf8/0x128 lr : split_huge_page_to_list_to_order+0x1c4/0x780 sp : ffff8000ac32f660 x29: ffff8000ac32f660 x28: ffff0000e0969eb0 x27: ffff8000ac32f6c0 x26: 0000000000000c40 x25: ffff0000e0969eb0 x24: 000000000000000d x23: ffff8000ac32f6c0 x22: ffffffdfc0700000 x21: 0000000000000000 x20: 0000000000000000 x19: ffffffdfc0700000 x18: 0000000000000000 x17: 0000000000000000 x16: ffffd5f3708ffc70 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: ffffffffffffffc0 x10: 0000000000000040 x9 : ffffd5f3708e692c x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff0000e0969eb8 x5 : ffffd5f37289e378 x4 : 0000000000000000 x3 : 0000000000000c40 x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000 Call trace: xas_split_alloc+0xf8/0x128 split_huge_page_to_list_to_order+0x1c4/0x780 truncate_inode_partial_folio+0xdc/0x160 truncate_inode_pages_range+0x1b4/0x4a8 truncate_pagecache_range+0x84/0xa0 xfs_flush_unmap_range+0x70/0x90 [xfs] xfs_file_fallocate+0xfc/0x4d8 [xfs] vfs_fallocate+0x124/0x2f0 ksys_fallocate+0x4c/0xa0 __arm64_sys_fallocate+0x24/0x38 invoke_syscall.constprop.0+0x7c/0xd8 do_el0_svc+0xb4/0xd0 el0_svc+0x44/0x1d8 el0t_64_sync_handler+0x134/0x150 el0t_64_sync+0x17c/0x180
Fix it by correcting the supported page cache orders, different sets for DAX and other files. With it corrected, 512MB page cache becomes disallowed on all non-DAX files on ARM64 system where the base page size is 64KB. After this patch is applied, the test program fails with error -EINVAL returned from __thp_vma_allowable_orders() and the madvise() system call to collapse the page caches.
Link: https://lkml.kernel.org/r/20240715000423.316491-1-gshan@redhat.com Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") Signed-off-by: Gavin Shan gshan@redhat.com Acked-by: David Hildenbrand david@redhat.com Reviewed-by: Ryan Roberts ryan.roberts@arm.com Acked-by: Zi Yan ziy@nvidia.com Cc: Baolin Wang baolin.wang@linux.alibaba.com Cc: Barry Song baohua@kernel.org Cc: Don Dutile ddutile@redhat.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Peter Xu peterx@redhat.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: William Kucharski william.kucharski@oracle.com Cc: stable@vger.kernel.org [5.17+] Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/huge_mm.h | 12 +++++++++--- mm/huge_memory.c | 12 ++++++++++-- 2 files changed, 19 insertions(+), 5 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 5ee1867520cc..c016cb753b55 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -80,14 +80,20 @@ extern struct kobj_attribute shmem_enabled_attr; #define THP_ORDERS_ALL_ANON ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
/* - * Mask of all large folio orders supported for file THP. + * Mask of all large folio orders supported for file THP. Folios in a DAX + * file is never split and the MAX_PAGECACHE_ORDER limit does not apply to + * it. */ -#define THP_ORDERS_ALL_FILE (BIT(PMD_ORDER) | BIT(PUD_ORDER)) +#define THP_ORDERS_ALL_FILE_DAX \ + (BIT(PMD_ORDER) | BIT(PUD_ORDER)) +#define THP_ORDERS_ALL_FILE_DEFAULT \ + ((BIT(MAX_PAGECACHE_ORDER + 1) - 1) & ~BIT(0))
/* * Mask of all large folio orders supported for THP. */ -#define THP_ORDERS_ALL (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE) +#define THP_ORDERS_ALL \ + (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE_DAX | THP_ORDERS_ALL_FILE_DEFAULT)
#define TVA_SMAPS (1 << 0) /* Will be used for procfs */ #define TVA_IN_PF (1 << 1) /* Page fault handler */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1f659287fbb3..16d8ed7f46bd 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -84,9 +84,17 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, bool smaps = tva_flags & TVA_SMAPS; bool in_pf = tva_flags & TVA_IN_PF; bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS; + unsigned long supported_orders; + /* Check the intersection of requested and supported orders. */ - orders &= vma_is_anonymous(vma) ? - THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE; + if (vma_is_anonymous(vma)) + supported_orders = THP_ORDERS_ALL_ANON; + else if (vma_is_dax(vma)) + supported_orders = THP_ORDERS_ALL_FILE_DAX; + else + supported_orders = THP_ORDERS_ALL_FILE_DEFAULT; + + orders &= supported_orders; if (!orders) return 0;
From: "Matthew Wilcox (Oracle)" willy@infradead.org
mainline inclusion from mainline-v6.11-rc1 commit 9aac777aaf9459786bc8463e6cbfc7e7e1abd1f9 category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Modelled after the loop in iomap_write_iter(), copy larger chunks from userspace if the filesystem has created large folios.
[hch: use mapping_max_folio_size to keep supporting file systems that do not support large folios]
Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Christoph Hellwig hch@lst.de Tested-by: Shaun Tancheff shaun.tancheff@hpe.com Tested-by: Sagi Grimberg sagi@grimberg.me Signed-off-by: Anna Schumaker Anna.Schumaker@Netapp.com Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/filemap.c | 40 +++++++++++++++++++++++++--------------- 1 file changed, 25 insertions(+), 15 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c index 9338f805cc4c..eb96ddf00ba8 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -4038,21 +4038,24 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i) loff_t pos = iocb->ki_pos; struct address_space *mapping = file->f_mapping; const struct address_space_operations *a_ops = mapping->a_ops; + size_t chunk = mapping_max_folio_size(mapping); long status = 0; ssize_t written = 0;
do { struct page *page; - unsigned long offset; /* Offset into pagecache page */ - unsigned long bytes; /* Bytes to write to page */ + struct folio *folio; + size_t offset; /* Offset into folio */ + size_t bytes; /* Bytes to write to folio */ size_t copied; /* Bytes copied from user */ void *fsdata = NULL;
- offset = (pos & (PAGE_SIZE - 1)); - bytes = min_t(unsigned long, PAGE_SIZE - offset, - iov_iter_count(i)); + bytes = iov_iter_count(i); +retry: + offset = pos & (chunk - 1); + bytes = min(chunk - offset, bytes); + balance_dirty_pages_ratelimited(mapping);
-again: /* * Bring in the user page that we will copy from _first_. * Otherwise there's a nasty deadlock on copying from the @@ -4074,11 +4077,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i) if (unlikely(status < 0)) break;
+ folio = page_folio(page); + offset = offset_in_folio(folio, pos); + if (bytes > folio_size(folio) - offset) + bytes = folio_size(folio) - offset; + if (mapping_writably_mapped(mapping)) - flush_dcache_page(page); + flush_dcache_folio(folio);
- copied = copy_page_from_iter_atomic(page, offset, bytes, i); - flush_dcache_page(page); + copied = copy_folio_from_iter_atomic(folio, offset, bytes, i); + flush_dcache_folio(folio);
status = a_ops->write_end(file, mapping, pos, bytes, copied, page, fsdata); @@ -4096,14 +4104,16 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i) * halfway through, might be a race with munmap, * might be severe memory pressure. */ - if (copied) + if (chunk > PAGE_SIZE) + chunk /= 2; + if (copied) { bytes = copied; - goto again; + goto retry; + } + } else { + pos += status; + written += status; } - pos += status; - written += status; - - balance_dirty_pages_ratelimited(mapping); } while (iov_iter_count(i));
if (!written)
From: Pankaj Raghav p.raghav@samsung.com
mainline inclusion from mainline-v6.9-rc1 commit e03c16fb4af1dfc615a4e1f51be0d5fe5840b904 category: cleanup bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
A while loop is used to adjust the new_order to be lower than the ra->size. ilog2 could be used to do the same instead of using a loop.
ilog2 typically resolves to a bit scan reverse instruction. This is particularly useful when ra->size is smaller than the 2^new_order as it resolves in one instruction instead of looping to find the new_order.
No functional changes.
Link: https://lkml.kernel.org/r/20240115102523.2336742-1-kernel@pankajraghav.com Signed-off-by: Pankaj Raghav p.raghav@samsung.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/readahead.c [ Context conflicts. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/readahead.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c index 689e003951fe..27a4dff182e0 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -508,10 +508,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
if (new_order < MAX_PAGECACHE_ORDER) { new_order += 2; - if (new_order > MAX_PAGECACHE_ORDER) - new_order = MAX_PAGECACHE_ORDER; - while ((1 << new_order) > ra->size) - new_order--; + new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order); + new_order = min_t(unsigned int, new_order, ilog2(ra->size)); }
/* See comment in page_cache_ra_unbounded() */
From: Gavin Shan gshan@redhat.com
mainline inclusion from mainline-v6.10 commit 1f789a45c3f1aa77531db21768fca70b66c0eeb1 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
In page_cache_ra_order(), the maximal order of the page cache to be allocated shouldn't be larger than MAX_PAGECACHE_ORDER. Otherwise, it's possible the large page cache can't be supported by xarray when the corresponding xarray entry is split.
For example, HPAGE_PMD_ORDER is 13 on ARM64 when the base page size is 64KB. The PMD-sized page cache can't be supported by xarray.
Link: https://lkml.kernel.org/r/20240627003953.1262512-3-gshan@redhat.com Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Signed-off-by: Gavin Shan gshan@redhat.com Acked-by: David Hildenbrand david@redhat.com Cc: Darrick J. Wong djwong@kernel.org Cc: Don Dutile ddutile@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Ryan Roberts ryan.roberts@arm.com Cc: William Kucharski william.kucharski@oracle.com Cc: Zhenyu Zhang zhenyzha@redhat.com Cc: stable@vger.kernel.org [5.18+] Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/readahead.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c index 27a4dff182e0..a8911f7c161a 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -506,11 +506,11 @@ void page_cache_ra_order(struct readahead_control *ractl,
limit = min(limit, index + ra->size - 1);
- if (new_order < MAX_PAGECACHE_ORDER) { + if (new_order < MAX_PAGECACHE_ORDER) new_order += 2; - new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order); - new_order = min_t(unsigned int, new_order, ilog2(ra->size)); - } + + new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order); + new_order = min_t(unsigned int, new_order, ilog2(ra->size));
/* See comment in page_cache_ra_unbounded() */ nofs = memalloc_nofs_save();
From: Yajun Deng yajun.deng@linux.dev
mainline inclusion from mainline-v6.9-rc1 commit 30afc8c34290184c023fa79136ce5f8813fc73da category: cleanup bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The file parameter in the __remove_shared_vm_struct is no longer used, remove it.
These functions vma_link() and mmap_region() have some of the same code, introduce vma_link_file() helper function to simplify the code.
Link: https://lkml.kernel.org/r/20240110084622.2425927-1-yajun.deng@linux.dev Signed-off-by: Yajun Deng yajun.deng@linux.dev Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/mmap.c [ Context conflicts with commit 4f4042f1e777 and e8e17ee90eaf ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/mmap.c | 44 +++++++++++++++++++------------------------- 1 file changed, 19 insertions(+), 25 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c index 1d8def3db125..27ba0bb1acde 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -106,7 +106,7 @@ void vma_set_page_prot(struct vm_area_struct *vma) * Requires inode->i_mapping->i_mmap_rwsem */ static void __remove_shared_vm_struct(struct vm_area_struct *vma, - struct file *file, struct address_space *mapping) + struct address_space *mapping) { if (vma->vm_flags & VM_SHARED) mapping_unmap_writable(mapping); @@ -127,7 +127,7 @@ void unlink_file_vma(struct vm_area_struct *vma) if (file) { struct address_space *mapping = file->f_mapping; i_mmap_lock_write(mapping); - __remove_shared_vm_struct(vma, file, mapping); + __remove_shared_vm_struct(vma, mapping); i_mmap_unlock_write(mapping); } } @@ -394,26 +394,30 @@ static void __vma_link_file(struct vm_area_struct *vma, flush_dcache_mmap_unlock(mapping); }
+static void vma_link_file(struct vm_area_struct *vma) +{ + struct file *file = vma->vm_file; + struct address_space *mapping; + + if (file) { + mapping = file->f_mapping; + i_mmap_lock_write(mapping); + __vma_link_file(vma, mapping); + i_mmap_unlock_write(mapping); + } +} + static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma) { VMA_ITERATOR(vmi, mm, 0); - struct address_space *mapping = NULL;
vma_iter_config(&vmi, vma->vm_start, vma->vm_end); if (vma_iter_prealloc(&vmi, vma)) return -ENOMEM;
vma_start_write(vma); - vma_iter_store(&vmi, vma); - - if (vma->vm_file) { - mapping = vma->vm_file->f_mapping; - i_mmap_lock_write(mapping); - __vma_link_file(vma, mapping); - i_mmap_unlock_write(mapping); - } - + vma_link_file(vma); mm->map_count++; validate_mm(mm); return 0; @@ -521,10 +525,9 @@ static inline void vma_complete(struct vma_prepare *vp, }
if (vp->remove && vp->file) { - __remove_shared_vm_struct(vp->remove, vp->file, vp->mapping); + __remove_shared_vm_struct(vp->remove, vp->mapping); if (vp->remove2) - __remove_shared_vm_struct(vp->remove2, vp->file, - vp->mapping); + __remove_shared_vm_struct(vp->remove2, vp->mapping); } else if (vp->insert) { /* * split_vma has split insert from vma, and needs @@ -2875,16 +2878,7 @@ static unsigned long __mmap_region(struct mm_struct *mm, vma_start_write(vma); vma_iter_store(&vmi, vma); mm->map_count++; - if (vma->vm_file) { - i_mmap_lock_write(vma->vm_file->f_mapping); - if (vma->vm_flags & VM_SHARED) - mapping_allow_writable(vma->vm_file->f_mapping); - - flush_dcache_mmap_lock(vma->vm_file->f_mapping); - vma_interval_tree_insert(vma, &vma->vm_file->f_mapping->i_mmap); - flush_dcache_mmap_unlock(vma->vm_file->f_mapping); - i_mmap_unlock_write(vma->vm_file->f_mapping); - } + vma_link_file(vma);
/* * vma_merge() calls khugepaged_enter_vma() either, the below
From: linke li lilinke99@qq.com
mainline inclusion from mainline-v6.10-rc1 commit 5ee9562c586cd4ca9402b3636157abdd58ab7978 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
In scan_swap_map_slots(), si->highest_bit can by changed by swap_range_alloc() concurrently. All reads on si->highest_bit except one is either protected by lock or read using READ_ONCE. So mark the one racy read on si->highest_bit as benign using READ_ONCE.
This patch is aimed at reducing the number of benign races reported by KCSAN in order to focus future debugging effort on harmful races.
Link: https://lkml.kernel.org/r/tencent_912BC3E8B0291DA4A0028AB424076375DA07@qq.co... Signed-off-by: linke li lilinke99@qq.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/swapfile.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c index 744e5c8bd66b..941a98e7ed39 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -896,7 +896,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si, last_in_cluster = offset + SWAPFILE_CLUSTER - 1;
/* Locate the first empty (unaligned) cluster */ - for (; last_in_cluster <= si->highest_bit; offset++) { + for (; last_in_cluster <= READ_ONCE(si->highest_bit); offset++) { if (si->swap_map[offset]) last_in_cluster = offset + SWAPFILE_CLUSTER; else if (offset == last_in_cluster) {
From: linke li lilinke99@qq.com
mainline inclusion from mainline-v6.10-rc1 commit 844776cb65a77ef27bfba2220e285940b714ae4e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
In deactivate_slab(), slab->freelist can be changed concurrently. Mark data race on slab->freelist as benign using READ_ONCE.
This patch is aimed at reducing the number of benign races reported by KCSAN in order to focus future debugging effort on harmful races.
Signed-off-by: linke li lilinke99@qq.com Signed-off-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/slub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/slub.c b/mm/slub.c index 7fcd18261c1e..6594bd801b6b 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2490,7 +2490,7 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab, struct slab new; struct slab old;
- if (slab->freelist) { + if (READ_ONCE(slab->freelist)) { stat(s, DEACTIVATE_REMOTE_FREES); tail = DEACTIVATE_TO_TAIL; }
From: linke li lilinke99@qq.com
mainline inclusion from mainline-v6.10-rc1 commit 87654cf7a9865c0be256d67229b7354125d7498e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The reads of slab->slabs are racy because it may be changed by put_cpu_partial concurrently. In slabs_cpu_partial_show() and show_slab_objects(), slab->slabs is only used for showing information.
Data-racy reads from shared variables that are used only for diagnostic purposes should typically use data_race(), since it is normally not a problem if the values are off by a little.
This patch is aimed at reducing the number of benign races reported by KCSAN in order to focus future debugging effort on harmful races.
Signed-off-by: linke li lilinke99@qq.com Reviewed-by: Chengming Zhou chengming.zhou@linux.dev Signed-off-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/slub.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c index 6594bd801b6b..bcbfd720b574 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -5448,7 +5448,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s, else if (flags & SO_OBJECTS) WARN_ON_ONCE(1); else - x = slab->slabs; + x = data_race(slab->slabs); total += x; nodes[node] += x; } @@ -5653,7 +5653,7 @@ static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf) slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
if (slab) - slabs += slab->slabs; + slabs += data_race(slab->slabs); } #endif
@@ -5667,7 +5667,7 @@ static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
slab = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu)); if (slab) { - slabs = READ_ONCE(slab->slabs); + slabs = data_race(slab->slabs); objects = (slabs * oo_objects(s->oo)) / 2; len += sysfs_emit_at(buf, len, " C%d=%d(%d)", cpu, objects, slabs);
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAHY3K
--------------------------------
The 2M anonymous mapping is dependent on BIT2.
Fixes: 08f7407a9f04 ("mm: add thp anon pmd size mapping align control") Signed-off-by: Liu Shixin liushixin2@huawei.com --- Documentation/admin-guide/mm/transhuge.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 5cf126c94f92..b6e5ba22176a 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -213,8 +213,8 @@ possible to enable/disable it by configurate the corresponding bit::
The kernel could try to enable mappings for different sizes, eg, 64K on arm64, BIT0 for file mapping, BIT1 for anonymous mapping, and THP size -page, BIT3 for anonymous mapping, where 64K anonymous mapping for arm64 -is dependent on BIT3 being turned on, the above feature are disabled by +page, BIT2 for anonymous mapping, where 2M anonymous mapping for arm64 +is dependent on BIT2 being turned on, the above feature are disabled by default, and could enable the above feature by writing the corresponding bit to 1::
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/11035 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/J...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/11035 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/J...