Huan Yang (1): mm: page_alloc: simpify page del and expand
Johannes Weiner (11): mm: page_alloc: remove pcppage migratetype caching mm: page_alloc: optimize free_unref_folios() mm: page_alloc: fix up block types when merging compatible blocks mm: page_alloc: move free pages when converting block during isolation mm: page_alloc: fix move_freepages_block() range error mm: page_alloc: fix freelist movement during block conversion mm: page_alloc: close migratetype race between freeing and stealing mm: page_isolation: prepare for hygienic freelists mm: page_alloc: consolidate free page accounting mm: page_alloc: batch vmstat updates in expand() mm: page_alloc: fix highatomic typing in multi-block buddies
Kefeng Wang (1): mm: remove migration for HugePage in isolate_single_pageblock()
Kemeng Shi (2): mm/page_alloc: remove unnecessary check in break_down_buddy_pages mm/page_alloc: remove unnecessary next_page in break_down_buddy_pages
Vlastimil Babka (1): mm: page_alloc: change move_freepages() to __move_freepages_block()
Yajun Deng (1): mm: page_alloc: simplify __free_pages_ok()
Yu Zhao (1): mm/page_alloc: keep track of free highatomic
Zi Yan (1): mm: page_alloc: set migratetype inside move_freepages()
include/linux/mm.h | 18 +- include/linux/mmzone.h | 1 + include/linux/page-isolation.h | 5 +- include/linux/vmstat.h | 8 - mm/debug_page_alloc.c | 12 +- mm/internal.h | 9 - mm/page_alloc.c | 735 +++++++++++++++++++-------------- mm/page_isolation.c | 135 ++---- 8 files changed, 470 insertions(+), 453 deletions(-)
From: Kemeng Shi shikemeng@huaweicloud.com
mainline inclusion from mainline-v6.7-rc1 commit 27e0db3c21aaf1422980e64b77956e15b839306f category: cleanup bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "Two minor cleanups to break_down_buddy_pages", v2.
Two minor cleanups to break_down_buddy_pages.
This patch (of 2):
1. We always have target in range started with next_page and full free range started with current_buddy.
2. The last split range size is 1 << low and low should be >= 0, then size >= 1. So page + size != page is always true (because size > 0). As summary, current_page will not equal to target page.
Link: https://lkml.kernel.org/r/20230927103514.98281-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20230927103514.98281-2-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi shikemeng@huaweicloud.com Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oscar Salvador osalvador@suse.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 36cd38df0614..fb6008b30b48 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6955,10 +6955,8 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page, if (set_page_guard(zone, current_buddy, high, migratetype)) continue;
- if (current_buddy != target) { - add_to_free_list(current_buddy, zone, high, migratetype); - set_buddy_order(current_buddy, high); - } + add_to_free_list(current_buddy, zone, high, migratetype); + set_buddy_order(current_buddy, high); } }
From: Kemeng Shi shikemeng@huaweicloud.com
mainline inclusion from mainline-v6.7-rc1 commit 0dfca313a009c83e2ad44b3719dc1222df6c6db5 category: cleanup bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The next_page is only used to forward page in case target is in second half range. Move forward page directly to remove unnecessary next_page.
Link: https://lkml.kernel.org/r/20230927103514.98281-3-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi shikemeng@huaweicloud.com Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oscar Salvador osalvador@suse.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fb6008b30b48..3cc5d5c7826e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6937,20 +6937,18 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page, int migratetype) { unsigned long size = 1 << high; - struct page *current_buddy, *next_page; + struct page *current_buddy;
while (high > low) { high--; size >>= 1;
if (target >= &page[size]) { - next_page = page + size; current_buddy = page; + page = page + size; } else { - next_page = page; current_buddy = page + size; } - page = next_page;
if (set_page_guard(zone, current_buddy, high, migratetype)) continue;
From: Yajun Deng yajun.deng@linux.dev
mainline inclusion from mainline-v6.8-rc1 commit 250ae189d98290d0539b4f9b8c4703e0bf24f9d3 category: cleanup bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
There is redundant code in __free_pages_ok(). Use free_one_page() simplify it.
Link: https://lkml.kernel.org/r/20231216030503.2126130-1-yajun.deng@linux.dev Signed-off-by: Yajun Deng yajun.deng@linux.dev Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3cc5d5c7826e..ff0940ab0fe6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1258,7 +1258,6 @@ static void free_one_page(struct zone *zone, static void __free_pages_ok(struct page *page, unsigned int order, fpi_t fpi_flags) { - unsigned long flags; int migratetype; unsigned long pfn = page_to_pfn(page); struct zone *zone = page_zone(page); @@ -1273,13 +1272,7 @@ static void __free_pages_ok(struct page *page, unsigned int order, */ migratetype = get_pfnblock_migratetype(page, pfn);
- spin_lock_irqsave(&zone->lock, flags); - if (unlikely(has_isolate_pageblock(zone) || - is_migrate_isolate(migratetype))) { - migratetype = get_pfnblock_migratetype(page, pfn); - } - __free_one_page(page, pfn, zone, order, migratetype, fpi_flags); - spin_unlock_irqrestore(&zone->lock, flags); + free_one_page(zone, page, pfn, order, migratetype, fpi_flags);
__count_vm_events(PGFREE, 1 << order); }
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v6.10-rc1 commit 17edeb5d3f761c20fd28f6002f5a9faa53c0a0d8 category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "mm: page_alloc: freelist migratetype hygiene", v4.
The page allocator's mobility grouping is intended to keep unmovable pages separate from reclaimable/compactable ones to allow on-demand defragmentation for higher-order allocations and huge pages.
Currently, there are several places where accidental type mixing occurs: an allocation asks for a page of a certain migratetype and receives another. This ruins pageblocks for compaction, which in turn makes allocating huge pages more expensive and less reliable.
The series addresses those causes. The last patch adds type checks on all freelist movements to prevent new violations being introduced.
The benefits can be seen in a mixed workload that stresses the machine with a memcache-type workload and a kernel build job while periodically attempting to allocate batches of THP. The following data is aggregated over 50 consecutive defconfig builds:
VANILLA PATCHED Hugealloc Time mean 165843.93 ( +0.00%) 113025.88 ( -31.85%) Hugealloc Time stddev 158957.35 ( +0.00%) 114716.07 ( -27.83%) Kbuild Real time 310.24 ( +0.00%) 300.73 ( -3.06%) Kbuild User time 1271.13 ( +0.00%) 1259.42 ( -0.92%) Kbuild System time 582.02 ( +0.00%) 559.79 ( -3.81%) THP fault alloc 30585.14 ( +0.00%) 40853.62 ( +33.57%) THP fault fallback 36626.46 ( +0.00%) 26357.62 ( -28.04%) THP fault fail rate % 54.49 ( +0.00%) 39.22 ( -27.53%) Pagealloc fallback 1328.00 ( +0.00%) 1.00 ( -99.85%) Pagealloc type mismatch 181009.50 ( +0.00%) 0.00 ( -100.00%) Direct compact stall 434.56 ( +0.00%) 257.66 ( -40.61%) Direct compact fail 421.70 ( +0.00%) 249.94 ( -40.63%) Direct compact success 12.86 ( +0.00%) 7.72 ( -37.09%) Direct compact success rate % 2.86 ( +0.00%) 2.82 ( -0.96%) Compact daemon scanned migrate 3370059.62 ( +0.00%) 3612054.76 ( +7.18%) Compact daemon scanned free 7718439.20 ( +0.00%) 5386385.02 ( -30.21%) Compact direct scanned migrate 309248.62 ( +0.00%) 176721.04 ( -42.85%) Compact direct scanned free 433582.84 ( +0.00%) 315727.66 ( -27.18%) Compact migrate scanned daemon % 91.20 ( +0.00%) 94.48 ( +3.56%) Compact free scanned daemon % 94.58 ( +0.00%) 94.42 ( -0.16%) Compact total migrate scanned 3679308.24 ( +0.00%) 3788775.80 ( +2.98%) Compact total free scanned 8152022.04 ( +0.00%) 5702112.68 ( -30.05%) Alloc stall 872.04 ( +0.00%) 5156.12 ( +490.71%) Pages kswapd scanned 510645.86 ( +0.00%) 3394.94 ( -99.33%) Pages kswapd reclaimed 134811.62 ( +0.00%) 2701.26 ( -98.00%) Pages direct scanned 99546.06 ( +0.00%) 376407.52 ( +278.12%) Pages direct reclaimed 62123.40 ( +0.00%) 289535.70 ( +366.06%) Pages total scanned 610191.92 ( +0.00%) 379802.46 ( -37.76%) Pages scanned kswapd % 76.36 ( +0.00%) 0.10 ( -98.58%) Swap out 12057.54 ( +0.00%) 15022.98 ( +24.59%) Swap in 209.16 ( +0.00%) 256.48 ( +22.52%) File refaults 17701.64 ( +0.00%) 11765.40 ( -33.53%)
Huge page success rate is higher, allocation latencies are shorter and more predictable.
Stealing (fallback) rate is drastically reduced. Notably, while the vanilla kernel keeps doing fallbacks on an ongoing basis, the patched kernel enters a steady state once the distribution of block types is adequate for the workload. Steals over 50 runs:
VANILLA PATCHED 1504.0 227.0 1557.0 6.0 1391.0 13.0 1080.0 26.0 1057.0 40.0 1156.0 6.0 805.0 46.0 736.0 20.0 1747.0 2.0 1699.0 34.0 1269.0 13.0 1858.0 12.0 907.0 4.0 727.0 2.0 563.0 2.0 3094.0 2.0 10211.0 3.0 2621.0 1.0 5508.0 2.0 1060.0 2.0 538.0 3.0 5773.0 2.0 2199.0 0.0 3781.0 2.0 1387.0 1.0 4977.0 0.0 2865.0 1.0 1814.0 1.0 3739.0 1.0 6857.0 0.0 382.0 0.0 407.0 1.0 3784.0 0.0 297.0 0.0 298.0 0.0 6636.0 0.0 4188.0 0.0 242.0 0.0 9960.0 0.0 5816.0 0.0 354.0 0.0 287.0 0.0 261.0 0.0 140.0 1.0 2065.0 0.0 312.0 0.0 331.0 0.0 164.0 0.0 465.0 1.0 219.0 0.0
Type mismatches are down too. Those count every time an allocation request asks for one migratetype and gets another. This can still occur minimally in the patched kernel due to non-stealing fallbacks, but it's quite rare and follows the pattern of overall fallbacks - once the block type distribution settles, mismatches cease as well:
VANILLA: PATCHED: 182602.0 268.0 135794.0 20.0 88619.0 19.0 95973.0 0.0 129590.0 0.0 129298.0 0.0 147134.0 0.0 230854.0 0.0 239709.0 0.0 137670.0 0.0 132430.0 0.0 65712.0 0.0 57901.0 0.0 67506.0 0.0 63565.0 4.0 34806.0 0.0 42962.0 0.0 32406.0 0.0 38668.0 0.0 61356.0 0.0 57800.0 0.0 41435.0 0.0 83456.0 0.0 65048.0 0.0 28955.0 0.0 47597.0 0.0 75117.0 0.0 55564.0 0.0 38280.0 0.0 52404.0 0.0 26264.0 0.0 37538.0 0.0 19671.0 0.0 30936.0 0.0 26933.0 0.0 16962.0 0.0 44554.0 0.0 46352.0 0.0 24995.0 0.0 35152.0 0.0 12823.0 0.0 21583.0 0.0 18129.0 0.0 31693.0 0.0 28745.0 0.0 33308.0 0.0 31114.0 0.0 35034.0 0.0 12111.0 0.0 24885.0 0.0
Compaction work is markedly reduced despite much better THP rates.
In the vanilla kernel, reclaim seems to have been driven primarily by watermark boosting that happens as a result of fallbacks. With those all but eliminated, watermarks average lower and kswapd does less work. The uptick in direct reclaim is because THP requests have to fend for themselves more often - which is intended policy right now. Aggregate reclaim activity is lowered significantly, though.
This patch (of 10):
The idea behind the cache is to save get_pageblock_migratetype() lookups during bulk freeing. A microbenchmark suggests this isn't helping, though. The pcp migratetype can get stale, which means that bulk freeing has an extra branch to check if the pageblock was isolated while on the pcp.
While the variance overlaps, the cache write and the branch seem to make this a net negative. The following test allocates and frees batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
Before: 8,668.48 msec task-clock # 99.735 CPUs utilized ( +- 2.90% ) 19 context-switches # 4.341 /sec ( +- 3.24% ) 0 cpu-migrations # 0.000 /sec 17,440 page-faults # 3.984 K/sec ( +- 2.90% ) 41,758,692,473 cycles # 9.541 GHz ( +- 2.90% ) 126,201,294,231 instructions # 5.98 insn per cycle ( +- 2.90% ) 25,348,098,335 branches # 5.791 G/sec ( +- 2.90% ) 33,436,921 branch-misses # 0.26% of all branches ( +- 2.90% )
0.0869148 +- 0.0000302 seconds time elapsed ( +- 0.03% )
After: 8,444.81 msec task-clock # 99.726 CPUs utilized ( +- 2.90% ) 22 context-switches # 5.160 /sec ( +- 3.23% ) 0 cpu-migrations # 0.000 /sec 17,443 page-faults # 4.091 K/sec ( +- 2.90% ) 40,616,738,355 cycles # 9.527 GHz ( +- 2.90% ) 126,383,351,792 instructions # 6.16 insn per cycle ( +- 2.90% ) 25,224,985,153 branches # 5.917 G/sec ( +- 2.90% ) 32,236,793 branch-misses # 0.25% of all branches ( +- 2.90% )
0.0846799 +- 0.0000412 seconds time elapsed ( +- 0.05% )
A side effect is that this also ensures that pages whose pageblock gets stolen while on the pcplist end up on the right freelist and we don't perform potentially type-incompatible buddy merges (or skip merges when we shouldn't), which is likely beneficial to long-term fragmentation management, although the effects would be harder to measure. Settle for simpler and faster code as justification here.
Link: https://lkml.kernel.org/r/20240320180429.678181-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20240320180429.678181-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Zi Yan ziy@nvidia.com Reviewed-by: Vlastimil Babka vbabka@suse.cz Acked-by: Mel Gorman mgorman@techsingularity.net Tested-by: "Huang, Ying" ying.huang@intel.com Tested-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: David Hildenbrand david@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/page_alloc.c [ Context conflicts with commit 62b208c4859c and ae577de78c12. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 66 +++++++++++-------------------------------------- 1 file changed, 14 insertions(+), 52 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ff0940ab0fe6..3d4932cd2332 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -207,24 +207,6 @@ EXPORT_SYMBOL(node_states);
gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
-/* - * A cached value of the page's pageblock's migratetype, used when the page is - * put on a pcplist. Used to avoid the pageblock migratetype lookup when - * freeing from pcplists in most cases, at the cost of possibly becoming stale. - * Also the migratetype set in the page does not necessarily match the pcplist - * index, e.g. page might have MIGRATE_CMA set but be on a pcplist with any - * other index - this ensures that it will be put on the correct CMA freelist. - */ -static inline int get_pcppage_migratetype(struct page *page) -{ - return page->index; -} - -static inline void set_pcppage_migratetype(struct page *page, int migratetype) -{ - page->index = migratetype; -} - #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE unsigned int pageblock_order __read_mostly; #endif @@ -1186,7 +1168,6 @@ static void free_pcppages_bulk(struct zone *zone, int count, { unsigned long flags; unsigned int order; - bool isolated_pageblocks; struct page *page;
/* @@ -1199,7 +1180,6 @@ static void free_pcppages_bulk(struct zone *zone, int count, pindex = pindex - 1;
spin_lock_irqsave(&zone->lock, flags); - isolated_pageblocks = has_isolate_pageblock(zone);
while (count > 0) { struct list_head *list; @@ -1215,23 +1195,19 @@ static void free_pcppages_bulk(struct zone *zone, int count, order = pindex_to_order(pindex); nr_pages = 1 << order; do { + unsigned long pfn; int mt;
page = list_last_entry(list, struct page, pcp_list); - mt = get_pcppage_migratetype(page); + pfn = page_to_pfn(page); + mt = get_pfnblock_migratetype(page, pfn);
/* must delete to avoid corrupting pcp list */ list_del(&page->pcp_list); count -= nr_pages; pcp->count -= nr_pages;
- /* MIGRATE_ISOLATE page should not go to pcplists */ - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); - /* Pageblock could have been isolated meanwhile */ - if (unlikely(isolated_pageblocks)) - mt = get_pageblock_migratetype(page); - - __free_one_page(page, page_to_pfn(page), zone, order, mt, FPI_NONE); + __free_one_page(page, pfn, zone, order, mt, FPI_NONE); trace_mm_page_pcpu_drain(page, order, mt); } while (count > 0 && !list_empty(list)); } @@ -1591,7 +1567,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, continue; del_page_from_free_list(page, zone, current_order); expand(zone, page, order, current_order, migratetype); - set_pcppage_migratetype(page, migratetype); trace_mm_page_alloc_zone_locked(page, order, migratetype, pcp_allowed_order(order) && migratetype < MIGRATE_PCPTYPES); @@ -2162,7 +2137,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, * pages are ordered properly. */ list_add_tail(&page->pcp_list, list); - if (is_migrate_cma(get_pcppage_migratetype(page))) + if (is_migrate_cma(get_pageblock_migratetype(page))) __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, -(1 << order)); } @@ -2362,19 +2337,6 @@ void drain_all_pages(struct zone *zone) __drain_all_pages(zone, false); }
-static bool free_unref_page_prepare(struct page *page, unsigned long pfn, - unsigned int order) -{ - int migratetype; - - if (!free_pages_prepare(page, order)) - return false; - - migratetype = get_pfnblock_migratetype(page, pfn); - set_pcppage_migratetype(page, migratetype); - return true; -} - static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free_high) { int min_nr_free, max_nr_free; @@ -2517,7 +2479,7 @@ void free_unref_page(struct page *page, unsigned int order) return; }
- if (!free_unref_page_prepare(page, pfn, order)) + if (!free_pages_prepare(page, order)) return;
/* @@ -2527,7 +2489,7 @@ void free_unref_page(struct page *page, unsigned int order) * get those areas back if necessary. Otherwise, we may have to free * excessively into the page allocator */ - migratetype = pcpmigratetype = get_pcppage_migratetype(page); + migratetype = pcpmigratetype = get_pfnblock_migratetype(page, pfn); if (unlikely(migratetype >= MIGRATE_PCPTYPES)) { if (unlikely(is_migrate_isolate(migratetype))) { free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE); @@ -2570,14 +2532,14 @@ void free_unref_folios(struct folio_batch *folios) }
folio_undo_large_rmappable(folio); - if (!free_unref_page_prepare(&folio->page, pfn, order)) + if (!free_pages_prepare(&folio->page, order)) continue;
/* * Free isolated folios and orders not handled on the PCP * directly to the allocator, see comment in free_unref_page. */ - migratetype = get_pcppage_migratetype(&folio->page); + migratetype = get_pfnblock_migratetype(&folio->page, pfn); if (!pcp_allowed_order(order) || is_migrate_isolate(migratetype)) { free_one_page(folio_zone(folio), &folio->page, pfn, @@ -2594,10 +2556,11 @@ void free_unref_folios(struct folio_batch *folios) for (i = 0; i < folios->nr; i++) { struct folio *folio = folios->folios[i]; struct zone *zone = folio_zone(folio); + unsigned long pfn = folio_pfn(folio); unsigned int order = (unsigned long)folio->private;
folio->private = NULL; - migratetype = get_pcppage_migratetype(&folio->page); + migratetype = get_pfnblock_migratetype(&folio->page, pfn);
/* Different zone requires a different pcp lock */ if (zone != locked_zone) { @@ -2614,9 +2577,8 @@ void free_unref_folios(struct folio_batch *folios) pcp = pcp_spin_trylock(zone->per_cpu_pageset); if (unlikely(!pcp)) { pcp_trylock_finish(UP_flags); - free_one_page(zone, &folio->page, - folio_pfn(folio), order, - migratetype, FPI_NONE); + free_one_page(zone, &folio->page, pfn, + order, migratetype, FPI_NONE); locked_zone = NULL; continue; } @@ -2785,7 +2747,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, } } __mod_zone_freepage_state(zone, -(1 << order), - get_pcppage_migratetype(page)); + get_pageblock_migratetype(page)); spin_unlock_irqrestore(&zone->lock, flags); } while (check_new_pages(page, order));
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v6.10-rc1 commit 9cbe97bad5cd75b5b493734bd2695febb8e95281 category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Move direct freeing of isolated pages to the lock-breaking block in the second loop. This saves an unnecessary migratetype reassessment.
Minor comment and local variable scoping cleanups.
Link: https://lkml.kernel.org/r/20240320180429.678181-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Suggested-by: Vlastimil Babka vbabka@suse.cz Tested-by: "Huang, Ying" ying.huang@intel.com Reviewed-by: Vlastimil Babka vbabka@suse.cz Tested-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: David Hildenbrand david@redhat.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/page_alloc.c [ Context conflict with commit ae577de78c12. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 32 +++++++++++++++++++++++--------- 1 file changed, 23 insertions(+), 9 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3d4932cd2332..dad180df69da 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2518,7 +2518,7 @@ void free_unref_folios(struct folio_batch *folios) unsigned long __maybe_unused UP_flags; struct per_cpu_pages *pcp = NULL; struct zone *locked_zone = NULL; - int i, j, migratetype; + int i, j;
/* Prepare folios for freeing */ for (i = 0, j = 0; i < folios->nr; i++) { @@ -2534,14 +2534,15 @@ void free_unref_folios(struct folio_batch *folios) folio_undo_large_rmappable(folio); if (!free_pages_prepare(&folio->page, order)) continue; - /* - * Free isolated folios and orders not handled on the PCP - * directly to the allocator, see comment in free_unref_page. + * Free orders not handled on the PCP directly to the + * allocator. */ - migratetype = get_pfnblock_migratetype(&folio->page, pfn); - if (!pcp_allowed_order(order) || - is_migrate_isolate(migratetype)) { + if (!pcp_allowed_order(order)) { + int migratetype; + + migratetype = get_pfnblock_migratetype(&folio->page, + pfn); free_one_page(folio_zone(folio), &folio->page, pfn, order, migratetype, FPI_NONE); continue; @@ -2558,15 +2559,29 @@ void free_unref_folios(struct folio_batch *folios) struct zone *zone = folio_zone(folio); unsigned long pfn = folio_pfn(folio); unsigned int order = (unsigned long)folio->private; + int migratetype;
folio->private = NULL; migratetype = get_pfnblock_migratetype(&folio->page, pfn);
/* Different zone requires a different pcp lock */ - if (zone != locked_zone) { + if (zone != locked_zone || + is_migrate_isolate(migratetype)) { if (pcp) { pcp_spin_unlock(pcp); pcp_trylock_finish(UP_flags); + locked_zone = NULL; + pcp = NULL; + } + + /* + * Free isolated pages directly to the + * allocator, see comment in free_unref_page. + */ + if (is_migrate_isolate(migratetype)) { + free_one_page(zone, &folio->page, pfn, + order, migratetype, FPI_NONE); + continue; }
/* @@ -2579,7 +2594,6 @@ void free_unref_folios(struct folio_batch *folios) pcp_trylock_finish(UP_flags); free_one_page(zone, &folio->page, pfn, order, migratetype, FPI_NONE); - locked_zone = NULL; continue; } locked_zone = zone;
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v6.10-rc1 commit e6cf9e1c4cde8a53385423ecb8ca581097f42e02 category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The buddy allocator coalesces compatible blocks during freeing, but it doesn't update the types of the subblocks to match. When an allocation later breaks the chunk down again, its pieces will be put on freelists of the wrong type. This encourages incompatible page mixing (ask for one type, get another), and thus long-term fragmentation.
Update the subblocks when merging a larger chunk, such that a later expand() will maintain freelist type hygiene.
Link: https://lkml.kernel.org/r/20240320180429.678181-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Zi Yan ziy@nvidia.com Reviewed-by: Vlastimil Babka vbabka@suse.cz Acked-by: Mel Gorman mgorman@techsingularity.net Tested-by: "Huang, Ying" ying.huang@intel.com Tested-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: David Hildenbrand david@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dad180df69da..3d7e0f110868 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -779,10 +779,17 @@ static inline void __free_one_page(struct page *page, */ int buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
- if (migratetype != buddy_mt - && (!migratetype_is_mergeable(migratetype) || - !migratetype_is_mergeable(buddy_mt))) - goto done_merging; + if (migratetype != buddy_mt) { + if (!migratetype_is_mergeable(migratetype) || + !migratetype_is_mergeable(buddy_mt)) + goto done_merging; + /* + * Match buddy type. This ensures that + * an expand() down the line puts the + * sub-blocks on the right freelists. + */ + set_pageblock_migratetype(buddy, migratetype); + } }
/*
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v6.10-rc1 commit b54ccd3c6bacbc571f7e61797fb5ff9fe3861413 category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When claiming a block during compaction isolation, move any remaining free pages to the correct freelists as well, instead of stranding them on the wrong list. Otherwise, this encourages incompatible page mixing down the line, and thus long-term fragmentation.
Link: https://lkml.kernel.org/r/20240320180429.678181-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Zi Yan ziy@nvidia.com Reviewed-by: Vlastimil Babka vbabka@suse.cz Acked-by: Mel Gorman mgorman@techsingularity.net Tested-by: "Huang, Ying" ying.huang@intel.com Tested-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: David Hildenbrand david@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3d7e0f110868..7e3593318813 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2681,9 +2681,12 @@ int __isolate_free_page(struct page *page, unsigned int order) * Only change normal pageblocks (i.e., they can merge * with others) */ - if (migratetype_is_mergeable(mt)) + if (migratetype_is_mergeable(mt)) { set_pageblock_migratetype(page, MIGRATE_MOVABLE); + move_freepages_block(zone, page, + MIGRATE_MOVABLE, NULL); + } } }
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v6.10-rc1 commit 2dd482ba627de15d67f0c0ed445133c8ae9b201b category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When a block is partially outside the zone of the cursor page, the function cuts the range to the pivot page instead of the zone start. This can leave large parts of the block behind, which encourages incompatible page mixing down the line (ask for one type, get another), and thus long-term fragmentation.
This triggers reliably on the first block in the DMA zone, whose start_pfn is 1. The block is stolen, but everything before the pivot page (which was often hundreds of pages) is left on the old list.
Link: https://lkml.kernel.org/r/20240320180429.678181-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Vlastimil Babka vbabka@suse.cz Tested-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7e3593318813..a101f5f550dc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1661,9 +1661,15 @@ int move_freepages_block(struct zone *zone, struct page *page, start_pfn = pageblock_start_pfn(pfn); end_pfn = pageblock_end_pfn(pfn) - 1;
- /* Do not cross zone boundaries */ + /* + * The caller only has the lock for @zone, don't touch ranges + * that straddle into other zones. While we could move part of + * the range that's inside the zone, this call is usually + * accompanied by other operations such as migratetype updates + * which also should be locked. + */ if (!zone_spans_pfn(zone, start_pfn)) - start_pfn = pfn; + return 0; if (!zone_spans_pfn(zone, end_pfn)) return 0;
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v6.10-rc1 commit c0cd6f557b9090525d288806cccbc73440ac235a category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Currently, page block type conversion during fallbacks, atomic reservations and isolation can strand various amounts of free pages on incorrect freelists.
For example, fallback stealing moves free pages in the block to the new type's freelists, but then may not actually claim the block for that type if there aren't enough compatible pages already allocated.
In all cases, free page moving might fail if the block straddles more than one zone, in which case no free pages are moved at all, but the block type is changed anyway.
This is detrimental to type hygiene on the freelists. It encourages incompatible page mixing down the line (ask for one type, get another) and thus contributes to long-term fragmentation.
Split the process into a proper transaction: check first if conversion will happen, then try to move the free pages, and only if that was successful convert the block to the new type.
[baolin.wang@linux.alibaba.com: fix allocation failures with CONFIG_CMA] Link: https://lkml.kernel.org/r/a97697e0-45b0-4f71-b087-fdc7a1d43c0e@linux.alibaba... Link: https://lkml.kernel.org/r/20240320180429.678181-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Tested-by: "Huang, Ying" ying.huang@intel.com Reviewed-by: Vlastimil Babka vbabka@suse.cz Tested-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: David Hildenbrand david@redhat.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/page-isolation.h | 3 +- mm/page_alloc.c | 174 ++++++++++++++++++++------------- mm/page_isolation.c | 22 +++-- 3 files changed, 121 insertions(+), 78 deletions(-)
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 4ac34392823a..8550b3c91480 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -34,8 +34,7 @@ static inline bool is_migrate_isolate(int migratetype) #define REPORT_FAILURE 0x2
void set_pageblock_migratetype(struct page *page, int migratetype); -int move_freepages_block(struct zone *zone, struct page *page, - int migratetype, int *num_movable); +int move_freepages_block(struct zone *zone, struct page *page, int migratetype);
int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, int migratetype, int flags, gfp_t gfp_flags); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a101f5f550dc..ba85db6cf987 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1612,9 +1612,8 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone, * Note that start_page and end_pages are not aligned on a pageblock * boundary. If alignment is required, use move_freepages_block() */ -static int move_freepages(struct zone *zone, - unsigned long start_pfn, unsigned long end_pfn, - int migratetype, int *num_movable) +static int move_freepages(struct zone *zone, unsigned long start_pfn, + unsigned long end_pfn, int migratetype) { struct page *page; unsigned long pfn; @@ -1624,14 +1623,6 @@ static int move_freepages(struct zone *zone, for (pfn = start_pfn; pfn <= end_pfn;) { page = pfn_to_page(pfn); if (!PageBuddy(page)) { - /* - * We assume that pages that could be isolated for - * migration are movable. But we don't actually try - * isolating, as that would be expensive. - */ - if (num_movable && - (PageLRU(page) || __PageMovable(page))) - (*num_movable)++; pfn++; continue; } @@ -1649,17 +1640,16 @@ static int move_freepages(struct zone *zone, return pages_moved; }
-int move_freepages_block(struct zone *zone, struct page *page, - int migratetype, int *num_movable) +static bool prep_move_freepages_block(struct zone *zone, struct page *page, + unsigned long *start_pfn, + unsigned long *end_pfn, + int *num_free, int *num_movable) { - unsigned long start_pfn, end_pfn, pfn; - - if (num_movable) - *num_movable = 0; + unsigned long pfn, start, end;
pfn = page_to_pfn(page); - start_pfn = pageblock_start_pfn(pfn); - end_pfn = pageblock_end_pfn(pfn) - 1; + start = pageblock_start_pfn(pfn); + end = pageblock_end_pfn(pfn) - 1;
/* * The caller only has the lock for @zone, don't touch ranges @@ -1668,13 +1658,50 @@ int move_freepages_block(struct zone *zone, struct page *page, * accompanied by other operations such as migratetype updates * which also should be locked. */ - if (!zone_spans_pfn(zone, start_pfn)) - return 0; - if (!zone_spans_pfn(zone, end_pfn)) - return 0; + if (!zone_spans_pfn(zone, start)) + return false; + if (!zone_spans_pfn(zone, end)) + return false; + + *start_pfn = start; + *end_pfn = end; + + if (num_free) { + *num_free = 0; + *num_movable = 0; + for (pfn = start; pfn <= end;) { + page = pfn_to_page(pfn); + if (PageBuddy(page)) { + int nr = 1 << buddy_order(page); + + *num_free += nr; + pfn += nr; + continue; + } + /* + * We assume that pages that could be isolated for + * migration are movable. But we don't actually try + * isolating, as that would be expensive. + */ + if (PageLRU(page) || __PageMovable(page)) + (*num_movable)++; + pfn++; + } + } + + return true; +} + +int move_freepages_block(struct zone *zone, struct page *page, + int migratetype) +{ + unsigned long start_pfn, end_pfn; + + if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, + NULL, NULL)) + return -1;
- return move_freepages(zone, start_pfn, end_pfn, migratetype, - num_movable); + return move_freepages(zone, start_pfn, end_pfn, migratetype); }
static void change_pageblock_range(struct page *pageblock_page, @@ -1759,33 +1786,37 @@ static inline bool boost_watermark(struct zone *zone) }
/* - * This function implements actual steal behaviour. If order is large enough, - * we can steal whole pageblock. If not, we first move freepages in this - * pageblock to our migratetype and determine how many already-allocated pages - * are there in the pageblock with a compatible migratetype. If at least half - * of pages are free or compatible, we can change migratetype of the pageblock - * itself, so pages freed in the future will be put on the correct free list. + * This function implements actual steal behaviour. If order is large enough, we + * can claim the whole pageblock for the requested migratetype. If not, we check + * the pageblock for constituent pages; if at least half of the pages are free + * or compatible, we can still claim the whole block, so pages freed in the + * future will be put on the correct free list. Otherwise, we isolate exactly + * the order we need from the fallback block and leave its migratetype alone. */ -static void steal_suitable_fallback(struct zone *zone, struct page *page, - unsigned int alloc_flags, int start_type, bool whole_block) +static struct page * +steal_suitable_fallback(struct zone *zone, struct page *page, + int current_order, int order, int start_type, + unsigned int alloc_flags, bool whole_block) { - unsigned int current_order = buddy_order(page); int free_pages, movable_pages, alike_pages; - int old_block_type; + unsigned long start_pfn, end_pfn; + int block_type;
- old_block_type = get_pageblock_migratetype(page); + block_type = get_pageblock_migratetype(page);
/* * This can happen due to races and we want to prevent broken * highatomic accounting. */ - if (is_migrate_highatomic(old_block_type)) + if (is_migrate_highatomic(block_type)) goto single_page;
/* Take ownership for orders >= pageblock_order */ if (current_order >= pageblock_order) { + del_page_from_free_list(page, zone, current_order); change_pageblock_range(page, current_order, start_type); - goto single_page; + expand(zone, page, order, current_order, start_type); + return page; }
/* @@ -1800,10 +1831,9 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, if (!whole_block) goto single_page;
- free_pages = move_freepages_block(zone, page, start_type, - &movable_pages); /* moving whole block can fail due to zone boundary conditions */ - if (!free_pages) + if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, + &free_pages, &movable_pages)) goto single_page;
/* @@ -1821,7 +1851,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, * vice versa, be conservative since we can't distinguish the * exact migratetype of non-movable pages. */ - if (old_block_type == MIGRATE_MOVABLE) + if (block_type == MIGRATE_MOVABLE) alike_pages = pageblock_nr_pages - (free_pages + movable_pages); else @@ -1832,13 +1862,16 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, * compatible migratability as our allocation, claim the whole block. */ if (free_pages + alike_pages >= (1 << (pageblock_order-1)) || - page_group_by_mobility_disabled) + page_group_by_mobility_disabled) { + move_freepages(zone, start_pfn, end_pfn, start_type); set_pageblock_migratetype(page, start_type); - - return; + return __rmqueue_smallest(zone, order, start_type); + }
single_page: - move_to_free_list(page, zone, current_order, start_type); + del_page_from_free_list(page, zone, current_order); + expand(zone, page, order, current_order, block_type); + return page; }
/* @@ -1906,9 +1939,10 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone) mt = get_pageblock_migratetype(page); /* Only reserve normal pageblocks (i.e., they can merge with others) */ if (migratetype_is_mergeable(mt)) { - zone->nr_reserved_highatomic += pageblock_nr_pages; - set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); - move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL); + if (move_freepages_block(zone, page, MIGRATE_HIGHATOMIC) != -1) { + set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); + zone->nr_reserved_highatomic += pageblock_nr_pages; + } }
out_unlock: @@ -1933,7 +1967,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, struct zone *zone; struct page *page; int order; - bool ret; + int ret;
for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->highest_zoneidx, ac->nodemask) { @@ -1982,10 +2016,14 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, * of pageblocks that cannot be completely freed * may increase. */ + ret = move_freepages_block(zone, page, ac->migratetype); + /* + * Reserving this block already succeeded, so this should + * not fail on zone boundaries. + */ + WARN_ON_ONCE(ret == -1); set_pageblock_migratetype(page, ac->migratetype); - ret = move_freepages_block(zone, page, ac->migratetype, - NULL); - if (ret) { + if (ret > 0) { spin_unlock_irqrestore(&zone->lock, flags); return ret; } @@ -2006,7 +2044,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, * deviation from the rest of this file, to make the for loop * condition simpler. */ -static __always_inline bool +static __always_inline struct page * __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, unsigned int alloc_flags) { @@ -2053,7 +2091,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, goto do_steal; }
- return false; + return NULL;
find_smallest: for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) { @@ -2073,14 +2111,14 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, do_steal: page = get_page_from_free_area(area, fallback_mt);
- steal_suitable_fallback(zone, page, alloc_flags, start_migratetype, - can_steal); + /* take off list, maybe claim block, expand remainder */ + page = steal_suitable_fallback(zone, page, current_order, order, + start_migratetype, alloc_flags, can_steal);
trace_mm_page_alloc_extfrag(page, order, current_order, start_migratetype, fallback_mt);
- return true; - + return page; }
/* @@ -2107,15 +2145,15 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, return page; } } -retry: + page = __rmqueue_smallest(zone, order, migratetype); if (unlikely(!page)) { if (alloc_flags & ALLOC_CMA) page = __rmqueue_cma_fallback(zone, order);
- if (!page && __rmqueue_fallback(zone, order, migratetype, - alloc_flags)) - goto retry; + if (!page) + page = __rmqueue_fallback(zone, order, migratetype, + alloc_flags); } return page; } @@ -2687,12 +2725,10 @@ int __isolate_free_page(struct page *page, unsigned int order) * Only change normal pageblocks (i.e., they can merge * with others) */ - if (migratetype_is_mergeable(mt)) { - set_pageblock_migratetype(page, - MIGRATE_MOVABLE); - move_freepages_block(zone, page, - MIGRATE_MOVABLE, NULL); - } + if (migratetype_is_mergeable(mt) && + move_freepages_block(zone, page, + MIGRATE_MOVABLE) != -1) + set_pageblock_migratetype(page, MIGRATE_MOVABLE); } }
diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 03381be87b28..b27ed476f80e 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -179,15 +179,18 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end, migratetype, isol_flags); if (!unmovable) { - unsigned long nr_pages; + int nr_pages; int mt = get_pageblock_migratetype(page);
+ nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE); + /* Block spans zone boundaries? */ + if (nr_pages == -1) { + spin_unlock_irqrestore(&zone->lock, flags); + return -EBUSY; + } + __mod_zone_freepage_state(zone, -nr_pages, mt); set_pageblock_migratetype(page, MIGRATE_ISOLATE); zone->nr_isolate_pageblock++; - nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE, - NULL); - - __mod_zone_freepage_state(zone, -nr_pages, mt); spin_unlock_irqrestore(&zone->lock, flags); return 0; } @@ -207,7 +210,7 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ static void unset_migratetype_isolate(struct page *page, int migratetype) { struct zone *zone; - unsigned long flags, nr_pages; + unsigned long flags; bool isolated_page = false; unsigned int order; struct page *buddy; @@ -253,7 +256,12 @@ static void unset_migratetype_isolate(struct page *page, int migratetype) * allocation. */ if (!isolated_page) { - nr_pages = move_freepages_block(zone, page, migratetype, NULL); + int nr_pages = move_freepages_block(zone, page, migratetype); + /* + * Isolating this block already succeeded, so this + * should not fail on zone boundaries. + */ + WARN_ON_ONCE(nr_pages == -1); __mod_zone_freepage_state(zone, nr_pages, migratetype); } set_pageblock_migratetype(page, migratetype);
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v6.10-rc1 commit 55612e80e722ac554cc5e80df05555b4f8d40c37 category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
There are three freeing paths that read the page's migratetype optimistically before grabbing the zone lock. When this races with block stealing, those pages go on the wrong freelist.
The paths in question are: - when freeing >costly orders that aren't THP - when freeing pages to the buddy upon pcp lock contention - when freeing pages that are isolated - when freeing pages initially during boot - when freeing the remainder in alloc_pages_exact() - when "accepting" unaccepted VM host memory before first use - when freeing pages during unpoisoning
None of these are so hot that they would need this optimization at the cost of hampering defrag efforts. Especially when contrasted with the fact that the most common buddy freeing path - free_pcppages_bulk - is checking the migratetype under the zone->lock just fine.
In addition, isolated pages need to look up the migratetype under the lock anyway, which adds branches to the locked section, and results in a double lookup when the pages are in fact isolated.
Move the lookups into the lock.
Link: https://lkml.kernel.org/r/20240320180429.678181-8-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reported-by: Vlastimil Babka vbabka@suse.cz Reviewed-by: Vlastimil Babka vbabka@suse.cz Tested-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/page_alloc.c [ Context conflict with commit 2ae116c3257d. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 52 ++++++++++++++++++------------------------------- 1 file changed, 19 insertions(+), 33 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ba85db6cf987..cae51fd0b7b2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1222,18 +1222,15 @@ static void free_pcppages_bulk(struct zone *zone, int count, spin_unlock_irqrestore(&zone->lock, flags); }
-static void free_one_page(struct zone *zone, - struct page *page, unsigned long pfn, - unsigned int order, - int migratetype, fpi_t fpi_flags) +static void free_one_page(struct zone *zone, struct page *page, + unsigned long pfn, unsigned int order, + fpi_t fpi_flags) { unsigned long flags; + int migratetype;
spin_lock_irqsave(&zone->lock, flags); - if (unlikely(has_isolate_pageblock(zone) || - is_migrate_isolate(migratetype))) { - migratetype = get_pfnblock_migratetype(page, pfn); - } + migratetype = get_pfnblock_migratetype(page, pfn); __free_one_page(page, pfn, zone, order, migratetype, fpi_flags); spin_unlock_irqrestore(&zone->lock, flags); } @@ -1241,21 +1238,13 @@ static void free_one_page(struct zone *zone, static void __free_pages_ok(struct page *page, unsigned int order, fpi_t fpi_flags) { - int migratetype; unsigned long pfn = page_to_pfn(page); struct zone *zone = page_zone(page);
if (!free_pages_prepare(page, order)) return;
- /* - * Calling get_pfnblock_migratetype() without spin_lock_irqsave() here - * is used to avoid calling get_pfnblock_migratetype() under the lock. - * This will reduce the lock holding time. - */ - migratetype = get_pfnblock_migratetype(page, pfn); - - free_one_page(zone, page, pfn, order, migratetype, fpi_flags); + free_one_page(zone, page, pfn, order, fpi_flags);
__count_vm_events(PGFREE, 1 << order); } @@ -2518,7 +2507,7 @@ void free_unref_page(struct page *page, unsigned int order) struct per_cpu_pages *pcp; struct zone *zone; unsigned long pfn = page_to_pfn(page); - int migratetype, pcpmigratetype; + int migratetype;
if (page_from_dynamic_pool(page)) { dynamic_pool_free_page(page); @@ -2540,23 +2529,23 @@ void free_unref_page(struct page *page, unsigned int order) * get those areas back if necessary. Otherwise, we may have to free * excessively into the page allocator */ - migratetype = pcpmigratetype = get_pfnblock_migratetype(page, pfn); + migratetype = get_pfnblock_migratetype(page, pfn); if (unlikely(migratetype >= MIGRATE_PCPTYPES)) { if (unlikely(is_migrate_isolate(migratetype))) { - free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE); + free_one_page(page_zone(page), page, pfn, order, FPI_NONE); return; } - pcpmigratetype = MIGRATE_MOVABLE; + migratetype = MIGRATE_MOVABLE; }
zone = page_zone(page); pcp_trylock_prepare(UP_flags); pcp = pcp_spin_trylock(zone->per_cpu_pageset); if (pcp) { - free_unref_page_commit(zone, pcp, page, pcpmigratetype, order); + free_unref_page_commit(zone, pcp, page, migratetype, order); pcp_spin_unlock(pcp); } else { - free_one_page(zone, page, pfn, order, migratetype, FPI_NONE); + free_one_page(zone, page, pfn, order, FPI_NONE); } pcp_trylock_finish(UP_flags); } @@ -2590,12 +2579,8 @@ void free_unref_folios(struct folio_batch *folios) * allocator. */ if (!pcp_allowed_order(order)) { - int migratetype; - - migratetype = get_pfnblock_migratetype(&folio->page, - pfn); - free_one_page(folio_zone(folio), &folio->page, pfn, - order, migratetype, FPI_NONE); + free_one_page(folio_zone(folio), &folio->page, + pfn, order, FPI_NONE); continue; } folio->private = (void *)(unsigned long)order; @@ -2631,7 +2616,7 @@ void free_unref_folios(struct folio_batch *folios) */ if (is_migrate_isolate(migratetype)) { free_one_page(zone, &folio->page, pfn, - order, migratetype, FPI_NONE); + order, FPI_NONE); continue; }
@@ -2644,7 +2629,7 @@ void free_unref_folios(struct folio_batch *folios) if (unlikely(!pcp)) { pcp_trylock_finish(UP_flags); free_one_page(zone, &folio->page, pfn, - order, migratetype, FPI_NONE); + order, FPI_NONE); continue; } locked_zone = zone; @@ -7022,13 +7007,14 @@ bool take_page_off_buddy(struct page *page) bool put_page_back_buddy(struct page *page) { struct zone *zone = page_zone(page); - unsigned long pfn = page_to_pfn(page); unsigned long flags; - int migratetype = get_pfnblock_migratetype(page, pfn); bool ret = false;
spin_lock_irqsave(&zone->lock, flags); if (put_page_testzero(page)) { + unsigned long pfn = page_to_pfn(page); + int migratetype = get_pfnblock_migratetype(page, pfn); + ClearPageHWPoisonTakenOff(page); __free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE); if (TestClearPageHWPoison(page)) {
From: Zi Yan ziy@nvidia.com
mainline inclusion from mainline-v6.10-rc1 commit f37c0f6876a8eabe1477c87860460bc181f6cdbb category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
This avoids changing migratetype after move_freepages() or move_freepages_block(), which is error prone. It also prepares for upcoming changes to fix move_freepages() not moving free pages partially in the range.
Link: https://lkml.kernel.org/r/20240320180429.678181-9-hannes@cmpxchg.org Signed-off-by: Zi Yan ziy@nvidia.com Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Vlastimil Babka vbabka@suse.cz Tested-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Mel Gorman mgorman@techsingularity.net Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 27 +++++++++++++-------------- mm/page_isolation.c | 7 +++---- 2 files changed, 16 insertions(+), 18 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index cae51fd0b7b2..6f59b8e73daa 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1597,9 +1597,8 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone, #endif
/* - * Move the free pages in a range to the freelist tail of the requested type. - * Note that start_page and end_pages are not aligned on a pageblock - * boundary. If alignment is required, use move_freepages_block() + * Change the type of a block and move all its free pages to that + * type's freelist. */ static int move_freepages(struct zone *zone, unsigned long start_pfn, unsigned long end_pfn, int migratetype) @@ -1609,6 +1608,9 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn, unsigned int order; int pages_moved = 0;
+ VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1)); + VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn); + for (pfn = start_pfn; pfn <= end_pfn;) { page = pfn_to_page(pfn); if (!PageBuddy(page)) { @@ -1626,6 +1628,8 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn, pages_moved += 1 << order; }
+ set_pageblock_migratetype(pfn_to_page(start_pfn), migratetype); + return pages_moved; }
@@ -1853,7 +1857,6 @@ steal_suitable_fallback(struct zone *zone, struct page *page, if (free_pages + alike_pages >= (1 << (pageblock_order-1)) || page_group_by_mobility_disabled) { move_freepages(zone, start_pfn, end_pfn, start_type); - set_pageblock_migratetype(page, start_type); return __rmqueue_smallest(zone, order, start_type); }
@@ -1927,12 +1930,10 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone) /* Yoink! */ mt = get_pageblock_migratetype(page); /* Only reserve normal pageblocks (i.e., they can merge with others) */ - if (migratetype_is_mergeable(mt)) { - if (move_freepages_block(zone, page, MIGRATE_HIGHATOMIC) != -1) { - set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); + if (migratetype_is_mergeable(mt)) + if (move_freepages_block(zone, page, + MIGRATE_HIGHATOMIC) != -1) zone->nr_reserved_highatomic += pageblock_nr_pages; - } - }
out_unlock: spin_unlock_irqrestore(&zone->lock, flags); @@ -2011,7 +2012,6 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, * not fail on zone boundaries. */ WARN_ON_ONCE(ret == -1); - set_pageblock_migratetype(page, ac->migratetype); if (ret > 0) { spin_unlock_irqrestore(&zone->lock, flags); return ret; @@ -2710,10 +2710,9 @@ int __isolate_free_page(struct page *page, unsigned int order) * Only change normal pageblocks (i.e., they can merge * with others) */ - if (migratetype_is_mergeable(mt) && - move_freepages_block(zone, page, - MIGRATE_MOVABLE) != -1) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); + if (migratetype_is_mergeable(mt)) + move_freepages_block(zone, page, + MIGRATE_MOVABLE); } }
diff --git a/mm/page_isolation.c b/mm/page_isolation.c index b27ed476f80e..8fc4f9491417 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -189,7 +189,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ return -EBUSY; } __mod_zone_freepage_state(zone, -nr_pages, mt); - set_pageblock_migratetype(page, MIGRATE_ISOLATE); zone->nr_isolate_pageblock++; spin_unlock_irqrestore(&zone->lock, flags); return 0; @@ -263,10 +262,10 @@ static void unset_migratetype_isolate(struct page *page, int migratetype) */ WARN_ON_ONCE(nr_pages == -1); __mod_zone_freepage_state(zone, nr_pages, migratetype); - } - set_pageblock_migratetype(page, migratetype); - if (isolated_page) + } else { + set_pageblock_migratetype(page, migratetype); __putback_isolated_page(page, order, migratetype); + } zone->nr_isolate_pageblock--; out: spin_unlock_irqrestore(&zone->lock, flags);
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v6.10-rc1 commit fd919a85cd55be5d00a6a7372071f44c8eafb825 category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Page isolation currently sets MIGRATE_ISOLATE on a block, then drops zone->lock and scans the block for straddling buddies to split up. Because this happens non-atomically wrt the page allocator, it's possible for allocations to get a buddy whose first block is a regular pcp migratetype but whose tail is isolated. This means that in certain cases memory can still be allocated after isolation. It will also trigger the freelist type hygiene warnings in subsequent patches.
start_isolate_page_range() isolate_single_pageblock() set_migratetype_isolate(tail) lock zone->lock move_freepages_block(tail) // nop set_pageblock_migratetype(tail) unlock zone->lock __rmqueue_smallest() del_page_from_freelist(head) expand(head, head_mt) WARN(head_mt != tail_mt) start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES) for (pfn = start_pfn, pfn < end_pfn) if (PageBuddy()) split_free_page(head)
Introduce a variant of move_freepages_block() provided by the allocator specifically for page isolation; it moves free pages, converts the block, and handles the splitting of straddling buddies while holding zone->lock.
The allocator knows that pageblocks and buddies are always naturally aligned, which means that buddies can only straddle blocks if they're actually >pageblock_order. This means the search-and-split part can be simplified compared to what page isolation used to do.
Also tighten up the page isolation code around the expectations of which pages can be large, and how they are freed.
Based on extensive discussions with and invaluable input from Zi Yan.
[hannes@cmpxchg.org: work around older gcc warning] Link: https://lkml.kernel.org/r/20240321142426.GB777580@cmpxchg.org Link: https://lkml.kernel.org/r/20240320180429.678181-10-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Vlastimil Babka vbabka@suse.cz Tested-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/internal.h mm/page_alloc.c mm/page_isolation.c [ Context conflict due to miss MAX_PAGE_ORDER. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/page-isolation.h | 4 +- mm/internal.h | 4 - mm/page_alloc.c | 204 +++++++++++++++++++-------------- mm/page_isolation.c | 106 ++++++----------- 4 files changed, 155 insertions(+), 163 deletions(-)
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 8550b3c91480..c16db0067090 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -34,7 +34,9 @@ static inline bool is_migrate_isolate(int migratetype) #define REPORT_FAILURE 0x2
void set_pageblock_migratetype(struct page *page, int migratetype); -int move_freepages_block(struct zone *zone, struct page *page, int migratetype); + +bool move_freepages_block_isolate(struct zone *zone, struct page *page, + int migratetype);
int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, int migratetype, int flags, gfp_t gfp_flags); diff --git a/mm/internal.h b/mm/internal.h index 0478e5dab55b..de564608dfa6 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -693,10 +693,6 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align, void memmap_init_range(unsigned long, int, unsigned long, unsigned long, unsigned long, enum meminit_context, struct vmem_altmap *, int);
- -int split_free_page(struct page *free_page, - unsigned int order, unsigned long split_pfn_offset); - #if defined CONFIG_COMPACTION || defined CONFIG_CMA
#define MAX_PAGE_ORDER MAX_ORDER diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6f59b8e73daa..3bc1502e42cf 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -826,64 +826,6 @@ static inline void __free_one_page(struct page *page, page_reporting_notify_free(order); }
-/** - * split_free_page() -- split a free page at split_pfn_offset - * @free_page: the original free page - * @order: the order of the page - * @split_pfn_offset: split offset within the page - * - * Return -ENOENT if the free page is changed, otherwise 0 - * - * It is used when the free page crosses two pageblocks with different migratetypes - * at split_pfn_offset within the page. The split free page will be put into - * separate migratetype lists afterwards. Otherwise, the function achieves - * nothing. - */ -int split_free_page(struct page *free_page, - unsigned int order, unsigned long split_pfn_offset) -{ - struct zone *zone = page_zone(free_page); - unsigned long free_page_pfn = page_to_pfn(free_page); - unsigned long pfn; - unsigned long flags; - int free_page_order; - int mt; - int ret = 0; - - if (split_pfn_offset == 0) - return ret; - - spin_lock_irqsave(&zone->lock, flags); - - if (!PageBuddy(free_page) || buddy_order(free_page) != order) { - ret = -ENOENT; - goto out; - } - - mt = get_pfnblock_migratetype(free_page, free_page_pfn); - if (likely(!is_migrate_isolate(mt))) - __mod_zone_freepage_state(zone, -(1UL << order), mt); - - del_page_from_free_list(free_page, zone, order); - for (pfn = free_page_pfn; - pfn < free_page_pfn + (1UL << order);) { - int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn); - - free_page_order = min_t(unsigned int, - pfn ? __ffs(pfn) : order, - __fls(split_pfn_offset)); - __free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order, - mt, FPI_NONE); - pfn += 1UL << free_page_order; - split_pfn_offset -= (1UL << free_page_order); - /* we have done the first part, now switch to second part */ - if (split_pfn_offset == 0) - split_pfn_offset = (1UL << order) - (pfn - free_page_pfn); - } -out: - spin_unlock_irqrestore(&zone->lock, flags); - return ret; -} /* * A bad page could be due to a number of fields. Instead of multiple branches, * try and check multiple fields with one check. The caller must do a detailed @@ -1685,8 +1627,8 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page, return true; }
-int move_freepages_block(struct zone *zone, struct page *page, - int migratetype) +static int move_freepages_block(struct zone *zone, struct page *page, + int migratetype) { unsigned long start_pfn, end_pfn;
@@ -1697,6 +1639,123 @@ int move_freepages_block(struct zone *zone, struct page *page, return move_freepages(zone, start_pfn, end_pfn, migratetype); }
+#ifdef CONFIG_MEMORY_ISOLATION +/* Look for a buddy that straddles start_pfn */ +static unsigned long find_large_buddy(unsigned long start_pfn) +{ + int order = 0; + struct page *page; + unsigned long pfn = start_pfn; + + while (!PageBuddy(page = pfn_to_page(pfn))) { + /* Nothing found */ + if (++order > MAX_PAGE_ORDER) + return start_pfn; + pfn &= ~0UL << order; + } + + /* + * Found a preceding buddy, but does it straddle? + */ + if (pfn + (1 << buddy_order(page)) > start_pfn) + return pfn; + + /* Nothing found */ + return start_pfn; +} + +/* Split a multi-block free page into its individual pageblocks */ +static void split_large_buddy(struct zone *zone, struct page *page, + unsigned long pfn, int order) +{ + unsigned long end_pfn = pfn + (1 << order); + + VM_WARN_ON_ONCE(order <= pageblock_order); + VM_WARN_ON_ONCE(pfn & (pageblock_nr_pages - 1)); + + /* Caller removed page from freelist, buddy info cleared! */ + VM_WARN_ON_ONCE(PageBuddy(page)); + + while (pfn != end_pfn) { + int mt = get_pfnblock_migratetype(page, pfn); + + __free_one_page(page, pfn, zone, pageblock_order, mt, FPI_NONE); + pfn += pageblock_nr_pages; + page = pfn_to_page(pfn); + } +} + +/** + * move_freepages_block_isolate - move free pages in block for page isolation + * @zone: the zone + * @page: the pageblock page + * @migratetype: migratetype to set on the pageblock + * + * This is similar to move_freepages_block(), but handles the special + * case encountered in page isolation, where the block of interest + * might be part of a larger buddy spanning multiple pageblocks. + * + * Unlike the regular page allocator path, which moves pages while + * stealing buddies off the freelist, page isolation is interested in + * arbitrary pfn ranges that may have overlapping buddies on both ends. + * + * This function handles that. Straddling buddies are split into + * individual pageblocks. Only the block of interest is moved. + * + * Returns %true if pages could be moved, %false otherwise. + */ +bool move_freepages_block_isolate(struct zone *zone, struct page *page, + int migratetype) +{ + unsigned long start_pfn, end_pfn, pfn; + int nr_moved, mt; + + if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, + NULL, NULL)) + return false; + + /* No splits needed if buddies can't span multiple blocks */ + if (pageblock_order == MAX_PAGE_ORDER) + goto move; + + /* We're a tail block in a larger buddy */ + pfn = find_large_buddy(start_pfn); + if (pfn != start_pfn) { + struct page *buddy = pfn_to_page(pfn); + int order = buddy_order(buddy); + int mt = get_pfnblock_migratetype(buddy, pfn); + + if (!is_migrate_isolate(mt)) + __mod_zone_freepage_state(zone, -(1UL << order), mt); + del_page_from_free_list(buddy, zone, order); + set_pageblock_migratetype(page, migratetype); + split_large_buddy(zone, buddy, pfn, order); + return true; + } + + /* We're the starting block of a larger buddy */ + if (PageBuddy(page) && buddy_order(page) > pageblock_order) { + int mt = get_pfnblock_migratetype(page, pfn); + int order = buddy_order(page); + + if (!is_migrate_isolate(mt)) + __mod_zone_freepage_state(zone, -(1UL << order), mt); + del_page_from_free_list(page, zone, order); + set_pageblock_migratetype(page, migratetype); + split_large_buddy(zone, page, pfn, order); + return true; + } +move: + mt = get_pfnblock_migratetype(page, start_pfn); + nr_moved = move_freepages(zone, start_pfn, end_pfn, migratetype); + if (!is_migrate_isolate(mt)) + __mod_zone_freepage_state(zone, -nr_moved, mt); + else if (!is_migrate_isolate(migratetype)) + __mod_zone_freepage_state(zone, nr_moved, migratetype); + return true; +} +#endif /* CONFIG_MEMORY_ISOLATION */ + static void change_pageblock_range(struct page *pageblock_page, int start_order, int migratetype) { @@ -6575,7 +6634,6 @@ int alloc_contig_range(unsigned long start, unsigned long end, unsigned migratetype, gfp_t gfp_mask) { unsigned long outer_start, outer_end; - int order; int ret = 0;
struct compact_control cc = { @@ -6648,29 +6706,7 @@ int alloc_contig_range(unsigned long start, unsigned long end, * We don't have to hold zone->lock here because the pages are * isolated thus they won't get removed from buddy. */ - - order = 0; - outer_start = start; - while (!PageBuddy(pfn_to_page(outer_start))) { - if (++order > MAX_ORDER) { - outer_start = start; - break; - } - outer_start &= ~0UL << order; - } - - if (outer_start != start) { - order = buddy_order(pfn_to_page(outer_start)); - - /* - * outer_start page could be small order buddy page and - * it doesn't include start page. Adjust outer_start - * in this case to report failed page properly - * on tracepoint in test_pages_isolated() - */ - if (outer_start + (1UL << order) <= start) - outer_start = start; - } + outer_start = find_large_buddy(start);
/* Make sure the range is really isolated. */ if (test_pages_isolated(outer_start, end, 0)) { diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 8fc4f9491417..b3aae89ed226 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -179,16 +179,10 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_ unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end, migratetype, isol_flags); if (!unmovable) { - int nr_pages; - int mt = get_pageblock_migratetype(page); - - nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE); - /* Block spans zone boundaries? */ - if (nr_pages == -1) { + if (!move_freepages_block_isolate(zone, page, MIGRATE_ISOLATE)) { spin_unlock_irqrestore(&zone->lock, flags); return -EBUSY; } - __mod_zone_freepage_state(zone, -nr_pages, mt); zone->nr_isolate_pageblock++; spin_unlock_irqrestore(&zone->lock, flags); return 0; @@ -255,13 +249,11 @@ static void unset_migratetype_isolate(struct page *page, int migratetype) * allocation. */ if (!isolated_page) { - int nr_pages = move_freepages_block(zone, page, migratetype); /* * Isolating this block already succeeded, so this * should not fail on zone boundaries. */ - WARN_ON_ONCE(nr_pages == -1); - __mod_zone_freepage_state(zone, nr_pages, migratetype); + WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype)); } else { set_pageblock_migratetype(page, migratetype); __putback_isolated_page(page, order, migratetype); @@ -377,26 +369,29 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
VM_BUG_ON(!page); pfn = page_to_pfn(page); - /* - * start_pfn is MAX_ORDER_NR_PAGES aligned, if there is any - * free pages in [start_pfn, boundary_pfn), its head page will - * always be in the range. - */ + if (PageBuddy(page)) { int order = buddy_order(page);
- if (pfn + (1UL << order) > boundary_pfn) { - /* free page changed before split, check it again */ - if (split_free_page(page, order, boundary_pfn - pfn)) - continue; - } + /* move_freepages_block_isolate() handled this */ + VM_WARN_ON_ONCE(pfn + (1 << order) > boundary_pfn);
pfn += 1UL << order; continue; } + /* - * migrate compound pages then let the free page handling code - * above do the rest. If migration is not possible, just fail. + * If a compound page is straddling our block, attempt + * to migrate it out of the way. + * + * We don't have to worry about this creating a large + * free page that straddles into our block: gigantic + * pages are freed as order-0 chunks, and LRU pages + * (currently) do not exceed pageblock_order. + * + * The block of interest has already been marked + * MIGRATE_ISOLATE above, so when migration is done it + * will free its pages onto the correct freelists. */ if (PageCompound(page)) { struct page *head = compound_head(page); @@ -407,16 +402,10 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, pfn = head_pfn + nr_pages; continue; } + #if defined CONFIG_COMPACTION || defined CONFIG_CMA - /* - * hugetlb, lru compound (THP), and movable compound pages - * can be migrated. Otherwise, fail the isolation. - */ - if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) { - int order; - unsigned long outer_pfn; + if (PageHuge(page)) { int page_mt = get_pageblock_migratetype(page); - bool isolate_page = !is_migrate_isolate_page(page); struct compact_control cc = { .nr_migratepages = 0, .order = -1, @@ -429,56 +418,25 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, }; INIT_LIST_HEAD(&cc.migratepages);
- /* - * XXX: mark the page as MIGRATE_ISOLATE so that - * no one else can grab the freed page after migration. - * Ideally, the page should be freed as two separate - * pages to be added into separate migratetype free - * lists. - */ - if (isolate_page) { - ret = set_migratetype_isolate(page, page_mt, - flags, head_pfn, head_pfn + nr_pages); - if (ret) - goto failed; - } - ret = __alloc_contig_migrate_range(&cc, head_pfn, head_pfn + nr_pages, page_mt); - - /* - * restore the page's migratetype so that it can - * be split into separate migratetype free lists - * later. - */ - if (isolate_page) - unset_migratetype_isolate(page, page_mt); - if (ret) goto failed; - /* - * reset pfn to the head of the free page, so - * that the free page handling code above can split - * the free page to the right migratetype list. - * - * head_pfn is not used here as a hugetlb page order - * can be bigger than MAX_ORDER, but after it is - * freed, the free page order is not. Use pfn within - * the range to find the head of the free page. - */ - order = 0; - outer_pfn = pfn; - while (!PageBuddy(pfn_to_page(outer_pfn))) { - /* stop if we cannot find the free page */ - if (++order > MAX_ORDER) - goto failed; - outer_pfn &= ~0UL << order; - } - pfn = outer_pfn; + pfn = head_pfn + nr_pages; continue; - } else + } + + /* + * These pages are movable too, but they're + * not expected to exceed pageblock_order. + * + * Let us know when they do, so we can add + * proper free and split handling for them. + */ + VM_WARN_ON_ONCE_PAGE(PageLRU(page), page); + VM_WARN_ON_ONCE_PAGE(__PageMovable(page), page); #endif - goto failed; + goto failed; }
pfn++;
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v6.10-rc1 commit e0932b6c1f942fa747258e152cdce0d0b2b5be5c category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Free page accounting currently happens a bit too high up the call stack, where it has to deal with guard pages, compaction capturing, block stealing and even page isolation. This is subtle and fragile, and makes it difficult to hack on the code.
Now that type violations on the freelists have been fixed, push the accounting down to where pages enter and leave the freelist.
[hannes@cmpxchg.org: undo unrelated drive-by line wrap] Link: https://lkml.kernel.org/r/20240327185736.GA7597@cmpxchg.org [hannes@cmpxchg.org: remove unused page parameter from account_freepages()] Link: https://lkml.kernel.org/r/20240327185831.GB7597@cmpxchg.org [baolin.wang@linux.alibaba.com: fix free page accounting] Link: https://lkml.kernel.org/r/a2a48baca69f103aa431fd201f8a06e3b95e203d.171264844... [andriy.shevchenko@linux.intel.com: avoid defining unused function] Link: https://lkml.kernel.org/r/20240423161506.2637177-1-andriy.shevchenko@linux.i... Link: https://lkml.kernel.org/r/20240320180429.678181-11-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Baolin Wang baolin.wang@linux.alibaba.com Reviewed-by: Vlastimil Babka vbabka@suse.cz Tested-by: Baolin Wang baolin.wang@linux.alibaba.com Cc: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Zi Yan ziy@nvidia.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Conflicts: mm/page_alloc.c [ Context conflicts due to miss MAX_PAGE_ORDER. ] Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/mm.h | 18 ++-- include/linux/vmstat.h | 8 -- mm/debug_page_alloc.c | 12 +-- mm/internal.h | 5 -- mm/page_alloc.c | 192 +++++++++++++++++++++++------------------ 5 files changed, 118 insertions(+), 117 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 2e6ef9532fc3..b6dcdaafc592 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3819,24 +3819,22 @@ static inline bool page_is_guard(struct page *page) return PageGuard(page); }
-bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order, - int migratetype); +bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order); static inline bool set_page_guard(struct zone *zone, struct page *page, - unsigned int order, int migratetype) + unsigned int order) { if (!debug_guardpage_enabled()) return false; - return __set_page_guard(zone, page, order, migratetype); + return __set_page_guard(zone, page, order); }
-void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order, - int migratetype); +void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order); static inline void clear_page_guard(struct zone *zone, struct page *page, - unsigned int order, int migratetype) + unsigned int order) { if (!debug_guardpage_enabled()) return; - __clear_page_guard(zone, page, order, migratetype); + __clear_page_guard(zone, page, order); }
#else /* CONFIG_DEBUG_PAGEALLOC */ @@ -3846,9 +3844,9 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; } static inline bool debug_guardpage_enabled(void) { return false; } static inline bool page_is_guard(struct page *page) { return false; } static inline bool set_page_guard(struct zone *zone, struct page *page, - unsigned int order, int migratetype) { return false; } + unsigned int order) { return false; } static inline void clear_page_guard(struct zone *zone, struct page *page, - unsigned int order, int migratetype) {} + unsigned int order) {} #endif /* CONFIG_DEBUG_PAGEALLOC */
#ifdef __HAVE_ARCH_GATE_AREA diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 343906a98d6e..735eae6e272c 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -487,14 +487,6 @@ static inline void node_stat_sub_folio(struct folio *folio, mod_node_page_state(folio_pgdat(folio), item, -folio_nr_pages(folio)); }
-static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages, - int migratetype) -{ - __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages); - if (is_migrate_cma(migratetype)) - __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages); -} - extern const char * const vmstat_text[];
static inline const char *zone_stat_name(enum zone_stat_item item) diff --git a/mm/debug_page_alloc.c b/mm/debug_page_alloc.c index f9d145730fd1..03a810927d0a 100644 --- a/mm/debug_page_alloc.c +++ b/mm/debug_page_alloc.c @@ -32,8 +32,7 @@ static int __init debug_guardpage_minorder_setup(char *buf) } early_param("debug_guardpage_minorder", debug_guardpage_minorder_setup);
-bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order, - int migratetype) +bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order) { if (order >= debug_guardpage_minorder()) return false; @@ -41,19 +40,12 @@ bool __set_page_guard(struct zone *zone, struct page *page, unsigned int order, __SetPageGuard(page); INIT_LIST_HEAD(&page->buddy_list); set_page_private(page, order); - /* Guard pages are not available for any usage */ - if (!is_migrate_isolate(migratetype)) - __mod_zone_freepage_state(zone, -(1 << order), migratetype);
return true; }
-void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order, - int migratetype) +void __clear_page_guard(struct zone *zone, struct page *page, unsigned int order) { __ClearPageGuard(page); - set_page_private(page, 0); - if (!is_migrate_isolate(migratetype)) - __mod_zone_freepage_state(zone, (1 << order), migratetype); } diff --git a/mm/internal.h b/mm/internal.h index de564608dfa6..8742aafde387 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1171,11 +1171,6 @@ static inline bool is_migrate_highatomic(enum migratetype migratetype) return migratetype == MIGRATE_HIGHATOMIC; }
-static inline bool is_migrate_highatomic_page(struct page *page) -{ - return get_pageblock_migratetype(page) == MIGRATE_HIGHATOMIC; -} - void setup_zone_pageset(struct zone *zone);
struct migration_target_control { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3bc1502e42cf..d662bbdf2e91 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -636,23 +636,33 @@ compaction_capture(struct capture_control *capc, struct page *page, } #endif /* CONFIG_COMPACTION */
-/* Used for pages not on another list */ -static inline void add_to_free_list(struct page *page, struct zone *zone, - unsigned int order, int migratetype) +static inline void account_freepages(struct zone *zone, int nr_pages, + int migratetype) { - struct free_area *area = &zone->free_area[order]; + if (is_migrate_isolate(migratetype)) + return;
- list_add(&page->buddy_list, &area->free_list[migratetype]); - area->nr_free++; + __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages); + + if (is_migrate_cma(migratetype)) + __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages); }
/* Used for pages not on another list */ -static inline void add_to_free_list_tail(struct page *page, struct zone *zone, - unsigned int order, int migratetype) +static inline void __add_to_free_list(struct page *page, struct zone *zone, + unsigned int order, int migratetype, + bool tail) { struct free_area *area = &zone->free_area[order];
- list_add_tail(&page->buddy_list, &area->free_list[migratetype]); + VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype, + "page type is %lu, passed migratetype is %d (nr=%d)\n", + get_pageblock_migratetype(page), migratetype, 1 << order); + + if (tail) + list_add_tail(&page->buddy_list, &area->free_list[migratetype]); + else + list_add(&page->buddy_list, &area->free_list[migratetype]); area->nr_free++; }
@@ -662,16 +672,28 @@ static inline void add_to_free_list_tail(struct page *page, struct zone *zone, * allocation again (e.g., optimization for memory onlining). */ static inline void move_to_free_list(struct page *page, struct zone *zone, - unsigned int order, int migratetype) + unsigned int order, int old_mt, int new_mt) { struct free_area *area = &zone->free_area[order];
- list_move_tail(&page->buddy_list, &area->free_list[migratetype]); + /* Free page moving can fail, so it happens before the type update */ + VM_WARN_ONCE(get_pageblock_migratetype(page) != old_mt, + "page type is %lu, passed migratetype is %d (nr=%d)\n", + get_pageblock_migratetype(page), old_mt, 1 << order); + + list_move_tail(&page->buddy_list, &area->free_list[new_mt]); + + account_freepages(zone, -(1 << order), old_mt); + account_freepages(zone, 1 << order, new_mt); }
-static inline void del_page_from_free_list(struct page *page, struct zone *zone, - unsigned int order) +static inline void __del_page_from_free_list(struct page *page, struct zone *zone, + unsigned int order, int migratetype) { + VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype, + "page type is %lu, passed migratetype is %d (nr=%d)\n", + get_pageblock_migratetype(page), migratetype, 1 << order); + /* clear reported state and update reported page count */ if (page_reported(page)) __ClearPageReported(page); @@ -682,6 +704,13 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone, zone->free_area[order].nr_free--; }
+static inline void del_page_from_free_list(struct page *page, struct zone *zone, + unsigned int order, int migratetype) +{ + __del_page_from_free_list(page, zone, order, migratetype); + account_freepages(zone, -(1 << order), migratetype); +} + static inline struct page *get_page_from_free_area(struct free_area *area, int migratetype) { @@ -753,16 +782,16 @@ static inline void __free_one_page(struct page *page, VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
VM_BUG_ON(migratetype == -1); - if (likely(!is_migrate_isolate(migratetype))) - __mod_zone_freepage_state(zone, 1 << order, migratetype); - VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page); VM_BUG_ON_PAGE(bad_range(zone, page), page);
+ account_freepages(zone, 1 << order, migratetype); + while (order < MAX_ORDER) { + int buddy_mt = migratetype; + if (compaction_capture(capc, page, order, migratetype)) { - __mod_zone_freepage_state(zone, -(1 << order), - migratetype); + account_freepages(zone, -(1 << order), migratetype); return; }
@@ -777,19 +806,12 @@ static inline void __free_one_page(struct page *page, * pageblock isolation could cause incorrect freepage or CMA * accounting or HIGHATOMIC accounting. */ - int buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn); + buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
- if (migratetype != buddy_mt) { - if (!migratetype_is_mergeable(migratetype) || - !migratetype_is_mergeable(buddy_mt)) - goto done_merging; - /* - * Match buddy type. This ensures that - * an expand() down the line puts the - * sub-blocks on the right freelists. - */ - set_pageblock_migratetype(buddy, migratetype); - } + if (migratetype != buddy_mt && + (!migratetype_is_mergeable(migratetype) || + !migratetype_is_mergeable(buddy_mt))) + goto done_merging; }
/* @@ -797,9 +819,19 @@ static inline void __free_one_page(struct page *page, * merge with it and move up one order. */ if (page_is_guard(buddy)) - clear_page_guard(zone, buddy, order, migratetype); + clear_page_guard(zone, buddy, order); else - del_page_from_free_list(buddy, zone, order); + __del_page_from_free_list(buddy, zone, order, buddy_mt); + + if (unlikely(buddy_mt != migratetype)) { + /* + * Match buddy type. This ensures that an + * expand() down the line puts the sub-blocks + * on the right freelists. + */ + set_pageblock_migratetype(buddy, migratetype); + } + combined_pfn = buddy_pfn & pfn; page = page + (combined_pfn - pfn); pfn = combined_pfn; @@ -816,10 +848,7 @@ static inline void __free_one_page(struct page *page, else to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
- if (to_tail) - add_to_free_list_tail(page, zone, order, migratetype); - else - add_to_free_list(page, zone, order, migratetype); + __add_to_free_list(page, zone, order, migratetype, to_tail);
/* Notify page reporting subsystem of freed page */ if (!(fpi_flags & FPI_SKIP_REPORT_NOTIFY)) @@ -1309,10 +1338,10 @@ static inline void expand(struct zone *zone, struct page *page, * Corresponding page table entries will not be touched, * pages will stay not present in virtual address space */ - if (set_page_guard(zone, &page[size], high, migratetype)) + if (set_page_guard(zone, &page[size], high)) continue;
- add_to_free_list(&page[size], zone, high, migratetype); + add_to_free_list(&page[size], zone, high, migratetype, false); set_buddy_order(&page[size], high); } } @@ -1503,7 +1532,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, page = get_page_from_free_area(area, migratetype); if (!page) continue; - del_page_from_free_list(page, zone, current_order); + del_page_from_free_list(page, zone, current_order, migratetype); expand(zone, page, order, current_order, migratetype); trace_mm_page_alloc_zone_locked(page, order, migratetype, pcp_allowed_order(order) && @@ -1543,7 +1572,7 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone, * type's freelist. */ static int move_freepages(struct zone *zone, unsigned long start_pfn, - unsigned long end_pfn, int migratetype) + unsigned long end_pfn, int old_mt, int new_mt) { struct page *page; unsigned long pfn; @@ -1565,12 +1594,14 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn, VM_BUG_ON_PAGE(page_zone(page) != zone, page);
order = buddy_order(page); - move_to_free_list(page, zone, order, migratetype); + + move_to_free_list(page, zone, order, old_mt, new_mt); + pfn += 1 << order; pages_moved += 1 << order; }
- set_pageblock_migratetype(pfn_to_page(start_pfn), migratetype); + set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
return pages_moved; } @@ -1628,7 +1659,7 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page, }
static int move_freepages_block(struct zone *zone, struct page *page, - int migratetype) + int old_mt, int new_mt) { unsigned long start_pfn, end_pfn;
@@ -1636,7 +1667,7 @@ static int move_freepages_block(struct zone *zone, struct page *page, NULL, NULL)) return -1;
- return move_freepages(zone, start_pfn, end_pfn, migratetype); + return move_freepages(zone, start_pfn, end_pfn, old_mt, new_mt); }
#ifdef CONFIG_MEMORY_ISOLATION @@ -1708,7 +1739,6 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page, int migratetype) { unsigned long start_pfn, end_pfn, pfn; - int nr_moved, mt;
if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, NULL, NULL)) @@ -1723,11 +1753,9 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page, if (pfn != start_pfn) { struct page *buddy = pfn_to_page(pfn); int order = buddy_order(buddy); - int mt = get_pfnblock_migratetype(buddy, pfn);
- if (!is_migrate_isolate(mt)) - __mod_zone_freepage_state(zone, -(1UL << order), mt); - del_page_from_free_list(buddy, zone, order); + del_page_from_free_list(buddy, zone, order, + get_pfnblock_migratetype(buddy, pfn)); set_pageblock_migratetype(page, migratetype); split_large_buddy(zone, buddy, pfn, order); return true; @@ -1735,23 +1763,17 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
/* We're the starting block of a larger buddy */ if (PageBuddy(page) && buddy_order(page) > pageblock_order) { - int mt = get_pfnblock_migratetype(page, pfn); int order = buddy_order(page);
- if (!is_migrate_isolate(mt)) - __mod_zone_freepage_state(zone, -(1UL << order), mt); - del_page_from_free_list(page, zone, order); + del_page_from_free_list(page, zone, order, + get_pfnblock_migratetype(page, pfn)); set_pageblock_migratetype(page, migratetype); split_large_buddy(zone, page, pfn, order); return true; } move: - mt = get_pfnblock_migratetype(page, start_pfn); - nr_moved = move_freepages(zone, start_pfn, end_pfn, migratetype); - if (!is_migrate_isolate(mt)) - __mod_zone_freepage_state(zone, -nr_moved, mt); - else if (!is_migrate_isolate(migratetype)) - __mod_zone_freepage_state(zone, nr_moved, migratetype); + move_freepages(zone, start_pfn, end_pfn, + get_pfnblock_migratetype(page, start_pfn), migratetype); return true; } #endif /* CONFIG_MEMORY_ISOLATION */ @@ -1865,7 +1887,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
/* Take ownership for orders >= pageblock_order */ if (current_order >= pageblock_order) { - del_page_from_free_list(page, zone, current_order); + del_page_from_free_list(page, zone, current_order, block_type); change_pageblock_range(page, current_order, start_type); expand(zone, page, order, current_order, start_type); return page; @@ -1915,12 +1937,12 @@ steal_suitable_fallback(struct zone *zone, struct page *page, */ if (free_pages + alike_pages >= (1 << (pageblock_order-1)) || page_group_by_mobility_disabled) { - move_freepages(zone, start_pfn, end_pfn, start_type); + move_freepages(zone, start_pfn, end_pfn, block_type, start_type); return __rmqueue_smallest(zone, order, start_type); }
single_page: - del_page_from_free_list(page, zone, current_order); + del_page_from_free_list(page, zone, current_order, block_type); expand(zone, page, order, current_order, block_type); return page; } @@ -1990,7 +2012,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone) mt = get_pageblock_migratetype(page); /* Only reserve normal pageblocks (i.e., they can merge with others) */ if (migratetype_is_mergeable(mt)) - if (move_freepages_block(zone, page, + if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) != -1) zone->nr_reserved_highatomic += pageblock_nr_pages;
@@ -2031,11 +2053,13 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, spin_lock_irqsave(&zone->lock, flags); for (order = 0; order < NR_PAGE_ORDERS; order++) { struct free_area *area = &(zone->free_area[order]); + int mt;
page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC); if (!page) continue;
+ mt = get_pageblock_migratetype(page); /* * In page freeing path, migratetype change is racy so * we can counter several free pages in a pageblock @@ -2043,7 +2067,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, * from highatomic to ac->migratetype. So we should * adjust the count once. */ - if (is_migrate_highatomic_page(page)) { + if (is_migrate_highatomic(mt)) { /* * It should never happen but changes to * locking could inadvertently allow a per-cpu @@ -2065,7 +2089,8 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, * of pageblocks that cannot be completely freed * may increase. */ - ret = move_freepages_block(zone, page, ac->migratetype); + ret = move_freepages_block(zone, page, mt, + ac->migratetype); /* * Reserving this block already succeeded, so this should * not fail on zone boundaries. @@ -2236,12 +2261,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, * pages are ordered properly. */ list_add_tail(&page->pcp_list, list); - if (is_migrate_cma(get_pageblock_migratetype(page))) - __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, - -(1 << order)); } - - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order)); spin_unlock_irqrestore(&zone->lock, flags);
return i; @@ -2751,11 +2771,9 @@ int __isolate_free_page(struct page *page, unsigned int order) watermark = zone->_watermark[WMARK_MIN] + (1UL << order); if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA)) return 0; - - __mod_zone_freepage_state(zone, -(1UL << order), mt); }
- del_page_from_free_list(page, zone, order); + del_page_from_free_list(page, zone, order, mt);
/* * Set the pageblock if the isolated page is at least half of a @@ -2770,7 +2788,7 @@ int __isolate_free_page(struct page *page, unsigned int order) * with others) */ if (migratetype_is_mergeable(mt)) - move_freepages_block(zone, page, + move_freepages_block(zone, page, mt, MIGRATE_MOVABLE); } } @@ -2855,8 +2873,6 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, return NULL; } } - __mod_zone_freepage_state(zone, -(1 << order), - get_pageblock_migratetype(page)); spin_unlock_irqrestore(&zone->lock, flags); } while (check_new_pages(page, order));
@@ -6940,8 +6956,9 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
BUG_ON(page_count(page)); BUG_ON(!PageBuddy(page)); + VM_WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE); order = buddy_order(page); - del_page_from_free_list(page, zone, order); + del_page_from_free_list(page, zone, order, MIGRATE_ISOLATE); pfn += (1 << order); } spin_unlock_irqrestore(&zone->lock, flags); @@ -6969,6 +6986,14 @@ bool is_free_buddy_page(struct page *page) EXPORT_SYMBOL(is_free_buddy_page);
#ifdef CONFIG_MEMORY_FAILURE +static inline void add_to_free_list(struct page *page, struct zone *zone, + unsigned int order, int migratetype, + bool tail) +{ + __add_to_free_list(page, zone, order, migratetype, tail); + account_freepages(zone, 1 << order, migratetype); +} + /* * Break down a higher-order page in sub-pages, and keep our target out of * buddy allocator. @@ -6991,10 +7016,10 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page, current_buddy = page + size; }
- if (set_page_guard(zone, current_buddy, high, migratetype)) + if (set_page_guard(zone, current_buddy, high)) continue;
- add_to_free_list(current_buddy, zone, high, migratetype); + add_to_free_list(current_buddy, zone, high, migratetype, false); set_buddy_order(current_buddy, high); } } @@ -7020,12 +7045,11 @@ bool take_page_off_buddy(struct page *page) int migratetype = get_pfnblock_migratetype(page_head, pfn_head);
- del_page_from_free_list(page_head, zone, page_order); + del_page_from_free_list(page_head, zone, page_order, + migratetype); break_down_buddy_pages(zone, page_head, page, 0, page_order, migratetype); SetPageHWPoisonTakenOff(page); - if (!is_migrate_isolate(migratetype)) - __mod_zone_freepage_state(zone, -1, migratetype); ret = true; break; } @@ -7130,7 +7154,7 @@ static bool try_to_accept_memory_one(struct zone *zone) list_del(&page->lru); last = list_empty(&zone->unaccepted_pages);
- __mod_zone_freepage_state(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); + account_freepages(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); __mod_zone_page_state(zone, NR_UNACCEPTED, -MAX_ORDER_NR_PAGES); spin_unlock_irqrestore(&zone->lock, flags);
@@ -7188,7 +7212,7 @@ static bool __free_unaccepted(struct page *page) spin_lock_irqsave(&zone->lock, flags); first = list_empty(&zone->unaccepted_pages); list_add_tail(&page->lru, &zone->unaccepted_pages); - __mod_zone_freepage_state(zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); + account_freepages(zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); __mod_zone_page_state(zone, NR_UNACCEPTED, MAX_ORDER_NR_PAGES); spin_unlock_irqrestore(&zone->lock, flags);
From: Vlastimil Babka vbabka@suse.cz
mainline inclusion from mainline-v6.10-rc1 commit e1f42a577f63647dadf1abe4583053c03d6be045 category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The function is now supposed to be called only on a single pageblock and checks start_pfn and end_pfn accordingly. Rename it to make this more obvious and drop the end_pfn parameter which can be determined trivially and none of the callers use it for anything else.
Also make the (now internal) end_pfn exclusive, which is more common.
Link: https://lkml.kernel.org/r/81b1d642-2ec0-49f5-89fc-19a3828419ff@suse.cz Signed-off-by: Vlastimil Babka vbabka@suse.cz Reviewed-by: Zi Yan ziy@nvidia.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Mel Gorman mgorman@techsingularity.net Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 43 ++++++++++++++++++++----------------------- 1 file changed, 20 insertions(+), 23 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d662bbdf2e91..7270d665fc53 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1571,18 +1571,18 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone, * Change the type of a block and move all its free pages to that * type's freelist. */ -static int move_freepages(struct zone *zone, unsigned long start_pfn, - unsigned long end_pfn, int old_mt, int new_mt) +static int __move_freepages_block(struct zone *zone, unsigned long start_pfn, + int old_mt, int new_mt) { struct page *page; - unsigned long pfn; + unsigned long pfn, end_pfn; unsigned int order; int pages_moved = 0;
VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1)); - VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn); + end_pfn = pageblock_end_pfn(start_pfn);
- for (pfn = start_pfn; pfn <= end_pfn;) { + for (pfn = start_pfn; pfn < end_pfn;) { page = pfn_to_page(pfn); if (!PageBuddy(page)) { pfn++; @@ -1608,14 +1608,13 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
static bool prep_move_freepages_block(struct zone *zone, struct page *page, unsigned long *start_pfn, - unsigned long *end_pfn, int *num_free, int *num_movable) { unsigned long pfn, start, end;
pfn = page_to_pfn(page); start = pageblock_start_pfn(pfn); - end = pageblock_end_pfn(pfn) - 1; + end = pageblock_end_pfn(pfn);
/* * The caller only has the lock for @zone, don't touch ranges @@ -1626,16 +1625,15 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page, */ if (!zone_spans_pfn(zone, start)) return false; - if (!zone_spans_pfn(zone, end)) + if (!zone_spans_pfn(zone, end - 1)) return false;
*start_pfn = start; - *end_pfn = end;
if (num_free) { *num_free = 0; *num_movable = 0; - for (pfn = start; pfn <= end;) { + for (pfn = start; pfn < end;) { page = pfn_to_page(pfn); if (PageBuddy(page)) { int nr = 1 << buddy_order(page); @@ -1661,13 +1659,12 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page, static int move_freepages_block(struct zone *zone, struct page *page, int old_mt, int new_mt) { - unsigned long start_pfn, end_pfn; + unsigned long start_pfn;
- if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, - NULL, NULL)) + if (!prep_move_freepages_block(zone, page, &start_pfn, NULL, NULL)) return -1;
- return move_freepages(zone, start_pfn, end_pfn, old_mt, new_mt); + return __move_freepages_block(zone, start_pfn, old_mt, new_mt); }
#ifdef CONFIG_MEMORY_ISOLATION @@ -1738,10 +1735,9 @@ static void split_large_buddy(struct zone *zone, struct page *page, bool move_freepages_block_isolate(struct zone *zone, struct page *page, int migratetype) { - unsigned long start_pfn, end_pfn, pfn; + unsigned long start_pfn, pfn;
- if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, - NULL, NULL)) + if (!prep_move_freepages_block(zone, page, &start_pfn, NULL, NULL)) return false;
/* No splits needed if buddies can't span multiple blocks */ @@ -1772,8 +1768,9 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page, return true; } move: - move_freepages(zone, start_pfn, end_pfn, - get_pfnblock_migratetype(page, start_pfn), migratetype); + __move_freepages_block(zone, start_pfn, + get_pfnblock_migratetype(page, start_pfn), + migratetype); return true; } #endif /* CONFIG_MEMORY_ISOLATION */ @@ -1873,7 +1870,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page, unsigned int alloc_flags, bool whole_block) { int free_pages, movable_pages, alike_pages; - unsigned long start_pfn, end_pfn; + unsigned long start_pfn; int block_type;
block_type = get_pageblock_migratetype(page); @@ -1906,8 +1903,8 @@ steal_suitable_fallback(struct zone *zone, struct page *page, goto single_page;
/* moving whole block can fail due to zone boundary conditions */ - if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, - &free_pages, &movable_pages)) + if (!prep_move_freepages_block(zone, page, &start_pfn, &free_pages, + &movable_pages)) goto single_page;
/* @@ -1937,7 +1934,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page, */ if (free_pages + alike_pages >= (1 << (pageblock_order-1)) || page_group_by_mobility_disabled) { - move_freepages(zone, start_pfn, end_pfn, block_type, start_type); + __move_freepages_block(zone, start_pfn, block_type, start_type); return __rmqueue_smallest(zone, order, start_type); }
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v6.10-rc1 commit 883dd161e9a83e188487debc562b1928917a4b39 category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
expand() currently updates vmstat for every subpage. This is unnecessary, since they're all of the same zone and migratetype.
Count added pages locally, then do a single vmstat update.
Link: https://lkml.kernel.org/r/20240327190111.GC7597@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Suggested-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7270d665fc53..4e2ec54b6a7f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1326,6 +1326,7 @@ static inline void expand(struct zone *zone, struct page *page, int low, int high, int migratetype) { unsigned long size = 1 << high; + unsigned long nr_added = 0;
while (high > low) { high--; @@ -1341,9 +1342,11 @@ static inline void expand(struct zone *zone, struct page *page, if (set_page_guard(zone, &page[size], high)) continue;
- add_to_free_list(&page[size], zone, high, migratetype, false); + __add_to_free_list(&page[size], zone, high, migratetype, false); set_buddy_order(&page[size], high); + nr_added += size; } + account_freepages(zone, nr_added, migratetype); }
static void check_new_page_bad(struct page *page)
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v6.10-rc3 commit 7cc5a5d65011983952a9c62f170f5b79e24b1239 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Christoph reports a page allocator splat triggered by xfstests:
generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30 XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273 XFS (nvme0n1): Ending clean mount XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850 XFS (nvme1n1): Ending clean mount XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850 XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9 XFS (nvme1n1): Ending clean mount ------------[ cut here ]------------ page type is 3, passed migratetype is 1 (nr=512) WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0 Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:expand+0x1c5/0x1f0 Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3 RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082 RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000 RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001 R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009 R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000 FS: 00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __warn+0x7b/0x120 ? expand+0x1c5/0x1f0 ? report_bug+0x191/0x1c0 ? handle_bug+0x3c/0x80 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? expand+0x1c5/0x1f0 ? expand+0x1c5/0x1f0 __rmqueue_pcplist+0x3a9/0x730 get_page_from_freelist+0x7a0/0xf00 __alloc_pages_noprof+0x153/0x2e0 __folio_alloc_noprof+0x10/0xa0 __filemap_get_folio+0x16b/0x370 iomap_write_begin+0x496/0x680
While trying to service a movable allocation (page type 1), the page allocator runs into a two-pageblock buddy on the movable freelist whose second block is typed as highatomic (page type 3).
This inconsistency is caused by the highatomic reservation system operating on single pageblocks, while MAX_ORDER can be bigger than that - in this configuration, pageblock_order is 9 while MAX_PAGE_ORDER is 10. The test case is observed to make several adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which marks the surrounding block as highatomic. Upon freeing, the blocks merge into an order-10 buddy. When the highatomic pool is drained later on, this order-10 buddy gets moved back to the movable list, but only the first pageblock is marked movable again. A subsequent expand() of this buddy warns about the tail being of a different type.
This is a long-standing bug that's surfaced by the recent block type warnings added to the allocator. The consequences seem mostly benign, it just results in odd behavior: the highatomic tail blocks are not properly drained, instead they end up on the movable list first, then go back to the highatomic list after an alloc-free cycle.
To fix this, make the highatomic reservation code aware that allocations/buddies can be larger than a pageblock.
While it's an old quirk, the recently added type consistency warnings seem to be the most prominent consequence of it. Set the Fixes: tag accordingly to highlight this backporting dependency.
Link: https://lkml.kernel.org/r/20240530114203.GA1222079@cmpxchg.org Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting") Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reported-by: Christoph Hellwig hch@infradead.org Reviewed-by: Zi Yan ziy@nvidia.com Tested-by: Christoph Hellwig hch@lst.de Cc: Andy Shevchenko andriy.shevchenko@linux.intel.com Cc: Baolin Wang baolin.wang@linux.alibaba.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Vlastimil Babka vbabka@suse.cz Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 50 +++++++++++++++++++++++++++++++++---------------- 1 file changed, 34 insertions(+), 16 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4e2ec54b6a7f..53534165b5ab 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1982,10 +1982,12 @@ int find_suitable_fallback(struct free_area *area, unsigned int order, }
/* - * Reserve a pageblock for exclusive use of high-order atomic allocations if - * there are no empty page blocks that contain a page with a suitable order + * Reserve the pageblock(s) surrounding an allocation request for + * exclusive use of high-order atomic allocations if there are no + * empty page blocks that contain a page with a suitable order */ -static void reserve_highatomic_pageblock(struct page *page, struct zone *zone) +static void reserve_highatomic_pageblock(struct page *page, int order, + struct zone *zone) { int mt; unsigned long max_managed, flags; @@ -2011,10 +2013,17 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone) /* Yoink! */ mt = get_pageblock_migratetype(page); /* Only reserve normal pageblocks (i.e., they can merge with others) */ - if (migratetype_is_mergeable(mt)) - if (move_freepages_block(zone, page, mt, - MIGRATE_HIGHATOMIC) != -1) - zone->nr_reserved_highatomic += pageblock_nr_pages; + if (!migratetype_is_mergeable(mt)) + goto out_unlock; + + if (order < pageblock_order) { + if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1) + goto out_unlock; + zone->nr_reserved_highatomic += pageblock_nr_pages; + } else { + change_pageblock_range(page, order, MIGRATE_HIGHATOMIC); + zone->nr_reserved_highatomic += 1 << order; + }
out_unlock: spin_unlock_irqrestore(&zone->lock, flags); @@ -2026,7 +2035,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone) * intense memory pressure but failed atomic allocations should be easier * to recover from than an OOM. * - * If @force is true, try to unreserve a pageblock even though highatomic + * If @force is true, try to unreserve pageblocks even though highatomic * pageblock is exhausted. */ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, @@ -2068,6 +2077,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, * adjust the count once. */ if (is_migrate_highatomic(mt)) { + unsigned long size; /* * It should never happen but changes to * locking could inadvertently allow a per-cpu @@ -2075,9 +2085,9 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, * while unreserving so be safe and watch for * underflows. */ - zone->nr_reserved_highatomic -= min( - pageblock_nr_pages, - zone->nr_reserved_highatomic); + size = max(pageblock_nr_pages, 1UL << order); + size = min(size, zone->nr_reserved_highatomic); + zone->nr_reserved_highatomic -= size; }
/* @@ -2089,11 +2099,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, * of pageblocks that cannot be completely freed * may increase. */ - ret = move_freepages_block(zone, page, mt, - ac->migratetype); + if (order < pageblock_order) + ret = move_freepages_block(zone, page, mt, + ac->migratetype); + else { + move_to_free_list(page, zone, order, mt, + ac->migratetype); + change_pageblock_range(page, order, + ac->migratetype); + ret = 1; + } /* - * Reserving this block already succeeded, so this should - * not fail on zone boundaries. + * Reserving the block(s) already succeeded, + * so this should not fail on zone boundaries. */ WARN_ON_ONCE(ret == -1); if (ret > 0) { @@ -3440,7 +3458,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, * if the pageblock should be reserved for the future */ if (unlikely(alloc_flags & ALLOC_HIGHATOMIC)) - reserve_highatomic_pageblock(page, zone); + reserve_highatomic_pageblock(page, order, zone);
return page; } else {
From: Yu Zhao yuzhao@google.com
mainline inclusion from mainline-v6.12-rc7 commit c928807f6f6b6d595a7e199591ae297c81de3aeb category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
OOM kills due to vastly overestimated free highatomic reserves were observed:
... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ... Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ... Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB
The second line above shows that the OOM kill was due to the following condition:
free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
And the third line shows there were no free pages in any MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type 'H'. Therefore __zone_watermark_unusable_free() underestimated the usable free memory by over 1GB, which resulted in the unnecessary OOM kill above.
The comments in __zone_watermark_unusable_free() warns about the potential risk, i.e.,
If the caller does not have rights to reserves below the min watermark then subtract the high-atomic reserves. This will over-estimate the size of the atomic reserve but it avoids a search.
However, it is possible to keep track of free pages in reserved highatomic pageblocks with a new per-zone counter nr_free_highatomic protected by the zone lock, to avoid a search when calculating the usable free memory. And the cost would be minimal, i.e., simple arithmetics in the highatomic alloc/free/move paths.
Note that since nr_free_highatomic can be relatively small, using a per-cpu counter might cause too much drift and defeat its purpose, in addition to the extra memory overhead.
Dependson e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting") - see [1]
[akpm@linux-foundation.org: s/if/else if/, per Johannes, stealth whitespace tweak] Link: https://lkml.kernel.org/r/20241028182653.3420139-1-yuzhao@google.com Link: https://lkml.kernel.org/r/0d0ddb33-fcdc-43e2-801f-0c1df2031afb@suse.cz [1] Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand") Signed-off-by: Yu Zhao yuzhao@google.com Reported-by: Link Lin linkl@google.com Acked-by: David Rientjes rientjes@google.com Acked-by: Vlastimil Babka vbabka@suse.cz Acked-by: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- include/linux/mmzone.h | 1 + mm/page_alloc.c | 10 +++++++--- 2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 3cee238de7c8..18bee72ebc71 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -865,6 +865,7 @@ struct zone { unsigned long watermark_boost;
unsigned long nr_reserved_highatomic; + unsigned long nr_free_highatomic;
/* * We don't know if the memory that we're going to allocate will be diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 53534165b5ab..e786cdd98bea 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -639,6 +639,8 @@ compaction_capture(struct capture_control *capc, struct page *page, static inline void account_freepages(struct zone *zone, int nr_pages, int migratetype) { + lockdep_assert_held(&zone->lock); + if (is_migrate_isolate(migratetype)) return;
@@ -646,6 +648,9 @@ static inline void account_freepages(struct zone *zone, int nr_pages,
if (is_migrate_cma(migratetype)) __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages); + else if (is_migrate_highatomic(migratetype)) + WRITE_ONCE(zone->nr_free_highatomic, + zone->nr_free_highatomic + nr_pages); }
/* Used for pages not on another list */ @@ -3072,11 +3077,10 @@ static inline long __zone_watermark_unusable_free(struct zone *z,
/* * If the caller does not have rights to reserves below the min - * watermark then subtract the high-atomic reserves. This will - * over-estimate the size of the atomic reserve but it avoids a search. + * watermark then subtract the free pages reserved for highatomic. */ if (likely(!(alloc_flags & ALLOC_RESERVES))) - unusable_free += z->nr_reserved_highatomic; + unusable_free += READ_ONCE(z->nr_free_highatomic);
#ifdef CONFIG_CMA /* If allocation can't use CMA areas don't use free CMA pages */
From: Kefeng Wang wangkefeng.wang@huawei.com
mainline inclusion from mainline-v6.12-rc1 commit cd5f3193b432cd70cc1c19aba790300dd11ae934 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The gigantic page size may larger than memory block size, so memory offline always fails in this case after commit b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity"),
offline_pages start_isolate_page_range start_isolate_page_range(isolate_before=true) isolate [isolate_start, isolate_start + pageblock_nr_pages) start_isolate_page_range(isolate_before=false) isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock __alloc_contig_migrate_range isolate_migratepages_range isolate_migratepages_block isolate_or_dissolve_huge_page if (hstate_is_gigantic(h)) return -ENOMEM;
[ 15.815756] memory offlining [mem 0x3c0000000-0x3c7ffffff] failed due to failure to isolate range
Gigantic PageHuge is bigger than a pageblock, but since it is freed as order-0 pages, its pageblocks after being freed will get to the right free list. There is no need to have special handling code for them in start_isolate_page_range(). For both alloc_contig_range() and memory offline cases, the migration code after start_isolate_page_range() will be able to migrate gigantic PageHuge when possible. Let's clean up start_isolate_page_range() and fix the aforementioned memory offline failure issue all together.
Let's clean up start_isolate_page_range() and fix the aforementioned memory offline failure issue all together.
Link: https://lkml.kernel.org/r/20240820032630.1894770-1-wangkefeng.wang@huawei.co... Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity") Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Acked-by: David Hildenbrand david@redhat.com Acked-by: Zi Yan ziy@nvidia.com Cc: Matthew Wilcox willy@infradead.org Cc: Oscar Salvador osalvador@suse.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_isolation.c | 28 +++------------------------- 1 file changed, 3 insertions(+), 25 deletions(-)
diff --git a/mm/page_isolation.c b/mm/page_isolation.c index b3aae89ed226..cf7f1922fc3e 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -398,30 +398,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, unsigned long head_pfn = page_to_pfn(head); unsigned long nr_pages = compound_nr(head);
- if (head_pfn + nr_pages <= boundary_pfn) { - pfn = head_pfn + nr_pages; - continue; - } - -#if defined CONFIG_COMPACTION || defined CONFIG_CMA - if (PageHuge(page)) { - int page_mt = get_pageblock_migratetype(page); - struct compact_control cc = { - .nr_migratepages = 0, - .order = -1, - .zone = page_zone(pfn_to_page(head_pfn)), - .mode = MIGRATE_SYNC, - .ignore_skip_hint = true, - .no_set_skip_hint = true, - .gfp_mask = gfp_flags, - .alloc_contig = true, - }; - INIT_LIST_HEAD(&cc.migratepages); - - ret = __alloc_contig_migrate_range(&cc, head_pfn, - head_pfn + nr_pages, page_mt); - if (ret) - goto failed; + if (head_pfn + nr_pages <= boundary_pfn || + PageHuge(page)) { pfn = head_pfn + nr_pages; continue; } @@ -435,7 +413,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags, */ VM_WARN_ON_ONCE_PAGE(PageLRU(page), page); VM_WARN_ON_ONCE_PAGE(__PageMovable(page), page); -#endif + goto failed; }
From: Huan Yang link@vivo.com
mainline inclusion from mainline-v6.12-rc1 commit 94deaf69dcd33462c61fa8cabb0883e3085a1046 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IBC4T0
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When page del from buddy and need expand, it will account free_pages in zone's migratetype.
The current way is to subtract the page number of the current order when deleting, and then add it back when expanding.
This is unnecessary, as when migrating the same type, we can directly record the difference between the high-order pages and the expand added, and then subtract it directly.
This patch merge that, only when del and expand done, then account free_pages.
Link: https://lkml.kernel.org/r/20240826064048.187790-1-link@vivo.com Signed-off-by: Huan Yang link@vivo.com Reviewed-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com --- mm/page_alloc.c | 35 +++++++++++++++++++++++++---------- 1 file changed, 25 insertions(+), 10 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e786cdd98bea..7734245d7870 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1327,11 +1327,11 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn, * * -- nyc */ -static inline void expand(struct zone *zone, struct page *page, - int low, int high, int migratetype) +static inline unsigned int expand(struct zone *zone, struct page *page, int low, + int high, int migratetype) { - unsigned long size = 1 << high; - unsigned long nr_added = 0; + unsigned int size = 1 << high; + unsigned int nr_added = 0;
while (high > low) { high--; @@ -1351,7 +1351,19 @@ static inline void expand(struct zone *zone, struct page *page, set_buddy_order(&page[size], high); nr_added += size; } - account_freepages(zone, nr_added, migratetype); + + return nr_added; +} + +static __always_inline void page_del_and_expand(struct zone *zone, + struct page *page, int low, + int high, int migratetype) +{ + int nr_pages = 1 << high; + + __del_page_from_free_list(page, zone, high, migratetype); + nr_pages -= expand(zone, page, low, high, migratetype); + account_freepages(zone, -nr_pages, migratetype); }
static void check_new_page_bad(struct page *page) @@ -1540,8 +1552,9 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, page = get_page_from_free_area(area, migratetype); if (!page) continue; - del_page_from_free_list(page, zone, current_order, migratetype); - expand(zone, page, order, current_order, migratetype); + + page_del_and_expand(zone, page, order, current_order, + migratetype); trace_mm_page_alloc_zone_locked(page, order, migratetype, pcp_allowed_order(order) && migratetype < MIGRATE_PCPTYPES); @@ -1892,9 +1905,12 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
/* Take ownership for orders >= pageblock_order */ if (current_order >= pageblock_order) { + unsigned int nr_added; + del_page_from_free_list(page, zone, current_order, block_type); change_pageblock_range(page, current_order, start_type); - expand(zone, page, order, current_order, start_type); + nr_added = expand(zone, page, order, current_order, start_type); + account_freepages(zone, nr_added, start_type); return page; }
@@ -1947,8 +1963,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page, }
single_page: - del_page_from_free_list(page, zone, current_order, block_type); - expand(zone, page, order, current_order, block_type); + page_del_and_expand(zone, page, order, current_order, block_type); return page; }
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/14226 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/4...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/14226 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/4...