This series moves the boot time initialization of tail struct pages of a gigantic page to later on in the boot. Only the HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages are initialized at the start. If HVO is successful, then no more tail struct pages need to be initialized. For a 1G hugepage, this series avoid initialization of 262144 - 63 = 262081 struct pages per hugepage.
When tested on a 512G system (allocating 500 1G hugepages), the kexec-boot times with DEFERRED_STRUCT_PAGE_INIT enabled are:
- with patches, HVO enabled: 1.32 seconds - with patches, HVO disabled: 2.15 seconds - without patches, HVO enabled: 3.90 seconds - without patches, HVO disabled: 3.58 seconds
This represents an approximately 70% reduction in boot time and will significantly reduce server downtime when using a large number of gigantic pages.
Usama Arif (4): mm: hugetlb_vmemmap: use nid of the head page to reallocate it memblock: pass memblock_type to memblock_setclr_flag memblock: introduce MEMBLOCK_RSRV_NOINIT flag mm: hugetlb: skip initialization of gigantic tail struct pages if freed by HVO
include/linux/memblock.h | 9 ++++++ mm/hugetlb.c | 68 ++++++++++++++++++++++++++++++++++------ mm/hugetlb_vmemmap.c | 4 +-- mm/hugetlb_vmemmap.h | 9 +++--- mm/internal.h | 3 ++ mm/memblock.c | 49 +++++++++++++++++++++-------- mm/mm_init.c | 2 +- 7 files changed, 115 insertions(+), 29 deletions(-)
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/3044 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/S...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/3044 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/S...
From: Usama Arif usama.arif@bytedance.com
mainline inclusion from mainline-v6.7-rc1 commit a9e34ea1f62c1e359ed197ec01d056c25edb2b61 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8JXRC CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------------------
Patch series "mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO", v5.
This series moves the boot time initialization of tail struct pages of a gigantic page to later on in the boot. Only the HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages are initialized at the start. If HVO is successful, then no more tail struct pages need to be initialized. For a 1G hugepage, this series avoid initialization of 262144 - 63 = 262081 struct pages per hugepage.
When tested on a 512G system (allocating 500 1G hugepages), the kexec-boot times with DEFERRED_STRUCT_PAGE_INIT enabled are:
- with patches, HVO enabled: 1.32 seconds - with patches, HVO disabled: 2.15 seconds - without patches, HVO enabled: 3.90 seconds - without patches, HVO disabled: 3.58 seconds
This represents an approximately 70% reduction in boot time and will significantly reduce server downtime when using a large number of gigantic pages.
This patch (of 4):
If tail page prep and initialization is skipped, then the "start" page will not contain the correct nid. Use the nid from first vmemap page.
Link: https://lkml.kernel.org/r/20230913105401.519709-1-usama.arif@bytedance.com Link: https://lkml.kernel.org/r/20230913105401.519709-2-usama.arif@bytedance.com Signed-off-by: Usama Arif usama.arif@bytedance.com Reviewed-by: Muchun Song songmuchun@bytedance.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: Fam Zheng fam.zheng@bytedance.com Cc: Mike Rapoport (IBM) rppt@kernel.org Cc: Punit Agrawal punit.agrawal@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Jinjiang Tu tujinjiang@huawei.com --- mm/hugetlb_vmemmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 4b9734777f69..56989b228acd 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -318,7 +318,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, .reuse_addr = reuse, .vmemmap_pages = &vmemmap_pages, }; - int nid = page_to_nid((struct page *)start); + int nid = page_to_nid((struct page *)reuse); gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY | __GFP_NOWARN;
From: Usama Arif usama.arif@bytedance.com
mainline inclusion from mainline-v6.7-rc1 commit ee8d2071ef52d83a2ac4f8a474fafb2aea91766d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8JXRC CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------------------
This allows setting flags to both memblock types and is in preparation for setting flags (for e.g. to not initialize struct pages) on reserved memory region.
[usama.arif@bytedance.com: add missing argument definition] Link: https://lkml.kernel.org/r/20230918090657.220463-1-usama.arif@bytedance.com Link: https://lkml.kernel.org/r/20230913105401.519709-3-usama.arif@bytedance.com Signed-off-by: Usama Arif usama.arif@bytedance.com Reviewed-by: Muchun Song songmuchun@bytedance.com Reviewed-by: Mike Rapoport (IBM) rppt@kernel.org Acked-by: Mike Kravetz mike.kravetz@oracle.com Cc: Fam Zheng fam.zheng@bytedance.com Cc: Punit Agrawal punit.agrawal@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Jinjiang Tu tujinjiang@huawei.com --- mm/memblock.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/mm/memblock.c b/mm/memblock.c index 913b2520a9a0..b978cda96cf0 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -892,6 +892,7 @@ int __init_memblock memblock_physmem_add(phys_addr_t base, phys_addr_t size)
/** * memblock_setclr_flag - set or clear flag for a memory region + * @type: memblock type to set/clear flag for * @base: base address of the region * @size: size of the region * @set: set or clear the flag @@ -901,10 +902,9 @@ int __init_memblock memblock_physmem_add(phys_addr_t base, phys_addr_t size) * * Return: 0 on success, -errno on failure. */ -static int __init_memblock memblock_setclr_flag(phys_addr_t base, - phys_addr_t size, int set, int flag) +static int __init_memblock memblock_setclr_flag(struct memblock_type *type, + phys_addr_t base, phys_addr_t size, int set, int flag) { - struct memblock_type *type = &memblock.memory; int i, ret, start_rgn, end_rgn;
ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn); @@ -933,7 +933,7 @@ static int __init_memblock memblock_setclr_flag(phys_addr_t base, */ int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size) { - return memblock_setclr_flag(base, size, 1, MEMBLOCK_HOTPLUG); + return memblock_setclr_flag(&memblock.memory, base, size, 1, MEMBLOCK_HOTPLUG); }
/** @@ -945,7 +945,7 @@ int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size) */ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size) { - return memblock_setclr_flag(base, size, 0, MEMBLOCK_HOTPLUG); + return memblock_setclr_flag(&memblock.memory, base, size, 0, MEMBLOCK_HOTPLUG); }
/** @@ -962,7 +962,7 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size)
system_has_some_mirror = true;
- return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR); + return memblock_setclr_flag(&memblock.memory, base, size, 1, MEMBLOCK_MIRROR); }
/** @@ -982,7 +982,7 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size) */ int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size) { - return memblock_setclr_flag(base, size, 1, MEMBLOCK_NOMAP); + return memblock_setclr_flag(&memblock.memory, base, size, 1, MEMBLOCK_NOMAP); }
/** @@ -994,7 +994,7 @@ int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size) */ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size) { - return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP); + return memblock_setclr_flag(&memblock.memory, base, size, 0, MEMBLOCK_NOMAP); }
static bool should_skip_region(struct memblock_type *type,
From: Usama Arif usama.arif@bytedance.com
mainline inclusion from mainline-v6.7-rc1 commit 77e6c43e137c130138c3fbadc847351a83c4befe category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8JXRC CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------------------
For reserved memory regions marked with this flag, reserve_bootmem_region is not called during memmap_init_reserved_pages. This can be used to avoid struct page initialization for regions which won't need them, for e.g. hugepages with Hugepage Vmemmap Optimization enabled.
Link: https://lkml.kernel.org/r/20230913105401.519709-4-usama.arif@bytedance.com Signed-off-by: Usama Arif usama.arif@bytedance.com Acked-by: Muchun Song songmuchun@bytedance.com Reviewed-by: Mike Rapoport (IBM) rppt@kernel.org Cc: Fam Zheng fam.zheng@bytedance.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Punit Agrawal punit.agrawal@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Jinjiang Tu tujinjiang@huawei.com --- include/linux/memblock.h | 9 +++++++++ mm/memblock.c | 33 ++++++++++++++++++++++++++++----- 2 files changed, 37 insertions(+), 5 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 1c1072e3ca06..ae3bde302f70 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -40,6 +40,8 @@ extern unsigned long long max_possible_pfn; * via a driver, and never indicated in the firmware-provided memory map as * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the * kernel resource tree. + * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are + * not initialized (only for reserved regions). */ enum memblock_flags { MEMBLOCK_NONE = 0x0, /* No special request */ @@ -47,6 +49,7 @@ enum memblock_flags { MEMBLOCK_MIRROR = 0x2, /* mirrored region */ MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */ MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */ + MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */ };
/** @@ -125,6 +128,7 @@ int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); int memblock_mark_nomap(phys_addr_t base, phys_addr_t size); int memblock_clear_nomap(phys_addr_t base, phys_addr_t size); +int memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size);
void memblock_free_all(void); void memblock_free(void *ptr, size_t size); @@ -259,6 +263,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m) return m->flags & MEMBLOCK_NOMAP; }
+static inline bool memblock_is_reserved_noinit(struct memblock_region *m) +{ + return m->flags & MEMBLOCK_RSRV_NOINIT; +} + static inline bool memblock_is_driver_managed(struct memblock_region *m) { return m->flags & MEMBLOCK_DRIVER_MANAGED; diff --git a/mm/memblock.c b/mm/memblock.c index b978cda96cf0..fd492e5bbdbc 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -997,6 +997,24 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size) return memblock_setclr_flag(&memblock.memory, base, size, 0, MEMBLOCK_NOMAP); }
+/** + * memblock_reserved_mark_noinit - Mark a reserved memory region with flag + * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized + * for this region. + * @base: the base phys addr of the region + * @size: the size of the region + * + * struct pages will not be initialized for reserved memory regions marked with + * %MEMBLOCK_RSRV_NOINIT. + * + * Return: 0 on success, -errno on failure. + */ +int __init_memblock memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t size) +{ + return memblock_setclr_flag(&memblock.reserved, base, size, 1, + MEMBLOCK_RSRV_NOINIT); +} + static bool should_skip_region(struct memblock_type *type, struct memblock_region *m, int nid, int flags) @@ -2113,13 +2131,18 @@ static void __init memmap_init_reserved_pages(void) memblock_set_node(start, end, &memblock.reserved, nid); }
- /* initialize struct pages for the reserved regions */ + /* + * initialize struct pages for reserved regions that don't have + * the MEMBLOCK_RSRV_NOINIT flag set + */ for_each_reserved_mem_region(region) { - nid = memblock_get_region_node(region); - start = region->base; - end = start + region->size; + if (!memblock_is_reserved_noinit(region)) { + nid = memblock_get_region_node(region); + start = region->base; + end = start + region->size;
- reserve_bootmem_region(start, end, nid); + reserve_bootmem_region(start, end, nid); + } } }
From: Usama Arif usama.arif@bytedance.com
mainline inclusion from mainline-v6.7-rc1 commit fde1c4ecf91640e5a95ec36b71ec2e8ec379ce40 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8JXRC CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------------------
The new boot flow when it comes to initialization of gigantic pages is as follows:
- At boot time, for a gigantic page during __alloc_bootmem_hugepage, the region after the first struct page is marked as noinit.
- This results in only the first struct page to be initialized in reserve_bootmem_region. As the tail struct pages are not initialized at this point, there can be a significant saving in boot time if HVO succeeds later on.
- Later on in the boot, the head page is prepped and the first HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages are initialized.
- HVO is attempted. If it is not successful, then the rest of the tail struct pages are initialized. If it is successful, no more tail struct pages need to be initialized saving significant boot time.
The WARN_ON for increased ref count in gather_bootmem_prealloc was changed to a VM_BUG_ON. This is OK as there should be no speculative references this early in boot process. The VM_BUG_ON's are there just in case such code is introduced.
[akpm@linux-foundation.org: make it nicer for 80 cols] Link: https://lkml.kernel.org/r/20230913105401.519709-5-usama.arif@bytedance.com Signed-off-by: Usama Arif usama.arif@bytedance.com Reviewed-by: Muchun Song songmuchun@bytedance.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: Fam Zheng fam.zheng@bytedance.com Cc: Mike Rapoport (IBM) rppt@kernel.org Cc: Punit Agrawal punit.agrawal@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Jinjiang Tu tujinjiang@huawei.com --- mm/hugetlb.c | 68 ++++++++++++++++++++++++++++++++++++++------ mm/hugetlb_vmemmap.c | 2 +- mm/hugetlb_vmemmap.h | 9 +++--- mm/internal.h | 3 ++ mm/mm_init.c | 2 +- 5 files changed, 69 insertions(+), 15 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1301ba7b2c9a..50c3bfcef60c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3196,6 +3196,16 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid) }
found: + + /* + * Only initialize the head struct page in memmap_init_reserved_pages, + * rest of the struct pages will be initialized by the HugeTLB + * subsystem itself. + * The head struct page is used to get folio information by the HugeTLB + * subsystem like zone id and node id. + */ + memblock_reserved_mark_noinit(virt_to_phys((void *)m + PAGE_SIZE), + huge_page_size(h) - PAGE_SIZE); /* Put them into a private list first because mem_map is not up yet */ INIT_LIST_HEAD(&m->list); list_add(&m->list, &huge_boot_pages); @@ -3203,6 +3213,43 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid) return 1; }
+/* Initialize [start_page:end_page_number] tail struct pages of a hugepage */ +static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, + unsigned long start_page_number, + unsigned long end_page_number) +{ + enum zone_type zone = zone_idx(folio_zone(folio)); + int nid = folio_nid(folio); + unsigned long head_pfn = folio_pfn(folio); + unsigned long pfn, end_pfn = head_pfn + end_page_number; + int ret; + + for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { + struct page *page = pfn_to_page(pfn); + + __init_single_page(page, pfn, zone, nid); + prep_compound_tail((struct page *)folio, pfn - head_pfn); + ret = page_ref_freeze(page, 1); + VM_BUG_ON(!ret); + } +} + +static void __init hugetlb_folio_init_vmemmap(struct folio *folio, + struct hstate *h, + unsigned long nr_pages) +{ + int ret; + + /* Prepare folio head */ + __folio_clear_reserved(folio); + __folio_set_head(folio); + ret = page_ref_freeze(&folio->page, 1); + VM_BUG_ON(!ret); + /* Initialize the necessary tail struct pages */ + hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages); + prep_compound_head((struct page *)folio, huge_page_order(h)); +} + /* * Put bootmem huge pages into the standard lists after mem_map is up. * Note: This only applies to gigantic (order > MAX_ORDER) pages. @@ -3213,19 +3260,21 @@ static void __init gather_bootmem_prealloc(void)
list_for_each_entry(m, &huge_boot_pages, list) { struct page *page = virt_to_page(m); - struct folio *folio = page_folio(page); + struct folio *folio = (void *)page; struct hstate *h = m->hstate;
VM_BUG_ON(!hstate_is_gigantic(h)); WARN_ON(folio_ref_count(folio) != 1); - if (prep_compound_gigantic_folio(folio, huge_page_order(h))) { - WARN_ON(folio_test_reserved(folio)); - prep_new_hugetlb_folio(h, folio, folio_nid(folio)); - free_huge_folio(folio); /* add to the hugepage allocator */ - } else { - /* VERY unlikely inflated ref count on a tail page */ - free_gigantic_folio(folio, huge_page_order(h)); - } + + hugetlb_folio_init_vmemmap(folio, h, + HUGETLB_VMEMMAP_RESERVE_PAGES); + prep_new_hugetlb_folio(h, folio, folio_nid(folio)); + /* If HVO fails, initialize all tail struct pages */ + if (!HPageVmemmapOptimized(&folio->page)) + hugetlb_folio_init_tail_vmemmap(folio, + HUGETLB_VMEMMAP_RESERVE_PAGES, + pages_per_huge_page(h)); + free_huge_folio(folio); /* add to the hugepage allocator */
/* * We need to restore the 'stolen' pages to totalram_pages @@ -3236,6 +3285,7 @@ static void __init gather_bootmem_prealloc(void) cond_resched(); } } + static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid) { unsigned long i; diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 56989b228acd..b5a8587a0bad 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -586,7 +586,7 @@ static int __init hugetlb_vmemmap_init(void) const struct hstate *h;
/* HUGETLB_VMEMMAP_RESERVE_SIZE should cover all used struct pages */ - BUILD_BUG_ON(__NR_USED_SUBPAGE * sizeof(struct page) > HUGETLB_VMEMMAP_RESERVE_SIZE); + BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES);
for_each_hstate(h) { if (hugetlb_vmemmap_optimizable(h)) { diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h index 25bd0e002431..4573899855d7 100644 --- a/mm/hugetlb_vmemmap.h +++ b/mm/hugetlb_vmemmap.h @@ -10,15 +10,16 @@ #define _LINUX_HUGETLB_VMEMMAP_H #include <linux/hugetlb.h>
-#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP -int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head); -void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head); - /* * Reserve one vmemmap page, all vmemmap addresses are mapped to it. See * Documentation/vm/vmemmap_dedup.rst. */ #define HUGETLB_VMEMMAP_RESERVE_SIZE PAGE_SIZE +#define HUGETLB_VMEMMAP_RESERVE_PAGES (HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page)) + +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP +int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head); +void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h) { diff --git a/mm/internal.h b/mm/internal.h index 30cf724ddbce..085ec240ce24 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1154,4 +1154,7 @@ struct vma_prepare { struct vm_area_struct *remove; struct vm_area_struct *remove2; }; + +void __meminit __init_single_page(struct page *page, unsigned long pfn, + unsigned long zone, int nid); #endif /* __MM_INTERNAL_H */ diff --git a/mm/mm_init.c b/mm/mm_init.c index 50f2f34745af..fed4370b02e1 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -555,7 +555,7 @@ static void __init find_zone_movable_pfns_for_nodes(void) node_states[N_MEMORY] = saved_node_state; }
-static void __meminit __init_single_page(struct page *page, unsigned long pfn, +void __meminit __init_single_page(struct page *page, unsigned long pfn, unsigned long zone, int nid) { mm_zero_struct_page(page);