Percpu embedded first chunk allocator is the firstly option, but it could fails on ARM64 when turning on numa with CONFIG_KASAN=y.
Let's implement page mapping percpu first chunk allocator as a fallback to the embedding allocator to increase the robustness of the system.
Also fix a crash when both NEED_PER_CPU_PAGE_FIRST_CHUNK and KASAN_VMALLOC enabled.
After merging this patch set, the ARM64 machine can start and work normally.
Kefeng Wang (3): vmalloc: choose a better start address in vm_area_register_early() arm64: support page mapping percpu first chunk allocator kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC
arch/arm64/Kconfig | 4 ++ arch/arm64/mm/kasan_init.c | 16 ++++++++ arch/arm64/mm/numa.c | 84 +++++++++++++++++++++++++++++++++----- include/linux/kasan.h | 10 ++++- mm/kasan/common.c | 5 +++ mm/vmalloc.c | 19 ++++++--- 6 files changed, 120 insertions(+), 18 deletions(-)
From: Kefeng Wang wangkefeng.wang@huawei.com
mainline inclusion from mainline-v5.16-rc1 commit 0eb68437a7f9dfef9c218873310c66c714f2fa99 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IB2BDP CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
Percpu embedded first chunk allocator is the firstly option, but it could fail on ARM64, eg,
percpu: max_distance=0x5fcfdc640000 too large for vmalloc space 0x781fefff0000 percpu: max_distance=0x600000540000 too large for vmalloc space 0x7dffb7ff0000 percpu: max_distance=0x5fff9adb0000 too large for vmalloc space 0x5dffb7ff0000
then we could get to
WARNING: CPU: 15 PID: 461 at vmalloc.c:3087 pcpu_get_vm_areas+0x488/0x838
and the system cannot boot successfully.
Let's implement page mapping percpu first chunk allocator as a fallback to the embedding allocator to increase the robustness of the system.
Also fix a crash when both NEED_PER_CPU_PAGE_FIRST_CHUNK and KASAN_VMALLOC enabled.
Tested on ARM64 qemu with cmdline "percpu_alloc=page".
This patch (of 3):
There are some fixed locations in the vmalloc area be reserved in ARM(see iotable_init()) and ARM64(see map_kernel()), but for pcpu_page_first_chunk(), it calls vm_area_register_early() and choose VMALLOC_START as the start address of vmap area which could be conflicted with above address, then could trigger a BUG_ON in vm_area_add_early().
Let's choose a suit start address by traversing the vmlist.
Link: https://lkml.kernel.org/r/20210910053354.26721-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20210910053354.26721-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will@kernel.org Cc: Andrey Ryabinin ryabinin.a.a@gmail.com Cc: Andrey Konovalov andreyknvl@gmail.com Cc: Dmitry Vyukov dvyukov@google.com Cc: Marco Elver elver@google.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Kaixiong Yu yukaixiong@huawei.com --- mm/vmalloc.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 6d802924d9e8..6de2ffbe925f 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2283,15 +2283,21 @@ void __init vm_area_add_early(struct vm_struct *vm) */ void __init vm_area_register_early(struct vm_struct *vm, size_t align) { - static size_t vm_init_off __initdata; - unsigned long addr; + unsigned long addr = ALIGN(VMALLOC_START, align); + struct vm_struct *cur, **p;
- addr = ALIGN(VMALLOC_START + vm_init_off, align); - vm_init_off = PFN_ALIGN(addr + vm->size) - VMALLOC_START; + BUG_ON(vmap_initialized);
- vm->addr = (void *)addr; + for (p = &vmlist; (cur = *p) != NULL; p = &cur->next) { + if ((unsigned long)cur->addr - addr >= vm->size) + break; + addr = ALIGN((unsigned long)cur->addr + cur->size, align); + }
- vm_area_add_early(vm); + BUG_ON(addr > VMALLOC_END - vm->size); + vm->addr = (void *)addr; + vm->next = *p; + *p = vm; }
static void vmap_init_free_space(void)
From: Kefeng Wang wangkefeng.wang@huawei.com
mainline inclusion from mainline-v5.16-rc1 commit 09cea6195073ee1d0f076d907d9249045757245d category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IB2BDP CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
Percpu embedded first chunk allocator is the firstly option, but it could fails on ARM64, eg,
percpu: max_distance=0x5fcfdc640000 too large for vmalloc space 0x781fefff0000 percpu: max_distance=0x600000540000 too large for vmalloc space 0x7dffb7ff0000 percpu: max_distance=0x5fff9adb0000 too large for vmalloc space 0x5dffb7ff0000
then we could get
WARNING: CPU: 15 PID: 461 at vmalloc.c:3087 pcpu_get_vm_areas+0x488/0x838
and the system could not boot successfully.
Let's implement page mapping percpu first chunk allocator as a fallback to the embedding allocator to increase the robustness of the system.
Link: https://lkml.kernel.org/r/20210910053354.26721-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Catalin Marinas catalin.marinas@arm.com Cc: Andrey Konovalov andreyknvl@gmail.com Cc: Andrey Ryabinin ryabinin.a.a@gmail.com Cc: Dmitry Vyukov dvyukov@google.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Marco Elver elver@google.com Cc: Will Deacon will@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Conflicts: arch/arm64/mm/numa.c [OLK-5.10 don't merge linux master inclusion commit ae3c107cd8bea82cb7cb427d9c5d305b8ce72216("numa: Move numa implementation to common code"), so drivers/base/arch_numa.c don't exist. Move pcpu_populate_pte() and modification of setup_per_cpu_areas() to arch/arm64/mm/numa.c. Besides, Commit 09cea6195073("arm64: support page mapping percpu first chunk allocator") from mainline leads to ABI breakage. Fix it by moving the "#include <asm/pgalloc.h>" statement after the "#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK".] Signed-off-by: Kaixiong Yu yukaixiong@huawei.com --- arch/arm64/Kconfig | 4 +++ arch/arm64/mm/numa.c | 84 ++++++++++++++++++++++++++++++++++++++------ 2 files changed, 77 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index c57dfa47937f..2591707024d4 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1196,6 +1196,10 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK def_bool y depends on NUMA
+config NEED_PER_CPU_PAGE_FIRST_CHUNK + def_bool y + depends on NUMA + source "kernel/Kconfig.hz"
config ARCH_SUPPORTS_DEBUG_PAGEALLOC diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c index dd72f25452c1..99a746e14f2b 100644 --- a/arch/arm64/mm/numa.c +++ b/arch/arm64/mm/numa.c @@ -342,23 +342,85 @@ static void __init pcpu_fc_free(void *ptr, size_t size) memblock_free_early(__pa(ptr), size); }
+#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK +#include <asm/pgalloc.h> + +static void __init pcpu_populate_pte(unsigned long addr) +{ + pgd_t *pgd = pgd_offset_k(addr); + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + + p4d = p4d_offset(pgd, addr); + if (p4d_none(*p4d)) { + pud_t *new; + + new = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!new) + goto err_alloc; + p4d_populate(&init_mm, p4d, new); + } + + pud = pud_offset(p4d, addr); + if (pud_none(*pud)) { + pmd_t *new; + + new = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!new) + goto err_alloc; + pud_populate(&init_mm, pud, new); + } + + pmd = pmd_offset(pud, addr); + if (!pmd_present(*pmd)) { + pte_t *new; + + new = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + if (!new) + goto err_alloc; + pmd_populate_kernel(&init_mm, pmd, new); + } + + return; + +err_alloc: + panic("%s: Failed to allocate %lu bytes align=%lx from=%lx\n", + __func__, PAGE_SIZE, PAGE_SIZE, PAGE_SIZE); +} +#endif + void __init setup_per_cpu_areas(void) { unsigned long delta; unsigned int cpu; - int rc; - - /* - * Always reserve area for module percpu variables. That's - * what the legacy allocator did. - */ - rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE, - PERCPU_DYNAMIC_RESERVE, PAGE_SIZE, - pcpu_cpu_distance, - pcpu_fc_alloc, pcpu_fc_free); + int rc = -EINVAL; + + if (pcpu_chosen_fc != PCPU_FC_PAGE) { + /* + * Always reserve area for module percpu variables. That's + * what the legacy allocator did. + */ + rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE, + PERCPU_DYNAMIC_RESERVE, PAGE_SIZE, + pcpu_cpu_distance, + pcpu_fc_alloc, pcpu_fc_free); +#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK if (rc < 0) - panic("Failed to initialize percpu areas."); + pr_warn("PERCPU: %s allocator failed (%d), falling back to page size\n", + pcpu_fc_names[pcpu_chosen_fc], rc); +#endif + }
+#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK + if (rc < 0) + rc = pcpu_page_first_chunk(PERCPU_MODULE_RESERVE, + pcpu_fc_alloc, + pcpu_fc_free, + pcpu_populate_pte); +#endif + if (rc < 0) + panic("Failed to initialize percpu areas (err=%d).", rc); delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start; for_each_possible_cpu(cpu) __per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
From: Kefeng Wang wangkefeng.wang@huawei.com
mainline inclusion from mainline-v5.16-rc1 commit 3252b1d8309ea42bc6329d9341072ecf1c9505c0 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IB2BDP CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
With KASAN_VMALLOC and NEED_PER_CPU_PAGE_FIRST_CHUNK the kernel crashes:
Unable to handle kernel paging request at virtual address ffff7000028f2000 ... swapper pgtable: 64k pages, 48-bit VAs, pgdp=0000000042440000 [ffff7000028f2000] pgd=000000063e7c0003, p4d=000000063e7c0003, pud=000000063e7c0003, pmd=000000063e7b0003, pte=0000000000000000 Internal error: Oops: 96000007 [#1] PREEMPT SMP Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 5.13.0-rc4-00003-gc6e6e28f3f30-dirty #62 Hardware name: linux,dummy-virt (DT) pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO BTYPE=--) pc : kasan_check_range+0x90/0x1a0 lr : memcpy+0x88/0xf4 sp : ffff80001378fe20 ... Call trace: kasan_check_range+0x90/0x1a0 pcpu_page_first_chunk+0x3f0/0x568 setup_per_cpu_areas+0xb8/0x184 start_kernel+0x8c/0x328
The vm area used in vm_area_register_early() has no kasan shadow memory, Let's add a new kasan_populate_early_vm_area_shadow() function to populate the vm area shadow memory to fix the issue.
[wangkefeng.wang@huawei.com: fix redefinition of 'kasan_populate_early_vm_area_shadow'] Link: https://lkml.kernel.org/r/20211011123211.3936196-1-wangkefeng.wang@huawei.co...
Link: https://lkml.kernel.org/r/20210910053354.26721-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Acked-by: Marco Elver elver@google.com [KASAN] Acked-by: Andrey Konovalov andreyknvl@gmail.com [KASAN] Acked-by: Catalin Marinas catalin.marinas@arm.com Cc: Andrey Ryabinin ryabinin.a.a@gmail.com Cc: Dmitry Vyukov dvyukov@google.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Will Deacon will@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Conflicts: include/linux/kasan.h [Becasue OLK-5.10 don't have mm/kasan/shadow.c, move "void __init __weak kasan_populate_early_vm_area_shadow(void *start, unsigned long size)" to mm/kasan/commmon.c] Signed-off-by: Kaixiong Yu yukaixiong@huawei.com --- arch/arm64/mm/kasan_init.c | 16 ++++++++++++++++ include/linux/kasan.h | 10 +++++++++- mm/kasan/common.c | 5 +++++ mm/vmalloc.c | 1 + 4 files changed, 31 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c index 02051d4074c4..952807291615 100644 --- a/arch/arm64/mm/kasan_init.c +++ b/arch/arm64/mm/kasan_init.c @@ -208,6 +208,22 @@ static void __init clear_pgds(unsigned long start, set_pgd(pgd_offset_k(start), __pgd(0)); }
+#ifdef CONFIG_KASAN_VMALLOC +void __init kasan_populate_early_vm_area_shadow(void *start, unsigned long size) +{ + unsigned long shadow_start, shadow_end; + + if (!is_vmalloc_or_module_addr(start)) + return; + + shadow_start = (unsigned long)kasan_mem_to_shadow(start); + shadow_start = ALIGN_DOWN(shadow_start, PAGE_SIZE); + shadow_end = (unsigned long)kasan_mem_to_shadow(start + size); + shadow_end = ALIGN(shadow_end, PAGE_SIZE); + kasan_map_populate(shadow_start, shadow_end, NUMA_NO_NODE); +} +#endif + void __init kasan_init(void) { u64 kimg_shadow_start, kimg_shadow_end; diff --git a/include/linux/kasan.h b/include/linux/kasan.h index c0b976dd138b..894bbeaceb05 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -217,7 +217,10 @@ void kasan_unpoison_vmalloc(const void *start, unsigned long size); void kasan_release_vmalloc(unsigned long start, unsigned long end, unsigned long free_region_start, unsigned long free_region_end); -#else + +void kasan_populate_early_vm_area_shadow(void *start, unsigned long size); + +#else /* CONFIG_KASAN_VMALLOC */ static inline int kasan_populate_vmalloc(unsigned long start, unsigned long size) { @@ -232,6 +235,11 @@ static inline void kasan_release_vmalloc(unsigned long start, unsigned long end, unsigned long free_region_start, unsigned long free_region_end) {} + +static inline void kasan_populate_early_vm_area_shadow(void *start, + unsigned long size) +{ } + #endif
#ifdef CONFIG_KASAN diff --git a/mm/kasan/common.c b/mm/kasan/common.c index 592eeba0a787..1958d7d64d1d 100644 --- a/mm/kasan/common.c +++ b/mm/kasan/common.c @@ -997,4 +997,9 @@ void kasan_release_vmalloc(unsigned long start, unsigned long end, (unsigned long)shadow_end); } } + +void __init __weak kasan_populate_early_vm_area_shadow(void *start, + unsigned long size) +{ } + #endif diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 6de2ffbe925f..4a2c6ce0ad56 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2298,6 +2298,7 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align) vm->addr = (void *)addr; vm->next = *p; *p = vm; + kasan_populate_early_vm_area_shadow(vm->addr, vm->size); }
static void vmap_init_free_space(void)
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/13212 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/B...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/13212 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/B...