From: ZhangPeng zhangpeng362@huawei.com
This patch set includes two patch sets: Mitigate a vmap lock contention and mm/vmalloc: lock contention optimization under multi-threading.
In high-concurrency scenarios such as the "/test_vmalloc.sh run_test_mask=7 nr_threads=64" scenario, the performance is improved by more than 10x. The run time was reduced from 18m45.268s to 1m45.066s.
Alexei Starovoitov (3): mm: Enforce VM_IOREMAP flag and range in ioremap_page_range. mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages(). mm: Introduce vmap_page_range() to map pages in PCI address space
Baoquan He (3): mm/vmalloc: fix the unchecked dereference warning in vread_iter() mm/vmalloc: remove vmap_area_list mm/vmalloc.c: optimize to reduce arguments of alloc_vmap_area()
Kefeng Wang (1): mm: vmalloc: use old shrinker register
Uladzislau Rezki (Sony) (13): mm: vmalloc: add va_alloc() helper mm: vmalloc: rename adjust_va_to_fit_type() function mm: vmalloc: move vmap_init_free_space() down in vmalloc.c mm: vmalloc: remove global vmap_area_root rb-tree mm: vmalloc: remove global purge_vmap_area_root rb-tree mm: vmalloc: offload free_vmap_area_lock lock mm: vmalloc: add a scan area of VA only once mm: vmalloc: support multiple nodes in vread_iter mm: vmalloc: support multiple nodes in vmallocinfo mm: vmalloc: set nr_nodes based on CPUs in a system mm: vmalloc: add a shrinker to drain vmap pools mm: vmalloc: improve description of vmap node layer mm: vmalloc: refactor vmalloc_dump_obj() function
rulinhuang (1): mm/vmalloc: eliminated the lock contention from twice to once
.../admin-guide/kdump/vmcoreinfo.rst | 8 +- arch/arm/mm/ioremap.c | 8 +- arch/arm64/kernel/crash_core.c | 1 - arch/loongarch/kernel/setup.c | 2 +- arch/mips/loongson64/init.c | 2 +- arch/powerpc/kernel/isa-bridge.c | 4 +- arch/riscv/kernel/crash_core.c | 1 - drivers/pci/pci.c | 4 +- include/linux/io.h | 7 + include/linux/vmalloc.h | 6 +- kernel/crash_core.c | 4 +- kernel/kallsyms_selftest.c | 1 - mm/nommu.c | 2 - mm/vmalloc.c | 1202 +++++++++++++---- 14 files changed, 936 insertions(+), 316 deletions(-)
From: Alexei Starovoitov ast@kernel.org
mainline inclusion from mainline-v6.9-rc1 commit 3e49a866c9dcbd8173e4f3e491293619a9e81fa4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
There are various users of get_vm_area() + ioremap_page_range() APIs. Enforce that get_vm_area() was requested as VM_IOREMAP type and range passed to ioremap_page_range() matches created vm_area to avoid accidentally ioremap-ing into wrong address range.
Signed-off-by: Alexei Starovoitov ast@kernel.org Signed-off-by: Andrii Nakryiko andrii@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/bpf/20240305030516.41519-2-alexei.starovoitov@gmail.... (cherry picked from commit 3e49a866c9dcbd8173e4f3e491293619a9e81fa4) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 31761425d7a8..7043cebddf22 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -307,8 +307,21 @@ static int vmap_range_noflush(unsigned long addr, unsigned long end, int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot) { + struct vm_struct *area; int err;
+ area = find_vm_area((void *)addr); + if (!area || !(area->flags & VM_IOREMAP)) { + WARN_ONCE(1, "vm_area at addr %lx is not marked as VM_IOREMAP\n", addr); + return -EINVAL; + } + if (addr != (unsigned long)area->addr || + (void *)end != area->addr + get_vm_area_size(area)) { + WARN_ONCE(1, "ioremap request [%lx,%lx) doesn't match vm_area [%lx, %lx)\n", + addr, end, (long)area->addr, + (long)area->addr + get_vm_area_size(area)); + return -ERANGE; + } err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot), ioremap_max_page_shift); flush_cache_vmap(addr, end);
From: Baoquan He bhe@redhat.com
mainline inclusion from mainline-6.7-rc1 commit ca6c2ce1b481996ae5b16e4589d3c0dd61899fa8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference:https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
LKP reported smatch warning as below:
=================== smatch warnings: mm/vmalloc.c:3689 vread_iter() error: we previously assumed 'vm' could be null (see line 3667) ...... 06c8994626d1b7 @3667 size = vm ? get_vm_area_size(vm) : va_size(va); ...... 06c8994626d1b7 @3689 else if (!(vm->flags & VM_IOREMAP)) ^^^^^^^^^ Unchecked dereference =====================
This is not a runtime bug because the possible null 'vm' in the pointed place could only happen when flags == VMAP_BLOCK. However, the case 'flags == VMAP_BLOCK' should never happen and has been detected with WARN_ON. Please check vm_map_ram() implementation and the earlier checking in vread_iter() at below:
~~~~~~~~~~~~~~~~~~~~~~~~~~ /* * VMAP_BLOCK indicates a sub-type of vm_map_ram area, need * be set together with VMAP_RAM. */ WARN_ON(flags == VMAP_BLOCK);
if (!vm && !flags) continue; ~~~~~~~~~~~~~~~~~~~~~~~~~~
So add checking on whether 'vm' could be null when dereferencing it in vread_iter(). This mutes smatch complaint.
Link: https://lkml.kernel.org/r/ZTCURc8ZQE+KrTvS@MiWiFi-R3L-srv Link: https://lkml.kernel.org/r/ZS/2k6DIMd0tZRgK@MiWiFi-R3L-srv Signed-off-by: Baoquan He bhe@redhat.com Reported-by: kernel test robot lkp@intel.com Reported-by: Dan Carpenter dan.carpenter@linaro.org Closes: https://lore.kernel.org/r/202310171600.WCrsOwFj-lkp@intel.com/ Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Philip Li philip.li@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit ca6c2ce1b481996ae5b16e4589d3c0dd61899fa8) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 7043cebddf22..5df166d7426e 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3827,7 +3827,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
if (flags & VMAP_RAM) copied = vmap_ram_vread_iter(iter, addr, n, flags); - else if (!(vm->flags & VM_IOREMAP)) + else if (!(vm && (vm->flags & VM_IOREMAP))) copied = aligned_vread_iter(iter, addr, n); else /* IOREMAP area is treated as memory hole */ copied = zero_iter(iter, n);
From: Alexei Starovoitov ast@kernel.org
mainline inclusion from mainline-6.9-rc1 commit e6f798225a31485e47a6e4f6aa07ee9fdf80c2cb category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
vmap/vmalloc APIs are used to map a set of pages into contiguous kernel virtual space.
get_vm_area() with appropriate flag is used to request an area of kernel address range. It's used for vmalloc, vmap, ioremap, xen use cases. - vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag. - the areas created by vmap() function should be tagged with VM_MAP. - ioremap areas are tagged with VM_IOREMAP.
BPF would like to extend the vmap API to implement a lazily-populated sparse, yet contiguous kernel virtual space. Introduce VM_SPARSE flag and vm_area_map_pages(area, start_addr, count, pages) API to map a set of pages within a given area. It has the same sanity checks as vmap() does. It also checks that get_vm_area() was created with VM_SPARSE flag which identifies such areas in /proc/vmallocinfo and returns zero pages on read through /proc/kcore.
The next commits will introduce bpf_arena which is a sparsely populated shared memory region between bpf program and user space process. It will map privately-managed pages into a sparse vm area with the following steps:
// request virtual memory region during bpf prog verification area = get_vm_area(area_size, VM_SPARSE);
// on demand vm_area_map_pages(area, kaddr, kend, pages); vm_area_unmap_pages(area, kaddr, kend);
// after bpf program is detached and unloaded free_vm_area(area);
Signed-off-by: Alexei Starovoitov ast@kernel.org Signed-off-by: Andrii Nakryiko andrii@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Pasha Tatashin pasha.tatashin@soleen.com Link: https://lore.kernel.org/bpf/20240305030516.41519-3-alexei.starovoitov@gmail.... (cherry picked from commit e6f798225a31485e47a6e4f6aa07ee9fdf80c2cb) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- include/linux/vmalloc.h | 5 ++++ mm/vmalloc.c | 59 +++++++++++++++++++++++++++++++++++++++-- 2 files changed, 62 insertions(+), 2 deletions(-)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index e7db501b7602..447be1bbc1a6 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -35,6 +35,7 @@ struct iov_iter; /* in uio.h */ #else #define VM_DEFER_KMEMLEAK 0 #endif +#define VM_SPARSE 0x00001000 /* sparse vm_area. not all pages are present. */
#ifdef CONFIG_EXTEND_HUGEPAGE_MAPPING #define VM_HUGE_PAGES 0x00004000 /* vmalloc hugepage mapping only */ @@ -250,6 +251,10 @@ static inline bool is_vm_area_hugepages(const void *addr) }
#ifdef CONFIG_MMU +int vm_area_map_pages(struct vm_struct *area, unsigned long start, + unsigned long end, struct page **pages); +void vm_area_unmap_pages(struct vm_struct *area, unsigned long start, + unsigned long end); void vunmap_range(unsigned long addr, unsigned long end); static inline void set_vm_flush_reset_perms(void *addr) { diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 5df166d7426e..e9472f0480e4 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -648,6 +648,58 @@ static int vmap_pages_range(unsigned long addr, unsigned long end, return err; }
+static int check_sparse_vm_area(struct vm_struct *area, unsigned long start, + unsigned long end) +{ + might_sleep(); + if (WARN_ON_ONCE(area->flags & VM_FLUSH_RESET_PERMS)) + return -EINVAL; + if (WARN_ON_ONCE(area->flags & VM_NO_GUARD)) + return -EINVAL; + if (WARN_ON_ONCE(!(area->flags & VM_SPARSE))) + return -EINVAL; + if ((end - start) >> PAGE_SHIFT > totalram_pages()) + return -E2BIG; + if (start < (unsigned long)area->addr || + (void *)end > area->addr + get_vm_area_size(area)) + return -ERANGE; + return 0; +} + +/** + * vm_area_map_pages - map pages inside given sparse vm_area + * @area: vm_area + * @start: start address inside vm_area + * @end: end address inside vm_area + * @pages: pages to map (always PAGE_SIZE pages) + */ +int vm_area_map_pages(struct vm_struct *area, unsigned long start, + unsigned long end, struct page **pages) +{ + int err; + + err = check_sparse_vm_area(area, start, end); + if (err) + return err; + + return vmap_pages_range(start, end, PAGE_KERNEL, pages, PAGE_SHIFT); +} + +/** + * vm_area_unmap_pages - unmap pages inside given sparse vm_area + * @area: vm_area + * @start: start address inside vm_area + * @end: end address inside vm_area + */ +void vm_area_unmap_pages(struct vm_struct *area, unsigned long start, + unsigned long end) +{ + if (check_sparse_vm_area(area, start, end)) + return; + + vunmap_range(start, end); +} + int is_vmalloc_or_module_addr(const void *x) { /* @@ -3827,9 +3879,9 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
if (flags & VMAP_RAM) copied = vmap_ram_vread_iter(iter, addr, n, flags); - else if (!(vm && (vm->flags & VM_IOREMAP))) + else if (!(vm && (vm->flags & (VM_IOREMAP | VM_SPARSE)))) copied = aligned_vread_iter(iter, addr, n); - else /* IOREMAP area is treated as memory hole */ + else /* IOREMAP | SPARSE area is treated as memory hole */ copied = zero_iter(iter, n);
addr += copied; @@ -4618,6 +4670,9 @@ static int s_show(struct seq_file *m, void *p) if (v->flags & VM_IOREMAP) seq_puts(m, " ioremap");
+ if (v->flags & VM_SPARSE) + seq_puts(m, " sparse"); + if (v->flags & VM_ALLOC) seq_puts(m, " vmalloc");
From: Alexei Starovoitov ast@kernel.org
mainline inclusion from mainline-6.9-rc1 commit d7bca9199a27b8690ae1c71dc11f825154af7234 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
ioremap_page_range() should be used for ranges within vmalloc range only. The vmalloc ranges are allocated by get_vm_area(). PCI has "resource" allocator that manages PCI_IOBASE, IO_SPACE_LIMIT address range, hence introduce vmap_page_range() to be used exclusively to map pages in PCI address space.
Fixes: 3e49a866c9dc ("mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.") Reported-by: Miguel Ojeda ojeda@kernel.org Signed-off-by: Alexei Starovoitov ast@kernel.org Signed-off-by: Daniel Borkmann daniel@iogearbox.net Reviewed-by: Christoph Hellwig hch@lst.de Tested-by: Miguel Ojeda ojeda@kernel.org Link: https://lore.kernel.org/bpf/CANiq72ka4rir+RTN2FQoT=Vvprp_Ao-CvoYEkSNqtSY+RZj... (cherry picked from commit d7bca9199a27b8690ae1c71dc11f825154af7234) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- arch/arm/mm/ioremap.c | 8 ++++---- arch/loongarch/kernel/setup.c | 2 +- arch/mips/loongson64/init.c | 2 +- arch/powerpc/kernel/isa-bridge.c | 4 ++-- drivers/pci/pci.c | 4 ++-- include/linux/io.h | 7 +++++++ mm/vmalloc.c | 23 +++++++++++++++-------- 7 files changed, 32 insertions(+), 18 deletions(-)
diff --git a/arch/arm/mm/ioremap.c b/arch/arm/mm/ioremap.c index 2129070065c3..794cfea9f9d4 100644 --- a/arch/arm/mm/ioremap.c +++ b/arch/arm/mm/ioremap.c @@ -110,8 +110,8 @@ void __init add_static_vm_early(struct static_vm *svm) int ioremap_page(unsigned long virt, unsigned long phys, const struct mem_type *mtype) { - return ioremap_page_range(virt, virt + PAGE_SIZE, phys, - __pgprot(mtype->prot_pte)); + return vmap_page_range(virt, virt + PAGE_SIZE, phys, + __pgprot(mtype->prot_pte)); } EXPORT_SYMBOL(ioremap_page);
@@ -466,8 +466,8 @@ int pci_remap_iospace(const struct resource *res, phys_addr_t phys_addr) if (res->end > IO_SPACE_LIMIT) return -EINVAL;
- return ioremap_page_range(vaddr, vaddr + resource_size(res), phys_addr, - __pgprot(get_mem_type(pci_ioremap_mem_type)->prot_pte)); + return vmap_page_range(vaddr, vaddr + resource_size(res), phys_addr, + __pgprot(get_mem_type(pci_ioremap_mem_type)->prot_pte)); } EXPORT_SYMBOL(pci_remap_iospace);
diff --git a/arch/loongarch/kernel/setup.c b/arch/loongarch/kernel/setup.c index 4bd592c5019d..bfe1f05b279d 100644 --- a/arch/loongarch/kernel/setup.c +++ b/arch/loongarch/kernel/setup.c @@ -591,7 +591,7 @@ static int __init add_legacy_isa_io(struct fwnode_handle *fwnode, }
vaddr = (unsigned long)(PCI_IOBASE + range->io_start); - ioremap_page_range(vaddr, vaddr + size, hw_start, pgprot_device(PAGE_KERNEL)); + vmap_page_range(vaddr, vaddr + size, hw_start, pgprot_device(PAGE_KERNEL));
return 0; } diff --git a/arch/mips/loongson64/init.c b/arch/mips/loongson64/init.c index f25caa6aa9d3..ce9bb5405e44 100644 --- a/arch/mips/loongson64/init.c +++ b/arch/mips/loongson64/init.c @@ -177,7 +177,7 @@ static int __init add_legacy_isa_io(struct fwnode_handle *fwnode, resource_size_
vaddr = PCI_IOBASE + range->io_start;
- ioremap_page_range(vaddr, vaddr + size, hw_start, pgprot_device(PAGE_KERNEL)); + vmap_page_range(vaddr, vaddr + size, hw_start, pgprot_device(PAGE_KERNEL));
return 0; } diff --git a/arch/powerpc/kernel/isa-bridge.c b/arch/powerpc/kernel/isa-bridge.c index 48e0eaf1ad61..5c064485197a 100644 --- a/arch/powerpc/kernel/isa-bridge.c +++ b/arch/powerpc/kernel/isa-bridge.c @@ -46,8 +46,8 @@ static void remap_isa_base(phys_addr_t pa, unsigned long size) WARN_ON_ONCE(size & ~PAGE_MASK);
if (slab_is_available()) { - if (ioremap_page_range(ISA_IO_BASE, ISA_IO_BASE + size, pa, - pgprot_noncached(PAGE_KERNEL))) + if (vmap_page_range(ISA_IO_BASE, ISA_IO_BASE + size, pa, + pgprot_noncached(PAGE_KERNEL))) vunmap_range(ISA_IO_BASE, ISA_IO_BASE + size); } else { early_ioremap_range(ISA_IO_BASE, pa, size, diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 4e2714908a85..3cb879698424 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4279,8 +4279,8 @@ int pci_remap_iospace(const struct resource *res, phys_addr_t phys_addr) if (res->end > IO_SPACE_LIMIT) return -EINVAL;
- return ioremap_page_range(vaddr, vaddr + resource_size(res), phys_addr, - pgprot_device(PAGE_KERNEL)); + return vmap_page_range(vaddr, vaddr + resource_size(res), phys_addr, + pgprot_device(PAGE_KERNEL)); #else /* * This architecture does not have memory mapped I/O space, diff --git a/include/linux/io.h b/include/linux/io.h index 7304f2a69960..235ba7d80a8f 100644 --- a/include/linux/io.h +++ b/include/linux/io.h @@ -23,12 +23,19 @@ void __iowrite64_copy(void __iomem *to, const void *from, size_t count); #ifdef CONFIG_MMU int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot); +int vmap_page_range(unsigned long addr, unsigned long end, + phys_addr_t phys_addr, pgprot_t prot); #else static inline int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot) { return 0; } +static inline int vmap_page_range(unsigned long addr, unsigned long end, + phys_addr_t phys_addr, pgprot_t prot) +{ + return 0; +} #endif
/* diff --git a/mm/vmalloc.c b/mm/vmalloc.c index e9472f0480e4..f5ac73a90d6d 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -304,11 +304,24 @@ static int vmap_range_noflush(unsigned long addr, unsigned long end, return err; }
+int vmap_page_range(unsigned long addr, unsigned long end, + phys_addr_t phys_addr, pgprot_t prot) +{ + int err; + + err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot), + ioremap_max_page_shift); + flush_cache_vmap(addr, end); + if (!err) + err = kmsan_ioremap_page_range(addr, end, phys_addr, prot, + ioremap_max_page_shift); + return err; +} + int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot) { struct vm_struct *area; - int err;
area = find_vm_area((void *)addr); if (!area || !(area->flags & VM_IOREMAP)) { @@ -322,13 +335,7 @@ int ioremap_page_range(unsigned long addr, unsigned long end, (long)area->addr + get_vm_area_size(area)); return -ERANGE; } - err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot), - ioremap_max_page_shift); - flush_cache_vmap(addr, end); - if (!err) - err = kmsan_ioremap_page_range(addr, end, phys_addr, prot, - ioremap_max_page_shift); - return err; + return vmap_page_range(addr, end, phys_addr, prot); }
static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-6.9-rc1 commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Patch series "Mitigate a vmap lock contention", v3.
1. Motivation
- Offload global vmap locks making it scaled to number of CPUS;
- If possible and there is an agreement, we can remove the "Per cpu kva allocator" to make the vmap code to be more simple;
- There were complaints from XFS folk that a vmalloc might be contented on their workloads.
2. Design(high level overview)
We introduce an effective vmap node logic. A node behaves as independent entity to serve an allocation request directly(if possible) from its pool. That way it bypasses a global vmap space that is protected by its own lock.
An access to pools are serialized by CPUs. Number of nodes are equal to number of CPUs in a system. Please note the high threshold is bound to 128 nodes.
Pools are size segregated and populated based on system demand. The maximum alloc request that can be stored into a segregated storage is 256 pages. The lazily drain path decays a pool by 25% as a first step and as second populates it by fresh freed VAs for reuse instead of returning them into a global space.
When a VA is obtained(alloc path), it is stored in separate nodes. A va->va_start address is converted into a correct node where it should be placed and resided. Doing so we balance VAs across the nodes as a result an access becomes scalable. The addr_to_node() function does a proper address conversion to a correct node.
A vmap space is divided on segments with fixed size, it is 16 pages. That way any address can be associated with a segment number. Number of segments are equal to num_possible_cpus() but not grater then 128. The numeration starts from 0. See below how it is converted:
static inline unsigned int addr_to_node_id(unsigned long addr) { return (addr / zone_size) % nr_nodes; }
On a free path, a VA can be easily found by converting its "va_start" address to a certain node it resides. It is moved from "busy" data to "lazy" data structure. Later on, as noted earlier, the lazy kworker decays each node pool and populates it by fresh incoming VAs. Please note, a VA is returned to a node that did an alloc request.
3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor
sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
<default perf> 94.41% 0.89% [kernel] [k] _raw_spin_lock 93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath 76.13% 0.28% [kernel] [k] __vmalloc_node_range 72.96% 0.81% [kernel] [k] alloc_vmap_area 56.94% 0.00% [kernel] [k] __get_vm_area_node 41.95% 0.00% [kernel] [k] vmalloc 37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test 35.17% 0.00% [kernel] [k] ret_from_fork_asm 35.17% 0.00% [kernel] [k] ret_from_fork 35.17% 0.00% [kernel] [k] kthread 35.08% 0.00% [test_vmalloc] [k] test_func 34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test 28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test 23.53% 0.25% [kernel] [k] vfree.part.0 21.72% 0.00% [kernel] [k] remove_vm_area 20.08% 0.21% [kernel] [k] find_unlink_vmap_area 2.34% 0.61% [kernel] [k] free_vmap_area_noflush <default perf> vs <patch-series perf> 82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test 63.36% 0.02% [kernel] [k] vmalloc 63.34% 2.64% [kernel] [k] __vmalloc_node_range 30.42% 4.46% [kernel] [k] vfree.part.0 28.98% 2.51% [kernel] [k] __alloc_pages_bulk 27.28% 0.19% [kernel] [k] __get_vm_area_node 26.13% 1.50% [kernel] [k] alloc_vmap_area 21.72% 21.67% [kernel] [k] clear_page_rep 19.51% 2.43% [kernel] [k] _raw_spin_lock 16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath 13.40% 2.07% [kernel] [k] free_unref_page 10.62% 0.01% [kernel] [k] remove_vm_area 9.02% 8.73% [kernel] [k] insert_vmap_area 8.94% 0.00% [kernel] [k] ret_from_fork_asm 8.94% 0.00% [kernel] [k] ret_from_fork 8.94% 0.00% [kernel] [k] kthread 8.29% 0.00% [test_vmalloc] [k] test_func 7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test 5.30% 4.73% [kernel] [k] purge_vmap_node 4.47% 2.65% [kernel] [k] free_vmap_area_noflush <patch-series perf>
confirms that a native_queued_spin_lock_slowpath goes down to 16.51% percent from 93.07%.
The throughput is ~12x higher:
urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary.
real 10m51.271s user 0m0.013s sys 0m0.187s urezki@pc638:~$
urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary.
real 0m51.301s user 0m0.015s sys 0m0.040s urezki@pc638:~$
This patch (of 11):
Currently __alloc_vmap_area() function contains an open codded logic that finds and adjusts a VA based on allocation request.
Introduce a va_alloc() helper that adjusts found VA only. There is no a functional change as a result of this patch.
Link: https://lkml.kernel.org/r/20240102184633.748113-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20240102184633.748113-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Reviewed-by: Baoquan He bhe@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Lorenzo Stoakes lstoakes@gmail.com Cc: Dave Chinner david@fromorbit.com Cc: Joel Fernandes (Google) joel@joelfernandes.org Cc: Liam R. Howlett Liam.Howlett@oracle.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com Cc: Paul E. McKenney paulmck@kernel.org Cc: Kazuhito Hagio k-hagio-ab@nec.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 41 ++++++++++++++++++++++++++++------------- 1 file changed, 28 insertions(+), 13 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f5ac73a90d6d..01ed7a3c17a9 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1553,6 +1553,32 @@ adjust_va_to_fit_type(struct rb_root *root, struct list_head *head, return 0; }
+static unsigned long +va_alloc(struct vmap_area *va, + struct rb_root *root, struct list_head *head, + unsigned long size, unsigned long align, + unsigned long vstart, unsigned long vend) +{ + unsigned long nva_start_addr; + int ret; + + if (va->va_start > vstart) + nva_start_addr = ALIGN(va->va_start, align); + else + nva_start_addr = ALIGN(vstart, align); + + /* Check the "vend" restriction. */ + if (nva_start_addr + size > vend) + return vend; + + /* Update the free vmap_area. */ + ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size); + if (WARN_ON_ONCE(ret)) + return vend; + + return nva_start_addr; +} + /* * Returns a start address of the newly allocated area, if success. * Otherwise a vend is returned that indicates failure. @@ -1565,7 +1591,6 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head, bool adjust_search_size = true; unsigned long nva_start_addr; struct vmap_area *va; - int ret;
/* * Do not adjust when: @@ -1583,18 +1608,8 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head, if (unlikely(!va)) return vend;
- if (va->va_start > vstart) - nva_start_addr = ALIGN(va->va_start, align); - else - nva_start_addr = ALIGN(vstart, align); - - /* Check the "vend" restriction. */ - if (nva_start_addr + size > vend) - return vend; - - /* Update the free vmap_area. */ - ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size); - if (WARN_ON_ONCE(ret)) + nva_start_addr = va_alloc(va, root, head, size, align, vstart, vend); + if (nva_start_addr == vend) return vend;
#if DEBUG_AUGMENT_LOWEST_MATCH_CHECK
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-6.9-rc1 commit 5b75b8e1b9040b34f43809b1948eaa4e83e39112 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
This patch renames the adjust_va_to_fit_type() function to va_clip() which is shorter and more expressive.
There is no a functional change as a result of this patch.
Link: https://lkml.kernel.org/r/20240102184633.748113-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Reviewed-by: Baoquan He bhe@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Lorenzo Stoakes lstoakes@gmail.com Cc: Dave Chinner david@fromorbit.com Cc: Joel Fernandes (Google) joel@joelfernandes.org Cc: Kazuhito Hagio k-hagio-ab@nec.com Cc: Liam R. Howlett Liam.Howlett@oracle.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com Cc: Paul E. McKenney paulmck@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 5b75b8e1b9040b34f43809b1948eaa4e83e39112) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 01ed7a3c17a9..bc7505584667 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1454,9 +1454,9 @@ classify_va_fit_type(struct vmap_area *va, }
static __always_inline int -adjust_va_to_fit_type(struct rb_root *root, struct list_head *head, - struct vmap_area *va, unsigned long nva_start_addr, - unsigned long size) +va_clip(struct rb_root *root, struct list_head *head, + struct vmap_area *va, unsigned long nva_start_addr, + unsigned long size) { struct vmap_area *lva = NULL; enum fit_type type = classify_va_fit_type(va, nva_start_addr, size); @@ -1572,7 +1572,7 @@ va_alloc(struct vmap_area *va, return vend;
/* Update the free vmap_area. */ - ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size); + ret = va_clip(root, head, va, nva_start_addr, size); if (WARN_ON_ONCE(ret)) return vend;
@@ -4232,9 +4232,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, /* It is a BUG(), but trigger recovery instead. */ goto recovery;
- ret = adjust_va_to_fit_type(&free_vmap_area_root, - &free_vmap_area_list, - va, start, size); + ret = va_clip(&free_vmap_area_root, + &free_vmap_area_list, va, start, size); if (WARN_ON_ONCE(unlikely(ret))) /* It is a BUG(), but trigger recovery instead. */ goto recovery;
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-6.9-rc1 commit 7fa8cee003166ef6db0bba70d610dbf173543811 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
A vmap_init_free_space() is a function that setups a vmap space and is considered as part of initialization phase. Since a main entry which is vmalloc_init(), has been moved down in vmalloc.c it makes sense to follow the pattern.
There is no a functional change as a result of this patch.
Link: https://lkml.kernel.org/r/20240102184633.748113-4-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Reviewed-by: Baoquan He bhe@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Lorenzo Stoakes lstoakes@gmail.com Cc: Dave Chinner david@fromorbit.com Cc: Joel Fernandes (Google) joel@joelfernandes.org Cc: Kazuhito Hagio k-hagio-ab@nec.com Cc: Liam R. Howlett Liam.Howlett@oracle.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com Cc: Paul E. McKenney paulmck@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 7fa8cee003166ef6db0bba70d610dbf173543811) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 82 ++++++++++++++++++++++++++-------------------------- 1 file changed, 41 insertions(+), 41 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index bc7505584667..59cd3f855df7 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2584,47 +2584,6 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align) kasan_populate_early_vm_area_shadow(vm->addr, vm->size); }
-static void vmap_init_free_space(void) -{ - unsigned long vmap_start = 1; - const unsigned long vmap_end = ULONG_MAX; - struct vmap_area *busy, *free; - - /* - * B F B B B F - * -|-----|.....|-----|-----|-----|.....|- - * | The KVA space | - * |<--------------------------------->| - */ - list_for_each_entry(busy, &vmap_area_list, list) { - if (busy->va_start - vmap_start > 0) { - free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT); - if (!WARN_ON_ONCE(!free)) { - free->va_start = vmap_start; - free->va_end = busy->va_start; - - insert_vmap_area_augment(free, NULL, - &free_vmap_area_root, - &free_vmap_area_list); - } - } - - vmap_start = busy->va_end; - } - - if (vmap_end - vmap_start > 0) { - free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT); - if (!WARN_ON_ONCE(!free)) { - free->va_start = vmap_start; - free->va_end = vmap_end; - - insert_vmap_area_augment(free, NULL, - &free_vmap_area_root, - &free_vmap_area_list); - } - } -} - static inline void setup_vmalloc_vm_locked(struct vm_struct *vm, struct vmap_area *va, unsigned long flags, const void *caller) { @@ -4743,6 +4702,47 @@ module_init(proc_vmalloc_init);
#endif
+static void vmap_init_free_space(void) +{ + unsigned long vmap_start = 1; + const unsigned long vmap_end = ULONG_MAX; + struct vmap_area *busy, *free; + + /* + * B F B B B F + * -|-----|.....|-----|-----|-----|.....|- + * | The KVA space | + * |<--------------------------------->| + */ + list_for_each_entry(busy, &vmap_area_list, list) { + if (busy->va_start - vmap_start > 0) { + free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT); + if (!WARN_ON_ONCE(!free)) { + free->va_start = vmap_start; + free->va_end = busy->va_start; + + insert_vmap_area_augment(free, NULL, + &free_vmap_area_root, + &free_vmap_area_list); + } + } + + vmap_start = busy->va_end; + } + + if (vmap_end - vmap_start > 0) { + free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT); + if (!WARN_ON_ONCE(!free)) { + free->va_start = vmap_start; + free->va_end = vmap_end; + + insert_vmap_area_augment(free, NULL, + &free_vmap_area_root, + &free_vmap_area_list); + } + } +} + void __init vmalloc_init(void) { struct vmap_area *va;
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-6.9-rc1 commit d093602919ad5908532142a048539800fa94a0d1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Store allocated objects in a separate nodes. A va->va_start address is converted into a correct node where it should be placed and resided. An addr_to_node() function is used to do a proper address conversion to determine a node that contains a VA.
Such approach balances VAs across nodes as a result an access becomes scalable. Number of nodes in a system depends on number of CPUs.
Please note:
1. As of now allocated VAs are bound to a node-0. It means the patch does not give any difference comparing with a current behavior;
2. The global vmap_area_lock, vmap_area_root are removed as there is no need in it anymore. The vmap_area_list is still kept and is _empty_. It is exported for a kexec only;
3. The vmallocinfo and vread() have to be reworked to be able to handle multiple nodes.
[urezki@gmail.com: mark vmap_init_free_space() with __init tag] Link: https://lkml.kernel.org/r/20240111132628.299644-1-urezki@gmail.com [urezki@gmail.com: fix a wrong value passed to __find_vmap_area()] Link: https://lkml.kernel.org/r/20240111121104.180993-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20240102184633.748113-5-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Reviewed-by: Baoquan He bhe@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Lorenzo Stoakes lstoakes@gmail.com Reviewed-by: Anshuman Khandual anshuman.khandual@arm.com Cc: Dave Chinner david@fromorbit.com Cc: Joel Fernandes (Google) joel@joelfernandes.org Cc: Kazuhito Hagio k-hagio-ab@nec.com Cc: Liam R. Howlett Liam.Howlett@oracle.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com Cc: Paul E. McKenney paulmck@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit d093602919ad5908532142a048539800fa94a0d1) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 242 ++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 174 insertions(+), 68 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 59cd3f855df7..5856db42e1c4 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -800,11 +800,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn); #define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0
-static DEFINE_SPINLOCK(vmap_area_lock); static DEFINE_SPINLOCK(free_vmap_area_lock); /* Export for kexec only */ LIST_HEAD(vmap_area_list); -static struct rb_root vmap_area_root = RB_ROOT; static bool vmap_initialized __read_mostly;
static struct rb_root purge_vmap_area_root = RB_ROOT; @@ -844,6 +842,38 @@ static struct rb_root free_vmap_area_root = RB_ROOT; */ static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
+/* + * An effective vmap-node logic. Users make use of nodes instead + * of a global heap. It allows to balance an access and mitigate + * contention. + */ +struct rb_list { + struct rb_root root; + struct list_head head; + spinlock_t lock; +}; + +static struct vmap_node { + /* Bookkeeping data of this node. */ + struct rb_list busy; +} single; + +static struct vmap_node *vmap_nodes = &single; +static __read_mostly unsigned int nr_vmap_nodes = 1; +static __read_mostly unsigned int vmap_zone_size = 1; + +static inline unsigned int +addr_to_node_id(unsigned long addr) +{ + return (addr / vmap_zone_size) % nr_vmap_nodes; +} + +static inline struct vmap_node * +addr_to_node(unsigned long addr) +{ + return &vmap_nodes[addr_to_node_id(addr)]; +} + static __always_inline unsigned long va_size(struct vmap_area *va) { @@ -875,10 +905,11 @@ unsigned long vmalloc_nr_pages(void) }
/* Look up the first VA which satisfies addr < va_end, NULL if none. */ -static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr) +static struct vmap_area * +find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root) { struct vmap_area *va = NULL; - struct rb_node *n = vmap_area_root.rb_node; + struct rb_node *n = root->rb_node;
addr = (unsigned long)kasan_reset_tag((void *)addr);
@@ -1624,12 +1655,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head, */ static void free_vmap_area(struct vmap_area *va) { + struct vmap_node *vn = addr_to_node(va->va_start); + /* * Remove from the busy tree/list. */ - spin_lock(&vmap_area_lock); - unlink_va(va, &vmap_area_root); - spin_unlock(&vmap_area_lock); + spin_lock(&vn->busy.lock); + unlink_va(va, &vn->busy.root); + spin_unlock(&vn->busy.lock);
/* * Insert/Merge it back to the free tree/list. @@ -1672,6 +1705,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, int node, gfp_t gfp_mask, unsigned long va_flags) { + struct vmap_node *vn; struct vmap_area *va; unsigned long freed; unsigned long addr; @@ -1717,9 +1751,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, va->vm = NULL; va->flags = va_flags;
- spin_lock(&vmap_area_lock); - insert_vmap_area(va, &vmap_area_root, &vmap_area_list); - spin_unlock(&vmap_area_lock); + vn = addr_to_node(va->va_start); + + spin_lock(&vn->busy.lock); + insert_vmap_area(va, &vn->busy.root, &vn->busy.head); + spin_unlock(&vn->busy.lock);
BUG_ON(!IS_ALIGNED(va->va_start, align)); BUG_ON(va->va_start < vstart); @@ -1943,26 +1979,61 @@ static void free_unmap_vmap_area(struct vmap_area *va)
struct vmap_area *find_vmap_area(unsigned long addr) { + struct vmap_node *vn; struct vmap_area *va; + int i, j;
- spin_lock(&vmap_area_lock); - va = __find_vmap_area(addr, &vmap_area_root); - spin_unlock(&vmap_area_lock); + /* + * An addr_to_node_id(addr) converts an address to a node index + * where a VA is located. If VA spans several zones and passed + * addr is not the same as va->va_start, what is not common, we + * may need to scan an extra nodes. See an example: + * + * <--va--> + * -|-----|-----|-----|-----|- + * 1 2 0 1 + * + * VA resides in node 1 whereas it spans 1 and 2. If passed + * addr is within a second node we should do extra work. We + * should mention that it is rare and is a corner case from + * the other hand it has to be covered. + */ + i = j = addr_to_node_id(addr); + do { + vn = &vmap_nodes[i];
- return va; + spin_lock(&vn->busy.lock); + va = __find_vmap_area(addr, &vn->busy.root); + spin_unlock(&vn->busy.lock); + + if (va) + return va; + } while ((i = (i + 1) % nr_vmap_nodes) != j); + + return NULL; }
static struct vmap_area *find_unlink_vmap_area(unsigned long addr) { + struct vmap_node *vn; struct vmap_area *va; + int i, j;
- spin_lock(&vmap_area_lock); - va = __find_vmap_area(addr, &vmap_area_root); - if (va) - unlink_va(va, &vmap_area_root); - spin_unlock(&vmap_area_lock); + i = j = addr_to_node_id(addr); + do { + vn = &vmap_nodes[i];
- return va; + spin_lock(&vn->busy.lock); + va = __find_vmap_area(addr, &vn->busy.root); + if (va) + unlink_va(va, &vn->busy.root); + spin_unlock(&vn->busy.lock); + + if (va) + return va; + } while ((i = (i + 1) % nr_vmap_nodes) != j); + + return NULL; }
/*** Per cpu kva allocator ***/ @@ -2164,6 +2235,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
static void free_vmap_block(struct vmap_block *vb) { + struct vmap_node *vn; struct vmap_block *tmp; struct xarray *xa;
@@ -2171,9 +2243,10 @@ static void free_vmap_block(struct vmap_block *vb) tmp = xa_erase(xa, addr_to_vb_idx(vb->va->va_start)); BUG_ON(tmp != vb);
- spin_lock(&vmap_area_lock); - unlink_va(vb->va, &vmap_area_root); - spin_unlock(&vmap_area_lock); + vn = addr_to_node(vb->va->va_start); + spin_lock(&vn->busy.lock); + unlink_va(vb->va, &vn->busy.root); + spin_unlock(&vn->busy.lock);
free_vmap_area_noflush(vb->va); kfree_rcu(vb, rcu_head); @@ -2597,9 +2670,11 @@ static inline void setup_vmalloc_vm_locked(struct vm_struct *vm, static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, unsigned long flags, const void *caller) { - spin_lock(&vmap_area_lock); + struct vmap_node *vn = addr_to_node(va->va_start); + + spin_lock(&vn->busy.lock); setup_vmalloc_vm_locked(vm, va, flags, caller); - spin_unlock(&vmap_area_lock); + spin_unlock(&vn->busy.lock); }
static void clear_vm_uninitialized_flag(struct vm_struct *vm) @@ -3792,6 +3867,7 @@ static size_t vmap_ram_vread_iter(struct iov_iter *iter, const char *addr, */ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) { + struct vmap_node *vn; struct vmap_area *va; struct vm_struct *vm; char *vaddr; @@ -3805,8 +3881,11 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
remains = count;
- spin_lock(&vmap_area_lock); - va = find_vmap_area_exceed_addr((unsigned long)addr); + /* Hooked to node_0 so far. */ + vn = addr_to_node(0); + spin_lock(&vn->busy.lock); + + va = find_vmap_area_exceed_addr((unsigned long)addr, &vn->busy.root); if (!va) goto finished_zero;
@@ -3814,7 +3893,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) if ((unsigned long)addr + remains <= va->va_start) goto finished_zero;
- list_for_each_entry_from(va, &vmap_area_list, list) { + list_for_each_entry_from(va, &vn->busy.head, list) { size_t copied;
if (remains == 0) @@ -3873,12 +3952,12 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) }
finished_zero: - spin_unlock(&vmap_area_lock); + spin_unlock(&vn->busy.lock); /* zero-fill memory holes */ return count - remains + zero_iter(iter, remains); finished: /* Nothing remains, or We couldn't copy/zero everything. */ - spin_unlock(&vmap_area_lock); + spin_unlock(&vn->busy.lock);
return count - remains; } @@ -4212,14 +4291,15 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, }
/* insert all vm's */ - spin_lock(&vmap_area_lock); for (area = 0; area < nr_vms; area++) { - insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list); + struct vmap_node *vn = addr_to_node(vas[area]->va_start);
+ spin_lock(&vn->busy.lock); + insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head); setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC, pcpu_get_vm_areas); + spin_unlock(&vn->busy.lock); } - spin_unlock(&vmap_area_lock);
/* * Mark allocated areas as accessible. Do it now as a best-effort @@ -4330,31 +4410,32 @@ bool vmalloc_dump_obj(void *object) { void *objp = (void *)PAGE_ALIGN((unsigned long)object); const void *caller; - struct vm_struct *vm; struct vmap_area *va; + struct vmap_node *vn; unsigned long addr; unsigned int nr_pages; + bool success = false;
- if (!spin_trylock(&vmap_area_lock)) - return false; - va = __find_vmap_area((unsigned long)objp, &vmap_area_root); - if (!va) { - spin_unlock(&vmap_area_lock); - return false; - } + vn = addr_to_node((unsigned long)objp);
- vm = va->vm; - if (!vm) { - spin_unlock(&vmap_area_lock); - return false; + if (spin_trylock(&vn->busy.lock)) { + va = __find_vmap_area((unsigned long)objp, &vn->busy.root); + + if (va && va->vm) { + addr = (unsigned long)va->vm->addr; + caller = va->vm->caller; + nr_pages = va->vm->nr_pages; + success = true; + } + + spin_unlock(&vn->busy.lock); } - addr = (unsigned long)vm->addr; - caller = vm->caller; - nr_pages = vm->nr_pages; - spin_unlock(&vmap_area_lock); - pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n", - nr_pages, addr, caller); - return true; + + if (success) + pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n", + nr_pages, addr, caller); + + return success; } #endif
@@ -4558,25 +4639,26 @@ EXPORT_SYMBOL(remap_vmalloc_hugepage_range);
#ifdef CONFIG_PROC_FS static void *s_start(struct seq_file *m, loff_t *pos) - __acquires(&vmap_purge_lock) - __acquires(&vmap_area_lock) { + struct vmap_node *vn = addr_to_node(0); + mutex_lock(&vmap_purge_lock); - spin_lock(&vmap_area_lock); + spin_lock(&vn->busy.lock);
- return seq_list_start(&vmap_area_list, *pos); + return seq_list_start(&vn->busy.head, *pos); }
static void *s_next(struct seq_file *m, void *p, loff_t *pos) { - return seq_list_next(p, &vmap_area_list, pos); + struct vmap_node *vn = addr_to_node(0); + return seq_list_next(p, &vn->busy.head, pos); }
static void s_stop(struct seq_file *m, void *p) - __releases(&vmap_area_lock) - __releases(&vmap_purge_lock) { - spin_unlock(&vmap_area_lock); + struct vmap_node *vn = addr_to_node(0); + + spin_unlock(&vn->busy.lock); mutex_unlock(&vmap_purge_lock); }
@@ -4619,9 +4701,11 @@ static void show_purge_info(struct seq_file *m)
static int s_show(struct seq_file *m, void *p) { + struct vmap_node *vn; struct vmap_area *va; struct vm_struct *v;
+ vn = addr_to_node(0); va = list_entry(p, struct vmap_area, list);
if (!va->vm) { @@ -4675,7 +4759,7 @@ static int s_show(struct seq_file *m, void *p) * As a final step, dump "unpurged" areas. */ final: - if (list_is_last(&va->list, &vmap_area_list)) + if (list_is_last(&va->list, &vn->busy.head)) show_purge_info(m);
return 0; @@ -4702,11 +4786,12 @@ module_init(proc_vmalloc_init);
#endif
-static void vmap_init_free_space(void) +static void __init vmap_init_free_space(void) { unsigned long vmap_start = 1; const unsigned long vmap_end = ULONG_MAX; - struct vmap_area *busy, *free; + struct vmap_area *free; + struct vm_struct *busy;
/* * B F B B B F @@ -4714,12 +4799,12 @@ static void vmap_init_free_space(void) * | The KVA space | * |<--------------------------------->| */ - list_for_each_entry(busy, &vmap_area_list, list) { - if (busy->va_start - vmap_start > 0) { + for (busy = vmlist; busy; busy = busy->next) { + if ((unsigned long) busy->addr - vmap_start > 0) { free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT); if (!WARN_ON_ONCE(!free)) { free->va_start = vmap_start; - free->va_end = busy->va_start; + free->va_end = (unsigned long) busy->addr;
insert_vmap_area_augment(free, NULL, &free_vmap_area_root, @@ -4727,7 +4812,7 @@ static void vmap_init_free_space(void) } }
- vmap_start = busy->va_end; + vmap_start = (unsigned long) busy->addr + busy->size; }
if (vmap_end - vmap_start > 0) { @@ -4743,9 +4828,23 @@ static void vmap_init_free_space(void) } }
+static void vmap_init_nodes(void) +{ + struct vmap_node *vn; + int i; + + for (i = 0; i < nr_vmap_nodes; i++) { + vn = &vmap_nodes[i]; + vn->busy.root = RB_ROOT; + INIT_LIST_HEAD(&vn->busy.head); + spin_lock_init(&vn->busy.lock); + } +} + void __init vmalloc_init(void) { struct vmap_area *va; + struct vmap_node *vn; struct vm_struct *tmp; int i;
@@ -4767,6 +4866,11 @@ void __init vmalloc_init(void) xa_init(&vbq->vmap_blocks); }
+ /* + * Setup nodes before importing vmlist. + */ + vmap_init_nodes(); + /* Import existing vmlist entries. */ for (tmp = vmlist; tmp; tmp = tmp->next) { va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT); @@ -4776,7 +4880,9 @@ void __init vmalloc_init(void) va->va_start = (unsigned long)tmp->addr; va->va_end = va->va_start + tmp->size; va->vm = tmp; - insert_vmap_area(va, &vmap_area_root, &vmap_area_list); + + vn = addr_to_node(va->va_start); + insert_vmap_area(va, &vn->busy.root, &vn->busy.head); }
/*
From: Baoquan He bhe@redhat.com
mainline inclusion from mainline-6.9-rc1 commit 55c49fee57af99f3c663e69dedc5b85e691bbe50 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile get the base address of vmalloc area. Now, vmap_area_list is empty, so export VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list.
[urezki@gmail.com: fix a warning in the crash_save_vmcoreinfo_init()] Link: https://lkml.kernel.org/r/20240111192329.449189-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20240102184633.748113-6-urezki@gmail.com Signed-off-by: Baoquan He bhe@redhat.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Acked-by: Lorenzo Stoakes lstoakes@gmail.com Cc: Christoph Hellwig hch@lst.de Cc: Dave Chinner david@fromorbit.com Cc: Joel Fernandes (Google) joel@joelfernandes.org Cc: Kazuhito Hagio k-hagio-ab@nec.com Cc: Liam R. Howlett Liam.Howlett@oracle.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com Cc: Paul E. McKenney paulmck@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 55c49fee57af99f3c663e69dedc5b85e691bbe50) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- Documentation/admin-guide/kdump/vmcoreinfo.rst | 8 ++++---- arch/arm64/kernel/crash_core.c | 1 - arch/riscv/kernel/crash_core.c | 1 - include/linux/vmalloc.h | 1 - kernel/crash_core.c | 4 +--- kernel/kallsyms_selftest.c | 1 - mm/nommu.c | 2 -- mm/vmalloc.c | 2 -- 8 files changed, 5 insertions(+), 15 deletions(-)
diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst index 599e8d3bcbc3..c11bd4b1ceb1 100644 --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst @@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates the kernel start address. Used to convert a virtual address from the direct kernel map to a physical address.
-vmap_area_list --------------- +VMALLOC_START +-------------
-Stores the virtual area list. makedumpfile gets the vmalloc start value -from this variable and its value is necessary for vmalloc translation. +Stores the base address of vmalloc area. makedumpfile gets this value +since is necessary for vmalloc translation.
mem_map ------- diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c index 66cde752cd74..2a24199a9b81 100644 --- a/arch/arm64/kernel/crash_core.c +++ b/arch/arm64/kernel/crash_core.c @@ -23,7 +23,6 @@ void arch_crash_save_vmcoreinfo(void) /* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */ vmcoreinfo_append_str("NUMBER(MODULES_VADDR)=0x%lx\n", MODULES_VADDR); vmcoreinfo_append_str("NUMBER(MODULES_END)=0x%lx\n", MODULES_END); - vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START); vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END); vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START); vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END); diff --git a/arch/riscv/kernel/crash_core.c b/arch/riscv/kernel/crash_core.c index 8706736fd4e2..d18d529fd9b9 100644 --- a/arch/riscv/kernel/crash_core.c +++ b/arch/riscv/kernel/crash_core.c @@ -8,7 +8,6 @@ void arch_crash_save_vmcoreinfo(void) VMCOREINFO_NUMBER(phys_ram_base);
vmcoreinfo_append_str("NUMBER(PAGE_OFFSET)=0x%lx\n", PAGE_OFFSET); - vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START); vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END); #ifdef CONFIG_MMU VMCOREINFO_NUMBER(VA_BITS); diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 447be1bbc1a6..333f5a67d171 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -276,7 +276,6 @@ extern long vread_iter(struct iov_iter *iter, const char *addr, size_t count); /* * Internals. Don't use.. */ -extern struct list_head vmap_area_list; extern __init void vm_area_add_early(struct vm_struct *vm); extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 2f675ef045d4..05d1d7168c5b 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -617,7 +617,7 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir); #endif VMCOREINFO_SYMBOL(_stext); - VMCOREINFO_SYMBOL(vmap_area_list); + vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", (unsigned long) VMALLOC_START);
#ifndef CONFIG_NUMA VMCOREINFO_SYMBOL(mem_map); @@ -658,8 +658,6 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_OFFSET(free_area, free_list); VMCOREINFO_OFFSET(list_head, next); VMCOREINFO_OFFSET(list_head, prev); - VMCOREINFO_OFFSET(vmap_area, va_start); - VMCOREINFO_OFFSET(vmap_area, list); VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER + 1); log_buf_vmcoreinfo_setup(); VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES); diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c index b4cac76ea5e9..8a689b4ff4f9 100644 --- a/kernel/kallsyms_selftest.c +++ b/kernel/kallsyms_selftest.c @@ -89,7 +89,6 @@ static struct test_item test_items[] = { ITEM_DATA(kallsyms_test_var_data_static), ITEM_DATA(kallsyms_test_var_bss), ITEM_DATA(kallsyms_test_var_data), - ITEM_DATA(vmap_area_list), #endif };
diff --git a/mm/nommu.c b/mm/nommu.c index 7f9e9e5a0e12..8c6686176ebd 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -131,8 +131,6 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL(follow_pfn);
-LIST_HEAD(vmap_area_list); - void vfree(const void *addr) { kfree(addr); diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 5856db42e1c4..beb5a6e6ebde 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -801,8 +801,6 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
static DEFINE_SPINLOCK(free_vmap_area_lock); -/* Export for kexec only */ -LIST_HEAD(vmap_area_list); static bool vmap_initialized __read_mostly;
static struct rb_root purge_vmap_area_root = RB_ROOT;
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-6.9-rc1 commit 282631cb2447318e2a55b41a665dbe8571c46d70 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Similar to busy VA, lazily-freed area is stored to a node it belongs to. Such approach does not require any global locking primitive, instead an access becomes scalable what mitigates a contention.
This patch removes a global purge-lock, global purge-tree and global purge list.
Link: https://lkml.kernel.org/r/20240102184633.748113-7-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Reviewed-by: Baoquan He bhe@redhat.com Cc: Christoph Hellwig hch@lst.de Cc: Dave Chinner david@fromorbit.com Cc: Joel Fernandes (Google) joel@joelfernandes.org Cc: Kazuhito Hagio k-hagio-ab@nec.com Cc: Liam R. Howlett Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com Cc: Paul E. McKenney paulmck@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 282631cb2447318e2a55b41a665dbe8571c46d70) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 135 +++++++++++++++++++++++++++++++-------------------- 1 file changed, 82 insertions(+), 53 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index beb5a6e6ebde..adf239513496 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -803,10 +803,6 @@ EXPORT_SYMBOL(vmalloc_to_pfn); static DEFINE_SPINLOCK(free_vmap_area_lock); static bool vmap_initialized __read_mostly;
-static struct rb_root purge_vmap_area_root = RB_ROOT; -static LIST_HEAD(purge_vmap_area_list); -static DEFINE_SPINLOCK(purge_vmap_area_lock); - /* * This kmem_cache is used for vmap_area objects. Instead of * allocating from slab we reuse an object from this cache to @@ -854,6 +850,12 @@ struct rb_list { static struct vmap_node { /* Bookkeeping data of this node. */ struct rb_list busy; + struct rb_list lazy; + + /* + * Ready-to-free areas. + */ + struct list_head purge_list; } single;
static struct vmap_node *vmap_nodes = &single; @@ -1838,40 +1840,22 @@ static DEFINE_MUTEX(vmap_purge_lock);
/* for per-CPU blocks */ static void purge_fragmented_blocks_allcpus(void); +static cpumask_t purge_nodes;
/* * Purges all lazily-freed vmap areas. */ -static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) +static unsigned long +purge_vmap_node(struct vmap_node *vn) { - unsigned long resched_threshold; - unsigned int num_purged_areas = 0; - struct list_head local_purge_list; + unsigned long num_purged_areas = 0; struct vmap_area *va, *n_va;
- lockdep_assert_held(&vmap_purge_lock); - - spin_lock(&purge_vmap_area_lock); - purge_vmap_area_root = RB_ROOT; - list_replace_init(&purge_vmap_area_list, &local_purge_list); - spin_unlock(&purge_vmap_area_lock); - - if (unlikely(list_empty(&local_purge_list))) - goto out; - - start = min(start, - list_first_entry(&local_purge_list, - struct vmap_area, list)->va_start); - - end = max(end, - list_last_entry(&local_purge_list, - struct vmap_area, list)->va_end); - - flush_tlb_kernel_range(start, end); - resched_threshold = lazy_max_pages() << 1; + if (list_empty(&vn->purge_list)) + return 0;
spin_lock(&free_vmap_area_lock); - list_for_each_entry_safe(va, n_va, &local_purge_list, list) { + list_for_each_entry_safe(va, n_va, &vn->purge_list, list) { unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT; unsigned long orig_start = va->va_start; unsigned long orig_end = va->va_end; @@ -1893,13 +1877,55 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
atomic_long_sub(nr, &vmap_lazy_nr); num_purged_areas++; - - if (atomic_long_read(&vmap_lazy_nr) < resched_threshold) - cond_resched_lock(&free_vmap_area_lock); } spin_unlock(&free_vmap_area_lock);
-out: + return num_purged_areas; +} + +/* + * Purges all lazily-freed vmap areas. + */ +static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) +{ + unsigned long num_purged_areas = 0; + struct vmap_node *vn; + int i; + + lockdep_assert_held(&vmap_purge_lock); + purge_nodes = CPU_MASK_NONE; + + for (i = 0; i < nr_vmap_nodes; i++) { + vn = &vmap_nodes[i]; + + INIT_LIST_HEAD(&vn->purge_list); + + if (RB_EMPTY_ROOT(&vn->lazy.root)) + continue; + + spin_lock(&vn->lazy.lock); + WRITE_ONCE(vn->lazy.root.rb_node, NULL); + list_replace_init(&vn->lazy.head, &vn->purge_list); + spin_unlock(&vn->lazy.lock); + + start = min(start, list_first_entry(&vn->purge_list, + struct vmap_area, list)->va_start); + + end = max(end, list_last_entry(&vn->purge_list, + struct vmap_area, list)->va_end); + + cpumask_set_cpu(i, &purge_nodes); + } + + if (cpumask_weight(&purge_nodes) > 0) { + flush_tlb_kernel_range(start, end); + + for_each_cpu(i, &purge_nodes) { + vn = &nodes[i]; + num_purged_areas += purge_vmap_node(vn); + } + } + trace_purge_vmap_area_lazy(start, end, num_purged_areas); return num_purged_areas > 0; } @@ -1918,16 +1944,9 @@ static void reclaim_and_purge_vmap_areas(void)
static void drain_vmap_area_work(struct work_struct *work) { - unsigned long nr_lazy; - - do { - mutex_lock(&vmap_purge_lock); - __purge_vmap_area_lazy(ULONG_MAX, 0); - mutex_unlock(&vmap_purge_lock); - - /* Recheck if further work is required. */ - nr_lazy = atomic_long_read(&vmap_lazy_nr); - } while (nr_lazy > lazy_max_pages()); + mutex_lock(&vmap_purge_lock); + __purge_vmap_area_lazy(ULONG_MAX, 0); + mutex_unlock(&vmap_purge_lock); }
/* @@ -1937,6 +1956,7 @@ static void drain_vmap_area_work(struct work_struct *work) */ static void free_vmap_area_noflush(struct vmap_area *va) { + struct vmap_node *vn = addr_to_node(va->va_start); unsigned long nr_lazy_max = lazy_max_pages(); unsigned long va_start = va->va_start; unsigned long nr_lazy; @@ -1950,10 +1970,9 @@ static void free_vmap_area_noflush(struct vmap_area *va) /* * Merge or place it to the purge tree/list. */ - spin_lock(&purge_vmap_area_lock); - merge_or_add_vmap_area(va, - &purge_vmap_area_root, &purge_vmap_area_list); - spin_unlock(&purge_vmap_area_lock); + spin_lock(&vn->lazy.lock); + merge_or_add_vmap_area(va, &vn->lazy.root, &vn->lazy.head); + spin_unlock(&vn->lazy.lock);
trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max);
@@ -4686,15 +4705,21 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v)
static void show_purge_info(struct seq_file *m) { + struct vmap_node *vn; struct vmap_area *va; + int i;
- spin_lock(&purge_vmap_area_lock); - list_for_each_entry(va, &purge_vmap_area_list, list) { - seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n", - (void *)va->va_start, (void *)va->va_end, - va->va_end - va->va_start); + for (i = 0; i < nr_vmap_nodes; i++) { + vn = &vmap_nodes[i]; + + spin_lock(&vn->lazy.lock); + list_for_each_entry(va, &vn->lazy.head, list) { + seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n", + (void *)va->va_start, (void *)va->va_end, + va->va_end - va->va_start); + } + spin_unlock(&vn->lazy.lock); } - spin_unlock(&purge_vmap_area_lock); }
static int s_show(struct seq_file *m, void *p) @@ -4836,6 +4861,10 @@ static void vmap_init_nodes(void) vn->busy.root = RB_ROOT; INIT_LIST_HEAD(&vn->busy.head); spin_lock_init(&vn->busy.lock); + + vn->lazy.root = RB_ROOT; + INIT_LIST_HEAD(&vn->lazy.head); + spin_lock_init(&vn->lazy.lock); } }
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-6.9-rc1 commit 72210662c5a2b6005f6daea7fe293a0dc573e1a5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Concurrent access to a global vmap space is a bottle-neck. We can simulate a high contention by running a vmalloc test suite.
To address it, introduce an effective vmap node logic. Each node behaves as independent entity. When a node is accessed it serves a request directly(if possible) from its pool.
This model has a size based pool for requests, i.e. pools are serialized and populated based on object size and real demand. A maximum object size that pool can handle is set to 256 pages.
This technique reduces a pressure on the global vmap lock.
Link: https://lkml.kernel.org/r/20240102184633.748113-8-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Cc: Baoquan He bhe@redhat.com Cc: Christoph Hellwig hch@lst.de Cc: Dave Chinner david@fromorbit.com Cc: Joel Fernandes (Google) joel@joelfernandes.org Cc: Kazuhito Hagio k-hagio-ab@nec.com Cc: Liam R. Howlett Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com Cc: Paul E. McKenney paulmck@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 72210662c5a2b6005f6daea7fe293a0dc573e1a5) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 387 +++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 342 insertions(+), 45 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index adf239513496..ea14123e49e5 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -847,7 +847,22 @@ struct rb_list { spinlock_t lock; };
+struct vmap_pool { + struct list_head head; + unsigned long len; +}; + +/* + * A fast size storage contains VAs up to 1M size. + */ +#define MAX_VA_SIZE_PAGES 256 + static struct vmap_node { + /* Simple size segregated storage. */ + struct vmap_pool pool[MAX_VA_SIZE_PAGES]; + spinlock_t pool_lock; + bool skip_populate; + /* Bookkeeping data of this node. */ struct rb_list busy; struct rb_list lazy; @@ -856,6 +871,8 @@ static struct vmap_node { * Ready-to-free areas. */ struct list_head purge_list; + struct work_struct purge_work; + unsigned long nr_purged; } single;
static struct vmap_node *vmap_nodes = &single; @@ -874,6 +891,61 @@ addr_to_node(unsigned long addr) return &vmap_nodes[addr_to_node_id(addr)]; }
+static inline struct vmap_node * +id_to_node(unsigned int id) +{ + return &vmap_nodes[id % nr_vmap_nodes]; +} + +/* + * We use the value 0 to represent "no node", that is why + * an encoded value will be the node-id incremented by 1. + * It is always greater then 0. A valid node_id which can + * be encoded is [0:nr_vmap_nodes - 1]. If a passed node_id + * is not valid 0 is returned. + */ +static unsigned int +encode_vn_id(unsigned int node_id) +{ + /* Can store U8_MAX [0:254] nodes. */ + if (node_id < nr_vmap_nodes) + return (node_id + 1) << BITS_PER_BYTE; + + /* Warn and no node encoded. */ + WARN_ONCE(1, "Encode wrong node id (%u)\n", node_id); + return 0; +} + +/* + * Returns an encoded node-id, the valid range is within + * [0:nr_vmap_nodes-1] values. Otherwise nr_vmap_nodes is + * returned if extracted data is wrong. + */ +static unsigned int +decode_vn_id(unsigned int val) +{ + unsigned int node_id = (val >> BITS_PER_BYTE) - 1; + + /* Can store U8_MAX [0:254] nodes. */ + if (node_id < nr_vmap_nodes) + return node_id; + + /* If it was _not_ zero, warn. */ + WARN_ONCE(node_id != UINT_MAX, + "Decode wrong node id (%d)\n", node_id); + + return nr_vmap_nodes; +} + +static bool +is_vn_id_valid(unsigned int node_id) +{ + if (node_id < nr_vmap_nodes) + return true; + + return false; +} + static __always_inline unsigned long va_size(struct vmap_area *va) { @@ -1695,6 +1767,104 @@ preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node) kmem_cache_free(vmap_area_cachep, va); }
+static struct vmap_pool * +size_to_va_pool(struct vmap_node *vn, unsigned long size) +{ + unsigned int idx = (size - 1) / PAGE_SIZE; + + if (idx < MAX_VA_SIZE_PAGES) + return &vn->pool[idx]; + + return NULL; +} + +static bool +node_pool_add_va(struct vmap_node *n, struct vmap_area *va) +{ + struct vmap_pool *vp; + + vp = size_to_va_pool(n, va_size(va)); + if (!vp) + return false; + + spin_lock(&n->pool_lock); + list_add(&va->list, &vp->head); + WRITE_ONCE(vp->len, vp->len + 1); + spin_unlock(&n->pool_lock); + + return true; +} + +static struct vmap_area * +node_pool_del_va(struct vmap_node *vn, unsigned long size, + unsigned long align, unsigned long vstart, + unsigned long vend) +{ + struct vmap_area *va = NULL; + struct vmap_pool *vp; + int err = 0; + + vp = size_to_va_pool(vn, size); + if (!vp || list_empty(&vp->head)) + return NULL; + + spin_lock(&vn->pool_lock); + if (!list_empty(&vp->head)) { + va = list_first_entry(&vp->head, struct vmap_area, list); + + if (IS_ALIGNED(va->va_start, align)) { + /* + * Do some sanity check and emit a warning + * if one of below checks detects an error. + */ + err |= (va_size(va) != size); + err |= (va->va_start < vstart); + err |= (va->va_end > vend); + + if (!WARN_ON_ONCE(err)) { + list_del_init(&va->list); + WRITE_ONCE(vp->len, vp->len - 1); + } else { + va = NULL; + } + } else { + list_move_tail(&va->list, &vp->head); + va = NULL; + } + } + spin_unlock(&vn->pool_lock); + + return va; +} + +static struct vmap_area * +node_alloc(unsigned long size, unsigned long align, + unsigned long vstart, unsigned long vend, + unsigned long *addr, unsigned int *vn_id) +{ + struct vmap_area *va; + + *vn_id = 0; + *addr = vend; + + /* + * Fallback to a global heap if not vmalloc or there + * is only one node. + */ + if (vstart != VMALLOC_START || vend != VMALLOC_END || + nr_vmap_nodes == 1) + return NULL; + + *vn_id = raw_smp_processor_id() % nr_vmap_nodes; + va = node_pool_del_va(id_to_node(*vn_id), size, align, vstart, vend); + *vn_id = encode_vn_id(*vn_id); + + if (va) + *addr = va->va_start; + + return va; +} + /* * Allocate a region of KVA of the specified size and alignment, within the * vstart and vend. @@ -1709,6 +1879,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, struct vmap_area *va; unsigned long freed; unsigned long addr; + unsigned int vn_id; int purged = 0; int ret;
@@ -1719,11 +1890,23 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, return ERR_PTR(-EBUSY);
might_sleep(); - gfp_mask = gfp_mask & GFP_RECLAIM_MASK;
- va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node); - if (unlikely(!va)) - return ERR_PTR(-ENOMEM); + /* + * If a VA is obtained from a global heap(if it fails here) + * it is anyway marked with this "vn_id" so it is returned + * to this pool's node later. Such way gives a possibility + * to populate pools based on users demand. + * + * On success a ready to go VA is returned. + */ + va = node_alloc(size, align, vstart, vend, &addr, &vn_id); + if (!va) { + gfp_mask = gfp_mask & GFP_RECLAIM_MASK; + + va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node); + if (unlikely(!va)) + return ERR_PTR(-ENOMEM); + }
/* * Only scan the relevant parts containing pointers to other objects @@ -1732,10 +1915,12 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask);
retry: - preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node); - addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list, - size, align, vstart, vend); - spin_unlock(&free_vmap_area_lock); + if (addr == vend) { + preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node); + addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list, + size, align, vstart, vend); + spin_unlock(&free_vmap_area_lock); + }
trace_alloc_vmap_area(addr, size, align, vstart, vend, addr == vend);
@@ -1749,7 +1934,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, va->va_start = addr; va->va_end = addr + size; va->vm = NULL; - va->flags = va_flags; + va->flags = (va_flags | vn_id);
vn = addr_to_node(va->va_start);
@@ -1842,63 +2027,135 @@ static DEFINE_MUTEX(vmap_purge_lock); static void purge_fragmented_blocks_allcpus(void); static cpumask_t purge_nodes;
-/* - * Purges all lazily-freed vmap areas. - */ -static unsigned long -purge_vmap_node(struct vmap_node *vn) +static void +reclaim_list_global(struct list_head *head) { - unsigned long num_purged_areas = 0; - struct vmap_area *va, *n_va; + struct vmap_area *va, *n;
- if (list_empty(&vn->purge_list)) - return 0; + if (list_empty(head)) + return;
spin_lock(&free_vmap_area_lock); + list_for_each_entry_safe(va, n, head, list) + merge_or_add_vmap_area_augment(va, + &free_vmap_area_root, &free_vmap_area_list); + spin_unlock(&free_vmap_area_lock); +} + +static void +decay_va_pool_node(struct vmap_node *vn, bool full_decay) +{ + struct vmap_area *va, *nva; + struct list_head decay_list; + struct rb_root decay_root; + unsigned long n_decay; + int i; + + decay_root = RB_ROOT; + INIT_LIST_HEAD(&decay_list); + + for (i = 0; i < MAX_VA_SIZE_PAGES; i++) { + struct list_head tmp_list; + + if (list_empty(&vn->pool[i].head)) + continue; + + INIT_LIST_HEAD(&tmp_list); + + /* Detach the pool, so no-one can access it. */ + spin_lock(&vn->pool_lock); + list_replace_init(&vn->pool[i].head, &tmp_list); + spin_unlock(&vn->pool_lock); + + if (full_decay) + WRITE_ONCE(vn->pool[i].len, 0); + + /* Decay a pool by ~25% out of left objects. */ + n_decay = vn->pool[i].len >> 2; + + list_for_each_entry_safe(va, nva, &tmp_list, list) { + list_del_init(&va->list); + merge_or_add_vmap_area(va, &decay_root, &decay_list); + + if (!full_decay) { + WRITE_ONCE(vn->pool[i].len, vn->pool[i].len - 1); + + if (!--n_decay) + break; + } + } + + /* Attach the pool back if it has been partly decayed. */ + if (!full_decay && !list_empty(&tmp_list)) { + spin_lock(&vn->pool_lock); + list_replace_init(&tmp_list, &vn->pool[i].head); + spin_unlock(&vn->pool_lock); + } + } + + reclaim_list_global(&decay_list); +} + +static void purge_vmap_node(struct work_struct *work) +{ + struct vmap_node *vn = container_of(work, + struct vmap_node, purge_work); + struct vmap_area *va, *n_va; + LIST_HEAD(local_list); + + vn->nr_purged = 0; + list_for_each_entry_safe(va, n_va, &vn->purge_list, list) { unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT; unsigned long orig_start = va->va_start; unsigned long orig_end = va->va_end; + unsigned int vn_id = decode_vn_id(va->flags);
- /* - * Finally insert or merge lazily-freed area. It is - * detached and there is no need to "unlink" it from - * anything. - */ - va = merge_or_add_vmap_area_augment(va, &free_vmap_area_root, - &free_vmap_area_list); - - if (!va) - continue; + list_del_init(&va->list);
if (is_vmalloc_or_module_addr((void *)orig_start)) kasan_release_vmalloc(orig_start, orig_end, va->va_start, va->va_end);
atomic_long_sub(nr, &vmap_lazy_nr); - num_purged_areas++; + vn->nr_purged++; + + if (is_vn_id_valid(vn_id) && !vn->skip_populate) + if (node_pool_add_va(vn, va)) + continue; + + /* Go back to global. */ + list_add(&va->list, &local_list); } - spin_unlock(&free_vmap_area_lock);
- return num_purged_areas; + reclaim_list_global(&local_list); }
/* * Purges all lazily-freed vmap areas. */ -static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) +static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, + bool full_pool_decay) { - unsigned long num_purged_areas = 0; + unsigned long nr_purged_areas = 0; + unsigned int nr_purge_helpers; + unsigned int nr_purge_nodes; struct vmap_node *vn; int i;
lockdep_assert_held(&vmap_purge_lock); + + /* + * Use cpumask to mark which node has to be processed. + */ purge_nodes = CPU_MASK_NONE;
for (i = 0; i < nr_vmap_nodes; i++) { vn = &vmap_nodes[i];
INIT_LIST_HEAD(&vn->purge_list); + vn->skip_populate = full_pool_decay; + decay_va_pool_node(vn, full_pool_decay);
if (RB_EMPTY_ROOT(&vn->lazy.root)) continue; @@ -1917,17 +2174,45 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) cpumask_set_cpu(i, &purge_nodes); }
- if (cpumask_weight(&purge_nodes) > 0) { + nr_purge_nodes = cpumask_weight(&purge_nodes); + if (nr_purge_nodes > 0) { flush_tlb_kernel_range(start, end);
+ /* One extra worker is per a lazy_max_pages() full set minus one. */ + nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages(); + nr_purge_helpers = clamp(nr_purge_helpers, 1U, nr_purge_nodes) - 1; + for_each_cpu(i, &purge_nodes) { - vn = &nodes[i]; - num_purged_areas += purge_vmap_node(vn); + vn = &vmap_nodes[i]; + + if (nr_purge_helpers > 0) { + INIT_WORK(&vn->purge_work, purge_vmap_node); + + if (cpumask_test_cpu(i, cpu_online_mask)) + schedule_work_on(i, &vn->purge_work); + else + schedule_work(&vn->purge_work); + + nr_purge_helpers--; + } else { + vn->purge_work.func = NULL; + purge_vmap_node(&vn->purge_work); + nr_purged_areas += vn->nr_purged; + } + } + + for_each_cpu(i, &purge_nodes) { + vn = &vmap_nodes[i]; + + if (vn->purge_work.func) { + flush_work(&vn->purge_work); + nr_purged_areas += vn->nr_purged; + } } }
- trace_purge_vmap_area_lazy(start, end, num_purged_areas); - return num_purged_areas > 0; + trace_purge_vmap_area_lazy(start, end, nr_purged_areas); + return nr_purged_areas > 0; }
/* @@ -1938,14 +2223,14 @@ static void reclaim_and_purge_vmap_areas(void) { mutex_lock(&vmap_purge_lock); purge_fragmented_blocks_allcpus(); - __purge_vmap_area_lazy(ULONG_MAX, 0); + __purge_vmap_area_lazy(ULONG_MAX, 0, true); mutex_unlock(&vmap_purge_lock); }
static void drain_vmap_area_work(struct work_struct *work) { mutex_lock(&vmap_purge_lock); - __purge_vmap_area_lazy(ULONG_MAX, 0); + __purge_vmap_area_lazy(ULONG_MAX, 0, false); mutex_unlock(&vmap_purge_lock); }
@@ -1956,9 +2241,10 @@ static void drain_vmap_area_work(struct work_struct *work) */ static void free_vmap_area_noflush(struct vmap_area *va) { - struct vmap_node *vn = addr_to_node(va->va_start); unsigned long nr_lazy_max = lazy_max_pages(); unsigned long va_start = va->va_start; + unsigned int vn_id = decode_vn_id(va->flags); + struct vmap_node *vn; unsigned long nr_lazy;
if (WARN_ON_ONCE(!list_empty(&va->list))) @@ -1968,10 +2254,14 @@ static void free_vmap_area_noflush(struct vmap_area *va) PAGE_SHIFT, &vmap_lazy_nr);
/* - * Merge or place it to the purge tree/list. + * If it was request by a certain node we would like to + * return it to that node, i.e. its pool for later reuse. */ + vn = is_vn_id_valid(vn_id) ? + id_to_node(vn_id):addr_to_node(va->va_start); + spin_lock(&vn->lazy.lock); - merge_or_add_vmap_area(va, &vn->lazy.root, &vn->lazy.head); + insert_vmap_area(va, &vn->lazy.root, &vn->lazy.head); spin_unlock(&vn->lazy.lock);
trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max); @@ -2480,7 +2770,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush) } free_purged_blocks(&purge_list);
- if (!__purge_vmap_area_lazy(start, end) && flush) + if (!__purge_vmap_area_lazy(start, end, false) && flush) flush_tlb_kernel_range(start, end); mutex_unlock(&vmap_purge_lock); } @@ -4854,7 +5144,7 @@ static void __init vmap_init_free_space(void) static void vmap_init_nodes(void) { struct vmap_node *vn; - int i; + int i, j;
for (i = 0; i < nr_vmap_nodes; i++) { vn = &vmap_nodes[i]; @@ -4865,6 +5155,13 @@ static void vmap_init_nodes(void) vn->lazy.root = RB_ROOT; INIT_LIST_HEAD(&vn->lazy.head); spin_lock_init(&vn->lazy.lock); + + for (j = 0; j < MAX_VA_SIZE_PAGES; j++) { + INIT_LIST_HEAD(&vn->pool[j].head); + WRITE_ONCE(vn->pool[j].len, 0); + } + + spin_lock_init(&vn->pool_lock); } }
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-6.9-rc1 commit 96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Invoke a kmemleak_scan_area() function only for newly allocated objects to add a scan area within that object. There is no reason to add a same scan area(pointer to beginning or inside the object) several times. If a VA is obtained from the cache its scan area has already been associated.
Link: https://lkml.kernel.org/r/20240202190628.47806-1-urezki@gmail.com Fixes: 7db166b4aa0d ("mm: vmalloc: offload free_vmap_area_lock lock") Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Reviewed-by: Lorenzo Stoakes lstoakes@gmail.com Cc: Baoquan He bhe@redhat.com Cc: Christoph Hellwig hch@lst.de Cc: Dave Chinner david@fromorbit.com Cc: Joel Fernandes (Google) joel@joelfernandes.org Cc: Kazuhito Hagio k-hagio-ab@nec.com Cc: Liam R. Howlett Liam.Howlett@oracle.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com Cc: Paul E. McKenney paulmck@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index ea14123e49e5..536505ab7634 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1906,13 +1906,13 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node); if (unlikely(!va)) return ERR_PTR(-ENOMEM); - }
- /* - * Only scan the relevant parts containing pointers to other objects - * to avoid false negatives. - */ - kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask); + /* + * Only scan the relevant parts containing pointers to other objects + * to avoid false negatives. + */ + kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask); + }
retry: if (addr == vend) {
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-6.9-rc1 commit 53becf32aec1c8049b854f0c31a11df5ed75df6f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Extend the vread_iter() to be able to perform a sequential reading of VAs which are spread among multiple nodes. So a data read over the /dev/kmem correctly reflects a vmalloc memory layout.
Link: https://lkml.kernel.org/r/20240102184633.748113-9-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Reviewed-by: Baoquan He bhe@redhat.com Cc: Christoph Hellwig hch@lst.de Cc: Dave Chinner david@fromorbit.com Cc: Joel Fernandes (Google) joel@joelfernandes.org Cc: Kazuhito Hagio k-hagio-ab@nec.com Cc: Liam R. Howlett Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com Cc: Paul E. McKenney paulmck@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 53becf32aec1c8049b854f0c31a11df5ed75df6f) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 67 +++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 53 insertions(+), 14 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 536505ab7634..5db9bb48ef7d 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -978,7 +978,7 @@ unsigned long vmalloc_nr_pages(void)
/* Look up the first VA which satisfies addr < va_end, NULL if none. */ static struct vmap_area * -find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root) +__find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root) { struct vmap_area *va = NULL; struct rb_node *n = root->rb_node; @@ -1002,6 +1002,41 @@ find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root) return va; }
+/* + * Returns a node where a first VA, that satisfies addr < va_end, resides. + * If success, a node is locked. A user is responsible to unlock it when a + * VA is no longer needed to be accessed. + * + * Returns NULL if nothing found. + */ +static struct vmap_node * +find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va) +{ + struct vmap_node *vn, *va_node = NULL; + struct vmap_area *va_lowest; + int i; + + for (i = 0; i < nr_vmap_nodes; i++) { + vn = &vmap_nodes[i]; + + spin_lock(&vn->busy.lock); + va_lowest = __find_vmap_area_exceed_addr(addr, &vn->busy.root); + if (va_lowest) { + if (!va_node || va_lowest->va_start < (*va)->va_start) { + if (va_node) + spin_unlock(&va_node->busy.lock); + + *va = va_lowest; + va_node = vn; + continue; + } + } + spin_unlock(&vn->busy.lock); + } + + return va_node; +} + static struct vmap_area *__find_vmap_area(unsigned long addr, struct rb_root *root) { struct rb_node *n = root->rb_node; @@ -4179,6 +4214,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) struct vm_struct *vm; char *vaddr; size_t n, size, flags, remains; + unsigned long next;
addr = kasan_reset_tag(addr);
@@ -4188,19 +4224,15 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
remains = count;
- /* Hooked to node_0 so far. */ - vn = addr_to_node(0); - spin_lock(&vn->busy.lock); - - va = find_vmap_area_exceed_addr((unsigned long)addr, &vn->busy.root); - if (!va) + vn = find_vmap_area_exceed_addr_lock((unsigned long) addr, &va); + if (!vn) goto finished_zero;
/* no intersects with alive vmap_area */ if ((unsigned long)addr + remains <= va->va_start) goto finished_zero;
- list_for_each_entry_from(va, &vn->busy.head, list) { + do { size_t copied;
if (remains == 0) @@ -4215,10 +4247,10 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) WARN_ON(flags == VMAP_BLOCK);
if (!vm && !flags) - continue; + goto next_va;
if (vm && (vm->flags & VM_UNINITIALIZED)) - continue; + goto next_va;
/* Pair with smp_wmb() in clear_vm_uninitialized_flag() */ smp_rmb(); @@ -4227,7 +4259,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) size = vm ? get_vm_area_size(vm) : va_size(va);
if (addr >= vaddr + size) - continue; + goto next_va;
if (addr < vaddr) { size_t to_zero = min_t(size_t, vaddr - addr, remains); @@ -4256,15 +4288,22 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
if (copied != n) goto finished; - } + + next_va: + next = va->va_end; + spin_unlock(&vn->busy.lock); + } while ((vn = find_vmap_area_exceed_addr_lock(next, &va)));
finished_zero: - spin_unlock(&vn->busy.lock); + if (vn) + spin_unlock(&vn->busy.lock); + /* zero-fill memory holes */ return count - remains + zero_iter(iter, remains); finished: /* Nothing remains, or We couldn't copy/zero everything. */ - spin_unlock(&vn->busy.lock); + if (vn) + spin_unlock(&vn->busy.lock);
return count - remains; }
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-6.9-rc1 commit 8e1d743f2c2671aa54f6f91a2b33823f92512870 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Allocated areas are spread among nodes, it implies that the scanning has to be performed individually of each node in order to dump all existing VAs.
Link: https://lkml.kernel.org/r/20240102184633.748113-10-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Cc: Baoquan He bhe@redhat.com Cc: Christoph Hellwig hch@lst.de Cc: Dave Chinner david@fromorbit.com Cc: Joel Fernandes (Google) joel@joelfernandes.org Cc: Kazuhito Hagio k-hagio-ab@nec.com Cc: Liam R. Howlett Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com Cc: Paul E. McKenney paulmck@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 8e1d743f2c2671aa54f6f91a2b33823f92512870) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 124 ++++++++++++++++++++------------------------------- 1 file changed, 49 insertions(+), 75 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 5db9bb48ef7d..10b3e5a3b7ad 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -4984,30 +4984,6 @@ EXPORT_SYMBOL(remap_vmalloc_hugepage_range); #endif
#ifdef CONFIG_PROC_FS -static void *s_start(struct seq_file *m, loff_t *pos) -{ - struct vmap_node *vn = addr_to_node(0); - - mutex_lock(&vmap_purge_lock); - spin_lock(&vn->busy.lock); - - return seq_list_start(&vn->busy.head, *pos); -} - -static void *s_next(struct seq_file *m, void *p, loff_t *pos) -{ - struct vmap_node *vn = addr_to_node(0); - return seq_list_next(p, &vn->busy.head, pos); -} - -static void s_stop(struct seq_file *m, void *p) -{ - struct vmap_node *vn = addr_to_node(0); - - spin_unlock(&vn->busy.lock); - mutex_unlock(&vmap_purge_lock); -} - static void show_numa_info(struct seq_file *m, struct vm_struct *v) { if (IS_ENABLED(CONFIG_NUMA)) { @@ -5051,87 +5027,85 @@ static void show_purge_info(struct seq_file *m) } }
-static int s_show(struct seq_file *m, void *p) +static int vmalloc_info_show(struct seq_file *m, void *p) { struct vmap_node *vn; struct vmap_area *va; struct vm_struct *v; + int i;
- vn = addr_to_node(0); - va = list_entry(p, struct vmap_area, list); + for (i = 0; i < nr_vmap_nodes; i++) { + vn = &vmap_nodes[i];
- if (!va->vm) { - if (va->flags & VMAP_RAM) - seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n", - (void *)va->va_start, (void *)va->va_end, - va->va_end - va->va_start); + spin_lock(&vn->busy.lock); + list_for_each_entry(va, &vn->busy.head, list) { + if (!va->vm) { + if (va->flags & VMAP_RAM) + seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n", + (void *)va->va_start, (void *)va->va_end, + va->va_end - va->va_start);
- goto final; - } + continue; + }
- v = va->vm; + v = va->vm;
- seq_printf(m, "0x%pK-0x%pK %7ld", - v->addr, v->addr + v->size, v->size); + seq_printf(m, "0x%pK-0x%pK %7ld", + v->addr, v->addr + v->size, v->size);
- if (v->caller) - seq_printf(m, " %pS", v->caller); + if (v->caller) + seq_printf(m, " %pS", v->caller);
- if (v->nr_pages) - seq_printf(m, " pages=%d", v->nr_pages); + if (v->nr_pages) + seq_printf(m, " pages=%d", v->nr_pages);
- if (v->phys_addr) - seq_printf(m, " phys=%pa", &v->phys_addr); + if (v->phys_addr) + seq_printf(m, " phys=%pa", &v->phys_addr);
- if (v->flags & VM_IOREMAP) - seq_puts(m, " ioremap"); + if (v->flags & VM_IOREMAP) + seq_puts(m, " ioremap");
- if (v->flags & VM_SPARSE) - seq_puts(m, " sparse"); + if (v->flags & VM_SPARSE) + seq_puts(m, " sparse");
- if (v->flags & VM_ALLOC) - seq_puts(m, " vmalloc"); + if (v->flags & VM_ALLOC) + seq_puts(m, " vmalloc");
- if (v->flags & VM_MAP) - seq_puts(m, " vmap"); + if (v->flags & VM_MAP) + seq_puts(m, " vmap");
- if (v->flags & VM_USERMAP) - seq_puts(m, " user"); + if (v->flags & VM_USERMAP) + seq_puts(m, " user");
- if (v->flags & VM_DMA_COHERENT) - seq_puts(m, " dma-coherent"); + if (v->flags & VM_DMA_COHERENT) + seq_puts(m, " dma-coherent");
- if (is_vmalloc_addr(v->pages)) - seq_puts(m, " vpages"); + if (is_vmalloc_addr(v->pages)) + seq_puts(m, " vpages");
- show_numa_info(m, v); - seq_putc(m, '\n'); + show_numa_info(m, v); + seq_putc(m, '\n'); + } + spin_unlock(&vn->busy.lock); + }
/* * As a final step, dump "unpurged" areas. */ -final: - if (list_is_last(&va->list, &vn->busy.head)) - show_purge_info(m); - + show_purge_info(m); return 0; }
-static const struct seq_operations vmalloc_op = { - .start = s_start, - .next = s_next, - .stop = s_stop, - .show = s_show, -}; - static int __init proc_vmalloc_init(void) { + void *priv_data = NULL; + if (IS_ENABLED(CONFIG_NUMA)) - proc_create_seq_private("vmallocinfo", 0400, NULL, - &vmalloc_op, - nr_node_ids * sizeof(unsigned int), NULL); - else - proc_create_seq("vmallocinfo", 0400, NULL, &vmalloc_op); + priv_data = kmalloc(nr_node_ids * sizeof(unsigned int), GFP_KERNEL); + + proc_create_single_data("vmallocinfo", + 0400, NULL, vmalloc_info_show, priv_data); + return 0; } module_init(proc_vmalloc_init);
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-6.9-rc1 commit 8f33a2ff307248c3e55a7696f60b3658b28edb57 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
A number of nodes which are used in the alloc/free paths is set based on num_possible_cpus() in a system. Please note a high limit threshold though is fixed and corresponds to 128 nodes.
For 32-bit or single core systems an access to a global vmap heap is not balanced. Such small systems do not suffer from lock contentions due to low number of CPUs. In such case the nr_nodes is equal to 1.
Test on AMD Ryzen Threadripper 3970X 32-Core Processor: sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
<default perf> 94.41% 0.89% [kernel] [k] _raw_spin_lock 93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath 76.13% 0.28% [kernel] [k] __vmalloc_node_range 72.96% 0.81% [kernel] [k] alloc_vmap_area 56.94% 0.00% [kernel] [k] __get_vm_area_node 41.95% 0.00% [kernel] [k] vmalloc 37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test 35.17% 0.00% [kernel] [k] ret_from_fork_asm 35.17% 0.00% [kernel] [k] ret_from_fork 35.17% 0.00% [kernel] [k] kthread 35.08% 0.00% [test_vmalloc] [k] test_func 34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test 28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test 23.53% 0.25% [kernel] [k] vfree.part.0 21.72% 0.00% [kernel] [k] remove_vm_area 20.08% 0.21% [kernel] [k] find_unlink_vmap_area 2.34% 0.61% [kernel] [k] free_vmap_area_noflush <default perf> vs <patch-series perf> 82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test 63.36% 0.02% [kernel] [k] vmalloc 63.34% 2.64% [kernel] [k] __vmalloc_node_range 30.42% 4.46% [kernel] [k] vfree.part.0 28.98% 2.51% [kernel] [k] __alloc_pages_bulk 27.28% 0.19% [kernel] [k] __get_vm_area_node 26.13% 1.50% [kernel] [k] alloc_vmap_area 21.72% 21.67% [kernel] [k] clear_page_rep 19.51% 2.43% [kernel] [k] _raw_spin_lock 16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath 13.40% 2.07% [kernel] [k] free_unref_page 10.62% 0.01% [kernel] [k] remove_vm_area 9.02% 8.73% [kernel] [k] insert_vmap_area 8.94% 0.00% [kernel] [k] ret_from_fork_asm 8.94% 0.00% [kernel] [k] ret_from_fork 8.94% 0.00% [kernel] [k] kthread 8.29% 0.00% [test_vmalloc] [k] test_func 7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test 5.30% 4.73% [kernel] [k] purge_vmap_node 4.47% 2.65% [kernel] [k] free_vmap_area_noflush <patch-series perf>
confirms that a native_queued_spin_lock_slowpath goes down to 16.51% percent from 93.07%.
The throughput is ~12x higher:
urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary.
real 10m51.271s user 0m0.013s sys 0m0.187s urezki@pc638:~$
urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary.
real 0m51.301s user 0m0.015s sys 0m0.040s urezki@pc638:~$
Link: https://lkml.kernel.org/r/20240102184633.748113-11-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Cc: Baoquan He bhe@redhat.com Cc: Christoph Hellwig hch@lst.de Cc: Dave Chinner david@fromorbit.com Cc: Joel Fernandes (Google) joel@joelfernandes.org Cc: Kazuhito Hagio k-hagio-ab@nec.com Cc: Liam R. Howlett Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com Cc: Paul E. McKenney paulmck@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 8f33a2ff307248c3e55a7696f60b3658b28edb57) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 29 +++++++++++++++++++++++------ 1 file changed, 23 insertions(+), 6 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 10b3e5a3b7ad..0bbd5eff03f4 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -5157,10 +5157,27 @@ static void __init vmap_init_free_space(void) static void vmap_init_nodes(void) { struct vmap_node *vn; - int i, j; + int i, n; + +#if BITS_PER_LONG == 64 + /* A high threshold of max nodes is fixed and bound to 128. */ + n = clamp_t(unsigned int, num_possible_cpus(), 1, 128); + + if (n > 1) { + vn = kmalloc_array(n, sizeof(*vn), GFP_NOWAIT | __GFP_NOWARN); + if (vn) { + /* Node partition is 16 pages. */ + vmap_zone_size = (1 << 4) * PAGE_SIZE; + nr_vmap_nodes = n; + vmap_nodes = vn; + } else { + pr_err("Failed to allocate an array. Disable a node layer\n"); + } + } +#endif
- for (i = 0; i < nr_vmap_nodes; i++) { - vn = &vmap_nodes[i]; + for (n = 0; n < nr_vmap_nodes; n++) { + vn = &vmap_nodes[n]; vn->busy.root = RB_ROOT; INIT_LIST_HEAD(&vn->busy.head); spin_lock_init(&vn->busy.lock); @@ -5169,9 +5186,9 @@ static void vmap_init_nodes(void) INIT_LIST_HEAD(&vn->lazy.head); spin_lock_init(&vn->lazy.lock);
- for (j = 0; j < MAX_VA_SIZE_PAGES; j++) { - INIT_LIST_HEAD(&vn->pool[j].head); - WRITE_ONCE(vn->pool[j].len, 0); + for (i = 0; i < MAX_VA_SIZE_PAGES; i++) { + INIT_LIST_HEAD(&vn->pool[i].head); + WRITE_ONCE(vn->pool[i].len, 0); }
spin_lock_init(&vn->pool_lock);
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-6.9-rc1 commit 7679ba6b36dbb300b757b672d6a32a606499e14b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
The added shrinker is used to return back current cached VAs into a global vmap space, when a system enters into a low memory mode.
Link: https://lkml.kernel.org/r/20240102184633.748113-12-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Cc: Kazuhito Hagio k-hagio-ab@nec.com Cc: Baoquan He bhe@redhat.com Cc: Christoph Hellwig hch@lst.de Cc: Dave Chinner david@fromorbit.com Cc: Joel Fernandes (Google) joel@joelfernandes.org Cc: Liam R. Howlett Liam.Howlett@oracle.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com Cc: Paul E. McKenney paulmck@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 7679ba6b36dbb300b757b672d6a32a606499e14b) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 0bbd5eff03f4..598eec85610b 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -5195,8 +5195,37 @@ static void vmap_init_nodes(void) } }
+static unsigned long +vmap_node_shrink_count(struct shrinker *shrink, struct shrink_control *sc) +{ + unsigned long count; + struct vmap_node *vn; + int i, j; + + for (count = 0, i = 0; i < nr_vmap_nodes; i++) { + vn = &vmap_nodes[i]; + + for (j = 0; j < MAX_VA_SIZE_PAGES; j++) + count += READ_ONCE(vn->pool[j].len); + } + + return count ? count : SHRINK_EMPTY; +} + +static unsigned long +vmap_node_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) +{ + int i; + + for (i = 0; i < nr_vmap_nodes; i++) + decay_va_pool_node(&vmap_nodes[i], true); + + return SHRINK_STOP; +} + void __init vmalloc_init(void) { + struct shrinker *vmap_node_shrinker; struct vmap_area *va; struct vmap_node *vn; struct vm_struct *tmp; @@ -5244,4 +5273,14 @@ void __init vmalloc_init(void) */ vmap_init_free_space(); vmap_initialized = true; + + vmap_node_shrinker = shrinker_alloc(0, "vmap-node"); + if (!vmap_node_shrinker) { + pr_err("Failed to allocate vmap-node shrinker!\n"); + return; + } + + vmap_node_shrinker->count_objects = vmap_node_shrink_count; + vmap_node_shrinker->scan_objects = vmap_node_shrink_scan; + shrinker_register(vmap_node_shrinker); }
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-6.9-rc1 commit 15e02a39fb6b43f37100563c6a246252d5d1e6da category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
This patch adds extra explanation of recently added vmap node layer based on community feedback. No functional change.
Link: https://lkml.kernel.org/r/20240124180920.50725-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Reviewed-by: Lorenzo Stoakes lstoakes@gmail.com Cc: Baoquan He bhe@redhat.com Cc: Christoph Hellwig hch@infradead.org Cc: Dave Chinner david@fromorbit.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 15e02a39fb6b43f37100563c6a246252d5d1e6da) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 60 ++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 46 insertions(+), 14 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 598eec85610b..f8187faa0017 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -837,9 +837,10 @@ static struct rb_root free_vmap_area_root = RB_ROOT; static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
/* - * An effective vmap-node logic. Users make use of nodes instead - * of a global heap. It allows to balance an access and mitigate - * contention. + * This structure defines a single, solid model where a list and + * rb-tree are part of one entity protected by the lock. Nodes are + * sorted in ascending order, thus for O(1) access to left/right + * neighbors a list is used as well as for sequential traversal. */ struct rb_list { struct rb_root root; @@ -847,16 +848,23 @@ struct rb_list { spinlock_t lock; };
+/* + * A fast size storage contains VAs up to 1M size. A pool consists + * of linked between each other ready to go VAs of certain sizes. + * An index in the pool-array corresponds to number of pages + 1. + */ +#define MAX_VA_SIZE_PAGES 256 + struct vmap_pool { struct list_head head; unsigned long len; };
/* - * A fast size storage contains VAs up to 1M size. + * An effective vmap-node logic. Users make use of nodes instead + * of a global heap. It allows to balance an access and mitigate + * contention. */ -#define MAX_VA_SIZE_PAGES 256 - static struct vmap_node { /* Simple size segregated storage. */ struct vmap_pool pool[MAX_VA_SIZE_PAGES]; @@ -875,6 +883,11 @@ static struct vmap_node { unsigned long nr_purged; } single;
+/* + * Initial setup consists of one single node, i.e. a balancing + * is fully disabled. Later on, after vmap is initialized these + * parameters are updated based on a system capacity. + */ static struct vmap_node *vmap_nodes = &single; static __read_mostly unsigned int nr_vmap_nodes = 1; static __read_mostly unsigned int vmap_zone_size = 1; @@ -2120,7 +2133,12 @@ decay_va_pool_node(struct vmap_node *vn, bool full_decay) } }
- /* Attach the pool back if it has been partly decayed. */ + /* + * Attach the pool back if it has been partly decayed. + * Please note, it is supposed that nobody(other contexts) + * can populate the pool therefore a simple list replace + * operation takes place here. + */ if (!full_decay && !list_empty(&tmp_list)) { spin_lock(&vn->pool_lock); list_replace_init(&tmp_list, &vn->pool[i].head); @@ -2329,16 +2347,14 @@ struct vmap_area *find_vmap_area(unsigned long addr) * An addr_to_node_id(addr) converts an address to a node index * where a VA is located. If VA spans several zones and passed * addr is not the same as va->va_start, what is not common, we - * may need to scan an extra nodes. See an example: + * may need to scan extra nodes. See an example: * - * <--va--> + * <----va----> * -|-----|-----|-----|-----|- * 1 2 0 1 * - * VA resides in node 1 whereas it spans 1 and 2. If passed - * addr is within a second node we should do extra work. We - * should mention that it is rare and is a corner case from - * the other hand it has to be covered. + * VA resides in node 1 whereas it spans 1, 2 an 0. If passed + * addr is within 2 or 0 nodes we should do extra work. */ i = j = addr_to_node_id(addr); do { @@ -2361,6 +2377,9 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr) struct vmap_area *va; int i, j;
+ /* + * Check the comment in the find_vmap_area() about the loop. + */ i = j = addr_to_node_id(addr); do { vn = &vmap_nodes[i]; @@ -5160,7 +5179,20 @@ static void vmap_init_nodes(void) int i, n;
#if BITS_PER_LONG == 64 - /* A high threshold of max nodes is fixed and bound to 128. */ + /* + * A high threshold of max nodes is fixed and bound to 128, + * thus a scale factor is 1 for systems where number of cores + * are less or equal to specified threshold. + * + * As for NUMA-aware notes. For bigger systems, for example + * NUMA with multi-sockets, where we can end-up with thousands + * of cores in total, a "sub-numa-clustering" should be added. + * + * In this case a NUMA domain is considered as a single entity + * with dedicated sub-nodes in it which describe one group or + * set of cores. Therefore a per-domain purging is supposed to + * be added as well as a per-domain balancing. + */ n = clamp_t(unsigned int, num_possible_cpus(), 1, 128);
if (n > 1) {
From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-6.9-rc1 commit 8be4d46e12af32342569840d958272dbb3be3f4c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
This patch tends to simplify the function in question, by removing an extra stack "objp" variable, returning back to an early exit approach if spin_trylock() fails or VA was not found.
Link: https://lkml.kernel.org/r/20240124180920.50725-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Reviewed-by: Lorenzo Stoakes lstoakes@gmail.com Cc: Baoquan He bhe@redhat.com Cc: Christoph Hellwig hch@infradead.org Cc: Dave Chinner david@fromorbit.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 8be4d46e12af32342569840d958272dbb3be3f4c) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 33 +++++++++++++++++---------------- 1 file changed, 17 insertions(+), 16 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f8187faa0017..54d5c9699336 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -4773,34 +4773,35 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms) #ifdef CONFIG_PRINTK bool vmalloc_dump_obj(void *object) { - void *objp = (void *)PAGE_ALIGN((unsigned long)object); const void *caller; + struct vm_struct *vm; struct vmap_area *va; struct vmap_node *vn; unsigned long addr; unsigned int nr_pages; - bool success = false; - - vn = addr_to_node((unsigned long)objp);
- if (spin_trylock(&vn->busy.lock)) { - va = __find_vmap_area((unsigned long)objp, &vn->busy.root); + addr = PAGE_ALIGN((unsigned long) object); + vn = addr_to_node(addr);
- if (va && va->vm) { - addr = (unsigned long)va->vm->addr; - caller = va->vm->caller; - nr_pages = va->vm->nr_pages; - success = true; - } + if (!spin_trylock(&vn->busy.lock)) + return false;
+ va = __find_vmap_area(addr, &vn->busy.root); + if (!va || !va->vm) { spin_unlock(&vn->busy.lock); + return false; }
- if (success) - pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n", - nr_pages, addr, caller); + vm = va->vm; + addr = (unsigned long) vm->addr; + caller = vm->caller; + nr_pages = vm->nr_pages; + spin_unlock(&vn->busy.lock);
- return success; + pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n", + nr_pages, addr, caller); + + return true; } #endif
From: rulinhuang rulin.huang@intel.com
maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://lore.kernel.org/all/20240307021440.64967-1-rulin.huang@intel.com/
--------------------------------
When allocating a new memory area where the mapping address range is known, it is observed that the vmap_node->busy.lock is acquired twice.
The first acquisition occurs in the alloc_vmap_area() function when inserting the vm area into the vm mapping red-black tree. The second acquisition occurs in the setup_vmalloc_vm() function when updating the properties of the vm, such as flags and address, etc.
Combine these two operations together in alloc_vmap_area(), which improves scalability when the vmap_node->busy.lock is contended. By doing so, the need to acquire the lock twice can also be eliminated to once.
With the above change, tested on intel sapphire rapids platform(224 vcpu), a 4% performance improvement is gained on stress-ng/pthread(https://github.com/ColinIanKing/stress-ng), which is the stress test of thread creations.
Link: https://lkml.kernel.org/r/20240307021440.64967-1-rulin.huang@intel.com Co-developed-by: "Chen, Tim C" tim.c.chen@intel.com Signed-off-by: "Chen, Tim C" tim.c.chen@intel.com Co-developed-by: "King, Colin" colin.king@intel.com Signed-off-by: "King, Colin" colin.king@intel.com Signed-off-by: rulinhuang rulin.huang@intel.com Reviewed-by: Baoquan He bhe@redhat.com Reviewed-by: Uladzislau Rezki (Sony) urezki@gmail.com Cc: Tim Chen tim.c.chen@linux.intel.com Cc: Christoph Hellwig hch@infradead.org Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Wangyang Guo wangyang.guo@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 50 ++++++++++++++++++++++---------------------------- 1 file changed, 22 insertions(+), 28 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 54d5c9699336..b81d757fdc0d 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1913,15 +1913,26 @@ node_alloc(unsigned long size, unsigned long align, return va; }
+static inline void setup_vmalloc_vm(struct vm_struct *vm, + struct vmap_area *va, unsigned long flags, const void *caller) +{ + vm->flags = flags; + vm->addr = (void *)va->va_start; + vm->size = va->va_end - va->va_start; + vm->caller = caller; + va->vm = vm; +} + /* * Allocate a region of KVA of the specified size and alignment, within the - * vstart and vend. + * vstart and vend. If vm is passed in, the two will also be bound. */ static struct vmap_area *alloc_vmap_area(unsigned long size, unsigned long align, unsigned long vstart, unsigned long vend, int node, gfp_t gfp_mask, - unsigned long va_flags) + unsigned long va_flags, struct vm_struct *vm, + unsigned long flags, const void *caller) { struct vmap_node *vn; struct vmap_area *va; @@ -1984,6 +1995,9 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, va->vm = NULL; va->flags = (va_flags | vn_id);
+ if (vm) + setup_vmalloc_vm(vm, va, flags, caller); + vn = addr_to_node(va->va_start);
spin_lock(&vn->busy.lock); @@ -2558,7 +2572,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE, VMALLOC_START, VMALLOC_END, node, gfp_mask, - VMAP_RAM|VMAP_BLOCK); + VMAP_RAM|VMAP_BLOCK, NULL, + 0, NULL); if (IS_ERR(va)) { kfree(vb); return ERR_CAST(va); @@ -2915,7 +2930,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node) struct vmap_area *va; va = alloc_vmap_area(size, PAGE_SIZE, VMALLOC_START, VMALLOC_END, - node, GFP_KERNEL, VMAP_RAM); + node, GFP_KERNEL, VMAP_RAM, + NULL, 0, NULL); if (IS_ERR(va)) return NULL;
@@ -3018,26 +3034,6 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align) kasan_populate_early_vm_area_shadow(vm->addr, vm->size); }
-static inline void setup_vmalloc_vm_locked(struct vm_struct *vm, - struct vmap_area *va, unsigned long flags, const void *caller) -{ - vm->flags = flags; - vm->addr = (void *)va->va_start; - vm->size = va->va_end - va->va_start; - vm->caller = caller; - va->vm = vm; -} - -static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, - unsigned long flags, const void *caller) -{ - struct vmap_node *vn = addr_to_node(va->va_start); - - spin_lock(&vn->busy.lock); - setup_vmalloc_vm_locked(vm, va, flags, caller); - spin_unlock(&vn->busy.lock); -} - static void clear_vm_uninitialized_flag(struct vm_struct *vm) { /* @@ -3074,14 +3070,12 @@ static struct vm_struct *__get_vm_area_node(unsigned long size, if (!(flags & VM_NO_GUARD)) size += PAGE_SIZE;
- va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0); + va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area, flags, caller); if (IS_ERR(va)) { kfree(area); return NULL; }
- setup_vmalloc_vm(area, va, flags, caller); - /* * Mark pages for non-VM_ALLOC mappings as accessible. Do it now as a * best-effort approach, as they can be mapped outside of vmalloc code. @@ -4661,7 +4655,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
spin_lock(&vn->busy.lock); insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head); - setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC, + setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC, pcpu_get_vm_areas); spin_unlock(&vn->busy.lock); }
From: Baoquan He bhe@redhat.com
maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://lore.kernel.org/all/20240309044454.648888-1-bhe@redhat.com/
--------------------------------
If called by __get_vm_area_node(), by open coding the field assignments of 'struct vm_struct *vm', and move the vm->flags and vm->caller assignments into __get_vm_area_node(), the passed in arguments 'flags' and 'caller' can be removed.
This alleviates overloaded arguments passed in for alloc_vmap_area().
Link: https://lkml.kernel.org/r/20240309044454.648888-1-bhe@redhat.com Signed-off-by: Baoquan He bhe@redhat.com Reviewed-by: Uladzislau Rezki (Sony) urezki@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index b81d757fdc0d..ce8e920a6400 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1931,8 +1931,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, unsigned long align, unsigned long vstart, unsigned long vend, int node, gfp_t gfp_mask, - unsigned long va_flags, struct vm_struct *vm, - unsigned long flags, const void *caller) + unsigned long va_flags, struct vm_struct *vm) { struct vmap_node *vn; struct vmap_area *va; @@ -1995,8 +1994,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, va->vm = NULL; va->flags = (va_flags | vn_id);
- if (vm) - setup_vmalloc_vm(vm, va, flags, caller); + if (vm) { + vm->addr = (void *)va->va_start; + vm->size = va->va_end - va->va_start; + va->vm = vm; + }
vn = addr_to_node(va->va_start);
@@ -2572,8 +2574,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE, VMALLOC_START, VMALLOC_END, node, gfp_mask, - VMAP_RAM|VMAP_BLOCK, NULL, - 0, NULL); + VMAP_RAM|VMAP_BLOCK, NULL); if (IS_ERR(va)) { kfree(vb); return ERR_CAST(va); @@ -2931,7 +2932,7 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node) va = alloc_vmap_area(size, PAGE_SIZE, VMALLOC_START, VMALLOC_END, node, GFP_KERNEL, VMAP_RAM, - NULL, 0, NULL); + NULL); if (IS_ERR(va)) return NULL;
@@ -3070,7 +3071,10 @@ static struct vm_struct *__get_vm_area_node(unsigned long size, if (!(flags & VM_NO_GUARD)) size += PAGE_SIZE;
- va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area, flags, caller); + area->flags = flags; + area->caller = caller; + + va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area); if (IS_ERR(va)) { kfree(area); return NULL;
From: Kefeng Wang wangkefeng.wang@huawei.com
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
--------------------------------
Since the struct shrinker was not introduced in OLK6.6, the vmalloc lock contention series of patches conflict. Fix this problem by introducing a struct shrinker.
Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index ce8e920a6400..c90d9309f80c 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -5254,9 +5254,13 @@ vmap_node_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) return SHRINK_STOP; }
+static struct shrinker vmap_node_shrinker = { + .count_objects = vmap_node_shrink_count, + .scan_objects = vmap_node_shrink_scan, +}; + void __init vmalloc_init(void) { - struct shrinker *vmap_node_shrinker; struct vmap_area *va; struct vmap_node *vn; struct vm_struct *tmp; @@ -5305,13 +5309,6 @@ void __init vmalloc_init(void) vmap_init_free_space(); vmap_initialized = true;
- vmap_node_shrinker = shrinker_alloc(0, "vmap-node"); - if (!vmap_node_shrinker) { + if (register_shrinker(&vmap_node_shrinker, "vmap-node")) pr_err("Failed to allocate vmap-node shrinker!\n"); - return; - } - - vmap_node_shrinker->count_objects = vmap_node_shrink_count; - vmap_node_shrinker->scan_objects = vmap_node_shrink_scan; - shrinker_register(vmap_node_shrinker); }
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/5627 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/K...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/5627 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/K...