From: "Uladzislau Rezki (Sony)" urezki@gmail.com
mainline inclusion from mainline-v6.9-rc1 commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9CHG1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Patch series "Mitigate a vmap lock contention", v3.
1. Motivation
- Offload global vmap locks making it scaled to number of CPUS;
- If possible and there is an agreement, we can remove the "Per cpu kva allocator" to make the vmap code to be more simple;
- There were complaints from XFS folk that a vmalloc might be contented on their workloads.
2. Design(high level overview)
We introduce an effective vmap node logic. A node behaves as independent entity to serve an allocation request directly(if possible) from its pool. That way it bypasses a global vmap space that is protected by its own lock.
An access to pools are serialized by CPUs. Number of nodes are equal to number of CPUs in a system. Please note the high threshold is bound to 128 nodes.
Pools are size segregated and populated based on system demand. The maximum alloc request that can be stored into a segregated storage is 256 pages. The lazily drain path decays a pool by 25% as a first step and as second populates it by fresh freed VAs for reuse instead of returning them into a global space.
When a VA is obtained(alloc path), it is stored in separate nodes. A va->va_start address is converted into a correct node where it should be placed and resided. Doing so we balance VAs across the nodes as a result an access becomes scalable. The addr_to_node() function does a proper address conversion to a correct node.
A vmap space is divided on segments with fixed size, it is 16 pages. That way any address can be associated with a segment number. Number of segments are equal to num_possible_cpus() but not grater then 128. The numeration starts from 0. See below how it is converted:
static inline unsigned int addr_to_node_id(unsigned long addr) { return (addr / zone_size) % nr_nodes; }
On a free path, a VA can be easily found by converting its "va_start" address to a certain node it resides. It is moved from "busy" data to "lazy" data structure. Later on, as noted earlier, the lazy kworker decays each node pool and populates it by fresh incoming VAs. Please note, a VA is returned to a node that did an alloc request.
3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor
sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
<default perf> 94.41% 0.89% [kernel] [k] _raw_spin_lock 93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath 76.13% 0.28% [kernel] [k] __vmalloc_node_range 72.96% 0.81% [kernel] [k] alloc_vmap_area 56.94% 0.00% [kernel] [k] __get_vm_area_node 41.95% 0.00% [kernel] [k] vmalloc 37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test 35.17% 0.00% [kernel] [k] ret_from_fork_asm 35.17% 0.00% [kernel] [k] ret_from_fork 35.17% 0.00% [kernel] [k] kthread 35.08% 0.00% [test_vmalloc] [k] test_func 34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test 28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test 23.53% 0.25% [kernel] [k] vfree.part.0 21.72% 0.00% [kernel] [k] remove_vm_area 20.08% 0.21% [kernel] [k] find_unlink_vmap_area 2.34% 0.61% [kernel] [k] free_vmap_area_noflush <default perf> vs <patch-series perf> 82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test 63.36% 0.02% [kernel] [k] vmalloc 63.34% 2.64% [kernel] [k] __vmalloc_node_range 30.42% 4.46% [kernel] [k] vfree.part.0 28.98% 2.51% [kernel] [k] __alloc_pages_bulk 27.28% 0.19% [kernel] [k] __get_vm_area_node 26.13% 1.50% [kernel] [k] alloc_vmap_area 21.72% 21.67% [kernel] [k] clear_page_rep 19.51% 2.43% [kernel] [k] _raw_spin_lock 16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath 13.40% 2.07% [kernel] [k] free_unref_page 10.62% 0.01% [kernel] [k] remove_vm_area 9.02% 8.73% [kernel] [k] insert_vmap_area 8.94% 0.00% [kernel] [k] ret_from_fork_asm 8.94% 0.00% [kernel] [k] ret_from_fork 8.94% 0.00% [kernel] [k] kthread 8.29% 0.00% [test_vmalloc] [k] test_func 7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test 5.30% 4.73% [kernel] [k] purge_vmap_node 4.47% 2.65% [kernel] [k] free_vmap_area_noflush <patch-series perf>
confirms that a native_queued_spin_lock_slowpath goes down to 16.51% percent from 93.07%.
The throughput is ~12x higher:
urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary.
real 10m51.271s user 0m0.013s sys 0m0.187s urezki@pc638:~$
urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary.
real 0m51.301s user 0m0.015s sys 0m0.040s urezki@pc638:~$
This patch (of 11):
Currently __alloc_vmap_area() function contains an open codded logic that finds and adjusts a VA based on allocation request.
Introduce a va_alloc() helper that adjusts found VA only. There is no a functional change as a result of this patch.
Link: https://lkml.kernel.org/r/20240102184633.748113-1-urezki@gmail.com Link: https://lkml.kernel.org/r/20240102184633.748113-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) urezki@gmail.com Reviewed-by: Baoquan He bhe@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Lorenzo Stoakes lstoakes@gmail.com Cc: Dave Chinner david@fromorbit.com Cc: Joel Fernandes (Google) joel@joelfernandes.org Cc: Liam R. Howlett Liam.Howlett@oracle.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Oleksiy Avramchenko oleksiy.avramchenko@sony.com Cc: Paul E. McKenney paulmck@kernel.org Cc: Kazuhito Hagio k-hagio-ab@nec.com Signed-off-by: Andrew Morton akpm@linux-foundation.org (cherry picked from commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc) Signed-off-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: ZhangPeng zhangpeng362@huawei.com --- mm/vmalloc.c | 41 ++++++++++++++++++++++++++++------------- 1 file changed, 28 insertions(+), 13 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f5ac73a90d6d..01ed7a3c17a9 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1553,6 +1553,32 @@ adjust_va_to_fit_type(struct rb_root *root, struct list_head *head, return 0; }
+static unsigned long +va_alloc(struct vmap_area *va, + struct rb_root *root, struct list_head *head, + unsigned long size, unsigned long align, + unsigned long vstart, unsigned long vend) +{ + unsigned long nva_start_addr; + int ret; + + if (va->va_start > vstart) + nva_start_addr = ALIGN(va->va_start, align); + else + nva_start_addr = ALIGN(vstart, align); + + /* Check the "vend" restriction. */ + if (nva_start_addr + size > vend) + return vend; + + /* Update the free vmap_area. */ + ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size); + if (WARN_ON_ONCE(ret)) + return vend; + + return nva_start_addr; +} + /* * Returns a start address of the newly allocated area, if success. * Otherwise a vend is returned that indicates failure. @@ -1565,7 +1591,6 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head, bool adjust_search_size = true; unsigned long nva_start_addr; struct vmap_area *va; - int ret;
/* * Do not adjust when: @@ -1583,18 +1608,8 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head, if (unlikely(!va)) return vend;
- if (va->va_start > vstart) - nva_start_addr = ALIGN(va->va_start, align); - else - nva_start_addr = ALIGN(vstart, align); - - /* Check the "vend" restriction. */ - if (nva_start_addr + size > vend) - return vend; - - /* Update the free vmap_area. */ - ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size); - if (WARN_ON_ONCE(ret)) + nva_start_addr = va_alloc(va, root, head, size, align, vstart, vend); + if (nva_start_addr == vend) return vend;
#if DEBUG_AUGMENT_LOWEST_MATCH_CHECK