hulk inclusion
category: performance
bugzilla: https://gitee.com/openeuler/kernel/issues/I9MSNE?from=project-issue
CVE: NA
--------------------------------
Considering that the high-frequency function cpu_util without is only
called when waking up or creating for the first time, in this scenario,
the performance can be optimized by simplifying the function.
Here are the detailed test results of unixbench.
Command: ./Run -c 1 -i 3
Without Patch
------------------------------------------------------------------------
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 41898849.1 3590.3
Double-Precision Whetstone 55.0 4426.3 804.8
Execl Throughput 43.0 2828.8 657.9
File Copy 1024 bufsize 2000 maxblocks 3960.0 837180.0 2114.1
File Copy 256 bufsize 500 maxblocks 1655.0 256669.0 1550.9
File Copy 4096 bufsize 8000 maxblocks 5800.0 2264169.0 3903.7
Pipe Throughput 12440.0 1101364.7 885.3
Pipe-based Context Switching 4000.0 136573.4 341.4
Process Creation 126.0 6031.7 478.7
Shell Scripts (1 concurrent) 42.4 5875.9 1385.8
Shell Scripts (8 concurrent) 6.0 2567.1 4278.5
System Call Overhead 15000.0 1065481.3 710.3
========
System Benchmarks Index Score 1252.0
With Patch
------------------------------------------------------------------------
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 41832459.9 3584.6
Double-Precision Whetstone 55.0 4426.6 804.8
Execl Throughput 43.0 2675.8 622.3
File Copy 1024 bufsize 2000 maxblocks 3960.0 919862.0 2322.9
File Copy 256 bufsize 500 maxblocks 1655.0 274966.0 1661.4
File Copy 4096 bufsize 8000 maxblocks 5800.0 2350539.0 4052.7
Pipe Throughput 12440.0 1182284.3 950.4
Pipe-based Context Switching 4000.0 155034.4 387.6
Process Creation 126.0 6371.9 505.7
Shell Scripts (1 concurrent) 42.4 5797.9 1367.4
Shell Scripts (8 concurrent) 6.0 2576.7 4294.4
System Call Overhead 15000.0 1128173.1 752.1
========
System Benchmarks Index Score 1299.1
After lmbench test, we can get 0% ~ 6% performance improvement
for lmbench fork_proc/exec_proc/shell_proc.
The test results are as follows:
base base+this patch
fork_proc 457ms 427ms (6.5%)
exec_proc 2008ms 1991ms (0.8%)
shell_proc 3062ms 2985ms (0.2%)
Signed-off-by: Zhang Qiao <zhangqiao22(a)huawei.com>
Signed-off-by: Li Zetao <lizetao1(a)huawei.com>
---
kernel/sched/fair.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7b0cb2f090da..010dbf2047e5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8530,13 +8530,23 @@ unsigned long cpu_util_cfs_boost(int cpu)
* utilization of the specified task, whenever the task is currently
* contributing to the CPU utilization.
*/
-static unsigned long cpu_util_without(int cpu, struct task_struct *p)
+static inline unsigned long cpu_util_without(int cpu, struct task_struct *p)
{
- /* Task has no contribution or is new */
- if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))
- p = NULL;
+ struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
+ unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
+ /*
+ * If @dst_cpu is -1 or @p migrates from @cpu to @dst_cpu remove its
+ * contribution. If @p migrates from another CPU to @cpu add its
+ * contribution. In all the other cases @cpu is not impacted by the
+ * migration so its util_avg is already correct.
+ */
+ if (sched_feat(UTIL_EST)) {
+ unsigned long util_est;
+ util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+ util = max(util, util_est);
+ }
- return cpu_util(cpu, p, -1, 0);
+ return min(util, capacity_orig_of(cpu));
}
/*
--
2.34.1
From: Peng Zhang <zhangpeng.00(a)bytedance.com>
mainline inclusion
from mainline-v6.8-rc1
commit 7e552dcd803f4ff60165271c573ab2e38d15769f
category: performance
bugzilla: https://gitee.com/openeuler/kernel/issues/I9N4V1
CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
----------------------------------------------------------------------
The last range stored in maple tree is typically quite large. By checking
if it exceeds the sum of the remaining ranges in that node, it is possible
to avoid checking all other gaps.
Running the maple tree test suite in user mode almost always results in a
near 100% hit rate for this optimization.
Link: https://lkml.kernel.org/r/20231215074632.82045-1-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00(a)bytedance.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Li Zetao <lizetao1(a)huawei.com>
---
lib/maple_tree.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 98b3ded67d06..da227397e4d8 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -1547,6 +1547,9 @@ static unsigned long mas_leaf_max_gap(struct ma_state *mas)
gap = ULONG_MAX - pivots[max_piv];
if (gap > max_gap)
max_gap = gap;
+
+ if (max_gap > pivots[max_piv] - mas->min)
+ return max_gap;
}
for (; i <= max_piv; i++) {
--
2.34.1
From: Dave Airlie <airlied(a)redhat.com>
stable inclusion
from stable-v5.10.214
commit 13d76b2f443dc371842916dd8768009ff1594716
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9LK8M
CVE: CVE-2024-26984
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id…
--------------------------------
commit fff1386cc889d8fb4089d285f883f8cba62d82ce upstream.
Running a lot of VK CTS in parallel against nouveau, once every
few hours you might see something like this crash.
BUG: kernel NULL pointer dereference, address: 0000000000000008
PGD 8000000114e6e067 P4D 8000000114e6e067 PUD 109046067 PMD 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 7 PID: 53891 Comm: deqp-vk Not tainted 6.8.0-rc6+ #27
Hardware name: Gigabyte Technology Co., Ltd. Z390 I AORUS PRO WIFI/Z390 I AORUS PRO WIFI-CF, BIOS F8 11/05/2021
RIP: 0010:gp100_vmm_pgt_mem+0xe3/0x180 [nouveau]
Code: c7 48 01 c8 49 89 45 58 85 d2 0f 84 95 00 00 00 41 0f b7 46 12 49 8b 7e 08 89 da 42 8d 2c f8 48 8b 47 08 41 83 c7 01 48 89 ee <48> 8b 40 08 ff d0 0f 1f 00 49 8b 7e 08 48 89 d9 48 8d 75 04 48 c1
RSP: 0000:ffffac20c5857838 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 00000000004d8001 RCX: 0000000000000001
RDX: 00000000004d8001 RSI: 00000000000006d8 RDI: ffffa07afe332180
RBP: 00000000000006d8 R08: ffffac20c5857ad0 R09: 0000000000ffff10
R10: 0000000000000001 R11: ffffa07af27e2de0 R12: 000000000000001c
R13: ffffac20c5857ad0 R14: ffffa07a96fe9040 R15: 000000000000001c
FS: 00007fe395eed7c0(0000) GS:ffffa07e2c980000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000011febe001 CR4: 00000000003706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
...
? gp100_vmm_pgt_mem+0xe3/0x180 [nouveau]
? gp100_vmm_pgt_mem+0x37/0x180 [nouveau]
nvkm_vmm_iter+0x351/0xa20 [nouveau]
? __pfx_nvkm_vmm_ref_ptes+0x10/0x10 [nouveau]
? __pfx_gp100_vmm_pgt_mem+0x10/0x10 [nouveau]
? __pfx_gp100_vmm_pgt_mem+0x10/0x10 [nouveau]
? __lock_acquire+0x3ed/0x2170
? __pfx_gp100_vmm_pgt_mem+0x10/0x10 [nouveau]
nvkm_vmm_ptes_get_map+0xc2/0x100 [nouveau]
? __pfx_nvkm_vmm_ref_ptes+0x10/0x10 [nouveau]
? __pfx_gp100_vmm_pgt_mem+0x10/0x10 [nouveau]
nvkm_vmm_map_locked+0x224/0x3a0 [nouveau]
Adding any sort of useful debug usually makes it go away, so I hand
wrote the function in a line, and debugged the asm.
Every so often pt->memory->ptrs is NULL. This ptrs ptr is set in
the nv50_instobj_acquire called from nvkm_kmap.
If Thread A and Thread B both get to nv50_instobj_acquire around
the same time, and Thread A hits the refcount_set line, and in
lockstep thread B succeeds at refcount_inc_not_zero, there is a
chance the ptrs value won't have been stored since refcount_set
is unordered. Force a memory barrier here, I picked smp_mb, since
we want it on all CPUs and it's write followed by a read.
v2: use paired smp_rmb/smp_wmb.
Cc: <stable(a)vger.kernel.org>
Fixes: be55287aa5ba ("drm/nouveau/imem/nv50: embed nvkm_instobj directly into nv04_instobj")
Signed-off-by: Dave Airlie <airlied(a)redhat.com>
Signed-off-by: Danilo Krummrich <dakr(a)redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240411011510.2546857-1-airl…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Yipeng Zou <zouyipeng(a)huawei.com>
---
drivers/gpu/drm/nouveau/nvkm/subdev/instmem/nv50.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/nv50.c b/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/nv50.c
index db48a1daca0c..f8ca79eaa7f7 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/nv50.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/nv50.c
@@ -221,8 +221,11 @@ nv50_instobj_acquire(struct nvkm_memory *memory)
void __iomem *map = NULL;
/* Already mapped? */
- if (refcount_inc_not_zero(&iobj->maps))
+ if (refcount_inc_not_zero(&iobj->maps)) {
+ /* read barrier match the wmb on refcount set */
+ smp_rmb();
return iobj->map;
+ }
/* Take the lock, and re-check that another thread hasn't
* already mapped the object in the meantime.
@@ -249,6 +252,8 @@ nv50_instobj_acquire(struct nvkm_memory *memory)
iobj->base.memory.ptrs = &nv50_instobj_fast;
else
iobj->base.memory.ptrs = &nv50_instobj_slow;
+ /* barrier to ensure the ptrs are written before refcount is set */
+ smp_wmb();
refcount_set(&iobj->maps, 1);
}
--
2.34.1