Backport patch from OLK-5.10 to stabilize the performance of unixbench context1.
Wang ShaoBo (1): timekeeping: Avoiding false sharing in field access of tk_core
Zheng Zengkai (1): openeuler_defconfig: Enable CONFIG_ARCH_LLC_128_LINE_SIZE for Hisilicon platforms
arch/arm64/Kconfig | 9 +++++++++ arch/arm64/configs/openeuler_defconfig | 1 + arch/arm64/include/asm/cache.h | 6 ++++++ kernel/time/timekeeping.c | 7 +++++++ 4 files changed, 23 insertions(+)
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/7411 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/2...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/7411 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/2...
From: Wang ShaoBo bobo.shaobowang@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9PMP7 CVE: NA
---------------------------
We detect a performance deterioration when using Unixbench, we use the dichotomy to locate the patch 7e66740ad725 ("MPAM / ACPI: Refactoring MPAM init process and set MPAM ACPI as entrance"), In comparing two commit df5defd901ff ("KVM: X86: MMU: Use the correct inherited permissions to get shadow page") and ac4dbb7554ef ("ACPI 6.x: Add definitions for MPAM table") we get following testing result:
CMD: ./Run -c xx context1 RESULT: +-------------UnixBench context1-----------+ +---------+--------------+-----------------+ + + ac4dbb7554ef + df5defd901ff + +---------+--------------+---------+-------+ + Cores + Score + Score + +---------+--------------+-----------------+ + 1 + 522.8 + 535.7 + +---------+--------------+-----------------+ + 24 + 11231.5 + 12111.2 + +---------+--------------+-----------------+ + 48 + 8535.1 + 8745.1 + +---------+--------------+-----------------+ + 72 + 10821.9 + 10343.8 + +---------+--------------+-----------------+ + 96 + 15238.5 + 42947.8 + +---------+--------------+-----------------+
We found a irrefutable difference in latency sampling when using the perf tool:
HEAD:ac4dbb7554ef HEAD:df5defd901ff
45.18% [kernel] [k] ktime_get_coarse_real_ts64 -> 1.78% [kernel] [k] ktime_get_coarse_real_ts64 ... 65.87 │ dmb ishld //smp_rmb()
Through ftrace we get the calltrace and and detected the number of visits of ktime_get_coarse_real_ts64, which frequently visits tk_core->seq and tk_core->timekeeper->tkr_mono:
- 48.86% [kernel] [k] ktime_get_coarse_real_ts64 - 5.76% ktime_get_coarse_real_ts64 #about 111437657 times per 10 seconds - 14.70% __audit_syscall_entry syscall_trace_enter el0_svc_common el0_svc_handler + el0_svc - 2.85% current_time
So this may be performance degradation caused by interference when happened different fields access, We compare .bss and .data section of this two version:
HEAD:ac4dbb7554ef `-> ffff00000962e680 l O .bss 0000000000000110 tk_core ffff000009355680 l O .data 0000000000000078 tk_fast_mono ffff0000093557a0 l O .data 0000000000000090 dummy_clock ffff000009355700 l O .data 0000000000000078 tk_fast_raw ffff000009355778 l O .data 0000000000000028 timekeeping_syscore_ops ffff00000962e640 l O .bss 0000000000000008 cycles_at_suspend
HEAD:df5defd901ff `-> ffff00000957dbc0 l O .bss 0000000000000110 tk_core ffff0000092b4e80 l O .data 0000000000000078 tk_fast_mono ffff0000092b4fa0 l O .data 0000000000000090 dummy_clock ffff0000092b4f00 l O .data 0000000000000078 tk_fast_raw ffff0000092b4f78 l O .data 0000000000000028 timekeeping_syscore_ops ffff00000957db80 l O .bss 0000000000000008 cycles_at_suspend
By comparing this two version tk_core's address: ffff00000962e680 is 128Byte aligned but latter df5defd901ff is 64Byte aligned, the memory storage layout of tk_core has undergone subtle changes:
HEAD:ac4dbb7554ef `-> |<--------formmer 64Bytes---------->|<------------latter 64Byte------------->| 0xffff00000957dbc0_>|<-seq 8Bytes->|<-tkr_mono 56Bytes->|<-thr_raw 56Bytes->|<-xtime_sec 8Bytes->| 0xffff00000957dc00_>...
HEAD:df5defd901ff `-> |<------formmer 64Bytes---->|<------------latter 64Byte-------->| 0xffff00000962e680_>|<-Other variables 64Bytes->|<-seq 8Bytes->|<-tkr_mono 56Bytes->| 0xffff00000962e6c0_>..
We testified thr_raw,xtime_sec fields interfere strongly with seq,tkr_mono field because of frequent load/store operation, this will cause as known false sharing.
We add a 64Bytes padding field in tk_core for reservation of any after usefull usage and keep tk_core 128Byte aligned, this can avoid changes in the way tk_core's layout is stored, In this solution, layout of tk_core always like this:
crash> struct -o tk_core_t struct tk_core_t { [0] u64 padding[8]; [64] seqcount_t seq; [72] struct timekeeper timekeeper; } SIZE: 336 crash> struct -o timekeeper struct timekeeper { [0] struct tk_read_base tkr_mono; [56] struct tk_read_base tkr_raw; [112] u64 xtime_sec; [120] unsigned long ktime_sec; ... } SIZE: 264
After appling our own solution:
+---------+--------------+ + + Our solution + +---------+--------------+ + Cores + Score + +---------+--------------+ + 1 + 548.9 + +---------+--------------+ + 24 + 11018.3 + +---------+--------------+ + 48 + 8938.2 + +---------+--------------+ + 72 + 14610.7 + +---------+--------------+ + 96 + 40811.7 + +---------+--------------+
Signed-off-by: Wang ShaoBo bobo.shaobowang@huawei.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/Kconfig | 9 +++++++++ arch/arm64/include/asm/cache.h | 6 ++++++ kernel/time/timekeeping.c | 7 +++++++ 3 files changed, 22 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index fc56e4e30e29..5576ddf63622 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1606,6 +1606,15 @@ config HW_PERF_EVENTS def_bool y depends on ARM_PMU
+config ARCH_LLC_128_LINE_SIZE + bool "Force 128 bytes alignment for fitting LLC cacheline" + depends on ARM64 + default y + help + As specific machine's LLC cacheline size may be up to + 128 bytes, gaining performance improvement from fitting + 128 Bytes LLC cache aligned. + # Supported by clang >= 7.0 or GCC >= 12.0.0 config CC_HAVE_SHADOW_CALL_STACK def_bool $(cc-option, -fsanitize=shadow-call-stack -ffixed-x18) diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h index 1613779be63a..8455df351ef8 100644 --- a/arch/arm64/include/asm/cache.h +++ b/arch/arm64/include/asm/cache.h @@ -8,6 +8,12 @@ #define L1_CACHE_SHIFT (6) #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
+#ifdef CONFIG_ARCH_LLC_128_LINE_SIZE +#ifndef ____cacheline_aligned_128 +#define ____cacheline_aligned_128 __attribute__((__aligned__(128))) +#endif +#endif + #define CLIDR_LOUU_SHIFT 27 #define CLIDR_LOC_SHIFT 24 #define CLIDR_LOUIS_SHIFT 21 diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 8aab7ed41490..3511fc008a85 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -48,9 +48,16 @@ DEFINE_RAW_SPINLOCK(timekeeper_lock); * cache line. */ static struct { +#ifdef CONFIG_ARCH_LLC_128_LINE_SIZE + u64 padding[8]; +#endif seqcount_raw_spinlock_t seq; struct timekeeper timekeeper; +#ifdef CONFIG_ARCH_LLC_128_LINE_SIZE +} tk_core ____cacheline_aligned_128 = { +#else } tk_core ____cacheline_aligned = { +#endif .seq = SEQCNT_RAW_SPINLOCK_ZERO(tk_core.seq, &timekeeper_lock), };
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9PMP7 CVE: NA
---------------------------
On some Hisilicon platforms, like Kunpeng920, L3 cacheline size is 128 byte, enable CONFIG_ARCH_LLC_128_LINE_SIZE for these hisilicon platforms by default.
Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/Kconfig | 4 ++-- arch/arm64/configs/openeuler_defconfig | 1 + 2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 5576ddf63622..2a875546bdc7 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1608,8 +1608,8 @@ config HW_PERF_EVENTS
config ARCH_LLC_128_LINE_SIZE bool "Force 128 bytes alignment for fitting LLC cacheline" - depends on ARM64 - default y + depends on ARCH_HISI + default n help As specific machine's LLC cacheline size may be up to 128 bytes, gaining performance improvement from fitting diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 4932a6be9d6f..6a73a3fd2994 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -477,6 +477,7 @@ CONFIG_HZ=250 CONFIG_SCHED_HRTICK=y CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_HW_PERF_EVENTS=y +CONFIG_ARCH_LLC_128_LINE_SIZE=y CONFIG_CC_HAVE_SHADOW_CALL_STACK=y CONFIG_PARAVIRT=y CONFIG_PARAVIRT_SCHED=y