hulk inclusion category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IBAZIJ
--------------------------------
While tuning file copy performance of unixbench 5.1.3, An obvious performance gap was found after an unrelated patch ("KVM: arm64: Exclude mdcr_el2_host from kvm_vcpu_arch") merged. After some debug, it is confirmed that the different cacheline alignments of the per_cpu variable osq_node will lead to different performances:
System.map of good performance case:
ffff800081a50dc0 D runqueues ffff800081a51e80 d qos_overload_timer ffff800081a51f00 d qos_throttled_cfs_rq ffff800081a51f40 d osq_node <- osq_node is 64Byte aligned ffff800081a51fc0 d qnodes ffff800081a52040 d rcu_data
System.map of bad performance case:
ffff800081a51000 D runqueues ffff800081a520c0 d qos_overload_timer ffff800081a62140 d qos_throttled_cfs_rq ffff800081a62180 d osq_node <- osq_node is 128Byte aligned ffff800081a62200 d qnodes ffff800081a62280 d rcu_data
Adjust the previous per_cpu variable qos_throttled_cfs_rq to 128B cacheline aligned, then struct osq_node will be 64 Bype cacheline aligned, achieving a better performance score of file copy testcase.
Before this patch: System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 9014327065.8 772435.9 Double-Precision Whetstone 55.0 1773200.2 322400.0 Execl Throughput 43.0 25330.3 5890.8 File Copy 1024 bufsize 2000 maxblocks 3960.0 500211.0 1263.2 File Copy 256 bufsize 500 maxblocks 1655.0 135793.0 820.5 File Copy 4096 bufsize 8000 maxblocks 5800.0 2033821.0 3506.6 Pipe Throughput 12440.0 307115565.6 246877.5 Pipe-based Context Switching 4000.0 26449665.0 66124.2 Process Creation 126.0 67528.1 5359.4 Shell Scripts (1 concurrent) 42.4 103709.4 24459.8 Shell Scripts (8 concurrent) 6.0 13968.7 23281.2 System Call Overhead 15000.0 14497214.3 9664.8 ======== System Benchmarks Index Score 19236.3
After this patch: System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 9014326929.3 772435.9 Double-Precision Whetstone 55.0 1768022.0 321458.5 Execl Throughput 43.0 25340.4 5893.1 File Copy 1024 bufsize 2000 maxblocks 3960.0 603479.0 1523.9 File Copy 256 bufsize 500 maxblocks 1655.0 150355.0 908.5 File Copy 4096 bufsize 8000 maxblocks 5800.0 2157456.0 3719.8 Pipe Throughput 12440.0 298863938.1 240244.3 Pipe-based Context Switching 4000.0 31548980.3 78872.5 Process Creation 126.0 64479.9 5117.5 Shell Scripts (1 concurrent) 42.4 108471.0 25582.8 Shell Scripts (8 concurrent) 6.0 14539.2 24232.0 System Call Overhead 15000.0 12485789.2 8323.9 ======== System Benchmarks Index Score 19862.6
Note: If the relative position of per_cpu variable qos_throttled_cfs_rq and osq_node changed, this workaround should be adjusted as well.
Signed-off-by: Yicong Yang yangyicong@hisilicon.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fadc59328e3b..be1d35549144 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -145,7 +145,7 @@ int __weak arch_asym_cpu_priority(int cpu)
#ifdef CONFIG_QOS_SCHED
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct list_head, qos_throttled_cfs_rq); +static DEFINE_PER_CPU_SECTION(struct list_head, qos_throttled_cfs_rq, PER_CPU_SHARED_ALIGNED_SECTION) __attribute__((__aligned__(128))); static DEFINE_PER_CPU_SHARED_ALIGNED(struct hrtimer, qos_overload_timer); static DEFINE_PER_CPU(int, qos_cpu_overload); unsigned int sysctl_overload_detect_period = 5000; /* in ms */