Align per_cpu osq_node to 64 Byte size cacheline to optimizing performance.
Zheng Zengkai (1): Align per_cpu osq_node to 64 Byte size cacheline
kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/14166 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/D...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/14166 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/D...
hulk inclusion category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/IBAZIJ
--------------------------------
While tuning file copy performance of unixbench 5.1.3, An obvious performance gap was found after an unrelated patch ("KVM: arm64: Exclude mdcr_el2_host from kvm_vcpu_arch") merged. After some debug, it is confirmed that the different cacheline alignments of the per_cpu variable osq_node will lead to different performances:
System.map of good performance case:
ffff800081a50dc0 D runqueues ffff800081a51e80 d qos_overload_timer ffff800081a51f00 d qos_throttled_cfs_rq ffff800081a51f40 d osq_node <- osq_node is 64Byte aligned ffff800081a51fc0 d qnodes ffff800081a52040 d rcu_data
System.map of bad performance case:
ffff800081a51000 D runqueues ffff800081a520c0 d qos_overload_timer ffff800081a62140 d qos_throttled_cfs_rq ffff800081a62180 d osq_node <- osq_node is 128Byte aligned ffff800081a62200 d qnodes ffff800081a62280 d rcu_data
Adjust the previous per_cpu variable qos_throttled_cfs_rq to 128B cacheline aligned, then struct osq_node will be 64 Bype cacheline aligned, achieving a better performance score of file copy testcase.
Before this patch: System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 9014327065.8 772435.9 Double-Precision Whetstone 55.0 1773200.2 322400.0 Execl Throughput 43.0 25330.3 5890.8 File Copy 1024 bufsize 2000 maxblocks 3960.0 500211.0 1263.2 File Copy 256 bufsize 500 maxblocks 1655.0 135793.0 820.5 File Copy 4096 bufsize 8000 maxblocks 5800.0 2033821.0 3506.6 Pipe Throughput 12440.0 307115565.6 246877.5 Pipe-based Context Switching 4000.0 26449665.0 66124.2 Process Creation 126.0 67528.1 5359.4 Shell Scripts (1 concurrent) 42.4 103709.4 24459.8 Shell Scripts (8 concurrent) 6.0 13968.7 23281.2 System Call Overhead 15000.0 14497214.3 9664.8 ======== System Benchmarks Index Score 19236.3
After this patch: System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 9014326929.3 772435.9 Double-Precision Whetstone 55.0 1768022.0 321458.5 Execl Throughput 43.0 25340.4 5893.1 File Copy 1024 bufsize 2000 maxblocks 3960.0 603479.0 1523.9 File Copy 256 bufsize 500 maxblocks 1655.0 150355.0 908.5 File Copy 4096 bufsize 8000 maxblocks 5800.0 2157456.0 3719.8 Pipe Throughput 12440.0 298863938.1 240244.3 Pipe-based Context Switching 4000.0 31548980.3 78872.5 Process Creation 126.0 64479.9 5117.5 Shell Scripts (1 concurrent) 42.4 108471.0 25582.8 Shell Scripts (8 concurrent) 6.0 14539.2 24232.0 System Call Overhead 15000.0 12485789.2 8323.9 ======== System Benchmarks Index Score 19862.6
Note: If the relative position of per_cpu variable qos_throttled_cfs_rq and osq_node changed, this workaround should be adjusted as well.
Signed-off-by: Yicong Yang yangyicong@hisilicon.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fadc59328e3b..be1d35549144 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -145,7 +145,7 @@ int __weak arch_asym_cpu_priority(int cpu)
#ifdef CONFIG_QOS_SCHED
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct list_head, qos_throttled_cfs_rq); +static DEFINE_PER_CPU_SECTION(struct list_head, qos_throttled_cfs_rq, PER_CPU_SHARED_ALIGNED_SECTION) __attribute__((__aligned__(128))); static DEFINE_PER_CPU_SHARED_ALIGNED(struct hrtimer, qos_overload_timer); static DEFINE_PER_CPU(int, qos_cpu_overload); unsigned int sysctl_overload_detect_period = 5000; /* in ms */