[PATCH OLK-6.6 v10 0/5] SMT QOS
Cloud Service Provider deploy Best-Effort and Latency Sensitive tasks on the same physical core to maximize the resource utilization. We observe the LS task needs more cycles to complete the same workload due to the Microarchitectural resource contention. This feature control the instruction throughput of BE task into pipeline, so that the other SMT running LS task could occupy more uarch resources to reach better IPC. First split out QOS_LABEL from QOS_SCHED for SMT QoS reuse. The test results on 920G: +-------------------------------------------------------------+ | | 基线 | 混布基线 | SMT QoS | | sched_wfi_timeout_us | \ | \ | 50 | | sched_smt_offline_util_pct | \ | \ | 50 | | P99 | 0.201 | 0.292 | 0.212 | | CPU utilization | 29.3% | 67.00% | 63.30% | | P99 regression percentage | \ | 45.27% | 5.47% | +-------------------------------------------------------------+ Changes in v10: - Decouple QOS_SCHED_SMT_EXPELLER and QOS_SCHED scheduling logic with SMT_QOS. - Some cleanup. Changes in v9: - Also use SMT sibling CPU utilization NUMA level watermark for offline task cpu select to improve the 920G CPU utilization. Chanegs in v8: - Use SMT sibling CPU utilization NUMA level watermark instead of src cpu offline tasks migrate watermark on every load balance. Changes in v7: - Fix the prefer_cpu and select_cpus save/restore. - Extract can_smt_qos_migrate_task() helper from can_migrate_task(). - Update the names. - Update the arch code and remove the pmu code. - Add some comment. Changes in v6: - Rename QOS_LABEL to QOS_LEVEL. - Rename USER_WFXT to SMT_QOS. - Rename TAG_PULL to SMT_TAG_PULL. - Adjust the select cpu code which depends on QOS_SCHED_DYNAMIC_AFFINITY. - Distribute offline tasks to SMT sibling cores based on the configured proportion move to a separate patch. - Use cpumask_t for even_cpu_mask. - smt_throttle move to the arch patch. - pmu_smt_update_status() -> smt_update_qos_level(). - >> smt_task_imbalance instead of / 1000. - Remove the limit for odd -> even load balance, which solve the one NUMA load very high problem. - Rebased on the newest OLK-6.6 code. - Fix the build issue for QOS_LEVEL selected but CGROUP_SCHED or CFS_BANDWIDTH not selected. Jinjie Ruan (5): sched: Split out QOS_LEVEL from QOS_SCHED for reuse sched: Add qos_sched_enabled() helper for future expansion sched/fair: Add SMT QoS sched core code arm64: Add arch code for SMT QoS config: Enable SMT_QOS arch/arm64/Kconfig.turbo | 17 ++ arch/arm64/configs/openeuler_defconfig | 1 + arch/arm64/include/asm/cpufeature.h | 5 + arch/arm64/include/asm/xint.h | 15 ++ arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/entry-common.c | 16 ++ arch/arm64/kernel/entry.S | 14 +- arch/arm64/kernel/smp.c | 23 ++ arch/arm64/kernel/smt_qos.c | 84 +++++++ arch/arm64/kernel/xcall/entry.S | 78 +++++++ drivers/irqchip/irq-gic-v3.c | 43 ++++ include/linux/sched.h | 20 ++ init/Kconfig | 5 + kernel/sched/core.c | 20 +- kernel/sched/fair.c | 289 ++++++++++++++++++++++++- kernel/sched/features.h | 4 + kernel/sched/sched.h | 18 +- 17 files changed, 618 insertions(+), 35 deletions(-) create mode 100644 arch/arm64/include/asm/xint.h create mode 100644 arch/arm64/kernel/smt_qos.c -- 2.34.1
hulk inclusion category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8929 ---------------------------------------- Refactor QOS_SCHED by decoupling the tag-related logic into a new sub-config called "QOS_LEVEL". This new config implements cgroup tagging and tag propagation, allowing it to be reused by "SMT QoS". Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> --- include/linux/sched.h | 20 ++++++++++++++++++++ init/Kconfig | 5 +++++ kernel/sched/core.c | 20 +++++++++++--------- kernel/sched/fair.c | 4 ++-- kernel/sched/sched.h | 18 ++---------------- 5 files changed, 40 insertions(+), 27 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index f5c80c372f74..ba35e2265893 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2679,4 +2679,24 @@ static inline bool smart_grid_used(void) return false; } #endif + +#ifdef CONFIG_QOS_LEVEL +#ifdef CONFIG_QOS_SCHED_MULTILEVEL +enum task_qos_level { + QOS_LEVEL_OFFLINE_EX = -2, + QOS_LEVEL_OFFLINE = -1, + QOS_LEVEL_ONLINE = 0, + QOS_LEVEL_HIGH = 1, + QOS_LEVEL_HIGH_EX = 2 +}; +#else +enum task_qos_level { + QOS_LEVEL_OFFLINE = -1, + QOS_LEVEL_ONLINE = 0, +}; +#endif + +DECLARE_PER_CPU_ALIGNED(int, qos_smt_status); +#endif + #endif diff --git a/init/Kconfig b/init/Kconfig index b577cdeec6e5..a50e9c8a8cab 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1096,11 +1096,16 @@ menuconfig CGROUP_SCHED tasks. if CGROUP_SCHED +config QOS_LEVEL + bool + depends on CGROUP_SCHED && CFS_BANDWIDTH + config QOS_SCHED bool "Qos task scheduling" depends on CGROUP_SCHED depends on CFS_BANDWIDTH depends on SMP + select QOS_LEVEL default n help This option enable qos scheduler, and support co-location online diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0d071de3ffa5..db9e41f600b8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -121,7 +121,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp); DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); DEFINE_PER_CPU(struct rnd_state, sched_rnd_state); -#ifdef CONFIG_QOS_SCHED +#ifdef CONFIG_QOS_LEVEL static void sched_change_qos_group(struct task_struct *tsk, struct task_group *tg); #endif @@ -4881,7 +4881,7 @@ void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs) struct task_group, css); tg = autogroup_task_group(p, tg); p->sched_task_group = tg; -#ifdef CONFIG_QOS_SCHED +#ifdef CONFIG_QOS_LEVEL tg_qos = tg; #endif } @@ -4892,7 +4892,7 @@ void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs) * so use __set_task_cpu(). */ __set_task_cpu(p, smp_processor_id()); -#ifdef CONFIG_QOS_SCHED +#ifdef CONFIG_QOS_LEVEL sched_change_qos_group(p, tg_qos); #endif @@ -7850,7 +7850,7 @@ static int __sched_setscheduler(struct task_struct *p, } change: -#ifdef CONFIG_QOS_SCHED +#ifdef CONFIG_QOS_LEVEL /* * If the scheduling policy of an offline task is set to a policy * other than SCHED_IDLE, the online task preemption and cpu resource @@ -10524,7 +10524,7 @@ void ia64_set_curr_task(int cpu, struct task_struct *p) /* task_group_lock serializes the addition/removal of task groups */ static DEFINE_SPINLOCK(task_group_lock); -#ifdef CONFIG_QOS_SCHED +#ifdef CONFIG_QOS_LEVEL static inline int alloc_qos_sched_group(struct task_group *tg, struct task_group *parent) { @@ -10551,7 +10551,9 @@ static void sched_change_qos_group(struct task_struct *tsk, struct task_group *t __setscheduler_prio(tsk, normal_prio(tsk)); } } +#endif +#ifdef CONFIG_QOS_SCHED struct offline_args { struct work_struct work; struct task_struct *p; @@ -10642,7 +10644,7 @@ struct task_group *sched_create_group(struct task_group *parent) if (!alloc_fair_sched_group(tg, parent)) goto err; -#ifdef CONFIG_QOS_SCHED +#ifdef CONFIG_QOS_LEVEL if (!alloc_qos_sched_group(tg, parent)) goto err; #endif @@ -10737,7 +10739,7 @@ static void sched_change_group(struct task_struct *tsk, struct task_group *group { tsk->sched_task_group = group; -#ifdef CONFIG_QOS_SCHED +#ifdef CONFIG_QOS_LEVEL sched_change_qos_group(tsk, group); #endif @@ -11655,7 +11657,7 @@ static int cpu_rebuild_affinity_domain_u64(struct cgroup_subsys_state *css, } #endif /* CONFIG_QOS_SCHED_SMART_GRID */ -#ifdef CONFIG_QOS_SCHED +#ifdef CONFIG_QOS_LEVEL static int tg_change_scheduler(struct task_group *tg, void *data) { int policy; @@ -11988,7 +11990,7 @@ static struct cftype cpu_legacy_files[] = { .write = cpu_uclamp_max_write, }, #endif -#ifdef CONFIG_QOS_SCHED +#ifdef CONFIG_QOS_LEVEL { .name = "qos_level", .flags = CFTYPE_NOT_ON_ROOT, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 38ee4c9c79bf..5af2793adae8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -165,8 +165,8 @@ static bool qos_smt_expelled(int this_cpu); static bool is_offline_task(struct task_struct *p); #endif -#ifdef CONFIG_QOS_SCHED_SMT_EXPELLER -static DEFINE_PER_CPU(int, qos_smt_status); +#ifdef CONFIG_QOS_LEVEL +DEFINE_PER_CPU_ALIGNED(int, qos_smt_status); #endif #ifdef CONFIG_QOS_SCHED_PRIO_LB diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e9a60a6295e4..bba4812d9e21 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -459,7 +459,7 @@ struct task_group { struct cfs_bandwidth cfs_bandwidth; -#ifdef CONFIG_QOS_SCHED +#ifdef CONFIG_QOS_LEVEL long qos_level; #endif @@ -1583,20 +1583,6 @@ do { \ } while (0) #ifdef CONFIG_QOS_SCHED -#ifdef CONFIG_QOS_SCHED_MULTILEVEL -enum task_qos_level { - QOS_LEVEL_OFFLINE_EX = -2, - QOS_LEVEL_OFFLINE = -1, - QOS_LEVEL_ONLINE = 0, - QOS_LEVEL_HIGH = 1, - QOS_LEVEL_HIGH_EX = 2 -}; -#else -enum task_qos_level { - QOS_LEVEL_OFFLINE = -1, - QOS_LEVEL_ONLINE = 0, -}; -#endif void init_qos_hrtimer(int cpu); #endif @@ -3483,7 +3469,7 @@ static inline bool is_per_cpu_kthread(struct task_struct *p) } #endif -#ifdef CONFIG_QOS_SCHED +#ifdef CONFIG_QOS_LEVEL static inline int qos_idle_policy(int policy) { return policy == QOS_LEVEL_OFFLINE; -- 2.34.1
hulk inclusion category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8929 ---------------------------------------- Using the qos_sched_enabled() helper to decouple QOS_SCHED_SMT_EXPELLER and QOS_SCHED scheduling logic, providing a hook for future extensions. No functional changes. Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> --- kernel/sched/fair.c | 29 ++++++++++++++++++++++++----- 1 file changed, 24 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5af2793adae8..250ef9a069c2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9321,6 +9321,13 @@ static int wake_soft_domain(struct task_struct *p, int target) } #endif +#ifdef CONFIG_QOS_SCHED +static __always_inline bool qos_sched_enabled(void) +{ + return true; +} +#endif + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE, @@ -9589,7 +9596,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ return; #ifdef CONFIG_QOS_SCHED - if (unlikely(is_offline_task(curr) && !is_offline_task(p))) + if (qos_sched_enabled() && unlikely(is_offline_task(curr) && !is_offline_task(p))) goto preempt; #endif @@ -9829,6 +9836,9 @@ static int unthrottle_qos_cfs_rqs(int cpu) static bool check_qos_cfs_rq(struct cfs_rq *cfs_rq) { + if (!qos_sched_enabled()) + return false; + if (unlikely(__this_cpu_read(qos_cpu_overload))) return false; @@ -9933,6 +9943,9 @@ static void start_qos_hrtimer(int cpu) ktime_t time; struct hrtimer *hrtimer = &(per_cpu(qos_overload_timer, cpu)); + if (!qos_sched_enabled()) + return; + time = ktime_add_ms(hrtimer->base->get_time(), (u64)sysctl_overload_detect_period); hrtimer_set_expires(hrtimer, time); hrtimer_start_expires(hrtimer, HRTIMER_MODE_ABS_PINNED); @@ -9942,6 +9955,9 @@ void init_qos_hrtimer(int cpu) { struct hrtimer *hrtimer = &(per_cpu(qos_overload_timer, cpu)); + if (!qos_sched_enabled()) + return; + hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED); hrtimer->function = qos_overload_timer_handler; } @@ -9953,6 +9969,9 @@ void init_qos_hrtimer(int cpu) */ static void qos_schedule_throttle(struct task_struct *p) { + if (!qos_sched_enabled()) + return; + if (unlikely(current->flags & PF_KTHREAD)) return; @@ -10009,7 +10028,7 @@ static bool qos_sched_idle_cpu(int this_cpu) static bool qos_smt_expelled(int this_cpu) { - if (!static_branch_likely(&qos_smt_expell_switch)) + if (!static_branch_likely(&qos_smt_expell_switch) || !qos_sched_enabled()) return false; /* @@ -10068,7 +10087,7 @@ static void qos_smt_send_ipi(int this_cpu) static void qos_smt_expel(int this_cpu, struct task_struct *p) { - if (!static_branch_likely(&qos_smt_expell_switch)) + if (!static_branch_likely(&qos_smt_expell_switch) || !qos_sched_enabled()) return; if (qos_smt_update_status(p)) @@ -10077,7 +10096,7 @@ static void qos_smt_expel(int this_cpu, struct task_struct *p) static inline bool qos_smt_enabled(void) { - if (!static_branch_likely(&qos_smt_expell_switch)) + if (!static_branch_likely(&qos_smt_expell_switch) || !qos_sched_enabled()) return false; if (!sched_smt_active()) @@ -10200,7 +10219,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf #ifdef CONFIG_FAIR_GROUP_SCHED if (!prev || prev->sched_class != &fair_sched_class) { #ifdef CONFIG_QOS_SCHED - if (cfs_rq->idle_h_nr_running != 0 && rq->online) + if (qos_sched_enabled() && cfs_rq->idle_h_nr_running != 0 && rq->online) goto qos_simple; else #endif -- 2.34.1
hulk inclusion category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8929 ---------------------------------------- Reuse QOS_LABEL to distinguish between online and offline tasks. And reuse QOS_SCHED_DYNAMIC_AFFINITY to select master SMT cpu for online tasks. Sample CPU utilization of all slave SMT cores within a NUMA node when collecting load balancing statistics, then select the cpu for offline task and distribute offline tasks to SMT sibling cores based on the target SMT sibling CPU utilization watermark, +--------+ +--------+ | | online/offline| | | CPU0 |<------------->| CPU2 | | | | | | +--------+ | +--------+ | | | | offline | offline | offline \/ | \/ +--------+ | +---------+ | | \/ | | | CPU1 |<------------->| CPU3 | | | offline | | +--------+ +---------+ Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> --- arch/arm64/Kconfig.turbo | 17 ++++ kernel/sched/fair.c | 212 +++++++++++++++++++++++++++++++++++++++ kernel/sched/features.h | 4 + 3 files changed, 233 insertions(+) diff --git a/arch/arm64/Kconfig.turbo b/arch/arm64/Kconfig.turbo index 778ea1025c2c..aa0af04cb2ab 100644 --- a/arch/arm64/Kconfig.turbo +++ b/arch/arm64/Kconfig.turbo @@ -84,4 +84,21 @@ config DYNAMIC_XCALL and a kernel module which provides customized implementation. +config SMT_QOS + bool "Support userspace timer/wft to reduce intra-core contention" + depends on SCHED_SMT + depends on FAST_IRQ + depends on QOS_SCHED_DYNAMIC_AFFINITY + depends on CFS_BANDWIDTH && CGROUP_SCHED + select QOS_LEVEL + default y + help + Cloud Service Provider deploy Best-Effort and Latency Sensitive tasks + on the same physical core to maximize the resource utilization. We + observe the LS task needs more cycles to complete the same workload + due to the uarch resource contention. This feature control the + instruction throughput of BS task into pipeline, so that the other + SMT running LS task could occupy more uarch resources to reach better + IPC. + endmenu # "Turbo features selection" diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 250ef9a069c2..3e9f0b8070b8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9321,9 +9321,179 @@ static int wake_soft_domain(struct task_struct *p, int target) } #endif +#ifdef CONFIG_SMT_QOS +static DEFINE_PER_CPU_ALIGNED(cpumask_t, smt_prefer_cpus); +static unsigned long numa_smt_util[MAX_NUMNODES]; +/* + * Target SMT sibling CPU utilization watermark. + * Default range: 0-100. + */ +static unsigned int sched_smt_offline_util_pct = 50; +static cpumask_t master_smt_cpumask; +static cpumask_t slave_smt_cpumask; + +static struct ctl_table smt_util_pct_sysctl_table[] = { + { + .procname = "sched_smt_offline_util_pct", + .data = &sched_smt_offline_util_pct, + .maxlen = sizeof(sched_smt_offline_util_pct), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE_HUNDRED, + }, + {} +}; + +static int __init sched_init_smt_qos(void) +{ + int cpu; + + if (!sched_smt_active()) + return 0; + + register_sysctl_init("kernel", smt_util_pct_sysctl_table); + + cpumask_copy(&master_smt_cpumask, cpu_possible_mask); + for_each_possible_cpu(cpu) { + if (cpu != cpumask_first(cpu_smt_mask(cpu))) + cpumask_clear_cpu(cpu, &master_smt_cpumask); + } + + cpumask_andnot(&slave_smt_cpumask, cpu_possible_mask, &master_smt_cpumask); + pr_info("Master SMT mask: %*pbl\n", cpumask_pr_args(&master_smt_cpumask)); + pr_info("Slave SMT mask: %*pbl\n", cpumask_pr_args(&slave_smt_cpumask)); + + return 0; +} +late_initcall(sched_init_smt_qos); + +static __always_inline bool smt_qos_enabled(void) +{ + return sched_smt_active() && sched_feat(SMT_TAG_PULL); +} + +static inline void smt_qos_set_task_select_cpus(struct task_struct *p, + const cpumask_t **backup_select_cpus, + int *idlest_cpu, int prev_cpu) +{ + cpumask_t *prefer_cpus = this_cpu_ptr(&smt_prefer_cpus); + cpumask_t *prefer_cpumask = &master_smt_cpumask; + + if (!smt_qos_enabled()) + return; + + if (task_group(p)->qos_level < QOS_LEVEL_ONLINE) { + unsigned long smt_util = numa_smt_util[cpu_to_node(prev_cpu)]; + + if (smt_util < sched_smt_offline_util_pct) + prefer_cpumask = &slave_smt_cpumask; + } + + if (*idlest_cpu != -1 && !cpumask_test_cpu(*idlest_cpu, prefer_cpumask)) + *idlest_cpu = -1; + + cpumask_copy(prefer_cpus, task_prefer_cpus(p)); + if (cpumask_empty(prefer_cpus)) + cpumask_and(prefer_cpus, p->cpus_ptr, prefer_cpumask); + else + cpumask_and(prefer_cpus, prefer_cpus, prefer_cpumask); + + *backup_select_cpus = p->select_cpus; + p->select_cpus = prefer_cpus; +} + +static inline void smt_qos_restore_task_select_cpus(struct task_struct *p, + const cpumask_t *backup_select_cpus) +{ + if (!smt_qos_enabled()) + return; + + p->select_cpus = backup_select_cpus; +} + +static inline void smt_qos_update_qos_level(int cpu, struct task_struct *p) +{ + int new_status; + + if (!smt_qos_enabled()) + return; + + new_status = p ? task_group(p)->qos_level : QOS_LEVEL_OFFLINE; + + if (likely(new_status == __this_cpu_read(qos_smt_status))) + return; + + __this_cpu_write(qos_smt_status, new_status); +} + +static inline bool is_slave_to_master(int src_cpu, int dst_cpu) +{ + return !cpumask_test_cpu(src_cpu, &master_smt_cpumask) && + cpumask_test_cpu(dst_cpu, &master_smt_cpumask); +} + +static inline bool smt_qos_should_not_busiest(int src_cpu, int dst_cpu) +{ + if (!smt_qos_enabled()) + return 0; + + /* + * Migration of tasks from SMT siblings to + * the primary SMT CPU is restricted. + */ + return is_slave_to_master(src_cpu, dst_cpu); +} + +static inline bool smt_qos_can_migrate_task(struct task_struct *p, int src_cpu, + int dst_cpu) +{ + if (!smt_qos_enabled()) + return 1; + + /* + * Only offline tasks are allowed to be migrated from + * primary SMT CPUs to SMT siblings. + */ + if (cpumask_test_cpu(src_cpu, &master_smt_cpumask) && + !cpumask_test_cpu(dst_cpu, &master_smt_cpumask)) { + unsigned long smt_util; + + if (task_group(p)->qos_level >= QOS_LEVEL_ONLINE) + return 0; + + smt_util = numa_smt_util[cpu_to_node(dst_cpu)]; + if (smt_util >= sched_smt_offline_util_pct) + return 0; + } + + /* + * Migration of tasks from SMT siblings to + * the primary SMT CPU is restricted. + */ + return !is_slave_to_master(src_cpu, dst_cpu); +} + +static inline void smt_qos_update_sd_ld_stats(struct sched_domain *sd, int dst_cpu, + unsigned long total_smt_capacity, + unsigned long total_smt_util) +{ + if (!smt_qos_enabled() || !total_smt_capacity) + return; + + if (!(sd->flags & SD_NUMA) && (sd->parent && (sd->parent->flags & SD_NUMA))) + numa_smt_util[cpu_to_node(dst_cpu)] = (total_smt_util * 100) / total_smt_capacity; +} +#endif /* CONFIG_SMT_QOS */ + #ifdef CONFIG_QOS_SCHED static __always_inline bool qos_sched_enabled(void) { +#ifdef CONFIG_SMT_QOS + if (sched_feat(SMT_TAG_PULL)) + return false; +#endif + return true; } #endif @@ -9356,6 +9526,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) struct sched_migrate_ctx ctx; int ret; #endif +#ifdef CONFIG_SMT_QOS + const cpumask_t *backup_select_cpus; +#endif time = schedstat_start_time(); @@ -9367,6 +9540,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) #ifdef CONFIG_QOS_SCHED_DYNAMIC_AFFINITY set_task_select_cpus(p, &idlest_cpu, sd_flag); #endif +#ifdef CONFIG_SMT_QOS + smt_qos_set_task_select_cpus(p, &backup_select_cpus, &idlest_cpu, prev_cpu); +#endif if (wake_flags & WF_TTWU) { record_wakee(p); @@ -9461,6 +9637,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) schedstat_inc(p->stats.nr_wakeups_force_preferred_cpus); } #endif + +#ifdef CONFIG_SMT_QOS + smt_qos_restore_task_select_cpus(p, backup_select_cpus); +#endif return new_cpu; } @@ -10377,6 +10557,9 @@ done: __maybe_unused; qos_smt_expel(this_cpu, p); #endif +#ifdef CONFIG_SMT_QOS + smt_qos_update_qos_level(rq->cpu, p); +#endif return p; idle: @@ -10436,6 +10619,10 @@ done: __maybe_unused; qos_smt_expel(this_cpu, NULL); #endif +#ifdef CONFIG_SMT_QOS + smt_qos_update_qos_level(rq->cpu, NULL); +#endif + return NULL; } @@ -10862,6 +11049,11 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) } #endif +#ifdef CONFIG_SMT_QOS + if (!smt_qos_can_migrate_task(p, env->src_cpu, env->dst_cpu)) + return 0; +#endif + /* * We do not migrate tasks that are: * 1) throttled_lb_pair, or @@ -11494,6 +11686,10 @@ struct sd_lb_stats { struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ struct sg_lb_stats local_stat; /* Statistics of the local group */ +#ifdef CONFIG_SMT_QOS + unsigned long total_smt_util; /* Total utilization of all groups in sd */ + unsigned long total_smt_capacity; /* Total capacity of all groups in sd */ +#endif }; static inline void init_sd_lb_stats(struct sd_lb_stats *sds) @@ -11924,6 +12120,12 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->group_util += cpu_util_cfs(i); sgs->group_runnable += cpu_runnable(rq); sgs->sum_h_nr_running += rq->cfs.h_nr_running; +#ifdef CONFIG_SMT_QOS + if (sched_smt_active() && !cpumask_test_cpu(i, &master_smt_cpumask)) { + sds->total_smt_util += cpu_util_cfs(i); + sds->total_smt_capacity += capacity_orig_of(i); + } +#endif nr_running = rq->nr_running; sgs->sum_nr_running += nr_running; @@ -12658,6 +12860,11 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd } update_idle_cpu_scan(env, sum_util); + +#ifdef CONFIG_SMT_QOS + smt_qos_update_sd_ld_stats(env->sd, env->dst_cpu, sds->total_smt_capacity, + sds->total_smt_util); +#endif } /** @@ -13052,6 +13259,11 @@ static struct rq *find_busiest_queue(struct lb_env *env, if (!nr_running) continue; +#ifdef CONFIG_SMT_QOS + if (smt_qos_should_not_busiest(i, env->dst_cpu)) + continue; +#endif + capacity = capacity_of(i); /* diff --git a/kernel/sched/features.h b/kernel/sched/features.h index c9ad8e72ecd0..446d136654d9 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -130,3 +130,7 @@ SCHED_FEAT(SOFT_QUOTA, false) #endif SCHED_FEAT(WA_SMT, false) + +#ifdef CONFIG_SMT_QOS +SCHED_FEAT(SMT_TAG_PULL, false) +#endif -- 2.34.1
hulk inclusion category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8929 ---------------------------------------- SMT QoS leverage lightweight dedicated IPIs to expedite WFI sleep for offline tasks, ensuring they are promptly preempted upon online task arrival. Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> --- arch/arm64/include/asm/cpufeature.h | 5 ++ arch/arm64/include/asm/xint.h | 15 ++++++ arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/entry-common.c | 16 ++++++ arch/arm64/kernel/entry.S | 14 +++-- arch/arm64/kernel/smp.c | 23 ++++++++ arch/arm64/kernel/smt_qos.c | 84 +++++++++++++++++++++++++++++ arch/arm64/kernel/xcall/entry.S | 78 +++++++++++++++++++++++++++ drivers/irqchip/irq-gic-v3.c | 43 +++++++++++++++ kernel/sched/fair.c | 44 +++++++++++++++ 10 files changed, 320 insertions(+), 3 deletions(-) create mode 100644 arch/arm64/include/asm/xint.h create mode 100644 arch/arm64/kernel/smt_qos.c diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index 6f73a51d2422..fed81fd3baf2 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -851,6 +851,11 @@ static __always_inline bool system_uses_xcall_xint(void) cpus_have_const_cap(ARM64_HAS_HW_XCALL_XINT); } +static __always_inline bool system_uses_xint(void) +{ + return IS_ENABLED(CONFIG_FAST_IRQ) && cpus_have_const_cap(ARM64_HAS_XINT); +} + static __always_inline bool system_uses_irq_prio_masking(void) { return IS_ENABLED(CONFIG_ARM64_PSEUDO_NMI) && diff --git a/arch/arm64/include/asm/xint.h b/arch/arm64/include/asm/xint.h new file mode 100644 index 000000000000..00ab27b327fa --- /dev/null +++ b/arch/arm64/include/asm/xint.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_XINT_H +#define __ASM_XINT_H + +#define NR_IPI_USER 7 // SGI + +#ifndef __ASSEMBLY__ +#include <linux/topology.h> + +extern void gic_handle_irq_noack(struct pt_regs *regs); +extern void gic_handle_nmi_noack(struct pt_regs *regs); +extern void arch_smp_send_ipi_user(int cpu); +extern bool should_restrict(void); +#endif /* __ASSEMBLY__ */ +#endif /* __ASM_XINT_H */ diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile index 300bfcb8a890..b61c72715daa 100644 --- a/arch/arm64/kernel/Makefile +++ b/arch/arm64/kernel/Makefile @@ -80,6 +80,7 @@ obj-y += vdso-wrap.o obj-$(CONFIG_COMPAT_VDSO) += vdso32-wrap.o obj-$(CONFIG_ARM64_ILP32) += vdso-ilp32/ obj-$(CONFIG_FAST_SYSCALL) += xcall/ +obj-$(CONFIG_SMT_QOS) += smt_qos.o obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS) += patch-scs.o obj-$(CONFIG_IPI_AS_NMI) += ipi_nmi.o obj-$(CONFIG_HISI_VIRTCCA_GUEST) += virtcca_cvm_guest.o virtcca_cvm_tsi.o diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c index c72993bb4563..55f416ea0303 100644 --- a/arch/arm64/kernel/entry-common.c +++ b/arch/arm64/kernel/entry-common.c @@ -26,6 +26,7 @@ #include <asm/stacktrace.h> #include <asm/sysreg.h> #include <asm/system_misc.h> +#include <asm/xint.h> /* * Handle IRQ/context state management when entering from kernel mode. @@ -945,6 +946,21 @@ static void noinstr __el0_irq_handler_common(struct pt_regs *regs) el0_interrupt(regs, ISR_EL1_IS, handle_arch_irq, handle_arch_nmi_irq); } +#ifdef CONFIG_FAST_IRQ +DECLARE_PER_CPU(u32, cpu_iar); +/* + * The generic exception handler for SPI and LPI taken from EL0 on eary CPUs + * before 920G. Most of code comes from el0_interrupt(), except that it pass + * irqnr to GIC driver via per-cpu variable due to the IRQ Ack is completed + * in entry code. + */ +asmlinkage void noinstr el0t_64_acked_irq_handler(struct pt_regs *regs, u32 irqnr) +{ + this_cpu_write(cpu_iar, irqnr); + el0_interrupt(regs, ISR_EL1_IS, gic_handle_irq_noack, gic_handle_nmi_noack); +} +#endif /* CONFIG_FAST_IRQ */ + asmlinkage void noinstr el0t_64_irq_handler(struct pt_regs *regs) { __el0_irq_handler_common(regs); diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S index 039ec8d40899..d0518a31b689 100644 --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -554,7 +554,7 @@ tsk .req x28 // current thread_info .text -#ifdef CONFIG_FAST_SYSCALL +#if defined(CONFIG_FAST_SYSCALL) || defined(CONFIG_FAST_IRQ) #include "xcall/entry.S" #endif @@ -579,8 +579,12 @@ SYM_CODE_START(vectors) sync_ventry // Synchronous 64-bit EL0 #else kernel_ventry 0, t, 64, sync // Synchronous 64-bit EL0 -#endif +#endif /* CONFIG_FAST_SYSCALL */ +#ifdef CONFIG_FAST_IRQ + irq_ventry // XINT 64-bit EL0 +#else kernel_ventry 0, t, 64, irq // IRQ 64-bit EL0 +#endif kernel_ventry 0, t, 64, fiq // FIQ 64-bit EL0 kernel_ventry 0, t, 64, error // Error 64-bit EL0 @@ -607,8 +611,12 @@ SYM_CODE_START(vectors_xcall_xint) sync_ventry // Synchronous 64-bit EL0 #else kernel_ventry 0, t, 64, sync // Synchronous 64-bit EL0 -#endif +#endif /* CONFIG_FAST_SYSCALL */ +#ifdef CONFIG_FAST_IRQ + irq_ventry // XINT 64-bit EL0 +#else kernel_ventry 0, t, 64, irq // IRQ 64-bit EL0 +#endif kernel_ventry 0, t, 64, fiq // FIQ 64-bit EL0 kernel_ventry 0, t, 64, error // Error 64-bit EL0 diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index dfdc7b2b3c3f..e0f450aea847 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -55,6 +55,7 @@ #include <asm/tlbflush.h> #include <asm/ptrace.h> #include <asm/virt.h> +#include <asm/xint.h> #include <trace/events/ipi.h> @@ -78,6 +79,9 @@ enum ipi_msg_type { IPI_TIMER, IPI_IRQ_WORK, IPI_WAKEUP, +#ifdef CONFIG_SMT_QOS + IPI_USER, +#endif NR_IPI }; @@ -806,6 +810,9 @@ static const char *ipi_types[NR_IPI] __tracepoint_string = { [IPI_TIMER] = "Timer broadcast interrupts", [IPI_IRQ_WORK] = "IRQ work interrupts", [IPI_WAKEUP] = "CPU wake-up interrupts", +#ifdef CONFIG_SMT_QOS + [IPI_USER] = "Userspace IPI", +#endif }; static void smp_cross_call(const struct cpumask *target, unsigned int ipinr); @@ -961,6 +968,11 @@ static void do_handle_IPI(int ipinr) cpu); break; #endif +#ifdef CONFIG_SMT_QOS + case IPI_USER: + /* Do nothing */ + break; +#endif default: pr_crit("CPU%u: Unknown IPI message 0x%x\n", cpu, ipinr); @@ -1062,6 +1074,17 @@ void tick_broadcast(const struct cpumask *mask) } #endif +#ifdef CONFIG_SMT_QOS +void arch_smp_send_ipi_user(int cpu) +{ + struct irq_desc *desc = ipi_desc[IPI_USER]; + struct irq_data *data = irq_desc_get_irq_data(desc); + struct irq_chip *chip = irq_data_get_irq_chip(data); + + chip->ipi_send_mask(data, cpumask_of(cpu)); +} +#endif + /* * The number of CPUs online, not counting this CPU (which may not be * fully online and so not counted in num_online_cpus()). diff --git a/arch/arm64/kernel/smt_qos.c b/arch/arm64/kernel/smt_qos.c new file mode 100644 index 000000000000..4e97d1fea7b7 --- /dev/null +++ b/arch/arm64/kernel/smt_qos.c @@ -0,0 +1,84 @@ +// SPDX-License-Identifier: GPL-2.0 +#define pr_fmt(fmt) "smt_qos: " fmt + +#include <linux/module.h> + +#include <asm/arch_gicv3.h> +#include <asm/cpuidle.h> +#include <asm/daifflags.h> +#include <asm/timex.h> +#include <asm/xint.h> + +#include <vdso/time64.h> + +static unsigned int sysctl_sched_wfi_timeout = 50; +static DEFINE_STATIC_KEY_TRUE(split_mode); + +static void irq_complete(u32 irqnr) +{ + if (static_branch_likely(&split_mode)) + write_gicreg(irqnr, ICC_EOIR1_EL1); + isb(); +} + +static void irq_deactive(u32 irqnr) +{ + if (static_branch_likely(&split_mode)) { + gic_write_dir(irqnr); + } else { + write_gicreg(irqnr, ICC_EOIR1_EL1); + isb(); + } +} + +static __always_inline void throttle_offline(void) +{ + cycles_t start, end; + u64 delta_us = 0; + + local_daif_restore(DAIF_PROCCTX); + + start = get_cycles(); + while (delta_us < sysctl_sched_wfi_timeout && should_restrict()) { + cpu_do_idle(); + end = get_cycles(); + delta_us = (end - start) * USEC_PER_SEC / arch_timer_get_cntfrq(); + } + + local_daif_mask(); +} + +asmlinkage void el0_xint_ipi_handler(struct pt_regs *regs) +{ + irq_complete(NR_IPI_USER); + irq_deactive(NR_IPI_USER); + throttle_offline(); +} + +static struct ctl_table sched_wfi_timeout_sysctl_table[] = { + { + .procname = "sched_wfi_timeout_us", + .data = &sysctl_sched_wfi_timeout, + .maxlen = sizeof(sysctl_sched_wfi_timeout), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE_THOUSAND, + }, + {} +}; + +static int __init xint_init(void) +{ + if (!system_uses_xint()) + return 0; + + register_sysctl_init("kernel", sched_wfi_timeout_sysctl_table); + + if (!is_hyp_mode_available()) + static_branch_disable(&split_mode); + + pr_info("GIC split mode enabled: %d\n", static_key_enabled(&split_mode)); + return 0; +} +module_init(xint_init); diff --git a/arch/arm64/kernel/xcall/entry.S b/arch/arm64/kernel/xcall/entry.S index d5ed68db1547..eb5994352bd1 100644 --- a/arch/arm64/kernel/xcall/entry.S +++ b/arch/arm64/kernel/xcall/entry.S @@ -291,3 +291,81 @@ alternative_else_nop_endif br x20 .org .Lventry_start\@ + 128 // Did we overflow the ventry slot? .endm + +#ifdef CONFIG_FAST_IRQ +#include <asm/xint.h> + +SYM_CODE_START_LOCAL(el0t_64_irq_entry) + ldp x20, x21, [sp, #16 * 10] + kernel_entry 0, 64 + mov x0, sp + ldr x1, [sp, #(S_SYSCALLNO - 8)] + bl el0t_64_acked_irq_handler + b ret_to_user +SYM_CODE_END(el0t_64_irq_entry) + +#ifdef CONFIG_SMT_QOS +SYM_CODE_START_LOCAL(el0_xint_ipi) + ldp x20, x21, [sp, #16 * 10] + hw_xcall_save_base_regs + mov x0, sp + bl el0_xint_ipi_handler + hw_xcall_restore_base_regs +SYM_CODE_END(el0_xint_ipi) +#endif + +SYM_CODE_START_LOCAL(el0t_64_irq_table) + /* Add more of SGIs or PPIs handled in el0 here */ + .rept NR_IPI_USER + .word el0t_64_irq_table - el0t_64_irq_entry + .endr +#ifdef CONFIG_SMT_QOS + .word el0t_64_irq_table - el0_xint_ipi +#else + .word el0t_64_irq_table - el0t_64_irq_entry +#endif + .rept 31 - NR_IPI_USER + .word el0t_64_irq_table - el0t_64_irq_entry + .endr +SYM_CODE_END(el0t_64_irq_table) + + .macro irq_ventry + .align 7 +.Lventry_start\@: + /* + * This must be the first instruction of the EL0 vector entries. It is + * skipped by the trampoline vectors, to trigger the cleanup. + */ + b .Lskip_tramp_vectors_cleanup\@ + mrs x30, tpidrro_el0 + msr tpidrro_el0, xzr +.Lskip_tramp_vectors_cleanup\@: + sub sp, sp, #PT_REGS_SIZE +alternative_if_not ARM64_HAS_XINT + b el0t_64_irq +alternative_else_nop_endif + stp x20, x21, [sp, #16 * 10] +alternative_if ARM64_USES_NMI + mrs x21, isr_el1 + tbz x21, ISR_EL1_IS_SHIFT, 0f + mrs_s x21, SYS_ICC_NMIAR1_EL1 + dsb sy + b 1f +alternative_else_nop_endif +0: + mrs x21, icc_iar1_el1 + dsb sy +1: + /* Save irqnr for use later */ + str x21, [sp, #(S_SYSCALLNO - 8)] + /* All SPI and LPI back to kernel native entry */ + cmp x21, 32 + b.ge el0t_64_irq_entry + /* Using jump table for different SGIs and PPIs */ + adr x20, el0t_64_irq_table + ldr w21, [x20, x21, lsl #2] + sub x20, x20, x21 + br x20 +.org .Lventry_start\@ + 128 // Did we overflow the ventry slot? + .endm +#endif /* CONFIG_FAST_IRQ */ diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 849d2e0db4fd..83f836609c13 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -1003,6 +1003,49 @@ static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs *regs } #ifdef CONFIG_FAST_IRQ +#include <asm/xint.h> + +/* + * Since the IRQ is taken from EL0 and IRQ Ack completed in entry code, + * So no need to read IAR here, Most of code comes from + * __gic_handle_irq_from_irqson(). + */ +DEFINE_PER_CPU(u32, cpu_iar); +asmlinkage void __exception_irq_entry gic_handle_irq_noack(struct pt_regs *regs) +{ + bool is_nmi; + u32 irqnr; + + irqnr = this_cpu_read(cpu_iar); + + is_nmi = gic_rpr_is_nmi_prio(); + + if (is_nmi) { + nmi_enter(); + __gic_handle_nmi(irqnr, regs); + nmi_exit(); + } + + if (gic_prio_masking_enabled()) { + gic_pmr_mask_irqs(); + gic_arch_enable_irqs(); + } else if (has_v3_3_nmi()) { +#ifdef CONFIG_ARM64_NMI + _allint_clear(); +#endif + } + + if (!is_nmi) + __gic_handle_irq(irqnr, regs); +} + +asmlinkage void __exception_irq_entry gic_handle_nmi_noack(struct pt_regs *regs) +{ + u32 irqnr = this_cpu_read(cpu_iar); + + __gic_handle_nmi(irqnr, regs); +} + DECLARE_BITMAP(irqnr_xint_map, 1024); static bool can_set_xint(unsigned int hwirq) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3e9f0b8070b8..d79892871835 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9322,6 +9322,8 @@ static int wake_soft_domain(struct task_struct *p, int target) #endif #ifdef CONFIG_SMT_QOS +#include <asm/xint.h> + static DEFINE_PER_CPU_ALIGNED(cpumask_t, smt_prefer_cpus); static unsigned long numa_smt_util[MAX_NUMNODES]; /* @@ -9412,6 +9414,45 @@ static inline void smt_qos_restore_task_select_cpus(struct task_struct *p, p->select_cpus = backup_select_cpus; } +bool should_restrict(void) +{ + int this_cpu = smp_processor_id(); + int cpu; + + if (idle_cpu(this_cpu)) + return false; + + for_each_cpu(cpu, cpu_smt_mask(this_cpu)) { + if (cpu == this_cpu) + continue; + + /* SMT master CPU is idle, need not throttle */ + if (idle_cpu(cpu)) + return false; + + /* SMT master CPU has finished online task */ + if (per_cpu(qos_smt_status, cpu) < QOS_LEVEL_ONLINE) + return false; + } + + return true; +} + +static void send_ipi_throttle_smt(int this_cpu) +{ + int cpu; + + if (!system_uses_xint()) + return; + + for_each_cpu(cpu, cpu_smt_mask(this_cpu)) { + if (cpu == this_cpu) + continue; + + arch_smp_send_ipi_user(cpu); + } +} + static inline void smt_qos_update_qos_level(int cpu, struct task_struct *p) { int new_status; @@ -9425,6 +9466,9 @@ static inline void smt_qos_update_qos_level(int cpu, struct task_struct *p) return; __this_cpu_write(qos_smt_status, new_status); + + if (cpumask_test_cpu(cpu, &master_smt_cpumask)) + send_ipi_throttle_smt(cpu); } static inline bool is_slave_to_master(int src_cpu, int dst_cpu) -- 2.34.1
hulk inclusion category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8929 ---------------------------------------- Enable SMT_QOS in openeuler_defconfig by default. Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> --- arch/arm64/configs/openeuler_defconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index e6d3b9b6788b..0d71c6c54d1d 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -402,6 +402,7 @@ CONFIG_DEBUG_FEATURE_BYPASS=y CONFIG_SECURITY_FEATURE_BYPASS=y CONFIG_ACTLR_XCALL_XINT=y CONFIG_DYNAMIC_XCALL=y +CONFIG_SMT_QOS=y # end of Turbo features selection # -- 2.34.1
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://atomgit.com/openeuler/kernel/merge_requests/22236 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/4AH... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://atomgit.com/openeuler/kernel/merge_requests/22236 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/4AH...
participants (2)
-
Jinjie Ruan -
patchwork bot