From 7aff1567992a546edf0e702989981f574be71127 Mon Sep 17 00:00:00 2001 From: Yipeng Zou Date: Thu, 11 May 2023 21:34:44 +0000 Subject: [PATCH] cpufreq: introduce cpufreq_zone MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Yipeng Zou sched/fair: introduce EAS+ wakeup task selection Signed-off-by: Chen Jiahao sched: Supports separate load balance between cold and hot partitions Support separate load balance, and the active zone in warm zone scale in or out depending on the CPU load Signed-off-by: Ruan Jinjie Signed-off-by: Yipeng Zou sched: Introduce smart grid scheduling strategy for cfs hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7BQZ0 CVE: NA ---------------------------------------- We want to dynamically expand or shrink the affinity range of tasks based on the CPU topology level while meeting the minimum resource requirements of tasks. We divide several level of affinity domains according to sched domains: level4 * SOCKET [ ] level3 * DIE [ ] level2 * MC [ ] [ ] level1 * SMT [ ] [ ] [ ] [ ] level0 * CPU 0 1 2 3 4 5 6 7 Whether users tend to choose power saving or performance will affect strategy of adjusting affinity, when selecting the power saving mode, we will choose a more appropriate affinity based on the energy model to reduce power consumption, while considering the QOS of resources such as CPU and memory consumption, for instance, if the current task CPU load is less than required, smart grid will judge whether to aggregate tasks together into a smaller range or not according to energy model. The main difference from EAS is that we pay more attention to the impact of power consumption brought by such as cpuidle and DVFS, and classify tasks to reduce interference and ensure resource QOS in each divided unit, which are more suitable for general-purpose on non-heterogeneous CPUs. -------- -------- -------- | group0 | | group1 | | group2 | -------- -------- -------- | | | v | v ---------------------+----- ----------------- | ---v-- | | | DIE0 | MC1 | | | DIE1 | ------ | | --------------------------- ----------------- We regularly count the resource satisfaction of groups, and adjust the affinity, scheduling balance and migrating memory will be considered based on memory location for better meetting resource requirements. Signed-off-by: Hui Tang Signed-off-by: Wang ShaoBo Reviewed-by: Chen Hui Reviewed-by: Zhang Qiao Signed-off-by: Zhang Changzhong sched: smart grid: init sched_grid_qos structure on QOS purpose hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7BQZ0 CVE: NA ---------------------------------------- As smart grid scheduling (SGS) may shrink resources and affect task QOS, We provide methods for evaluating task QOS in divided grid, we mainly focus on the following two aspects: 1. Evaluate whether (such as CPU or memory) resources meet our demand 2. Ensure the least impact when working with (cpufreq and cpuidle) governors For tackling this questions, we have summarized several sampling methods to obtain tasks' characteristics at same time reducing scheduling noise as much as possible: 1. we detected the key factors that how sensitive a process is in cpufreq or cpuidle adjustment, and to guide the cpufreq/cpuidle governor 2. We dynamically monitor process memory bandwidth and adjust memory allocation to minimize cross-remote memory access 3. We provide a variety of load tracking mechanisms to adapt to different types of task's load change --------------------------------- ----------------- | class A | | class B | | -------- -------- | | -------- | | | group0 | | group1 | |---| | group2 | |----------+ | -------- -------- | | -------- | | | CPU/memory sensitive type | | balance type | | ----------------+---------------- --------+-------- | v v | (target cpufreq) ------------------------------------------------------- | (sensitivity) | Not satisfied with QOS? | | --------------------------+---------------------------- | v v ------------------------------------------------------- ---------------- | expand or shrink resource |<--| energy model | ----------------------------+-------------------------- ---------------- v | ----------- ----------- ------------ v | | | | | | --------------- | GRID0 +--------+ GRID1 +--------+ GRID2 |<-- | governor | | | | | | | --------------- ----------- ----------- ------------ \ | / \ ------------------- / | pages migration | ------------------- We will introduce the energy model in the follow-up implementation, and consider the dynamic affinity adjustment between each divided grid in the runtime. Signed-off-by: Wang ShaoBo Reviewed-by: Kefeng Wang Reviewed-by: Xie XiuQi Signed-off-by: Zhang Changzhong sched: Add static key to reduce noise hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7A718 -------------------------------- Add static key to reduce noise when not enable dynamic affinity. There are better performance in some case, such for lmbench. Fixes: 243865da2684 ("cpuset: Introduce new interface for scheduler ...") Signed-off-by: Hui Tang Reviewed-by: Zhang Qiao Signed-off-by: Zhang Changzhong sched: fix smart grid usage count hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7D98G CVE: NA ---------------------------------------- smart_grid_usage_dec() should called when free taskgroup if the mode is auto. Signed-off-by: Hui Tang Reviewed-by: Zhang Qiao Signed-off-by: Zhang Changzhong sched: fix WARN found by deadlock detect hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7BQZ0 CVE: NA ---------------------------------------- The WARNING report when run: echo 1 > /sys/fs/cgroup/cpu/cpu.dynamic_affinity_mode [ 147.276757] WARNING: CPU: 5 PID: 1770 at kernel/cpu.c:326 \ lockdep_assert_cpus_held+0xac/0xd0 [ 147.279670] Kernel panic - not syncing: panic_on_warn set ... [ 147.279670] [ 147.282211] CPU: 5 PID: 1770 Comm: bash Kdump: loaded Not tainted 4.19 [ 147.284796] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996).. [ 147.290963] Call Trace: [ 147.292459] dump_stack+0xc6/0x11e [ 147.294295] ? lockdep_assert_cpus_held+0xa0/0xd0 [ 147.296876] panic+0x1d6/0x46b [ 147.298591] ? refcount_error_report+0x2a5/0x2a5 [ 147.301131] ? kmsg_dump_rewind_nolock+0xde/0xde [ 147.303738] ? sched_clock_cpu+0x18/0x1b0 [ 147.305943] ? __warn+0x1d1/0x210 [ 147.307831] ? lockdep_assert_cpus_held+0xac/0xd0 [ 147.310469] __warn+0x1ec/0x210 [ 147.312271] ? lockdep_assert_cpus_held+0xac/0xd0 [ 147.314838] report_bug+0x1ee/0x2b0 [ 147.316798] fixup_bug.part.4+0x37/0x80 [ 147.318946] do_error_trap+0x21c/0x260 [ 147.321062] ? fixup_bug.part.4+0x80/0x80 [ 147.323253] ? check_preemption_disabled+0x34/0x1f0 [ 147.324886] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 147.326277] ? lockdep_hardirqs_off+0x1cb/0x2b0 [ 147.327505] ? error_entry+0x9a/0x130 [ 147.328523] ? trace_hardirqs_off_caller+0x59/0x1a0 [ 147.329844] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 147.331124] invalid_op+0x14/0x20 [ 147.332057] ? vprintk_func+0x68/0x1a0 [ 147.333082] ? lockdep_assert_cpus_held+0xac/0xd0 [ 147.334355] ? lockdep_assert_cpus_held+0xac/0xd0 [ 147.335624] ? static_key_slow_inc_cpuslocked+0x5a/0x230 [ 147.337079] ? tg_set_dynamic_affinity_mode+0x4f/0x70 [ 147.338444] ? cgroup_file_write+0x471/0x6a0 [ 147.339604] ? cgroup_css.part.4+0x100/0x100 [ 147.340782] ? cgroup_css.part.4+0x100/0x100 [ 147.341943] ? kernfs_fop_write+0x2af/0x430 [ 147.343083] ? kernfs_vma_page_mkwrite+0x230/0x230 [ 147.344401] ? __vfs_write+0xef/0x680 [ 147.345404] ? kernel_read+0x110/0x110 Signed-off-by: Hui Tang Reviewed-by: Zhang Qiao Signed-off-by: Zhang Changzhong sched: Fix possible deadlock in tg_set_dynamic_affinity_mode hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGD0 CVE: NA ---------------------------------------- Deadlock occurs in two situations as follows: The first case: tg_set_dynamic_affinity_mode --- raw_spin_lock_irq(&auto_affi->lock); ->start_auto_affintiy --- trigger timer ->tg_update_task_prefer_cpus >css_task_inter_next ->raw_spin_unlock_irq hr_timer_run_queues ->sched_auto_affi_period_timer --- try spin lock (&auto_affi->lock) The second case as follows: [ 291.470810] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 291.472715] rcu: 1-...0: (0 ticks this GP) idle=a6a/1/0x4000000000000002 softirq=78516/78516 fqs=5249 [ 291.475268] rcu: (detected by 6, t=21006 jiffies, g=202169, q=9862) [ 291.477038] Sending NMI from CPU 6 to CPUs 1: [ 291.481268] NMI backtrace for cpu 1 [ 291.481273] CPU: 1 PID: 1923 Comm: sh Kdump: loaded Not tainted 4.19.90+ #150 [ 291.481278] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 [ 291.481281] RIP: 0010:queued_spin_lock_slowpath+0x136/0x9a0 [ 291.481289] Code: c0 74 3f 49 89 dd 48 89 dd 48 b8 00 00 00 00 00 fc ff df 49 c1 ed 03 83 e5 07 49 01 c5 83 c5 03 48 83 05 c4 66 b9 05 01 f3 90 <41> 0f b6 45 00 40 38 c5 7c 08 84 c0 0f 85 ad 07 00 00 0 [ 291.481292] RSP: 0018:ffff88801de87cd8 EFLAGS: 00000002 [ 291.481297] RAX: 0000000000000101 RBX: ffff888001be0a28 RCX: ffffffffb8090f7d [ 291.481301] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff888001be0a28 [ 291.481304] RBP: 0000000000000003 R08: ffffed100037c146 R09: ffffed100037c146 [ 291.481307] R10: 000000001106b143 R11: ffffed100037c145 R12: 1ffff11003bd0f9c [ 291.481311] R13: ffffed100037c145 R14: fffffbfff7a38dee R15: dffffc0000000000 [ 291.481315] FS: 00007fac4f306740(0000) GS:ffff88801de80000(0000) knlGS:0000000000000000 [ 291.481318] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 291.481321] CR2: 00007fac4f4bb650 CR3: 00000000046b6000 CR4: 00000000000006e0 [ 291.481323] Call Trace: [ 291.481324] [ 291.481326] ? osq_unlock+0x2a0/0x2a0 [ 291.481329] ? check_preemption_disabled+0x4c/0x290 [ 291.481331] ? rcu_accelerate_cbs+0x33/0xed0 [ 291.481333] _raw_spin_lock_irqsave+0x83/0xa0 [ 291.481336] sched_auto_affi_period_timer+0x251/0x820 [ 291.481338] ? __remove_hrtimer+0x151/0x200 [ 291.481340] __hrtimer_run_queues+0x39d/0xa50 [ 291.481343] ? tg_update_affinity_domain_down+0x460/0x460 [ 291.481345] ? enqueue_hrtimer+0x2e0/0x2e0 [ 291.481348] ? ktime_get_update_offsets_now+0x1d7/0x2c0 [ 291.481350] hrtimer_run_queues+0x243/0x470 [ 291.481352] run_local_timers+0x5e/0x150 [ 291.481354] update_process_times+0x36/0xb0 [ 291.481357] tick_sched_handle.isra.4+0x7c/0x180 [ 291.481359] tick_nohz_handler+0xd1/0x1d0 [ 291.481365] smp_apic_timer_interrupt+0x12c/0x4e0 [ 291.481368] apic_timer_interrupt+0xf/0x20 [ 291.481370] [ 291.481372] ? smp_call_function_many+0x68c/0x840 [ 291.481375] ? smp_call_function_many+0x6ab/0x840 [ 291.481377] ? arch_unregister_cpu+0x60/0x60 [ 291.481379] ? native_set_fixmap+0x100/0x180 [ 291.481381] ? arch_unregister_cpu+0x60/0x60 [ 291.481384] ? set_task_select_cpus+0x116/0x940 [ 291.481386] ? smp_call_function+0x53/0xc0 [ 291.481388] ? arch_unregister_cpu+0x60/0x60 [ 291.481390] ? on_each_cpu+0x49/0xf0 [ 291.481393] ? set_task_select_cpus+0x115/0x940 [ 291.481395] ? text_poke_bp+0xff/0x180 [ 291.481397] ? poke_int3_handler+0xc0/0xc0 [ 291.481400] ? __set_prefer_cpus_ptr.constprop.4+0x1cd/0x900 [ 291.481402] ? hrtick+0x1b0/0x1b0 [ 291.481404] ? set_task_select_cpus+0x115/0x940 [ 291.481407] ? __jump_label_transform.isra.0+0x3a1/0x470 [ 291.481409] ? kernel_init+0x280/0x280 [ 291.481411] ? kasan_check_read+0x1d/0x30 [ 291.481413] ? mutex_lock+0x96/0x100 [ 291.481415] ? __mutex_lock_slowpath+0x30/0x30 [ 291.481418] ? arch_jump_label_transform+0x52/0x80 [ 291.481420] ? set_task_select_cpus+0x115/0x940 [ 291.481422] ? __jump_label_update+0x1a1/0x1e0 [ 291.481424] ? jump_label_update+0x2ee/0x3b0 [ 291.481427] ? static_key_slow_inc_cpuslocked+0x1c8/0x2d0 [ 291.481430] ? start_auto_affinity+0x190/0x200 [ 291.481432] ? tg_set_dynamic_affinity_mode+0xad/0xf0 [ 291.481435] ? cpu_affinity_mode_write_u64+0x22/0x30 [ 291.481437] ? cgroup_file_write+0x46f/0x660 [ 291.481439] ? cgroup_init_cftypes+0x300/0x300 [ 291.481441] ? __mutex_lock_slowpath+0x30/0x30 Signed-off-by: Hui Tang Reviewed-by: Zhang Qiao Signed-off-by: Zhang Changzhong sched: Fix negative count for jump label hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7DA63 CVE: NA -------------------------------- Add mutex lock to prevent negative count for jump label. [28612.530675] ------------[ cut here ]------------ [28612.532708] jump label: negative count! [28612.535031] WARNING: CPU: 4 PID: 3899 at kernel/jump_label.c:202 __static_key_slow_dec_cpuslocked+0x204/0x240 [28612.538216] Kernel panic - not syncing: panic_on_warn set ... [28612.538216] [28612.540487] CPU: 4 PID: 3899 Comm: sh Kdump: loaded Not tainted [28612.542788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) [28612.546455] Call Trace: [28612.547339] dump_stack+0xc6/0x11e [28612.548546] ? __static_key_slow_dec_cpuslocked+0x200/0x240 [28612.550352] panic+0x1d6/0x46b [28612.551375] ? refcount_error_report+0x2a5/0x2a5 [28612.552915] ? kmsg_dump_rewind_nolock+0xde/0xde [28612.554358] ? sched_clock_cpu+0x18/0x1b0 [28612.555699] ? __warn+0x1d1/0x210 [28612.556799] ? __static_key_slow_dec_cpuslocked+0x204/0x240 [28612.558548] __warn+0x1ec/0x210 [28612.559621] ? __static_key_slow_dec_cpuslocked+0x204/0x240 [28612.561536] report_bug+0x1ee/0x2b0 [28612.562706] fixup_bug.part.4+0x37/0x80 [28612.563937] do_error_trap+0x21c/0x260 [28612.565109] ? fixup_bug.part.4+0x80/0x80 [28612.566453] ? check_preemption_disabled+0x34/0x1f0 [28612.567991] ? trace_hardirqs_off_thunk+0x1a/0x1c [28612.569534] ? lockdep_hardirqs_off+0x1cb/0x2b0 [28612.570993] ? error_entry+0x9a/0x130 [28612.572138] ? trace_hardirqs_off_caller+0x59/0x1a0 [28612.573710] ? trace_hardirqs_off_thunk+0x1a/0x1c [28612.575232] invalid_op+0x14/0x20 [root@lo[ca2lh8ost6 12.576387] ? vprintk_func+0x68/0x1a0 [28612.577827] ? __static_key_slow_dec_cpuslocked+0x204/0x240 smartg[ri2d]8# 612.579662] ? __static_key_slow_dec_cpuslocked+0x204/0x240 [28612.581781] ? static_key_disable+0x30/0x30 [28612.583248] ? s tatic_key_slow_dec+0x57/0x90 [28612.584997] ? tg_set_dynamic_affinity_mode+0x42/0x70 [28612.586714] ? cgroup_file_write+0x471/0x6a0 [28612.588162] ? cgroup_css.part.4+0x100/0x100 [28612.589579] ? cgroup_css.part.4+0x100/0x100 [28612.591031] ? kernfs_fop_write+0x2af/0x430 [28612.592625] ? kernfs_vma_page_mkwrite+0x230/0x230 [28612.594274] ? __vfs_write+0xef/0x680 [28612.595590] ? kernel_read+0x110/0x110 ea8612.596899] ? check_preemption_disabled+0x3mkd4ir/: 0canxno1t fcr0 Signed-off-by: Hui Tang Reviewed-by: Zhang Qiao Signed-off-by: Zhang Changzhong sched/rt: Fix possible warn when push_rt_task hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7DX9Y CVE: NA ------------------------------- A warn may be triggered during reboot, as follows: reboot ->kernel_restart ->machine_restart ->smp_send_stop --- ipi handler set_cpu_online(cpu, false) balance_callback -> __balance_callback ->push_rt_task -> find_lock_lowest_rq <从vec->mask获取的rq> -> find_lowest_rq -> cpupri_find -> cpupri_find_fitness -> __cpupri_find [cpumask_and(..., vec->mask)] -> set_task_cpu(next_task, lowest_rq->cpu) --- WARN_ON(!oneline(cpu) So add !cpu_online(lowest_rq->cpu) check before set_task_cpu(). The fix does not completely fix the problem, since cpu_online_mask may be cleared after check. Fixes: 4ff9083b8a9a8 ("sched/core: WARN() when migrating to an offline CPU") Signed-off-by: Hui Tang Reviewed-by: Zhang Qiao Reviewed-by: Chen Hui Signed-off-by: Yongqiang Liu sched: Fix timer storm for smart grid hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7DSX6 CVE: NA ------------------------------- Timer storm may be triggered if !cpumask_weight(ad->domains[i]) which is set in cpu offline. Fixes: 713cfd2684fa ("sched: Introduce smart grid scheduling strategy for cfs") Signed-off-by: Hui Tang Reviewed-by: Zhang Qiao Signed-off-by: Zhang Changzhong sched: fix dereference NULL pointers hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7EA1X CVE: NA ------------------------------- tg->auto_affinity is NULL if init_auto_affinity() failed. So add checking for tg->auto_affinity before derefrence. Fixes: 713cfd2684fa ("sched: Introduce smart grid scheduling strategy for cfs") Signed-off-by: Hui Tang Reviewed-by: Zhang Qiao Signed-off-by: Zhang Changzhong sched: Fix memory leak on error branch hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7EBNA CVE: NA ------------------------------- Fix memory leak on error branch for smart grid. Fixes: 713cfd2684fa ("sched: Introduce smart grid scheduling strategy for cfs") Signed-off-by: Hui Tang Reviewed-by: Zhang Qiao Signed-off-by: Zhang Changzhong sched: clear credit count in error branch hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7EBSH CVE: NA ------------------------------- Clear credit count if sched_prefer_cpus_fork failed. Fixes: 243865da2684 ("cpuset: Introduce new interface for scheduler dynamic affinity") Signed-off-by: Hui Tang Reviewed-by: Zhang Qiao Signed-off-by: Zhang Changzhong sched: Adjust few parameters range for smart grid hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7EEF3 CVE: NA ------------------------------- Adjust few parameters range for smart grid. Fixes: 713cfd2684fa ("sched: Introduce smart grid scheduling strategy for cfs") Signed-off-by: Hui Tang Reviewed-by: Zhang Qiao Signed-off-by: Zhang Changzhong sched: Delete redundant updates to p->prefer_cpus hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7F7KV CVE: NA ------------------------------- Delete redundant updates to p->prefer_cpus when smart grid used. Add missed check for p->prefer_cpus when !CONFIG_QOS_SCHED_SMART_GRID. Fixes: 21e5d85e205f ("sched: Fix possible deadlock in tg_set_dynamic_affinity_mode") Signed-off-by: Hui Tang Reviewed-by: Zhang Qiao Signed-off-by: Zhang Changzhong sched: Fix memory leak for smart grid hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7FBJM CVE: NA ---------------------------------------- Free ad->domains_orig[] in 'free_affinity_domains', otherwise the memory will leak. Fixes: 713cfd2684fa ("sched: Introduce smart grid scheduling strategy for cfs") Signed-off-by: Hui Tang Reviewed-by: Zhang Qiao Signed-off-by: Zhang Changzhong config: enable CONFIG_QOS_SCHED_SMART_GRID by default hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7G6SW CVE: NA -------------------------------- set config CONFIG_QOS_SCHED_SMART_GRID default value. Signed-off-by: Wang ShaoBo Reviewed-by: Wei Li Reviewed-by: Xie XiuQi Reviewed-by: Chao Liu Signed-off-by: Zhang Changzhong sched: Fix null pointer derefrence for sd->span hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7HFZV CVE: NA ---------------------------------------- There may be NULL pointer derefrence when hotplug running and creating taskgroup concurrently. sched_autogroup_create_attach -> sched_create_group -> alloc_fair_sched_group -> init_auto_affinity -> init_affinity_domains -> cpumask_copy(xx, sched_domain_span(tmp)) { tmp may be free due rcu lock missing } { hotplug will rebuild sched domain } sched_cpu_activate -> build_sched_domains -> cpuset_cpu_active -> partition_sched_domains -> build_sched_domains -> cpu_attach_domain -> destroy_sched_domains -> call_rcu(&sd->rcu, destroy_sched_domains_rcu) So sd should be protect with rcu lock in entire critical zone. [ 599.811593] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 600.112821] pc : init_affinity_domains+0xf4/0x200 [ 600.125918] lr : init_affinity_domains+0xd4/0x200 [ 600.331355] Call trace: [ 600.338734] init_affinity_domains+0xf4/0x200 [ 600.347955] init_auto_affinity+0x78/0xc0 [ 600.356622] alloc_fair_sched_group+0xd8/0x210 [ 600.365594] sched_create_group+0x48/0xc0 [ 600.373970] sched_autogroup_create_attach+0x54/0x190 [ 600.383311] ksys_setsid+0x110/0x130 [ 600.391014] __arm64_sys_setsid+0x18/0x24 [ 600.399156] el0_svc_common+0x118/0x170 [ 600.406818] el0_svc_handler+0x3c/0x80 [ 600.414188] el0_svc+0x8/0x640 [ 600.420719] Code: b40002c0 9104e002 f9402061 a9401444 (a9001424) [ 600.430504] SMP: stopping secondary CPUs [ 600.441751] Starting crashdump kernel... Fixes: 713cfd2684fa ("sched: Introduce smart grid scheduling strategy for cfs") Signed-off-by: Hui Tang Reviewed-by: Zhang Qiao Signed-off-by: Zhang Changzhong Revert "cpufreq: introduce cpufreq_zone" This reverts commit 1fb8a6fea2e62c3cb83789bc15de2fd339ecb58f. Revert "sched/fair: introduce EAS+ wakeup task selection" This reverts commit fc7c571ed0470c0902df6a31bdfd7d971882311d. Revert "sched: Supports separate load balance between cold and hot partitions" This reverts commit 36b5aa93b319ea661b0bc2305dd7452224f30d6a. sched: introduce affinity domain type Signed-off-by: Yipeng Zou smart_gird: decoupling with cpufreq governor Signed-off-by: Yipeng Zou smart_grid: remove warm cpu api There is no need to maintain warm cpu api. Signed-off-by: Yipeng Zou support key thread to migrate into hot zone Signed-off-by: Ruan Jinjie --- arch/arm64/configs/openeuler_defconfig | 1 + drivers/cpufreq/cpufreq_ondemand.c | 5 +- fs/exec.c | 4 + fs/proc/array.c | 13 + include/linux/cgroup.h | 1 + include/linux/sched.h | 29 ++ include/linux/sched/grid_qos.h | 119 ++++++ include/linux/sched/sysctl.h | 4 + init/Kconfig | 13 + kernel/cgroup/cgroup-v1.c | 59 +++ kernel/fork.c | 15 +- kernel/sched/Makefile | 1 + kernel/sched/core.c | 294 +++++++++++++- kernel/sched/cpufreq.c | 9 +- kernel/sched/cpufreq_schedutil.c | 5 +- kernel/sched/fair.c | 505 ++++++++++++++++++++++++- kernel/sched/grid/Makefile | 2 + kernel/sched/grid/internal.h | 6 + kernel/sched/grid/power.c | 27 ++ kernel/sched/grid/qos.c | 228 +++++++++++ kernel/sched/grid/stat.c | 32 ++ kernel/sched/rt.c | 3 + kernel/sched/sched.h | 53 +++ kernel/sysctl.c | 13 +- mm/mempolicy.c | 12 +- 25 files changed, 1427 insertions(+), 26 deletions(-) create mode 100644 include/linux/sched/grid_qos.h create mode 100644 kernel/sched/grid/Makefile create mode 100644 kernel/sched/grid/internal.h create mode 100644 kernel/sched/grid/power.c create mode 100644 kernel/sched/grid/qos.c create mode 100644 kernel/sched/grid/stat.c diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index eb4ee0522446..2f5291820e7a 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -142,6 +142,7 @@ CONFIG_CGROUP_SCHED=y CONFIG_QOS_SCHED=y CONFIG_QOS_SCHED_MULTILEVEL=y CONFIG_QOS_SCHED_DYNAMIC_AFFINITY=y +CONFIG_QOS_SCHED_SMART_GRID=y CONFIG_QOS_SCHED_SMT_EXPELLER=y CONFIG_FAIR_GROUP_SCHED=y CONFIG_QOS_SCHED_PRIO_LB=y diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index ac361a8b1d3b..6b95a7c68fa5 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -14,7 +14,7 @@ #include #include #include - +#include #include "cpufreq_ondemand.h" /* On-demand governor macros */ @@ -142,7 +142,8 @@ static void od_update(struct cpufreq_policy *policy) dbs_info->freq_lo = 0; /* Check for frequency increase */ - if (load > dbs_data->up_threshold) { + if (load > dbs_data->up_threshold || + cpumask_test_cpu(policy->cpu, sched_grid_global_qos_get_hot_cpumasks())) { /* If switching to max speed, apply sampling_down_factor */ if (policy->cur < policy->max) policy_dbs->rate_mult = dbs_data->sampling_down_factor; diff --git a/fs/exec.c b/fs/exec.c index 981b3ac90c44..bd5414da6492 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1201,6 +1201,7 @@ char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk) } EXPORT_SYMBOL_GPL(__get_task_comm); +void transfer_one_task(struct task_struct *tsk); /* * These functions flushes out all traces of the currently running executable * so that a new one can be started @@ -1212,6 +1213,9 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec) trace_task_rename(tsk, buf); strlcpy(tsk->comm, buf, sizeof(tsk->comm)); task_unlock(tsk); + + transfer_one_task(tsk); + perf_event_comm(tsk, exec); } diff --git a/fs/proc/array.c b/fs/proc/array.c index 18a4588c35be..989f7602035c 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -389,6 +389,16 @@ static void task_cpus_allowed(struct seq_file *m, struct task_struct *task) cpumask_pr_args(task->cpus_ptr)); } +#ifdef CONFIG_QOS_SCHED_DYNAMIC_AFFINITY +static void task_cpus_preferred(struct seq_file *m, struct task_struct *task) +{ + seq_printf(m, "Cpus_preferred:\t%*pb\n", + cpumask_pr_args(task->prefer_cpus)); + seq_printf(m, "Cpus_preferred_list:\t%*pbl\n", + cpumask_pr_args(task->prefer_cpus)); +} +#endif + static inline void task_core_dumping(struct seq_file *m, struct mm_struct *mm) { seq_put_decimal_ull(m, "CoreDumping:\t", !!mm->core_state); @@ -427,6 +437,9 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, task_cpus_allowed(m, task); cpuset_task_status_allowed(m, task); task_context_switch_counts(m, task); +#ifdef CONFIG_QOS_SCHED_DYNAMIC_AFFINITY + task_cpus_preferred(m, task); +#endif return 0; } diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 0dcf260a4c1c..3137ca0a03e4 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -110,6 +110,7 @@ struct cgroup *cgroup_get_from_fd(int fd); int cgroup_attach_task_all(struct task_struct *from, struct task_struct *); int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from); +int cgroup_transfer_one_task(struct task_struct *tsk, struct cgroup *to, struct cgroup *from); int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts); int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts); diff --git a/include/linux/sched.h b/include/linux/sched.h index 3aae225f98a7..76e76992d280 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1442,7 +1442,16 @@ struct task_struct { KABI_RESERVE(10) KABI_RESERVE(11) #endif + +#if !defined(__GENKSYMS__) +#if defined(CONFIG_QOS_SCHED_SMART_GRID) + struct sched_grid_qos *grid_qos; +#else + KABI_RESERVE(12) +#endif +#else KABI_RESERVE(12) +#endif KABI_RESERVE(13) KABI_RESERVE(14) KABI_RESERVE(15) @@ -2230,6 +2239,26 @@ void sched_prefer_cpus_free(struct task_struct *p); void dynamic_affinity_enable(void); #endif +#ifdef CONFIG_QOS_SCHED_SMART_GRID + +enum sched_grid_global_qos_type { + SCHED_GRID_GLOBAL_QOS_TYPE_WARM = 0, + SCHED_GRID_GLOBAL_QOS_TYPE_HOT, + SCHED_GRID_GLOBAL_QOS_TYPE_NR +}; + +extern struct static_key __smart_grid_used; +static inline bool smart_grid_used(void) +{ + return static_key_false(&__smart_grid_used); +} +#else +static inline bool smart_grid_used(void) +{ + return false; +} +#endif + #ifdef CONFIG_BPF_SCHED extern void sched_settag(struct task_struct *tsk, s64 tag); diff --git a/include/linux/sched/grid_qos.h b/include/linux/sched/grid_qos.h new file mode 100644 index 000000000000..7abdbee8fe7a --- /dev/null +++ b/include/linux/sched/grid_qos.h @@ -0,0 +1,119 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_SCHED_GRID_QOS_H +#define _LINUX_SCHED_GRID_QOS_H +#include +#include + +#ifdef CONFIG_QOS_SCHED_SMART_GRID +enum sched_grid_qos_class { + SCHED_GRID_QOS_CLASS_LEVEL_1 = 0, + SCHED_GRID_QOS_CLASS_LEVEL_2 = 1, + SCHED_GRID_QOS_CLASS_LEVEL_3 = 2, + SCHED_GRID_QOS_CLASS_LEVEL_4 = 3, + SCHED_GRID_QOS_CLASS_LEVEL_5 = 4, + SCHED_GRID_QOS_CLASS_LEVEL_6 = 5, + SCHED_GRID_QOS_CLASS_LEVEL_7 = 6, + SCHED_GRID_QOS_CLASS_LEVEL_8 = 7, + SCHED_GRID_QOS_CLASS_LEVEL_NR +}; + +enum { + SCHED_GRID_QOS_IPS_INDEX = 0, + SCHED_GRID_QOS_MEMBOUND_RATIO_INDEX = 1, + SCHED_GRID_QOS_MEMBANDWIDTH_INDEX = 2, + SCHED_GRID_QOS_SAMPLE_NR +}; + +#define SCHED_GRID_QOS_RING_BUFFER_MAXLEN 100 + +struct sched_grid_qos_ring_buffer { + u64 vecs[SCHED_GRID_QOS_RING_BUFFER_MAXLEN]; + unsigned int head; + void (*push)(u64 *data, int stepsize, + struct sched_grid_qos_ring_buffer *ring_buffer); +}; + +struct sched_grid_qos_sample { + const char *name; + int index; + int sample_bypass; + int sample_times; + struct sched_grid_qos_ring_buffer ring_buffer; + u64 pred_target[MAX_NUMNODES]; + void (*cal_target)(int stepsize, + struct sched_grid_qos_ring_buffer *ring_buffer); + + int account_ready; + int (*start)(void *arg); + int (*account)(void *arg); +}; + +struct sched_grid_qos_stat { + enum sched_grid_qos_class class_lvl; + int (*set_class_lvl)(struct sched_grid_qos_stat *qos_stat); + struct sched_grid_qos_sample sample[SCHED_GRID_QOS_SAMPLE_NR]; +}; + +struct sched_grid_qos_power { + int cpufreq_sense_ratio; + int target_cpufreq; + int cstate_sense_ratio; +}; + +struct sched_grid_qos_affinity { + nodemask_t mem_preferred_node_mask; + const struct cpumask *prefer_cpus; +}; + +struct task_struct; +struct sched_grid_qos { + struct sched_grid_qos_stat stat; + struct sched_grid_qos_power power; + struct sched_grid_qos_affinity affinity; + + int (*affinity_set)(struct task_struct *p); +}; + +static inline int sched_qos_affinity_set(struct task_struct *p) +{ + return p->grid_qos->affinity_set(p); +} + +int sched_grid_qos_fork(struct task_struct *p, struct task_struct *orig); +void sched_grid_qos_free(struct task_struct *p); + +int sched_grid_preferred_interleave_nid(struct mempolicy *policy); +int sched_grid_preferred_nid(int preferred_nid, nodemask_t *nodemask); + +struct auto_affinity; + +struct sched_grid_global_qos { + raw_spinlock_t lock; + enum sched_grid_global_qos_type type; + cpumask_var_t cpus; + struct list_head af_list_head; +}; + +int __init sched_grid_global_qos_init(void); +int sched_grid_global_qos_update(enum sched_grid_global_qos_type sgs_type, bool is_locked); +int sched_grid_global_qos_add_af(enum sched_grid_global_qos_type sgs_type, struct auto_affinity *af); +int sched_grid_global_qos_del_af(enum sched_grid_global_qos_type sgs_type, struct auto_affinity *af); +struct cpumask* sched_grid_global_qos_get_hot_cpumasks(void); +#else +static inline int +sched_grid_preferred_interleave_nid(struct mempolicy *policy) +{ + return NUMA_NO_NODE; +} +static inline int +sched_grid_preferred_nid(int preferred_nid, nodemask_t *nodemask) +{ + return preferred_nid; +} + +static inline int sched_qos_affinity_set(struct task_struct *p) +{ + return 0; +} +#endif +#endif diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 4d6bbc0934c9..31c4a84ce3df 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -35,6 +35,10 @@ extern unsigned int sysctl_sched_child_runs_first; extern int sysctl_sched_util_low_pct; #endif +#ifdef CONFIG_QOS_SCHED_SMART_GRID +extern int sysctl_affinity_adjust_delay_ms; +#endif + enum sched_tunable_scaling { SCHED_TUNABLESCALING_NONE, SCHED_TUNABLESCALING_LOG, diff --git a/init/Kconfig b/init/Kconfig index b7fbf5b9bdf2..174d970e644f 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1065,6 +1065,19 @@ config UCLAMP_TASK_GROUP If in doubt, say N. +config QOS_SCHED_SMART_GRID + bool "qos smart grid scheduler" + depends on FAIR_GROUP_SCHED && QOS_SCHED_DYNAMIC_AFFINITY + default n + help + This feature is used for power consumption tuning in server scenario. + This can be divided into the following aspects: + 1. User interface, manage user needs. + 2. Collect tasks' features to ensure key tasks' QOS. + 3. Weaken the influence the impact of CPU frequency and cpuidle + adjustment on tasks. + 4. Docking EAS (Energy Aware Scheduling) model. + config CGROUP_PIDS bool "PIDs controller" help diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 647d0891cff6..6304ee6b3a36 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -81,6 +81,65 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) } EXPORT_SYMBOL_GPL(cgroup_attach_task_all); +/** + * cgroup_trasnsfer_one_task - move one task from one cgroup to another + * @to: cgroup to which the tasks will be moved + * @from: cgroup in which the tasks currently reside + * + * Locking rules between cgroup_post_fork() and the migration path + * guarantee that, if a task is forking while being migrated, the new child + * is guaranteed to be either visible in the source cgroup after the + * parent's migration is complete or put into the target cgroup. No task + * can slip out of migration through forking. + */ +int cgroup_transfer_one_task(struct task_struct *tsk, struct cgroup *to, struct cgroup *from) +{ + DEFINE_CGROUP_MGCTX(mgctx); + struct cgrp_cset_link *link; + int ret; + + if (cgroup_on_dfl(to)) + return -EINVAL; + + ret = cgroup_migrate_vet_dst(to); + if (ret) + return ret; + + mutex_lock(&cgroup_mutex); + + percpu_down_write(&cgroup_threadgroup_rwsem); + + /* all tasks in @from are being moved, all csets are source */ + spin_lock_irq(&css_set_lock); + list_for_each_entry(link, &from->cset_links, cset_link) + cgroup_migrate_add_src(link->cset, to, &mgctx); + spin_unlock_irq(&css_set_lock); + + ret = cgroup_migrate_prepare_dst(&mgctx); + if (ret) + goto out_err; + + /* + * Migrate tasks one-by-one until @from is empty. This fails iff + * ->can_attach() fails. + */ + if (tsk) + get_task_struct(tsk); + + if (tsk) { + ret = cgroup_migrate(tsk, false, &mgctx); + if (!ret) + TRACE_CGROUP_PATH(transfer_tasks, to, tsk, false); + put_task_struct(tsk); + } +out_err: + cgroup_migrate_finish(&mgctx); + percpu_up_write(&cgroup_threadgroup_rwsem); + mutex_unlock(&cgroup_mutex); + return ret; +} +EXPORT_SYMBOL_GPL(cgroup_transfer_one_task); + /** * cgroup_trasnsfer_tasks - move tasks from one cgroup to another * @to: cgroup to which the tasks will be moved diff --git a/kernel/fork.c b/kernel/fork.c index 6592d68d98ce..abab329d74e9 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -97,7 +97,9 @@ #include #include #include - +#ifdef CONFIG_QOS_SCHED_SMART_GRID +#include +#endif #include #include #include @@ -470,6 +472,9 @@ void free_task(struct task_struct *tsk) free_kthread_struct(tsk); #ifdef CONFIG_QOS_SCHED_DYNAMIC_AFFINITY sched_prefer_cpus_free(tsk); +#endif +#ifdef CONFIG_QOS_SCHED_SMART_GRID + sched_grid_qos_free(tsk); #endif free_task_struct(tsk); } @@ -2057,7 +2062,7 @@ static __latent_entropy struct task_struct *copy_process( #ifdef CONFIG_QOS_SCHED_DYNAMIC_AFFINITY retval = sched_prefer_cpus_fork(p, current->prefer_cpus); if (retval) - goto bad_fork_free; + goto bad_fork_cleanup_count; #endif lockdep_assert_irqs_enabled(); @@ -2077,6 +2082,12 @@ static __latent_entropy struct task_struct *copy_process( if (retval < 0) goto bad_fork_free; +#ifdef CONFIG_QOS_SCHED_SMART_GRID + retval = sched_grid_qos_fork(p, current); + if (retval) + goto bad_fork_cleanup_count; +#endif + /* * If multiple threads are within copy_process(), then this check * triggers too late. This doesn't hurt, the check is only there diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 6f3106774d05..a6fe0ee09917 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -39,3 +39,4 @@ obj-$(CONFIG_PSI) += psi.o obj-$(CONFIG_SCHED_CORE) += core_sched.o obj-$(CONFIG_BPF_SCHED) += bpf_sched.o obj-$(CONFIG_BPF_SCHED) += bpf_topology.o +obj-$(CONFIG_QOS_SCHED_SMART_GRID) += grid/ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 454bca0c9c6b..e36a49664d52 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -23,7 +23,7 @@ #include "../workqueue_internal.h" #include "../../io_uring/io-wq.h" #include "../smpboot.h" - +#include #include "pelt.h" #include "smp.h" @@ -7996,6 +7996,7 @@ int sched_cpu_activate(unsigned int cpu) static_branch_inc_cpuslocked(&sched_smt_present); #endif set_cpu_active(cpu, true); + tg_update_affinity_domains(cpu, 1); if (sched_smp_initialized) { sched_domains_numa_masks_set(cpu); @@ -8054,6 +8055,7 @@ int sched_cpu_deactivate(unsigned int cpu) return ret; } sched_domains_numa_masks_clear(cpu); + tg_update_affinity_domains(cpu, 0); return 0; } @@ -8100,6 +8102,8 @@ int sched_cpu_dying(unsigned int cpu) } #endif +int __init sched_grid_global_qos_init(void); + void __init sched_init_smp(void) { sched_init_numa(); @@ -8123,6 +8127,9 @@ void __init sched_init_smp(void) init_sched_dl_class(); sched_smp_initialized = true; + + sched_grid_global_qos_init(); + init_auto_affinity(&root_task_group); } static int __init migration_init(void) @@ -9452,6 +9459,260 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css, } #endif /* CONFIG_RT_GROUP_SCHED */ +#ifdef CONFIG_QOS_SCHED_SMART_GRID +int tg_set_dynamic_affinity_mode(struct task_group *tg, u64 mode) +{ + struct auto_affinity *auto_affi = tg->auto_affinity; + + if (unlikely(!auto_affi)) + return -EPERM; + + /* auto mode*/ + if (mode == 1) { + start_auto_affinity(auto_affi); + } else if (mode == 0) { + stop_auto_affinity(auto_affi); + } else { + return -EINVAL; + } + + return 0; +} + +static u64 cpu_affinity_mode_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct task_group *tg = css_tg(css); + + if (unlikely(!tg->auto_affinity)) + return -EPERM; + + return tg->auto_affinity->mode; +} + +static int cpu_affinity_mode_write_u64(struct cgroup_subsys_state *css, + struct cftype *cftype, u64 mode) +{ + return tg_set_dynamic_affinity_mode(css_tg(css), mode); +} + +void transfer_one_task(struct task_struct *tsk) +{ + int i = 0; + extern int comm_count; + struct task_group *tg = task_group(tsk); + extern struct auto_affinity *hot_auto_affi; + int ret; + + if (smart_grid_used()) { + if (unlikely(!hot_auto_affi)) + return; + + raw_spin_lock_irq(&tg->auto_affinity->lock); + raw_spin_lock_irq(&hot_auto_affi->tg->auto_affinity->lock); + for (i = 0; i < comm_count; i++) { + if (!strcmp(tsk->comm, tg->auto_affinity->except_comm[i])) { + ret = cgroup_transfer_one_task(tsk, hot_auto_affi->tg->css.cgroup, tg->css.cgroup); + break; + } + } + raw_spin_unlock_irq(&hot_auto_affi->tg->auto_affinity->lock); + raw_spin_unlock_irq(&tg->auto_affinity->lock); + } +} + +int comm_count = 0; +EXPORT_SYMBOL(comm_count); + +static ssize_t except_process_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct task_group *tg = css_tg(of_css(of)); + struct auto_affinity *auto_affi; + int i = 0; + + if (unlikely(!tg->auto_affinity)) + return -EPERM; + + auto_affi = tg->auto_affinity; + raw_spin_lock_irq(&auto_affi->lock); + + if (strlen(buf) == 1) { + for (i = 0; i < comm_count; i++) + memset((void *)auto_affi->except_comm[i], 0, TASK_COMM_LEN); + comm_count = 0; + raw_spin_unlock_irq(&tg->auto_affinity->lock); + return nbytes; + } + + if (comm_count >= EXCEPT_MAX) { + raw_spin_unlock_irq(&tg->auto_affinity->lock); + return -EPERM; + } + + memset((void *)auto_affi->except_comm[comm_count], 0, TASK_COMM_LEN); + memcpy(auto_affi->except_comm[comm_count], buf, nbytes-1); + comm_count++; + + raw_spin_unlock_irq(&tg->auto_affinity->lock); + return nbytes; +} + +static int except_process_show(struct seq_file *sf, void *v) +{ + struct task_group *tg = css_tg(seq_css(sf)); + struct auto_affinity *auto_affi; + int i; + + if (unlikely(!tg->auto_affinity)) + return -EPERM; + + auto_affi = tg->auto_affinity; + + for (i = 0; i < comm_count; i++) { + if (i == comm_count - 1) + seq_printf(sf, "%s", auto_affi->except_comm[i]); + else + seq_printf(sf, "%s, ", auto_affi->except_comm[i]); + } + seq_printf(sf, "\n"); + + return 0; +} + +int tg_set_affinity_period(struct task_group *tg, u64 period_ms) +{ + if (unlikely(!tg->auto_affinity)) + return -EPERM; + + if (!period_ms || period_ms > U64_MAX / NSEC_PER_MSEC) + return -EINVAL; + + raw_spin_lock_irq(&tg->auto_affinity->lock); + tg->auto_affinity->period = ms_to_ktime(period_ms); + raw_spin_unlock_irq(&tg->auto_affinity->lock); + return 0; +} + +u64 tg_get_affinity_period(struct task_group *tg) +{ + if (unlikely(!tg->auto_affinity)) + return -EPERM; + + return ktime_to_ms(tg->auto_affinity->period); +} + +static int cpu_affinity_period_write_uint(struct cgroup_subsys_state *css, + struct cftype *cftype, u64 period) +{ + return tg_set_affinity_period(css_tg(css), period); +} + +static u64 cpu_affinity_period_read_uint(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + return tg_get_affinity_period(css_tg(css)); +} + +static int cpu_affinity_domain_mask_write_u64(struct cgroup_subsys_state *css, + struct cftype *cftype, + u64 mask) +{ + struct task_group *tg = css_tg(css); + struct affinity_domain *ad; + u16 full; + + if (unlikely(!tg->auto_affinity)) + return -EPERM; + + ad = &tg->auto_affinity->ad; + full = (1 << ad->dcount) - 1; + if (mask > full) + return -EINVAL; + + raw_spin_lock_irq(&tg->auto_affinity->lock); + ad->domain_mask = mask; + raw_spin_unlock_irq(&tg->auto_affinity->lock); + return 0; +} + +static u64 cpu_affinity_domain_mask_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct task_group *tg = css_tg(css); + + if (unlikely(!tg->auto_affinity)) + return -EPERM; + + return tg->auto_affinity->ad.domain_mask; +} + +struct auto_affinity *hot_auto_affi; +EXPORT_SYMBOL_GPL(hot_auto_affi); + +static int cpu_affinity_domain_type_write_u64(struct cgroup_subsys_state *css, + struct cftype *cftype, + u64 type) +{ + struct task_group *tg = css_tg(css); + struct affinity_domain *ad; + + if (unlikely(!tg->auto_affinity)) + return -EPERM; + + ad = &tg->auto_affinity->ad; + if (type >= SCHED_GRID_GLOBAL_QOS_TYPE_NR) + return -EINVAL; + + raw_spin_lock_irq(&tg->auto_affinity->lock); + sched_grid_global_qos_del_af(ad->domain_type, tg->auto_affinity); + ad->domain_type = type; + + if (!hot_auto_affi && ad->domain_type == SCHED_GRID_GLOBAL_QOS_TYPE_HOT) + hot_auto_affi = tg->auto_affinity; + + sched_grid_global_qos_add_af(ad->domain_type, tg->auto_affinity); + raw_spin_unlock_irq(&tg->auto_affinity->lock); + return 0; +} + +static u64 cpu_affinity_domain_type_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct task_group *tg = css_tg(css); + + if (unlikely(!tg->auto_affinity)) + return -EPERM; + + return tg->auto_affinity->ad.domain_type; +} + +static int cpu_affinity_stat_show(struct seq_file *sf, void *v) +{ + struct task_group *tg = css_tg(seq_css(sf)); + struct auto_affinity *auto_affi = tg->auto_affinity; + struct affinity_domain *ad; + int i; + + if (unlikely(!auto_affi)) + return -EPERM; + + ad = &auto_affi->ad; + seq_printf(sf, "period_active %d\n", auto_affi->period_active); + seq_printf(sf, "dcount %d\n", ad->dcount); + seq_printf(sf, "domain_mask 0x%x\n", ad->domain_mask); + seq_printf(sf, "curr_level %d\n", ad->curr_level); + seq_printf(sf, "domain_type %s\n", ad->domain_type == SCHED_GRID_GLOBAL_QOS_TYPE_HOT ? "hot" : "warm"); + seq_printf(sf, "global hot %*pbl\n", cpumask_pr_args(sched_grid_global_qos_get_hot_cpumasks())); + for (i = 0; i < ad->dcount; i++) + seq_printf(sf, "sd_level %d, cpu list %*pbl, stay_cnt %llu\n", + i, cpumask_pr_args(ad->domains[i]), + schedstat_val(ad->stay_cnt[i])); + + return 0; +} +#endif /* CONFIG_QOS_SCHED_SMART_GRID */ + #ifdef CONFIG_QOS_SCHED static int tg_change_scheduler(struct task_group *tg, void *data) { @@ -9605,6 +9866,37 @@ static struct cftype cpu_legacy_files[] = { .write_u64 = cpu_shares_write_u64, }, #endif +#ifdef CONFIG_QOS_SCHED_SMART_GRID + { + .name = "dynamic_affinity_mode", + .read_u64 = cpu_affinity_mode_read_u64, + .write_u64 = cpu_affinity_mode_write_u64, + }, + { + .name = "affinity_period_ms", + .read_u64 = cpu_affinity_period_read_uint, + .write_u64 = cpu_affinity_period_write_uint, + }, + { + .name = "affinity_domain_mask", + .read_u64 = cpu_affinity_domain_mask_read_u64, + .write_u64 = cpu_affinity_domain_mask_write_u64, + }, + { + .name = "affinity_domain_type", + .read_u64 = cpu_affinity_domain_type_read_u64, + .write_u64 = cpu_affinity_domain_type_write_u64, + }, + { + .name = "affinity_except_process", + .write = except_process_write, + .seq_show = except_process_show, + }, + { + .name = "affinity_stat", + .seq_show = cpu_affinity_stat_show, + }, +#endif #ifdef CONFIG_CFS_BANDWIDTH { .name = "cfs_quota_us", diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c index 7c2fe50fd76d..5ca1fbbc24c2 100644 --- a/kernel/sched/cpufreq.c +++ b/kernel/sched/cpufreq.c @@ -6,7 +6,7 @@ * Author: Rafael J. Wysocki */ #include - +#include #include "sched.h" DEFINE_PER_CPU(struct update_util_data __rcu *, cpufreq_update_util_data); @@ -68,10 +68,13 @@ EXPORT_SYMBOL_GPL(cpufreq_remove_update_util_hook); * - the local and remote CPUs share @policy, * - dvfs_possible_from_any_cpu is set in @policy and the local CPU is not going * offline (in which case it is not expected to run cpufreq updates any more). + * - when cpufreq_zone enable, cpu was hot and it's freq not equle max khz or cpu was warm. */ bool cpufreq_this_cpu_can_update(struct cpufreq_policy *policy) { - return cpumask_test_cpu(smp_processor_id(), policy->cpus) || + return (!cpumask_test_cpu(policy->cpu, sched_grid_global_qos_get_hot_cpumasks()) || + policy->cur != policy->max) && + (cpumask_test_cpu(smp_processor_id(), policy->cpus) || (policy->dvfs_possible_from_any_cpu && - rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data))); + rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data)))); } diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 5e39da0ae086..9ab6d55f909f 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -9,7 +9,7 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include "sched.h" - +#include #include #include @@ -165,6 +165,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy, freq = map_util_freq(util, freq, max); + if (cpumask_test_cpu(policy->cpu, sched_grid_global_qos_get_hot_cpumasks())) + freq = sg_policy->policy->cpuinfo.max_freq; + if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update) return sg_policy->next_freq; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ff209d25c21c..791beb80a6e3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -28,6 +28,7 @@ #include #include #endif +#include #include /* @@ -5624,6 +5625,462 @@ static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {} #endif /* CONFIG_CFS_BANDWIDTH */ +#ifdef CONFIG_QOS_SCHED_SMART_GRID +#define AUTO_AFFINITY_DEFAULT_PERIOD_MS 2000 +#define IS_DOMAIN_SET(level, mask) ((1 << (level)) & (mask)) + +static DEFINE_MUTEX(smart_grid_used_mutex); + +static inline unsigned long cpu_util(int cpu); +static unsigned long capacity_of(int cpu); +static int sched_idle_cpu(int cpu); +static unsigned long cpu_runnable(struct rq *rq); +static inline bool prefer_cpus_valid(struct task_struct *p); + +int sysctl_affinity_adjust_delay_ms = 5000; + +struct static_key __smart_grid_used; + +static void smart_grid_usage_inc(void) +{ + static_key_slow_inc(&__smart_grid_used); +} + +static void smart_grid_usage_dec(void) +{ + static_key_slow_dec(&__smart_grid_used); +} + +static inline struct cpumask *task_prefer_cpus(struct task_struct *p) +{ + struct affinity_domain *ad; + + if (!smart_grid_used()) + return p->prefer_cpus; + + if (task_group(p)->auto_affinity->mode == 0) + return p->cpus_ptr; + + ad = &task_group(p)->auto_affinity->ad; + return ad->domains[ad->curr_level]; +} + +static inline int dynamic_affinity_mode(struct task_struct *p) +{ + if (!prefer_cpus_valid(p)) + return -1; + + if (smart_grid_used()) + return task_group(p)->auto_affinity->mode == 0 ? -1 : 1; + + return 0; +} + +static void affinity_domain_up(struct task_group *tg) +{ + struct affinity_domain *ad = &tg->auto_affinity->ad; + u16 level = ad->curr_level; + + if (ad->curr_level >= ad->dcount - 1) + return; + + while (level < ad->dcount) { + if (IS_DOMAIN_SET(level + 1, ad->domain_mask) && + cpumask_weight(ad->domains[level + 1]) > 0) { + ad->curr_level = level + 1; + sched_grid_global_qos_update(ad->domain_type, false); + return; + } + level++; + } +} + +static void affinity_domain_down(struct task_group *tg) +{ + struct affinity_domain *ad = &tg->auto_affinity->ad; + u16 level = ad->curr_level; + + if (ad->curr_level <= 0) + return; + + while (level > 0) { + if (!cpumask_weight(ad->domains[level - 1])) + return; + + if (IS_DOMAIN_SET(level - 1, ad->domain_mask)) { + ad->curr_level = level - 1; + sched_grid_global_qos_update(ad->domain_type, false); + return; + } + level--; + } +} + +static enum hrtimer_restart sched_auto_affi_period_timer(struct hrtimer *timer) +{ + struct auto_affinity *auto_affi = + container_of(timer, struct auto_affinity, period_timer); + struct task_group *tg = auto_affi->tg; + struct affinity_domain *ad = &auto_affi->ad; + struct cpumask *span = ad->domains[ad->curr_level]; + unsigned long util_avg_sum = 0; + unsigned long tg_capacity = 0; + unsigned long flags; + int cpu; + + for_each_cpu(cpu, span) { + util_avg_sum += cpu_util(cpu); + tg_capacity += capacity_of(cpu); + } + + raw_spin_lock_irqsave(&auto_affi->lock, flags); + if (util_avg_sum * 100 >= tg_capacity * sysctl_sched_util_low_pct) { + affinity_domain_up(tg); + } else if (util_avg_sum * 100 < tg_capacity * + sysctl_sched_util_low_pct / 2) { + affinity_domain_down(tg); + } + + schedstat_inc(ad->stay_cnt[ad->curr_level]); + + hrtimer_forward_now(timer, auto_affi->period); + raw_spin_unlock_irqrestore(&auto_affi->lock, flags); + return HRTIMER_RESTART; +} + +static int tg_update_affinity_domain_down(struct task_group *tg, void *data) +{ + struct auto_affinity *auto_affi = tg->auto_affinity; + struct affinity_domain *ad; + int *cpu_state = data; + unsigned long flags; + int i; + + if (!auto_affi) + return 0; + + ad = &tg->auto_affinity->ad; + raw_spin_lock_irqsave(&auto_affi->lock, flags); + + for (i = 0; i < ad->dcount; i++) { + if (!cpumask_test_cpu(cpu_state[0], ad->domains_orig[i])) + continue; + + /* online */ + if (cpu_state[1]) { + cpumask_set_cpu(cpu_state[0], ad->domains[i]); + } else { + cpumask_clear_cpu(cpu_state[0], ad->domains[i]); + if (!cpumask_weight(ad->domains[i])) + affinity_domain_up(tg); + } + + } + raw_spin_unlock_irqrestore(&auto_affi->lock, flags); + + return 0; +} + +void tg_update_affinity_domains(int cpu, int online) +{ + int cpu_state[2]; + + cpu_state[0] = cpu; + cpu_state[1] = online; + + rcu_read_lock(); + walk_tg_tree(tg_update_affinity_domain_down, tg_nop, cpu_state); + rcu_read_unlock(); +} + +void start_auto_affinity(struct auto_affinity *auto_affi) +{ + ktime_t delay_ms; + + mutex_lock(&smart_grid_used_mutex); + raw_spin_lock_irq(&auto_affi->lock); + if (auto_affi->period_active == 1) { + raw_spin_unlock_irq(&auto_affi->lock); + mutex_unlock(&smart_grid_used_mutex); + return; + } + + auto_affi->period_active = 1; + auto_affi->mode = 1; + delay_ms = ms_to_ktime(sysctl_affinity_adjust_delay_ms); + hrtimer_forward_now(&auto_affi->period_timer, delay_ms); + hrtimer_start_expires(&auto_affi->period_timer, + HRTIMER_MODE_ABS_PINNED); + raw_spin_unlock_irq(&auto_affi->lock); + + smart_grid_usage_inc(); + mutex_unlock(&smart_grid_used_mutex); +} + +void stop_auto_affinity(struct auto_affinity *auto_affi) +{ + struct affinity_domain *ad = &auto_affi->ad; + + mutex_lock(&smart_grid_used_mutex); + raw_spin_lock_irq(&auto_affi->lock); + if (auto_affi->period_active == 0) { + raw_spin_unlock_irq(&auto_affi->lock); + mutex_unlock(&smart_grid_used_mutex); + return; + } + + hrtimer_cancel(&auto_affi->period_timer); + auto_affi->period_active = 0; + auto_affi->mode = 0; + ad->curr_level = ad->dcount > 0 ? ad->dcount - 1 : 0; + raw_spin_unlock_irq(&auto_affi->lock); + + smart_grid_usage_dec(); + mutex_unlock(&smart_grid_used_mutex); +} + +static struct sched_group *sd_find_idlest_group(struct sched_domain *sd) +{ + struct sched_group *idlest = NULL, *group = sd->groups; + unsigned long min_runnable_load = ULONG_MAX; + unsigned long min_avg_load = ULONG_MAX; + int imbalance_scale = 100 + (sd->imbalance_pct-100)/2; + unsigned long imbalance = scale_load_down(NICE_0_LOAD) * + (sd->imbalance_pct-100) / 100; + + do { + unsigned long load, avg_load, runnable_load; + int i; + + avg_load = 0; + runnable_load = 0; + + for_each_cpu(i, sched_group_span(group)) { + load = cpu_runnable(cpu_rq(i)); + runnable_load += load; + avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs); + } + + avg_load = (avg_load * SCHED_CAPACITY_SCALE) / + group->sgc->capacity; + runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) / + group->sgc->capacity; + + if (min_runnable_load > (runnable_load + imbalance)) { + min_runnable_load = runnable_load; + min_avg_load = avg_load; + idlest = group; + } else if ((runnable_load < (min_runnable_load + imbalance)) && + (100*min_avg_load > imbalance_scale*avg_load)) { + min_avg_load = avg_load; + idlest = group; + } + } while (group = group->next, group != sd->groups); + + return idlest ? idlest : group; +} + +static int group_find_idlest_cpu(struct sched_group *group) +{ + int least_loaded_cpu = cpumask_first(sched_group_span(group)); + unsigned long load, min_load = ULONG_MAX; + unsigned int min_exit_latency = UINT_MAX; + u64 latest_idle_timestamp = 0; + int shallowest_idle_cpu = -1; + int i; + + if (group->group_weight == 1) + return least_loaded_cpu; + + for_each_cpu(i, sched_group_span(group)) { + if (sched_idle_cpu(i)) + return i; + + if (available_idle_cpu(i)) { + struct rq *rq = cpu_rq(i); + struct cpuidle_state *idle = idle_get_state(rq); + + if (idle && idle->exit_latency < min_exit_latency) { + min_exit_latency = idle->exit_latency; + latest_idle_timestamp = rq->idle_stamp; + shallowest_idle_cpu = i; + } else if ((!idle || + idle->exit_latency == min_exit_latency) && + rq->idle_stamp > latest_idle_timestamp) { + latest_idle_timestamp = rq->idle_stamp; + shallowest_idle_cpu = i; + } + } else if (shallowest_idle_cpu == -1) { + load = cpu_runnable(cpu_rq(i)); + if (load < min_load) { + min_load = load; + least_loaded_cpu = i; + } + } + } + + return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : + least_loaded_cpu; +} + +void free_affinity_domains(struct affinity_domain *ad) +{ + int i; + + for (i = 0; i < AD_LEVEL_MAX; i++) { + kfree(ad->domains[i]); + kfree(ad->domains_orig[i]); + ad->domains[i] = NULL; + ad->domains_orig[i] = NULL; + } + ad->dcount = 0; +} + +static int init_affinity_domains_orig(struct affinity_domain *ad) +{ + int i, j; + + for (i = 0; i < ad->dcount; i++) { + ad->domains_orig[i] = kmalloc(sizeof(cpumask_t), GFP_KERNEL); + if (!ad->domains_orig[i]) + goto err; + + cpumask_copy(ad->domains_orig[i], ad->domains[i]); + } + + return 0; +err: + for (j = 0; j < i; j++) { + kfree(ad->domains_orig[j]); + ad->domains_orig[j] = NULL; + } + return -ENOMEM; +} + +static int init_affinity_domains(struct affinity_domain *ad) +{ + struct sched_domain *sd = NULL, *tmp; + struct sched_group *idlest = NULL; + int ret = -ENOMEM; + int dcount = 0; + int i = 0; + int cpu; + + for (i = 0; i < AD_LEVEL_MAX; i++) { + ad->domains[i] = kmalloc(sizeof(cpumask_t), GFP_KERNEL); + if (!ad->domains[i]) + goto err; + } + + rcu_read_lock(); + cpu = cpumask_first_and(cpu_active_mask, + housekeeping_cpumask(HK_FLAG_DOMAIN)); + for_each_domain(cpu, tmp) { + sd = tmp; + dcount++; + } + + if (!sd || dcount > AD_LEVEL_MAX) { + rcu_read_unlock(); + ret = -EINVAL; + goto err; + } + + idlest = sd_find_idlest_group(sd); + cpu = group_find_idlest_cpu(idlest); + i = 0; + for_each_domain(cpu, tmp) { + cpumask_copy(ad->domains[i], sched_domain_span(tmp)); + __schedstat_set(ad->stay_cnt[i], 0); + i++; + } + rcu_read_unlock(); + + ad->dcount = dcount; + ad->curr_level = ad->dcount > 0 ? ad->dcount - 1 : 0; + ad->domain_mask = (1 << ad->dcount) - 1; + ad->domain_type = AFFINITY_DOAMIN_TYPE_DEFAULT; + + ret = init_affinity_domains_orig(ad); + if (ret) + goto err; + + return 0; +err: + free_affinity_domains(ad); + return ret; +} + +int init_auto_affinity(struct task_group *tg) +{ + struct auto_affinity *auto_affi; + int ret; + + auto_affi = kzalloc(sizeof(*auto_affi), GFP_KERNEL); + if (!auto_affi) + return -ENOMEM; + + raw_spin_lock_init(&auto_affi->lock); + auto_affi->mode = 0; + auto_affi->period_active = 0; + auto_affi->period = ms_to_ktime(AUTO_AFFINITY_DEFAULT_PERIOD_MS); + hrtimer_init(&auto_affi->period_timer, CLOCK_MONOTONIC, + HRTIMER_MODE_ABS_PINNED); + auto_affi->period_timer.function = sched_auto_affi_period_timer; + + ret = init_affinity_domains(&auto_affi->ad); + if (ret) { + kfree(auto_affi); + if (ret == -EINVAL) + ret = 0; + return ret; + } + + auto_affi->tg = tg; + tg->auto_affinity = auto_affi; + INIT_LIST_HEAD(&auto_affi->af_list); + sched_grid_global_qos_add_af(auto_affi->ad.domain_type, auto_affi); + return 0; +} + +static void destroy_auto_affinity(struct task_group *tg) +{ + struct auto_affinity *auto_affi = tg->auto_affinity; + + if (unlikely(!auto_affi)) + return; + + if (auto_affi->period_active) + smart_grid_usage_dec(); + + hrtimer_cancel(&auto_affi->period_timer); + sched_grid_global_qos_del_af(auto_affi->ad.domain_type, auto_affi); + free_affinity_domains(&auto_affi->ad); + + kfree(tg->auto_affinity); + tg->auto_affinity = NULL; +} +#else +static void destroy_auto_affinity(struct task_group *tg) {} + +#ifdef CONFIG_QOS_SCHED_DYNAMIC_AFFINITY +static inline bool prefer_cpus_valid(struct task_struct *p); + +static inline struct cpumask *task_prefer_cpus(struct task_struct *p) +{ + return p->prefer_cpus; +} + +static inline int dynamic_affinity_mode(struct task_struct *p) +{ + if (!prefer_cpus_valid(p)) + return -1; + + return 0; +} +#endif +#endif + /************************************************** * CFS operations on tasks: */ @@ -7097,19 +7554,17 @@ int sysctl_sched_util_low_pct = 85; static inline bool prefer_cpus_valid(struct task_struct *p) { - if (!dynamic_affinity_used()) - return false; + struct cpumask *prefer_cpus = task_prefer_cpus(p); - return p->prefer_cpus && - !cpumask_empty(p->prefer_cpus) && - !cpumask_equal(p->prefer_cpus, p->cpus_ptr) && - cpumask_subset(p->prefer_cpus, p->cpus_ptr); + return !cpumask_empty(prefer_cpus) && + !cpumask_equal(prefer_cpus, p->cpus_ptr) && + cpumask_subset(prefer_cpus, p->cpus_ptr); } /* * set_task_select_cpus: select the cpu range for task * @p: the task whose available cpu range will to set - * @idlest_cpu: the cpu which is the idlest in prefer cpus + *uto_affinity_used @idlest_cpu: the cpu which is the idlest in prefer cpus * * If sum of 'util_avg' among 'preferred_cpus' lower than the percentage * 'sysctl_sched_util_low_pct' of 'preferred_cpus' capacity, select @@ -7127,13 +7582,23 @@ static void set_task_select_cpus(struct task_struct *p, int *idlest_cpu, long min_util = INT_MIN; struct task_group *tg; long spare; - int cpu; + int cpu, mode; - p->select_cpus = p->cpus_ptr; - if (!prefer_cpus_valid(p)) + rcu_read_lock(); + mode = dynamic_affinity_mode(p); + if (mode == -1) { + rcu_read_unlock(); + return; + } else if (mode == 1) { + p->select_cpus = task_prefer_cpus(p); + if (idlest_cpu) + *idlest_cpu = cpumask_first(p->select_cpus); + sched_qos_affinity_set(p); + rcu_read_unlock(); return; + } - rcu_read_lock(); + /* manual mode */ tg = task_group(p); for_each_cpu(cpu, p->prefer_cpus) { if (unlikely(!tg->se[cpu])) @@ -7203,12 +7668,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f time = schedstat_start_time(); /* - * required for stable ->cpus_allowed + * required for stable ->cpus_ptr */ lockdep_assert_held(&p->pi_lock); #ifdef CONFIG_QOS_SCHED_DYNAMIC_AFFINITY - set_task_select_cpus(p, &idlest_cpu, sd_flag); + p->select_cpus = p->cpus_ptr; + if (dynamic_affinity_used() || smart_grid_used()) + set_task_select_cpus(p, &idlest_cpu, sd_flag); #endif if (sd_flag & SD_BALANCE_WAKE) { @@ -8770,7 +9237,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) return 0; #ifdef CONFIG_QOS_SCHED_DYNAMIC_AFFINITY - set_task_select_cpus(p, NULL, 0); + p->select_cpus = p->cpus_ptr; + if (dynamic_affinity_used() || smart_grid_used()) + set_task_select_cpus(p, NULL, 0); + if (!cpumask_test_cpu(env->dst_cpu, p->select_cpus)) { #else if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) { @@ -12663,6 +13133,7 @@ void free_fair_sched_group(struct task_group *tg) int i; destroy_cfs_bandwidth(tg_cfs_bandwidth(tg)); + destroy_auto_affinity(tg); for_each_possible_cpu(i) { #ifdef CONFIG_QOS_SCHED @@ -12683,7 +13154,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) { struct sched_entity *se; struct cfs_rq *cfs_rq; - int i; + int i, ret; tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL); if (!tg->cfs_rq) @@ -12695,6 +13166,9 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) tg->shares = NICE_0_LOAD; init_cfs_bandwidth(tg_cfs_bandwidth(tg)); + ret = init_auto_affinity(tg); + if (ret) + goto err; for_each_possible_cpu(i) { cfs_rq = kzalloc_node(sizeof(struct cfs_rq), @@ -12717,6 +13191,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent) err_free_rq: kfree(cfs_rq); err: + destroy_auto_affinity(tg); return 0; } diff --git a/kernel/sched/grid/Makefile b/kernel/sched/grid/Makefile new file mode 100644 index 000000000000..82f2a09c3c30 --- /dev/null +++ b/kernel/sched/grid/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +obj-$(CONFIG_QOS_SCHED_SMART_GRID) += qos.o power.o stat.o diff --git a/kernel/sched/grid/internal.h b/kernel/sched/grid/internal.h new file mode 100644 index 000000000000..743f72aaffbf --- /dev/null +++ b/kernel/sched/grid/internal.h @@ -0,0 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_SCHED_SMART_GRID_INTERNAL_H +#define _LINUX_SCHED_SMART_GRID_INTERNAL_H +void qos_power_init(struct sched_grid_qos_power *power); +void qos_stat_init(struct sched_grid_qos_stat *stat); +#endif diff --git a/kernel/sched/grid/power.c b/kernel/sched/grid/power.c new file mode 100644 index 000000000000..f916cd3801ad --- /dev/null +++ b/kernel/sched/grid/power.c @@ -0,0 +1,27 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Common code for QOS-aware smart grid Scheduling + * + * Copyright (C) 2023-2024 Huawei Technologies Co., Ltd + * + * Author: Wang Shaobo + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + */ +#include +#include "internal.h" + +void qos_power_init(struct sched_grid_qos_power *power) +{ + power->cpufreq_sense_ratio = 0; + power->target_cpufreq = 0; + power->cstate_sense_ratio = 0; +} diff --git a/kernel/sched/grid/qos.c b/kernel/sched/grid/qos.c new file mode 100644 index 000000000000..dd87dcf07e3b --- /dev/null +++ b/kernel/sched/grid/qos.c @@ -0,0 +1,228 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Common code for Smart Grid Scheduling + * + * Copyright (C) 2023-2024 Huawei Technologies Co., Ltd + * + * Author: Wang Shaobo + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + */ +#include +#include +#include +#include +#include +#include "internal.h" +#include <../kernel/sched/sched.h> + +static inline int qos_affinity_set(struct task_struct *p) +{ + int n; + struct sched_grid_qos_affinity *affinity = &p->grid_qos->affinity; + + if (likely(affinity->prefer_cpus == p->select_cpus)) + return 0; + + /* + * We want the memory allocation to be as close to the CPU + * as possible, and adjust after getting memory bandwidth usage. + */ + for (n = 0; n < nr_node_ids; n++) { + if (cpumask_intersects(cpumask_of_node(n), p->select_cpus)) + node_set(n, affinity->mem_preferred_node_mask); + else + node_clear(n, affinity->mem_preferred_node_mask); + } + + affinity->prefer_cpus = p->select_cpus; + return 0; +} + +int sched_grid_qos_fork(struct task_struct *p, struct task_struct *orig) +{ + struct sched_grid_qos *qos; + + qos = kzalloc(sizeof(*qos), GFP_KERNEL); + if (!qos) + return -ENOMEM; + + qos_power_init(&qos->power); + qos_stat_init(&qos->stat); + + nodes_clear(qos->affinity.mem_preferred_node_mask); + if (likely(orig->grid_qos)) + qos->affinity = orig->grid_qos->affinity; + qos->affinity_set = qos_affinity_set; + p->grid_qos = qos; + + return 0; +} + +void sched_grid_qos_free(struct task_struct *p) +{ + kfree(p->grid_qos); + p->grid_qos = NULL; +} + +/* dynamic select a more appropriate preferred interleave nid for process */ +int sched_grid_preferred_interleave_nid(struct mempolicy *policy) +{ + nodemask_t nmask; + unsigned int next; + struct task_struct *me = current; + nodemask_t *preferred_nmask = NULL; + + if (likely(me->grid_qos)) + preferred_nmask = + &me->grid_qos->affinity.mem_preferred_node_mask; + + if (!preferred_nmask || !policy) + return NUMA_NO_NODE; + + if (nodes_equal(policy->v.nodes, *preferred_nmask)) + return NUMA_NO_NODE; + /* + * We perceive the actual consumption of memory bandwidth + * in each node and post a preferred interleave nid in + * more appropriate range. + */ + nodes_and(nmask, policy->v.nodes, *preferred_nmask); + if (nodes_empty(nmask)) + return NUMA_NO_NODE; + + next = next_node_in(me->il_prev, nmask); + if (next < MAX_NUMNODES) + me->il_prev = next; + return next; +} + +/* dynamic select a more appropriate preferred nid for process */ +int sched_grid_preferred_nid(int preferred_nid, nodemask_t *nodemask) +{ + int nd = preferred_nid; + nodemask_t nmask, ndmask; + nodemask_t *preferred_nmask = NULL; + + if (likely(current->grid_qos)) + preferred_nmask = + ¤t->grid_qos->affinity.mem_preferred_node_mask; + + if (!preferred_nmask) + return preferred_nid; + + /* + * We perceive the actual consumption of memory bandwidth + * in each node and post a preferred nid in more appropriate + * range. + */ + nmask = *preferred_nmask; + if (nodemask) { + if (nodes_equal(*nodemask, nmask)) + return preferred_nid; + + nodes_and(nmask, nmask, *nodemask); + } + + if (node_isset(preferred_nid, nmask)) + return preferred_nid; + + /* + * We prefer the numa node we're running, if there is no limit + * to nodemask, we select preferred nid in preferred range or + * in restriced range if not. + */ + init_nodemask_of_node(&ndmask, numa_node_id()); + nodes_and(ndmask, nmask, ndmask); + if (!nodes_empty(ndmask)) + nd = first_node(ndmask); + else if (!nodes_empty(nmask)) + nd = first_node(nmask); + + return nd; +} + +static struct sched_grid_global_qos sgs_global_qos[SCHED_GRID_GLOBAL_QOS_TYPE_NR]; + +int __init sched_grid_global_qos_init(void) +{ + int index; + + for (index = 0; index < SCHED_GRID_GLOBAL_QOS_TYPE_NR; index++) { + if (!zalloc_cpumask_var(&sgs_global_qos[index].cpus, GFP_KERNEL)) + BUG_ON(1); + + raw_spin_lock_init(&sgs_global_qos[index].lock); + sgs_global_qos[index].type = index; + INIT_LIST_HEAD(&sgs_global_qos[index].af_list_head); + } + + return 0; +} + +int sched_grid_global_qos_update(enum sched_grid_global_qos_type sgs_type, bool is_locked) +{ + struct list_head *pos; + struct auto_affinity *af_pos; + + if (sgs_type >= SCHED_GRID_GLOBAL_QOS_TYPE_NR) + return -1; + + if (!is_locked) + raw_spin_lock_irq(&sgs_global_qos[sgs_type].lock); + + cpumask_clear(sgs_global_qos[sgs_type].cpus); + + list_for_each(pos, &sgs_global_qos[sgs_type].af_list_head) { + af_pos = list_entry(pos, struct auto_affinity, af_list); + cpumask_or(sgs_global_qos[sgs_type].cpus, + sgs_global_qos[sgs_type].cpus, + af_pos->ad.domains[af_pos->ad.curr_level]); + } + + cpumask_andnot(sgs_global_qos[SCHED_GRID_GLOBAL_QOS_TYPE_WARM].cpus, + sgs_global_qos[SCHED_GRID_GLOBAL_QOS_TYPE_WARM].cpus, + sgs_global_qos[SCHED_GRID_GLOBAL_QOS_TYPE_HOT].cpus); + + if (!is_locked) + raw_spin_unlock_irq(&sgs_global_qos[sgs_type].lock); + + return 0; +} + +int sched_grid_global_qos_add_af(enum sched_grid_global_qos_type sgs_type, struct auto_affinity *af) +{ + if (sgs_type >= SCHED_GRID_GLOBAL_QOS_TYPE_NR || af == NULL) + return -1; + + raw_spin_lock_irq(&sgs_global_qos[sgs_type].lock); + list_add_tail(&af->af_list, &sgs_global_qos[sgs_type].af_list_head); + sched_grid_global_qos_update(sgs_type, true); + raw_spin_unlock_irq(&sgs_global_qos[sgs_type].lock); + return 0; +} + +int sched_grid_global_qos_del_af(enum sched_grid_global_qos_type sgs_type, struct auto_affinity *af) +{ + if (sgs_type >= SCHED_GRID_GLOBAL_QOS_TYPE_NR || af == NULL) + return -1; + + raw_spin_lock_irq(&sgs_global_qos[sgs_type].lock); + list_del(&af->af_list); + sched_grid_global_qos_update(sgs_type, true); + raw_spin_unlock_irq(&sgs_global_qos[sgs_type].lock); + return 0; +} + +struct cpumask* sched_grid_global_qos_get_hot_cpumasks(void) +{ + return &sgs_global_qos[SCHED_GRID_GLOBAL_QOS_TYPE_HOT].cpus[0]; +} diff --git a/kernel/sched/grid/stat.c b/kernel/sched/grid/stat.c new file mode 100644 index 000000000000..b40c75145608 --- /dev/null +++ b/kernel/sched/grid/stat.c @@ -0,0 +1,32 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Common code for QOS-aware smart grid Scheduling + * + * Copyright (C) 2023-2024 Huawei Technologies Co., Ltd + * + * Author: Wang Shaobo + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + */ +#include +#include "internal.h" + +void qos_stat_init(struct sched_grid_qos_stat *stat) +{ + stat->sample[SCHED_GRID_QOS_IPS_INDEX].name = "ips"; + stat->sample[SCHED_GRID_QOS_IPS_INDEX].index = SCHED_GRID_QOS_IPS_INDEX; + stat->sample[SCHED_GRID_QOS_MEMBOUND_RATIO_INDEX].name = "membound_ratio"; + stat->sample[SCHED_GRID_QOS_MEMBOUND_RATIO_INDEX].index = + SCHED_GRID_QOS_MEMBOUND_RATIO_INDEX; + stat->sample[SCHED_GRID_QOS_MEMBANDWIDTH_INDEX].name = "memband_width"; + stat->sample[SCHED_GRID_QOS_MEMBANDWIDTH_INDEX].index = + SCHED_GRID_QOS_MEMBANDWIDTH_INDEX; +} diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 0f349d8d076d..ca868c04ff24 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1941,6 +1941,9 @@ static int push_rt_task(struct rq *rq) goto retry; } + if (unlikely(!cpu_online(lowest_rq->cpu))) + goto out; + deactivate_task(rq, next_task, 0); set_task_cpu(next_task, lowest_rq->cpu); activate_task(lowest_rq, next_task, 0); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 27fd1240ac85..4b950bc235d2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -400,6 +400,40 @@ struct cfs_bandwidth { #endif }; + +#ifdef CONFIG_QOS_SCHED_SMART_GRID +#define AD_LEVEL_MAX 8 +#define EXCEPT_MAX 8 + +#define AFFINITY_DOAMIN_TYPE_DEFAULT SCHED_GRID_GLOBAL_QOS_TYPE_WARM + +struct affinity_domain { + int dcount; + int curr_level; + u32 domain_mask; + enum sched_grid_global_qos_type domain_type; +#ifdef CONFIG_SCHEDSTATS + u64 stay_cnt[AD_LEVEL_MAX]; +#endif + struct cpumask *domains[AD_LEVEL_MAX]; + struct cpumask *domains_orig[AD_LEVEL_MAX]; +}; +#endif + +struct auto_affinity { +#ifdef CONFIG_QOS_SCHED_SMART_GRID + raw_spinlock_t lock; + u64 mode; + ktime_t period; + struct hrtimer period_timer; + int period_active; + struct affinity_domain ad; + struct task_group *tg; + struct list_head af_list; + char except_comm[EXCEPT_MAX][TASK_COMM_LEN]; +#endif +}; + /* Task group related information */ struct task_group { struct cgroup_subsys_state css; @@ -460,7 +494,11 @@ struct task_group { #else KABI_RESERVE(1) #endif +#if defined(CONFIG_QOS_SCHED_SMART_GRID) && !defined(__GENKSYMS__) + struct auto_affinity *auto_affinity; +#else KABI_RESERVE(2) +#endif KABI_RESERVE(3) KABI_RESERVE(4) }; @@ -533,6 +571,21 @@ extern void sched_offline_group(struct task_group *tg); extern void sched_move_task(struct task_struct *tsk); +#ifdef CONFIG_QOS_SCHED_SMART_GRID +extern void start_auto_affinity(struct auto_affinity *auto_affi); +extern void stop_auto_affinity(struct auto_affinity *auto_affi); +extern int init_auto_affinity(struct task_group *tg); +extern void tg_update_affinity_domains(int cpu, int online); + +#else +static inline int init_auto_affinity(struct task_group *tg) +{ + return 0; +} + +static inline void tg_update_affinity_domains(int cpu, int online) {} +#endif + #ifdef CONFIG_FAIR_GROUP_SCHED extern int sched_group_set_shares(struct task_group *tg, unsigned long shares); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e22228e55afc..c31c43f3d468 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -125,7 +125,7 @@ static int one_thousand = 1000; #ifdef CONFIG_PRINTK static int ten_thousand = 10000; #endif -#ifdef CONFIG_QOS_SCHED +#if defined(CONFIG_QOS_SCHED) || defined(CONFIG_QOS_SCHED_SMART_GRID) static int hundred_thousand = 100000; #endif #ifdef CONFIG_PERF_EVENTS @@ -2748,6 +2748,17 @@ static struct ctl_table kern_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = &one_hundred, }, +#endif +#ifdef CONFIG_QOS_SCHED_SMART_GRID + { + .procname = "affinity_adjust_delay_ms", + .data = &sysctl_affinity_adjust_delay_ms, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = &hundred_thousand, + }, #endif { } }; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index e2927e81c738..5ea194795d92 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -76,6 +76,7 @@ #include #include #include +#include #include #include #include @@ -2235,7 +2236,14 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, if (pol->mode == MPOL_INTERLEAVE) { unsigned nid; - nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); + if (smart_grid_used()) { + nid = sched_grid_preferred_interleave_nid(pol); + nid = (nid == NUMA_NO_NODE) ? + interleave_nid(pol, vma, addr, PAGE_SHIFT + order) : nid; + } else { + nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); + } + mpol_cond_put(pol); page = alloc_page_interleave(gfp, order, nid); goto out; @@ -2282,6 +2290,8 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, nmask = policy_nodemask(gfp, pol); preferred_nid = policy_node(gfp, pol, node); + if (smart_grid_used()) + preferred_nid = sched_grid_preferred_nid(preferred_nid, nmask); page = __alloc_pages(gfp, order, preferred_nid, nmask); mark_vma_cdm(nmask, page, vma); mpol_cond_put(pol); -- 2.34.1