[PATCH OLK-6.6 000/451] sched_ext v1
Aboorva Devarajan (2): sched_ext: Documentation: Remove mentions of scx_bpf_switch_all sched: Pass correct scheduling policy to __setscheduler_class Alan Maguire (16): kbuild,bpf: Switch to using --btf_features for pahole v1.26 and later kbuild, bpf: Use test-ge check for v1.25-only pahole libbpf: Add btf__distill_base() creating split BTF with distilled base BTF selftests/bpf: Test distilled base, split BTF generation libbpf: Split BTF relocation selftests/bpf: Extend distilled BTF tests to cover BTF relocation resolve_btfids: Handle presence of .BTF.base section libbpf: BTF relocation followup fixing naming, loop logic module, bpf: Store BTF base pointer in struct module libbpf: Split field iter code into its own file kernel libbpf,bpf: Share BTF relocate-related code with kernel kbuild,bpf: Add module-specific pahole flags for distilled base BTF selftests/bpf: Add kfunc_call test for simple dtor in bpf_testmod bpf: fix build when CONFIG_DEBUG_INFO_BTF[_MODULES] is undefined libbpf: Fix error handling in btf__distill_base() libbpf: Fix license for btf_relocate.c Alexander Lobakin (1): bitops: make BYTES_TO_BITS() treewide-available Alexei Starovoitov (2): s390/bpf: Fix indirect trampoline generation bpf: Introduce "volatile compare" macros Andrea Righi (33): sched_ext: fix typo in set_weight() description sched_ext: add CONFIG_DEBUG_INFO_BTF dependency sched_ext: Provide a sysfs enable_seq counter sched_ext: improve WAKE_SYNC behavior for default idle CPU selection sched_ext: Clarify ops.select_cpu() for single-CPU tasks sched_ext: Introduce LLC awareness to the default idle selection policy sched_ext: Introduce NUMA awareness to the default idle selection policy sched_ext: Do not enable LLC/NUMA optimizations when domains overlap sched_ext: Fix incorrect use of bitwise AND MAINTAINERS: add self as reviewer for sched_ext sched_ext: idle: Refresh idle masks during idle-to-idle transitions sched_ext: Use the NUMA scheduling domain for NUMA optimizations sched_ext: idle: use assign_cpu() to update the idle cpumask sched_ext: idle: clarify comments sched_ext: idle: introduce check_builtin_idle_enabled() helper sched_ext: idle: small CPU iteration refactoring sched_ext: update scx_bpf_dsq_insert() doc for SCX_DSQ_LOCAL_ON sched_ext: Include remaining task time slice in error state dump sched_ext: Include task weight in the error state dump selftests/sched_ext: Fix enum resolution tools/sched_ext: Add helper to check task migration state sched_ext: selftests/dsp_local_on: Fix selftest on UP systems sched_ext: Fix lock imbalance in dispatch_to_local_dsq() selftests/sched_ext: Fix exit selftest hang on UP sched_ext: Move built-in idle CPU selection policy to a separate file sched_ext: Track currently locked rq sched_ext: Make scx_locked_rq() inline sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() sched/ext: Fix invalid task state transitions on class switch sched_ext: Make scx_kf_allowed_if_unlocked() available outside ext.c sched_ext: Remove duplicate BTF_ID_FLAGS definitions sched_ext: Fix rq lock state in hotplug ops sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl() Andrii Nakryiko (14): bpf: Emit global subprog name in verifier logs bpf: Validate global subprogs lazily selftests/bpf: Add lazy global subprog validation tests libbpf: Add btf__new_split() API that was declared but not implemented bpf: move sleepable flag from bpf_prog_aux to bpf_prog libbpf: Add BTF field iterator libbpf: Make use of BTF field iterator in BPF linker code libbpf: Make use of BTF field iterator in BTF handling code bpftool: Use BTF field iterator in btfgen libbpf: Remove callback-based type/string BTF field visitor helpers bpf: extract iterator argument type and name validation logic bpf: allow passing struct bpf_iter_<type> as kfunc arguments selftests/bpf: test passing iterator to a kfunc selftests/bpf: validate eliminated global subprog is not freplaceable Arnaldo Carvalho de Melo (1): tools include UAPI: Sync linux/sched.h copy with the kernel sources Atul Kumar Pant (1): sched_ext: Fixes typos in comments Benjamin Tissoires (1): bpf: introduce in_sleepable() helper Bitao Hu (4): genirq: Convert kstat_irqs to a struct genirq: Provide a snapshot mechanism for interrupt statistics watchdog/softlockup: Low-overhead detection of interrupt storm watchdog/softlockup: Report the most frequent interrupts Björn Töpel (1): selftests: sched_ext: Add sched_ext as proper selftest target Breno Leitao (3): rhashtable: Fix potential deadlock by moving schedule_work outside lock sched_ext: Use kvzalloc for large exit_dump allocation sched/ext: Prevent update_locked_rq() calls with NULL rq Changwoo Min (12): sched_ext: Clarify sched_ext_ops table for userland scheduler sched_ext: add a missing rcu_read_lock/unlock pair at scx_select_cpu_dfl() MAINTAINERS: add me as reviewer for sched_ext sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass() sched_ext: Relocate scx_enabled() related code sched_ext: Implement scx_bpf_now() sched_ext: Add scx_bpf_now() for BPF scheduler sched_ext: Add time helpers for BPF schedulers sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now() sched_ext: Use time helpers in BPF schedulers sched_ext: Fix incorrect time delta calculation in time_delta() sched_ext: Add scx_bpf_events() and scx_read_event() for BPF schedulers Cheng-Yang Chou (1): sched_ext: Always use SMP versions in kernel/sched/ext.c Christian Brauner (1): file: add take_fd() cleanup helper Christian Loehle (1): sched/fair: Remove stale FREQUENCY_UTIL comment Christophe Leroy (2): bpf: Remove arch_unprotect_bpf_trampoline() bpf: Check return from set_memory_rox() Chuyi Zhou (15): cgroup: Prepare for using css_task_iter_*() in BPF bpf: Introduce css_task open-coded iterator kfuncs bpf: Introduce task open coded iterator kfuncs bpf: Introduce css open-coded iterator kfuncs bpf: teach the verifier to enforce css_iter and task_iter in RCU CS bpf: Let bpf_iter_task_new accept null task ptr selftests/bpf: rename bpf_iter_task.c to bpf_iter_tasks.c selftests/bpf: Add tests for open-coded task and css iter bpf: Relax allowlist for css_task iter selftests/bpf: Add tests for css_task iter combining with cgroup iter selftests/bpf: Add test for using css_task iter in sleepable progs bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_reg selftests/bpf: get trusted cgrp from bpf_iter__cgroup directly sched_ext: Fix the incorrect bpf_list kfunc API in common.bpf.h. sched_ext: Use SCX_CALL_OP_TASK in task_tick_scx Colin Ian King (1): sched_ext: Fix spelling mistake: "intead" -> "instead" Daniel Xu (3): bpf: btf: Support flags for BTF_SET8 sets bpf: btf: Add BTF_KFUNCS_START/END macro pair bpf: treewide: Annotate BPF kfuncs in BTF Dave Marchevsky (6): bpf: Don't explicitly emit BTF for struct btf_iter_num selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c bpf: Introduce task_vma open-coded iterator kfuncs selftests/bpf: Add tests for open-coded task_vma iter bpf: Add __bpf_kfunc_{start,end}_defs macros bpf: Add __bpf_hook_{start,end} macros David Vernet (15): bpf: Add ability to pin bpf timer to calling CPU selftests/bpf: Test pinning bpf timer to a core sched_ext: Implement runnable task stall watchdog sched_ext: Print sched_ext info when dumping stack sched_ext: Implement SCX_KICK_WAIT sched_ext: Implement sched_ext_ops.cpu_acquire/release() sched_ext: Add selftests bpf: Load vmlinux btf for any struct_ops map sched_ext: Make scx_bpf_cpuperf_set() @cpu arg signed scx: Allow calling sleepable kfuncs from BPF_PROG_TYPE_SYSCALL scx/selftests: Verify we can call create_dsq from prog_run sched_ext: Remove unnecessary cpu_relax() scx: Fix exit selftest to use custom DSQ scx: Fix raciness in scx_ops_bypass() scx: Fix maximal BPF selftest prog Dawei Li (1): genirq: Deduplicate interrupt descriptor initialization Devaansh Kumar (1): sched_ext: selftests: Fix grammar in tests description Devaansh-Kumar (1): sched_ext: Documentation: Update instructions for running example schedulers Eduard Zingerman (2): libbpf: Make btf_parse_elf process .BTF.base transparently selftests/bpf: Check if distilled base inherits source endianness Geliang Tang (1): bpf, btf: Check btf for register_bpf_struct_ops Henry Huang (2): sched_ext: initialize kit->cursor.flags sched_ext: keep running prev when prev->scx.slice != 0 Herbert Xu (1): rhashtable: Fix rhashtable_try_insert test Honglei Wang (3): sched_ext: use correct function name in pick_task_scx() warning message sched_ext: Add __weak to fix the build errors sched_ext: switch class when preempted by higher priority scheduler Hongyan Xia (1): sched/ext: Add BPF function to fetch rq Hou Tao (7): bpf: Free dynamically allocated bits in bpf_iter_bits_destroy() bpf: Add bpf_mem_alloc_check_size() helper bpf: Check the validity of nr_words in bpf_iter_bits_new() bpf: Use __u64 to save the bits in bits iterator selftests/bpf: Add three test cases for bits_iter selftests/bpf: Use -4095 as the bad address for bits iterator selftests/bpf: Export map_update_retriable() Ihor Solodrai (2): selftests/sched_ext: add order-only dependency of runner.o on BPFOBJ selftests/sched_ext: fix build after renames in sched_ext API Ilpo Järvinen (1): <linux/cleanup.h>: Allow the passing of both iomem and non-iomem pointers to no_free_ptr() Ingo Molnar (3): sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c sched/fair: Rename check_preempt_wakeup() to check_preempt_wakeup_fair() sched/fair: Rename check_preempt_curr() to wakeup_preempt() Jake Hillion (2): sched_ext: create_dsq: Return -EEXIST on duplicate request sched_ext: Drop kfuncs marked for removal in 6.15 Jiapeng Chong (1): sched_ext: Fixes incorrect type in bpf_scx_init() Jiayuan Chen (1): selftests/bpf: Fixes for test_maps test Kui-Feng Lee (29): bpf: refactory struct_ops type initialization to a function. bpf: get type information with BTF_ID_LIST bpf, net: introduce bpf_struct_ops_desc. bpf: add struct_ops_tab to btf. bpf: make struct_ops_map support btfs other than btf_vmlinux. bpf: pass btf object id in bpf_map_info. bpf: lookup struct_ops types from a given module BTF. bpf: pass attached BTF to the bpf_struct_ops subsystem bpf: hold module refcnt in bpf_struct_ops map creation and prog verification. bpf: validate value_type bpf, net: switch to dynamic registration libbpf: Find correct module BTFs for struct_ops maps and progs. bpf: export btf_ctx_access to modules. selftests/bpf: test case for register_bpf_struct_ops(). bpf: Fix error checks against bpf_get_btf_vmlinux(). bpf: Remove an unnecessary check. selftests/bpf: Suppress warning message of an unused variable. bpf: add btf pointer to struct bpf_ctx_arg_aux. bpf: Move __kfunc_param_match_suffix() to btf.c. bpf: Create argument information for nullable arguments. selftests/bpf: Test PTR_MAYBE_NULL arguments of struct_ops operators. libbpf: Set btf_value_type_id of struct bpf_map for struct_ops. libbpf: Convert st_ops->data to shadow type. bpftool: Generated shadow variables for struct_ops maps. bpftool: Add an example for struct_ops map and shadow type. selftests/bpf: Test if shadow types work correctly. bpf, net: validate struct_ops when updating value. bpf: struct_ops supports more than one page for trampolines. selftests/bpf: Test struct_ops maps with a large number of struct_ops program. Kumar Kartikeya Dwivedi (4): bpf: Allow calling static subprogs while holding a bpf_spin_lock selftests/bpf: Add test for static subprog call in lock cs bpf: Transfer RCU lock state between subprog calls selftests/bpf: Add tests for RCU lock transfer between subprogs Liang Jie (1): sched_ext: Use sizeof_field for key_len in dsq_hash_params Luo Gengkun (7): bpf: Fix kabi-breakage for bpf_func_info_aux bpf: Fix kabi-breakage for bpf_tramp_image bpf: Fix kabi for bpf_attr bpf_verifier: Fix kabi for bpf_verifier_env bpf: Fix kabi for bpf_ctx_arg_aux bpf: Fix kabi for bpf_prog_aux and bpf_prog selftests/bpf: modify test_loader that didn't support running bpf_prog_type_syscall programs Manu Bretelle (1): sched_ext: define missing cfi stubs for sched_ext Martin KaFai Lau (5): libbpf: Ensure undefined bpf_attr field stays 0 bpf: Remove unnecessary err < 0 check in bpf_struct_ops_map_update_elem bpf: Fix a crash when btf_parse_base() returns an error pointer bpf: Reject struct_ops registration that uses module ptr and the module btf_id is missing bpf: Use kallsyms to find the function name of a struct_ops's stub function Masahiro Yamada (1): kbuild: avoid too many execution of scripts/pahole-flags.sh Matthieu Baerts (1): bpf: fix compilation error without CGROUPS Peter Zijlstra (26): cfi: Flip headers x86/cfi,bpf: Fix BPF JIT call x86/cfi,bpf: Fix bpf_callback_t CFI x86/cfi,bpf: Fix bpf_struct_ops CFI cfi: Add CFI_NOSEAL() bpf: Fix dtor CFI cleanup: Make no_free_ptr() __must_check sched: Simplify set_user_nice() sched: Simplify syscalls sched: Simplify sched_{set,get}affinity() sched: Simplify yield_to() sched: Simplify sched_rr_get_interval() sched: Simplify sched_move_task() sched: Misc cleanups sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity sched: Allow sched_class::dequeue_task() to fail sched: Unify runtime accounting across classes sched: Use set_next_task(.first) where required sched/fair: Cleanup pick_task_fair() vs throttle sched/fair: Cleanup pick_task_fair()'s curr sched/fair: Unify pick_{,next_}_task_fair() sched: Fixup set_next_task() implementations sched: Split up put_prev_task_balance() sched: Rework pick_next_task() sched: Combine the last put_prev_task() and the first set_next_task() sched: Add put_prev_task(.next) Pu Lehui (8): riscv, bpf: Fix unpredictable kernel crash about RV64 struct_ops bpf: Fix kabi breakage in struct module riscv, bpf: Fix out-of-bounds issue when preparing trampoline image selftests/bpf: Fix btf leak on new btf alloc failure in btf_distill test libbpf: Fix return zero when elf_begin failed libbpf: Fix incorrect traversal end type ID when marking BTF_IS_EMBEDDED selftests/bpf: Add distilled BTF test about marking BTF_IS_EMBEDDED selftests/bpf: Add file_read_pattern to gitignore Randy Dunlap (1): sched_ext: fix kernel-doc warnings Shizhao Chen (1): sched_ext: Add option -l in selftest runner to list all available tests Song Liu (8): bpf: Charge modmem for struct_ops trampoline bpf: Let bpf_prog_pack_free handle any pointer bpf: Adjust argument names of arch_prepare_bpf_trampoline() bpf: Add helpers for trampoline image management bpf, x86: Adjust arch_prepare_bpf_trampoline return value bpf: Add arch_bpf_trampoline_size() bpf: Use arch_bpf_trampoline_size x86, bpf: Use bpf_prog_pack for bpf trampoline T.J. Mercier (1): bpf, docs: Fix broken link to renamed bpf_iter_task_vmas.c Tejun Heo (152): sched: Restructure sched_class order sanity checks in sched_init() sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork() sched: Add sched_class->reweight_task() sched: Add sched_class->switching_to() and expose check_class_changing/changed() sched: Factor out cgroup weight conversion functions sched: Factor out update_other_load_avgs() from __update_blocked_others() sched: Add normal_policy() sched_ext: Add boilerplate for extensible scheduler class sched_ext: Implement BPF extensible scheduler class sched_ext: Add scx_simple and scx_example_qmap example schedulers sched_ext: Add sysrq-S which disables the BPF scheduler sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT sched_ext: Print debug dump after an error exit tools/sched_ext: Add scx_show_state.py sched_ext: Implement scx_bpf_kick_cpu() and task preemption support sched_ext: Add a central scheduler which makes all scheduling decisions on one CPU sched_ext: Make watchdog handle ops.dispatch() looping stall sched_ext: Add task state tracking operations sched_ext: Implement tickless support sched_ext: Track tasks that are subjects of the in-flight SCX operation sched_ext: Implement sched_ext_ops.cpu_online/offline() sched_ext: Bypass BPF scheduler while PM events are in progress sched_ext: Implement core-sched support sched_ext: Add vtime-ordered priority queue to dispatch_q's sched_ext: Documentation: scheduler: Document extensible scheduler class sched, sched_ext: Replace scx_next_task_picked() with sched_class->switch_class() cpufreq_schedutil: Refactor sugov_cpu_is_busy() sched_ext: Add cpuperf support sched_ext: Drop tools_clean target from the top-level Makefile sched_ext: Swap argument positions in kcalloc() call to avoid compiler warning sched, sched_ext: Simplify dl_prio() case handling in sched_fork() sched_ext: Account for idle policy when setting p->scx.weight in scx_ops_enable_task() sched_ext: Disallow loading BPF scheduler if isolcpus= domain isolation is in effect sched_ext: Minor cleanups in kernel/sched/ext.h sched, sched_ext: Open code for_balance_class_range() sched, sched_ext: Move some declarations from kernel/sched/ext.h to sched.h sched_ext: Take out ->priq and ->flags from scx_dsq_node sched_ext: Implement DSQ iterator sched_ext/scx_qmap: Add an example usage of DSQ iterator sched_ext: Reimplement scx_bpf_reenqueue_local() sched_ext: Make scx_bpf_reenqueue_local() skip tasks that are being migrated sched: Move struct balance_callback definition upward sched_ext: Unpin and repin rq lock from balance_scx() sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUP sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT] sched_ext: Allow p->scx.disallow only while loading sched_ext: Simplify scx_can_stop_tick() invocation in sched_can_stop_tick() sched_ext: Add scx_enabled() test to @start_class promotion in put_prev_task_balance() sched_ext: Use update_curr_common() in update_curr_scx() sched_ext: Simplify UP support by enabling sched_class->balance() in UP sched_ext: Improve comment on idle_sched_class exception in scx_task_iter_next_locked() sched_ext: Make task_can_run_on_remote_rq() use common task_allowed_on_cpu() sched_ext: Fix unsafe list iteration in process_ddsp_deferred_locals() sched_ext: Make scx_rq_online() also test cpu_active() in addition to SCX_RQ_ONLINE sched_ext: Improve logging around enable/disable sched_ext: Don't use double locking to migrate tasks across CPUs scx_central: Fix smatch checker warning sched_ext: Add missing cfi stub for ops.tick sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq() sched_ext: Use sched_clock_cpu() instead of rq_clock_task() in touch_core_sched() sched_ext: Don't call put_prev_task_scx() before picking the next task sched_ext: Replace SCX_TASK_BAL_KEEP with SCX_RQ_BAL_KEEP sched_ext: Unify regular and core-sched pick task paths sched_ext: Relocate functions in kernel/sched/ext.c sched_ext: Remove switch_class_scx() sched_ext: Remove sched_class->switch_class() sched_ext: TASK_DEAD tasks must be switched out of SCX on ops_disable sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable sched: Expose css_tg() sched: Make cpu_shares_read_u64() use tg_weight() sched: Introduce CONFIG_GROUP_SCHED_WEIGHT sched_ext: Add cgroup support sched_ext: Add a cgroup scheduler which uses flattened hierarchy sched_ext: Temporarily work around pick_task_scx() being called without balance_scx() sched_ext: Add missing static to scx_has_op[] sched_ext: Add missing static to scx_dump_data sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate sched_ext: Refactor consume_remote_task() sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling sched_ext: Restructure dispatch_to_local_dsq() sched_ext: Reorder args for consume_local/remote_task() sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq() sched_ext: Move consume_local_task() upward sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq() sched_ext: Compact struct bpf_iter_scx_dsq_kern sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() scx_qmap: Implement highpri boosting sched_ext: Synchronize bypass state changes with rq lock sched_ext: Don't trigger ops.quiescent/runnable() on migrations sched_ext: Fix build when !CONFIG_STACKTRACE sched_ext: Build fix for !CONFIG_SMP sched_ext: Add __COMPAT helpers for features added during v6.12 devel cycle tools/sched_ext: Receive misc updates from SCX repo scx_flatcg: Use a user DSQ for fallback instead of SCX_DSQ_GLOBAL sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new() sched_ext: Relocate find_user_dsq() sched_ext: Split the global DSQ per NUMA node sched_ext: Use shorter slice while bypassing sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable() sched_ext: Remove SCX_OPS_PREPPING sched_ext: Initialize in bypass mode sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable() sched_ext: Enable scx_ops_init_task() separately sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online() sched_ext: Decouple locks in scx_ops_disable_workfn() sched_ext: Decouple locks in scx_ops_enable() sched_ext: Improve error reporting during loading sched_ext: scx_cgroup_exit() may be called without successful scx_cgroup_init() sched/core: Make select_task_rq() take the pointer to wake_flags instead of value sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() was called sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED Revert "sched_ext: Use shorter slice while bypassing" sched_ext: Start schedulers with consistent p->scx.slice values sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl() sched_ext: bypass mode shouldn't depend on ops.select_cpu() sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers sched_ext: Don't hold scx_tasks_lock for too long sched_ext: Make cast_mask() inline sched_ext: Fix enq_last_no_enq_fails selftest sched_ext: Add a missing newline at the end of an error message sched_ext: Update scx_show_state.py to match scx_ops_bypass_depth's new type sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx() sched_ext: ops.cpu_acquire() should be called with SCX_KF_REST sched_ext: Factor out move_task_between_dsqs() from scx_dispatch_from_dsq() sched_ext: Rename CFI stubs to names that are recognized by BPF sched_ext: Replace set_arg_maybe_null() with __nullable CFI stub tags sched_ext: Avoid live-locking bypass mode switching sched_ext: Enable the ops breather and eject BPF scheduler on softlockup sched_ext: scx_bpf_dispatch_from_dsq_set_*() are allowed from unlocked context sched_ext: Rename scx_bpf_dispatch[_vtime]() to scx_bpf_dsq_insert[_vtime]() sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local() sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*() sched_ext: Fix invalid irq restore in scx_ops_bypass() sched_ext: Fix dsq_local_on selftest tools/sched_ext: Receive updates from SCX repo sched_ext: selftests/dsp_local_on: Fix sporadic failures sched_ext: Fix incorrect autogroup migration detection sched_ext: Implement auto local dispatching of migration disabled tasks sched_ext: Fix migration disabled handling in targeted dispatches sched_ext: Fix incorrect assumption about migration disabled tasks in task_can_run_on_remote_rq() sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance() sched_ext: Implement SCX_OPS_ALLOW_QUEUED_WAKEUP sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator sched_ext: Make scx_group_set_weight() always update tg->scx.weight sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group() sched_ext: Mark scx_bpf_dsq_move_set_[slice|vtime]() with KF_RCU sched_ext: Don't kick CPUs running higher classes sched_ext: Use SCX_TASK_READY test instead of tryget_task_struct() during class switch tools/sched_ext: Sync with scx repo Thomas Gleixner (1): sched/ext: Remove sched_fork() hack Thorsten Blum (1): sched_ext: Use str_enabled_disabled() helper in update_selcpu_topology() Tianchen Ding (1): sched_ext: Use btf_ids to resolve task_struct Tony Ambardar (1): libbpf: Ensure new BTF objects inherit input endianness Vincent Guittot (2): sched/cpufreq: Rework schedutil governor performance estimation sched/fair: Fix sched_can_stop_tick() for fair tasks Vishal Chourasia (2): sched_ext: Add __weak markers to BPF helper function decalarations sched_ext: Fix function pointer type mismatches in BPF selftests Wenyu Huang (1): sched/doc: Update documentation after renames and synchronize Chinese version Yafang Shao (2): bpf: Add bits iterator selftests/bpf: Add selftest for bits iter Yipeng Zou (1): sched_ext: Allow dequeue_task_scx to fail Yiwei Lin (1): sched/fair: Remove unused 'curr' argument from pick_next_entity() Yu Liao (2): sched: Put task_group::idle under CONFIG_GROUP_SCHED_WEIGHT sched: Add dummy version of sched_group_set_idle() Yury Norov (1): cpumask: introduce assign_cpu() macro Zhang Qiao (3): sched_ext: Remove redundant p->nr_cpus_allowed checker sched/ext: Fix unmatch trailing comment of CONFIG_EXT_GROUP_SCHED sched/ext: Use tg_cgroup() to elieminate duplicate code Zhao Mengmeng (1): sched_ext: Replace scx_next_task_picked() with switch_class() in comment Zicheng Qu (17): sched: Fix kabi for reweight_task in struct sched_class sched/syscalls: Fix kabi for EXPORT_SYMBOL moved from core.c to syscalls.c sched: Fix kabi for switching_to in struct sched_class sched/fair: Fix kabi for check_preempt_curr and wakeup_preempt in struct sched_class sched: Fix kabi for dequeue_task in struct sched_class sched_ext: Fix kabi for scx in struct task_struct sched_ext: Fix kabi for switch_class in struct sched_class sched: Fix kabi for exec_max in struct sched_statistics sched_ext: Fix kabi for balance in struct sched_class sched_ext: Fix kabi for header in kernel/sched/sched.h sched: Fix kabi pick_task in struct sched_class sched: Fix kabi for put_prev_task in struct sched_class sched_ext: Fix kabi for scx_flags and scx_weight in struct task_group sched: Fix kabi for int idle in struct task_group sched: Add __setscheduler_class() for sched_ext genirq: Fix kabi for kstat_irqs in struct irq_desc sched_ext: Enable and disable sched_ext configs Zqiang (1): sched_ext: Fix unsafe locking in the scx_dump_state() guanjing (1): sched_ext: fix application of sizeof to pointer Documentation/bpf/bpf_iterators.rst | 2 +- Documentation/bpf/kfuncs.rst | 14 +- Documentation/scheduler/index.rst | 1 + Documentation/scheduler/sched-design-CFS.rst | 8 +- Documentation/scheduler/sched-ext.rst | 325 + .../zh_CN/scheduler/sched-design-CFS.rst | 8 +- MAINTAINERS | 16 +- Makefile | 4 +- arch/arm64/configs/openeuler_defconfig | 3 + arch/arm64/kernel/bpf-rvi.c | 4 +- arch/arm64/net/bpf_jit_comp.c | 55 +- arch/mips/dec/setup.c | 2 +- arch/parisc/kernel/smp.c | 2 +- arch/powerpc/kvm/book3s_hv_rm_xics.c | 2 +- arch/riscv/include/asm/cfi.h | 3 +- arch/riscv/kernel/cfi.c | 2 +- arch/riscv/net/bpf_jit_comp64.c | 48 +- arch/s390/net/bpf_jit_comp.c | 59 +- arch/x86/configs/openeuler_defconfig | 2 + arch/x86/include/asm/cfi.h | 126 +- arch/x86/kernel/alternative.c | 87 +- arch/x86/kernel/cfi.c | 4 +- arch/x86/net/bpf_jit_comp.c | 261 +- block/blk-cgroup.c | 4 +- drivers/hid/bpf/hid_bpf_dispatch.c | 12 +- drivers/tty/sysrq.c | 1 + fs/proc/stat.c | 4 +- include/asm-generic/Kbuild | 1 + include/asm-generic/cfi.h | 5 + include/asm-generic/vmlinux.lds.h | 1 + include/linux/bitops.h | 2 + include/linux/bpf.h | 130 +- include/linux/bpf_mem_alloc.h | 3 + include/linux/bpf_verifier.h | 21 +- include/linux/btf.h | 105 + include/linux/btf_ids.h | 21 +- include/linux/cfi.h | 12 + include/linux/cgroup.h | 14 +- include/linux/cleanup.h | 42 +- include/linux/cpumask.h | 41 +- include/linux/energy_model.h | 1 - include/linux/file.h | 20 + include/linux/filter.h | 2 +- include/linux/irqdesc.h | 17 +- include/linux/kernel_stat.h | 8 + include/linux/module.h | 8 +- include/linux/sched.h | 8 +- include/linux/sched/ext.h | 216 + include/linux/sched/task.h | 8 +- include/trace/events/sched_ext.h | 32 + include/uapi/linux/bpf.h | 16 +- include/uapi/linux/sched.h | 1 + init/Kconfig | 10 + init/init_task.c | 12 + kernel/Kconfig.preempt | 27 +- kernel/bpf-rvi/common_kfuncs.c | 4 +- kernel/bpf/Makefile | 8 +- kernel/bpf/bpf_iter.c | 12 +- kernel/bpf/bpf_struct_ops.c | 745 +- kernel/bpf/bpf_struct_ops_types.h | 12 - kernel/bpf/btf.c | 431 +- kernel/bpf/cgroup_iter.c | 65 +- kernel/bpf/core.c | 76 +- kernel/bpf/cpumask.c | 18 +- kernel/bpf/dispatcher.c | 7 +- kernel/bpf/helpers.c | 202 +- kernel/bpf/map_iter.c | 10 +- kernel/bpf/memalloc.c | 14 +- kernel/bpf/syscall.c | 12 +- kernel/bpf/task_iter.c | 242 +- kernel/bpf/trampoline.c | 99 +- kernel/bpf/verifier.c | 317 +- kernel/cgroup/cgroup.c | 18 +- kernel/cgroup/cpuset.c | 4 +- kernel/cgroup/rstat.c | 13 +- kernel/events/core.c | 2 +- kernel/fork.c | 17 +- kernel/irq/Kconfig | 4 + kernel/irq/internals.h | 2 +- kernel/irq/irqdesc.c | 144 +- kernel/irq/proc.c | 5 +- kernel/module/main.c | 5 +- kernel/sched/autogroup.c | 4 +- kernel/sched/bpf_sched.c | 8 +- kernel/sched/build_policy.c | 13 + kernel/sched/core.c | 2492 +----- kernel/sched/cpuacct.c | 4 +- kernel/sched/cpufreq_schedutil.c | 83 +- kernel/sched/deadline.c | 175 +- kernel/sched/debug.c | 3 + kernel/sched/ext.c | 7155 +++++++++++++++++ kernel/sched/ext.h | 119 + kernel/sched/ext_idle.c | 755 ++ kernel/sched/ext_idle.h | 39 + kernel/sched/fair.c | 306 +- kernel/sched/idle.c | 31 +- kernel/sched/rt.c | 40 +- kernel/sched/sched.h | 473 +- kernel/sched/stop_task.c | 35 +- kernel/sched/syscalls.c | 1713 ++++ kernel/trace/bpf_trace.c | 12 +- kernel/trace/trace_probe.c | 2 - kernel/watchdog.c | 223 +- lib/Kconfig.debug | 14 + lib/dump_stack.c | 1 + lib/rhashtable.c | 12 +- net/bpf/bpf_dummy_struct_ops.c | 72 +- net/bpf/test_run.c | 30 +- net/core/filter.c | 33 +- net/core/xdp.c | 10 +- net/ipv4/bpf_tcp_ca.c | 93 +- net/ipv4/fou_bpf.c | 10 +- net/ipv4/tcp_bbr.c | 4 +- net/ipv4/tcp_cong.c | 6 +- net/ipv4/tcp_cubic.c | 4 +- net/ipv4/tcp_dctcp.c | 4 +- net/netfilter/nf_conntrack_bpf.c | 10 +- net/netfilter/nf_nat_bpf.c | 10 +- net/socket.c | 8 +- net/xfrm/xfrm_interface_bpf.c | 10 +- scripts/Makefile.btf | 33 + scripts/Makefile.modfinal | 2 +- scripts/gdb/linux/interrupts.py | 6 +- scripts/pahole-flags.sh | 30 - tools/Makefile | 10 +- .../bpf/bpftool/Documentation/bpftool-gen.rst | 58 +- tools/bpf/bpftool/gen.c | 253 +- tools/bpf/resolve_btfids/main.c | 8 + tools/include/linux/bitops.h | 2 + tools/include/uapi/linux/bpf.h | 14 +- tools/include/uapi/linux/sched.h | 1 + tools/lib/bpf/Build | 2 +- tools/lib/bpf/bpf.c | 4 +- tools/lib/bpf/bpf.h | 4 +- tools/lib/bpf/btf.c | 704 +- tools/lib/bpf/btf.h | 36 + tools/lib/bpf/btf_iter.c | 177 + tools/lib/bpf/btf_relocate.c | 519 ++ tools/lib/bpf/libbpf.c | 97 +- tools/lib/bpf/libbpf.map | 4 +- tools/lib/bpf/libbpf_internal.h | 29 +- tools/lib/bpf/libbpf_probes.c | 1 + tools/lib/bpf/linker.c | 58 +- tools/perf/util/probe-finder.c | 4 +- tools/sched_ext/.gitignore | 2 + tools/sched_ext/Makefile | 246 + tools/sched_ext/README.md | 270 + .../sched_ext/include/bpf-compat/gnu/stubs.h | 11 + tools/sched_ext/include/scx/common.bpf.h | 647 ++ tools/sched_ext/include/scx/common.h | 81 + tools/sched_ext/include/scx/compat.bpf.h | 143 + tools/sched_ext/include/scx/compat.h | 187 + .../sched_ext/include/scx/enums.autogen.bpf.h | 105 + tools/sched_ext/include/scx/enums.autogen.h | 41 + tools/sched_ext/include/scx/enums.bpf.h | 12 + tools/sched_ext/include/scx/enums.h | 27 + tools/sched_ext/include/scx/user_exit_info.h | 118 + tools/sched_ext/scx_central.bpf.c | 356 + tools/sched_ext/scx_central.c | 145 + tools/sched_ext/scx_flatcg.bpf.c | 954 +++ tools/sched_ext/scx_flatcg.c | 234 + tools/sched_ext/scx_flatcg.h | 51 + tools/sched_ext/scx_qmap.bpf.c | 827 ++ tools/sched_ext/scx_qmap.c | 155 + tools/sched_ext/scx_show_state.py | 42 + tools/sched_ext/scx_simple.bpf.c | 151 + tools/sched_ext/scx_simple.c | 107 + tools/testing/selftests/Makefile | 9 +- tools/testing/selftests/bpf/.gitignore | 1 + .../testing/selftests/bpf/bpf_experimental.h | 96 + .../selftests/bpf/bpf_testmod/bpf_testmod.c | 160 +- .../selftests/bpf/bpf_testmod/bpf_testmod.h | 61 + .../bpf/bpf_testmod/bpf_testmod_kfunc.h | 9 + .../selftests/bpf/prog_tests/bpf_iter.c | 44 +- .../selftests/bpf/prog_tests/btf_distill.c | 692 ++ .../selftests/bpf/prog_tests/cgroup_iter.c | 33 + .../bpf/prog_tests/global_func_dead_code.c | 60 + .../testing/selftests/bpf/prog_tests/iters.c | 209 + .../selftests/bpf/prog_tests/kfunc_call.c | 1 + .../selftests/bpf/prog_tests/rcu_read_lock.c | 6 + .../selftests/bpf/prog_tests/spin_lock.c | 2 + .../prog_tests/test_struct_ops_maybe_null.c | 46 + .../bpf/prog_tests/test_struct_ops_module.c | 86 + .../prog_tests/test_struct_ops_multi_pages.c | 30 + .../testing/selftests/bpf/prog_tests/timer.c | 4 + .../selftests/bpf/prog_tests/verifier.c | 4 + ...f_iter_task_vma.c => bpf_iter_task_vmas.c} | 0 .../{bpf_iter_task.c => bpf_iter_tasks.c} | 0 .../bpf/progs/freplace_dead_global_func.c | 11 + tools/testing/selftests/bpf/progs/iters_css.c | 72 + .../selftests/bpf/progs/iters_css_task.c | 102 + .../testing/selftests/bpf/progs/iters_task.c | 41 + .../selftests/bpf/progs/iters_task_failure.c | 105 + .../selftests/bpf/progs/iters_task_vma.c | 43 + .../selftests/bpf/progs/iters_testmod_seq.c | 50 + .../selftests/bpf/progs/kfunc_call_test.c | 37 + .../selftests/bpf/progs/rcu_read_lock.c | 120 + .../bpf/progs/struct_ops_maybe_null.c | 29 + .../bpf/progs/struct_ops_maybe_null_fail.c | 24 + .../selftests/bpf/progs/struct_ops_module.c | 37 + .../bpf/progs/struct_ops_multi_pages.c | 102 + .../selftests/bpf/progs/test_global_func12.c | 4 +- .../selftests/bpf/progs/test_spin_lock.c | 65 + .../selftests/bpf/progs/test_spin_lock_fail.c | 44 + tools/testing/selftests/bpf/progs/timer.c | 63 +- .../selftests/bpf/progs/verifier_bits_iter.c | 232 + .../bpf/progs/verifier_global_subprogs.c | 101 + .../selftests/bpf/progs/verifier_spin_lock.c | 2 +- .../bpf/progs/verifier_subprog_precision.c | 4 +- tools/testing/selftests/bpf/test_loader.c | 10 +- tools/testing/selftests/bpf/test_maps.c | 18 +- tools/testing/selftests/bpf/test_maps.h | 5 + tools/testing/selftests/sched_ext/.gitignore | 6 + tools/testing/selftests/sched_ext/Makefile | 211 + tools/testing/selftests/sched_ext/config | 9 + .../selftests/sched_ext/create_dsq.bpf.c | 58 + .../testing/selftests/sched_ext/create_dsq.c | 57 + .../sched_ext/ddsp_bogus_dsq_fail.bpf.c | 42 + .../selftests/sched_ext/ddsp_bogus_dsq_fail.c | 60 + .../sched_ext/ddsp_vtimelocal_fail.bpf.c | 39 + .../sched_ext/ddsp_vtimelocal_fail.c | 59 + .../selftests/sched_ext/dsp_local_on.bpf.c | 68 + .../selftests/sched_ext/dsp_local_on.c | 60 + .../sched_ext/enq_last_no_enq_fails.bpf.c | 29 + .../sched_ext/enq_last_no_enq_fails.c | 64 + .../sched_ext/enq_select_cpu_fails.bpf.c | 43 + .../sched_ext/enq_select_cpu_fails.c | 61 + tools/testing/selftests/sched_ext/exit.bpf.c | 86 + tools/testing/selftests/sched_ext/exit.c | 64 + tools/testing/selftests/sched_ext/exit_test.h | 20 + .../testing/selftests/sched_ext/hotplug.bpf.c | 61 + tools/testing/selftests/sched_ext/hotplug.c | 170 + .../selftests/sched_ext/hotplug_test.h | 15 + .../sched_ext/init_enable_count.bpf.c | 53 + .../selftests/sched_ext/init_enable_count.c | 157 + .../testing/selftests/sched_ext/maximal.bpf.c | 166 + tools/testing/selftests/sched_ext/maximal.c | 54 + .../selftests/sched_ext/maybe_null.bpf.c | 36 + .../testing/selftests/sched_ext/maybe_null.c | 49 + .../sched_ext/maybe_null_fail_dsp.bpf.c | 25 + .../sched_ext/maybe_null_fail_yld.bpf.c | 28 + .../testing/selftests/sched_ext/minimal.bpf.c | 21 + tools/testing/selftests/sched_ext/minimal.c | 58 + .../selftests/sched_ext/prog_run.bpf.c | 33 + tools/testing/selftests/sched_ext/prog_run.c | 78 + .../testing/selftests/sched_ext/reload_loop.c | 74 + tools/testing/selftests/sched_ext/runner.c | 212 + tools/testing/selftests/sched_ext/scx_test.h | 131 + .../selftests/sched_ext/select_cpu_dfl.bpf.c | 40 + .../selftests/sched_ext/select_cpu_dfl.c | 75 + .../sched_ext/select_cpu_dfl_nodispatch.bpf.c | 89 + .../sched_ext/select_cpu_dfl_nodispatch.c | 75 + .../sched_ext/select_cpu_dispatch.bpf.c | 41 + .../selftests/sched_ext/select_cpu_dispatch.c | 73 + .../select_cpu_dispatch_bad_dsq.bpf.c | 37 + .../sched_ext/select_cpu_dispatch_bad_dsq.c | 59 + .../select_cpu_dispatch_dbl_dsp.bpf.c | 38 + .../sched_ext/select_cpu_dispatch_dbl_dsp.c | 59 + .../sched_ext/select_cpu_vtime.bpf.c | 92 + .../selftests/sched_ext/select_cpu_vtime.c | 62 + .../selftests/sched_ext/test_example.c | 49 + tools/testing/selftests/sched_ext/util.c | 71 + tools/testing/selftests/sched_ext/util.h | 13 + 263 files changed, 27732 insertions(+), 3787 deletions(-) create mode 100644 Documentation/scheduler/sched-ext.rst create mode 100644 include/asm-generic/cfi.h create mode 100644 include/linux/sched/ext.h create mode 100644 include/trace/events/sched_ext.h delete mode 100644 kernel/bpf/bpf_struct_ops_types.h create mode 100644 kernel/sched/ext.c create mode 100644 kernel/sched/ext.h create mode 100644 kernel/sched/ext_idle.c create mode 100644 kernel/sched/ext_idle.h create mode 100644 kernel/sched/syscalls.c create mode 100644 scripts/Makefile.btf delete mode 100755 scripts/pahole-flags.sh create mode 100644 tools/lib/bpf/btf_iter.c create mode 100644 tools/lib/bpf/btf_relocate.c create mode 100644 tools/sched_ext/.gitignore create mode 100644 tools/sched_ext/Makefile create mode 100644 tools/sched_ext/README.md create mode 100644 tools/sched_ext/include/bpf-compat/gnu/stubs.h create mode 100644 tools/sched_ext/include/scx/common.bpf.h create mode 100644 tools/sched_ext/include/scx/common.h create mode 100644 tools/sched_ext/include/scx/compat.bpf.h create mode 100644 tools/sched_ext/include/scx/compat.h create mode 100644 tools/sched_ext/include/scx/enums.autogen.bpf.h create mode 100644 tools/sched_ext/include/scx/enums.autogen.h create mode 100644 tools/sched_ext/include/scx/enums.bpf.h create mode 100644 tools/sched_ext/include/scx/enums.h create mode 100644 tools/sched_ext/include/scx/user_exit_info.h create mode 100644 tools/sched_ext/scx_central.bpf.c create mode 100644 tools/sched_ext/scx_central.c create mode 100644 tools/sched_ext/scx_flatcg.bpf.c create mode 100644 tools/sched_ext/scx_flatcg.c create mode 100644 tools/sched_ext/scx_flatcg.h create mode 100644 tools/sched_ext/scx_qmap.bpf.c create mode 100644 tools/sched_ext/scx_qmap.c create mode 100644 tools/sched_ext/scx_show_state.py create mode 100644 tools/sched_ext/scx_simple.bpf.c create mode 100644 tools/sched_ext/scx_simple.c create mode 100644 tools/testing/selftests/bpf/prog_tests/btf_distill.c create mode 100644 tools/testing/selftests/bpf/prog_tests/global_func_dead_code.c create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_maybe_null.c create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_multi_pages.c rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%) rename tools/testing/selftests/bpf/progs/{bpf_iter_task.c => bpf_iter_tasks.c} (100%) create mode 100644 tools/testing/selftests/bpf/progs/freplace_dead_global_func.c create mode 100644 tools/testing/selftests/bpf/progs/iters_css.c create mode 100644 tools/testing/selftests/bpf/progs/iters_css_task.c create mode 100644 tools/testing/selftests/bpf/progs/iters_task.c create mode 100644 tools/testing/selftests/bpf/progs/iters_task_failure.c create mode 100644 tools/testing/selftests/bpf/progs/iters_task_vma.c create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_maybe_null.c create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_maybe_null_fail.c create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_module.c create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_multi_pages.c create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c create mode 100644 tools/testing/selftests/bpf/progs/verifier_global_subprogs.c create mode 100644 tools/testing/selftests/sched_ext/.gitignore create mode 100644 tools/testing/selftests/sched_ext/Makefile create mode 100644 tools/testing/selftests/sched_ext/config create mode 100644 tools/testing/selftests/sched_ext/create_dsq.bpf.c create mode 100644 tools/testing/selftests/sched_ext/create_dsq.c create mode 100644 tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c create mode 100644 tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.c create mode 100644 tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c create mode 100644 tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.c create mode 100644 tools/testing/selftests/sched_ext/dsp_local_on.bpf.c create mode 100644 tools/testing/selftests/sched_ext/dsp_local_on.c create mode 100644 tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c create mode 100644 tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c create mode 100644 tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c create mode 100644 tools/testing/selftests/sched_ext/enq_select_cpu_fails.c create mode 100644 tools/testing/selftests/sched_ext/exit.bpf.c create mode 100644 tools/testing/selftests/sched_ext/exit.c create mode 100644 tools/testing/selftests/sched_ext/exit_test.h create mode 100644 tools/testing/selftests/sched_ext/hotplug.bpf.c create mode 100644 tools/testing/selftests/sched_ext/hotplug.c create mode 100644 tools/testing/selftests/sched_ext/hotplug_test.h create mode 100644 tools/testing/selftests/sched_ext/init_enable_count.bpf.c create mode 100644 tools/testing/selftests/sched_ext/init_enable_count.c create mode 100644 tools/testing/selftests/sched_ext/maximal.bpf.c create mode 100644 tools/testing/selftests/sched_ext/maximal.c create mode 100644 tools/testing/selftests/sched_ext/maybe_null.bpf.c create mode 100644 tools/testing/selftests/sched_ext/maybe_null.c create mode 100644 tools/testing/selftests/sched_ext/maybe_null_fail_dsp.bpf.c create mode 100644 tools/testing/selftests/sched_ext/maybe_null_fail_yld.bpf.c create mode 100644 tools/testing/selftests/sched_ext/minimal.bpf.c create mode 100644 tools/testing/selftests/sched_ext/minimal.c create mode 100644 tools/testing/selftests/sched_ext/prog_run.bpf.c create mode 100644 tools/testing/selftests/sched_ext/prog_run.c create mode 100644 tools/testing/selftests/sched_ext/reload_loop.c create mode 100644 tools/testing/selftests/sched_ext/runner.c create mode 100644 tools/testing/selftests/sched_ext/scx_test.h create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dfl.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_vtime.c create mode 100644 tools/testing/selftests/sched_ext/test_example.c create mode 100644 tools/testing/selftests/sched_ext/util.c create mode 100644 tools/testing/selftests/sched_ext/util.h -- 2.34.1
From: Song Liu <song@kernel.org> mainline inclusion from mainline-v6.7-rc1 commit 5c04433daf9ed8b28d4900112be1fd19e1786b25 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Current code charges modmem for regular trampoline, but not for struct_ops trampoline. Add bpf_jit_[charge|uncharge]_modmem() to struct_ops so the trampoline is charged in both cases. Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230914222542.2986059-1-song@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/bpf_struct_ops.c | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index fdc3e8705a3c..db6176fb64dc 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -615,7 +615,10 @@ static void __bpf_struct_ops_map_free(struct bpf_map *map) if (st_map->links) bpf_struct_ops_map_put_progs(st_map); bpf_map_area_free(st_map->links); - bpf_jit_free_exec(st_map->image); + if (st_map->image) { + bpf_jit_free_exec(st_map->image); + bpf_jit_uncharge_modmem(PAGE_SIZE); + } bpf_map_area_free(st_map->uvalue); bpf_map_area_free(st_map); } @@ -657,6 +660,7 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) struct bpf_struct_ops_map *st_map; const struct btf_type *t, *vt; struct bpf_map *map; + int ret; st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id); if (!st_ops) @@ -681,12 +685,27 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) st_map->st_ops = st_ops; map = &st_map->map; + ret = bpf_jit_charge_modmem(PAGE_SIZE); + if (ret) { + __bpf_struct_ops_map_free(map); + return ERR_PTR(ret); + } + + st_map->image = bpf_jit_alloc_exec(PAGE_SIZE); + if (!st_map->image) { + /* __bpf_struct_ops_map_free() uses st_map->image as flag + * for "charged or not". In this case, we need to unchange + * here. + */ + bpf_jit_uncharge_modmem(PAGE_SIZE); + __bpf_struct_ops_map_free(map); + return ERR_PTR(-ENOMEM); + } st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE); st_map->links = bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_links *), NUMA_NO_NODE); - st_map->image = bpf_jit_alloc_exec(PAGE_SIZE); - if (!st_map->uvalue || !st_map->links || !st_map->image) { + if (!st_map->uvalue || !st_map->links) { __bpf_struct_ops_map_free(map); return ERR_PTR(-ENOMEM); } @@ -907,4 +926,3 @@ int bpf_struct_ops_link_create(union bpf_attr *attr) kfree(link); return err; } - -- 2.34.1
From: David Vernet <void@manifault.com> mainline inclusion from mainline-v6.7-rc1 commit d6247ecb6c1e17d7a33317090627f5bfe563cbb2 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- BPF supports creating high resolution timers using bpf_timer_* helper functions. Currently, only the BPF_F_TIMER_ABS flag is supported, which specifies that the timeout should be interpreted as absolute time. It would also be useful to be able to pin that timer to a core. For example, if you wanted to make a subset of cores run without timer interrupts, and only have the timer be invoked on a single core. This patch adds support for this with a new BPF_F_TIMER_CPU_PIN flag. When specified, the HRTIMER_MODE_PINNED flag is passed to hrtimer_start(). A subsequent patch will update selftests to validate. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20231004162339.200702-2-void@manifault.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/uapi/linux/bpf.h | 4 ++++ kernel/bpf/helpers.c | 5 ++++- tools/include/uapi/linux/bpf.h | 4 ++++ 3 files changed, 12 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 5aca37b62f79..31e5b8061fe5 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5124,6 +5124,8 @@ union bpf_attr { * **BPF_F_TIMER_ABS** * Start the timer in absolute expire value instead of the * default relative one. + * **BPF_F_TIMER_CPU_PIN** + * Timer will be pinned to the CPU of the caller. * * Return * 0 on success. @@ -7348,9 +7350,11 @@ struct bpf_core_relo { * Flags to control bpf_timer_start() behaviour. * - BPF_F_TIMER_ABS: Timeout passed is absolute time, by default it is * relative to current time. + * - BPF_F_TIMER_CPU_PIN: Timer will be pinned to the CPU of the caller. */ enum { BPF_F_TIMER_ABS = (1ULL << 0), + BPF_F_TIMER_CPU_PIN = (1ULL << 1), }; /* BPF numbers iterator state */ diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index f40085f4de31..c889a9dee6c6 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1363,7 +1363,7 @@ BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, timer, u64, nsecs, u64, fla if (in_nmi()) return -EOPNOTSUPP; - if (flags > BPF_F_TIMER_ABS) + if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN)) return -EINVAL; __bpf_spin_lock_irqsave(&timer->lock); t = timer->timer; @@ -1377,6 +1377,9 @@ BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, timer, u64, nsecs, u64, fla else mode = HRTIMER_MODE_REL_SOFT; + if (flags & BPF_F_TIMER_CPU_PIN) + mode |= HRTIMER_MODE_PINNED; + hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode); out: __bpf_spin_unlock_irqrestore(&timer->lock); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 337fc55cd500..7be085c034f7 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -5124,6 +5124,8 @@ union bpf_attr { * **BPF_F_TIMER_ABS** * Start the timer in absolute expire value instead of the * default relative one. + * **BPF_F_TIMER_CPU_PIN** + * Timer will be pinned to the CPU of the caller. * * Return * 0 on success. @@ -7351,9 +7353,11 @@ struct bpf_core_relo { * Flags to control bpf_timer_start() behaviour. * - BPF_F_TIMER_ABS: Timeout passed is absolute time, by default it is * relative to current time. + * - BPF_F_TIMER_CPU_PIN: Timer will be pinned to the CPU of the caller. */ enum { BPF_F_TIMER_ABS = (1ULL << 0), + BPF_F_TIMER_CPU_PIN = (1ULL << 1), }; /* BPF numbers iterator state */ -- 2.34.1
From: David Vernet <void@manifault.com> mainline inclusion from mainline-v6.7-rc1 commit 0d7ae06860753bb30b3731302b994da071120d00 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Now that we support pinning a BPF timer to the current core, we should test it with some selftests. This patch adds two new testcases to the timer suite, which verifies that a BPF timer both with and without BPF_F_TIMER_ABS, can be pinned to the calling core with BPF_F_TIMER_CPU_PIN. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20231004162339.200702-3-void@manifault.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../testing/selftests/bpf/prog_tests/timer.c | 4 ++ tools/testing/selftests/bpf/progs/timer.c | 63 ++++++++++++++++++- 2 files changed, 66 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/prog_tests/timer.c b/tools/testing/selftests/bpf/prog_tests/timer.c index ce2c61d62fc6..760ad96b4be0 100644 --- a/tools/testing/selftests/bpf/prog_tests/timer.c +++ b/tools/testing/selftests/bpf/prog_tests/timer.c @@ -15,6 +15,7 @@ static int timer(struct timer *timer_skel) ASSERT_EQ(timer_skel->data->callback_check, 52, "callback_check1"); ASSERT_EQ(timer_skel->data->callback2_check, 52, "callback2_check1"); + ASSERT_EQ(timer_skel->bss->pinned_callback_check, 0, "pinned_callback_check1"); prog_fd = bpf_program__fd(timer_skel->progs.test1); err = bpf_prog_test_run_opts(prog_fd, &topts); @@ -33,6 +34,9 @@ static int timer(struct timer *timer_skel) /* check that timer_cb3() was executed twice */ ASSERT_EQ(timer_skel->bss->abs_data, 12, "abs_data"); + /* check that timer_cb_pinned() was executed twice */ + ASSERT_EQ(timer_skel->bss->pinned_callback_check, 2, "pinned_callback_check"); + /* check that there were no errors in timer execution */ ASSERT_EQ(timer_skel->bss->err, 0, "err"); diff --git a/tools/testing/selftests/bpf/progs/timer.c b/tools/testing/selftests/bpf/progs/timer.c index 9a16d95213e1..8b946c8188c6 100644 --- a/tools/testing/selftests/bpf/progs/timer.c +++ b/tools/testing/selftests/bpf/progs/timer.c @@ -51,7 +51,7 @@ struct { __uint(max_entries, 1); __type(key, int); __type(value, struct elem); -} abs_timer SEC(".maps"); +} abs_timer SEC(".maps"), soft_timer_pinned SEC(".maps"), abs_timer_pinned SEC(".maps"); __u64 bss_data; __u64 abs_data; @@ -59,6 +59,8 @@ __u64 err; __u64 ok; __u64 callback_check = 52; __u64 callback2_check = 52; +__u64 pinned_callback_check; +__s32 pinned_cpu; #define ARRAY 1 #define HTAB 2 @@ -329,3 +331,62 @@ int BPF_PROG2(test3, int, a) return 0; } + +/* callback for pinned timer */ +static int timer_cb_pinned(void *map, int *key, struct bpf_timer *timer) +{ + __s32 cpu = bpf_get_smp_processor_id(); + + if (cpu != pinned_cpu) + err |= 16384; + + pinned_callback_check++; + return 0; +} + +static void test_pinned_timer(bool soft) +{ + int key = 0; + void *map; + struct bpf_timer *timer; + __u64 flags = BPF_F_TIMER_CPU_PIN; + __u64 start_time; + + if (soft) { + map = &soft_timer_pinned; + start_time = 0; + } else { + map = &abs_timer_pinned; + start_time = bpf_ktime_get_boot_ns(); + flags |= BPF_F_TIMER_ABS; + } + + timer = bpf_map_lookup_elem(map, &key); + if (timer) { + if (bpf_timer_init(timer, map, CLOCK_BOOTTIME) != 0) + err |= 4096; + bpf_timer_set_callback(timer, timer_cb_pinned); + pinned_cpu = bpf_get_smp_processor_id(); + bpf_timer_start(timer, start_time + 1000, flags); + } else { + err |= 8192; + } +} + +SEC("fentry/bpf_fentry_test4") +int BPF_PROG2(test4, int, a) +{ + bpf_printk("test4"); + test_pinned_timer(true); + + return 0; +} + +SEC("fentry/bpf_fentry_test5") +int BPF_PROG2(test5, int, a) +{ + bpf_printk("test5"); + test_pinned_timer(false); + + return 0; +} -- 2.34.1
From: Dave Marchevsky <davemarchevsky@fb.com> mainline inclusion from mainline-v6.7-rc1 commit f10ca5da5bd71e5cefed7995e75a7c873ce3816e category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Commit 6018e1f407cc ("bpf: implement numbers iterator") added the BTF_TYPE_EMIT line that this patch is modifying. The struct btf_iter_num doesn't exist, so only a forward declaration is emitted in BTF: FWD 'btf_iter_num' fwd_kind=struct That commit was probably hoping to ensure that struct bpf_iter_num is emitted in vmlinux BTF. A previous version of this patch changed the line to emit the correct type, but Yonghong confirmed that it would definitely be emitted regardless in [0], so this patch simply removes the line. This isn't marked "Fixes" because the extraneous btf_iter_num FWD wasn't causing any issues that I noticed, aside from mild confusion when I looked through the code. [0]: https://lore.kernel.org/bpf/25d08207-43e6-36a8-5e0f-47a913d4cda5@linux.dev/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231013204426.1074286-2-davemarchevsky@fb.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/bpf_iter.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c index 96856f130cbf..833faa04461b 100644 --- a/kernel/bpf/bpf_iter.c +++ b/kernel/bpf/bpf_iter.c @@ -793,8 +793,6 @@ __bpf_kfunc int bpf_iter_num_new(struct bpf_iter_num *it, int start, int end) BUILD_BUG_ON(sizeof(struct bpf_iter_num_kern) != sizeof(struct bpf_iter_num)); BUILD_BUG_ON(__alignof__(struct bpf_iter_num_kern) != __alignof__(struct bpf_iter_num)); - BTF_TYPE_EMIT(struct btf_iter_num); - /* start == end is legit, it's an empty range and we'll just get NULL * on first (and any subsequent) bpf_iter_num_next() call */ -- 2.34.1
From: Dave Marchevsky <davemarchevsky@fb.com> mainline inclusion from mainline-v6.7-rc1 commit 45b38941c81f16bb2e9b0115f03e164a3576ea8b category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Further patches in this series will add a struct bpf_iter_task_vma, which will result in a name collision with the selftest prog renamed in this patch. Rename the selftest to avoid the collision. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231013204426.1074286-3-davemarchevsky@fb.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/prog_tests/bpf_iter.c | 26 +++++++++---------- ...f_iter_task_vma.c => bpf_iter_task_vmas.c} | 0 2 files changed, 13 insertions(+), 13 deletions(-) rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%) diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c index f141e278b16f..caea3c500cc0 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c @@ -10,7 +10,7 @@ #include "bpf_iter_task.skel.h" #include "bpf_iter_task_stack.skel.h" #include "bpf_iter_task_file.skel.h" -#include "bpf_iter_task_vma.skel.h" +#include "bpf_iter_task_vmas.skel.h" #include "bpf_iter_task_btf.skel.h" #include "bpf_iter_tcp4.skel.h" #include "bpf_iter_tcp6.skel.h" @@ -1401,19 +1401,19 @@ static void str_strip_first_line(char *str) static void test_task_vma_common(struct bpf_iter_attach_opts *opts) { int err, iter_fd = -1, proc_maps_fd = -1; - struct bpf_iter_task_vma *skel; + struct bpf_iter_task_vmas *skel; int len, read_size = 4; char maps_path[64]; - skel = bpf_iter_task_vma__open(); - if (!ASSERT_OK_PTR(skel, "bpf_iter_task_vma__open")) + skel = bpf_iter_task_vmas__open(); + if (!ASSERT_OK_PTR(skel, "bpf_iter_task_vmas__open")) return; skel->bss->pid = getpid(); skel->bss->one_task = opts ? 1 : 0; - err = bpf_iter_task_vma__load(skel); - if (!ASSERT_OK(err, "bpf_iter_task_vma__load")) + err = bpf_iter_task_vmas__load(skel); + if (!ASSERT_OK(err, "bpf_iter_task_vmas__load")) goto out; skel->links.proc_maps = bpf_program__attach_iter( @@ -1464,25 +1464,25 @@ static void test_task_vma_common(struct bpf_iter_attach_opts *opts) out: close(proc_maps_fd); close(iter_fd); - bpf_iter_task_vma__destroy(skel); + bpf_iter_task_vmas__destroy(skel); } static void test_task_vma_dead_task(void) { - struct bpf_iter_task_vma *skel; + struct bpf_iter_task_vmas *skel; int wstatus, child_pid = -1; time_t start_tm, cur_tm; int err, iter_fd = -1; int wait_sec = 3; - skel = bpf_iter_task_vma__open(); - if (!ASSERT_OK_PTR(skel, "bpf_iter_task_vma__open")) + skel = bpf_iter_task_vmas__open(); + if (!ASSERT_OK_PTR(skel, "bpf_iter_task_vmas__open")) return; skel->bss->pid = getpid(); - err = bpf_iter_task_vma__load(skel); - if (!ASSERT_OK(err, "bpf_iter_task_vma__load")) + err = bpf_iter_task_vmas__load(skel); + if (!ASSERT_OK(err, "bpf_iter_task_vmas__load")) goto out; skel->links.proc_maps = bpf_program__attach_iter( @@ -1535,7 +1535,7 @@ static void test_task_vma_dead_task(void) out: waitpid(child_pid, &wstatus, 0); close(iter_fd); - bpf_iter_task_vma__destroy(skel); + bpf_iter_task_vmas__destroy(skel); } void test_bpf_sockmap_map_iter_fd(void) diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c b/tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c similarity index 100% rename from tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c rename to tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c -- 2.34.1
From: Dave Marchevsky <davemarchevsky@fb.com> mainline inclusion from mainline-v6.7-rc1 commit 4ac4546821584736798aaa9e97da9f6eaf689ea3 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_task_vma in open-coded iterator style. BPF programs can use these kfuncs directly or through bpf_for_each macro for natural-looking iteration of all task vmas. The implementation borrows heavily from bpf_find_vma helper's locking - differing only in that it holds the mmap_read lock for all iterations while the helper only executes its provided callback on a maximum of 1 vma. Aside from locking, struct vma_iterator and vma_next do all the heavy lifting. A pointer to an inner data struct, struct bpf_iter_task_vma_data, is the only field in struct bpf_iter_task_vma. This is because the inner data struct contains a struct vma_iterator (not ptr), whose size is likely to change under us. If bpf_iter_task_vma_kern contained vma_iterator directly such a change would require change in opaque bpf_iter_task_vma struct's size. So better to allocate vma_iterator using BPF allocator, and since that alloc must already succeed, might as well allocate all iter fields, thereby freezing struct bpf_iter_task_vma size. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231013204426.1074286-4-davemarchevsky@fb.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/helpers.c | 3 ++ kernel/bpf/task_iter.c | 91 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 94 insertions(+) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index c889a9dee6c6..fed10499f218 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2671,6 +2671,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_slice_rdwr, KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_num_new, KF_ITER_NEW) BTF_ID_FLAGS(func, bpf_iter_num_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY) +BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW | KF_RCU) +BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY) BTF_ID_FLAGS(func, bpf_dynptr_adjust) BTF_ID_FLAGS(func, bpf_dynptr_is_null) BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly) diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index f7ef58090c7d..210128790e6f 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -7,7 +7,9 @@ #include <linux/fs.h> #include <linux/fdtable.h> #include <linux/filter.h> +#include <linux/bpf_mem_alloc.h> #include <linux/btf_ids.h> +#include <linux/mm_types.h> #include "mmap_unlock_work.h" static const char * const iter_task_type_names[] = { @@ -823,6 +825,95 @@ const struct bpf_func_proto bpf_find_vma_proto = { .arg5_type = ARG_ANYTHING, }; +struct bpf_iter_task_vma_kern_data { + struct task_struct *task; + struct mm_struct *mm; + struct mmap_unlock_irq_work *work; + struct vma_iterator vmi; +}; + +struct bpf_iter_task_vma { + /* opaque iterator state; having __u64 here allows to preserve correct + * alignment requirements in vmlinux.h, generated from BTF + */ + __u64 __opaque[1]; +} __attribute__((aligned(8))); + +/* Non-opaque version of bpf_iter_task_vma */ +struct bpf_iter_task_vma_kern { + struct bpf_iter_task_vma_kern_data *data; +} __attribute__((aligned(8))); + +__diag_push(); +__diag_ignore_all("-Wmissing-prototypes", + "Global functions as their definitions will be in vmlinux BTF"); + +__bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it, + struct task_struct *task, u64 addr) +{ + struct bpf_iter_task_vma_kern *kit = (void *)it; + bool irq_work_busy = false; + int err; + + BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma)); + BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma)); + + /* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized + * before, so non-NULL kit->data doesn't point to previously + * bpf_mem_alloc'd bpf_iter_task_vma_kern_data + */ + kit->data = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_iter_task_vma_kern_data)); + if (!kit->data) + return -ENOMEM; + + kit->data->task = get_task_struct(task); + kit->data->mm = task->mm; + if (!kit->data->mm) { + err = -ENOENT; + goto err_cleanup_iter; + } + + /* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */ + irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work); + if (irq_work_busy || !mmap_read_trylock(kit->data->mm)) { + err = -EBUSY; + goto err_cleanup_iter; + } + + vma_iter_init(&kit->data->vmi, kit->data->mm, addr); + return 0; + +err_cleanup_iter: + if (kit->data->task) + put_task_struct(kit->data->task); + bpf_mem_free(&bpf_global_ma, kit->data); + /* NULL kit->data signals failed bpf_iter_task_vma initialization */ + kit->data = NULL; + return err; +} + +__bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it) +{ + struct bpf_iter_task_vma_kern *kit = (void *)it; + + if (!kit->data) /* bpf_iter_task_vma_new failed */ + return NULL; + return vma_next(&kit->data->vmi); +} + +__bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it) +{ + struct bpf_iter_task_vma_kern *kit = (void *)it; + + if (kit->data) { + bpf_mmap_unlock_mm(kit->data->work, kit->data->mm); + put_task_struct(kit->data->task); + bpf_mem_free(&bpf_global_ma, kit->data); + } +} + +__diag_pop(); + DEFINE_PER_CPU(struct mmap_unlock_irq_work, mmap_unlock_work); static void do_mmap_read_unlock(struct irq_work *entry) -- 2.34.1
From: Dave Marchevsky <davemarchevsky@fb.com> mainline inclusion from mainline-v6.7-rc1 commit e0e1a7a5fc377d54bd792c6368a375d41fc316ef category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- The open-coded task_vma iter added earlier in this series allows for natural iteration over a task's vmas using existing open-coded iter infrastructure, specifically bpf_for_each. This patch adds a test demonstrating this pattern and validating correctness. The vma->vm_start and vma->vm_end addresses of the first 1000 vmas are recorded and compared to /proc/PID/maps output. As expected, both see the same vmas and addresses - with the exception of the [vsyscall] vma - which is explained in a comment in the prog_tests program. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231013204426.1074286-5-davemarchevsky@fb.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../testing/selftests/bpf/bpf_experimental.h | 8 +++ .../testing/selftests/bpf/prog_tests/iters.c | 58 +++++++++++++++++++ .../selftests/bpf/progs/iters_task_vma.c | 43 ++++++++++++++ 3 files changed, 109 insertions(+) create mode 100644 tools/testing/selftests/bpf/progs/iters_task_vma.c diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h index 4494eaa9937e..f7d32c0e1480 100644 --- a/tools/testing/selftests/bpf/bpf_experimental.h +++ b/tools/testing/selftests/bpf/bpf_experimental.h @@ -159,6 +159,14 @@ extern void *bpf_percpu_obj_new_impl(__u64 local_type_id, void *meta) __ksym; */ extern void bpf_percpu_obj_drop_impl(void *kptr, void *meta) __ksym; +struct bpf_iter_task_vma; + +extern int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it, + struct task_struct *task, + unsigned long addr) __ksym; +extern struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it) __ksym; +extern void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it) __ksym; + /* Convenience macro to wrap over bpf_obj_drop_impl */ #define bpf_percpu_obj_drop(kptr) bpf_percpu_obj_drop_impl(kptr, NULL) diff --git a/tools/testing/selftests/bpf/prog_tests/iters.c b/tools/testing/selftests/bpf/prog_tests/iters.c index 10804ae5ae97..b696873c5455 100644 --- a/tools/testing/selftests/bpf/prog_tests/iters.c +++ b/tools/testing/selftests/bpf/prog_tests/iters.c @@ -8,6 +8,7 @@ #include "iters_looping.skel.h" #include "iters_num.skel.h" #include "iters_testmod_seq.skel.h" +#include "iters_task_vma.skel.h" static void subtest_num_iters(void) { @@ -90,6 +91,61 @@ static void subtest_testmod_seq_iters(void) iters_testmod_seq__destroy(skel); } +static void subtest_task_vma_iters(void) +{ + unsigned long start, end, bpf_iter_start, bpf_iter_end; + struct iters_task_vma *skel; + char rest_of_line[1000]; + unsigned int seen; + FILE *f = NULL; + int err; + + skel = iters_task_vma__open_and_load(); + if (!ASSERT_OK_PTR(skel, "skel_open_and_load")) + return; + + skel->bss->target_pid = getpid(); + + err = iters_task_vma__attach(skel); + if (!ASSERT_OK(err, "skel_attach")) + goto cleanup; + + getpgid(skel->bss->target_pid); + iters_task_vma__detach(skel); + + if (!ASSERT_GT(skel->bss->vmas_seen, 0, "vmas_seen_gt_zero")) + goto cleanup; + + f = fopen("/proc/self/maps", "r"); + if (!ASSERT_OK_PTR(f, "proc_maps_fopen")) + goto cleanup; + + seen = 0; + while (fscanf(f, "%lx-%lx %[^\n]\n", &start, &end, rest_of_line) == 3) { + /* [vsyscall] vma isn't _really_ part of task->mm vmas. + * /proc/PID/maps returns it when out of vmas - see get_gate_vma + * calls in fs/proc/task_mmu.c + */ + if (strstr(rest_of_line, "[vsyscall]")) + continue; + + bpf_iter_start = skel->bss->vm_ranges[seen].vm_start; + bpf_iter_end = skel->bss->vm_ranges[seen].vm_end; + + ASSERT_EQ(bpf_iter_start, start, "vma->vm_start match"); + ASSERT_EQ(bpf_iter_end, end, "vma->vm_end match"); + seen++; + } + + if (!ASSERT_EQ(skel->bss->vmas_seen, seen, "vmas_seen_eq")) + goto cleanup; + +cleanup: + if (f) + fclose(f); + iters_task_vma__destroy(skel); +} + void test_iters(void) { RUN_TESTS(iters_state_safety); @@ -103,4 +159,6 @@ void test_iters(void) subtest_num_iters(); if (test__start_subtest("testmod_seq")) subtest_testmod_seq_iters(); + if (test__start_subtest("task_vma")) + subtest_task_vma_iters(); } diff --git a/tools/testing/selftests/bpf/progs/iters_task_vma.c b/tools/testing/selftests/bpf/progs/iters_task_vma.c new file mode 100644 index 000000000000..44edecfdfaee --- /dev/null +++ b/tools/testing/selftests/bpf/progs/iters_task_vma.c @@ -0,0 +1,43 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */ + +#include "vmlinux.h" +#include "bpf_experimental.h" +#include <bpf/bpf_helpers.h> +#include "bpf_misc.h" + +pid_t target_pid = 0; +unsigned int vmas_seen = 0; + +struct { + __u64 vm_start; + __u64 vm_end; +} vm_ranges[1000]; + +SEC("raw_tp/sys_enter") +int iter_task_vma_for_each(const void *ctx) +{ + struct task_struct *task = bpf_get_current_task_btf(); + struct vm_area_struct *vma; + unsigned int seen = 0; + + if (task->pid != target_pid) + return 0; + + if (vmas_seen) + return 0; + + bpf_for_each(task_vma, vma, task, 0) { + if (seen >= 1000) + break; + + vm_ranges[seen].vm_start = vma->vm_start; + vm_ranges[seen].vm_end = vma->vm_end; + seen++; + } + + vmas_seen = seen; + return 0; +} + +char _license[] SEC("license") = "GPL"; -- 2.34.1
From: Chuyi Zhou <zhouchuyi@bytedance.com> mainline inclusion from mainline-v6.7-rc1 commit 6da88306811b40a207c94c9da9faf07bdb20776e category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- This patch makes some preparations for using css_task_iter_*() in BPF Program. 1. Flags CSS_TASK_ITER_* are #define-s and it's not easy for bpf prog to use them. Convert them to enum so bpf prog can take them from vmlinux.h. 2. In the next patch we will add css_task_iter_*() in common kfuncs which is not safe. Since css_task_iter_*() does spin_unlock_irq() which might screw up irq flags depending on the context where bpf prog is running. So we should use irqsave/irqrestore here and the switching is harmless. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-2-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/cgroup.h | 12 +++++------- kernel/cgroup/cgroup.c | 18 ++++++++++++------ 2 files changed, 17 insertions(+), 13 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index e0be4e60aba8..8148e2f65ab7 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -41,13 +41,11 @@ struct kernel_clone_args; #define CGROUP_WEIGHT_DFL 100 #define CGROUP_WEIGHT_MAX 10000 -/* walk only threadgroup leaders */ -#define CSS_TASK_ITER_PROCS (1U << 0) -/* walk all threaded css_sets in the domain */ -#define CSS_TASK_ITER_THREADED (1U << 1) - -/* internal flags */ -#define CSS_TASK_ITER_SKIPPED (1U << 16) +enum { + CSS_TASK_ITER_PROCS = (1U << 0), /* walk only threadgroup leaders */ + CSS_TASK_ITER_THREADED = (1U << 1), /* walk all threaded css_sets in the domain */ + CSS_TASK_ITER_SKIPPED = (1U << 16), /* internal flags */ +}; /* a css_task_iter should be treated as an opaque object */ struct css_task_iter { diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 17521bc192ee..d0a298cdf532 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5079,9 +5079,11 @@ static void css_task_iter_advance(struct css_task_iter *it) void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags, struct css_task_iter *it) { + unsigned long irqflags; + memset(it, 0, sizeof(*it)); - spin_lock_irq(&css_set_lock); + spin_lock_irqsave(&css_set_lock, irqflags); it->ss = css->ss; it->flags = flags; @@ -5095,7 +5097,7 @@ void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags, css_task_iter_advance(it); - spin_unlock_irq(&css_set_lock); + spin_unlock_irqrestore(&css_set_lock, irqflags); } /** @@ -5108,12 +5110,14 @@ void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags, */ struct task_struct *css_task_iter_next(struct css_task_iter *it) { + unsigned long irqflags; + if (it->cur_task) { put_task_struct(it->cur_task); it->cur_task = NULL; } - spin_lock_irq(&css_set_lock); + spin_lock_irqsave(&css_set_lock, irqflags); /* @it may be half-advanced by skips, finish advancing */ if (it->flags & CSS_TASK_ITER_SKIPPED) @@ -5126,7 +5130,7 @@ struct task_struct *css_task_iter_next(struct css_task_iter *it) css_task_iter_advance(it); } - spin_unlock_irq(&css_set_lock); + spin_unlock_irqrestore(&css_set_lock, irqflags); return it->cur_task; } @@ -5139,11 +5143,13 @@ struct task_struct *css_task_iter_next(struct css_task_iter *it) */ void css_task_iter_end(struct css_task_iter *it) { + unsigned long irqflags; + if (it->cur_cset) { - spin_lock_irq(&css_set_lock); + spin_lock_irqsave(&css_set_lock, irqflags); list_del(&it->iters_node); put_css_set_locked(it->cur_cset); - spin_unlock_irq(&css_set_lock); + spin_unlock_irqrestore(&css_set_lock, irqflags); } if (it->cur_dcset) -- 2.34.1
From: Chuyi Zhou <zhouchuyi@bytedance.com> mainline inclusion from mainline-v6.7-rc1 commit 9c66dc94b62aef23300f05f63404afb8990920b4 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- This patch adds kfuncs bpf_iter_css_task_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_css_task in open-coded iterator style. These kfuncs actually wrapps css_task_iter_{start,next, end}. BPF programs can use these kfuncs through bpf_for_each macro for iteration of all tasks under a css. css_task_iter_*() would try to get the global spin-lock *css_set_lock*, so the bpf side has to be careful in where it allows to use this iter. Currently we only allow it in bpf_lsm and bpf iter-s. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-3-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Conflicts: kernel/bpf/verifier.c tools/testing/selftests/bpf/bpf_experimental.h [Fix conflicts because of missing some kfuncs.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/helpers.c | 3 + kernel/bpf/task_iter.c | 58 +++++++++++++++++++ kernel/bpf/verifier.c | 23 ++++++++ .../testing/selftests/bpf/bpf_experimental.h | 8 +++ 4 files changed, 92 insertions(+) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index fed10499f218..a698e70c0250 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2674,6 +2674,9 @@ BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY) BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW | KF_RCU) BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY) +BTF_ID_FLAGS(func, bpf_iter_css_task_new, KF_ITER_NEW | KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_iter_css_task_next, KF_ITER_NEXT | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_iter_css_task_destroy, KF_ITER_DESTROY) BTF_ID_FLAGS(func, bpf_dynptr_adjust) BTF_ID_FLAGS(func, bpf_dynptr_is_null) BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly) diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index 210128790e6f..0a4e497dd6cc 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -914,6 +914,64 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it) __diag_pop(); +struct bpf_iter_css_task { + __u64 __opaque[1]; +} __attribute__((aligned(8))); + +struct bpf_iter_css_task_kern { + struct css_task_iter *css_it; +} __attribute__((aligned(8))); + +__diag_push(); +__diag_ignore_all("-Wmissing-prototypes", + "Global functions as their definitions will be in vmlinux BTF"); + +__bpf_kfunc int bpf_iter_css_task_new(struct bpf_iter_css_task *it, + struct cgroup_subsys_state *css, unsigned int flags) +{ + struct bpf_iter_css_task_kern *kit = (void *)it; + + BUILD_BUG_ON(sizeof(struct bpf_iter_css_task_kern) != sizeof(struct bpf_iter_css_task)); + BUILD_BUG_ON(__alignof__(struct bpf_iter_css_task_kern) != + __alignof__(struct bpf_iter_css_task)); + kit->css_it = NULL; + switch (flags) { + case CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED: + case CSS_TASK_ITER_PROCS: + case 0: + break; + default: + return -EINVAL; + } + + kit->css_it = bpf_mem_alloc(&bpf_global_ma, sizeof(struct css_task_iter)); + if (!kit->css_it) + return -ENOMEM; + css_task_iter_start(css, flags, kit->css_it); + return 0; +} + +__bpf_kfunc struct task_struct *bpf_iter_css_task_next(struct bpf_iter_css_task *it) +{ + struct bpf_iter_css_task_kern *kit = (void *)it; + + if (!kit->css_it) + return NULL; + return css_task_iter_next(kit->css_it); +} + +__bpf_kfunc void bpf_iter_css_task_destroy(struct bpf_iter_css_task *it) +{ + struct bpf_iter_css_task_kern *kit = (void *)it; + + if (!kit->css_it) + return; + css_task_iter_end(kit->css_it); + bpf_mem_free(&bpf_global_ma, kit->css_it); +} + +__diag_pop(); + DEFINE_PER_CPU(struct mmap_unlock_irq_work, mmap_unlock_work); static void do_mmap_read_unlock(struct irq_work *entry) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 9ff0704cef3d..0d9394f99f9c 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -10823,6 +10823,7 @@ enum special_kfunc_type { KF_bpf_handle_ingress_ptype, KF_bpf_handle_egress_ptype, #endif + KF_bpf_iter_css_task_new, }; BTF_SET_START(special_kfunc_set) @@ -10851,6 +10852,7 @@ BTF_ID(func, bpf_get_skb_ethhdr) BTF_ID(func, bpf_handle_ingress_ptype) BTF_ID(func, bpf_handle_egress_ptype) #endif +BTF_ID(func, bpf_iter_css_task_new) BTF_SET_END(special_kfunc_set) BTF_ID_LIST(special_kfunc_list) @@ -10881,6 +10883,7 @@ BTF_ID(func, bpf_get_skb_ethhdr) BTF_ID(func, bpf_handle_ingress_ptype) BTF_ID(func, bpf_handle_egress_ptype) #endif +BTF_ID(func, bpf_iter_css_task_new) static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta) { @@ -11405,6 +11408,20 @@ static int process_kf_arg_ptr_to_rbtree_node(struct bpf_verifier_env *env, &meta->arg_rbtree_root.field); } +static bool check_css_task_iter_allowlist(struct bpf_verifier_env *env) +{ + enum bpf_prog_type prog_type = resolve_prog_type(env->prog); + + switch (prog_type) { + case BPF_PROG_TYPE_LSM: + return true; + case BPF_TRACE_ITER: + return env->prog->aux->sleepable; + default: + return false; + } +} + static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_arg_meta *meta, int insn_idx) { @@ -11646,6 +11663,12 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_ break; } case KF_ARG_PTR_TO_ITER: + if (meta->func_id == special_kfunc_list[KF_bpf_iter_css_task_new]) { + if (!check_css_task_iter_allowlist(env)) { + verbose(env, "css_task_iter is only allowed in bpf_lsm and bpf iter-s\n"); + return -EINVAL; + } + } ret = process_iter_arg(env, regno, insn_idx, meta); if (ret < 0) return ret; diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h index f7d32c0e1480..1d9dc535062f 100644 --- a/tools/testing/selftests/bpf/bpf_experimental.h +++ b/tools/testing/selftests/bpf/bpf_experimental.h @@ -170,4 +170,12 @@ extern void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it) __ksym; /* Convenience macro to wrap over bpf_obj_drop_impl */ #define bpf_percpu_obj_drop(kptr) bpf_percpu_obj_drop_impl(kptr, NULL) +struct bpf_iter_css_task; +struct cgroup_subsys_state; +extern int bpf_iter_css_task_new(struct bpf_iter_css_task *it, + struct cgroup_subsys_state *css, unsigned int flags) __weak __ksym; +extern struct task_struct *bpf_iter_css_task_next(struct bpf_iter_css_task *it) __weak __ksym; +extern void bpf_iter_css_task_destroy(struct bpf_iter_css_task *it) __weak __ksym; + + #endif -- 2.34.1
From: Chuyi Zhou <zhouchuyi@bytedance.com> mainline inclusion from mainline-v6.7-rc1 commit c68a78ffe2cb4207f64fd0f4262818c728c67be0 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- This patch adds kfuncs bpf_iter_task_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_task in open-coded iterator style. BPF programs can use these kfuncs or through bpf_for_each macro to iterate all processes in the system. The API design keep consistent with SEC("iter/task"). bpf_iter_task_new() accepts a specific task and iterating type which allows: 1. iterating all process in the system (BPF_TASK_ITER_ALL_PROCS) 2. iterating all threads in the system (BPF_TASK_ITER_ALL_THREADS) 3. iterating all threads of a specific task (BPF_TASK_ITER_PROC_THREADS) Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Link: https://lore.kernel.org/r/20231018061746.111364-4-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/helpers.c | 3 + kernel/bpf/task_iter.c | 90 +++++++++++++++++++ .../testing/selftests/bpf/bpf_experimental.h | 5 ++ 3 files changed, 98 insertions(+) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index a698e70c0250..e6c44a165a85 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2677,6 +2677,9 @@ BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY) BTF_ID_FLAGS(func, bpf_iter_css_task_new, KF_ITER_NEW | KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, bpf_iter_css_task_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_css_task_destroy, KF_ITER_DESTROY) +BTF_ID_FLAGS(func, bpf_iter_task_new, KF_ITER_NEW | KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_iter_task_next, KF_ITER_NEXT | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_iter_task_destroy, KF_ITER_DESTROY) BTF_ID_FLAGS(func, bpf_dynptr_adjust) BTF_ID_FLAGS(func, bpf_dynptr_is_null) BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly) diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index 0a4e497dd6cc..5e9a290f868b 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -972,6 +972,96 @@ __bpf_kfunc void bpf_iter_css_task_destroy(struct bpf_iter_css_task *it) __diag_pop(); +struct bpf_iter_task { + __u64 __opaque[3]; +} __attribute__((aligned(8))); + +struct bpf_iter_task_kern { + struct task_struct *task; + struct task_struct *pos; + unsigned int flags; +} __attribute__((aligned(8))); + +enum { + /* all process in the system */ + BPF_TASK_ITER_ALL_PROCS, + /* all threads in the system */ + BPF_TASK_ITER_ALL_THREADS, + /* all threads of a specific process */ + BPF_TASK_ITER_PROC_THREADS +}; + +__diag_push(); +__diag_ignore_all("-Wmissing-prototypes", + "Global functions as their definitions will be in vmlinux BTF"); + +__bpf_kfunc int bpf_iter_task_new(struct bpf_iter_task *it, + struct task_struct *task, unsigned int flags) +{ + struct bpf_iter_task_kern *kit = (void *)it; + + BUILD_BUG_ON(sizeof(struct bpf_iter_task_kern) > sizeof(struct bpf_iter_task)); + BUILD_BUG_ON(__alignof__(struct bpf_iter_task_kern) != + __alignof__(struct bpf_iter_task)); + + kit->task = kit->pos = NULL; + switch (flags) { + case BPF_TASK_ITER_ALL_THREADS: + case BPF_TASK_ITER_ALL_PROCS: + case BPF_TASK_ITER_PROC_THREADS: + break; + default: + return -EINVAL; + } + + if (flags == BPF_TASK_ITER_PROC_THREADS) + kit->task = task; + else + kit->task = &init_task; + kit->pos = kit->task; + kit->flags = flags; + return 0; +} + +__bpf_kfunc struct task_struct *bpf_iter_task_next(struct bpf_iter_task *it) +{ + struct bpf_iter_task_kern *kit = (void *)it; + struct task_struct *pos; + unsigned int flags; + + flags = kit->flags; + pos = kit->pos; + + if (!pos) + return pos; + + if (flags == BPF_TASK_ITER_ALL_PROCS) + goto get_next_task; + + kit->pos = next_thread(kit->pos); + if (kit->pos == kit->task) { + if (flags == BPF_TASK_ITER_PROC_THREADS) { + kit->pos = NULL; + return pos; + } + } else + return pos; + +get_next_task: + kit->pos = next_task(kit->pos); + kit->task = kit->pos; + if (kit->pos == &init_task) + kit->pos = NULL; + + return pos; +} + +__bpf_kfunc void bpf_iter_task_destroy(struct bpf_iter_task *it) +{ +} + +__diag_pop(); + DEFINE_PER_CPU(struct mmap_unlock_irq_work, mmap_unlock_work); static void do_mmap_read_unlock(struct irq_work *entry) diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h index 1d9dc535062f..2a67e2819f01 100644 --- a/tools/testing/selftests/bpf/bpf_experimental.h +++ b/tools/testing/selftests/bpf/bpf_experimental.h @@ -177,5 +177,10 @@ extern int bpf_iter_css_task_new(struct bpf_iter_css_task *it, extern struct task_struct *bpf_iter_css_task_next(struct bpf_iter_css_task *it) __weak __ksym; extern void bpf_iter_css_task_destroy(struct bpf_iter_css_task *it) __weak __ksym; +struct bpf_iter_task; +extern int bpf_iter_task_new(struct bpf_iter_task *it, + struct task_struct *task, unsigned int flags) __weak __ksym; +extern struct task_struct *bpf_iter_task_next(struct bpf_iter_task *it) __weak __ksym; +extern void bpf_iter_task_destroy(struct bpf_iter_task *it) __weak __ksym; #endif -- 2.34.1
From: Chuyi Zhou <zhouchuyi@bytedance.com> mainline inclusion from mainline-v6.7-rc1 commit 7251d0905e7518bcb990c8e9a3615b1bb23c78f2 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- This Patch adds kfuncs bpf_iter_css_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_css in open-coded iterator style. These kfuncs actually wrapps css_next_descendant_{pre, post}. css_iter can be used to: 1) iterating a sepcific cgroup tree with pre/post/up order 2) iterating cgroup_subsystem in BPF Prog, like for_each_mem_cgroup_tree/cpuset_for_each_descendant_pre in kernel. The API design is consistent with cgroup_iter. bpf_iter_css_new accepts parameters defining iteration order and starting css. Here we also reuse BPF_CGROUP_ITER_DESCENDANTS_PRE, BPF_CGROUP_ITER_DESCENDANTS_POST, BPF_CGROUP_ITER_ANCESTORS_UP enums. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-5-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/cgroup_iter.c | 65 +++++++++++++++++++ kernel/bpf/helpers.c | 3 + .../testing/selftests/bpf/bpf_experimental.h | 6 ++ 3 files changed, 74 insertions(+) diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c index 810378f04fbc..209e5135f9fb 100644 --- a/kernel/bpf/cgroup_iter.c +++ b/kernel/bpf/cgroup_iter.c @@ -294,3 +294,68 @@ static int __init bpf_cgroup_iter_init(void) } late_initcall(bpf_cgroup_iter_init); + +struct bpf_iter_css { + __u64 __opaque[3]; +} __attribute__((aligned(8))); + +struct bpf_iter_css_kern { + struct cgroup_subsys_state *start; + struct cgroup_subsys_state *pos; + unsigned int flags; +} __attribute__((aligned(8))); + +__diag_push(); +__diag_ignore_all("-Wmissing-prototypes", + "Global functions as their definitions will be in vmlinux BTF"); + +__bpf_kfunc int bpf_iter_css_new(struct bpf_iter_css *it, + struct cgroup_subsys_state *start, unsigned int flags) +{ + struct bpf_iter_css_kern *kit = (void *)it; + + BUILD_BUG_ON(sizeof(struct bpf_iter_css_kern) > sizeof(struct bpf_iter_css)); + BUILD_BUG_ON(__alignof__(struct bpf_iter_css_kern) != __alignof__(struct bpf_iter_css)); + + kit->start = NULL; + switch (flags) { + case BPF_CGROUP_ITER_DESCENDANTS_PRE: + case BPF_CGROUP_ITER_DESCENDANTS_POST: + case BPF_CGROUP_ITER_ANCESTORS_UP: + break; + default: + return -EINVAL; + } + + kit->start = start; + kit->pos = NULL; + kit->flags = flags; + return 0; +} + +__bpf_kfunc struct cgroup_subsys_state *bpf_iter_css_next(struct bpf_iter_css *it) +{ + struct bpf_iter_css_kern *kit = (void *)it; + + if (!kit->start) + return NULL; + + switch (kit->flags) { + case BPF_CGROUP_ITER_DESCENDANTS_PRE: + kit->pos = css_next_descendant_pre(kit->pos, kit->start); + break; + case BPF_CGROUP_ITER_DESCENDANTS_POST: + kit->pos = css_next_descendant_post(kit->pos, kit->start); + break; + case BPF_CGROUP_ITER_ANCESTORS_UP: + kit->pos = kit->pos ? kit->pos->parent : kit->start; + } + + return kit->pos; +} + +__bpf_kfunc void bpf_iter_css_destroy(struct bpf_iter_css *it) +{ +} + +__diag_pop(); \ No newline at end of file diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index e6c44a165a85..a9f9ac4d04ca 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2680,6 +2680,9 @@ BTF_ID_FLAGS(func, bpf_iter_css_task_destroy, KF_ITER_DESTROY) BTF_ID_FLAGS(func, bpf_iter_task_new, KF_ITER_NEW | KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, bpf_iter_task_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_task_destroy, KF_ITER_DESTROY) +BTF_ID_FLAGS(func, bpf_iter_css_new, KF_ITER_NEW | KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_iter_css_next, KF_ITER_NEXT | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_iter_css_destroy, KF_ITER_DESTROY) BTF_ID_FLAGS(func, bpf_dynptr_adjust) BTF_ID_FLAGS(func, bpf_dynptr_is_null) BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly) diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h index 2a67e2819f01..facf41fec72f 100644 --- a/tools/testing/selftests/bpf/bpf_experimental.h +++ b/tools/testing/selftests/bpf/bpf_experimental.h @@ -183,4 +183,10 @@ extern int bpf_iter_task_new(struct bpf_iter_task *it, extern struct task_struct *bpf_iter_task_next(struct bpf_iter_task *it) __weak __ksym; extern void bpf_iter_task_destroy(struct bpf_iter_task *it) __weak __ksym; +struct bpf_iter_css; +extern int bpf_iter_css_new(struct bpf_iter_css *it, + struct cgroup_subsys_state *start, unsigned int flags) __weak __ksym; +extern struct cgroup_subsys_state *bpf_iter_css_next(struct bpf_iter_css *it) __weak __ksym; +extern void bpf_iter_css_destroy(struct bpf_iter_css *it) __weak __ksym; + #endif -- 2.34.1
From: Chuyi Zhou <zhouchuyi@bytedance.com> mainline inclusion from mainline-v6.7-rc1 commit dfab99df147b0d364f0c199f832ff2aedfb2265a category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- css_iter and task_iter should be used in rcu section. Specifically, in sleepable progs explicit bpf_rcu_read_lock() is needed before use these iters. In normal bpf progs that have implicit rcu_read_lock(), it's OK to use them directly. This patch adds a new a KF flag KF_RCU_PROTECTED for bpf_iter_task_new and bpf_iter_css_new. It means the kfunc should be used in RCU CS. We check whether we are in rcu cs before we want to invoke this kfunc. If the rcu protection is guaranteed, we would let st->type = PTR_TO_STACK | MEM_RCU. Once user do rcu_unlock during the iteration, state MEM_RCU of regs would be cleared. is_iter_reg_valid_init() will reject if reg->type is UNTRUSTED. It is worth noting that currently, bpf_rcu_read_unlock does not clear the state of the STACK_ITER reg, since bpf_for_each_spilled_reg only considers STACK_SPILL. This patch also let bpf_for_each_spilled_reg search STACK_ITER. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-6-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Conflicts: include/linux/bpf_verifier.h [Fix conflicts because of kabi.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf_verifier.h | 19 ++++++++------ include/linux/btf.h | 1 + kernel/bpf/helpers.c | 4 +-- kernel/bpf/verifier.c | 50 ++++++++++++++++++++++++++++-------- 4 files changed, 53 insertions(+), 21 deletions(-) diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index b7ff22ed0fd7..e1d213f2585d 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -473,19 +473,18 @@ struct bpf_verifier_state { KABI_RESERVE(4) }; -#define bpf_get_spilled_reg(slot, frame) \ +#define bpf_get_spilled_reg(slot, frame, mask) \ (((slot < frame->allocated_stack / BPF_REG_SIZE) && \ - (frame->stack[slot].slot_type[0] == STACK_SPILL)) \ + ((1 << frame->stack[slot].slot_type[0]) & (mask))) \ ? &frame->stack[slot].spilled_ptr : NULL) /* Iterate over 'frame', setting 'reg' to either NULL or a spilled register. */ -#define bpf_for_each_spilled_reg(iter, frame, reg) \ - for (iter = 0, reg = bpf_get_spilled_reg(iter, frame); \ +#define bpf_for_each_spilled_reg(iter, frame, reg, mask) \ + for (iter = 0, reg = bpf_get_spilled_reg(iter, frame, mask); \ iter < frame->allocated_stack / BPF_REG_SIZE; \ - iter++, reg = bpf_get_spilled_reg(iter, frame)) + iter++, reg = bpf_get_spilled_reg(iter, frame, mask)) -/* Invoke __expr over regsiters in __vst, setting __state and __reg */ -#define bpf_for_each_reg_in_vstate(__vst, __state, __reg, __expr) \ +#define bpf_for_each_reg_in_vstate_mask(__vst, __state, __reg, __mask, __expr) \ ({ \ struct bpf_verifier_state *___vstate = __vst; \ int ___i, ___j; \ @@ -497,7 +496,7 @@ struct bpf_verifier_state { __reg = &___regs[___j]; \ (void)(__expr); \ } \ - bpf_for_each_spilled_reg(___j, __state, __reg) { \ + bpf_for_each_spilled_reg(___j, __state, __reg, __mask) { \ if (!__reg) \ continue; \ (void)(__expr); \ @@ -505,6 +504,10 @@ struct bpf_verifier_state { } \ }) +/* Invoke __expr over regsiters in __vst, setting __state and __reg */ +#define bpf_for_each_reg_in_vstate(__vst, __state, __reg, __expr) \ + bpf_for_each_reg_in_vstate_mask(__vst, __state, __reg, 1 << STACK_SPILL, __expr) + /* linked list of verifier states used to prune search */ struct bpf_verifier_state_list { struct bpf_verifier_state state; diff --git a/include/linux/btf.h b/include/linux/btf.h index 928113a80a95..c2231c64d60b 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -74,6 +74,7 @@ #define KF_ITER_NEW (1 << 8) /* kfunc implements BPF iter constructor */ #define KF_ITER_NEXT (1 << 9) /* kfunc implements BPF iter next method */ #define KF_ITER_DESTROY (1 << 10) /* kfunc implements BPF iter destructor */ +#define KF_RCU_PROTECTED (1 << 11) /* kfunc should be protected by rcu cs when they are invoked */ /* * Tag marking a kernel function as a kfunc. This is meant to minimize the diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index a9f9ac4d04ca..112588659104 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2677,10 +2677,10 @@ BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY) BTF_ID_FLAGS(func, bpf_iter_css_task_new, KF_ITER_NEW | KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, bpf_iter_css_task_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_css_task_destroy, KF_ITER_DESTROY) -BTF_ID_FLAGS(func, bpf_iter_task_new, KF_ITER_NEW | KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_iter_task_new, KF_ITER_NEW | KF_TRUSTED_ARGS | KF_RCU_PROTECTED) BTF_ID_FLAGS(func, bpf_iter_task_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_task_destroy, KF_ITER_DESTROY) -BTF_ID_FLAGS(func, bpf_iter_css_new, KF_ITER_NEW | KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_iter_css_new, KF_ITER_NEW | KF_TRUSTED_ARGS | KF_RCU_PROTECTED) BTF_ID_FLAGS(func, bpf_iter_css_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_css_destroy, KF_ITER_DESTROY) BTF_ID_FLAGS(func, bpf_dynptr_adjust) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 0d9394f99f9c..bb3db6da2db6 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1191,7 +1191,12 @@ static bool is_dynptr_type_expected(struct bpf_verifier_env *env, struct bpf_reg static void __mark_reg_known_zero(struct bpf_reg_state *reg); +static bool in_rcu_cs(struct bpf_verifier_env *env); + +static bool is_kfunc_rcu_protected(struct bpf_kfunc_call_arg_meta *meta); + static int mark_stack_slots_iter(struct bpf_verifier_env *env, + struct bpf_kfunc_call_arg_meta *meta, struct bpf_reg_state *reg, int insn_idx, struct btf *btf, u32 btf_id, int nr_slots) { @@ -1212,6 +1217,12 @@ static int mark_stack_slots_iter(struct bpf_verifier_env *env, __mark_reg_known_zero(st); st->type = PTR_TO_STACK; /* we don't have dedicated reg type */ + if (is_kfunc_rcu_protected(meta)) { + if (in_rcu_cs(env)) + st->type |= MEM_RCU; + else + st->type |= PTR_UNTRUSTED; + } st->live |= REG_LIVE_WRITTEN; st->ref_obj_id = i == 0 ? id : 0; st->iter.btf = btf; @@ -1286,7 +1297,7 @@ static bool is_iter_reg_valid_uninit(struct bpf_verifier_env *env, return true; } -static bool is_iter_reg_valid_init(struct bpf_verifier_env *env, struct bpf_reg_state *reg, +static int is_iter_reg_valid_init(struct bpf_verifier_env *env, struct bpf_reg_state *reg, struct btf *btf, u32 btf_id, int nr_slots) { struct bpf_func_state *state = func(env, reg); @@ -1294,26 +1305,28 @@ static bool is_iter_reg_valid_init(struct bpf_verifier_env *env, struct bpf_reg_ spi = iter_get_spi(env, reg, nr_slots); if (spi < 0) - return false; + return -EINVAL; for (i = 0; i < nr_slots; i++) { struct bpf_stack_state *slot = &state->stack[spi - i]; struct bpf_reg_state *st = &slot->spilled_ptr; + if (st->type & PTR_UNTRUSTED) + return -EPROTO; /* only main (first) slot has ref_obj_id set */ if (i == 0 && !st->ref_obj_id) - return false; + return -EINVAL; if (i != 0 && st->ref_obj_id) - return false; + return -EINVAL; if (st->iter.btf != btf || st->iter.btf_id != btf_id) - return false; + return -EINVAL; for (j = 0; j < BPF_REG_SIZE; j++) if (slot->slot_type[j] != STACK_ITER) - return false; + return -EINVAL; } - return true; + return 0; } /* Check if given stack slot is "special": @@ -7841,15 +7854,24 @@ static int process_iter_arg(struct bpf_verifier_env *env, int regno, int insn_id return err; } - err = mark_stack_slots_iter(env, reg, insn_idx, meta->btf, btf_id, nr_slots); + err = mark_stack_slots_iter(env, meta, reg, insn_idx, meta->btf, btf_id, nr_slots); if (err) return err; } else { /* iter_next() or iter_destroy() expect initialized iter state*/ - if (!is_iter_reg_valid_init(env, reg, meta->btf, btf_id, nr_slots)) { + err = is_iter_reg_valid_init(env, reg, meta->btf, btf_id, nr_slots); + switch (err) { + case 0: + break; + case -EINVAL: verbose(env, "expected an initialized iter_%s as arg #%d\n", iter_type_str(meta->btf, btf_id), regno); - return -EINVAL; + return err; + case -EPROTO: + verbose(env, "expected an RCU CS when using %s\n", meta->func_name); + return err; + default: + return err; } spi = iter_get_spi(env, reg, nr_slots); @@ -10577,6 +10599,11 @@ static bool is_kfunc_rcu(struct bpf_kfunc_call_arg_meta *meta) return meta->kfunc_flags & KF_RCU; } +static bool is_kfunc_rcu_protected(struct bpf_kfunc_call_arg_meta *meta) +{ + return meta->kfunc_flags & KF_RCU_PROTECTED; +} + static bool __kfunc_param_match_suffix(const struct btf *btf, const struct btf_param *arg, const char *suffix) @@ -11964,6 +11991,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, if (env->cur_state->active_rcu_lock) { struct bpf_func_state *state; struct bpf_reg_state *reg; + u32 clear_mask = (1 << STACK_SPILL) | (1 << STACK_ITER); if (in_rbtree_lock_required_cb(env) && (rcu_lock || rcu_unlock)) { verbose(env, "Calling bpf_rcu_read_{lock,unlock} in unnecessary rbtree callback\n"); @@ -11974,7 +12002,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, verbose(env, "nested rcu read lock (kernel function %s)\n", func_name); return -EINVAL; } else if (rcu_unlock) { - bpf_for_each_reg_in_vstate(env->cur_state, state, reg, ({ + bpf_for_each_reg_in_vstate_mask(env->cur_state, state, reg, clear_mask, ({ if (reg->type & MEM_RCU) { reg->type &= ~(MEM_RCU | PTR_MAYBE_NULL); reg->type |= PTR_UNTRUSTED; -- 2.34.1
From: Chuyi Zhou <zhouchuyi@bytedance.com> mainline inclusion from mainline-v6.7-rc1 commit cb3ecf7915a1d7ce5304402f4d8616d9fa5193f7 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- When using task_iter to iterate all threads of a specific task, we enforce that the user must pass a valid task pointer to ensure safety. However, when iterating all threads/process in the system, BPF verifier still require a valid ptr instead of "nullable" pointer, even though it's pointless, which is a kind of surprising from usability standpoint. It would be nice if we could let that kfunc accept a explicit null pointer when we are using BPF_TASK_ITER_ALL_{PROCS, THREADS} and a valid pointer when using BPF_TASK_ITER_THREAD. Given a trival kfunc: __bpf_kfunc void FN(struct TYPE_A *obj); BPF Prog would reject a nullptr for obj. The error info is: "arg#x pointer type xx xx must point to scalar, or struct with scalar" reported by get_kfunc_ptr_arg_type(). The reg->type is SCALAR_VALUE and the btf type of ref_t is not scalar or scalar_struct which leads to the rejection of get_kfunc_ptr_arg_type. This patch add "__nullable" annotation: __bpf_kfunc void FN(struct TYPE_A *obj__nullable); Here __nullable indicates obj can be optional, user can pass a explicit nullptr or a normal TYPE_A pointer. In get_kfunc_ptr_arg_type(), we will detect whether the current arg is optional and register is null, If so, return a new kfunc_ptr_arg_type KF_ARG_PTR_TO_NULL and skip to the next arg in check_kfunc_args(). Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-7-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/task_iter.c | 7 +++++-- kernel/bpf/verifier.c | 13 ++++++++++++- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index 5e9a290f868b..a663c3d11b62 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -996,7 +996,7 @@ __diag_ignore_all("-Wmissing-prototypes", "Global functions as their definitions will be in vmlinux BTF"); __bpf_kfunc int bpf_iter_task_new(struct bpf_iter_task *it, - struct task_struct *task, unsigned int flags) + struct task_struct *task__nullable, unsigned int flags) { struct bpf_iter_task_kern *kit = (void *)it; @@ -1008,14 +1008,17 @@ __bpf_kfunc int bpf_iter_task_new(struct bpf_iter_task *it, switch (flags) { case BPF_TASK_ITER_ALL_THREADS: case BPF_TASK_ITER_ALL_PROCS: + break; case BPF_TASK_ITER_PROC_THREADS: + if (!task__nullable) + return -EINVAL; break; default: return -EINVAL; } if (flags == BPF_TASK_ITER_PROC_THREADS) - kit->task = task; + kit->task = task__nullable; else kit->task = &init_task; kit->pos = kit->task; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index bb3db6da2db6..0e7ed6b38547 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -10678,6 +10678,11 @@ static bool is_kfunc_arg_refcounted_kptr(const struct btf *btf, const struct btf return __kfunc_param_match_suffix(btf, arg, "__refcounted_kptr"); } +static bool is_kfunc_arg_nullable(const struct btf *btf, const struct btf_param *arg) +{ + return __kfunc_param_match_suffix(btf, arg, "__nullable"); +} + static bool is_kfunc_arg_scalar_with_name(const struct btf *btf, const struct btf_param *arg, const char *name) @@ -10820,6 +10825,7 @@ enum kfunc_ptr_arg_type { KF_ARG_PTR_TO_CALLBACK, KF_ARG_PTR_TO_RB_ROOT, KF_ARG_PTR_TO_RB_NODE, + KF_ARG_PTR_TO_NULL, }; enum special_kfunc_type { @@ -10991,6 +10997,8 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env, if (is_kfunc_arg_callback(env, meta->btf, &args[argno])) return KF_ARG_PTR_TO_CALLBACK; + if (is_kfunc_arg_nullable(meta->btf, &args[argno]) && register_is_null(reg)) + return KF_ARG_PTR_TO_NULL; if (argno + 1 < nargs && (is_kfunc_arg_mem_size(meta->btf, &args[argno + 1], ®s[regno + 1]) || @@ -11535,7 +11543,8 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_ } if ((is_kfunc_trusted_args(meta) || is_kfunc_rcu(meta)) && - (register_is_null(reg) || type_may_be_null(reg->type))) { + (register_is_null(reg) || type_may_be_null(reg->type)) && + !is_kfunc_arg_nullable(meta->btf, &args[i])) { verbose(env, "Possibly NULL pointer passed to trusted arg%d\n", i); return -EACCES; } @@ -11560,6 +11569,8 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_ return kf_arg_type; switch (kf_arg_type) { + case KF_ARG_PTR_TO_NULL: + continue; case KF_ARG_PTR_TO_ALLOC_BTF_ID: case KF_ARG_PTR_TO_BTF_ID: if (!is_kfunc_trusted_args(meta) && !is_kfunc_rcu(meta)) -- 2.34.1
From: Chuyi Zhou <zhouchuyi@bytedance.com> mainline inclusion from mainline-v6.7-rc1 commit ddab78cbb52f81f7f7598482602342955b2ff8b8 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- The newly-added struct bpf_iter_task has a name collision with a selftest for the seq_file task iter's bpf skel, so the selftests/bpf/progs file is renamed in order to avoid the collision. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-8-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/prog_tests/bpf_iter.c | 18 +++++++++--------- .../{bpf_iter_task.c => bpf_iter_tasks.c} | 0 2 files changed, 9 insertions(+), 9 deletions(-) rename tools/testing/selftests/bpf/progs/{bpf_iter_task.c => bpf_iter_tasks.c} (100%) diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c index caea3c500cc0..2cacc8fa9697 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c @@ -7,7 +7,7 @@ #include "bpf_iter_ipv6_route.skel.h" #include "bpf_iter_netlink.skel.h" #include "bpf_iter_bpf_map.skel.h" -#include "bpf_iter_task.skel.h" +#include "bpf_iter_tasks.skel.h" #include "bpf_iter_task_stack.skel.h" #include "bpf_iter_task_file.skel.h" #include "bpf_iter_task_vmas.skel.h" @@ -215,12 +215,12 @@ static void *do_nothing_wait(void *arg) static void test_task_common_nocheck(struct bpf_iter_attach_opts *opts, int *num_unknown, int *num_known) { - struct bpf_iter_task *skel; + struct bpf_iter_tasks *skel; pthread_t thread_id; void *ret; - skel = bpf_iter_task__open_and_load(); - if (!ASSERT_OK_PTR(skel, "bpf_iter_task__open_and_load")) + skel = bpf_iter_tasks__open_and_load(); + if (!ASSERT_OK_PTR(skel, "bpf_iter_tasks__open_and_load")) return; ASSERT_OK(pthread_mutex_lock(&do_nothing_mutex), "pthread_mutex_lock"); @@ -239,7 +239,7 @@ static void test_task_common_nocheck(struct bpf_iter_attach_opts *opts, ASSERT_FALSE(pthread_join(thread_id, &ret) || ret != NULL, "pthread_join"); - bpf_iter_task__destroy(skel); + bpf_iter_tasks__destroy(skel); } static void test_task_common(struct bpf_iter_attach_opts *opts, int num_unknown, int num_known) @@ -307,10 +307,10 @@ static void test_task_pidfd(void) static void test_task_sleepable(void) { - struct bpf_iter_task *skel; + struct bpf_iter_tasks *skel; - skel = bpf_iter_task__open_and_load(); - if (!ASSERT_OK_PTR(skel, "bpf_iter_task__open_and_load")) + skel = bpf_iter_tasks__open_and_load(); + if (!ASSERT_OK_PTR(skel, "bpf_iter_tasks__open_and_load")) return; do_dummy_read(skel->progs.dump_task_sleepable); @@ -320,7 +320,7 @@ static void test_task_sleepable(void) ASSERT_GT(skel->bss->num_success_copy_from_user_task, 0, "num_success_copy_from_user_task"); - bpf_iter_task__destroy(skel); + bpf_iter_tasks__destroy(skel); } static void test_task_stack(void) diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_task.c b/tools/testing/selftests/bpf/progs/bpf_iter_tasks.c similarity index 100% rename from tools/testing/selftests/bpf/progs/bpf_iter_task.c rename to tools/testing/selftests/bpf/progs/bpf_iter_tasks.c -- 2.34.1
From: Chuyi Zhou <zhouchuyi@bytedance.com> mainline inclusion from mainline-v6.7-rc1 commit 130e0f7af9fc7388b90fb016ca13ff3840c48d4a category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- This patch adds 4 subtests to demonstrate these patterns and validating correctness. subtest1: 1) We use task_iter to iterate all process in the system and search for the current process with a given pid. 2) We create some threads in current process context, and use BPF_TASK_ITER_PROC_THREADS to iterate all threads of current process. As expected, we would find all the threads of current process. 3) We create some threads and use BPF_TASK_ITER_ALL_THREADS to iterate all threads in the system. As expected, we would find all the threads which was created. subtest2: We create a cgroup and add the current task to the cgroup. In the BPF program, we would use bpf_for_each(css_task, task, css) to iterate all tasks under the cgroup. As expected, we would find the current process. subtest3: 1) We create a cgroup tree. In the BPF program, we use bpf_for_each(css, pos, root, XXX) to iterate all descendant under the root with pre and post order. As expected, we would find all descendant and the last iterating cgroup in post-order is root cgroup, the first iterating cgroup in pre-order is root cgroup. 2) We wse BPF_CGROUP_ITER_ANCESTORS_UP to traverse the cgroup tree starting from leaf and root separately, and record the height. The diff of the hights would be the total tree-high - 1. subtest4: Add some failure testcase when using css_task, task and css iters, e.g, unlock when using task-iters to iterate tasks. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Link: https://lore.kernel.org/r/20231018061746.111364-9-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../testing/selftests/bpf/prog_tests/iters.c | 150 ++++++++++++++++++ tools/testing/selftests/bpf/progs/iters_css.c | 72 +++++++++ .../selftests/bpf/progs/iters_css_task.c | 47 ++++++ .../testing/selftests/bpf/progs/iters_task.c | 41 +++++ .../selftests/bpf/progs/iters_task_failure.c | 105 ++++++++++++ 5 files changed, 415 insertions(+) create mode 100644 tools/testing/selftests/bpf/progs/iters_css.c create mode 100644 tools/testing/selftests/bpf/progs/iters_css_task.c create mode 100644 tools/testing/selftests/bpf/progs/iters_task.c create mode 100644 tools/testing/selftests/bpf/progs/iters_task_failure.c diff --git a/tools/testing/selftests/bpf/prog_tests/iters.c b/tools/testing/selftests/bpf/prog_tests/iters.c index b696873c5455..c2425791c923 100644 --- a/tools/testing/selftests/bpf/prog_tests/iters.c +++ b/tools/testing/selftests/bpf/prog_tests/iters.c @@ -1,7 +1,14 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */ +#include <sys/syscall.h> +#include <sys/mman.h> +#include <sys/wait.h> +#include <unistd.h> +#include <malloc.h> +#include <stdlib.h> #include <test_progs.h> +#include "cgroup_helpers.h" #include "iters.skel.h" #include "iters_state_safety.skel.h" @@ -9,6 +16,10 @@ #include "iters_num.skel.h" #include "iters_testmod_seq.skel.h" #include "iters_task_vma.skel.h" +#include "iters_task.skel.h" +#include "iters_css_task.skel.h" +#include "iters_css.skel.h" +#include "iters_task_failure.skel.h" static void subtest_num_iters(void) { @@ -146,6 +157,138 @@ static void subtest_task_vma_iters(void) iters_task_vma__destroy(skel); } +static pthread_mutex_t do_nothing_mutex; + +static void *do_nothing_wait(void *arg) +{ + pthread_mutex_lock(&do_nothing_mutex); + pthread_mutex_unlock(&do_nothing_mutex); + + pthread_exit(arg); +} + +#define thread_num 2 + +static void subtest_task_iters(void) +{ + struct iters_task *skel = NULL; + pthread_t thread_ids[thread_num]; + void *ret; + int err; + + skel = iters_task__open_and_load(); + if (!ASSERT_OK_PTR(skel, "open_and_load")) + goto cleanup; + skel->bss->target_pid = getpid(); + err = iters_task__attach(skel); + if (!ASSERT_OK(err, "iters_task__attach")) + goto cleanup; + pthread_mutex_lock(&do_nothing_mutex); + for (int i = 0; i < thread_num; i++) + ASSERT_OK(pthread_create(&thread_ids[i], NULL, &do_nothing_wait, NULL), + "pthread_create"); + + syscall(SYS_getpgid); + iters_task__detach(skel); + ASSERT_EQ(skel->bss->procs_cnt, 1, "procs_cnt"); + ASSERT_EQ(skel->bss->threads_cnt, thread_num + 1, "threads_cnt"); + ASSERT_EQ(skel->bss->proc_threads_cnt, thread_num + 1, "proc_threads_cnt"); + pthread_mutex_unlock(&do_nothing_mutex); + for (int i = 0; i < thread_num; i++) + ASSERT_OK(pthread_join(thread_ids[i], &ret), "pthread_join"); +cleanup: + iters_task__destroy(skel); +} + +extern int stack_mprotect(void); + +static void subtest_css_task_iters(void) +{ + struct iters_css_task *skel = NULL; + int err, cg_fd, cg_id; + const char *cgrp_path = "/cg1"; + + err = setup_cgroup_environment(); + if (!ASSERT_OK(err, "setup_cgroup_environment")) + goto cleanup; + cg_fd = create_and_get_cgroup(cgrp_path); + if (!ASSERT_GE(cg_fd, 0, "create_and_get_cgroup")) + goto cleanup; + cg_id = get_cgroup_id(cgrp_path); + err = join_cgroup(cgrp_path); + if (!ASSERT_OK(err, "join_cgroup")) + goto cleanup; + + skel = iters_css_task__open_and_load(); + if (!ASSERT_OK_PTR(skel, "open_and_load")) + goto cleanup; + + skel->bss->target_pid = getpid(); + skel->bss->cg_id = cg_id; + err = iters_css_task__attach(skel); + if (!ASSERT_OK(err, "iters_task__attach")) + goto cleanup; + err = stack_mprotect(); + if (!ASSERT_EQ(err, -1, "stack_mprotect") || + !ASSERT_EQ(errno, EPERM, "stack_mprotect")) + goto cleanup; + iters_css_task__detach(skel); + ASSERT_EQ(skel->bss->css_task_cnt, 1, "css_task_cnt"); + +cleanup: + cleanup_cgroup_environment(); + iters_css_task__destroy(skel); +} + +static void subtest_css_iters(void) +{ + struct iters_css *skel = NULL; + struct { + const char *path; + int fd; + } cgs[] = { + { "/cg1" }, + { "/cg1/cg2" }, + { "/cg1/cg2/cg3" }, + { "/cg1/cg2/cg3/cg4" }, + }; + int err, cg_nr = ARRAY_SIZE(cgs); + int i; + + err = setup_cgroup_environment(); + if (!ASSERT_OK(err, "setup_cgroup_environment")) + goto cleanup; + for (i = 0; i < cg_nr; i++) { + cgs[i].fd = create_and_get_cgroup(cgs[i].path); + if (!ASSERT_GE(cgs[i].fd, 0, "create_and_get_cgroup")) + goto cleanup; + } + + skel = iters_css__open_and_load(); + if (!ASSERT_OK_PTR(skel, "open_and_load")) + goto cleanup; + + skel->bss->target_pid = getpid(); + skel->bss->root_cg_id = get_cgroup_id(cgs[0].path); + skel->bss->leaf_cg_id = get_cgroup_id(cgs[cg_nr - 1].path); + err = iters_css__attach(skel); + + if (!ASSERT_OK(err, "iters_task__attach")) + goto cleanup; + + syscall(SYS_getpgid); + ASSERT_EQ(skel->bss->pre_order_cnt, cg_nr, "pre_order_cnt"); + ASSERT_EQ(skel->bss->first_cg_id, get_cgroup_id(cgs[0].path), "first_cg_id"); + + ASSERT_EQ(skel->bss->post_order_cnt, cg_nr, "post_order_cnt"); + ASSERT_EQ(skel->bss->last_cg_id, get_cgroup_id(cgs[0].path), "last_cg_id"); + ASSERT_EQ(skel->bss->tree_high, cg_nr - 1, "tree_high"); + iters_css__detach(skel); +cleanup: + cleanup_cgroup_environment(); + iters_css__destroy(skel); +} + void test_iters(void) { RUN_TESTS(iters_state_safety); @@ -161,4 +304,11 @@ void test_iters(void) subtest_testmod_seq_iters(); if (test__start_subtest("task_vma")) subtest_task_vma_iters(); + if (test__start_subtest("task")) + subtest_task_iters(); + if (test__start_subtest("css_task")) + subtest_css_task_iters(); + if (test__start_subtest("css")) + subtest_css_iters(); + RUN_TESTS(iters_task_failure); } diff --git a/tools/testing/selftests/bpf/progs/iters_css.c b/tools/testing/selftests/bpf/progs/iters_css.c new file mode 100644 index 000000000000..ec1f6c2f590b --- /dev/null +++ b/tools/testing/selftests/bpf/progs/iters_css.c @@ -0,0 +1,72 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2023 Chuyi Zhou <zhouchuyi@bytedance.com> */ + +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include "bpf_misc.h" +#include "bpf_experimental.h" + +char _license[] SEC("license") = "GPL"; + +pid_t target_pid; +u64 root_cg_id, leaf_cg_id; +u64 first_cg_id, last_cg_id; + +int pre_order_cnt, post_order_cnt, tree_high; + +struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym; +void bpf_cgroup_release(struct cgroup *p) __ksym; +void bpf_rcu_read_lock(void) __ksym; +void bpf_rcu_read_unlock(void) __ksym; + +SEC("fentry.s/" SYS_PREFIX "sys_getpgid") +int iter_css_for_each(const void *ctx) +{ + struct task_struct *cur_task = bpf_get_current_task_btf(); + struct cgroup_subsys_state *root_css, *leaf_css, *pos; + struct cgroup *root_cgrp, *leaf_cgrp, *cur_cgrp; + + if (cur_task->pid != target_pid) + return 0; + + root_cgrp = bpf_cgroup_from_id(root_cg_id); + + if (!root_cgrp) + return 0; + + leaf_cgrp = bpf_cgroup_from_id(leaf_cg_id); + + if (!leaf_cgrp) { + bpf_cgroup_release(root_cgrp); + return 0; + } + root_css = &root_cgrp->self; + leaf_css = &leaf_cgrp->self; + pre_order_cnt = post_order_cnt = tree_high = 0; + first_cg_id = last_cg_id = 0; + + bpf_rcu_read_lock(); + bpf_for_each(css, pos, root_css, BPF_CGROUP_ITER_DESCENDANTS_POST) { + cur_cgrp = pos->cgroup; + post_order_cnt++; + last_cg_id = cur_cgrp->kn->id; + } + + bpf_for_each(css, pos, root_css, BPF_CGROUP_ITER_DESCENDANTS_PRE) { + cur_cgrp = pos->cgroup; + pre_order_cnt++; + if (!first_cg_id) + first_cg_id = cur_cgrp->kn->id; + } + + bpf_for_each(css, pos, leaf_css, BPF_CGROUP_ITER_ANCESTORS_UP) + tree_high++; + + bpf_for_each(css, pos, root_css, BPF_CGROUP_ITER_ANCESTORS_UP) + tree_high--; + bpf_rcu_read_unlock(); + bpf_cgroup_release(root_cgrp); + bpf_cgroup_release(leaf_cgrp); + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/iters_css_task.c b/tools/testing/selftests/bpf/progs/iters_css_task.c new file mode 100644 index 000000000000..5089ce384a1c --- /dev/null +++ b/tools/testing/selftests/bpf/progs/iters_css_task.c @@ -0,0 +1,47 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2023 Chuyi Zhou <zhouchuyi@bytedance.com> */ + +#include "vmlinux.h" +#include <errno.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include "bpf_misc.h" +#include "bpf_experimental.h" + +char _license[] SEC("license") = "GPL"; + +struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym; +void bpf_cgroup_release(struct cgroup *p) __ksym; + +pid_t target_pid; +int css_task_cnt; +u64 cg_id; + +SEC("lsm/file_mprotect") +int BPF_PROG(iter_css_task_for_each, struct vm_area_struct *vma, + unsigned long reqprot, unsigned long prot, int ret) +{ + struct task_struct *cur_task = bpf_get_current_task_btf(); + struct cgroup_subsys_state *css; + struct task_struct *task; + struct cgroup *cgrp; + + if (cur_task->pid != target_pid) + return ret; + + cgrp = bpf_cgroup_from_id(cg_id); + + if (!cgrp) + return -EPERM; + + css = &cgrp->self; + css_task_cnt = 0; + + bpf_for_each(css_task, task, css, CSS_TASK_ITER_PROCS) + if (task->pid == target_pid) + css_task_cnt++; + + bpf_cgroup_release(cgrp); + + return -EPERM; +} diff --git a/tools/testing/selftests/bpf/progs/iters_task.c b/tools/testing/selftests/bpf/progs/iters_task.c new file mode 100644 index 000000000000..c9b4055cd410 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/iters_task.c @@ -0,0 +1,41 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2023 Chuyi Zhou <zhouchuyi@bytedance.com> */ + +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include "bpf_misc.h" +#include "bpf_experimental.h" + +char _license[] SEC("license") = "GPL"; + +pid_t target_pid; +int procs_cnt, threads_cnt, proc_threads_cnt; + +void bpf_rcu_read_lock(void) __ksym; +void bpf_rcu_read_unlock(void) __ksym; + +SEC("fentry.s/" SYS_PREFIX "sys_getpgid") +int iter_task_for_each_sleep(void *ctx) +{ + struct task_struct *cur_task = bpf_get_current_task_btf(); + struct task_struct *pos; + + if (cur_task->pid != target_pid) + return 0; + procs_cnt = threads_cnt = proc_threads_cnt = 0; + + bpf_rcu_read_lock(); + bpf_for_each(task, pos, NULL, BPF_TASK_ITER_ALL_PROCS) + if (pos->pid == target_pid) + procs_cnt++; + + bpf_for_each(task, pos, cur_task, BPF_TASK_ITER_PROC_THREADS) + proc_threads_cnt++; + + bpf_for_each(task, pos, NULL, BPF_TASK_ITER_ALL_THREADS) + if (pos->tgid == target_pid) + threads_cnt++; + bpf_rcu_read_unlock(); + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/iters_task_failure.c b/tools/testing/selftests/bpf/progs/iters_task_failure.c new file mode 100644 index 000000000000..c3bf96a67dba --- /dev/null +++ b/tools/testing/selftests/bpf/progs/iters_task_failure.c @@ -0,0 +1,105 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2023 Chuyi Zhou <zhouchuyi@bytedance.com> */ + +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include "bpf_misc.h" +#include "bpf_experimental.h" + +char _license[] SEC("license") = "GPL"; + +struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym; +void bpf_cgroup_release(struct cgroup *p) __ksym; +void bpf_rcu_read_lock(void) __ksym; +void bpf_rcu_read_unlock(void) __ksym; + +SEC("?fentry.s/" SYS_PREFIX "sys_getpgid") +__failure __msg("expected an RCU CS when using bpf_iter_task_next") +int BPF_PROG(iter_tasks_without_lock) +{ + struct task_struct *pos; + + bpf_for_each(task, pos, NULL, BPF_TASK_ITER_ALL_PROCS) { + + } + return 0; +} + +SEC("?fentry.s/" SYS_PREFIX "sys_getpgid") +__failure __msg("expected an RCU CS when using bpf_iter_css_next") +int BPF_PROG(iter_css_without_lock) +{ + u64 cg_id = bpf_get_current_cgroup_id(); + struct cgroup *cgrp = bpf_cgroup_from_id(cg_id); + struct cgroup_subsys_state *root_css, *pos; + + if (!cgrp) + return 0; + root_css = &cgrp->self; + + bpf_for_each(css, pos, root_css, BPF_CGROUP_ITER_DESCENDANTS_POST) { + + } + bpf_cgroup_release(cgrp); + return 0; +} + +SEC("?fentry.s/" SYS_PREFIX "sys_getpgid") +__failure __msg("expected an RCU CS when using bpf_iter_task_next") +int BPF_PROG(iter_tasks_lock_and_unlock) +{ + struct task_struct *pos; + + bpf_rcu_read_lock(); + bpf_for_each(task, pos, NULL, BPF_TASK_ITER_ALL_PROCS) { + bpf_rcu_read_unlock(); + + bpf_rcu_read_lock(); + } + bpf_rcu_read_unlock(); + return 0; +} + +SEC("?fentry.s/" SYS_PREFIX "sys_getpgid") +__failure __msg("expected an RCU CS when using bpf_iter_css_next") +int BPF_PROG(iter_css_lock_and_unlock) +{ + u64 cg_id = bpf_get_current_cgroup_id(); + struct cgroup *cgrp = bpf_cgroup_from_id(cg_id); + struct cgroup_subsys_state *root_css, *pos; + + if (!cgrp) + return 0; + root_css = &cgrp->self; + + bpf_rcu_read_lock(); + bpf_for_each(css, pos, root_css, BPF_CGROUP_ITER_DESCENDANTS_POST) { + bpf_rcu_read_unlock(); + + bpf_rcu_read_lock(); + } + bpf_rcu_read_unlock(); + bpf_cgroup_release(cgrp); + return 0; +} + +SEC("?fentry.s/" SYS_PREFIX "sys_getpgid") +__failure __msg("css_task_iter is only allowed in bpf_lsm and bpf iter-s") +int BPF_PROG(iter_css_task_for_each) +{ + u64 cg_id = bpf_get_current_cgroup_id(); + struct cgroup *cgrp = bpf_cgroup_from_id(cg_id); + struct cgroup_subsys_state *css; + struct task_struct *task; + + if (cgrp == NULL) + return 0; + css = &cgrp->self; + + bpf_for_each(css_task, task, css, CSS_TASK_ITER_PROCS) { + + } + bpf_cgroup_release(cgrp); + return 0; +} -- 2.34.1
From: Masahiro Yamada <masahiroy@kernel.org> mainline inclusion from mainline-v6.7-rc1 commit 72d091846de935e0942a8a0f1fe24ff739d85d76 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- scripts/pahole-flags.sh is executed so many times. You can confirm it, as follows: $ cat <<EOF >> scripts/pahole-flags.sh
echo "scripts/pahole-flags.sh was executed" >&2 EOF
$ make -s scripts/pahole-flags.sh was executed scripts/pahole-flags.sh was executed scripts/pahole-flags.sh was executed scripts/pahole-flags.sh was executed scripts/pahole-flags.sh was executed [ lots of repeated lines... ] This scripts is executed more than 20 times during the kernel build because PAHOLE_FLAGS is a recursively expanded variable and exported to sub-processes. With GNU Make >= 4.4, it is executed more than 60 times because exported variables are also passed to other $(shell ) invocations. Without careful coding, it is known to cause an exponential fork explosion. [1] The use of $(shell ) in an exported recursive variable is likely wrong because $(shell ) is always evaluated due to the 'export' keyword, and the evaluation can occur multiple times by the nature of recursive variables. Convert the shell script to a Makefile, which is included only when CONFIG_DEBUG_INFO_BTF=y. [1]: https://savannah.gnu.org/bugs/index.php?64746 Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Tested-by: Alan Maguire <alan.maguire@oracle.com> Reviewed-by: Nicolas Schier <n.schier@avm.de> Tested-by: Miguel Ojeda <ojeda@kernel.org> Acked-by: Miguel Ojeda <ojeda@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Reviewed-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- MAINTAINERS | 2 +- Makefile | 4 +--- scripts/Makefile.btf | 19 +++++++++++++++++++ scripts/pahole-flags.sh | 30 ------------------------------ 4 files changed, 21 insertions(+), 34 deletions(-) create mode 100644 scripts/Makefile.btf delete mode 100755 scripts/pahole-flags.sh diff --git a/MAINTAINERS b/MAINTAINERS index 0bc3cd1ffc25..30523a5769c2 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3774,7 +3774,7 @@ F: net/sched/act_bpf.c F: net/sched/cls_bpf.c F: samples/bpf/ F: scripts/bpf_doc.py -F: scripts/pahole-flags.sh +F: scripts/Makefile.btf F: scripts/pahole-version.sh F: tools/bpf/ F: tools/lib/bpf/ diff --git a/Makefile b/Makefile index 7c4955cf5f71..b62b39c18643 100644 --- a/Makefile +++ b/Makefile @@ -522,8 +522,6 @@ LZ4 = lz4 XZ = xz ZSTD = zstd -PAHOLE_FLAGS = $(shell PAHOLE=$(PAHOLE) $(srctree)/scripts/pahole-flags.sh) - CHECKFLAGS := -D__linux__ -Dlinux -D__STDC__ -Dunix -D__unix__ \ -Wbitwise -Wno-return-void -Wno-unknown-attribute $(CF) NOSTDINC_FLAGS := @@ -614,7 +612,6 @@ export KBUILD_RUSTFLAGS RUSTFLAGS_KERNEL RUSTFLAGS_MODULE export KBUILD_AFLAGS AFLAGS_KERNEL AFLAGS_MODULE export KBUILD_AFLAGS_MODULE KBUILD_CFLAGS_MODULE KBUILD_RUSTFLAGS_MODULE KBUILD_LDFLAGS_MODULE export KBUILD_AFLAGS_KERNEL KBUILD_CFLAGS_KERNEL KBUILD_RUSTFLAGS_KERNEL -export PAHOLE_FLAGS # Files to ignore in find ... statements @@ -1015,6 +1012,7 @@ KBUILD_CPPFLAGS += $(call cc-option,-fmacro-prefix-map=$(srctree)/=) # include additional Makefiles when needed include-y := scripts/Makefile.extrawarn include-$(CONFIG_DEBUG_INFO) += scripts/Makefile.debug +include-$(CONFIG_DEBUG_INFO_BTF)+= scripts/Makefile.btf include-$(CONFIG_KASAN) += scripts/Makefile.kasan include-$(CONFIG_KCSAN) += scripts/Makefile.kcsan include-$(CONFIG_KMSAN) += scripts/Makefile.kmsan diff --git a/scripts/Makefile.btf b/scripts/Makefile.btf new file mode 100644 index 000000000000..82377e470aed --- /dev/null +++ b/scripts/Makefile.btf @@ -0,0 +1,19 @@ +# SPDX-License-Identifier: GPL-2.0 + +pahole-ver := $(CONFIG_PAHOLE_VERSION) +pahole-flags-y := + +# pahole 1.18 through 1.21 can't handle zero-sized per-CPU vars +ifeq ($(call test-le, $(pahole-ver), 121),y) +pahole-flags-$(call test-ge, $(pahole-ver), 118) += --skip_encoding_btf_vars +endif + +pahole-flags-$(call test-ge, $(pahole-ver), 121) += --btf_gen_floats + +pahole-flags-$(call test-ge, $(pahole-ver), 122) += -j + +pahole-flags-$(CONFIG_PAHOLE_HAS_LANG_EXCLUDE) += --lang_exclude=rust + +pahole-flags-$(call test-ge, $(pahole-ver), 125) += --skip_encoding_btf_inconsistent_proto --btf_gen_optimized + +export PAHOLE_FLAGS := $(pahole-flags-y) diff --git a/scripts/pahole-flags.sh b/scripts/pahole-flags.sh deleted file mode 100755 index 728d55190d97..000000000000 --- a/scripts/pahole-flags.sh +++ /dev/null @@ -1,30 +0,0 @@ -#!/bin/sh -# SPDX-License-Identifier: GPL-2.0 - -extra_paholeopt= - -if ! [ -x "$(command -v ${PAHOLE})" ]; then - exit 0 -fi - -pahole_ver=$($(dirname $0)/pahole-version.sh ${PAHOLE}) - -if [ "${pahole_ver}" -ge "118" ] && [ "${pahole_ver}" -le "121" ]; then - # pahole 1.18 through 1.21 can't handle zero-sized per-CPU vars - extra_paholeopt="${extra_paholeopt} --skip_encoding_btf_vars" -fi -if [ "${pahole_ver}" -ge "121" ]; then - extra_paholeopt="${extra_paholeopt} --btf_gen_floats" -fi -if [ "${pahole_ver}" -ge "122" ]; then - extra_paholeopt="${extra_paholeopt} -j" -fi -if [ "${pahole_ver}" -ge "124" ]; then - # see PAHOLE_HAS_LANG_EXCLUDE - extra_paholeopt="${extra_paholeopt} --lang_exclude=rust" -fi -if [ "${pahole_ver}" -ge "125" ]; then - extra_paholeopt="${extra_paholeopt} --skip_encoding_btf_inconsistent_proto --btf_gen_optimized" -fi - -echo ${extra_paholeopt} -- 2.34.1
From: Matthieu Baerts <matttbe@kernel.org> mainline inclusion from mainline-v6.7-rc1 commit 05670f81d1287c40ec861186e4c4e3401013e7fb category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Our MPTCP CI complained [1] -- and KBuild too -- that it was no longer possible to build the kernel without CONFIG_CGROUPS: kernel/bpf/task_iter.c: In function 'bpf_iter_css_task_new': kernel/bpf/task_iter.c:919:14: error: 'CSS_TASK_ITER_PROCS' undeclared (first use in this function) 919 | case CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED: | ^~~~~~~~~~~~~~~~~~~ kernel/bpf/task_iter.c:919:14: note: each undeclared identifier is reported only once for each function it appears in kernel/bpf/task_iter.c:919:36: error: 'CSS_TASK_ITER_THREADED' undeclared (first use in this function) 919 | case CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED: | ^~~~~~~~~~~~~~~~~~~~~~ kernel/bpf/task_iter.c:927:60: error: invalid application of 'sizeof' to incomplete type 'struct css_task_iter' 927 | kit->css_it = bpf_mem_alloc(&bpf_global_ma, sizeof(struct css_task_iter)); | ^~~~~~ kernel/bpf/task_iter.c:930:9: error: implicit declaration of function 'css_task_iter_start'; did you mean 'task_seq_start'? [-Werror=implicit-function-declaration] 930 | css_task_iter_start(css, flags, kit->css_it); | ^~~~~~~~~~~~~~~~~~~ | task_seq_start kernel/bpf/task_iter.c: In function 'bpf_iter_css_task_next': kernel/bpf/task_iter.c:940:16: error: implicit declaration of function 'css_task_iter_next'; did you mean 'class_dev_iter_next'? [-Werror=implicit-function-declaration] 940 | return css_task_iter_next(kit->css_it); | ^~~~~~~~~~~~~~~~~~ | class_dev_iter_next kernel/bpf/task_iter.c:940:16: error: returning 'int' from a function with return type 'struct task_struct *' makes pointer from integer without a cast [-Werror=int-conversion] 940 | return css_task_iter_next(kit->css_it); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ kernel/bpf/task_iter.c: In function 'bpf_iter_css_task_destroy': kernel/bpf/task_iter.c:949:9: error: implicit declaration of function 'css_task_iter_end' [-Werror=implicit-function-declaration] 949 | css_task_iter_end(kit->css_it); | ^~~~~~~~~~~~~~~~~ This patch simply surrounds with a #ifdef the new code requiring CGroups support. It seems enough for the compiler and this is similar to bpf_iter_css_{new,next,destroy}() functions where no other #ifdef have been added in kernel/bpf/helpers.c and in the selftests. Fixes: 9c66dc94b62a ("bpf: Introduce css_task open-coded iterator kfuncs") Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/6665206927 Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202310260528.aHWgVFqq-lkp@intel.com/ Signed-off-by: Matthieu Baerts <matttbe@kernel.org> [ added missing ifdefs for BTF_ID cgroup definitions ] Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20231101181601.1493271-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Conflicts: kernel/bpf/verifier.c [Fix conflict because missing some kfuncs.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/helpers.c | 8 +++++--- kernel/bpf/task_iter.c | 4 ++++ kernel/bpf/verifier.c | 8 ++++++++ 3 files changed, 17 insertions(+), 3 deletions(-) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 112588659104..73e1e0b71ef3 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2674,15 +2674,17 @@ BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY) BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW | KF_RCU) BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY) +#ifdef CONFIG_CGROUPS BTF_ID_FLAGS(func, bpf_iter_css_task_new, KF_ITER_NEW | KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, bpf_iter_css_task_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_css_task_destroy, KF_ITER_DESTROY) -BTF_ID_FLAGS(func, bpf_iter_task_new, KF_ITER_NEW | KF_TRUSTED_ARGS | KF_RCU_PROTECTED) -BTF_ID_FLAGS(func, bpf_iter_task_next, KF_ITER_NEXT | KF_RET_NULL) -BTF_ID_FLAGS(func, bpf_iter_task_destroy, KF_ITER_DESTROY) BTF_ID_FLAGS(func, bpf_iter_css_new, KF_ITER_NEW | KF_TRUSTED_ARGS | KF_RCU_PROTECTED) BTF_ID_FLAGS(func, bpf_iter_css_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_css_destroy, KF_ITER_DESTROY) +#endif +BTF_ID_FLAGS(func, bpf_iter_task_new, KF_ITER_NEW | KF_TRUSTED_ARGS | KF_RCU_PROTECTED) +BTF_ID_FLAGS(func, bpf_iter_task_next, KF_ITER_NEXT | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_iter_task_destroy, KF_ITER_DESTROY) BTF_ID_FLAGS(func, bpf_dynptr_adjust) BTF_ID_FLAGS(func, bpf_dynptr_is_null) BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly) diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index a663c3d11b62..e0793906bf13 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -914,6 +914,8 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it) __diag_pop(); +#ifdef CONFIG_CGROUPS + struct bpf_iter_css_task { __u64 __opaque[1]; } __attribute__((aligned(8))); @@ -972,6 +974,8 @@ __bpf_kfunc void bpf_iter_css_task_destroy(struct bpf_iter_css_task *it) __diag_pop(); +#endif /* CONFIG_CGROUPS */ + struct bpf_iter_task { __u64 __opaque[3]; } __attribute__((aligned(8))); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 0e7ed6b38547..5cfdff54ad9b 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -5420,7 +5420,9 @@ static bool in_rcu_cs(struct bpf_verifier_env *env) /* Once GCC supports btf_type_tag the following mechanism will be replaced with tag check */ BTF_SET_START(rcu_protected_types) BTF_ID(struct, prog_test_ref_kfunc) +#ifdef CONFIG_CGROUPS BTF_ID(struct, cgroup) +#endif BTF_ID(struct, bpf_cpumask) BTF_ID(struct, task_struct) BTF_SET_END(rcu_protected_types) @@ -10885,7 +10887,9 @@ BTF_ID(func, bpf_get_skb_ethhdr) BTF_ID(func, bpf_handle_ingress_ptype) BTF_ID(func, bpf_handle_egress_ptype) #endif +#ifdef CONFIG_CGROUPS BTF_ID(func, bpf_iter_css_task_new) +#endif BTF_SET_END(special_kfunc_set) BTF_ID_LIST(special_kfunc_list) @@ -10916,7 +10920,11 @@ BTF_ID(func, bpf_get_skb_ethhdr) BTF_ID(func, bpf_handle_ingress_ptype) BTF_ID(func, bpf_handle_egress_ptype) #endif +#ifdef CONFIG_CGROUPS BTF_ID(func, bpf_iter_css_task_new) +#else +BTF_ID_UNUSED +#endif static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta) { -- 2.34.1
From: Dave Marchevsky <davemarchevsky@fb.com> mainline inclusion from mainline-v6.7-rc1 commit 391145ba2accc48b596f3d438af1a6255b62a555 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- BPF kfuncs are meant to be called from BPF programs. Accordingly, most kfuncs are not called from anywhere in the kernel, which the -Wmissing-prototypes warning is unhappy about. We've peppered __diag_ignore_all("-Wmissing-prototypes", ... everywhere kfuncs are defined in the codebase to suppress this warning. This patch adds two macros meant to bound one or many kfunc definitions. All existing kfunc definitions which use these __diag calls to suppress -Wmissing-prototypes are migrated to use the newly-introduced macros. A new __diag_ignore_all - for "-Wmissing-declarations" - is added to the __bpf_kfunc_start_defs macro based on feedback from Andrii on an earlier version of this patch [0] and another recent mailing list thread [1]. In the future we might need to ignore different warnings or do other kfunc-specific things. This change will make it easier to make such modifications for all kfunc defs. [0]: https://lore.kernel.org/bpf/CAEf4BzaE5dRWtK6RPLnjTW-MW9sx9K3Fn6uwqCTChK2Dcb1... [1]: https://lore.kernel.org/bpf/ZT+2qCc%2FaXep0%2FLf@krava/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Cc: Jiri Olsa <olsajiri@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: David Vernet <void@manifault.com> Acked-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20231031215625.2343848-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Conflicts: kernel/bpf/helpers.c net/core/filter.c net/core/xdp.c [Fix conflicts because of missing some kfuncs.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- Documentation/bpf/kfuncs.rst | 6 ++---- include/linux/btf.h | 9 +++++++++ kernel/bpf/bpf_iter.c | 6 ++---- kernel/bpf/cgroup_iter.c | 6 ++---- kernel/bpf/cpumask.c | 6 ++---- kernel/bpf/helpers.c | 6 ++---- kernel/bpf/map_iter.c | 6 ++---- kernel/bpf/task_iter.c | 18 ++++++------------ kernel/trace/bpf_trace.c | 6 ++---- net/bpf/test_run.c | 7 +++---- net/core/filter.c | 13 ++++--------- net/core/xdp.c | 6 ++---- net/ipv4/fou_bpf.c | 6 ++---- net/netfilter/nf_conntrack_bpf.c | 6 ++---- net/netfilter/nf_nat_bpf.c | 6 ++---- net/xfrm/xfrm_interface_bpf.c | 6 ++---- 16 files changed, 46 insertions(+), 73 deletions(-) diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst index 0d2647fb358d..723408e399ab 100644 --- a/Documentation/bpf/kfuncs.rst +++ b/Documentation/bpf/kfuncs.rst @@ -37,16 +37,14 @@ prototype in a header for the wrapper kfunc. An example is given below:: /* Disables missing prototype warnings */ - __diag_push(); - __diag_ignore_all("-Wmissing-prototypes", - "Global kfuncs as their definitions will be in BTF"); + __bpf_kfunc_start_defs(); __bpf_kfunc struct task_struct *bpf_find_get_task_by_vpid(pid_t nr) { return find_get_task_by_vpid(nr); } - __diag_pop(); + __bpf_kfunc_end_defs(); A wrapper kfunc is often needed when we need to annotate parameters of the kfunc. Otherwise one may directly make the kfunc visible to the BPF program by diff --git a/include/linux/btf.h b/include/linux/btf.h index c2231c64d60b..dc5ce962f600 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -84,6 +84,15 @@ */ #define __bpf_kfunc __used noinline +#define __bpf_kfunc_start_defs() \ + __diag_push(); \ + __diag_ignore_all("-Wmissing-declarations", \ + "Global kfuncs as their definitions will be in BTF");\ + __diag_ignore_all("-Wmissing-prototypes", \ + "Global kfuncs as their definitions will be in BTF") + +#define __bpf_kfunc_end_defs() __diag_pop() + /* * Return the name of the passed struct, if exists, or halt the build if for * example the structure gets renamed. In this way, developers have to revisit diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c index 833faa04461b..0fae79164187 100644 --- a/kernel/bpf/bpf_iter.c +++ b/kernel/bpf/bpf_iter.c @@ -782,9 +782,7 @@ struct bpf_iter_num_kern { int end; /* final value, exclusive */ } __aligned(8); -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global functions as their definitions will be in vmlinux BTF"); +__bpf_kfunc_start_defs(); __bpf_kfunc int bpf_iter_num_new(struct bpf_iter_num *it, int start, int end) { @@ -843,4 +841,4 @@ __bpf_kfunc void bpf_iter_num_destroy(struct bpf_iter_num *it) s->cur = s->end = 0; } -__diag_pop(); +__bpf_kfunc_end_defs(); diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c index 209e5135f9fb..d1b5c5618dd7 100644 --- a/kernel/bpf/cgroup_iter.c +++ b/kernel/bpf/cgroup_iter.c @@ -305,9 +305,7 @@ struct bpf_iter_css_kern { unsigned int flags; } __attribute__((aligned(8))); -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global functions as their definitions will be in vmlinux BTF"); +__bpf_kfunc_start_defs(); __bpf_kfunc int bpf_iter_css_new(struct bpf_iter_css *it, struct cgroup_subsys_state *start, unsigned int flags) @@ -358,4 +356,4 @@ __bpf_kfunc void bpf_iter_css_destroy(struct bpf_iter_css *it) { } -__diag_pop(); \ No newline at end of file +__bpf_kfunc_end_defs(); diff --git a/kernel/bpf/cpumask.c b/kernel/bpf/cpumask.c index 6983af8e093c..e01c741e54e7 100644 --- a/kernel/bpf/cpumask.c +++ b/kernel/bpf/cpumask.c @@ -34,9 +34,7 @@ static bool cpu_valid(u32 cpu) return cpu < nr_cpu_ids; } -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global kfuncs as their definitions will be in BTF"); +__bpf_kfunc_start_defs(); /** * bpf_cpumask_create() - Create a mutable BPF cpumask. @@ -407,7 +405,7 @@ __bpf_kfunc u32 bpf_cpumask_any_and_distribute(const struct cpumask *src1, return cpumask_any_and_distribute(src1, src2); } -__diag_pop(); +__bpf_kfunc_end_defs(); BTF_SET8_START(cpumask_kfunc_btf_ids) BTF_ID_FLAGS(func, bpf_cpumask_create, KF_ACQUIRE | KF_RET_NULL) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 73e1e0b71ef3..a17fee069161 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2033,9 +2033,7 @@ void bpf_rb_root_free(const struct btf_field *field, void *rb_root, } } -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global functions as their definitions will be in vmlinux BTF"); +__bpf_kfunc_start_defs(); __bpf_kfunc void *bpf_obj_new_impl(u64 local_type_id__k, void *meta__ign) { @@ -2615,7 +2613,7 @@ __bpf_kfunc void bpf_rcu_read_unlock(void) rcu_read_unlock(); } -__diag_pop(); +__bpf_kfunc_end_defs(); BTF_SET8_START(generic_btf_ids) #ifdef CONFIG_KEXEC_CORE diff --git a/kernel/bpf/map_iter.c b/kernel/bpf/map_iter.c index 6fc9dae9edc8..6abd7c5df4b3 100644 --- a/kernel/bpf/map_iter.c +++ b/kernel/bpf/map_iter.c @@ -193,9 +193,7 @@ static int __init bpf_map_iter_init(void) late_initcall(bpf_map_iter_init); -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global functions as their definitions will be in vmlinux BTF"); +__bpf_kfunc_start_defs(); __bpf_kfunc s64 bpf_map_sum_elem_count(const struct bpf_map *map) { @@ -213,7 +211,7 @@ __bpf_kfunc s64 bpf_map_sum_elem_count(const struct bpf_map *map) return ret; } -__diag_pop(); +__bpf_kfunc_end_defs(); BTF_SET8_START(bpf_map_iter_kfunc_ids) BTF_ID_FLAGS(func, bpf_map_sum_elem_count, KF_TRUSTED_ARGS) diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index e0793906bf13..ea8d2cf3f52c 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -844,9 +844,7 @@ struct bpf_iter_task_vma_kern { struct bpf_iter_task_vma_kern_data *data; } __attribute__((aligned(8))); -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global functions as their definitions will be in vmlinux BTF"); +__bpf_kfunc_start_defs(); __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it, struct task_struct *task, u64 addr) @@ -912,7 +910,7 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it) } } -__diag_pop(); +__bpf_kfunc_end_defs(); #ifdef CONFIG_CGROUPS @@ -924,9 +922,7 @@ struct bpf_iter_css_task_kern { struct css_task_iter *css_it; } __attribute__((aligned(8))); -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global functions as their definitions will be in vmlinux BTF"); +__bpf_kfunc_start_defs(); __bpf_kfunc int bpf_iter_css_task_new(struct bpf_iter_css_task *it, struct cgroup_subsys_state *css, unsigned int flags) @@ -972,7 +968,7 @@ __bpf_kfunc void bpf_iter_css_task_destroy(struct bpf_iter_css_task *it) bpf_mem_free(&bpf_global_ma, kit->css_it); } -__diag_pop(); +__bpf_kfunc_end_defs(); #endif /* CONFIG_CGROUPS */ @@ -995,9 +991,7 @@ enum { BPF_TASK_ITER_PROC_THREADS }; -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global functions as their definitions will be in vmlinux BTF"); +__bpf_kfunc_start_defs(); __bpf_kfunc int bpf_iter_task_new(struct bpf_iter_task *it, struct task_struct *task__nullable, unsigned int flags) @@ -1067,7 +1061,7 @@ __bpf_kfunc void bpf_iter_task_destroy(struct bpf_iter_task *it) { } -__diag_pop(); +__bpf_kfunc_end_defs(); DEFINE_PER_CPU(struct mmap_unlock_irq_work, mmap_unlock_work); diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 50556b239023..357c7febd8ba 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1256,9 +1256,7 @@ static const struct bpf_func_proto bpf_get_func_arg_cnt_proto = { }; #ifdef CONFIG_KEYS -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "kfuncs which will be used in BPF programs"); +__bpf_kfunc_start_defs(); /** * bpf_lookup_user_key - lookup a key by its serial @@ -1408,7 +1406,7 @@ __bpf_kfunc int bpf_verify_pkcs7_signature(struct bpf_dynptr_kern *data_ptr, } #endif /* CONFIG_SYSTEM_DATA_VERIFICATION */ -__diag_pop(); +__bpf_kfunc_end_defs(); BTF_SET8_START(key_sig_kfunc_set) BTF_ID_FLAGS(func, bpf_lookup_user_key, KF_ACQUIRE | KF_RET_NULL | KF_SLEEPABLE) diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 3a1c82b797a3..6ccb767169fc 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -504,9 +504,8 @@ static int bpf_test_finish(const union bpf_attr *kattr, * architecture dependent calling conventions. 7+ can be supported in the * future. */ -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global functions as their definitions will be in vmlinux BTF"); +__bpf_kfunc_start_defs(); + __bpf_kfunc int bpf_fentry_test1(int a) { return a + 1; @@ -606,7 +605,7 @@ __bpf_kfunc void bpf_kfunc_call_memb_release(struct prog_test_member *p) { } -__diag_pop(); +__bpf_kfunc_end_defs(); BTF_SET8_START(bpf_test_modify_return_ids) BTF_ID_FLAGS(func, bpf_modify_return_test) diff --git a/net/core/filter.c b/net/core/filter.c index f91de361383a..fe54ad9b0e8f 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -12103,9 +12103,7 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id) return func; } -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global functions as their definitions will be in vmlinux BTF"); +__bpf_kfunc_start_defs(); __bpf_kfunc int bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags, struct bpf_dynptr_kern *ptr__uninit) { @@ -12258,7 +12256,7 @@ __bpf_kfunc void bpf_handle_egress_ptype(struct __sk_buff *skb_ctx) rcu_read_unlock(); } #endif -__diag_pop(); +__bpf_kfunc_end_defs(); int bpf_dynptr_from_skb_rdonly(struct sk_buff *skb, u64 flags, struct bpf_dynptr_kern *ptr__uninit) @@ -12342,10 +12340,7 @@ static int __init bpf_kfunc_init(void) } late_initcall(bpf_kfunc_init); -/* Disables missing prototype warnings */ -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global functions as their definitions will be in vmlinux BTF"); +__bpf_kfunc_start_defs(); /* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code. * @@ -12379,7 +12374,7 @@ __bpf_kfunc int bpf_sock_destroy(struct sock_common *sock) return sk->sk_prot->diag_destroy(sk, ECONNABORTED); } -__diag_pop() +__bpf_kfunc_end_defs(); BTF_SET8_START(bpf_sk_iter_kfunc_ids) BTF_ID_FLAGS(func, bpf_sock_destroy, KF_TRUSTED_ARGS) diff --git a/net/core/xdp.c b/net/core/xdp.c index 5ee3f8f165e5..1642222e350b 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -692,9 +692,7 @@ struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf) return nxdpf; } -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global functions as their definitions will be in vmlinux BTF"); +__bpf_kfunc_start_defs(); /** * bpf_xdp_metadata_rx_timestamp - Read XDP frame RX timestamp. @@ -734,7 +732,7 @@ __bpf_kfunc int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash, return -EOPNOTSUPP; } -__diag_pop(); +__bpf_kfunc_end_defs(); BTF_SET8_START(xdp_metadata_kfunc_ids) #define XDP_METADATA_KFUNC(_, name) BTF_ID_FLAGS(func, name, KF_TRUSTED_ARGS) diff --git a/net/ipv4/fou_bpf.c b/net/ipv4/fou_bpf.c index 3760a14b6b57..4da03bf45c9b 100644 --- a/net/ipv4/fou_bpf.c +++ b/net/ipv4/fou_bpf.c @@ -22,9 +22,7 @@ enum bpf_fou_encap_type { FOU_BPF_ENCAP_GUE, }; -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global functions as their definitions will be in BTF"); +__bpf_kfunc_start_defs(); /* bpf_skb_set_fou_encap - Set FOU encap parameters * @@ -100,7 +98,7 @@ __bpf_kfunc int bpf_skb_get_fou_encap(struct __sk_buff *skb_ctx, return 0; } -__diag_pop() +__bpf_kfunc_end_defs(); BTF_SET8_START(fou_kfunc_set) BTF_ID_FLAGS(func, bpf_skb_set_fou_encap) diff --git a/net/netfilter/nf_conntrack_bpf.c b/net/netfilter/nf_conntrack_bpf.c index b21799d468d2..475358ec8212 100644 --- a/net/netfilter/nf_conntrack_bpf.c +++ b/net/netfilter/nf_conntrack_bpf.c @@ -230,9 +230,7 @@ static int _nf_conntrack_btf_struct_access(struct bpf_verifier_log *log, return 0; } -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global functions as their definitions will be in nf_conntrack BTF"); +__bpf_kfunc_start_defs(); /* bpf_xdp_ct_alloc - Allocate a new CT entry * @@ -467,7 +465,7 @@ __bpf_kfunc int bpf_ct_change_status(struct nf_conn *nfct, u32 status) return nf_ct_change_status_common(nfct, status); } -__diag_pop() +__bpf_kfunc_end_defs(); BTF_SET8_START(nf_ct_kfunc_set) BTF_ID_FLAGS(func, bpf_xdp_ct_alloc, KF_ACQUIRE | KF_RET_NULL) diff --git a/net/netfilter/nf_nat_bpf.c b/net/netfilter/nf_nat_bpf.c index 141ee7783223..6e3b2f58855f 100644 --- a/net/netfilter/nf_nat_bpf.c +++ b/net/netfilter/nf_nat_bpf.c @@ -12,9 +12,7 @@ #include <net/netfilter/nf_conntrack_core.h> #include <net/netfilter/nf_nat.h> -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global functions as their definitions will be in nf_nat BTF"); +__bpf_kfunc_start_defs(); /* bpf_ct_set_nat_info - Set source or destination nat address * @@ -54,7 +52,7 @@ __bpf_kfunc int bpf_ct_set_nat_info(struct nf_conn___init *nfct, return nf_nat_setup_info(ct, &range, manip) == NF_DROP ? -ENOMEM : 0; } -__diag_pop() +__bpf_kfunc_end_defs(); BTF_SET8_START(nf_nat_kfunc_set) BTF_ID_FLAGS(func, bpf_ct_set_nat_info, KF_TRUSTED_ARGS) diff --git a/net/xfrm/xfrm_interface_bpf.c b/net/xfrm/xfrm_interface_bpf.c index d74f3fd20f2b..7d5e920141e9 100644 --- a/net/xfrm/xfrm_interface_bpf.c +++ b/net/xfrm/xfrm_interface_bpf.c @@ -27,9 +27,7 @@ struct bpf_xfrm_info { int link; }; -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global functions as their definitions will be in xfrm_interface BTF"); +__bpf_kfunc_start_defs(); /* bpf_skb_get_xfrm_info - Get XFRM metadata * @@ -93,7 +91,7 @@ __bpf_kfunc int bpf_skb_set_xfrm_info(struct __sk_buff *skb_ctx, const struct bp return 0; } -__diag_pop() +__bpf_kfunc_end_defs(); BTF_SET8_START(xfrm_ifc_kfunc_set) BTF_ID_FLAGS(func, bpf_skb_get_xfrm_info) -- 2.34.1
From: Dave Marchevsky <davemarchevsky@fb.com> mainline inclusion from mainline-v6.7-rc1 commit 15fb6f2b6c4c3c129adc2412ae12ec15e60a6adb category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Not all uses of __diag_ignore_all(...) in BPF-related code in order to suppress warnings are wrapping kfunc definitions. Some "hook point" definitions - small functions meant to be used as attach points for fentry and similar BPF progs - need to suppress -Wmissing-declarations. We could use __bpf_kfunc_{start,end}_defs added in the previous patch in such cases, but this might be confusing to someone unfamiliar with BPF internals. Instead, this patch adds __bpf_hook_{start,end} macros, currently having the same effect as __bpf_kfunc_{start,end}_defs, then uses them to suppress warnings for two hook points in the kernel itself and some bpf_testmod hook points as well. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Cc: Yafang Shao <laoar.shao@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20231031215625.2343848-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/btf.h | 2 ++ kernel/cgroup/rstat.c | 9 +++------ net/socket.c | 8 ++------ tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c | 6 ++---- 4 files changed, 9 insertions(+), 16 deletions(-) diff --git a/include/linux/btf.h b/include/linux/btf.h index dc5ce962f600..59d404e22814 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -92,6 +92,8 @@ "Global kfuncs as their definitions will be in BTF") #define __bpf_kfunc_end_defs() __diag_pop() +#define __bpf_hook_start() __bpf_kfunc_start_defs() +#define __bpf_hook_end() __bpf_kfunc_end_defs() /* * Return the name of the passed struct, if exists, or halt the build if for diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 329c10c084a2..c6d6bec9bcce 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -156,19 +156,16 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos, * optimize away the callsite. Therefore, __weak is needed to ensure that the * call is still emitted, by telling the compiler that we don't know what the * function might eventually be. - * - * __diag_* below are needed to dismiss the missing prototype warning. */ -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "kfuncs which will be used in BPF programs"); + +__bpf_hook_start(); __weak noinline void bpf_rstat_flush(struct cgroup *cgrp, struct cgroup *parent, int cpu) { } -__diag_pop(); +__bpf_hook_end(); /* see cgroup_rstat_flush() */ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) diff --git a/net/socket.c b/net/socket.c index 23c08b0d2899..15b696635e43 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1696,20 +1696,16 @@ struct file *__sys_socket_file(int family, int type, int protocol) * Therefore, __weak is needed to ensure that the call is still * emitted, by telling the compiler that we don't know what the * function might eventually be. - * - * __diag_* below are needed to dismiss the missing prototype warning. */ -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "A fmod_ret entry point for BPF programs"); +__bpf_hook_start(); __weak noinline int update_socket_protocol(int family, int type, int protocol) { return protocol; } -__diag_pop(); +__bpf_hook_end(); int __sys_socket(int family, int type, int protocol) { diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c index 2e8adf059fa3..1b8589143bd3 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c @@ -40,9 +40,7 @@ struct bpf_testmod_struct_arg_4 { int b; }; -__diag_push(); -__diag_ignore_all("-Wmissing-prototypes", - "Global functions as their definitions will be in bpf_testmod.ko BTF"); +__bpf_hook_start(); noinline int bpf_testmod_test_struct_arg_1(struct bpf_testmod_struct_arg_2 a, int b, int c) { @@ -332,7 +330,7 @@ noinline int bpf_fentry_shadow_test(int a) } EXPORT_SYMBOL_GPL(bpf_fentry_shadow_test); -__diag_pop(); +__bpf_hook_end(); static struct bin_attribute bin_attr_bpf_testmod_file __ro_after_init = { .attr = { .name = "bpf_testmod", .mode = 0666, }, -- 2.34.1
From: Chuyi Zhou <zhouchuyi@bytedance.com> mainline inclusion from mainline-v6.7-rc1 commit 3091b667498b0a212e760e1033e5f9b8c33a948f category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- The newly added open-coded css_task iter would try to hold the global css_set_lock in bpf_iter_css_task_new, so the bpf side has to be careful in where it allows to use this iter. The mainly concern is dead locking on css_set_lock. check_css_task_iter_allowlist() in verifier enforced css_task can only be used in bpf_lsm hooks and sleepable bpf_iter. This patch relax the allowlist for css_task iter. Any lsm and any iter (even non-sleepable) and any sleepable are safe since they would not hold the css_set_lock before entering BPF progs context. This patch also fixes the misused BPF_TRACE_ITER in check_css_task_iter_allowlist which compared bpf_prog_type with bpf_attach_type. Fixes: 9c66dc94b62ae ("bpf: Introduce css_task open-coded iterator kfuncs") Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231031050438.93297-2-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/verifier.c | 16 ++++++++++++---- .../selftests/bpf/progs/iters_task_failure.c | 4 ++-- 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 5cfdff54ad9b..2e03b5bad9b4 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -11451,6 +11451,12 @@ static int process_kf_arg_ptr_to_rbtree_node(struct bpf_verifier_env *env, &meta->arg_rbtree_root.field); } +/* + * css_task iter allowlist is needed to avoid dead locking on css_set_lock. + * LSM hooks and iters (both sleepable and non-sleepable) are safe. + * Any sleepable progs are also safe since bpf_check_attach_target() enforce + * them can only be attached to some specific hook points. + */ static bool check_css_task_iter_allowlist(struct bpf_verifier_env *env) { enum bpf_prog_type prog_type = resolve_prog_type(env->prog); @@ -11458,10 +11464,12 @@ static bool check_css_task_iter_allowlist(struct bpf_verifier_env *env) switch (prog_type) { case BPF_PROG_TYPE_LSM: return true; - case BPF_TRACE_ITER: - return env->prog->aux->sleepable; + case BPF_PROG_TYPE_TRACING: + if (env->prog->expected_attach_type == BPF_TRACE_ITER) + return true; + fallthrough; default: - return false; + return env->prog->aux->sleepable; } } @@ -11711,7 +11719,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_ case KF_ARG_PTR_TO_ITER: if (meta->func_id == special_kfunc_list[KF_bpf_iter_css_task_new]) { if (!check_css_task_iter_allowlist(env)) { - verbose(env, "css_task_iter is only allowed in bpf_lsm and bpf iter-s\n"); + verbose(env, "css_task_iter is only allowed in bpf_lsm, bpf_iter and sleepable progs\n"); return -EINVAL; } } diff --git a/tools/testing/selftests/bpf/progs/iters_task_failure.c b/tools/testing/selftests/bpf/progs/iters_task_failure.c index c3bf96a67dba..6b1588d70652 100644 --- a/tools/testing/selftests/bpf/progs/iters_task_failure.c +++ b/tools/testing/selftests/bpf/progs/iters_task_failure.c @@ -84,8 +84,8 @@ int BPF_PROG(iter_css_lock_and_unlock) return 0; } -SEC("?fentry.s/" SYS_PREFIX "sys_getpgid") -__failure __msg("css_task_iter is only allowed in bpf_lsm and bpf iter-s") +SEC("?fentry/" SYS_PREFIX "sys_getpgid") +__failure __msg("css_task_iter is only allowed in bpf_lsm, bpf_iter and sleepable progs") int BPF_PROG(iter_css_task_for_each) { u64 cg_id = bpf_get_current_cgroup_id(); -- 2.34.1
From: Chuyi Zhou <zhouchuyi@bytedance.com> mainline inclusion from mainline-v6.7-rc1 commit f49843afde6771ef6ed5d021eacafacfc98a58bf category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- This patch adds a test which demonstrates how css_task iter can be combined with cgroup iter and it won't cause deadlock, though cgroup iter is not sleepable. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231031050438.93297-3-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/prog_tests/cgroup_iter.c | 33 ++++++++++++++ .../selftests/bpf/progs/iters_css_task.c | 44 +++++++++++++++++++ 2 files changed, 77 insertions(+) diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_iter.c b/tools/testing/selftests/bpf/prog_tests/cgroup_iter.c index e02feb5fae97..574d9a0cdc8e 100644 --- a/tools/testing/selftests/bpf/prog_tests/cgroup_iter.c +++ b/tools/testing/selftests/bpf/prog_tests/cgroup_iter.c @@ -4,6 +4,7 @@ #include <test_progs.h> #include <bpf/libbpf.h> #include <bpf/btf.h> +#include "iters_css_task.skel.h" #include "cgroup_iter.skel.h" #include "cgroup_helpers.h" @@ -263,6 +264,35 @@ static void test_walk_dead_self_only(struct cgroup_iter *skel) close(cgrp_fd); } +static void test_walk_self_only_css_task(void) +{ + struct iters_css_task *skel; + int err; + + skel = iters_css_task__open(); + if (!ASSERT_OK_PTR(skel, "skel_open")) + return; + + bpf_program__set_autoload(skel->progs.cgroup_id_printer, true); + + err = iters_css_task__load(skel); + if (!ASSERT_OK(err, "skel_load")) + goto cleanup; + + err = join_cgroup(cg_path[CHILD2]); + if (!ASSERT_OK(err, "join_cgroup")) + goto cleanup; + + skel->bss->target_pid = getpid(); + snprintf(expected_output, sizeof(expected_output), + PROLOGUE "%8llu\n" EPILOGUE, cg_id[CHILD2]); + read_from_cgroup_iter(skel->progs.cgroup_id_printer, cg_fd[CHILD2], + BPF_CGROUP_ITER_SELF_ONLY, "test_walk_self_only_css_task"); + ASSERT_EQ(skel->bss->css_task_cnt, 1, "css_task_cnt"); +cleanup: + iters_css_task__destroy(skel); +} + void test_cgroup_iter(void) { struct cgroup_iter *skel = NULL; @@ -293,6 +323,9 @@ void test_cgroup_iter(void) test_walk_self_only(skel); if (test__start_subtest("cgroup_iter__dead_self_only")) test_walk_dead_self_only(skel); + if (test__start_subtest("cgroup_iter__self_only_css_task")) + test_walk_self_only_css_task(); + out: cgroup_iter__destroy(skel); cleanup_cgroups(); diff --git a/tools/testing/selftests/bpf/progs/iters_css_task.c b/tools/testing/selftests/bpf/progs/iters_css_task.c index 5089ce384a1c..384ff806990f 100644 --- a/tools/testing/selftests/bpf/progs/iters_css_task.c +++ b/tools/testing/selftests/bpf/progs/iters_css_task.c @@ -10,6 +10,7 @@ char _license[] SEC("license") = "GPL"; +struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym; struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym; void bpf_cgroup_release(struct cgroup *p) __ksym; @@ -45,3 +46,46 @@ int BPF_PROG(iter_css_task_for_each, struct vm_area_struct *vma, return -EPERM; } + +static inline u64 cgroup_id(struct cgroup *cgrp) +{ + return cgrp->kn->id; +} + +SEC("?iter/cgroup") +int cgroup_id_printer(struct bpf_iter__cgroup *ctx) +{ + struct seq_file *seq = ctx->meta->seq; + struct cgroup *cgrp, *acquired; + struct cgroup_subsys_state *css; + struct task_struct *task; + u64 cgrp_id; + + cgrp = ctx->cgroup; + + /* epilogue */ + if (cgrp == NULL) { + BPF_SEQ_PRINTF(seq, "epilogue\n"); + return 0; + } + + /* prologue */ + if (ctx->meta->seq_num == 0) + BPF_SEQ_PRINTF(seq, "prologue\n"); + + cgrp_id = cgroup_id(cgrp); + + BPF_SEQ_PRINTF(seq, "%8llu\n", cgrp_id); + + acquired = bpf_cgroup_from_id(cgrp_id); + if (!acquired) + return 0; + css = &acquired->self; + css_task_cnt = 0; + bpf_for_each(css_task, task, css, CSS_TASK_ITER_PROCS) { + if (task->pid == target_pid) + css_task_cnt++; + } + bpf_cgroup_release(acquired); + return 0; +} -- 2.34.1
From: Chuyi Zhou <zhouchuyi@bytedance.com> mainline inclusion from mainline-v6.7-rc1 commit d8234d47c4aa494d789b85562fa90e837b4575f9 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- This Patch add a test to prove css_task iter can be used in normal sleepable progs. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231031050438.93297-4-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../testing/selftests/bpf/prog_tests/iters.c | 1 + .../selftests/bpf/progs/iters_css_task.c | 19 +++++++++++++++++++ 2 files changed, 20 insertions(+) diff --git a/tools/testing/selftests/bpf/prog_tests/iters.c b/tools/testing/selftests/bpf/prog_tests/iters.c index c2425791c923..bf84d4a1d9ae 100644 --- a/tools/testing/selftests/bpf/prog_tests/iters.c +++ b/tools/testing/selftests/bpf/prog_tests/iters.c @@ -294,6 +294,7 @@ void test_iters(void) RUN_TESTS(iters_state_safety); RUN_TESTS(iters_looping); RUN_TESTS(iters); + RUN_TESTS(iters_css_task); if (env.has_testmod) RUN_TESTS(iters_testmod_seq); diff --git a/tools/testing/selftests/bpf/progs/iters_css_task.c b/tools/testing/selftests/bpf/progs/iters_css_task.c index 384ff806990f..e180aa1b1d52 100644 --- a/tools/testing/selftests/bpf/progs/iters_css_task.c +++ b/tools/testing/selftests/bpf/progs/iters_css_task.c @@ -89,3 +89,22 @@ int cgroup_id_printer(struct bpf_iter__cgroup *ctx) bpf_cgroup_release(acquired); return 0; } + +SEC("?fentry.s/" SYS_PREFIX "sys_getpgid") +int BPF_PROG(iter_css_task_for_each_sleep) +{ + u64 cgrp_id = bpf_get_current_cgroup_id(); + struct cgroup *cgrp = bpf_cgroup_from_id(cgrp_id); + struct cgroup_subsys_state *css; + struct task_struct *task; + + if (cgrp == NULL) + return 0; + css = &cgrp->self; + + bpf_for_each(css_task, task, css, CSS_TASK_ITER_PROCS) { + + } + bpf_cgroup_release(cgrp); + return 0; +} -- 2.34.1
From: Chuyi Zhou <zhouchuyi@bytedance.com> mainline inclusion from mainline-v6.7-rc1 commit 0de4f50de25af79c2a46db55d70cdbd8f985c6d1 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- BTF_TYPE_SAFE_TRUSTED(struct bpf_iter__task) in verifier.c wanted to teach BPF verifier that bpf_iter__task -> task is a trusted ptr. But it doesn't work well. The reason is, bpf_iter__task -> task would go through btf_ctx_access() which enforces the reg_type of 'task' is ctx_arg_info->reg_type, and in task_iter.c, we actually explicitly declare that the ctx_arg_info->reg_type is PTR_TO_BTF_ID_OR_NULL. Actually we have a previous case like this[1] where PTR_TRUSTED is added to the arg flag for map_iter. This patch sets ctx_arg_info->reg_type is PTR_TO_BTF_ID_OR_NULL | PTR_TRUSTED in task_reg_info. Similarly, bpf_cgroup_reg_info -> cgroup is also PTR_TRUSTED since we are under the protection of cgroup_mutex and we would check cgroup_is_dead() in __cgroup_iter_seq_show(). This patch is to improve the user experience of the newly introduced bpf_iter_css_task kfunc before hitting the mainline. The Fixes tag is pointing to the commit introduced the bpf_iter_css_task kfunc. Link[1]:https://lore.kernel.org/all/20230706133932.45883-3-aspsk@isovalent.com/ Fixes: 9c66dc94b62a ("bpf: Introduce css_task open-coded iterator kfuncs") Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231107132204.912120-2-zhouchuyi@bytedance.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/cgroup_iter.c | 2 +- kernel/bpf/task_iter.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c index d1b5c5618dd7..f04a468cf6a7 100644 --- a/kernel/bpf/cgroup_iter.c +++ b/kernel/bpf/cgroup_iter.c @@ -282,7 +282,7 @@ static struct bpf_iter_reg bpf_cgroup_reg_info = { .ctx_arg_info_size = 1, .ctx_arg_info = { { offsetof(struct bpf_iter__cgroup, cgroup), - PTR_TO_BTF_ID_OR_NULL }, + PTR_TO_BTF_ID_OR_NULL | PTR_TRUSTED }, }, .seq_info = &cgroup_iter_seq_info, }; diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index ea8d2cf3f52c..145c506899e4 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -726,7 +726,7 @@ static struct bpf_iter_reg task_reg_info = { .ctx_arg_info_size = 1, .ctx_arg_info = { { offsetof(struct bpf_iter__task, task), - PTR_TO_BTF_ID_OR_NULL }, + PTR_TO_BTF_ID_OR_NULL | PTR_TRUSTED }, }, .seq_info = &task_seq_info, .fill_link_info = bpf_iter_fill_link_info, -- 2.34.1
From: Chuyi Zhou <zhouchuyi@bytedance.com> mainline inclusion from mainline-v6.7-rc1 commit 3c5864ba9cf912ff9809f315d28f296f21563cce category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Commit f49843afde (selftests/bpf: Add tests for css_task iter combining with cgroup iter) added a test which demonstrates how css_task iter can be combined with cgroup iter. That test used bpf_cgroup_from_id() to convert bpf_iter__cgroup->cgroup to a trusted ptr which is pointless now, since with the previous fix, we can get a trusted cgroup directly from bpf_iter__cgroup. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231107132204.912120-3-zhouchuyi@bytedance.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../testing/selftests/bpf/progs/iters_css_task.c | 16 ++++------------ 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/tools/testing/selftests/bpf/progs/iters_css_task.c b/tools/testing/selftests/bpf/progs/iters_css_task.c index e180aa1b1d52..9ac758649cb8 100644 --- a/tools/testing/selftests/bpf/progs/iters_css_task.c +++ b/tools/testing/selftests/bpf/progs/iters_css_task.c @@ -56,12 +56,9 @@ SEC("?iter/cgroup") int cgroup_id_printer(struct bpf_iter__cgroup *ctx) { struct seq_file *seq = ctx->meta->seq; - struct cgroup *cgrp, *acquired; + struct cgroup *cgrp = ctx->cgroup; struct cgroup_subsys_state *css; struct task_struct *task; - u64 cgrp_id; - - cgrp = ctx->cgroup; /* epilogue */ if (cgrp == NULL) { @@ -73,20 +70,15 @@ int cgroup_id_printer(struct bpf_iter__cgroup *ctx) if (ctx->meta->seq_num == 0) BPF_SEQ_PRINTF(seq, "prologue\n"); - cgrp_id = cgroup_id(cgrp); - - BPF_SEQ_PRINTF(seq, "%8llu\n", cgrp_id); + BPF_SEQ_PRINTF(seq, "%8llu\n", cgroup_id(cgrp)); - acquired = bpf_cgroup_from_id(cgrp_id); - if (!acquired) - return 0; - css = &acquired->self; + css = &cgrp->self; css_task_cnt = 0; bpf_for_each(css_task, task, css, CSS_TASK_ITER_PROCS) { if (task->pid == target_pid) css_task_cnt++; } - bpf_cgroup_release(acquired); + return 0; } -- 2.34.1
From: Andrii Nakryiko <andrii@kernel.org> mainline inclusion from mainline-v6.8-rc1 commit 491dd8edecbc5027ee317f3f1e7e9800fb66d88f category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- We have the name, instead of emitting just func#N to identify global subprog, augment verifier log messages with actual function name to make it more user-friendly. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231124035937.403208-2-andrii@kernel.org Conflicts: kernel/bpf/verifier.c [Fix conflicts because of cve patches and missing previous patches.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/verifier.c | 35 ++++++++++++++++++++++++----------- 1 file changed, 24 insertions(+), 11 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 2e03b5bad9b4..747c77b7fc75 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -339,6 +339,11 @@ struct bpf_kfunc_call_arg_meta { struct btf *btf_vmlinux; +static const char *btf_type_name(const struct btf *btf, u32 id) +{ + return btf_name_by_offset(btf, btf_type_by_id(btf, id)->name_off); +} + static DEFINE_MUTEX(bpf_verifier_lock); static const struct bpf_line_info * @@ -503,6 +508,17 @@ static bool subprog_is_global(const struct bpf_verifier_env *env, int subprog) return aux && aux[subprog].linkage == BTF_FUNC_GLOBAL; } +static const char *subprog_name(const struct bpf_verifier_env *env, int subprog) +{ + struct bpf_func_info *info; + + if (!env->prog->aux->func_info) + return ""; + + info = &env->prog->aux->func_info[subprog]; + return btf_type_name(env->prog->aux->btf, info->type_id); +} + static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg) { return btf_record_has_field(reg_btf_record(reg), BPF_SPIN_LOCK); @@ -748,11 +764,6 @@ static int iter_get_spi(struct bpf_verifier_env *env, struct bpf_reg_state *reg, return stack_slot_obj_get_spi(env, reg, "iter", nr_slots); } -static const char *btf_type_name(const struct btf *btf, u32 id) -{ - return btf_name_by_offset(btf, btf_type_by_id(btf, id)->name_off); -} - static const char *dynptr_type_str(enum bpf_dynptr_type type) { switch (type) { @@ -9472,13 +9483,16 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn, if (err == -EFAULT) return err; if (subprog_is_global(env, subprog)) { + const char *sub_name = subprog_name(env, subprog); + if (err) { - verbose(env, "Caller passes invalid args into func#%d\n", subprog); + verbose(env, "Caller passes invalid args into func#%d ('%s')\n", + subprog, sub_name); return err; } - if (env->log.level & BPF_LOG_LEVEL) - verbose(env, "Func#%d is global and valid. Skipping.\n", subprog); + verbose(env, "Func#%d ('%s') is global and assumed valid.\n", + subprog, sub_name); if (env->subprog_info[subprog].changes_pkt_data) clear_all_pkt_pointers(env); clear_caller_saved_regs(env, caller->regs); @@ -20117,9 +20131,8 @@ static int do_check_subprogs(struct bpf_verifier_env *env) if (ret) { return ret; } else if (env->log.level & BPF_LOG_LEVEL) { - verbose(env, - "Func#%d is safe for any args that match its prototype\n", - i); + verbose(env, "Func#%d ('%s') is safe for any args that match its prototype\n", + i, subprog_name(env, i)); } } return 0; -- 2.34.1
From: Andrii Nakryiko <andrii@kernel.org> mainline inclusion from mainline-v6.8-rc1 commit 2afae08c9dcb8ac648414277cec70c2fe6a34d9e category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Slightly change BPF verifier logic around eagerness and order of global subprog validation. Instead of going over every global subprog eagerly and validating it before main (entry) BPF program is verified, turn it around. Validate main program first, mark subprogs that were called from main program for later verification, but otherwise assume it is valid. Afterwards, go over marked global subprogs and validate those, potentially marking some more global functions as being called. Continue this process until all (transitively) callable global subprogs are validated. It's a BFS traversal at its heart and will always converge. This is an important change because it allows to feature-gate some subprograms that might not be verifiable on some older kernel, depending on supported set of features. E.g., at some point, global functions were allowed to accept a pointer to memory, which size is identified by user-provided type. Unfortunately, older kernels don't support this feature. With BPF CO-RE approach, the natural way would be to still compile BPF object file once and guard calls to this global subprog with some CO-RE check or using .rodata variables. That's what people do to guard usage of new helpers or kfuncs, and any other new BPF-side feature that might be missing on old kernels. That's currently impossible to do with global subprogs, unfortunately, because they are eagerly and unconditionally validated. This patch set aims to change this, so that in the future when global funcs gain new features, those can be guarded using BPF CO-RE techniques in the same fashion as any other new kernel feature. Two selftests had to be adjusted in sync with these changes. test_global_func12 relied on eager global subprog validation failing before main program failure is detected (unknown return value). Fix by making sure that main program is always valid. verifier_subprog_precision's parent_stack_slot_precise subtest relied on verifier checkpointing heuristic to do a checkpoint at instruction #5, but that's no longer true because we don't have enough jumps validated before reaching insn #5 due to global subprogs being validated later. Other than that, no changes, as one would expect. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231124035937.403208-3-andrii@kernel.org Conflicts: kernel/bpf/verifier.c [Fix conflicts because of missing pre-patches BPF exceptions and cve.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf.h | 2 + kernel/bpf/verifier.c | 43 ++++++++++++++++--- .../selftests/bpf/progs/test_global_func12.c | 4 +- .../bpf/progs/verifier_subprog_precision.c | 4 +- 4 files changed, 43 insertions(+), 10 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index d6498fbda50a..3c8853d9c255 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1434,6 +1434,8 @@ static inline bool bpf_prog_has_trampoline(const struct bpf_prog *prog) struct bpf_func_info_aux { u16 linkage; bool unreliable; + bool called : 1; + bool verified : 1; }; enum bpf_jit_poke_reason { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 747c77b7fc75..8c810642d24d 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -519,6 +519,11 @@ static const char *subprog_name(const struct bpf_verifier_env *env, int subprog) return btf_type_name(env->prog->aux->btf, info->type_id); } +static struct bpf_func_info_aux *subprog_aux(const struct bpf_verifier_env *env, int subprog) +{ + return &env->prog->aux->func_info_aux[subprog]; +} + static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg) { return btf_record_has_field(reg_btf_record(reg), BPF_SPIN_LOCK); @@ -9495,6 +9500,8 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn, subprog, sub_name); if (env->subprog_info[subprog].changes_pkt_data) clear_all_pkt_pointers(env); + /* mark global subprog for verifying after main prog */ + subprog_aux(env, subprog)->called = true; clear_caller_saved_regs(env, caller->regs); /* All global functions return a 64-bit SCALAR_VALUE */ @@ -20097,8 +20104,11 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog) return ret; } -/* Verify all global functions in a BPF program one by one based on their BTF. - * All global functions must pass verification. Otherwise the whole program is rejected. +/* Lazily verify all global functions based on their BTF, if they are called + * from main BPF program or any of subprograms transitively. + * BPF global subprogs called from dead code are not validated. + * All callable global functions must pass verification. + * Otherwise the whole program is rejected. * Consider: * int bar(int); * int foo(int f) @@ -20117,13 +20127,20 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog) static int do_check_subprogs(struct bpf_verifier_env *env) { struct bpf_prog_aux *aux = env->prog->aux; - int i, ret; + struct bpf_func_info_aux *sub_aux; + int i, ret, new_cnt; if (!aux->func_info) return 0; +again: + new_cnt = 0; for (i = 1; i < env->subprog_cnt; i++) { - if (aux->func_info_aux[i].linkage != BTF_FUNC_GLOBAL) + if (!subprog_is_global(env, i)) + continue; + + sub_aux = subprog_aux(env, i); + if (!sub_aux->called || sub_aux->verified) continue; env->insn_idx = env->subprog_info[i].start; WARN_ON_ONCE(env->insn_idx == 0); @@ -20134,7 +20151,21 @@ static int do_check_subprogs(struct bpf_verifier_env *env) verbose(env, "Func#%d ('%s') is safe for any args that match its prototype\n", i, subprog_name(env, i)); } + + /* We verified new global subprog, it might have called some + * more global subprogs that we haven't verified yet, so we + * need to do another pass over subprogs to verify those. + */ + sub_aux->verified = true; + new_cnt++; } + + /* We can't loop forever as we verify at least one global subprog on + * each pass. + */ + if (new_cnt) + goto again; + return 0; } @@ -20833,8 +20864,8 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3 if (ret) goto skip_full_check; - ret = do_check_subprogs(env); - ret = ret ?: do_check_main(env); + ret = do_check_main(env); + ret = ret ?: do_check_subprogs(env); if (ret == 0 && bpf_prog_is_offloaded(env->prog->aux)) ret = bpf_prog_offload_finalize(env); diff --git a/tools/testing/selftests/bpf/progs/test_global_func12.c b/tools/testing/selftests/bpf/progs/test_global_func12.c index 7f159d83c6f6..6e03d42519a6 100644 --- a/tools/testing/selftests/bpf/progs/test_global_func12.c +++ b/tools/testing/selftests/bpf/progs/test_global_func12.c @@ -19,5 +19,7 @@ int global_func12(struct __sk_buff *skb) { const struct S s = {.x = skb->len }; - return foo(&s); + foo(&s); + + return 1; } diff --git a/tools/testing/selftests/bpf/progs/verifier_subprog_precision.c b/tools/testing/selftests/bpf/progs/verifier_subprog_precision.c index 4b8b0f45d17d..83d187de7f6b 100644 --- a/tools/testing/selftests/bpf/progs/verifier_subprog_precision.c +++ b/tools/testing/selftests/bpf/progs/verifier_subprog_precision.c @@ -370,12 +370,10 @@ __naked int parent_stack_slot_precise(void) SEC("?raw_tp") __success __log_level(2) __msg("9: (0f) r1 += r6") -__msg("mark_precise: frame0: last_idx 9 first_idx 6") +__msg("mark_precise: frame0: last_idx 9 first_idx 0") __msg("mark_precise: frame0: regs=r6 stack= before 8: (bf) r1 = r7") __msg("mark_precise: frame0: regs=r6 stack= before 7: (27) r6 *= 4") __msg("mark_precise: frame0: regs=r6 stack= before 6: (79) r6 = *(u64 *)(r10 -8)") -__msg("mark_precise: frame0: parent state regs= stack=-8:") -__msg("mark_precise: frame0: last_idx 5 first_idx 0") __msg("mark_precise: frame0: regs= stack=-8 before 5: (85) call pc+6") __msg("mark_precise: frame0: regs= stack=-8 before 4: (b7) r1 = 0") __msg("mark_precise: frame0: regs= stack=-8 before 3: (7b) *(u64 *)(r10 -8) = r6") -- 2.34.1
From: Luo Gengkun <luogengkun2@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA -------------------------------- Fix kabi-breakage by using KABI_FILL_HOLE because there is some hole in bpf_func_info_aux. Fixes: a88fed0a5443 ("bpf: Validate global subprogs lazily") Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 3c8853d9c255..511168c6dc2f 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1434,8 +1434,8 @@ static inline bool bpf_prog_has_trampoline(const struct bpf_prog *prog) struct bpf_func_info_aux { u16 linkage; bool unreliable; - bool called : 1; - bool verified : 1; + KABI_FILL_HOLE(bool called : 1) + KABI_FILL_HOLE(bool verified : 1) }; enum bpf_jit_poke_reason { -- 2.34.1
From: Andrii Nakryiko <andrii@kernel.org> mainline inclusion from mainline-v6.8-rc1 commit e8a339b5235e294f29153149ea7cf26a9a87dbea category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Add a few test that validate BPF verifier's lazy approach to validating global subprogs. We check that global subprogs that are called transitively through another global subprog is validated. We also check that invalid global subprog is not validated, if it's not called from the main program. And we also check that main program is always validated first, before any of the subprogs. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231124035937.403208-4-andrii@kernel.org Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/prog_tests/verifier.c | 2 + .../bpf/progs/verifier_global_subprogs.c | 92 +++++++++++++++++++ 2 files changed, 94 insertions(+) create mode 100644 tools/testing/selftests/bpf/progs/verifier_global_subprogs.c diff --git a/tools/testing/selftests/bpf/prog_tests/verifier.c b/tools/testing/selftests/bpf/prog_tests/verifier.c index 5cfa7a6316b6..8d746642cbd7 100644 --- a/tools/testing/selftests/bpf/prog_tests/verifier.c +++ b/tools/testing/selftests/bpf/prog_tests/verifier.c @@ -25,6 +25,7 @@ #include "verifier_direct_stack_access_wraparound.skel.h" #include "verifier_div0.skel.h" #include "verifier_div_overflow.skel.h" +#include "verifier_global_subprogs.skel.h" #include "verifier_gotol.skel.h" #include "verifier_helper_access_var_len.skel.h" #include "verifier_helper_packet_access.skel.h" @@ -134,6 +135,7 @@ void test_verifier_direct_packet_access(void) { RUN(verifier_direct_packet_acces void test_verifier_direct_stack_access_wraparound(void) { RUN(verifier_direct_stack_access_wraparound); } void test_verifier_div0(void) { RUN(verifier_div0); } void test_verifier_div_overflow(void) { RUN(verifier_div_overflow); } +void test_verifier_global_subprogs(void) { RUN(verifier_global_subprogs); } void test_verifier_gotol(void) { RUN(verifier_gotol); } void test_verifier_helper_access_var_len(void) { RUN(verifier_helper_access_var_len); } void test_verifier_helper_packet_access(void) { RUN(verifier_helper_packet_access); } diff --git a/tools/testing/selftests/bpf/progs/verifier_global_subprogs.c b/tools/testing/selftests/bpf/progs/verifier_global_subprogs.c new file mode 100644 index 000000000000..a0a5efd1caa1 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/verifier_global_subprogs.c @@ -0,0 +1,92 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */ + +#include <stdbool.h> +#include <errno.h> +#include <string.h> +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> +#include "bpf_misc.h" + +int arr[1]; +int unkn_idx; + +__noinline long global_bad(void) +{ + return arr[unkn_idx]; /* BOOM */ +} + +__noinline long global_good(void) +{ + return arr[0]; +} + +__noinline long global_calls_bad(void) +{ + return global_good() + global_bad() /* does BOOM indirectly */; +} + +__noinline long global_calls_good_only(void) +{ + return global_good(); +} + +SEC("?raw_tp") +__success __log_level(2) +/* main prog is validated completely first */ +__msg("('global_calls_good_only') is global and assumed valid.") +__msg("1: (95) exit") +/* eventually global_good() is transitively validated as well */ +__msg("Validating global_good() func") +__msg("('global_good') is safe for any args that match its prototype") +int chained_global_func_calls_success(void) +{ + return global_calls_good_only(); +} + +SEC("?raw_tp") +__failure __log_level(2) +/* main prog validated successfully first */ +__msg("1: (95) exit") +/* eventually we validate global_bad() and fail */ +__msg("Validating global_bad() func") +__msg("math between map_value pointer and register") /* BOOM */ +int chained_global_func_calls_bad(void) +{ + return global_calls_bad(); +} + +/* do out of bounds access forcing verifier to fail verification if this + * global func is called + */ +__noinline int global_unsupp(const int *mem) +{ + if (!mem) + return 0; + return mem[100]; /* BOOM */ +} + +const volatile bool skip_unsupp_global = true; + +SEC("?raw_tp") +__success +int guarded_unsupp_global_called(void) +{ + if (!skip_unsupp_global) + return global_unsupp(NULL); + return 0; +} + +SEC("?raw_tp") +__failure __log_level(2) +__msg("Func#1 ('global_unsupp') is global and assumed valid.") +__msg("Validating global_unsupp() func#1...") +__msg("value is outside of the allowed memory range") +int unguarded_unsupp_global_called(void) +{ + int x = 0; + + return global_unsupp(&x); +} + +char _license[] SEC("license") = "GPL"; -- 2.34.1
From: Song Liu <song@kernel.org> mainline inclusion from mainline-v6.8-rc1 commit f08a1c658257c73697a819c4ded3a84b6f0ead74 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Currently, bpf_prog_pack_free only can only free pointer to struct bpf_binary_header, which is not flexible. Add a size argument to bpf_prog_pack_free so that it can handle any pointer. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20231206224054.492250-2-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/filter.h | 2 +- kernel/bpf/core.c | 21 ++++++++++----------- kernel/bpf/dispatcher.c | 5 +---- 3 files changed, 12 insertions(+), 16 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index e55b625257bd..8908c5b1c951 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -1057,7 +1057,7 @@ struct bpf_binary_header * bpf_jit_binary_pack_hdr(const struct bpf_prog *fp); void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns); -void bpf_prog_pack_free(struct bpf_binary_header *hdr); +void bpf_prog_pack_free(void *ptr, u32 size); static inline bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp) { diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 72a7503a59d0..d0436f78c4b7 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -942,20 +942,20 @@ void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns) return ptr; } -void bpf_prog_pack_free(struct bpf_binary_header *hdr) +void bpf_prog_pack_free(void *ptr, u32 size) { struct bpf_prog_pack *pack = NULL, *tmp; unsigned int nbits; unsigned long pos; mutex_lock(&pack_mutex); - if (hdr->size > BPF_PROG_PACK_SIZE) { - bpf_jit_free_exec(hdr); + if (size > BPF_PROG_PACK_SIZE) { + bpf_jit_free_exec(ptr); goto out; } list_for_each_entry(tmp, &pack_list, list) { - if ((void *)hdr >= tmp->ptr && (tmp->ptr + BPF_PROG_PACK_SIZE) > (void *)hdr) { + if (ptr >= tmp->ptr && (tmp->ptr + BPF_PROG_PACK_SIZE) > ptr) { pack = tmp; break; } @@ -964,10 +964,10 @@ void bpf_prog_pack_free(struct bpf_binary_header *hdr) if (WARN_ONCE(!pack, "bpf_prog_pack bug\n")) goto out; - nbits = BPF_PROG_SIZE_TO_NBITS(hdr->size); - pos = ((unsigned long)hdr - (unsigned long)pack->ptr) >> BPF_PROG_CHUNK_SHIFT; + nbits = BPF_PROG_SIZE_TO_NBITS(size); + pos = ((unsigned long)ptr - (unsigned long)pack->ptr) >> BPF_PROG_CHUNK_SHIFT; - WARN_ONCE(bpf_arch_text_invalidate(hdr, hdr->size), + WARN_ONCE(bpf_arch_text_invalidate(ptr, size), "bpf_prog_pack bug: missing bpf_arch_text_invalidate?\n"); bitmap_clear(pack->bitmap, pos, nbits); @@ -1114,8 +1114,7 @@ bpf_jit_binary_pack_alloc(unsigned int proglen, u8 **image_ptr, *rw_header = kvmalloc(size, GFP_KERNEL); if (!*rw_header) { - bpf_arch_text_copy(&ro_header->size, &size, sizeof(size)); - bpf_prog_pack_free(ro_header); + bpf_prog_pack_free(ro_header, size); bpf_jit_uncharge_modmem(size); return NULL; } @@ -1146,7 +1145,7 @@ int bpf_jit_binary_pack_finalize(struct bpf_prog *prog, kvfree(rw_header); if (IS_ERR(ptr)) { - bpf_prog_pack_free(ro_header); + bpf_prog_pack_free(ro_header, ro_header->size); return PTR_ERR(ptr); } return 0; @@ -1167,7 +1166,7 @@ void bpf_jit_binary_pack_free(struct bpf_binary_header *ro_header, { u32 size = ro_header->size; - bpf_prog_pack_free(ro_header); + bpf_prog_pack_free(ro_header, size); kvfree(rw_header); bpf_jit_uncharge_modmem(size); } diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c index fa3e9225aedc..56760fc10e78 100644 --- a/kernel/bpf/dispatcher.c +++ b/kernel/bpf/dispatcher.c @@ -150,10 +150,7 @@ void bpf_dispatcher_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from, goto out; d->rw_image = bpf_jit_alloc_exec(PAGE_SIZE); if (!d->rw_image) { - u32 size = PAGE_SIZE; - - bpf_arch_text_copy(d->image, &size, sizeof(size)); - bpf_prog_pack_free((struct bpf_binary_header *)d->image); + bpf_prog_pack_free(d->image, PAGE_SIZE); d->image = NULL; goto out; } -- 2.34.1
From: Song Liu <song@kernel.org> mainline inclusion from mainline-v6.8-rc1 commit 7a3d9a159b178e87306a6e989071ed9a114a1a31 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- We are using "im" for "struct bpf_tramp_image" and "tr" for "struct bpf_trampoline" in most of the code base. The only exception is the prototype and fallback version of arch_prepare_bpf_trampoline(). Update them to match the rest of the code base. We mix "orig_call" and "func_addr" for the argument in different versions of arch_prepare_bpf_trampoline(). s/orig_call/func_addr/g so they match. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20231206224054.492250-3-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- arch/arm64/net/bpf_jit_comp.c | 10 +++++----- include/linux/bpf.h | 4 ++-- kernel/bpf/trampoline.c | 4 ++-- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 2494a5c96489..50d97e94fe63 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -1890,7 +1890,7 @@ static bool is_struct_ops_tramp(const struct bpf_tramp_links *fentry_links) * */ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im, - struct bpf_tramp_links *tlinks, void *orig_call, + struct bpf_tramp_links *tlinks, void *func_addr, int nregs, u32 flags) { int i; @@ -1992,7 +1992,7 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im, if (flags & BPF_TRAMP_F_IP_ARG) { /* save ip address of the traced function */ - emit_addr_mov_i64(A64_R(10), (const u64)orig_call, ctx); + emit_addr_mov_i64(A64_R(10), (const u64)func_addr, ctx); emit(A64_STR64I(A64_R(10), A64_SP, ip_off), ctx); } @@ -2108,7 +2108,7 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im, int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image_end, const struct btf_func_model *m, u32 flags, struct bpf_tramp_links *tlinks, - void *orig_call) + void *func_addr) { int i, ret; int nregs = m->nr_args; @@ -2129,7 +2129,7 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, if (nregs > 8) return -ENOTSUPP; - ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nregs, flags); + ret = prepare_trampoline(&ctx, im, tlinks, func_addr, nregs, flags); if (ret < 0) return ret; @@ -2140,7 +2140,7 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, ctx.idx = 0; jit_fill_hole(image, (unsigned int)(image_end - image)); - ret = prepare_trampoline(&ctx, im, tlinks, orig_call, nregs, flags); + ret = prepare_trampoline(&ctx, im, tlinks, func_addr, nregs, flags); if (ret > 0 && validate_code(&ctx) < 0) ret = -EINVAL; diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 511168c6dc2f..5642ba7d6150 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1151,10 +1151,10 @@ struct bpf_tramp_run_ctx; * fexit = a set of program to run after original function */ struct bpf_tramp_image; -int arch_prepare_bpf_trampoline(struct bpf_tramp_image *tr, void *image, void *image_end, +int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image_end, const struct btf_func_model *m, u32 flags, struct bpf_tramp_links *tlinks, - void *orig_call); + void *func_addr); u64 notrace __bpf_prog_enter_sleepable_recur(struct bpf_prog *prog, struct bpf_tramp_run_ctx *run_ctx); void notrace __bpf_prog_exit_sleepable_recur(struct bpf_prog *prog, u64 start, diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index bb842decab0d..99c7221a8933 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -1060,10 +1060,10 @@ bpf_trampoline_exit_t bpf_trampoline_exit(const struct bpf_prog *prog) } int __weak -arch_prepare_bpf_trampoline(struct bpf_tramp_image *tr, void *image, void *image_end, +arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image_end, const struct btf_func_model *m, u32 flags, struct bpf_tramp_links *tlinks, - void *orig_call) + void *func_addr) { return -ENOTSUPP; } -- 2.34.1
From: Song Liu <song@kernel.org> mainline inclusion from mainline-v6.8-rc1 commit 82583daa2efc2e336962b231a46bad03a280b3e0 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- As BPF trampoline of different archs moves from bpf_jit_[alloc|free]_exec() to bpf_prog_pack_[alloc|free](), we need to use different _alloc, _free for different archs during the transition. Add the following helpers for this transition: void *arch_alloc_bpf_trampoline(unsigned int size); void arch_free_bpf_trampoline(void *image, unsigned int size); void arch_protect_bpf_trampoline(void *image, unsigned int size); void arch_unprotect_bpf_trampoline(void *image, unsigned int size); The fallback version of these helpers require size <= PAGE_SIZE, but they are only called with size == PAGE_SIZE. They will be called with size < PAGE_SIZE when arch_bpf_trampoline_size() helper is introduced later. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20231206224054.492250-4-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf.h | 5 ++++ kernel/bpf/bpf_struct_ops.c | 12 ++++----- kernel/bpf/trampoline.c | 46 ++++++++++++++++++++++++++++------ net/bpf/bpf_dummy_struct_ops.c | 7 +++--- 4 files changed, 52 insertions(+), 18 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 5642ba7d6150..689dd882c791 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1155,6 +1155,11 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i const struct btf_func_model *m, u32 flags, struct bpf_tramp_links *tlinks, void *func_addr); +void *arch_alloc_bpf_trampoline(unsigned int size); +void arch_free_bpf_trampoline(void *image, unsigned int size); +void arch_protect_bpf_trampoline(void *image, unsigned int size); +void arch_unprotect_bpf_trampoline(void *image, unsigned int size); + u64 notrace __bpf_prog_enter_sleepable_recur(struct bpf_prog *prog, struct bpf_tramp_run_ctx *run_ctx); void notrace __bpf_prog_exit_sleepable_recur(struct bpf_prog *prog, u64 start, diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index db6176fb64dc..e9e95879bce2 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -515,7 +515,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, if (err) goto reset_unlock; } - set_memory_rox((long)st_map->image, 1); + arch_protect_bpf_trampoline(st_map->image, PAGE_SIZE); /* Let bpf_link handle registration & unregistration. * * Pair with smp_load_acquire() during lookup_elem(). @@ -524,7 +524,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, goto unlock; } - set_memory_rox((long)st_map->image, 1); + arch_protect_bpf_trampoline(st_map->image, PAGE_SIZE); err = st_ops->reg(kdata); if (likely(!err)) { /* This refcnt increment on the map here after @@ -547,8 +547,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, * there was a race in registering the struct_ops (under the same name) to * a sub-system through different struct_ops's maps. */ - set_memory_nx((long)st_map->image, 1); - set_memory_rw((long)st_map->image, 1); + arch_unprotect_bpf_trampoline(st_map->image, PAGE_SIZE); reset_unlock: bpf_struct_ops_map_put_progs(st_map); @@ -616,7 +615,7 @@ static void __bpf_struct_ops_map_free(struct bpf_map *map) bpf_struct_ops_map_put_progs(st_map); bpf_map_area_free(st_map->links); if (st_map->image) { - bpf_jit_free_exec(st_map->image); + arch_free_bpf_trampoline(st_map->image, PAGE_SIZE); bpf_jit_uncharge_modmem(PAGE_SIZE); } bpf_map_area_free(st_map->uvalue); @@ -691,7 +690,7 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) return ERR_PTR(ret); } - st_map->image = bpf_jit_alloc_exec(PAGE_SIZE); + st_map->image = arch_alloc_bpf_trampoline(PAGE_SIZE); if (!st_map->image) { /* __bpf_struct_ops_map_free() uses st_map->image as flag * for "charged or not". In this case, we need to unchange @@ -711,7 +710,6 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) } mutex_init(&st_map->lock); - set_vm_flush_reset_perms(st_map->image); bpf_map_init_from_attr(map, attr); return map; diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 99c7221a8933..f4b0df32a071 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -254,7 +254,7 @@ bpf_trampoline_get_progs(const struct bpf_trampoline *tr, int *total, bool *ip_a static void bpf_tramp_image_free(struct bpf_tramp_image *im) { bpf_image_ksym_del(&im->ksym); - bpf_jit_free_exec(im->image); + arch_free_bpf_trampoline(im->image, PAGE_SIZE); bpf_jit_uncharge_modmem(PAGE_SIZE); percpu_ref_exit(&im->pcref); kfree_rcu(im, rcu); @@ -365,10 +365,9 @@ static struct bpf_tramp_image *bpf_tramp_image_alloc(u64 key) goto out_free_im; err = -ENOMEM; - im->image = image = bpf_jit_alloc_exec(PAGE_SIZE); + im->image = image = arch_alloc_bpf_trampoline(PAGE_SIZE); if (!image) goto out_uncharge; - set_vm_flush_reset_perms(image); err = percpu_ref_init(&im->pcref, __bpf_tramp_image_release, 0, GFP_KERNEL); if (err) @@ -381,7 +380,7 @@ static struct bpf_tramp_image *bpf_tramp_image_alloc(u64 key) return im; out_free_image: - bpf_jit_free_exec(im->image); + arch_free_bpf_trampoline(im->image, PAGE_SIZE); out_uncharge: bpf_jit_uncharge_modmem(PAGE_SIZE); out_free_im: @@ -444,7 +443,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mut if (err < 0) goto out_free; - set_memory_rox((long)im->image, 1); + arch_protect_bpf_trampoline(im->image, PAGE_SIZE); WARN_ON(tr->cur_image && total == 0); if (tr->cur_image) @@ -461,8 +460,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mut * trampoline again, and retry register. */ /* reset im->image memory attr for arch_prepare_bpf_trampoline */ - set_memory_nx((long)im->image, 1); - set_memory_rw((long)im->image, 1); + arch_unprotect_bpf_trampoline(im->image, PAGE_SIZE); goto again; } #endif @@ -1068,6 +1066,40 @@ arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image return -ENOTSUPP; } +void * __weak arch_alloc_bpf_trampoline(unsigned int size) +{ + void *image; + + if (WARN_ON_ONCE(size > PAGE_SIZE)) + return NULL; + image = bpf_jit_alloc_exec(PAGE_SIZE); + if (image) + set_vm_flush_reset_perms(image); + return image; +} + +void __weak arch_free_bpf_trampoline(void *image, unsigned int size) +{ + WARN_ON_ONCE(size > PAGE_SIZE); + /* bpf_jit_free_exec doesn't need "size", but + * bpf_prog_pack_free() needs it. + */ + bpf_jit_free_exec(image); +} + +void __weak arch_protect_bpf_trampoline(void *image, unsigned int size) +{ + WARN_ON_ONCE(size > PAGE_SIZE); + set_memory_rox((long)image, 1); +} + +void __weak arch_unprotect_bpf_trampoline(void *image, unsigned int size) +{ + WARN_ON_ONCE(size > PAGE_SIZE); + set_memory_nx((long)image, 1); + set_memory_rw((long)image, 1); +} + static int __init init_trampolines(void) { int i; diff --git a/net/bpf/bpf_dummy_struct_ops.c b/net/bpf/bpf_dummy_struct_ops.c index 5918d1b32e19..2748f9d77b18 100644 --- a/net/bpf/bpf_dummy_struct_ops.c +++ b/net/bpf/bpf_dummy_struct_ops.c @@ -101,12 +101,11 @@ int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr, goto out; } - image = bpf_jit_alloc_exec(PAGE_SIZE); + image = arch_alloc_bpf_trampoline(PAGE_SIZE); if (!image) { err = -ENOMEM; goto out; } - set_vm_flush_reset_perms(image); link = kzalloc(sizeof(*link), GFP_USER); if (!link) { @@ -124,7 +123,7 @@ int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr, if (err < 0) goto out; - set_memory_rox((long)image, 1); + arch_protect_bpf_trampoline(image, PAGE_SIZE); prog_ret = dummy_ops_call_op(image, args); err = dummy_ops_copy_args(args); @@ -134,7 +133,7 @@ int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr, err = -EFAULT; out: kfree(args); - bpf_jit_free_exec(image); + arch_free_bpf_trampoline(image, PAGE_SIZE); if (link) bpf_link_put(&link->link); kfree(tlinks); -- 2.34.1
From: Song Liu <song@kernel.org> mainline inclusion from mainline-v6.8-rc1 commit 38b8b58ae776bf748bd1bd7a24c3fd1d10f76f45 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- x86's implementation of arch_prepare_bpf_trampoline() requires BPF_INSN_SAFETY buffer space between end of program and image_end. OTOH, the return value does not include BPF_INSN_SAFETY. This doesn't cause any real issue at the moment. However, "image" of size retval is not enough for arch_prepare_bpf_trampoline(). This will cause confusion when we introduce a new helper arch_bpf_trampoline_size(). To avoid future confusion, adjust the return value to include BPF_INSN_SAFETY. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20231206224054.492250-5-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- arch/x86/net/bpf_jit_comp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index fca7da66067e..6216072a9b64 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -2697,7 +2697,7 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i ret = -EFAULT; goto cleanup; } - ret = prog - (u8 *)image; + ret = prog - (u8 *)image + BPF_INSN_SAFETY; cleanup: kfree(branches); -- 2.34.1
From: Song Liu <song@kernel.org> mainline inclusion from mainline-v6.8-rc1 commit 96d1b7c081c0c96cbe8901045f4ff15a2e9974a2 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- This helper will be used to calculate the size of the trampoline before allocating the memory. arch_prepare_bpf_trampoline() for arm64 and riscv64 can use arch_bpf_trampoline_size() to check the trampoline fits in the image. OTOH, arch_prepare_bpf_trampoline() for s390 has to call the JIT process twice, so it cannot use arch_bpf_trampoline_size(). Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Björn Töpel <bjorn@rivosinc.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> # on riscv Link: https://lore.kernel.org/r/20231206224054.492250-6-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- arch/arm64/net/bpf_jit_comp.c | 56 ++++++++++++++++++++++++--------- arch/riscv/net/bpf_jit_comp64.c | 22 ++++++++++--- arch/s390/net/bpf_jit_comp.c | 56 ++++++++++++++++++++------------- arch/x86/net/bpf_jit_comp.c | 40 ++++++++++++++++++++--- include/linux/bpf.h | 2 ++ kernel/bpf/trampoline.c | 6 ++++ 6 files changed, 136 insertions(+), 46 deletions(-) diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 50d97e94fe63..098c057a0dd0 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -2105,18 +2105,10 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im, return ctx->idx; } -int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, - void *image_end, const struct btf_func_model *m, - u32 flags, struct bpf_tramp_links *tlinks, - void *func_addr) +static int btf_func_model_nregs(const struct btf_func_model *m) { - int i, ret; int nregs = m->nr_args; - int max_insns = ((long)image_end - (long)image) / AARCH64_INSN_SIZE; - struct jit_ctx ctx = { - .image = NULL, - .idx = 0, - }; + int i; /* extra registers needed for struct argument */ for (i = 0; i < MAX_BPF_FUNC_ARGS; i++) { @@ -2125,19 +2117,53 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, nregs += (m->arg_size[i] + 7) / 8 - 1; } + return nregs; +} + +int arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags, + struct bpf_tramp_links *tlinks, void *func_addr) +{ + struct jit_ctx ctx = { + .image = NULL, + .idx = 0, + }; + struct bpf_tramp_image im; + int nregs, ret; + + nregs = btf_func_model_nregs(m); /* the first 8 registers are used for arguments */ if (nregs > 8) return -ENOTSUPP; - ret = prepare_trampoline(&ctx, im, tlinks, func_addr, nregs, flags); + ret = prepare_trampoline(&ctx, &im, tlinks, func_addr, nregs, flags); if (ret < 0) return ret; - if (ret > max_insns) - return -EFBIG; + return ret < 0 ? ret : ret * AARCH64_INSN_SIZE; +} - ctx.image = image; - ctx.idx = 0; +int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, + void *image_end, const struct btf_func_model *m, + u32 flags, struct bpf_tramp_links *tlinks, + void *func_addr) +{ + int ret, nregs; + struct jit_ctx ctx = { + .image = image, + .idx = 0, + }; + + nregs = btf_func_model_nregs(m); + /* the first 8 registers are used for arguments */ + if (nregs > 8) + return -ENOTSUPP; + + ret = arch_bpf_trampoline_size(m, flags, tlinks, func_addr); + if (ret < 0) + return ret; + + if (ret > ((long)image_end - (long)image)) + return -EFBIG; jit_fill_hole(image, (unsigned int)(image_end - image)); ret = prepare_trampoline(&ctx, im, tlinks, func_addr, nregs, flags); diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c index 5426dc2697f9..4f046eb2b5cf 100644 --- a/arch/riscv/net/bpf_jit_comp64.c +++ b/arch/riscv/net/bpf_jit_comp64.c @@ -1029,6 +1029,21 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, return ret; } +int arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags, + struct bpf_tramp_links *tlinks, void *func_addr) +{ + struct bpf_tramp_image im; + struct rv_jit_context ctx; + int ret; + + ctx.ninsns = 0; + ctx.insns = NULL; + ctx.ro_insns = NULL; + ret = __arch_prepare_bpf_trampoline(&im, m, tlinks, func_addr, flags, &ctx); + + return ret < 0 ? ret : ninsns_rvoff(ctx.ninsns); +} + int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image_end, const struct btf_func_model *m, u32 flags, struct bpf_tramp_links *tlinks, @@ -1037,14 +1052,11 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, int ret; struct rv_jit_context ctx; - ctx.ninsns = 0; - ctx.insns = NULL; - ctx.ro_insns = NULL; - ret = __arch_prepare_bpf_trampoline(im, m, tlinks, func_addr, flags, &ctx); + ret = arch_bpf_trampoline_size(im, m, flags, tlinks, func_addr); if (ret < 0) return ret; - if (ninsns_rvoff(ret) > (long)image_end - (long)image) + if (ret > (long)image_end - (long)image) return -EFBIG; ctx.ninsns = 0; diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c index 56143957def4..1f22da0ec420 100644 --- a/arch/s390/net/bpf_jit_comp.c +++ b/arch/s390/net/bpf_jit_comp.c @@ -2539,6 +2539,21 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, return 0; } +int arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags, + struct bpf_tramp_links *tlinks, void *orig_call) +{ + struct bpf_tramp_image im; + struct bpf_tramp_jit tjit; + int ret; + + memset(&tjit, 0, sizeof(tjit)); + + ret = __arch_prepare_bpf_trampoline(&im, &tjit, m, flags, + tlinks, orig_call); + + return ret < 0 ? ret : tjit.common.prg; +} + int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image_end, const struct btf_func_model *m, u32 flags, struct bpf_tramp_links *tlinks, @@ -2546,30 +2561,27 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, { struct bpf_tramp_jit tjit; int ret; - int i; - for (i = 0; i < 2; i++) { - if (i == 0) { - /* Compute offsets, check whether the code fits. */ - memset(&tjit, 0, sizeof(tjit)); - } else { - /* Generate the code. */ - tjit.common.prg = 0; - tjit.common.prg_buf = image; - } - ret = __arch_prepare_bpf_trampoline(im, &tjit, m, flags, - tlinks, func_addr); - if (ret < 0) - return ret; - if (tjit.common.prg > (char *)image_end - (char *)image) - /* - * Use the same error code as for exceeding - * BPF_MAX_TRAMP_LINKS. - */ - return -E2BIG; - } + /* Compute offsets, check whether the code fits. */ + memset(&tjit, 0, sizeof(tjit)); + ret = __arch_prepare_bpf_trampoline(im, &tjit, m, flags, + tlinks, func_addr); + + if (ret < 0) + return ret; + if (tjit.common.prg > (char *)image_end - (char *)image) + /* + * Use the same error code as for exceeding + * BPF_MAX_TRAMP_LINKS. + */ + return -E2BIG; + + tjit.common.prg = 0; + tjit.common.prg_buf = image; + ret = __arch_prepare_bpf_trampoline(im, &tjit, m, flags, + tlinks, func_addr); - return tjit.common.prg; + return ret < 0 ? ret : tjit.common.prg; } bool bpf_jit_supports_subprog_tailcalls(void) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 6216072a9b64..887606a47a88 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -2448,10 +2448,10 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog, * add rsp, 8 // skip eth_type_trans's frame * ret // return to its caller */ -int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image_end, - const struct btf_func_model *m, u32 flags, - struct bpf_tramp_links *tlinks, - void *func_addr) +static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image_end, + const struct btf_func_model *m, u32 flags, + struct bpf_tramp_links *tlinks, + void *func_addr) { int i, ret, nr_regs = m->nr_args, stack_size = 0; int regs_off, nregs_off, ip_off, run_ctx_off, arg_stack_off, rbx_off; @@ -2704,6 +2704,38 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i return ret; } +int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image_end, + const struct btf_func_model *m, u32 flags, + struct bpf_tramp_links *tlinks, + void *func_addr) +{ + return __arch_prepare_bpf_trampoline(im, image, image_end, m, flags, tlinks, func_addr); +} + +int arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags, + struct bpf_tramp_links *tlinks, void *func_addr) +{ + struct bpf_tramp_image im; + void *image; + int ret; + + /* Allocate a temporary buffer for __arch_prepare_bpf_trampoline(). + * This will NOT cause fragmentation in direct map, as we do not + * call set_memory_*() on this buffer. + * + * We cannot use kvmalloc here, because we need image to be in + * module memory range. + */ + image = bpf_jit_alloc_exec(PAGE_SIZE); + if (!image) + return -ENOMEM; + + ret = __arch_prepare_bpf_trampoline(&im, image, image + PAGE_SIZE, m, flags, + tlinks, func_addr); + bpf_jit_free_exec(image); + return ret; +} + static int emit_bpf_dispatcher(u8 **pprog, int a, int b, s64 *progs, u8 *image, u8 *buf) { u8 *jg_reloc, *prog = *pprog; diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 689dd882c791..32a3aa04f4b5 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1159,6 +1159,8 @@ void *arch_alloc_bpf_trampoline(unsigned int size); void arch_free_bpf_trampoline(void *image, unsigned int size); void arch_protect_bpf_trampoline(void *image, unsigned int size); void arch_unprotect_bpf_trampoline(void *image, unsigned int size); +int arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags, + struct bpf_tramp_links *tlinks, void *func_addr); u64 notrace __bpf_prog_enter_sleepable_recur(struct bpf_prog *prog, struct bpf_tramp_run_ctx *run_ctx); diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index f4b0df32a071..1540072cd7fa 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -1100,6 +1100,12 @@ void __weak arch_unprotect_bpf_trampoline(void *image, unsigned int size) set_memory_rw((long)image, 1); } +int __weak arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags, + struct bpf_tramp_links *tlinks, void *func_addr) +{ + return -ENOTSUPP; +} + static int __init init_trampolines(void) { int i; -- 2.34.1
From: Song Liu <song@kernel.org> mainline inclusion from mainline-v6.8-rc1 commit 26ef208c209a0e6eed8942a5d191b39dccfa6e38 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Instead of blindly allocating PAGE_SIZE for each trampoline, check the size of the trampoline with arch_bpf_trampoline_size(). This size is saved in bpf_tramp_image->size, and used for modmem charge/uncharge. The fallback arch_alloc_bpf_trampoline() still allocates a whole page because we need to use set_memory_* to protect the memory. struct_ops trampoline still uses a whole page for multiple trampolines. With this size check at caller (regular trampoline and struct_ops trampoline), remove arch_bpf_trampoline_size() from arch_prepare_bpf_trampoline() in archs. Also, update bpf_image_ksym_add() to handle symbol of different sizes. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Björn Töpel <bjorn@rivosinc.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> # on riscv Link: https://lore.kernel.org/r/20231206224054.492250-7-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Conflicts: kernel/bpf/trampoline.c [Fix conflict because lts patch c6ef788c2ca6 is merged.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- arch/arm64/net/bpf_jit_comp.c | 7 ----- arch/riscv/net/bpf_jit_comp64.c | 7 ----- include/linux/bpf.h | 3 +- kernel/bpf/bpf_struct_ops.c | 7 +++++ kernel/bpf/dispatcher.c | 2 +- kernel/bpf/trampoline.c | 54 ++++++++++++++++++++------------- 6 files changed, 43 insertions(+), 37 deletions(-) diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 098c057a0dd0..057828e9b3f8 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -2158,13 +2158,6 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, if (nregs > 8) return -ENOTSUPP; - ret = arch_bpf_trampoline_size(m, flags, tlinks, func_addr); - if (ret < 0) - return ret; - - if (ret > ((long)image_end - (long)image)) - return -EFBIG; - jit_fill_hole(image, (unsigned int)(image_end - image)); ret = prepare_trampoline(&ctx, im, tlinks, func_addr, nregs, flags); diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c index 4f046eb2b5cf..4a1a61ae14d6 100644 --- a/arch/riscv/net/bpf_jit_comp64.c +++ b/arch/riscv/net/bpf_jit_comp64.c @@ -1052,13 +1052,6 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, int ret; struct rv_jit_context ctx; - ret = arch_bpf_trampoline_size(im, m, flags, tlinks, func_addr); - if (ret < 0) - return ret; - - if (ret > (long)image_end - (long)image) - return -EFBIG; - ctx.ninsns = 0; /* * The bpf_int_jit_compile() uses a RW buffer (ctx.insns) to write the diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 32a3aa04f4b5..461e9451c3e4 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1194,6 +1194,7 @@ enum bpf_tramp_prog_type { struct bpf_tramp_image { void *image; + int size; struct bpf_ksym ksym; struct percpu_ref pcref; void *ip_after_call; @@ -1395,7 +1396,7 @@ int arch_prepare_bpf_dispatcher(void *image, void *buf, s64 *funcs, int num_func void bpf_dispatcher_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from, struct bpf_prog *to); /* Called only from JIT-enabled code, so there's no need for stubs. */ -void bpf_image_ksym_add(void *data, struct bpf_ksym *ksym); +void bpf_image_ksym_add(void *data, unsigned int size, struct bpf_ksym *ksym); void bpf_image_ksym_del(struct bpf_ksym *ksym); void bpf_ksym_add(struct bpf_ksym *ksym); void bpf_ksym_del(struct bpf_ksym *ksym); diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index e9e95879bce2..4d53c53fc5aa 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -355,6 +355,7 @@ int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks, void *image, void *image_end) { u32 flags; + int size; tlinks[BPF_TRAMP_FENTRY].links[0] = link; tlinks[BPF_TRAMP_FENTRY].nr_links = 1; @@ -362,6 +363,12 @@ int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks, * and it must be used alone. */ flags = model->ret_size > 0 ? BPF_TRAMP_F_RET_FENTRY_RET : 0; + + size = arch_bpf_trampoline_size(model, flags, tlinks, NULL); + if (size < 0) + return size; + if (size > (unsigned long)image_end - (unsigned long)image) + return -E2BIG; return arch_prepare_bpf_trampoline(NULL, image, image_end, model, flags, tlinks, NULL); } diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c index 56760fc10e78..70fb82bf1637 100644 --- a/kernel/bpf/dispatcher.c +++ b/kernel/bpf/dispatcher.c @@ -154,7 +154,7 @@ void bpf_dispatcher_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from, d->image = NULL; goto out; } - bpf_image_ksym_add(d->image, &d->ksym); + bpf_image_ksym_add(d->image, PAGE_SIZE, &d->ksym); } prev_num_progs = d->num_progs; diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 1540072cd7fa..ba4280057f49 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -115,10 +115,10 @@ bool bpf_prog_has_trampoline(const struct bpf_prog *prog) (ptype == BPF_PROG_TYPE_LSM && eatype == BPF_LSM_MAC); } -void bpf_image_ksym_add(void *data, struct bpf_ksym *ksym) +void bpf_image_ksym_add(void *data, unsigned int size, struct bpf_ksym *ksym) { ksym->start = (unsigned long) data; - ksym->end = ksym->start + PAGE_SIZE; + ksym->end = ksym->start + size; bpf_ksym_add(ksym); perf_event_ksymbol(PERF_RECORD_KSYMBOL_TYPE_BPF, ksym->start, PAGE_SIZE, false, ksym->name); @@ -254,8 +254,8 @@ bpf_trampoline_get_progs(const struct bpf_trampoline *tr, int *total, bool *ip_a static void bpf_tramp_image_free(struct bpf_tramp_image *im) { bpf_image_ksym_del(&im->ksym); - arch_free_bpf_trampoline(im->image, PAGE_SIZE); - bpf_jit_uncharge_modmem(PAGE_SIZE); + arch_free_bpf_trampoline(im->image, im->size); + bpf_jit_uncharge_modmem(im->size); percpu_ref_exit(&im->pcref); kfree_rcu(im, rcu); } @@ -349,7 +349,7 @@ static void bpf_tramp_image_put(struct bpf_tramp_image *im) call_rcu_tasks_trace(&im->rcu, __bpf_tramp_image_put_rcu_tasks); } -static struct bpf_tramp_image *bpf_tramp_image_alloc(u64 key) +static struct bpf_tramp_image *bpf_tramp_image_alloc(u64 key, int size) { struct bpf_tramp_image *im; struct bpf_ksym *ksym; @@ -360,12 +360,13 @@ static struct bpf_tramp_image *bpf_tramp_image_alloc(u64 key) if (!im) goto out; - err = bpf_jit_charge_modmem(PAGE_SIZE); + err = bpf_jit_charge_modmem(size); if (err) goto out_free_im; + im->size = size; err = -ENOMEM; - im->image = image = arch_alloc_bpf_trampoline(PAGE_SIZE); + im->image = image = arch_alloc_bpf_trampoline(size); if (!image) goto out_uncharge; @@ -376,13 +377,13 @@ static struct bpf_tramp_image *bpf_tramp_image_alloc(u64 key) ksym = &im->ksym; INIT_LIST_HEAD_RCU(&ksym->lnode); snprintf(ksym->name, KSYM_NAME_LEN, "bpf_trampoline_%llu", key); - bpf_image_ksym_add(image, ksym); + bpf_image_ksym_add(image, size, ksym); return im; out_free_image: - arch_free_bpf_trampoline(im->image, PAGE_SIZE); + arch_free_bpf_trampoline(im->image, im->size); out_uncharge: - bpf_jit_uncharge_modmem(PAGE_SIZE); + bpf_jit_uncharge_modmem(size); out_free_im: kfree(im); out: @@ -395,7 +396,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mut struct bpf_tramp_links *tlinks; u32 orig_flags = tr->flags; bool ip_arg = false; - int err, total; + int err, total, size; tlinks = bpf_trampoline_get_progs(tr, &total, &ip_arg); if (IS_ERR(tlinks)) @@ -408,12 +409,6 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mut goto out; } - im = bpf_tramp_image_alloc(tr->key); - if (IS_ERR(im)) { - err = PTR_ERR(im); - goto out; - } - /* clear all bits except SHARE_IPMODIFY and TAIL_CALL_CTX */ tr->flags &= (BPF_TRAMP_F_SHARE_IPMODIFY | BPF_TRAMP_F_TAIL_CALL_CTX); @@ -437,13 +432,31 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mut tr->flags |= BPF_TRAMP_F_ORIG_STACK; #endif - err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE, + size = arch_bpf_trampoline_size(&tr->func.model, tr->flags, + tlinks, tr->func.addr); + if (size < 0) { + err = size; + goto out; + } + + if (size > PAGE_SIZE) { + err = -E2BIG; + goto out; + } + + im = bpf_tramp_image_alloc(tr->key, size); + if (IS_ERR(im)) { + err = PTR_ERR(im); + goto out; + } + + err = arch_prepare_bpf_trampoline(im, im->image, im->image + size, &tr->func.model, tr->flags, tlinks, tr->func.addr); if (err < 0) goto out_free; - arch_protect_bpf_trampoline(im->image, PAGE_SIZE); + arch_protect_bpf_trampoline(im->image, im->size); WARN_ON(tr->cur_image && total == 0); if (tr->cur_image) @@ -459,8 +472,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mut * BPF_TRAMP_F_SHARE_IPMODIFY is set, we can generate the * trampoline again, and retry register. */ - /* reset im->image memory attr for arch_prepare_bpf_trampoline */ - arch_unprotect_bpf_trampoline(im->image, PAGE_SIZE); + bpf_tramp_image_free(im); goto again; } #endif -- 2.34.1
From: Luo Gengkun <luogengkun2@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA -------------------------------- Fix kabi-breakage by using KABI_USE. Fixes: 7c2f256d4f9a ("bpf: Use arch_bpf_trampoline_size") Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 461e9451c3e4..533148ccb549 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1194,7 +1194,6 @@ enum bpf_tramp_prog_type { struct bpf_tramp_image { void *image; - int size; struct bpf_ksym ksym; struct percpu_ref pcref; void *ip_after_call; @@ -1204,7 +1203,7 @@ struct bpf_tramp_image { struct work_struct work; }; - KABI_RESERVE(1) + KABI_USE(1, int size) KABI_RESERVE(2) KABI_RESERVE(3) KABI_RESERVE(4) -- 2.34.1
From: Song Liu <song@kernel.org> mainline inclusion from mainline-v6.8-rc1 commit 3ba026fca8786161b0c4d75be396e61d6816e0a1 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- There are three major changes here: 1. Add arch_[alloc|free]_bpf_trampoline based on bpf_prog_pack; 2. Let arch_prepare_bpf_trampoline handle ROX input image, this requires arch_prepare_bpf_trampoline allocating a temporary RW buffer; 3. Update __arch_prepare_bpf_trampoline() to handle a RW buffer (rw_image) and a ROX buffer (image). This part is similar to the image/rw_image logic in bpf_int_jit_compile(). Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20231206224054.492250-8-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- arch/x86/net/bpf_jit_comp.c | 98 +++++++++++++++++++++++++++---------- 1 file changed, 72 insertions(+), 26 deletions(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 887606a47a88..a4f84b7e89b8 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -2224,7 +2224,8 @@ static void restore_regs(const struct btf_func_model *m, u8 **prog, static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, struct bpf_tramp_link *l, int stack_size, - int run_ctx_off, bool save_ret) + int run_ctx_off, bool save_ret, + void *image, void *rw_image) { u8 *prog = *pprog; u8 *jmp_insn; @@ -2252,7 +2253,7 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, else EMIT4(0x48, 0x8D, 0x75, -run_ctx_off); - if (emit_rsb_call(&prog, bpf_trampoline_enter(p), prog)) + if (emit_rsb_call(&prog, bpf_trampoline_enter(p), image + (prog - (u8 *)rw_image))) return -EINVAL; /* remember prog start time returned by __bpf_prog_enter */ emit_mov_reg(&prog, true, BPF_REG_6, BPF_REG_0); @@ -2276,7 +2277,7 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, (long) p->insnsi >> 32, (u32) (long) p->insnsi); /* call JITed bpf program or interpreter */ - if (emit_rsb_call(&prog, p->bpf_func, prog)) + if (emit_rsb_call(&prog, p->bpf_func, image + (prog - (u8 *)rw_image))) return -EINVAL; /* @@ -2303,7 +2304,7 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, EMIT3_off32(0x48, 0x8D, 0x95, -run_ctx_off); else EMIT4(0x48, 0x8D, 0x55, -run_ctx_off); - if (emit_rsb_call(&prog, bpf_trampoline_exit(p), prog)) + if (emit_rsb_call(&prog, bpf_trampoline_exit(p), image + (prog - (u8 *)rw_image))) return -EINVAL; *pprog = prog; @@ -2338,14 +2339,15 @@ static int emit_cond_near_jump(u8 **pprog, void *func, void *ip, u8 jmp_cond) static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, struct bpf_tramp_links *tl, int stack_size, - int run_ctx_off, bool save_ret) + int run_ctx_off, bool save_ret, + void *image, void *rw_image) { int i; u8 *prog = *pprog; for (i = 0; i < tl->nr_links; i++) { if (invoke_bpf_prog(m, &prog, tl->links[i], stack_size, - run_ctx_off, save_ret)) + run_ctx_off, save_ret, image, rw_image)) return -EINVAL; } *pprog = prog; @@ -2354,7 +2356,8 @@ static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog, struct bpf_tramp_links *tl, int stack_size, - int run_ctx_off, u8 **branches) + int run_ctx_off, u8 **branches, + void *image, void *rw_image) { u8 *prog = *pprog; int i; @@ -2365,7 +2368,8 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog, emit_mov_imm32(&prog, false, BPF_REG_0, 0); emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8); for (i = 0; i < tl->nr_links; i++) { - if (invoke_bpf_prog(m, &prog, tl->links[i], stack_size, run_ctx_off, true)) + if (invoke_bpf_prog(m, &prog, tl->links[i], stack_size, run_ctx_off, true, + image, rw_image)) return -EINVAL; /* mod_ret prog stored return value into [rbp - 8]. Emit: @@ -2448,7 +2452,8 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog, * add rsp, 8 // skip eth_type_trans's frame * ret // return to its caller */ -static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image_end, +static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_image, + void *rw_image_end, void *image, const struct btf_func_model *m, u32 flags, struct bpf_tramp_links *tlinks, void *func_addr) @@ -2547,7 +2552,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image orig_call += X86_PATCH_SIZE; } - prog = image; + prog = rw_image; EMIT_ENDBR(); /* @@ -2589,7 +2594,8 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image if (flags & BPF_TRAMP_F_CALL_ORIG) { /* arg1: mov rdi, im */ emit_mov_imm64(&prog, BPF_REG_1, (long) im >> 32, (u32) (long) im); - if (emit_rsb_call(&prog, __bpf_tramp_enter, prog)) { + if (emit_rsb_call(&prog, __bpf_tramp_enter, + image + (prog - (u8 *)rw_image))) { ret = -EINVAL; goto cleanup; } @@ -2597,7 +2603,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image if (fentry->nr_links) if (invoke_bpf(m, &prog, fentry, regs_off, run_ctx_off, - flags & BPF_TRAMP_F_RET_FENTRY_RET)) + flags & BPF_TRAMP_F_RET_FENTRY_RET, image, rw_image)) return -EINVAL; if (fmod_ret->nr_links) { @@ -2607,7 +2613,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image return -ENOMEM; if (invoke_bpf_mod_ret(m, &prog, fmod_ret, regs_off, - run_ctx_off, branches)) { + run_ctx_off, branches, image, rw_image)) { ret = -EINVAL; goto cleanup; } @@ -2628,14 +2634,14 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image EMIT2(0xff, 0xd3); /* call *rbx */ } else { /* call original function */ - if (emit_rsb_call(&prog, orig_call, prog)) { + if (emit_rsb_call(&prog, orig_call, image + (prog - (u8 *)rw_image))) { ret = -EINVAL; goto cleanup; } } /* remember return value in a stack for bpf prog to access */ emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8); - im->ip_after_call = prog; + im->ip_after_call = image + (prog - (u8 *)rw_image); memcpy(prog, x86_nops[5], X86_PATCH_SIZE); prog += X86_PATCH_SIZE; } @@ -2651,12 +2657,13 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image * aligned address of do_fexit. */ for (i = 0; i < fmod_ret->nr_links; i++) - emit_cond_near_jump(&branches[i], prog, branches[i], - X86_JNE); + emit_cond_near_jump(&branches[i], image + (prog - (u8 *)rw_image), + image + (branches[i] - (u8 *)rw_image), X86_JNE); } if (fexit->nr_links) - if (invoke_bpf(m, &prog, fexit, regs_off, run_ctx_off, false)) { + if (invoke_bpf(m, &prog, fexit, regs_off, run_ctx_off, + false, image, rw_image)) { ret = -EINVAL; goto cleanup; } @@ -2669,10 +2676,10 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image * restored to R0. */ if (flags & BPF_TRAMP_F_CALL_ORIG) { - im->ip_epilogue = prog; + im->ip_epilogue = image + (prog - (u8 *)rw_image); /* arg1: mov rdi, im */ emit_mov_imm64(&prog, BPF_REG_1, (long) im >> 32, (u32) (long) im); - if (emit_rsb_call(&prog, __bpf_tramp_exit, prog)) { + if (emit_rsb_call(&prog, __bpf_tramp_exit, image + (prog - (u8 *)rw_image))) { ret = -EINVAL; goto cleanup; } @@ -2691,25 +2698,64 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image if (flags & BPF_TRAMP_F_SKIP_FRAME) /* skip our return address and return to parent */ EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */ - emit_return(&prog, prog); + emit_return(&prog, image + (prog - (u8 *)rw_image)); /* Make sure the trampoline generation logic doesn't overflow */ - if (WARN_ON_ONCE(prog > (u8 *)image_end - BPF_INSN_SAFETY)) { + if (WARN_ON_ONCE(prog > (u8 *)rw_image_end - BPF_INSN_SAFETY)) { ret = -EFAULT; goto cleanup; } - ret = prog - (u8 *)image + BPF_INSN_SAFETY; + ret = prog - (u8 *)rw_image + BPF_INSN_SAFETY; cleanup: kfree(branches); return ret; } +void *arch_alloc_bpf_trampoline(unsigned int size) +{ + return bpf_prog_pack_alloc(size, jit_fill_hole); +} + +void arch_free_bpf_trampoline(void *image, unsigned int size) +{ + bpf_prog_pack_free(image, size); +} + +void arch_protect_bpf_trampoline(void *image, unsigned int size) +{ +} + +void arch_unprotect_bpf_trampoline(void *image, unsigned int size) +{ +} + int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image_end, const struct btf_func_model *m, u32 flags, struct bpf_tramp_links *tlinks, void *func_addr) { - return __arch_prepare_bpf_trampoline(im, image, image_end, m, flags, tlinks, func_addr); + void *rw_image, *tmp; + int ret; + u32 size = image_end - image; + + /* rw_image doesn't need to be in module memory range, so we can + * use kvmalloc. + */ + rw_image = kvmalloc(size, GFP_KERNEL); + if (!rw_image) + return -ENOMEM; + + ret = __arch_prepare_bpf_trampoline(im, rw_image, rw_image + size, image, m, + flags, tlinks, func_addr); + if (ret < 0) + goto out; + + tmp = bpf_arch_text_copy(image, rw_image, size); + if (IS_ERR(tmp)) + ret = PTR_ERR(tmp); +out: + kvfree(rw_image); + return ret; } int arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags, @@ -2730,8 +2776,8 @@ int arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags, if (!image) return -ENOMEM; - ret = __arch_prepare_bpf_trampoline(&im, image, image + PAGE_SIZE, m, flags, - tlinks, func_addr); + ret = __arch_prepare_bpf_trampoline(&im, image, image + PAGE_SIZE, image, + m, flags, tlinks, func_addr); bpf_jit_free_exec(image); return ret; } -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.8-rc1 commit 4382159696c9af67ee047ed55f2dbf05480f52f6 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Normal include order is that linux/foo.h should include asm/foo.h, CFI has it the wrong way around. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lore.kernel.org/r/20231215092707.231038174@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- arch/riscv/include/asm/cfi.h | 3 ++- arch/riscv/kernel/cfi.c | 2 +- arch/x86/include/asm/cfi.h | 3 ++- arch/x86/kernel/cfi.c | 4 ++-- include/asm-generic/Kbuild | 1 + include/asm-generic/cfi.h | 5 +++++ include/linux/cfi.h | 1 + 7 files changed, 14 insertions(+), 5 deletions(-) create mode 100644 include/asm-generic/cfi.h diff --git a/arch/riscv/include/asm/cfi.h b/arch/riscv/include/asm/cfi.h index 56bf9d69d5e3..8f7a62257044 100644 --- a/arch/riscv/include/asm/cfi.h +++ b/arch/riscv/include/asm/cfi.h @@ -7,8 +7,9 @@ * * Copyright (C) 2023 Google LLC */ +#include <linux/bug.h> -#include <linux/cfi.h> +struct pt_regs; #ifdef CONFIG_CFI_CLANG enum bug_trap_type handle_cfi_failure(struct pt_regs *regs); diff --git a/arch/riscv/kernel/cfi.c b/arch/riscv/kernel/cfi.c index 820158d7a291..6ec9dbd7292e 100644 --- a/arch/riscv/kernel/cfi.c +++ b/arch/riscv/kernel/cfi.c @@ -4,7 +4,7 @@ * * Copyright (C) 2023 Google LLC */ -#include <asm/cfi.h> +#include <linux/cfi.h> #include <asm/insn.h> /* diff --git a/arch/x86/include/asm/cfi.h b/arch/x86/include/asm/cfi.h index 58dacd90daef..2a494643089d 100644 --- a/arch/x86/include/asm/cfi.h +++ b/arch/x86/include/asm/cfi.h @@ -7,8 +7,9 @@ * * Copyright (C) 2022 Google LLC */ +#include <linux/bug.h> -#include <linux/cfi.h> +struct pt_regs; #ifdef CONFIG_CFI_CLANG enum bug_trap_type handle_cfi_failure(struct pt_regs *regs); diff --git a/arch/x86/kernel/cfi.c b/arch/x86/kernel/cfi.c index 8674a5c0c031..e6bf78fac146 100644 --- a/arch/x86/kernel/cfi.c +++ b/arch/x86/kernel/cfi.c @@ -4,10 +4,10 @@ * * Copyright (C) 2022 Google LLC */ -#include <asm/cfi.h> +#include <linux/string.h> +#include <linux/cfi.h> #include <asm/insn.h> #include <asm/insn-eval.h> -#include <linux/string.h> /* * Returns the target address and the expected type when regs->ip points diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild index 941be574bbe0..ca2be8eaba5e 100644 --- a/include/asm-generic/Kbuild +++ b/include/asm-generic/Kbuild @@ -11,6 +11,7 @@ mandatory-y += bitops.h mandatory-y += bug.h mandatory-y += bugs.h mandatory-y += cacheflush.h +mandatory-y += cfi.h mandatory-y += checksum.h mandatory-y += compat.h mandatory-y += current.h diff --git a/include/asm-generic/cfi.h b/include/asm-generic/cfi.h new file mode 100644 index 000000000000..41fac3537bf9 --- /dev/null +++ b/include/asm-generic/cfi.h @@ -0,0 +1,5 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_GENERIC_CFI_H +#define __ASM_GENERIC_CFI_H + +#endif /* __ASM_GENERIC_CFI_H */ diff --git a/include/linux/cfi.h b/include/linux/cfi.h index 3552ec82b725..2309d74e77e6 100644 --- a/include/linux/cfi.h +++ b/include/linux/cfi.h @@ -9,6 +9,7 @@ #include <linux/bug.h> #include <linux/module.h> +#include <asm/cfi.h> #ifdef CONFIG_CFI_CLANG enum bug_trap_type report_cfi_failure(struct pt_regs *regs, unsigned long addr, -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.8-rc1 commit 4f9087f16651aca4a5f32da840a53f6660f0579a category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- The current BPF call convention is __nocfi, except when it calls !JIT things, then it calls regular C functions. It so happens that with FineIBT the __nocfi and C calling conventions are incompatible. Specifically __nocfi will call at func+0, while FineIBT will have endbr-poison there, which is not a valid indirect target. Causing #CP. Notably this only triggers on IBT enabled hardware, which is probably why this hasn't been reported (also, most people will have JIT on anyway). Implement proper CFI prologues for the BPF JIT codegen and drop __nocfi for x86. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.345270396@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Conflicts: arch/x86/net/bpf_jit_comp.c include/linux/bpf.h kernel/bpf/core.c [Fix conflict because of kabi and cev patches db775a8ebf140, and pre-patches fd5d27b70188 is not existed.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- arch/x86/include/asm/cfi.h | 110 ++++++++++++++++++++++++++++++++++ arch/x86/kernel/alternative.c | 47 ++++++++++++--- arch/x86/net/bpf_jit_comp.c | 82 +++++++++++++++++++++++-- include/linux/bpf.h | 13 +++- include/linux/cfi.h | 7 +++ kernel/bpf/core.c | 25 ++++++++ 6 files changed, 270 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/cfi.h b/arch/x86/include/asm/cfi.h index 2a494643089d..7a7b0b823a98 100644 --- a/arch/x86/include/asm/cfi.h +++ b/arch/x86/include/asm/cfi.h @@ -9,15 +9,125 @@ */ #include <linux/bug.h> +/* + * An overview of the various calling conventions... + * + * Traditional: + * + * foo: + * ... code here ... + * ret + * + * direct caller: + * call foo + * + * indirect caller: + * lea foo(%rip), %r11 + * ... + * call *%r11 + * + * + * IBT: + * + * foo: + * endbr64 + * ... code here ... + * ret + * + * direct caller: + * call foo / call foo+4 + * + * indirect caller: + * lea foo(%rip), %r11 + * ... + * call *%r11 + * + * + * kCFI: + * + * __cfi_foo: + * movl $0x12345678, %eax + * # 11 nops when CONFIG_CALL_PADDING + * foo: + * endbr64 # when IBT + * ... code here ... + * ret + * + * direct call: + * call foo # / call foo+4 when IBT + * + * indirect call: + * lea foo(%rip), %r11 + * ... + * movl $(-0x12345678), %r10d + * addl -4(%r11), %r10d # -15 when CONFIG_CALL_PADDING + * jz 1f + * ud2 + * 1:call *%r11 + * + * + * FineIBT (builds as kCFI + CALL_PADDING + IBT + RETPOLINE and runtime patches into): + * + * __cfi_foo: + * endbr64 + * subl 0x12345678, %r10d + * jz foo + * ud2 + * nop + * foo: + * osp nop3 # was endbr64 + * ... code here ... + * ret + * + * direct caller: + * call foo / call foo+4 + * + * indirect caller: + * lea foo(%rip), %r11 + * ... + * movl $0x12345678, %r10d + * subl $16, %r11 + * nop4 + * call *%r11 + * + */ +enum cfi_mode { + CFI_DEFAULT, /* FineIBT if hardware has IBT, otherwise kCFI */ + CFI_OFF, /* Taditional / IBT depending on .config */ + CFI_KCFI, /* Optionally CALL_PADDING, IBT, RETPOLINE */ + CFI_FINEIBT, /* see arch/x86/kernel/alternative.c */ +}; + +extern enum cfi_mode cfi_mode; + struct pt_regs; #ifdef CONFIG_CFI_CLANG enum bug_trap_type handle_cfi_failure(struct pt_regs *regs); +#define __bpfcall +extern u32 cfi_bpf_hash; + +static inline int cfi_get_offset(void) +{ + switch (cfi_mode) { + case CFI_FINEIBT: + return 16; + case CFI_KCFI: + if (IS_ENABLED(CONFIG_CALL_PADDING)) + return 16; + return 5; + default: + return 0; + } +} +#define cfi_get_offset cfi_get_offset + #else static inline enum bug_trap_type handle_cfi_failure(struct pt_regs *regs) { return BUG_TRAP_TYPE_NONE; } +#define cfi_bpf_hash 0U #endif /* CONFIG_CFI_CLANG */ #endif /* _ASM_X86_CFI_H */ diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index aae7456ece07..811f4a1f0fdf 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -30,6 +30,7 @@ #include <asm/fixmap.h> #include <asm/paravirt.h> #include <asm/asm-prototypes.h> +#include <asm/cfi.h> int __read_mostly alternatives_patched; @@ -842,15 +843,43 @@ void __init_or_module apply_seal_endbr(s32 *start, s32 *end) { } #endif /* CONFIG_X86_KERNEL_IBT */ #ifdef CONFIG_FINEIBT +#define __CFI_DEFAULT CFI_DEFAULT +#elif defined(CONFIG_CFI_CLANG) +#define __CFI_DEFAULT CFI_KCFI +#else +#define __CFI_DEFAULT CFI_OFF +#endif -enum cfi_mode { - CFI_DEFAULT, - CFI_OFF, - CFI_KCFI, - CFI_FINEIBT, -}; +enum cfi_mode cfi_mode __ro_after_init = __CFI_DEFAULT; + +#ifdef CONFIG_CFI_CLANG +struct bpf_insn; + +/* Must match bpf_func_t / DEFINE_BPF_PROG_RUN() */ +extern unsigned int __bpf_prog_runX(const void *ctx, + const struct bpf_insn *insn); + +/* + * Force a reference to the external symbol so the compiler generates + * __kcfi_typid. + */ +__ADDRESSABLE(__bpf_prog_runX); + +/* u32 __ro_after_init cfi_bpf_hash = __kcfi_typeid___bpf_prog_runX; */ +asm ( +" .pushsection .data..ro_after_init,\"aw\",@progbits \n" +" .type cfi_bpf_hash,@object \n" +" .globl cfi_bpf_hash \n" +" .p2align 2, 0x0 \n" +"cfi_bpf_hash: \n" +" .long __kcfi_typeid___bpf_prog_runX \n" +" .size cfi_bpf_hash, 4 \n" +" .popsection \n" +); +#endif + +#ifdef CONFIG_FINEIBT -static enum cfi_mode cfi_mode __ro_after_init = CFI_DEFAULT; static bool cfi_rand __ro_after_init = true; static u32 cfi_seed __ro_after_init; @@ -1159,8 +1188,10 @@ static void __apply_fineibt(s32 *start_retpoline, s32 *end_retpoline, goto err; if (cfi_rand) { - if (builtin) + if (builtin) { cfi_seed = get_random_u32(); + cfi_bpf_hash = cfi_rehash(cfi_bpf_hash); + } ret = cfi_rand_preamble(start_cfi, end_cfi); if (ret) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index a4f84b7e89b8..6798237a5c83 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -16,6 +16,7 @@ #include <asm/set_memory.h> #include <asm/nospec-branch.h> #include <asm/text-patching.h> +#include <asm/cfi.h> static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len) { @@ -50,9 +51,11 @@ static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len) do { EMIT4(b1, b2, b3, b4); EMIT(off, 4); } while (0) #ifdef CONFIG_X86_KERNEL_IBT -#define EMIT_ENDBR() EMIT(gen_endbr(), 4) +#define EMIT_ENDBR() EMIT(gen_endbr(), 4) +#define EMIT_ENDBR_POISON() EMIT(gen_endbr_poison(), 4) #else #define EMIT_ENDBR() +#define EMIT_ENDBR_POISON() #endif static bool is_imm8(int value) @@ -337,6 +340,69 @@ static void pop_callee_regs(u8 **pprog, bool *callee_regs_used) *pprog = prog; } +/* + * Emit the various CFI preambles, see asm/cfi.h and the comments about FineIBT + * in arch/x86/kernel/alternative.c + */ + +static void emit_fineibt(u8 **pprog) +{ + u8 *prog = *pprog; + + EMIT_ENDBR(); + EMIT3_off32(0x41, 0x81, 0xea, cfi_bpf_hash); /* subl $hash, %r10d */ + EMIT2(0x74, 0x07); /* jz.d8 +7 */ + EMIT2(0x0f, 0x0b); /* ud2 */ + EMIT1(0x90); /* nop */ + EMIT_ENDBR_POISON(); + + *pprog = prog; +} + +static void emit_kcfi(u8 **pprog) +{ + u8 *prog = *pprog; + + EMIT1_off32(0xb8, cfi_bpf_hash); /* movl $hash, %eax */ +#ifdef CONFIG_CALL_PADDING + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); +#endif + EMIT_ENDBR(); + + *pprog = prog; +} + +static void emit_cfi(u8 **pprog) +{ + u8 *prog = *pprog; + + switch (cfi_mode) { + case CFI_FINEIBT: + emit_fineibt(&prog); + break; + + case CFI_KCFI: + emit_kcfi(&prog); + break; + + default: + EMIT_ENDBR(); + break; + } + + *pprog = prog; +} + /* * Emit x86-64 prologue code for BPF program. * bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes @@ -347,10 +413,10 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, { u8 *prog = *pprog; + emit_cfi(&prog); /* BPF trampoline can be made to work without these nops, * but let's waste 5 bytes for now and optimize later */ - EMIT_ENDBR(); memcpy(prog, x86_nops[5], X86_PATCH_SIZE); prog += X86_PATCH_SIZE; if (!ebpf_from_cbpf) { @@ -3039,9 +3105,16 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) jit_data->header = header; jit_data->rw_header = rw_header; } - prog->bpf_func = (void *)image; + /* + * ctx.prog_offset is used when CFI preambles put code *before* + * the function. See emit_cfi(). For FineIBT specifically this code + * can also be executed and bpf_prog_kallsyms_add() will + * generate an additional symbol to cover this, hence also + * decrement proglen. + */ + prog->bpf_func = (void *)image + cfi_get_offset(); prog->jited = 1; - prog->jited_len = proglen; + prog->jited_len = proglen - cfi_get_offset(); } else { prog = orig_prog; } @@ -3096,6 +3169,7 @@ void bpf_jit_free(struct bpf_prog *prog) kvfree(jit_data->addrs); kfree(jit_data); } + prog->bpf_func = (void *)prog->bpf_func - cfi_get_offset(); hdr = bpf_jit_binary_pack_hdr(prog); bpf_jit_binary_pack_free(hdr, NULL); WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(prog)); diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 533148ccb549..2839780ba1b5 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -30,6 +30,7 @@ #include <linux/static_call.h> #include <linux/memcontrol.h> #include <linux/kabi.h> +#include <linux/cfi.h> struct bpf_verifier_env; struct bpf_verifier_log; @@ -1278,7 +1279,12 @@ struct bpf_dispatcher { KABI_RESERVE(4) }; -static __always_inline __nocfi unsigned int bpf_dispatcher_nop_func( +#ifndef __bpfcall +#define __bpfcall __nocfi +#endif + +static __always_inline __bpfcall unsigned int bpf_dispatcher_nop_func( + const void *ctx, const struct bpf_insn *insnsi, bpf_func_t bpf_func) @@ -1372,7 +1378,7 @@ int arch_prepare_bpf_dispatcher(void *image, void *buf, s64 *funcs, int num_func #define DEFINE_BPF_DISPATCHER(name) \ __BPF_DISPATCHER_SC(name); \ - noinline __nocfi unsigned int bpf_dispatcher_##name##_func( \ + noinline __bpfcall unsigned int bpf_dispatcher_##name##_func( \ const void *ctx, \ const struct bpf_insn *insnsi, \ bpf_func_t bpf_func) \ @@ -1523,6 +1529,9 @@ struct bpf_prog_aux { struct bpf_kfunc_desc_tab *kfunc_tab; struct bpf_kfunc_btf_tab *kfunc_btf_tab; u32 size_poke_tab; +#ifdef CONFIG_FINEIBT + struct bpf_ksym ksym_prefix; +#endif struct bpf_ksym ksym; const struct bpf_prog_ops *ops; struct bpf_map **used_maps; diff --git a/include/linux/cfi.h b/include/linux/cfi.h index 2309d74e77e6..1ed2d96c0cfc 100644 --- a/include/linux/cfi.h +++ b/include/linux/cfi.h @@ -11,6 +11,13 @@ #include <linux/module.h> #include <asm/cfi.h> +#ifndef cfi_get_offset +static inline int cfi_get_offset(void) +{ + return 0; +} +#endif + #ifdef CONFIG_CFI_CLANG enum bug_trap_type report_cfi_failure(struct pt_regs *regs, unsigned long addr, unsigned long *target, u32 type); diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index d0436f78c4b7..50f305a31415 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -121,6 +121,9 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag #endif INIT_LIST_HEAD_RCU(&fp->aux->ksym.lnode); +#ifdef CONFIG_FINEIBT + INIT_LIST_HEAD_RCU(&fp->aux->ksym_prefix.lnode); +#endif mutex_init(&fp->aux->used_maps_mutex); mutex_init(&fp->aux->ext_mutex); mutex_init(&fp->aux->dst_mutex); @@ -692,6 +695,23 @@ void bpf_prog_kallsyms_add(struct bpf_prog *fp) fp->aux->ksym.prog = true; bpf_ksym_add(&fp->aux->ksym); + +#ifdef CONFIG_FINEIBT + /* + * When FineIBT, code in the __cfi_foo() symbols can get executed + * and hence unwinder needs help. + */ + if (cfi_mode != CFI_FINEIBT) + return; + + snprintf(fp->aux->ksym_prefix.name, KSYM_NAME_LEN, + "__cfi_%s", fp->aux->ksym.name); + + fp->aux->ksym_prefix.start = (unsigned long) fp->bpf_func - 16; + fp->aux->ksym_prefix.end = (unsigned long) fp->bpf_func; + + bpf_ksym_add(&fp->aux->ksym_prefix); +#endif } void bpf_prog_kallsyms_del(struct bpf_prog *fp) @@ -700,6 +720,11 @@ void bpf_prog_kallsyms_del(struct bpf_prog *fp) return; bpf_ksym_del(&fp->aux->ksym); +#ifdef CONFIG_FINEIBT + if (cfi_mode != CFI_FINEIBT) + return; + bpf_ksym_del(&fp->aux->ksym_prefix); +#endif } static struct bpf_ksym *bpf_ksym_find(unsigned long addr) -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.8-rc1 commit e72d88d18df4e03c80e64c2535f70c64f1dc6fc1 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Where the main BPF program is expected to match bpf_func_t, sub-programs are expected to match bpf_callback_t. This fixes things like: tools/testing/selftests/bpf/progs/bloom_filter_bench.c: bpf_for_each_map_elem(&array_map, bloom_callback, &data, 0); Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.451956710@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- arch/x86/include/asm/cfi.h | 2 ++ arch/x86/kernel/alternative.c | 18 ++++++++++++++++++ arch/x86/net/bpf_jit_comp.c | 18 ++++++++++-------- 3 files changed, 30 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/cfi.h b/arch/x86/include/asm/cfi.h index 7a7b0b823a98..8779abd217b7 100644 --- a/arch/x86/include/asm/cfi.h +++ b/arch/x86/include/asm/cfi.h @@ -106,6 +106,7 @@ struct pt_regs; enum bug_trap_type handle_cfi_failure(struct pt_regs *regs); #define __bpfcall extern u32 cfi_bpf_hash; +extern u32 cfi_bpf_subprog_hash; static inline int cfi_get_offset(void) { @@ -128,6 +129,7 @@ static inline enum bug_trap_type handle_cfi_failure(struct pt_regs *regs) return BUG_TRAP_TYPE_NONE; } #define cfi_bpf_hash 0U +#define cfi_bpf_subprog_hash 0U #endif /* CONFIG_CFI_CLANG */ #endif /* _ASM_X86_CFI_H */ diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 811f4a1f0fdf..287065ad3feb 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -876,6 +876,23 @@ asm ( " .size cfi_bpf_hash, 4 \n" " .popsection \n" ); + +/* Must match bpf_callback_t */ +extern u64 __bpf_callback_fn(u64, u64, u64, u64, u64); + +__ADDRESSABLE(__bpf_callback_fn); + +/* u32 __ro_after_init cfi_bpf_subprog_hash = __kcfi_typeid___bpf_callback_fn; */ +asm ( +" .pushsection .data..ro_after_init,\"aw\",@progbits \n" +" .type cfi_bpf_subprog_hash,@object \n" +" .globl cfi_bpf_subprog_hash \n" +" .p2align 2, 0x0 \n" +"cfi_bpf_subprog_hash: \n" +" .long __kcfi_typeid___bpf_callback_fn \n" +" .size cfi_bpf_subprog_hash, 4 \n" +" .popsection \n" +); #endif #ifdef CONFIG_FINEIBT @@ -1191,6 +1208,7 @@ static void __apply_fineibt(s32 *start_retpoline, s32 *end_retpoline, if (builtin) { cfi_seed = get_random_u32(); cfi_bpf_hash = cfi_rehash(cfi_bpf_hash); + cfi_bpf_subprog_hash = cfi_rehash(cfi_bpf_subprog_hash); } ret = cfi_rand_preamble(start_cfi, end_cfi); diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 6798237a5c83..35b04d0d88e5 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -345,12 +345,13 @@ static void pop_callee_regs(u8 **pprog, bool *callee_regs_used) * in arch/x86/kernel/alternative.c */ -static void emit_fineibt(u8 **pprog) +static void emit_fineibt(u8 **pprog, bool is_subprog) { + u32 hash = is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash; u8 *prog = *pprog; EMIT_ENDBR(); - EMIT3_off32(0x41, 0x81, 0xea, cfi_bpf_hash); /* subl $hash, %r10d */ + EMIT3_off32(0x41, 0x81, 0xea, hash); /* subl $hash, %r10d */ EMIT2(0x74, 0x07); /* jz.d8 +7 */ EMIT2(0x0f, 0x0b); /* ud2 */ EMIT1(0x90); /* nop */ @@ -359,11 +360,12 @@ static void emit_fineibt(u8 **pprog) *pprog = prog; } -static void emit_kcfi(u8 **pprog) +static void emit_kcfi(u8 **pprog, bool is_subprog) { + u32 hash = is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash; u8 *prog = *pprog; - EMIT1_off32(0xb8, cfi_bpf_hash); /* movl $hash, %eax */ + EMIT1_off32(0xb8, hash); /* movl $hash, %eax */ #ifdef CONFIG_CALL_PADDING EMIT1(0x90); EMIT1(0x90); @@ -382,17 +384,17 @@ static void emit_kcfi(u8 **pprog) *pprog = prog; } -static void emit_cfi(u8 **pprog) +static void emit_cfi(u8 **pprog, bool is_subprog) { u8 *prog = *pprog; switch (cfi_mode) { case CFI_FINEIBT: - emit_fineibt(&prog); + emit_fineibt(&prog, is_subprog); break; case CFI_KCFI: - emit_kcfi(&prog); + emit_kcfi(&prog, is_subprog); break; default: @@ -413,7 +415,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, { u8 *prog = *pprog; - emit_cfi(&prog); + emit_cfi(&prog, is_subprog); /* BPF trampoline can be made to work without these nops, * but let's waste 5 bytes for now and optimize later */ -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.8-rc1 commit 2cd3e3772e41377f32d6eea643e0590774e9187c category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- BPF struct_ops uses __arch_prepare_bpf_trampoline() to write trampolines for indirect function calls. These tramplines much have matching CFI. In order to obtain the correct CFI hash for the various methods, add a matching structure that contains stub functions, the compiler will generate correct CFI which we can pilfer for the trampolines. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.566977112@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Conflicts: include/linux/bpf.h [Fix conflicts because of kabi.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- arch/x86/include/asm/cfi.h | 6 +++ arch/x86/kernel/alternative.c | 22 +++++++++++ arch/x86/net/bpf_jit_comp.c | 66 ++++++++++++++++++++------------ include/linux/bpf.h | 13 +++++++ kernel/bpf/bpf_struct_ops.c | 16 ++++---- net/bpf/bpf_dummy_struct_ops.c | 31 ++++++++++++++- net/ipv4/bpf_tcp_ca.c | 69 ++++++++++++++++++++++++++++++++++ 7 files changed, 191 insertions(+), 32 deletions(-) diff --git a/arch/x86/include/asm/cfi.h b/arch/x86/include/asm/cfi.h index 8779abd217b7..1a50b2cd4713 100644 --- a/arch/x86/include/asm/cfi.h +++ b/arch/x86/include/asm/cfi.h @@ -123,6 +123,8 @@ static inline int cfi_get_offset(void) } #define cfi_get_offset cfi_get_offset +extern u32 cfi_get_func_hash(void *func); + #else static inline enum bug_trap_type handle_cfi_failure(struct pt_regs *regs) { @@ -130,6 +132,10 @@ static inline enum bug_trap_type handle_cfi_failure(struct pt_regs *regs) } #define cfi_bpf_hash 0U #define cfi_bpf_subprog_hash 0U +static inline u32 cfi_get_func_hash(void *func) +{ + return 0; +} #endif /* CONFIG_CFI_CLANG */ #endif /* _ASM_X86_CFI_H */ diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 287065ad3feb..183d42302243 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -893,6 +893,28 @@ asm ( " .size cfi_bpf_subprog_hash, 4 \n" " .popsection \n" ); + +u32 cfi_get_func_hash(void *func) +{ + u32 hash; + + func -= cfi_get_offset(); + switch (cfi_mode) { + case CFI_FINEIBT: + func += 7; + break; + case CFI_KCFI: + func += 1; + break; + default: + return 0; + } + + if (get_kernel_nofault(hash, func)) + return 0; + + return hash; +} #endif #ifdef CONFIG_FINEIBT diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 35b04d0d88e5..5c020dcec31b 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -345,9 +345,8 @@ static void pop_callee_regs(u8 **pprog, bool *callee_regs_used) * in arch/x86/kernel/alternative.c */ -static void emit_fineibt(u8 **pprog, bool is_subprog) +static void emit_fineibt(u8 **pprog, u32 hash) { - u32 hash = is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash; u8 *prog = *pprog; EMIT_ENDBR(); @@ -360,9 +359,8 @@ static void emit_fineibt(u8 **pprog, bool is_subprog) *pprog = prog; } -static void emit_kcfi(u8 **pprog, bool is_subprog) +static void emit_kcfi(u8 **pprog, u32 hash) { - u32 hash = is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash; u8 *prog = *pprog; EMIT1_off32(0xb8, hash); /* movl $hash, %eax */ @@ -384,17 +382,17 @@ static void emit_kcfi(u8 **pprog, bool is_subprog) *pprog = prog; } -static void emit_cfi(u8 **pprog, bool is_subprog) +static void emit_cfi(u8 **pprog, u32 hash) { u8 *prog = *pprog; switch (cfi_mode) { case CFI_FINEIBT: - emit_fineibt(&prog, is_subprog); + emit_fineibt(&prog, hash); break; case CFI_KCFI: - emit_kcfi(&prog, is_subprog); + emit_kcfi(&prog, hash); break; default: @@ -415,7 +413,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, { u8 *prog = *pprog; - emit_cfi(&prog, is_subprog); + emit_cfi(&prog, is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash); /* BPF trampoline can be made to work without these nops, * but let's waste 5 bytes for now and optimize later */ @@ -2536,10 +2534,19 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im u8 *prog; bool save_ret; + /* + * F_INDIRECT is only compatible with F_RET_FENTRY_RET, it is + * explicitly incompatible with F_CALL_ORIG | F_SKIP_FRAME | F_IP_ARG + * because @func_addr. + */ + WARN_ON_ONCE((flags & BPF_TRAMP_F_INDIRECT) && + (flags & ~(BPF_TRAMP_F_INDIRECT | BPF_TRAMP_F_RET_FENTRY_RET))); + /* extra registers for struct arguments */ - for (i = 0; i < m->nr_args; i++) + for (i = 0; i < m->nr_args; i++) { if (m->arg_flags[i] & BTF_FMODEL_STRUCT_ARG) nr_regs += (m->arg_size[i] + 7) / 8 - 1; + } /* x86-64 supports up to MAX_BPF_FUNC_ARGS arguments. 1-6 * are passed through regs, the remains are through stack. @@ -2622,20 +2629,27 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im prog = rw_image; - EMIT_ENDBR(); - /* - * This is the direct-call trampoline, as such it needs accounting - * for the __fentry__ call. - */ - x86_call_depth_emit_accounting(&prog, NULL); + if (flags & BPF_TRAMP_F_INDIRECT) { + /* + * Indirect call for bpf_struct_ops + */ + emit_cfi(&prog, cfi_get_func_hash(func_addr)); + } else { + /* + * Direct-call fentry stub, as such it needs accounting for the + * __fentry__ call. + */ + x86_call_depth_emit_accounting(&prog, NULL); + } EMIT1(0x55); /* push rbp */ EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */ - if (!is_imm8(stack_size)) + if (!is_imm8(stack_size)) { /* sub rsp, stack_size */ EMIT3_off32(0x48, 0x81, 0xEC, stack_size); - else + } else { /* sub rsp, stack_size */ EMIT4(0x48, 0x83, 0xEC, stack_size); + } if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) EMIT1(0x50); /* push rax */ /* mov QWORD PTR [rbp - rbx_off], rbx */ @@ -2669,10 +2683,11 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im } } - if (fentry->nr_links) + if (fentry->nr_links) { if (invoke_bpf(m, &prog, fentry, regs_off, run_ctx_off, flags & BPF_TRAMP_F_RET_FENTRY_RET, image, rw_image)) return -EINVAL; + } if (fmod_ret->nr_links) { branches = kcalloc(fmod_ret->nr_links, sizeof(u8 *), @@ -2691,11 +2706,12 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im restore_regs(m, &prog, regs_off); save_args(m, &prog, arg_stack_off, true); - if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) + if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) { /* Before calling the original function, restore the * tail_call_cnt from stack to rax. */ RESTORE_TAIL_CALL_CNT(stack_size); + } if (flags & BPF_TRAMP_F_ORIG_STACK) { emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, 8); @@ -2724,17 +2740,19 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im /* Update the branches saved in invoke_bpf_mod_ret with the * aligned address of do_fexit. */ - for (i = 0; i < fmod_ret->nr_links; i++) + for (i = 0; i < fmod_ret->nr_links; i++) { emit_cond_near_jump(&branches[i], image + (prog - (u8 *)rw_image), image + (branches[i] - (u8 *)rw_image), X86_JNE); + } } - if (fexit->nr_links) + if (fexit->nr_links) { if (invoke_bpf(m, &prog, fexit, regs_off, run_ctx_off, false, image, rw_image)) { ret = -EINVAL; goto cleanup; } + } if (flags & BPF_TRAMP_F_RESTORE_REGS) restore_regs(m, &prog, regs_off); @@ -2751,11 +2769,12 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im ret = -EINVAL; goto cleanup; } - } else if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) + } else if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) { /* Before running the original function, restore the * tail_call_cnt from stack to rax. */ RESTORE_TAIL_CALL_CNT(stack_size); + } /* restore return value of orig_call or fentry prog back into RAX */ if (save_ret) @@ -2763,9 +2782,10 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off); EMIT1(0xC9); /* leave */ - if (flags & BPF_TRAMP_F_SKIP_FRAME) + if (flags & BPF_TRAMP_F_SKIP_FRAME) { /* skip our return address and return to parent */ EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */ + } emit_return(&prog, image + (prog - (u8 *)rw_image)); /* Make sure the trampoline generation logic doesn't overflow */ if (WARN_ON_ONCE(prog > (u8 *)rw_image_end - BPF_INSN_SAFETY)) { diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 2839780ba1b5..3bb9b329340f 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1113,6 +1113,17 @@ struct btf_func_model { */ #define BPF_TRAMP_F_TAIL_CALL_CTX BIT(7) +/* + * Indicate the trampoline should be suitable to receive indirect calls; + * without this indirectly calling the generated code can result in #UD/#CP, + * depending on the CFI options. + * + * Used by bpf_struct_ops. + * + * Incompatible with FENTRY usage, overloads @func_addr argument. + */ +#define BPF_TRAMP_F_INDIRECT BIT(8) + /* Each call __bpf_prog_enter + call bpf_func + call __bpf_prog_exit is ~50 * bytes on x86. */ @@ -1787,6 +1798,7 @@ struct bpf_struct_ops { struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS]; u32 type_id; u32 value_id; + void *cfi_stubs; KABI_RESERVE(1) KABI_RESERVE(2) @@ -1805,6 +1817,7 @@ int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key, int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks, struct bpf_tramp_link *link, const struct btf_func_model *model, + void *stub_func, void *image, void *image_end); static inline bool bpf_try_module_get(const void *data, struct module *owner) { diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 4d53c53fc5aa..02068bd0e4d9 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -352,17 +352,16 @@ const struct bpf_link_ops bpf_struct_ops_link_lops = { int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks, struct bpf_tramp_link *link, const struct btf_func_model *model, - void *image, void *image_end) + void *stub_func, void *image, void *image_end) { - u32 flags; + u32 flags = BPF_TRAMP_F_INDIRECT; int size; tlinks[BPF_TRAMP_FENTRY].links[0] = link; tlinks[BPF_TRAMP_FENTRY].nr_links = 1; - /* BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops, - * and it must be used alone. - */ - flags = model->ret_size > 0 ? BPF_TRAMP_F_RET_FENTRY_RET : 0; + + if (model->ret_size > 0) + flags |= BPF_TRAMP_F_RET_FENTRY_RET; size = arch_bpf_trampoline_size(model, flags, tlinks, NULL); if (size < 0) @@ -370,7 +369,7 @@ int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks, if (size > (unsigned long)image_end - (unsigned long)image) return -E2BIG; return arch_prepare_bpf_trampoline(NULL, image, image_end, - model, flags, tlinks, NULL); + model, flags, tlinks, stub_func); } static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, @@ -504,11 +503,12 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, err = bpf_struct_ops_prepare_trampoline(tlinks, link, &st_ops->func_models[i], + *(void **)(st_ops->cfi_stubs + moff), image, image_end); if (err < 0) goto reset_unlock; - *(void **)(kdata + moff) = image; + *(void **)(kdata + moff) = image + cfi_get_offset(); image += err; /* put prog_id to udata */ diff --git a/net/bpf/bpf_dummy_struct_ops.c b/net/bpf/bpf_dummy_struct_ops.c index 2748f9d77b18..8906f7bdf4a9 100644 --- a/net/bpf/bpf_dummy_struct_ops.c +++ b/net/bpf/bpf_dummy_struct_ops.c @@ -12,6 +12,11 @@ extern struct bpf_struct_ops bpf_bpf_dummy_ops; /* A common type for test_N with return value in bpf_dummy_ops */ typedef int (*dummy_ops_test_ret_fn)(struct bpf_dummy_ops_state *state, ...); +static int dummy_ops_test_ret_function(struct bpf_dummy_ops_state *state, ...) +{ + return 0; +} + struct bpf_dummy_ops_test_args { u64 args[MAX_BPF_FUNC_ARGS]; struct bpf_dummy_ops_state state; @@ -62,7 +67,7 @@ static int dummy_ops_copy_args(struct bpf_dummy_ops_test_args *args) static int dummy_ops_call_op(void *image, struct bpf_dummy_ops_test_args *args) { - dummy_ops_test_ret_fn test = (void *)image; + dummy_ops_test_ret_fn test = (void *)image + cfi_get_offset(); struct bpf_dummy_ops_state *state = NULL; /* state needs to be NULL if args[0] is 0 */ @@ -119,6 +124,7 @@ int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr, op_idx = prog->expected_attach_type; err = bpf_struct_ops_prepare_trampoline(tlinks, link, &st_ops->func_models[op_idx], + &dummy_ops_test_ret_function, image, image + PAGE_SIZE); if (err < 0) goto out; @@ -219,6 +225,28 @@ static void bpf_dummy_unreg(void *kdata) { } +static int bpf_dummy_test_1(struct bpf_dummy_ops_state *cb) +{ + return 0; +} + +static int bpf_dummy_test_2(struct bpf_dummy_ops_state *cb, int a1, unsigned short a2, + char a3, unsigned long a4) +{ + return 0; +} + +static int bpf_dummy_test_sleepable(struct bpf_dummy_ops_state *cb) +{ + return 0; +} + +static struct bpf_dummy_ops __bpf_bpf_dummy_ops = { + .test_1 = bpf_dummy_test_1, + .test_2 = bpf_dummy_test_2, + .test_sleepable = bpf_dummy_test_sleepable, +}; + struct bpf_struct_ops bpf_bpf_dummy_ops = { .verifier_ops = &bpf_dummy_verifier_ops, .init = bpf_dummy_init, @@ -227,4 +255,5 @@ struct bpf_struct_ops bpf_bpf_dummy_ops = { .reg = bpf_dummy_reg, .unreg = bpf_dummy_unreg, .name = "bpf_dummy_ops", + .cfi_stubs = &__bpf_bpf_dummy_ops, }; diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c index 39dcccf0f174..ae8b15e6896f 100644 --- a/net/ipv4/bpf_tcp_ca.c +++ b/net/ipv4/bpf_tcp_ca.c @@ -271,6 +271,74 @@ static int bpf_tcp_ca_validate(void *kdata) return tcp_validate_congestion_control(kdata); } +static u32 bpf_tcp_ca_ssthresh(struct sock *sk) +{ + return 0; +} + +static void bpf_tcp_ca_cong_avoid(struct sock *sk, u32 ack, u32 acked) +{ +} + +static void bpf_tcp_ca_set_state(struct sock *sk, u8 new_state) +{ +} + +static void bpf_tcp_ca_cwnd_event(struct sock *sk, enum tcp_ca_event ev) +{ +} + +static void bpf_tcp_ca_in_ack_event(struct sock *sk, u32 flags) +{ +} + +static void bpf_tcp_ca_pkts_acked(struct sock *sk, const struct ack_sample *sample) +{ +} + +static u32 bpf_tcp_ca_min_tso_segs(struct sock *sk) +{ + return 0; +} + +static void bpf_tcp_ca_cong_control(struct sock *sk, const struct rate_sample *rs) +{ +} + +static u32 bpf_tcp_ca_undo_cwnd(struct sock *sk) +{ + return 0; +} + +static u32 bpf_tcp_ca_sndbuf_expand(struct sock *sk) +{ + return 0; +} + +static void __bpf_tcp_ca_init(struct sock *sk) +{ +} + +static void __bpf_tcp_ca_release(struct sock *sk) +{ +} + +static struct tcp_congestion_ops __bpf_ops_tcp_congestion_ops = { + .ssthresh = bpf_tcp_ca_ssthresh, + .cong_avoid = bpf_tcp_ca_cong_avoid, + .set_state = bpf_tcp_ca_set_state, + .cwnd_event = bpf_tcp_ca_cwnd_event, + .in_ack_event = bpf_tcp_ca_in_ack_event, + .pkts_acked = bpf_tcp_ca_pkts_acked, + .min_tso_segs = bpf_tcp_ca_min_tso_segs, + .cong_control = bpf_tcp_ca_cong_control, + .undo_cwnd = bpf_tcp_ca_undo_cwnd, + .sndbuf_expand = bpf_tcp_ca_sndbuf_expand, + + .init = __bpf_tcp_ca_init, + .release = __bpf_tcp_ca_release, +}; + struct bpf_struct_ops bpf_tcp_congestion_ops = { .verifier_ops = &bpf_tcp_ca_verifier_ops, .reg = bpf_tcp_ca_reg, @@ -281,6 +349,7 @@ struct bpf_struct_ops bpf_tcp_congestion_ops = { .init = bpf_tcp_ca_init, .validate = bpf_tcp_ca_validate, .name = "tcp_congestion_ops", + .cfi_stubs = &__bpf_ops_tcp_congestion_ops, }; static int __init bpf_tcp_ca_kfunc_init(void) -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.8-rc1 commit e9d13b9d2f99ccf7afeab490d97eaa5ac9846598 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Add a CFI_NOSEAL() helper to mark functions that need to retain their CFI information, despite not otherwise leaking their address. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.669401084@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- arch/x86/include/asm/cfi.h | 5 +++++ include/linux/cfi.h | 4 ++++ 2 files changed, 9 insertions(+) diff --git a/arch/x86/include/asm/cfi.h b/arch/x86/include/asm/cfi.h index 1a50b2cd4713..7cd752557905 100644 --- a/arch/x86/include/asm/cfi.h +++ b/arch/x86/include/asm/cfi.h @@ -8,6 +8,7 @@ * Copyright (C) 2022 Google LLC */ #include <linux/bug.h> +#include <asm/ibt.h> /* * An overview of the various calling conventions... @@ -138,4 +139,8 @@ static inline u32 cfi_get_func_hash(void *func) } #endif /* CONFIG_CFI_CLANG */ +#if HAS_KERNEL_IBT == 1 +#define CFI_NOSEAL(x) asm(IBT_NOSEAL(__stringify(x))) +#endif + #endif /* _ASM_X86_CFI_H */ diff --git a/include/linux/cfi.h b/include/linux/cfi.h index 1ed2d96c0cfc..f0df518e11dd 100644 --- a/include/linux/cfi.h +++ b/include/linux/cfi.h @@ -46,4 +46,8 @@ static inline void module_cfi_finalize(const Elf_Ehdr *hdr, #endif /* CONFIG_ARCH_USES_CFI_TRAPS */ #endif /* CONFIG_MODULES */ +#ifndef CFI_NOSEAL +#define CFI_NOSEAL(x) +#endif + #endif /* _LINUX_CFI_H */ -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.8-rc1 commit e4c00339891c074c76f626ac82981963cbba5332 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Ensure the various dtor functions match their prototype and retain their CFI signatures, since they don't have their address taken, they are prone to not getting CFI, making them impossible to call indirectly. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.799451071@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/cpumask.c | 8 +++++++- kernel/bpf/helpers.c | 16 ++++++++++++++-- net/bpf/test_run.c | 15 +++++++++++++-- 3 files changed, 34 insertions(+), 5 deletions(-) diff --git a/kernel/bpf/cpumask.c b/kernel/bpf/cpumask.c index e01c741e54e7..6acecc8ebd61 100644 --- a/kernel/bpf/cpumask.c +++ b/kernel/bpf/cpumask.c @@ -96,6 +96,12 @@ __bpf_kfunc void bpf_cpumask_release(struct bpf_cpumask *cpumask) migrate_enable(); } +__bpf_kfunc void bpf_cpumask_release_dtor(void *cpumask) +{ + bpf_cpumask_release(cpumask); +} +CFI_NOSEAL(bpf_cpumask_release_dtor); + /** * bpf_cpumask_first() - Get the index of the first nonzero bit in the cpumask. * @cpumask: The cpumask being queried. @@ -441,7 +447,7 @@ static const struct btf_kfunc_id_set cpumask_kfunc_set = { BTF_ID_LIST(cpumask_dtor_ids) BTF_ID(struct, bpf_cpumask) -BTF_ID(func, bpf_cpumask_release) +BTF_ID(func, bpf_cpumask_release_dtor) static int __init cpumask_kfunc_init(void) { diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index a17fee069161..0edec71be0a7 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2270,6 +2270,12 @@ __bpf_kfunc void bpf_task_release(struct task_struct *p) put_task_struct_rcu_user(p); } +__bpf_kfunc void bpf_task_release_dtor(void *p) +{ + put_task_struct_rcu_user(p); +} +CFI_NOSEAL(bpf_task_release_dtor); + #ifdef CONFIG_CGROUPS /** * bpf_cgroup_acquire - Acquire a reference to a cgroup. A cgroup acquired by @@ -2294,6 +2300,12 @@ __bpf_kfunc void bpf_cgroup_release(struct cgroup *cgrp) cgroup_put(cgrp); } +__bpf_kfunc void bpf_cgroup_release_dtor(void *cgrp) +{ + cgroup_put(cgrp); +} +CFI_NOSEAL(bpf_cgroup_release_dtor); + /** * bpf_cgroup_ancestor - Perform a lookup on an entry in a cgroup's ancestor * array. A cgroup returned by this kfunc which is not subsequently stored in a @@ -2653,10 +2665,10 @@ static const struct btf_kfunc_id_set generic_kfunc_set = { BTF_ID_LIST(generic_dtor_ids) BTF_ID(struct, task_struct) -BTF_ID(func, bpf_task_release) +BTF_ID(func, bpf_task_release_dtor) #ifdef CONFIG_CGROUPS BTF_ID(struct, cgroup) -BTF_ID(func, bpf_cgroup_release) +BTF_ID(func, bpf_cgroup_release_dtor) #endif BTF_SET8_START(common_btf_ids) diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 6ccb767169fc..63a875c638b6 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -601,10 +601,21 @@ __bpf_kfunc void bpf_kfunc_call_test_release(struct prog_test_ref_kfunc *p) refcount_dec(&p->cnt); } +__bpf_kfunc void bpf_kfunc_call_test_release_dtor(void *p) +{ + bpf_kfunc_call_test_release(p); +} +CFI_NOSEAL(bpf_kfunc_call_test_release_dtor); + __bpf_kfunc void bpf_kfunc_call_memb_release(struct prog_test_member *p) { } +__bpf_kfunc void bpf_kfunc_call_memb_release_dtor(void *p) +{ +} +CFI_NOSEAL(bpf_kfunc_call_memb_release_dtor); + __bpf_kfunc_end_defs(); BTF_SET8_START(bpf_test_modify_return_ids) @@ -1698,9 +1709,9 @@ static const struct btf_kfunc_id_set bpf_prog_test_kfunc_set = { BTF_ID_LIST(bpf_prog_test_dtor_kfunc_ids) BTF_ID(struct, prog_test_ref_kfunc) -BTF_ID(func, bpf_kfunc_call_test_release) +BTF_ID(func, bpf_kfunc_call_test_release_dtor) BTF_ID(struct, prog_test_member) -BTF_ID(func, bpf_kfunc_call_memb_release) +BTF_ID(func, bpf_kfunc_call_memb_release_dtor) static int __init bpf_prog_test_run_init(void) { -- 2.34.1
From: Alexei Starovoitov <ast@kernel.org> mainline inclusion from mainline-v6.8-rc1 commit 0c970ed2f87c058fe3ddeb4d7d8f64f72cf41d7a category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- The func_addr used to be NULL for indirect trampolines used by struct_ops. Now func_addr is a valid function pointer. Hence use BPF_TRAMP_F_INDIRECT flag to detect such condition. Fixes: 2cd3e3772e41 ("x86/cfi,bpf: Fix bpf_struct_ops CFI") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/bpf/20231216004549.78355-1-alexei.starovoitov@gmail.... Conflicts: tools/testing/selftests/bpf/DENYLIST.s390x [Fix conflict because context diff.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- arch/s390/net/bpf_jit_comp.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c index 1f22da0ec420..28df56b06f5a 100644 --- a/arch/s390/net/bpf_jit_comp.c +++ b/arch/s390/net/bpf_jit_comp.c @@ -2258,7 +2258,8 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, return -ENOTSUPP; /* Return to %r14, since func_addr and %r0 are not available. */ - if (!func_addr && !(flags & BPF_TRAMP_F_ORIG_STACK)) + if ((!func_addr && !(flags & BPF_TRAMP_F_ORIG_STACK)) || + (flags & BPF_TRAMP_F_INDIRECT)) flags |= BPF_TRAMP_F_SKIP_FRAME; /* -- 2.34.1
From: Alexei Starovoitov <ast@kernel.org> mainline inclusion from mainline-v6.8-rc1 commit a8b242d77bd72556b7a9d8be779f7d27b95ba73c category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Compilers optimize conditional operators at will, but often bpf programmers want to force compilers to keep the same operator in asm as it's written in C. Introduce bpf_cmp_likely/unlikely(var1, conditional_op, var2) macros that can be used as: - if (seen >= 1000) + if (bpf_cmp_unlikely(seen, >=, 1000)) The macros take advantage of BPF assembly that is C like. The macros check the sign of variable 'seen' and emits either signed or unsigned compare. For example: int a; bpf_cmp_unlikely(a, >, 0) will be translated to 'if rX s> 0 goto' in BPF assembly. unsigned int a; bpf_cmp_unlikely(a, >, 0) will be translated to 'if rX > 0 goto' in BPF assembly. C type conversions coupled with comparison operator are tricky. int i = -1; unsigned int j = 1; if (i < j) // this is false. long i = -1; unsigned int j = 1; if (i < j) // this is true. Make sure BPF program is compiled with -Wsign-compare then the macros will catch the mistake. The macros check LHS (left hand side) only to figure out the sign of compare. 'if 0 < rX goto' is not allowed in the assembly, so the users have to use a variable on LHS anyway. The patch updates few tests to demonstrate the use of the macros. The macro allows to use BPF_JSET in C code, since LLVM doesn't generate it at present. For example: if (i & j) compiles into r0 &= r1; if r0 == 0 goto while if (bpf_cmp_unlikely(i, &, j)) compiles into if r0 & r1 goto Note that the macros has to be careful with RHS assembly predicate. Since: u64 __rhs = 1ull << 42; asm goto("if r0 < %[rhs] goto +1" :: [rhs] "ri" (__rhs)); LLVM will silently truncate 64-bit constant into s32 imm. Note that [lhs] "r"((short)LHS) the type cast is a workaround for LLVM issue. When LHS is exactly 32-bit LLVM emits redundant <<=32, >>=32 to zero upper 32-bits. When LHS is 64 or 16 or 8-bit variable there are no shifts. When LHS is 32-bit the (u64) cast doesn't help. Hence use (short) cast. It does _not_ truncate the variable before it's assigned to a register. Traditional likely()/unlikely() macros that use __builtin_expect(!!(x), 1 or 0) have no effect on these macros, hence macros implement the logic manually. bpf_cmp_unlikely() macro preserves compare operator as-is while bpf_cmp_likely() macro flips the compare. Consider two cases: A. for() { if (foo >= 10) { bar += foo; } other code; } B. for() { if (foo >= 10) break; other code; } It's ok to use either bpf_cmp_likely or bpf_cmp_unlikely macros in both cases, but consider that 'break' is effectively 'goto out_of_the_loop'. Hence it's better to use bpf_cmp_unlikely in the B case. While 'bar += foo' is better to keep as 'fallthrough' == likely code path in the A case. When it's written as: A. for() { if (bpf_cmp_likely(foo, >=, 10)) { bar += foo; } other code; } B. for() { if (bpf_cmp_unlikely(foo, >=, 10)) break; other code; } The assembly will look like: A. for() { if r1 < 10 goto L1; bar += foo; L1: other code; } B. for() { if r1 >= 10 goto L2; other code; } L2: The bpf_cmp_likely vs bpf_cmp_unlikely changes basic block layout, hence it will greatly influence the verification process. The number of processed instructions will be different, since the verifier walks the fallthrough first. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20231226191148.48536-3-alexei.starovoitov@gmail.... Conflicts: tools/testing/selftests/bpf/prog_tests/exceptions.c tools/testing/selftests/bpf/bpf_experimental.h tools/testing/selftests/bpf/progs/iters_task_vma.c [exceptions.c is no exist, just ignore it. And fix other simple conflicts.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../testing/selftests/bpf/bpf_experimental.h | 69 +++++++++++++++++++ .../selftests/bpf/progs/iters_task_vma.c | 2 +- 2 files changed, 70 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h index facf41fec72f..17127c2f928a 100644 --- a/tools/testing/selftests/bpf/bpf_experimental.h +++ b/tools/testing/selftests/bpf/bpf_experimental.h @@ -189,4 +189,73 @@ extern int bpf_iter_css_new(struct bpf_iter_css *it, extern struct cgroup_subsys_state *bpf_iter_css_next(struct bpf_iter_css *it) __weak __ksym; extern void bpf_iter_css_destroy(struct bpf_iter_css *it) __weak __ksym; +#define __cmp_cannot_be_signed(x) \ + __builtin_strcmp(#x, "==") == 0 || __builtin_strcmp(#x, "!=") == 0 || \ + __builtin_strcmp(#x, "&") == 0 + +#define __is_signed_type(type) (((type)(-1)) < (type)1) + +#define __bpf_cmp(LHS, OP, SIGN, PRED, RHS, DEFAULT) \ + ({ \ + __label__ l_true; \ + bool ret = DEFAULT; \ + asm volatile goto("if %[lhs] " SIGN #OP " %[rhs] goto %l[l_true]" \ + :: [lhs] "r"((short)LHS), [rhs] PRED (RHS) :: l_true); \ + ret = !DEFAULT; \ +l_true: \ + ret; \ + }) + +/* C type conversions coupled with comparison operator are tricky. + * Make sure BPF program is compiled with -Wsign-compare then + * __lhs OP __rhs below will catch the mistake. + * Be aware that we check only __lhs to figure out the sign of compare. + */ +#define _bpf_cmp(LHS, OP, RHS, NOFLIP) \ + ({ \ + typeof(LHS) __lhs = (LHS); \ + typeof(RHS) __rhs = (RHS); \ + bool ret; \ + _Static_assert(sizeof(&(LHS)), "1st argument must be an lvalue expression"); \ + (void)(__lhs OP __rhs); \ + if (__cmp_cannot_be_signed(OP) || !__is_signed_type(typeof(__lhs))) { \ + if (sizeof(__rhs) == 8) \ + ret = __bpf_cmp(__lhs, OP, "", "r", __rhs, NOFLIP); \ + else \ + ret = __bpf_cmp(__lhs, OP, "", "i", __rhs, NOFLIP); \ + } else { \ + if (sizeof(__rhs) == 8) \ + ret = __bpf_cmp(__lhs, OP, "s", "r", __rhs, NOFLIP); \ + else \ + ret = __bpf_cmp(__lhs, OP, "s", "i", __rhs, NOFLIP); \ + } \ + ret; \ + }) + +#ifndef bpf_cmp_unlikely +#define bpf_cmp_unlikely(LHS, OP, RHS) _bpf_cmp(LHS, OP, RHS, true) +#endif + +#ifndef bpf_cmp_likely +#define bpf_cmp_likely(LHS, OP, RHS) \ + ({ \ + bool ret; \ + if (__builtin_strcmp(#OP, "==") == 0) \ + ret = _bpf_cmp(LHS, !=, RHS, false); \ + else if (__builtin_strcmp(#OP, "!=") == 0) \ + ret = _bpf_cmp(LHS, ==, RHS, false); \ + else if (__builtin_strcmp(#OP, "<=") == 0) \ + ret = _bpf_cmp(LHS, >, RHS, false); \ + else if (__builtin_strcmp(#OP, "<") == 0) \ + ret = _bpf_cmp(LHS, >=, RHS, false); \ + else if (__builtin_strcmp(#OP, ">") == 0) \ + ret = _bpf_cmp(LHS, <=, RHS, false); \ + else if (__builtin_strcmp(#OP, ">=") == 0) \ + ret = _bpf_cmp(LHS, <, RHS, false); \ + else \ + (void) "bug"; \ + ret; \ + }) +#endif + #endif diff --git a/tools/testing/selftests/bpf/progs/iters_task_vma.c b/tools/testing/selftests/bpf/progs/iters_task_vma.c index 44edecfdfaee..dc0c3691dcc2 100644 --- a/tools/testing/selftests/bpf/progs/iters_task_vma.c +++ b/tools/testing/selftests/bpf/progs/iters_task_vma.c @@ -28,7 +28,7 @@ int iter_task_vma_for_each(const void *ctx) return 0; bpf_for_each(task_vma, vma, task, 0) { - if (seen >= 1000) + if (bpf_cmp_unlikely(seen, >=, 1000)) break; vm_ranges[seen].vm_start = vma->vm_start; -- 2.34.1
From: Pu Lehui <pulehui@huawei.com> mainline inclusion from mainline-v6.8-rc2 commit 1732ebc4a26181c8f116c7639db99754b313edc8 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- We encountered a kernel crash triggered by the bpf_tcp_ca testcase as show below: Unable to handle kernel paging request at virtual address ff60000088554500 Oops [#1] ... CPU: 3 PID: 458 Comm: test_progs Tainted: G OE 6.8.0-rc1-kselftest_plain #1 Hardware name: riscv-virtio,qemu (DT) epc : 0xff60000088554500 ra : tcp_ack+0x288/0x1232 epc : ff60000088554500 ra : ffffffff80cc7166 sp : ff2000000117ba50 gp : ffffffff82587b60 tp : ff60000087be0040 t0 : ff60000088554500 t1 : ffffffff801ed24e t2 : 0000000000000000 s0 : ff2000000117bbc0 s1 : 0000000000000500 a0 : ff20000000691000 a1 : 0000000000000018 a2 : 0000000000000001 a3 : ff60000087be03a0 a4 : 0000000000000000 a5 : 0000000000000000 a6 : 0000000000000021 a7 : ffffffff8263f880 s2 : 000000004ac3c13b s3 : 000000004ac3c13a s4 : 0000000000008200 s5 : 0000000000000001 s6 : 0000000000000104 s7 : ff2000000117bb00 s8 : ff600000885544c0 s9 : 0000000000000000 s10: ff60000086ff0b80 s11: 000055557983a9c0 t3 : 0000000000000000 t4 : 000000000000ffc4 t5 : ffffffff8154f170 t6 : 0000000000000030 status: 0000000200000120 badaddr: ff60000088554500 cause: 000000000000000c Code: c796 67d7 0000 0000 0052 0002 c13b 4ac3 0000 0000 (0001) 0000 ---[ end trace 0000000000000000 ]--- The reason is that commit 2cd3e3772e41 ("x86/cfi,bpf: Fix bpf_struct_ops CFI") changes the func_addr of arch_prepare_bpf_trampoline in struct_ops from NULL to non-NULL, while we use func_addr on RV64 to differentiate between struct_ops and regular trampoline. When the struct_ops testcase is triggered, it emits wrong prologue and epilogue, and lead to unpredictable issues. After commit 2cd3e3772e41, we can use BPF_TRAMP_F_INDIRECT to distinguish them as it always be set in struct_ops. Fixes: 2cd3e3772e41 ("x86/cfi,bpf: Fix bpf_struct_ops CFI") Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Björn Töpel <bjorn@rivosinc.com> Acked-by: Björn Töpel <bjorn@kernel.org> Link: https://lore.kernel.org/bpf/20240123023207.1917284-1-pulehui@huaweicloud.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- arch/riscv/net/bpf_jit_comp64.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c index 4a1a61ae14d6..4ae7e6e7c061 100644 --- a/arch/riscv/net/bpf_jit_comp64.c +++ b/arch/riscv/net/bpf_jit_comp64.c @@ -796,6 +796,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY]; struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT]; struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN]; + bool is_struct_ops = flags & BPF_TRAMP_F_INDIRECT; void *orig_call = func_addr; bool save_ret; u32 insn; @@ -878,7 +879,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, stack_size = round_up(stack_size, 16); - if (func_addr) { + if (!is_struct_ops) { /* For the trampoline called from function entry, * the frame of traced function and the frame of * trampoline need to be considered. @@ -998,7 +999,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, emit_ld(RV_REG_S1, -sreg_off, RV_REG_FP, ctx); - if (func_addr) { + if (!is_struct_ops) { /* trampoline called from function entry */ emit_ld(RV_REG_T0, stack_size - 8, RV_REG_SP, ctx); emit_ld(RV_REG_FP, stack_size - 16, RV_REG_SP, ctx); -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 3b1f89e747cd4b24244f2798a35d28815b744303 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Move the majority of the code to bpf_struct_ops_init_one(), which can then be utilized for the initialization of newly registered dynamically allocated struct_ops types in the following patches. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-2-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/btf.h | 1 + kernel/bpf/bpf_struct_ops.c | 157 +++++++++++++++++++----------------- kernel/bpf/btf.c | 5 ++ 3 files changed, 89 insertions(+), 74 deletions(-) diff --git a/include/linux/btf.h b/include/linux/btf.h index 59d404e22814..1d852dad7473 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -137,6 +137,7 @@ struct btf_struct_metas { extern const struct file_operations btf_fops; +const char *btf_get_name(const struct btf *btf); void btf_get(struct btf *btf); void btf_put(struct btf *btf); int btf_new_fd(const union bpf_attr *attr, bpfptr_t uattr, u32 uattr_sz); diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 02068bd0e4d9..96cba76f4ac3 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -110,102 +110,111 @@ const struct bpf_prog_ops bpf_struct_ops_prog_ops = { static const struct btf_type *module_type; -void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log) +static void bpf_struct_ops_init_one(struct bpf_struct_ops *st_ops, + struct btf *btf, + struct bpf_verifier_log *log) { - s32 type_id, value_id, module_id; const struct btf_member *member; - struct bpf_struct_ops *st_ops; const struct btf_type *t; + s32 type_id, value_id; char value_name[128]; const char *mname; - u32 i, j; + int i; - /* Ensure BTF type is emitted for "struct bpf_struct_ops_##_name" */ -#define BPF_STRUCT_OPS_TYPE(_name) BTF_TYPE_EMIT(struct bpf_struct_ops_##_name); -#include "bpf_struct_ops_types.h" -#undef BPF_STRUCT_OPS_TYPE + if (strlen(st_ops->name) + VALUE_PREFIX_LEN >= + sizeof(value_name)) { + pr_warn("struct_ops name %s is too long\n", + st_ops->name); + return; + } + sprintf(value_name, "%s%s", VALUE_PREFIX, st_ops->name); - module_id = btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT); - if (module_id < 0) { - pr_warn("Cannot find struct module in btf_vmlinux\n"); + value_id = btf_find_by_name_kind(btf, value_name, + BTF_KIND_STRUCT); + if (value_id < 0) { + pr_warn("Cannot find struct %s in %s\n", + value_name, btf_get_name(btf)); return; } - module_type = btf_type_by_id(btf, module_id); - for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { - st_ops = bpf_struct_ops[i]; + type_id = btf_find_by_name_kind(btf, st_ops->name, + BTF_KIND_STRUCT); + if (type_id < 0) { + pr_warn("Cannot find struct %s in %s\n", + st_ops->name, btf_get_name(btf)); + return; + } + t = btf_type_by_id(btf, type_id); + if (btf_type_vlen(t) > BPF_STRUCT_OPS_MAX_NR_MEMBERS) { + pr_warn("Cannot support #%u members in struct %s\n", + btf_type_vlen(t), st_ops->name); + return; + } - if (strlen(st_ops->name) + VALUE_PREFIX_LEN >= - sizeof(value_name)) { - pr_warn("struct_ops name %s is too long\n", + for_each_member(i, t, member) { + const struct btf_type *func_proto; + + mname = btf_name_by_offset(btf, member->name_off); + if (!*mname) { + pr_warn("anon member in struct %s is not supported\n", st_ops->name); - continue; + break; } - sprintf(value_name, "%s%s", VALUE_PREFIX, st_ops->name); - value_id = btf_find_by_name_kind(btf, value_name, - BTF_KIND_STRUCT); - if (value_id < 0) { - pr_warn("Cannot find struct %s in btf_vmlinux\n", - value_name); - continue; + if (__btf_member_bitfield_size(t, member)) { + pr_warn("bit field member %s in struct %s is not supported\n", + mname, st_ops->name); + break; } - type_id = btf_find_by_name_kind(btf, st_ops->name, - BTF_KIND_STRUCT); - if (type_id < 0) { - pr_warn("Cannot find struct %s in btf_vmlinux\n", - st_ops->name); - continue; - } - t = btf_type_by_id(btf, type_id); - if (btf_type_vlen(t) > BPF_STRUCT_OPS_MAX_NR_MEMBERS) { - pr_warn("Cannot support #%u members in struct %s\n", - btf_type_vlen(t), st_ops->name); - continue; + func_proto = btf_type_resolve_func_ptr(btf, + member->type, + NULL); + if (func_proto && + btf_distill_func_proto(log, btf, + func_proto, mname, + &st_ops->func_models[i])) { + pr_warn("Error in parsing func ptr %s in struct %s\n", + mname, st_ops->name); + break; } + } - for_each_member(j, t, member) { - const struct btf_type *func_proto; + if (i == btf_type_vlen(t)) { + if (st_ops->init(btf)) { + pr_warn("Error in init bpf_struct_ops %s\n", + st_ops->name); + } else { + st_ops->type_id = type_id; + st_ops->type = t; + st_ops->value_id = value_id; + st_ops->value_type = btf_type_by_id(btf, + value_id); + } + } +} - mname = btf_name_by_offset(btf, member->name_off); - if (!*mname) { - pr_warn("anon member in struct %s is not supported\n", - st_ops->name); - break; - } +void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log) +{ + struct bpf_struct_ops *st_ops; + s32 module_id; + u32 i; - if (__btf_member_bitfield_size(t, member)) { - pr_warn("bit field member %s in struct %s is not supported\n", - mname, st_ops->name); - break; - } + /* Ensure BTF type is emitted for "struct bpf_struct_ops_##_name" */ +#define BPF_STRUCT_OPS_TYPE(_name) BTF_TYPE_EMIT(struct bpf_struct_ops_##_name); +#include "bpf_struct_ops_types.h" +#undef BPF_STRUCT_OPS_TYPE - func_proto = btf_type_resolve_func_ptr(btf, - member->type, - NULL); - if (func_proto && - btf_distill_func_proto(log, btf, - func_proto, mname, - &st_ops->func_models[j])) { - pr_warn("Error in parsing func ptr %s in struct %s\n", - mname, st_ops->name); - break; - } - } + module_id = btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT); + if (module_id < 0) { + pr_warn("Cannot find struct module in %s\n", btf_get_name(btf)); + return; + } + module_type = btf_type_by_id(btf, module_id); - if (j == btf_type_vlen(t)) { - if (st_ops->init(btf)) { - pr_warn("Error in init bpf_struct_ops %s\n", - st_ops->name); - } else { - st_ops->type_id = type_id; - st_ops->type = t; - st_ops->value_id = value_id; - st_ops->value_type = btf_type_by_id(btf, - value_id); - } - } + for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { + st_ops = bpf_struct_ops[i]; + bpf_struct_ops_init_one(st_ops, btf, log); } } diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 5d1413e95185..6bc791cc4807 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -1720,6 +1720,11 @@ static void btf_free_rcu(struct rcu_head *rcu) btf_free(btf); } +const char *btf_get_name(const struct btf *btf) +{ + return btf->name; +} + void btf_get(struct btf *btf) { refcount_inc(&btf->refcnt); -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 95678395386d45fa0a075d2e7a6866326a469d76 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Get ready to remove bpf_struct_ops_init() in the future. By using BTF_ID_LIST, it is possible to gather type information while building instead of runtime. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-3-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/bpf_struct_ops.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 96cba76f4ac3..5b3ebcb435d0 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -108,7 +108,12 @@ const struct bpf_prog_ops bpf_struct_ops_prog_ops = { #endif }; -static const struct btf_type *module_type; +BTF_ID_LIST(st_ops_ids) +BTF_ID(struct, module) + +enum { + IDX_MODULE_ID, +}; static void bpf_struct_ops_init_one(struct bpf_struct_ops *st_ops, struct btf *btf, @@ -197,7 +202,6 @@ static void bpf_struct_ops_init_one(struct bpf_struct_ops *st_ops, void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log) { struct bpf_struct_ops *st_ops; - s32 module_id; u32 i; /* Ensure BTF type is emitted for "struct bpf_struct_ops_##_name" */ @@ -205,13 +209,6 @@ void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log) #include "bpf_struct_ops_types.h" #undef BPF_STRUCT_OPS_TYPE - module_id = btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT); - if (module_id < 0) { - pr_warn("Cannot find struct module in %s\n", btf_get_name(btf)); - return; - } - module_type = btf_type_by_id(btf, module_id); - for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { st_ops = bpf_struct_ops[i]; bpf_struct_ops_init_one(st_ops, btf, log); @@ -387,6 +384,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; const struct bpf_struct_ops *st_ops = st_map->st_ops; struct bpf_struct_ops_value *uvalue, *kvalue; + const struct btf_type *module_type; const struct btf_member *member; const struct btf_type *t = st_ops->type; struct bpf_tramp_links *tlinks; @@ -434,6 +432,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, image = st_map->image; image_end = st_map->image + PAGE_SIZE; + module_type = btf_type_by_id(btf_vmlinux, st_ops_ids[IDX_MODULE_ID]); for_each_member(i, t, member) { const struct btf_type *mtype, *ptype; struct bpf_prog *prog; -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 4c5763ed996a61b51d721d0968d0df957826ea49 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Move some of members of bpf_struct_ops to bpf_struct_ops_desc. type_id is unavailabe in bpf_struct_ops anymore. Modules should get it from the btf received by kmod's init function. Cc: netdev@vger.kernel.org Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-4-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Conflicts: include/linux/bpf.h [Fix conflicts because of kabi.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf.h | 17 +++++--- kernel/bpf/bpf_struct_ops.c | 80 +++++++++++++++++----------------- kernel/bpf/verifier.c | 8 ++-- net/bpf/bpf_dummy_struct_ops.c | 11 ++++- net/ipv4/bpf_tcp_ca.c | 8 +++- 5 files changed, 75 insertions(+), 49 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 3bb9b329340f..562e52f0f982 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1792,13 +1792,11 @@ struct bpf_struct_ops { void (*unreg)(void *kdata); int (*update)(void *kdata, void *old_kdata); int (*validate)(void *kdata); - const struct btf_type *type; - const struct btf_type *value_type; + void *cfi_stubs; const char *name; struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS]; u32 type_id; u32 value_id; - void *cfi_stubs; KABI_RESERVE(1) KABI_RESERVE(2) @@ -1806,9 +1804,18 @@ struct bpf_struct_ops { KABI_RESERVE(4) }; +struct bpf_struct_ops_desc { + struct bpf_struct_ops *st_ops; + + const struct btf_type *type; + const struct btf_type *value_type; + u32 type_id; + u32 value_id; +}; + #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL) #define BPF_MODULE_OWNER ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA)) -const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id); +const struct bpf_struct_ops_desc *bpf_struct_ops_find(u32 type_id); void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log); bool bpf_struct_ops_get(const void *kdata); void bpf_struct_ops_put(const void *kdata); @@ -1852,7 +1859,7 @@ int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr); #endif #else -static inline const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id) +static inline const struct bpf_struct_ops_desc *bpf_struct_ops_find(u32 type_id) { return NULL; } diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 5b3ebcb435d0..9774f7824e8b 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -32,7 +32,7 @@ struct bpf_struct_ops_value { struct bpf_struct_ops_map { struct bpf_map map; struct rcu_head rcu; - const struct bpf_struct_ops *st_ops; + const struct bpf_struct_ops_desc *st_ops_desc; /* protect map_update */ struct mutex lock; /* link has all the bpf_links that is populated @@ -92,9 +92,9 @@ enum { __NR_BPF_STRUCT_OPS_TYPE, }; -static struct bpf_struct_ops * const bpf_struct_ops[] = { +static struct bpf_struct_ops_desc bpf_struct_ops[] = { #define BPF_STRUCT_OPS_TYPE(_name) \ - [BPF_STRUCT_OPS_TYPE_##_name] = &bpf_##_name, + [BPF_STRUCT_OPS_TYPE_##_name] = { .st_ops = &bpf_##_name }, #include "bpf_struct_ops_types.h" #undef BPF_STRUCT_OPS_TYPE }; @@ -115,10 +115,11 @@ enum { IDX_MODULE_ID, }; -static void bpf_struct_ops_init_one(struct bpf_struct_ops *st_ops, - struct btf *btf, - struct bpf_verifier_log *log) +static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, + struct btf *btf, + struct bpf_verifier_log *log) { + struct bpf_struct_ops *st_ops = st_ops_desc->st_ops; const struct btf_member *member; const struct btf_type *t; s32 type_id, value_id; @@ -190,18 +191,18 @@ static void bpf_struct_ops_init_one(struct bpf_struct_ops *st_ops, pr_warn("Error in init bpf_struct_ops %s\n", st_ops->name); } else { - st_ops->type_id = type_id; - st_ops->type = t; - st_ops->value_id = value_id; - st_ops->value_type = btf_type_by_id(btf, - value_id); + st_ops_desc->type_id = type_id; + st_ops_desc->type = t; + st_ops_desc->value_id = value_id; + st_ops_desc->value_type = btf_type_by_id(btf, + value_id); } } } void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log) { - struct bpf_struct_ops *st_ops; + struct bpf_struct_ops_desc *st_ops_desc; u32 i; /* Ensure BTF type is emitted for "struct bpf_struct_ops_##_name" */ @@ -210,14 +211,14 @@ void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log) #undef BPF_STRUCT_OPS_TYPE for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { - st_ops = bpf_struct_ops[i]; - bpf_struct_ops_init_one(st_ops, btf, log); + st_ops_desc = &bpf_struct_ops[i]; + bpf_struct_ops_desc_init(st_ops_desc, btf, log); } } extern struct btf *btf_vmlinux; -static const struct bpf_struct_ops * +static const struct bpf_struct_ops_desc * bpf_struct_ops_find_value(u32 value_id) { unsigned int i; @@ -226,14 +227,14 @@ bpf_struct_ops_find_value(u32 value_id) return NULL; for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { - if (bpf_struct_ops[i]->value_id == value_id) - return bpf_struct_ops[i]; + if (bpf_struct_ops[i].value_id == value_id) + return &bpf_struct_ops[i]; } return NULL; } -const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id) +const struct bpf_struct_ops_desc *bpf_struct_ops_find(u32 type_id) { unsigned int i; @@ -241,8 +242,8 @@ const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id) return NULL; for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { - if (bpf_struct_ops[i]->type_id == type_id) - return bpf_struct_ops[i]; + if (bpf_struct_ops[i].type_id == type_id) + return &bpf_struct_ops[i]; } return NULL; @@ -302,7 +303,7 @@ static void *bpf_struct_ops_map_lookup_elem(struct bpf_map *map, void *key) static void bpf_struct_ops_map_put_progs(struct bpf_struct_ops_map *st_map) { - const struct btf_type *t = st_map->st_ops->type; + const struct btf_type *t = st_map->st_ops_desc->type; u32 i; for (i = 0; i < btf_type_vlen(t); i++) { @@ -382,11 +383,12 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, void *value, u64 flags) { struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; - const struct bpf_struct_ops *st_ops = st_map->st_ops; + const struct bpf_struct_ops_desc *st_ops_desc = st_map->st_ops_desc; + const struct bpf_struct_ops *st_ops = st_ops_desc->st_ops; struct bpf_struct_ops_value *uvalue, *kvalue; const struct btf_type *module_type; const struct btf_member *member; - const struct btf_type *t = st_ops->type; + const struct btf_type *t = st_ops_desc->type; struct bpf_tramp_links *tlinks; void *udata, *kdata; int prog_fd, err; @@ -399,7 +401,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, if (*(u32 *)key != 0) return -E2BIG; - err = check_zero_holes(st_ops->value_type, value); + err = check_zero_holes(st_ops_desc->value_type, value); if (err) return err; @@ -492,7 +494,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, } if (prog->type != BPF_PROG_TYPE_STRUCT_OPS || - prog->aux->attach_btf_id != st_ops->type_id || + prog->aux->attach_btf_id != st_ops_desc->type_id || prog->expected_attach_type != i) { bpf_prog_put(prog); err = -EINVAL; @@ -588,7 +590,7 @@ static long bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key) BPF_STRUCT_OPS_STATE_TOBEFREE); switch (prev_state) { case BPF_STRUCT_OPS_STATE_INUSE: - st_map->st_ops->unreg(&st_map->kvalue.data); + st_map->st_ops_desc->st_ops->unreg(&st_map->kvalue.data); bpf_map_put(map); return 0; case BPF_STRUCT_OPS_STATE_TOBEFREE: @@ -669,22 +671,22 @@ static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr) static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) { - const struct bpf_struct_ops *st_ops; + const struct bpf_struct_ops_desc *st_ops_desc; size_t st_map_size; struct bpf_struct_ops_map *st_map; const struct btf_type *t, *vt; struct bpf_map *map; int ret; - st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id); - if (!st_ops) + st_ops_desc = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id); + if (!st_ops_desc) return ERR_PTR(-ENOTSUPP); - vt = st_ops->value_type; + vt = st_ops_desc->value_type; if (attr->value_size != vt->size) return ERR_PTR(-EINVAL); - t = st_ops->type; + t = st_ops_desc->type; st_map_size = sizeof(*st_map) + /* kvalue stores the @@ -696,7 +698,7 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) if (!st_map) return ERR_PTR(-ENOMEM); - st_map->st_ops = st_ops; + st_map->st_ops_desc = st_ops_desc; map = &st_map->map; ret = bpf_jit_charge_modmem(PAGE_SIZE); @@ -733,8 +735,8 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) static u64 bpf_struct_ops_map_mem_usage(const struct bpf_map *map) { struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; - const struct bpf_struct_ops *st_ops = st_map->st_ops; - const struct btf_type *vt = st_ops->value_type; + const struct bpf_struct_ops_desc *st_ops_desc = st_map->st_ops_desc; + const struct btf_type *vt = st_ops_desc->value_type; u64 usage; usage = sizeof(*st_map) + @@ -808,7 +810,7 @@ static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link) /* st_link->map can be NULL if * bpf_struct_ops_link_create() fails to register. */ - st_map->st_ops->unreg(&st_map->kvalue.data); + st_map->st_ops_desc->st_ops->unreg(&st_map->kvalue.data); bpf_map_put(&st_map->map); } kfree(st_link); @@ -855,7 +857,7 @@ static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map if (!bpf_struct_ops_valid_to_reg(new_map)) return -EINVAL; - if (!st_map->st_ops->update) + if (!st_map->st_ops_desc->st_ops->update) return -EOPNOTSUPP; mutex_lock(&update_mutex); @@ -868,12 +870,12 @@ static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map old_st_map = container_of(old_map, struct bpf_struct_ops_map, map); /* The new and old struct_ops must be the same type. */ - if (st_map->st_ops != old_st_map->st_ops) { + if (st_map->st_ops_desc != old_st_map->st_ops_desc) { err = -EINVAL; goto err_out; } - err = st_map->st_ops->update(st_map->kvalue.data, old_st_map->kvalue.data); + err = st_map->st_ops_desc->st_ops->update(st_map->kvalue.data, old_st_map->kvalue.data); if (err) goto err_out; @@ -924,7 +926,7 @@ int bpf_struct_ops_link_create(union bpf_attr *attr) if (err) goto err_out; - err = st_map->st_ops->reg(st_map->kvalue.data); + err = st_map->st_ops_desc->st_ops->reg(st_map->kvalue.data); if (err) { bpf_link_cleanup(&link_primer); link = NULL; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 8c810642d24d..24f05aa247e8 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -20208,6 +20208,7 @@ static void print_verification_stats(struct bpf_verifier_env *env) static int check_struct_ops_btf_id(struct bpf_verifier_env *env) { const struct btf_type *t, *func_proto; + const struct bpf_struct_ops_desc *st_ops_desc; const struct bpf_struct_ops *st_ops; const struct btf_member *member; struct bpf_prog *prog = env->prog; @@ -20220,14 +20221,15 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env) } btf_id = prog->aux->attach_btf_id; - st_ops = bpf_struct_ops_find(btf_id); - if (!st_ops) { + st_ops_desc = bpf_struct_ops_find(btf_id); + if (!st_ops_desc) { verbose(env, "attach_btf_id %u is not a supported struct\n", btf_id); return -ENOTSUPP; } + st_ops = st_ops_desc->st_ops; - t = st_ops->type; + t = st_ops_desc->type; member_idx = prog->expected_attach_type; if (member_idx >= btf_type_vlen(t)) { verbose(env, "attach to invalid member idx %u of struct %s\n", diff --git a/net/bpf/bpf_dummy_struct_ops.c b/net/bpf/bpf_dummy_struct_ops.c index 8906f7bdf4a9..ba2c58dba2da 100644 --- a/net/bpf/bpf_dummy_struct_ops.c +++ b/net/bpf/bpf_dummy_struct_ops.c @@ -22,6 +22,8 @@ struct bpf_dummy_ops_test_args { struct bpf_dummy_ops_state state; }; +static struct btf *bpf_dummy_ops_btf; + static struct bpf_dummy_ops_test_args * dummy_ops_init_args(const union bpf_attr *kattr, unsigned int nr) { @@ -90,9 +92,15 @@ int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr, void *image = NULL; unsigned int op_idx; int prog_ret; + s32 type_id; int err; - if (prog->aux->attach_btf_id != st_ops->type_id) + type_id = btf_find_by_name_kind(bpf_dummy_ops_btf, + bpf_bpf_dummy_ops.name, + BTF_KIND_STRUCT); + if (type_id < 0) + return -EINVAL; + if (prog->aux->attach_btf_id != type_id) return -EOPNOTSUPP; func_proto = prog->aux->attach_func_proto; @@ -148,6 +156,7 @@ int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr, static int bpf_dummy_init(struct btf *btf) { + bpf_dummy_ops_btf = btf; return 0; } diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c index ae8b15e6896f..dffd8828079b 100644 --- a/net/ipv4/bpf_tcp_ca.c +++ b/net/ipv4/bpf_tcp_ca.c @@ -20,6 +20,7 @@ static u32 unsupported_ops[] = { static const struct btf_type *tcp_sock_type; static u32 tcp_sock_id, sock_id; +static const struct btf_type *tcp_congestion_ops_type; static int bpf_tcp_ca_init(struct btf *btf) { @@ -36,6 +37,11 @@ static int bpf_tcp_ca_init(struct btf *btf) tcp_sock_id = type_id; tcp_sock_type = btf_type_by_id(btf, tcp_sock_id); + type_id = btf_find_by_name_kind(btf, "tcp_congestion_ops", BTF_KIND_STRUCT); + if (type_id < 0) + return -EINVAL; + tcp_congestion_ops_type = btf_type_by_id(btf, type_id); + return 0; } @@ -149,7 +155,7 @@ static u32 prog_ops_moff(const struct bpf_prog *prog) u32 midx; midx = prog->expected_attach_type; - t = bpf_tcp_congestion_ops.type; + t = tcp_congestion_ops_type; m = &btf_type_member(t)[midx]; return __btf_member_bit_offset(t, m) / 8; -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit e61995111a76633376419d1bccede8696e94e6e5 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Maintain a registry of registered struct_ops types in the per-btf (module) struct_ops_tab. This registry allows for easy lookup of struct_ops types that are registered by a specific module. It is a preparation work for supporting kernel module struct_ops in a latter patch. Each struct_ops will be registered under its own kernel module btf and will be stored in the newly added btf->struct_ops_tab. The bpf verifier and bpf syscall (e.g. prog and map cmd) can find the struct_ops and its btf type/size/id... information from btf->struct_ops_tab. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-5-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/btf.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 55 insertions(+) diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 6bc791cc4807..4ecdd08aec7f 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -245,6 +245,12 @@ struct btf_id_dtor_kfunc_tab { struct btf_id_dtor_kfunc dtors[]; }; +struct btf_struct_ops_tab { + u32 cnt; + u32 capacity; + struct bpf_struct_ops_desc ops[]; +}; + struct btf { void *data; struct btf_type **types; @@ -262,6 +268,7 @@ struct btf { struct btf_kfunc_set_tab *kfunc_set_tab; struct btf_id_dtor_kfunc_tab *dtor_kfunc_tab; struct btf_struct_metas *struct_meta_tab; + struct btf_struct_ops_tab *struct_ops_tab; /* split BTF support */ struct btf *base_btf; @@ -1701,11 +1708,20 @@ static void btf_free_struct_meta_tab(struct btf *btf) btf->struct_meta_tab = NULL; } +static void btf_free_struct_ops_tab(struct btf *btf) +{ + struct btf_struct_ops_tab *tab = btf->struct_ops_tab; + + kfree(tab); + btf->struct_ops_tab = NULL; +} + static void btf_free(struct btf *btf) { btf_free_struct_meta_tab(btf); btf_free_dtor_kfunc_tab(btf); btf_free_kfunc_set_tab(btf); + btf_free_struct_ops_tab(btf); kvfree(btf->types); kvfree(btf->resolved_sizes); kvfree(btf->resolved_ids); @@ -8878,3 +8894,42 @@ bool btf_type_ids_nocast_alias(struct bpf_verifier_log *log, return !strncmp(reg_name, arg_name, cmp_len); } + +static int +btf_add_struct_ops(struct btf *btf, struct bpf_struct_ops *st_ops) +{ + struct btf_struct_ops_tab *tab, *new_tab; + int i; + + tab = btf->struct_ops_tab; + if (!tab) { + tab = kzalloc(offsetof(struct btf_struct_ops_tab, ops[4]), + GFP_KERNEL); + if (!tab) + return -ENOMEM; + tab->capacity = 4; + btf->struct_ops_tab = tab; + } + + for (i = 0; i < tab->cnt; i++) + if (tab->ops[i].st_ops == st_ops) + return -EEXIST; + + if (tab->cnt == tab->capacity) { + new_tab = krealloc(tab, + offsetof(struct btf_struct_ops_tab, + ops[tab->capacity * 2]), + GFP_KERNEL); + if (!new_tab) + return -ENOMEM; + tab = new_tab; + tab->capacity *= 2; + btf->struct_ops_tab = tab; + } + + tab->ops[btf->struct_ops_tab->cnt].st_ops = st_ops; + + btf->struct_ops_tab->cnt++; + + return 0; +} -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 47f4f657acd5d04c78c5c5ac7022cba9ce3b4a7d category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Once new struct_ops can be registered from modules, btf_vmlinux is no longer the only btf that struct_ops_map would face. st_map should remember what btf it should use to get type information. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-6-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/bpf_struct_ops.c | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 9774f7824e8b..5ddcca4c4fba 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -46,6 +46,8 @@ struct bpf_struct_ops_map { * "links[]". */ void *image; + /* The owner moduler's btf. */ + struct btf *btf; /* uvalue->data stores the kernel struct * (e.g. tcp_congestion_ops) that is more useful * to userspace than the kvalue. For example, @@ -314,7 +316,7 @@ static void bpf_struct_ops_map_put_progs(struct bpf_struct_ops_map *st_map) } } -static int check_zero_holes(const struct btf_type *t, void *data) +static int check_zero_holes(const struct btf *btf, const struct btf_type *t, void *data) { const struct btf_member *member; u32 i, moff, msize, prev_mend = 0; @@ -326,8 +328,8 @@ static int check_zero_holes(const struct btf_type *t, void *data) memchr_inv(data + prev_mend, 0, moff - prev_mend)) return -EINVAL; - mtype = btf_type_by_id(btf_vmlinux, member->type); - mtype = btf_resolve_size(btf_vmlinux, mtype, &msize); + mtype = btf_type_by_id(btf, member->type); + mtype = btf_resolve_size(btf, mtype, &msize); if (IS_ERR(mtype)) return PTR_ERR(mtype); prev_mend = moff + msize; @@ -401,12 +403,12 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, if (*(u32 *)key != 0) return -E2BIG; - err = check_zero_holes(st_ops_desc->value_type, value); + err = check_zero_holes(st_map->btf, st_ops_desc->value_type, value); if (err) return err; uvalue = value; - err = check_zero_holes(t, uvalue->data); + err = check_zero_holes(st_map->btf, t, uvalue->data); if (err) return err; @@ -442,7 +444,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, u32 moff; moff = __btf_member_bit_offset(t, member) / 8; - ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL); + ptype = btf_type_resolve_ptr(st_map->btf, member->type, NULL); if (ptype == module_type) { if (*(void **)(udata + moff)) goto reset_unlock; @@ -467,8 +469,8 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, if (!ptype || !btf_type_is_func_proto(ptype)) { u32 msize; - mtype = btf_type_by_id(btf_vmlinux, member->type); - mtype = btf_resolve_size(btf_vmlinux, mtype, &msize); + mtype = btf_type_by_id(st_map->btf, member->type); + mtype = btf_resolve_size(st_map->btf, mtype, &msize); if (IS_ERR(mtype)) { err = PTR_ERR(mtype); goto reset_unlock; @@ -607,6 +609,7 @@ static long bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key) static void bpf_struct_ops_map_seq_show_elem(struct bpf_map *map, void *key, struct seq_file *m) { + struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; void *value; int err; @@ -616,7 +619,8 @@ static void bpf_struct_ops_map_seq_show_elem(struct bpf_map *map, void *key, err = bpf_struct_ops_map_sys_lookup_elem(map, key, value); if (!err) { - btf_type_seq_show(btf_vmlinux, map->btf_vmlinux_value_type_id, + btf_type_seq_show(st_map->btf, + map->btf_vmlinux_value_type_id, value, m); seq_puts(m, "\n"); } @@ -726,6 +730,8 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) return ERR_PTR(-ENOMEM); } + st_map->btf = btf_vmlinux; + mutex_init(&st_map->lock); bpf_map_init_from_attr(map, attr); -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 1338b93346587a2a6ac79bbcf55ef5b357745573 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Include btf object id (btf_obj_id) in bpf_map_info so that tools (ex: bpftools struct_ops dump) know the correct btf from the kernel to look up type information of struct_ops types. Since struct_ops types can be defined and registered in a module. The type information of a struct_ops type are defined in the btf of the module defining it. The userspace tools need to know which btf is for the module defining a struct_ops type. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-7-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf.h | 4 ++++ include/uapi/linux/bpf.h | 2 +- kernel/bpf/bpf_struct_ops.c | 7 +++++++ kernel/bpf/syscall.c | 2 ++ tools/include/uapi/linux/bpf.h | 2 +- 5 files changed, 15 insertions(+), 2 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 562e52f0f982..565077321d8c 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1858,6 +1858,7 @@ struct bpf_dummy_ops { int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr); #endif +void bpf_map_struct_ops_info_fill(struct bpf_map_info *info, struct bpf_map *map); #else static inline const struct bpf_struct_ops_desc *bpf_struct_ops_find(u32 type_id) { @@ -1885,6 +1886,9 @@ static inline int bpf_struct_ops_link_create(union bpf_attr *attr) { return -EOPNOTSUPP; } +static inline void bpf_map_struct_ops_info_fill(struct bpf_map_info *info, struct bpf_map *map) +{ +} #endif diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 31e5b8061fe5..c33161c17663 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -6485,7 +6485,7 @@ struct bpf_map_info { __u32 btf_id; __u32 btf_key_type_id; __u32 btf_value_type_id; - __u32 :32; /* alignment pad */ + __u32 btf_vmlinux_id; __u64 map_extra; } __attribute__((aligned(8))); diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 5ddcca4c4fba..5e98af4fc2e2 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -947,3 +947,10 @@ int bpf_struct_ops_link_create(union bpf_attr *attr) kfree(link); return err; } + +void bpf_map_struct_ops_info_fill(struct bpf_map_info *info, struct bpf_map *map) +{ + struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; + + info->btf_vmlinux_id = btf_obj_id(st_map->btf); +} diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 81ab8e889c7d..0d6d3e6d673c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -4767,6 +4767,8 @@ static int bpf_map_get_info_by_fd(struct file *file, info.btf_value_type_id = map->btf_value_type_id; } info.btf_vmlinux_value_type_id = map->btf_vmlinux_value_type_id; + if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) + bpf_map_struct_ops_info_fill(&info, map); if (bpf_map_is_offloaded(map)) { err = bpf_map_offload_info_fill(&info, map); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 7be085c034f7..0d27f3fd97c0 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -6488,7 +6488,7 @@ struct bpf_map_info { __u32 btf_id; __u32 btf_key_type_id; __u32 btf_value_type_id; - __u32 :32; /* alignment pad */ + __u32 btf_vmlinux_id; __u64 map_extra; } __attribute__((aligned(8))); -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 689423db3bda2244c24db8a64de4cdb37be1de41 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- This is a preparation for searching for struct_ops types from a specified module. BTF is always btf_vmlinux now. This patch passes a pointer of BTF to bpf_struct_ops_find_value() and bpf_struct_ops_find(). Once the new registration API of struct_ops types is used, other BTFs besides btf_vmlinux can also be passed to them. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-8-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf.h | 4 ++-- kernel/bpf/bpf_struct_ops.c | 11 ++++++----- kernel/bpf/verifier.c | 2 +- 3 files changed, 9 insertions(+), 8 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 565077321d8c..8720049084c7 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1815,7 +1815,7 @@ struct bpf_struct_ops_desc { #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL) #define BPF_MODULE_OWNER ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA)) -const struct bpf_struct_ops_desc *bpf_struct_ops_find(u32 type_id); +const struct bpf_struct_ops_desc *bpf_struct_ops_find(struct btf *btf, u32 type_id); void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log); bool bpf_struct_ops_get(const void *kdata); void bpf_struct_ops_put(const void *kdata); @@ -1860,7 +1860,7 @@ int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr, #endif void bpf_map_struct_ops_info_fill(struct bpf_map_info *info, struct bpf_map *map); #else -static inline const struct bpf_struct_ops_desc *bpf_struct_ops_find(u32 type_id) +static inline const struct bpf_struct_ops_desc *bpf_struct_ops_find(struct btf *btf, u32 type_id) { return NULL; } diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 5e98af4fc2e2..7505f515aac3 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -221,11 +221,11 @@ void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log) extern struct btf *btf_vmlinux; static const struct bpf_struct_ops_desc * -bpf_struct_ops_find_value(u32 value_id) +bpf_struct_ops_find_value(struct btf *btf, u32 value_id) { unsigned int i; - if (!value_id || !btf_vmlinux) + if (!value_id || !btf) return NULL; for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { @@ -236,11 +236,12 @@ bpf_struct_ops_find_value(u32 value_id) return NULL; } -const struct bpf_struct_ops_desc *bpf_struct_ops_find(u32 type_id) +const struct bpf_struct_ops_desc * +bpf_struct_ops_find(struct btf *btf, u32 type_id) { unsigned int i; - if (!type_id || !btf_vmlinux) + if (!type_id || !btf) return NULL; for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { @@ -682,7 +683,7 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) struct bpf_map *map; int ret; - st_ops_desc = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id); + st_ops_desc = bpf_struct_ops_find_value(btf_vmlinux, attr->btf_vmlinux_value_type_id); if (!st_ops_desc) return ERR_PTR(-ENOTSUPP); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 24f05aa247e8..305083374194 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -20221,7 +20221,7 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env) } btf_id = prog->aux->attach_btf_id; - st_ops_desc = bpf_struct_ops_find(btf_id); + st_ops_desc = bpf_struct_ops_find(btf_vmlinux, btf_id); if (!st_ops_desc) { verbose(env, "attach_btf_id %u is not a supported struct\n", btf_id); -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit fcc2c1fb0651477c8ed78a3a293c175ccd70697a category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Pass the fd of a btf from the userspace to the bpf() syscall, and then convert the fd into a btf. The btf is generated from the module that defines the target BPF struct_ops type. In order to inform the kernel about the module that defines the target struct_ops type, the userspace program needs to provide a btf fd for the respective module's btf. This btf contains essential information on the types defined within the module, including the target struct_ops type. A btf fd must be provided to the kernel for struct_ops maps and for the bpf programs attached to those maps. In the case of the bpf programs, the attach_btf_obj_fd parameter is passed as part of the bpf_attr and is converted into a btf. This btf is then stored in the prog->aux->attach_btf field. Here, it just let the verifier access attach_btf directly. In the case of struct_ops maps, a btf fd is passed as value_type_btf_obj_fd of bpf_attr. The bpf_struct_ops_map_alloc() function converts the fd to a btf and stores it as st_map->btf. A flag BPF_F_VTYPE_BTF_OBJ_FD is added for map_flags to indicate that the value of value_type_btf_obj_fd is set. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-9-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/uapi/linux/bpf.h | 8 +++++ kernel/bpf/bpf_struct_ops.c | 65 ++++++++++++++++++++++++---------- kernel/bpf/syscall.c | 2 +- kernel/bpf/verifier.c | 9 +++-- tools/include/uapi/linux/bpf.h | 8 +++++ 5 files changed, 70 insertions(+), 22 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c33161c17663..da549108042a 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1336,6 +1336,9 @@ enum { /* Get path from provided FD in BPF_OBJ_PIN/BPF_OBJ_GET commands */ BPF_F_PATH_FD = (1U << 14), + +/* Flag for value_type_btf_obj_fd, the fd is available */ + BPF_F_VTYPE_BTF_OBJ_FD = (1U << 15), }; /* Flags for BPF_PROG_QUERY. */ @@ -1409,6 +1412,11 @@ union bpf_attr { * to using 5 hash functions). */ __u64 map_extra; + + __s32 value_type_btf_obj_fd; /* fd pointing to a BTF + * type data for + * btf_vmlinux_value_type_id. + */ }; struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 7505f515aac3..3b8d689ece5d 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -641,6 +641,7 @@ static void __bpf_struct_ops_map_free(struct bpf_map *map) bpf_jit_uncharge_modmem(PAGE_SIZE); } bpf_map_area_free(st_map->uvalue); + btf_put(st_map->btf); bpf_map_area_free(st_map); } @@ -669,7 +670,8 @@ static void bpf_struct_ops_map_free(struct bpf_map *map) static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr) { if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 || - (attr->map_flags & ~BPF_F_LINK) || !attr->btf_vmlinux_value_type_id) + (attr->map_flags & ~(BPF_F_LINK | BPF_F_VTYPE_BTF_OBJ_FD)) || + !attr->btf_vmlinux_value_type_id) return -EINVAL; return 0; } @@ -681,15 +683,36 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) struct bpf_struct_ops_map *st_map; const struct btf_type *t, *vt; struct bpf_map *map; + struct btf *btf; int ret; - st_ops_desc = bpf_struct_ops_find_value(btf_vmlinux, attr->btf_vmlinux_value_type_id); - if (!st_ops_desc) - return ERR_PTR(-ENOTSUPP); + if (attr->map_flags & BPF_F_VTYPE_BTF_OBJ_FD) { + /* The map holds btf for its whole life time. */ + btf = btf_get_by_fd(attr->value_type_btf_obj_fd); + if (IS_ERR(btf)) + return ERR_CAST(btf); + if (!btf_is_module(btf)) { + btf_put(btf); + return ERR_PTR(-EINVAL); + } + } else { + btf = bpf_get_btf_vmlinux(); + if (IS_ERR(btf)) + return ERR_CAST(btf); + btf_get(btf); + } + + st_ops_desc = bpf_struct_ops_find_value(btf, attr->btf_vmlinux_value_type_id); + if (!st_ops_desc) { + ret = -ENOTSUPP; + goto errout; + } vt = st_ops_desc->value_type; - if (attr->value_size != vt->size) - return ERR_PTR(-EINVAL); + if (attr->value_size != vt->size) { + ret = -EINVAL; + goto errout; + } t = st_ops_desc->type; @@ -700,17 +723,17 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) (vt->size - sizeof(struct bpf_struct_ops_value)); st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE); - if (!st_map) - return ERR_PTR(-ENOMEM); + if (!st_map) { + ret = -ENOMEM; + goto errout; + } st_map->st_ops_desc = st_ops_desc; map = &st_map->map; ret = bpf_jit_charge_modmem(PAGE_SIZE); - if (ret) { - __bpf_struct_ops_map_free(map); - return ERR_PTR(ret); - } + if (ret) + goto errout_free; st_map->image = arch_alloc_bpf_trampoline(PAGE_SIZE); if (!st_map->image) { @@ -719,24 +742,30 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) * here. */ bpf_jit_uncharge_modmem(PAGE_SIZE); - __bpf_struct_ops_map_free(map); - return ERR_PTR(-ENOMEM); + ret = -ENOMEM; + goto errout_free; } st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE); st_map->links = bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_links *), NUMA_NO_NODE); if (!st_map->uvalue || !st_map->links) { - __bpf_struct_ops_map_free(map); - return ERR_PTR(-ENOMEM); + ret = -ENOMEM; + goto errout_free; } - - st_map->btf = btf_vmlinux; + st_map->btf = btf; mutex_init(&st_map->lock); bpf_map_init_from_attr(map, attr); return map; + +errout_free: + __bpf_struct_ops_map_free(map); +errout: + btf_put(btf); + + return ERR_PTR(ret); } static u64 bpf_struct_ops_map_mem_usage(const struct bpf_map *map) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 0d6d3e6d673c..55beb18bc371 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1135,7 +1135,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf, return ret; } -#define BPF_MAP_CREATE_LAST_FIELD map_extra +#define BPF_MAP_CREATE_LAST_FIELD value_type_btf_obj_fd /* called via syscall */ static int map_create(union bpf_attr *attr) { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 305083374194..4db679be64b8 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -20213,6 +20213,7 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env) const struct btf_member *member; struct bpf_prog *prog = env->prog; u32 btf_id, member_idx; + struct btf *btf; const char *mname; if (!prog->gpl_compatible) { @@ -20220,8 +20221,10 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env) return -EINVAL; } + btf = prog->aux->attach_btf ?: bpf_get_btf_vmlinux(); + btf_id = prog->aux->attach_btf_id; - st_ops_desc = bpf_struct_ops_find(btf_vmlinux, btf_id); + st_ops_desc = bpf_struct_ops_find(btf, btf_id); if (!st_ops_desc) { verbose(env, "attach_btf_id %u is not a supported struct\n", btf_id); @@ -20238,8 +20241,8 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env) } member = &btf_type_member(t)[member_idx]; - mname = btf_name_by_offset(btf_vmlinux, member->name_off); - func_proto = btf_type_resolve_func_ptr(btf_vmlinux, member->type, + mname = btf_name_by_offset(btf, member->name_off); + func_proto = btf_type_resolve_func_ptr(btf, member->type, NULL); if (!func_proto) { verbose(env, "attach to invalid member %s(@idx %u) of struct %s\n", diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 0d27f3fd97c0..7b9d03897c37 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1336,6 +1336,9 @@ enum { /* Get path from provided FD in BPF_OBJ_PIN/BPF_OBJ_GET commands */ BPF_F_PATH_FD = (1U << 14), + +/* Flag for value_type_btf_obj_fd, the fd is available */ + BPF_F_VTYPE_BTF_OBJ_FD = (1U << 15), }; /* Flags for BPF_PROG_QUERY. */ @@ -1409,6 +1412,11 @@ union bpf_attr { * to using 5 hash functions). */ __u64 map_extra; + + __s32 value_type_btf_obj_fd; /* fd pointing to a BTF + * type data for + * btf_vmlinux_value_type_id. + */ }; struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ -- 2.34.1
From: Luo Gengkun <luogengkun2@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA -------------------------------- Fix kabi-breakage by using KABI related MARCO because the uapi cannot include kabi.h. And because bpf_attr is a union and pahole show that there are another structure bigger than the modified one. Fixes: 526d613854fe ("bpf: pass attached BTF to the bpf_struct_ops subsystem") Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/uapi/linux/bpf.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index da549108042a..fc59c9ee3988 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1413,10 +1413,12 @@ union bpf_attr { */ __u64 map_extra; +#if !defined(__GENKSYMS__) __s32 value_type_btf_obj_fd; /* fd pointing to a BTF * type data for * btf_vmlinux_value_type_id. */ +#endif }; struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit e3f87fdfed7b770dd7066b02262b12747881e76d category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- To ensure that a module remains accessible whenever a struct_ops object of a struct_ops type provided by the module is still in use. struct bpf_struct_ops_map doesn't hold a refcnt to btf anymore since a module will hold a refcnt to it's btf already. But, struct_ops programs are different. They hold their associated btf, not the module since they need only btf to assure their types (signatures). However, verifier holds the refcnt of the associated module of a struct_ops type temporarily when verify a struct_ops prog. Verifier needs the help from the verifier operators (struct bpf_verifier_ops) provided by the owner module to verify data access of a prog, provide information, and generate code. This patch also add a count of links (links_cnt) to bpf_struct_ops_map. It avoids bpf_struct_ops_map_put_progs() from accessing btf after calling module_put() in bpf_struct_ops_map_free(). Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-10-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Conflicts: include/linux/bpf.h [Fix conflicts because of kabi.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf.h | 1 + include/linux/bpf_verifier.h | 1 + kernel/bpf/bpf_struct_ops.c | 29 +++++++++++++++++++++++------ kernel/bpf/verifier.c | 11 +++++++++++ 4 files changed, 36 insertions(+), 6 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 8720049084c7..a4a6bf41abce 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1794,6 +1794,7 @@ struct bpf_struct_ops { int (*validate)(void *kdata); void *cfi_stubs; const char *name; + struct module *owner; struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS]; u32 type_id; u32 value_id; diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index e1d213f2585d..badc370db9f9 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -681,6 +681,7 @@ struct bpf_verifier_env { u32 prev_insn_idx; struct bpf_prog *prog; /* eBPF program being verified */ const struct bpf_verifier_ops *ops; + struct module *attach_btf_mod; /* The owner module of prog->aux->attach_btf */ struct bpf_verifier_stack_elem *head; /* stack of verifier states to be processed */ int stack_size; /* number of states to be processed */ bool strict_alignment; /* perform strict pointer alignment checks */ diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 3b8d689ece5d..02216a8d9265 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -40,6 +40,7 @@ struct bpf_struct_ops_map { * (in kvalue.data). */ struct bpf_link **links; + u32 links_cnt; /* image is a page that has all the trampolines * that stores the func args before calling the bpf_prog. * A PAGE_SIZE "image" is enough to store all trampoline for @@ -306,10 +307,9 @@ static void *bpf_struct_ops_map_lookup_elem(struct bpf_map *map, void *key) static void bpf_struct_ops_map_put_progs(struct bpf_struct_ops_map *st_map) { - const struct btf_type *t = st_map->st_ops_desc->type; u32 i; - for (i = 0; i < btf_type_vlen(t); i++) { + for (i = 0; i < st_map->links_cnt; i++) { if (st_map->links[i]) { bpf_link_put(st_map->links[i]); st_map->links[i] = NULL; @@ -641,12 +641,20 @@ static void __bpf_struct_ops_map_free(struct bpf_map *map) bpf_jit_uncharge_modmem(PAGE_SIZE); } bpf_map_area_free(st_map->uvalue); - btf_put(st_map->btf); bpf_map_area_free(st_map); } static void bpf_struct_ops_map_free(struct bpf_map *map) { + struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; + + /* st_ops->owner was acquired during map_alloc to implicitly holds + * the btf's refcnt. The acquire was only done when btf_is_module() + * st_map->btf cannot be NULL here. + */ + if (btf_is_module(st_map->btf)) + module_put(st_map->st_ops_desc->st_ops->owner); + /* The struct_ops's function may switch to another struct_ops. * * For example, bpf_tcp_cc_x->init() may switch to @@ -682,6 +690,7 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) size_t st_map_size; struct bpf_struct_ops_map *st_map; const struct btf_type *t, *vt; + struct module *mod = NULL; struct bpf_map *map; struct btf *btf; int ret; @@ -695,11 +704,18 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) btf_put(btf); return ERR_PTR(-EINVAL); } + + mod = btf_try_get_module(btf); + /* mod holds a refcnt to btf. We don't need an extra refcnt + * here. + */ + btf_put(btf); + if (!mod) + return ERR_PTR(-EINVAL); } else { btf = bpf_get_btf_vmlinux(); if (IS_ERR(btf)) return ERR_CAST(btf); - btf_get(btf); } st_ops_desc = bpf_struct_ops_find_value(btf, attr->btf_vmlinux_value_type_id); @@ -746,8 +762,9 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) goto errout_free; } st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE); + st_map->links_cnt = btf_type_vlen(t); st_map->links = - bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_links *), + bpf_map_area_alloc(st_map->links_cnt * sizeof(struct bpf_links *), NUMA_NO_NODE); if (!st_map->uvalue || !st_map->links) { ret = -ENOMEM; @@ -763,7 +780,7 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) errout_free: __bpf_struct_ops_map_free(map); errout: - btf_put(btf); + module_put(mod); return ERR_PTR(ret); } diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 4db679be64b8..349beb1fdf75 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -20222,6 +20222,15 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env) } btf = prog->aux->attach_btf ?: bpf_get_btf_vmlinux(); + if (btf_is_module(btf)) { + /* Make sure st_ops is valid through the lifetime of env */ + env->attach_btf_mod = btf_try_get_module(btf); + if (!env->attach_btf_mod) { + verbose(env, "struct_ops module %s is not found\n", + btf_get_name(btf)); + return -ENOTSUPP; + } + } btf_id = prog->aux->attach_btf_id; st_ops_desc = bpf_struct_ops_find(btf, btf_id); @@ -20989,6 +20998,8 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3 env->prog->expected_attach_type = 0; *prog = env->prog; + + module_put(env->attach_btf_mod); err_unlock: if (!is_priv) mutex_unlock(&bpf_verifier_lock); -- 2.34.1
From: Luo Gengkun <luogengkun2@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA -------------------------------- Fix kabi-breakage by using KABI_USE. Fixes: 55cc81adffd9 ("bpf: hold module refcnt in bpf_struct_ops map creation and prog verification.") Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf_verifier.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index badc370db9f9..eeef504067e6 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -681,7 +681,6 @@ struct bpf_verifier_env { u32 prev_insn_idx; struct bpf_prog *prog; /* eBPF program being verified */ const struct bpf_verifier_ops *ops; - struct module *attach_btf_mod; /* The owner module of prog->aux->attach_btf */ struct bpf_verifier_stack_elem *head; /* stack of verifier states to be processed */ int stack_size; /* number of states to be processed */ bool strict_alignment; /* perform strict pointer alignment checks */ @@ -749,7 +748,7 @@ struct bpf_verifier_env { char tmp_str_buf[TMP_STR_BUF_LEN]; KABI_USE(1, struct bpf_jmp_history_entry *cur_hist_ent) - KABI_RESERVE(2) + KABI_USE(2, struct module *attach_btf_mod) KABI_RESERVE(3) KABI_RESERVE(4) KABI_RESERVE(5) -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 612d087d4ba54cef47946e22e5dabad762dd7ed5 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- A value_type should consist of three components: refcnt, state, and data. refcnt and state has been move to struct bpf_struct_ops_common_value to make it easier to check the value type. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-11-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf.h | 12 +++++ kernel/bpf/bpf_struct_ops.c | 93 ++++++++++++++++++++++++------------- 2 files changed, 72 insertions(+), 33 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index a4a6bf41abce..6eba14c2eb93 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1814,6 +1814,18 @@ struct bpf_struct_ops_desc { u32 value_id; }; +enum bpf_struct_ops_state { + BPF_STRUCT_OPS_STATE_INIT, + BPF_STRUCT_OPS_STATE_INUSE, + BPF_STRUCT_OPS_STATE_TOBEFREE, + BPF_STRUCT_OPS_STATE_READY, +}; + +struct bpf_struct_ops_common_value { + refcount_t refcnt; + enum bpf_struct_ops_state state; +}; + #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL) #define BPF_MODULE_OWNER ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA)) const struct bpf_struct_ops_desc *bpf_struct_ops_find(struct btf *btf, u32 type_id); diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 02216a8d9265..30ab34fab0f8 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -13,19 +13,8 @@ #include <linux/btf_ids.h> #include <linux/rcupdate_wait.h> -enum bpf_struct_ops_state { - BPF_STRUCT_OPS_STATE_INIT, - BPF_STRUCT_OPS_STATE_INUSE, - BPF_STRUCT_OPS_STATE_TOBEFREE, - BPF_STRUCT_OPS_STATE_READY, -}; - -#define BPF_STRUCT_OPS_COMMON_VALUE \ - refcount_t refcnt; \ - enum bpf_struct_ops_state state - struct bpf_struct_ops_value { - BPF_STRUCT_OPS_COMMON_VALUE; + struct bpf_struct_ops_common_value common; char data[] ____cacheline_aligned_in_smp; }; @@ -81,8 +70,8 @@ static DEFINE_MUTEX(update_mutex); #define BPF_STRUCT_OPS_TYPE(_name) \ extern struct bpf_struct_ops bpf_##_name; \ \ -struct bpf_struct_ops_##_name { \ - BPF_STRUCT_OPS_COMMON_VALUE; \ +struct bpf_struct_ops_##_name { \ + struct bpf_struct_ops_common_value common; \ struct _name data ____cacheline_aligned_in_smp; \ }; #include "bpf_struct_ops_types.h" @@ -113,11 +102,49 @@ const struct bpf_prog_ops bpf_struct_ops_prog_ops = { BTF_ID_LIST(st_ops_ids) BTF_ID(struct, module) +BTF_ID(struct, bpf_struct_ops_common_value) enum { IDX_MODULE_ID, + IDX_ST_OPS_COMMON_VALUE_ID, }; +extern struct btf *btf_vmlinux; + +static bool is_valid_value_type(struct btf *btf, s32 value_id, + const struct btf_type *type, + const char *value_name) +{ + const struct btf_type *common_value_type; + const struct btf_member *member; + const struct btf_type *vt, *mt; + + vt = btf_type_by_id(btf, value_id); + if (btf_vlen(vt) != 2) { + pr_warn("The number of %s's members should be 2, but we get %d\n", + value_name, btf_vlen(vt)); + return false; + } + member = btf_type_member(vt); + mt = btf_type_by_id(btf, member->type); + common_value_type = btf_type_by_id(btf_vmlinux, + st_ops_ids[IDX_ST_OPS_COMMON_VALUE_ID]); + if (mt != common_value_type) { + pr_warn("The first member of %s should be bpf_struct_ops_common_value\n", + value_name); + return false; + } + member++; + mt = btf_type_by_id(btf, member->type); + if (mt != type) { + pr_warn("The second member of %s should be %s\n", + value_name, btf_name_by_offset(btf, type->name_off)); + return false; + } + + return true; +} + static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, struct btf *btf, struct bpf_verifier_log *log) @@ -138,14 +165,6 @@ static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, } sprintf(value_name, "%s%s", VALUE_PREFIX, st_ops->name); - value_id = btf_find_by_name_kind(btf, value_name, - BTF_KIND_STRUCT); - if (value_id < 0) { - pr_warn("Cannot find struct %s in %s\n", - value_name, btf_get_name(btf)); - return; - } - type_id = btf_find_by_name_kind(btf, st_ops->name, BTF_KIND_STRUCT); if (type_id < 0) { @@ -160,6 +179,16 @@ static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, return; } + value_id = btf_find_by_name_kind(btf, value_name, + BTF_KIND_STRUCT); + if (value_id < 0) { + pr_warn("Cannot find struct %s in %s\n", + value_name, btf_get_name(btf)); + return; + } + if (!is_valid_value_type(btf, value_id, t, value_name)) + return; + for_each_member(i, t, member) { const struct btf_type *func_proto; @@ -219,8 +248,6 @@ void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log) } } -extern struct btf *btf_vmlinux; - static const struct bpf_struct_ops_desc * bpf_struct_ops_find_value(struct btf *btf, u32 value_id) { @@ -276,7 +303,7 @@ int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key, kvalue = &st_map->kvalue; /* Pair with smp_store_release() during map_update */ - state = smp_load_acquire(&kvalue->state); + state = smp_load_acquire(&kvalue->common.state); if (state == BPF_STRUCT_OPS_STATE_INIT) { memset(value, 0, map->value_size); return 0; @@ -287,7 +314,7 @@ int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key, */ uvalue = value; memcpy(uvalue, st_map->uvalue, map->value_size); - uvalue->state = state; + uvalue->common.state = state; /* This value offers the user space a general estimate of how * many sockets are still utilizing this struct_ops for TCP @@ -295,7 +322,7 @@ int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key, * should sufficiently meet our present goals. */ refcnt = atomic64_read(&map->refcnt) - atomic64_read(&map->usercnt); - refcount_set(&uvalue->refcnt, max_t(s64, refcnt, 0)); + refcount_set(&uvalue->common.refcnt, max_t(s64, refcnt, 0)); return 0; } @@ -413,7 +440,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, if (err) return err; - if (uvalue->state || refcount_read(&uvalue->refcnt)) + if (uvalue->common.state || refcount_read(&uvalue->common.refcnt)) return -EINVAL; tlinks = kcalloc(BPF_TRAMP_MAX, sizeof(*tlinks), GFP_KERNEL); @@ -425,7 +452,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, mutex_lock(&st_map->lock); - if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) { + if (kvalue->common.state != BPF_STRUCT_OPS_STATE_INIT) { err = -EBUSY; goto unlock; } @@ -540,7 +567,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, * * Pair with smp_load_acquire() during lookup_elem(). */ - smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_READY); + smp_store_release(&kvalue->common.state, BPF_STRUCT_OPS_STATE_READY); goto unlock; } @@ -558,7 +585,7 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, * It ensures the above udata updates (e.g. prog->aux->id) * can be seen once BPF_STRUCT_OPS_STATE_INUSE is set. */ - smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE); + smp_store_release(&kvalue->common.state, BPF_STRUCT_OPS_STATE_INUSE); goto unlock; } @@ -588,7 +615,7 @@ static long bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key) if (st_map->map.map_flags & BPF_F_LINK) return -EOPNOTSUPP; - prev_state = cmpxchg(&st_map->kvalue.state, + prev_state = cmpxchg(&st_map->kvalue.common.state, BPF_STRUCT_OPS_STATE_INUSE, BPF_STRUCT_OPS_STATE_TOBEFREE); switch (prev_state) { @@ -848,7 +875,7 @@ static bool bpf_struct_ops_valid_to_reg(struct bpf_map *map) return map->map_type == BPF_MAP_TYPE_STRUCT_OPS && map->map_flags & BPF_F_LINK && /* Pair with smp_store_release() during map_update */ - smp_load_acquire(&st_map->kvalue.state) == BPF_STRUCT_OPS_STATE_READY; + smp_load_acquire(&st_map->kvalue.common.state) == BPF_STRUCT_OPS_STATE_READY; } static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link) -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit f6be98d19985411ca1f3d53413d94d5b7f41c200 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Replace the static list of struct_ops types with per-btf struct_ops_tab to enable dynamic registration. Both bpf_dummy_ops and bpf_tcp_ca now utilize the registration function instead of being listed in bpf_struct_ops_types.h. Cc: netdev@vger.kernel.org Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-12-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf.h | 27 +++++--- include/linux/btf.h | 12 ++++ kernel/bpf/bpf_struct_ops.c | 100 ++++-------------------------- kernel/bpf/bpf_struct_ops_types.h | 12 ---- kernel/bpf/btf.c | 86 +++++++++++++++++++++++-- net/bpf/bpf_dummy_struct_ops.c | 11 +++- net/ipv4/bpf_tcp_ca.c | 12 +++- 7 files changed, 142 insertions(+), 118 deletions(-) delete mode 100644 kernel/bpf/bpf_struct_ops_types.h diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 6eba14c2eb93..0624b33be7a2 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1827,9 +1827,20 @@ struct bpf_struct_ops_common_value { }; #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL) +/* This macro helps developer to register a struct_ops type and generate + * type information correctly. Developers should use this macro to register + * a struct_ops type instead of calling __register_bpf_struct_ops() directly. + */ +#define register_bpf_struct_ops(st_ops, type) \ + ({ \ + struct bpf_struct_ops_##type { \ + struct bpf_struct_ops_common_value common; \ + struct type data ____cacheline_aligned_in_smp; \ + }; \ + BTF_TYPE_EMIT(struct bpf_struct_ops_##type); \ + __register_bpf_struct_ops(st_ops); \ + }) #define BPF_MODULE_OWNER ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA)) -const struct bpf_struct_ops_desc *bpf_struct_ops_find(struct btf *btf, u32 type_id); -void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log); bool bpf_struct_ops_get(const void *kdata); void bpf_struct_ops_put(const void *kdata); int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key, @@ -1871,16 +1882,12 @@ struct bpf_dummy_ops { int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr); #endif +int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, + struct btf *btf, + struct bpf_verifier_log *log); void bpf_map_struct_ops_info_fill(struct bpf_map_info *info, struct bpf_map *map); #else -static inline const struct bpf_struct_ops_desc *bpf_struct_ops_find(struct btf *btf, u32 type_id) -{ - return NULL; -} -static inline void bpf_struct_ops_init(struct btf *btf, - struct bpf_verifier_log *log) -{ -} +#define register_bpf_struct_ops(st_ops, type) ({ (void *)(st_ops); 0; }) static inline bool bpf_try_module_get(const void *data, struct module *owner) { return try_module_get(owner); diff --git a/include/linux/btf.h b/include/linux/btf.h index 1d852dad7473..79f34103ab5d 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -497,6 +497,18 @@ static inline void *btf_id_set8_contains(const struct btf_id_set8 *set, u32 id) struct bpf_verifier_log; +#if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL) +struct bpf_struct_ops; +int __register_bpf_struct_ops(struct bpf_struct_ops *st_ops); +const struct bpf_struct_ops_desc *bpf_struct_ops_find_value(struct btf *btf, u32 value_id); +const struct bpf_struct_ops_desc *bpf_struct_ops_find(struct btf *btf, u32 type_id); +#else +static inline const struct bpf_struct_ops_desc *bpf_struct_ops_find(struct btf *btf, u32 type_id) +{ + return NULL; +} +#endif + #ifdef CONFIG_BPF_SYSCALL const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id); const char *btf_name_by_offset(const struct btf *btf, u32 offset); diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 30ab34fab0f8..defc052e4622 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -62,35 +62,6 @@ static DEFINE_MUTEX(update_mutex); #define VALUE_PREFIX "bpf_struct_ops_" #define VALUE_PREFIX_LEN (sizeof(VALUE_PREFIX) - 1) -/* bpf_struct_ops_##_name (e.g. bpf_struct_ops_tcp_congestion_ops) is - * the map's value exposed to the userspace and its btf-type-id is - * stored at the map->btf_vmlinux_value_type_id. - * - */ -#define BPF_STRUCT_OPS_TYPE(_name) \ -extern struct bpf_struct_ops bpf_##_name; \ - \ -struct bpf_struct_ops_##_name { \ - struct bpf_struct_ops_common_value common; \ - struct _name data ____cacheline_aligned_in_smp; \ -}; -#include "bpf_struct_ops_types.h" -#undef BPF_STRUCT_OPS_TYPE - -enum { -#define BPF_STRUCT_OPS_TYPE(_name) BPF_STRUCT_OPS_TYPE_##_name, -#include "bpf_struct_ops_types.h" -#undef BPF_STRUCT_OPS_TYPE - __NR_BPF_STRUCT_OPS_TYPE, -}; - -static struct bpf_struct_ops_desc bpf_struct_ops[] = { -#define BPF_STRUCT_OPS_TYPE(_name) \ - [BPF_STRUCT_OPS_TYPE_##_name] = { .st_ops = &bpf_##_name }, -#include "bpf_struct_ops_types.h" -#undef BPF_STRUCT_OPS_TYPE -}; - const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = { }; @@ -145,9 +116,9 @@ static bool is_valid_value_type(struct btf *btf, s32 value_id, return true; } -static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, - struct btf *btf, - struct bpf_verifier_log *log) +int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, + struct btf *btf, + struct bpf_verifier_log *log) { struct bpf_struct_ops *st_ops = st_ops_desc->st_ops; const struct btf_member *member; @@ -161,7 +132,7 @@ static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, sizeof(value_name)) { pr_warn("struct_ops name %s is too long\n", st_ops->name); - return; + return -EINVAL; } sprintf(value_name, "%s%s", VALUE_PREFIX, st_ops->name); @@ -170,13 +141,13 @@ static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, if (type_id < 0) { pr_warn("Cannot find struct %s in %s\n", st_ops->name, btf_get_name(btf)); - return; + return -EINVAL; } t = btf_type_by_id(btf, type_id); if (btf_type_vlen(t) > BPF_STRUCT_OPS_MAX_NR_MEMBERS) { pr_warn("Cannot support #%u members in struct %s\n", btf_type_vlen(t), st_ops->name); - return; + return -EINVAL; } value_id = btf_find_by_name_kind(btf, value_name, @@ -184,10 +155,10 @@ static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, if (value_id < 0) { pr_warn("Cannot find struct %s in %s\n", value_name, btf_get_name(btf)); - return; + return -EINVAL; } if (!is_valid_value_type(btf, value_id, t, value_name)) - return; + return -EINVAL; for_each_member(i, t, member) { const struct btf_type *func_proto; @@ -196,13 +167,13 @@ static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, if (!*mname) { pr_warn("anon member in struct %s is not supported\n", st_ops->name); - break; + return -EOPNOTSUPP; } if (__btf_member_bitfield_size(t, member)) { pr_warn("bit field member %s in struct %s is not supported\n", mname, st_ops->name); - break; + return -EOPNOTSUPP; } func_proto = btf_type_resolve_func_ptr(btf, @@ -214,7 +185,7 @@ static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, &st_ops->func_models[i])) { pr_warn("Error in parsing func ptr %s in struct %s\n", mname, st_ops->name); - break; + return -EINVAL; } } @@ -222,6 +193,7 @@ static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, if (st_ops->init(btf)) { pr_warn("Error in init bpf_struct_ops %s\n", st_ops->name); + return -EINVAL; } else { st_ops_desc->type_id = type_id; st_ops_desc->type = t; @@ -230,54 +202,8 @@ static void bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, value_id); } } -} -void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log) -{ - struct bpf_struct_ops_desc *st_ops_desc; - u32 i; - - /* Ensure BTF type is emitted for "struct bpf_struct_ops_##_name" */ -#define BPF_STRUCT_OPS_TYPE(_name) BTF_TYPE_EMIT(struct bpf_struct_ops_##_name); -#include "bpf_struct_ops_types.h" -#undef BPF_STRUCT_OPS_TYPE - - for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { - st_ops_desc = &bpf_struct_ops[i]; - bpf_struct_ops_desc_init(st_ops_desc, btf, log); - } -} - -static const struct bpf_struct_ops_desc * -bpf_struct_ops_find_value(struct btf *btf, u32 value_id) -{ - unsigned int i; - - if (!value_id || !btf) - return NULL; - - for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { - if (bpf_struct_ops[i].value_id == value_id) - return &bpf_struct_ops[i]; - } - - return NULL; -} - -const struct bpf_struct_ops_desc * -bpf_struct_ops_find(struct btf *btf, u32 type_id) -{ - unsigned int i; - - if (!type_id || !btf) - return NULL; - - for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { - if (bpf_struct_ops[i].type_id == type_id) - return &bpf_struct_ops[i]; - } - - return NULL; + return 0; } static int bpf_struct_ops_map_get_next_key(struct bpf_map *map, void *key, diff --git a/kernel/bpf/bpf_struct_ops_types.h b/kernel/bpf/bpf_struct_ops_types.h deleted file mode 100644 index 5678a9ddf817..000000000000 --- a/kernel/bpf/bpf_struct_ops_types.h +++ /dev/null @@ -1,12 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -/* internal file - do not include directly */ - -#ifdef CONFIG_BPF_JIT -#ifdef CONFIG_NET -BPF_STRUCT_OPS_TYPE(bpf_dummy_ops) -#endif -#ifdef CONFIG_INET -#include <net/tcp.h> -BPF_STRUCT_OPS_TYPE(tcp_congestion_ops) -#endif -#endif diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 4ecdd08aec7f..1cd7e8fbcc57 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -19,6 +19,7 @@ #include <linux/bpf_verifier.h> #include <linux/btf.h> #include <linux/btf_ids.h> +#include <linux/bpf.h> #include <linux/bpf_lsm.h> #include <linux/skmsg.h> #include <linux/perf_event.h> @@ -5801,8 +5802,6 @@ struct btf *btf_parse_vmlinux(void) /* btf_parse_vmlinux() runs under bpf_verifier_lock */ bpf_ctx_convert.t = btf_type_by_id(btf, bpf_ctx_convert_btf_id[0]); - bpf_struct_ops_init(btf, log); - refcount_set(&btf->refcnt, 1); err = btf_alloc_id(btf); @@ -8895,11 +8894,13 @@ bool btf_type_ids_nocast_alias(struct bpf_verifier_log *log, return !strncmp(reg_name, arg_name, cmp_len); } +#ifdef CONFIG_BPF_JIT static int -btf_add_struct_ops(struct btf *btf, struct bpf_struct_ops *st_ops) +btf_add_struct_ops(struct btf *btf, struct bpf_struct_ops *st_ops, + struct bpf_verifier_log *log) { struct btf_struct_ops_tab *tab, *new_tab; - int i; + int i, err; tab = btf->struct_ops_tab; if (!tab) { @@ -8929,7 +8930,84 @@ btf_add_struct_ops(struct btf *btf, struct bpf_struct_ops *st_ops) tab->ops[btf->struct_ops_tab->cnt].st_ops = st_ops; + err = bpf_struct_ops_desc_init(&tab->ops[btf->struct_ops_tab->cnt], btf, log); + if (err) + return err; + btf->struct_ops_tab->cnt++; return 0; } + +const struct bpf_struct_ops_desc * +bpf_struct_ops_find_value(struct btf *btf, u32 value_id) +{ + const struct bpf_struct_ops_desc *st_ops_list; + unsigned int i; + u32 cnt; + + if (!value_id) + return NULL; + if (!btf->struct_ops_tab) + return NULL; + + cnt = btf->struct_ops_tab->cnt; + st_ops_list = btf->struct_ops_tab->ops; + for (i = 0; i < cnt; i++) { + if (st_ops_list[i].value_id == value_id) + return &st_ops_list[i]; + } + + return NULL; +} + +const struct bpf_struct_ops_desc * +bpf_struct_ops_find(struct btf *btf, u32 type_id) +{ + const struct bpf_struct_ops_desc *st_ops_list; + unsigned int i; + u32 cnt; + + if (!type_id) + return NULL; + if (!btf->struct_ops_tab) + return NULL; + + cnt = btf->struct_ops_tab->cnt; + st_ops_list = btf->struct_ops_tab->ops; + for (i = 0; i < cnt; i++) { + if (st_ops_list[i].type_id == type_id) + return &st_ops_list[i]; + } + + return NULL; +} + +int __register_bpf_struct_ops(struct bpf_struct_ops *st_ops) +{ + struct bpf_verifier_log *log; + struct btf *btf; + int err = 0; + + btf = btf_get_module_btf(st_ops->owner); + if (!btf) + return -EINVAL; + + log = kzalloc(sizeof(*log), GFP_KERNEL | __GFP_NOWARN); + if (!log) { + err = -ENOMEM; + goto errout; + } + + log->level = BPF_LOG_KERNEL; + + err = btf_add_struct_ops(btf, st_ops, log); + +errout: + kfree(log); + btf_put(btf); + + return err; +} +EXPORT_SYMBOL_GPL(__register_bpf_struct_ops); +#endif diff --git a/net/bpf/bpf_dummy_struct_ops.c b/net/bpf/bpf_dummy_struct_ops.c index ba2c58dba2da..02de71719aed 100644 --- a/net/bpf/bpf_dummy_struct_ops.c +++ b/net/bpf/bpf_dummy_struct_ops.c @@ -7,7 +7,7 @@ #include <linux/bpf.h> #include <linux/btf.h> -extern struct bpf_struct_ops bpf_bpf_dummy_ops; +static struct bpf_struct_ops bpf_bpf_dummy_ops; /* A common type for test_N with return value in bpf_dummy_ops */ typedef int (*dummy_ops_test_ret_fn)(struct bpf_dummy_ops_state *state, ...); @@ -256,7 +256,7 @@ static struct bpf_dummy_ops __bpf_bpf_dummy_ops = { .test_sleepable = bpf_dummy_test_sleepable, }; -struct bpf_struct_ops bpf_bpf_dummy_ops = { +static struct bpf_struct_ops bpf_bpf_dummy_ops = { .verifier_ops = &bpf_dummy_verifier_ops, .init = bpf_dummy_init, .check_member = bpf_dummy_ops_check_member, @@ -265,4 +265,11 @@ struct bpf_struct_ops bpf_bpf_dummy_ops = { .unreg = bpf_dummy_unreg, .name = "bpf_dummy_ops", .cfi_stubs = &__bpf_bpf_dummy_ops, + .owner = THIS_MODULE, }; + +static int __init bpf_dummy_struct_ops_init(void) +{ + return register_bpf_struct_ops(&bpf_bpf_dummy_ops, bpf_dummy_ops); +} +late_initcall(bpf_dummy_struct_ops_init); diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c index dffd8828079b..8e7716256d3c 100644 --- a/net/ipv4/bpf_tcp_ca.c +++ b/net/ipv4/bpf_tcp_ca.c @@ -12,7 +12,7 @@ #include <net/bpf_sk_storage.h> /* "extern" is to avoid sparse warning. It is only used in bpf_struct_ops.c. */ -extern struct bpf_struct_ops bpf_tcp_congestion_ops; +static struct bpf_struct_ops bpf_tcp_congestion_ops; static u32 unsupported_ops[] = { offsetof(struct tcp_congestion_ops, get_info), @@ -345,7 +345,7 @@ static struct tcp_congestion_ops __bpf_ops_tcp_congestion_ops = { .release = __bpf_tcp_ca_release, }; -struct bpf_struct_ops bpf_tcp_congestion_ops = { +static struct bpf_struct_ops bpf_tcp_congestion_ops = { .verifier_ops = &bpf_tcp_ca_verifier_ops, .reg = bpf_tcp_ca_reg, .unreg = bpf_tcp_ca_unreg, @@ -356,10 +356,16 @@ struct bpf_struct_ops bpf_tcp_congestion_ops = { .validate = bpf_tcp_ca_validate, .name = "tcp_congestion_ops", .cfi_stubs = &__bpf_ops_tcp_congestion_ops, + .owner = THIS_MODULE, }; static int __init bpf_tcp_ca_kfunc_init(void) { - return register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_tcp_ca_kfunc_set); + int ret; + + ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_tcp_ca_kfunc_set); + ret = ret ?: register_bpf_struct_ops(&bpf_tcp_congestion_ops, tcp_congestion_ops); + + return ret; } late_initcall(bpf_tcp_ca_kfunc_init); -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 9e926acda0c2e21bca431a1818665ddcd6939755 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Locate the module BTFs for struct_ops maps and progs and pass them to the kernel. This ensures that the kernel correctly resolves type IDs from the appropriate module BTFs. For the map of a struct_ops object, the FD of the module BTF is set to bpf_map to keep a reference to the module BTF. The FD is passed to the kernel as value_type_btf_obj_fd when the struct_ops object is loaded. For a bpf_struct_ops prog, attach_btf_obj_fd of bpf_prog is the FD of a module BTF in the kernel. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240119225005.668602-13-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/bpf.c | 4 +++- tools/lib/bpf/bpf.h | 4 +++- tools/lib/bpf/libbpf.c | 41 ++++++++++++++++++++++++++--------- tools/lib/bpf/libbpf_probes.c | 1 + 4 files changed, 38 insertions(+), 12 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index a5e21d9baf61..6a8b495b53f5 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -171,7 +171,8 @@ int bpf_map_create(enum bpf_map_type map_type, __u32 max_entries, const struct bpf_map_create_opts *opts) { - const size_t attr_sz = offsetofend(union bpf_attr, map_extra); + const size_t attr_sz = offsetofend(union bpf_attr, + value_type_btf_obj_fd); union bpf_attr attr; int fd; @@ -193,6 +194,7 @@ int bpf_map_create(enum bpf_map_type map_type, attr.btf_key_type_id = OPTS_GET(opts, btf_key_type_id, 0); attr.btf_value_type_id = OPTS_GET(opts, btf_value_type_id, 0); attr.btf_vmlinux_value_type_id = OPTS_GET(opts, btf_vmlinux_value_type_id, 0); + attr.value_type_btf_obj_fd = OPTS_GET(opts, value_type_btf_obj_fd, -1); attr.inner_map_fd = OPTS_GET(opts, inner_map_fd, 0); attr.map_flags = OPTS_GET(opts, map_flags, 0); diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index 107fef748868..db0ff8ade19a 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -51,8 +51,10 @@ struct bpf_map_create_opts { __u32 numa_node; __u32 map_ifindex; + __s32 value_type_btf_obj_fd; + size_t:0; }; -#define bpf_map_create_opts__last_field map_ifindex +#define bpf_map_create_opts__last_field value_type_btf_obj_fd LIBBPF_API int bpf_map_create(enum bpf_map_type map_type, const char *map_name, diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index d55a75b48bbe..072f72c1b40c 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -523,6 +523,7 @@ struct bpf_map { struct bpf_map_def def; __u32 numa_node; __u32 btf_var_idx; + int mod_btf_fd; __u32 btf_key_type_id; __u32 btf_value_type_id; __u32 btf_vmlinux_value_type_id; @@ -923,22 +924,29 @@ find_member_by_name(const struct btf *btf, const struct btf_type *t, return NULL; } +static int find_ksym_btf_id(struct bpf_object *obj, const char *ksym_name, + __u16 kind, struct btf **res_btf, + struct module_btf **res_mod_btf); + #define STRUCT_OPS_VALUE_PREFIX "bpf_struct_ops_" static int find_btf_by_prefix_kind(const struct btf *btf, const char *prefix, const char *name, __u32 kind); static int -find_struct_ops_kern_types(const struct btf *btf, const char *tname, +find_struct_ops_kern_types(struct bpf_object *obj, const char *tname, + struct module_btf **mod_btf, const struct btf_type **type, __u32 *type_id, const struct btf_type **vtype, __u32 *vtype_id, const struct btf_member **data_member) { const struct btf_type *kern_type, *kern_vtype; const struct btf_member *kern_data_member; + struct btf *btf; __s32 kern_vtype_id, kern_type_id; __u32 i; - kern_type_id = btf__find_by_name_kind(btf, tname, BTF_KIND_STRUCT); + kern_type_id = find_ksym_btf_id(obj, tname, BTF_KIND_STRUCT, + &btf, mod_btf); if (kern_type_id < 0) { pr_warn("struct_ops init_kern: struct %s is not found in kernel BTF\n", tname); @@ -992,14 +1000,16 @@ static bool bpf_map__is_struct_ops(const struct bpf_map *map) } /* Init the map's fields that depend on kern_btf */ -static int bpf_map__init_kern_struct_ops(struct bpf_map *map, - const struct btf *btf, - const struct btf *kern_btf) +static int bpf_map__init_kern_struct_ops(struct bpf_map *map) { const struct btf_member *member, *kern_member, *kern_data_member; const struct btf_type *type, *kern_type, *kern_vtype; __u32 i, kern_type_id, kern_vtype_id, kern_data_off; + struct bpf_object *obj = map->obj; + const struct btf *btf = obj->btf; struct bpf_struct_ops *st_ops; + const struct btf *kern_btf; + struct module_btf *mod_btf; void *data, *kern_data; const char *tname; int err; @@ -1007,16 +1017,19 @@ static int bpf_map__init_kern_struct_ops(struct bpf_map *map, st_ops = map->st_ops; type = st_ops->type; tname = st_ops->tname; - err = find_struct_ops_kern_types(kern_btf, tname, + err = find_struct_ops_kern_types(obj, tname, &mod_btf, &kern_type, &kern_type_id, &kern_vtype, &kern_vtype_id, &kern_data_member); if (err) return err; + kern_btf = mod_btf ? mod_btf->btf : obj->btf_vmlinux; + pr_debug("struct_ops init_kern %s: type_id:%u kern_type_id:%u kern_vtype_id:%u\n", map->name, st_ops->type_id, kern_type_id, kern_vtype_id); + map->mod_btf_fd = mod_btf ? mod_btf->fd : -1; map->def.value_size = kern_vtype->size; map->btf_vmlinux_value_type_id = kern_vtype_id; @@ -1092,6 +1105,8 @@ static int bpf_map__init_kern_struct_ops(struct bpf_map *map, return -ENOTSUP; } + if (mod_btf) + prog->attach_btf_obj_fd = mod_btf->fd; prog->attach_btf_id = kern_type_id; prog->expected_attach_type = kern_member_idx; @@ -1134,8 +1149,7 @@ static int bpf_object__init_kern_struct_ops_maps(struct bpf_object *obj) if (!bpf_map__is_struct_ops(map)) continue; - err = bpf_map__init_kern_struct_ops(map, obj->btf, - obj->btf_vmlinux); + err = bpf_map__init_kern_struct_ops(map); if (err) return err; } @@ -5129,8 +5143,13 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b create_attr.numa_node = map->numa_node; create_attr.map_extra = map->map_extra; - if (bpf_map__is_struct_ops(map)) + if (bpf_map__is_struct_ops(map)) { create_attr.btf_vmlinux_value_type_id = map->btf_vmlinux_value_type_id; + if (map->mod_btf_fd >= 0) { + create_attr.value_type_btf_obj_fd = map->mod_btf_fd; + create_attr.map_flags |= BPF_F_VTYPE_BTF_OBJ_FD; + } + } if (obj->btf && btf__fd(obj->btf) >= 0) { create_attr.btf_fd = btf__fd(obj->btf); @@ -9452,7 +9471,9 @@ static int libbpf_find_attach_btf_id(struct bpf_program *prog, const char *attac *btf_obj_fd = 0; *btf_type_id = 1; } else { - err = find_kernel_btf_id(prog->obj, attach_name, attach_type, btf_obj_fd, btf_type_id); + err = find_kernel_btf_id(prog->obj, attach_name, + attach_type, btf_obj_fd, + btf_type_id); } if (err) { pr_warn("prog '%s': failed to find kernel BTF type ID of '%s': %d\n", diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c index 9c4db90b92b6..98373d126d9d 100644 --- a/tools/lib/bpf/libbpf_probes.c +++ b/tools/lib/bpf/libbpf_probes.c @@ -326,6 +326,7 @@ static int probe_map_create(enum bpf_map_type map_type) case BPF_MAP_TYPE_STRUCT_OPS: /* we'll get -ENOTSUPP for invalid BTF type ID for struct_ops */ opts.btf_vmlinux_value_type_id = 1; + opts.value_type_btf_obj_fd = -1; exp_err = -524; /* -ENOTSUPP */ break; case BPF_MAP_TYPE_BLOOM_FILTER: -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 7c81c2490c73e614c6d48e4f339f4f224140b565 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- The module requires the use of btf_ctx_access() to invoke bpf_tracing_btf_ctx_access() from a module. This function is valuable for implementing validation functions that ensure proper access to ctx. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-14-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Conflicts: kernel/bpf/btf.c [Fix context conflicts.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/btf.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 1cd7e8fbcc57..db3e74481f65 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -6400,6 +6400,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, } return true; } +EXPORT_SYMBOL_GPL(btf_ctx_access); enum bpf_struct_walk_result { /* < 0 error */ -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 0253e0590e2dc46996534371d56b5297099aed4e category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Create a new struct_ops type called bpf_testmod_ops within the bpf_testmod module. When a struct_ops object is registered, the bpf_testmod module will invoke test_2 from the module. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240119225005.668602-15-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 66 ++++++++++++++++ .../selftests/bpf/bpf_testmod/bpf_testmod.h | 5 ++ .../bpf/prog_tests/test_struct_ops_module.c | 75 +++++++++++++++++++ .../selftests/bpf/progs/struct_ops_module.c | 30 ++++++++ 4 files changed, 176 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_module.c diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c index 1b8589143bd3..ca863f436978 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2020 Facebook */ +#include <linux/bpf.h> #include <linux/btf.h> #include <linux/btf_ids.h> #include <linux/delay.h> @@ -516,11 +517,75 @@ BTF_ID_FLAGS(func, bpf_kfunc_call_test_static_unused_arg) BTF_ID_FLAGS(func, bpf_kfunc_call_test_offset) BTF_SET8_END(bpf_testmod_check_kfunc_ids) +static int bpf_testmod_ops_init(struct btf *btf) +{ + return 0; +} + +static bool bpf_testmod_ops_is_valid_access(int off, int size, + enum bpf_access_type type, + const struct bpf_prog *prog, + struct bpf_insn_access_aux *info) +{ + return bpf_tracing_btf_ctx_access(off, size, type, prog, info); +} + +static int bpf_testmod_ops_init_member(const struct btf_type *t, + const struct btf_member *member, + void *kdata, const void *udata) +{ + return 0; +} + static const struct btf_kfunc_id_set bpf_testmod_kfunc_set = { .owner = THIS_MODULE, .set = &bpf_testmod_check_kfunc_ids, }; +static const struct bpf_verifier_ops bpf_testmod_verifier_ops = { + .is_valid_access = bpf_testmod_ops_is_valid_access, +}; + +static int bpf_dummy_reg(void *kdata) +{ + struct bpf_testmod_ops *ops = kdata; + int r; + + r = ops->test_2(4, 3); + + return 0; +} + +static void bpf_dummy_unreg(void *kdata) +{ +} + +static int bpf_testmod_test_1(void) +{ + return 0; +} + +static int bpf_testmod_test_2(int a, int b) +{ + return 0; +} + +static struct bpf_testmod_ops __bpf_testmod_ops = { + .test_1 = bpf_testmod_test_1, + .test_2 = bpf_testmod_test_2, +}; + +struct bpf_struct_ops bpf_bpf_testmod_ops = { + .verifier_ops = &bpf_testmod_verifier_ops, + .init = bpf_testmod_ops_init, + .init_member = bpf_testmod_ops_init_member, + .reg = bpf_dummy_reg, + .unreg = bpf_dummy_unreg, + .cfi_stubs = &__bpf_testmod_ops, + .name = "bpf_testmod_ops", + .owner = THIS_MODULE, +}; + extern int bpf_fentry_test1(int a); static int bpf_testmod_init(void) @@ -531,6 +596,7 @@ static int bpf_testmod_init(void) ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_testmod_kfunc_set); ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_testmod_kfunc_set); ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &bpf_testmod_kfunc_set); + ret = ret ?: register_bpf_struct_ops(&bpf_bpf_testmod_ops, bpf_testmod_ops); if (ret < 0) return ret; if (bpf_fentry_test1(0) < 0) diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h index f32793efe095..ca5435751c79 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h @@ -28,4 +28,9 @@ struct bpf_iter_testmod_seq { int cnt; }; +struct bpf_testmod_ops { + int (*test_1)(void); + int (*test_2)(int a, int b); +}; + #endif /* _BPF_TESTMOD_H */ diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c new file mode 100644 index 000000000000..8d833f0c7580 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c @@ -0,0 +1,75 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */ +#include <test_progs.h> +#include <time.h> + +#include "struct_ops_module.skel.h" + +static void check_map_info(struct bpf_map_info *info) +{ + struct bpf_btf_info btf_info; + char btf_name[256]; + u32 btf_info_len = sizeof(btf_info); + int err, fd; + + fd = bpf_btf_get_fd_by_id(info->btf_vmlinux_id); + if (!ASSERT_GE(fd, 0, "get_value_type_btf_obj_fd")) + return; + + memset(&btf_info, 0, sizeof(btf_info)); + btf_info.name = ptr_to_u64(btf_name); + btf_info.name_len = sizeof(btf_name); + err = bpf_btf_get_info_by_fd(fd, &btf_info, &btf_info_len); + if (!ASSERT_OK(err, "get_value_type_btf_obj_info")) + goto cleanup; + + if (!ASSERT_EQ(strcmp(btf_name, "bpf_testmod"), 0, "get_value_type_btf_obj_name")) + goto cleanup; + +cleanup: + close(fd); +} + +static void test_struct_ops_load(void) +{ + DECLARE_LIBBPF_OPTS(bpf_object_open_opts, opts); + struct struct_ops_module *skel; + struct bpf_map_info info = {}; + struct bpf_link *link; + int err; + u32 len; + + skel = struct_ops_module__open_opts(&opts); + if (!ASSERT_OK_PTR(skel, "struct_ops_module_open")) + return; + + err = struct_ops_module__load(skel); + if (!ASSERT_OK(err, "struct_ops_module_load")) + goto cleanup; + + len = sizeof(info); + err = bpf_map_get_info_by_fd(bpf_map__fd(skel->maps.testmod_1), &info, + &len); + if (!ASSERT_OK(err, "bpf_map_get_info_by_fd")) + goto cleanup; + + link = bpf_map__attach_struct_ops(skel->maps.testmod_1); + ASSERT_OK_PTR(link, "attach_test_mod_1"); + + /* test_2() will be called from bpf_dummy_reg() in bpf_testmod.c */ + ASSERT_EQ(skel->bss->test_2_result, 7, "test_2_result"); + + bpf_link__destroy(link); + + check_map_info(&info); + +cleanup: + struct_ops_module__destroy(skel); +} + +void serial_test_struct_ops_module(void) +{ + if (test__start_subtest("test_struct_ops_load")) + test_struct_ops_load(); +} + diff --git a/tools/testing/selftests/bpf/progs/struct_ops_module.c b/tools/testing/selftests/bpf/progs/struct_ops_module.c new file mode 100644 index 000000000000..e44ac55195ca --- /dev/null +++ b/tools/testing/selftests/bpf/progs/struct_ops_module.c @@ -0,0 +1,30 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */ +#include <vmlinux.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include "../bpf_testmod/bpf_testmod.h" + +char _license[] SEC("license") = "GPL"; + +int test_2_result = 0; + +SEC("struct_ops/test_1") +int BPF_PROG(test_1) +{ + return 0xdeadbeef; +} + +SEC("struct_ops/test_2") +int BPF_PROG(test_2, int a, int b) +{ + test_2_result = a + b; + return a + b; +} + +SEC(".struct_ops.link") +struct bpf_testmod_ops testmod_1 = { + .test_1 = (void *)test_1, + .test_2 = (void *)test_2, +}; + -- 2.34.1
From: Martin KaFai Lau <martin.lau@kernel.org> mainline inclusion from mainline-v6.9-rc1 commit c9f115564561af63db662791e9a35fcf1dfefd2a category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- The commit 9e926acda0c2 ("libbpf: Find correct module BTFs for struct_ops maps and progs.") sets a newly added field (value_type_btf_obj_fd) to -1 in libbpf when the caller of the libbpf's bpf_map_create did not define this field by passing a NULL "opts" or passing in a "opts" that does not cover this new field. OPT_HAS(opts, field) is used to decide if the field is defined or not: ((opts) && opts->sz >= offsetofend(typeof(*(opts)), field)) Once OPTS_HAS decided the field is not defined, that field should be set to 0. For this particular new field (value_type_btf_obj_fd), its corresponding map_flags "BPF_F_VTYPE_BTF_OBJ_FD" is not set. Thus, the kernel does not treat it as an fd field. Fixes: 9e926acda0c2 ("libbpf: Find correct module BTFs for struct_ops maps and progs.") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240124224418.2905133-1-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/bpf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 6a8b495b53f5..04ccbd5b2d0d 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -194,7 +194,7 @@ int bpf_map_create(enum bpf_map_type map_type, attr.btf_key_type_id = OPTS_GET(opts, btf_key_type_id, 0); attr.btf_value_type_id = OPTS_GET(opts, btf_value_type_id, 0); attr.btf_vmlinux_value_type_id = OPTS_GET(opts, btf_vmlinux_value_type_id, 0); - attr.value_type_btf_obj_fd = OPTS_GET(opts, value_type_btf_obj_fd, -1); + attr.value_type_btf_obj_fd = OPTS_GET(opts, value_type_btf_obj_fd, 0); attr.inner_map_fd = OPTS_GET(opts, inner_map_fd, 0); attr.map_flags = OPTS_GET(opts, map_flags, 0); -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit e6be8cd5d3cf54ccd0ae66027d6f4697b15f4c3e category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- In bpf_struct_ops_map_alloc, it needs to check for NULL in the returned pointer of bpf_get_btf_vmlinux() when CONFIG_DEBUG_INFO_BTF is not set. ENOTSUPP is used to preserve the same behavior before the struct_ops kmod support. In the function check_struct_ops_btf_id(), instead of redoing the bpf_get_btf_vmlinux() that has already been done in syscall.c, the fix here is to check for prog->aux->attach_btf_id. BPF_PROG_TYPE_STRUCT_OPS must require attach_btf_id and syscall.c guarantees a valid attach_btf as long as attach_btf_id is set. When attach_btf_id is not set, this patch returns -ENOTSUPP because it is what the selftest in test_libbpf_probe_prog_types() and libbpf_probes.c are expecting for feature probing purpose. Changes from v1: - Remove an unnecessary NULL check in check_struct_ops_btf_id() Reported-by: syzbot+88f0aafe5f950d7489d7@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/00000000000040d68a060fc8db8c@google.com/ Reported-by: syzbot+1336f3d4b10bcda75b89@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/00000000000026353b060fc21c07@google.com/ Fixes: fcc2c1fb0651 ("bpf: pass attached BTF to the bpf_struct_ops subsystem") Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240126023113.1379504-1-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/bpf_struct_ops.c | 2 ++ kernel/bpf/verifier.c | 5 ++++- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index defc052e4622..0decd862dfe0 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -669,6 +669,8 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) btf = bpf_get_btf_vmlinux(); if (IS_ERR(btf)) return ERR_CAST(btf); + if (!btf) + return ERR_PTR(-ENOTSUPP); } st_ops_desc = bpf_struct_ops_find_value(btf, attr->btf_vmlinux_value_type_id); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 349beb1fdf75..167b9c6515c2 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -20221,7 +20221,10 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env) return -EINVAL; } - btf = prog->aux->attach_btf ?: bpf_get_btf_vmlinux(); + if (!prog->aux->attach_btf_id) + return -ENOTSUPP; + + btf = prog->aux->attach_btf; if (btf_is_module(btf)) { /* Make sure st_ops is valid through the lifetime of env */ env->attach_btf_mod = btf_try_get_module(btf); -- 2.34.1
From: Daniel Xu <dxu@dxuuu.xyz> mainline inclusion from mainline-v6.9-rc1 commit 79b47344bbc5a693a92ed6b2b09dac59254bfac8 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- This commit adds support for flags on BTF_SET8s. struct btf_id_set8 already supported 32 bits worth of flags, but was only used for alignment purposes before. We now use these bits to encode flags. The first use case is tagging kfunc sets with a flag so that pahole can recognize which BTF_ID_FLAGS(func, ..) are actual kfuncs. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/7bb152ec76d6c2c930daec88e995bf18484a5ebb.170649139... Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/btf_ids.h | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h index a9cb10b0e2e9..dca09b7f21dc 100644 --- a/include/linux/btf_ids.h +++ b/include/linux/btf_ids.h @@ -21,6 +21,7 @@ struct btf_id_set8 { #include <linux/compiler.h> /* for __PASTE */ #include <linux/compiler_attributes.h> /* for __maybe_unused */ +#include <linux/stringify.h> /* * Following macros help to define lists of BTF IDs placed @@ -183,17 +184,18 @@ extern struct btf_id_set name; * .word (1 << 3) | (1 << 1) | (1 << 2) * */ -#define __BTF_SET8_START(name, scope) \ +#define __BTF_SET8_START(name, scope, flags) \ +__BTF_ID_LIST(name, local) \ asm( \ ".pushsection " BTF_IDS_SECTION ",\"a\"; \n" \ "." #scope " __BTF_ID__set8__" #name "; \n" \ "__BTF_ID__set8__" #name ":; \n" \ -".zero 8 \n" \ +".zero 4 \n" \ +".long " __stringify(flags) "\n" \ ".popsection; \n"); #define BTF_SET8_START(name) \ -__BTF_ID_LIST(name, local) \ -__BTF_SET8_START(name, local) +__BTF_SET8_START(name, local, 0) #define BTF_SET8_END(name) \ asm( \ -- 2.34.1
From: Daniel Xu <dxu@dxuuu.xyz> mainline inclusion from mainline-v6.9-rc1 commit a05e90427ef6706f59188b379ad6366b9d298bc5 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- This macro pair is functionally equivalent to BTF_SET8_START/END, except with BTF_SET8_KFUNCS flag set in the btf_id_set8 flags field. The next commit will codemod all kfunc set8s to this new variant such that all kfuncs are tagged as such in .BTF_ids section. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/d536c57c7c2af428686853cc7396b7a44faa53b7.170649139... Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/btf_ids.h | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h index dca09b7f21dc..e24aabfe8ecc 100644 --- a/include/linux/btf_ids.h +++ b/include/linux/btf_ids.h @@ -8,6 +8,9 @@ struct btf_id_set { u32 ids[]; }; +/* This flag implies BTF_SET8 holds kfunc(s) */ +#define BTF_SET8_KFUNCS (1 << 0) + struct btf_id_set8 { u32 cnt; u32 flags; @@ -204,6 +207,12 @@ asm( \ ".popsection; \n"); \ extern struct btf_id_set8 name; +#define BTF_KFUNCS_START(name) \ +__BTF_SET8_START(name, local, BTF_SET8_KFUNCS) + +#define BTF_KFUNCS_END(name) \ +BTF_SET8_END(name) + #else #define BTF_ID_LIST(name) static u32 __maybe_unused name[64]; @@ -218,6 +227,8 @@ extern struct btf_id_set8 name; #define BTF_SET_END(name) #define BTF_SET8_START(name) static struct btf_id_set8 __maybe_unused name = { 0 }; #define BTF_SET8_END(name) +#define BTF_KFUNCS_START(name) static struct btf_id_set8 __maybe_unused name = { .flags = BTF_SET8_KFUNCS }; +#define BTF_KFUNCS_END(name) #endif /* CONFIG_DEBUG_INFO_BTF */ -- 2.34.1
From: Daniel Xu <dxu@dxuuu.xyz> mainline inclusion from mainline-v6.9-rc1 commit 6f3189f38a3e995232e028a4c341164c4aca1b20 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- This commit marks kfuncs as such inside the .BTF_ids section. The upshot of these annotations is that we'll be able to automatically generate kfunc prototypes for downstream users. The process is as follows: 1. In source, use BTF_KFUNCS_START/END macro pair to mark kfuncs 2. During build, pahole injects into BTF a "bpf_kfunc" BTF_DECL_TAG for each function inside BTF_KFUNCS sets 3. At runtime, vmlinux or module BTF is made available in sysfs 4. At runtime, bpftool (or similar) can look at provided BTF and generate appropriate prototypes for functions with "bpf_kfunc" tag To ensure future kfunc are similarly tagged, we now also return error inside kfunc registration for untagged kfuncs. For vmlinux kfuncs, we also WARN(), as initcall machinery does not handle errors. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/e55150ceecbf0a5d961e608941165c0bee7bc943.170649139... Signed-off-by: Alexei Starovoitov <ast@kernel.org> Conflicts: fs/verity/measure.c kernel/bpf/cpumask.c kernel/bpf/helpers.c kernel/cgroup/rstat.c kernel/trace/bpf_trace.c net/core/filter.c net/core/xdp.c tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c arch/arm64/kernel/bpf-rvi.c block/blk-cgroup.c drivers/hid/bpf/hid_bpf_dispatch.c fs/proc/stat.c kernel/bpf-rvi/common_kfuncs.c kernel/cgroup/cpuset.c kernel/sched/bpf_sched.c kernel/sched/cpuacct.c [Fix conflicts because of missing some kfuns or context diff. And USE KFUNCS for self-dev kfuncs.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- Documentation/bpf/kfuncs.rst | 8 ++++---- arch/arm64/kernel/bpf-rvi.c | 4 ++-- block/blk-cgroup.c | 4 ++-- drivers/hid/bpf/hid_bpf_dispatch.c | 12 +++++------ fs/proc/stat.c | 4 ++-- kernel/bpf-rvi/common_kfuncs.c | 4 ++-- kernel/bpf/btf.c | 8 ++++++++ kernel/bpf/cpumask.c | 4 ++-- kernel/bpf/helpers.c | 8 ++++---- kernel/bpf/map_iter.c | 4 ++-- kernel/cgroup/cpuset.c | 4 ++-- kernel/cgroup/rstat.c | 4 ++-- kernel/sched/bpf_sched.c | 8 ++++---- kernel/sched/cpuacct.c | 4 ++-- kernel/trace/bpf_trace.c | 4 ++-- net/bpf/test_run.c | 8 ++++---- net/core/filter.c | 20 +++++++++---------- net/core/xdp.c | 4 ++-- net/ipv4/bpf_tcp_ca.c | 4 ++-- net/ipv4/fou_bpf.c | 4 ++-- net/ipv4/tcp_bbr.c | 4 ++-- net/ipv4/tcp_cubic.c | 4 ++-- net/ipv4/tcp_dctcp.c | 4 ++-- net/netfilter/nf_conntrack_bpf.c | 4 ++-- net/netfilter/nf_nat_bpf.c | 4 ++-- net/xfrm/xfrm_interface_bpf.c | 4 ++-- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 8 ++++---- 27 files changed, 82 insertions(+), 74 deletions(-) diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst index 723408e399ab..18920610ab7c 100644 --- a/Documentation/bpf/kfuncs.rst +++ b/Documentation/bpf/kfuncs.rst @@ -153,10 +153,10 @@ In addition to kfuncs' arguments, verifier may need more information about the type of kfunc(s) being registered with the BPF subsystem. To do so, we define flags on a set of kfuncs as follows:: - BTF_SET8_START(bpf_task_set) + BTF_KFUNCS_START(bpf_task_set) BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE) - BTF_SET8_END(bpf_task_set) + BTF_KFUNCS_END(bpf_task_set) This set encodes the BTF ID of each kfunc listed above, and encodes the flags along with it. Ofcourse, it is also allowed to specify no flags. @@ -323,10 +323,10 @@ Once the kfunc is prepared for use, the final step to making it visible is registering it with the BPF subsystem. Registration is done per BPF program type. An example is shown below:: - BTF_SET8_START(bpf_task_set) + BTF_KFUNCS_START(bpf_task_set) BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE) - BTF_SET8_END(bpf_task_set) + BTF_KFUNCS_END(bpf_task_set) static const struct btf_kfunc_id_set bpf_task_kfunc_set = { .owner = THIS_MODULE, diff --git a/arch/arm64/kernel/bpf-rvi.c b/arch/arm64/kernel/bpf-rvi.c index 1ff460b46c85..026822aab85e 100644 --- a/arch/arm64/kernel/bpf-rvi.c +++ b/arch/arm64/kernel/bpf-rvi.c @@ -45,10 +45,10 @@ __bpf_kfunc const char *bpf_arch_flags(enum arch_flags_type t, int i) } } -BTF_SET8_START(bpf_arm64_kfunc_ids) +BTF_KFUNCS_START(bpf_arm64_kfunc_ids) BTF_ID_FLAGS(func, bpf_arm64_cpu_have_feature, KF_RCU) BTF_ID_FLAGS(func, bpf_arch_flags) -BTF_SET8_END(bpf_arm64_kfunc_ids) +BTF_KFUNCS_END(bpf_arm64_kfunc_ids) static const struct btf_kfunc_id_set bpf_arm64_kfunc_set = { .owner = THIS_MODULE, diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index eadf78bdcc40..ed387265f996 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -2342,9 +2342,9 @@ __bpf_kfunc void bpf_blkcg_get_dev_iostat(struct blkcg *blkcg, int major, int mi rcu_read_unlock(); } -BTF_SET8_START(bpf_blkcg_kfunc_ids) +BTF_KFUNCS_START(bpf_blkcg_kfunc_ids) BTF_ID_FLAGS(func, bpf_blkcg_get_dev_iostat) -BTF_SET8_END(bpf_blkcg_kfunc_ids) +BTF_KFUNCS_END(bpf_blkcg_kfunc_ids) static const struct btf_kfunc_id_set bpf_blkcg_kfunc_set = { .owner = THIS_MODULE, diff --git a/drivers/hid/bpf/hid_bpf_dispatch.c b/drivers/hid/bpf/hid_bpf_dispatch.c index 7903c8638e81..5fa53d0bea58 100644 --- a/drivers/hid/bpf/hid_bpf_dispatch.c +++ b/drivers/hid/bpf/hid_bpf_dispatch.c @@ -172,9 +172,9 @@ hid_bpf_get_data(struct hid_bpf_ctx *ctx, unsigned int offset, const size_t rdwr * The following set contains all functions we agree BPF programs * can use. */ -BTF_SET8_START(hid_bpf_kfunc_ids) +BTF_KFUNCS_START(hid_bpf_kfunc_ids) BTF_ID_FLAGS(func, hid_bpf_get_data, KF_RET_NULL) -BTF_SET8_END(hid_bpf_kfunc_ids) +BTF_KFUNCS_END(hid_bpf_kfunc_ids) static const struct btf_kfunc_id_set hid_bpf_kfunc_set = { .owner = THIS_MODULE, @@ -467,11 +467,11 @@ hid_bpf_hw_request(struct hid_bpf_ctx *ctx, __u8 *buf, size_t buf__sz, } /* our HID-BPF entrypoints */ -BTF_SET8_START(hid_bpf_fmodret_ids) +BTF_KFUNCS_START(hid_bpf_fmodret_ids) BTF_ID_FLAGS(func, hid_bpf_device_event) BTF_ID_FLAGS(func, hid_bpf_rdesc_fixup) BTF_ID_FLAGS(func, __hid_bpf_tail_call) -BTF_SET8_END(hid_bpf_fmodret_ids) +BTF_KFUNCS_END(hid_bpf_fmodret_ids) static const struct btf_kfunc_id_set hid_bpf_fmodret_set = { .owner = THIS_MODULE, @@ -479,12 +479,12 @@ static const struct btf_kfunc_id_set hid_bpf_fmodret_set = { }; /* for syscall HID-BPF */ -BTF_SET8_START(hid_bpf_syscall_kfunc_ids) +BTF_KFUNCS_START(hid_bpf_syscall_kfunc_ids) BTF_ID_FLAGS(func, hid_bpf_attach_prog) BTF_ID_FLAGS(func, hid_bpf_allocate_context, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, hid_bpf_release_context, KF_RELEASE) BTF_ID_FLAGS(func, hid_bpf_hw_request) -BTF_SET8_END(hid_bpf_syscall_kfunc_ids) +BTF_KFUNCS_END(hid_bpf_syscall_kfunc_ids) static const struct btf_kfunc_id_set hid_bpf_syscall_kfunc_set = { .owner = THIS_MODULE, diff --git a/fs/proc/stat.c b/fs/proc/stat.c index 4757e5b1be38..dc21a2c9a25d 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -239,11 +239,11 @@ __bpf_kfunc void bpf_show_all_irqs(struct seq_file *p) show_all_irqs(p); } -BTF_SET8_START(bpf_proc_stat_kfunc_ids) +BTF_KFUNCS_START(bpf_proc_stat_kfunc_ids) BTF_ID_FLAGS(func, bpf_get_idle_time) BTF_ID_FLAGS(func, bpf_get_iowait_time) BTF_ID_FLAGS(func, bpf_show_all_irqs) -BTF_SET8_END(bpf_proc_stat_kfunc_ids) +BTF_KFUNCS_END(bpf_proc_stat_kfunc_ids) static const struct btf_kfunc_id_set bpf_proc_stat_kfunc_set = { .owner = THIS_MODULE, diff --git a/kernel/bpf-rvi/common_kfuncs.c b/kernel/bpf-rvi/common_kfuncs.c index b0f33fd0201d..d790a7b65da6 100644 --- a/kernel/bpf-rvi/common_kfuncs.c +++ b/kernel/bpf-rvi/common_kfuncs.c @@ -280,7 +280,7 @@ __bpf_kfunc void bpf_x86_direct_pages(unsigned long *p) } #endif -BTF_SET8_START(bpf_common_kfuncs_ids) +BTF_KFUNCS_START(bpf_common_kfuncs_ids) BTF_ID_FLAGS(func, bpf_mem_cgroup_from_task, KF_RET_NULL | KF_RCU) BTF_ID_FLAGS(func, bpf_task_active_pid_ns, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, bpf_pidns_nr_tasks) @@ -309,7 +309,7 @@ BTF_ID_FLAGS(func, bpf_mem_committed) BTF_ID_FLAGS(func, bpf_mem_vmalloc_used) BTF_ID_FLAGS(func, bpf_mem_vmalloc_total) BTF_ID_FLAGS(func, bpf_x86_direct_pages, KF_TRUSTED_ARGS) -BTF_SET8_END(bpf_common_kfuncs_ids) +BTF_KFUNCS_END(bpf_common_kfuncs_ids) static const struct btf_kfunc_id_set bpf_common_kfuncs_set = { .owner = THIS_MODULE, diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index db3e74481f65..2f8cfd27bd9e 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -8230,6 +8230,14 @@ int register_btf_kfunc_id_set(enum bpf_prog_type prog_type, { enum btf_kfunc_hook hook; + /* All kfuncs need to be tagged as such in BTF. + * WARN() for initcall registrations that do not check errors. + */ + if (!(kset->set->flags & BTF_SET8_KFUNCS)) { + WARN_ON(!kset->owner); + return -EINVAL; + } + hook = bpf_prog_type_to_kfunc_hook(prog_type); return __register_btf_kfunc_id_set(hook, kset); } diff --git a/kernel/bpf/cpumask.c b/kernel/bpf/cpumask.c index 6acecc8ebd61..317f74895370 100644 --- a/kernel/bpf/cpumask.c +++ b/kernel/bpf/cpumask.c @@ -413,7 +413,7 @@ __bpf_kfunc u32 bpf_cpumask_any_and_distribute(const struct cpumask *src1, __bpf_kfunc_end_defs(); -BTF_SET8_START(cpumask_kfunc_btf_ids) +BTF_KFUNCS_START(cpumask_kfunc_btf_ids) BTF_ID_FLAGS(func, bpf_cpumask_create, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_cpumask_release, KF_RELEASE) BTF_ID_FLAGS(func, bpf_cpumask_acquire, KF_ACQUIRE | KF_TRUSTED_ARGS) @@ -438,7 +438,7 @@ BTF_ID_FLAGS(func, bpf_cpumask_full, KF_RCU) BTF_ID_FLAGS(func, bpf_cpumask_copy, KF_RCU) BTF_ID_FLAGS(func, bpf_cpumask_any_distribute, KF_RCU) BTF_ID_FLAGS(func, bpf_cpumask_any_and_distribute, KF_RCU) -BTF_SET8_END(cpumask_kfunc_btf_ids) +BTF_KFUNCS_END(cpumask_kfunc_btf_ids) static const struct btf_kfunc_id_set cpumask_kfunc_set = { .owner = THIS_MODULE, diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 0edec71be0a7..7b79779126a0 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2627,7 +2627,7 @@ __bpf_kfunc void bpf_rcu_read_unlock(void) __bpf_kfunc_end_defs(); -BTF_SET8_START(generic_btf_ids) +BTF_KFUNCS_START(generic_btf_ids) #ifdef CONFIG_KEXEC_CORE BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE) #endif @@ -2655,7 +2655,7 @@ BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL) #ifdef CONFIG_BPF_RVI BTF_ID_FLAGS(func, bpf_current_level1_reaper, KF_ACQUIRE | KF_RET_NULL) #endif -BTF_SET8_END(generic_btf_ids) +BTF_KFUNCS_END(generic_btf_ids) static const struct btf_kfunc_id_set generic_kfunc_set = { .owner = THIS_MODULE, @@ -2671,7 +2671,7 @@ BTF_ID(struct, cgroup) BTF_ID(func, bpf_cgroup_release_dtor) #endif -BTF_SET8_START(common_btf_ids) +BTF_KFUNCS_START(common_btf_ids) BTF_ID_FLAGS(func, bpf_cast_to_kern_ctx) BTF_ID_FLAGS(func, bpf_rdonly_cast) BTF_ID_FLAGS(func, bpf_rcu_read_lock) @@ -2700,7 +2700,7 @@ BTF_ID_FLAGS(func, bpf_dynptr_is_null) BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly) BTF_ID_FLAGS(func, bpf_dynptr_size) BTF_ID_FLAGS(func, bpf_dynptr_clone) -BTF_SET8_END(common_btf_ids) +BTF_KFUNCS_END(common_btf_ids) static const struct btf_kfunc_id_set common_kfunc_set = { .owner = THIS_MODULE, diff --git a/kernel/bpf/map_iter.c b/kernel/bpf/map_iter.c index 6abd7c5df4b3..9575314f40a6 100644 --- a/kernel/bpf/map_iter.c +++ b/kernel/bpf/map_iter.c @@ -213,9 +213,9 @@ __bpf_kfunc s64 bpf_map_sum_elem_count(const struct bpf_map *map) __bpf_kfunc_end_defs(); -BTF_SET8_START(bpf_map_iter_kfunc_ids) +BTF_KFUNCS_START(bpf_map_iter_kfunc_ids) BTF_ID_FLAGS(func, bpf_map_sum_elem_count, KF_TRUSTED_ARGS) -BTF_SET8_END(bpf_map_iter_kfunc_ids) +BTF_KFUNCS_END(bpf_map_iter_kfunc_ids) static const struct btf_kfunc_id_set bpf_map_iter_kfunc_set = { .owner = THIS_MODULE, diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index dd8d502fec45..5222bf813c91 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -5250,10 +5250,10 @@ __bpf_kfunc unsigned int bpf_cpumask_weight(struct cpumask *pmask) return cpumask_weight(pmask); } -BTF_SET8_START(bpf_cpuset_kfunc_ids) +BTF_KFUNCS_START(bpf_cpuset_kfunc_ids) BTF_ID_FLAGS(func, bpf_cpuset_from_task, KF_RET_NULL | KF_RCU) BTF_ID_FLAGS(func, bpf_cpumask_weight) -BTF_SET8_END(bpf_cpuset_kfunc_ids) +BTF_KFUNCS_END(bpf_cpuset_kfunc_ids) static const struct btf_kfunc_id_set bpf_cpuset_kfunc_set = { .owner = THIS_MODULE, diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index c6d6bec9bcce..b589868cc85c 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -534,13 +534,13 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq) } /* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */ -BTF_SET8_START(bpf_rstat_kfunc_ids) +BTF_KFUNCS_START(bpf_rstat_kfunc_ids) BTF_ID_FLAGS(func, cgroup_rstat_updated) BTF_ID_FLAGS(func, cgroup_rstat_flush, KF_SLEEPABLE) #if defined(CONFIG_BPF_RVI) && !defined(CONFIG_PREEMPT_RT) BTF_ID_FLAGS(func, cgroup_rstat_flush_atomic) #endif -BTF_SET8_END(bpf_rstat_kfunc_ids) +BTF_KFUNCS_END(bpf_rstat_kfunc_ids) static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = { .owner = THIS_MODULE, diff --git a/kernel/sched/bpf_sched.c b/kernel/sched/bpf_sched.c index 6849a58439b2..1039f20198aa 100644 --- a/kernel/sched/bpf_sched.c +++ b/kernel/sched/bpf_sched.c @@ -167,12 +167,12 @@ __bpf_kfunc s32 bpf_sched_cpu_stats_of(int cpuid, __diag_pop(); -BTF_SET8_START(sched_cpustats_kfunc_btf_ids) +BTF_KFUNCS_START(sched_cpustats_kfunc_btf_ids) BTF_ID_FLAGS(func, bpf_sched_cpustats_create, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_sched_cpustats_release, KF_RELEASE) BTF_ID_FLAGS(func, bpf_sched_cpustats_acquire, KF_ACQUIRE | KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, bpf_sched_cpu_stats_of, KF_RCU) -BTF_SET8_END(sched_cpustats_kfunc_btf_ids) +BTF_KFUNCS_END(sched_cpustats_kfunc_btf_ids) static const struct btf_kfunc_id_set cpustats_kfunc_set = { .owner = THIS_MODULE, @@ -221,12 +221,12 @@ __bpf_kfunc int bpf_sched_set_task_prefer_nid(struct task_struct *task, int nid) return set_prefer_cpus_ptr(task, cpumask_of_node(nid)); } -BTF_SET8_START(sched_task_kfunc_btf_ids) +BTF_KFUNCS_START(sched_task_kfunc_btf_ids) BTF_ID_FLAGS(func, bpf_sched_entity_is_task) BTF_ID_FLAGS(func, bpf_sched_entity_to_task) BTF_ID_FLAGS(func, bpf_sched_tag_of_entity) BTF_ID_FLAGS(func, bpf_sched_set_task_prefer_nid) -BTF_SET8_END(sched_task_kfunc_btf_ids) +BTF_KFUNCS_END(sched_task_kfunc_btf_ids) static const struct btf_kfunc_id_set sched_task_kfunc_set = { .owner = THIS_MODULE, diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 801f3df9734c..c479a4b9ce77 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -427,10 +427,10 @@ __bpf_kfunc void bpf_cpuacct_kcpustat_cpu_fetch(struct kernel_cpustat *dst, memcpy(dst, per_cpu_ptr(ca->cpustat, cpu), sizeof(struct kernel_cpustat)); } -BTF_SET8_START(bpf_cpuacct_kfunc_ids) +BTF_KFUNCS_START(bpf_cpuacct_kfunc_ids) BTF_ID_FLAGS(func, bpf_task_ca_cpuusage) BTF_ID_FLAGS(func, bpf_cpuacct_kcpustat_cpu_fetch) -BTF_SET8_END(bpf_cpuacct_kfunc_ids) +BTF_KFUNCS_END(bpf_cpuacct_kfunc_ids) static const struct btf_kfunc_id_set bpf_cpuacct_kfunc_set = { .owner = THIS_MODULE, diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 357c7febd8ba..fe21203dbc31 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1408,14 +1408,14 @@ __bpf_kfunc int bpf_verify_pkcs7_signature(struct bpf_dynptr_kern *data_ptr, __bpf_kfunc_end_defs(); -BTF_SET8_START(key_sig_kfunc_set) +BTF_KFUNCS_START(key_sig_kfunc_set) BTF_ID_FLAGS(func, bpf_lookup_user_key, KF_ACQUIRE | KF_RET_NULL | KF_SLEEPABLE) BTF_ID_FLAGS(func, bpf_lookup_system_key, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_key_put, KF_RELEASE) #ifdef CONFIG_SYSTEM_DATA_VERIFICATION BTF_ID_FLAGS(func, bpf_verify_pkcs7_signature, KF_SLEEPABLE) #endif -BTF_SET8_END(key_sig_kfunc_set) +BTF_KFUNCS_END(key_sig_kfunc_set) static const struct btf_kfunc_id_set bpf_key_sig_kfunc_set = { .owner = THIS_MODULE, diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 63a875c638b6..a343143c7888 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -618,21 +618,21 @@ CFI_NOSEAL(bpf_kfunc_call_memb_release_dtor); __bpf_kfunc_end_defs(); -BTF_SET8_START(bpf_test_modify_return_ids) +BTF_KFUNCS_START(bpf_test_modify_return_ids) BTF_ID_FLAGS(func, bpf_modify_return_test) BTF_ID_FLAGS(func, bpf_modify_return_test2) BTF_ID_FLAGS(func, bpf_fentry_test1, KF_SLEEPABLE) -BTF_SET8_END(bpf_test_modify_return_ids) +BTF_KFUNCS_END(bpf_test_modify_return_ids) static const struct btf_kfunc_id_set bpf_test_modify_return_set = { .owner = THIS_MODULE, .set = &bpf_test_modify_return_ids, }; -BTF_SET8_START(test_sk_check_kfunc_ids) +BTF_KFUNCS_START(test_sk_check_kfunc_ids) BTF_ID_FLAGS(func, bpf_kfunc_call_test_release, KF_RELEASE) BTF_ID_FLAGS(func, bpf_kfunc_call_memb_release, KF_RELEASE) -BTF_SET8_END(test_sk_check_kfunc_ids) +BTF_KFUNCS_END(test_sk_check_kfunc_ids) static void *bpf_test_init(const union bpf_attr *kattr, u32 user_size, u32 size, u32 headroom, u32 tailroom) diff --git a/net/core/filter.c b/net/core/filter.c index fe54ad9b0e8f..d5a88790db2f 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -12272,27 +12272,27 @@ int bpf_dynptr_from_skb_rdonly(struct sk_buff *skb, u64 flags, return 0; } -BTF_SET8_START(bpf_kfunc_check_set_skb) +BTF_KFUNCS_START(bpf_kfunc_check_set_skb) BTF_ID_FLAGS(func, bpf_dynptr_from_skb) -BTF_SET8_END(bpf_kfunc_check_set_skb) +BTF_KFUNCS_END(bpf_kfunc_check_set_skb) -BTF_SET8_START(bpf_kfunc_check_set_xdp) +BTF_KFUNCS_START(bpf_kfunc_check_set_xdp) BTF_ID_FLAGS(func, bpf_dynptr_from_xdp) -BTF_SET8_END(bpf_kfunc_check_set_xdp) +BTF_KFUNCS_END(bpf_kfunc_check_set_xdp) -BTF_SET8_START(bpf_kfunc_check_set_sock_addr) +BTF_KFUNCS_START(bpf_kfunc_check_set_sock_addr) BTF_ID_FLAGS(func, bpf_sock_addr_set_sun_path) -BTF_SET8_END(bpf_kfunc_check_set_sock_addr) +BTF_KFUNCS_END(bpf_kfunc_check_set_sock_addr) #ifdef CONFIG_HISOCK -BTF_SET8_START(bpf_kfunc_check_set_hisock) +BTF_KFUNCS_START(bpf_kfunc_check_set_hisock) BTF_ID_FLAGS(func, bpf_set_ingress_dst) BTF_ID_FLAGS(func, bpf_get_skb_ethhdr) BTF_ID_FLAGS(func, bpf_set_ingress_dev) BTF_ID_FLAGS(func, bpf_set_egress_dev) BTF_ID_FLAGS(func, bpf_handle_ingress_ptype) BTF_ID_FLAGS(func, bpf_handle_egress_ptype) -BTF_SET8_END(bpf_kfunc_check_set_hisock) +BTF_KFUNCS_END(bpf_kfunc_check_set_hisock) #endif static const struct btf_kfunc_id_set bpf_kfunc_set_skb = { @@ -12376,9 +12376,9 @@ __bpf_kfunc int bpf_sock_destroy(struct sock_common *sock) __bpf_kfunc_end_defs(); -BTF_SET8_START(bpf_sk_iter_kfunc_ids) +BTF_KFUNCS_START(bpf_sk_iter_kfunc_ids) BTF_ID_FLAGS(func, bpf_sock_destroy, KF_TRUSTED_ARGS) -BTF_SET8_END(bpf_sk_iter_kfunc_ids) +BTF_KFUNCS_END(bpf_sk_iter_kfunc_ids) static int tracing_iter_filter(const struct bpf_prog *prog, u32 kfunc_id) { diff --git a/net/core/xdp.c b/net/core/xdp.c index 1642222e350b..39738e00e732 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -734,11 +734,11 @@ __bpf_kfunc int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash, __bpf_kfunc_end_defs(); -BTF_SET8_START(xdp_metadata_kfunc_ids) +BTF_KFUNCS_START(xdp_metadata_kfunc_ids) #define XDP_METADATA_KFUNC(_, name) BTF_ID_FLAGS(func, name, KF_TRUSTED_ARGS) XDP_METADATA_KFUNC_xxx #undef XDP_METADATA_KFUNC -BTF_SET8_END(xdp_metadata_kfunc_ids) +BTF_KFUNCS_END(xdp_metadata_kfunc_ids) static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = { .owner = THIS_MODULE, diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c index 8e7716256d3c..5d83d7b058fa 100644 --- a/net/ipv4/bpf_tcp_ca.c +++ b/net/ipv4/bpf_tcp_ca.c @@ -201,13 +201,13 @@ bpf_tcp_ca_get_func_proto(enum bpf_func_id func_id, } } -BTF_SET8_START(bpf_tcp_ca_check_kfunc_ids) +BTF_KFUNCS_START(bpf_tcp_ca_check_kfunc_ids) BTF_ID_FLAGS(func, tcp_reno_ssthresh) BTF_ID_FLAGS(func, tcp_reno_cong_avoid) BTF_ID_FLAGS(func, tcp_reno_undo_cwnd) BTF_ID_FLAGS(func, tcp_slow_start) BTF_ID_FLAGS(func, tcp_cong_avoid_ai) -BTF_SET8_END(bpf_tcp_ca_check_kfunc_ids) +BTF_KFUNCS_END(bpf_tcp_ca_check_kfunc_ids) static const struct btf_kfunc_id_set bpf_tcp_ca_kfunc_set = { .owner = THIS_MODULE, diff --git a/net/ipv4/fou_bpf.c b/net/ipv4/fou_bpf.c index 4da03bf45c9b..06e5572f296f 100644 --- a/net/ipv4/fou_bpf.c +++ b/net/ipv4/fou_bpf.c @@ -100,10 +100,10 @@ __bpf_kfunc int bpf_skb_get_fou_encap(struct __sk_buff *skb_ctx, __bpf_kfunc_end_defs(); -BTF_SET8_START(fou_kfunc_set) +BTF_KFUNCS_START(fou_kfunc_set) BTF_ID_FLAGS(func, bpf_skb_set_fou_encap) BTF_ID_FLAGS(func, bpf_skb_get_fou_encap) -BTF_SET8_END(fou_kfunc_set) +BTF_KFUNCS_END(fou_kfunc_set) static const struct btf_kfunc_id_set fou_bpf_kfunc_set = { .owner = THIS_MODULE, diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index 146792cd26fe..56bb7a9621ab 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -1154,7 +1154,7 @@ static struct tcp_congestion_ops tcp_bbr_cong_ops __read_mostly = { .set_state = bbr_set_state, }; -BTF_SET8_START(tcp_bbr_check_kfunc_ids) +BTF_KFUNCS_START(tcp_bbr_check_kfunc_ids) #ifdef CONFIG_X86 #ifdef CONFIG_DYNAMIC_FTRACE BTF_ID_FLAGS(func, bbr_init) @@ -1167,7 +1167,7 @@ BTF_ID_FLAGS(func, bbr_min_tso_segs) BTF_ID_FLAGS(func, bbr_set_state) #endif #endif -BTF_SET8_END(tcp_bbr_check_kfunc_ids) +BTF_KFUNCS_END(tcp_bbr_check_kfunc_ids) static const struct btf_kfunc_id_set tcp_bbr_kfunc_set = { .owner = THIS_MODULE, diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c index 5ff7be13deb6..103e93f7a9f1 100644 --- a/net/ipv4/tcp_cubic.c +++ b/net/ipv4/tcp_cubic.c @@ -487,7 +487,7 @@ static struct tcp_congestion_ops cubictcp __read_mostly = { .name = "cubic", }; -BTF_SET8_START(tcp_cubic_check_kfunc_ids) +BTF_KFUNCS_START(tcp_cubic_check_kfunc_ids) #ifdef CONFIG_X86 #ifdef CONFIG_DYNAMIC_FTRACE BTF_ID_FLAGS(func, cubictcp_init) @@ -498,7 +498,7 @@ BTF_ID_FLAGS(func, cubictcp_cwnd_event) BTF_ID_FLAGS(func, cubictcp_acked) #endif #endif -BTF_SET8_END(tcp_cubic_check_kfunc_ids) +BTF_KFUNCS_END(tcp_cubic_check_kfunc_ids) static const struct btf_kfunc_id_set tcp_cubic_kfunc_set = { .owner = THIS_MODULE, diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c index 8ad62713b0ba..b004280855f8 100644 --- a/net/ipv4/tcp_dctcp.c +++ b/net/ipv4/tcp_dctcp.c @@ -271,7 +271,7 @@ static struct tcp_congestion_ops dctcp_reno __read_mostly = { .name = "dctcp-reno", }; -BTF_SET8_START(tcp_dctcp_check_kfunc_ids) +BTF_KFUNCS_START(tcp_dctcp_check_kfunc_ids) #ifdef CONFIG_X86 #ifdef CONFIG_DYNAMIC_FTRACE BTF_ID_FLAGS(func, dctcp_init) @@ -282,7 +282,7 @@ BTF_ID_FLAGS(func, dctcp_cwnd_undo) BTF_ID_FLAGS(func, dctcp_state) #endif #endif -BTF_SET8_END(tcp_dctcp_check_kfunc_ids) +BTF_KFUNCS_END(tcp_dctcp_check_kfunc_ids) static const struct btf_kfunc_id_set tcp_dctcp_kfunc_set = { .owner = THIS_MODULE, diff --git a/net/netfilter/nf_conntrack_bpf.c b/net/netfilter/nf_conntrack_bpf.c index 475358ec8212..d2492d050fe6 100644 --- a/net/netfilter/nf_conntrack_bpf.c +++ b/net/netfilter/nf_conntrack_bpf.c @@ -467,7 +467,7 @@ __bpf_kfunc int bpf_ct_change_status(struct nf_conn *nfct, u32 status) __bpf_kfunc_end_defs(); -BTF_SET8_START(nf_ct_kfunc_set) +BTF_KFUNCS_START(nf_ct_kfunc_set) BTF_ID_FLAGS(func, bpf_xdp_ct_alloc, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_xdp_ct_lookup, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_skb_ct_alloc, KF_ACQUIRE | KF_RET_NULL) @@ -478,7 +478,7 @@ BTF_ID_FLAGS(func, bpf_ct_set_timeout, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, bpf_ct_change_timeout, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, bpf_ct_set_status, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, bpf_ct_change_status, KF_TRUSTED_ARGS) -BTF_SET8_END(nf_ct_kfunc_set) +BTF_KFUNCS_END(nf_ct_kfunc_set) static const struct btf_kfunc_id_set nf_conntrack_kfunc_set = { .owner = THIS_MODULE, diff --git a/net/netfilter/nf_nat_bpf.c b/net/netfilter/nf_nat_bpf.c index 6e3b2f58855f..481be15609b1 100644 --- a/net/netfilter/nf_nat_bpf.c +++ b/net/netfilter/nf_nat_bpf.c @@ -54,9 +54,9 @@ __bpf_kfunc int bpf_ct_set_nat_info(struct nf_conn___init *nfct, __bpf_kfunc_end_defs(); -BTF_SET8_START(nf_nat_kfunc_set) +BTF_KFUNCS_START(nf_nat_kfunc_set) BTF_ID_FLAGS(func, bpf_ct_set_nat_info, KF_TRUSTED_ARGS) -BTF_SET8_END(nf_nat_kfunc_set) +BTF_KFUNCS_END(nf_nat_kfunc_set) static const struct btf_kfunc_id_set nf_bpf_nat_kfunc_set = { .owner = THIS_MODULE, diff --git a/net/xfrm/xfrm_interface_bpf.c b/net/xfrm/xfrm_interface_bpf.c index 7d5e920141e9..5ea15037ebd1 100644 --- a/net/xfrm/xfrm_interface_bpf.c +++ b/net/xfrm/xfrm_interface_bpf.c @@ -93,10 +93,10 @@ __bpf_kfunc int bpf_skb_set_xfrm_info(struct __sk_buff *skb_ctx, const struct bp __bpf_kfunc_end_defs(); -BTF_SET8_START(xfrm_ifc_kfunc_set) +BTF_KFUNCS_START(xfrm_ifc_kfunc_set) BTF_ID_FLAGS(func, bpf_skb_get_xfrm_info) BTF_ID_FLAGS(func, bpf_skb_set_xfrm_info) -BTF_SET8_END(xfrm_ifc_kfunc_set) +BTF_KFUNCS_END(xfrm_ifc_kfunc_set) static const struct btf_kfunc_id_set xfrm_interface_kfunc_set = { .owner = THIS_MODULE, diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c index ca863f436978..b4133a1b7a72 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c @@ -339,11 +339,11 @@ static struct bin_attribute bin_attr_bpf_testmod_file __ro_after_init = { .write = bpf_testmod_test_write, }; -BTF_SET8_START(bpf_testmod_common_kfunc_ids) +BTF_KFUNCS_START(bpf_testmod_common_kfunc_ids) BTF_ID_FLAGS(func, bpf_iter_testmod_seq_new, KF_ITER_NEW) BTF_ID_FLAGS(func, bpf_iter_testmod_seq_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_testmod_seq_destroy, KF_ITER_DESTROY) -BTF_SET8_END(bpf_testmod_common_kfunc_ids) +BTF_KFUNCS_END(bpf_testmod_common_kfunc_ids) static const struct btf_kfunc_id_set bpf_testmod_common_kfunc_set = { .owner = THIS_MODULE, @@ -489,7 +489,7 @@ __bpf_kfunc static u32 bpf_kfunc_call_test_static_unused_arg(u32 arg, u32 unused return arg; } -BTF_SET8_START(bpf_testmod_check_kfunc_ids) +BTF_KFUNCS_START(bpf_testmod_check_kfunc_ids) BTF_ID_FLAGS(func, bpf_testmod_test_mod_kfunc) BTF_ID_FLAGS(func, bpf_kfunc_call_test1) BTF_ID_FLAGS(func, bpf_kfunc_call_test2) @@ -515,7 +515,7 @@ BTF_ID_FLAGS(func, bpf_kfunc_call_test_ref, KF_TRUSTED_ARGS | KF_RCU) BTF_ID_FLAGS(func, bpf_kfunc_call_test_destructive, KF_DESTRUCTIVE) BTF_ID_FLAGS(func, bpf_kfunc_call_test_static_unused_arg) BTF_ID_FLAGS(func, bpf_kfunc_call_test_offset) -BTF_SET8_END(bpf_testmod_check_kfunc_ids) +BTF_KFUNCS_END(bpf_testmod_check_kfunc_ids) static int bpf_testmod_ops_init(struct btf *btf) { -- 2.34.1
From: Andrii Nakryiko <andrii@kernel.org> mainline inclusion from mainline-v6.9-rc1 commit c81a8ab196b5083d5109a51585fcc24fa2055a77 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Seems like original commit adding split BTF support intended to add btf__new_split() API, and even declared it in libbpf.map, but never added (trivial) implementation. Fix this. Fixes: ba451366bf44 ("libbpf: Implement basic split BTF support") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240201172027.604869-4-andrii@kernel.org Conflicts: tools/lib/bpf/libbpf.map [The conflicts were due to some minor issue] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/btf.c | 5 +++++ tools/lib/bpf/libbpf.map | 2 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index 2e9f28cece3f..2fe479fa0c77 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -919,6 +919,11 @@ struct btf *btf__new(const void *data, __u32 size) return libbpf_ptr(btf_new(data, size, NULL)); } +struct btf *btf__new_split(const void *data, __u32 size, struct btf *base_btf) +{ + return libbpf_ptr(btf_new(data, size, base_btf)); +} + static struct btf *btf_parse_elf(const char *path, struct btf *base_btf, struct btf_ext **btf_ext) { diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map index 228ab00a5e69..f802a77d10b6 100644 --- a/tools/lib/bpf/libbpf.map +++ b/tools/lib/bpf/libbpf.map @@ -246,7 +246,6 @@ LIBBPF_0.3.0 { btf__parse_raw_split; btf__parse_split; btf__new_empty_split; - btf__new_split; ring_buffer__epoll_fd; } LIBBPF_0.2.0; @@ -395,6 +394,7 @@ LIBBPF_1.2.0 { LIBBPF_1.3.0 { global: + btf__new_split; bpf_obj_pin_opts; bpf_object__unpin; bpf_prog_detach_opts; -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit df9705eaa0bad034dad0f73386ff82f5c4dd7e24 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- The "i" here is always equal to "btf_type_vlen(t)" since the "for_each_member()" loop never breaks. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240203055119.2235598-1-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/bpf_struct_ops.c | 21 +++++++++------------ 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 0decd862dfe0..f98f580de77a 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -189,20 +189,17 @@ int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, } } - if (i == btf_type_vlen(t)) { - if (st_ops->init(btf)) { - pr_warn("Error in init bpf_struct_ops %s\n", - st_ops->name); - return -EINVAL; - } else { - st_ops_desc->type_id = type_id; - st_ops_desc->type = t; - st_ops_desc->value_id = value_id; - st_ops_desc->value_type = btf_type_by_id(btf, - value_id); - } + if (st_ops->init(btf)) { + pr_warn("Error in init bpf_struct_ops %s\n", + st_ops->name); + return -EINVAL; } + st_ops_desc->type_id = type_id; + st_ops_desc->type = t; + st_ops_desc->value_id = value_id; + st_ops_desc->value_type = btf_type_by_id(btf, value_id); + return 0; } -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 169e650069647325496e5a65408ff301874c8e01 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- "r" is used to receive the return value of test_2 in bpf_testmod.c, but it is not actually used. So, we remove "r" and change the return type to "void". Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202401300557.z5vzn8FM-lkp@intel.com/ Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240204061204.1864529-1-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c | 6 ++---- tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h | 2 +- tools/testing/selftests/bpf/progs/struct_ops_module.c | 3 +-- 3 files changed, 4 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c index b4133a1b7a72..2e1b5d8f0fa2 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c @@ -549,9 +549,8 @@ static const struct bpf_verifier_ops bpf_testmod_verifier_ops = { static int bpf_dummy_reg(void *kdata) { struct bpf_testmod_ops *ops = kdata; - int r; - r = ops->test_2(4, 3); + ops->test_2(4, 3); return 0; } @@ -565,9 +564,8 @@ static int bpf_testmod_test_1(void) return 0; } -static int bpf_testmod_test_2(int a, int b) +static void bpf_testmod_test_2(int a, int b) { - return 0; } static struct bpf_testmod_ops __bpf_testmod_ops = { diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h index ca5435751c79..537beca42896 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h @@ -30,7 +30,7 @@ struct bpf_iter_testmod_seq { struct bpf_testmod_ops { int (*test_1)(void); - int (*test_2)(int a, int b); + void (*test_2)(int a, int b); }; #endif /* _BPF_TESTMOD_H */ diff --git a/tools/testing/selftests/bpf/progs/struct_ops_module.c b/tools/testing/selftests/bpf/progs/struct_ops_module.c index e44ac55195ca..b78746b3cef3 100644 --- a/tools/testing/selftests/bpf/progs/struct_ops_module.c +++ b/tools/testing/selftests/bpf/progs/struct_ops_module.c @@ -16,10 +16,9 @@ int BPF_PROG(test_1) } SEC("struct_ops/test_2") -int BPF_PROG(test_2, int a, int b) +void BPF_PROG(test_2, int a, int b) { test_2_result = a + b; - return a + b; } SEC(".struct_ops.link") -- 2.34.1
From: Kumar Kartikeya Dwivedi <memxor@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit a44b1334aadd82203f661adb9adb41e53ad0e8d1 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Currently, calling any helpers, kfuncs, or subprogs except the graph data structure (lists, rbtrees) API kfuncs while holding a bpf_spin_lock is not allowed. One of the original motivations of this decision was to force the BPF programmer's hand into keeping the bpf_spin_lock critical section small, and to ensure the execution time of the program does not increase due to lock waiting times. In addition to this, some of the helpers and kfuncs may be unsafe to call while holding a bpf_spin_lock. However, when it comes to subprog calls, atleast for static subprogs, the verifier is able to explore their instructions during verification. Therefore, it is similar in effect to having the same code inlined into the critical section. Hence, not allowing static subprog calls in the bpf_spin_lock critical section is mostly an annoyance that needs to be worked around, without providing any tangible benefit. Unlike static subprog calls, global subprog calls are not safe to permit within the critical section, as the verifier does not explore them during verification, therefore whether the same lock will be taken again, or unlocked, cannot be ascertained. Therefore, allow calling static subprogs within a bpf_spin_lock critical section, and only reject it in case the subprog linkage is global. Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20240204222349.938118-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Conflicts: kernel/bpf/verifier.c [Fix context conflicts because of missing BPF exceptions.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/verifier.c | 11 ++++++++--- .../testing/selftests/bpf/progs/verifier_spin_lock.c | 2 +- 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 167b9c6515c2..2aa16265169e 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -9490,6 +9490,13 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn, if (subprog_is_global(env, subprog)) { const char *sub_name = subprog_name(env, subprog); + /* Only global subprogs cannot be called with a lock held. */ + if (env->cur_state->active_lock.ptr) { + verbose(env, "global function calls are not allowed while holding a lock,\n" + "use static function instead\n"); + return -EINVAL; + } + if (err) { verbose(env, "Caller passes invalid args into func#%d ('%s')\n", subprog, sub_name); @@ -17528,7 +17535,6 @@ static int do_check(struct bpf_verifier_env *env) if (env->cur_state->active_lock.ptr) { if ((insn->src_reg == BPF_REG_0 && insn->imm != BPF_FUNC_spin_unlock) || - (insn->src_reg == BPF_PSEUDO_CALL) || (insn->src_reg == BPF_PSEUDO_KFUNC_CALL && (insn->off != 0 || !is_bpf_graph_api_kfunc(insn->imm)))) { verbose(env, "function calls are not allowed while holding a lock\n"); @@ -17571,8 +17577,7 @@ static int do_check(struct bpf_verifier_env *env) return -EINVAL; } - if (env->cur_state->active_lock.ptr && - !in_rbtree_lock_required_cb(env)) { + if (env->cur_state->active_lock.ptr && !env->cur_state->curframe) { verbose(env, "bpf_spin_unlock is missing\n"); return -EINVAL; } diff --git a/tools/testing/selftests/bpf/progs/verifier_spin_lock.c b/tools/testing/selftests/bpf/progs/verifier_spin_lock.c index 9c1aa69650f8..fb316c080c84 100644 --- a/tools/testing/selftests/bpf/progs/verifier_spin_lock.c +++ b/tools/testing/selftests/bpf/progs/verifier_spin_lock.c @@ -330,7 +330,7 @@ l1_%=: r7 = r0; \ SEC("cgroup/skb") __description("spin_lock: test10 lock in subprog without unlock") -__failure __msg("unlock is missing") +__success __failure_unpriv __msg_unpriv("") __naked void lock_in_subprog_without_unlock(void) { -- 2.34.1
From: Kumar Kartikeya Dwivedi <memxor@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit e8699c4ff85baedcf40f33db816cc487cee39397 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Add selftests for static subprog calls within bpf_spin_lock critical section, and ensure we still reject global subprog calls. Also test the case where a subprog call will unlock the caller's held lock, or the caller will unlock a lock taken by a subprog call, ensuring correct transfer of lock state across frames on exit. Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20240204222349.938118-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/prog_tests/spin_lock.c | 2 + .../selftests/bpf/progs/test_spin_lock.c | 65 +++++++++++++++++++ .../selftests/bpf/progs/test_spin_lock_fail.c | 44 +++++++++++++ 3 files changed, 111 insertions(+) diff --git a/tools/testing/selftests/bpf/prog_tests/spin_lock.c b/tools/testing/selftests/bpf/prog_tests/spin_lock.c index f29c08d93beb..11814d6b81e4 100644 --- a/tools/testing/selftests/bpf/prog_tests/spin_lock.c +++ b/tools/testing/selftests/bpf/prog_tests/spin_lock.c @@ -48,6 +48,8 @@ static struct { { "lock_id_mismatch_innermapval_kptr", "bpf_spin_unlock of different lock" }, { "lock_id_mismatch_innermapval_global", "bpf_spin_unlock of different lock" }, { "lock_id_mismatch_innermapval_mapval", "bpf_spin_unlock of different lock" }, + { "lock_global_subprog_call1", "global function calls are not allowed while holding a lock" }, + { "lock_global_subprog_call2", "global function calls are not allowed while holding a lock" }, }; static int match_regex(const char *pattern, const char *string) diff --git a/tools/testing/selftests/bpf/progs/test_spin_lock.c b/tools/testing/selftests/bpf/progs/test_spin_lock.c index b2440a0ff422..d8d77bdffd3d 100644 --- a/tools/testing/selftests/bpf/progs/test_spin_lock.c +++ b/tools/testing/selftests/bpf/progs/test_spin_lock.c @@ -101,4 +101,69 @@ int bpf_spin_lock_test(struct __sk_buff *skb) err: return err; } + +struct bpf_spin_lock lockA __hidden SEC(".data.A"); + +__noinline +static int static_subprog(struct __sk_buff *ctx) +{ + volatile int ret = 0; + + if (ctx->protocol) + return ret; + return ret + ctx->len; +} + +__noinline +static int static_subprog_lock(struct __sk_buff *ctx) +{ + volatile int ret = 0; + + ret = static_subprog(ctx); + bpf_spin_lock(&lockA); + return ret + ctx->len; +} + +__noinline +static int static_subprog_unlock(struct __sk_buff *ctx) +{ + volatile int ret = 0; + + ret = static_subprog(ctx); + bpf_spin_unlock(&lockA); + return ret + ctx->len; +} + +SEC("tc") +int lock_static_subprog_call(struct __sk_buff *ctx) +{ + int ret = 0; + + bpf_spin_lock(&lockA); + if (ctx->mark == 42) + ret = static_subprog(ctx); + bpf_spin_unlock(&lockA); + return ret; +} + +SEC("tc") +int lock_static_subprog_lock(struct __sk_buff *ctx) +{ + int ret = 0; + + ret = static_subprog_lock(ctx); + bpf_spin_unlock(&lockA); + return ret; +} + +SEC("tc") +int lock_static_subprog_unlock(struct __sk_buff *ctx) +{ + int ret = 0; + + bpf_spin_lock(&lockA); + ret = static_subprog_unlock(ctx); + return ret; +} + char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/test_spin_lock_fail.c b/tools/testing/selftests/bpf/progs/test_spin_lock_fail.c index 293ac1049d38..1c8b678e2e9a 100644 --- a/tools/testing/selftests/bpf/progs/test_spin_lock_fail.c +++ b/tools/testing/selftests/bpf/progs/test_spin_lock_fail.c @@ -201,4 +201,48 @@ CHECK(innermapval_mapval, &iv->lock, &v->lock); #undef CHECK +__noinline +int global_subprog(struct __sk_buff *ctx) +{ + volatile int ret = 0; + + if (ctx->protocol) + ret += ctx->protocol; + return ret + ctx->mark; +} + +__noinline +static int static_subprog_call_global(struct __sk_buff *ctx) +{ + volatile int ret = 0; + + if (ctx->protocol) + return ret; + return ret + ctx->len + global_subprog(ctx); +} + +SEC("?tc") +int lock_global_subprog_call1(struct __sk_buff *ctx) +{ + int ret = 0; + + bpf_spin_lock(&lockA); + if (ctx->mark == 42) + ret = global_subprog(ctx); + bpf_spin_unlock(&lockA); + return ret; +} + +SEC("?tc") +int lock_global_subprog_call2(struct __sk_buff *ctx) +{ + int ret = 0; + + bpf_spin_lock(&lockA); + if (ctx->mark == 42) + ret = static_subprog_call_global(ctx); + bpf_spin_unlock(&lockA); + return ret; +} + char _license[] SEC("license") = "GPL"; -- 2.34.1
From: Kumar Kartikeya Dwivedi <memxor@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 6fceea0fa59f6786a2847a4cae409117624e8b58 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Allow transferring an imbalanced RCU lock state between subprog calls during verification. This allows patterns where a subprog call returns with an RCU lock held, or a subprog call releases an RCU lock held by the caller. Currently, the verifier would end up complaining if the RCU lock is not released when processing an exit from a subprog, which is non-ideal if its execution is supposed to be enclosed in an RCU read section of the caller. Instead, simply only check whether we are processing exit for frame#0 and do not complain on an active RCU lock otherwise. We only need to update the check when processing BPF_EXIT insn, as copy_verifier_state is already set up to do the right thing. Suggested-by: David Vernet <void@manifault.com> Tested-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20240205055646.1112186-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/verifier.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 2aa16265169e..0b7d82a68002 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -17582,8 +17582,7 @@ static int do_check(struct bpf_verifier_env *env) return -EINVAL; } - if (env->cur_state->active_rcu_lock && - !in_rbtree_lock_required_cb(env)) { + if (env->cur_state->active_rcu_lock && !env->cur_state->curframe) { verbose(env, "bpf_rcu_read_unlock is missing\n"); return -EINVAL; } -- 2.34.1
From: Kumar Kartikeya Dwivedi <memxor@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 8be6a0147af314fd60db9da2158cd737dc6394a7 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Add selftests covering the following cases: - A static or global subprog called from within a RCU read section works - A static subprog taking an RCU read lock which is released in caller works - A static subprog releasing the caller's RCU read lock works Global subprogs that leave the lock in an imbalanced state will not work, as they are verified separately, so ensure those cases fail as well. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20240205055646.1112186-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/prog_tests/rcu_read_lock.c | 6 + .../selftests/bpf/progs/rcu_read_lock.c | 120 ++++++++++++++++++ 2 files changed, 126 insertions(+) diff --git a/tools/testing/selftests/bpf/prog_tests/rcu_read_lock.c b/tools/testing/selftests/bpf/prog_tests/rcu_read_lock.c index 3f1f58d3a729..a1f7e7378a64 100644 --- a/tools/testing/selftests/bpf/prog_tests/rcu_read_lock.c +++ b/tools/testing/selftests/bpf/prog_tests/rcu_read_lock.c @@ -29,6 +29,10 @@ static void test_success(void) bpf_program__set_autoload(skel->progs.non_sleepable_1, true); bpf_program__set_autoload(skel->progs.non_sleepable_2, true); bpf_program__set_autoload(skel->progs.task_trusted_non_rcuptr, true); + bpf_program__set_autoload(skel->progs.rcu_read_lock_subprog, true); + bpf_program__set_autoload(skel->progs.rcu_read_lock_global_subprog, true); + bpf_program__set_autoload(skel->progs.rcu_read_lock_subprog_lock, true); + bpf_program__set_autoload(skel->progs.rcu_read_lock_subprog_unlock, true); err = rcu_read_lock__load(skel); if (!ASSERT_OK(err, "skel_load")) goto out; @@ -75,6 +79,8 @@ static const char * const inproper_region_tests[] = { "inproper_sleepable_helper", "inproper_sleepable_kfunc", "nested_rcu_region", + "rcu_read_lock_global_subprog_lock", + "rcu_read_lock_global_subprog_unlock", }; static void test_inproper_region(void) diff --git a/tools/testing/selftests/bpf/progs/rcu_read_lock.c b/tools/testing/selftests/bpf/progs/rcu_read_lock.c index 14fb01437fb8..ab3a532b7dd6 100644 --- a/tools/testing/selftests/bpf/progs/rcu_read_lock.c +++ b/tools/testing/selftests/bpf/progs/rcu_read_lock.c @@ -319,3 +319,123 @@ int cross_rcu_region(void *ctx) bpf_rcu_read_unlock(); return 0; } + +__noinline +static int static_subprog(void *ctx) +{ + volatile int ret = 0; + + if (bpf_get_prandom_u32()) + return ret + 42; + return ret + bpf_get_prandom_u32(); +} + +__noinline +int global_subprog(u64 a) +{ + volatile int ret = a; + + return ret + static_subprog(NULL); +} + +__noinline +static int static_subprog_lock(void *ctx) +{ + volatile int ret = 0; + + bpf_rcu_read_lock(); + if (bpf_get_prandom_u32()) + return ret + 42; + return ret + bpf_get_prandom_u32(); +} + +__noinline +int global_subprog_lock(u64 a) +{ + volatile int ret = a; + + return ret + static_subprog_lock(NULL); +} + +__noinline +static int static_subprog_unlock(void *ctx) +{ + volatile int ret = 0; + + bpf_rcu_read_unlock(); + if (bpf_get_prandom_u32()) + return ret + 42; + return ret + bpf_get_prandom_u32(); +} + +__noinline +int global_subprog_unlock(u64 a) +{ + volatile int ret = a; + + return ret + static_subprog_unlock(NULL); +} + +SEC("?fentry.s/" SYS_PREFIX "sys_getpgid") +int rcu_read_lock_subprog(void *ctx) +{ + volatile int ret = 0; + + bpf_rcu_read_lock(); + if (bpf_get_prandom_u32()) + ret += static_subprog(ctx); + bpf_rcu_read_unlock(); + return 0; +} + +SEC("?fentry.s/" SYS_PREFIX "sys_getpgid") +int rcu_read_lock_global_subprog(void *ctx) +{ + volatile int ret = 0; + + bpf_rcu_read_lock(); + if (bpf_get_prandom_u32()) + ret += global_subprog(ret); + bpf_rcu_read_unlock(); + return 0; +} + +SEC("?fentry.s/" SYS_PREFIX "sys_getpgid") +int rcu_read_lock_subprog_lock(void *ctx) +{ + volatile int ret = 0; + + ret += static_subprog_lock(ctx); + bpf_rcu_read_unlock(); + return 0; +} + +SEC("?fentry.s/" SYS_PREFIX "sys_getpgid") +int rcu_read_lock_global_subprog_lock(void *ctx) +{ + volatile int ret = 0; + + ret += global_subprog_lock(ret); + bpf_rcu_read_unlock(); + return 0; +} + +SEC("?fentry.s/" SYS_PREFIX "sys_getpgid") +int rcu_read_lock_subprog_unlock(void *ctx) +{ + volatile int ret = 0; + + bpf_rcu_read_lock(); + ret += static_subprog_unlock(ctx); + return 0; +} + +SEC("?fentry.s/" SYS_PREFIX "sys_getpgid") +int rcu_read_lock_global_subprog_unlock(void *ctx) +{ + volatile int ret = 0; + + bpf_rcu_read_lock(); + ret += global_subprog_unlock(ret); + return 0; +} -- 2.34.1
From: Geliang Tang <tanggeliang@kylinos.cn> mainline inclusion from mainline-v6.9-rc1 commit 947e56f82fd783a1ec1c9359b20b5699d09cae14 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Similar to the handling in the functions __register_btf_kfunc_id_set() and register_btf_id_dtor_kfuncs(), this patch uses the newly added helper check_btf_kconfigs() to handle module with its btf section stripped. While at it, the patch also adds the missed IS_ERR() check to fix the commit f6be98d19985 ("bpf, net: switch to dynamic registration") Fixes: f6be98d19985 ("bpf, net: switch to dynamic registration") Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Link: https://lore.kernel.org/r/69082b9835463fe36f9e354bddf2d0a97df39c2b.170737330... Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/btf.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 2f8cfd27bd9e..ddde130786f7 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -9000,7 +9000,9 @@ int __register_bpf_struct_ops(struct bpf_struct_ops *st_ops) btf = btf_get_module_btf(st_ops->owner); if (!btf) - return -EINVAL; + return check_btf_kconfigs(st_ops->owner, "struct_ops"); + if (IS_ERR(btf)) + return PTR_ERR(btf); log = kzalloc(sizeof(*log), GFP_KERNEL | __GFP_NOWARN); if (!log) { -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 77c0208e199ccb0986fb3612f2409c8cdcb036ad category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Enable the providers to use types defined in a module instead of in the kernel (btf_vmlinux). Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240209023750.1153905-2-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf.h | 1 + kernel/bpf/btf.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 0624b33be7a2..986fef28f5b4 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1488,6 +1488,7 @@ struct bpf_jit_poke_descriptor { struct bpf_ctx_arg_aux { u32 offset; enum bpf_reg_type reg_type; + struct btf *btf; u32 btf_id; }; diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index ddde130786f7..6fad9bb2281e 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -6302,7 +6302,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, } info->reg_type = ctx_arg_info->reg_type; - info->btf = btf_vmlinux; + info->btf = ctx_arg_info->btf ? : btf_vmlinux; info->btf_id = ctx_arg_info->btf_id; return true; } -- 2.34.1
From: Luo Gengkun <luogengkun2@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA -------------------------------- Fix kabi-breakage by using KABI_BROKEN_INSERT. Fixes: 0d52d64ea93d ("bpf: add btf pointer to struct bpf_ctx_arg_aux.") Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 986fef28f5b4..5313a8a42420 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1488,8 +1488,8 @@ struct bpf_jit_poke_descriptor { struct bpf_ctx_arg_aux { u32 offset; enum bpf_reg_type reg_type; - struct btf *btf; u32 btf_id; + KABI_BROKEN_INSERT(struct btf *btf) }; struct btf_mod_pair { -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 6115a0aeef01aef152ad7738393aad11422bfb82 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Move __kfunc_param_match_suffix() to btf.c and rename it as btf_param_match_suffix(). It can be reused by bpf_struct_ops later. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240209023750.1153905-3-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Conflicts: kernel/bpf/verifier.c [Fix conflicts because of missing some funcs.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/btf.h | 4 ++++ kernel/bpf/btf.c | 18 ++++++++++++++++++ kernel/bpf/verifier.c | 36 +++++++++--------------------------- 3 files changed, 31 insertions(+), 27 deletions(-) diff --git a/include/linux/btf.h b/include/linux/btf.h index 79f34103ab5d..c9e3e0786648 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -495,6 +495,10 @@ static inline void *btf_id_set8_contains(const struct btf_id_set8 *set, u32 id) return bsearch(&id, set->pairs, set->cnt, sizeof(set->pairs[0]), btf_id_cmp_func); } +bool btf_param_match_suffix(const struct btf *btf, + const struct btf_param *arg, + const char *suffix); + struct bpf_verifier_log; #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL) diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 6fad9bb2281e..aa02f1dc9029 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -9022,3 +9022,21 @@ int __register_bpf_struct_ops(struct bpf_struct_ops *st_ops) } EXPORT_SYMBOL_GPL(__register_bpf_struct_ops); #endif + +bool btf_param_match_suffix(const struct btf *btf, + const struct btf_param *arg, + const char *suffix) +{ + int suffix_len = strlen(suffix), len; + const char *param_name; + + /* In the future, this can be ported to use BTF tagging */ + param_name = btf_name_by_offset(btf, arg->name_off); + if (str_is_empty(param_name)) + return false; + len = strlen(param_name); + if (len <= suffix_len) + return false; + param_name += len - suffix_len; + return !strncmp(param_name, suffix, suffix_len); +} diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 0b7d82a68002..d3cd3d121450 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -10634,24 +10634,6 @@ static bool is_kfunc_rcu_protected(struct bpf_kfunc_call_arg_meta *meta) return meta->kfunc_flags & KF_RCU_PROTECTED; } -static bool __kfunc_param_match_suffix(const struct btf *btf, - const struct btf_param *arg, - const char *suffix) -{ - int suffix_len = strlen(suffix), len; - const char *param_name; - - /* In the future, this can be ported to use BTF tagging */ - param_name = btf_name_by_offset(btf, arg->name_off); - if (str_is_empty(param_name)) - return false; - len = strlen(param_name); - if (len < suffix_len) - return false; - param_name += len - suffix_len; - return !strncmp(param_name, suffix, suffix_len); -} - static bool is_kfunc_arg_mem_size(const struct btf *btf, const struct btf_param *arg, const struct bpf_reg_state *reg) @@ -10662,7 +10644,7 @@ static bool is_kfunc_arg_mem_size(const struct btf *btf, if (!btf_type_is_scalar(t) || reg->type != SCALAR_VALUE) return false; - return __kfunc_param_match_suffix(btf, arg, "__sz"); + return btf_param_match_suffix(btf, arg, "__sz"); } static bool is_kfunc_arg_const_mem_size(const struct btf *btf, @@ -10675,42 +10657,42 @@ static bool is_kfunc_arg_const_mem_size(const struct btf *btf, if (!btf_type_is_scalar(t) || reg->type != SCALAR_VALUE) return false; - return __kfunc_param_match_suffix(btf, arg, "__szk"); + return btf_param_match_suffix(btf, arg, "__szk"); } static bool is_kfunc_arg_optional(const struct btf *btf, const struct btf_param *arg) { - return __kfunc_param_match_suffix(btf, arg, "__opt"); + return btf_param_match_suffix(btf, arg, "__opt"); } static bool is_kfunc_arg_constant(const struct btf *btf, const struct btf_param *arg) { - return __kfunc_param_match_suffix(btf, arg, "__k"); + return btf_param_match_suffix(btf, arg, "__k"); } static bool is_kfunc_arg_ignore(const struct btf *btf, const struct btf_param *arg) { - return __kfunc_param_match_suffix(btf, arg, "__ign"); + return btf_param_match_suffix(btf, arg, "__ign"); } static bool is_kfunc_arg_alloc_obj(const struct btf *btf, const struct btf_param *arg) { - return __kfunc_param_match_suffix(btf, arg, "__alloc"); + return btf_param_match_suffix(btf, arg, "__alloc"); } static bool is_kfunc_arg_uninit(const struct btf *btf, const struct btf_param *arg) { - return __kfunc_param_match_suffix(btf, arg, "__uninit"); + return btf_param_match_suffix(btf, arg, "__uninit"); } static bool is_kfunc_arg_refcounted_kptr(const struct btf *btf, const struct btf_param *arg) { - return __kfunc_param_match_suffix(btf, arg, "__refcounted_kptr"); + return btf_param_match_suffix(btf, arg, "__refcounted_kptr"); } static bool is_kfunc_arg_nullable(const struct btf *btf, const struct btf_param *arg) { - return __kfunc_param_match_suffix(btf, arg, "__nullable"); + return btf_param_match_suffix(btf, arg, "__nullable"); } static bool is_kfunc_arg_scalar_with_name(const struct btf *btf, -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 1611603537a4b88cec7993f32b70c03113801a46 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Collect argument information from the type information of stub functions to mark arguments of BPF struct_ops programs with PTR_MAYBE_NULL if they are nullable. A nullable argument is annotated by suffixing "__nullable" at the argument name of stub function. For nullable arguments, this patch sets a struct bpf_ctx_arg_aux to label their reg_type with PTR_TO_BTF_ID | PTR_TRUSTED | PTR_MAYBE_NULL. This makes the verifier to check programs and ensure that they properly check the pointer. The programs should check if the pointer is null before accessing the pointed memory. The implementer of a struct_ops type should annotate the arguments that can be null. The implementer should define a stub function (empty) as a placeholder for each defined operator. The name of a stub function should be in the pattern "<st_op_type>__<operator name>". For example, for test_maybe_null of struct bpf_testmod_ops, it's stub function name should be "bpf_testmod_ops__test_maybe_null". You mark an argument nullable by suffixing the argument name with "__nullable" at the stub function. Since we already has stub functions for kCFI, we just reuse these stub functions with the naming convention mentioned earlier. These stub functions with the naming convention is only required if there are nullable arguments to annotate. For functions having not nullable arguments, stub functions are not necessary for the purpose of this patch. This patch will prepare a list of struct bpf_ctx_arg_aux, aka arg_info, for each member field of a struct_ops type. "arg_info" will be assigned to "prog->aux->ctx_arg_info" of BPF struct_ops programs in check_struct_ops_btf_id() so that it can be used by btf_ctx_access() later to set reg_type properly for the verifier. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240209023750.1153905-4-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Conflicts: include/linux/bpf.h kernel/bpf/btf.c [Fix conflicts because of kabi and context diff.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf.h | 21 ++++ include/linux/btf.h | 2 + kernel/bpf/bpf_struct_ops.c | 213 ++++++++++++++++++++++++++++++++++-- kernel/bpf/btf.c | 27 +++++ kernel/bpf/verifier.c | 6 + 5 files changed, 257 insertions(+), 12 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 5313a8a42420..6bb262b615ff 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1806,6 +1806,19 @@ struct bpf_struct_ops { KABI_RESERVE(4) }; +/* Every member of a struct_ops type has an instance even a member is not + * an operator (function pointer). The "info" field will be assigned to + * prog->aux->ctx_arg_info of BPF struct_ops programs to provide the + * argument information required by the verifier to verify the program. + * + * btf_ctx_access() will lookup prog->aux->ctx_arg_info to find the + * corresponding entry for an given argument. + */ +struct bpf_struct_ops_arg_info { + struct bpf_ctx_arg_aux *info; + u32 cnt; +}; + struct bpf_struct_ops_desc { struct bpf_struct_ops *st_ops; @@ -1813,6 +1826,9 @@ struct bpf_struct_ops_desc { const struct btf_type *value_type; u32 type_id; u32 value_id; + + /* Collection of argument information for each member */ + struct bpf_struct_ops_arg_info *arg_info; }; enum bpf_struct_ops_state { @@ -1887,6 +1903,7 @@ int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, struct btf *btf, struct bpf_verifier_log *log); void bpf_map_struct_ops_info_fill(struct bpf_map_info *info, struct bpf_map *map); +void bpf_struct_ops_desc_release(struct bpf_struct_ops_desc *st_ops_desc); #else #define register_bpf_struct_ops(st_ops, type) ({ (void *)(st_ops); 0; }) static inline bool bpf_try_module_get(const void *data, struct module *owner) @@ -1911,6 +1928,10 @@ static inline void bpf_map_struct_ops_info_fill(struct bpf_map_info *info, struc { } +static inline void bpf_struct_ops_desc_release(struct bpf_struct_ops_desc *st_ops_desc) +{ +} + #endif #if defined(CONFIG_CGROUP_BPF) && defined(CONFIG_BPF_LSM) diff --git a/include/linux/btf.h b/include/linux/btf.h index c9e3e0786648..91c55f350357 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -498,6 +498,8 @@ static inline void *btf_id_set8_contains(const struct btf_id_set8 *set, u32 id) bool btf_param_match_suffix(const struct btf *btf, const struct btf_param *arg, const char *suffix); +int btf_ctx_arg_offset(const struct btf *btf, const struct btf_type *func_proto, + u32 arg_no); struct bpf_verifier_log; diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index f98f580de77a..0d7be97a2411 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -116,17 +116,183 @@ static bool is_valid_value_type(struct btf *btf, s32 value_id, return true; } +#define MAYBE_NULL_SUFFIX "__nullable" +#define MAX_STUB_NAME 128 + +/* Return the type info of a stub function, if it exists. + * + * The name of a stub function is made up of the name of the struct_ops and + * the name of the function pointer member, separated by "__". For example, + * if the struct_ops type is named "foo_ops" and the function pointer + * member is named "bar", the stub function name would be "foo_ops__bar". + */ +static const struct btf_type * +find_stub_func_proto(const struct btf *btf, const char *st_op_name, + const char *member_name) +{ + char stub_func_name[MAX_STUB_NAME]; + const struct btf_type *func_type; + s32 btf_id; + int cp; + + cp = snprintf(stub_func_name, MAX_STUB_NAME, "%s__%s", + st_op_name, member_name); + if (cp >= MAX_STUB_NAME) { + pr_warn("Stub function name too long\n"); + return NULL; + } + btf_id = btf_find_by_name_kind(btf, stub_func_name, BTF_KIND_FUNC); + if (btf_id < 0) + return NULL; + func_type = btf_type_by_id(btf, btf_id); + if (!func_type) + return NULL; + + return btf_type_by_id(btf, func_type->type); /* FUNC_PROTO */ +} + +/* Prepare argument info for every nullable argument of a member of a + * struct_ops type. + * + * Initialize a struct bpf_struct_ops_arg_info according to type info of + * the arguments of a stub function. (Check kCFI for more information about + * stub functions.) + * + * Each member in the struct_ops type has a struct bpf_struct_ops_arg_info + * to provide an array of struct bpf_ctx_arg_aux, which in turn provides + * the information that used by the verifier to check the arguments of the + * BPF struct_ops program assigned to the member. Here, we only care about + * the arguments that are marked as __nullable. + * + * The array of struct bpf_ctx_arg_aux is eventually assigned to + * prog->aux->ctx_arg_info of BPF struct_ops programs and passed to the + * verifier. (See check_struct_ops_btf_id()) + * + * arg_info->info will be the list of struct bpf_ctx_arg_aux if success. If + * fails, it will be kept untouched. + */ +static int prepare_arg_info(struct btf *btf, + const char *st_ops_name, + const char *member_name, + const struct btf_type *func_proto, + struct bpf_struct_ops_arg_info *arg_info) +{ + const struct btf_type *stub_func_proto, *pointed_type; + const struct btf_param *stub_args, *args; + struct bpf_ctx_arg_aux *info, *info_buf; + u32 nargs, arg_no, info_cnt = 0; + u32 arg_btf_id; + int offset; + + stub_func_proto = find_stub_func_proto(btf, st_ops_name, member_name); + if (!stub_func_proto) + return 0; + + /* Check if the number of arguments of the stub function is the same + * as the number of arguments of the function pointer. + */ + nargs = btf_type_vlen(func_proto); + if (nargs != btf_type_vlen(stub_func_proto)) { + pr_warn("the number of arguments of the stub function %s__%s does not match the number of arguments of the member %s of struct %s\n", + st_ops_name, member_name, member_name, st_ops_name); + return -EINVAL; + } + + if (!nargs) + return 0; + + args = btf_params(func_proto); + stub_args = btf_params(stub_func_proto); + + info_buf = kcalloc(nargs, sizeof(*info_buf), GFP_KERNEL); + if (!info_buf) + return -ENOMEM; + + /* Prepare info for every nullable argument */ + info = info_buf; + for (arg_no = 0; arg_no < nargs; arg_no++) { + /* Skip arguments that is not suffixed with + * "__nullable". + */ + if (!btf_param_match_suffix(btf, &stub_args[arg_no], + MAYBE_NULL_SUFFIX)) + continue; + + /* Should be a pointer to struct */ + pointed_type = btf_type_resolve_ptr(btf, + args[arg_no].type, + &arg_btf_id); + if (!pointed_type || + !btf_type_is_struct(pointed_type)) { + pr_warn("stub function %s__%s has %s tagging to an unsupported type\n", + st_ops_name, member_name, MAYBE_NULL_SUFFIX); + goto err_out; + } + + offset = btf_ctx_arg_offset(btf, func_proto, arg_no); + if (offset < 0) { + pr_warn("stub function %s__%s has an invalid trampoline ctx offset for arg#%u\n", + st_ops_name, member_name, arg_no); + goto err_out; + } + + if (args[arg_no].type != stub_args[arg_no].type) { + pr_warn("arg#%u type in stub function %s__%s does not match with its original func_proto\n", + arg_no, st_ops_name, member_name); + goto err_out; + } + + /* Fill the information of the new argument */ + info->reg_type = + PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL; + info->btf_id = arg_btf_id; + info->btf = btf; + info->offset = offset; + + info++; + info_cnt++; + } + + if (info_cnt) { + arg_info->info = info_buf; + arg_info->cnt = info_cnt; + } else { + kfree(info_buf); + } + + return 0; + +err_out: + kfree(info_buf); + + return -EINVAL; +} + +/* Clean up the arg_info in a struct bpf_struct_ops_desc. */ +void bpf_struct_ops_desc_release(struct bpf_struct_ops_desc *st_ops_desc) +{ + struct bpf_struct_ops_arg_info *arg_info; + int i; + + arg_info = st_ops_desc->arg_info; + for (i = 0; i < btf_type_vlen(st_ops_desc->type); i++) + kfree(arg_info[i].info); + + kfree(arg_info); +} + int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, struct btf *btf, struct bpf_verifier_log *log) { struct bpf_struct_ops *st_ops = st_ops_desc->st_ops; + struct bpf_struct_ops_arg_info *arg_info; const struct btf_member *member; const struct btf_type *t; s32 type_id, value_id; char value_name[128]; const char *mname; - int i; + int i, err; if (strlen(st_ops->name) + VALUE_PREFIX_LEN >= sizeof(value_name)) { @@ -160,6 +326,17 @@ int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, if (!is_valid_value_type(btf, value_id, t, value_name)) return -EINVAL; + arg_info = kcalloc(btf_type_vlen(t), sizeof(*arg_info), + GFP_KERNEL); + if (!arg_info) + return -ENOMEM; + + st_ops_desc->arg_info = arg_info; + st_ops_desc->type = t; + st_ops_desc->type_id = type_id; + st_ops_desc->value_id = value_id; + st_ops_desc->value_type = btf_type_by_id(btf, value_id); + for_each_member(i, t, member) { const struct btf_type *func_proto; @@ -167,40 +344,52 @@ int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, if (!*mname) { pr_warn("anon member in struct %s is not supported\n", st_ops->name); - return -EOPNOTSUPP; + err = -EOPNOTSUPP; + goto errout; } if (__btf_member_bitfield_size(t, member)) { pr_warn("bit field member %s in struct %s is not supported\n", mname, st_ops->name); - return -EOPNOTSUPP; + err = -EOPNOTSUPP; + goto errout; } func_proto = btf_type_resolve_func_ptr(btf, member->type, NULL); - if (func_proto && - btf_distill_func_proto(log, btf, + if (!func_proto) + continue; + + if (btf_distill_func_proto(log, btf, func_proto, mname, &st_ops->func_models[i])) { pr_warn("Error in parsing func ptr %s in struct %s\n", mname, st_ops->name); - return -EINVAL; + err = -EINVAL; + goto errout; } + + err = prepare_arg_info(btf, st_ops->name, mname, + func_proto, + arg_info + i); + if (err) + goto errout; } if (st_ops->init(btf)) { pr_warn("Error in init bpf_struct_ops %s\n", st_ops->name); - return -EINVAL; + err = -EINVAL; + goto errout; } - st_ops_desc->type_id = type_id; - st_ops_desc->type = t; - st_ops_desc->value_id = value_id; - st_ops_desc->value_type = btf_type_by_id(btf, value_id); - return 0; + +errout: + bpf_struct_ops_desc_release(st_ops_desc); + + return err; } static int bpf_struct_ops_map_get_next_key(struct bpf_map *map, void *key, diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index aa02f1dc9029..ed7d054d6e68 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -1712,6 +1712,13 @@ static void btf_free_struct_meta_tab(struct btf *btf) static void btf_free_struct_ops_tab(struct btf *btf) { struct btf_struct_ops_tab *tab = btf->struct_ops_tab; + u32 i; + + if (!tab) + return; + + for (i = 0; i < tab->cnt; i++) + bpf_struct_ops_desc_release(&tab->ops[i]); kfree(tab); btf->struct_ops_tab = NULL; @@ -6153,6 +6160,26 @@ static const struct bpf_raw_tp_null_args raw_tp_null_args[] = { { "amdtp_packet", 0x100 }, }; +int btf_ctx_arg_offset(const struct btf *btf, const struct btf_type *func_proto, + u32 arg_no) +{ + const struct btf_param *args; + const struct btf_type *t; + int off = 0, i; + u32 sz; + + args = btf_params(func_proto); + for (i = 0; i < arg_no; i++) { + t = btf_type_by_id(btf, args[i].type); + t = btf_resolve_size(btf, t, &sz); + if (IS_ERR(t)) + return PTR_ERR(t); + off += roundup(sz, 8); + } + + return off; +} + bool btf_ctx_access(int off, int size, enum bpf_access_type type, const struct bpf_prog *prog, struct bpf_insn_access_aux *info) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index d3cd3d121450..f6cac4dba261 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -20258,6 +20258,12 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env) } } + /* btf_ctx_access() used this to provide argument type info */ + prog->aux->ctx_arg_info = + st_ops_desc->arg_info[member_idx].info; + prog->aux->ctx_arg_info_size = + st_ops_desc->arg_info[member_idx].cnt; + prog->aux->attach_func_proto = func_proto; prog->aux->attach_func_name = mname; env->ops = st_ops->verifier_ops; -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 00f239eccf461a6403b3c16e767d04f3954cae98 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Test if the verifier verifies nullable pointer arguments correctly for BPF struct_ops programs. "test_maybe_null" in struct bpf_testmod_ops is the operator defined for the test cases here. A BPF program should check a pointer for NULL beforehand to access the value pointed by the nullable pointer arguments, or the verifier should reject the programs. The test here includes two parts; the programs checking pointers properly and the programs not checking pointers beforehand. The test checks if the verifier accepts the programs checking properly and rejects the programs not checking at all. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240209023750.1153905-5-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 13 +++++- .../selftests/bpf/bpf_testmod/bpf_testmod.h | 4 ++ .../prog_tests/test_struct_ops_maybe_null.c | 46 +++++++++++++++++++ .../bpf/progs/struct_ops_maybe_null.c | 29 ++++++++++++ .../bpf/progs/struct_ops_maybe_null_fail.c | 24 ++++++++++ 5 files changed, 115 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_maybe_null.c create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_maybe_null.c create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_maybe_null_fail.c diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c index 2e1b5d8f0fa2..79238005edd1 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c @@ -550,7 +550,11 @@ static int bpf_dummy_reg(void *kdata) { struct bpf_testmod_ops *ops = kdata; - ops->test_2(4, 3); + /* Some test cases (ex. struct_ops_maybe_null) may not have test_2 + * initialized, so we need to check for NULL. + */ + if (ops->test_2) + ops->test_2(4, 3); return 0; } @@ -568,9 +572,16 @@ static void bpf_testmod_test_2(int a, int b) { } +static int bpf_testmod_ops__test_maybe_null(int dummy, + struct task_struct *task__nullable) +{ + return 0; +} + static struct bpf_testmod_ops __bpf_testmod_ops = { .test_1 = bpf_testmod_test_1, .test_2 = bpf_testmod_test_2, + .test_maybe_null = bpf_testmod_ops__test_maybe_null, }; struct bpf_struct_ops bpf_bpf_testmod_ops = { diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h index 537beca42896..c3b0cf788f9f 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h @@ -5,6 +5,8 @@ #include <linux/types.h> +struct task_struct; + struct bpf_testmod_test_read_ctx { char *buf; loff_t off; @@ -31,6 +33,8 @@ struct bpf_iter_testmod_seq { struct bpf_testmod_ops { int (*test_1)(void); void (*test_2)(int a, int b); + /* Used to test nullable arguments. */ + int (*test_maybe_null)(int dummy, struct task_struct *task); }; #endif /* _BPF_TESTMOD_H */ diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_maybe_null.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_maybe_null.c new file mode 100644 index 000000000000..01dc2613c8a5 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_maybe_null.c @@ -0,0 +1,46 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */ +#include <test_progs.h> + +#include "struct_ops_maybe_null.skel.h" +#include "struct_ops_maybe_null_fail.skel.h" + +/* Test that the verifier accepts a program that access a nullable pointer + * with a proper check. + */ +static void maybe_null(void) +{ + struct struct_ops_maybe_null *skel; + + skel = struct_ops_maybe_null__open_and_load(); + if (!ASSERT_OK_PTR(skel, "struct_ops_module_open_and_load")) + return; + + struct_ops_maybe_null__destroy(skel); +} + +/* Test that the verifier rejects a program that access a nullable pointer + * without a check beforehand. + */ +static void maybe_null_fail(void) +{ + struct struct_ops_maybe_null_fail *skel; + + skel = struct_ops_maybe_null_fail__open_and_load(); + if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__open_and_load")) + return; + + struct_ops_maybe_null_fail__destroy(skel); +} + +void test_struct_ops_maybe_null(void) +{ + /* The verifier verifies the programs at load time, so testing both + * programs in the same compile-unit is complicated. We run them in + * separate objects to simplify the testing. + */ + if (test__start_subtest("maybe_null")) + maybe_null(); + if (test__start_subtest("maybe_null_fail")) + maybe_null_fail(); +} diff --git a/tools/testing/selftests/bpf/progs/struct_ops_maybe_null.c b/tools/testing/selftests/bpf/progs/struct_ops_maybe_null.c new file mode 100644 index 000000000000..b450f72e744a --- /dev/null +++ b/tools/testing/selftests/bpf/progs/struct_ops_maybe_null.c @@ -0,0 +1,29 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */ +#include <vmlinux.h> +#include <bpf/bpf_tracing.h> +#include "../bpf_testmod/bpf_testmod.h" + +char _license[] SEC("license") = "GPL"; + +pid_t tgid = 0; + +/* This is a test BPF program that uses struct_ops to access an argument + * that may be NULL. This is a test for the verifier to ensure that it can + * rip PTR_MAYBE_NULL correctly. + */ +SEC("struct_ops/test_maybe_null") +int BPF_PROG(test_maybe_null, int dummy, + struct task_struct *task) +{ + if (task) + tgid = task->tgid; + + return 0; +} + +SEC(".struct_ops.link") +struct bpf_testmod_ops testmod_1 = { + .test_maybe_null = (void *)test_maybe_null, +}; + diff --git a/tools/testing/selftests/bpf/progs/struct_ops_maybe_null_fail.c b/tools/testing/selftests/bpf/progs/struct_ops_maybe_null_fail.c new file mode 100644 index 000000000000..6283099ec383 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/struct_ops_maybe_null_fail.c @@ -0,0 +1,24 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */ +#include <vmlinux.h> +#include <bpf/bpf_tracing.h> +#include "../bpf_testmod/bpf_testmod.h" + +char _license[] SEC("license") = "GPL"; + +pid_t tgid = 0; + +SEC("struct_ops/test_maybe_null_struct_ptr") +int BPF_PROG(test_maybe_null_struct_ptr, int dummy, + struct task_struct *task) +{ + tgid = task->tgid; + + return 0; +} + +SEC(".struct_ops.link") +struct bpf_testmod_ops testmod_struct_ptr = { + .test_maybe_null = (void *)test_maybe_null_struct_ptr, +}; + -- 2.34.1
From: Benjamin Tissoires <bentiss@kernel.org> mainline inclusion from mainline-v6.9-rc1 commit dfe6625df48ec54c6dc9b86d361f26962d09de88 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- No code change, but it'll allow to have only one place to change everything when we add in_sleepable in cur_state. Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/20240221-hid-bpf-sleepable-v3-2-1fb378ca6301@kerne... Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/verifier.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f6cac4dba261..ab1e3e057a87 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -5423,6 +5423,11 @@ static int map_kptr_match_type(struct bpf_verifier_env *env, return -EINVAL; } +static bool in_sleepable(struct bpf_verifier_env *env) +{ + return env->prog->aux->sleepable; +} + /* The non-sleepable programs and sleepable programs with explicit bpf_rcu_read_lock() * can dereference RCU protected pointers and result is PTR_TRUSTED. */ @@ -5430,7 +5435,7 @@ static bool in_rcu_cs(struct bpf_verifier_env *env) { return env->cur_state->active_rcu_lock || env->cur_state->active_lock.ptr || - !env->prog->aux->sleepable; + !in_sleepable(env); } /* Once GCC supports btf_type_tag the following mechanism will be replaced with tag check */ @@ -10154,7 +10159,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn return -EINVAL; } - if (!env->prog->aux->sleepable && fn->might_sleep) { + if (!in_sleepable(env) && fn->might_sleep) { verbose(env, "helper call might sleep in a non-sleepable prog\n"); return -EINVAL; } @@ -10184,7 +10189,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn return -EINVAL; } - if (env->prog->aux->sleepable && is_storage_get_function(func_id)) + if (in_sleepable(env) && is_storage_get_function(func_id)) env->insn_aux_data[insn_idx].storage_get_func_atomic = true; } @@ -11479,7 +11484,7 @@ static bool check_css_task_iter_allowlist(struct bpf_verifier_env *env) return true; fallthrough; default: - return env->prog->aux->sleepable; + return in_sleepable(env); } } @@ -12002,7 +12007,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, } sleepable = is_kfunc_sleepable(&meta); - if (sleepable && !env->prog->aux->sleepable) { + if (sleepable && !in_sleepable(env)) { verbose(env, "program must be sleepable to call sleepable kfunc %s\n", func_name); return -EACCES; } @@ -19582,7 +19587,7 @@ static int do_misc_fixups(struct bpf_verifier_env *env) } if (is_storage_get_function(insn->imm)) { - if (!env->prog->aux->sleepable || + if (!in_sleepable(env) || env->insn_aux_data[i + delta].storage_get_func_atomic) insn_buf[0] = BPF_MOV64_IMM(BPF_REG_5, (__force __s32)GFP_ATOMIC); else -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 3644d285462a60c80ac225d508fcfe705640d2b4 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- For a struct_ops map, btf_value_type_id is the type ID of it's struct type. This value is required by bpftool to generate skeleton including pointers of shadow types. The code generator gets the type ID from bpf_map__btf_value_type_id() in order to get the type information of the struct type of a map. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240229064523.2091270-2-thinker.li@gmail.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/libbpf.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 072f72c1b40c..dc3aecf1f070 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -1214,6 +1214,7 @@ static int init_struct_ops_maps(struct bpf_object *obj, const char *sec_name, map->name = strdup(var_name); if (!map->name) return -ENOMEM; + map->btf_value_type_id = type_id; map->def.type = BPF_MAP_TYPE_STRUCT_OPS; map->def.key_size = sizeof(int); @@ -5193,6 +5194,10 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b create_attr.btf_value_type_id = 0; map->btf_key_type_id = 0; map->btf_value_type_id = 0; + break; + case BPF_MAP_TYPE_STRUCT_OPS: + create_attr.btf_value_type_id = 0; + break; default: break; } -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 69e4a9d2b3f5adf5af4feeab0a9f505da971265a category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Convert st_ops->data to the shadow type of the struct_ops map. The shadow type of a struct_ops type is a variant of the original struct type providing a way to access/change the values in the maps of the struct_ops type. bpf_map__initial_value() will return st_ops->data for struct_ops types. The skeleton is going to use it as the pointer to the shadow type of the original struct type. One of the main differences between the original struct type and the shadow type is that all function pointers of the shadow type are converted to pointers of struct bpf_program. Users can replace these bpf_program pointers with other BPF programs. The st_ops->progs[] will be updated before updating the value of a map to reflect the changes made by users. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240229064523.2091270-3-thinker.li@gmail.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/libbpf.c | 40 ++++++++++++++++++++++++++++++++++++++-- 1 file changed, 38 insertions(+), 2 deletions(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index dc3aecf1f070..5f8f0dd589e7 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -999,6 +999,19 @@ static bool bpf_map__is_struct_ops(const struct bpf_map *map) return map->def.type == BPF_MAP_TYPE_STRUCT_OPS; } +static bool is_valid_st_ops_program(struct bpf_object *obj, + const struct bpf_program *prog) +{ + int i; + + for (i = 0; i < obj->nr_programs; i++) { + if (&obj->programs[i] == prog) + return prog->type == BPF_PROG_TYPE_STRUCT_OPS; + } + + return false; +} + /* Init the map's fields that depend on kern_btf */ static int bpf_map__init_kern_struct_ops(struct bpf_map *map) { @@ -1087,9 +1100,16 @@ static int bpf_map__init_kern_struct_ops(struct bpf_map *map) if (btf_is_ptr(mtype)) { struct bpf_program *prog; - prog = st_ops->progs[i]; + /* Update the value from the shadow type */ + prog = *(void **)mdata; + st_ops->progs[i] = prog; if (!prog) continue; + if (!is_valid_st_ops_program(obj, prog)) { + pr_warn("struct_ops init_kern %s: member %s is not a struct_ops program\n", + map->name, mname); + return -ENOTSUP; + } kern_mtype = skip_mods_and_typedefs(kern_btf, kern_mtype->type, @@ -9173,7 +9193,9 @@ static struct bpf_map *find_struct_ops_map_by_offset(struct bpf_object *obj, return NULL; } -/* Collect the reloc from ELF and populate the st_ops->progs[] */ +/* Collect the reloc from ELF, populate the st_ops->progs[], and update + * st_ops->data for shadow type. + */ static int bpf_object__collect_st_ops_relos(struct bpf_object *obj, Elf64_Shdr *shdr, Elf_Data *data) { @@ -9287,6 +9309,14 @@ static int bpf_object__collect_st_ops_relos(struct bpf_object *obj, } st_ops->progs[member_idx] = prog; + + /* st_ops->data will be exposed to users, being returned by + * bpf_map__initial_value() as a pointer to the shadow + * type. All function pointers in the original struct type + * should be converted to a pointer to struct bpf_program + * in the shadow type. + */ + *((struct bpf_program **)(st_ops->data + moff)) = prog; } return 0; @@ -9743,6 +9773,12 @@ int bpf_map__set_initial_value(struct bpf_map *map, void *bpf_map__initial_value(struct bpf_map *map, size_t *psize) { + if (bpf_map__is_struct_ops(map)) { + if (psize) + *psize = map->def.value_size; + return map->st_ops->data; + } + if (!map->mmaped) return NULL; *psize = map->def.value_size; -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit a7b0fa352eafef95bd0d736ca94965d3f884ad18 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Declares and defines a pointer of the shadow type for each struct_ops map. The code generator will create an anonymous struct type as the shadow type for each struct_ops map. The shadow type is translated from the original struct type of the map. The user of the skeleton use pointers of them to access the values of struct_ops maps. However, shadow types only supports certain types of fields, including scalar types and function pointers. Any fields of unsupported types are translated into an array of characters to occupy the space of the original field. Function pointers are translated into pointers of the struct bpf_program. Additionally, padding fields are generated to occupy the space between two consecutive fields. The pointers of shadow types of struct_osp maps are initialized when *__open_opts() in skeletons are called. For a map called FOO, the user can access it through the pointer at skel->struct_ops.FOO. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20240229064523.2091270-4-thinker.li@gmail.com Conflicts: tools/bpf/bpftool/gen.c [Fix conflict because of context diff.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/bpf/bpftool/gen.c | 237 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 236 insertions(+), 1 deletion(-) diff --git a/tools/bpf/bpftool/gen.c b/tools/bpf/bpftool/gen.c index 882bf8e6e70e..34515fbe9414 100644 --- a/tools/bpf/bpftool/gen.c +++ b/tools/bpf/bpftool/gen.c @@ -54,6 +54,20 @@ static bool str_has_suffix(const char *str, const char *suffix) return true; } +static const struct btf_type * +resolve_func_ptr(const struct btf *btf, __u32 id, __u32 *res_id) +{ + const struct btf_type *t; + + t = skip_mods_and_typedefs(btf, id, NULL); + if (!btf_is_ptr(t)) + return NULL; + + t = skip_mods_and_typedefs(btf, t->type, res_id); + + return btf_is_func_proto(t) ? t : NULL; +} + static void get_obj_name(char *name, const char *file) { /* Using basename() GNU version which doesn't modify arg. */ @@ -903,6 +917,207 @@ codegen_progs_skeleton(struct bpf_object *obj, size_t prog_cnt, bool populate_li } } +static int walk_st_ops_shadow_vars(struct btf *btf, const char *ident, + const struct btf_type *map_type, __u32 map_type_id) +{ + LIBBPF_OPTS(btf_dump_emit_type_decl_opts, opts, .indent_level = 3); + const struct btf_type *member_type; + __u32 offset, next_offset = 0; + const struct btf_member *m; + struct btf_dump *d = NULL; + const char *member_name; + __u32 member_type_id; + int i, err = 0, n; + int size; + + d = btf_dump__new(btf, codegen_btf_dump_printf, NULL, NULL); + if (!d) + return -errno; + + n = btf_vlen(map_type); + for (i = 0, m = btf_members(map_type); i < n; i++, m++) { + member_type = skip_mods_and_typedefs(btf, m->type, &member_type_id); + member_name = btf__name_by_offset(btf, m->name_off); + + offset = m->offset / 8; + if (next_offset < offset) + printf("\t\t\tchar __padding_%d[%d];\n", i, offset - next_offset); + + switch (btf_kind(member_type)) { + case BTF_KIND_INT: + case BTF_KIND_FLOAT: + case BTF_KIND_ENUM: + case BTF_KIND_ENUM64: + /* scalar type */ + printf("\t\t\t"); + opts.field_name = member_name; + err = btf_dump__emit_type_decl(d, member_type_id, &opts); + if (err) { + p_err("Failed to emit type declaration for %s: %d", member_name, err); + goto out; + } + printf(";\n"); + + size = btf__resolve_size(btf, member_type_id); + if (size < 0) { + p_err("Failed to resolve size of %s: %d\n", member_name, size); + err = size; + goto out; + } + + next_offset = offset + size; + break; + + case BTF_KIND_PTR: + if (resolve_func_ptr(btf, m->type, NULL)) { + /* Function pointer */ + printf("\t\t\tstruct bpf_program *%s;\n", member_name); + + next_offset = offset + sizeof(void *); + break; + } + /* All pointer types are unsupported except for + * function pointers. + */ + fallthrough; + + default: + /* Unsupported types + * + * Types other than scalar types and function + * pointers are currently not supported in order to + * prevent conflicts in the generated code caused + * by multiple definitions. For instance, if the + * struct type FOO is used in a struct_ops map, + * bpftool has to generate definitions for FOO, + * which may result in conflicts if FOO is defined + * in different skeleton files. + */ + size = btf__resolve_size(btf, member_type_id); + if (size < 0) { + p_err("Failed to resolve size of %s: %d\n", member_name, size); + err = size; + goto out; + } + printf("\t\t\tchar __unsupported_%d[%d];\n", i, size); + + next_offset = offset + size; + break; + } + } + + /* Cannot fail since it must be a struct type */ + size = btf__resolve_size(btf, map_type_id); + if (next_offset < (__u32)size) + printf("\t\t\tchar __padding_end[%d];\n", size - next_offset); + +out: + btf_dump__free(d); + + return err; +} + +/* Generate the pointer of the shadow type for a struct_ops map. + * + * This function adds a pointer of the shadow type for a struct_ops map. + * The members of a struct_ops map can be exported through a pointer to a + * shadow type. The user can access these members through the pointer. + * + * A shadow type includes not all members, only members of some types. + * They are scalar types and function pointers. The function pointers are + * translated to the pointer of the struct bpf_program. The scalar types + * are translated to the original type without any modifiers. + * + * Unsupported types will be translated to a char array to occupy the same + * space as the original field, being renamed as __unsupported_*. The user + * should treat these fields as opaque data. + */ +static int gen_st_ops_shadow_type(const char *obj_name, struct btf *btf, const char *ident, + const struct bpf_map *map) +{ + const struct btf_type *map_type; + const char *type_name; + __u32 map_type_id; + int err; + + map_type_id = bpf_map__btf_value_type_id(map); + if (map_type_id == 0) + return -EINVAL; + map_type = btf__type_by_id(btf, map_type_id); + if (!map_type) + return -EINVAL; + + type_name = btf__name_by_offset(btf, map_type->name_off); + + printf("\t\tstruct %s__%s__%s {\n", obj_name, ident, type_name); + + err = walk_st_ops_shadow_vars(btf, ident, map_type, map_type_id); + if (err) + return err; + + printf("\t\t} *%s;\n", ident); + + return 0; +} + +static int gen_st_ops_shadow(const char *obj_name, struct btf *btf, struct bpf_object *obj) +{ + int err, st_ops_cnt = 0; + struct bpf_map *map; + char ident[256]; + + if (!btf) + return 0; + + /* Generate the pointers to shadow types of + * struct_ops maps. + */ + bpf_object__for_each_map(map, obj) { + if (bpf_map__type(map) != BPF_MAP_TYPE_STRUCT_OPS) + continue; + if (!get_map_ident(map, ident, sizeof(ident))) + continue; + + if (st_ops_cnt == 0) /* first struct_ops map */ + printf("\tstruct {\n"); + st_ops_cnt++; + + err = gen_st_ops_shadow_type(obj_name, btf, ident, map); + if (err) + return err; + } + + if (st_ops_cnt) + printf("\t} struct_ops;\n"); + + return 0; +} + +/* Generate the code to initialize the pointers of shadow types. */ +static void gen_st_ops_shadow_init(struct btf *btf, struct bpf_object *obj) +{ + struct bpf_map *map; + char ident[256]; + + if (!btf) + return; + + /* Initialize the pointers to_ops shadow types of + * struct_ops maps. + */ + bpf_object__for_each_map(map, obj) { + if (bpf_map__type(map) != BPF_MAP_TYPE_STRUCT_OPS) + continue; + if (!get_map_ident(map, ident, sizeof(ident))) + continue; + codegen("\ + \n\ + obj->struct_ops.%1$s = bpf_map__initial_value(obj->maps.%1$s, NULL);\n\ + \n\ + ", ident); + } +} + static int do_skeleton(int argc, char **argv) { char header_guard[MAX_OBJ_NAME_LEN + sizeof("__SKEL_H__")]; @@ -1046,6 +1261,11 @@ static int do_skeleton(int argc, char **argv) printf("\t} maps;\n"); } + btf = bpf_object__btf(obj); + err = gen_st_ops_shadow(obj_name, btf, obj); + if (err) + goto out; + if (prog_cnt) { printf("\tstruct {\n"); bpf_object__for_each_program(prog, obj) { @@ -1069,7 +1289,6 @@ static int do_skeleton(int argc, char **argv) printf("\t} links;\n"); } - btf = bpf_object__btf(obj); if (btf) { err = codegen_datasecs(obj, obj_name); if (err) @@ -1127,6 +1346,12 @@ static int do_skeleton(int argc, char **argv) if (err) \n\ goto err_out; \n\ \n\ + ", obj_name); + + gen_st_ops_shadow_init(btf, obj); + + codegen("\ + \n\ return obj; \n\ err_out: \n\ %1$s__destroy(obj); \n\ @@ -1436,6 +1661,10 @@ static int do_subskeleton(int argc, char **argv) printf("\t} maps;\n"); } + err = gen_st_ops_shadow(obj_name, btf, obj); + if (err) + goto out; + if (prog_cnt) { printf("\tstruct {\n"); bpf_object__for_each_program(prog, obj) { @@ -1547,6 +1776,12 @@ static int do_subskeleton(int argc, char **argv) if (err) \n\ goto err; \n\ \n\ + "); + + gen_st_ops_shadow_init(btf, obj); + + codegen("\ + \n\ return obj; \n\ err: \n\ %1$s__destroy(obj); \n\ -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit f2e81192e07e87897ff1296c96775eceea8f582a category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- The example in bpftool-gen.8 explains how to use the pointer of the shadow type to change the value of a field of a struct_ops map. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20240229064523.2091270-5-thinker.li@gmail.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../bpf/bpftool/Documentation/bpftool-gen.rst | 58 +++++++++++++++++-- 1 file changed, 52 insertions(+), 6 deletions(-) diff --git a/tools/bpf/bpftool/Documentation/bpftool-gen.rst b/tools/bpf/bpftool/Documentation/bpftool-gen.rst index 5006e724d1bc..5e60825818dd 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-gen.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-gen.rst @@ -257,18 +257,48 @@ EXAMPLES return 0; } -This is example BPF application with two BPF programs and a mix of BPF maps -and global variables. Source code is split across two source code files. +**$ cat example3.bpf.c** + +:: + + #include <linux/ptrace.h> + #include <linux/bpf.h> + #include <bpf/bpf_helpers.h> + /* This header file is provided by the bpf_testmod module. */ + #include "bpf_testmod.h" + + int test_2_result = 0; + + /* bpf_Testmod.ko calls this function, passing a "4" + * and testmod_map->data. + */ + SEC("struct_ops/test_2") + void BPF_PROG(test_2, int a, int b) + { + test_2_result = a + b; + } + + SEC(".struct_ops") + struct bpf_testmod_ops testmod_map = { + .test_2 = (void *)test_2, + .data = 0x1, + }; + +This is example BPF application with three BPF programs and a mix of BPF +maps and global variables. Source code is split across three source code +files. **$ clang --target=bpf -g example1.bpf.c -o example1.bpf.o** **$ clang --target=bpf -g example2.bpf.c -o example2.bpf.o** -**$ bpftool gen object example.bpf.o example1.bpf.o example2.bpf.o** +**$ clang --target=bpf -g example3.bpf.c -o example3.bpf.o** + +**$ bpftool gen object example.bpf.o example1.bpf.o example2.bpf.o example3.bpf.o** -This set of commands compiles *example1.bpf.c* and *example2.bpf.c* -individually and then statically links respective object files into the final -BPF ELF object file *example.bpf.o*. +This set of commands compiles *example1.bpf.c*, *example2.bpf.c* and +*example3.bpf.c* individually and then statically links respective object +files into the final BPF ELF object file *example.bpf.o*. **$ bpftool gen skeleton example.bpf.o name example | tee example.skel.h** @@ -291,7 +321,15 @@ BPF ELF object file *example.bpf.o*. struct bpf_map *data; struct bpf_map *bss; struct bpf_map *my_map; + struct bpf_map *testmod_map; } maps; + struct { + struct example__testmod_map__bpf_testmod_ops { + const struct bpf_program *test_1; + const struct bpf_program *test_2; + int data; + } *testmod_map; + } struct_ops; struct { struct bpf_program *handle_sys_enter; struct bpf_program *handle_sys_exit; @@ -304,6 +342,7 @@ BPF ELF object file *example.bpf.o*. struct { int x; } data; + int test_2_result; } *bss; struct example__data { _Bool global_flag; @@ -342,10 +381,16 @@ BPF ELF object file *example.bpf.o*. skel->rodata->param1 = 128; + /* Change the value through the pointer of shadow type */ + skel->struct_ops.testmod_map->data = 13; + err = example__load(skel); if (err) goto cleanup; + /* The result of the function test_2() */ + printf("test_2_result: %d\n", skel->bss->test_2_result); + err = example__attach(skel); if (err) goto cleanup; @@ -372,6 +417,7 @@ BPF ELF object file *example.bpf.o*. :: + test_2_result: 17 my_map name: my_map sys_enter prog FD: 8 my_static_var: 7 -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 0623e73317940d052216fb6eef4efd55a0a7f602 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Change the values of fields, including scalar types and function pointers, and check if the struct_ops map works as expected. The test changes the field "test_2" of "testmod_1" from the pointer to test_2() to pointer to test_3() and the field "data" to 13. The function test_2() and test_3() both compute a new value for "test_2_result", but in different way. By checking the value of "test_2_result", it ensures the struct_ops map works as expected with changes through shadow types. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240229064523.2091270-6-thinker.li@gmail.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 11 ++++++++++- .../selftests/bpf/bpf_testmod/bpf_testmod.h | 8 ++++++++ .../bpf/prog_tests/test_struct_ops_module.c | 19 +++++++++++++++---- .../selftests/bpf/progs/struct_ops_module.c | 8 ++++++++ 4 files changed, 41 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c index 79238005edd1..2c5c64950781 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c @@ -534,6 +534,15 @@ static int bpf_testmod_ops_init_member(const struct btf_type *t, const struct btf_member *member, void *kdata, const void *udata) { + if (member->offset == offsetof(struct bpf_testmod_ops, data) * 8) { + /* For data fields, this function has to copy it and return + * 1 to indicate that the data has been handled by the + * struct_ops type, or the verifier will reject the map if + * the value of the data field is not zero. + */ + ((struct bpf_testmod_ops *)kdata)->data = ((struct bpf_testmod_ops *)udata)->data; + return 1; + } return 0; } @@ -554,7 +563,7 @@ static int bpf_dummy_reg(void *kdata) * initialized, so we need to check for NULL. */ if (ops->test_2) - ops->test_2(4, 3); + ops->test_2(4, ops->data); return 0; } diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h index c3b0cf788f9f..971458acfac3 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h @@ -35,6 +35,14 @@ struct bpf_testmod_ops { void (*test_2)(int a, int b); /* Used to test nullable arguments. */ int (*test_maybe_null)(int dummy, struct task_struct *task); + + /* The following fields are used to test shadow copies. */ + char onebyte; + struct { + int a; + int b; + } unsupported; + int data; }; #endif /* _BPF_TESTMOD_H */ diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c index 8d833f0c7580..7d6facf46ebb 100644 --- a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c +++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c @@ -32,17 +32,23 @@ static void check_map_info(struct bpf_map_info *info) static void test_struct_ops_load(void) { - DECLARE_LIBBPF_OPTS(bpf_object_open_opts, opts); struct struct_ops_module *skel; struct bpf_map_info info = {}; struct bpf_link *link; int err; u32 len; - skel = struct_ops_module__open_opts(&opts); + skel = struct_ops_module__open(); if (!ASSERT_OK_PTR(skel, "struct_ops_module_open")) return; + skel->struct_ops.testmod_1->data = 13; + skel->struct_ops.testmod_1->test_2 = skel->progs.test_3; + /* Since test_2() is not being used, it should be disabled from + * auto-loading, or it will fail to load. + */ + bpf_program__set_autoload(skel->progs.test_2, false); + err = struct_ops_module__load(skel); if (!ASSERT_OK(err, "struct_ops_module_load")) goto cleanup; @@ -56,8 +62,13 @@ static void test_struct_ops_load(void) link = bpf_map__attach_struct_ops(skel->maps.testmod_1); ASSERT_OK_PTR(link, "attach_test_mod_1"); - /* test_2() will be called from bpf_dummy_reg() in bpf_testmod.c */ - ASSERT_EQ(skel->bss->test_2_result, 7, "test_2_result"); + /* test_3() will be called from bpf_dummy_reg() in bpf_testmod.c + * + * In bpf_testmod.c it will pass 4 and 13 (the value of data) to + * .test_2. So, the value of test_2_result should be 20 (4 + 13 + + * 3). + */ + ASSERT_EQ(skel->bss->test_2_result, 20, "check_shadow_variables"); bpf_link__destroy(link); diff --git a/tools/testing/selftests/bpf/progs/struct_ops_module.c b/tools/testing/selftests/bpf/progs/struct_ops_module.c index b78746b3cef3..25952fa09348 100644 --- a/tools/testing/selftests/bpf/progs/struct_ops_module.c +++ b/tools/testing/selftests/bpf/progs/struct_ops_module.c @@ -21,9 +21,17 @@ void BPF_PROG(test_2, int a, int b) test_2_result = a + b; } +SEC("struct_ops/test_3") +int BPF_PROG(test_3, int a, int b) +{ + test_2_result = a + b + 3; + return a + b + 3; +} + SEC(".struct_ops.link") struct bpf_testmod_ops testmod_1 = { .test_1 = (void *)test_1, .test_2 = (void *)test_2, + .data = 0x1, }; -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 73e4f9e615d7b99f39663d4722dc73e8fa5db5f9 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Perform all validations when updating values of struct_ops maps. Doing validation in st_ops->reg() and st_ops->update() is not necessary anymore. However, tcp_register_congestion_control() has been called in various places. It still needs to do validations. Cc: netdev@vger.kernel.org Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240224223418.526631-2-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/bpf_struct_ops.c | 11 ++++++----- net/ipv4/tcp_cong.c | 6 +----- 2 files changed, 7 insertions(+), 10 deletions(-) diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 0d7be97a2411..c244ed5114fd 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -667,13 +667,14 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, *(unsigned long *)(udata + moff) = prog->aux->id; } + if (st_ops->validate) { + err = st_ops->validate(kdata); + if (err) + goto reset_unlock; + } + if (st_map->map.map_flags & BPF_F_LINK) { err = 0; - if (st_ops->validate) { - err = st_ops->validate(kdata); - if (err) - goto reset_unlock; - } arch_protect_bpf_trampoline(st_map->image, PAGE_SIZE); /* Let bpf_link handle registration & unregistration. * diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index 95dbb2799be4..48617d99abb0 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -145,11 +145,7 @@ EXPORT_SYMBOL_GPL(tcp_unregister_congestion_control); int tcp_update_congestion_control(struct tcp_congestion_ops *ca, struct tcp_congestion_ops *old_ca) { struct tcp_congestion_ops *existing; - int ret; - - ret = tcp_validate_congestion_control(ca); - if (ret) - return ret; + int ret = 0; ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name)); -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 187e2af05abe6bf80581490239c449456627d17a category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- The BPF struct_ops previously only allowed one page of trampolines. Each function pointer of a struct_ops is implemented by a struct_ops bpf program. Each struct_ops bpf program requires a trampoline. The following selftest patch shows each page can hold a little more than 20 trampolines. While one page is more than enough for the tcp-cc usecase, the sched_ext use case shows that one page is not always enough and hits the one page limit. This patch overcomes the one page limit by allocating another page when needed and it is limited to a total of MAX_IMAGE_PAGES (8) pages which is more than enough for reasonable usages. The variable st_map->image has been changed to st_map->image_pages, and its type has been changed to an array of pointers to pages. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240224223418.526631-3-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf.h | 4 +- kernel/bpf/bpf_struct_ops.c | 130 ++++++++++++++++++++++----------- net/bpf/bpf_dummy_struct_ops.c | 12 +-- 3 files changed, 96 insertions(+), 50 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 6bb262b615ff..1c57b5ac2678 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1866,7 +1866,9 @@ int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks, struct bpf_tramp_link *link, const struct btf_func_model *model, void *stub_func, - void *image, void *image_end); + void **image, u32 *image_off, + bool allow_alloc); +void bpf_struct_ops_image_free(void *image); static inline bool bpf_try_module_get(const void *data, struct module *owner) { if (owner == BPF_MODULE_OWNER) diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index c244ed5114fd..cb6deeead404 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -18,6 +18,8 @@ struct bpf_struct_ops_value { char data[] ____cacheline_aligned_in_smp; }; +#define MAX_TRAMP_IMAGE_PAGES 8 + struct bpf_struct_ops_map { struct bpf_map map; struct rcu_head rcu; @@ -30,12 +32,11 @@ struct bpf_struct_ops_map { */ struct bpf_link **links; u32 links_cnt; - /* image is a page that has all the trampolines + u32 image_pages_cnt; + /* image_pages is an array of pages that has all the trampolines * that stores the func args before calling the bpf_prog. - * A PAGE_SIZE "image" is enough to store all trampoline for - * "links[]". */ - void *image; + void *image_pages[MAX_TRAMP_IMAGE_PAGES]; /* The owner moduler's btf. */ struct btf *btf; /* uvalue->data stores the kernel struct @@ -116,6 +117,31 @@ static bool is_valid_value_type(struct btf *btf, s32 value_id, return true; } +static void *bpf_struct_ops_image_alloc(void) +{ + void *image; + int err; + + err = bpf_jit_charge_modmem(PAGE_SIZE); + if (err) + return ERR_PTR(err); + image = arch_alloc_bpf_trampoline(PAGE_SIZE); + if (!image) { + bpf_jit_uncharge_modmem(PAGE_SIZE); + return ERR_PTR(-ENOMEM); + } + + return image; +} + +void bpf_struct_ops_image_free(void *image) +{ + if (image) { + arch_free_bpf_trampoline(image, PAGE_SIZE); + bpf_jit_uncharge_modmem(PAGE_SIZE); + } +} + #define MAYBE_NULL_SUFFIX "__nullable" #define MAX_STUB_NAME 128 @@ -456,6 +482,15 @@ static void bpf_struct_ops_map_put_progs(struct bpf_struct_ops_map *st_map) } } +static void bpf_struct_ops_map_free_image(struct bpf_struct_ops_map *st_map) +{ + int i; + + for (i = 0; i < st_map->image_pages_cnt; i++) + bpf_struct_ops_image_free(st_map->image_pages[i]); + st_map->image_pages_cnt = 0; +} + static int check_zero_holes(const struct btf *btf, const struct btf_type *t, void *data) { const struct btf_member *member; @@ -501,9 +536,12 @@ const struct bpf_link_ops bpf_struct_ops_link_lops = { int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks, struct bpf_tramp_link *link, const struct btf_func_model *model, - void *stub_func, void *image, void *image_end) + void *stub_func, + void **_image, u32 *_image_off, + bool allow_alloc) { - u32 flags = BPF_TRAMP_F_INDIRECT; + u32 image_off = *_image_off, flags = BPF_TRAMP_F_INDIRECT; + void *image = *_image; int size; tlinks[BPF_TRAMP_FENTRY].links[0] = link; @@ -513,12 +551,32 @@ int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks, flags |= BPF_TRAMP_F_RET_FENTRY_RET; size = arch_bpf_trampoline_size(model, flags, tlinks, NULL); - if (size < 0) - return size; - if (size > (unsigned long)image_end - (unsigned long)image) - return -E2BIG; - return arch_prepare_bpf_trampoline(NULL, image, image_end, + if (size <= 0) + return size ? : -EFAULT; + + /* Allocate image buffer if necessary */ + if (!image || size > PAGE_SIZE - image_off) { + if (!allow_alloc) + return -E2BIG; + + image = bpf_struct_ops_image_alloc(); + if (IS_ERR(image)) + return PTR_ERR(image); + image_off = 0; + } + + size = arch_prepare_bpf_trampoline(NULL, image + image_off, + image + PAGE_SIZE, model, flags, tlinks, stub_func); + if (size <= 0) { + if (image != *_image) + bpf_struct_ops_image_free(image); + return size ? : -EFAULT; + } + + *_image = image; + *_image_off = image_off + size; + return 0; } static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, @@ -534,8 +592,8 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, struct bpf_tramp_links *tlinks; void *udata, *kdata; int prog_fd, err; - void *image, *image_end; - u32 i; + u32 i, trampoline_start, image_off = 0; + void *cur_image = NULL, *image = NULL; if (flags) return -EINVAL; @@ -573,8 +631,6 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, udata = &uvalue->data; kdata = &kvalue->data; - image = st_map->image; - image_end = st_map->image + PAGE_SIZE; module_type = btf_type_by_id(btf_vmlinux, st_ops_ids[IDX_MODULE_ID]); for_each_member(i, t, member) { @@ -653,15 +709,24 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, &bpf_struct_ops_link_lops, prog); st_map->links[i] = &link->link; + trampoline_start = image_off; err = bpf_struct_ops_prepare_trampoline(tlinks, link, - &st_ops->func_models[i], - *(void **)(st_ops->cfi_stubs + moff), - image, image_end); + &st_ops->func_models[i], + *(void **)(st_ops->cfi_stubs + moff), + &image, &image_off, + st_map->image_pages_cnt < MAX_TRAMP_IMAGE_PAGES); + if (err) + goto reset_unlock; + + if (cur_image != image) { + st_map->image_pages[st_map->image_pages_cnt++] = image; + cur_image = image; + trampoline_start = 0; + } if (err < 0) goto reset_unlock; - *(void **)(kdata + moff) = image + cfi_get_offset(); - image += err; + *(void **)(kdata + moff) = image + trampoline_start + cfi_get_offset(); /* put prog_id to udata */ *(unsigned long *)(udata + moff) = prog->aux->id; @@ -672,10 +737,11 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, if (err) goto reset_unlock; } + for (i = 0; i < st_map->image_pages_cnt; i++) + arch_protect_bpf_trampoline(st_map->image_pages[i], PAGE_SIZE); if (st_map->map.map_flags & BPF_F_LINK) { err = 0; - arch_protect_bpf_trampoline(st_map->image, PAGE_SIZE); /* Let bpf_link handle registration & unregistration. * * Pair with smp_load_acquire() during lookup_elem(). @@ -684,7 +750,6 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, goto unlock; } - arch_protect_bpf_trampoline(st_map->image, PAGE_SIZE); err = st_ops->reg(kdata); if (likely(!err)) { /* This refcnt increment on the map here after @@ -707,9 +772,9 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, * there was a race in registering the struct_ops (under the same name) to * a sub-system through different struct_ops's maps. */ - arch_unprotect_bpf_trampoline(st_map->image, PAGE_SIZE); reset_unlock: + bpf_struct_ops_map_free_image(st_map); bpf_struct_ops_map_put_progs(st_map); memset(uvalue, 0, map->value_size); memset(kvalue, 0, map->value_size); @@ -776,10 +841,7 @@ static void __bpf_struct_ops_map_free(struct bpf_map *map) if (st_map->links) bpf_struct_ops_map_put_progs(st_map); bpf_map_area_free(st_map->links); - if (st_map->image) { - arch_free_bpf_trampoline(st_map->image, PAGE_SIZE); - bpf_jit_uncharge_modmem(PAGE_SIZE); - } + bpf_struct_ops_map_free_image(st_map); bpf_map_area_free(st_map->uvalue); bpf_map_area_free(st_map); } @@ -889,20 +951,6 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) st_map->st_ops_desc = st_ops_desc; map = &st_map->map; - ret = bpf_jit_charge_modmem(PAGE_SIZE); - if (ret) - goto errout_free; - - st_map->image = arch_alloc_bpf_trampoline(PAGE_SIZE); - if (!st_map->image) { - /* __bpf_struct_ops_map_free() uses st_map->image as flag - * for "charged or not". In this case, we need to unchange - * here. - */ - bpf_jit_uncharge_modmem(PAGE_SIZE); - ret = -ENOMEM; - goto errout_free; - } st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE); st_map->links_cnt = btf_type_vlen(t); st_map->links = diff --git a/net/bpf/bpf_dummy_struct_ops.c b/net/bpf/bpf_dummy_struct_ops.c index 02de71719aed..1b5f812e6972 100644 --- a/net/bpf/bpf_dummy_struct_ops.c +++ b/net/bpf/bpf_dummy_struct_ops.c @@ -91,6 +91,7 @@ int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr, struct bpf_tramp_link *link = NULL; void *image = NULL; unsigned int op_idx; + u32 image_off = 0; int prog_ret; s32 type_id; int err; @@ -114,12 +115,6 @@ int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr, goto out; } - image = arch_alloc_bpf_trampoline(PAGE_SIZE); - if (!image) { - err = -ENOMEM; - goto out; - } - link = kzalloc(sizeof(*link), GFP_USER); if (!link) { err = -ENOMEM; @@ -133,7 +128,8 @@ int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr, err = bpf_struct_ops_prepare_trampoline(tlinks, link, &st_ops->func_models[op_idx], &dummy_ops_test_ret_function, - image, image + PAGE_SIZE); + &image, &image_off, + true); if (err < 0) goto out; @@ -147,7 +143,7 @@ int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr, err = -EFAULT; out: kfree(args); - arch_free_bpf_trampoline(image, PAGE_SIZE); + bpf_struct_ops_image_free(image); if (link) bpf_link_put(&link->link); kfree(tlinks); -- 2.34.1
From: Kui-Feng Lee <thinker.li@gmail.com> mainline inclusion from mainline-v6.9-rc1 commit 93bc28d859e57f1a654d3b63600d14c85c5630a4 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Create and load a struct_ops map with a large number of struct_ops programs to generate trampolines taking a size over multiple pages. The map includes 40 programs. Their trampolines takes 6.6k+, more than 1.5 pages, on x86. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240224223418.526631-4-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/bpf_testmod/bpf_testmod.h | 44 ++++++++ .../prog_tests/test_struct_ops_multi_pages.c | 30 ++++++ .../bpf/progs/struct_ops_multi_pages.c | 102 ++++++++++++++++++ 3 files changed, 176 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_multi_pages.c create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_multi_pages.c diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h index 971458acfac3..622d24e29d35 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h @@ -43,6 +43,50 @@ struct bpf_testmod_ops { int b; } unsupported; int data; + + /* The following pointers are used to test the maps having multiple + * pages of trampolines. + */ + int (*tramp_1)(int value); + int (*tramp_2)(int value); + int (*tramp_3)(int value); + int (*tramp_4)(int value); + int (*tramp_5)(int value); + int (*tramp_6)(int value); + int (*tramp_7)(int value); + int (*tramp_8)(int value); + int (*tramp_9)(int value); + int (*tramp_10)(int value); + int (*tramp_11)(int value); + int (*tramp_12)(int value); + int (*tramp_13)(int value); + int (*tramp_14)(int value); + int (*tramp_15)(int value); + int (*tramp_16)(int value); + int (*tramp_17)(int value); + int (*tramp_18)(int value); + int (*tramp_19)(int value); + int (*tramp_20)(int value); + int (*tramp_21)(int value); + int (*tramp_22)(int value); + int (*tramp_23)(int value); + int (*tramp_24)(int value); + int (*tramp_25)(int value); + int (*tramp_26)(int value); + int (*tramp_27)(int value); + int (*tramp_28)(int value); + int (*tramp_29)(int value); + int (*tramp_30)(int value); + int (*tramp_31)(int value); + int (*tramp_32)(int value); + int (*tramp_33)(int value); + int (*tramp_34)(int value); + int (*tramp_35)(int value); + int (*tramp_36)(int value); + int (*tramp_37)(int value); + int (*tramp_38)(int value); + int (*tramp_39)(int value); + int (*tramp_40)(int value); }; #endif /* _BPF_TESTMOD_H */ diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_multi_pages.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_multi_pages.c new file mode 100644 index 000000000000..645d32b5160c --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_multi_pages.c @@ -0,0 +1,30 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */ +#include <test_progs.h> + +#include "struct_ops_multi_pages.skel.h" + +static void do_struct_ops_multi_pages(void) +{ + struct struct_ops_multi_pages *skel; + struct bpf_link *link; + + /* The size of all trampolines of skel->maps.multi_pages should be + * over 1 page (at least for x86). + */ + skel = struct_ops_multi_pages__open_and_load(); + if (!ASSERT_OK_PTR(skel, "struct_ops_multi_pages_open_and_load")) + return; + + link = bpf_map__attach_struct_ops(skel->maps.multi_pages); + ASSERT_OK_PTR(link, "attach_multi_pages"); + + bpf_link__destroy(link); + struct_ops_multi_pages__destroy(skel); +} + +void test_struct_ops_multi_pages(void) +{ + if (test__start_subtest("multi_pages")) + do_struct_ops_multi_pages(); +} diff --git a/tools/testing/selftests/bpf/progs/struct_ops_multi_pages.c b/tools/testing/selftests/bpf/progs/struct_ops_multi_pages.c new file mode 100644 index 000000000000..9efcc6e4d356 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/struct_ops_multi_pages.c @@ -0,0 +1,102 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */ +#include <vmlinux.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include "../bpf_testmod/bpf_testmod.h" + +char _license[] SEC("license") = "GPL"; + +#define TRAMP(x) \ + SEC("struct_ops/tramp_" #x) \ + int BPF_PROG(tramp_ ## x, int a) \ + { \ + return a; \ + } + +TRAMP(1) +TRAMP(2) +TRAMP(3) +TRAMP(4) +TRAMP(5) +TRAMP(6) +TRAMP(7) +TRAMP(8) +TRAMP(9) +TRAMP(10) +TRAMP(11) +TRAMP(12) +TRAMP(13) +TRAMP(14) +TRAMP(15) +TRAMP(16) +TRAMP(17) +TRAMP(18) +TRAMP(19) +TRAMP(20) +TRAMP(21) +TRAMP(22) +TRAMP(23) +TRAMP(24) +TRAMP(25) +TRAMP(26) +TRAMP(27) +TRAMP(28) +TRAMP(29) +TRAMP(30) +TRAMP(31) +TRAMP(32) +TRAMP(33) +TRAMP(34) +TRAMP(35) +TRAMP(36) +TRAMP(37) +TRAMP(38) +TRAMP(39) +TRAMP(40) + +#define F_TRAMP(x) .tramp_ ## x = (void *)tramp_ ## x + +SEC(".struct_ops.link") +struct bpf_testmod_ops multi_pages = { + F_TRAMP(1), + F_TRAMP(2), + F_TRAMP(3), + F_TRAMP(4), + F_TRAMP(5), + F_TRAMP(6), + F_TRAMP(7), + F_TRAMP(8), + F_TRAMP(9), + F_TRAMP(10), + F_TRAMP(11), + F_TRAMP(12), + F_TRAMP(13), + F_TRAMP(14), + F_TRAMP(15), + F_TRAMP(16), + F_TRAMP(17), + F_TRAMP(18), + F_TRAMP(19), + F_TRAMP(20), + F_TRAMP(21), + F_TRAMP(22), + F_TRAMP(23), + F_TRAMP(24), + F_TRAMP(25), + F_TRAMP(26), + F_TRAMP(27), + F_TRAMP(28), + F_TRAMP(29), + F_TRAMP(30), + F_TRAMP(31), + F_TRAMP(32), + F_TRAMP(33), + F_TRAMP(34), + F_TRAMP(35), + F_TRAMP(36), + F_TRAMP(37), + F_TRAMP(38), + F_TRAMP(39), + F_TRAMP(40), +}; -- 2.34.1
From: Andrii Nakryiko <andrii@kernel.org> mainline inclusion from mainline-v6.9-rc1 commit 66c8473135c62f478301a0e5b3012f203562dfa6 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- prog->aux->sleepable is checked very frequently as part of (some) BPF program run hot paths. So this extra aux indirection seems wasteful and on busy systems might cause unnecessary memory cache misses. Let's move sleepable flag into prog itself to eliminate unnecessary pointer dereference. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Message-ID: <20240309004739.2961431-1-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Conflicts: include/linux/bpf.h. kernel/bpf/syscall.c kernel/trace/bpf_trace.c [Fix conflicts because of context diff and CVE patches 12775e7694d47.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf.h | 8 ++++---- kernel/bpf/bpf_iter.c | 4 ++-- kernel/bpf/core.c | 2 +- kernel/bpf/syscall.c | 8 ++++---- kernel/bpf/trampoline.c | 4 ++-- kernel/bpf/verifier.c | 12 ++++++------ kernel/events/core.c | 2 +- kernel/trace/bpf_trace.c | 2 +- net/bpf/bpf_dummy_struct_ops.c | 2 +- 9 files changed, 22 insertions(+), 22 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 1c57b5ac2678..61955afeb201 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1528,7 +1528,6 @@ struct bpf_prog_aux { bool offload_requested; /* Program is bound and offloaded to the netdev. */ bool attach_btf_trace; /* true if attaching to BTF-enabled raw tp */ bool func_proto_unreliable; - bool sleepable; bool tail_call_reachable; bool xdp_has_frags; /* BTF_KIND_FUNC_PROTO for valid attach_btf_id */ @@ -1622,7 +1621,8 @@ struct bpf_prog { enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */ call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */ call_get_func_ip:1, /* Do we call get_func_ip() */ - tstamp_type_access:1; /* Accessed __sk_buff->tstamp_type */ + tstamp_type_access:1, /* Accessed __sk_buff->tstamp_type */ + sleepable:1; /* BPF program is sleepable */ enum bpf_prog_type type; /* Type of BPF program */ enum bpf_attach_type expected_attach_type; /* For some prog types */ u32 len; /* Number of filter blocks */ @@ -2227,14 +2227,14 @@ bpf_prog_run_array_uprobe(const struct bpf_prog_array *array, old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx); item = &array->items[0]; while ((prog = READ_ONCE(item->prog))) { - if (!prog->aux->sleepable) + if (!prog->sleepable) rcu_read_lock(); run_ctx.bpf_cookie = item->bpf_cookie; ret &= run_prog(prog, ctx); item++; - if (!prog->aux->sleepable) + if (!prog->sleepable) rcu_read_unlock(); } bpf_reset_run_ctx(old_run_ctx); diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c index 0fae79164187..112581cf97e7 100644 --- a/kernel/bpf/bpf_iter.c +++ b/kernel/bpf/bpf_iter.c @@ -548,7 +548,7 @@ int bpf_iter_link_attach(const union bpf_attr *attr, bpfptr_t uattr, return -ENOENT; /* Only allow sleepable program for resched-able iterator */ - if (prog->aux->sleepable && !bpf_iter_target_support_resched(tinfo)) + if (prog->sleepable && !bpf_iter_target_support_resched(tinfo)) return -EINVAL; link = kzalloc(sizeof(*link), GFP_USER | __GFP_NOWARN); @@ -697,7 +697,7 @@ int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx) struct bpf_run_ctx run_ctx, *old_run_ctx; int ret; - if (prog->aux->sleepable) { + if (prog->sleepable) { rcu_read_lock_trace(); migrate_disable(); might_fault(); diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 50f305a31415..55e2b667b823 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -2757,7 +2757,7 @@ void __bpf_free_used_maps(struct bpf_prog_aux *aux, bool sleepable; u32 i; - sleepable = aux->sleepable; + sleepable = aux->prog->sleepable; for (i = 0; i < len; i++) { map = used_maps[i]; if (map->ops->map_poke_untrack) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 55beb18bc371..da12f1ca8399 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2165,7 +2165,7 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred) btf_put(prog->aux->attach_btf); if (deferred) { - if (prog->aux->sleepable) + if (prog->sleepable) call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_rcu); else call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu); @@ -2712,11 +2712,11 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size) } prog->expected_attach_type = attr->expected_attach_type; + prog->sleepable = !!(attr->prog_flags & BPF_F_SLEEPABLE); prog->aux->attach_btf = attach_btf; prog->aux->attach_btf_id = attr->attach_btf_id; prog->aux->dst_prog = dst_prog; prog->aux->dev_bound = !!attr->prog_ifindex; - prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE; prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS; err = security_bpf_prog_alloc(prog->aux); @@ -2937,7 +2937,7 @@ static void bpf_link_free(struct bpf_link *link) bpf_link_free_id(link->id); if (link->prog) { - sleepable = link->prog->aux->sleepable; + sleepable = link->prog->sleepable; /* detach BPF program, clean up used resources */ ops->release(link); } @@ -5477,7 +5477,7 @@ static int bpf_prog_bind_map(union bpf_attr *attr) /* The bpf program will not access the bpf map, but for the sake of * simplicity, increase sleepable_refcnt for sleepable program as well. */ - if (prog->aux->sleepable) + if (prog->sleepable) atomic64_inc(&map->sleepable_refcnt); memcpy(used_maps_new, used_maps_old, sizeof(used_maps_old[0]) * prog->aux->used_map_cnt); diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index ba4280057f49..afa5543cbcb1 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -1041,7 +1041,7 @@ void notrace __bpf_tramp_exit(struct bpf_tramp_image *tr) bpf_trampoline_enter_t bpf_trampoline_enter(const struct bpf_prog *prog) { - bool sleepable = prog->aux->sleepable; + bool sleepable = prog->sleepable; if (bpf_prog_check_recur(prog)) return sleepable ? __bpf_prog_enter_sleepable_recur : @@ -1056,7 +1056,7 @@ bpf_trampoline_enter_t bpf_trampoline_enter(const struct bpf_prog *prog) bpf_trampoline_exit_t bpf_trampoline_exit(const struct bpf_prog *prog) { - bool sleepable = prog->aux->sleepable; + bool sleepable = prog->sleepable; if (bpf_prog_check_recur(prog)) return sleepable ? __bpf_prog_exit_sleepable_recur : diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index ab1e3e057a87..94a9c03a1d53 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -5425,7 +5425,7 @@ static int map_kptr_match_type(struct bpf_verifier_env *env, static bool in_sleepable(struct bpf_verifier_env *env) { - return env->prog->aux->sleepable; + return env->prog->sleepable; } /* The non-sleepable programs and sleepable programs with explicit bpf_rcu_read_lock() @@ -17873,7 +17873,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, return -EINVAL; } - if (prog->aux->sleepable) + if (prog->sleepable) switch (map->map_type) { case BPF_MAP_TYPE_HASH: case BPF_MAP_TYPE_LRU_HASH: @@ -18057,7 +18057,7 @@ static int resolve_pseudo_ldimm64(struct bpf_verifier_env *env) return -E2BIG; } - if (env->prog->aux->sleepable) + if (env->prog->sleepable) atomic64_inc(&map->sleepable_refcnt); /* hold the map. If the program is rejected by verifier, * the map will be released by release_maps() or it @@ -20525,7 +20525,7 @@ int bpf_check_attach_target(struct bpf_verifier_log *log, } } - if (prog->aux->sleepable) { + if (prog->sleepable) { ret = -EINVAL; switch (prog->type) { case BPF_PROG_TYPE_TRACING: @@ -20663,14 +20663,14 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) u64 key; if (prog->type == BPF_PROG_TYPE_SYSCALL) { - if (prog->aux->sleepable) + if (prog->sleepable) /* attach_btf_id checked to be zero already */ return 0; verbose(env, "Syscall programs can only be sleepable\n"); return -EINVAL; } - if (prog->aux->sleepable && !can_be_sleepable(prog)) { + if (prog->sleepable && !can_be_sleepable(prog)) { verbose(env, "Only fentry/fexit/fmod_ret, lsm, iter, uprobe, and struct_ops programs can be sleepable\n"); return -EINVAL; } diff --git a/kernel/events/core.c b/kernel/events/core.c index cdd34d6e3dd4..1c9c47207b1c 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -10672,7 +10672,7 @@ static int __perf_event_set_bpf_prog(struct perf_event *event, (is_syscall_tp && prog->type != BPF_PROG_TYPE_TRACEPOINT)) return -EINVAL; - if (prog->type == BPF_PROG_TYPE_KPROBE && prog->aux->sleepable && !is_uprobe) + if (prog->type == BPF_PROG_TYPE_KPROBE && prog->sleepable && !is_uprobe) /* only uprobe programs are allowed to be sleepable */ return -EINVAL; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index fe21203dbc31..af5436c68339 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -3111,7 +3111,7 @@ static int uprobe_prog_run(struct bpf_uprobe *uprobe, .uprobe = uprobe, }; struct bpf_prog *prog = link->link.prog; - bool sleepable = prog->aux->sleepable; + bool sleepable = prog->sleepable; struct bpf_run_ctx *old_run_ctx; if (link->task && current->mm != link->task->mm) diff --git a/net/bpf/bpf_dummy_struct_ops.c b/net/bpf/bpf_dummy_struct_ops.c index 1b5f812e6972..de33dc1b0daa 100644 --- a/net/bpf/bpf_dummy_struct_ops.c +++ b/net/bpf/bpf_dummy_struct_ops.c @@ -174,7 +174,7 @@ static int bpf_dummy_ops_check_member(const struct btf_type *t, case offsetof(struct bpf_dummy_ops, test_sleepable): break; default: - if (prog->aux->sleepable) + if (prog->sleepable) return -EINVAL; } -- 2.34.1
From: Luo Gengkun <luogengkun2@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA -------------------------------- Fix kabi-breakage by using KABI_FILL_HOLE and KABI_DEPRECATED. Fixes: 62b1bcfc7f25 ("bpf: move sleepable flag from bpf_prog_aux to bpf_prog") Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 61955afeb201..2a5a9aaaf6bd 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1528,6 +1528,7 @@ struct bpf_prog_aux { bool offload_requested; /* Program is bound and offloaded to the netdev. */ bool attach_btf_trace; /* true if attaching to BTF-enabled raw tp */ bool func_proto_unreliable; + KABI_DEPRECATE(bool, sleepable) bool tail_call_reachable; bool xdp_has_frags; /* BTF_KIND_FUNC_PROTO for valid attach_btf_id */ @@ -1621,13 +1622,13 @@ struct bpf_prog { enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */ call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */ call_get_func_ip:1, /* Do we call get_func_ip() */ - tstamp_type_access:1, /* Accessed __sk_buff->tstamp_type */ - sleepable:1; /* BPF program is sleepable */ + tstamp_type_access:1; /* Accessed __sk_buff->tstamp_type */ enum bpf_prog_type type; /* Type of BPF program */ enum bpf_attach_type expected_attach_type; /* For some prog types */ u32 len; /* Number of filter blocks */ u32 jited_len; /* Size of jited insns in bytes */ u8 tag[BPF_TAG_SIZE]; + KABI_FILL_HOLE(u8 sleepable:1) /* BPF program is sleepable */ struct bpf_prog_stats __percpu *stats; int __percpu *active; unsigned int (*bpf_func)(const void *ctx, -- 2.34.1
From: Martin KaFai Lau <martin.lau@kernel.org> mainline inclusion from mainline-v6.10-rc1 commit 7f3edd0c72c3f7214f8f28495f2e6466348eb128 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- There is a "if (err)" check earlier, so the "if (err < 0)" check that this patch removing is unnecessary. It was my overlook when making adjustments to the bpf_struct_ops_prepare_trampoline() such that the caller does not have to worry about the new page when the function returns error. Fixes: 187e2af05abe ("bpf: struct_ops supports more than one page for trampolines.") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20240315192112.2825039-1-martin.lau@linux.dev Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/bpf_struct_ops.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index cb6deeead404..d9607514b380 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -723,8 +723,6 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, cur_image = image; trampoline_start = 0; } - if (err < 0) - goto reset_unlock; *(void **)(kdata + moff) = image + trampoline_start + cfi_get_offset(); -- 2.34.1
From: Christophe Leroy <christophe.leroy@csgroup.eu> mainline inclusion from mainline-v6.10-rc1 commit e3362acd796789dc0562eb1a3937007b0beb0c5b category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Last user of arch_unprotect_bpf_trampoline() was removed by commit 187e2af05abe ("bpf: struct_ops supports more than one page for trampolines.") Remove arch_unprotect_bpf_trampoline() Reported-by: Daniel Borkmann <daniel@iogearbox.net> Fixes: 187e2af05abe ("bpf: struct_ops supports more than one page for trampolines.") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/r/42c635bb54d3af91db0f9b85d724c7c290069f67.171057435... Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Conflicts: arch/arm64/net/bpf_jit_comp.c [Fix conflicts because of 96b0f5addc7a ("arm64, bpf: Use bpf_prog_pack for arm64 bpf trampoline") is not merged.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- arch/x86/net/bpf_jit_comp.c | 4 ---- include/linux/bpf.h | 1 - kernel/bpf/trampoline.c | 7 ------- 3 files changed, 12 deletions(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 5c020dcec31b..132c9ec9d926 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -2813,10 +2813,6 @@ void arch_protect_bpf_trampoline(void *image, unsigned int size) { } -void arch_unprotect_bpf_trampoline(void *image, unsigned int size) -{ -} - int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image_end, const struct btf_func_model *m, u32 flags, struct bpf_tramp_links *tlinks, diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 2a5a9aaaf6bd..c67235157b35 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1170,7 +1170,6 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i void *arch_alloc_bpf_trampoline(unsigned int size); void arch_free_bpf_trampoline(void *image, unsigned int size); void arch_protect_bpf_trampoline(void *image, unsigned int size); -void arch_unprotect_bpf_trampoline(void *image, unsigned int size); int arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags, struct bpf_tramp_links *tlinks, void *func_addr); diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index afa5543cbcb1..d62643031605 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -1105,13 +1105,6 @@ void __weak arch_protect_bpf_trampoline(void *image, unsigned int size) set_memory_rox((long)image, 1); } -void __weak arch_unprotect_bpf_trampoline(void *image, unsigned int size) -{ - WARN_ON_ONCE(size > PAGE_SIZE); - set_memory_nx((long)image, 1); - set_memory_rw((long)image, 1); -} - int __weak arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags, struct bpf_tramp_links *tlinks, void *func_addr) { -- 2.34.1
From: Christophe Leroy <christophe.leroy@csgroup.eu> mainline inclusion from mainline-v6.10-rc1 commit c733239f8f530872a1f80d8c45dcafbaff368737 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- arch_protect_bpf_trampoline() and alloc_new_pack() call set_memory_rox() which can fail, leading to unprotected memory. Take into account return from set_memory_rox() function and add __must_check flag to arch_protect_bpf_trampoline(). Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/fe1c163c83767fde5cab31d209a4a6be3ddb3a73.171057435... Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Conflicts: arch/arm64/net/bpf_jit_comp.c [Fix conflicts because of 96b0f5addc7a ("arm64, bpf: Use bpf_prog_pack for arm64 bpf trampoline") is not merged.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- arch/x86/net/bpf_jit_comp.c | 3 ++- include/linux/bpf.h | 2 +- kernel/bpf/bpf_struct_ops.c | 8 ++++++-- kernel/bpf/core.c | 28 +++++++++++++++++++++------- kernel/bpf/trampoline.c | 8 +++++--- net/bpf/bpf_dummy_struct_ops.c | 4 +++- 6 files changed, 38 insertions(+), 15 deletions(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 132c9ec9d926..a07c2224d839 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -2809,8 +2809,9 @@ void arch_free_bpf_trampoline(void *image, unsigned int size) bpf_prog_pack_free(image, size); } -void arch_protect_bpf_trampoline(void *image, unsigned int size) +int arch_protect_bpf_trampoline(void *image, unsigned int size) { + return 0; } int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image_end, diff --git a/include/linux/bpf.h b/include/linux/bpf.h index c67235157b35..0f39e0e77d1e 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1169,7 +1169,7 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i void *func_addr); void *arch_alloc_bpf_trampoline(unsigned int size); void arch_free_bpf_trampoline(void *image, unsigned int size); -void arch_protect_bpf_trampoline(void *image, unsigned int size); +int __must_check arch_protect_bpf_trampoline(void *image, unsigned int size); int arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags, struct bpf_tramp_links *tlinks, void *func_addr); diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index d9607514b380..1409a844183c 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -735,8 +735,12 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, if (err) goto reset_unlock; } - for (i = 0; i < st_map->image_pages_cnt; i++) - arch_protect_bpf_trampoline(st_map->image_pages[i], PAGE_SIZE); + for (i = 0; i < st_map->image_pages_cnt; i++) { + err = arch_protect_bpf_trampoline(st_map->image_pages[i], + PAGE_SIZE); + if (err) + goto reset_unlock; + } if (st_map->map.map_flags & BPF_F_LINK) { err = 0; diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 55e2b667b823..13649240750a 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -908,23 +908,30 @@ static LIST_HEAD(pack_list); static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_insns) { struct bpf_prog_pack *pack; + int err; pack = kzalloc(struct_size(pack, bitmap, BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)), GFP_KERNEL); if (!pack) return NULL; pack->ptr = bpf_jit_alloc_exec(BPF_PROG_PACK_SIZE); - if (!pack->ptr) { - kfree(pack); - return NULL; - } + if (!pack->ptr) + goto out; bpf_fill_ill_insns(pack->ptr, BPF_PROG_PACK_SIZE); bitmap_zero(pack->bitmap, BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE); - list_add_tail(&pack->list, &pack_list); set_vm_flush_reset_perms(pack->ptr); - set_memory_rox((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE); + err = set_memory_rox((unsigned long)pack->ptr, + BPF_PROG_PACK_SIZE / PAGE_SIZE); + if (err) + goto out; + list_add_tail(&pack->list, &pack_list); return pack; + +out: + bpf_jit_free_exec(pack->ptr); + kfree(pack); + return NULL; } void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns) @@ -939,9 +946,16 @@ void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns) size = round_up(size, PAGE_SIZE); ptr = bpf_jit_alloc_exec(size); if (ptr) { + int err; + bpf_fill_ill_insns(ptr, size); set_vm_flush_reset_perms(ptr); - set_memory_rox((unsigned long)ptr, size / PAGE_SIZE); + err = set_memory_rox((unsigned long)ptr, + size / PAGE_SIZE); + if (err) { + bpf_jit_free_exec(ptr); + ptr = NULL; + } } goto out; } diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index d62643031605..f28e060f0989 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -456,7 +456,9 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mut if (err < 0) goto out_free; - arch_protect_bpf_trampoline(im->image, im->size); + err = arch_protect_bpf_trampoline(im->image, im->size); + if (err) + goto out_free; WARN_ON(tr->cur_image && total == 0); if (tr->cur_image) @@ -1099,10 +1101,10 @@ void __weak arch_free_bpf_trampoline(void *image, unsigned int size) bpf_jit_free_exec(image); } -void __weak arch_protect_bpf_trampoline(void *image, unsigned int size) +int __weak arch_protect_bpf_trampoline(void *image, unsigned int size) { WARN_ON_ONCE(size > PAGE_SIZE); - set_memory_rox((long)image, 1); + return set_memory_rox((long)image, 1); } int __weak arch_bpf_trampoline_size(const struct btf_func_model *m, u32 flags, diff --git a/net/bpf/bpf_dummy_struct_ops.c b/net/bpf/bpf_dummy_struct_ops.c index de33dc1b0daa..25b75844891a 100644 --- a/net/bpf/bpf_dummy_struct_ops.c +++ b/net/bpf/bpf_dummy_struct_ops.c @@ -133,7 +133,9 @@ int bpf_struct_ops_test_run(struct bpf_prog *prog, const union bpf_attr *kattr, if (err < 0) goto out; - arch_protect_bpf_trampoline(image, PAGE_SIZE); + err = arch_protect_bpf_trampoline(image, PAGE_SIZE); + if (err) + goto out; prog_ret = dummy_ops_call_op(image, args); err = dummy_ops_copy_args(args); -- 2.34.1
From: Alexander Lobakin <aleksander.lobakin@intel.com> mainline inclusion from mainline-v6.10-rc1 commit 7d8296b250f2eed73f1758607926d4d258dea5d4 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Avoid open-coding that simple expression each time by moving BYTES_TO_BITS() from the probes code to <linux/bitops.h> to export it to the rest of the kernel. Simplify the macro while at it. `BITS_PER_LONG / sizeof(long)` always equals to %BITS_PER_BYTE, regardless of the target architecture. Do the same for the tools ecosystem as well (incl. its version of bitops.h). The previous implementation had its implicit type of long, while the new one is int, so adjust the format literal accordingly in the perf code. Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Acked-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bitops.h | 2 ++ kernel/trace/trace_probe.c | 2 -- tools/include/linux/bitops.h | 2 ++ tools/perf/util/probe-finder.c | 4 +--- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/bitops.h b/include/linux/bitops.h index b2342eebc8d2..07729a31198e 100644 --- a/include/linux/bitops.h +++ b/include/linux/bitops.h @@ -20,6 +20,8 @@ #define BITS_TO_U32(nr) __KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(u32)) #define BITS_TO_BYTES(nr) __KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(char)) +#define BYTES_TO_BITS(nb) ((nb) * BITS_PER_BYTE) + extern unsigned int __sw_hweight8(unsigned int w); extern unsigned int __sw_hweight16(unsigned int w); extern unsigned int __sw_hweight32(unsigned int w); diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index 1500c5c060b1..b168d6dd0773 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -1067,8 +1067,6 @@ parse_probe_arg(char *arg, const struct fetch_type *type, return ret; } -#define BYTES_TO_BITS(nb) ((BITS_PER_LONG * (nb)) / sizeof(long)) - /* Bitfield type needs to be parsed into a fetch function */ static int __parse_bitfield_probe_arg(const char *bf, const struct fetch_type *t, diff --git a/tools/include/linux/bitops.h b/tools/include/linux/bitops.h index 7319f6ced108..272f15d0e434 100644 --- a/tools/include/linux/bitops.h +++ b/tools/include/linux/bitops.h @@ -20,6 +20,8 @@ #define BITS_TO_U32(nr) DIV_ROUND_UP(nr, BITS_PER_TYPE(u32)) #define BITS_TO_BYTES(nr) DIV_ROUND_UP(nr, BITS_PER_TYPE(char)) +#define BYTES_TO_BITS(nb) ((nb) * BITS_PER_BYTE) + extern unsigned int __sw_hweight8(unsigned int w); extern unsigned int __sw_hweight16(unsigned int w); extern unsigned int __sw_hweight32(unsigned int w); diff --git a/tools/perf/util/probe-finder.c b/tools/perf/util/probe-finder.c index c0c8d7f9514b..f9c396fe81dc 100644 --- a/tools/perf/util/probe-finder.c +++ b/tools/perf/util/probe-finder.c @@ -304,8 +304,6 @@ static int convert_variable_location(Dwarf_Die *vr_die, Dwarf_Addr addr, return ret2; } -#define BYTES_TO_BITS(nb) ((nb) * BITS_PER_LONG / sizeof(long)) - static int convert_variable_type(Dwarf_Die *vr_die, struct probe_trace_arg *tvar, const char *cast, bool user_access) @@ -335,7 +333,7 @@ static int convert_variable_type(Dwarf_Die *vr_die, total = dwarf_bytesize(vr_die); if (boffs < 0 || total < 0) return -ENOENT; - ret = snprintf(buf, 16, "b%d@%d/%zd", bsize, boffs, + ret = snprintf(buf, 16, "b%d@%d/%d", bsize, boffs, BYTES_TO_BITS(total)); goto formatted; } -- 2.34.1
From: Alan Maguire <alan.maguire@oracle.com> mainline inclusion from mainline-v6.10-rc1 commit fcd1ed89a0439c45e1336bd9649485c44b7597c7 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- The btf_features list can be used for pahole v1.26 and later - it is useful because if a feature is not yet implemented it will not exit with a failure message. This will allow us to add feature requests to the pahole options without having to check pahole versions in future; if the version of pahole supports the feature it will be added. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240507135514.490467-1-alan.maguire@oracle.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- scripts/Makefile.btf | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/scripts/Makefile.btf b/scripts/Makefile.btf index 82377e470aed..2d6e5ed9081e 100644 --- a/scripts/Makefile.btf +++ b/scripts/Makefile.btf @@ -3,6 +3,8 @@ pahole-ver := $(CONFIG_PAHOLE_VERSION) pahole-flags-y := +ifeq ($(call test-le, $(pahole-ver), 125),y) + # pahole 1.18 through 1.21 can't handle zero-sized per-CPU vars ifeq ($(call test-le, $(pahole-ver), 121),y) pahole-flags-$(call test-ge, $(pahole-ver), 118) += --skip_encoding_btf_vars @@ -12,8 +14,17 @@ pahole-flags-$(call test-ge, $(pahole-ver), 121) += --btf_gen_floats pahole-flags-$(call test-ge, $(pahole-ver), 122) += -j -pahole-flags-$(CONFIG_PAHOLE_HAS_LANG_EXCLUDE) += --lang_exclude=rust +ifeq ($(pahole-ver), 125) +pahole-flags-y += --skip_encoding_btf_inconsistent_proto --btf_gen_optimized +endif + +else -pahole-flags-$(call test-ge, $(pahole-ver), 125) += --skip_encoding_btf_inconsistent_proto --btf_gen_optimized +# Switch to using --btf_features for v1.26 and later. +pahole-flags-$(call test-ge, $(pahole-ver), 126) = -j --btf_features=encode_force,var,float,enum64,decl_tag,type_tag,optimized_func,consistent_func + +endif + +pahole-flags-$(CONFIG_PAHOLE_HAS_LANG_EXCLUDE) += --lang_exclude=rust export PAHOLE_FLAGS := $(pahole-flags-y) -- 2.34.1
From: Alan Maguire <alan.maguire@oracle.com> mainline inclusion from mainline-v6.11-rc1 commit 34021caef79f76e70ac31247d321ecd0683c4939 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- There is no need to set the pahole v1.25-only flags in an "ifeq" version clause; we are already in a <= v1.25 branch of "ifeq", so that combined with a "test-ge" v1.25 ensures the flags will be applied for v1.25 only. Suggested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240514162716.2448265-1-alan.maguire@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- scripts/Makefile.btf | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/scripts/Makefile.btf b/scripts/Makefile.btf index 2d6e5ed9081e..bca8a8f26ea4 100644 --- a/scripts/Makefile.btf +++ b/scripts/Makefile.btf @@ -14,9 +14,7 @@ pahole-flags-$(call test-ge, $(pahole-ver), 121) += --btf_gen_floats pahole-flags-$(call test-ge, $(pahole-ver), 122) += -j -ifeq ($(pahole-ver), 125) -pahole-flags-y += --skip_encoding_btf_inconsistent_proto --btf_gen_optimized -endif +pahole-flags-$(call test-ge, $(pahole-ver), 125) += --skip_encoding_btf_inconsistent_proto --btf_gen_optimized else -- 2.34.1
From: Yafang Shao <laoar.shao@gmail.com> mainline inclusion from mainline-v6.11-rc1 commit 4665415975b0827e9646cab91c61d02a6b364d59 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Add three new kfuncs for the bits iterator: - bpf_iter_bits_new Initialize a new bits iterator for a given memory area. Due to the limitation of bpf memalloc, the max number of words (8-byte units) that can be iterated over is limited to (4096 / 8). - bpf_iter_bits_next Get the next bit in a bpf_iter_bits - bpf_iter_bits_destroy Destroy a bpf_iter_bits The bits iterator facilitates the iteration of the bits of a memory area, such as cpumask. It can be used in any context and on any address. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240517023034.48138-2-laoar.shao@gmail.com Conflicts: kernel/bpf/helpers.c [Fix conflicts because of context diff.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/helpers.c | 119 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 119 insertions(+) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 7b79779126a0..364498c643de 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2625,6 +2625,122 @@ __bpf_kfunc void bpf_rcu_read_unlock(void) rcu_read_unlock(); } +struct bpf_iter_bits { + __u64 __opaque[2]; +} __aligned(8); + +struct bpf_iter_bits_kern { + union { + unsigned long *bits; + unsigned long bits_copy; + }; + u32 nr_bits; + int bit; +} __aligned(8); + +/** + * bpf_iter_bits_new() - Initialize a new bits iterator for a given memory area + * @it: The new bpf_iter_bits to be created + * @unsafe_ptr__ign: A pointer pointing to a memory area to be iterated over + * @nr_words: The size of the specified memory area, measured in 8-byte units. + * Due to the limitation of memalloc, it can't be greater than 512. + * + * This function initializes a new bpf_iter_bits structure for iterating over + * a memory area which is specified by the @unsafe_ptr__ign and @nr_words. It + * copies the data of the memory area to the newly created bpf_iter_bits @it for + * subsequent iteration operations. + * + * On success, 0 is returned. On failure, ERR is returned. + */ +__bpf_kfunc int +bpf_iter_bits_new(struct bpf_iter_bits *it, const u64 *unsafe_ptr__ign, u32 nr_words) +{ + struct bpf_iter_bits_kern *kit = (void *)it; + u32 nr_bytes = nr_words * sizeof(u64); + u32 nr_bits = BYTES_TO_BITS(nr_bytes); + int err; + + BUILD_BUG_ON(sizeof(struct bpf_iter_bits_kern) != sizeof(struct bpf_iter_bits)); + BUILD_BUG_ON(__alignof__(struct bpf_iter_bits_kern) != + __alignof__(struct bpf_iter_bits)); + + kit->nr_bits = 0; + kit->bits_copy = 0; + kit->bit = -1; + + if (!unsafe_ptr__ign || !nr_words) + return -EINVAL; + + /* Optimization for u64 mask */ + if (nr_bits == 64) { + err = bpf_probe_read_kernel_common(&kit->bits_copy, nr_bytes, unsafe_ptr__ign); + if (err) + return -EFAULT; + + kit->nr_bits = nr_bits; + return 0; + } + + /* Fallback to memalloc */ + kit->bits = bpf_mem_alloc(&bpf_global_ma, nr_bytes); + if (!kit->bits) + return -ENOMEM; + + err = bpf_probe_read_kernel_common(kit->bits, nr_bytes, unsafe_ptr__ign); + if (err) { + bpf_mem_free(&bpf_global_ma, kit->bits); + return err; + } + + kit->nr_bits = nr_bits; + return 0; +} + +/** + * bpf_iter_bits_next() - Get the next bit in a bpf_iter_bits + * @it: The bpf_iter_bits to be checked + * + * This function returns a pointer to a number representing the value of the + * next bit in the bits. + * + * If there are no further bits available, it returns NULL. + */ +__bpf_kfunc int *bpf_iter_bits_next(struct bpf_iter_bits *it) +{ + struct bpf_iter_bits_kern *kit = (void *)it; + u32 nr_bits = kit->nr_bits; + const unsigned long *bits; + int bit; + + if (nr_bits == 0) + return NULL; + + bits = nr_bits == 64 ? &kit->bits_copy : kit->bits; + bit = find_next_bit(bits, nr_bits, kit->bit + 1); + if (bit >= nr_bits) { + kit->nr_bits = 0; + return NULL; + } + + kit->bit = bit; + return &kit->bit; +} + +/** + * bpf_iter_bits_destroy() - Destroy a bpf_iter_bits + * @it: The bpf_iter_bits to be destroyed + * + * Destroy the resource associated with the bpf_iter_bits. + */ +__bpf_kfunc void bpf_iter_bits_destroy(struct bpf_iter_bits *it) +{ + struct bpf_iter_bits_kern *kit = (void *)it; + + if (kit->nr_bits <= 64) + return; + bpf_mem_free(&bpf_global_ma, kit->bits); +} + __bpf_kfunc_end_defs(); BTF_KFUNCS_START(generic_btf_ids) @@ -2700,6 +2816,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_is_null) BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly) BTF_ID_FLAGS(func, bpf_dynptr_size) BTF_ID_FLAGS(func, bpf_dynptr_clone) +BTF_ID_FLAGS(func, bpf_iter_bits_new, KF_ITER_NEW) +BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY) BTF_KFUNCS_END(common_btf_ids) static const struct btf_kfunc_id_set common_kfunc_set = { -- 2.34.1
From: Yafang Shao <laoar.shao@gmail.com> mainline inclusion from mainline-v6.11-rc1 commit 6ba7acdb93b4ecb554d5838fca3f5f0fcf9fff14 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Add test cases for the bits iter: - Positive cases - Bit mask representing a single word (8-byte unit) - Bit mask representing data spanning more than one word - The index of the set bit - Nagative cases - bpf_iter_bits_destroy() is required after calling bpf_iter_bits_new() - bpf_iter_bits_destroy() can only destroy an initialized iter - bpf_iter_bits_next() must use an initialized iter - Bit mask representing zero words - Bit mask representing fewer words than expected - Case for ENOMEM - Case for NULL pointer Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240517023034.48138-3-laoar.shao@gmail.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/prog_tests/verifier.c | 2 + .../selftests/bpf/progs/verifier_bits_iter.c | 153 ++++++++++++++++++ 2 files changed, 155 insertions(+) create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c diff --git a/tools/testing/selftests/bpf/prog_tests/verifier.c b/tools/testing/selftests/bpf/prog_tests/verifier.c index 8d746642cbd7..6de0aeebbb4a 100644 --- a/tools/testing/selftests/bpf/prog_tests/verifier.c +++ b/tools/testing/selftests/bpf/prog_tests/verifier.c @@ -79,6 +79,7 @@ #include "verifier_xadd.skel.h" #include "verifier_xdp.skel.h" #include "verifier_xdp_direct_packet_access.skel.h" +#include "verifier_bits_iter.skel.h" #define MAX_ENTRIES 11 @@ -188,6 +189,7 @@ void test_verifier_var_off(void) { RUN(verifier_var_off); } void test_verifier_xadd(void) { RUN(verifier_xadd); } void test_verifier_xdp(void) { RUN(verifier_xdp); } void test_verifier_xdp_direct_packet_access(void) { RUN(verifier_xdp_direct_packet_access); } +void test_verifier_bits_iter(void) { RUN(verifier_bits_iter); } static int init_test_val_map(struct bpf_object *obj, char *map_name) { diff --git a/tools/testing/selftests/bpf/progs/verifier_bits_iter.c b/tools/testing/selftests/bpf/progs/verifier_bits_iter.c new file mode 100644 index 000000000000..716113c2bce2 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/verifier_bits_iter.c @@ -0,0 +1,153 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2024 Yafang Shao <laoar.shao@gmail.com> */ + +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +#include "bpf_misc.h" +#include "task_kfunc_common.h" + +char _license[] SEC("license") = "GPL"; + +int bpf_iter_bits_new(struct bpf_iter_bits *it, const u64 *unsafe_ptr__ign, + u32 nr_bits) __ksym __weak; +int *bpf_iter_bits_next(struct bpf_iter_bits *it) __ksym __weak; +void bpf_iter_bits_destroy(struct bpf_iter_bits *it) __ksym __weak; + +SEC("iter.s/cgroup") +__description("bits iter without destroy") +__failure __msg("Unreleased reference") +int BPF_PROG(no_destroy, struct bpf_iter_meta *meta, struct cgroup *cgrp) +{ + struct bpf_iter_bits it; + u64 data = 1; + + bpf_iter_bits_new(&it, &data, 1); + bpf_iter_bits_next(&it); + return 0; +} + +SEC("iter/cgroup") +__description("uninitialized iter in ->next()") +__failure __msg("expected an initialized iter_bits as arg #1") +int BPF_PROG(next_uninit, struct bpf_iter_meta *meta, struct cgroup *cgrp) +{ + struct bpf_iter_bits *it = NULL; + + bpf_iter_bits_next(it); + return 0; +} + +SEC("iter/cgroup") +__description("uninitialized iter in ->destroy()") +__failure __msg("expected an initialized iter_bits as arg #1") +int BPF_PROG(destroy_uninit, struct bpf_iter_meta *meta, struct cgroup *cgrp) +{ + struct bpf_iter_bits it = {}; + + bpf_iter_bits_destroy(&it); + return 0; +} + +SEC("syscall") +__description("null pointer") +__success __retval(0) +int null_pointer(void) +{ + int nr = 0; + int *bit; + + bpf_for_each(bits, bit, NULL, 1) + nr++; + return nr; +} + +SEC("syscall") +__description("bits copy") +__success __retval(10) +int bits_copy(void) +{ + u64 data = 0xf7310UL; /* 4 + 3 + 2 + 1 + 0*/ + int nr = 0; + int *bit; + + bpf_for_each(bits, bit, &data, 1) + nr++; + return nr; +} + +SEC("syscall") +__description("bits memalloc") +__success __retval(64) +int bits_memalloc(void) +{ + u64 data[2]; + int nr = 0; + int *bit; + + __builtin_memset(&data, 0xf0, sizeof(data)); /* 4 * 16 */ + bpf_for_each(bits, bit, &data[0], sizeof(data) / sizeof(u64)) + nr++; + return nr; +} + +SEC("syscall") +__description("bit index") +__success __retval(8) +int bit_index(void) +{ + u64 data = 0x100; + int bit_idx = 0; + int *bit; + + bpf_for_each(bits, bit, &data, 1) { + if (*bit == 0) + continue; + bit_idx = *bit; + } + return bit_idx; +} + +SEC("syscall") +__description("bits nomem") +__success __retval(0) +int bits_nomem(void) +{ + u64 data[4]; + int nr = 0; + int *bit; + + __builtin_memset(&data, 0xff, sizeof(data)); + bpf_for_each(bits, bit, &data[0], 513) /* Be greater than 512 */ + nr++; + return nr; +} + +SEC("syscall") +__description("fewer words") +__success __retval(1) +int fewer_words(void) +{ + u64 data[2] = {0x1, 0xff}; + int nr = 0; + int *bit; + + bpf_for_each(bits, bit, &data[0], 1) + nr++; + return nr; +} + +SEC("syscall") +__description("zero words") +__success __retval(0) +int zero_words(void) +{ + u64 data[2] = {0x1, 0xff}; + int nr = 0; + int *bit; + + bpf_for_each(bits, bit, &data[0], 0) + nr++; + return nr; +} -- 2.34.1
From: Luo Gengkun <luogengkun2@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA -------------------------------- The original commit is came from 80a4129fcf20 ("selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages"). This commit fix test_loader that didn't support running bpf_prog_type_syscall programs. Fixes: 54b0c6cdfd3b ("selftests/bpf: Add selftest for bits iter") Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/testing/selftests/bpf/test_loader.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/bpf/test_loader.c b/tools/testing/selftests/bpf/test_loader.c index b4edd8454934..5cbd0ab1a4c9 100644 --- a/tools/testing/selftests/bpf/test_loader.c +++ b/tools/testing/selftests/bpf/test_loader.c @@ -480,7 +480,7 @@ static bool is_unpriv_capable_map(struct bpf_map *map) } } -static int do_prog_test_run(int fd_prog, int *retval) +static int do_prog_test_run(int fd_prog, int *retval, bool empty_opts) { __u8 tmp_out[TEST_DATA_LEN << 2] = {}; __u8 tmp_in[TEST_DATA_LEN] = {}; @@ -493,6 +493,11 @@ static int do_prog_test_run(int fd_prog, int *retval) .repeat = 1, ); + if (empty_opts) { + memset(&topts, 0, sizeof(struct bpf_test_run_opts)); + topts.sz = sizeof(struct bpf_test_run_opts); + } + err = bpf_prog_test_run_opts(fd_prog, &topts); saved_errno = errno; @@ -625,7 +630,8 @@ void run_subtest(struct test_loader *tester, } } - do_prog_test_run(bpf_program__fd(tprog), &retval); + do_prog_test_run(bpf_program__fd(tprog), &retval, + bpf_program__type(tprog) == BPF_PROG_TYPE_SYSCALL ? true : false); if (retval != subspec->retval && subspec->retval != POINTER_VALUE) { PRINT_FAIL("Unexpected retval: %d != %d\n", retval, subspec->retval); goto tobj_cleanup; -- 2.34.1
From: Andrii Nakryiko <andrii@kernel.org> mainline inclusion from mainline-v6.11-rc1 commit 68153bb2fffbe59804370e514482f95c4b2053ff category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Implement iterator-based type ID and string offset BTF field iterator. This is used extensively in BTF-handling code and BPF linker code for various sanity checks, rewriting IDs/offsets, etc. Currently this is implemented as visitor pattern calling custom callbacks, which makes the logic (especially in simple cases) unnecessarily obscure and harder to follow. Having equivalent functionality using iterator pattern makes for simpler to understand and maintain code. As we add more code for BTF processing logic in libbpf, it's best to switch to iterator pattern before adding more callback-based code. The idea for iterator-based implementation is to record offsets of necessary fields within fixed btf_type parts (which should be iterated just once), and, for kinds that have multiple members (based on vlen field), record where in each member necessary fields are located. Generic iteration code then just keeps track of last offset that was returned and handles N members correctly. Return type is just u32 pointer, where NULL is returned when all relevant fields were already iterated. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-2-andrii@kernel.org Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/btf.c | 162 ++++++++++++++++++++++++++++++++ tools/lib/bpf/libbpf_internal.h | 24 +++++ 2 files changed, 186 insertions(+) diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index 2fe479fa0c77..daf28445b9f0 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -4965,6 +4965,168 @@ int btf_type_visit_str_offs(struct btf_type *t, str_off_visit_fn visit, void *ct return 0; } +int btf_field_iter_init(struct btf_field_iter *it, struct btf_type *t, enum btf_field_iter_kind iter_kind) +{ + it->p = NULL; + it->m_idx = -1; + it->off_idx = 0; + it->vlen = 0; + + switch (iter_kind) { + case BTF_FIELD_ITER_IDS: + switch (btf_kind(t)) { + case BTF_KIND_UNKN: + case BTF_KIND_INT: + case BTF_KIND_FLOAT: + case BTF_KIND_ENUM: + case BTF_KIND_ENUM64: + it->desc = (struct btf_field_desc) {}; + break; + case BTF_KIND_FWD: + case BTF_KIND_CONST: + case BTF_KIND_VOLATILE: + case BTF_KIND_RESTRICT: + case BTF_KIND_PTR: + case BTF_KIND_TYPEDEF: + case BTF_KIND_FUNC: + case BTF_KIND_VAR: + case BTF_KIND_DECL_TAG: + case BTF_KIND_TYPE_TAG: + it->desc = (struct btf_field_desc) { 1, {offsetof(struct btf_type, type)} }; + break; + case BTF_KIND_ARRAY: + it->desc = (struct btf_field_desc) { + 2, {sizeof(struct btf_type) + offsetof(struct btf_array, type), + sizeof(struct btf_type) + offsetof(struct btf_array, index_type)} + }; + break; + case BTF_KIND_STRUCT: + case BTF_KIND_UNION: + it->desc = (struct btf_field_desc) { + 0, {}, + sizeof(struct btf_member), + 1, {offsetof(struct btf_member, type)} + }; + break; + case BTF_KIND_FUNC_PROTO: + it->desc = (struct btf_field_desc) { + 1, {offsetof(struct btf_type, type)}, + sizeof(struct btf_param), + 1, {offsetof(struct btf_param, type)} + }; + break; + case BTF_KIND_DATASEC: + it->desc = (struct btf_field_desc) { + 0, {}, + sizeof(struct btf_var_secinfo), + 1, {offsetof(struct btf_var_secinfo, type)} + }; + break; + default: + return -EINVAL; + } + break; + case BTF_FIELD_ITER_STRS: + switch (btf_kind(t)) { + case BTF_KIND_UNKN: + it->desc = (struct btf_field_desc) {}; + break; + case BTF_KIND_INT: + case BTF_KIND_FLOAT: + case BTF_KIND_FWD: + case BTF_KIND_ARRAY: + case BTF_KIND_CONST: + case BTF_KIND_VOLATILE: + case BTF_KIND_RESTRICT: + case BTF_KIND_PTR: + case BTF_KIND_TYPEDEF: + case BTF_KIND_FUNC: + case BTF_KIND_VAR: + case BTF_KIND_DECL_TAG: + case BTF_KIND_TYPE_TAG: + case BTF_KIND_DATASEC: + it->desc = (struct btf_field_desc) { + 1, {offsetof(struct btf_type, name_off)} + }; + break; + case BTF_KIND_ENUM: + it->desc = (struct btf_field_desc) { + 1, {offsetof(struct btf_type, name_off)}, + sizeof(struct btf_enum), + 1, {offsetof(struct btf_enum, name_off)} + }; + break; + case BTF_KIND_ENUM64: + it->desc = (struct btf_field_desc) { + 1, {offsetof(struct btf_type, name_off)}, + sizeof(struct btf_enum64), + 1, {offsetof(struct btf_enum64, name_off)} + }; + break; + case BTF_KIND_STRUCT: + case BTF_KIND_UNION: + it->desc = (struct btf_field_desc) { + 1, {offsetof(struct btf_type, name_off)}, + sizeof(struct btf_member), + 1, {offsetof(struct btf_member, name_off)} + }; + break; + case BTF_KIND_FUNC_PROTO: + it->desc = (struct btf_field_desc) { + 1, {offsetof(struct btf_type, name_off)}, + sizeof(struct btf_param), + 1, {offsetof(struct btf_param, name_off)} + }; + break; + default: + return -EINVAL; + } + break; + default: + return -EINVAL; + } + + if (it->desc.m_sz) + it->vlen = btf_vlen(t); + + it->p = t; + return 0; +} + +__u32 *btf_field_iter_next(struct btf_field_iter *it) +{ + if (!it->p) + return NULL; + + if (it->m_idx < 0) { + if (it->off_idx < it->desc.t_cnt) + return it->p + it->desc.t_offs[it->off_idx++]; + /* move to per-member iteration */ + it->m_idx = 0; + it->p += sizeof(struct btf_type); + it->off_idx = 0; + } + + /* if type doesn't have members, stop */ + if (it->desc.m_sz == 0) { + it->p = NULL; + return NULL; + } + + if (it->off_idx >= it->desc.m_cnt) { + /* exhausted this member's fields, go to the next member */ + it->m_idx++; + it->p += it->desc.m_sz; + it->off_idx = 0; + } + + if (it->m_idx < it->vlen) + return it->p + it->desc.m_offs[it->off_idx++]; + + it->p = NULL; + return NULL; +} + int btf_ext_visit_type_ids(struct btf_ext *btf_ext, type_id_visit_fn visit, void *ctx) { const struct btf_ext_info *seg; diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h index 57dec645d687..193a04a1e0d2 100644 --- a/tools/lib/bpf/libbpf_internal.h +++ b/tools/lib/bpf/libbpf_internal.h @@ -486,6 +486,30 @@ struct bpf_line_info_min { __u32 line_col; }; +enum btf_field_iter_kind { + BTF_FIELD_ITER_IDS, + BTF_FIELD_ITER_STRS, +}; + +struct btf_field_desc { + /* once-per-type offsets */ + int t_cnt, t_offs[2]; + /* member struct size, or zero, if no members */ + int m_sz; + /* repeated per-member offsets */ + int m_cnt, m_offs[1]; +}; + +struct btf_field_iter { + struct btf_field_desc desc; + void *p; + int m_idx; + int off_idx; + int vlen; +}; + +int btf_field_iter_init(struct btf_field_iter *it, struct btf_type *t, enum btf_field_iter_kind iter_kind); +__u32 *btf_field_iter_next(struct btf_field_iter *it); typedef int (*type_id_visit_fn)(__u32 *type_id, void *ctx); typedef int (*str_off_visit_fn)(__u32 *str_off, void *ctx); -- 2.34.1
From: Andrii Nakryiko <andrii@kernel.org> mainline inclusion from mainline-v6.11-rc1 commit 2bce2c1cb2f0acbf619737a10575f99df0c43984 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Switch all BPF linker code dealing with iterating BTF type ID and string offset fields to new btf_field_iter facilities. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-3-andrii@kernel.org Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/btf.c | 4 +-- tools/lib/bpf/libbpf_internal.h | 4 +-- tools/lib/bpf/linker.c | 58 ++++++++++++++++++++------------- 3 files changed, 40 insertions(+), 26 deletions(-) diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index daf28445b9f0..64e680331e83 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -5099,7 +5099,7 @@ __u32 *btf_field_iter_next(struct btf_field_iter *it) return NULL; if (it->m_idx < 0) { - if (it->off_idx < it->desc.t_cnt) + if (it->off_idx < it->desc.t_off_cnt) return it->p + it->desc.t_offs[it->off_idx++]; /* move to per-member iteration */ it->m_idx = 0; @@ -5113,7 +5113,7 @@ __u32 *btf_field_iter_next(struct btf_field_iter *it) return NULL; } - if (it->off_idx >= it->desc.m_cnt) { + if (it->off_idx >= it->desc.m_off_cnt) { /* exhausted this member's fields, go to the next member */ it->m_idx++; it->p += it->desc.m_sz; diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h index 193a04a1e0d2..32a961e7d389 100644 --- a/tools/lib/bpf/libbpf_internal.h +++ b/tools/lib/bpf/libbpf_internal.h @@ -493,11 +493,11 @@ enum btf_field_iter_kind { struct btf_field_desc { /* once-per-type offsets */ - int t_cnt, t_offs[2]; + int t_off_cnt, t_offs[2]; /* member struct size, or zero, if no members */ int m_sz; /* repeated per-member offsets */ - int m_cnt, m_offs[1]; + int m_off_cnt, m_offs[1]; }; struct btf_field_iter { diff --git a/tools/lib/bpf/linker.c b/tools/lib/bpf/linker.c index e1b3136643aa..7a2c13db6f9d 100644 --- a/tools/lib/bpf/linker.c +++ b/tools/lib/bpf/linker.c @@ -934,19 +934,33 @@ static int check_btf_str_off(__u32 *str_off, void *ctx) static int linker_sanity_check_btf(struct src_obj *obj) { struct btf_type *t; - int i, n, err = 0; + int i, n, err; if (!obj->btf) return 0; n = btf__type_cnt(obj->btf); for (i = 1; i < n; i++) { + struct btf_field_iter it; + __u32 *type_id, *str_off; + t = btf_type_by_id(obj->btf, i); - err = err ?: btf_type_visit_type_ids(t, check_btf_type_id, obj->btf); - err = err ?: btf_type_visit_str_offs(t, check_btf_str_off, obj->btf); + err = btf_field_iter_init(&it, t, BTF_FIELD_ITER_IDS); if (err) return err; + while ((type_id = btf_field_iter_next(&it))) { + if (*type_id >= n) + return -EINVAL; + } + + err = btf_field_iter_init(&it, t, BTF_FIELD_ITER_STRS); + if (err) + return err; + while ((str_off = btf_field_iter_next(&it))) { + if (!btf__str_by_offset(obj->btf, *str_off)) + return -EINVAL; + } } return 0; @@ -2218,26 +2232,10 @@ static int linker_fixup_btf(struct src_obj *obj) return 0; } -static int remap_type_id(__u32 *type_id, void *ctx) -{ - int *id_map = ctx; - int new_id = id_map[*type_id]; - - /* Error out if the type wasn't remapped. Ignore VOID which stays VOID. */ - if (new_id == 0 && *type_id != 0) { - pr_warn("failed to find new ID mapping for original BTF type ID %u\n", *type_id); - return -EINVAL; - } - - *type_id = id_map[*type_id]; - - return 0; -} - static int linker_append_btf(struct bpf_linker *linker, struct src_obj *obj) { const struct btf_type *t; - int i, j, n, start_id, id; + int i, j, n, start_id, id, err; const char *name; if (!obj->btf) @@ -2308,9 +2306,25 @@ static int linker_append_btf(struct bpf_linker *linker, struct src_obj *obj) n = btf__type_cnt(linker->btf); for (i = start_id; i < n; i++) { struct btf_type *dst_t = btf_type_by_id(linker->btf, i); + struct btf_field_iter it; + __u32 *type_id; - if (btf_type_visit_type_ids(dst_t, remap_type_id, obj->btf_type_map)) - return -EINVAL; + err = btf_field_iter_init(&it, dst_t, BTF_FIELD_ITER_IDS); + if (err) + return err; + + while ((type_id = btf_field_iter_next(&it))) { + int new_id = obj->btf_type_map[*type_id]; + + /* Error out if the type wasn't remapped. Ignore VOID which stays VOID. */ + if (new_id == 0 && *type_id != 0) { + pr_warn("failed to find new ID mapping for original BTF type ID %u\n", + *type_id); + return -EINVAL; + } + + *type_id = obj->btf_type_map[*type_id]; + } } /* Rewrite VAR/FUNC underlying types (i.e., FUNC's FUNC_PROTO and VAR's -- 2.34.1
From: Andrii Nakryiko <andrii@kernel.org> mainline inclusion from mainline-v6.11-rc1 commit c2641123696b572a3b059e1b45777317ba9f9086 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Use new BTF field iterator logic to replace all the callback-based visitor calls. There is still a .BTF.ext callback-based visitor APIs that should be converted, which will happens as a follow up. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-4-andrii@kernel.org Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/btf.c | 76 ++++++++++++++++++++++++++++++++------------- 1 file changed, 54 insertions(+), 22 deletions(-) diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index 64e680331e83..08b0c8efb29e 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -1573,9 +1573,8 @@ struct btf_pipe { struct hashmap *str_off_map; /* map string offsets from src to dst */ }; -static int btf_rewrite_str(__u32 *str_off, void *ctx) +static int btf_rewrite_str(struct btf_pipe *p, __u32 *str_off) { - struct btf_pipe *p = ctx; long mapped_off; int off, err; @@ -1608,7 +1607,9 @@ static int btf_rewrite_str(__u32 *str_off, void *ctx) int btf__add_type(struct btf *btf, const struct btf *src_btf, const struct btf_type *src_type) { struct btf_pipe p = { .src = src_btf, .dst = btf }; + struct btf_field_iter it; struct btf_type *t; + __u32 *str_off; int sz, err; sz = btf_type_size(src_type); @@ -1625,26 +1626,17 @@ int btf__add_type(struct btf *btf, const struct btf *src_btf, const struct btf_t memcpy(t, src_type, sz); - err = btf_type_visit_str_offs(t, btf_rewrite_str, &p); + err = btf_field_iter_init(&it, t, BTF_FIELD_ITER_STRS); if (err) return libbpf_err(err); - return btf_commit_type(btf, sz); -} - -static int btf_rewrite_type_ids(__u32 *type_id, void *ctx) -{ - struct btf *btf = ctx; - - if (!*type_id) /* nothing to do for VOID references */ - return 0; + while ((str_off = btf_field_iter_next(&it))) { + err = btf_rewrite_str(&p, str_off); + if (err) + return libbpf_err(err); + } - /* we haven't updated btf's type count yet, so - * btf->start_id + btf->nr_types - 1 is the type ID offset we should - * add to all newly added BTF types - */ - *type_id += btf->start_id + btf->nr_types - 1; - return 0; + return btf_commit_type(btf, sz); } static size_t btf_dedup_identity_hash_fn(long key, void *ctx); @@ -1692,6 +1684,9 @@ int btf__add_btf(struct btf *btf, const struct btf *src_btf) memcpy(t, src_btf->types_data, data_sz); for (i = 0; i < cnt; i++) { + struct btf_field_iter it; + __u32 *type_id, *str_off; + sz = btf_type_size(t); if (sz < 0) { /* unlikely, has to be corrupted src_btf */ @@ -1703,15 +1698,31 @@ int btf__add_btf(struct btf *btf, const struct btf *src_btf) *off = t - btf->types_data; /* add, dedup, and remap strings referenced by this BTF type */ - err = btf_type_visit_str_offs(t, btf_rewrite_str, &p); + err = btf_field_iter_init(&it, t, BTF_FIELD_ITER_STRS); if (err) goto err_out; + while ((str_off = btf_field_iter_next(&it))) { + err = btf_rewrite_str(&p, str_off); + if (err) + goto err_out; + } /* remap all type IDs referenced from this BTF type */ - err = btf_type_visit_type_ids(t, btf_rewrite_type_ids, btf); + err = btf_field_iter_init(&it, t, BTF_FIELD_ITER_IDS); if (err) goto err_out; + while ((type_id = btf_field_iter_next(&it))) { + if (!*type_id) /* nothing to do for VOID references */ + continue; + + /* we haven't updated btf's type count yet, so + * btf->start_id + btf->nr_types - 1 is the type ID offset we should + * add to all newly added BTF types + */ + *type_id += btf->start_id + btf->nr_types - 1; + } + /* go to next type data and type offset index entry */ t += sz; off++; @@ -3283,11 +3294,19 @@ static int btf_for_each_str_off(struct btf_dedup *d, str_off_visit_fn fn, void * int i, r; for (i = 0; i < d->btf->nr_types; i++) { + struct btf_field_iter it; struct btf_type *t = btf_type_by_id(d->btf, d->btf->start_id + i); + __u32 *str_off; - r = btf_type_visit_str_offs(t, fn, ctx); + r = btf_field_iter_init(&it, t, BTF_FIELD_ITER_STRS); if (r) return r; + + while ((str_off = btf_field_iter_next(&it))) { + r = fn(str_off, ctx); + if (r) + return r; + } } if (!d->btf_ext) @@ -4765,10 +4784,23 @@ static int btf_dedup_remap_types(struct btf_dedup *d) for (i = 0; i < d->btf->nr_types; i++) { struct btf_type *t = btf_type_by_id(d->btf, d->btf->start_id + i); + struct btf_field_iter it; + __u32 *type_id; - r = btf_type_visit_type_ids(t, btf_dedup_remap_type_id, d); + r = btf_field_iter_init(&it, t, BTF_FIELD_ITER_IDS); if (r) return r; + + while ((type_id = btf_field_iter_next(&it))) { + __u32 resolved_id, new_id; + + resolved_id = resolve_type_id(d, *type_id); + new_id = d->hypot_map[resolved_id]; + if (new_id > BTF_MAX_NR_TYPES) + return -EINVAL; + + *type_id = new_id; + } } if (!d->btf_ext) -- 2.34.1
From: Andrii Nakryiko <andrii@kernel.org> mainline inclusion from mainline-v6.11-rc1 commit e1a8630291fde2a0edac2955e3df48587dac9906 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Switch bpftool's code which is using libbpf-internal btf_type_visit_type_ids() helper to new btf_field_iter functionality. This makes bpftool code simpler, but also unblocks removing libbpf's btf_type_visit_type_ids() helper completely. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Reviewed-by: Quentin Monnet <qmo@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-5-andrii@kernel.org Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/bpf/bpftool/gen.c | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/tools/bpf/bpftool/gen.c b/tools/bpf/bpftool/gen.c index 34515fbe9414..c2005feda012 100644 --- a/tools/bpf/bpftool/gen.c +++ b/tools/bpf/bpftool/gen.c @@ -2359,15 +2359,6 @@ static int btfgen_record_obj(struct btfgen_info *info, const char *obj_path) return err; } -static int btfgen_remap_id(__u32 *type_id, void *ctx) -{ - unsigned int *ids = ctx; - - *type_id = ids[*type_id]; - - return 0; -} - /* Generate BTF from relocation information previously recorded */ static struct btf *btfgen_get_btf(struct btfgen_info *info) { @@ -2447,10 +2438,15 @@ static struct btf *btfgen_get_btf(struct btfgen_info *info) /* second pass: fix up type ids */ for (i = 1; i < btf__type_cnt(btf_new); i++) { struct btf_type *btf_type = (struct btf_type *) btf__type_by_id(btf_new, i); + struct btf_field_iter it; + __u32 *type_id; - err = btf_type_visit_type_ids(btf_type, btfgen_remap_id, ids); + err = btf_field_iter_init(&it, btf_type, BTF_FIELD_ITER_IDS); if (err) goto err_out; + + while ((type_id = btf_field_iter_next(&it))) + *type_id = ids[*type_id]; } free(ids); -- 2.34.1
From: Andrii Nakryiko <andrii@kernel.org> mainline inclusion from mainline-v6.11-rc1 commit 072088704433f75dacf9e33179dd7a81f0a238d4 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Now that all libbpf/bpftool code switched to btf_field_iter, remove btf_type_visit_type_ids() and btf_type_visit_str_offs() callback-based helpers as not needed anymore. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240605001629.4061937-6-andrii@kernel.org Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/btf.c | 130 -------------------------------- tools/lib/bpf/libbpf_internal.h | 2 - 2 files changed, 132 deletions(-) diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index 08b0c8efb29e..d97991fead93 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -4867,136 +4867,6 @@ struct btf *btf__load_module_btf(const char *module_name, struct btf *vmlinux_bt return btf__parse_split(path, vmlinux_btf); } -int btf_type_visit_type_ids(struct btf_type *t, type_id_visit_fn visit, void *ctx) -{ - int i, n, err; - - switch (btf_kind(t)) { - case BTF_KIND_INT: - case BTF_KIND_FLOAT: - case BTF_KIND_ENUM: - case BTF_KIND_ENUM64: - return 0; - - case BTF_KIND_FWD: - case BTF_KIND_CONST: - case BTF_KIND_VOLATILE: - case BTF_KIND_RESTRICT: - case BTF_KIND_PTR: - case BTF_KIND_TYPEDEF: - case BTF_KIND_FUNC: - case BTF_KIND_VAR: - case BTF_KIND_DECL_TAG: - case BTF_KIND_TYPE_TAG: - return visit(&t->type, ctx); - - case BTF_KIND_ARRAY: { - struct btf_array *a = btf_array(t); - - err = visit(&a->type, ctx); - err = err ?: visit(&a->index_type, ctx); - return err; - } - - case BTF_KIND_STRUCT: - case BTF_KIND_UNION: { - struct btf_member *m = btf_members(t); - - for (i = 0, n = btf_vlen(t); i < n; i++, m++) { - err = visit(&m->type, ctx); - if (err) - return err; - } - return 0; - } - - case BTF_KIND_FUNC_PROTO: { - struct btf_param *m = btf_params(t); - - err = visit(&t->type, ctx); - if (err) - return err; - for (i = 0, n = btf_vlen(t); i < n; i++, m++) { - err = visit(&m->type, ctx); - if (err) - return err; - } - return 0; - } - - case BTF_KIND_DATASEC: { - struct btf_var_secinfo *m = btf_var_secinfos(t); - - for (i = 0, n = btf_vlen(t); i < n; i++, m++) { - err = visit(&m->type, ctx); - if (err) - return err; - } - return 0; - } - - default: - return -EINVAL; - } -} - -int btf_type_visit_str_offs(struct btf_type *t, str_off_visit_fn visit, void *ctx) -{ - int i, n, err; - - err = visit(&t->name_off, ctx); - if (err) - return err; - - switch (btf_kind(t)) { - case BTF_KIND_STRUCT: - case BTF_KIND_UNION: { - struct btf_member *m = btf_members(t); - - for (i = 0, n = btf_vlen(t); i < n; i++, m++) { - err = visit(&m->name_off, ctx); - if (err) - return err; - } - break; - } - case BTF_KIND_ENUM: { - struct btf_enum *m = btf_enum(t); - - for (i = 0, n = btf_vlen(t); i < n; i++, m++) { - err = visit(&m->name_off, ctx); - if (err) - return err; - } - break; - } - case BTF_KIND_ENUM64: { - struct btf_enum64 *m = btf_enum64(t); - - for (i = 0, n = btf_vlen(t); i < n; i++, m++) { - err = visit(&m->name_off, ctx); - if (err) - return err; - } - break; - } - case BTF_KIND_FUNC_PROTO: { - struct btf_param *m = btf_params(t); - - for (i = 0, n = btf_vlen(t); i < n; i++, m++) { - err = visit(&m->name_off, ctx); - if (err) - return err; - } - break; - } - default: - break; - } - - return 0; -} - int btf_field_iter_init(struct btf_field_iter *it, struct btf_type *t, enum btf_field_iter_kind iter_kind) { it->p = NULL; diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h index 32a961e7d389..2f594b4b732f 100644 --- a/tools/lib/bpf/libbpf_internal.h +++ b/tools/lib/bpf/libbpf_internal.h @@ -513,8 +513,6 @@ __u32 *btf_field_iter_next(struct btf_field_iter *it); typedef int (*type_id_visit_fn)(__u32 *type_id, void *ctx); typedef int (*str_off_visit_fn)(__u32 *str_off, void *ctx); -int btf_type_visit_type_ids(struct btf_type *t, type_id_visit_fn visit, void *ctx); -int btf_type_visit_str_offs(struct btf_type *t, str_off_visit_fn visit, void *ctx); int btf_ext_visit_type_ids(struct btf_ext *btf_ext, type_id_visit_fn visit, void *ctx); int btf_ext_visit_str_offs(struct btf_ext *btf_ext, str_off_visit_fn visit, void *ctx); __s32 btf__find_by_name_kind_own(const struct btf *btf, const char *type_name, -- 2.34.1
From: Alan Maguire <alan.maguire@oracle.com> mainline inclusion from mainline-v6.11-rc1 commit 58e185a0dc359a6c1c9eff348d7badfc9f722159 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- To support more robust split BTF, adding supplemental context for the base BTF type ids that split BTF refers to is required. Without such references, a simple shuffling of base BTF type ids (without any other significant change) invalidates the split BTF. Here the attempt is made to store additional context to make split BTF more robust. This context comes in the form of distilled base BTF providing minimal information (name and - in some cases - size) for base INTs, FLOATs, STRUCTs, UNIONs, ENUMs and ENUM64s along with modified split BTF that points at that base and contains any additional types needed (such as TYPEDEF, PTR and anonymous STRUCT/UNION declarations). This information constitutes the minimal BTF representation needed to disambiguate or remove split BTF references to base BTF. The rules are as follows: - INT, FLOAT, FWD are recorded in full. - if a named base BTF STRUCT or UNION is referred to from split BTF, it will be encoded as a zero-member sized STRUCT/UNION (preserving size for later relocation checks). Only base BTF STRUCT/UNIONs that are either embedded in split BTF STRUCT/UNIONs or that have multiple STRUCT/UNION instances of the same name will _need_ size checks at relocation time, but as it is possible a different set of types will be duplicates in the later to-be-resolved base BTF, we preserve size information for all named STRUCT/UNIONs. - if an ENUM[64] is named, a ENUM forward representation (an ENUM with no values) of the same size is used. - in all other cases, the type is added to the new split BTF. Avoiding struct/union/enum/enum64 expansion is important to keep the distilled base BTF representation to a minimum size. When successful, new representations of the distilled base BTF and new split BTF that refers to it are returned. Both need to be freed by the caller. So to take a simple example, with split BTF with a type referring to "struct sk_buff", we will generate distilled base BTF with a 0-member STRUCT sk_buff of the appropriate size, and the split BTF will refer to it instead. Tools like pahole can utilize such split BTF to populate the .BTF section (split BTF) and an additional .BTF.base section. Then when the split BTF is loaded, the distilled base BTF can be used to relocate split BTF to reference the current (and possibly changed) base BTF. So for example if "struct sk_buff" was id 502 when the split BTF was originally generated, we can use the distilled base BTF to see that id 502 refers to a "struct sk_buff" and replace instances of id 502 with the current (relocated) base BTF sk_buff type id. Distilled base BTF is small; when building a kernel with all modules using distilled base BTF as a test, overall module size grew by only 5.3Mb total across ~2700 modules. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613095014.357981-2-alan.maguire@oracle.com Conflicts: tools/lib/bpf/libbpf.map [The conflicts were due to btf__distill_base is LIBBPF_1.5.0 API, but current version of libbpf is LIBBPF_1.3.0] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/btf.c | 319 ++++++++++++++++++++++++++++++++++++++- tools/lib/bpf/btf.h | 21 +++ tools/lib/bpf/libbpf.map | 1 + 3 files changed, 335 insertions(+), 6 deletions(-) diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index d97991fead93..5ef5544578ab 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -1604,9 +1604,8 @@ static int btf_rewrite_str(struct btf_pipe *p, __u32 *str_off) return 0; } -int btf__add_type(struct btf *btf, const struct btf *src_btf, const struct btf_type *src_type) +static int btf_add_type(struct btf_pipe *p, const struct btf_type *src_type) { - struct btf_pipe p = { .src = src_btf, .dst = btf }; struct btf_field_iter it; struct btf_type *t; __u32 *str_off; @@ -1617,10 +1616,10 @@ int btf__add_type(struct btf *btf, const struct btf *src_btf, const struct btf_t return libbpf_err(sz); /* deconstruct BTF, if necessary, and invalidate raw_data */ - if (btf_ensure_modifiable(btf)) + if (btf_ensure_modifiable(p->dst)) return libbpf_err(-ENOMEM); - t = btf_add_type_mem(btf, sz); + t = btf_add_type_mem(p->dst, sz); if (!t) return libbpf_err(-ENOMEM); @@ -1631,12 +1630,19 @@ int btf__add_type(struct btf *btf, const struct btf *src_btf, const struct btf_t return libbpf_err(err); while ((str_off = btf_field_iter_next(&it))) { - err = btf_rewrite_str(&p, str_off); + err = btf_rewrite_str(p, str_off); if (err) return libbpf_err(err); } - return btf_commit_type(btf, sz); + return btf_commit_type(p->dst, sz); +} + +int btf__add_type(struct btf *btf, const struct btf *src_btf, const struct btf_type *src_type) +{ + struct btf_pipe p = { .src = src_btf, .dst = btf }; + + return btf_add_type(&p, src_type); } static size_t btf_dedup_identity_hash_fn(long key, void *ctx); @@ -5108,3 +5114,304 @@ int btf_ext_visit_str_offs(struct btf_ext *btf_ext, str_off_visit_fn visit, void return 0; } + +struct btf_distill { + struct btf_pipe pipe; + int *id_map; + unsigned int split_start_id; + unsigned int split_start_str; + int diff_id; +}; + +static int btf_add_distilled_type_ids(struct btf_distill *dist, __u32 i) +{ + struct btf_type *split_t = btf_type_by_id(dist->pipe.src, i); + struct btf_field_iter it; + __u32 *id; + int err; + + err = btf_field_iter_init(&it, split_t, BTF_FIELD_ITER_IDS); + if (err) + return err; + while ((id = btf_field_iter_next(&it))) { + struct btf_type *base_t; + + if (!*id) + continue; + /* split BTF id, not needed */ + if (*id >= dist->split_start_id) + continue; + /* already added ? */ + if (dist->id_map[*id] > 0) + continue; + + /* only a subset of base BTF types should be referenced from + * split BTF; ensure nothing unexpected is referenced. + */ + base_t = btf_type_by_id(dist->pipe.src, *id); + switch (btf_kind(base_t)) { + case BTF_KIND_INT: + case BTF_KIND_FLOAT: + case BTF_KIND_FWD: + case BTF_KIND_ARRAY: + case BTF_KIND_STRUCT: + case BTF_KIND_UNION: + case BTF_KIND_TYPEDEF: + case BTF_KIND_ENUM: + case BTF_KIND_ENUM64: + case BTF_KIND_PTR: + case BTF_KIND_CONST: + case BTF_KIND_RESTRICT: + case BTF_KIND_VOLATILE: + case BTF_KIND_FUNC_PROTO: + case BTF_KIND_TYPE_TAG: + dist->id_map[*id] = *id; + break; + default: + pr_warn("unexpected reference to base type[%u] of kind [%u] when creating distilled base BTF.\n", + *id, btf_kind(base_t)); + return -EINVAL; + } + /* If a base type is used, ensure types it refers to are + * marked as used also; so for example if we find a PTR to INT + * we need both the PTR and INT. + * + * The only exception is named struct/unions, since distilled + * base BTF composite types have no members. + */ + if (btf_is_composite(base_t) && base_t->name_off) + continue; + err = btf_add_distilled_type_ids(dist, *id); + if (err) + return err; + } + return 0; +} + +static int btf_add_distilled_types(struct btf_distill *dist) +{ + bool adding_to_base = dist->pipe.dst->start_id == 1; + int id = btf__type_cnt(dist->pipe.dst); + struct btf_type *t; + int i, err = 0; + + + /* Add types for each of the required references to either distilled + * base or split BTF, depending on type characteristics. + */ + for (i = 1; i < dist->split_start_id; i++) { + const char *name; + int kind; + + if (!dist->id_map[i]) + continue; + t = btf_type_by_id(dist->pipe.src, i); + kind = btf_kind(t); + name = btf__name_by_offset(dist->pipe.src, t->name_off); + + switch (kind) { + case BTF_KIND_INT: + case BTF_KIND_FLOAT: + case BTF_KIND_FWD: + /* Named int, float, fwd are added to base. */ + if (!adding_to_base) + continue; + err = btf_add_type(&dist->pipe, t); + break; + case BTF_KIND_STRUCT: + case BTF_KIND_UNION: + /* Named struct/union are added to base as 0-vlen + * struct/union of same size. Anonymous struct/unions + * are added to split BTF as-is. + */ + if (adding_to_base) { + if (!t->name_off) + continue; + err = btf_add_composite(dist->pipe.dst, kind, name, t->size); + } else { + if (t->name_off) + continue; + err = btf_add_type(&dist->pipe, t); + } + break; + case BTF_KIND_ENUM: + case BTF_KIND_ENUM64: + /* Named enum[64]s are added to base as a sized + * enum; relocation will match with appropriately-named + * and sized enum or enum64. + * + * Anonymous enums are added to split BTF as-is. + */ + if (adding_to_base) { + if (!t->name_off) + continue; + err = btf__add_enum(dist->pipe.dst, name, t->size); + } else { + if (t->name_off) + continue; + err = btf_add_type(&dist->pipe, t); + } + break; + case BTF_KIND_ARRAY: + case BTF_KIND_TYPEDEF: + case BTF_KIND_PTR: + case BTF_KIND_CONST: + case BTF_KIND_RESTRICT: + case BTF_KIND_VOLATILE: + case BTF_KIND_FUNC_PROTO: + case BTF_KIND_TYPE_TAG: + /* All other types are added to split BTF. */ + if (adding_to_base) + continue; + err = btf_add_type(&dist->pipe, t); + break; + default: + pr_warn("unexpected kind when adding base type '%s'[%u] of kind [%u] to distilled base BTF.\n", + name, i, kind); + return -EINVAL; + + } + if (err < 0) + break; + dist->id_map[i] = id++; + } + return err; +} + +/* Split BTF ids without a mapping will be shifted downwards since distilled + * base BTF is smaller than the original base BTF. For those that have a + * mapping (either to base or updated split BTF), update the id based on + * that mapping. + */ +static int btf_update_distilled_type_ids(struct btf_distill *dist, __u32 i) +{ + struct btf_type *t = btf_type_by_id(dist->pipe.dst, i); + struct btf_field_iter it; + __u32 *id; + int err; + + err = btf_field_iter_init(&it, t, BTF_FIELD_ITER_IDS); + if (err) + return err; + while ((id = btf_field_iter_next(&it))) { + if (dist->id_map[*id]) + *id = dist->id_map[*id]; + else if (*id >= dist->split_start_id) + *id -= dist->diff_id; + } + return 0; +} + +/* Create updated split BTF with distilled base BTF; distilled base BTF + * consists of BTF information required to clarify the types that split + * BTF refers to, omitting unneeded details. Specifically it will contain + * base types and memberless definitions of named structs, unions and enumerated + * types. Associated reference types like pointers, arrays and anonymous + * structs, unions and enumerated types will be added to split BTF. + * Size is recorded for named struct/unions to help guide matching to the + * target base BTF during later relocation. + * + * The only case where structs, unions or enumerated types are fully represented + * is when they are anonymous; in such cases, the anonymous type is added to + * split BTF in full. + * + * We return newly-created split BTF where the split BTF refers to a newly-created + * distilled base BTF. Both must be freed separately by the caller. + */ +int btf__distill_base(const struct btf *src_btf, struct btf **new_base_btf, + struct btf **new_split_btf) +{ + struct btf *new_base = NULL, *new_split = NULL; + const struct btf *old_base; + unsigned int n = btf__type_cnt(src_btf); + struct btf_distill dist = {}; + struct btf_type *t; + int i, err = 0; + + /* src BTF must be split BTF. */ + old_base = btf__base_btf(src_btf); + if (!new_base_btf || !new_split_btf || !old_base) + return libbpf_err(-EINVAL); + + new_base = btf__new_empty(); + if (!new_base) + return libbpf_err(-ENOMEM); + dist.id_map = calloc(n, sizeof(*dist.id_map)); + if (!dist.id_map) { + err = -ENOMEM; + goto done; + } + dist.pipe.src = src_btf; + dist.pipe.dst = new_base; + dist.pipe.str_off_map = hashmap__new(btf_dedup_identity_hash_fn, btf_dedup_equal_fn, NULL); + if (IS_ERR(dist.pipe.str_off_map)) { + err = -ENOMEM; + goto done; + } + dist.split_start_id = btf__type_cnt(old_base); + dist.split_start_str = old_base->hdr->str_len; + + /* Pass over src split BTF; generate the list of base BTF type ids it + * references; these will constitute our distilled BTF set to be + * distributed over base and split BTF as appropriate. + */ + for (i = src_btf->start_id; i < n; i++) { + err = btf_add_distilled_type_ids(&dist, i); + if (err < 0) + goto done; + } + /* Next add types for each of the required references to base BTF and split BTF + * in turn. + */ + err = btf_add_distilled_types(&dist); + if (err < 0) + goto done; + + /* Create new split BTF with distilled base BTF as its base; the final + * state is split BTF with distilled base BTF that represents enough + * about its base references to allow it to be relocated with the base + * BTF available. + */ + new_split = btf__new_empty_split(new_base); + if (!new_split_btf) { + err = -errno; + goto done; + } + dist.pipe.dst = new_split; + /* First add all split types */ + for (i = src_btf->start_id; i < n; i++) { + t = btf_type_by_id(src_btf, i); + err = btf_add_type(&dist.pipe, t); + if (err < 0) + goto done; + } + /* Now add distilled types to split BTF that are not added to base. */ + err = btf_add_distilled_types(&dist); + if (err < 0) + goto done; + + /* All split BTF ids will be shifted downwards since there are less base + * BTF ids in distilled base BTF. + */ + dist.diff_id = dist.split_start_id - btf__type_cnt(new_base); + + n = btf__type_cnt(new_split); + /* Now update base/split BTF ids. */ + for (i = 1; i < n; i++) { + err = btf_update_distilled_type_ids(&dist, i); + if (err < 0) + break; + } +done: + free(dist.id_map); + hashmap__free(dist.pipe.str_off_map); + if (err) { + btf__free(new_split); + btf__free(new_base); + return libbpf_err(err); + } + *new_base_btf = new_base; + *new_split_btf = new_split; + + return 0; +} diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h index 8e6880d91c84..cb08ee9a5a10 100644 --- a/tools/lib/bpf/btf.h +++ b/tools/lib/bpf/btf.h @@ -107,6 +107,27 @@ LIBBPF_API struct btf *btf__new_empty(void); */ LIBBPF_API struct btf *btf__new_empty_split(struct btf *base_btf); +/** + * @brief **btf__distill_base()** creates new versions of the split BTF + * *src_btf* and its base BTF. The new base BTF will only contain the types + * needed to improve robustness of the split BTF to small changes in base BTF. + * When that split BTF is loaded against a (possibly changed) base, this + * distilled base BTF will help update references to that (possibly changed) + * base BTF. + * + * Both the new split and its associated new base BTF must be freed by + * the caller. + * + * If successful, 0 is returned and **new_base_btf** and **new_split_btf** + * will point at new base/split BTF. Both the new split and its associated + * new base BTF must be freed by the caller. + * + * A negative value is returned on error and the thread-local `errno` variable + * is set to the error code as well. + */ +LIBBPF_API int btf__distill_base(const struct btf *src_btf, struct btf **new_base_btf, + struct btf **new_split_btf); + LIBBPF_API struct btf *btf__parse(const char *path, struct btf_ext **btf_ext); LIBBPF_API struct btf *btf__parse_split(const char *path, struct btf *base_btf); LIBBPF_API struct btf *btf__parse_elf(const char *path, struct btf_ext **btf_ext); diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map index f802a77d10b6..acf4e94bebc9 100644 --- a/tools/lib/bpf/libbpf.map +++ b/tools/lib/bpf/libbpf.map @@ -395,6 +395,7 @@ LIBBPF_1.2.0 { LIBBPF_1.3.0 { global: btf__new_split; + btf__distill_base; bpf_obj_pin_opts; bpf_object__unpin; bpf_prog_detach_opts; -- 2.34.1
From: Alan Maguire <alan.maguire@oracle.com> mainline inclusion from mainline-v6.11-rc1 commit eb20e727c4343ad591cff2bef243590c77f62cf1 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Test generation of split+distilled base BTF, ensuring that - named base BTF STRUCTs and UNIONs are represented as 0-vlen sized STRUCT/UNIONs - named ENUM[64]s are represented as 0-vlen named ENUM[64]s - anonymous struct/unions are represented in full in split BTF - anonymous enums are represented in full in split BTF - types unreferenced from split BTF are not present in distilled base BTF Also test that with vmlinux BTF and split BTF based upon it, we only represent needed base types referenced from split BTF in distilled base. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613095014.357981-3-alan.maguire@oracle.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/prog_tests/btf_distill.c | 274 ++++++++++++++++++ 1 file changed, 274 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/btf_distill.c diff --git a/tools/testing/selftests/bpf/prog_tests/btf_distill.c b/tools/testing/selftests/bpf/prog_tests/btf_distill.c new file mode 100644 index 000000000000..5c3a38747962 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/btf_distill.c @@ -0,0 +1,274 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2024, Oracle and/or its affiliates. */ + +#include <test_progs.h> +#include <bpf/btf.h> +#include "btf_helpers.h" + +/* Fabricate base, split BTF with references to base types needed; then create + * split BTF with distilled base BTF and ensure expectations are met: + * - only referenced base types from split BTF are present + * - struct/union/enum are represented as empty unless anonymous, when they + * are represented in full in split BTF + */ +static void test_distilled_base(void) +{ + struct btf *btf1 = NULL, *btf2 = NULL, *btf3 = NULL, *btf4 = NULL; + + btf1 = btf__new_empty(); + if (!ASSERT_OK_PTR(btf1, "empty_main_btf")) + return; + + btf__add_int(btf1, "int", 4, BTF_INT_SIGNED); /* [1] int */ + btf__add_ptr(btf1, 1); /* [2] ptr to int */ + btf__add_struct(btf1, "s1", 8); /* [3] struct s1 { */ + btf__add_field(btf1, "f1", 2, 0, 0); /* int *f1; */ + /* } */ + btf__add_struct(btf1, "", 12); /* [4] struct { */ + btf__add_field(btf1, "f1", 1, 0, 0); /* int f1; */ + btf__add_field(btf1, "f2", 3, 32, 0); /* struct s1 f2; */ + /* } */ + btf__add_int(btf1, "unsigned int", 4, 0); /* [5] unsigned int */ + btf__add_union(btf1, "u1", 12); /* [6] union u1 { */ + btf__add_field(btf1, "f1", 1, 0, 0); /* int f1; */ + btf__add_field(btf1, "f2", 2, 0, 0); /* int *f2; */ + /* } */ + btf__add_union(btf1, "", 4); /* [7] union { */ + btf__add_field(btf1, "f1", 1, 0, 0); /* int f1; */ + /* } */ + btf__add_enum(btf1, "e1", 4); /* [8] enum e1 { */ + btf__add_enum_value(btf1, "v1", 1); /* v1 = 1; */ + /* } */ + btf__add_enum(btf1, "", 4); /* [9] enum { */ + btf__add_enum_value(btf1, "av1", 2); /* av1 = 2; */ + /* } */ + btf__add_enum64(btf1, "e641", 8, true); /* [10] enum64 { */ + btf__add_enum64_value(btf1, "v1", 1024); /* v1 = 1024; */ + /* } */ + btf__add_enum64(btf1, "", 8, true); /* [11] enum64 { */ + btf__add_enum64_value(btf1, "v1", 1025); /* v1 = 1025; */ + /* } */ + btf__add_struct(btf1, "unneeded", 4); /* [12] struct unneeded { */ + btf__add_field(btf1, "f1", 1, 0, 0); /* int f1; */ + /* } */ + btf__add_struct(btf1, "embedded", 4); /* [13] struct embedded { */ + btf__add_field(btf1, "f1", 1, 0, 0); /* int f1; */ + /* } */ + btf__add_func_proto(btf1, 1); /* [14] int (*)(int *p1); */ + btf__add_func_param(btf1, "p1", 1); + + btf__add_array(btf1, 1, 1, 3); /* [15] int [3]; */ + + btf__add_struct(btf1, "from_proto", 4); /* [16] struct from_proto { */ + btf__add_field(btf1, "f1", 1, 0, 0); /* int f1; */ + /* } */ + btf__add_union(btf1, "u1", 4); /* [17] union u1 { */ + btf__add_field(btf1, "f1", 1, 0, 0); /* int f1; */ + /* } */ + VALIDATE_RAW_BTF( + btf1, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] PTR '(anon)' type_id=1", + "[3] STRUCT 's1' size=8 vlen=1\n" + "\t'f1' type_id=2 bits_offset=0", + "[4] STRUCT '(anon)' size=12 vlen=2\n" + "\t'f1' type_id=1 bits_offset=0\n" + "\t'f2' type_id=3 bits_offset=32", + "[5] INT 'unsigned int' size=4 bits_offset=0 nr_bits=32 encoding=(none)", + "[6] UNION 'u1' size=12 vlen=2\n" + "\t'f1' type_id=1 bits_offset=0\n" + "\t'f2' type_id=2 bits_offset=0", + "[7] UNION '(anon)' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[8] ENUM 'e1' encoding=UNSIGNED size=4 vlen=1\n" + "\t'v1' val=1", + "[9] ENUM '(anon)' encoding=UNSIGNED size=4 vlen=1\n" + "\t'av1' val=2", + "[10] ENUM64 'e641' encoding=SIGNED size=8 vlen=1\n" + "\t'v1' val=1024", + "[11] ENUM64 '(anon)' encoding=SIGNED size=8 vlen=1\n" + "\t'v1' val=1025", + "[12] STRUCT 'unneeded' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[13] STRUCT 'embedded' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[14] FUNC_PROTO '(anon)' ret_type_id=1 vlen=1\n" + "\t'p1' type_id=1", + "[15] ARRAY '(anon)' type_id=1 index_type_id=1 nr_elems=3", + "[16] STRUCT 'from_proto' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[17] UNION 'u1' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0"); + + btf2 = btf__new_empty_split(btf1); + if (!ASSERT_OK_PTR(btf2, "empty_split_btf")) + goto cleanup; + + btf__add_ptr(btf2, 3); /* [18] ptr to struct s1 */ + /* add ptr to struct anon */ + btf__add_ptr(btf2, 4); /* [19] ptr to struct (anon) */ + btf__add_const(btf2, 6); /* [20] const union u1 */ + btf__add_restrict(btf2, 7); /* [21] restrict union (anon) */ + btf__add_volatile(btf2, 8); /* [22] volatile enum e1 */ + btf__add_typedef(btf2, "et", 9); /* [23] typedef enum (anon) */ + btf__add_const(btf2, 10); /* [24] const enum64 e641 */ + btf__add_ptr(btf2, 11); /* [25] restrict enum64 (anon) */ + btf__add_struct(btf2, "with_embedded", 4); /* [26] struct with_embedded { */ + btf__add_field(btf2, "f1", 13, 0, 0); /* struct embedded f1; */ + /* } */ + btf__add_func(btf2, "fn", BTF_FUNC_STATIC, 14); /* [27] int fn(int p1); */ + btf__add_typedef(btf2, "arraytype", 15); /* [28] typedef int[3] foo; */ + btf__add_func_proto(btf2, 1); /* [29] int (*)(struct from proto p1); */ + btf__add_func_param(btf2, "p1", 16); + + VALIDATE_RAW_BTF( + btf2, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] PTR '(anon)' type_id=1", + "[3] STRUCT 's1' size=8 vlen=1\n" + "\t'f1' type_id=2 bits_offset=0", + "[4] STRUCT '(anon)' size=12 vlen=2\n" + "\t'f1' type_id=1 bits_offset=0\n" + "\t'f2' type_id=3 bits_offset=32", + "[5] INT 'unsigned int' size=4 bits_offset=0 nr_bits=32 encoding=(none)", + "[6] UNION 'u1' size=12 vlen=2\n" + "\t'f1' type_id=1 bits_offset=0\n" + "\t'f2' type_id=2 bits_offset=0", + "[7] UNION '(anon)' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[8] ENUM 'e1' encoding=UNSIGNED size=4 vlen=1\n" + "\t'v1' val=1", + "[9] ENUM '(anon)' encoding=UNSIGNED size=4 vlen=1\n" + "\t'av1' val=2", + "[10] ENUM64 'e641' encoding=SIGNED size=8 vlen=1\n" + "\t'v1' val=1024", + "[11] ENUM64 '(anon)' encoding=SIGNED size=8 vlen=1\n" + "\t'v1' val=1025", + "[12] STRUCT 'unneeded' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[13] STRUCT 'embedded' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[14] FUNC_PROTO '(anon)' ret_type_id=1 vlen=1\n" + "\t'p1' type_id=1", + "[15] ARRAY '(anon)' type_id=1 index_type_id=1 nr_elems=3", + "[16] STRUCT 'from_proto' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[17] UNION 'u1' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[18] PTR '(anon)' type_id=3", + "[19] PTR '(anon)' type_id=4", + "[20] CONST '(anon)' type_id=6", + "[21] RESTRICT '(anon)' type_id=7", + "[22] VOLATILE '(anon)' type_id=8", + "[23] TYPEDEF 'et' type_id=9", + "[24] CONST '(anon)' type_id=10", + "[25] PTR '(anon)' type_id=11", + "[26] STRUCT 'with_embedded' size=4 vlen=1\n" + "\t'f1' type_id=13 bits_offset=0", + "[27] FUNC 'fn' type_id=14 linkage=static", + "[28] TYPEDEF 'arraytype' type_id=15", + "[29] FUNC_PROTO '(anon)' ret_type_id=1 vlen=1\n" + "\t'p1' type_id=16"); + + if (!ASSERT_EQ(0, btf__distill_base(btf2, &btf3, &btf4), + "distilled_base") || + !ASSERT_OK_PTR(btf3, "distilled_base") || + !ASSERT_OK_PTR(btf4, "distilled_split") || + !ASSERT_EQ(8, btf__type_cnt(btf3), "distilled_base_type_cnt")) + goto cleanup; + + VALIDATE_RAW_BTF( + btf4, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] STRUCT 's1' size=8 vlen=0", + "[3] UNION 'u1' size=12 vlen=0", + "[4] ENUM 'e1' encoding=UNSIGNED size=4 vlen=0", + "[5] ENUM 'e641' encoding=UNSIGNED size=8 vlen=0", + "[6] STRUCT 'embedded' size=4 vlen=0", + "[7] STRUCT 'from_proto' size=4 vlen=0", + /* split BTF; these types should match split BTF above from 17-28, with + * updated type id references + */ + "[8] PTR '(anon)' type_id=2", + "[9] PTR '(anon)' type_id=20", + "[10] CONST '(anon)' type_id=3", + "[11] RESTRICT '(anon)' type_id=21", + "[12] VOLATILE '(anon)' type_id=4", + "[13] TYPEDEF 'et' type_id=22", + "[14] CONST '(anon)' type_id=5", + "[15] PTR '(anon)' type_id=23", + "[16] STRUCT 'with_embedded' size=4 vlen=1\n" + "\t'f1' type_id=6 bits_offset=0", + "[17] FUNC 'fn' type_id=24 linkage=static", + "[18] TYPEDEF 'arraytype' type_id=25", + "[19] FUNC_PROTO '(anon)' ret_type_id=1 vlen=1\n" + "\t'p1' type_id=7", + /* split BTF types added from original base BTF below */ + "[20] STRUCT '(anon)' size=12 vlen=2\n" + "\t'f1' type_id=1 bits_offset=0\n" + "\t'f2' type_id=2 bits_offset=32", + "[21] UNION '(anon)' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[22] ENUM '(anon)' encoding=UNSIGNED size=4 vlen=1\n" + "\t'av1' val=2", + "[23] ENUM64 '(anon)' encoding=SIGNED size=8 vlen=1\n" + "\t'v1' val=1025", + "[24] FUNC_PROTO '(anon)' ret_type_id=1 vlen=1\n" + "\t'p1' type_id=1", + "[25] ARRAY '(anon)' type_id=1 index_type_id=1 nr_elems=3"); + +cleanup: + btf__free(btf4); + btf__free(btf3); + btf__free(btf2); + btf__free(btf1); +} + +/* create split reference BTF from vmlinux + split BTF with a few type references; + * ensure the resultant split reference BTF is as expected, containing only types + * needed to disambiguate references from split BTF. + */ +static void test_distilled_base_vmlinux(void) +{ + struct btf *split_btf = NULL, *vmlinux_btf = btf__load_vmlinux_btf(); + struct btf *split_dist = NULL, *base_dist = NULL; + __s32 int_id, myint_id; + + if (!ASSERT_OK_PTR(vmlinux_btf, "load_vmlinux")) + return; + int_id = btf__find_by_name_kind(vmlinux_btf, "int", BTF_KIND_INT); + if (!ASSERT_GT(int_id, 0, "find_int")) + goto cleanup; + split_btf = btf__new_empty_split(vmlinux_btf); + if (!ASSERT_OK_PTR(split_btf, "new_split")) + goto cleanup; + myint_id = btf__add_typedef(split_btf, "myint", int_id); + btf__add_ptr(split_btf, myint_id); + + if (!ASSERT_EQ(btf__distill_base(split_btf, &base_dist, &split_dist), 0, + "distill_vmlinux_base")) + goto cleanup; + + if (!ASSERT_OK_PTR(split_dist, "split_distilled") || + !ASSERT_OK_PTR(base_dist, "base_dist")) + goto cleanup; + VALIDATE_RAW_BTF( + split_dist, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] TYPEDEF 'myint' type_id=1", + "[3] PTR '(anon)' type_id=2"); + +cleanup: + btf__free(split_dist); + btf__free(base_dist); + btf__free(split_btf); + btf__free(vmlinux_btf); +} + +void test_btf_distill(void) +{ + if (test__start_subtest("distilled_base")) + test_distilled_base(); + if (test__start_subtest("distilled_base_vmlinux")) + test_distilled_base_vmlinux(); +} -- 2.34.1
From: Alan Maguire <alan.maguire@oracle.com> mainline inclusion from mainline-v6.11-rc1 commit 19e00c897d5031bed969dd79af28e899e038009f category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Map distilled base BTF type ids referenced in split BTF and their references to the base BTF passed in, and if the mapping succeeds, reparent the split BTF to the base BTF. Relocation is done by first verifying that distilled base BTF only consists of named INT, FLOAT, ENUM, FWD, STRUCT and UNION kinds; then we sort these to speed lookups. Once sorted, the base BTF is iterated, and for each relevant kind we check for an equivalent in distilled base BTF. When found, the mapping from distilled -> base BTF id and string offset is recorded. In establishing mappings, we need to ensure we check STRUCT/UNION size when the STRUCT/UNION is embedded in a split BTF STRUCT/UNION, and when duplicate names exist for the same STRUCT/UNION. Otherwise size is ignored in matching STRUCT/UNIONs. Once all mappings are established, we can update type ids and string offsets in split BTF and reparent it to the new base. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613095014.357981-4-alan.maguire@oracle.com Conflicts: tools/lib/bpf/Build tools/lib/bpf/libbpf.map [The conflicts were due to btf__relocate is LIBBPF_1.5.0 API, but current version of libbpf is LIBBPF_1.3.0] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/Build | 2 +- tools/lib/bpf/btf.c | 17 ++ tools/lib/bpf/btf.h | 14 + tools/lib/bpf/btf_relocate.c | 506 ++++++++++++++++++++++++++++++++ tools/lib/bpf/libbpf.map | 1 + tools/lib/bpf/libbpf_internal.h | 3 + 6 files changed, 542 insertions(+), 1 deletion(-) create mode 100644 tools/lib/bpf/btf_relocate.c diff --git a/tools/lib/bpf/Build b/tools/lib/bpf/Build index 2d0c282c8588..18c006c4be71 100644 --- a/tools/lib/bpf/Build +++ b/tools/lib/bpf/Build @@ -1,4 +1,4 @@ libbpf-y := libbpf.o bpf.o nlattr.o btf.o libbpf_errno.o str_error.o \ netlink.o bpf_prog_linfo.o libbpf_probes.o hashmap.o \ btf_dump.o ringbuf.o strset.o linker.o gen_loader.o relo_core.o \ - usdt.o zip.o elf.o + usdt.o zip.o elf.o btf_relocate.o diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index 5ef5544578ab..0efb0a645e3d 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -5415,3 +5415,20 @@ int btf__distill_base(const struct btf *src_btf, struct btf **new_base_btf, return 0; } + +const struct btf_header *btf_header(const struct btf *btf) +{ + return btf->hdr; +} + +void btf_set_base_btf(struct btf *btf, const struct btf *base_btf) +{ + btf->base_btf = (struct btf *)base_btf; + btf->start_id = btf__type_cnt(base_btf); + btf->start_str_off = base_btf->hdr->str_len; +} + +int btf__relocate(struct btf *btf, const struct btf *base_btf) +{ + return libbpf_err(btf_relocate(btf, base_btf, NULL)); +} diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h index cb08ee9a5a10..8a93120b7385 100644 --- a/tools/lib/bpf/btf.h +++ b/tools/lib/bpf/btf.h @@ -252,6 +252,20 @@ struct btf_dedup_opts { LIBBPF_API int btf__dedup(struct btf *btf, const struct btf_dedup_opts *opts); +/** + * @brief **btf__relocate()** will check the split BTF *btf* for references + * to base BTF kinds, and verify those references are compatible with + * *base_btf*; if they are, *btf* is adjusted such that is re-parented to + * *base_btf* and type ids and strings are adjusted to accommodate this. + * + * If successful, 0 is returned and **btf** now has **base_btf** as its + * base. + * + * A negative value is returned on error and the thread-local `errno` variable + * is set to the error code as well. + */ +LIBBPF_API int btf__relocate(struct btf *btf, const struct btf *base_btf); + struct btf_dump; struct btf_dump_opts { diff --git a/tools/lib/bpf/btf_relocate.c b/tools/lib/bpf/btf_relocate.c new file mode 100644 index 000000000000..eabb8755f662 --- /dev/null +++ b/tools/lib/bpf/btf_relocate.c @@ -0,0 +1,506 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2024, Oracle and/or its affiliates. */ + +#ifndef _GNU_SOURCE +#define _GNU_SOURCE +#endif + +#include "btf.h" +#include "bpf.h" +#include "libbpf.h" +#include "libbpf_internal.h" + +struct btf; + +struct btf_relocate { + struct btf *btf; + const struct btf *base_btf; + const struct btf *dist_base_btf; + unsigned int nr_base_types; + unsigned int nr_split_types; + unsigned int nr_dist_base_types; + int dist_str_len; + int base_str_len; + __u32 *id_map; + __u32 *str_map; +}; + +/* Set temporarily in relocation id_map if distilled base struct/union is + * embedded in a split BTF struct/union; in such a case, size information must + * match between distilled base BTF and base BTF representation of type. + */ +#define BTF_IS_EMBEDDED ((__u32)-1) + +/* <name, size, id> triple used in sorting/searching distilled base BTF. */ +struct btf_name_info { + const char *name; + /* set when search requires a size match */ + int needs_size:1, + size:31; + __u32 id; +}; + +static int btf_relocate_rewrite_type_id(struct btf_relocate *r, __u32 i) +{ + struct btf_type *t = btf_type_by_id(r->btf, i); + struct btf_field_iter it; + __u32 *id; + int err; + + err = btf_field_iter_init(&it, t, BTF_FIELD_ITER_IDS); + if (err) + return err; + + while ((id = btf_field_iter_next(&it))) + *id = r->id_map[*id]; + return 0; +} + +/* Simple string comparison used for sorting within BTF, since all distilled + * types are named. If strings match, and size is non-zero for both elements + * fall back to using size for ordering. + */ +static int cmp_btf_name_size(const void *n1, const void *n2) +{ + const struct btf_name_info *ni1 = n1; + const struct btf_name_info *ni2 = n2; + int name_diff = strcmp(ni1->name, ni2->name); + + if (!name_diff && ni1->needs_size && ni2->needs_size) + return ni2->size - ni1->size; + return name_diff; +} + +/* Binary search with a small twist; find leftmost element that matches + * so that we can then iterate through all exact matches. So for example + * searching { "a", "bb", "bb", "c" } we would always match on the + * leftmost "bb". + */ +static struct btf_name_info *search_btf_name_size(struct btf_name_info *key, + struct btf_name_info *vals, + int nelems) +{ + struct btf_name_info *ret = NULL; + int high = nelems - 1; + int low = 0; + + while (low <= high) { + int mid = (low + high)/2; + struct btf_name_info *val = &vals[mid]; + int diff = cmp_btf_name_size(key, val); + + if (diff == 0) + ret = val; + /* even if found, keep searching for leftmost match */ + if (diff <= 0) + high = mid - 1; + else + low = mid + 1; + } + return ret; +} + +/* If a member of a split BTF struct/union refers to a base BTF + * struct/union, mark that struct/union id temporarily in the id_map + * with BTF_IS_EMBEDDED. Members can be const/restrict/volatile/typedef + * reference types, but if a pointer is encountered, the type is no longer + * considered embedded. + */ +static int btf_mark_embedded_composite_type_ids(struct btf_relocate *r, __u32 i) +{ + struct btf_type *t = btf_type_by_id(r->btf, i); + struct btf_field_iter it; + __u32 *id; + int err; + + if (!btf_is_composite(t)) + return 0; + + err = btf_field_iter_init(&it, t, BTF_FIELD_ITER_IDS); + if (err) + return err; + + while ((id = btf_field_iter_next(&it))) { + __u32 next_id = *id; + + while (next_id) { + t = btf_type_by_id(r->btf, next_id); + switch (btf_kind(t)) { + case BTF_KIND_CONST: + case BTF_KIND_RESTRICT: + case BTF_KIND_VOLATILE: + case BTF_KIND_TYPEDEF: + case BTF_KIND_TYPE_TAG: + next_id = t->type; + break; + case BTF_KIND_ARRAY: { + struct btf_array *a = btf_array(t); + + next_id = a->type; + break; + } + case BTF_KIND_STRUCT: + case BTF_KIND_UNION: + if (next_id < r->nr_dist_base_types) + r->id_map[next_id] = BTF_IS_EMBEDDED; + next_id = 0; + break; + default: + next_id = 0; + break; + } + } + } + + return 0; +} + +/* Build a map from distilled base BTF ids to base BTF ids. To do so, iterate + * through base BTF looking up distilled type (using binary search) equivalents. + */ +static int btf_relocate_map_distilled_base(struct btf_relocate *r) +{ + struct btf_name_info *dist_base_info_sorted, *dist_base_info_sorted_end; + struct btf_type *base_t, *dist_t; + __u8 *base_name_cnt = NULL; + int err = 0; + __u32 id; + + /* generate a sort index array of name/type ids sorted by name for + * distilled base BTF to speed name-based lookups. + */ + dist_base_info_sorted = calloc(r->nr_dist_base_types, sizeof(*dist_base_info_sorted)); + if (!dist_base_info_sorted) { + err = -ENOMEM; + goto done; + } + dist_base_info_sorted_end = dist_base_info_sorted + r->nr_dist_base_types; + for (id = 0; id < r->nr_dist_base_types; id++) { + dist_t = btf_type_by_id(r->dist_base_btf, id); + dist_base_info_sorted[id].name = btf__name_by_offset(r->dist_base_btf, + dist_t->name_off); + dist_base_info_sorted[id].id = id; + dist_base_info_sorted[id].size = dist_t->size; + dist_base_info_sorted[id].needs_size = true; + } + qsort(dist_base_info_sorted, r->nr_dist_base_types, sizeof(*dist_base_info_sorted), + cmp_btf_name_size); + + /* Mark distilled base struct/union members of split BTF structs/unions + * in id_map with BTF_IS_EMBEDDED; this signals that these types + * need to match both name and size, otherwise embeddding the base + * struct/union in the split type is invalid. + */ + for (id = r->nr_dist_base_types; id < r->nr_split_types; id++) { + err = btf_mark_embedded_composite_type_ids(r, id); + if (err) + goto done; + } + + /* Collect name counts for composite types in base BTF. If multiple + * instances of a struct/union of the same name exist, we need to use + * size to determine which to map to since name alone is ambiguous. + */ + base_name_cnt = calloc(r->base_str_len, sizeof(*base_name_cnt)); + if (!base_name_cnt) { + err = -ENOMEM; + goto done; + } + for (id = 1; id < r->nr_base_types; id++) { + base_t = btf_type_by_id(r->base_btf, id); + if (!btf_is_composite(base_t) || !base_t->name_off) + continue; + if (base_name_cnt[base_t->name_off] < 255) + base_name_cnt[base_t->name_off]++; + } + + /* Now search base BTF for matching distilled base BTF types. */ + for (id = 1; id < r->nr_base_types; id++) { + struct btf_name_info *dist_name_info, *dist_name_info_next = NULL; + struct btf_name_info base_name_info = {}; + int dist_kind, base_kind; + + base_t = btf_type_by_id(r->base_btf, id); + /* distilled base consists of named types only. */ + if (!base_t->name_off) + continue; + base_kind = btf_kind(base_t); + base_name_info.id = id; + base_name_info.name = btf__name_by_offset(r->base_btf, base_t->name_off); + switch (base_kind) { + case BTF_KIND_INT: + case BTF_KIND_FLOAT: + case BTF_KIND_ENUM: + case BTF_KIND_ENUM64: + /* These types should match both name and size */ + base_name_info.needs_size = true; + base_name_info.size = base_t->size; + break; + case BTF_KIND_FWD: + /* No size considerations for fwds. */ + break; + case BTF_KIND_STRUCT: + case BTF_KIND_UNION: + /* Size only needs to be used for struct/union if there + * are multiple types in base BTF with the same name. + * If there are multiple _distilled_ types with the same + * name (a very unlikely scenario), that doesn't matter + * unless corresponding _base_ types to match them are + * missing. + */ + base_name_info.needs_size = base_name_cnt[base_t->name_off] > 1; + base_name_info.size = base_t->size; + break; + default: + continue; + } + /* iterate over all matching distilled base types */ + for (dist_name_info = search_btf_name_size(&base_name_info, dist_base_info_sorted, + r->nr_dist_base_types); + dist_name_info != NULL; dist_name_info = dist_name_info_next) { + /* Are there more distilled matches to process after + * this one? + */ + dist_name_info_next = dist_name_info + 1; + if (dist_name_info_next >= dist_base_info_sorted_end || + cmp_btf_name_size(&base_name_info, dist_name_info_next)) + dist_name_info_next = NULL; + + if (!dist_name_info->id || dist_name_info->id > r->nr_dist_base_types) { + pr_warn("base BTF id [%d] maps to invalid distilled base BTF id [%d]\n", + id, dist_name_info->id); + err = -EINVAL; + goto done; + } + dist_t = btf_type_by_id(r->dist_base_btf, dist_name_info->id); + dist_kind = btf_kind(dist_t); + + /* Validate that the found distilled type is compatible. + * Do not error out on mismatch as another match may + * occur for an identically-named type. + */ + switch (dist_kind) { + case BTF_KIND_FWD: + switch (base_kind) { + case BTF_KIND_FWD: + if (btf_kflag(dist_t) != btf_kflag(base_t)) + continue; + break; + case BTF_KIND_STRUCT: + if (btf_kflag(base_t)) + continue; + break; + case BTF_KIND_UNION: + if (!btf_kflag(base_t)) + continue; + break; + default: + continue; + } + break; + case BTF_KIND_INT: + if (dist_kind != base_kind || + btf_int_encoding(base_t) != btf_int_encoding(dist_t)) + continue; + break; + case BTF_KIND_FLOAT: + if (dist_kind != base_kind) + continue; + break; + case BTF_KIND_ENUM: + /* ENUM and ENUM64 are encoded as sized ENUM in + * distilled base BTF. + */ + if (base_kind != dist_kind && base_kind != BTF_KIND_ENUM64) + continue; + break; + case BTF_KIND_STRUCT: + case BTF_KIND_UNION: + /* size verification is required for embedded + * struct/unions. + */ + if (r->id_map[dist_name_info->id] == BTF_IS_EMBEDDED && + base_t->size != dist_t->size) + continue; + break; + default: + continue; + } + if (r->id_map[dist_name_info->id] && + r->id_map[dist_name_info->id] != BTF_IS_EMBEDDED) { + /* we already have a match; this tells us that + * multiple base types of the same name + * have the same size, since for cases where + * multiple types have the same name we match + * on name and size. In this case, we have + * no way of determining which to relocate + * to in base BTF, so error out. + */ + pr_warn("distilled base BTF type '%s' [%u], size %u has multiple candidates of the same size (ids [%u, %u]) in base BTF\n", + base_name_info.name, dist_name_info->id, + base_t->size, id, r->id_map[dist_name_info->id]); + err = -EINVAL; + goto done; + } + /* map id and name */ + r->id_map[dist_name_info->id] = id; + r->str_map[dist_t->name_off] = base_t->name_off; + } + } + /* ensure all distilled BTF ids now have a mapping... */ + for (id = 1; id < r->nr_dist_base_types; id++) { + const char *name; + + if (r->id_map[id] && r->id_map[id] != BTF_IS_EMBEDDED) + continue; + dist_t = btf_type_by_id(r->dist_base_btf, id); + name = btf__name_by_offset(r->dist_base_btf, dist_t->name_off); + pr_warn("distilled base BTF type '%s' [%d] is not mapped to base BTF id\n", + name, id); + err = -EINVAL; + break; + } +done: + free(base_name_cnt); + free(dist_base_info_sorted); + return err; +} + +/* distilled base should only have named int/float/enum/fwd/struct/union types. */ +static int btf_relocate_validate_distilled_base(struct btf_relocate *r) +{ + unsigned int i; + + for (i = 1; i < r->nr_dist_base_types; i++) { + struct btf_type *t = btf_type_by_id(r->dist_base_btf, i); + int kind = btf_kind(t); + + switch (kind) { + case BTF_KIND_INT: + case BTF_KIND_FLOAT: + case BTF_KIND_ENUM: + case BTF_KIND_STRUCT: + case BTF_KIND_UNION: + case BTF_KIND_FWD: + if (t->name_off) + break; + pr_warn("type [%d], kind [%d] is invalid for distilled base BTF; it is anonymous\n", + i, kind); + return -EINVAL; + default: + pr_warn("type [%d] in distilled based BTF has unexpected kind [%d]\n", + i, kind); + return -EINVAL; + } + } + return 0; +} + +static int btf_relocate_rewrite_strs(struct btf_relocate *r, __u32 i) +{ + struct btf_type *t = btf_type_by_id(r->btf, i); + struct btf_field_iter it; + __u32 *str_off; + int off, err; + + err = btf_field_iter_init(&it, t, BTF_FIELD_ITER_STRS); + if (err) + return err; + + while ((str_off = btf_field_iter_next(&it))) { + if (!*str_off) + continue; + if (*str_off >= r->dist_str_len) { + *str_off += r->base_str_len - r->dist_str_len; + } else { + off = r->str_map[*str_off]; + if (!off) { + pr_warn("string '%s' [offset %u] is not mapped to base BTF", + btf__str_by_offset(r->btf, off), *str_off); + return -ENOENT; + } + *str_off = off; + } + } + return 0; +} + +/* If successful, output of relocation is updated BTF with base BTF pointing + * at base_btf, and type ids, strings adjusted accordingly. + */ +int btf_relocate(struct btf *btf, const struct btf *base_btf, __u32 **id_map) +{ + unsigned int nr_types = btf__type_cnt(btf); + const struct btf_header *dist_base_hdr; + const struct btf_header *base_hdr; + struct btf_relocate r = {}; + int err = 0; + __u32 id, i; + + r.dist_base_btf = btf__base_btf(btf); + if (!base_btf || r.dist_base_btf == base_btf) + return -EINVAL; + + r.nr_dist_base_types = btf__type_cnt(r.dist_base_btf); + r.nr_base_types = btf__type_cnt(base_btf); + r.nr_split_types = nr_types - r.nr_dist_base_types; + r.btf = btf; + r.base_btf = base_btf; + + r.id_map = calloc(nr_types, sizeof(*r.id_map)); + r.str_map = calloc(btf_header(r.dist_base_btf)->str_len, sizeof(*r.str_map)); + dist_base_hdr = btf_header(r.dist_base_btf); + base_hdr = btf_header(r.base_btf); + r.dist_str_len = dist_base_hdr->str_len; + r.base_str_len = base_hdr->str_len; + if (!r.id_map || !r.str_map) { + err = -ENOMEM; + goto err_out; + } + + err = btf_relocate_validate_distilled_base(&r); + if (err) + goto err_out; + + /* Split BTF ids need to be adjusted as base and distilled base + * have different numbers of types, changing the start id of split + * BTF. + */ + for (id = r.nr_dist_base_types; id < nr_types; id++) + r.id_map[id] = id + r.nr_base_types - r.nr_dist_base_types; + + /* Build a map from distilled base ids to actual base BTF ids; it is used + * to update split BTF id references. Also build a str_map mapping from + * distilled base BTF names to base BTF names. + */ + err = btf_relocate_map_distilled_base(&r); + if (err) + goto err_out; + + /* Next, rewrite type ids in split BTF, replacing split ids with updated + * ids based on number of types in base BTF, and base ids with + * relocated ids from base_btf. + */ + for (i = 0, id = r.nr_dist_base_types; i < r.nr_split_types; i++, id++) { + err = btf_relocate_rewrite_type_id(&r, id); + if (err) + goto err_out; + } + /* String offsets now need to be updated using the str_map. */ + for (i = 0; i < r.nr_split_types; i++) { + err = btf_relocate_rewrite_strs(&r, i + r.nr_dist_base_types); + if (err) + goto err_out; + } + /* Finally reset base BTF to be base_btf */ + btf_set_base_btf(btf, base_btf); + + if (id_map) { + *id_map = r.id_map; + r.id_map = NULL; + } +err_out: + free(r.id_map); + free(r.str_map); + return err; +} diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map index acf4e94bebc9..14476d734601 100644 --- a/tools/lib/bpf/libbpf.map +++ b/tools/lib/bpf/libbpf.map @@ -396,6 +396,7 @@ LIBBPF_1.3.0 { global: btf__new_split; btf__distill_base; + btf__relocate; bpf_obj_pin_opts; bpf_object__unpin; bpf_prog_detach_opts; diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h index 2f594b4b732f..e88f49ea64ef 100644 --- a/tools/lib/bpf/libbpf_internal.h +++ b/tools/lib/bpf/libbpf_internal.h @@ -233,6 +233,9 @@ struct btf_type; struct btf_type *btf_type_by_id(const struct btf *btf, __u32 type_id); const char *btf_kind_str(const struct btf_type *t); const struct btf_type *skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id); +const struct btf_header *btf_header(const struct btf *btf); +void btf_set_base_btf(struct btf *btf, const struct btf *base_btf); +int btf_relocate(struct btf *btf, const struct btf *base_btf, __u32 **id_map); static inline enum btf_func_linkage btf_func_linkage(const struct btf_type *t) { -- 2.34.1
From: Alan Maguire <alan.maguire@oracle.com> mainline inclusion from mainline-v6.11-rc1 commit affdeb50616b190c3236cc2bf116e1b931a43be2 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Ensure relocated BTF looks as expected; in this case identical to original split BTF, with a few duplicate anonymous types added to split BTF by the relocation process. Also add relocation tests for edge cases like missing type in base BTF and multiple types of the same name. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613095014.357981-5-alan.maguire@oracle.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/prog_tests/btf_distill.c | 278 ++++++++++++++++++ 1 file changed, 278 insertions(+) diff --git a/tools/testing/selftests/bpf/prog_tests/btf_distill.c b/tools/testing/selftests/bpf/prog_tests/btf_distill.c index 5c3a38747962..bfbe795823a2 100644 --- a/tools/testing/selftests/bpf/prog_tests/btf_distill.c +++ b/tools/testing/selftests/bpf/prog_tests/btf_distill.c @@ -217,7 +217,277 @@ static void test_distilled_base(void) "\t'p1' type_id=1", "[25] ARRAY '(anon)' type_id=1 index_type_id=1 nr_elems=3"); + if (!ASSERT_EQ(btf__relocate(btf4, btf1), 0, "relocate_split")) + goto cleanup; + + VALIDATE_RAW_BTF( + btf4, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] PTR '(anon)' type_id=1", + "[3] STRUCT 's1' size=8 vlen=1\n" + "\t'f1' type_id=2 bits_offset=0", + "[4] STRUCT '(anon)' size=12 vlen=2\n" + "\t'f1' type_id=1 bits_offset=0\n" + "\t'f2' type_id=3 bits_offset=32", + "[5] INT 'unsigned int' size=4 bits_offset=0 nr_bits=32 encoding=(none)", + "[6] UNION 'u1' size=12 vlen=2\n" + "\t'f1' type_id=1 bits_offset=0\n" + "\t'f2' type_id=2 bits_offset=0", + "[7] UNION '(anon)' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[8] ENUM 'e1' encoding=UNSIGNED size=4 vlen=1\n" + "\t'v1' val=1", + "[9] ENUM '(anon)' encoding=UNSIGNED size=4 vlen=1\n" + "\t'av1' val=2", + "[10] ENUM64 'e641' encoding=SIGNED size=8 vlen=1\n" + "\t'v1' val=1024", + "[11] ENUM64 '(anon)' encoding=SIGNED size=8 vlen=1\n" + "\t'v1' val=1025", + "[12] STRUCT 'unneeded' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[13] STRUCT 'embedded' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[14] FUNC_PROTO '(anon)' ret_type_id=1 vlen=1\n" + "\t'p1' type_id=1", + "[15] ARRAY '(anon)' type_id=1 index_type_id=1 nr_elems=3", + "[16] STRUCT 'from_proto' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[17] UNION 'u1' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[18] PTR '(anon)' type_id=3", + "[19] PTR '(anon)' type_id=30", + "[20] CONST '(anon)' type_id=6", + "[21] RESTRICT '(anon)' type_id=31", + "[22] VOLATILE '(anon)' type_id=8", + "[23] TYPEDEF 'et' type_id=32", + "[24] CONST '(anon)' type_id=10", + "[25] PTR '(anon)' type_id=33", + "[26] STRUCT 'with_embedded' size=4 vlen=1\n" + "\t'f1' type_id=13 bits_offset=0", + "[27] FUNC 'fn' type_id=34 linkage=static", + "[28] TYPEDEF 'arraytype' type_id=35", + "[29] FUNC_PROTO '(anon)' ret_type_id=1 vlen=1\n" + "\t'p1' type_id=16", + /* below here are (duplicate) anon base types added by distill + * process to split BTF. + */ + "[30] STRUCT '(anon)' size=12 vlen=2\n" + "\t'f1' type_id=1 bits_offset=0\n" + "\t'f2' type_id=3 bits_offset=32", + "[31] UNION '(anon)' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[32] ENUM '(anon)' encoding=UNSIGNED size=4 vlen=1\n" + "\t'av1' val=2", + "[33] ENUM64 '(anon)' encoding=SIGNED size=8 vlen=1\n" + "\t'v1' val=1025", + "[34] FUNC_PROTO '(anon)' ret_type_id=1 vlen=1\n" + "\t'p1' type_id=1", + "[35] ARRAY '(anon)' type_id=1 index_type_id=1 nr_elems=3"); + +cleanup: + btf__free(btf4); + btf__free(btf3); + btf__free(btf2); + btf__free(btf1); +} + +/* ensure we can cope with multiple types with the same name in + * distilled base BTF. In this case because sizes are different, + * we can still disambiguate them. + */ +static void test_distilled_base_multi(void) +{ + struct btf *btf1 = NULL, *btf2 = NULL, *btf3 = NULL, *btf4 = NULL; + + btf1 = btf__new_empty(); + if (!ASSERT_OK_PTR(btf1, "empty_main_btf")) + return; + btf__add_int(btf1, "int", 4, BTF_INT_SIGNED); /* [1] int */ + btf__add_int(btf1, "int", 8, BTF_INT_SIGNED); /* [2] int */ + VALIDATE_RAW_BTF( + btf1, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] INT 'int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED"); + btf2 = btf__new_empty_split(btf1); + if (!ASSERT_OK_PTR(btf2, "empty_split_btf")) + goto cleanup; + btf__add_ptr(btf2, 1); + btf__add_const(btf2, 2); + VALIDATE_RAW_BTF( + btf2, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] INT 'int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED", + "[3] PTR '(anon)' type_id=1", + "[4] CONST '(anon)' type_id=2"); + if (!ASSERT_EQ(0, btf__distill_base(btf2, &btf3, &btf4), + "distilled_base") || + !ASSERT_OK_PTR(btf3, "distilled_base") || + !ASSERT_OK_PTR(btf4, "distilled_split") || + !ASSERT_EQ(3, btf__type_cnt(btf3), "distilled_base_type_cnt")) + goto cleanup; + VALIDATE_RAW_BTF( + btf3, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] INT 'int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED"); + if (!ASSERT_EQ(btf__relocate(btf4, btf1), 0, "relocate_split")) + goto cleanup; + + VALIDATE_RAW_BTF( + btf4, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] INT 'int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED", + "[3] PTR '(anon)' type_id=1", + "[4] CONST '(anon)' type_id=2"); + +cleanup: + btf__free(btf4); + btf__free(btf3); + btf__free(btf2); + btf__free(btf1); +} + +/* If a needed type is not present in the base BTF we wish to relocate + * with, btf__relocate() should error our. + */ +static void test_distilled_base_missing_err(void) +{ + struct btf *btf1 = NULL, *btf2 = NULL, *btf3 = NULL, *btf4 = NULL, *btf5 = NULL; + + btf1 = btf__new_empty(); + if (!ASSERT_OK_PTR(btf1, "empty_main_btf")) + return; + btf__add_int(btf1, "int", 4, BTF_INT_SIGNED); /* [1] int */ + btf__add_int(btf1, "int", 8, BTF_INT_SIGNED); /* [2] int */ + VALIDATE_RAW_BTF( + btf1, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] INT 'int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED"); + btf2 = btf__new_empty_split(btf1); + if (!ASSERT_OK_PTR(btf2, "empty_split_btf")) + goto cleanup; + btf__add_ptr(btf2, 1); + btf__add_const(btf2, 2); + VALIDATE_RAW_BTF( + btf2, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] INT 'int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED", + "[3] PTR '(anon)' type_id=1", + "[4] CONST '(anon)' type_id=2"); + if (!ASSERT_EQ(0, btf__distill_base(btf2, &btf3, &btf4), + "distilled_base") || + !ASSERT_OK_PTR(btf3, "distilled_base") || + !ASSERT_OK_PTR(btf4, "distilled_split") || + !ASSERT_EQ(3, btf__type_cnt(btf3), "distilled_base_type_cnt")) + goto cleanup; + VALIDATE_RAW_BTF( + btf3, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] INT 'int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED"); + btf5 = btf__new_empty(); + if (!ASSERT_OK_PTR(btf5, "empty_reloc_btf")) + return; + btf__add_int(btf5, "int", 4, BTF_INT_SIGNED); /* [1] int */ + VALIDATE_RAW_BTF( + btf5, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED"); + ASSERT_EQ(btf__relocate(btf4, btf5), -EINVAL, "relocate_split"); + +cleanup: + btf__free(btf5); + btf__free(btf4); + btf__free(btf3); + btf__free(btf2); + btf__free(btf1); +} + +/* With 2 types of same size in distilled base BTF, relocation should + * fail as we have no means to choose between them. + */ +static void test_distilled_base_multi_err(void) +{ + struct btf *btf1 = NULL, *btf2 = NULL, *btf3 = NULL, *btf4 = NULL; + + btf1 = btf__new_empty(); + if (!ASSERT_OK_PTR(btf1, "empty_main_btf")) + return; + btf__add_int(btf1, "int", 4, BTF_INT_SIGNED); /* [1] int */ + btf__add_int(btf1, "int", 4, BTF_INT_SIGNED); /* [2] int */ + VALIDATE_RAW_BTF( + btf1, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED"); + btf2 = btf__new_empty_split(btf1); + if (!ASSERT_OK_PTR(btf2, "empty_split_btf")) + goto cleanup; + btf__add_ptr(btf2, 1); + btf__add_const(btf2, 2); + VALIDATE_RAW_BTF( + btf2, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[3] PTR '(anon)' type_id=1", + "[4] CONST '(anon)' type_id=2"); + if (!ASSERT_EQ(0, btf__distill_base(btf2, &btf3, &btf4), + "distilled_base") || + !ASSERT_OK_PTR(btf3, "distilled_base") || + !ASSERT_OK_PTR(btf4, "distilled_split") || + !ASSERT_EQ(3, btf__type_cnt(btf3), "distilled_base_type_cnt")) + goto cleanup; + VALIDATE_RAW_BTF( + btf3, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED"); + ASSERT_EQ(btf__relocate(btf4, btf1), -EINVAL, "relocate_split"); +cleanup: + btf__free(btf4); + btf__free(btf3); + btf__free(btf2); + btf__free(btf1); +} + +/* With 2 types of same size in base BTF, relocation should + * fail as we have no means to choose between them. + */ +static void test_distilled_base_multi_err2(void) +{ + struct btf *btf1 = NULL, *btf2 = NULL, *btf3 = NULL, *btf4 = NULL, *btf5 = NULL; + + btf1 = btf__new_empty(); + if (!ASSERT_OK_PTR(btf1, "empty_main_btf")) + return; + btf__add_int(btf1, "int", 4, BTF_INT_SIGNED); /* [1] int */ + VALIDATE_RAW_BTF( + btf1, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED"); + btf2 = btf__new_empty_split(btf1); + if (!ASSERT_OK_PTR(btf2, "empty_split_btf")) + goto cleanup; + btf__add_ptr(btf2, 1); + VALIDATE_RAW_BTF( + btf2, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] PTR '(anon)' type_id=1"); + if (!ASSERT_EQ(0, btf__distill_base(btf2, &btf3, &btf4), + "distilled_base") || + !ASSERT_OK_PTR(btf3, "distilled_base") || + !ASSERT_OK_PTR(btf4, "distilled_split") || + !ASSERT_EQ(2, btf__type_cnt(btf3), "distilled_base_type_cnt")) + goto cleanup; + VALIDATE_RAW_BTF( + btf3, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED"); + btf5 = btf__new_empty(); + if (!ASSERT_OK_PTR(btf5, "empty_reloc_btf")) + return; + btf__add_int(btf5, "int", 4, BTF_INT_SIGNED); /* [1] int */ + btf__add_int(btf5, "int", 4, BTF_INT_SIGNED); /* [2] int */ + VALIDATE_RAW_BTF( + btf5, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED"); + ASSERT_EQ(btf__relocate(btf4, btf5), -EINVAL, "relocate_split"); cleanup: + btf__free(btf5); btf__free(btf4); btf__free(btf3); btf__free(btf2); @@ -269,6 +539,14 @@ void test_btf_distill(void) { if (test__start_subtest("distilled_base")) test_distilled_base(); + if (test__start_subtest("distilled_base_multi")) + test_distilled_base_multi(); + if (test__start_subtest("distilled_base_missing_err")) + test_distilled_base_missing_err(); + if (test__start_subtest("distilled_base_multi_err")) + test_distilled_base_multi_err(); + if (test__start_subtest("distilled_base_multi_err2")) + test_distilled_base_multi_err2(); if (test__start_subtest("distilled_base_vmlinux")) test_distilled_base_vmlinux(); } -- 2.34.1
From: Eduard Zingerman <eddyz87@gmail.com> mainline inclusion from mainline-v6.11-rc1 commit c86f180ffc993975fed5907a869fc9b1555d0cfb category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Update btf_parse_elf() to check if .BTF.base section is present. The logic is as follows: if .BTF.base section exists: distilled_base := btf_new(.BTF.base) if distilled_base: btf := btf_new(.BTF, .base_btf=distilled_base) if base_btf: btf_relocate(btf, base_btf) else: btf := btf_new(.BTF) return btf In other words: - if .BTF.base section exists, load BTF from it and use it as a base for .BTF load; - if base_btf is specified and .BTF.base section exist, relocate newly loaded .BTF against base_btf. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240613095014.357981-6-alan.maguire@oracle.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/btf.c | 164 +++++++++++++++++++++++++++++--------------- tools/lib/bpf/btf.h | 1 + 2 files changed, 111 insertions(+), 54 deletions(-) diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index 0efb0a645e3d..cd74b6a3ec99 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -116,6 +116,9 @@ struct btf { /* whether strings are already deduplicated */ bool strs_deduped; + /* whether base_btf should be freed in btf_free for this instance */ + bool owns_base; + /* BTF object FD, if loaded into kernel */ int fd; @@ -810,6 +813,8 @@ void btf__free(struct btf *btf) free(btf->raw_data); free(btf->raw_data_swapped); free(btf->type_offs); + if (btf->owns_base) + btf__free(btf->base_btf); free(btf); } @@ -924,53 +929,38 @@ struct btf *btf__new_split(const void *data, __u32 size, struct btf *base_btf) return libbpf_ptr(btf_new(data, size, base_btf)); } -static struct btf *btf_parse_elf(const char *path, struct btf *base_btf, - struct btf_ext **btf_ext) +struct btf_elf_secs { + Elf_Data *btf_data; + Elf_Data *btf_ext_data; + Elf_Data *btf_base_data; +}; + +static int btf_find_elf_sections(Elf *elf, const char *path, struct btf_elf_secs *secs) { - Elf_Data *btf_data = NULL, *btf_ext_data = NULL; - int err = 0, fd = -1, idx = 0; - struct btf *btf = NULL; Elf_Scn *scn = NULL; - Elf *elf = NULL; + Elf_Data *data; GElf_Ehdr ehdr; size_t shstrndx; + int idx = 0; - if (elf_version(EV_CURRENT) == EV_NONE) { - pr_warn("failed to init libelf for %s\n", path); - return ERR_PTR(-LIBBPF_ERRNO__LIBELF); - } - - fd = open(path, O_RDONLY | O_CLOEXEC); - if (fd < 0) { - err = -errno; - pr_warn("failed to open %s: %s\n", path, strerror(errno)); - return ERR_PTR(err); - } - - err = -LIBBPF_ERRNO__FORMAT; - - elf = elf_begin(fd, ELF_C_READ, NULL); - if (!elf) { - pr_warn("failed to open %s as ELF file\n", path); - goto done; - } if (!gelf_getehdr(elf, &ehdr)) { pr_warn("failed to get EHDR from %s\n", path); - goto done; + goto err; } if (elf_getshdrstrndx(elf, &shstrndx)) { pr_warn("failed to get section names section index for %s\n", path); - goto done; + goto err; } if (!elf_rawdata(elf_getscn(elf, shstrndx), NULL)) { pr_warn("failed to get e_shstrndx from %s\n", path); - goto done; + goto err; } while ((scn = elf_nextscn(elf, scn)) != NULL) { + Elf_Data **field; GElf_Shdr sh; char *name; @@ -978,42 +968,102 @@ static struct btf *btf_parse_elf(const char *path, struct btf *base_btf, if (gelf_getshdr(scn, &sh) != &sh) { pr_warn("failed to get section(%d) header from %s\n", idx, path); - goto done; + goto err; } name = elf_strptr(elf, shstrndx, sh.sh_name); if (!name) { pr_warn("failed to get section(%d) name from %s\n", idx, path); - goto done; + goto err; } - if (strcmp(name, BTF_ELF_SEC) == 0) { - btf_data = elf_getdata(scn, 0); - if (!btf_data) { - pr_warn("failed to get section(%d, %s) data from %s\n", - idx, name, path); - goto done; - } - continue; - } else if (btf_ext && strcmp(name, BTF_EXT_ELF_SEC) == 0) { - btf_ext_data = elf_getdata(scn, 0); - if (!btf_ext_data) { - pr_warn("failed to get section(%d, %s) data from %s\n", - idx, name, path); - goto done; - } + + if (strcmp(name, BTF_ELF_SEC) == 0) + field = &secs->btf_data; + else if (strcmp(name, BTF_EXT_ELF_SEC) == 0) + field = &secs->btf_ext_data; + else if (strcmp(name, BTF_BASE_ELF_SEC) == 0) + field = &secs->btf_base_data; + else continue; + + data = elf_getdata(scn, 0); + if (!data) { + pr_warn("failed to get section(%d, %s) data from %s\n", + idx, name, path); + goto err; } + *field = data; + } + + return 0; + +err: + return -LIBBPF_ERRNO__FORMAT; +} + +static struct btf *btf_parse_elf(const char *path, struct btf *base_btf, + struct btf_ext **btf_ext) +{ + struct btf_elf_secs secs = {}; + struct btf *dist_base_btf = NULL; + struct btf *btf = NULL; + int err = 0, fd = -1; + Elf *elf = NULL; + + if (elf_version(EV_CURRENT) == EV_NONE) { + pr_warn("failed to init libelf for %s\n", path); + return ERR_PTR(-LIBBPF_ERRNO__LIBELF); + } + + fd = open(path, O_RDONLY | O_CLOEXEC); + if (fd < 0) { + err = -errno; + pr_warn("failed to open %s: %s\n", path, strerror(errno)); + return ERR_PTR(err); } - if (!btf_data) { + elf = elf_begin(fd, ELF_C_READ, NULL); + if (!elf) { + pr_warn("failed to open %s as ELF file\n", path); + goto done; + } + + err = btf_find_elf_sections(elf, path, &secs); + if (err) + goto done; + + if (!secs.btf_data) { pr_warn("failed to find '%s' ELF section in %s\n", BTF_ELF_SEC, path); err = -ENODATA; goto done; } - btf = btf_new(btf_data->d_buf, btf_data->d_size, base_btf); - err = libbpf_get_error(btf); - if (err) + + if (secs.btf_base_data) { + dist_base_btf = btf_new(secs.btf_base_data->d_buf, secs.btf_base_data->d_size, + NULL); + if (IS_ERR(dist_base_btf)) { + err = PTR_ERR(dist_base_btf); + dist_base_btf = NULL; + goto done; + } + } + + btf = btf_new(secs.btf_data->d_buf, secs.btf_data->d_size, + dist_base_btf ?: base_btf); + if (IS_ERR(btf)) { + err = PTR_ERR(btf); goto done; + } + if (dist_base_btf && base_btf) { + err = btf__relocate(btf, base_btf); + if (err) + goto done; + btf__free(dist_base_btf); + dist_base_btf = NULL; + } + + if (dist_base_btf) + btf->owns_base = true; switch (gelf_getclass(elf)) { case ELFCLASS32: @@ -1027,11 +1077,12 @@ static struct btf *btf_parse_elf(const char *path, struct btf *base_btf, break; } - if (btf_ext && btf_ext_data) { - *btf_ext = btf_ext__new(btf_ext_data->d_buf, btf_ext_data->d_size); - err = libbpf_get_error(*btf_ext); - if (err) + if (btf_ext && secs.btf_ext_data) { + *btf_ext = btf_ext__new(secs.btf_ext_data->d_buf, secs.btf_ext_data->d_size); + if (IS_ERR(*btf_ext)) { + err = PTR_ERR(*btf_ext); goto done; + } } else if (btf_ext) { *btf_ext = NULL; } @@ -1045,6 +1096,7 @@ static struct btf *btf_parse_elf(const char *path, struct btf *base_btf, if (btf_ext) btf_ext__free(*btf_ext); + btf__free(dist_base_btf); btf__free(btf); return ERR_PTR(err); @@ -5430,5 +5482,9 @@ void btf_set_base_btf(struct btf *btf, const struct btf *base_btf) int btf__relocate(struct btf *btf, const struct btf *base_btf) { - return libbpf_err(btf_relocate(btf, base_btf, NULL)); + int err = btf_relocate(btf, base_btf, NULL); + + if (!err) + btf->owns_base = false; + return libbpf_err(err); } diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h index 8a93120b7385..b68d216837a9 100644 --- a/tools/lib/bpf/btf.h +++ b/tools/lib/bpf/btf.h @@ -18,6 +18,7 @@ extern "C" { #define BTF_ELF_SEC ".BTF" #define BTF_EXT_ELF_SEC ".BTF.ext" +#define BTF_BASE_ELF_SEC ".BTF.base" #define MAPS_ELF_SEC ".maps" struct btf; -- 2.34.1
From: Alan Maguire <alan.maguire@oracle.com> mainline inclusion from mainline-v6.11-rc1 commit 6ba77385f386053cea2a1cad33717de74a26db4e category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Now that btf_parse_elf() handles .BTF.base section presence, we need to ensure that resolve_btfids uses .BTF.base when present rather than the vmlinux base BTF passed in via the -B option. Detect .BTF.base section presence and unset the base BTF path to ensure that BTF ELF parsing will do the right thing. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240613095014.357981-7-alan.maguire@oracle.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/bpf/resolve_btfids/main.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/tools/bpf/resolve_btfids/main.c b/tools/bpf/resolve_btfids/main.c index b3edc239fe56..d54aaa0619df 100644 --- a/tools/bpf/resolve_btfids/main.c +++ b/tools/bpf/resolve_btfids/main.c @@ -409,6 +409,14 @@ static int elf_collect(struct object *obj) obj->efile.idlist = data; obj->efile.idlist_shndx = idx; obj->efile.idlist_addr = sh.sh_addr; + } else if (!strcmp(name, BTF_BASE_ELF_SEC)) { + /* If a .BTF.base section is found, do not resolve + * BTF ids relative to vmlinux; resolve relative + * to the .BTF.base section instead. btf__parse_split() + * will take care of this once the base BTF it is + * passed is NULL. + */ + obj->base_btf_path = NULL; } if (compressed_section_fix(elf, scn, &sh)) -- 2.34.1
From: Alan Maguire <alan.maguire@oracle.com> mainline inclusion from mainline-v6.11-rc1 commit d1cf840854bb603c0718a011bc993f69f2df014e category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Use less verbose names in BTF relocation code and fix off-by-one error and typo in btf_relocate.c. Simplify loop over matching distilled types, moving from assigning a _next value in loop body to moving match check conditions into the guard. Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240620091733.1967885-2-alan.maguire@oracle.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/btf_relocate.c | 72 ++++++++++++++++-------------------- 1 file changed, 31 insertions(+), 41 deletions(-) diff --git a/tools/lib/bpf/btf_relocate.c b/tools/lib/bpf/btf_relocate.c index eabb8755f662..23a41fb03e0d 100644 --- a/tools/lib/bpf/btf_relocate.c +++ b/tools/lib/bpf/btf_relocate.c @@ -160,7 +160,7 @@ static int btf_mark_embedded_composite_type_ids(struct btf_relocate *r, __u32 i) */ static int btf_relocate_map_distilled_base(struct btf_relocate *r) { - struct btf_name_info *dist_base_info_sorted, *dist_base_info_sorted_end; + struct btf_name_info *info, *info_end; struct btf_type *base_t, *dist_t; __u8 *base_name_cnt = NULL; int err = 0; @@ -169,26 +169,24 @@ static int btf_relocate_map_distilled_base(struct btf_relocate *r) /* generate a sort index array of name/type ids sorted by name for * distilled base BTF to speed name-based lookups. */ - dist_base_info_sorted = calloc(r->nr_dist_base_types, sizeof(*dist_base_info_sorted)); - if (!dist_base_info_sorted) { + info = calloc(r->nr_dist_base_types, sizeof(*info)); + if (!info) { err = -ENOMEM; goto done; } - dist_base_info_sorted_end = dist_base_info_sorted + r->nr_dist_base_types; + info_end = info + r->nr_dist_base_types; for (id = 0; id < r->nr_dist_base_types; id++) { dist_t = btf_type_by_id(r->dist_base_btf, id); - dist_base_info_sorted[id].name = btf__name_by_offset(r->dist_base_btf, - dist_t->name_off); - dist_base_info_sorted[id].id = id; - dist_base_info_sorted[id].size = dist_t->size; - dist_base_info_sorted[id].needs_size = true; + info[id].name = btf__name_by_offset(r->dist_base_btf, dist_t->name_off); + info[id].id = id; + info[id].size = dist_t->size; + info[id].needs_size = true; } - qsort(dist_base_info_sorted, r->nr_dist_base_types, sizeof(*dist_base_info_sorted), - cmp_btf_name_size); + qsort(info, r->nr_dist_base_types, sizeof(*info), cmp_btf_name_size); /* Mark distilled base struct/union members of split BTF structs/unions * in id_map with BTF_IS_EMBEDDED; this signals that these types - * need to match both name and size, otherwise embeddding the base + * need to match both name and size, otherwise embedding the base * struct/union in the split type is invalid. */ for (id = r->nr_dist_base_types; id < r->nr_split_types; id++) { @@ -216,8 +214,7 @@ static int btf_relocate_map_distilled_base(struct btf_relocate *r) /* Now search base BTF for matching distilled base BTF types. */ for (id = 1; id < r->nr_base_types; id++) { - struct btf_name_info *dist_name_info, *dist_name_info_next = NULL; - struct btf_name_info base_name_info = {}; + struct btf_name_info *dist_info, base_info = {}; int dist_kind, base_kind; base_t = btf_type_by_id(r->base_btf, id); @@ -225,16 +222,16 @@ static int btf_relocate_map_distilled_base(struct btf_relocate *r) if (!base_t->name_off) continue; base_kind = btf_kind(base_t); - base_name_info.id = id; - base_name_info.name = btf__name_by_offset(r->base_btf, base_t->name_off); + base_info.id = id; + base_info.name = btf__name_by_offset(r->base_btf, base_t->name_off); switch (base_kind) { case BTF_KIND_INT: case BTF_KIND_FLOAT: case BTF_KIND_ENUM: case BTF_KIND_ENUM64: /* These types should match both name and size */ - base_name_info.needs_size = true; - base_name_info.size = base_t->size; + base_info.needs_size = true; + base_info.size = base_t->size; break; case BTF_KIND_FWD: /* No size considerations for fwds. */ @@ -248,31 +245,24 @@ static int btf_relocate_map_distilled_base(struct btf_relocate *r) * unless corresponding _base_ types to match them are * missing. */ - base_name_info.needs_size = base_name_cnt[base_t->name_off] > 1; - base_name_info.size = base_t->size; + base_info.needs_size = base_name_cnt[base_t->name_off] > 1; + base_info.size = base_t->size; break; default: continue; } /* iterate over all matching distilled base types */ - for (dist_name_info = search_btf_name_size(&base_name_info, dist_base_info_sorted, - r->nr_dist_base_types); - dist_name_info != NULL; dist_name_info = dist_name_info_next) { - /* Are there more distilled matches to process after - * this one? - */ - dist_name_info_next = dist_name_info + 1; - if (dist_name_info_next >= dist_base_info_sorted_end || - cmp_btf_name_size(&base_name_info, dist_name_info_next)) - dist_name_info_next = NULL; - - if (!dist_name_info->id || dist_name_info->id > r->nr_dist_base_types) { + for (dist_info = search_btf_name_size(&base_info, info, r->nr_dist_base_types); + dist_info != NULL && dist_info < info_end && + cmp_btf_name_size(&base_info, dist_info) == 0; + dist_info++) { + if (!dist_info->id || dist_info->id >= r->nr_dist_base_types) { pr_warn("base BTF id [%d] maps to invalid distilled base BTF id [%d]\n", - id, dist_name_info->id); + id, dist_info->id); err = -EINVAL; goto done; } - dist_t = btf_type_by_id(r->dist_base_btf, dist_name_info->id); + dist_t = btf_type_by_id(r->dist_base_btf, dist_info->id); dist_kind = btf_kind(dist_t); /* Validate that the found distilled type is compatible. @@ -319,15 +309,15 @@ static int btf_relocate_map_distilled_base(struct btf_relocate *r) /* size verification is required for embedded * struct/unions. */ - if (r->id_map[dist_name_info->id] == BTF_IS_EMBEDDED && + if (r->id_map[dist_info->id] == BTF_IS_EMBEDDED && base_t->size != dist_t->size) continue; break; default: continue; } - if (r->id_map[dist_name_info->id] && - r->id_map[dist_name_info->id] != BTF_IS_EMBEDDED) { + if (r->id_map[dist_info->id] && + r->id_map[dist_info->id] != BTF_IS_EMBEDDED) { /* we already have a match; this tells us that * multiple base types of the same name * have the same size, since for cases where @@ -337,13 +327,13 @@ static int btf_relocate_map_distilled_base(struct btf_relocate *r) * to in base BTF, so error out. */ pr_warn("distilled base BTF type '%s' [%u], size %u has multiple candidates of the same size (ids [%u, %u]) in base BTF\n", - base_name_info.name, dist_name_info->id, - base_t->size, id, r->id_map[dist_name_info->id]); + base_info.name, dist_info->id, + base_t->size, id, r->id_map[dist_info->id]); err = -EINVAL; goto done; } /* map id and name */ - r->id_map[dist_name_info->id] = id; + r->id_map[dist_info->id] = id; r->str_map[dist_t->name_off] = base_t->name_off; } } @@ -362,7 +352,7 @@ static int btf_relocate_map_distilled_base(struct btf_relocate *r) } done: free(base_name_cnt); - free(dist_base_info_sorted); + free(info); return err; } -- 2.34.1
From: Alan Maguire <alan.maguire@oracle.com> mainline inclusion from mainline-v6.11-rc1 commit d4e48e3dd45017abdd69a19285d197de897ef44f category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- ...as this will allow split BTF modules with a base BTF representation (rather than the full vmlinux BTF at time of BTF encoding) to resolve their references to kernel types in a way that is more resilient to small changes in kernel types. This will allow modules that are not built every time the kernel is to provide more resilient BTF, rather than have it invalidated every time BTF ids for core kernel types change. Fields are ordered to avoid holes in struct module. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240620091733.1967885-3-alan.maguire@oracle.com Conflicts: include/linux/module.h [The conflicts were due to some minor issue.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/module.h | 2 ++ kernel/module/main.c | 5 ++++- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/module.h b/include/linux/module.h index 63edb1d830d3..96f1a88b69c7 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -517,7 +517,9 @@ struct module { #endif #ifdef CONFIG_DEBUG_INFO_BTF_MODULES unsigned int btf_data_size; + unsigned int btf_base_data_size; void *btf_data; + void *btf_base_data; #else KABI_DEPRECATE(unsigned int, btf_data_size) KABI_DEPRECATE(void *, btf_data) diff --git a/kernel/module/main.c b/kernel/module/main.c index a221b102f0e3..5bb35a5fe2d0 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -2164,6 +2164,8 @@ static int find_module_sections(struct module *mod, struct load_info *info) #endif #ifdef CONFIG_DEBUG_INFO_BTF_MODULES mod->btf_data = any_section_objs(info, ".BTF", 1, &mod->btf_data_size); + mod->btf_base_data = any_section_objs(info, ".BTF.base", 1, + &mod->btf_base_data_size); #endif #ifdef CONFIG_JUMP_LABEL mod->jump_entries = section_objs(info, "__jump_table", @@ -2598,8 +2600,9 @@ static noinline int do_init_module(struct module *mod) } #ifdef CONFIG_DEBUG_INFO_BTF_MODULES - /* .BTF is not SHF_ALLOC and will get removed, so sanitize pointer */ + /* .BTF is not SHF_ALLOC and will get removed, so sanitize pointers */ mod->btf_data = NULL; + mod->btf_base_data = NULL; #endif /* * We want to free module_init, but be aware that kallsyms may be -- 2.34.1
From: Pu Lehui <pulehui@huawei.com> hulk inclusion category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA -------------------------------- When backporting resilient split BTF, we need to add the members `unsigned int btf_base_data_size` and `void *btf_base_data` to the struct module. According to the layout of the struct module shown below, we can use KABI_FILE_HOLE to fill `unsigned int btf_base_data_size` and use the reserved space with KABI_USE to handle the rest. Additionally, since CONFIG_DEBUG_INFO_BTF_MODULES has been re-enabled, we can remove the unused KABI_DEPRECATE. struct module { ... /* --- cacheline 17 boundary (1088 bytes) --- */ unsigned int btf_data_size; /* 1088 4 */ /* XXX 4 bytes hole, try to pack */ void * btf_data; /* 1096 8 */ struct jump_entry * jump_entries; /* 1104 8 */ ... } Fixes: de93c21fa0f7 ("[Backport] module, bpf: Store BTF base pointer in struct module") Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/module.h | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/module.h b/include/linux/module.h index 96f1a88b69c7..af0d31ecd36c 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -517,12 +517,8 @@ struct module { #endif #ifdef CONFIG_DEBUG_INFO_BTF_MODULES unsigned int btf_data_size; - unsigned int btf_base_data_size; + KABI_FILL_HOLE(unsigned int btf_base_data_size) void *btf_data; - void *btf_base_data; -#else - KABI_DEPRECATE(unsigned int, btf_data_size) - KABI_DEPRECATE(void *, btf_data) #endif #ifdef CONFIG_JUMP_LABEL struct jump_entry *jump_entries; @@ -609,7 +605,11 @@ struct module { #ifdef CONFIG_DYNAMIC_DEBUG_CORE struct _ddebug_info dyndbg_info; #endif +#ifdef CONFIG_DEBUG_INFO_BTF_MODULES + KABI_USE(1, void *btf_base_data) +#else KABI_RESERVE(1) +#endif KABI_RESERVE(2) KABI_RESERVE(3) KABI_RESERVE(4) -- 2.34.1
From: Alan Maguire <alan.maguire@oracle.com> mainline inclusion from mainline-v6.11-rc1 commit e7ac331b30555cf1a0826784a346f36dbf800451 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- This will allow it to be shared with the kernel. No functional change. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240620091733.1967885-4-alan.maguire@oracle.com Conflicts: tools/lib/bpf/Build [The conflicts were due to some minor issue.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/Build | 2 +- tools/lib/bpf/btf.c | 162 ------------------------------------- tools/lib/bpf/btf_iter.c | 169 +++++++++++++++++++++++++++++++++++++++ 3 files changed, 170 insertions(+), 163 deletions(-) create mode 100644 tools/lib/bpf/btf_iter.c diff --git a/tools/lib/bpf/Build b/tools/lib/bpf/Build index 18c006c4be71..41dae3ca2685 100644 --- a/tools/lib/bpf/Build +++ b/tools/lib/bpf/Build @@ -1,4 +1,4 @@ libbpf-y := libbpf.o bpf.o nlattr.o btf.o libbpf_errno.o str_error.o \ netlink.o bpf_prog_linfo.o libbpf_probes.o hashmap.o \ btf_dump.o ringbuf.o strset.o linker.o gen_loader.o relo_core.o \ - usdt.o zip.o elf.o btf_relocate.o + usdt.o zip.o elf.o btf_iter.o btf_relocate.o diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index cd74b6a3ec99..db8cfaeba6f7 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -4925,168 +4925,6 @@ struct btf *btf__load_module_btf(const char *module_name, struct btf *vmlinux_bt return btf__parse_split(path, vmlinux_btf); } -int btf_field_iter_init(struct btf_field_iter *it, struct btf_type *t, enum btf_field_iter_kind iter_kind) -{ - it->p = NULL; - it->m_idx = -1; - it->off_idx = 0; - it->vlen = 0; - - switch (iter_kind) { - case BTF_FIELD_ITER_IDS: - switch (btf_kind(t)) { - case BTF_KIND_UNKN: - case BTF_KIND_INT: - case BTF_KIND_FLOAT: - case BTF_KIND_ENUM: - case BTF_KIND_ENUM64: - it->desc = (struct btf_field_desc) {}; - break; - case BTF_KIND_FWD: - case BTF_KIND_CONST: - case BTF_KIND_VOLATILE: - case BTF_KIND_RESTRICT: - case BTF_KIND_PTR: - case BTF_KIND_TYPEDEF: - case BTF_KIND_FUNC: - case BTF_KIND_VAR: - case BTF_KIND_DECL_TAG: - case BTF_KIND_TYPE_TAG: - it->desc = (struct btf_field_desc) { 1, {offsetof(struct btf_type, type)} }; - break; - case BTF_KIND_ARRAY: - it->desc = (struct btf_field_desc) { - 2, {sizeof(struct btf_type) + offsetof(struct btf_array, type), - sizeof(struct btf_type) + offsetof(struct btf_array, index_type)} - }; - break; - case BTF_KIND_STRUCT: - case BTF_KIND_UNION: - it->desc = (struct btf_field_desc) { - 0, {}, - sizeof(struct btf_member), - 1, {offsetof(struct btf_member, type)} - }; - break; - case BTF_KIND_FUNC_PROTO: - it->desc = (struct btf_field_desc) { - 1, {offsetof(struct btf_type, type)}, - sizeof(struct btf_param), - 1, {offsetof(struct btf_param, type)} - }; - break; - case BTF_KIND_DATASEC: - it->desc = (struct btf_field_desc) { - 0, {}, - sizeof(struct btf_var_secinfo), - 1, {offsetof(struct btf_var_secinfo, type)} - }; - break; - default: - return -EINVAL; - } - break; - case BTF_FIELD_ITER_STRS: - switch (btf_kind(t)) { - case BTF_KIND_UNKN: - it->desc = (struct btf_field_desc) {}; - break; - case BTF_KIND_INT: - case BTF_KIND_FLOAT: - case BTF_KIND_FWD: - case BTF_KIND_ARRAY: - case BTF_KIND_CONST: - case BTF_KIND_VOLATILE: - case BTF_KIND_RESTRICT: - case BTF_KIND_PTR: - case BTF_KIND_TYPEDEF: - case BTF_KIND_FUNC: - case BTF_KIND_VAR: - case BTF_KIND_DECL_TAG: - case BTF_KIND_TYPE_TAG: - case BTF_KIND_DATASEC: - it->desc = (struct btf_field_desc) { - 1, {offsetof(struct btf_type, name_off)} - }; - break; - case BTF_KIND_ENUM: - it->desc = (struct btf_field_desc) { - 1, {offsetof(struct btf_type, name_off)}, - sizeof(struct btf_enum), - 1, {offsetof(struct btf_enum, name_off)} - }; - break; - case BTF_KIND_ENUM64: - it->desc = (struct btf_field_desc) { - 1, {offsetof(struct btf_type, name_off)}, - sizeof(struct btf_enum64), - 1, {offsetof(struct btf_enum64, name_off)} - }; - break; - case BTF_KIND_STRUCT: - case BTF_KIND_UNION: - it->desc = (struct btf_field_desc) { - 1, {offsetof(struct btf_type, name_off)}, - sizeof(struct btf_member), - 1, {offsetof(struct btf_member, name_off)} - }; - break; - case BTF_KIND_FUNC_PROTO: - it->desc = (struct btf_field_desc) { - 1, {offsetof(struct btf_type, name_off)}, - sizeof(struct btf_param), - 1, {offsetof(struct btf_param, name_off)} - }; - break; - default: - return -EINVAL; - } - break; - default: - return -EINVAL; - } - - if (it->desc.m_sz) - it->vlen = btf_vlen(t); - - it->p = t; - return 0; -} - -__u32 *btf_field_iter_next(struct btf_field_iter *it) -{ - if (!it->p) - return NULL; - - if (it->m_idx < 0) { - if (it->off_idx < it->desc.t_off_cnt) - return it->p + it->desc.t_offs[it->off_idx++]; - /* move to per-member iteration */ - it->m_idx = 0; - it->p += sizeof(struct btf_type); - it->off_idx = 0; - } - - /* if type doesn't have members, stop */ - if (it->desc.m_sz == 0) { - it->p = NULL; - return NULL; - } - - if (it->off_idx >= it->desc.m_off_cnt) { - /* exhausted this member's fields, go to the next member */ - it->m_idx++; - it->p += it->desc.m_sz; - it->off_idx = 0; - } - - if (it->m_idx < it->vlen) - return it->p + it->desc.m_offs[it->off_idx++]; - - it->p = NULL; - return NULL; -} - int btf_ext_visit_type_ids(struct btf_ext *btf_ext, type_id_visit_fn visit, void *ctx) { const struct btf_ext_info *seg; diff --git a/tools/lib/bpf/btf_iter.c b/tools/lib/bpf/btf_iter.c new file mode 100644 index 000000000000..c308aa60285d --- /dev/null +++ b/tools/lib/bpf/btf_iter.c @@ -0,0 +1,169 @@ +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +/* Copyright (c) 2021 Facebook */ +/* Copyright (c) 2024, Oracle and/or its affiliates. */ + +#include "btf.h" +#include "libbpf_internal.h" + +int btf_field_iter_init(struct btf_field_iter *it, struct btf_type *t, + enum btf_field_iter_kind iter_kind) +{ + it->p = NULL; + it->m_idx = -1; + it->off_idx = 0; + it->vlen = 0; + + switch (iter_kind) { + case BTF_FIELD_ITER_IDS: + switch (btf_kind(t)) { + case BTF_KIND_UNKN: + case BTF_KIND_INT: + case BTF_KIND_FLOAT: + case BTF_KIND_ENUM: + case BTF_KIND_ENUM64: + it->desc = (struct btf_field_desc) {}; + break; + case BTF_KIND_FWD: + case BTF_KIND_CONST: + case BTF_KIND_VOLATILE: + case BTF_KIND_RESTRICT: + case BTF_KIND_PTR: + case BTF_KIND_TYPEDEF: + case BTF_KIND_FUNC: + case BTF_KIND_VAR: + case BTF_KIND_DECL_TAG: + case BTF_KIND_TYPE_TAG: + it->desc = (struct btf_field_desc) { 1, {offsetof(struct btf_type, type)} }; + break; + case BTF_KIND_ARRAY: + it->desc = (struct btf_field_desc) { + 2, {sizeof(struct btf_type) + offsetof(struct btf_array, type), + sizeof(struct btf_type) + offsetof(struct btf_array, index_type)} + }; + break; + case BTF_KIND_STRUCT: + case BTF_KIND_UNION: + it->desc = (struct btf_field_desc) { + 0, {}, + sizeof(struct btf_member), + 1, {offsetof(struct btf_member, type)} + }; + break; + case BTF_KIND_FUNC_PROTO: + it->desc = (struct btf_field_desc) { + 1, {offsetof(struct btf_type, type)}, + sizeof(struct btf_param), + 1, {offsetof(struct btf_param, type)} + }; + break; + case BTF_KIND_DATASEC: + it->desc = (struct btf_field_desc) { + 0, {}, + sizeof(struct btf_var_secinfo), + 1, {offsetof(struct btf_var_secinfo, type)} + }; + break; + default: + return -EINVAL; + } + break; + case BTF_FIELD_ITER_STRS: + switch (btf_kind(t)) { + case BTF_KIND_UNKN: + it->desc = (struct btf_field_desc) {}; + break; + case BTF_KIND_INT: + case BTF_KIND_FLOAT: + case BTF_KIND_FWD: + case BTF_KIND_ARRAY: + case BTF_KIND_CONST: + case BTF_KIND_VOLATILE: + case BTF_KIND_RESTRICT: + case BTF_KIND_PTR: + case BTF_KIND_TYPEDEF: + case BTF_KIND_FUNC: + case BTF_KIND_VAR: + case BTF_KIND_DECL_TAG: + case BTF_KIND_TYPE_TAG: + case BTF_KIND_DATASEC: + it->desc = (struct btf_field_desc) { + 1, {offsetof(struct btf_type, name_off)} + }; + break; + case BTF_KIND_ENUM: + it->desc = (struct btf_field_desc) { + 1, {offsetof(struct btf_type, name_off)}, + sizeof(struct btf_enum), + 1, {offsetof(struct btf_enum, name_off)} + }; + break; + case BTF_KIND_ENUM64: + it->desc = (struct btf_field_desc) { + 1, {offsetof(struct btf_type, name_off)}, + sizeof(struct btf_enum64), + 1, {offsetof(struct btf_enum64, name_off)} + }; + break; + case BTF_KIND_STRUCT: + case BTF_KIND_UNION: + it->desc = (struct btf_field_desc) { + 1, {offsetof(struct btf_type, name_off)}, + sizeof(struct btf_member), + 1, {offsetof(struct btf_member, name_off)} + }; + break; + case BTF_KIND_FUNC_PROTO: + it->desc = (struct btf_field_desc) { + 1, {offsetof(struct btf_type, name_off)}, + sizeof(struct btf_param), + 1, {offsetof(struct btf_param, name_off)} + }; + break; + default: + return -EINVAL; + } + break; + default: + return -EINVAL; + } + + if (it->desc.m_sz) + it->vlen = btf_vlen(t); + + it->p = t; + return 0; +} + +__u32 *btf_field_iter_next(struct btf_field_iter *it) +{ + if (!it->p) + return NULL; + + if (it->m_idx < 0) { + if (it->off_idx < it->desc.t_off_cnt) + return it->p + it->desc.t_offs[it->off_idx++]; + /* move to per-member iteration */ + it->m_idx = 0; + it->p += sizeof(struct btf_type); + it->off_idx = 0; + } + + /* if type doesn't have members, stop */ + if (it->desc.m_sz == 0) { + it->p = NULL; + return NULL; + } + + if (it->off_idx >= it->desc.m_off_cnt) { + /* exhausted this member's fields, go to the next member */ + it->m_idx++; + it->p += it->desc.m_sz; + it->off_idx = 0; + } + + if (it->m_idx < it->vlen) + return it->p + it->desc.m_offs[it->off_idx++]; + + it->p = NULL; + return NULL; +} -- 2.34.1
From: Alan Maguire <alan.maguire@oracle.com> mainline inclusion from mainline-v6.11-rc1 commit 8646db238997df36c6ad71a9d7e0b52ceee221b2 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Share relocation implementation with the kernel. As part of this, we also need the type/string iteration functions so also share btf_iter.c file. Relocation code in kernel and userspace is identical save for the impementation of the reparenting of split BTF to the relocated base BTF and retrieval of the BTF header from "struct btf"; these small functions need separate user-space and kernel implementations for the separate "struct btf"s they operate upon. One other wrinkle on the kernel side is we have to map .BTF.ids in modules as they were generated with the type ids used at BTF encoding time. btf_relocate() optionally returns an array mapping from old BTF ids to relocated ids, so we use that to fix up these references where needed for kfuncs. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240620091733.1967885-5-alan.maguire@oracle.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/btf.h | 64 +++++++++++++ kernel/bpf/Makefile | 8 +- kernel/bpf/btf.c | 176 ++++++++++++++++++++++++----------- tools/lib/bpf/btf_iter.c | 8 ++ tools/lib/bpf/btf_relocate.c | 23 +++++ 5 files changed, 226 insertions(+), 53 deletions(-) diff --git a/include/linux/btf.h b/include/linux/btf.h index 91c55f350357..f477f7e23ca3 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -140,6 +140,7 @@ extern const struct file_operations btf_fops; const char *btf_get_name(const struct btf *btf); void btf_get(struct btf *btf); void btf_put(struct btf *btf); +const struct btf_header *btf_header(const struct btf *btf); int btf_new_fd(const union bpf_attr *attr, bpfptr_t uattr, u32 uattr_sz); struct btf *btf_get_by_fd(int fd); int btf_get_info_by_fd(const struct btf *btf, @@ -212,8 +213,10 @@ int btf_get_fd_by_id(u32 id); u32 btf_obj_id(const struct btf *btf); bool btf_is_kernel(const struct btf *btf); bool btf_is_module(const struct btf *btf); +bool btf_is_vmlinux(const struct btf *btf); struct module *btf_try_get_module(const struct btf *btf); u32 btf_nr_types(const struct btf *btf); +struct btf *btf_base_btf(const struct btf *btf); bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s, const struct btf_member *m, u32 expected_offset, u32 expected_size); @@ -339,6 +342,11 @@ static inline u8 btf_int_offset(const struct btf_type *t) return BTF_INT_OFFSET(*(u32 *)(t + 1)); } +static inline __u8 btf_int_bits(const struct btf_type *t) +{ + return BTF_INT_BITS(*(__u32 *)(t + 1)); +} + static inline bool btf_type_is_scalar(const struct btf_type *t) { return btf_type_is_int(t) || btf_type_is_enum(t); @@ -478,6 +486,11 @@ static inline struct btf_param *btf_params(const struct btf_type *t) return (struct btf_param *)(t + 1); } +static inline struct btf_decl_tag *btf_decl_tag(const struct btf_type *t) +{ + return (struct btf_decl_tag *)(t + 1); +} + static inline int btf_id_cmp_func(const void *a, const void *b) { const int *pa = a, *pb = b; @@ -515,9 +528,38 @@ static inline const struct bpf_struct_ops_desc *bpf_struct_ops_find(struct btf * } #endif +enum btf_field_iter_kind { + BTF_FIELD_ITER_IDS, + BTF_FIELD_ITER_STRS, +}; + +struct btf_field_desc { + /* once-per-type offsets */ + int t_off_cnt, t_offs[2]; + /* member struct size, or zero, if no members */ + int m_sz; + /* repeated per-member offsets */ + int m_off_cnt, m_offs[1]; +}; + +struct btf_field_iter { + struct btf_field_desc desc; + void *p; + int m_idx; + int off_idx; + int vlen; +}; + #ifdef CONFIG_BPF_SYSCALL const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id); +void btf_set_base_btf(struct btf *btf, const struct btf *base_btf); +int btf_relocate(struct btf *btf, const struct btf *base_btf, __u32 **map_ids); +int btf_field_iter_init(struct btf_field_iter *it, struct btf_type *t, + enum btf_field_iter_kind iter_kind); +__u32 *btf_field_iter_next(struct btf_field_iter *it); + const char *btf_name_by_offset(const struct btf *btf, u32 offset); +const char *btf_str_by_offset(const struct btf *btf, u32 offset); struct btf *btf_parse_vmlinux(void); struct btf *bpf_prog_get_target_btf(const struct bpf_prog *prog); u32 *btf_kfunc_id_set_contains(const struct btf *btf, u32 kfunc_btf_id, @@ -544,6 +586,28 @@ static inline const struct btf_type *btf_type_by_id(const struct btf *btf, { return NULL; } + +static inline void btf_set_base_btf(struct btf *btf, const struct btf *base_btf) +{ +} + +static inline int btf_relocate(void *log, struct btf *btf, const struct btf *base_btf, + __u32 **map_ids) +{ + return -EOPNOTSUPP; +} + +static inline int btf_field_iter_init(struct btf_field_iter *it, struct btf_type *t, + enum btf_field_iter_kind iter_kind) +{ + return -EOPNOTSUPP; +} + +static inline __u32 *btf_field_iter_next(struct btf_field_iter *it) +{ + return NULL; +} + static inline const char *btf_name_by_offset(const struct btf *btf, u32 offset) { diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index f526b7573e97..f286815b5b7a 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -44,5 +44,11 @@ endif obj-$(CONFIG_BPF_PRELOAD) += preload/ obj-$(CONFIG_BPF_SYSCALL) += relo_core.o -$(obj)/relo_core.o: $(srctree)/tools/lib/bpf/relo_core.c FORCE +obj-$(CONFIG_BPF_SYSCALL) += btf_iter.o +obj-$(CONFIG_BPF_SYSCALL) += btf_relocate.o + +# Some source files are common to libbpf. +vpath %.c $(srctree)/kernel/bpf:$(srctree)/tools/lib/bpf + +$(obj)/%.o: %.c FORCE $(call if_changed_rule,cc_o_c) diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index ed7d054d6e68..95dea7f2ffe4 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -277,6 +277,7 @@ struct btf { u32 start_str_off; /* first string offset (0 for base BTF) */ char name[MODULE_NAME_LEN]; bool kernel_btf; + __u32 *base_id_map; /* map from distilled base BTF -> vmlinux BTF ids */ }; enum verifier_phase { @@ -533,6 +534,11 @@ static bool btf_type_is_decl_tag_target(const struct btf_type *t) btf_type_is_var(t) || btf_type_is_typedef(t); } +bool btf_is_vmlinux(const struct btf *btf) +{ + return btf->kernel_btf && !btf->base_btf; +} + u32 btf_nr_types(const struct btf *btf) { u32 total = 0; @@ -775,7 +781,7 @@ static bool __btf_name_char_ok(char c, bool first) return true; } -static const char *btf_str_by_offset(const struct btf *btf, u32 offset) +const char *btf_str_by_offset(const struct btf *btf, u32 offset) { while (offset < btf->start_str_off) btf = btf->base_btf; @@ -1668,14 +1674,8 @@ static void btf_free_kfunc_set_tab(struct btf *btf) if (!tab) return; - /* For module BTF, we directly assign the sets being registered, so - * there is nothing to free except kfunc_set_tab. - */ - if (btf_is_module(btf)) - goto free_tab; for (hook = 0; hook < ARRAY_SIZE(tab->sets); hook++) kfree(tab->sets[hook]); -free_tab: kfree(tab); btf->kfunc_set_tab = NULL; } @@ -1733,7 +1733,12 @@ static void btf_free(struct btf *btf) kvfree(btf->types); kvfree(btf->resolved_sizes); kvfree(btf->resolved_ids); - kvfree(btf->data); + /* vmlinux does not allocate btf->data, it simply points it at + * __start_BTF. + */ + if (!btf_is_vmlinux(btf)) + kvfree(btf->data); + kvfree(btf->base_id_map); kfree(btf); } @@ -1762,6 +1767,23 @@ void btf_put(struct btf *btf) } } +struct btf *btf_base_btf(const struct btf *btf) +{ + return btf->base_btf; +} + +const struct btf_header *btf_header(const struct btf *btf) +{ + return &btf->hdr; +} + +void btf_set_base_btf(struct btf *btf, const struct btf *base_btf) +{ + btf->base_btf = (struct btf *)base_btf; + btf->start_id = btf_nr_types(base_btf); + btf->start_str_off = base_btf->hdr.str_len; +} + static int env_resolve_init(struct btf_verifier_env *env) { struct btf *btf = env->btf; @@ -5759,23 +5781,15 @@ int get_kern_ctx_btf_id(struct bpf_verifier_log *log, enum bpf_prog_type prog_ty BTF_ID_LIST(bpf_ctx_convert_btf_id) BTF_ID(struct, bpf_ctx_convert) -struct btf *btf_parse_vmlinux(void) +static struct btf *btf_parse_base(struct btf_verifier_env *env, const char *name, + void *data, unsigned int data_size) { - struct btf_verifier_env *env = NULL; - struct bpf_verifier_log *log; struct btf *btf = NULL; int err; if (!IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) return ERR_PTR(-ENOENT); - env = kzalloc(sizeof(*env), GFP_KERNEL | __GFP_NOWARN); - if (!env) - return ERR_PTR(-ENOMEM); - - log = &env->log; - log->level = BPF_LOG_KERNEL; - btf = kzalloc(sizeof(*btf), GFP_KERNEL | __GFP_NOWARN); if (!btf) { err = -ENOMEM; @@ -5783,10 +5797,10 @@ struct btf *btf_parse_vmlinux(void) } env->btf = btf; - btf->data = __start_BTF; - btf->data_size = __stop_BTF - __start_BTF; + btf->data = data; + btf->data_size = data_size; btf->kernel_btf = true; - snprintf(btf->name, sizeof(btf->name), "vmlinux"); + snprintf(btf->name, sizeof(btf->name), "%s", name); err = btf_parse_hdr(env); if (err) @@ -5806,20 +5820,11 @@ struct btf *btf_parse_vmlinux(void) if (err) goto errout; - /* btf_parse_vmlinux() runs under bpf_verifier_lock */ - bpf_ctx_convert.t = btf_type_by_id(btf, bpf_ctx_convert_btf_id[0]); - refcount_set(&btf->refcnt, 1); - err = btf_alloc_id(btf); - if (err) - goto errout; - - btf_verifier_env_free(env); return btf; errout: - btf_verifier_env_free(env); if (btf) { kvfree(btf->types); kfree(btf); @@ -5827,19 +5832,61 @@ struct btf *btf_parse_vmlinux(void) return ERR_PTR(err); } +struct btf *btf_parse_vmlinux(void) +{ + struct btf_verifier_env *env = NULL; + struct bpf_verifier_log *log; + struct btf *btf; + int err; + + env = kzalloc(sizeof(*env), GFP_KERNEL | __GFP_NOWARN); + if (!env) + return ERR_PTR(-ENOMEM); + + log = &env->log; + log->level = BPF_LOG_KERNEL; + btf = btf_parse_base(env, "vmlinux", __start_BTF, __stop_BTF - __start_BTF); + if (IS_ERR(btf)) + goto err_out; + + /* btf_parse_vmlinux() runs under bpf_verifier_lock */ + bpf_ctx_convert.t = btf_type_by_id(btf, bpf_ctx_convert_btf_id[0]); + err = btf_alloc_id(btf); + if (err) { + btf_free(btf); + btf = ERR_PTR(err); + } +err_out: + btf_verifier_env_free(env); + return btf; +} + #ifdef CONFIG_DEBUG_INFO_BTF_MODULES -static struct btf *btf_parse_module(const char *module_name, const void *data, unsigned int data_size) +/* If .BTF_ids section was created with distilled base BTF, both base and + * split BTF ids will need to be mapped to actual base/split ids for + * BTF now that it has been relocated. + */ +static __u32 btf_relocate_id(const struct btf *btf, __u32 id) +{ + if (!btf->base_btf || !btf->base_id_map) + return id; + return btf->base_id_map[id]; +} + +static struct btf *btf_parse_module(const char *module_name, const void *data, + unsigned int data_size, void *base_data, + unsigned int base_data_size) { + struct btf *btf = NULL, *vmlinux_btf, *base_btf = NULL; struct btf_verifier_env *env = NULL; struct bpf_verifier_log *log; - struct btf *btf = NULL, *base_btf; - int err; + int err = 0; - base_btf = bpf_get_btf_vmlinux(); - if (IS_ERR(base_btf)) - return base_btf; - if (!base_btf) + vmlinux_btf = bpf_get_btf_vmlinux(); + if (IS_ERR(vmlinux_btf)) + return vmlinux_btf; + if (!vmlinux_btf) return ERR_PTR(-EINVAL); env = kzalloc(sizeof(*env), GFP_KERNEL | __GFP_NOWARN); @@ -5849,6 +5896,16 @@ static struct btf *btf_parse_module(const char *module_name, const void *data, u log = &env->log; log->level = BPF_LOG_KERNEL; + if (base_data) { + base_btf = btf_parse_base(env, ".BTF.base", base_data, base_data_size); + if (IS_ERR(base_btf)) { + err = PTR_ERR(base_btf); + goto errout; + } + } else { + base_btf = vmlinux_btf; + } + btf = kzalloc(sizeof(*btf), GFP_KERNEL | __GFP_NOWARN); if (!btf) { err = -ENOMEM; @@ -5888,12 +5945,22 @@ static struct btf *btf_parse_module(const char *module_name, const void *data, u if (err) goto errout; + if (base_btf != vmlinux_btf) { + err = btf_relocate(btf, vmlinux_btf, &btf->base_id_map); + if (err) + goto errout; + btf_free(base_btf); + base_btf = vmlinux_btf; + } + btf_verifier_env_free(env); refcount_set(&btf->refcnt, 1); return btf; errout: btf_verifier_env_free(env); + if (base_btf != vmlinux_btf) + btf_free(base_btf); if (btf) { kvfree(btf->data); kvfree(btf->types); @@ -7682,7 +7749,8 @@ static int btf_module_notify(struct notifier_block *nb, unsigned long op, err = -ENOMEM; goto out; } - btf = btf_parse_module(mod->name, mod->btf_data, mod->btf_data_size); + btf = btf_parse_module(mod->name, mod->btf_data, mod->btf_data_size, + mod->btf_base_data, mod->btf_base_data_size); if (IS_ERR(btf)) { kfree(btf_mod); if (!IS_ENABLED(CONFIG_MODULE_ALLOW_BTF_MISMATCH)) { @@ -8013,7 +8081,7 @@ static int btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook, bool add_filter = !!kset->filter; struct btf_kfunc_set_tab *tab; struct btf_id_set8 *set; - u32 set_cnt; + u32 set_cnt, i; int ret; if (hook >= BTF_KFUNC_HOOK_MAX) { @@ -8059,21 +8127,15 @@ static int btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook, goto end; } - /* We don't need to allocate, concatenate, and sort module sets, because - * only one is allowed per hook. Hence, we can directly assign the - * pointer and return. - */ - if (!vmlinux_set) { - tab->sets[hook] = add_set; - goto do_add_filter; - } - /* In case of vmlinux sets, there may be more than one set being * registered per hook. To create a unified set, we allocate a new set * and concatenate all individual sets being registered. While each set * is individually sorted, they may become unsorted when concatenated, * hence re-sorting the final set again is required to make binary * searching the set using btf_id_set8_contains function work. + * + * For module sets, we need to allocate as we may need to relocate + * BTF ids. */ set_cnt = set ? set->cnt : 0; @@ -8103,11 +8165,14 @@ static int btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook, /* Concatenate the two sets */ memcpy(set->pairs + set->cnt, add_set->pairs, add_set->cnt * sizeof(set->pairs[0])); + /* Now that the set is copied, update with relocated BTF ids */ + for (i = set->cnt; i < set->cnt + add_set->cnt; i++) + set->pairs[i].id = btf_relocate_id(btf, set->pairs[i].id); + set->cnt += add_set->cnt; sort(set->pairs, set->cnt, sizeof(set->pairs[0]), btf_id_cmp_func, NULL); -do_add_filter: if (add_filter) { hook_filter = &tab->hook_filters[hook]; hook_filter->filters[hook_filter->nr_filters++] = kset->filter; @@ -8238,7 +8303,7 @@ static int __register_btf_kfunc_id_set(enum btf_kfunc_hook hook, return PTR_ERR(btf); for (i = 0; i < kset->set->cnt; i++) { - ret = btf_check_kfunc_protos(btf, kset->set->pairs[i].id, + ret = btf_check_kfunc_protos(btf, btf_relocate_id(btf, kset->set->pairs[i].id), kset->set->pairs[i].flags); if (ret) goto err_out; @@ -8302,7 +8367,7 @@ static int btf_check_dtor_kfuncs(struct btf *btf, const struct btf_id_dtor_kfunc u32 nr_args, i; for (i = 0; i < cnt; i++) { - dtor_btf_id = dtors[i].kfunc_btf_id; + dtor_btf_id = btf_relocate_id(btf, dtors[i].kfunc_btf_id); dtor_func = btf_type_by_id(btf, dtor_btf_id); if (!dtor_func || !btf_type_is_func(dtor_func)) @@ -8337,7 +8402,7 @@ int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dtors, u32 add_c { struct btf_id_dtor_kfunc_tab *tab; struct btf *btf; - u32 tab_cnt; + u32 tab_cnt, i; int ret; btf = btf_get_module_btf(owner); @@ -8388,6 +8453,13 @@ int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dtors, u32 add_c btf->dtor_kfunc_tab = tab; memcpy(tab->dtors + tab->cnt, dtors, add_cnt * sizeof(tab->dtors[0])); + + /* remap BTF ids based on BTF relocation (if any) */ + for (i = tab_cnt; i < tab_cnt + add_cnt; i++) { + tab->dtors[i].btf_id = btf_relocate_id(btf, tab->dtors[i].btf_id); + tab->dtors[i].kfunc_btf_id = btf_relocate_id(btf, tab->dtors[i].kfunc_btf_id); + } + tab->cnt += add_cnt; sort(tab->dtors, tab->cnt, sizeof(tab->dtors[0]), btf_id_cmp_func, NULL); diff --git a/tools/lib/bpf/btf_iter.c b/tools/lib/bpf/btf_iter.c index c308aa60285d..9a6c822c2294 100644 --- a/tools/lib/bpf/btf_iter.c +++ b/tools/lib/bpf/btf_iter.c @@ -2,8 +2,16 @@ /* Copyright (c) 2021 Facebook */ /* Copyright (c) 2024, Oracle and/or its affiliates. */ +#ifdef __KERNEL__ +#include <linux/bpf.h> +#include <linux/btf.h> + +#define btf_var_secinfos(t) (struct btf_var_secinfo *)btf_type_var_secinfo(t) + +#else #include "btf.h" #include "libbpf_internal.h" +#endif int btf_field_iter_init(struct btf_field_iter *it, struct btf_type *t, enum btf_field_iter_kind iter_kind) diff --git a/tools/lib/bpf/btf_relocate.c b/tools/lib/bpf/btf_relocate.c index 23a41fb03e0d..2281dbbafa11 100644 --- a/tools/lib/bpf/btf_relocate.c +++ b/tools/lib/bpf/btf_relocate.c @@ -5,11 +5,34 @@ #define _GNU_SOURCE #endif +#ifdef __KERNEL__ +#include <linux/bpf.h> +#include <linux/bsearch.h> +#include <linux/btf.h> +#include <linux/sort.h> +#include <linux/string.h> +#include <linux/bpf_verifier.h> + +#define btf_type_by_id (struct btf_type *)btf_type_by_id +#define btf__type_cnt btf_nr_types +#define btf__base_btf btf_base_btf +#define btf__name_by_offset btf_name_by_offset +#define btf__str_by_offset btf_str_by_offset +#define btf_kflag btf_type_kflag + +#define calloc(nmemb, sz) kvcalloc(nmemb, sz, GFP_KERNEL | __GFP_NOWARN) +#define free(ptr) kvfree(ptr) +#define qsort(base, num, sz, cmp) sort(base, num, sz, cmp, NULL) + +#else + #include "btf.h" #include "bpf.h" #include "libbpf.h" #include "libbpf_internal.h" +#endif /* __KERNEL__ */ + struct btf; struct btf_relocate { -- 2.34.1
From: Alan Maguire <alan.maguire@oracle.com> mainline inclusion from mainline-v6.11-rc1 commit 46fb0b62ea29c0dbcb3e44f1d67aafe79bc6e045 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Support creation of module BTF along with distilled base BTF; the latter is stored in a .BTF.base ELF section and supplements split BTF references to base BTF with information about base types, allowing for later relocation of split BTF with a (possibly changed) base. resolve_btfids detects the presence of a .BTF.base section and will use it instead of the base BTF it is passed in BTF id resolution. Modules will be built with a distilled .BTF.base section for external module build, i.e. make -C. -M=path2/module ...while in-tree module build as part of a normal kernel build will not generate distilled base BTF; this is because in-tree modules change with the kernel and do not require BTF relocation for the running vmlinux. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240620091733.1967885-6-alan.maguire@oracle.com Conflicts: scripts/Makefile.btf [The conflicts were due to not merge commit ebb79e96f1ea.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- scripts/Makefile.btf | 5 +++++ scripts/Makefile.modfinal | 2 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/scripts/Makefile.btf b/scripts/Makefile.btf index bca8a8f26ea4..191b4903e864 100644 --- a/scripts/Makefile.btf +++ b/scripts/Makefile.btf @@ -21,8 +21,13 @@ else # Switch to using --btf_features for v1.26 and later. pahole-flags-$(call test-ge, $(pahole-ver), 126) = -j --btf_features=encode_force,var,float,enum64,decl_tag,type_tag,optimized_func,consistent_func +ifneq ($(KBUILD_EXTMOD),) +module-pahole-flags-$(call test-ge, $(pahole-ver), 126) += --btf_features=distilled_base +endif + endif pahole-flags-$(CONFIG_PAHOLE_HAS_LANG_EXCLUDE) += --lang_exclude=rust export PAHOLE_FLAGS := $(pahole-flags-y) +export MODULE_PAHOLE_FLAGS := $(module-pahole-flags-y) diff --git a/scripts/Makefile.modfinal b/scripts/Makefile.modfinal index 1979913aff68..4e07b3501568 100644 --- a/scripts/Makefile.modfinal +++ b/scripts/Makefile.modfinal @@ -42,7 +42,7 @@ quiet_cmd_btf_ko = BTF [M] $@ if [ ! -f vmlinux ]; then \ printf "Skipping BTF generation for %s due to unavailability of vmlinux\n" $@ 1>&2; \ else \ - LLVM_OBJCOPY="$(OBJCOPY)" $(PAHOLE) -J $(PAHOLE_FLAGS) --btf_base vmlinux $@; \ + LLVM_OBJCOPY="$(OBJCOPY)" $(PAHOLE) -J $(PAHOLE_FLAGS) $(MODULE_PAHOLE_FLAGS) --btf_base vmlinux $@; \ $(RESOLVE_BTFIDS) -b vmlinux $@; \ fi; -- 2.34.1
From: Alan Maguire <alan.maguire@oracle.com> mainline inclusion from mainline-v6.11-rc1 commit 47a8cf0c5b3f6769b9d558301735c75119a0a165 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- add simple kfuncs to create/destroy a context type to bpf_testmod, register them and add a kfunc_call test to use them. This provides test coverage for registration of dtor kfuncs from modules. By transferring the context pointer to a map value as a __kptr we also trigger the map-based dtor cleanup logic, improving test coverage. Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240620091733.1967885-7-alan.maguire@oracle.com Conflicts: tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h [The conflicts were due to some minor issue.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 46 +++++++++++++++++++ .../bpf/bpf_testmod/bpf_testmod_kfunc.h | 9 ++++ .../selftests/bpf/prog_tests/kfunc_call.c | 1 + .../selftests/bpf/progs/kfunc_call_test.c | 37 +++++++++++++++ 4 files changed, 93 insertions(+) diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c index 2c5c64950781..700bba39a22c 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c @@ -138,6 +138,37 @@ __bpf_kfunc void bpf_iter_testmod_seq_destroy(struct bpf_iter_testmod_seq *it) it->cnt = 0; } +__bpf_kfunc struct bpf_testmod_ctx * +bpf_testmod_ctx_create(int *err) +{ + struct bpf_testmod_ctx *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_ATOMIC); + if (!ctx) { + *err = -ENOMEM; + return NULL; + } + refcount_set(&ctx->usage, 1); + + return ctx; +} + +static void testmod_free_cb(struct rcu_head *head) +{ + struct bpf_testmod_ctx *ctx; + + ctx = container_of(head, struct bpf_testmod_ctx, rcu); + kfree(ctx); +} + +__bpf_kfunc void bpf_testmod_ctx_release(struct bpf_testmod_ctx *ctx) +{ + if (!ctx) + return; + if (refcount_dec_and_test(&ctx->usage)) + call_rcu(&ctx->rcu, testmod_free_cb); +} + struct bpf_testmod_btf_type_tag_1 { int a; }; @@ -343,8 +374,14 @@ BTF_KFUNCS_START(bpf_testmod_common_kfunc_ids) BTF_ID_FLAGS(func, bpf_iter_testmod_seq_new, KF_ITER_NEW) BTF_ID_FLAGS(func, bpf_iter_testmod_seq_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_testmod_seq_destroy, KF_ITER_DESTROY) +BTF_ID_FLAGS(func, bpf_testmod_ctx_create, KF_ACQUIRE | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_testmod_ctx_release, KF_RELEASE) BTF_KFUNCS_END(bpf_testmod_common_kfunc_ids) +BTF_ID_LIST(bpf_testmod_dtor_ids) +BTF_ID(struct, bpf_testmod_ctx) +BTF_ID(func, bpf_testmod_ctx_release) + static const struct btf_kfunc_id_set bpf_testmod_common_kfunc_set = { .owner = THIS_MODULE, .set = &bpf_testmod_common_kfunc_ids, @@ -608,6 +645,12 @@ extern int bpf_fentry_test1(int a); static int bpf_testmod_init(void) { + const struct btf_id_dtor_kfunc bpf_testmod_dtors[] = { + { + .btf_id = bpf_testmod_dtor_ids[0], + .kfunc_btf_id = bpf_testmod_dtor_ids[1] + }, + }; int ret; ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC, &bpf_testmod_common_kfunc_set); @@ -615,6 +658,9 @@ static int bpf_testmod_init(void) ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_testmod_kfunc_set); ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &bpf_testmod_kfunc_set); ret = ret ?: register_bpf_struct_ops(&bpf_bpf_testmod_ops, bpf_testmod_ops); + ret = ret ?: register_btf_id_dtor_kfuncs(bpf_testmod_dtors, + ARRAY_SIZE(bpf_testmod_dtors), + THIS_MODULE); if (ret < 0) return ret; if (bpf_fentry_test1(0) < 0) diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h index f5c5b1375c24..9e0e6821539a 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h @@ -64,6 +64,11 @@ struct prog_test_fail3 { char arr2[]; }; +struct bpf_testmod_ctx { + struct callback_head rcu; + refcount_t usage; +}; + struct prog_test_ref_kfunc * bpf_kfunc_call_test_acquire(unsigned long *scalar_ptr) __ksym; void bpf_kfunc_call_test_release(struct prog_test_ref_kfunc *p) __ksym; @@ -104,4 +109,8 @@ void bpf_kfunc_call_test_fail1(struct prog_test_fail1 *p); void bpf_kfunc_call_test_fail2(struct prog_test_fail2 *p); void bpf_kfunc_call_test_fail3(struct prog_test_fail3 *p); void bpf_kfunc_call_test_mem_len_fail1(void *mem, int len); + +struct bpf_testmod_ctx *bpf_testmod_ctx_create(int *err) __ksym; +void bpf_testmod_ctx_release(struct bpf_testmod_ctx *ctx) __ksym; + #endif /* _BPF_TESTMOD_KFUNC_H */ diff --git a/tools/testing/selftests/bpf/prog_tests/kfunc_call.c b/tools/testing/selftests/bpf/prog_tests/kfunc_call.c index 2eb71559713c..5b743212292f 100644 --- a/tools/testing/selftests/bpf/prog_tests/kfunc_call.c +++ b/tools/testing/selftests/bpf/prog_tests/kfunc_call.c @@ -78,6 +78,7 @@ static struct kfunc_test_params kfunc_tests[] = { SYSCALL_TEST(kfunc_syscall_test, 0), SYSCALL_NULL_CTX_TEST(kfunc_syscall_test_null, 0), TC_TEST(kfunc_call_test_static_unused_arg, 0), + TC_TEST(kfunc_call_ctx, 0), }; struct syscall_test_args { diff --git a/tools/testing/selftests/bpf/progs/kfunc_call_test.c b/tools/testing/selftests/bpf/progs/kfunc_call_test.c index cf68d1e48a0f..f502f755f567 100644 --- a/tools/testing/selftests/bpf/progs/kfunc_call_test.c +++ b/tools/testing/selftests/bpf/progs/kfunc_call_test.c @@ -177,4 +177,41 @@ int kfunc_call_test_static_unused_arg(struct __sk_buff *skb) return actual != expected ? -1 : 0; } +struct ctx_val { + struct bpf_testmod_ctx __kptr *ctx; +}; + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 1); + __type(key, int); + __type(value, struct ctx_val); +} ctx_map SEC(".maps"); + +SEC("tc") +int kfunc_call_ctx(struct __sk_buff *skb) +{ + struct bpf_testmod_ctx *ctx; + int err = 0; + + ctx = bpf_testmod_ctx_create(&err); + if (!ctx && !err) + err = -1; + if (ctx) { + int key = 0; + struct ctx_val *ctx_val = bpf_map_lookup_elem(&ctx_map, &key); + + /* Transfer ctx to map to be freed via implicit dtor call + * on cleanup. + */ + if (ctx_val) + ctx = bpf_kptr_xchg(&ctx_val->ctx, ctx); + if (ctx) { + bpf_testmod_ctx_release(ctx); + err = -1; + } + } + return err; +} + char _license[] SEC("license") = "GPL"; -- 2.34.1
From: Alan Maguire <alan.maguire@oracle.com> mainline inclusion from mainline-v6.11-rc1 commit 5a532459aa919d055d822d8db4ea2c5c8d511568 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Kernel test robot reports that kernel build fails with resilient split BTF changes. Examining the associated config and code we see that btf_relocate_id() is defined under CONFIG_DEBUG_INFO_BTF_MODULES. Moving it outside the #ifdef solves the issue. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202406221742.d2srFLVI-lkp@intel.com/ Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/r/20240623135224.27981-1-alan.maguire@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/btf.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 95dea7f2ffe4..3fe2b93b50bc 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -5861,8 +5861,6 @@ struct btf *btf_parse_vmlinux(void) return btf; } -#ifdef CONFIG_DEBUG_INFO_BTF_MODULES - /* If .BTF_ids section was created with distilled base BTF, both base and * split BTF ids will need to be mapped to actual base/split ids for * BTF now that it has been relocated. @@ -5874,6 +5872,8 @@ static __u32 btf_relocate_id(const struct btf *btf, __u32 id) return btf->base_id_map[id]; } +#ifdef CONFIG_DEBUG_INFO_BTF_MODULES + static struct btf *btf_parse_module(const char *module_name, const void *data, unsigned int data_size, void *base_data, unsigned int base_data_size) -- 2.34.1
From: Alan Maguire <alan.maguire@oracle.com> mainline inclusion from mainline-v6.11-rc1 commit 5b747c23f17d791e08fdf4baa7e14b704625518c category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Coverity points out that after calling btf__new_empty_split() the wrong value is checked for error. Fixes: 58e185a0dc35 ("libbpf: Add btf__distill_base() creating split BTF with distilled base BTF") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240629100058.2866763-1-alan.maguire@oracle.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/btf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index db8cfaeba6f7..ed64b5c207b8 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -5263,7 +5263,7 @@ int btf__distill_base(const struct btf *src_btf, struct btf **new_base_btf, * BTF available. */ new_split = btf__new_empty_split(new_base); - if (!new_split_btf) { + if (!new_split) { err = -errno; goto done; } -- 2.34.1
From: Pu Lehui <pulehui@huawei.com> mainline inclusion from mainline-v6.11-rc1 commit 9f1e16fb1fc9826001c69e0551d51fbbcd2d74e9 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- We get the size of the trampoline image during the dry run phase and allocate memory based on that size. The allocated image will then be populated with instructions during the real patch phase. But after commit 26ef208c209a ("bpf: Use arch_bpf_trampoline_size"), the `im` argument is inconsistent in the dry run and real patch phase. This may cause emit_imm in RV64 to generate a different number of instructions when generating the 'im' address, potentially causing out-of-bounds issues. Let's emit the maximum number of instructions for the "im" address during dry run to fix this problem. Fixes: 26ef208c209a ("bpf: Use arch_bpf_trampoline_size") Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240622030437.3973492-3-pulehui@huaweicloud.com Conflicts: arch/riscv/net/bpf_jit_comp64.c [Fix conflict because of context diff.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- arch/riscv/net/bpf_jit_comp64.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c index 4ae7e6e7c061..a509fab76c84 100644 --- a/arch/riscv/net/bpf_jit_comp64.c +++ b/arch/riscv/net/bpf_jit_comp64.c @@ -15,6 +15,8 @@ #define RV_FENTRY_NINSNS 2 #define RV_FENTRY_NBYTES (RV_FENTRY_NINSNS * 4) +/* imm that allows emit_imm to emit max count insns */ +#define RV_MAX_COUNT_IMM 0x7FFF7FF7FF7FF7FF #define RV_REG_TCC RV_REG_A6 #define RV_REG_TCC_SAVED RV_REG_S6 /* Store A6 in S6 if program do calls */ @@ -922,7 +924,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, orig_call += RV_FENTRY_NINSNS * 4; if (flags & BPF_TRAMP_F_CALL_ORIG) { - emit_imm(RV_REG_A0, (const s64)im, ctx); + emit_imm(RV_REG_A0, ctx->insns ? (const s64)im : RV_MAX_COUNT_IMM, ctx); ret = emit_call((const u64)__bpf_tramp_enter, true, ctx); if (ret) return ret; @@ -983,7 +985,7 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, if (flags & BPF_TRAMP_F_CALL_ORIG) { im->ip_epilogue = ctx->insns + ctx->ninsns; - emit_imm(RV_REG_A0, (const s64)im, ctx); + emit_imm(RV_REG_A0, ctx->insns ? (const s64)im : RV_MAX_COUNT_IMM, ctx); ret = emit_call((const u64)__bpf_tramp_exit, true, ctx); if (ret) goto out; @@ -1052,6 +1054,7 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, { int ret; struct rv_jit_context ctx; + u32 size = image_end - image; ctx.ninsns = 0; /* @@ -1065,11 +1068,16 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, ctx.ro_insns = image; ret = __arch_prepare_bpf_trampoline(im, m, tlinks, func_addr, flags, &ctx); if (ret < 0) - return ret; + goto out; - bpf_flush_icache(ctx.insns, ctx.insns + ctx.ninsns); + if (WARN_ON(size < ninsns_rvoff(ctx.ninsns))) { + ret = -E2BIG; + goto out; + } - return ninsns_rvoff(ret); + bpf_flush_icache(image, image_end); +out: + return ret < 0 ? ret : size; } int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx, -- 2.34.1
From: Alan Maguire <alan.maguire@oracle.com> mainline inclusion from mainline-v6.12-rc1 commit 4a4c013d3385b0db85dc361203dc42ff048b6fd6 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- License should be // SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) ...as with other libbpf files. Fixes: 19e00c897d50 ("libbpf: Split BTF relocation") Reported-by: Neill Kapron <nkapron@google.com> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240810093504.2111134-1-alan.maguire@oracle.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/btf_relocate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/lib/bpf/btf_relocate.c b/tools/lib/bpf/btf_relocate.c index 2281dbbafa11..6c9a16115bf0 100644 --- a/tools/lib/bpf/btf_relocate.c +++ b/tools/lib/bpf/btf_relocate.c @@ -1,4 +1,4 @@ -// SPDX-License-Identifier: GPL-2.0 +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) /* Copyright (c) 2024, Oracle and/or its affiliates. */ #ifndef _GNU_SOURCE -- 2.34.1
From: Andrii Nakryiko <andrii@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 496ddd19a0fad22f250fc7a7b7a8000155418934 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Verifier enforces that all iterator structs are named `bpf_iter_<name>` and that whenever iterator is passed to a kfunc it's passed as a valid PTR -> STRUCT chain (with potentially const modifiers in between). We'll need this check for upcoming changes, so instead of duplicating the logic, extract it into a helper function. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240808232230.2848712-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/btf.h | 5 +++++ kernel/bpf/btf.c | 50 ++++++++++++++++++++++++++++++++------------- 2 files changed, 41 insertions(+), 14 deletions(-) diff --git a/include/linux/btf.h b/include/linux/btf.h index f477f7e23ca3..3dad546d37aa 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -580,6 +580,7 @@ btf_get_prog_ctx_type(struct bpf_verifier_log *log, const struct btf *btf, int get_kern_ctx_btf_id(struct bpf_verifier_log *log, enum bpf_prog_type prog_type); bool btf_types_are_same(const struct btf *btf1, u32 id1, const struct btf *btf2, u32 id2); +int btf_check_iter_arg(struct btf *btf, const struct btf_type *func, int arg_idx); #else static inline const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id) @@ -654,6 +655,10 @@ static inline bool btf_types_are_same(const struct btf *btf1, u32 id1, { return false; } +static inline int btf_check_iter_arg(struct btf *btf, const struct btf_type *func, int arg_idx) +{ + return -EOPNOTSUPP; +} #endif static inline bool btf_type_is_struct_ptr(struct btf *btf, const struct btf_type *t) diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 3fe2b93b50bc..a3918dd19331 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -7968,15 +7968,44 @@ BTF_ID_LIST_GLOBAL(btf_tracing_ids, MAX_BTF_TRACING_TYPE) BTF_TRACING_TYPE_xxx #undef BTF_TRACING_TYPE +/* Validate well-formedness of iter argument type. + * On success, return positive BTF ID of iter state's STRUCT type. + * On error, negative error is returned. + */ +int btf_check_iter_arg(struct btf *btf, const struct btf_type *func, int arg_idx) +{ + const struct btf_param *arg; + const struct btf_type *t; + const char *name; + int btf_id; + + if (btf_type_vlen(func) <= arg_idx) + return -EINVAL; + + arg = &btf_params(func)[arg_idx]; + t = btf_type_skip_modifiers(btf, arg->type, NULL); + if (!t || !btf_type_is_ptr(t)) + return -EINVAL; + t = btf_type_skip_modifiers(btf, t->type, &btf_id); + if (!t || !__btf_type_is_struct(t)) + return -EINVAL; + + name = btf_name_by_offset(btf, t->name_off); + if (!name || strncmp(name, ITER_PREFIX, sizeof(ITER_PREFIX) - 1)) + return -EINVAL; + + return btf_id; +} + static int btf_check_iter_kfuncs(struct btf *btf, const char *func_name, const struct btf_type *func, u32 func_flags) { u32 flags = func_flags & (KF_ITER_NEW | KF_ITER_NEXT | KF_ITER_DESTROY); - const char *name, *sfx, *iter_name; - const struct btf_param *arg; + const char *sfx, *iter_name; const struct btf_type *t; char exp_name[128]; u32 nr_args; + int btf_id; /* exactly one of KF_ITER_{NEW,NEXT,DESTROY} can be set */ if (!flags || (flags & (flags - 1))) @@ -7987,28 +8016,21 @@ static int btf_check_iter_kfuncs(struct btf *btf, const char *func_name, if (nr_args < 1) return -EINVAL; - arg = &btf_params(func)[0]; - t = btf_type_skip_modifiers(btf, arg->type, NULL); - if (!t || !btf_type_is_ptr(t)) - return -EINVAL; - t = btf_type_skip_modifiers(btf, t->type, NULL); - if (!t || !__btf_type_is_struct(t)) - return -EINVAL; - - name = btf_name_by_offset(btf, t->name_off); - if (!name || strncmp(name, ITER_PREFIX, sizeof(ITER_PREFIX) - 1)) - return -EINVAL; + btf_id = btf_check_iter_arg(btf, func, 0); + if (btf_id < 0) + return btf_id; /* sizeof(struct bpf_iter_<type>) should be a multiple of 8 to * fit nicely in stack slots */ + t = btf_type_by_id(btf, btf_id); if (t->size == 0 || (t->size % 8)) return -EINVAL; /* validate bpf_iter_<type>_{new,next,destroy}(struct bpf_iter_<type> *) * naming pattern */ - iter_name = name + sizeof(ITER_PREFIX) - 1; + iter_name = btf_name_by_offset(btf, t->name_off) + sizeof(ITER_PREFIX) - 1; if (flags & KF_ITER_NEW) sfx = "new"; else if (flags & KF_ITER_NEXT) -- 2.34.1
From: Andrii Nakryiko <andrii@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit baebe9aaba1e59e34cd1fe6455bb4c3029ad3bc1 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- There are potentially useful cases where a specific iterator type might need to be passed into some kfunc. So, in addition to existing bpf_iter_<type>_{new,next,destroy}() kfuncs, allow to pass iterator pointer to any kfunc. We employ "__iter" naming suffix for arguments that are meant to accept iterators. We also enforce that they accept PTR -> STRUCT btf_iter_<type> type chain and point to a valid initialized on-the-stack iterator state. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240808232230.2848712-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/verifier.c | 35 ++++++++++++++++++++++++----------- 1 file changed, 24 insertions(+), 11 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 94a9c03a1d53..c2c561f57418 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -7839,12 +7839,17 @@ static bool is_iter_destroy_kfunc(struct bpf_kfunc_call_arg_meta *meta) return meta->kfunc_flags & KF_ITER_DESTROY; } -static bool is_kfunc_arg_iter(struct bpf_kfunc_call_arg_meta *meta, int arg) +static bool is_kfunc_arg_iter(struct bpf_kfunc_call_arg_meta *meta, int arg_idx, + const struct btf_param *arg) { /* btf_check_iter_kfuncs() guarantees that first argument of any iter * kfunc is iter state pointer */ - return arg == 0 && is_iter_kfunc(meta); + if (is_iter_kfunc(meta)) + return arg_idx == 0; + + /* iter passed as an argument to a generic kfunc */ + return btf_param_match_suffix(meta->btf, arg, "__iter"); } static int process_iter_arg(struct bpf_verifier_env *env, int regno, int insn_idx, @@ -7852,14 +7857,20 @@ static int process_iter_arg(struct bpf_verifier_env *env, int regno, int insn_id { struct bpf_reg_state *regs = cur_regs(env), *reg = ®s[regno]; const struct btf_type *t; - const struct btf_param *arg; - int spi, err, i, nr_slots; - u32 btf_id; + int spi, err, i, nr_slots, btf_id; - /* btf_check_iter_kfuncs() ensures we don't need to validate anything here */ - arg = &btf_params(meta->func_proto)[0]; - t = btf_type_skip_modifiers(meta->btf, arg->type, NULL); /* PTR */ - t = btf_type_skip_modifiers(meta->btf, t->type, &btf_id); /* STRUCT */ + /* For iter_{new,next,destroy} functions, btf_check_iter_kfuncs() + * ensures struct convention, so we wouldn't need to do any BTF + * validation here. But given iter state can be passed as a parameter + * to any kfunc, if arg has "__iter" suffix, we need to be a bit more + * conservative here. + */ + btf_id = btf_check_iter_arg(meta->btf, meta->func_proto, regno - 1); + if (btf_id < 0) { + verbose(env, "expected valid iter pointer as arg #%d\n", regno); + return -EINVAL; + } + t = btf_type_by_id(meta->btf, btf_id); nr_slots = t->size / BPF_REG_SIZE; if (is_iter_new_kfunc(meta)) { @@ -7881,7 +7892,9 @@ static int process_iter_arg(struct bpf_verifier_env *env, int regno, int insn_id if (err) return err; } else { - /* iter_next() or iter_destroy() expect initialized iter state*/ + /* iter_next() or iter_destroy(), as well as any kfunc + * accepting iter argument, expect initialized iter state + */ err = is_iter_reg_valid_init(env, reg, meta->btf, btf_id, nr_slots); switch (err) { case 0: @@ -10993,7 +11006,7 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env, if (is_kfunc_arg_dynptr(meta->btf, &args[argno])) return KF_ARG_PTR_TO_DYNPTR; - if (is_kfunc_arg_iter(meta, argno)) + if (is_kfunc_arg_iter(meta, argno, &args[argno])) return KF_ARG_PTR_TO_ITER; if (is_kfunc_arg_list_head(meta->btf, &args[argno])) -- 2.34.1
From: Andrii Nakryiko <andrii@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit b0cd726f9a8279b33faa5b4b0011df14f331fb33 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Define BPF iterator "getter" kfunc, which accepts iterator pointer as one of the arguments. Make sure that argument passed doesn't have to be the very first argument (unlike new-next-destroy combo). Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240808232230.2848712-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Conflicts: tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c [Fix conflicts caused by context diff.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 16 ++++-- .../selftests/bpf/progs/iters_testmod_seq.c | 50 +++++++++++++++++++ 2 files changed, 62 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c index 700bba39a22c..72d26adba834 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c @@ -112,13 +112,12 @@ bpf_testmod_test_mod_kfunc(int i) __bpf_kfunc int bpf_iter_testmod_seq_new(struct bpf_iter_testmod_seq *it, s64 value, int cnt) { - if (cnt < 0) { - it->cnt = 0; + it->cnt = cnt; + + if (cnt < 0) return -EINVAL; - } it->value = value; - it->cnt = cnt; return 0; } @@ -133,6 +132,14 @@ __bpf_kfunc s64 *bpf_iter_testmod_seq_next(struct bpf_iter_testmod_seq* it) return &it->value; } +__bpf_kfunc s64 bpf_iter_testmod_seq_value(int val, struct bpf_iter_testmod_seq* it__iter) +{ + if (it__iter->cnt < 0) + return 0; + + return val + it__iter->value; +} + __bpf_kfunc void bpf_iter_testmod_seq_destroy(struct bpf_iter_testmod_seq *it) { it->cnt = 0; @@ -376,6 +383,7 @@ BTF_ID_FLAGS(func, bpf_iter_testmod_seq_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_testmod_seq_destroy, KF_ITER_DESTROY) BTF_ID_FLAGS(func, bpf_testmod_ctx_create, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_testmod_ctx_release, KF_RELEASE) +BTF_ID_FLAGS(func, bpf_iter_testmod_seq_value) BTF_KFUNCS_END(bpf_testmod_common_kfunc_ids) BTF_ID_LIST(bpf_testmod_dtor_ids) diff --git a/tools/testing/selftests/bpf/progs/iters_testmod_seq.c b/tools/testing/selftests/bpf/progs/iters_testmod_seq.c index 3873fb6c292a..4a176e6aede8 100644 --- a/tools/testing/selftests/bpf/progs/iters_testmod_seq.c +++ b/tools/testing/selftests/bpf/progs/iters_testmod_seq.c @@ -12,6 +12,7 @@ struct bpf_iter_testmod_seq { extern int bpf_iter_testmod_seq_new(struct bpf_iter_testmod_seq *it, s64 value, int cnt) __ksym; extern s64 *bpf_iter_testmod_seq_next(struct bpf_iter_testmod_seq *it) __ksym; +extern s64 bpf_iter_testmod_seq_value(int blah, struct bpf_iter_testmod_seq *it) __ksym; extern void bpf_iter_testmod_seq_destroy(struct bpf_iter_testmod_seq *it) __ksym; const volatile __s64 exp_empty = 0 + 1; @@ -76,4 +77,53 @@ int testmod_seq_truncated(const void *ctx) return 0; } +SEC("?raw_tp") +__failure +__msg("expected an initialized iter_testmod_seq as arg #2") +int testmod_seq_getter_before_bad(const void *ctx) +{ + struct bpf_iter_testmod_seq it; + + return bpf_iter_testmod_seq_value(0, &it); +} + +SEC("?raw_tp") +__failure +__msg("expected an initialized iter_testmod_seq as arg #2") +int testmod_seq_getter_after_bad(const void *ctx) +{ + struct bpf_iter_testmod_seq it; + s64 sum = 0, *v; + + bpf_iter_testmod_seq_new(&it, 100, 100); + + while ((v = bpf_iter_testmod_seq_next(&it))) { + sum += *v; + } + + bpf_iter_testmod_seq_destroy(&it); + + return sum + bpf_iter_testmod_seq_value(0, &it); +} + +SEC("?socket") +__success __retval(1000000) +int testmod_seq_getter_good(const void *ctx) +{ + struct bpf_iter_testmod_seq it; + s64 sum = 0, *v; + + bpf_iter_testmod_seq_new(&it, 100, 100); + + while ((v = bpf_iter_testmod_seq_next(&it))) { + sum += *v; + } + + sum *= bpf_iter_testmod_seq_value(0, &it); + + bpf_iter_testmod_seq_destroy(&it); + + return sum; +} + char _license[] SEC("license") = "GPL"; -- 2.34.1
From: Martin KaFai Lau <martin.lau@kernel.org> mainline inclusion from mainline-v6.11-rc7 commit b408473ea01b2e499d23503e2bf898416da9d7ac category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- The pointer returned by btf_parse_base could be an error pointer. IS_ERR() check is needed before calling btf_free(base_btf). Fixes: 8646db238997 ("libbpf,bpf: Share BTF relocate-related code with kernel") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240830012214.1646005-1-martin.lau@linux.dev Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/btf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index a3918dd19331..3b6479ea0eb9 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -5959,7 +5959,7 @@ static struct btf *btf_parse_module(const char *module_name, const void *data, errout: btf_verifier_env_free(env); - if (base_btf != vmlinux_btf) + if (!IS_ERR(base_btf) && base_btf != vmlinux_btf) btf_free(base_btf); if (btf) { kvfree(btf->data); -- 2.34.1
From: Tony Ambardar <tony.ambardar@gmail.com> mainline inclusion from mainline-v6.12-rc1 commit da18bfa59d403040d8bcba1284285916fe9e4425 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- New split BTF needs to preserve base's endianness. Similarly, when creating a distilled BTF, we need to preserve original endianness. Fix by updating libbpf's btf__distill_base() and btf_new_empty() to retain the byte order of any source BTF objects when creating new ones. Fixes: ba451366bf44 ("libbpf: Implement basic split BTF support") Fixes: 58e185a0dc35 ("libbpf: Add btf__distill_base() creating split BTF with distilled base BTF") Reported-by: Song Liu <song@kernel.org> Reported-by: Eduard Zingerman <eddyz87@gmail.com> Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Tony Ambardar <tony.ambardar@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/6358db36c5f68b07873a0a5be2d062b1af5ea5f8.camel@g... Link: https://lore.kernel.org/bpf/20240830095150.278881-1-tony.ambardar@gmail.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/btf.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index ed64b5c207b8..f433a5282042 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -837,6 +837,7 @@ static struct btf *btf_new_empty(struct btf *base_btf) btf->base_btf = base_btf; btf->start_id = btf__type_cnt(base_btf); btf->start_str_off = base_btf->hdr->str_len; + btf->swapped_endian = base_btf->swapped_endian; } /* +1 for empty string at offset 0 */ @@ -5226,6 +5227,9 @@ int btf__distill_base(const struct btf *src_btf, struct btf **new_base_btf, new_base = btf__new_empty(); if (!new_base) return libbpf_err(-ENOMEM); + + btf__set_endianness(new_base, btf__endianness(src_btf)); + dist.id_map = calloc(n, sizeof(*dist.id_map)); if (!dist.id_map) { err = -ENOMEM; -- 2.34.1
From: Eduard Zingerman <eddyz87@gmail.com> mainline inclusion from mainline-v6.12-rc1 commit 181b0d1af5d5b2ba8996fa715382cbfbfef46b6e category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Create a BTF with endianness different from host, make a distilled base/split BTF pair from it, dump as raw bytes, import again and verify that endianness is preserved. Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Tested-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240830173406.1581007-1-eddyz87@gmail.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/prog_tests/btf_distill.c | 68 +++++++++++++++++++ 1 file changed, 68 insertions(+) diff --git a/tools/testing/selftests/bpf/prog_tests/btf_distill.c b/tools/testing/selftests/bpf/prog_tests/btf_distill.c index bfbe795823a2..ca84726d5ac1 100644 --- a/tools/testing/selftests/bpf/prog_tests/btf_distill.c +++ b/tools/testing/selftests/bpf/prog_tests/btf_distill.c @@ -535,6 +535,72 @@ static void test_distilled_base_vmlinux(void) btf__free(vmlinux_btf); } +/* Split and new base BTFs should inherit endianness from source BTF. */ +static void test_distilled_endianness(void) +{ + struct btf *base = NULL, *split = NULL, *new_base = NULL, *new_split = NULL; + struct btf *new_base1 = NULL, *new_split1 = NULL; + enum btf_endianness inverse_endianness; + const void *raw_data; + __u32 size; + + base = btf__new_empty(); + if (!ASSERT_OK_PTR(base, "empty_main_btf")) + return; + inverse_endianness = btf__endianness(base) == BTF_LITTLE_ENDIAN ? BTF_BIG_ENDIAN + : BTF_LITTLE_ENDIAN; + btf__set_endianness(base, inverse_endianness); + btf__add_int(base, "int", 4, BTF_INT_SIGNED); /* [1] int */ + VALIDATE_RAW_BTF( + base, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED"); + split = btf__new_empty_split(base); + if (!ASSERT_OK_PTR(split, "empty_split_btf")) + goto cleanup; + btf__add_ptr(split, 1); + VALIDATE_RAW_BTF( + split, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] PTR '(anon)' type_id=1"); + if (!ASSERT_EQ(0, btf__distill_base(split, &new_base, &new_split), + "distilled_base") || + !ASSERT_OK_PTR(new_base, "distilled_base") || + !ASSERT_OK_PTR(new_split, "distilled_split") || + !ASSERT_EQ(2, btf__type_cnt(new_base), "distilled_base_type_cnt")) + goto cleanup; + VALIDATE_RAW_BTF( + new_split, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] PTR '(anon)' type_id=1"); + + raw_data = btf__raw_data(new_base, &size); + if (!ASSERT_OK_PTR(raw_data, "btf__raw_data #1")) + goto cleanup; + new_base1 = btf__new(raw_data, size); + if (!ASSERT_OK_PTR(new_base1, "new_base1 = btf__new()")) + goto cleanup; + raw_data = btf__raw_data(new_split, &size); + if (!ASSERT_OK_PTR(raw_data, "btf__raw_data #2")) + goto cleanup; + new_split1 = btf__new_split(raw_data, size, new_base1); + if (!ASSERT_OK_PTR(new_split1, "new_split1 = btf__new()")) + goto cleanup; + + ASSERT_EQ(btf__endianness(new_base1), inverse_endianness, "new_base1 endianness"); + ASSERT_EQ(btf__endianness(new_split1), inverse_endianness, "new_split1 endianness"); + VALIDATE_RAW_BTF( + new_split1, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] PTR '(anon)' type_id=1"); +cleanup: + btf__free(new_split1); + btf__free(new_base1); + btf__free(new_split); + btf__free(new_base); + btf__free(split); + btf__free(base); +} + void test_btf_distill(void) { if (test__start_subtest("distilled_base")) @@ -549,4 +615,6 @@ void test_btf_distill(void) test_distilled_base_multi_err2(); if (test__start_subtest("distilled_base_vmlinux")) test_distilled_base_vmlinux(); + if (test__start_subtest("distilled_endianness")) + test_distilled_endianness(); } -- 2.34.1
From: Hou Tao <houtao1@huawei.com> mainline inclusion from mainline-v6.12-rc6 commit 101ccfbabf4738041273ce64e2b116cf440dea13 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- bpf_iter_bits_destroy() uses "kit->nr_bits <= 64" to check whether the bits are dynamically allocated. However, the check is incorrect and may cause a kmemleak as shown below: unreferenced object 0xffff88812628c8c0 (size 32): comm "swapper/0", pid 1, jiffies 4294727320 hex dump (first 32 bytes): b0 c1 55 f5 81 88 ff ff f0 f0 f0 f0 f0 f0 f0 f0 ..U........... f0 f0 f0 f0 f0 f0 f0 f0 00 00 00 00 00 00 00 00 .............. backtrace (crc 781e32cc): [<00000000c452b4ab>] kmemleak_alloc+0x4b/0x80 [<0000000004e09f80>] __kmalloc_node_noprof+0x480/0x5c0 [<00000000597124d6>] __alloc.isra.0+0x89/0xb0 [<000000004ebfffcd>] alloc_bulk+0x2af/0x720 [<00000000d9c10145>] prefill_mem_cache+0x7f/0xb0 [<00000000ff9738ff>] bpf_mem_alloc_init+0x3e2/0x610 [<000000008b616eac>] bpf_global_ma_init+0x19/0x30 [<00000000fc473efc>] do_one_initcall+0xd3/0x3c0 [<00000000ec81498c>] kernel_init_freeable+0x66a/0x940 [<00000000b119f72f>] kernel_init+0x20/0x160 [<00000000f11ac9a7>] ret_from_fork+0x3c/0x70 [<0000000004671da4>] ret_from_fork_asm+0x1a/0x30 That is because nr_bits will be set as zero in bpf_iter_bits_next() after all bits have been iterated. Fix the issue by setting kit->bit to kit->nr_bits instead of setting kit->nr_bits to zero when the iteration completes in bpf_iter_bits_next(). In addition, use "!nr_bits || bits >= nr_bits" to check whether the iteration is complete and still use "nr_bits > 64" to indicate whether bits are dynamically allocated. The "!nr_bits" check is necessary because bpf_iter_bits_new() may fail before setting kit->nr_bits, and this condition will stop the iteration early instead of accessing the zeroed or freed kit->bits. Considering the initial value of kit->bits is -1 and the type of kit->nr_bits is unsigned int, change the type of kit->nr_bits to int. The potential overflow problem will be handled in the following patch. Fixes: 4665415975b0 ("bpf: Add bits iterator") Acked-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241030100516.3633640-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/helpers.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 364498c643de..a6897551899a 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2634,7 +2634,7 @@ struct bpf_iter_bits_kern { unsigned long *bits; unsigned long bits_copy; }; - u32 nr_bits; + int nr_bits; int bit; } __aligned(8); @@ -2708,17 +2708,16 @@ bpf_iter_bits_new(struct bpf_iter_bits *it, const u64 *unsafe_ptr__ign, u32 nr_w __bpf_kfunc int *bpf_iter_bits_next(struct bpf_iter_bits *it) { struct bpf_iter_bits_kern *kit = (void *)it; - u32 nr_bits = kit->nr_bits; + int bit = kit->bit, nr_bits = kit->nr_bits; const unsigned long *bits; - int bit; - if (nr_bits == 0) + if (!nr_bits || bit >= nr_bits) return NULL; bits = nr_bits == 64 ? &kit->bits_copy : kit->bits; - bit = find_next_bit(bits, nr_bits, kit->bit + 1); + bit = find_next_bit(bits, nr_bits, bit + 1); if (bit >= nr_bits) { - kit->nr_bits = 0; + kit->bit = bit; return NULL; } -- 2.34.1
From: Hou Tao <houtao1@huawei.com> mainline inclusion from mainline-v6.12-rc6 commit 62a898b07b83f6f407003d8a70f0827a5af08a59 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Introduce bpf_mem_alloc_check_size() to check whether the allocation size exceeds the limitation for the kmalloc-equivalent allocator. The upper limit for percpu allocation is LLIST_NODE_SZ bytes larger than non-percpu allocation, so a percpu argument is added to the helper. The helper will be used in the following patch to check whether the size parameter passed to bpf_mem_alloc() is too big. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241030100516.3633640-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Conflicts: include/linux/bpf_mem_alloc.h [Fix conflicts caused by context diff.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/bpf_mem_alloc.h | 3 +++ kernel/bpf/memalloc.c | 14 +++++++++++++- 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h index bb1223b21308..50ab9929b3d0 100644 --- a/include/linux/bpf_mem_alloc.h +++ b/include/linux/bpf_mem_alloc.h @@ -25,6 +25,9 @@ struct bpf_mem_alloc { int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu); void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma); +/* Check the allocation size for kmalloc equivalent allocator */ +int bpf_mem_alloc_check_size(bool percpu, size_t size); + /* kmalloc/kfree equivalent: */ void *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size); void bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr); diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 85f9501ff6e6..38fe720d0f87 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -35,6 +35,8 @@ */ #define LLIST_NODE_SZ sizeof(struct llist_node) +#define BPF_MEM_ALLOC_SIZE_MAX 4096 + /* similar to kmalloc, but sizeof == 8 bucket is gone */ static u8 size_index[24] __ro_after_init = { 3, /* 8 */ @@ -65,7 +67,7 @@ static u8 size_index[24] __ro_after_init = { static int bpf_mem_cache_idx(size_t size) { - if (!size || size > 4096) + if (!size || size > BPF_MEM_ALLOC_SIZE_MAX) return -1; if (size <= 192) @@ -930,3 +932,13 @@ void notrace *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags) return !ret ? NULL : ret + LLIST_NODE_SZ; } + +int bpf_mem_alloc_check_size(bool percpu, size_t size) +{ + /* The size of percpu allocation doesn't have LLIST_NODE_SZ overhead */ + if ((percpu && size > BPF_MEM_ALLOC_SIZE_MAX) || + (!percpu && size > BPF_MEM_ALLOC_SIZE_MAX - LLIST_NODE_SZ)) + return -E2BIG; + + return 0; +} -- 2.34.1
From: Hou Tao <houtao1@huawei.com> mainline inclusion from mainline-v6.12-rc6 commit 393397fbdcad7396639d7077c33f86169184ba99 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Check the validity of nr_words in bpf_iter_bits_new(). Without this check, when multiplication overflow occurs for nr_bits (e.g., when nr_words = 0x0400-0001, nr_bits becomes 64), stack corruption may occur due to bpf_probe_read_kernel_common(..., nr_bytes = 0x2000-0008). Fix it by limiting the maximum value of nr_words to 511. The value is derived from the current implementation of BPF memory allocator. To ensure compatibility if the BPF memory allocator's size limitation changes in the future, use the helper bpf_mem_alloc_check_size() to check whether nr_bytes is too larger. And return -E2BIG instead of -ENOMEM for oversized nr_bytes. Fixes: 4665415975b0 ("bpf: Add bits iterator") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241030100516.3633640-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/helpers.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index a6897551899a..d80ea34b133b 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2629,6 +2629,8 @@ struct bpf_iter_bits { __u64 __opaque[2]; } __aligned(8); +#define BITS_ITER_NR_WORDS_MAX 511 + struct bpf_iter_bits_kern { union { unsigned long *bits; @@ -2643,7 +2645,8 @@ struct bpf_iter_bits_kern { * @it: The new bpf_iter_bits to be created * @unsafe_ptr__ign: A pointer pointing to a memory area to be iterated over * @nr_words: The size of the specified memory area, measured in 8-byte units. - * Due to the limitation of memalloc, it can't be greater than 512. + * The maximum value of @nr_words is @BITS_ITER_NR_WORDS_MAX. This limit may be + * further reduced by the BPF memory allocator implementation. * * This function initializes a new bpf_iter_bits structure for iterating over * a memory area which is specified by the @unsafe_ptr__ign and @nr_words. It @@ -2670,6 +2673,8 @@ bpf_iter_bits_new(struct bpf_iter_bits *it, const u64 *unsafe_ptr__ign, u32 nr_w if (!unsafe_ptr__ign || !nr_words) return -EINVAL; + if (nr_words > BITS_ITER_NR_WORDS_MAX) + return -E2BIG; /* Optimization for u64 mask */ if (nr_bits == 64) { @@ -2681,6 +2686,9 @@ bpf_iter_bits_new(struct bpf_iter_bits *it, const u64 *unsafe_ptr__ign, u32 nr_w return 0; } + if (bpf_mem_alloc_check_size(false, nr_bytes)) + return -E2BIG; + /* Fallback to memalloc */ kit->bits = bpf_mem_alloc(&bpf_global_ma, nr_bytes); if (!kit->bits) -- 2.34.1
From: Hou Tao <houtao1@huawei.com> mainline inclusion from mainline-v6.12-rc6 commit e1339383675063ae4760d81ffe13a79981841b8d category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- On 32-bit hosts (e.g., arm32), when a bpf program passes a u64 to bpf_iter_bits_new(), bpf_iter_bits_new() will use bits_copy to store the content of the u64. However, bits_copy is only 4 bytes, leading to stack corruption. The straightforward solution would be to replace u64 with unsigned long in bpf_iter_bits_new(). However, this introduces confusion and problems for 32-bit hosts because the size of ulong in bpf program is 8 bytes, but it is treated as 4-bytes after passed to bpf_iter_bits_new(). Fix it by changing the type of both bits and bit_count from unsigned long to u64. However, the change is not enough. The main reason is that bpf_iter_bits_next() uses find_next_bit() to find the next bit and the pointer passed to find_next_bit() is an unsigned long pointer instead of a u64 pointer. For 32-bit little-endian host, it is fine but it is not the case for 32-bit big-endian host. Because under 32-bit big-endian host, the first iterated unsigned long will be the bits 32-63 of the u64 instead of the expected bits 0-31. Therefore, in addition to changing the type, swap the two unsigned longs within the u64 for 32-bit big-endian host. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241030100516.3633640-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/bpf/helpers.c | 33 ++++++++++++++++++++++++++++++--- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index d80ea34b133b..c6894e40fb02 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2633,13 +2633,36 @@ struct bpf_iter_bits { struct bpf_iter_bits_kern { union { - unsigned long *bits; - unsigned long bits_copy; + __u64 *bits; + __u64 bits_copy; }; int nr_bits; int bit; } __aligned(8); +/* On 64-bit hosts, unsigned long and u64 have the same size, so passing + * a u64 pointer and an unsigned long pointer to find_next_bit() will + * return the same result, as both point to the same 8-byte area. + * + * For 32-bit little-endian hosts, using a u64 pointer or unsigned long + * pointer also makes no difference. This is because the first iterated + * unsigned long is composed of bits 0-31 of the u64 and the second unsigned + * long is composed of bits 32-63 of the u64. + * + * However, for 32-bit big-endian hosts, this is not the case. The first + * iterated unsigned long will be bits 32-63 of the u64, so swap these two + * ulong values within the u64. + */ +static void swap_ulong_in_u64(u64 *bits, unsigned int nr) +{ +#if (BITS_PER_LONG == 32) && defined(__BIG_ENDIAN) + unsigned int i; + + for (i = 0; i < nr; i++) + bits[i] = (bits[i] >> 32) | ((u64)(u32)bits[i] << 32); +#endif +} + /** * bpf_iter_bits_new() - Initialize a new bits iterator for a given memory area * @it: The new bpf_iter_bits to be created @@ -2682,6 +2705,8 @@ bpf_iter_bits_new(struct bpf_iter_bits *it, const u64 *unsafe_ptr__ign, u32 nr_w if (err) return -EFAULT; + swap_ulong_in_u64(&kit->bits_copy, nr_words); + kit->nr_bits = nr_bits; return 0; } @@ -2700,6 +2725,8 @@ bpf_iter_bits_new(struct bpf_iter_bits *it, const u64 *unsafe_ptr__ign, u32 nr_w return err; } + swap_ulong_in_u64(kit->bits, nr_words); + kit->nr_bits = nr_bits; return 0; } @@ -2717,7 +2744,7 @@ __bpf_kfunc int *bpf_iter_bits_next(struct bpf_iter_bits *it) { struct bpf_iter_bits_kern *kit = (void *)it; int bit = kit->bit, nr_bits = kit->nr_bits; - const unsigned long *bits; + const void *bits; if (!nr_bits || bit >= nr_bits) return NULL; -- 2.34.1
From: Hou Tao <houtao1@huawei.com> mainline inclusion from mainline-v6.12-rc6 commit ebafc1e535db19505aec3b94a4a641fe735a2eac category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Add more test cases for bits iterator: (1) huge word test Verify the multiplication overflow of nr_bits in bits_iter. Without the overflow check, when nr_words is 67108865, nr_bits becomes 64, causing bpf_probe_read_kernel_common() to corrupt the stack. (2) max word test Verify correct handling of maximum nr_words value (511). (3) bad word test Verify early termination of bits iteration when bits iterator initialization fails. Also rename bits_nomem to bits_too_big to better reflect its purpose. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241030100516.3633640-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/progs/verifier_bits_iter.c | 61 ++++++++++++++++++- 1 file changed, 58 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/bpf/progs/verifier_bits_iter.c b/tools/testing/selftests/bpf/progs/verifier_bits_iter.c index 716113c2bce2..e1421d846ac8 100644 --- a/tools/testing/selftests/bpf/progs/verifier_bits_iter.c +++ b/tools/testing/selftests/bpf/progs/verifier_bits_iter.c @@ -15,6 +15,8 @@ int bpf_iter_bits_new(struct bpf_iter_bits *it, const u64 *unsafe_ptr__ign, int *bpf_iter_bits_next(struct bpf_iter_bits *it) __ksym __weak; void bpf_iter_bits_destroy(struct bpf_iter_bits *it) __ksym __weak; +u64 bits_array[511] = {}; + SEC("iter.s/cgroup") __description("bits iter without destroy") __failure __msg("Unreleased reference") @@ -110,16 +112,16 @@ int bit_index(void) } SEC("syscall") -__description("bits nomem") +__description("bits too big") __success __retval(0) -int bits_nomem(void) +int bits_too_big(void) { u64 data[4]; int nr = 0; int *bit; __builtin_memset(&data, 0xff, sizeof(data)); - bpf_for_each(bits, bit, &data[0], 513) /* Be greater than 512 */ + bpf_for_each(bits, bit, &data[0], 512) /* Be greater than 511 */ nr++; return nr; } @@ -151,3 +153,56 @@ int zero_words(void) nr++; return nr; } + +SEC("syscall") +__description("huge words") +__success __retval(0) +int huge_words(void) +{ + u64 data[8] = {0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1}; + int nr = 0; + int *bit; + + bpf_for_each(bits, bit, &data[0], 67108865) + nr++; + return nr; +} + +SEC("syscall") +__description("max words") +__success __retval(4) +int max_words(void) +{ + volatile int nr = 0; + int *bit; + + bits_array[0] = (1ULL << 63) | 1U; + bits_array[510] = (1ULL << 33) | (1ULL << 32); + + bpf_for_each(bits, bit, bits_array, 511) { + if (nr == 0 && *bit != 0) + break; + if (nr == 2 && *bit != 32672) + break; + nr++; + } + return nr; +} + +SEC("syscall") +__description("bad words") +__success __retval(0) +int bad_words(void) +{ + void *bad_addr = (void *)(3UL << 30); + int nr = 0; + int *bit; + + bpf_for_each(bits, bit, bad_addr, 1) + nr++; + + bpf_for_each(bits, bit, bad_addr, 4) + nr++; + + return nr; +} -- 2.34.1
From: Hou Tao <houtao1@huawei.com> mainline inclusion from mainline-v6.12 commit 6801cf7890f2ed8fcc14859b47501f8ee7a58ec7 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- As reported by Byeonguk, the bad_words test in verifier_bits_iter.c occasionally fails on s390 host. Quoting Ilya's explanation: s390 kernel runs in a completely separate address space, there is no user/kernel split at TASK_SIZE. The same address may be valid in both the kernel and the user address spaces, there is no way to tell by looking at it. The config option related to this property is ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE. Also, unfortunately, 0 is a valid address in the s390 kernel address space. Fix the issue by using -4095 as the bad address for bits iterator, as suggested by Ilya. Verify that bpf_iter_bits_new() returns -EINVAL for NULL address and -EFAULT for bad address. Fixes: ebafc1e535db ("selftests/bpf: Add three test cases for bits_iter") Reported-by: Byeonguk Jeong <jungbu2855@gmail.com> Closes: https://lore.kernel.org/bpf/ZycSXwjH4UTvx-Cn@ub22/ Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20241105043057.3371482-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/progs/verifier_bits_iter.c | 32 ++++++++++++++++--- 1 file changed, 28 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/bpf/progs/verifier_bits_iter.c b/tools/testing/selftests/bpf/progs/verifier_bits_iter.c index e1421d846ac8..b6b018baed46 100644 --- a/tools/testing/selftests/bpf/progs/verifier_bits_iter.c +++ b/tools/testing/selftests/bpf/progs/verifier_bits_iter.c @@ -57,9 +57,15 @@ __description("null pointer") __success __retval(0) int null_pointer(void) { - int nr = 0; + struct bpf_iter_bits iter; + int err, nr = 0; int *bit; + err = bpf_iter_bits_new(&iter, NULL, 1); + bpf_iter_bits_destroy(&iter); + if (err != -EINVAL) + return 1; + bpf_for_each(bits, bit, NULL, 1) nr++; return nr; @@ -194,15 +200,33 @@ __description("bad words") __success __retval(0) int bad_words(void) { - void *bad_addr = (void *)(3UL << 30); - int nr = 0; + void *bad_addr = (void *)-4095; + struct bpf_iter_bits iter; + volatile int nr; int *bit; + int err; + + err = bpf_iter_bits_new(&iter, bad_addr, 1); + bpf_iter_bits_destroy(&iter); + if (err != -EFAULT) + return 1; + nr = 0; bpf_for_each(bits, bit, bad_addr, 1) nr++; + if (nr != 0) + return 2; + err = bpf_iter_bits_new(&iter, bad_addr, 4); + bpf_iter_bits_destroy(&iter); + if (err != -EFAULT) + return 3; + + nr = 0; bpf_for_each(bits, bit, bad_addr, 4) nr++; + if (nr != 0) + return 4; - return nr; + return 0; } -- 2.34.1
From: Martin KaFai Lau <martin.lau@kernel.org> mainline inclusion from mainline-v6.14-rc1 commit 96ea081ed52bf077cad6d00153b6fba68e510767 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- There is a UAF report in the bpf_struct_ops when CONFIG_MODULES=n. In particular, the report is on tcp_congestion_ops that has a "struct module *owner" member. For struct_ops that has a "struct module *owner" member, it can be extended either by the regular kernel module or by the bpf_struct_ops. bpf_try_module_get() will be used to do the refcounting and different refcount is done based on the owner pointer. When CONFIG_MODULES=n, the btf_id of the "struct module" is missing: WARN: resolve_btfids: unresolved symbol module Thus, the bpf_try_module_get() cannot do the correct refcounting. Not all subsystem's struct_ops requires the "struct module *owner" member. e.g. the recent sched_ext_ops. This patch is to disable bpf_struct_ops registration if the struct_ops has the "struct module *" member and the "struct module" btf_id is missing. The btf_type_is_fwd() helper is moved to the btf.h header file for this test. This has happened since the beginning of bpf_struct_ops which has gone through many changes. The Fixes tag is set to a recent commit that this patch can apply cleanly. Considering CONFIG_MODULES=n is not common and the age of the issue, targeting for bpf-next also. Fixes: 1611603537a4 ("bpf: Create argument information for nullable arguments.") Reported-by: Robert Morris <rtm@csail.mit.edu> Closes: https://lore.kernel.org/bpf/74665.1733669976@localhost/ Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241220201818.127152-1-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/btf.h | 5 +++++ kernel/bpf/bpf_struct_ops.c | 21 +++++++++++++++++++++ kernel/bpf/btf.c | 5 ----- 3 files changed, 26 insertions(+), 5 deletions(-) diff --git a/include/linux/btf.h b/include/linux/btf.h index 3dad546d37aa..b597008ece90 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -352,6 +352,11 @@ static inline bool btf_type_is_scalar(const struct btf_type *t) return btf_type_is_int(t) || btf_type_is_enum(t); } +static inline bool btf_type_is_fwd(const struct btf_type *t) +{ + return BTF_INFO_KIND(t->info) == BTF_KIND_FWD; +} + static inline bool btf_type_is_typedef(const struct btf_type *t) { return BTF_INFO_KIND(t->info) == BTF_KIND_TYPEDEF; diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 1409a844183c..32143bc7f73b 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -307,6 +307,20 @@ void bpf_struct_ops_desc_release(struct bpf_struct_ops_desc *st_ops_desc) kfree(arg_info); } +static bool is_module_member(const struct btf *btf, u32 id) +{ + const struct btf_type *t; + + t = btf_type_resolve_ptr(btf, id, NULL); + if (!t) + return false; + + if (!__btf_type_is_struct(t) && !btf_type_is_fwd(t)) + return false; + + return !strcmp(btf_name_by_offset(btf, t->name_off), "module"); +} + int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, struct btf *btf, struct bpf_verifier_log *log) @@ -381,6 +395,13 @@ int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc, goto errout; } + if (!st_ops_ids[IDX_MODULE_ID] && is_module_member(btf, member->type)) { + pr_warn("'struct module' btf id not found. Is CONFIG_MODULES enabled? bpf_struct_ops '%s' needs module support.\n", + st_ops->name); + err = -EOPNOTSUPP; + goto errout; + } + func_proto = btf_type_resolve_func_ptr(btf, member->type, NULL); diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 3b6479ea0eb9..1c5ed5d4b591 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -501,11 +501,6 @@ bool btf_type_is_void(const struct btf_type *t) return t == &btf_void; } -static bool btf_type_is_fwd(const struct btf_type *t) -{ - return BTF_INFO_KIND(t->info) == BTF_KIND_FWD; -} - static bool btf_type_is_datasec(const struct btf_type *t) { return BTF_INFO_KIND(t->info) == BTF_KIND_DATASEC; -- 2.34.1
From: Pu Lehui <pulehui@huawei.com> mainline inclusion from mainline-v6.14-rc1 commit 4a04cb326a6c7f9a2c066f8c2ca78a5a9b87ddab category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Fix btf leak on new btf alloc failure in btf_distill test. Fixes: affdeb50616b ("selftests/bpf: Extend distilled BTF tests to cover BTF relocation") Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250115100241.4171581-1-pulehui@huaweicloud.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/testing/selftests/bpf/prog_tests/btf_distill.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/bpf/prog_tests/btf_distill.c b/tools/testing/selftests/bpf/prog_tests/btf_distill.c index ca84726d5ac1..b72b966df77b 100644 --- a/tools/testing/selftests/bpf/prog_tests/btf_distill.c +++ b/tools/testing/selftests/bpf/prog_tests/btf_distill.c @@ -385,7 +385,7 @@ static void test_distilled_base_missing_err(void) "[2] INT 'int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED"); btf5 = btf__new_empty(); if (!ASSERT_OK_PTR(btf5, "empty_reloc_btf")) - return; + goto cleanup; btf__add_int(btf5, "int", 4, BTF_INT_SIGNED); /* [1] int */ VALIDATE_RAW_BTF( btf5, @@ -478,7 +478,7 @@ static void test_distilled_base_multi_err2(void) "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED"); btf5 = btf__new_empty(); if (!ASSERT_OK_PTR(btf5, "empty_reloc_btf")) - return; + goto cleanup; btf__add_int(btf5, "int", 4, BTF_INT_SIGNED); /* [1] int */ btf__add_int(btf5, "int", 4, BTF_INT_SIGNED); /* [2] int */ VALIDATE_RAW_BTF( -- 2.34.1
From: Pu Lehui <pulehui@huawei.com> mainline inclusion from mainline-v6.14-rc1 commit 5436a54332c19df0acbef2b87cbf9f7cba56f2dd category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- The error number of elf_begin is omitted when encapsulating the btf_find_elf_sections function. Fixes: c86f180ffc99 ("libbpf: Make btf_parse_elf process .BTF.base transparently") Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250115100241.4171581-2-pulehui@huaweicloud.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/btf.c | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index f433a5282042..4ed135131bcb 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -1025,6 +1025,7 @@ static struct btf *btf_parse_elf(const char *path, struct btf *base_btf, elf = elf_begin(fd, ELF_C_READ, NULL); if (!elf) { + err = -LIBBPF_ERRNO__FORMAT; pr_warn("failed to open %s as ELF file\n", path); goto done; } -- 2.34.1
From: Pu Lehui <pulehui@huawei.com> mainline inclusion from mainline-v6.14-rc1 commit 5ca681a86ef93369685cb63f71994f4cf7303e7c category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- When redirecting the split BTF to the vmlinux base BTF, we need to mark the distilled base struct/union members of split BTF structs/unions in id_map with BTF_IS_EMBEDDED. This indicates that these types must match both name and size later. Therefore, we need to traverse the entire split BTF, which involves traversing type IDs from nr_dist_base_types to nr_types. However, the current implementation uses an incorrect traversal end type ID, so let's correct it. Fixes: 19e00c897d50 ("libbpf: Split BTF relocation") Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250115100241.4171581-3-pulehui@huaweicloud.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/btf_relocate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/lib/bpf/btf_relocate.c b/tools/lib/bpf/btf_relocate.c index 6c9a16115bf0..363f3cdd9981 100644 --- a/tools/lib/bpf/btf_relocate.c +++ b/tools/lib/bpf/btf_relocate.c @@ -212,7 +212,7 @@ static int btf_relocate_map_distilled_base(struct btf_relocate *r) * need to match both name and size, otherwise embedding the base * struct/union in the split type is invalid. */ - for (id = r->nr_dist_base_types; id < r->nr_split_types; id++) { + for (id = r->nr_dist_base_types; id < r->nr_dist_base_types + r->nr_split_types; id++) { err = btf_mark_embedded_composite_type_ids(r, id); if (err) goto done; -- 2.34.1
From: Pu Lehui <pulehui@huawei.com> mainline inclusion from mainline-v6.14-rc1 commit 556a399406635566413f9c71b134d5d287b25b29 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- When redirecting the split BTF to the vmlinux base BTF, we need to mark the distilled base struct/union members of split BTF structs/unions in id_map with BTF_IS_EMBEDDED. This indicates that these types must match both name and size later. So if a needed composite type, which is the member of composite type in the split BTF, has a different size in the base BTF we wish to relocate with, btf__relocate() should error out. Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250115100241.4171581-4-pulehui@huaweicloud.com Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- .../selftests/bpf/prog_tests/btf_distill.c | 72 +++++++++++++++++++ 1 file changed, 72 insertions(+) diff --git a/tools/testing/selftests/bpf/prog_tests/btf_distill.c b/tools/testing/selftests/bpf/prog_tests/btf_distill.c index b72b966df77b..fb67ae195a73 100644 --- a/tools/testing/selftests/bpf/prog_tests/btf_distill.c +++ b/tools/testing/selftests/bpf/prog_tests/btf_distill.c @@ -601,6 +601,76 @@ static void test_distilled_endianness(void) btf__free(base); } +/* If a needed composite type, which is the member of composite type + * in the split BTF, has a different size in the base BTF we wish to + * relocate with, btf__relocate() should error out. + */ +static void test_distilled_base_embedded_err(void) +{ + struct btf *btf1 = NULL, *btf2 = NULL, *btf3 = NULL, *btf4 = NULL, *btf5 = NULL; + + btf1 = btf__new_empty(); + if (!ASSERT_OK_PTR(btf1, "empty_main_btf")) + return; + + btf__add_int(btf1, "int", 4, BTF_INT_SIGNED); /* [1] int */ + btf__add_struct(btf1, "s1", 4); /* [2] struct s1 { */ + btf__add_field(btf1, "f1", 1, 0, 0); /* int f1; */ + /* } */ + VALIDATE_RAW_BTF( + btf1, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] STRUCT 's1' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0"); + + btf2 = btf__new_empty_split(btf1); + if (!ASSERT_OK_PTR(btf2, "empty_split_btf")) + goto cleanup; + + btf__add_struct(btf2, "with_embedded", 8); /* [3] struct with_embedded { */ + btf__add_field(btf2, "e1", 2, 0, 0); /* struct s1 e1; */ + /* } */ + + VALIDATE_RAW_BTF( + btf2, + "[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED", + "[2] STRUCT 's1' size=4 vlen=1\n" + "\t'f1' type_id=1 bits_offset=0", + "[3] STRUCT 'with_embedded' size=8 vlen=1\n" + "\t'e1' type_id=2 bits_offset=0"); + + if (!ASSERT_EQ(0, btf__distill_base(btf2, &btf3, &btf4), + "distilled_base") || + !ASSERT_OK_PTR(btf3, "distilled_base") || + !ASSERT_OK_PTR(btf4, "distilled_split") || + !ASSERT_EQ(2, btf__type_cnt(btf3), "distilled_base_type_cnt")) + goto cleanup; + + VALIDATE_RAW_BTF( + btf4, + "[1] STRUCT 's1' size=4 vlen=0", + "[2] STRUCT 'with_embedded' size=8 vlen=1\n" + "\t'e1' type_id=1 bits_offset=0"); + + btf5 = btf__new_empty(); + if (!ASSERT_OK_PTR(btf5, "empty_reloc_btf")) + goto cleanup; + + btf__add_int(btf5, "int", 4, BTF_INT_SIGNED); /* [1] int */ + /* struct with the same name but different size */ + btf__add_struct(btf5, "s1", 8); /* [2] struct s1 { */ + btf__add_field(btf5, "f1", 1, 0, 0); /* int f1; */ + /* } */ + + ASSERT_EQ(btf__relocate(btf4, btf5), -EINVAL, "relocate_split"); +cleanup: + btf__free(btf5); + btf__free(btf4); + btf__free(btf3); + btf__free(btf2); + btf__free(btf1); +} + void test_btf_distill(void) { if (test__start_subtest("distilled_base")) @@ -613,6 +683,8 @@ void test_btf_distill(void) test_distilled_base_multi_err(); if (test__start_subtest("distilled_base_multi_err2")) test_distilled_base_multi_err2(); + if (test__start_subtest("distilled_base_embedded_err")) + test_distilled_base_embedded_err(); if (test__start_subtest("distilled_base_vmlinux")) test_distilled_base_vmlinux(); if (test__start_subtest("distilled_endianness")) -- 2.34.1
From: Pu Lehui <pulehui@huawei.com> hulk inclusion category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA -------------------------------- When bpf selftests is compiled, there will always be one more file_read_pattern file in git status. The reason is that commit 9c9387a5464a ("selftests/bpf: add demo for file read pattern detection") did not add it to the .gitignore file. Fixes: 9c9387a5464a ("selftests/bpf: add demo for file read pattern detection") Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/testing/selftests/bpf/.gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore index f1aebabfb017..729d305aa53c 100644 --- a/tools/testing/selftests/bpf/.gitignore +++ b/tools/testing/selftests/bpf/.gitignore @@ -52,3 +52,4 @@ xdp_redirect_multi xdp_synproxy xdp_hw_metadata xdp_features +file_read_pattern -- 2.34.1
From: "T.J. Mercier" <tjmercier@google.com> mainline inclusion from mainline-v6.15-rc1 commit dc438a9bc7610f23cca29a21bb5588b3c09d37f0 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8335 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- This file was renamed from bpf_iter_task_vma.c. Fixes: 45b38941c81f ("selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c") Signed-off-by: T.J. Mercier <tjmercier@google.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20250304204520.201115-1-tjmercier@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- Documentation/bpf/bpf_iterators.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/bpf/bpf_iterators.rst b/Documentation/bpf/bpf_iterators.rst index 07433915aa41..7f514cb6b052 100644 --- a/Documentation/bpf/bpf_iterators.rst +++ b/Documentation/bpf/bpf_iterators.rst @@ -86,7 +86,7 @@ following steps: The following are a few examples of selftest BPF iterator programs: * `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_ -* `bpf_iter_task_vma.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c>`_ +* `bpf_iter_task_vmas.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c>`_ * `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_ Let us look at ``bpf_iter_task_file.c``, which runs in kernel space: -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit df268382adc1f7aa3ad92f7de71b70395f24e4e7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IDC9YK Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Currently, sched_init() checks that the sched_class'es are in the expected order by testing each adjacency which is a bit brittle and makes it cumbersome to add optional sched_class'es. Instead, let's verify whether they're in the expected order using sched_class_above() which is what matters. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: David Vernet <dvernet@meta.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 61d377045ebd..84a8d262f00e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10050,12 +10050,12 @@ void __init sched_init(void) int i; /* Make sure the linker didn't screw up */ - BUG_ON(&idle_sched_class != &fair_sched_class + 1 || - &fair_sched_class != &rt_sched_class + 1 || - &rt_sched_class != &dl_sched_class + 1); #ifdef CONFIG_SMP - BUG_ON(&dl_sched_class != &stop_sched_class + 1); + BUG_ON(!sched_class_above(&stop_sched_class, &dl_sched_class)); #endif + BUG_ON(!sched_class_above(&dl_sched_class, &rt_sched_class)); + BUG_ON(!sched_class_above(&rt_sched_class, &fair_sched_class)); + BUG_ON(!sched_class_above(&fair_sched_class, &idle_sched_class)); wait_bit_init(); -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 304b3f2bc07ba7c74a1ebc7917a8f0549e6b8d54 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IDC9YK Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- A new BPF extensible sched_class will need more control over the forking process. It wants to be able to fail from sched_cgroup_fork() after the new task's sched_task_group is initialized so that the loaded BPF program can prepare the task with its cgroup association is established and reject fork if e.g. allocation fails. Allow sched_cgroup_fork() to fail by making it return int instead of void and adding sched_cancel_fork() to undo sched_fork() in the error path. sched_cgroup_fork() doesn't fail yet and this patch shouldn't cause any behavior changes. v2: Patch description updated to detail the expected use. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/task.h | 3 ++- kernel/fork.c | 15 ++++++++++----- kernel/sched/core.c | 8 +++++++- 3 files changed, 19 insertions(+), 7 deletions(-) diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index a23af225c898..03d35e3eda3c 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -61,7 +61,8 @@ extern asmlinkage void schedule_tail(struct task_struct *prev); extern void init_idle(struct task_struct *idle, int cpu); extern int sched_fork(unsigned long clone_flags, struct task_struct *p); -extern void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs); +extern int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs); +extern void sched_cancel_fork(struct task_struct *p); extern void sched_post_fork(struct task_struct *p); extern void sched_dead(struct task_struct *p); diff --git a/kernel/fork.c b/kernel/fork.c index 69b9ced58ff1..59197037d846 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2619,7 +2619,7 @@ __latent_entropy struct task_struct *copy_process( retval = perf_event_init_task(p, clone_flags); if (retval) - goto bad_fork_cleanup_policy; + goto bad_fork_sched_cancel_fork; retval = audit_alloc(p); if (retval) goto bad_fork_cleanup_perf; @@ -2751,7 +2751,9 @@ __latent_entropy struct task_struct *copy_process( * cgroup specific, it unconditionally needs to place the task on a * runqueue. */ - sched_cgroup_fork(p, args); + retval = sched_cgroup_fork(p, args); + if (retval) + goto bad_fork_cancel_cgroup; /* * From this point on we must avoid any synchronous user-space @@ -2797,13 +2799,13 @@ __latent_entropy struct task_struct *copy_process( /* Don't start children in a dying pid namespace */ if (unlikely(!(ns_of_pid(pid)->pid_allocated & PIDNS_ADDING))) { retval = -ENOMEM; - goto bad_fork_cancel_cgroup; + goto bad_fork_core_free; } /* Let kill terminate clone/fork in the middle */ if (fatal_signal_pending(current)) { retval = -EINTR; - goto bad_fork_cancel_cgroup; + goto bad_fork_core_free; } /* No more failure paths after this point. */ @@ -2879,10 +2881,11 @@ __latent_entropy struct task_struct *copy_process( return p; -bad_fork_cancel_cgroup: +bad_fork_core_free: sched_core_free(p); spin_unlock(¤t->sighand->siglock); write_unlock_irq(&tasklist_lock); +bad_fork_cancel_cgroup: cgroup_cancel_fork(p, args); bad_fork_put_pidfd: if (clone_flags & CLONE_PIDFD) { @@ -2921,6 +2924,8 @@ __latent_entropy struct task_struct *copy_process( audit_free(p); bad_fork_cleanup_perf: perf_event_free_task(p); +bad_fork_sched_cancel_fork: + sched_cancel_fork(p); bad_fork_cleanup_policy: lockdep_free_task(p); #ifdef CONFIG_NUMA diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 84a8d262f00e..e32e0d9db41e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4864,7 +4864,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) return 0; } -void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs) +int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs) { unsigned long flags; struct task_group *tg_qos __maybe_unused = NULL; @@ -4899,6 +4899,12 @@ void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs) if (p->sched_class->task_fork) p->sched_class->task_fork(p); raw_spin_unlock_irqrestore(&p->pi_lock, flags); + + return 0; +} + +void sched_cancel_fork(struct task_struct *p) +{ } void sched_post_fork(struct task_struct *p) -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit e83edbf88f18087b5b349c7ff701b030dd53dff3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IDC9YK Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Currently, during a task weight change, sched core directly calls reweight_task() defined in fair.c if @p is on CFS. Let's make it a proper sched_class operation instead. CFS's reweight_task() is renamed to reweight_task_fair() and now called through sched_class. While it turns a direct call into an indirect one, set_load_weight() isn't called from a hot path and this change shouldn't cause any noticeable difference. This will be used to implement reweight_task for a new BPF extensible sched_class so that it can keep its cached task weight up-to-date. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Conflicts: kernel/sched/core.c kernel/sched/fair.c kernel/sched/sched.h [The correct patch order should be 9059393e4ec1 ("sched/fair: Use reweight_entity() for set_user_nice()"), e83edbf88f18 ("sched: Add sched_class->reweight_task()"), d32960528702 ("sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks"), but current patch is e83edbf88f18 ("sched: Add sched_class->reweight_task()"), and the incorrect order made some conflicts.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 4 ++-- kernel/sched/fair.c | 3 ++- kernel/sched/sched.h | 4 ++-- 3 files changed, 6 insertions(+), 5 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e32e0d9db41e..d12fbe45b9be 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1333,8 +1333,8 @@ static void set_load_weight(struct task_struct *p, bool update_load) * SCHED_OTHER tasks have to update their load when changing their * weight */ - if (update_load && p->sched_class == &fair_sched_class) - reweight_task(p, &lw); + if (update_load && p->sched_class->reweight_task) + p->sched_class->reweight_task(task_rq(p), p, &lw); else p->se.load = lw; } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ad30bb800961..d8baa27af572 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3975,7 +3975,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, } } -void reweight_task(struct task_struct *p, const struct load_weight *lw) +static void reweight_task_fair(struct rq *rq, struct task_struct *p, const struct load_weight *lw) { struct sched_entity *se = &p->se; struct cfs_rq *cfs_rq = cfs_rq_of(se); @@ -15538,6 +15538,7 @@ DEFINE_SCHED_CLASS(fair) = { .task_tick = task_tick_fair, .task_fork = task_fork_fair, + .reweight_task = reweight_task_fair, .prio_changed = prio_changed_fair, .switched_from = switched_from_fair, .switched_to = switched_to_fair, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 2cbbf75a3e92..797d345d7cec 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2486,6 +2486,8 @@ struct sched_class { */ void (*switched_from)(struct rq *this_rq, struct task_struct *task); void (*switched_to) (struct rq *this_rq, struct task_struct *task); + void (*reweight_task)(struct rq *this_rq, struct task_struct *task, + const struct load_weight *lw); void (*prio_changed) (struct rq *this_rq, struct task_struct *task, int oldprio); @@ -2645,8 +2647,6 @@ extern void init_sched_dl_class(void); extern void init_sched_rt_class(void); extern void init_sched_fair_class(void); -extern void reweight_task(struct task_struct *p, const struct load_weight *lw); - extern void resched_curr(struct rq *rq); extern void resched_cpu(int cpu); -- 2.34.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IDC9YK -------------------------------- sched: Fix kabi for the method pointer reweight_task in struct sched_class. Fixes: e5850c18ee6b ("sched: Add sched_class->reweight_task()") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/sched.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 797d345d7cec..fbff97e9a3d3 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2486,8 +2486,6 @@ struct sched_class { */ void (*switched_from)(struct rq *this_rq, struct task_struct *task); void (*switched_to) (struct rq *this_rq, struct task_struct *task); - void (*reweight_task)(struct rq *this_rq, struct task_struct *task, - const struct load_weight *lw); void (*prio_changed) (struct rq *this_rq, struct task_struct *task, int oldprio); @@ -2503,7 +2501,8 @@ struct sched_class { #if defined(CONFIG_SCHED_CORE) || defined(CONFIG_CGROUP_IFS) int (*task_is_throttled)(struct task_struct *p, int cpu); #endif - KABI_RESERVE(1) + KABI_USE(1, void (*reweight_task)(struct rq *this_rq, + struct task_struct *task, const struct load_weight *lw)) KABI_RESERVE(2) }; -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.7-rc1 commit 85be6d842447067ce76047a14d4258c96fd33b7b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IDC9YK Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- recent discussion brought about the realization that it makes sense for no_free_ptr() to have __must_check semantics in order to avoid leaking the resource. Additionally, add a few comments to clarify why/how things work. All credit to Linus on how to combine __must_check and the stmt-expression. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20230816103102.GF980931@hirez.programming.kicks-as... Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/cleanup.h | 39 ++++++++++++++++++++++++++++++++++++--- 1 file changed, 36 insertions(+), 3 deletions(-) diff --git a/include/linux/cleanup.h b/include/linux/cleanup.h index 64b8600eb8c0..f5edf32a7e58 100644 --- a/include/linux/cleanup.h +++ b/include/linux/cleanup.h @@ -7,8 +7,9 @@ /* * DEFINE_FREE(name, type, free): * simple helper macro that defines the required wrapper for a __free() - * based cleanup function. @free is an expression using '_T' to access - * the variable. + * based cleanup function. @free is an expression using '_T' to access the + * variable. @free should typically include a NULL test before calling a + * function, see the example below. * * __free(name): * variable attribute to add a scoped based cleanup to the variable. @@ -17,6 +18,9 @@ * like a non-atomic xchg(var, NULL), such that the cleanup function will * be inhibited -- provided it sanely deals with a NULL value. * + * NOTE: this has __must_check semantics so that it is harder to accidentally + * leak the resource. + * * return_ptr(p): * returns p while inhibiting the __free(). * @@ -24,6 +28,8 @@ * * DEFINE_FREE(kfree, void *, if (_T) kfree(_T)) * + * void *alloc_obj(...) + * { * struct obj *p __free(kfree) = kmalloc(...); * if (!p) * return NULL; @@ -32,6 +38,24 @@ * return NULL; * * return_ptr(p); + * } + * + * NOTE: the DEFINE_FREE()'s @free expression includes a NULL test even though + * kfree() is fine to be called with a NULL value. This is on purpose. This way + * the compiler sees the end of our alloc_obj() function as: + * + * tmp = p; + * p = NULL; + * if (p) + * kfree(p); + * return tmp; + * + * And through the magic of value-propagation and dead-code-elimination, it + * eliminates the actual cleanup call and compiles into: + * + * return p; + * + * Without the NULL test it turns into a mess and the compiler can't help us. */ #define DEFINE_FREE(_name, _type, _free) \ @@ -39,8 +63,17 @@ #define __free(_name) __cleanup(__free_##_name) +#define __get_and_null_ptr(p) \ + ({ __auto_type __ptr = &(p); \ + __auto_type __val = *__ptr; \ + *__ptr = NULL; __val; }) + +static inline __must_check +const volatile void * __must_check_fn(const volatile void *val) +{ return val; } + #define no_free_ptr(p) \ - ({ __auto_type __ptr = (p); (p) = NULL; __ptr; }) + ((typeof(p)) __must_check_fn(__get_and_null_ptr(p))) #define return_ptr(p) return no_free_ptr(p) -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.7-rc1 commit 94b548a15e8ec47dfbf6925bdfb64bb5657dce0c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IDC9YK Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 13 ++++++------- kernel/sched/sched.h | 5 +++++ 2 files changed, 11 insertions(+), 7 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d12fbe45b9be..a4d83c7fdf98 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7280,9 +7280,8 @@ static inline int rt_effective_prio(struct task_struct *p, int prio) void set_user_nice(struct task_struct *p, long nice) { bool queued, running; - int old_prio; - struct rq_flags rf; struct rq *rq; + int old_prio; if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE) return; @@ -7290,7 +7289,9 @@ void set_user_nice(struct task_struct *p, long nice) * We have to be careful, if called from sys_setpriority(), * the task might be in the middle of scheduling on another CPU. */ - rq = task_rq_lock(p, &rf); + CLASS(task_rq_lock, rq_guard)(p); + rq = rq_guard.rq; + update_rq_clock(rq); /* @@ -7301,8 +7302,9 @@ void set_user_nice(struct task_struct *p, long nice) */ if (task_has_dl_policy(p) || task_has_rt_policy(p)) { p->static_prio = NICE_TO_PRIO(nice); - goto out_unlock; + return; } + queued = task_on_rq_queued(p); running = task_current(rq, p); if (queued) @@ -7325,9 +7327,6 @@ void set_user_nice(struct task_struct *p, long nice) * lowered its priority, then reschedule its CPU: */ p->sched_class->prio_changed(rq, p, old_prio); - -out_unlock: - task_rq_unlock(rq, p, &rf); } EXPORT_SYMBOL(set_user_nice); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index fbff97e9a3d3..3288a59bae62 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1855,6 +1855,11 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf) raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags); } +DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct, + _T->rq = task_rq_lock(_T->lock, &_T->rf), + task_rq_unlock(_T->rq, _T->lock, &_T->rf), + struct rq *rq; struct rq_flags rf) + static inline void rq_lock_irqsave(struct rq *rq, struct rq_flags *rf) __acquires(rq->lock) -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.7-rc1 commit febe162d4d9158cf2b5d48fdd440db7bb55dd622 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IDC9YK Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 154 +++++++++++++++++++------------------------- 1 file changed, 68 insertions(+), 86 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a4d83c7fdf98..cfd774cb622e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7599,6 +7599,21 @@ static struct task_struct *find_process_by_pid(pid_t pid) return pid ? find_task_by_vpid(pid) : current; } +static struct task_struct *find_get_task(pid_t pid) +{ + struct task_struct *p; + guard(rcu)(); + + p = find_process_by_pid(pid); + if (likely(p)) + get_task_struct(p); + + return p; +} + +DEFINE_CLASS(find_get_task, struct task_struct *, if (_T) put_task_struct(_T), + find_get_task(pid), pid_t pid) + /* * sched_setparam() passes in -1 for its policy, to let the functions * it calls know not to change it. @@ -7644,14 +7659,11 @@ static void __setscheduler_params(struct task_struct *p, static bool check_same_owner(struct task_struct *p) { const struct cred *cred = current_cred(), *pcred; - bool match; + guard(rcu)(); - rcu_read_lock(); pcred = __task_cred(p); - match = (uid_eq(cred->euid, pcred->euid) || - uid_eq(cred->euid, pcred->uid)); - rcu_read_unlock(); - return match; + return (uid_eq(cred->euid, pcred->euid) || + uid_eq(cred->euid, pcred->uid)); } /* @@ -8075,27 +8087,17 @@ static int do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param) { struct sched_param lparam; - struct task_struct *p; - int retval; if (!param || pid < 0) return -EINVAL; if (copy_from_user(&lparam, param, sizeof(struct sched_param))) return -EFAULT; - rcu_read_lock(); - retval = -ESRCH; - p = find_process_by_pid(pid); - if (likely(p)) - get_task_struct(p); - rcu_read_unlock(); - - if (likely(p)) { - retval = sched_setscheduler(p, policy, &lparam); - put_task_struct(p); - } + CLASS(find_get_task, p)(pid); + if (!p) + return -ESRCH; - return retval; + return sched_setscheduler(p, policy, &lparam); } /* @@ -8191,7 +8193,6 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr, unsigned int, flags) { struct sched_attr attr; - struct task_struct *p; int retval; if (!uattr || pid < 0 || flags) @@ -8206,21 +8207,14 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr, if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY) attr.sched_policy = SETPARAM_POLICY; - rcu_read_lock(); - retval = -ESRCH; - p = find_process_by_pid(pid); - if (likely(p)) - get_task_struct(p); - rcu_read_unlock(); + CLASS(find_get_task, p)(pid); + if (!p) + return -ESRCH; - if (likely(p)) { - if (attr.sched_flags & SCHED_FLAG_KEEP_PARAMS) - get_params(p, &attr); - retval = sched_setattr(p, &attr); - put_task_struct(p); - } + if (attr.sched_flags & SCHED_FLAG_KEEP_PARAMS) + get_params(p, &attr); - return retval; + return sched_setattr(p, &attr); } /** @@ -8238,16 +8232,17 @@ SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid) if (pid < 0) return -EINVAL; - retval = -ESRCH; - rcu_read_lock(); + guard(rcu)(); p = find_process_by_pid(pid); - if (p) { - retval = security_task_getscheduler(p); - if (!retval) - retval = p->policy - | (p->sched_reset_on_fork ? SCHED_RESET_ON_FORK : 0); + if (!p) + return -ESRCH; + + retval = security_task_getscheduler(p); + if (!retval) { + retval = p->policy; + if (p->sched_reset_on_fork) + retval |= SCHED_RESET_ON_FORK; } - rcu_read_unlock(); return retval; } @@ -8268,30 +8263,23 @@ SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param) if (!param || pid < 0) return -EINVAL; - rcu_read_lock(); - p = find_process_by_pid(pid); - retval = -ESRCH; - if (!p) - goto out_unlock; + scoped_guard (rcu) { + p = find_process_by_pid(pid); + if (!p) + return -ESRCH; - retval = security_task_getscheduler(p); - if (retval) - goto out_unlock; + retval = security_task_getscheduler(p); + if (retval) + return retval; - if (task_has_rt_policy(p)) - lp.sched_priority = p->rt_priority; - rcu_read_unlock(); + if (task_has_rt_policy(p)) + lp.sched_priority = p->rt_priority; + } /* * This one might sleep, we cannot do it with a spinlock held ... */ - retval = copy_to_user(param, &lp, sizeof(*param)) ? -EFAULT : 0; - - return retval; - -out_unlock: - rcu_read_unlock(); - return retval; + return copy_to_user(param, &lp, sizeof(*param)) ? -EFAULT : 0; } /* @@ -8351,39 +8339,33 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr, usize < SCHED_ATTR_SIZE_VER0 || flags) return -EINVAL; - rcu_read_lock(); - p = find_process_by_pid(pid); - retval = -ESRCH; - if (!p) - goto out_unlock; + scoped_guard (rcu) { + p = find_process_by_pid(pid); + if (!p) + return -ESRCH; - retval = security_task_getscheduler(p); - if (retval) - goto out_unlock; + retval = security_task_getscheduler(p); + if (retval) + return retval; - kattr.sched_policy = p->policy; - if (p->sched_reset_on_fork) - kattr.sched_flags |= SCHED_FLAG_RESET_ON_FORK; - get_params(p, &kattr); - kattr.sched_flags &= SCHED_FLAG_ALL; + kattr.sched_policy = p->policy; + if (p->sched_reset_on_fork) + kattr.sched_flags |= SCHED_FLAG_RESET_ON_FORK; + get_params(p, &kattr); + kattr.sched_flags &= SCHED_FLAG_ALL; #ifdef CONFIG_UCLAMP_TASK - /* - * This could race with another potential updater, but this is fine - * because it'll correctly read the old or the new value. We don't need - * to guarantee who wins the race as long as it doesn't return garbage. - */ - kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value; - kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value; + /* + * This could race with another potential updater, but this is fine + * because it'll correctly read the old or the new value. We don't need + * to guarantee who wins the race as long as it doesn't return garbage. + */ + kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value; + kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value; #endif - - rcu_read_unlock(); + } return sched_attr_copy_to_user(uattr, &kattr, usize); - -out_unlock: - rcu_read_unlock(); - return retval; } #ifdef CONFIG_SMP -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.7-rc1 commit 92c2ec5bc1081e6bbbe172bcfb1a566ad7b4f809 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IDC9YK Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 53 ++++++++++++--------------------------------- 1 file changed, 14 insertions(+), 39 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index cfd774cb622e..cf96ba4ddbae 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8460,39 +8460,24 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask) { struct affinity_context ac; struct cpumask *user_mask; - struct task_struct *p; int retval; - rcu_read_lock(); - - p = find_process_by_pid(pid); - if (!p) { - rcu_read_unlock(); + CLASS(find_get_task, p)(pid); + if (!p) return -ESRCH; - } - - /* Prevent p going away */ - get_task_struct(p); - rcu_read_unlock(); - if (p->flags & PF_NO_SETAFFINITY) { - retval = -EINVAL; - goto out_put_task; - } + if (p->flags & PF_NO_SETAFFINITY) + return -EINVAL; if (!check_same_owner(p)) { - rcu_read_lock(); - if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) { - rcu_read_unlock(); - retval = -EPERM; - goto out_put_task; - } - rcu_read_unlock(); + guard(rcu)(); + if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) + return -EPERM; } retval = security_task_setscheduler(p); if (retval) - goto out_put_task; + return retval; /* * With non-SMP configs, user_cpus_ptr/user_mask isn't used and @@ -8502,8 +8487,7 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask) if (user_mask) { cpumask_copy(user_mask, in_mask); } else if (IS_ENABLED(CONFIG_SMP)) { - retval = -ENOMEM; - goto out_put_task; + return -ENOMEM; } ac = (struct affinity_context){ @@ -8515,8 +8499,6 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask) retval = __sched_setaffinity(p, &ac); kfree(ac.user_mask); -out_put_task: - put_task_struct(p); return retval; } @@ -8558,28 +8540,21 @@ SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len, long sched_getaffinity(pid_t pid, struct cpumask *mask) { struct task_struct *p; - unsigned long flags; int retval; - rcu_read_lock(); - - retval = -ESRCH; + guard(rcu)(); p = find_process_by_pid(pid); if (!p) - goto out_unlock; + return -ESRCH; retval = security_task_getscheduler(p); if (retval) - goto out_unlock; + return retval; - raw_spin_lock_irqsave(&p->pi_lock, flags); + guard(raw_spinlock_irqsave)(&p->pi_lock); cpumask_and(mask, &p->cpus_mask, cpu_active_mask); - raw_spin_unlock_irqrestore(&p->pi_lock, flags); -out_unlock: - rcu_read_unlock(); - - return retval; + return 0; } /** -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.7-rc1 commit 7a50f76674f8b6f4f30a1cec954179f10e20110c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IDC9YK Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Conflicts: kernel/sched/core.c [Conflicts with 002c90dd0cec ("sched: Fix race between yield_to() and try_to_wake_up()"), so refer to the mainline to solve this conflict in raw_spin_lock_irqsave mode.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 67 ++++++++++++++++++++------------------------- 1 file changed, 29 insertions(+), 38 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index cf96ba4ddbae..43a43b742fed 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9001,55 +9001,46 @@ int __sched yield_to(struct task_struct *p, bool preempt) { struct task_struct *curr = current; struct rq *rq, *p_rq; - unsigned long flags; int yielded = 0; - raw_spin_lock_irqsave(&p->pi_lock, flags); - rq = this_rq(); + scoped_guard (raw_spinlock_irqsave, &p->pi_lock) { + rq = this_rq(); again: - p_rq = task_rq(p); - /* - * If we're the only runnable task on the rq and target rq also - * has only one task, there's absolutely no point in yielding. - */ - if (rq->nr_running == 1 && p_rq->nr_running == 1) { - yielded = -ESRCH; - goto out_irq; - } + p_rq = task_rq(p); + /* + * If we're the only runnable task on the rq and target rq also + * has only one task, there's absolutely no point in yielding. + */ + if (rq->nr_running == 1 && p_rq->nr_running == 1) + return -ESRCH; - double_rq_lock(rq, p_rq); - if (task_rq(p) != p_rq) { - double_rq_unlock(rq, p_rq); - goto again; - } + guard(double_rq_lock)(rq, p_rq); + if (task_rq(p) != p_rq) + goto again; - if (!curr->sched_class->yield_to_task) - goto out_unlock; + if (!curr->sched_class->yield_to_task) + return 0; - if (curr->sched_class != p->sched_class) - goto out_unlock; + if (curr->sched_class != p->sched_class) + return 0; - if (task_on_cpu(p_rq, p) || !task_is_running(p)) - goto out_unlock; + if (task_on_cpu(p_rq, p) || !task_is_running(p)) + return 0; - yielded = curr->sched_class->yield_to_task(rq, p); - if (yielded) { - schedstat_inc(rq->yld_count); - /* - * Make p's CPU reschedule; pick_next_entity takes care of - * fairness. - */ - if (preempt && rq != p_rq) - resched_curr(p_rq); + yielded = curr->sched_class->yield_to_task(rq, p); + if (yielded) { + schedstat_inc(rq->yld_count); + /* + * Make p's CPU reschedule; pick_next_entity + * takes care of fairness. + */ + if (preempt && rq != p_rq) + resched_curr(p_rq); + } } -out_unlock: - double_rq_unlock(rq, p_rq); -out_irq: - raw_spin_unlock_irqrestore(&p->pi_lock, flags); - - if (yielded > 0) + if (yielded) schedule(); return yielded; -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.7-rc1 commit af7c5763f5e8bc1b3f827354a283ccaf6a8c8098 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IDC9YK Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 36 ++++++++++++++---------------------- 1 file changed, 14 insertions(+), 22 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 43a43b742fed..61023f4325ff 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9143,38 +9143,30 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy) static int sched_rr_get_interval(pid_t pid, struct timespec64 *t) { - struct task_struct *p; - unsigned int time_slice; - struct rq_flags rf; - struct rq *rq; + unsigned int time_slice = 0; int retval; if (pid < 0) return -EINVAL; - retval = -ESRCH; - rcu_read_lock(); - p = find_process_by_pid(pid); - if (!p) - goto out_unlock; + scoped_guard (rcu) { + struct task_struct *p = find_process_by_pid(pid); + if (!p) + return -ESRCH; - retval = security_task_getscheduler(p); - if (retval) - goto out_unlock; + retval = security_task_getscheduler(p); + if (retval) + return retval; - rq = task_rq_lock(p, &rf); - time_slice = 0; - if (p->sched_class->get_rr_interval) - time_slice = p->sched_class->get_rr_interval(rq, p); - task_rq_unlock(rq, p, &rf); + scoped_guard (task_rq_lock, p) { + struct rq *rq = scope.rq; + if (p->sched_class->get_rr_interval) + time_slice = p->sched_class->get_rr_interval(rq, p); + } + } - rcu_read_unlock(); jiffies_to_timespec64(time_slice, t); return 0; - -out_unlock: - rcu_read_unlock(); - return retval; } /** -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.7-rc1 commit fa614b4feb5a246474ac71b45e520a8ddefc809c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IDC9YK Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 61023f4325ff..4d8b447525fd 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10684,17 +10684,18 @@ void sched_move_task(struct task_struct *tsk) int queued, running, queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; struct task_group *group; - struct rq_flags rf; struct rq *rq; - rq = task_rq_lock(tsk, &rf); + CLASS(task_rq_lock, rq_guard)(tsk); + rq = rq_guard.rq; + /* * Esp. with SCHED_AUTOGROUP enabled it is possible to get superfluous * group changes. */ group = sched_get_task_group(tsk); if (group == tsk->sched_task_group) - goto unlock; + return; update_rq_clock(rq); @@ -10719,9 +10720,6 @@ void sched_move_task(struct task_struct *tsk) */ resched_curr(rq); } - -unlock: - task_rq_unlock(rq, tsk, &rf); } static inline struct task_group *css_tg(struct cgroup_subsys_state *css) -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.7-rc1 commit 0e34600ac9317dbe5f0a7bfaa3d7187d757572ed category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IDC9YK Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Random remaining guard use... Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Conflicts: kernel/sched/core.c [Fix context conflicts.] Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 167 +++++++++++++++++--------------------------- 1 file changed, 65 insertions(+), 102 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4d8b447525fd..d5d972185d95 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1491,16 +1491,12 @@ static void __uclamp_update_util_min_rt_default(struct task_struct *p) static void uclamp_update_util_min_rt_default(struct task_struct *p) { - struct rq_flags rf; - struct rq *rq; - if (!rt_task(p)) return; /* Protect updates to p->uclamp_* */ - rq = task_rq_lock(p, &rf); + guard(task_rq_lock)(p); __uclamp_update_util_min_rt_default(p); - task_rq_unlock(rq, p, &rf); } static inline struct uclamp_se @@ -1796,9 +1792,8 @@ static void uclamp_update_root_tg(void) uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX], sysctl_sched_uclamp_util_max, false); - rcu_read_lock(); + guard(rcu)(); cpu_util_update_eff(&root_task_group.css); - rcu_read_unlock(); } #else static void uclamp_update_root_tg(void) { } @@ -1825,10 +1820,9 @@ static void uclamp_sync_util_min_rt_default(void) smp_mb__after_spinlock(); read_unlock(&tasklist_lock); - rcu_read_lock(); + guard(rcu)(); for_each_process_thread(g, p) uclamp_update_util_min_rt_default(p); - rcu_read_unlock(); } static int sysctl_sched_uclamp_handler(struct ctl_table *table, int write, @@ -2259,17 +2253,14 @@ int __task_state_match(struct task_struct *p, unsigned int state) static __always_inline int task_state_match(struct task_struct *p, unsigned int state) { - int match; - +#ifdef CONFIG_PREEMPT_RT /* * Serialize against current_save_and_set_rtlock_wait_state() and * current_restore_rtlock_saved_state(), and __refrigerator(). */ - raw_spin_lock_irq(&p->pi_lock); - match = __task_state_match(p, state); - raw_spin_unlock_irq(&p->pi_lock); - - return match; + guard(raw_spinlock_irq)(&p->pi_lock); +#endif + return __task_state_match(p, state); } /* @@ -2423,10 +2414,9 @@ void migrate_disable(void) return; } - preempt_disable(); + guard(preempt)(); this_rq()->nr_pinned++; p->migration_disabled = 1; - preempt_enable(); } EXPORT_SYMBOL_GPL(migrate_disable); @@ -2450,7 +2440,7 @@ void migrate_enable(void) * Ensure stop_task runs either before or after this, and that * __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule(). */ - preempt_disable(); + guard(preempt)(); if (p->cpus_ptr != &p->cpus_mask) __set_cpus_allowed_ptr(p, &ac); /* @@ -2461,7 +2451,6 @@ void migrate_enable(void) barrier(); p->migration_disabled = 0; this_rq()->nr_pinned--; - preempt_enable(); } EXPORT_SYMBOL_GPL(migrate_enable); @@ -3533,13 +3522,11 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p, */ void kick_process(struct task_struct *p) { - int cpu; + guard(preempt)(); + int cpu = task_cpu(p); - preempt_disable(); - cpu = task_cpu(p); if ((cpu != smp_processor_id()) && task_curr(p)) smp_send_reschedule(cpu); - preempt_enable(); } EXPORT_SYMBOL_GPL(kick_process); @@ -6458,8 +6445,9 @@ static void sched_core_balance(struct rq *rq) struct sched_domain *sd; int cpu = cpu_of(rq); - preempt_disable(); - rcu_read_lock(); + guard(preempt)(); + guard(rcu)(); + raw_spin_rq_unlock_irq(rq); for_each_domain(cpu, sd) { if (need_resched()) @@ -6469,8 +6457,6 @@ static void sched_core_balance(struct rq *rq) break; } raw_spin_rq_lock_irq(rq); - rcu_read_unlock(); - preempt_enable(); } static DEFINE_PER_CPU(struct balance_callback, core_balance_head); @@ -8371,8 +8357,6 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr, #ifdef CONFIG_SMP int dl_task_check_affinity(struct task_struct *p, const struct cpumask *mask) { - int ret = 0; - /* * If the task isn't a deadline task or admission control is * disabled then we don't care about affinity changes. @@ -8386,11 +8370,11 @@ int dl_task_check_affinity(struct task_struct *p, const struct cpumask *mask) * tasks allowed to run on all the CPUs in the task's * root_domain. */ - rcu_read_lock(); + guard(rcu)(); if (!cpumask_subset(task_rq(p)->rd->span, mask)) - ret = -EBUSY; - rcu_read_unlock(); - return ret; + return -EBUSY; + + return 0; } #endif @@ -10756,11 +10740,9 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css) #ifdef CONFIG_UCLAMP_TASK_GROUP /* Propagate the effective uclamp value for the new group */ - mutex_lock(&uclamp_mutex); - rcu_read_lock(); + guard(mutex)(&uclamp_mutex); + guard(rcu)(); cpu_util_update_eff(css); - rcu_read_unlock(); - mutex_unlock(&uclamp_mutex); #endif return 0; @@ -10919,8 +10901,8 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf, static_branch_enable(&sched_uclamp_used); - mutex_lock(&uclamp_mutex); - rcu_read_lock(); + guard(mutex)(&uclamp_mutex); + guard(rcu)(); tg = css_tg(of_css(of)); if (tg->uclamp_req[clamp_id].value != req.util) @@ -10935,9 +10917,6 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf, /* Update effective clamps to track the most restrictive value */ cpu_util_update_eff(of_css(of)); - rcu_read_unlock(); - mutex_unlock(&uclamp_mutex); - return nbytes; } @@ -10963,10 +10942,10 @@ static inline void cpu_uclamp_print(struct seq_file *sf, u64 percent; u32 rem; - rcu_read_lock(); - tg = css_tg(seq_css(sf)); - util_clamp = tg->uclamp_req[clamp_id].value; - rcu_read_unlock(); + scoped_guard (rcu) { + tg = css_tg(seq_css(sf)); + util_clamp = tg->uclamp_req[clamp_id].value; + } if (util_clamp == SCHED_CAPACITY_SCALE) { seq_puts(sf, "max\n"); @@ -11296,7 +11275,6 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data) static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota) { - int ret; struct cfs_schedulable_data data = { .tg = tg, .period = period, @@ -11308,11 +11286,8 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota) do_div(data.quota, NSEC_PER_USEC); } - rcu_read_lock(); - ret = walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data); - rcu_read_unlock(); - - return ret; + guard(rcu)(); + return walk_tg_tree(tg_cfs_schedulable_down, tg_nop, &data); } static int cpu_cfs_stat_show(struct seq_file *sf, void *v) @@ -12497,14 +12472,12 @@ int __sched_mm_cid_migrate_from_fetch_cid(struct rq *src_rq, * are not the last task to be migrated from this cpu for this mm, so * there is no need to move src_cid to the destination cpu. */ - rcu_read_lock(); + guard(rcu)(); src_task = rcu_dereference(src_rq->curr); if (READ_ONCE(src_task->mm_cid_active) && src_task->mm == mm) { - rcu_read_unlock(); t->last_mm_cid = -1; return -1; } - rcu_read_unlock(); return src_cid; } @@ -12548,18 +12521,17 @@ int __sched_mm_cid_migrate_from_try_steal_cid(struct rq *src_rq, * the lazy-put flag, this task will be responsible for transitioning * from lazy-put flag set to MM_CID_UNSET. */ - rcu_read_lock(); - src_task = rcu_dereference(src_rq->curr); - if (READ_ONCE(src_task->mm_cid_active) && src_task->mm == mm) { - rcu_read_unlock(); - /* - * We observed an active task for this mm, there is therefore - * no point in moving this cid to the destination cpu. - */ - t->last_mm_cid = -1; - return -1; + scoped_guard (rcu) { + src_task = rcu_dereference(src_rq->curr); + if (READ_ONCE(src_task->mm_cid_active) && src_task->mm == mm) { + /* + * We observed an active task for this mm, there is therefore + * no point in moving this cid to the destination cpu. + */ + t->last_mm_cid = -1; + return -1; + } } - rcu_read_unlock(); /* * The src_cid is unused, so it can be unset. @@ -12632,7 +12604,6 @@ static void sched_mm_cid_remote_clear(struct mm_struct *mm, struct mm_cid *pcpu_ { struct rq *rq = cpu_rq(cpu); struct task_struct *t; - unsigned long flags; int cid, lazy_cid; cid = READ_ONCE(pcpu_cid->cid); @@ -12667,23 +12638,21 @@ static void sched_mm_cid_remote_clear(struct mm_struct *mm, struct mm_cid *pcpu_ * the lazy-put flag, that task will be responsible for transitioning * from lazy-put flag set to MM_CID_UNSET. */ - rcu_read_lock(); - t = rcu_dereference(rq->curr); - if (READ_ONCE(t->mm_cid_active) && t->mm == mm) { - rcu_read_unlock(); - return; + scoped_guard (rcu) { + t = rcu_dereference(rq->curr); + if (READ_ONCE(t->mm_cid_active) && t->mm == mm) + return; } - rcu_read_unlock(); /* * The cid is unused, so it can be unset. * Disable interrupts to keep the window of cid ownership without rq * lock small. */ - local_irq_save(flags); - if (try_cmpxchg(&pcpu_cid->cid, &lazy_cid, MM_CID_UNSET)) - __mm_cid_put(mm, cid); - local_irq_restore(flags); + scoped_guard (irqsave) { + if (try_cmpxchg(&pcpu_cid->cid, &lazy_cid, MM_CID_UNSET)) + __mm_cid_put(mm, cid); + } } static void sched_mm_cid_remote_clear_old(struct mm_struct *mm, int cpu) @@ -12705,14 +12674,13 @@ static void sched_mm_cid_remote_clear_old(struct mm_struct *mm, int cpu) * snapshot associated with this cid if an active task using the mm is * observed on this rq. */ - rcu_read_lock(); - curr = rcu_dereference(rq->curr); - if (READ_ONCE(curr->mm_cid_active) && curr->mm == mm) { - WRITE_ONCE(pcpu_cid->time, rq_clock); - rcu_read_unlock(); - return; + scoped_guard (rcu) { + curr = rcu_dereference(rq->curr); + if (READ_ONCE(curr->mm_cid_active) && curr->mm == mm) { + WRITE_ONCE(pcpu_cid->time, rq_clock); + return; + } } - rcu_read_unlock(); if (rq_clock < pcpu_cid->time + SCHED_MM_CID_PERIOD_NS) return; @@ -12808,7 +12776,6 @@ void task_tick_mm_cid(struct rq *rq, struct task_struct *curr) void sched_mm_cid_exit_signals(struct task_struct *t) { struct mm_struct *mm = t->mm; - struct rq_flags rf; struct rq *rq; if (!mm) @@ -12816,7 +12783,7 @@ void sched_mm_cid_exit_signals(struct task_struct *t) preempt_disable(); rq = this_rq(); - rq_lock_irqsave(rq, &rf); + guard(rq_lock_irqsave)(rq); preempt_enable_no_resched(); /* holding spinlock */ WRITE_ONCE(t->mm_cid_active, 0); /* @@ -12826,13 +12793,11 @@ void sched_mm_cid_exit_signals(struct task_struct *t) smp_mb(); mm_cid_put(mm); t->last_mm_cid = t->mm_cid = -1; - rq_unlock_irqrestore(rq, &rf); } void sched_mm_cid_before_execve(struct task_struct *t) { struct mm_struct *mm = t->mm; - struct rq_flags rf; struct rq *rq; if (!mm) @@ -12840,7 +12805,7 @@ void sched_mm_cid_before_execve(struct task_struct *t) preempt_disable(); rq = this_rq(); - rq_lock_irqsave(rq, &rf); + guard(rq_lock_irqsave)(rq); preempt_enable_no_resched(); /* holding spinlock */ WRITE_ONCE(t->mm_cid_active, 0); /* @@ -12850,13 +12815,11 @@ void sched_mm_cid_before_execve(struct task_struct *t) smp_mb(); mm_cid_put(mm); t->last_mm_cid = t->mm_cid = -1; - rq_unlock_irqrestore(rq, &rf); } void sched_mm_cid_after_execve(struct task_struct *t) { struct mm_struct *mm = t->mm; - struct rq_flags rf; struct rq *rq; if (!mm) @@ -12864,16 +12827,16 @@ void sched_mm_cid_after_execve(struct task_struct *t) preempt_disable(); rq = this_rq(); - rq_lock_irqsave(rq, &rf); - preempt_enable_no_resched(); /* holding spinlock */ - WRITE_ONCE(t->mm_cid_active, 1); - /* - * Store t->mm_cid_active before loading per-mm/cpu cid. - * Matches barrier in sched_mm_cid_remote_clear_old(). - */ - smp_mb(); - t->last_mm_cid = t->mm_cid = mm_cid_get(rq, mm); - rq_unlock_irqrestore(rq, &rf); + scoped_guard (rq_lock_irqsave, rq) { + preempt_enable_no_resched(); /* holding spinlock */ + WRITE_ONCE(t->mm_cid_active, 1); + /* + * Store t->mm_cid_active before loading per-mm/cpu cid. + * Matches barrier in sched_mm_cid_remote_clear_old(). + */ + smp_mb(); + t->last_mm_cid = t->mm_cid = mm_cid_get(rq, mm); + } rseq_set_notify_resume(t); } -- 2.34.1
From: Vincent Guittot <vincent.guittot@linaro.org> mainline inclusion from mainline-v6.8-rc1 commit 9c0b4bb7f6303c9c4e2e34984c46f5a86478f84d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- The current method to take into account uclamp hints when estimating the target frequency can end in a situation where the selected target frequency is finally higher than uclamp hints, whereas there are no real needs. Such cases mainly happen because we are currently mixing the traditional scheduler utilization signal with the uclamp performance hints. By adding these 2 metrics, we loose an important information when it comes to select the target frequency, and we have to make some assumptions which can't fit all cases. Rework the interface between the scheduler and schedutil governor in order to propagate all information down to the cpufreq governor. effective_cpu_util() interface changes and now returns the actual utilization of the CPU with 2 optional inputs: - The minimum performance for this CPU; typically the capacity to handle the deadline task and the interrupt pressure. But also uclamp_min request when available. - The maximum targeting performance for this CPU which reflects the maximum level that we would like to not exceed. By default it will be the CPU capacity but can be reduced because of some performance hints set with uclamp. The value can be lower than actual utilization and/or min performance level. A new sugov_effective_cpu_perf() interface is also available to compute the final performance level that is targeted for the CPU, after applying some cpufreq headroom and taking into account all inputs. With these 2 functions, schedutil is now able to decide when it must go above uclamp hints. It now also has a generic way to get the min performance level. The dependency between energy model and cpufreq governor and its headroom policy doesn't exist anymore. eenv_pd_max_util() asks schedutil for the targeted performance after applying the impact of the waking task. [ mingo: Refined the changelog & C comments. ] Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20231122133904.446032-2-vincent.guittot@linaro.org Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Conflicts: kernel/sched/sched.h kernel/sched/cpufreq_schedutil.c [Only pick the part modified by this mainline commit, and ignore the conflict part caused by higher version.] Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/energy_model.h | 1 - kernel/sched/core.c | 90 ++++++++++++++------------------ kernel/sched/cpufreq_schedutil.c | 35 +++++++++---- kernel/sched/fair.c | 22 ++++++-- kernel/sched/sched.h | 24 +++------ 5 files changed, 89 insertions(+), 83 deletions(-) diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h index 784efdba6703..3b1128cc5b1f 100644 --- a/include/linux/energy_model.h +++ b/include/linux/energy_model.h @@ -248,7 +248,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, scale_cpu = arch_scale_cpu_capacity(cpu); ref_freq = arch_scale_freq_ref(cpu); - max_util = map_util_perf(max_util); max_util = min(max_util, allowed_cpu_cap); freq = map_util_freq(max_util, ref_freq, scale_cpu); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d5d972185d95..ecc8f7d2a9bc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7481,18 +7481,13 @@ int sched_core_idle_cpu(int cpu) * required to meet deadlines. */ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, - enum cpu_util_type type, - struct task_struct *p) + unsigned long *min, + unsigned long *max) { - unsigned long dl_util, util, irq, max; + unsigned long util, irq, scale; struct rq *rq = cpu_rq(cpu); - max = arch_scale_cpu_capacity(cpu); - - if (!uclamp_is_used() && - type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) { - return max; - } + scale = arch_scale_cpu_capacity(cpu); /* * Early check to see if IRQ/steal time saturates the CPU, can be @@ -7500,45 +7495,49 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, * update_irq_load_avg(). */ irq = cpu_util_irq(rq); - if (unlikely(irq >= max)) - return max; + if (unlikely(irq >= scale)) { + if (min) + *min = scale; + if (max) + *max = scale; + return scale; + } + + if (min) { + /* + * The minimum utilization returns the highest level between: + * - the computed DL bandwidth needed with the IRQ pressure which + * steals time to the deadline task. + * - The minimum performance requirement for CFS and/or RT. + */ + *min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN)); + + /* + * When an RT task is runnable and uclamp is not used, we must + * ensure that the task will run at maximum compute capacity. + */ + if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt)) + *min = max(*min, scale); + } /* * Because the time spend on RT/DL tasks is visible as 'lost' time to * CFS tasks and we use the same metric to track the effective * utilization (PELT windows are synchronized) we can directly add them * to obtain the CPU's actual utilization. - * - * CFS and RT utilization can be boosted or capped, depending on - * utilization clamp constraints requested by currently RUNNABLE - * tasks. - * When there are no CFS RUNNABLE tasks, clamps are released and - * frequency will be gracefully reduced with the utilization decay. */ util = util_cfs + cpu_util_rt(rq); - if (type == FREQUENCY_UTIL) - util = uclamp_rq_util_with(rq, util, p); - - dl_util = cpu_util_dl(rq); + util += cpu_util_dl(rq); /* - * For frequency selection we do not make cpu_util_dl() a permanent part - * of this sum because we want to use cpu_bw_dl() later on, but we need - * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such - * that we select f_max when there is no idle time. - * - * NOTE: numerical errors or stop class might cause us to not quite hit - * saturation when we should -- something for later. + * The maximum hint is a soft bandwidth requirement, which can be lower + * than the actual utilization because of uclamp_max requirements. */ - if (util + dl_util >= max) - return max; + if (max) + *max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX)); - /* - * OTOH, for energy computation we need the estimated running time, so - * include util_dl and ignore dl_bw. - */ - if (type == ENERGY_UTIL) - util += dl_util; + if (util >= scale) + return scale; /* * There is still idle time; further improve the number by using the @@ -7549,28 +7548,15 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, * U' = irq + --------- * U * max */ - util = scale_irq_capacity(util, irq, max); + util = scale_irq_capacity(util, irq, scale); util += irq; - /* - * Bandwidth required by DEADLINE must always be granted while, for - * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism - * to gracefully reduce the frequency when no tasks show up for longer - * periods of time. - * - * Ideally we would like to set bw_dl as min/guaranteed freq and util + - * bw_dl as requested freq. However, cpufreq is not yet ready for such - * an interface. So, we only do the latter for now. - */ - if (type == FREQUENCY_UTIL) - util += cpu_bw_dl(rq); - - return min(max, util); + return min(scale, util); } unsigned long sched_cpu_util(int cpu) { - return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL); + return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL); } #endif /* CONFIG_SMP */ diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index dd7bfdc144b7..819ec1ccc08c 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -47,7 +47,7 @@ struct sugov_cpu { u64 last_update; unsigned long util; - unsigned long bw_dl; + unsigned long bw_min; /* The field below is for single-CPU policies only: */ #ifdef CONFIG_NO_HZ_COMMON @@ -192,7 +192,6 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy, unsigned int freq; freq = get_capacity_ref_freq(policy); - util = map_util_perf(util); freq = map_util_freq(util, freq, max); if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update) @@ -202,14 +201,30 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy, return cpufreq_driver_resolve_freq(policy, freq); } +unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual, + unsigned long min, + unsigned long max) +{ + /* Add dvfs headroom to actual utilization */ + actual = map_util_perf(actual); + /* Actually we don't need to target the max performance */ + if (actual < max) + max = actual; + + /* + * Ensure at least minimum performance while providing more compute + * capacity when possible. + */ + return max(min, max); +} + static void sugov_get_util(struct sugov_cpu *sg_cpu) { - unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu); - struct rq *rq = cpu_rq(sg_cpu->cpu); + unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu); - sg_cpu->bw_dl = cpu_bw_dl(rq); - sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util, - FREQUENCY_UTIL, NULL); + util = effective_cpu_util(sg_cpu->cpu, util, &min, &max); + sg_cpu->bw_min = min; + sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max); } /** @@ -355,7 +370,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; } */ static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu) { - if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl) + if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min) WRITE_ONCE(sg_cpu->sg_policy->limits_changed, true); } @@ -456,8 +471,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time, sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util) sg_cpu->util = prev_util; - cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl), - map_util_perf(sg_cpu->util), max_cap); + cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min, + sg_cpu->util, max_cap); sg_cpu->sg_policy->last_freq_update_time = time; } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d8baa27af572..4ccadaa44b0c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8833,7 +8833,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv, for_each_cpu(cpu, pd_cpus) { unsigned long util = cpu_util(cpu, p, -1, 0); - busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL); + busy_time += effective_cpu_util(cpu, util, NULL, NULL); } eenv->pd_busy_time = min(eenv->pd_cap, busy_time); @@ -8856,7 +8856,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus, for_each_cpu(cpu, pd_cpus) { struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL; unsigned long util = cpu_util(cpu, p, dst_cpu, 1); - unsigned long eff_util; + unsigned long eff_util, min, max; /* * Performance domain frequency: utilization clamping @@ -8865,7 +8865,23 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus, * NOTE: in case RT tasks are running, by default the * FREQUENCY_UTIL's utilization can be max OPP. */ - eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk); + eff_util = effective_cpu_util(cpu, util, &min, &max); + + /* Task's uclamp can modify min and max value */ + if (tsk && uclamp_is_used()) { + min = max(min, uclamp_eff_value(p, UCLAMP_MIN)); + + /* + * If there is no active max uclamp constraint, + * directly use task's one, otherwise keep max. + */ + if (uclamp_rq_is_idle(cpu_rq(cpu))) + max = uclamp_eff_value(p, UCLAMP_MAX); + else + max = max(max, uclamp_eff_value(p, UCLAMP_MAX)); + } + + eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max); max_util = max(max_util, eff_util); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 3288a59bae62..e8c2c098992b 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3200,24 +3200,14 @@ static inline unsigned long capacity_orig_of(int cpu) return cpu_rq(cpu)->cpu_capacity_orig; } -/** - * enum cpu_util_type - CPU utilization type - * @FREQUENCY_UTIL: Utilization used to select frequency - * @ENERGY_UTIL: Utilization used during energy calculation - * - * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time - * need to be aggregated differently depending on the usage made of them. This - * enum is used within effective_cpu_util() to differentiate the types of - * utilization expected by the callers, and adjust the aggregation accordingly. - */ -enum cpu_util_type { - FREQUENCY_UTIL, - ENERGY_UTIL, -}; - unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, - enum cpu_util_type type, - struct task_struct *p); + unsigned long *min, + unsigned long *max); + +unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual, + unsigned long min, + unsigned long max); + /* * Verify the fitness of task @p to run on @cpu taking into account the -- 2.34.1
From: Ingo Molnar <mingo@kernel.org> mainline inclusion from mainline-v6.11-rc1 commit 04746ed80bcf3130951ed4d5c1bc5b0bcabdde22 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- core.c has become rather large, move most scheduler syscall related functionality into a separate file, syscalls.c. This is about ~15% of core.c's raw linecount. Move the alloc_user_cpus_ptr(), __rt_effective_prio(), rt_effective_prio(), uclamp_none(), uclamp_se_set() and uclamp_bucket_id() inlines to kernel/sched/sched.h. Internally export the __sched_setscheduler(), __sched_setaffinity(), __setscheduler_prio(), set_load_weight(), enqueue_task(), dequeue_task(), check_class_changed(), splice_balance_callbacks() and balance_callbacks() methods to better facilitate this. Move the new file's build to sched_policy.c, because it fits there semantically, but also because it's the smallest of the 4 build units under an allmodconfig build: -rw-rw-r-- 1 mingo mingo 7.3M May 27 12:35 kernel/sched/core.i -rw-rw-r-- 1 mingo mingo 6.4M May 27 12:36 kernel/sched/build_utility.i -rw-rw-r-- 1 mingo mingo 6.3M May 27 12:36 kernel/sched/fair.i -rw-rw-r-- 1 mingo mingo 5.8M May 27 12:36 kernel/sched/build_policy.i This better balances build time for scheduler subsystem rebuilds. I build-tested this new file as a standalone syscalls.o file for a bit, to make sure all the encapsulations & abstractions are robust. Also update/add my copyright notices to these files. Build time measurements: # -Before/+After: kepler:~/tip> perf stat -e 'cycles,instructions,duration_time' --sync --repeat 5 --pre 'rm -f kernel/sched/*.o' m kernel/sched/built-in.a >/dev/null Performance counter stats for 'm kernel/sched/built-in.a' (5 runs): - 71,938,508,607 cycles ( +- 0.17% ) + 71,992,916,493 cycles ( +- 0.22% ) - 106,214,780,964 instructions # 1.48 insn per cycle ( +- 0.01% ) + 105,450,231,154 instructions # 1.46 insn per cycle ( +- 0.01% ) - 5,878,232,620 ns duration_time ( +- 0.38% ) + 5,290,085,069 ns duration_time ( +- 0.21% ) - 5.8782 +- 0.0221 seconds time elapsed ( +- 0.38% ) + 5.2901 +- 0.0111 seconds time elapsed ( +- 0.21% ) Build time improvement of -11.1% (duration_time) is expected: the parallel build time of the scheduler subsystem is determined by the largest, slowest to build object file, which is kernel/sched/core.o. By moving ~15% of its complexity into another build unit, we reduced build time by -11%. Measured cycles spent on building is within its ~0.2% stddev noise envelope. The -0.7% reduction in instructions spent on building the scheduler is statistically reliable and somewhat surprising - I can only speculate: maybe compilers aren't that efficient at building & optimizing 10+ KLOC files (core.c), and it's an overall win to balance the linecount a bit. Anyway, this might be a data point that suggests that reducing the linecount of our largest files will improve not just code readability and maintainability, but might also improve build times a bit. Code generation got a bit worse, by 0.5kb text on an x86 defconfig build: # -Before/+After: kepler:~/tip> size vmlinux text data bss dec hex filename -26475475 10439178 1740804 38655457 24dd5e1 vmlinux +26476003 10439178 1740804 38655985 24dd7f1 vmlinux kepler:~/tip> size kernel/sched/built-in.a text data bss dec hex filename - 76056 30025 489 106570 1a04a kernel/sched/core.o (ex kernel/sched/built-in.a) + 63452 29453 489 93394 16cd2 kernel/sched/core.o (ex kernel/sched/built-in.a) 44299 2181 104 46584 b5f8 kernel/sched/fair.o (ex kernel/sched/built-in.a) - 42764 3424 120 46308 b4e4 kernel/sched/build_policy.o (ex kernel/sched/built-in.a) + 55651 4044 120 59815 e9a7 kernel/sched/build_policy.o (ex kernel/sched/built-in.a) 44866 12655 2192 59713 e941 kernel/sched/build_utility.o (ex kernel/sched/built-in.a) 44866 12655 2192 59713 e941 kernel/sched/build_utility.o (ex kernel/sched/built-in.a) This is primarily due to the extra functions exported, and the size gets exaggerated somewhat by __pfx CFI function padding: ffffffff810cc710 <__pfx_enqueue_task>: ffffffff810cc710: 90 nop ffffffff810cc711: 90 nop ffffffff810cc712: 90 nop ffffffff810cc713: 90 nop ffffffff810cc714: 90 nop ffffffff810cc715: 90 nop ffffffff810cc716: 90 nop ffffffff810cc717: 90 nop ffffffff810cc718: 90 nop ffffffff810cc719: 90 nop ffffffff810cc71a: 90 nop ffffffff810cc71b: 90 nop ffffffff810cc71c: 90 nop ffffffff810cc71d: 90 nop ffffffff810cc71e: 90 nop ffffffff810cc71f: 90 nop AFAICS the cost is primarily not to core.o and fair.o though (which contain most performance sensitive scheduler functions), only to syscalls.o that get called with much lower frequency - so I think this is an acceptable trade-off for better code separation. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20240407084319.1462211-2-mingo@kernel.org Conflicts: kernel/sched/core.c kernel/sched/sched.h kernel/sched/build_policy.c [The code move from core.c to syscalls.c, this version has additional patches more than the mainline one, 1621f4654db7 ("sched: Introduce qos scheduler for co-location"), so move it together to the new file syscalls.c. Fix compile error for the code move from sched/core.c to sched/syscalls.c in 04746ed80bcf ("sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c"). As for other conflicts, only pick the part modified by this mainline commit, and ignore the conflict part caused by higher version.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/build_policy.c | 1 + kernel/sched/core.c | 1973 ++--------------------------------- kernel/sched/sched.h | 134 ++- kernel/sched/syscalls.c | 1693 ++++++++++++++++++++++++++++++ 4 files changed, 1923 insertions(+), 1878 deletions(-) create mode 100644 kernel/sched/syscalls.c diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c index 6aff6482b1e7..e954328cc093 100644 --- a/kernel/sched/build_policy.c +++ b/kernel/sched/build_policy.c @@ -55,3 +55,4 @@ #ifdef CONFIG_SCHED_SOFT_DOMAIN #include "soft_domain.c" #endif +#include "syscalls.c" diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ecc8f7d2a9bc..d28790f1b7ca 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2,9 +2,10 @@ /* * kernel/sched/core.c * - * Core kernel scheduler code and related syscalls + * Core kernel CPU scheduler code * * Copyright (C) 1991-2002 Linus Torvalds + * Copyright (C) 1998-2024 Ingo Molnar, Red Hat */ #include <linux/highmem.h> #include <linux/hrtimer_api.h> @@ -1316,7 +1317,7 @@ int tg_nop(struct task_group *tg, void *data) } #endif -static void set_load_weight(struct task_struct *p, bool update_load) +void set_load_weight(struct task_struct *p, bool update_load) { int prio = p->static_prio - MAX_RT_PRIO; struct load_weight lw; @@ -1373,7 +1374,7 @@ static unsigned int __maybe_unused sysctl_sched_uclamp_util_max = SCHED_CAPACITY * This knob will not override the system default sched_util_clamp_min defined * above. */ -static unsigned int sysctl_sched_uclamp_util_min_rt_default = SCHED_CAPACITY_SCALE; +unsigned int sysctl_sched_uclamp_util_min_rt_default = SCHED_CAPACITY_SCALE; /* All clamps are required to be less or equal than these values */ static struct uclamp_se uclamp_default[UCLAMP_CNT]; @@ -1398,32 +1399,6 @@ static struct uclamp_se uclamp_default[UCLAMP_CNT]; */ DEFINE_STATIC_KEY_FALSE(sched_uclamp_used); -/* Integer rounded range for each bucket */ -#define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS) - -#define for_each_clamp_id(clamp_id) \ - for ((clamp_id) = 0; (clamp_id) < UCLAMP_CNT; (clamp_id)++) - -static inline unsigned int uclamp_bucket_id(unsigned int clamp_value) -{ - return min_t(unsigned int, clamp_value / UCLAMP_BUCKET_DELTA, UCLAMP_BUCKETS - 1); -} - -static inline unsigned int uclamp_none(enum uclamp_id clamp_id) -{ - if (clamp_id == UCLAMP_MIN) - return 0; - return SCHED_CAPACITY_SCALE; -} - -static inline void uclamp_se_set(struct uclamp_se *uc_se, - unsigned int value, bool user_defined) -{ - uc_se->value = value; - uc_se->bucket_id = uclamp_bucket_id(value); - uc_se->user_defined = user_defined; -} - static inline unsigned int uclamp_idle_value(struct rq *rq, enum uclamp_id clamp_id, unsigned int clamp_value) @@ -1889,107 +1864,6 @@ static int sysctl_sched_uclamp_handler(struct ctl_table *table, int write, #endif #endif -static int uclamp_validate(struct task_struct *p, - const struct sched_attr *attr) -{ - int util_min = p->uclamp_req[UCLAMP_MIN].value; - int util_max = p->uclamp_req[UCLAMP_MAX].value; - - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) { - util_min = attr->sched_util_min; - - if (util_min + 1 > SCHED_CAPACITY_SCALE + 1) - return -EINVAL; - } - - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) { - util_max = attr->sched_util_max; - - if (util_max + 1 > SCHED_CAPACITY_SCALE + 1) - return -EINVAL; - } - - if (util_min != -1 && util_max != -1 && util_min > util_max) - return -EINVAL; - - /* - * We have valid uclamp attributes; make sure uclamp is enabled. - * - * We need to do that here, because enabling static branches is a - * blocking operation which obviously cannot be done while holding - * scheduler locks. - */ - static_branch_enable(&sched_uclamp_used); - - return 0; -} - -static bool uclamp_reset(const struct sched_attr *attr, - enum uclamp_id clamp_id, - struct uclamp_se *uc_se) -{ - /* Reset on sched class change for a non user-defined clamp value. */ - if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)) && - !uc_se->user_defined) - return true; - - /* Reset on sched_util_{min,max} == -1. */ - if (clamp_id == UCLAMP_MIN && - attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN && - attr->sched_util_min == -1) { - return true; - } - - if (clamp_id == UCLAMP_MAX && - attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX && - attr->sched_util_max == -1) { - return true; - } - - return false; -} - -static void __setscheduler_uclamp(struct task_struct *p, - const struct sched_attr *attr) -{ - enum uclamp_id clamp_id; - - for_each_clamp_id(clamp_id) { - struct uclamp_se *uc_se = &p->uclamp_req[clamp_id]; - unsigned int value; - - if (!uclamp_reset(attr, clamp_id, uc_se)) - continue; - - /* - * RT by default have a 100% boost value that could be modified - * at runtime. - */ - if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN)) - value = sysctl_sched_uclamp_util_min_rt_default; - else - value = uclamp_none(clamp_id); - - uclamp_se_set(uc_se, value, false); - - } - - if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP))) - return; - - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN && - attr->sched_util_min != -1) { - uclamp_se_set(&p->uclamp_req[UCLAMP_MIN], - attr->sched_util_min, true); - } - - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX && - attr->sched_util_max != -1) { - uclamp_se_set(&p->uclamp_req[UCLAMP_MAX], - attr->sched_util_max, true); - } -} - static void uclamp_fork(struct task_struct *p) { enum uclamp_id clamp_id; @@ -2057,13 +1931,6 @@ static void __init init_uclamp(void) #else /* CONFIG_UCLAMP_TASK */ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { } static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { } -static inline int uclamp_validate(struct task_struct *p, - const struct sched_attr *attr) -{ - return -EOPNOTSUPP; -} -static void __setscheduler_uclamp(struct task_struct *p, - const struct sched_attr *attr) { } static inline void uclamp_fork(struct task_struct *p) { } static inline void uclamp_post_fork(struct task_struct *p) { } static inline void init_uclamp(void) { } @@ -2093,7 +1960,7 @@ unsigned long get_wchan(struct task_struct *p) return ip; } -static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) +void enqueue_task(struct rq *rq, struct task_struct *p, int flags) { if (!(flags & ENQUEUE_NOCLOCK)) update_rq_clock(rq); @@ -2110,7 +1977,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) sched_core_enqueue(rq, p); } -static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) +void dequeue_task(struct rq *rq, struct task_struct *p, int flags) { if (sched_core_enabled(rq)) sched_core_dequeue(rq, p, flags); @@ -2146,52 +2013,6 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags) dequeue_task(rq, p, flags); } -static inline int __normal_prio(int policy, int rt_prio, int nice) -{ - int prio; - - if (dl_policy(policy)) - prio = MAX_DL_PRIO - 1; - else if (rt_policy(policy)) - prio = MAX_RT_PRIO - 1 - rt_prio; - else - prio = NICE_TO_PRIO(nice); - - return prio; -} - -/* - * Calculate the expected normal priority: i.e. priority - * without taking RT-inheritance into account. Might be - * boosted by interactivity modifiers. Changes upon fork, - * setprio syscalls, and whenever the interactivity - * estimator recalculates. - */ -static inline int normal_prio(struct task_struct *p) -{ - return __normal_prio(p->policy, p->rt_priority, PRIO_TO_NICE(p->static_prio)); -} - -/* - * Calculate the current priority, i.e. the priority - * taken into account by the scheduler. This value might - * be boosted by RT tasks, or might be boosted by - * interactivity modifiers. Will be RT if the task got - * RT-boosted. If not then it returns p->normal_prio. - */ -static int effective_prio(struct task_struct *p) -{ - p->normal_prio = normal_prio(p); - /* - * If we are RT tasks or we were boosted to RT priority, - * keep the priority unchanged. Otherwise, update priority - * to the normal priority: - */ - if (!rt_prio(p->prio)) - return p->normal_prio; - return p->prio; -} - /** * task_curr - is this task currently executing on a CPU? * @p: the task in question. @@ -2210,9 +2031,9 @@ inline int task_curr(const struct task_struct *p) * this means any call to check_class_changed() must be followed by a call to * balance_callback(). */ -static inline void check_class_changed(struct rq *rq, struct task_struct *p, - const struct sched_class *prev_class, - int oldprio) +void check_class_changed(struct rq *rq, struct task_struct *p, + const struct sched_class *prev_class, + int oldprio) { if (prev_class != p->sched_class) { if (prev_class->switched_from) @@ -2383,9 +2204,6 @@ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state static void __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx); -static int __set_cpus_allowed_ptr(struct task_struct *p, - struct affinity_context *ctx); - static void migrate_disable_switch(struct rq *rq, struct task_struct *p) { struct affinity_context ac = { @@ -2819,16 +2637,6 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) kfree_rcu((union cpumask_rcuhead *)ac.user_mask, rcu); } -static cpumask_t *alloc_user_cpus_ptr(int node) -{ - /* - * See do_set_cpus_allowed() above for the rcu_head usage. - */ - int size = max_t(int, cpumask_size(), sizeof(struct rcu_head)); - - return kmalloc_node(size, GFP_KERNEL, node); -} - int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src, int node) { @@ -3197,8 +3005,7 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p, * task must not exit() & deallocate itself prematurely. The * call is not atomic; no spinlocks may be held. */ -static int __set_cpus_allowed_ptr(struct task_struct *p, - struct affinity_context *ctx) +int __set_cpus_allowed_ptr(struct task_struct *p, struct affinity_context *ctx) { struct rq_flags rf; struct rq *rq; @@ -3317,9 +3124,6 @@ void force_compatible_cpus_allowed_ptr(struct task_struct *p) free_cpumask_var(new_mask); } -static int -__sched_setaffinity(struct task_struct *p, struct affinity_context *ctx); - /* * Restore the affinity of a task @p which was previously restricted by a * call to force_compatible_cpus_allowed_ptr(). @@ -3699,12 +3503,6 @@ void sched_set_stop_task(int cpu, struct task_struct *stop) #else /* CONFIG_SMP */ -static inline int __set_cpus_allowed_ptr(struct task_struct *p, - struct affinity_context *ctx) -{ - return set_cpus_allowed_ptr(p, ctx->new_mask); -} - static inline void migrate_disable_switch(struct rq *rq, struct task_struct *p) { } static inline bool rq_has_pinned_tasks(struct rq *rq) @@ -3712,11 +3510,6 @@ static inline bool rq_has_pinned_tasks(struct rq *rq) return false; } -static inline cpumask_t *alloc_user_cpus_ptr(int node) -{ - return NULL; -} - #endif /* !CONFIG_SMP */ static void @@ -5145,7 +4938,7 @@ __splice_balance_callbacks(struct rq *rq, bool split) return head; } -static inline struct balance_callback *splice_balance_callbacks(struct rq *rq) +struct balance_callback *splice_balance_callbacks(struct rq *rq) { return __splice_balance_callbacks(rq, true); } @@ -5155,7 +4948,7 @@ static void __balance_callbacks(struct rq *rq) do_balance_callbacks(rq, __splice_balance_callbacks(rq, false)); } -static inline void balance_callbacks(struct rq *rq, struct balance_callback *head) +void balance_callbacks(struct rq *rq, struct balance_callback *head) { unsigned long flags; @@ -5172,15 +4965,6 @@ static inline void __balance_callbacks(struct rq *rq) { } -static inline struct balance_callback *splice_balance_callbacks(struct rq *rq) -{ - return NULL; -} - -static inline void balance_callbacks(struct rq *rq, struct balance_callback *head) -{ -} - #endif static inline void @@ -7099,7 +6883,7 @@ int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flag } EXPORT_SYMBOL(default_wake_function); -static void __setscheduler_prio(struct task_struct *p, int prio) +void __setscheduler_prio(struct task_struct *p, int prio) { if (dl_prio(prio)) p->sched_class = &dl_sched_class; @@ -7113,21 +6897,6 @@ static void __setscheduler_prio(struct task_struct *p, int prio) #ifdef CONFIG_RT_MUTEXES -static inline int __rt_effective_prio(struct task_struct *pi_task, int prio) -{ - if (pi_task) - prio = min(prio, pi_task->prio); - - return prio; -} - -static inline int rt_effective_prio(struct task_struct *p, int prio) -{ - struct task_struct *pi_task = rt_mutex_get_top_task(p); - - return __rt_effective_prio(pi_task, prio); -} - /* * rt_mutex_setprio - set the current priority of a task * @p: task to boost @@ -7256,1454 +7025,117 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task) preempt_enable(); } -#else -static inline int rt_effective_prio(struct task_struct *p, int prio) -{ - return prio; -} #endif -void set_user_nice(struct task_struct *p, long nice) +#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) +int __sched __cond_resched(void) { - bool queued, running; - struct rq *rq; - int old_prio; - - if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE) - return; - /* - * We have to be careful, if called from sys_setpriority(), - * the task might be in the middle of scheduling on another CPU. - */ - CLASS(task_rq_lock, rq_guard)(p); - rq = rq_guard.rq; - - update_rq_clock(rq); - - /* - * The RT priorities are set via sched_setscheduler(), but we still - * allow the 'normal' nice value to be set - but as expected - * it won't have any effect on scheduling until the task is - * SCHED_DEADLINE, SCHED_FIFO or SCHED_RR: - */ - if (task_has_dl_policy(p) || task_has_rt_policy(p)) { - p->static_prio = NICE_TO_PRIO(nice); - return; + if (should_resched(0) && !irqs_disabled()) { + preempt_schedule_common(); + return 1; } - - queued = task_on_rq_queued(p); - running = task_current(rq, p); - if (queued) - dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK); - if (running) - put_prev_task(rq, p); - - p->static_prio = NICE_TO_PRIO(nice); - set_load_weight(p, true); - old_prio = p->prio; - p->prio = effective_prio(p); - - if (queued) - enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK); - if (running) - set_next_task(rq, p); - /* - * If the task increased its priority or is running and - * lowered its priority, then reschedule its CPU: + * In preemptible kernels, ->rcu_read_lock_nesting tells the tick + * whether the current CPU is in an RCU read-side critical section, + * so the tick can report quiescent states even for CPUs looping + * in kernel context. In contrast, in non-preemptible kernels, + * RCU readers leave no in-memory hints, which means that CPU-bound + * processes executing in kernel context might never report an + * RCU quiescent state. Therefore, the following code causes + * cond_resched() to report a quiescent state, but only when RCU + * is in urgent need of one. */ - p->sched_class->prio_changed(rq, p, old_prio); +#ifndef CONFIG_PREEMPT_RCU + rcu_all_qs(); +#endif + return 0; } -EXPORT_SYMBOL(set_user_nice); +EXPORT_SYMBOL(__cond_resched); +#endif -/* - * is_nice_reduction - check if nice value is an actual reduction - * - * Similar to can_nice() but does not perform a capability check. - * - * @p: task - * @nice: nice value - */ -static bool is_nice_reduction(const struct task_struct *p, const int nice) -{ - /* Convert nice value [19,-20] to rlimit style value [1,40]: */ - int nice_rlim = nice_to_rlimit(nice); +#ifdef CONFIG_PREEMPT_DYNAMIC +#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL) +#define cond_resched_dynamic_enabled __cond_resched +#define cond_resched_dynamic_disabled ((void *)&__static_call_return0) +DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched); +EXPORT_STATIC_CALL_TRAMP(cond_resched); - return (nice_rlim <= task_rlimit(p, RLIMIT_NICE)); +#define might_resched_dynamic_enabled __cond_resched +#define might_resched_dynamic_disabled ((void *)&__static_call_return0) +DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched); +EXPORT_STATIC_CALL_TRAMP(might_resched); +#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY) +static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched); +int __sched dynamic_cond_resched(void) +{ + klp_sched_try_switch(); + if (!static_branch_unlikely(&sk_dynamic_cond_resched)) + return 0; + return __cond_resched(); } +EXPORT_SYMBOL(dynamic_cond_resched); -/* - * can_nice - check if a task can reduce its nice value - * @p: task - * @nice: nice value - */ -int can_nice(const struct task_struct *p, const int nice) +static DEFINE_STATIC_KEY_FALSE(sk_dynamic_might_resched); +int __sched dynamic_might_resched(void) { - return is_nice_reduction(p, nice) || capable(CAP_SYS_NICE); + if (!static_branch_unlikely(&sk_dynamic_might_resched)) + return 0; + return __cond_resched(); } - -#ifdef __ARCH_WANT_SYS_NICE +EXPORT_SYMBOL(dynamic_might_resched); +#endif +#endif /* - * sys_nice - change the priority of the current process. - * @increment: priority increment + * __cond_resched_lock() - if a reschedule is pending, drop the given lock, + * call schedule, and on return reacquire the lock. * - * sys_setpriority is a more generic, but much slower function that - * does similar things. + * This works OK both with and without CONFIG_PREEMPTION. We do strange low-level + * operations here to prevent schedule() from being called twice (once via + * spin_unlock(), once by hand). */ -SYSCALL_DEFINE1(nice, int, increment) +int __cond_resched_lock(spinlock_t *lock) { - long nice, retval; - - /* - * Setpriority might change our priority at the same moment. - * We don't have to worry. Conceptually one call occurs first - * and we have a single winner. - */ - increment = clamp(increment, -NICE_WIDTH, NICE_WIDTH); - nice = task_nice(current) + increment; - - nice = clamp_val(nice, MIN_NICE, MAX_NICE); - if (increment < 0 && !can_nice(current, nice)) - return -EPERM; + int resched = should_resched(PREEMPT_LOCK_OFFSET); + int ret = 0; - retval = security_task_setnice(current, nice); - if (retval) - return retval; + lockdep_assert_held(lock); - set_user_nice(current, nice); - return 0; + if (spin_needbreak(lock) || resched) { + spin_unlock(lock); + if (!_cond_resched()) + cpu_relax(); + ret = 1; + spin_lock(lock); + } + return ret; } +EXPORT_SYMBOL(__cond_resched_lock); -#endif - -/** - * task_prio - return the priority value of a given task. - * @p: the task in question. - * - * Return: The priority value as seen by users in /proc. - * - * sched policy return value kernel prio user prio/nice - * - * normal, batch, idle [0 ... 39] [100 ... 139] 0/[-20 ... 19] - * fifo, rr [-2 ... -100] [98 ... 0] [1 ... 99] - * deadline -101 -1 0 - */ -int task_prio(const struct task_struct *p) +int __cond_resched_rwlock_read(rwlock_t *lock) { - return p->prio - MAX_RT_PRIO; + int resched = should_resched(PREEMPT_LOCK_OFFSET); + int ret = 0; + + lockdep_assert_held_read(lock); + + if (rwlock_needbreak(lock) || resched) { + read_unlock(lock); + if (!_cond_resched()) + cpu_relax(); + ret = 1; + read_lock(lock); + } + return ret; } +EXPORT_SYMBOL(__cond_resched_rwlock_read); -/** - * idle_cpu - is a given CPU idle currently? - * @cpu: the processor in question. - * - * Return: 1 if the CPU is currently idle. 0 otherwise. - */ -int idle_cpu(int cpu) +int __cond_resched_rwlock_write(rwlock_t *lock) { - struct rq *rq = cpu_rq(cpu); + int resched = should_resched(PREEMPT_LOCK_OFFSET); + int ret = 0; - if (rq->curr != rq->idle) - return 0; - - if (rq->nr_running) - return 0; - -#ifdef CONFIG_SMP - if (rq->ttwu_pending) - return 0; -#endif - - return 1; -} - -/** - * available_idle_cpu - is a given CPU idle for enqueuing work. - * @cpu: the CPU in question. - * - * Return: 1 if the CPU is currently idle. 0 otherwise. - */ -int available_idle_cpu(int cpu) -{ - if (!idle_cpu(cpu)) - return 0; - - if (vcpu_is_preempted(cpu)) - return 0; - - return 1; -} - -/** - * idle_task - return the idle task for a given CPU. - * @cpu: the processor in question. - * - * Return: The idle task for the CPU @cpu. - */ -struct task_struct *idle_task(int cpu) -{ - return cpu_rq(cpu)->idle; -} - -#ifdef CONFIG_SCHED_CORE -int sched_core_idle_cpu(int cpu) -{ - struct rq *rq = cpu_rq(cpu); - - if (sched_core_enabled(rq) && rq->curr == rq->idle) - return 1; - - return idle_cpu(cpu); -} - -#endif - -#ifdef CONFIG_SMP -/* - * This function computes an effective utilization for the given CPU, to be - * used for frequency selection given the linear relation: f = u * f_max. - * - * The scheduler tracks the following metrics: - * - * cpu_util_{cfs,rt,dl,irq}() - * cpu_bw_dl() - * - * Where the cfs,rt and dl util numbers are tracked with the same metric and - * synchronized windows and are thus directly comparable. - * - * The cfs,rt,dl utilization are the running times measured with rq->clock_task - * which excludes things like IRQ and steal-time. These latter are then accrued - * in the irq utilization. - * - * The DL bandwidth number otoh is not a measured metric but a value computed - * based on the task model parameters and gives the minimal utilization - * required to meet deadlines. - */ -unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, - unsigned long *min, - unsigned long *max) -{ - unsigned long util, irq, scale; - struct rq *rq = cpu_rq(cpu); - - scale = arch_scale_cpu_capacity(cpu); - - /* - * Early check to see if IRQ/steal time saturates the CPU, can be - * because of inaccuracies in how we track these -- see - * update_irq_load_avg(). - */ - irq = cpu_util_irq(rq); - if (unlikely(irq >= scale)) { - if (min) - *min = scale; - if (max) - *max = scale; - return scale; - } - - if (min) { - /* - * The minimum utilization returns the highest level between: - * - the computed DL bandwidth needed with the IRQ pressure which - * steals time to the deadline task. - * - The minimum performance requirement for CFS and/or RT. - */ - *min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN)); - - /* - * When an RT task is runnable and uclamp is not used, we must - * ensure that the task will run at maximum compute capacity. - */ - if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt)) - *min = max(*min, scale); - } - - /* - * Because the time spend on RT/DL tasks is visible as 'lost' time to - * CFS tasks and we use the same metric to track the effective - * utilization (PELT windows are synchronized) we can directly add them - * to obtain the CPU's actual utilization. - */ - util = util_cfs + cpu_util_rt(rq); - util += cpu_util_dl(rq); - - /* - * The maximum hint is a soft bandwidth requirement, which can be lower - * than the actual utilization because of uclamp_max requirements. - */ - if (max) - *max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX)); - - if (util >= scale) - return scale; - - /* - * There is still idle time; further improve the number by using the - * irq metric. Because IRQ/steal time is hidden from the task clock we - * need to scale the task numbers: - * - * max - irq - * U' = irq + --------- * U - * max - */ - util = scale_irq_capacity(util, irq, scale); - util += irq; - - return min(scale, util); -} - -unsigned long sched_cpu_util(int cpu) -{ - return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL); -} -#endif /* CONFIG_SMP */ - -/** - * find_process_by_pid - find a process with a matching PID value. - * @pid: the pid in question. - * - * The task of @pid, if found. %NULL otherwise. - */ -static struct task_struct *find_process_by_pid(pid_t pid) -{ - return pid ? find_task_by_vpid(pid) : current; -} - -static struct task_struct *find_get_task(pid_t pid) -{ - struct task_struct *p; - guard(rcu)(); - - p = find_process_by_pid(pid); - if (likely(p)) - get_task_struct(p); - - return p; -} - -DEFINE_CLASS(find_get_task, struct task_struct *, if (_T) put_task_struct(_T), - find_get_task(pid), pid_t pid) - -/* - * sched_setparam() passes in -1 for its policy, to let the functions - * it calls know not to change it. - */ -#define SETPARAM_POLICY -1 - -static void __setscheduler_params(struct task_struct *p, - const struct sched_attr *attr) -{ - int policy = attr->sched_policy; - - if (policy == SETPARAM_POLICY) - policy = p->policy; - - p->policy = policy; - - if (dl_policy(policy)) - __setparam_dl(p, attr); - else if (fair_policy(policy)) - p->static_prio = NICE_TO_PRIO(attr->sched_nice); - - /* rt-policy tasks do not have a timerslack */ - if (task_is_realtime(p)) { - p->timer_slack_ns = 0; - } else if (p->timer_slack_ns == 0) { - /* when switching back to non-rt policy, restore timerslack */ - p->timer_slack_ns = p->default_timer_slack_ns; - } - - /* - * __sched_setscheduler() ensures attr->sched_priority == 0 when - * !rt_policy. Always setting this ensures that things like - * getparam()/getattr() don't report silly values for !rt tasks. - */ - p->rt_priority = attr->sched_priority; - p->normal_prio = normal_prio(p); - set_load_weight(p, true); -} - -/* - * Check the target process has a UID that matches the current process's: - */ -static bool check_same_owner(struct task_struct *p) -{ - const struct cred *cred = current_cred(), *pcred; - guard(rcu)(); - - pcred = __task_cred(p); - return (uid_eq(cred->euid, pcred->euid) || - uid_eq(cred->euid, pcred->uid)); -} - -/* - * Allow unprivileged RT tasks to decrease priority. - * Only issue a capable test if needed and only once to avoid an audit - * event on permitted non-privileged operations: - */ -static int user_check_sched_setscheduler(struct task_struct *p, - const struct sched_attr *attr, - int policy, int reset_on_fork) -{ - if (fair_policy(policy)) { - if (attr->sched_nice < task_nice(p) && - !is_nice_reduction(p, attr->sched_nice)) - goto req_priv; - } - - if (rt_policy(policy)) { - unsigned long rlim_rtprio = task_rlimit(p, RLIMIT_RTPRIO); - - /* Can't set/change the rt policy: */ - if (policy != p->policy && !rlim_rtprio) - goto req_priv; - - /* Can't increase priority: */ - if (attr->sched_priority > p->rt_priority && - attr->sched_priority > rlim_rtprio) - goto req_priv; - } - - /* - * Can't set/change SCHED_DEADLINE policy at all for now - * (safest behavior); in the future we would like to allow - * unprivileged DL tasks to increase their relative deadline - * or reduce their runtime (both ways reducing utilization) - */ - if (dl_policy(policy)) - goto req_priv; - - /* - * Treat SCHED_IDLE as nice 20. Only allow a switch to - * SCHED_NORMAL if the RLIMIT_NICE would normally permit it. - */ - if (task_has_idle_policy(p) && !idle_policy(policy)) { - if (!is_nice_reduction(p, task_nice(p))) - goto req_priv; - } - - /* Can't change other user's priorities: */ - if (!check_same_owner(p)) - goto req_priv; - - /* Normal users shall not reset the sched_reset_on_fork flag: */ - if (p->sched_reset_on_fork && !reset_on_fork) - goto req_priv; - - return 0; - -req_priv: - if (!capable(CAP_SYS_NICE)) - return -EPERM; - - return 0; -} - -static int __sched_setscheduler(struct task_struct *p, - const struct sched_attr *attr, - bool user, bool pi) -{ - int oldpolicy = -1, policy = attr->sched_policy; - int retval, oldprio, newprio, queued, running; - const struct sched_class *prev_class; - struct balance_callback *head; - struct rq_flags rf; - int reset_on_fork; - int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; - struct rq *rq; - bool cpuset_locked = false; - - /* The pi code expects interrupts enabled */ - BUG_ON(pi && in_interrupt()); -recheck: - /* Double check policy once rq lock held: */ - if (policy < 0) { - reset_on_fork = p->sched_reset_on_fork; - policy = oldpolicy = p->policy; - } else { - reset_on_fork = !!(attr->sched_flags & SCHED_FLAG_RESET_ON_FORK); - - if (!valid_policy(policy)) - return -EINVAL; - } - - if (attr->sched_flags & ~(SCHED_FLAG_ALL | SCHED_FLAG_SUGOV)) - return -EINVAL; - - /* - * Valid priorities for SCHED_FIFO and SCHED_RR are - * 1..MAX_RT_PRIO-1, valid priority for SCHED_NORMAL, - * SCHED_BATCH and SCHED_IDLE is 0. - */ - if (attr->sched_priority > MAX_RT_PRIO-1) - return -EINVAL; - if ((dl_policy(policy) && !__checkparam_dl(attr)) || - (rt_policy(policy) != (attr->sched_priority != 0))) - return -EINVAL; - - if (user) { - retval = user_check_sched_setscheduler(p, attr, policy, reset_on_fork); - if (retval) - return retval; - - if (attr->sched_flags & SCHED_FLAG_SUGOV) - return -EINVAL; - - retval = security_task_setscheduler(p); - if (retval) - return retval; - } - - /* Update task specific "requested" clamps */ - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) { - retval = uclamp_validate(p, attr); - if (retval) - return retval; - } - - /* - * SCHED_DEADLINE bandwidth accounting relies on stable cpusets - * information. - */ - if (dl_policy(policy) || dl_policy(p->policy)) { - cpuset_locked = true; - cpuset_lock(); - } - - /* - * Make sure no PI-waiters arrive (or leave) while we are - * changing the priority of the task: - * - * To be able to change p->policy safely, the appropriate - * runqueue lock must be held. - */ - rq = task_rq_lock(p, &rf); - update_rq_clock(rq); - - /* - * Changing the policy of the stop threads its a very bad idea: - */ - if (p == rq->stop) { - retval = -EINVAL; - goto unlock; - } - - /* - * If not changing anything there's no need to proceed further, - * but store a possible modification of reset_on_fork. - */ - if (unlikely(policy == p->policy)) { - if (fair_policy(policy) && attr->sched_nice != task_nice(p)) - goto change; - if (rt_policy(policy) && attr->sched_priority != p->rt_priority) - goto change; - if (dl_policy(policy) && dl_param_changed(p, attr)) - goto change; - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) - goto change; - - p->sched_reset_on_fork = reset_on_fork; - retval = 0; - goto unlock; - } -change: - -#ifdef CONFIG_QOS_SCHED - /* - * If the scheduling policy of an offline task is set to a policy - * other than SCHED_IDLE, the online task preemption and cpu resource - * isolation will be invalid, so return -EINVAL in this case. - */ - if (unlikely(is_offline_level(task_group(p)->qos_level) && !idle_policy(policy))) { - retval = -EINVAL; - goto unlock; - } -#endif - - if (user) { -#ifdef CONFIG_RT_GROUP_SCHED - /* - * Do not allow realtime tasks into groups that have no runtime - * assigned. - */ - if (rt_bandwidth_enabled() && rt_policy(policy) && - task_group(p)->rt_bandwidth.rt_runtime == 0 && - !task_group_is_autogroup(task_group(p))) { - retval = -EPERM; - goto unlock; - } -#endif -#ifdef CONFIG_SMP - if (dl_bandwidth_enabled() && dl_policy(policy) && - !(attr->sched_flags & SCHED_FLAG_SUGOV)) { - cpumask_t *span = rq->rd->span; - - /* - * Don't allow tasks with an affinity mask smaller than - * the entire root_domain to become SCHED_DEADLINE. We - * will also fail if there's no bandwidth available. - */ - if (!cpumask_subset(span, p->cpus_ptr) || - rq->rd->dl_bw.bw == 0) { - retval = -EPERM; - goto unlock; - } - } -#endif - } - - /* Re-check policy now with rq lock held: */ - if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) { - policy = oldpolicy = -1; - task_rq_unlock(rq, p, &rf); - if (cpuset_locked) - cpuset_unlock(); - goto recheck; - } - - /* - * If setscheduling to SCHED_DEADLINE (or changing the parameters - * of a SCHED_DEADLINE task) we need to check if enough bandwidth - * is available. - */ - if ((dl_policy(policy) || dl_task(p)) && sched_dl_overflow(p, policy, attr)) { - retval = -EBUSY; - goto unlock; - } - - p->sched_reset_on_fork = reset_on_fork; - oldprio = p->prio; - - newprio = __normal_prio(policy, attr->sched_priority, attr->sched_nice); - if (pi) { - /* - * Take priority boosted tasks into account. If the new - * effective priority is unchanged, we just store the new - * normal parameters and do not touch the scheduler class and - * the runqueue. This will be done when the task deboost - * itself. - */ - newprio = rt_effective_prio(p, newprio); - if (newprio == oldprio) - queue_flags &= ~DEQUEUE_MOVE; - } - - queued = task_on_rq_queued(p); - running = task_current(rq, p); - if (queued) - dequeue_task(rq, p, queue_flags); - if (running) - put_prev_task(rq, p); - - prev_class = p->sched_class; - - if (!(attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)) { - __setscheduler_params(p, attr); - __setscheduler_prio(p, newprio); - } - __setscheduler_uclamp(p, attr); - - if (queued) { - /* - * We enqueue to tail when the priority of a task is - * increased (user space view). - */ - if (oldprio < p->prio) - queue_flags |= ENQUEUE_HEAD; - - enqueue_task(rq, p, queue_flags); - } - if (running) - set_next_task(rq, p); - - check_class_changed(rq, p, prev_class, oldprio); - - /* Avoid rq from going away on us: */ - preempt_disable(); - head = splice_balance_callbacks(rq); - task_rq_unlock(rq, p, &rf); - - if (pi) { - if (cpuset_locked) - cpuset_unlock(); - rt_mutex_adjust_pi(p); - } - - /* Run balance callbacks after we've adjusted the PI chain: */ - balance_callbacks(rq, head); - preempt_enable(); - - return 0; - -unlock: - task_rq_unlock(rq, p, &rf); - if (cpuset_locked) - cpuset_unlock(); - return retval; -} - -static int _sched_setscheduler(struct task_struct *p, int policy, - const struct sched_param *param, bool check) -{ - struct sched_attr attr = { - .sched_policy = policy, - .sched_priority = param->sched_priority, - .sched_nice = PRIO_TO_NICE(p->static_prio), - }; - - /* Fixup the legacy SCHED_RESET_ON_FORK hack. */ - if ((policy != SETPARAM_POLICY) && (policy & SCHED_RESET_ON_FORK)) { - attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK; - policy &= ~SCHED_RESET_ON_FORK; - attr.sched_policy = policy; - } - - return __sched_setscheduler(p, &attr, check, true); -} -/** - * sched_setscheduler - change the scheduling policy and/or RT priority of a thread. - * @p: the task in question. - * @policy: new policy. - * @param: structure containing the new RT priority. - * - * Use sched_set_fifo(), read its comment. - * - * Return: 0 on success. An error code otherwise. - * - * NOTE that the task may be already dead. - */ -int sched_setscheduler(struct task_struct *p, int policy, - const struct sched_param *param) -{ - return _sched_setscheduler(p, policy, param, true); -} - -int sched_setattr(struct task_struct *p, const struct sched_attr *attr) -{ - return __sched_setscheduler(p, attr, true, true); -} - -int sched_setattr_nocheck(struct task_struct *p, const struct sched_attr *attr) -{ - return __sched_setscheduler(p, attr, false, true); -} -EXPORT_SYMBOL_GPL(sched_setattr_nocheck); - -/** - * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace. - * @p: the task in question. - * @policy: new policy. - * @param: structure containing the new RT priority. - * - * Just like sched_setscheduler, only don't bother checking if the - * current context has permission. For example, this is needed in - * stop_machine(): we create temporary high priority worker threads, - * but our caller might not have that capability. - * - * Return: 0 on success. An error code otherwise. - */ -int sched_setscheduler_nocheck(struct task_struct *p, int policy, - const struct sched_param *param) -{ - return _sched_setscheduler(p, policy, param, false); -} - -/* - * SCHED_FIFO is a broken scheduler model; that is, it is fundamentally - * incapable of resource management, which is the one thing an OS really should - * be doing. - * - * This is of course the reason it is limited to privileged users only. - * - * Worse still; it is fundamentally impossible to compose static priority - * workloads. You cannot take two correctly working static prio workloads - * and smash them together and still expect them to work. - * - * For this reason 'all' FIFO tasks the kernel creates are basically at: - * - * MAX_RT_PRIO / 2 - * - * The administrator _MUST_ configure the system, the kernel simply doesn't - * know enough information to make a sensible choice. - */ -void sched_set_fifo(struct task_struct *p) -{ - struct sched_param sp = { .sched_priority = MAX_RT_PRIO / 2 }; - WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0); -} -EXPORT_SYMBOL_GPL(sched_set_fifo); - -/* - * For when you don't much care about FIFO, but want to be above SCHED_NORMAL. - */ -void sched_set_fifo_low(struct task_struct *p) -{ - struct sched_param sp = { .sched_priority = 1 }; - WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0); -} -EXPORT_SYMBOL_GPL(sched_set_fifo_low); - -void sched_set_normal(struct task_struct *p, int nice) -{ - struct sched_attr attr = { - .sched_policy = SCHED_NORMAL, - .sched_nice = nice, - }; - WARN_ON_ONCE(sched_setattr_nocheck(p, &attr) != 0); -} -EXPORT_SYMBOL_GPL(sched_set_normal); - -static int -do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param) -{ - struct sched_param lparam; - - if (!param || pid < 0) - return -EINVAL; - if (copy_from_user(&lparam, param, sizeof(struct sched_param))) - return -EFAULT; - - CLASS(find_get_task, p)(pid); - if (!p) - return -ESRCH; - - return sched_setscheduler(p, policy, &lparam); -} - -/* - * Mimics kernel/events/core.c perf_copy_attr(). - */ -static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *attr) -{ - u32 size; - int ret; - - /* Zero the full structure, so that a short copy will be nice: */ - memset(attr, 0, sizeof(*attr)); - - ret = get_user(size, &uattr->size); - if (ret) - return ret; - - /* ABI compatibility quirk: */ - if (!size) - size = SCHED_ATTR_SIZE_VER0; - if (size < SCHED_ATTR_SIZE_VER0 || size > PAGE_SIZE) - goto err_size; - - ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size); - if (ret) { - if (ret == -E2BIG) - goto err_size; - return ret; - } - - if ((attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) && - size < SCHED_ATTR_SIZE_VER1) - return -EINVAL; - - /* - * XXX: Do we want to be lenient like existing syscalls; or do we want - * to be strict and return an error on out-of-bounds values? - */ - attr->sched_nice = clamp(attr->sched_nice, MIN_NICE, MAX_NICE); - - return 0; - -err_size: - put_user(sizeof(*attr), &uattr->size); - return -E2BIG; -} - -static void get_params(struct task_struct *p, struct sched_attr *attr) -{ - if (task_has_dl_policy(p)) - __getparam_dl(p, attr); - else if (task_has_rt_policy(p)) - attr->sched_priority = p->rt_priority; - else - attr->sched_nice = task_nice(p); -} - -/** - * sys_sched_setscheduler - set/change the scheduler policy and RT priority - * @pid: the pid in question. - * @policy: new policy. - * @param: structure containing the new RT priority. - * - * Return: 0 on success. An error code otherwise. - */ -SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy, struct sched_param __user *, param) -{ - if (policy < 0) - return -EINVAL; - - return do_sched_setscheduler(pid, policy, param); -} - -/** - * sys_sched_setparam - set/change the RT priority of a thread - * @pid: the pid in question. - * @param: structure containing the new RT priority. - * - * Return: 0 on success. An error code otherwise. - */ -SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param) -{ - return do_sched_setscheduler(pid, SETPARAM_POLICY, param); -} - -/** - * sys_sched_setattr - same as above, but with extended sched_attr - * @pid: the pid in question. - * @uattr: structure containing the extended parameters. - * @flags: for future extension. - */ -SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr, - unsigned int, flags) -{ - struct sched_attr attr; - int retval; - - if (!uattr || pid < 0 || flags) - return -EINVAL; - - retval = sched_copy_attr(uattr, &attr); - if (retval) - return retval; - - if ((int)attr.sched_policy < 0) - return -EINVAL; - if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY) - attr.sched_policy = SETPARAM_POLICY; - - CLASS(find_get_task, p)(pid); - if (!p) - return -ESRCH; - - if (attr.sched_flags & SCHED_FLAG_KEEP_PARAMS) - get_params(p, &attr); - - return sched_setattr(p, &attr); -} - -/** - * sys_sched_getscheduler - get the policy (scheduling class) of a thread - * @pid: the pid in question. - * - * Return: On success, the policy of the thread. Otherwise, a negative error - * code. - */ -SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid) -{ - struct task_struct *p; - int retval; - - if (pid < 0) - return -EINVAL; - - guard(rcu)(); - p = find_process_by_pid(pid); - if (!p) - return -ESRCH; - - retval = security_task_getscheduler(p); - if (!retval) { - retval = p->policy; - if (p->sched_reset_on_fork) - retval |= SCHED_RESET_ON_FORK; - } - return retval; -} - -/** - * sys_sched_getparam - get the RT priority of a thread - * @pid: the pid in question. - * @param: structure containing the RT priority. - * - * Return: On success, 0 and the RT priority is in @param. Otherwise, an error - * code. - */ -SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param) -{ - struct sched_param lp = { .sched_priority = 0 }; - struct task_struct *p; - int retval; - - if (!param || pid < 0) - return -EINVAL; - - scoped_guard (rcu) { - p = find_process_by_pid(pid); - if (!p) - return -ESRCH; - - retval = security_task_getscheduler(p); - if (retval) - return retval; - - if (task_has_rt_policy(p)) - lp.sched_priority = p->rt_priority; - } - - /* - * This one might sleep, we cannot do it with a spinlock held ... - */ - return copy_to_user(param, &lp, sizeof(*param)) ? -EFAULT : 0; -} - -/* - * Copy the kernel size attribute structure (which might be larger - * than what user-space knows about) to user-space. - * - * Note that all cases are valid: user-space buffer can be larger or - * smaller than the kernel-space buffer. The usual case is that both - * have the same size. - */ -static int -sched_attr_copy_to_user(struct sched_attr __user *uattr, - struct sched_attr *kattr, - unsigned int usize) -{ - unsigned int ksize = sizeof(*kattr); - - if (!access_ok(uattr, usize)) - return -EFAULT; - - /* - * sched_getattr() ABI forwards and backwards compatibility: - * - * If usize == ksize then we just copy everything to user-space and all is good. - * - * If usize < ksize then we only copy as much as user-space has space for, - * this keeps ABI compatibility as well. We skip the rest. - * - * If usize > ksize then user-space is using a newer version of the ABI, - * which part the kernel doesn't know about. Just ignore it - tooling can - * detect the kernel's knowledge of attributes from the attr->size value - * which is set to ksize in this case. - */ - kattr->size = min(usize, ksize); - - if (copy_to_user(uattr, kattr, kattr->size)) - return -EFAULT; - - return 0; -} - -/** - * sys_sched_getattr - similar to sched_getparam, but with sched_attr - * @pid: the pid in question. - * @uattr: structure containing the extended parameters. - * @usize: sizeof(attr) for fwd/bwd comp. - * @flags: for future extension. - */ -SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr, - unsigned int, usize, unsigned int, flags) -{ - struct sched_attr kattr = { }; - struct task_struct *p; - int retval; - - if (!uattr || pid < 0 || usize > PAGE_SIZE || - usize < SCHED_ATTR_SIZE_VER0 || flags) - return -EINVAL; - - scoped_guard (rcu) { - p = find_process_by_pid(pid); - if (!p) - return -ESRCH; - - retval = security_task_getscheduler(p); - if (retval) - return retval; - - kattr.sched_policy = p->policy; - if (p->sched_reset_on_fork) - kattr.sched_flags |= SCHED_FLAG_RESET_ON_FORK; - get_params(p, &kattr); - kattr.sched_flags &= SCHED_FLAG_ALL; - -#ifdef CONFIG_UCLAMP_TASK - /* - * This could race with another potential updater, but this is fine - * because it'll correctly read the old or the new value. We don't need - * to guarantee who wins the race as long as it doesn't return garbage. - */ - kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value; - kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value; -#endif - } - - return sched_attr_copy_to_user(uattr, &kattr, usize); -} - -#ifdef CONFIG_SMP -int dl_task_check_affinity(struct task_struct *p, const struct cpumask *mask) -{ - /* - * If the task isn't a deadline task or admission control is - * disabled then we don't care about affinity changes. - */ - if (!task_has_dl_policy(p) || !dl_bandwidth_enabled()) - return 0; - - /* - * Since bandwidth control happens on root_domain basis, - * if admission test is enabled, we only admit -deadline - * tasks allowed to run on all the CPUs in the task's - * root_domain. - */ - guard(rcu)(); - if (!cpumask_subset(task_rq(p)->rd->span, mask)) - return -EBUSY; - - return 0; -} -#endif - -static int -__sched_setaffinity(struct task_struct *p, struct affinity_context *ctx) -{ - int retval; - cpumask_var_t cpus_allowed, new_mask; - - if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) - return -ENOMEM; - - if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) { - retval = -ENOMEM; - goto out_free_cpus_allowed; - } - - cpuset_cpus_allowed(p, cpus_allowed); - cpumask_and(new_mask, ctx->new_mask, cpus_allowed); - - ctx->new_mask = new_mask; - ctx->flags |= SCA_CHECK; - - retval = dl_task_check_affinity(p, new_mask); - if (retval) - goto out_free_new_mask; - - retval = __set_cpus_allowed_ptr(p, ctx); - if (retval) - goto out_free_new_mask; - - cpuset_cpus_allowed(p, cpus_allowed); - if (!cpumask_subset(new_mask, cpus_allowed)) { - /* - * We must have raced with a concurrent cpuset update. - * Just reset the cpumask to the cpuset's cpus_allowed. - */ - cpumask_copy(new_mask, cpus_allowed); - - /* - * If SCA_USER is set, a 2nd call to __set_cpus_allowed_ptr() - * will restore the previous user_cpus_ptr value. - * - * In the unlikely event a previous user_cpus_ptr exists, - * we need to further restrict the mask to what is allowed - * by that old user_cpus_ptr. - */ - if (unlikely((ctx->flags & SCA_USER) && ctx->user_mask)) { - bool empty = !cpumask_and(new_mask, new_mask, - ctx->user_mask); - - if (empty) - cpumask_copy(new_mask, cpus_allowed); - } - __set_cpus_allowed_ptr(p, ctx); - retval = -EINVAL; - } - -out_free_new_mask: - free_cpumask_var(new_mask); -out_free_cpus_allowed: - free_cpumask_var(cpus_allowed); - return retval; -} - -long sched_setaffinity(pid_t pid, const struct cpumask *in_mask) -{ - struct affinity_context ac; - struct cpumask *user_mask; - int retval; - - CLASS(find_get_task, p)(pid); - if (!p) - return -ESRCH; - - if (p->flags & PF_NO_SETAFFINITY) - return -EINVAL; - - if (!check_same_owner(p)) { - guard(rcu)(); - if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) - return -EPERM; - } - - retval = security_task_setscheduler(p); - if (retval) - return retval; - - /* - * With non-SMP configs, user_cpus_ptr/user_mask isn't used and - * alloc_user_cpus_ptr() returns NULL. - */ - user_mask = alloc_user_cpus_ptr(NUMA_NO_NODE); - if (user_mask) { - cpumask_copy(user_mask, in_mask); - } else if (IS_ENABLED(CONFIG_SMP)) { - return -ENOMEM; - } - - ac = (struct affinity_context){ - .new_mask = in_mask, - .user_mask = user_mask, - .flags = SCA_USER, - }; - - retval = __sched_setaffinity(p, &ac); - kfree(ac.user_mask); - - return retval; -} - -static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len, - struct cpumask *new_mask) -{ - if (len < cpumask_size()) - cpumask_clear(new_mask); - else if (len > cpumask_size()) - len = cpumask_size(); - - return copy_from_user(new_mask, user_mask_ptr, len) ? -EFAULT : 0; -} - -/** - * sys_sched_setaffinity - set the CPU affinity of a process - * @pid: pid of the process - * @len: length in bytes of the bitmask pointed to by user_mask_ptr - * @user_mask_ptr: user-space pointer to the new CPU mask - * - * Return: 0 on success. An error code otherwise. - */ -SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len, - unsigned long __user *, user_mask_ptr) -{ - cpumask_var_t new_mask; - int retval; - - if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) - return -ENOMEM; - - retval = get_user_cpu_mask(user_mask_ptr, len, new_mask); - if (retval == 0) - retval = sched_setaffinity(pid, new_mask); - free_cpumask_var(new_mask); - return retval; -} - -long sched_getaffinity(pid_t pid, struct cpumask *mask) -{ - struct task_struct *p; - int retval; - - guard(rcu)(); - p = find_process_by_pid(pid); - if (!p) - return -ESRCH; - - retval = security_task_getscheduler(p); - if (retval) - return retval; - - guard(raw_spinlock_irqsave)(&p->pi_lock); - cpumask_and(mask, &p->cpus_mask, cpu_active_mask); - - return 0; -} - -/** - * sys_sched_getaffinity - get the CPU affinity of a process - * @pid: pid of the process - * @len: length in bytes of the bitmask pointed to by user_mask_ptr - * @user_mask_ptr: user-space pointer to hold the current CPU mask - * - * Return: size of CPU mask copied to user_mask_ptr on success. An - * error code otherwise. - */ -SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len, - unsigned long __user *, user_mask_ptr) -{ - int ret; - cpumask_var_t mask; - - if ((len * BITS_PER_BYTE) < nr_cpu_ids) - return -EINVAL; - if (len & (sizeof(unsigned long)-1)) - return -EINVAL; - - if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) - return -ENOMEM; - - ret = sched_getaffinity(pid, mask); - if (ret == 0) { - unsigned int retlen = min(len, cpumask_size()); - - if (copy_to_user(user_mask_ptr, cpumask_bits(mask), retlen)) - ret = -EFAULT; - else - ret = retlen; - } - free_cpumask_var(mask); - - return ret; -} - -static void do_sched_yield(void) -{ - struct rq_flags rf; - struct rq *rq; - - rq = this_rq_lock_irq(&rf); - - schedstat_inc(rq->yld_count); - current->sched_class->yield_task(rq); - - preempt_disable(); - rq_unlock_irq(rq, &rf); - sched_preempt_enable_no_resched(); - - schedule(); -} - -/** - * sys_sched_yield - yield the current processor to other threads. - * - * This function yields the current CPU to other tasks. If there are no - * other threads running on this CPU then this function will return. - * - * Return: 0. - */ -SYSCALL_DEFINE0(sched_yield) -{ - do_sched_yield(); - return 0; -} - -#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) -int __sched __cond_resched(void) -{ - if (should_resched(0) && !irqs_disabled()) { - preempt_schedule_common(); - return 1; - } - /* - * In preemptible kernels, ->rcu_read_lock_nesting tells the tick - * whether the current CPU is in an RCU read-side critical section, - * so the tick can report quiescent states even for CPUs looping - * in kernel context. In contrast, in non-preemptible kernels, - * RCU readers leave no in-memory hints, which means that CPU-bound - * processes executing in kernel context might never report an - * RCU quiescent state. Therefore, the following code causes - * cond_resched() to report a quiescent state, but only when RCU - * is in urgent need of one. - */ -#ifndef CONFIG_PREEMPT_RCU - rcu_all_qs(); -#endif - return 0; -} -EXPORT_SYMBOL(__cond_resched); -#endif - -#ifdef CONFIG_PREEMPT_DYNAMIC -#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL) -#define cond_resched_dynamic_enabled __cond_resched -#define cond_resched_dynamic_disabled ((void *)&__static_call_return0) -DEFINE_STATIC_CALL_RET0(cond_resched, __cond_resched); -EXPORT_STATIC_CALL_TRAMP(cond_resched); - -#define might_resched_dynamic_enabled __cond_resched -#define might_resched_dynamic_disabled ((void *)&__static_call_return0) -DEFINE_STATIC_CALL_RET0(might_resched, __cond_resched); -EXPORT_STATIC_CALL_TRAMP(might_resched); -#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY) -static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched); -int __sched dynamic_cond_resched(void) -{ - klp_sched_try_switch(); - if (!static_branch_unlikely(&sk_dynamic_cond_resched)) - return 0; - return __cond_resched(); -} -EXPORT_SYMBOL(dynamic_cond_resched); - -static DEFINE_STATIC_KEY_FALSE(sk_dynamic_might_resched); -int __sched dynamic_might_resched(void) -{ - if (!static_branch_unlikely(&sk_dynamic_might_resched)) - return 0; - return __cond_resched(); -} -EXPORT_SYMBOL(dynamic_might_resched); -#endif -#endif - -/* - * __cond_resched_lock() - if a reschedule is pending, drop the given lock, - * call schedule, and on return reacquire the lock. - * - * This works OK both with and without CONFIG_PREEMPTION. We do strange low-level - * operations here to prevent schedule() from being called twice (once via - * spin_unlock(), once by hand). - */ -int __cond_resched_lock(spinlock_t *lock) -{ - int resched = should_resched(PREEMPT_LOCK_OFFSET); - int ret = 0; - - lockdep_assert_held(lock); - - if (spin_needbreak(lock) || resched) { - spin_unlock(lock); - if (!_cond_resched()) - cpu_relax(); - ret = 1; - spin_lock(lock); - } - return ret; -} -EXPORT_SYMBOL(__cond_resched_lock); - -int __cond_resched_rwlock_read(rwlock_t *lock) -{ - int resched = should_resched(PREEMPT_LOCK_OFFSET); - int ret = 0; - - lockdep_assert_held_read(lock); - - if (rwlock_needbreak(lock) || resched) { - read_unlock(lock); - if (!_cond_resched()) - cpu_relax(); - ret = 1; - read_lock(lock); - } - return ret; -} -EXPORT_SYMBOL(__cond_resched_rwlock_read); - -int __cond_resched_rwlock_write(rwlock_t *lock) -{ - int resched = should_resched(PREEMPT_LOCK_OFFSET); - int ret = 0; - - lockdep_assert_held_write(lock); + lockdep_assert_held_write(lock); if (rwlock_needbreak(lock) || resched) { write_unlock(lock); @@ -8923,100 +7355,6 @@ static inline void preempt_dynamic_init(void) { } #endif /* #ifdef CONFIG_PREEMPT_DYNAMIC */ -/** - * yield - yield the current processor to other threads. - * - * Do not ever use this function, there's a 99% chance you're doing it wrong. - * - * The scheduler is at all times free to pick the calling task as the most - * eligible task to run, if removing the yield() call from your code breaks - * it, it's already broken. - * - * Typical broken usage is: - * - * while (!event) - * yield(); - * - * where one assumes that yield() will let 'the other' process run that will - * make event true. If the current task is a SCHED_FIFO task that will never - * happen. Never use yield() as a progress guarantee!! - * - * If you want to use yield() to wait for something, use wait_event(). - * If you want to use yield() to be 'nice' for others, use cond_resched(). - * If you still want to use yield(), do not! - */ -void __sched yield(void) -{ - set_current_state(TASK_RUNNING); - do_sched_yield(); -} -EXPORT_SYMBOL(yield); - -/** - * yield_to - yield the current processor to another thread in - * your thread group, or accelerate that thread toward the - * processor it's on. - * @p: target task - * @preempt: whether task preemption is allowed or not - * - * It's the caller's job to ensure that the target task struct - * can't go away on us before we can do any checks. - * - * Return: - * true (>0) if we indeed boosted the target task. - * false (0) if we failed to boost the target. - * -ESRCH if there's no task to yield to. - */ -int __sched yield_to(struct task_struct *p, bool preempt) -{ - struct task_struct *curr = current; - struct rq *rq, *p_rq; - int yielded = 0; - - scoped_guard (raw_spinlock_irqsave, &p->pi_lock) { - rq = this_rq(); - -again: - p_rq = task_rq(p); - /* - * If we're the only runnable task on the rq and target rq also - * has only one task, there's absolutely no point in yielding. - */ - if (rq->nr_running == 1 && p_rq->nr_running == 1) - return -ESRCH; - - guard(double_rq_lock)(rq, p_rq); - if (task_rq(p) != p_rq) - goto again; - - if (!curr->sched_class->yield_to_task) - return 0; - - if (curr->sched_class != p->sched_class) - return 0; - - if (task_on_cpu(p_rq, p) || !task_is_running(p)) - return 0; - - yielded = curr->sched_class->yield_to_task(rq, p); - if (yielded) { - schedstat_inc(rq->yld_count); - /* - * Make p's CPU reschedule; pick_next_entity - * takes care of fairness. - */ - if (preempt && rq != p_rq) - resched_curr(p_rq); - } - } - - if (yielded) - schedule(); - - return yielded; -} -EXPORT_SYMBOL_GPL(yield_to); - int io_schedule_prepare(void) { int old_iowait = current->in_iowait; @@ -9058,123 +7396,6 @@ void __sched io_schedule(void) } EXPORT_SYMBOL(io_schedule); -/** - * sys_sched_get_priority_max - return maximum RT priority. - * @policy: scheduling class. - * - * Return: On success, this syscall returns the maximum - * rt_priority that can be used by a given scheduling class. - * On failure, a negative error code is returned. - */ -SYSCALL_DEFINE1(sched_get_priority_max, int, policy) -{ - int ret = -EINVAL; - - switch (policy) { - case SCHED_FIFO: - case SCHED_RR: - ret = MAX_RT_PRIO-1; - break; - case SCHED_DEADLINE: - case SCHED_NORMAL: - case SCHED_BATCH: - case SCHED_IDLE: - ret = 0; - break; - } - return ret; -} - -/** - * sys_sched_get_priority_min - return minimum RT priority. - * @policy: scheduling class. - * - * Return: On success, this syscall returns the minimum - * rt_priority that can be used by a given scheduling class. - * On failure, a negative error code is returned. - */ -SYSCALL_DEFINE1(sched_get_priority_min, int, policy) -{ - int ret = -EINVAL; - - switch (policy) { - case SCHED_FIFO: - case SCHED_RR: - ret = 1; - break; - case SCHED_DEADLINE: - case SCHED_NORMAL: - case SCHED_BATCH: - case SCHED_IDLE: - ret = 0; - } - return ret; -} - -static int sched_rr_get_interval(pid_t pid, struct timespec64 *t) -{ - unsigned int time_slice = 0; - int retval; - - if (pid < 0) - return -EINVAL; - - scoped_guard (rcu) { - struct task_struct *p = find_process_by_pid(pid); - if (!p) - return -ESRCH; - - retval = security_task_getscheduler(p); - if (retval) - return retval; - - scoped_guard (task_rq_lock, p) { - struct rq *rq = scope.rq; - if (p->sched_class->get_rr_interval) - time_slice = p->sched_class->get_rr_interval(rq, p); - } - } - - jiffies_to_timespec64(time_slice, t); - return 0; -} - -/** - * sys_sched_rr_get_interval - return the default timeslice of a process. - * @pid: pid of the process. - * @interval: userspace pointer to the timeslice value. - * - * this syscall writes the default timeslice value of a given process - * into the user-space timespec buffer. A value of '0' means infinity. - * - * Return: On success, 0 and the timeslice is in @interval. Otherwise, - * an error code. - */ -SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid, - struct __kernel_timespec __user *, interval) -{ - struct timespec64 t; - int retval = sched_rr_get_interval(pid, &t); - - if (retval == 0) - retval = put_timespec64(&t, interval); - - return retval; -} - -#ifdef CONFIG_COMPAT_32BIT_TIME -SYSCALL_DEFINE2(sched_rr_get_interval_time32, pid_t, pid, - struct old_timespec32 __user *, interval) -{ - struct timespec64 t; - int retval = sched_rr_get_interval(pid, &t); - - if (retval == 0) - retval = put_old_timespec32(&t, interval); - return retval; -} -#endif - void sched_show_task(struct task_struct *p) { unsigned long free = 0; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e8c2c098992b..d77442a46bf6 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2590,8 +2590,19 @@ extern void update_group_capacity(struct sched_domain *sd, int cpu); extern void trigger_load_balance(struct rq *rq); +extern int __set_cpus_allowed_ptr(struct task_struct *p, struct affinity_context *ctx); extern void set_cpus_allowed_common(struct task_struct *p, struct affinity_context *ctx); +static inline cpumask_t *alloc_user_cpus_ptr(int node) +{ + /* + * See do_set_cpus_allowed() above for the rcu_head usage. + */ + int size = max_t(int, cpumask_size(), sizeof(struct rcu_head)); + + return kmalloc_node(size, GFP_KERNEL, node); +} + static inline struct task_struct *get_push_task(struct rq *rq) { struct task_struct *p = rq->curr; @@ -2613,7 +2624,20 @@ static inline struct task_struct *get_push_task(struct rq *rq) extern int push_cpu_stop(void *arg); -#endif +#else /* !CONFIG_SMP: */ + +static inline int __set_cpus_allowed_ptr(struct task_struct *p, + struct affinity_context *ctx) +{ + return set_cpus_allowed_ptr(p, ctx->new_mask); +} + +static inline cpumask_t *alloc_user_cpus_ptr(int node) +{ + return NULL; +} + +#endif /* !CONFIG_SMP */ #ifdef CONFIG_CPU_IDLE static inline void idle_set_state(struct rq *rq, @@ -3344,6 +3368,36 @@ static inline bool uclamp_is_used(void) { return static_branch_likely(&sched_uclamp_used); } + +#define for_each_clamp_id(clamp_id) \ + for ((clamp_id) = 0; (clamp_id) < UCLAMP_CNT; (clamp_id)++) + +extern unsigned int sysctl_sched_uclamp_util_min_rt_default; + + +static inline unsigned int uclamp_none(enum uclamp_id clamp_id) +{ + if (clamp_id == UCLAMP_MIN) + return 0; + return SCHED_CAPACITY_SCALE; +} + +/* Integer rounded range for each bucket */ +#define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS) + +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value) +{ + return min_t(unsigned int, clamp_value / UCLAMP_BUCKET_DELTA, UCLAMP_BUCKETS - 1); +} + +static inline void uclamp_se_set(struct uclamp_se *uc_se, + unsigned int value, bool user_defined) +{ + uc_se->value = value; + uc_se->bucket_id = uclamp_bucket_id(value); + uc_se->user_defined = user_defined; +} + #else /* CONFIG_UCLAMP_TASK */ static inline unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id) @@ -3386,6 +3440,7 @@ static inline bool uclamp_rq_is_idle(struct rq *rq) { return false; } + #endif /* CONFIG_UCLAMP_TASK */ #ifdef CONFIG_HAVE_SCHED_AVG_IRQ @@ -3771,7 +3826,82 @@ extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se); #ifdef CONFIG_BPF_SCHED bool bpf_sched_is_cpu_allowed(struct task_struct *p, int cpu); -#endif +#endif /* CONFIG_BPF_SCHED */ +#ifdef CONFIG_RT_MUTEXES +static inline int __rt_effective_prio(struct task_struct *pi_task, int prio) +{ + if (pi_task) + prio = min(prio, pi_task->prio); + + return prio; +} + +static inline int rt_effective_prio(struct task_struct *p, int prio) +{ + struct task_struct *pi_task = rt_mutex_get_top_task(p); + + return __rt_effective_prio(pi_task, prio); +} +#else +static inline int rt_effective_prio(struct task_struct *p, int prio) +{ + return prio; +} +#endif /* CONFIG_RT_MUTEXES */ + +static inline int __normal_prio(int policy, int rt_prio, int nice) +{ + int prio; + + if (dl_policy(policy)) + prio = MAX_DL_PRIO - 1; + else if (rt_policy(policy)) + prio = MAX_RT_PRIO - 1 - rt_prio; + else + prio = NICE_TO_PRIO(nice); + + return prio; +} + +/* + * Calculate the expected normal priority: i.e. priority + * without taking RT-inheritance into account. Might be + * boosted by interactivity modifiers. Changes upon fork, + * setprio syscalls, and whenever the interactivity + * estimator recalculates. + */ +static inline int normal_prio(struct task_struct *p) +{ + return __normal_prio(p->policy, p->rt_priority, PRIO_TO_NICE(p->static_prio)); +} + +extern int __sched_setscheduler(struct task_struct *p, const struct sched_attr *attr, bool user, bool pi); +extern int __sched_setaffinity(struct task_struct *p, struct affinity_context *ctx); +extern void __setscheduler_prio(struct task_struct *p, int prio); +extern void __setscheduler_params(struct task_struct *p, const struct sched_attr *attr); +extern void set_load_weight(struct task_struct *p, bool update_load); +extern void enqueue_task(struct rq *rq, struct task_struct *p, int flags); +extern void dequeue_task(struct rq *rq, struct task_struct *p, int flags); + +extern void check_class_changed(struct rq *rq, struct task_struct *p, + const struct sched_class *prev_class, + int oldprio); + +#ifdef CONFIG_SMP +extern struct balance_callback *splice_balance_callbacks(struct rq *rq); +extern void balance_callbacks(struct rq *rq, struct balance_callback *head); +#else + +static inline struct balance_callback *splice_balance_callbacks(struct rq *rq) +{ + return NULL; +} + +static inline void balance_callbacks(struct rq *rq, struct balance_callback *head) +{ +} + +#endif /* CONFIG_SMP */ #ifdef CONFIG_SCHED_SOFT_DOMAIN void build_soft_domain(void); diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c new file mode 100644 index 000000000000..5f4eaecb90cd --- /dev/null +++ b/kernel/sched/syscalls.c @@ -0,0 +1,1693 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * kernel/sched/syscalls.c + * + * Core kernel scheduler syscalls related code + * + * Copyright (C) 1991-2002 Linus Torvalds + * Copyright (C) 1998-2024 Ingo Molnar, Red Hat + */ +#include <linux/sched.h> +#include <linux/cpuset.h> +#include <linux/sched/debug.h> + +#include <uapi/linux/sched/types.h> + +#include "sched.h" +#include "autogroup.h" + +/* + * Calculate the current priority, i.e. the priority + * taken into account by the scheduler. This value might + * be boosted by RT tasks, or might be boosted by + * interactivity modifiers. Will be RT if the task got + * RT-boosted. If not then it returns p->normal_prio. + */ +static int effective_prio(struct task_struct *p) +{ + p->normal_prio = normal_prio(p); + /* + * If we are RT tasks or we were boosted to RT priority, + * keep the priority unchanged. Otherwise, update priority + * to the normal priority: + */ + if (!rt_prio(p->prio)) + return p->normal_prio; + return p->prio; +} + +void set_user_nice(struct task_struct *p, long nice) +{ + bool queued, running; + struct rq *rq; + int old_prio; + + if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE) + return; + /* + * We have to be careful, if called from sys_setpriority(), + * the task might be in the middle of scheduling on another CPU. + */ + CLASS(task_rq_lock, rq_guard)(p); + rq = rq_guard.rq; + + update_rq_clock(rq); + + /* + * The RT priorities are set via sched_setscheduler(), but we still + * allow the 'normal' nice value to be set - but as expected + * it won't have any effect on scheduling until the task is + * SCHED_DEADLINE, SCHED_FIFO or SCHED_RR: + */ + if (task_has_dl_policy(p) || task_has_rt_policy(p)) { + p->static_prio = NICE_TO_PRIO(nice); + return; + } + + queued = task_on_rq_queued(p); + running = task_current(rq, p); + if (queued) + dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK); + if (running) + put_prev_task(rq, p); + + p->static_prio = NICE_TO_PRIO(nice); + set_load_weight(p, true); + old_prio = p->prio; + p->prio = effective_prio(p); + + if (queued) + enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK); + if (running) + set_next_task(rq, p); + + /* + * If the task increased its priority or is running and + * lowered its priority, then reschedule its CPU: + */ + p->sched_class->prio_changed(rq, p, old_prio); +} +EXPORT_SYMBOL(set_user_nice); + +/* + * is_nice_reduction - check if nice value is an actual reduction + * + * Similar to can_nice() but does not perform a capability check. + * + * @p: task + * @nice: nice value + */ +static bool is_nice_reduction(const struct task_struct *p, const int nice) +{ + /* Convert nice value [19,-20] to rlimit style value [1,40]: */ + int nice_rlim = nice_to_rlimit(nice); + + return (nice_rlim <= task_rlimit(p, RLIMIT_NICE)); +} + +/* + * can_nice - check if a task can reduce its nice value + * @p: task + * @nice: nice value + */ +int can_nice(const struct task_struct *p, const int nice) +{ + return is_nice_reduction(p, nice) || capable(CAP_SYS_NICE); +} + +#ifdef __ARCH_WANT_SYS_NICE + +/* + * sys_nice - change the priority of the current process. + * @increment: priority increment + * + * sys_setpriority is a more generic, but much slower function that + * does similar things. + */ +SYSCALL_DEFINE1(nice, int, increment) +{ + long nice, retval; + + /* + * Setpriority might change our priority at the same moment. + * We don't have to worry. Conceptually one call occurs first + * and we have a single winner. + */ + increment = clamp(increment, -NICE_WIDTH, NICE_WIDTH); + nice = task_nice(current) + increment; + + nice = clamp_val(nice, MIN_NICE, MAX_NICE); + if (increment < 0 && !can_nice(current, nice)) + return -EPERM; + + retval = security_task_setnice(current, nice); + if (retval) + return retval; + + set_user_nice(current, nice); + return 0; +} + +#endif + +/** + * task_prio - return the priority value of a given task. + * @p: the task in question. + * + * Return: The priority value as seen by users in /proc. + * + * sched policy return value kernel prio user prio/nice + * + * normal, batch, idle [0 ... 39] [100 ... 139] 0/[-20 ... 19] + * fifo, rr [-2 ... -100] [98 ... 0] [1 ... 99] + * deadline -101 -1 0 + */ +int task_prio(const struct task_struct *p) +{ + return p->prio - MAX_RT_PRIO; +} + +/** + * idle_cpu - is a given CPU idle currently? + * @cpu: the processor in question. + * + * Return: 1 if the CPU is currently idle. 0 otherwise. + */ +int idle_cpu(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + if (rq->curr != rq->idle) + return 0; + + if (rq->nr_running) + return 0; + +#ifdef CONFIG_SMP + if (rq->ttwu_pending) + return 0; +#endif + + return 1; +} + +/** + * available_idle_cpu - is a given CPU idle for enqueuing work. + * @cpu: the CPU in question. + * + * Return: 1 if the CPU is currently idle. 0 otherwise. + */ +int available_idle_cpu(int cpu) +{ + if (!idle_cpu(cpu)) + return 0; + + if (vcpu_is_preempted(cpu)) + return 0; + + return 1; +} + +/** + * idle_task - return the idle task for a given CPU. + * @cpu: the processor in question. + * + * Return: The idle task for the CPU @cpu. + */ +struct task_struct *idle_task(int cpu) +{ + return cpu_rq(cpu)->idle; +} + +#ifdef CONFIG_SCHED_CORE +int sched_core_idle_cpu(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + if (sched_core_enabled(rq) && rq->curr == rq->idle) + return 1; + + return idle_cpu(cpu); +} + +#endif + +#ifdef CONFIG_SMP +/* + * This function computes an effective utilization for the given CPU, to be + * used for frequency selection given the linear relation: f = u * f_max. + * + * The scheduler tracks the following metrics: + * + * cpu_util_{cfs,rt,dl,irq}() + * cpu_bw_dl() + * + * Where the cfs,rt and dl util numbers are tracked with the same metric and + * synchronized windows and are thus directly comparable. + * + * The cfs,rt,dl utilization are the running times measured with rq->clock_task + * which excludes things like IRQ and steal-time. These latter are then accrued + * in the irq utilization. + * + * The DL bandwidth number otoh is not a measured metric but a value computed + * based on the task model parameters and gives the minimal utilization + * required to meet deadlines. + */ +unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, + unsigned long *min, + unsigned long *max) +{ + unsigned long util, irq, scale; + struct rq *rq = cpu_rq(cpu); + + scale = arch_scale_cpu_capacity(cpu); + + /* + * Early check to see if IRQ/steal time saturates the CPU, can be + * because of inaccuracies in how we track these -- see + * update_irq_load_avg(). + */ + irq = cpu_util_irq(rq); + if (unlikely(irq >= scale)) { + if (min) + *min = scale; + if (max) + *max = scale; + return scale; + } + + if (min) { + /* + * The minimum utilization returns the highest level between: + * - the computed DL bandwidth needed with the IRQ pressure which + * steals time to the deadline task. + * - The minimum performance requirement for CFS and/or RT. + */ + *min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN)); + + /* + * When an RT task is runnable and uclamp is not used, we must + * ensure that the task will run at maximum compute capacity. + */ + if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt)) + *min = max(*min, scale); + } + + /* + * Because the time spend on RT/DL tasks is visible as 'lost' time to + * CFS tasks and we use the same metric to track the effective + * utilization (PELT windows are synchronized) we can directly add them + * to obtain the CPU's actual utilization. + */ + util = util_cfs + cpu_util_rt(rq); + util += cpu_util_dl(rq); + + /* + * The maximum hint is a soft bandwidth requirement, which can be lower + * than the actual utilization because of uclamp_max requirements. + */ + if (max) + *max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX)); + + if (util >= scale) + return scale; + + /* + * There is still idle time; further improve the number by using the + * irq metric. Because IRQ/steal time is hidden from the task clock we + * need to scale the task numbers: + * + * max - irq + * U' = irq + --------- * U + * max + */ + util = scale_irq_capacity(util, irq, scale); + util += irq; + + return min(scale, util); +} + +unsigned long sched_cpu_util(int cpu) +{ + return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL); +} +#endif /* CONFIG_SMP */ + +/** + * find_process_by_pid - find a process with a matching PID value. + * @pid: the pid in question. + * + * The task of @pid, if found. %NULL otherwise. + */ +static struct task_struct *find_process_by_pid(pid_t pid) +{ + return pid ? find_task_by_vpid(pid) : current; +} + +static struct task_struct *find_get_task(pid_t pid) +{ + struct task_struct *p; + guard(rcu)(); + + p = find_process_by_pid(pid); + if (likely(p)) + get_task_struct(p); + + return p; +} + +DEFINE_CLASS(find_get_task, struct task_struct *, if (_T) put_task_struct(_T), + find_get_task(pid), pid_t pid) + +/* + * sched_setparam() passes in -1 for its policy, to let the functions + * it calls know not to change it. + */ +#define SETPARAM_POLICY -1 + +void __setscheduler_params(struct task_struct *p, + const struct sched_attr *attr) +{ + int policy = attr->sched_policy; + + if (policy == SETPARAM_POLICY) + policy = p->policy; + + p->policy = policy; + + if (dl_policy(policy)) + __setparam_dl(p, attr); + else if (fair_policy(policy)) + p->static_prio = NICE_TO_PRIO(attr->sched_nice); + + /* rt-policy tasks do not have a timerslack */ + if (task_is_realtime(p)) { + p->timer_slack_ns = 0; + } else if (p->timer_slack_ns == 0) { + /* when switching back to non-rt policy, restore timerslack */ + p->timer_slack_ns = p->default_timer_slack_ns; + } + + /* + * __sched_setscheduler() ensures attr->sched_priority == 0 when + * !rt_policy. Always setting this ensures that things like + * getparam()/getattr() don't report silly values for !rt tasks. + */ + p->rt_priority = attr->sched_priority; + p->normal_prio = normal_prio(p); + set_load_weight(p, true); +} + +/* + * Check the target process has a UID that matches the current process's: + */ +static bool check_same_owner(struct task_struct *p) +{ + const struct cred *cred = current_cred(), *pcred; + guard(rcu)(); + + pcred = __task_cred(p); + return (uid_eq(cred->euid, pcred->euid) || + uid_eq(cred->euid, pcred->uid)); +} + +#ifdef CONFIG_UCLAMP_TASK + +static int uclamp_validate(struct task_struct *p, + const struct sched_attr *attr) +{ + int util_min = p->uclamp_req[UCLAMP_MIN].value; + int util_max = p->uclamp_req[UCLAMP_MAX].value; + + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) { + util_min = attr->sched_util_min; + + if (util_min + 1 > SCHED_CAPACITY_SCALE + 1) + return -EINVAL; + } + + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) { + util_max = attr->sched_util_max; + + if (util_max + 1 > SCHED_CAPACITY_SCALE + 1) + return -EINVAL; + } + + if (util_min != -1 && util_max != -1 && util_min > util_max) + return -EINVAL; + + /* + * We have valid uclamp attributes; make sure uclamp is enabled. + * + * We need to do that here, because enabling static branches is a + * blocking operation which obviously cannot be done while holding + * scheduler locks. + */ + static_branch_enable(&sched_uclamp_used); + + return 0; +} + +static bool uclamp_reset(const struct sched_attr *attr, + enum uclamp_id clamp_id, + struct uclamp_se *uc_se) +{ + /* Reset on sched class change for a non user-defined clamp value. */ + if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)) && + !uc_se->user_defined) + return true; + + /* Reset on sched_util_{min,max} == -1. */ + if (clamp_id == UCLAMP_MIN && + attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN && + attr->sched_util_min == -1) { + return true; + } + + if (clamp_id == UCLAMP_MAX && + attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX && + attr->sched_util_max == -1) { + return true; + } + + return false; +} + +static void __setscheduler_uclamp(struct task_struct *p, + const struct sched_attr *attr) +{ + enum uclamp_id clamp_id; + + for_each_clamp_id(clamp_id) { + struct uclamp_se *uc_se = &p->uclamp_req[clamp_id]; + unsigned int value; + + if (!uclamp_reset(attr, clamp_id, uc_se)) + continue; + + /* + * RT by default have a 100% boost value that could be modified + * at runtime. + */ + if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN)) + value = sysctl_sched_uclamp_util_min_rt_default; + else + value = uclamp_none(clamp_id); + + uclamp_se_set(uc_se, value, false); + + } + + if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP))) + return; + + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN && + attr->sched_util_min != -1) { + uclamp_se_set(&p->uclamp_req[UCLAMP_MIN], + attr->sched_util_min, true); + } + + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX && + attr->sched_util_max != -1) { + uclamp_se_set(&p->uclamp_req[UCLAMP_MAX], + attr->sched_util_max, true); + } +} + +#else /* !CONFIG_UCLAMP_TASK: */ + +static inline int uclamp_validate(struct task_struct *p, + const struct sched_attr *attr) +{ + return -EOPNOTSUPP; +} +static void __setscheduler_uclamp(struct task_struct *p, + const struct sched_attr *attr) { } +#endif + +/* + * Allow unprivileged RT tasks to decrease priority. + * Only issue a capable test if needed and only once to avoid an audit + * event on permitted non-privileged operations: + */ +static int user_check_sched_setscheduler(struct task_struct *p, + const struct sched_attr *attr, + int policy, int reset_on_fork) +{ + if (fair_policy(policy)) { + if (attr->sched_nice < task_nice(p) && + !is_nice_reduction(p, attr->sched_nice)) + goto req_priv; + } + + if (rt_policy(policy)) { + unsigned long rlim_rtprio = task_rlimit(p, RLIMIT_RTPRIO); + + /* Can't set/change the rt policy: */ + if (policy != p->policy && !rlim_rtprio) + goto req_priv; + + /* Can't increase priority: */ + if (attr->sched_priority > p->rt_priority && + attr->sched_priority > rlim_rtprio) + goto req_priv; + } + + /* + * Can't set/change SCHED_DEADLINE policy at all for now + * (safest behavior); in the future we would like to allow + * unprivileged DL tasks to increase their relative deadline + * or reduce their runtime (both ways reducing utilization) + */ + if (dl_policy(policy)) + goto req_priv; + + /* + * Treat SCHED_IDLE as nice 20. Only allow a switch to + * SCHED_NORMAL if the RLIMIT_NICE would normally permit it. + */ + if (task_has_idle_policy(p) && !idle_policy(policy)) { + if (!is_nice_reduction(p, task_nice(p))) + goto req_priv; + } + + /* Can't change other user's priorities: */ + if (!check_same_owner(p)) + goto req_priv; + + /* Normal users shall not reset the sched_reset_on_fork flag: */ + if (p->sched_reset_on_fork && !reset_on_fork) + goto req_priv; + + return 0; + +req_priv: + if (!capable(CAP_SYS_NICE)) + return -EPERM; + + return 0; +} + +int __sched_setscheduler(struct task_struct *p, + const struct sched_attr *attr, + bool user, bool pi) +{ + int oldpolicy = -1, policy = attr->sched_policy; + int retval, oldprio, newprio, queued, running; + const struct sched_class *prev_class; + struct balance_callback *head; + struct rq_flags rf; + int reset_on_fork; + int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; + struct rq *rq; + bool cpuset_locked = false; + + /* The pi code expects interrupts enabled */ + BUG_ON(pi && in_interrupt()); +recheck: + /* Double check policy once rq lock held: */ + if (policy < 0) { + reset_on_fork = p->sched_reset_on_fork; + policy = oldpolicy = p->policy; + } else { + reset_on_fork = !!(attr->sched_flags & SCHED_FLAG_RESET_ON_FORK); + + if (!valid_policy(policy)) + return -EINVAL; + } + + if (attr->sched_flags & ~(SCHED_FLAG_ALL | SCHED_FLAG_SUGOV)) + return -EINVAL; + + /* + * Valid priorities for SCHED_FIFO and SCHED_RR are + * 1..MAX_RT_PRIO-1, valid priority for SCHED_NORMAL, + * SCHED_BATCH and SCHED_IDLE is 0. + */ + if (attr->sched_priority > MAX_RT_PRIO-1) + return -EINVAL; + if ((dl_policy(policy) && !__checkparam_dl(attr)) || + (rt_policy(policy) != (attr->sched_priority != 0))) + return -EINVAL; + + if (user) { + retval = user_check_sched_setscheduler(p, attr, policy, reset_on_fork); + if (retval) + return retval; + + if (attr->sched_flags & SCHED_FLAG_SUGOV) + return -EINVAL; + + retval = security_task_setscheduler(p); + if (retval) + return retval; + } + + /* Update task specific "requested" clamps */ + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) { + retval = uclamp_validate(p, attr); + if (retval) + return retval; + } + + /* + * SCHED_DEADLINE bandwidth accounting relies on stable cpusets + * information. + */ + if (dl_policy(policy) || dl_policy(p->policy)) { + cpuset_locked = true; + cpuset_lock(); + } + + /* + * Make sure no PI-waiters arrive (or leave) while we are + * changing the priority of the task: + * + * To be able to change p->policy safely, the appropriate + * runqueue lock must be held. + */ + rq = task_rq_lock(p, &rf); + update_rq_clock(rq); + + /* + * Changing the policy of the stop threads its a very bad idea: + */ + if (p == rq->stop) { + retval = -EINVAL; + goto unlock; + } + + /* + * If not changing anything there's no need to proceed further, + * but store a possible modification of reset_on_fork. + */ + if (unlikely(policy == p->policy)) { + if (fair_policy(policy) && attr->sched_nice != task_nice(p)) + goto change; + if (rt_policy(policy) && attr->sched_priority != p->rt_priority) + goto change; + if (dl_policy(policy) && dl_param_changed(p, attr)) + goto change; + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) + goto change; + + p->sched_reset_on_fork = reset_on_fork; + retval = 0; + goto unlock; + } +change: + +#ifdef CONFIG_QOS_SCHED + /* + * If the scheduling policy of an offline task is set to a policy + * other than SCHED_IDLE, the online task preemption and cpu resource + * isolation will be invalid, so return -EINVAL in this case. + */ + if (unlikely(is_offline_level(task_group(p)->qos_level) && !idle_policy(policy))) { + retval = -EINVAL; + goto unlock; + } +#endif + + if (user) { +#ifdef CONFIG_RT_GROUP_SCHED + /* + * Do not allow realtime tasks into groups that have no runtime + * assigned. + */ + if (rt_bandwidth_enabled() && rt_policy(policy) && + task_group(p)->rt_bandwidth.rt_runtime == 0 && + !task_group_is_autogroup(task_group(p))) { + retval = -EPERM; + goto unlock; + } +#endif +#ifdef CONFIG_SMP + if (dl_bandwidth_enabled() && dl_policy(policy) && + !(attr->sched_flags & SCHED_FLAG_SUGOV)) { + cpumask_t *span = rq->rd->span; + + /* + * Don't allow tasks with an affinity mask smaller than + * the entire root_domain to become SCHED_DEADLINE. We + * will also fail if there's no bandwidth available. + */ + if (!cpumask_subset(span, p->cpus_ptr) || + rq->rd->dl_bw.bw == 0) { + retval = -EPERM; + goto unlock; + } + } +#endif + } + + /* Re-check policy now with rq lock held: */ + if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) { + policy = oldpolicy = -1; + task_rq_unlock(rq, p, &rf); + if (cpuset_locked) + cpuset_unlock(); + goto recheck; + } + + /* + * If setscheduling to SCHED_DEADLINE (or changing the parameters + * of a SCHED_DEADLINE task) we need to check if enough bandwidth + * is available. + */ + if ((dl_policy(policy) || dl_task(p)) && sched_dl_overflow(p, policy, attr)) { + retval = -EBUSY; + goto unlock; + } + + p->sched_reset_on_fork = reset_on_fork; + oldprio = p->prio; + + newprio = __normal_prio(policy, attr->sched_priority, attr->sched_nice); + if (pi) { + /* + * Take priority boosted tasks into account. If the new + * effective priority is unchanged, we just store the new + * normal parameters and do not touch the scheduler class and + * the runqueue. This will be done when the task deboost + * itself. + */ + newprio = rt_effective_prio(p, newprio); + if (newprio == oldprio) + queue_flags &= ~DEQUEUE_MOVE; + } + + queued = task_on_rq_queued(p); + running = task_current(rq, p); + if (queued) + dequeue_task(rq, p, queue_flags); + if (running) + put_prev_task(rq, p); + + prev_class = p->sched_class; + + if (!(attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)) { + __setscheduler_params(p, attr); + __setscheduler_prio(p, newprio); + } + __setscheduler_uclamp(p, attr); + + if (queued) { + /* + * We enqueue to tail when the priority of a task is + * increased (user space view). + */ + if (oldprio < p->prio) + queue_flags |= ENQUEUE_HEAD; + + enqueue_task(rq, p, queue_flags); + } + if (running) + set_next_task(rq, p); + + check_class_changed(rq, p, prev_class, oldprio); + + /* Avoid rq from going away on us: */ + preempt_disable(); + head = splice_balance_callbacks(rq); + task_rq_unlock(rq, p, &rf); + + if (pi) { + if (cpuset_locked) + cpuset_unlock(); + rt_mutex_adjust_pi(p); + } + + /* Run balance callbacks after we've adjusted the PI chain: */ + balance_callbacks(rq, head); + preempt_enable(); + + return 0; + +unlock: + task_rq_unlock(rq, p, &rf); + if (cpuset_locked) + cpuset_unlock(); + return retval; +} + +static int _sched_setscheduler(struct task_struct *p, int policy, + const struct sched_param *param, bool check) +{ + struct sched_attr attr = { + .sched_policy = policy, + .sched_priority = param->sched_priority, + .sched_nice = PRIO_TO_NICE(p->static_prio), + }; + + /* Fixup the legacy SCHED_RESET_ON_FORK hack. */ + if ((policy != SETPARAM_POLICY) && (policy & SCHED_RESET_ON_FORK)) { + attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK; + policy &= ~SCHED_RESET_ON_FORK; + attr.sched_policy = policy; + } + + return __sched_setscheduler(p, &attr, check, true); +} +/** + * sched_setscheduler - change the scheduling policy and/or RT priority of a thread. + * @p: the task in question. + * @policy: new policy. + * @param: structure containing the new RT priority. + * + * Use sched_set_fifo(), read its comment. + * + * Return: 0 on success. An error code otherwise. + * + * NOTE that the task may be already dead. + */ +int sched_setscheduler(struct task_struct *p, int policy, + const struct sched_param *param) +{ + return _sched_setscheduler(p, policy, param, true); +} + +int sched_setattr(struct task_struct *p, const struct sched_attr *attr) +{ + return __sched_setscheduler(p, attr, true, true); +} + +int sched_setattr_nocheck(struct task_struct *p, const struct sched_attr *attr) +{ + return __sched_setscheduler(p, attr, false, true); +} +EXPORT_SYMBOL_GPL(sched_setattr_nocheck); + +/** + * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace. + * @p: the task in question. + * @policy: new policy. + * @param: structure containing the new RT priority. + * + * Just like sched_setscheduler, only don't bother checking if the + * current context has permission. For example, this is needed in + * stop_machine(): we create temporary high priority worker threads, + * but our caller might not have that capability. + * + * Return: 0 on success. An error code otherwise. + */ +int sched_setscheduler_nocheck(struct task_struct *p, int policy, + const struct sched_param *param) +{ + return _sched_setscheduler(p, policy, param, false); +} + +/* + * SCHED_FIFO is a broken scheduler model; that is, it is fundamentally + * incapable of resource management, which is the one thing an OS really should + * be doing. + * + * This is of course the reason it is limited to privileged users only. + * + * Worse still; it is fundamentally impossible to compose static priority + * workloads. You cannot take two correctly working static prio workloads + * and smash them together and still expect them to work. + * + * For this reason 'all' FIFO tasks the kernel creates are basically at: + * + * MAX_RT_PRIO / 2 + * + * The administrator _MUST_ configure the system, the kernel simply doesn't + * know enough information to make a sensible choice. + */ +void sched_set_fifo(struct task_struct *p) +{ + struct sched_param sp = { .sched_priority = MAX_RT_PRIO / 2 }; + WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0); +} +EXPORT_SYMBOL_GPL(sched_set_fifo); + +/* + * For when you don't much care about FIFO, but want to be above SCHED_NORMAL. + */ +void sched_set_fifo_low(struct task_struct *p) +{ + struct sched_param sp = { .sched_priority = 1 }; + WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0); +} +EXPORT_SYMBOL_GPL(sched_set_fifo_low); + +void sched_set_normal(struct task_struct *p, int nice) +{ + struct sched_attr attr = { + .sched_policy = SCHED_NORMAL, + .sched_nice = nice, + }; + WARN_ON_ONCE(sched_setattr_nocheck(p, &attr) != 0); +} +EXPORT_SYMBOL_GPL(sched_set_normal); + +static int +do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param) +{ + struct sched_param lparam; + + if (!param || pid < 0) + return -EINVAL; + if (copy_from_user(&lparam, param, sizeof(struct sched_param))) + return -EFAULT; + + CLASS(find_get_task, p)(pid); + if (!p) + return -ESRCH; + + return sched_setscheduler(p, policy, &lparam); +} + +/* + * Mimics kernel/events/core.c perf_copy_attr(). + */ +static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *attr) +{ + u32 size; + int ret; + + /* Zero the full structure, so that a short copy will be nice: */ + memset(attr, 0, sizeof(*attr)); + + ret = get_user(size, &uattr->size); + if (ret) + return ret; + + /* ABI compatibility quirk: */ + if (!size) + size = SCHED_ATTR_SIZE_VER0; + if (size < SCHED_ATTR_SIZE_VER0 || size > PAGE_SIZE) + goto err_size; + + ret = copy_struct_from_user(attr, sizeof(*attr), uattr, size); + if (ret) { + if (ret == -E2BIG) + goto err_size; + return ret; + } + + if ((attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) && + size < SCHED_ATTR_SIZE_VER1) + return -EINVAL; + + /* + * XXX: Do we want to be lenient like existing syscalls; or do we want + * to be strict and return an error on out-of-bounds values? + */ + attr->sched_nice = clamp(attr->sched_nice, MIN_NICE, MAX_NICE); + + return 0; + +err_size: + put_user(sizeof(*attr), &uattr->size); + return -E2BIG; +} + +static void get_params(struct task_struct *p, struct sched_attr *attr) +{ + if (task_has_dl_policy(p)) + __getparam_dl(p, attr); + else if (task_has_rt_policy(p)) + attr->sched_priority = p->rt_priority; + else + attr->sched_nice = task_nice(p); +} + +/** + * sys_sched_setscheduler - set/change the scheduler policy and RT priority + * @pid: the pid in question. + * @policy: new policy. + * @param: structure containing the new RT priority. + * + * Return: 0 on success. An error code otherwise. + */ +SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy, struct sched_param __user *, param) +{ + if (policy < 0) + return -EINVAL; + + return do_sched_setscheduler(pid, policy, param); +} + +/** + * sys_sched_setparam - set/change the RT priority of a thread + * @pid: the pid in question. + * @param: structure containing the new RT priority. + * + * Return: 0 on success. An error code otherwise. + */ +SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param) +{ + return do_sched_setscheduler(pid, SETPARAM_POLICY, param); +} + +/** + * sys_sched_setattr - same as above, but with extended sched_attr + * @pid: the pid in question. + * @uattr: structure containing the extended parameters. + * @flags: for future extension. + */ +SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr, + unsigned int, flags) +{ + struct sched_attr attr; + int retval; + + if (!uattr || pid < 0 || flags) + return -EINVAL; + + retval = sched_copy_attr(uattr, &attr); + if (retval) + return retval; + + if ((int)attr.sched_policy < 0) + return -EINVAL; + if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY) + attr.sched_policy = SETPARAM_POLICY; + + CLASS(find_get_task, p)(pid); + if (!p) + return -ESRCH; + + if (attr.sched_flags & SCHED_FLAG_KEEP_PARAMS) + get_params(p, &attr); + + return sched_setattr(p, &attr); +} + +/** + * sys_sched_getscheduler - get the policy (scheduling class) of a thread + * @pid: the pid in question. + * + * Return: On success, the policy of the thread. Otherwise, a negative error + * code. + */ +SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid) +{ + struct task_struct *p; + int retval; + + if (pid < 0) + return -EINVAL; + + guard(rcu)(); + p = find_process_by_pid(pid); + if (!p) + return -ESRCH; + + retval = security_task_getscheduler(p); + if (!retval) { + retval = p->policy; + if (p->sched_reset_on_fork) + retval |= SCHED_RESET_ON_FORK; + } + return retval; +} + +/** + * sys_sched_getparam - get the RT priority of a thread + * @pid: the pid in question. + * @param: structure containing the RT priority. + * + * Return: On success, 0 and the RT priority is in @param. Otherwise, an error + * code. + */ +SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param) +{ + struct sched_param lp = { .sched_priority = 0 }; + struct task_struct *p; + int retval; + + if (!param || pid < 0) + return -EINVAL; + + scoped_guard (rcu) { + p = find_process_by_pid(pid); + if (!p) + return -ESRCH; + + retval = security_task_getscheduler(p); + if (retval) + return retval; + + if (task_has_rt_policy(p)) + lp.sched_priority = p->rt_priority; + } + + /* + * This one might sleep, we cannot do it with a spinlock held ... + */ + return copy_to_user(param, &lp, sizeof(*param)) ? -EFAULT : 0; +} + +/* + * Copy the kernel size attribute structure (which might be larger + * than what user-space knows about) to user-space. + * + * Note that all cases are valid: user-space buffer can be larger or + * smaller than the kernel-space buffer. The usual case is that both + * have the same size. + */ +static int +sched_attr_copy_to_user(struct sched_attr __user *uattr, + struct sched_attr *kattr, + unsigned int usize) +{ + unsigned int ksize = sizeof(*kattr); + + if (!access_ok(uattr, usize)) + return -EFAULT; + + /* + * sched_getattr() ABI forwards and backwards compatibility: + * + * If usize == ksize then we just copy everything to user-space and all is good. + * + * If usize < ksize then we only copy as much as user-space has space for, + * this keeps ABI compatibility as well. We skip the rest. + * + * If usize > ksize then user-space is using a newer version of the ABI, + * which part the kernel doesn't know about. Just ignore it - tooling can + * detect the kernel's knowledge of attributes from the attr->size value + * which is set to ksize in this case. + */ + kattr->size = min(usize, ksize); + + if (copy_to_user(uattr, kattr, kattr->size)) + return -EFAULT; + + return 0; +} + +/** + * sys_sched_getattr - similar to sched_getparam, but with sched_attr + * @pid: the pid in question. + * @uattr: structure containing the extended parameters. + * @usize: sizeof(attr) for fwd/bwd comp. + * @flags: for future extension. + */ +SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr, + unsigned int, usize, unsigned int, flags) +{ + struct sched_attr kattr = { }; + struct task_struct *p; + int retval; + + if (!uattr || pid < 0 || usize > PAGE_SIZE || + usize < SCHED_ATTR_SIZE_VER0 || flags) + return -EINVAL; + + scoped_guard (rcu) { + p = find_process_by_pid(pid); + if (!p) + return -ESRCH; + + retval = security_task_getscheduler(p); + if (retval) + return retval; + + kattr.sched_policy = p->policy; + if (p->sched_reset_on_fork) + kattr.sched_flags |= SCHED_FLAG_RESET_ON_FORK; + get_params(p, &kattr); + kattr.sched_flags &= SCHED_FLAG_ALL; + +#ifdef CONFIG_UCLAMP_TASK + /* + * This could race with another potential updater, but this is fine + * because it'll correctly read the old or the new value. We don't need + * to guarantee who wins the race as long as it doesn't return garbage. + */ + kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value; + kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value; +#endif + } + + return sched_attr_copy_to_user(uattr, &kattr, usize); +} + +#ifdef CONFIG_SMP +int dl_task_check_affinity(struct task_struct *p, const struct cpumask *mask) +{ + /* + * If the task isn't a deadline task or admission control is + * disabled then we don't care about affinity changes. + */ + if (!task_has_dl_policy(p) || !dl_bandwidth_enabled()) + return 0; + + /* + * Since bandwidth control happens on root_domain basis, + * if admission test is enabled, we only admit -deadline + * tasks allowed to run on all the CPUs in the task's + * root_domain. + */ + guard(rcu)(); + if (!cpumask_subset(task_rq(p)->rd->span, mask)) + return -EBUSY; + + return 0; +} +#endif /* CONFIG_SMP */ + +int __sched_setaffinity(struct task_struct *p, struct affinity_context *ctx) +{ + int retval; + cpumask_var_t cpus_allowed, new_mask; + + if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) + return -ENOMEM; + + if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) { + retval = -ENOMEM; + goto out_free_cpus_allowed; + } + + cpuset_cpus_allowed(p, cpus_allowed); + cpumask_and(new_mask, ctx->new_mask, cpus_allowed); + + ctx->new_mask = new_mask; + ctx->flags |= SCA_CHECK; + + retval = dl_task_check_affinity(p, new_mask); + if (retval) + goto out_free_new_mask; + + retval = __set_cpus_allowed_ptr(p, ctx); + if (retval) + goto out_free_new_mask; + + cpuset_cpus_allowed(p, cpus_allowed); + if (!cpumask_subset(new_mask, cpus_allowed)) { + /* + * We must have raced with a concurrent cpuset update. + * Just reset the cpumask to the cpuset's cpus_allowed. + */ + cpumask_copy(new_mask, cpus_allowed); + + /* + * If SCA_USER is set, a 2nd call to __set_cpus_allowed_ptr() + * will restore the previous user_cpus_ptr value. + * + * In the unlikely event a previous user_cpus_ptr exists, + * we need to further restrict the mask to what is allowed + * by that old user_cpus_ptr. + */ + if (unlikely((ctx->flags & SCA_USER) && ctx->user_mask)) { + bool empty = !cpumask_and(new_mask, new_mask, + ctx->user_mask); + + if (empty) + cpumask_copy(new_mask, cpus_allowed); + } + __set_cpus_allowed_ptr(p, ctx); + retval = -EINVAL; + } + +out_free_new_mask: + free_cpumask_var(new_mask); +out_free_cpus_allowed: + free_cpumask_var(cpus_allowed); + return retval; +} + +long sched_setaffinity(pid_t pid, const struct cpumask *in_mask) +{ + struct affinity_context ac; + struct cpumask *user_mask; + int retval; + + CLASS(find_get_task, p)(pid); + if (!p) + return -ESRCH; + + if (p->flags & PF_NO_SETAFFINITY) + return -EINVAL; + + if (!check_same_owner(p)) { + guard(rcu)(); + if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) + return -EPERM; + } + + retval = security_task_setscheduler(p); + if (retval) + return retval; + + /* + * With non-SMP configs, user_cpus_ptr/user_mask isn't used and + * alloc_user_cpus_ptr() returns NULL. + */ + user_mask = alloc_user_cpus_ptr(NUMA_NO_NODE); + if (user_mask) { + cpumask_copy(user_mask, in_mask); + } else if (IS_ENABLED(CONFIG_SMP)) { + return -ENOMEM; + } + + ac = (struct affinity_context){ + .new_mask = in_mask, + .user_mask = user_mask, + .flags = SCA_USER, + }; + + retval = __sched_setaffinity(p, &ac); + kfree(ac.user_mask); + + return retval; +} + +static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len, + struct cpumask *new_mask) +{ + if (len < cpumask_size()) + cpumask_clear(new_mask); + else if (len > cpumask_size()) + len = cpumask_size(); + + return copy_from_user(new_mask, user_mask_ptr, len) ? -EFAULT : 0; +} + +/** + * sys_sched_setaffinity - set the CPU affinity of a process + * @pid: pid of the process + * @len: length in bytes of the bitmask pointed to by user_mask_ptr + * @user_mask_ptr: user-space pointer to the new CPU mask + * + * Return: 0 on success. An error code otherwise. + */ +SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len, + unsigned long __user *, user_mask_ptr) +{ + cpumask_var_t new_mask; + int retval; + + if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) + return -ENOMEM; + + retval = get_user_cpu_mask(user_mask_ptr, len, new_mask); + if (retval == 0) + retval = sched_setaffinity(pid, new_mask); + free_cpumask_var(new_mask); + return retval; +} + +long sched_getaffinity(pid_t pid, struct cpumask *mask) +{ + struct task_struct *p; + int retval; + + guard(rcu)(); + p = find_process_by_pid(pid); + if (!p) + return -ESRCH; + + retval = security_task_getscheduler(p); + if (retval) + return retval; + + guard(raw_spinlock_irqsave)(&p->pi_lock); + cpumask_and(mask, &p->cpus_mask, cpu_active_mask); + + return 0; +} + +/** + * sys_sched_getaffinity - get the CPU affinity of a process + * @pid: pid of the process + * @len: length in bytes of the bitmask pointed to by user_mask_ptr + * @user_mask_ptr: user-space pointer to hold the current CPU mask + * + * Return: size of CPU mask copied to user_mask_ptr on success. An + * error code otherwise. + */ +SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len, + unsigned long __user *, user_mask_ptr) +{ + int ret; + cpumask_var_t mask; + + if ((len * BITS_PER_BYTE) < nr_cpu_ids) + return -EINVAL; + if (len & (sizeof(unsigned long)-1)) + return -EINVAL; + + if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) + return -ENOMEM; + + ret = sched_getaffinity(pid, mask); + if (ret == 0) { + unsigned int retlen = min(len, cpumask_size()); + + if (copy_to_user(user_mask_ptr, cpumask_bits(mask), retlen)) + ret = -EFAULT; + else + ret = retlen; + } + free_cpumask_var(mask); + + return ret; +} + +static void do_sched_yield(void) +{ + struct rq_flags rf; + struct rq *rq; + + rq = this_rq_lock_irq(&rf); + + schedstat_inc(rq->yld_count); + current->sched_class->yield_task(rq); + + preempt_disable(); + rq_unlock_irq(rq, &rf); + sched_preempt_enable_no_resched(); + + schedule(); +} + +/** + * sys_sched_yield - yield the current processor to other threads. + * + * This function yields the current CPU to other tasks. If there are no + * other threads running on this CPU then this function will return. + * + * Return: 0. + */ +SYSCALL_DEFINE0(sched_yield) +{ + do_sched_yield(); + return 0; +} + +/** + * yield - yield the current processor to other threads. + * + * Do not ever use this function, there's a 99% chance you're doing it wrong. + * + * The scheduler is at all times free to pick the calling task as the most + * eligible task to run, if removing the yield() call from your code breaks + * it, it's already broken. + * + * Typical broken usage is: + * + * while (!event) + * yield(); + * + * where one assumes that yield() will let 'the other' process run that will + * make event true. If the current task is a SCHED_FIFO task that will never + * happen. Never use yield() as a progress guarantee!! + * + * If you want to use yield() to wait for something, use wait_event(). + * If you want to use yield() to be 'nice' for others, use cond_resched(). + * If you still want to use yield(), do not! + */ +void __sched yield(void) +{ + set_current_state(TASK_RUNNING); + do_sched_yield(); +} +EXPORT_SYMBOL(yield); + +/** + * yield_to - yield the current processor to another thread in + * your thread group, or accelerate that thread toward the + * processor it's on. + * @p: target task + * @preempt: whether task preemption is allowed or not + * + * It's the caller's job to ensure that the target task struct + * can't go away on us before we can do any checks. + * + * Return: + * true (>0) if we indeed boosted the target task. + * false (0) if we failed to boost the target. + * -ESRCH if there's no task to yield to. + */ +int __sched yield_to(struct task_struct *p, bool preempt) +{ + struct task_struct *curr = current; + struct rq *rq, *p_rq; + int yielded = 0; + + scoped_guard (raw_spinlock_irqsave, &p->pi_lock) { + rq = this_rq(); + +again: + p_rq = task_rq(p); + /* + * If we're the only runnable task on the rq and target rq also + * has only one task, there's absolutely no point in yielding. + */ + if (rq->nr_running == 1 && p_rq->nr_running == 1) + return -ESRCH; + + guard(double_rq_lock)(rq, p_rq); + if (task_rq(p) != p_rq) + goto again; + + if (!curr->sched_class->yield_to_task) + return 0; + + if (curr->sched_class != p->sched_class) + return 0; + + if (task_on_cpu(p_rq, p) || !task_is_running(p)) + return 0; + + yielded = curr->sched_class->yield_to_task(rq, p); + if (yielded) { + schedstat_inc(rq->yld_count); + /* + * Make p's CPU reschedule; pick_next_entity + * takes care of fairness. + */ + if (preempt && rq != p_rq) + resched_curr(p_rq); + } + } + + if (yielded) + schedule(); + + return yielded; +} +EXPORT_SYMBOL_GPL(yield_to); + +/** + * sys_sched_get_priority_max - return maximum RT priority. + * @policy: scheduling class. + * + * Return: On success, this syscall returns the maximum + * rt_priority that can be used by a given scheduling class. + * On failure, a negative error code is returned. + */ +SYSCALL_DEFINE1(sched_get_priority_max, int, policy) +{ + int ret = -EINVAL; + + switch (policy) { + case SCHED_FIFO: + case SCHED_RR: + ret = MAX_RT_PRIO-1; + break; + case SCHED_DEADLINE: + case SCHED_NORMAL: + case SCHED_BATCH: + case SCHED_IDLE: + ret = 0; + break; + } + return ret; +} + +/** + * sys_sched_get_priority_min - return minimum RT priority. + * @policy: scheduling class. + * + * Return: On success, this syscall returns the minimum + * rt_priority that can be used by a given scheduling class. + * On failure, a negative error code is returned. + */ +SYSCALL_DEFINE1(sched_get_priority_min, int, policy) +{ + int ret = -EINVAL; + + switch (policy) { + case SCHED_FIFO: + case SCHED_RR: + ret = 1; + break; + case SCHED_DEADLINE: + case SCHED_NORMAL: + case SCHED_BATCH: + case SCHED_IDLE: + ret = 0; + } + return ret; +} + +static int sched_rr_get_interval(pid_t pid, struct timespec64 *t) +{ + unsigned int time_slice = 0; + int retval; + + if (pid < 0) + return -EINVAL; + + scoped_guard (rcu) { + struct task_struct *p = find_process_by_pid(pid); + if (!p) + return -ESRCH; + + retval = security_task_getscheduler(p); + if (retval) + return retval; + + scoped_guard (task_rq_lock, p) { + struct rq *rq = scope.rq; + if (p->sched_class->get_rr_interval) + time_slice = p->sched_class->get_rr_interval(rq, p); + } + } + + jiffies_to_timespec64(time_slice, t); + return 0; +} + +/** + * sys_sched_rr_get_interval - return the default timeslice of a process. + * @pid: pid of the process. + * @interval: userspace pointer to the timeslice value. + * + * this syscall writes the default timeslice value of a given process + * into the user-space timespec buffer. A value of '0' means infinity. + * + * Return: On success, 0 and the timeslice is in @interval. Otherwise, + * an error code. + */ +SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid, + struct __kernel_timespec __user *, interval) +{ + struct timespec64 t; + int retval = sched_rr_get_interval(pid, &t); + + if (retval == 0) + retval = put_timespec64(&t, interval); + + return retval; +} + +#ifdef CONFIG_COMPAT_32BIT_TIME +SYSCALL_DEFINE2(sched_rr_get_interval_time32, pid_t, pid, + struct old_timespec32 __user *, interval) +{ + struct timespec64 t; + int retval = sched_rr_get_interval(pid, &t); + + if (retval == 0) + retval = put_old_timespec32(&t, interval); + return retval; +} +#endif -- 2.34.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY -------------------------------- The mainline patch 04746ed80bcf ("sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c") moves some EXPORT_SYMBOL from core.c to syscalls.c, resulting in the KABI broken, so still keep the EXPORT_SYMBOL in core.c and leave the implementation in syscalls.c. Fixes: eab09c7d2a15 ("sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 17 +++++++++++++++++ kernel/sched/syscalls.c | 6 ------ 2 files changed, 17 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d28790f1b7ca..379b8266b439 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -100,6 +100,23 @@ #include <linux/sched/grid_qos.h> +/* + * Fix kabi for 04746ed80bcf, + * and still keep EXPORT_SYMBOL them in core.c instead of syscalls.c + */ +extern void sched_set_fifo_low(struct task_struct *p); +EXPORT_SYMBOL_GPL(sched_set_fifo_low); +extern int sched_setattr_nocheck(struct task_struct *p, const struct sched_attr *attr); +EXPORT_SYMBOL_GPL(sched_setattr_nocheck); +extern int __sched yield_to(struct task_struct *p, bool preempt); +EXPORT_SYMBOL_GPL(yield_to); +extern void set_user_nice(struct task_struct *p, long nice); +EXPORT_SYMBOL(set_user_nice); +extern void sched_set_fifo(struct task_struct *p); +EXPORT_SYMBOL_GPL(sched_set_fifo); +extern void sched_set_normal(struct task_struct *p, int nice); +EXPORT_SYMBOL_GPL(sched_set_normal); + EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpu); EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpumask); diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index 5f4eaecb90cd..a0efcc8661e5 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -87,7 +87,6 @@ void set_user_nice(struct task_struct *p, long nice) */ p->sched_class->prio_changed(rq, p, old_prio); } -EXPORT_SYMBOL(set_user_nice); /* * is_nice_reduction - check if nice value is an actual reduction @@ -876,7 +875,6 @@ int sched_setattr_nocheck(struct task_struct *p, const struct sched_attr *attr) { return __sched_setscheduler(p, attr, false, true); } -EXPORT_SYMBOL_GPL(sched_setattr_nocheck); /** * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace. @@ -920,7 +918,6 @@ void sched_set_fifo(struct task_struct *p) struct sched_param sp = { .sched_priority = MAX_RT_PRIO / 2 }; WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0); } -EXPORT_SYMBOL_GPL(sched_set_fifo); /* * For when you don't much care about FIFO, but want to be above SCHED_NORMAL. @@ -930,7 +927,6 @@ void sched_set_fifo_low(struct task_struct *p) struct sched_param sp = { .sched_priority = 1 }; WARN_ON_ONCE(sched_setscheduler_nocheck(p, SCHED_FIFO, &sp) != 0); } -EXPORT_SYMBOL_GPL(sched_set_fifo_low); void sched_set_normal(struct task_struct *p, int nice) { @@ -940,7 +936,6 @@ void sched_set_normal(struct task_struct *p, int nice) }; WARN_ON_ONCE(sched_setattr_nocheck(p, &attr) != 0); } -EXPORT_SYMBOL_GPL(sched_set_normal); static int do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param) @@ -1573,7 +1568,6 @@ int __sched yield_to(struct task_struct *p, bool preempt) return yielded; } -EXPORT_SYMBOL_GPL(yield_to); /** * sys_sched_get_priority_max - return maximum RT priority. -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit d8c7bc2e209170ff78b88b85258546aa6e337af9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- When a task switches to a new sched_class, the prev and new classes are notified through ->switched_from() and ->switched_to(), respectively, after the switching is done. A new BPF extensible sched_class will have callbacks that allow the BPF scheduler to keep track of relevant task states (like priority and cpumask). Those callbacks aren't called while a task is on a different sched_class. When a task comes back, we wanna tell the BPF progs the up-to-date state before the task gets enqueued, so we need a hook which is called before the switching is committed. This patch adds ->switching_to() which is called during sched_class switch through check_class_changing() before the task is restored. Also, this patch exposes check_class_changing/changed() in kernel/sched/sched.h. They will be used by the new BPF extensible sched_class to implement implicit sched_class switching which is used e.g. when falling back to CFS when the BPF scheduler fails or unloads. This is a prep patch and doesn't cause any behavior changes. The new operation and exposed functions aren't used yet. v3: Refreshed on top of tip:sched/core. v2: Improve patch description w/ details on planned use. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 12 ++++++++++++ kernel/sched/sched.h | 3 +++ kernel/sched/syscalls.c | 1 + 3 files changed, 16 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 379b8266b439..1e3c01aa2520 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2041,6 +2041,17 @@ inline int task_curr(const struct task_struct *p) return cpu_curr(task_cpu(p)) == p; } +/* + * ->switching_to() is called with the pi_lock and rq_lock held and must not + * mess with locking. + */ +void check_class_changing(struct rq *rq, struct task_struct *p, + const struct sched_class *prev_class) +{ + if (prev_class != p->sched_class && p->sched_class->switching_to) + p->sched_class->switching_to(rq, p); +} + /* * switched_from, switched_to and prio_changed must _NOT_ drop rq->lock, * use the balance_callback list if you want balancing. @@ -7025,6 +7036,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task) } __setscheduler_prio(p, prio); + check_class_changing(rq, p, prev_class); if (queued) enqueue_task(rq, p, queue_flag); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d77442a46bf6..14ff791b3d7e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2489,6 +2489,7 @@ struct sched_class { * cannot assume the switched_from/switched_to pair is serialized by * rq->lock. They are however serialized by p->pi_lock. */ + void (*switching_to) (struct rq *this_rq, struct task_struct *task); void (*switched_from)(struct rq *this_rq, struct task_struct *task); void (*switched_to) (struct rq *this_rq, struct task_struct *task); void (*prio_changed) (struct rq *this_rq, struct task_struct *task, @@ -3883,6 +3884,8 @@ extern void set_load_weight(struct task_struct *p, bool update_load); extern void enqueue_task(struct rq *rq, struct task_struct *p, int flags); extern void dequeue_task(struct rq *rq, struct task_struct *p, int flags); +extern void check_class_changing(struct rq *rq, struct task_struct *p, + const struct sched_class *prev_class); extern void check_class_changed(struct rq *rq, struct task_struct *p, const struct sched_class *prev_class, int oldprio); diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index a0efcc8661e5..f3b3acec7a3d 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -790,6 +790,7 @@ int __sched_setscheduler(struct task_struct *p, __setscheduler_prio(p, newprio); } __setscheduler_uclamp(p, attr); + check_class_changing(rq, p, prev_class); if (queued) { /* -- 2.34.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY -------------------------------- sched: Fix kabi for the method pointer switching_to in struct sched_class. Fixes: 268c45531283 ("sched: Add sched_class->switching_to() and expose check_class_changing/changed()") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/sched.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 14ff791b3d7e..93371bdf6205 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2489,7 +2489,6 @@ struct sched_class { * cannot assume the switched_from/switched_to pair is serialized by * rq->lock. They are however serialized by p->pi_lock. */ - void (*switching_to) (struct rq *this_rq, struct task_struct *task); void (*switched_from)(struct rq *this_rq, struct task_struct *task); void (*switched_to) (struct rq *this_rq, struct task_struct *task); void (*prio_changed) (struct rq *this_rq, struct task_struct *task, @@ -2509,7 +2508,7 @@ struct sched_class { #endif KABI_USE(1, void (*reweight_task)(struct rq *this_rq, struct task_struct *task, const struct load_weight *lw)) - KABI_RESERVE(2) + KABI_USE(2, void (*switching_to) (struct rq *this_rq, struct task_struct *task)) }; static inline void put_prev_task(struct rq *rq, struct task_struct *prev) -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 4f9c7ca851044273df5d67e00ca0b4d0476a48f6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Factor out sched_weight_from/to_cgroup() which convert between scheduler shares and cgroup weight. No functional change. The factored out functions will be used by a new BPF extensible sched_class so that the weights can be exposed to the BPF programs in a way which is consistent cgroup weights and easier to interpret. The weight conversions will be used regardless of cgroup usage. It's just borrowing the cgroup weight range as it's more intuitive. CGROUP_WEIGHT_MIN/DFL/MAX constants are moved outside CONFIG_CGROUPS so that the conversion helpers can always be defined. v2: The helpers are now defined regardless of COFNIG_CGROUPS. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/cgroup.h | 4 ++-- kernel/sched/core.c | 28 +++++++++++++--------------- kernel/sched/sched.h | 18 ++++++++++++++++++ 3 files changed, 33 insertions(+), 17 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 8148e2f65ab7..b9f749a466e1 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -30,8 +30,6 @@ struct kernel_clone_args; -#ifdef CONFIG_CGROUPS - /* * All weight knobs on the default hierarchy should use the following min, * default and max values. The default value is the logarithmic center of @@ -41,6 +39,8 @@ struct kernel_clone_args; #define CGROUP_WEIGHT_DFL 100 #define CGROUP_WEIGHT_MAX 10000 +#ifdef CONFIG_CGROUPS + enum { CSS_TASK_ITER_PROCS = (1U << 0), /* walk only threadgroup leaders */ CSS_TASK_ITER_THREADED = (1U << 1), /* walk all threaded css_sets in the domain */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1e3c01aa2520..521cf7b466b4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10183,29 +10183,27 @@ static int cpu_local_stat_show(struct seq_file *sf, } #ifdef CONFIG_FAIR_GROUP_SCHED + +static unsigned long tg_weight(struct task_group *tg) +{ + return scale_load_down(tg->shares); +} + static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) { - struct task_group *tg = css_tg(css); - u64 weight = scale_load_down(tg->shares); - - return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024); + return sched_weight_to_cgroup(tg_weight(css_tg(css))); } static int cpu_weight_write_u64(struct cgroup_subsys_state *css, - struct cftype *cft, u64 weight) + struct cftype *cft, u64 cgrp_weight) { - /* - * cgroup weight knobs should use the common MIN, DFL and MAX - * values which are 1, 100 and 10000 respectively. While it loses - * a bit of range on both ends, it maps pretty well onto the shares - * value used by scheduler and the round-trip conversions preserve - * the original value over the entire range. - */ - if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX) + unsigned long weight; + + if (cgrp_weight < CGROUP_WEIGHT_MIN || cgrp_weight > CGROUP_WEIGHT_MAX) return -ERANGE; - weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL); + weight = sched_weight_from_cgroup(cgrp_weight); return sched_group_set_shares(css_tg(css), scale_load(weight)); } @@ -10213,7 +10211,7 @@ static int cpu_weight_write_u64(struct cgroup_subsys_state *css, static s64 cpu_weight_nice_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) { - unsigned long weight = scale_load_down(css_tg(css)->shares); + unsigned long weight = tg_weight(css_tg(css)); int last_delta = INT_MAX; int prio, delta; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 93371bdf6205..233caa0c1047 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -239,6 +239,24 @@ static inline void update_avg(u64 *avg, u64 sample) #define shr_bound(val, shift) \ (val >> min_t(typeof(shift), shift, BITS_PER_TYPE(typeof(val)) - 1)) +/* + * cgroup weight knobs should use the common MIN, DFL and MAX values which are + * 1, 100 and 10000 respectively. While it loses a bit of range on both ends, it + * maps pretty well onto the shares value used by scheduler and the round-trip + * conversions preserve the original value over the entire range. + */ +static inline unsigned long sched_weight_from_cgroup(unsigned long cgrp_weight) +{ + return DIV_ROUND_CLOSEST_ULL(cgrp_weight * 1024, CGROUP_WEIGHT_DFL); +} + +static inline unsigned long sched_weight_to_cgroup(unsigned long weight) +{ + return clamp_t(unsigned long, + DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024), + CGROUP_WEIGHT_MIN, CGROUP_WEIGHT_MAX); +} + /* * !! For sched_setattr_nocheck() (kernel) only !! * -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 96fd6c65efc652e9054163e6d3cf254b9e5b93d2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- RT, DL, thermal and irq load and utilization metrics need to be decayed and updated periodically and before consumption to keep the numbers reasonable. This is currently done from __update_blocked_others() as a part of the fair class load balance path. Let's factor it out to update_other_load_avgs(). Pure refactor. No functional changes. This will be used by the new BPF extensible scheduling class to ensure that the above metrics are properly maintained. v2: Refreshed on top of tip:sched/core. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Conflicts: kernel/sched/fair.c kernel/sched/sched.h [use the logic in __update_blocked_others of this patch and fix compile err for update_other_load_avgs. Fix the compile error for 'update_other_load_avgs()' in kernel/sched/syscalls.c. Because the mainline patch 96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from __update_blocked_others()") uses 'arch_scale_hw_pressure()' and 'update_hw_load_avg()' inside 'update_other_load_avgs()', but current version does not have them, so use other functions instead.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/fair.c | 16 +++------------- kernel/sched/sched.h | 7 ++++++- kernel/sched/syscalls.c | 19 +++++++++++++++++++ 3 files changed, 28 insertions(+), 14 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4ccadaa44b0c..317dbe9ed70f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11289,28 +11289,18 @@ static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) { static bool __update_blocked_others(struct rq *rq, bool *done) { - const struct sched_class *curr_class; - u64 now = rq_clock_pelt(rq); - unsigned long thermal_pressure; - bool decayed; + bool updated; /* * update_load_avg() can call cpufreq_update_util(). Make sure that RT, * DL and IRQ signals have been updated before updating CFS. */ - curr_class = rq->curr->sched_class; - - thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq)); - - decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | - update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | - update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure) | - update_irq_load_avg(rq, 0); + updated = update_other_load_avgs(rq); if (others_have_blocked(rq)) *done = false; - return decayed; + return updated; } #ifdef CONFIG_FAIR_GROUP_SCHED diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 233caa0c1047..f7d956e04211 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3242,6 +3242,8 @@ static inline unsigned long capacity_orig_of(int cpu) return cpu_rq(cpu)->cpu_capacity_orig; } +bool update_other_load_avgs(struct rq *rq); + unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, unsigned long *min, unsigned long *max); @@ -3284,7 +3286,10 @@ static inline unsigned long cpu_util_rt(struct rq *rq) { return READ_ONCE(rq->avg_rt.util_avg); } -#endif + +#else /* !CONFIG_SMP */ +static inline bool update_other_load_avgs(struct rq *rq) { return false; } +#endif /* CONFIG_SMP */ #ifdef CONFIG_UCLAMP_TASK unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id); diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index f3b3acec7a3d..ebf339797d8b 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -232,6 +232,25 @@ int sched_core_idle_cpu(int cpu) #endif #ifdef CONFIG_SMP +/* + * Load avg and utiliztion metrics need to be updated periodically and before + * consumption. This function updates the metrics for all subsystems except for + * the fair class. @rq must be locked and have its clock updated. + */ +bool update_other_load_avgs(struct rq *rq) +{ + u64 now = rq_clock_pelt(rq); + const struct sched_class *curr_class = rq->curr->sched_class; + unsigned long hw_pressure = arch_scale_thermal_pressure(cpu_of(rq)); + + lockdep_assert_rq_held(rq); + + return update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | + update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | + update_thermal_load_avg(now, rq, hw_pressure) | + update_irq_load_avg(rq, 0); +} + /* * This function computes an effective utilization for the given CPU, to be * used for frequency selection given the linear relation: f = u * f_max. -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 2c8d046d5d51691550da023976815e7a64151636 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- A new BPF extensible sched_class will need to dynamically change how a task picks its sched_class. For example, if the loaded BPF scheduler progs fail, the tasks will be forced back on CFS even if the task's policy is set to the new sched_class. To support such mapping, add normal_policy() which wraps testing for %SCHED_NORMAL. This doesn't cause any behavior changes. v2: Update the description with more details on the expected use. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Conflicts: kernel/sched/fair.c kernel/sched/sched.h [Only pick the part modified by this mainline commit, and ignore the conflict part caused by higher version.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/fair.c | 2 +- kernel/sched/sched.h | 8 +++++++- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 317dbe9ed70f..ea183905a145 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9602,7 +9602,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ /* * BATCH and IDLE tasks do not preempt others. */ - if (unlikely(p->policy != SCHED_NORMAL)) + if (unlikely(!normal_policy(p->policy))) return; cfs_rq = cfs_rq_of(se); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index f7d956e04211..0fe32c94c4e6 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -189,9 +189,15 @@ static inline int idle_policy(int policy) { return policy == SCHED_IDLE; } + +static inline int normal_policy(int policy) +{ + return policy == SCHED_NORMAL; +} + static inline int fair_policy(int policy) { - return policy == SCHED_NORMAL || policy == SCHED_BATCH; + return normal_policy(policy) || policy == SCHED_BATCH; } static inline int rt_policy(int policy) -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit a7a9fc549293168c1edbc8433d5d4fbbe1171630 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- This adds dummy implementations of sched_ext interfaces which interact with the scheduler core and hook them in the correct places. As they're all dummies, this doesn't cause any behavior changes. This is split out to help reviewing. v2: balance_scx_on_up() dropped. This will be handled in sched_ext proper. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Conflicts: kernel/sched/core.c kernel/sched/sched.h [Only pick the part modified by this mainline commit, and ignore the conflict part caused by higher version.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 12 ++++++++++++ kernel/fork.c | 2 ++ kernel/sched/core.c | 32 ++++++++++++++++++++++++-------- kernel/sched/ext.h | 24 ++++++++++++++++++++++++ kernel/sched/idle.c | 2 ++ kernel/sched/sched.h | 1 + 6 files changed, 65 insertions(+), 8 deletions(-) create mode 100644 include/linux/sched/ext.h create mode 100644 kernel/sched/ext.h diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h new file mode 100644 index 000000000000..a05dfcf533b0 --- /dev/null +++ b/include/linux/sched/ext.h @@ -0,0 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_SCHED_EXT_H +#define _LINUX_SCHED_EXT_H + +#ifdef CONFIG_SCHED_CLASS_EXT +#error "NOT IMPLEMENTED YET" +#else /* !CONFIG_SCHED_CLASS_EXT */ + +static inline void sched_ext_free(struct task_struct *p) {} + +#endif /* CONFIG_SCHED_CLASS_EXT */ +#endif /* _LINUX_SCHED_EXT_H */ diff --git a/kernel/fork.c b/kernel/fork.c index 59197037d846..2b8331fd6bf0 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -23,6 +23,7 @@ #include <linux/sched/task.h> #include <linux/sched/task_stack.h> #include <linux/sched/cputime.h> +#include <linux/sched/ext.h> #include <linux/seq_file.h> #include <linux/rtmutex.h> #include <linux/init.h> @@ -1031,6 +1032,7 @@ void __put_task_struct(struct task_struct *tsk) WARN_ON(refcount_read(&tsk->usage)); WARN_ON(tsk == current); + sched_ext_free(tsk); io_uring_free(tsk); cgroup_free(tsk); task_numa_free(tsk, true); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 521cf7b466b4..8b2d1be2bf39 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4611,6 +4611,8 @@ late_initcall(sched_core_sysctl_init); */ int sched_fork(unsigned long clone_flags, struct task_struct *p) { + int ret; + __sched_fork(clone_flags, p); /* * We mark the process as NEW here. This guarantees that @@ -4647,12 +4649,16 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) p->sched_reset_on_fork = 0; } - if (dl_prio(p->prio)) - return -EAGAIN; - else if (rt_prio(p->prio)) + scx_pre_fork(p); + + if (dl_prio(p->prio)) { + ret = -EAGAIN; + goto out_cancel; + } else if (rt_prio(p->prio)) { p->sched_class = &rt_sched_class; - else + } else { p->sched_class = &fair_sched_class; + } init_entity_runnable_average(&p->se); @@ -4670,6 +4676,10 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) RB_CLEAR_NODE(&p->pushable_dl_tasks); #endif return 0; + +out_cancel: + scx_cancel_fork(p); + return ret; } int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs) @@ -4708,16 +4718,18 @@ int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs) p->sched_class->task_fork(p); raw_spin_unlock_irqrestore(&p->pi_lock, flags); - return 0; + return scx_fork(p); } void sched_cancel_fork(struct task_struct *p) { + scx_cancel_fork(p); } void sched_post_fork(struct task_struct *p) { uclamp_post_fork(p); + scx_post_fork(p); } unsigned long to_ratio(u64 period, u64 runtime) @@ -5863,7 +5875,7 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev, * We can terminate the balance pass as soon as we know there is * a runnable task of @class priority or higher. */ - for_class_range(class, prev->sched_class, &idle_sched_class) { + for_balance_class_range(class, prev->sched_class, &idle_sched_class) { if (class->balance(rq, prev, rf)) break; } @@ -5881,6 +5893,9 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) const struct sched_class *class; struct task_struct *p; + if (scx_enabled()) + goto restart; + /* * Optimization: we know that if all tasks are in the fair class we can * call that function directly, but only if the @prev task wasn't of a @@ -5906,7 +5921,7 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) restart: put_prev_task_balance(rq, prev, rf); - for_each_class(class) { + for_each_active_class(class) { p = class->pick_next_task(rq); if (p) return p; @@ -5939,7 +5954,7 @@ static inline struct task_struct *pick_task(struct rq *rq) const struct sched_class *class; struct task_struct *p; - for_each_class(class) { + for_each_active_class(class) { p = class->pick_task(rq); if (p) return p; @@ -8404,6 +8419,7 @@ void __init sched_init(void) balance_push_set(smp_processor_id(), false); #endif init_sched_fair_class(); + init_sched_ext_class(); psi_init(); diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h new file mode 100644 index 000000000000..6a93c4825339 --- /dev/null +++ b/kernel/sched/ext.h @@ -0,0 +1,24 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifdef CONFIG_SCHED_CLASS_EXT +#error "NOT IMPLEMENTED YET" +#else /* CONFIG_SCHED_CLASS_EXT */ + +#define scx_enabled() false + +static inline void scx_pre_fork(struct task_struct *p) {} +static inline int scx_fork(struct task_struct *p) { return 0; } +static inline void scx_post_fork(struct task_struct *p) {} +static inline void scx_cancel_fork(struct task_struct *p) {} +static inline void init_sched_ext_class(void) {} + +#define for_each_active_class for_each_class +#define for_balance_class_range for_class_range + +#endif /* CONFIG_SCHED_CLASS_EXT */ + +#if defined(CONFIG_SCHED_CLASS_EXT) && defined(CONFIG_SMP) +#error "NOT IMPLEMENTED YET" +#else +static inline void scx_update_idle(struct rq *rq, bool idle) {} +#endif diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 86618f71918f..d0a197a953c3 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -448,11 +448,13 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl static void put_prev_task_idle(struct rq *rq, struct task_struct *prev) { + scx_update_idle(rq, false); } static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first) { update_idle_core(rq); + scx_update_idle(rq, true); schedstat_inc(rq->sched_goidle); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0fe32c94c4e6..f078bdf0bafa 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3962,5 +3962,6 @@ static inline int destroy_soft_domain(struct task_group *tg) } #endif +#include "ext.h" #endif /* _KERNEL_SCHED_SCHED_H */ -- 2.34.1
From: Ingo Molnar <mingo@kernel.org> mainline inclusion from mainline-v6.7-rc1 commit 82845683ca6a15fe8c7912c6264bb0e84ec6f5fb category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Other scheduling classes already postfix their similar methods with the class name. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ea183905a145..e82505f979a4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9532,7 +9532,7 @@ static void set_next_buddy(struct sched_entity *se) /* * Preempt the current task with a newly woken task if needed: */ -static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags) +static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags) { struct task_struct *curr = rq->curr; struct sched_entity *se = &curr->se, *pse = &p->se; @@ -15522,7 +15522,7 @@ DEFINE_SCHED_CLASS(fair) = { .yield_task = yield_task_fair, .yield_to_task = yield_to_task_fair, - .check_preempt_curr = check_preempt_wakeup, + .check_preempt_curr = check_preempt_wakeup_fair, .pick_next_task = __pick_next_task_fair, .put_prev_task = put_prev_task_fair, -- 2.34.1
From: Ingo Molnar <mingo@kernel.org> mainline inclusion from mainline-v6.7-rc1 commit e23edc86b09df655bf8963bbcb16647adc787395 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- The name is a bit opaque - make it clear that this is about wakeup preemption. Also rename the ->check_preempt_curr() methods similarly. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Conflicts: kernel/sched/core.c [Conflicts with 4c89400decb9 ("interference: Add sched wakelat interference track support"), so accept both.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 14 +++++++------- kernel/sched/deadline.c | 10 +++++----- kernel/sched/fair.c | 10 +++++----- kernel/sched/idle.c | 4 ++-- kernel/sched/rt.c | 6 +++--- kernel/sched/sched.h | 4 ++-- kernel/sched/stop_task.c | 4 ++-- 7 files changed, 26 insertions(+), 26 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8b2d1be2bf39..a2493d05980c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2072,10 +2072,10 @@ void check_class_changed(struct rq *rq, struct task_struct *p, p->sched_class->prio_changed(rq, p, oldprio); } -void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) +void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags) { if (p->sched_class == rq->curr->sched_class) - rq->curr->sched_class->check_preempt_curr(rq, p, flags); + rq->curr->sched_class->wakeup_preempt(rq, p, flags); else if (sched_class_above(p->sched_class, rq->curr->sched_class)) resched_curr(rq); @@ -2375,7 +2375,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf, rq_lock(rq, rf); WARN_ON_ONCE(task_cpu(p) != new_cpu); activate_task(rq, p, 0); - check_preempt_curr(rq, p, 0); + wakeup_preempt(rq, p, 0); return rq; } @@ -3247,7 +3247,7 @@ static void __migrate_swap_task(struct task_struct *p, int cpu) deactivate_task(src_rq, p, 0); set_task_cpu(p, cpu); activate_task(dst_rq, p, 0); - check_preempt_curr(dst_rq, p, 0); + wakeup_preempt(dst_rq, p, 0); rq_unpin_lock(dst_rq, &drf); rq_unpin_lock(src_rq, &srf); @@ -3610,7 +3610,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, } activate_task(rq, p, en_flags); - check_preempt_curr(rq, p, wake_flags); + wakeup_preempt(rq, p, wake_flags); ttwu_do_wakeup(p); @@ -3686,7 +3686,7 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags) * it should preempt the task that is current now. */ update_rq_clock(rq); - check_preempt_curr(rq, p, wake_flags); + wakeup_preempt(rq, p, wake_flags); } ttwu_do_wakeup(p); ret = 1; @@ -4782,7 +4782,7 @@ void wake_up_new_task(struct task_struct *p) activate_task(rq, p, ENQUEUE_NOCLOCK); trace_sched_wakeup_new(p); ifs_task_waking(p); - check_preempt_curr(rq, p, WF_FORK); + wakeup_preempt(rq, p, WF_FORK); #ifdef CONFIG_SMP if (p->sched_class->task_woken) { /* diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 9a2327abbd99..185e44b6c811 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -763,7 +763,7 @@ static inline void deadline_queue_pull_task(struct rq *rq) static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags); static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags); -static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p, int flags); +static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, int flags); static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se, struct rq *rq) @@ -1175,7 +1175,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer) enqueue_task_dl(rq, p, ENQUEUE_REPLENISH); if (dl_task(rq->curr)) - check_preempt_curr_dl(rq, p, 0); + wakeup_preempt_dl(rq, p, 0); else resched_curr(rq); @@ -1944,7 +1944,7 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf) * Only called when both the current and waking task are -deadline * tasks. */ -static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p, +static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, int flags) { if (dl_entity_preempt(&p->dl, &rq->curr->dl)) { @@ -2684,7 +2684,7 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p) deadline_queue_push_tasks(rq); #endif if (dl_task(rq->curr)) - check_preempt_curr_dl(rq, p, 0); + wakeup_preempt_dl(rq, p, 0); else resched_curr(rq); } else { @@ -2753,7 +2753,7 @@ DEFINE_SCHED_CLASS(dl) = { .dequeue_task = dequeue_task_dl, .yield_task = yield_task_dl, - .check_preempt_curr = check_preempt_curr_dl, + .wakeup_preempt = wakeup_preempt_dl, .pick_next_task = pick_next_task_dl, .put_prev_task = put_prev_task_dl, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e82505f979a4..6125cd954c26 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9546,7 +9546,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int /* * This is possible from callers such as attach_tasks(), in which we - * unconditionally check_preempt_curr() after an enqueue (which may have + * unconditionally wakeup_preempt() after an enqueue (which may have * lead to a throttle). This both saves work and prevents false * next-buddy nomination below. */ @@ -11199,7 +11199,7 @@ static void attach_task(struct rq *rq, struct task_struct *p) WARN_ON_ONCE(task_rq(p) != rq); activate_task(rq, p, ENQUEUE_NOCLOCK); - check_preempt_curr(rq, p, 0); + wakeup_preempt(rq, p, 0); } /* @@ -15032,7 +15032,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio) if (p->prio > oldprio) resched_curr(rq); } else - check_preempt_curr(rq, p, 0); + wakeup_preempt(rq, p, 0); } #ifdef CONFIG_FAIR_GROUP_SCHED @@ -15134,7 +15134,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p) if (task_current(rq, p)) resched_curr(rq); else - check_preempt_curr(rq, p, 0); + wakeup_preempt(rq, p, 0); } } @@ -15522,7 +15522,7 @@ DEFINE_SCHED_CLASS(fair) = { .yield_task = yield_task_fair, .yield_to_task = yield_to_task_fair, - .check_preempt_curr = check_preempt_wakeup_fair, + .wakeup_preempt = check_preempt_wakeup_fair, .pick_next_task = __pick_next_task_fair, .put_prev_task = put_prev_task_fair, diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index d0a197a953c3..b9a69749d5b5 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -441,7 +441,7 @@ balance_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) /* * Idle tasks are unconditionally rescheduled: */ -static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int flags) +static void wakeup_preempt_idle(struct rq *rq, struct task_struct *p, int flags) { resched_curr(rq); } @@ -531,7 +531,7 @@ DEFINE_SCHED_CLASS(idle) = { /* dequeue is not valid, we print a debug message there: */ .dequeue_task = dequeue_task_idle, - .check_preempt_curr = check_preempt_curr_idle, + .wakeup_preempt = wakeup_preempt_idle, .pick_next_task = pick_next_task_idle, .put_prev_task = put_prev_task_idle, diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 48d53bf0161b..f660e0f5fc94 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -957,7 +957,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun) /* * When we're idle and a woken (rt) task is - * throttled check_preempt_curr() will set + * throttled wakeup_preempt() will set * skip_update and the time between the wakeup * and this unthrottle will get accounted as * 'runtime'. @@ -1719,7 +1719,7 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf) /* * Preempt the current task with a newly woken task if needed: */ -static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flags) +static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags) { if (p->prio < rq->curr->prio) { resched_curr(rq); @@ -2716,7 +2716,7 @@ DEFINE_SCHED_CLASS(rt) = { .dequeue_task = dequeue_task_rt, .yield_task = yield_task_rt, - .check_preempt_curr = check_preempt_curr_rt, + .wakeup_preempt = wakeup_preempt_rt, .pick_next_task = pick_next_task_rt, .put_prev_task = put_prev_task_rt, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index f078bdf0bafa..877fd26c8ab6 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2479,7 +2479,7 @@ struct sched_class { void (*yield_task) (struct rq *rq); bool (*yield_to_task)(struct rq *rq, struct task_struct *p); - void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags); + void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags); struct task_struct *(*pick_next_task)(struct rq *rq); @@ -2778,7 +2778,7 @@ static inline void sub_nr_running(struct rq *rq, unsigned count) extern void activate_task(struct rq *rq, struct task_struct *p, int flags); extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags); -extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags); +extern void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags); #ifdef CONFIG_PREEMPT_RT #define SCHED_NR_MIGRATE_BREAK 8 diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c index 85590599b4d6..6cf7304e6449 100644 --- a/kernel/sched/stop_task.c +++ b/kernel/sched/stop_task.c @@ -23,7 +23,7 @@ balance_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) #endif /* CONFIG_SMP */ static void -check_preempt_curr_stop(struct rq *rq, struct task_struct *p, int flags) +wakeup_preempt_stop(struct rq *rq, struct task_struct *p, int flags) { /* we're never preempted */ } @@ -120,7 +120,7 @@ DEFINE_SCHED_CLASS(stop) = { .dequeue_task = dequeue_task_stop, .yield_task = yield_task_stop, - .check_preempt_curr = check_preempt_curr_stop, + .wakeup_preempt = wakeup_preempt_stop, .pick_next_task = pick_next_task_stop, .put_prev_task = put_prev_task_stop, -- 2.34.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY -------------------------------- sched/fair: Fix kabi for check_preempt_curr and wakeup_preempt in struct sched_class. Fixes: 14bf8c5744af ("sched/fair: Rename check_preempt_curr() to wakeup_preempt()") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/sched.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 877fd26c8ab6..63ee0adae6a7 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2479,7 +2479,8 @@ struct sched_class { void (*yield_task) (struct rq *rq); bool (*yield_to_task)(struct rq *rq, struct task_struct *p); - void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags); + KABI_REPLACE(void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags), + void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags)) struct task_struct *(*pick_next_task)(struct rq *rq); -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.8-rc1 commit 2f7a0f58948d8231236e2facecc500f1930fb996 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- In preparation of introducing !task sched_dl_entity; move the bandwidth accounting into {en.de}queue_dl_entity(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/a86dccbbe44e021b8771627e1dae01a69b73466d.169909515... Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/deadline.c | 130 ++++++++++++++++++++++------------------ kernel/sched/sched.h | 6 ++ 2 files changed, 78 insertions(+), 58 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 185e44b6c811..dff757fbdaab 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -389,12 +389,12 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw) * up, and checks if the task is still in the "ACTIVE non contending" * state or not (in the second case, it updates running_bw). */ -static void task_non_contending(struct task_struct *p) +static void task_non_contending(struct sched_dl_entity *dl_se) { - struct sched_dl_entity *dl_se = &p->dl; struct hrtimer *timer = &dl_se->inactive_timer; struct dl_rq *dl_rq = dl_rq_of_se(dl_se); struct rq *rq = rq_of_dl_rq(dl_rq); + struct task_struct *p = dl_task_of(dl_se); s64 zerolag_time; /* @@ -426,13 +426,14 @@ static void task_non_contending(struct task_struct *p) if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) { if (dl_task(p)) sub_running_bw(dl_se, dl_rq); + if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) { struct dl_bw *dl_b = dl_bw_of(task_cpu(p)); if (READ_ONCE(p->__state) == TASK_DEAD) - sub_rq_bw(&p->dl, &rq->dl); + sub_rq_bw(dl_se, &rq->dl); raw_spin_lock(&dl_b->lock); - __dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p))); + __dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p))); raw_spin_unlock(&dl_b->lock); __dl_clear_params(p); } @@ -1634,6 +1635,41 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags) update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags); + /* + * Check if a constrained deadline task was activated + * after the deadline but before the next period. + * If that is the case, the task will be throttled and + * the replenishment timer will be set to the next period. + */ + if (!dl_se->dl_throttled && !dl_is_implicit(dl_se)) + dl_check_constrained_dl(dl_se); + + if (flags & (ENQUEUE_RESTORE|ENQUEUE_MIGRATING)) { + struct dl_rq *dl_rq = dl_rq_of_se(dl_se); + + add_rq_bw(dl_se, dl_rq); + add_running_bw(dl_se, dl_rq); + } + + /* + * If p is throttled, we do not enqueue it. In fact, if it exhausted + * its budget it needs a replenishment and, since it now is on + * its rq, the bandwidth timer callback (which clearly has not + * run yet) will take care of this. + * However, the active utilization does not depend on the fact + * that the task is on the runqueue or not (but depends on the + * task's state - in GRUB parlance, "inactive" vs "active contending"). + * In other words, even if a task is throttled its utilization must + * be counted in the active utilization; hence, we need to call + * add_running_bw(). + */ + if (dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) { + if (flags & ENQUEUE_WAKEUP) + task_contending(dl_se, flags); + + return; + } + /* * If this is a wakeup or a new instance, the scheduling * parameters of the task might need updating. Otherwise, @@ -1654,9 +1690,28 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags) __enqueue_dl_entity(dl_se); } -static void dequeue_dl_entity(struct sched_dl_entity *dl_se) +static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags) { __dequeue_dl_entity(dl_se); + + if (flags & (DEQUEUE_SAVE|DEQUEUE_MIGRATING)) { + struct dl_rq *dl_rq = dl_rq_of_se(dl_se); + + sub_running_bw(dl_se, dl_rq); + sub_rq_bw(dl_se, dl_rq); + } + + /* + * This check allows to start the inactive timer (or to immediately + * decrease the active utilization, if needed) in two cases: + * when the task blocks and when it is terminating + * (p->state == TASK_DEAD). We can handle the two cases in the same + * way, because from GRUB's point of view the same thing is happening + * (the task moves from "active contending" to "active non contending" + * or "inactive") + */ + if (flags & DEQUEUE_SLEEP) + task_non_contending(dl_se); } static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags) @@ -1705,76 +1760,35 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags) return; } - /* - * Check if a constrained deadline task was activated - * after the deadline but before the next period. - * If that is the case, the task will be throttled and - * the replenishment timer will be set to the next period. - */ - if (!p->dl.dl_throttled && !dl_is_implicit(&p->dl)) - dl_check_constrained_dl(&p->dl); - - if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE) { - add_rq_bw(&p->dl, &rq->dl); - add_running_bw(&p->dl, &rq->dl); - } - - /* - * If p is throttled, we do not enqueue it. In fact, if it exhausted - * its budget it needs a replenishment and, since it now is on - * its rq, the bandwidth timer callback (which clearly has not - * run yet) will take care of this. - * However, the active utilization does not depend on the fact - * that the task is on the runqueue or not (but depends on the - * task's state - in GRUB parlance, "inactive" vs "active contending"). - * In other words, even if a task is throttled its utilization must - * be counted in the active utilization; hence, we need to call - * add_running_bw(). - */ - if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) { - if (flags & ENQUEUE_WAKEUP) - task_contending(&p->dl, flags); - - return; - } - check_schedstat_required(); update_stats_wait_start_dl(dl_rq_of_se(&p->dl), &p->dl); + if (p->on_rq == TASK_ON_RQ_MIGRATING) + flags |= ENQUEUE_MIGRATING; + enqueue_dl_entity(&p->dl, flags); - if (!task_current(rq, p) && p->nr_cpus_allowed > 1) + if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1) enqueue_pushable_dl_task(rq, p); } static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) { update_stats_dequeue_dl(&rq->dl, &p->dl, flags); - dequeue_dl_entity(&p->dl); - dequeue_pushable_dl_task(rq, p); + dequeue_dl_entity(&p->dl, flags); + + if (!p->dl.dl_throttled) + dequeue_pushable_dl_task(rq, p); } static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) { update_curr_dl(rq); - __dequeue_task_dl(rq, p, flags); - if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE) { - sub_running_bw(&p->dl, &rq->dl); - sub_rq_bw(&p->dl, &rq->dl); - } + if (p->on_rq == TASK_ON_RQ_MIGRATING) + flags |= DEQUEUE_MIGRATING; - /* - * This check allows to start the inactive timer (or to immediately - * decrease the active utilization, if needed) in two cases: - * when the task blocks and when it is terminating - * (p->state == TASK_DEAD). We can handle the two cases in the same - * way, because from GRUB's point of view the same thing is happening - * (the task moves from "active contending" to "active non contending" - * or "inactive") - */ - if (flags & DEQUEUE_SLEEP) - task_non_contending(p); + __dequeue_task_dl(rq, p, flags); } /* @@ -2617,7 +2631,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p) * will reset the task parameters. */ if (task_on_rq_queued(p) && p->dl.dl_runtime) - task_non_contending(p); + task_non_contending(&p->dl); /* * In case a task is setscheduled out from SCHED_DEADLINE we need to diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 63ee0adae6a7..baff0888bb70 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2435,6 +2435,10 @@ extern const u32 sched_prio_to_wmult[40]; * MOVE - paired with SAVE/RESTORE, explicitly does not preserve the location * in the runqueue. * + * NOCLOCK - skip the update_rq_clock() (avoids double updates) + * + * MIGRATION - p->on_rq == TASK_ON_RQ_MIGRATING (used for DEADLINE) + * * ENQUEUE_HEAD - place at front of runqueue (tail if not specified) * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline) * ENQUEUE_MIGRATED - the task was migrated during wakeup @@ -2445,6 +2449,7 @@ extern const u32 sched_prio_to_wmult[40]; #define DEQUEUE_SAVE 0x02 /* Matches ENQUEUE_RESTORE */ #define DEQUEUE_MOVE 0x04 /* Matches ENQUEUE_MOVE */ #define DEQUEUE_NOCLOCK 0x08 /* Matches ENQUEUE_NOCLOCK */ +#define DEQUEUE_MIGRATING 0x100 /* Matches ENQUEUE_MIGRATING */ #define ENQUEUE_WAKEUP 0x01 #define ENQUEUE_RESTORE 0x02 @@ -2459,6 +2464,7 @@ extern const u32 sched_prio_to_wmult[40]; #define ENQUEUE_MIGRATED 0x00 #endif #define ENQUEUE_INITIAL 0x80 +#define ENQUEUE_MIGRATING 0x100 #define RETRY_TASK ((void *)-1UL) -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit 863ccdbb918a77e3f011571f943020bf7f0b114b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Change the function signature of sched_class::dequeue_task() to return a boolean, allowing future patches to 'fail' dequeue. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.864630153@infradead.org Conflicts: kernel/sched/deadline.c [Only pick the part modified by this mainline commit, and ignore the conflict part caused by higher version.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 7 +++++-- kernel/sched/deadline.c | 4 +++- kernel/sched/fair.c | 4 +++- kernel/sched/idle.c | 3 ++- kernel/sched/rt.c | 4 +++- kernel/sched/sched.h | 4 ++-- kernel/sched/stop_task.c | 3 ++- 7 files changed, 20 insertions(+), 9 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a2493d05980c..faef3245490f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1994,7 +1994,10 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags) sched_core_enqueue(rq, p); } -void dequeue_task(struct rq *rq, struct task_struct *p, int flags) +/* + * Must only return false when DEQUEUE_SLEEP. + */ +inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags) { if (sched_core_enabled(rq)) sched_core_dequeue(rq, p, flags); @@ -2008,7 +2011,7 @@ void dequeue_task(struct rq *rq, struct task_struct *p, int flags) } uclamp_rq_dec(rq, p); - p->sched_class->dequeue_task(rq, p, flags); + return p->sched_class->dequeue_task(rq, p, flags); } void activate_task(struct rq *rq, struct task_struct *p, int flags) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index dff757fbdaab..ebbdc3dae343 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1781,7 +1781,7 @@ static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) dequeue_pushable_dl_task(rq, p); } -static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) +static bool dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) { update_curr_dl(rq); @@ -1789,6 +1789,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) flags |= DEQUEUE_MIGRATING; __dequeue_task_dl(rq, p, flags); + + return true; } /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6125cd954c26..812abf3347e2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7645,7 +7645,7 @@ static void set_next_buddy(struct sched_entity *se); * decreased. We remove the task from the rbtree and * update the fair scheduling stats: */ -static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) +static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; @@ -7724,6 +7724,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) dequeue_throttle: util_est_update(&rq->cfs, p, task_sleep); hrtick_update(rq); + + return true; } #ifdef CONFIG_SMP diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index b9a69749d5b5..8c3ae98a16f0 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -485,13 +485,14 @@ struct task_struct *pick_next_task_idle(struct rq *rq) * It is not legal to sleep in the idle task - print a warning * message if some code attempts to do it: */ -static void +static bool dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags) { raw_spin_rq_unlock_irq(rq); printk(KERN_ERR "bad: scheduling from the idle thread!\n"); dump_stack(); raw_spin_rq_lock_irq(rq); + return true; } /* diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index f660e0f5fc94..b6eea8784978 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1552,7 +1552,7 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags) enqueue_pushable_task(rq, p); } -static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags) +static bool dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags) { struct sched_rt_entity *rt_se = &p->rt; @@ -1560,6 +1560,8 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags) dequeue_rt_entity(rt_se, flags); dequeue_pushable_task(rq, p); + + return true; } /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index baff0888bb70..891c7e3b0da7 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2481,7 +2481,7 @@ struct sched_class { #endif void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); - void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); + bool (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); void (*yield_task) (struct rq *rq); bool (*yield_to_task)(struct rq *rq, struct task_struct *p); @@ -3917,7 +3917,7 @@ extern void __setscheduler_prio(struct task_struct *p, int prio); extern void __setscheduler_params(struct task_struct *p, const struct sched_attr *attr); extern void set_load_weight(struct task_struct *p, bool update_load); extern void enqueue_task(struct rq *rq, struct task_struct *p, int flags); -extern void dequeue_task(struct rq *rq, struct task_struct *p, int flags); +extern bool dequeue_task(struct rq *rq, struct task_struct *p, int flags); extern void check_class_changing(struct rq *rq, struct task_struct *p, const struct sched_class *prev_class); diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c index 6cf7304e6449..52d8164e935c 100644 --- a/kernel/sched/stop_task.c +++ b/kernel/sched/stop_task.c @@ -57,10 +57,11 @@ enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags) add_nr_running(rq, 1); } -static void +static bool dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags) { sub_nr_running(rq, 1); + return true; } static void yield_task_stop(struct rq *rq) -- 2.34.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY -------------------------------- sched_ext: Fix kabi for dequeue_task in struct sched_class. Fixes: 88150ba51eca ("sched: Allow sched_class::dequeue_task() to fail") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/sched.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 891c7e3b0da7..57f8fada633a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2481,7 +2481,8 @@ struct sched_class { #endif void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); - bool (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); + KABI_REPLACE(void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags), + bool (*dequeue_task)(struct rq *rq, struct task_struct *p, int flags)) void (*yield_task) (struct rq *rq); bool (*yield_to_task)(struct rq *rq, struct task_struct *p); -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit f0e1a0643a59bf1f922fa209cec86a170b784f3f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Implement a new scheduler class sched_ext (SCX), which allows scheduling policies to be implemented as BPF programs to achieve the following: 1. Ease of experimentation and exploration: Enabling rapid iteration of new scheduling policies. 2. Customization: Building application-specific schedulers which implement policies that are not applicable to general-purpose schedulers. 3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling policies in production environments. sched_ext leverages BPF’s struct_ops feature to define a structure which exports function callbacks and flags to BPF programs that wish to implement scheduling policies. The struct_ops structure exported by sched_ext is struct sched_ext_ops, and is conceptually similar to struct sched_class. The role of sched_ext is to map the complex sched_class callbacks to the more simple and ergonomic struct sched_ext_ops callbacks. For more detailed discussion on the motivations and overview, please refer to the cover letter. Later patches will also add several example schedulers and documentation. This patch implements the minimum core framework to enable implementation of BPF schedulers. Subsequent patches will gradually add functionalities including safety guarantee mechanisms, nohz and cgroup support. include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on top, each operation should be self-explanatory. The followings are worth noting: - Both "sched_ext" and its shorthand "scx" are used. If the identifier already has "sched" in it, "ext" is used; otherwise, "scx". - In sched_ext_ops, only .name is mandatory. Every operation is optional and if omitted a simple but functional default behavior is provided. - A new policy constant SCHED_EXT is added and a task can select sched_ext by invoking sched_setscheduler(2) with the new policy constant. However, if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL and the task is scheduled by CFS. When the BPF scheduler is loaded, all tasks which have the SCHED_EXT policy are switched to sched_ext. - To bridge the workflow imbalance between the scheduler core and sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be used by a scheduler that doesn't require it. SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting the next task on the CPU. The BPF scheduler can manage an arbitrary number of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq(). - sched_ext guarantees system integrity no matter what the BPF scheduler does. To enable this, each task's ownership is tracked through p->scx.ops_state and all tasks are put on scx_tasks list. The disable path can always recover and revert all tasks back to CFS. See p->scx.ops_state and scx_tasks. - A task is not tied to its rq while enqueued. This decouples CPU selection from queueing and allows sharing a scheduling queue across an arbitrary subset of CPUs. This adds some complexities as a task may need to be bounced between rq's right before it starts executing. See dispatch_to_local_dsq() and move_task_to_local_dsq(). - One complication that arises from the above weak association between task and rq is that synchronizing with dequeue() gets complicated as dequeue() may happen anytime while the task is enqueued and the dispatch path might need to release the rq lock to transfer the task. Solving this requires a bit of complexity. See the logic around p->scx.sticky_cpu and p->scx.ops_qseq. - Both enable and disable paths are a bit complicated. The enable path switches all tasks without blocking to avoid issues which can arise from partially switched states (e.g. the switching task itself being starved). The disable path can't trust the BPF scheduler at all, so it also has to guarantee forward progress without blocking. See scx_ops_enable() and scx_ops_disable_workfn(). - When sched_ext is disabled, static_branches are used to shut down the entry points from hot paths. v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab scx_ops_enable_mutex which can lead to deadlocks in the disable path. Fixed. - Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead to use-after-free. - Consolidated per-cpu variable usages and other cleanups. v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple groups can be expressed. Later CPU hotplug operations are put into their own group. - SCX_OPS_DISABLING state is replaced with the new bypass mechanism which allows temporarily putting the system into simple FIFO scheduling mode bypassing the BPF scheduler. In addition to the shut down path, this will also be used to isolate the BPF scheduler across PM events. Enabling and disabling the bypass mode requires iterating all runnable tasks. rq->scx.runnable_list addition is moved from the later watchdog patch. - ops.prep_enable() is replaced with ops.init_task() and ops.enable/disable() are now called whenever the task enters and leaves sched_ext instead of when the task becomes schedulable on sched_ext and stops being so. A new operation - ops.exit_task() - is called when the task stops being schedulable on sched_ext. - scx_bpf_dispatch() can now be called from ops.select_cpu() too. This removes the need for communicating local dispatch decision made by ops.select_cpu() to ops.enqueue() via per-task storage. SCX_KF_SELECT_CPU is added to support the change. - SCX_TASK_ENQ_LOCAL which told the BPF scheudler that scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly if it finds a suitable idle CPU. If such behavior is not desired, users can use scx_bpf_select_cpu_dfl() which returns the verdict in a bool out param. - scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up queueing many tasks on a local DSQ which makes tasks to execute in order while other CPUs stay idle which made some hackbench numbers really bad. Fixed. - The current state of sched_ext can now be monitored through files under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is to enable monitoring on kernels which don't enable debugfs. - sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may be NULL and a BPF scheduler which derefs the pointer without checking could crash the kernel. Tell BPF. This is currently a bit ugly. A better way to annotate this is expected in the future. - scx_exit_info updated to carry pointers to message buffers instead of embedding them directly. This decouples buffer sizes from API so that they can be changed without breaking compatibility. - exit_code added to scx_exit_info. This is used to indicate different exit conditions on non-error exits and will be used to handle e.g. CPU hotplugs. - The patch "sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext" is folded in and the interface is changed so that partial switching is indicated with a new ops flag %SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry and in turn SCX_KF_INIT. ops.init() is now called with SCX_KF_SLEEPABLE. - Code reorganized so that only the parts necessary to integrate with the rest of the kernel are in the header files. - Changes to reflect the BPF and other kernel changes including the addition of bpf_sched_ext_ops.cfi_stubs. v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t instead of atomic64_t and scx_dsp_buf_ent.qseq which uses load_acquire/store_release is now unsigned long instead of u64. - Fix the bug where bpf_scx_btf_struct_access() was allowing write access to arbitrary fields. - Distinguish kfuncs which can be called from any sched_ext ops and from anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from sched_ext ops. - Rename "type" to "kind" in scx_exit_info to make it easier to use on languages in which "type" is a reserved keyword. - Since cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup"), PF_IDLE is not set on idle tasks which haven't been online yet which made scx_task_iter_next_filtered() include those idle tasks in iterations leading to oopses. Update scx_task_iter_next_filtered() to directly test p->sched_class against idle_sched_class instead of using is_idle_task() which tests PF_IDLE. - Other updates to match upstream changes such as adding const to set_cpumask() param and renaming check_preempt_curr() to wakeup_preempt(). v4: - SCHED_CHANGE_BLOCK replaced with the previous sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is because upstream is adaopting a different generic cleanup mechanism. Once that lands, the code will be adapted accordingly. - task_on_scx() used to test whether a task should be switched into SCX, which is confusing. Renamed to task_should_scx(). task_on_scx() now tests whether a task is currently on SCX. - scx_has_idle_cpus is barely used anymore and replaced with direct check on the idle cpumask. - SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer fully idle cores. - ops.enable() now sees up-to-date p->scx.weight value. - ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF schedulers expecting ->select_cpu() call. - Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest of the scheduler. v3: - ops.set_weight() added to allow BPF schedulers to track weight changes without polling p->scx.weight. - move_task_to_local_dsq() was losing SCX-specific enq_flags when enqueueing the task on the target dsq because it goes through activate_task() which loses the upper 32bit of the flags. Carry the flags through rq->scx.extra_enq_flags. - scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running() and scx_bpf_task_cpu() now use the new KF_RCU instead of KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them. - The kfunc helper access control mechanism implemented through sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always used when invoking scx_ops operations. v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is called from put_prev_taks_scx() and pick_next_task_scx() as necessary. To determine whether balance_scx() should be called from put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the comment in put_prev_task_scx() for details. - sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced with SCHED_CHANGE_BLOCK(). - Unused all_dsqs list removed. This was a left-over from previous iterations. - p->scx.kf_mask is added to track and enforce which kfunc helpers are allowed. Also, init/exit sequences are updated to make some kfuncs always safe to call regardless of the current BPF scheduler state. Combined, this should make all the kfuncs safe. - BPF now supports sleepable struct_ops operations. Hacky workaround removed and operations and kfunc helpers are tagged appropriately. - BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask() and friends are added so that BPF schedulers can use the idle masks with the generic helpers. This replaces the hacky kfunc helpers added by a separate patch in V1. - CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is enabled. This restriction will be removed by a later patch which adds core-sched support. - Add MAINTAINERS entries and other misc changes. Signed-off-by: Tejun Heo <tj@kernel.org> Co-authored-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Andrea Righi <andrea.righi@canonical.com> Conflicts: include/linux/sched.h kernel/sched/core.c kernel/sched/sched.h kernel/sched/build_policy.c kernel/sched/ext.c [For core.c, the difference is that the mainline removes `sched_balance_trigger(rq)`, but we removes `trigger_load_balance(rq)` instead. For sched_ext: Fix the compile err for ext.c. For other files, only pick the part modified by this mainline commit, and ignore the conflict part caused by higher version.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- MAINTAINERS | 13 + include/asm-generic/vmlinux.lds.h | 1 + include/linux/sched.h | 5 + include/linux/sched/ext.h | 141 +- include/uapi/linux/sched.h | 1 + init/init_task.c | 11 + kernel/Kconfig.preempt | 25 +- kernel/sched/build_policy.c | 9 + kernel/sched/core.c | 66 +- kernel/sched/debug.c | 3 + kernel/sched/ext.c | 4260 +++++++++++++++++++++++++++++ kernel/sched/ext.h | 73 +- kernel/sched/sched.h | 17 + kernel/sched/syscalls.c | 2 + 14 files changed, 4620 insertions(+), 7 deletions(-) create mode 100644 kernel/sched/ext.c diff --git a/MAINTAINERS b/MAINTAINERS index 30523a5769c2..ce07060ea476 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -19266,6 +19266,19 @@ F: include/linux/wait.h F: include/uapi/linux/sched.h F: kernel/sched/ +SCHEDULER - SCHED_EXT +R: Tejun Heo <tj@kernel.org> +R: David Vernet <void@manifault.com> +L: linux-kernel@vger.kernel.org +S: Maintained +W: https://github.com/sched-ext/scx +T: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git +F: include/linux/sched/ext.h +F: kernel/sched/ext.h +F: kernel/sched/ext.c +F: tools/sched_ext/ +F: tools/testing/selftests/sched_ext + SCSI LIBSAS SUBSYSTEM R: John Garry <john.g.garry@oracle.com> R: Jason Yan <yanaijie@huawei.com> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index bbbecc8fda8e..ffe49234a418 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -131,6 +131,7 @@ *(__dl_sched_class) \ *(__rt_sched_class) \ *(__fair_sched_class) \ + *(__ext_sched_class) \ *(__idle_sched_class) \ __sched_class_lowest = .; diff --git a/include/linux/sched.h b/include/linux/sched.h index f5c80c372f74..a9c29f1de7fe 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -77,6 +77,8 @@ struct task_delay_info; struct task_group; struct user_event_mm; +#include <linux/sched/ext.h> + /* * Task state bitmask. NOTE! These bits are also * encoded in fs/proc/array.c: get_task_state(). @@ -837,6 +839,9 @@ struct task_struct { struct sched_entity se; struct sched_rt_entity rt; struct sched_dl_entity dl; +#ifdef CONFIG_SCHED_CLASS_EXT + struct sched_ext_entity scx; +#endif const struct sched_class *sched_class; #ifdef CONFIG_SCHED_CORE diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index a05dfcf533b0..c1530a7992cc 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -1,9 +1,148 @@ /* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> + * Copyright (c) 2022 David Vernet <dvernet@meta.com> + */ #ifndef _LINUX_SCHED_EXT_H #define _LINUX_SCHED_EXT_H #ifdef CONFIG_SCHED_CLASS_EXT -#error "NOT IMPLEMENTED YET" + +#include <linux/llist.h> +#include <linux/rhashtable-types.h> + +enum scx_public_consts { + SCX_OPS_NAME_LEN = 128, + + SCX_SLICE_DFL = 20 * 1000000, /* 20ms */ +}; + +/* + * DSQ (dispatch queue) IDs are 64bit of the format: + * + * Bits: [63] [62 .. 0] + * [ B] [ ID ] + * + * B: 1 for IDs for built-in DSQs, 0 for ops-created user DSQs + * ID: 63 bit ID + * + * Built-in IDs: + * + * Bits: [63] [62] [61..32] [31 .. 0] + * [ 1] [ L] [ R ] [ V ] + * + * 1: 1 for built-in DSQs. + * L: 1 for LOCAL_ON DSQ IDs, 0 for others + * V: For LOCAL_ON DSQ IDs, a CPU number. For others, a pre-defined value. + */ +enum scx_dsq_id_flags { + SCX_DSQ_FLAG_BUILTIN = 1LLU << 63, + SCX_DSQ_FLAG_LOCAL_ON = 1LLU << 62, + + SCX_DSQ_INVALID = SCX_DSQ_FLAG_BUILTIN | 0, + SCX_DSQ_GLOBAL = SCX_DSQ_FLAG_BUILTIN | 1, + SCX_DSQ_LOCAL = SCX_DSQ_FLAG_BUILTIN | 2, + SCX_DSQ_LOCAL_ON = SCX_DSQ_FLAG_BUILTIN | SCX_DSQ_FLAG_LOCAL_ON, + SCX_DSQ_LOCAL_CPU_MASK = 0xffffffffLLU, +}; + +/* + * Dispatch queue (dsq) is a simple FIFO which is used to buffer between the + * scheduler core and the BPF scheduler. See the documentation for more details. + */ +struct scx_dispatch_q { + raw_spinlock_t lock; + struct list_head list; /* tasks in dispatch order */ + u32 nr; + u64 id; + struct rhash_head hash_node; + struct llist_node free_node; + struct rcu_head rcu; +}; + +/* scx_entity.flags */ +enum scx_ent_flags { + SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ + SCX_TASK_BAL_KEEP = 1 << 1, /* balance decided to keep current */ + SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ + SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ + + SCX_TASK_STATE_SHIFT = 8, /* bit 8 and 9 are used to carry scx_task_state */ + SCX_TASK_STATE_BITS = 2, + SCX_TASK_STATE_MASK = ((1 << SCX_TASK_STATE_BITS) - 1) << SCX_TASK_STATE_SHIFT, + + SCX_TASK_CURSOR = 1 << 31, /* iteration cursor, not a task */ +}; + +/* scx_entity.flags & SCX_TASK_STATE_MASK */ +enum scx_task_state { + SCX_TASK_NONE, /* ops.init_task() not called yet */ + SCX_TASK_INIT, /* ops.init_task() succeeded, but task can be cancelled */ + SCX_TASK_READY, /* fully initialized, but not in sched_ext */ + SCX_TASK_ENABLED, /* fully initialized and in sched_ext */ + + SCX_TASK_NR_STATES, +}; + +/* + * Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from + * everywhere and the following bits track which kfunc sets are currently + * allowed for %current. This simple per-task tracking works because SCX ops + * nest in a limited way. BPF will likely implement a way to allow and disallow + * kfuncs depending on the calling context which will replace this manual + * mechanism. See scx_kf_allow(). + */ +enum scx_kf_mask { + SCX_KF_UNLOCKED = 0, /* not sleepable, not rq locked */ + /* all non-sleepables may be nested inside SLEEPABLE */ + SCX_KF_SLEEPABLE = 1 << 0, /* sleepable init operations */ + /* ops.dequeue (in REST) may be nested inside DISPATCH */ + SCX_KF_DISPATCH = 1 << 2, /* ops.dispatch() */ + SCX_KF_ENQUEUE = 1 << 3, /* ops.enqueue() and ops.select_cpu() */ + SCX_KF_SELECT_CPU = 1 << 4, /* ops.select_cpu() */ + SCX_KF_REST = 1 << 5, /* other rq-locked operations */ + + __SCX_KF_RQ_LOCKED = SCX_KF_DISPATCH | + SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, +}; + +/* + * The following is embedded in task_struct and contains all fields necessary + * for a task to be scheduled by SCX. + */ +struct sched_ext_entity { + struct scx_dispatch_q *dsq; + struct list_head dsq_node; + u32 flags; /* protected by rq lock */ + u32 weight; + s32 sticky_cpu; + s32 holding_cpu; + u32 kf_mask; /* see scx_kf_mask above */ + atomic_long_t ops_state; + + struct list_head runnable_node; /* rq->scx.runnable_list */ + + u64 ddsp_dsq_id; + u64 ddsp_enq_flags; + + /* BPF scheduler modifiable fields */ + + /* + * Runtime budget in nsecs. This is usually set through + * scx_bpf_dispatch() but can also be modified directly by the BPF + * scheduler. Automatically decreased by SCX as the task executes. On + * depletion, a scheduling event is triggered. + */ + u64 slice; + + /* cold fields */ + /* must be the last field, see init_scx_entity() */ + struct list_head tasks_node; +}; + +void sched_ext_free(struct task_struct *p); + #else /* !CONFIG_SCHED_CLASS_EXT */ static inline void sched_ext_free(struct task_struct *p) {} diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 3bac0a8ceab2..359a14cc76a4 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -118,6 +118,7 @@ struct clone_args { /* SCHED_ISO: reserved but not implemented yet */ #define SCHED_IDLE 5 #define SCHED_DEADLINE 6 +#define SCHED_EXT 7 /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */ #define SCHED_RESET_ON_FORK 0x40000000 diff --git a/init/init_task.c b/init/init_task.c index 9fd44251b8b7..df8193e73c47 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -6,6 +6,7 @@ #include <linux/sched/sysctl.h> #include <linux/sched/rt.h> #include <linux/sched/task.h> +#include <linux/sched/ext.h> #include <linux/init.h> #include <linux/fs.h> #include <linux/mm.h> @@ -111,6 +112,16 @@ struct task_struct init_task #endif #ifdef CONFIG_CGROUP_SCHED .sched_task_group = &root_task_group, +#endif +#ifdef CONFIG_SCHED_CLASS_EXT + .scx = { + .dsq_node = LIST_HEAD_INIT(init_task.scx.dsq_node), + .sticky_cpu = -1, + .holding_cpu = -1, + .runnable_node = LIST_HEAD_INIT(init_task.scx.runnable_node), + .ddsp_dsq_id = SCX_DSQ_INVALID, + .slice = SCX_SLICE_DFL, + }, #endif .ptraced = LIST_HEAD_INIT(init_task.ptraced), .ptrace_entry = LIST_HEAD_INIT(init_task.ptrace_entry), diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index dc2a630f200a..1ac534c1940e 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -133,4 +133,27 @@ config SCHED_CORE which is the likely usage by Linux distributions, there should be no measurable impact on performance. - +config SCHED_CLASS_EXT + bool "Extensible Scheduling Class" + depends on BPF_SYSCALL && BPF_JIT && !SCHED_CORE + help + This option enables a new scheduler class sched_ext (SCX), which + allows scheduling policies to be implemented as BPF programs to + achieve the following: + + - Ease of experimentation and exploration: Enabling rapid + iteration of new scheduling policies. + - Customization: Building application-specific schedulers which + implement policies that are not applicable to general-purpose + schedulers. + - Rapid scheduler deployments: Non-disruptive swap outs of + scheduling policies in production environments. + + sched_ext leverages BPF struct_ops feature to define a structure + which exports function callbacks and flags to BPF programs that + wish to implement scheduling policies. The struct_ops structure + exported by sched_ext is struct sched_ext_ops, and is conceptually + similar to struct sched_class. + + For more information: + https://github.com/sched-ext/scx diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c index e954328cc093..a0f9faaca01c 100644 --- a/kernel/sched/build_policy.c +++ b/kernel/sched/build_policy.c @@ -21,13 +21,18 @@ #include <linux/cpuidle.h> #include <linux/jiffies.h> +#include <linux/kobject.h> #include <linux/livepatch.h> +#include <linux/pm.h> #include <linux/psi.h> +#include <linux/rhashtable.h> +#include <linux/seq_buf.h> #include <linux/seqlock_api.h> #include <linux/slab.h> #include <linux/suspend.h> #include <linux/tsacct_kern.h> #include <linux/vtime.h> +#include <linux/percpu-rwsem.h> #include <uapi/linux/sched/types.h> @@ -55,4 +60,8 @@ #ifdef CONFIG_SCHED_SOFT_DOMAIN #include "soft_domain.c" #endif +#ifdef CONFIG_SCHED_CLASS_EXT +# include "ext.c" +#endif + #include "syscalls.c" diff --git a/kernel/sched/core.c b/kernel/sched/core.c index faef3245490f..57828b83372e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3803,6 +3803,15 @@ bool cpus_share_resources(int this_cpu, int that_cpu) static inline bool ttwu_queue_cond(struct task_struct *p, int cpu) { + /* + * The BPF scheduler may depend on select_task_rq() being invoked during + * wakeups. In addition, @p may end up executing on a different CPU + * regardless of what happens in the wakeup path making the ttwu_queue + * optimization less meaningful. Skip if on SCX. + */ + if (task_on_scx(p)) + return false; + /* * Do not complicate things with the async wake_list while the CPU is * in hotplug state. @@ -4376,6 +4385,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->rt.on_rq = 0; p->rt.on_list = 0; +#ifdef CONFIG_SCHED_CLASS_EXT + init_scx_entity(&p->scx); +#endif + #ifdef CONFIG_PREEMPT_NOTIFIERS INIT_HLIST_HEAD(&p->preempt_notifiers); #endif @@ -4659,6 +4672,10 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) goto out_cancel; } else if (rt_prio(p->prio)) { p->sched_class = &rt_sched_class; +#ifdef CONFIG_SCHED_CLASS_EXT + } else if (task_should_scx(p)) { + p->sched_class = &ext_sched_class; +#endif } else { p->sched_class = &fair_sched_class; } @@ -5576,8 +5593,10 @@ void scheduler_tick(void) wq_worker_tick(curr); #ifdef CONFIG_SMP - rq->idle_balance = idle_cpu(cpu); - trigger_load_balance(rq); + if (!scx_switched_all()) { + rq->idle_balance = idle_cpu(cpu); + trigger_load_balance(rq); + } #endif } @@ -6935,6 +6954,10 @@ void __setscheduler_prio(struct task_struct *p, int prio) p->sched_class = &dl_sched_class; else if (rt_prio(prio)) p->sched_class = &rt_sched_class; +#ifdef CONFIG_SCHED_CLASS_EXT + else if (task_should_scx(p)) + p->sched_class = &ext_sched_class; +#endif else p->sched_class = &fair_sched_class; @@ -8239,6 +8262,10 @@ void __init sched_init(void) BUG_ON(!sched_class_above(&dl_sched_class, &rt_sched_class)); BUG_ON(!sched_class_above(&rt_sched_class, &fair_sched_class)); BUG_ON(!sched_class_above(&fair_sched_class, &idle_sched_class)); +#ifdef CONFIG_SCHED_CLASS_EXT + BUG_ON(!sched_class_above(&fair_sched_class, &ext_sched_class)); + BUG_ON(!sched_class_above(&ext_sched_class, &idle_sched_class)); +#endif wait_bit_init(); @@ -11099,3 +11126,38 @@ void sched_mm_cid_fork(struct task_struct *t) t->mm_cid_active = 1; } #endif + +#ifdef CONFIG_SCHED_CLASS_EXT +void sched_deq_and_put_task(struct task_struct *p, int queue_flags, + struct sched_enq_and_set_ctx *ctx) +{ + struct rq *rq = task_rq(p); + + lockdep_assert_rq_held(rq); + + *ctx = (struct sched_enq_and_set_ctx){ + .p = p, + .queue_flags = queue_flags, + .queued = task_on_rq_queued(p), + .running = task_current(rq, p), + }; + + update_rq_clock(rq); + if (ctx->queued) + dequeue_task(rq, p, queue_flags | DEQUEUE_NOCLOCK); + if (ctx->running) + put_prev_task(rq, p); +} + +void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx) +{ + struct rq *rq = task_rq(ctx->p); + + lockdep_assert_rq_held(rq); + + if (ctx->queued) + enqueue_task(rq, ctx->p, ctx->queue_flags | ENQUEUE_NOCLOCK); + if (ctx->running) + set_next_task(rq, ctx->p); +} +#endif /* CONFIG_SCHED_CLASS_EXT */ diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 31e71294a949..bba1526d43d2 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -1151,6 +1151,9 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, } else if (fair_policy(p->policy)) { P(se.slice); } +#ifdef CONFIG_SCHED_CLASS_EXT + __PS("ext.enabled", task_on_scx(p)); +#endif #undef PN_SCHEDSTAT #undef P_SCHEDSTAT diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c new file mode 100644 index 000000000000..c4dbce5b6f5d --- /dev/null +++ b/kernel/sched/ext.c @@ -0,0 +1,4260 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> + * Copyright (c) 2022 David Vernet <dvernet@meta.com> + */ +#include <linux/bpf_verifier.h> +#include <linux/bpf.h> +#include <linux/btf.h> + +#define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void))) + +enum scx_consts { + SCX_DSP_DFL_MAX_BATCH = 32, + + SCX_EXIT_BT_LEN = 64, + SCX_EXIT_MSG_LEN = 1024, +}; + +enum scx_exit_kind { + SCX_EXIT_NONE, + SCX_EXIT_DONE, + + SCX_EXIT_UNREG = 64, /* user-space initiated unregistration */ + SCX_EXIT_UNREG_BPF, /* BPF-initiated unregistration */ + SCX_EXIT_UNREG_KERN, /* kernel-initiated unregistration */ + + SCX_EXIT_ERROR = 1024, /* runtime error, error msg contains details */ + SCX_EXIT_ERROR_BPF, /* ERROR but triggered through scx_bpf_error() */ +}; + +/* + * scx_exit_info is passed to ops.exit() to describe why the BPF scheduler is + * being disabled. + */ +struct scx_exit_info { + /* %SCX_EXIT_* - broad category of the exit reason */ + enum scx_exit_kind kind; + + /* exit code if gracefully exiting */ + s64 exit_code; + + /* textual representation of the above */ + const char *reason; + + /* backtrace if exiting due to an error */ + unsigned long *bt; + u32 bt_len; + + /* informational message */ + char *msg; +}; + +/* sched_ext_ops.flags */ +enum scx_ops_flags { + /* + * Keep built-in idle tracking even if ops.update_idle() is implemented. + */ + SCX_OPS_KEEP_BUILTIN_IDLE = 1LLU << 0, + + /* + * By default, if there are no other task to run on the CPU, ext core + * keeps running the current task even after its slice expires. If this + * flag is specified, such tasks are passed to ops.enqueue() with + * %SCX_ENQ_LAST. See the comment above %SCX_ENQ_LAST for more info. + */ + SCX_OPS_ENQ_LAST = 1LLU << 1, + + /* + * An exiting task may schedule after PF_EXITING is set. In such cases, + * bpf_task_from_pid() may not be able to find the task and if the BPF + * scheduler depends on pid lookup for dispatching, the task will be + * lost leading to various issues including RCU grace period stalls. + * + * To mask this problem, by default, unhashed tasks are automatically + * dispatched to the local DSQ on enqueue. If the BPF scheduler doesn't + * depend on pid lookups and wants to handle these tasks directly, the + * following flag can be used. + */ + SCX_OPS_ENQ_EXITING = 1LLU << 2, + + /* + * If set, only tasks with policy set to SCHED_EXT are attached to + * sched_ext. If clear, SCHED_NORMAL tasks are also included. + */ + SCX_OPS_SWITCH_PARTIAL = 1LLU << 3, + + SCX_OPS_ALL_FLAGS = SCX_OPS_KEEP_BUILTIN_IDLE | + SCX_OPS_ENQ_LAST | + SCX_OPS_ENQ_EXITING | + SCX_OPS_SWITCH_PARTIAL, +}; + +/* argument container for ops.init_task() */ +struct scx_init_task_args { + /* + * Set if ops.init_task() is being invoked on the fork path, as opposed + * to the scheduler transition path. + */ + bool fork; +}; + +/* argument container for ops.exit_task() */ +struct scx_exit_task_args { + /* Whether the task exited before running on sched_ext. */ + bool cancelled; +}; + +/** + * struct sched_ext_ops - Operation table for BPF scheduler implementation + * + * Userland can implement an arbitrary scheduling policy by implementing and + * loading operations in this table. + */ +struct sched_ext_ops { + /** + * select_cpu - Pick the target CPU for a task which is being woken up + * @p: task being woken up + * @prev_cpu: the cpu @p was on before sleeping + * @wake_flags: SCX_WAKE_* + * + * Decision made here isn't final. @p may be moved to any CPU while it + * is getting dispatched for execution later. However, as @p is not on + * the rq at this point, getting the eventual execution CPU right here + * saves a small bit of overhead down the line. + * + * If an idle CPU is returned, the CPU is kicked and will try to + * dispatch. While an explicit custom mechanism can be added, + * select_cpu() serves as the default way to wake up idle CPUs. + * + * @p may be dispatched directly by calling scx_bpf_dispatch(). If @p + * is dispatched, the ops.enqueue() callback will be skipped. Finally, + * if @p is dispatched to SCX_DSQ_LOCAL, it will be dispatched to the + * local DSQ of whatever CPU is returned by this callback. + */ + s32 (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags); + + /** + * enqueue - Enqueue a task on the BPF scheduler + * @p: task being enqueued + * @enq_flags: %SCX_ENQ_* + * + * @p is ready to run. Dispatch directly by calling scx_bpf_dispatch() + * or enqueue on the BPF scheduler. If not directly dispatched, the bpf + * scheduler owns @p and if it fails to dispatch @p, the task will + * stall. + * + * If @p was dispatched from ops.select_cpu(), this callback is + * skipped. + */ + void (*enqueue)(struct task_struct *p, u64 enq_flags); + + /** + * dequeue - Remove a task from the BPF scheduler + * @p: task being dequeued + * @deq_flags: %SCX_DEQ_* + * + * Remove @p from the BPF scheduler. This is usually called to isolate + * the task while updating its scheduling properties (e.g. priority). + * + * The ext core keeps track of whether the BPF side owns a given task or + * not and can gracefully ignore spurious dispatches from BPF side, + * which makes it safe to not implement this method. However, depending + * on the scheduling logic, this can lead to confusing behaviors - e.g. + * scheduling position not being updated across a priority change. + */ + void (*dequeue)(struct task_struct *p, u64 deq_flags); + + /** + * dispatch - Dispatch tasks from the BPF scheduler and/or consume DSQs + * @cpu: CPU to dispatch tasks for + * @prev: previous task being switched out + * + * Called when a CPU's local dsq is empty. The operation should dispatch + * one or more tasks from the BPF scheduler into the DSQs using + * scx_bpf_dispatch() and/or consume user DSQs into the local DSQ using + * scx_bpf_consume(). + * + * The maximum number of times scx_bpf_dispatch() can be called without + * an intervening scx_bpf_consume() is specified by + * ops.dispatch_max_batch. See the comments on top of the two functions + * for more details. + * + * When not %NULL, @prev is an SCX task with its slice depleted. If + * @prev is still runnable as indicated by set %SCX_TASK_QUEUED in + * @prev->scx.flags, it is not enqueued yet and will be enqueued after + * ops.dispatch() returns. To keep executing @prev, return without + * dispatching or consuming any tasks. Also see %SCX_OPS_ENQ_LAST. + */ + void (*dispatch)(s32 cpu, struct task_struct *prev); + + /** + * tick - Periodic tick + * @p: task running currently + * + * This operation is called every 1/HZ seconds on CPUs which are + * executing an SCX task. Setting @p->scx.slice to 0 will trigger an + * immediate dispatch cycle on the CPU. + */ + void (*tick)(struct task_struct *p); + + /** + * yield - Yield CPU + * @from: yielding task + * @to: optional yield target task + * + * If @to is NULL, @from is yielding the CPU to other runnable tasks. + * The BPF scheduler should ensure that other available tasks are + * dispatched before the yielding task. Return value is ignored in this + * case. + * + * If @to is not-NULL, @from wants to yield the CPU to @to. If the bpf + * scheduler can implement the request, return %true; otherwise, %false. + */ + bool (*yield)(struct task_struct *from, struct task_struct *to); + + /** + * set_weight - Set task weight + * @p: task to set weight for + * @weight: new eight [1..10000] + * + * Update @p's weight to @weight. + */ + void (*set_weight)(struct task_struct *p, u32 weight); + + /** + * set_cpumask - Set CPU affinity + * @p: task to set CPU affinity for + * @cpumask: cpumask of cpus that @p can run on + * + * Update @p's CPU affinity to @cpumask. + */ + void (*set_cpumask)(struct task_struct *p, + const struct cpumask *cpumask); + + /** + * update_idle - Update the idle state of a CPU + * @cpu: CPU to udpate the idle state for + * @idle: whether entering or exiting the idle state + * + * This operation is called when @rq's CPU goes or leaves the idle + * state. By default, implementing this operation disables the built-in + * idle CPU tracking and the following helpers become unavailable: + * + * - scx_bpf_select_cpu_dfl() + * - scx_bpf_test_and_clear_cpu_idle() + * - scx_bpf_pick_idle_cpu() + * + * The user also must implement ops.select_cpu() as the default + * implementation relies on scx_bpf_select_cpu_dfl(). + * + * Specify the %SCX_OPS_KEEP_BUILTIN_IDLE flag to keep the built-in idle + * tracking. + */ + void (*update_idle)(s32 cpu, bool idle); + + /** + * init_task - Initialize a task to run in a BPF scheduler + * @p: task to initialize for BPF scheduling + * @args: init arguments, see the struct definition + * + * Either we're loading a BPF scheduler or a new task is being forked. + * Initialize @p for BPF scheduling. This operation may block and can + * be used for allocations, and is called exactly once for a task. + * + * Return 0 for success, -errno for failure. An error return while + * loading will abort loading of the BPF scheduler. During a fork, it + * will abort that specific fork. + */ + s32 (*init_task)(struct task_struct *p, struct scx_init_task_args *args); + + /** + * exit_task - Exit a previously-running task from the system + * @p: task to exit + * + * @p is exiting or the BPF scheduler is being unloaded. Perform any + * necessary cleanup for @p. + */ + void (*exit_task)(struct task_struct *p, struct scx_exit_task_args *args); + + /** + * enable - Enable BPF scheduling for a task + * @p: task to enable BPF scheduling for + * + * Enable @p for BPF scheduling. enable() is called on @p any time it + * enters SCX, and is always paired with a matching disable(). + */ + void (*enable)(struct task_struct *p); + + /** + * disable - Disable BPF scheduling for a task + * @p: task to disable BPF scheduling for + * + * @p is exiting, leaving SCX or the BPF scheduler is being unloaded. + * Disable BPF scheduling for @p. A disable() call is always matched + * with a prior enable() call. + */ + void (*disable)(struct task_struct *p); + + /* + * All online ops must come before ops.init(). + */ + + /** + * init - Initialize the BPF scheduler + */ + s32 (*init)(void); + + /** + * exit - Clean up after the BPF scheduler + * @info: Exit info + */ + void (*exit)(struct scx_exit_info *info); + + /** + * dispatch_max_batch - Max nr of tasks that dispatch() can dispatch + */ + u32 dispatch_max_batch; + + /** + * flags - %SCX_OPS_* flags + */ + u64 flags; + + /** + * name - BPF scheduler's name + * + * Must be a non-zero valid BPF object name including only isalnum(), + * '_' and '.' chars. Shows up in kernel.sched_ext_ops sysctl while the + * BPF scheduler is enabled. + */ + char name[SCX_OPS_NAME_LEN]; +}; + +enum scx_opi { + SCX_OPI_BEGIN = 0, + SCX_OPI_NORMAL_BEGIN = 0, + SCX_OPI_NORMAL_END = SCX_OP_IDX(init), + SCX_OPI_END = SCX_OP_IDX(init), +}; + +enum scx_wake_flags { + /* expose select WF_* flags as enums */ + SCX_WAKE_FORK = WF_FORK, + SCX_WAKE_TTWU = WF_TTWU, + SCX_WAKE_SYNC = WF_SYNC, +}; + +enum scx_enq_flags { + /* expose select ENQUEUE_* flags as enums */ + SCX_ENQ_WAKEUP = ENQUEUE_WAKEUP, + SCX_ENQ_HEAD = ENQUEUE_HEAD, + + /* high 32bits are SCX specific */ + + /* + * The task being enqueued is the only task available for the cpu. By + * default, ext core keeps executing such tasks but when + * %SCX_OPS_ENQ_LAST is specified, they're ops.enqueue()'d with the + * %SCX_ENQ_LAST flag set. + * + * If the BPF scheduler wants to continue executing the task, + * ops.enqueue() should dispatch the task to %SCX_DSQ_LOCAL immediately. + * If the task gets queued on a different dsq or the BPF side, the BPF + * scheduler is responsible for triggering a follow-up scheduling event. + * Otherwise, Execution may stall. + */ + SCX_ENQ_LAST = 1LLU << 41, + + /* high 8 bits are internal */ + __SCX_ENQ_INTERNAL_MASK = 0xffLLU << 56, + + SCX_ENQ_CLEAR_OPSS = 1LLU << 56, +}; + +enum scx_deq_flags { + /* expose select DEQUEUE_* flags as enums */ + SCX_DEQ_SLEEP = DEQUEUE_SLEEP, +}; + +enum scx_pick_idle_cpu_flags { + SCX_PICK_IDLE_CORE = 1LLU << 0, /* pick a CPU whose SMT siblings are also idle */ +}; + +enum scx_ops_enable_state { + SCX_OPS_PREPPING, + SCX_OPS_ENABLING, + SCX_OPS_ENABLED, + SCX_OPS_DISABLING, + SCX_OPS_DISABLED, +}; + +static const char *scx_ops_enable_state_str[] = { + [SCX_OPS_PREPPING] = "prepping", + [SCX_OPS_ENABLING] = "enabling", + [SCX_OPS_ENABLED] = "enabled", + [SCX_OPS_DISABLING] = "disabling", + [SCX_OPS_DISABLED] = "disabled", +}; + +/* + * sched_ext_entity->ops_state + * + * Used to track the task ownership between the SCX core and the BPF scheduler. + * State transitions look as follows: + * + * NONE -> QUEUEING -> QUEUED -> DISPATCHING + * ^ | | + * | v v + * \-------------------------------/ + * + * QUEUEING and DISPATCHING states can be waited upon. See wait_ops_state() call + * sites for explanations on the conditions being waited upon and why they are + * safe. Transitions out of them into NONE or QUEUED must store_release and the + * waiters should load_acquire. + * + * Tracking scx_ops_state enables sched_ext core to reliably determine whether + * any given task can be dispatched by the BPF scheduler at all times and thus + * relaxes the requirements on the BPF scheduler. This allows the BPF scheduler + * to try to dispatch any task anytime regardless of its state as the SCX core + * can safely reject invalid dispatches. + */ +enum scx_ops_state { + SCX_OPSS_NONE, /* owned by the SCX core */ + SCX_OPSS_QUEUEING, /* in transit to the BPF scheduler */ + SCX_OPSS_QUEUED, /* owned by the BPF scheduler */ + SCX_OPSS_DISPATCHING, /* in transit back to the SCX core */ + + /* + * QSEQ brands each QUEUED instance so that, when dispatch races + * dequeue/requeue, the dispatcher can tell whether it still has a claim + * on the task being dispatched. + * + * As some 32bit archs can't do 64bit store_release/load_acquire, + * p->scx.ops_state is atomic_long_t which leaves 30 bits for QSEQ on + * 32bit machines. The dispatch race window QSEQ protects is very narrow + * and runs with IRQ disabled. 30 bits should be sufficient. + */ + SCX_OPSS_QSEQ_SHIFT = 2, +}; + +/* Use macros to ensure that the type is unsigned long for the masks */ +#define SCX_OPSS_STATE_MASK ((1LU << SCX_OPSS_QSEQ_SHIFT) - 1) +#define SCX_OPSS_QSEQ_MASK (~SCX_OPSS_STATE_MASK) + +/* + * During exit, a task may schedule after losing its PIDs. When disabling the + * BPF scheduler, we need to be able to iterate tasks in every state to + * guarantee system safety. Maintain a dedicated task list which contains every + * task between its fork and eventual free. + */ +static DEFINE_SPINLOCK(scx_tasks_lock); +static LIST_HEAD(scx_tasks); + +/* ops enable/disable */ +static struct kthread_worker *scx_ops_helper; +static DEFINE_MUTEX(scx_ops_enable_mutex); +DEFINE_STATIC_KEY_FALSE(__scx_ops_enabled); +DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem); +static atomic_t scx_ops_enable_state_var = ATOMIC_INIT(SCX_OPS_DISABLED); +static atomic_t scx_ops_bypass_depth = ATOMIC_INIT(0); +static bool scx_switching_all; +DEFINE_STATIC_KEY_FALSE(__scx_switched_all); + +static struct sched_ext_ops scx_ops; +static bool scx_warned_zero_slice; + +static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_last); +static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting); +static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled); + +struct static_key_false scx_has_op[SCX_OPI_END] = + { [0 ... SCX_OPI_END-1] = STATIC_KEY_FALSE_INIT }; + +static atomic_t scx_exit_kind = ATOMIC_INIT(SCX_EXIT_DONE); +static struct scx_exit_info *scx_exit_info; + +/* idle tracking */ +#ifdef CONFIG_SMP +#ifdef CONFIG_CPUMASK_OFFSTACK +#define CL_ALIGNED_IF_ONSTACK +#else +#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp +#endif + +static struct { + cpumask_var_t cpu; + cpumask_var_t smt; +} idle_masks CL_ALIGNED_IF_ONSTACK; + +#endif /* CONFIG_SMP */ + +/* + * Direct dispatch marker. + * + * Non-NULL values are used for direct dispatch from enqueue path. A valid + * pointer points to the task currently being enqueued. An ERR_PTR value is used + * to indicate that direct dispatch has already happened. + */ +static DEFINE_PER_CPU(struct task_struct *, direct_dispatch_task); + +/* dispatch queues */ +static struct scx_dispatch_q __cacheline_aligned_in_smp scx_dsq_global; + +static const struct rhashtable_params dsq_hash_params = { + .key_len = 8, + .key_offset = offsetof(struct scx_dispatch_q, id), + .head_offset = offsetof(struct scx_dispatch_q, hash_node), +}; + +static struct rhashtable dsq_hash; +static LLIST_HEAD(dsqs_to_free); + +/* dispatch buf */ +struct scx_dsp_buf_ent { + struct task_struct *task; + unsigned long qseq; + u64 dsq_id; + u64 enq_flags; +}; + +static u32 scx_dsp_max_batch; + +struct scx_dsp_ctx { + struct rq *rq; + struct rq_flags *rf; + u32 cursor; + u32 nr_tasks; + struct scx_dsp_buf_ent buf[]; +}; + +static struct scx_dsp_ctx __percpu *scx_dsp_ctx; + +/* string formatting from BPF */ +struct scx_bstr_buf { + u64 data[MAX_BPRINTF_VARARGS]; + char line[SCX_EXIT_MSG_LEN]; +}; + +static DEFINE_RAW_SPINLOCK(scx_exit_bstr_buf_lock); +static struct scx_bstr_buf scx_exit_bstr_buf; + +/* /sys/kernel/sched_ext interface */ +static struct kset *scx_kset; +static struct kobject *scx_root_kobj; + +static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind, + s64 exit_code, + const char *fmt, ...); + +#define scx_ops_error_kind(err, fmt, args...) \ + scx_ops_exit_kind((err), 0, fmt, ##args) + +#define scx_ops_exit(code, fmt, args...) \ + scx_ops_exit_kind(SCX_EXIT_UNREG_KERN, (code), fmt, ##args) + +#define scx_ops_error(fmt, args...) \ + scx_ops_error_kind(SCX_EXIT_ERROR, fmt, ##args) + +#define SCX_HAS_OP(op) static_branch_likely(&scx_has_op[SCX_OP_IDX(op)]) + +/* if the highest set bit is N, return a mask with bits [N+1, 31] set */ +static u32 higher_bits(u32 flags) +{ + return ~((1 << fls(flags)) - 1); +} + +/* return the mask with only the highest bit set */ +static u32 highest_bit(u32 flags) +{ + int bit = fls(flags); + return ((u64)1 << bit) >> 1; +} + +/* + * scx_kf_mask enforcement. Some kfuncs can only be called from specific SCX + * ops. When invoking SCX ops, SCX_CALL_OP[_RET]() should be used to indicate + * the allowed kfuncs and those kfuncs should use scx_kf_allowed() to check + * whether it's running from an allowed context. + * + * @mask is constant, always inline to cull the mask calculations. + */ +static __always_inline void scx_kf_allow(u32 mask) +{ + /* nesting is allowed only in increasing scx_kf_mask order */ + WARN_ONCE((mask | higher_bits(mask)) & current->scx.kf_mask, + "invalid nesting current->scx.kf_mask=0x%x mask=0x%x\n", + current->scx.kf_mask, mask); + current->scx.kf_mask |= mask; + barrier(); +} + +static void scx_kf_disallow(u32 mask) +{ + barrier(); + current->scx.kf_mask &= ~mask; +} + +#define SCX_CALL_OP(mask, op, args...) \ +do { \ + if (mask) { \ + scx_kf_allow(mask); \ + scx_ops.op(args); \ + scx_kf_disallow(mask); \ + } else { \ + scx_ops.op(args); \ + } \ +} while (0) + +#define SCX_CALL_OP_RET(mask, op, args...) \ +({ \ + __typeof__(scx_ops.op(args)) __ret; \ + if (mask) { \ + scx_kf_allow(mask); \ + __ret = scx_ops.op(args); \ + scx_kf_disallow(mask); \ + } else { \ + __ret = scx_ops.op(args); \ + } \ + __ret; \ +}) + +/* @mask is constant, always inline to cull unnecessary branches */ +static __always_inline bool scx_kf_allowed(u32 mask) +{ + if (unlikely(!(current->scx.kf_mask & mask))) { + scx_ops_error("kfunc with mask 0x%x called from an operation only allowing 0x%x", + mask, current->scx.kf_mask); + return false; + } + + if (unlikely((mask & SCX_KF_SLEEPABLE) && in_interrupt())) { + scx_ops_error("sleepable kfunc called from non-sleepable context"); + return false; + } + + /* + * Enforce nesting boundaries. e.g. A kfunc which can be called from + * DISPATCH must not be called if we're running DEQUEUE which is nested + * inside ops.dispatch(). We don't need to check the SCX_KF_SLEEPABLE + * boundary thanks to the above in_interrupt() check. + */ + if (unlikely(highest_bit(mask) == SCX_KF_DISPATCH && + (current->scx.kf_mask & higher_bits(SCX_KF_DISPATCH)))) { + scx_ops_error("dispatch kfunc called from a nested operation"); + return false; + } + + return true; +} + + +/* + * SCX task iterator. + */ +struct scx_task_iter { + struct sched_ext_entity cursor; + struct task_struct *locked; + struct rq *rq; + struct rq_flags rf; +}; + +/** + * scx_task_iter_init - Initialize a task iterator + * @iter: iterator to init + * + * Initialize @iter. Must be called with scx_tasks_lock held. Once initialized, + * @iter must eventually be exited with scx_task_iter_exit(). + * + * scx_tasks_lock may be released between this and the first next() call or + * between any two next() calls. If scx_tasks_lock is released between two + * next() calls, the caller is responsible for ensuring that the task being + * iterated remains accessible either through RCU read lock or obtaining a + * reference count. + * + * All tasks which existed when the iteration started are guaranteed to be + * visited as long as they still exist. + */ +static void scx_task_iter_init(struct scx_task_iter *iter) +{ + lockdep_assert_held(&scx_tasks_lock); + + iter->cursor = (struct sched_ext_entity){ .flags = SCX_TASK_CURSOR }; + list_add(&iter->cursor.tasks_node, &scx_tasks); + iter->locked = NULL; +} + +/** + * scx_task_iter_rq_unlock - Unlock rq locked by a task iterator + * @iter: iterator to unlock rq for + * + * If @iter is in the middle of a locked iteration, it may be locking the rq of + * the task currently being visited. Unlock the rq if so. This function can be + * safely called anytime during an iteration. + * + * Returns %true if the rq @iter was locking is unlocked. %false if @iter was + * not locking an rq. + */ +static bool scx_task_iter_rq_unlock(struct scx_task_iter *iter) +{ + if (iter->locked) { + task_rq_unlock(iter->rq, iter->locked, &iter->rf); + iter->locked = NULL; + return true; + } else { + return false; + } +} + +/** + * scx_task_iter_exit - Exit a task iterator + * @iter: iterator to exit + * + * Exit a previously initialized @iter. Must be called with scx_tasks_lock held. + * If the iterator holds a task's rq lock, that rq lock is released. See + * scx_task_iter_init() for details. + */ +static void scx_task_iter_exit(struct scx_task_iter *iter) +{ + lockdep_assert_held(&scx_tasks_lock); + + scx_task_iter_rq_unlock(iter); + list_del_init(&iter->cursor.tasks_node); +} + +/** + * scx_task_iter_next - Next task + * @iter: iterator to walk + * + * Visit the next task. See scx_task_iter_init() for details. + */ +static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter) +{ + struct list_head *cursor = &iter->cursor.tasks_node; + struct sched_ext_entity *pos; + + lockdep_assert_held(&scx_tasks_lock); + + list_for_each_entry(pos, cursor, tasks_node) { + if (&pos->tasks_node == &scx_tasks) + return NULL; + if (!(pos->flags & SCX_TASK_CURSOR)) { + list_move(cursor, &pos->tasks_node); + return container_of(pos, struct task_struct, scx); + } + } + + /* can't happen, should always terminate at scx_tasks above */ + BUG(); +} + +/** + * scx_task_iter_next_locked - Next non-idle task with its rq locked + * @iter: iterator to walk + * @include_dead: Whether we should include dead tasks in the iteration + * + * Visit the non-idle task with its rq lock held. Allows callers to specify + * whether they would like to filter out dead tasks. See scx_task_iter_init() + * for details. + */ +static struct task_struct * +scx_task_iter_next_locked(struct scx_task_iter *iter, bool include_dead) +{ + struct task_struct *p; +retry: + scx_task_iter_rq_unlock(iter); + + while ((p = scx_task_iter_next(iter))) { + /* + * is_idle_task() tests %PF_IDLE which may not be set for CPUs + * which haven't yet been onlined. Test sched_class directly. + */ + if (p->sched_class != &idle_sched_class) + break; + } + if (!p) + return NULL; + + iter->rq = task_rq_lock(p, &iter->rf); + iter->locked = p; + + /* + * If we see %TASK_DEAD, @p already disabled preemption, is about to do + * the final __schedule(), won't ever need to be scheduled again and can + * thus be safely ignored. If we don't see %TASK_DEAD, @p can't enter + * the final __schedle() while we're locking its rq and thus will stay + * alive until the rq is unlocked. + */ + if (!include_dead && READ_ONCE(p->__state) == TASK_DEAD) + goto retry; + + return p; +} + +static enum scx_ops_enable_state scx_ops_enable_state(void) +{ + return atomic_read(&scx_ops_enable_state_var); +} + +static enum scx_ops_enable_state +scx_ops_set_enable_state(enum scx_ops_enable_state to) +{ + return atomic_xchg(&scx_ops_enable_state_var, to); +} + +static bool scx_ops_tryset_enable_state(enum scx_ops_enable_state to, + enum scx_ops_enable_state from) +{ + int from_v = from; + + return atomic_try_cmpxchg(&scx_ops_enable_state_var, &from_v, to); +} + +static bool scx_ops_bypassing(void) +{ + return unlikely(atomic_read(&scx_ops_bypass_depth)); +} + +/** + * wait_ops_state - Busy-wait the specified ops state to end + * @p: target task + * @opss: state to wait the end of + * + * Busy-wait for @p to transition out of @opss. This can only be used when the + * state part of @opss is %SCX_QUEUEING or %SCX_DISPATCHING. This function also + * has load_acquire semantics to ensure that the caller can see the updates made + * in the enqueueing and dispatching paths. + */ +static void wait_ops_state(struct task_struct *p, unsigned long opss) +{ + do { + cpu_relax(); + } while (atomic_long_read_acquire(&p->scx.ops_state) == opss); +} + +/** + * ops_cpu_valid - Verify a cpu number + * @cpu: cpu number which came from a BPF ops + * @where: extra information reported on error + * + * @cpu is a cpu number which came from the BPF scheduler and can be any value. + * Verify that it is in range and one of the possible cpus. If invalid, trigger + * an ops error. + */ +static bool ops_cpu_valid(s32 cpu, const char *where) +{ + if (likely(cpu >= 0 && cpu < nr_cpu_ids && cpu_possible(cpu))) { + return true; + } else { + scx_ops_error("invalid CPU %d%s%s", cpu, + where ? " " : "", where ?: ""); + return false; + } +} + +/** + * ops_sanitize_err - Sanitize a -errno value + * @ops_name: operation to blame on failure + * @err: -errno value to sanitize + * + * Verify @err is a valid -errno. If not, trigger scx_ops_error() and return + * -%EPROTO. This is necessary because returning a rogue -errno up the chain can + * cause misbehaviors. For an example, a large negative return from + * ops.init_task() triggers an oops when passed up the call chain because the + * value fails IS_ERR() test after being encoded with ERR_PTR() and then is + * handled as a pointer. + */ +static int ops_sanitize_err(const char *ops_name, s32 err) +{ + if (err < 0 && err >= -MAX_ERRNO) + return err; + + scx_ops_error("ops.%s() returned an invalid errno %d", ops_name, err); + return -EPROTO; +} + +static void update_curr_scx(struct rq *rq) +{ + struct task_struct *curr = rq->curr; + u64 now = rq_clock_task(rq); + u64 delta_exec; + + if (time_before_eq64(now, curr->se.exec_start)) + return; + + delta_exec = now - curr->se.exec_start; + curr->se.exec_start = now; + curr->se.sum_exec_runtime += delta_exec; + account_group_exec_runtime(curr, delta_exec); + cgroup_account_cputime(curr, delta_exec); + + curr->scx.slice -= min(curr->scx.slice, delta_exec); +} + +static void dsq_mod_nr(struct scx_dispatch_q *dsq, s32 delta) +{ + /* scx_bpf_dsq_nr_queued() reads ->nr without locking, use WRITE_ONCE() */ + WRITE_ONCE(dsq->nr, dsq->nr + delta); +} + +static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p, + u64 enq_flags) +{ + bool is_local = dsq->id == SCX_DSQ_LOCAL; + + WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_node)); + + if (!is_local) { + raw_spin_lock(&dsq->lock); + if (unlikely(dsq->id == SCX_DSQ_INVALID)) { + scx_ops_error("attempting to dispatch to a destroyed dsq"); + /* fall back to the global dsq */ + raw_spin_unlock(&dsq->lock); + dsq = &scx_dsq_global; + raw_spin_lock(&dsq->lock); + } + } + + if (enq_flags & SCX_ENQ_HEAD) + list_add(&p->scx.dsq_node, &dsq->list); + else + list_add_tail(&p->scx.dsq_node, &dsq->list); + + dsq_mod_nr(dsq, 1); + p->scx.dsq = dsq; + + /* + * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the + * direct dispatch path, but we clear them here because the direct + * dispatch verdict may be overridden on the enqueue path during e.g. + * bypass. + */ + p->scx.ddsp_dsq_id = SCX_DSQ_INVALID; + p->scx.ddsp_enq_flags = 0; + + /* + * We're transitioning out of QUEUEING or DISPATCHING. store_release to + * match waiters' load_acquire. + */ + if (enq_flags & SCX_ENQ_CLEAR_OPSS) + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE); + + if (is_local) { + struct rq *rq = container_of(dsq, struct rq, scx.local_dsq); + + if (sched_class_above(&ext_sched_class, rq->curr->sched_class)) + resched_curr(rq); + } else { + raw_spin_unlock(&dsq->lock); + } +} + +static void dispatch_dequeue(struct rq *rq, struct task_struct *p) +{ + struct scx_dispatch_q *dsq = p->scx.dsq; + bool is_local = dsq == &rq->scx.local_dsq; + + if (!dsq) { + WARN_ON_ONCE(!list_empty(&p->scx.dsq_node)); + /* + * When dispatching directly from the BPF scheduler to a local + * DSQ, the task isn't associated with any DSQ but + * @p->scx.holding_cpu may be set under the protection of + * %SCX_OPSS_DISPATCHING. + */ + if (p->scx.holding_cpu >= 0) + p->scx.holding_cpu = -1; + return; + } + + if (!is_local) + raw_spin_lock(&dsq->lock); + + /* + * Now that we hold @dsq->lock, @p->holding_cpu and @p->scx.dsq_node + * can't change underneath us. + */ + if (p->scx.holding_cpu < 0) { + /* @p must still be on @dsq, dequeue */ + WARN_ON_ONCE(list_empty(&p->scx.dsq_node)); + list_del_init(&p->scx.dsq_node); + dsq_mod_nr(dsq, -1); + } else { + /* + * We're racing against dispatch_to_local_dsq() which already + * removed @p from @dsq and set @p->scx.holding_cpu. Clear the + * holding_cpu which tells dispatch_to_local_dsq() that it lost + * the race. + */ + WARN_ON_ONCE(!list_empty(&p->scx.dsq_node)); + p->scx.holding_cpu = -1; + } + p->scx.dsq = NULL; + + if (!is_local) + raw_spin_unlock(&dsq->lock); +} + +static struct scx_dispatch_q *find_user_dsq(u64 dsq_id) +{ + return rhashtable_lookup_fast(&dsq_hash, &dsq_id, dsq_hash_params); +} + +static struct scx_dispatch_q *find_non_local_dsq(u64 dsq_id) +{ + lockdep_assert(rcu_read_lock_any_held()); + + if (dsq_id == SCX_DSQ_GLOBAL) + return &scx_dsq_global; + else + return find_user_dsq(dsq_id); +} + +static struct scx_dispatch_q *find_dsq_for_dispatch(struct rq *rq, u64 dsq_id, + struct task_struct *p) +{ + struct scx_dispatch_q *dsq; + + if (dsq_id == SCX_DSQ_LOCAL) + return &rq->scx.local_dsq; + + dsq = find_non_local_dsq(dsq_id); + if (unlikely(!dsq)) { + scx_ops_error("non-existent DSQ 0x%llx for %s[%d]", + dsq_id, p->comm, p->pid); + return &scx_dsq_global; + } + + return dsq; +} + +static void mark_direct_dispatch(struct task_struct *ddsp_task, + struct task_struct *p, u64 dsq_id, + u64 enq_flags) +{ + /* + * Mark that dispatch already happened from ops.select_cpu() or + * ops.enqueue() by spoiling direct_dispatch_task with a non-NULL value + * which can never match a valid task pointer. + */ + __this_cpu_write(direct_dispatch_task, ERR_PTR(-ESRCH)); + + /* @p must match the task on the enqueue path */ + if (unlikely(p != ddsp_task)) { + if (IS_ERR(ddsp_task)) + scx_ops_error("%s[%d] already direct-dispatched", + p->comm, p->pid); + else + scx_ops_error("scheduling for %s[%d] but trying to direct-dispatch %s[%d]", + ddsp_task->comm, ddsp_task->pid, + p->comm, p->pid); + return; + } + + /* + * %SCX_DSQ_LOCAL_ON is not supported during direct dispatch because + * dispatching to the local DSQ of a different CPU requires unlocking + * the current rq which isn't allowed in the enqueue path. Use + * ops.select_cpu() to be on the target CPU and then %SCX_DSQ_LOCAL. + */ + if (unlikely((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON)) { + scx_ops_error("SCX_DSQ_LOCAL_ON can't be used for direct-dispatch"); + return; + } + + WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID); + WARN_ON_ONCE(p->scx.ddsp_enq_flags); + + p->scx.ddsp_dsq_id = dsq_id; + p->scx.ddsp_enq_flags = enq_flags; +} + +static void direct_dispatch(struct task_struct *p, u64 enq_flags) +{ + struct scx_dispatch_q *dsq; + + enq_flags |= (p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); + dsq = find_dsq_for_dispatch(task_rq(p), p->scx.ddsp_dsq_id, p); + dispatch_enqueue(dsq, p, enq_flags); +} + +static bool scx_rq_online(struct rq *rq) +{ +#ifdef CONFIG_SMP + return likely(rq->online); +#else + return true; +#endif +} + +static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, + int sticky_cpu) +{ + struct task_struct **ddsp_taskp; + unsigned long qseq; + + WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED)); + + /* rq migration */ + if (sticky_cpu == cpu_of(rq)) + goto local_norefill; + + if (!scx_rq_online(rq)) + goto local; + + if (scx_ops_bypassing()) { + if (enq_flags & SCX_ENQ_LAST) + goto local; + else + goto global; + } + + if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) + goto direct; + + /* see %SCX_OPS_ENQ_EXITING */ + if (!static_branch_unlikely(&scx_ops_enq_exiting) && + unlikely(p->flags & PF_EXITING)) + goto local; + + /* see %SCX_OPS_ENQ_LAST */ + if (!static_branch_unlikely(&scx_ops_enq_last) && + (enq_flags & SCX_ENQ_LAST)) + goto local; + + if (!SCX_HAS_OP(enqueue)) + goto global; + + /* DSQ bypass didn't trigger, enqueue on the BPF scheduler */ + qseq = rq->scx.ops_qseq++ << SCX_OPSS_QSEQ_SHIFT; + + WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); + atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq); + + ddsp_taskp = this_cpu_ptr(&direct_dispatch_task); + WARN_ON_ONCE(*ddsp_taskp); + *ddsp_taskp = p; + + SCX_CALL_OP(SCX_KF_ENQUEUE, enqueue, p, enq_flags); + + *ddsp_taskp = NULL; + if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) + goto direct; + + /* + * If not directly dispatched, QUEUEING isn't clear yet and dispatch or + * dequeue may be waiting. The store_release matches their load_acquire. + */ + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq); + return; + +direct: + direct_dispatch(p, enq_flags); + return; + +local: + p->scx.slice = SCX_SLICE_DFL; +local_norefill: + dispatch_enqueue(&rq->scx.local_dsq, p, enq_flags); + return; + +global: + p->scx.slice = SCX_SLICE_DFL; + dispatch_enqueue(&scx_dsq_global, p, enq_flags); +} + +static bool task_runnable(const struct task_struct *p) +{ + return !list_empty(&p->scx.runnable_node); +} + +static void set_task_runnable(struct rq *rq, struct task_struct *p) +{ + lockdep_assert_rq_held(rq); + + /* + * list_add_tail() must be used. scx_ops_bypass() depends on tasks being + * appened to the runnable_list. + */ + list_add_tail(&p->scx.runnable_node, &rq->scx.runnable_list); +} + +static void clr_task_runnable(struct task_struct *p) +{ + list_del_init(&p->scx.runnable_node); +} + +static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags) +{ + int sticky_cpu = p->scx.sticky_cpu; + + enq_flags |= rq->scx.extra_enq_flags; + + if (sticky_cpu >= 0) + p->scx.sticky_cpu = -1; + + /* + * Restoring a running task will be immediately followed by + * set_next_task_scx() which expects the task to not be on the BPF + * scheduler as tasks can only start running through local DSQs. Force + * direct-dispatch into the local DSQ by setting the sticky_cpu. + */ + if (unlikely(enq_flags & ENQUEUE_RESTORE) && task_current(rq, p)) + sticky_cpu = cpu_of(rq); + + if (p->scx.flags & SCX_TASK_QUEUED) { + WARN_ON_ONCE(!task_runnable(p)); + return; + } + + set_task_runnable(rq, p); + p->scx.flags |= SCX_TASK_QUEUED; + rq->scx.nr_running++; + add_nr_running(rq, 1); + + do_enqueue_task(rq, p, enq_flags, sticky_cpu); +} + +static void ops_dequeue(struct task_struct *p, u64 deq_flags) +{ + unsigned long opss; + + clr_task_runnable(p); + + /* acquire ensures that we see the preceding updates on QUEUED */ + opss = atomic_long_read_acquire(&p->scx.ops_state); + + switch (opss & SCX_OPSS_STATE_MASK) { + case SCX_OPSS_NONE: + break; + case SCX_OPSS_QUEUEING: + /* + * QUEUEING is started and finished while holding @p's rq lock. + * As we're holding the rq lock now, we shouldn't see QUEUEING. + */ + BUG(); + case SCX_OPSS_QUEUED: + if (SCX_HAS_OP(dequeue)) + SCX_CALL_OP(SCX_KF_REST, dequeue, p, deq_flags); + + if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, + SCX_OPSS_NONE)) + break; + fallthrough; + case SCX_OPSS_DISPATCHING: + /* + * If @p is being dispatched from the BPF scheduler to a DSQ, + * wait for the transfer to complete so that @p doesn't get + * added to its DSQ after dequeueing is complete. + * + * As we're waiting on DISPATCHING with the rq locked, the + * dispatching side shouldn't try to lock the rq while + * DISPATCHING is set. See dispatch_to_local_dsq(). + * + * DISPATCHING shouldn't have qseq set and control can reach + * here with NONE @opss from the above QUEUED case block. + * Explicitly wait on %SCX_OPSS_DISPATCHING instead of @opss. + */ + wait_ops_state(p, SCX_OPSS_DISPATCHING); + BUG_ON(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); + break; + } +} + +static void dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags) +{ + if (!(p->scx.flags & SCX_TASK_QUEUED)) { + WARN_ON_ONCE(task_runnable(p)); + return; + } + + ops_dequeue(p, deq_flags); + + if (deq_flags & SCX_DEQ_SLEEP) + p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP; + else + p->scx.flags &= ~SCX_TASK_DEQD_FOR_SLEEP; + + p->scx.flags &= ~SCX_TASK_QUEUED; + rq->scx.nr_running--; + sub_nr_running(rq, 1); + + dispatch_dequeue(rq, p); +} + +static void yield_task_scx(struct rq *rq) +{ + struct task_struct *p = rq->curr; + + if (SCX_HAS_OP(yield)) + SCX_CALL_OP_RET(SCX_KF_REST, yield, p, NULL); + else + p->scx.slice = 0; +} + +static bool yield_to_task_scx(struct rq *rq, struct task_struct *to) +{ + struct task_struct *from = rq->curr; + + if (SCX_HAS_OP(yield)) + return SCX_CALL_OP_RET(SCX_KF_REST, yield, from, to); + else + return false; +} + +#ifdef CONFIG_SMP +/** + * move_task_to_local_dsq - Move a task from a different rq to a local DSQ + * @rq: rq to move the task into, currently locked + * @p: task to move + * @enq_flags: %SCX_ENQ_* + * + * Move @p which is currently on a different rq to @rq's local DSQ. The caller + * must: + * + * 1. Start with exclusive access to @p either through its DSQ lock or + * %SCX_OPSS_DISPATCHING flag. + * + * 2. Set @p->scx.holding_cpu to raw_smp_processor_id(). + * + * 3. Remember task_rq(@p). Release the exclusive access so that we don't + * deadlock with dequeue. + * + * 4. Lock @rq and the task_rq from #3. + * + * 5. Call this function. + * + * Returns %true if @p was successfully moved. %false after racing dequeue and + * losing. + */ +static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p, + u64 enq_flags) +{ + struct rq *task_rq; + + lockdep_assert_rq_held(rq); + + /* + * If dequeue got to @p while we were trying to lock both rq's, it'd + * have cleared @p->scx.holding_cpu to -1. While other cpus may have + * updated it to different values afterwards, as this operation can't be + * preempted or recurse, @p->scx.holding_cpu can never become + * raw_smp_processor_id() again before we're done. Thus, we can tell + * whether we lost to dequeue by testing whether @p->scx.holding_cpu is + * still raw_smp_processor_id(). + * + * See dispatch_dequeue() for the counterpart. + */ + if (unlikely(p->scx.holding_cpu != raw_smp_processor_id())) + return false; + + /* @p->rq couldn't have changed if we're still the holding cpu */ + task_rq = task_rq(p); + lockdep_assert_rq_held(task_rq); + + WARN_ON_ONCE(!cpumask_test_cpu(cpu_of(rq), p->cpus_ptr)); + deactivate_task(task_rq, p, 0); + set_task_cpu(p, cpu_of(rq)); + p->scx.sticky_cpu = cpu_of(rq); + + /* + * We want to pass scx-specific enq_flags but activate_task() will + * truncate the upper 32 bit. As we own @rq, we can pass them through + * @rq->scx.extra_enq_flags instead. + */ + WARN_ON_ONCE(rq->scx.extra_enq_flags); + rq->scx.extra_enq_flags = enq_flags; + activate_task(rq, p, 0); + rq->scx.extra_enq_flags = 0; + + return true; +} + +/** + * dispatch_to_local_dsq_lock - Ensure source and destination rq's are locked + * @rq: current rq which is locked + * @rf: rq_flags to use when unlocking @rq + * @src_rq: rq to move task from + * @dst_rq: rq to move task to + * + * We're holding @rq lock and trying to dispatch a task from @src_rq to + * @dst_rq's local DSQ and thus need to lock both @src_rq and @dst_rq. Whether + * @rq stays locked isn't important as long as the state is restored after + * dispatch_to_local_dsq_unlock(). + */ +static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq_flags *rf, + struct rq *src_rq, struct rq *dst_rq) +{ + rq_unpin_lock(rq, rf); + + if (src_rq == dst_rq) { + raw_spin_rq_unlock(rq); + raw_spin_rq_lock(dst_rq); + } else if (rq == src_rq) { + double_lock_balance(rq, dst_rq); + rq_repin_lock(rq, rf); + } else if (rq == dst_rq) { + double_lock_balance(rq, src_rq); + rq_repin_lock(rq, rf); + } else { + raw_spin_rq_unlock(rq); + double_rq_lock(src_rq, dst_rq); + } +} + +/** + * dispatch_to_local_dsq_unlock - Undo dispatch_to_local_dsq_lock() + * @rq: current rq which is locked + * @rf: rq_flags to use when unlocking @rq + * @src_rq: rq to move task from + * @dst_rq: rq to move task to + * + * Unlock @src_rq and @dst_rq and ensure that @rq is locked on return. + */ +static void dispatch_to_local_dsq_unlock(struct rq *rq, struct rq_flags *rf, + struct rq *src_rq, struct rq *dst_rq) +{ + if (src_rq == dst_rq) { + raw_spin_rq_unlock(dst_rq); + raw_spin_rq_lock(rq); + rq_repin_lock(rq, rf); + } else if (rq == src_rq) { + double_unlock_balance(rq, dst_rq); + } else if (rq == dst_rq) { + double_unlock_balance(rq, src_rq); + } else { + double_rq_unlock(src_rq, dst_rq); + raw_spin_rq_lock(rq); + rq_repin_lock(rq, rf); + } +} +#endif /* CONFIG_SMP */ + +static void consume_local_task(struct rq *rq, struct scx_dispatch_q *dsq, + struct task_struct *p) +{ + lockdep_assert_held(&dsq->lock); /* released on return */ + + /* @dsq is locked and @p is on this rq */ + WARN_ON_ONCE(p->scx.holding_cpu >= 0); + list_move_tail(&p->scx.dsq_node, &rq->scx.local_dsq.list); + dsq_mod_nr(dsq, -1); + dsq_mod_nr(&rq->scx.local_dsq, 1); + p->scx.dsq = &rq->scx.local_dsq; + raw_spin_unlock(&dsq->lock); +} + +#ifdef CONFIG_SMP +/* + * Similar to kernel/sched/core.c::is_cpu_allowed() but we're testing whether @p + * can be pulled to @rq. + */ +static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq) +{ + int cpu = cpu_of(rq); + + if (!cpumask_test_cpu(cpu, p->cpus_ptr)) + return false; + if (unlikely(is_migration_disabled(p))) + return false; + if (!(p->flags & PF_KTHREAD) && unlikely(!task_cpu_possible(cpu, p))) + return false; + if (!scx_rq_online(rq)) + return false; + return true; +} + +static bool consume_remote_task(struct rq *rq, struct rq_flags *rf, + struct scx_dispatch_q *dsq, + struct task_struct *p, struct rq *task_rq) +{ + bool moved = false; + + lockdep_assert_held(&dsq->lock); /* released on return */ + + /* + * @dsq is locked and @p is on a remote rq. @p is currently protected by + * @dsq->lock. We want to pull @p to @rq but may deadlock if we grab + * @task_rq while holding @dsq and @rq locks. As dequeue can't drop the + * rq lock or fail, do a little dancing from our side. See + * move_task_to_local_dsq(). + */ + WARN_ON_ONCE(p->scx.holding_cpu >= 0); + list_del_init(&p->scx.dsq_node); + dsq_mod_nr(dsq, -1); + p->scx.holding_cpu = raw_smp_processor_id(); + raw_spin_unlock(&dsq->lock); + + rq_unpin_lock(rq, rf); + double_lock_balance(rq, task_rq); + rq_repin_lock(rq, rf); + + moved = move_task_to_local_dsq(rq, p, 0); + + double_unlock_balance(rq, task_rq); + + return moved; +} +#else /* CONFIG_SMP */ +static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq) { return false; } +static bool consume_remote_task(struct rq *rq, struct rq_flags *rf, + struct scx_dispatch_q *dsq, + struct task_struct *p, struct rq *task_rq) { return false; } +#endif /* CONFIG_SMP */ + +static bool consume_dispatch_q(struct rq *rq, struct rq_flags *rf, + struct scx_dispatch_q *dsq) +{ + struct task_struct *p; +retry: + if (list_empty(&dsq->list)) + return false; + + raw_spin_lock(&dsq->lock); + + list_for_each_entry(p, &dsq->list, scx.dsq_node) { + struct rq *task_rq = task_rq(p); + + if (rq == task_rq) { + consume_local_task(rq, dsq, p); + return true; + } + + if (task_can_run_on_remote_rq(p, rq)) { + if (likely(consume_remote_task(rq, rf, dsq, p, task_rq))) + return true; + goto retry; + } + } + + raw_spin_unlock(&dsq->lock); + return false; +} + +enum dispatch_to_local_dsq_ret { + DTL_DISPATCHED, /* successfully dispatched */ + DTL_LOST, /* lost race to dequeue */ + DTL_NOT_LOCAL, /* destination is not a local DSQ */ + DTL_INVALID, /* invalid local dsq_id */ +}; + +/** + * dispatch_to_local_dsq - Dispatch a task to a local dsq + * @rq: current rq which is locked + * @rf: rq_flags to use when unlocking @rq + * @dsq_id: destination dsq ID + * @p: task to dispatch + * @enq_flags: %SCX_ENQ_* + * + * We're holding @rq lock and want to dispatch @p to the local DSQ identified by + * @dsq_id. This function performs all the synchronization dancing needed + * because local DSQs are protected with rq locks. + * + * The caller must have exclusive ownership of @p (e.g. through + * %SCX_OPSS_DISPATCHING). + */ +static enum dispatch_to_local_dsq_ret +dispatch_to_local_dsq(struct rq *rq, struct rq_flags *rf, u64 dsq_id, + struct task_struct *p, u64 enq_flags) +{ + struct rq *src_rq = task_rq(p); + struct rq *dst_rq; + + /* + * We're synchronized against dequeue through DISPATCHING. As @p can't + * be dequeued, its task_rq and cpus_allowed are stable too. + */ + if (dsq_id == SCX_DSQ_LOCAL) { + dst_rq = rq; + } else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) { + s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK; + + if (!ops_cpu_valid(cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict")) + return DTL_INVALID; + dst_rq = cpu_rq(cpu); + } else { + return DTL_NOT_LOCAL; + } + + /* if dispatching to @rq that @p is already on, no lock dancing needed */ + if (rq == src_rq && rq == dst_rq) { + dispatch_enqueue(&dst_rq->scx.local_dsq, p, + enq_flags | SCX_ENQ_CLEAR_OPSS); + return DTL_DISPATCHED; + } + +#ifdef CONFIG_SMP + if (cpumask_test_cpu(cpu_of(dst_rq), p->cpus_ptr)) { + struct rq *locked_dst_rq = dst_rq; + bool dsp; + + /* + * @p is on a possibly remote @src_rq which we need to lock to + * move the task. If dequeue is in progress, it'd be locking + * @src_rq and waiting on DISPATCHING, so we can't grab @src_rq + * lock while holding DISPATCHING. + * + * As DISPATCHING guarantees that @p is wholly ours, we can + * pretend that we're moving from a DSQ and use the same + * mechanism - mark the task under transfer with holding_cpu, + * release DISPATCHING and then follow the same protocol. + */ + p->scx.holding_cpu = raw_smp_processor_id(); + + /* store_release ensures that dequeue sees the above */ + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE); + + dispatch_to_local_dsq_lock(rq, rf, src_rq, locked_dst_rq); + + /* + * We don't require the BPF scheduler to avoid dispatching to + * offline CPUs mostly for convenience but also because CPUs can + * go offline between scx_bpf_dispatch() calls and here. If @p + * is destined to an offline CPU, queue it on its current CPU + * instead, which should always be safe. As this is an allowed + * behavior, don't trigger an ops error. + */ + if (!scx_rq_online(dst_rq)) + dst_rq = src_rq; + + if (src_rq == dst_rq) { + /* + * As @p is staying on the same rq, there's no need to + * go through the full deactivate/activate cycle. + * Optimize by abbreviating the operations in + * move_task_to_local_dsq(). + */ + dsp = p->scx.holding_cpu == raw_smp_processor_id(); + if (likely(dsp)) { + p->scx.holding_cpu = -1; + dispatch_enqueue(&dst_rq->scx.local_dsq, p, + enq_flags); + } + } else { + dsp = move_task_to_local_dsq(dst_rq, p, enq_flags); + } + + /* if the destination CPU is idle, wake it up */ + if (dsp && sched_class_above(p->sched_class, + dst_rq->curr->sched_class)) + resched_curr(dst_rq); + + dispatch_to_local_dsq_unlock(rq, rf, src_rq, locked_dst_rq); + + return dsp ? DTL_DISPATCHED : DTL_LOST; + } +#endif /* CONFIG_SMP */ + + scx_ops_error("SCX_DSQ_LOCAL[_ON] verdict target cpu %d not allowed for %s[%d]", + cpu_of(dst_rq), p->comm, p->pid); + return DTL_INVALID; +} + +/** + * finish_dispatch - Asynchronously finish dispatching a task + * @rq: current rq which is locked + * @rf: rq_flags to use when unlocking @rq + * @p: task to finish dispatching + * @qseq_at_dispatch: qseq when @p started getting dispatched + * @dsq_id: destination DSQ ID + * @enq_flags: %SCX_ENQ_* + * + * Dispatching to local DSQs may need to wait for queueing to complete or + * require rq lock dancing. As we don't wanna do either while inside + * ops.dispatch() to avoid locking order inversion, we split dispatching into + * two parts. scx_bpf_dispatch() which is called by ops.dispatch() records the + * task and its qseq. Once ops.dispatch() returns, this function is called to + * finish up. + * + * There is no guarantee that @p is still valid for dispatching or even that it + * was valid in the first place. Make sure that the task is still owned by the + * BPF scheduler and claim the ownership before dispatching. + */ +static void finish_dispatch(struct rq *rq, struct rq_flags *rf, + struct task_struct *p, + unsigned long qseq_at_dispatch, + u64 dsq_id, u64 enq_flags) +{ + struct scx_dispatch_q *dsq; + unsigned long opss; + +retry: + /* + * No need for _acquire here. @p is accessed only after a successful + * try_cmpxchg to DISPATCHING. + */ + opss = atomic_long_read(&p->scx.ops_state); + + switch (opss & SCX_OPSS_STATE_MASK) { + case SCX_OPSS_DISPATCHING: + case SCX_OPSS_NONE: + /* someone else already got to it */ + return; + case SCX_OPSS_QUEUED: + /* + * If qseq doesn't match, @p has gone through at least one + * dispatch/dequeue and re-enqueue cycle between + * scx_bpf_dispatch() and here and we have no claim on it. + */ + if ((opss & SCX_OPSS_QSEQ_MASK) != qseq_at_dispatch) + return; + + /* + * While we know @p is accessible, we don't yet have a claim on + * it - the BPF scheduler is allowed to dispatch tasks + * spuriously and there can be a racing dequeue attempt. Let's + * claim @p by atomically transitioning it from QUEUED to + * DISPATCHING. + */ + if (likely(atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, + SCX_OPSS_DISPATCHING))) + break; + goto retry; + case SCX_OPSS_QUEUEING: + /* + * do_enqueue_task() is in the process of transferring the task + * to the BPF scheduler while holding @p's rq lock. As we aren't + * holding any kernel or BPF resource that the enqueue path may + * depend upon, it's safe to wait. + */ + wait_ops_state(p, opss); + goto retry; + } + + BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED)); + + switch (dispatch_to_local_dsq(rq, rf, dsq_id, p, enq_flags)) { + case DTL_DISPATCHED: + break; + case DTL_LOST: + break; + case DTL_INVALID: + dsq_id = SCX_DSQ_GLOBAL; + fallthrough; + case DTL_NOT_LOCAL: + dsq = find_dsq_for_dispatch(cpu_rq(raw_smp_processor_id()), + dsq_id, p); + dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); + break; + } +} + +static void flush_dispatch_buf(struct rq *rq, struct rq_flags *rf) +{ + struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); + u32 u; + + for (u = 0; u < dspc->cursor; u++) { + struct scx_dsp_buf_ent *ent = &dspc->buf[u]; + + finish_dispatch(rq, rf, ent->task, ent->qseq, ent->dsq_id, + ent->enq_flags); + } + + dspc->nr_tasks += dspc->cursor; + dspc->cursor = 0; +} + +static int balance_scx(struct rq *rq, struct task_struct *prev, + struct rq_flags *rf) +{ + struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); + bool prev_on_scx = prev->sched_class == &ext_sched_class; + + lockdep_assert_rq_held(rq); + + if (prev_on_scx) { + WARN_ON_ONCE(prev->scx.flags & SCX_TASK_BAL_KEEP); + update_curr_scx(rq); + + /* + * If @prev is runnable & has slice left, it has priority and + * fetching more just increases latency for the fetched tasks. + * Tell put_prev_task_scx() to put @prev on local_dsq. + * + * See scx_ops_disable_workfn() for the explanation on the + * bypassing test. + */ + if ((prev->scx.flags & SCX_TASK_QUEUED) && + prev->scx.slice && !scx_ops_bypassing()) { + prev->scx.flags |= SCX_TASK_BAL_KEEP; + return 1; + } + } + + /* if there already are tasks to run, nothing to do */ + if (rq->scx.local_dsq.nr) + return 1; + + if (consume_dispatch_q(rq, rf, &scx_dsq_global)) + return 1; + + if (!SCX_HAS_OP(dispatch) || scx_ops_bypassing() || !scx_rq_online(rq)) + return 0; + + dspc->rq = rq; + dspc->rf = rf; + + /* + * The dispatch loop. Because flush_dispatch_buf() may drop the rq lock, + * the local DSQ might still end up empty after a successful + * ops.dispatch(). If the local DSQ is empty even after ops.dispatch() + * produced some tasks, retry. The BPF scheduler may depend on this + * looping behavior to simplify its implementation. + */ + do { + dspc->nr_tasks = 0; + + SCX_CALL_OP(SCX_KF_DISPATCH, dispatch, cpu_of(rq), + prev_on_scx ? prev : NULL); + + flush_dispatch_buf(rq, rf); + + if (rq->scx.local_dsq.nr) + return 1; + if (consume_dispatch_q(rq, rf, &scx_dsq_global)) + return 1; + } while (dspc->nr_tasks); + + return 0; +} + +static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) +{ + if (p->scx.flags & SCX_TASK_QUEUED) { + WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); + dispatch_dequeue(rq, p); + } + + p->se.exec_start = rq_clock_task(rq); + + clr_task_runnable(p); +} + +static void put_prev_task_scx(struct rq *rq, struct task_struct *p) +{ +#ifndef CONFIG_SMP + /* + * UP workaround. + * + * Because SCX may transfer tasks across CPUs during dispatch, dispatch + * is performed from its balance operation which isn't called in UP. + * Let's work around by calling it from the operations which come right + * after. + * + * 1. If the prev task is on SCX, pick_next_task() calls + * .put_prev_task() right after. As .put_prev_task() is also called + * from other places, we need to distinguish the calls which can be + * done by looking at the previous task's state - if still queued or + * dequeued with %SCX_DEQ_SLEEP, the caller must be pick_next_task(). + * This case is handled here. + * + * 2. If the prev task is not on SCX, the first following call into SCX + * will be .pick_next_task(), which is covered by calling + * balance_scx() from pick_next_task_scx(). + * + * Note that we can't merge the first case into the second as + * balance_scx() must be called before the previous SCX task goes + * through put_prev_task_scx(). + * + * As UP doesn't transfer tasks around, balance_scx() doesn't need @rf. + * Pass in %NULL. + */ + if (p->scx.flags & (SCX_TASK_QUEUED | SCX_TASK_DEQD_FOR_SLEEP)) + balance_scx(rq, p, NULL); +#endif + + update_curr_scx(rq); + + /* + * If we're being called from put_prev_task_balance(), balance_scx() may + * have decided that @p should keep running. + */ + if (p->scx.flags & SCX_TASK_BAL_KEEP) { + p->scx.flags &= ~SCX_TASK_BAL_KEEP; + set_task_runnable(rq, p); + dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD); + return; + } + + if (p->scx.flags & SCX_TASK_QUEUED) { + set_task_runnable(rq, p); + + /* + * If @p has slice left and balance_scx() didn't tag it for + * keeping, @p is getting preempted by a higher priority + * scheduler class. Leave it at the head of the local DSQ. + */ + if (p->scx.slice && !scx_ops_bypassing()) { + dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD); + return; + } + + /* + * If we're in the pick_next_task path, balance_scx() should + * have already populated the local DSQ if there are any other + * available tasks. If empty, tell ops.enqueue() that @p is the + * only one available for this cpu. ops.enqueue() should put it + * on the local DSQ so that the subsequent pick_next_task_scx() + * can find the task unless it wants to trigger a separate + * follow-up scheduling event. + */ + if (list_empty(&rq->scx.local_dsq.list)) + do_enqueue_task(rq, p, SCX_ENQ_LAST, -1); + else + do_enqueue_task(rq, p, 0, -1); + } +} + +static struct task_struct *first_local_task(struct rq *rq) +{ + return list_first_entry_or_null(&rq->scx.local_dsq.list, + struct task_struct, scx.dsq_node); +} + +static struct task_struct *pick_next_task_scx(struct rq *rq) +{ + struct task_struct *p; + +#ifndef CONFIG_SMP + /* UP workaround - see the comment at the head of put_prev_task_scx() */ + if (unlikely(rq->curr->sched_class != &ext_sched_class)) + balance_scx(rq, rq->curr, NULL); +#endif + + p = first_local_task(rq); + if (!p) + return NULL; + + set_next_task_scx(rq, p, true); + + if (unlikely(!p->scx.slice)) { + if (!scx_ops_bypassing() && !scx_warned_zero_slice) { + printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n", + p->comm, p->pid); + scx_warned_zero_slice = true; + } + p->scx.slice = SCX_SLICE_DFL; + } + + return p; +} + +#ifdef CONFIG_SMP + +static bool test_and_clear_cpu_idle(int cpu) +{ +#ifdef CONFIG_SCHED_SMT + /* + * SMT mask should be cleared whether we can claim @cpu or not. The SMT + * cluster is not wholly idle either way. This also prevents + * scx_pick_idle_cpu() from getting caught in an infinite loop. + */ + if (sched_smt_active()) { + const struct cpumask *smt = cpu_smt_mask(cpu); + + /* + * If offline, @cpu is not its own sibling and + * scx_pick_idle_cpu() can get caught in an infinite loop as + * @cpu is never cleared from idle_masks.smt. Ensure that @cpu + * is eventually cleared. + */ + if (cpumask_intersects(smt, idle_masks.smt)) + cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); + else if (cpumask_test_cpu(cpu, idle_masks.smt)) + __cpumask_clear_cpu(cpu, idle_masks.smt); + } +#endif + return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu); +} + +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) +{ + int cpu; + +retry: + if (sched_smt_active()) { + cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed); + if (cpu < nr_cpu_ids) + goto found; + + if (flags & SCX_PICK_IDLE_CORE) + return -EBUSY; + } + + cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed); + if (cpu >= nr_cpu_ids) + return -EBUSY; + +found: + if (test_and_clear_cpu_idle(cpu)) + return cpu; + else + goto retry; +} + +static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, + u64 wake_flags, bool *found) +{ + s32 cpu; + + *found = false; + + if (!static_branch_likely(&scx_builtin_idle_enabled)) { + scx_ops_error("built-in idle tracking is disabled"); + return prev_cpu; + } + + /* + * If WAKE_SYNC, the waker's local DSQ is empty, and the system is + * under utilized, wake up @p to the local DSQ of the waker. Checking + * only for an empty local DSQ is insufficient as it could give the + * wakee an unfair advantage when the system is oversaturated. + * Checking only for the presence of idle CPUs is also insufficient as + * the local DSQ of the waker could have tasks piled up on it even if + * there is an idle core elsewhere on the system. + */ + cpu = smp_processor_id(); + if ((wake_flags & SCX_WAKE_SYNC) && p->nr_cpus_allowed > 1 && + !cpumask_empty(idle_masks.cpu) && !(current->flags & PF_EXITING) && + cpu_rq(cpu)->scx.local_dsq.nr == 0) { + if (cpumask_test_cpu(cpu, p->cpus_ptr)) + goto cpu_found; + } + + if (p->nr_cpus_allowed == 1) { + if (test_and_clear_cpu_idle(prev_cpu)) { + cpu = prev_cpu; + goto cpu_found; + } else { + return prev_cpu; + } + } + + /* + * If CPU has SMT, any wholly idle CPU is likely a better pick than + * partially idle @prev_cpu. + */ + if (sched_smt_active()) { + if (cpumask_test_cpu(prev_cpu, idle_masks.smt) && + test_and_clear_cpu_idle(prev_cpu)) { + cpu = prev_cpu; + goto cpu_found; + } + + cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE); + if (cpu >= 0) + goto cpu_found; + } + + if (test_and_clear_cpu_idle(prev_cpu)) { + cpu = prev_cpu; + goto cpu_found; + } + + cpu = scx_pick_idle_cpu(p->cpus_ptr, 0); + if (cpu >= 0) + goto cpu_found; + + return prev_cpu; + +cpu_found: + *found = true; + return cpu; +} + +static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flags) +{ + /* + * sched_exec() calls with %WF_EXEC when @p is about to exec(2) as it + * can be a good migration opportunity with low cache and memory + * footprint. Returning a CPU different than @prev_cpu triggers + * immediate rq migration. However, for SCX, as the current rq + * association doesn't dictate where the task is going to run, this + * doesn't fit well. If necessary, we can later add a dedicated method + * which can decide to preempt self to force it through the regular + * scheduling path. + */ + if (unlikely(wake_flags & WF_EXEC)) + return prev_cpu; + + if (SCX_HAS_OP(select_cpu)) { + s32 cpu; + struct task_struct **ddsp_taskp; + + ddsp_taskp = this_cpu_ptr(&direct_dispatch_task); + WARN_ON_ONCE(*ddsp_taskp); + *ddsp_taskp = p; + + cpu = SCX_CALL_OP_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, + select_cpu, p, prev_cpu, wake_flags); + *ddsp_taskp = NULL; + if (ops_cpu_valid(cpu, "from ops.select_cpu()")) + return cpu; + else + return prev_cpu; + } else { + bool found; + s32 cpu; + + cpu = scx_select_cpu_dfl(p, prev_cpu, wake_flags, &found); + if (found) { + p->scx.slice = SCX_SLICE_DFL; + p->scx.ddsp_dsq_id = SCX_DSQ_LOCAL; + } + return cpu; + } +} + +static void set_cpus_allowed_scx(struct task_struct *p, + struct affinity_context *ac) +{ + set_cpus_allowed_common(p, ac); + + /* + * The effective cpumask is stored in @p->cpus_ptr which may temporarily + * differ from the configured one in @p->cpus_mask. Always tell the bpf + * scheduler the effective one. + * + * Fine-grained memory write control is enforced by BPF making the const + * designation pointless. Cast it away when calling the operation. + */ + if (SCX_HAS_OP(set_cpumask)) + SCX_CALL_OP(SCX_KF_REST, set_cpumask, p, + (struct cpumask *)p->cpus_ptr); +} + +static void reset_idle_masks(void) +{ + /* + * Consider all online cpus idle. Should converge to the actual state + * quickly. + */ + cpumask_copy(idle_masks.cpu, cpu_online_mask); + cpumask_copy(idle_masks.smt, cpu_online_mask); +} + +void __scx_update_idle(struct rq *rq, bool idle) +{ + int cpu = cpu_of(rq); + + if (SCX_HAS_OP(update_idle)) { + SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle); + if (!static_branch_unlikely(&scx_builtin_idle_enabled)) + return; + } + + if (idle) + cpumask_set_cpu(cpu, idle_masks.cpu); + else + cpumask_clear_cpu(cpu, idle_masks.cpu); + +#ifdef CONFIG_SCHED_SMT + if (sched_smt_active()) { + const struct cpumask *smt = cpu_smt_mask(cpu); + + if (idle) { + /* + * idle_masks.smt handling is racy but that's fine as + * it's only for optimization and self-correcting. + */ + for_each_cpu(cpu, smt) { + if (!cpumask_test_cpu(cpu, idle_masks.cpu)) + return; + } + cpumask_or(idle_masks.smt, idle_masks.smt, smt); + } else { + cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); + } + } +#endif +} + +#else /* CONFIG_SMP */ + +static bool test_and_clear_cpu_idle(int cpu) { return false; } +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) { return -EBUSY; } +static void reset_idle_masks(void) {} + +#endif /* CONFIG_SMP */ + +static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued) +{ + update_other_load_avgs(rq); + update_curr_scx(rq); + + /* + * While bypassing, always resched as we can't trust the slice + * management. + */ + if (scx_ops_bypassing()) + curr->scx.slice = 0; + else if (SCX_HAS_OP(tick)) + SCX_CALL_OP(SCX_KF_REST, tick, curr); + + if (!curr->scx.slice) + resched_curr(rq); +} + +static enum scx_task_state scx_get_task_state(const struct task_struct *p) +{ + return (p->scx.flags & SCX_TASK_STATE_MASK) >> SCX_TASK_STATE_SHIFT; +} + +static void scx_set_task_state(struct task_struct *p, enum scx_task_state state) +{ + enum scx_task_state prev_state = scx_get_task_state(p); + bool warn = false; + + BUILD_BUG_ON(SCX_TASK_NR_STATES > (1 << SCX_TASK_STATE_BITS)); + + switch (state) { + case SCX_TASK_NONE: + break; + case SCX_TASK_INIT: + warn = prev_state != SCX_TASK_NONE; + break; + case SCX_TASK_READY: + warn = prev_state == SCX_TASK_NONE; + break; + case SCX_TASK_ENABLED: + warn = prev_state != SCX_TASK_READY; + break; + default: + warn = true; + return; + } + + WARN_ONCE(warn, "sched_ext: Invalid task state transition %d -> %d for %s[%d]", + prev_state, state, p->comm, p->pid); + + p->scx.flags &= ~SCX_TASK_STATE_MASK; + p->scx.flags |= state << SCX_TASK_STATE_SHIFT; +} + +static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool fork) +{ + int ret; + + if (SCX_HAS_OP(init_task)) { + struct scx_init_task_args args = { + .fork = fork, + }; + + ret = SCX_CALL_OP_RET(SCX_KF_SLEEPABLE, init_task, p, &args); + if (unlikely(ret)) { + ret = ops_sanitize_err("init_task", ret); + return ret; + } + } + + scx_set_task_state(p, SCX_TASK_INIT); + + return 0; +} + +static void set_task_scx_weight(struct task_struct *p) +{ + u32 weight = sched_prio_to_weight[p->static_prio - MAX_RT_PRIO]; + + p->scx.weight = sched_weight_to_cgroup(weight); +} + +static void scx_ops_enable_task(struct task_struct *p) +{ + lockdep_assert_rq_held(task_rq(p)); + + /* + * Set the weight before calling ops.enable() so that the scheduler + * doesn't see a stale value if they inspect the task struct. + */ + set_task_scx_weight(p); + if (SCX_HAS_OP(enable)) + SCX_CALL_OP(SCX_KF_REST, enable, p); + scx_set_task_state(p, SCX_TASK_ENABLED); + + if (SCX_HAS_OP(set_weight)) + SCX_CALL_OP(SCX_KF_REST, set_weight, p, p->scx.weight); +} + +static void scx_ops_disable_task(struct task_struct *p) +{ + lockdep_assert_rq_held(task_rq(p)); + WARN_ON_ONCE(scx_get_task_state(p) != SCX_TASK_ENABLED); + + if (SCX_HAS_OP(disable)) + SCX_CALL_OP(SCX_KF_REST, disable, p); + scx_set_task_state(p, SCX_TASK_READY); +} + +static void scx_ops_exit_task(struct task_struct *p) +{ + struct scx_exit_task_args args = { + .cancelled = false, + }; + + lockdep_assert_rq_held(task_rq(p)); + + switch (scx_get_task_state(p)) { + case SCX_TASK_NONE: + return; + case SCX_TASK_INIT: + args.cancelled = true; + break; + case SCX_TASK_READY: + break; + case SCX_TASK_ENABLED: + scx_ops_disable_task(p); + break; + default: + WARN_ON_ONCE(true); + return; + } + + if (SCX_HAS_OP(exit_task)) + SCX_CALL_OP(SCX_KF_REST, exit_task, p, &args); + scx_set_task_state(p, SCX_TASK_NONE); +} + +void init_scx_entity(struct sched_ext_entity *scx) +{ + /* + * init_idle() calls this function again after fork sequence is + * complete. Don't touch ->tasks_node as it's already linked. + */ + memset(scx, 0, offsetof(struct sched_ext_entity, tasks_node)); + + INIT_LIST_HEAD(&scx->dsq_node); + scx->sticky_cpu = -1; + scx->holding_cpu = -1; + INIT_LIST_HEAD(&scx->runnable_node); + scx->ddsp_dsq_id = SCX_DSQ_INVALID; + scx->slice = SCX_SLICE_DFL; +} + +void scx_pre_fork(struct task_struct *p) +{ + /* + * BPF scheduler enable/disable paths want to be able to iterate and + * update all tasks which can become complex when racing forks. As + * enable/disable are very cold paths, let's use a percpu_rwsem to + * exclude forks. + */ + percpu_down_read(&scx_fork_rwsem); +} + +int scx_fork(struct task_struct *p) +{ + percpu_rwsem_assert_held(&scx_fork_rwsem); + + if (scx_enabled()) + return scx_ops_init_task(p, task_group(p), true); + else + return 0; +} + +void scx_post_fork(struct task_struct *p) +{ + if (scx_enabled()) { + scx_set_task_state(p, SCX_TASK_READY); + + /* + * Enable the task immediately if it's running on sched_ext. + * Otherwise, it'll be enabled in switching_to_scx() if and + * when it's ever configured to run with a SCHED_EXT policy. + */ + if (p->sched_class == &ext_sched_class) { + struct rq_flags rf; + struct rq *rq; + + rq = task_rq_lock(p, &rf); + scx_ops_enable_task(p); + task_rq_unlock(rq, p, &rf); + } + } + + spin_lock_irq(&scx_tasks_lock); + list_add_tail(&p->scx.tasks_node, &scx_tasks); + spin_unlock_irq(&scx_tasks_lock); + + percpu_up_read(&scx_fork_rwsem); +} + +void scx_cancel_fork(struct task_struct *p) +{ + if (scx_enabled()) { + struct rq *rq; + struct rq_flags rf; + + rq = task_rq_lock(p, &rf); + WARN_ON_ONCE(scx_get_task_state(p) >= SCX_TASK_READY); + scx_ops_exit_task(p); + task_rq_unlock(rq, p, &rf); + } + + percpu_up_read(&scx_fork_rwsem); +} + +void sched_ext_free(struct task_struct *p) +{ + unsigned long flags; + + spin_lock_irqsave(&scx_tasks_lock, flags); + list_del_init(&p->scx.tasks_node); + spin_unlock_irqrestore(&scx_tasks_lock, flags); + + /* + * @p is off scx_tasks and wholly ours. scx_ops_enable()'s READY -> + * ENABLED transitions can't race us. Disable ops for @p. + */ + if (scx_get_task_state(p) != SCX_TASK_NONE) { + struct rq_flags rf; + struct rq *rq; + + rq = task_rq_lock(p, &rf); + scx_ops_exit_task(p); + task_rq_unlock(rq, p, &rf); + } +} + +static void reweight_task_scx(struct rq *rq, struct task_struct *p, const struct load_weight *lw) +{ + lockdep_assert_rq_held(task_rq(p)); + + set_task_scx_weight(p); + if (SCX_HAS_OP(set_weight)) + SCX_CALL_OP(SCX_KF_REST, set_weight, p, p->scx.weight); +} + +static void prio_changed_scx(struct rq *rq, struct task_struct *p, int oldprio) +{ +} + +static void switching_to_scx(struct rq *rq, struct task_struct *p) +{ + scx_ops_enable_task(p); + + /* + * set_cpus_allowed_scx() is not called while @p is associated with a + * different scheduler class. Keep the BPF scheduler up-to-date. + */ + if (SCX_HAS_OP(set_cpumask)) + SCX_CALL_OP(SCX_KF_REST, set_cpumask, p, + (struct cpumask *)p->cpus_ptr); +} + +static void switched_from_scx(struct rq *rq, struct task_struct *p) +{ + scx_ops_disable_task(p); +} + +static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {} +static void switched_to_scx(struct rq *rq, struct task_struct *p) {} + +/* + * Omitted operations: + * + * - wakeup_preempt: NOOP as it isn't useful in the wakeup path because the task + * isn't tied to the CPU at that point. + * + * - migrate_task_rq: Unnecessary as task to cpu mapping is transient. + * + * - task_fork/dead: We need fork/dead notifications for all tasks regardless of + * their current sched_class. Call them directly from sched core instead. + * + * - task_woken: Unnecessary. + */ +DEFINE_SCHED_CLASS(ext) = { + .enqueue_task = enqueue_task_scx, + .dequeue_task = dequeue_task_scx, + .yield_task = yield_task_scx, + .yield_to_task = yield_to_task_scx, + + .wakeup_preempt = wakeup_preempt_scx, + + .pick_next_task = pick_next_task_scx, + + .put_prev_task = put_prev_task_scx, + .set_next_task = set_next_task_scx, + +#ifdef CONFIG_SMP + .balance = balance_scx, + .select_task_rq = select_task_rq_scx, + .set_cpus_allowed = set_cpus_allowed_scx, +#endif + + .task_tick = task_tick_scx, + + .switching_to = switching_to_scx, + .switched_from = switched_from_scx, + .switched_to = switched_to_scx, + .reweight_task = reweight_task_scx, + .prio_changed = prio_changed_scx, + + .update_curr = update_curr_scx, + +#ifdef CONFIG_UCLAMP_TASK + .uclamp_enabled = 0, +#endif +}; + +static void init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id) +{ + memset(dsq, 0, sizeof(*dsq)); + + raw_spin_lock_init(&dsq->lock); + INIT_LIST_HEAD(&dsq->list); + dsq->id = dsq_id; +} + +static struct scx_dispatch_q *create_dsq(u64 dsq_id, int node) +{ + struct scx_dispatch_q *dsq; + int ret; + + if (dsq_id & SCX_DSQ_FLAG_BUILTIN) + return ERR_PTR(-EINVAL); + + dsq = kmalloc_node(sizeof(*dsq), GFP_KERNEL, node); + if (!dsq) + return ERR_PTR(-ENOMEM); + + init_dsq(dsq, dsq_id); + + ret = rhashtable_insert_fast(&dsq_hash, &dsq->hash_node, + dsq_hash_params); + if (ret) { + kfree(dsq); + return ERR_PTR(ret); + } + return dsq; +} + +static void free_dsq_irq_workfn(struct irq_work *irq_work) +{ + struct llist_node *to_free = llist_del_all(&dsqs_to_free); + struct scx_dispatch_q *dsq, *tmp_dsq; + + llist_for_each_entry_safe(dsq, tmp_dsq, to_free, free_node) + kfree_rcu(dsq, rcu); +} + +static DEFINE_IRQ_WORK(free_dsq_irq_work, free_dsq_irq_workfn); + +static void destroy_dsq(u64 dsq_id) +{ + struct scx_dispatch_q *dsq; + unsigned long flags; + + rcu_read_lock(); + + dsq = find_user_dsq(dsq_id); + if (!dsq) + goto out_unlock_rcu; + + raw_spin_lock_irqsave(&dsq->lock, flags); + + if (dsq->nr) { + scx_ops_error("attempting to destroy in-use dsq 0x%016llx (nr=%u)", + dsq->id, dsq->nr); + goto out_unlock_dsq; + } + + if (rhashtable_remove_fast(&dsq_hash, &dsq->hash_node, dsq_hash_params)) + goto out_unlock_dsq; + + /* + * Mark dead by invalidating ->id to prevent dispatch_enqueue() from + * queueing more tasks. As this function can be called from anywhere, + * freeing is bounced through an irq work to avoid nesting RCU + * operations inside scheduler locks. + */ + dsq->id = SCX_DSQ_INVALID; + llist_add(&dsq->free_node, &dsqs_to_free); + irq_work_queue(&free_dsq_irq_work); + +out_unlock_dsq: + raw_spin_unlock_irqrestore(&dsq->lock, flags); +out_unlock_rcu: + rcu_read_unlock(); +} + + +/******************************************************************************** + * Sysfs interface and ops enable/disable. + */ + +#define SCX_ATTR(_name) \ + static struct kobj_attribute scx_attr_##_name = { \ + .attr = { .name = __stringify(_name), .mode = 0444 }, \ + .show = scx_attr_##_name##_show, \ + } + +static ssize_t scx_attr_state_show(struct kobject *kobj, + struct kobj_attribute *ka, char *buf) +{ + return sysfs_emit(buf, "%s\n", + scx_ops_enable_state_str[scx_ops_enable_state()]); +} +SCX_ATTR(state); + +static ssize_t scx_attr_switch_all_show(struct kobject *kobj, + struct kobj_attribute *ka, char *buf) +{ + return sysfs_emit(buf, "%d\n", READ_ONCE(scx_switching_all)); +} +SCX_ATTR(switch_all); + +static struct attribute *scx_global_attrs[] = { + &scx_attr_state.attr, + &scx_attr_switch_all.attr, + NULL, +}; + +static const struct attribute_group scx_global_attr_group = { + .attrs = scx_global_attrs, +}; + +static void scx_kobj_release(struct kobject *kobj) +{ + kfree(kobj); +} + +static ssize_t scx_attr_ops_show(struct kobject *kobj, + struct kobj_attribute *ka, char *buf) +{ + return sysfs_emit(buf, "%s\n", scx_ops.name); +} +SCX_ATTR(ops); + +static struct attribute *scx_sched_attrs[] = { + &scx_attr_ops.attr, + NULL, +}; +ATTRIBUTE_GROUPS(scx_sched); + +static const struct kobj_type scx_ktype = { + .release = scx_kobj_release, + .sysfs_ops = &kobj_sysfs_ops, + .default_groups = scx_sched_groups, +}; + +static int scx_uevent(const struct kobject *kobj, struct kobj_uevent_env *env) +{ + return add_uevent_var(env, "SCXOPS=%s", scx_ops.name); +} + +static const struct kset_uevent_ops scx_uevent_ops = { + .uevent = scx_uevent, +}; + +/* + * Used by sched_fork() and __setscheduler_prio() to pick the matching + * sched_class. dl/rt are already handled. + */ +bool task_should_scx(struct task_struct *p) +{ + if (!scx_enabled() || + unlikely(scx_ops_enable_state() == SCX_OPS_DISABLING)) + return false; + if (READ_ONCE(scx_switching_all)) + return true; + return p->policy == SCHED_EXT; +} + +/** + * scx_ops_bypass - [Un]bypass scx_ops and guarantee forward progress + * + * Bypassing guarantees that all runnable tasks make forward progress without + * trusting the BPF scheduler. We can't grab any mutexes or rwsems as they might + * be held by tasks that the BPF scheduler is forgetting to run, which + * unfortunately also excludes toggling the static branches. + * + * Let's work around by overriding a couple ops and modifying behaviors based on + * the DISABLING state and then cycling the queued tasks through dequeue/enqueue + * to force global FIFO scheduling. + * + * a. ops.enqueue() is ignored and tasks are queued in simple global FIFO order. + * + * b. ops.dispatch() is ignored. + * + * c. balance_scx() never sets %SCX_TASK_BAL_KEEP as the slice value can't be + * trusted. Whenever a tick triggers, the running task is rotated to the tail + * of the queue. + * + * d. pick_next_task() suppresses zero slice warning. + */ +static void scx_ops_bypass(bool bypass) +{ + int depth, cpu; + + if (bypass) { + depth = atomic_inc_return(&scx_ops_bypass_depth); + WARN_ON_ONCE(depth <= 0); + if (depth != 1) + return; + } else { + depth = atomic_dec_return(&scx_ops_bypass_depth); + WARN_ON_ONCE(depth < 0); + if (depth != 0) + return; + } + + /* + * We need to guarantee that no tasks are on the BPF scheduler while + * bypassing. Either we see enabled or the enable path sees the + * increased bypass_depth before moving tasks to SCX. + */ + if (!scx_enabled()) + return; + + /* + * No task property is changing. We just need to make sure all currently + * queued tasks are re-queued according to the new scx_ops_bypassing() + * state. As an optimization, walk each rq's runnable_list instead of + * the scx_tasks list. + * + * This function can't trust the scheduler and thus can't use + * cpus_read_lock(). Walk all possible CPUs instead of online. + */ + for_each_possible_cpu(cpu) { + struct rq *rq = cpu_rq(cpu); + struct rq_flags rf; + struct task_struct *p, *n; + + rq_lock_irqsave(rq, &rf); + + /* + * The use of list_for_each_entry_safe_reverse() is required + * because each task is going to be removed from and added back + * to the runnable_list during iteration. Because they're added + * to the tail of the list, safe reverse iteration can still + * visit all nodes. + */ + list_for_each_entry_safe_reverse(p, n, &rq->scx.runnable_list, + scx.runnable_node) { + struct sched_enq_and_set_ctx ctx; + + /* cycling deq/enq is enough, see the function comment */ + sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx); + sched_enq_and_set_task(&ctx); + } + + rq_unlock_irqrestore(rq, &rf); + } +} + +static void free_exit_info(struct scx_exit_info *ei) +{ + kfree(ei->msg); + kfree(ei->bt); + kfree(ei); +} + +static struct scx_exit_info *alloc_exit_info(void) +{ + struct scx_exit_info *ei; + + ei = kzalloc(sizeof(*ei), GFP_KERNEL); + if (!ei) + return NULL; + + ei->bt = kcalloc(sizeof(ei->bt[0]), SCX_EXIT_BT_LEN, GFP_KERNEL); + ei->msg = kzalloc(SCX_EXIT_MSG_LEN, GFP_KERNEL); + + if (!ei->bt || !ei->msg) { + free_exit_info(ei); + return NULL; + } + + return ei; +} + +static const char *scx_exit_reason(enum scx_exit_kind kind) +{ + switch (kind) { + case SCX_EXIT_UNREG: + return "Scheduler unregistered from user space"; + case SCX_EXIT_UNREG_BPF: + return "Scheduler unregistered from BPF"; + case SCX_EXIT_UNREG_KERN: + return "Scheduler unregistered from the main kernel"; + case SCX_EXIT_ERROR: + return "runtime error"; + case SCX_EXIT_ERROR_BPF: + return "scx_bpf_error"; + default: + return "<UNKNOWN>"; + } +} + +static void scx_ops_disable_workfn(struct kthread_work *work) +{ + struct scx_exit_info *ei = scx_exit_info; + struct scx_task_iter sti; + struct task_struct *p; + struct rhashtable_iter rht_iter; + struct scx_dispatch_q *dsq; + int i, kind; + + kind = atomic_read(&scx_exit_kind); + while (true) { + /* + * NONE indicates that a new scx_ops has been registered since + * disable was scheduled - don't kill the new ops. DONE + * indicates that the ops has already been disabled. + */ + if (kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE) + return; + if (atomic_try_cmpxchg(&scx_exit_kind, &kind, SCX_EXIT_DONE)) + break; + } + ei->kind = kind; + ei->reason = scx_exit_reason(ei->kind); + + /* guarantee forward progress by bypassing scx_ops */ + scx_ops_bypass(true); + + switch (scx_ops_set_enable_state(SCX_OPS_DISABLING)) { + case SCX_OPS_DISABLING: + WARN_ONCE(true, "sched_ext: duplicate disabling instance?"); + break; + case SCX_OPS_DISABLED: + pr_warn("sched_ext: ops error detected without ops (%s)\n", + scx_exit_info->msg); + WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_DISABLED) != + SCX_OPS_DISABLING); + goto done; + default: + break; + } + + /* + * Here, every runnable task is guaranteed to make forward progress and + * we can safely use blocking synchronization constructs. Actually + * disable ops. + */ + mutex_lock(&scx_ops_enable_mutex); + + static_branch_disable(&__scx_switched_all); + WRITE_ONCE(scx_switching_all, false); + + /* + * Avoid racing against fork. See scx_ops_enable() for explanation on + * the locking order. + */ + percpu_down_write(&scx_fork_rwsem); + cpus_read_lock(); + + spin_lock_irq(&scx_tasks_lock); + scx_task_iter_init(&sti); + /* + * Invoke scx_ops_exit_task() on all non-idle tasks, including + * TASK_DEAD tasks. Because dead tasks may have a nonzero refcount, + * we may not have invoked sched_ext_free() on them by the time a + * scheduler is disabled. We must therefore exit the task here, or we'd + * fail to invoke ops.exit_task(), as the scheduler will have been + * unloaded by the time the task is subsequently exited on the + * sched_ext_free() path. + */ + while ((p = scx_task_iter_next_locked(&sti, true))) { + const struct sched_class *old_class = p->sched_class; + struct sched_enq_and_set_ctx ctx; + + if (READ_ONCE(p->__state) != TASK_DEAD) { + sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, + &ctx); + + p->scx.slice = min_t(u64, p->scx.slice, SCX_SLICE_DFL); + __setscheduler_prio(p, p->prio); + check_class_changing(task_rq(p), p, old_class); + + sched_enq_and_set_task(&ctx); + + check_class_changed(task_rq(p), p, old_class, p->prio); + } + scx_ops_exit_task(p); + } + scx_task_iter_exit(&sti); + spin_unlock_irq(&scx_tasks_lock); + + /* no task is on scx, turn off all the switches and flush in-progress calls */ + static_branch_disable_cpuslocked(&__scx_ops_enabled); + for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++) + static_branch_disable_cpuslocked(&scx_has_op[i]); + static_branch_disable_cpuslocked(&scx_ops_enq_last); + static_branch_disable_cpuslocked(&scx_ops_enq_exiting); + static_branch_disable_cpuslocked(&scx_builtin_idle_enabled); + synchronize_rcu(); + + cpus_read_unlock(); + percpu_up_write(&scx_fork_rwsem); + + if (ei->kind >= SCX_EXIT_ERROR) { + printk(KERN_ERR "sched_ext: BPF scheduler \"%s\" errored, disabling\n", scx_ops.name); + + if (ei->msg[0] == '\0') + printk(KERN_ERR "sched_ext: %s\n", ei->reason); + else + printk(KERN_ERR "sched_ext: %s (%s)\n", ei->reason, ei->msg); + + stack_trace_print(ei->bt, ei->bt_len, 2); + } + + if (scx_ops.exit) + SCX_CALL_OP(SCX_KF_UNLOCKED, exit, ei); + + /* + * Delete the kobject from the hierarchy eagerly in addition to just + * dropping a reference. Otherwise, if the object is deleted + * asynchronously, sysfs could observe an object of the same name still + * in the hierarchy when another scheduler is loaded. + */ + kobject_del(scx_root_kobj); + kobject_put(scx_root_kobj); + scx_root_kobj = NULL; + + memset(&scx_ops, 0, sizeof(scx_ops)); + + rhashtable_walk_enter(&dsq_hash, &rht_iter); + do { + rhashtable_walk_start(&rht_iter); + + while ((dsq = rhashtable_walk_next(&rht_iter)) && !IS_ERR(dsq)) + destroy_dsq(dsq->id); + + rhashtable_walk_stop(&rht_iter); + } while (dsq == ERR_PTR(-EAGAIN)); + rhashtable_walk_exit(&rht_iter); + + free_percpu(scx_dsp_ctx); + scx_dsp_ctx = NULL; + scx_dsp_max_batch = 0; + + free_exit_info(scx_exit_info); + scx_exit_info = NULL; + + mutex_unlock(&scx_ops_enable_mutex); + + WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_DISABLED) != + SCX_OPS_DISABLING); +done: + scx_ops_bypass(false); +} + +static DEFINE_KTHREAD_WORK(scx_ops_disable_work, scx_ops_disable_workfn); + +static void schedule_scx_ops_disable_work(void) +{ + struct kthread_worker *helper = READ_ONCE(scx_ops_helper); + + /* + * We may be called spuriously before the first bpf_sched_ext_reg(). If + * scx_ops_helper isn't set up yet, there's nothing to do. + */ + if (helper) + kthread_queue_work(helper, &scx_ops_disable_work); +} + +static void scx_ops_disable(enum scx_exit_kind kind) +{ + int none = SCX_EXIT_NONE; + + if (WARN_ON_ONCE(kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE)) + kind = SCX_EXIT_ERROR; + + atomic_try_cmpxchg(&scx_exit_kind, &none, kind); + + schedule_scx_ops_disable_work(); +} + +static void scx_ops_error_irq_workfn(struct irq_work *irq_work) +{ + schedule_scx_ops_disable_work(); +} + +static DEFINE_IRQ_WORK(scx_ops_error_irq_work, scx_ops_error_irq_workfn); + +static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind, + s64 exit_code, + const char *fmt, ...) +{ + struct scx_exit_info *ei = scx_exit_info; + int none = SCX_EXIT_NONE; + va_list args; + + if (!atomic_try_cmpxchg(&scx_exit_kind, &none, kind)) + return; + + ei->exit_code = exit_code; + + if (kind >= SCX_EXIT_ERROR) + ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1); + + va_start(args, fmt); + vscnprintf(ei->msg, SCX_EXIT_MSG_LEN, fmt, args); + va_end(args); + + irq_work_queue(&scx_ops_error_irq_work); +} + +static struct kthread_worker *scx_create_rt_helper(const char *name) +{ + struct kthread_worker *helper; + + helper = kthread_create_worker(0, name); + if (helper) + sched_set_fifo(helper->task); + return helper; +} + +static int validate_ops(const struct sched_ext_ops *ops) +{ + /* + * It doesn't make sense to specify the SCX_OPS_ENQ_LAST flag if the + * ops.enqueue() callback isn't implemented. + */ + if ((ops->flags & SCX_OPS_ENQ_LAST) && !ops->enqueue) { + scx_ops_error("SCX_OPS_ENQ_LAST requires ops.enqueue() to be implemented"); + return -EINVAL; + } + + return 0; +} + +static int scx_ops_enable(struct sched_ext_ops *ops) +{ + struct scx_task_iter sti; + struct task_struct *p; + int i, ret; + + mutex_lock(&scx_ops_enable_mutex); + + if (!scx_ops_helper) { + WRITE_ONCE(scx_ops_helper, + scx_create_rt_helper("sched_ext_ops_helper")); + if (!scx_ops_helper) { + ret = -ENOMEM; + goto err_unlock; + } + } + + if (scx_ops_enable_state() != SCX_OPS_DISABLED) { + ret = -EBUSY; + goto err_unlock; + } + + scx_root_kobj = kzalloc(sizeof(*scx_root_kobj), GFP_KERNEL); + if (!scx_root_kobj) { + ret = -ENOMEM; + goto err_unlock; + } + + scx_root_kobj->kset = scx_kset; + ret = kobject_init_and_add(scx_root_kobj, &scx_ktype, NULL, "root"); + if (ret < 0) + goto err; + + scx_exit_info = alloc_exit_info(); + if (!scx_exit_info) { + ret = -ENOMEM; + goto err_del; + } + + /* + * Set scx_ops, transition to PREPPING and clear exit info to arm the + * disable path. Failure triggers full disabling from here on. + */ + scx_ops = *ops; + + WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_PREPPING) != + SCX_OPS_DISABLED); + + atomic_set(&scx_exit_kind, SCX_EXIT_NONE); + scx_warned_zero_slice = false; + + /* + * Keep CPUs stable during enable so that the BPF scheduler can track + * online CPUs by watching ->on/offline_cpu() after ->init(). + */ + cpus_read_lock(); + + if (scx_ops.init) { + ret = SCX_CALL_OP_RET(SCX_KF_SLEEPABLE, init); + if (ret) { + ret = ops_sanitize_err("init", ret); + goto err_disable_unlock_cpus; + } + } + + cpus_read_unlock(); + + ret = validate_ops(ops); + if (ret) + goto err_disable; + + WARN_ON_ONCE(scx_dsp_ctx); + scx_dsp_max_batch = ops->dispatch_max_batch ?: SCX_DSP_DFL_MAX_BATCH; + scx_dsp_ctx = __alloc_percpu(struct_size_t(struct scx_dsp_ctx, buf, + scx_dsp_max_batch), + __alignof__(struct scx_dsp_ctx)); + if (!scx_dsp_ctx) { + ret = -ENOMEM; + goto err_disable; + } + + /* + * Lock out forks before opening the floodgate so that they don't wander + * into the operations prematurely. + * + * We don't need to keep the CPUs stable but grab cpus_read_lock() to + * ease future locking changes for cgroup suport. + * + * Note that cpu_hotplug_lock must nest inside scx_fork_rwsem due to the + * following dependency chain: + * + * scx_fork_rwsem --> pernet_ops_rwsem --> cpu_hotplug_lock + */ + percpu_down_write(&scx_fork_rwsem); + cpus_read_lock(); + + for (i = SCX_OPI_NORMAL_BEGIN; i < SCX_OPI_NORMAL_END; i++) + if (((void (**)(void))ops)[i]) + static_branch_enable_cpuslocked(&scx_has_op[i]); + + if (ops->flags & SCX_OPS_ENQ_LAST) + static_branch_enable_cpuslocked(&scx_ops_enq_last); + + if (ops->flags & SCX_OPS_ENQ_EXITING) + static_branch_enable_cpuslocked(&scx_ops_enq_exiting); + + if (!ops->update_idle || (ops->flags & SCX_OPS_KEEP_BUILTIN_IDLE)) { + reset_idle_masks(); + static_branch_enable_cpuslocked(&scx_builtin_idle_enabled); + } else { + static_branch_disable_cpuslocked(&scx_builtin_idle_enabled); + } + + static_branch_enable_cpuslocked(&__scx_ops_enabled); + + /* + * Enable ops for every task. Fork is excluded by scx_fork_rwsem + * preventing new tasks from being added. No need to exclude tasks + * leaving as sched_ext_free() can handle both prepped and enabled + * tasks. Prep all tasks first and then enable them with preemption + * disabled. + */ + spin_lock_irq(&scx_tasks_lock); + + scx_task_iter_init(&sti); + while ((p = scx_task_iter_next_locked(&sti, false))) { + get_task_struct(p); + scx_task_iter_rq_unlock(&sti); + spin_unlock_irq(&scx_tasks_lock); + + ret = scx_ops_init_task(p, task_group(p), false); + if (ret) { + put_task_struct(p); + spin_lock_irq(&scx_tasks_lock); + scx_task_iter_exit(&sti); + spin_unlock_irq(&scx_tasks_lock); + pr_err("sched_ext: ops.init_task() failed (%d) for %s[%d] while loading\n", + ret, p->comm, p->pid); + goto err_disable_unlock_all; + } + + put_task_struct(p); + spin_lock_irq(&scx_tasks_lock); + } + scx_task_iter_exit(&sti); + + /* + * All tasks are prepped but are still ops-disabled. Ensure that + * %current can't be scheduled out and switch everyone. + * preempt_disable() is necessary because we can't guarantee that + * %current won't be starved if scheduled out while switching. + */ + preempt_disable(); + + /* + * From here on, the disable path must assume that tasks have ops + * enabled and need to be recovered. + * + * Transition to ENABLING fails iff the BPF scheduler has already + * triggered scx_bpf_error(). Returning an error code here would lose + * the recorded error information. Exit indicating success so that the + * error is notified through ops.exit() with all the details. + */ + if (!scx_ops_tryset_enable_state(SCX_OPS_ENABLING, SCX_OPS_PREPPING)) { + preempt_enable(); + spin_unlock_irq(&scx_tasks_lock); + WARN_ON_ONCE(atomic_read(&scx_exit_kind) == SCX_EXIT_NONE); + ret = 0; + goto err_disable_unlock_all; + } + + /* + * We're fully committed and can't fail. The PREPPED -> ENABLED + * transitions here are synchronized against sched_ext_free() through + * scx_tasks_lock. + */ + WRITE_ONCE(scx_switching_all, !(ops->flags & SCX_OPS_SWITCH_PARTIAL)); + + scx_task_iter_init(&sti); + while ((p = scx_task_iter_next_locked(&sti, false))) { + const struct sched_class *old_class = p->sched_class; + struct sched_enq_and_set_ctx ctx; + + sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx); + + scx_set_task_state(p, SCX_TASK_READY); + __setscheduler_prio(p, p->prio); + check_class_changing(task_rq(p), p, old_class); + + sched_enq_and_set_task(&ctx); + + check_class_changed(task_rq(p), p, old_class, p->prio); + } + scx_task_iter_exit(&sti); + + spin_unlock_irq(&scx_tasks_lock); + preempt_enable(); + cpus_read_unlock(); + percpu_up_write(&scx_fork_rwsem); + + /* see above ENABLING transition for the explanation on exiting with 0 */ + if (!scx_ops_tryset_enable_state(SCX_OPS_ENABLED, SCX_OPS_ENABLING)) { + WARN_ON_ONCE(atomic_read(&scx_exit_kind) == SCX_EXIT_NONE); + ret = 0; + goto err_disable; + } + + if (!(ops->flags & SCX_OPS_SWITCH_PARTIAL)) + static_branch_enable(&__scx_switched_all); + + kobject_uevent(scx_root_kobj, KOBJ_ADD); + mutex_unlock(&scx_ops_enable_mutex); + + return 0; + +err_del: + kobject_del(scx_root_kobj); +err: + kobject_put(scx_root_kobj); + scx_root_kobj = NULL; + if (scx_exit_info) { + free_exit_info(scx_exit_info); + scx_exit_info = NULL; + } +err_unlock: + mutex_unlock(&scx_ops_enable_mutex); + return ret; + +err_disable_unlock_all: + percpu_up_write(&scx_fork_rwsem); +err_disable_unlock_cpus: + cpus_read_unlock(); +err_disable: + mutex_unlock(&scx_ops_enable_mutex); + /* must be fully disabled before returning */ + scx_ops_disable(SCX_EXIT_ERROR); + kthread_flush_work(&scx_ops_disable_work); + return ret; +} + + +/******************************************************************************** + * bpf_struct_ops plumbing. + */ +#include <linux/bpf_verifier.h> +#include <linux/bpf.h> +#include <linux/btf.h> + +extern struct btf *btf_vmlinux; +static const struct btf_type *task_struct_type; +static u32 task_struct_type_id; + +static bool set_arg_maybe_null(const char *op, int arg_n, int off, int size, + enum bpf_access_type type, + const struct bpf_prog *prog, + struct bpf_insn_access_aux *info) +{ + struct btf *btf = bpf_get_btf_vmlinux(); + const struct bpf_struct_ops_desc *st_ops_desc; + const struct btf_member *member; + const struct btf_type *t; + u32 btf_id, member_idx; + const char *mname; + + /* struct_ops op args are all sequential, 64-bit numbers */ + if (off != arg_n * sizeof(__u64)) + return false; + + /* btf_id should be the type id of struct sched_ext_ops */ + btf_id = prog->aux->attach_btf_id; + st_ops_desc = bpf_struct_ops_find(btf, btf_id); + if (!st_ops_desc) + return false; + + /* BTF type of struct sched_ext_ops */ + t = st_ops_desc->type; + + member_idx = prog->expected_attach_type; + if (member_idx >= btf_type_vlen(t)) + return false; + + /* + * Get the member name of this struct_ops program, which corresponds to + * a field in struct sched_ext_ops. For example, the member name of the + * dispatch struct_ops program (callback) is "dispatch". + */ + member = &btf_type_member(t)[member_idx]; + mname = btf_name_by_offset(btf_vmlinux, member->name_off); + + if (!strcmp(mname, op)) { + /* + * The value is a pointer to a type (struct task_struct) given + * by a BTF ID (PTR_TO_BTF_ID). It is trusted (PTR_TRUSTED), + * however, can be a NULL (PTR_MAYBE_NULL). The BPF program + * should check the pointer to make sure it is not NULL before + * using it, or the verifier will reject the program. + * + * Longer term, this is something that should be addressed by + * BTF, and be fully contained within the verifier. + */ + info->reg_type = PTR_MAYBE_NULL | PTR_TO_BTF_ID | PTR_TRUSTED; + info->btf = btf_vmlinux; + info->btf_id = task_struct_type_id; + + return true; + } + + return false; +} + +static bool bpf_scx_is_valid_access(int off, int size, + enum bpf_access_type type, + const struct bpf_prog *prog, + struct bpf_insn_access_aux *info) +{ + if (type != BPF_READ) + return false; + if (set_arg_maybe_null("dispatch", 1, off, size, type, prog, info) || + set_arg_maybe_null("yield", 1, off, size, type, prog, info)) + return true; + if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS) + return false; + if (off % size != 0) + return false; + + return btf_ctx_access(off, size, type, prog, info); +} + +static int bpf_scx_btf_struct_access(struct bpf_verifier_log *log, + const struct bpf_reg_state *reg, int off, + int size) +{ + const struct btf_type *t; + + t = btf_type_by_id(reg->btf, reg->btf_id); + if (t == task_struct_type) { + if (off >= offsetof(struct task_struct, scx.slice) && + off + size <= offsetofend(struct task_struct, scx.slice)) + return SCALAR_VALUE; + } + + return -EACCES; +} + +static const struct bpf_func_proto * +bpf_scx_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + switch (func_id) { + case BPF_FUNC_task_storage_get: + return &bpf_task_storage_get_proto; + case BPF_FUNC_task_storage_delete: + return &bpf_task_storage_delete_proto; + default: + return bpf_base_func_proto(func_id); + } +} + +static const struct bpf_verifier_ops bpf_scx_verifier_ops = { + .get_func_proto = bpf_scx_get_func_proto, + .is_valid_access = bpf_scx_is_valid_access, + .btf_struct_access = bpf_scx_btf_struct_access, +}; + +static int bpf_scx_init_member(const struct btf_type *t, + const struct btf_member *member, + void *kdata, const void *udata) +{ + const struct sched_ext_ops *uops = udata; + struct sched_ext_ops *ops = kdata; + u32 moff = __btf_member_bit_offset(t, member) / 8; + int ret; + + switch (moff) { + case offsetof(struct sched_ext_ops, dispatch_max_batch): + if (*(u32 *)(udata + moff) > INT_MAX) + return -E2BIG; + ops->dispatch_max_batch = *(u32 *)(udata + moff); + return 1; + case offsetof(struct sched_ext_ops, flags): + if (*(u64 *)(udata + moff) & ~SCX_OPS_ALL_FLAGS) + return -EINVAL; + ops->flags = *(u64 *)(udata + moff); + return 1; + case offsetof(struct sched_ext_ops, name): + ret = bpf_obj_name_cpy(ops->name, uops->name, + sizeof(ops->name)); + if (ret < 0) + return ret; + if (ret == 0) + return -EINVAL; + return 1; + } + + return 0; +} + +static int bpf_scx_check_member(const struct btf_type *t, + const struct btf_member *member, + const struct bpf_prog *prog) +{ + u32 moff = __btf_member_bit_offset(t, member) / 8; + + switch (moff) { + case offsetof(struct sched_ext_ops, init_task): + case offsetof(struct sched_ext_ops, init): + case offsetof(struct sched_ext_ops, exit): + break; + default: + if (prog->sleepable) + return -EINVAL; + } + + return 0; +} + +static int bpf_scx_reg(void *kdata) +{ + return scx_ops_enable(kdata); +} + +static void bpf_scx_unreg(void *kdata) +{ + scx_ops_disable(SCX_EXIT_UNREG); + kthread_flush_work(&scx_ops_disable_work); +} + +static int bpf_scx_init(struct btf *btf) +{ + u32 type_id; + + type_id = btf_find_by_name_kind(btf, "task_struct", BTF_KIND_STRUCT); + if (type_id < 0) + return -EINVAL; + task_struct_type = btf_type_by_id(btf, type_id); + task_struct_type_id = type_id; + + return 0; +} + +static int bpf_scx_update(void *kdata, void *old_kdata) +{ + /* + * sched_ext does not support updating the actively-loaded BPF + * scheduler, as registering a BPF scheduler can always fail if the + * scheduler returns an error code for e.g. ops.init(), ops.init_task(), + * etc. Similarly, we can always race with unregistration happening + * elsewhere, such as with sysrq. + */ + return -EOPNOTSUPP; +} + +static int bpf_scx_validate(void *kdata) +{ + return 0; +} + +static s32 select_cpu_stub(struct task_struct *p, s32 prev_cpu, u64 wake_flags) { return -EINVAL; } +static void enqueue_stub(struct task_struct *p, u64 enq_flags) {} +static void dequeue_stub(struct task_struct *p, u64 enq_flags) {} +static void dispatch_stub(s32 prev_cpu, struct task_struct *p) {} +static bool yield_stub(struct task_struct *from, struct task_struct *to) { return false; } +static void set_weight_stub(struct task_struct *p, u32 weight) {} +static void set_cpumask_stub(struct task_struct *p, const struct cpumask *mask) {} +static void update_idle_stub(s32 cpu, bool idle) {} +static s32 init_task_stub(struct task_struct *p, struct scx_init_task_args *args) { return -EINVAL; } +static void exit_task_stub(struct task_struct *p, struct scx_exit_task_args *args) {} +static void enable_stub(struct task_struct *p) {} +static void disable_stub(struct task_struct *p) {} +static s32 init_stub(void) { return -EINVAL; } +static void exit_stub(struct scx_exit_info *info) {} + +static struct sched_ext_ops __bpf_ops_sched_ext_ops = { + .select_cpu = select_cpu_stub, + .enqueue = enqueue_stub, + .dequeue = dequeue_stub, + .dispatch = dispatch_stub, + .yield = yield_stub, + .set_weight = set_weight_stub, + .set_cpumask = set_cpumask_stub, + .update_idle = update_idle_stub, + .init_task = init_task_stub, + .exit_task = exit_task_stub, + .enable = enable_stub, + .disable = disable_stub, + .init = init_stub, + .exit = exit_stub, +}; + +static struct bpf_struct_ops bpf_sched_ext_ops = { + .verifier_ops = &bpf_scx_verifier_ops, + .reg = bpf_scx_reg, + .unreg = bpf_scx_unreg, + .check_member = bpf_scx_check_member, + .init_member = bpf_scx_init_member, + .init = bpf_scx_init, + .update = bpf_scx_update, + .validate = bpf_scx_validate, + .name = "sched_ext_ops", + .owner = THIS_MODULE, + .cfi_stubs = &__bpf_ops_sched_ext_ops +}; + + +/******************************************************************************** + * System integration and init. + */ + +void __init init_sched_ext_class(void) +{ + s32 cpu, v; + + /* + * The following is to prevent the compiler from optimizing out the enum + * definitions so that BPF scheduler implementations can use them + * through the generated vmlinux.h. + */ + WRITE_ONCE(v, SCX_ENQ_WAKEUP | SCX_DEQ_SLEEP); + + BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params)); + init_dsq(&scx_dsq_global, SCX_DSQ_GLOBAL); +#ifdef CONFIG_SMP + BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL)); + BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL)); +#endif + for_each_possible_cpu(cpu) { + struct rq *rq = cpu_rq(cpu); + + init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL); + INIT_LIST_HEAD(&rq->scx.runnable_list); + } +} + + +/******************************************************************************** + * Helpers that can be called from the BPF scheduler. + */ +#include <linux/btf_ids.h> + +__bpf_kfunc_start_defs(); + +/** + * scx_bpf_create_dsq - Create a custom DSQ + * @dsq_id: DSQ to create + * @node: NUMA node to allocate from + * + * Create a custom DSQ identified by @dsq_id. Can be called from ops.init() and + * ops.init_task(). + */ +__bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) +{ + if (!scx_kf_allowed(SCX_KF_SLEEPABLE)) + return -EINVAL; + + if (unlikely(node >= (int)nr_node_ids || + (node < 0 && node != NUMA_NO_NODE))) + return -EINVAL; + return PTR_ERR_OR_ZERO(create_dsq(dsq_id, node)); +} + +__bpf_kfunc_end_defs(); + +BTF_KFUNCS_START(scx_kfunc_ids_sleepable) +BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE) +BTF_KFUNCS_END(scx_kfunc_ids_sleepable) + +static const struct btf_kfunc_id_set scx_kfunc_set_sleepable = { + .owner = THIS_MODULE, + .set = &scx_kfunc_ids_sleepable, +}; + +__bpf_kfunc_start_defs(); + +/** + * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu() + * @p: task_struct to select a CPU for + * @prev_cpu: CPU @p was on previously + * @wake_flags: %SCX_WAKE_* flags + * @is_idle: out parameter indicating whether the returned CPU is idle + * + * Can only be called from ops.select_cpu() if the built-in CPU selection is + * enabled - ops.update_idle() is missing or %SCX_OPS_KEEP_BUILTIN_IDLE is set. + * @p, @prev_cpu and @wake_flags match ops.select_cpu(). + * + * Returns the picked CPU with *@is_idle indicating whether the picked CPU is + * currently idle and thus a good candidate for direct dispatching. + */ +__bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, + u64 wake_flags, bool *is_idle) +{ + if (!scx_kf_allowed(SCX_KF_SELECT_CPU)) { + *is_idle = false; + return prev_cpu; + } +#ifdef CONFIG_SMP + return scx_select_cpu_dfl(p, prev_cpu, wake_flags, is_idle); +#else + *is_idle = false; + return prev_cpu; +#endif +} + +__bpf_kfunc_end_defs(); + +BTF_KFUNCS_START(scx_kfunc_ids_select_cpu) +BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU) +BTF_KFUNCS_END(scx_kfunc_ids_select_cpu) + +static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = { + .owner = THIS_MODULE, + .set = &scx_kfunc_ids_select_cpu, +}; + +static bool scx_dispatch_preamble(struct task_struct *p, u64 enq_flags) +{ + if (!scx_kf_allowed(SCX_KF_ENQUEUE | SCX_KF_DISPATCH)) + return false; + + lockdep_assert_irqs_disabled(); + + if (unlikely(!p)) { + scx_ops_error("called with NULL task"); + return false; + } + + if (unlikely(enq_flags & __SCX_ENQ_INTERNAL_MASK)) { + scx_ops_error("invalid enq_flags 0x%llx", enq_flags); + return false; + } + + return true; +} + +static void scx_dispatch_commit(struct task_struct *p, u64 dsq_id, u64 enq_flags) +{ + struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); + struct task_struct *ddsp_task; + + ddsp_task = __this_cpu_read(direct_dispatch_task); + if (ddsp_task) { + mark_direct_dispatch(ddsp_task, p, dsq_id, enq_flags); + return; + } + + if (unlikely(dspc->cursor >= scx_dsp_max_batch)) { + scx_ops_error("dispatch buffer overflow"); + return; + } + + dspc->buf[dspc->cursor++] = (struct scx_dsp_buf_ent){ + .task = p, + .qseq = atomic_long_read(&p->scx.ops_state) & SCX_OPSS_QSEQ_MASK, + .dsq_id = dsq_id, + .enq_flags = enq_flags, + }; +} + +__bpf_kfunc_start_defs(); + +/** + * scx_bpf_dispatch - Dispatch a task into the FIFO queue of a DSQ + * @p: task_struct to dispatch + * @dsq_id: DSQ to dispatch to + * @slice: duration @p can run for in nsecs + * @enq_flags: SCX_ENQ_* + * + * Dispatch @p into the FIFO queue of the DSQ identified by @dsq_id. It is safe + * to call this function spuriously. Can be called from ops.enqueue(), + * ops.select_cpu(), and ops.dispatch(). + * + * When called from ops.select_cpu() or ops.enqueue(), it's for direct dispatch + * and @p must match the task being enqueued. Also, %SCX_DSQ_LOCAL_ON can't be + * used to target the local DSQ of a CPU other than the enqueueing one. Use + * ops.select_cpu() to be on the target CPU in the first place. + * + * When called from ops.select_cpu(), @enq_flags and @dsp_id are stored, and @p + * will be directly dispatched to the corresponding dispatch queue after + * ops.select_cpu() returns. If @p is dispatched to SCX_DSQ_LOCAL, it will be + * dispatched to the local DSQ of the CPU returned by ops.select_cpu(). + * @enq_flags are OR'd with the enqueue flags on the enqueue path before the + * task is dispatched. + * + * When called from ops.dispatch(), there are no restrictions on @p or @dsq_id + * and this function can be called upto ops.dispatch_max_batch times to dispatch + * multiple tasks. scx_bpf_dispatch_nr_slots() returns the number of the + * remaining slots. scx_bpf_consume() flushes the batch and resets the counter. + * + * This function doesn't have any locking restrictions and may be called under + * BPF locks (in the future when BPF introduces more flexible locking). + * + * @p is allowed to run for @slice. The scheduling path is triggered on slice + * exhaustion. If zero, the current residual slice is maintained. + */ +__bpf_kfunc void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, + u64 enq_flags) +{ + if (!scx_dispatch_preamble(p, enq_flags)) + return; + + if (slice) + p->scx.slice = slice; + else + p->scx.slice = p->scx.slice ?: 1; + + scx_dispatch_commit(p, dsq_id, enq_flags); +} + +__bpf_kfunc_end_defs(); + +BTF_KFUNCS_START(scx_kfunc_ids_enqueue_dispatch) +BTF_ID_FLAGS(func, scx_bpf_dispatch, KF_RCU) +BTF_KFUNCS_END(scx_kfunc_ids_enqueue_dispatch) + +static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = { + .owner = THIS_MODULE, + .set = &scx_kfunc_ids_enqueue_dispatch, +}; + +__bpf_kfunc_start_defs(); + +/** + * scx_bpf_dispatch_nr_slots - Return the number of remaining dispatch slots + * + * Can only be called from ops.dispatch(). + */ +__bpf_kfunc u32 scx_bpf_dispatch_nr_slots(void) +{ + if (!scx_kf_allowed(SCX_KF_DISPATCH)) + return 0; + + return scx_dsp_max_batch - __this_cpu_read(scx_dsp_ctx->cursor); +} + +/** + * scx_bpf_dispatch_cancel - Cancel the latest dispatch + * + * Cancel the latest dispatch. Can be called multiple times to cancel further + * dispatches. Can only be called from ops.dispatch(). + */ +__bpf_kfunc void scx_bpf_dispatch_cancel(void) +{ + struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); + + if (!scx_kf_allowed(SCX_KF_DISPATCH)) + return; + + if (dspc->cursor > 0) + dspc->cursor--; + else + scx_ops_error("dispatch buffer underflow"); +} + +/** + * scx_bpf_consume - Transfer a task from a DSQ to the current CPU's local DSQ + * @dsq_id: DSQ to consume + * + * Consume a task from the non-local DSQ identified by @dsq_id and transfer it + * to the current CPU's local DSQ for execution. Can only be called from + * ops.dispatch(). + * + * This function flushes the in-flight dispatches from scx_bpf_dispatch() before + * trying to consume the specified DSQ. It may also grab rq locks and thus can't + * be called under any BPF locks. + * + * Returns %true if a task has been consumed, %false if there isn't any task to + * consume. + */ +__bpf_kfunc bool scx_bpf_consume(u64 dsq_id) +{ + struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); + struct scx_dispatch_q *dsq; + + if (!scx_kf_allowed(SCX_KF_DISPATCH)) + return false; + + flush_dispatch_buf(dspc->rq, dspc->rf); + + dsq = find_non_local_dsq(dsq_id); + if (unlikely(!dsq)) { + scx_ops_error("invalid DSQ ID 0x%016llx", dsq_id); + return false; + } + + if (consume_dispatch_q(dspc->rq, dspc->rf, dsq)) { + /* + * A successfully consumed task can be dequeued before it starts + * running while the CPU is trying to migrate other dispatched + * tasks. Bump nr_tasks to tell balance_scx() to retry on empty + * local DSQ. + */ + dspc->nr_tasks++; + return true; + } else { + return false; + } +} + +__bpf_kfunc_end_defs(); + +BTF_KFUNCS_START(scx_kfunc_ids_dispatch) +BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots) +BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel) +BTF_ID_FLAGS(func, scx_bpf_consume) +BTF_KFUNCS_END(scx_kfunc_ids_dispatch) + +static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = { + .owner = THIS_MODULE, + .set = &scx_kfunc_ids_dispatch, +}; + +__bpf_kfunc_start_defs(); + +/** + * scx_bpf_dsq_nr_queued - Return the number of queued tasks + * @dsq_id: id of the DSQ + * + * Return the number of tasks in the DSQ matching @dsq_id. If not found, + * -%ENOENT is returned. + */ +__bpf_kfunc s32 scx_bpf_dsq_nr_queued(u64 dsq_id) +{ + struct scx_dispatch_q *dsq; + s32 ret; + + preempt_disable(); + + if (dsq_id == SCX_DSQ_LOCAL) { + ret = READ_ONCE(this_rq()->scx.local_dsq.nr); + goto out; + } else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) { + s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK; + + if (ops_cpu_valid(cpu, NULL)) { + ret = READ_ONCE(cpu_rq(cpu)->scx.local_dsq.nr); + goto out; + } + } else { + dsq = find_non_local_dsq(dsq_id); + if (dsq) { + ret = READ_ONCE(dsq->nr); + goto out; + } + } + ret = -ENOENT; +out: + preempt_enable(); + return ret; +} + +/** + * scx_bpf_destroy_dsq - Destroy a custom DSQ + * @dsq_id: DSQ to destroy + * + * Destroy the custom DSQ identified by @dsq_id. Only DSQs created with + * scx_bpf_create_dsq() can be destroyed. The caller must ensure that the DSQ is + * empty and no further tasks are dispatched to it. Ignored if called on a DSQ + * which doesn't exist. Can be called from any online scx_ops operations. + */ +__bpf_kfunc void scx_bpf_destroy_dsq(u64 dsq_id) +{ + destroy_dsq(dsq_id); +} + +__bpf_kfunc_end_defs(); + +static s32 __bstr_format(u64 *data_buf, char *line_buf, size_t line_size, + char *fmt, unsigned long long *data, u32 data__sz) +{ + struct bpf_bprintf_data bprintf_data = { .get_bin_args = true }; + s32 ret; + + if (data__sz % 8 || data__sz > MAX_BPRINTF_VARARGS * 8 || + (data__sz && !data)) { + scx_ops_error("invalid data=%p and data__sz=%u", + (void *)data, data__sz); + return -EINVAL; + } + + ret = copy_from_kernel_nofault(data_buf, data, data__sz); + if (ret < 0) { + scx_ops_error("failed to read data fields (%d)", ret); + return ret; + } + + ret = bpf_bprintf_prepare(fmt, UINT_MAX, data_buf, data__sz / 8, + &bprintf_data); + if (ret < 0) { + scx_ops_error("format preparation failed (%d)", ret); + return ret; + } + + ret = bstr_printf(line_buf, line_size, fmt, + bprintf_data.bin_args); + bpf_bprintf_cleanup(&bprintf_data); + if (ret < 0) { + scx_ops_error("(\"%s\", %p, %u) failed to format", + fmt, data, data__sz); + return ret; + } + + return ret; +} + +static s32 bstr_format(struct scx_bstr_buf *buf, + char *fmt, unsigned long long *data, u32 data__sz) +{ + return __bstr_format(buf->data, buf->line, sizeof(buf->line), + fmt, data, data__sz); +} + +__bpf_kfunc_start_defs(); + +/** + * scx_bpf_exit_bstr - Gracefully exit the BPF scheduler. + * @exit_code: Exit value to pass to user space via struct scx_exit_info. + * @fmt: error message format string + * @data: format string parameters packaged using ___bpf_fill() macro + * @data__sz: @data len, must end in '__sz' for the verifier + * + * Indicate that the BPF scheduler wants to exit gracefully, and initiate ops + * disabling. + */ +__bpf_kfunc void scx_bpf_exit_bstr(s64 exit_code, char *fmt, + unsigned long long *data, u32 data__sz) +{ + unsigned long flags; + + raw_spin_lock_irqsave(&scx_exit_bstr_buf_lock, flags); + if (bstr_format(&scx_exit_bstr_buf, fmt, data, data__sz) >= 0) + scx_ops_exit_kind(SCX_EXIT_UNREG_BPF, exit_code, "%s", + scx_exit_bstr_buf.line); + raw_spin_unlock_irqrestore(&scx_exit_bstr_buf_lock, flags); +} + +/** + * scx_bpf_error_bstr - Indicate fatal error + * @fmt: error message format string + * @data: format string parameters packaged using ___bpf_fill() macro + * @data__sz: @data len, must end in '__sz' for the verifier + * + * Indicate that the BPF scheduler encountered a fatal error and initiate ops + * disabling. + */ +__bpf_kfunc void scx_bpf_error_bstr(char *fmt, unsigned long long *data, + u32 data__sz) +{ + unsigned long flags; + + raw_spin_lock_irqsave(&scx_exit_bstr_buf_lock, flags); + if (bstr_format(&scx_exit_bstr_buf, fmt, data, data__sz) >= 0) + scx_ops_exit_kind(SCX_EXIT_ERROR_BPF, 0, "%s", + scx_exit_bstr_buf.line); + raw_spin_unlock_irqrestore(&scx_exit_bstr_buf_lock, flags); +} + +/** + * scx_bpf_nr_cpu_ids - Return the number of possible CPU IDs + * + * All valid CPU IDs in the system are smaller than the returned value. + */ +__bpf_kfunc u32 scx_bpf_nr_cpu_ids(void) +{ + return nr_cpu_ids; +} + +/** + * scx_bpf_get_possible_cpumask - Get a referenced kptr to cpu_possible_mask + */ +__bpf_kfunc const struct cpumask *scx_bpf_get_possible_cpumask(void) +{ + return cpu_possible_mask; +} + +/** + * scx_bpf_get_online_cpumask - Get a referenced kptr to cpu_online_mask + */ +__bpf_kfunc const struct cpumask *scx_bpf_get_online_cpumask(void) +{ + return cpu_online_mask; +} + +/** + * scx_bpf_put_cpumask - Release a possible/online cpumask + * @cpumask: cpumask to release + */ +__bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask) +{ + /* + * Empty function body because we aren't actually acquiring or releasing + * a reference to a global cpumask, which is read-only in the caller and + * is never released. The acquire / release semantics here are just used + * to make the cpumask is a trusted pointer in the caller. + */ +} + +/** + * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking + * per-CPU cpumask. + * + * Returns NULL if idle tracking is not enabled, or running on a UP kernel. + */ +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) +{ + if (!static_branch_likely(&scx_builtin_idle_enabled)) { + scx_ops_error("built-in idle tracking is disabled"); + return cpu_none_mask; + } + +#ifdef CONFIG_SMP + return idle_masks.cpu; +#else + return cpu_none_mask; +#endif +} + +/** + * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking, + * per-physical-core cpumask. Can be used to determine if an entire physical + * core is free. + * + * Returns NULL if idle tracking is not enabled, or running on a UP kernel. + */ +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void) +{ + if (!static_branch_likely(&scx_builtin_idle_enabled)) { + scx_ops_error("built-in idle tracking is disabled"); + return cpu_none_mask; + } + +#ifdef CONFIG_SMP + if (sched_smt_active()) + return idle_masks.smt; + else + return idle_masks.cpu; +#else + return cpu_none_mask; +#endif +} + +/** + * scx_bpf_put_idle_cpumask - Release a previously acquired referenced kptr to + * either the percpu, or SMT idle-tracking cpumask. + */ +__bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask) +{ + /* + * Empty function body because we aren't actually acquiring or releasing + * a reference to a global idle cpumask, which is read-only in the + * caller and is never released. The acquire / release semantics here + * are just used to make the cpumask a trusted pointer in the caller. + */ +} + +/** + * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state + * @cpu: cpu to test and clear idle for + * + * Returns %true if @cpu was idle and its idle state was successfully cleared. + * %false otherwise. + * + * Unavailable if ops.update_idle() is implemented and + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set. + */ +__bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) +{ + if (!static_branch_likely(&scx_builtin_idle_enabled)) { + scx_ops_error("built-in idle tracking is disabled"); + return false; + } + + if (ops_cpu_valid(cpu, NULL)) + return test_and_clear_cpu_idle(cpu); + else + return false; +} + +/** + * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu + * @cpus_allowed: Allowed cpumask + * @flags: %SCX_PICK_IDLE_CPU_* flags + * + * Pick and claim an idle cpu in @cpus_allowed. Returns the picked idle cpu + * number on success. -%EBUSY if no matching cpu was found. + * + * Idle CPU tracking may race against CPU scheduling state transitions. For + * example, this function may return -%EBUSY as CPUs are transitioning into the + * idle state. If the caller then assumes that there will be dispatch events on + * the CPUs as they were all busy, the scheduler may end up stalling with CPUs + * idling while there are pending tasks. Use scx_bpf_pick_any_cpu() and + * scx_bpf_kick_cpu() to guarantee that there will be at least one dispatch + * event in the near future. + * + * Unavailable if ops.update_idle() is implemented and + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set. + */ +__bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed, + u64 flags) +{ + if (!static_branch_likely(&scx_builtin_idle_enabled)) { + scx_ops_error("built-in idle tracking is disabled"); + return -EBUSY; + } + + return scx_pick_idle_cpu(cpus_allowed, flags); +} + +/** + * scx_bpf_pick_any_cpu - Pick and claim an idle cpu if available or pick any CPU + * @cpus_allowed: Allowed cpumask + * @flags: %SCX_PICK_IDLE_CPU_* flags + * + * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any + * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle cpu + * number if @cpus_allowed is not empty. -%EBUSY is returned if @cpus_allowed is + * empty. + * + * If ops.update_idle() is implemented and %SCX_OPS_KEEP_BUILTIN_IDLE is not + * set, this function can't tell which CPUs are idle and will always pick any + * CPU. + */ +__bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed, + u64 flags) +{ + s32 cpu; + + if (static_branch_likely(&scx_builtin_idle_enabled)) { + cpu = scx_pick_idle_cpu(cpus_allowed, flags); + if (cpu >= 0) + return cpu; + } + + cpu = cpumask_any_distribute(cpus_allowed); + if (cpu < nr_cpu_ids) + return cpu; + else + return -EBUSY; +} + +/** + * scx_bpf_task_running - Is task currently running? + * @p: task of interest + */ +__bpf_kfunc bool scx_bpf_task_running(const struct task_struct *p) +{ + return task_rq(p)->curr == p; +} + +/** + * scx_bpf_task_cpu - CPU a task is currently associated with + * @p: task of interest + */ +__bpf_kfunc s32 scx_bpf_task_cpu(const struct task_struct *p) +{ + return task_cpu(p); +} + +__bpf_kfunc_end_defs(); + +BTF_KFUNCS_START(scx_kfunc_ids_any) +BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued) +BTF_ID_FLAGS(func, scx_bpf_destroy_dsq) +BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids) +BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE) +BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE) +BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE) +BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE) +BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE) +BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE) +BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle) +BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU) +BTF_KFUNCS_END(scx_kfunc_ids_any) + +static const struct btf_kfunc_id_set scx_kfunc_set_any = { + .owner = THIS_MODULE, + .set = &scx_kfunc_ids_any, +}; + +static int __init scx_init(void) +{ + int ret; + + /* + * kfunc registration can't be done from init_sched_ext_class() as + * register_btf_kfunc_id_set() needs most of the system to be up. + * + * Some kfuncs are context-sensitive and can only be called from + * specific SCX ops. They are grouped into BTF sets accordingly. + * Unfortunately, BPF currently doesn't have a way of enforcing such + * restrictions. Eventually, the verifier should be able to enforce + * them. For now, register them the same and make each kfunc explicitly + * check using scx_kf_allowed(). + */ + if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, + &scx_kfunc_set_sleepable)) || + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, + &scx_kfunc_set_select_cpu)) || + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, + &scx_kfunc_set_enqueue_dispatch)) || + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, + &scx_kfunc_set_dispatch)) || + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, + &scx_kfunc_set_any)) || + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &scx_kfunc_set_any)) || + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, + &scx_kfunc_set_any))) { + pr_err("sched_ext: Failed to register kfunc sets (%d)\n", ret); + return ret; + } + + ret = register_bpf_struct_ops(&bpf_sched_ext_ops, sched_ext_ops); + if (ret) { + pr_err("sched_ext: Failed to register struct_ops (%d)\n", ret); + return ret; + } + + scx_kset = kset_create_and_add("sched_ext", &scx_uevent_ops, kernel_kobj); + if (!scx_kset) { + pr_err("sched_ext: Failed to create /sys/kernel/sched_ext\n"); + return -ENOMEM; + } + + ret = sysfs_create_group(&scx_kset->kobj, &scx_global_attr_group); + if (ret < 0) { + pr_err("sched_ext: Failed to add global attributes\n"); + return ret; + } + + return 0; +} +__initcall(scx_init); diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index 6a93c4825339..9c5a2d928281 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -1,15 +1,76 @@ /* SPDX-License-Identifier: GPL-2.0 */ - +/* + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> + * Copyright (c) 2022 David Vernet <dvernet@meta.com> + */ #ifdef CONFIG_SCHED_CLASS_EXT -#error "NOT IMPLEMENTED YET" + +struct sched_enq_and_set_ctx { + struct task_struct *p; + int queue_flags; + bool queued; + bool running; +}; + +void sched_deq_and_put_task(struct task_struct *p, int queue_flags, + struct sched_enq_and_set_ctx *ctx); +void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx); + +extern const struct sched_class ext_sched_class; + +DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled); +DECLARE_STATIC_KEY_FALSE(__scx_switched_all); +#define scx_enabled() static_branch_unlikely(&__scx_ops_enabled) +#define scx_switched_all() static_branch_unlikely(&__scx_switched_all) + +static inline bool task_on_scx(const struct task_struct *p) +{ + return scx_enabled() && p->sched_class == &ext_sched_class; +} + +void init_scx_entity(struct sched_ext_entity *scx); +void scx_pre_fork(struct task_struct *p); +int scx_fork(struct task_struct *p); +void scx_post_fork(struct task_struct *p); +void scx_cancel_fork(struct task_struct *p); +bool task_should_scx(struct task_struct *p); +void init_sched_ext_class(void); + +static inline const struct sched_class *next_active_class(const struct sched_class *class) +{ + class++; + if (scx_switched_all() && class == &fair_sched_class) + class++; + if (!scx_enabled() && class == &ext_sched_class) + class++; + return class; +} + +#define for_active_class_range(class, _from, _to) \ + for (class = (_from); class != (_to); class = next_active_class(class)) + +#define for_each_active_class(class) \ + for_active_class_range(class, __sched_class_highest, __sched_class_lowest) + +/* + * SCX requires a balance() call before every pick_next_task() call including + * when waking up from idle. + */ +#define for_balance_class_range(class, prev_class, end_class) \ + for_active_class_range(class, (prev_class) > &ext_sched_class ? \ + &ext_sched_class : (prev_class), (end_class)) + #else /* CONFIG_SCHED_CLASS_EXT */ #define scx_enabled() false +#define scx_switched_all() false static inline void scx_pre_fork(struct task_struct *p) {} static inline int scx_fork(struct task_struct *p) { return 0; } static inline void scx_post_fork(struct task_struct *p) {} static inline void scx_cancel_fork(struct task_struct *p) {} +static inline bool task_on_scx(const struct task_struct *p) { return false; } static inline void init_sched_ext_class(void) {} #define for_each_active_class for_each_class @@ -18,7 +79,13 @@ static inline void init_sched_ext_class(void) {} #endif /* CONFIG_SCHED_CLASS_EXT */ #if defined(CONFIG_SCHED_CLASS_EXT) && defined(CONFIG_SMP) -#error "NOT IMPLEMENTED YET" +void __scx_update_idle(struct rq *rq, bool idle); + +static inline void scx_update_idle(struct rq *rq, bool idle) +{ + if (scx_enabled()) + __scx_update_idle(rq, idle); +} #else static inline void scx_update_idle(struct rq *rq, bool idle) {} #endif diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 57f8fada633a..ea9941093b96 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -192,6 +192,10 @@ static inline int idle_policy(int policy) static inline int normal_policy(int policy) { +#ifdef CONFIG_SCHED_CLASS_EXT + if (policy == SCHED_EXT) + return true; +#endif return policy == SCHED_NORMAL; } @@ -819,6 +823,16 @@ struct cfs_rq { KABI_RESERVE(8) }; +#ifdef CONFIG_SCHED_CLASS_EXT +struct scx_rq { + struct scx_dispatch_q local_dsq; + struct list_head runnable_list; /* runnable tasks on this rq */ + unsigned long ops_qseq; + u64 extra_enq_flags; /* see move_task_to_local_dsq() */ + u32 nr_running; +}; +#endif /* CONFIG_SCHED_CLASS_EXT */ + static inline int rt_bandwidth_enabled(void) { return sysctl_sched_rt_runtime >= 0; @@ -1173,6 +1187,9 @@ struct rq { #ifdef CONFIG_SCHED_STEAL struct sparsemask *cfs_overload_cpus; #endif +#ifdef CONFIG_SCHED_CLASS_EXT + struct scx_rq scx; +#endif #ifdef CONFIG_FAIR_GROUP_SCHED /* list of leaf cfs_rq on this CPU: */ diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index ebf339797d8b..76e9f70f3287 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -1610,6 +1610,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy) case SCHED_NORMAL: case SCHED_BATCH: case SCHED_IDLE: + case SCHED_EXT: ret = 0; break; } @@ -1637,6 +1638,7 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy) case SCHED_NORMAL: case SCHED_BATCH: case SCHED_IDLE: + case SCHED_EXT: ret = 0; } return ret; -- 2.34.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY -------------------------------- sched_ext: Fix kabi for struct sched_ext_entity scx in struct task_struct. Fixes: 12fe9a293477 ("sched_ext: Implement BPF extensible scheduler class") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched.h | 7 ++++--- kernel/sched/sched.h | 6 +++--- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index a9c29f1de7fe..6b1f6dc28ec6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -839,9 +839,6 @@ struct task_struct { struct sched_entity se; struct sched_rt_entity rt; struct sched_dl_entity dl; -#ifdef CONFIG_SCHED_CLASS_EXT - struct sched_ext_entity scx; -#endif const struct sched_class *sched_class; #ifdef CONFIG_SCHED_CORE @@ -1648,6 +1645,10 @@ struct task_struct { KABI_RESERVE(16) KABI_AUX_PTR(task_struct) +#if !defined(__GENKSYMS__) && defined(CONFIG_SCHED_CLASS_EXT) + KABI_BROKEN_INSERT(struct sched_ext_entity scx) +#endif + /* CPU-specific state of this task: */ struct thread_struct thread; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ea9941093b96..26f8fde4d8cf 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1187,9 +1187,6 @@ struct rq { #ifdef CONFIG_SCHED_STEAL struct sparsemask *cfs_overload_cpus; #endif -#ifdef CONFIG_SCHED_CLASS_EXT - struct scx_rq scx; -#endif #ifdef CONFIG_FAIR_GROUP_SCHED /* list of leaf cfs_rq on this CPU: */ @@ -1380,6 +1377,9 @@ struct rq { KABI_RESERVE(6) KABI_RESERVE(7) KABI_RESERVE(8) +#if defined(CONFIG_SCHED_CLASS_EXT) + KABI_EXTEND(struct scx_rq scx) +#endif }; #ifdef CONFIG_FAIR_GROUP_SCHED -- 2.34.1
From: Yipeng Zou <zouyipeng@huawei.com> mainline inclusion from mainline-v6.12-rc1 commit 9ad2861b773d7da1cf3a5a04a4e8f1aa7d092bbb category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Since dequeue_task() allowed to fail, there is a compile error: kernel/sched/ext.c:3630:19: error: initialization of ‘bool (*)(struct rq*, struct task_struct *, int)’ {aka ‘_Bool (*)(struct rq *, struct task_struct *, int)’} from incompatible pointer type ‘void (*)(struct rq*, struct task_struct *, int)’ 3630 | .dequeue_task = dequeue_task_scx, | ^~~~~~~~~~~~~~~~ Allow dequeue_task_scx to fail too. Fixes: 863ccdbb918a ("sched: Allow sched_class::dequeue_task() to fail") Signed-off-by: Yipeng Zou <zouyipeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index c4dbce5b6f5d..79c7fd88db24 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1262,11 +1262,11 @@ static void ops_dequeue(struct task_struct *p, u64 deq_flags) } } -static void dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags) +static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags) { if (!(p->scx.flags & SCX_TASK_QUEUED)) { WARN_ON_ONCE(task_runnable(p)); - return; + return true; } ops_dequeue(p, deq_flags); @@ -1281,6 +1281,7 @@ static void dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags sub_nr_running(rq, 1); dispatch_dequeue(rq, p); + return true; } static void yield_task_scx(struct rq *rq) -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 2a52ca7c98960aafb0eca9ef96b2d0c932171357 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Add two simple example BPF schedulers - simple and qmap. * simple: In terms of scheduling, it behaves identical to not having any operation implemented at all. The two operations it implements are only to improve visibility and exit handling. On certain homogeneous configurations, this actually can perform pretty well. * qmap: A fixed five level priority scheduler to demonstrate queueing PIDs on BPF maps for scheduling. While not very practical, this is useful as a simple example and will be used to demonstrate different features. v7: - Compat helpers stripped out in prepartion of upstreaming as the upstreamed patchset will be the baselinfe. Utility macros that can be used to implement compat features are kept. - Explicitly disable map autoattach on struct_ops to avoid trying to attach twice while maintaining compatbility with older libbpf. v6: - Common header files reorganized and cleaned up. Compat helpers are added to demonstrate how schedulers can maintain backward compatibility with older kernels while making use of newly added features. - simple_select_cpu() added to keep track of the number of local dispatches. This is needed because the default ops.select_cpu() implementation is updated to dispatch directly and won't call ops.enqueue(). - Updated to reflect the sched_ext API changes. Switching all tasks is the default behavior now and scx_qmap supports partial switching when `-p` is specified. - tools/sched_ext/Kconfig dropped. This will be included in the doc instead. v5: - Improve Makefile. Build artifects are now collected into a separate dir which change be changed. Install and help targets are added and clean actually cleans everything. - MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR() and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss. - Add scx_common.h which provides common utilities to user code such as SCX_BUG[_ON]() and RESIZE_ARRAY(). - Use SCX_BUG[_ON]() to simplify error handling. v4: - Dropped _example prefix from scheduler names. v3: - Rename scx_example_dummy to scx_example_simple and restructure a bit to ease later additions. Comment updates. - Added declarations for BPF inline iterators. In the future, hopefully, these will be consolidated into a generic BPF header so that they don't need to be replicated here. v2: - Updated with the generic BPF cpumask helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Conflicts: Makefile [Conflicts with f0b057292e10 ("tools: fix annoying "mkdir -p ..." logs when building tools in parallel"), so only cherry-pick the main part of this mainline patch.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- Makefile | 8 +- tools/Makefile | 10 +- tools/sched_ext/.gitignore | 2 + tools/sched_ext/Makefile | 246 ++++++++++++ .../sched_ext/include/bpf-compat/gnu/stubs.h | 11 + tools/sched_ext/include/scx/common.bpf.h | 379 ++++++++++++++++++ tools/sched_ext/include/scx/common.h | 75 ++++ tools/sched_ext/include/scx/compat.bpf.h | 28 ++ tools/sched_ext/include/scx/compat.h | 153 +++++++ tools/sched_ext/include/scx/user_exit_info.h | 64 +++ tools/sched_ext/scx_qmap.bpf.c | 264 ++++++++++++ tools/sched_ext/scx_qmap.c | 99 +++++ tools/sched_ext/scx_simple.bpf.c | 63 +++ tools/sched_ext/scx_simple.c | 99 +++++ 14 files changed, 1499 insertions(+), 2 deletions(-) create mode 100644 tools/sched_ext/.gitignore create mode 100644 tools/sched_ext/Makefile create mode 100644 tools/sched_ext/include/bpf-compat/gnu/stubs.h create mode 100644 tools/sched_ext/include/scx/common.bpf.h create mode 100644 tools/sched_ext/include/scx/common.h create mode 100644 tools/sched_ext/include/scx/compat.bpf.h create mode 100644 tools/sched_ext/include/scx/compat.h create mode 100644 tools/sched_ext/include/scx/user_exit_info.h create mode 100644 tools/sched_ext/scx_qmap.bpf.c create mode 100644 tools/sched_ext/scx_qmap.c create mode 100644 tools/sched_ext/scx_simple.bpf.c create mode 100644 tools/sched_ext/scx_simple.c diff --git a/Makefile b/Makefile index b62b39c18643..96583a016426 100644 --- a/Makefile +++ b/Makefile @@ -1358,6 +1358,12 @@ ifneq ($(wildcard $(resolve_btfids_O)),) $(Q)$(MAKE) -sC $(srctree)/tools/bpf/resolve_btfids O=$(resolve_btfids_O) clean endif +tools-clean-targets := sched_ext +PHONY += $(tools-clean-targets) +$(tools-clean-targets): + $(Q)$(MAKE) -sC tools $@_clean +tools_clean: $(tools-clean-targets) + tools/: FORCE $(Q)mkdir -p $(objtree)/tools $(Q)$(MAKE) O=$(abspath $(objtree)) subdir=tools -C $(srctree)/tools/ @@ -1522,7 +1528,7 @@ PHONY += $(mrproper-dirs) mrproper $(mrproper-dirs): $(Q)$(MAKE) $(clean)=$(patsubst _mrproper_%,%,$@) -mrproper: clean $(mrproper-dirs) +mrproper: clean $(mrproper-dirs) tools_clean $(call cmd,rmfiles) @find . $(RCS_FIND_IGNORE) \ \( -name '*.rmeta' \) \ diff --git a/tools/Makefile b/tools/Makefile index 153817f0cc17..f4ae5a42a41a 100644 --- a/tools/Makefile +++ b/tools/Makefile @@ -30,6 +30,7 @@ help: @echo ' pci - PCI tools' @echo ' perf - Linux performance measurement and analysis tool' @echo ' selftests - various kernel selftests' + @echo ' sched_ext - sched_ext example schedulers' @echo ' bootconfig - boot config tool' @echo ' spi - spi tools' @echo ' tmon - thermal monitoring and tuning tool' @@ -93,6 +94,9 @@ perf: FORCE $(Q)mkdir -p $(PERF_O) . $(Q)$(MAKE) --no-print-directory -C perf O=$(PERF_O) subdir= +sched_ext: FORCE + $(call descend,sched_ext) + selftests: FORCE $(call descend,testing/$@) @@ -188,6 +192,9 @@ perf_clean: $(Q)mkdir -p $(PERF_O) . $(Q)$(MAKE) --no-print-directory -C perf O=$(PERF_O) subdir= clean +sched_ext_clean: + $(call descend,sched_ext,clean) + selftests_clean: $(call descend,testing/$(@:_clean=),clean) @@ -217,6 +224,7 @@ clean: acpi_clean cgroup_clean counter_clean cpupower_clean hv_clean firewire_cl mm_clean bpf_clean iio_clean x86_energy_perf_policy_clean tmon_clean \ freefall_clean build_clean libbpf_clean libsubcmd_clean \ gpio_clean objtool_clean leds_clean wmi_clean pci_clean firmware_clean debugging_clean \ - intel-speed-select_clean tracing_clean thermal_clean thermometer_clean thermal-engine_clean + intel-speed-select_clean tracing_clean thermal_clean thermometer_clean thermal-engine_clean \ + sched_ext_clean .PHONY: FORCE diff --git a/tools/sched_ext/.gitignore b/tools/sched_ext/.gitignore new file mode 100644 index 000000000000..d6264fe1c8cd --- /dev/null +++ b/tools/sched_ext/.gitignore @@ -0,0 +1,2 @@ +tools/ +build/ diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile new file mode 100644 index 000000000000..626782a21375 --- /dev/null +++ b/tools/sched_ext/Makefile @@ -0,0 +1,246 @@ +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2022 Meta Platforms, Inc. and affiliates. +include ../build/Build.include +include ../scripts/Makefile.arch +include ../scripts/Makefile.include + +all: all_targets + +ifneq ($(LLVM),) +ifneq ($(filter %/,$(LLVM)),) +LLVM_PREFIX := $(LLVM) +else ifneq ($(filter -%,$(LLVM)),) +LLVM_SUFFIX := $(LLVM) +endif + +CLANG_TARGET_FLAGS_arm := arm-linux-gnueabi +CLANG_TARGET_FLAGS_arm64 := aarch64-linux-gnu +CLANG_TARGET_FLAGS_hexagon := hexagon-linux-musl +CLANG_TARGET_FLAGS_m68k := m68k-linux-gnu +CLANG_TARGET_FLAGS_mips := mipsel-linux-gnu +CLANG_TARGET_FLAGS_powerpc := powerpc64le-linux-gnu +CLANG_TARGET_FLAGS_riscv := riscv64-linux-gnu +CLANG_TARGET_FLAGS_s390 := s390x-linux-gnu +CLANG_TARGET_FLAGS_x86 := x86_64-linux-gnu +CLANG_TARGET_FLAGS := $(CLANG_TARGET_FLAGS_$(ARCH)) + +ifeq ($(CROSS_COMPILE),) +ifeq ($(CLANG_TARGET_FLAGS),) +$(error Specify CROSS_COMPILE or add '--target=' option to lib.mk) +else +CLANG_FLAGS += --target=$(CLANG_TARGET_FLAGS) +endif # CLANG_TARGET_FLAGS +else +CLANG_FLAGS += --target=$(notdir $(CROSS_COMPILE:%-=%)) +endif # CROSS_COMPILE + +CC := $(LLVM_PREFIX)clang$(LLVM_SUFFIX) $(CLANG_FLAGS) -fintegrated-as +else +CC := $(CROSS_COMPILE)gcc +endif # LLVM + +CURDIR := $(abspath .) +TOOLSDIR := $(abspath ..) +LIBDIR := $(TOOLSDIR)/lib +BPFDIR := $(LIBDIR)/bpf +TOOLSINCDIR := $(TOOLSDIR)/include +BPFTOOLDIR := $(TOOLSDIR)/bpf/bpftool +APIDIR := $(TOOLSINCDIR)/uapi +GENDIR := $(abspath ../../include/generated) +GENHDR := $(GENDIR)/autoconf.h + +ifeq ($(O),) +OUTPUT_DIR := $(CURDIR)/build +else +OUTPUT_DIR := $(O)/build +endif # O +OBJ_DIR := $(OUTPUT_DIR)/obj +INCLUDE_DIR := $(OUTPUT_DIR)/include +BPFOBJ_DIR := $(OBJ_DIR)/libbpf +SCXOBJ_DIR := $(OBJ_DIR)/sched_ext +BINDIR := $(OUTPUT_DIR)/bin +BPFOBJ := $(BPFOBJ_DIR)/libbpf.a +ifneq ($(CROSS_COMPILE),) +HOST_BUILD_DIR := $(OBJ_DIR)/host +HOST_OUTPUT_DIR := host-tools +HOST_INCLUDE_DIR := $(HOST_OUTPUT_DIR)/include +else +HOST_BUILD_DIR := $(OBJ_DIR) +HOST_OUTPUT_DIR := $(OUTPUT_DIR) +HOST_INCLUDE_DIR := $(INCLUDE_DIR) +endif +HOST_BPFOBJ := $(HOST_BUILD_DIR)/libbpf/libbpf.a +RESOLVE_BTFIDS := $(HOST_BUILD_DIR)/resolve_btfids/resolve_btfids +DEFAULT_BPFTOOL := $(HOST_OUTPUT_DIR)/sbin/bpftool + +VMLINUX_BTF_PATHS ?= $(if $(O),$(O)/vmlinux) \ + $(if $(KBUILD_OUTPUT),$(KBUILD_OUTPUT)/vmlinux) \ + ../../vmlinux \ + /sys/kernel/btf/vmlinux \ + /boot/vmlinux-$(shell uname -r) +VMLINUX_BTF ?= $(abspath $(firstword $(wildcard $(VMLINUX_BTF_PATHS)))) +ifeq ($(VMLINUX_BTF),) +$(error Cannot find a vmlinux for VMLINUX_BTF at any of "$(VMLINUX_BTF_PATHS)") +endif + +BPFTOOL ?= $(DEFAULT_BPFTOOL) + +ifneq ($(wildcard $(GENHDR)),) + GENFLAGS := -DHAVE_GENHDR +endif + +CFLAGS += -g -O2 -rdynamic -pthread -Wall -Werror $(GENFLAGS) \ + -I$(INCLUDE_DIR) -I$(GENDIR) -I$(LIBDIR) \ + -I$(TOOLSINCDIR) -I$(APIDIR) -I$(CURDIR)/include + +# Silence some warnings when compiled with clang +ifneq ($(LLVM),) +CFLAGS += -Wno-unused-command-line-argument +endif + +LDFLAGS = -lelf -lz -lpthread + +IS_LITTLE_ENDIAN = $(shell $(CC) -dM -E - </dev/null | \ + grep 'define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__') + +# Get Clang's default includes on this system, as opposed to those seen by +# '-target bpf'. This fixes "missing" files on some architectures/distros, +# such as asm/byteorder.h, asm/socket.h, asm/sockios.h, sys/cdefs.h etc. +# +# Use '-idirafter': Don't interfere with include mechanics except where the +# build would have failed anyways. +define get_sys_includes +$(shell $(1) -v -E - </dev/null 2>&1 \ + | sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') \ +$(shell $(1) -dM -E - </dev/null | grep '__riscv_xlen ' | awk '{printf("-D__riscv_xlen=%d -D__BITS_PER_LONG=%d", $$3, $$3)}') +endef + +BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) \ + $(if $(IS_LITTLE_ENDIAN),-mlittle-endian,-mbig-endian) \ + -I$(CURDIR)/include -I$(CURDIR)/include/bpf-compat \ + -I$(INCLUDE_DIR) -I$(APIDIR) \ + -I../../include \ + $(call get_sys_includes,$(CLANG)) \ + -Wall -Wno-compare-distinct-pointer-types \ + -O2 -mcpu=v3 + +# sort removes libbpf duplicates when not cross-building +MAKE_DIRS := $(sort $(OBJ_DIR)/libbpf $(HOST_BUILD_DIR)/libbpf \ + $(HOST_BUILD_DIR)/bpftool $(HOST_BUILD_DIR)/resolve_btfids \ + $(INCLUDE_DIR) $(SCXOBJ_DIR) $(BINDIR)) + +$(MAKE_DIRS): + $(call msg,MKDIR,,$@) + $(Q)mkdir -p $@ + +$(BPFOBJ): $(wildcard $(BPFDIR)/*.[ch] $(BPFDIR)/Makefile) \ + $(APIDIR)/linux/bpf.h \ + | $(OBJ_DIR)/libbpf + $(Q)$(MAKE) $(submake_extras) -C $(BPFDIR) OUTPUT=$(OBJ_DIR)/libbpf/ \ + EXTRA_CFLAGS='-g -O0 -fPIC' \ + DESTDIR=$(OUTPUT_DIR) prefix= all install_headers + +$(DEFAULT_BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile) \ + $(HOST_BPFOBJ) | $(HOST_BUILD_DIR)/bpftool + $(Q)$(MAKE) $(submake_extras) -C $(BPFTOOLDIR) \ + ARCH= CROSS_COMPILE= CC=$(HOSTCC) LD=$(HOSTLD) \ + EXTRA_CFLAGS='-g -O0' \ + OUTPUT=$(HOST_BUILD_DIR)/bpftool/ \ + LIBBPF_OUTPUT=$(HOST_BUILD_DIR)/libbpf/ \ + LIBBPF_DESTDIR=$(HOST_OUTPUT_DIR)/ \ + prefix= DESTDIR=$(HOST_OUTPUT_DIR)/ install-bin + +$(INCLUDE_DIR)/vmlinux.h: $(VMLINUX_BTF) $(BPFTOOL) | $(INCLUDE_DIR) +ifeq ($(VMLINUX_H),) + $(call msg,GEN,,$@) + $(Q)$(BPFTOOL) btf dump file $(VMLINUX_BTF) format c > $@ +else + $(call msg,CP,,$@) + $(Q)cp "$(VMLINUX_H)" $@ +endif + +$(SCXOBJ_DIR)/%.bpf.o: %.bpf.c $(INCLUDE_DIR)/vmlinux.h include/scx/*.h \ + | $(BPFOBJ) $(SCXOBJ_DIR) + $(call msg,CLNG-BPF,,$(notdir $@)) + $(Q)$(CLANG) $(BPF_CFLAGS) -target bpf -c $< -o $@ + +$(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BPFTOOL) + $(eval sched=$(notdir $@)) + $(call msg,GEN-SKEL,,$(sched)) + $(Q)$(BPFTOOL) gen object $(<:.o=.linked1.o) $< + $(Q)$(BPFTOOL) gen object $(<:.o=.linked2.o) $(<:.o=.linked1.o) + $(Q)$(BPFTOOL) gen object $(<:.o=.linked3.o) $(<:.o=.linked2.o) + $(Q)diff $(<:.o=.linked2.o) $(<:.o=.linked3.o) + $(Q)$(BPFTOOL) gen skeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $@ + $(Q)$(BPFTOOL) gen subskeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $(@:.skel.h=.subskel.h) + +SCX_COMMON_DEPS := include/scx/common.h include/scx/user_exit_info.h | $(BINDIR) + +c-sched-targets = scx_simple scx_qmap + +$(addprefix $(BINDIR)/,$(c-sched-targets)): \ + $(BINDIR)/%: \ + $(filter-out %.bpf.c,%.c) \ + $(INCLUDE_DIR)/%.bpf.skel.h \ + $(SCX_COMMON_DEPS) + $(eval sched=$(notdir $@)) + $(CC) $(CFLAGS) -c $(sched).c -o $(SCXOBJ_DIR)/$(sched).o + $(CC) -o $@ $(SCXOBJ_DIR)/$(sched).o $(HOST_BPFOBJ) $(LDFLAGS) + +$(c-sched-targets): %: $(BINDIR)/% + +install: all + $(Q)mkdir -p $(DESTDIR)/usr/local/bin/ + $(Q)cp $(BINDIR)/* $(DESTDIR)/usr/local/bin/ + +clean: + rm -rf $(OUTPUT_DIR) $(HOST_OUTPUT_DIR) + rm -f *.o *.bpf.o *.bpf.skel.h *.bpf.subskel.h + rm -f $(c-sched-targets) + +help: + @echo 'Building targets' + @echo '================' + @echo '' + @echo ' all - Compile all schedulers' + @echo '' + @echo 'Alternatively, you may compile individual schedulers:' + @echo '' + @printf ' %s\n' $(c-sched-targets) + @echo '' + @echo 'For any scheduler build target, you may specify an alternative' + @echo 'build output path with the O= environment variable. For example:' + @echo '' + @echo ' O=/tmp/sched_ext make all' + @echo '' + @echo 'will compile all schedulers, and emit the build artifacts to' + @echo '/tmp/sched_ext/build.' + @echo '' + @echo '' + @echo 'Installing targets' + @echo '==================' + @echo '' + @echo ' install - Compile and install all schedulers to /usr/bin.' + @echo ' You may specify the DESTDIR= environment variable' + @echo ' to indicate a prefix for /usr/bin. For example:' + @echo '' + @echo ' DESTDIR=/tmp/sched_ext make install' + @echo '' + @echo ' will build the schedulers in CWD/build, and' + @echo ' install the schedulers to /tmp/sched_ext/usr/bin.' + @echo '' + @echo '' + @echo 'Cleaning targets' + @echo '================' + @echo '' + @echo ' clean - Remove all generated files' + +all_targets: $(c-sched-targets) + +.PHONY: all all_targets $(c-sched-targets) clean help + +# delete failed targets +.DELETE_ON_ERROR: + +# keep intermediate (.bpf.skel.h, .bpf.o, etc) targets +.SECONDARY: diff --git a/tools/sched_ext/include/bpf-compat/gnu/stubs.h b/tools/sched_ext/include/bpf-compat/gnu/stubs.h new file mode 100644 index 000000000000..ad7d139ce907 --- /dev/null +++ b/tools/sched_ext/include/bpf-compat/gnu/stubs.h @@ -0,0 +1,11 @@ +/* + * Dummy gnu/stubs.h. clang can end up including /usr/include/gnu/stubs.h when + * compiling BPF files although its content doesn't play any role. The file in + * turn includes stubs-64.h or stubs-32.h depending on whether __x86_64__ is + * defined. When compiling a BPF source, __x86_64__ isn't set and thus + * stubs-32.h is selected. However, the file is not there if the system doesn't + * have 32bit glibc devel package installed leading to a build failure. + * + * The problem is worked around by making this file available in the include + * search paths before the system one when building BPF. + */ diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h new file mode 100644 index 000000000000..833fe1bdccf9 --- /dev/null +++ b/tools/sched_ext/include/scx/common.bpf.h @@ -0,0 +1,379 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> + * Copyright (c) 2022 David Vernet <dvernet@meta.com> + */ +#ifndef __SCX_COMMON_BPF_H +#define __SCX_COMMON_BPF_H + +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include <asm-generic/errno.h> +#include "user_exit_info.h" + +#define PF_WQ_WORKER 0x00000020 /* I'm a workqueue worker */ +#define PF_KTHREAD 0x00200000 /* I am a kernel thread */ +#define PF_EXITING 0x00000004 +#define CLOCK_MONOTONIC 1 + +/* + * Earlier versions of clang/pahole lost upper 32bits in 64bit enums which can + * lead to really confusing misbehaviors. Let's trigger a build failure. + */ +static inline void ___vmlinux_h_sanity_check___(void) +{ + _Static_assert(SCX_DSQ_FLAG_BUILTIN, + "bpftool generated vmlinux.h is missing high bits for 64bit enums, upgrade clang and pahole"); +} + +s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) __ksym; +s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool *is_idle) __ksym; +void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flags) __ksym; +u32 scx_bpf_dispatch_nr_slots(void) __ksym; +void scx_bpf_dispatch_cancel(void) __ksym; +bool scx_bpf_consume(u64 dsq_id) __ksym; +s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym; +void scx_bpf_destroy_dsq(u64 dsq_id) __ksym; +void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak; +void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym; +u32 scx_bpf_nr_cpu_ids(void) __ksym __weak; +const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak; +const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak; +void scx_bpf_put_cpumask(const struct cpumask *cpumask) __ksym __weak; +const struct cpumask *scx_bpf_get_idle_cpumask(void) __ksym; +const struct cpumask *scx_bpf_get_idle_smtmask(void) __ksym; +void scx_bpf_put_idle_cpumask(const struct cpumask *cpumask) __ksym; +bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) __ksym; +s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym; +s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym; +bool scx_bpf_task_running(const struct task_struct *p) __ksym; +s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym; + +static inline __attribute__((format(printf, 1, 2))) +void ___scx_bpf_bstr_format_checker(const char *fmt, ...) {} + +/* + * Helper macro for initializing the fmt and variadic argument inputs to both + * bstr exit kfuncs. Callers to this function should use ___fmt and ___param to + * refer to the initialized list of inputs to the bstr kfunc. + */ +#define scx_bpf_bstr_preamble(fmt, args...) \ + static char ___fmt[] = fmt; \ + /* \ + * Note that __param[] must have at least one \ + * element to keep the verifier happy. \ + */ \ + unsigned long long ___param[___bpf_narg(args) ?: 1] = {}; \ + \ + _Pragma("GCC diagnostic push") \ + _Pragma("GCC diagnostic ignored \"-Wint-conversion\"") \ + ___bpf_fill(___param, args); \ + _Pragma("GCC diagnostic pop") \ + +/* + * scx_bpf_exit() wraps the scx_bpf_exit_bstr() kfunc with variadic arguments + * instead of an array of u64. Using this macro will cause the scheduler to + * exit cleanly with the specified exit code being passed to user space. + */ +#define scx_bpf_exit(code, fmt, args...) \ +({ \ + scx_bpf_bstr_preamble(fmt, args) \ + scx_bpf_exit_bstr(code, ___fmt, ___param, sizeof(___param)); \ + ___scx_bpf_bstr_format_checker(fmt, ##args); \ +}) + +/* + * scx_bpf_error() wraps the scx_bpf_error_bstr() kfunc with variadic arguments + * instead of an array of u64. Invoking this macro will cause the scheduler to + * exit in an erroneous state, with diagnostic information being passed to the + * user. + */ +#define scx_bpf_error(fmt, args...) \ +({ \ + scx_bpf_bstr_preamble(fmt, args) \ + scx_bpf_error_bstr(___fmt, ___param, sizeof(___param)); \ + ___scx_bpf_bstr_format_checker(fmt, ##args); \ +}) + +#define BPF_STRUCT_OPS(name, args...) \ +SEC("struct_ops/"#name) \ +BPF_PROG(name, ##args) + +#define BPF_STRUCT_OPS_SLEEPABLE(name, args...) \ +SEC("struct_ops.s/"#name) \ +BPF_PROG(name, ##args) + +/** + * RESIZABLE_ARRAY - Generates annotations for an array that may be resized + * @elfsec: the data section of the BPF program in which to place the array + * @arr: the name of the array + * + * libbpf has an API for setting map value sizes. Since data sections (i.e. + * bss, data, rodata) themselves are maps, a data section can be resized. If + * a data section has an array as its last element, the BTF info for that + * array will be adjusted so that length of the array is extended to meet the + * new length of the data section. This macro annotates an array to have an + * element count of one with the assumption that this array can be resized + * within the userspace program. It also annotates the section specifier so + * this array exists in a custom sub data section which can be resized + * independently. + * + * See RESIZE_ARRAY() for the userspace convenience macro for resizing an + * array declared with RESIZABLE_ARRAY(). + */ +#define RESIZABLE_ARRAY(elfsec, arr) arr[1] SEC("."#elfsec"."#arr) + +/** + * MEMBER_VPTR - Obtain the verified pointer to a struct or array member + * @base: struct or array to index + * @member: dereferenced member (e.g. .field, [idx0][idx1], .field[idx0] ...) + * + * The verifier often gets confused by the instruction sequence the compiler + * generates for indexing struct fields or arrays. This macro forces the + * compiler to generate a code sequence which first calculates the byte offset, + * checks it against the struct or array size and add that byte offset to + * generate the pointer to the member to help the verifier. + * + * Ideally, we want to abort if the calculated offset is out-of-bounds. However, + * BPF currently doesn't support abort, so evaluate to %NULL instead. The caller + * must check for %NULL and take appropriate action to appease the verifier. To + * avoid confusing the verifier, it's best to check for %NULL and dereference + * immediately. + * + * vptr = MEMBER_VPTR(my_array, [i][j]); + * if (!vptr) + * return error; + * *vptr = new_value; + * + * sizeof(@base) should encompass the memory area to be accessed and thus can't + * be a pointer to the area. Use `MEMBER_VPTR(*ptr, .member)` instead of + * `MEMBER_VPTR(ptr, ->member)`. + */ +#define MEMBER_VPTR(base, member) (typeof((base) member) *) \ +({ \ + u64 __base = (u64)&(base); \ + u64 __addr = (u64)&((base) member) - __base; \ + _Static_assert(sizeof(base) >= sizeof((base) member), \ + "@base is smaller than @member, is @base a pointer?"); \ + asm volatile ( \ + "if %0 <= %[max] goto +2\n" \ + "%0 = 0\n" \ + "goto +1\n" \ + "%0 += %1\n" \ + : "+r"(__addr) \ + : "r"(__base), \ + [max]"i"(sizeof(base) - sizeof((base) member))); \ + __addr; \ +}) + +/** + * ARRAY_ELEM_PTR - Obtain the verified pointer to an array element + * @arr: array to index into + * @i: array index + * @n: number of elements in array + * + * Similar to MEMBER_VPTR() but is intended for use with arrays where the + * element count needs to be explicit. + * It can be used in cases where a global array is defined with an initial + * size but is intended to be be resized before loading the BPF program. + * Without this version of the macro, MEMBER_VPTR() will use the compile time + * size of the array to compute the max, which will result in rejection by + * the verifier. + */ +#define ARRAY_ELEM_PTR(arr, i, n) (typeof(arr[i]) *) \ +({ \ + u64 __base = (u64)arr; \ + u64 __addr = (u64)&(arr[i]) - __base; \ + asm volatile ( \ + "if %0 <= %[max] goto +2\n" \ + "%0 = 0\n" \ + "goto +1\n" \ + "%0 += %1\n" \ + : "+r"(__addr) \ + : "r"(__base), \ + [max]"r"(sizeof(arr[0]) * ((n) - 1))); \ + __addr; \ +}) + + +/* + * BPF declarations and helpers + */ + +/* list and rbtree */ +#define __contains(name, node) __attribute__((btf_decl_tag("contains:" #name ":" #node))) +#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8))) + +void *bpf_obj_new_impl(__u64 local_type_id, void *meta) __ksym; +void bpf_obj_drop_impl(void *kptr, void *meta) __ksym; + +#define bpf_obj_new(type) ((type *)bpf_obj_new_impl(bpf_core_type_id_local(type), NULL)) +#define bpf_obj_drop(kptr) bpf_obj_drop_impl(kptr, NULL) + +void bpf_list_push_front(struct bpf_list_head *head, struct bpf_list_node *node) __ksym; +void bpf_list_push_back(struct bpf_list_head *head, struct bpf_list_node *node) __ksym; +struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head) __ksym; +struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head) __ksym; +struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root, + struct bpf_rb_node *node) __ksym; +int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *node, + bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b), + void *meta, __u64 off) __ksym; +#define bpf_rbtree_add(head, node, less) bpf_rbtree_add_impl(head, node, less, NULL, 0) + +struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym; + +void *bpf_refcount_acquire_impl(void *kptr, void *meta) __ksym; +#define bpf_refcount_acquire(kptr) bpf_refcount_acquire_impl(kptr, NULL) + +/* task */ +struct task_struct *bpf_task_from_pid(s32 pid) __ksym; +struct task_struct *bpf_task_acquire(struct task_struct *p) __ksym; +void bpf_task_release(struct task_struct *p) __ksym; + +/* cgroup */ +struct cgroup *bpf_cgroup_ancestor(struct cgroup *cgrp, int level) __ksym; +void bpf_cgroup_release(struct cgroup *cgrp) __ksym; +struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym; + +/* css iteration */ +struct bpf_iter_css; +struct cgroup_subsys_state; +extern int bpf_iter_css_new(struct bpf_iter_css *it, + struct cgroup_subsys_state *start, + unsigned int flags) __weak __ksym; +extern struct cgroup_subsys_state * +bpf_iter_css_next(struct bpf_iter_css *it) __weak __ksym; +extern void bpf_iter_css_destroy(struct bpf_iter_css *it) __weak __ksym; + +/* cpumask */ +struct bpf_cpumask *bpf_cpumask_create(void) __ksym; +struct bpf_cpumask *bpf_cpumask_acquire(struct bpf_cpumask *cpumask) __ksym; +void bpf_cpumask_release(struct bpf_cpumask *cpumask) __ksym; +u32 bpf_cpumask_first(const struct cpumask *cpumask) __ksym; +u32 bpf_cpumask_first_zero(const struct cpumask *cpumask) __ksym; +void bpf_cpumask_set_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym; +void bpf_cpumask_clear_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym; +bool bpf_cpumask_test_cpu(u32 cpu, const struct cpumask *cpumask) __ksym; +bool bpf_cpumask_test_and_set_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym; +bool bpf_cpumask_test_and_clear_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym; +void bpf_cpumask_setall(struct bpf_cpumask *cpumask) __ksym; +void bpf_cpumask_clear(struct bpf_cpumask *cpumask) __ksym; +bool bpf_cpumask_and(struct bpf_cpumask *dst, const struct cpumask *src1, + const struct cpumask *src2) __ksym; +void bpf_cpumask_or(struct bpf_cpumask *dst, const struct cpumask *src1, + const struct cpumask *src2) __ksym; +void bpf_cpumask_xor(struct bpf_cpumask *dst, const struct cpumask *src1, + const struct cpumask *src2) __ksym; +bool bpf_cpumask_equal(const struct cpumask *src1, const struct cpumask *src2) __ksym; +bool bpf_cpumask_intersects(const struct cpumask *src1, const struct cpumask *src2) __ksym; +bool bpf_cpumask_subset(const struct cpumask *src1, const struct cpumask *src2) __ksym; +bool bpf_cpumask_empty(const struct cpumask *cpumask) __ksym; +bool bpf_cpumask_full(const struct cpumask *cpumask) __ksym; +void bpf_cpumask_copy(struct bpf_cpumask *dst, const struct cpumask *src) __ksym; +u32 bpf_cpumask_any_distribute(const struct cpumask *cpumask) __ksym; +u32 bpf_cpumask_any_and_distribute(const struct cpumask *src1, + const struct cpumask *src2) __ksym; + +/* rcu */ +void bpf_rcu_read_lock(void) __ksym; +void bpf_rcu_read_unlock(void) __ksym; + + +/* + * Other helpers + */ + +/* useful compiler attributes */ +#define likely(x) __builtin_expect(!!(x), 1) +#define unlikely(x) __builtin_expect(!!(x), 0) +#define __maybe_unused __attribute__((__unused__)) + +/* + * READ/WRITE_ONCE() are from kernel (include/asm-generic/rwonce.h). They + * prevent compiler from caching, redoing or reordering reads or writes. + */ +typedef __u8 __attribute__((__may_alias__)) __u8_alias_t; +typedef __u16 __attribute__((__may_alias__)) __u16_alias_t; +typedef __u32 __attribute__((__may_alias__)) __u32_alias_t; +typedef __u64 __attribute__((__may_alias__)) __u64_alias_t; + +static __always_inline void __read_once_size(const volatile void *p, void *res, int size) +{ + switch (size) { + case 1: *(__u8_alias_t *) res = *(volatile __u8_alias_t *) p; break; + case 2: *(__u16_alias_t *) res = *(volatile __u16_alias_t *) p; break; + case 4: *(__u32_alias_t *) res = *(volatile __u32_alias_t *) p; break; + case 8: *(__u64_alias_t *) res = *(volatile __u64_alias_t *) p; break; + default: + barrier(); + __builtin_memcpy((void *)res, (const void *)p, size); + barrier(); + } +} + +static __always_inline void __write_once_size(volatile void *p, void *res, int size) +{ + switch (size) { + case 1: *(volatile __u8_alias_t *) p = *(__u8_alias_t *) res; break; + case 2: *(volatile __u16_alias_t *) p = *(__u16_alias_t *) res; break; + case 4: *(volatile __u32_alias_t *) p = *(__u32_alias_t *) res; break; + case 8: *(volatile __u64_alias_t *) p = *(__u64_alias_t *) res; break; + default: + barrier(); + __builtin_memcpy((void *)p, (const void *)res, size); + barrier(); + } +} + +#define READ_ONCE(x) \ +({ \ + union { typeof(x) __val; char __c[1]; } __u = \ + { .__c = { 0 } }; \ + __read_once_size(&(x), __u.__c, sizeof(x)); \ + __u.__val; \ +}) + +#define WRITE_ONCE(x, val) \ +({ \ + union { typeof(x) __val; char __c[1]; } __u = \ + { .__val = (val) }; \ + __write_once_size(&(x), __u.__c, sizeof(x)); \ + __u.__val; \ +}) + +/* + * log2_u32 - Compute the base 2 logarithm of a 32-bit exponential value. + * @v: The value for which we're computing the base 2 logarithm. + */ +static inline u32 log2_u32(u32 v) +{ + u32 r; + u32 shift; + + r = (v > 0xFFFF) << 4; v >>= r; + shift = (v > 0xFF) << 3; v >>= shift; r |= shift; + shift = (v > 0xF) << 2; v >>= shift; r |= shift; + shift = (v > 0x3) << 1; v >>= shift; r |= shift; + r |= (v >> 1); + return r; +} + +/* + * log2_u64 - Compute the base 2 logarithm of a 64-bit exponential value. + * @v: The value for which we're computing the base 2 logarithm. + */ +static inline u32 log2_u64(u64 v) +{ + u32 hi = v >> 32; + if (hi) + return log2_u32(hi) + 32 + 1; + else + return log2_u32(v) + 1; +} + +#include "compat.bpf.h" + +#endif /* __SCX_COMMON_BPF_H */ diff --git a/tools/sched_ext/include/scx/common.h b/tools/sched_ext/include/scx/common.h new file mode 100644 index 000000000000..5b0f90152152 --- /dev/null +++ b/tools/sched_ext/include/scx/common.h @@ -0,0 +1,75 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + */ +#ifndef __SCHED_EXT_COMMON_H +#define __SCHED_EXT_COMMON_H + +#ifdef __KERNEL__ +#error "Should not be included by BPF programs" +#endif + +#include <stdarg.h> +#include <stdio.h> +#include <stdlib.h> +#include <stdint.h> +#include <errno.h> + +typedef uint8_t u8; +typedef uint16_t u16; +typedef uint32_t u32; +typedef uint64_t u64; +typedef int8_t s8; +typedef int16_t s16; +typedef int32_t s32; +typedef int64_t s64; + +#define SCX_BUG(__fmt, ...) \ + do { \ + fprintf(stderr, "[SCX_BUG] %s:%d", __FILE__, __LINE__); \ + if (errno) \ + fprintf(stderr, " (%s)\n", strerror(errno)); \ + else \ + fprintf(stderr, "\n"); \ + fprintf(stderr, __fmt __VA_OPT__(,) __VA_ARGS__); \ + fprintf(stderr, "\n"); \ + \ + exit(EXIT_FAILURE); \ + } while (0) + +#define SCX_BUG_ON(__cond, __fmt, ...) \ + do { \ + if (__cond) \ + SCX_BUG((__fmt) __VA_OPT__(,) __VA_ARGS__); \ + } while (0) + +/** + * RESIZE_ARRAY - Convenience macro for resizing a BPF array + * @__skel: the skeleton containing the array + * @elfsec: the data section of the BPF program in which the array exists + * @arr: the name of the array + * @n: the desired array element count + * + * For BPF arrays declared with RESIZABLE_ARRAY(), this macro performs two + * operations. It resizes the map which corresponds to the custom data + * section that contains the target array. As a side effect, the BTF info for + * the array is adjusted so that the array length is sized to cover the new + * data section size. The second operation is reassigning the skeleton pointer + * for that custom data section so that it points to the newly memory mapped + * region. + */ +#define RESIZE_ARRAY(__skel, elfsec, arr, n) \ + do { \ + size_t __sz; \ + bpf_map__set_value_size((__skel)->maps.elfsec##_##arr, \ + sizeof((__skel)->elfsec##_##arr->arr[0]) * (n)); \ + (__skel)->elfsec##_##arr = \ + bpf_map__initial_value((__skel)->maps.elfsec##_##arr, &__sz); \ + } while (0) + +#include "user_exit_info.h" +#include "compat.h" + +#endif /* __SCHED_EXT_COMMON_H */ diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h new file mode 100644 index 000000000000..3d2fe1208900 --- /dev/null +++ b/tools/sched_ext/include/scx/compat.bpf.h @@ -0,0 +1,28 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ +#ifndef __SCX_COMPAT_BPF_H +#define __SCX_COMPAT_BPF_H + +#define __COMPAT_ENUM_OR_ZERO(__type, __ent) \ +({ \ + __type __ret = 0; \ + if (bpf_core_enum_value_exists(__type, __ent)) \ + __ret = __ent; \ + __ret; \ +}) + +/* + * Define sched_ext_ops. This may be expanded to define multiple variants for + * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH(). + */ +#define SCX_OPS_DEFINE(__name, ...) \ + SEC(".struct_ops.link") \ + struct sched_ext_ops __name = { \ + __VA_ARGS__, \ + }; + +#endif /* __SCX_COMPAT_BPF_H */ diff --git a/tools/sched_ext/include/scx/compat.h b/tools/sched_ext/include/scx/compat.h new file mode 100644 index 000000000000..a7fdaf8a858e --- /dev/null +++ b/tools/sched_ext/include/scx/compat.h @@ -0,0 +1,153 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ +#ifndef __SCX_COMPAT_H +#define __SCX_COMPAT_H + +#include <bpf/btf.h> + +struct btf *__COMPAT_vmlinux_btf __attribute__((weak)); + +static inline void __COMPAT_load_vmlinux_btf(void) +{ + if (!__COMPAT_vmlinux_btf) { + __COMPAT_vmlinux_btf = btf__load_vmlinux_btf(); + SCX_BUG_ON(!__COMPAT_vmlinux_btf, "btf__load_vmlinux_btf()"); + } +} + +static inline bool __COMPAT_read_enum(const char *type, const char *name, u64 *v) +{ + const struct btf_type *t; + const char *n; + s32 tid; + int i; + + __COMPAT_load_vmlinux_btf(); + + tid = btf__find_by_name(__COMPAT_vmlinux_btf, type); + if (tid < 0) + return false; + + t = btf__type_by_id(__COMPAT_vmlinux_btf, tid); + SCX_BUG_ON(!t, "btf__type_by_id(%d)", tid); + + if (btf_is_enum(t)) { + struct btf_enum *e = btf_enum(t); + + for (i = 0; i < BTF_INFO_VLEN(t->info); i++) { + n = btf__name_by_offset(__COMPAT_vmlinux_btf, e[i].name_off); + SCX_BUG_ON(!n, "btf__name_by_offset()"); + if (!strcmp(n, name)) { + *v = e[i].val; + return true; + } + } + } else if (btf_is_enum64(t)) { + struct btf_enum64 *e = btf_enum64(t); + + for (i = 0; i < BTF_INFO_VLEN(t->info); i++) { + n = btf__name_by_offset(__COMPAT_vmlinux_btf, e[i].name_off); + SCX_BUG_ON(!n, "btf__name_by_offset()"); + if (!strcmp(n, name)) { + *v = btf_enum64_value(&e[i]); + return true; + } + } + } + + return false; +} + +#define __COMPAT_ENUM_OR_ZERO(__type, __ent) \ +({ \ + u64 __val = 0; \ + __COMPAT_read_enum(__type, __ent, &__val); \ + __val; \ +}) + +static inline bool __COMPAT_has_ksym(const char *ksym) +{ + __COMPAT_load_vmlinux_btf(); + return btf__find_by_name(__COMPAT_vmlinux_btf, ksym) >= 0; +} + +static inline bool __COMPAT_struct_has_field(const char *type, const char *field) +{ + const struct btf_type *t; + const struct btf_member *m; + const char *n; + s32 tid; + int i; + + __COMPAT_load_vmlinux_btf(); + tid = btf__find_by_name_kind(__COMPAT_vmlinux_btf, type, BTF_KIND_STRUCT); + if (tid < 0) + return false; + + t = btf__type_by_id(__COMPAT_vmlinux_btf, tid); + SCX_BUG_ON(!t, "btf__type_by_id(%d)", tid); + + m = btf_members(t); + + for (i = 0; i < BTF_INFO_VLEN(t->info); i++) { + n = btf__name_by_offset(__COMPAT_vmlinux_btf, m[i].name_off); + SCX_BUG_ON(!n, "btf__name_by_offset()"); + if (!strcmp(n, field)) + return true; + } + + return false; +} + +#define SCX_OPS_SWITCH_PARTIAL \ + __COMPAT_ENUM_OR_ZERO("scx_ops_flags", "SCX_OPS_SWITCH_PARTIAL") + +/* + * struct sched_ext_ops can change over time. If compat.bpf.h::SCX_OPS_DEFINE() + * is used to define ops and compat.h::SCX_OPS_LOAD/ATTACH() are used to load + * and attach it, backward compatibility is automatically maintained where + * reasonable. + */ +#define SCX_OPS_OPEN(__ops_name, __scx_name) ({ \ + struct __scx_name *__skel; \ + \ + __skel = __scx_name##__open(); \ + SCX_BUG_ON(!__skel, "Could not open " #__scx_name); \ + __skel; \ +}) + +#define SCX_OPS_LOAD(__skel, __ops_name, __scx_name) ({ \ + SCX_BUG_ON(__scx_name##__load((__skel)), "Failed to load skel"); \ +}) + +/* + * New versions of bpftool now emit additional link placeholders for BPF maps, + * and set up BPF skeleton in such a way that libbpf will auto-attach BPF maps + * automatically, assumming libbpf is recent enough (v1.5+). Old libbpf will do + * nothing with those links and won't attempt to auto-attach maps. + * + * To maintain compatibility with older libbpf while avoiding trying to attach + * twice, disable the autoattach feature on newer libbpf. + */ +#if LIBBPF_MAJOR_VERSION > 1 || \ + (LIBBPF_MAJOR_VERSION == 1 && LIBBPF_MINOR_VERSION >= 5) +#define __SCX_OPS_DISABLE_AUTOATTACH(__skel, __ops_name) \ + bpf_map__set_autoattach((__skel)->maps.__ops_name, false) +#else +#define __SCX_OPS_DISABLE_AUTOATTACH(__skel, __ops_name) do {} while (0) +#endif + +#define SCX_OPS_ATTACH(__skel, __ops_name, __scx_name) ({ \ + struct bpf_link *__link; \ + __SCX_OPS_DISABLE_AUTOATTACH(__skel, __ops_name); \ + SCX_BUG_ON(__scx_name##__attach((__skel)), "Failed to attach skel"); \ + __link = bpf_map__attach_struct_ops((__skel)->maps.__ops_name); \ + SCX_BUG_ON(!__link, "Failed to attach struct_ops"); \ + __link; \ +}) + +#endif /* __SCX_COMPAT_H */ diff --git a/tools/sched_ext/include/scx/user_exit_info.h b/tools/sched_ext/include/scx/user_exit_info.h new file mode 100644 index 000000000000..8c3b7fac4d05 --- /dev/null +++ b/tools/sched_ext/include/scx/user_exit_info.h @@ -0,0 +1,64 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Define struct user_exit_info which is shared between BPF and userspace parts + * to communicate exit status and other information. + * + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> + * Copyright (c) 2022 David Vernet <dvernet@meta.com> + */ +#ifndef __USER_EXIT_INFO_H +#define __USER_EXIT_INFO_H + +enum uei_sizes { + UEI_REASON_LEN = 128, + UEI_MSG_LEN = 1024, +}; + +struct user_exit_info { + int kind; + s64 exit_code; + char reason[UEI_REASON_LEN]; + char msg[UEI_MSG_LEN]; +}; + +#ifdef __bpf__ + +#include "vmlinux.h" +#include <bpf/bpf_core_read.h> + +#define UEI_DEFINE(__name) \ + struct user_exit_info __name SEC(".data") + +#define UEI_RECORD(__uei_name, __ei) ({ \ + bpf_probe_read_kernel_str(__uei_name.reason, \ + sizeof(__uei_name.reason), (__ei)->reason); \ + bpf_probe_read_kernel_str(__uei_name.msg, \ + sizeof(__uei_name.msg), (__ei)->msg); \ + if (bpf_core_field_exists((__ei)->exit_code)) \ + __uei_name.exit_code = (__ei)->exit_code; \ + /* use __sync to force memory barrier */ \ + __sync_val_compare_and_swap(&__uei_name.kind, __uei_name.kind, \ + (__ei)->kind); \ +}) + +#else /* !__bpf__ */ + +#include <stdio.h> +#include <stdbool.h> + +#define UEI_EXITED(__skel, __uei_name) ({ \ + /* use __sync to force memory barrier */ \ + __sync_val_compare_and_swap(&(__skel)->data->__uei_name.kind, -1, -1); \ +}) + +#define UEI_REPORT(__skel, __uei_name) ({ \ + struct user_exit_info *__uei = &(__skel)->data->__uei_name; \ + fprintf(stderr, "EXIT: %s", __uei->reason); \ + if (__uei->msg[0] != '\0') \ + fprintf(stderr, " (%s)", __uei->msg); \ + fputs("\n", stderr); \ +}) + +#endif /* __bpf__ */ +#endif /* __USER_EXIT_INFO_H */ diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c new file mode 100644 index 000000000000..976a2693da71 --- /dev/null +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -0,0 +1,264 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * A simple five-level FIFO queue scheduler. + * + * There are five FIFOs implemented using BPF_MAP_TYPE_QUEUE. A task gets + * assigned to one depending on its compound weight. Each CPU round robins + * through the FIFOs and dispatches more from FIFOs with higher indices - 1 from + * queue0, 2 from queue1, 4 from queue2 and so on. + * + * This scheduler demonstrates: + * + * - BPF-side queueing using PIDs. + * - Sleepable per-task storage allocation using ops.prep_enable(). + * + * This scheduler is primarily for demonstration and testing of sched_ext + * features and unlikely to be useful for actual workloads. + * + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> + * Copyright (c) 2022 David Vernet <dvernet@meta.com> + */ +#include <scx/common.bpf.h> + +enum consts { + ONE_SEC_IN_NS = 1000000000, + SHARED_DSQ = 0, +}; + +char _license[] SEC("license") = "GPL"; + +const volatile u64 slice_ns = SCX_SLICE_DFL; +const volatile u32 dsp_batch; + +u32 test_error_cnt; + +UEI_DEFINE(uei); + +struct qmap { + __uint(type, BPF_MAP_TYPE_QUEUE); + __uint(max_entries, 4096); + __type(value, u32); +} queue0 SEC(".maps"), + queue1 SEC(".maps"), + queue2 SEC(".maps"), + queue3 SEC(".maps"), + queue4 SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS); + __uint(max_entries, 5); + __type(key, int); + __array(values, struct qmap); +} queue_arr SEC(".maps") = { + .values = { + [0] = &queue0, + [1] = &queue1, + [2] = &queue2, + [3] = &queue3, + [4] = &queue4, + }, +}; + +/* Per-task scheduling context */ +struct task_ctx { + bool force_local; /* Dispatch directly to local_dsq */ +}; + +struct { + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct task_ctx); +} task_ctx_stor SEC(".maps"); + +struct cpu_ctx { + u64 dsp_idx; /* dispatch index */ + u64 dsp_cnt; /* remaining count */ +}; + +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(max_entries, 1); + __type(key, u32); + __type(value, struct cpu_ctx); +} cpu_ctx_stor SEC(".maps"); + +/* Statistics */ +u64 nr_enqueued, nr_dispatched, nr_dequeued; + +s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p, + s32 prev_cpu, u64 wake_flags) +{ + struct task_ctx *tctx; + s32 cpu; + + tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0); + if (!tctx) { + scx_bpf_error("task_ctx lookup failed"); + return -ESRCH; + } + + if (p->nr_cpus_allowed == 1 || + scx_bpf_test_and_clear_cpu_idle(prev_cpu)) { + tctx->force_local = true; + return prev_cpu; + } + + cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); + if (cpu >= 0) + return cpu; + + return prev_cpu; +} + +static int weight_to_idx(u32 weight) +{ + /* Coarsely map the compound weight to a FIFO. */ + if (weight <= 25) + return 0; + else if (weight <= 50) + return 1; + else if (weight < 200) + return 2; + else if (weight < 400) + return 3; + else + return 4; +} + +void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags) +{ + struct task_ctx *tctx; + u32 pid = p->pid; + int idx = weight_to_idx(p->scx.weight); + void *ring; + + if (test_error_cnt && !--test_error_cnt) + scx_bpf_error("test triggering error"); + + tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0); + if (!tctx) { + scx_bpf_error("task_ctx lookup failed"); + return; + } + + /* Is select_cpu() is telling us to enqueue locally? */ + if (tctx->force_local) { + tctx->force_local = false; + scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, enq_flags); + return; + } + + ring = bpf_map_lookup_elem(&queue_arr, &idx); + if (!ring) { + scx_bpf_error("failed to find ring %d", idx); + return; + } + + /* Queue on the selected FIFO. If the FIFO overflows, punt to global. */ + if (bpf_map_push_elem(ring, &pid, 0)) { + scx_bpf_dispatch(p, SHARED_DSQ, slice_ns, enq_flags); + return; + } + + __sync_fetch_and_add(&nr_enqueued, 1); +} + +/* + * The BPF queue map doesn't support removal and sched_ext can handle spurious + * dispatches. qmap_dequeue() is only used to collect statistics. + */ +void BPF_STRUCT_OPS(qmap_dequeue, struct task_struct *p, u64 deq_flags) +{ + __sync_fetch_and_add(&nr_dequeued, 1); +} + +void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) +{ + struct task_struct *p; + struct cpu_ctx *cpuc; + u32 zero = 0, batch = dsp_batch ?: 1; + void *fifo; + s32 i, pid; + + if (scx_bpf_consume(SHARED_DSQ)) + return; + + if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) { + scx_bpf_error("failed to look up cpu_ctx"); + return; + } + + for (i = 0; i < 5; i++) { + /* Advance the dispatch cursor and pick the fifo. */ + if (!cpuc->dsp_cnt) { + cpuc->dsp_idx = (cpuc->dsp_idx + 1) % 5; + cpuc->dsp_cnt = 1 << cpuc->dsp_idx; + } + + fifo = bpf_map_lookup_elem(&queue_arr, &cpuc->dsp_idx); + if (!fifo) { + scx_bpf_error("failed to find ring %llu", cpuc->dsp_idx); + return; + } + + /* Dispatch or advance. */ + bpf_repeat(BPF_MAX_LOOPS) { + if (bpf_map_pop_elem(fifo, &pid)) + break; + + p = bpf_task_from_pid(pid); + if (!p) + continue; + + __sync_fetch_and_add(&nr_dispatched, 1); + scx_bpf_dispatch(p, SHARED_DSQ, slice_ns, 0); + bpf_task_release(p); + batch--; + cpuc->dsp_cnt--; + if (!batch || !scx_bpf_dispatch_nr_slots()) { + scx_bpf_consume(SHARED_DSQ); + return; + } + if (!cpuc->dsp_cnt) + break; + } + + cpuc->dsp_cnt = 0; + } +} + +s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p, + struct scx_init_task_args *args) +{ + /* + * @p is new. Let's ensure that its task_ctx is available. We can sleep + * in this function and the following will automatically use GFP_KERNEL. + */ + if (bpf_task_storage_get(&task_ctx_stor, p, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE)) + return 0; + else + return -ENOMEM; +} + +s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init) +{ + return scx_bpf_create_dsq(SHARED_DSQ, -1); +} + +void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SCX_OPS_DEFINE(qmap_ops, + .select_cpu = (void *)qmap_select_cpu, + .enqueue = (void *)qmap_enqueue, + .dequeue = (void *)qmap_dequeue, + .dispatch = (void *)qmap_dispatch, + .init_task = (void *)qmap_init_task, + .init = (void *)qmap_init, + .exit = (void *)qmap_exit, + .name = "qmap"); diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c new file mode 100644 index 000000000000..7c84ade7ecfb --- /dev/null +++ b/tools/sched_ext/scx_qmap.c @@ -0,0 +1,99 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> + * Copyright (c) 2022 David Vernet <dvernet@meta.com> + */ +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <inttypes.h> +#include <signal.h> +#include <libgen.h> +#include <bpf/bpf.h> +#include <scx/common.h> +#include "scx_qmap.bpf.skel.h" + +const char help_fmt[] = +"A simple five-level FIFO queue sched_ext scheduler.\n" +"\n" +"See the top-level comment in .bpf.c for more details.\n" +"\n" +"Usage: %s [-s SLICE_US] [-e COUNT] [-b COUNT] [-p] [-v]\n" +"\n" +" -s SLICE_US Override slice duration\n" +" -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n" +" -b COUNT Dispatch upto COUNT tasks together\n" +" -p Switch only tasks on SCHED_EXT policy intead of all\n" +" -v Print libbpf debug messages\n" +" -h Display this help and exit\n"; + +static bool verbose; +static volatile int exit_req; + +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) +{ + if (level == LIBBPF_DEBUG && !verbose) + return 0; + return vfprintf(stderr, format, args); +} + +static void sigint_handler(int dummy) +{ + exit_req = 1; +} + +int main(int argc, char **argv) +{ + struct scx_qmap *skel; + struct bpf_link *link; + int opt; + + libbpf_set_print(libbpf_print_fn); + signal(SIGINT, sigint_handler); + signal(SIGTERM, sigint_handler); + + skel = SCX_OPS_OPEN(qmap_ops, scx_qmap); + + while ((opt = getopt(argc, argv, "s:e:b:pvh")) != -1) { + switch (opt) { + case 's': + skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000; + break; + case 'e': + skel->bss->test_error_cnt = strtoul(optarg, NULL, 0); + break; + case 'b': + skel->rodata->dsp_batch = strtoul(optarg, NULL, 0); + break; + case 'p': + skel->struct_ops.qmap_ops->flags |= SCX_OPS_SWITCH_PARTIAL; + break; + case 'v': + verbose = true; + break; + default: + fprintf(stderr, help_fmt, basename(argv[0])); + return opt != 'h'; + } + } + + SCX_OPS_LOAD(skel, qmap_ops, scx_qmap); + link = SCX_OPS_ATTACH(skel, qmap_ops, scx_qmap); + + while (!exit_req && !UEI_EXITED(skel, uei)) { + long nr_enqueued = skel->bss->nr_enqueued; + long nr_dispatched = skel->bss->nr_dispatched; + + printf("stats : enq=%lu dsp=%lu delta=%ld deq=%"PRIu64"\n", + nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched, + skel->bss->nr_dequeued); + fflush(stdout); + sleep(1); + } + + bpf_link__destroy(link); + UEI_REPORT(skel, uei); + scx_qmap__destroy(skel); + return 0; +} diff --git a/tools/sched_ext/scx_simple.bpf.c b/tools/sched_ext/scx_simple.bpf.c new file mode 100644 index 000000000000..6bb13a3c801b --- /dev/null +++ b/tools/sched_ext/scx_simple.bpf.c @@ -0,0 +1,63 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * A simple scheduler. + * + * A simple global FIFO scheduler. It also demonstrates the following niceties. + * + * - Statistics tracking how many tasks are queued to local and global dsq's. + * - Termination notification for userspace. + * + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> + * Copyright (c) 2022 David Vernet <dvernet@meta.com> + */ +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +UEI_DEFINE(uei); + +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(key_size, sizeof(u32)); + __uint(value_size, sizeof(u64)); + __uint(max_entries, 2); /* [local, global] */ +} stats SEC(".maps"); + +static void stat_inc(u32 idx) +{ + u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx); + if (cnt_p) + (*cnt_p)++; +} + +s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) +{ + bool is_idle = false; + s32 cpu; + + cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &is_idle); + if (is_idle) { + stat_inc(0); /* count local queueing */ + scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); + } + + return cpu; +} + +void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) +{ + stat_inc(1); /* count global queueing */ + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); +} + +void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SCX_OPS_DEFINE(simple_ops, + .select_cpu = (void *)simple_select_cpu, + .enqueue = (void *)simple_enqueue, + .exit = (void *)simple_exit, + .name = "simple"); diff --git a/tools/sched_ext/scx_simple.c b/tools/sched_ext/scx_simple.c new file mode 100644 index 000000000000..789ac62fea8e --- /dev/null +++ b/tools/sched_ext/scx_simple.c @@ -0,0 +1,99 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> + * Copyright (c) 2022 David Vernet <dvernet@meta.com> + */ +#include <stdio.h> +#include <unistd.h> +#include <signal.h> +#include <libgen.h> +#include <bpf/bpf.h> +#include <scx/common.h> +#include "scx_simple.bpf.skel.h" + +const char help_fmt[] = +"A simple sched_ext scheduler.\n" +"\n" +"See the top-level comment in .bpf.c for more details.\n" +"\n" +"Usage: %s [-v]\n" +"\n" +" -v Print libbpf debug messages\n" +" -h Display this help and exit\n"; + +static bool verbose; +static volatile int exit_req; + +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) +{ + if (level == LIBBPF_DEBUG && !verbose) + return 0; + return vfprintf(stderr, format, args); +} + +static void sigint_handler(int simple) +{ + exit_req = 1; +} + +static void read_stats(struct scx_simple *skel, __u64 *stats) +{ + int nr_cpus = libbpf_num_possible_cpus(); + __u64 cnts[2][nr_cpus]; + __u32 idx; + + memset(stats, 0, sizeof(stats[0]) * 2); + + for (idx = 0; idx < 2; idx++) { + int ret, cpu; + + ret = bpf_map_lookup_elem(bpf_map__fd(skel->maps.stats), + &idx, cnts[idx]); + if (ret < 0) + continue; + for (cpu = 0; cpu < nr_cpus; cpu++) + stats[idx] += cnts[idx][cpu]; + } +} + +int main(int argc, char **argv) +{ + struct scx_simple *skel; + struct bpf_link *link; + __u32 opt; + + libbpf_set_print(libbpf_print_fn); + signal(SIGINT, sigint_handler); + signal(SIGTERM, sigint_handler); + + skel = SCX_OPS_OPEN(simple_ops, scx_simple); + + while ((opt = getopt(argc, argv, "vh")) != -1) { + switch (opt) { + case 'v': + verbose = true; + break; + default: + fprintf(stderr, help_fmt, basename(argv[0])); + return opt != 'h'; + } + } + + SCX_OPS_LOAD(skel, simple_ops, scx_simple); + link = SCX_OPS_ATTACH(skel, simple_ops, scx_simple); + + while (!exit_req && !UEI_EXITED(skel, uei)) { + __u64 stats[2]; + + read_stats(skel, stats); + printf("local=%llu global=%llu\n", stats[0], stats[1]); + fflush(stdout); + sleep(1); + } + + bpf_link__destroy(link); + UEI_REPORT(skel, uei); + scx_simple__destroy(skel); + return 0; +} -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 79e104400fc3b94a2d04c65d222a0726b3ae2382 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- This enables the admin to abort the BPF scheduler and revert to CFS anytime. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Conflicts: drivers/tty/sysrq.c [Only pick the part modified by this mainline commit, and ignore the conflict part caused by higher version.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- drivers/tty/sysrq.c | 1 + kernel/sched/build_policy.c | 1 + kernel/sched/ext.c | 20 ++++++++++++++++++++ 3 files changed, 22 insertions(+) diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index 02c0a7b935b2..de447b3f061d 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -520,6 +520,7 @@ static const struct sysrq_key_op *sysrq_key_table[62] = { NULL, /* P */ NULL, /* Q */ NULL, /* R */ + /* S: May be registered by sched_ext for resetting */ NULL, /* S */ NULL, /* T */ NULL, /* U */ diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c index a0f9faaca01c..4ae066f08cc9 100644 --- a/kernel/sched/build_policy.c +++ b/kernel/sched/build_policy.c @@ -32,6 +32,7 @@ #include <linux/suspend.h> #include <linux/tsacct_kern.h> #include <linux/vtime.h> +#include <linux/sysrq.h> #include <linux/percpu-rwsem.h> #include <uapi/linux/sched/types.h> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 79c7fd88db24..b8584b1ba80a 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -24,6 +24,7 @@ enum scx_exit_kind { SCX_EXIT_UNREG = 64, /* user-space initiated unregistration */ SCX_EXIT_UNREG_BPF, /* BPF-initiated unregistration */ SCX_EXIT_UNREG_KERN, /* kernel-initiated unregistration */ + SCX_EXIT_SYSRQ, /* requested by 'S' sysrq */ SCX_EXIT_ERROR = 1024, /* runtime error, error msg contains details */ SCX_EXIT_ERROR_BPF, /* ERROR but triggered through scx_bpf_error() */ @@ -2781,6 +2782,8 @@ static const char *scx_exit_reason(enum scx_exit_kind kind) return "Scheduler unregistered from BPF"; case SCX_EXIT_UNREG_KERN: return "Scheduler unregistered from the main kernel"; + case SCX_EXIT_SYSRQ: + return "disabled by sysrq-S"; case SCX_EXIT_ERROR: return "runtime error"; case SCX_EXIT_ERROR_BPF: @@ -3531,6 +3534,21 @@ static struct bpf_struct_ops bpf_sched_ext_ops = { * System integration and init. */ +static void sysrq_handle_sched_ext_reset(u8 key) +{ + if (scx_ops_helper) + scx_ops_disable(SCX_EXIT_SYSRQ); + else + pr_info("sched_ext: BPF scheduler not yet used\n"); +} + +static const struct sysrq_key_op sysrq_sched_ext_reset_op = { + .handler = sysrq_handle_sched_ext_reset, + .help_msg = "reset-sched-ext(S)", + .action_msg = "Disable sched_ext and revert all tasks to CFS", + .enable_mask = SYSRQ_ENABLE_RTNICE, +}; + void __init init_sched_ext_class(void) { s32 cpu, v; @@ -3554,6 +3572,8 @@ void __init init_sched_ext_class(void) init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL); INIT_LIST_HEAD(&rq->scx.runnable_list); } + + register_sysrq_key('S', &sysrq_sched_ext_reset_op); } -- 2.34.1
From: David Vernet <dvernet@meta.com> mainline inclusion from mainline-v6.12-rc1 commit 8a010b81b3a50b033fc3cddc613517abda586cbe category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- The most common and critical way that a BPF scheduler can misbehave is by failing to run runnable tasks for too long. This patch implements a watchdog. * All tasks record when they become runnable. * A watchdog work periodically scans all runnable tasks. If any task has stayed runnable for too long, the BPF scheduler is aborted. * scheduler_tick() monitors whether the watchdog itself is stuck. If so, the BPF scheduler is aborted. Because the watchdog only scans the tasks which are currently runnable and usually very infrequently, the overhead should be negligible. scx_qmap is updated so that it can be told to stall user and/or kernel tasks. A detected task stall looks like the following: sched_ext: BPF scheduler "qmap" errored, disabling sched_ext: runnable task stall (dbus-daemon[953] failed to run for 6.478s) scx_check_timeout_workfn+0x10e/0x1b0 process_one_work+0x287/0x560 worker_thread+0x234/0x420 kthread+0xe9/0x100 ret_from_fork+0x1f/0x30 A detected watchdog stall: sched_ext: BPF scheduler "qmap" errored, disabling sched_ext: runnable task stall (watchdog failed to check in for 5.001s) scheduler_tick+0x2eb/0x340 update_process_times+0x7a/0x90 tick_sched_timer+0xd8/0x130 __hrtimer_run_queues+0x178/0x3b0 hrtimer_interrupt+0xfc/0x390 __sysvec_apic_timer_interrupt+0xb7/0x2b0 sysvec_apic_timer_interrupt+0x90/0xb0 asm_sysvec_apic_timer_interrupt+0x1b/0x20 default_idle+0x14/0x20 arch_cpu_idle+0xf/0x20 default_idle_call+0x50/0x90 do_idle+0xe8/0x240 cpu_startup_entry+0x1d/0x20 kernel_init+0x0/0x190 start_kernel+0x0/0x392 start_kernel+0x324/0x392 x86_64_start_reservations+0x2a/0x2c x86_64_start_kernel+0x104/0x109 secondary_startup_64_no_verify+0xce/0xdb Note that this patch exposes scx_ops_error[_type]() in kernel/sched/ext.h to inline scx_notify_sched_tick(). v4: - While disabling, cancel_delayed_work_sync(&scx_watchdog_work) was being called before forward progress was guaranteed and thus could lead to system lockup. Relocated. - While enabling, it was comparing msecs against jiffies without conversion leading to spurious load failures on lower HZ kernels. Fixed. - runnable list management is now used by core bypass logic and moved to the patch implementing sched_ext core. v3: - bpf_scx_init_member() was incorrectly comparing ops->timeout_ms against SCX_WATCHDOG_MAX_TIMEOUT which is in jiffies without conversion leading to spurious load failures in lower HZ kernels. Fixed. v2: - Julia Lawall noticed that the watchdog code was mixing msecs and jiffies. Fix by using jiffies for everything. Signed-off-by: David Vernet <dvernet@meta.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 1 + init/init_task.c | 1 + kernel/sched/core.c | 1 + kernel/sched/ext.c | 130 ++++++++++++++++++++++++++++++++- kernel/sched/ext.h | 2 + tools/sched_ext/scx_qmap.bpf.c | 12 +++ tools/sched_ext/scx_qmap.c | 12 ++- 7 files changed, 153 insertions(+), 6 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index c1530a7992cc..96031252436f 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -122,6 +122,7 @@ struct sched_ext_entity { atomic_long_t ops_state; struct list_head runnable_node; /* rq->scx.runnable_list */ + unsigned long runnable_at; u64 ddsp_dsq_id; u64 ddsp_enq_flags; diff --git a/init/init_task.c b/init/init_task.c index df8193e73c47..b86c0a995136 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -119,6 +119,7 @@ struct task_struct init_task .sticky_cpu = -1, .holding_cpu = -1, .runnable_node = LIST_HEAD_INIT(init_task.scx.runnable_node), + .runnable_at = INITIAL_JIFFIES, .ddsp_dsq_id = SCX_DSQ_INVALID, .slice = SCX_SLICE_DFL, }, diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 57828b83372e..74cc29e195ec 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5581,6 +5581,7 @@ void scheduler_tick(void) calc_global_load_tick(rq); sched_core_tick(rq); task_tick_mm_cid(rq, curr); + scx_tick(rq); rq_unlock(rq, &rf); diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index b8584b1ba80a..17e8590d6c88 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -12,6 +12,7 @@ enum scx_consts { SCX_DSP_DFL_MAX_BATCH = 32, + SCX_WATCHDOG_MAX_TIMEOUT = 30 * HZ, SCX_EXIT_BT_LEN = 64, SCX_EXIT_MSG_LEN = 1024, @@ -28,6 +29,7 @@ enum scx_exit_kind { SCX_EXIT_ERROR = 1024, /* runtime error, error msg contains details */ SCX_EXIT_ERROR_BPF, /* ERROR but triggered through scx_bpf_error() */ + SCX_EXIT_ERROR_STALL, /* watchdog detected stalled runnable tasks */ }; /* @@ -323,6 +325,15 @@ struct sched_ext_ops { */ u64 flags; + /** + * timeout_ms - The maximum amount of time, in milliseconds, that a + * runnable task should be able to wait before being scheduled. The + * maximum timeout may not exceed the default timeout of 30 seconds. + * + * Defaults to the maximum allowed timeout value of 30 seconds. + */ + u32 timeout_ms; + /** * name - BPF scheduler's name * @@ -476,6 +487,23 @@ struct static_key_false scx_has_op[SCX_OPI_END] = static atomic_t scx_exit_kind = ATOMIC_INIT(SCX_EXIT_DONE); static struct scx_exit_info *scx_exit_info; +/* + * The maximum amount of time in jiffies that a task may be runnable without + * being scheduled on a CPU. If this timeout is exceeded, it will trigger + * scx_ops_error(). + */ +static unsigned long scx_watchdog_timeout; + +/* + * The last time the delayed work was run. This delayed work relies on + * ksoftirqd being able to run to service timer interrupts, so it's possible + * that this work itself could get wedged. To account for this, we check that + * it's not stalled in the timer tick, and trigger an error if it is. + */ +static unsigned long scx_watchdog_timestamp = INITIAL_JIFFIES; + +static struct delayed_work scx_watchdog_work; + /* idle tracking */ #ifdef CONFIG_SMP #ifdef CONFIG_CPUMASK_OFFSTACK @@ -1174,6 +1202,11 @@ static void set_task_runnable(struct rq *rq, struct task_struct *p) { lockdep_assert_rq_held(rq); + if (p->scx.flags & SCX_TASK_RESET_RUNNABLE_AT) { + p->scx.runnable_at = jiffies; + p->scx.flags &= ~SCX_TASK_RESET_RUNNABLE_AT; + } + /* * list_add_tail() must be used. scx_ops_bypass() depends on tasks being * appened to the runnable_list. @@ -1181,9 +1214,11 @@ static void set_task_runnable(struct rq *rq, struct task_struct *p) list_add_tail(&p->scx.runnable_node, &rq->scx.runnable_list); } -static void clr_task_runnable(struct task_struct *p) +static void clr_task_runnable(struct task_struct *p, bool reset_runnable_at) { list_del_init(&p->scx.runnable_node); + if (reset_runnable_at) + p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT; } static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags) @@ -1221,7 +1256,8 @@ static void ops_dequeue(struct task_struct *p, u64 deq_flags) { unsigned long opss; - clr_task_runnable(p); + /* dequeue is always temporary, don't reset runnable_at */ + clr_task_runnable(p, false); /* acquire ensures that we see the preceding updates on QUEUED */ opss = atomic_long_read_acquire(&p->scx.ops_state); @@ -1831,7 +1867,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) p->se.exec_start = rq_clock_task(rq); - clr_task_runnable(p); + clr_task_runnable(p, true); } static void put_prev_task_scx(struct rq *rq, struct task_struct *p) @@ -2181,9 +2217,71 @@ static void reset_idle_masks(void) {} #endif /* CONFIG_SMP */ -static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued) +static bool check_rq_for_timeouts(struct rq *rq) +{ + struct task_struct *p; + struct rq_flags rf; + bool timed_out = false; + + rq_lock_irqsave(rq, &rf); + list_for_each_entry(p, &rq->scx.runnable_list, scx.runnable_node) { + unsigned long last_runnable = p->scx.runnable_at; + + if (unlikely(time_after(jiffies, + last_runnable + scx_watchdog_timeout))) { + u32 dur_ms = jiffies_to_msecs(jiffies - last_runnable); + + scx_ops_error_kind(SCX_EXIT_ERROR_STALL, + "%s[%d] failed to run for %u.%03us", + p->comm, p->pid, + dur_ms / 1000, dur_ms % 1000); + timed_out = true; + break; + } + } + rq_unlock_irqrestore(rq, &rf); + + return timed_out; +} + +static void scx_watchdog_workfn(struct work_struct *work) +{ + int cpu; + + WRITE_ONCE(scx_watchdog_timestamp, jiffies); + + for_each_online_cpu(cpu) { + if (unlikely(check_rq_for_timeouts(cpu_rq(cpu)))) + break; + + cond_resched(); + } + queue_delayed_work(system_unbound_wq, to_delayed_work(work), + scx_watchdog_timeout / 2); +} + +void scx_tick(struct rq *rq) { + unsigned long last_check; + + if (!scx_enabled()) + return; + + last_check = READ_ONCE(scx_watchdog_timestamp); + if (unlikely(time_after(jiffies, + last_check + READ_ONCE(scx_watchdog_timeout)))) { + u32 dur_ms = jiffies_to_msecs(jiffies - last_check); + + scx_ops_error_kind(SCX_EXIT_ERROR_STALL, + "watchdog failed to check in for %u.%03us", + dur_ms / 1000, dur_ms % 1000); + } + update_other_load_avgs(rq); +} + +static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued) +{ update_curr_scx(rq); /* @@ -2253,6 +2351,7 @@ static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool scx_set_task_state(p, SCX_TASK_INIT); + p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT; return 0; } @@ -2331,6 +2430,7 @@ void init_scx_entity(struct sched_ext_entity *scx) scx->sticky_cpu = -1; scx->holding_cpu = -1; INIT_LIST_HEAD(&scx->runnable_node); + scx->runnable_at = jiffies; scx->ddsp_dsq_id = SCX_DSQ_INVALID; scx->slice = SCX_SLICE_DFL; } @@ -2788,6 +2888,8 @@ static const char *scx_exit_reason(enum scx_exit_kind kind) return "runtime error"; case SCX_EXIT_ERROR_BPF: return "scx_bpf_error"; + case SCX_EXIT_ERROR_STALL: + return "runnable task stall"; default: return "<UNKNOWN>"; } @@ -2909,6 +3011,8 @@ static void scx_ops_disable_workfn(struct kthread_work *work) if (scx_ops.exit) SCX_CALL_OP(SCX_KF_UNLOCKED, exit, ei); + cancel_delayed_work_sync(&scx_watchdog_work); + /* * Delete the kobject from the hierarchy eagerly in addition to just * dropping a reference. Otherwise, if the object is deleted @@ -3031,6 +3135,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops) { struct scx_task_iter sti; struct task_struct *p; + unsigned long timeout; int i, ret; mutex_lock(&scx_ops_enable_mutex); @@ -3108,6 +3213,16 @@ static int scx_ops_enable(struct sched_ext_ops *ops) goto err_disable; } + if (ops->timeout_ms) + timeout = msecs_to_jiffies(ops->timeout_ms); + else + timeout = SCX_WATCHDOG_MAX_TIMEOUT; + + WRITE_ONCE(scx_watchdog_timeout, timeout); + WRITE_ONCE(scx_watchdog_timestamp, jiffies); + queue_delayed_work(system_unbound_wq, &scx_watchdog_work, + scx_watchdog_timeout / 2); + /* * Lock out forks before opening the floodgate so that they don't wander * into the operations prematurely. @@ -3418,6 +3533,12 @@ static int bpf_scx_init_member(const struct btf_type *t, if (ret == 0) return -EINVAL; return 1; + case offsetof(struct sched_ext_ops, timeout_ms): + if (msecs_to_jiffies(*(u32 *)(udata + moff)) > + SCX_WATCHDOG_MAX_TIMEOUT) + return -E2BIG; + ops->timeout_ms = *(u32 *)(udata + moff); + return 1; } return 0; @@ -3574,6 +3695,7 @@ void __init init_sched_ext_class(void) } register_sysrq_key('S', &sysrq_sched_ext_reset_op); + INIT_DELAYED_WORK(&scx_watchdog_work, scx_watchdog_workfn); } diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index 9c5a2d928281..56fcdb0b2c05 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -29,6 +29,7 @@ static inline bool task_on_scx(const struct task_struct *p) return scx_enabled() && p->sched_class == &ext_sched_class; } +void scx_tick(struct rq *rq); void init_scx_entity(struct sched_ext_entity *scx); void scx_pre_fork(struct task_struct *p); int scx_fork(struct task_struct *p); @@ -66,6 +67,7 @@ static inline const struct sched_class *next_active_class(const struct sched_cla #define scx_enabled() false #define scx_switched_all() false +static inline void scx_tick(struct rq *rq) {} static inline void scx_pre_fork(struct task_struct *p) {} static inline int scx_fork(struct task_struct *p) { return 0; } static inline void scx_post_fork(struct task_struct *p) {} diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c index 976a2693da71..8beae08dfdc7 100644 --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -29,6 +29,8 @@ enum consts { char _license[] SEC("license") = "GPL"; const volatile u64 slice_ns = SCX_SLICE_DFL; +const volatile u32 stall_user_nth; +const volatile u32 stall_kernel_nth; const volatile u32 dsp_batch; u32 test_error_cnt; @@ -129,11 +131,20 @@ static int weight_to_idx(u32 weight) void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags) { + static u32 user_cnt, kernel_cnt; struct task_ctx *tctx; u32 pid = p->pid; int idx = weight_to_idx(p->scx.weight); void *ring; + if (p->flags & PF_KTHREAD) { + if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth)) + return; + } else { + if (stall_user_nth && !(++user_cnt % stall_user_nth)) + return; + } + if (test_error_cnt && !--test_error_cnt) scx_bpf_error("test triggering error"); @@ -261,4 +272,5 @@ SCX_OPS_DEFINE(qmap_ops, .init_task = (void *)qmap_init_task, .init = (void *)qmap_init, .exit = (void *)qmap_exit, + .timeout_ms = 5000U, .name = "qmap"); diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c index 7c84ade7ecfb..6e9e9726cd62 100644 --- a/tools/sched_ext/scx_qmap.c +++ b/tools/sched_ext/scx_qmap.c @@ -19,10 +19,12 @@ const char help_fmt[] = "\n" "See the top-level comment in .bpf.c for more details.\n" "\n" -"Usage: %s [-s SLICE_US] [-e COUNT] [-b COUNT] [-p] [-v]\n" +"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-b COUNT] [-p] [-v]\n" "\n" " -s SLICE_US Override slice duration\n" " -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n" +" -t COUNT Stall every COUNT'th user thread\n" +" -T COUNT Stall every COUNT'th kernel thread\n" " -b COUNT Dispatch upto COUNT tasks together\n" " -p Switch only tasks on SCHED_EXT policy intead of all\n" " -v Print libbpf debug messages\n" @@ -55,7 +57,7 @@ int main(int argc, char **argv) skel = SCX_OPS_OPEN(qmap_ops, scx_qmap); - while ((opt = getopt(argc, argv, "s:e:b:pvh")) != -1) { + while ((opt = getopt(argc, argv, "s:e:t:T:b:pvh")) != -1) { switch (opt) { case 's': skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000; @@ -63,6 +65,12 @@ int main(int argc, char **argv) case 'e': skel->bss->test_error_cnt = strtoul(optarg, NULL, 0); break; + case 't': + skel->rodata->stall_user_nth = strtoul(optarg, NULL, 0); + break; + case 'T': + skel->rodata->stall_kernel_nth = strtoul(optarg, NULL, 0); + break; case 'b': skel->rodata->dsp_batch = strtoul(optarg, NULL, 0); break; -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 7bb6f0810ecfb73a9d7a2ca56fb001e0201a6758 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- BPF schedulers might not want to schedule certain tasks - e.g. kernel threads. This patch adds p->scx.disallow which can be set by BPF schedulers in such cases. The field can be changed anytime and setting it in ops.prep_enable() guarantees that the task can never be scheduled by sched_ext. scx_qmap is updated with the -d option to disallow a specific PID: # echo $$ 1092 # grep -E '(policy)|(ext\.enabled)' /proc/self/sched policy : 0 ext.enabled : 0 # ./set-scx 1092 # grep -E '(policy)|(ext\.enabled)' /proc/self/sched policy : 7 ext.enabled : 0 Run "scx_qmap -p -d 1092" in another terminal. # cat /sys/kernel/sched_ext/nr_rejected 1 # grep -E '(policy)|(ext\.enabled)' /proc/self/sched policy : 0 ext.enabled : 0 # ./set-scx 1092 setparam failed for 1092 (Permission denied) - v4: Refreshed on top of tip:sched/core. - v3: Update description to reflect /sys/kernel/sched_ext interface change. - v2: Use atomic_long_t instead of atomic64_t for scx_kick_cpus_pnt_seqs to accommodate 32bit archs. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Barret Rhoden <brho@google.com> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 12 ++++++++ kernel/sched/ext.c | 50 ++++++++++++++++++++++++++++++++++ kernel/sched/ext.h | 2 ++ kernel/sched/syscalls.c | 4 +++ tools/sched_ext/scx_qmap.bpf.c | 4 +++ tools/sched_ext/scx_qmap.c | 11 ++++++-- 6 files changed, 81 insertions(+), 2 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index 96031252436f..ea7c501ac819 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -137,6 +137,18 @@ struct sched_ext_entity { */ u64 slice; + /* + * If set, reject future sched_setscheduler(2) calls updating the policy + * to %SCHED_EXT with -%EACCES. + * + * If set from ops.init_task() and the task's policy is already + * %SCHED_EXT, which can happen while the BPF scheduler is being loaded + * or by inhering the parent's policy during fork, the task's policy is + * rejected and forcefully reverted to %SCHED_NORMAL. The number of + * such events are reported through /sys/kernel/debug/sched_ext::nr_rejected. + */ + bool disallow; /* reject switching into SCX */ + /* cold fields */ /* must be the last field, see init_scx_entity() */ struct list_head tasks_node; diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 17e8590d6c88..0e9939cf9751 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -487,6 +487,8 @@ struct static_key_false scx_has_op[SCX_OPI_END] = static atomic_t scx_exit_kind = ATOMIC_INIT(SCX_EXIT_DONE); static struct scx_exit_info *scx_exit_info; +static atomic_long_t scx_nr_rejected = ATOMIC_LONG_INIT(0); + /* * The maximum amount of time in jiffies that a task may be runnable without * being scheduled on a CPU. If this timeout is exceeded, it will trigger @@ -2337,6 +2339,8 @@ static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool { int ret; + p->scx.disallow = false; + if (SCX_HAS_OP(init_task)) { struct scx_init_task_args args = { .fork = fork, @@ -2351,6 +2355,27 @@ static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool scx_set_task_state(p, SCX_TASK_INIT); + if (p->scx.disallow) { + struct rq *rq; + struct rq_flags rf; + + rq = task_rq_lock(p, &rf); + + /* + * We're either in fork or load path and @p->policy will be + * applied right after. Reverting @p->policy here and rejecting + * %SCHED_EXT transitions from scx_check_setscheduler() + * guarantees that if ops.init_task() sets @p->disallow, @p can + * never be in SCX. + */ + if (p->policy == SCHED_EXT) { + p->policy = SCHED_NORMAL; + atomic_long_inc(&scx_nr_rejected); + } + + task_rq_unlock(rq, p, &rf); + } + p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT; return 0; } @@ -2554,6 +2579,18 @@ static void switched_from_scx(struct rq *rq, struct task_struct *p) static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {} static void switched_to_scx(struct rq *rq, struct task_struct *p) {} +int scx_check_setscheduler(struct task_struct *p, int policy) +{ + lockdep_assert_rq_held(task_rq(p)); + + /* if disallow, reject transitioning into SCX */ + if (scx_enabled() && READ_ONCE(p->scx.disallow) && + p->policy != policy && policy == SCHED_EXT) + return -EACCES; + + return 0; +} + /* * Omitted operations: * @@ -2708,9 +2745,17 @@ static ssize_t scx_attr_switch_all_show(struct kobject *kobj, } SCX_ATTR(switch_all); +static ssize_t scx_attr_nr_rejected_show(struct kobject *kobj, + struct kobj_attribute *ka, char *buf) +{ + return sysfs_emit(buf, "%ld\n", atomic_long_read(&scx_nr_rejected)); +} +SCX_ATTR(nr_rejected); + static struct attribute *scx_global_attrs[] = { &scx_attr_state.attr, &scx_attr_switch_all.attr, + &scx_attr_nr_rejected.attr, NULL, }; @@ -3183,6 +3228,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops) atomic_set(&scx_exit_kind, SCX_EXIT_NONE); scx_warned_zero_slice = false; + atomic_long_set(&scx_nr_rejected, 0); + /* * Keep CPUs stable during enable so that the BPF scheduler can track * online CPUs by watching ->on/offline_cpu() after ->init(). @@ -3481,6 +3528,9 @@ static int bpf_scx_btf_struct_access(struct bpf_verifier_log *log, if (off >= offsetof(struct task_struct, scx.slice) && off + size <= offsetofend(struct task_struct, scx.slice)) return SCALAR_VALUE; + if (off >= offsetof(struct task_struct, scx.disallow) && + off + size <= offsetofend(struct task_struct, scx.disallow)) + return SCALAR_VALUE; } return -EACCES; diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index 56fcdb0b2c05..33a9f7fe5832 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -35,6 +35,7 @@ void scx_pre_fork(struct task_struct *p); int scx_fork(struct task_struct *p); void scx_post_fork(struct task_struct *p); void scx_cancel_fork(struct task_struct *p); +int scx_check_setscheduler(struct task_struct *p, int policy); bool task_should_scx(struct task_struct *p); void init_sched_ext_class(void); @@ -72,6 +73,7 @@ static inline void scx_pre_fork(struct task_struct *p) {} static inline int scx_fork(struct task_struct *p) { return 0; } static inline void scx_post_fork(struct task_struct *p) {} static inline void scx_cancel_fork(struct task_struct *p) {} +static inline int scx_check_setscheduler(struct task_struct *p, int policy) { return 0; } static inline bool task_on_scx(const struct task_struct *p) { return false; } static inline void init_sched_ext_class(void) {} diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index 76e9f70f3287..396fb2518c79 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -695,6 +695,10 @@ int __sched_setscheduler(struct task_struct *p, goto unlock; } + retval = scx_check_setscheduler(p, policy); + if (retval) + goto unlock; + /* * If not changing anything there's no need to proceed further, * but store a possible modification of reset_on_fork. diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c index 8beae08dfdc7..5ff217c4bfa0 100644 --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -32,6 +32,7 @@ const volatile u64 slice_ns = SCX_SLICE_DFL; const volatile u32 stall_user_nth; const volatile u32 stall_kernel_nth; const volatile u32 dsp_batch; +const volatile s32 disallow_tgid; u32 test_error_cnt; @@ -243,6 +244,9 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p, struct scx_init_task_args *args) { + if (p->tgid == disallow_tgid) + p->scx.disallow = true; + /* * @p is new. Let's ensure that its task_ctx is available. We can sleep * in this function and the following will automatically use GFP_KERNEL. diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c index 6e9e9726cd62..a2614994cfaa 100644 --- a/tools/sched_ext/scx_qmap.c +++ b/tools/sched_ext/scx_qmap.c @@ -19,13 +19,15 @@ const char help_fmt[] = "\n" "See the top-level comment in .bpf.c for more details.\n" "\n" -"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-b COUNT] [-p] [-v]\n" +"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-b COUNT]\n" +" [-d PID] [-p] [-v]\n" "\n" " -s SLICE_US Override slice duration\n" " -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n" " -t COUNT Stall every COUNT'th user thread\n" " -T COUNT Stall every COUNT'th kernel thread\n" " -b COUNT Dispatch upto COUNT tasks together\n" +" -d PID Disallow a process from switching into SCHED_EXT (-1 for self)\n" " -p Switch only tasks on SCHED_EXT policy intead of all\n" " -v Print libbpf debug messages\n" " -h Display this help and exit\n"; @@ -57,7 +59,7 @@ int main(int argc, char **argv) skel = SCX_OPS_OPEN(qmap_ops, scx_qmap); - while ((opt = getopt(argc, argv, "s:e:t:T:b:pvh")) != -1) { + while ((opt = getopt(argc, argv, "s:e:t:T:b:d:pvh")) != -1) { switch (opt) { case 's': skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000; @@ -74,6 +76,11 @@ int main(int argc, char **argv) case 'b': skel->rodata->dsp_batch = strtoul(optarg, NULL, 0); break; + case 'd': + skel->rodata->disallow_tgid = strtol(optarg, NULL, 0); + if (skel->rodata->disallow_tgid < 0) + skel->rodata->disallow_tgid = getpid(); + break; case 'p': skel->struct_ops.qmap_ops->flags |= SCX_OPS_SWITCH_PARTIAL; break; -- 2.34.1
From: David Vernet <void@manifault.com> mainline inclusion from mainline-v6.12-rc1 commit 1538e33995eaf3f315cbb5506019b9f913ed8555 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- It would be useful to see what the sched_ext scheduler state is, and what scheduler is running, when we're dumping a task's stack. This patch therefore adds a new print_scx_info() function that's called in the same context as print_worker_info() and print_stop_info(). An example dump follows. BUG: kernel NULL pointer dereference, address: 0000000000000999 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP CPU: 13 PID: 2047 Comm: insmod Tainted: G O 6.6.0-work-10323-gb58d4cae8e99-dirty #34 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 2/2/2022 Sched_ext: qmap (enabled+all), task: runnable_at=-17ms RIP: 0010:init_module+0x9/0x1000 [test_module] ... v3: - scx_ops_enable_state_str[] definition moved to an earlier patch as it's now used by core implementation. - Convert jiffy delta to msecs using jiffies_to_msecs() instead of multiplying by (HZ / MSEC_PER_SEC). The conversion is implemented in jiffies_delta_msecs(). v2: - We are now using scx_ops_enable_state_str[] outside CONFIG_SCHED_DEBUG. Move it outside of CONFIG_SCHED_DEBUG and to the top. This was reported by Changwoo and Andrea. Signed-off-by: David Vernet <void@manifault.com> Reported-by: Changwoo Min <changwoo@igalia.com> Reported-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 2 ++ kernel/sched/core.c | 1 + kernel/sched/ext.c | 53 +++++++++++++++++++++++++++++++++++++++ lib/dump_stack.c | 1 + 4 files changed, 57 insertions(+) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index ea7c501ac819..85fb5dc725ef 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -155,10 +155,12 @@ struct sched_ext_entity { }; void sched_ext_free(struct task_struct *p); +void print_scx_info(const char *log_lvl, struct task_struct *p); #else /* !CONFIG_SCHED_CLASS_EXT */ static inline void sched_ext_free(struct task_struct *p) {} +static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {} #endif /* CONFIG_SCHED_CLASS_EXT */ #endif /* _LINUX_SCHED_EXT_H */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 74cc29e195ec..600b942961c2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7493,6 +7493,7 @@ void sched_show_task(struct task_struct *p) print_worker_info(KERN_INFO, p); print_stop_info(KERN_INFO, p); + print_scx_info(KERN_INFO, p); show_stack(p, NULL, KERN_INFO); put_task_stack(p); } diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 0e9939cf9751..5671f8be0ab7 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -590,6 +590,14 @@ static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind, #define SCX_HAS_OP(op) static_branch_likely(&scx_has_op[SCX_OP_IDX(op)]) +static long jiffies_delta_msecs(unsigned long at, unsigned long now) +{ + if (time_after(at, now)) + return jiffies_to_msecs(at - now); + else + return -(long)jiffies_to_msecs(now - at); +} + /* if the highest set bit is N, return a mask with bits [N+1, 31] set */ static u32 higher_bits(u32 flags) { @@ -3720,6 +3728,51 @@ static const struct sysrq_key_op sysrq_sched_ext_reset_op = { .enable_mask = SYSRQ_ENABLE_RTNICE, }; +/** + * print_scx_info - print out sched_ext scheduler state + * @log_lvl: the log level to use when printing + * @p: target task + * + * If a sched_ext scheduler is enabled, print the name and state of the + * scheduler. If @p is on sched_ext, print further information about the task. + * + * This function can be safely called on any task as long as the task_struct + * itself is accessible. While safe, this function isn't synchronized and may + * print out mixups or garbages of limited length. + */ +void print_scx_info(const char *log_lvl, struct task_struct *p) +{ + enum scx_ops_enable_state state = scx_ops_enable_state(); + const char *all = READ_ONCE(scx_switching_all) ? "+all" : ""; + char runnable_at_buf[22] = "?"; + struct sched_class *class; + unsigned long runnable_at; + + if (state == SCX_OPS_DISABLED) + return; + + /* + * Carefully check if the task was running on sched_ext, and then + * carefully copy the time it's been runnable, and its state. + */ + if (copy_from_kernel_nofault(&class, &p->sched_class, sizeof(class)) || + class != &ext_sched_class) { + printk("%sSched_ext: %s (%s%s)", log_lvl, scx_ops.name, + scx_ops_enable_state_str[state], all); + return; + } + + if (!copy_from_kernel_nofault(&runnable_at, &p->scx.runnable_at, + sizeof(runnable_at))) + scnprintf(runnable_at_buf, sizeof(runnable_at_buf), "%+ldms", + jiffies_delta_msecs(runnable_at, jiffies)); + + /* print everything onto one line to conserve console space */ + printk("%sSched_ext: %s (%s%s), task: runnable_at=%s", + log_lvl, scx_ops.name, scx_ops_enable_state_str[state], all, + runnable_at_buf); +} + void __init init_sched_ext_class(void) { s32 cpu, v; diff --git a/lib/dump_stack.c b/lib/dump_stack.c index 83471e81501a..6e667c445539 100644 --- a/lib/dump_stack.c +++ b/lib/dump_stack.c @@ -68,6 +68,7 @@ void dump_stack_print_info(const char *log_lvl) print_worker_info(log_lvl, current); print_stop_info(log_lvl, current); + print_scx_info(log_lvl, current); } /** -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 07814a9439a3b03d79a1001614b5bc1cab69bcec category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- If a BPF scheduler triggers an error, the scheduler is aborted and the system is reverted to the built-in scheduler. In the process, a lot of information which may be useful for figuring out what happened can be lost. This patch adds debug dump which captures information which may be useful for debugging including runqueue and runnable thread states at the time of failure. The following shows a debug dump after triggering the watchdog: root@test ~# os/work/tools/sched_ext/build/bin/scx_qmap -t 100 stats : enq=1 dsp=0 delta=1 deq=0 stats : enq=90 dsp=90 delta=0 deq=0 stats : enq=156 dsp=156 delta=0 deq=0 stats : enq=218 dsp=218 delta=0 deq=0 stats : enq=255 dsp=255 delta=0 deq=0 stats : enq=271 dsp=271 delta=0 deq=0 stats : enq=284 dsp=284 delta=0 deq=0 stats : enq=293 dsp=293 delta=0 deq=0 DEBUG DUMP ================================================================================ kworker/u32:12[320] triggered exit kind 1026: runnable task stall (stress[1530] failed to run for 6.841s) Backtrace: scx_watchdog_workfn+0x136/0x1c0 process_scheduled_works+0x2b5/0x600 worker_thread+0x269/0x360 kthread+0xeb/0x110 ret_from_fork+0x36/0x40 ret_from_fork_asm+0x1a/0x30 QMAP FIFO[0]: QMAP FIFO[1]: QMAP FIFO[2]: 1436 QMAP FIFO[3]: QMAP FIFO[4]: CPU states ---------- CPU 0 : nr_run=1 ops_qseq=244 curr=swapper/0[0] class=idle_sched_class QMAP: dsp_idx=1 dsp_cnt=0 R stress[1530] -6841ms scx_state/flags=3/0x1 ops_state/qseq=2/20 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=ff QMAP: force_local=0 asm_sysvec_apic_timer_interrupt+0x16/0x20 CPU 2 : nr_run=2 ops_qseq=142 curr=swapper/2[0] class=idle_sched_class QMAP: dsp_idx=1 dsp_cnt=0 R sshd[1703] -5905ms scx_state/flags=3/0x9 ops_state/qseq=2/88 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=ff QMAP: force_local=1 __x64_sys_ppoll+0xf6/0x120 do_syscall_64+0x7b/0x150 entry_SYSCALL_64_after_hwframe+0x76/0x7e R fish[1539] -4141ms scx_state/flags=3/0x9 ops_state/qseq=2/124 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=ff QMAP: force_local=1 futex_wait+0x60/0xe0 do_futex+0x109/0x180 __x64_sys_futex+0x117/0x190 do_syscall_64+0x7b/0x150 entry_SYSCALL_64_after_hwframe+0x76/0x7e CPU 3 : nr_run=2 ops_qseq=162 curr=kworker/u32:12[320] class=ext_sched_class QMAP: dsp_idx=1 dsp_cnt=0 *R kworker/u32:12[320] +0ms scx_state/flags=3/0xd ops_state/qseq=0/0 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=ff QMAP: force_local=0 scx_dump_state+0x613/0x6f0 scx_ops_error_irq_workfn+0x1f/0x40 irq_work_run_list+0x82/0xd0 irq_work_run+0x14/0x30 __sysvec_irq_work+0x40/0x140 sysvec_irq_work+0x60/0x70 asm_sysvec_irq_work+0x16/0x20 scx_watchdog_workfn+0x15f/0x1c0 process_scheduled_works+0x2b5/0x600 worker_thread+0x269/0x360 kthread+0xeb/0x110 ret_from_fork+0x36/0x40 ret_from_fork_asm+0x1a/0x30 R kworker/3:2[1436] +0ms scx_state/flags=3/0x9 ops_state/qseq=2/160 sticky/holding_cpu=-1/-1 dsq_id=(n/a) cpus=08 QMAP: force_local=0 kthread+0xeb/0x110 ret_from_fork+0x36/0x40 ret_from_fork_asm+0x1a/0x30 CPU 7 : nr_run=0 ops_qseq=76 curr=swapper/7[0] class=idle_sched_class ================================================================================ EXIT: runnable task stall (stress[1530] failed to run for 6.841s) It shows that CPU 3 was running the watchdog when it triggered the error condition and the scx_qmap thread has been queued on CPU 0 for over 5 seconds but failed to run. It also prints out scx_qmap specific information - e.g. which tasks are queued on each FIFO and so on using the dump_*() ops. This dump has proved pretty useful for developing and debugging BPF schedulers. Debug dump is generated automatically when the BPF scheduler exits due to an error. The debug buffer used in such cases is determined by sched_ext_ops.exit_dump_len and defaults to 32k. If the debug dump overruns the available buffer, the output is truncated and marked accordingly. Debug dump output can also be read through the sched_ext_dump tracepoint. When read through the tracepoint, there is no length limit. SysRq-D can be used to trigger debug dump at any time while a BPF scheduler is loaded. This is non-destructive - the scheduler keeps running afterwards. The output can be read through the sched_ext_dump tracepoint. v2: - The size of exit debug dump buffer can now be customized using sched_ext_ops.exit_dump_len. - sched_ext_ops.dump*() added to enable dumping of BPF scheduler specific information. - Tracpoint output and SysRq-D triggering added. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Conflicts: include/trace/events/sched_ext.h [In current version, the assign_str should accept two parameters, so adjust it locally.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/trace/events/sched_ext.h | 32 ++ kernel/sched/ext.c | 421 ++++++++++++++++++- tools/sched_ext/include/scx/common.bpf.h | 12 + tools/sched_ext/include/scx/compat.h | 9 +- tools/sched_ext/include/scx/user_exit_info.h | 19 + tools/sched_ext/scx_qmap.bpf.c | 54 +++ tools/sched_ext/scx_qmap.c | 14 +- tools/sched_ext/scx_simple.c | 2 +- 8 files changed, 555 insertions(+), 8 deletions(-) create mode 100644 include/trace/events/sched_ext.h diff --git a/include/trace/events/sched_ext.h b/include/trace/events/sched_ext.h new file mode 100644 index 000000000000..73e0f7cf2bfa --- /dev/null +++ b/include/trace/events/sched_ext.h @@ -0,0 +1,32 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM sched_ext + +#if !defined(_TRACE_SCHED_EXT_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_SCHED_EXT_H + +#include <linux/tracepoint.h> + +TRACE_EVENT(sched_ext_dump, + + TP_PROTO(const char *line), + + TP_ARGS(line), + + TP_STRUCT__entry( + __string(line, line) + ), + + TP_fast_assign( + __assign_str(line, line); + ), + + TP_printk("%s", + __get_str(line) + ) +); + +#endif /* _TRACE_SCHED_EXT_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 5671f8be0ab7..22c127c5ced8 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -16,6 +16,7 @@ enum scx_consts { SCX_EXIT_BT_LEN = 64, SCX_EXIT_MSG_LEN = 1024, + SCX_EXIT_DUMP_DFL_LEN = 32768, }; enum scx_exit_kind { @@ -52,6 +53,9 @@ struct scx_exit_info { /* informational message */ char *msg; + + /* debug dump */ + char *dump; }; /* sched_ext_ops.flags */ @@ -109,6 +113,17 @@ struct scx_exit_task_args { bool cancelled; }; +/* + * Informational context provided to dump operations. + */ +struct scx_dump_ctx { + enum scx_exit_kind kind; + s64 exit_code; + const char *reason; + u64 at_ns; + u64 at_jiffies; +}; + /** * struct sched_ext_ops - Operation table for BPF scheduler implementation * @@ -300,6 +315,36 @@ struct sched_ext_ops { */ void (*disable)(struct task_struct *p); + /** + * dump - Dump BPF scheduler state on error + * @ctx: debug dump context + * + * Use scx_bpf_dump() to generate BPF scheduler specific debug dump. + */ + void (*dump)(struct scx_dump_ctx *ctx); + + /** + * dump_cpu - Dump BPF scheduler state for a CPU on error + * @ctx: debug dump context + * @cpu: CPU to generate debug dump for + * @idle: @cpu is currently idle without any runnable tasks + * + * Use scx_bpf_dump() to generate BPF scheduler specific debug dump for + * @cpu. If @idle is %true and this operation doesn't produce any + * output, @cpu is skipped for dump. + */ + void (*dump_cpu)(struct scx_dump_ctx *ctx, s32 cpu, bool idle); + + /** + * dump_task - Dump BPF scheduler state for a runnable task on error + * @ctx: debug dump context + * @p: runnable task to generate debug dump for + * + * Use scx_bpf_dump() to generate BPF scheduler specific debug dump for + * @p. + */ + void (*dump_task)(struct scx_dump_ctx *ctx, struct task_struct *p); + /* * All online ops must come before ops.init(). */ @@ -334,6 +379,12 @@ struct sched_ext_ops { */ u32 timeout_ms; + /** + * exit_dump_len - scx_exit_info.dump buffer length. If 0, the default + * value of 32768 is used. + */ + u32 exit_dump_len; + /** * name - BPF scheduler's name * @@ -571,10 +622,27 @@ struct scx_bstr_buf { static DEFINE_RAW_SPINLOCK(scx_exit_bstr_buf_lock); static struct scx_bstr_buf scx_exit_bstr_buf; +/* ops debug dump */ +struct scx_dump_data { + s32 cpu; + bool first; + s32 cursor; + struct seq_buf *s; + const char *prefix; + struct scx_bstr_buf buf; +}; + +struct scx_dump_data scx_dump_data = { + .cpu = -1, +}; + /* /sys/kernel/sched_ext interface */ static struct kset *scx_kset; static struct kobject *scx_root_kobj; +#define CREATE_TRACE_POINTS +#include <trace/events/sched_ext.h> + static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind, s64 exit_code, const char *fmt, ...); @@ -2902,12 +2970,13 @@ static void scx_ops_bypass(bool bypass) static void free_exit_info(struct scx_exit_info *ei) { + kfree(ei->dump); kfree(ei->msg); kfree(ei->bt); kfree(ei); } -static struct scx_exit_info *alloc_exit_info(void) +static struct scx_exit_info *alloc_exit_info(size_t exit_dump_len) { struct scx_exit_info *ei; @@ -2917,8 +2986,9 @@ static struct scx_exit_info *alloc_exit_info(void) ei->bt = kcalloc(sizeof(ei->bt[0]), SCX_EXIT_BT_LEN, GFP_KERNEL); ei->msg = kzalloc(SCX_EXIT_MSG_LEN, GFP_KERNEL); + ei->dump = kzalloc(exit_dump_len, GFP_KERNEL); - if (!ei->bt || !ei->msg) { + if (!ei->bt || !ei->msg || !ei->dump) { free_exit_info(ei); return NULL; } @@ -3130,8 +3200,274 @@ static void scx_ops_disable(enum scx_exit_kind kind) schedule_scx_ops_disable_work(); } +static void dump_newline(struct seq_buf *s) +{ + trace_sched_ext_dump(""); + + /* @s may be zero sized and seq_buf triggers WARN if so */ + if (s->size) + seq_buf_putc(s, '\n'); +} + +static __printf(2, 3) void dump_line(struct seq_buf *s, const char *fmt, ...) +{ + va_list args; + +#ifdef CONFIG_TRACEPOINTS + if (trace_sched_ext_dump_enabled()) { + /* protected by scx_dump_state()::dump_lock */ + static char line_buf[SCX_EXIT_MSG_LEN]; + + va_start(args, fmt); + vscnprintf(line_buf, sizeof(line_buf), fmt, args); + va_end(args); + + trace_sched_ext_dump(line_buf); + } +#endif + /* @s may be zero sized and seq_buf triggers WARN if so */ + if (s->size) { + va_start(args, fmt); + seq_buf_vprintf(s, fmt, args); + va_end(args); + + seq_buf_putc(s, '\n'); + } +} + +static void dump_stack_trace(struct seq_buf *s, const char *prefix, + const unsigned long *bt, unsigned int len) +{ + unsigned int i; + + for (i = 0; i < len; i++) + dump_line(s, "%s%pS", prefix, (void *)bt[i]); +} + +static void ops_dump_init(struct seq_buf *s, const char *prefix) +{ + struct scx_dump_data *dd = &scx_dump_data; + + lockdep_assert_irqs_disabled(); + + dd->cpu = smp_processor_id(); /* allow scx_bpf_dump() */ + dd->first = true; + dd->cursor = 0; + dd->s = s; + dd->prefix = prefix; +} + +static void ops_dump_flush(void) +{ + struct scx_dump_data *dd = &scx_dump_data; + char *line = dd->buf.line; + + if (!dd->cursor) + return; + + /* + * There's something to flush and this is the first line. Insert a blank + * line to distinguish ops dump. + */ + if (dd->first) { + dump_newline(dd->s); + dd->first = false; + } + + /* + * There may be multiple lines in $line. Scan and emit each line + * separately. + */ + while (true) { + char *end = line; + char c; + + while (*end != '\n' && *end != '\0') + end++; + + /* + * If $line overflowed, it may not have newline at the end. + * Always emit with a newline. + */ + c = *end; + *end = '\0'; + dump_line(dd->s, "%s%s", dd->prefix, line); + if (c == '\0') + break; + + /* move to the next line */ + end++; + if (*end == '\0') + break; + line = end; + } + + dd->cursor = 0; +} + +static void ops_dump_exit(void) +{ + ops_dump_flush(); + scx_dump_data.cpu = -1; +} + +static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx, + struct task_struct *p, char marker) +{ + static unsigned long bt[SCX_EXIT_BT_LEN]; + char dsq_id_buf[19] = "(n/a)"; + unsigned long ops_state = atomic_long_read(&p->scx.ops_state); + unsigned int bt_len; + + if (p->scx.dsq) + scnprintf(dsq_id_buf, sizeof(dsq_id_buf), "0x%llx", + (unsigned long long)p->scx.dsq->id); + + dump_newline(s); + dump_line(s, " %c%c %s[%d] %+ldms", + marker, task_state_to_char(p), p->comm, p->pid, + jiffies_delta_msecs(p->scx.runnable_at, dctx->at_jiffies)); + dump_line(s, " scx_state/flags=%u/0x%x ops_state/qseq=%lu/%lu", + scx_get_task_state(p), p->scx.flags & ~SCX_TASK_STATE_MASK, + ops_state & SCX_OPSS_STATE_MASK, + ops_state >> SCX_OPSS_QSEQ_SHIFT); + dump_line(s, " sticky/holding_cpu=%d/%d dsq_id=%s", + p->scx.sticky_cpu, p->scx.holding_cpu, dsq_id_buf); + dump_line(s, " cpus=%*pb", cpumask_pr_args(p->cpus_ptr)); + + if (SCX_HAS_OP(dump_task)) { + ops_dump_init(s, " "); + SCX_CALL_OP(SCX_KF_REST, dump_task, dctx, p); + ops_dump_exit(); + } + + bt_len = stack_trace_save_tsk(p, bt, SCX_EXIT_BT_LEN, 1); + if (bt_len) { + dump_newline(s); + dump_stack_trace(s, " ", bt, bt_len); + } +} + +static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) +{ + static DEFINE_SPINLOCK(dump_lock); + static const char trunc_marker[] = "\n\n~~~~ TRUNCATED ~~~~\n"; + struct scx_dump_ctx dctx = { + .kind = ei->kind, + .exit_code = ei->exit_code, + .reason = ei->reason, + .at_ns = ktime_get_ns(), + .at_jiffies = jiffies, + }; + struct seq_buf s; + unsigned long flags; + char *buf; + int cpu; + + spin_lock_irqsave(&dump_lock, flags); + + seq_buf_init(&s, ei->dump, dump_len); + + if (ei->kind == SCX_EXIT_NONE) { + dump_line(&s, "Debug dump triggered by %s", ei->reason); + } else { + dump_line(&s, "%s[%d] triggered exit kind %d:", + current->comm, current->pid, ei->kind); + dump_line(&s, " %s (%s)", ei->reason, ei->msg); + dump_newline(&s); + dump_line(&s, "Backtrace:"); + dump_stack_trace(&s, " ", ei->bt, ei->bt_len); + } + + if (SCX_HAS_OP(dump)) { + ops_dump_init(&s, ""); + SCX_CALL_OP(SCX_KF_UNLOCKED, dump, &dctx); + ops_dump_exit(); + } + + dump_newline(&s); + dump_line(&s, "CPU states"); + dump_line(&s, "----------"); + + for_each_possible_cpu(cpu) { + struct rq *rq = cpu_rq(cpu); + struct rq_flags rf; + struct task_struct *p; + struct seq_buf ns; + size_t avail, used; + bool idle; + + rq_lock(rq, &rf); + + idle = list_empty(&rq->scx.runnable_list) && + rq->curr->sched_class == &idle_sched_class; + + if (idle && !SCX_HAS_OP(dump_cpu)) + goto next; + + /* + * We don't yet know whether ops.dump_cpu() will produce output + * and we may want to skip the default CPU dump if it doesn't. + * Use a nested seq_buf to generate the standard dump so that we + * can decide whether to commit later. + */ + avail = seq_buf_get_buf(&s, &buf); + seq_buf_init(&ns, buf, avail); + + dump_newline(&ns); + dump_line(&ns, "CPU %-4d: nr_run=%u ops_qseq=%lu", + cpu, rq->scx.nr_running, rq->scx.ops_qseq); + dump_line(&ns, " curr=%s[%d] class=%ps", + rq->curr->comm, rq->curr->pid, + rq->curr->sched_class); + + used = seq_buf_used(&ns); + if (SCX_HAS_OP(dump_cpu)) { + ops_dump_init(&ns, " "); + SCX_CALL_OP(SCX_KF_REST, dump_cpu, &dctx, cpu, idle); + ops_dump_exit(); + } + + /* + * If idle && nothing generated by ops.dump_cpu(), there's + * nothing interesting. Skip. + */ + if (idle && used == seq_buf_used(&ns)) + goto next; + + /* + * $s may already have overflowed when $ns was created. If so, + * calling commit on it will trigger BUG. + */ + if (avail) { + seq_buf_commit(&s, seq_buf_used(&ns)); + if (seq_buf_has_overflowed(&ns)) + seq_buf_set_overflow(&s); + } + + if (rq->curr->sched_class == &ext_sched_class) + scx_dump_task(&s, &dctx, rq->curr, '*'); + + list_for_each_entry(p, &rq->scx.runnable_list, scx.runnable_node) + scx_dump_task(&s, &dctx, p, ' '); + next: + rq_unlock(rq, &rf); + } + + if (seq_buf_has_overflowed(&s) && dump_len >= sizeof(trunc_marker)) + memcpy(ei->dump + dump_len - sizeof(trunc_marker), + trunc_marker, sizeof(trunc_marker)); + + spin_unlock_irqrestore(&dump_lock, flags); +} + static void scx_ops_error_irq_workfn(struct irq_work *irq_work) { + struct scx_exit_info *ei = scx_exit_info; + + if (ei->kind >= SCX_EXIT_ERROR) + scx_dump_state(ei, scx_ops.exit_dump_len); + schedule_scx_ops_disable_work(); } @@ -3157,6 +3493,13 @@ static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind, vscnprintf(ei->msg, SCX_EXIT_MSG_LEN, fmt, args); va_end(args); + /* + * Set ei->kind and ->reason for scx_dump_state(). They'll be set again + * in scx_ops_disable_workfn(). + */ + ei->kind = kind; + ei->reason = scx_exit_reason(ei->kind); + irq_work_queue(&scx_ops_error_irq_work); } @@ -3218,7 +3561,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops) if (ret < 0) goto err; - scx_exit_info = alloc_exit_info(); + scx_exit_info = alloc_exit_info(ops->exit_dump_len); if (!scx_exit_info) { ret = -ENOMEM; goto err_del; @@ -3597,6 +3940,10 @@ static int bpf_scx_init_member(const struct btf_type *t, return -E2BIG; ops->timeout_ms = *(u32 *)(udata + moff); return 1; + case offsetof(struct sched_ext_ops, exit_dump_len): + ops->exit_dump_len = + *(u32 *)(udata + moff) ?: SCX_EXIT_DUMP_DFL_LEN; + return 1; } return 0; @@ -3728,6 +4075,21 @@ static const struct sysrq_key_op sysrq_sched_ext_reset_op = { .enable_mask = SYSRQ_ENABLE_RTNICE, }; +static void sysrq_handle_sched_ext_dump(u8 key) +{ + struct scx_exit_info ei = { .kind = SCX_EXIT_NONE, .reason = "SysRq-D" }; + + if (scx_enabled()) + scx_dump_state(&ei, 0); +} + +static const struct sysrq_key_op sysrq_sched_ext_dump_op = { + .handler = sysrq_handle_sched_ext_dump, + .help_msg = "dump-sched-ext(D)", + .action_msg = "Trigger sched_ext debug dump", + .enable_mask = SYSRQ_ENABLE_RTNICE, +}; + /** * print_scx_info - print out sched_ext scheduler state * @log_lvl: the log level to use when printing @@ -3798,6 +4160,7 @@ void __init init_sched_ext_class(void) } register_sysrq_key('S', &sysrq_sched_ext_reset_op); + register_sysrq_key('D', &sysrq_sched_ext_dump_op); INIT_DELAYED_WORK(&scx_watchdog_work, scx_watchdog_workfn); } @@ -4223,6 +4586,57 @@ __bpf_kfunc void scx_bpf_error_bstr(char *fmt, unsigned long long *data, raw_spin_unlock_irqrestore(&scx_exit_bstr_buf_lock, flags); } +/** + * scx_bpf_dump - Generate extra debug dump specific to the BPF scheduler + * @fmt: format string + * @data: format string parameters packaged using ___bpf_fill() macro + * @data__sz: @data len, must end in '__sz' for the verifier + * + * To be called through scx_bpf_dump() helper from ops.dump(), dump_cpu() and + * dump_task() to generate extra debug dump specific to the BPF scheduler. + * + * The extra dump may be multiple lines. A single line may be split over + * multiple calls. The last line is automatically terminated. + */ +__bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, + u32 data__sz) +{ + struct scx_dump_data *dd = &scx_dump_data; + struct scx_bstr_buf *buf = &dd->buf; + s32 ret; + + if (raw_smp_processor_id() != dd->cpu) { + scx_ops_error("scx_bpf_dump() must only be called from ops.dump() and friends"); + return; + } + + /* append the formatted string to the line buf */ + ret = __bstr_format(buf->data, buf->line + dd->cursor, + sizeof(buf->line) - dd->cursor, fmt, data, data__sz); + if (ret < 0) { + dump_line(dd->s, "%s[!] (\"%s\", %p, %u) failed to format (%d)", + dd->prefix, fmt, data, data__sz, ret); + return; + } + + dd->cursor += ret; + dd->cursor = min_t(s32, dd->cursor, sizeof(buf->line)); + + if (!dd->cursor) + return; + + /* + * If the line buf overflowed or ends in a newline, flush it into the + * dump. This is to allow the caller to generate a single line over + * multiple calls. As ops_dump_flush() can also handle multiple lines in + * the line buf, the only case which can lead to an unexpected + * truncation is when the caller keeps generating newlines in the middle + * instead of the end consecutively. Don't do that. + */ + if (dd->cursor >= sizeof(buf->line) || buf->line[dd->cursor - 1] == '\n') + ops_dump_flush(); +} + /** * scx_bpf_nr_cpu_ids - Return the number of possible CPU IDs * @@ -4431,6 +4845,7 @@ BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued) BTF_ID_FLAGS(func, scx_bpf_destroy_dsq) BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids) BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE) BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE) diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h index 833fe1bdccf9..3ea5cdf58bc7 100644 --- a/tools/sched_ext/include/scx/common.bpf.h +++ b/tools/sched_ext/include/scx/common.bpf.h @@ -38,6 +38,7 @@ s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym; void scx_bpf_destroy_dsq(u64 dsq_id) __ksym; void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak; void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym; +void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym __weak; u32 scx_bpf_nr_cpu_ids(void) __ksym __weak; const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak; const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak; @@ -97,6 +98,17 @@ void ___scx_bpf_bstr_format_checker(const char *fmt, ...) {} ___scx_bpf_bstr_format_checker(fmt, ##args); \ }) +/* + * scx_bpf_dump() wraps the scx_bpf_dump_bstr() kfunc with variadic arguments + * instead of an array of u64. To be used from ops.dump() and friends. + */ +#define scx_bpf_dump(fmt, args...) \ +({ \ + scx_bpf_bstr_preamble(fmt, args) \ + scx_bpf_dump_bstr(___fmt, ___param, sizeof(___param)); \ + ___scx_bpf_bstr_format_checker(fmt, ##args); \ +}) + #define BPF_STRUCT_OPS(name, args...) \ SEC("struct_ops/"#name) \ BPF_PROG(name, ##args) diff --git a/tools/sched_ext/include/scx/compat.h b/tools/sched_ext/include/scx/compat.h index a7fdaf8a858e..c58024c980c8 100644 --- a/tools/sched_ext/include/scx/compat.h +++ b/tools/sched_ext/include/scx/compat.h @@ -111,16 +111,23 @@ static inline bool __COMPAT_struct_has_field(const char *type, const char *field * is used to define ops and compat.h::SCX_OPS_LOAD/ATTACH() are used to load * and attach it, backward compatibility is automatically maintained where * reasonable. + * + * ec7e3b0463e1 ("implement-ops") in https://github.com/sched-ext/sched_ext is + * the current minimum required kernel version. */ #define SCX_OPS_OPEN(__ops_name, __scx_name) ({ \ struct __scx_name *__skel; \ \ + SCX_BUG_ON(!__COMPAT_struct_has_field("sched_ext_ops", "dump"), \ + "sched_ext_ops.dump() missing, kernel too old?"); \ + \ __skel = __scx_name##__open(); \ SCX_BUG_ON(!__skel, "Could not open " #__scx_name); \ __skel; \ }) -#define SCX_OPS_LOAD(__skel, __ops_name, __scx_name) ({ \ +#define SCX_OPS_LOAD(__skel, __ops_name, __scx_name, __uei_name) ({ \ + UEI_SET_SIZE(__skel, __ops_name, __uei_name); \ SCX_BUG_ON(__scx_name##__load((__skel)), "Failed to load skel"); \ }) diff --git a/tools/sched_ext/include/scx/user_exit_info.h b/tools/sched_ext/include/scx/user_exit_info.h index 8c3b7fac4d05..c2ef85c645e1 100644 --- a/tools/sched_ext/include/scx/user_exit_info.h +++ b/tools/sched_ext/include/scx/user_exit_info.h @@ -13,6 +13,7 @@ enum uei_sizes { UEI_REASON_LEN = 128, UEI_MSG_LEN = 1024, + UEI_DUMP_DFL_LEN = 32768, }; struct user_exit_info { @@ -28,6 +29,8 @@ struct user_exit_info { #include <bpf/bpf_core_read.h> #define UEI_DEFINE(__name) \ + char RESIZABLE_ARRAY(data, __name##_dump); \ + const volatile u32 __name##_dump_len; \ struct user_exit_info __name SEC(".data") #define UEI_RECORD(__uei_name, __ei) ({ \ @@ -35,6 +38,8 @@ struct user_exit_info { sizeof(__uei_name.reason), (__ei)->reason); \ bpf_probe_read_kernel_str(__uei_name.msg, \ sizeof(__uei_name.msg), (__ei)->msg); \ + bpf_probe_read_kernel_str(__uei_name##_dump, \ + __uei_name##_dump_len, (__ei)->dump); \ if (bpf_core_field_exists((__ei)->exit_code)) \ __uei_name.exit_code = (__ei)->exit_code; \ /* use __sync to force memory barrier */ \ @@ -47,6 +52,13 @@ struct user_exit_info { #include <stdio.h> #include <stdbool.h> +/* no need to call the following explicitly if SCX_OPS_LOAD() is used */ +#define UEI_SET_SIZE(__skel, __ops_name, __uei_name) ({ \ + u32 __len = (__skel)->struct_ops.__ops_name->exit_dump_len ?: UEI_DUMP_DFL_LEN; \ + (__skel)->rodata->__uei_name##_dump_len = __len; \ + RESIZE_ARRAY((__skel), data, __uei_name##_dump, __len); \ +}) + #define UEI_EXITED(__skel, __uei_name) ({ \ /* use __sync to force memory barrier */ \ __sync_val_compare_and_swap(&(__skel)->data->__uei_name.kind, -1, -1); \ @@ -54,6 +66,13 @@ struct user_exit_info { #define UEI_REPORT(__skel, __uei_name) ({ \ struct user_exit_info *__uei = &(__skel)->data->__uei_name; \ + char *__uei_dump = (__skel)->data_##__uei_name##_dump->__uei_name##_dump; \ + if (__uei_dump[0] != '\0') { \ + fputs("\nDEBUG DUMP\n", stderr); \ + fputs("================================================================================\n\n", stderr); \ + fputs(__uei_dump, stderr); \ + fputs("\n================================================================================\n\n", stderr); \ + } \ fprintf(stderr, "EXIT: %s", __uei->reason); \ if (__uei->msg[0] != '\0') \ fprintf(stderr, " (%s)", __uei->msg); \ diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c index 5ff217c4bfa0..5b3da28bf042 100644 --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -33,6 +33,7 @@ const volatile u32 stall_user_nth; const volatile u32 stall_kernel_nth; const volatile u32 dsp_batch; const volatile s32 disallow_tgid; +const volatile bool suppress_dump; u32 test_error_cnt; @@ -258,6 +259,56 @@ s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p, return -ENOMEM; } +void BPF_STRUCT_OPS(qmap_dump, struct scx_dump_ctx *dctx) +{ + s32 i, pid; + + if (suppress_dump) + return; + + bpf_for(i, 0, 5) { + void *fifo; + + if (!(fifo = bpf_map_lookup_elem(&queue_arr, &i))) + return; + + scx_bpf_dump("QMAP FIFO[%d]:", i); + bpf_repeat(4096) { + if (bpf_map_pop_elem(fifo, &pid)) + break; + scx_bpf_dump(" %d", pid); + } + scx_bpf_dump("\n"); + } +} + +void BPF_STRUCT_OPS(qmap_dump_cpu, struct scx_dump_ctx *dctx, s32 cpu, bool idle) +{ + u32 zero = 0; + struct cpu_ctx *cpuc; + + if (suppress_dump || idle) + return; + if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, cpu))) + return; + + scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu", + cpuc->dsp_idx, cpuc->dsp_cnt); +} + +void BPF_STRUCT_OPS(qmap_dump_task, struct scx_dump_ctx *dctx, struct task_struct *p) +{ + struct task_ctx *taskc; + + if (suppress_dump) + return; + if (!(taskc = bpf_task_storage_get(&task_ctx_stor, p, 0, 0))) + return; + + scx_bpf_dump("QMAP: force_local=%d", + taskc->force_local); +} + s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init) { return scx_bpf_create_dsq(SHARED_DSQ, -1); @@ -274,6 +325,9 @@ SCX_OPS_DEFINE(qmap_ops, .dequeue = (void *)qmap_dequeue, .dispatch = (void *)qmap_dispatch, .init_task = (void *)qmap_init_task, + .dump = (void *)qmap_dump, + .dump_cpu = (void *)qmap_dump_cpu, + .dump_task = (void *)qmap_dump_task, .init = (void *)qmap_init, .exit = (void *)qmap_exit, .timeout_ms = 5000U, diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c index a2614994cfaa..a1123a17581b 100644 --- a/tools/sched_ext/scx_qmap.c +++ b/tools/sched_ext/scx_qmap.c @@ -20,7 +20,7 @@ const char help_fmt[] = "See the top-level comment in .bpf.c for more details.\n" "\n" "Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-b COUNT]\n" -" [-d PID] [-p] [-v]\n" +" [-d PID] [-D LEN] [-p] [-v]\n" "\n" " -s SLICE_US Override slice duration\n" " -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n" @@ -28,6 +28,8 @@ const char help_fmt[] = " -T COUNT Stall every COUNT'th kernel thread\n" " -b COUNT Dispatch upto COUNT tasks together\n" " -d PID Disallow a process from switching into SCHED_EXT (-1 for self)\n" +" -D LEN Set scx_exit_info.dump buffer length\n" +" -S Suppress qmap-specific debug dump\n" " -p Switch only tasks on SCHED_EXT policy intead of all\n" " -v Print libbpf debug messages\n" " -h Display this help and exit\n"; @@ -59,7 +61,7 @@ int main(int argc, char **argv) skel = SCX_OPS_OPEN(qmap_ops, scx_qmap); - while ((opt = getopt(argc, argv, "s:e:t:T:b:d:pvh")) != -1) { + while ((opt = getopt(argc, argv, "s:e:t:T:b:d:D:Spvh")) != -1) { switch (opt) { case 's': skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000; @@ -81,6 +83,12 @@ int main(int argc, char **argv) if (skel->rodata->disallow_tgid < 0) skel->rodata->disallow_tgid = getpid(); break; + case 'D': + skel->struct_ops.qmap_ops->exit_dump_len = strtoul(optarg, NULL, 0); + break; + case 'S': + skel->rodata->suppress_dump = true; + break; case 'p': skel->struct_ops.qmap_ops->flags |= SCX_OPS_SWITCH_PARTIAL; break; @@ -93,7 +101,7 @@ int main(int argc, char **argv) } } - SCX_OPS_LOAD(skel, qmap_ops, scx_qmap); + SCX_OPS_LOAD(skel, qmap_ops, scx_qmap, uei); link = SCX_OPS_ATTACH(skel, qmap_ops, scx_qmap); while (!exit_req && !UEI_EXITED(skel, uei)) { diff --git a/tools/sched_ext/scx_simple.c b/tools/sched_ext/scx_simple.c index 789ac62fea8e..7f500d1d56ac 100644 --- a/tools/sched_ext/scx_simple.c +++ b/tools/sched_ext/scx_simple.c @@ -80,7 +80,7 @@ int main(int argc, char **argv) } } - SCX_OPS_LOAD(skel, simple_ops, scx_simple); + SCX_OPS_LOAD(skel, simple_ops, scx_simple, uei); link = SCX_OPS_ATTACH(skel, simple_ops, scx_simple); while (!exit_req && !UEI_EXITED(skel, uei)) { -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 1c3ae1cb2f2cf245d529fe363ce420ec8028a547 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- There are states which are interesting but don't quite fit the interface exposed under /sys/kernel/sched_ext. Add tools/scx_show_state.py to show them. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/sched_ext/scx_show_state.py | 39 +++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) create mode 100644 tools/sched_ext/scx_show_state.py diff --git a/tools/sched_ext/scx_show_state.py b/tools/sched_ext/scx_show_state.py new file mode 100644 index 000000000000..d457d2a74e1e --- /dev/null +++ b/tools/sched_ext/scx_show_state.py @@ -0,0 +1,39 @@ +#!/usr/bin/env drgn +# +# Copyright (C) 2024 Tejun Heo <tj@kernel.org> +# Copyright (C) 2024 Meta Platforms, Inc. and affiliates. + +desc = """ +This is a drgn script to show the current sched_ext state. +For more info on drgn, visit https://github.com/osandov/drgn. +""" + +import drgn +import sys + +def err(s): + print(s, file=sys.stderr, flush=True) + sys.exit(1) + +def read_int(name): + return int(prog[name].value_()) + +def read_atomic(name): + return prog[name].counter.value_() + +def read_static_key(name): + return prog[name].key.enabled.counter.value_() + +def ops_state_str(state): + return prog['scx_ops_enable_state_str'][state].string_().decode() + +ops = prog['scx_ops'] +enable_state = read_atomic("scx_ops_enable_state_var") + +print(f'ops : {ops.name.string_().decode()}') +print(f'enabled : {read_static_key("__scx_ops_enabled")}') +print(f'switching_all : {read_int("scx_switching_all")}') +print(f'switched_all : {read_static_key("__scx_switched_all")}') +print(f'enable_state : {ops_state_str(enable_state)} ({enable_state})') +print(f'bypass_depth : {read_atomic("scx_ops_bypass_depth")}') +print(f'nr_rejected : {read_atomic("scx_nr_rejected")}') -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 81aae789181b5850d77dfdf74d4b85c63f0705e9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- It's often useful to wake up and/or trigger reschedule on other CPUs. This patch adds scx_bpf_kick_cpu() kfunc helper that BPF scheduler can call to kick the target CPU into the scheduling path. As a sched_ext task relinquishes its CPU only after its slice is depleted, this patch also adds SCX_KICK_PREEMPT and SCX_ENQ_PREEMPT which clears the slice of the target CPU's current task to guarantee that sched_ext's scheduling path runs on the CPU. If SCX_KICK_IDLE is specified, the target CPU is kicked iff the CPU is idle to guarantee that the target CPU will go through at least one full sched_ext scheduling cycle after the kicking. This can be used to wake up idle CPUs without incurring unnecessary overhead if it isn't currently idle. As a demonstration of how backward compatibility can be supported using BPF CO-RE, tools/sched_ext/include/scx/compat.bpf.h is added. It provides __COMPAT_scx_bpf_kick_cpu_IDLE() which uses SCX_KICK_IDLE if available or becomes a regular kicking otherwise. This allows schedulers to use the new SCX_KICK_IDLE while maintaining support for older kernels. The plan is to temporarily use compat helpers to ease API updates and drop them after a few kernel releases. v5: - SCX_KICK_IDLE added. Note that this also adds a compat mechanism for schedulers so that they can support kernels without SCX_KICK_IDLE. This is useful as a demonstration of how new feature flags can be added in a backward compatible way. - kick_cpus_irq_workfn() reimplemented so that it touches the pending cpumasks only as necessary to reduce kicking overhead on machines with a lot of CPUs. - tools/sched_ext/include/scx/compat.bpf.h added. v4: - Move example scheduler to its own patch. v3: - Make scx_example_central switch all tasks by default. - Convert to BPF inline iterators. v2: - Julia Lawall reported that scx_example_central can overflow the dispatch buffer and malfunction. As scheduling for other CPUs can't be handled by the automatic retry mechanism, fix by implementing an explicit overflow and retry handling. - Updated to use generic BPF cpumask helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 4 + kernel/sched/ext.c | 225 +++++++++++++++++++++-- kernel/sched/sched.h | 10 + tools/sched_ext/include/scx/common.bpf.h | 1 + 4 files changed, 227 insertions(+), 13 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index 85fb5dc725ef..3b2809b980ac 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -134,6 +134,10 @@ struct sched_ext_entity { * scx_bpf_dispatch() but can also be modified directly by the BPF * scheduler. Automatically decreased by SCX as the task executes. On * depletion, a scheduling event is triggered. + * + * This value is cleared to zero if the task is preempted by + * %SCX_KICK_PREEMPT and shouldn't be used to determine how long the + * task ran. Use p->se.sum_exec_runtime instead. */ u64 slice; diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 22c127c5ced8..73b8881a5061 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -416,6 +416,14 @@ enum scx_enq_flags { /* high 32bits are SCX specific */ + /* + * Set the following to trigger preemption when calling + * scx_bpf_dispatch() with a local dsq as the target. The slice of the + * current task is cleared to zero and the CPU is kicked into the + * scheduling path. Implies %SCX_ENQ_HEAD. + */ + SCX_ENQ_PREEMPT = 1LLU << 32, + /* * The task being enqueued is the only task available for the cpu. By * default, ext core keeps executing such tasks but when @@ -445,6 +453,24 @@ enum scx_pick_idle_cpu_flags { SCX_PICK_IDLE_CORE = 1LLU << 0, /* pick a CPU whose SMT siblings are also idle */ }; +enum scx_kick_flags { + /* + * Kick the target CPU if idle. Guarantees that the target CPU goes + * through at least one full scheduling cycle before going idle. If the + * target CPU can be determined to be currently not idle and going to go + * through a scheduling cycle before going idle, noop. + */ + SCX_KICK_IDLE = 1LLU << 0, + + /* + * Preempt the current task and execute the dispatch path. If the + * current task of the target CPU is an SCX task, its ->scx.slice is + * cleared to zero before the scheduling path is invoked so that the + * task expires and the dispatch path is invoked. + */ + SCX_KICK_PREEMPT = 1LLU << 1, +}; + enum scx_ops_enable_state { SCX_OPS_PREPPING, SCX_OPS_ENABLING, @@ -1023,7 +1049,7 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p, } } - if (enq_flags & SCX_ENQ_HEAD) + if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) list_add(&p->scx.dsq_node, &dsq->list); else list_add_tail(&p->scx.dsq_node, &dsq->list); @@ -1049,8 +1075,16 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p, if (is_local) { struct rq *rq = container_of(dsq, struct rq, scx.local_dsq); + bool preempt = false; + + if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr && + rq->curr->sched_class == &ext_sched_class) { + rq->curr->scx.slice = 0; + preempt = true; + } - if (sched_class_above(&ext_sched_class, rq->curr->sched_class)) + if (preempt || sched_class_above(&ext_sched_class, + rq->curr->sched_class)) resched_curr(rq); } else { raw_spin_unlock(&dsq->lock); @@ -1877,8 +1911,10 @@ static int balance_scx(struct rq *rq, struct task_struct *prev, { struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); bool prev_on_scx = prev->sched_class == &ext_sched_class; + bool has_tasks = false; lockdep_assert_rq_held(rq); + rq->scx.flags |= SCX_RQ_BALANCING; if (prev_on_scx) { WARN_ON_ONCE(prev->scx.flags & SCX_TASK_BAL_KEEP); @@ -1895,19 +1931,19 @@ static int balance_scx(struct rq *rq, struct task_struct *prev, if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice && !scx_ops_bypassing()) { prev->scx.flags |= SCX_TASK_BAL_KEEP; - return 1; + goto has_tasks; } } /* if there already are tasks to run, nothing to do */ if (rq->scx.local_dsq.nr) - return 1; + goto has_tasks; if (consume_dispatch_q(rq, rf, &scx_dsq_global)) - return 1; + goto has_tasks; if (!SCX_HAS_OP(dispatch) || scx_ops_bypassing() || !scx_rq_online(rq)) - return 0; + goto out; dspc->rq = rq; dspc->rf = rf; @@ -1928,12 +1964,18 @@ static int balance_scx(struct rq *rq, struct task_struct *prev, flush_dispatch_buf(rq, rf); if (rq->scx.local_dsq.nr) - return 1; + goto has_tasks; if (consume_dispatch_q(rq, rf, &scx_dsq_global)) - return 1; + goto has_tasks; } while (dspc->nr_tasks); - return 0; + goto out; + +has_tasks: + has_tasks = true; +out: + rq->scx.flags &= ~SCX_RQ_BALANCING; + return has_tasks; } static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) @@ -2671,7 +2713,8 @@ int scx_check_setscheduler(struct task_struct *p, int policy) * Omitted operations: * * - wakeup_preempt: NOOP as it isn't useful in the wakeup path because the task - * isn't tied to the CPU at that point. + * isn't tied to the CPU at that point. Preemption is implemented by resetting + * the victim task's slice to 0 and triggering reschedule on the target CPU. * * - migrate_task_rq: Unnecessary as task to cpu mapping is transient. * @@ -2907,6 +2950,9 @@ bool task_should_scx(struct task_struct *p) * of the queue. * * d. pick_next_task() suppresses zero slice warning. + * + * e. scx_bpf_kick_cpu() is disabled to avoid irq_work malfunction during PM + * operations. */ static void scx_ops_bypass(bool bypass) { @@ -3415,11 +3461,21 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) seq_buf_init(&ns, buf, avail); dump_newline(&ns); - dump_line(&ns, "CPU %-4d: nr_run=%u ops_qseq=%lu", - cpu, rq->scx.nr_running, rq->scx.ops_qseq); + dump_line(&ns, "CPU %-4d: nr_run=%u flags=0x%x ops_qseq=%lu", + cpu, rq->scx.nr_running, rq->scx.flags, + rq->scx.ops_qseq); dump_line(&ns, " curr=%s[%d] class=%ps", rq->curr->comm, rq->curr->pid, rq->curr->sched_class); + if (!cpumask_empty(rq->scx.cpus_to_kick)) + dump_line(&ns, " cpus_to_kick : %*pb", + cpumask_pr_args(rq->scx.cpus_to_kick)); + if (!cpumask_empty(rq->scx.cpus_to_kick_if_idle)) + dump_line(&ns, " idle_to_kick : %*pb", + cpumask_pr_args(rq->scx.cpus_to_kick_if_idle)); + if (!cpumask_empty(rq->scx.cpus_to_preempt)) + dump_line(&ns, " cpus_to_preempt: %*pb", + cpumask_pr_args(rq->scx.cpus_to_preempt)); used = seq_buf_used(&ns); if (SCX_HAS_OP(dump_cpu)) { @@ -4090,6 +4146,82 @@ static const struct sysrq_key_op sysrq_sched_ext_dump_op = { .enable_mask = SYSRQ_ENABLE_RTNICE, }; +static bool can_skip_idle_kick(struct rq *rq) +{ + lockdep_assert_rq_held(rq); + + /* + * We can skip idle kicking if @rq is going to go through at least one + * full SCX scheduling cycle before going idle. Just checking whether + * curr is not idle is insufficient because we could be racing + * balance_one() trying to pull the next task from a remote rq, which + * may fail, and @rq may become idle afterwards. + * + * The race window is small and we don't and can't guarantee that @rq is + * only kicked while idle anyway. Skip only when sure. + */ + return !is_idle_task(rq->curr) && !(rq->scx.flags & SCX_RQ_BALANCING); +} + +static void kick_one_cpu(s32 cpu, struct rq *this_rq) +{ + struct rq *rq = cpu_rq(cpu); + struct scx_rq *this_scx = &this_rq->scx; + unsigned long flags; + + raw_spin_rq_lock_irqsave(rq, flags); + + /* + * During CPU hotplug, a CPU may depend on kicking itself to make + * forward progress. Allow kicking self regardless of online state. + */ + if (cpu_online(cpu) || cpu == cpu_of(this_rq)) { + if (cpumask_test_cpu(cpu, this_scx->cpus_to_preempt)) { + if (rq->curr->sched_class == &ext_sched_class) + rq->curr->scx.slice = 0; + cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt); + } + + resched_curr(rq); + } else { + cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt); + } + + raw_spin_rq_unlock_irqrestore(rq, flags); +} + +static void kick_one_cpu_if_idle(s32 cpu, struct rq *this_rq) +{ + struct rq *rq = cpu_rq(cpu); + unsigned long flags; + + raw_spin_rq_lock_irqsave(rq, flags); + + if (!can_skip_idle_kick(rq) && + (cpu_online(cpu) || cpu == cpu_of(this_rq))) + resched_curr(rq); + + raw_spin_rq_unlock_irqrestore(rq, flags); +} + +static void kick_cpus_irq_workfn(struct irq_work *irq_work) +{ + struct rq *this_rq = this_rq(); + struct scx_rq *this_scx = &this_rq->scx; + s32 cpu; + + for_each_cpu(cpu, this_scx->cpus_to_kick) { + kick_one_cpu(cpu, this_rq); + cpumask_clear_cpu(cpu, this_scx->cpus_to_kick); + cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle); + } + + for_each_cpu(cpu, this_scx->cpus_to_kick_if_idle) { + kick_one_cpu_if_idle(cpu, this_rq); + cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle); + } +} + /** * print_scx_info - print out sched_ext scheduler state * @log_lvl: the log level to use when printing @@ -4144,7 +4276,7 @@ void __init init_sched_ext_class(void) * definitions so that BPF scheduler implementations can use them * through the generated vmlinux.h. */ - WRITE_ONCE(v, SCX_ENQ_WAKEUP | SCX_DEQ_SLEEP); + WRITE_ONCE(v, SCX_ENQ_WAKEUP | SCX_DEQ_SLEEP | SCX_KICK_PREEMPT); BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params)); init_dsq(&scx_dsq_global, SCX_DSQ_GLOBAL); @@ -4157,6 +4289,11 @@ void __init init_sched_ext_class(void) init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL); INIT_LIST_HEAD(&rq->scx.runnable_list); + + BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick, GFP_KERNEL)); + BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL)); + BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_preempt, GFP_KERNEL)); + init_irq_work(&rq->scx.kick_cpus_irq_work, kick_cpus_irq_workfn); } register_sysrq_key('S', &sysrq_sched_ext_reset_op); @@ -4443,6 +4580,67 @@ static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = { __bpf_kfunc_start_defs(); +/** + * scx_bpf_kick_cpu - Trigger reschedule on a CPU + * @cpu: cpu to kick + * @flags: %SCX_KICK_* flags + * + * Kick @cpu into rescheduling. This can be used to wake up an idle CPU or + * trigger rescheduling on a busy CPU. This can be called from any online + * scx_ops operation and the actual kicking is performed asynchronously through + * an irq work. + */ +__bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags) +{ + struct rq *this_rq; + unsigned long irq_flags; + + if (!ops_cpu_valid(cpu, NULL)) + return; + + /* + * While bypassing for PM ops, IRQ handling may not be online which can + * lead to irq_work_queue() malfunction such as infinite busy wait for + * IRQ status update. Suppress kicking. + */ + if (scx_ops_bypassing()) + return; + + local_irq_save(irq_flags); + + this_rq = this_rq(); + + /* + * Actual kicking is bounced to kick_cpus_irq_workfn() to avoid nesting + * rq locks. We can probably be smarter and avoid bouncing if called + * from ops which don't hold a rq lock. + */ + if (flags & SCX_KICK_IDLE) { + struct rq *target_rq = cpu_rq(cpu); + + if (unlikely(flags & SCX_KICK_PREEMPT)) + scx_ops_error("PREEMPT cannot be used with SCX_KICK_IDLE"); + + if (raw_spin_rq_trylock(target_rq)) { + if (can_skip_idle_kick(target_rq)) { + raw_spin_rq_unlock(target_rq); + goto out; + } + raw_spin_rq_unlock(target_rq); + } + cpumask_set_cpu(cpu, this_rq->scx.cpus_to_kick_if_idle); + } else { + cpumask_set_cpu(cpu, this_rq->scx.cpus_to_kick); + + if (flags & SCX_KICK_PREEMPT) + cpumask_set_cpu(cpu, this_rq->scx.cpus_to_preempt); + } + + irq_work_queue(&this_rq->scx.kick_cpus_irq_work); +out: + local_irq_restore(irq_flags); +} + /** * scx_bpf_dsq_nr_queued - Return the number of queued tasks * @dsq_id: id of the DSQ @@ -4841,6 +5039,7 @@ __bpf_kfunc s32 scx_bpf_task_cpu(const struct task_struct *p) __bpf_kfunc_end_defs(); BTF_KFUNCS_START(scx_kfunc_ids_any) +BTF_ID_FLAGS(func, scx_bpf_kick_cpu) BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued) BTF_ID_FLAGS(func, scx_bpf_destroy_dsq) BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 26f8fde4d8cf..be6c7e000620 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -824,12 +824,22 @@ struct cfs_rq { }; #ifdef CONFIG_SCHED_CLASS_EXT +/* scx_rq->flags, protected by the rq lock */ +enum scx_rq_flags { + SCX_RQ_BALANCING = 1 << 1, +}; + struct scx_rq { struct scx_dispatch_q local_dsq; struct list_head runnable_list; /* runnable tasks on this rq */ unsigned long ops_qseq; u64 extra_enq_flags; /* see move_task_to_local_dsq() */ u32 nr_running; + u32 flags; + cpumask_var_t cpus_to_kick; + cpumask_var_t cpus_to_kick_if_idle; + cpumask_var_t cpus_to_preempt; + struct irq_work kick_cpus_irq_work; }; #endif /* CONFIG_SCHED_CLASS_EXT */ diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h index 3ea5cdf58bc7..421118bc56ff 100644 --- a/tools/sched_ext/include/scx/common.bpf.h +++ b/tools/sched_ext/include/scx/common.bpf.h @@ -34,6 +34,7 @@ void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flag u32 scx_bpf_dispatch_nr_slots(void) __ksym; void scx_bpf_dispatch_cancel(void) __ksym; bool scx_bpf_consume(u64 dsq_id) __ksym; +void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym; s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym; void scx_bpf_destroy_dsq(u64 dsq_id) __ksym; void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak; -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 037df2a314a0f2b1c6778602909dc57e5f1790f5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- This patch adds a new example scheduler, scx_central, which demonstrates central scheduling where one CPU is responsible for making all scheduling decisions in the system using scx_bpf_kick_cpu(). The central CPU makes scheduling decisions for all CPUs in the system, queues tasks on the appropriate local dsq's and preempts the worker CPUs. The worker CPUs in turn preempt the central CPU when it needs tasks to run. Currently, every CPU depends on its own tick to expire the current task. A follow-up patch implementing tickless support for sched_ext will allow the worker CPUs to go full tickless so that they can run completely undisturbed. v3: - Kumar fixed a bug where the dispatch path could overflow the dispatch buffer if too many are dispatched to the fallback DSQ. - Use the new SCX_KICK_IDLE to wake up non-central CPUs. - Dropped '-p' option. v2: - Use RESIZABLE_ARRAY() instead of fixed MAX_CPUS and use SCX_BUG[_ON]() to simplify error handling. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/sched_ext/Makefile | 2 +- tools/sched_ext/scx_central.bpf.c | 214 ++++++++++++++++++++++++++++++ tools/sched_ext/scx_central.c | 105 +++++++++++++++ 3 files changed, 320 insertions(+), 1 deletion(-) create mode 100644 tools/sched_ext/scx_central.bpf.c create mode 100644 tools/sched_ext/scx_central.c diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile index 626782a21375..bf7e108f5ae1 100644 --- a/tools/sched_ext/Makefile +++ b/tools/sched_ext/Makefile @@ -176,7 +176,7 @@ $(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BP SCX_COMMON_DEPS := include/scx/common.h include/scx/user_exit_info.h | $(BINDIR) -c-sched-targets = scx_simple scx_qmap +c-sched-targets = scx_simple scx_qmap scx_central $(addprefix $(BINDIR)/,$(c-sched-targets)): \ $(BINDIR)/%: \ diff --git a/tools/sched_ext/scx_central.bpf.c b/tools/sched_ext/scx_central.bpf.c new file mode 100644 index 000000000000..428b2262faa3 --- /dev/null +++ b/tools/sched_ext/scx_central.bpf.c @@ -0,0 +1,214 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * A central FIFO sched_ext scheduler which demonstrates the followings: + * + * a. Making all scheduling decisions from one CPU: + * + * The central CPU is the only one making scheduling decisions. All other + * CPUs kick the central CPU when they run out of tasks to run. + * + * There is one global BPF queue and the central CPU schedules all CPUs by + * dispatching from the global queue to each CPU's local dsq from dispatch(). + * This isn't the most straightforward. e.g. It'd be easier to bounce + * through per-CPU BPF queues. The current design is chosen to maximally + * utilize and verify various SCX mechanisms such as LOCAL_ON dispatching. + * + * b. Preemption + * + * SCX_KICK_PREEMPT is used to trigger scheduling and CPUs to move to the + * next tasks. + * + * This scheduler is designed to maximize usage of various SCX mechanisms. A + * more practical implementation would likely put the scheduling loop outside + * the central CPU's dispatch() path and add some form of priority mechanism. + * + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> + * Copyright (c) 2022 David Vernet <dvernet@meta.com> + */ +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +enum { + FALLBACK_DSQ_ID = 0, +}; + +const volatile s32 central_cpu; +const volatile u32 nr_cpu_ids = 1; /* !0 for veristat, set during init */ +const volatile u64 slice_ns = SCX_SLICE_DFL; + +u64 nr_total, nr_locals, nr_queued, nr_lost_pids; +u64 nr_dispatches, nr_mismatches, nr_retries; +u64 nr_overflows; + +UEI_DEFINE(uei); + +struct { + __uint(type, BPF_MAP_TYPE_QUEUE); + __uint(max_entries, 4096); + __type(value, s32); +} central_q SEC(".maps"); + +/* can't use percpu map due to bad lookups */ +bool RESIZABLE_ARRAY(data, cpu_gimme_task); + +s32 BPF_STRUCT_OPS(central_select_cpu, struct task_struct *p, + s32 prev_cpu, u64 wake_flags) +{ + /* + * Steer wakeups to the central CPU as much as possible to avoid + * disturbing other CPUs. It's safe to blindly return the central cpu as + * select_cpu() is a hint and if @p can't be on it, the kernel will + * automatically pick a fallback CPU. + */ + return central_cpu; +} + +void BPF_STRUCT_OPS(central_enqueue, struct task_struct *p, u64 enq_flags) +{ + s32 pid = p->pid; + + __sync_fetch_and_add(&nr_total, 1); + + if (bpf_map_push_elem(¢ral_q, &pid, 0)) { + __sync_fetch_and_add(&nr_overflows, 1); + scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_DFL, enq_flags); + return; + } + + __sync_fetch_and_add(&nr_queued, 1); + + if (!scx_bpf_task_running(p)) + scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT); +} + +static bool dispatch_to_cpu(s32 cpu) +{ + struct task_struct *p; + s32 pid; + + bpf_repeat(BPF_MAX_LOOPS) { + if (bpf_map_pop_elem(¢ral_q, &pid)) + break; + + __sync_fetch_and_sub(&nr_queued, 1); + + p = bpf_task_from_pid(pid); + if (!p) { + __sync_fetch_and_add(&nr_lost_pids, 1); + continue; + } + + /* + * If we can't run the task at the top, do the dumb thing and + * bounce it to the fallback dsq. + */ + if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) { + __sync_fetch_and_add(&nr_mismatches, 1); + scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_DFL, 0); + bpf_task_release(p); + /* + * We might run out of dispatch buffer slots if we continue dispatching + * to the fallback DSQ, without dispatching to the local DSQ of the + * target CPU. In such a case, break the loop now as will fail the + * next dispatch operation. + */ + if (!scx_bpf_dispatch_nr_slots()) + break; + continue; + } + + /* dispatch to local and mark that @cpu doesn't need more */ + scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_DFL, 0); + + if (cpu != central_cpu) + scx_bpf_kick_cpu(cpu, SCX_KICK_IDLE); + + bpf_task_release(p); + return true; + } + + return false; +} + +void BPF_STRUCT_OPS(central_dispatch, s32 cpu, struct task_struct *prev) +{ + if (cpu == central_cpu) { + /* dispatch for all other CPUs first */ + __sync_fetch_and_add(&nr_dispatches, 1); + + bpf_for(cpu, 0, nr_cpu_ids) { + bool *gimme; + + if (!scx_bpf_dispatch_nr_slots()) + break; + + /* central's gimme is never set */ + gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids); + if (gimme && !*gimme) + continue; + + if (dispatch_to_cpu(cpu)) + *gimme = false; + } + + /* + * Retry if we ran out of dispatch buffer slots as we might have + * skipped some CPUs and also need to dispatch for self. The ext + * core automatically retries if the local dsq is empty but we + * can't rely on that as we're dispatching for other CPUs too. + * Kick self explicitly to retry. + */ + if (!scx_bpf_dispatch_nr_slots()) { + __sync_fetch_and_add(&nr_retries, 1); + scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT); + return; + } + + /* look for a task to run on the central CPU */ + if (scx_bpf_consume(FALLBACK_DSQ_ID)) + return; + dispatch_to_cpu(central_cpu); + } else { + bool *gimme; + + if (scx_bpf_consume(FALLBACK_DSQ_ID)) + return; + + gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids); + if (gimme) + *gimme = true; + + /* + * Force dispatch on the scheduling CPU so that it finds a task + * to run for us. + */ + scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT); + } +} + +int BPF_STRUCT_OPS_SLEEPABLE(central_init) +{ + return scx_bpf_create_dsq(FALLBACK_DSQ_ID, -1); +} + +void BPF_STRUCT_OPS(central_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SCX_OPS_DEFINE(central_ops, + /* + * We are offloading all scheduling decisions to the central CPU + * and thus being the last task on a given CPU doesn't mean + * anything special. Enqueue the last tasks like any other tasks. + */ + .flags = SCX_OPS_ENQ_LAST, + + .select_cpu = (void *)central_select_cpu, + .enqueue = (void *)central_enqueue, + .dispatch = (void *)central_dispatch, + .init = (void *)central_init, + .exit = (void *)central_exit, + .name = "central"); diff --git a/tools/sched_ext/scx_central.c b/tools/sched_ext/scx_central.c new file mode 100644 index 000000000000..5f09fc666a63 --- /dev/null +++ b/tools/sched_ext/scx_central.c @@ -0,0 +1,105 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> + * Copyright (c) 2022 David Vernet <dvernet@meta.com> + */ +#define _GNU_SOURCE +#include <sched.h> +#include <stdio.h> +#include <unistd.h> +#include <inttypes.h> +#include <signal.h> +#include <libgen.h> +#include <bpf/bpf.h> +#include <scx/common.h> +#include "scx_central.bpf.skel.h" + +const char help_fmt[] = +"A central FIFO sched_ext scheduler.\n" +"\n" +"See the top-level comment in .bpf.c for more details.\n" +"\n" +"Usage: %s [-s SLICE_US] [-c CPU]\n" +"\n" +" -s SLICE_US Override slice duration\n" +" -c CPU Override the central CPU (default: 0)\n" +" -v Print libbpf debug messages\n" +" -h Display this help and exit\n"; + +static bool verbose; +static volatile int exit_req; + +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) +{ + if (level == LIBBPF_DEBUG && !verbose) + return 0; + return vfprintf(stderr, format, args); +} + +static void sigint_handler(int dummy) +{ + exit_req = 1; +} + +int main(int argc, char **argv) +{ + struct scx_central *skel; + struct bpf_link *link; + __u64 seq = 0; + __s32 opt; + + libbpf_set_print(libbpf_print_fn); + signal(SIGINT, sigint_handler); + signal(SIGTERM, sigint_handler); + + skel = SCX_OPS_OPEN(central_ops, scx_central); + + skel->rodata->central_cpu = 0; + skel->rodata->nr_cpu_ids = libbpf_num_possible_cpus(); + + while ((opt = getopt(argc, argv, "s:c:pvh")) != -1) { + switch (opt) { + case 's': + skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000; + break; + case 'c': + skel->rodata->central_cpu = strtoul(optarg, NULL, 0); + break; + case 'v': + verbose = true; + break; + default: + fprintf(stderr, help_fmt, basename(argv[0])); + return opt != 'h'; + } + } + + /* Resize arrays so their element count is equal to cpu count. */ + RESIZE_ARRAY(skel, data, cpu_gimme_task, skel->rodata->nr_cpu_ids); + + SCX_OPS_LOAD(skel, central_ops, scx_central, uei); + link = SCX_OPS_ATTACH(skel, central_ops, scx_central); + + while (!exit_req && !UEI_EXITED(skel, uei)) { + printf("[SEQ %llu]\n", seq++); + printf("total :%10" PRIu64 " local:%10" PRIu64 " queued:%10" PRIu64 " lost:%10" PRIu64 "\n", + skel->bss->nr_total, + skel->bss->nr_locals, + skel->bss->nr_queued, + skel->bss->nr_lost_pids); + printf(" dispatch:%10" PRIu64 " mismatch:%10" PRIu64 " retry:%10" PRIu64 "\n", + skel->bss->nr_dispatches, + skel->bss->nr_mismatches, + skel->bss->nr_retries); + printf("overflow:%10" PRIu64 "\n", + skel->bss->nr_overflows); + fflush(stdout); + sleep(1); + } + + bpf_link__destroy(link); + UEI_REPORT(skel, uei); + scx_central__destroy(skel); + return 0; +} -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 0922f54fdd15aedb93730eb8cfa0c069cbad4e08 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- The dispatch path retries if the local DSQ is still empty after ops.dispatch() either dispatched or consumed a task. This is both out of necessity and for convenience. It has to retry because the dispatch path might lose the tasks to dequeue while the rq lock is released while trying to migrate tasks across CPUs, and the retry mechanism makes ops.dispatch() implementation easier as it only needs to make some forward progress each iteration. However, this makes it possible for ops.dispatch() to stall CPUs by repeatedly dispatching ineligible tasks. If all CPUs are stalled that way, the watchdog or sysrq handler can't run and the system can't be saved. Let's address the issue by breaking out of the dispatch loop after 32 iterations. It is unlikely but not impossible for ops.dispatch() to legitimately go over the iteration limit. We want to come back to the dispatch path in such cases as not doing so risks stalling the CPU by idling with runnable tasks pending. As the previous task is still current in balance_scx(), resched_curr() doesn't do anything - it will just get cleared. Let's instead use scx_kick_bpf() which will trigger reschedule after switching to the next task which will likely be the idle task. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 17 +++++++++++++++++ tools/sched_ext/scx_qmap.bpf.c | 15 +++++++++++++++ tools/sched_ext/scx_qmap.c | 8 ++++++-- 3 files changed, 38 insertions(+), 2 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 73b8881a5061..51447362394f 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -12,6 +12,7 @@ enum scx_consts { SCX_DSP_DFL_MAX_BATCH = 32, + SCX_DSP_MAX_LOOPS = 32, SCX_WATCHDOG_MAX_TIMEOUT = 30 * HZ, SCX_EXIT_BT_LEN = 64, @@ -669,6 +670,7 @@ static struct kobject *scx_root_kobj; #define CREATE_TRACE_POINTS #include <trace/events/sched_ext.h> +static void scx_bpf_kick_cpu(s32 cpu, u64 flags); static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind, s64 exit_code, const char *fmt, ...); @@ -1911,6 +1913,7 @@ static int balance_scx(struct rq *rq, struct task_struct *prev, { struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); bool prev_on_scx = prev->sched_class == &ext_sched_class; + int nr_loops = SCX_DSP_MAX_LOOPS; bool has_tasks = false; lockdep_assert_rq_held(rq); @@ -1967,6 +1970,20 @@ static int balance_scx(struct rq *rq, struct task_struct *prev, goto has_tasks; if (consume_dispatch_q(rq, rf, &scx_dsq_global)) goto has_tasks; + + /* + * ops.dispatch() can trap us in this loop by repeatedly + * dispatching ineligible tasks. Break out once in a while to + * allow the watchdog to run. As IRQ can't be enabled in + * balance(), we want to complete this scheduling cycle and then + * start a new one. IOW, we want to call resched_curr() on the + * next, most likely idle, task, not the current one. Use + * scx_bpf_kick_cpu() for deferred kicking. + */ + if (unlikely(!--nr_loops)) { + scx_bpf_kick_cpu(cpu_of(rq), 0); + break; + } } while (dspc->nr_tasks); goto out; diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c index 5b3da28bf042..879fc9c788e5 100644 --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -31,6 +31,7 @@ char _license[] SEC("license") = "GPL"; const volatile u64 slice_ns = SCX_SLICE_DFL; const volatile u32 stall_user_nth; const volatile u32 stall_kernel_nth; +const volatile u32 dsp_inf_loop_after; const volatile u32 dsp_batch; const volatile s32 disallow_tgid; const volatile bool suppress_dump; @@ -198,6 +199,20 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) if (scx_bpf_consume(SHARED_DSQ)) return; + if (dsp_inf_loop_after && nr_dispatched > dsp_inf_loop_after) { + /* + * PID 2 should be kthreadd which should mostly be idle and off + * the scheduler. Let's keep dispatching it to force the kernel + * to call this function over and over again. + */ + p = bpf_task_from_pid(2); + if (p) { + scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, 0); + bpf_task_release(p); + return; + } + } + if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) { scx_bpf_error("failed to look up cpu_ctx"); return; diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c index a1123a17581b..594147a710a8 100644 --- a/tools/sched_ext/scx_qmap.c +++ b/tools/sched_ext/scx_qmap.c @@ -19,13 +19,14 @@ const char help_fmt[] = "\n" "See the top-level comment in .bpf.c for more details.\n" "\n" -"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-b COUNT]\n" +"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-l COUNT] [-b COUNT]\n" " [-d PID] [-D LEN] [-p] [-v]\n" "\n" " -s SLICE_US Override slice duration\n" " -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n" " -t COUNT Stall every COUNT'th user thread\n" " -T COUNT Stall every COUNT'th kernel thread\n" +" -l COUNT Trigger dispatch infinite looping after COUNT dispatches\n" " -b COUNT Dispatch upto COUNT tasks together\n" " -d PID Disallow a process from switching into SCHED_EXT (-1 for self)\n" " -D LEN Set scx_exit_info.dump buffer length\n" @@ -61,7 +62,7 @@ int main(int argc, char **argv) skel = SCX_OPS_OPEN(qmap_ops, scx_qmap); - while ((opt = getopt(argc, argv, "s:e:t:T:b:d:D:Spvh")) != -1) { + while ((opt = getopt(argc, argv, "s:e:t:T:l:b:d:D:Spvh")) != -1) { switch (opt) { case 's': skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000; @@ -75,6 +76,9 @@ int main(int argc, char **argv) case 'T': skel->rodata->stall_kernel_nth = strtoul(optarg, NULL, 0); break; + case 'l': + skel->rodata->dsp_inf_loop_after = strtoul(optarg, NULL, 0); + break; case 'b': skel->rodata->dsp_batch = strtoul(optarg, NULL, 0); break; -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 1c29f8541e178c590c2b9b66b9681e6ccab84cea category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Being able to track the task runnable and running state transitions are useful for a variety of purposes including latency tracking and load factor calculation. Currently, BPF schedulers don't have a good way of tracking these transitions. Becoming runnable can be determined from ops.enqueue() but becoming quiescent can only be inferred from the lack of subsequent enqueue. Also, as the local dsq can have multiple tasks and some events are handled in the sched_ext core, it's difficult to determine when a given task starts and stops executing. This patch adds sched_ext_ops.runnable(), .running(), .stopping() and .quiescent() operations to track the task runnable and running state transitions. They're mostly self explanatory; however, we want to ensure that running <-> stopping transitions are always contained within runnable <-> quiescent transitions which is a bit different from how the scheduler core behaves. This adds a bit of complication. See the comment in dequeue_task_scx(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 105 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 105 insertions(+) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 51447362394f..486a76ac0be0 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -218,6 +218,72 @@ struct sched_ext_ops { */ void (*tick)(struct task_struct *p); + /** + * runnable - A task is becoming runnable on its associated CPU + * @p: task becoming runnable + * @enq_flags: %SCX_ENQ_* + * + * This and the following three functions can be used to track a task's + * execution state transitions. A task becomes ->runnable() on a CPU, + * and then goes through one or more ->running() and ->stopping() pairs + * as it runs on the CPU, and eventually becomes ->quiescent() when it's + * done running on the CPU. + * + * @p is becoming runnable on the CPU because it's + * + * - waking up (%SCX_ENQ_WAKEUP) + * - being moved from another CPU + * - being restored after temporarily taken off the queue for an + * attribute change. + * + * This and ->enqueue() are related but not coupled. This operation + * notifies @p's state transition and may not be followed by ->enqueue() + * e.g. when @p is being dispatched to a remote CPU, or when @p is + * being enqueued on a CPU experiencing a hotplug event. Likewise, a + * task may be ->enqueue()'d without being preceded by this operation + * e.g. after exhausting its slice. + */ + void (*runnable)(struct task_struct *p, u64 enq_flags); + + /** + * running - A task is starting to run on its associated CPU + * @p: task starting to run + * + * See ->runnable() for explanation on the task state notifiers. + */ + void (*running)(struct task_struct *p); + + /** + * stopping - A task is stopping execution + * @p: task stopping to run + * @runnable: is task @p still runnable? + * + * See ->runnable() for explanation on the task state notifiers. If + * !@runnable, ->quiescent() will be invoked after this operation + * returns. + */ + void (*stopping)(struct task_struct *p, bool runnable); + + /** + * quiescent - A task is becoming not runnable on its associated CPU + * @p: task becoming not runnable + * @deq_flags: %SCX_DEQ_* + * + * See ->runnable() for explanation on the task state notifiers. + * + * @p is becoming quiescent on the CPU because it's + * + * - sleeping (%SCX_DEQ_SLEEP) + * - being moved to another CPU + * - being temporarily taken off the queue for an attribute change + * (%SCX_DEQ_SAVE) + * + * This and ->dequeue() are related but not coupled. This operation + * notifies @p's state transition and may not be preceded by ->dequeue() + * e.g. when @p is being dispatched to a remote CPU. + */ + void (*quiescent)(struct task_struct *p, u64 deq_flags); + /** * yield - Yield CPU * @from: yielding task @@ -1363,6 +1429,9 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags rq->scx.nr_running++; add_nr_running(rq, 1); + if (SCX_HAS_OP(runnable)) + SCX_CALL_OP(SCX_KF_REST, runnable, p, enq_flags); + do_enqueue_task(rq, p, enq_flags, sticky_cpu); } @@ -1422,6 +1491,26 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags ops_dequeue(p, deq_flags); + /* + * A currently running task which is going off @rq first gets dequeued + * and then stops running. As we want running <-> stopping transitions + * to be contained within runnable <-> quiescent transitions, trigger + * ->stopping() early here instead of in put_prev_task_scx(). + * + * @p may go through multiple stopping <-> running transitions between + * here and put_prev_task_scx() if task attribute changes occur while + * balance_scx() leaves @rq unlocked. However, they don't contain any + * information meaningful to the BPF scheduler and can be suppressed by + * skipping the callbacks if the task is !QUEUED. + */ + if (SCX_HAS_OP(stopping) && task_current(rq, p)) { + update_curr_scx(rq); + SCX_CALL_OP(SCX_KF_REST, stopping, p, false); + } + + if (SCX_HAS_OP(quiescent)) + SCX_CALL_OP(SCX_KF_REST, quiescent, p, deq_flags); + if (deq_flags & SCX_DEQ_SLEEP) p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP; else @@ -2004,6 +2093,10 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) p->se.exec_start = rq_clock_task(rq); + /* see dequeue_task_scx() on why we skip when !QUEUED */ + if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED)) + SCX_CALL_OP(SCX_KF_REST, running, p); + clr_task_runnable(p, true); } @@ -2042,6 +2135,10 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p) update_curr_scx(rq); + /* see dequeue_task_scx() on why we skip when !QUEUED */ + if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED)) + SCX_CALL_OP(SCX_KF_REST, stopping, p, true); + /* * If we're being called from put_prev_task_balance(), balance_scx() may * have decided that @p should keep running. @@ -4086,6 +4183,10 @@ static s32 select_cpu_stub(struct task_struct *p, s32 prev_cpu, u64 wake_flags) static void enqueue_stub(struct task_struct *p, u64 enq_flags) {} static void dequeue_stub(struct task_struct *p, u64 enq_flags) {} static void dispatch_stub(s32 prev_cpu, struct task_struct *p) {} +static void runnable_stub(struct task_struct *p, u64 enq_flags) {} +static void running_stub(struct task_struct *p) {} +static void stopping_stub(struct task_struct *p, bool runnable) {} +static void quiescent_stub(struct task_struct *p, u64 deq_flags) {} static bool yield_stub(struct task_struct *from, struct task_struct *to) { return false; } static void set_weight_stub(struct task_struct *p, u32 weight) {} static void set_cpumask_stub(struct task_struct *p, const struct cpumask *mask) {} @@ -4102,6 +4203,10 @@ static struct sched_ext_ops __bpf_ops_sched_ext_ops = { .enqueue = enqueue_stub, .dequeue = dequeue_stub, .dispatch = dispatch_stub, + .runnable = runnable_stub, + .running = running_stub, + .stopping = stopping_stub, + .quiescent = quiescent_stub, .yield = yield_stub, .set_weight = set_weight_stub, .set_cpumask = set_cpumask_stub, -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 22a920209ab6aa4f8ec960ed81041643fddeaec6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Allow BPF schedulers to indicate tickless operation by setting p->scx.slice to SCX_SLICE_INF. A CPU whose current task has infinte slice goes into tickless operation. scx_central is updated to use tickless operations for all tasks and instead use a BPF timer to expire slices. This also uses the SCX_ENQ_PREEMPT and task state tracking added by the previous patches. Currently, there is no way to pin the timer on the central CPU, so it may end up on one of the worker CPUs; however, outside of that, the worker CPUs can go tickless both while running sched_ext tasks and idling. With schbench running, scx_central shows: root@test ~# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts LOC: 142024 656 664 449 Local timer interrupts LOC: 161663 663 665 449 Local timer interrupts Without it: root@test ~ [SIGINT]# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts LOC: 188778 3142 3793 3993 Local timer interrupts LOC: 198993 5314 6323 6438 Local timer interrupts While scx_central itself is too barebone to be useful as a production scheduler, a more featureful central scheduler can be built using the same approach. Google's experience shows that such an approach can have significant benefits for certain applications such as VM hosting. v4: Allow operation even if BPF_F_TIMER_CPU_PIN is not available. v3: Pin the central scheduler's timer on the central_cpu using BPF_F_TIMER_CPU_PIN. v2: Convert to BPF inline iterators. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 1 + kernel/sched/core.c | 11 ++- kernel/sched/ext.c | 52 +++++++++- kernel/sched/ext.h | 2 + kernel/sched/sched.h | 1 + tools/sched_ext/scx_central.bpf.c | 159 ++++++++++++++++++++++++++++-- tools/sched_ext/scx_central.c | 29 +++++- 7 files changed, 242 insertions(+), 13 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index 3b2809b980ac..6f1a4977e9f8 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -16,6 +16,7 @@ enum scx_public_consts { SCX_OPS_NAME_LEN = 128, SCX_SLICE_DFL = 20 * 1000000, /* 20ms */ + SCX_SLICE_INF = U64_MAX, /* infinite, implies nohz */ }; /* diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 600b942961c2..f2bba949f4b3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1265,11 +1265,14 @@ bool sched_can_stop_tick(struct rq *rq) return true; /* - * If there are no DL,RR/FIFO tasks, there must only be CFS tasks left; - * if there's more than one we need the tick for involuntary - * preemption. + * If there are no DL,RR/FIFO tasks, there must only be CFS or SCX tasks + * left. For CFS, if there's more than one we need the tick for + * involuntary preemption. For SCX, ask. */ - if (rq->nr_running > 1) + if (!scx_switched_all() && rq->nr_running > 1) + return false; + + if (scx_enabled() && !scx_can_stop_tick(rq)) return false; /* diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 486a76ac0be0..c0591f9ec938 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1090,7 +1090,8 @@ static void update_curr_scx(struct rq *rq) account_group_exec_runtime(curr, delta_exec); cgroup_account_cputime(curr, delta_exec); - curr->scx.slice -= min(curr->scx.slice, delta_exec); + if (curr->scx.slice != SCX_SLICE_INF) + curr->scx.slice -= min(curr->scx.slice, delta_exec); } static void dsq_mod_nr(struct scx_dispatch_q *dsq, s32 delta) @@ -2098,6 +2099,28 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) SCX_CALL_OP(SCX_KF_REST, running, p); clr_task_runnable(p, true); + + /* + * @p is getting newly scheduled or got kicked after someone updated its + * slice. Refresh whether tick can be stopped. See scx_can_stop_tick(). + */ + if ((p->scx.slice == SCX_SLICE_INF) != + (bool)(rq->scx.flags & SCX_RQ_CAN_STOP_TICK)) { + if (p->scx.slice == SCX_SLICE_INF) + rq->scx.flags |= SCX_RQ_CAN_STOP_TICK; + else + rq->scx.flags &= ~SCX_RQ_CAN_STOP_TICK; + + sched_update_tick_dependency(rq); + + /* + * For now, let's refresh the load_avgs just when transitioning + * in and out of nohz. In the future, we might want to add a + * mechanism which calls the following periodically on + * tick-stopped CPUs. + */ + update_other_load_avgs(rq); + } } static void put_prev_task_scx(struct rq *rq, struct task_struct *p) @@ -2823,6 +2846,26 @@ int scx_check_setscheduler(struct task_struct *p, int policy) return 0; } +#ifdef CONFIG_NO_HZ_FULL +bool scx_can_stop_tick(struct rq *rq) +{ + struct task_struct *p = rq->curr; + + if (scx_ops_bypassing()) + return false; + + if (p->sched_class != &ext_sched_class) + return true; + + /* + * @rq can dispatch from different DSQs, so we can't tell whether it + * needs the tick or not by looking at nr_running. Allow stopping ticks + * iff the BPF scheduler indicated so. See set_next_task_scx(). + */ + return rq->scx.flags & SCX_RQ_CAN_STOP_TICK; +} +#endif + /* * Omitted operations: * @@ -3125,6 +3168,9 @@ static void scx_ops_bypass(bool bypass) } rq_unlock_irqrestore(rq, &rf); + + /* kick to restore ticks */ + resched_cpu(cpu); } } @@ -4581,7 +4627,9 @@ __bpf_kfunc_start_defs(); * BPF locks (in the future when BPF introduces more flexible locking). * * @p is allowed to run for @slice. The scheduling path is triggered on slice - * exhaustion. If zero, the current residual slice is maintained. + * exhaustion. If zero, the current residual slice is maintained. If + * %SCX_SLICE_INF, @p never expires and the BPF scheduler must kick the CPU with + * scx_bpf_kick_cpu() to trigger scheduling. */ __bpf_kfunc void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flags) diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index 33a9f7fe5832..6ed946f72489 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -35,6 +35,7 @@ void scx_pre_fork(struct task_struct *p); int scx_fork(struct task_struct *p); void scx_post_fork(struct task_struct *p); void scx_cancel_fork(struct task_struct *p); +bool scx_can_stop_tick(struct rq *rq); int scx_check_setscheduler(struct task_struct *p, int policy); bool task_should_scx(struct task_struct *p); void init_sched_ext_class(void); @@ -73,6 +74,7 @@ static inline void scx_pre_fork(struct task_struct *p) {} static inline int scx_fork(struct task_struct *p) { return 0; } static inline void scx_post_fork(struct task_struct *p) {} static inline void scx_cancel_fork(struct task_struct *p) {} +static inline bool scx_can_stop_tick(struct rq *rq) { return true; } static inline int scx_check_setscheduler(struct task_struct *p, int policy) { return 0; } static inline bool task_on_scx(const struct task_struct *p) { return false; } static inline void init_sched_ext_class(void) {} diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index be6c7e000620..faf3ffa95bd4 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -827,6 +827,7 @@ struct cfs_rq { /* scx_rq->flags, protected by the rq lock */ enum scx_rq_flags { SCX_RQ_BALANCING = 1 << 1, + SCX_RQ_CAN_STOP_TICK = 1 << 2, }; struct scx_rq { diff --git a/tools/sched_ext/scx_central.bpf.c b/tools/sched_ext/scx_central.bpf.c index 428b2262faa3..1d8fd570eaa7 100644 --- a/tools/sched_ext/scx_central.bpf.c +++ b/tools/sched_ext/scx_central.bpf.c @@ -13,7 +13,26 @@ * through per-CPU BPF queues. The current design is chosen to maximally * utilize and verify various SCX mechanisms such as LOCAL_ON dispatching. * - * b. Preemption + * b. Tickless operation + * + * All tasks are dispatched with the infinite slice which allows stopping the + * ticks on CONFIG_NO_HZ_FULL kernels running with the proper nohz_full + * parameter. The tickless operation can be observed through + * /proc/interrupts. + * + * Periodic switching is enforced by a periodic timer checking all CPUs and + * preempting them as necessary. Unfortunately, BPF timer currently doesn't + * have a way to pin to a specific CPU, so the periodic timer isn't pinned to + * the central CPU. + * + * c. Preemption + * + * Kthreads are unconditionally queued to the head of a matching local dsq + * and dispatched with SCX_DSQ_PREEMPT. This ensures that a kthread is always + * prioritized over user threads, which is required for ensuring forward + * progress as e.g. the periodic timer may run on a ksoftirqd and if the + * ksoftirqd gets starved by a user thread, there may not be anything else to + * vacate that user thread. * * SCX_KICK_PREEMPT is used to trigger scheduling and CPUs to move to the * next tasks. @@ -32,14 +51,17 @@ char _license[] SEC("license") = "GPL"; enum { FALLBACK_DSQ_ID = 0, + MS_TO_NS = 1000LLU * 1000, + TIMER_INTERVAL_NS = 1 * MS_TO_NS, }; const volatile s32 central_cpu; const volatile u32 nr_cpu_ids = 1; /* !0 for veristat, set during init */ const volatile u64 slice_ns = SCX_SLICE_DFL; +bool timer_pinned = true; u64 nr_total, nr_locals, nr_queued, nr_lost_pids; -u64 nr_dispatches, nr_mismatches, nr_retries; +u64 nr_timers, nr_dispatches, nr_mismatches, nr_retries; u64 nr_overflows; UEI_DEFINE(uei); @@ -52,6 +74,23 @@ struct { /* can't use percpu map due to bad lookups */ bool RESIZABLE_ARRAY(data, cpu_gimme_task); +u64 RESIZABLE_ARRAY(data, cpu_started_at); + +struct central_timer { + struct bpf_timer timer; +}; + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 1); + __type(key, u32); + __type(value, struct central_timer); +} central_timer SEC(".maps"); + +static bool vtime_before(u64 a, u64 b) +{ + return (s64)(a - b) < 0; +} s32 BPF_STRUCT_OPS(central_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) @@ -71,9 +110,22 @@ void BPF_STRUCT_OPS(central_enqueue, struct task_struct *p, u64 enq_flags) __sync_fetch_and_add(&nr_total, 1); + /* + * Push per-cpu kthreads at the head of local dsq's and preempt the + * corresponding CPU. This ensures that e.g. ksoftirqd isn't blocked + * behind other threads which is necessary for forward progress + * guarantee as we depend on the BPF timer which may run from ksoftirqd. + */ + if ((p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1) { + __sync_fetch_and_add(&nr_locals, 1); + scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_INF, + enq_flags | SCX_ENQ_PREEMPT); + return; + } + if (bpf_map_push_elem(¢ral_q, &pid, 0)) { __sync_fetch_and_add(&nr_overflows, 1); - scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_DFL, enq_flags); + scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_INF, enq_flags); return; } @@ -106,7 +158,7 @@ static bool dispatch_to_cpu(s32 cpu) */ if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) { __sync_fetch_and_add(&nr_mismatches, 1); - scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_DFL, 0); + scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_INF, 0); bpf_task_release(p); /* * We might run out of dispatch buffer slots if we continue dispatching @@ -120,7 +172,7 @@ static bool dispatch_to_cpu(s32 cpu) } /* dispatch to local and mark that @cpu doesn't need more */ - scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_DFL, 0); + scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_INF, 0); if (cpu != central_cpu) scx_bpf_kick_cpu(cpu, SCX_KICK_IDLE); @@ -188,9 +240,102 @@ void BPF_STRUCT_OPS(central_dispatch, s32 cpu, struct task_struct *prev) } } +void BPF_STRUCT_OPS(central_running, struct task_struct *p) +{ + s32 cpu = scx_bpf_task_cpu(p); + u64 *started_at = ARRAY_ELEM_PTR(cpu_started_at, cpu, nr_cpu_ids); + if (started_at) + *started_at = bpf_ktime_get_ns() ?: 1; /* 0 indicates idle */ +} + +void BPF_STRUCT_OPS(central_stopping, struct task_struct *p, bool runnable) +{ + s32 cpu = scx_bpf_task_cpu(p); + u64 *started_at = ARRAY_ELEM_PTR(cpu_started_at, cpu, nr_cpu_ids); + if (started_at) + *started_at = 0; +} + +static int central_timerfn(void *map, int *key, struct bpf_timer *timer) +{ + u64 now = bpf_ktime_get_ns(); + u64 nr_to_kick = nr_queued; + s32 i, curr_cpu; + + curr_cpu = bpf_get_smp_processor_id(); + if (timer_pinned && (curr_cpu != central_cpu)) { + scx_bpf_error("Central timer ran on CPU %d, not central CPU %d", + curr_cpu, central_cpu); + return 0; + } + + bpf_for(i, 0, nr_cpu_ids) { + s32 cpu = (nr_timers + i) % nr_cpu_ids; + u64 *started_at; + + if (cpu == central_cpu) + continue; + + /* kick iff the current one exhausted its slice */ + started_at = ARRAY_ELEM_PTR(cpu_started_at, cpu, nr_cpu_ids); + if (started_at && *started_at && + vtime_before(now, *started_at + slice_ns)) + continue; + + /* and there's something pending */ + if (scx_bpf_dsq_nr_queued(FALLBACK_DSQ_ID) || + scx_bpf_dsq_nr_queued(SCX_DSQ_LOCAL_ON | cpu)) + ; + else if (nr_to_kick) + nr_to_kick--; + else + continue; + + scx_bpf_kick_cpu(cpu, SCX_KICK_PREEMPT); + } + + bpf_timer_start(timer, TIMER_INTERVAL_NS, BPF_F_TIMER_CPU_PIN); + __sync_fetch_and_add(&nr_timers, 1); + return 0; +} + int BPF_STRUCT_OPS_SLEEPABLE(central_init) { - return scx_bpf_create_dsq(FALLBACK_DSQ_ID, -1); + u32 key = 0; + struct bpf_timer *timer; + int ret; + + ret = scx_bpf_create_dsq(FALLBACK_DSQ_ID, -1); + if (ret) + return ret; + + timer = bpf_map_lookup_elem(¢ral_timer, &key); + if (!timer) + return -ESRCH; + + if (bpf_get_smp_processor_id() != central_cpu) { + scx_bpf_error("init from non-central CPU"); + return -EINVAL; + } + + bpf_timer_init(timer, ¢ral_timer, CLOCK_MONOTONIC); + bpf_timer_set_callback(timer, central_timerfn); + + ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, BPF_F_TIMER_CPU_PIN); + /* + * BPF_F_TIMER_CPU_PIN is pretty new (>=6.7). If we're running in a + * kernel which doesn't have it, bpf_timer_start() will return -EINVAL. + * Retry without the PIN. This would be the perfect use case for + * bpf_core_enum_value_exists() but the enum type doesn't have a name + * and can't be used with bpf_core_enum_value_exists(). Oh well... + */ + if (ret == -EINVAL) { + timer_pinned = false; + ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, 0); + } + if (ret) + scx_bpf_error("bpf_timer_start failed (%d)", ret); + return ret; } void BPF_STRUCT_OPS(central_exit, struct scx_exit_info *ei) @@ -209,6 +354,8 @@ SCX_OPS_DEFINE(central_ops, .select_cpu = (void *)central_select_cpu, .enqueue = (void *)central_enqueue, .dispatch = (void *)central_dispatch, + .running = (void *)central_running, + .stopping = (void *)central_stopping, .init = (void *)central_init, .exit = (void *)central_exit, .name = "central"); diff --git a/tools/sched_ext/scx_central.c b/tools/sched_ext/scx_central.c index 5f09fc666a63..fb3f50886552 100644 --- a/tools/sched_ext/scx_central.c +++ b/tools/sched_ext/scx_central.c @@ -48,6 +48,7 @@ int main(int argc, char **argv) struct bpf_link *link; __u64 seq = 0; __s32 opt; + cpu_set_t *cpuset; libbpf_set_print(libbpf_print_fn); signal(SIGINT, sigint_handler); @@ -77,10 +78,35 @@ int main(int argc, char **argv) /* Resize arrays so their element count is equal to cpu count. */ RESIZE_ARRAY(skel, data, cpu_gimme_task, skel->rodata->nr_cpu_ids); + RESIZE_ARRAY(skel, data, cpu_started_at, skel->rodata->nr_cpu_ids); SCX_OPS_LOAD(skel, central_ops, scx_central, uei); + + /* + * Affinitize the loading thread to the central CPU, as: + * - That's where the BPF timer is first invoked in the BPF program. + * - We probably don't want this user space component to take up a core + * from a task that would benefit from avoiding preemption on one of + * the tickless cores. + * + * Until BPF supports pinning the timer, it's not guaranteed that it + * will always be invoked on the central CPU. In practice, this + * suffices the majority of the time. + */ + cpuset = CPU_ALLOC(skel->rodata->nr_cpu_ids); + SCX_BUG_ON(!cpuset, "Failed to allocate cpuset"); + CPU_ZERO(cpuset); + CPU_SET(skel->rodata->central_cpu, cpuset); + SCX_BUG_ON(sched_setaffinity(0, sizeof(cpuset), cpuset), + "Failed to affinitize to central CPU %d (max %d)", + skel->rodata->central_cpu, skel->rodata->nr_cpu_ids - 1); + CPU_FREE(cpuset); + link = SCX_OPS_ATTACH(skel, central_ops, scx_central); + if (!skel->data->timer_pinned) + printf("WARNING : BPF_F_TIMER_CPU_PIN not available, timer not pinned to central\n"); + while (!exit_req && !UEI_EXITED(skel, uei)) { printf("[SEQ %llu]\n", seq++); printf("total :%10" PRIu64 " local:%10" PRIu64 " queued:%10" PRIu64 " lost:%10" PRIu64 "\n", @@ -88,7 +114,8 @@ int main(int argc, char **argv) skel->bss->nr_locals, skel->bss->nr_queued, skel->bss->nr_lost_pids); - printf(" dispatch:%10" PRIu64 " mismatch:%10" PRIu64 " retry:%10" PRIu64 "\n", + printf("timer :%10" PRIu64 " dispatch:%10" PRIu64 " mismatch:%10" PRIu64 " retry:%10" PRIu64 "\n", + skel->bss->nr_timers, skel->bss->nr_dispatches, skel->bss->nr_mismatches, skel->bss->nr_retries); -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 36454023f50b2aadd25b7f47c50b5edc0c59e1c0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- When some SCX operations are in flight, it is known that the subject task's rq lock is held throughout which makes it safe to access certain fields of the task - e.g. its current task_group. We want to add SCX kfunc helpers that can make use of this guarantee - e.g. to help determining the currently associated CPU cgroup from the task's current task_group. As it'd be dangerous call such a helper on a task which isn't rq lock protected, the helper should be able to verify the input task and reject accordingly. This patch adds sched_ext_entity.kf_tasks[] that track the tasks which are currently being operated on by a terminal SCX operation. The new SCX_CALL_OP_[2]TASK[_RET]() can be used when invoking SCX operations which take tasks as arguments and the scx_kf_allowed_on_arg_tasks() can be used by kfunc helpers to verify the input task status. Note that as sched_ext_entity.kf_tasks[] can't handle nesting, the tracking is currently only limited to terminal SCX operations. If needed in the future, this restriction can be removed by moving the tracking to the task side with a couple per-task counters. v2: Updated to reflect the addition of SCX_KF_SELECT_CPU. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 2 + kernel/sched/ext.c | 91 +++++++++++++++++++++++++++++++-------- 2 files changed, 76 insertions(+), 17 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index 6f1a4977e9f8..74341dbc6a19 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -106,6 +106,7 @@ enum scx_kf_mask { __SCX_KF_RQ_LOCKED = SCX_KF_DISPATCH | SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, + __SCX_KF_TERMINAL = SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, }; /* @@ -120,6 +121,7 @@ struct sched_ext_entity { s32 sticky_cpu; s32 holding_cpu; u32 kf_mask; /* see scx_kf_mask above */ + struct task_struct *kf_tasks[2]; /* see SCX_CALL_OP_TASK() */ atomic_long_t ops_state; struct list_head runnable_node; /* rq->scx.runnable_list */ diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index c0591f9ec938..3cb46c4c9475 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -821,6 +821,47 @@ do { \ __ret; \ }) +/* + * Some kfuncs are allowed only on the tasks that are subjects of the + * in-progress scx_ops operation for, e.g., locking guarantees. To enforce such + * restrictions, the following SCX_CALL_OP_*() variants should be used when + * invoking scx_ops operations that take task arguments. These can only be used + * for non-nesting operations due to the way the tasks are tracked. + * + * kfuncs which can only operate on such tasks can in turn use + * scx_kf_allowed_on_arg_tasks() to test whether the invocation is allowed on + * the specific task. + */ +#define SCX_CALL_OP_TASK(mask, op, task, args...) \ +do { \ + BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ + current->scx.kf_tasks[0] = task; \ + SCX_CALL_OP(mask, op, task, ##args); \ + current->scx.kf_tasks[0] = NULL; \ +} while (0) + +#define SCX_CALL_OP_TASK_RET(mask, op, task, args...) \ +({ \ + __typeof__(scx_ops.op(task, ##args)) __ret; \ + BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ + current->scx.kf_tasks[0] = task; \ + __ret = SCX_CALL_OP_RET(mask, op, task, ##args); \ + current->scx.kf_tasks[0] = NULL; \ + __ret; \ +}) + +#define SCX_CALL_OP_2TASKS_RET(mask, op, task0, task1, args...) \ +({ \ + __typeof__(scx_ops.op(task0, task1, ##args)) __ret; \ + BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ + current->scx.kf_tasks[0] = task0; \ + current->scx.kf_tasks[1] = task1; \ + __ret = SCX_CALL_OP_RET(mask, op, task0, task1, ##args); \ + current->scx.kf_tasks[0] = NULL; \ + current->scx.kf_tasks[1] = NULL; \ + __ret; \ +}) + /* @mask is constant, always inline to cull unnecessary branches */ static __always_inline bool scx_kf_allowed(u32 mask) { @@ -850,6 +891,22 @@ static __always_inline bool scx_kf_allowed(u32 mask) return true; } +/* see SCX_CALL_OP_TASK() */ +static __always_inline bool scx_kf_allowed_on_arg_tasks(u32 mask, + struct task_struct *p) +{ + if (!scx_kf_allowed(mask)) + return false; + + if (unlikely((p != current->scx.kf_tasks[0] && + p != current->scx.kf_tasks[1]))) { + scx_ops_error("called on a task not being operated on"); + return false; + } + + return true; +} + /* * SCX task iterator. @@ -1346,7 +1403,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, WARN_ON_ONCE(*ddsp_taskp); *ddsp_taskp = p; - SCX_CALL_OP(SCX_KF_ENQUEUE, enqueue, p, enq_flags); + SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, enqueue, p, enq_flags); *ddsp_taskp = NULL; if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) @@ -1431,7 +1488,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags add_nr_running(rq, 1); if (SCX_HAS_OP(runnable)) - SCX_CALL_OP(SCX_KF_REST, runnable, p, enq_flags); + SCX_CALL_OP_TASK(SCX_KF_REST, runnable, p, enq_flags); do_enqueue_task(rq, p, enq_flags, sticky_cpu); } @@ -1457,7 +1514,7 @@ static void ops_dequeue(struct task_struct *p, u64 deq_flags) BUG(); case SCX_OPSS_QUEUED: if (SCX_HAS_OP(dequeue)) - SCX_CALL_OP(SCX_KF_REST, dequeue, p, deq_flags); + SCX_CALL_OP_TASK(SCX_KF_REST, dequeue, p, deq_flags); if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, SCX_OPSS_NONE)) @@ -1506,11 +1563,11 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags */ if (SCX_HAS_OP(stopping) && task_current(rq, p)) { update_curr_scx(rq); - SCX_CALL_OP(SCX_KF_REST, stopping, p, false); + SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, false); } if (SCX_HAS_OP(quiescent)) - SCX_CALL_OP(SCX_KF_REST, quiescent, p, deq_flags); + SCX_CALL_OP_TASK(SCX_KF_REST, quiescent, p, deq_flags); if (deq_flags & SCX_DEQ_SLEEP) p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP; @@ -1530,7 +1587,7 @@ static void yield_task_scx(struct rq *rq) struct task_struct *p = rq->curr; if (SCX_HAS_OP(yield)) - SCX_CALL_OP_RET(SCX_KF_REST, yield, p, NULL); + SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, p, NULL); else p->scx.slice = 0; } @@ -1540,7 +1597,7 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to) struct task_struct *from = rq->curr; if (SCX_HAS_OP(yield)) - return SCX_CALL_OP_RET(SCX_KF_REST, yield, from, to); + return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, from, to); else return false; } @@ -2096,7 +2153,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) /* see dequeue_task_scx() on why we skip when !QUEUED */ if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED)) - SCX_CALL_OP(SCX_KF_REST, running, p); + SCX_CALL_OP_TASK(SCX_KF_REST, running, p); clr_task_runnable(p, true); @@ -2160,7 +2217,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p) /* see dequeue_task_scx() on why we skip when !QUEUED */ if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED)) - SCX_CALL_OP(SCX_KF_REST, stopping, p, true); + SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, true); /* * If we're being called from put_prev_task_balance(), balance_scx() may @@ -2382,8 +2439,8 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag WARN_ON_ONCE(*ddsp_taskp); *ddsp_taskp = p; - cpu = SCX_CALL_OP_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, - select_cpu, p, prev_cpu, wake_flags); + cpu = SCX_CALL_OP_TASK_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, + select_cpu, p, prev_cpu, wake_flags); *ddsp_taskp = NULL; if (ops_cpu_valid(cpu, "from ops.select_cpu()")) return cpu; @@ -2416,8 +2473,8 @@ static void set_cpus_allowed_scx(struct task_struct *p, * designation pointless. Cast it away when calling the operation. */ if (SCX_HAS_OP(set_cpumask)) - SCX_CALL_OP(SCX_KF_REST, set_cpumask, p, - (struct cpumask *)p->cpus_ptr); + SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p, + (struct cpumask *)p->cpus_ptr); } static void reset_idle_masks(void) @@ -2652,7 +2709,7 @@ static void scx_ops_enable_task(struct task_struct *p) */ set_task_scx_weight(p); if (SCX_HAS_OP(enable)) - SCX_CALL_OP(SCX_KF_REST, enable, p); + SCX_CALL_OP_TASK(SCX_KF_REST, enable, p); scx_set_task_state(p, SCX_TASK_ENABLED); if (SCX_HAS_OP(set_weight)) @@ -2806,7 +2863,7 @@ static void reweight_task_scx(struct rq *rq, struct task_struct *p, const struct set_task_scx_weight(p); if (SCX_HAS_OP(set_weight)) - SCX_CALL_OP(SCX_KF_REST, set_weight, p, p->scx.weight); + SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight); } static void prio_changed_scx(struct rq *rq, struct task_struct *p, int oldprio) @@ -2822,8 +2879,8 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p) * different scheduler class. Keep the BPF scheduler up-to-date. */ if (SCX_HAS_OP(set_cpumask)) - SCX_CALL_OP(SCX_KF_REST, set_cpumask, p, - (struct cpumask *)p->cpus_ptr); + SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p, + (struct cpumask *)p->cpus_ptr); } static void switched_from_scx(struct rq *rq, struct task_struct *p) -- 2.34.1
From: David Vernet <dvernet@meta.com> mainline inclusion from mainline-v6.12-rc1 commit 90e55164dad42c6546b698c031697b224a320834 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- If set when calling scx_bpf_kick_cpu(), the invoking CPU will busy wait for the kicked cpu to enter the scheduler. See the following for example usage: https://github.com/sched-ext/scx/blob/main/scheds/c/scx_pair.bpf.c v2: - Updated to fit the updated kick_cpus_irq_workfn() implementation. - Include SCX_KICK_WAIT related information in debug dump. Signed-off-by: David Vernet <dvernet@meta.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 4 ++- kernel/sched/ext.c | 82 ++++++++++++++++++++++++++++++++++++++++---- kernel/sched/ext.h | 4 +++ kernel/sched/sched.h | 2 ++ 4 files changed, 85 insertions(+), 7 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f2bba949f4b3..d2f00496fa26 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5949,8 +5949,10 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) for_each_active_class(class) { p = class->pick_next_task(rq); - if (p) + if (p) { + scx_next_task_picked(rq, p, class); return p; + } } BUG(); /* The idle class should always have a runnable task. */ diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 3cb46c4c9475..dfbfacca6722 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -536,6 +536,12 @@ enum scx_kick_flags { * task expires and the dispatch path is invoked. */ SCX_KICK_PREEMPT = 1LLU << 1, + + /* + * Wait for the CPU to be rescheduled. The scx_bpf_kick_cpu() call will + * return after the target CPU finishes picking the next task. + */ + SCX_KICK_WAIT = 1LLU << 2, }; enum scx_ops_enable_state { @@ -665,6 +671,9 @@ static struct { #endif /* CONFIG_SMP */ +/* for %SCX_KICK_WAIT */ +static unsigned long __percpu *scx_kick_cpus_pnt_seqs; + /* * Direct dispatch marker. * @@ -2293,6 +2302,23 @@ static struct task_struct *pick_next_task_scx(struct rq *rq) return p; } +void scx_next_task_picked(struct rq *rq, struct task_struct *p, + const struct sched_class *active) +{ + lockdep_assert_rq_held(rq); + + if (!scx_enabled()) + return; +#ifdef CONFIG_SMP + /* + * Pairs with the smp_load_acquire() issued by a CPU in + * kick_cpus_irq_workfn() who is waiting for this CPU to perform a + * resched. + */ + smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1); +#endif +} + #ifdef CONFIG_SMP static bool test_and_clear_cpu_idle(int cpu) @@ -3678,9 +3704,9 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) seq_buf_init(&ns, buf, avail); dump_newline(&ns); - dump_line(&ns, "CPU %-4d: nr_run=%u flags=0x%x ops_qseq=%lu", + dump_line(&ns, "CPU %-4d: nr_run=%u flags=0x%x ops_qseq=%lu pnt_seq=%lu", cpu, rq->scx.nr_running, rq->scx.flags, - rq->scx.ops_qseq); + rq->scx.ops_qseq, rq->scx.pnt_seq); dump_line(&ns, " curr=%s[%d] class=%ps", rq->curr->comm, rq->curr->pid, rq->curr->sched_class); @@ -3693,6 +3719,9 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) if (!cpumask_empty(rq->scx.cpus_to_preempt)) dump_line(&ns, " cpus_to_preempt: %*pb", cpumask_pr_args(rq->scx.cpus_to_preempt)); + if (!cpumask_empty(rq->scx.cpus_to_wait)) + dump_line(&ns, " cpus_to_wait : %*pb", + cpumask_pr_args(rq->scx.cpus_to_wait)); used = seq_buf_used(&ns); if (SCX_HAS_OP(dump_cpu)) { @@ -4388,10 +4417,11 @@ static bool can_skip_idle_kick(struct rq *rq) return !is_idle_task(rq->curr) && !(rq->scx.flags & SCX_RQ_BALANCING); } -static void kick_one_cpu(s32 cpu, struct rq *this_rq) +static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *pseqs) { struct rq *rq = cpu_rq(cpu); struct scx_rq *this_scx = &this_rq->scx; + bool should_wait = false; unsigned long flags; raw_spin_rq_lock_irqsave(rq, flags); @@ -4407,12 +4437,20 @@ static void kick_one_cpu(s32 cpu, struct rq *this_rq) cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt); } + if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) { + pseqs[cpu] = rq->scx.pnt_seq; + should_wait = true; + } + resched_curr(rq); } else { cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt); + cpumask_clear_cpu(cpu, this_scx->cpus_to_wait); } raw_spin_rq_unlock_irqrestore(rq, flags); + + return should_wait; } static void kick_one_cpu_if_idle(s32 cpu, struct rq *this_rq) @@ -4433,10 +4471,12 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work) { struct rq *this_rq = this_rq(); struct scx_rq *this_scx = &this_rq->scx; + unsigned long *pseqs = this_cpu_ptr(scx_kick_cpus_pnt_seqs); + bool should_wait = false; s32 cpu; for_each_cpu(cpu, this_scx->cpus_to_kick) { - kick_one_cpu(cpu, this_rq); + should_wait |= kick_one_cpu(cpu, this_rq, pseqs); cpumask_clear_cpu(cpu, this_scx->cpus_to_kick); cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle); } @@ -4445,6 +4485,28 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work) kick_one_cpu_if_idle(cpu, this_rq); cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle); } + + if (!should_wait) + return; + + for_each_cpu(cpu, this_scx->cpus_to_wait) { + unsigned long *wait_pnt_seq = &cpu_rq(cpu)->scx.pnt_seq; + + if (cpu != cpu_of(this_rq)) { + /* + * Pairs with smp_store_release() issued by this CPU in + * scx_next_task_picked() on the resched path. + * + * We busy-wait here to guarantee that no other task can + * be scheduled on our core before the target CPU has + * entered the resched path. + */ + while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu]) + cpu_relax(); + } + + cpumask_clear_cpu(cpu, this_scx->cpus_to_wait); + } } /** @@ -4509,6 +4571,11 @@ void __init init_sched_ext_class(void) BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL)); BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL)); #endif + scx_kick_cpus_pnt_seqs = + __alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids, + __alignof__(scx_kick_cpus_pnt_seqs[0])); + BUG_ON(!scx_kick_cpus_pnt_seqs); + for_each_possible_cpu(cpu) { struct rq *rq = cpu_rq(cpu); @@ -4518,6 +4585,7 @@ void __init init_sched_ext_class(void) BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_preempt, GFP_KERNEL)); + BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_wait, GFP_KERNEL)); init_irq_work(&rq->scx.kick_cpus_irq_work, kick_cpus_irq_workfn); } @@ -4845,8 +4913,8 @@ __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags) if (flags & SCX_KICK_IDLE) { struct rq *target_rq = cpu_rq(cpu); - if (unlikely(flags & SCX_KICK_PREEMPT)) - scx_ops_error("PREEMPT cannot be used with SCX_KICK_IDLE"); + if (unlikely(flags & (SCX_KICK_PREEMPT | SCX_KICK_WAIT))) + scx_ops_error("PREEMPT/WAIT cannot be used with SCX_KICK_IDLE"); if (raw_spin_rq_trylock(target_rq)) { if (can_skip_idle_kick(target_rq)) { @@ -4861,6 +4929,8 @@ __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags) if (flags & SCX_KICK_PREEMPT) cpumask_set_cpu(cpu, this_rq->scx.cpus_to_preempt); + if (flags & SCX_KICK_WAIT) + cpumask_set_cpu(cpu, this_rq->scx.cpus_to_wait); } irq_work_queue(&this_rq->scx.kick_cpus_irq_work); diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index 6ed946f72489..0aeb1fda1794 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -29,6 +29,8 @@ static inline bool task_on_scx(const struct task_struct *p) return scx_enabled() && p->sched_class == &ext_sched_class; } +void scx_next_task_picked(struct rq *rq, struct task_struct *p, + const struct sched_class *active); void scx_tick(struct rq *rq); void init_scx_entity(struct sched_ext_entity *scx); void scx_pre_fork(struct task_struct *p); @@ -69,6 +71,8 @@ static inline const struct sched_class *next_active_class(const struct sched_cla #define scx_enabled() false #define scx_switched_all() false +static inline void scx_next_task_picked(struct rq *rq, struct task_struct *p, + const struct sched_class *active) {} static inline void scx_tick(struct rq *rq) {} static inline void scx_pre_fork(struct task_struct *p) {} static inline int scx_fork(struct task_struct *p) { return 0; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index faf3ffa95bd4..4f0a62b78f7c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -840,6 +840,8 @@ struct scx_rq { cpumask_var_t cpus_to_kick; cpumask_var_t cpus_to_kick_if_idle; cpumask_var_t cpus_to_preempt; + cpumask_var_t cpus_to_wait; + unsigned long pnt_seq; struct irq_work kick_cpus_irq_work; }; #endif /* CONFIG_SCHED_CLASS_EXT */ -- 2.34.1
From: David Vernet <dvernet@meta.com> mainline inclusion from mainline-v6.12-rc1 commit 245254f7081dbe1c8da54675d0e4ddbe74cee61b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Scheduler classes are strictly ordered and when a higher priority class has tasks to run, the lower priority ones lose access to the CPU. Being able to monitor and act on these events are necessary for use cases includling strict core-scheduling and latency management. This patch adds two operations ops.cpu_acquire() and .cpu_release(). The former is invoked when a CPU becomes available to the BPF scheduler and the opposite for the latter. This patch also implements scx_bpf_reenqueue_local() which can be called from .cpu_release() to trigger requeueing of all tasks in the local dsq of the CPU so that the tasks can be reassigned to other available CPUs. scx_pair is updated to use .cpu_acquire/release() along with %SCX_KICK_WAIT to make the pair scheduling guarantee strict even when a CPU is preempted by a higher priority scheduler class. scx_qmap is updated to use .cpu_acquire/release() to empty the local dsq of a preempted CPU. A similar approach can be adopted by BPF schedulers that want to have a tight control over latency. v4: Use the new SCX_KICK_IDLE to wake up a CPU after re-enqueueing. v3: Drop the const qualifier from scx_cpu_release_args.task. BPF enforces access control through the verifier, so the qualifier isn't actually operative and only gets in the way when interacting with various helpers. v2: Add p->scx.kf_mask annotation to allow calling scx_bpf_reenqueue_local() from ops.cpu_release() nested inside ops.init() and other sleepable operations. Signed-off-by: David Vernet <dvernet@meta.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 4 +- kernel/sched/ext.c | 198 ++++++++++++++++++++++- kernel/sched/ext.h | 2 + kernel/sched/sched.h | 1 + tools/sched_ext/include/scx/common.bpf.h | 1 + tools/sched_ext/scx_qmap.bpf.c | 37 ++++- tools/sched_ext/scx_qmap.c | 4 +- 7 files changed, 240 insertions(+), 7 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index 74341dbc6a19..21c627337e01 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -98,13 +98,15 @@ enum scx_kf_mask { SCX_KF_UNLOCKED = 0, /* not sleepable, not rq locked */ /* all non-sleepables may be nested inside SLEEPABLE */ SCX_KF_SLEEPABLE = 1 << 0, /* sleepable init operations */ + /* ENQUEUE and DISPATCH may be nested inside CPU_RELEASE */ + SCX_KF_CPU_RELEASE = 1 << 1, /* ops.cpu_release() */ /* ops.dequeue (in REST) may be nested inside DISPATCH */ SCX_KF_DISPATCH = 1 << 2, /* ops.dispatch() */ SCX_KF_ENQUEUE = 1 << 3, /* ops.enqueue() and ops.select_cpu() */ SCX_KF_SELECT_CPU = 1 << 4, /* ops.select_cpu() */ SCX_KF_REST = 1 << 5, /* other rq-locked operations */ - __SCX_KF_RQ_LOCKED = SCX_KF_DISPATCH | + __SCX_KF_RQ_LOCKED = SCX_KF_CPU_RELEASE | SCX_KF_DISPATCH | SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, __SCX_KF_TERMINAL = SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, }; diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index dfbfacca6722..956d8cfad948 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -114,6 +114,32 @@ struct scx_exit_task_args { bool cancelled; }; +enum scx_cpu_preempt_reason { + /* next task is being scheduled by &sched_class_rt */ + SCX_CPU_PREEMPT_RT, + /* next task is being scheduled by &sched_class_dl */ + SCX_CPU_PREEMPT_DL, + /* next task is being scheduled by &sched_class_stop */ + SCX_CPU_PREEMPT_STOP, + /* unknown reason for SCX being preempted */ + SCX_CPU_PREEMPT_UNKNOWN, +}; + +/* + * Argument container for ops->cpu_acquire(). Currently empty, but may be + * expanded in the future. + */ +struct scx_cpu_acquire_args {}; + +/* argument container for ops->cpu_release() */ +struct scx_cpu_release_args { + /* the reason the CPU was preempted */ + enum scx_cpu_preempt_reason reason; + + /* the task that's going to be scheduled on the CPU */ + struct task_struct *task; +}; + /* * Informational context provided to dump operations. */ @@ -339,6 +365,28 @@ struct sched_ext_ops { */ void (*update_idle)(s32 cpu, bool idle); + /** + * cpu_acquire - A CPU is becoming available to the BPF scheduler + * @cpu: The CPU being acquired by the BPF scheduler. + * @args: Acquire arguments, see the struct definition. + * + * A CPU that was previously released from the BPF scheduler is now once + * again under its control. + */ + void (*cpu_acquire)(s32 cpu, struct scx_cpu_acquire_args *args); + + /** + * cpu_release - A CPU is taken away from the BPF scheduler + * @cpu: The CPU being released by the BPF scheduler. + * @args: Release arguments, see the struct definition. + * + * The specified CPU is no longer under the control of the BPF + * scheduler. This could be because it was preempted by a higher + * priority sched_class, though there may be other reasons as well. The + * caller should consult @args->reason to determine the cause. + */ + void (*cpu_release)(s32 cpu, struct scx_cpu_release_args *args); + /** * init_task - Initialize a task to run in a BPF scheduler * @p: task to initialize for BPF scheduling @@ -491,6 +539,17 @@ enum scx_enq_flags { */ SCX_ENQ_PREEMPT = 1LLU << 32, + /* + * The task being enqueued was previously enqueued on the current CPU's + * %SCX_DSQ_LOCAL, but was removed from it in a call to the + * bpf_scx_reenqueue_local() kfunc. If bpf_scx_reenqueue_local() was + * invoked in a ->cpu_release() callback, and the task is again + * dispatched back to %SCX_LOCAL_DSQ by this current ->enqueue(), the + * task will not be scheduled on the CPU until at least the next invocation + * of the ->cpu_acquire() callback. + */ + SCX_ENQ_REENQ = 1LLU << 40, + /* * The task being enqueued is the only task available for the cpu. By * default, ext core keeps executing such tasks but when @@ -629,6 +688,7 @@ static bool scx_warned_zero_slice; static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_last); static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting); +DEFINE_STATIC_KEY_FALSE(scx_ops_cpu_preempt); static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled); struct static_key_false scx_has_op[SCX_OPI_END] = @@ -891,6 +951,12 @@ static __always_inline bool scx_kf_allowed(u32 mask) * inside ops.dispatch(). We don't need to check the SCX_KF_SLEEPABLE * boundary thanks to the above in_interrupt() check. */ + if (unlikely(highest_bit(mask) == SCX_KF_CPU_RELEASE && + (current->scx.kf_mask & higher_bits(SCX_KF_CPU_RELEASE)))) { + scx_ops_error("cpu_release kfunc called from a nested operation"); + return false; + } + if (unlikely(highest_bit(mask) == SCX_KF_DISPATCH && (current->scx.kf_mask & higher_bits(SCX_KF_DISPATCH)))) { scx_ops_error("dispatch kfunc called from a nested operation"); @@ -2075,6 +2141,19 @@ static int balance_scx(struct rq *rq, struct task_struct *prev, lockdep_assert_rq_held(rq); rq->scx.flags |= SCX_RQ_BALANCING; + if (static_branch_unlikely(&scx_ops_cpu_preempt) && + unlikely(rq->scx.cpu_released)) { + /* + * If the previous sched_class for the current CPU was not SCX, + * notify the BPF scheduler that it again has control of the + * core. This callback complements ->cpu_release(), which is + * emitted in scx_next_task_picked(). + */ + if (SCX_HAS_OP(cpu_acquire)) + SCX_CALL_OP(0, cpu_acquire, cpu_of(rq), NULL); + rq->scx.cpu_released = false; + } + if (prev_on_scx) { WARN_ON_ONCE(prev->scx.flags & SCX_TASK_BAL_KEEP); update_curr_scx(rq); @@ -2082,7 +2161,9 @@ static int balance_scx(struct rq *rq, struct task_struct *prev, /* * If @prev is runnable & has slice left, it has priority and * fetching more just increases latency for the fetched tasks. - * Tell put_prev_task_scx() to put @prev on local_dsq. + * Tell put_prev_task_scx() to put @prev on local_dsq. If the + * BPF scheduler wants to handle this explicitly, it should + * implement ->cpu_released(). * * See scx_ops_disable_workfn() for the explanation on the * bypassing test. @@ -2302,6 +2383,20 @@ static struct task_struct *pick_next_task_scx(struct rq *rq) return p; } +static enum scx_cpu_preempt_reason +preempt_reason_from_class(const struct sched_class *class) +{ +#ifdef CONFIG_SMP + if (class == &stop_sched_class) + return SCX_CPU_PREEMPT_STOP; +#endif + if (class == &dl_sched_class) + return SCX_CPU_PREEMPT_DL; + if (class == &rt_sched_class) + return SCX_CPU_PREEMPT_RT; + return SCX_CPU_PREEMPT_UNKNOWN; +} + void scx_next_task_picked(struct rq *rq, struct task_struct *p, const struct sched_class *active) { @@ -2317,6 +2412,40 @@ void scx_next_task_picked(struct rq *rq, struct task_struct *p, */ smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1); #endif + if (!static_branch_unlikely(&scx_ops_cpu_preempt)) + return; + + /* + * The callback is conceptually meant to convey that the CPU is no + * longer under the control of SCX. Therefore, don't invoke the + * callback if the CPU is is staying on SCX, or going idle (in which + * case the SCX scheduler has actively decided not to schedule any + * tasks on the CPU). + */ + if (likely(active >= &ext_sched_class)) + return; + + /* + * At this point we know that SCX was preempted by a higher priority + * sched_class, so invoke the ->cpu_release() callback if we have not + * done so already. We only send the callback once between SCX being + * preempted, and it regaining control of the CPU. + * + * ->cpu_release() complements ->cpu_acquire(), which is emitted the + * next time that balance_scx() is invoked. + */ + if (!rq->scx.cpu_released) { + if (SCX_HAS_OP(cpu_release)) { + struct scx_cpu_release_args args = { + .reason = preempt_reason_from_class(active), + .task = p, + }; + + SCX_CALL_OP(SCX_KF_CPU_RELEASE, + cpu_release, cpu_of(rq), &args); + } + rq->scx.cpu_released = true; + } } #ifdef CONFIG_SMP @@ -3403,6 +3532,7 @@ static void scx_ops_disable_workfn(struct kthread_work *work) static_branch_disable_cpuslocked(&scx_has_op[i]); static_branch_disable_cpuslocked(&scx_ops_enq_last); static_branch_disable_cpuslocked(&scx_ops_enq_exiting); + static_branch_disable_cpuslocked(&scx_ops_cpu_preempt); static_branch_disable_cpuslocked(&scx_builtin_idle_enabled); synchronize_rcu(); @@ -3704,9 +3834,10 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) seq_buf_init(&ns, buf, avail); dump_newline(&ns); - dump_line(&ns, "CPU %-4d: nr_run=%u flags=0x%x ops_qseq=%lu pnt_seq=%lu", + dump_line(&ns, "CPU %-4d: nr_run=%u flags=0x%x cpu_rel=%d ops_qseq=%lu pnt_seq=%lu", cpu, rq->scx.nr_running, rq->scx.flags, - rq->scx.ops_qseq, rq->scx.pnt_seq); + rq->scx.cpu_released, rq->scx.ops_qseq, + rq->scx.pnt_seq); dump_line(&ns, " curr=%s[%d] class=%ps", rq->curr->comm, rq->curr->pid, rq->curr->sched_class); @@ -3947,6 +4078,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops) if (ops->flags & SCX_OPS_ENQ_EXITING) static_branch_enable_cpuslocked(&scx_ops_enq_exiting); + if (scx_ops.cpu_acquire || scx_ops.cpu_release) + static_branch_enable_cpuslocked(&scx_ops_cpu_preempt); if (!ops->update_idle || (ops->flags & SCX_OPS_KEEP_BUILTIN_IDLE)) { reset_idle_masks(); @@ -4323,6 +4456,8 @@ static bool yield_stub(struct task_struct *from, struct task_struct *to) { retur static void set_weight_stub(struct task_struct *p, u32 weight) {} static void set_cpumask_stub(struct task_struct *p, const struct cpumask *mask) {} static void update_idle_stub(s32 cpu, bool idle) {} +static void cpu_acquire_stub(s32 cpu, struct scx_cpu_acquire_args *args) {} +static void cpu_release_stub(s32 cpu, struct scx_cpu_release_args *args) {} static s32 init_task_stub(struct task_struct *p, struct scx_init_task_args *args) { return -EINVAL; } static void exit_task_stub(struct task_struct *p, struct scx_exit_task_args *args) {} static void enable_stub(struct task_struct *p) {} @@ -4343,6 +4478,8 @@ static struct sched_ext_ops __bpf_ops_sched_ext_ops = { .set_weight = set_weight_stub, .set_cpumask = set_cpumask_stub, .update_idle = update_idle_stub, + .cpu_acquire = cpu_acquire_stub, + .cpu_release = cpu_release_stub, .init_task = init_task_stub, .exit_task = exit_task_stub, .enable = enable_stub, @@ -4875,6 +5012,59 @@ static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = { __bpf_kfunc_start_defs(); +/** + * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ + * + * Iterate over all of the tasks currently enqueued on the local DSQ of the + * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of + * processed tasks. Can only be called from ops.cpu_release(). + */ +__bpf_kfunc u32 scx_bpf_reenqueue_local(void) +{ + u32 nr_enqueued, i; + struct rq *rq; + + if (!scx_kf_allowed(SCX_KF_CPU_RELEASE)) + return 0; + + rq = cpu_rq(smp_processor_id()); + lockdep_assert_rq_held(rq); + + /* + * Get the number of tasks on the local DSQ before iterating over it to + * pull off tasks. The enqueue callback below can signal that it wants + * the task to stay on the local DSQ, and we want to prevent the BPF + * scheduler from causing us to loop indefinitely. + */ + nr_enqueued = rq->scx.local_dsq.nr; + for (i = 0; i < nr_enqueued; i++) { + struct task_struct *p; + + p = first_local_task(rq); + WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != + SCX_OPSS_NONE); + WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED)); + WARN_ON_ONCE(p->scx.holding_cpu != -1); + dispatch_dequeue(rq, p); + do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1); + } + + return nr_enqueued; +} + +__bpf_kfunc_end_defs(); + +BTF_KFUNCS_START(scx_kfunc_ids_cpu_release) +BTF_ID_FLAGS(func, scx_bpf_reenqueue_local) +BTF_KFUNCS_END(scx_kfunc_ids_cpu_release) + +static const struct btf_kfunc_id_set scx_kfunc_set_cpu_release = { + .owner = THIS_MODULE, + .set = &scx_kfunc_ids_cpu_release, +}; + +__bpf_kfunc_start_defs(); + /** * scx_bpf_kick_cpu - Trigger reschedule on a CPU * @cpu: cpu to kick @@ -5384,6 +5574,8 @@ static int __init scx_init(void) &scx_kfunc_set_enqueue_dispatch)) || (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_dispatch)) || + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, + &scx_kfunc_set_cpu_release)) || (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_any)) || (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index 0aeb1fda1794..4ebd1c2478f1 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -24,6 +24,8 @@ DECLARE_STATIC_KEY_FALSE(__scx_switched_all); #define scx_enabled() static_branch_unlikely(&__scx_ops_enabled) #define scx_switched_all() static_branch_unlikely(&__scx_switched_all) +DECLARE_STATIC_KEY_FALSE(scx_ops_cpu_preempt); + static inline bool task_on_scx(const struct task_struct *p) { return scx_enabled() && p->sched_class == &ext_sched_class; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 4f0a62b78f7c..0f9656c202ed 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -837,6 +837,7 @@ struct scx_rq { u64 extra_enq_flags; /* see move_task_to_local_dsq() */ u32 nr_running; u32 flags; + bool cpu_released; cpumask_var_t cpus_to_kick; cpumask_var_t cpus_to_kick_if_idle; cpumask_var_t cpus_to_preempt; diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h index 421118bc56ff..8686f84497db 100644 --- a/tools/sched_ext/include/scx/common.bpf.h +++ b/tools/sched_ext/include/scx/common.bpf.h @@ -34,6 +34,7 @@ void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flag u32 scx_bpf_dispatch_nr_slots(void) __ksym; void scx_bpf_dispatch_cancel(void) __ksym; bool scx_bpf_consume(u64 dsq_id) __ksym; +u32 scx_bpf_reenqueue_local(void) __ksym; void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym; s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym; void scx_bpf_destroy_dsq(u64 dsq_id) __ksym; diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c index 879fc9c788e5..4a87377558c8 100644 --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -11,6 +11,8 @@ * * - BPF-side queueing using PIDs. * - Sleepable per-task storage allocation using ops.prep_enable(). + * - Using ops.cpu_release() to handle a higher priority scheduling class taking + * the CPU away. * * This scheduler is primarily for demonstration and testing of sched_ext * features and unlikely to be useful for actual workloads. @@ -90,7 +92,7 @@ struct { } cpu_ctx_stor SEC(".maps"); /* Statistics */ -u64 nr_enqueued, nr_dispatched, nr_dequeued; +u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued; s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) @@ -164,6 +166,22 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags) return; } + /* + * If the task was re-enqueued due to the CPU being preempted by a + * higher priority scheduling class, just re-enqueue the task directly + * on the global DSQ. As we want another CPU to pick it up, find and + * kick an idle CPU. + */ + if (enq_flags & SCX_ENQ_REENQ) { + s32 cpu; + + scx_bpf_dispatch(p, SHARED_DSQ, 0, enq_flags); + cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); + if (cpu >= 0) + scx_bpf_kick_cpu(cpu, SCX_KICK_IDLE); + return; + } + ring = bpf_map_lookup_elem(&queue_arr, &idx); if (!ring) { scx_bpf_error("failed to find ring %d", idx); @@ -257,6 +275,22 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) } } +void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args) +{ + u32 cnt; + + /* + * Called when @cpu is taken by a higher priority scheduling class. This + * makes @cpu no longer available for executing sched_ext tasks. As we + * don't want the tasks in @cpu's local dsq to sit there until @cpu + * becomes available again, re-enqueue them into the global dsq. See + * %SCX_ENQ_REENQ handling in qmap_enqueue(). + */ + cnt = scx_bpf_reenqueue_local(); + if (cnt) + __sync_fetch_and_add(&nr_reenqueued, cnt); +} + s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p, struct scx_init_task_args *args) { @@ -339,6 +373,7 @@ SCX_OPS_DEFINE(qmap_ops, .enqueue = (void *)qmap_enqueue, .dequeue = (void *)qmap_dequeue, .dispatch = (void *)qmap_dispatch, + .cpu_release = (void *)qmap_cpu_release, .init_task = (void *)qmap_init_task, .dump = (void *)qmap_dump, .dump_cpu = (void *)qmap_dump_cpu, diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c index 594147a710a8..2a97421afe9a 100644 --- a/tools/sched_ext/scx_qmap.c +++ b/tools/sched_ext/scx_qmap.c @@ -112,9 +112,9 @@ int main(int argc, char **argv) long nr_enqueued = skel->bss->nr_enqueued; long nr_dispatched = skel->bss->nr_dispatched; - printf("stats : enq=%lu dsp=%lu delta=%ld deq=%"PRIu64"\n", + printf("stats : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64"\n", nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched, - skel->bss->nr_dequeued); + skel->bss->nr_reenqueued, skel->bss->nr_dequeued); fflush(stdout); sleep(1); } -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 60c27fb59f6cffa73fc8c60e3a22323c78044576 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Add ops.cpu_online/offline() which are invoked when CPUs come online and offline respectively. As the enqueue path already automatically bypasses tasks to the local dsq on a deactivated CPU, BPF schedulers are guaranteed to see tasks only on CPUs which are between online() and offline(). If the BPF scheduler doesn't implement ops.cpu_online/offline(), the scheduler is automatically exited with SCX_ECODE_RESTART | SCX_ECODE_RSN_HOTPLUG. Userspace can implement CPU hotpplug support trivially by simply reinitializing and reloading the scheduler. scx_qmap is updated to print out online CPUs on hotplug events. Other schedulers are updated to restart based on ecode. v3: - The previous implementation added @reason to sched_class.rq_on/offline() to distinguish between CPU hotplug events and topology updates. This was buggy and fragile as the methods are skipped if the current state equals the target state. Instead, add scx_rq_[de]activate() which are directly called from sched_cpu_de/activate(). This also allows ops.cpu_on/offline() to sleep which can be useful. - ops.dispatch() could be called on a CPU that the BPF scheduler was told to be offline. The dispatch patch is updated to bypass in such cases. v2: - To accommodate lock ordering change between scx_cgroup_rwsem and cpus_read_lock(), CPU hotplug operations are put into its own SCX_OPI block and enabled eariler during scx_ope_enable() so that cpus_read_lock() can be dropped before acquiring scx_cgroup_rwsem. - Auto exit with ECODE added. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Conflicts: kernel/sched/core.c [Only pick the part modified by this mainline commit, and ignore the conflict part caused by higher version.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 4 + kernel/sched/ext.c | 156 ++++++++++++++++++- kernel/sched/ext.h | 4 + kernel/sched/sched.h | 6 + tools/sched_ext/include/scx/compat.h | 26 ++++ tools/sched_ext/include/scx/user_exit_info.h | 28 ++++ tools/sched_ext/scx_central.c | 9 +- tools/sched_ext/scx_qmap.bpf.c | 57 +++++++ tools/sched_ext/scx_qmap.c | 4 + tools/sched_ext/scx_simple.c | 8 +- 10 files changed, 290 insertions(+), 12 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d2f00496fa26..c19dc2891b98 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8003,6 +8003,8 @@ int sched_cpu_activate(unsigned int cpu) cpuset_cpu_active(); } + scx_rq_activate(rq); + /* * Put the rq online, if not already. This happens: * @@ -8063,6 +8065,8 @@ int sched_cpu_deactivate(unsigned int cpu) } rq_unlock_irqrestore(rq, &rf); + scx_rq_deactivate(rq); + /* * When going down, decrement the number of cores with SMT present. */ diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 956d8cfad948..f1d1cac9b4f2 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -34,6 +34,29 @@ enum scx_exit_kind { SCX_EXIT_ERROR_STALL, /* watchdog detected stalled runnable tasks */ }; +/* + * An exit code can be specified when exiting with scx_bpf_exit() or + * scx_ops_exit(), corresponding to exit_kind UNREG_BPF and UNREG_KERN + * respectively. The codes are 64bit of the format: + * + * Bits: [63 .. 48 47 .. 32 31 .. 0] + * [ SYS ACT ] [ SYS RSN ] [ USR ] + * + * SYS ACT: System-defined exit actions + * SYS RSN: System-defined exit reasons + * USR : User-defined exit codes and reasons + * + * Using the above, users may communicate intention and context by ORing system + * actions and/or system reasons with a user-defined exit code. + */ +enum scx_exit_code { + /* Reasons */ + SCX_ECODE_RSN_HOTPLUG = 1LLU << 32, + + /* Actions */ + SCX_ECODE_ACT_RESTART = 1LLU << 48, +}; + /* * scx_exit_info is passed to ops.exit() to describe why the BPF scheduler is * being disabled. @@ -461,7 +484,29 @@ struct sched_ext_ops { void (*dump_task)(struct scx_dump_ctx *ctx, struct task_struct *p); /* - * All online ops must come before ops.init(). + * All online ops must come before ops.cpu_online(). + */ + + /** + * cpu_online - A CPU became online + * @cpu: CPU which just came up + * + * @cpu just came online. @cpu will not call ops.enqueue() or + * ops.dispatch(), nor run tasks associated with other CPUs beforehand. + */ + void (*cpu_online)(s32 cpu); + + /** + * cpu_offline - A CPU is going offline + * @cpu: CPU which is going offline + * + * @cpu is going offline. @cpu will not call ops.enqueue() or + * ops.dispatch(), nor run tasks associated with other CPUs afterwards. + */ + void (*cpu_offline)(s32 cpu); + + /* + * All CPU hotplug ops must come before ops.init(). */ /** @@ -500,6 +545,15 @@ struct sched_ext_ops { */ u32 exit_dump_len; + /** + * hotplug_seq - A sequence number that may be set by the scheduler to + * detect when a hotplug event has occurred during the loading process. + * If 0, no detection occurs. Otherwise, the scheduler will fail to + * load if the sequence number does not match @scx_hotplug_seq on the + * enable path. + */ + u64 hotplug_seq; + /** * name - BPF scheduler's name * @@ -513,7 +567,9 @@ struct sched_ext_ops { enum scx_opi { SCX_OPI_BEGIN = 0, SCX_OPI_NORMAL_BEGIN = 0, - SCX_OPI_NORMAL_END = SCX_OP_IDX(init), + SCX_OPI_NORMAL_END = SCX_OP_IDX(cpu_online), + SCX_OPI_CPU_HOTPLUG_BEGIN = SCX_OP_IDX(cpu_online), + SCX_OPI_CPU_HOTPLUG_END = SCX_OP_IDX(init), SCX_OPI_END = SCX_OP_IDX(init), }; @@ -698,6 +754,7 @@ static atomic_t scx_exit_kind = ATOMIC_INIT(SCX_EXIT_DONE); static struct scx_exit_info *scx_exit_info; static atomic_long_t scx_nr_rejected = ATOMIC_LONG_INIT(0); +static atomic_long_t scx_hotplug_seq = ATOMIC_LONG_INIT(0); /* * The maximum amount of time in jiffies that a task may be runnable without @@ -1423,11 +1480,7 @@ static void direct_dispatch(struct task_struct *p, u64 enq_flags) static bool scx_rq_online(struct rq *rq) { -#ifdef CONFIG_SMP - return likely(rq->online); -#else - return true; -#endif + return likely(rq->scx.flags & SCX_RQ_ONLINE); } static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, @@ -1442,6 +1495,11 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, if (sticky_cpu == cpu_of(rq)) goto local_norefill; + /* + * If !scx_rq_online(), we already told the BPF scheduler that the CPU + * is offline and are just running the hotplug path. Don't bother the + * BPF scheduler. + */ if (!scx_rq_online(rq)) goto local; @@ -2678,6 +2736,42 @@ void __scx_update_idle(struct rq *rq, bool idle) #endif } +static void handle_hotplug(struct rq *rq, bool online) +{ + int cpu = cpu_of(rq); + + atomic_long_inc(&scx_hotplug_seq); + + if (online && SCX_HAS_OP(cpu_online)) + SCX_CALL_OP(SCX_KF_SLEEPABLE, cpu_online, cpu); + else if (!online && SCX_HAS_OP(cpu_offline)) + SCX_CALL_OP(SCX_KF_SLEEPABLE, cpu_offline, cpu); + else + scx_ops_exit(SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG, + "cpu %d going %s, exiting scheduler", cpu, + online ? "online" : "offline"); +} + +void scx_rq_activate(struct rq *rq) +{ + handle_hotplug(rq, true); +} + +void scx_rq_deactivate(struct rq *rq) +{ + handle_hotplug(rq, false); +} + +static void rq_online_scx(struct rq *rq) +{ + rq->scx.flags |= SCX_RQ_ONLINE; +} + +static void rq_offline_scx(struct rq *rq) +{ + rq->scx.flags &= ~SCX_RQ_ONLINE; +} + #else /* CONFIG_SMP */ static bool test_and_clear_cpu_idle(int cpu) { return false; } @@ -3109,6 +3203,9 @@ DEFINE_SCHED_CLASS(ext) = { .balance = balance_scx, .select_task_rq = select_task_rq_scx, .set_cpus_allowed = set_cpus_allowed_scx, + + .rq_online = rq_online_scx, + .rq_offline = rq_offline_scx, #endif .task_tick = task_tick_scx, @@ -3240,10 +3337,18 @@ static ssize_t scx_attr_nr_rejected_show(struct kobject *kobj, } SCX_ATTR(nr_rejected); +static ssize_t scx_attr_hotplug_seq_show(struct kobject *kobj, + struct kobj_attribute *ka, char *buf) +{ + return sysfs_emit(buf, "%ld\n", atomic_long_read(&scx_hotplug_seq)); +} +SCX_ATTR(hotplug_seq); + static struct attribute *scx_global_attrs[] = { &scx_attr_state.attr, &scx_attr_switch_all.attr, &scx_attr_nr_rejected.attr, + &scx_attr_hotplug_seq.attr, NULL, }; @@ -3946,6 +4051,25 @@ static struct kthread_worker *scx_create_rt_helper(const char *name) return helper; } +static void check_hotplug_seq(const struct sched_ext_ops *ops) +{ + unsigned long long global_hotplug_seq; + + /* + * If a hotplug event has occurred between when a scheduler was + * initialized, and when we were able to attach, exit and notify user + * space about it. + */ + if (ops->hotplug_seq) { + global_hotplug_seq = atomic_long_read(&scx_hotplug_seq); + if (ops->hotplug_seq != global_hotplug_seq) { + scx_ops_exit(SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG, + "expected hotplug seq %llu did not match actual %llu", + ops->hotplug_seq, global_hotplug_seq); + } + } +} + static int validate_ops(const struct sched_ext_ops *ops) { /* @@ -4028,6 +4152,10 @@ static int scx_ops_enable(struct sched_ext_ops *ops) } } + for (i = SCX_OPI_CPU_HOTPLUG_BEGIN; i < SCX_OPI_CPU_HOTPLUG_END; i++) + if (((void (**)(void))ops)[i]) + static_branch_enable_cpuslocked(&scx_has_op[i]); + cpus_read_unlock(); ret = validate_ops(ops); @@ -4069,6 +4197,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops) percpu_down_write(&scx_fork_rwsem); cpus_read_lock(); + check_hotplug_seq(ops); + for (i = SCX_OPI_NORMAL_BEGIN; i < SCX_OPI_NORMAL_END; i++) if (((void (**)(void))ops)[i]) static_branch_enable_cpuslocked(&scx_has_op[i]); @@ -4379,6 +4509,9 @@ static int bpf_scx_init_member(const struct btf_type *t, ops->exit_dump_len = *(u32 *)(udata + moff) ?: SCX_EXIT_DUMP_DFL_LEN; return 1; + case offsetof(struct sched_ext_ops, hotplug_seq): + ops->hotplug_seq = *(u64 *)(udata + moff); + return 1; } return 0; @@ -4392,6 +4525,8 @@ static int bpf_scx_check_member(const struct btf_type *t, switch (moff) { case offsetof(struct sched_ext_ops, init_task): + case offsetof(struct sched_ext_ops, cpu_online): + case offsetof(struct sched_ext_ops, cpu_offline): case offsetof(struct sched_ext_ops, init): case offsetof(struct sched_ext_ops, exit): break; @@ -4462,6 +4597,8 @@ static s32 init_task_stub(struct task_struct *p, struct scx_init_task_args *args static void exit_task_stub(struct task_struct *p, struct scx_exit_task_args *args) {} static void enable_stub(struct task_struct *p) {} static void disable_stub(struct task_struct *p) {} +static void cpu_online_stub(s32 cpu) {} +static void cpu_offline_stub(s32 cpu) {} static s32 init_stub(void) { return -EINVAL; } static void exit_stub(struct scx_exit_info *info) {} @@ -4484,6 +4621,8 @@ static struct sched_ext_ops __bpf_ops_sched_ext_ops = { .exit_task = exit_task_stub, .enable = enable_stub, .disable = disable_stub, + .cpu_online = cpu_online_stub, + .cpu_offline = cpu_offline_stub, .init = init_stub, .exit = exit_stub, }; @@ -4724,6 +4863,9 @@ void __init init_sched_ext_class(void) BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_preempt, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_wait, GFP_KERNEL)); init_irq_work(&rq->scx.kick_cpus_irq_work, kick_cpus_irq_workfn); + + if (cpu_online(cpu)) + cpu_rq(cpu)->scx.flags |= SCX_RQ_ONLINE; } register_sysrq_key('S', &sysrq_sched_ext_reset_op); diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index 4ebd1c2478f1..037f9acdf443 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -40,6 +40,8 @@ int scx_fork(struct task_struct *p); void scx_post_fork(struct task_struct *p); void scx_cancel_fork(struct task_struct *p); bool scx_can_stop_tick(struct rq *rq); +void scx_rq_activate(struct rq *rq); +void scx_rq_deactivate(struct rq *rq); int scx_check_setscheduler(struct task_struct *p, int policy); bool task_should_scx(struct task_struct *p); void init_sched_ext_class(void); @@ -81,6 +83,8 @@ static inline int scx_fork(struct task_struct *p) { return 0; } static inline void scx_post_fork(struct task_struct *p) {} static inline void scx_cancel_fork(struct task_struct *p) {} static inline bool scx_can_stop_tick(struct rq *rq) { return true; } +static inline void scx_rq_activate(struct rq *rq) {} +static inline void scx_rq_deactivate(struct rq *rq) {} static inline int scx_check_setscheduler(struct task_struct *p, int policy) { return 0; } static inline bool task_on_scx(const struct task_struct *p) { return false; } static inline void init_sched_ext_class(void) {} diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0f9656c202ed..ea4f78c8b4d9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -826,6 +826,12 @@ struct cfs_rq { #ifdef CONFIG_SCHED_CLASS_EXT /* scx_rq->flags, protected by the rq lock */ enum scx_rq_flags { + /* + * A hotplugged CPU starts scheduling before rq_online_scx(). Track + * ops.cpu_on/offline() state so that ops.enqueue/dispatch() are called + * only while the BPF scheduler considers the CPU to be online. + */ + SCX_RQ_ONLINE = 1 << 0, SCX_RQ_BALANCING = 1 << 1, SCX_RQ_CAN_STOP_TICK = 1 << 2, }; diff --git a/tools/sched_ext/include/scx/compat.h b/tools/sched_ext/include/scx/compat.h index c58024c980c8..cc56ff9aa252 100644 --- a/tools/sched_ext/include/scx/compat.h +++ b/tools/sched_ext/include/scx/compat.h @@ -8,6 +8,9 @@ #define __SCX_COMPAT_H #include <bpf/btf.h> +#include <fcntl.h> +#include <stdlib.h> +#include <unistd.h> struct btf *__COMPAT_vmlinux_btf __attribute__((weak)); @@ -106,6 +109,28 @@ static inline bool __COMPAT_struct_has_field(const char *type, const char *field #define SCX_OPS_SWITCH_PARTIAL \ __COMPAT_ENUM_OR_ZERO("scx_ops_flags", "SCX_OPS_SWITCH_PARTIAL") +static inline long scx_hotplug_seq(void) +{ + int fd; + char buf[32]; + ssize_t len; + long val; + + fd = open("/sys/kernel/sched_ext/hotplug_seq", O_RDONLY); + if (fd < 0) + return -ENOENT; + + len = read(fd, buf, sizeof(buf) - 1); + SCX_BUG_ON(len <= 0, "read failed (%ld)", len); + buf[len] = 0; + close(fd); + + val = strtoul(buf, NULL, 10); + SCX_BUG_ON(val < 0, "invalid num hotplug events: %lu", val); + + return val; +} + /* * struct sched_ext_ops can change over time. If compat.bpf.h::SCX_OPS_DEFINE() * is used to define ops and compat.h::SCX_OPS_LOAD/ATTACH() are used to load @@ -123,6 +148,7 @@ static inline bool __COMPAT_struct_has_field(const char *type, const char *field \ __skel = __scx_name##__open(); \ SCX_BUG_ON(!__skel, "Could not open " #__scx_name); \ + __skel->struct_ops.__ops_name->hotplug_seq = scx_hotplug_seq(); \ __skel; \ }) diff --git a/tools/sched_ext/include/scx/user_exit_info.h b/tools/sched_ext/include/scx/user_exit_info.h index c2ef85c645e1..891693ee604e 100644 --- a/tools/sched_ext/include/scx/user_exit_info.h +++ b/tools/sched_ext/include/scx/user_exit_info.h @@ -77,7 +77,35 @@ struct user_exit_info { if (__uei->msg[0] != '\0') \ fprintf(stderr, " (%s)", __uei->msg); \ fputs("\n", stderr); \ + __uei->exit_code; \ }) +/* + * We can't import vmlinux.h while compiling user C code. Let's duplicate + * scx_exit_code definition. + */ +enum scx_exit_code { + /* Reasons */ + SCX_ECODE_RSN_HOTPLUG = 1LLU << 32, + + /* Actions */ + SCX_ECODE_ACT_RESTART = 1LLU << 48, +}; + +enum uei_ecode_mask { + UEI_ECODE_USER_MASK = ((1LLU << 32) - 1), + UEI_ECODE_SYS_RSN_MASK = ((1LLU << 16) - 1) << 32, + UEI_ECODE_SYS_ACT_MASK = ((1LLU << 16) - 1) << 48, +}; + +/* + * These macro interpret the ecode returned from UEI_REPORT(). + */ +#define UEI_ECODE_USER(__ecode) ((__ecode) & UEI_ECODE_USER_MASK) +#define UEI_ECODE_SYS_RSN(__ecode) ((__ecode) & UEI_ECODE_SYS_RSN_MASK) +#define UEI_ECODE_SYS_ACT(__ecode) ((__ecode) & UEI_ECODE_SYS_ACT_MASK) + +#define UEI_ECODE_RESTART(__ecode) (UEI_ECODE_SYS_ACT((__ecode)) == SCX_ECODE_ACT_RESTART) + #endif /* __bpf__ */ #endif /* __USER_EXIT_INFO_H */ diff --git a/tools/sched_ext/scx_central.c b/tools/sched_ext/scx_central.c index fb3f50886552..21deea320bd7 100644 --- a/tools/sched_ext/scx_central.c +++ b/tools/sched_ext/scx_central.c @@ -46,14 +46,14 @@ int main(int argc, char **argv) { struct scx_central *skel; struct bpf_link *link; - __u64 seq = 0; + __u64 seq = 0, ecode; __s32 opt; cpu_set_t *cpuset; libbpf_set_print(libbpf_print_fn); signal(SIGINT, sigint_handler); signal(SIGTERM, sigint_handler); - +restart: skel = SCX_OPS_OPEN(central_ops, scx_central); skel->rodata->central_cpu = 0; @@ -126,7 +126,10 @@ int main(int argc, char **argv) } bpf_link__destroy(link); - UEI_REPORT(skel, uei); + ecode = UEI_REPORT(skel, uei); scx_central__destroy(skel); + + if (UEI_ECODE_RESTART(ecode)) + goto restart; return 0; } diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c index 4a87377558c8..619078355bf5 100644 --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -358,8 +358,63 @@ void BPF_STRUCT_OPS(qmap_dump_task, struct scx_dump_ctx *dctx, struct task_struc taskc->force_local); } +/* + * Print out the online and possible CPU map using bpf_printk() as a + * demonstration of using the cpumask kfuncs and ops.cpu_on/offline(). + */ +static void print_cpus(void) +{ + const struct cpumask *possible, *online; + s32 cpu; + char buf[128] = "", *p; + int idx; + + possible = scx_bpf_get_possible_cpumask(); + online = scx_bpf_get_online_cpumask(); + + idx = 0; + bpf_for(cpu, 0, scx_bpf_nr_cpu_ids()) { + if (!(p = MEMBER_VPTR(buf, [idx++]))) + break; + if (bpf_cpumask_test_cpu(cpu, online)) + *p++ = 'O'; + else if (bpf_cpumask_test_cpu(cpu, possible)) + *p++ = 'X'; + else + *p++ = ' '; + + if ((cpu & 7) == 7) { + if (!(p = MEMBER_VPTR(buf, [idx++]))) + break; + *p++ = '|'; + } + } + buf[sizeof(buf) - 1] = '\0'; + + scx_bpf_put_cpumask(online); + scx_bpf_put_cpumask(possible); + + bpf_printk("CPUS: |%s", buf); +} + +void BPF_STRUCT_OPS(qmap_cpu_online, s32 cpu) +{ + bpf_printk("CPU %d coming online", cpu); + /* @cpu is already online at this point */ + print_cpus(); +} + +void BPF_STRUCT_OPS(qmap_cpu_offline, s32 cpu) +{ + bpf_printk("CPU %d going offline", cpu); + /* @cpu is still online at this point */ + print_cpus(); +} + s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init) { + print_cpus(); + return scx_bpf_create_dsq(SHARED_DSQ, -1); } @@ -378,6 +433,8 @@ SCX_OPS_DEFINE(qmap_ops, .dump = (void *)qmap_dump, .dump_cpu = (void *)qmap_dump_cpu, .dump_task = (void *)qmap_dump_task, + .cpu_online = (void *)qmap_cpu_online, + .cpu_offline = (void *)qmap_cpu_offline, .init = (void *)qmap_init, .exit = (void *)qmap_exit, .timeout_ms = 5000U, diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c index 2a97421afe9a..920fb54f9c77 100644 --- a/tools/sched_ext/scx_qmap.c +++ b/tools/sched_ext/scx_qmap.c @@ -122,5 +122,9 @@ int main(int argc, char **argv) bpf_link__destroy(link); UEI_REPORT(skel, uei); scx_qmap__destroy(skel); + /* + * scx_qmap implements ops.cpu_on/offline() and doesn't need to restart + * on CPU hotplug events. + */ return 0; } diff --git a/tools/sched_ext/scx_simple.c b/tools/sched_ext/scx_simple.c index 7f500d1d56ac..bead482e1383 100644 --- a/tools/sched_ext/scx_simple.c +++ b/tools/sched_ext/scx_simple.c @@ -62,11 +62,12 @@ int main(int argc, char **argv) struct scx_simple *skel; struct bpf_link *link; __u32 opt; + __u64 ecode; libbpf_set_print(libbpf_print_fn); signal(SIGINT, sigint_handler); signal(SIGTERM, sigint_handler); - +restart: skel = SCX_OPS_OPEN(simple_ops, scx_simple); while ((opt = getopt(argc, argv, "vh")) != -1) { @@ -93,7 +94,10 @@ int main(int argc, char **argv) } bpf_link__destroy(link); - UEI_REPORT(skel, uei); + ecode = UEI_REPORT(skel, uei); scx_simple__destroy(skel); + + if (UEI_ECODE_RESTART(ecode)) + goto restart; return 0; } -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 0fd55582ed5b55998ea7df371e25ab40e5c098c3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- PM operations freeze userspace. Some BPF schedulers have active userspace component and may misbehave as expected across PM events. While the system is frozen, nothing too interesting is happening in terms of scheduling and we can get by just fine with the fallback FIFO behavior. Let's make things easier by always bypassing the BPF scheduler while PM events are in progress. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index f1d1cac9b4f2..7df977cb9101 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -4830,6 +4830,34 @@ void print_scx_info(const char *log_lvl, struct task_struct *p) runnable_at_buf); } +static int scx_pm_handler(struct notifier_block *nb, unsigned long event, void *ptr) +{ + /* + * SCX schedulers often have userspace components which are sometimes + * involved in critial scheduling paths. PM operations involve freezing + * userspace which can lead to scheduling misbehaviors including stalls. + * Let's bypass while PM operations are in progress. + */ + switch (event) { + case PM_HIBERNATION_PREPARE: + case PM_SUSPEND_PREPARE: + case PM_RESTORE_PREPARE: + scx_ops_bypass(true); + break; + case PM_POST_HIBERNATION: + case PM_POST_SUSPEND: + case PM_POST_RESTORE: + scx_ops_bypass(false); + break; + } + + return NOTIFY_OK; +} + +static struct notifier_block scx_pm_notifier = { + .notifier_call = scx_pm_handler, +}; + void __init init_sched_ext_class(void) { s32 cpu, v; @@ -5734,6 +5762,12 @@ static int __init scx_init(void) return ret; } + ret = register_pm_notifier(&scx_pm_notifier); + if (ret) { + pr_err("sched_ext: Failed to register PM notifier (%d)\n", ret); + return ret; + } + scx_kset = kset_create_and_add("sched_ext", &scx_uevent_ops, kernel_kobj); if (!scx_kset) { pr_err("sched_ext: Failed to create /sys/kernel/sched_ext\n"); -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 7b0888b7cc1924b74ce660e02f6211df8dd46e7b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- The core-sched support is composed of the following parts: - task_struct->scx.core_sched_at is added. This is a timestamp which can be used to order tasks. Depending on whether the BPF scheduler implements custom ordering, it tracks either global FIFO ordering of all tasks or local-DSQ ordering within the dispatched tasks on a CPU. - prio_less() is updated to call scx_prio_less() when comparing SCX tasks. scx_prio_less() calls ops.core_sched_before() if available or uses the core_sched_at timestamp. For global FIFO ordering, the BPF scheduler doesn't need to do anything. Otherwise, it should implement ops.core_sched_before() which reflects the ordering. - When core-sched is enabled, balance_scx() balances all SMT siblings so that they all have tasks dispatched if necessary before pick_task_scx() is called. pick_task_scx() picks between the current task and the first dispatched task on the local DSQ based on availability and the core_sched_at timestamps. Note that FIFO ordering is expected among the already dispatched tasks whether running or on the local DSQ, so this path always compares core_sched_at instead of calling into ops.core_sched_before(). qmap_core_sched_before() is added to scx_qmap. It scales the distances from the heads of the queues to compare the tasks across different priority queues and seems to behave as expected. v3: Fixed build error when !CONFIG_SCHED_SMT reported by Andrea Righi. v2: Sched core added the const qualifiers to prio_less task arguments. Explicitly drop them for ops.core_sched_before() task arguments. BPF enforces access control through the verifier, so the qualifier isn't actually operative and only gets in the way when interacting with various helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Reviewed-by: Josh Don <joshdon@google.com> Cc: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 3 + kernel/Kconfig.preempt | 2 +- kernel/sched/core.c | 10 +- kernel/sched/ext.c | 250 +++++++++++++++++++++++++++++++-- kernel/sched/ext.h | 5 + tools/sched_ext/scx_qmap.bpf.c | 91 +++++++++++- tools/sched_ext/scx_qmap.c | 5 +- 7 files changed, 346 insertions(+), 20 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index 21c627337e01..3db7b32b2d1d 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -129,6 +129,9 @@ struct sched_ext_entity { struct list_head runnable_node; /* rq->scx.runnable_list */ unsigned long runnable_at; +#ifdef CONFIG_SCHED_CORE + u64 core_sched_at; /* see scx_prio_less() */ +#endif u64 ddsp_dsq_id; u64 ddsp_enq_flags; diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index 1ac534c1940e..4c97595dd0bb 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -135,7 +135,7 @@ config SCHED_CORE config SCHED_CLASS_EXT bool "Extensible Scheduling Class" - depends on BPF_SYSCALL && BPF_JIT && !SCHED_CORE + depends on BPF_SYSCALL && BPF_JIT help This option enables a new scheduler class sched_ext (SCX), which allows scheduling policies to be implemented as BPF programs to diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c19dc2891b98..c0b0d2de8a9d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -203,7 +203,10 @@ static inline int __task_prio(const struct task_struct *p) if (p->sched_class == &idle_sched_class) return MAX_RT_PRIO + NICE_WIDTH; /* 140 */ - return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */ + if (task_on_scx(p)) + return MAX_RT_PRIO + MAX_NICE + 1; /* 120, squash ext */ + + return MAX_RT_PRIO + MAX_NICE; /* 119, squash fair */ } /* @@ -232,6 +235,11 @@ static inline bool prio_less(const struct task_struct *a, if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */ return cfs_prio_less(a, b, in_fi); +#ifdef CONFIG_SCHED_CLASS_EXT + if (pa == MAX_RT_PRIO + MAX_NICE + 1) /* ext */ + return scx_prio_less(a, b, in_fi); +#endif + return false; } diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 7df977cb9101..86f15b1f356c 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -348,6 +348,24 @@ struct sched_ext_ops { */ bool (*yield)(struct task_struct *from, struct task_struct *to); + /** + * core_sched_before - Task ordering for core-sched + * @a: task A + * @b: task B + * + * Used by core-sched to determine the ordering between two tasks. See + * Documentation/admin-guide/hw-vuln/core-scheduling.rst for details on + * core-sched. + * + * Both @a and @b are runnable and may or may not currently be queued on + * the BPF scheduler. Should return %true if @a should run before @b. + * %false if there's no required ordering or @b should run before @a. + * + * If not specified, the default is ordering them according to when they + * became runnable. + */ + bool (*core_sched_before)(struct task_struct *a, struct task_struct *b); + /** * set_weight - Set task weight * @p: task to set weight for @@ -629,6 +647,14 @@ enum scx_enq_flags { enum scx_deq_flags { /* expose select DEQUEUE_* flags as enums */ SCX_DEQ_SLEEP = DEQUEUE_SLEEP, + + /* high 32bits are SCX specific */ + + /* + * The generic core-sched layer decided to execute the task even though + * it hasn't been dispatched yet. Dequeue from the BPF side. + */ + SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32, }; enum scx_pick_idle_cpu_flags { @@ -1264,6 +1290,49 @@ static int ops_sanitize_err(const char *ops_name, s32 err) return -EPROTO; } +/** + * touch_core_sched - Update timestamp used for core-sched task ordering + * @rq: rq to read clock from, must be locked + * @p: task to update the timestamp for + * + * Update @p->scx.core_sched_at timestamp. This is used by scx_prio_less() to + * implement global or local-DSQ FIFO ordering for core-sched. Should be called + * when a task becomes runnable and its turn on the CPU ends (e.g. slice + * exhaustion). + */ +static void touch_core_sched(struct rq *rq, struct task_struct *p) +{ +#ifdef CONFIG_SCHED_CORE + /* + * It's okay to update the timestamp spuriously. Use + * sched_core_disabled() which is cheaper than enabled(). + */ + if (!sched_core_disabled()) + p->scx.core_sched_at = rq_clock_task(rq); +#endif +} + +/** + * touch_core_sched_dispatch - Update core-sched timestamp on dispatch + * @rq: rq to read clock from, must be locked + * @p: task being dispatched + * + * If the BPF scheduler implements custom core-sched ordering via + * ops.core_sched_before(), @p->scx.core_sched_at is used to implement FIFO + * ordering within each local DSQ. This function is called from dispatch paths + * and updates @p->scx.core_sched_at if custom core-sched ordering is in effect. + */ +static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p) +{ + lockdep_assert_rq_held(rq); + assert_clock_updated(rq); + +#ifdef CONFIG_SCHED_CORE + if (SCX_HAS_OP(core_sched_before)) + touch_core_sched(rq, p); +#endif +} + static void update_curr_scx(struct rq *rq) { struct task_struct *curr = rq->curr; @@ -1279,8 +1348,11 @@ static void update_curr_scx(struct rq *rq) account_group_exec_runtime(curr, delta_exec); cgroup_account_cputime(curr, delta_exec); - if (curr->scx.slice != SCX_SLICE_INF) + if (curr->scx.slice != SCX_SLICE_INF) { curr->scx.slice -= min(curr->scx.slice, delta_exec); + if (!curr->scx.slice) + touch_core_sched(rq, curr); + } } static void dsq_mod_nr(struct scx_dispatch_q *dsq, s32 delta) @@ -1473,6 +1545,8 @@ static void direct_dispatch(struct task_struct *p, u64 enq_flags) { struct scx_dispatch_q *dsq; + touch_core_sched_dispatch(task_rq(p), p); + enq_flags |= (p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); dsq = find_dsq_for_dispatch(task_rq(p), p->scx.ddsp_dsq_id, p); dispatch_enqueue(dsq, p, enq_flags); @@ -1554,12 +1628,19 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, return; local: + /* + * For task-ordering, slice refill must be treated as implying the end + * of the current slice. Otherwise, the longer @p stays on the CPU, the + * higher priority it becomes from scx_prio_less()'s POV. + */ + touch_core_sched(rq, p); p->scx.slice = SCX_SLICE_DFL; local_norefill: dispatch_enqueue(&rq->scx.local_dsq, p, enq_flags); return; global: + touch_core_sched(rq, p); /* see the comment in local: */ p->scx.slice = SCX_SLICE_DFL; dispatch_enqueue(&scx_dsq_global, p, enq_flags); } @@ -1623,6 +1704,9 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags if (SCX_HAS_OP(runnable)) SCX_CALL_OP_TASK(SCX_KF_REST, runnable, p, enq_flags); + if (enq_flags & SCX_ENQ_WAKEUP) + touch_core_sched(rq, p); + do_enqueue_task(rq, p, enq_flags, sticky_cpu); } @@ -2111,6 +2195,7 @@ static void finish_dispatch(struct rq *rq, struct rq_flags *rf, struct scx_dispatch_q *dsq; unsigned long opss; + touch_core_sched_dispatch(rq, p); retry: /* * No need for _acquire here. @p is accessed only after a successful @@ -2188,8 +2273,8 @@ static void flush_dispatch_buf(struct rq *rq, struct rq_flags *rf) dspc->cursor = 0; } -static int balance_scx(struct rq *rq, struct task_struct *prev, - struct rq_flags *rf) +static int balance_one(struct rq *rq, struct task_struct *prev, + struct rq_flags *rf, bool local) { struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); bool prev_on_scx = prev->sched_class == &ext_sched_class; @@ -2213,7 +2298,7 @@ static int balance_scx(struct rq *rq, struct task_struct *prev, } if (prev_on_scx) { - WARN_ON_ONCE(prev->scx.flags & SCX_TASK_BAL_KEEP); + WARN_ON_ONCE(local && (prev->scx.flags & SCX_TASK_BAL_KEEP)); update_curr_scx(rq); /* @@ -2225,10 +2310,16 @@ static int balance_scx(struct rq *rq, struct task_struct *prev, * * See scx_ops_disable_workfn() for the explanation on the * bypassing test. + * + * When balancing a remote CPU for core-sched, there won't be a + * following put_prev_task_scx() call and we don't own + * %SCX_TASK_BAL_KEEP. Instead, pick_task_scx() will test the + * same conditions later and pick @rq->curr accordingly. */ if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice && !scx_ops_bypassing()) { - prev->scx.flags |= SCX_TASK_BAL_KEEP; + if (local) + prev->scx.flags |= SCX_TASK_BAL_KEEP; goto has_tasks; } } @@ -2290,10 +2381,56 @@ static int balance_scx(struct rq *rq, struct task_struct *prev, return has_tasks; } +static int balance_scx(struct rq *rq, struct task_struct *prev, + struct rq_flags *rf) +{ + int ret; + + ret = balance_one(rq, prev, rf, true); + +#ifdef CONFIG_SCHED_SMT + /* + * When core-sched is enabled, this ops.balance() call will be followed + * by put_prev_scx() and pick_task_scx() on this CPU and pick_task_scx() + * on the SMT siblings. Balance the siblings too. + */ + if (sched_core_enabled(rq)) { + const struct cpumask *smt_mask = cpu_smt_mask(cpu_of(rq)); + int scpu; + + for_each_cpu_andnot(scpu, smt_mask, cpumask_of(cpu_of(rq))) { + struct rq *srq = cpu_rq(scpu); + struct rq_flags srf; + struct task_struct *sprev = srq->curr; + + /* + * While core-scheduling, rq lock is shared among + * siblings but the debug annotations and rq clock + * aren't. Do pinning dance to transfer the ownership. + */ + WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq)); + rq_unpin_lock(rq, rf); + rq_pin_lock(srq, &srf); + + update_rq_clock(srq); + balance_one(srq, sprev, &srf, false); + + rq_unpin_lock(srq, &srf); + rq_repin_lock(rq, rf); + } + } +#endif + return ret; +} + static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) { if (p->scx.flags & SCX_TASK_QUEUED) { - WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); + /* + * Core-sched might decide to execute @p before it is + * dispatched. Call ops_dequeue() to notify the BPF scheduler. + */ + ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC); dispatch_dequeue(rq, p); } @@ -2384,7 +2521,8 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p) /* * If @p has slice left and balance_scx() didn't tag it for * keeping, @p is getting preempted by a higher priority - * scheduler class. Leave it at the head of the local DSQ. + * scheduler class or core-sched forcing a different task. Leave + * it at the head of the local DSQ. */ if (p->scx.slice && !scx_ops_bypassing()) { dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD); @@ -2441,6 +2579,84 @@ static struct task_struct *pick_next_task_scx(struct rq *rq) return p; } +#ifdef CONFIG_SCHED_CORE +/** + * scx_prio_less - Task ordering for core-sched + * @a: task A + * @b: task B + * + * Core-sched is implemented as an additional scheduling layer on top of the + * usual sched_class'es and needs to find out the expected task ordering. For + * SCX, core-sched calls this function to interrogate the task ordering. + * + * Unless overridden by ops.core_sched_before(), @p->scx.core_sched_at is used + * to implement the default task ordering. The older the timestamp, the higher + * prority the task - the global FIFO ordering matching the default scheduling + * behavior. + * + * When ops.core_sched_before() is enabled, @p->scx.core_sched_at is used to + * implement FIFO ordering within each local DSQ. See pick_task_scx(). + */ +bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, + bool in_fi) +{ + /* + * The const qualifiers are dropped from task_struct pointers when + * calling ops.core_sched_before(). Accesses are controlled by the + * verifier. + */ + if (SCX_HAS_OP(core_sched_before) && !scx_ops_bypassing()) + return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, core_sched_before, + (struct task_struct *)a, + (struct task_struct *)b); + else + return time_after64(a->scx.core_sched_at, b->scx.core_sched_at); +} + +/** + * pick_task_scx - Pick a candidate task for core-sched + * @rq: rq to pick the candidate task from + * + * Core-sched calls this function on each SMT sibling to determine the next + * tasks to run on the SMT siblings. balance_one() has been called on all + * siblings and put_prev_task_scx() has been called only for the current CPU. + * + * As put_prev_task_scx() hasn't been called on remote CPUs, we can't just look + * at the first task in the local dsq. @rq->curr has to be considered explicitly + * to mimic %SCX_TASK_BAL_KEEP. + */ +static struct task_struct *pick_task_scx(struct rq *rq) +{ + struct task_struct *curr = rq->curr; + struct task_struct *first = first_local_task(rq); + + if (curr->scx.flags & SCX_TASK_QUEUED) { + /* is curr the only runnable task? */ + if (!first) + return curr; + + /* + * Does curr trump first? We can always go by core_sched_at for + * this comparison as it represents global FIFO ordering when + * the default core-sched ordering is used and local-DSQ FIFO + * ordering otherwise. + * + * We can have a task with an earlier timestamp on the DSQ. For + * example, when a current task is preempted by a sibling + * picking a different cookie, the task would be requeued at the + * head of the local DSQ with an earlier timestamp than the + * core-sched picked next task. Besides, the BPF scheduler may + * dispatch any tasks to the local DSQ anytime. + */ + if (curr->scx.slice && time_before64(curr->scx.core_sched_at, + first->scx.core_sched_at)) + return curr; + } + + return first; /* this may be %NULL */ +} +#endif /* CONFIG_SCHED_CORE */ + static enum scx_cpu_preempt_reason preempt_reason_from_class(const struct sched_class *class) { @@ -2848,13 +3064,15 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued) update_curr_scx(rq); /* - * While bypassing, always resched as we can't trust the slice - * management. + * While disabling, always resched and refresh core-sched timestamp as + * we can't trust the slice management or ops.core_sched_before(). */ - if (scx_ops_bypassing()) + if (scx_ops_bypassing()) { curr->scx.slice = 0; - else if (SCX_HAS_OP(tick)) + touch_core_sched(rq, curr); + } else if (SCX_HAS_OP(tick)) { SCX_CALL_OP(SCX_KF_REST, tick, curr); + } if (!curr->scx.slice) resched_curr(rq); @@ -3208,6 +3426,10 @@ DEFINE_SCHED_CLASS(ext) = { .rq_offline = rq_offline_scx, #endif +#ifdef CONFIG_SCHED_CORE + .pick_task = pick_task_scx, +#endif + .task_tick = task_tick_scx, .switching_to = switching_to_scx, @@ -3421,12 +3643,14 @@ bool task_should_scx(struct task_struct *p) * * c. balance_scx() never sets %SCX_TASK_BAL_KEEP as the slice value can't be * trusted. Whenever a tick triggers, the running task is rotated to the tail - * of the queue. + * of the queue with core_sched_at touched. * * d. pick_next_task() suppresses zero slice warning. * * e. scx_bpf_kick_cpu() is disabled to avoid irq_work malfunction during PM * operations. + * + * f. scx_prio_less() reverts to the default core_sched_at order. */ static void scx_ops_bypass(bool bypass) { @@ -4588,6 +4812,7 @@ static void running_stub(struct task_struct *p) {} static void stopping_stub(struct task_struct *p, bool runnable) {} static void quiescent_stub(struct task_struct *p, u64 deq_flags) {} static bool yield_stub(struct task_struct *from, struct task_struct *to) { return false; } +static bool core_sched_before_stub(struct task_struct *a, struct task_struct *b) { return false; } static void set_weight_stub(struct task_struct *p, u32 weight) {} static void set_cpumask_stub(struct task_struct *p, const struct cpumask *mask) {} static void update_idle_stub(s32 cpu, bool idle) {} @@ -4612,6 +4837,7 @@ static struct sched_ext_ops __bpf_ops_sched_ext_ops = { .stopping = stopping_stub, .quiescent = quiescent_stub, .yield = yield_stub, + .core_sched_before = core_sched_before_stub, .set_weight = set_weight_stub, .set_cpumask = set_cpumask_stub, .update_idle = update_idle_stub, diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index 037f9acdf443..6555878c5da3 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -70,6 +70,11 @@ static inline const struct sched_class *next_active_class(const struct sched_cla for_active_class_range(class, (prev_class) > &ext_sched_class ? \ &ext_sched_class : (prev_class), (end_class)) +#ifdef CONFIG_SCHED_CORE +bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, + bool in_fi); +#endif + #else /* CONFIG_SCHED_CLASS_EXT */ #define scx_enabled() false diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c index 619078355bf5..c75c70d6a8eb 100644 --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -13,6 +13,7 @@ * - Sleepable per-task storage allocation using ops.prep_enable(). * - Using ops.cpu_release() to handle a higher priority scheduling class taking * the CPU away. + * - Core-sched support. * * This scheduler is primarily for demonstration and testing of sched_ext * features and unlikely to be useful for actual workloads. @@ -67,9 +68,21 @@ struct { }, }; +/* + * Per-queue sequence numbers to implement core-sched ordering. + * + * Tail seq is assigned to each queued task and incremented. Head seq tracks the + * sequence number of the latest dispatched task. The distance between the a + * task's seq and the associated queue's head seq is called the queue distance + * and used when comparing two tasks for ordering. See qmap_core_sched_before(). + */ +static u64 core_sched_head_seqs[5]; +static u64 core_sched_tail_seqs[5]; + /* Per-task scheduling context */ struct task_ctx { bool force_local; /* Dispatch directly to local_dsq */ + u64 core_sched_seq; }; struct { @@ -93,6 +106,7 @@ struct { /* Statistics */ u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued; +u64 nr_core_sched_execed; s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) @@ -159,8 +173,18 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags) return; } - /* Is select_cpu() is telling us to enqueue locally? */ - if (tctx->force_local) { + /* + * All enqueued tasks must have their core_sched_seq updated for correct + * core-sched ordering, which is why %SCX_OPS_ENQ_LAST is specified in + * qmap_ops.flags. + */ + tctx->core_sched_seq = core_sched_tail_seqs[idx]++; + + /* + * If qmap_select_cpu() is telling us to or this is the last runnable + * task on the CPU, enqueue locally. + */ + if (tctx->force_local || (enq_flags & SCX_ENQ_LAST)) { tctx->force_local = false; scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, enq_flags); return; @@ -204,6 +228,19 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags) void BPF_STRUCT_OPS(qmap_dequeue, struct task_struct *p, u64 deq_flags) { __sync_fetch_and_add(&nr_dequeued, 1); + if (deq_flags & SCX_DEQ_CORE_SCHED_EXEC) + __sync_fetch_and_add(&nr_core_sched_execed, 1); +} + +static void update_core_sched_head_seq(struct task_struct *p) +{ + struct task_ctx *tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0); + int idx = weight_to_idx(p->scx.weight); + + if (tctx) + core_sched_head_seqs[idx] = tctx->core_sched_seq; + else + scx_bpf_error("task_ctx lookup failed"); } void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) @@ -258,6 +295,7 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) if (!p) continue; + update_core_sched_head_seq(p); __sync_fetch_and_add(&nr_dispatched, 1); scx_bpf_dispatch(p, SHARED_DSQ, slice_ns, 0); bpf_task_release(p); @@ -275,6 +313,49 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) } } +/* + * The distance from the head of the queue scaled by the weight of the queue. + * The lower the number, the older the task and the higher the priority. + */ +static s64 task_qdist(struct task_struct *p) +{ + int idx = weight_to_idx(p->scx.weight); + struct task_ctx *tctx; + s64 qdist; + + tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0); + if (!tctx) { + scx_bpf_error("task_ctx lookup failed"); + return 0; + } + + qdist = tctx->core_sched_seq - core_sched_head_seqs[idx]; + + /* + * As queue index increments, the priority doubles. The queue w/ index 3 + * is dispatched twice more frequently than 2. Reflect the difference by + * scaling qdists accordingly. Note that the shift amount needs to be + * flipped depending on the sign to avoid flipping priority direction. + */ + if (qdist >= 0) + return qdist << (4 - idx); + else + return qdist << idx; +} + +/* + * This is called to determine the task ordering when core-sched is picking + * tasks to execute on SMT siblings and should encode about the same ordering as + * the regular scheduling path. Use the priority-scaled distances from the head + * of the queues to compare the two tasks which should be consistent with the + * dispatch path behavior. + */ +bool BPF_STRUCT_OPS(qmap_core_sched_before, + struct task_struct *a, struct task_struct *b) +{ + return task_qdist(a) > task_qdist(b); +} + void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args) { u32 cnt; @@ -354,8 +435,8 @@ void BPF_STRUCT_OPS(qmap_dump_task, struct scx_dump_ctx *dctx, struct task_struc if (!(taskc = bpf_task_storage_get(&task_ctx_stor, p, 0, 0))) return; - scx_bpf_dump("QMAP: force_local=%d", - taskc->force_local); + scx_bpf_dump("QMAP: force_local=%d core_sched_seq=%llu", + taskc->force_local, taskc->core_sched_seq); } /* @@ -428,6 +509,7 @@ SCX_OPS_DEFINE(qmap_ops, .enqueue = (void *)qmap_enqueue, .dequeue = (void *)qmap_dequeue, .dispatch = (void *)qmap_dispatch, + .core_sched_before = (void *)qmap_core_sched_before, .cpu_release = (void *)qmap_cpu_release, .init_task = (void *)qmap_init_task, .dump = (void *)qmap_dump, @@ -437,5 +519,6 @@ SCX_OPS_DEFINE(qmap_ops, .cpu_offline = (void *)qmap_cpu_offline, .init = (void *)qmap_init, .exit = (void *)qmap_exit, + .flags = SCX_OPS_ENQ_LAST, .timeout_ms = 5000U, .name = "qmap"); diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c index 920fb54f9c77..bc36ec4f88a7 100644 --- a/tools/sched_ext/scx_qmap.c +++ b/tools/sched_ext/scx_qmap.c @@ -112,9 +112,10 @@ int main(int argc, char **argv) long nr_enqueued = skel->bss->nr_enqueued; long nr_dispatched = skel->bss->nr_dispatched; - printf("stats : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64"\n", + printf("stats : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64"\n", nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched, - skel->bss->nr_reenqueued, skel->bss->nr_dequeued); + skel->bss->nr_reenqueued, skel->bss->nr_dequeued, + skel->bss->nr_core_sched_execed); fflush(stdout); sleep(1); } -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 06e51be3d5e7a07aea5c9012773df8d5de01db6c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Currently, a dsq is always a FIFO. A task which is dispatched earlier gets consumed or executed earlier. While this is sufficient when dsq's are used for simple staging areas for tasks which are ready to execute, it'd make dsq's a lot more useful if they can implement custom ordering. This patch adds a vtime-ordered priority queue to dsq's. When the BPF scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it can specify the vtime tha the task should be inserted at and the task is inserted into the priority queue in the dsq which is ordered according to time_before64() comparison of the vtime values. A DSQ can either be a FIFO or priority queue and automatically switches between the two depending on whether scx_bpf_dispatch() or scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ already has the other type queued is not allowed and triggers an ops error. Built-in DSQs must always be FIFOs. This makes it very easy for the BPF schedulers to implement proper vtime based scheduling within each dsq very easy and efficient at a negligible cost in terms of code complexity and overhead. scx_simple and scx_example_flatcg are updated to default to weighted vtime scheduling (the latter within each cgroup). FIFO scheduling can be selected with -f option. v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes led to unexpected starvations, DSQs now error out if both modes are used at the same time and the built-in DSQs are no longer allowed to be priority queues. - Explicit type struct scx_dsq_node added to contain fields needed to be linked on DSQs. This will be used to implement stateful iterator. - Tasks are now always linked on dsq->list whether the DSQ is in FIFO or PRIQ mode. This confines PRIQ related complexities to the enqueue and dequeue paths. Other paths only need to look at dsq->list. This will also ease implementing BPF iterator. - Print p->scx.dsq_flags in debug dump. v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own p->scx.dsq_flags. The flag is protected with the dsq lock unlike other flags in p->scx.flags. This led to flag corruption in some cases. - Add comments explaining the interaction between using consumption of p->scx.slice to determine vtime progress and yielding. v2: - p->scx.dsq_vtime was not initialized on load or across cgroup migrations leading to some tasks being stalled for extended period of time depending on how saturated the machine is. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 29 ++++- init/init_task.c | 2 +- kernel/sched/ext.c | 156 ++++++++++++++++++++--- tools/sched_ext/include/scx/common.bpf.h | 1 + tools/sched_ext/scx_simple.bpf.c | 97 +++++++++++++- tools/sched_ext/scx_simple.c | 8 +- 6 files changed, 267 insertions(+), 26 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index 3db7b32b2d1d..9cee193dab19 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -49,12 +49,15 @@ enum scx_dsq_id_flags { }; /* - * Dispatch queue (dsq) is a simple FIFO which is used to buffer between the - * scheduler core and the BPF scheduler. See the documentation for more details. + * A dispatch queue (DSQ) can be either a FIFO or p->scx.dsq_vtime ordered + * queue. A built-in DSQ is always a FIFO. The built-in local DSQs are used to + * buffer between the scheduler core and the BPF scheduler. See the + * documentation for more details. */ struct scx_dispatch_q { raw_spinlock_t lock; struct list_head list; /* tasks in dispatch order */ + struct rb_root priq; /* used to order by p->scx.dsq_vtime */ u32 nr; u64 id; struct rhash_head hash_node; @@ -86,6 +89,11 @@ enum scx_task_state { SCX_TASK_NR_STATES, }; +/* scx_entity.dsq_flags */ +enum scx_ent_dsq_flags { + SCX_TASK_DSQ_ON_PRIQ = 1 << 0, /* task is queued on the priority queue of a dsq */ +}; + /* * Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from * everywhere and the following bits track which kfunc sets are currently @@ -111,13 +119,19 @@ enum scx_kf_mask { __SCX_KF_TERMINAL = SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, }; +struct scx_dsq_node { + struct list_head list; /* dispatch order */ + struct rb_node priq; /* p->scx.dsq_vtime order */ + u32 flags; /* SCX_TASK_DSQ_* flags */ +}; + /* * The following is embedded in task_struct and contains all fields necessary * for a task to be scheduled by SCX. */ struct sched_ext_entity { struct scx_dispatch_q *dsq; - struct list_head dsq_node; + struct scx_dsq_node dsq_node; /* protected by dsq lock */ u32 flags; /* protected by rq lock */ u32 weight; s32 sticky_cpu; @@ -149,6 +163,15 @@ struct sched_ext_entity { */ u64 slice; + /* + * Used to order tasks when dispatching to the vtime-ordered priority + * queue of a dsq. This is usually set through scx_bpf_dispatch_vtime() + * but can also be modified directly by the BPF scheduler. Modifying it + * while a task is queued on a dsq may mangle the ordering and is not + * recommended. + */ + u64 dsq_vtime; + /* * If set, reject future sched_setscheduler(2) calls updating the policy * to %SCHED_EXT with -%EACCES. diff --git a/init/init_task.c b/init/init_task.c index b86c0a995136..c744b3fe2cd2 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -115,7 +115,7 @@ struct task_struct init_task #endif #ifdef CONFIG_SCHED_CLASS_EXT .scx = { - .dsq_node = LIST_HEAD_INIT(init_task.scx.dsq_node), + .dsq_node.list = LIST_HEAD_INIT(init_task.scx.dsq_node.list), .sticky_cpu = -1, .holding_cpu = -1, .runnable_node = LIST_HEAD_INIT(init_task.scx.runnable_node), diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 86f15b1f356c..ab24cf1dd4f6 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -642,6 +642,7 @@ enum scx_enq_flags { __SCX_ENQ_INTERNAL_MASK = 0xffLLU << 56, SCX_ENQ_CLEAR_OPSS = 1LLU << 56, + SCX_ENQ_DSQ_PRIQ = 1LLU << 57, }; enum scx_deq_flags { @@ -1355,6 +1356,17 @@ static void update_curr_scx(struct rq *rq) } } +static bool scx_dsq_priq_less(struct rb_node *node_a, + const struct rb_node *node_b) +{ + const struct task_struct *a = + container_of(node_a, struct task_struct, scx.dsq_node.priq); + const struct task_struct *b = + container_of(node_b, struct task_struct, scx.dsq_node.priq); + + return time_before64(a->scx.dsq_vtime, b->scx.dsq_vtime); +} + static void dsq_mod_nr(struct scx_dispatch_q *dsq, s32 delta) { /* scx_bpf_dsq_nr_queued() reads ->nr without locking, use WRITE_ONCE() */ @@ -1366,7 +1378,9 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p, { bool is_local = dsq->id == SCX_DSQ_LOCAL; - WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_node)); + WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_node.list)); + WARN_ON_ONCE((p->scx.dsq_node.flags & SCX_TASK_DSQ_ON_PRIQ) || + !RB_EMPTY_NODE(&p->scx.dsq_node.priq)); if (!is_local) { raw_spin_lock(&dsq->lock); @@ -1379,10 +1393,59 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p, } } - if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) - list_add(&p->scx.dsq_node, &dsq->list); - else - list_add_tail(&p->scx.dsq_node, &dsq->list); + if (unlikely((dsq->id & SCX_DSQ_FLAG_BUILTIN) && + (enq_flags & SCX_ENQ_DSQ_PRIQ))) { + /* + * SCX_DSQ_LOCAL and SCX_DSQ_GLOBAL DSQs always consume from + * their FIFO queues. To avoid confusion and accidentally + * starving vtime-dispatched tasks by FIFO-dispatched tasks, we + * disallow any internal DSQ from doing vtime ordering of + * tasks. + */ + scx_ops_error("cannot use vtime ordering for built-in DSQs"); + enq_flags &= ~SCX_ENQ_DSQ_PRIQ; + } + + if (enq_flags & SCX_ENQ_DSQ_PRIQ) { + struct rb_node *rbp; + + /* + * A PRIQ DSQ shouldn't be using FIFO enqueueing. As tasks are + * linked to both the rbtree and list on PRIQs, this can only be + * tested easily when adding the first task. + */ + if (unlikely(RB_EMPTY_ROOT(&dsq->priq) && + !list_empty(&dsq->list))) + scx_ops_error("DSQ ID 0x%016llx already had FIFO-enqueued tasks", + dsq->id); + + p->scx.dsq_node.flags |= SCX_TASK_DSQ_ON_PRIQ; + rb_add(&p->scx.dsq_node.priq, &dsq->priq, scx_dsq_priq_less); + + /* + * Find the previous task and insert after it on the list so + * that @dsq->list is vtime ordered. + */ + rbp = rb_prev(&p->scx.dsq_node.priq); + if (rbp) { + struct task_struct *prev = + container_of(rbp, struct task_struct, + scx.dsq_node.priq); + list_add(&p->scx.dsq_node.list, &prev->scx.dsq_node.list); + } else { + list_add(&p->scx.dsq_node.list, &dsq->list); + } + } else { + /* a FIFO DSQ shouldn't be using PRIQ enqueuing */ + if (unlikely(!RB_EMPTY_ROOT(&dsq->priq))) + scx_ops_error("DSQ ID 0x%016llx already had PRIQ-enqueued tasks", + dsq->id); + + if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) + list_add(&p->scx.dsq_node.list, &dsq->list); + else + list_add_tail(&p->scx.dsq_node.list, &dsq->list); + } dsq_mod_nr(dsq, 1); p->scx.dsq = dsq; @@ -1421,13 +1484,30 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p, } } +static void task_unlink_from_dsq(struct task_struct *p, + struct scx_dispatch_q *dsq) +{ + if (p->scx.dsq_node.flags & SCX_TASK_DSQ_ON_PRIQ) { + rb_erase(&p->scx.dsq_node.priq, &dsq->priq); + RB_CLEAR_NODE(&p->scx.dsq_node.priq); + p->scx.dsq_node.flags &= ~SCX_TASK_DSQ_ON_PRIQ; + } + + list_del_init(&p->scx.dsq_node.list); +} + +static bool task_linked_on_dsq(struct task_struct *p) +{ + return !list_empty(&p->scx.dsq_node.list); +} + static void dispatch_dequeue(struct rq *rq, struct task_struct *p) { struct scx_dispatch_q *dsq = p->scx.dsq; bool is_local = dsq == &rq->scx.local_dsq; if (!dsq) { - WARN_ON_ONCE(!list_empty(&p->scx.dsq_node)); + WARN_ON_ONCE(task_linked_on_dsq(p)); /* * When dispatching directly from the BPF scheduler to a local * DSQ, the task isn't associated with any DSQ but @@ -1448,8 +1528,8 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p) */ if (p->scx.holding_cpu < 0) { /* @p must still be on @dsq, dequeue */ - WARN_ON_ONCE(list_empty(&p->scx.dsq_node)); - list_del_init(&p->scx.dsq_node); + WARN_ON_ONCE(!task_linked_on_dsq(p)); + task_unlink_from_dsq(p, dsq); dsq_mod_nr(dsq, -1); } else { /* @@ -1458,7 +1538,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p) * holding_cpu which tells dispatch_to_local_dsq() that it lost * the race. */ - WARN_ON_ONCE(!list_empty(&p->scx.dsq_node)); + WARN_ON_ONCE(task_linked_on_dsq(p)); p->scx.holding_cpu = -1; } p->scx.dsq = NULL; @@ -1954,7 +2034,8 @@ static void consume_local_task(struct rq *rq, struct scx_dispatch_q *dsq, /* @dsq is locked and @p is on this rq */ WARN_ON_ONCE(p->scx.holding_cpu >= 0); - list_move_tail(&p->scx.dsq_node, &rq->scx.local_dsq.list); + task_unlink_from_dsq(p, dsq); + list_add_tail(&p->scx.dsq_node.list, &rq->scx.local_dsq.list); dsq_mod_nr(dsq, -1); dsq_mod_nr(&rq->scx.local_dsq, 1); p->scx.dsq = &rq->scx.local_dsq; @@ -1997,7 +2078,7 @@ static bool consume_remote_task(struct rq *rq, struct rq_flags *rf, * move_task_to_local_dsq(). */ WARN_ON_ONCE(p->scx.holding_cpu >= 0); - list_del_init(&p->scx.dsq_node); + task_unlink_from_dsq(p, dsq); dsq_mod_nr(dsq, -1); p->scx.holding_cpu = raw_smp_processor_id(); raw_spin_unlock(&dsq->lock); @@ -2029,7 +2110,7 @@ static bool consume_dispatch_q(struct rq *rq, struct rq_flags *rf, raw_spin_lock(&dsq->lock); - list_for_each_entry(p, &dsq->list, scx.dsq_node) { + list_for_each_entry(p, &dsq->list, scx.dsq_node.list) { struct rq *task_rq = task_rq(p); if (rq == task_rq) { @@ -2548,7 +2629,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p) static struct task_struct *first_local_task(struct rq *rq) { return list_first_entry_or_null(&rq->scx.local_dsq.list, - struct task_struct, scx.dsq_node); + struct task_struct, scx.dsq_node.list); } static struct task_struct *pick_next_task_scx(struct rq *rq) @@ -3230,7 +3311,8 @@ void init_scx_entity(struct sched_ext_entity *scx) */ memset(scx, 0, offsetof(struct sched_ext_entity, tasks_node)); - INIT_LIST_HEAD(&scx->dsq_node); + INIT_LIST_HEAD(&scx->dsq_node.list); + RB_CLEAR_NODE(&scx->dsq_node.priq); scx->sticky_cpu = -1; scx->holding_cpu = -1; INIT_LIST_HEAD(&scx->runnable_node); @@ -4075,12 +4157,13 @@ static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx, dump_line(s, " %c%c %s[%d] %+ldms", marker, task_state_to_char(p), p->comm, p->pid, jiffies_delta_msecs(p->scx.runnable_at, dctx->at_jiffies)); - dump_line(s, " scx_state/flags=%u/0x%x ops_state/qseq=%lu/%lu", + dump_line(s, " scx_state/flags=%u/0x%x dsq_flags=0x%x ops_state/qseq=%lu/%lu", scx_get_task_state(p), p->scx.flags & ~SCX_TASK_STATE_MASK, - ops_state & SCX_OPSS_STATE_MASK, + p->scx.dsq_node.flags, ops_state & SCX_OPSS_STATE_MASK, ops_state >> SCX_OPSS_QSEQ_SHIFT); - dump_line(s, " sticky/holding_cpu=%d/%d dsq_id=%s", - p->scx.sticky_cpu, p->scx.holding_cpu, dsq_id_buf); + dump_line(s, " sticky/holding_cpu=%d/%d dsq_id=%s dsq_vtime=%llu", + p->scx.sticky_cpu, p->scx.holding_cpu, dsq_id_buf, + p->scx.dsq_vtime); dump_line(s, " cpus=%*pb", cpumask_pr_args(p->cpus_ptr)); if (SCX_HAS_OP(dump_task)) { @@ -4668,6 +4751,9 @@ static int bpf_scx_btf_struct_access(struct bpf_verifier_log *log, if (off >= offsetof(struct task_struct, scx.slice) && off + size <= offsetofend(struct task_struct, scx.slice)) return SCALAR_VALUE; + if (off >= offsetof(struct task_struct, scx.dsq_vtime) && + off + size <= offsetofend(struct task_struct, scx.dsq_vtime)) + return SCALAR_VALUE; if (off >= offsetof(struct task_struct, scx.disallow) && off + size <= offsetofend(struct task_struct, scx.disallow)) return SCALAR_VALUE; @@ -5303,10 +5389,44 @@ __bpf_kfunc void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, scx_dispatch_commit(p, dsq_id, enq_flags); } +/** + * scx_bpf_dispatch_vtime - Dispatch a task into the vtime priority queue of a DSQ + * @p: task_struct to dispatch + * @dsq_id: DSQ to dispatch to + * @slice: duration @p can run for in nsecs + * @vtime: @p's ordering inside the vtime-sorted queue of the target DSQ + * @enq_flags: SCX_ENQ_* + * + * Dispatch @p into the vtime priority queue of the DSQ identified by @dsq_id. + * Tasks queued into the priority queue are ordered by @vtime and always + * consumed after the tasks in the FIFO queue. All other aspects are identical + * to scx_bpf_dispatch(). + * + * @vtime ordering is according to time_before64() which considers wrapping. A + * numerically larger vtime may indicate an earlier position in the ordering and + * vice-versa. + */ +__bpf_kfunc void scx_bpf_dispatch_vtime(struct task_struct *p, u64 dsq_id, + u64 slice, u64 vtime, u64 enq_flags) +{ + if (!scx_dispatch_preamble(p, enq_flags)) + return; + + if (slice) + p->scx.slice = slice; + else + p->scx.slice = p->scx.slice ?: 1; + + p->scx.dsq_vtime = vtime; + + scx_dispatch_commit(p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ); +} + __bpf_kfunc_end_defs(); BTF_KFUNCS_START(scx_kfunc_ids_enqueue_dispatch) BTF_ID_FLAGS(func, scx_bpf_dispatch, KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime, KF_RCU) BTF_KFUNCS_END(scx_kfunc_ids_enqueue_dispatch) static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = { diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h index 8686f84497db..3fa87084cf17 100644 --- a/tools/sched_ext/include/scx/common.bpf.h +++ b/tools/sched_ext/include/scx/common.bpf.h @@ -31,6 +31,7 @@ static inline void ___vmlinux_h_sanity_check___(void) s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) __ksym; s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool *is_idle) __ksym; void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flags) __ksym; +void scx_bpf_dispatch_vtime(struct task_struct *p, u64 dsq_id, u64 slice, u64 vtime, u64 enq_flags) __ksym; u32 scx_bpf_dispatch_nr_slots(void) __ksym; void scx_bpf_dispatch_cancel(void) __ksym; bool scx_bpf_consume(u64 dsq_id) __ksym; diff --git a/tools/sched_ext/scx_simple.bpf.c b/tools/sched_ext/scx_simple.bpf.c index 6bb13a3c801b..ed7e8d535fc5 100644 --- a/tools/sched_ext/scx_simple.bpf.c +++ b/tools/sched_ext/scx_simple.bpf.c @@ -2,11 +2,20 @@ /* * A simple scheduler. * - * A simple global FIFO scheduler. It also demonstrates the following niceties. + * By default, it operates as a simple global weighted vtime scheduler and can + * be switched to FIFO scheduling. It also demonstrates the following niceties. * * - Statistics tracking how many tasks are queued to local and global dsq's. * - Termination notification for userspace. * + * While very simple, this scheduler should work reasonably well on CPUs with a + * uniform L3 cache topology. While preemption is not implemented, the fact that + * the scheduling queue is shared across all CPUs means that whatever is at the + * front of the queue is likely to be executed fairly quickly given enough + * number of CPUs. The FIFO scheduling mode may be beneficial to some workloads + * but comes with the usual problems with FIFO scheduling where saturating + * threads can easily drown out interactive ones. + * * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. * Copyright (c) 2022 Tejun Heo <tj@kernel.org> * Copyright (c) 2022 David Vernet <dvernet@meta.com> @@ -15,8 +24,20 @@ char _license[] SEC("license") = "GPL"; +const volatile bool fifo_sched; + +static u64 vtime_now; UEI_DEFINE(uei); +/* + * Built-in DSQs such as SCX_DSQ_GLOBAL cannot be used as priority queues + * (meaning, cannot be dispatched to with scx_bpf_dispatch_vtime()). We + * therefore create a separate DSQ with ID 0 that we dispatch to and consume + * from. If scx_simple only supported global FIFO scheduling, then we could + * just use SCX_DSQ_GLOBAL. + */ +#define SHARED_DSQ 0 + struct { __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); __uint(key_size, sizeof(u32)); @@ -31,6 +52,11 @@ static void stat_inc(u32 idx) (*cnt_p)++; } +static inline bool vtime_before(u64 a, u64 b) +{ + return (s64)(a - b) < 0; +} + s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) { bool is_idle = false; @@ -48,7 +74,69 @@ s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, s32 prev_cpu, u64 w void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) { stat_inc(1); /* count global queueing */ - scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); + + if (fifo_sched) { + scx_bpf_dispatch(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags); + } else { + u64 vtime = p->scx.dsq_vtime; + + /* + * Limit the amount of budget that an idling task can accumulate + * to one slice. + */ + if (vtime_before(vtime, vtime_now - SCX_SLICE_DFL)) + vtime = vtime_now - SCX_SLICE_DFL; + + scx_bpf_dispatch_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime, + enq_flags); + } +} + +void BPF_STRUCT_OPS(simple_dispatch, s32 cpu, struct task_struct *prev) +{ + scx_bpf_consume(SHARED_DSQ); +} + +void BPF_STRUCT_OPS(simple_running, struct task_struct *p) +{ + if (fifo_sched) + return; + + /* + * Global vtime always progresses forward as tasks start executing. The + * test and update can be performed concurrently from multiple CPUs and + * thus racy. Any error should be contained and temporary. Let's just + * live with it. + */ + if (vtime_before(vtime_now, p->scx.dsq_vtime)) + vtime_now = p->scx.dsq_vtime; +} + +void BPF_STRUCT_OPS(simple_stopping, struct task_struct *p, bool runnable) +{ + if (fifo_sched) + return; + + /* + * Scale the execution time by the inverse of the weight and charge. + * + * Note that the default yield implementation yields by setting + * @p->scx.slice to zero and the following would treat the yielding task + * as if it has consumed all its slice. If this penalizes yielding tasks + * too much, determine the execution time by taking explicit timestamps + * instead of depending on @p->scx.slice. + */ + p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight; +} + +void BPF_STRUCT_OPS(simple_enable, struct task_struct *p) +{ + p->scx.dsq_vtime = vtime_now; +} + +s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) +{ + return scx_bpf_create_dsq(SHARED_DSQ, -1); } void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei) @@ -59,5 +147,10 @@ void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei) SCX_OPS_DEFINE(simple_ops, .select_cpu = (void *)simple_select_cpu, .enqueue = (void *)simple_enqueue, + .dispatch = (void *)simple_dispatch, + .running = (void *)simple_running, + .stopping = (void *)simple_stopping, + .enable = (void *)simple_enable, + .init = (void *)simple_init, .exit = (void *)simple_exit, .name = "simple"); diff --git a/tools/sched_ext/scx_simple.c b/tools/sched_ext/scx_simple.c index bead482e1383..76d83199545c 100644 --- a/tools/sched_ext/scx_simple.c +++ b/tools/sched_ext/scx_simple.c @@ -17,8 +17,9 @@ const char help_fmt[] = "\n" "See the top-level comment in .bpf.c for more details.\n" "\n" -"Usage: %s [-v]\n" +"Usage: %s [-f] [-v]\n" "\n" +" -f Use FIFO scheduling instead of weighted vtime scheduling\n" " -v Print libbpf debug messages\n" " -h Display this help and exit\n"; @@ -70,8 +71,11 @@ int main(int argc, char **argv) restart: skel = SCX_OPS_OPEN(simple_ops, scx_simple); - while ((opt = getopt(argc, argv, "vh")) != -1) { + while ((opt = getopt(argc, argv, "fvh")) != -1) { switch (opt) { + case 'f': + skel->rodata->fifo_sched = true; + break; case 'v': verbose = true; break; -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit fa48e8d2c7b58d242c1db3a09c14f4274e055087 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Add Documentation/scheduler/sched-ext.rst which gives a high-level overview and pointers to the examples. v6: - Add paragraph explaining debug dump. v5: - Updated to reflect /sys/kernel interface change. Kconfig options added. v4: - README improved, reformatted in markdown and renamed to README.md. v3: - Added tools/sched_ext/README. - Dropped _example prefix from scheduler names. v2: - Apply minor edits suggested by Bagas. Caveats section dropped as all of them are addressed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- Documentation/scheduler/index.rst | 1 + Documentation/scheduler/sched-ext.rst | 314 ++++++++++++++++++++++++++ include/linux/sched/ext.h | 2 + kernel/Kconfig.preempt | 1 + kernel/sched/ext.c | 2 + kernel/sched/ext.h | 2 + tools/sched_ext/README.md | 258 +++++++++++++++++++++ 7 files changed, 580 insertions(+) create mode 100644 Documentation/scheduler/sched-ext.rst create mode 100644 tools/sched_ext/README.md diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst index 3170747226f6..0b650bb550e6 100644 --- a/Documentation/scheduler/index.rst +++ b/Documentation/scheduler/index.rst @@ -19,6 +19,7 @@ Scheduler sched-nice-design sched-rt-group sched-stats + sched-ext sched-debug text_files diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst new file mode 100644 index 000000000000..497eeaa5ecbe --- /dev/null +++ b/Documentation/scheduler/sched-ext.rst @@ -0,0 +1,314 @@ +========================== +Extensible Scheduler Class +========================== + +sched_ext is a scheduler class whose behavior can be defined by a set of BPF +programs - the BPF scheduler. + +* sched_ext exports a full scheduling interface so that any scheduling + algorithm can be implemented on top. + +* The BPF scheduler can group CPUs however it sees fit and schedule them + together, as tasks aren't tied to specific CPUs at the time of wakeup. + +* The BPF scheduler can be turned on and off dynamically anytime. + +* The system integrity is maintained no matter what the BPF scheduler does. + The default scheduling behavior is restored anytime an error is detected, + a runnable task stalls, or on invoking the SysRq key sequence + :kbd:`SysRq-S`. + +* When the BPF scheduler triggers an error, debug information is dumped to + aid debugging. The debug dump is passed to and printed out by the + scheduler binary. The debug dump can also be accessed through the + `sched_ext_dump` tracepoint. The SysRq key sequence :kbd:`SysRq-D` + triggers a debug dump. This doesn't terminate the BPF scheduler and can + only be read through the tracepoint. + +Switching to and from sched_ext +=============================== + +``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and +``tools/sched_ext`` contains the example schedulers. The following config +options should be enabled to use sched_ext: + +.. code-block:: none + + CONFIG_BPF=y + CONFIG_SCHED_CLASS_EXT=y + CONFIG_BPF_SYSCALL=y + CONFIG_BPF_JIT=y + CONFIG_DEBUG_INFO_BTF=y + CONFIG_BPF_JIT_ALWAYS_ON=y + CONFIG_BPF_JIT_DEFAULT_ON=y + CONFIG_PAHOLE_HAS_SPLIT_BTF=y + CONFIG_PAHOLE_HAS_BTF_TAG=y + +sched_ext is used only when the BPF scheduler is loaded and running. + +If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be +treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is +loaded. On load, such tasks will be switched to and scheduled by sched_ext. + +The BPF scheduler can choose to schedule all normal and lower class tasks by +calling ``scx_bpf_switch_all()`` from its ``init()`` operation. In this +case, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE`` and +``SCHED_EXT`` tasks are scheduled by sched_ext. In the example schedulers, +this mode can be selected with the ``-a`` option. + +Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or +detection of any internal error including stalled runnable tasks aborts the +BPF scheduler and reverts all tasks back to CFS. + +.. code-block:: none + + # make -j16 -C tools/sched_ext + # tools/sched_ext/scx_simple + local=0 global=3 + local=5 global=24 + local=9 global=44 + local=13 global=56 + local=17 global=72 + ^CEXIT: BPF scheduler unregistered + +The current status of the BPF scheduler can be determined as follows: + +.. code-block:: none + + # cat /sys/kernel/sched_ext/state + enabled + # cat /sys/kernel/sched_ext/root/ops + simple + +``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more +detailed information: + +.. code-block:: none + + # tools/sched_ext/scx_show_state.py + ops : simple + enabled : 1 + switching_all : 1 + switched_all : 1 + enable_state : enabled (2) + bypass_depth : 0 + nr_rejected : 0 + +If ``CONFIG_SCHED_DEBUG`` is set, whether a given task is on sched_ext can +be determined as follows: + +.. code-block:: none + + # grep ext /proc/self/sched + ext.enabled : 1 + +The Basics +========== + +Userspace can implement an arbitrary BPF scheduler by loading a set of BPF +programs that implement ``struct sched_ext_ops``. The only mandatory field +is ``ops.name`` which must be a valid BPF object name. All operations are +optional. The following modified excerpt is from +``tools/sched/scx_simple.bpf.c`` showing a minimal global FIFO scheduler. + +.. code-block:: c + + /* + * Decide which CPU a task should be migrated to before being + * enqueued (either at wakeup, fork time, or exec time). If an + * idle core is found by the default ops.select_cpu() implementation, + * then dispatch the task directly to SCX_DSQ_LOCAL and skip the + * ops.enqueue() callback. + * + * Note that this implementation has exactly the same behavior as the + * default ops.select_cpu implementation. The behavior of the scheduler + * would be exactly same if the implementation just didn't define the + * simple_select_cpu() struct_ops prog. + */ + s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, + s32 prev_cpu, u64 wake_flags) + { + s32 cpu; + /* Need to initialize or the BPF verifier will reject the program */ + bool direct = false; + + cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct); + + if (direct) + scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); + + return cpu; + } + + /* + * Do a direct dispatch of a task to the global DSQ. This ops.enqueue() + * callback will only be invoked if we failed to find a core to dispatch + * to in ops.select_cpu() above. + * + * Note that this implementation has exactly the same behavior as the + * default ops.enqueue implementation, which just dispatches the task + * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same + * if the implementation just didn't define the simple_enqueue struct_ops + * prog. + */ + void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) + { + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); + } + + s32 BPF_STRUCT_OPS(simple_init) + { + /* + * All SCHED_OTHER, SCHED_IDLE, and SCHED_BATCH tasks should + * use sched_ext. + */ + scx_bpf_switch_all(); + return 0; + } + + void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei) + { + exit_type = ei->type; + } + + SEC(".struct_ops") + struct sched_ext_ops simple_ops = { + .select_cpu = (void *)simple_select_cpu, + .enqueue = (void *)simple_enqueue, + .init = (void *)simple_init, + .exit = (void *)simple_exit, + .name = "simple", + }; + +Dispatch Queues +--------------- + +To match the impedance between the scheduler core and the BPF scheduler, +sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a +priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``), +and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage +an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and +``scx_bpf_destroy_dsq()``. + +A CPU always executes a task from its local DSQ. A task is "dispatched" to a +DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's +local DSQ. + +When a CPU is looking for the next task to run, if the local DSQ is not +empty, the first task is picked. Otherwise, the CPU tries to consume the +global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()`` +is invoked. + +Scheduling Cycle +---------------- + +The following briefly shows how a waking task is scheduled and executed. + +1. When a task is waking up, ``ops.select_cpu()`` is the first operation + invoked. This serves two purposes. First, CPU selection optimization + hint. Second, waking up the selected CPU if idle. + + The CPU selected by ``ops.select_cpu()`` is an optimization hint and not + binding. The actual decision is made at the last step of scheduling. + However, there is a small performance gain if the CPU + ``ops.select_cpu()`` returns matches the CPU the task eventually runs on. + + A side-effect of selecting a CPU is waking it up from idle. While a BPF + scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper, + using ``ops.select_cpu()`` judiciously can be simpler and more efficient. + + A task can be immediately dispatched to a DSQ from ``ops.select_cpu()`` by + calling ``scx_bpf_dispatch()``. If the task is dispatched to + ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be dispatched to the + local DSQ of whichever CPU is returned from ``ops.select_cpu()``. + Additionally, dispatching directly from ``ops.select_cpu()`` will cause the + ``ops.enqueue()`` callback to be skipped. + + Note that the scheduler core will ignore an invalid CPU selection, for + example, if it's outside the allowed cpumask of the task. + +2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the + task was dispatched directly from ``ops.select_cpu()``). ``ops.enqueue()`` + can make one of the following decisions: + + * Immediately dispatch the task to either the global or local DSQ by + calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or + ``SCX_DSQ_LOCAL``, respectively. + + * Immediately dispatch the task to a custom DSQ by calling + ``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63. + + * Queue the task on the BPF side. + +3. When a CPU is ready to schedule, it first looks at its local DSQ. If + empty, it then looks at the global DSQ. If there still isn't a task to + run, ``ops.dispatch()`` is invoked which can use the following two + functions to populate the local DSQ. + + * ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can + be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``, + ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()`` + currently can't be called with BPF locks held, this is being worked on + and will be supported. ``scx_bpf_dispatch()`` schedules dispatching + rather than performing them immediately. There can be up to + ``ops.dispatch_max_batch`` pending tasks. + + * ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ + to the dispatching DSQ. This function cannot be called with any BPF + locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks + before trying to consume the specified DSQ. + +4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ, + the CPU runs the first one. If empty, the following steps are taken: + + * Try to consume the global DSQ. If successful, run the task. + + * If ``ops.dispatch()`` has dispatched any tasks, retry #3. + + * If the previous task is an SCX task and still runnable, keep executing + it (see ``SCX_OPS_ENQ_LAST``). + + * Go idle. + +Note that the BPF scheduler can always choose to dispatch tasks immediately +in ``ops.enqueue()`` as illustrated in the above simple example. If only the +built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as +a task is never queued on the BPF scheduler and both the local and global +DSQs are consumed automatically. + +``scx_bpf_dispatch()`` queues the task on the FIFO of the target DSQ. Use +``scx_bpf_dispatch_vtime()`` for the priority queue. Internal DSQs such as +``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue +dispatching, and must be dispatched to with ``scx_bpf_dispatch()``. See the +function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` for +more information. + +Where to Look +============= + +* ``include/linux/sched/ext.h`` defines the core data structures, ops table + and constants. + +* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers. + The functions prefixed with ``scx_bpf_`` can be called from the BPF + scheduler. + +* ``tools/sched_ext/`` hosts example BPF scheduler implementations. + + * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a + custom DSQ. + + * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five + levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``. + +ABI Instability +=============== + +The APIs provided by sched_ext to BPF schedulers programs have no stability +guarantees. This includes the ops table callbacks and constants defined in +``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in +``kernel/sched/ext.c``. + +While we will attempt to provide a relatively stable API surface when +possible, they are subject to change without warning between kernel +versions. diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index 9cee193dab19..fe9a67ffe6b1 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -1,5 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0 */ /* + * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst + * * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. * Copyright (c) 2022 Tejun Heo <tj@kernel.org> * Copyright (c) 2022 David Vernet <dvernet@meta.com> diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index 4c97595dd0bb..5ec2759552f6 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -156,4 +156,5 @@ config SCHED_CLASS_EXT similar to struct sched_class. For more information: + Documentation/scheduler/sched-ext.rst https://github.com/sched-ext/scx diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index ab24cf1dd4f6..bf2c1e516035 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1,5 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0 */ /* + * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst + * * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. * Copyright (c) 2022 Tejun Heo <tj@kernel.org> * Copyright (c) 2022 David Vernet <dvernet@meta.com> diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index 6555878c5da3..c41d742b5d62 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -1,5 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0 */ /* + * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst + * * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. * Copyright (c) 2022 Tejun Heo <tj@kernel.org> * Copyright (c) 2022 David Vernet <dvernet@meta.com> diff --git a/tools/sched_ext/README.md b/tools/sched_ext/README.md new file mode 100644 index 000000000000..8efe70cc4363 --- /dev/null +++ b/tools/sched_ext/README.md @@ -0,0 +1,258 @@ +SCHED_EXT EXAMPLE SCHEDULERS +============================ + +# Introduction + +This directory contains a number of example sched_ext schedulers. These +schedulers are meant to provide examples of different types of schedulers +that can be built using sched_ext, and illustrate how various features of +sched_ext can be used. + +Some of the examples are performant, production-ready schedulers. That is, for +the correct workload and with the correct tuning, they may be deployed in a +production environment with acceptable or possibly even improved performance. +Others are just examples that in practice, would not provide acceptable +performance (though they could be improved to get there). + +This README will describe these example schedulers, including describing the +types of workloads or scenarios they're designed to accommodate, and whether or +not they're production ready. For more details on any of these schedulers, +please see the header comment in their .bpf.c file. + + +# Compiling the examples + +There are a few toolchain dependencies for compiling the example schedulers. + +## Toolchain dependencies + +1. clang >= 16.0.0 + +The schedulers are BPF programs, and therefore must be compiled with clang. gcc +is actively working on adding a BPF backend compiler as well, but are still +missing some features such as BTF type tags which are necessary for using +kptrs. + +2. pahole >= 1.25 + +You may need pahole in order to generate BTF from DWARF. + +3. rust >= 1.70.0 + +Rust schedulers uses features present in the rust toolchain >= 1.70.0. You +should be able to use the stable build from rustup, but if that doesn't +work, try using the rustup nightly build. + +There are other requirements as well, such as make, but these are the main / +non-trivial ones. + +## Compiling the kernel + +In order to run a sched_ext scheduler, you'll have to run a kernel compiled +with the patches in this repository, and with a minimum set of necessary +Kconfig options: + +``` +CONFIG_BPF=y +CONFIG_SCHED_CLASS_EXT=y +CONFIG_BPF_SYSCALL=y +CONFIG_BPF_JIT=y +CONFIG_DEBUG_INFO_BTF=y +``` + +It's also recommended that you also include the following Kconfig options: + +``` +CONFIG_BPF_JIT_ALWAYS_ON=y +CONFIG_BPF_JIT_DEFAULT_ON=y +CONFIG_PAHOLE_HAS_SPLIT_BTF=y +CONFIG_PAHOLE_HAS_BTF_TAG=y +``` + +There is a `Kconfig` file in this directory whose contents you can append to +your local `.config` file, as long as there are no conflicts with any existing +options in the file. + +## Getting a vmlinux.h file + +You may notice that most of the example schedulers include a "vmlinux.h" file. +This is a large, auto-generated header file that contains all of the types +defined in some vmlinux binary that was compiled with +[BTF](https://docs.kernel.org/bpf/btf.html) (i.e. with the BTF-related Kconfig +options specified above). + +The header file is created using `bpftool`, by passing it a vmlinux binary +compiled with BTF as follows: + +```bash +$ bpftool btf dump file /path/to/vmlinux format c > vmlinux.h +``` + +`bpftool` analyzes all of the BTF encodings in the binary, and produces a +header file that can be included by BPF programs to access those types. For +example, using vmlinux.h allows a scheduler to access fields defined directly +in vmlinux as follows: + +```c +#include "vmlinux.h" +// vmlinux.h is also implicitly included by scx_common.bpf.h. +#include "scx_common.bpf.h" + +/* + * vmlinux.h provides definitions for struct task_struct and + * struct scx_enable_args. + */ +void BPF_STRUCT_OPS(example_enable, struct task_struct *p, + struct scx_enable_args *args) +{ + bpf_printk("Task %s enabled in example scheduler", p->comm); +} + +// vmlinux.h provides the definition for struct sched_ext_ops. +SEC(".struct_ops.link") +struct sched_ext_ops example_ops { + .enable = (void *)example_enable, + .name = "example", +} +``` + +The scheduler build system will generate this vmlinux.h file as part of the +scheduler build pipeline. It looks for a vmlinux file in the following +dependency order: + +1. If the O= environment variable is defined, at `$O/vmlinux` +2. If the KBUILD_OUTPUT= environment variable is defined, at + `$KBUILD_OUTPUT/vmlinux` +3. At `../../vmlinux` (i.e. at the root of the kernel tree where you're + compiling the schedulers) +3. `/sys/kernel/btf/vmlinux` +4. `/boot/vmlinux-$(uname -r)` + +In other words, if you have compiled a kernel in your local repo, its vmlinux +file will be used to generate vmlinux.h. Otherwise, it will be the vmlinux of +the kernel you're currently running on. This means that if you're running on a +kernel with sched_ext support, you may not need to compile a local kernel at +all. + +### Aside on CO-RE + +One of the cooler features of BPF is that it supports +[CO-RE](https://nakryiko.com/posts/bpf-core-reference-guide/) (Compile Once Run +Everywhere). This feature allows you to reference fields inside of structs with +types defined internal to the kernel, and not have to recompile if you load the +BPF program on a different kernel with the field at a different offset. In our +example above, we print out a task name with `p->comm`. CO-RE would perform +relocations for that access when the program is loaded to ensure that it's +referencing the correct offset for the currently running kernel. + +## Compiling the schedulers + +Once you have your toolchain setup, and a vmlinux that can be used to generate +a full vmlinux.h file, you can compile the schedulers using `make`: + +```bash +$ make -j($nproc) +``` + +# Example schedulers + +This directory contains the following example schedulers. These schedulers are +for testing and demonstrating different aspects of sched_ext. While some may be +useful in limited scenarios, they are not intended to be practical. + +For more scheduler implementations, tools and documentation, visit +https://github.com/sched-ext/scx. + +## scx_simple + +A simple scheduler that provides an example of a minimal sched_ext scheduler. +scx_simple can be run in either global weighted vtime mode, or FIFO mode. + +Though very simple, in limited scenarios, this scheduler can perform reasonably +well on single-socket systems with a unified L3 cache. + +## scx_qmap + +Another simple, yet slightly more complex scheduler that provides an example of +a basic weighted FIFO queuing policy. It also provides examples of some common +useful BPF features, such as sleepable per-task storage allocation in the +`ops.prep_enable()` callback, and using the `BPF_MAP_TYPE_QUEUE` map type to +enqueue tasks. It also illustrates how core-sched support could be implemented. + +## scx_central + +A "central" scheduler where scheduling decisions are made from a single CPU. +This scheduler illustrates how scheduling decisions can be dispatched from a +single CPU, allowing other cores to run with infinite slices, without timer +ticks, and without having to incur the overhead of making scheduling decisions. + +The approach demonstrated by this scheduler may be useful for any workload that +benefits from minimizing scheduling overhead and timer ticks. An example of +where this could be particularly useful is running VMs, where running with +infinite slices and no timer ticks allows the VM to avoid unnecessary expensive +vmexits. + + +# Troubleshooting + +There are a number of common issues that you may run into when building the +schedulers. We'll go over some of the common ones here. + +## Build Failures + +### Old version of clang + +``` +error: static assertion failed due to requirement 'SCX_DSQ_FLAG_BUILTIN': bpftool generated vmlinux.h is missing high bits for 64bit enums, upgrade clang and pahole + _Static_assert(SCX_DSQ_FLAG_BUILTIN, + ^~~~~~~~~~~~~~~~~~~~ +1 error generated. +``` + +This means you built the kernel or the schedulers with an older version of +clang than what's supported (i.e. older than 16.0.0). To remediate this: + +1. `which clang` to make sure you're using a sufficiently new version of clang. + +2. `make fullclean` in the root path of the repository, and rebuild the kernel + and schedulers. + +3. Rebuild the kernel, and then your example schedulers. + +The schedulers are also cleaned if you invoke `make mrproper` in the root +directory of the tree. + +### Stale kernel build / incomplete vmlinux.h file + +As described above, you'll need a `vmlinux.h` file that was generated from a +vmlinux built with BTF, and with sched_ext support enabled. If you don't, +you'll see errors such as the following which indicate that a type being +referenced in a scheduler is unknown: + +``` +/path/to/sched_ext/tools/sched_ext/user_exit_info.h:25:23: note: forward declaration of 'struct scx_exit_info' + +const struct scx_exit_info *ei) + +^ +``` + +In order to resolve this, please follow the steps above in +[Getting a vmlinux.h file](#getting-a-vmlinuxh-file) in order to ensure your +schedulers are using a vmlinux.h file that includes the requisite types. + +## Misc + +### llvm: [OFF] + +You may see the following output when building the schedulers: + +``` +Auto-detecting system features: +... clang-bpf-co-re: [ on ] +... llvm: [ OFF ] +... libcap: [ on ] +... libbfd: [ on ] +``` + +Seeing `llvm: [ OFF ]` here is not an issue. You can safely ignore. -- 2.34.1
From: David Vernet <dvernet@meta.com> mainline inclusion from mainline-v6.12-rc1 commit a5db7817af780db6a7f290c79677eff4fd13c5fa category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Add basic selftests. Signed-off-by: David Vernet <dvernet@meta.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/testing/selftests/sched_ext/.gitignore | 6 + tools/testing/selftests/sched_ext/Makefile | 218 ++++++++++++++++++ tools/testing/selftests/sched_ext/config | 9 + .../selftests/sched_ext/create_dsq.bpf.c | 58 +++++ .../testing/selftests/sched_ext/create_dsq.c | 57 +++++ .../sched_ext/ddsp_bogus_dsq_fail.bpf.c | 42 ++++ .../selftests/sched_ext/ddsp_bogus_dsq_fail.c | 57 +++++ .../sched_ext/ddsp_vtimelocal_fail.bpf.c | 39 ++++ .../sched_ext/ddsp_vtimelocal_fail.c | 56 +++++ .../selftests/sched_ext/dsp_local_on.bpf.c | 65 ++++++ .../selftests/sched_ext/dsp_local_on.c | 58 +++++ .../sched_ext/enq_last_no_enq_fails.bpf.c | 21 ++ .../sched_ext/enq_last_no_enq_fails.c | 60 +++++ .../sched_ext/enq_select_cpu_fails.bpf.c | 43 ++++ .../sched_ext/enq_select_cpu_fails.c | 61 +++++ tools/testing/selftests/sched_ext/exit.bpf.c | 84 +++++++ tools/testing/selftests/sched_ext/exit.c | 55 +++++ tools/testing/selftests/sched_ext/exit_test.h | 20 ++ .../testing/selftests/sched_ext/hotplug.bpf.c | 61 +++++ tools/testing/selftests/sched_ext/hotplug.c | 168 ++++++++++++++ .../selftests/sched_ext/hotplug_test.h | 15 ++ .../sched_ext/init_enable_count.bpf.c | 53 +++++ .../selftests/sched_ext/init_enable_count.c | 166 +++++++++++++ .../testing/selftests/sched_ext/maximal.bpf.c | 132 +++++++++++ tools/testing/selftests/sched_ext/maximal.c | 51 ++++ .../selftests/sched_ext/maybe_null.bpf.c | 36 +++ .../testing/selftests/sched_ext/maybe_null.c | 49 ++++ .../sched_ext/maybe_null_fail_dsp.bpf.c | 25 ++ .../sched_ext/maybe_null_fail_yld.bpf.c | 28 +++ .../testing/selftests/sched_ext/minimal.bpf.c | 21 ++ tools/testing/selftests/sched_ext/minimal.c | 58 +++++ .../selftests/sched_ext/prog_run.bpf.c | 32 +++ tools/testing/selftests/sched_ext/prog_run.c | 78 +++++++ .../testing/selftests/sched_ext/reload_loop.c | 75 ++++++ tools/testing/selftests/sched_ext/runner.c | 201 ++++++++++++++++ tools/testing/selftests/sched_ext/scx_test.h | 131 +++++++++++ .../selftests/sched_ext/select_cpu_dfl.bpf.c | 40 ++++ .../selftests/sched_ext/select_cpu_dfl.c | 72 ++++++ .../sched_ext/select_cpu_dfl_nodispatch.bpf.c | 89 +++++++ .../sched_ext/select_cpu_dfl_nodispatch.c | 72 ++++++ .../sched_ext/select_cpu_dispatch.bpf.c | 41 ++++ .../selftests/sched_ext/select_cpu_dispatch.c | 70 ++++++ .../select_cpu_dispatch_bad_dsq.bpf.c | 37 +++ .../sched_ext/select_cpu_dispatch_bad_dsq.c | 56 +++++ .../select_cpu_dispatch_dbl_dsp.bpf.c | 38 +++ .../sched_ext/select_cpu_dispatch_dbl_dsp.c | 56 +++++ .../sched_ext/select_cpu_vtime.bpf.c | 92 ++++++++ .../selftests/sched_ext/select_cpu_vtime.c | 59 +++++ .../selftests/sched_ext/test_example.c | 49 ++++ tools/testing/selftests/sched_ext/util.c | 71 ++++++ tools/testing/selftests/sched_ext/util.h | 13 ++ 51 files changed, 3244 insertions(+) create mode 100644 tools/testing/selftests/sched_ext/.gitignore create mode 100644 tools/testing/selftests/sched_ext/Makefile create mode 100644 tools/testing/selftests/sched_ext/config create mode 100644 tools/testing/selftests/sched_ext/create_dsq.bpf.c create mode 100644 tools/testing/selftests/sched_ext/create_dsq.c create mode 100644 tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c create mode 100644 tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.c create mode 100644 tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c create mode 100644 tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.c create mode 100644 tools/testing/selftests/sched_ext/dsp_local_on.bpf.c create mode 100644 tools/testing/selftests/sched_ext/dsp_local_on.c create mode 100644 tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c create mode 100644 tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c create mode 100644 tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c create mode 100644 tools/testing/selftests/sched_ext/enq_select_cpu_fails.c create mode 100644 tools/testing/selftests/sched_ext/exit.bpf.c create mode 100644 tools/testing/selftests/sched_ext/exit.c create mode 100644 tools/testing/selftests/sched_ext/exit_test.h create mode 100644 tools/testing/selftests/sched_ext/hotplug.bpf.c create mode 100644 tools/testing/selftests/sched_ext/hotplug.c create mode 100644 tools/testing/selftests/sched_ext/hotplug_test.h create mode 100644 tools/testing/selftests/sched_ext/init_enable_count.bpf.c create mode 100644 tools/testing/selftests/sched_ext/init_enable_count.c create mode 100644 tools/testing/selftests/sched_ext/maximal.bpf.c create mode 100644 tools/testing/selftests/sched_ext/maximal.c create mode 100644 tools/testing/selftests/sched_ext/maybe_null.bpf.c create mode 100644 tools/testing/selftests/sched_ext/maybe_null.c create mode 100644 tools/testing/selftests/sched_ext/maybe_null_fail_dsp.bpf.c create mode 100644 tools/testing/selftests/sched_ext/maybe_null_fail_yld.bpf.c create mode 100644 tools/testing/selftests/sched_ext/minimal.bpf.c create mode 100644 tools/testing/selftests/sched_ext/minimal.c create mode 100644 tools/testing/selftests/sched_ext/prog_run.bpf.c create mode 100644 tools/testing/selftests/sched_ext/prog_run.c create mode 100644 tools/testing/selftests/sched_ext/reload_loop.c create mode 100644 tools/testing/selftests/sched_ext/runner.c create mode 100644 tools/testing/selftests/sched_ext/scx_test.h create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dfl.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c create mode 100644 tools/testing/selftests/sched_ext/select_cpu_vtime.c create mode 100644 tools/testing/selftests/sched_ext/test_example.c create mode 100644 tools/testing/selftests/sched_ext/util.c create mode 100644 tools/testing/selftests/sched_ext/util.h diff --git a/tools/testing/selftests/sched_ext/.gitignore b/tools/testing/selftests/sched_ext/.gitignore new file mode 100644 index 000000000000..ae5491a114c0 --- /dev/null +++ b/tools/testing/selftests/sched_ext/.gitignore @@ -0,0 +1,6 @@ +* +!*.c +!*.h +!Makefile +!.gitignore +!config diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile new file mode 100644 index 000000000000..0754a2c110a1 --- /dev/null +++ b/tools/testing/selftests/sched_ext/Makefile @@ -0,0 +1,218 @@ +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2022 Meta Platforms, Inc. and affiliates. +include ../../../build/Build.include +include ../../../scripts/Makefile.arch +include ../../../scripts/Makefile.include +include ../lib.mk + +ifneq ($(LLVM),) +ifneq ($(filter %/,$(LLVM)),) +LLVM_PREFIX := $(LLVM) +else ifneq ($(filter -%,$(LLVM)),) +LLVM_SUFFIX := $(LLVM) +endif + +CC := $(LLVM_PREFIX)clang$(LLVM_SUFFIX) $(CLANG_FLAGS) -fintegrated-as +else +CC := gcc +endif # LLVM + +ifneq ($(CROSS_COMPILE),) +$(error CROSS_COMPILE not supported for scx selftests) +endif # CROSS_COMPILE + +CURDIR := $(abspath .) +REPOROOT := $(abspath ../../../..) +TOOLSDIR := $(REPOROOT)/tools +LIBDIR := $(TOOLSDIR)/lib +BPFDIR := $(LIBDIR)/bpf +TOOLSINCDIR := $(TOOLSDIR)/include +BPFTOOLDIR := $(TOOLSDIR)/bpf/bpftool +APIDIR := $(TOOLSINCDIR)/uapi +GENDIR := $(REPOROOT)/include/generated +GENHDR := $(GENDIR)/autoconf.h +SCXTOOLSDIR := $(TOOLSDIR)/sched_ext +SCXTOOLSINCDIR := $(TOOLSDIR)/sched_ext/include + +OUTPUT_DIR := $(CURDIR)/build +OBJ_DIR := $(OUTPUT_DIR)/obj +INCLUDE_DIR := $(OUTPUT_DIR)/include +BPFOBJ_DIR := $(OBJ_DIR)/libbpf +SCXOBJ_DIR := $(OBJ_DIR)/sched_ext +BPFOBJ := $(BPFOBJ_DIR)/libbpf.a +LIBBPF_OUTPUT := $(OBJ_DIR)/libbpf/libbpf.a +DEFAULT_BPFTOOL := $(OUTPUT_DIR)/sbin/bpftool +HOST_BUILD_DIR := $(OBJ_DIR) +HOST_OUTPUT_DIR := $(OUTPUT_DIR) + +VMLINUX_BTF_PATHS ?= ../../../../vmlinux \ + /sys/kernel/btf/vmlinux \ + /boot/vmlinux-$(shell uname -r) +VMLINUX_BTF ?= $(abspath $(firstword $(wildcard $(VMLINUX_BTF_PATHS)))) +ifeq ($(VMLINUX_BTF),) +$(error Cannot find a vmlinux for VMLINUX_BTF at any of "$(VMLINUX_BTF_PATHS)") +endif + +BPFTOOL ?= $(DEFAULT_BPFTOOL) + +ifneq ($(wildcard $(GENHDR)),) + GENFLAGS := -DHAVE_GENHDR +endif + +CFLAGS += -g -O2 -rdynamic -pthread -Wall -Werror $(GENFLAGS) \ + -I$(INCLUDE_DIR) -I$(GENDIR) -I$(LIBDIR) \ + -I$(TOOLSINCDIR) -I$(APIDIR) -I$(CURDIR)/include -I$(SCXTOOLSINCDIR) + +# Silence some warnings when compiled with clang +ifneq ($(LLVM),) +CFLAGS += -Wno-unused-command-line-argument +endif + +LDFLAGS = -lelf -lz -lpthread -lzstd + +IS_LITTLE_ENDIAN = $(shell $(CC) -dM -E - </dev/null | \ + grep 'define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__') + +# Get Clang's default includes on this system, as opposed to those seen by +# '-target bpf'. This fixes "missing" files on some architectures/distros, +# such as asm/byteorder.h, asm/socket.h, asm/sockios.h, sys/cdefs.h etc. +# +# Use '-idirafter': Don't interfere with include mechanics except where the +# build would have failed anyways. +define get_sys_includes +$(shell $(1) -v -E - </dev/null 2>&1 \ + | sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') \ +$(shell $(1) -dM -E - </dev/null | grep '__riscv_xlen ' | awk '{printf("-D__riscv_xlen=%d -D__BITS_PER_LONG=%d", $$3, $$3)}') +endef + +BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) \ + $(if $(IS_LITTLE_ENDIAN),-mlittle-endian,-mbig-endian) \ + -I$(CURDIR)/include -I$(CURDIR)/include/bpf-compat \ + -I$(INCLUDE_DIR) -I$(APIDIR) -I$(SCXTOOLSINCDIR) \ + -I$(REPOROOT)/include \ + $(call get_sys_includes,$(CLANG)) \ + -Wall -Wno-compare-distinct-pointer-types \ + -Wno-incompatible-function-pointer-types \ + -O2 -mcpu=v3 + +# sort removes libbpf duplicates when not cross-building +MAKE_DIRS := $(sort $(OBJ_DIR)/libbpf $(OBJ_DIR)/libbpf \ + $(OBJ_DIR)/bpftool $(OBJ_DIR)/resolve_btfids \ + $(INCLUDE_DIR) $(SCXOBJ_DIR)) + +$(MAKE_DIRS): + $(call msg,MKDIR,,$@) + $(Q)mkdir -p $@ + +$(BPFOBJ): $(wildcard $(BPFDIR)/*.[ch] $(BPFDIR)/Makefile) \ + $(APIDIR)/linux/bpf.h \ + | $(OBJ_DIR)/libbpf + $(Q)$(MAKE) $(submake_extras) -C $(BPFDIR) OUTPUT=$(OBJ_DIR)/libbpf/ \ + EXTRA_CFLAGS='-g -O0 -fPIC' \ + DESTDIR=$(OUTPUT_DIR) prefix= all install_headers + +$(DEFAULT_BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile) \ + $(LIBBPF_OUTPUT) | $(OBJ_DIR)/bpftool + $(Q)$(MAKE) $(submake_extras) -C $(BPFTOOLDIR) \ + ARCH= CROSS_COMPILE= CC=$(HOSTCC) LD=$(HOSTLD) \ + EXTRA_CFLAGS='-g -O0' \ + OUTPUT=$(OBJ_DIR)/bpftool/ \ + LIBBPF_OUTPUT=$(OBJ_DIR)/libbpf/ \ + LIBBPF_DESTDIR=$(OUTPUT_DIR)/ \ + prefix= DESTDIR=$(OUTPUT_DIR)/ install-bin + +$(INCLUDE_DIR)/vmlinux.h: $(VMLINUX_BTF) $(BPFTOOL) | $(INCLUDE_DIR) +ifeq ($(VMLINUX_H),) + $(call msg,GEN,,$@) + $(Q)$(BPFTOOL) btf dump file $(VMLINUX_BTF) format c > $@ +else + $(call msg,CP,,$@) + $(Q)cp "$(VMLINUX_H)" $@ +endif + +$(SCXOBJ_DIR)/%.bpf.o: %.bpf.c $(INCLUDE_DIR)/vmlinux.h | $(BPFOBJ) $(SCXOBJ_DIR) + $(call msg,CLNG-BPF,,$(notdir $@)) + $(Q)$(CLANG) $(BPF_CFLAGS) -target bpf -c $< -o $@ + +$(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BPFTOOL) | $(INCLUDE_DIR) + $(eval sched=$(notdir $@)) + $(call msg,GEN-SKEL,,$(sched)) + $(Q)$(BPFTOOL) gen object $(<:.o=.linked1.o) $< + $(Q)$(BPFTOOL) gen object $(<:.o=.linked2.o) $(<:.o=.linked1.o) + $(Q)$(BPFTOOL) gen object $(<:.o=.linked3.o) $(<:.o=.linked2.o) + $(Q)diff $(<:.o=.linked2.o) $(<:.o=.linked3.o) + $(Q)$(BPFTOOL) gen skeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $@ + $(Q)$(BPFTOOL) gen subskeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $(@:.skel.h=.subskel.h) + +################ +# C schedulers # +################ + +override define CLEAN + rm -rf $(OUTPUT_DIR) + rm -f *.o *.bpf.o *.bpf.skel.h *.bpf.subskel.h + rm -f $(TEST_GEN_PROGS) + rm -f runner +endef + +# Every testcase takes all of the BPF progs are dependencies by default. This +# allows testcases to load any BPF scheduler, which is useful for testcases +# that don't need their own prog to run their test. +all_test_bpfprogs := $(foreach prog,$(wildcard *.bpf.c),$(INCLUDE_DIR)/$(patsubst %.c,%.skel.h,$(prog))) + +auto-test-targets := \ + create_dsq \ + enq_last_no_enq_fails \ + enq_select_cpu_fails \ + ddsp_bogus_dsq_fail \ + ddsp_vtimelocal_fail \ + dsp_local_on \ + exit \ + hotplug \ + init_enable_count \ + maximal \ + maybe_null \ + minimal \ + prog_run \ + reload_loop \ + select_cpu_dfl \ + select_cpu_dfl_nodispatch \ + select_cpu_dispatch \ + select_cpu_dispatch_bad_dsq \ + select_cpu_dispatch_dbl_dsp \ + select_cpu_vtime \ + test_example \ + +testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets))) + +$(SCXOBJ_DIR)/runner.o: runner.c | $(SCXOBJ_DIR) + $(CC) $(CFLAGS) -c $< -o $@ + +# Create all of the test targets object files, whose testcase objects will be +# registered into the runner in ELF constructors. +# +# Note that we must do double expansion here in order to support conditionally +# compiling BPF object files only if one is present, as the wildcard Make +# function doesn't support using implicit rules otherwise. +$(testcase-targets): $(SCXOBJ_DIR)/%.o: %.c $(SCXOBJ_DIR)/runner.o $(all_test_bpfprogs) | $(SCXOBJ_DIR) + $(eval test=$(patsubst %.o,%.c,$(notdir $@))) + $(CC) $(CFLAGS) -c $< -o $@ $(SCXOBJ_DIR)/runner.o + +$(SCXOBJ_DIR)/util.o: util.c | $(SCXOBJ_DIR) + $(CC) $(CFLAGS) -c $< -o $@ + +runner: $(SCXOBJ_DIR)/runner.o $(SCXOBJ_DIR)/util.o $(BPFOBJ) $(testcase-targets) + @echo "$(testcase-targets)" + $(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS) + +TEST_GEN_PROGS := runner + +all: runner + +.PHONY: all clean help + +.DEFAULT_GOAL := all + +.DELETE_ON_ERROR: + +.SECONDARY: diff --git a/tools/testing/selftests/sched_ext/config b/tools/testing/selftests/sched_ext/config new file mode 100644 index 000000000000..0de9b4ee249d --- /dev/null +++ b/tools/testing/selftests/sched_ext/config @@ -0,0 +1,9 @@ +CONFIG_SCHED_DEBUG=y +CONFIG_SCHED_CLASS_EXT=y +CONFIG_CGROUPS=y +CONFIG_CGROUP_SCHED=y +CONFIG_EXT_GROUP_SCHED=y +CONFIG_BPF=y +CONFIG_BPF_SYSCALL=y +CONFIG_DEBUG_INFO=y +CONFIG_DEBUG_INFO_BTF=y diff --git a/tools/testing/selftests/sched_ext/create_dsq.bpf.c b/tools/testing/selftests/sched_ext/create_dsq.bpf.c new file mode 100644 index 000000000000..23f79ed343f0 --- /dev/null +++ b/tools/testing/selftests/sched_ext/create_dsq.bpf.c @@ -0,0 +1,58 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Create and destroy DSQs in a loop. + * + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +void BPF_STRUCT_OPS(create_dsq_exit_task, struct task_struct *p, + struct scx_exit_task_args *args) +{ + scx_bpf_destroy_dsq(p->pid); +} + +s32 BPF_STRUCT_OPS_SLEEPABLE(create_dsq_init_task, struct task_struct *p, + struct scx_init_task_args *args) +{ + s32 err; + + err = scx_bpf_create_dsq(p->pid, -1); + if (err) + scx_bpf_error("Failed to create DSQ for %s[%d]", + p->comm, p->pid); + + return err; +} + +s32 BPF_STRUCT_OPS_SLEEPABLE(create_dsq_init) +{ + u32 i; + s32 err; + + bpf_for(i, 0, 1024) { + err = scx_bpf_create_dsq(i, -1); + if (err) { + scx_bpf_error("Failed to create DSQ %d", i); + return 0; + } + } + + bpf_for(i, 0, 1024) { + scx_bpf_destroy_dsq(i); + } + + return 0; +} + +SEC(".struct_ops.link") +struct sched_ext_ops create_dsq_ops = { + .init_task = create_dsq_init_task, + .exit_task = create_dsq_exit_task, + .init = create_dsq_init, + .name = "create_dsq", +}; diff --git a/tools/testing/selftests/sched_ext/create_dsq.c b/tools/testing/selftests/sched_ext/create_dsq.c new file mode 100644 index 000000000000..fa946d9146d4 --- /dev/null +++ b/tools/testing/selftests/sched_ext/create_dsq.c @@ -0,0 +1,57 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "create_dsq.bpf.skel.h" +#include "scx_test.h" + +static enum scx_test_status setup(void **ctx) +{ + struct create_dsq *skel; + + skel = create_dsq__open_and_load(); + if (!skel) { + SCX_ERR("Failed to open and load skel"); + return SCX_TEST_FAIL; + } + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct create_dsq *skel = ctx; + struct bpf_link *link; + + link = bpf_map__attach_struct_ops(skel->maps.create_dsq_ops); + if (!link) { + SCX_ERR("Failed to attach scheduler"); + return SCX_TEST_FAIL; + } + + bpf_link__destroy(link); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct create_dsq *skel = ctx; + + create_dsq__destroy(skel); +} + +struct scx_test create_dsq = { + .name = "create_dsq", + .description = "Create and destroy a dsq in a loop", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&create_dsq) diff --git a/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c b/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c new file mode 100644 index 000000000000..e97ad41d354a --- /dev/null +++ b/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c @@ -0,0 +1,42 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> + */ +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +UEI_DEFINE(uei); + +s32 BPF_STRUCT_OPS(ddsp_bogus_dsq_fail_select_cpu, struct task_struct *p, + s32 prev_cpu, u64 wake_flags) +{ + s32 cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); + + if (cpu >= 0) { + /* + * If we dispatch to a bogus DSQ that will fall back to the + * builtin global DSQ, we fail gracefully. + */ + scx_bpf_dispatch_vtime(p, 0xcafef00d, SCX_SLICE_DFL, + p->scx.dsq_vtime, 0); + return cpu; + } + + return prev_cpu; +} + +void BPF_STRUCT_OPS(ddsp_bogus_dsq_fail_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SEC(".struct_ops.link") +struct sched_ext_ops ddsp_bogus_dsq_fail_ops = { + .select_cpu = ddsp_bogus_dsq_fail_select_cpu, + .exit = ddsp_bogus_dsq_fail_exit, + .name = "ddsp_bogus_dsq_fail", + .timeout_ms = 1000U, +}; diff --git a/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.c b/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.c new file mode 100644 index 000000000000..e65d22f23f3b --- /dev/null +++ b/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.c @@ -0,0 +1,57 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> + */ +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "ddsp_bogus_dsq_fail.bpf.skel.h" +#include "scx_test.h" + +static enum scx_test_status setup(void **ctx) +{ + struct ddsp_bogus_dsq_fail *skel; + + skel = ddsp_bogus_dsq_fail__open_and_load(); + SCX_FAIL_IF(!skel, "Failed to open and load skel"); + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct ddsp_bogus_dsq_fail *skel = ctx; + struct bpf_link *link; + + link = bpf_map__attach_struct_ops(skel->maps.ddsp_bogus_dsq_fail_ops); + SCX_FAIL_IF(!link, "Failed to attach struct_ops"); + + sleep(1); + + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR)); + bpf_link__destroy(link); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct ddsp_bogus_dsq_fail *skel = ctx; + + ddsp_bogus_dsq_fail__destroy(skel); +} + +struct scx_test ddsp_bogus_dsq_fail = { + .name = "ddsp_bogus_dsq_fail", + .description = "Verify we gracefully fail, and fall back to using a " + "built-in DSQ, if we do a direct dispatch to an invalid" + " DSQ in ops.select_cpu()", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&ddsp_bogus_dsq_fail) diff --git a/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c b/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c new file mode 100644 index 000000000000..dde7e7dafbfb --- /dev/null +++ b/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c @@ -0,0 +1,39 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> + */ +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +UEI_DEFINE(uei); + +s32 BPF_STRUCT_OPS(ddsp_vtimelocal_fail_select_cpu, struct task_struct *p, + s32 prev_cpu, u64 wake_flags) +{ + s32 cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); + + if (cpu >= 0) { + /* Shouldn't be allowed to vtime dispatch to a builtin DSQ. */ + scx_bpf_dispatch_vtime(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, + p->scx.dsq_vtime, 0); + return cpu; + } + + return prev_cpu; +} + +void BPF_STRUCT_OPS(ddsp_vtimelocal_fail_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SEC(".struct_ops.link") +struct sched_ext_ops ddsp_vtimelocal_fail_ops = { + .select_cpu = ddsp_vtimelocal_fail_select_cpu, + .exit = ddsp_vtimelocal_fail_exit, + .name = "ddsp_vtimelocal_fail", + .timeout_ms = 1000U, +}; diff --git a/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.c b/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.c new file mode 100644 index 000000000000..abafee587cd6 --- /dev/null +++ b/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.c @@ -0,0 +1,56 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> + */ +#include <bpf/bpf.h> +#include <scx/common.h> +#include <unistd.h> +#include "ddsp_vtimelocal_fail.bpf.skel.h" +#include "scx_test.h" + +static enum scx_test_status setup(void **ctx) +{ + struct ddsp_vtimelocal_fail *skel; + + skel = ddsp_vtimelocal_fail__open_and_load(); + SCX_FAIL_IF(!skel, "Failed to open and load skel"); + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct ddsp_vtimelocal_fail *skel = ctx; + struct bpf_link *link; + + link = bpf_map__attach_struct_ops(skel->maps.ddsp_vtimelocal_fail_ops); + SCX_FAIL_IF(!link, "Failed to attach struct_ops"); + + sleep(1); + + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR)); + bpf_link__destroy(link); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct ddsp_vtimelocal_fail *skel = ctx; + + ddsp_vtimelocal_fail__destroy(skel); +} + +struct scx_test ddsp_vtimelocal_fail = { + .name = "ddsp_vtimelocal_fail", + .description = "Verify we gracefully fail, and fall back to using a " + "built-in DSQ, if we do a direct vtime dispatch to a " + "built-in DSQ from DSQ in ops.select_cpu()", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&ddsp_vtimelocal_fail) diff --git a/tools/testing/selftests/sched_ext/dsp_local_on.bpf.c b/tools/testing/selftests/sched_ext/dsp_local_on.bpf.c new file mode 100644 index 000000000000..efb4672decb4 --- /dev/null +++ b/tools/testing/selftests/sched_ext/dsp_local_on.bpf.c @@ -0,0 +1,65 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; +const volatile s32 nr_cpus; + +UEI_DEFINE(uei); + +struct { + __uint(type, BPF_MAP_TYPE_QUEUE); + __uint(max_entries, 8192); + __type(value, s32); +} queue SEC(".maps"); + +s32 BPF_STRUCT_OPS(dsp_local_on_select_cpu, struct task_struct *p, + s32 prev_cpu, u64 wake_flags) +{ + return prev_cpu; +} + +void BPF_STRUCT_OPS(dsp_local_on_enqueue, struct task_struct *p, + u64 enq_flags) +{ + s32 pid = p->pid; + + if (bpf_map_push_elem(&queue, &pid, 0)) + scx_bpf_error("Failed to enqueue %s[%d]", p->comm, p->pid); +} + +void BPF_STRUCT_OPS(dsp_local_on_dispatch, s32 cpu, struct task_struct *prev) +{ + s32 pid, target; + struct task_struct *p; + + if (bpf_map_pop_elem(&queue, &pid)) + return; + + p = bpf_task_from_pid(pid); + if (!p) + return; + + target = bpf_get_prandom_u32() % nr_cpus; + + scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | target, SCX_SLICE_DFL, 0); + bpf_task_release(p); +} + +void BPF_STRUCT_OPS(dsp_local_on_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SEC(".struct_ops.link") +struct sched_ext_ops dsp_local_on_ops = { + .select_cpu = dsp_local_on_select_cpu, + .enqueue = dsp_local_on_enqueue, + .dispatch = dsp_local_on_dispatch, + .exit = dsp_local_on_exit, + .name = "dsp_local_on", + .timeout_ms = 1000U, +}; diff --git a/tools/testing/selftests/sched_ext/dsp_local_on.c b/tools/testing/selftests/sched_ext/dsp_local_on.c new file mode 100644 index 000000000000..472851b56854 --- /dev/null +++ b/tools/testing/selftests/sched_ext/dsp_local_on.c @@ -0,0 +1,58 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ +#include <bpf/bpf.h> +#include <scx/common.h> +#include <unistd.h> +#include "dsp_local_on.bpf.skel.h" +#include "scx_test.h" + +static enum scx_test_status setup(void **ctx) +{ + struct dsp_local_on *skel; + + skel = dsp_local_on__open(); + SCX_FAIL_IF(!skel, "Failed to open"); + + skel->rodata->nr_cpus = libbpf_num_possible_cpus(); + SCX_FAIL_IF(dsp_local_on__load(skel), "Failed to load skel"); + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct dsp_local_on *skel = ctx; + struct bpf_link *link; + + link = bpf_map__attach_struct_ops(skel->maps.dsp_local_on_ops); + SCX_FAIL_IF(!link, "Failed to attach struct_ops"); + + /* Just sleeping is fine, plenty of scheduling events happening */ + sleep(1); + + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR)); + bpf_link__destroy(link); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct dsp_local_on *skel = ctx; + + dsp_local_on__destroy(skel); +} + +struct scx_test dsp_local_on = { + .name = "dsp_local_on", + .description = "Verify we can directly dispatch tasks to a local DSQs " + "from osp.dispatch()", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&dsp_local_on) diff --git a/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c b/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c new file mode 100644 index 000000000000..b0b99531d5d5 --- /dev/null +++ b/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c @@ -0,0 +1,21 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * A scheduler that validates the behavior of direct dispatching with a default + * select_cpu implementation. + * + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +SEC(".struct_ops.link") +struct sched_ext_ops enq_last_no_enq_fails_ops = { + .name = "enq_last_no_enq_fails", + /* Need to define ops.enqueue() with SCX_OPS_ENQ_LAST */ + .flags = SCX_OPS_ENQ_LAST, + .timeout_ms = 1000U, +}; diff --git a/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c b/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c new file mode 100644 index 000000000000..2a3eda5e2c0b --- /dev/null +++ b/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c @@ -0,0 +1,60 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "enq_last_no_enq_fails.bpf.skel.h" +#include "scx_test.h" + +static enum scx_test_status setup(void **ctx) +{ + struct enq_last_no_enq_fails *skel; + + skel = enq_last_no_enq_fails__open_and_load(); + if (!skel) { + SCX_ERR("Failed to open and load skel"); + return SCX_TEST_FAIL; + } + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct enq_last_no_enq_fails *skel = ctx; + struct bpf_link *link; + + link = bpf_map__attach_struct_ops(skel->maps.enq_last_no_enq_fails_ops); + if (link) { + SCX_ERR("Incorrectly succeeded in to attaching scheduler"); + return SCX_TEST_FAIL; + } + + bpf_link__destroy(link); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct enq_last_no_enq_fails *skel = ctx; + + enq_last_no_enq_fails__destroy(skel); +} + +struct scx_test enq_last_no_enq_fails = { + .name = "enq_last_no_enq_fails", + .description = "Verify we fail to load a scheduler if we specify " + "the SCX_OPS_ENQ_LAST flag without defining " + "ops.enqueue()", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&enq_last_no_enq_fails) diff --git a/tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c b/tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c new file mode 100644 index 000000000000..b3dfc1033cd6 --- /dev/null +++ b/tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c @@ -0,0 +1,43 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +/* Manually specify the signature until the kfunc is added to the scx repo. */ +s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, + bool *found) __ksym; + +s32 BPF_STRUCT_OPS(enq_select_cpu_fails_select_cpu, struct task_struct *p, + s32 prev_cpu, u64 wake_flags) +{ + return prev_cpu; +} + +void BPF_STRUCT_OPS(enq_select_cpu_fails_enqueue, struct task_struct *p, + u64 enq_flags) +{ + /* + * Need to initialize the variable or the verifier will fail to load. + * Improving these semantics is actively being worked on. + */ + bool found = false; + + /* Can only call from ops.select_cpu() */ + scx_bpf_select_cpu_dfl(p, 0, 0, &found); + + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); +} + +SEC(".struct_ops.link") +struct sched_ext_ops enq_select_cpu_fails_ops = { + .select_cpu = enq_select_cpu_fails_select_cpu, + .enqueue = enq_select_cpu_fails_enqueue, + .name = "enq_select_cpu_fails", + .timeout_ms = 1000U, +}; diff --git a/tools/testing/selftests/sched_ext/enq_select_cpu_fails.c b/tools/testing/selftests/sched_ext/enq_select_cpu_fails.c new file mode 100644 index 000000000000..dd1350e5f002 --- /dev/null +++ b/tools/testing/selftests/sched_ext/enq_select_cpu_fails.c @@ -0,0 +1,61 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "enq_select_cpu_fails.bpf.skel.h" +#include "scx_test.h" + +static enum scx_test_status setup(void **ctx) +{ + struct enq_select_cpu_fails *skel; + + skel = enq_select_cpu_fails__open_and_load(); + if (!skel) { + SCX_ERR("Failed to open and load skel"); + return SCX_TEST_FAIL; + } + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct enq_select_cpu_fails *skel = ctx; + struct bpf_link *link; + + link = bpf_map__attach_struct_ops(skel->maps.enq_select_cpu_fails_ops); + if (!link) { + SCX_ERR("Failed to attach scheduler"); + return SCX_TEST_FAIL; + } + + sleep(1); + + bpf_link__destroy(link); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct enq_select_cpu_fails *skel = ctx; + + enq_select_cpu_fails__destroy(skel); +} + +struct scx_test enq_select_cpu_fails = { + .name = "enq_select_cpu_fails", + .description = "Verify we fail to call scx_bpf_select_cpu_dfl() " + "from ops.enqueue()", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&enq_select_cpu_fails) diff --git a/tools/testing/selftests/sched_ext/exit.bpf.c b/tools/testing/selftests/sched_ext/exit.bpf.c new file mode 100644 index 000000000000..ae12ddaac921 --- /dev/null +++ b/tools/testing/selftests/sched_ext/exit.bpf.c @@ -0,0 +1,84 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +#include "exit_test.h" + +const volatile int exit_point; +UEI_DEFINE(uei); + +#define EXIT_CLEANLY() scx_bpf_exit(exit_point, "%d", exit_point) + +s32 BPF_STRUCT_OPS(exit_select_cpu, struct task_struct *p, + s32 prev_cpu, u64 wake_flags) +{ + bool found; + + if (exit_point == EXIT_SELECT_CPU) + EXIT_CLEANLY(); + + return scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &found); +} + +void BPF_STRUCT_OPS(exit_enqueue, struct task_struct *p, u64 enq_flags) +{ + if (exit_point == EXIT_ENQUEUE) + EXIT_CLEANLY(); + + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); +} + +void BPF_STRUCT_OPS(exit_dispatch, s32 cpu, struct task_struct *p) +{ + if (exit_point == EXIT_DISPATCH) + EXIT_CLEANLY(); + + scx_bpf_consume(SCX_DSQ_GLOBAL); +} + +void BPF_STRUCT_OPS(exit_enable, struct task_struct *p) +{ + if (exit_point == EXIT_ENABLE) + EXIT_CLEANLY(); +} + +s32 BPF_STRUCT_OPS(exit_init_task, struct task_struct *p, + struct scx_init_task_args *args) +{ + if (exit_point == EXIT_INIT_TASK) + EXIT_CLEANLY(); + + return 0; +} + +void BPF_STRUCT_OPS(exit_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +s32 BPF_STRUCT_OPS_SLEEPABLE(exit_init) +{ + if (exit_point == EXIT_INIT) + EXIT_CLEANLY(); + + return 0; +} + +SEC(".struct_ops.link") +struct sched_ext_ops exit_ops = { + .select_cpu = exit_select_cpu, + .enqueue = exit_enqueue, + .dispatch = exit_dispatch, + .init_task = exit_init_task, + .enable = exit_enable, + .exit = exit_exit, + .init = exit_init, + .name = "exit", + .timeout_ms = 1000U, +}; diff --git a/tools/testing/selftests/sched_ext/exit.c b/tools/testing/selftests/sched_ext/exit.c new file mode 100644 index 000000000000..31bcd06e21cd --- /dev/null +++ b/tools/testing/selftests/sched_ext/exit.c @@ -0,0 +1,55 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ +#include <bpf/bpf.h> +#include <sched.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "exit.bpf.skel.h" +#include "scx_test.h" + +#include "exit_test.h" + +static enum scx_test_status run(void *ctx) +{ + enum exit_test_case tc; + + for (tc = 0; tc < NUM_EXITS; tc++) { + struct exit *skel; + struct bpf_link *link; + char buf[16]; + + skel = exit__open(); + skel->rodata->exit_point = tc; + exit__load(skel); + link = bpf_map__attach_struct_ops(skel->maps.exit_ops); + if (!link) { + SCX_ERR("Failed to attach scheduler"); + exit__destroy(skel); + return SCX_TEST_FAIL; + } + + /* Assumes uei.kind is written last */ + while (skel->data->uei.kind == EXIT_KIND(SCX_EXIT_NONE)) + sched_yield(); + + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_UNREG_BPF)); + SCX_EQ(skel->data->uei.exit_code, tc); + sprintf(buf, "%d", tc); + SCX_ASSERT(!strcmp(skel->data->uei.msg, buf)); + bpf_link__destroy(link); + exit__destroy(skel); + } + + return SCX_TEST_PASS; +} + +struct scx_test exit_test = { + .name = "exit", + .description = "Verify we can cleanly exit a scheduler in multiple places", + .run = run, +}; +REGISTER_SCX_TEST(&exit_test) diff --git a/tools/testing/selftests/sched_ext/exit_test.h b/tools/testing/selftests/sched_ext/exit_test.h new file mode 100644 index 000000000000..94f0268b9cb8 --- /dev/null +++ b/tools/testing/selftests/sched_ext/exit_test.h @@ -0,0 +1,20 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ + +#ifndef __EXIT_TEST_H__ +#define __EXIT_TEST_H__ + +enum exit_test_case { + EXIT_SELECT_CPU, + EXIT_ENQUEUE, + EXIT_DISPATCH, + EXIT_ENABLE, + EXIT_INIT_TASK, + EXIT_INIT, + NUM_EXITS, +}; + +#endif // # __EXIT_TEST_H__ diff --git a/tools/testing/selftests/sched_ext/hotplug.bpf.c b/tools/testing/selftests/sched_ext/hotplug.bpf.c new file mode 100644 index 000000000000..8f2601db39f3 --- /dev/null +++ b/tools/testing/selftests/sched_ext/hotplug.bpf.c @@ -0,0 +1,61 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +#include "hotplug_test.h" + +UEI_DEFINE(uei); + +void BPF_STRUCT_OPS(hotplug_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +static void exit_from_hotplug(s32 cpu, bool onlining) +{ + /* + * Ignored, just used to verify that we can invoke blocking kfuncs + * from the hotplug path. + */ + scx_bpf_create_dsq(0, -1); + + s64 code = SCX_ECODE_ACT_RESTART | HOTPLUG_EXIT_RSN; + + if (onlining) + code |= HOTPLUG_ONLINING; + + scx_bpf_exit(code, "hotplug event detected (%d going %s)", cpu, + onlining ? "online" : "offline"); +} + +void BPF_STRUCT_OPS_SLEEPABLE(hotplug_cpu_online, s32 cpu) +{ + exit_from_hotplug(cpu, true); +} + +void BPF_STRUCT_OPS_SLEEPABLE(hotplug_cpu_offline, s32 cpu) +{ + exit_from_hotplug(cpu, false); +} + +SEC(".struct_ops.link") +struct sched_ext_ops hotplug_cb_ops = { + .cpu_online = hotplug_cpu_online, + .cpu_offline = hotplug_cpu_offline, + .exit = hotplug_exit, + .name = "hotplug_cbs", + .timeout_ms = 1000U, +}; + +SEC(".struct_ops.link") +struct sched_ext_ops hotplug_nocb_ops = { + .exit = hotplug_exit, + .name = "hotplug_nocbs", + .timeout_ms = 1000U, +}; diff --git a/tools/testing/selftests/sched_ext/hotplug.c b/tools/testing/selftests/sched_ext/hotplug.c new file mode 100644 index 000000000000..87bf220b1bce --- /dev/null +++ b/tools/testing/selftests/sched_ext/hotplug.c @@ -0,0 +1,168 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ +#include <bpf/bpf.h> +#include <sched.h> +#include <scx/common.h> +#include <sched.h> +#include <sys/wait.h> +#include <unistd.h> + +#include "hotplug_test.h" +#include "hotplug.bpf.skel.h" +#include "scx_test.h" +#include "util.h" + +const char *online_path = "/sys/devices/system/cpu/cpu1/online"; + +static bool is_cpu_online(void) +{ + return file_read_long(online_path) > 0; +} + +static void toggle_online_status(bool online) +{ + long val = online ? 1 : 0; + int ret; + + ret = file_write_long(online_path, val); + if (ret != 0) + fprintf(stderr, "Failed to bring CPU %s (%s)", + online ? "online" : "offline", strerror(errno)); +} + +static enum scx_test_status setup(void **ctx) +{ + if (!is_cpu_online()) + return SCX_TEST_SKIP; + + return SCX_TEST_PASS; +} + +static enum scx_test_status test_hotplug(bool onlining, bool cbs_defined) +{ + struct hotplug *skel; + struct bpf_link *link; + long kind, code; + + SCX_ASSERT(is_cpu_online()); + + skel = hotplug__open_and_load(); + SCX_ASSERT(skel); + + /* Testing the offline -> online path, so go offline before starting */ + if (onlining) + toggle_online_status(0); + + if (cbs_defined) { + kind = SCX_KIND_VAL(SCX_EXIT_UNREG_BPF); + code = SCX_ECODE_VAL(SCX_ECODE_ACT_RESTART) | HOTPLUG_EXIT_RSN; + if (onlining) + code |= HOTPLUG_ONLINING; + } else { + kind = SCX_KIND_VAL(SCX_EXIT_UNREG_KERN); + code = SCX_ECODE_VAL(SCX_ECODE_ACT_RESTART) | + SCX_ECODE_VAL(SCX_ECODE_RSN_HOTPLUG); + } + + if (cbs_defined) + link = bpf_map__attach_struct_ops(skel->maps.hotplug_cb_ops); + else + link = bpf_map__attach_struct_ops(skel->maps.hotplug_nocb_ops); + + if (!link) { + SCX_ERR("Failed to attach scheduler"); + hotplug__destroy(skel); + return SCX_TEST_FAIL; + } + + toggle_online_status(onlining ? 1 : 0); + + while (!UEI_EXITED(skel, uei)) + sched_yield(); + + SCX_EQ(skel->data->uei.kind, kind); + SCX_EQ(UEI_REPORT(skel, uei), code); + + if (!onlining) + toggle_online_status(1); + + bpf_link__destroy(link); + hotplug__destroy(skel); + + return SCX_TEST_PASS; +} + +static enum scx_test_status test_hotplug_attach(void) +{ + struct hotplug *skel; + struct bpf_link *link; + enum scx_test_status status = SCX_TEST_PASS; + long kind, code; + + SCX_ASSERT(is_cpu_online()); + SCX_ASSERT(scx_hotplug_seq() > 0); + + skel = SCX_OPS_OPEN(hotplug_nocb_ops, hotplug); + SCX_ASSERT(skel); + + SCX_OPS_LOAD(skel, hotplug_nocb_ops, hotplug, uei); + + /* + * Take the CPU offline to increment the global hotplug seq, which + * should cause attach to fail due to us setting the hotplug seq above + */ + toggle_online_status(0); + link = bpf_map__attach_struct_ops(skel->maps.hotplug_nocb_ops); + + toggle_online_status(1); + + SCX_ASSERT(link); + while (!UEI_EXITED(skel, uei)) + sched_yield(); + + kind = SCX_KIND_VAL(SCX_EXIT_UNREG_KERN); + code = SCX_ECODE_VAL(SCX_ECODE_ACT_RESTART) | + SCX_ECODE_VAL(SCX_ECODE_RSN_HOTPLUG); + SCX_EQ(skel->data->uei.kind, kind); + SCX_EQ(UEI_REPORT(skel, uei), code); + + bpf_link__destroy(link); + hotplug__destroy(skel); + + return status; +} + +static enum scx_test_status run(void *ctx) +{ + +#define HP_TEST(__onlining, __cbs_defined) ({ \ + if (test_hotplug(__onlining, __cbs_defined) != SCX_TEST_PASS) \ + return SCX_TEST_FAIL; \ +}) + + HP_TEST(true, true); + HP_TEST(false, true); + HP_TEST(true, false); + HP_TEST(false, false); + +#undef HP_TEST + + return test_hotplug_attach(); +} + +static void cleanup(void *ctx) +{ + toggle_online_status(1); +} + +struct scx_test hotplug_test = { + .name = "hotplug", + .description = "Verify hotplug behavior", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&hotplug_test) diff --git a/tools/testing/selftests/sched_ext/hotplug_test.h b/tools/testing/selftests/sched_ext/hotplug_test.h new file mode 100644 index 000000000000..73d236f90787 --- /dev/null +++ b/tools/testing/selftests/sched_ext/hotplug_test.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ + +#ifndef __HOTPLUG_TEST_H__ +#define __HOTPLUG_TEST_H__ + +enum hotplug_test_flags { + HOTPLUG_EXIT_RSN = 1LLU << 0, + HOTPLUG_ONLINING = 1LLU << 1, +}; + +#endif // # __HOTPLUG_TEST_H__ diff --git a/tools/testing/selftests/sched_ext/init_enable_count.bpf.c b/tools/testing/selftests/sched_ext/init_enable_count.bpf.c new file mode 100644 index 000000000000..47ea89a626c3 --- /dev/null +++ b/tools/testing/selftests/sched_ext/init_enable_count.bpf.c @@ -0,0 +1,53 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * A scheduler that verifies that we do proper counting of init, enable, etc + * callbacks. + * + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +u64 init_task_cnt, exit_task_cnt, enable_cnt, disable_cnt; +u64 init_fork_cnt, init_transition_cnt; + +s32 BPF_STRUCT_OPS_SLEEPABLE(cnt_init_task, struct task_struct *p, + struct scx_init_task_args *args) +{ + __sync_fetch_and_add(&init_task_cnt, 1); + + if (args->fork) + __sync_fetch_and_add(&init_fork_cnt, 1); + else + __sync_fetch_and_add(&init_transition_cnt, 1); + + return 0; +} + +void BPF_STRUCT_OPS(cnt_exit_task, struct task_struct *p) +{ + __sync_fetch_and_add(&exit_task_cnt, 1); +} + +void BPF_STRUCT_OPS(cnt_enable, struct task_struct *p) +{ + __sync_fetch_and_add(&enable_cnt, 1); +} + +void BPF_STRUCT_OPS(cnt_disable, struct task_struct *p) +{ + __sync_fetch_and_add(&disable_cnt, 1); +} + +SEC(".struct_ops.link") +struct sched_ext_ops init_enable_count_ops = { + .init_task = cnt_init_task, + .exit_task = cnt_exit_task, + .enable = cnt_enable, + .disable = cnt_disable, + .name = "init_enable_count", +}; diff --git a/tools/testing/selftests/sched_ext/init_enable_count.c b/tools/testing/selftests/sched_ext/init_enable_count.c new file mode 100644 index 000000000000..97d45f1e5597 --- /dev/null +++ b/tools/testing/selftests/sched_ext/init_enable_count.c @@ -0,0 +1,166 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ +#include <stdio.h> +#include <unistd.h> +#include <sched.h> +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include "scx_test.h" +#include "init_enable_count.bpf.skel.h" + +#define SCHED_EXT 7 + +static struct init_enable_count * +open_load_prog(bool global) +{ + struct init_enable_count *skel; + + skel = init_enable_count__open(); + SCX_BUG_ON(!skel, "Failed to open skel"); + + if (!global) + skel->struct_ops.init_enable_count_ops->flags |= SCX_OPS_SWITCH_PARTIAL; + + SCX_BUG_ON(init_enable_count__load(skel), "Failed to load skel"); + + return skel; +} + +static enum scx_test_status run_test(bool global) +{ + struct init_enable_count *skel; + struct bpf_link *link; + const u32 num_children = 5, num_pre_forks = 1024; + int ret, i, status; + struct sched_param param = {}; + pid_t pids[num_pre_forks]; + + skel = open_load_prog(global); + + /* + * Fork a bunch of children before we attach the scheduler so that we + * ensure (at least in practical terms) that there are more tasks that + * transition from SCHED_OTHER -> SCHED_EXT than there are tasks that + * take the fork() path either below or in other processes. + */ + for (i = 0; i < num_pre_forks; i++) { + pids[i] = fork(); + SCX_FAIL_IF(pids[i] < 0, "Failed to fork child"); + if (pids[i] == 0) { + sleep(1); + exit(0); + } + } + + link = bpf_map__attach_struct_ops(skel->maps.init_enable_count_ops); + SCX_FAIL_IF(!link, "Failed to attach struct_ops"); + + for (i = 0; i < num_pre_forks; i++) { + SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i], + "Failed to wait for pre-forked child\n"); + + SCX_FAIL_IF(status != 0, "Pre-forked child %d exited with status %d\n", i, + status); + } + + bpf_link__destroy(link); + SCX_GE(skel->bss->init_task_cnt, num_pre_forks); + SCX_GE(skel->bss->exit_task_cnt, num_pre_forks); + + link = bpf_map__attach_struct_ops(skel->maps.init_enable_count_ops); + SCX_FAIL_IF(!link, "Failed to attach struct_ops"); + + /* SCHED_EXT children */ + for (i = 0; i < num_children; i++) { + pids[i] = fork(); + SCX_FAIL_IF(pids[i] < 0, "Failed to fork child"); + + if (pids[i] == 0) { + ret = sched_setscheduler(0, SCHED_EXT, ¶m); + SCX_BUG_ON(ret, "Failed to set sched to sched_ext"); + + /* + * Reset to SCHED_OTHER for half of them. Counts for + * everything should still be the same regardless, as + * ops.disable() is invoked even if a task is still on + * SCHED_EXT before it exits. + */ + if (i % 2 == 0) { + ret = sched_setscheduler(0, SCHED_OTHER, ¶m); + SCX_BUG_ON(ret, "Failed to reset sched to normal"); + } + exit(0); + } + } + for (i = 0; i < num_children; i++) { + SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i], + "Failed to wait for SCX child\n"); + + SCX_FAIL_IF(status != 0, "SCX child %d exited with status %d\n", i, + status); + } + + /* SCHED_OTHER children */ + for (i = 0; i < num_children; i++) { + pids[i] = fork(); + if (pids[i] == 0) + exit(0); + } + + for (i = 0; i < num_children; i++) { + SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i], + "Failed to wait for normal child\n"); + + SCX_FAIL_IF(status != 0, "Normal child %d exited with status %d\n", i, + status); + } + + bpf_link__destroy(link); + + SCX_GE(skel->bss->init_task_cnt, 2 * num_children); + SCX_GE(skel->bss->exit_task_cnt, 2 * num_children); + + if (global) { + SCX_GE(skel->bss->enable_cnt, 2 * num_children); + SCX_GE(skel->bss->disable_cnt, 2 * num_children); + } else { + SCX_EQ(skel->bss->enable_cnt, num_children); + SCX_EQ(skel->bss->disable_cnt, num_children); + } + /* + * We forked a ton of tasks before we attached the scheduler above, so + * this should be fine. Technically it could be flaky if a ton of forks + * are happening at the same time in other processes, but that should + * be exceedingly unlikely. + */ + SCX_GT(skel->bss->init_transition_cnt, skel->bss->init_fork_cnt); + SCX_GE(skel->bss->init_fork_cnt, 2 * num_children); + + init_enable_count__destroy(skel); + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + enum scx_test_status status; + + status = run_test(true); + if (status != SCX_TEST_PASS) + return status; + + return run_test(false); +} + +struct scx_test init_enable_count = { + .name = "init_enable_count", + .description = "Verify we do the correct amount of counting of init, " + "enable, etc callbacks.", + .run = run, +}; +REGISTER_SCX_TEST(&init_enable_count) diff --git a/tools/testing/selftests/sched_ext/maximal.bpf.c b/tools/testing/selftests/sched_ext/maximal.bpf.c new file mode 100644 index 000000000000..44612fdaf399 --- /dev/null +++ b/tools/testing/selftests/sched_ext/maximal.bpf.c @@ -0,0 +1,132 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * A scheduler with every callback defined. + * + * This scheduler defines every callback. + * + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +s32 BPF_STRUCT_OPS(maximal_select_cpu, struct task_struct *p, s32 prev_cpu, + u64 wake_flags) +{ + return prev_cpu; +} + +void BPF_STRUCT_OPS(maximal_enqueue, struct task_struct *p, u64 enq_flags) +{ + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); +} + +void BPF_STRUCT_OPS(maximal_dequeue, struct task_struct *p, u64 deq_flags) +{} + +void BPF_STRUCT_OPS(maximal_dispatch, s32 cpu, struct task_struct *prev) +{ + scx_bpf_consume(SCX_DSQ_GLOBAL); +} + +void BPF_STRUCT_OPS(maximal_runnable, struct task_struct *p, u64 enq_flags) +{} + +void BPF_STRUCT_OPS(maximal_running, struct task_struct *p) +{} + +void BPF_STRUCT_OPS(maximal_stopping, struct task_struct *p, bool runnable) +{} + +void BPF_STRUCT_OPS(maximal_quiescent, struct task_struct *p, u64 deq_flags) +{} + +bool BPF_STRUCT_OPS(maximal_yield, struct task_struct *from, + struct task_struct *to) +{ + return false; +} + +bool BPF_STRUCT_OPS(maximal_core_sched_before, struct task_struct *a, + struct task_struct *b) +{ + return false; +} + +void BPF_STRUCT_OPS(maximal_set_weight, struct task_struct *p, u32 weight) +{} + +void BPF_STRUCT_OPS(maximal_set_cpumask, struct task_struct *p, + const struct cpumask *cpumask) +{} + +void BPF_STRUCT_OPS(maximal_update_idle, s32 cpu, bool idle) +{} + +void BPF_STRUCT_OPS(maximal_cpu_acquire, s32 cpu, + struct scx_cpu_acquire_args *args) +{} + +void BPF_STRUCT_OPS(maximal_cpu_release, s32 cpu, + struct scx_cpu_release_args *args) +{} + +void BPF_STRUCT_OPS(maximal_cpu_online, s32 cpu) +{} + +void BPF_STRUCT_OPS(maximal_cpu_offline, s32 cpu) +{} + +s32 BPF_STRUCT_OPS(maximal_init_task, struct task_struct *p, + struct scx_init_task_args *args) +{ + return 0; +} + +void BPF_STRUCT_OPS(maximal_enable, struct task_struct *p) +{} + +void BPF_STRUCT_OPS(maximal_exit_task, struct task_struct *p, + struct scx_exit_task_args *args) +{} + +void BPF_STRUCT_OPS(maximal_disable, struct task_struct *p) +{} + +s32 BPF_STRUCT_OPS_SLEEPABLE(maximal_init) +{ + return 0; +} + +void BPF_STRUCT_OPS(maximal_exit, struct scx_exit_info *info) +{} + +SEC(".struct_ops.link") +struct sched_ext_ops maximal_ops = { + .select_cpu = maximal_select_cpu, + .enqueue = maximal_enqueue, + .dequeue = maximal_dequeue, + .dispatch = maximal_dispatch, + .runnable = maximal_runnable, + .running = maximal_running, + .stopping = maximal_stopping, + .quiescent = maximal_quiescent, + .yield = maximal_yield, + .core_sched_before = maximal_core_sched_before, + .set_weight = maximal_set_weight, + .set_cpumask = maximal_set_cpumask, + .update_idle = maximal_update_idle, + .cpu_acquire = maximal_cpu_acquire, + .cpu_release = maximal_cpu_release, + .cpu_online = maximal_cpu_online, + .cpu_offline = maximal_cpu_offline, + .init_task = maximal_init_task, + .enable = maximal_enable, + .exit_task = maximal_exit_task, + .disable = maximal_disable, + .init = maximal_init, + .exit = maximal_exit, + .name = "maximal", +}; diff --git a/tools/testing/selftests/sched_ext/maximal.c b/tools/testing/selftests/sched_ext/maximal.c new file mode 100644 index 000000000000..f38fc973c380 --- /dev/null +++ b/tools/testing/selftests/sched_ext/maximal.c @@ -0,0 +1,51 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "maximal.bpf.skel.h" +#include "scx_test.h" + +static enum scx_test_status setup(void **ctx) +{ + struct maximal *skel; + + skel = maximal__open_and_load(); + SCX_FAIL_IF(!skel, "Failed to open and load skel"); + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct maximal *skel = ctx; + struct bpf_link *link; + + link = bpf_map__attach_struct_ops(skel->maps.maximal_ops); + SCX_FAIL_IF(!link, "Failed to attach scheduler"); + + bpf_link__destroy(link); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct maximal *skel = ctx; + + maximal__destroy(skel); +} + +struct scx_test maximal = { + .name = "maximal", + .description = "Verify we can load a scheduler with every callback defined", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&maximal) diff --git a/tools/testing/selftests/sched_ext/maybe_null.bpf.c b/tools/testing/selftests/sched_ext/maybe_null.bpf.c new file mode 100644 index 000000000000..27d0f386acfb --- /dev/null +++ b/tools/testing/selftests/sched_ext/maybe_null.bpf.c @@ -0,0 +1,36 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +u64 vtime_test; + +void BPF_STRUCT_OPS(maybe_null_running, struct task_struct *p) +{} + +void BPF_STRUCT_OPS(maybe_null_success_dispatch, s32 cpu, struct task_struct *p) +{ + if (p != NULL) + vtime_test = p->scx.dsq_vtime; +} + +bool BPF_STRUCT_OPS(maybe_null_success_yield, struct task_struct *from, + struct task_struct *to) +{ + if (to) + bpf_printk("Yielding to %s[%d]", to->comm, to->pid); + + return false; +} + +SEC(".struct_ops.link") +struct sched_ext_ops maybe_null_success = { + .dispatch = maybe_null_success_dispatch, + .yield = maybe_null_success_yield, + .enable = maybe_null_running, + .name = "minimal", +}; diff --git a/tools/testing/selftests/sched_ext/maybe_null.c b/tools/testing/selftests/sched_ext/maybe_null.c new file mode 100644 index 000000000000..31cfafb0cf65 --- /dev/null +++ b/tools/testing/selftests/sched_ext/maybe_null.c @@ -0,0 +1,49 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + */ +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "maybe_null.bpf.skel.h" +#include "maybe_null_fail_dsp.bpf.skel.h" +#include "maybe_null_fail_yld.bpf.skel.h" +#include "scx_test.h" + +static enum scx_test_status run(void *ctx) +{ + struct maybe_null *skel; + struct maybe_null_fail_dsp *fail_dsp; + struct maybe_null_fail_yld *fail_yld; + + skel = maybe_null__open_and_load(); + if (!skel) { + SCX_ERR("Failed to open and load maybe_null skel"); + return SCX_TEST_FAIL; + } + maybe_null__destroy(skel); + + fail_dsp = maybe_null_fail_dsp__open_and_load(); + if (fail_dsp) { + maybe_null_fail_dsp__destroy(fail_dsp); + SCX_ERR("Should failed to open and load maybe_null_fail_dsp skel"); + return SCX_TEST_FAIL; + } + + fail_yld = maybe_null_fail_yld__open_and_load(); + if (fail_yld) { + maybe_null_fail_yld__destroy(fail_yld); + SCX_ERR("Should failed to open and load maybe_null_fail_yld skel"); + return SCX_TEST_FAIL; + } + + return SCX_TEST_PASS; +} + +struct scx_test maybe_null = { + .name = "maybe_null", + .description = "Verify if PTR_MAYBE_NULL work for .dispatch", + .run = run, +}; +REGISTER_SCX_TEST(&maybe_null) diff --git a/tools/testing/selftests/sched_ext/maybe_null_fail_dsp.bpf.c b/tools/testing/selftests/sched_ext/maybe_null_fail_dsp.bpf.c new file mode 100644 index 000000000000..c0641050271d --- /dev/null +++ b/tools/testing/selftests/sched_ext/maybe_null_fail_dsp.bpf.c @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +u64 vtime_test; + +void BPF_STRUCT_OPS(maybe_null_running, struct task_struct *p) +{} + +void BPF_STRUCT_OPS(maybe_null_fail_dispatch, s32 cpu, struct task_struct *p) +{ + vtime_test = p->scx.dsq_vtime; +} + +SEC(".struct_ops.link") +struct sched_ext_ops maybe_null_fail = { + .dispatch = maybe_null_fail_dispatch, + .enable = maybe_null_running, + .name = "maybe_null_fail_dispatch", +}; diff --git a/tools/testing/selftests/sched_ext/maybe_null_fail_yld.bpf.c b/tools/testing/selftests/sched_ext/maybe_null_fail_yld.bpf.c new file mode 100644 index 000000000000..3c1740028e3b --- /dev/null +++ b/tools/testing/selftests/sched_ext/maybe_null_fail_yld.bpf.c @@ -0,0 +1,28 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +u64 vtime_test; + +void BPF_STRUCT_OPS(maybe_null_running, struct task_struct *p) +{} + +bool BPF_STRUCT_OPS(maybe_null_fail_yield, struct task_struct *from, + struct task_struct *to) +{ + bpf_printk("Yielding to %s[%d]", to->comm, to->pid); + + return false; +} + +SEC(".struct_ops.link") +struct sched_ext_ops maybe_null_fail = { + .yield = maybe_null_fail_yield, + .enable = maybe_null_running, + .name = "maybe_null_fail_yield", +}; diff --git a/tools/testing/selftests/sched_ext/minimal.bpf.c b/tools/testing/selftests/sched_ext/minimal.bpf.c new file mode 100644 index 000000000000..6a7eccef0104 --- /dev/null +++ b/tools/testing/selftests/sched_ext/minimal.bpf.c @@ -0,0 +1,21 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * A completely minimal scheduler. + * + * This scheduler defines the absolute minimal set of struct sched_ext_ops + * fields: its name. It should _not_ fail to be loaded, and can be used to + * exercise the default scheduling paths in ext.c. + * + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +SEC(".struct_ops.link") +struct sched_ext_ops minimal_ops = { + .name = "minimal", +}; diff --git a/tools/testing/selftests/sched_ext/minimal.c b/tools/testing/selftests/sched_ext/minimal.c new file mode 100644 index 000000000000..6c5db8ebbf8a --- /dev/null +++ b/tools/testing/selftests/sched_ext/minimal.c @@ -0,0 +1,58 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "minimal.bpf.skel.h" +#include "scx_test.h" + +static enum scx_test_status setup(void **ctx) +{ + struct minimal *skel; + + skel = minimal__open_and_load(); + if (!skel) { + SCX_ERR("Failed to open and load skel"); + return SCX_TEST_FAIL; + } + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct minimal *skel = ctx; + struct bpf_link *link; + + link = bpf_map__attach_struct_ops(skel->maps.minimal_ops); + if (!link) { + SCX_ERR("Failed to attach scheduler"); + return SCX_TEST_FAIL; + } + + bpf_link__destroy(link); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct minimal *skel = ctx; + + minimal__destroy(skel); +} + +struct scx_test minimal = { + .name = "minimal", + .description = "Verify we can load a fully minimal scheduler", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&minimal) diff --git a/tools/testing/selftests/sched_ext/prog_run.bpf.c b/tools/testing/selftests/sched_ext/prog_run.bpf.c new file mode 100644 index 000000000000..fd2c8f12af16 --- /dev/null +++ b/tools/testing/selftests/sched_ext/prog_run.bpf.c @@ -0,0 +1,32 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * A scheduler that validates that we can invoke sched_ext kfuncs in + * BPF_PROG_TYPE_SYSCALL programs. + * + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ + +#include <scx/common.bpf.h> + +UEI_DEFINE(uei); + +char _license[] SEC("license") = "GPL"; + +SEC("syscall") +int BPF_PROG(prog_run_syscall) +{ + scx_bpf_exit(0xdeadbeef, "Exited from PROG_RUN"); + return 0; +} + +void BPF_STRUCT_OPS(prog_run_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SEC(".struct_ops.link") +struct sched_ext_ops prog_run_ops = { + .exit = prog_run_exit, + .name = "prog_run", +}; diff --git a/tools/testing/selftests/sched_ext/prog_run.c b/tools/testing/selftests/sched_ext/prog_run.c new file mode 100644 index 000000000000..3cd57ef8daaa --- /dev/null +++ b/tools/testing/selftests/sched_ext/prog_run.c @@ -0,0 +1,78 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ +#include <bpf/bpf.h> +#include <sched.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "prog_run.bpf.skel.h" +#include "scx_test.h" + +static enum scx_test_status setup(void **ctx) +{ + struct prog_run *skel; + + skel = prog_run__open_and_load(); + if (!skel) { + SCX_ERR("Failed to open and load skel"); + return SCX_TEST_FAIL; + } + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct prog_run *skel = ctx; + struct bpf_link *link; + int prog_fd, err = 0; + + prog_fd = bpf_program__fd(skel->progs.prog_run_syscall); + if (prog_fd < 0) { + SCX_ERR("Failed to get BPF_PROG_RUN prog"); + return SCX_TEST_FAIL; + } + + LIBBPF_OPTS(bpf_test_run_opts, topts); + + link = bpf_map__attach_struct_ops(skel->maps.prog_run_ops); + if (!link) { + SCX_ERR("Failed to attach scheduler"); + close(prog_fd); + return SCX_TEST_FAIL; + } + + err = bpf_prog_test_run_opts(prog_fd, &topts); + SCX_EQ(err, 0); + + /* Assumes uei.kind is written last */ + while (skel->data->uei.kind == EXIT_KIND(SCX_EXIT_NONE)) + sched_yield(); + + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_UNREG_BPF)); + SCX_EQ(skel->data->uei.exit_code, 0xdeadbeef); + close(prog_fd); + bpf_link__destroy(link); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct prog_run *skel = ctx; + + prog_run__destroy(skel); +} + +struct scx_test prog_run = { + .name = "prog_run", + .description = "Verify we can call into a scheduler with BPF_PROG_RUN, and invoke kfuncs", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&prog_run) diff --git a/tools/testing/selftests/sched_ext/reload_loop.c b/tools/testing/selftests/sched_ext/reload_loop.c new file mode 100644 index 000000000000..5cfba2d6e056 --- /dev/null +++ b/tools/testing/selftests/sched_ext/reload_loop.c @@ -0,0 +1,75 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ +#include <bpf/bpf.h> +#include <pthread.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "maximal.bpf.skel.h" +#include "scx_test.h" + +static struct maximal *skel; +static pthread_t threads[2]; + +bool force_exit = false; + +static enum scx_test_status setup(void **ctx) +{ + skel = maximal__open_and_load(); + if (!skel) { + SCX_ERR("Failed to open and load skel"); + return SCX_TEST_FAIL; + } + + return SCX_TEST_PASS; +} + +static void *do_reload_loop(void *arg) +{ + u32 i; + + for (i = 0; i < 1024 && !force_exit; i++) { + struct bpf_link *link; + + link = bpf_map__attach_struct_ops(skel->maps.maximal_ops); + if (link) + bpf_link__destroy(link); + } + + return NULL; +} + +static enum scx_test_status run(void *ctx) +{ + int err; + void *ret; + + err = pthread_create(&threads[0], NULL, do_reload_loop, NULL); + SCX_FAIL_IF(err, "Failed to create thread 0"); + + err = pthread_create(&threads[1], NULL, do_reload_loop, NULL); + SCX_FAIL_IF(err, "Failed to create thread 1"); + + SCX_FAIL_IF(pthread_join(threads[0], &ret), "thread 0 failed"); + SCX_FAIL_IF(pthread_join(threads[1], &ret), "thread 1 failed"); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + force_exit = true; + maximal__destroy(skel); +} + +struct scx_test reload_loop = { + .name = "reload_loop", + .description = "Stress test loading and unloading schedulers repeatedly in a tight loop", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&reload_loop) diff --git a/tools/testing/selftests/sched_ext/runner.c b/tools/testing/selftests/sched_ext/runner.c new file mode 100644 index 000000000000..eab48c7ff309 --- /dev/null +++ b/tools/testing/selftests/sched_ext/runner.c @@ -0,0 +1,201 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> + */ +#include <stdio.h> +#include <unistd.h> +#include <signal.h> +#include <libgen.h> +#include <bpf/bpf.h> +#include "scx_test.h" + +const char help_fmt[] = +"The runner for sched_ext tests.\n" +"\n" +"The runner is statically linked against all testcases, and runs them all serially.\n" +"It's required for the testcases to be serial, as only a single host-wide sched_ext\n" +"scheduler may be loaded at any given time." +"\n" +"Usage: %s [-t TEST] [-h]\n" +"\n" +" -t TEST Only run tests whose name includes this string\n" +" -s Include print output for skipped tests\n" +" -q Don't print the test descriptions during run\n" +" -h Display this help and exit\n"; + +static volatile int exit_req; +static bool quiet, print_skipped; + +#define MAX_SCX_TESTS 2048 + +static struct scx_test __scx_tests[MAX_SCX_TESTS]; +static unsigned __scx_num_tests = 0; + +static void sigint_handler(int simple) +{ + exit_req = 1; +} + +static void print_test_preamble(const struct scx_test *test, bool quiet) +{ + printf("===== START =====\n"); + printf("TEST: %s\n", test->name); + if (!quiet) + printf("DESCRIPTION: %s\n", test->description); + printf("OUTPUT:\n"); +} + +static const char *status_to_result(enum scx_test_status status) +{ + switch (status) { + case SCX_TEST_PASS: + case SCX_TEST_SKIP: + return "ok"; + case SCX_TEST_FAIL: + return "not ok"; + default: + return "<UNKNOWN>"; + } +} + +static void print_test_result(const struct scx_test *test, + enum scx_test_status status, + unsigned int testnum) +{ + const char *result = status_to_result(status); + const char *directive = status == SCX_TEST_SKIP ? "SKIP " : ""; + + printf("%s %u %s # %s\n", result, testnum, test->name, directive); + printf("===== END =====\n"); +} + +static bool should_skip_test(const struct scx_test *test, const char * filter) +{ + return !strstr(test->name, filter); +} + +static enum scx_test_status run_test(const struct scx_test *test) +{ + enum scx_test_status status; + void *context = NULL; + + if (test->setup) { + status = test->setup(&context); + if (status != SCX_TEST_PASS) + return status; + } + + status = test->run(context); + + if (test->cleanup) + test->cleanup(context); + + return status; +} + +static bool test_valid(const struct scx_test *test) +{ + if (!test) { + fprintf(stderr, "NULL test detected\n"); + return false; + } + + if (!test->name) { + fprintf(stderr, + "Test with no name found. Must specify test name.\n"); + return false; + } + + if (!test->description) { + fprintf(stderr, "Test %s requires description.\n", test->name); + return false; + } + + if (!test->run) { + fprintf(stderr, "Test %s has no run() callback\n", test->name); + return false; + } + + return true; +} + +int main(int argc, char **argv) +{ + const char *filter = NULL; + unsigned testnum = 0, i; + unsigned passed = 0, skipped = 0, failed = 0; + int opt; + + signal(SIGINT, sigint_handler); + signal(SIGTERM, sigint_handler); + + libbpf_set_strict_mode(LIBBPF_STRICT_ALL); + + while ((opt = getopt(argc, argv, "qst:h")) != -1) { + switch (opt) { + case 'q': + quiet = true; + break; + case 's': + print_skipped = true; + break; + case 't': + filter = optarg; + break; + default: + fprintf(stderr, help_fmt, basename(argv[0])); + return opt != 'h'; + } + } + + for (i = 0; i < __scx_num_tests; i++) { + enum scx_test_status status; + struct scx_test *test = &__scx_tests[i]; + + if (filter && should_skip_test(test, filter)) { + /* + * Printing the skipped tests and their preambles can + * add a lot of noise to the runner output. Printing + * this is only really useful for CI, so let's skip it + * by default. + */ + if (print_skipped) { + print_test_preamble(test, quiet); + print_test_result(test, SCX_TEST_SKIP, ++testnum); + } + continue; + } + + print_test_preamble(test, quiet); + status = run_test(test); + print_test_result(test, status, ++testnum); + switch (status) { + case SCX_TEST_PASS: + passed++; + break; + case SCX_TEST_SKIP: + skipped++; + break; + case SCX_TEST_FAIL: + failed++; + break; + } + } + printf("\n\n=============================\n\n"); + printf("RESULTS:\n\n"); + printf("PASSED: %u\n", passed); + printf("SKIPPED: %u\n", skipped); + printf("FAILED: %u\n", failed); + + return 0; +} + +void scx_test_register(struct scx_test *test) +{ + SCX_BUG_ON(!test_valid(test), "Invalid test found"); + SCX_BUG_ON(__scx_num_tests >= MAX_SCX_TESTS, "Maximum tests exceeded"); + + __scx_tests[__scx_num_tests++] = *test; +} diff --git a/tools/testing/selftests/sched_ext/scx_test.h b/tools/testing/selftests/sched_ext/scx_test.h new file mode 100644 index 000000000000..90b8d6915bb7 --- /dev/null +++ b/tools/testing/selftests/sched_ext/scx_test.h @@ -0,0 +1,131 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + */ + +#ifndef __SCX_TEST_H__ +#define __SCX_TEST_H__ + +#include <errno.h> +#include <scx/common.h> +#include <scx/compat.h> + +enum scx_test_status { + SCX_TEST_PASS = 0, + SCX_TEST_SKIP, + SCX_TEST_FAIL, +}; + +#define EXIT_KIND(__ent) __COMPAT_ENUM_OR_ZERO("scx_exit_kind", #__ent) + +struct scx_test { + /** + * name - The name of the testcase. + */ + const char *name; + + /** + * description - A description of your testcase: what it tests and is + * meant to validate. + */ + const char *description; + + /* + * setup - Setup the test. + * @ctx: A pointer to a context object that will be passed to run and + * cleanup. + * + * An optional callback that allows a testcase to perform setup for its + * run. A test may return SCX_TEST_SKIP to skip the run. + */ + enum scx_test_status (*setup)(void **ctx); + + /* + * run - Run the test. + * @ctx: Context set in the setup() callback. If @ctx was not set in + * setup(), it is NULL. + * + * The main test. Callers should return one of: + * + * - SCX_TEST_PASS: Test passed + * - SCX_TEST_SKIP: Test should be skipped + * - SCX_TEST_FAIL: Test failed + * + * This callback must be defined. + */ + enum scx_test_status (*run)(void *ctx); + + /* + * cleanup - Perform cleanup following the test + * @ctx: Context set in the setup() callback. If @ctx was not set in + * setup(), it is NULL. + * + * An optional callback that allows a test to perform cleanup after + * being run. This callback is run even if the run() callback returns + * SCX_TEST_SKIP or SCX_TEST_FAIL. It is not run if setup() returns + * SCX_TEST_SKIP or SCX_TEST_FAIL. + */ + void (*cleanup)(void *ctx); +}; + +void scx_test_register(struct scx_test *test); + +#define REGISTER_SCX_TEST(__test) \ + __attribute__((constructor)) \ + static void ___scxregister##__LINE__(void) \ + { \ + scx_test_register(__test); \ + } + +#define SCX_ERR(__fmt, ...) \ + do { \ + fprintf(stderr, "ERR: %s:%d\n", __FILE__, __LINE__); \ + fprintf(stderr, __fmt"\n", ##__VA_ARGS__); \ + } while (0) + +#define SCX_FAIL(__fmt, ...) \ + do { \ + SCX_ERR(__fmt, ##__VA_ARGS__); \ + return SCX_TEST_FAIL; \ + } while (0) + +#define SCX_FAIL_IF(__cond, __fmt, ...) \ + do { \ + if (__cond) \ + SCX_FAIL(__fmt, ##__VA_ARGS__); \ + } while (0) + +#define SCX_GT(_x, _y) SCX_FAIL_IF((_x) <= (_y), "Expected %s > %s (%lu > %lu)", \ + #_x, #_y, (u64)(_x), (u64)(_y)) +#define SCX_GE(_x, _y) SCX_FAIL_IF((_x) < (_y), "Expected %s >= %s (%lu >= %lu)", \ + #_x, #_y, (u64)(_x), (u64)(_y)) +#define SCX_LT(_x, _y) SCX_FAIL_IF((_x) >= (_y), "Expected %s < %s (%lu < %lu)", \ + #_x, #_y, (u64)(_x), (u64)(_y)) +#define SCX_LE(_x, _y) SCX_FAIL_IF((_x) > (_y), "Expected %s <= %s (%lu <= %lu)", \ + #_x, #_y, (u64)(_x), (u64)(_y)) +#define SCX_EQ(_x, _y) SCX_FAIL_IF((_x) != (_y), "Expected %s == %s (%lu == %lu)", \ + #_x, #_y, (u64)(_x), (u64)(_y)) +#define SCX_ASSERT(_x) SCX_FAIL_IF(!(_x), "Expected %s to be true (%lu)", \ + #_x, (u64)(_x)) + +#define SCX_ECODE_VAL(__ecode) ({ \ + u64 __val = 0; \ + bool __found = false; \ + \ + __found = __COMPAT_read_enum("scx_exit_code", #__ecode, &__val); \ + SCX_ASSERT(__found); \ + (s64)__val; \ +}) + +#define SCX_KIND_VAL(__kind) ({ \ + u64 __val = 0; \ + bool __found = false; \ + \ + __found = __COMPAT_read_enum("scx_exit_kind", #__kind, &__val); \ + SCX_ASSERT(__found); \ + __val; \ +}) + +#endif // # __SCX_TEST_H__ diff --git a/tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c new file mode 100644 index 000000000000..2ed2991afafe --- /dev/null +++ b/tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c @@ -0,0 +1,40 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * A scheduler that validates the behavior of direct dispatching with a default + * select_cpu implementation. + * + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +bool saw_local = false; + +static bool task_is_test(const struct task_struct *p) +{ + return !bpf_strncmp(p->comm, 9, "select_cpu"); +} + +void BPF_STRUCT_OPS(select_cpu_dfl_enqueue, struct task_struct *p, + u64 enq_flags) +{ + const struct cpumask *idle_mask = scx_bpf_get_idle_cpumask(); + + if (task_is_test(p) && + bpf_cpumask_test_cpu(scx_bpf_task_cpu(p), idle_mask)) { + saw_local = true; + } + scx_bpf_put_idle_cpumask(idle_mask); + + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); +} + +SEC(".struct_ops.link") +struct sched_ext_ops select_cpu_dfl_ops = { + .enqueue = select_cpu_dfl_enqueue, + .name = "select_cpu_dfl", +}; diff --git a/tools/testing/selftests/sched_ext/select_cpu_dfl.c b/tools/testing/selftests/sched_ext/select_cpu_dfl.c new file mode 100644 index 000000000000..a53a40c2d2f0 --- /dev/null +++ b/tools/testing/selftests/sched_ext/select_cpu_dfl.c @@ -0,0 +1,72 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "select_cpu_dfl.bpf.skel.h" +#include "scx_test.h" + +#define NUM_CHILDREN 1028 + +static enum scx_test_status setup(void **ctx) +{ + struct select_cpu_dfl *skel; + + skel = select_cpu_dfl__open_and_load(); + SCX_FAIL_IF(!skel, "Failed to open and load skel"); + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct select_cpu_dfl *skel = ctx; + struct bpf_link *link; + pid_t pids[NUM_CHILDREN]; + int i, status; + + link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dfl_ops); + SCX_FAIL_IF(!link, "Failed to attach scheduler"); + + for (i = 0; i < NUM_CHILDREN; i++) { + pids[i] = fork(); + if (pids[i] == 0) { + sleep(1); + exit(0); + } + } + + for (i = 0; i < NUM_CHILDREN; i++) { + SCX_EQ(waitpid(pids[i], &status, 0), pids[i]); + SCX_EQ(status, 0); + } + + SCX_ASSERT(!skel->bss->saw_local); + + bpf_link__destroy(link); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct select_cpu_dfl *skel = ctx; + + select_cpu_dfl__destroy(skel); +} + +struct scx_test select_cpu_dfl = { + .name = "select_cpu_dfl", + .description = "Verify the default ops.select_cpu() dispatches tasks " + "when idles cores are found, and skips ops.enqueue()", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&select_cpu_dfl) diff --git a/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c new file mode 100644 index 000000000000..4bb5abb2d369 --- /dev/null +++ b/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c @@ -0,0 +1,89 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * A scheduler that validates the behavior of direct dispatching with a default + * select_cpu implementation, and with the SCX_OPS_ENQ_DFL_NO_DISPATCH ops flag + * specified. + * + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +bool saw_local = false; + +/* Per-task scheduling context */ +struct task_ctx { + bool force_local; /* CPU changed by ops.select_cpu() */ +}; + +struct { + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct task_ctx); +} task_ctx_stor SEC(".maps"); + +/* Manually specify the signature until the kfunc is added to the scx repo. */ +s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, + bool *found) __ksym; + +s32 BPF_STRUCT_OPS(select_cpu_dfl_nodispatch_select_cpu, struct task_struct *p, + s32 prev_cpu, u64 wake_flags) +{ + struct task_ctx *tctx; + s32 cpu; + + tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0); + if (!tctx) { + scx_bpf_error("task_ctx lookup failed"); + return -ESRCH; + } + + cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, + &tctx->force_local); + + return cpu; +} + +void BPF_STRUCT_OPS(select_cpu_dfl_nodispatch_enqueue, struct task_struct *p, + u64 enq_flags) +{ + u64 dsq_id = SCX_DSQ_GLOBAL; + struct task_ctx *tctx; + + tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0); + if (!tctx) { + scx_bpf_error("task_ctx lookup failed"); + return; + } + + if (tctx->force_local) { + dsq_id = SCX_DSQ_LOCAL; + tctx->force_local = false; + saw_local = true; + } + + scx_bpf_dispatch(p, dsq_id, SCX_SLICE_DFL, enq_flags); +} + +s32 BPF_STRUCT_OPS(select_cpu_dfl_nodispatch_init_task, + struct task_struct *p, struct scx_init_task_args *args) +{ + if (bpf_task_storage_get(&task_ctx_stor, p, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE)) + return 0; + else + return -ENOMEM; +} + +SEC(".struct_ops.link") +struct sched_ext_ops select_cpu_dfl_nodispatch_ops = { + .select_cpu = select_cpu_dfl_nodispatch_select_cpu, + .enqueue = select_cpu_dfl_nodispatch_enqueue, + .init_task = select_cpu_dfl_nodispatch_init_task, + .name = "select_cpu_dfl_nodispatch", +}; diff --git a/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.c b/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.c new file mode 100644 index 000000000000..1d85bf4bf3a3 --- /dev/null +++ b/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.c @@ -0,0 +1,72 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "select_cpu_dfl_nodispatch.bpf.skel.h" +#include "scx_test.h" + +#define NUM_CHILDREN 1028 + +static enum scx_test_status setup(void **ctx) +{ + struct select_cpu_dfl_nodispatch *skel; + + skel = select_cpu_dfl_nodispatch__open_and_load(); + SCX_FAIL_IF(!skel, "Failed to open and load skel"); + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct select_cpu_dfl_nodispatch *skel = ctx; + struct bpf_link *link; + pid_t pids[NUM_CHILDREN]; + int i, status; + + link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dfl_nodispatch_ops); + SCX_FAIL_IF(!link, "Failed to attach scheduler"); + + for (i = 0; i < NUM_CHILDREN; i++) { + pids[i] = fork(); + if (pids[i] == 0) { + sleep(1); + exit(0); + } + } + + for (i = 0; i < NUM_CHILDREN; i++) { + SCX_EQ(waitpid(pids[i], &status, 0), pids[i]); + SCX_EQ(status, 0); + } + + SCX_ASSERT(skel->bss->saw_local); + + bpf_link__destroy(link); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct select_cpu_dfl_nodispatch *skel = ctx; + + select_cpu_dfl_nodispatch__destroy(skel); +} + +struct scx_test select_cpu_dfl_nodispatch = { + .name = "select_cpu_dfl_nodispatch", + .description = "Verify behavior of scx_bpf_select_cpu_dfl() in " + "ops.select_cpu()", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&select_cpu_dfl_nodispatch) diff --git a/tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c new file mode 100644 index 000000000000..f0b96a4a04b2 --- /dev/null +++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c @@ -0,0 +1,41 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * A scheduler that validates the behavior of direct dispatching with a default + * select_cpu implementation. + * + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +s32 BPF_STRUCT_OPS(select_cpu_dispatch_select_cpu, struct task_struct *p, + s32 prev_cpu, u64 wake_flags) +{ + u64 dsq_id = SCX_DSQ_LOCAL; + s32 cpu = prev_cpu; + + if (scx_bpf_test_and_clear_cpu_idle(cpu)) + goto dispatch; + + cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); + if (cpu >= 0) + goto dispatch; + + dsq_id = SCX_DSQ_GLOBAL; + cpu = prev_cpu; + +dispatch: + scx_bpf_dispatch(p, dsq_id, SCX_SLICE_DFL, 0); + return cpu; +} + +SEC(".struct_ops.link") +struct sched_ext_ops select_cpu_dispatch_ops = { + .select_cpu = select_cpu_dispatch_select_cpu, + .name = "select_cpu_dispatch", + .timeout_ms = 1000U, +}; diff --git a/tools/testing/selftests/sched_ext/select_cpu_dispatch.c b/tools/testing/selftests/sched_ext/select_cpu_dispatch.c new file mode 100644 index 000000000000..0309ca8785b3 --- /dev/null +++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch.c @@ -0,0 +1,70 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "select_cpu_dispatch.bpf.skel.h" +#include "scx_test.h" + +#define NUM_CHILDREN 1028 + +static enum scx_test_status setup(void **ctx) +{ + struct select_cpu_dispatch *skel; + + skel = select_cpu_dispatch__open_and_load(); + SCX_FAIL_IF(!skel, "Failed to open and load skel"); + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct select_cpu_dispatch *skel = ctx; + struct bpf_link *link; + pid_t pids[NUM_CHILDREN]; + int i, status; + + link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dispatch_ops); + SCX_FAIL_IF(!link, "Failed to attach scheduler"); + + for (i = 0; i < NUM_CHILDREN; i++) { + pids[i] = fork(); + if (pids[i] == 0) { + sleep(1); + exit(0); + } + } + + for (i = 0; i < NUM_CHILDREN; i++) { + SCX_EQ(waitpid(pids[i], &status, 0), pids[i]); + SCX_EQ(status, 0); + } + + bpf_link__destroy(link); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct select_cpu_dispatch *skel = ctx; + + select_cpu_dispatch__destroy(skel); +} + +struct scx_test select_cpu_dispatch = { + .name = "select_cpu_dispatch", + .description = "Test direct dispatching to built-in DSQs from " + "ops.select_cpu()", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&select_cpu_dispatch) diff --git a/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c new file mode 100644 index 000000000000..7b42ddce0f56 --- /dev/null +++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c @@ -0,0 +1,37 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * A scheduler that validates the behavior of direct dispatching with a default + * select_cpu implementation. + * + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +UEI_DEFINE(uei); + +s32 BPF_STRUCT_OPS(select_cpu_dispatch_bad_dsq_select_cpu, struct task_struct *p, + s32 prev_cpu, u64 wake_flags) +{ + /* Dispatching to a random DSQ should fail. */ + scx_bpf_dispatch(p, 0xcafef00d, SCX_SLICE_DFL, 0); + + return prev_cpu; +} + +void BPF_STRUCT_OPS(select_cpu_dispatch_bad_dsq_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SEC(".struct_ops.link") +struct sched_ext_ops select_cpu_dispatch_bad_dsq_ops = { + .select_cpu = select_cpu_dispatch_bad_dsq_select_cpu, + .exit = select_cpu_dispatch_bad_dsq_exit, + .name = "select_cpu_dispatch_bad_dsq", + .timeout_ms = 1000U, +}; diff --git a/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.c b/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.c new file mode 100644 index 000000000000..47eb6ed7627d --- /dev/null +++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.c @@ -0,0 +1,56 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "select_cpu_dispatch_bad_dsq.bpf.skel.h" +#include "scx_test.h" + +static enum scx_test_status setup(void **ctx) +{ + struct select_cpu_dispatch_bad_dsq *skel; + + skel = select_cpu_dispatch_bad_dsq__open_and_load(); + SCX_FAIL_IF(!skel, "Failed to open and load skel"); + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct select_cpu_dispatch_bad_dsq *skel = ctx; + struct bpf_link *link; + + link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dispatch_bad_dsq_ops); + SCX_FAIL_IF(!link, "Failed to attach scheduler"); + + sleep(1); + + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR)); + bpf_link__destroy(link); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct select_cpu_dispatch_bad_dsq *skel = ctx; + + select_cpu_dispatch_bad_dsq__destroy(skel); +} + +struct scx_test select_cpu_dispatch_bad_dsq = { + .name = "select_cpu_dispatch_bad_dsq", + .description = "Verify graceful failure if we direct-dispatch to a " + "bogus DSQ in ops.select_cpu()", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&select_cpu_dispatch_bad_dsq) diff --git a/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c new file mode 100644 index 000000000000..653e3dc0b4dc --- /dev/null +++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c @@ -0,0 +1,38 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * A scheduler that validates the behavior of direct dispatching with a default + * select_cpu implementation. + * + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +UEI_DEFINE(uei); + +s32 BPF_STRUCT_OPS(select_cpu_dispatch_dbl_dsp_select_cpu, struct task_struct *p, + s32 prev_cpu, u64 wake_flags) +{ + /* Dispatching twice in a row is disallowed. */ + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0); + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0); + + return prev_cpu; +} + +void BPF_STRUCT_OPS(select_cpu_dispatch_dbl_dsp_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SEC(".struct_ops.link") +struct sched_ext_ops select_cpu_dispatch_dbl_dsp_ops = { + .select_cpu = select_cpu_dispatch_dbl_dsp_select_cpu, + .exit = select_cpu_dispatch_dbl_dsp_exit, + .name = "select_cpu_dispatch_dbl_dsp", + .timeout_ms = 1000U, +}; diff --git a/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.c b/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.c new file mode 100644 index 000000000000..48ff028a3c46 --- /dev/null +++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.c @@ -0,0 +1,56 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + */ +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "select_cpu_dispatch_dbl_dsp.bpf.skel.h" +#include "scx_test.h" + +static enum scx_test_status setup(void **ctx) +{ + struct select_cpu_dispatch_dbl_dsp *skel; + + skel = select_cpu_dispatch_dbl_dsp__open_and_load(); + SCX_FAIL_IF(!skel, "Failed to open and load skel"); + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct select_cpu_dispatch_dbl_dsp *skel = ctx; + struct bpf_link *link; + + link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dispatch_dbl_dsp_ops); + SCX_FAIL_IF(!link, "Failed to attach scheduler"); + + sleep(1); + + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR)); + bpf_link__destroy(link); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct select_cpu_dispatch_dbl_dsp *skel = ctx; + + select_cpu_dispatch_dbl_dsp__destroy(skel); +} + +struct scx_test select_cpu_dispatch_dbl_dsp = { + .name = "select_cpu_dispatch_dbl_dsp", + .description = "Verify graceful failure if we dispatch twice to a " + "DSQ in ops.select_cpu()", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&select_cpu_dispatch_dbl_dsp) diff --git a/tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c new file mode 100644 index 000000000000..7f3ebf4fc2ea --- /dev/null +++ b/tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c @@ -0,0 +1,92 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * A scheduler that validates that enqueue flags are properly stored and + * applied at dispatch time when a task is directly dispatched from + * ops.select_cpu(). We validate this by using scx_bpf_dispatch_vtime(), and + * making the test a very basic vtime scheduler. + * + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +volatile bool consumed; + +static u64 vtime_now; + +#define VTIME_DSQ 0 + +static inline bool vtime_before(u64 a, u64 b) +{ + return (s64)(a - b) < 0; +} + +static inline u64 task_vtime(const struct task_struct *p) +{ + u64 vtime = p->scx.dsq_vtime; + + if (vtime_before(vtime, vtime_now - SCX_SLICE_DFL)) + return vtime_now - SCX_SLICE_DFL; + else + return vtime; +} + +s32 BPF_STRUCT_OPS(select_cpu_vtime_select_cpu, struct task_struct *p, + s32 prev_cpu, u64 wake_flags) +{ + s32 cpu; + + cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); + if (cpu >= 0) + goto ddsp; + + cpu = prev_cpu; + scx_bpf_test_and_clear_cpu_idle(cpu); +ddsp: + scx_bpf_dispatch_vtime(p, VTIME_DSQ, SCX_SLICE_DFL, task_vtime(p), 0); + return cpu; +} + +void BPF_STRUCT_OPS(select_cpu_vtime_dispatch, s32 cpu, struct task_struct *p) +{ + if (scx_bpf_consume(VTIME_DSQ)) + consumed = true; +} + +void BPF_STRUCT_OPS(select_cpu_vtime_running, struct task_struct *p) +{ + if (vtime_before(vtime_now, p->scx.dsq_vtime)) + vtime_now = p->scx.dsq_vtime; +} + +void BPF_STRUCT_OPS(select_cpu_vtime_stopping, struct task_struct *p, + bool runnable) +{ + p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight; +} + +void BPF_STRUCT_OPS(select_cpu_vtime_enable, struct task_struct *p) +{ + p->scx.dsq_vtime = vtime_now; +} + +s32 BPF_STRUCT_OPS_SLEEPABLE(select_cpu_vtime_init) +{ + return scx_bpf_create_dsq(VTIME_DSQ, -1); +} + +SEC(".struct_ops.link") +struct sched_ext_ops select_cpu_vtime_ops = { + .select_cpu = select_cpu_vtime_select_cpu, + .dispatch = select_cpu_vtime_dispatch, + .running = select_cpu_vtime_running, + .stopping = select_cpu_vtime_stopping, + .enable = select_cpu_vtime_enable, + .init = select_cpu_vtime_init, + .name = "select_cpu_vtime", + .timeout_ms = 1000U, +}; diff --git a/tools/testing/selftests/sched_ext/select_cpu_vtime.c b/tools/testing/selftests/sched_ext/select_cpu_vtime.c new file mode 100644 index 000000000000..b4629c2364f5 --- /dev/null +++ b/tools/testing/selftests/sched_ext/select_cpu_vtime.c @@ -0,0 +1,59 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> + */ +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "select_cpu_vtime.bpf.skel.h" +#include "scx_test.h" + +static enum scx_test_status setup(void **ctx) +{ + struct select_cpu_vtime *skel; + + skel = select_cpu_vtime__open_and_load(); + SCX_FAIL_IF(!skel, "Failed to open and load skel"); + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct select_cpu_vtime *skel = ctx; + struct bpf_link *link; + + SCX_ASSERT(!skel->bss->consumed); + + link = bpf_map__attach_struct_ops(skel->maps.select_cpu_vtime_ops); + SCX_FAIL_IF(!link, "Failed to attach scheduler"); + + sleep(1); + + SCX_ASSERT(skel->bss->consumed); + + bpf_link__destroy(link); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct select_cpu_vtime *skel = ctx; + + select_cpu_vtime__destroy(skel); +} + +struct scx_test select_cpu_vtime = { + .name = "select_cpu_vtime", + .description = "Test doing direct vtime-dispatching from " + "ops.select_cpu(), to a non-built-in DSQ", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&select_cpu_vtime) diff --git a/tools/testing/selftests/sched_ext/test_example.c b/tools/testing/selftests/sched_ext/test_example.c new file mode 100644 index 000000000000..ce36cdf03cdc --- /dev/null +++ b/tools/testing/selftests/sched_ext/test_example.c @@ -0,0 +1,49 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ +#include <bpf/bpf.h> +#include <scx/common.h> +#include "scx_test.h" + +static bool setup_called = false; +static bool run_called = false; +static bool cleanup_called = false; + +static int context = 10; + +static enum scx_test_status setup(void **ctx) +{ + setup_called = true; + *ctx = &context; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + int *arg = ctx; + + SCX_ASSERT(setup_called); + SCX_ASSERT(!run_called && !cleanup_called); + SCX_EQ(*arg, context); + + run_called = true; + return SCX_TEST_PASS; +} + +static void cleanup (void *ctx) +{ + SCX_BUG_ON(!run_called || cleanup_called, "Wrong callbacks invoked"); +} + +struct scx_test example = { + .name = "example", + .description = "Validate the basic function of the test suite itself", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&example) diff --git a/tools/testing/selftests/sched_ext/util.c b/tools/testing/selftests/sched_ext/util.c new file mode 100644 index 000000000000..e47769c91918 --- /dev/null +++ b/tools/testing/selftests/sched_ext/util.c @@ -0,0 +1,71 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <dvernet@meta.com> + */ +#include <errno.h> +#include <fcntl.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> + +/* Returns read len on success, or -errno on failure. */ +static ssize_t read_text(const char *path, char *buf, size_t max_len) +{ + ssize_t len; + int fd; + + fd = open(path, O_RDONLY); + if (fd < 0) + return -errno; + + len = read(fd, buf, max_len - 1); + + if (len >= 0) + buf[len] = 0; + + close(fd); + return len < 0 ? -errno : len; +} + +/* Returns written len on success, or -errno on failure. */ +static ssize_t write_text(const char *path, char *buf, ssize_t len) +{ + int fd; + ssize_t written; + + fd = open(path, O_WRONLY | O_APPEND); + if (fd < 0) + return -errno; + + written = write(fd, buf, len); + close(fd); + return written < 0 ? -errno : written; +} + +long file_read_long(const char *path) +{ + char buf[128]; + + + if (read_text(path, buf, sizeof(buf)) <= 0) + return -1; + + return atol(buf); +} + +int file_write_long(const char *path, long val) +{ + char buf[64]; + int ret; + + ret = sprintf(buf, "%lu", val); + if (ret < 0) + return ret; + + if (write_text(path, buf, sizeof(buf)) <= 0) + return -1; + + return 0; +} diff --git a/tools/testing/selftests/sched_ext/util.h b/tools/testing/selftests/sched_ext/util.h new file mode 100644 index 000000000000..bc13dfec1267 --- /dev/null +++ b/tools/testing/selftests/sched_ext/util.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2024 David Vernet <void@manifault.com> + */ + +#ifndef __SCX_TEST_UTIL_H__ +#define __SCX_TEST_UTIL_H__ + +long file_read_long(const char *path); +int file_write_long(const char *path, long val); + +#endif // __SCX_TEST_H__ -- 2.34.1
From: David Vernet <void@manifault.com> mainline inclusion from mainline-v6.8-rc1 commit 8b7b0e5fe47de90ba6c350f9abece589fb637f79 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- In libbpf, when determining whether we need to load vmlinux btf, we're currently (among other things) checking whether there is any struct_ops program present in the object. This works for most realistic struct_ops maps, as a struct_ops map is of course typically composed of one or more struct_ops programs. However, that technically need not be the case. A struct_ops interface could be defined which allows a map to be specified which one or more non-prog fields, and which provides default behavior if no struct_ops progs is actually provided otherwise. For sched_ext, for example, you technically only need to specify the name of the scheduler in the struct_ops map, with the core scheduler logic providing default behavior if no prog is actually specified. If we were to define and try to load such a struct_ops map, we would crash in libbpf when initializing it as obj->btf_vmlinux will be NULL: Reading symbols from minimal... (gdb) r Starting program: minimal_example [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". Program received signal SIGSEGV, Segmentation fault. 0x000055555558308c in btf__type_cnt (btf=0x0) at btf.c:612 612 return btf->start_id + btf->nr_types; (gdb) bt type_name=0x5555555d99e3 "sched_ext_ops", kind=4) at btf.c:914 kind=4) at btf.c:942 type=0x7fffffffe558, type_id=0x7fffffffe548, ... data_member=0x7fffffffe568) at libbpf.c:948 kern_btf=0x0) at libbpf.c:1017 at libbpf.c:8059 So as to account for such bare-bones struct_ops maps, let's update obj_needs_vmlinux_btf() to also iterate over an obj's maps and check whether any of them are struct_ops maps. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/bpf/20231208061704.400463-1-void@manifault.com Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/lib/bpf/libbpf.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 5f8f0dd589e7..034b4074d690 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -3083,9 +3083,15 @@ static bool prog_needs_vmlinux_btf(struct bpf_program *prog) return false; } +static bool map_needs_vmlinux_btf(struct bpf_map *map) +{ + return bpf_map__is_struct_ops(map); +} + static bool obj_needs_vmlinux_btf(const struct bpf_object *obj) { struct bpf_program *prog; + struct bpf_map *map; int i; /* CO-RE relocations need kernel BTF, only when btf_custom_path @@ -3110,6 +3116,11 @@ static bool obj_needs_vmlinux_btf(const struct bpf_object *obj) return true; } + bpf_object__for_each_map(map, obj) { + if (map_needs_vmlinux_btf(map)) + return true; + } + return false; } -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit b999e365c2982dbd50f01fec520215d3c61ea2aa category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- scx_next_task_picked() is used by sched_ext to notify the BPF scheduler when a CPU is taken away by a task dispatched from a higher priority sched_class so that the BPF scheduler can, e.g., punt the task[s] which was running or were waiting for the CPU to other CPUs. Replace the sched_ext specific hook scx_next_task_picked() with a new sched_class operation switch_class(). The changes are straightforward and the code looks better afterwards. However, when !CONFIG_SCHED_CLASS_EXT, this ends up adding an unused hook which is unlikely to be useful to other sched_classes. For further discussion on this subject, please refer to the following: http://lkml.kernel.org/r/CAHk-=wjFPLqo7AXu8maAGEGnOy6reUg-F4zzFhVB0Kyu22h7pw... Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 5 ++++- kernel/sched/ext.c | 20 ++++++++++---------- kernel/sched/ext.h | 4 ---- kernel/sched/sched.h | 2 ++ 4 files changed, 16 insertions(+), 15 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c0b0d2de8a9d..a7166f5c891d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5958,7 +5958,10 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) for_each_active_class(class) { p = class->pick_next_task(rq); if (p) { - scx_next_task_picked(rq, p, class); + const struct sched_class *prev_class = prev->sched_class; + + if (class != prev_class && prev_class->switch_class) + prev_class->switch_class(rq, p); return p; } } diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index bf2c1e516035..12157b842949 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2754,10 +2754,9 @@ preempt_reason_from_class(const struct sched_class *class) return SCX_CPU_PREEMPT_UNKNOWN; } -void scx_next_task_picked(struct rq *rq, struct task_struct *p, - const struct sched_class *active) +static void switch_class_scx(struct rq *rq, struct task_struct *next) { - lockdep_assert_rq_held(rq); + const struct sched_class *next_class = next->sched_class; if (!scx_enabled()) return; @@ -2774,12 +2773,11 @@ void scx_next_task_picked(struct rq *rq, struct task_struct *p, /* * The callback is conceptually meant to convey that the CPU is no - * longer under the control of SCX. Therefore, don't invoke the - * callback if the CPU is is staying on SCX, or going idle (in which - * case the SCX scheduler has actively decided not to schedule any - * tasks on the CPU). + * longer under the control of SCX. Therefore, don't invoke the callback + * if the next class is below SCX (in which case the BPF scheduler has + * actively decided not to schedule any tasks on the CPU). */ - if (likely(active >= &ext_sched_class)) + if (sched_class_above(&ext_sched_class, next_class)) return; /* @@ -2794,8 +2792,8 @@ void scx_next_task_picked(struct rq *rq, struct task_struct *p, if (!rq->scx.cpu_released) { if (SCX_HAS_OP(cpu_release)) { struct scx_cpu_release_args args = { - .reason = preempt_reason_from_class(active), - .task = p, + .reason = preempt_reason_from_class(next_class), + .task = next, }; SCX_CALL_OP(SCX_KF_CPU_RELEASE, @@ -3501,6 +3499,8 @@ DEFINE_SCHED_CLASS(ext) = { .put_prev_task = put_prev_task_scx, .set_next_task = set_next_task_scx, + .switch_class = switch_class_scx, + #ifdef CONFIG_SMP .balance = balance_scx, .select_task_rq = select_task_rq_scx, diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index c41d742b5d62..bf6f2cfa49d5 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -33,8 +33,6 @@ static inline bool task_on_scx(const struct task_struct *p) return scx_enabled() && p->sched_class == &ext_sched_class; } -void scx_next_task_picked(struct rq *rq, struct task_struct *p, - const struct sched_class *active); void scx_tick(struct rq *rq); void init_scx_entity(struct sched_ext_entity *scx); void scx_pre_fork(struct task_struct *p); @@ -82,8 +80,6 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, #define scx_enabled() false #define scx_switched_all() false -static inline void scx_next_task_picked(struct rq *rq, struct task_struct *p, - const struct sched_class *active) {} static inline void scx_tick(struct rq *rq) {} static inline void scx_pre_fork(struct task_struct *p) {} static inline int scx_fork(struct task_struct *p) { return 0; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ea4f78c8b4d9..d37bbca234c5 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2531,6 +2531,8 @@ struct sched_class { void (*put_prev_task)(struct rq *rq, struct task_struct *p); void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first); + void (*switch_class)(struct rq *rq, struct task_struct *next); + #ifdef CONFIG_SMP int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags); -- 2.34.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY -------------------------------- sched_ext: Fix kabi for the function attribute switch_class in struct sched_class. Fixes: 408ca56f62c0 ("sched, sched_ext: Replace scx_next_task_picked() with sched_class->switch_class()") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/sched.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d37bbca234c5..6af56b0228e3 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2531,8 +2531,6 @@ struct sched_class { void (*put_prev_task)(struct rq *rq, struct task_struct *p); void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first); - void (*switch_class)(struct rq *rq, struct task_struct *next); - #ifdef CONFIG_SMP int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags); @@ -2580,6 +2578,7 @@ struct sched_class { KABI_USE(1, void (*reweight_task)(struct rq *this_rq, struct task_struct *task, const struct load_weight *lw)) KABI_USE(2, void (*switching_to) (struct rq *this_rq, struct task_struct *task)) + KABI_EXTEND(void (*switch_class)(struct rq *rq, struct task_struct *next)) }; static inline void put_prev_task(struct rq *rq, struct task_struct *prev) -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 8988cad8d06eb6097667925d2eb0522850bb0aac category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- sugov_cpu_is_busy() is used to avoid decreasing performance level while the CPU is busy and called by sugov_update_single_freq() and sugov_update_single_perf(). Both callers repeat the same pattern to first test for uclamp and then the business. Let's refactor so that the tests aren't repeated. The new helper is named sugov_hold_freq() and tests both the uclamp exception and CPU business. No functional changes. This will make adding more exception conditions easier. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/cpufreq_schedutil.c | 38 +++++++++++++++----------------- 1 file changed, 18 insertions(+), 20 deletions(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 819ec1ccc08c..139ab95c49a4 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -352,16 +352,27 @@ static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time, } #ifdef CONFIG_NO_HZ_COMMON -static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) +static bool sugov_hold_freq(struct sugov_cpu *sg_cpu) { - unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu); - bool ret = idle_calls == sg_cpu->saved_idle_calls; + unsigned long idle_calls; + bool ret; + + /* if capped by uclamp_max, always update to be in compliance */ + if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu))) + return false; + + /* + * Maintain the frequency if the CPU has not been idle recently, as + * reduction is likely to be premature. + */ + idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu); + ret = idle_calls == sg_cpu->saved_idle_calls; sg_cpu->saved_idle_calls = idle_calls; return ret; } #else -static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; } +static inline bool sugov_hold_freq(struct sugov_cpu *sg_cpu) { return false; } #endif /* CONFIG_NO_HZ_COMMON */ /* @@ -407,14 +418,8 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time, return; next_f = get_next_freq(sg_policy, sg_cpu->util, max_cap); - /* - * Do not reduce the frequency if the CPU has not been idle - * recently, as the reduction is likely to be premature then. - * - * Except when the rq is capped by uclamp_max. - */ - if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) && - sugov_cpu_is_busy(sg_cpu) && next_f < sg_policy->next_freq && + + if (sugov_hold_freq(sg_cpu) && next_f < sg_policy->next_freq && !sg_policy->need_freq_update) { next_f = sg_policy->next_freq; @@ -461,14 +466,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time, if (!sugov_update_single_common(sg_cpu, time, max_cap, flags)) return; - /* - * Do not reduce the target performance level if the CPU has not been - * idle recently, as the reduction is likely to be premature then. - * - * Except when the rq is capped by uclamp_max. - */ - if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) && - sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util) + if (sugov_hold_freq(sg_cpu) && sg_cpu->util < prev_util) sg_cpu->util = prev_util; cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min, -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit d86adb4fc0655a0867da811d000df75d2a325ef6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- sched_ext currently does not integrate with schedutil. When schedutil is the governor, frequencies are left unregulated and usually get stuck close to the highest performance level from running RT tasks. Add CPU performance monitoring and scaling support by integrating into schedutil. The following kfuncs are added: - scx_bpf_cpuperf_cap(): Query the relative performance capacity of different CPUs in the system. - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU relative to its max performance. - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU. This gives direct control over CPU performance setting to the BPF scheduler. The only changes on the schedutil side are accounting for the utilization factor from sched_ext and disabling frequency holding heuristics as it may not apply well to sched_ext schedulers which may have a lot weaker connection between tasks and their current / last CPU. With cpuperf support added, there is no reason to block uclamp. Enable while at it. A toy implementation of cpuperf is added to scx_qmap as a demonstration of the feature. v2: Ignore cpu_util_cfs_boost() when scx_switched_all() in sugov_get_util() to avoid factoring in stale util metric. (Christian) Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/cpufreq_schedutil.c | 12 +- kernel/sched/ext.c | 83 ++++++++++++- kernel/sched/ext.h | 9 ++ kernel/sched/sched.h | 1 + tools/sched_ext/include/scx/common.bpf.h | 3 + tools/sched_ext/scx_qmap.bpf.c | 142 ++++++++++++++++++++++- tools/sched_ext/scx_qmap.c | 8 ++ 7 files changed, 252 insertions(+), 6 deletions(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 139ab95c49a4..2e6600647d6b 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -220,8 +220,10 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual, static void sugov_get_util(struct sugov_cpu *sg_cpu) { - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu); + unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu); + if (!scx_switched_all()) + util += cpu_util_cfs_boost(sg_cpu->cpu); util = effective_cpu_util(sg_cpu->cpu, util, &min, &max); sg_cpu->bw_min = min; sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max); @@ -357,6 +359,14 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu) unsigned long idle_calls; bool ret; + /* + * The heuristics in this function is for the fair class. For SCX, the + * performance target comes directly from the BPF scheduler. Let's just + * follow it. + */ + if (scx_switched_all()) + return false; + /* if capped by uclamp_max, always update to be in compliance */ if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu))) return false; diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 12157b842949..2c52188574ee 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -20,6 +20,8 @@ enum scx_consts { SCX_EXIT_BT_LEN = 64, SCX_EXIT_MSG_LEN = 1024, SCX_EXIT_DUMP_DFL_LEN = 32768, + + SCX_CPUPERF_ONE = SCHED_CAPACITY_SCALE, }; enum scx_exit_kind { @@ -3525,7 +3527,7 @@ DEFINE_SCHED_CLASS(ext) = { .update_curr = update_curr_scx, #ifdef CONFIG_UCLAMP_TASK - .uclamp_enabled = 0, + .uclamp_enabled = 1, #endif }; @@ -4398,7 +4400,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops) struct scx_task_iter sti; struct task_struct *p; unsigned long timeout; - int i, ret; + int i, cpu, ret; mutex_lock(&scx_ops_enable_mutex); @@ -4447,6 +4449,9 @@ static int scx_ops_enable(struct sched_ext_ops *ops) atomic_long_set(&scx_nr_rejected, 0); + for_each_possible_cpu(cpu) + cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE; + /* * Keep CPUs stable during enable so that the BPF scheduler can track * online CPUs by watching ->on/offline_cpu() after ->init(). @@ -5840,6 +5845,77 @@ __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, ops_dump_flush(); } +/** + * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU + * @cpu: CPU of interest + * + * Return the maximum relative capacity of @cpu in relation to the most + * performant CPU in the system. The return value is in the range [1, + * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur(). + */ +__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu) +{ + if (ops_cpu_valid(cpu, NULL)) + return arch_scale_cpu_capacity(cpu); + else + return SCX_CPUPERF_ONE; +} + +/** + * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU + * @cpu: CPU of interest + * + * Return the current relative performance of @cpu in relation to its maximum. + * The return value is in the range [1, %SCX_CPUPERF_ONE]. + * + * The current performance level of a CPU in relation to the maximum performance + * available in the system can be calculated as follows: + * + * scx_bpf_cpuperf_cap() * scx_bpf_cpuperf_cur() / %SCX_CPUPERF_ONE + * + * The result is in the range [1, %SCX_CPUPERF_ONE]. + */ +__bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu) +{ + if (ops_cpu_valid(cpu, NULL)) + return arch_scale_freq_capacity(cpu); + else + return SCX_CPUPERF_ONE; +} + +/** + * scx_bpf_cpuperf_set - Set the relative performance target of a CPU + * @cpu: CPU of interest + * @perf: target performance level [0, %SCX_CPUPERF_ONE] + * @flags: %SCX_CPUPERF_* flags + * + * Set the target performance level of @cpu to @perf. @perf is in linear + * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the + * schedutil cpufreq governor chooses the target frequency. + * + * The actual performance level chosen, CPU grouping, and the overhead and + * latency of the operations are dependent on the hardware and cpufreq driver in + * use. Consult hardware and cpufreq documentation for more information. The + * current performance level can be monitored using scx_bpf_cpuperf_cur(). + */ +__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf) +{ + if (unlikely(perf > SCX_CPUPERF_ONE)) { + scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu); + return; + } + + if (ops_cpu_valid(cpu, NULL)) { + struct rq *rq = cpu_rq(cpu); + + rq->scx.cpuperf_target = perf; + + rcu_read_lock_sched_notrace(); + cpufreq_update_util(cpu_rq(cpu), 0); + rcu_read_unlock_sched_notrace(); + } +} + /** * scx_bpf_nr_cpu_ids - Return the number of possible CPU IDs * @@ -6050,6 +6126,9 @@ BTF_ID_FLAGS(func, scx_bpf_destroy_dsq) BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap) +BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur) +BTF_ID_FLAGS(func, scx_bpf_cpuperf_set) BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids) BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE) BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE) diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index bf6f2cfa49d5..0a7b9a34b18f 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -46,6 +46,14 @@ int scx_check_setscheduler(struct task_struct *p, int policy); bool task_should_scx(struct task_struct *p); void init_sched_ext_class(void); +static inline u32 scx_cpuperf_target(s32 cpu) +{ + if (scx_enabled()) + return cpu_rq(cpu)->scx.cpuperf_target; + else + return 0; +} + static inline const struct sched_class *next_active_class(const struct sched_class *class) { class++; @@ -85,6 +93,7 @@ static inline void scx_pre_fork(struct task_struct *p) {} static inline int scx_fork(struct task_struct *p) { return 0; } static inline void scx_post_fork(struct task_struct *p) {} static inline void scx_cancel_fork(struct task_struct *p) {} +static inline u32 scx_cpuperf_target(s32 cpu) { return 0; } static inline bool scx_can_stop_tick(struct rq *rq) { return true; } static inline void scx_rq_activate(struct rq *rq) {} static inline void scx_rq_deactivate(struct rq *rq) {} diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 6af56b0228e3..db7edcf28138 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -843,6 +843,7 @@ struct scx_rq { u64 extra_enq_flags; /* see move_task_to_local_dsq() */ u32 nr_running; u32 flags; + u32 cpuperf_target; /* [0, SCHED_CAPACITY_SCALE] */ bool cpu_released; cpumask_var_t cpus_to_kick; cpumask_var_t cpus_to_kick_if_idle; diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h index 3fa87084cf17..dbbda0e35c5d 100644 --- a/tools/sched_ext/include/scx/common.bpf.h +++ b/tools/sched_ext/include/scx/common.bpf.h @@ -42,6 +42,9 @@ void scx_bpf_destroy_dsq(u64 dsq_id) __ksym; void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak; void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym; void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym __weak; +u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak; +u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak; +void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak; u32 scx_bpf_nr_cpu_ids(void) __ksym __weak; const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak; const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak; diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c index c75c70d6a8eb..b1d0b09c966e 100644 --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -68,6 +68,18 @@ struct { }, }; +/* + * If enabled, CPU performance target is set according to the queue index + * according to the following table. + */ +static const u32 qidx_to_cpuperf_target[] = { + [0] = SCX_CPUPERF_ONE * 0 / 4, + [1] = SCX_CPUPERF_ONE * 1 / 4, + [2] = SCX_CPUPERF_ONE * 2 / 4, + [3] = SCX_CPUPERF_ONE * 3 / 4, + [4] = SCX_CPUPERF_ONE * 4 / 4, +}; + /* * Per-queue sequence numbers to implement core-sched ordering. * @@ -95,6 +107,8 @@ struct { struct cpu_ctx { u64 dsp_idx; /* dispatch index */ u64 dsp_cnt; /* remaining count */ + u32 avg_weight; + u32 cpuperf_target; }; struct { @@ -107,6 +121,8 @@ struct { /* Statistics */ u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued; u64 nr_core_sched_execed; +u32 cpuperf_min, cpuperf_avg, cpuperf_max; +u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max; s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) @@ -313,6 +329,29 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) } } +void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p) +{ + struct cpu_ctx *cpuc; + u32 zero = 0; + int idx; + + if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) { + scx_bpf_error("failed to look up cpu_ctx"); + return; + } + + /* + * Use the running avg of weights to select the target cpuperf level. + * This is a demonstration of the cpuperf feature rather than a + * practical strategy to regulate CPU frequency. + */ + cpuc->avg_weight = cpuc->avg_weight * 3 / 4 + p->scx.weight / 4; + idx = weight_to_idx(cpuc->avg_weight); + cpuc->cpuperf_target = qidx_to_cpuperf_target[idx]; + + scx_bpf_cpuperf_set(scx_bpf_task_cpu(p), cpuc->cpuperf_target); +} + /* * The distance from the head of the queue scaled by the weight of the queue. * The lower the number, the older the task and the higher the priority. @@ -422,8 +461,9 @@ void BPF_STRUCT_OPS(qmap_dump_cpu, struct scx_dump_ctx *dctx, s32 cpu, bool idle if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, cpu))) return; - scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu", - cpuc->dsp_idx, cpuc->dsp_cnt); + scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu avg_weight=%u cpuperf_target=%u", + cpuc->dsp_idx, cpuc->dsp_cnt, cpuc->avg_weight, + cpuc->cpuperf_target); } void BPF_STRUCT_OPS(qmap_dump_task, struct scx_dump_ctx *dctx, struct task_struct *p) @@ -492,11 +532,106 @@ void BPF_STRUCT_OPS(qmap_cpu_offline, s32 cpu) print_cpus(); } +struct monitor_timer { + struct bpf_timer timer; +}; + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 1); + __type(key, u32); + __type(value, struct monitor_timer); +} monitor_timer SEC(".maps"); + +/* + * Print out the min, avg and max performance levels of CPUs every second to + * demonstrate the cpuperf interface. + */ +static void monitor_cpuperf(void) +{ + u32 zero = 0, nr_cpu_ids; + u64 cap_sum = 0, cur_sum = 0, cur_min = SCX_CPUPERF_ONE, cur_max = 0; + u64 target_sum = 0, target_min = SCX_CPUPERF_ONE, target_max = 0; + const struct cpumask *online; + int i, nr_online_cpus = 0; + + nr_cpu_ids = scx_bpf_nr_cpu_ids(); + online = scx_bpf_get_online_cpumask(); + + bpf_for(i, 0, nr_cpu_ids) { + struct cpu_ctx *cpuc; + u32 cap, cur; + + if (!bpf_cpumask_test_cpu(i, online)) + continue; + nr_online_cpus++; + + /* collect the capacity and current cpuperf */ + cap = scx_bpf_cpuperf_cap(i); + cur = scx_bpf_cpuperf_cur(i); + + cur_min = cur < cur_min ? cur : cur_min; + cur_max = cur > cur_max ? cur : cur_max; + + /* + * $cur is relative to $cap. Scale it down accordingly so that + * it's in the same scale as other CPUs and $cur_sum/$cap_sum + * makes sense. + */ + cur_sum += cur * cap / SCX_CPUPERF_ONE; + cap_sum += cap; + + if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, i))) { + scx_bpf_error("failed to look up cpu_ctx"); + goto out; + } + + /* collect target */ + cur = cpuc->cpuperf_target; + target_sum += cur; + target_min = cur < target_min ? cur : target_min; + target_max = cur > target_max ? cur : target_max; + } + + cpuperf_min = cur_min; + cpuperf_avg = cur_sum * SCX_CPUPERF_ONE / cap_sum; + cpuperf_max = cur_max; + + cpuperf_target_min = target_min; + cpuperf_target_avg = target_sum / nr_online_cpus; + cpuperf_target_max = target_max; +out: + scx_bpf_put_cpumask(online); +} + +static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer) +{ + monitor_cpuperf(); + + bpf_timer_start(timer, ONE_SEC_IN_NS, 0); + return 0; +} + s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init) { + u32 key = 0; + struct bpf_timer *timer; + s32 ret; + print_cpus(); - return scx_bpf_create_dsq(SHARED_DSQ, -1); + ret = scx_bpf_create_dsq(SHARED_DSQ, -1); + if (ret) + return ret; + + timer = bpf_map_lookup_elem(&monitor_timer, &key); + if (!timer) + return -ESRCH; + + bpf_timer_init(timer, &monitor_timer, CLOCK_MONOTONIC); + bpf_timer_set_callback(timer, monitor_timerfn); + + return bpf_timer_start(timer, ONE_SEC_IN_NS, 0); } void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei) @@ -509,6 +644,7 @@ SCX_OPS_DEFINE(qmap_ops, .enqueue = (void *)qmap_enqueue, .dequeue = (void *)qmap_dequeue, .dispatch = (void *)qmap_dispatch, + .tick = (void *)qmap_tick, .core_sched_before = (void *)qmap_core_sched_before, .cpu_release = (void *)qmap_cpu_release, .init_task = (void *)qmap_init_task, diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c index bc36ec4f88a7..4d41c0cb1dab 100644 --- a/tools/sched_ext/scx_qmap.c +++ b/tools/sched_ext/scx_qmap.c @@ -116,6 +116,14 @@ int main(int argc, char **argv) nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched, skel->bss->nr_reenqueued, skel->bss->nr_dequeued, skel->bss->nr_core_sched_execed); + if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur")) + printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n", + skel->bss->cpuperf_min, + skel->bss->cpuperf_avg, + skel->bss->cpuperf_max, + skel->bss->cpuperf_target_min, + skel->bss->cpuperf_target_avg, + skel->bss->cpuperf_target_max); fflush(stdout); sleep(1); } -- 2.34.1
From: David Vernet <void@manifault.com> mainline inclusion from mainline-v6.12-rc1 commit 8a6c6b4b935f1613273f71609b69844babd8ff9e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- The scx_bpf_cpuperf_set() kfunc allows a BPF program to set the relative performance target of a specified CPU. Commit d86adb4fc065 ("sched_ext: Add cpuperf support") defined the @cpu argument to be unsigned. Let's update it to be signed to match the norm for the rest of ext.c and the kernel. Note that the kfunc declaration of scx_bpf_cpuperf_set() in the common.bpf.h header in tools/sched_ext already listed the cpu as signed, so this also fixes the build for tools/sched_ext and the sched_ext selftests due to kfunc declarations now being emitted in vmlinux.h based on BTF (thus causing the compiler to error due to observing conflicting types). Fixes: d86adb4fc065 ("sched_ext: Add cpuperf support") Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 2c52188574ee..5f1cfb45e22e 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -5898,7 +5898,7 @@ __bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu) * use. Consult hardware and cpufreq documentation for more information. The * current performance level can be monitored using scx_bpf_cpuperf_cur(). */ -__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf) +__bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf) { if (unlikely(perf > SCX_CPUPERF_ONE)) { scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu); -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit eb4a3b629b4d61e83dc49541185100a297a7e6ba category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- 2a52ca7c9896 ("sched_ext: Add scx_simple and scx_example_qmap example schedulers") added the tools_clean target which is triggered by mrproper. The tools_clean target triggers the sched_ext_clean target in tools/. This unfortunately makes mrproper fail when no BTF enabled kernel image is found: Makefile:83: *** Cannot find a vmlinux for VMLINUX_BTF at any of " ../../vmlinux /sys/kernel/btf/vmlinux/boot/vmlinux-4.15.0-136-generic". Stop. Makefile:192: recipe for target 'sched_ext_clean' failed make[2]: *** [sched_ext_clean] Error 2 Makefile:1361: recipe for target 'sched_ext' failed make[1]: *** [sched_ext] Error 2 Makefile:240: recipe for target '__sub-make' failed make: *** [__sub-make] Error 2 Clean targets shouldn't fail like this but also it's really odd for mrproper to single out and trigger the sched_ext_clean target when no other clean targets under tools/ are triggered. Fix builds by dropping the tools_clean target from the top-level Makefile. The offending Makefile line is shared across BPF targets under tools/. Let's revisit them later. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jon Hunter <jonathanh@nvidia.com> Link: http://lkml.kernel.org/r/ac065f1f-8754-4626-95db-2c9fcf02567b@nvidia.com Fixes: 2a52ca7c9896 ("sched_ext: Add scx_simple and scx_example_qmap example schedulers") Cc: David Vernet <void@manifault.com> Conflicts: Makefile [Conflicts with f0b057292e10 ("tools: fix annoying "mkdir -p ..." logs when building tools in parallel"), so only cherry-pick the main part of this mainline patch.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- Makefile | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/Makefile b/Makefile index 96583a016426..b62b39c18643 100644 --- a/Makefile +++ b/Makefile @@ -1358,12 +1358,6 @@ ifneq ($(wildcard $(resolve_btfids_O)),) $(Q)$(MAKE) -sC $(srctree)/tools/bpf/resolve_btfids O=$(resolve_btfids_O) clean endif -tools-clean-targets := sched_ext -PHONY += $(tools-clean-targets) -$(tools-clean-targets): - $(Q)$(MAKE) -sC tools $@_clean -tools_clean: $(tools-clean-targets) - tools/: FORCE $(Q)mkdir -p $(objtree)/tools $(Q)$(MAKE) O=$(abspath $(objtree)) subdir=tools -C $(srctree)/tools/ @@ -1528,7 +1522,7 @@ PHONY += $(mrproper-dirs) mrproper $(mrproper-dirs): $(Q)$(MAKE) $(clean)=$(patsubst _mrproper_%,%,$@) -mrproper: clean $(mrproper-dirs) tools_clean +mrproper: clean $(mrproper-dirs) $(call cmd,rmfiles) @find . $(RCS_FIND_IGNORE) \ \( -name '*.rmeta' \) \ -- 2.34.1
From: Colin Ian King <colin.i.king@gmail.com> mainline inclusion from mainline-v6.12-rc1 commit f97dcd0fcf7a95aaf448a9d1a7ed6c95e16dfcdb category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- There is a spelling mistake in the help text. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/sched_ext/scx_qmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c index 4d41c0cb1dab..e4e3ecffc4cf 100644 --- a/tools/sched_ext/scx_qmap.c +++ b/tools/sched_ext/scx_qmap.c @@ -31,7 +31,7 @@ const char help_fmt[] = " -d PID Disallow a process from switching into SCHED_EXT (-1 for self)\n" " -D LEN Set scx_exit_info.dump buffer length\n" " -S Suppress qmap-specific debug dump\n" -" -p Switch only tasks on SCHED_EXT policy intead of all\n" +" -p Switch only tasks on SCHED_EXT policy instead of all\n" " -v Print libbpf debug messages\n" " -h Display this help and exit\n"; -- 2.34.1
From: Andrea Righi <andrea.righi@canonical.com> mainline inclusion from mainline-v6.12-rc1 commit 1ff4f169c9f576a73c529365770d82595c53b6ef category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Correct eight to weight in the description of the .set_weight() operation in sched_ext_ops. Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 5f1cfb45e22e..bbd2b285aef0 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -373,7 +373,7 @@ struct sched_ext_ops { /** * set_weight - Set task weight * @p: task to set weight for - * @weight: new eight [1..10000] + * @weight: new weight [1..10000] * * Update @p's weight to @weight. */ -- 2.34.1
From: Andrea Righi <andrea.righi@canonical.com> mainline inclusion from mainline-v6.12-rc1 commit b5ba2e1a955417e78a6018fb736a14c03df0abcd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Without BTF, attempting to load any sched_ext scheduler will result in an error like the following: libbpf: kernel BTF is missing at '/sys/kernel/btf/vmlinux', was CONFIG_DEBUG_INFO_BTF enabled? This makes sched_ext pretty much unusable, so explicitly depend on CONFIG_DEBUG_INFO_BTF to prevent these issues. Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/Kconfig.preempt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index 5ec2759552f6..de625357e5aa 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -135,7 +135,7 @@ config SCHED_CORE config SCHED_CLASS_EXT bool "Extensible Scheduling Class" - depends on BPF_SYSCALL && BPF_JIT + depends on BPF_SYSCALL && BPF_JIT && DEBUG_INFO_BTF help This option enables a new scheduler class sched_ext (SCX), which allows scheduling policies to be implemented as BPF programs to -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit b651d7c39289850b5a0a2c67231dd36117340a2e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- alloc_exit_info() calls kcalloc() but puts in the size of the element as the first argument which triggers the following gcc warning: kernel/sched/ext.c:3815:32: warning: ‘kmalloc_array_noprof’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args] Fix it by swapping the positions of the first two arguments. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Vishal Chourasia <vishalc@linux.ibm.com> Link: http://lkml.kernel.org/r/ZoG6zreEtQhAUr_2@linux.ibm.com Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index bbd2b285aef0..d843347e83c3 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -3817,7 +3817,7 @@ static struct scx_exit_info *alloc_exit_info(size_t exit_dump_len) if (!ei) return NULL; - ei->bt = kcalloc(sizeof(ei->bt[0]), SCX_EXIT_BT_LEN, GFP_KERNEL); + ei->bt = kcalloc(SCX_EXIT_BT_LEN, sizeof(ei->bt[0]), GFP_KERNEL); ei->msg = kzalloc(SCX_EXIT_MSG_LEN, GFP_KERNEL); ei->dump = kzalloc(exit_dump_len, GFP_KERNEL); -- 2.34.1
From: Aboorva Devarajan <aboorvad@linux.ibm.com> mainline inclusion from mainline-v6.12-rc1 commit 18b2bd03371b64fdb21b31eb48095099d95b56ef category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Updated sched_ext doc to eliminate references to scx_bpf_switch_all, which has been removed in recent sched_ext versions. Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- Documentation/scheduler/sched-ext.rst | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst index 497eeaa5ecbe..a707d2181a77 100644 --- a/Documentation/scheduler/sched-ext.rst +++ b/Documentation/scheduler/sched-ext.rst @@ -48,13 +48,16 @@ sched_ext is used only when the BPF scheduler is loaded and running. If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is -loaded. On load, such tasks will be switched to and scheduled by sched_ext. +loaded. -The BPF scheduler can choose to schedule all normal and lower class tasks by -calling ``scx_bpf_switch_all()`` from its ``init()`` operation. In this -case, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE`` and -``SCHED_EXT`` tasks are scheduled by sched_ext. In the example schedulers, -this mode can be selected with the ``-a`` option. +When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set +in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and +``SCHED_EXT`` tasks are scheduled by sched_ext. + +However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is +set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled +by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and +``SCHED_IDLE`` policies are scheduled by CFS. Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or detection of any internal error including stalled runnable tasks aborts the @@ -109,7 +112,7 @@ Userspace can implement an arbitrary BPF scheduler by loading a set of BPF programs that implement ``struct sched_ext_ops``. The only mandatory field is ``ops.name`` which must be a valid BPF object name. All operations are optional. The following modified excerpt is from -``tools/sched/scx_simple.bpf.c`` showing a minimal global FIFO scheduler. +``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler. .. code-block:: c @@ -156,13 +159,12 @@ optional. The following modified excerpt is from scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); } - s32 BPF_STRUCT_OPS(simple_init) + s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) { /* - * All SCHED_OTHER, SCHED_IDLE, and SCHED_BATCH tasks should - * use sched_ext. + * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and + * SCHED_BATCH tasks should use sched_ext. */ - scx_bpf_switch_all(); return 0; } -- 2.34.1
From: Hongyan Xia <hongyan.xia2@arm.com> mainline inclusion from mainline-v6.12-rc1 commit 6203ef73fa5c0358f7960b038628259be1448724 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- rq contains many useful fields to implement a custom scheduler. For example, various clock signals like clock_task and clock_pelt can be used to track load. It also contains stats in other sched_classes, which are useful to drive scheduling decisions in ext. tj: Put the new helper below scx_bpf_task_*() helpers. Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 13 +++++++++++++ tools/sched_ext/include/scx/common.bpf.h | 1 + 2 files changed, 14 insertions(+) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index d843347e83c3..f1db80cda121 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -6117,6 +6117,18 @@ __bpf_kfunc s32 scx_bpf_task_cpu(const struct task_struct *p) return task_cpu(p); } +/** + * scx_bpf_cpu_rq - Fetch the rq of a CPU + * @cpu: CPU of the rq + */ +__bpf_kfunc struct rq *scx_bpf_cpu_rq(s32 cpu) +{ + if (!ops_cpu_valid(cpu, NULL)) + return NULL; + + return cpu_rq(cpu); +} + __bpf_kfunc_end_defs(); BTF_KFUNCS_START(scx_kfunc_ids_any) @@ -6141,6 +6153,7 @@ BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_cpu_rq) BTF_KFUNCS_END(scx_kfunc_ids_any) static const struct btf_kfunc_id_set scx_kfunc_set_any = { diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h index dbbda0e35c5d..965d20324114 100644 --- a/tools/sched_ext/include/scx/common.bpf.h +++ b/tools/sched_ext/include/scx/common.bpf.h @@ -57,6 +57,7 @@ s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym; s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym; bool scx_bpf_task_running(const struct task_struct *p) __ksym; s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym; +struct rq *scx_bpf_cpu_rq(s32 cpu) __ksym; static inline __attribute__((format(printf, 1, 2))) void ___scx_bpf_bstr_format_checker(const char *fmt, ...) {} -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 60564acbef5c43eeb11ecadf6efe17ac255a80b1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- sched_fork() returns with -EAGAIN if dl_prio(@p). a7a9fc549293 ("sched_ext: Add boilerplate for extensible scheduler class") added scx_pre_fork() call before it and then scx_cancel_fork() on the exit path. This is silly as the dl_prio() block can just be moved above the scx_pre_fork() call. Move the dl_prio() block above the scx_pre_fork() call and remove the now unnecessary scx_cancel_fork() invocation. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a7166f5c891d..4c18785cae46 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4638,8 +4638,6 @@ late_initcall(sched_core_sysctl_init); */ int sched_fork(unsigned long clone_flags, struct task_struct *p) { - int ret; - __sched_fork(clone_flags, p); /* * We mark the process as NEW here. This guarantees that @@ -4676,12 +4674,12 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) p->sched_reset_on_fork = 0; } + if (dl_prio(p->prio)) + return -EAGAIN; + scx_pre_fork(p); - if (dl_prio(p->prio)) { - ret = -EAGAIN; - goto out_cancel; - } else if (rt_prio(p->prio)) { + if (rt_prio(p->prio)) { p->sched_class = &rt_sched_class; #ifdef CONFIG_SCHED_CLASS_EXT } else if (task_should_scx(p)) { @@ -4707,10 +4705,6 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) RB_CLEAR_NODE(&p->pushable_dl_tasks); #endif return 0; - -out_cancel: - scx_cancel_fork(p); - return ret; } int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs) -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit e98abd22fbcada509f776229af688b7f74d6cdba category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- When initializing p->scx.weight, scx_ops_enable_task() wasn't considering whether the task is SCHED_IDLE. Update it to use WEIGHT_IDLEPRIO as the source weight for SCHED_IDLE tasks. This leaves reweight_task_scx() the sole user of set_task_scx_weight(). Open code it. @weight is going to be provided by sched core in the future anyway. v2: Use the newly available @lw->weight to set @p->scx.weight in reweight_task_scx(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index f1db80cda121..21352c70edec 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -3242,22 +3242,23 @@ static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool return 0; } -static void set_task_scx_weight(struct task_struct *p) -{ - u32 weight = sched_prio_to_weight[p->static_prio - MAX_RT_PRIO]; - - p->scx.weight = sched_weight_to_cgroup(weight); -} - static void scx_ops_enable_task(struct task_struct *p) { + u32 weight; + lockdep_assert_rq_held(task_rq(p)); /* * Set the weight before calling ops.enable() so that the scheduler * doesn't see a stale value if they inspect the task struct. */ - set_task_scx_weight(p); + if (task_has_idle_policy(p)) + weight = WEIGHT_IDLEPRIO; + else + weight = sched_prio_to_weight[p->static_prio - MAX_RT_PRIO]; + + p->scx.weight = sched_weight_to_cgroup(weight); + if (SCX_HAS_OP(enable)) SCX_CALL_OP_TASK(SCX_KF_REST, enable, p); scx_set_task_state(p, SCX_TASK_ENABLED); @@ -3412,7 +3413,7 @@ static void reweight_task_scx(struct rq *rq, struct task_struct *p, const struct { lockdep_assert_rq_held(task_rq(p)); - set_task_scx_weight(p); + p->scx.weight = sched_weight_to_cgroup(scale_load_down(lw->weight)); if (SCX_HAS_OP(set_weight)) SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight); } -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 9f391f94a1730232ad2760202755b2d9baf4688d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- sched_domains regulate the load balancing for sched_classes. A machine can be partitioned into multiple sections that are not load-balanced across using either isolcpus= boot param or cpuset partitions. In such cases, tasks that are in one partition are expected to stay within that partition. cpuset configured partitions are always reflected in each member task's cpumask. As SCX always honors the task cpumasks, the BPF scheduler is automatically in compliance with the configured partitions. However, for isolcpus= domain isolation, the isolated CPUs are simply omitted from the top-level sched_domain[s] without further restrictions on tasks' cpumasks, so, for example, a task currently running in an isolated CPU may have more CPUs in its allowed cpumask while expected to remain on the same CPU. There is no straightforward way to enforce this partitioning preemptively on BPF schedulers and erroring out after a violation can be surprising. isolcpus= domain isolation is being replaced with cpuset partitions anyway, so keep it simple and simply disallow loading a BPF scheduler if isolcpus= domain isolation is in effect. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20240626082342.GY31592@noisy.programming.kicks-ass.... Cc: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/build_policy.c | 1 + kernel/sched/ext.c | 6 ++++++ 2 files changed, 7 insertions(+) diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c index 4ae066f08cc9..07c93846999b 100644 --- a/kernel/sched/build_policy.c +++ b/kernel/sched/build_policy.c @@ -16,6 +16,7 @@ #include <linux/sched/clock.h> #include <linux/sched/cputime.h> #include <linux/sched/hotplug.h> +#include <linux/sched/isolation.h> #include <linux/sched/posix-timers.h> #include <linux/sched/rt.h> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 21352c70edec..6c50565e6b4e 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -4403,6 +4403,12 @@ static int scx_ops_enable(struct sched_ext_ops *ops) unsigned long timeout; int i, cpu, ret; + if (!cpumask_equal(housekeeping_cpumask(HK_TYPE_DOMAIN), + cpu_possible_mask)) { + pr_err("sched_ext: Not compatible with \"isolcpus=\" domain isolation"); + return -EINVAL; + } + mutex_lock(&scx_ops_enable_mutex); if (!scx_ops_helper) { -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 6ab228ecc3fd9e0e3cdda32e56e39faac0dc500f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- - scx_ops_cpu_preempt is only used in kernel/sched/ext.c and doesn't need to be global. Make it static. - Relocate task_on_scx() so that the inline functions are located together. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 2 +- kernel/sched/ext.h | 12 +++++------- 2 files changed, 6 insertions(+), 8 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 6c50565e6b4e..47a6ffaf61e7 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -775,7 +775,7 @@ static bool scx_warned_zero_slice; static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_last); static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting); -DEFINE_STATIC_KEY_FALSE(scx_ops_cpu_preempt); +static DEFINE_STATIC_KEY_FALSE(scx_ops_cpu_preempt); static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled); struct static_key_false scx_has_op[SCX_OPI_END] = diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index 0a7b9a34b18f..229007693504 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -26,13 +26,6 @@ DECLARE_STATIC_KEY_FALSE(__scx_switched_all); #define scx_enabled() static_branch_unlikely(&__scx_ops_enabled) #define scx_switched_all() static_branch_unlikely(&__scx_switched_all) -DECLARE_STATIC_KEY_FALSE(scx_ops_cpu_preempt); - -static inline bool task_on_scx(const struct task_struct *p) -{ - return scx_enabled() && p->sched_class == &ext_sched_class; -} - void scx_tick(struct rq *rq); void init_scx_entity(struct sched_ext_entity *scx); void scx_pre_fork(struct task_struct *p); @@ -54,6 +47,11 @@ static inline u32 scx_cpuperf_target(s32 cpu) return 0; } +static inline bool task_on_scx(const struct task_struct *p) +{ + return scx_enabled() && p->sched_class == &ext_sched_class; +} + static inline const struct sched_class *next_active_class(const struct sched_class *class) { class++; -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 744d83601ffa11ebbca52c0ec0b039e269d05054 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- For flexibility, sched_ext allows the BPF scheduler to select the CPU to execute a task on at dispatch time so that e.g. a queue can be shared across multiple CPUs. To enable this, the dispatch path is executed from balance() so that a dispatched task can be hot-migrated to its target CPU. This means that sched_ext needs its balance() method invoked before every pick_next_task() even when the CPU is waking up from SCHED_IDLE. for_balance_class_range() defined in kernel/sched/ext.h implements this selective iteration promotion. However, the indirection obfuscates more than helps. Open code the iteration promotion in put_prev_task_balance() and remove for_balance_class_range(). No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Vernet <void@manifault.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 14 +++++++++++++- kernel/sched/ext.h | 9 --------- 2 files changed, 13 insertions(+), 10 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4c18785cae46..aabea3baed41 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5894,7 +5894,19 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { #ifdef CONFIG_SMP + const struct sched_class *start_class = prev->sched_class; const struct sched_class *class; + +#ifdef CONFIG_SCHED_CLASS_EXT + /* + * SCX requires a balance() call before every pick_next_task() including + * when waking up from SCHED_IDLE. If @start_class is below SCX, start + * from SCX instead. + */ + if (sched_class_above(&ext_sched_class, start_class)) + start_class = &ext_sched_class; +#endif + /* * We must do the balancing pass before put_prev_task(), such * that when we release the rq->lock the task is in the same @@ -5903,7 +5915,7 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev, * We can terminate the balance pass as soon as we know there is * a runnable task of @class priority or higher. */ - for_balance_class_range(class, prev->sched_class, &idle_sched_class) { + for_active_class_range(class, start_class, &idle_sched_class) { if (class->balance(rq, prev, rf)) break; } diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index 229007693504..1d7837bdfaba 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -68,14 +68,6 @@ static inline const struct sched_class *next_active_class(const struct sched_cla #define for_each_active_class(class) \ for_active_class_range(class, __sched_class_highest, __sched_class_lowest) -/* - * SCX requires a balance() call before every pick_next_task() call including - * when waking up from idle. - */ -#define for_balance_class_range(class, prev_class, end_class) \ - for_active_class_range(class, (prev_class) > &ext_sched_class ? \ - &ext_sched_class : (prev_class), (end_class)) - #ifdef CONFIG_SCHED_CORE bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, bool in_fi); @@ -100,7 +92,6 @@ static inline bool task_on_scx(const struct task_struct *p) { return false; } static inline void init_sched_ext_class(void) {} #define for_each_active_class for_each_class -#define for_balance_class_range for_class_range #endif /* CONFIG_SCHED_CLASS_EXT */ -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit e196c908f92795e76377d2392a16f9fd5d508a61 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- While sched_ext was out of tree, everything sched_ext specific which can be put in kernel/sched/ext.h was put there to ease forward porting. However, kernel/sched/sched.h is the better location for some of them. Relocate. - struct sched_enq_and_set_ctx, sched_deq_and_put_task() and sched_enq_and_set_task(). - scx_enabled() and scx_switched_all(). - for_active_class_range() and for_each_active_class(). sched_class declarations are moved above the class iterators for this. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: David Vernet <void@manifault.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Conflicts: kernel/sched/sched.h [Conflicts with 645a1ba256ef ("sched: topology: Build soft domain for LLC"), so accept both.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.h | 39 -------------------------- kernel/sched/sched.h | 65 ++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 59 insertions(+), 45 deletions(-) diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index 1d7837bdfaba..32d3a51f591a 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -8,24 +8,6 @@ */ #ifdef CONFIG_SCHED_CLASS_EXT -struct sched_enq_and_set_ctx { - struct task_struct *p; - int queue_flags; - bool queued; - bool running; -}; - -void sched_deq_and_put_task(struct task_struct *p, int queue_flags, - struct sched_enq_and_set_ctx *ctx); -void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx); - -extern const struct sched_class ext_sched_class; - -DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled); -DECLARE_STATIC_KEY_FALSE(__scx_switched_all); -#define scx_enabled() static_branch_unlikely(&__scx_ops_enabled) -#define scx_switched_all() static_branch_unlikely(&__scx_switched_all) - void scx_tick(struct rq *rq); void init_scx_entity(struct sched_ext_entity *scx); void scx_pre_fork(struct task_struct *p); @@ -52,22 +34,6 @@ static inline bool task_on_scx(const struct task_struct *p) return scx_enabled() && p->sched_class == &ext_sched_class; } -static inline const struct sched_class *next_active_class(const struct sched_class *class) -{ - class++; - if (scx_switched_all() && class == &fair_sched_class) - class++; - if (!scx_enabled() && class == &ext_sched_class) - class++; - return class; -} - -#define for_active_class_range(class, _from, _to) \ - for (class = (_from); class != (_to); class = next_active_class(class)) - -#define for_each_active_class(class) \ - for_active_class_range(class, __sched_class_highest, __sched_class_lowest) - #ifdef CONFIG_SCHED_CORE bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, bool in_fi); @@ -75,9 +41,6 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, #else /* CONFIG_SCHED_CLASS_EXT */ -#define scx_enabled() false -#define scx_switched_all() false - static inline void scx_tick(struct rq *rq) {} static inline void scx_pre_fork(struct task_struct *p) {} static inline int scx_fork(struct task_struct *p) { return 0; } @@ -91,8 +54,6 @@ static inline int scx_check_setscheduler(struct task_struct *p, int policy) { re static inline bool task_on_scx(const struct task_struct *p) { return false; } static inline void init_sched_ext_class(void) {} -#define for_each_active_class for_each_class - #endif /* CONFIG_SCHED_CLASS_EXT */ #if defined(CONFIG_SCHED_CLASS_EXT) && defined(CONFIG_SMP) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index db7edcf28138..8b0aeaa6315f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2613,19 +2613,54 @@ const struct sched_class name##_sched_class \ extern struct sched_class __sched_class_highest[]; extern struct sched_class __sched_class_lowest[]; +extern const struct sched_class stop_sched_class; +extern const struct sched_class dl_sched_class; +extern const struct sched_class rt_sched_class; +extern const struct sched_class fair_sched_class; +extern const struct sched_class idle_sched_class; + +#ifdef CONFIG_SCHED_CLASS_EXT +extern const struct sched_class ext_sched_class; + +DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled); /* SCX BPF scheduler loaded */ +DECLARE_STATIC_KEY_FALSE(__scx_switched_all); /* all fair class tasks on SCX */ + +#define scx_enabled() static_branch_unlikely(&__scx_ops_enabled) +#define scx_switched_all() static_branch_unlikely(&__scx_switched_all) +#else /* !CONFIG_SCHED_CLASS_EXT */ +#define scx_enabled() false +#define scx_switched_all() false +#endif /* !CONFIG_SCHED_CLASS_EXT */ + +/* + * Iterate only active classes. SCX can take over all fair tasks or be + * completely disabled. If the former, skip fair. If the latter, skip SCX. + */ +static inline const struct sched_class *next_active_class(const struct sched_class *class) +{ + class++; +#ifdef CONFIG_SCHED_CLASS_EXT + if (scx_switched_all() && class == &fair_sched_class) + class++; + if (!scx_enabled() && class == &ext_sched_class) + class++; +#endif + return class; +} + #define for_class_range(class, _from, _to) \ for (class = (_from); class < (_to); class++) #define for_each_class(class) \ for_class_range(class, __sched_class_highest, __sched_class_lowest) -#define sched_class_above(_a, _b) ((_a) < (_b)) +#define for_active_class_range(class, _from, _to) \ + for (class = (_from); class != (_to); class = next_active_class(class)) -extern const struct sched_class stop_sched_class; -extern const struct sched_class dl_sched_class; -extern const struct sched_class rt_sched_class; -extern const struct sched_class fair_sched_class; -extern const struct sched_class idle_sched_class; +#define for_each_active_class(class) \ + for_active_class_range(class, __sched_class_highest, __sched_class_lowest) + +#define sched_class_above(_a, _b) ((_a) < (_b)) static inline bool sched_stop_runnable(struct rq *rq) { @@ -4009,6 +4044,24 @@ static inline int destroy_soft_domain(struct task_group *tg) } #endif +#ifdef CONFIG_SCHED_CLASS_EXT +/* + * Used by SCX in the enable/disable paths to move tasks between sched_classes + * and establish invariants. + */ +struct sched_enq_and_set_ctx { + struct task_struct *p; + int queue_flags; + bool queued; + bool running; +}; + +void sched_deq_and_put_task(struct task_struct *p, int queue_flags, + struct sched_enq_and_set_ctx *ctx); +void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx); + +#endif /* CONFIG_SCHED_CLASS_EXT */ + #include "ext.h" #endif /* _KERNEL_SCHED_SCHED_H */ -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit d4af01c3731ff9c6e224d7183f8226a56d72b56c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- struct scx_dsq_node contains two data structure nodes to link the containing task to a DSQ and a flags field that is protected by the lock of the associated DSQ. One reason why they are grouped into a struct is to use the type independently as a cursor node when iterating tasks on a DSQ. However, when iterating, the cursor only needs to be linked on the FIFO list and the rb_node part ends up inflating the size of the iterator data structure unnecessarily making it potentially too expensive to place it on stack. Take ->priq and ->flags out of scx_dsq_node and put them in sched_ext_entity as ->dsq_priq and ->dsq_flags, respectively. scx_dsq_node is renamed to scx_dsq_list_node and the field names are renamed accordingly. This will help implementing DSQ task iterator that can be allocated on stack. No functional change intended. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 10 ++++---- init/init_task.c | 2 +- kernel/sched/ext.c | 54 +++++++++++++++++++-------------------- 3 files changed, 33 insertions(+), 33 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index fe9a67ffe6b1..eb9cfd18a923 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -121,10 +121,8 @@ enum scx_kf_mask { __SCX_KF_TERMINAL = SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, }; -struct scx_dsq_node { - struct list_head list; /* dispatch order */ - struct rb_node priq; /* p->scx.dsq_vtime order */ - u32 flags; /* SCX_TASK_DSQ_* flags */ +struct scx_dsq_list_node { + struct list_head node; }; /* @@ -133,7 +131,9 @@ struct scx_dsq_node { */ struct sched_ext_entity { struct scx_dispatch_q *dsq; - struct scx_dsq_node dsq_node; /* protected by dsq lock */ + struct scx_dsq_list_node dsq_list; /* dispatch order */ + struct rb_node dsq_priq; /* p->scx.dsq_vtime order */ + u32 dsq_flags; /* protected by DSQ lock */ u32 flags; /* protected by rq lock */ u32 weight; s32 sticky_cpu; diff --git a/init/init_task.c b/init/init_task.c index c744b3fe2cd2..7f73a1aa092d 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -115,7 +115,7 @@ struct task_struct init_task #endif #ifdef CONFIG_SCHED_CLASS_EXT .scx = { - .dsq_node.list = LIST_HEAD_INIT(init_task.scx.dsq_node.list), + .dsq_list.node = LIST_HEAD_INIT(init_task.scx.dsq_list.node), .sticky_cpu = -1, .holding_cpu = -1, .runnable_node = LIST_HEAD_INIT(init_task.scx.runnable_node), diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 47a6ffaf61e7..12263ccf9aa9 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1364,9 +1364,9 @@ static bool scx_dsq_priq_less(struct rb_node *node_a, const struct rb_node *node_b) { const struct task_struct *a = - container_of(node_a, struct task_struct, scx.dsq_node.priq); + container_of(node_a, struct task_struct, scx.dsq_priq); const struct task_struct *b = - container_of(node_b, struct task_struct, scx.dsq_node.priq); + container_of(node_b, struct task_struct, scx.dsq_priq); return time_before64(a->scx.dsq_vtime, b->scx.dsq_vtime); } @@ -1382,9 +1382,9 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p, { bool is_local = dsq->id == SCX_DSQ_LOCAL; - WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_node.list)); - WARN_ON_ONCE((p->scx.dsq_node.flags & SCX_TASK_DSQ_ON_PRIQ) || - !RB_EMPTY_NODE(&p->scx.dsq_node.priq)); + WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node)); + WARN_ON_ONCE((p->scx.dsq_flags & SCX_TASK_DSQ_ON_PRIQ) || + !RB_EMPTY_NODE(&p->scx.dsq_priq)); if (!is_local) { raw_spin_lock(&dsq->lock); @@ -1423,21 +1423,21 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p, scx_ops_error("DSQ ID 0x%016llx already had FIFO-enqueued tasks", dsq->id); - p->scx.dsq_node.flags |= SCX_TASK_DSQ_ON_PRIQ; - rb_add(&p->scx.dsq_node.priq, &dsq->priq, scx_dsq_priq_less); + p->scx.dsq_flags |= SCX_TASK_DSQ_ON_PRIQ; + rb_add(&p->scx.dsq_priq, &dsq->priq, scx_dsq_priq_less); /* * Find the previous task and insert after it on the list so * that @dsq->list is vtime ordered. */ - rbp = rb_prev(&p->scx.dsq_node.priq); + rbp = rb_prev(&p->scx.dsq_priq); if (rbp) { struct task_struct *prev = container_of(rbp, struct task_struct, - scx.dsq_node.priq); - list_add(&p->scx.dsq_node.list, &prev->scx.dsq_node.list); + scx.dsq_priq); + list_add(&p->scx.dsq_list.node, &prev->scx.dsq_list.node); } else { - list_add(&p->scx.dsq_node.list, &dsq->list); + list_add(&p->scx.dsq_list.node, &dsq->list); } } else { /* a FIFO DSQ shouldn't be using PRIQ enqueuing */ @@ -1446,9 +1446,9 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p, dsq->id); if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) - list_add(&p->scx.dsq_node.list, &dsq->list); + list_add(&p->scx.dsq_list.node, &dsq->list); else - list_add_tail(&p->scx.dsq_node.list, &dsq->list); + list_add_tail(&p->scx.dsq_list.node, &dsq->list); } dsq_mod_nr(dsq, 1); @@ -1491,18 +1491,18 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p, static void task_unlink_from_dsq(struct task_struct *p, struct scx_dispatch_q *dsq) { - if (p->scx.dsq_node.flags & SCX_TASK_DSQ_ON_PRIQ) { - rb_erase(&p->scx.dsq_node.priq, &dsq->priq); - RB_CLEAR_NODE(&p->scx.dsq_node.priq); - p->scx.dsq_node.flags &= ~SCX_TASK_DSQ_ON_PRIQ; + if (p->scx.dsq_flags & SCX_TASK_DSQ_ON_PRIQ) { + rb_erase(&p->scx.dsq_priq, &dsq->priq); + RB_CLEAR_NODE(&p->scx.dsq_priq); + p->scx.dsq_flags &= ~SCX_TASK_DSQ_ON_PRIQ; } - list_del_init(&p->scx.dsq_node.list); + list_del_init(&p->scx.dsq_list.node); } static bool task_linked_on_dsq(struct task_struct *p) { - return !list_empty(&p->scx.dsq_node.list); + return !list_empty(&p->scx.dsq_list.node); } static void dispatch_dequeue(struct rq *rq, struct task_struct *p) @@ -1527,8 +1527,8 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p) raw_spin_lock(&dsq->lock); /* - * Now that we hold @dsq->lock, @p->holding_cpu and @p->scx.dsq_node - * can't change underneath us. + * Now that we hold @dsq->lock, @p->holding_cpu and @p->scx.dsq_* can't + * change underneath us. */ if (p->scx.holding_cpu < 0) { /* @p must still be on @dsq, dequeue */ @@ -2039,7 +2039,7 @@ static void consume_local_task(struct rq *rq, struct scx_dispatch_q *dsq, /* @dsq is locked and @p is on this rq */ WARN_ON_ONCE(p->scx.holding_cpu >= 0); task_unlink_from_dsq(p, dsq); - list_add_tail(&p->scx.dsq_node.list, &rq->scx.local_dsq.list); + list_add_tail(&p->scx.dsq_list.node, &rq->scx.local_dsq.list); dsq_mod_nr(dsq, -1); dsq_mod_nr(&rq->scx.local_dsq, 1); p->scx.dsq = &rq->scx.local_dsq; @@ -2114,7 +2114,7 @@ static bool consume_dispatch_q(struct rq *rq, struct rq_flags *rf, raw_spin_lock(&dsq->lock); - list_for_each_entry(p, &dsq->list, scx.dsq_node.list) { + list_for_each_entry(p, &dsq->list, scx.dsq_list.node) { struct rq *task_rq = task_rq(p); if (rq == task_rq) { @@ -2633,7 +2633,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p) static struct task_struct *first_local_task(struct rq *rq) { return list_first_entry_or_null(&rq->scx.local_dsq.list, - struct task_struct, scx.dsq_node.list); + struct task_struct, scx.dsq_list.node); } static struct task_struct *pick_next_task_scx(struct rq *rq) @@ -3314,8 +3314,8 @@ void init_scx_entity(struct sched_ext_entity *scx) */ memset(scx, 0, offsetof(struct sched_ext_entity, tasks_node)); - INIT_LIST_HEAD(&scx->dsq_node.list); - RB_CLEAR_NODE(&scx->dsq_node.priq); + INIT_LIST_HEAD(&scx->dsq_list.node); + RB_CLEAR_NODE(&scx->dsq_priq); scx->sticky_cpu = -1; scx->holding_cpu = -1; INIT_LIST_HEAD(&scx->runnable_node); @@ -4164,7 +4164,7 @@ static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx, jiffies_delta_msecs(p->scx.runnable_at, dctx->at_jiffies)); dump_line(s, " scx_state/flags=%u/0x%x dsq_flags=0x%x ops_state/qseq=%lu/%lu", scx_get_task_state(p), p->scx.flags & ~SCX_TASK_STATE_MASK, - p->scx.dsq_node.flags, ops_state & SCX_OPSS_STATE_MASK, + p->scx.dsq_flags, ops_state & SCX_OPSS_STATE_MASK, ops_state >> SCX_OPSS_QSEQ_SHIFT); dump_line(s, " sticky/holding_cpu=%d/%d dsq_id=%s dsq_vtime=%llu", p->scx.sticky_cpu, p->scx.holding_cpu, dsq_id_buf, -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 650ba21b131ed1f8ee57826b2c6295a3be221132 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- DSQs are very opaque in the consumption path. The BPF scheduler has no way of knowing which tasks are being considered and which is picked. This patch adds BPF DSQ iterator. - Allows iterating tasks queued on a DSQ in the dispatch order or reverse from anywhere using bpf_for_each(scx_dsq) or calling the iterator kfuncs directly. - Has ordering guarantee where only tasks which were already queued when the iteration started are visible and consumable during the iteration. v5: - Add a comment to the naked list_empty(&dsq->list) test in consume_dispatch_q() to explain the reasoning behind the lockless test and by extension why nldsq_next_task() isn't used there. - scx_qmap changes separated into its own patch. v4: - bpf_iter_scx_dsq_new() declaration in common.bpf.h was using the wrong type for the last argument (bool rev instead of u64 flags). Fix it. v3: - Alexei pointed out that the iterator is too big to allocate on stack. Added a prep patch to reduce the size of the cursor. Now bpf_iter_scx_dsq is 48 bytes and bpf_iter_scx_dsq_kern is 40 bytes on 64bit. - u32_before() comparison factored out. v2: - scx_bpf_consume_task() is separated out into a separate patch. - DSQ seq and iter flags don't need to be u64. Use u32. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: bpf@vger.kernel.org Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 3 + kernel/sched/ext.c | 192 ++++++++++++++++++++++- tools/sched_ext/include/scx/common.bpf.h | 3 + 3 files changed, 196 insertions(+), 2 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index eb9cfd18a923..593d2f4909dd 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -61,6 +61,7 @@ struct scx_dispatch_q { struct list_head list; /* tasks in dispatch order */ struct rb_root priq; /* used to order by p->scx.dsq_vtime */ u32 nr; + u32 seq; /* used by BPF iter */ u64 id; struct rhash_head hash_node; struct llist_node free_node; @@ -123,6 +124,7 @@ enum scx_kf_mask { struct scx_dsq_list_node { struct list_head node; + bool is_bpf_iter_cursor; }; /* @@ -133,6 +135,7 @@ struct sched_ext_entity { struct scx_dispatch_q *dsq; struct scx_dsq_list_node dsq_list; /* dispatch order */ struct rb_node dsq_priq; /* p->scx.dsq_vtime order */ + u32 dsq_seq; u32 dsq_flags; /* protected by DSQ lock */ u32 flags; /* protected by rq lock */ u32 weight; diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 12263ccf9aa9..98c52f96fa0f 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -930,6 +930,11 @@ static u32 highest_bit(u32 flags) return ((u64)1 << bit) >> 1; } +static bool u32_before(u32 a, u32 b) +{ + return (s32)(a - b) < 0; +} + /* * scx_kf_mask enforcement. Some kfuncs can only be called from specific SCX * ops. When invoking SCX ops, SCX_CALL_OP[_RET]() should be used to indicate @@ -1070,6 +1075,73 @@ static __always_inline bool scx_kf_allowed_on_arg_tasks(u32 mask, return true; } +/** + * nldsq_next_task - Iterate to the next task in a non-local DSQ + * @dsq: user dsq being interated + * @cur: current position, %NULL to start iteration + * @rev: walk backwards + * + * Returns %NULL when iteration is finished. + */ +static struct task_struct *nldsq_next_task(struct scx_dispatch_q *dsq, + struct task_struct *cur, bool rev) +{ + struct list_head *list_node; + struct scx_dsq_list_node *dsq_lnode; + + lockdep_assert_held(&dsq->lock); + + if (cur) + list_node = &cur->scx.dsq_list.node; + else + list_node = &dsq->list; + + /* find the next task, need to skip BPF iteration cursors */ + do { + if (rev) + list_node = list_node->prev; + else + list_node = list_node->next; + + if (list_node == &dsq->list) + return NULL; + + dsq_lnode = container_of(list_node, struct scx_dsq_list_node, + node); + } while (dsq_lnode->is_bpf_iter_cursor); + + return container_of(dsq_lnode, struct task_struct, scx.dsq_list); +} + +#define nldsq_for_each_task(p, dsq) \ + for ((p) = nldsq_next_task((dsq), NULL, false); (p); \ + (p) = nldsq_next_task((dsq), (p), false)) + + +/* + * BPF DSQ iterator. Tasks in a non-local DSQ can be iterated in [reverse] + * dispatch order. BPF-visible iterator is opaque and larger to allow future + * changes without breaking backward compatibility. Can be used with + * bpf_for_each(). See bpf_iter_scx_dsq_*(). + */ +enum scx_dsq_iter_flags { + /* iterate in the reverse dispatch order */ + SCX_DSQ_ITER_REV = 1U << 0, + + __SCX_DSQ_ITER_ALL_FLAGS = SCX_DSQ_ITER_REV, +}; + +struct bpf_iter_scx_dsq_kern { + struct scx_dsq_list_node cursor; + struct scx_dispatch_q *dsq; + u32 dsq_seq; + u32 flags; +} __attribute__((aligned(8))); + +struct bpf_iter_scx_dsq { + u64 __opaque[6]; +} __attribute__((aligned(8))); + /* * SCX task iterator. @@ -1419,7 +1491,7 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p, * tested easily when adding the first task. */ if (unlikely(RB_EMPTY_ROOT(&dsq->priq) && - !list_empty(&dsq->list))) + nldsq_next_task(dsq, NULL, false))) scx_ops_error("DSQ ID 0x%016llx already had FIFO-enqueued tasks", dsq->id); @@ -1451,6 +1523,10 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p, list_add_tail(&p->scx.dsq_list.node, &dsq->list); } + /* seq records the order tasks are queued, used by BPF DSQ iterator */ + dsq->seq++; + p->scx.dsq_seq = dsq->seq; + dsq_mod_nr(dsq, 1); p->scx.dsq = dsq; @@ -2109,12 +2185,17 @@ static bool consume_dispatch_q(struct rq *rq, struct rq_flags *rf, { struct task_struct *p; retry: + /* + * The caller can't expect to successfully consume a task if the task's + * addition to @dsq isn't guaranteed to be visible somehow. Test + * @dsq->list without locking and skip if it seems empty. + */ if (list_empty(&dsq->list)) return false; raw_spin_lock(&dsq->lock); - list_for_each_entry(p, &dsq->list, scx.dsq_list.node) { + nldsq_for_each_task(p, dsq) { struct rq *task_rq = task_rq(p); if (rq == task_rq) { @@ -5709,6 +5790,110 @@ __bpf_kfunc void scx_bpf_destroy_dsq(u64 dsq_id) destroy_dsq(dsq_id); } +/** + * bpf_iter_scx_dsq_new - Create a DSQ iterator + * @it: iterator to initialize + * @dsq_id: DSQ to iterate + * @flags: %SCX_DSQ_ITER_* + * + * Initialize BPF iterator @it which can be used with bpf_for_each() to walk + * tasks in the DSQ specified by @dsq_id. Iteration using @it only includes + * tasks which are already queued when this function is invoked. + */ +__bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, + u64 flags) +{ + struct bpf_iter_scx_dsq_kern *kit = (void *)it; + + BUILD_BUG_ON(sizeof(struct bpf_iter_scx_dsq_kern) > + sizeof(struct bpf_iter_scx_dsq)); + BUILD_BUG_ON(__alignof__(struct bpf_iter_scx_dsq_kern) != + __alignof__(struct bpf_iter_scx_dsq)); + + if (flags & ~__SCX_DSQ_ITER_ALL_FLAGS) + return -EINVAL; + + kit->dsq = find_non_local_dsq(dsq_id); + if (!kit->dsq) + return -ENOENT; + + INIT_LIST_HEAD(&kit->cursor.node); + kit->cursor.is_bpf_iter_cursor = true; + kit->dsq_seq = READ_ONCE(kit->dsq->seq); + kit->flags = flags; + + return 0; +} + +/** + * bpf_iter_scx_dsq_next - Progress a DSQ iterator + * @it: iterator to progress + * + * Return the next task. See bpf_iter_scx_dsq_new(). + */ +__bpf_kfunc struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it) +{ + struct bpf_iter_scx_dsq_kern *kit = (void *)it; + bool rev = kit->flags & SCX_DSQ_ITER_REV; + struct task_struct *p; + unsigned long flags; + + if (!kit->dsq) + return NULL; + + raw_spin_lock_irqsave(&kit->dsq->lock, flags); + + if (list_empty(&kit->cursor.node)) + p = NULL; + else + p = container_of(&kit->cursor, struct task_struct, scx.dsq_list); + + /* + * Only tasks which were queued before the iteration started are + * visible. This bounds BPF iterations and guarantees that vtime never + * jumps in the other direction while iterating. + */ + do { + p = nldsq_next_task(kit->dsq, p, rev); + } while (p && unlikely(u32_before(kit->dsq_seq, p->scx.dsq_seq))); + + if (p) { + if (rev) + list_move_tail(&kit->cursor.node, &p->scx.dsq_list.node); + else + list_move(&kit->cursor.node, &p->scx.dsq_list.node); + } else { + list_del_init(&kit->cursor.node); + } + + raw_spin_unlock_irqrestore(&kit->dsq->lock, flags); + + return p; +} + +/** + * bpf_iter_scx_dsq_destroy - Destroy a DSQ iterator + * @it: iterator to destroy + * + * Undo scx_iter_scx_dsq_new(). + */ +__bpf_kfunc void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it) +{ + struct bpf_iter_scx_dsq_kern *kit = (void *)it; + + if (!kit->dsq) + return; + + if (!list_empty(&kit->cursor.node)) { + unsigned long flags; + + raw_spin_lock_irqsave(&kit->dsq->lock, flags); + list_del_init(&kit->cursor.node); + raw_spin_unlock_irqrestore(&kit->dsq->lock, flags); + } + kit->dsq = NULL; +} + __bpf_kfunc_end_defs(); static s32 __bstr_format(u64 *data_buf, char *line_buf, size_t line_size, @@ -6142,6 +6327,9 @@ BTF_KFUNCS_START(scx_kfunc_ids_any) BTF_ID_FLAGS(func, scx_bpf_kick_cpu) BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued) BTF_ID_FLAGS(func, scx_bpf_destroy_dsq) +BTF_ID_FLAGS(func, bpf_iter_scx_dsq_new, KF_ITER_NEW | KF_RCU_PROTECTED) +BTF_ID_FLAGS(func, bpf_iter_scx_dsq_next, KF_ITER_NEXT | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_iter_scx_dsq_destroy, KF_ITER_DESTROY) BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_TRUSTED_ARGS) diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h index 965d20324114..20280df62857 100644 --- a/tools/sched_ext/include/scx/common.bpf.h +++ b/tools/sched_ext/include/scx/common.bpf.h @@ -39,6 +39,9 @@ u32 scx_bpf_reenqueue_local(void) __ksym; void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym; s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym; void scx_bpf_destroy_dsq(u64 dsq_id) __ksym; +int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, u64 flags) __ksym __weak; +struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it) __ksym __weak; +void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it) __ksym __weak; void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak; void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym; void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym __weak; -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 6fbd643318a1a5f3caea7f94bfe035efbb293ddb category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Implement periodic dumping of the shared DSQ to demonstrate the use of the newly added DSQ iterator. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: bpf@vger.kernel.org Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/sched_ext/scx_qmap.bpf.c | 25 +++++++++++++++++++++++++ tools/sched_ext/scx_qmap.c | 8 ++++++-- 2 files changed, 31 insertions(+), 2 deletions(-) diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c index b1d0b09c966e..27e35066a602 100644 --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -36,6 +36,7 @@ const volatile u32 stall_user_nth; const volatile u32 stall_kernel_nth; const volatile u32 dsp_inf_loop_after; const volatile u32 dsp_batch; +const volatile bool print_shared_dsq; const volatile s32 disallow_tgid; const volatile bool suppress_dump; @@ -604,10 +605,34 @@ static void monitor_cpuperf(void) scx_bpf_put_cpumask(online); } +/* + * Dump the currently queued tasks in the shared DSQ to demonstrate the usage of + * scx_bpf_dsq_nr_queued() and DSQ iterator. Raise the dispatch batch count to + * see meaningful dumps in the trace pipe. + */ +static void dump_shared_dsq(void) +{ + struct task_struct *p; + s32 nr; + + if (!(nr = scx_bpf_dsq_nr_queued(SHARED_DSQ))) + return; + + bpf_printk("Dumping %d tasks in SHARED_DSQ in reverse order", nr); + + bpf_rcu_read_lock(); + bpf_for_each(scx_dsq, p, SHARED_DSQ, SCX_DSQ_ITER_REV) + bpf_printk("%s[%d]", p->comm, p->pid); + bpf_rcu_read_unlock(); +} + static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer) { monitor_cpuperf(); + if (print_shared_dsq) + dump_shared_dsq(); + bpf_timer_start(timer, ONE_SEC_IN_NS, 0); return 0; } diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c index e4e3ecffc4cf..304f0488a386 100644 --- a/tools/sched_ext/scx_qmap.c +++ b/tools/sched_ext/scx_qmap.c @@ -20,7 +20,7 @@ const char help_fmt[] = "See the top-level comment in .bpf.c for more details.\n" "\n" "Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-l COUNT] [-b COUNT]\n" -" [-d PID] [-D LEN] [-p] [-v]\n" +" [-P] [-d PID] [-D LEN] [-p] [-v]\n" "\n" " -s SLICE_US Override slice duration\n" " -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n" @@ -28,6 +28,7 @@ const char help_fmt[] = " -T COUNT Stall every COUNT'th kernel thread\n" " -l COUNT Trigger dispatch infinite looping after COUNT dispatches\n" " -b COUNT Dispatch upto COUNT tasks together\n" +" -P Print out DSQ content to trace_pipe every second, use with -b\n" " -d PID Disallow a process from switching into SCHED_EXT (-1 for self)\n" " -D LEN Set scx_exit_info.dump buffer length\n" " -S Suppress qmap-specific debug dump\n" @@ -62,7 +63,7 @@ int main(int argc, char **argv) skel = SCX_OPS_OPEN(qmap_ops, scx_qmap); - while ((opt = getopt(argc, argv, "s:e:t:T:l:b:d:D:Spvh")) != -1) { + while ((opt = getopt(argc, argv, "s:e:t:T:l:b:Pd:D:Spvh")) != -1) { switch (opt) { case 's': skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000; @@ -82,6 +83,9 @@ int main(int argc, char **argv) case 'b': skel->rodata->dsp_batch = strtoul(optarg, NULL, 0); break; + case 'P': + skel->rodata->print_shared_dsq = true; + break; case 'd': skel->rodata->disallow_tgid = strtol(optarg, NULL, 0); if (skel->rodata->disallow_tgid < 0) -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit fd0cf516956a0aaa4d899383ee5c2ff191418b5f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- scx_bpf_reenqueue_local() is used to re-enqueue tasks on the local DSQ from ops.cpu_release(). Because the BPF scheduler may dispatch tasks to the same local DSQ, to avoid processing the same tasks repeatedly, it first takes the number of queued tasks and processes the task at the head of the queue that number of times. This is incorrect as a task can be dispatched to the same local DSQ with SCX_ENQ_HEAD. Such a task will be processed repeatedly until the count is exhausted and the succeeding tasks won't be processed at all. Fix it by first moving all candidate tasks to a private list and then processing that list. While at it, remove the WARNs. They're rather superflous as later steps will check them anyway. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 245254f7081d ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()") Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 98c52f96fa0f..94c2ca02830b 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -5632,8 +5632,10 @@ __bpf_kfunc_start_defs(); */ __bpf_kfunc u32 scx_bpf_reenqueue_local(void) { - u32 nr_enqueued, i; + LIST_HEAD(tasks); + u32 nr_enqueued = 0; struct rq *rq; + struct task_struct *p, *n; if (!scx_kf_allowed(SCX_KF_CPU_RELEASE)) return 0; @@ -5642,22 +5644,20 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(void) lockdep_assert_rq_held(rq); /* - * Get the number of tasks on the local DSQ before iterating over it to - * pull off tasks. The enqueue callback below can signal that it wants - * the task to stay on the local DSQ, and we want to prevent the BPF - * scheduler from causing us to loop indefinitely. + * The BPF scheduler may choose to dispatch tasks back to + * @rq->scx.local_dsq. Move all candidate tasks off to a private list + * first to avoid processing the same tasks repeatedly. */ - nr_enqueued = rq->scx.local_dsq.nr; - for (i = 0; i < nr_enqueued; i++) { - struct task_struct *p; - - p = first_local_task(rq); - WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != - SCX_OPSS_NONE); - WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED)); - WARN_ON_ONCE(p->scx.holding_cpu != -1); + list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list, + scx.dsq_list.node) { dispatch_dequeue(rq, p); + list_add_tail(&p->scx.dsq_list.node, &tasks); + } + + list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) { + list_del_init(&p->scx.dsq_list.node); do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1); + nr_enqueued++; } return nr_enqueued; -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit e7a6395a889a82edb1cdcebc2c66646aca454658 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- When a running task is migrated to another CPU, the stop_task is used to preempt the running task and migrate it. This, expectedly, invokes ops.cpu_release(). If the BPF scheduler then calls scx_bpf_reenqueue_local(), it re-enqueues all tasks on the local DSQ including the task which is being migrated. This creates an unnecessary re-enqueue of a task which is about to be deactivated and re-activated for migration anyway. It can also cause confusion for the BPF scheduler as scx_bpf_task_cpu() of the task and its allowed CPUs may not agree while migration is pending. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 245254f7081d ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()") Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 94c2ca02830b..64dd59614612 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -5650,6 +5650,20 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(void) */ list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list, scx.dsq_list.node) { + /* + * If @p is being migrated, @p's current CPU may not agree with + * its allowed CPUs and the migration_cpu_stop is about to + * deactivate and re-activate @p anyway. Skip re-enqueueing. + * + * While racing sched property changes may also dequeue and + * re-enqueue a migrating task while its current CPU and allowed + * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to + * the current local DSQ for running tasks and thus are not + * visible to the BPF scheduler. + */ + if (p->migration_pending) + continue; + dispatch_dequeue(rq, p); list_add_tail(&p->scx.dsq_list.node, &tasks); } -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit fc283116d008654c8dd6ccb222372cf011d3bb80 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Move struct balance_callback definition upward so that it's visible to class-specific rq struct definitions. This will be used to embed a struct balance_callback in struct scx_rq. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Conflicts: kernel/sched/sched.h [Only pick the part modified by this mainline commit, and ignore the conflict part caused by higher version.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/sched.h | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 8b0aeaa6315f..7542f22f1b79 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -688,6 +688,11 @@ do { \ # define u64_u32_load(var) u64_u32_load_copy(var, var##_copy) # define u64_u32_store(var, val) u64_u32_store_copy(var, var##_copy, val) +struct balance_callback { + struct balance_callback *next; + void (*func)(struct rq *rq); +}; + /* CFS-related fields in a runqueue */ struct cfs_rq { struct load_weight load; @@ -1154,11 +1159,6 @@ DECLARE_STATIC_KEY_FALSE(sched_uclamp_used); #endif /* CONFIG_UCLAMP_TASK */ struct rq; -struct balance_callback { - struct balance_callback *next; - void (*func)(struct rq *rq); -}; - /* * This is the main, per-CPU runqueue data structure. * -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 3cf78c5d01d61e6a4289c5856198235011f5e84b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- sched_ext often needs to migrate tasks across CPUs right before execution and thus uses the balance path to dispatch tasks from the BPF scheduler. balance_scx() is called with rq locked and pinned but is passed @rf and thus allowed to unpin and unlock. Currently, @rf is passed down the call stack so the rq lock is unpinned just when double locking is needed. This creates unnecessary complications such as having to explicitly manipulate lock pinning for core scheduling. We also want to use dispatch_to_local_dsq_lock() from other paths which are called with rq locked but unpinned. rq lock handling in the dispatch path is straightforward outside the migration implementation and extending the pinning protection down the callstack doesn't add enough meaningful extra protection to justify the extra complexity. Unpin and repin rq lock from the outer balance_scx() and drop @rf passing and lock pinning handling from the inner functions. UP is updated to call balance_one() instead of balance_scx() to avoid adding NULL @rf handling to balance_scx(). AS this makes balance_scx() unused in UP, it's put inside a CONFIG_SMP block. No functional changes intended outside of lock annotation updates. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 93 +++++++++++++++++----------------------------- 1 file changed, 34 insertions(+), 59 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 64dd59614612..99211cd93704 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -855,7 +855,6 @@ static u32 scx_dsp_max_batch; struct scx_dsp_ctx { struct rq *rq; - struct rq_flags *rf; u32 cursor; u32 nr_tasks; struct scx_dsp_buf_ent buf[]; @@ -2050,7 +2049,6 @@ static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p, /** * dispatch_to_local_dsq_lock - Ensure source and destination rq's are locked * @rq: current rq which is locked - * @rf: rq_flags to use when unlocking @rq * @src_rq: rq to move task from * @dst_rq: rq to move task to * @@ -2059,20 +2057,16 @@ static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p, * @rq stays locked isn't important as long as the state is restored after * dispatch_to_local_dsq_unlock(). */ -static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq_flags *rf, - struct rq *src_rq, struct rq *dst_rq) +static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq *src_rq, + struct rq *dst_rq) { - rq_unpin_lock(rq, rf); - if (src_rq == dst_rq) { raw_spin_rq_unlock(rq); raw_spin_rq_lock(dst_rq); } else if (rq == src_rq) { double_lock_balance(rq, dst_rq); - rq_repin_lock(rq, rf); } else if (rq == dst_rq) { double_lock_balance(rq, src_rq); - rq_repin_lock(rq, rf); } else { raw_spin_rq_unlock(rq); double_rq_lock(src_rq, dst_rq); @@ -2082,19 +2076,17 @@ static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq_flags *rf, /** * dispatch_to_local_dsq_unlock - Undo dispatch_to_local_dsq_lock() * @rq: current rq which is locked - * @rf: rq_flags to use when unlocking @rq * @src_rq: rq to move task from * @dst_rq: rq to move task to * * Unlock @src_rq and @dst_rq and ensure that @rq is locked on return. */ -static void dispatch_to_local_dsq_unlock(struct rq *rq, struct rq_flags *rf, - struct rq *src_rq, struct rq *dst_rq) +static void dispatch_to_local_dsq_unlock(struct rq *rq, struct rq *src_rq, + struct rq *dst_rq) { if (src_rq == dst_rq) { raw_spin_rq_unlock(dst_rq); raw_spin_rq_lock(rq); - rq_repin_lock(rq, rf); } else if (rq == src_rq) { double_unlock_balance(rq, dst_rq); } else if (rq == dst_rq) { @@ -2102,7 +2094,6 @@ static void dispatch_to_local_dsq_unlock(struct rq *rq, struct rq_flags *rf, } else { double_rq_unlock(src_rq, dst_rq); raw_spin_rq_lock(rq); - rq_repin_lock(rq, rf); } } #endif /* CONFIG_SMP */ @@ -2142,8 +2133,7 @@ static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq) return true; } -static bool consume_remote_task(struct rq *rq, struct rq_flags *rf, - struct scx_dispatch_q *dsq, +static bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq, struct task_struct *p, struct rq *task_rq) { bool moved = false; @@ -2163,9 +2153,7 @@ static bool consume_remote_task(struct rq *rq, struct rq_flags *rf, p->scx.holding_cpu = raw_smp_processor_id(); raw_spin_unlock(&dsq->lock); - rq_unpin_lock(rq, rf); double_lock_balance(rq, task_rq); - rq_repin_lock(rq, rf); moved = move_task_to_local_dsq(rq, p, 0); @@ -2175,13 +2163,11 @@ static bool consume_remote_task(struct rq *rq, struct rq_flags *rf, } #else /* CONFIG_SMP */ static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq) { return false; } -static bool consume_remote_task(struct rq *rq, struct rq_flags *rf, - struct scx_dispatch_q *dsq, +static bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq, struct task_struct *p, struct rq *task_rq) { return false; } #endif /* CONFIG_SMP */ -static bool consume_dispatch_q(struct rq *rq, struct rq_flags *rf, - struct scx_dispatch_q *dsq) +static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq) { struct task_struct *p; retry: @@ -2204,7 +2190,7 @@ static bool consume_dispatch_q(struct rq *rq, struct rq_flags *rf, } if (task_can_run_on_remote_rq(p, rq)) { - if (likely(consume_remote_task(rq, rf, dsq, p, task_rq))) + if (likely(consume_remote_task(rq, dsq, p, task_rq))) return true; goto retry; } @@ -2224,7 +2210,6 @@ enum dispatch_to_local_dsq_ret { /** * dispatch_to_local_dsq - Dispatch a task to a local dsq * @rq: current rq which is locked - * @rf: rq_flags to use when unlocking @rq * @dsq_id: destination dsq ID * @p: task to dispatch * @enq_flags: %SCX_ENQ_* @@ -2237,8 +2222,8 @@ enum dispatch_to_local_dsq_ret { * %SCX_OPSS_DISPATCHING). */ static enum dispatch_to_local_dsq_ret -dispatch_to_local_dsq(struct rq *rq, struct rq_flags *rf, u64 dsq_id, - struct task_struct *p, u64 enq_flags) +dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p, + u64 enq_flags) { struct rq *src_rq = task_rq(p); struct rq *dst_rq; @@ -2287,7 +2272,7 @@ dispatch_to_local_dsq(struct rq *rq, struct rq_flags *rf, u64 dsq_id, /* store_release ensures that dequeue sees the above */ atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE); - dispatch_to_local_dsq_lock(rq, rf, src_rq, locked_dst_rq); + dispatch_to_local_dsq_lock(rq, src_rq, locked_dst_rq); /* * We don't require the BPF scheduler to avoid dispatching to @@ -2322,7 +2307,7 @@ dispatch_to_local_dsq(struct rq *rq, struct rq_flags *rf, u64 dsq_id, dst_rq->curr->sched_class)) resched_curr(dst_rq); - dispatch_to_local_dsq_unlock(rq, rf, src_rq, locked_dst_rq); + dispatch_to_local_dsq_unlock(rq, src_rq, locked_dst_rq); return dsp ? DTL_DISPATCHED : DTL_LOST; } @@ -2336,7 +2321,6 @@ dispatch_to_local_dsq(struct rq *rq, struct rq_flags *rf, u64 dsq_id, /** * finish_dispatch - Asynchronously finish dispatching a task * @rq: current rq which is locked - * @rf: rq_flags to use when unlocking @rq * @p: task to finish dispatching * @qseq_at_dispatch: qseq when @p started getting dispatched * @dsq_id: destination DSQ ID @@ -2353,8 +2337,7 @@ dispatch_to_local_dsq(struct rq *rq, struct rq_flags *rf, u64 dsq_id, * was valid in the first place. Make sure that the task is still owned by the * BPF scheduler and claim the ownership before dispatching. */ -static void finish_dispatch(struct rq *rq, struct rq_flags *rf, - struct task_struct *p, +static void finish_dispatch(struct rq *rq, struct task_struct *p, unsigned long qseq_at_dispatch, u64 dsq_id, u64 enq_flags) { @@ -2407,7 +2390,7 @@ static void finish_dispatch(struct rq *rq, struct rq_flags *rf, BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED)); - switch (dispatch_to_local_dsq(rq, rf, dsq_id, p, enq_flags)) { + switch (dispatch_to_local_dsq(rq, dsq_id, p, enq_flags)) { case DTL_DISPATCHED: break; case DTL_LOST: @@ -2423,7 +2406,7 @@ static void finish_dispatch(struct rq *rq, struct rq_flags *rf, } } -static void flush_dispatch_buf(struct rq *rq, struct rq_flags *rf) +static void flush_dispatch_buf(struct rq *rq) { struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); u32 u; @@ -2431,7 +2414,7 @@ static void flush_dispatch_buf(struct rq *rq, struct rq_flags *rf) for (u = 0; u < dspc->cursor; u++) { struct scx_dsp_buf_ent *ent = &dspc->buf[u]; - finish_dispatch(rq, rf, ent->task, ent->qseq, ent->dsq_id, + finish_dispatch(rq, ent->task, ent->qseq, ent->dsq_id, ent->enq_flags); } @@ -2439,8 +2422,7 @@ static void flush_dispatch_buf(struct rq *rq, struct rq_flags *rf) dspc->cursor = 0; } -static int balance_one(struct rq *rq, struct task_struct *prev, - struct rq_flags *rf, bool local) +static int balance_one(struct rq *rq, struct task_struct *prev, bool local) { struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); bool prev_on_scx = prev->sched_class == &ext_sched_class; @@ -2494,14 +2476,13 @@ static int balance_one(struct rq *rq, struct task_struct *prev, if (rq->scx.local_dsq.nr) goto has_tasks; - if (consume_dispatch_q(rq, rf, &scx_dsq_global)) + if (consume_dispatch_q(rq, &scx_dsq_global)) goto has_tasks; if (!SCX_HAS_OP(dispatch) || scx_ops_bypassing() || !scx_rq_online(rq)) goto out; dspc->rq = rq; - dspc->rf = rf; /* * The dispatch loop. Because flush_dispatch_buf() may drop the rq lock, @@ -2516,11 +2497,11 @@ static int balance_one(struct rq *rq, struct task_struct *prev, SCX_CALL_OP(SCX_KF_DISPATCH, dispatch, cpu_of(rq), prev_on_scx ? prev : NULL); - flush_dispatch_buf(rq, rf); + flush_dispatch_buf(rq); if (rq->scx.local_dsq.nr) goto has_tasks; - if (consume_dispatch_q(rq, rf, &scx_dsq_global)) + if (consume_dispatch_q(rq, &scx_dsq_global)) goto has_tasks; /* @@ -2547,12 +2528,15 @@ static int balance_one(struct rq *rq, struct task_struct *prev, return has_tasks; } +#ifdef CONFIG_SMP static int balance_scx(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { int ret; - ret = balance_one(rq, prev, rf, true); + rq_unpin_lock(rq, rf); + + ret = balance_one(rq, prev, true); #ifdef CONFIG_SCHED_SMT /* @@ -2566,28 +2550,19 @@ static int balance_scx(struct rq *rq, struct task_struct *prev, for_each_cpu_andnot(scpu, smt_mask, cpumask_of(cpu_of(rq))) { struct rq *srq = cpu_rq(scpu); - struct rq_flags srf; struct task_struct *sprev = srq->curr; - /* - * While core-scheduling, rq lock is shared among - * siblings but the debug annotations and rq clock - * aren't. Do pinning dance to transfer the ownership. - */ WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq)); - rq_unpin_lock(rq, rf); - rq_pin_lock(srq, &srf); - update_rq_clock(srq); - balance_one(srq, sprev, &srf, false); - - rq_unpin_lock(srq, &srf); - rq_repin_lock(rq, rf); + balance_one(srq, sprev, false); } } #endif + rq_repin_lock(rq, rf); + return ret; } +#endif static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) { @@ -2657,11 +2632,11 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p) * balance_scx() must be called before the previous SCX task goes * through put_prev_task_scx(). * - * As UP doesn't transfer tasks around, balance_scx() doesn't need @rf. - * Pass in %NULL. + * @rq is pinned and can't be unlocked. As UP doesn't transfer tasks + * around, balance_one() doesn't need to. */ if (p->scx.flags & (SCX_TASK_QUEUED | SCX_TASK_DEQD_FOR_SLEEP)) - balance_scx(rq, p, NULL); + balance_one(rq, p, true); #endif update_curr_scx(rq); @@ -2724,7 +2699,7 @@ static struct task_struct *pick_next_task_scx(struct rq *rq) #ifndef CONFIG_SMP /* UP workaround - see the comment at the head of put_prev_task_scx() */ if (unlikely(rq->curr->sched_class != &ext_sched_class)) - balance_scx(rq, rq->curr, NULL); + balance_one(rq, rq->curr, true); #endif p = first_local_task(rq); @@ -5586,7 +5561,7 @@ __bpf_kfunc bool scx_bpf_consume(u64 dsq_id) if (!scx_kf_allowed(SCX_KF_DISPATCH)) return false; - flush_dispatch_buf(dspc->rq, dspc->rf); + flush_dispatch_buf(dspc->rq); dsq = find_non_local_dsq(dsq_id); if (unlikely(!dsq)) { @@ -5594,7 +5569,7 @@ __bpf_kfunc bool scx_bpf_consume(u64 dsq_id) return false; } - if (consume_dispatch_q(dspc->rq, dspc->rf, dsq)) { + if (consume_dispatch_q(dspc->rq, dsq)) { /* * A successfully consumed task can be dequeued before it starts * running while the CPU is trying to migrate other dispatched -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit f47a818950dd5e5d16eb6e9c1713bb0bc61649cd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- SCX_RQ_BALANCING is used to mark that the rq is currently in balance(). Rename it to SCX_RQ_IN_BALANCE and add SCX_RQ_IN_WAKEUP which marks whether the rq is currently enqueueing for a wakeup. This will be used to implement direct dispatching to local DSQ of another CPU. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 13 +++++++++---- kernel/sched/sched.h | 6 ++++-- 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 99211cd93704..e565d0c17a3f 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1836,6 +1836,9 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags { int sticky_cpu = p->scx.sticky_cpu; + if (enq_flags & ENQUEUE_WAKEUP) + rq->scx.flags |= SCX_RQ_IN_WAKEUP; + enq_flags |= rq->scx.extra_enq_flags; if (sticky_cpu >= 0) @@ -1852,7 +1855,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags if (p->scx.flags & SCX_TASK_QUEUED) { WARN_ON_ONCE(!task_runnable(p)); - return; + goto out; } set_task_runnable(rq, p); @@ -1867,6 +1870,8 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags touch_core_sched(rq, p); do_enqueue_task(rq, p, enq_flags, sticky_cpu); +out: + rq->scx.flags &= ~SCX_RQ_IN_WAKEUP; } static void ops_dequeue(struct task_struct *p, u64 deq_flags) @@ -2430,7 +2435,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local) bool has_tasks = false; lockdep_assert_rq_held(rq); - rq->scx.flags |= SCX_RQ_BALANCING; + rq->scx.flags |= SCX_RQ_IN_BALANCE; if (static_branch_unlikely(&scx_ops_cpu_preempt) && unlikely(rq->scx.cpu_released)) { @@ -2524,7 +2529,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local) has_tasks: has_tasks = true; out: - rq->scx.flags &= ~SCX_RQ_BALANCING; + rq->scx.flags &= ~SCX_RQ_IN_BALANCE; return has_tasks; } @@ -5072,7 +5077,7 @@ static bool can_skip_idle_kick(struct rq *rq) * The race window is small and we don't and can't guarantee that @rq is * only kicked while idle anyway. Skip only when sure. */ - return !is_idle_task(rq->curr) && !(rq->scx.flags & SCX_RQ_BALANCING); + return !is_idle_task(rq->curr) && !(rq->scx.flags & SCX_RQ_IN_BALANCE); } static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *pseqs) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 7542f22f1b79..1d3a7576ccce 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -837,8 +837,10 @@ enum scx_rq_flags { * only while the BPF scheduler considers the CPU to be online. */ SCX_RQ_ONLINE = 1 << 0, - SCX_RQ_BALANCING = 1 << 1, - SCX_RQ_CAN_STOP_TICK = 1 << 2, + SCX_RQ_CAN_STOP_TICK = 1 << 1, + + SCX_RQ_IN_WAKEUP = 1 << 16, + SCX_RQ_IN_BALANCE = 1 << 17, }; struct scx_rq { -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 5b26f7b920f76b2b9cc398c252a9e35e44bf5bb9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the local DSQ of any CPU. However, during direct dispatch from ops.select_cpu() and ops.enqueue(), this isn't allowed. This is because dispatching to the local DSQ of a remote CPU requires locking both the task's current and new rq's and such double locking can't be done directly from ops.enqueue(). While waking up a task, as ops.select_cpu() can pick any CPU and both ops.select_cpu() and ops.enqueue() can use SCX_DSQ_LOCAL as the dispatch target to dispatch to the DSQ of the picked CPU, the BPF scheduler can still do whatever it wants to do. However, while a task is being enqueued for a different reason, e.g. after its slice expiration, only ops.enqueue() is called and there's no way for the BPF scheduler to directly dispatch to the local DSQ of a remote CPU. This gap in API forces schedulers into work-arounds which are not straightforward or optimal such as skipping direct dispatches in such cases. Implement deferred enqueueing to allow directly dispatching to the local DSQ of a remote CPU from ops.select_cpu() and ops.enqueue(). Such tasks are temporarily queued on rq->scx.ddsp_deferred_locals. When the rq lock can be safely released, the tasks are taken off the list and queued on the target local DSQs using dispatch_to_local_dsq(). v2: - Add missing return after queue_balance_callback() in schedule_deferred(). (David). - dispatch_to_local_dsq() now assumes that @rq is locked but unpinned and thus no longer takes @rf. Updated accordingly. - UP build warning fix. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Andrea Righi <righi.andrea@gmail.com> Acked-by: David Vernet <void@manifault.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Changwoo Min <changwoo@igalia.com> Conflicts: kernel/sched/ext.c [Current version is 'WARN_ON_ONCE(task_linked_on_dsq(p))', but the mainline version is 'WARN_ON_ONCE(!list_empty(&p->scx.dsq_list.node))', anyway, this mainline patch remove the WARN_ON_ONCE, and add other logic without conflicts.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 168 ++++++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 3 + 2 files changed, 153 insertions(+), 18 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index e565d0c17a3f..567015775378 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -892,6 +892,7 @@ static struct kobject *scx_root_kobj; #define CREATE_TRACE_POINTS #include <trace/events/sched_ext.h> +static void process_ddsp_deferred_locals(struct rq *rq); static void scx_bpf_kick_cpu(s32 cpu, u64 flags); static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind, s64 exit_code, @@ -1366,6 +1367,67 @@ static int ops_sanitize_err(const char *ops_name, s32 err) return -EPROTO; } +static void run_deferred(struct rq *rq) +{ + process_ddsp_deferred_locals(rq); +} + +#ifdef CONFIG_SMP +static void deferred_bal_cb_workfn(struct rq *rq) +{ + run_deferred(rq); +} +#endif + +static void deferred_irq_workfn(struct irq_work *irq_work) +{ + struct rq *rq = container_of(irq_work, struct rq, scx.deferred_irq_work); + + raw_spin_rq_lock(rq); + run_deferred(rq); + raw_spin_rq_unlock(rq); +} + +/** + * schedule_deferred - Schedule execution of deferred actions on an rq + * @rq: target rq + * + * Schedule execution of deferred actions on @rq. Must be called with @rq + * locked. Deferred actions are executed with @rq locked but unpinned, and thus + * can unlock @rq to e.g. migrate tasks to other rqs. + */ +static void schedule_deferred(struct rq *rq) +{ + lockdep_assert_rq_held(rq); + +#ifdef CONFIG_SMP + /* + * If in the middle of waking up a task, task_woken_scx() will be called + * afterwards which will then run the deferred actions, no need to + * schedule anything. + */ + if (rq->scx.flags & SCX_RQ_IN_WAKEUP) + return; + + /* + * If in balance, the balance callbacks will be called before rq lock is + * released. Schedule one. + */ + if (rq->scx.flags & SCX_RQ_IN_BALANCE) { + queue_balance_callback(rq, &rq->scx.deferred_bal_cb, + deferred_bal_cb_workfn); + return; + } +#endif + /* + * No scheduler hooks available. Queue an irq work. They are executed on + * IRQ re-enable which may take a bit longer than the scheduler hooks. + * The above WAKEUP and BALANCE paths should cover most of the cases and + * the time to IRQ re-enable shouldn't be long. + */ + irq_work_queue(&rq->scx.deferred_irq_work); +} + /** * touch_core_sched - Update timestamp used for core-sched task ordering * @rq: rq to read clock from, must be locked @@ -1586,7 +1648,13 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p) bool is_local = dsq == &rq->scx.local_dsq; if (!dsq) { - WARN_ON_ONCE(task_linked_on_dsq(p)); + /* + * If !dsq && on-list, @p is on @rq's ddsp_deferred_locals. + * Unlinking is all that's needed to cancel. + */ + if (unlikely(!list_empty(&p->scx.dsq_list.node))) + list_del_init(&p->scx.dsq_list.node); + /* * When dispatching directly from the BPF scheduler to a local * DSQ, the task isn't associated with any DSQ but @@ -1595,6 +1663,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p) */ if (p->scx.holding_cpu >= 0) p->scx.holding_cpu = -1; + return; } @@ -1682,17 +1751,6 @@ static void mark_direct_dispatch(struct task_struct *ddsp_task, return; } - /* - * %SCX_DSQ_LOCAL_ON is not supported during direct dispatch because - * dispatching to the local DSQ of a different CPU requires unlocking - * the current rq which isn't allowed in the enqueue path. Use - * ops.select_cpu() to be on the target CPU and then %SCX_DSQ_LOCAL. - */ - if (unlikely((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON)) { - scx_ops_error("SCX_DSQ_LOCAL_ON can't be used for direct-dispatch"); - return; - } - WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID); WARN_ON_ONCE(p->scx.ddsp_enq_flags); @@ -1702,13 +1760,58 @@ static void mark_direct_dispatch(struct task_struct *ddsp_task, static void direct_dispatch(struct task_struct *p, u64 enq_flags) { + struct rq *rq = task_rq(p); struct scx_dispatch_q *dsq; + u64 dsq_id = p->scx.ddsp_dsq_id; + + touch_core_sched_dispatch(rq, p); + + p->scx.ddsp_enq_flags |= enq_flags; + + /* + * We are in the enqueue path with @rq locked and pinned, and thus can't + * double lock a remote rq and enqueue to its local DSQ. For + * DSQ_LOCAL_ON verdicts targeting the local DSQ of a remote CPU, defer + * the enqueue so that it's executed when @rq can be unlocked. + */ + if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) { + s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK; + unsigned long opss; - touch_core_sched_dispatch(task_rq(p), p); + if (cpu == cpu_of(rq)) { + dsq_id = SCX_DSQ_LOCAL; + goto dispatch; + } + + opss = atomic_long_read(&p->scx.ops_state) & SCX_OPSS_STATE_MASK; + + switch (opss & SCX_OPSS_STATE_MASK) { + case SCX_OPSS_NONE: + break; + case SCX_OPSS_QUEUEING: + /* + * As @p was never passed to the BPF side, _release is + * not strictly necessary. Still do it for consistency. + */ + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE); + break; + default: + WARN_ONCE(true, "sched_ext: %s[%d] has invalid ops state 0x%lx in direct_dispatch()", + p->comm, p->pid, opss); + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE); + break; + } - enq_flags |= (p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); - dsq = find_dsq_for_dispatch(task_rq(p), p->scx.ddsp_dsq_id, p); - dispatch_enqueue(dsq, p, enq_flags); + WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node)); + list_add_tail(&p->scx.dsq_list.node, + &rq->scx.ddsp_deferred_locals); + schedule_deferred(rq); + return; + } + +dispatch: + dsq = find_dsq_for_dispatch(rq, dsq_id, p); + dispatch_enqueue(dsq, p, p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); } static bool scx_rq_online(struct rq *rq) @@ -2611,6 +2714,29 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) } } +static void process_ddsp_deferred_locals(struct rq *rq) +{ + struct task_struct *p, *tmp; + + lockdep_assert_rq_held(rq); + + /* + * Now that @rq can be unlocked, execute the deferred enqueueing of + * tasks directly dispatched to the local DSQs of other CPUs. See + * direct_dispatch(). + */ + list_for_each_entry_safe(p, tmp, &rq->scx.ddsp_deferred_locals, + scx.dsq_list.node) { + s32 ret; + + list_del_init(&p->scx.dsq_list.node); + + ret = dispatch_to_local_dsq(rq, p->scx.ddsp_dsq_id, p, + p->scx.ddsp_enq_flags); + WARN_ON_ONCE(ret == DTL_NOT_LOCAL); + } +} + static void put_prev_task_scx(struct rq *rq, struct task_struct *p) { #ifndef CONFIG_SMP @@ -3032,6 +3158,11 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag } } +static void task_woken_scx(struct rq *rq, struct task_struct *p) +{ + run_deferred(rq); +} + static void set_cpus_allowed_scx(struct task_struct *p, struct affinity_context *ac) { @@ -3547,8 +3678,6 @@ bool scx_can_stop_tick(struct rq *rq) * * - task_fork/dead: We need fork/dead notifications for all tasks regardless of * their current sched_class. Call them directly from sched core instead. - * - * - task_woken: Unnecessary. */ DEFINE_SCHED_CLASS(ext) = { .enqueue_task = enqueue_task_scx, @@ -3568,6 +3697,7 @@ DEFINE_SCHED_CLASS(ext) = { #ifdef CONFIG_SMP .balance = balance_scx, .select_task_rq = select_task_rq_scx, + .task_woken = task_woken_scx, .set_cpus_allowed = set_cpus_allowed_scx, .rq_online = rq_online_scx, @@ -5272,11 +5402,13 @@ void __init init_sched_ext_class(void) init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL); INIT_LIST_HEAD(&rq->scx.runnable_list); + INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals); BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_preempt, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_wait, GFP_KERNEL)); + init_irq_work(&rq->scx.deferred_irq_work, deferred_irq_workfn); init_irq_work(&rq->scx.kick_cpus_irq_work, kick_cpus_irq_workfn); if (cpu_online(cpu)) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 1d3a7576ccce..2601bf00120d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -846,6 +846,7 @@ enum scx_rq_flags { struct scx_rq { struct scx_dispatch_q local_dsq; struct list_head runnable_list; /* runnable tasks on this rq */ + struct list_head ddsp_deferred_locals; /* deferred ddsps from enq */ unsigned long ops_qseq; u64 extra_enq_flags; /* see move_task_to_local_dsq() */ u32 nr_running; @@ -857,6 +858,8 @@ struct scx_rq { cpumask_var_t cpus_to_preempt; cpumask_var_t cpus_to_wait; unsigned long pnt_seq; + struct balance_callback deferred_bal_cb; + struct irq_work deferred_irq_work; struct irq_work kick_cpus_irq_work; }; #endif /* CONFIG_SCHED_CLASS_EXT */ -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 1edab907b57d42e2dcf4c16a00185a89209e8700 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Because there was no way to directly dispatch to the local DSQ of a remote CPU from ops.enqueue(), scx_qmap skipped looking for an idle CPU on !wakeup enqueues. This restriction was removed and sched_ext now allows SCX_DSQ_LOCAL_ON verdicts for direct dispatches. Factor out pick_direct_dispatch_cpu() from ops.select_cpu() and use it to direct dispatch from ops.enqueue() on !wakeup enqueues. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Changwoo Min <changwoo@igalia.com> Cc: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/sched_ext/scx_qmap.bpf.c | 39 ++++++++++++++++++++++++++-------- tools/sched_ext/scx_qmap.c | 5 +++-- 2 files changed, 33 insertions(+), 11 deletions(-) diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c index 27e35066a602..892278f12dce 100644 --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -120,11 +120,26 @@ struct { } cpu_ctx_stor SEC(".maps"); /* Statistics */ -u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued; +u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued, nr_ddsp_from_enq; u64 nr_core_sched_execed; u32 cpuperf_min, cpuperf_avg, cpuperf_max; u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max; +static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu) +{ + s32 cpu; + + if (p->nr_cpus_allowed == 1 || + scx_bpf_test_and_clear_cpu_idle(prev_cpu)) + return prev_cpu; + + cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); + if (cpu >= 0) + return cpu; + + return -1; +} + s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) { @@ -137,17 +152,14 @@ s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p, return -ESRCH; } - if (p->nr_cpus_allowed == 1 || - scx_bpf_test_and_clear_cpu_idle(prev_cpu)) { + cpu = pick_direct_dispatch_cpu(p, prev_cpu); + + if (cpu >= 0) { tctx->force_local = true; + return cpu; + } else { return prev_cpu; } - - cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); - if (cpu >= 0) - return cpu; - - return prev_cpu; } static int weight_to_idx(u32 weight) @@ -172,6 +184,7 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags) u32 pid = p->pid; int idx = weight_to_idx(p->scx.weight); void *ring; + s32 cpu; if (p->flags & PF_KTHREAD) { if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth)) @@ -207,6 +220,14 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags) return; } + /* if !WAKEUP, select_cpu() wasn't called, try direct dispatch */ + if (!(enq_flags & SCX_ENQ_WAKEUP) && + (cpu = pick_direct_dispatch_cpu(p, scx_bpf_task_cpu(p))) >= 0) { + __sync_fetch_and_add(&nr_ddsp_from_enq, 1); + scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, slice_ns, enq_flags); + return; + } + /* * If the task was re-enqueued due to the CPU being preempted by a * higher priority scheduling class, just re-enqueue the task directly diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c index 304f0488a386..c9ca30d62b2b 100644 --- a/tools/sched_ext/scx_qmap.c +++ b/tools/sched_ext/scx_qmap.c @@ -116,10 +116,11 @@ int main(int argc, char **argv) long nr_enqueued = skel->bss->nr_enqueued; long nr_dispatched = skel->bss->nr_dispatched; - printf("stats : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64"\n", + printf("stats : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64" enq_ddsp=%"PRIu64"\n", nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched, skel->bss->nr_reenqueued, skel->bss->nr_dequeued, - skel->bss->nr_core_sched_execed); + skel->bss->nr_core_sched_execed, + skel->bss->nr_ddsp_from_enq); if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur")) printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n", skel->bss->cpuperf_min, -- 2.34.1
From: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> mainline inclusion from mainline-v6.12-rc1 commit 8bb30798fd6ee79e4041a32ca85b9f70345d8671 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- The type_id is defined as u32type, if(type_id<0) is invalid, hence modified its type to s32. ./kernel/sched/ext.c:4958:5-12: WARNING: Unsigned expression compared with zero: type_id < 0. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9523 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 567015775378..9756ca5d414d 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -5066,7 +5066,7 @@ static void bpf_scx_unreg(void *kdata) static int bpf_scx_init(struct btf *btf) { - u32 type_id; + s32 type_id; type_id = btf_find_by_name_kind(btf, "task_struct", BTF_KIND_STRUCT); if (type_id < 0) -- 2.34.1
From: David Vernet <void@manifault.com> mainline inclusion from mainline-v6.12-rc1 commit 298dec19bdeb6e33ac220502504d969272b50cf6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- We currently only allow calling sleepable scx kfuncs (i.e. scx_bpf_create_dsq()) from BPF_PROG_TYPE_STRUCT_OPS progs. The idea here was that we'd never have to call scx_bpf_create_dsq() outside of a sched_ext struct_ops callback, but that might not actually be true. For example, a scheduler could do something like the following: 1. Open and load (not yet attach) a scheduler skel 2. Synchronously call into a BPF_PROG_TYPE_SYSCALL prog from user space. For example, to initialize an LLC domain, or some other global, read-only state. 3. Attach the skel, which actually enables the scheduler The advantage of doing this is that it can preclude having to do pretty ugly boilerplate like initializing a read-only, statically sized array of u64[]'s which the kernel consumes literally once at init time to then create struct bpf_cpumask objects which are actually queried at runtime. Doing the above is already possible given that we can invoke core BPF kfuncs, such as bpf_cpumask_create(), from BPF_PROG_TYPE_SYSCALL progs. We already allow many scx kfuncs to be called from BPF_PROG_TYPE_SYSCALL progs (e.g. scx_bpf_kick_cpu()). Let's allow the sleepable kfuncs as well. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 14 ++++++-------- kernel/sched/ext.c | 27 +++++++++++---------------- 2 files changed, 17 insertions(+), 24 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index 593d2f4909dd..26e1c33bc844 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -106,16 +106,14 @@ enum scx_ent_dsq_flags { * mechanism. See scx_kf_allow(). */ enum scx_kf_mask { - SCX_KF_UNLOCKED = 0, /* not sleepable, not rq locked */ - /* all non-sleepables may be nested inside SLEEPABLE */ - SCX_KF_SLEEPABLE = 1 << 0, /* sleepable init operations */ + SCX_KF_UNLOCKED = 0, /* sleepable and not rq locked */ /* ENQUEUE and DISPATCH may be nested inside CPU_RELEASE */ - SCX_KF_CPU_RELEASE = 1 << 1, /* ops.cpu_release() */ + SCX_KF_CPU_RELEASE = 1 << 0, /* ops.cpu_release() */ /* ops.dequeue (in REST) may be nested inside DISPATCH */ - SCX_KF_DISPATCH = 1 << 2, /* ops.dispatch() */ - SCX_KF_ENQUEUE = 1 << 3, /* ops.enqueue() and ops.select_cpu() */ - SCX_KF_SELECT_CPU = 1 << 4, /* ops.select_cpu() */ - SCX_KF_REST = 1 << 5, /* other rq-locked operations */ + SCX_KF_DISPATCH = 1 << 1, /* ops.dispatch() */ + SCX_KF_ENQUEUE = 1 << 2, /* ops.enqueue() and ops.select_cpu() */ + SCX_KF_SELECT_CPU = 1 << 3, /* ops.select_cpu() */ + SCX_KF_REST = 1 << 4, /* other rq-locked operations */ __SCX_KF_RQ_LOCKED = SCX_KF_CPU_RELEASE | SCX_KF_DISPATCH | SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 9756ca5d414d..59b5e6d3169c 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1033,16 +1033,12 @@ static __always_inline bool scx_kf_allowed(u32 mask) return false; } - if (unlikely((mask & SCX_KF_SLEEPABLE) && in_interrupt())) { - scx_ops_error("sleepable kfunc called from non-sleepable context"); - return false; - } - /* * Enforce nesting boundaries. e.g. A kfunc which can be called from * DISPATCH must not be called if we're running DEQUEUE which is nested - * inside ops.dispatch(). We don't need to check the SCX_KF_SLEEPABLE - * boundary thanks to the above in_interrupt() check. + * inside ops.dispatch(). We don't need to check boundaries for any + * blocking kfuncs as the verifier ensures they're only called from + * sleepable progs. */ if (unlikely(highest_bit(mask) == SCX_KF_CPU_RELEASE && (current->scx.kf_mask & higher_bits(SCX_KF_CPU_RELEASE)))) { @@ -3234,9 +3230,9 @@ static void handle_hotplug(struct rq *rq, bool online) atomic_long_inc(&scx_hotplug_seq); if (online && SCX_HAS_OP(cpu_online)) - SCX_CALL_OP(SCX_KF_SLEEPABLE, cpu_online, cpu); + SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu); else if (!online && SCX_HAS_OP(cpu_offline)) - SCX_CALL_OP(SCX_KF_SLEEPABLE, cpu_offline, cpu); + SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_offline, cpu); else scx_ops_exit(SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG, "cpu %d going %s, exiting scheduler", cpu, @@ -3400,7 +3396,7 @@ static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool .fork = fork, }; - ret = SCX_CALL_OP_RET(SCX_KF_SLEEPABLE, init_task, p, &args); + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init_task, p, &args); if (unlikely(ret)) { ret = ops_sanitize_err("init_task", ret); return ret; @@ -4657,7 +4653,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops) cpus_read_lock(); if (scx_ops.init) { - ret = SCX_CALL_OP_RET(SCX_KF_SLEEPABLE, init); + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init); if (ret) { ret = ops_sanitize_err("init", ret); goto err_disable_unlock_cpus; @@ -5433,14 +5429,11 @@ __bpf_kfunc_start_defs(); * @dsq_id: DSQ to create * @node: NUMA node to allocate from * - * Create a custom DSQ identified by @dsq_id. Can be called from ops.init() and - * ops.init_task(). + * Create a custom DSQ identified by @dsq_id. Can be called from any sleepable + * scx callback, and any BPF_PROG_TYPE_SYSCALL prog. */ __bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) { - if (!scx_kf_allowed(SCX_KF_SLEEPABLE)) - return -EINVAL; - if (unlikely(node >= (int)nr_node_ids || (node < 0 && node != NUMA_NO_NODE))) return -EINVAL; @@ -6499,6 +6492,8 @@ static int __init scx_init(void) */ if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_sleepable)) || + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, + &scx_kfunc_set_sleepable)) || (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_select_cpu)) || (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, -- 2.34.1
From: David Vernet <void@manifault.com> mainline inclusion from mainline-v6.12-rc1 commit 958b1891846e768584a4d44aed9fabc6f9a445fb category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- We already have some testcases verifying that we can call BPF_PROG_TYPE_SYSCALL progs and invoke scx_bpf_exit(). Let's extend that to also call scx_bpf_create_dsq() so we get coverage for that as well. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/testing/selftests/sched_ext/prog_run.bpf.c | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/sched_ext/prog_run.bpf.c b/tools/testing/selftests/sched_ext/prog_run.bpf.c index fd2c8f12af16..6a4d7c48e3f2 100644 --- a/tools/testing/selftests/sched_ext/prog_run.bpf.c +++ b/tools/testing/selftests/sched_ext/prog_run.bpf.c @@ -16,6 +16,7 @@ char _license[] SEC("license") = "GPL"; SEC("syscall") int BPF_PROG(prog_run_syscall) { + scx_bpf_create_dsq(0, -1); scx_bpf_exit(0xdeadbeef, "Exited from PROG_RUN"); return 0; } -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit a2f4b16e736d62892ffd333996a7d682b57f6664 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- scx_dump_task() uses stack_trace_save_tsk() which is only available when CONFIG_STACKTRACE. Make CONFIG_SCHED_CLASS_EXT select CONFIG_STACKTRACE if the support is available and skip capturing stack trace if !CONFIG_STACKTRACE. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407161844.reewQQrR-lkp@intel.com/ Acked-by: David Vernet <void@manifault.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/Kconfig.preempt | 1 + kernel/sched/ext.c | 4 +++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index de625357e5aa..23034579d772 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -136,6 +136,7 @@ config SCHED_CORE config SCHED_CLASS_EXT bool "Extensible Scheduling Class" depends on BPF_SYSCALL && BPF_JIT && DEBUG_INFO_BTF + select STACKTRACE if STACKTRACE_SUPPORT help This option enables a new scheduler class sched_ext (SCX), which allows scheduling policies to be implemented as BPF programs to diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 59b5e6d3169c..7fc8a3831050 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -4339,7 +4339,7 @@ static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx, static unsigned long bt[SCX_EXIT_BT_LEN]; char dsq_id_buf[19] = "(n/a)"; unsigned long ops_state = atomic_long_read(&p->scx.ops_state); - unsigned int bt_len; + unsigned int bt_len = 0; if (p->scx.dsq) scnprintf(dsq_id_buf, sizeof(dsq_id_buf), "0x%llx", @@ -4364,7 +4364,9 @@ static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx, ops_dump_exit(); } +#ifdef CONFIG_STACKTRACE bt_len = stack_trace_save_tsk(p, bt, SCX_EXIT_BT_LEN, 1); +#endif if (bt_len) { dump_newline(s); dump_stack_trace(s, " ", bt, bt_len); -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit e99129e5dbf7ca87233d31ad19348f6ce8627b38 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... --------------------------------
From 1232da7eced620537a78f19c8cf3d4a3508e2419 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Wed, 31 Jul 2024 09:14:52 -1000
p->scx.disallow provides a way for the BPF scheduler to reject certain tasks from attaching. It's currently allowed for both the load and fork paths; however, the latter doesn't actually work as p->sched_class is already set by the time scx_ops_init_task() is called during fork. This is a convenience feature which is mostly useful from the load path anyway. Allow it only from the load path. v2: Trigger scx_ops_error() iff @p->policy == SCHED_EXT to make it a bit easier for the BPF scheduler (David). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: "Zhangqiao (2012 lab)" <zhangqiao22@huawei.com> Link: http://lkml.kernel.org/r/20240711110720.1285-1-zhangqiao22@huawei.com Fixes: 7bb6f0810ecf ("sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT") Acked-by: David Vernet <void@manifault.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 11 ++++++----- kernel/sched/ext.c | 35 ++++++++++++++++++++--------------- 2 files changed, 26 insertions(+), 20 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index 26e1c33bc844..69f68e2121a8 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -179,11 +179,12 @@ struct sched_ext_entity { * If set, reject future sched_setscheduler(2) calls updating the policy * to %SCHED_EXT with -%EACCES. * - * If set from ops.init_task() and the task's policy is already - * %SCHED_EXT, which can happen while the BPF scheduler is being loaded - * or by inhering the parent's policy during fork, the task's policy is - * rejected and forcefully reverted to %SCHED_NORMAL. The number of - * such events are reported through /sys/kernel/debug/sched_ext::nr_rejected. + * Can be set from ops.init_task() while the BPF scheduler is being + * loaded (!scx_init_task_args->fork). If set and the task's policy is + * already %SCHED_EXT, the task's policy is rejected and forcefully + * reverted to %SCHED_NORMAL. The number of such events are reported + * through /sys/kernel/debug/sched_ext::nr_rejected. Setting this flag + * during fork is not allowed. */ bool disallow; /* reject switching into SCX */ diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 7fc8a3831050..337466b85115 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -3406,24 +3406,29 @@ static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool scx_set_task_state(p, SCX_TASK_INIT); if (p->scx.disallow) { - struct rq *rq; - struct rq_flags rf; + if (!fork) { + struct rq *rq; + struct rq_flags rf; - rq = task_rq_lock(p, &rf); + rq = task_rq_lock(p, &rf); - /* - * We're either in fork or load path and @p->policy will be - * applied right after. Reverting @p->policy here and rejecting - * %SCHED_EXT transitions from scx_check_setscheduler() - * guarantees that if ops.init_task() sets @p->disallow, @p can - * never be in SCX. - */ - if (p->policy == SCHED_EXT) { - p->policy = SCHED_NORMAL; - atomic_long_inc(&scx_nr_rejected); - } + /* + * We're in the load path and @p->policy will be applied + * right after. Reverting @p->policy here and rejecting + * %SCHED_EXT transitions from scx_check_setscheduler() + * guarantees that if ops.init_task() sets @p->disallow, + * @p can never be in SCX. + */ + if (p->policy == SCHED_EXT) { + p->policy = SCHED_NORMAL; + atomic_long_inc(&scx_nr_rejected); + } - task_rq_unlock(rq, p, &rf); + task_rq_unlock(rq, p, &rf); + } else if (p->policy == SCHED_EXT) { + scx_ops_error("ops.init_task() set task->scx.disallow for %s[%d] during fork", + p->comm, p->pid); + } } p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT; -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 11cc374f4643b1be16deab571e034409c6ee7e66 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- The way sched_can_stop_tick() used scx_can_stop_tick() was rather confusing and the behavior wasn't ideal when SCX is enabled in partial mode. Simplify it so that: - scx_can_stop_tick() can say no if scx_enabled(). - CFS tests rq->cfs.nr_running > 1 instead of rq->nr_running. This is easier to follow and leads to the correct answer whether SCX is disabled, enabled in partial mode or all tasks are switched to SCX. Peter, note that this is a bit different from your suggestion where sched_can_stop_tick() unconditionally returns scx_can_stop_tick() iff scx_switched_all(). The problem is that in partial mode, tick can be stopped when there is only one SCX task even if the BPF scheduler didn't ask and isn't ready for it. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index aabea3baed41..3901e77ee4f9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1277,10 +1277,10 @@ bool sched_can_stop_tick(struct rq *rq) * left. For CFS, if there's more than one we need the tick for * involuntary preemption. For SCX, ask. */ - if (!scx_switched_all() && rq->nr_running > 1) + if (scx_enabled() && !scx_can_stop_tick(rq)) return false; - if (scx_enabled() && !scx_can_stop_tick(rq)) + if (rq->cfs.nr_running > 1) return false; /* -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit cd0144926836b8405966ca9d00f6425ef822fa4b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- SCX needs its balance() invoked even when waking up from a lower priority sched class (idle) and put_prev_task_balance() thus has the logic to promote @start_class if it's lower than ext_sched_class. This is only needed when SCX is enabled. Add scx_enabled() test to avoid unnecessary overhead when SCX is disabled. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3901e77ee4f9..e3a0c63f7ee1 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5903,7 +5903,7 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev, * when waking up from SCHED_IDLE. If @start_class is below SCX, start * from SCX instead. */ - if (sched_class_above(&ext_sched_class, start_class)) + if (scx_enabled() && sched_class_above(&ext_sched_class, start_class)) start_class = &ext_sched_class; #endif -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.8-rc1 commit 5d69eca542ee17c618f9a55da52191d5e28b435f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- All classes use sched_entity::exec_start to track runtime and have copies of the exact same code around to compute runtime. Collapse all that. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/54d148a144f26d9559698c4dd82d8859038a7380.169909515... Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched.h | 2 +- kernel/sched/deadline.c | 15 +++-------- kernel/sched/fair.c | 57 ++++++++++++++++++++++++++++++---------- kernel/sched/rt.c | 15 +++-------- kernel/sched/sched.h | 12 ++------- kernel/sched/stop_task.c | 13 +-------- 6 files changed, 53 insertions(+), 61 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 6b1f6dc28ec6..d0618abb6a2f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -535,7 +535,7 @@ struct sched_statistics { u64 block_max; s64 sum_block_runtime; - u64 exec_max; + s64 exec_max; u64 slice_max; u64 nr_migrations_cold; diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index ebbdc3dae343..6ab915483b4b 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1300,9 +1300,8 @@ static void update_curr_dl(struct rq *rq) { struct task_struct *curr = rq->curr; struct sched_dl_entity *dl_se = &curr->dl; - u64 delta_exec, scaled_delta_exec; + s64 delta_exec, scaled_delta_exec; int cpu = cpu_of(rq); - u64 now; if (!dl_task(curr) || !on_dl_rq(dl_se)) return; @@ -1315,21 +1314,13 @@ static void update_curr_dl(struct rq *rq) * natural solution, but the full ramifications of this * approach need further study. */ - now = rq_clock_task(rq); - delta_exec = now - curr->se.exec_start; - if (unlikely((s64)delta_exec <= 0)) { + delta_exec = update_curr_common(rq); + if (unlikely(delta_exec <= 0)) { if (unlikely(dl_se->dl_yielded)) goto throttle; return; } - schedstat_set(curr->stats.exec_max, - max(curr->stats.exec_max, delta_exec)); - - trace_sched_stat_runtime(curr, delta_exec, 0); - - update_current_exec_runtime(curr, now, delta_exec); - if (dl_entity_is_special(dl_se)) return; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 812abf3347e2..448a40805c08 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1309,23 +1309,17 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq) } #endif /* CONFIG_SMP */ -/* - * Update the current task's runtime statistics. - */ -static void update_curr(struct cfs_rq *cfs_rq) +static s64 update_curr_se(struct rq *rq, struct sched_entity *curr) { - struct sched_entity *curr = cfs_rq->curr; - u64 now = rq_clock_task(rq_of(cfs_rq)); - u64 delta_exec; - - if (unlikely(!curr)) - return; + u64 now = rq_clock_task(rq); + s64 delta_exec; delta_exec = now - curr->exec_start; - if (unlikely((s64)delta_exec <= 0)) - return; + if (unlikely(delta_exec <= 0)) + return delta_exec; curr->exec_start = now; + curr->sum_exec_runtime += delta_exec; if (schedstat_enabled()) { struct sched_statistics *stats; @@ -1335,8 +1329,43 @@ static void update_curr(struct cfs_rq *cfs_rq) max(delta_exec, stats->exec_max)); } - curr->sum_exec_runtime += delta_exec; - schedstat_add(cfs_rq->exec_clock, delta_exec); + return delta_exec; +} + +/* + * Used by other classes to account runtime. + */ +s64 update_curr_common(struct rq *rq) +{ + struct task_struct *curr = rq->curr; + s64 delta_exec; + + delta_exec = update_curr_se(rq, &curr->se); + if (unlikely(delta_exec <= 0)) + return delta_exec; + + trace_sched_stat_runtime(curr, delta_exec, 0); + + account_group_exec_runtime(curr, delta_exec); + cgroup_account_cputime(curr, delta_exec); + + return delta_exec; +} + +/* + * Update the current task's runtime statistics. + */ +static void update_curr(struct cfs_rq *cfs_rq) +{ + struct sched_entity *curr = cfs_rq->curr; + s64 delta_exec; + + if (unlikely(!curr)) + return; + + delta_exec = update_curr_se(rq_of(cfs_rq), curr); + if (unlikely(delta_exec <= 0)) + return; curr->vruntime += calc_delta_fair(delta_exec, curr); update_deadline(cfs_rq, curr); diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index b6eea8784978..2bf4e4ba42e6 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1050,24 +1050,15 @@ static void update_curr_rt(struct rq *rq) { struct task_struct *curr = rq->curr; struct sched_rt_entity *rt_se = &curr->rt; - u64 delta_exec; - u64 now; + s64 delta_exec; if (curr->sched_class != &rt_sched_class) return; - now = rq_clock_task(rq); - delta_exec = now - curr->se.exec_start; - if (unlikely((s64)delta_exec <= 0)) + delta_exec = update_curr_common(rq); + if (unlikely(delta_exec <= 0)) return; - schedstat_set(curr->stats.exec_max, - max(curr->stats.exec_max, delta_exec)); - - trace_sched_stat_runtime(curr, delta_exec, 0); - - update_current_exec_runtime(curr, now, delta_exec); - if (!rt_bandwidth_enabled()) return; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 2601bf00120d..ea87f0fe61b2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2517,6 +2517,8 @@ struct affinity_context { unsigned int flags; }; +extern s64 update_curr_common(struct rq *rq); + struct sched_class { #ifdef CONFIG_UCLAMP_TASK @@ -3688,16 +3690,6 @@ extern int sched_dynamic_mode(const char *str); extern void sched_dynamic_update(int mode); #endif -static inline void update_current_exec_runtime(struct task_struct *curr, - u64 now, u64 delta_exec) -{ - curr->se.sum_exec_runtime += delta_exec; - account_group_exec_runtime(curr, delta_exec); - - curr->se.exec_start = now; - cgroup_account_cputime(curr, delta_exec); -} - #ifdef CONFIG_SCHED_MM_CID #define SCHED_MM_CID_PERIOD_NS (100ULL * 1000000) /* 100ms */ diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c index 52d8164e935c..4cf02074fa9e 100644 --- a/kernel/sched/stop_task.c +++ b/kernel/sched/stop_task.c @@ -71,18 +71,7 @@ static void yield_task_stop(struct rq *rq) static void put_prev_task_stop(struct rq *rq, struct task_struct *prev) { - struct task_struct *curr = rq->curr; - u64 now, delta_exec; - - now = rq_clock_task(rq); - delta_exec = now - curr->se.exec_start; - if (unlikely((s64)delta_exec < 0)) - delta_exec = 0; - - schedstat_set(curr->stats.exec_max, - max(curr->stats.exec_max, delta_exec)); - - update_current_exec_runtime(curr, now, delta_exec); + update_curr_common(rq); } /* -- 2.34.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY -------------------------------- sched: Fix kabi for exec_max modified from u64 to s64 in struct sched_statistics. Fixes: 472e07951383 ("sched: Unify runtime accounting across classes") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index d0618abb6a2f..ed51bf4e536a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -535,7 +535,7 @@ struct sched_statistics { u64 block_max; s64 sum_block_runtime; - s64 exec_max; + KABI_REPLACE(u64 exec_max, s64 exec_max) u64 slice_max; u64 nr_migrations_cold; -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 7799140b6a1697bf1d6ff80395079633f548f6e7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- update_curr_scx() is open coding runtime updates. Use update_curr_common() instead and avoid unnecessary deviations. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 337466b85115..e5bc3cb2a760 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1470,20 +1470,14 @@ static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p) static void update_curr_scx(struct rq *rq) { struct task_struct *curr = rq->curr; - u64 now = rq_clock_task(rq); - u64 delta_exec; + s64 delta_exec; - if (time_before_eq64(now, curr->se.exec_start)) + delta_exec = update_curr_common(rq); + if (unlikely(delta_exec <= 0)) return; - delta_exec = now - curr->se.exec_start; - curr->se.exec_start = now; - curr->se.sum_exec_runtime += delta_exec; - account_group_exec_runtime(curr, delta_exec); - cgroup_account_cputime(curr, delta_exec); - if (curr->scx.slice != SCX_SLICE_INF) { - curr->scx.slice -= min(curr->scx.slice, delta_exec); + curr->scx.slice -= min_t(u64, curr->scx.slice, delta_exec); if (!curr->scx.slice) touch_core_sched(rq, curr); } -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit a735d43c7f85d112a6aefd72973188d0626e4464 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- On SMP, SCX performs dispatch from sched_class->balance(). As balance() was not available in UP, it instead called the internal balance function from put_prev_task_scx() and pick_next_task_scx() to emulate the effect, which is rather nasty. Enabling sched_class->balance() on UP shouldn't cause any meaningful overhead. Enable balance() on UP and drop the ugly workaround. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 4 +--- kernel/sched/ext.c | 41 +---------------------------------------- kernel/sched/sched.h | 2 +- 3 files changed, 3 insertions(+), 44 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e3a0c63f7ee1..51e72f715f98 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5893,7 +5893,6 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt) static void put_prev_task_balance(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { -#ifdef CONFIG_SMP const struct sched_class *start_class = prev->sched_class; const struct sched_class *class; @@ -5916,10 +5915,9 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev, * a runnable task of @class priority or higher. */ for_active_class_range(class, start_class, &idle_sched_class) { - if (class->balance(rq, prev, rf)) + if (class->balance && class->balance(rq, prev, rf)) break; } -#endif put_prev_task(rq, prev); } diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index e5bc3cb2a760..b2633b9c832e 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2626,7 +2626,6 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local) return has_tasks; } -#ifdef CONFIG_SMP static int balance_scx(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { @@ -2660,7 +2659,6 @@ static int balance_scx(struct rq *rq, struct task_struct *prev, return ret; } -#endif static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) { @@ -2729,37 +2727,6 @@ static void process_ddsp_deferred_locals(struct rq *rq) static void put_prev_task_scx(struct rq *rq, struct task_struct *p) { -#ifndef CONFIG_SMP - /* - * UP workaround. - * - * Because SCX may transfer tasks across CPUs during dispatch, dispatch - * is performed from its balance operation which isn't called in UP. - * Let's work around by calling it from the operations which come right - * after. - * - * 1. If the prev task is on SCX, pick_next_task() calls - * .put_prev_task() right after. As .put_prev_task() is also called - * from other places, we need to distinguish the calls which can be - * done by looking at the previous task's state - if still queued or - * dequeued with %SCX_DEQ_SLEEP, the caller must be pick_next_task(). - * This case is handled here. - * - * 2. If the prev task is not on SCX, the first following call into SCX - * will be .pick_next_task(), which is covered by calling - * balance_scx() from pick_next_task_scx(). - * - * Note that we can't merge the first case into the second as - * balance_scx() must be called before the previous SCX task goes - * through put_prev_task_scx(). - * - * @rq is pinned and can't be unlocked. As UP doesn't transfer tasks - * around, balance_one() doesn't need to. - */ - if (p->scx.flags & (SCX_TASK_QUEUED | SCX_TASK_DEQD_FOR_SLEEP)) - balance_one(rq, p, true); -#endif - update_curr_scx(rq); /* see dequeue_task_scx() on why we skip when !QUEUED */ @@ -2817,12 +2784,6 @@ static struct task_struct *pick_next_task_scx(struct rq *rq) { struct task_struct *p; -#ifndef CONFIG_SMP - /* UP workaround - see the comment at the head of put_prev_task_scx() */ - if (unlikely(rq->curr->sched_class != &ext_sched_class)) - balance_one(rq, rq->curr, true); -#endif - p = first_local_task(rq); if (!p) return NULL; @@ -3682,6 +3643,7 @@ DEFINE_SCHED_CLASS(ext) = { .wakeup_preempt = wakeup_preempt_scx, + .balance = balance_scx, .pick_next_task = pick_next_task_scx, .put_prev_task = put_prev_task_scx, @@ -3690,7 +3652,6 @@ DEFINE_SCHED_CLASS(ext) = { .switch_class = switch_class_scx, #ifdef CONFIG_SMP - .balance = balance_scx, .select_task_rq = select_task_rq_scx, .task_woken = task_woken_scx, .set_cpus_allowed = set_cpus_allowed_scx, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ea87f0fe61b2..c2594580d981 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2534,13 +2534,13 @@ struct sched_class { KABI_REPLACE(void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags), void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags)) + int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); struct task_struct *(*pick_next_task)(struct rq *rq); void (*put_prev_task)(struct rq *rq, struct task_struct *p); void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first); #ifdef CONFIG_SMP - int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags); struct task_struct * (*pick_task)(struct rq *rq); -- 2.34.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY -------------------------------- sched_ext: Fix kabi for function attribute balance moved out of the SMP scope in struct sched_class. Fixes: 8837c5d8c08a ("sched_ext: Simplify UP support by enabling sched_class->balance() in UP") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/sched.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c2594580d981..75cb29d12c71 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2534,13 +2534,14 @@ struct sched_class { KABI_REPLACE(void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags), void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags)) - int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); struct task_struct *(*pick_next_task)(struct rq *rq); void (*put_prev_task)(struct rq *rq, struct task_struct *p); void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first); #ifdef CONFIG_SMP + KABI_BROKEN_REMOVE(int (*balance)(struct rq *rq, + struct task_struct *prev, struct rq_flags *rf)) int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags); struct task_struct * (*pick_task)(struct rq *rq); @@ -2587,6 +2588,7 @@ struct sched_class { struct task_struct *task, const struct load_weight *lw)) KABI_USE(2, void (*switching_to) (struct rq *this_rq, struct task_struct *task)) KABI_EXTEND(void (*switch_class)(struct rq *rq, struct task_struct *next)) + KABI_EXTEND(int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)) }; static inline void put_prev_task(struct rq *rq, struct task_struct *prev) -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 9390a923e109f85b242bf676dc5bc81958d447fa category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- scx_task_iter_next_locked() skips tasks whose sched_class is idle_sched_class. While it has a short comment explaining why it's testing the sched_class directly isntead of using is_idle_task(), the comment doesn't sufficiently explain what's going on and why. Improve the comment. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index b2633b9c832e..b22581f99cac 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1256,8 +1256,29 @@ scx_task_iter_next_locked(struct scx_task_iter *iter, bool include_dead) while ((p = scx_task_iter_next(iter))) { /* - * is_idle_task() tests %PF_IDLE which may not be set for CPUs - * which haven't yet been onlined. Test sched_class directly. + * scx_task_iter is used to prepare and move tasks into SCX + * while loading the BPF scheduler and vice-versa while + * unloading. The init_tasks ("swappers") should be excluded + * from the iteration because: + * + * - It's unsafe to use __setschduler_prio() on an init_task to + * determine the sched_class to use as it won't preserve its + * idle_sched_class. + * + * - ops.init/exit_task() can easily be confused if called with + * init_tasks as they, e.g., share PID 0. + * + * As init_tasks are never scheduled through SCX, they can be + * skipped safely. Note that is_idle_task() which tests %PF_IDLE + * doesn't work here: + * + * - %PF_IDLE may not be set for an init_task whose CPU hasn't + * yet been onlined. + * + * - %PF_IDLE can be set on tasks that are not init_tasks. See + * play_idle_precise() used by CONFIG_IDLE_INJECT. + * + * Test for idle_sched_class as only init_tasks are on it. */ if (p->sched_class != &idle_sched_class) break; -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 2c390dda9e03d7936c492224453342d458e9bf98 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- task_can_run_on_remote_rq() is similar to is_cpu_allowed() but there are subtle differences. It currently open codes all the tests. This is cumbersome to understand and error-prone in case the intersecting tests need to be updated. Factor out the common part - testing whether the task is allowed on the CPU at all regardless of the CPU state - into task_allowed_on_cpu() and make both is_cpu_allowed() and SCX's task_can_run_on_remote_rq() use it. As the code is now linked between the two and each contains only the extra tests that differ between them, it's less error-prone when the conditions need to be updated. Also, improve the comment to explain why they are different. v2: Replace accidental "extern inline" with "static inline" (Peter). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com> Conflicts: kernel/sched/sched.h [There should be no conflicts for this mainline patch, but cannot compile successfully, so add the '#include <linux/mmu_context.h>' manually.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 4 ++-- kernel/sched/ext.c | 21 ++++++++++++++++----- kernel/sched/sched.h | 19 +++++++++++++++++++ 3 files changed, 37 insertions(+), 7 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 51e72f715f98..2f20b0a07902 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2326,7 +2326,7 @@ static inline bool rq_has_pinned_tasks(struct rq *rq) static inline bool is_cpu_allowed(struct task_struct *p, int cpu) { /* When not in the task's cpumask, no point in looking further. */ - if (!cpumask_test_cpu(cpu, p->cpus_ptr)) + if (!task_allowed_on_cpu(p, cpu)) return false; /* migrate_disabled() must be allowed to finish. */ @@ -2335,7 +2335,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu) /* Non kernel threads are not allowed during either online or offline. */ if (!(p->flags & PF_KTHREAD)) - return cpu_active(cpu) && task_cpu_possible(cpu, p); + return cpu_active(cpu); /* KTHREAD_IS_PER_CPU is always allowed. */ if (kthread_is_per_cpu(p)) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index b22581f99cac..8dcf6b5999f9 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2234,19 +2234,30 @@ static void consume_local_task(struct rq *rq, struct scx_dispatch_q *dsq, #ifdef CONFIG_SMP /* - * Similar to kernel/sched/core.c::is_cpu_allowed() but we're testing whether @p - * can be pulled to @rq. + * Similar to kernel/sched/core.c::is_cpu_allowed(). However, there are two + * differences: + * + * - is_cpu_allowed() asks "Can this task run on this CPU?" while + * task_can_run_on_remote_rq() asks "Can the BPF scheduler migrate the task to + * this CPU?". + * + * While migration is disabled, is_cpu_allowed() has to say "yes" as the task + * must be allowed to finish on the CPU that it's currently on regardless of + * the CPU state. However, task_can_run_on_remote_rq() must say "no" as the + * BPF scheduler shouldn't attempt to migrate a task which has migration + * disabled. + * + * - The BPF scheduler is bypassed while the rq is offline and we can always say + * no to the BPF scheduler initiated migrations while offline. */ static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq) { int cpu = cpu_of(rq); - if (!cpumask_test_cpu(cpu, p->cpus_ptr)) + if (!task_allowed_on_cpu(p, cpu)) return false; if (unlikely(is_migration_disabled(p))) return false; - if (!(p->flags & PF_KTHREAD) && unlikely(!task_cpu_possible(cpu, p))) - return false; if (!scx_rq_online(rq)) return false; return true; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 75cb29d12c71..9cc24f97e285 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -45,6 +45,7 @@ #include <linux/lockdep.h> #include <linux/minmax.h> #include <linux/mm.h> +#include <linux/mmu_context.h> #include <linux/module.h> #include <linux/mutex_api.h> #include <linux/plist.h> @@ -2708,6 +2709,19 @@ extern void trigger_load_balance(struct rq *rq); extern int __set_cpus_allowed_ptr(struct task_struct *p, struct affinity_context *ctx); extern void set_cpus_allowed_common(struct task_struct *p, struct affinity_context *ctx); +static inline bool task_allowed_on_cpu(struct task_struct *p, int cpu) +{ + /* When not in the task's cpumask, no point in looking further. */ + if (!cpumask_test_cpu(cpu, p->cpus_ptr)) + return false; + + /* Can @cpu run a user thread? */ + if (!(p->flags & PF_KTHREAD) && !task_cpu_possible(cpu, p)) + return false; + + return true; +} + static inline cpumask_t *alloc_user_cpus_ptr(int node) { /* @@ -2741,6 +2755,11 @@ extern int push_cpu_stop(void *arg); #else /* !CONFIG_SMP: */ +static inline bool task_allowed_on_cpu(struct task_struct *p, int cpu) +{ + return true; +} + static inline int __set_cpus_allowed_ptr(struct task_struct *p, struct affinity_context *ctx) { -- 2.34.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY -------------------------------- sched_ext: Fix kabi for add "#include <linux/mmu_context.h>" in kernel/sched/sched.h. Fixes: eabbe530b1b3 ("sched_ext: Make task_can_run_on_remote_rq() use common task_allowed_on_cpu()") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/sched.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9cc24f97e285..38349ee6a944 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -45,7 +45,6 @@ #include <linux/lockdep.h> #include <linux/minmax.h> #include <linux/mm.h> -#include <linux/mmu_context.h> #include <linux/module.h> #include <linux/mutex_api.h> #include <linux/plist.h> @@ -71,6 +70,9 @@ #include <linux/wait_bit.h> #include <linux/workqueue_api.h> #include <linux/kabi.h> +#ifndef __GENKSYMS__ +#include <linux/mmu_context.h> +#endif #include <trace/events/power.h> #include <trace/events/sched.h> -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 72763ea3d45c7f9fd69b825468afbf4d11c5ffc2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- process_ddsp_deferred_locals() executes deferred direct dispatches to the local DSQs of remote CPUs. It iterates the tasks on rq->scx.ddsp_deferred_locals list, removing and calling dispatch_to_local_dsq() on each. However, the list is protected by the rq lock that can be dropped by dispatch_to_local_dsq() temporarily, so the list can be modified during the iteration, which can lead to oopses and other failures. Fix it by popping from the head of the list instead of iterating the list. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches") Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 8dcf6b5999f9..cbb020a1b96b 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2736,17 +2736,19 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) static void process_ddsp_deferred_locals(struct rq *rq) { - struct task_struct *p, *tmp; + struct task_struct *p; lockdep_assert_rq_held(rq); /* * Now that @rq can be unlocked, execute the deferred enqueueing of * tasks directly dispatched to the local DSQs of other CPUs. See - * direct_dispatch(). + * direct_dispatch(). Keep popping from the head instead of using + * list_for_each_entry_safe() as dispatch_local_dsq() may unlock @rq + * temporarily. */ - list_for_each_entry_safe(p, tmp, &rq->scx.ddsp_deferred_locals, - scx.dsq_list.node) { + while ((p = list_first_entry_or_null(&rq->scx.ddsp_deferred_locals, + struct task_struct, scx.dsq_list.node))) { s32 ret; list_del_init(&p->scx.dsq_list.node); -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 991ef53a4832941c8130008ef35c66ec88c7fa0f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- scx_rq_online() currently only tests SCX_RQ_ONLINE. This isn't fully correct - e.g. consume_dispatch_q() uses task_run_on_remote_rq() which tests scx_rq_online() to see whether the current rq can run the task, and, if so, calls consume_remote_task() to migrate the task to @rq. While the test itself was done while locking @rq, @rq can be temporarily unlocked by consume_remote_task() and nothing prevents SCX_RQ_ONLINE from going offline before the migration takes place. To address the issue, add cpu_active() test to scx_rq_online(). There is a synchronize_rcu() between cpu_active() being cleared and the rq going offline, so if an on-going scheduling operation sees cpu_active(), the associated rq is guaranteed to not go offline until the scheduling operation is complete. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 60c27fb59f6c ("sched_ext: Implement sched_ext_ops.cpu_online/offline()") Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index cbb020a1b96b..9c752bc7a663 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1827,7 +1827,14 @@ static void direct_dispatch(struct task_struct *p, u64 enq_flags) static bool scx_rq_online(struct rq *rq) { - return likely(rq->scx.flags & SCX_RQ_ONLINE); + /* + * Test both cpu_active() and %SCX_RQ_ONLINE. %SCX_RQ_ONLINE indicates + * the online state as seen from the BPF scheduler. cpu_active() test + * guarantees that, if this function returns %true, %SCX_RQ_ONLINE will + * stay set until the current scheduling operation is complete even if + * we aren't locking @rq. + */ + return likely((rq->scx.flags & SCX_RQ_ONLINE) && cpu_active(cpu_of(rq))); } static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 344576fa6a69ce1292ef669c8d50c2088c36dc1e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- sched_ext currently doesn't generate messages when the BPF scheduler is enabled and disabled unless there are errors. It is useful to have paper trail. Improve logging around enable/disable: - Generate info messages on enable and non-error disable. - Update error exit message formatting so that it's consistent with non-error message. Also, prefix ei->msg with the BPF scheduler's name to make it clear where the message is coming from. - Shorten scx_exit_reason() strings for SCX_EXIT_UNREG* for brevity and consistency. v2: Use pr_*() instead of KERN_* consistently. (David) Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Phil Auld <pauld@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 9c752bc7a663..90769d06bbd4 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -4022,11 +4022,11 @@ static const char *scx_exit_reason(enum scx_exit_kind kind) { switch (kind) { case SCX_EXIT_UNREG: - return "Scheduler unregistered from user space"; + return "unregistered from user space"; case SCX_EXIT_UNREG_BPF: - return "Scheduler unregistered from BPF"; + return "unregistered from BPF"; case SCX_EXIT_UNREG_KERN: - return "Scheduler unregistered from the main kernel"; + return "unregistered from the main kernel"; case SCX_EXIT_SYSRQ: return "disabled by sysrq-S"; case SCX_EXIT_ERROR: @@ -4144,14 +4144,16 @@ static void scx_ops_disable_workfn(struct kthread_work *work) percpu_up_write(&scx_fork_rwsem); if (ei->kind >= SCX_EXIT_ERROR) { - printk(KERN_ERR "sched_ext: BPF scheduler \"%s\" errored, disabling\n", scx_ops.name); + pr_err("sched_ext: BPF scheduler \"%s\" disabled (%s)\n", + scx_ops.name, ei->reason); - if (ei->msg[0] == '\0') - printk(KERN_ERR "sched_ext: %s\n", ei->reason); - else - printk(KERN_ERR "sched_ext: %s (%s)\n", ei->reason, ei->msg); + if (ei->msg[0] != '\0') + pr_err("sched_ext: %s: %s\n", scx_ops.name, ei->msg); stack_trace_print(ei->bt, ei->bt_len, 2); + } else { + pr_info("sched_ext: BPF scheduler \"%s\" disabled (%s)\n", + scx_ops.name, ei->reason); } if (scx_ops.exit) @@ -4826,6 +4828,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops) if (!(ops->flags & SCX_OPS_SWITCH_PARTIAL)) static_branch_enable(&__scx_switched_all); + pr_info("sched_ext: BPF scheduler \"%s\" enabled%s\n", + scx_ops.name, scx_switched_all() ? "" : " (partial)"); kobject_uevent(scx_root_kobj, KOBJ_ADD); mutex_unlock(&scx_ops_enable_mutex); -- 2.34.1
From: Manu Bretelle <chantr4@gmail.com> mainline inclusion from mainline-v6.12-rc1 commit 33d031ec12105e4e4589dc5f50511a666d6f4b4f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- `__bpf_ops_sched_ext_ops` was missing the initialization of some struct attributes. With https://lore.kernel.org/all/20240722183049.2254692-4-martin.lau@linux.dev/ every single attributes need to be initialized programs (like scx_layered) will fail to load. 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_init not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_exit not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_prep_move not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_move not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_cancel_move not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_set_weight not found in kernel, skipping it as it's set to zero 05:26:48 [WARN] libbpf: prog 'layered_dump': BPF program load failed: unknown error (-524) 05:26:48 [WARN] libbpf: prog 'layered_dump': -- BEGIN PROG LOAD LOG -- attach to unsupported member dump of struct sched_ext_ops processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 -- END PROG LOAD LOG -- 05:26:48 [WARN] libbpf: prog 'layered_dump': failed to load: -524 05:26:48 [WARN] libbpf: failed to load object 'bpf_bpf' 05:26:48 [WARN] libbpf: failed to load BPF skeleton 'bpf_bpf': -524 Error: Failed to load BPF program Signed-off-by: Manu Bretelle <chantr4@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 90769d06bbd4..60e9b81c0815 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -5120,6 +5120,9 @@ static void cpu_online_stub(s32 cpu) {} static void cpu_offline_stub(s32 cpu) {} static s32 init_stub(void) { return -EINVAL; } static void exit_stub(struct scx_exit_info *info) {} +static void dump_stub(struct scx_dump_ctx *ctx) {} +static void dump_cpu_stub(struct scx_dump_ctx *ctx, s32 cpu, bool idle) {} +static void dump_task_stub(struct scx_dump_ctx *ctx, struct task_struct *p) {} static struct sched_ext_ops __bpf_ops_sched_ext_ops = { .select_cpu = select_cpu_stub, @@ -5145,6 +5148,9 @@ static struct sched_ext_ops __bpf_ops_sched_ext_ops = { .cpu_offline = cpu_offline_stub, .init = init_stub, .exit = exit_stub, + .dump = dump_stub, + .dump_cpu = dump_cpu_stub, + .dump_task = dump_task_stub, }; static struct bpf_struct_ops bpf_sched_ext_ops = { -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 89909296a51e792f296e52e104a04aed0cb7a9e9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- consume_remote_task() and dispatch_to_local_dsq() use move_task_to_local_dsq() to migrate the task to the target CPU. Currently, move_task_to_local_dsq() expects the caller to lock both the source and destination rq's. While this may save a few lock operations while the rq's are not contended, under contention, the double locking can exacerbate the situation significantly (refer to the linked message below). Update the migration path so that double locking is not used. move_task_to_local_dsq() now expects the caller to be locking the source rq, drops it and then acquires the destination rq lock. Code is simpler this way and, on a 2-way NUMA machine w/ Xeon Gold 6138, 'hackbench 100 thread 5000` shows ~3% improvement with scx_simple. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20240806082716.GP37996@noisy.programming.kicks-ass.... Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 134 ++++++++++++++++----------------------------- 1 file changed, 46 insertions(+), 88 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 60e9b81c0815..f64c370f99aa 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2107,11 +2107,12 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to) #ifdef CONFIG_SMP /** * move_task_to_local_dsq - Move a task from a different rq to a local DSQ - * @rq: rq to move the task into, currently locked * @p: task to move * @enq_flags: %SCX_ENQ_* + * @src_rq: rq to move the task from, locked on entry, released on return + * @dst_rq: rq to move the task into, locked on return * - * Move @p which is currently on a different rq to @rq's local DSQ. The caller + * Move @p which is currently on @src_rq to @dst_rq's local DSQ. The caller * must: * * 1. Start with exclusive access to @p either through its DSQ lock or @@ -2119,109 +2120,63 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to) * * 2. Set @p->scx.holding_cpu to raw_smp_processor_id(). * - * 3. Remember task_rq(@p). Release the exclusive access so that we don't - * deadlock with dequeue. + * 3. Remember task_rq(@p) as @src_rq. Release the exclusive access so that we + * don't deadlock with dequeue. * - * 4. Lock @rq and the task_rq from #3. + * 4. Lock @src_rq from #3. * * 5. Call this function. * * Returns %true if @p was successfully moved. %false after racing dequeue and - * losing. + * losing. On return, @src_rq is unlocked and @dst_rq is locked. */ -static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p, - u64 enq_flags) +static bool move_task_to_local_dsq(struct task_struct *p, u64 enq_flags, + struct rq *src_rq, struct rq *dst_rq) { - struct rq *task_rq; - - lockdep_assert_rq_held(rq); + lockdep_assert_rq_held(src_rq); /* - * If dequeue got to @p while we were trying to lock both rq's, it'd - * have cleared @p->scx.holding_cpu to -1. While other cpus may have - * updated it to different values afterwards, as this operation can't be + * If dequeue got to @p while we were trying to lock @src_rq, it'd have + * cleared @p->scx.holding_cpu to -1. While other cpus may have updated + * it to different values afterwards, as this operation can't be * preempted or recurse, @p->scx.holding_cpu can never become * raw_smp_processor_id() again before we're done. Thus, we can tell * whether we lost to dequeue by testing whether @p->scx.holding_cpu is * still raw_smp_processor_id(). * + * @p->rq couldn't have changed if we're still the holding cpu. + * * See dispatch_dequeue() for the counterpart. */ - if (unlikely(p->scx.holding_cpu != raw_smp_processor_id())) + if (unlikely(p->scx.holding_cpu != raw_smp_processor_id()) || + WARN_ON_ONCE(src_rq != task_rq(p))) { + raw_spin_rq_unlock(src_rq); + raw_spin_rq_lock(dst_rq); return false; + } - /* @p->rq couldn't have changed if we're still the holding cpu */ - task_rq = task_rq(p); - lockdep_assert_rq_held(task_rq); + /* the following marks @p MIGRATING which excludes dequeue */ + deactivate_task(src_rq, p, 0); + set_task_cpu(p, cpu_of(dst_rq)); + p->scx.sticky_cpu = cpu_of(dst_rq); - WARN_ON_ONCE(!cpumask_test_cpu(cpu_of(rq), p->cpus_ptr)); - deactivate_task(task_rq, p, 0); - set_task_cpu(p, cpu_of(rq)); - p->scx.sticky_cpu = cpu_of(rq); + raw_spin_rq_unlock(src_rq); + raw_spin_rq_lock(dst_rq); /* * We want to pass scx-specific enq_flags but activate_task() will * truncate the upper 32 bit. As we own @rq, we can pass them through * @rq->scx.extra_enq_flags instead. */ - WARN_ON_ONCE(rq->scx.extra_enq_flags); - rq->scx.extra_enq_flags = enq_flags; - activate_task(rq, p, 0); - rq->scx.extra_enq_flags = 0; + WARN_ON_ONCE(!cpumask_test_cpu(cpu_of(dst_rq), p->cpus_ptr)); + WARN_ON_ONCE(dst_rq->scx.extra_enq_flags); + dst_rq->scx.extra_enq_flags = enq_flags; + activate_task(dst_rq, p, 0); + dst_rq->scx.extra_enq_flags = 0; return true; } -/** - * dispatch_to_local_dsq_lock - Ensure source and destination rq's are locked - * @rq: current rq which is locked - * @src_rq: rq to move task from - * @dst_rq: rq to move task to - * - * We're holding @rq lock and trying to dispatch a task from @src_rq to - * @dst_rq's local DSQ and thus need to lock both @src_rq and @dst_rq. Whether - * @rq stays locked isn't important as long as the state is restored after - * dispatch_to_local_dsq_unlock(). - */ -static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq *src_rq, - struct rq *dst_rq) -{ - if (src_rq == dst_rq) { - raw_spin_rq_unlock(rq); - raw_spin_rq_lock(dst_rq); - } else if (rq == src_rq) { - double_lock_balance(rq, dst_rq); - } else if (rq == dst_rq) { - double_lock_balance(rq, src_rq); - } else { - raw_spin_rq_unlock(rq); - double_rq_lock(src_rq, dst_rq); - } -} - -/** - * dispatch_to_local_dsq_unlock - Undo dispatch_to_local_dsq_lock() - * @rq: current rq which is locked - * @src_rq: rq to move task from - * @dst_rq: rq to move task to - * - * Unlock @src_rq and @dst_rq and ensure that @rq is locked on return. - */ -static void dispatch_to_local_dsq_unlock(struct rq *rq, struct rq *src_rq, - struct rq *dst_rq) -{ - if (src_rq == dst_rq) { - raw_spin_rq_unlock(dst_rq); - raw_spin_rq_lock(rq); - } else if (rq == src_rq) { - double_unlock_balance(rq, dst_rq); - } else if (rq == dst_rq) { - double_unlock_balance(rq, src_rq); - } else { - double_rq_unlock(src_rq, dst_rq); - raw_spin_rq_lock(rq); - } -} #endif /* CONFIG_SMP */ static void consume_local_task(struct rq *rq, struct scx_dispatch_q *dsq, @@ -2273,8 +2228,6 @@ static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq) static bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq, struct task_struct *p, struct rq *task_rq) { - bool moved = false; - lockdep_assert_held(&dsq->lock); /* released on return */ /* @@ -2290,13 +2243,10 @@ static bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq, p->scx.holding_cpu = raw_smp_processor_id(); raw_spin_unlock(&dsq->lock); - double_lock_balance(rq, task_rq); - - moved = move_task_to_local_dsq(rq, p, 0); - - double_unlock_balance(rq, task_rq); + raw_spin_rq_unlock(rq); + raw_spin_rq_lock(task_rq); - return moved; + return move_task_to_local_dsq(p, 0, task_rq, rq); } #else /* CONFIG_SMP */ static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq) { return false; } @@ -2390,7 +2340,6 @@ dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p, #ifdef CONFIG_SMP if (cpumask_test_cpu(cpu_of(dst_rq), p->cpus_ptr)) { - struct rq *locked_dst_rq = dst_rq; bool dsp; /* @@ -2409,7 +2358,11 @@ dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p, /* store_release ensures that dequeue sees the above */ atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE); - dispatch_to_local_dsq_lock(rq, src_rq, locked_dst_rq); + /* switch to @src_rq lock */ + if (rq != src_rq) { + raw_spin_rq_unlock(rq); + raw_spin_rq_lock(src_rq); + } /* * We don't require the BPF scheduler to avoid dispatching to @@ -2436,7 +2389,8 @@ dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p, enq_flags); } } else { - dsp = move_task_to_local_dsq(dst_rq, p, enq_flags); + dsp = move_task_to_local_dsq(p, enq_flags, + src_rq, dst_rq); } /* if the destination CPU is idle, wake it up */ @@ -2444,7 +2398,11 @@ dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p, dst_rq->curr->sched_class)) resched_curr(dst_rq); - dispatch_to_local_dsq_unlock(rq, src_rq, locked_dst_rq); + /* switch back to @rq lock */ + if (rq != dst_rq) { + raw_spin_rq_unlock(dst_rq); + raw_spin_rq_lock(rq); + } return dsp ? DTL_DISPATCHED : DTL_LOST; } -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 59cfdf3f3349019fbfc986a285afcc3873d155f4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- ARRAY_ELEM_PTR() is an access macro used to help the BPF verifier not confused by offseted memory acceeses by yiedling a valid pointer or NULL in a way that's clear to the verifier. As such, the canonical usage involves checking NULL return from the macro. Note that in many cases, the NULL condition can never happen - they're there just to hint the verifier. In a bpf_loop in scx_central.bpf.c::central_dispatch(), the NULL check was incorrect in that there was another dereference of the pointer in addition to the NULL checked access. This worked as the pointer can never be NULL and the verifier could tell it would never be NULL in this case. However, this still looks wrong and trips smatch: ./tools/sched_ext/scx_central.bpf.c:205 ____central_dispatch() error: we previously assumed 'gimme' could be null (see line 201) ./tools/sched_ext/scx_central.bpf.c 195 196 if (!scx_bpf_dispatch_nr_slots()) 197 break; 198 199 /* central's gimme is never set */ 200 gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids); 201 if (gimme && !*gimme) ^^^^^ If gimme is NULL 202 continue; 203 204 if (dispatch_to_cpu(cpu)) --> 205 *gimme = false; Fix the NULL check so that there are no derefs if NULL. This doesn't change actual behavior. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: http://lkml.kernel.org/r/<955e1c3c-ace2-4a1d-b246-15b8196038a3@stanley.mountain> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/sched_ext/scx_central.bpf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/sched_ext/scx_central.bpf.c b/tools/sched_ext/scx_central.bpf.c index 1d8fd570eaa7..8dd8eb73b6b8 100644 --- a/tools/sched_ext/scx_central.bpf.c +++ b/tools/sched_ext/scx_central.bpf.c @@ -198,7 +198,7 @@ void BPF_STRUCT_OPS(central_dispatch, s32 cpu, struct task_struct *prev) /* central's gimme is never set */ gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids); - if (gimme && !*gimme) + if (!gimme || !*gimme) continue; if (dispatch_to_cpu(cpu)) -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit bf934bed5e2fd81f767d75c05fb95f0333a1b183 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- The cfi stub for ops.tick was missing which will fail scheduler loading after pending BPF changes. Add it. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index f64c370f99aa..28f2c8980229 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -5059,6 +5059,7 @@ static s32 select_cpu_stub(struct task_struct *p, s32 prev_cpu, u64 wake_flags) static void enqueue_stub(struct task_struct *p, u64 enq_flags) {} static void dequeue_stub(struct task_struct *p, u64 enq_flags) {} static void dispatch_stub(s32 prev_cpu, struct task_struct *p) {} +static void tick_stub(struct task_struct *p) {} static void runnable_stub(struct task_struct *p, u64 enq_flags) {} static void running_stub(struct task_struct *p) {} static void stopping_stub(struct task_struct *p, bool runnable) {} @@ -5087,6 +5088,7 @@ static struct sched_ext_ops __bpf_ops_sched_ext_ops = { .enqueue = enqueue_stub, .dequeue = dequeue_stub, .dispatch = dispatch_stub, + .tick = tick_stub, .runnable = runnable_stub, .running = running_stub, .stopping = stopping_stub, -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 0366017e0973a52245787e972c8ac68a88692bba category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- When deciding whether a task can be migrated to a CPU, dispatch_to_local_dsq() was open-coding p->cpus_allowed and scx_rq_online() tests instead of using task_can_run_on_remote_rq(). This had two problems. - It was missing is_migration_disabled() check and thus could try to migrate a task which shouldn't leading to assertion and scheduling failures. - It was testing p->cpus_ptr directly instead of using task_allowed_on_cpu() and thus failed to consider ISA compatibility. Update dispatch_to_local_dsq() to use task_can_run_on_remote_rq(): - Move scx_ops_error() triggering into task_can_run_on_remote_rq(). - When migration isn't allowed, fall back to the global DSQ instead of the source DSQ by returning DTL_INVALID. This is both simpler and an overall better behavior. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 40 ++++++++++++++++++++-------------------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 28f2c8980229..3904688dafd2 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2212,16 +2212,30 @@ static void consume_local_task(struct rq *rq, struct scx_dispatch_q *dsq, * - The BPF scheduler is bypassed while the rq is offline and we can always say * no to the BPF scheduler initiated migrations while offline. */ -static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq) +static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq, + bool trigger_error) { int cpu = cpu_of(rq); - if (!task_allowed_on_cpu(p, cpu)) + /* + * We don't require the BPF scheduler to avoid dispatching to offline + * CPUs mostly for convenience but also because CPUs can go offline + * between scx_bpf_dispatch() calls and here. Trigger error iff the + * picked CPU is outside the allowed mask. + */ + if (!task_allowed_on_cpu(p, cpu)) { + if (trigger_error) + scx_ops_error("SCX_DSQ_LOCAL[_ON] verdict target cpu %d not allowed for %s[%d]", + cpu_of(rq), p->comm, p->pid); return false; + } + if (unlikely(is_migration_disabled(p))) return false; + if (!scx_rq_online(rq)) return false; + return true; } @@ -2249,9 +2263,8 @@ static bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq, return move_task_to_local_dsq(p, 0, task_rq, rq); } #else /* CONFIG_SMP */ -static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq) { return false; } -static bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq, - struct task_struct *p, struct rq *task_rq) { return false; } +static inline bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq, bool trigger_error) { return false; } +static inline bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq, struct task_struct *p, struct rq *task_rq) { return false; } #endif /* CONFIG_SMP */ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq) @@ -2276,7 +2289,7 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq) return true; } - if (task_can_run_on_remote_rq(p, rq)) { + if (task_can_run_on_remote_rq(p, rq, false)) { if (likely(consume_remote_task(rq, dsq, p, task_rq))) return true; goto retry; @@ -2339,7 +2352,7 @@ dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p, } #ifdef CONFIG_SMP - if (cpumask_test_cpu(cpu_of(dst_rq), p->cpus_ptr)) { + if (likely(task_can_run_on_remote_rq(p, dst_rq, true))) { bool dsp; /* @@ -2364,17 +2377,6 @@ dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p, raw_spin_rq_lock(src_rq); } - /* - * We don't require the BPF scheduler to avoid dispatching to - * offline CPUs mostly for convenience but also because CPUs can - * go offline between scx_bpf_dispatch() calls and here. If @p - * is destined to an offline CPU, queue it on its current CPU - * instead, which should always be safe. As this is an allowed - * behavior, don't trigger an ops error. - */ - if (!scx_rq_online(dst_rq)) - dst_rq = src_rq; - if (src_rq == dst_rq) { /* * As @p is staying on the same rq, there's no need to @@ -2408,8 +2410,6 @@ dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p, } #endif /* CONFIG_SMP */ - scx_ops_error("SCX_DSQ_LOCAL[_ON] verdict target cpu %d not allowed for %s[%d]", - cpu_of(dst_rq), p->comm, p->pid); return DTL_INVALID; } -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 62607d033bb8dc417c2fd06f37f433d468023e66 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Since 3cf78c5d01d6 ("sched_ext: Unpin and repin rq lock from balance_scx()"), sched_ext's balance path terminates rq_pin in the outermost function. This is simpler and in line with what other balance functions are doing but it loses control over rq->clock_update_flags which makes assert_clock_udpated() trigger if other CPUs pins the rq lock. The only place this matters is touch_core_sched() which uses the timestamp to order tasks from sibling rq's. Switch to sched_clock_cpu(). Later, it may be better to use per-core dispatch sequence number. v2: Use sched_clock_cpu() instead of ktime_get_ns() per David. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 3cf78c5d01d6 ("sched_ext: Unpin and repin rq lock from balance_scx()") Acked-by: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 3904688dafd2..83a00eefe281 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1457,13 +1457,18 @@ static void schedule_deferred(struct rq *rq) */ static void touch_core_sched(struct rq *rq, struct task_struct *p) { + lockdep_assert_rq_held(rq); + #ifdef CONFIG_SCHED_CORE /* * It's okay to update the timestamp spuriously. Use * sched_core_disabled() which is cheaper than enabled(). + * + * As this is used to determine ordering between tasks of sibling CPUs, + * it may be better to use per-core dispatch sequence instead. */ if (!sched_core_disabled()) - p->scx.core_sched_at = rq_clock_task(rq); + p->scx.core_sched_at = sched_clock_cpu(cpu_of(rq)); #endif } @@ -1480,7 +1485,6 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p) static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p) { lockdep_assert_rq_held(rq); - assert_clock_updated(rq); #ifdef CONFIG_SCHED_CORE if (SCX_HAS_OP(core_sched_before)) -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit 7d2180d9d943d31491d77e336557f33670cfe7fd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Turns out the core_sched bits forgot to use the set_next_task(.first=true) variant. Notably: pick_next_task() := pick_task() + set_next_task(.first = true) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.614146342@infradead.org Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 4 ++-- kernel/sched/sched.h | 4 ++++ 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2f20b0a07902..138224b65397 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6055,7 +6055,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) next = rq->core_pick; if (next != prev) { put_prev_task(rq, prev); - set_next_task(rq, next); + set_next_task_first(rq, next); } rq->core_pick = NULL; @@ -6229,7 +6229,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) } out_set_next: - set_next_task(rq, next); + set_next_task_first(rq, next); out: if (rq->core->core_forceidle_count && next == rq->idle) queue_core_balance(rq); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 38349ee6a944..096b9f8bcbc1 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2605,6 +2605,10 @@ static inline void set_next_task(struct rq *rq, struct task_struct *next) next->sched_class->set_next_task(rq, next, false); } +static inline void set_next_task_first(struct rq *rq, struct task_struct *next) +{ + next->sched_class->set_next_task(rq, next, true); +} /* * Helper to define a sched_class instance; each one is placed in a separate -- 2.34.1
From: Yiwei Lin <s921975628@gmail.com> mainline inclusion from mainline-v6.7-rc1 commit 4c456c9ad334a940e354da1002184bc19f4493ef category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- The 'curr' argument of pick_next_entity() has become unused after the EEVDF changes. [ mingo: Updated the changelog. ] Signed-off-by: Yiwei Lin <s921975628@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231020055617.42064-1-s921975628@gmail.com Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/fair.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 448a40805c08..74892db443c5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5680,7 +5680,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) * 4) do not run the "skip" process, if something else is available */ static struct sched_entity * -pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) +pick_next_entity(struct cfs_rq *cfs_rq) { /* * Enabling NEXT_BUDDY will affect latency but not fairness. @@ -10197,7 +10197,7 @@ static struct task_struct *pick_task_fair(struct rq *rq) goto again; } - se = pick_next_entity(cfs_rq, curr); + se = pick_next_entity(cfs_rq); cfs_rq = group_cfs_rq(se); } while (cfs_rq); @@ -10283,7 +10283,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf } } - se = pick_next_entity(cfs_rq, curr); + se = pick_next_entity(cfs_rq); cfs_rq = group_cfs_rq(se); #ifdef CONFIG_QOS_SCHED if (check_qos_cfs_rq(cfs_rq)) { @@ -10333,7 +10333,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf put_prev_task(rq, prev); do { - se = pick_next_entity(cfs_rq, NULL); + se = pick_next_entity(cfs_rq); if (check_qos_cfs_rq(group_cfs_rq(se))) { cfs_rq = &rq->cfs; if (!cfs_rq->nr_running) @@ -10360,7 +10360,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf put_prev_task(rq, prev); do { - se = pick_next_entity(cfs_rq, NULL); + se = pick_next_entity(cfs_rq); set_next_entity(cfs_rq, se); cfs_rq = group_cfs_rq(se); } while (cfs_rq); -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit 8e2e13ac6122915bd98315237b0317495e391be0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Per 54d27365cae8 ("sched/fair: Prevent throttling in early pick_next_task_fair()") the reason check_cfs_rq_runtime() is under the 'if (curr)' check is to ensure the (downward) traversal does not result in an empty cfs_rq. But then the pick_task_fair() 'copy' of all this made it restart the traversal anyway, so that seems to solve the issue too. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.501679876@infradead.org Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/fair.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 74892db443c5..7d84284c7c93 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10192,11 +10192,11 @@ static struct task_struct *pick_task_fair(struct rq *rq) update_curr(cfs_rq); else curr = NULL; - - if (unlikely(check_cfs_rq_runtime(cfs_rq))) - goto again; } + if (unlikely(check_cfs_rq_runtime(cfs_rq))) + goto again; + se = pick_next_entity(cfs_rq); cfs_rq = group_cfs_rq(se); } while (cfs_rq); -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit c97f54fe6d014419e557200ed075cf53b47c5420 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- With 4c456c9ad334 ("sched/fair: Remove unused 'curr' argument from pick_next_entity()") curr is no longer being used, so no point in clearing it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.614707623@infradead.org Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/fair.c | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7d84284c7c93..dca564cce201 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10184,15 +10184,9 @@ static struct task_struct *pick_task_fair(struct rq *rq) return NULL; do { - struct sched_entity *curr = cfs_rq->curr; - /* When we pick for a remote RQ, we'll not have done put_prev_entity() */ - if (curr) { - if (curr->on_rq) - update_curr(cfs_rq); - else - curr = NULL; - } + if (cfs_rq->curr && cfs_rq->curr->on_rq) + update_curr(cfs_rq); if (unlikely(check_cfs_rq_runtime(cfs_rq))) goto again; -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit 3b3dd89b8bb0f03657859c22c86c19224f778638 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Implement pick_next_task_fair() in terms of pick_task_fair() to de-duplicate the pick loop. More importantly, this makes all the pick loops use the state-invariant form, which is useful to introduce further re-try conditions in later patches. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.725062368@infradead.org Conflicts: kernel/sched/fair.c [The conflict is with 926b9b0cd97e ("sched: Throttle qos cfs_rq when current cpu is running online task"), some part are moved from pick_next_task_fair() to pick_task_fair(), but there are addition hulk inclusion inside, so moved them together with this mainline patch.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/fair.c | 79 +++++++++++---------------------------------- 1 file changed, 18 insertions(+), 61 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index dca564cce201..3a3eb2318962 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10172,7 +10172,6 @@ void qos_smt_check_need_resched(void) } #endif -#ifdef CONFIG_SMP static struct task_struct *pick_task_fair(struct rq *rq) { struct sched_entity *se; @@ -10184,7 +10183,7 @@ static struct task_struct *pick_task_fair(struct rq *rq) return NULL; do { - /* When we pick for a remote RQ, we'll not have done put_prev_entity() */ + /* Might not have done put_prev_entity() */ if (cfs_rq->curr && cfs_rq->curr->on_rq) update_curr(cfs_rq); @@ -10193,11 +10192,20 @@ static struct task_struct *pick_task_fair(struct rq *rq) se = pick_next_entity(cfs_rq); cfs_rq = group_cfs_rq(se); +#ifdef CONFIG_QOS_SCHED + if (check_qos_cfs_rq(cfs_rq)) { + cfs_rq = &rq->cfs; + WARN(cfs_rq->nr_running == 0, + "rq->nr_running=%u, cfs_rq->idle_h_nr_running=%u\n", + rq->nr_running, cfs_rq->idle_h_nr_running); + if (unlikely(!cfs_rq->nr_running)) + return NULL; + } +#endif } while (cfs_rq); return task_of(se); } -#endif struct task_struct * pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) @@ -10225,8 +10233,10 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf } #endif - if (!sched_fair_runnable(rq)) + p = pick_task_fair(rq); + if (!p) goto idle; + se = &p->se; #ifdef CONFIG_FAIR_GROUP_SCHED if (!prev || prev->sched_class != &fair_sched_class) { @@ -10244,62 +10254,14 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf * * Therefore attempt to avoid putting and setting the entire cgroup * hierarchy, only change the part that actually changes. - */ - - do { - struct sched_entity *curr = cfs_rq->curr; - - /* - * Since we got here without doing put_prev_entity() we also - * have to consider cfs_rq->curr. If it is still a runnable - * entity, update_curr() will update its vruntime, otherwise - * forget we've ever seen it. - */ - if (curr) { - if (curr->on_rq) - update_curr(cfs_rq); - else - curr = NULL; - - /* - * This call to check_cfs_rq_runtime() will do the - * throttle and dequeue its entity in the parent(s). - * Therefore the nr_running test will indeed - * be correct. - */ - if (unlikely(check_cfs_rq_runtime(cfs_rq))) { - cfs_rq = &rq->cfs; - - if (!cfs_rq->nr_running) - goto idle; - - goto simple; - } - } - - se = pick_next_entity(cfs_rq); - cfs_rq = group_cfs_rq(se); -#ifdef CONFIG_QOS_SCHED - if (check_qos_cfs_rq(cfs_rq)) { - cfs_rq = &rq->cfs; - WARN(cfs_rq->nr_running == 0, - "rq->nr_running=%u, cfs_rq->idle_h_nr_running=%u\n", - rq->nr_running, cfs_rq->idle_h_nr_running); - if (unlikely(!cfs_rq->nr_running)) - return NULL; - } -#endif - } while (cfs_rq); - - p = task_of(se); - - /* + * * Since we haven't yet done put_prev_entity and if the selected task * is a different task than we started out with, try and touch the * least amount of cfs_rqs. */ if (prev != p) { struct sched_entity *pse = &prev->se; + struct cfs_rq *cfs_rq; while (!(cfs_rq = is_same_group(se, pse))) { int se_depth = se->depth; @@ -10353,13 +10315,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf if (prev) put_prev_task(rq, prev); - do { - se = pick_next_entity(cfs_rq); - set_next_entity(cfs_rq, se); - cfs_rq = group_cfs_rq(se); - } while (cfs_rq); - - p = task_of(se); + for_each_sched_entity(se) + set_next_entity(cfs_rq_of(se), se); done: __maybe_unused; #ifdef CONFIG_SMP -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit dae4320b29f0bbdae93f7c1f6f80b19f109ca0bc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- The rule is that: pick_next_task() := pick_task() + set_next_task(.first = true) Turns out, there's still a few things in pick_next_task() that are missing from that combination. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.724111109@infradead.org Conflicts: kernel/sched/fair.c kernel/sched/deadline.c [fair:Plenty of conflicts with 89bf80a4d6d5 ("sched: Introduce priority load balance for qos scheduler"), a3c9f2da0a35 ("sched: Introduce handle priority reversion mechanism"), ceea34dd528c ("sched: Implement the function of qos smt expeller"), fix the conflicts based on code logic. Also, remove the warn because does not contain the delayed queue. Remove the `se->sched_delayed` because 82e9d0456e06 ("sched/fair: Avoid re-setting virtual deadline on 'migrations'") is not merged. deadline: 63ba8422f876 ("sched/deadline: Introduce deadline servers") is not merged, so ignore the changes in deadline.c] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/fair.c | 79 ++++++++++++++++++++++----------------------- 1 file changed, 39 insertions(+), 40 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3a3eb2318962..f983861d11b2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10207,6 +10207,9 @@ static struct task_struct *pick_task_fair(struct rq *rq) return task_of(se); } +static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first); +static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first); + struct task_struct * pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { @@ -10279,9 +10282,11 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf put_prev_entity(cfs_rq, pse); set_next_entity(cfs_rq, se); + + __set_next_task_fair(rq, p, true); } - goto done; + return p; #ifdef CONFIG_QOS_SCHED qos_simple: @@ -10307,45 +10312,15 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf se = parent_entity(se); } - goto done; + __set_next_task_fair(rq, p, true); + return p; #endif simple: #endif if (prev) put_prev_task(rq, prev); - - for_each_sched_entity(se) - set_next_entity(cfs_rq_of(se), se); - -done: __maybe_unused; -#ifdef CONFIG_SMP - /* - * Move the next running task to the front of - * the list, so our cfs_tasks list becomes MRU - * one. - */ -#ifdef CONFIG_QOS_SCHED_PRIO_LB - adjust_rq_cfs_tasks(list_move, rq, &p->se); -#else - list_move(&p->se.group_node, &rq->cfs_tasks); -#endif -#endif - - if (hrtick_enabled_fair(rq)) - hrtick_start_fair(rq, p); - - update_misfit_status(p, rq); - sched_fair_update_stop_tick(rq, p); - -#ifdef CONFIG_QOS_SCHED - qos_schedule_throttle(p); -#endif - -#ifdef CONFIG_QOS_SCHED_SMT_EXPELLER - qos_smt_expel(this_cpu, p); -#endif - + set_next_task_fair(rq, p, true); return p; idle: @@ -15120,12 +15095,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p) } } -/* Account for a task changing its policy or group. - * - * This routine is mostly called to set cfs_rq->curr field when a task - * migrates between groups/classes. - */ -static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first) +static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first) { struct sched_entity *se = &p->se; @@ -15142,6 +15112,33 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first) #endif } #endif + if (!first) + return; + + if (hrtick_enabled_fair(rq)) + hrtick_start_fair(rq, p); + + update_misfit_status(p, rq); + sched_fair_update_stop_tick(rq, p); + +#ifdef CONFIG_QOS_SCHED + qos_schedule_throttle(p); +#endif + +#ifdef CONFIG_QOS_SCHED_SMT_EXPELLER + qos_smt_expel(rq->cpu, p); +#endif +} + +/* + * Account for a task changing its policy or group. + * + * This routine is mostly called to set cfs_rq->curr field when a task + * migrates between groups/classes. + */ +static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first) +{ + struct sched_entity *se = &p->se; for_each_sched_entity(se) { struct cfs_rq *cfs_rq = cfs_rq_of(se); @@ -15150,6 +15147,8 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first) /* ensure bandwidth has been allocated on our new cfs_rq */ account_cfs_rq_runtime(cfs_rq, 0); } + + __set_next_task_fair(rq, p, first); } void init_cfs_rq(struct cfs_rq *cfs_rq) -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit 260598f142c34811d226fdde5ab0346b48181439 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- With the goal of pushing put_prev_task() after pick_task() / into pick_next_task(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.943143811@infradead.org Conflicts: kernel/sched/core.c [In function prev_balance(), the mainline patch does not contain the "put_prev_task(rq, prev);", so just remove it.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 138224b65397..5e8caf11d50a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5890,8 +5890,8 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt) schedstat_inc(this_rq()->sched_count); } -static void put_prev_task_balance(struct rq *rq, struct task_struct *prev, - struct rq_flags *rf) +static void prev_balance(struct rq *rq, struct task_struct *prev, + struct rq_flags *rf) { const struct sched_class *start_class = prev->sched_class; const struct sched_class *class; @@ -5919,7 +5919,6 @@ static void put_prev_task_balance(struct rq *rq, struct task_struct *prev, break; } - put_prev_task(rq, prev); } /* @@ -5957,7 +5956,8 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) } restart: - put_prev_task_balance(rq, prev, rf); + prev_balance(rq, prev, rf); + put_prev_task(rq, prev); for_each_active_class(class) { p = class->pick_next_task(rq); @@ -6062,7 +6062,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) goto out; } - put_prev_task_balance(rq, prev, rf); + prev_balance(rq, prev, rf); + put_prev_task(rq, prev); smt_mask = cpu_smt_mask(cpu); need_sync = !!rq->core->core_cookie; -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit fd03c5b8585562d60f8b597b4332d28f48abfe7d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- The current rule is that: pick_next_task() := pick_task() + set_next_task(.first = true) And many classes implement it directly as such. Change things around to make pick_next_task() optional while also changing the definition to: pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true) The reason is that sched_ext would like to have a 'final' call that knows the next task. By placing put_prev_task() right next to set_next_task() (as it already is for sched_core) this becomes trivial. As a bonus, this is a nice cleanup on its own. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.051225657@infradead.org Conflicts: kernel/sched/core.c kernel/sched/deadline.c kernel/sched/fair.c kernel/sched/ext.c kernel/sched/idle.c [conflict details: idle.c: COnflicts with a91091aed1fa ("sched: More flexible use of CPU quota when CPU is idle"), so only pick the part modified by this mainline, and keep the logic of a91091aed1fa. fair.c: Conflicts with hulk inclusion ceea34dd528c ("sched: Implement the function of qos smt expeller"), only pick the part modified by this mainline, and keep the logic of ceea34dd528c. deadline.c: Use the mainline patch pick_task_dl() to replace the pick_next_task_dl(). core.c: Combine the logic in mainline into current version in function __pick_next_task(). ext.c: Add params with the core.c, due to lack of d7b01aef9dbd ("Merge branch 'tip/sched/core' into for-6.12") which is the mainline patch.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 32 +++++++++++++++++++++++--------- kernel/sched/deadline.c | 14 +------------- kernel/sched/ext.c | 9 ++++++++- kernel/sched/fair.c | 11 +++++------ kernel/sched/idle.c | 18 +++--------------- kernel/sched/rt.c | 13 +------------ kernel/sched/sched.h | 16 ++++++++++++---- kernel/sched/stop_task.c | 13 +------------ 8 files changed, 54 insertions(+), 72 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5e8caf11d50a..f24f6a46d693 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5948,8 +5948,9 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) /* Assume the next prioritized class is idle_sched_class */ if (!p) { + p = pick_task_idle(rq); put_prev_task(rq, prev); - p = pick_next_task_idle(rq); + set_next_task_first(rq, p); } return p; @@ -5957,16 +5958,29 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) restart: prev_balance(rq, prev, rf); - put_prev_task(rq, prev); for_each_active_class(class) { - p = class->pick_next_task(rq); - if (p) { - const struct sched_class *prev_class = prev->sched_class; + if (class->pick_next_task) { + p = class->pick_next_task(rq, prev); + if (p) { + const struct sched_class *prev_class = prev->sched_class; + + if (class != prev_class && prev_class->switch_class) + prev_class->switch_class(rq, p); + return p; + } + } else { + p = class->pick_task(rq); + if (p) { + const struct sched_class *prev_class = prev->sched_class; - if (class != prev_class && prev_class->switch_class) - prev_class->switch_class(rq, p); - return p; + put_prev_task(rq, prev); + set_next_task_first(rq, p); + + if (class != prev_class && prev_class->switch_class) + prev_class->switch_class(rq, p); + return p; + } } } @@ -6063,7 +6077,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) } prev_balance(rq, prev, rf); - put_prev_task(rq, prev); smt_mask = cpu_smt_mask(cpu); need_sync = !!rq->core->core_cookie; @@ -6230,6 +6243,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) } out_set_next: + put_prev_task(rq, prev); set_next_task_first(rq, next); out: if (rq->core->core_forceidle_count && next == rq->idle) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 6ab915483b4b..e1b627da0efa 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2031,17 +2031,6 @@ static struct task_struct *pick_task_dl(struct rq *rq) return p; } -static struct task_struct *pick_next_task_dl(struct rq *rq) -{ - struct task_struct *p; - - p = pick_task_dl(rq); - if (p) - set_next_task_dl(rq, p, true); - - return p; -} - static void put_prev_task_dl(struct rq *rq, struct task_struct *p) { struct sched_dl_entity *dl_se = &p->dl; @@ -2762,13 +2751,12 @@ DEFINE_SCHED_CLASS(dl) = { .wakeup_preempt = wakeup_preempt_dl, - .pick_next_task = pick_next_task_dl, + .pick_task = pick_task_dl, .put_prev_task = put_prev_task_dl, .set_next_task = set_next_task_dl, #ifdef CONFIG_SMP .balance = balance_dl, - .pick_task = pick_task_dl, .select_task_rq = select_task_rq_dl, .migrate_task_rq = migrate_task_rq_dl, .set_cpus_allowed = set_cpus_allowed_dl, diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 83a00eefe281..e611bf6b4950 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2783,14 +2783,21 @@ static struct task_struct *first_local_task(struct rq *rq) struct task_struct, scx.dsq_list.node); } -static struct task_struct *pick_next_task_scx(struct rq *rq) +static struct task_struct *pick_next_task_scx(struct rq *rq, + struct task_struct *prev) { struct task_struct *p; + if (prev->sched_class == &ext_sched_class) + put_prev_task_scx(rq, prev); + p = first_local_task(rq); if (!p) return NULL; + if (prev->sched_class != &ext_sched_class) + prev->sched_class->put_prev_task(rq, prev); + set_next_task_scx(rq, p, true); if (unlikely(!p->scx.slice)) { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f983861d11b2..88065c9967e6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10242,7 +10242,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf se = &p->se; #ifdef CONFIG_FAIR_GROUP_SCHED - if (!prev || prev->sched_class != &fair_sched_class) { + if (prev->sched_class != &fair_sched_class) { #ifdef CONFIG_QOS_SCHED if (cfs_rq->idle_h_nr_running != 0 && rq->online) goto qos_simple; @@ -10318,8 +10318,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf simple: #endif - if (prev) - put_prev_task(rq, prev); + put_prev_task(rq, prev); set_next_task_fair(rq, p, true); return p; @@ -10383,9 +10382,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf return NULL; } -static struct task_struct *__pick_next_task_fair(struct rq *rq) +static struct task_struct *__pick_next_task_fair(struct rq *rq, struct task_struct *prev) { - return pick_next_task_fair(rq, NULL, NULL); + return pick_next_task_fair(rq, prev, NULL); } /* @@ -15505,13 +15504,13 @@ DEFINE_SCHED_CLASS(fair) = { .wakeup_preempt = check_preempt_wakeup_fair, + .pick_task = pick_task_fair, .pick_next_task = __pick_next_task_fair, .put_prev_task = put_prev_task_fair, .set_next_task = set_next_task_fair, #ifdef CONFIG_SMP .balance = balance_fair, - .pick_task = pick_task_fair, .select_task_rq = select_task_rq_fair, .migrate_task_rq = migrate_task_rq_fair, diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 8c3ae98a16f0..12081077893e 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -458,17 +458,8 @@ static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool fir schedstat_inc(rq->sched_goidle); } -#ifdef CONFIG_SMP -static struct task_struct *pick_task_idle(struct rq *rq) -{ - return rq->idle; -} -#endif - -struct task_struct *pick_next_task_idle(struct rq *rq) +struct task_struct *pick_task_idle(struct rq *rq) { - struct task_struct *next = rq->idle; - #ifdef CONFIG_SCHED_SOFT_QUOTA if (sched_feat(SOFT_QUOTA)) { if (unthrottle_cfs_rq_soft_quota(rq) && rq->cfs.nr_running) @@ -476,9 +467,7 @@ struct task_struct *pick_next_task_idle(struct rq *rq) } #endif - set_next_task_idle(rq, next, true); - - return next; + return rq->idle; } /* @@ -534,13 +523,12 @@ DEFINE_SCHED_CLASS(idle) = { .wakeup_preempt = wakeup_preempt_idle, - .pick_next_task = pick_next_task_idle, + .pick_task = pick_task_idle, .put_prev_task = put_prev_task_idle, .set_next_task = set_next_task_idle, #ifdef CONFIG_SMP .balance = balance_idle, - .pick_task = pick_task_idle, .select_task_rq = select_task_rq_idle, .set_cpus_allowed = set_cpus_allowed_common, #endif diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 2bf4e4ba42e6..62058bf546f0 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1808,16 +1808,6 @@ static struct task_struct *pick_task_rt(struct rq *rq) return p; } -static struct task_struct *pick_next_task_rt(struct rq *rq) -{ - struct task_struct *p = pick_task_rt(rq); - - if (p) - set_next_task_rt(rq, p, true); - - return p; -} - static void put_prev_task_rt(struct rq *rq, struct task_struct *p) { struct sched_rt_entity *rt_se = &p->rt; @@ -2711,13 +2701,12 @@ DEFINE_SCHED_CLASS(rt) = { .wakeup_preempt = wakeup_preempt_rt, - .pick_next_task = pick_next_task_rt, + .pick_task = pick_task_rt, .put_prev_task = put_prev_task_rt, .set_next_task = set_next_task_rt, #ifdef CONFIG_SMP .balance = balance_rt, - .pick_task = pick_task_rt, .select_task_rq = select_task_rq_rt, .set_cpus_allowed = set_cpus_allowed_common, .rq_online = rq_online_rt, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 096b9f8bcbc1..975441271989 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2537,7 +2537,17 @@ struct sched_class { KABI_REPLACE(void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags), void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags)) - struct task_struct *(*pick_next_task)(struct rq *rq); + struct task_struct *(*pick_task)(struct rq *rq); + /* + * Optional! When implemented pick_next_task() should be equivalent to: + * + * next = pick_task(); + * if (next) { + * put_prev_task(prev); + * set_next_task_first(next); + * } + */ + struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev); void (*put_prev_task)(struct rq *rq, struct task_struct *p); void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first); @@ -2547,8 +2557,6 @@ struct sched_class { struct task_struct *prev, struct rq_flags *rf)) int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags); - struct task_struct * (*pick_task)(struct rq *rq); - void (*migrate_task_rq)(struct task_struct *p, int new_cpu); void (*task_woken)(struct rq *this_rq, struct task_struct *task); @@ -2699,7 +2707,7 @@ static inline bool sched_fair_runnable(struct rq *rq) } extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); -extern struct task_struct *pick_next_task_idle(struct rq *rq); +extern struct task_struct *pick_task_idle(struct rq *rq); #define SCA_CHECK 0x01 #define SCA_MIGRATE_DISABLE 0x02 diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c index 4cf02074fa9e..0fd5352ff0ce 100644 --- a/kernel/sched/stop_task.c +++ b/kernel/sched/stop_task.c @@ -41,16 +41,6 @@ static struct task_struct *pick_task_stop(struct rq *rq) return rq->stop; } -static struct task_struct *pick_next_task_stop(struct rq *rq) -{ - struct task_struct *p = pick_task_stop(rq); - - if (p) - set_next_task_stop(rq, p, true); - - return p; -} - static void enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags) { @@ -112,13 +102,12 @@ DEFINE_SCHED_CLASS(stop) = { .wakeup_preempt = wakeup_preempt_stop, - .pick_next_task = pick_next_task_stop, + .pick_task = pick_task_stop, .put_prev_task = put_prev_task_stop, .set_next_task = set_next_task_stop, #ifdef CONFIG_SMP .balance = balance_stop, - .pick_task = pick_task_stop, .select_task_rq = select_task_rq_stop, .set_cpus_allowed = set_cpus_allowed_common, #endif -- 2.34.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY -------------------------------- sched: Fix kabi for "struct task_struct *(*pick_task)(struct rq *rq)" moved out from CONFIG_SMP in struct sched_class. Fixes: 4892708f983a ("sched: Rework pick_next_task()") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/sched.h | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 975441271989..53fc24923f1d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2537,7 +2537,6 @@ struct sched_class { KABI_REPLACE(void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags), void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags)) - struct task_struct *(*pick_task)(struct rq *rq); /* * Optional! When implemented pick_next_task() should be equivalent to: * @@ -2547,7 +2546,8 @@ struct sched_class { * set_next_task_first(next); * } */ - struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev); + KABI_REPLACE(struct task_struct *(*pick_next_task)(struct rq *rq), + struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev)) void (*put_prev_task)(struct rq *rq, struct task_struct *p); void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first); @@ -2557,6 +2557,8 @@ struct sched_class { struct task_struct *prev, struct rq_flags *rf)) int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags); + KABI_BROKEN_REMOVE(struct task_struct * (*pick_task)(struct rq *rq)) + void (*migrate_task_rq)(struct task_struct *p, int new_cpu); void (*task_woken)(struct rq *this_rq, struct task_struct *task); @@ -2600,6 +2602,7 @@ struct sched_class { KABI_USE(2, void (*switching_to) (struct rq *this_rq, struct task_struct *task)) KABI_EXTEND(void (*switch_class)(struct rq *rq, struct task_struct *next)) KABI_EXTEND(int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)) + KABI_EXTEND(struct task_struct *(*pick_task)(struct rq *rq)) }; static inline void put_prev_task(struct rq *rq, struct task_struct *prev) -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit 436f3eed5c69c1048a5754df6e3dbb291e5cccbd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Ensure the last put_prev_task() and the first set_next_task() always go together. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.158454756@infradead.org Conflicts: kernel/sched/core.c [Combine the logic in mainline into current version in function __pick_next_task().] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 19 +++++-------------- kernel/sched/fair.c | 3 +-- kernel/sched/sched.h | 10 +++++++++- 3 files changed, 15 insertions(+), 17 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f24f6a46d693..f389666ccae6 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5949,8 +5949,7 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) /* Assume the next prioritized class is idle_sched_class */ if (!p) { p = pick_task_idle(rq); - put_prev_task(rq, prev); - set_next_task_first(rq, p); + put_prev_set_next_task(rq, prev, p); } return p; @@ -5974,9 +5973,8 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) if (p) { const struct sched_class *prev_class = prev->sched_class; - put_prev_task(rq, prev); - set_next_task_first(rq, p); - + put_prev_set_next_task(rq, prev, p); + if (class != prev_class && prev_class->switch_class) prev_class->switch_class(rq, p); return p; @@ -6067,13 +6065,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq); next = rq->core_pick; - if (next != prev) { - put_prev_task(rq, prev); - set_next_task_first(rq, next); - } - rq->core_pick = NULL; - goto out; + goto out_set_next; } prev_balance(rq, prev, rf); @@ -6243,9 +6236,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) } out_set_next: - put_prev_task(rq, prev); - set_next_task_first(rq, next); -out: + put_prev_set_next_task(rq, prev, next); if (rq->core->core_forceidle_count && next == rq->idle) queue_core_balance(rq); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 88065c9967e6..4cb872d91ecb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10318,8 +10318,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf simple: #endif - put_prev_task(rq, prev); - set_next_task_fair(rq, p, true); + put_prev_set_next_task(rq, prev, p); return p; idle: diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 53fc24923f1d..a2de66a260f4 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2616,8 +2616,16 @@ static inline void set_next_task(struct rq *rq, struct task_struct *next) next->sched_class->set_next_task(rq, next, false); } -static inline void set_next_task_first(struct rq *rq, struct task_struct *next) +static inline void put_prev_set_next_task(struct rq *rq, + struct task_struct *prev, + struct task_struct *next) { + WARN_ON_ONCE(rq->curr != prev); + + if (next == prev) + return; + + prev->sched_class->put_prev_task(rq, prev); next->sched_class->set_next_task(rq, next, true); } -- 2.34.1
From: Peter Zijlstra <peterz@infradead.org> mainline inclusion from mainline-v6.12-rc1 commit b2d70222dbf2a2ff7a972a685d249a5d75afa87f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... ------------------------------- In order to tell the previous sched_class what the next task is, add put_prev_task(.next). Notable SCX will use this to: 1) determine the next task will leave the SCX sched class and push the current task to another CPU if possible. 2) statistics on how often and which other classes preempt it Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.367421076@infradead.org Conflicts: kernel/sched/ext.c kernel/sched/deadline.c [This mainline patch changes the interface put_prev_task, but did not change the implementation put_prev_task_scx, so modify it together. For ext.c, the patch d7b01aef9dbd ("Merge branch 'tip/sched/core' into for-6.12") is not merged in, so add the necessary code here.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/deadline.c | 2 +- kernel/sched/ext.c | 7 ++++--- kernel/sched/fair.c | 2 +- kernel/sched/idle.c | 2 +- kernel/sched/rt.c | 2 +- kernel/sched/sched.h | 6 +++--- kernel/sched/stop_task.c | 2 +- 7 files changed, 12 insertions(+), 11 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index e1b627da0efa..d55e3105e62d 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2031,7 +2031,7 @@ static struct task_struct *pick_task_dl(struct rq *rq) return p; } -static void put_prev_task_dl(struct rq *rq, struct task_struct *p) +static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct task_struct *next) { struct sched_dl_entity *dl_se = &p->dl; struct dl_rq *dl_rq = &rq->dl; diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index e611bf6b4950..93008e4a1e2e 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2728,7 +2728,8 @@ static void process_ddsp_deferred_locals(struct rq *rq) } } -static void put_prev_task_scx(struct rq *rq, struct task_struct *p) +static void put_prev_task_scx(struct rq *rq, struct task_struct *p, + struct task_struct *next) { update_curr_scx(rq); @@ -2789,14 +2790,14 @@ static struct task_struct *pick_next_task_scx(struct rq *rq, struct task_struct *p; if (prev->sched_class == &ext_sched_class) - put_prev_task_scx(rq, prev); + put_prev_task_scx(rq, prev, NULL); p = first_local_task(rq); if (!p) return NULL; if (prev->sched_class != &ext_sched_class) - prev->sched_class->put_prev_task(rq, prev); + prev->sched_class->put_prev_task(rq, prev, p); set_next_task_scx(rq, p, true); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4cb872d91ecb..a7eb4ad50cb3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10389,7 +10389,7 @@ static struct task_struct *__pick_next_task_fair(struct rq *rq, struct task_stru /* * Account for a descheduled task: */ -static void put_prev_task_fair(struct rq *rq, struct task_struct *prev) +static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct task_struct *next) { struct sched_entity *se = &prev->se; struct cfs_rq *cfs_rq; diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 12081077893e..26e8c5061cff 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -446,7 +446,7 @@ static void wakeup_preempt_idle(struct rq *rq, struct task_struct *p, int flags) resched_curr(rq); } -static void put_prev_task_idle(struct rq *rq, struct task_struct *prev) +static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct task_struct *next) { scx_update_idle(rq, false); } diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 62058bf546f0..8fcf63588472 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1808,7 +1808,7 @@ static struct task_struct *pick_task_rt(struct rq *rq) return p; } -static void put_prev_task_rt(struct rq *rq, struct task_struct *p) +static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct task_struct *next) { struct sched_rt_entity *rt_se = &p->rt; struct rt_rq *rt_rq = &rq->rt; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a2de66a260f4..895dab1b2d33 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2549,7 +2549,7 @@ struct sched_class { KABI_REPLACE(struct task_struct *(*pick_next_task)(struct rq *rq), struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev)) - void (*put_prev_task)(struct rq *rq, struct task_struct *p); + void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct task_struct *next); void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first); #ifdef CONFIG_SMP @@ -2608,7 +2608,7 @@ struct sched_class { static inline void put_prev_task(struct rq *rq, struct task_struct *prev) { WARN_ON_ONCE(rq->curr != prev); - prev->sched_class->put_prev_task(rq, prev); + prev->sched_class->put_prev_task(rq, prev, NULL); } static inline void set_next_task(struct rq *rq, struct task_struct *next) @@ -2625,7 +2625,7 @@ static inline void put_prev_set_next_task(struct rq *rq, if (next == prev) return; - prev->sched_class->put_prev_task(rq, prev); + prev->sched_class->put_prev_task(rq, prev, next); next->sched_class->set_next_task(rq, next, true); } diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c index 0fd5352ff0ce..058dd42e3d9b 100644 --- a/kernel/sched/stop_task.c +++ b/kernel/sched/stop_task.c @@ -59,7 +59,7 @@ static void yield_task_stop(struct rq *rq) BUG(); /* the stop task should never yield, its pointless. */ } -static void put_prev_task_stop(struct rq *rq, struct task_struct *prev) +static void put_prev_task_stop(struct rq *rq, struct task_struct *prev, struct task_struct *next) { update_curr_common(rq); } -- 2.34.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY -------------------------------- sched: Fix kabi for put_prev_task in struct sched_class. Fixes: afce0d86af3a ("sched: Add put_prev_task(.next)") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/sched.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 895dab1b2d33..aedb4aba1a29 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2549,7 +2549,9 @@ struct sched_class { KABI_REPLACE(struct task_struct *(*pick_next_task)(struct rq *rq), struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev)) - void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct task_struct *next); + KABI_REPLACE(void (*put_prev_task)(struct rq *rq, struct task_struct *p), + void (*put_prev_task)(struct rq *rq, + struct task_struct *p, struct task_struct *next)) void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first); #ifdef CONFIG_SMP -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 7c65ae81ea86a6ed8086e1a5651acd766187f19b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- fd03c5b85855 ("sched: Rework pick_next_task()") changed the definition of pick_next_task() from: pick_next_task() := pick_task() + set_next_task(.first = true) to: pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true) making invoking put_prev_task() pick_next_task()'s responsibility. This reordering allows pick_task() to be shared between regular and core-sched paths and put_prev_task() to know the next task. sched_ext depended on put_prev_task_scx() enqueueing the current task before pick_next_task_scx() is called. While pulling sched/core changes, 70cc76aa0d80 ("Merge branch 'tip/sched/core' into for-6.12") added an explicit put_prev_task_scx() call for SCX tasks in pick_next_task_scx() before picking the first task as a workaround. Clean it up and adopt the conventions that other sched classes are following. The operation of keeping running the current task was spread and required the task to be put on the local DSQ before picking: - balance_one() used SCX_TASK_BAL_KEEP to indicate that the task is still runnable, hasn't exhausted its slice, and thus should keep running. - put_prev_task_scx() enqueued the task to local DSQ if SCX_TASK_BAL_KEEP is set. It also called do_enqueue_task() with SCX_ENQ_LAST if it is the only runnable task. do_enqueue_task() in turn decided whether to use the local DSQ depending on SCX_OPS_ENQ_LAST. Consolidate the logic in balance_one() as it always knows whether it is going to keep the current task. balance_one() now considers all conditions where the current task should be kept and uses SCX_TASK_BAL_KEEP to tell pick_next_task_scx() to keep the current task instead of picking one from the local DSQ. Accordingly, SCX_ENQ_LAST handling is removed from put_prev_task_scx() and do_enqueue_task() and pick_next_task_scx() is updated to pick the current task if SCX_TASK_BAL_KEEP is set. The workaround put_prev_task[_scx]() calls are replaced with put_prev_set_next_task(). This causes two behavior changes observable from the BPF scheduler: - When a task keep running, it no longer goes through enqueue/dequeue cycle and thus ops.stopping/running() transitions. The new behavior is better and all the existing schedulers should be able to handle the new behavior. - The BPF scheduler cannot keep executing the current task by enqueueing SCX_ENQ_LAST task to the local DSQ. If SCX_OPS_ENQ_LAST is specified, the BPF scheduler is responsible for resuming execution after each SCX_ENQ_LAST. SCX_OPS_ENQ_LAST is mostly useful for cases where scheduling decisions are not made on the local CPU - e.g. central or userspace-driven schedulin - and the new behavior is more logical and shouldn't pose any problems. SCX_OPS_ENQ_LAST demonstration from scx_qmap is dropped as it doesn't fit that well anymore and the last task handling is moved to the end of qmap_dispatch(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Andrea Righi <righi.andrea@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 130 ++++++++++++++++----------------- tools/sched_ext/scx_qmap.bpf.c | 22 +++++- 2 files changed, 80 insertions(+), 72 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 93008e4a1e2e..89144e5ea060 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -634,11 +634,8 @@ enum scx_enq_flags { * %SCX_OPS_ENQ_LAST is specified, they're ops.enqueue()'d with the * %SCX_ENQ_LAST flag set. * - * If the BPF scheduler wants to continue executing the task, - * ops.enqueue() should dispatch the task to %SCX_DSQ_LOCAL immediately. - * If the task gets queued on a different dsq or the BPF side, the BPF - * scheduler is responsible for triggering a follow-up scheduling event. - * Otherwise, Execution may stall. + * The BPF scheduler is responsible for triggering a follow-up + * scheduling event. Otherwise, Execution may stall. */ SCX_ENQ_LAST = 1LLU << 41, @@ -1861,12 +1858,8 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, if (!scx_rq_online(rq)) goto local; - if (scx_ops_bypassing()) { - if (enq_flags & SCX_ENQ_LAST) - goto local; - else - goto global; - } + if (scx_ops_bypassing()) + goto global; if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) goto direct; @@ -1876,11 +1869,6 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, unlikely(p->flags & PF_EXITING)) goto local; - /* see %SCX_OPS_ENQ_LAST */ - if (!static_branch_unlikely(&scx_ops_enq_last) && - (enq_flags & SCX_ENQ_LAST)) - goto local; - if (!SCX_HAS_OP(enqueue)) goto global; @@ -2526,7 +2514,6 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local) struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); bool prev_on_scx = prev->sched_class == &ext_sched_class; int nr_loops = SCX_DSP_MAX_LOOPS; - bool has_tasks = false; lockdep_assert_rq_held(rq); rq->scx.flags |= SCX_RQ_IN_BALANCE; @@ -2551,9 +2538,9 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local) /* * If @prev is runnable & has slice left, it has priority and * fetching more just increases latency for the fetched tasks. - * Tell put_prev_task_scx() to put @prev on local_dsq. If the - * BPF scheduler wants to handle this explicitly, it should - * implement ->cpu_released(). + * Tell pick_next_task_scx() to keep running @prev. If the BPF + * scheduler wants to handle this explicitly, it should + * implement ->cpu_release(). * * See scx_ops_disable_workfn() for the explanation on the * bypassing test. @@ -2579,7 +2566,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local) goto has_tasks; if (!SCX_HAS_OP(dispatch) || scx_ops_bypassing() || !scx_rq_online(rq)) - goto out; + goto no_tasks; dspc->rq = rq; @@ -2618,13 +2605,23 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local) } } while (dspc->nr_tasks); - goto out; +no_tasks: + /* + * Didn't find another task to run. Keep running @prev unless + * %SCX_OPS_ENQ_LAST is in effect. + */ + if ((prev->scx.flags & SCX_TASK_QUEUED) && + (!static_branch_unlikely(&scx_ops_enq_last) || scx_ops_bypassing())) { + if (local) + prev->scx.flags |= SCX_TASK_BAL_KEEP; + goto has_tasks; + } + rq->scx.flags &= ~SCX_RQ_IN_BALANCE; + return false; has_tasks: - has_tasks = true; -out: rq->scx.flags &= ~SCX_RQ_IN_BALANCE; - return has_tasks; + return true; } static int balance_scx(struct rq *rq, struct task_struct *prev, @@ -2737,25 +2734,16 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p, if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED)) SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, true); - /* - * If we're being called from put_prev_task_balance(), balance_scx() may - * have decided that @p should keep running. - */ - if (p->scx.flags & SCX_TASK_BAL_KEEP) { + if (p->scx.flags & SCX_TASK_QUEUED) { p->scx.flags &= ~SCX_TASK_BAL_KEEP; - set_task_runnable(rq, p); - dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD); - return; - } - if (p->scx.flags & SCX_TASK_QUEUED) { set_task_runnable(rq, p); /* - * If @p has slice left and balance_scx() didn't tag it for - * keeping, @p is getting preempted by a higher priority - * scheduler class or core-sched forcing a different task. Leave - * it at the head of the local DSQ. + * If @p has slice left and is being put, @p is getting + * preempted by a higher priority scheduler class or core-sched + * forcing a different task. Leave it at the head of the local + * DSQ. */ if (p->scx.slice && !scx_ops_bypassing()) { dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD); @@ -2763,18 +2751,17 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p, } /* - * If we're in the pick_next_task path, balance_scx() should - * have already populated the local DSQ if there are any other - * available tasks. If empty, tell ops.enqueue() that @p is the - * only one available for this cpu. ops.enqueue() should put it - * on the local DSQ so that the subsequent pick_next_task_scx() - * can find the task unless it wants to trigger a separate - * follow-up scheduling event. + * If @p is runnable but we're about to enter a lower + * sched_class, %SCX_OPS_ENQ_LAST must be set. Tell + * ops.enqueue() that @p is the only one available for this cpu, + * which should trigger an explicit follow-up scheduling event. */ - if (list_empty(&rq->scx.local_dsq.list)) + if (sched_class_above(&ext_sched_class, next->sched_class)) { + WARN_ON_ONCE(!static_branch_unlikely(&scx_ops_enq_last)); do_enqueue_task(rq, p, SCX_ENQ_LAST, -1); - else + } else { do_enqueue_task(rq, p, 0, -1); + } } } @@ -2789,27 +2776,33 @@ static struct task_struct *pick_next_task_scx(struct rq *rq, { struct task_struct *p; - if (prev->sched_class == &ext_sched_class) - put_prev_task_scx(rq, prev, NULL); - - p = first_local_task(rq); - if (!p) - return NULL; - - if (prev->sched_class != &ext_sched_class) - prev->sched_class->put_prev_task(rq, prev, p); - - set_next_task_scx(rq, p, true); + /* + * If balance_scx() is telling us to keep running @prev, replenish slice + * if necessary and keep running @prev. Otherwise, pop the first one + * from the local DSQ. + */ + if (prev->scx.flags & SCX_TASK_BAL_KEEP) { + prev->scx.flags &= ~SCX_TASK_BAL_KEEP; + p = prev; + if (!p->scx.slice) + p->scx.slice = SCX_SLICE_DFL; + } else { + p = first_local_task(rq); + if (!p) + return NULL; - if (unlikely(!p->scx.slice)) { - if (!scx_ops_bypassing() && !scx_warned_zero_slice) { - printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n", - p->comm, p->pid); - scx_warned_zero_slice = true; + if (unlikely(!p->scx.slice)) { + if (!scx_ops_bypassing() && !scx_warned_zero_slice) { + printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n", + p->comm, p->pid); + scx_warned_zero_slice = true; + } + p->scx.slice = SCX_SLICE_DFL; } - p->scx.slice = SCX_SLICE_DFL; } + put_prev_set_next_task(rq, prev, p); + return p; } @@ -3883,12 +3876,13 @@ bool task_should_scx(struct task_struct *p) * to force global FIFO scheduling. * * a. ops.enqueue() is ignored and tasks are queued in simple global FIFO order. + * %SCX_OPS_ENQ_LAST is also ignored. * * b. ops.dispatch() is ignored. * - * c. balance_scx() never sets %SCX_TASK_BAL_KEEP as the slice value can't be - * trusted. Whenever a tick triggers, the running task is rotated to the tail - * of the queue with core_sched_at touched. + * c. balance_scx() does not set %SCX_TASK_BAL_KEEP on non-zero slice as slice + * can't be trusted. Whenever a tick triggers, the running task is rotated to + * the tail of the queue with core_sched_at touched. * * d. pick_next_task() suppresses zero slice warning. * diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c index 892278f12dce..7e7f28f4117e 100644 --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -205,8 +205,7 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags) /* * All enqueued tasks must have their core_sched_seq updated for correct - * core-sched ordering, which is why %SCX_OPS_ENQ_LAST is specified in - * qmap_ops.flags. + * core-sched ordering. Also, take a look at the end of qmap_dispatch(). */ tctx->core_sched_seq = core_sched_tail_seqs[idx]++; @@ -214,7 +213,7 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags) * If qmap_select_cpu() is telling us to or this is the last runnable * task on the CPU, enqueue locally. */ - if (tctx->force_local || (enq_flags & SCX_ENQ_LAST)) { + if (tctx->force_local) { tctx->force_local = false; scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, enq_flags); return; @@ -285,6 +284,7 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) { struct task_struct *p; struct cpu_ctx *cpuc; + struct task_ctx *tctx; u32 zero = 0, batch = dsp_batch ?: 1; void *fifo; s32 i, pid; @@ -349,6 +349,21 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) cpuc->dsp_cnt = 0; } + + /* + * No other tasks. @prev will keep running. Update its core_sched_seq as + * if the task were enqueued and dispatched immediately. + */ + if (prev) { + tctx = bpf_task_storage_get(&task_ctx_stor, prev, 0, 0); + if (!tctx) { + scx_bpf_error("task_ctx lookup failed"); + return; + } + + tctx->core_sched_seq = + core_sched_tail_seqs[weight_to_idx(prev->scx.weight)]++; + } } void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p) @@ -701,6 +716,5 @@ SCX_OPS_DEFINE(qmap_ops, .cpu_offline = (void *)qmap_cpu_offline, .init = (void *)qmap_init, .exit = (void *)qmap_exit, - .flags = SCX_OPS_ENQ_LAST, .timeout_ms = 5000U, .name = "qmap"); -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 8b1451f2f723f845c05b8bad3d4c45de284338b5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- SCX_TASK_BAL_KEEP is used by balance_one() to tell pick_next_task_scx() to keep running the current task. It's not really a task property. Replace it with SCX_RQ_BAL_KEEP which resides in rq->scx.flags and is a better fit for the usage. Also, the existing clearing rule is unnecessarily strict and makes it difficult to use with core-sched. Just clear it on entry to balance_one(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 1 - kernel/sched/ext.c | 20 +++++++++----------- kernel/sched/sched.h | 1 + 3 files changed, 10 insertions(+), 12 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index 69f68e2121a8..db2a266113ac 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -71,7 +71,6 @@ struct scx_dispatch_q { /* scx_entity.flags */ enum scx_ent_flags { SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ - SCX_TASK_BAL_KEEP = 1 << 1, /* balance decided to keep current */ SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 89144e5ea060..aac9b75a459f 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2517,6 +2517,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local) lockdep_assert_rq_held(rq); rq->scx.flags |= SCX_RQ_IN_BALANCE; + rq->scx.flags &= ~SCX_RQ_BAL_KEEP; if (static_branch_unlikely(&scx_ops_cpu_preempt) && unlikely(rq->scx.cpu_released)) { @@ -2532,7 +2533,6 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local) } if (prev_on_scx) { - WARN_ON_ONCE(local && (prev->scx.flags & SCX_TASK_BAL_KEEP)); update_curr_scx(rq); /* @@ -2547,13 +2547,13 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local) * * When balancing a remote CPU for core-sched, there won't be a * following put_prev_task_scx() call and we don't own - * %SCX_TASK_BAL_KEEP. Instead, pick_task_scx() will test the - * same conditions later and pick @rq->curr accordingly. + * %SCX_RQ_BAL_KEEP. Instead, pick_task_scx() will test the same + * conditions later and pick @rq->curr accordingly. */ if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice && !scx_ops_bypassing()) { if (local) - prev->scx.flags |= SCX_TASK_BAL_KEEP; + rq->scx.flags |= SCX_RQ_BAL_KEEP; goto has_tasks; } } @@ -2613,7 +2613,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local) if ((prev->scx.flags & SCX_TASK_QUEUED) && (!static_branch_unlikely(&scx_ops_enq_last) || scx_ops_bypassing())) { if (local) - prev->scx.flags |= SCX_TASK_BAL_KEEP; + rq->scx.flags |= SCX_RQ_BAL_KEEP; goto has_tasks; } rq->scx.flags &= ~SCX_RQ_IN_BALANCE; @@ -2735,8 +2735,6 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p, SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, true); if (p->scx.flags & SCX_TASK_QUEUED) { - p->scx.flags &= ~SCX_TASK_BAL_KEEP; - set_task_runnable(rq, p); /* @@ -2781,8 +2779,8 @@ static struct task_struct *pick_next_task_scx(struct rq *rq, * if necessary and keep running @prev. Otherwise, pop the first one * from the local DSQ. */ - if (prev->scx.flags & SCX_TASK_BAL_KEEP) { - prev->scx.flags &= ~SCX_TASK_BAL_KEEP; + if ((rq->scx.flags & SCX_RQ_BAL_KEEP) && + !WARN_ON_ONCE(prev->sched_class != &ext_sched_class)) { p = prev; if (!p->scx.slice) p->scx.slice = SCX_SLICE_DFL; @@ -2850,7 +2848,7 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, * * As put_prev_task_scx() hasn't been called on remote CPUs, we can't just look * at the first task in the local dsq. @rq->curr has to be considered explicitly - * to mimic %SCX_TASK_BAL_KEEP. + * to mimic %SCX_RQ_BAL_KEEP. */ static struct task_struct *pick_task_scx(struct rq *rq) { @@ -3880,7 +3878,7 @@ bool task_should_scx(struct task_struct *p) * * b. ops.dispatch() is ignored. * - * c. balance_scx() does not set %SCX_TASK_BAL_KEEP on non-zero slice as slice + * c. balance_scx() does not set %SCX_RQ_BAL_KEEP on non-zero slice as slice * can't be trusted. Whenever a tick triggers, the running task is rotated to * the tail of the queue with core_sched_at touched. * diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index aedb4aba1a29..cb7356cf0e4a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -841,6 +841,7 @@ enum scx_rq_flags { */ SCX_RQ_ONLINE = 1 << 0, SCX_RQ_CAN_STOP_TICK = 1 << 1, + SCX_RQ_BAL_KEEP = 1 << 2, /* balance decided to keep current */ SCX_RQ_IN_WAKEUP = 1 << 16, SCX_RQ_IN_BALANCE = 1 << 17, -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 753e2836d139b43ab535718c5f17c73c284bb299 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Because the BPF scheduler's dispatch path is invoked from balance(), sched_ext needs to invoke balance_one() on all sibling rq's before picking the next task for core-sched. Before the recent pick_next_task() updates, sched_ext couldn't share pick task between regular and core-sched paths because pick_next_task() depended on put_prev_task() being called on the current task. Tasks currently running on sibling rq's can't be put when one rq is trying to pick the next task, so pick_task_scx() had to have a separate mechanism to pick between a sibling rq's current task and the first task in its local DSQ. However, with the preceding updates, pick_next_task_scx() no longer depends on the current task being put and can compare the current task and the next in line statelessly, and the pick task logic should be shareable between regular and core-sched paths. Unify regular and core-sched pick task paths: - There's no reason to distinguish local and sibling picks anymore. @local is removed from balance_one(). - pick_next_task_scx() is turned into pick_task_scx() by dropping the put_prev_set_next_task() call. - The old pick_task_scx() is dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 78 +++++++--------------------------------------- 1 file changed, 11 insertions(+), 67 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index aac9b75a459f..c617bcb656c9 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2509,7 +2509,7 @@ static void flush_dispatch_buf(struct rq *rq) dspc->cursor = 0; } -static int balance_one(struct rq *rq, struct task_struct *prev, bool local) +static int balance_one(struct rq *rq, struct task_struct *prev) { struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); bool prev_on_scx = prev->sched_class == &ext_sched_class; @@ -2538,22 +2538,16 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local) /* * If @prev is runnable & has slice left, it has priority and * fetching more just increases latency for the fetched tasks. - * Tell pick_next_task_scx() to keep running @prev. If the BPF + * Tell pick_task_scx() to keep running @prev. If the BPF * scheduler wants to handle this explicitly, it should * implement ->cpu_release(). * * See scx_ops_disable_workfn() for the explanation on the * bypassing test. - * - * When balancing a remote CPU for core-sched, there won't be a - * following put_prev_task_scx() call and we don't own - * %SCX_RQ_BAL_KEEP. Instead, pick_task_scx() will test the same - * conditions later and pick @rq->curr accordingly. */ if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice && !scx_ops_bypassing()) { - if (local) - rq->scx.flags |= SCX_RQ_BAL_KEEP; + rq->scx.flags |= SCX_RQ_BAL_KEEP; goto has_tasks; } } @@ -2612,8 +2606,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev, bool local) */ if ((prev->scx.flags & SCX_TASK_QUEUED) && (!static_branch_unlikely(&scx_ops_enq_last) || scx_ops_bypassing())) { - if (local) - rq->scx.flags |= SCX_RQ_BAL_KEEP; + rq->scx.flags |= SCX_RQ_BAL_KEEP; goto has_tasks; } rq->scx.flags &= ~SCX_RQ_IN_BALANCE; @@ -2631,13 +2624,13 @@ static int balance_scx(struct rq *rq, struct task_struct *prev, rq_unpin_lock(rq, rf); - ret = balance_one(rq, prev, true); + ret = balance_one(rq, prev); #ifdef CONFIG_SCHED_SMT /* * When core-sched is enabled, this ops.balance() call will be followed - * by put_prev_scx() and pick_task_scx() on this CPU and pick_task_scx() - * on the SMT siblings. Balance the siblings too. + * by pick_task_scx() on this CPU and the SMT siblings. Balance the + * siblings too. */ if (sched_core_enabled(rq)) { const struct cpumask *smt_mask = cpu_smt_mask(cpu_of(rq)); @@ -2649,7 +2642,7 @@ static int balance_scx(struct rq *rq, struct task_struct *prev, WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq)); update_rq_clock(srq); - balance_one(srq, sprev, false); + balance_one(srq, sprev); } } #endif @@ -2769,9 +2762,9 @@ static struct task_struct *first_local_task(struct rq *rq) struct task_struct, scx.dsq_list.node); } -static struct task_struct *pick_next_task_scx(struct rq *rq, - struct task_struct *prev) +static struct task_struct *pick_task_scx(struct rq *rq) { + struct task_struct *prev = rq->curr; struct task_struct *p; /* @@ -2799,8 +2792,6 @@ static struct task_struct *pick_next_task_scx(struct rq *rq, } } - put_prev_set_next_task(rq, prev, p); - return p; } @@ -2837,49 +2828,6 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, else return time_after64(a->scx.core_sched_at, b->scx.core_sched_at); } - -/** - * pick_task_scx - Pick a candidate task for core-sched - * @rq: rq to pick the candidate task from - * - * Core-sched calls this function on each SMT sibling to determine the next - * tasks to run on the SMT siblings. balance_one() has been called on all - * siblings and put_prev_task_scx() has been called only for the current CPU. - * - * As put_prev_task_scx() hasn't been called on remote CPUs, we can't just look - * at the first task in the local dsq. @rq->curr has to be considered explicitly - * to mimic %SCX_RQ_BAL_KEEP. - */ -static struct task_struct *pick_task_scx(struct rq *rq) -{ - struct task_struct *curr = rq->curr; - struct task_struct *first = first_local_task(rq); - - if (curr->scx.flags & SCX_TASK_QUEUED) { - /* is curr the only runnable task? */ - if (!first) - return curr; - - /* - * Does curr trump first? We can always go by core_sched_at for - * this comparison as it represents global FIFO ordering when - * the default core-sched ordering is used and local-DSQ FIFO - * ordering otherwise. - * - * We can have a task with an earlier timestamp on the DSQ. For - * example, when a current task is preempted by a sibling - * picking a different cookie, the task would be requeued at the - * head of the local DSQ with an earlier timestamp than the - * core-sched picked next task. Besides, the BPF scheduler may - * dispatch any tasks to the local DSQ anytime. - */ - if (curr->scx.slice && time_before64(curr->scx.core_sched_at, - first->scx.core_sched_at)) - return curr; - } - - return first; /* this may be %NULL */ -} #endif /* CONFIG_SCHED_CORE */ static enum scx_cpu_preempt_reason @@ -3646,7 +3594,7 @@ DEFINE_SCHED_CLASS(ext) = { .wakeup_preempt = wakeup_preempt_scx, .balance = balance_scx, - .pick_next_task = pick_next_task_scx, + .pick_task = pick_task_scx, .put_prev_task = put_prev_task_scx, .set_next_task = set_next_task_scx, @@ -3662,10 +3610,6 @@ DEFINE_SCHED_CLASS(ext) = { .rq_offline = rq_offline_scx, #endif -#ifdef CONFIG_SCHED_CORE - .pick_task = pick_task_scx, -#endif - .task_tick = task_tick_scx, .switching_to = switching_to_scx, -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 65aaf90569ffa283170243576c3982521d6cb193 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Relocate functions to ease the removal of switch_class_scx(). No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 156 ++++++++++++++++++++++----------------------- 1 file changed, 78 insertions(+), 78 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index c617bcb656c9..c856ec7bab07 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2651,6 +2651,31 @@ static int balance_scx(struct rq *rq, struct task_struct *prev, return ret; } +static void process_ddsp_deferred_locals(struct rq *rq) +{ + struct task_struct *p; + + lockdep_assert_rq_held(rq); + + /* + * Now that @rq can be unlocked, execute the deferred enqueueing of + * tasks directly dispatched to the local DSQs of other CPUs. See + * direct_dispatch(). Keep popping from the head instead of using + * list_for_each_entry_safe() as dispatch_local_dsq() may unlock @rq + * temporarily. + */ + while ((p = list_first_entry_or_null(&rq->scx.ddsp_deferred_locals, + struct task_struct, scx.dsq_list.node))) { + s32 ret; + + list_del_init(&p->scx.dsq_list.node); + + ret = dispatch_to_local_dsq(rq, p->scx.ddsp_dsq_id, p, + p->scx.ddsp_enq_flags); + WARN_ON_ONCE(ret == DTL_NOT_LOCAL); + } +} + static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) { if (p->scx.flags & SCX_TASK_QUEUED) { @@ -2693,28 +2718,66 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) } } -static void process_ddsp_deferred_locals(struct rq *rq) +static enum scx_cpu_preempt_reason +preempt_reason_from_class(const struct sched_class *class) { - struct task_struct *p; +#ifdef CONFIG_SMP + if (class == &stop_sched_class) + return SCX_CPU_PREEMPT_STOP; +#endif + if (class == &dl_sched_class) + return SCX_CPU_PREEMPT_DL; + if (class == &rt_sched_class) + return SCX_CPU_PREEMPT_RT; + return SCX_CPU_PREEMPT_UNKNOWN; +} - lockdep_assert_rq_held(rq); +static void switch_class_scx(struct rq *rq, struct task_struct *next) +{ + const struct sched_class *next_class = next->sched_class; + if (!scx_enabled()) + return; +#ifdef CONFIG_SMP /* - * Now that @rq can be unlocked, execute the deferred enqueueing of - * tasks directly dispatched to the local DSQs of other CPUs. See - * direct_dispatch(). Keep popping from the head instead of using - * list_for_each_entry_safe() as dispatch_local_dsq() may unlock @rq - * temporarily. + * Pairs with the smp_load_acquire() issued by a CPU in + * kick_cpus_irq_workfn() who is waiting for this CPU to perform a + * resched. */ - while ((p = list_first_entry_or_null(&rq->scx.ddsp_deferred_locals, - struct task_struct, scx.dsq_list.node))) { - s32 ret; + smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1); +#endif + if (!static_branch_unlikely(&scx_ops_cpu_preempt)) + return; - list_del_init(&p->scx.dsq_list.node); + /* + * The callback is conceptually meant to convey that the CPU is no + * longer under the control of SCX. Therefore, don't invoke the callback + * if the next class is below SCX (in which case the BPF scheduler has + * actively decided not to schedule any tasks on the CPU). + */ + if (sched_class_above(&ext_sched_class, next_class)) + return; - ret = dispatch_to_local_dsq(rq, p->scx.ddsp_dsq_id, p, - p->scx.ddsp_enq_flags); - WARN_ON_ONCE(ret == DTL_NOT_LOCAL); + /* + * At this point we know that SCX was preempted by a higher priority + * sched_class, so invoke the ->cpu_release() callback if we have not + * done so already. We only send the callback once between SCX being + * preempted, and it regaining control of the CPU. + * + * ->cpu_release() complements ->cpu_acquire(), which is emitted the + * next time that balance_scx() is invoked. + */ + if (!rq->scx.cpu_released) { + if (SCX_HAS_OP(cpu_release)) { + struct scx_cpu_release_args args = { + .reason = preempt_reason_from_class(next_class), + .task = next, + }; + + SCX_CALL_OP(SCX_KF_CPU_RELEASE, + cpu_release, cpu_of(rq), &args); + } + rq->scx.cpu_released = true; } } @@ -2830,69 +2893,6 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, } #endif /* CONFIG_SCHED_CORE */ -static enum scx_cpu_preempt_reason -preempt_reason_from_class(const struct sched_class *class) -{ -#ifdef CONFIG_SMP - if (class == &stop_sched_class) - return SCX_CPU_PREEMPT_STOP; -#endif - if (class == &dl_sched_class) - return SCX_CPU_PREEMPT_DL; - if (class == &rt_sched_class) - return SCX_CPU_PREEMPT_RT; - return SCX_CPU_PREEMPT_UNKNOWN; -} - -static void switch_class_scx(struct rq *rq, struct task_struct *next) -{ - const struct sched_class *next_class = next->sched_class; - - if (!scx_enabled()) - return; -#ifdef CONFIG_SMP - /* - * Pairs with the smp_load_acquire() issued by a CPU in - * kick_cpus_irq_workfn() who is waiting for this CPU to perform a - * resched. - */ - smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1); -#endif - if (!static_branch_unlikely(&scx_ops_cpu_preempt)) - return; - - /* - * The callback is conceptually meant to convey that the CPU is no - * longer under the control of SCX. Therefore, don't invoke the callback - * if the next class is below SCX (in which case the BPF scheduler has - * actively decided not to schedule any tasks on the CPU). - */ - if (sched_class_above(&ext_sched_class, next_class)) - return; - - /* - * At this point we know that SCX was preempted by a higher priority - * sched_class, so invoke the ->cpu_release() callback if we have not - * done so already. We only send the callback once between SCX being - * preempted, and it regaining control of the CPU. - * - * ->cpu_release() complements ->cpu_acquire(), which is emitted the - * next time that balance_scx() is invoked. - */ - if (!rq->scx.cpu_released) { - if (SCX_HAS_OP(cpu_release)) { - struct scx_cpu_release_args args = { - .reason = preempt_reason_from_class(next_class), - .task = next, - }; - - SCX_CALL_OP(SCX_KF_CPU_RELEASE, - cpu_release, cpu_of(rq), &args); - } - rq->scx.cpu_released = true; - } -} - #ifdef CONFIG_SMP static bool test_and_clear_cpu_idle(int cpu) -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit f422316d7466da7724f0662b7e282afbbca78e95 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Now that put_prev_task_scx() is called with @next on task switches, there's no reason to use sched_class.switch_class(). Rename switch_class_scx() to switch_class() and call it from put_prev_task_scx(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index c856ec7bab07..cea5a9d3df80 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2732,12 +2732,10 @@ preempt_reason_from_class(const struct sched_class *class) return SCX_CPU_PREEMPT_UNKNOWN; } -static void switch_class_scx(struct rq *rq, struct task_struct *next) +static void switch_class(struct rq *rq, struct task_struct *next) { const struct sched_class *next_class = next->sched_class; - if (!scx_enabled()) - return; #ifdef CONFIG_SMP /* * Pairs with the smp_load_acquire() issued by a CPU in @@ -2817,6 +2815,9 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p, do_enqueue_task(rq, p, 0, -1); } } + + if (next && next->sched_class != &ext_sched_class) + switch_class(rq, next); } static struct task_struct *first_local_task(struct rq *rq) @@ -3599,8 +3600,6 @@ DEFINE_SCHED_CLASS(ext) = { .put_prev_task = put_prev_task_scx, .set_next_task = set_next_task_scx, - .switch_class = switch_class_scx, - #ifdef CONFIG_SMP .select_task_rq = select_task_rq_scx, .task_woken = task_woken_scx, -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 37cb049ef8b89e97d735f5cf7e34ef3f85ed6d94 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- With sched_ext converted to use put_prev_task() for class switch detection, there's no user of switch_class() left. Drop it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f389666ccae6..57ad03167265 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5961,13 +5961,8 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) for_each_active_class(class) { if (class->pick_next_task) { p = class->pick_next_task(rq, prev); - if (p) { - const struct sched_class *prev_class = prev->sched_class; - - if (class != prev_class && prev_class->switch_class) - prev_class->switch_class(rq, p); + if (p) return p; - } } else { p = class->pick_task(rq); if (p) { -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 61eeb9a90522da89b95a1f02060fd5a2caaae4df category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- scx_ops_disable_workfn() only switches !TASK_DEAD tasks out of SCX while calling scx_ops_exit_task() on all tasks including dead ones. This can leave a dead task on SCX but with SCX_TASK_NONE state, which is inconsistent. If another task was in the process of changing the TASK_DEAD task's scheduling class and grabs the rq lock after scx_ops_disable_workfn() is done with the task, the task ends up calling scx_ops_disable_task() on the dead task which is in an inconsistent state triggering a warning: WARNING: CPU: 6 PID: 3316 at kernel/sched/ext.c:3411 scx_ops_disable_task+0x12c/0x160 ... RIP: 0010:scx_ops_disable_task+0x12c/0x160 ... Call Trace: <TASK> check_class_changed+0x2c/0x70 __sched_setscheduler+0x8a0/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f140d70ea5b There is no reason to leave dead tasks on SCX when unloading the BPF scheduler. Fix by making scx_ops_disable_workfn() eject all tasks including the dead ones from SCX. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 24 ++++++++---------------- 1 file changed, 8 insertions(+), 16 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index cea5a9d3df80..98a46b1f8801 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -4006,30 +4006,22 @@ static void scx_ops_disable_workfn(struct kthread_work *work) spin_lock_irq(&scx_tasks_lock); scx_task_iter_init(&sti); /* - * Invoke scx_ops_exit_task() on all non-idle tasks, including - * TASK_DEAD tasks. Because dead tasks may have a nonzero refcount, - * we may not have invoked sched_ext_free() on them by the time a - * scheduler is disabled. We must therefore exit the task here, or we'd - * fail to invoke ops.exit_task(), as the scheduler will have been - * unloaded by the time the task is subsequently exited on the - * sched_ext_free() path. + * The BPF scheduler is going away. All tasks including %TASK_DEAD ones + * must be switched out and exited synchronously. */ while ((p = scx_task_iter_next_locked(&sti, true))) { const struct sched_class *old_class = p->sched_class; struct sched_enq_and_set_ctx ctx; - if (READ_ONCE(p->__state) != TASK_DEAD) { - sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, - &ctx); + sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx); - p->scx.slice = min_t(u64, p->scx.slice, SCX_SLICE_DFL); - __setscheduler_prio(p, p->prio); - check_class_changing(task_rq(p), p, old_class); + p->scx.slice = min_t(u64, p->scx.slice, SCX_SLICE_DFL); + __setscheduler_prio(p, p->prio); + check_class_changing(task_rq(p), p, old_class); - sched_enq_and_set_task(&ctx); + sched_enq_and_set_task(&ctx); - check_class_changed(task_rq(p), p, old_class, p->prio); - } + check_class_changed(task_rq(p), p, old_class, p->prio); scx_ops_exit_task(p); } scx_task_iter_exit(&sti); -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit a8532fac7b5d27b8d62008a89593dccb6f9786ef category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- During scx_ops_enable(), SCX needs to invoke the sleepable ops.init_task() on every task. To do this, it does get_task_struct() on each iterated task, drop the lock and then call ops.init_task(). However, a TASK_DEAD task may already have lost all its usage count and be waiting for RCU grace period to be freed. If get_task_struct() is called on such task, use-after-free can happen. To avoid such situations, scx_ops_enable() skips initialization of TASK_DEAD tasks, which seems safe as they are never going to be scheduled again. Unfortunately, a racing sched_setscheduler(2) can grab the task before the task is unhashed and then continue to e.g. move the task from RT to SCX after TASK_DEAD is set and ops_enable skipped the task. As the task hasn't gone through scx_ops_init_task(), scx_ops_enable_task() called from switching_to_scx() triggers the following warning: sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[2872] WARNING: CPU: 6 PID: 2367 at kernel/sched/ext.c:3327 scx_ops_enable_task+0x18f/0x1f0 ... RIP: 0010:scx_ops_enable_task+0x18f/0x1f0 ... switching_to_scx+0x13/0xa0 __sched_setscheduler+0x84e/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e As in the ops_disable path, it just doesn't seem like a good idea to leave any task in an inconsistent state, even when the task is dead. The root cause is ops_enable not being able to tell reliably whether a task is truly dead (no one else is looking at it and it's about to be freed) and was testing TASK_DEAD instead. Fix it by testing the task's usage count directly. - ops_init no longer ignores TASK_DEAD tasks. As now all users iterate all tasks, @include_dead is removed from scx_task_iter_next_locked() along with dead task filtering. - tryget_task_struct() is added. Tasks are skipped iff tryget_task_struct() fails. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/task.h | 5 +++++ kernel/sched/ext.c | 30 +++++++++++++----------------- 2 files changed, 18 insertions(+), 17 deletions(-) diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 03d35e3eda3c..a4f5fd4774ce 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -118,6 +118,11 @@ static inline struct task_struct *get_task_struct(struct task_struct *t) return t; } +static inline struct task_struct *tryget_task_struct(struct task_struct *t) +{ + return refcount_inc_not_zero(&t->usage) ? t : NULL; +} + extern void __put_task_struct(struct task_struct *t); extern void __put_task_struct_rcu_cb(struct rcu_head *rhp); diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 98a46b1f8801..8a250c072aff 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1244,11 +1244,10 @@ static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter) * whether they would like to filter out dead tasks. See scx_task_iter_init() * for details. */ -static struct task_struct * -scx_task_iter_next_locked(struct scx_task_iter *iter, bool include_dead) +static struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter) { struct task_struct *p; -retry: + scx_task_iter_rq_unlock(iter); while ((p = scx_task_iter_next(iter))) { @@ -1286,16 +1285,6 @@ scx_task_iter_next_locked(struct scx_task_iter *iter, bool include_dead) iter->rq = task_rq_lock(p, &iter->rf); iter->locked = p; - /* - * If we see %TASK_DEAD, @p already disabled preemption, is about to do - * the final __schedule(), won't ever need to be scheduled again and can - * thus be safely ignored. If we don't see %TASK_DEAD, @p can't enter - * the final __schedle() while we're locking its rq and thus will stay - * alive until the rq is unlocked. - */ - if (!include_dead && READ_ONCE(p->__state) == TASK_DEAD) - goto retry; - return p; } @@ -4009,7 +3998,7 @@ static void scx_ops_disable_workfn(struct kthread_work *work) * The BPF scheduler is going away. All tasks including %TASK_DEAD ones * must be switched out and exited synchronously. */ - while ((p = scx_task_iter_next_locked(&sti, true))) { + while ((p = scx_task_iter_next_locked(&sti))) { const struct sched_class *old_class = p->sched_class; struct sched_enq_and_set_ctx ctx; @@ -4640,8 +4629,15 @@ static int scx_ops_enable(struct sched_ext_ops *ops) spin_lock_irq(&scx_tasks_lock); scx_task_iter_init(&sti); - while ((p = scx_task_iter_next_locked(&sti, false))) { - get_task_struct(p); + while ((p = scx_task_iter_next_locked(&sti))) { + /* + * @p may already be dead, have lost all its usages counts and + * be waiting for RCU grace period before being freed. @p can't + * be initialized for SCX in such cases and should be ignored. + */ + if (!tryget_task_struct(p)) + continue; + scx_task_iter_rq_unlock(&sti); spin_unlock_irq(&scx_tasks_lock); @@ -4694,7 +4690,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops) WRITE_ONCE(scx_switching_all, !(ops->flags & SCX_OPS_SWITCH_PARTIAL)); scx_task_iter_init(&sti); - while ((p = scx_task_iter_next_locked(&sti, false))) { + while ((p = scx_task_iter_next_locked(&sti))) { const struct sched_class *old_class = p->sched_class; struct sched_enq_and_set_ctx ctx; -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 859dc4ec5a4321298d8978cb6eeb3e21a2eebab5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- A new BPF extensible sched_class will use css_tg() in the init and exit paths to visit all task_groups by walking cgroups. v4: __setscheduler_prio() is already exposed. Dropped from this patch. v3: Dropped SCHED_CHANGE_BLOCK() as upstream is adding more generic cleanup mechanism. v2: Expose SCHED_CHANGE_BLOCK() too and update the description. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 5 ----- kernel/sched/sched.h | 5 +++++ 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 57ad03167265..8106fecb1862 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9015,11 +9015,6 @@ void sched_move_task(struct task_struct *tsk) } } -static inline struct task_group *css_tg(struct cgroup_subsys_state *css) -{ - return css ? container_of(css, struct task_group, css) : NULL; -} - static struct cgroup_subsys_state * cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index cb7356cf0e4a..595e79404202 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -560,6 +560,11 @@ static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data) return walk_tg_tree_from(&root_task_group, down, up, data); } +static inline struct task_group *css_tg(struct cgroup_subsys_state *css) +{ + return css ? container_of(css, struct task_group, css) : NULL; +} + extern int tg_nop(struct task_group *tg, void *data); #ifdef CONFIG_BPF_SCHED extern int tg_change_tag(struct task_group *tg, void *data); -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 41082c1d1d2bf6c3e989785fd1def0f09cede446 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Move tg_weight() upward and make cpu_shares_read_u64() use it too. This makes the weight retrieval shared between cgroup v1 and v2 paths and will be used to implement cgroup support for sched_ext. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/core.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8106fecb1862..4b8dbe590c7f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9275,6 +9275,11 @@ static int cpu_uclamp_max_show(struct seq_file *sf, void *v) #endif /* CONFIG_UCLAMP_TASK_GROUP */ #ifdef CONFIG_FAIR_GROUP_SCHED +static unsigned long tg_weight(struct task_group *tg) +{ + return scale_load_down(tg->shares); +} + static int cpu_shares_write_u64(struct cgroup_subsys_state *css, struct cftype *cftype, u64 shareval) { @@ -9286,9 +9291,7 @@ static int cpu_shares_write_u64(struct cgroup_subsys_state *css, static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) { - struct task_group *tg = css_tg(css); - - return (u64) scale_load_down(tg->shares); + return tg_weight(css_tg(css)); } #ifdef CONFIG_CFS_BANDWIDTH @@ -10252,11 +10255,6 @@ static int cpu_local_stat_show(struct seq_file *sf, #ifdef CONFIG_FAIR_GROUP_SCHED -static unsigned long tg_weight(struct task_group *tg) -{ - return scale_load_down(tg->shares); -} - static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) { -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit e179e80c5d4fef458c3cbc3ad4ea17c6d42c0446 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- sched_ext will soon add cgroup cpu.weigh support. The cgroup interface code is currently gated behind CONFIG_FAIR_GROUP_SCHED. As the fair class and/or SCX may implement the feature, put the interface code behind the new CONFIG_CGROUP_SCHED_WEIGHT which is selected by CONFIG_FAIR_GROUP_SCHED. This allows either sched class to enable the itnerface code without ading more complex CONFIG tests. When !CONFIG_FAIR_GROUP_SCHED, a dummy version of sched_group_set_shares() is added to support later CONFIG_CGROUP_SCHED_WEIGHT && !CONFIG_FAIR_GROUP_SCHED builds. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Conflicts: init/Kconfig [Conflicts with hulk inclusion, such as 1621f4654db7 ("sched: Introduce qos scheduler for co-location"), d4614d07adbc ("sched: Introduce qos smt expeller for co-location"), 89bf80a4d6d5 ("sched: Introduce priority load balance for qos scheduler"), and 0b620bf6de24 ("sched/fair: Introduce multiple qos level"). Thus, combine the mainline logic into current version.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- init/Kconfig | 4 ++++ kernel/sched/core.c | 14 +++++++------- kernel/sched/sched.h | 4 +++- 3 files changed, 14 insertions(+), 8 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index b076198e39bb..e620e9fb0369 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1135,9 +1135,13 @@ config QOS_SCHED_MULTILEVEL Expand the qos_level to [-2,2] to distinguish the tasks expected to be with extremely high or low priority level. +config GROUP_SCHED_WEIGHT + def_bool n + config FAIR_GROUP_SCHED bool "Group scheduling for SCHED_OTHER" depends on CGROUP_SCHED + select GROUP_SCHED_WEIGHT default CGROUP_SCHED config CFS_BANDWIDTH diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4b8dbe590c7f..0a90ff7dedff 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9274,7 +9274,7 @@ static int cpu_uclamp_max_show(struct seq_file *sf, void *v) } #endif /* CONFIG_UCLAMP_TASK_GROUP */ -#ifdef CONFIG_FAIR_GROUP_SCHED +#ifdef CONFIG_GROUP_SCHED_WEIGHT static unsigned long tg_weight(struct task_group *tg) { return scale_load_down(tg->shares); @@ -9293,6 +9293,7 @@ static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css, { return tg_weight(css_tg(css)); } +#endif /* CONFIG_GROUP_SCHED_WEIGHT */ #ifdef CONFIG_CFS_BANDWIDTH static DEFINE_MUTEX(cfs_constraints_mutex); @@ -9646,7 +9647,6 @@ static int cpu_cfs_local_stat_show(struct seq_file *sf, void *v) return 0; } #endif /* CONFIG_CFS_BANDWIDTH */ -#endif /* CONFIG_FAIR_GROUP_SCHED */ #ifdef CONFIG_RT_GROUP_SCHED static int cpu_rt_runtime_write(struct cgroup_subsys_state *css, @@ -9674,7 +9674,7 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css, } #endif /* CONFIG_RT_GROUP_SCHED */ -#ifdef CONFIG_FAIR_GROUP_SCHED +#ifdef CONFIG_GROUP_SCHED_WEIGHT static s64 cpu_idle_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) { @@ -10076,7 +10076,7 @@ static int soft_domain_cpu_list_seq_show(struct seq_file *sf, void *v) #endif static struct cftype cpu_legacy_files[] = { -#ifdef CONFIG_FAIR_GROUP_SCHED +#ifdef CONFIG_GROUP_SCHED_WEIGHT { .name = "shares", .read_u64 = cpu_shares_read_u64, @@ -10253,7 +10253,7 @@ static int cpu_local_stat_show(struct seq_file *sf, return 0; } -#ifdef CONFIG_FAIR_GROUP_SCHED +#ifdef CONFIG_GROUP_SCHED_WEIGHT static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) @@ -10307,7 +10307,7 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css, return sched_group_set_shares(css_tg(css), scale_load(weight)); } -#endif +#endif /* CONFIG_GROUP_SCHED_WEIGHT */ static void __maybe_unused cpu_period_quota_print(struct seq_file *sf, long period, long quota) @@ -10465,7 +10465,7 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of, #endif static struct cftype cpu_files[] = { -#ifdef CONFIG_FAIR_GROUP_SCHED +#ifdef CONFIG_GROUP_SCHED_WEIGHT { .name = "weight", .flags = CFTYPE_NOT_ON_ROOT, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 595e79404202..8184141ea8a9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -529,7 +529,7 @@ struct task_group { KABI_RESERVE(8) }; -#ifdef CONFIG_FAIR_GROUP_SCHED +#ifdef CONFIG_GROUP_SCHED_WEIGHT #define ROOT_TASK_GROUP_LOAD NICE_0_LOAD /* @@ -635,6 +635,8 @@ extern void set_task_rq_fair(struct sched_entity *se, static inline void set_task_rq_fair(struct sched_entity *se, struct cfs_rq *prev, struct cfs_rq *next) { } #endif /* CONFIG_SMP */ +#else /* !CONFIG_FAIR_GROUP_SCHED */ +static inline int sched_group_set_shares(struct task_group *tg, unsigned long shares) { return 0; } #endif /* CONFIG_FAIR_GROUP_SCHED */ #else /* CONFIG_CGROUP_SCHED */ -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 8195136669661fdfe54e9a8923c33b31c92fc1da category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Add sched_ext_ops operations to init/exit cgroups, and track task migrations and config changes. A BPF scheduler may not implement or implement only subset of cgroup features. The implemented features can be indicated using %SCX_OPS_HAS_CGOUP_* flags. If cgroup configuration makes use of features that are not implemented, a warning is triggered. While a BPF scheduler is being enabled and disabled, relevant cgroup operations are locked out using scx_cgroup_rwsem. This avoids situations like task prep taking place while the task is being moved across cgroups, making things easier for BPF schedulers. v7: - cgroup interface file visibility toggling is dropped in favor just warning messages. Dynamically changing interface visiblity caused more confusion than helping. v6: - Updated to reflect the removal of SCX_KF_SLEEPABLE. - Updated to use CONFIG_GROUP_SCHED_WEIGHT and fixes for !CONFIG_FAIR_GROUP_SCHED && CONFIG_EXT_GROUP_SCHED. v5: - Flipped the locking order between scx_cgroup_rwsem and cpus_read_lock() to avoid locking order conflict w/ cpuset. Better documentation around locking. - sched_move_task() takes an early exit if the source and destination are identical. This triggered the warning in scx_cgroup_can_attach() as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup migration path so that ops.cgroup_prep_move() is skipped for identity migrations so that its invocations always match ops.cgroup_move() one-to-one. v4: - Example schedulers moved into their own patches. - Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi. v3: - Make scx_example_pair switch all tasks by default. - Convert to BPF inline iterators. - scx_bpf_task_cgroup() is added to determine the current cgroup from CPU controller's POV. This allows BPF schedulers to accurately track CPU cgroup membership. - scx_example_flatcg added. This demonstrates flattened hierarchy implementation of CPU cgroup control and shows significant performance improvement when cgroups which are nested multiple levels are under competition. v2: - Build fixes for different CONFIG combinations. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Andrea Righi <andrea.righi@canonical.com> Conflicts: kernel/sched/core.c [Conflicts with the hulk inclusion a82fbc2607b0 ("sched: programmable: Add a tag for the task group"), e6798742026d ("sched: Fix might sleep in atomic section issue") and 90136ddd778b ("sched: Fix soft domain group memleak"), so only pick the part modified from mainline patch.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 3 + init/Kconfig | 6 + kernel/sched/core.c | 60 +- kernel/sched/ext.c | 519 +++++++++++++++++- kernel/sched/ext.h | 22 + kernel/sched/sched.h | 5 + tools/sched_ext/include/scx/common.bpf.h | 1 + .../testing/selftests/sched_ext/maximal.bpf.c | 32 ++ 8 files changed, 629 insertions(+), 19 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index db2a266113ac..dc646cca4fcf 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -188,6 +188,9 @@ struct sched_ext_entity { bool disallow; /* reject switching into SCX */ /* cold fields */ +#ifdef CONFIG_EXT_GROUP_SCHED + struct cgroup *cgrp_moving_from; +#endif /* must be the last field, see init_scx_entity() */ struct list_head tasks_node; }; diff --git a/init/Kconfig b/init/Kconfig index e620e9fb0369..cfd8543ea2e5 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1166,6 +1166,12 @@ config RT_GROUP_SCHED realtime bandwidth for them. See Documentation/scheduler/sched-rt-group.rst for more information. +config EXT_GROUP_SCHED + bool + depends on SCHED_CLASS_EXT && CGROUP_SCHED + select GROUP_SCHED_WEIGHT + default y + endif #CGROUP_SCHED config QOS_SCHED_DYNAMIC_AFFINITY diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0a90ff7dedff..f69042ff2d36 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8315,6 +8315,9 @@ void __init sched_init(void) root_task_group.shares = ROOT_TASK_GROUP_LOAD; init_cfs_bandwidth(&root_task_group.cfs_bandwidth, NULL); #endif /* CONFIG_FAIR_GROUP_SCHED */ +#ifdef CONFIG_EXT_GROUP_SCHED + root_task_group.scx_weight = CGROUP_WEIGHT_DFL; +#endif /* CONFIG_EXT_GROUP_SCHED */ #ifdef CONFIG_RT_GROUP_SCHED root_task_group.rt_se = (struct sched_rt_entity **)ptr; ptr += nr_cpu_ids * sizeof(void **); @@ -8862,6 +8865,7 @@ struct task_group *sched_create_group(struct task_group *parent) tg_init_tag(tg, parent); #endif + scx_group_set_weight(tg, CGROUP_WEIGHT_DFL); alloc_uclamp_sched_group(tg, parent); return tg; @@ -9001,6 +9005,7 @@ void sched_move_task(struct task_struct *tsk) put_prev_task(rq, tsk); sched_change_group(tsk, group); + scx_move_task(tsk); if (queued) enqueue_task(rq, tsk, queue_flags); @@ -9038,6 +9043,11 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css) { struct task_group *tg = css_tg(css); struct task_group *parent = css_tg(css->parent); + int ret; + + ret = scx_tg_online(tg); + if (ret) + return ret; if (parent) sched_online_group(tg, parent); @@ -9058,6 +9068,7 @@ static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css) offline_auto_affinity(tg); offline_soft_domain(tg); + scx_tg_offline(tg); } static void cpu_cgroup_css_released(struct cgroup_subsys_state *css) @@ -9077,9 +9088,9 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css) sched_unregister_group(tg); } -#ifdef CONFIG_RT_GROUP_SCHED static int cpu_cgroup_can_attach(struct cgroup_taskset *tset) { +#ifdef CONFIG_RT_GROUP_SCHED struct task_struct *task; struct cgroup_subsys_state *css; @@ -9087,9 +9098,9 @@ static int cpu_cgroup_can_attach(struct cgroup_taskset *tset) if (!sched_rt_can_attach(css_tg(css), task)) return -EINVAL; } - return 0; -} #endif + return scx_cgroup_can_attach(tset); +} static void cpu_cgroup_attach(struct cgroup_taskset *tset) { @@ -9098,6 +9109,13 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset) cgroup_taskset_for_each(task, css, tset) sched_move_task(task); + + scx_cgroup_finish_attach(); +} + +static void cpu_cgroup_cancel_attach(struct cgroup_taskset *tset) +{ + scx_cgroup_cancel_attach(tset); } #ifdef CONFIG_UCLAMP_TASK_GROUP @@ -9277,15 +9295,25 @@ static int cpu_uclamp_max_show(struct seq_file *sf, void *v) #ifdef CONFIG_GROUP_SCHED_WEIGHT static unsigned long tg_weight(struct task_group *tg) { +#ifdef CONFIG_FAIR_GROUP_SCHED return scale_load_down(tg->shares); +#else + return sched_weight_from_cgroup(tg->scx_weight); +#endif } static int cpu_shares_write_u64(struct cgroup_subsys_state *css, struct cftype *cftype, u64 shareval) { + int ret; + if (shareval > scale_load_down(ULONG_MAX)) shareval = MAX_SHARES; - return sched_group_set_shares(css_tg(css), scale_load(shareval)); + ret = sched_group_set_shares(css_tg(css), scale_load(shareval)); + if (!ret) + scx_group_set_weight(css_tg(css), + sched_weight_to_cgroup(shareval)); + return ret; } static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css, @@ -9684,7 +9712,12 @@ static s64 cpu_idle_read_s64(struct cgroup_subsys_state *css, static int cpu_idle_write_s64(struct cgroup_subsys_state *css, struct cftype *cft, s64 idle) { - return sched_group_set_idle(css_tg(css), idle); + int ret; + + ret = sched_group_set_idle(css_tg(css), idle); + if (!ret) + scx_group_set_idle(css_tg(css), idle); + return ret; } #endif @@ -10265,13 +10298,17 @@ static int cpu_weight_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, u64 cgrp_weight) { unsigned long weight; + int ret; if (cgrp_weight < CGROUP_WEIGHT_MIN || cgrp_weight > CGROUP_WEIGHT_MAX) return -ERANGE; weight = sched_weight_from_cgroup(cgrp_weight); - return sched_group_set_shares(css_tg(css), scale_load(weight)); + ret = sched_group_set_shares(css_tg(css), scale_load(weight)); + if (!ret) + scx_group_set_weight(css_tg(css), cgrp_weight); + return ret; } static s64 cpu_weight_nice_read_s64(struct cgroup_subsys_state *css, @@ -10296,7 +10333,7 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css, struct cftype *cft, s64 nice) { unsigned long weight; - int idx; + int idx, ret; if (nice < MIN_NICE || nice > MAX_NICE) return -ERANGE; @@ -10305,7 +10342,11 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css, idx = array_index_nospec(idx, 40); weight = sched_prio_to_weight[idx]; - return sched_group_set_shares(css_tg(css), scale_load(weight)); + ret = sched_group_set_shares(css_tg(css), scale_load(weight)); + if (!ret) + scx_group_set_weight(css_tg(css), + sched_weight_to_cgroup(weight)); + return ret; } #endif /* CONFIG_GROUP_SCHED_WEIGHT */ @@ -10551,10 +10592,9 @@ struct cgroup_subsys cpu_cgrp_subsys = { .css_free = cpu_cgroup_css_free, .css_extra_stat_show = cpu_extra_stat_show, .css_local_stat_show = cpu_local_stat_show, -#ifdef CONFIG_RT_GROUP_SCHED .can_attach = cpu_cgroup_can_attach, -#endif .attach = cpu_cgroup_attach, + .cancel_attach = cpu_cgroup_cancel_attach, .legacy_cftypes = cpu_legacy_files, .dfl_cftypes = cpu_files, .early_init = true, diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 8a250c072aff..8ae97643caec 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -120,10 +120,16 @@ enum scx_ops_flags { */ SCX_OPS_SWITCH_PARTIAL = 1LLU << 3, + /* + * CPU cgroup support flags + */ + SCX_OPS_HAS_CGROUP_WEIGHT = 1LLU << 16, /* cpu.weight */ + SCX_OPS_ALL_FLAGS = SCX_OPS_KEEP_BUILTIN_IDLE | SCX_OPS_ENQ_LAST | SCX_OPS_ENQ_EXITING | - SCX_OPS_SWITCH_PARTIAL, + SCX_OPS_SWITCH_PARTIAL | + SCX_OPS_HAS_CGROUP_WEIGHT, }; /* argument container for ops.init_task() */ @@ -133,6 +139,10 @@ struct scx_init_task_args { * to the scheduler transition path. */ bool fork; +#ifdef CONFIG_EXT_GROUP_SCHED + /* the cgroup the task is joining */ + struct cgroup *cgroup; +#endif }; /* argument container for ops.exit_task() */ @@ -141,6 +151,12 @@ struct scx_exit_task_args { bool cancelled; }; +/* argument container for ops->cgroup_init() */ +struct scx_cgroup_init_args { + /* the weight of the cgroup [1..10000] */ + u32 weight; +}; + enum scx_cpu_preempt_reason { /* next task is being scheduled by &sched_class_rt */ SCX_CPU_PREEMPT_RT, @@ -505,6 +521,79 @@ struct sched_ext_ops { */ void (*dump_task)(struct scx_dump_ctx *ctx, struct task_struct *p); +#ifdef CONFIG_EXT_GROUP_SCHED + /** + * cgroup_init - Initialize a cgroup + * @cgrp: cgroup being initialized + * @args: init arguments, see the struct definition + * + * Either the BPF scheduler is being loaded or @cgrp created, initialize + * @cgrp for sched_ext. This operation may block. + * + * Return 0 for success, -errno for failure. An error return while + * loading will abort loading of the BPF scheduler. During cgroup + * creation, it will abort the specific cgroup creation. + */ + s32 (*cgroup_init)(struct cgroup *cgrp, + struct scx_cgroup_init_args *args); + + /** + * cgroup_exit - Exit a cgroup + * @cgrp: cgroup being exited + * + * Either the BPF scheduler is being unloaded or @cgrp destroyed, exit + * @cgrp for sched_ext. This operation my block. + */ + void (*cgroup_exit)(struct cgroup *cgrp); + + /** + * cgroup_prep_move - Prepare a task to be moved to a different cgroup + * @p: task being moved + * @from: cgroup @p is being moved from + * @to: cgroup @p is being moved to + * + * Prepare @p for move from cgroup @from to @to. This operation may + * block and can be used for allocations. + * + * Return 0 for success, -errno for failure. An error return aborts the + * migration. + */ + s32 (*cgroup_prep_move)(struct task_struct *p, + struct cgroup *from, struct cgroup *to); + + /** + * cgroup_move - Commit cgroup move + * @p: task being moved + * @from: cgroup @p is being moved from + * @to: cgroup @p is being moved to + * + * Commit the move. @p is dequeued during this operation. + */ + void (*cgroup_move)(struct task_struct *p, + struct cgroup *from, struct cgroup *to); + + /** + * cgroup_cancel_move - Cancel cgroup move + * @p: task whose cgroup move is being canceled + * @from: cgroup @p was being moved from + * @to: cgroup @p was being moved to + * + * @p was cgroup_prep_move()'d but failed before reaching cgroup_move(). + * Undo the preparation. + */ + void (*cgroup_cancel_move)(struct task_struct *p, + struct cgroup *from, struct cgroup *to); + + /** + * cgroup_set_weight - A cgroup's weight is being changed + * @cgrp: cgroup whose weight is being updated + * @weight: new weight [1..10000] + * + * Update @tg's weight to @weight. + */ + void (*cgroup_set_weight)(struct cgroup *cgrp, u32 weight); +#endif /* CONFIG_CGROUPS */ + /* * All online ops must come before ops.cpu_online(). */ @@ -687,6 +776,11 @@ enum scx_kick_flags { SCX_KICK_WAIT = 1LLU << 2, }; +enum scx_tg_flags { + SCX_TG_ONLINE = 1U << 0, + SCX_TG_INITED = 1U << 1, +}; + enum scx_ops_enable_state { SCX_OPS_PREPPING, SCX_OPS_ENABLING, @@ -3244,6 +3338,28 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued) resched_curr(rq); } +#ifdef CONFIG_EXT_GROUP_SCHED +static struct cgroup *tg_cgrp(struct task_group *tg) +{ + /* + * If CGROUP_SCHED is disabled, @tg is NULL. If @tg is an autogroup, + * @tg->css.cgroup is NULL. In both cases, @tg can be treated as the + * root cgroup. + */ + if (tg && tg->css.cgroup) + return tg->css.cgroup; + else + return &cgrp_dfl_root.cgrp; +} + +#define SCX_INIT_TASK_ARGS_CGROUP(tg) .cgroup = tg_cgrp(tg), + +#else /* CONFIG_EXT_GROUP_SCHED */ + +#define SCX_INIT_TASK_ARGS_CGROUP(tg) + +#endif /* CONFIG_EXT_GROUP_SCHED */ + static enum scx_task_state scx_get_task_state(const struct task_struct *p) { return (p->scx.flags & SCX_TASK_STATE_MASK) >> SCX_TASK_STATE_SHIFT; @@ -3288,6 +3404,7 @@ static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool if (SCX_HAS_OP(init_task)) { struct scx_init_task_args args = { + SCX_INIT_TASK_ARGS_CGROUP(tg) .fork = fork, }; @@ -3352,7 +3469,7 @@ static void scx_ops_enable_task(struct task_struct *p) scx_set_task_state(p, SCX_TASK_ENABLED); if (SCX_HAS_OP(set_weight)) - SCX_CALL_OP(SCX_KF_REST, set_weight, p, p->scx.weight); + SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight); } static void scx_ops_disable_task(struct task_struct *p) @@ -3563,6 +3680,219 @@ bool scx_can_stop_tick(struct rq *rq) } #endif +#ifdef CONFIG_EXT_GROUP_SCHED + +DEFINE_STATIC_PERCPU_RWSEM(scx_cgroup_rwsem); +static bool cgroup_warned_missing_weight; +static bool cgroup_warned_missing_idle; + +static void scx_cgroup_warn_missing_weight(struct task_group *tg) +{ + if (scx_ops_enable_state() == SCX_OPS_DISABLED || + cgroup_warned_missing_weight) + return; + + if ((scx_ops.flags & SCX_OPS_HAS_CGROUP_WEIGHT) || !tg->css.parent) + return; + + pr_warn("sched_ext: \"%s\" does not implement cgroup cpu.weight\n", + scx_ops.name); + cgroup_warned_missing_weight = true; +} + +static void scx_cgroup_warn_missing_idle(struct task_group *tg) +{ + if (scx_ops_enable_state() == SCX_OPS_DISABLED || + cgroup_warned_missing_idle) + return; + + if (!tg->idle) + return; + + pr_warn("sched_ext: \"%s\" does not implement cgroup cpu.idle\n", + scx_ops.name); + cgroup_warned_missing_idle = true; +} + +int scx_tg_online(struct task_group *tg) +{ + int ret = 0; + + WARN_ON_ONCE(tg->scx_flags & (SCX_TG_ONLINE | SCX_TG_INITED)); + + percpu_down_read(&scx_cgroup_rwsem); + + scx_cgroup_warn_missing_weight(tg); + + if (SCX_HAS_OP(cgroup_init)) { + struct scx_cgroup_init_args args = { .weight = tg->scx_weight }; + + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init, + tg->css.cgroup, &args); + if (!ret) + tg->scx_flags |= SCX_TG_ONLINE | SCX_TG_INITED; + else + ret = ops_sanitize_err("cgroup_init", ret); + } else { + tg->scx_flags |= SCX_TG_ONLINE; + } + + percpu_up_read(&scx_cgroup_rwsem); + return ret; +} + +void scx_tg_offline(struct task_group *tg) +{ + WARN_ON_ONCE(!(tg->scx_flags & SCX_TG_ONLINE)); + + percpu_down_read(&scx_cgroup_rwsem); + + if (SCX_HAS_OP(cgroup_exit) && (tg->scx_flags & SCX_TG_INITED)) + SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, tg->css.cgroup); + tg->scx_flags &= ~(SCX_TG_ONLINE | SCX_TG_INITED); + + percpu_up_read(&scx_cgroup_rwsem); +} + +int scx_cgroup_can_attach(struct cgroup_taskset *tset) +{ + struct cgroup_subsys_state *css; + struct task_struct *p; + int ret; + + /* released in scx_finish/cancel_attach() */ + percpu_down_read(&scx_cgroup_rwsem); + + if (!scx_enabled()) + return 0; + + cgroup_taskset_for_each(p, css, tset) { + struct cgroup *from = tg_cgrp(task_group(p)); + struct cgroup *to = tg_cgrp(css_tg(css)); + + WARN_ON_ONCE(p->scx.cgrp_moving_from); + + /* + * sched_move_task() omits identity migrations. Let's match the + * behavior so that ops.cgroup_prep_move() and ops.cgroup_move() + * always match one-to-one. + */ + if (from == to) + continue; + + if (SCX_HAS_OP(cgroup_prep_move)) { + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_prep_move, + p, from, css->cgroup); + if (ret) + goto err; + } + + p->scx.cgrp_moving_from = from; + } + + return 0; + +err: + cgroup_taskset_for_each(p, css, tset) { + if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from) + SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, p, + p->scx.cgrp_moving_from, css->cgroup); + p->scx.cgrp_moving_from = NULL; + } + + percpu_up_read(&scx_cgroup_rwsem); + return ops_sanitize_err("cgroup_prep_move", ret); +} + +void scx_move_task(struct task_struct *p) +{ + if (!scx_enabled()) + return; + + /* + * We're called from sched_move_task() which handles both cgroup and + * autogroup moves. Ignore the latter. + * + * Also ignore exiting tasks, because in the exit path tasks transition + * from the autogroup to the root group, so task_group_is_autogroup() + * alone isn't able to catch exiting autogroup tasks. This is safe for + * cgroup_move(), because cgroup migrations never happen for PF_EXITING + * tasks. + */ + if (task_group_is_autogroup(task_group(p)) || (p->flags & PF_EXITING)) + return; + + /* + * @p must have ops.cgroup_prep_move() called on it and thus + * cgrp_moving_from set. + */ + if (SCX_HAS_OP(cgroup_move) && !WARN_ON_ONCE(!p->scx.cgrp_moving_from)) + SCX_CALL_OP_TASK(SCX_KF_UNLOCKED, cgroup_move, p, + p->scx.cgrp_moving_from, tg_cgrp(task_group(p))); + p->scx.cgrp_moving_from = NULL; +} + +void scx_cgroup_finish_attach(void) +{ + percpu_up_read(&scx_cgroup_rwsem); +} + +void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) +{ + struct cgroup_subsys_state *css; + struct task_struct *p; + + if (!scx_enabled()) + goto out_unlock; + + cgroup_taskset_for_each(p, css, tset) { + if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from) + SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, p, + p->scx.cgrp_moving_from, css->cgroup); + p->scx.cgrp_moving_from = NULL; + } +out_unlock: + percpu_up_read(&scx_cgroup_rwsem); +} + +void scx_group_set_weight(struct task_group *tg, unsigned long weight) +{ + percpu_down_read(&scx_cgroup_rwsem); + + if (tg->scx_weight != weight) { + if (SCX_HAS_OP(cgroup_set_weight)) + SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_set_weight, + tg_cgrp(tg), weight); + tg->scx_weight = weight; + } + + percpu_up_read(&scx_cgroup_rwsem); +} + +void scx_group_set_idle(struct task_group *tg, bool idle) +{ + percpu_down_read(&scx_cgroup_rwsem); + scx_cgroup_warn_missing_idle(tg); + percpu_up_read(&scx_cgroup_rwsem); +} + +static void scx_cgroup_lock(void) +{ + percpu_down_write(&scx_cgroup_rwsem); +} + +static void scx_cgroup_unlock(void) +{ + percpu_up_write(&scx_cgroup_rwsem); +} + +#else /* CONFIG_EXT_GROUP_SCHED */ + +static inline void scx_cgroup_lock(void) {} +static inline void scx_cgroup_unlock(void) {} + +#endif /* CONFIG_EXT_GROUP_SCHED */ + /* * Omitted operations: * @@ -3694,6 +4024,96 @@ static void destroy_dsq(u64 dsq_id) rcu_read_unlock(); } +#ifdef CONFIG_EXT_GROUP_SCHED +static void scx_cgroup_exit(void) +{ + struct cgroup_subsys_state *css; + + percpu_rwsem_assert_held(&scx_cgroup_rwsem); + + /* + * scx_tg_on/offline() are excluded through scx_cgroup_rwsem. If we walk + * cgroups and exit all the inited ones, all online cgroups are exited. + */ + rcu_read_lock(); + css_for_each_descendant_post(css, &root_task_group.css) { + struct task_group *tg = css_tg(css); + + if (!(tg->scx_flags & SCX_TG_INITED)) + continue; + tg->scx_flags &= ~SCX_TG_INITED; + + if (!scx_ops.cgroup_exit) + continue; + + if (WARN_ON_ONCE(!css_tryget(css))) + continue; + rcu_read_unlock(); + + SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, css->cgroup); + + rcu_read_lock(); + css_put(css); + } + rcu_read_unlock(); +} + +static int scx_cgroup_init(void) +{ + struct cgroup_subsys_state *css; + int ret; + + percpu_rwsem_assert_held(&scx_cgroup_rwsem); + + cgroup_warned_missing_weight = false; + cgroup_warned_missing_idle = false; + + /* + * scx_tg_on/offline() are excluded thorugh scx_cgroup_rwsem. If we walk + * cgroups and init, all online cgroups are initialized. + */ + rcu_read_lock(); + css_for_each_descendant_pre(css, &root_task_group.css) { + struct task_group *tg = css_tg(css); + struct scx_cgroup_init_args args = { .weight = tg->scx_weight }; + + scx_cgroup_warn_missing_weight(tg); + scx_cgroup_warn_missing_idle(tg); + + if ((tg->scx_flags & + (SCX_TG_ONLINE | SCX_TG_INITED)) != SCX_TG_ONLINE) + continue; + + if (!scx_ops.cgroup_init) { + tg->scx_flags |= SCX_TG_INITED; + continue; + } + + if (WARN_ON_ONCE(!css_tryget(css))) + continue; + rcu_read_unlock(); + + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init, + css->cgroup, &args); + if (ret) { + css_put(css); + return ret; + } + tg->scx_flags |= SCX_TG_INITED; + + rcu_read_lock(); + css_put(css); + } + rcu_read_unlock(); + + return 0; +} + +#else +static void scx_cgroup_exit(void) {} +static int scx_cgroup_init(void) { return 0; } +#endif + /******************************************************************************** * Sysfs interface and ops enable/disable. @@ -3986,11 +4406,12 @@ static void scx_ops_disable_workfn(struct kthread_work *work) WRITE_ONCE(scx_switching_all, false); /* - * Avoid racing against fork. See scx_ops_enable() for explanation on - * the locking order. + * Avoid racing against fork and cgroup changes. See scx_ops_enable() + * for explanation on the locking order. */ percpu_down_write(&scx_fork_rwsem); cpus_read_lock(); + scx_cgroup_lock(); spin_lock_irq(&scx_tasks_lock); scx_task_iter_init(&sti); @@ -4026,6 +4447,9 @@ static void scx_ops_disable_workfn(struct kthread_work *work) static_branch_disable_cpuslocked(&scx_builtin_idle_enabled); synchronize_rcu(); + scx_cgroup_exit(); + + scx_cgroup_unlock(); cpus_read_unlock(); percpu_up_write(&scx_fork_rwsem); @@ -4582,11 +5006,17 @@ static int scx_ops_enable(struct sched_ext_ops *ops) scx_watchdog_timeout / 2); /* - * Lock out forks before opening the floodgate so that they don't wander - * into the operations prematurely. + * Lock out forks, cgroup on/offlining and moves before opening the + * floodgate so that they don't wander into the operations prematurely. + * + * We don't need to keep the CPUs stable but static_branch_*() requires + * cpus_read_lock() and scx_cgroup_rwsem must nest inside + * cpu_hotplug_lock because of the following dependency chain: + * + * cpu_hotplug_lock --> cgroup_threadgroup_rwsem --> scx_cgroup_rwsem * - * We don't need to keep the CPUs stable but grab cpus_read_lock() to - * ease future locking changes for cgroup suport. + * So, we need to do cpus_read_lock() before scx_cgroup_lock() and use + * static_branch_*_cpuslocked(). * * Note that cpu_hotplug_lock must nest inside scx_fork_rwsem due to the * following dependency chain: @@ -4595,6 +5025,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops) */ percpu_down_write(&scx_fork_rwsem); cpus_read_lock(); + scx_cgroup_lock(); check_hotplug_seq(ops); @@ -4617,6 +5048,14 @@ static int scx_ops_enable(struct sched_ext_ops *ops) static_branch_disable_cpuslocked(&scx_builtin_idle_enabled); } + /* + * All cgroups should be initialized before letting in tasks. cgroup + * on/offlining and task migrations are already locked out. + */ + ret = scx_cgroup_init(); + if (ret) + goto err_disable_unlock_all; + static_branch_enable_cpuslocked(&__scx_ops_enabled); /* @@ -4708,6 +5147,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops) spin_unlock_irq(&scx_tasks_lock); preempt_enable(); + scx_cgroup_unlock(); cpus_read_unlock(); percpu_up_write(&scx_fork_rwsem); @@ -4742,6 +5182,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops) return ret; err_disable_unlock_all: + scx_cgroup_unlock(); percpu_up_write(&scx_fork_rwsem); err_disable_unlock_cpus: cpus_read_unlock(); @@ -4936,6 +5377,11 @@ static int bpf_scx_check_member(const struct btf_type *t, switch (moff) { case offsetof(struct sched_ext_ops, init_task): +#ifdef CONFIG_EXT_GROUP_SCHED + case offsetof(struct sched_ext_ops, cgroup_init): + case offsetof(struct sched_ext_ops, cgroup_exit): + case offsetof(struct sched_ext_ops, cgroup_prep_move): +#endif case offsetof(struct sched_ext_ops, cpu_online): case offsetof(struct sched_ext_ops, cpu_offline): case offsetof(struct sched_ext_ops, init): @@ -5010,6 +5456,14 @@ static s32 init_task_stub(struct task_struct *p, struct scx_init_task_args *args static void exit_task_stub(struct task_struct *p, struct scx_exit_task_args *args) {} static void enable_stub(struct task_struct *p) {} static void disable_stub(struct task_struct *p) {} +#ifdef CONFIG_EXT_GROUP_SCHED +static s32 cgroup_init_stub(struct cgroup *cgrp, struct scx_cgroup_init_args *args) { return -EINVAL; } +static void cgroup_exit_stub(struct cgroup *cgrp) {} +static s32 cgroup_prep_move_stub(struct task_struct *p, struct cgroup *from, struct cgroup *to) { return -EINVAL; } +static void cgroup_move_stub(struct task_struct *p, struct cgroup *from, struct cgroup *to) {} +static void cgroup_cancel_move_stub(struct task_struct *p, struct cgroup *from, struct cgroup *to) {} +static void cgroup_set_weight_stub(struct cgroup *cgrp, u32 weight) {} +#endif static void cpu_online_stub(s32 cpu) {} static void cpu_offline_stub(s32 cpu) {} static s32 init_stub(void) { return -EINVAL; } @@ -5039,6 +5493,14 @@ static struct sched_ext_ops __bpf_ops_sched_ext_ops = { .exit_task = exit_task_stub, .enable = enable_stub, .disable = disable_stub, +#ifdef CONFIG_EXT_GROUP_SCHED + .cgroup_init = cgroup_init_stub, + .cgroup_exit = cgroup_exit_stub, + .cgroup_prep_move = cgroup_prep_move_stub, + .cgroup_move = cgroup_move_stub, + .cgroup_cancel_move = cgroup_cancel_move_stub, + .cgroup_set_weight = cgroup_set_weight_stub, +#endif .cpu_online = cpu_online_stub, .cpu_offline = cpu_offline_stub, .init = init_stub, @@ -5288,7 +5750,8 @@ void __init init_sched_ext_class(void) * definitions so that BPF scheduler implementations can use them * through the generated vmlinux.h. */ - WRITE_ONCE(v, SCX_ENQ_WAKEUP | SCX_DEQ_SLEEP | SCX_KICK_PREEMPT); + WRITE_ONCE(v, SCX_ENQ_WAKEUP | SCX_DEQ_SLEEP | SCX_KICK_PREEMPT | + SCX_TG_ONLINE); BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params)); init_dsq(&scx_dsq_global, SCX_DSQ_GLOBAL); @@ -6348,6 +6811,41 @@ __bpf_kfunc struct rq *scx_bpf_cpu_rq(s32 cpu) return cpu_rq(cpu); } +/** + * scx_bpf_task_cgroup - Return the sched cgroup of a task + * @p: task of interest + * + * @p->sched_task_group->css.cgroup represents the cgroup @p is associated with + * from the scheduler's POV. SCX operations should use this function to + * determine @p's current cgroup as, unlike following @p->cgroups, + * @p->sched_task_group is protected by @p's rq lock and thus atomic w.r.t. all + * rq-locked operations. Can be called on the parameter tasks of rq-locked + * operations. The restriction guarantees that @p's rq is locked by the caller. + */ +#ifdef CONFIG_CGROUP_SCHED +__bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p) +{ + struct task_group *tg = p->sched_task_group; + struct cgroup *cgrp = &cgrp_dfl_root.cgrp; + + if (!scx_kf_allowed_on_arg_tasks(__SCX_KF_RQ_LOCKED, p)) + goto out; + + /* + * A task_group may either be a cgroup or an autogroup. In the latter + * case, @tg->css.cgroup is %NULL. A task_group can't become the other + * kind once created. + */ + if (tg && tg->css.cgroup) + cgrp = tg->css.cgroup; + else + cgrp = &cgrp_dfl_root.cgrp; +out: + cgroup_get(cgrp); + return cgrp; +} +#endif + __bpf_kfunc_end_defs(); BTF_KFUNCS_START(scx_kfunc_ids_any) @@ -6376,6 +6874,9 @@ BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_cpu_rq) +#ifdef CONFIG_CGROUP_SCHED +BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_RCU | KF_ACQUIRE) +#endif BTF_KFUNCS_END(scx_kfunc_ids_any) static const struct btf_kfunc_id_set scx_kfunc_set_any = { diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index 32d3a51f591a..246019519231 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -67,3 +67,25 @@ static inline void scx_update_idle(struct rq *rq, bool idle) #else static inline void scx_update_idle(struct rq *rq, bool idle) {} #endif + +#ifdef CONFIG_CGROUP_SCHED +#ifdef CONFIG_EXT_GROUP_SCHED +int scx_tg_online(struct task_group *tg); +void scx_tg_offline(struct task_group *tg); +int scx_cgroup_can_attach(struct cgroup_taskset *tset); +void scx_move_task(struct task_struct *p); +void scx_cgroup_finish_attach(void); +void scx_cgroup_cancel_attach(struct cgroup_taskset *tset); +void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight); +void scx_group_set_idle(struct task_group *tg, bool idle); +#else /* CONFIG_EXT_GROUP_SCHED */ +static inline int scx_tg_online(struct task_group *tg) { return 0; } +static inline void scx_tg_offline(struct task_group *tg) {} +static inline int scx_cgroup_can_attach(struct cgroup_taskset *tset) { return 0; } +static inline void scx_move_task(struct task_struct *p) {} +static inline void scx_cgroup_finish_attach(void) {} +static inline void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) {} +static inline void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight) {} +static inline void scx_group_set_idle(struct task_group *tg, bool idle) {} +#endif /* CONFIG_EXT_GROUP_SCHED */ +#endif /* CONFIG_CGROUP_SCHED */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 8184141ea8a9..176d30c0eaf5 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -477,6 +477,11 @@ struct task_group { struct rt_bandwidth rt_bandwidth; #endif +#ifdef CONFIG_EXT_GROUP_SCHED + u32 scx_flags; /* SCX_TG_* */ + u32 scx_weight; +#endif + struct rcu_head rcu; struct list_head list; diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h index 20280df62857..457462b19966 100644 --- a/tools/sched_ext/include/scx/common.bpf.h +++ b/tools/sched_ext/include/scx/common.bpf.h @@ -61,6 +61,7 @@ s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym; bool scx_bpf_task_running(const struct task_struct *p) __ksym; s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym; struct rq *scx_bpf_cpu_rq(s32 cpu) __ksym; +struct cgroup *scx_bpf_task_cgroup(struct task_struct *p) __ksym; static inline __attribute__((format(printf, 1, 2))) void ___scx_bpf_bstr_format_checker(const char *fmt, ...) {} diff --git a/tools/testing/selftests/sched_ext/maximal.bpf.c b/tools/testing/selftests/sched_ext/maximal.bpf.c index 44612fdaf399..00bfa9cb95d3 100644 --- a/tools/testing/selftests/sched_ext/maximal.bpf.c +++ b/tools/testing/selftests/sched_ext/maximal.bpf.c @@ -95,6 +95,32 @@ void BPF_STRUCT_OPS(maximal_exit_task, struct task_struct *p, void BPF_STRUCT_OPS(maximal_disable, struct task_struct *p) {} +s32 BPF_STRUCT_OPS(maximal_cgroup_init, struct cgroup *cgrp, + struct scx_cgroup_init_args *args) +{ + return 0; +} + +void BPF_STRUCT_OPS(maximal_cgroup_exit, struct cgroup *cgrp) +{} + +s32 BPF_STRUCT_OPS(maximal_cgroup_prep_move, struct task_struct *p, + struct cgroup *from, struct cgroup *to) +{ + return 0; +} + +void BPF_STRUCT_OPS(maximal_cgroup_move, struct task_struct *p, + struct cgroup *from, struct cgroup *to) +{} + +void BPF_STRUCT_OPS(maximal_cgroup_cancel_move, struct task_struct *p, + struct cgroup *from, struct cgroup *to) +{} + +void BPF_STRUCT_OPS(maximal_cgroup_set_weight, struct cgroup *cgrp, u32 weight) +{} + s32 BPF_STRUCT_OPS_SLEEPABLE(maximal_init) { return 0; @@ -126,6 +152,12 @@ struct sched_ext_ops maximal_ops = { .enable = maximal_enable, .exit_task = maximal_exit_task, .disable = maximal_disable, + .cgroup_init = maximal_cgroup_init, + .cgroup_exit = maximal_cgroup_exit, + .cgroup_prep_move = maximal_cgroup_prep_move, + .cgroup_move = maximal_cgroup_move, + .cgroup_cancel_move = maximal_cgroup_cancel_move, + .cgroup_set_weight = maximal_cgroup_set_weight, .init = maximal_init, .exit = maximal_exit, .name = "maximal", -- 2.34.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY -------------------------------- sched_ext: Fix kabi for u32 scx_flags and u32 scx_weight in struct task_group. Fixes: 975068b0ce86 ("sched_ext: Add cgroup support") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/sched.h | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 176d30c0eaf5..32ea9c249d4f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -477,11 +477,6 @@ struct task_group { struct rt_bandwidth rt_bandwidth; #endif -#ifdef CONFIG_EXT_GROUP_SCHED - u32 scx_flags; /* SCX_TG_* */ - u32 scx_weight; -#endif - struct rcu_head rcu; struct list_head list; @@ -526,7 +521,12 @@ struct task_group { #else KABI_RESERVE(2) #endif + +#ifdef CONFIG_EXT_GROUP_SCHED + KABI_USE2(3, u32 scx_flags, u32 scx_weight) /* SCX_TG_* */ +#else KABI_RESERVE(3) +#endif KABI_RESERVE(4) KABI_RESERVE(5) KABI_RESERVE(6) -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit a4103eacc2ab408bb65e9902f0857b219fb489de category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- This patch adds scx_flatcg example scheduler which implements hierarchical weight-based cgroup CPU control by flattening the cgroup hierarchy into a single layer by compounding the active weight share at each level. This flattening of hierarchy can bring a substantial performance gain when the cgroup hierarchy is nested multiple levels. in a simple benchmark using wrk[8] on apache serving a CGI script calculating sha1sum of a small file, it outperforms CFS by ~3% with CPU controller disabled and by ~10% with two apache instances competing with 2:1 weight ratio nested four level deep. However, the gain comes at the cost of not being able to properly handle thundering herd of cgroups. For example, if many cgroups which are nested behind a low priority parent cgroup wake up around the same time, they may be able to consume more CPU cycles than they are entitled to. In many use cases, this isn't a real concern especially given the performance gain. Also, there are ways to mitigate the problem further by e.g. introducing an extra scheduling layer on cgroup delegation boundaries. v5: - Updated to specify SCX_OPS_HAS_CGROUP_WEIGHT instead of SCX_OPS_KNOB_CGROUP_WEIGHT. v4: - Revert reference counted kptr for cgv_node as the change caused easily reproducible stalls. v3: - Updated to reflect the core API changes including ops.init/exit_task() and direct dispatch from ops.select_cpu(). Fixes and improvements including additional statistics. - Use reference counted kptr for cgv_node instead of xchg'ing against stash location. - Dropped '-p' option. v2: - Use SCX_BUG[_ON]() to simplify error handling. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/sched_ext/Makefile | 2 +- tools/sched_ext/README.md | 12 + tools/sched_ext/scx_flatcg.bpf.c | 949 +++++++++++++++++++++++++++++++ tools/sched_ext/scx_flatcg.c | 233 ++++++++ tools/sched_ext/scx_flatcg.h | 51 ++ 5 files changed, 1246 insertions(+), 1 deletion(-) create mode 100644 tools/sched_ext/scx_flatcg.bpf.c create mode 100644 tools/sched_ext/scx_flatcg.c create mode 100644 tools/sched_ext/scx_flatcg.h diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile index bf7e108f5ae1..ca3815e572d8 100644 --- a/tools/sched_ext/Makefile +++ b/tools/sched_ext/Makefile @@ -176,7 +176,7 @@ $(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BP SCX_COMMON_DEPS := include/scx/common.h include/scx/user_exit_info.h | $(BINDIR) -c-sched-targets = scx_simple scx_qmap scx_central +c-sched-targets = scx_simple scx_qmap scx_central scx_flatcg $(addprefix $(BINDIR)/,$(c-sched-targets)): \ $(BINDIR)/%: \ diff --git a/tools/sched_ext/README.md b/tools/sched_ext/README.md index 8efe70cc4363..16a42e4060f6 100644 --- a/tools/sched_ext/README.md +++ b/tools/sched_ext/README.md @@ -192,6 +192,18 @@ where this could be particularly useful is running VMs, where running with infinite slices and no timer ticks allows the VM to avoid unnecessary expensive vmexits. +## scx_flatcg + +A flattened cgroup hierarchy scheduler. This scheduler implements hierarchical +weight-based cgroup CPU control by flattening the cgroup hierarchy into a single +layer, by compounding the active weight share at each level. The effect of this +is a much more performant CPU controller, which does not need to descend down +cgroup trees in order to properly compute a cgroup's share. + +Similar to scx_simple, in limited scenarios, this scheduler can perform +reasonably well on single socket-socket systems with a unified L3 cache and show +significantly lowered hierarchical scheduling overhead. + # Troubleshooting diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c new file mode 100644 index 000000000000..3ab2b60781a0 --- /dev/null +++ b/tools/sched_ext/scx_flatcg.bpf.c @@ -0,0 +1,949 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * A demo sched_ext flattened cgroup hierarchy scheduler. It implements + * hierarchical weight-based cgroup CPU control by flattening the cgroup + * hierarchy into a single layer by compounding the active weight share at each + * level. Consider the following hierarchy with weights in parentheses: + * + * R + A (100) + B (100) + * | \ C (100) + * \ D (200) + * + * Ignoring the root and threaded cgroups, only B, C and D can contain tasks. + * Let's say all three have runnable tasks. The total share that each of these + * three cgroups is entitled to can be calculated by compounding its share at + * each level. + * + * For example, B is competing against C and in that competition its share is + * 100/(100+100) == 1/2. At its parent level, A is competing against D and A's + * share in that competition is 100/(200+100) == 1/3. B's eventual share in the + * system can be calculated by multiplying the two shares, 1/2 * 1/3 == 1/6. C's + * eventual shaer is the same at 1/6. D is only competing at the top level and + * its share is 200/(100+200) == 2/3. + * + * So, instead of hierarchically scheduling level-by-level, we can consider it + * as B, C and D competing each other with respective share of 1/6, 1/6 and 2/3 + * and keep updating the eventual shares as the cgroups' runnable states change. + * + * This flattening of hierarchy can bring a substantial performance gain when + * the cgroup hierarchy is nested multiple levels. in a simple benchmark using + * wrk[8] on apache serving a CGI script calculating sha1sum of a small file, it + * outperforms CFS by ~3% with CPU controller disabled and by ~10% with two + * apache instances competing with 2:1 weight ratio nested four level deep. + * + * However, the gain comes at the cost of not being able to properly handle + * thundering herd of cgroups. For example, if many cgroups which are nested + * behind a low priority parent cgroup wake up around the same time, they may be + * able to consume more CPU cycles than they are entitled to. In many use cases, + * this isn't a real concern especially given the performance gain. Also, there + * are ways to mitigate the problem further by e.g. introducing an extra + * scheduling layer on cgroup delegation boundaries. + * + * The scheduler first picks the cgroup to run and then schedule the tasks + * within by using nested weighted vtime scheduling by default. The + * cgroup-internal scheduling can be switched to FIFO with the -f option. + */ +#include <scx/common.bpf.h> +#include "scx_flatcg.h" + +/* + * Maximum amount of retries to find a valid cgroup. + */ +#define CGROUP_MAX_RETRIES 1024 + +char _license[] SEC("license") = "GPL"; + +const volatile u32 nr_cpus = 32; /* !0 for veristat, set during init */ +const volatile u64 cgrp_slice_ns = SCX_SLICE_DFL; +const volatile bool fifo_sched; + +u64 cvtime_now; +UEI_DEFINE(uei); + +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __type(key, u32); + __type(value, u64); + __uint(max_entries, FCG_NR_STATS); +} stats SEC(".maps"); + +static void stat_inc(enum fcg_stat_idx idx) +{ + u32 idx_v = idx; + + u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx_v); + if (cnt_p) + (*cnt_p)++; +} + +struct fcg_cpu_ctx { + u64 cur_cgid; + u64 cur_at; +}; + +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __type(key, u32); + __type(value, struct fcg_cpu_ctx); + __uint(max_entries, 1); +} cpu_ctx SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_CGRP_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct fcg_cgrp_ctx); +} cgrp_ctx SEC(".maps"); + +struct cgv_node { + struct bpf_rb_node rb_node; + __u64 cvtime; + __u64 cgid; +}; + +private(CGV_TREE) struct bpf_spin_lock cgv_tree_lock; +private(CGV_TREE) struct bpf_rb_root cgv_tree __contains(cgv_node, rb_node); + +struct cgv_node_stash { + struct cgv_node __kptr *node; +}; + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 16384); + __type(key, __u64); + __type(value, struct cgv_node_stash); +} cgv_node_stash SEC(".maps"); + +struct fcg_task_ctx { + u64 bypassed_at; +}; + +struct { + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct fcg_task_ctx); +} task_ctx SEC(".maps"); + +/* gets inc'd on weight tree changes to expire the cached hweights */ +u64 hweight_gen = 1; + +static u64 div_round_up(u64 dividend, u64 divisor) +{ + return (dividend + divisor - 1) / divisor; +} + +static bool vtime_before(u64 a, u64 b) +{ + return (s64)(a - b) < 0; +} + +static bool cgv_node_less(struct bpf_rb_node *a, const struct bpf_rb_node *b) +{ + struct cgv_node *cgc_a, *cgc_b; + + cgc_a = container_of(a, struct cgv_node, rb_node); + cgc_b = container_of(b, struct cgv_node, rb_node); + + return cgc_a->cvtime < cgc_b->cvtime; +} + +static struct fcg_cpu_ctx *find_cpu_ctx(void) +{ + struct fcg_cpu_ctx *cpuc; + u32 idx = 0; + + cpuc = bpf_map_lookup_elem(&cpu_ctx, &idx); + if (!cpuc) { + scx_bpf_error("cpu_ctx lookup failed"); + return NULL; + } + return cpuc; +} + +static struct fcg_cgrp_ctx *find_cgrp_ctx(struct cgroup *cgrp) +{ + struct fcg_cgrp_ctx *cgc; + + cgc = bpf_cgrp_storage_get(&cgrp_ctx, cgrp, 0, 0); + if (!cgc) { + scx_bpf_error("cgrp_ctx lookup failed for cgid %llu", cgrp->kn->id); + return NULL; + } + return cgc; +} + +static struct fcg_cgrp_ctx *find_ancestor_cgrp_ctx(struct cgroup *cgrp, int level) +{ + struct fcg_cgrp_ctx *cgc; + + cgrp = bpf_cgroup_ancestor(cgrp, level); + if (!cgrp) { + scx_bpf_error("ancestor cgroup lookup failed"); + return NULL; + } + + cgc = find_cgrp_ctx(cgrp); + if (!cgc) + scx_bpf_error("ancestor cgrp_ctx lookup failed"); + bpf_cgroup_release(cgrp); + return cgc; +} + +static void cgrp_refresh_hweight(struct cgroup *cgrp, struct fcg_cgrp_ctx *cgc) +{ + int level; + + if (!cgc->nr_active) { + stat_inc(FCG_STAT_HWT_SKIP); + return; + } + + if (cgc->hweight_gen == hweight_gen) { + stat_inc(FCG_STAT_HWT_CACHE); + return; + } + + stat_inc(FCG_STAT_HWT_UPDATES); + bpf_for(level, 0, cgrp->level + 1) { + struct fcg_cgrp_ctx *cgc; + bool is_active; + + cgc = find_ancestor_cgrp_ctx(cgrp, level); + if (!cgc) + break; + + if (!level) { + cgc->hweight = FCG_HWEIGHT_ONE; + cgc->hweight_gen = hweight_gen; + } else { + struct fcg_cgrp_ctx *pcgc; + + pcgc = find_ancestor_cgrp_ctx(cgrp, level - 1); + if (!pcgc) + break; + + /* + * We can be oppotunistic here and not grab the + * cgv_tree_lock and deal with the occasional races. + * However, hweight updates are already cached and + * relatively low-frequency. Let's just do the + * straightforward thing. + */ + bpf_spin_lock(&cgv_tree_lock); + is_active = cgc->nr_active; + if (is_active) { + cgc->hweight_gen = pcgc->hweight_gen; + cgc->hweight = + div_round_up(pcgc->hweight * cgc->weight, + pcgc->child_weight_sum); + } + bpf_spin_unlock(&cgv_tree_lock); + + if (!is_active) { + stat_inc(FCG_STAT_HWT_RACE); + break; + } + } + } +} + +static void cgrp_cap_budget(struct cgv_node *cgv_node, struct fcg_cgrp_ctx *cgc) +{ + u64 delta, cvtime, max_budget; + + /* + * A node which is on the rbtree can't be pointed to from elsewhere yet + * and thus can't be updated and repositioned. Instead, we collect the + * vtime deltas separately and apply it asynchronously here. + */ + delta = cgc->cvtime_delta; + __sync_fetch_and_sub(&cgc->cvtime_delta, delta); + cvtime = cgv_node->cvtime + delta; + + /* + * Allow a cgroup to carry the maximum budget proportional to its + * hweight such that a full-hweight cgroup can immediately take up half + * of the CPUs at the most while staying at the front of the rbtree. + */ + max_budget = (cgrp_slice_ns * nr_cpus * cgc->hweight) / + (2 * FCG_HWEIGHT_ONE); + if (vtime_before(cvtime, cvtime_now - max_budget)) + cvtime = cvtime_now - max_budget; + + cgv_node->cvtime = cvtime; +} + +static void cgrp_enqueued(struct cgroup *cgrp, struct fcg_cgrp_ctx *cgc) +{ + struct cgv_node_stash *stash; + struct cgv_node *cgv_node; + u64 cgid = cgrp->kn->id; + + /* paired with cmpxchg in try_pick_next_cgroup() */ + if (__sync_val_compare_and_swap(&cgc->queued, 0, 1)) { + stat_inc(FCG_STAT_ENQ_SKIP); + return; + } + + stash = bpf_map_lookup_elem(&cgv_node_stash, &cgid); + if (!stash) { + scx_bpf_error("cgv_node lookup failed for cgid %llu", cgid); + return; + } + + /* NULL if the node is already on the rbtree */ + cgv_node = bpf_kptr_xchg(&stash->node, NULL); + if (!cgv_node) { + stat_inc(FCG_STAT_ENQ_RACE); + return; + } + + bpf_spin_lock(&cgv_tree_lock); + cgrp_cap_budget(cgv_node, cgc); + bpf_rbtree_add(&cgv_tree, &cgv_node->rb_node, cgv_node_less); + bpf_spin_unlock(&cgv_tree_lock); +} + +static void set_bypassed_at(struct task_struct *p, struct fcg_task_ctx *taskc) +{ + /* + * Tell fcg_stopping() that this bypassed the regular scheduling path + * and should be force charged to the cgroup. 0 is used to indicate that + * the task isn't bypassing, so if the current runtime is 0, go back by + * one nanosecond. + */ + taskc->bypassed_at = p->se.sum_exec_runtime ?: (u64)-1; +} + +s32 BPF_STRUCT_OPS(fcg_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) +{ + struct fcg_task_ctx *taskc; + bool is_idle = false; + s32 cpu; + + cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &is_idle); + + taskc = bpf_task_storage_get(&task_ctx, p, 0, 0); + if (!taskc) { + scx_bpf_error("task_ctx lookup failed"); + return cpu; + } + + /* + * If select_cpu_dfl() is recommending local enqueue, the target CPU is + * idle. Follow it and charge the cgroup later in fcg_stopping() after + * the fact. + */ + if (is_idle) { + set_bypassed_at(p, taskc); + stat_inc(FCG_STAT_LOCAL); + scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); + } + + return cpu; +} + +void BPF_STRUCT_OPS(fcg_enqueue, struct task_struct *p, u64 enq_flags) +{ + struct fcg_task_ctx *taskc; + struct cgroup *cgrp; + struct fcg_cgrp_ctx *cgc; + + taskc = bpf_task_storage_get(&task_ctx, p, 0, 0); + if (!taskc) { + scx_bpf_error("task_ctx lookup failed"); + return; + } + + /* + * Use the direct dispatching and force charging to deal with tasks with + * custom affinities so that we don't have to worry about per-cgroup + * dq's containing tasks that can't be executed from some CPUs. + */ + if (p->nr_cpus_allowed != nr_cpus) { + set_bypassed_at(p, taskc); + + /* + * The global dq is deprioritized as we don't want to let tasks + * to boost themselves by constraining its cpumask. The + * deprioritization is rather severe, so let's not apply that to + * per-cpu kernel threads. This is ham-fisted. We probably wanna + * implement per-cgroup fallback dq's instead so that we have + * more control over when tasks with custom cpumask get issued. + */ + if (p->nr_cpus_allowed == 1 && (p->flags & PF_KTHREAD)) { + stat_inc(FCG_STAT_LOCAL); + scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags); + } else { + stat_inc(FCG_STAT_GLOBAL); + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); + } + return; + } + + cgrp = scx_bpf_task_cgroup(p); + cgc = find_cgrp_ctx(cgrp); + if (!cgc) + goto out_release; + + if (fifo_sched) { + scx_bpf_dispatch(p, cgrp->kn->id, SCX_SLICE_DFL, enq_flags); + } else { + u64 tvtime = p->scx.dsq_vtime; + + /* + * Limit the amount of budget that an idling task can accumulate + * to one slice. + */ + if (vtime_before(tvtime, cgc->tvtime_now - SCX_SLICE_DFL)) + tvtime = cgc->tvtime_now - SCX_SLICE_DFL; + + scx_bpf_dispatch_vtime(p, cgrp->kn->id, SCX_SLICE_DFL, + tvtime, enq_flags); + } + + cgrp_enqueued(cgrp, cgc); +out_release: + bpf_cgroup_release(cgrp); +} + +/* + * Walk the cgroup tree to update the active weight sums as tasks wake up and + * sleep. The weight sums are used as the base when calculating the proportion a + * given cgroup or task is entitled to at each level. + */ +static void update_active_weight_sums(struct cgroup *cgrp, bool runnable) +{ + struct fcg_cgrp_ctx *cgc; + bool updated = false; + int idx; + + cgc = find_cgrp_ctx(cgrp); + if (!cgc) + return; + + /* + * In most cases, a hot cgroup would have multiple threads going to + * sleep and waking up while the whole cgroup stays active. In leaf + * cgroups, ->nr_runnable which is updated with __sync operations gates + * ->nr_active updates, so that we don't have to grab the cgv_tree_lock + * repeatedly for a busy cgroup which is staying active. + */ + if (runnable) { + if (__sync_fetch_and_add(&cgc->nr_runnable, 1)) + return; + stat_inc(FCG_STAT_ACT); + } else { + if (__sync_sub_and_fetch(&cgc->nr_runnable, 1)) + return; + stat_inc(FCG_STAT_DEACT); + } + + /* + * If @cgrp is becoming runnable, its hweight should be refreshed after + * it's added to the weight tree so that enqueue has the up-to-date + * value. If @cgrp is becoming quiescent, the hweight should be + * refreshed before it's removed from the weight tree so that the usage + * charging which happens afterwards has access to the latest value. + */ + if (!runnable) + cgrp_refresh_hweight(cgrp, cgc); + + /* propagate upwards */ + bpf_for(idx, 0, cgrp->level) { + int level = cgrp->level - idx; + struct fcg_cgrp_ctx *cgc, *pcgc = NULL; + bool propagate = false; + + cgc = find_ancestor_cgrp_ctx(cgrp, level); + if (!cgc) + break; + if (level) { + pcgc = find_ancestor_cgrp_ctx(cgrp, level - 1); + if (!pcgc) + break; + } + + /* + * We need the propagation protected by a lock to synchronize + * against weight changes. There's no reason to drop the lock at + * each level but bpf_spin_lock() doesn't want any function + * calls while locked. + */ + bpf_spin_lock(&cgv_tree_lock); + + if (runnable) { + if (!cgc->nr_active++) { + updated = true; + if (pcgc) { + propagate = true; + pcgc->child_weight_sum += cgc->weight; + } + } + } else { + if (!--cgc->nr_active) { + updated = true; + if (pcgc) { + propagate = true; + pcgc->child_weight_sum -= cgc->weight; + } + } + } + + bpf_spin_unlock(&cgv_tree_lock); + + if (!propagate) + break; + } + + if (updated) + __sync_fetch_and_add(&hweight_gen, 1); + + if (runnable) + cgrp_refresh_hweight(cgrp, cgc); +} + +void BPF_STRUCT_OPS(fcg_runnable, struct task_struct *p, u64 enq_flags) +{ + struct cgroup *cgrp; + + cgrp = scx_bpf_task_cgroup(p); + update_active_weight_sums(cgrp, true); + bpf_cgroup_release(cgrp); +} + +void BPF_STRUCT_OPS(fcg_running, struct task_struct *p) +{ + struct cgroup *cgrp; + struct fcg_cgrp_ctx *cgc; + + if (fifo_sched) + return; + + cgrp = scx_bpf_task_cgroup(p); + cgc = find_cgrp_ctx(cgrp); + if (cgc) { + /* + * @cgc->tvtime_now always progresses forward as tasks start + * executing. The test and update can be performed concurrently + * from multiple CPUs and thus racy. Any error should be + * contained and temporary. Let's just live with it. + */ + if (vtime_before(cgc->tvtime_now, p->scx.dsq_vtime)) + cgc->tvtime_now = p->scx.dsq_vtime; + } + bpf_cgroup_release(cgrp); +} + +void BPF_STRUCT_OPS(fcg_stopping, struct task_struct *p, bool runnable) +{ + struct fcg_task_ctx *taskc; + struct cgroup *cgrp; + struct fcg_cgrp_ctx *cgc; + + /* + * Scale the execution time by the inverse of the weight and charge. + * + * Note that the default yield implementation yields by setting + * @p->scx.slice to zero and the following would treat the yielding task + * as if it has consumed all its slice. If this penalizes yielding tasks + * too much, determine the execution time by taking explicit timestamps + * instead of depending on @p->scx.slice. + */ + if (!fifo_sched) + p->scx.dsq_vtime += + (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight; + + taskc = bpf_task_storage_get(&task_ctx, p, 0, 0); + if (!taskc) { + scx_bpf_error("task_ctx lookup failed"); + return; + } + + if (!taskc->bypassed_at) + return; + + cgrp = scx_bpf_task_cgroup(p); + cgc = find_cgrp_ctx(cgrp); + if (cgc) { + __sync_fetch_and_add(&cgc->cvtime_delta, + p->se.sum_exec_runtime - taskc->bypassed_at); + taskc->bypassed_at = 0; + } + bpf_cgroup_release(cgrp); +} + +void BPF_STRUCT_OPS(fcg_quiescent, struct task_struct *p, u64 deq_flags) +{ + struct cgroup *cgrp; + + cgrp = scx_bpf_task_cgroup(p); + update_active_weight_sums(cgrp, false); + bpf_cgroup_release(cgrp); +} + +void BPF_STRUCT_OPS(fcg_cgroup_set_weight, struct cgroup *cgrp, u32 weight) +{ + struct fcg_cgrp_ctx *cgc, *pcgc = NULL; + + cgc = find_cgrp_ctx(cgrp); + if (!cgc) + return; + + if (cgrp->level) { + pcgc = find_ancestor_cgrp_ctx(cgrp, cgrp->level - 1); + if (!pcgc) + return; + } + + bpf_spin_lock(&cgv_tree_lock); + if (pcgc && cgc->nr_active) + pcgc->child_weight_sum += (s64)weight - cgc->weight; + cgc->weight = weight; + bpf_spin_unlock(&cgv_tree_lock); +} + +static bool try_pick_next_cgroup(u64 *cgidp) +{ + struct bpf_rb_node *rb_node; + struct cgv_node_stash *stash; + struct cgv_node *cgv_node; + struct fcg_cgrp_ctx *cgc; + struct cgroup *cgrp; + u64 cgid; + + /* pop the front cgroup and wind cvtime_now accordingly */ + bpf_spin_lock(&cgv_tree_lock); + + rb_node = bpf_rbtree_first(&cgv_tree); + if (!rb_node) { + bpf_spin_unlock(&cgv_tree_lock); + stat_inc(FCG_STAT_PNC_NO_CGRP); + *cgidp = 0; + return true; + } + + rb_node = bpf_rbtree_remove(&cgv_tree, rb_node); + bpf_spin_unlock(&cgv_tree_lock); + + if (!rb_node) { + /* + * This should never happen. bpf_rbtree_first() was called + * above while the tree lock was held, so the node should + * always be present. + */ + scx_bpf_error("node could not be removed"); + return true; + } + + cgv_node = container_of(rb_node, struct cgv_node, rb_node); + cgid = cgv_node->cgid; + + if (vtime_before(cvtime_now, cgv_node->cvtime)) + cvtime_now = cgv_node->cvtime; + + /* + * If lookup fails, the cgroup's gone. Free and move on. See + * fcg_cgroup_exit(). + */ + cgrp = bpf_cgroup_from_id(cgid); + if (!cgrp) { + stat_inc(FCG_STAT_PNC_GONE); + goto out_free; + } + + cgc = bpf_cgrp_storage_get(&cgrp_ctx, cgrp, 0, 0); + if (!cgc) { + bpf_cgroup_release(cgrp); + stat_inc(FCG_STAT_PNC_GONE); + goto out_free; + } + + if (!scx_bpf_consume(cgid)) { + bpf_cgroup_release(cgrp); + stat_inc(FCG_STAT_PNC_EMPTY); + goto out_stash; + } + + /* + * Successfully consumed from the cgroup. This will be our current + * cgroup for the new slice. Refresh its hweight. + */ + cgrp_refresh_hweight(cgrp, cgc); + + bpf_cgroup_release(cgrp); + + /* + * As the cgroup may have more tasks, add it back to the rbtree. Note + * that here we charge the full slice upfront and then exact later + * according to the actual consumption. This prevents lowpri thundering + * herd from saturating the machine. + */ + bpf_spin_lock(&cgv_tree_lock); + cgv_node->cvtime += cgrp_slice_ns * FCG_HWEIGHT_ONE / (cgc->hweight ?: 1); + cgrp_cap_budget(cgv_node, cgc); + bpf_rbtree_add(&cgv_tree, &cgv_node->rb_node, cgv_node_less); + bpf_spin_unlock(&cgv_tree_lock); + + *cgidp = cgid; + stat_inc(FCG_STAT_PNC_NEXT); + return true; + +out_stash: + stash = bpf_map_lookup_elem(&cgv_node_stash, &cgid); + if (!stash) { + stat_inc(FCG_STAT_PNC_GONE); + goto out_free; + } + + /* + * Paired with cmpxchg in cgrp_enqueued(). If they see the following + * transition, they'll enqueue the cgroup. If they are earlier, we'll + * see their task in the dq below and requeue the cgroup. + */ + __sync_val_compare_and_swap(&cgc->queued, 1, 0); + + if (scx_bpf_dsq_nr_queued(cgid)) { + bpf_spin_lock(&cgv_tree_lock); + bpf_rbtree_add(&cgv_tree, &cgv_node->rb_node, cgv_node_less); + bpf_spin_unlock(&cgv_tree_lock); + stat_inc(FCG_STAT_PNC_RACE); + } else { + cgv_node = bpf_kptr_xchg(&stash->node, cgv_node); + if (cgv_node) { + scx_bpf_error("unexpected !NULL cgv_node stash"); + goto out_free; + } + } + + return false; + +out_free: + bpf_obj_drop(cgv_node); + return false; +} + +void BPF_STRUCT_OPS(fcg_dispatch, s32 cpu, struct task_struct *prev) +{ + struct fcg_cpu_ctx *cpuc; + struct fcg_cgrp_ctx *cgc; + struct cgroup *cgrp; + u64 now = bpf_ktime_get_ns(); + bool picked_next = false; + + cpuc = find_cpu_ctx(); + if (!cpuc) + return; + + if (!cpuc->cur_cgid) + goto pick_next_cgroup; + + if (vtime_before(now, cpuc->cur_at + cgrp_slice_ns)) { + if (scx_bpf_consume(cpuc->cur_cgid)) { + stat_inc(FCG_STAT_CNS_KEEP); + return; + } + stat_inc(FCG_STAT_CNS_EMPTY); + } else { + stat_inc(FCG_STAT_CNS_EXPIRE); + } + + /* + * The current cgroup is expiring. It was already charged a full slice. + * Calculate the actual usage and accumulate the delta. + */ + cgrp = bpf_cgroup_from_id(cpuc->cur_cgid); + if (!cgrp) { + stat_inc(FCG_STAT_CNS_GONE); + goto pick_next_cgroup; + } + + cgc = bpf_cgrp_storage_get(&cgrp_ctx, cgrp, 0, 0); + if (cgc) { + /* + * We want to update the vtime delta and then look for the next + * cgroup to execute but the latter needs to be done in a loop + * and we can't keep the lock held. Oh well... + */ + bpf_spin_lock(&cgv_tree_lock); + __sync_fetch_and_add(&cgc->cvtime_delta, + (cpuc->cur_at + cgrp_slice_ns - now) * + FCG_HWEIGHT_ONE / (cgc->hweight ?: 1)); + bpf_spin_unlock(&cgv_tree_lock); + } else { + stat_inc(FCG_STAT_CNS_GONE); + } + + bpf_cgroup_release(cgrp); + +pick_next_cgroup: + cpuc->cur_at = now; + + if (scx_bpf_consume(SCX_DSQ_GLOBAL)) { + cpuc->cur_cgid = 0; + return; + } + + bpf_repeat(CGROUP_MAX_RETRIES) { + if (try_pick_next_cgroup(&cpuc->cur_cgid)) { + picked_next = true; + break; + } + } + + /* + * This only happens if try_pick_next_cgroup() races against enqueue + * path for more than CGROUP_MAX_RETRIES times, which is extremely + * unlikely and likely indicates an underlying bug. There shouldn't be + * any stall risk as the race is against enqueue. + */ + if (!picked_next) + stat_inc(FCG_STAT_PNC_FAIL); +} + +s32 BPF_STRUCT_OPS(fcg_init_task, struct task_struct *p, + struct scx_init_task_args *args) +{ + struct fcg_task_ctx *taskc; + struct fcg_cgrp_ctx *cgc; + + /* + * @p is new. Let's ensure that its task_ctx is available. We can sleep + * in this function and the following will automatically use GFP_KERNEL. + */ + taskc = bpf_task_storage_get(&task_ctx, p, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!taskc) + return -ENOMEM; + + taskc->bypassed_at = 0; + + if (!(cgc = find_cgrp_ctx(args->cgroup))) + return -ENOENT; + + p->scx.dsq_vtime = cgc->tvtime_now; + + return 0; +} + +int BPF_STRUCT_OPS_SLEEPABLE(fcg_cgroup_init, struct cgroup *cgrp, + struct scx_cgroup_init_args *args) +{ + struct fcg_cgrp_ctx *cgc; + struct cgv_node *cgv_node; + struct cgv_node_stash empty_stash = {}, *stash; + u64 cgid = cgrp->kn->id; + int ret; + + /* + * Technically incorrect as cgroup ID is full 64bit while dq ID is + * 63bit. Should not be a problem in practice and easy to spot in the + * unlikely case that it breaks. + */ + ret = scx_bpf_create_dsq(cgid, -1); + if (ret) + return ret; + + cgc = bpf_cgrp_storage_get(&cgrp_ctx, cgrp, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!cgc) { + ret = -ENOMEM; + goto err_destroy_dsq; + } + + cgc->weight = args->weight; + cgc->hweight = FCG_HWEIGHT_ONE; + + ret = bpf_map_update_elem(&cgv_node_stash, &cgid, &empty_stash, + BPF_NOEXIST); + if (ret) { + if (ret != -ENOMEM) + scx_bpf_error("unexpected stash creation error (%d)", + ret); + goto err_destroy_dsq; + } + + stash = bpf_map_lookup_elem(&cgv_node_stash, &cgid); + if (!stash) { + scx_bpf_error("unexpected cgv_node stash lookup failure"); + ret = -ENOENT; + goto err_destroy_dsq; + } + + cgv_node = bpf_obj_new(struct cgv_node); + if (!cgv_node) { + ret = -ENOMEM; + goto err_del_cgv_node; + } + + cgv_node->cgid = cgid; + cgv_node->cvtime = cvtime_now; + + cgv_node = bpf_kptr_xchg(&stash->node, cgv_node); + if (cgv_node) { + scx_bpf_error("unexpected !NULL cgv_node stash"); + ret = -EBUSY; + goto err_drop; + } + + return 0; + +err_drop: + bpf_obj_drop(cgv_node); +err_del_cgv_node: + bpf_map_delete_elem(&cgv_node_stash, &cgid); +err_destroy_dsq: + scx_bpf_destroy_dsq(cgid); + return ret; +} + +void BPF_STRUCT_OPS(fcg_cgroup_exit, struct cgroup *cgrp) +{ + u64 cgid = cgrp->kn->id; + + /* + * For now, there's no way find and remove the cgv_node if it's on the + * cgv_tree. Let's drain them in the dispatch path as they get popped + * off the front of the tree. + */ + bpf_map_delete_elem(&cgv_node_stash, &cgid); + scx_bpf_destroy_dsq(cgid); +} + +void BPF_STRUCT_OPS(fcg_cgroup_move, struct task_struct *p, + struct cgroup *from, struct cgroup *to) +{ + struct fcg_cgrp_ctx *from_cgc, *to_cgc; + s64 vtime_delta; + + /* find_cgrp_ctx() triggers scx_ops_error() on lookup failures */ + if (!(from_cgc = find_cgrp_ctx(from)) || !(to_cgc = find_cgrp_ctx(to))) + return; + + vtime_delta = p->scx.dsq_vtime - from_cgc->tvtime_now; + p->scx.dsq_vtime = to_cgc->tvtime_now + vtime_delta; +} + +void BPF_STRUCT_OPS(fcg_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SCX_OPS_DEFINE(flatcg_ops, + .select_cpu = (void *)fcg_select_cpu, + .enqueue = (void *)fcg_enqueue, + .dispatch = (void *)fcg_dispatch, + .runnable = (void *)fcg_runnable, + .running = (void *)fcg_running, + .stopping = (void *)fcg_stopping, + .quiescent = (void *)fcg_quiescent, + .init_task = (void *)fcg_init_task, + .cgroup_set_weight = (void *)fcg_cgroup_set_weight, + .cgroup_init = (void *)fcg_cgroup_init, + .cgroup_exit = (void *)fcg_cgroup_exit, + .cgroup_move = (void *)fcg_cgroup_move, + .exit = (void *)fcg_exit, + .flags = SCX_OPS_HAS_CGROUP_WEIGHT | SCX_OPS_ENQ_EXITING, + .name = "flatcg"); diff --git a/tools/sched_ext/scx_flatcg.c b/tools/sched_ext/scx_flatcg.c new file mode 100644 index 000000000000..5d24ca9c29d9 --- /dev/null +++ b/tools/sched_ext/scx_flatcg.c @@ -0,0 +1,233 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> + * Copyright (c) 2023 David Vernet <dvernet@meta.com> + */ +#include <stdio.h> +#include <signal.h> +#include <unistd.h> +#include <libgen.h> +#include <limits.h> +#include <inttypes.h> +#include <fcntl.h> +#include <time.h> +#include <bpf/bpf.h> +#include <scx/common.h> +#include "scx_flatcg.h" +#include "scx_flatcg.bpf.skel.h" + +#ifndef FILEID_KERNFS +#define FILEID_KERNFS 0xfe +#endif + +const char help_fmt[] = +"A flattened cgroup hierarchy sched_ext scheduler.\n" +"\n" +"See the top-level comment in .bpf.c for more details.\n" +"\n" +"Usage: %s [-s SLICE_US] [-i INTERVAL] [-f] [-v]\n" +"\n" +" -s SLICE_US Override slice duration\n" +" -i INTERVAL Report interval\n" +" -f Use FIFO scheduling instead of weighted vtime scheduling\n" +" -v Print libbpf debug messages\n" +" -h Display this help and exit\n"; + +static bool verbose; +static volatile int exit_req; + +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) +{ + if (level == LIBBPF_DEBUG && !verbose) + return 0; + return vfprintf(stderr, format, args); +} + +static void sigint_handler(int dummy) +{ + exit_req = 1; +} + +static float read_cpu_util(__u64 *last_sum, __u64 *last_idle) +{ + FILE *fp; + char buf[4096]; + char *line, *cur = NULL, *tok; + __u64 sum = 0, idle = 0; + __u64 delta_sum, delta_idle; + int idx; + + fp = fopen("/proc/stat", "r"); + if (!fp) { + perror("fopen(\"/proc/stat\")"); + return 0.0; + } + + if (!fgets(buf, sizeof(buf), fp)) { + perror("fgets(\"/proc/stat\")"); + fclose(fp); + return 0.0; + } + fclose(fp); + + line = buf; + for (idx = 0; (tok = strtok_r(line, " \n", &cur)); idx++) { + char *endp = NULL; + __u64 v; + + if (idx == 0) { + line = NULL; + continue; + } + v = strtoull(tok, &endp, 0); + if (!endp || *endp != '\0') { + fprintf(stderr, "failed to parse %dth field of /proc/stat (\"%s\")\n", + idx, tok); + continue; + } + sum += v; + if (idx == 4) + idle = v; + } + + delta_sum = sum - *last_sum; + delta_idle = idle - *last_idle; + *last_sum = sum; + *last_idle = idle; + + return delta_sum ? (float)(delta_sum - delta_idle) / delta_sum : 0.0; +} + +static void fcg_read_stats(struct scx_flatcg *skel, __u64 *stats) +{ + __u64 cnts[FCG_NR_STATS][skel->rodata->nr_cpus]; + __u32 idx; + + memset(stats, 0, sizeof(stats[0]) * FCG_NR_STATS); + + for (idx = 0; idx < FCG_NR_STATS; idx++) { + int ret, cpu; + + ret = bpf_map_lookup_elem(bpf_map__fd(skel->maps.stats), + &idx, cnts[idx]); + if (ret < 0) + continue; + for (cpu = 0; cpu < skel->rodata->nr_cpus; cpu++) + stats[idx] += cnts[idx][cpu]; + } +} + +int main(int argc, char **argv) +{ + struct scx_flatcg *skel; + struct bpf_link *link; + struct timespec intv_ts = { .tv_sec = 2, .tv_nsec = 0 }; + bool dump_cgrps = false; + __u64 last_cpu_sum = 0, last_cpu_idle = 0; + __u64 last_stats[FCG_NR_STATS] = {}; + unsigned long seq = 0; + __s32 opt; + __u64 ecode; + + libbpf_set_print(libbpf_print_fn); + signal(SIGINT, sigint_handler); + signal(SIGTERM, sigint_handler); +restart: + skel = SCX_OPS_OPEN(flatcg_ops, scx_flatcg); + + skel->rodata->nr_cpus = libbpf_num_possible_cpus(); + + while ((opt = getopt(argc, argv, "s:i:dfvh")) != -1) { + double v; + + switch (opt) { + case 's': + v = strtod(optarg, NULL); + skel->rodata->cgrp_slice_ns = v * 1000; + break; + case 'i': + v = strtod(optarg, NULL); + intv_ts.tv_sec = v; + intv_ts.tv_nsec = (v - (float)intv_ts.tv_sec) * 1000000000; + break; + case 'd': + dump_cgrps = true; + break; + case 'f': + skel->rodata->fifo_sched = true; + break; + case 'v': + verbose = true; + break; + case 'h': + default: + fprintf(stderr, help_fmt, basename(argv[0])); + return opt != 'h'; + } + } + + printf("slice=%.1lfms intv=%.1lfs dump_cgrps=%d", + (double)skel->rodata->cgrp_slice_ns / 1000000.0, + (double)intv_ts.tv_sec + (double)intv_ts.tv_nsec / 1000000000.0, + dump_cgrps); + + SCX_OPS_LOAD(skel, flatcg_ops, scx_flatcg, uei); + link = SCX_OPS_ATTACH(skel, flatcg_ops, scx_flatcg); + + while (!exit_req && !UEI_EXITED(skel, uei)) { + __u64 acc_stats[FCG_NR_STATS]; + __u64 stats[FCG_NR_STATS]; + float cpu_util; + int i; + + cpu_util = read_cpu_util(&last_cpu_sum, &last_cpu_idle); + + fcg_read_stats(skel, acc_stats); + for (i = 0; i < FCG_NR_STATS; i++) + stats[i] = acc_stats[i] - last_stats[i]; + + memcpy(last_stats, acc_stats, sizeof(acc_stats)); + + printf("\n[SEQ %6lu cpu=%5.1lf hweight_gen=%" PRIu64 "]\n", + seq++, cpu_util * 100.0, skel->data->hweight_gen); + printf(" act:%6llu deact:%6llu global:%6llu local:%6llu\n", + stats[FCG_STAT_ACT], + stats[FCG_STAT_DEACT], + stats[FCG_STAT_GLOBAL], + stats[FCG_STAT_LOCAL]); + printf("HWT cache:%6llu update:%6llu skip:%6llu race:%6llu\n", + stats[FCG_STAT_HWT_CACHE], + stats[FCG_STAT_HWT_UPDATES], + stats[FCG_STAT_HWT_SKIP], + stats[FCG_STAT_HWT_RACE]); + printf("ENQ skip:%6llu race:%6llu\n", + stats[FCG_STAT_ENQ_SKIP], + stats[FCG_STAT_ENQ_RACE]); + printf("CNS keep:%6llu expire:%6llu empty:%6llu gone:%6llu\n", + stats[FCG_STAT_CNS_KEEP], + stats[FCG_STAT_CNS_EXPIRE], + stats[FCG_STAT_CNS_EMPTY], + stats[FCG_STAT_CNS_GONE]); + printf("PNC next:%6llu empty:%6llu nocgrp:%6llu gone:%6llu race:%6llu fail:%6llu\n", + stats[FCG_STAT_PNC_NEXT], + stats[FCG_STAT_PNC_EMPTY], + stats[FCG_STAT_PNC_NO_CGRP], + stats[FCG_STAT_PNC_GONE], + stats[FCG_STAT_PNC_RACE], + stats[FCG_STAT_PNC_FAIL]); + printf("BAD remove:%6llu\n", + acc_stats[FCG_STAT_BAD_REMOVAL]); + fflush(stdout); + + nanosleep(&intv_ts, NULL); + } + + bpf_link__destroy(link); + ecode = UEI_REPORT(skel, uei); + scx_flatcg__destroy(skel); + + if (UEI_ECODE_RESTART(ecode)) + goto restart; + return 0; +} diff --git a/tools/sched_ext/scx_flatcg.h b/tools/sched_ext/scx_flatcg.h new file mode 100644 index 000000000000..6f2ea50acb1c --- /dev/null +++ b/tools/sched_ext/scx_flatcg.h @@ -0,0 +1,51 @@ +#ifndef __SCX_EXAMPLE_FLATCG_H +#define __SCX_EXAMPLE_FLATCG_H + +enum { + FCG_HWEIGHT_ONE = 1LLU << 16, +}; + +enum fcg_stat_idx { + FCG_STAT_ACT, + FCG_STAT_DEACT, + FCG_STAT_LOCAL, + FCG_STAT_GLOBAL, + + FCG_STAT_HWT_UPDATES, + FCG_STAT_HWT_CACHE, + FCG_STAT_HWT_SKIP, + FCG_STAT_HWT_RACE, + + FCG_STAT_ENQ_SKIP, + FCG_STAT_ENQ_RACE, + + FCG_STAT_CNS_KEEP, + FCG_STAT_CNS_EXPIRE, + FCG_STAT_CNS_EMPTY, + FCG_STAT_CNS_GONE, + + FCG_STAT_PNC_NO_CGRP, + FCG_STAT_PNC_NEXT, + FCG_STAT_PNC_EMPTY, + FCG_STAT_PNC_GONE, + FCG_STAT_PNC_RACE, + FCG_STAT_PNC_FAIL, + + FCG_STAT_BAD_REMOVAL, + + FCG_NR_STATS, +}; + +struct fcg_cgrp_ctx { + u32 nr_active; + u32 nr_runnable; + u32 queued; + u32 weight; + u32 hweight; + u64 child_weight_sum; + u64 hweight_gen; + s64 cvtime_delta; + u64 tvtime_now; +}; + +#endif /* __SCX_EXAMPLE_FLATCG_H */ -- 2.34.1
From: Yu Liao <liaoyu15@huawei.com> mainline inclusion from mainline-v6.12-rc1 commit 7ebd84d627e40cb9fb12b338588e81b6cca371e3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- When build with CONFIG_GROUP_SCHED_WEIGHT && !CONFIG_FAIR_GROUP_SCHED, the idle member is not defined: kernel/sched/ext.c:3701:16: error: 'struct task_group' has no member named 'idle' 3701 | if (!tg->idle) | ^~ Fix this by putting 'idle' under new CONFIG_GROUP_SCHED_WEIGHT. tj: Move idle field upward to avoid breaking up CONFIG_FAIR_GROUP_SCHED block. Fixes: e179e80c5d4f ("sched: Introduce CONFIG_GROUP_SCHED_WEIGHT") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409220859.UiCAoFOW-lkp@intel.com/ Signed-off-by: Yu Liao <liaoyu15@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/sched.h | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 32ea9c249d4f..345b9c233334 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -450,16 +450,17 @@ struct soft_domain_ctx { struct task_group { struct cgroup_subsys_state css; +#ifdef CONFIG_GROUP_SCHED_WEIGHT + /* A positive value indicates that this is a SCHED_IDLE group. */ + int idle; +#endif + #ifdef CONFIG_FAIR_GROUP_SCHED /* schedulable entities of this group on each CPU */ struct sched_entity **se; /* runqueue "owned" by this group on each CPU */ struct cfs_rq **cfs_rq; unsigned long shares; - - /* A positive value indicates that this is a SCHED_IDLE group. */ - int idle; - #ifdef CONFIG_SMP /* * load_avg can be heavily contended at clock tick time, so put -- 2.34.1
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY -------------------------------- sched: Fix kabi for "int idle" moved from CONFIG_FAIR_GROUP_SCHED to CONFIG_GROUP_SCHED_WEIGHT in struct task_group. Fixes: d149b6356644 ("sched: Put task_group::idle under CONFIG_GROUP_SCHED_WEIGHT") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/sched.h | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 345b9c233334..5a248dcb0473 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -450,17 +450,13 @@ struct soft_domain_ctx { struct task_group { struct cgroup_subsys_state css; -#ifdef CONFIG_GROUP_SCHED_WEIGHT - /* A positive value indicates that this is a SCHED_IDLE group. */ - int idle; -#endif - #ifdef CONFIG_FAIR_GROUP_SCHED /* schedulable entities of this group on each CPU */ struct sched_entity **se; /* runqueue "owned" by this group on each CPU */ struct cfs_rq **cfs_rq; unsigned long shares; + KABI_BROKEN_REMOVE(int idle) #ifdef CONFIG_SMP /* * load_avg can be heavily contended at clock tick time, so put @@ -528,7 +524,14 @@ struct task_group { #else KABI_RESERVE(3) #endif + +#ifdef CONFIG_GROUP_SCHED_WEIGHT + /* A positive value indicates that this is a SCHED_IDLE group. */ + KABI_USE(4, int idle) +#else KABI_RESERVE(4) +#endif + KABI_RESERVE(5) KABI_RESERVE(6) KABI_RESERVE(7) -- 2.34.1
From: Yu Liao <liaoyu15@huawei.com> mainline inclusion from mainline-v6.12-rc1 commit bdeb868c0ddf04c4777bf651834495baaf4f991b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Fix the following error when build with CONFIG_GROUP_SCHED_WEIGHT && !CONFIG_FAIR_GROUP_SCHED: kernel/sched/core.c:9634:15: error: implicit declaration of function 'sched_group_set_idle'; did you mean 'scx_group_set_idle'? [-Wimplicit-function-declaration] 9634 | ret = sched_group_set_idle(css_tg(css), idle); | ^~~~~~~~~~~~~~~~~~~~ | scx_group_set_idle Fixes: e179e80c5d4f ("sched: Introduce CONFIG_GROUP_SCHED_WEIGHT") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409220859.UiCAoFOW-lkp@intel.com/ Signed-off-by: Yu Liao <liaoyu15@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/sched.h | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 5a248dcb0473..aa1740216760 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -646,6 +646,7 @@ static inline void set_task_rq_fair(struct sched_entity *se, #endif /* CONFIG_SMP */ #else /* !CONFIG_FAIR_GROUP_SCHED */ static inline int sched_group_set_shares(struct task_group *tg, unsigned long shares) { return 0; } +static inline int sched_group_set_idle(struct task_group *tg, long idle) { return 0; } #endif /* CONFIG_FAIR_GROUP_SCHED */ #else /* CONFIG_CGROUP_SCHED */ -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit da330f5e4c197980ce4d343b1e557c9f33282770 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- pick_task_scx() must be preceded by balance_scx() but there currently is a bug where fair could say yes on balance() but no on pick_task(), which then ends up calling pick_task_scx() without preceding balance_scx(). Work around by dropping WARN_ON_ONCE() and ignoring cases which don't make sense. This isn't great and can theoretically lead to stalls. However, for switch_all cases, this happens only while a BPF scheduler is being loaded or unloaded, and, for partial cases, fair will likely keep triggering this CPU. This will be reverted once the fair behavior is fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 8ae97643caec..6484cf5beaa9 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2918,9 +2918,24 @@ static struct task_struct *pick_task_scx(struct rq *rq) * If balance_scx() is telling us to keep running @prev, replenish slice * if necessary and keep running @prev. Otherwise, pop the first one * from the local DSQ. + * + * WORKAROUND: + * + * %SCX_RQ_BAL_KEEP should be set iff $prev is on SCX as it must just + * have gone through balance_scx(). Unfortunately, there currently is a + * bug where fair could say yes on balance() but no on pick_task(), + * which then ends up calling pick_task_scx() without preceding + * balance_scx(). + * + * For now, ignore cases where $prev is not on SCX. This isn't great and + * can theoretically lead to stalls. However, for switch_all cases, this + * happens only while a BPF scheduler is being loaded or unloaded, and, + * for partial cases, fair will likely keep triggering this CPU. + * + * Once fair is fixed, restore WARN_ON_ONCE(). */ if ((rq->scx.flags & SCX_RQ_BAL_KEEP) && - !WARN_ON_ONCE(prev->sched_class != &ext_sched_class)) { + prev->sched_class == &ext_sched_class) { p = prev; if (!p->scx.slice) p->scx.slice = SCX_SLICE_DFL; -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 02e65e1c1282b8e38638de238ac7410846898348 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- scx_has_op[] is only used inside ext.c but doesn't have static. Add it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409062337.m7qqI88I-lkp@intel.com/ Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 6484cf5beaa9..1a06636cac05 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -869,7 +869,7 @@ static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting); static DEFINE_STATIC_KEY_FALSE(scx_ops_cpu_preempt); static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled); -struct static_key_false scx_has_op[SCX_OPI_END] = +static struct static_key_false scx_has_op[SCX_OPI_END] = { [0 ... SCX_OPI_END-1] = STATIC_KEY_FALSE_INIT }; static atomic_t scx_exit_kind = ATOMIC_INIT(SCX_EXIT_DONE); -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 3ac352797cf02b53ea4a6ca0b958a288eedda514 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- scx_dump_data is only used inside ext.c but doesn't have static. Add it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409070218.RB5WsQ07-lkp@intel.com/ Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 1a06636cac05..6d868c3b85e2 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -972,7 +972,7 @@ struct scx_dump_data { struct scx_bstr_buf buf; }; -struct scx_dump_data scx_dump_data = { +static struct scx_dump_data scx_dump_data = { .cpu = -1, }; -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit fdaedba2f96f6755f505c454ca7408930f4fe1bf category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Sleepables don't need to be in its own kfunc set as each is tagged with KF_SLEEPABLE. Rename to scx_kfunc_set_unlocked indicating that rq lock is not held and relocate right above the any set. This will be used to add kfuncs that are allowed to be called from SYSCALL but not TRACING. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 66 +++++++++++++++++++++++----------------------- 1 file changed, 33 insertions(+), 33 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 6d868c3b85e2..4ddd9dd3a2c6 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -5810,35 +5810,6 @@ void __init init_sched_ext_class(void) __bpf_kfunc_start_defs(); -/** - * scx_bpf_create_dsq - Create a custom DSQ - * @dsq_id: DSQ to create - * @node: NUMA node to allocate from - * - * Create a custom DSQ identified by @dsq_id. Can be called from any sleepable - * scx callback, and any BPF_PROG_TYPE_SYSCALL prog. - */ -__bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) -{ - if (unlikely(node >= (int)nr_node_ids || - (node < 0 && node != NUMA_NO_NODE))) - return -EINVAL; - return PTR_ERR_OR_ZERO(create_dsq(dsq_id, node)); -} - -__bpf_kfunc_end_defs(); - -BTF_KFUNCS_START(scx_kfunc_ids_sleepable) -BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE) -BTF_KFUNCS_END(scx_kfunc_ids_sleepable) - -static const struct btf_kfunc_id_set scx_kfunc_set_sleepable = { - .owner = THIS_MODULE, - .set = &scx_kfunc_ids_sleepable, -}; - -__bpf_kfunc_start_defs(); - /** * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu() * @p: task_struct to select a CPU for @@ -6181,6 +6152,35 @@ static const struct btf_kfunc_id_set scx_kfunc_set_cpu_release = { __bpf_kfunc_start_defs(); +/** + * scx_bpf_create_dsq - Create a custom DSQ + * @dsq_id: DSQ to create + * @node: NUMA node to allocate from + * + * Create a custom DSQ identified by @dsq_id. Can be called from any sleepable + * scx callback, and any BPF_PROG_TYPE_SYSCALL prog. + */ +__bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) +{ + if (unlikely(node >= (int)nr_node_ids || + (node < 0 && node != NUMA_NO_NODE))) + return -EINVAL; + return PTR_ERR_OR_ZERO(create_dsq(dsq_id, node)); +} + +__bpf_kfunc_end_defs(); + +BTF_KFUNCS_START(scx_kfunc_ids_unlocked) +BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE) +BTF_KFUNCS_END(scx_kfunc_ids_unlocked) + +static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = { + .owner = THIS_MODULE, + .set = &scx_kfunc_ids_unlocked, +}; + +__bpf_kfunc_start_defs(); + /** * scx_bpf_kick_cpu - Trigger reschedule on a CPU * @cpu: cpu to kick @@ -6915,10 +6915,6 @@ static int __init scx_init(void) * check using scx_kf_allowed(). */ if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, - &scx_kfunc_set_sleepable)) || - (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, - &scx_kfunc_set_sleepable)) || - (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_select_cpu)) || (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_enqueue_dispatch)) || @@ -6926,6 +6922,10 @@ static int __init scx_init(void) &scx_kfunc_set_dispatch)) || (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_cpu_release)) || + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, + &scx_kfunc_set_unlocked)) || + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, + &scx_kfunc_set_unlocked)) || (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_any)) || (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 4d3ca89bdd31936dcb31cb655801c9e95fe413f3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- The tricky p->scx.holding_cpu handling was split across consume_remote_task() body and move_task_to_local_dsq(). Refactor such that: - All the tricky part is now in the new unlink_dsq_and_lock_src_rq() with consolidated documentation. - move_task_to_local_dsq() now implements straightforward task migration making it easier to use in other places. - dispatch_to_local_dsq() is another user move_task_to_local_dsq(). The usage is updated accordingly. This makes the local and remote cases more symmetric. No functional changes intended. v2: s/task_rq/src_rq/ for consistency. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 145 ++++++++++++++++++++++++--------------------- 1 file changed, 76 insertions(+), 69 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 4ddd9dd3a2c6..7ecc29a001d9 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2187,49 +2187,13 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to) * @src_rq: rq to move the task from, locked on entry, released on return * @dst_rq: rq to move the task into, locked on return * - * Move @p which is currently on @src_rq to @dst_rq's local DSQ. The caller - * must: - * - * 1. Start with exclusive access to @p either through its DSQ lock or - * %SCX_OPSS_DISPATCHING flag. - * - * 2. Set @p->scx.holding_cpu to raw_smp_processor_id(). - * - * 3. Remember task_rq(@p) as @src_rq. Release the exclusive access so that we - * don't deadlock with dequeue. - * - * 4. Lock @src_rq from #3. - * - * 5. Call this function. - * - * Returns %true if @p was successfully moved. %false after racing dequeue and - * losing. On return, @src_rq is unlocked and @dst_rq is locked. + * Move @p which is currently on @src_rq to @dst_rq's local DSQ. */ -static bool move_task_to_local_dsq(struct task_struct *p, u64 enq_flags, +static void move_task_to_local_dsq(struct task_struct *p, u64 enq_flags, struct rq *src_rq, struct rq *dst_rq) { lockdep_assert_rq_held(src_rq); - /* - * If dequeue got to @p while we were trying to lock @src_rq, it'd have - * cleared @p->scx.holding_cpu to -1. While other cpus may have updated - * it to different values afterwards, as this operation can't be - * preempted or recurse, @p->scx.holding_cpu can never become - * raw_smp_processor_id() again before we're done. Thus, we can tell - * whether we lost to dequeue by testing whether @p->scx.holding_cpu is - * still raw_smp_processor_id(). - * - * @p->rq couldn't have changed if we're still the holding cpu. - * - * See dispatch_dequeue() for the counterpart. - */ - if (unlikely(p->scx.holding_cpu != raw_smp_processor_id()) || - WARN_ON_ONCE(src_rq != task_rq(p))) { - raw_spin_rq_unlock(src_rq); - raw_spin_rq_lock(dst_rq); - return false; - } - /* the following marks @p MIGRATING which excludes dequeue */ deactivate_task(src_rq, p, 0); set_task_cpu(p, cpu_of(dst_rq)); @@ -2248,8 +2212,6 @@ static bool move_task_to_local_dsq(struct task_struct *p, u64 enq_flags, dst_rq->scx.extra_enq_flags = enq_flags; activate_task(dst_rq, p, 0); dst_rq->scx.extra_enq_flags = 0; - - return true; } #endif /* CONFIG_SMP */ @@ -2314,28 +2276,69 @@ static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq, return true; } -static bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq, - struct task_struct *p, struct rq *task_rq) +/** + * unlink_dsq_and_lock_src_rq() - Unlink task from its DSQ and lock its task_rq + * @p: target task + * @dsq: locked DSQ @p is currently on + * @src_rq: rq @p is currently on, stable with @dsq locked + * + * Called with @dsq locked but no rq's locked. We want to move @p to a different + * DSQ, including any local DSQ, but are not locking @src_rq. Locking @src_rq is + * required when transferring into a local DSQ. Even when transferring into a + * non-local DSQ, it's better to use the same mechanism to protect against + * dequeues and maintain the invariant that @p->scx.dsq can only change while + * @src_rq is locked, which e.g. scx_dump_task() depends on. + * + * We want to grab @src_rq but that can deadlock if we try while locking @dsq, + * so we want to unlink @p from @dsq, drop its lock and then lock @src_rq. As + * this may race with dequeue, which can't drop the rq lock or fail, do a little + * dancing from our side. + * + * @p->scx.holding_cpu is set to this CPU before @dsq is unlocked. If @p gets + * dequeued after we unlock @dsq but before locking @src_rq, the holding_cpu + * would be cleared to -1. While other cpus may have updated it to different + * values afterwards, as this operation can't be preempted or recurse, the + * holding_cpu can never become this CPU again before we're done. Thus, we can + * tell whether we lost to dequeue by testing whether the holding_cpu still + * points to this CPU. See dispatch_dequeue() for the counterpart. + * + * On return, @dsq is unlocked and @src_rq is locked. Returns %true if @p is + * still valid. %false if lost to dequeue. + */ +static bool unlink_dsq_and_lock_src_rq(struct task_struct *p, + struct scx_dispatch_q *dsq, + struct rq *src_rq) { - lockdep_assert_held(&dsq->lock); /* released on return */ + s32 cpu = raw_smp_processor_id(); + + lockdep_assert_held(&dsq->lock); - /* - * @dsq is locked and @p is on a remote rq. @p is currently protected by - * @dsq->lock. We want to pull @p to @rq but may deadlock if we grab - * @task_rq while holding @dsq and @rq locks. As dequeue can't drop the - * rq lock or fail, do a little dancing from our side. See - * move_task_to_local_dsq(). - */ WARN_ON_ONCE(p->scx.holding_cpu >= 0); task_unlink_from_dsq(p, dsq); dsq_mod_nr(dsq, -1); - p->scx.holding_cpu = raw_smp_processor_id(); + p->scx.holding_cpu = cpu; + raw_spin_unlock(&dsq->lock); + raw_spin_rq_lock(src_rq); - raw_spin_rq_unlock(rq); - raw_spin_rq_lock(task_rq); + /* task_rq couldn't have changed if we're still the holding cpu */ + return likely(p->scx.holding_cpu == cpu) && + !WARN_ON_ONCE(src_rq != task_rq(p)); +} - return move_task_to_local_dsq(p, 0, task_rq, rq); +static bool consume_remote_task(struct rq *this_rq, struct scx_dispatch_q *dsq, + struct task_struct *p, struct rq *src_rq) +{ + raw_spin_rq_unlock(this_rq); + + if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) { + move_task_to_local_dsq(p, 0, src_rq, this_rq); + return true; + } else { + raw_spin_rq_unlock(src_rq); + raw_spin_rq_lock(this_rq); + return false; + } } #else /* CONFIG_SMP */ static inline bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq, bool trigger_error) { return false; } @@ -2439,7 +2442,8 @@ dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p, * As DISPATCHING guarantees that @p is wholly ours, we can * pretend that we're moving from a DSQ and use the same * mechanism - mark the task under transfer with holding_cpu, - * release DISPATCHING and then follow the same protocol. + * release DISPATCHING and then follow the same protocol. See + * unlink_dsq_and_lock_src_rq(). */ p->scx.holding_cpu = raw_smp_processor_id(); @@ -2452,28 +2456,31 @@ dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p, raw_spin_rq_lock(src_rq); } - if (src_rq == dst_rq) { + /* task_rq couldn't have changed if we're still the holding cpu */ + dsp = p->scx.holding_cpu == raw_smp_processor_id() && + !WARN_ON_ONCE(src_rq != task_rq(p)); + + if (likely(dsp)) { /* - * As @p is staying on the same rq, there's no need to + * If @p is staying on the same rq, there's no need to * go through the full deactivate/activate cycle. * Optimize by abbreviating the operations in * move_task_to_local_dsq(). */ - dsp = p->scx.holding_cpu == raw_smp_processor_id(); - if (likely(dsp)) { + if (src_rq == dst_rq) { p->scx.holding_cpu = -1; - dispatch_enqueue(&dst_rq->scx.local_dsq, p, - enq_flags); + dispatch_enqueue(&dst_rq->scx.local_dsq, + p, enq_flags); + } else { + move_task_to_local_dsq(p, enq_flags, + src_rq, dst_rq); } - } else { - dsp = move_task_to_local_dsq(p, enq_flags, - src_rq, dst_rq); - } - /* if the destination CPU is idle, wake it up */ - if (dsp && sched_class_above(p->sched_class, - dst_rq->curr->sched_class)) - resched_curr(dst_rq); + /* if the destination CPU is idle, wake it up */ + if (sched_class_above(p->sched_class, + dst_rq->curr->sched_class)) + resched_curr(dst_rq); + } /* switch back to @rq lock */ if (rq != dst_rq) { -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit e683949a4b8c954fcbaf6b3ba87bd925bf83e481 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- find_dsq_for_dispatch() handles all DSQ IDs except SCX_DSQ_LOCAL_ON. Instead, each caller is hanlding SCX_DSQ_LOCAL_ON before calling it. Move SCX_DSQ_LOCAL_ON lookup into find_dsq_for_dispatch() to remove duplicate code in direct_dispatch() and dispatch_to_local_dsq(). No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 90 +++++++++++++++++++++------------------------- 1 file changed, 40 insertions(+), 50 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 7ecc29a001d9..e8b2d747cacd 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1813,6 +1813,15 @@ static struct scx_dispatch_q *find_dsq_for_dispatch(struct rq *rq, u64 dsq_id, if (dsq_id == SCX_DSQ_LOCAL) return &rq->scx.local_dsq; + if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) { + s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK; + + if (!ops_cpu_valid(cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict")) + return &scx_dsq_global; + + return &cpu_rq(cpu)->scx.local_dsq; + } + dsq = find_non_local_dsq(dsq_id); if (unlikely(!dsq)) { scx_ops_error("non-existent DSQ 0x%llx for %s[%d]", @@ -1856,8 +1865,8 @@ static void mark_direct_dispatch(struct task_struct *ddsp_task, static void direct_dispatch(struct task_struct *p, u64 enq_flags) { struct rq *rq = task_rq(p); - struct scx_dispatch_q *dsq; - u64 dsq_id = p->scx.ddsp_dsq_id; + struct scx_dispatch_q *dsq = + find_dsq_for_dispatch(rq, p->scx.ddsp_dsq_id, p); touch_core_sched_dispatch(rq, p); @@ -1869,15 +1878,9 @@ static void direct_dispatch(struct task_struct *p, u64 enq_flags) * DSQ_LOCAL_ON verdicts targeting the local DSQ of a remote CPU, defer * the enqueue so that it's executed when @rq can be unlocked. */ - if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) { - s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK; + if (dsq->id == SCX_DSQ_LOCAL && dsq != &rq->scx.local_dsq) { unsigned long opss; - if (cpu == cpu_of(rq)) { - dsq_id = SCX_DSQ_LOCAL; - goto dispatch; - } - opss = atomic_long_read(&p->scx.ops_state) & SCX_OPSS_STATE_MASK; switch (opss & SCX_OPSS_STATE_MASK) { @@ -1904,8 +1907,6 @@ static void direct_dispatch(struct task_struct *p, u64 enq_flags) return; } -dispatch: - dsq = find_dsq_for_dispatch(rq, dsq_id, p); dispatch_enqueue(dsq, p, p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); } @@ -2381,51 +2382,38 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq) enum dispatch_to_local_dsq_ret { DTL_DISPATCHED, /* successfully dispatched */ DTL_LOST, /* lost race to dequeue */ - DTL_NOT_LOCAL, /* destination is not a local DSQ */ DTL_INVALID, /* invalid local dsq_id */ }; /** * dispatch_to_local_dsq - Dispatch a task to a local dsq * @rq: current rq which is locked - * @dsq_id: destination dsq ID + * @dst_dsq: destination DSQ * @p: task to dispatch * @enq_flags: %SCX_ENQ_* * - * We're holding @rq lock and want to dispatch @p to the local DSQ identified by - * @dsq_id. This function performs all the synchronization dancing needed - * because local DSQs are protected with rq locks. + * We're holding @rq lock and want to dispatch @p to @dst_dsq which is a local + * DSQ. This function performs all the synchronization dancing needed because + * local DSQs are protected with rq locks. * * The caller must have exclusive ownership of @p (e.g. through * %SCX_OPSS_DISPATCHING). */ static enum dispatch_to_local_dsq_ret -dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p, - u64 enq_flags) +dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq, + struct task_struct *p, u64 enq_flags) { struct rq *src_rq = task_rq(p); - struct rq *dst_rq; + struct rq *dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq); /* * We're synchronized against dequeue through DISPATCHING. As @p can't * be dequeued, its task_rq and cpus_allowed are stable too. + * + * If dispatching to @rq that @p is already on, no lock dancing needed. */ - if (dsq_id == SCX_DSQ_LOCAL) { - dst_rq = rq; - } else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) { - s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK; - - if (!ops_cpu_valid(cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict")) - return DTL_INVALID; - dst_rq = cpu_rq(cpu); - } else { - return DTL_NOT_LOCAL; - } - - /* if dispatching to @rq that @p is already on, no lock dancing needed */ if (rq == src_rq && rq == dst_rq) { - dispatch_enqueue(&dst_rq->scx.local_dsq, p, - enq_flags | SCX_ENQ_CLEAR_OPSS); + dispatch_enqueue(dst_dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); return DTL_DISPATCHED; } @@ -2567,19 +2555,21 @@ static void finish_dispatch(struct rq *rq, struct task_struct *p, BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED)); - switch (dispatch_to_local_dsq(rq, dsq_id, p, enq_flags)) { - case DTL_DISPATCHED: - break; - case DTL_LOST: - break; - case DTL_INVALID: - dsq_id = SCX_DSQ_GLOBAL; - fallthrough; - case DTL_NOT_LOCAL: - dsq = find_dsq_for_dispatch(cpu_rq(raw_smp_processor_id()), - dsq_id, p); + dsq = find_dsq_for_dispatch(this_rq(), dsq_id, p); + + if (dsq->id == SCX_DSQ_LOCAL) { + switch (dispatch_to_local_dsq(rq, dsq, p, enq_flags)) { + case DTL_DISPATCHED: + break; + case DTL_LOST: + break; + case DTL_INVALID: + dispatch_enqueue(&scx_dsq_global, p, + enq_flags | SCX_ENQ_CLEAR_OPSS); + break; + } + } else { dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); - break; } } @@ -2756,13 +2746,13 @@ static void process_ddsp_deferred_locals(struct rq *rq) */ while ((p = list_first_entry_or_null(&rq->scx.ddsp_deferred_locals, struct task_struct, scx.dsq_list.node))) { - s32 ret; + struct scx_dispatch_q *dsq; list_del_init(&p->scx.dsq_list.node); - ret = dispatch_to_local_dsq(rq, p->scx.ddsp_dsq_id, p, - p->scx.ddsp_enq_flags); - WARN_ON_ONCE(ret == DTL_NOT_LOCAL); + dsq = find_dsq_for_dispatch(rq, p->scx.ddsp_dsq_id, p); + if (!WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL)) + dispatch_to_local_dsq(rq, dsq, p, p->scx.ddsp_enq_flags); } } -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 0aab26309ee9ceafcd5292b81690ccdac0796803 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- With the preceding update, the only return value which makes meaningful difference is DTL_INVALID, for which one caller, finish_dispatch(), falls back to the global DSQ and the other, process_ddsp_deferred_locals(), doesn't do anything. It should always fallback to the global DSQ. Move the global DSQ fallback into dispatch_to_local_dsq() and remove the return value. v2: Patch title and description updated to reflect the behavior fix for process_ddsp_deferred_locals(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 41 ++++++++++------------------------------- 1 file changed, 10 insertions(+), 31 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index e8b2d747cacd..ece108739d6a 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2379,12 +2379,6 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq) return false; } -enum dispatch_to_local_dsq_ret { - DTL_DISPATCHED, /* successfully dispatched */ - DTL_LOST, /* lost race to dequeue */ - DTL_INVALID, /* invalid local dsq_id */ -}; - /** * dispatch_to_local_dsq - Dispatch a task to a local dsq * @rq: current rq which is locked @@ -2399,9 +2393,8 @@ enum dispatch_to_local_dsq_ret { * The caller must have exclusive ownership of @p (e.g. through * %SCX_OPSS_DISPATCHING). */ -static enum dispatch_to_local_dsq_ret -dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq, - struct task_struct *p, u64 enq_flags) +static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq, + struct task_struct *p, u64 enq_flags) { struct rq *src_rq = task_rq(p); struct rq *dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq); @@ -2414,13 +2407,11 @@ dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq, */ if (rq == src_rq && rq == dst_rq) { dispatch_enqueue(dst_dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); - return DTL_DISPATCHED; + return; } #ifdef CONFIG_SMP if (likely(task_can_run_on_remote_rq(p, dst_rq, true))) { - bool dsp; - /* * @p is on a possibly remote @src_rq which we need to lock to * move the task. If dequeue is in progress, it'd be locking @@ -2445,10 +2436,8 @@ dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq, } /* task_rq couldn't have changed if we're still the holding cpu */ - dsp = p->scx.holding_cpu == raw_smp_processor_id() && - !WARN_ON_ONCE(src_rq != task_rq(p)); - - if (likely(dsp)) { + if (likely(p->scx.holding_cpu == raw_smp_processor_id()) && + !WARN_ON_ONCE(src_rq != task_rq(p))) { /* * If @p is staying on the same rq, there's no need to * go through the full deactivate/activate cycle. @@ -2476,11 +2465,11 @@ dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq, raw_spin_rq_lock(rq); } - return dsp ? DTL_DISPATCHED : DTL_LOST; + return; } #endif /* CONFIG_SMP */ - return DTL_INVALID; + dispatch_enqueue(&scx_dsq_global, p, enq_flags | SCX_ENQ_CLEAR_OPSS); } /** @@ -2557,20 +2546,10 @@ static void finish_dispatch(struct rq *rq, struct task_struct *p, dsq = find_dsq_for_dispatch(this_rq(), dsq_id, p); - if (dsq->id == SCX_DSQ_LOCAL) { - switch (dispatch_to_local_dsq(rq, dsq, p, enq_flags)) { - case DTL_DISPATCHED: - break; - case DTL_LOST: - break; - case DTL_INVALID: - dispatch_enqueue(&scx_dsq_global, p, - enq_flags | SCX_ENQ_CLEAR_OPSS); - break; - } - } else { + if (dsq->id == SCX_DSQ_LOCAL) + dispatch_to_local_dsq(rq, dsq, p, enq_flags); + else dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); - } } static void flush_dispatch_buf(struct rq *rq) -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 18f856991d0596aa125cf63893ca08d7c74b85a2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Now that there's nothing left after the big if block, flip the if condition and unindent the body. No functional changes intended. v2: Add BUG() to clarify control can't reach the end of dispatch_to_local_dsq() in UP kernels per David. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 96 ++++++++++++++++++++++------------------------ 1 file changed, 46 insertions(+), 50 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index ece108739d6a..a89c62b06bd0 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2411,65 +2411,61 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq, } #ifdef CONFIG_SMP - if (likely(task_can_run_on_remote_rq(p, dst_rq, true))) { - /* - * @p is on a possibly remote @src_rq which we need to lock to - * move the task. If dequeue is in progress, it'd be locking - * @src_rq and waiting on DISPATCHING, so we can't grab @src_rq - * lock while holding DISPATCHING. - * - * As DISPATCHING guarantees that @p is wholly ours, we can - * pretend that we're moving from a DSQ and use the same - * mechanism - mark the task under transfer with holding_cpu, - * release DISPATCHING and then follow the same protocol. See - * unlink_dsq_and_lock_src_rq(). - */ - p->scx.holding_cpu = raw_smp_processor_id(); + if (unlikely(!task_can_run_on_remote_rq(p, dst_rq, true))) { + dispatch_enqueue(&scx_dsq_global, p, enq_flags | SCX_ENQ_CLEAR_OPSS); + return; + } - /* store_release ensures that dequeue sees the above */ - atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE); + /* + * @p is on a possibly remote @src_rq which we need to lock to move the + * task. If dequeue is in progress, it'd be locking @src_rq and waiting + * on DISPATCHING, so we can't grab @src_rq lock while holding + * DISPATCHING. + * + * As DISPATCHING guarantees that @p is wholly ours, we can pretend that + * we're moving from a DSQ and use the same mechanism - mark the task + * under transfer with holding_cpu, release DISPATCHING and then follow + * the same protocol. See unlink_dsq_and_lock_src_rq(). + */ + p->scx.holding_cpu = raw_smp_processor_id(); - /* switch to @src_rq lock */ - if (rq != src_rq) { - raw_spin_rq_unlock(rq); - raw_spin_rq_lock(src_rq); - } + /* store_release ensures that dequeue sees the above */ + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE); - /* task_rq couldn't have changed if we're still the holding cpu */ - if (likely(p->scx.holding_cpu == raw_smp_processor_id()) && - !WARN_ON_ONCE(src_rq != task_rq(p))) { - /* - * If @p is staying on the same rq, there's no need to - * go through the full deactivate/activate cycle. - * Optimize by abbreviating the operations in - * move_task_to_local_dsq(). - */ - if (src_rq == dst_rq) { - p->scx.holding_cpu = -1; - dispatch_enqueue(&dst_rq->scx.local_dsq, - p, enq_flags); - } else { - move_task_to_local_dsq(p, enq_flags, - src_rq, dst_rq); - } + /* switch to @src_rq lock */ + if (rq != src_rq) { + raw_spin_rq_unlock(rq); + raw_spin_rq_lock(src_rq); + } - /* if the destination CPU is idle, wake it up */ - if (sched_class_above(p->sched_class, - dst_rq->curr->sched_class)) - resched_curr(dst_rq); + /* task_rq couldn't have changed if we're still the holding cpu */ + if (likely(p->scx.holding_cpu == raw_smp_processor_id()) && + !WARN_ON_ONCE(src_rq != task_rq(p))) { + /* + * If @p is staying on the same rq, there's no need to go + * through the full deactivate/activate cycle. Optimize by + * abbreviating the operations in move_task_to_local_dsq(). + */ + if (src_rq == dst_rq) { + p->scx.holding_cpu = -1; + dispatch_enqueue(&dst_rq->scx.local_dsq, p, enq_flags); + } else { + move_task_to_local_dsq(p, enq_flags, src_rq, dst_rq); } - /* switch back to @rq lock */ - if (rq != dst_rq) { - raw_spin_rq_unlock(dst_rq); - raw_spin_rq_lock(rq); - } + /* if the destination CPU is idle, wake it up */ + if (sched_class_above(p->sched_class, dst_rq->curr->sched_class)) + resched_curr(dst_rq); + } - return; + /* switch back to @rq lock */ + if (rq != dst_rq) { + raw_spin_rq_unlock(dst_rq); + raw_spin_rq_lock(rq); } +#else /* CONFIG_SMP */ + BUG(); /* control can not reach here on UP */ #endif /* CONFIG_SMP */ - - dispatch_enqueue(&scx_dsq_global, p, enq_flags | SCX_ENQ_CLEAR_OPSS); } /** -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 1389f49098981a525ca9fb752b86a4ae874125a8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Reorder args for consistency in the order of: current_rq, p, src_[rq|dsq], dst_[rq|dsq]. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index a89c62b06bd0..57f450e0e097 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2217,8 +2217,8 @@ static void move_task_to_local_dsq(struct task_struct *p, u64 enq_flags, #endif /* CONFIG_SMP */ -static void consume_local_task(struct rq *rq, struct scx_dispatch_q *dsq, - struct task_struct *p) +static void consume_local_task(struct task_struct *p, + struct scx_dispatch_q *dsq, struct rq *rq) { lockdep_assert_held(&dsq->lock); /* released on return */ @@ -2327,8 +2327,8 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p, !WARN_ON_ONCE(src_rq != task_rq(p)); } -static bool consume_remote_task(struct rq *this_rq, struct scx_dispatch_q *dsq, - struct task_struct *p, struct rq *src_rq) +static bool consume_remote_task(struct rq *this_rq, struct task_struct *p, + struct scx_dispatch_q *dsq, struct rq *src_rq) { raw_spin_rq_unlock(this_rq); @@ -2343,7 +2343,7 @@ static bool consume_remote_task(struct rq *this_rq, struct scx_dispatch_q *dsq, } #else /* CONFIG_SMP */ static inline bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq, bool trigger_error) { return false; } -static inline bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq, struct task_struct *p, struct rq *task_rq) { return false; } +static inline bool consume_remote_task(struct rq *this_rq, struct task_struct *p, struct scx_dispatch_q *dsq, struct rq *task_rq) { return false; } #endif /* CONFIG_SMP */ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq) @@ -2364,12 +2364,12 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq) struct rq *task_rq = task_rq(p); if (rq == task_rq) { - consume_local_task(rq, dsq, p); + consume_local_task(p, dsq, rq); return true; } if (task_can_run_on_remote_rq(p, rq, false)) { - if (likely(consume_remote_task(rq, dsq, p, task_rq))) + if (likely(consume_remote_task(rq, p, dsq, task_rq))) return true; goto retry; } -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 6557133ecd59b6d9922d5709b432b2de3a4d1316 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- All task_unlink_from_dsq() users are doing dsq_mod_nr(dsq, -1). Move it into task_unlink_from_dsq(). Also move sanity check into it. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Conflicts: kernel/sched/ext.c [Remove the WARN with the mainline patch.] Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 57f450e0e097..2f26e13fd520 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1723,6 +1723,8 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p, static void task_unlink_from_dsq(struct task_struct *p, struct scx_dispatch_q *dsq) { + WARN_ON_ONCE(list_empty(&p->scx.dsq_list.node)); + if (p->scx.dsq_flags & SCX_TASK_DSQ_ON_PRIQ) { rb_erase(&p->scx.dsq_priq, &dsq->priq); RB_CLEAR_NODE(&p->scx.dsq_priq); @@ -1730,6 +1732,7 @@ static void task_unlink_from_dsq(struct task_struct *p, } list_del_init(&p->scx.dsq_list.node); + dsq_mod_nr(dsq, -1); } static bool task_linked_on_dsq(struct task_struct *p) @@ -1771,9 +1774,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p) */ if (p->scx.holding_cpu < 0) { /* @p must still be on @dsq, dequeue */ - WARN_ON_ONCE(!task_linked_on_dsq(p)); task_unlink_from_dsq(p, dsq); - dsq_mod_nr(dsq, -1); } else { /* * We're racing against dispatch_to_local_dsq() which already @@ -2226,7 +2227,6 @@ static void consume_local_task(struct task_struct *p, WARN_ON_ONCE(p->scx.holding_cpu >= 0); task_unlink_from_dsq(p, dsq); list_add_tail(&p->scx.dsq_list.node, &rq->scx.local_dsq.list); - dsq_mod_nr(dsq, -1); dsq_mod_nr(&rq->scx.local_dsq, 1); p->scx.dsq = &rq->scx.local_dsq; raw_spin_unlock(&dsq->lock); @@ -2316,7 +2316,6 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p, WARN_ON_ONCE(p->scx.holding_cpu >= 0); task_unlink_from_dsq(p, dsq); - dsq_mod_nr(dsq, -1); p->scx.holding_cpu = cpu; raw_spin_unlock(&dsq->lock); -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit d434210e13bb03b81cee85adbab23077e45a59e5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- So that the local case comes first and two CONFIG_SMP blocks can be merged. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 31 ++++++++++++++----------------- 1 file changed, 14 insertions(+), 17 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 2f26e13fd520..50ebc6bdda8a 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2181,6 +2181,20 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to) return false; } +static void consume_local_task(struct task_struct *p, + struct scx_dispatch_q *dsq, struct rq *rq) +{ + lockdep_assert_held(&dsq->lock); /* released on return */ + + /* @dsq is locked and @p is on this rq */ + WARN_ON_ONCE(p->scx.holding_cpu >= 0); + task_unlink_from_dsq(p, dsq); + list_add_tail(&p->scx.dsq_list.node, &rq->scx.local_dsq.list); + dsq_mod_nr(&rq->scx.local_dsq, 1); + p->scx.dsq = &rq->scx.local_dsq; + raw_spin_unlock(&dsq->lock); +} + #ifdef CONFIG_SMP /** * move_task_to_local_dsq - Move a task from a different rq to a local DSQ @@ -2216,23 +2230,6 @@ static void move_task_to_local_dsq(struct task_struct *p, u64 enq_flags, dst_rq->scx.extra_enq_flags = 0; } -#endif /* CONFIG_SMP */ - -static void consume_local_task(struct task_struct *p, - struct scx_dispatch_q *dsq, struct rq *rq) -{ - lockdep_assert_held(&dsq->lock); /* released on return */ - - /* @dsq is locked and @p is on this rq */ - WARN_ON_ONCE(p->scx.holding_cpu >= 0); - task_unlink_from_dsq(p, dsq); - list_add_tail(&p->scx.dsq_list.node, &rq->scx.local_dsq.list); - dsq_mod_nr(&rq->scx.local_dsq, 1); - p->scx.dsq = &rq->scx.local_dsq; - raw_spin_unlock(&dsq->lock); -} - -#ifdef CONFIG_SMP /* * Similar to kernel/sched/core.c::is_cpu_allowed(). However, there are two * differences: -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit cf3e94430dd994114e48f0b488f22e68e14b5976 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- - Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq(). - Rename consume_local_task() to move_local_task_to_local_dsq() and remove task_unlink_from_dsq() and source DSQ unlocking from it. This is to make the migration code easier to reuse. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 42 ++++++++++++++++++++++++++---------------- 1 file changed, 26 insertions(+), 16 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 50ebc6bdda8a..bcd0db9c6d59 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2181,23 +2181,30 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to) return false; } -static void consume_local_task(struct task_struct *p, - struct scx_dispatch_q *dsq, struct rq *rq) +static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, + struct scx_dispatch_q *src_dsq, + struct rq *dst_rq) { - lockdep_assert_held(&dsq->lock); /* released on return */ + struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq; + + /* @dsq is locked and @p is on @dst_rq */ + lockdep_assert_held(&src_dsq->lock); + lockdep_assert_rq_held(dst_rq); - /* @dsq is locked and @p is on this rq */ WARN_ON_ONCE(p->scx.holding_cpu >= 0); - task_unlink_from_dsq(p, dsq); - list_add_tail(&p->scx.dsq_list.node, &rq->scx.local_dsq.list); - dsq_mod_nr(&rq->scx.local_dsq, 1); - p->scx.dsq = &rq->scx.local_dsq; - raw_spin_unlock(&dsq->lock); + + if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) + list_add(&p->scx.dsq_list.node, &dst_dsq->list); + else + list_add_tail(&p->scx.dsq_list.node, &dst_dsq->list); + + dsq_mod_nr(dst_dsq, 1); + p->scx.dsq = dst_dsq; } #ifdef CONFIG_SMP /** - * move_task_to_local_dsq - Move a task from a different rq to a local DSQ + * move_remote_task_to_local_dsq - Move a task from a foreign rq to a local DSQ * @p: task to move * @enq_flags: %SCX_ENQ_* * @src_rq: rq to move the task from, locked on entry, released on return @@ -2205,8 +2212,8 @@ static void consume_local_task(struct task_struct *p, * * Move @p which is currently on @src_rq to @dst_rq's local DSQ. */ -static void move_task_to_local_dsq(struct task_struct *p, u64 enq_flags, - struct rq *src_rq, struct rq *dst_rq) +static void move_remote_task_to_local_dsq(struct task_struct *p, u64 enq_flags, + struct rq *src_rq, struct rq *dst_rq) { lockdep_assert_rq_held(src_rq); @@ -2329,7 +2336,7 @@ static bool consume_remote_task(struct rq *this_rq, struct task_struct *p, raw_spin_rq_unlock(this_rq); if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) { - move_task_to_local_dsq(p, 0, src_rq, this_rq); + move_remote_task_to_local_dsq(p, 0, src_rq, this_rq); return true; } else { raw_spin_rq_unlock(src_rq); @@ -2360,7 +2367,9 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq) struct rq *task_rq = task_rq(p); if (rq == task_rq) { - consume_local_task(p, dsq, rq); + task_unlink_from_dsq(p, dsq); + move_local_task_to_local_dsq(p, 0, dsq, rq); + raw_spin_unlock(&dsq->lock); return true; } @@ -2440,13 +2449,14 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq, /* * If @p is staying on the same rq, there's no need to go * through the full deactivate/activate cycle. Optimize by - * abbreviating the operations in move_task_to_local_dsq(). + * abbreviating move_remote_task_to_local_dsq(). */ if (src_rq == dst_rq) { p->scx.holding_cpu = -1; dispatch_enqueue(&dst_rq->scx.local_dsq, p, enq_flags); } else { - move_task_to_local_dsq(p, enq_flags, src_rq, dst_rq); + move_remote_task_to_local_dsq(p, enq_flags, + src_rq, dst_rq); } /* if the destination CPU is idle, wake it up */ -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 6462dd53a26088a433f90a5c15822196f201037c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was using 5 of them. We want to add two more u64 fields but it's better if we do so while staying within scx_iter_scx_dsq to maintain binary compatibility. The way scx_iter_scx_dsq_kern is laid out is rather inefficient - the node field takes up three u64's but only one bit of the last u64 is used. Turn the bool into u32 flags and only use the lower 16 bits freeing up 48 bits - 16 bits for flags, 32 bits for a u32 - for use by struct bpf_iter_scx_dsq_kern. This allows moving the dsq_seq and flags fields of bpf_iter_scx_dsq_kern into the cursor field reducing the struct size by a full u64. No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- include/linux/sched/ext.h | 10 +++++++++- kernel/sched/ext.c | 23 ++++++++++++----------- 2 files changed, 21 insertions(+), 12 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index dc646cca4fcf..1ddbde64a31b 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -119,9 +119,17 @@ enum scx_kf_mask { __SCX_KF_TERMINAL = SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, }; +enum scx_dsq_lnode_flags { + SCX_DSQ_LNODE_ITER_CURSOR = 1 << 0, + + /* high 16 bits can be for iter cursor flags */ + __SCX_DSQ_LNODE_PRIV_SHIFT = 16, +}; + struct scx_dsq_list_node { struct list_head node; - bool is_bpf_iter_cursor; + u32 flags; + u32 priv; /* can be used by iter cursor */ }; /* diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index bcd0db9c6d59..d20734028bc6 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1195,7 +1195,7 @@ static struct task_struct *nldsq_next_task(struct scx_dispatch_q *dsq, dsq_lnode = container_of(list_node, struct scx_dsq_list_node, node); - } while (dsq_lnode->is_bpf_iter_cursor); + } while (dsq_lnode->flags & SCX_DSQ_LNODE_ITER_CURSOR); return container_of(dsq_lnode, struct task_struct, scx.dsq_list); } @@ -1213,16 +1213,15 @@ static struct task_struct *nldsq_next_task(struct scx_dispatch_q *dsq, */ enum scx_dsq_iter_flags { /* iterate in the reverse dispatch order */ - SCX_DSQ_ITER_REV = 1U << 0, + SCX_DSQ_ITER_REV = 1U << 16, - __SCX_DSQ_ITER_ALL_FLAGS = SCX_DSQ_ITER_REV, + __SCX_DSQ_ITER_USER_FLAGS = SCX_DSQ_ITER_REV, + __SCX_DSQ_ITER_ALL_FLAGS = __SCX_DSQ_ITER_USER_FLAGS, }; struct bpf_iter_scx_dsq_kern { struct scx_dsq_list_node cursor; struct scx_dispatch_q *dsq; - u32 dsq_seq; - u32 flags; } __attribute__((aligned(8))); struct bpf_iter_scx_dsq { @@ -1260,6 +1259,9 @@ static void scx_task_iter_init(struct scx_task_iter *iter) { lockdep_assert_held(&scx_tasks_lock); + BUILD_BUG_ON(__SCX_DSQ_ITER_ALL_FLAGS & + ((1U << __SCX_DSQ_LNODE_PRIV_SHIFT) - 1)); + iter->cursor = (struct sched_ext_entity){ .flags = SCX_TASK_CURSOR }; list_add(&iter->cursor.tasks_node, &scx_tasks); iter->locked = NULL; @@ -6293,7 +6295,7 @@ __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, BUILD_BUG_ON(__alignof__(struct bpf_iter_scx_dsq_kern) != __alignof__(struct bpf_iter_scx_dsq)); - if (flags & ~__SCX_DSQ_ITER_ALL_FLAGS) + if (flags & ~__SCX_DSQ_ITER_USER_FLAGS) return -EINVAL; kit->dsq = find_non_local_dsq(dsq_id); @@ -6301,9 +6303,8 @@ __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, return -ENOENT; INIT_LIST_HEAD(&kit->cursor.node); - kit->cursor.is_bpf_iter_cursor = true; - kit->dsq_seq = READ_ONCE(kit->dsq->seq); - kit->flags = flags; + kit->cursor.flags |= SCX_DSQ_LNODE_ITER_CURSOR | flags; + kit->cursor.priv = READ_ONCE(kit->dsq->seq); return 0; } @@ -6317,7 +6318,7 @@ __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, __bpf_kfunc struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it) { struct bpf_iter_scx_dsq_kern *kit = (void *)it; - bool rev = kit->flags & SCX_DSQ_ITER_REV; + bool rev = kit->cursor.flags & SCX_DSQ_ITER_REV; struct task_struct *p; unsigned long flags; @@ -6338,7 +6339,7 @@ __bpf_kfunc struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *i */ do { p = nldsq_next_task(kit->dsq, p, rev); - } while (p && unlikely(u32_before(kit->dsq_seq, p->scx.dsq_seq))); + } while (p && unlikely(u32_before(kit->cursor.priv, p->scx.dsq_seq))); if (p) { if (rev) -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 4c30f5ce4f7af4f639af99e0bdeada8b268b7361 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Once a task is put into a DSQ, the allowed operations are fairly limited. Tasks in the built-in local and global DSQs are executed automatically and, ignoring dequeue, there is only one way a task in a user DSQ can be manipulated - scx_bpf_consume() moves the first task to the dispatching local DSQ. This inflexibility sometimes gets in the way and is an area where multiple feature requests have been made. Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context which dosen't hold a rq lock including BPF timers and SYSCALL programs. This is an expansion of an earlier patch which only allowed moving into the dispatching local DSQ: http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument count limit and often won't be needed anyway. Instead provide scx_bpf_dispatch_from_dsq_set_{slice|vtime}() kfuncs which can be called only when needed and override the specified parameter for the subsequent dispatch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: David Vernet <void@manifault.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 232 ++++++++++++++++++++++- tools/sched_ext/include/scx/common.bpf.h | 10 + 2 files changed, 239 insertions(+), 3 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index d20734028bc6..d6291e06bb17 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1162,6 +1162,11 @@ static __always_inline bool scx_kf_allowed_on_arg_tasks(u32 mask, return true; } +static bool scx_kf_allowed_if_unlocked(void) +{ + return !current->scx.kf_mask; +} + /** * nldsq_next_task - Iterate to the next task in a non-local DSQ * @dsq: user dsq being interated @@ -1215,13 +1220,20 @@ enum scx_dsq_iter_flags { /* iterate in the reverse dispatch order */ SCX_DSQ_ITER_REV = 1U << 16, + __SCX_DSQ_ITER_HAS_SLICE = 1U << 30, + __SCX_DSQ_ITER_HAS_VTIME = 1U << 31, + __SCX_DSQ_ITER_USER_FLAGS = SCX_DSQ_ITER_REV, - __SCX_DSQ_ITER_ALL_FLAGS = __SCX_DSQ_ITER_USER_FLAGS, + __SCX_DSQ_ITER_ALL_FLAGS = __SCX_DSQ_ITER_USER_FLAGS | + __SCX_DSQ_ITER_HAS_SLICE | + __SCX_DSQ_ITER_HAS_VTIME, }; struct bpf_iter_scx_dsq_kern { struct scx_dsq_list_node cursor; struct scx_dispatch_q *dsq; + u64 slice; + u64 vtime; } __attribute__((aligned(8))); struct bpf_iter_scx_dsq { @@ -5880,7 +5892,7 @@ __bpf_kfunc_start_defs(); * scx_bpf_dispatch - Dispatch a task into the FIFO queue of a DSQ * @p: task_struct to dispatch * @dsq_id: DSQ to dispatch to - * @slice: duration @p can run for in nsecs + * @slice: duration @p can run for in nsecs, 0 to keep the current value * @enq_flags: SCX_ENQ_* * * Dispatch @p into the FIFO queue of the DSQ identified by @dsq_id. It is safe @@ -5930,7 +5942,7 @@ __bpf_kfunc void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, * scx_bpf_dispatch_vtime - Dispatch a task into the vtime priority queue of a DSQ * @p: task_struct to dispatch * @dsq_id: DSQ to dispatch to - * @slice: duration @p can run for in nsecs + * @slice: duration @p can run for in nsecs, 0 to keep the current value * @vtime: @p's ordering inside the vtime-sorted queue of the target DSQ * @enq_flags: SCX_ENQ_* * @@ -5971,6 +5983,118 @@ static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = { .set = &scx_kfunc_ids_enqueue_dispatch, }; +static bool scx_dispatch_from_dsq(struct bpf_iter_scx_dsq_kern *kit, + struct task_struct *p, u64 dsq_id, + u64 enq_flags) +{ + struct scx_dispatch_q *src_dsq = kit->dsq, *dst_dsq; + struct rq *this_rq, *src_rq, *dst_rq, *locked_rq; + bool dispatched = false; + bool in_balance; + unsigned long flags; + + if (!scx_kf_allowed_if_unlocked() && !scx_kf_allowed(SCX_KF_DISPATCH)) + return false; + + /* + * Can be called from either ops.dispatch() locking this_rq() or any + * context where no rq lock is held. If latter, lock @p's task_rq which + * we'll likely need anyway. + */ + src_rq = task_rq(p); + + local_irq_save(flags); + this_rq = this_rq(); + in_balance = this_rq->scx.flags & SCX_RQ_IN_BALANCE; + + if (in_balance) { + if (this_rq != src_rq) { + raw_spin_rq_unlock(this_rq); + raw_spin_rq_lock(src_rq); + } + } else { + raw_spin_rq_lock(src_rq); + } + + locked_rq = src_rq; + raw_spin_lock(&src_dsq->lock); + + /* + * Did someone else get to it? @p could have already left $src_dsq, got + * re-enqueud, or be in the process of being consumed by someone else. + */ + if (unlikely(p->scx.dsq != src_dsq || + u32_before(kit->cursor.priv, p->scx.dsq_seq) || + p->scx.holding_cpu >= 0) || + WARN_ON_ONCE(src_rq != task_rq(p))) { + raw_spin_unlock(&src_dsq->lock); + goto out; + } + + /* @p is still on $src_dsq and stable, determine the destination */ + dst_dsq = find_dsq_for_dispatch(this_rq, dsq_id, p); + + if (dst_dsq->id == SCX_DSQ_LOCAL) { + dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq); + if (!task_can_run_on_remote_rq(p, dst_rq, true)) { + dst_dsq = &scx_dsq_global; + dst_rq = src_rq; + } + } else { + /* no need to migrate if destination is a non-local DSQ */ + dst_rq = src_rq; + } + + /* + * Move @p into $dst_dsq. If $dst_dsq is the local DSQ of a different + * CPU, @p will be migrated. + */ + if (dst_dsq->id == SCX_DSQ_LOCAL) { + /* @p is going from a non-local DSQ to a local DSQ */ + if (src_rq == dst_rq) { + task_unlink_from_dsq(p, src_dsq); + move_local_task_to_local_dsq(p, enq_flags, + src_dsq, dst_rq); + raw_spin_unlock(&src_dsq->lock); + } else { + raw_spin_unlock(&src_dsq->lock); + move_remote_task_to_local_dsq(p, enq_flags, + src_rq, dst_rq); + locked_rq = dst_rq; + } + } else { + /* + * @p is going from a non-local DSQ to a non-local DSQ. As + * $src_dsq is already locked, do an abbreviated dequeue. + */ + task_unlink_from_dsq(p, src_dsq); + p->scx.dsq = NULL; + raw_spin_unlock(&src_dsq->lock); + + if (kit->cursor.flags & __SCX_DSQ_ITER_HAS_VTIME) + p->scx.dsq_vtime = kit->vtime; + dispatch_enqueue(dst_dsq, p, enq_flags); + } + + if (kit->cursor.flags & __SCX_DSQ_ITER_HAS_SLICE) + p->scx.slice = kit->slice; + + dispatched = true; +out: + if (in_balance) { + if (this_rq != locked_rq) { + raw_spin_rq_unlock(locked_rq); + raw_spin_rq_lock(this_rq); + } + } else { + raw_spin_rq_unlock_irqrestore(locked_rq, flags); + } + + kit->cursor.flags &= ~(__SCX_DSQ_ITER_HAS_SLICE | + __SCX_DSQ_ITER_HAS_VTIME); + return dispatched; +} + __bpf_kfunc_start_defs(); /** @@ -6050,12 +6174,112 @@ __bpf_kfunc bool scx_bpf_consume(u64 dsq_id) } } +/** + * scx_bpf_dispatch_from_dsq_set_slice - Override slice when dispatching from DSQ + * @it__iter: DSQ iterator in progress + * @slice: duration the dispatched task can run for in nsecs + * + * Override the slice of the next task that will be dispatched from @it__iter + * using scx_bpf_dispatch_from_dsq[_vtime](). If this function is not called, + * the previous slice duration is kept. + */ +__bpf_kfunc void scx_bpf_dispatch_from_dsq_set_slice( + struct bpf_iter_scx_dsq *it__iter, u64 slice) +{ + struct bpf_iter_scx_dsq_kern *kit = (void *)it__iter; + + kit->slice = slice; + kit->cursor.flags |= __SCX_DSQ_ITER_HAS_SLICE; +} + +/** + * scx_bpf_dispatch_from_dsq_set_vtime - Override vtime when dispatching from DSQ + * @it__iter: DSQ iterator in progress + * @vtime: task's ordering inside the vtime-sorted queue of the target DSQ + * + * Override the vtime of the next task that will be dispatched from @it__iter + * using scx_bpf_dispatch_from_dsq_vtime(). If this function is not called, the + * previous slice vtime is kept. If scx_bpf_dispatch_from_dsq() is used to + * dispatch the next task, the override is ignored and cleared. + */ +__bpf_kfunc void scx_bpf_dispatch_from_dsq_set_vtime( + struct bpf_iter_scx_dsq *it__iter, u64 vtime) +{ + struct bpf_iter_scx_dsq_kern *kit = (void *)it__iter; + + kit->vtime = vtime; + kit->cursor.flags |= __SCX_DSQ_ITER_HAS_VTIME; +} + +/** + * scx_bpf_dispatch_from_dsq - Move a task from DSQ iteration to a DSQ + * @it__iter: DSQ iterator in progress + * @p: task to transfer + * @dsq_id: DSQ to move @p to + * @enq_flags: SCX_ENQ_* + * + * Transfer @p which is on the DSQ currently iterated by @it__iter to the DSQ + * specified by @dsq_id. All DSQs - local DSQs, global DSQ and user DSQs - can + * be the destination. + * + * For the transfer to be successful, @p must still be on the DSQ and have been + * queued before the DSQ iteration started. This function doesn't care whether + * @p was obtained from the DSQ iteration. @p just has to be on the DSQ and have + * been queued before the iteration started. + * + * @p's slice is kept by default. Use scx_bpf_dispatch_from_dsq_set_slice() to + * update. + * + * Can be called from ops.dispatch() or any BPF context which doesn't hold a rq + * lock (e.g. BPF timers or SYSCALL programs). + * + * Returns %true if @p has been consumed, %false if @p had already been consumed + * or dequeued. + */ +__bpf_kfunc bool scx_bpf_dispatch_from_dsq(struct bpf_iter_scx_dsq *it__iter, + struct task_struct *p, u64 dsq_id, + u64 enq_flags) +{ + return scx_dispatch_from_dsq((struct bpf_iter_scx_dsq_kern *)it__iter, + p, dsq_id, enq_flags); +} + +/** + * scx_bpf_dispatch_vtime_from_dsq - Move a task from DSQ iteration to a PRIQ DSQ + * @it__iter: DSQ iterator in progress + * @p: task to transfer + * @dsq_id: DSQ to move @p to + * @enq_flags: SCX_ENQ_* + * + * Transfer @p which is on the DSQ currently iterated by @it__iter to the + * priority queue of the DSQ specified by @dsq_id. The destination must be a + * user DSQ as only user DSQs support priority queue. + * + * @p's slice and vtime are kept by default. Use + * scx_bpf_dispatch_from_dsq_set_slice() and + * scx_bpf_dispatch_from_dsq_set_vtime() to update. + * + * All other aspects are identical to scx_bpf_dispatch_from_dsq(). See + * scx_bpf_dispatch_vtime() for more information on @vtime. + */ +__bpf_kfunc bool scx_bpf_dispatch_vtime_from_dsq(struct bpf_iter_scx_dsq *it__iter, + struct task_struct *p, u64 dsq_id, + u64 enq_flags) +{ + return scx_dispatch_from_dsq((struct bpf_iter_scx_dsq_kern *)it__iter, + p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ); +} + __bpf_kfunc_end_defs(); BTF_KFUNCS_START(scx_kfunc_ids_dispatch) BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots) BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel) BTF_ID_FLAGS(func, scx_bpf_consume) +BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq_set_slice) +BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq_set_vtime) +BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq, KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime_from_dsq, KF_RCU) BTF_KFUNCS_END(scx_kfunc_ids_dispatch) static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = { @@ -6152,6 +6376,8 @@ __bpf_kfunc_end_defs(); BTF_KFUNCS_START(scx_kfunc_ids_unlocked) BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE) +BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq, KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime_from_dsq, KF_RCU) BTF_KFUNCS_END(scx_kfunc_ids_unlocked) static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = { diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h index 457462b19966..f538c75db183 100644 --- a/tools/sched_ext/include/scx/common.bpf.h +++ b/tools/sched_ext/include/scx/common.bpf.h @@ -35,6 +35,10 @@ void scx_bpf_dispatch_vtime(struct task_struct *p, u64 dsq_id, u64 slice, u64 vt u32 scx_bpf_dispatch_nr_slots(void) __ksym; void scx_bpf_dispatch_cancel(void) __ksym; bool scx_bpf_consume(u64 dsq_id) __ksym; +void scx_bpf_dispatch_from_dsq_set_slice(struct bpf_iter_scx_dsq *it__iter, u64 slice) __ksym; +void scx_bpf_dispatch_from_dsq_set_vtime(struct bpf_iter_scx_dsq *it__iter, u64 vtime) __ksym; +bool scx_bpf_dispatch_from_dsq(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak; +bool scx_bpf_dispatch_vtime_from_dsq(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak; u32 scx_bpf_reenqueue_local(void) __ksym; void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym; s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym; @@ -63,6 +67,12 @@ s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym; struct rq *scx_bpf_cpu_rq(s32 cpu) __ksym; struct cgroup *scx_bpf_task_cgroup(struct task_struct *p) __ksym; +/* + * Use the following as @it__iter when calling + * scx_bpf_dispatch[_vtime]_from_dsq() from within bpf_for_each() loops. + */ +#define BPF_FOR_EACH_ITER (&___it) + static inline __attribute__((format(printf, 1, 2))) void ___scx_bpf_bstr_format_checker(const char *fmt, ...) {} -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 2d285d561543d76b4a13263731b447e646c258f2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- Implement a silly boosting mechanism for nice -20 tasks. The only purpose is demonstrating and testing scx_bpf_dispatch_from_dsq(). The boosting only works within SHARED_DSQ and makes only minor differences with increased dispatch batch (-b). This exercises moving tasks to a user DSQ and all local DSQs from ops.dispatch() and BPF timerfn. v2: - Updated to use scx_bpf_dispatch_from_dsq_set_{slice|vtime}(). - Drop the workaround for the iterated tasks not being trusted by the verifier. The issue is fixed from BPF side. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: David Vernet <void@manifault.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/sched_ext/scx_qmap.bpf.c | 133 +++++++++++++++++++++++++++++---- tools/sched_ext/scx_qmap.c | 11 ++- 2 files changed, 130 insertions(+), 14 deletions(-) diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c index 7e7f28f4117e..83c8f54c1e31 100644 --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -27,6 +27,8 @@ enum consts { ONE_SEC_IN_NS = 1000000000, SHARED_DSQ = 0, + HIGHPRI_DSQ = 1, + HIGHPRI_WEIGHT = 8668, /* this is what -20 maps to */ }; char _license[] SEC("license") = "GPL"; @@ -36,10 +38,12 @@ const volatile u32 stall_user_nth; const volatile u32 stall_kernel_nth; const volatile u32 dsp_inf_loop_after; const volatile u32 dsp_batch; +const volatile bool highpri_boosting; const volatile bool print_shared_dsq; const volatile s32 disallow_tgid; const volatile bool suppress_dump; +u64 nr_highpri_queued; u32 test_error_cnt; UEI_DEFINE(uei); @@ -95,6 +99,7 @@ static u64 core_sched_tail_seqs[5]; /* Per-task scheduling context */ struct task_ctx { bool force_local; /* Dispatch directly to local_dsq */ + bool highpri; u64 core_sched_seq; }; @@ -122,6 +127,7 @@ struct { /* Statistics */ u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued, nr_ddsp_from_enq; u64 nr_core_sched_execed; +u64 nr_expedited_local, nr_expedited_remote, nr_expedited_lost, nr_expedited_from_timer; u32 cpuperf_min, cpuperf_avg, cpuperf_max; u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max; @@ -140,17 +146,25 @@ static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu) return -1; } +static struct task_ctx *lookup_task_ctx(struct task_struct *p) +{ + struct task_ctx *tctx; + + if (!(tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0))) { + scx_bpf_error("task_ctx lookup failed"); + return NULL; + } + return tctx; +} + s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) { struct task_ctx *tctx; s32 cpu; - tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0); - if (!tctx) { - scx_bpf_error("task_ctx lookup failed"); + if (!(tctx = lookup_task_ctx(p))) return -ESRCH; - } cpu = pick_direct_dispatch_cpu(p, prev_cpu); @@ -197,11 +211,8 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags) if (test_error_cnt && !--test_error_cnt) scx_bpf_error("test triggering error"); - tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0); - if (!tctx) { - scx_bpf_error("task_ctx lookup failed"); + if (!(tctx = lookup_task_ctx(p))) return; - } /* * All enqueued tasks must have their core_sched_seq updated for correct @@ -255,6 +266,10 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags) return; } + if (highpri_boosting && p->scx.weight >= HIGHPRI_WEIGHT) { + tctx->highpri = true; + __sync_fetch_and_add(&nr_highpri_queued, 1); + } __sync_fetch_and_add(&nr_enqueued, 1); } @@ -271,13 +286,80 @@ void BPF_STRUCT_OPS(qmap_dequeue, struct task_struct *p, u64 deq_flags) static void update_core_sched_head_seq(struct task_struct *p) { - struct task_ctx *tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0); int idx = weight_to_idx(p->scx.weight); + struct task_ctx *tctx; - if (tctx) + if ((tctx = lookup_task_ctx(p))) core_sched_head_seqs[idx] = tctx->core_sched_seq; - else - scx_bpf_error("task_ctx lookup failed"); +} + +/* + * To demonstrate the use of scx_bpf_dispatch_from_dsq(), implement silly + * selective priority boosting mechanism by scanning SHARED_DSQ looking for + * highpri tasks, moving them to HIGHPRI_DSQ and then consuming them first. This + * makes minor difference only when dsp_batch is larger than 1. + * + * scx_bpf_dispatch[_vtime]_from_dsq() are allowed both from ops.dispatch() and + * non-rq-lock holding BPF programs. As demonstration, this function is called + * from qmap_dispatch() and monitor_timerfn(). + */ +static bool dispatch_highpri(bool from_timer) +{ + struct task_struct *p; + s32 this_cpu = bpf_get_smp_processor_id(); + + /* scan SHARED_DSQ and move highpri tasks to HIGHPRI_DSQ */ + bpf_for_each(scx_dsq, p, SHARED_DSQ, 0) { + static u64 highpri_seq; + struct task_ctx *tctx; + + if (!(tctx = lookup_task_ctx(p))) + return false; + + if (tctx->highpri) { + /* exercise the set_*() and vtime interface too */ + scx_bpf_dispatch_from_dsq_set_slice( + BPF_FOR_EACH_ITER, slice_ns * 2); + scx_bpf_dispatch_from_dsq_set_vtime( + BPF_FOR_EACH_ITER, highpri_seq++); + scx_bpf_dispatch_vtime_from_dsq( + BPF_FOR_EACH_ITER, p, HIGHPRI_DSQ, 0); + } + } + + /* + * Scan HIGHPRI_DSQ and dispatch until a task that can run on this CPU + * is found. + */ + bpf_for_each(scx_dsq, p, HIGHPRI_DSQ, 0) { + bool dispatched = false; + s32 cpu; + + if (bpf_cpumask_test_cpu(this_cpu, p->cpus_ptr)) + cpu = this_cpu; + else + cpu = scx_bpf_pick_any_cpu(p->cpus_ptr, 0); + + if (scx_bpf_dispatch_from_dsq(BPF_FOR_EACH_ITER, p, + SCX_DSQ_LOCAL_ON | cpu, + SCX_ENQ_PREEMPT)) { + if (cpu == this_cpu) { + dispatched = true; + __sync_fetch_and_add(&nr_expedited_local, 1); + } else { + __sync_fetch_and_add(&nr_expedited_remote, 1); + } + if (from_timer) + __sync_fetch_and_add(&nr_expedited_from_timer, 1); + } else { + __sync_fetch_and_add(&nr_expedited_lost, 1); + } + + if (dispatched) + return true; + } + + return false; } void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) @@ -289,7 +371,10 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) void *fifo; s32 i, pid; - if (scx_bpf_consume(SHARED_DSQ)) + if (dispatch_highpri(false)) + return; + + if (!nr_highpri_queued && scx_bpf_consume(SHARED_DSQ)) return; if (dsp_inf_loop_after && nr_dispatched > dsp_inf_loop_after) { @@ -326,6 +411,8 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) /* Dispatch or advance. */ bpf_repeat(BPF_MAX_LOOPS) { + struct task_ctx *tctx; + if (bpf_map_pop_elem(fifo, &pid)) break; @@ -333,13 +420,25 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) if (!p) continue; + if (!(tctx = lookup_task_ctx(p))) { + bpf_task_release(p); + return; + } + + if (tctx->highpri) + __sync_fetch_and_sub(&nr_highpri_queued, 1); + update_core_sched_head_seq(p); __sync_fetch_and_add(&nr_dispatched, 1); + scx_bpf_dispatch(p, SHARED_DSQ, slice_ns, 0); bpf_task_release(p); + batch--; cpuc->dsp_cnt--; if (!batch || !scx_bpf_dispatch_nr_slots()) { + if (dispatch_highpri(false)) + return; scx_bpf_consume(SHARED_DSQ); return; } @@ -664,6 +763,10 @@ static void dump_shared_dsq(void) static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer) { + bpf_rcu_read_lock(); + dispatch_highpri(true); + bpf_rcu_read_unlock(); + monitor_cpuperf(); if (print_shared_dsq) @@ -685,6 +788,10 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init) if (ret) return ret; + ret = scx_bpf_create_dsq(HIGHPRI_DSQ, -1); + if (ret) + return ret; + timer = bpf_map_lookup_elem(&monitor_timer, &key); if (!timer) return -ESRCH; diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c index c9ca30d62b2b..ac45a02b4055 100644 --- a/tools/sched_ext/scx_qmap.c +++ b/tools/sched_ext/scx_qmap.c @@ -29,6 +29,7 @@ const char help_fmt[] = " -l COUNT Trigger dispatch infinite looping after COUNT dispatches\n" " -b COUNT Dispatch upto COUNT tasks together\n" " -P Print out DSQ content to trace_pipe every second, use with -b\n" +" -H Boost nice -20 tasks in SHARED_DSQ, use with -b\n" " -d PID Disallow a process from switching into SCHED_EXT (-1 for self)\n" " -D LEN Set scx_exit_info.dump buffer length\n" " -S Suppress qmap-specific debug dump\n" @@ -63,7 +64,7 @@ int main(int argc, char **argv) skel = SCX_OPS_OPEN(qmap_ops, scx_qmap); - while ((opt = getopt(argc, argv, "s:e:t:T:l:b:Pd:D:Spvh")) != -1) { + while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PHd:D:Spvh")) != -1) { switch (opt) { case 's': skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000; @@ -86,6 +87,9 @@ int main(int argc, char **argv) case 'P': skel->rodata->print_shared_dsq = true; break; + case 'H': + skel->rodata->highpri_boosting = true; + break; case 'd': skel->rodata->disallow_tgid = strtol(optarg, NULL, 0); if (skel->rodata->disallow_tgid < 0) @@ -121,6 +125,11 @@ int main(int argc, char **argv) skel->bss->nr_reenqueued, skel->bss->nr_dequeued, skel->bss->nr_core_sched_execed, skel->bss->nr_ddsp_from_enq); + printf(" exp_local=%"PRIu64" exp_remote=%"PRIu64" exp_timer=%"PRIu64" exp_lost=%"PRIu64"\n", + skel->bss->nr_expedited_local, + skel->bss->nr_expedited_remote, + skel->bss->nr_expedited_from_timer, + skel->bss->nr_expedited_lost); if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur")) printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n", skel->bss->cpuperf_min, -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 750a40d816de1567bd08f1017362f6e54e6549dc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- While the BPF scheduler is being unloaded, the following warning messages trigger sometimes: NOHZ tick-stop error: local softirq work is pending, handler #80!!! This is caused by the CPU entering idle while there are pending softirqs. The main culprit is the bypassing state assertion not being synchronized with rq operations. As the BPF scheduler cannot be trusted in the disable path, the first step is entering the bypass mode where the BPF scheduler is ignored and scheduling becomes global FIFO. This is implemented by turning scx_ops_bypassing() true. However, the transition isn't synchronized against anything and it's possible for enqueue and dispatch paths to have different ideas on whether bypass mode is on. Make each rq track its own bypass state with SCX_RQ_BYPASSING which is modified while rq is locked. This removes most of the NOHZ tick-stop messages but not completely. I believe the stragglers are from the sched core bug where pick_task_scx() can be called without preceding balance_scx(). Once that bug is fixed, we should verify that all occurrences of this error message are gone too. v2: scx_enabled() test moved inside the for_each_possible_cpu() loop so that the per-cpu states are always synchronized with the global state. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: David Vernet <void@manifault.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 63 ++++++++++++++++++++++++++------------------ kernel/sched/sched.h | 1 + 2 files changed, 38 insertions(+), 26 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index d6291e06bb17..2133bcce37fa 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1415,9 +1415,9 @@ static bool scx_ops_tryset_enable_state(enum scx_ops_enable_state to, return atomic_try_cmpxchg(&scx_ops_enable_state_var, &from_v, to); } -static bool scx_ops_bypassing(void) +static bool scx_rq_bypassing(struct rq *rq) { - return unlikely(atomic_read(&scx_ops_bypass_depth)); + return unlikely(rq->scx.flags & SCX_RQ_BYPASSING); } /** @@ -1957,7 +1957,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, if (!scx_rq_online(rq)) goto local; - if (scx_ops_bypassing()) + if (scx_rq_bypassing(rq)) goto global; if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) @@ -2621,7 +2621,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev) * bypassing test. */ if ((prev->scx.flags & SCX_TASK_QUEUED) && - prev->scx.slice && !scx_ops_bypassing()) { + prev->scx.slice && !scx_rq_bypassing(rq)) { rq->scx.flags |= SCX_RQ_BAL_KEEP; goto has_tasks; } @@ -2634,7 +2634,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev) if (consume_dispatch_q(rq, &scx_dsq_global)) goto has_tasks; - if (!SCX_HAS_OP(dispatch) || scx_ops_bypassing() || !scx_rq_online(rq)) + if (!SCX_HAS_OP(dispatch) || scx_rq_bypassing(rq) || !scx_rq_online(rq)) goto no_tasks; dspc->rq = rq; @@ -2680,7 +2680,8 @@ static int balance_one(struct rq *rq, struct task_struct *prev) * %SCX_OPS_ENQ_LAST is in effect. */ if ((prev->scx.flags & SCX_TASK_QUEUED) && - (!static_branch_unlikely(&scx_ops_enq_last) || scx_ops_bypassing())) { + (!static_branch_unlikely(&scx_ops_enq_last) || + scx_rq_bypassing(rq))) { rq->scx.flags |= SCX_RQ_BAL_KEEP; goto has_tasks; } @@ -2872,7 +2873,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p, * forcing a different task. Leave it at the head of the local * DSQ. */ - if (p->scx.slice && !scx_ops_bypassing()) { + if (p->scx.slice && !scx_rq_bypassing(rq)) { dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD); return; } @@ -2937,7 +2938,7 @@ static struct task_struct *pick_task_scx(struct rq *rq) return NULL; if (unlikely(!p->scx.slice)) { - if (!scx_ops_bypassing() && !scx_warned_zero_slice) { + if (!scx_rq_bypassing(rq) && !scx_warned_zero_slice) { printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n", p->comm, p->pid); scx_warned_zero_slice = true; @@ -2975,7 +2976,7 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, * calling ops.core_sched_before(). Accesses are controlled by the * verifier. */ - if (SCX_HAS_OP(core_sched_before) && !scx_ops_bypassing()) + if (SCX_HAS_OP(core_sched_before) && !scx_rq_bypassing(task_rq(a))) return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, core_sched_before, (struct task_struct *)a, (struct task_struct *)b); @@ -3334,7 +3335,7 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued) * While disabling, always resched and refresh core-sched timestamp as * we can't trust the slice management or ops.core_sched_before(). */ - if (scx_ops_bypassing()) { + if (scx_rq_bypassing(rq)) { curr->scx.slice = 0; touch_core_sched(rq, curr); } else if (SCX_HAS_OP(tick)) { @@ -3672,7 +3673,7 @@ bool scx_can_stop_tick(struct rq *rq) { struct task_struct *p = rq->curr; - if (scx_ops_bypassing()) + if (scx_rq_bypassing(rq)) return false; if (p->sched_class != &ext_sched_class) @@ -4264,17 +4265,9 @@ static void scx_ops_bypass(bool bypass) return; } - /* - * We need to guarantee that no tasks are on the BPF scheduler while - * bypassing. Either we see enabled or the enable path sees the - * increased bypass_depth before moving tasks to SCX. - */ - if (!scx_enabled()) - return; - /* * No task property is changing. We just need to make sure all currently - * queued tasks are re-queued according to the new scx_ops_bypassing() + * queued tasks are re-queued according to the new scx_rq_bypassing() * state. As an optimization, walk each rq's runnable_list instead of * the scx_tasks list. * @@ -4288,6 +4281,24 @@ static void scx_ops_bypass(bool bypass) rq_lock_irqsave(rq, &rf); + if (bypass) { + WARN_ON_ONCE(rq->scx.flags & SCX_RQ_BYPASSING); + rq->scx.flags |= SCX_RQ_BYPASSING; + } else { + WARN_ON_ONCE(!(rq->scx.flags & SCX_RQ_BYPASSING)); + rq->scx.flags &= ~SCX_RQ_BYPASSING; + } + + /* + * We need to guarantee that no tasks are on the BPF scheduler + * while bypassing. Either we see enabled or the enable path + * sees scx_rq_bypassing() before moving tasks to SCX. + */ + if (!scx_enabled()) { + rq_unlock_irqrestore(rq, &rf); + continue; + } + /* * The use of list_for_each_entry_safe_reverse() is required * because each task is going to be removed from and added back @@ -6405,17 +6416,17 @@ __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags) if (!ops_cpu_valid(cpu, NULL)) return; + local_irq_save(irq_flags); + + this_rq = this_rq(); + /* * While bypassing for PM ops, IRQ handling may not be online which can * lead to irq_work_queue() malfunction such as infinite busy wait for * IRQ status update. Suppress kicking. */ - if (scx_ops_bypassing()) - return; - - local_irq_save(irq_flags); - - this_rq = this_rq(); + if (scx_rq_bypassing(this_rq)) + goto out; /* * Actual kicking is bounced to kick_cpus_irq_workfn() to avoid nesting diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index aa1740216760..5ba3184808ec 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -859,6 +859,7 @@ enum scx_rq_flags { SCX_RQ_ONLINE = 1 << 0, SCX_RQ_CAN_STOP_TICK = 1 << 1, SCX_RQ_BAL_KEEP = 1 << 2, /* balance decided to keep current */ + SCX_RQ_BYPASSING = 1 << 3, SCX_RQ_IN_WAKEUP = 1 << 16, SCX_RQ_IN_BALANCE = 1 << 17, -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 513ed0c7ccc103c2ff668154854ec410729a3170 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- A task moving across CPUs should not trigger quiescent/runnable task state events as the task is staying runnable the whole time and just stopping and then starting on different CPUs. Suppress quiescent/runnable task state events if task_on_rq_migrating(). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: David Vernet <void@manifault.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 2133bcce37fa..1560c98d362b 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2075,7 +2075,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags rq->scx.nr_running++; add_nr_running(rq, 1); - if (SCX_HAS_OP(runnable)) + if (SCX_HAS_OP(runnable) && !task_on_rq_migrating(p)) SCX_CALL_OP_TASK(SCX_KF_REST, runnable, p, enq_flags); if (enq_flags & SCX_ENQ_WAKEUP) @@ -2159,7 +2159,7 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, false); } - if (SCX_HAS_OP(quiescent)) + if (SCX_HAS_OP(quiescent) && !task_on_rq_migrating(p)) SCX_CALL_OP_TASK(SCX_KF_REST, quiescent, p, deq_flags); if (deq_flags & SCX_DEQ_SLEEP) -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 62d3726d4cd66f3e48dfe0f0401e0d74e58c2170 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- a2f4b16e736d ("sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT]") tried fixing build when !CONFIG_STACKTRACE but didn't so fully. Also put stack_trace_print() and stack_trace_save() inside CONFIG_STACKTRACE to fix build when !CONFIG_STACKTRACE. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409220642.fDW2OmWc-lkp@intel.com/ Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 1560c98d362b..149b6b58e8b6 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -4477,8 +4477,9 @@ static void scx_ops_disable_workfn(struct kthread_work *work) if (ei->msg[0] != '\0') pr_err("sched_ext: %s: %s\n", scx_ops.name, ei->msg); - +#ifdef CONFIG_STACKTRACE stack_trace_print(ei->bt, ei->bt_len, 2); +#endif } else { pr_info("sched_ext: BPF scheduler \"%s\" disabled (%s)\n", scx_ops.name, ei->reason); @@ -4855,10 +4856,10 @@ static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind, return; ei->exit_code = exit_code; - +#ifdef CONFIG_STACKTRACE if (kind >= SCX_EXIT_ERROR) ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1); - +#endif va_start(args, fmt); vscnprintf(ei->msg, SCX_EXIT_MSG_LEN, fmt, args); va_end(args); -- 2.34.1
From: Andrea Righi <andrea.righi@linux.dev> mainline inclusion from mainline-v6.12-rc1 commit 431844b65f4c1b988ccd886f2ed29c138f7bb262 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- As discussed during the distro-centric session within the sched_ext Microconference at LPC 2024, introduce a sequence counter that is incremented every time a BPF scheduler is loaded. This feature can help distributions in diagnosing potential performance regressions by identifying systems where users are running (or have ran) custom BPF schedulers. Example: arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq 0 arighi@virtme-ng~> sudo scx_simple local=1 global=0 ^CEXIT: unregistered from user space arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq 1 In this way user-space tools (such as Ubuntu's apport and similar) are able to gather and include this information in bug reports. Cc: Giovanni Gherdovich <giovanni.gherdovich@suse.com> Cc: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Cc: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Cc: Phil Auld <pauld@redhat.com> Signed-off-by: Andrea Righi <andrea.righi@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- Documentation/scheduler/sched-ext.rst | 10 ++++++++++ kernel/sched/ext.c | 17 +++++++++++++++++ tools/sched_ext/scx_show_state.py | 1 + 3 files changed, 28 insertions(+) diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst index a707d2181a77..6c0d70e2e27d 100644 --- a/Documentation/scheduler/sched-ext.rst +++ b/Documentation/scheduler/sched-ext.rst @@ -83,6 +83,15 @@ The current status of the BPF scheduler can be determined as follows: # cat /sys/kernel/sched_ext/root/ops simple +You can check if any BPF scheduler has ever been loaded since boot by examining +this monotonically incrementing counter (a value of zero indicates that no BPF +scheduler has been loaded): + +.. code-block:: none + + # cat /sys/kernel/sched_ext/enable_seq + 1 + ``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more detailed information: @@ -96,6 +105,7 @@ detailed information: enable_state : enabled (2) bypass_depth : 0 nr_rejected : 0 + enable_seq : 1 If ``CONFIG_SCHED_DEBUG`` is set, whether a given task is on sched_ext can be determined as follows: diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 149b6b58e8b6..2261f6ff5d64 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -878,6 +878,13 @@ static struct scx_exit_info *scx_exit_info; static atomic_long_t scx_nr_rejected = ATOMIC_LONG_INIT(0); static atomic_long_t scx_hotplug_seq = ATOMIC_LONG_INIT(0); +/* + * A monotically increasing sequence number that is incremented every time a + * scheduler is enabled. This can be used by to check if any custom sched_ext + * scheduler has ever been used in the system. + */ +static atomic_long_t scx_enable_seq = ATOMIC_LONG_INIT(0); + /* * The maximum amount of time in jiffies that a task may be runnable without * being scheduled on a CPU. If this timeout is exceeded, it will trigger @@ -4162,11 +4169,19 @@ static ssize_t scx_attr_hotplug_seq_show(struct kobject *kobj, } SCX_ATTR(hotplug_seq); +static ssize_t scx_attr_enable_seq_show(struct kobject *kobj, + struct kobj_attribute *ka, char *buf) +{ + return sysfs_emit(buf, "%ld\n", atomic_long_read(&scx_enable_seq)); +} +SCX_ATTR(enable_seq); + static struct attribute *scx_global_attrs[] = { &scx_attr_state.attr, &scx_attr_switch_all.attr, &scx_attr_nr_rejected.attr, &scx_attr_hotplug_seq.attr, + &scx_attr_enable_seq.attr, NULL, }; @@ -5185,6 +5200,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops) kobject_uevent(scx_root_kobj, KOBJ_ADD); mutex_unlock(&scx_ops_enable_mutex); + atomic_long_inc(&scx_enable_seq); + return 0; err_del: diff --git a/tools/sched_ext/scx_show_state.py b/tools/sched_ext/scx_show_state.py index d457d2a74e1e..8bc626ede1c4 100644 --- a/tools/sched_ext/scx_show_state.py +++ b/tools/sched_ext/scx_show_state.py @@ -37,3 +37,4 @@ print(f'switched_all : {read_static_key("__scx_switched_all")}') print(f'enable_state : {ops_state_str(enable_state)} ({enable_state})') print(f'bypass_depth : {read_atomic("scx_ops_bypass_depth")}') print(f'nr_rejected : {read_atomic("scx_nr_rejected")}') +print(f'enable_seq : {read_atomic("scx_enable_seq")}') -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 42268ad0eb4142245ea40ab01a5690a40e9c3b41 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- move_remote_task_to_local_dsq() is only defined on SMP configs but scx_disaptch_from_dsq() was calling move_remote_task_to_local_dsq() on UP configs too causing build failures. Add a dummy move_remote_task_to_local_dsq() which triggers a warning. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409241108.jaocHiDJ-lkp@intel.com/ Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- kernel/sched/ext.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 2261f6ff5d64..237243b62737 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2366,6 +2366,7 @@ static bool consume_remote_task(struct rq *this_rq, struct task_struct *p, } } #else /* CONFIG_SMP */ +static inline void move_remote_task_to_local_dsq(struct task_struct *p, u64 enq_flags, struct rq *src_rq, struct rq *dst_rq) { WARN_ON_ONCE(1); } static inline bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq, bool trigger_error) { return false; } static inline bool consume_remote_task(struct rq *this_rq, struct task_struct *p, struct scx_dispatch_q *dsq, struct rq *task_rq) { return false; } #endif /* CONFIG_SMP */ -- 2.34.1
From: Tejun Heo <tj@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit 1e123fd73deb16cb362ecefb55c90c9196f4a6c2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ID7HLY Reference: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi... -------------------------------- cgroup support and scx_bpf_dispatch[_vtime]_from_dsq() are newly added since 8bb30798fd6e ("sched_ext: Fixes incorrect type in bpf_scx_init()") which is the current earliest commit targeted by BPF schedulers. Add compat helpers for them and apply them in the example schedulers. These will be dropped after a few kernel releases. The exact backward compatibility window hasn't been decided yet. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: cheliequan <cheliequan@inspur.com> Signed-off-by: Luo Gengkun <luogengkun2@huawei.com> --- tools/sched_ext/include/scx/compat.bpf.h | 19 +++++++++++++++++++ tools/sched_ext/scx_flatcg.bpf.c | 10 +++++----- tools/sched_ext/scx_qmap.bpf.c | 12 ++++++------ 3 files changed, 30 insertions(+), 11 deletions(-) diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h index 3d2fe1208900..e5afe9efd3f3 100644 --- a/tools/sched_ext/include/scx/compat.bpf.h +++ b/tools/sched_ext/include/scx/compat.bpf.h @@ -15,6 +15,25 @@ __ret; \ }) +/* v6.12: 819513666966 ("sched_ext: Add cgroup support") */ +#define __COMPAT_scx_bpf_task_cgroup(p) \ + (bpf_ksym_exists(scx_bpf_task_cgroup) ? \ + scx_bpf_task_cgroup((p)) : NULL) + +/* v6.12: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") */ +#define __COMPAT_scx_bpf_dispatch_from_dsq_set_slice(it, slice) \ + (bpf_ksym_exists(scx_bpf_dispatch_from_dsq_set_slice) ? \ + scx_bpf_dispatch_from_dsq_set_slice((it), (slice)) : (void)0) +#define __COMPAT_scx_bpf_dispatch_from_dsq_set_vtime(it, vtime) \ + (bpf_ksym_exists(scx_bpf_dispatch_from_dsq_set_vtime) ? \ + scx_bpf_dispatch_from_dsq_set_vtime((it), (vtime)) : (void)0) +#define __COMPAT_scx_bpf_dispatch_from_dsq(it, p, dsq_id, enq_flags) \ + (bpf_ksym_exists(scx_bpf_dispatch_from_dsq) ? \ + scx_bpf_dispatch_from_dsq((it), (p), (dsq_id), (enq_flags)) : false) +#define __COMPAT_scx_bpf_dispatch_vtime_from_dsq(it, p, dsq_id, enq_flags) \ + (bpf_ksym_exists(scx_bpf_dispatch_vtime_from_dsq) ? \ + scx_bpf_dispatch_vtime_from_dsq((it), (p), (dsq_id), (enq_flags)) : false) + /* * Define sched_ext_ops. This may be expanded to define multiple variants for * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH(). diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c index 3ab2b60781a0..936415b98ae7 100644 --- a/tools/sched_ext/scx_flatcg.bpf.c +++ b/tools/sched_ext/scx_flatcg.bpf.c @@ -383,7 +383,7 @@ void BPF_STRUCT_OPS(fcg_enqueue, struct task_struct *p, u64 enq_flags) return; } - cgrp = scx_bpf_task_cgroup(p); + cgrp = __COMPAT_scx_bpf_task_cgroup(p); cgc = find_cgrp_ctx(cgrp); if (!cgc) goto out_release; @@ -509,7 +509,7 @@ void BPF_STRUCT_OPS(fcg_runnable, struct task_struct *p, u64 enq_flags) { struct cgroup *cgrp; - cgrp = scx_bpf_task_cgroup(p); + cgrp = __COMPAT_scx_bpf_task_cgroup(p); update_active_weight_sums(cgrp, true); bpf_cgroup_release(cgrp); } @@ -522,7 +522,7 @@ void BPF_STRUCT_OPS(fcg_running, struct task_struct *p) if (fifo_sched) return; - cgrp = scx_bpf_task_cgroup(p); + cgrp = __COMPAT_scx_bpf_task_cgroup(p); cgc = find_cgrp_ctx(cgrp); if (cgc) { /* @@ -565,7 +565,7 @@ void BPF_STRUCT_OPS(fcg_stopping, struct task_struct *p, bool runnable) if (!taskc->bypassed_at) return; - cgrp = scx_bpf_task_cgroup(p); + cgrp = __COMPAT_scx_bpf_task_cgroup(p); cgc = find_cgrp_ctx(cgrp); if (cgc) { __sync_fetch_and_add(&cgc->cvtime_delta, @@ -579,7 +579,7 @@ void BPF_STRUCT_OPS(fcg_quiescent, struct task_struct *p, u64 deq_flags) { struct cgroup *cgrp; - cgrp = scx_bpf_task_cgroup(p); + cgrp = __COMPAT_scx_bpf_task_cgroup(p); update_active_weight_sums(cgrp, false); bpf_cgroup_release(cgrp); } diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c index 83c8f54c1e31..67e2a7968cc9 100644 --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -318,11 +318,11 @@ static bool dispatch_highpri(bool from_timer) if (tctx->highpri) { /* exercise the set_*() and vtime interface too */ - scx_bpf_dispatch_from_dsq_set_slice( + __COMPAT_scx_bpf_dispatch_from_dsq_set_slice( BPF_FOR_EACH_ITER, slice_ns * 2); - scx_bpf_dispatch_from_dsq_set_vtime( + __COMPAT_scx_bpf_dispatch_from_dsq_set_vtime( BPF_FOR_EACH_ITER, highpri_seq++); - scx_bpf_dispatch_vtime_from_dsq( + __COMPAT_scx_bpf_dispatch_vtime_from_dsq( BPF_FOR_EACH_ITER, p, HIGHPRI_DSQ, 0); } } @@ -340,9 +340,9 @@ static bool dispatch_highpri(bool from_timer) else cpu = scx_bpf_pick_any_cpu(p->cpus_ptr, 0); - if (scx_bpf_dispatch_from_dsq(BPF_FOR_EACH_ITER, p, - SCX_DSQ_LOCAL_ON | cpu, - SCX_ENQ_PREEMPT)) { + if (__COMPAT_scx_bpf_dispatch_from_dsq(BPF_FOR_EACH_ITER, p, + SCX_DSQ_LOCAL_ON | cpu, + SCX_ENQ_PREEMPT)) { if (cpu == this_cpu) { dispatched = true; __sync_fetch_and_add(&nr_expedited_local, 1); -- 2.34.1
participants (1)
-
Zicheng Qu