infrastructure for scheduler bpf
Guan Jing (12): sched: programmable: Introduce bpf sched sched: programmable: Add a tag for the task sched: programmable: Add a tag for the task group sched: programmable: Add user interface of task group tag sched: programmable: Add user interface of task tag sched: basic infrastructure for scheduler bpf sched: introduce bpf_sched_enable() libbpf: add support for scheduler bpf programs bpftool: recognize scheduler programs sched: Add helper functions to get cpu statistics sched: programmable: Add hook in select_task_rq_fair() sched: programmable: Add hook in can_migrate_task()
arch/arm64/configs/openeuler_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + fs/proc/base.c | 64 ++++++++++++++ include/linux/bpf_sched.h | 50 +++++++++++ include/linux/bpf_types.h | 4 + include/linux/sched.h | 38 ++++++++ include/linux/sched_hook_defs.h | 4 + include/uapi/linux/bpf.h | 9 ++ init/init_task.c | 3 + kernel/bpf/Kconfig | 12 +++ kernel/bpf/btf.c | 1 + kernel/bpf/syscall.c | 27 ++++++ kernel/bpf/trampoline.c | 1 + kernel/bpf/verifier.c | 11 ++- kernel/sched/bpf_sched.c | 102 +++++++++++++++++++++ kernel/sched/build_utility.c | 4 + kernel/sched/core.c | 118 +++++++++++++++++++++++++ kernel/sched/fair.c | 45 ++++++++++ kernel/sched/sched.h | 10 +++ scripts/bpf_doc.py | 4 + tools/include/uapi/linux/bpf.h | 9 ++ tools/lib/bpf/bpf.c | 1 + tools/lib/bpf/libbpf.c | 23 ++++- tools/lib/bpf/libbpf.h | 2 + tools/lib/bpf/libbpf.map | 1 + 25 files changed, 543 insertions(+), 2 deletions(-) create mode 100644 include/linux/bpf_sched.h create mode 100644 include/linux/sched_hook_defs.h create mode 100644 kernel/sched/bpf_sched.c
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8OIT1
--------------------------------
Introduce bpf sched, Enables instrumentation of the sched hooks with eBPF programs for implementing dynamic scheduling policies.
Signed-off-by: Guan Jing guanjing6@huawei.com --- arch/arm64/configs/openeuler_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + kernel/bpf/Kconfig | 12 ++++++++++++ 3 files changed, 14 insertions(+)
diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 66737e8c8673..16568cd7d2cb 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -88,6 +88,7 @@ CONFIG_BPF_JIT_DEFAULT_ON=y # CONFIG_BPF_UNPRIV_DEFAULT_OFF is not set # CONFIG_BPF_PRELOAD is not set # CONFIG_BPF_LSM is not set +CONFIG_BPF_SCHED=y # end of BPF subsystem
CONFIG_PREEMPT_NONE_BUILD=y diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index fcbef6c06587..2fc4487e9f5f 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -106,6 +106,7 @@ CONFIG_BPF_JIT_DEFAULT_ON=y # CONFIG_BPF_UNPRIV_DEFAULT_OFF is not set # CONFIG_BPF_PRELOAD is not set # CONFIG_BPF_LSM is not set +CONFIG_BPF_SCHED=y # end of BPF subsystem
CONFIG_PREEMPT_BUILD=y diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig index 6a906ff93006..35b5c28d116c 100644 --- a/kernel/bpf/Kconfig +++ b/kernel/bpf/Kconfig @@ -100,4 +100,16 @@ config BPF_LSM
If you are unsure how to answer this question, answer N.
+config BPF_SCHED + bool "Sched Instrumentation with BPF" + depends on BPF_EVENTS + depends on BPF_SYSCALL + help + Enables instrumentation of the sched hooks with eBPF programs for + implementing dynamic scheduling policies. When CONFIG_BPF_SCHED + is enabled, privileged BPF could be used to expand scheduling + capabilities. + + If you are unsure how to answer this question, answer N. + endmenu # "BPF subsystem"
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8OIT1
--------------------------------
Add a tag for the task, useful to identify the special task. User can use the file system interface to mark different tags for specific workloads. The kernel subsystems can use the set_* helpers to mark it too. The bpf prog obtains the tags to detect different workloads.
Signed-off-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Ren Zhijie renzhijie2@huawei.com Signed-off-by: Hui Tang tanghui20@huawei.com Signed-off-by: Guan Jing guanjing6@huawei.com --- include/linux/sched.h | 4 ++++ init/init_task.c | 3 +++ kernel/sched/core.c | 3 +++ 3 files changed, 10 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 9fdd08aa9626..ff76ebfc3d91 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1552,6 +1552,10 @@ struct task_struct { const cpumask_t *select_cpus; #endif
+#ifdef CONFIG_BPF_SCHED + long tag; +#endif + /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/init/init_task.c b/init/init_task.c index ac0c5850f74b..2101c6e3432d 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -213,6 +213,9 @@ struct task_struct init_task #ifdef CONFIG_SECCOMP_FILTER .seccomp = { .filter_count = ATOMIC_INIT(0) }, #endif +#ifdef CONFIG_BPF_SCHED + .tag = 0, +#endif }; EXPORT_SYMBOL(init_task);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7a0997e7e136..dfb1014a63e8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4541,6 +4541,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->migration_pending = NULL; #endif init_sched_mm_cid(p); +#ifdef CONFIG_BPF_SCHED + p->tag = 0; +#endif }
DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8OIT1
--------------------------------
Add a tag for the task group, to support the tag-based scheduling mechanism.
The tag is used to identify a special task or a type of special tasks, there are many special tasks in the real world, such as foreground and background tasks, online and offline tasks, ect. so, we can identify such special tasks, and execute specific policies.
Signed-off-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Ren Zhijie renzhijie2@huawei.com Signed-off-by: Hui Tang tanghui20@huawei.com Signed-off-by: Guan Jing guanjing6@huawei.com --- kernel/sched/core.c | 19 +++++++++++++++++++ kernel/sched/sched.h | 3 +++ 2 files changed, 22 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dfb1014a63e8..8615a048d91a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10496,6 +10496,13 @@ static void sched_unregister_group(struct task_group *tg) call_rcu(&tg->rcu, sched_free_group_rcu); }
+#ifdef CONFIG_BPF_SCHED +static inline void tg_init_tag(struct task_group *tg, struct task_group *ptg) +{ + tg->tag = ptg->tag; +} +#endif + /* allocate runqueue etc for a new task group */ struct task_group *sched_create_group(struct task_group *parent) { @@ -10516,6 +10523,10 @@ struct task_group *sched_create_group(struct task_group *parent) if (!alloc_rt_sched_group(tg, parent)) goto err;
+#ifdef CONFIG_BPF_SCHED + tg_init_tag(tg, parent); +#endif + alloc_uclamp_sched_group(tg, parent);
return tg; @@ -10603,6 +10614,14 @@ static void sched_change_group(struct task_struct *tsk, struct task_group *group sched_change_qos_group(tsk, group); #endif
+#ifdef CONFIG_BPF_SCHED + /* + * This function has cleared and restored the task status, + * so we do not need to dequeue and enqueue the task again. + */ + tsk->tag = group->tag; +#endif + #ifdef CONFIG_FAIR_GROUP_SCHED if (tsk->sched_class->task_change_group) tsk->sched_class->task_change_group(tsk); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 4b679122d26f..0d2fc752ea7c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -419,6 +419,9 @@ struct task_group { struct uclamp_se uclamp[UCLAMP_CNT]; #endif
+#ifdef CONFIG_BPF_SCHED + long tag; +#endif };
#ifdef CONFIG_FAIR_GROUP_SCHED
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8OIT1
--------------------------------
Add user interface of task group tag, bridges the information gap between user-mode and kernel-mode.
Signed-off-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Ren Zhijie renzhijie2@huawei.com Signed-off-by: Hui Tang tanghui20@huawei.com Signed-off-by: Guan Jing guanjing6@huawei.com --- include/linux/sched.h | 4 +++ kernel/sched/core.c | 81 +++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 3 ++ 3 files changed, 88 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h index ff76ebfc3d91..e6c13283015f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2468,6 +2468,10 @@ static inline void rseq_syscall(struct pt_regs *regs) { }
+#ifdef CONFIG_BPF_SCHED +extern void sched_settag(struct task_struct *tsk, s64 tag); +#endif + #endif
#ifdef CONFIG_SCHED_CORE diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8615a048d91a..5f2fe9d54c2a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -11413,6 +11413,80 @@ static inline s64 cpu_qos_read(struct cgroup_subsys_state *css, } #endif
+#ifdef CONFIG_BPF_SCHED +void sched_settag(struct task_struct *tsk, s64 tag) +{ + int queued, running, queue_flags = + DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; + struct rq_flags rf; + struct rq *rq; + + if (tsk->tag == tag) + return; + + rq = task_rq_lock(tsk, &rf); + + running = task_current(rq, tsk); + queued = task_on_rq_queued(tsk); + + update_rq_clock(rq); + if (queued) + dequeue_task(rq, tsk, queue_flags); + if (running) + put_prev_task(rq, tsk); + + tsk->tag = tag; + + if (queued) + enqueue_task(rq, tsk, queue_flags); + if (running) + set_next_task(rq, tsk); + + task_rq_unlock(rq, tsk, &rf); +} + +int tg_change_tag(struct task_group *tg, void *data) +{ + struct css_task_iter it; + struct task_struct *tsk; + s64 tag = *(s64 *)data; + struct cgroup_subsys_state *css = &tg->css; + + tg->tag = tag; + + css_task_iter_start(css, 0, &it); + while ((tsk = css_task_iter_next(&it))) + sched_settag(tsk, tag); + css_task_iter_end(&it); + + return 0; +} + +static int cpu_tag_write(struct cgroup_subsys_state *css, + struct cftype *cftype, s64 tag) +{ + struct task_group *tg = css_tg(css); + + if (tg == &root_task_group) + return -EINVAL; + + if (tg->tag == tag) + return 0; + + rcu_read_lock(); + walk_tg_tree_from(tg, tg_change_tag, tg_nop, (void *)(&tag)); + rcu_read_unlock(); + + return 0; +} + +static inline s64 cpu_tag_read(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + return css_tg(css)->tag; +} +#endif + static struct cftype cpu_legacy_files[] = { #ifdef CONFIG_FAIR_GROUP_SCHED { @@ -11483,6 +11557,13 @@ static struct cftype cpu_legacy_files[] = { .read_s64 = cpu_qos_read, .write_s64 = cpu_qos_write, }, +#endif +#ifdef CONFIG_BPF_SCHED + { + .name = "tag", + .read_s64 = cpu_tag_read, + .write_s64 = cpu_tag_write, + }, #endif { } /* Terminate */ }; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0d2fc752ea7c..05a7f09f2bba 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -456,6 +456,9 @@ static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data) }
extern int tg_nop(struct task_group *tg, void *data); +#ifdef CONFIG_BPF_SCHED +extern int tg_change_tag(struct task_group *tg, void *data); +#endif
extern void free_fair_sched_group(struct task_group *tg); extern int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent);
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8OIT1
--------------------------------
Add user interface of task tag, bridges the information gap between user-mode and kernel mode.
Add proc interface: /proc/${pid}/task/${pid}/tag
Signed-off-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Ren Zhijie renzhijie2@huawei.com Signed-off-by: Hui Tang tanghui20@huawei.com Signed-off-by: Guan Jing guanjing6@huawei.com --- fs/proc/base.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 64 insertions(+)
diff --git a/fs/proc/base.c b/fs/proc/base.c index 243c15919e18..b407d4a47acc 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3658,6 +3658,67 @@ static const struct inode_operations proc_tid_comm_inode_operations = { .permission = proc_tid_comm_permission, };
+#ifdef CONFIG_BPF_SCHED +static ssize_t pid_tag_write(struct file *file, const char __user *buf, + size_t count, loff_t *offset) +{ + struct inode *inode = file_inode(file); + struct task_struct *tsk; + int err = 0; + long tag = 0; + + tsk = get_proc_task(inode); + if (!tsk) { + err = -ESRCH; + goto out; + } + + if (unlikely(tsk->pid == 1)) { + err = -EPERM; + goto out; + } + + + err = kstrtol_from_user(buf, count, 0, &tag); + if (err) + goto out; + + sched_settag(tsk, tag); + +out: + put_task_struct(tsk); + return err < 0 ? err : count; +} + +static int pid_tag_show(struct seq_file *m, void *v) +{ + struct inode *inode = m->private; + struct task_struct *tsk; + + tsk = get_proc_task(inode); + if (!tsk) + return -ESRCH; + + seq_printf(m, "%ld\n", tsk->tag); + put_task_struct(tsk); + + return 0; +} + +static int pid_tag_open(struct inode *inode, struct file *flip) +{ + return single_open(flip, pid_tag_show, inode); +} + +static const struct file_operations proc_pid_tag_operations = { + .open = pid_tag_open, + .read = seq_read, + .write = pid_tag_write, + .llseek = seq_lseek, + .release = single_release, +}; +#endif + /* * Tasks */ @@ -3764,6 +3825,9 @@ static const struct pid_entry tid_base_stuff[] = { #ifdef CONFIG_QOS_SCHED_DYNAMIC_AFFINITY REG("preferred_cpuset", 0644, proc_preferred_cpuset_operations), #endif +#ifdef CONFIG_BPF_SCHED + REG("tag", 0644, proc_pid_tag_operations), +#endif };
static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8OIT1
Reference: https://lore.kernel.org/all/20210916162451.709260-1-guro@fb.com/
-------------------
This commit introduces basic definitions and infrastructure for scheduler bpf programs. It defines the BPF_PROG_TYPE_SCHED program type and the BPF_SCHED attachment type.
The implementation is inspired by lsm bpf programs and is based on kretprobes. This will allow to add new hooks with a minimal changes to the kernel code and without any changes to libbpf/bpftool. It's very convenient as I anticipate a large number of private patches being used for a long time before (or if at all) reaching upstream.
Sched programs are expected to return an int, which meaning will be context defined.
This patch doesn't add any real scheduler hooks (only a stub), it will be done by following patches in the series.
Scheduler bpf programs as now are very restricted in what they can do: only the bpf_printk() helper is available. The scheduler context can impose significant restrictions on what's safe and what's not. So let's extend their abilities on case by case basis when a need arise.
Signed-off-by: Roman Gushchin guro@fb.com Signed-off-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Ren Zhijie renzhijie2@huawei.com Signed-off-by: Hui Tang tanghui20@huawei.com Signed-off-by: Guan Jing guanjing6@huawei.com --- include/linux/bpf_sched.h | 26 ++++++++++++++ include/linux/bpf_types.h | 4 +++ include/linux/sched_hook_defs.h | 2 ++ include/uapi/linux/bpf.h | 2 ++ kernel/bpf/btf.c | 1 + kernel/bpf/syscall.c | 16 +++++++++ kernel/bpf/trampoline.c | 1 + kernel/bpf/verifier.c | 11 +++++- kernel/sched/bpf_sched.c | 62 +++++++++++++++++++++++++++++++++ kernel/sched/build_utility.c | 4 +++ tools/include/uapi/linux/bpf.h | 2 ++ tools/lib/bpf/bpf.c | 1 + 12 files changed, 131 insertions(+), 1 deletion(-) create mode 100644 include/linux/bpf_sched.h create mode 100644 include/linux/sched_hook_defs.h create mode 100644 kernel/sched/bpf_sched.c
diff --git a/include/linux/bpf_sched.h b/include/linux/bpf_sched.h new file mode 100644 index 000000000000..874393e6a6aa --- /dev/null +++ b/include/linux/bpf_sched.h @@ -0,0 +1,26 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_BPF_SCHED_H +#define _LINUX_BPF_SCHED_H + +#include <linux/bpf.h> + +#ifdef CONFIG_BPF_SCHED + +#define BPF_SCHED_HOOK(RET, DEFAULT, NAME, ...) \ + RET bpf_sched_##NAME(__VA_ARGS__); +#include <linux/sched_hook_defs.h> +#undef BPF_SCHED_HOOK + +int bpf_sched_verify_prog(struct bpf_verifier_log *vlog, + const struct bpf_prog *prog); + +#else /* !CONFIG_BPF_SCHED */ + +static inline int bpf_sched_verify_prog(struct bpf_verifier_log *vlog, + const struct bpf_prog *prog) +{ + return -EOPNOTSUPP; +} + +#endif /* CONFIG_BPF_SCHED */ +#endif /* _LINUX_BPF_SCHED_H */ diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index fc0d6f32c687..dd79463eea4e 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -83,6 +83,10 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SYSCALL, bpf_syscall, BPF_PROG_TYPE(BPF_PROG_TYPE_NETFILTER, netfilter, struct bpf_nf_ctx, struct bpf_nf_ctx) #endif +#ifdef CONFIG_BPF_SCHED +BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED, bpf_sched, + void *, void *) +#endif /* CONFIG_BPF_SCHED */
BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops) diff --git a/include/linux/sched_hook_defs.h b/include/linux/sched_hook_defs.h new file mode 100644 index 000000000000..14344004e335 --- /dev/null +++ b/include/linux/sched_hook_defs.h @@ -0,0 +1,2 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +BPF_SCHED_HOOK(int, 0, dummy, void) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 4924f0cde1bc..9dd0b85549b6 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -988,6 +988,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_SK_LOOKUP, BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ BPF_PROG_TYPE_NETFILTER, + BPF_PROG_TYPE_SCHED, };
enum bpf_attach_type { @@ -1040,6 +1041,7 @@ enum bpf_attach_type { BPF_TCX_INGRESS, BPF_TCX_EGRESS, BPF_TRACE_UPROBE_MULTI, + BPF_SCHED, __MAX_BPF_ATTACH_TYPE };
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 8090d7fb11ef..133805b4bc71 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -5982,6 +5982,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, return true; t = btf_type_by_id(btf, t->type); break; + case BPF_SCHED: case BPF_MODIFY_RETURN: /* For now the BPF_MODIFY_RETURN can only be attached to * functions that return an int. diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index d77b2f8b9364..f60472ddb820 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2412,6 +2412,7 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type, case BPF_PROG_TYPE_LSM: case BPF_PROG_TYPE_STRUCT_OPS: case BPF_PROG_TYPE_EXT: + case BPF_PROG_TYPE_SCHED: break; default: return -EINVAL; @@ -2539,6 +2540,7 @@ static bool is_perfmon_prog_type(enum bpf_prog_type prog_type) case BPF_PROG_TYPE_LSM: case BPF_PROG_TYPE_STRUCT_OPS: /* has access to struct sock */ case BPF_PROG_TYPE_EXT: /* extends any prog */ + case BPF_PROG_TYPE_SCHED: return true; default: return false; @@ -3115,6 +3117,12 @@ static int bpf_tracing_prog_attach(struct bpf_prog *prog, goto out_put_prog; } break; + case BPF_PROG_TYPE_SCHED: + if (prog->expected_attach_type != BPF_SCHED) { + err = -EINVAL; + goto out_put_prog; + } + break; default: err = -EINVAL; goto out_put_prog; @@ -3582,6 +3590,7 @@ static int bpf_raw_tp_link_attach(struct bpf_prog *prog, case BPF_PROG_TYPE_TRACING: case BPF_PROG_TYPE_EXT: case BPF_PROG_TYPE_LSM: + case BPF_PROG_TYPE_SCHED: if (user_tp_name) /* The attach point for this category of programs * should be specified via btf_id during program load. @@ -3717,6 +3726,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type) case BPF_TCX_INGRESS: case BPF_TCX_EGRESS: return BPF_PROG_TYPE_SCHED_CLS; + case BPF_SCHED: + return BPF_PROG_TYPE_SCHED; default: return BPF_PROG_TYPE_UNSPEC; } @@ -3744,6 +3755,10 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog, -EINVAL : 0; case BPF_PROG_TYPE_EXT: return 0; + case BPF_PROG_TYPE_SCHED: + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + return 0; case BPF_PROG_TYPE_NETFILTER: if (attach_type != BPF_NETFILTER) return -EINVAL; @@ -4922,6 +4937,7 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr) ret = cgroup_bpf_link_attach(attr, prog); break; case BPF_PROG_TYPE_EXT: + case BPF_PROG_TYPE_SCHED: ret = bpf_tracing_prog_attach(prog, attr->link_create.target_fd, attr->link_create.target_btf_id, diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index e97aeda3a86b..2aa01c26f6a4 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -493,6 +493,7 @@ static enum bpf_tramp_prog_type bpf_attach_type_to_tramp(struct bpf_prog *prog) switch (prog->expected_attach_type) { case BPF_TRACE_FENTRY: return BPF_TRAMP_FENTRY; + case BPF_SCHED: case BPF_MODIFY_RETURN: return BPF_TRAMP_MODIFY_RETURN; case BPF_TRACE_FEXIT: diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 824531d4c262..c8da0be7d576 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -27,6 +27,7 @@ #include <linux/module.h> #include <linux/cpumask.h> #include <net/xdp.h> +#include <linux/bpf_sched.h>
#include "disasm.h"
@@ -19453,6 +19454,7 @@ int bpf_check_attach_target(struct bpf_verifier_log *log, case BPF_LSM_CGROUP: case BPF_TRACE_FENTRY: case BPF_TRACE_FEXIT: + case BPF_SCHED: if (!btf_type_is_func(t)) { bpf_log(log, "attach_btf_id %u is not a function\n", btf_id); @@ -19629,7 +19631,8 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
if (prog->type != BPF_PROG_TYPE_TRACING && prog->type != BPF_PROG_TYPE_LSM && - prog->type != BPF_PROG_TYPE_EXT) + prog->type != BPF_PROG_TYPE_EXT && + prog->type != BPF_PROG_TYPE_SCHED) return 0;
ret = bpf_check_attach_target(&env->log, prog, tgt_prog, btf_id, &tgt_info); @@ -19673,6 +19676,12 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) return -EINVAL; }
+ if (prog->type == BPF_PROG_TYPE_SCHED) { + ret = bpf_sched_verify_prog(&env->log, prog); + if (ret < 0) + return ret; + } + key = bpf_trampoline_compute_key(tgt_prog, prog->aux->attach_btf, btf_id); tr = bpf_trampoline_get(key, &tgt_info); if (!tr) diff --git a/kernel/sched/bpf_sched.c b/kernel/sched/bpf_sched.c new file mode 100644 index 000000000000..2360404d4a07 --- /dev/null +++ b/kernel/sched/bpf_sched.c @@ -0,0 +1,62 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/bpf.h> +#include <linux/cgroup.h> +#include <linux/bpf_verifier.h> +#include <linux/bpf_sched.h> +#include <linux/btf_ids.h> +#include "sched.h" + +/* + * For every hook declare a nop function where a BPF program can be attached. + */ +#define BPF_SCHED_HOOK(RET, DEFAULT, NAME, ...) \ +noinline RET bpf_sched_##NAME(__VA_ARGS__) \ +{ \ + return DEFAULT; \ +} + +#include <linux/sched_hook_defs.h> +#undef BPF_SCHED_HOOK + +#define BPF_SCHED_HOOK(RET, DEFAULT, NAME, ...) BTF_ID(func, bpf_sched_##NAME) +BTF_SET_START(bpf_sched_hooks) +#include <linux/sched_hook_defs.h> +#undef BPF_SCHED_HOOK +BTF_SET_END(bpf_sched_hooks) + +int bpf_sched_verify_prog(struct bpf_verifier_log *vlog, + const struct bpf_prog *prog) +{ + if (!prog->gpl_compatible) { + bpf_log(vlog, + "sched programs must have a GPL compatible license\n"); + return -EINVAL; + } + + if (!btf_id_set_contains(&bpf_sched_hooks, prog->aux->attach_btf_id)) { + bpf_log(vlog, "attach_btf_id %u points to wrong type name %s\n", + prog->aux->attach_btf_id, prog->aux->attach_func_name); + return -EINVAL; + } + + return 0; +} + +static const struct bpf_func_proto * +bpf_sched_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + switch (func_id) { + case BPF_FUNC_trace_printk: + return bpf_get_trace_printk_proto(); + default: + return bpf_base_func_proto(func_id); + } +} + +const struct bpf_prog_ops bpf_sched_prog_ops = { +}; + +const struct bpf_verifier_ops bpf_sched_verifier_ops = { + .get_func_proto = bpf_sched_func_proto, + .is_valid_access = btf_ctx_access, +}; diff --git a/kernel/sched/build_utility.c b/kernel/sched/build_utility.c index 99bdd96f454f..d44c584d9bc7 100644 --- a/kernel/sched/build_utility.c +++ b/kernel/sched/build_utility.c @@ -108,3 +108,7 @@ #ifdef CONFIG_SCHED_AUTOGROUP # include "autogroup.c" #endif + +#ifdef CONFIG_BPF_SCHED +# include "bpf_sched.c" +#endif diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 4924f0cde1bc..9dd0b85549b6 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -988,6 +988,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_SK_LOOKUP, BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ BPF_PROG_TYPE_NETFILTER, + BPF_PROG_TYPE_SCHED, };
enum bpf_attach_type { @@ -1040,6 +1041,7 @@ enum bpf_attach_type { BPF_TCX_INGRESS, BPF_TCX_EGRESS, BPF_TRACE_UPROBE_MULTI, + BPF_SCHED, __MAX_BPF_ATTACH_TYPE };
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index b0f1913763a3..ddbb16be651f 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -781,6 +781,7 @@ int bpf_link_create(int prog_fd, int target_fd, case BPF_TRACE_FENTRY: case BPF_TRACE_FEXIT: case BPF_MODIFY_RETURN: + case BPF_SCHED: case BPF_LSM_MAC: attr.link_create.tracing.cookie = OPTS_GET(opts, tracing.cookie, 0); if (!OPTS_ZEROED(opts, tracing))
maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8OIT1
Reference: https://lore.kernel.org/all/20210916162451.709260-1-guro@fb.com/
-------------------
Introduce a dedicated static key and the bpf_sched_enabled() wrapper to guard all invocations of bpf programs in the scheduler code.
It will help to avoid any potential performance regression in a case when no scheduler bpf programs are attached.
Signed-off-by: Roman Gushchin guro@fb.com Signed-off-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Ren Zhijie renzhijie2@huawei.com Signed-off-by: Hui Tang tanghui20@huawei.com Signed-off-by: Guan Jing guanjing6@huawei.com --- include/linux/bpf_sched.h | 24 ++++++++++++++++++++++++ kernel/bpf/syscall.c | 11 +++++++++++ kernel/sched/bpf_sched.c | 2 ++ 3 files changed, 37 insertions(+)
diff --git a/include/linux/bpf_sched.h b/include/linux/bpf_sched.h index 874393e6a6aa..9cd2493d2787 100644 --- a/include/linux/bpf_sched.h +++ b/include/linux/bpf_sched.h @@ -6,6 +6,8 @@
#ifdef CONFIG_BPF_SCHED
+#include <linux/jump_label.h> + #define BPF_SCHED_HOOK(RET, DEFAULT, NAME, ...) \ RET bpf_sched_##NAME(__VA_ARGS__); #include <linux/sched_hook_defs.h> @@ -14,6 +16,23 @@ int bpf_sched_verify_prog(struct bpf_verifier_log *vlog, const struct bpf_prog *prog);
+DECLARE_STATIC_KEY_FALSE(bpf_sched_enabled_key); + +static inline bool bpf_sched_enabled(void) +{ + return static_branch_unlikely(&bpf_sched_enabled_key); +} + +static inline void bpf_sched_inc(void) +{ + static_branch_inc(&bpf_sched_enabled_key); +} + +static inline void bpf_sched_dec(void) +{ + static_branch_dec(&bpf_sched_enabled_key); +} + #else /* !CONFIG_BPF_SCHED */
static inline int bpf_sched_verify_prog(struct bpf_verifier_log *vlog, @@ -22,5 +41,10 @@ static inline int bpf_sched_verify_prog(struct bpf_verifier_log *vlog, return -EOPNOTSUPP; }
+static inline bool bpf_sched_enabled(void) +{ + return false; +} + #endif /* CONFIG_BPF_SCHED */ #endif /* _LINUX_BPF_SCHED_H */ diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index f60472ddb820..875a3587350d 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -36,6 +36,7 @@ #include <linux/memcontrol.h> #include <linux/trace_events.h> #include <net/netfilter/nf_bpf_link.h> +#include <linux/bpf_sched.h>
#include <net/tcx.h>
@@ -3027,6 +3028,11 @@ static void bpf_tracing_link_release(struct bpf_link *link) struct bpf_tracing_link *tr_link = container_of(link, struct bpf_tracing_link, link.link);
+#ifdef CONFIG_BPF_SCHED + if (link->prog->type == BPF_PROG_TYPE_SCHED) + bpf_sched_dec(); +#endif + WARN_ON_ONCE(bpf_trampoline_unlink_prog(&tr_link->link, tr_link->trampoline));
@@ -3242,6 +3248,11 @@ static int bpf_tracing_prog_attach(struct bpf_prog *prog, goto out_unlock; }
+#ifdef CONFIG_BPF_SCHED + if (prog->type == BPF_PROG_TYPE_SCHED) + bpf_sched_inc(); +#endif + link->tgt_prog = tgt_prog; link->trampoline = tr;
diff --git a/kernel/sched/bpf_sched.c b/kernel/sched/bpf_sched.c index 2360404d4a07..e2525bd60abf 100644 --- a/kernel/sched/bpf_sched.c +++ b/kernel/sched/bpf_sched.c @@ -6,6 +6,8 @@ #include <linux/btf_ids.h> #include "sched.h"
+DEFINE_STATIC_KEY_FALSE(bpf_sched_enabled_key); + /* * For every hook declare a nop function where a BPF program can be attached. */
maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8OIT1
Reference: https://lore.kernel.org/all/20210916162451.709260-1-guro@fb.com/
-------------------
This patch adds a support for loading and attaching scheduler bpf programs.
Signed-off-by: Roman Gushchin guro@fb.com Signed-off-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Ren Zhijie renzhijie2@huawei.com Signed-off-by: Hui Tang tanghui20@huawei.com Signed-off-by: Guan Jing guanjing6@huawei.com --- tools/lib/bpf/libbpf.c | 21 ++++++++++++++++++++- tools/lib/bpf/libbpf.h | 2 ++ tools/lib/bpf/libbpf.map | 1 + 3 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 96ff1aa4bf6a..d683d1bcc0f4 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -3029,7 +3029,8 @@ static int bpf_object_fixup_btf(struct bpf_object *obj) static bool prog_needs_vmlinux_btf(struct bpf_program *prog) { if (prog->type == BPF_PROG_TYPE_STRUCT_OPS || - prog->type == BPF_PROG_TYPE_LSM) + prog->type == BPF_PROG_TYPE_LSM || + prog->type == BPF_PROG_TYPE_SCHED) return true;
/* BPF_PROG_TYPE_TRACING programs which do not attach to other programs @@ -8764,6 +8765,7 @@ static int attach_kprobe_multi(const struct bpf_program *prog, long cookie, stru static int attach_uprobe_multi(const struct bpf_program *prog, long cookie, struct bpf_link **link); static int attach_lsm(const struct bpf_program *prog, long cookie, struct bpf_link **link); static int attach_iter(const struct bpf_program *prog, long cookie, struct bpf_link **link); +static int attach_sched(const struct bpf_program *prog, long cookie, struct bpf_link **link);
static const struct bpf_sec_def section_defs[] = { SEC_DEF("socket", SOCKET_FILTER, 0, SEC_NONE), @@ -8858,6 +8860,7 @@ static const struct bpf_sec_def section_defs[] = { SEC_DEF("struct_ops.s+", STRUCT_OPS, 0, SEC_SLEEPABLE), SEC_DEF("sk_lookup", SK_LOOKUP, BPF_SK_LOOKUP, SEC_ATTACHABLE), SEC_DEF("netfilter", NETFILTER, BPF_NETFILTER, SEC_NONE), + SEC_DEF("sched/", SCHED, BPF_SCHED, SEC_ATTACH_BTF, attach_sched), };
int libbpf_register_prog_handler(const char *sec, @@ -9237,6 +9240,7 @@ static int bpf_object__collect_st_ops_relos(struct bpf_object *obj, #define BTF_TRACE_PREFIX "btf_trace_" #define BTF_LSM_PREFIX "bpf_lsm_" #define BTF_ITER_PREFIX "bpf_iter_" +#define BTF_SCHED_PREFIX "bpf_sched_" #define BTF_MAX_NAME_SIZE 128
void btf_get_kernel_prefix_kind(enum bpf_attach_type attach_type, @@ -9256,6 +9260,10 @@ void btf_get_kernel_prefix_kind(enum bpf_attach_type attach_type, *prefix = BTF_ITER_PREFIX; *kind = BTF_KIND_FUNC; break; + case BPF_SCHED: + *prefix = BTF_SCHED_PREFIX; + *kind = BTF_KIND_FUNC; + break; default: *prefix = ""; *kind = BTF_KIND_FUNC; @@ -12113,6 +12121,17 @@ struct bpf_link *bpf_program__attach_netfilter(const struct bpf_program *prog, return link; }
+struct bpf_link *bpf_program__attach_sched(const struct bpf_program *prog) +{ + return bpf_program__attach_btf_id(prog, NULL); +} + +static int attach_sched(const struct bpf_program *prog, long cookie, struct bpf_link **link) +{ + *link = bpf_program__attach_sched(prog); + return libbpf_get_error(*link); +} + struct bpf_link *bpf_program__attach(const struct bpf_program *prog) { struct bpf_link *link = NULL; diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index 0e52621cba43..aabdd973c1a5 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -769,6 +769,8 @@ bpf_program__attach_xdp(const struct bpf_program *prog, int ifindex); LIBBPF_API struct bpf_link * bpf_program__attach_freplace(const struct bpf_program *prog, int target_fd, const char *attach_func_name); +LIBBPF_API struct bpf_link * +bpf_program__attach_sched(const struct bpf_program *prog);
struct bpf_netfilter_opts { /* size of this struct, for forward/backward compatibility */ diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map index 57712321490f..228ab00a5e69 100644 --- a/tools/lib/bpf/libbpf.map +++ b/tools/lib/bpf/libbpf.map @@ -236,6 +236,7 @@ LIBBPF_0.2.0 { perf_buffer__buffer_fd; perf_buffer__epoll_fd; perf_buffer__consume_buffer; + bpf_program__attach_sched; } LIBBPF_0.1.0;
LIBBPF_0.3.0 {
maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8OIT1
Reference: https://lore.kernel.org/all/20210916162451.709260-1-guro@fb.com/
-------------------
Teach bpftool to recognize scheduler bpf programs.
Signed-off-by: Roman Gushchin guro@fb.com Signed-off-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Ren Zhijie renzhijie2@huawei.com Signed-off-by: Hui Tang tanghui20@huawei.com Signed-off-by: Guan Jing guanjing6@huawei.com --- tools/lib/bpf/libbpf.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index d683d1bcc0f4..41697c6274b9 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -121,6 +121,7 @@ static const char * const attach_type_name[] = { [BPF_TCX_INGRESS] = "tcx_ingress", [BPF_TCX_EGRESS] = "tcx_egress", [BPF_TRACE_UPROBE_MULTI] = "trace_uprobe_multi", + [BPF_SCHED] = "sched", };
static const char * const link_type_name[] = { @@ -209,6 +210,7 @@ static const char * const prog_type_name[] = { [BPF_PROG_TYPE_SK_LOOKUP] = "sk_lookup", [BPF_PROG_TYPE_SYSCALL] = "syscall", [BPF_PROG_TYPE_NETFILTER] = "netfilter", + [BPF_PROG_TYPE_SCHED] = "sched", };
static int __base_pr(enum libbpf_print_level level, const char *format,
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8OIT1
--------------------------------
Add the helper functions to get cpu statistics, acquire multiple types of nr_running statitic.
Based on CPU statistics in different dimensions, specific scheduling policies can be implemented in bpf program.
Signed-off-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Hui Tang tanghui20@huawei.com Signed-off-by: Ren Zhijie renzhijie2@huawei.com Signed-off-by: Guan Jing guanjing6@huawei.com --- include/linux/sched.h | 11 ++++++++++ include/uapi/linux/bpf.h | 7 +++++++ kernel/sched/bpf_sched.c | 38 ++++++++++++++++++++++++++++++++++ scripts/bpf_doc.py | 2 ++ tools/include/uapi/linux/bpf.h | 7 +++++++ 5 files changed, 65 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h index e6c13283015f..c968bd562a9f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2470,6 +2470,17 @@ static inline void rseq_syscall(struct pt_regs *regs)
#ifdef CONFIG_BPF_SCHED extern void sched_settag(struct task_struct *tsk, s64 tag); + +struct bpf_sched_cpu_stats { + /* nr_running */ + unsigned int nr_running; + unsigned int cfs_nr_running; + unsigned int cfs_h_nr_running; + unsigned int cfs_idle_h_nr_running; + unsigned int rt_nr_running; + unsigned int rr_nr_running; +}; + #endif
#endif diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 9dd0b85549b6..87914a0fc2e3 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5644,6 +5644,12 @@ union bpf_attr { * 0 on success. * * **-ENOENT** if the bpf_local_storage cannot be found. + * + * int bpf_sched_cpu_stats_of(int cpu, struct bpf_sched_cpu_stats *ctx, int len) + * Description + * Get multiple types of *cpu* statistics and store in *ctx*. + * Return + * 0 on success, or a negative error in case of failure. */ #define ___BPF_FUNC_MAPPER(FN, ctx...) \ FN(unspec, 0, ##ctx) \ @@ -5858,6 +5864,7 @@ union bpf_attr { FN(user_ringbuf_drain, 209, ##ctx) \ FN(cgrp_storage_get, 210, ##ctx) \ FN(cgrp_storage_delete, 211, ##ctx) \ + FN(sched_cpu_stats_of, 212, ##ctx) \ /* */
/* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't diff --git a/kernel/sched/bpf_sched.c b/kernel/sched/bpf_sched.c index e2525bd60abf..1ddff44b6a93 100644 --- a/kernel/sched/bpf_sched.c +++ b/kernel/sched/bpf_sched.c @@ -44,12 +44,50 @@ int bpf_sched_verify_prog(struct bpf_verifier_log *vlog, return 0; }
+BPF_CALL_3(bpf_sched_cpu_stats_of, int *, cpuid, + struct bpf_sched_cpu_stats *, ctx, + int, len) +{ + struct rq *rq; + int cpu = *cpuid; + + if ((unsigned int)cpu >= nr_cpu_ids) { + memset(ctx, 0, len); + return -EINVAL; + } + + rq = cpu_rq(cpu); + memset(ctx, 0, len); + + SCHED_WARN_ON(!rcu_read_lock_held()); + /* nr_running */ + ctx->nr_running = rq->nr_running; + ctx->cfs_nr_running = rq->cfs.nr_running; + ctx->cfs_h_nr_running = rq->cfs.h_nr_running; + ctx->cfs_idle_h_nr_running = rq->cfs.idle_h_nr_running; + ctx->rt_nr_running = rq->rt.rt_nr_running; + ctx->rr_nr_running = rq->rt.rr_nr_running; + + return 0; +} + +static const struct bpf_func_proto bpf_sched_cpu_stats_of_proto = { + .func = bpf_sched_cpu_stats_of, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_INT, + .arg2_type = ARG_PTR_TO_UNINIT_MEM, + .arg3_type = ARG_CONST_SIZE, +}; + static const struct bpf_func_proto * bpf_sched_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { switch (func_id) { case BPF_FUNC_trace_printk: return bpf_get_trace_printk_proto(); + case BPF_FUNC_sched_cpu_stats_of: + return &bpf_sched_cpu_stats_of_proto; default: return bpf_base_func_proto(func_id); } diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py index 61b7dddedc46..fd0c5f5d25bd 100755 --- a/scripts/bpf_doc.py +++ b/scripts/bpf_doc.py @@ -700,6 +700,7 @@ class PrinterHelpers(Printer): 'struct bpf_dynptr', 'struct iphdr', 'struct ipv6hdr', + 'struct bpf_sched_cpu_stats', ] known_types = { '...', @@ -755,6 +756,7 @@ class PrinterHelpers(Printer): 'const struct bpf_dynptr', 'struct iphdr', 'struct ipv6hdr', + 'struct bpf_sched_cpu_stats', } mapped_types = { 'u8': '__u8', diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 9dd0b85549b6..87914a0fc2e3 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -5644,6 +5644,12 @@ union bpf_attr { * 0 on success. * * **-ENOENT** if the bpf_local_storage cannot be found. + * + * int bpf_sched_cpu_stats_of(int cpu, struct bpf_sched_cpu_stats *ctx, int len) + * Description + * Get multiple types of *cpu* statistics and store in *ctx*. + * Return + * 0 on success, or a negative error in case of failure. */ #define ___BPF_FUNC_MAPPER(FN, ctx...) \ FN(unspec, 0, ##ctx) \ @@ -5858,6 +5864,7 @@ union bpf_attr { FN(user_ringbuf_drain, 209, ##ctx) \ FN(cgrp_storage_get, 210, ##ctx) \ FN(cgrp_storage_delete, 211, ##ctx) \ + FN(sched_cpu_stats_of, 212, ##ctx) \ /* */
/* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8OIT1
--------------------------------
Add hook of sched type in select_task_rq_fair(), as follows: 'cfs_select_rq' Replace the original core selection policy or implement dynamic CPU affinity.
Signed-off-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Hui Tang tanghui20@huawei.com Signed-off-by: Guan Jing guanjing6@huawei.com --- include/linux/sched.h | 12 ++++++++++++ include/linux/sched_hook_defs.h | 2 +- kernel/sched/core.c | 15 +++++++++++++++ kernel/sched/fair.c | 28 ++++++++++++++++++++++++++++ kernel/sched/sched.h | 4 ++++ scripts/bpf_doc.py | 2 ++ 6 files changed, 62 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index c968bd562a9f..94e6cbb056fd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2481,6 +2481,18 @@ struct bpf_sched_cpu_stats { unsigned int rr_nr_running; };
+struct sched_migrate_ctx { + struct task_struct *task; + struct cpumask *select_idle_mask; + int prev_cpu; + int curr_cpu; + int is_sync; + int want_affine; + int wake_flags; + int sd_flag; + int new_cpu; +}; + #endif
#endif diff --git a/include/linux/sched_hook_defs.h b/include/linux/sched_hook_defs.h index 14344004e335..0e91209826a1 100644 --- a/include/linux/sched_hook_defs.h +++ b/include/linux/sched_hook_defs.h @@ -1,2 +1,2 @@ /* SPDX-License-Identifier: GPL-2.0 */ -BPF_SCHED_HOOK(int, 0, dummy, void) +BPF_SCHED_HOOK(int, -1, cfs_select_rq, struct sched_migrate_ctx *ctx) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5f2fe9d54c2a..e7fd05db31a4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2468,7 +2468,11 @@ static inline bool rq_has_pinned_tasks(struct rq *rq) * Per-CPU kthreads are allowed to run on !active && online CPUs, see * __set_cpus_allowed_ptr() and select_fallback_rq(). */ +#ifdef CONFIG_BPF_SCHED +inline bool is_cpu_allowed(struct task_struct *p, int cpu) +#else static inline bool is_cpu_allowed(struct task_struct *p, int cpu) +#endif { /* When not in the task's cpumask, no point in looking further. */ if (!cpumask_test_cpu(cpu, p->cpus_ptr)) @@ -9955,6 +9959,10 @@ LIST_HEAD(task_groups); static struct kmem_cache *task_group_cache __read_mostly; #endif
+#ifdef CONFIG_BPF_SCHED +DECLARE_PER_CPU(cpumask_var_t, select_idle_mask); +#endif + void __init sched_init(void) { unsigned long ptr = 0; @@ -10010,6 +10018,13 @@ void __init sched_init(void) global_rt_period(), global_rt_runtime()); #endif /* CONFIG_RT_GROUP_SCHED */
+#if defined(CONFIG_CPUMASK_OFFSTACK) && defined(CONFIG_BPF_SCHED) + for_each_possible_cpu(i) { + per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node( + cpumask_size(), GFP_KERNEL, cpu_to_node(i)); + } +#endif + #ifdef CONFIG_CGROUP_SCHED task_group_cache = KMEM_CACHE(task_group, 0);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 318258ea011e..195728e36a1d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -52,6 +52,7 @@ #include <asm/switch_to.h>
#include <linux/sched/cond_resched.h> +#include <linux/bpf_sched.h>
#include "sched.h" #include "stats.h" @@ -99,6 +100,10 @@ unsigned int sysctl_sched_child_runs_first __read_mostly;
const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
+#ifdef CONFIG_BPF_SCHED +DEFINE_PER_CPU(cpumask_var_t, select_idle_mask); +#endif + int sched_thermal_decay_shift; static int __init setup_sched_thermal_decay_shift(char *str) { @@ -8441,6 +8446,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) #ifdef CONFIG_QOS_SCHED_DYNAMIC_AFFINITY int idlest_cpu = -1; #endif +#ifdef CONFIG_BPF_SCHED + struct sched_migrate_ctx ctx; + int ret; +#endif
time = schedstat_start_time();
@@ -8475,6 +8484,25 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) }
rcu_read_lock(); +#ifdef CONFIG_BPF_SCHED + if (bpf_sched_enabled()) { + ctx.task = p; + ctx.prev_cpu = prev_cpu; + ctx.curr_cpu = cpu; + ctx.is_sync = sync; + ctx.wake_flags = wake_flags; + ctx.want_affine = want_affine; + ctx.sd_flag = sd_flag; + ctx.select_idle_mask = this_cpu_cpumask_var_ptr(select_idle_mask); + + ret = bpf_sched_cfs_select_rq(&ctx); + if (ret >= 0 && is_cpu_allowed(p, ret)) { + rcu_read_unlock(); + return ret; + } + } +#endif + for_each_domain(cpu, tmp) { /* * If both 'cpu' and 'prev_cpu' are part of this domain, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 05a7f09f2bba..830087ca204c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3597,4 +3597,8 @@ static inline void init_sched_mm_cid(struct task_struct *t) { } extern u64 avg_vruntime(struct cfs_rq *cfs_rq); extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
+#ifdef CONFIG_BPF_SCHED +inline bool is_cpu_allowed(struct task_struct *p, int cpu); +#endif + #endif /* _KERNEL_SCHED_SCHED_H */ diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py index fd0c5f5d25bd..359373bc8dab 100755 --- a/scripts/bpf_doc.py +++ b/scripts/bpf_doc.py @@ -701,6 +701,7 @@ class PrinterHelpers(Printer): 'struct iphdr', 'struct ipv6hdr', 'struct bpf_sched_cpu_stats', + 'struct sched_migrate_ctx', ] known_types = { '...', @@ -757,6 +758,7 @@ class PrinterHelpers(Printer): 'struct iphdr', 'struct ipv6hdr', 'struct bpf_sched_cpu_stats', + 'struct sched_migrate_ctx', } mapped_types = { 'u8': '__u8',
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8OIT1
--------------------------------
Add hook of sched type in can_migrate_task(), as follows: 'cfs_can_migrate_task' Decide whether the task can be migrated to dst_cpu.
Signed-off-by: Guan Jing guanjing6@huawei.com --- include/linux/sched.h | 7 +++++++ include/linux/sched_hook_defs.h | 2 ++ kernel/sched/fair.c | 17 +++++++++++++++++ 3 files changed, 26 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 94e6cbb056fd..d250a779059d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2493,6 +2493,13 @@ struct sched_migrate_ctx { int new_cpu; };
+struct sched_migrate_node { + int src_cpu; + int src_node; + int dst_cpu; + int dst_node; +}; + #endif
#endif diff --git a/include/linux/sched_hook_defs.h b/include/linux/sched_hook_defs.h index 0e91209826a1..c43297cc6049 100644 --- a/include/linux/sched_hook_defs.h +++ b/include/linux/sched_hook_defs.h @@ -1,2 +1,4 @@ /* SPDX-License-Identifier: GPL-2.0 */ BPF_SCHED_HOOK(int, -1, cfs_select_rq, struct sched_migrate_ctx *ctx) +BPF_SCHED_HOOK(int, -1, cfs_can_migrate_task, struct task_struct *p, + struct sched_migrate_node *migrate_node) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 195728e36a1d..e37590f6ed9f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9847,9 +9847,26 @@ static int can_migrate_task(struct task_struct *p, struct lb_env *env) { int tsk_cache_hot; +#ifdef CONFIG_BPF_SCHED + struct sched_migrate_node migrate_node; + int ret; +#endif
lockdep_assert_rq_held(env->src_rq);
+#ifdef CONFIG_BPF_SCHED + if (bpf_sched_enabled()) { + migrate_node.src_cpu = env->src_cpu; + migrate_node.src_node = cpu_to_node(env->src_cpu); + migrate_node.dst_cpu = env->dst_cpu; + migrate_node.dst_node = cpu_to_node(env->dst_cpu); + + ret = bpf_sched_cfs_can_migrate_task(p, &migrate_node); + if (!ret) + return ret; + } +#endif + /* * We do not migrate tasks that are: * 1) throttled_lb_pair, or
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/3890 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/T...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/3890 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/T...