[PATCH OLK-6.6 00/39] High Performance Container Resource View Isolation -- Stage 2

Common supports: bpf-rvi: cpuset: Fix missing of return for !tsk in task_effective_cpumask() bpf-rvi: memcg: Add bpf_mem_cgroup_from_task() kfunc bpf-rvi: cgroup: Add cgroup_rstat_flush_atomic() kfunc bpf-rvi: proc: add bpf_get_{idle,iowait}_time kfunc bpf-rvi: cpuacct: Add bpf_cpuacct_kcpustat_cpu_fetch kfunc bpf-rvi: cpuacct: Add task_cpuacct() Cpuinfo_arm64: bpf-rvi: arm64: Add bpf_arm64_cpu_have_feature() kfunc bpf-rvi: arm64: Add cpuinfo_arm64 iterator target bpf-rvi: Add bpf_arch_flags kunc for arm64 samples/bpf: Add iterator program for cpuinfo_arm64 Diskstats: bpf-rvi: block: Add diskstats iterator target bpf-rvi: blk-cgroup: Add bpf_blkcg_get_dev_iostat() kfunc samples/bpf: Add iterator program for diskstats Partitions: bpf-rvi: block: Add partitions iterator target bpf-rvi: block: Look up /dev in reaper's fs->root and filter partitions samples/bpf: Add iterator program for partitions Loadavg: bpf-rvi: pidns: Calculate loadavg for each pid namespace bpf-rvi: pidns: Add for_each_task_in_pidns and loadavg-related kfuncs samples/bpf: Add iterator program for loadavg Uptime: bpf-rvi: cpuacct: Add bpf_task_ca_cpuusage() kfunc samples/bpf: Add iterator program for uptime Swaps: bpf-rvi: Add bpf_si_memswinfo() kfunc bpf-rvi: Add bpf_page_counter_read() kfunc samples/bpf: Add iterator program for swaps Stat: bpf-rvi: Add bpf_seq_file_append() kfunc bpf-rvi: Add stat-related misc kfuncs bpf-rvi: Add cpu runqueue related kfuncs bpf-rvi: Add kstat_ & kcpustat_ kfuncs bpf-rvi: stat: Add stat iterator target samples/bpf: Add iterator program for stat Meminfo: bpf-rvi: x86: Add bpf_mem_direct_map kfunc bpf-rvi: proc/meminfo: Add bpf_mem_* kfunc bpf-rvi: cma: Add bpf_mem_{total,free}cma kfunc bpf-rvi: hugetlb: Add bpf_hugetlb_report_meminfo kfunc bpf-rvi: mmm/memory-failure: Add bpf_mem_failure kfunc bpf-rvi: mm/percpu: Add bpf_mem_percpu kfunc bpf-rvi: mm/util: Add bpf_mem_commit* kfunc bpf-rvi: mm/vmalloc: Add bpf_mem_vmalloc_{used,total} kfunc samples/bpf: Add iterator program for meminfo --- GONG Ruiqi (31): bpf-rvi: cpuset: Fix missing of return for !tsk in task_effective_cpumask() bpf-rvi: memcg: Add bpf_mem_cgroup_from_task() kfunc bpf-rvi: cgroup: Add cgroup_rstat_flush_atomic() kfunc bpf-rvi: proc: add bpf_get_{idle,iowait}_time kfunc bpf-rvi: cpuacct: Add bpf_cpuacct_kcpustat_cpu_fetch kfunc bpf-rvi: cpuacct: Add task_cpuacct() bpf-rvi: arm64: Add bpf_arm64_cpu_have_feature() kfunc bpf-rvi: arm64: Add cpuinfo_arm64 iterator target bpf-rvi: Add bpf_arch_flags kunc for arm64 samples/bpf: Add iterator program for cpuinfo_arm64 bpf-rvi: block: Add diskstats iterator target bpf-rvi: blk-cgroup: Add bpf_blkcg_get_dev_iostat() kfunc samples/bpf: Add iterator program for diskstats bpf-rvi: block: Add partitions iterator target bpf-rvi: block: Look up /dev in reaper's fs->root and filter partitions samples/bpf: Add iterator program for partitions bpf-rvi: pidns: Calculate loadavg for each pid namespace bpf-rvi: pidns: Add for_each_task_in_pidns and loadavg-related kfuncs samples/bpf: Add iterator program for loadavg bpf-rvi: cpuacct: Add bpf_task_ca_cpuusage() kfunc samples/bpf: Add iterator program for uptime bpf-rvi: Add bpf_si_memswinfo() kfunc bpf-rvi: Add bpf_page_counter_read() kfunc samples/bpf: Add iterator program for swaps bpf-rvi: Add bpf_seq_file_append() kfunc bpf-rvi: Add stat-related misc kfuncs bpf-rvi: Add cpu runqueue related kfuncs bpf-rvi: Add kstat_ & kcpustat_ kfuncs bpf-rvi: stat: Add stat iterator target samples/bpf: Add iterator program for stat samples/bpf: Add iterator program for meminfo Gu Bowen (8): bpf-rvi: x86: Add bpf_mem_direct_map kfunc bpf-rvi: proc/meminfo: Add bpf_mem_* kfunc bpf-rvi: cma: Add bpf_mem_{total,free}cma kfunc bpf-rvi: hugetlb: Add bpf_hugetlb_report_meminfo kfunc bpf-rvi: mmm/memory-failure: Add bpf_mem_failure kfunc bpf-rvi: mm/percpu: Add bpf_mem_percpu kfunc bpf-rvi: mm/util: Add bpf_mem_commit* kfunc bpf-rvi: mm/vmalloc: Add bpf_mem_vmalloc_{used,total} kfunc arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/bpf-rvi.c | 212 +++++++++ arch/arm64/kernel/cpufeature.c | 28 ++ arch/arm64/kernel/cpuinfo.c | 128 +----- arch/arm64/kernel/hwcap_str.h | 131 ++++++ arch/x86/mm/pat/set_memory.c | 33 ++ block/Kconfig.iosched | 3 + block/blk-cgroup.c | 159 +++++++ block/genhd.c | 432 ++++++++++++++++-- fs/proc/meminfo.c | 40 ++ fs/proc/stat.c | 204 +++++++++ include/linux/cgroup.h | 12 + include/linux/pid.h | 5 + include/linux/pid_namespace.h | 29 ++ kernel/bpf-rvi/Kconfig | 2 + kernel/bpf-rvi/Makefile | 2 +- kernel/bpf-rvi/common_kfuncs.c | 136 ++++++ kernel/bpf-rvi/generic_single_iter.c | 3 + kernel/bpf/helpers.c | 27 ++ kernel/cgroup/cpuset.c | 4 +- kernel/cgroup/rstat.c | 18 + kernel/pid.c | 10 + kernel/pid_namespace.c | 142 ++++++ kernel/sched/cpuacct.c | 43 ++ mm/cma.c | 35 ++ mm/hugetlb.c | 57 +++ mm/memcontrol.c | 30 +- mm/memory-failure.c | 28 ++ mm/percpu.c | 27 ++ mm/util.c | 32 ++ mm/vmalloc.c | 33 ++ samples/bpf/Makefile | 10 + samples/bpf/bpf_rvi_cpuinfo_arm64.bpf.c | 112 +++++ samples/bpf/bpf_rvi_diskstats.bpf.c | 295 ++++++++++++ samples/bpf/bpf_rvi_loadavg.bpf.c | 60 +++ samples/bpf/bpf_rvi_meminfo.bpf.c | 239 ++++++++++ samples/bpf/bpf_rvi_partitions.bpf.c | 42 ++ samples/bpf/bpf_rvi_stat.bpf.c | 220 +++++++++ samples/bpf/bpf_rvi_swaps.bpf.c | 104 +++++ samples/bpf/bpf_rvi_uptime.bpf.c | 122 +++++ .../selftests/bpf/progs/proc_iter_common.h | 0 41 files changed, 3084 insertions(+), 166 deletions(-) create mode 100644 arch/arm64/kernel/bpf-rvi.c create mode 100644 arch/arm64/kernel/hwcap_str.h create mode 100644 kernel/bpf-rvi/common_kfuncs.c create mode 100644 samples/bpf/bpf_rvi_cpuinfo_arm64.bpf.c create mode 100644 samples/bpf/bpf_rvi_diskstats.bpf.c create mode 100644 samples/bpf/bpf_rvi_loadavg.bpf.c create mode 100644 samples/bpf/bpf_rvi_meminfo.bpf.c create mode 100644 samples/bpf/bpf_rvi_partitions.bpf.c create mode 100644 samples/bpf/bpf_rvi_stat.bpf.c create mode 100644 samples/bpf/bpf_rvi_swaps.bpf.c create mode 100644 samples/bpf/bpf_rvi_uptime.bpf.c create mode 100644 tools/testing/selftests/bpf/progs/proc_iter_common.h -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- As the title says, in case of tsk == NULL, task_effective_cpumask() should return right after clearing pmask to all 0. Otherwise null pointer dereference may occur. Fixes: 0ea52fdbeffa ("cpuset: Add task_effective_cpumask()") Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- kernel/cgroup/cpuset.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 417827f2c043..ccf74e7cb33f 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -5212,8 +5212,10 @@ void task_effective_cpumask(struct task_struct *tsk, struct cpumask *pmask) { struct cpuset *cs; - if (!tsk) + if (!tsk) { cpumask_clear(pmask); + return; + } rcu_read_lock(); cs = task_cs(tsk); -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Which is to get the memory cgroup that a task lies in. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- mm/memcontrol.c | 30 +++++++++++++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b3609b71cbe8..d388ca2bdbf1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -78,12 +78,17 @@ #include <net/ip.h> #include "slab.h" #include "swap.h" - #include <linux/uaccess.h> #include <trace/events/vmscan.h> #include <linux/ksm.h> +#ifdef CONFIG_BPF_RVI +#include <linux/bpf.h> +#include <linux/btf.h> +#include <linux/btf_ids.h> +#endif + struct cgroup_subsys memory_cgrp_subsys __read_mostly; EXPORT_SYMBOL(memory_cgrp_subsys); @@ -8870,6 +8875,29 @@ static __init int mem_cgroup_sysctls_init(void) } #endif +#ifdef CONFIG_BPF_RVI +__bpf_kfunc struct mem_cgroup *bpf_mem_cgroup_from_task(struct task_struct *p) +{ + return mem_cgroup_from_task(p); +} + +BTF_SET8_START(bpf_memcg_kfunc_ids) +BTF_ID_FLAGS(func, bpf_mem_cgroup_from_task, KF_RET_NULL | KF_RCU) +BTF_SET8_END(bpf_memcg_kfunc_ids) + +static const struct btf_kfunc_id_set bpf_memcg_kfunc_set = { + .owner = THIS_MODULE, + .set = &bpf_memcg_kfunc_ids, +}; + +static int __init bpf_memcg_kfunc_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &bpf_memcg_kfunc_set); +} +late_initcall(bpf_memcg_kfunc_init); +#endif + static int __init cgroup_memory(char *s) { char *token; -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Restore cgroup_rstat_flush_atomic() that was removed by commit 0a2dc6ac3329 ("cgroup: remove cgroup_rstat_flush_atomic()") and expose it to bpf. This function is basically the same with cgroup_rstat_flush(), and the later one is sleepable only in PREEMPT_RT case, which is not a supported scenario of this feature. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- include/linux/cgroup.h | 3 +++ kernel/cgroup/rstat.c | 18 ++++++++++++++++++ 2 files changed, 21 insertions(+) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 29fb4556d123..48f0e9dcd3a1 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -693,6 +693,9 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen) void cgroup_rstat_updated(struct cgroup *cgrp, int cpu); void cgroup_rstat_flush(struct cgroup *cgrp); void cgroup_rstat_flush_hold(struct cgroup *cgrp); +#if defined(CONFIG_BPF_RVI) && !defined(CONFIG_PREEMPT_RT) +void cgroup_rstat_flush_atomic(struct cgroup *cgrp); +#endif void cgroup_rstat_flush_release(void); /* diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index c32439b855f5..329c10c084a2 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -239,6 +239,21 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) spin_unlock_irq(&cgroup_rstat_lock); } +#if defined(CONFIG_BPF_RVI) && !defined(CONFIG_PREEMPT_RT) +/** + * cgroup_rstat_flush_atomic- atomic version of cgroup_rstat_flush() + * @cgrp: target cgroup + * + * This function can be called from any context. + */ +__bpf_kfunc void cgroup_rstat_flush_atomic(struct cgroup *cgrp) +{ + spin_lock_irq(&cgroup_rstat_lock); + cgroup_rstat_flush_locked(cgrp); + spin_unlock_irq(&cgroup_rstat_lock); +} +#endif + /** * cgroup_rstat_flush_hold - flush stats in @cgrp's subtree and hold * @cgrp: target cgroup @@ -525,6 +540,9 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq) BTF_SET8_START(bpf_rstat_kfunc_ids) BTF_ID_FLAGS(func, cgroup_rstat_updated) BTF_ID_FLAGS(func, cgroup_rstat_flush, KF_SLEEPABLE) +#if defined(CONFIG_BPF_RVI) && !defined(CONFIG_PREEMPT_RT) +BTF_ID_FLAGS(func, cgroup_rstat_flush_atomic) +#endif BTF_SET8_END(bpf_rstat_kfunc_ids) static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = { -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Which is to get the idle and iowait time of a task. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- fs/proc/stat.c | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index da60956b2915..9b58e9ded6bf 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -14,6 +14,11 @@ #include <linux/irqnr.h> #include <linux/sched/cputime.h> #include <linux/tick.h> +#ifdef CONFIG_BPF_RVI +#include <linux/bpf.h> +#include <linux/btf.h> +#include <linux/btf_ids.h> +#endif #ifndef arch_irq_stat_cpu #define arch_irq_stat_cpu(cpu) 0 @@ -214,3 +219,32 @@ static int __init proc_stat_init(void) return 0; } fs_initcall(proc_stat_init); + +#ifdef CONFIG_BPF_RVI +__bpf_kfunc u64 bpf_get_idle_time(struct kernel_cpustat *kcs, int cpu) +{ + return get_idle_time(kcs, cpu); +} + +__bpf_kfunc u64 bpf_get_iowait_time(struct kernel_cpustat *kcs, int cpu) +{ + return get_iowait_time(kcs, cpu); +} + +BTF_SET8_START(bpf_proc_stat_kfunc_ids) +BTF_ID_FLAGS(func, bpf_get_idle_time) +BTF_ID_FLAGS(func, bpf_get_iowait_time) +BTF_SET8_END(bpf_proc_stat_kfunc_ids) + +static const struct btf_kfunc_id_set bpf_proc_stat_kfunc_set = { + .owner = THIS_MODULE, + .set = &bpf_proc_stat_kfunc_ids, +}; + +static int __init bpf_proc_stat_kfunc_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &bpf_proc_stat_kfunc_set); +} +late_initcall(bpf_proc_stat_kfunc_init); +#endif /* CONFIG_BPF_RVI */ -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ica1gk -------------------------------- To get per-cpuacct cpu time statistics. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- kernel/sched/cpuacct.c | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 6fee560173c6..edb37dfda54d 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -7,6 +7,12 @@ * (balbir@in.ibm.com). */ +#ifdef CONFIG_BPF_RVI +#include <linux/bpf.h> +#include <linux/btf.h> +#include <linux/btf_ids.h> +#endif + /* Time spent by the tasks of the CPU accounting group executing in ... */ enum cpuacct_stat_index { CPUACCT_STAT_USER, /* ... user mode */ @@ -401,3 +407,27 @@ static int __init cgroup_v1_ifs_init(void) } late_initcall_sync(cgroup_v1_ifs_init); #endif + +#ifdef CONFIG_BPF_RVI +__bpf_kfunc void bpf_cpuacct_kcpustat_cpu_fetch(struct kernel_cpustat *dst, + struct cpuacct *ca, int cpu) +{ + memcpy(dst, per_cpu_ptr(ca->cpustat, cpu), sizeof(struct kernel_cpustat)); +} + +BTF_SET8_START(bpf_cpuacct_kfunc_ids) +BTF_ID_FLAGS(func, bpf_cpuacct_kcpustat_cpu_fetch) +BTF_SET8_END(bpf_cpuacct_kfunc_ids) + +static const struct btf_kfunc_id_set bpf_cpuacct_kfunc_set = { + .owner = THIS_MODULE, + .set = &bpf_cpuacct_kfunc_ids, +}; + +static int __init bpf_cpuacct_kfunc_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &bpf_cpuacct_kfunc_set); +} +late_initcall(bpf_cpuacct_kfunc_init); +#endif -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ica1gk -------------------------------- Add a helper to get the cpuacct cgroup of a task. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- include/linux/cgroup.h | 9 +++++++++ kernel/sched/cpuacct.c | 5 +++++ 2 files changed, 14 insertions(+) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 48f0e9dcd3a1..453beda8bc9a 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -704,10 +704,19 @@ void cgroup_rstat_flush_release(void); #ifdef CONFIG_CGROUP_CPUACCT void cpuacct_charge(struct task_struct *tsk, u64 cputime); void cpuacct_account_field(struct task_struct *tsk, int index, u64 val); +#ifdef CONFIG_BPF_RVI +struct cpuacct *task_cpuacct(struct task_struct *tsk); +#endif #else static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {} static inline void cpuacct_account_field(struct task_struct *tsk, int index, u64 val) {} +#ifdef CONFIG_BPF_RVI +static inline struct cpuacct *task_cpuacct(struct task_struct *tsk) +{ + return NULL; +} +#endif #endif void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec); diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index edb37dfda54d..3d3d12b60572 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -409,6 +409,11 @@ late_initcall_sync(cgroup_v1_ifs_init); #endif #ifdef CONFIG_BPF_RVI +struct cpuacct *task_cpuacct(struct task_struct *tsk) +{ + return tsk ? task_ca(tsk) : NULL; +} + __bpf_kfunc void bpf_cpuacct_kcpustat_cpu_fetch(struct kernel_cpustat *dst, struct cpuacct *ca, int cpu) { -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Add a kfunc to retrieve ARM CPU's feature. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- arch/arm64/kernel/cpufeature.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index b13858668877..6a17507ac7e5 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -75,6 +75,11 @@ #include <linux/cpu.h> #include <linux/kasan.h> #include <linux/percpu.h> +#ifdef CONFIG_BPF_RVI +#include <linux/bpf.h> +#include <linux/btf.h> +#include <linux/btf_ids.h> +#endif #include <asm/actlr.h> #include <asm/cpu.h> @@ -3723,6 +3728,29 @@ bool cpu_have_feature(unsigned int num) } EXPORT_SYMBOL_GPL(cpu_have_feature); +#ifdef CONFIG_BPF_RVI +bool bpf_arm64_cpu_have_feature(unsigned int num) +{ + return cpu_have_feature(num); +} + +BTF_SET8_START(bpf_arm64_cpufeature_kfunc_ids) +BTF_ID_FLAGS(func, bpf_arm64_cpu_have_feature, KF_RCU) +BTF_SET8_END(bpf_arm64_cpufeature_kfunc_ids) + +static const struct btf_kfunc_id_set bpf_arm64_cpufeature_kfunc_set = { + .owner = THIS_MODULE, + .set = &bpf_arm64_cpufeature_kfunc_ids, +}; + +static int __init bpf_arm64_cpufeature_kfunc_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &bpf_arm64_cpufeature_kfunc_set); +} +late_initcall(bpf_arm64_cpufeature_kfunc_init); +#endif + unsigned long cpu_get_elf_hwcap(void) { /* -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- The v6.6's implementation of arm64 cpuinfo loops through online CPUs completely within seq_show(), which isn't compatible with how bpf iter works. Fortunately, commit 7bb797757bf5 ("arm64/cpuinfo: only show one cpu's info in c_show()") of v6.16 refines this part of code and meets our requirement. Create the bpf iter target for cpuinfo interface and place all the code at a new file, where CPUs are iterated via seq_{start,show,next}(), while keeping a native_c_show() borrowed from the aforementioned commit. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/bpf-rvi.c | 167 ++++++++++++++++++++++++++++++++++ arch/arm64/kernel/cpuinfo.c | 128 +------------------------- arch/arm64/kernel/hwcap_str.h | 131 ++++++++++++++++++++++++++ 4 files changed, 300 insertions(+), 127 deletions(-) create mode 100644 arch/arm64/kernel/bpf-rvi.c create mode 100644 arch/arm64/kernel/hwcap_str.h diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile index 4ce58887302a..87a7be72c95a 100644 --- a/arch/arm64/kernel/Makefile +++ b/arch/arm64/kernel/Makefile @@ -83,6 +83,7 @@ obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS) += patch-scs.o obj-$(CONFIG_IPI_AS_NMI) += ipi_nmi.o obj-$(CONFIG_HISI_VIRTCCA_GUEST) += virtcca_cvm_guest.o virtcca_cvm_tsi.o obj-$(CONFIG_HISI_VIRTCCA_HOST) += virtcca_cvm_host.o +obj-$(CONFIG_BPF_RVI) += bpf-rvi.o CFLAGS_patch-scs.o += -mbranch-protection=none # Force dependency (vdso*-wrap.S includes vdso.so through incbin) diff --git a/arch/arm64/kernel/bpf-rvi.c b/arch/arm64/kernel/bpf-rvi.c new file mode 100644 index 000000000000..c6590704e3a6 --- /dev/null +++ b/arch/arm64/kernel/bpf-rvi.c @@ -0,0 +1,167 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2025 Huawei Technologies Co., Ltd */ +#include <asm/cpu.h> +#include <asm/cputype.h> + +#include <linux/bpf.h> +#include <linux/btf_ids.h> +#include <linux/cpuset.h> +#include <linux/pid_namespace.h> + +#include "hwcap_str.h" + +static int native_c_show(struct seq_file *m, void *v) +{ + int j; + int cpu = m->index; + bool aarch32 = personality(current->personality) == PER_LINUX32; + struct cpuinfo_arm64 *cpuinfo = v; + u32 midr = cpuinfo->reg_midr; + + /* + * glibc reads /proc/cpuinfo to determine the number of + * online processors, looking for lines beginning with + * "processor". Give glibc what it expects. + */ + seq_printf(m, "processor\t: %ld\n", cpu); + if (aarch32) + seq_printf(m, "model name\t: ARMv8 Processor rev %d (%s)\n", + MIDR_REVISION(midr), COMPAT_ELF_PLATFORM); + + seq_printf(m, "BogoMIPS\t: %lu.%02lu\n", + loops_per_jiffy / (500000UL/HZ), + loops_per_jiffy / (5000UL/HZ) % 100); + + /* + * Dump out the common processor features in a single line. + * Userspace should read the hwcaps with getauxval(AT_HWCAP) + * rather than attempting to parse this, but there's a body of + * software which does already (at least for 32-bit). + */ + seq_puts(m, "Features\t:"); + if (aarch32) { +#ifdef CONFIG_AARCH32_EL0 + for (j = 0; j < ARRAY_SIZE(compat_hwcap_str); j++) { + if (a32_elf_hwcap & (1 << j)) { + /* + * Warn once if any feature should not + * have been present on arm64 platform. + */ + if (WARN_ON_ONCE(!compat_hwcap_str[j])) + continue; + + seq_printf(m, " %s", compat_hwcap_str[j]); + } + } + + for (j = 0; j < ARRAY_SIZE(compat_hwcap2_str); j++) + if (a32_elf_hwcap2 & (1 << j)) + seq_printf(m, " %s", compat_hwcap2_str[j]); +#endif /* CONFIG_AARCH32_EL0 */ + } else { + for (j = 0; j < ARRAY_SIZE(hwcap_str); j++) + if (cpu_have_feature(j)) + seq_printf(m, " %s", hwcap_str[j]); + } + seq_puts(m, "\n"); + + seq_printf(m, "CPU implementer\t: 0x%02x\n", MIDR_IMPLEMENTOR(midr)); + seq_printf(m, "CPU architecture: 8\n"); + seq_printf(m, "CPU variant\t: 0x%x\n", MIDR_VARIANT(midr)); + seq_printf(m, "CPU part\t: 0x%03x\n", MIDR_PARTNUM(midr)); + seq_printf(m, "CPU revision\t: %d\n\n", MIDR_REVISION(midr)); + + return 0; +} + +static void bpf_c_stop(struct seq_file *m, void *v) +{ +} + +struct cpuinfo_arm64_seq_priv { + cpumask_t allowed_mask; +}; + +static void *bpf_c_start(struct seq_file *m, loff_t *pos) +{ + struct cpuinfo_arm64_seq_priv *priv = m->private; + struct task_struct *reaper = get_current_level1_reaper(); + + task_effective_cpumask(reaper ?: current, &priv->allowed_mask); + if (reaper) + put_task_struct(reaper); + + /* + * DO NOT use cpumask_first() here: sys_read may start from somewhere in + * the middle of the file, and *pos may contain a value from the last + * read. + */ + *pos = cpumask_next(*pos - 1, &priv->allowed_mask); + return *pos < nr_cpu_ids ? &per_cpu(cpu_data, *pos) : NULL; +} + +static void *bpf_c_next(struct seq_file *m, void *v, loff_t *pos) +{ + struct cpuinfo_arm64_seq_priv *priv = m->private; + + *pos = cpumask_next(*pos, &priv->allowed_mask); + return *pos < nr_cpu_ids ? &per_cpu(cpu_data, *pos) : NULL; +} + +struct bpf_iter__cpuinfo_arm64 { + __bpf_md_ptr(struct bpf_iter_meta *, meta); + __bpf_md_ptr(struct cpuinfo_arm64 *, cpuinfo); +}; + +static int bpf_c_show(struct seq_file *m, void *v) +{ + struct bpf_iter__cpuinfo_arm64 ctx; + struct bpf_iter_meta meta; + struct bpf_prog *prog; + + meta.seq = m; + prog = bpf_iter_get_info(&meta, false); + if (!prog) + return native_c_show(m, v); + + ctx.meta = &meta; + ctx.cpuinfo = (struct cpuinfo_arm64 *)v; + return bpf_iter_run_prog(prog, &ctx); +} + +static const struct seq_operations bpf_cpuinfo_op = { + .start = bpf_c_start, + .next = bpf_c_next, + .stop = bpf_c_stop, + .show = bpf_c_show +}; + +DEFINE_BPF_ITER_FUNC(cpuinfo_arm64, struct bpf_iter_meta *meta, + struct cpuinfo_arm64 *cpuinfo) + +BTF_ID_LIST(btf_cpuinfo_arm64_id) +BTF_ID(struct, cpuinfo_arm64) + +static const struct bpf_iter_seq_info cpuinfo_arm64_seq_info = { + .seq_ops = &bpf_cpuinfo_op, + .init_seq_private = NULL, + .fini_seq_private = NULL, + .seq_priv_size = sizeof(struct cpuinfo_arm64_seq_priv), +}; + +static struct bpf_iter_reg cpuinfo_arm64_reg_info = { + .target = "cpuinfo_arm64", + .ctx_arg_info_size = 1, + .ctx_arg_info = { + { offsetof(struct bpf_iter__cpuinfo_arm64, cpuinfo), + PTR_TO_BTF_ID }, + }, + .seq_info = &cpuinfo_arm64_seq_info, +}; + +static int __init cpuinfo_iter_init(void) +{ + cpuinfo_arm64_reg_info.ctx_arg_info[0].btf_id = *btf_cpuinfo_arm64_id; + return bpf_iter_reg_target(&cpuinfo_arm64_reg_info); +} +late_initcall(cpuinfo_iter_init); diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c index dade66047478..eca636492f7d 100644 --- a/arch/arm64/kernel/cpuinfo.c +++ b/arch/arm64/kernel/cpuinfo.c @@ -49,132 +49,7 @@ static inline const char *icache_policy_str(int l1ip) unsigned long __icache_flags; -static const char *const hwcap_str[] = { - [KERNEL_HWCAP_FP] = "fp", - [KERNEL_HWCAP_ASIMD] = "asimd", - [KERNEL_HWCAP_EVTSTRM] = "evtstrm", - [KERNEL_HWCAP_AES] = "aes", - [KERNEL_HWCAP_PMULL] = "pmull", - [KERNEL_HWCAP_SHA1] = "sha1", - [KERNEL_HWCAP_SHA2] = "sha2", - [KERNEL_HWCAP_CRC32] = "crc32", - [KERNEL_HWCAP_ATOMICS] = "atomics", - [KERNEL_HWCAP_FPHP] = "fphp", - [KERNEL_HWCAP_ASIMDHP] = "asimdhp", - [KERNEL_HWCAP_CPUID] = "cpuid", - [KERNEL_HWCAP_ASIMDRDM] = "asimdrdm", - [KERNEL_HWCAP_JSCVT] = "jscvt", - [KERNEL_HWCAP_FCMA] = "fcma", - [KERNEL_HWCAP_LRCPC] = "lrcpc", - [KERNEL_HWCAP_DCPOP] = "dcpop", - [KERNEL_HWCAP_SHA3] = "sha3", - [KERNEL_HWCAP_SM3] = "sm3", - [KERNEL_HWCAP_SM4] = "sm4", - [KERNEL_HWCAP_ASIMDDP] = "asimddp", - [KERNEL_HWCAP_SHA512] = "sha512", - [KERNEL_HWCAP_SVE] = "sve", - [KERNEL_HWCAP_ASIMDFHM] = "asimdfhm", - [KERNEL_HWCAP_DIT] = "dit", - [KERNEL_HWCAP_USCAT] = "uscat", - [KERNEL_HWCAP_ILRCPC] = "ilrcpc", - [KERNEL_HWCAP_FLAGM] = "flagm", - [KERNEL_HWCAP_SSBS] = "ssbs", - [KERNEL_HWCAP_SB] = "sb", - [KERNEL_HWCAP_PACA] = "paca", - [KERNEL_HWCAP_PACG] = "pacg", - [KERNEL_HWCAP_LS64] = "ls64", - [KERNEL_HWCAP_LS64_V] = "ls64_v", - [KERNEL_HWCAP_DCPODP] = "dcpodp", - [KERNEL_HWCAP_SVE2] = "sve2", - [KERNEL_HWCAP_SVEAES] = "sveaes", - [KERNEL_HWCAP_SVEPMULL] = "svepmull", - [KERNEL_HWCAP_SVEBITPERM] = "svebitperm", - [KERNEL_HWCAP_SVESHA3] = "svesha3", - [KERNEL_HWCAP_SVESM4] = "svesm4", - [KERNEL_HWCAP_FLAGM2] = "flagm2", - [KERNEL_HWCAP_FRINT] = "frint", - [KERNEL_HWCAP_SVEI8MM] = "svei8mm", - [KERNEL_HWCAP_SVEF32MM] = "svef32mm", - [KERNEL_HWCAP_SVEF64MM] = "svef64mm", - [KERNEL_HWCAP_SVEBF16] = "svebf16", - [KERNEL_HWCAP_I8MM] = "i8mm", - [KERNEL_HWCAP_BF16] = "bf16", - [KERNEL_HWCAP_DGH] = "dgh", - [KERNEL_HWCAP_RNG] = "rng", - [KERNEL_HWCAP_BTI] = "bti", - [KERNEL_HWCAP_MTE] = "mte", - [KERNEL_HWCAP_ECV] = "ecv", - [KERNEL_HWCAP_AFP] = "afp", - [KERNEL_HWCAP_RPRES] = "rpres", - [KERNEL_HWCAP_MTE3] = "mte3", - [KERNEL_HWCAP_SME] = "sme", - [KERNEL_HWCAP_SME_I16I64] = "smei16i64", - [KERNEL_HWCAP_SME_F64F64] = "smef64f64", - [KERNEL_HWCAP_SME_I8I32] = "smei8i32", - [KERNEL_HWCAP_SME_F16F32] = "smef16f32", - [KERNEL_HWCAP_SME_B16F32] = "smeb16f32", - [KERNEL_HWCAP_SME_F32F32] = "smef32f32", - [KERNEL_HWCAP_SME_FA64] = "smefa64", - [KERNEL_HWCAP_WFXT] = "wfxt", - [KERNEL_HWCAP_EBF16] = "ebf16", - [KERNEL_HWCAP_SVE_EBF16] = "sveebf16", - [KERNEL_HWCAP_CSSC] = "cssc", - [KERNEL_HWCAP_RPRFM] = "rprfm", - [KERNEL_HWCAP_SVE2P1] = "sve2p1", - [KERNEL_HWCAP_SME2] = "sme2", - [KERNEL_HWCAP_SME2P1] = "sme2p1", - [KERNEL_HWCAP_SME_I16I32] = "smei16i32", - [KERNEL_HWCAP_SME_BI32I32] = "smebi32i32", - [KERNEL_HWCAP_SME_B16B16] = "smeb16b16", - [KERNEL_HWCAP_SME_F16F16] = "smef16f16", - [KERNEL_HWCAP_MOPS] = "mops", - [KERNEL_HWCAP_HBC] = "hbc", -}; - -#ifdef CONFIG_AARCH32_EL0 -#define COMPAT_KERNEL_HWCAP(x) const_ilog2(COMPAT_HWCAP_ ## x) -static const char *const compat_hwcap_str[] = { - [COMPAT_KERNEL_HWCAP(SWP)] = "swp", - [COMPAT_KERNEL_HWCAP(HALF)] = "half", - [COMPAT_KERNEL_HWCAP(THUMB)] = "thumb", - [COMPAT_KERNEL_HWCAP(26BIT)] = NULL, /* Not possible on arm64 */ - [COMPAT_KERNEL_HWCAP(FAST_MULT)] = "fastmult", - [COMPAT_KERNEL_HWCAP(FPA)] = NULL, /* Not possible on arm64 */ - [COMPAT_KERNEL_HWCAP(VFP)] = "vfp", - [COMPAT_KERNEL_HWCAP(EDSP)] = "edsp", - [COMPAT_KERNEL_HWCAP(JAVA)] = NULL, /* Not possible on arm64 */ - [COMPAT_KERNEL_HWCAP(IWMMXT)] = NULL, /* Not possible on arm64 */ - [COMPAT_KERNEL_HWCAP(CRUNCH)] = NULL, /* Not possible on arm64 */ - [COMPAT_KERNEL_HWCAP(THUMBEE)] = NULL, /* Not possible on arm64 */ - [COMPAT_KERNEL_HWCAP(NEON)] = "neon", - [COMPAT_KERNEL_HWCAP(VFPv3)] = "vfpv3", - [COMPAT_KERNEL_HWCAP(VFPV3D16)] = NULL, /* Not possible on arm64 */ - [COMPAT_KERNEL_HWCAP(TLS)] = "tls", - [COMPAT_KERNEL_HWCAP(VFPv4)] = "vfpv4", - [COMPAT_KERNEL_HWCAP(IDIVA)] = "idiva", - [COMPAT_KERNEL_HWCAP(IDIVT)] = "idivt", - [COMPAT_KERNEL_HWCAP(VFPD32)] = NULL, /* Not possible on arm64 */ - [COMPAT_KERNEL_HWCAP(LPAE)] = "lpae", - [COMPAT_KERNEL_HWCAP(EVTSTRM)] = "evtstrm", - [COMPAT_KERNEL_HWCAP(FPHP)] = "fphp", - [COMPAT_KERNEL_HWCAP(ASIMDHP)] = "asimdhp", - [COMPAT_KERNEL_HWCAP(ASIMDDP)] = "asimddp", - [COMPAT_KERNEL_HWCAP(ASIMDFHM)] = "asimdfhm", - [COMPAT_KERNEL_HWCAP(ASIMDBF16)] = "asimdbf16", - [COMPAT_KERNEL_HWCAP(I8MM)] = "i8mm", -}; - -#define COMPAT_KERNEL_HWCAP2(x) const_ilog2(COMPAT_HWCAP2_ ## x) -static const char *const compat_hwcap2_str[] = { - [COMPAT_KERNEL_HWCAP2(AES)] = "aes", - [COMPAT_KERNEL_HWCAP2(PMULL)] = "pmull", - [COMPAT_KERNEL_HWCAP2(SHA1)] = "sha1", - [COMPAT_KERNEL_HWCAP2(SHA2)] = "sha2", - [COMPAT_KERNEL_HWCAP2(CRC32)] = "crc32", - [COMPAT_KERNEL_HWCAP2(SB)] = "sb", - [COMPAT_KERNEL_HWCAP2(SSBS)] = "ssbs", -}; -#endif /* CONFIG_AARCH32_EL0 */ +#include "hwcap_str.h" static int c_show(struct seq_file *m, void *v) { @@ -265,7 +140,6 @@ const struct seq_operations cpuinfo_op = { .show = c_show }; - static struct kobj_type cpuregs_kobj_type = { .sysfs_ops = &kobj_sysfs_ops, }; diff --git a/arch/arm64/kernel/hwcap_str.h b/arch/arm64/kernel/hwcap_str.h new file mode 100644 index 000000000000..e53a2c10b0c1 --- /dev/null +++ b/arch/arm64/kernel/hwcap_str.h @@ -0,0 +1,131 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* Copyright (c) 2025 Huawei Technologies Co., Ltd */ +#include <asm/hwcap.h> + +static const char *const hwcap_str[] = { + [KERNEL_HWCAP_FP] = "fp", + [KERNEL_HWCAP_ASIMD] = "asimd", + [KERNEL_HWCAP_EVTSTRM] = "evtstrm", + [KERNEL_HWCAP_AES] = "aes", + [KERNEL_HWCAP_PMULL] = "pmull", + [KERNEL_HWCAP_SHA1] = "sha1", + [KERNEL_HWCAP_SHA2] = "sha2", + [KERNEL_HWCAP_CRC32] = "crc32", + [KERNEL_HWCAP_ATOMICS] = "atomics", + [KERNEL_HWCAP_FPHP] = "fphp", + [KERNEL_HWCAP_ASIMDHP] = "asimdhp", + [KERNEL_HWCAP_CPUID] = "cpuid", + [KERNEL_HWCAP_ASIMDRDM] = "asimdrdm", + [KERNEL_HWCAP_JSCVT] = "jscvt", + [KERNEL_HWCAP_FCMA] = "fcma", + [KERNEL_HWCAP_LRCPC] = "lrcpc", + [KERNEL_HWCAP_DCPOP] = "dcpop", + [KERNEL_HWCAP_SHA3] = "sha3", + [KERNEL_HWCAP_SM3] = "sm3", + [KERNEL_HWCAP_SM4] = "sm4", + [KERNEL_HWCAP_ASIMDDP] = "asimddp", + [KERNEL_HWCAP_SHA512] = "sha512", + [KERNEL_HWCAP_SVE] = "sve", + [KERNEL_HWCAP_ASIMDFHM] = "asimdfhm", + [KERNEL_HWCAP_DIT] = "dit", + [KERNEL_HWCAP_USCAT] = "uscat", + [KERNEL_HWCAP_ILRCPC] = "ilrcpc", + [KERNEL_HWCAP_FLAGM] = "flagm", + [KERNEL_HWCAP_SSBS] = "ssbs", + [KERNEL_HWCAP_SB] = "sb", + [KERNEL_HWCAP_PACA] = "paca", + [KERNEL_HWCAP_PACG] = "pacg", + [KERNEL_HWCAP_LS64] = "ls64", + [KERNEL_HWCAP_LS64_V] = "ls64_v", + [KERNEL_HWCAP_DCPODP] = "dcpodp", + [KERNEL_HWCAP_SVE2] = "sve2", + [KERNEL_HWCAP_SVEAES] = "sveaes", + [KERNEL_HWCAP_SVEPMULL] = "svepmull", + [KERNEL_HWCAP_SVEBITPERM] = "svebitperm", + [KERNEL_HWCAP_SVESHA3] = "svesha3", + [KERNEL_HWCAP_SVESM4] = "svesm4", + [KERNEL_HWCAP_FLAGM2] = "flagm2", + [KERNEL_HWCAP_FRINT] = "frint", + [KERNEL_HWCAP_SVEI8MM] = "svei8mm", + [KERNEL_HWCAP_SVEF32MM] = "svef32mm", + [KERNEL_HWCAP_SVEF64MM] = "svef64mm", + [KERNEL_HWCAP_SVEBF16] = "svebf16", + [KERNEL_HWCAP_I8MM] = "i8mm", + [KERNEL_HWCAP_BF16] = "bf16", + [KERNEL_HWCAP_DGH] = "dgh", + [KERNEL_HWCAP_RNG] = "rng", + [KERNEL_HWCAP_BTI] = "bti", + [KERNEL_HWCAP_MTE] = "mte", + [KERNEL_HWCAP_ECV] = "ecv", + [KERNEL_HWCAP_AFP] = "afp", + [KERNEL_HWCAP_RPRES] = "rpres", + [KERNEL_HWCAP_MTE3] = "mte3", + [KERNEL_HWCAP_SME] = "sme", + [KERNEL_HWCAP_SME_I16I64] = "smei16i64", + [KERNEL_HWCAP_SME_F64F64] = "smef64f64", + [KERNEL_HWCAP_SME_I8I32] = "smei8i32", + [KERNEL_HWCAP_SME_F16F32] = "smef16f32", + [KERNEL_HWCAP_SME_B16F32] = "smeb16f32", + [KERNEL_HWCAP_SME_F32F32] = "smef32f32", + [KERNEL_HWCAP_SME_FA64] = "smefa64", + [KERNEL_HWCAP_WFXT] = "wfxt", + [KERNEL_HWCAP_EBF16] = "ebf16", + [KERNEL_HWCAP_SVE_EBF16] = "sveebf16", + [KERNEL_HWCAP_CSSC] = "cssc", + [KERNEL_HWCAP_RPRFM] = "rprfm", + [KERNEL_HWCAP_SVE2P1] = "sve2p1", + [KERNEL_HWCAP_SME2] = "sme2", + [KERNEL_HWCAP_SME2P1] = "sme2p1", + [KERNEL_HWCAP_SME_I16I32] = "smei16i32", + [KERNEL_HWCAP_SME_BI32I32] = "smebi32i32", + [KERNEL_HWCAP_SME_B16B16] = "smeb16b16", + [KERNEL_HWCAP_SME_F16F16] = "smef16f16", + [KERNEL_HWCAP_MOPS] = "mops", + [KERNEL_HWCAP_HBC] = "hbc", +}; + +#ifdef CONFIG_AARCH32_EL0 +#define COMPAT_KERNEL_HWCAP(x) const_ilog2(COMPAT_HWCAP_ ## x) +static const char *const compat_hwcap_str[] = { + [COMPAT_KERNEL_HWCAP(SWP)] = "swp", + [COMPAT_KERNEL_HWCAP(HALF)] = "half", + [COMPAT_KERNEL_HWCAP(THUMB)] = "thumb", + [COMPAT_KERNEL_HWCAP(26BIT)] = NULL, /* Not possible on arm64 */ + [COMPAT_KERNEL_HWCAP(FAST_MULT)] = "fastmult", + [COMPAT_KERNEL_HWCAP(FPA)] = NULL, /* Not possible on arm64 */ + [COMPAT_KERNEL_HWCAP(VFP)] = "vfp", + [COMPAT_KERNEL_HWCAP(EDSP)] = "edsp", + [COMPAT_KERNEL_HWCAP(JAVA)] = NULL, /* Not possible on arm64 */ + [COMPAT_KERNEL_HWCAP(IWMMXT)] = NULL, /* Not possible on arm64 */ + [COMPAT_KERNEL_HWCAP(CRUNCH)] = NULL, /* Not possible on arm64 */ + [COMPAT_KERNEL_HWCAP(THUMBEE)] = NULL, /* Not possible on arm64 */ + [COMPAT_KERNEL_HWCAP(NEON)] = "neon", + [COMPAT_KERNEL_HWCAP(VFPv3)] = "vfpv3", + [COMPAT_KERNEL_HWCAP(VFPV3D16)] = NULL, /* Not possible on arm64 */ + [COMPAT_KERNEL_HWCAP(TLS)] = "tls", + [COMPAT_KERNEL_HWCAP(VFPv4)] = "vfpv4", + [COMPAT_KERNEL_HWCAP(IDIVA)] = "idiva", + [COMPAT_KERNEL_HWCAP(IDIVT)] = "idivt", + [COMPAT_KERNEL_HWCAP(VFPD32)] = NULL, /* Not possible on arm64 */ + [COMPAT_KERNEL_HWCAP(LPAE)] = "lpae", + [COMPAT_KERNEL_HWCAP(EVTSTRM)] = "evtstrm", + [COMPAT_KERNEL_HWCAP(FPHP)] = "fphp", + [COMPAT_KERNEL_HWCAP(ASIMDHP)] = "asimdhp", + [COMPAT_KERNEL_HWCAP(ASIMDDP)] = "asimddp", + [COMPAT_KERNEL_HWCAP(ASIMDFHM)] = "asimdfhm", + [COMPAT_KERNEL_HWCAP(ASIMDBF16)] = "asimdbf16", + [COMPAT_KERNEL_HWCAP(I8MM)] = "i8mm", +}; + +#define COMPAT_KERNEL_HWCAP2(x) const_ilog2(COMPAT_HWCAP2_ ## x) +static const char *const compat_hwcap2_str[] = { + [COMPAT_KERNEL_HWCAP2(AES)] = "aes", + [COMPAT_KERNEL_HWCAP2(PMULL)] = "pmull", + [COMPAT_KERNEL_HWCAP2(SHA1)] = "sha1", + [COMPAT_KERNEL_HWCAP2(SHA2)] = "sha2", + [COMPAT_KERNEL_HWCAP2(CRC32)] = "crc32", + [COMPAT_KERNEL_HWCAP2(SB)] = "sb", + [COMPAT_KERNEL_HWCAP2(SSBS)] = "ssbs", +}; +#endif /* CONFIG_AARCH32_EL0 */ + -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Add a kfunc (for arm64 in this patch) to get arch-related flags and their string representation. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- arch/arm64/kernel/bpf-rvi.c | 45 +++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/arch/arm64/kernel/bpf-rvi.c b/arch/arm64/kernel/bpf-rvi.c index c6590704e3a6..5b86a1f4a26e 100644 --- a/arch/arm64/kernel/bpf-rvi.c +++ b/arch/arm64/kernel/bpf-rvi.c @@ -165,3 +165,48 @@ static int __init cpuinfo_iter_init(void) return bpf_iter_reg_target(&cpuinfo_arm64_reg_info); } late_initcall(cpuinfo_iter_init); + +enum arch_flags_type { + ARM64_HWCAP, + ARM64_HWCAP_SIZE, + ARM64_COMPAT_HWCAP, + ARM64_COMPAT_HWCAP_SIZE, + ARM64_COMPAT_HWCAP2, + ARM64_COMPAT_HWCAP2_SIZE, +}; + +__bpf_kfunc const char *bpf_arch_flags(enum arch_flags_type t, int i) +{ + switch (t) { + case ARM64_HWCAP: + return hwcap_str[i]; + case ARM64_HWCAP_SIZE: + return (void *)ARRAY_SIZE(hwcap_str); + case ARM64_COMPAT_HWCAP: + return compat_hwcap_str[i]; + case ARM64_COMPAT_HWCAP_SIZE: + return (void *)ARRAY_SIZE(compat_hwcap_str); + case ARM64_COMPAT_HWCAP2: + return compat_hwcap2_str[i]; + case ARM64_COMPAT_HWCAP2_SIZE: + return (void *)ARRAY_SIZE(compat_hwcap2_str); + default: + return NULL; + } +} + +BTF_SET8_START(bpf_arch_flags_kfunc_ids) +BTF_ID_FLAGS(func, bpf_arch_flags) +BTF_SET8_END(bpf_arch_flags_kfunc_ids) + +static const struct btf_kfunc_id_set bpf_arch_flags_kfunc_set = { + .owner = THIS_MODULE, + .set = &bpf_arch_flags_kfunc_ids, +}; + +static int __init bpf_arch_flags_kfunc_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &bpf_arch_flags_kfunc_set); +} +late_initcall(bpf_arch_flags_kfunc_init); -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Implement the bpf prog for the arm64 'cpuinfo' interface. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- samples/bpf/Makefile | 3 + samples/bpf/bpf_rvi_cpuinfo_arm64.bpf.c | 112 ++++++++++++++++++++++++ 2 files changed, 115 insertions(+) create mode 100644 samples/bpf/bpf_rvi_cpuinfo_arm64.bpf.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 7627f996b5e5..7a8559c534c8 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -152,6 +152,9 @@ always-y += hbm_edt_kern.o ifeq ($(ARCH), x86) always-$(CONFIG_BPF_RVI) += bpf_rvi_cpuinfo_x86.bpf.o endif +ifeq ($(ARCH), arm64) +always-$(CONFIG_BPF_RVI) += bpf_rvi_cpuinfo_arm64.bpf.o +endif always-$(CONFIG_BPF_RVI) += bpf_rvi_cpu_online.bpf.o ifeq ($(ARCH), arm) diff --git a/samples/bpf/bpf_rvi_cpuinfo_arm64.bpf.c b/samples/bpf/bpf_rvi_cpuinfo_arm64.bpf.c new file mode 100644 index 000000000000..1f5ab1053b3b --- /dev/null +++ b/samples/bpf/bpf_rvi_cpuinfo_arm64.bpf.c @@ -0,0 +1,112 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Huawei Technologies Co., Ltd */ +#include <vmlinux.h> +#include <bpf/bpf_core_read.h> +#include <bpf/bpf_helpers.h> + +#define MIDR_REVISION_MASK 0xf +#define MIDR_REVISION(midr) ((midr) & MIDR_REVISION_MASK) +#define MIDR_PARTNUM_SHIFT 4 +#define MIDR_PARTNUM_MASK (0xfff << MIDR_PARTNUM_SHIFT) +#define MIDR_PARTNUM(midr) (((midr) & MIDR_PARTNUM_MASK) >> MIDR_PARTNUM_SHIFT) +#define MIDR_VARIANT_SHIFT 20 +#define MIDR_VARIANT_MASK (0xf << MIDR_VARIANT_SHIFT) +#define MIDR_VARIANT(midr) (((midr) & MIDR_VARIANT_MASK) >> MIDR_VARIANT_SHIFT) +#define MIDR_IMPLEMENTOR_SHIFT 24 +#define MIDR_IMPLEMENTOR_MASK (0xff << MIDR_IMPLEMENTOR_SHIFT) +#define MIDR_IMPLEMENTOR(midr) (((midr) & MIDR_IMPLEMENTOR_MASK) >> MIDR_IMPLEMENTOR_SHIFT) + +#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) +#define BITS_PER_BYTE 8UL +#define __KERNEL_DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d)) +#define DIV_ROUND_UP __KERNEL_DIV_ROUND_UP +#define BITS_PER_TYPE(type) (sizeof(type) * BITS_PER_BYTE) +#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_TYPE(long)) + +extern bool bpf_arm64_cpu_have_feature(unsigned int num) __ksym; +extern const char *bpf_arch_flags(enum arch_flags_type t, int i) __ksym; + +/* Reference: https://docs.ebpf.io/ebpf-library/libbpf/ebpf/__ksym/ */ +extern void loops_per_jiffy __ksym; +extern void a32_elf_hwcap __ksym; +extern void a32_elf_hwcap2 __ksym; + +extern int CONFIG_HZ __kconfig __weak; +extern int CONFIG_NR_CPUS __kconfig __weak; +extern bool CONFIG_CPU_BIG_ENDIAN __kconfig __weak; +extern bool CONFIG_AARCH32_EL0 __kconfig __weak; + +#define PER_LINUX32 0x0008 +#define PER_MASK 0x00ff +#define personality(pers) (pers & PER_MASK) + +#define RET_OK 0 +#define RET_FAIL 1 +#define RET_SKIP -1 + +SEC("iter/cpuinfo_arm64") +int dump_cpuinfo_arm64(struct bpf_iter__cpuinfo_arm64 *ctx) +{ + struct seq_file *m = ctx->meta->seq; + struct cpuinfo_arm64 *cpuinfo = ctx->cpuinfo; + unsigned int midr = cpuinfo->reg_midr; + struct task_struct *current = bpf_get_current_task_btf(); + bool aarch32 = personality(current->personality) == PER_LINUX32; + unsigned long out_loops_per_jiffy; + unsigned int out_a32_elf_hwcap, out_a32_elf_hwcap2; + int err = 0; + int j; + const char *COMPAT_ELF_PLATFORM = CONFIG_CPU_BIG_ENDIAN ? "v8b" : "v8l"; + + BPF_SEQ_PRINTF(m, "processor\t: %ld\n", ctx->meta->seq_num); + + if (aarch32) + BPF_SEQ_PRINTF(m, "model name\t: ARMv8 Processor rev %d (%s)\n", + MIDR_REVISION(midr), COMPAT_ELF_PLATFORM); + + err = bpf_core_read(&out_loops_per_jiffy, sizeof(unsigned long), &loops_per_jiffy); + if (err) + return RET_FAIL; + BPF_SEQ_PRINTF(m, "BogoMIPS\t: %lu.%02lu\n", + out_loops_per_jiffy / (500000UL/CONFIG_HZ), + out_loops_per_jiffy / (5000UL/CONFIG_HZ) % 100); + + BPF_SEQ_PRINTF(m, "Features\t:"); + if (aarch32 && CONFIG_AARCH32_EL0) { + unsigned long compat_hwcap_str_size, compat_hwcap2_str_size; + + compat_hwcap_str_size = (unsigned long)bpf_arch_flags(ARM64_COMPAT_HWCAP_SIZE, 0); + bpf_core_read(&out_a32_elf_hwcap, sizeof(unsigned int), &a32_elf_hwcap); + for (j = 0; j < compat_hwcap_str_size; j++) { + if (out_a32_elf_hwcap & (1 << j)) { + if (!bpf_arch_flags(ARM64_COMPAT_HWCAP, j)) + continue; + BPF_SEQ_PRINTF(m, " %s", bpf_arch_flags(ARM64_COMPAT_HWCAP, j)); + } + } + + compat_hwcap2_str_size = (unsigned long)bpf_arch_flags(ARM64_COMPAT_HWCAP2_SIZE, 0); + bpf_core_read(&out_a32_elf_hwcap2, sizeof(unsigned int), &a32_elf_hwcap2); + for (j = 0; j < compat_hwcap2_str_size; j++) + if (out_a32_elf_hwcap2 & (1 << j)) + BPF_SEQ_PRINTF(m, " %s", bpf_arch_flags(ARM64_COMPAT_HWCAP2, j)); + } else { + unsigned long hwcap_str_size = (unsigned long)bpf_arch_flags(ARM64_HWCAP_SIZE, 0); + + for (j = 0; j < hwcap_str_size; j++) + if (bpf_arm64_cpu_have_feature(j)) + BPF_SEQ_PRINTF(m, " %s", bpf_arch_flags(ARM64_HWCAP, j)); + } + + BPF_SEQ_PRINTF(m, "\n"); + + BPF_SEQ_PRINTF(m, "CPU implementer\t: 0x%02x\n", MIDR_IMPLEMENTOR(midr)); + BPF_SEQ_PRINTF(m, "CPU architecture: 8\n"); + BPF_SEQ_PRINTF(m, "CPU variant\t: 0x%x\n", MIDR_VARIANT(midr)); + BPF_SEQ_PRINTF(m, "CPU part\t: 0x%03x\n", MIDR_PARTNUM(midr)); + BPF_SEQ_PRINTF(m, "CPU revision\t: %d\n\n", MIDR_REVISION(midr)); + + return RET_OK; +} + +char _license[] SEC("license") = "GPL"; -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Create a bpf iter target for the 'diskstats' interface, to which the bpf prog can attach. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- block/genhd.c | 222 ++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 186 insertions(+), 36 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index 61d340aa30f4..9d9b60501bcb 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -26,6 +26,10 @@ #include <linux/badblocks.h> #include <linux/part_stat.h> #include <linux/blktrace_api.h> +#ifdef CONFIG_BPF_RVI +#include <linux/bpf.h> +#include <linux/pid_namespace.h> +#endif #include "blk-throttle.h" #include "blk.h" @@ -1231,6 +1235,57 @@ const struct device_type disk_type = { }; #ifdef CONFIG_PROC_FS +static int native_diskstats_show(struct seq_file *seqf, struct block_device *hd, + struct disk_stats *stat, unsigned int inflight) +{ + seq_printf(seqf, "%4d %7d %pg " + "%lu %lu %lu %u " + "%lu %lu %lu %u " + "%u %u %u " + "%lu %lu %lu %u " + "%lu %u" + "\n", + MAJOR(hd->bd_dev), MINOR(hd->bd_dev), hd, + stat->ios[STAT_READ], + stat->merges[STAT_READ], + stat->sectors[STAT_READ], + (unsigned int)div_u64(stat->nsecs[STAT_READ], + NSEC_PER_MSEC), + stat->ios[STAT_WRITE], + stat->merges[STAT_WRITE], + stat->sectors[STAT_WRITE], + (unsigned int)div_u64(stat->nsecs[STAT_WRITE], + NSEC_PER_MSEC), + inflight, + jiffies_to_msecs(stat->io_ticks), + (unsigned int)div_u64(stat->nsecs[STAT_READ] + + stat->nsecs[STAT_WRITE] + + stat->nsecs[STAT_DISCARD] + + stat->nsecs[STAT_FLUSH], + NSEC_PER_MSEC), + stat->ios[STAT_DISCARD], + stat->merges[STAT_DISCARD], + stat->sectors[STAT_DISCARD], + (unsigned int)div_u64(stat->nsecs[STAT_DISCARD], + NSEC_PER_MSEC), + stat->ios[STAT_FLUSH], + (unsigned int)div_u64(stat->nsecs[STAT_FLUSH], + NSEC_PER_MSEC) + ); + return 0; +} + +#ifdef CONFIG_BPF_RVI +static int __diskstats_show(struct seq_file *seqf, struct block_device *hd, + struct disk_stats *stat, unsigned int inflight); +#else +static int __diskstats_show(struct seq_file *seqf, struct block_device *hd, + struct disk_stats *stat, unsigned int inflight) +{ + return native_diskstats_show(seqf, hd, stat, inflight); +} +#endif + /* * aggregate disk stat collector. Uses the same stats that the sysfs * entries do, above, but makes them available through one seq_file. @@ -1245,6 +1300,7 @@ static int diskstats_show(struct seq_file *seqf, void *v) unsigned int inflight; struct disk_stats stat; unsigned long idx; + int ret = 0; /* if (&disk_to_dev(gp)->kobj.entry == block_class.devices.next) @@ -1269,44 +1325,13 @@ static int diskstats_show(struct seq_file *seqf, void *v) part_stat_unlock(); } part_stat_read_all(hd, &stat); - seq_printf(seqf, "%4d %7d %pg " - "%lu %lu %lu %u " - "%lu %lu %lu %u " - "%u %u %u " - "%lu %lu %lu %u " - "%lu %u" - "\n", - MAJOR(hd->bd_dev), MINOR(hd->bd_dev), hd, - stat.ios[STAT_READ], - stat.merges[STAT_READ], - stat.sectors[STAT_READ], - (unsigned int)div_u64(stat.nsecs[STAT_READ], - NSEC_PER_MSEC), - stat.ios[STAT_WRITE], - stat.merges[STAT_WRITE], - stat.sectors[STAT_WRITE], - (unsigned int)div_u64(stat.nsecs[STAT_WRITE], - NSEC_PER_MSEC), - inflight, - jiffies_to_msecs(stat.io_ticks), - (unsigned int)div_u64(stat.nsecs[STAT_READ] + - stat.nsecs[STAT_WRITE] + - stat.nsecs[STAT_DISCARD] + - stat.nsecs[STAT_FLUSH], - NSEC_PER_MSEC), - stat.ios[STAT_DISCARD], - stat.merges[STAT_DISCARD], - stat.sectors[STAT_DISCARD], - (unsigned int)div_u64(stat.nsecs[STAT_DISCARD], - NSEC_PER_MSEC), - stat.ios[STAT_FLUSH], - (unsigned int)div_u64(stat.nsecs[STAT_FLUSH], - NSEC_PER_MSEC) - ); + ret = __diskstats_show(seqf, hd, &stat, inflight); + if (ret) + break; } rcu_read_unlock(); - return 0; + return ret; } static const struct seq_operations diskstats_op = { @@ -1316,11 +1341,136 @@ static const struct seq_operations diskstats_op = { .show = diskstats_show }; +#ifdef CONFIG_BPF_RVI +struct diskstats_seq_priv { + struct class_dev_iter iter; // must be the first, + // to let us reuse disk_seqf_next() + struct blkcg *task_blkcg; +}; + +/* + * Basically the same with disk_seqf_start() but without allocating iter and + * then overwriting seqf->private, which points to priv_data->target_private + * in bpf_iter case (see prepare_seq_file()), and is needed to retrieve + * struct bpf_iter_priv_data. Here we allocate iter via setting + * .seq_priv_size and turning priv_data->target_private into iter. + */ +static void *bpf_disk_seqf_start(struct seq_file *seqf, loff_t *pos) +{ + loff_t skip = *pos; + struct diskstats_seq_priv *priv = seqf->private; + struct class_dev_iter *iter; + struct device *dev; + struct task_struct *task; + + task = get_current_level1_reaper(); + if (!task) + task = current; + priv->task_blkcg = css_to_blkcg(task_css(task, io_cgrp_id)); + + iter = &priv->iter; + class_dev_iter_init(iter, &block_class, NULL, &disk_type); + do { + dev = class_dev_iter_next(iter); + if (!dev) + return NULL; + } while (skip--); + + return dev_to_disk(dev); +} + +/* + * Similar to the difference between {bpf_,}disk_seqf_start, + * here we don't free iter. + */ +static void bpf_disk_seqf_stop(struct seq_file *seqf, void *v) +{ + struct diskstats_seq_priv *priv = seqf->private; + struct class_dev_iter *iter = &priv->iter; + + /* stop is called even after start failed :-( */ + if (iter) + class_dev_iter_exit(iter); +} + +struct bpf_iter__diskstats { + __bpf_md_ptr(struct bpf_iter_meta *, meta); + __bpf_md_ptr(struct block_device *, bd); + __bpf_md_ptr(struct disk_stats *, native_stat); + unsigned int inflight __aligned(8); + __bpf_md_ptr(struct blkcg *, task_blkcg); +}; + +DEFINE_BPF_ITER_FUNC(diskstats, struct bpf_iter_meta *meta, + struct block_device *bd, struct disk_stats *native_stat, + uint inflight, + struct blkcg *task_blkcg) + +static int __diskstats_show(struct seq_file *seqf, struct block_device *hd, + struct disk_stats *stat, unsigned int inflight) +{ + struct bpf_iter__diskstats ctx; + struct bpf_iter_meta meta; + struct bpf_prog *prog; + struct diskstats_seq_priv *priv = seqf->private; + + meta.seq = seqf; + prog = bpf_iter_get_info(&meta, false); + if (!prog) + return native_diskstats_show(seqf, hd, stat, inflight); + + ctx.meta = &meta; + ctx.bd = hd; + ctx.native_stat = stat; + ctx.inflight = inflight; + ctx.task_blkcg = priv->task_blkcg; + return bpf_iter_run_prog(prog, &ctx); +} + +static const struct seq_operations bpf_diskstats_op = { + .start = bpf_disk_seqf_start, + .next = disk_seqf_next, + .stop = bpf_disk_seqf_stop, + .show = diskstats_show +}; + +static const struct bpf_iter_seq_info diskstats_seq_info = { + .seq_ops = &bpf_diskstats_op, + .init_seq_private = NULL, + .fini_seq_private = NULL, + .seq_priv_size = sizeof(struct diskstats_seq_priv), +}; + +static struct bpf_iter_reg diskstats_reg_info = { + .target = "diskstats", + .ctx_arg_info_size = 2, + .ctx_arg_info = { + { offsetof(struct bpf_iter__diskstats, bd), + PTR_TO_BTF_ID }, + { offsetof(struct bpf_iter__diskstats, native_stat), + PTR_TO_BTF_ID }, + }, + .seq_info = &diskstats_seq_info, +}; + +BTF_ID_LIST(btf_diststats_ids) +BTF_ID(struct, block_device) +BTF_ID(struct, disk_stats) +#endif /* CONFIG_BPF_RVI */ + static int __init proc_genhd_init(void) { + int err = 0; + proc_create_seq("diskstats", 0, NULL, &diskstats_op); proc_create_seq("partitions", 0, NULL, &partitions_op); - return 0; + +#ifdef CONFIG_BPF_RVI + diskstats_reg_info.ctx_arg_info[0].btf_id = btf_diststats_ids[0]; + diskstats_reg_info.ctx_arg_info[1].btf_id = btf_diststats_ids[1]; + err = bpf_iter_reg_target(&diskstats_reg_info); +#endif + return err; } module_init(proc_genhd_init); #endif /* CONFIG_PROC_FS */ -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Add a kfunc to get iostat from a blkcg, which treats cgroup v1, v2 and different block policy differently. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- block/Kconfig.iosched | 3 + block/blk-cgroup.c | 159 +++++++++++++++++++++++++++++++++++++++++ kernel/bpf-rvi/Kconfig | 1 + 3 files changed, 163 insertions(+) diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched index 27f11320b8d1..392b03dc3350 100644 --- a/block/Kconfig.iosched +++ b/block/Kconfig.iosched @@ -44,4 +44,7 @@ config BFQ_CGROUP_DEBUG Enable some debugging help. Currently it exports additional stat files in a cgroup which can be useful for debugging. +config BPF_RVI_BLK_BFQ + bool + endmenu diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 37e0b92e2e87..85e7e3307837 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -29,10 +29,19 @@ #include <linux/resume_user_mode.h> #include <linux/psi.h> #include <linux/part_stat.h> +#ifdef CONFIG_BPF_RVI +#include <linux/bpf.h> +#include <linux/btf.h> +#include <linux/btf_ids.h> +#endif + #include "blk.h" #include "blk-cgroup.h" #include "blk-ioprio.h" #include "blk-throttle.h" +#ifdef CONFIG_BPF_RVI +#include "bfq-iosched.h" +#endif static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu); @@ -2208,5 +2217,155 @@ bool blk_cgroup_congested(void) return ret; } +#ifdef CONFIG_BPF_RVI +struct blkg_rw_iostat { + struct blkg_rwstat_sample throttle_bytes; + struct blkg_rwstat_sample throttle_ios; + struct blkg_rwstat_sample bfq_bytes; + struct blkg_rwstat_sample bfq_ios; + struct blkg_rwstat_sample bfq_service_time; + struct blkg_rwstat_sample bfq_wait_time; + struct blkg_rwstat_sample bfq_merged; + struct blkg_iostat v2_iostat; +}; + +/* + * Getting: + * + * - "throttle.io_{service_bytes,serviced}_recursive" + * via offsetof(struct throtl_grp, stat_{bytes,ios}) + * - "bfq.io_{merged,{wait,service}_time,service_bytes,serviced}_recursive" + * via offsetof(struct bfq_group, stats.{merged,{wait,service}_time,bytes,ios}) + */ +static void blkcg_get_one_stat_v1(struct blkcg_gq *blkg, struct blkg_rw_iostat *iostat) +{ + struct blkcg_policy *pol; + +#ifdef CONFIG_BLK_DEV_THROTTLING // what blkcg_policy_throtl depends on + pol = &blkcg_policy_throtl; + if (blkcg_policy_enabled(blkg->q, pol)) { + // throttle.io_service_bytes_recursive + blkg_rwstat_recursive_sum(blkg, pol, + offsetof(struct throtl_grp, stat_bytes), + &iostat->throttle_bytes); + // throttle.io_serviced_recursive + blkg_rwstat_recursive_sum(blkg, pol, + offsetof(struct throtl_grp, stat_ios), + &iostat->throttle_ios); + } +#endif + + /* + * CONFIG_BPF_RVI_BLK_BFQ: blkcg_policy_bfq is in block/bfq-cgroup.c, which could be + * built as a module if CONFIG_IOSCHED_BFQ=m + * CONFIG_BFQ_GROUP_IOSCHED: what struct bfq_group.stats depends on + */ +#if defined(CONFIG_BPF_RVI_BLK_BFQ) && defined(CONFIG_BFQ_GROUP_IOSCHED) + pol = &blkcg_policy_bfq; + if (blkcg_policy_enabled(blkg->q, pol)) { + // bfq.io_service_bytes_recursive + blkg_rwstat_recursive_sum(blkg, pol, + offsetof(struct bfq_group, stats.bytes), + &iostat->bfq_bytes); + // bfq.io_serviced_recursive + blkg_rwstat_recursive_sum(blkg, pol, + offsetof(struct bfq_group, stats.ios), + &iostat->bfq_ios); +#ifdef CONFIG_BFQ_CGROUP_DEBUG + // bfq.io_service_time_recursive + blkg_rwstat_recursive_sum(blkg, pol, + offsetof(struct bfq_group, stats.service_time), + &iostat->bfq_service_time); + // bfq.io_wait_time_recursive + blkg_rwstat_recursive_sum(blkg, pol, + offsetof(struct bfq_group, stats.wait_time), + &iostat->bfq_wait_time); + // bfq.io_merged_recursive + blkg_rwstat_recursive_sum(blkg, pol, + offsetof(struct bfq_group, stats.merged), + &iostat->bfq_merged); +#endif + } +#endif +} + +/* Reference: blkcg_print_one_stat() */ +static void blkcg_get_one_stat_v2(struct blkcg_gq *blkg, struct blkg_rw_iostat *iostat) +{ + struct blkg_iostat_set *bis = &blkg->iostat; + unsigned int seq; + + if (!blkg->online) + return; + + do { + seq = u64_stats_fetch_begin(&bis->sync); + iostat->v2_iostat = bis->cur; + } while (u64_stats_fetch_retry(&bis->sync, seq)); +} + +/* + * Basically inmitating: + * + * - v1: + * - tg_print_rwstat_recursive() in block/blk-throttle.c + * - bfqg_print_rwstat_recursive() in block/bfq-cgroup.c + * - v2: + * - blkcg_print_stat() + * + * without the final printing (e.g. the __blkg_prfill_rwstat() part). + * + * Note that a subsystem can only exist in either cgroup v1 or v2 at the same time. + */ +__bpf_kfunc void bpf_blkcg_get_dev_iostat(struct blkcg *blkcg, int major, int minor, + struct blkg_rw_iostat *iostat, bool is_v2) +{ + struct blkcg_gq *blkg; + char dev_name[64]; + + if (!blkcg || !iostat) + return; + + if (is_v2) { + if (blkcg == &blkcg_root) + blkcg_fill_root_iostats(); + else + cgroup_rstat_flush_atomic(blkcg->css.cgroup); + } + + // memset(iostat, 0, sizeof(*iostat)); + snprintf(dev_name, sizeof(dev_name), "%d:%d", major, minor); + rcu_read_lock(); + hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { + if (strcmp(dev_name, blkg_dev_name(blkg))) + continue; + spin_lock_irq(&blkg->q->queue_lock); + if (is_v2) + blkcg_get_one_stat_v2(blkg, iostat); + else + blkcg_get_one_stat_v1(blkg, iostat); + spin_unlock_irq(&blkg->q->queue_lock); + break; + } + rcu_read_unlock(); +} + +BTF_SET8_START(bpf_blkcg_kfunc_ids) +BTF_ID_FLAGS(func, bpf_blkcg_get_dev_iostat) +BTF_SET8_END(bpf_blkcg_kfunc_ids) + +static const struct btf_kfunc_id_set bpf_blkcg_kfunc_set = { + .owner = THIS_MODULE, + .set = &bpf_blkcg_kfunc_ids, +}; + +static int __init bpf_blkcg_kfunc_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &bpf_blkcg_kfunc_set); +} +late_initcall(bpf_blkcg_kfunc_init); +#endif /* CONFIG_BPF_RVI */ + module_param(blkcg_debug_stats, bool, 0644); MODULE_PARM_DESC(blkcg_debug_stats, "True if you want debug stats, false if not"); diff --git a/kernel/bpf-rvi/Kconfig b/kernel/bpf-rvi/Kconfig index 0e356ae6fc85..8a9cbac36a0c 100644 --- a/kernel/bpf-rvi/Kconfig +++ b/kernel/bpf-rvi/Kconfig @@ -7,6 +7,7 @@ config BPF_RVI depends on BPF_SYSCALL depends on BPF_JIT depends on CPUSETS + select BPF_RVI_BLK_BFQ if IOSCHED_BFQ = y # built-in required default n help A resource view is a bundle of interfaces under /proc and /sys -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Implement the bpf prog for the 'diskstats' interface. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- samples/bpf/Makefile | 1 + samples/bpf/bpf_rvi_diskstats.bpf.c | 295 ++++++++++++++++++ .../selftests/bpf/progs/proc_iter_common.h | 0 3 files changed, 296 insertions(+) create mode 100644 samples/bpf/bpf_rvi_diskstats.bpf.c create mode 100644 tools/testing/selftests/bpf/progs/proc_iter_common.h diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 7a8559c534c8..91d73753c7fb 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -156,6 +156,7 @@ ifeq ($(ARCH), arm64) always-$(CONFIG_BPF_RVI) += bpf_rvi_cpuinfo_arm64.bpf.o endif always-$(CONFIG_BPF_RVI) += bpf_rvi_cpu_online.bpf.o +always-$(CONFIG_BPF_RVI) += bpf_rvi_diskstats.bpf.o ifeq ($(ARCH), arm) # Strip all except -D__LINUX_ARM_ARCH__ option needed to handle linux diff --git a/samples/bpf/bpf_rvi_diskstats.bpf.c b/samples/bpf/bpf_rvi_diskstats.bpf.c new file mode 100644 index 000000000000..2e08a989c34b --- /dev/null +++ b/samples/bpf/bpf_rvi_diskstats.bpf.c @@ -0,0 +1,295 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Huawei Technologies Co., Ltd */ +#include <vmlinux.h> +#include <bpf/bpf_helpers.h> + +void bpf_blkcg_get_dev_iostat(struct blkcg *blkcg, int major, int minor, + struct blkg_rw_iostat *iostat, bool is_v2) __ksym; + +char _license[] SEC("license") = "GPL"; + +#define MINORBITS 20 +#define MINORMASK ((1U << MINORBITS) - 1) +#define MAJOR(dev) ((unsigned int) ((dev) >> MINORBITS)) +#define MINOR(dev) ((unsigned int) ((dev) & MINORMASK)) +#define MKDEV(ma, mi) (((ma) << MINORBITS) | (mi)) + +#define USE_CGROUP_V1 + +#define anyof(seg) (r##seg || w##seg || d##seg) + +#ifdef USE_CGROUP_V1 +static bool throttle_stat_available(struct blkg_rw_iostat *iostat) +{ + u64 rbytes, wbytes, dbytes; + u64 rios, wios, dios; + + rbytes = iostat->throttle_bytes.cnt[BLKG_RWSTAT_READ]; + wbytes = iostat->throttle_bytes.cnt[BLKG_RWSTAT_WRITE]; + dbytes = iostat->throttle_bytes.cnt[BLKG_RWSTAT_DISCARD]; + rios = iostat->throttle_ios.cnt[BLKG_RWSTAT_READ]; + wios = iostat->throttle_ios.cnt[BLKG_RWSTAT_WRITE]; + dios = iostat->throttle_ios.cnt[BLKG_RWSTAT_DISCARD]; + + if (anyof(bytes) || anyof(ios)) + return true; + return false; +} + +static bool bfq_stat_available(struct blkg_rw_iostat *iostat) +{ + u64 rbytes, wbytes, dbytes; + u64 rios, wios, dios; + u64 rserv, wserv, dserv; + u64 rwait, wwait, dwait; + u64 rmerge, wmerge, dmerge; + + rbytes = iostat->bfq_bytes.cnt[BLKG_RWSTAT_READ]; + wbytes = iostat->bfq_bytes.cnt[BLKG_RWSTAT_WRITE]; + dbytes = iostat->bfq_bytes.cnt[BLKG_RWSTAT_DISCARD]; + rios = iostat->bfq_ios.cnt[BLKG_RWSTAT_READ]; + wios = iostat->bfq_ios.cnt[BLKG_RWSTAT_WRITE]; + dios = iostat->bfq_ios.cnt[BLKG_RWSTAT_DISCARD]; + rserv = iostat->bfq_service_time.cnt[BLKG_RWSTAT_READ]; + wserv = iostat->bfq_service_time.cnt[BLKG_RWSTAT_WRITE]; + dserv = iostat->bfq_service_time.cnt[BLKG_RWSTAT_DISCARD]; + rwait = iostat->bfq_wait_time.cnt[BLKG_RWSTAT_READ]; + wwait = iostat->bfq_wait_time.cnt[BLKG_RWSTAT_WRITE]; + dwait = iostat->bfq_wait_time.cnt[BLKG_RWSTAT_DISCARD]; + rmerge = iostat->bfq_merged.cnt[BLKG_RWSTAT_READ]; + wmerge = iostat->bfq_merged.cnt[BLKG_RWSTAT_WRITE]; + dmerge = iostat->bfq_merged.cnt[BLKG_RWSTAT_DISCARD]; + + if (anyof(bytes) || anyof(ios) || anyof(serv) || anyof(wait) || anyof(merge)) + return true; + return false; +} +#else +static bool v2_stat_available(struct blkg_rw_iostat *iostat) +{ + u64 rbytes, wbytes, dbytes; + u64 rios, wios, dios; + + rbytes = iostat->v2_iostat.bytes[BLKG_IOSTAT_READ]; + wbytes = iostat->v2_iostat.bytes[BLKG_IOSTAT_WRITE]; + dbytes = iostat->v2_iostat.bytes[BLKG_IOSTAT_DISCARD]; + rios = iostat->v2_iostat.ios[BLKG_IOSTAT_READ]; + wios = iostat->v2_iostat.ios[BLKG_IOSTAT_WRITE]; + dios = iostat->v2_iostat.ios[BLKG_IOSTAT_DISCARD]; + + if (anyof(bytes) || anyof(ios)) + return true; + return false; +} +#endif + +#define MSEC_PER_SEC 1000L +#define NSEC_PER_MSEC 1000000L +#define HZ 1000 +static inline u32 jiffies_to_msecs(const unsigned long j) +{ + return (MSEC_PER_SEC / HZ) * j; +} + +static inline u64 div_u64(u64 dividend, u32 divisor) +{ + return dividend / divisor; +} + +static void native_diskstats_show(struct seq_file *m, struct block_device *hd, + struct disk_stats *stat, unsigned int inflight) +{ + BPF_SEQ_PRINTF(m, "%4d %7d ", MAJOR(hd->bd_dev), MINOR(hd->bd_dev)); + /* Reference: bdev_name() in lib/vsprintf.c */ + if (hd->bd_partno) + BPF_SEQ_PRINTF(m, "%sp%d ", hd->bd_disk->disk_name, hd->bd_partno); + else + BPF_SEQ_PRINTF(m, "%s ", hd->bd_disk->disk_name); + + BPF_SEQ_PRINTF(m, "%lu %lu %lu %u %lu %lu %lu %u ", + stat->ios[STAT_READ], + stat->merges[STAT_READ], + stat->sectors[STAT_READ], + (unsigned int)div_u64(stat->nsecs[STAT_READ], + NSEC_PER_MSEC), + stat->ios[STAT_WRITE], + stat->merges[STAT_WRITE], + stat->sectors[STAT_WRITE], + (unsigned int)div_u64(stat->nsecs[STAT_WRITE], + NSEC_PER_MSEC) + ); + BPF_SEQ_PRINTF(m, "%u %u %u ", + inflight, + jiffies_to_msecs(stat->io_ticks), + (unsigned int)div_u64(stat->nsecs[STAT_READ] + + stat->nsecs[STAT_WRITE] + + stat->nsecs[STAT_DISCARD] + + stat->nsecs[STAT_FLUSH], + NSEC_PER_MSEC) + ); + BPF_SEQ_PRINTF(m, "%lu %lu %lu %u %lu %u\n", + stat->ios[STAT_DISCARD], + stat->merges[STAT_DISCARD], + stat->sectors[STAT_DISCARD], + (unsigned int)div_u64(stat->nsecs[STAT_DISCARD], + NSEC_PER_MSEC), + stat->ios[STAT_FLUSH], + (unsigned int)div_u64(stat->nsecs[STAT_FLUSH], + NSEC_PER_MSEC) + ); +} + +enum iostat_choice { + IOSTAT_CHOICE_NONE, +#ifdef USE_CGROUP_V1 + IOSTAT_CHOICE_BFQ, + IOSTAT_CHOICE_THROTTLE, +#else + IOSTAT_CHOICE_V2, +#endif +}; + +/* Reference: diskstats_show() in block/genhd.c */ +SEC("iter/diskstats") +s64 dump_diskstats(struct bpf_iter__diskstats *ctx) +{ + struct seq_file *m = ctx->meta->seq; + struct block_device *bd = ctx->bd; + // struct disk_stats *native_stat = ctx->native_stat; + struct blkcg *blkcg = ctx->task_blkcg; + struct blkg_rw_iostat iostat = {}; + int major, minor; + enum iostat_choice choice = IOSTAT_CHOICE_NONE; +#ifdef USE_CGROUP_V1 + bool use_v2 = false; +#else + bool use_v2 = true; +#endif + + major = MAJOR(bd->bd_dev); + minor = MINOR(bd->bd_dev); + + bpf_blkcg_get_dev_iostat(blkcg, major, minor, &iostat, use_v2); +#ifdef USE_CGROUP_V1 + if (bfq_stat_available(&iostat)) + choice = IOSTAT_CHOICE_BFQ; + else if (throttle_stat_available(&iostat)) + choice = IOSTAT_CHOICE_THROTTLE; +#else + if (v2_stat_available(&iostat)) + choice = IOSTAT_CHOICE_V2; +#endif + + if (choice == IOSTAT_CHOICE_NONE) { + native_diskstats_show(m, bd, ctx->native_stat, ctx->inflight); + return 0; + } + + BPF_SEQ_PRINTF(m, "%4d %7d ", major, minor); + /* Reference: bdev_name() in lib/vsprintf.c */ + if (bd->bd_partno) + BPF_SEQ_PRINTF(m, "%sp%d ", bd->bd_disk->disk_name, bd->bd_partno); + else + BPF_SEQ_PRINTF(m, "%s ", bd->bd_disk->disk_name); + + /* + * Long fmt needs to be split, as BPF_SEQ_PRINTF accepts limited + * number of arguments via macro expansion. + */ +#ifdef USE_CGROUP_V1 + if (choice == IOSTAT_CHOICE_BFQ) { + BPF_SEQ_PRINTF(m, "%lu %lu %lu %lu ", + /* 4-7: read {ios,*merges*,sectors,*nsecs*} */ + iostat.bfq_ios.cnt[BLKG_RWSTAT_READ], + iostat.bfq_merged.cnt[BLKG_RWSTAT_READ], + iostat.bfq_bytes.cnt[BLKG_RWSTAT_READ] >> 9, + iostat.bfq_service_time.cnt[BLKG_RWSTAT_READ] / 1000000 + + iostat.bfq_wait_time.cnt[BLKG_RWSTAT_READ] / 1000000); + BPF_SEQ_PRINTF(m, "%lu %lu %lu %lu ", + /* 8-11: write {ios,*merges*,sectors,*nsecs*} */ + iostat.bfq_ios.cnt[BLKG_RWSTAT_WRITE], + iostat.bfq_merged.cnt[BLKG_RWSTAT_WRITE], + iostat.bfq_bytes.cnt[BLKG_RWSTAT_WRITE] >> 9, + iostat.bfq_service_time.cnt[BLKG_RWSTAT_WRITE] / 1000000 + + iostat.bfq_wait_time.cnt[BLKG_RWSTAT_WRITE] / 1000000); + BPF_SEQ_PRINTF(m, "%u %u %u ", + // 12: I/Os currently in progress (inflight) + ctx->inflight, + // 13: time spent doing I/Os (ms) (io_ticks) TODO + 0, + // 14: weighted time doing I/Os (ns) (rd + wr + discard + flush) + 0); + BPF_SEQ_PRINTF(m, "%lu %lu %lu %lu ", + /* 15-18: discard {ios,*merges*,sectors,*nsecs*} */ + iostat.bfq_ios.cnt[BLKG_RWSTAT_DISCARD], + iostat.bfq_merged.cnt[BLKG_RWSTAT_DISCARD], + iostat.bfq_bytes.cnt[BLKG_RWSTAT_DISCARD] >> 9, + iostat.bfq_service_time.cnt[BLKG_RWSTAT_DISCARD] / 1000000 + + iostat.bfq_wait_time.cnt[BLKG_RWSTAT_DISCARD] / 1000000); + BPF_SEQ_PRINTF(m, "%lu %lu\n", + /* 19-20: flush {ios,nsec} */ + 0, 0); + } else if (choice == IOSTAT_CHOICE_THROTTLE) { + BPF_SEQ_PRINTF(m, "%lu %lu %lu %lu ", + /* 4-7: read {ios,*merges*,sectors,*nsecs*} */ + iostat.throttle_ios.cnt[BLKG_RWSTAT_READ], + 0, + iostat.throttle_bytes.cnt[BLKG_RWSTAT_READ] >> 9, + 0); + BPF_SEQ_PRINTF(m, "%lu %lu %lu %lu ", + /* 8-11: write {ios,*merges*,sectors,*nsecs*} */ + iostat.throttle_ios.cnt[BLKG_RWSTAT_WRITE], + 0, + iostat.throttle_bytes.cnt[BLKG_RWSTAT_WRITE] >> 9, + 0); + BPF_SEQ_PRINTF(m, "%u %u %u ", + // 12: I/Os currently in progress (inflight) + ctx->inflight, + // 13: time spent doing I/Os (ms) (io_ticks) TODO + 0, + // 14: weighted time doing I/Os (ns) (rd + wr + discard + flush) + 0); + BPF_SEQ_PRINTF(m, "%lu %lu %lu %lu ", + /* 15-18: discard {ios,*merges*,sectors,*nsecs*} */ + iostat.throttle_ios.cnt[BLKG_RWSTAT_DISCARD], + 0, + iostat.throttle_bytes.cnt[BLKG_RWSTAT_DISCARD] >> 9, + 0); + BPF_SEQ_PRINTF(m, "%lu %lu\n", + /* 19-20: flush {ios,nsec} */ + 0, 0); + } +#else + if (choice == IOSTAT_CHOICE_V2) { + BPF_SEQ_PRINTF(m, "%lu %lu %lu %lu ", + /* 4-7: read {ios,*merges*,sectors,*nsecs*} */ + iostat.v2_iostat.ios[BLKG_IOSTAT_READ], + 0, + iostat.v2_iostat.bytes[BLKG_IOSTAT_READ] >> 9, + 0); + BPF_SEQ_PRINTF(m, "%lu %lu %lu %lu ", + /* 8-11: write {ios,*merges*,sectors,*nsecs*} */ + iostat.v2_iostat.ios[BLKG_IOSTAT_WRITE], + 0, + iostat.v2_iostat.bytes[BLKG_IOSTAT_WRITE] >> 9, + 0); + BPF_SEQ_PRINTF(m, "%u %u %u ", + // 12: I/Os currently in progress (inflight) + ctx->inflight, + // 13: time spent doing I/Os (ms) (io_ticks) TODO + 0, + // 14: weighted time doing I/Os (ns) (rd + wr + discard + flush) + 0); + BPF_SEQ_PRINTF(m, "%lu %lu %lu %lu ", + /* 15-18: discard {ios,*merges*,sectors,*nsecs*} */ + iostat.v2_iostat.ios[BLKG_IOSTAT_DISCARD], + 0, + iostat.v2_iostat.bytes[BLKG_IOSTAT_DISCARD] >> 9, + 0); + BPF_SEQ_PRINTF(m, "%lu %lu\n", + /* 19-20: flush {ios,nsec} */ + 0, 0); + } +#endif + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/proc_iter_common.h b/tools/testing/selftests/bpf/progs/proc_iter_common.h new file mode 100644 index 000000000000..e69de29bb2d1 -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Create a bpf iter target for the 'partitions' interface, to which the bpf prog can attach. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- block/genhd.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) diff --git a/block/genhd.c b/block/genhd.c index 9d9b60501bcb..f19a86b11dea 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -1456,6 +1456,93 @@ static struct bpf_iter_reg diskstats_reg_info = { BTF_ID_LIST(btf_diststats_ids) BTF_ID(struct, block_device) BTF_ID(struct, disk_stats) + +static void *bpf_show_partition_start(struct seq_file *seqf, loff_t *pos) +{ + void *p; + + p = bpf_disk_seqf_start(seqf, pos); + if (!IS_ERR_OR_NULL(p) && !*pos) + seq_puts(seqf, "major minor #blocks name\n\n"); + return p; +} + +struct bpf_iter__partitions { + __bpf_md_ptr(struct bpf_iter_meta *, meta); + __bpf_md_ptr(struct block_device *, part); +}; + +DEFINE_BPF_ITER_FUNC(partitions, struct bpf_iter_meta *meta, + struct block_device *part) + +static void native_show_partition(struct seq_file *seqf, struct block_device *part) +{ + if (!bdev_nr_sectors(part)) + return; + seq_printf(seqf, "%4d %7d %10llu %pg\n", + MAJOR(part->bd_dev), MINOR(part->bd_dev), + bdev_nr_sectors(part) >> 1, part); +} + +static void __show_partition(struct seq_file *seqf, struct block_device *part) +{ + struct bpf_iter__partitions ctx; + struct bpf_iter_meta meta; + struct bpf_prog *prog; + + meta.seq = seqf; + prog = bpf_iter_get_info(&meta, false); + if (!prog) + return native_show_partition(seqf, part); + + ctx.meta = &meta; + ctx.part = part; + bpf_iter_run_prog(prog, &ctx); +} + +/* Inconvenient to operate Xarray in bpf progs. */ +static int bpf_show_partition(struct seq_file *seqf, void *v) +{ + struct gendisk *sgp = v; + struct block_device *part; + unsigned long idx; + + if (!get_capacity(sgp) || (sgp->flags & GENHD_FL_HIDDEN)) + return 0; + + rcu_read_lock(); + xa_for_each(&sgp->part_tbl, idx, part) + __show_partition(seqf, part); + rcu_read_unlock(); + return 0; +} + +static const struct seq_operations bpf_partitions_op = { + .start = bpf_show_partition_start, + .next = disk_seqf_next, + .stop = bpf_disk_seqf_stop, + .show = bpf_show_partition +}; + +static const struct bpf_iter_seq_info partitions_seq_info = { + .seq_ops = &bpf_partitions_op, + .init_seq_private = NULL, + .fini_seq_private = NULL, + .seq_priv_size = sizeof(struct class_dev_iter), +}; + +static struct bpf_iter_reg partitions_reg_info = { + .target = "partitions", + .ctx_arg_info_size = 1, + .ctx_arg_info = { + { offsetof(struct bpf_iter__partitions, part), + PTR_TO_BTF_ID }, // part won't be NULL + }, + .seq_info = &partitions_seq_info, +}; + +BTF_ID_LIST(btf_partitions_ids) +BTF_ID(struct, block_device) #endif /* CONFIG_BPF_RVI */ static int __init proc_genhd_init(void) @@ -1469,6 +1556,11 @@ static int __init proc_genhd_init(void) diskstats_reg_info.ctx_arg_info[0].btf_id = btf_diststats_ids[0]; diskstats_reg_info.ctx_arg_info[1].btf_id = btf_diststats_ids[1]; err = bpf_iter_reg_target(&diskstats_reg_info); + if (err) + return err; + + partitions_reg_info.ctx_arg_info[0].btf_id = btf_partitions_ids[0]; + err = bpf_iter_reg_target(&partitions_reg_info); #endif return err; } -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- When a container starts, it makes use of both mount namespace and chroot to achieve path isolation. So to get the /dev in the container, what we actually need is fs->root of container's init task (i.e. the reaper), not its mount namespace. Acquire reaper's fs->root, look up the /dev under it, and filter the partitions output based on the /dev content. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- block/genhd.c | 124 ++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 121 insertions(+), 3 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index f19a86b11dea..b186530af3bd 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -29,6 +29,8 @@ #ifdef CONFIG_BPF_RVI #include <linux/bpf.h> #include <linux/pid_namespace.h> +#include <linux/namei.h> +#include <linux/fs_struct.h> #endif #include "blk-throttle.h" @@ -1457,13 +1459,125 @@ BTF_ID_LIST(btf_diststats_ids) BTF_ID(struct, block_device) BTF_ID(struct, disk_stats) +struct traverse_ctx { + struct dir_context ctx; + struct dentry *parent_dentry; + struct xarray *dev_list; + unsigned int index; +}; + +static bool filldir_callback(struct dir_context *ctx, const char *name, + int namelen, loff_t offset, u64 ino, + unsigned int d_type) +{ + struct traverse_ctx *tctx = container_of(ctx, struct traverse_ctx, ctx); + struct dentry *child_dentry; + struct inode *inode; + struct bdev_handle *handle; + void *rc; + + if (d_type != DT_BLK) + return true; + + child_dentry = lookup_one_len(name, tctx->parent_dentry, namelen); + if (IS_ERR(child_dentry)) { + pr_warn("Lookup failed for %s: %ld\n", name, PTR_ERR(child_dentry)); + return true; + } + + // double check if it's block dev + inode = d_inode(child_dentry); + if (!S_ISBLK(inode->i_mode)) + goto err_put; + + handle = bdev_open_by_dev(inode->i_rdev, BLK_OPEN_READ, NULL, NULL); + if (IS_ERR(handle)) { + pr_err("Failed to open block device %s (err=%ld)\n", + name, PTR_ERR(handle)); + goto err_put; + } + + rc = xa_store(tctx->dev_list, tctx->index++, handle->bdev, GFP_KERNEL); + if (xa_is_err(rc)) + pr_warn("xa_store() on %d failed\n", tctx->index - 1); + + bdev_release(handle); +err_put: + dput(child_dentry); + return true; +} + +static unsigned int get_targeted_dev(struct xarray *dev_list) +{ + struct task_struct *reaper; + struct path root_path, dev_path; + struct file *dir; + int ret; + struct traverse_ctx buf = { + .ctx.actor = filldir_callback, + .dev_list = dev_list, + .index = 0, + }; + + xa_init(dev_list); + reaper = get_current_level1_reaper(); + if (!reaper) + return 0; + + /* Reference: get_task_root() */ + task_lock(reaper); + if (!reaper->fs) { + task_unlock(reaper); + goto out_put_reaper; + } + get_fs_root(reaper->fs, &root_path); + task_unlock(reaper); + + /* + * For vfs_path_lookup(), @name being "dev" or "/dev" makes no + * difference, since struct nameidata.root is preset. + */ + ret = vfs_path_lookup(root_path.dentry, root_path.mnt, "dev", + LOOKUP_FOLLOW|LOOKUP_DIRECTORY, &dev_path); + if (ret) + goto out_put_root; + + dir = dentry_open(&dev_path, O_RDONLY, current_cred()); + if (IS_ERR(dir)) { + ret = PTR_ERR(dir); + goto out_put_devpath; + } + buf.parent_dentry = dev_path.dentry; + + iterate_dir(dir, &buf.ctx); + + filp_close(dir, NULL); +out_put_devpath: + path_put(&dev_path); +out_put_root: + path_put(&root_path); +out_put_reaper: + put_task_struct(reaper); + return buf.index; +} + +struct partitions_seq_priv { + struct class_dev_iter iter; // must be the first + struct xarray dev_list; + unsigned int dev_list_size; +}; + static void *bpf_show_partition_start(struct seq_file *seqf, loff_t *pos) { + struct partitions_seq_priv *priv = seqf->private; void *p; p = bpf_disk_seqf_start(seqf, pos); if (!IS_ERR_OR_NULL(p) && !*pos) seq_puts(seqf, "major minor #blocks name\n\n"); + + priv->dev_list_size = get_targeted_dev(&priv->dev_list); + return p; } @@ -1503,6 +1617,7 @@ static void __show_partition(struct seq_file *seqf, struct block_device *part) /* Inconvenient to operate Xarray in bpf progs. */ static int bpf_show_partition(struct seq_file *seqf, void *v) { + struct partitions_seq_priv *priv = seqf->private; struct gendisk *sgp = v; struct block_device *part; unsigned long idx; @@ -1511,8 +1626,11 @@ static int bpf_show_partition(struct seq_file *seqf, void *v) return 0; rcu_read_lock(); - xa_for_each(&sgp->part_tbl, idx, part) - __show_partition(seqf, part); + xa_for_each(&sgp->part_tbl, idx, part) { + for (int i = 0; i < priv->dev_list_size; ++i) + if (part == xa_load(&priv->dev_list, i)) + __show_partition(seqf, part); + } rcu_read_unlock(); return 0; } @@ -1528,7 +1646,7 @@ static const struct bpf_iter_seq_info partitions_seq_info = { .seq_ops = &bpf_partitions_op, .init_seq_private = NULL, .fini_seq_private = NULL, - .seq_priv_size = sizeof(struct class_dev_iter), + .seq_priv_size = sizeof(struct partitions_seq_priv), }; static struct bpf_iter_reg partitions_reg_info = { -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Implement the bpf prog for the 'partitions' interface. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- samples/bpf/Makefile | 1 + samples/bpf/bpf_rvi_partitions.bpf.c | 42 ++++++++++++++++++++++++++++ 2 files changed, 43 insertions(+) create mode 100644 samples/bpf/bpf_rvi_partitions.bpf.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 91d73753c7fb..bc85adcb714f 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -157,6 +157,7 @@ always-$(CONFIG_BPF_RVI) += bpf_rvi_cpuinfo_arm64.bpf.o endif always-$(CONFIG_BPF_RVI) += bpf_rvi_cpu_online.bpf.o always-$(CONFIG_BPF_RVI) += bpf_rvi_diskstats.bpf.o +always-$(CONFIG_BPF_RVI) += bpf_rvi_partitions.bpf.o ifeq ($(ARCH), arm) # Strip all except -D__LINUX_ARM_ARCH__ option needed to handle linux diff --git a/samples/bpf/bpf_rvi_partitions.bpf.c b/samples/bpf/bpf_rvi_partitions.bpf.c new file mode 100644 index 000000000000..56afe576a789 --- /dev/null +++ b/samples/bpf/bpf_rvi_partitions.bpf.c @@ -0,0 +1,42 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Huawei Technologies Co., Ltd */ +#include <vmlinux.h> +#include <bpf/bpf_helpers.h> + +char _license[] SEC("license") = "GPL"; + +/* Copied from kdev_t.h */ +#define MINORBITS 20 +#define MINORMASK ((1U << MINORBITS) - 1) +#define MAJOR(dev) ((unsigned int) ((dev) >> MINORBITS)) +#define MINOR(dev) ((unsigned int) ((dev) & MINORMASK)) + +static inline sector_t bdev_nr_sectors(struct block_device *bdev) +{ + return bdev->bd_nr_sectors; +} + +/* Reference: show_partition() in block/genh.c */ +SEC("iter/partitions") +s64 dump_partitions(struct bpf_iter__partitions *ctx) +{ + struct seq_file *m = ctx->meta->seq; + struct block_device *part = ctx->part; + + if (!bdev_nr_sectors(part)) + return 0; + BPF_SEQ_PRINTF(m, "%4d %7d %10llu ", + MAJOR(part->bd_dev), MINOR(part->bd_dev), + bdev_nr_sectors(part) >> 1); + + /* + * Mimic %pg format of printk. + * Reference: bdev_name() in lib/vsprintf.c + */ + if (part->bd_partno) + BPF_SEQ_PRINTF(m, "%sp%d\n", part->bd_disk->disk_name, part->bd_partno); + else + BPF_SEQ_PRINTF(m, "%s\n", part->bd_disk->disk_name); + + return 0; +} -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Add struct pidns_loadavg to record and track the average load with respect to the tasks within a pid namespace. Use a delayed_work to update this data structure for all pid namespaces with LOAD_FREQ interval. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- include/linux/pid.h | 5 ++ include/linux/pid_namespace.h | 18 ++++++ kernel/bpf-rvi/Kconfig | 1 + kernel/pid.c | 10 +++ kernel/pid_namespace.c | 117 ++++++++++++++++++++++++++++++++++ 5 files changed, 151 insertions(+) diff --git a/include/linux/pid.h b/include/linux/pid.h index b90bc447d2a2..9ddee8589956 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -217,4 +217,9 @@ pid_t pid_vnr(struct pid *pid); } \ task = tg___; \ } while_each_pid_task(pid, type, task) + +#ifdef CONFIG_BPF_RVI +struct pidns_loadavg; +extern struct pidns_loadavg init_pidns_loadavg; +#endif #endif /* _LINUX_PID_H */ diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h index 28161eefca5d..062d6690b69a 100644 --- a/include/linux/pid_namespace.h +++ b/include/linux/pid_namespace.h @@ -45,7 +45,11 @@ struct pid_namespace { #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) int memfd_noexec_scope; #endif +#ifdef CONFIG_BPF_RVI + KABI_USE(1, struct pidns_loadavg *loadavg) +#else KABI_RESERVE(1) +#endif KABI_RESERVE(2) KABI_RESERVE(3) } __randomize_layout; @@ -132,6 +136,20 @@ static inline bool task_is_in_init_pid_ns(struct task_struct *tsk) #ifdef CONFIG_BPF_RVI extern struct task_struct *get_current_level1_reaper(void); + +/* + * This struct should be viewed as an extension but not an entity. + * IOW it doesn't hold refcount to struct pid_namespace (but the list does), and + * all its members are semantically embedded in struct pid_namespace. + */ +struct pidns_loadavg { + struct pid_namespace *pidns; + struct list_head list; + unsigned long load_tasks; + unsigned long avenrun[3]; +}; + +extern struct pidns_loadavg init_pidns_loadavg; #endif #endif /* _LINUX_PID_NS_H */ diff --git a/kernel/bpf-rvi/Kconfig b/kernel/bpf-rvi/Kconfig index 8a9cbac36a0c..c1a76498eeee 100644 --- a/kernel/bpf-rvi/Kconfig +++ b/kernel/bpf-rvi/Kconfig @@ -7,6 +7,7 @@ config BPF_RVI depends on BPF_SYSCALL depends on BPF_JIT depends on CPUSETS + depends on PID_NS select BPF_RVI_BLK_BFQ if IOSCHED_BFQ = y # built-in required default n help diff --git a/kernel/pid.c b/kernel/pid.c index 8000cf327985..dd14df48b118 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -93,9 +93,19 @@ struct pid_namespace init_pid_ns = { #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) .memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC, #endif +#ifdef CONFIG_BPF_RVI + .loadavg = &init_pidns_loadavg, +#endif }; EXPORT_SYMBOL_GPL(init_pid_ns); +#ifdef CONFIG_BPF_RVI +struct pidns_loadavg init_pidns_loadavg = { + .pidns = &init_pid_ns, + .list = LIST_HEAD_INIT(init_pidns_loadavg.list), +}; +#endif + /* * Note: disable interrupts while the pidmap_lock is held as an * interrupt might come in and do read_lock(&tasklist_lock). diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index 1180070fc2a0..2e1afc01240a 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -22,11 +22,18 @@ #include <linux/export.h> #include <linux/sched/task.h> #include <linux/sched/signal.h> +#ifdef CONFIG_BPF_RVI +#include <linux/sched/loadavg.h> +#endif #include <linux/idr.h> #include "pid_sysctl.h" static DEFINE_MUTEX(pid_caches_mutex); static struct kmem_cache *pid_ns_cachep; +#ifdef CONFIG_BPF_RVI +static struct kmem_cache *pidns_loadavg_cachep; +static DEFINE_SPINLOCK(pidns_list_lock); +#endif /* Write once array, filled from the beginning. */ static struct kmem_cache *pid_cache[MAX_PID_NS_LEVEL]; @@ -116,6 +123,18 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns); #endif + +#ifdef CONFIG_BPF_RVI + ns->loadavg = kmem_cache_zalloc(pidns_loadavg_cachep, GFP_KERNEL); + if (ns->loadavg == NULL) + goto out_free_idr; + ns->loadavg->pidns = ns; + spin_lock(&pidns_list_lock); + // additional 1 refcount for the list + list_add_tail(&get_pid_ns(ns)->loadavg->list, &init_pidns_loadavg.list); + spin_unlock(&pidns_list_lock); +#endif + return ns; out_free_idr: @@ -142,6 +161,13 @@ static void destroy_pid_namespace(struct pid_namespace *ns) ns_free_inum(&ns->ns); idr_destroy(&ns->idr); +#ifdef CONFIG_BPF_RVI + /* + * ns->loadavg's lifecycle aligns precisely with ns, + * so don't need RCU delayed free. + */ + kmem_cache_free(pidns_loadavg_cachep, ns->loadavg); +#endif call_rcu(&ns->rcu, delayed_free_pidns); } @@ -481,6 +507,11 @@ const struct proc_ns_operations pidns_for_children_operations = { .get_parent = pidns_get_parent, }; +#ifdef CONFIG_BPF_RVI +static void pidns_calc_loadavg_workfn(struct work_struct *work); +static DECLARE_DELAYED_WORK(pidns_calc_loadavg_work, pidns_calc_loadavg_workfn); +#endif + static __init int pid_namespaces_init(void) { pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT); @@ -490,7 +521,93 @@ static __init int pid_namespaces_init(void) #endif register_pid_ns_sysctl_table_vm(); + +#ifdef CONFIG_BPF_RVI + pidns_loadavg_cachep = KMEM_CACHE(pidns_loadavg, SLAB_PANIC | SLAB_ACCOUNT); + schedule_delayed_work(&pidns_calc_loadavg_work, LOAD_FREQ); +#endif return 0; } __initcall(pid_namespaces_init); + +#ifdef CONFIG_BPF_RVI +static void pidns_list_reset(void) +{ + struct list_head *pos, *tmp; + + spin_lock(&pidns_list_lock); + list_for_each_safe(pos, tmp, &init_pidns_loadavg.list) { + struct pidns_loadavg *entry = list_entry(pos, struct pidns_loadavg, list); + struct pid_namespace *pidns = entry->pidns; + + /* + * Where the actual releasing of pidns is triggered: + * + * refcount == 1 means the pidns is only referred by this list, + * which should be released. + */ + if (refcount_read(&pidns->ns.count) == 1) { + list_del(pos); + put_pid_ns(pidns); + continue; + } + + pidns->loadavg->load_tasks = 0; // reset + } + spin_unlock(&pidns_list_lock); +} + +static void pidns_update_load_tasks(void) +{ + struct task_struct *p, *t; + + rcu_read_lock(); + for_each_process_thread(p, t) { + // exists for sure, don't need get_pid_ns() + struct pid_namespace *pidns = task_active_pid_ns(t); + unsigned int state = READ_ONCE(t->__state) & TASK_REPORT; + + if (state != TASK_UNINTERRUPTIBLE && state != TASK_RUNNING) + continue; + + // Skip calculating init_pid_ns's loadavg. Meaningless. + while (pidns != &init_pid_ns) { + pidns->loadavg->load_tasks += 1; + pidns = pidns->parent; + } + } + rcu_read_unlock(); +} + +static void pidns_calc_avenrun(void) +{ + struct list_head *pos; + + spin_lock(&pidns_list_lock); + /* + * As the loadavg of init_pid_ns is exactly /proc/loadavg, avoid redundant + * re-calculation for init_pid_ns, and reuse init_pidns_loadavg.list as the + * list head. + */ + list_for_each(pos, &init_pidns_loadavg.list) { + struct pidns_loadavg *entry = list_entry(pos, struct pidns_loadavg, list); + long active = entry->load_tasks; + + /* Reference: calc_global_load() */ + active = active > 0 ? active * FIXED_1 : 0; + entry->avenrun[0] = calc_load(entry->avenrun[0], EXP_1, active); + entry->avenrun[1] = calc_load(entry->avenrun[1], EXP_5, active); + entry->avenrun[2] = calc_load(entry->avenrun[2], EXP_15, active); + } + spin_unlock(&pidns_list_lock); +} + +static void pidns_calc_loadavg_workfn(struct work_struct *work) +{ + pidns_list_reset(); + pidns_update_load_tasks(); + pidns_calc_avenrun(); + schedule_delayed_work(&pidns_calc_loadavg_work, LOAD_FREQ); +} +#endif /* CONFIG_BPF_RVI */ -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Add kfuncs to get the active pid namespace of a task, and to get the number of tasks and the last pid of a given pid namespace. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- include/linux/pid_namespace.h | 11 +++++++++++ kernel/bpf/helpers.c | 27 +++++++++++++++++++++++++++ kernel/pid_namespace.c | 25 +++++++++++++++++++++++++ 3 files changed, 63 insertions(+) diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h index 062d6690b69a..43b18144f366 100644 --- a/include/linux/pid_namespace.h +++ b/include/linux/pid_namespace.h @@ -150,6 +150,17 @@ struct pidns_loadavg { }; extern struct pidns_loadavg init_pidns_loadavg; + +struct pid_iter { + unsigned int pid; + struct task_struct *task; +}; + +struct pid_iter next_pid(struct pid_namespace *ns, struct pid_iter iter); + +#define for_each_task_in_pidns(iter, ns) \ + for (iter = next_pid(ns, iter); iter.task; \ + iter.pid += 1, iter = next_pid(ns, iter)) #endif #endif /* _LINUX_PID_NS_H */ diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 140097c8198e..0e7c55a00124 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2386,6 +2386,30 @@ __bpf_kfunc struct task_struct *bpf_current_level1_reaper(void) return p; } + +__bpf_kfunc struct pid_namespace *bpf_task_active_pid_ns(struct task_struct *task) +{ + return task_active_pid_ns(task); +} + +__bpf_kfunc u64 bpf_pidns_nr_tasks(struct pid_namespace *ns) +{ + struct pid_iter iter; + u32 nr_running = 0, nr_threads = 0; + + for_each_task_in_pidns(iter, ns) { + nr_threads++; + if (task_is_running(iter.task)) + nr_running++; + } + + return (u64)nr_running << 32 | nr_threads; +} + +__bpf_kfunc u32 bpf_pidns_last_pid(struct pid_namespace *ns) +{ + return idr_get_cursor(&ns->idr) - 1; +} #endif /** @@ -2632,6 +2656,9 @@ BTF_ID_FLAGS(func, bpf_task_under_cgroup, KF_RCU) BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL) #ifdef CONFIG_BPF_RVI BTF_ID_FLAGS(func, bpf_current_level1_reaper, KF_ACQUIRE | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_task_active_pid_ns, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_pidns_nr_tasks) +BTF_ID_FLAGS(func, bpf_pidns_last_pid) #endif BTF_SET8_END(generic_btf_ids) diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index 2e1afc01240a..c9d70a840842 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -610,4 +610,29 @@ static void pidns_calc_loadavg_workfn(struct work_struct *work) pidns_calc_avenrun(); schedule_delayed_work(&pidns_calc_loadavg_work, LOAD_FREQ); } + +/* Reference: next_tgid() in fs/proc/base.c */ +struct pid_iter next_pid(struct pid_namespace *ns, struct pid_iter iter) +{ + struct pid *pid; + + if (iter.task) + put_task_struct(iter.task); + rcu_read_lock(); +retry: + iter.task = NULL; + pid = find_ge_pid(iter.pid, ns); + if (pid) { + iter.pid = pid_nr_ns(pid, ns); + iter.task = pid_task(pid, PIDTYPE_PID); + // maybe don't need this + if (!iter.task) { + iter.pid += 1; + goto retry; + } + get_task_struct(iter.task); + } + rcu_read_unlock(); + return iter; +} #endif /* CONFIG_BPF_RVI */ -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Implement the bpf prog for the 'loadavg' interface. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- kernel/bpf-rvi/generic_single_iter.c | 1 + samples/bpf/Makefile | 1 + samples/bpf/bpf_rvi_loadavg.bpf.c | 60 ++++++++++++++++++++++++++++ 3 files changed, 62 insertions(+) create mode 100644 samples/bpf/bpf_rvi_loadavg.bpf.c diff --git a/kernel/bpf-rvi/generic_single_iter.c b/kernel/bpf-rvi/generic_single_iter.c index c8b462427366..542241aed597 100644 --- a/kernel/bpf-rvi/generic_single_iter.c +++ b/kernel/bpf-rvi/generic_single_iter.c @@ -50,6 +50,7 @@ static const struct seq_operations generic_single_seq_ops = { /* * Users of "generic_single" iter type: * - cpu_online + * - loadavg */ DEFINE_BPF_ITER_FUNC(generic_single, struct bpf_iter_meta *meta) diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index bc85adcb714f..b7afc09514d9 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -158,6 +158,7 @@ endif always-$(CONFIG_BPF_RVI) += bpf_rvi_cpu_online.bpf.o always-$(CONFIG_BPF_RVI) += bpf_rvi_diskstats.bpf.o always-$(CONFIG_BPF_RVI) += bpf_rvi_partitions.bpf.o +always-$(CONFIG_BPF_RVI) += bpf_rvi_loadavg.bpf.o ifeq ($(ARCH), arm) # Strip all except -D__LINUX_ARM_ARCH__ option needed to handle linux diff --git a/samples/bpf/bpf_rvi_loadavg.bpf.c b/samples/bpf/bpf_rvi_loadavg.bpf.c new file mode 100644 index 000000000000..2841488b2bfc --- /dev/null +++ b/samples/bpf/bpf_rvi_loadavg.bpf.c @@ -0,0 +1,60 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Huawei Technologies Co., Ltd */ +#include <vmlinux.h> +#include <bpf/bpf_core_read.h> +#include <bpf/bpf_helpers.h> + +void bpf_rcu_read_lock(void) __ksym; +void bpf_rcu_read_unlock(void) __ksym; +void bpf_task_release(struct task_struct *p) __ksym; +struct task_struct *bpf_current_level1_reaper(void) __ksym; +struct pid_namespace *bpf_task_active_pid_ns(struct task_struct *task) __ksym; +u64 bpf_pidns_nr_tasks(struct pid_namespace *ns) __ksym; +u32 bpf_pidns_last_pid(struct pid_namespace *ns) __ksym; + +char _license[] SEC("license") = "GPL"; + +#define FSHIFT 11 /* nr of bits of precision */ +#define FIXED_1 (1<<FSHIFT) /* 1.0 as fixed-point */ +#define LOAD_INT(x) ((x) >> FSHIFT) +#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) + +#define RET_OK 0 +#define RET_FAIL 1 +#define RET_SKIP -1 + +SEC("iter/generic_single") +s64 dump_loadavg(struct bpf_iter__generic_single *ctx) +{ + struct seq_file *m = ctx->meta->seq; + struct task_struct *reaper; + struct pid_namespace *pidns; + u64 nr_mix; + u32 nr_running, nr_threads, last_pid; + unsigned long avenrun[3]; + int ret = RET_OK; + + reaper = bpf_current_level1_reaper(); + if (!reaper) + return RET_FAIL; + bpf_rcu_read_lock(); + + pidns = bpf_task_active_pid_ns(reaper); + // ~= memcpy(avenrun, pidns->loadavg->avenrun, sizeof(avenrun)) + BPF_CORE_READ_INTO(&avenrun, pidns, loadavg, avenrun); + + nr_mix = bpf_pidns_nr_tasks(pidns); + nr_running = nr_mix >> 32; + nr_threads = (u32)nr_mix; + last_pid = bpf_pidns_last_pid(pidns); + + BPF_SEQ_PRINTF(m, "%lu.%02lu %lu.%02lu %lu.%02lu %u/%d %d\n", + LOAD_INT(avenrun[0]), LOAD_FRAC(avenrun[0]), + LOAD_INT(avenrun[1]), LOAD_FRAC(avenrun[1]), + LOAD_INT(avenrun[2]), LOAD_FRAC(avenrun[2]), + nr_running, nr_threads, last_pid); + + bpf_rcu_read_unlock(); + bpf_task_release(reaper); + return ret; +} -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Which is to get the total CPU usage of a task's CPU accounting cgroup. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- kernel/sched/cpuacct.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 3d3d12b60572..801f3df9734c 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -414,6 +414,13 @@ struct cpuacct *task_cpuacct(struct task_struct *tsk) return tsk ? task_ca(tsk) : NULL; } +__bpf_kfunc u64 bpf_task_ca_cpuusage(struct task_struct *p) +{ + if (!p) + return 0; + return cpuusage_read(task_css(p, cpuacct_cgrp_id), NULL); +} + __bpf_kfunc void bpf_cpuacct_kcpustat_cpu_fetch(struct kernel_cpustat *dst, struct cpuacct *ca, int cpu) { @@ -421,6 +428,7 @@ __bpf_kfunc void bpf_cpuacct_kcpustat_cpu_fetch(struct kernel_cpustat *dst, } BTF_SET8_START(bpf_cpuacct_kfunc_ids) +BTF_ID_FLAGS(func, bpf_task_ca_cpuusage) BTF_ID_FLAGS(func, bpf_cpuacct_kcpustat_cpu_fetch) BTF_SET8_END(bpf_cpuacct_kfunc_ids) -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Implement the bpf prog for the 'uptime' interface. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- kernel/bpf-rvi/generic_single_iter.c | 1 + samples/bpf/Makefile | 1 + samples/bpf/bpf_rvi_uptime.bpf.c | 122 +++++++++++++++++++++++++++ 3 files changed, 124 insertions(+) create mode 100644 samples/bpf/bpf_rvi_uptime.bpf.c diff --git a/kernel/bpf-rvi/generic_single_iter.c b/kernel/bpf-rvi/generic_single_iter.c index 542241aed597..88ced6d8fabd 100644 --- a/kernel/bpf-rvi/generic_single_iter.c +++ b/kernel/bpf-rvi/generic_single_iter.c @@ -51,6 +51,7 @@ static const struct seq_operations generic_single_seq_ops = { * Users of "generic_single" iter type: * - cpu_online * - loadavg + * - uptime */ DEFINE_BPF_ITER_FUNC(generic_single, struct bpf_iter_meta *meta) diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index b7afc09514d9..4aeb40711241 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -159,6 +159,7 @@ always-$(CONFIG_BPF_RVI) += bpf_rvi_cpu_online.bpf.o always-$(CONFIG_BPF_RVI) += bpf_rvi_diskstats.bpf.o always-$(CONFIG_BPF_RVI) += bpf_rvi_partitions.bpf.o always-$(CONFIG_BPF_RVI) += bpf_rvi_loadavg.bpf.o +always-$(CONFIG_BPF_RVI) += bpf_rvi_uptime.bpf.o ifeq ($(ARCH), arm) # Strip all except -D__LINUX_ARM_ARCH__ option needed to handle linux diff --git a/samples/bpf/bpf_rvi_uptime.bpf.c b/samples/bpf/bpf_rvi_uptime.bpf.c new file mode 100644 index 000000000000..2c6d539476b0 --- /dev/null +++ b/samples/bpf/bpf_rvi_uptime.bpf.c @@ -0,0 +1,122 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Huawei Technologies Co., Ltd */ +#include <vmlinux.h> +#include <bpf/bpf_helpers.h> + +#define NSEC_PER_SEC 1000000000L + +void bpf_rcu_read_lock(void) __ksym; +void bpf_rcu_read_unlock(void) __ksym; +void bpf_task_release(struct task_struct *p) __ksym; +struct task_struct *bpf_current_level1_reaper(void) __ksym; +struct cpuset *bpf_cpuset_from_task(struct task_struct *p) __ksym; +void cgroup_rstat_flush_atomic(struct cgroup *cgrp) __ksym; +unsigned int bpf_cpumask_weight(struct cpumask *pmask) __ksym; +u64 bpf_task_ca_cpuusage(struct task_struct *p) __ksym; + +char _license[] SEC("license") = "GPL"; + +static int task_effective_cpus_num(struct task_struct *reaper) +{ + struct cpuset *cpuset; + + cpuset = bpf_cpuset_from_task(reaper); + if (!cpuset) + return -1; + + return bpf_cpumask_weight(cpuset->effective_cpus); +} + +/* cpuacct.usage of cgroup v1. See cpuusage_read(). */ +static int cgroup_v1_get_cpuusage(struct task_struct *reaper, u64 *usage) +{ + *usage = bpf_task_ca_cpuusage(reaper); + return 0; +} + +/* "usage_usec" of cpu.stat of cgroup v2. See cgroup_base_stat_cputime_show(). */ +static int cgroup_v2_get_cpuusage(struct task_struct *reaper, u64 *usage) +{ + struct cgroup *cgroup; + + cgroup = reaper->cgroups->dfl_cgrp; + if (!cgroup) + return -1; + cgroup_rstat_flush_atomic(cgroup); + *usage = cgroup->bstat.cputime.sum_exec_runtime; + + return 0; +} + +/* + * What LXCFS uses is cpuacct.usage, which, as LXCFS's code comment says, might overestimate the + * container's busy time if the container doesn't have its own cpuacct cgroup. + */ +static int cgroup_get_cpuusage(struct task_struct *reaper, u64 *usage) +{ + int err; + + bpf_rcu_read_lock(); + if (reaper->cgroups->dfl_cgrp) + err = cgroup_v2_get_cpuusage(reaper, usage); + else + err = cgroup_v1_get_cpuusage(reaper, usage); + bpf_rcu_read_unlock(); + return err; +} + +#define RET_OK 0 +#define RET_FAIL 1 +#define RET_SKIP -1 + +SEC("iter/generic_single") +s64 dump_uptime(struct bpf_iter__generic_single *ctx) +{ + struct seq_file *m = ctx->meta->seq; + struct task_struct *reaper; + u64 cur_timestamp; + u64 runtime, totaltime, idletime = 0, cpuusage = 0; + u64 run_sec, run_nsec, idle_sec, idle_nsec; + unsigned int cpu_count; + int err, ret = RET_FAIL; + + reaper = bpf_current_level1_reaper(); + if (!reaper) + return RET_FAIL; + err = cgroup_get_cpuusage(reaper, &cpuusage); + if (err) + goto err; + cpu_count = task_effective_cpus_num(reaper); + if (cpu_count == -1) + goto err; + + cur_timestamp = bpf_ktime_get_boot_ns(); // "ns": nanosecond, not namespace + /* + * LXCFS takes the 22nd column from /proc/<pid>/stat, which is task->start_boottime (with + * transforming unit and adding timens bias). See do_task_stat() for details. + */ + if (cur_timestamp < reaper->start_boottime) + runtime = 0; + else + runtime = cur_timestamp - reaper->start_boottime; + /* + * As implemented in uptime_proc_show(), idle time of the original /proc/uptime is the sum + * of each cpu's idle time. Here we calculate it the other way around: subtract the total + * amount of cpu time by cpu usage. + */ + totaltime = runtime * cpu_count; + if (totaltime > cpuusage) + idletime = totaltime - cpuusage; + + run_sec = runtime / NSEC_PER_SEC; + run_nsec = runtime % NSEC_PER_SEC; + idle_sec = idletime / NSEC_PER_SEC; + idle_nsec = idletime % NSEC_PER_SEC; + BPF_SEQ_PRINTF(m, "%llu.%02llu %llu.%02llu\n", run_sec, run_nsec / (NSEC_PER_SEC / 100), + idle_sec, idle_nsec / (NSEC_PER_SEC / 100)); + + ret = RET_OK; +err: + bpf_task_release(reaper); + return ret; +} -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- This kfunc is to acquire the memory & swap-related parts of sysinfo. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- kernel/bpf-rvi/Makefile | 2 +- kernel/bpf-rvi/common_kfuncs.c | 54 ++++++++++++++++++++++++++++++++++ 2 files changed, 55 insertions(+), 1 deletion(-) create mode 100644 kernel/bpf-rvi/common_kfuncs.c diff --git a/kernel/bpf-rvi/Makefile b/kernel/bpf-rvi/Makefile index 8c226d5f1b3e..9c846eefda4d 100644 --- a/kernel/bpf-rvi/Makefile +++ b/kernel/bpf-rvi/Makefile @@ -1,4 +1,4 @@ # SPDX-License-Identifier: GPL-2.0 # Copyright (c) 2025 Huawei Technologies Co., Ltd -obj-y := generic_single_iter.o +obj-y := generic_single_iter.o common_kfuncs.o diff --git a/kernel/bpf-rvi/common_kfuncs.c b/kernel/bpf-rvi/common_kfuncs.c new file mode 100644 index 000000000000..53b7ca1acc39 --- /dev/null +++ b/kernel/bpf-rvi/common_kfuncs.c @@ -0,0 +1,54 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2025 Huawei Technologies Co., Ltd */ + +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/types.h> +#include <linux/btf_ids.h> +#include <linux/bpf.h> + +/* + * no padding member "_f" compared to struct sysinfo, because sizeof(_f) + * maybe zero and not supported by bpf_verifier. + */ +struct bpf_sysinfo { + __kernel_long_t uptime; /* Seconds since boot */ + __kernel_ulong_t loads[3]; /* 1, 5, and 15 minute load averages */ + __kernel_ulong_t totalram; /* Total usable main memory size */ + __kernel_ulong_t freeram; /* Available memory size */ + __kernel_ulong_t sharedram; /* Amount of shared memory */ + __kernel_ulong_t bufferram; /* Memory used by buffers */ + __kernel_ulong_t totalswap; /* Total swap space size */ + __kernel_ulong_t freeswap; /* swap space still available */ + __u16 procs; /* Number of current processes */ + __u16 pad; /* Explicit padding for m68k */ + __kernel_ulong_t totalhigh; /* Total high memory size */ + __kernel_ulong_t freehigh; /* Available high memory size */ + __u32 mem_unit; /* Memory unit size in bytes */ +}; + +__bpf_kfunc void bpf_si_memswinfo(struct bpf_sysinfo *bsi) +{ + struct sysinfo *si = (struct sysinfo *)bsi; + + if (si) { + si_meminfo(si); + si_swapinfo(si); + } +} + +BTF_SET8_START(bpf_common_kfuncs_ids) +BTF_ID_FLAGS(func, bpf_si_memswinfo) +BTF_SET8_END(bpf_common_kfuncs_ids) + +static const struct btf_kfunc_id_set bpf_common_kfuncs_set = { + .owner = THIS_MODULE, + .set = &bpf_common_kfuncs_ids, +}; + +static int __init bpf_common_kfuncs_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &bpf_common_kfuncs_set); +} +late_initcall(bpf_common_kfuncs_init); -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- This kfunc is to enable bpf progs to perform atomic read on ->usage of struct page_counter. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- kernel/bpf-rvi/common_kfuncs.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/bpf-rvi/common_kfuncs.c b/kernel/bpf-rvi/common_kfuncs.c index 53b7ca1acc39..c93b1fc08395 100644 --- a/kernel/bpf-rvi/common_kfuncs.c +++ b/kernel/bpf-rvi/common_kfuncs.c @@ -4,6 +4,7 @@ #include <linux/mm.h> #include <linux/swap.h> #include <linux/types.h> +#include <linux/page_counter.h> #include <linux/btf_ids.h> #include <linux/bpf.h> @@ -37,8 +38,14 @@ __bpf_kfunc void bpf_si_memswinfo(struct bpf_sysinfo *bsi) } } +__bpf_kfunc unsigned long bpf_page_counter_read(struct page_counter *counter) +{ + return page_counter_read(counter); +} + BTF_SET8_START(bpf_common_kfuncs_ids) BTF_ID_FLAGS(func, bpf_si_memswinfo) +BTF_ID_FLAGS(func, bpf_page_counter_read) BTF_SET8_END(bpf_common_kfuncs_ids) static const struct btf_kfunc_id_set bpf_common_kfuncs_set = { -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Implement the bpf prog for the 'swaps' interface. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- kernel/bpf-rvi/generic_single_iter.c | 1 + samples/bpf/Makefile | 1 + samples/bpf/bpf_rvi_swaps.bpf.c | 104 +++++++++++++++++++++++++++ 3 files changed, 106 insertions(+) create mode 100644 samples/bpf/bpf_rvi_swaps.bpf.c diff --git a/kernel/bpf-rvi/generic_single_iter.c b/kernel/bpf-rvi/generic_single_iter.c index 88ced6d8fabd..37b2db9020e8 100644 --- a/kernel/bpf-rvi/generic_single_iter.c +++ b/kernel/bpf-rvi/generic_single_iter.c @@ -52,6 +52,7 @@ static const struct seq_operations generic_single_seq_ops = { * - cpu_online * - loadavg * - uptime + * - swaps */ DEFINE_BPF_ITER_FUNC(generic_single, struct bpf_iter_meta *meta) diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 4aeb40711241..ff59231e80de 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -160,6 +160,7 @@ always-$(CONFIG_BPF_RVI) += bpf_rvi_diskstats.bpf.o always-$(CONFIG_BPF_RVI) += bpf_rvi_partitions.bpf.o always-$(CONFIG_BPF_RVI) += bpf_rvi_loadavg.bpf.o always-$(CONFIG_BPF_RVI) += bpf_rvi_uptime.bpf.o +always-$(CONFIG_BPF_RVI) += bpf_rvi_swaps.bpf.o ifeq ($(ARCH), arm) # Strip all except -D__LINUX_ARM_ARCH__ option needed to handle linux diff --git a/samples/bpf/bpf_rvi_swaps.bpf.c b/samples/bpf/bpf_rvi_swaps.bpf.c new file mode 100644 index 000000000000..50befdc272c7 --- /dev/null +++ b/samples/bpf/bpf_rvi_swaps.bpf.c @@ -0,0 +1,104 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Huawei Technologies Co., Ltd */ +#include <vmlinux.h> +#include <bpf/bpf_helpers.h> + +struct task_struct *bpf_current_level1_reaper(void) __ksym; +void bpf_task_release(struct task_struct *p) __ksym; +struct mem_cgroup *bpf_mem_cgroup_from_task(struct task_struct *p) __ksym; +void bpf_si_memswinfo(struct bpf_sysinfo *si) __ksym; +unsigned long bpf_atomic_long_read(const atomic_long_t *v) __ksym; +unsigned long bpf_page_counter_read(struct page_counter *pc) __ksym; +void bpf_rcu_read_lock(void) __ksym; +void bpf_rcu_read_unlock(void) __ksym; +void cgroup_rstat_flush_atomic(struct cgroup *cgrp) __ksym; + +char _license[] SEC("license") = "GPL"; + +/* Reference: https://docs.ebpf.io/ebpf-library/libbpf/ebpf/__ksym/ */ +extern void cgrp_dfl_root __ksym; +/* Reference: cgroup_on_dfl() */ +static inline bool cgroup_on_dfl(const struct cgroup *cgrp) +{ + return cgrp->root == &cgrp_dfl_root; +} + +#define RET_OK 0 +#define RET_FAIL 1 +#define RET_SKIP -1 + +SEC("iter/generic_single") +s64 dump_swaps(struct bpf_iter__generic_single *ctx) +{ + struct seq_file *m = ctx->meta->seq; + struct task_struct *reaper; + struct mem_cgroup *memcg; + struct bpf_sysinfo si = {}; + u64 limit, usage, swapusage = 0, swaptotal = 0; + u64 kb_per_page; + + reaper = bpf_current_level1_reaper(); + if (!reaper) + return RET_FAIL; + bpf_rcu_read_lock(); + memcg = bpf_mem_cgroup_from_task(reaper); + if (!memcg) { + bpf_rcu_read_unlock(); + bpf_task_release(reaper); + return RET_FAIL; + } + + bpf_si_memswinfo(&si); + cgroup_rstat_flush_atomic(memcg->css.cgroup); + limit = memcg->memory.max; + /* + * si.totalram: size in pages + * si.mem_unit: PAGE_SIZE + * memcg->memory.{max,...}: counting in pages + */ + if (limit == 0 || limit > si.totalram) + limit = si.totalram; + /* + * Reference: page_counter_read(). + * memcg->memory.usage is atomic, should be read by (bpf_)atomic_long_read. + * Consider using mem_cgroup_usage(memcg, true/false)? + */ + usage = bpf_page_counter_read(&memcg->memory); + if (usage == 0 || usage > limit) + usage = limit; + + if (cgroup_on_dfl(memcg->css.cgroup)) { // if memcg is on V2 hierarchy + swaptotal = memcg->swap.max; + swapusage = bpf_page_counter_read(&memcg->swap); + } else { + u64 memsw_limit = memcg->memsw.max; // memsw = mem + swap + u64 memsw_usage = bpf_page_counter_read(&memcg->memsw); + + /* + * Reasonably, memsw.max should >= memory.max, as memsw = mem + swap in V1. + * But it's not necessarily the case, as users may configure them as they wish. + */ + if (memsw_limit > limit) + swaptotal = memsw_limit - limit; + /* Similar treatment for {memsw,memory}.usage */ + if (swaptotal && memsw_usage > usage) + swapusage = memsw_usage - usage; + } + if (swaptotal > si.totalswap) + swaptotal = si.totalswap; + if (swapusage > si.totalswap - si.freeswap) + swapusage = si.totalswap - si.freeswap; + + kb_per_page = si.mem_unit >> 10; + /* Reference: swap_show(). Aligned with LXCFS. */ + BPF_SEQ_PRINTF(m, "Filename\t\t\t\tType\t\tSize\t\tUsed\t\tPriority\n"); + if (swaptotal > 0) + BPF_SEQ_PRINTF(m, "none%*svirtual\t\t%llu\t%llu\t0\n", + 36, " ", swaptotal * kb_per_page, + swapusage * kb_per_page); // in KB + + bpf_rcu_read_unlock(); + bpf_task_release(reaper); + + return RET_OK; +} -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Add a special kfunc for concatenating buf content of seq_file. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- kernel/bpf-rvi/common_kfuncs.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/kernel/bpf-rvi/common_kfuncs.c b/kernel/bpf-rvi/common_kfuncs.c index c93b1fc08395..c8c47f72e7fd 100644 --- a/kernel/bpf-rvi/common_kfuncs.c +++ b/kernel/bpf-rvi/common_kfuncs.c @@ -5,6 +5,7 @@ #include <linux/swap.h> #include <linux/types.h> #include <linux/page_counter.h> +#include <linux/seq_file.h> #include <linux/btf_ids.h> #include <linux/bpf.h> @@ -43,9 +44,25 @@ __bpf_kfunc unsigned long bpf_page_counter_read(struct page_counter *counter) return page_counter_read(counter); } +/* Moving src's content to the end of dst. Reference: seq_vprintf. */ +__bpf_kfunc void bpf_seq_file_append(struct seq_file *dst, struct seq_file *src) +{ + /* + * ->count: length of content + * ->size: available buffer space + * i.e. seq_printf(dst, "%s", src->buf) + */ + if (dst->count < dst->size) + if (src->count < dst->size - dst->count) { + memmove(dst->buf + dst->count, src->buf, src->count); + dst->count += src->count; + } +} + BTF_SET8_START(bpf_common_kfuncs_ids) BTF_ID_FLAGS(func, bpf_si_memswinfo) BTF_ID_FLAGS(func, bpf_page_counter_read) +BTF_ID_FLAGS(func, bpf_seq_file_append) BTF_SET8_END(bpf_common_kfuncs_ids) static const struct btf_kfunc_id_set bpf_common_kfuncs_set = { -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Add the following three kfuncs: - bpf_show_all_irqs - bpf_get_boottime_timens - bpf_get_total_forks Which are only used by /proc/stat interface. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- fs/proc/stat.c | 6 ++++++ kernel/bpf-rvi/common_kfuncs.c | 17 +++++++++++++++++ 2 files changed, 23 insertions(+) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index 9b58e9ded6bf..1d8699285e02 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -231,9 +231,15 @@ __bpf_kfunc u64 bpf_get_iowait_time(struct kernel_cpustat *kcs, int cpu) return get_iowait_time(kcs, cpu); } +__bpf_kfunc void bpf_show_all_irqs(struct seq_file *p) +{ + show_all_irqs(p); +} + BTF_SET8_START(bpf_proc_stat_kfunc_ids) BTF_ID_FLAGS(func, bpf_get_idle_time) BTF_ID_FLAGS(func, bpf_get_iowait_time) +BTF_ID_FLAGS(func, bpf_show_all_irqs) BTF_SET8_END(bpf_proc_stat_kfunc_ids) static const struct btf_kfunc_id_set bpf_proc_stat_kfunc_set = { diff --git a/kernel/bpf-rvi/common_kfuncs.c b/kernel/bpf-rvi/common_kfuncs.c index c8c47f72e7fd..abcfb4515372 100644 --- a/kernel/bpf-rvi/common_kfuncs.c +++ b/kernel/bpf-rvi/common_kfuncs.c @@ -6,6 +6,10 @@ #include <linux/types.h> #include <linux/page_counter.h> #include <linux/seq_file.h> +#include <linux/sched/stat.h> +#include <linux/time64.h> +#include <linux/timekeeping.h> +#include <linux/time_namespace.h> #include <linux/btf_ids.h> #include <linux/bpf.h> @@ -59,10 +63,23 @@ __bpf_kfunc void bpf_seq_file_append(struct seq_file *dst, struct seq_file *src) } } +__bpf_kfunc void bpf_get_boottime_timens(struct task_struct *tsk, struct timespec64 *boottime) +{ + getboottime64(boottime); + *boottime = timespec64_sub(*boottime, tsk->nsproxy->time_ns->offsets.boottime); +} + +__bpf_kfunc unsigned long bpf_get_total_forks(void) +{ + return total_forks; +} + BTF_SET8_START(bpf_common_kfuncs_ids) BTF_ID_FLAGS(func, bpf_si_memswinfo) BTF_ID_FLAGS(func, bpf_page_counter_read) BTF_ID_FLAGS(func, bpf_seq_file_append) +BTF_ID_FLAGS(func, bpf_get_boottime_timens) +BTF_ID_FLAGS(func, bpf_get_total_forks) BTF_SET8_END(bpf_common_kfuncs_ids) static const struct btf_kfunc_id_set bpf_common_kfuncs_set = { -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Add bpf_nr_{running,context_switches,iowait} to get statistics about CPU runqueue and scheduling. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- kernel/bpf-rvi/common_kfuncs.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/kernel/bpf-rvi/common_kfuncs.c b/kernel/bpf-rvi/common_kfuncs.c index abcfb4515372..9eae982c081d 100644 --- a/kernel/bpf-rvi/common_kfuncs.c +++ b/kernel/bpf-rvi/common_kfuncs.c @@ -7,6 +7,7 @@ #include <linux/page_counter.h> #include <linux/seq_file.h> #include <linux/sched/stat.h> +#include <linux/kernel_stat.h> #include <linux/time64.h> #include <linux/timekeeping.h> #include <linux/time_namespace.h> @@ -74,12 +75,30 @@ __bpf_kfunc unsigned long bpf_get_total_forks(void) return total_forks; } +__bpf_kfunc unsigned int bpf_nr_running(void) +{ + return nr_running(); +} + +__bpf_kfunc unsigned long long bpf_nr_context_switches(void) +{ + return nr_context_switches(); +} + +__bpf_kfunc unsigned int bpf_nr_iowait(void) +{ + return nr_iowait(); +} + BTF_SET8_START(bpf_common_kfuncs_ids) BTF_ID_FLAGS(func, bpf_si_memswinfo) BTF_ID_FLAGS(func, bpf_page_counter_read) BTF_ID_FLAGS(func, bpf_seq_file_append) BTF_ID_FLAGS(func, bpf_get_boottime_timens) BTF_ID_FLAGS(func, bpf_get_total_forks) +BTF_ID_FLAGS(func, bpf_nr_running) +BTF_ID_FLAGS(func, bpf_nr_context_switches) +BTF_ID_FLAGS(func, bpf_nr_iowait) BTF_SET8_END(bpf_common_kfuncs_ids) static const struct btf_kfunc_id_set bpf_common_kfuncs_set = { -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ica1gk -------------------------------- Which is to get statistics about {soft,}irq and all catagories of vtime for a specific CPU. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- kernel/bpf-rvi/common_kfuncs.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/kernel/bpf-rvi/common_kfuncs.c b/kernel/bpf-rvi/common_kfuncs.c index 9eae982c081d..2b78ca5ca1f8 100644 --- a/kernel/bpf-rvi/common_kfuncs.c +++ b/kernel/bpf-rvi/common_kfuncs.c @@ -90,6 +90,25 @@ __bpf_kfunc unsigned int bpf_nr_iowait(void) return nr_iowait(); } +/* + * Kernel statistics for CPU accounting + */ + +__bpf_kfunc unsigned int bpf_kstat_softirqs_cpu(unsigned int irq, int cpu) +{ + return kstat_softirqs_cpu(irq, cpu); +} + +__bpf_kfunc unsigned long bpf_kstat_cpu_irqs_sum(unsigned int cpu) +{ + return kstat_cpu_irqs_sum(cpu); +} + +__bpf_kfunc void bpf_kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu) +{ + kcpustat_cpu_fetch(dst, cpu); +} + BTF_SET8_START(bpf_common_kfuncs_ids) BTF_ID_FLAGS(func, bpf_si_memswinfo) BTF_ID_FLAGS(func, bpf_page_counter_read) @@ -99,6 +118,9 @@ BTF_ID_FLAGS(func, bpf_get_total_forks) BTF_ID_FLAGS(func, bpf_nr_running) BTF_ID_FLAGS(func, bpf_nr_context_switches) BTF_ID_FLAGS(func, bpf_nr_iowait) +BTF_ID_FLAGS(func, bpf_kstat_softirqs_cpu) +BTF_ID_FLAGS(func, bpf_kstat_cpu_irqs_sum) +BTF_ID_FLAGS(func, bpf_kcpustat_cpu_fetch) BTF_SET8_END(bpf_common_kfuncs_ids) static const struct btf_kfunc_id_set bpf_common_kfuncs_set = { -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Create a bpf iter target for the 'stat' interface, to which the bpf prog can attach. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- fs/proc/stat.c | 164 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 164 insertions(+) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index 1d8699285e02..4757e5b1be38 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -18,6 +18,9 @@ #include <linux/bpf.h> #include <linux/btf.h> #include <linux/btf_ids.h> +#include <linux/pid_namespace.h> +#include <linux/cgroup.h> +#include <linux/cpuset.h> #endif #ifndef arch_irq_stat_cpu @@ -253,4 +256,165 @@ static int __init bpf_proc_stat_kfunc_init(void) &bpf_proc_stat_kfunc_set); } late_initcall(bpf_proc_stat_kfunc_init); + +struct stat_sum_data { + u64 user, nice, system, idle, iowait, irq, softirq, steal; + u64 guest, guest_nice; + u64 sum; + u64 sum_softirq; + unsigned int per_softirq_sums[NR_SOFTIRQS]; +}; + +struct stat_seq_priv { + cpumask_t allowed_mask; + struct cpuacct *cpuacct; + bool sum_printed; + struct task_struct *task; + struct seq_file seqf_pcpu; +}; + +static int seq_file_setup(struct seq_file *seq) +{ + seq->size = PAGE_SIZE << 3; + seq->buf = kvzalloc(seq->size, GFP_KERNEL); + if (!seq->buf) + return -ENOMEM; + return 0; +} + +static void seq_file_destroy(struct seq_file *seq) +{ + if (seq->buf) + kvfree(seq->buf); +} + +static void *bpf_c_start(struct seq_file *m, loff_t *pos) +{ + struct stat_seq_priv *priv = m->private; + struct task_struct *reaper = get_current_level1_reaper(); + + priv->task = reaper ?: current; + task_effective_cpumask(priv->task, &priv->allowed_mask); + priv->cpuacct = task_cpuacct(priv->task); + if (seq_file_setup(&priv->seqf_pcpu)) + return NULL; + + /* + * DO NOT use cpumask_first() here: sys_read may start from somewhere in + * the middle of the file, and *pos may contain a value from the last + * read. + */ + *pos = cpumask_next(*pos - 1, &priv->allowed_mask); + if ((*pos) < nr_cpu_ids) + // avoid 0, which will be treated as NULL + return (void *)(unsigned long)((*pos) + 1); + return NULL; +} + +static void *bpf_c_next(struct seq_file *m, void *v, loff_t *pos) +{ + struct stat_seq_priv *priv = m->private; + + *pos = cpumask_next(*pos, &priv->allowed_mask); + + if ((*pos) == nr_cpu_ids) { + if (!priv->sum_printed) + priv->sum_printed = true; + else { + ++*pos; // just to silence "did not updated position index" msg + return NULL; + } + } + + // avoid 0, which will be treated as NULL + return (void *)(unsigned long)((*pos) + 1); +} + +struct bpf_iter__stat { + __bpf_md_ptr(struct bpf_iter_meta *, meta); + u64 cpuid __aligned(8); + __bpf_md_ptr(struct cpuacct *, cpuacct); + u64 arch_irq_stat_cpu __aligned(8); + u64 arch_irq_stat __aligned(8); + bool print_all __aligned(8); + __bpf_md_ptr(struct seq_file *, seqf_pcpu); +}; + +static int bpf_show_stat(struct seq_file *m, void *v) +{ + struct stat_seq_priv *priv = m->private; + struct bpf_iter__stat ctx; + struct bpf_iter_meta meta; + struct bpf_prog *prog; + u64 cpuid = (unsigned long)v - 1; // decode '+ 1' + + meta.seq = m; + prog = bpf_iter_get_info(&meta, false); + if (!prog) + return show_stat(m, v); + + ctx.meta = &meta; + + ctx.cpuid = cpuid; + ctx.cpuacct = priv->cpuacct; + if (cpuid != nr_cpu_ids) + ctx.arch_irq_stat_cpu = arch_irq_stat_cpu(cpuid); + else + ctx.arch_irq_stat = arch_irq_stat(); + ctx.print_all = (cpuid == nr_cpu_ids); + ctx.seqf_pcpu = &priv->seqf_pcpu; + + return bpf_iter_run_prog(prog, &ctx); +} + +static void bpf_c_stop(struct seq_file *m, void *v) +{ + struct stat_seq_priv *priv = m->private; + + if (priv->task != current) + put_task_struct(priv->task); + seq_file_destroy(&priv->seqf_pcpu); +} + +const struct seq_operations bpf_stat_ops = { + .start = bpf_c_start, + .next = bpf_c_next, + .stop = bpf_c_stop, + .show = bpf_show_stat, +}; + +DEFINE_BPF_ITER_FUNC(stat, struct bpf_iter_meta *meta, + u64 cpuid, + struct cpuacct *cpuacct, + u64 arch_irq_stat_cpu, + u64 arch_irq_stat, + bool print_all, + struct seq_file *seqf_pcpu) + +BTF_ID_LIST(btf_stat_id) +BTF_ID(struct, cpuacct) + +static const struct bpf_iter_seq_info stat_seq_info = { + .seq_ops = &bpf_stat_ops, + .init_seq_private = NULL, + .fini_seq_private = NULL, + .seq_priv_size = sizeof(struct stat_seq_priv), +}; + +static struct bpf_iter_reg stat_reg_info = { + .target = "stat", + .ctx_arg_info_size = 1, + .ctx_arg_info = { + { offsetof(struct bpf_iter__stat, cpuacct), + PTR_TO_BTF_ID, }, + }, + .seq_info = &stat_seq_info, +}; + +static int __init stat_iter_init(void) +{ + stat_reg_info.ctx_arg_info[0].btf_id = btf_stat_id[0]; + return bpf_iter_reg_target(&stat_reg_info); +} +late_initcall(stat_iter_init); #endif /* CONFIG_BPF_RVI */ -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Implement the bpf prog for the 'stat' interface. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- samples/bpf/Makefile | 1 + samples/bpf/bpf_rvi_stat.bpf.c | 220 +++++++++++++++++++++++++++++++++ 2 files changed, 221 insertions(+) create mode 100644 samples/bpf/bpf_rvi_stat.bpf.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index ff59231e80de..0b8036730076 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -161,6 +161,7 @@ always-$(CONFIG_BPF_RVI) += bpf_rvi_partitions.bpf.o always-$(CONFIG_BPF_RVI) += bpf_rvi_loadavg.bpf.o always-$(CONFIG_BPF_RVI) += bpf_rvi_uptime.bpf.o always-$(CONFIG_BPF_RVI) += bpf_rvi_swaps.bpf.o +always-$(CONFIG_BPF_RVI) += bpf_rvi_stat.bpf.o ifeq ($(ARCH), arm) # Strip all except -D__LINUX_ARM_ARCH__ option needed to handle linux diff --git a/samples/bpf/bpf_rvi_stat.bpf.c b/samples/bpf/bpf_rvi_stat.bpf.c new file mode 100644 index 000000000000..66cc82de231b --- /dev/null +++ b/samples/bpf/bpf_rvi_stat.bpf.c @@ -0,0 +1,220 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Huawei Technologies Co., Ltd */ +#include <vmlinux.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +#define READ_ONCE(x) (*(volatile typeof(x) *)&(x)) + +unsigned int bpf_kstat_softirqs_cpu(unsigned int irq, int cpu) __ksym; +unsigned long bpf_kstat_cpu_irqs_sum(unsigned int cpu) __ksym; +void bpf_kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu) __ksym; +u64 bpf_get_idle_time(struct kernel_cpustat *kcs, int cpu) __ksym; +u64 bpf_get_iowait_time(struct kernel_cpustat *kcs, int cpu) __ksym; +struct task_struct *bpf_current_level1_reaper(void) __ksym; +void bpf_task_release(struct task_struct *p) __ksym; +struct cpumask *bpf_task_allowed_cpus(struct task_struct *p) __ksym; +u32 bpf_cpumask_next_idx(int n, const struct cpumask *mask) __ksym; +u64 bpf_cpuacct_stat_from_task(struct task_struct *p, int cpu, enum cpuacct_stat_index idx) __ksym; +void bpf_cpuacct_kcpustat_cpu_fetch(struct kernel_cpustat *dst, struct cpuacct *ca, int cpu) __ksym; +void bpf_seq_file_append(struct seq_file *dst, struct seq_file *src) __ksym; +void bpf_get_boottime_timens(struct task_struct *tsk, struct timespec64 *boottime) __ksym; +unsigned long bpf_get_total_forks(void) __ksym; +unsigned int bpf_nr_running(void) __ksym; +unsigned long long bpf_nr_context_switches(void) __ksym; +unsigned int bpf_nr_iowait(void) __ksym; +void bpf_show_all_irqs(struct seq_file *p) __ksym; + +#define NSEC_PER_SEC 1000000000L +#define USER_HZ 100 +/* Take the `#if (NSEC_PER_SEC % USER_HZ) == 0` case */ +static inline u64 nsec_to_clock_t(u64 x) +{ + return x / (NSEC_PER_SEC / USER_HZ); +} + +struct stat_sum_data { + u64 user, nice, system, idle, iowait, irq, softirq, steal; + u64 guest, guest_nice; + u64 sum; + u64 sum_softirq; + unsigned int per_softirq_sums[NR_SOFTIRQS]; + u64 nr_context_switches, nr_running, nr_iowait; +}; + +struct stat_sum_data_map { + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct stat_sum_data); +} collect_map SEC(".maps"); + +#define RET_OK 0 +#define RET_FAIL 1 +#define RET_SKIP -1 + +SEC("iter/stat") +s64 dump_stat(struct bpf_iter__stat *ctx) +{ + struct seq_file *m = ctx->meta->seq; + u64 cpuid = ctx->cpuid; + struct cpuacct *cpuacct = ctx->cpuacct; + bool print_all = ctx->print_all; + struct seq_file *seqf_pcpu = ctx->seqf_pcpu; + struct task_struct *current = bpf_get_current_task_btf(); // just for bpf map management + struct stat_sum_data *collect; + int j; + + collect = bpf_task_storage_get(&collect_map, current, NULL, BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!collect) + return RET_FAIL; + + if (!print_all) { + u64 user, nice, system, idle, iowait, irq, softirq, steal; + u64 guest, guest_nice; + struct kernel_cpustat kcpustat = {}; + u64 *cpustat = kcpustat.cpustat; + + bpf_kcpustat_cpu_fetch(&kcpustat, cpuid); + user = cpustat[CPUTIME_USER]; + nice = cpustat[CPUTIME_NICE]; + system = cpustat[CPUTIME_SYSTEM]; + idle = bpf_get_idle_time(&kcpustat, cpuid); + iowait = bpf_get_iowait_time(&kcpustat, cpuid); + irq = cpustat[CPUTIME_IRQ]; + softirq = cpustat[CPUTIME_SOFTIRQ]; + steal = cpustat[CPUTIME_STEAL]; + guest = cpustat[CPUTIME_GUEST]; + guest_nice = cpustat[CPUTIME_GUEST_NICE]; + + collect->sum += bpf_kstat_cpu_irqs_sum(cpuid); + collect->sum += ctx->arch_irq_stat_cpu; + for (j = 0; j < NR_SOFTIRQS; j++) { + unsigned int softirq_stat = bpf_kstat_softirqs_cpu(j, cpuid); + + collect->per_softirq_sums[j] += softirq_stat; + collect->sum_softirq += softirq_stat; + } + + // don't print cpuid to avoid leaking host info + BPF_SEQ_PRINTF(seqf_pcpu, "cpu%d", ctx->meta->seq_num); + + if (cpuacct) { + struct kernel_cpustat kcpustat = {}; + u64 *cpustat = kcpustat.cpustat; + + bpf_cpuacct_kcpustat_cpu_fetch(&kcpustat, cpuacct, cpuid); + + user = cpustat[CPUTIME_USER]; + nice = cpustat[CPUTIME_NICE]; + system = cpustat[CPUTIME_SYSTEM]; + irq = cpustat[CPUTIME_IRQ]; + softirq = cpustat[CPUTIME_SOFTIRQ]; + idle = cpustat[CPUTIME_IDLE]; + iowait = cpustat[CPUTIME_IOWAIT]; + steal = cpustat[CPUTIME_STEAL]; + guest = cpustat[CPUTIME_GUEST]; + guest_nice = cpustat[CPUTIME_GUEST_NICE]; + + collect->user += user; + collect->nice += nice; + collect->system += system; + collect->idle += idle; + collect->iowait += iowait; + collect->irq += irq; + collect->softirq += softirq; + collect->steal += steal; + collect->guest += guest; + collect->guest_nice += guest_nice; + + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(user)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(nice)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(system)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(idle)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(iowait)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(irq)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(softirq)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(steal)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(guest)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(guest_nice)); + } else { + collect->user += user; + collect->nice += nice; + collect->system += system; + collect->idle += idle; + collect->iowait += iowait; + collect->irq += irq; + collect->softirq += softirq; + collect->steal += steal; + collect->guest += guest; + collect->guest_nice += guest_nice; + + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(user)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(nice)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(system)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(idle)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(iowait)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(irq)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(softirq)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(steal)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(guest)); + BPF_SEQ_PRINTF(seqf_pcpu, " %ld", nsec_to_clock_t(guest_nice)); + } + + BPF_SEQ_PRINTF(seqf_pcpu, "\n"); + } else { + struct timespec64 boottime; + + // Add only once + collect->sum += ctx->arch_irq_stat; + + BPF_SEQ_PRINTF(m, "cpu %ld", nsec_to_clock_t(collect->user)); + BPF_SEQ_PRINTF(m, " %ld", nsec_to_clock_t(collect->nice)); + BPF_SEQ_PRINTF(m, " %ld", nsec_to_clock_t(collect->system)); + BPF_SEQ_PRINTF(m, " %ld", nsec_to_clock_t(collect->idle)); + BPF_SEQ_PRINTF(m, " %ld", nsec_to_clock_t(collect->iowait)); + BPF_SEQ_PRINTF(m, " %ld", nsec_to_clock_t(collect->irq)); + BPF_SEQ_PRINTF(m, " %ld", nsec_to_clock_t(collect->softirq)); + BPF_SEQ_PRINTF(m, " %ld", nsec_to_clock_t(collect->steal)); + BPF_SEQ_PRINTF(m, " %ld", nsec_to_clock_t(collect->guest)); + BPF_SEQ_PRINTF(m, " %ld", nsec_to_clock_t(collect->guest_nice)); + BPF_SEQ_PRINTF(m, "\n"); + + // ************************************ + // Dump percpu printing + // Don't do this: + // BPF_SEQ_PRINTF(m, "%s", seqf_pcpu->buf); + // as it prints at most 512 bytes each time + bpf_seq_file_append(m, seqf_pcpu); + // ************************************ + + BPF_SEQ_PRINTF(m, "intr %ld\n", collect->sum); + + bpf_show_all_irqs(m); + + bpf_get_boottime_timens(current, &boottime); + BPF_SEQ_PRINTF(m, + "\nctxt %llu\n" + "btime %llu\n" + "processes %lu\n" + "procs_running %u\n" + "procs_blocked %u\n", + bpf_nr_context_switches(), + (unsigned long long)boottime.tv_sec, + bpf_get_total_forks(), + bpf_nr_running(), + bpf_nr_iowait()); + + BPF_SEQ_PRINTF(m, "softirq %ld", collect->sum_softirq); + + for (j = 0; j < NR_SOFTIRQS; j++) + BPF_SEQ_PRINTF(m, " %d", collect->per_softirq_sums[j]); + BPF_SEQ_PRINTF(m, "\n"); + + bpf_task_storage_delete(&collect_map, current); + } + + return RET_OK; +} + +char _license[] SEC("license") = "GPL"; -- 2.25.1

From: Gu Bowen <gubowen5@huawei.com> hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Add kfunc to retrieve x86 direct_pages_count info. Signed-off-by: Gu Bowen <gubowen5@huawei.com> --- arch/x86/mm/pat/set_memory.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 2d850f6bae70..0116b554269e 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -22,6 +22,10 @@ #include <linux/cc_platform.h> #include <linux/set_memory.h> #include <linux/memregion.h> +#ifdef CONFIG_BPF_RVI +#include <linux/btf.h> +#include <linux/btf_ids.h> +#endif #include <asm/e820/api.h> #include <asm/processor.h> @@ -122,6 +126,35 @@ void arch_report_meminfo(struct seq_file *m) seq_printf(m, "DirectMap1G: %8lu kB\n", direct_pages_count[PG_LEVEL_1G] << 20); } + +#ifdef CONFIG_BPF_RVI +__bpf_kfunc void bpf_mem_direct_map(unsigned long *p) +{ + p[0] = direct_pages_count[PG_LEVEL_4K] << 2; +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) + p[1] = direct_pages_count[PG_LEVEL_2M] << 11; +#else + p[1] = direct_pages_count[PG_LEVEL_2M] << 12; +#endif + p[2] = direct_pages_count[PG_LEVEL_1G] << 20; +} + +BTF_SET8_START(bpf_direct_map_kfunc_ids) +BTF_ID_FLAGS(func, bpf_mem_direct_map, KF_TRUSTED_ARGS) +BTF_SET8_END(bpf_direct_map_kfunc_ids) + +static const struct btf_kfunc_id_set bpf_direct_map_kfunc_set = { + .owner = THIS_MODULE, + .set = &bpf_direct_map_kfunc_ids, +}; + +static int __init bpf_direct_map_kfunc_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &bpf_direct_map_kfunc_set); +} +late_initcall(bpf_direct_map_kfunc_init); +#endif /* CONFIG_BPF_RVI */ #else static inline void split_page_count(int level) { } #endif -- 2.25.1

From: Gu Bowen <gubowen5@huawei.com> hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Add kfuncs to retrieve the number of some sorts of memory pages. Signed-off-by: Gu Bowen <gubowen5@huawei.com> --- fs/proc/meminfo.c | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 57a431c1130b..d9a89d254274 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -19,6 +19,10 @@ #endif #include <linux/zswap.h> #include <linux/dynamic_pool.h> +#ifdef CONFIG_BPF_RVI +#include <linux/btf.h> +#include <linux/btf_ids.h> +#endif #include <asm/page.h> #include "internal.h" @@ -176,6 +180,42 @@ static int meminfo_proc_show(struct seq_file *m, void *v) return 0; } +#ifdef CONFIG_BPF_RVI +__bpf_kfunc unsigned long bpf_mem_file_hugepage(void) +{ + return global_node_page_state(NR_FILE_THPS); +} + +__bpf_kfunc unsigned long bpf_mem_file_pmdmapped(void) +{ + return global_node_page_state(NR_FILE_PMDMAPPED); +} + +__bpf_kfunc unsigned long bpf_mem_kreclaimable(void) +{ + return global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B) + + global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE); +} + +BTF_SET8_START(bpf_fs_meminfo_kfunc_ids) +BTF_ID_FLAGS(func, bpf_mem_file_hugepage) +BTF_ID_FLAGS(func, bpf_mem_file_pmdmapped) +BTF_ID_FLAGS(func, bpf_mem_kreclaimable) +BTF_SET8_END(bpf_fs_meminfo_kfunc_ids) + +static const struct btf_kfunc_id_set bpf_fs_meminfo_kfunc_set = { + .owner = THIS_MODULE, + .set = &bpf_fs_meminfo_kfunc_ids, +}; + +static int __init bpf_fs_meminfo_kfunc_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &bpf_fs_meminfo_kfunc_set); +} +late_initcall(bpf_fs_meminfo_kfunc_init); +#endif + static int __init proc_meminfo_init(void) { struct proc_dir_entry *pde; -- 2.25.1

From: Gu Bowen <gubowen5@huawei.com> hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Add kfuncs to get CMA information. Signed-off-by: Gu Bowen <gubowen5@huawei.com> --- mm/cma.c | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/mm/cma.c b/mm/cma.c index 7d5253421f34..6a18acca509f 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -31,6 +31,12 @@ #include <linux/highmem.h> #include <linux/io.h> #include <linux/kmemleak.h> +#ifdef CONFIG_BPF_RVI +#include <linux/btf.h> +#include <linux/btf_ids.h> +#include <linux/vmstat.h> +#include <linux/mmzone.h> +#endif #include <trace/events/cma.h> #include "internal.h" @@ -624,3 +630,32 @@ int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data) return 0; } + +#ifdef CONFIG_BPF_RVI +__bpf_kfunc unsigned long bpf_mem_totalcma(void) +{ + return totalcma_pages; +} + +__bpf_kfunc unsigned long bpf_mem_freecma(void) +{ + return global_zone_page_state(NR_FREE_CMA_PAGES); +} + +BTF_SET8_START(bpf_mem_cma_kfunc_ids) +BTF_ID_FLAGS(func, bpf_mem_totalcma) +BTF_ID_FLAGS(func, bpf_mem_freecma) +BTF_SET8_END(bpf_mem_cma_kfunc_ids) + +static const struct btf_kfunc_id_set bpf_mem_cma_kfunc_set = { + .owner = THIS_MODULE, + .set = &bpf_mem_cma_kfunc_ids, +}; + +static int __init bpf_mem_cma_kfunc_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &bpf_mem_cma_kfunc_set); +} +late_initcall(bpf_mem_cma_kfunc_init); +#endif -- 2.25.1

From: Gu Bowen <gubowen5@huawei.com> hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Add kfunc to get hugetlb information. Signed-off-by: Gu Bowen <gubowen5@huawei.com> --- mm/hugetlb.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 57 insertions(+) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index b94470b4cfc1..a30b7cc66cb5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -37,6 +37,10 @@ #include <linux/mm_inline.h> #include <linux/share_pool.h> #include <linux/dynamic_pool.h> +#ifdef CONFIG_BPF_RVI +#include <linux/btf.h> +#include <linux/btf_ids.h> +#endif #include <asm/page.h> #include <asm/pgalloc.h> @@ -7884,3 +7888,56 @@ struct folio *alloc_hugetlb_folio_size(int nid, unsigned long size) } EXPORT_SYMBOL(alloc_hugetlb_folio_size); #endif + +#ifdef CONFIG_BPF_RVI +struct bpf_mem_hugepage { + unsigned long total; + unsigned long free; + unsigned long rsvd; + unsigned long surp; + unsigned long size; + unsigned long hugetlb; +}; + +__bpf_kfunc int bpf_hugetlb_report_meminfo(struct bpf_mem_hugepage *hugepage_info) +{ + struct hstate *h; + unsigned long total = 0; + + if (!hugepages_supported()) + return -1; + + for_each_hstate(h) { + unsigned long count = h->nr_huge_pages; + + total += huge_page_size(h) * count; + + if (h == &default_hstate) { + hugepage_info->total = count; + hugepage_info->free = h->free_huge_pages; + hugepage_info->rsvd = h->resv_huge_pages; + hugepage_info->surp = h->surplus_huge_pages; + hugepage_info->size = huge_page_size(h) / SZ_1K; + } + } + + hugepage_info->hugetlb = total / SZ_1K; + return 0; +} + +BTF_SET8_START(bpf_mem_hugepage_kfunc_ids) +BTF_ID_FLAGS(func, bpf_hugetlb_report_meminfo, KF_TRUSTED_ARGS) +BTF_SET8_END(bpf_mem_hugepage_kfunc_ids) + +static const struct btf_kfunc_id_set bpf_mem_hugepage_kfunc_set = { + .owner = THIS_MODULE, + .set = &bpf_mem_hugepage_kfunc_ids, +}; + +static int __init bpf_mem_hugepage_kfunc_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &bpf_mem_hugepage_kfunc_set); +} +late_initcall(bpf_mem_hugepage_kfunc_init); +#endif -- 2.25.1

From: Gu Bowen <gubowen5@huawei.com> hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Add kfunc to get the number of poisoned pages. Signed-off-by: Gu Bowen <gubowen5@huawei.com> --- mm/memory-failure.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 7f7b75611869..ec9f29c1fa79 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -61,6 +61,11 @@ #include <linux/shmem_fs.h> #include <linux/sysctl.h> #include <linux/dynamic_pool.h> +#ifdef CONFIG_BPF_RVI +#include <linux/btf.h> +#include <linux/btf_ids.h> +#include <linux/atomic.h> +#endif #include "swap.h" #include "internal.h" #include "ras/ras_event.h" @@ -2841,3 +2846,26 @@ int soft_offline_page(unsigned long pfn, int flags) return ret; } EXPORT_SYMBOL_GPL(soft_offline_page); + +#ifdef CONFIG_BPF_RVI +__bpf_kfunc unsigned long bpf_mem_failure(void) +{ + return atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10); +} + +BTF_SET8_START(bpf_mem_failure_kfunc_ids) +BTF_ID_FLAGS(func, bpf_mem_failure) +BTF_SET8_END(bpf_mem_failure_kfunc_ids) + +static const struct btf_kfunc_id_set bpf_mem_failure_kfunc_set = { + .owner = THIS_MODULE, + .set = &bpf_mem_failure_kfunc_ids, +}; + +static int __init bpf_mem_failure_kfunc_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &bpf_mem_failure_kfunc_set); +} +late_initcall(bpf_mem_failure_kfunc_init); +#endif -- 2.25.1

From: Gu Bowen <gubowen5@huawei.com> hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Add kfunc to get the number of per-cpu pages. Signed-off-by: Gu Bowen <gubowen5@huawei.com> --- mm/percpu.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/mm/percpu.c b/mm/percpu.c index f5ed46ba3813..fa7d7a87f971 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -87,6 +87,10 @@ #include <linux/sched.h> #include <linux/sched/mm.h> #include <linux/memcontrol.h> +#ifdef CONFIG_BPF_RVI +#include <linux/btf.h> +#include <linux/btf_ids.h> +#endif #include <asm/cacheflush.h> #include <asm/sections.h> @@ -3419,6 +3423,29 @@ unsigned long pcpu_nr_pages(void) return pcpu_nr_populated * pcpu_nr_units; } +#ifdef CONFIG_BPF_RVI +__bpf_kfunc unsigned long bpf_mem_percpu(void) +{ + return pcpu_nr_pages(); +} + +BTF_SET8_START(bpf_mem_percpu_kfunc_ids) +BTF_ID_FLAGS(func, bpf_mem_percpu) +BTF_SET8_END(bpf_mem_percpu_kfunc_ids) + +static const struct btf_kfunc_id_set bpf_mem_percpu_kfunc_set = { + .owner = THIS_MODULE, + .set = &bpf_mem_percpu_kfunc_ids, +}; + +static int __init bpf_mem_percpu_kfunc_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &bpf_mem_percpu_kfunc_set); +} +late_initcall(bpf_mem_percpu_kfunc_init); +#endif + /* * Percpu allocator is initialized early during boot when neither slab or * workqueue is available. Plug async management until everything is up -- 2.25.1

From: Gu Bowen <gubowen5@huawei.com> hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Add kfuncs to get info about committed pages of virtual memory. Signed-off-by: Gu Bowen <gubowen5@huawei.com> --- mm/util.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/mm/util.c b/mm/util.c index f3d6751b2f2a..dcc9d1412991 100644 --- a/mm/util.c +++ b/mm/util.c @@ -24,6 +24,10 @@ #include <linux/sizes.h> #include <linux/compat.h> #include <linux/share_pool.h> +#ifdef CONFIG_BPF_RVI +#include <linux/btf.h> +#include <linux/btf_ids.h> +#endif #include <linux/uaccess.h> #include <linux/oom.h> @@ -944,6 +948,34 @@ unsigned long vm_memory_committed(void) } EXPORT_SYMBOL_GPL(vm_memory_committed); +#ifdef CONFIG_BPF_RVI +__bpf_kfunc unsigned long bpf_mem_commit_limit(void) +{ + return vm_commit_limit(); +} +__bpf_kfunc unsigned long bpf_mem_committed(void) +{ + return vm_memory_committed(); +} + +BTF_SET8_START(bpf_memcommit_kfunc_ids) +BTF_ID_FLAGS(func, bpf_mem_commit_limit) +BTF_ID_FLAGS(func, bpf_mem_committed) +BTF_SET8_END(bpf_memcommit_kfunc_ids) + +static const struct btf_kfunc_id_set bpf_memcommit_kfunc_set = { + .owner = THIS_MODULE, + .set = &bpf_memcommit_kfunc_ids, +}; + +static int __init bpf_memcommit_kfunc_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &bpf_memcommit_kfunc_set); +} +late_initcall(bpf_memcommit_kfunc_init); +#endif + /* * Check that a process has enough memory to allocate a new virtual * mapping. 0 means there is enough memory for the allocation to -- 2.25.1

From: Gu Bowen <gubowen5@huawei.com> hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Add kfunc to get info about vmalloc pages. Signed-off-by: Gu Bowen <gubowen5@huawei.com> --- mm/vmalloc.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 1855affa144e..6c92fea79336 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -40,6 +40,10 @@ #include <linux/pgtable.h> #include <linux/hugetlb.h> #include <linux/sched/mm.h> +#ifdef CONFIG_BPF_RVI +#include <linux/btf.h> +#include <linux/btf_ids.h> +#endif #include <asm/tlbflush.h> #include <asm/shmparam.h> @@ -989,6 +993,35 @@ unsigned long vmalloc_nr_pages(void) return atomic_long_read(&nr_vmalloc_pages); } +#ifdef CONFIG_BPF_RVI +__bpf_kfunc unsigned long bpf_mem_vmalloc_used(void) +{ + return vmalloc_nr_pages(); +} + +__bpf_kfunc unsigned long bpf_mem_vmalloc_total(void) +{ + return (unsigned long)VMALLOC_TOTAL >> 10; +} + +BTF_SET8_START(bpf_mem_vmalloc_kfunc_ids) +BTF_ID_FLAGS(func, bpf_mem_vmalloc_used) +BTF_ID_FLAGS(func, bpf_mem_vmalloc_total) +BTF_SET8_END(bpf_mem_vmalloc_kfunc_ids) + +static const struct btf_kfunc_id_set bpf_mem_vmalloc_kfunc_set = { + .owner = THIS_MODULE, + .set = &bpf_mem_vmalloc_kfunc_ids, +}; + +static int __init bpf_mem_vmalloc_kfunc_init(void) +{ + return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, + &bpf_mem_vmalloc_kfunc_set); +} +late_initcall(bpf_mem_vmalloc_kfunc_init); +#endif + static struct vmap_area *__find_vmap_area(unsigned long addr, struct rb_root *root) { struct rb_node *n = root->rb_node; -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICA1GK -------------------------------- Implement the bpf prog for the 'meminfo' interface. Co-developed-by: Gu Bowen <gubowen5@huawei.com> Signed-off-by: Gu Bowen <gubowen5@huawei.com> Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> --- samples/bpf/Makefile | 1 + samples/bpf/bpf_rvi_meminfo.bpf.c | 239 ++++++++++++++++++++++++++++++ 2 files changed, 240 insertions(+) create mode 100644 samples/bpf/bpf_rvi_meminfo.bpf.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 0b8036730076..172e507a56ba 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -162,6 +162,7 @@ always-$(CONFIG_BPF_RVI) += bpf_rvi_loadavg.bpf.o always-$(CONFIG_BPF_RVI) += bpf_rvi_uptime.bpf.o always-$(CONFIG_BPF_RVI) += bpf_rvi_swaps.bpf.o always-$(CONFIG_BPF_RVI) += bpf_rvi_stat.bpf.o +always-$(CONFIG_BPF_RVI) += bpf_rvi_meminfo.bpf.o ifeq ($(ARCH), arm) # Strip all except -D__LINUX_ARM_ARCH__ option needed to handle linux diff --git a/samples/bpf/bpf_rvi_meminfo.bpf.c b/samples/bpf/bpf_rvi_meminfo.bpf.c new file mode 100644 index 000000000000..b6ed4cf12be1 --- /dev/null +++ b/samples/bpf/bpf_rvi_meminfo.bpf.c @@ -0,0 +1,239 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Huawei Technologies Co., Ltd */ +#include <vmlinux.h> +#include <bpf/bpf_helpers.h> + +#define READ_ONCE(x) (*(volatile typeof(x) *)&(x)) + +void bpf_rcu_read_lock(void) __ksym; +void bpf_rcu_read_unlock(void) __ksym; +void bpf_task_release(struct task_struct *p) __ksym; +struct task_struct *bpf_current_level1_reaper(void) __ksym; +struct mem_cgroup *bpf_mem_cgroup_from_task(struct task_struct *p) __ksym; +void cgroup_rstat_flush_atomic(struct cgroup *cgrp) __ksym; +void bpf_si_memswinfo(struct bpf_sysinfo *si) __ksym; +unsigned long bpf_page_counter_read(struct page_counter *pc) __ksym; +unsigned long bpf_mem_committed(void) __ksym; +unsigned long bpf_mem_commit_limit(void) __ksym; +unsigned long bpf_mem_vmalloc_total(void) __ksym; +unsigned long bpf_mem_vmalloc_used(void) __ksym; +unsigned long bpf_mem_percpu(void) __ksym; +unsigned long bpf_mem_failure(void) __ksym; +unsigned long bpf_mem_totalcma(void) __ksym; +unsigned long bpf_mem_freecma(void) __ksym; +int bpf_hugetlb_report_meminfo(struct bpf_mem_hugepage *hugepage_info) __ksym; +void bpf_mem_direct_map(unsigned long *p) __ksym; +unsigned long bpf_mem_file_hugepage(void) __ksym; +unsigned long bpf_mem_file_pmdmapped(void) __ksym; +unsigned long bpf_mem_kreclaimable(void) __ksym; + +extern bool CONFIG_SWAP __kconfig __weak; +extern bool CONFIG_MEMCG_KMEM __kconfig __weak; +extern bool CONFIG_ZSWAP __kconfig __weak; +extern bool CONFIG_MEMORY_FAILURE __kconfig __weak; +extern bool CONFIG_TRANSPARENT_HUGEPAGE __kconfig __weak; +extern bool CONFIG_CMA __kconfig __weak; +extern bool CONFIG_X86 __kconfig __weak; +extern bool CONFIG_X86_64 __kconfig __weak; +extern bool CONFIG_X86_PAE __kconfig __weak; + +/* Axiom */ +#define PAGE_SHIFT 12 +#define PMD_SHIFT 21 +/* include/linux/huge_mm.h */ +#define HPAGE_PMD_SHIFT PMD_SHIFT +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) +#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER) + +char _license[] SEC("license") = "GPL"; + +#define KB(pg) ((pg) * 4) +#define SI_PG(v, unit) (v * unit / 4096) + +static inline unsigned long +memcg_page_state(struct mem_cgroup *memcg, int idx) +{ + long x = READ_ONCE(memcg->vmstats->state[idx]); + + return x < 0 ? 0 : x; +} + +/* Reference: https://docs.ebpf.io/ebpf-library/libbpf/ebpf/__ksym/ */ +extern void cgrp_dfl_root __ksym; +/* Reference: cgroup_on_dfl() */ +static inline bool cgroup_on_dfl(const struct cgroup *cgrp) +{ + return cgrp->root == &cgrp_dfl_root; +} + +#define RET_OK 0 +#define RET_FAIL 1 +#define RET_SKIP -1 + +SEC("iter/generic_single") +s64 dump_meminfo(struct bpf_iter__generic_single *ctx) +{ + struct seq_file *m = ctx->meta->seq; + struct task_struct *reaper; + struct mem_cgroup *memcg; + struct bpf_sysinfo si = {}; + struct bpf_mem_hugepage hg_info = {}; + u64 usage, limit; + unsigned long sreclaimable, sunreclaim; + unsigned long memswlimit, memswusage; + unsigned long cached, active_anon, inactive_anon; + unsigned long active_file, inactive_file, unevicatable; + unsigned long swapusage, swapfree, swaptotal; + unsigned long committed; + + bpf_hugetlb_report_meminfo(&hg_info); + + committed = bpf_mem_committed(); + bpf_si_memswinfo(&si); + + reaper = bpf_current_level1_reaper(); + if (!reaper) + return RET_FAIL; + bpf_rcu_read_lock(); + memcg = bpf_mem_cgroup_from_task(reaper); + if (!memcg) { + bpf_rcu_read_unlock(); + bpf_task_release(reaper); + return RET_FAIL; + } + cgroup_rstat_flush_atomic(memcg->css.cgroup); + limit = memcg->memory.max; + if (limit == 0 || limit > si.totalram) + limit = si.totalram; + /* + * Reference: page_counter_read(). + * memcg->memory.usage is atomic, should be read by (bpf_)atomic_long_read. + */ + usage = bpf_page_counter_read(&memcg->memory); + if (usage == 0 || usage > limit) + usage = limit; + + if (cgroup_on_dfl(memcg->css.cgroup)) { // if memcg is on V2 hierarchy + swaptotal = memcg->swap.max; + swapusage = bpf_page_counter_read(&memcg->swap); + } else { + u64 memsw_limit = memcg->memsw.max; // memsw = mem + swap + u64 memsw_usage = bpf_page_counter_read(&memcg->memsw); + + /* + * Reasonably, memsw.max should >= memory.max, as memsw = mem + swap in V1. + * But it's not necessarily the case, as users may configure them as they wish. + */ + if (memsw_limit > limit) + swaptotal = memsw_limit - limit; + /* Similar treatment for {memsw,memory}.usage */ + if (swaptotal && memsw_usage > usage) + swapusage = memsw_usage - usage; + } + if (swaptotal > si.totalswap) + swaptotal = si.totalswap; + if (swapusage > si.totalswap - si.freeswap) + swapusage = si.totalswap - si.freeswap; + + swapfree = swaptotal - swapusage; + if (swapfree > si.freeswap) + swapfree = si.freeswap; + + cached = memcg_page_state(memcg, NR_FILE_PAGES); + active_anon = memcg_page_state(memcg, NR_ACTIVE_ANON); + inactive_anon = memcg_page_state(memcg, NR_INACTIVE_ANON); + active_file = memcg_page_state(memcg, NR_ACTIVE_FILE); + inactive_file = memcg_page_state(memcg, NR_INACTIVE_FILE); + unevicatable = memcg_page_state(memcg, NR_UNEVICTABLE); + sreclaimable = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B); + sunreclaim = memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B); + + BPF_SEQ_PRINTF(m, "MemTotal: %8llu kB\n", KB(limit)); + BPF_SEQ_PRINTF(m, "MemFree: %8llu kB\n", KB(limit - usage)); + BPF_SEQ_PRINTF(m, "MemAvailable: %8llu kB\n", KB(limit - usage + cached)); + BPF_SEQ_PRINTF(m, "Buffers: %8llu kB\n", KB(0)); + BPF_SEQ_PRINTF(m, "Cached: %8llu kB\n", KB(cached)); + + if (CONFIG_SWAP) + BPF_SEQ_PRINTF(m, "SwapCached: %8llu kB\n", KB(memcg_page_state(memcg, NR_SWAPCACHE))); + + BPF_SEQ_PRINTF(m, "Active: %8llu kB\n", KB(active_anon + active_file)); + BPF_SEQ_PRINTF(m, "Inactive: %8llu kB\n", KB(inactive_anon + inactive_file)); + BPF_SEQ_PRINTF(m, "Active(anon): %8llu kB\n", KB(active_anon)); + BPF_SEQ_PRINTF(m, "Inactive(anon): %8llu kB\n", KB(inactive_anon)); + BPF_SEQ_PRINTF(m, "Active(file): %8llu kB\n", KB(active_file)); + BPF_SEQ_PRINTF(m, "Inactive(file): %8llu kB\n", KB(inactive_file)); + BPF_SEQ_PRINTF(m, "Unevictable: %8llu kB\n", KB(unevicatable)); + BPF_SEQ_PRINTF(m, "Mlocked: %8llu kB\n", KB(0)); + BPF_SEQ_PRINTF(m, "SwapTotal: %8llu kB\n", KB(swaptotal)); + BPF_SEQ_PRINTF(m, "SwapFree: %8llu kB\n", KB(swapfree)); + + if (CONFIG_MEMCG_KMEM && CONFIG_ZSWAP) { + BPF_SEQ_PRINTF(m, "Zswap: %8llu kB\n", memcg_page_state(memcg, MEMCG_ZSWAP_B)); + BPF_SEQ_PRINTF(m, "Zswapped: %8llu kB\n", memcg_page_state(memcg, MEMCG_ZSWAPPED)); + } + + BPF_SEQ_PRINTF(m, "Dirty: %8llu kB\n", KB(memcg_page_state(memcg, NR_FILE_DIRTY))); + BPF_SEQ_PRINTF(m, "Writeback: %8llu kB\n", KB(memcg_page_state(memcg, NR_WRITEBACK))); + BPF_SEQ_PRINTF(m, "AnonPages: %8llu kB\n", KB(memcg_page_state(memcg, NR_ANON_MAPPED))); + BPF_SEQ_PRINTF(m, "Mapped: %8llu kB\n", KB(memcg_page_state(memcg, NR_FILE_MAPPED))); + BPF_SEQ_PRINTF(m, "Shmem: %8llu kB\n", KB(memcg_page_state(memcg, NR_SHMEM))); + BPF_SEQ_PRINTF(m, "KReclaimable: %8llu kB\n", KB(bpf_mem_kreclaimable())); + BPF_SEQ_PRINTF(m, "Slab: %8llu kB\n", KB(sreclaimable + sunreclaim)); + BPF_SEQ_PRINTF(m, "SReclaimable: %8llu kB\n", KB(sreclaimable)); + BPF_SEQ_PRINTF(m, "SUnreclaim: %8llu kB\n", KB(sunreclaim)); + BPF_SEQ_PRINTF(m, "KernelStack: %8llu kB\n", memcg_page_state(memcg, NR_KERNEL_STACK_KB)); + BPF_SEQ_PRINTF(m, "PageTables: %8llu kB\n", KB(memcg_page_state(memcg, NR_PAGETABLE))); + BPF_SEQ_PRINTF(m, "SecPageTables %8llu kB\n", KB(memcg_page_state(memcg, NR_SECONDARY_PAGETABLE))); + BPF_SEQ_PRINTF(m, "NFS_Unstable: %8llu kB\n", KB(0)); + BPF_SEQ_PRINTF(m, "Bounce: %8llu kB\n", KB(0)); + BPF_SEQ_PRINTF(m, "WritebackTmp: %8llu kB\n", KB(memcg_page_state(memcg, NR_WRITEBACK_TEMP))); + BPF_SEQ_PRINTF(m, "CommitLimit: %8llu kB\n", KB(bpf_mem_commit_limit())); + BPF_SEQ_PRINTF(m, "Committed_AS: %8llu kB\n", KB(committed)); + BPF_SEQ_PRINTF(m, "VmallocTotal: %8llu kB\n", bpf_mem_vmalloc_total()); + BPF_SEQ_PRINTF(m, "VmallocUsed: %8llu kB\n", KB(bpf_mem_vmalloc_used())); + BPF_SEQ_PRINTF(m, "VmallocChunk: %8llu kB\n", KB(0)); + BPF_SEQ_PRINTF(m, "Percpu: %8llu kB\n", KB(bpf_mem_percpu())); + + if (CONFIG_MEMORY_FAILURE) + BPF_SEQ_PRINTF(m, "HardwareCorrupted: %8llu kB\n", bpf_mem_failure()); + + if (CONFIG_TRANSPARENT_HUGEPAGE) { + BPF_SEQ_PRINTF(m, "AnonHugePages: %8llu kB\n", KB(memcg_page_state(memcg, NR_ANON_THPS) * + HPAGE_PMD_NR)); + BPF_SEQ_PRINTF(m, "ShmemHugePages: %8llu kB\n", KB(memcg_page_state(memcg, NR_SHMEM_THPS) * + HPAGE_PMD_NR)); + BPF_SEQ_PRINTF(m, "ShmemPmdMapped: %8llu kB\n", KB(memcg_page_state(memcg, NR_SHMEM_PMDMAPPED) * + HPAGE_PMD_NR)); + BPF_SEQ_PRINTF(m, "FileHugePages: %8llu kB\n", KB(bpf_mem_file_hugepage())); + BPF_SEQ_PRINTF(m, "FilePmdMapped: %8llu kB\n", KB(bpf_mem_file_pmdmapped())); + } + if (CONFIG_CMA) { + BPF_SEQ_PRINTF(m, "CmaTotal: %8llu kB\n", KB(bpf_mem_totalcma())); + BPF_SEQ_PRINTF(m, "CmaFree: %8llu kB\n", KB(bpf_mem_freecma())); + } + BPF_SEQ_PRINTF(m, "Unaccepted: %8llu kB\n", KB(0)); + BPF_SEQ_PRINTF(m, "HugePages_Total: %8llu\n", hg_info.total); + BPF_SEQ_PRINTF(m, "HugePages_Free: %8llu\n", hg_info.free); + BPF_SEQ_PRINTF(m, "HugePages_Rsvd: %8llu\n", hg_info.rsvd); + BPF_SEQ_PRINTF(m, "HugePages_Surp: %8llu\n", hg_info.surp); + BPF_SEQ_PRINTF(m, "Hugepagesize: %8llu kB\n", hg_info.size); + BPF_SEQ_PRINTF(m, "Hugetlb: %8llu kB\n", hg_info.hugetlb); + + if (CONFIG_X86) { + unsigned long direct_map_info[3] = {}; + + bpf_mem_direct_map(direct_map_info); + BPF_SEQ_PRINTF(m, "DirectMap4k: %8llu kB\n", direct_map_info[0]); + if (CONFIG_X86_64 || CONFIG_X86_PAE) + BPF_SEQ_PRINTF(m, "DirectMap2M: %8llu kB\n", direct_map_info[1]); + else + BPF_SEQ_PRINTF(m, "DirectMap4M: %8llu kB\n", direct_map_info[1]); + BPF_SEQ_PRINTF(m, "DirectMap1G: %8llu kB\n", direct_map_info[2]); + } + + bpf_rcu_read_unlock(); + bpf_task_release(reaper); + + return RET_OK; +} -- 2.25.1

反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/17276 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/UOC... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/17276 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/UOC...
participants (2)
-
GONG Ruiqi
-
patchwork bot