[PATCH openEuler-22.03-LTS-SP1 00/23] Support perf --bpf-counters

This is a minimal patch series to enable '--bpf-counters' options for perf tool in 5.10 kernel version. How to compile? =============== 1. Download openEuler kernel source code: git clone https://gitee.com/openeuler/kernel.git 2. Download these patches, apply them in kernel/ directory: cd kernel/ git am --reject /path/to/*.patch 3. Prepare some requirements before compiling perf tool: Not list one by one here, please refer to [1]. [1] https://medium.com/@manas.marwah/building-perf-tool-fc838f084f71 4. Compile perf tool: cd tools/perf make BUILD_BPF_SKEL=1 If successful, you will see a binary file named 'perf' under tools/perf/ directory. How to use? =========== 1. Use bperf to sample a process: $ perf stat --bpf-counters -e cycles,instructions \ perf bench sched messaging -g 1 -l 100 -t # Running 'sched/messaging' benchmark: # 20 sender and receiver threads per group # 1 groups == 40 threads run Total time: 0.011 [sec] Performance counter stats for 'perf bench sched messaging -g 1 -l 100 -t': 639,816,606 cycles 284,102,919 instructions # 0.44 insn per cycle 0.018093452 seconds time elapsed 0.023612000 seconds user 0.255643000 seconds sys 2. Use bperf to sample a container: $ perf stat --bpf-counters -e cycles,instructions --timeout 1000 \ --for-each-cgroup docker/66166578dd7d17655923862792a0b9d60c8785fdc1bb8da8b76daedb205c4b10 Performance counter stats for 'system wide': 7,402,983 cycles docker/66166578dd7d17655923862792a0b9d60c8785fdc1bb8da8b76daedb205c4b10 6,036,926 instructions docker/66166578dd7d17655923862792a0b9d60c8785fdc1bb8da8b76daedb205c4b10 # 0.82 insn per cycle 1.004999462 seconds time elapsed How to integrate via bpf? ======================== If you are interested in process sampling, please check the following bpf progs: tools/perf/util/bpf_skel/bperf_leader.bpf.c tools/perf/util/bpf_skel/bperf_follower.bpf.c And if you are interested in container sampling, please check the following bpf prog: tools/perf/util/bpf_skel/bperf_cgroup.bpf.c There have some documents in tools/perf/util/bpf_counter.c, search the keywords 'bperf: share hardware PMCs with BPF' in it and see how bperf works. That's all. Enjoy. Namhyung Kim (9): bpf: Allow bpf_get_current_ancestor_cgroup_id for tracing perf core: Factor out __perf_sw_event_sched perf core: Add PERF_COUNT_SW_CGROUP_SWITCHES event perf tools: Add 'cgroup-switches' software event perf tools: Add read_cgroup_id() function perf tools: Add cgroup_is_v2() helper perf bpf_counter: Move common functions to bpf_counter.h perf stat: Enable BPF counter with --for-each-cgroup perf stat: Fix BPF program section name Song Liu (12): bpftool: Add Makefile target bootstrap perf build: Support build BPF skeletons with perf perf stat: Enable counting events for BPF programs perf stat: Introduce 'bperf' to share hardware PMCs with BPF perf stat: Measure 't0' and 'ref_time' after enable_counters() perf util: Move bpf_perf definitions to a libperf header perf bpf: check perf_attr_map is compatible with the perf binary perf stat: Introduce config stat.bpf-counter-events perf stat: Introduce ':b' modifier perf stat: Introduce bpf_counter_ops->disable() perf bpf: Fix building perf with BUILD_BPF_SKEL=1 by default in more distros perf bpf_skel: Do not use typedef to avoid error on old clang Tengda Wu (1): perf stat: Support inherit events during fork() for bperf Xiaomeng Zhang (1): perf stat: Increase perf_attr_map entries include/linux/perf_event.h | 40 +- include/uapi/linux/perf_event.h | 1 + kernel/trace/bpf_trace.c | 2 + tools/bpf/bpftool/Makefile | 2 + tools/build/Makefile.feature | 4 +- tools/include/uapi/linux/perf_event.h | 7 + tools/lib/perf/include/perf/bpf_perf.h | 31 + tools/perf/Documentation/perf-stat.txt | 31 + tools/perf/Makefile.config | 9 + tools/perf/Makefile.perf | 65 +- tools/perf/builtin-stat.c | 111 ++- tools/perf/util/Build | 2 + tools/perf/util/bpf_counter.c | 828 ++++++++++++++++++ tools/perf/util/bpf_counter.h | 131 +++ tools/perf/util/bpf_counter_cgroup.c | 307 +++++++ tools/perf/util/bpf_skel/.gitignore | 3 + tools/perf/util/bpf_skel/bperf_cgroup.bpf.c | 191 ++++ tools/perf/util/bpf_skel/bperf_follower.bpf.c | 162 ++++ tools/perf/util/bpf_skel/bperf_leader.bpf.c | 55 ++ tools/perf/util/bpf_skel/bperf_u.h | 19 + .../util/bpf_skel/bpf_prog_profiler.bpf.c | 93 ++ tools/perf/util/cgroup.c | 47 + tools/perf/util/cgroup.h | 13 + tools/perf/util/config.c | 4 + tools/perf/util/evlist.c | 4 + tools/perf/util/evsel.c | 27 + tools/perf/util/evsel.h | 32 + tools/perf/util/parse-events.c | 12 +- tools/perf/util/parse-events.l | 3 +- tools/perf/util/python.c | 27 + tools/perf/util/stat-display.c | 4 +- tools/perf/util/stat.c | 2 +- tools/perf/util/target.c | 34 +- tools/perf/util/target.h | 8 + tools/scripts/Makefile.include | 1 + 35 files changed, 2266 insertions(+), 46 deletions(-) create mode 100644 tools/lib/perf/include/perf/bpf_perf.h create mode 100644 tools/perf/util/bpf_counter.c create mode 100644 tools/perf/util/bpf_counter.h create mode 100644 tools/perf/util/bpf_counter_cgroup.c create mode 100644 tools/perf/util/bpf_skel/.gitignore create mode 100644 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c create mode 100644 tools/perf/util/bpf_skel/bperf_u.h create mode 100644 tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c -- 2.34.1

From: Namhyung Kim <namhyung@kernel.org> Allow the helper to be called from tracing programs. This is needed to handle cgroup hiererachies in the program. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210627153627.824198-1-namhyung@kernel.org Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- kernel/trace/bpf_trace.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index b1e255e4f8e7..fbd681e15259 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1303,6 +1303,8 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) #ifdef CONFIG_CGROUPS case BPF_FUNC_get_current_cgroup_id: return &bpf_get_current_cgroup_id_proto; + case BPF_FUNC_get_current_ancestor_cgroup_id: + return &bpf_get_current_ancestor_cgroup_id_proto; #endif case BPF_FUNC_send_signal: return &bpf_send_signal_proto; -- 2.34.1

From: Namhyung Kim <namhyung@kernel.org> In some cases, we need to check more than whether the software event is enabled. So split the condition check and the actual event handling. This is a preparation for the next change. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210210083327.22726-1-namhyung@kernel.org Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- include/linux/perf_event.h | 33 ++++++++++++--------------------- 1 file changed, 12 insertions(+), 21 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 8e0e8cd6d4bf..789338a342c2 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1204,30 +1204,24 @@ DECLARE_PER_CPU(struct pt_regs, __perf_regs[4]); * which is guaranteed by us not actually scheduling inside other swevents * because those disable preemption. */ -static __always_inline void -perf_sw_event_sched(u32 event_id, u64 nr, u64 addr) +static __always_inline void __perf_sw_event_sched(u32 event_id, u64 nr, u64 addr) { - if (static_key_false(&perf_swevent_enabled[event_id])) { - struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]); + struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]); - perf_fetch_caller_regs(regs); - ___perf_sw_event(event_id, nr, regs, addr); - } + perf_fetch_caller_regs(regs); + ___perf_sw_event(event_id, nr, regs, addr); } extern struct static_key_false perf_sched_events; -static __always_inline bool -perf_sw_migrate_enabled(void) +static __always_inline bool __perf_sw_enabled(int swevt) { - if (static_key_false(&perf_swevent_enabled[PERF_COUNT_SW_CPU_MIGRATIONS])) - return true; - return false; + return static_key_false(&perf_swevent_enabled[swevt]); } static inline void perf_event_task_migrate(struct task_struct *task) { - if (perf_sw_migrate_enabled()) + if (__perf_sw_enabled(PERF_COUNT_SW_CPU_MIGRATIONS)) task->sched_migrated = 1; } @@ -1237,11 +1231,9 @@ static inline void perf_event_task_sched_in(struct task_struct *prev, if (static_branch_unlikely(&perf_sched_events)) __perf_event_task_sched_in(prev, task); - if (perf_sw_migrate_enabled() && task->sched_migrated) { - struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]); - - perf_fetch_caller_regs(regs); - ___perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, regs, 0); + if (__perf_sw_enabled(PERF_COUNT_SW_CPU_MIGRATIONS) && + task->sched_migrated) { + __perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0); task->sched_migrated = 0; } } @@ -1249,7 +1241,8 @@ static inline void perf_event_task_sched_in(struct task_struct *prev, static inline void perf_event_task_sched_out(struct task_struct *prev, struct task_struct *next) { - perf_sw_event_sched(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 0); + if (__perf_sw_enabled(PERF_COUNT_SW_CONTEXT_SWITCHES)) + __perf_sw_event_sched(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 0); if (static_branch_unlikely(&perf_sched_events)) __perf_event_task_sched_out(prev, next); @@ -1516,8 +1509,6 @@ static inline int perf_event_refresh(struct perf_event *event, int refresh) static inline void perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr) { } static inline void -perf_sw_event_sched(u32 event_id, u64 nr, u64 addr) { } -static inline void perf_bp_event(struct perf_event *event, void *data) { } static inline int perf_register_guest_info_callbacks -- 2.34.1

From: Namhyung Kim <namhyung@kernel.org> This patch adds a new software event to count context switches involving cgroup switches. So it's counted only if cgroups of previous and next tasks are different. Note that it only checks the cgroups in the perf_event subsystem. For cgroup v2, it shouldn't matter anyway. One can argue that we can do this by using existing sched_switch event with eBPF. But some systems might not have eBPF for some reason so I'd like to add this as a simple way. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210210083327.22726-2-namhyung@kernel.org Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- include/linux/perf_event.h | 7 +++++++ include/uapi/linux/perf_event.h | 1 + 2 files changed, 8 insertions(+) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 789338a342c2..1038d32d3893 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1244,6 +1244,13 @@ static inline void perf_event_task_sched_out(struct task_struct *prev, if (__perf_sw_enabled(PERF_COUNT_SW_CONTEXT_SWITCHES)) __perf_sw_event_sched(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 0); +#ifdef CONFIG_CGROUP_PERF + if (__perf_sw_enabled(PERF_COUNT_SW_CGROUP_SWITCHES) && + perf_cgroup_from_task(prev, NULL) != + perf_cgroup_from_task(next, NULL)) + __perf_sw_event_sched(PERF_COUNT_SW_CGROUP_SWITCHES, 1, 0); +#endif + if (static_branch_unlikely(&perf_sched_events)) __perf_event_task_sched_out(prev, next); } diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 4b0893023877..e1b9861b028c 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -112,6 +112,7 @@ enum perf_sw_ids { PERF_COUNT_SW_EMULATION_FAULTS = 8, PERF_COUNT_SW_DUMMY = 9, PERF_COUNT_SW_BPF_OUTPUT = 10, + PERF_COUNT_SW_CGROUP_SWITCHES = 11, PERF_COUNT_SW_MAX, /* non-ABI */ }; -- 2.34.1

From: Namhyung Kim <namhyung@kernel.org> It counts how often cgroups are changed actually during the context switches. # perf stat -a -e context-switches,cgroup-switches -a sleep 1 Performance counter stats for 'system wide': 11,267 context-switches 10,950 cgroup-switches 1.015634369 seconds time elapsed Committer notes: The kernel patches landed in v5.13, but this entry wasn't filled in perf's parse-events tables, which was leading to a segfault when running 'perf list' on a kernel with that feature, as reported by Thomas Richter. Also removed the part touching tools/include/uapi/linux/perf_event.h as it was updated in the usual sync with the kernel UAPI headers, in a previous, already upstream, patch. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Richter <tmricht@linux.ibm.com> Link: http://lore.kernel.org/lkml/20210210083327.22726-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/perf/util/parse-events.c | 4 ++++ tools/perf/util/parse-events.l | 1 + 2 files changed, 5 insertions(+) diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c index 66ba42e3c4a7..1f389d51bc73 100644 --- a/tools/perf/util/parse-events.c +++ b/tools/perf/util/parse-events.c @@ -145,6 +145,10 @@ struct event_symbol event_symbols_sw[PERF_COUNT_SW_MAX] = { .symbol = "bpf-output", .alias = "", }, + [PERF_COUNT_SW_CGROUP_SWITCHES] = { + .symbol = "cgroup-switches", + .alias = "", + }, }; #define __PERF_EVENT_FIELD(config, name) \ diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l index 9db5097317f4..88f203bb6fab 100644 --- a/tools/perf/util/parse-events.l +++ b/tools/perf/util/parse-events.l @@ -347,6 +347,7 @@ emulation-faults { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_EM dummy { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); } duration_time { return tool(yyscanner, PERF_TOOL_DURATION_TIME); } bpf-output { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_BPF_OUTPUT); } +cgroup-switches { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_CGROUP_SWITCHES); } /* * We have to handle the kernel PMU event cycles-ct/cycles-t/mem-loads/mem-stores separately. -- 2.34.1

From: Song Liu <songliubraving@fb.com> This target is used to only build the bootstrap bpftool, which will be used to generate bpf skeletons for other tools, like perf. Signed-off-by: Song Liu <songliubraving@fb.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel-team@fb.com Link: http://lore.kernel.org/lkml/20201228174054.907740-2-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/bpf/bpftool/Makefile | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile index 802c4a113400..0e031ab5fbc1 100644 --- a/tools/bpf/bpftool/Makefile +++ b/tools/bpf/bpftool/Makefile @@ -145,6 +145,8 @@ VMLINUX_BTF_PATHS ?= $(if $(O),$(O)/vmlinux) \ /boot/vmlinux-$(shell uname -r) VMLINUX_BTF ?= $(abspath $(firstword $(wildcard $(VMLINUX_BTF_PATHS)))) +bootstrap: $(BPFTOOL_BOOTSTRAP) + ifneq ($(VMLINUX_BTF)$(VMLINUX_H),) ifeq ($(feature-clang-bpf-co-re),1) -- 2.34.1

From: Song Liu <songliubraving@fb.com> BPF programs are useful in perf to profile BPF programs. BPF skeleton is by far the easiest way to write BPF tools. Enable building BPF skeletons in util/bpf_skel. A dummy bpf skeleton is added. More bpf skeletons will be added for different use cases. Signed-off-by: Song Liu <songliubraving@fb.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel-team@fb.com Link: http://lore.kernel.org/lkml/20201229214214.3413833-3-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/build/Makefile.feature | 4 ++- tools/perf/Makefile.config | 9 ++++++ tools/perf/Makefile.perf | 49 +++++++++++++++++++++++++++-- tools/perf/util/bpf_skel/.gitignore | 3 ++ tools/scripts/Makefile.include | 1 + 5 files changed, 63 insertions(+), 3 deletions(-) create mode 100644 tools/perf/util/bpf_skel/.gitignore diff --git a/tools/build/Makefile.feature b/tools/build/Makefile.feature index 2b948928dfd6..2e8dae24bc07 100644 --- a/tools/build/Makefile.feature +++ b/tools/build/Makefile.feature @@ -100,7 +100,9 @@ FEATURE_TESTS_EXTRA := \ clang \ libbpf \ libpfm4 \ - libdebuginfod + libdebuginfod \ + clang-bpf-co-re + FEATURE_TESTS ?= $(FEATURE_TESTS_BASIC) diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config index f195e079660f..9dac90c93174 100644 --- a/tools/perf/Makefile.config +++ b/tools/perf/Makefile.config @@ -652,6 +652,15 @@ ifndef NO_LIBBPF endif endif +ifdef BUILD_BPF_SKEL + $(call feature_check,clang-bpf-co-re) + ifeq ($(feature-clang-bpf-co-re), 0) + dummy := $(error Error: clang too old. Please install recent clang) + endif + $(call detected,CONFIG_PERF_BPF_SKEL) + CFLAGS += -DHAVE_BPF_SKEL +endif + dwarf-post-unwind := 1 dwarf-post-unwind-text := BUG diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf index e41a8f9b99d2..e22457ebee45 100644 --- a/tools/perf/Makefile.perf +++ b/tools/perf/Makefile.perf @@ -126,6 +126,8 @@ include ../scripts/utilities.mak # # Define NO_LIBDEBUGINFOD if you do not want support debuginfod # +# Define BUILD_BPF_SKEL to enable BPF skeletons +# # As per kernel Makefile, avoid funny character set dependencies unexport LC_ALL @@ -175,6 +177,12 @@ endef LD += $(EXTRA_LDFLAGS) +HOSTCC ?= gcc +HOSTLD ?= ld +HOSTAR ?= ar +CLANG ?= clang +LLVM_STRIP ?= llvm-strip + PKG_CONFIG = $(CROSS_COMPILE)pkg-config LLVM_CONFIG ?= llvm-config @@ -731,7 +739,8 @@ prepare: $(OUTPUT)PERF-VERSION-FILE $(OUTPUT)common-cmds.h archheaders $(drm_ioc $(x86_arch_prctl_code_array) \ $(rename_flags_array) \ $(arch_errno_name_array) \ - $(sync_file_range_arrays) + $(sync_file_range_arrays) \ + bpf-skel $(OUTPUT)%.o: %.c prepare FORCE $(Q)$(MAKE) -f $(srctree)/tools/build/Makefile.build dir=$(build-dir) $@ @@ -1004,7 +1013,43 @@ config-clean: python-clean: $(python-clean) -clean:: $(LIBTRACEEVENT)-clean $(LIBAPI)-clean $(LIBBPF)-clean $(LIBSUBCMD)-clean $(LIBPERF)-clean config-clean fixdep-clean python-clean +SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel) +SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp) +SKELETONS := + +ifdef BUILD_BPF_SKEL +BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool +LIBBPF_SRC := $(abspath ../lib/bpf) +BPF_INCLUDE := -I$(SKEL_TMP_OUT)/.. -I$(BPF_PATH) -I$(LIBBPF_SRC)/.. + +$(SKEL_TMP_OUT): + $(Q)$(MKDIR) -p $@ + +$(BPFTOOL): | $(SKEL_TMP_OUT) + CFLAGS= $(MAKE) -C ../bpf/bpftool \ + OUTPUT=$(SKEL_TMP_OUT)/ bootstrap + +$(SKEL_TMP_OUT)/%.bpf.o: util/bpf_skel/%.bpf.c $(LIBBPF) | $(SKEL_TMP_OUT) + $(QUIET_CLANG)$(CLANG) -g -O2 -target bpf $(BPF_INCLUDE) \ + -c $(filter util/bpf_skel/%.bpf.c,$^) -o $@ && $(LLVM_STRIP) -g $@ + +$(SKEL_OUT)/%.skel.h: $(SKEL_TMP_OUT)/%.bpf.o | $(BPFTOOL) + $(QUIET_GENSKEL)$(BPFTOOL) gen skeleton $< > $@ + +bpf-skel: $(SKELETONS) + +.PRECIOUS: $(SKEL_TMP_OUT)/%.bpf.o + +else # BUILD_BPF_SKEL + +bpf-skel: + +endif # BUILD_BPF_SKEL + +bpf-skel-clean: + $(call QUIET_CLEAN, bpf-skel) $(RM) -r $(SKEL_TMP_OUT) $(SKELETONS) + +clean:: $(LIBTRACEEVENT)-clean $(LIBAPI)-clean $(LIBBPF)-clean $(LIBSUBCMD)-clean $(LIBPERF)-clean config-clean fixdep-clean python-clean bpf-skel-clean $(call QUIET_CLEAN, core-objs) $(RM) $(LIBPERF_A) $(OUTPUT)perf-archive $(OUTPUT)perf-with-kcore $(LANG_BINDINGS) $(Q)find $(if $(OUTPUT),$(OUTPUT),.) -name '*.o' -delete -o -name '\.*.cmd' -delete -o -name '\.*.d' -delete $(Q)$(RM) $(OUTPUT).config-detected diff --git a/tools/perf/util/bpf_skel/.gitignore b/tools/perf/util/bpf_skel/.gitignore new file mode 100644 index 000000000000..5263e9e6c5d8 --- /dev/null +++ b/tools/perf/util/bpf_skel/.gitignore @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only +.tmp +*.skel.h \ No newline at end of file diff --git a/tools/scripts/Makefile.include b/tools/scripts/Makefile.include index a3f99ef6b11b..397fe4787d16 100644 --- a/tools/scripts/Makefile.include +++ b/tools/scripts/Makefile.include @@ -135,6 +135,7 @@ ifneq ($(silent),1) $(MAKE) $(PRINT_DIR) -C $$subdir QUIET_FLEX = @echo ' FLEX '$@; QUIET_BISON = @echo ' BISON '$@; + QUIET_GENSKEL = @echo ' GEN-SKEL '$@; descend = \ +@echo ' DESCEND '$(1); \ -- 2.34.1

From: Song Liu <songliubraving@fb.com> Introduce 'perf stat -b' option, which counts events for BPF programs, like: [root@localhost ~]# ~/perf stat -e ref-cycles,cycles -b 254 -I 1000 1.487903822 115,200 ref-cycles 1.487903822 86,012 cycles 2.489147029 80,560 ref-cycles 2.489147029 73,784 cycles 3.490341825 60,720 ref-cycles 3.490341825 37,797 cycles 4.491540887 37,120 ref-cycles 4.491540887 31,963 cycles The example above counts 'cycles' and 'ref-cycles' of BPF program of id 254. This is similar to bpftool-prog-profile command, but more flexible. 'perf stat -b' creates per-cpu perf_event and loads fentry/fexit BPF programs (monitor-progs) to the target BPF program (target-prog). The monitor-progs read perf_event before and after the target-prog, and aggregate the difference in a BPF map. Then the user space reads data from these maps. A new 'struct bpf_counter' is introduced to provide a common interface that uses BPF programs/maps to count perf events. Committer notes: Removed all but bpf_counter.h includes from evsel.h, not needed at all. Also BPF map lookups for PERCPU_ARRAYs need to have as its value receive buffer passed to the kernel libbpf_num_possible_cpus() entries, not evsel__nr_cpus(evsel), as the former uses /sys/devices/system/cpu/possible while the later uses /sys/devices/system/cpu/online, which may be less than the 'possible' number making the bpf map lookup overwrite memory and cause hard to debug memory corruption. We need to continue using evsel__nr_cpus(evsel) when accessing the perf_counts array tho, not to overwrite another are of memory :-) Signed-off-by: Song Liu <songliubraving@fb.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/lkml/20210120163031.GU12699@kernel.org/ Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel-team@fb.com Link: http://lore.kernel.org/lkml/20201229214214.3413833-4-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Conflicts: tools/perf/builtin-stat.c Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/include/uapi/linux/perf_event.h | 7 + tools/perf/Documentation/perf-stat.txt | 18 + tools/perf/Makefile.perf | 2 +- tools/perf/builtin-stat.c | 82 ++++- tools/perf/util/Build | 1 + tools/perf/util/bpf_counter.c | 314 ++++++++++++++++++ tools/perf/util/bpf_counter.h | 72 ++++ .../util/bpf_skel/bpf_prog_profiler.bpf.c | 93 ++++++ tools/perf/util/evsel.c | 5 + tools/perf/util/evsel.h | 5 + tools/perf/util/python.c | 21 ++ tools/perf/util/stat-display.c | 4 +- tools/perf/util/stat.c | 2 +- tools/perf/util/target.c | 34 +- tools/perf/util/target.h | 10 + 15 files changed, 652 insertions(+), 18 deletions(-) create mode 100644 tools/perf/util/bpf_counter.c create mode 100644 tools/perf/util/bpf_counter.h create mode 100644 tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h index e0b41a42c524..5af05bb8189a 100644 --- a/tools/include/uapi/linux/perf_event.h +++ b/tools/include/uapi/linux/perf_event.h @@ -112,6 +112,7 @@ enum perf_sw_ids { PERF_COUNT_SW_EMULATION_FAULTS = 8, PERF_COUNT_SW_DUMMY = 9, PERF_COUNT_SW_BPF_OUTPUT = 10, + PERF_COUNT_SW_CGROUP_SWITCHES = 11, PERF_COUNT_SW_MAX, /* non-ABI */ }; @@ -441,6 +442,12 @@ struct perf_event_attr { __u16 __reserved_2; __u32 aux_sample_size; __u32 __reserved_3; + + /* + * User provided data if sigtrap=1, passed back to user via + * siginfo_t::si_perf, e.g. to permit user to identify the event. + */ + __u64 sig_data; }; /* diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt index f9bcd95bf352..5bfc0357694b 100644 --- a/tools/perf/Documentation/perf-stat.txt +++ b/tools/perf/Documentation/perf-stat.txt @@ -75,6 +75,24 @@ report:: --tid=<tid>:: stat events on existing thread id (comma separated list) +-b:: +--bpf-prog:: + stat events on existing bpf program id (comma separated list), + requiring root rights. bpftool-prog could be used to find program + id all bpf programs in the system. For example: + + # bpftool prog | head -n 1 + 17247: tracepoint name sys_enter tag 192d548b9d754067 gpl + + # perf stat -e cycles,instructions --bpf-prog 17247 --timeout 1000 + + Performance counter stats for 'BPF program(s) 17247': + + 85,967 cycles + 28,982 instructions # 0.34 insn per cycle + + 1.102235068 seconds time elapsed + ifdef::HAVE_LIBPFM[] --pfm-events events:: Select a PMU event using libpfm4 syntax (see http://perfmon2.sf.net) diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf index e22457ebee45..0407e9aaaeef 100644 --- a/tools/perf/Makefile.perf +++ b/tools/perf/Makefile.perf @@ -1015,7 +1015,7 @@ python-clean: SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel) SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp) -SKELETONS := +SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h ifdef BUILD_BPF_SKEL BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 89e80a3bc9c3..8cf0490d01b6 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -67,6 +67,7 @@ #include "util/top.h" #include "util/affinity.h" #include "util/pfm.h" +#include "util/bpf_counter.h" #include "asm/bug.h" #include <linux/time64.h> @@ -409,12 +410,32 @@ static int read_affinity_counters(struct timespec *rs) return 0; } +static int read_bpf_map_counters(void) +{ + struct evsel *counter; + int err; + + evlist__for_each_entry(evsel_list, counter) { + err = bpf_counter__read(counter); + if (err) + return err; + } + return 0; +} + static void read_counters(struct timespec *rs) { struct evsel *counter; + int err; - if (!stat_config.stop_read_counter && (read_affinity_counters(rs) < 0)) - return; + if (!stat_config.stop_read_counter) { + if (target__has_bpf(&target)) + err = read_bpf_map_counters(); + else + err = read_affinity_counters(rs); + if (err < 0) + return; + } evlist__for_each_entry(evsel_list, counter) { if (counter->err) @@ -496,11 +517,22 @@ static bool handle_interval(unsigned int interval, int *times) return false; } -static void enable_counters(void) +static int enable_counters(void) { + struct evsel *evsel; + int err; + + if (target__has_bpf(&target)) { + evlist__for_each_entry(evsel_list, evsel) { + err = bpf_counter__enable(evsel); + if (err) + return err; + } + } + if (stat_config.initial_delay < 0) { pr_info(EVLIST_DISABLED_MSG); - return; + return 0; } if (stat_config.initial_delay > 0) { @@ -518,6 +550,7 @@ static void enable_counters(void) if (stat_config.initial_delay > 0) pr_info(EVLIST_ENABLED_MSG); } + return 0; } static void disable_counters(void) @@ -720,7 +753,7 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx) const bool forks = (argc > 0); bool is_pipe = STAT_RECORD ? perf_stat.data.is_pipe : false; struct affinity affinity; - int i, cpu; + int i, cpu, err; bool second_pass = false; if (forks) { @@ -738,6 +771,13 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx) if (affinity__setup(&affinity) < 0) return -1; + if (target__has_bpf(&target)) { + evlist__for_each_entry(evsel_list, counter) { + if (bpf_counter__load(counter, &target)) + return -1; + } + } + evlist__for_each_cpu (evsel_list, i, cpu) { affinity__set(&affinity, cpu); @@ -851,7 +891,7 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx) } if (STAT_RECORD) { - int err, fd = perf_data__fd(&perf_stat.data); + int fd = perf_data__fd(&perf_stat.data); if (is_pipe) { err = perf_header__write_pipe(perf_data__fd(&perf_stat.data)); @@ -877,7 +917,9 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx) if (forks) { perf_evlist__start_workload(evsel_list); - enable_counters(); + err = enable_counters(); + if (err) + return -1; if (interval || timeout || evlist__ctlfd_initialized(evsel_list)) status = dispatch_events(forks, timeout, interval, ×); @@ -896,7 +938,9 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx) if (WIFSIGNALED(status)) psignal(WTERMSIG(status), argv[0]); } else { - enable_counters(); + err = enable_counters(); + if (err) + return -1; status = dispatch_events(forks, timeout, interval, ×); } @@ -1087,6 +1131,10 @@ static struct option stat_options[] = { "stat events on existing process id"), OPT_STRING('t', "tid", &target.tid, "tid", "stat events on existing thread id"), +#ifdef HAVE_BPF_SKEL + OPT_STRING('b', "bpf-prog", &target.bpf_str, "bpf-prog-id", + "stat events on existing bpf program id"), +#endif OPT_BOOLEAN('a', "all-cpus", &target.system_wide, "system-wide collection from all CPUs"), OPT_BOOLEAN('g', "group", &group, @@ -2058,11 +2106,12 @@ int cmd_stat(int argc, const char **argv) "perf stat [<options>] [<command>]", NULL }; - int status = -EINVAL, run_idx; + int status = -EINVAL, run_idx, err; const char *mode; FILE *output = stderr; unsigned int interval, timeout; const char * const stat_subcommands[] = { "record", "report" }; + char errbuf[BUFSIZ]; setlocale(LC_ALL, ""); @@ -2173,6 +2222,12 @@ int cmd_stat(int argc, const char **argv) } else if (big_num_opt == 0) /* User passed --no-big-num */ stat_config.big_num = false; + err = target__validate(&target); + if (err) { + target__strerror(&target, err, errbuf, BUFSIZ); + pr_warning("%s\n", errbuf); + } + setup_system_wide(argc); /* @@ -2243,8 +2298,6 @@ int cmd_stat(int argc, const char **argv) goto out; } - target__validate(&target); - if ((stat_config.aggr_mode == AGGR_THREAD) && (target.system_wide)) target.per_thread = true; @@ -2375,9 +2428,10 @@ int cmd_stat(int argc, const char **argv) * tools remain -acme */ int fd = perf_data__fd(&perf_stat.data); - int err = perf_event__synthesize_kernel_mmap((void *)&perf_stat, - process_synthesized_event, - &perf_stat.session->machines.host); + + err = perf_event__synthesize_kernel_mmap((void *)&perf_stat, + process_synthesized_event, + &perf_stat.session->machines.host); if (err) { pr_warning("Couldn't synthesize the kernel mmap record, harmless, " "older tools may produce warnings about this file\n."); diff --git a/tools/perf/util/Build b/tools/perf/util/Build index 6bf0547bf134..aa3f308cb325 100644 --- a/tools/perf/util/Build +++ b/tools/perf/util/Build @@ -137,6 +137,7 @@ perf-y += clockid.o perf-$(CONFIG_LIBBPF) += bpf-loader.o perf-$(CONFIG_LIBBPF) += bpf_map.o +perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o perf-$(CONFIG_LIBELF) += symbol-elf.o perf-$(CONFIG_LIBELF) += probe-file.o diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c new file mode 100644 index 000000000000..04f89120b323 --- /dev/null +++ b/tools/perf/util/bpf_counter.c @@ -0,0 +1,314 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* Copyright (c) 2019 Facebook */ + +#include <assert.h> +#include <limits.h> +#include <unistd.h> +#include <sys/time.h> +#include <sys/resource.h> +#include <linux/err.h> +#include <linux/zalloc.h> +#include <bpf/bpf.h> +#include <bpf/btf.h> +#include <bpf/libbpf.h> + +#include "bpf_counter.h" +#include "counts.h" +#include "debug.h" +#include "evsel.h" +#include "target.h" + +#include "bpf_skel/bpf_prog_profiler.skel.h" + +static inline void *u64_to_ptr(__u64 ptr) +{ + return (void *)(unsigned long)ptr; +} + +static void set_max_rlimit(void) +{ + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; + + setrlimit(RLIMIT_MEMLOCK, &rinf); +} + +static struct bpf_counter *bpf_counter_alloc(void) +{ + struct bpf_counter *counter; + + counter = zalloc(sizeof(*counter)); + if (counter) + INIT_LIST_HEAD(&counter->list); + return counter; +} + +static int bpf_program_profiler__destroy(struct evsel *evsel) +{ + struct bpf_counter *counter, *tmp; + + list_for_each_entry_safe(counter, tmp, + &evsel->bpf_counter_list, list) { + list_del_init(&counter->list); + bpf_prog_profiler_bpf__destroy(counter->skel); + free(counter); + } + assert(list_empty(&evsel->bpf_counter_list)); + + return 0; +} + +static char *bpf_target_prog_name(int tgt_fd) +{ + struct bpf_prog_info_linear *info_linear; + struct bpf_func_info *func_info; + const struct btf_type *t; + char *name = NULL; + struct btf *btf; + + info_linear = bpf_program__get_prog_info_linear( + tgt_fd, 1UL << BPF_PROG_INFO_FUNC_INFO); + if (IS_ERR_OR_NULL(info_linear)) { + pr_debug("failed to get info_linear for prog FD %d\n", tgt_fd); + return NULL; + } + + if (info_linear->info.btf_id == 0 || + btf__get_from_id(info_linear->info.btf_id, &btf)) { + pr_debug("prog FD %d doesn't have valid btf\n", tgt_fd); + goto out; + } + + func_info = u64_to_ptr(info_linear->info.func_info); + t = btf__type_by_id(btf, func_info[0].type_id); + if (!t) { + pr_debug("btf %d doesn't have type %d\n", + info_linear->info.btf_id, func_info[0].type_id); + goto out; + } + name = strdup(btf__name_by_offset(btf, t->name_off)); +out: + free(info_linear); + return name; +} + +static int bpf_program_profiler_load_one(struct evsel *evsel, u32 prog_id) +{ + struct bpf_prog_profiler_bpf *skel; + struct bpf_counter *counter; + struct bpf_program *prog; + char *prog_name; + int prog_fd; + int err; + + prog_fd = bpf_prog_get_fd_by_id(prog_id); + if (prog_fd < 0) { + pr_err("Failed to open fd for bpf prog %u\n", prog_id); + return -1; + } + counter = bpf_counter_alloc(); + if (!counter) { + close(prog_fd); + return -1; + } + + skel = bpf_prog_profiler_bpf__open(); + if (!skel) { + pr_err("Failed to open bpf skeleton\n"); + goto err_out; + } + + skel->rodata->num_cpu = evsel__nr_cpus(evsel); + + bpf_map__resize(skel->maps.events, evsel__nr_cpus(evsel)); + bpf_map__resize(skel->maps.fentry_readings, 1); + bpf_map__resize(skel->maps.accum_readings, 1); + + prog_name = bpf_target_prog_name(prog_fd); + if (!prog_name) { + pr_err("Failed to get program name for bpf prog %u. Does it have BTF?\n", prog_id); + goto err_out; + } + + bpf_object__for_each_program(prog, skel->obj) { + err = bpf_program__set_attach_target(prog, prog_fd, prog_name); + if (err) { + pr_err("bpf_program__set_attach_target failed.\n" + "Does bpf prog %u have BTF?\n", prog_id); + goto err_out; + } + } + set_max_rlimit(); + err = bpf_prog_profiler_bpf__load(skel); + if (err) { + pr_err("bpf_prog_profiler_bpf__load failed\n"); + goto err_out; + } + + assert(skel != NULL); + counter->skel = skel; + list_add(&counter->list, &evsel->bpf_counter_list); + close(prog_fd); + return 0; +err_out: + bpf_prog_profiler_bpf__destroy(skel); + free(counter); + close(prog_fd); + return -1; +} + +static int bpf_program_profiler__load(struct evsel *evsel, struct target *target) +{ + char *bpf_str, *bpf_str_, *tok, *saveptr = NULL, *p; + u32 prog_id; + int ret; + + bpf_str_ = bpf_str = strdup(target->bpf_str); + if (!bpf_str) + return -1; + + while ((tok = strtok_r(bpf_str, ",", &saveptr)) != NULL) { + prog_id = strtoul(tok, &p, 10); + if (prog_id == 0 || prog_id == UINT_MAX || + (*p != '\0' && *p != ',')) { + pr_err("Failed to parse bpf prog ids %s\n", + target->bpf_str); + return -1; + } + + ret = bpf_program_profiler_load_one(evsel, prog_id); + if (ret) { + bpf_program_profiler__destroy(evsel); + free(bpf_str_); + return -1; + } + bpf_str = NULL; + } + free(bpf_str_); + return 0; +} + +static int bpf_program_profiler__enable(struct evsel *evsel) +{ + struct bpf_counter *counter; + int ret; + + list_for_each_entry(counter, &evsel->bpf_counter_list, list) { + assert(counter->skel != NULL); + ret = bpf_prog_profiler_bpf__attach(counter->skel); + if (ret) { + bpf_program_profiler__destroy(evsel); + return ret; + } + } + return 0; +} + +static int bpf_program_profiler__read(struct evsel *evsel) +{ + // perf_cpu_map uses /sys/devices/system/cpu/online + int num_cpu = evsel__nr_cpus(evsel); + // BPF_MAP_TYPE_PERCPU_ARRAY uses /sys/devices/system/cpu/possible + // Sometimes possible > online, like on a Ryzen 3900X that has 24 + // threads but its possible showed 0-31 -acme + int num_cpu_bpf = libbpf_num_possible_cpus(); + struct bpf_perf_event_value values[num_cpu_bpf]; + struct bpf_counter *counter; + int reading_map_fd; + __u32 key = 0; + int err, cpu; + + if (list_empty(&evsel->bpf_counter_list)) + return -EAGAIN; + + for (cpu = 0; cpu < num_cpu; cpu++) { + perf_counts(evsel->counts, cpu, 0)->val = 0; + perf_counts(evsel->counts, cpu, 0)->ena = 0; + perf_counts(evsel->counts, cpu, 0)->run = 0; + } + list_for_each_entry(counter, &evsel->bpf_counter_list, list) { + struct bpf_prog_profiler_bpf *skel = counter->skel; + + assert(skel != NULL); + reading_map_fd = bpf_map__fd(skel->maps.accum_readings); + + err = bpf_map_lookup_elem(reading_map_fd, &key, values); + if (err) { + pr_err("failed to read value\n"); + return err; + } + + for (cpu = 0; cpu < num_cpu; cpu++) { + perf_counts(evsel->counts, cpu, 0)->val += values[cpu].counter; + perf_counts(evsel->counts, cpu, 0)->ena += values[cpu].enabled; + perf_counts(evsel->counts, cpu, 0)->run += values[cpu].running; + } + } + return 0; +} + +static int bpf_program_profiler__install_pe(struct evsel *evsel, int cpu, + int fd) +{ + struct bpf_prog_profiler_bpf *skel; + struct bpf_counter *counter; + int ret; + + list_for_each_entry(counter, &evsel->bpf_counter_list, list) { + skel = counter->skel; + assert(skel != NULL); + + ret = bpf_map_update_elem(bpf_map__fd(skel->maps.events), + &cpu, &fd, BPF_ANY); + if (ret) + return ret; + } + return 0; +} + +struct bpf_counter_ops bpf_program_profiler_ops = { + .load = bpf_program_profiler__load, + .enable = bpf_program_profiler__enable, + .read = bpf_program_profiler__read, + .destroy = bpf_program_profiler__destroy, + .install_pe = bpf_program_profiler__install_pe, +}; + +int bpf_counter__install_pe(struct evsel *evsel, int cpu, int fd) +{ + if (list_empty(&evsel->bpf_counter_list)) + return 0; + return evsel->bpf_counter_ops->install_pe(evsel, cpu, fd); +} + +int bpf_counter__load(struct evsel *evsel, struct target *target) +{ + if (target__has_bpf(target)) + evsel->bpf_counter_ops = &bpf_program_profiler_ops; + + if (evsel->bpf_counter_ops) + return evsel->bpf_counter_ops->load(evsel, target); + return 0; +} + +int bpf_counter__enable(struct evsel *evsel) +{ + if (list_empty(&evsel->bpf_counter_list)) + return 0; + return evsel->bpf_counter_ops->enable(evsel); +} + +int bpf_counter__read(struct evsel *evsel) +{ + if (list_empty(&evsel->bpf_counter_list)) + return -EAGAIN; + return evsel->bpf_counter_ops->read(evsel); +} + +void bpf_counter__destroy(struct evsel *evsel) +{ + if (list_empty(&evsel->bpf_counter_list)) + return; + evsel->bpf_counter_ops->destroy(evsel); + evsel->bpf_counter_ops = NULL; +} diff --git a/tools/perf/util/bpf_counter.h b/tools/perf/util/bpf_counter.h new file mode 100644 index 000000000000..2eca210e5dc1 --- /dev/null +++ b/tools/perf/util/bpf_counter.h @@ -0,0 +1,72 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __PERF_BPF_COUNTER_H +#define __PERF_BPF_COUNTER_H 1 + +#include <linux/list.h> + +struct evsel; +struct target; +struct bpf_counter; + +typedef int (*bpf_counter_evsel_op)(struct evsel *evsel); +typedef int (*bpf_counter_evsel_target_op)(struct evsel *evsel, + struct target *target); +typedef int (*bpf_counter_evsel_install_pe_op)(struct evsel *evsel, + int cpu, + int fd); + +struct bpf_counter_ops { + bpf_counter_evsel_target_op load; + bpf_counter_evsel_op enable; + bpf_counter_evsel_op read; + bpf_counter_evsel_op destroy; + bpf_counter_evsel_install_pe_op install_pe; +}; + +struct bpf_counter { + void *skel; + struct list_head list; +}; + +#ifdef HAVE_BPF_SKEL + +int bpf_counter__load(struct evsel *evsel, struct target *target); +int bpf_counter__enable(struct evsel *evsel); +int bpf_counter__read(struct evsel *evsel); +void bpf_counter__destroy(struct evsel *evsel); +int bpf_counter__install_pe(struct evsel *evsel, int cpu, int fd); + +#else /* HAVE_BPF_SKEL */ + +#include<linux/err.h> + +static inline int bpf_counter__load(struct evsel *evsel __maybe_unused, + struct target *target __maybe_unused) +{ + return 0; +} + +static inline int bpf_counter__enable(struct evsel *evsel __maybe_unused) +{ + return 0; +} + +static inline int bpf_counter__read(struct evsel *evsel __maybe_unused) +{ + return -EAGAIN; +} + +static inline void bpf_counter__destroy(struct evsel *evsel __maybe_unused) +{ +} + +static inline int bpf_counter__install_pe(struct evsel *evsel __maybe_unused, + int cpu __maybe_unused, + int fd __maybe_unused) +{ + return 0; +} + +#endif /* HAVE_BPF_SKEL */ + +#endif /* __PERF_BPF_COUNTER_H */ diff --git a/tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c b/tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c new file mode 100644 index 000000000000..c7cec92d0236 --- /dev/null +++ b/tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c @@ -0,0 +1,93 @@ +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +// Copyright (c) 2020 Facebook +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +/* map of perf event fds, num_cpu * num_metric entries */ +struct { + __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); + __uint(key_size, sizeof(__u32)); + __uint(value_size, sizeof(int)); +} events SEC(".maps"); + +/* readings at fentry */ +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(key_size, sizeof(__u32)); + __uint(value_size, sizeof(struct bpf_perf_event_value)); + __uint(max_entries, 1); +} fentry_readings SEC(".maps"); + +/* accumulated readings */ +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(key_size, sizeof(__u32)); + __uint(value_size, sizeof(struct bpf_perf_event_value)); + __uint(max_entries, 1); +} accum_readings SEC(".maps"); + +const volatile __u32 num_cpu = 1; + +SEC("fentry/XXX") +int BPF_PROG(fentry_XXX) +{ + __u32 key = bpf_get_smp_processor_id(); + struct bpf_perf_event_value *ptr; + __u32 zero = 0; + long err; + + /* look up before reading, to reduce error */ + ptr = bpf_map_lookup_elem(&fentry_readings, &zero); + if (!ptr) + return 0; + + err = bpf_perf_event_read_value(&events, key, ptr, sizeof(*ptr)); + if (err) + return 0; + + return 0; +} + +static inline void +fexit_update_maps(struct bpf_perf_event_value *after) +{ + struct bpf_perf_event_value *before, diff, *accum; + __u32 zero = 0; + + before = bpf_map_lookup_elem(&fentry_readings, &zero); + /* only account samples with a valid fentry_reading */ + if (before && before->counter) { + struct bpf_perf_event_value *accum; + + diff.counter = after->counter - before->counter; + diff.enabled = after->enabled - before->enabled; + diff.running = after->running - before->running; + + accum = bpf_map_lookup_elem(&accum_readings, &zero); + if (accum) { + accum->counter += diff.counter; + accum->enabled += diff.enabled; + accum->running += diff.running; + } + } +} + +SEC("fexit/XXX") +int BPF_PROG(fexit_XXX) +{ + struct bpf_perf_event_value reading; + __u32 cpu = bpf_get_smp_processor_id(); + __u32 one = 1, zero = 0; + int err; + + /* read all events before updating the maps, to reduce error */ + err = bpf_perf_event_read_value(&events, cpu, &reading, sizeof(reading)); + if (err) + return 0; + + fexit_update_maps(&reading); + return 0; +} + +char LICENSE[] SEC("license") = "Dual BSD/GPL"; diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c index 1a1cbd16d76d..8829fed10910 100644 --- a/tools/perf/util/evsel.c +++ b/tools/perf/util/evsel.c @@ -25,6 +25,7 @@ #include <stdlib.h> #include <perf/evsel.h> #include "asm/bug.h" +#include "bpf_counter.h" #include "callchain.h" #include "cgroup.h" #include "counts.h" @@ -247,6 +248,7 @@ void evsel__init(struct evsel *evsel, evsel->bpf_obj = NULL; evsel->bpf_fd = -1; INIT_LIST_HEAD(&evsel->config_terms); + INIT_LIST_HEAD(&evsel->bpf_counter_list); perf_evsel__object.init(evsel); evsel->sample_size = __evsel__sample_size(attr->sample_type); evsel__calc_id_pos(evsel); @@ -1374,6 +1376,7 @@ void evsel__exit(struct evsel *evsel) { assert(list_empty(&evsel->core.node)); assert(evsel->evlist == NULL); + bpf_counter__destroy(evsel); evsel__free_counts(evsel); perf_evsel__free_fd(&evsel->core); perf_evsel__free_id(&evsel->core); @@ -1797,6 +1800,8 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus, FD(evsel, cpu, thread) = fd; + bpf_counter__install_pe(evsel, cpu, fd); + if (unlikely(test_attr__enabled)) { test_attr__open(&evsel->core.attr, pid, cpus->map[cpu], fd, group_fd, flags); diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h index 79a860d8e3ee..e818e558a275 100644 --- a/tools/perf/util/evsel.h +++ b/tools/perf/util/evsel.h @@ -17,6 +17,8 @@ struct cgroup; struct perf_counts; struct perf_stat_evsel; union perf_event; +struct bpf_counter_ops; +struct target; typedef int (evsel__sb_cb_t)(union perf_event *event, void *data); @@ -127,6 +129,8 @@ struct evsel { * See also evsel__has_callchain(). */ __u64 synth_sample_type; + struct list_head bpf_counter_list; + struct bpf_counter_ops *bpf_counter_ops; }; struct perf_missing_features { @@ -423,4 +427,5 @@ static inline bool evsel__is_dummy_event(struct evsel *evsel) struct perf_env *evsel__env(struct evsel *evsel); int evsel__store_ids(struct evsel *evsel, struct evlist *evlist); + #endif /* __PERF_EVSEL_H */ diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c index ae8edde7c50e..5dcc30ae6cab 100644 --- a/tools/perf/util/python.c +++ b/tools/perf/util/python.c @@ -79,6 +79,27 @@ int metricgroup__copy_metric_events(struct evlist *evlist, struct cgroup *cgrp, return 0; } +/* + * XXX: All these evsel destructors need some better mechanism, like a linked + * list of destructors registered when the relevant code indeed is used instead + * of having more and more calls in perf_evsel__delete(). -- acme + * + * For now, add some more: + * + * Not to drag the BPF bandwagon... + */ +void bpf_counter__destroy(struct evsel *evsel); +int bpf_counter__install_pe(struct evsel *evsel, int cpu, int fd); + +void bpf_counter__destroy(struct evsel *evsel __maybe_unused) +{ +} + +int bpf_counter__install_pe(struct evsel *evsel __maybe_unused, int cpu __maybe_unused, int fd __maybe_unused) +{ + return 0; +} + /* * Support debug printing even though util/debug.c is not linked. That means * implementing 'verbose' and 'eprintf'. diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c index 96fe9c1af336..b6f71ad4b536 100644 --- a/tools/perf/util/stat-display.c +++ b/tools/perf/util/stat-display.c @@ -1027,7 +1027,9 @@ static void print_header(struct perf_stat_config *config, if (!config->csv_output) { fprintf(output, "\n"); fprintf(output, " Performance counter stats for "); - if (_target->system_wide) + if (_target->bpf_str) + fprintf(output, "\'BPF program(s) %s", _target->bpf_str); + else if (_target->system_wide) fprintf(output, "\'system wide"); else if (_target->cpu_list) fprintf(output, "\'CPU(s) %s", _target->cpu_list); diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c index bd0decd6d753..8cd26aa111c1 100644 --- a/tools/perf/util/stat.c +++ b/tools/perf/util/stat.c @@ -527,7 +527,7 @@ int create_perf_stat_counter(struct evsel *evsel, if (leader->core.nr_members > 1) attr->read_format |= PERF_FORMAT_ID|PERF_FORMAT_GROUP; - attr->inherit = !config->no_inherit; + attr->inherit = !config->no_inherit && list_empty(&evsel->bpf_counter_list); /* * Some events get initialized with sample_(period/type) set, diff --git a/tools/perf/util/target.c b/tools/perf/util/target.c index a3db13dea937..0f383418e3df 100644 --- a/tools/perf/util/target.c +++ b/tools/perf/util/target.c @@ -56,6 +56,34 @@ enum target_errno target__validate(struct target *target) ret = TARGET_ERRNO__UID_OVERRIDE_SYSTEM; } + /* BPF and CPU are mutually exclusive */ + if (target->bpf_str && target->cpu_list) { + target->cpu_list = NULL; + if (ret == TARGET_ERRNO__SUCCESS) + ret = TARGET_ERRNO__BPF_OVERRIDE_CPU; + } + + /* BPF and PID/TID are mutually exclusive */ + if (target->bpf_str && target->tid) { + target->tid = NULL; + if (ret == TARGET_ERRNO__SUCCESS) + ret = TARGET_ERRNO__BPF_OVERRIDE_PID; + } + + /* BPF and UID are mutually exclusive */ + if (target->bpf_str && target->uid_str) { + target->uid_str = NULL; + if (ret == TARGET_ERRNO__SUCCESS) + ret = TARGET_ERRNO__BPF_OVERRIDE_UID; + } + + /* BPF and THREADS are mutually exclusive */ + if (target->bpf_str && target->per_thread) { + target->per_thread = false; + if (ret == TARGET_ERRNO__SUCCESS) + ret = TARGET_ERRNO__BPF_OVERRIDE_THREAD; + } + /* THREAD and SYSTEM/CPU are mutually exclusive */ if (target->per_thread && (target->system_wide || target->cpu_list)) { target->per_thread = false; @@ -109,6 +137,10 @@ static const char *target__error_str[] = { "PID/TID switch overriding SYSTEM", "UID switch overriding SYSTEM", "SYSTEM/CPU switch overriding PER-THREAD", + "BPF switch overriding CPU", + "BPF switch overriding PID/TID", + "BPF switch overriding UID", + "BPF switch overriding THREAD", "Invalid User: %s", "Problems obtaining information for user %s", }; @@ -134,7 +166,7 @@ int target__strerror(struct target *target, int errnum, switch (errnum) { case TARGET_ERRNO__PID_OVERRIDE_CPU ... - TARGET_ERRNO__SYSTEM_OVERRIDE_THREAD: + TARGET_ERRNO__BPF_OVERRIDE_THREAD: snprintf(buf, buflen, "%s", msg); break; diff --git a/tools/perf/util/target.h b/tools/perf/util/target.h index 6ef01a83b24e..f132c6c2eef8 100644 --- a/tools/perf/util/target.h +++ b/tools/perf/util/target.h @@ -10,6 +10,7 @@ struct target { const char *tid; const char *cpu_list; const char *uid_str; + const char *bpf_str; uid_t uid; bool system_wide; bool uses_mmap; @@ -36,6 +37,10 @@ enum target_errno { TARGET_ERRNO__PID_OVERRIDE_SYSTEM, TARGET_ERRNO__UID_OVERRIDE_SYSTEM, TARGET_ERRNO__SYSTEM_OVERRIDE_THREAD, + TARGET_ERRNO__BPF_OVERRIDE_CPU, + TARGET_ERRNO__BPF_OVERRIDE_PID, + TARGET_ERRNO__BPF_OVERRIDE_UID, + TARGET_ERRNO__BPF_OVERRIDE_THREAD, /* for target__parse_uid() */ TARGET_ERRNO__INVALID_UID, @@ -59,6 +64,11 @@ static inline bool target__has_cpu(struct target *target) return target->system_wide || target->cpu_list; } +static inline bool target__has_bpf(struct target *target) +{ + return target->bpf_str; +} + static inline bool target__none(struct target *target) { return !target__has_task(target) && !target__has_cpu(target); -- 2.34.1

From: Song Liu <songliubraving@fb.com> The perf tool uses performance monitoring counters (PMCs) to monitor system performance. The PMCs are limited hardware resources. For example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. Modern data center systems use these PMCs in many different ways: system level monitoring, (maybe nested) container level monitoring, per process monitoring, profiling (in sample mode), etc. In some cases, there are more active perf_events than available hardware PMCs. To allow all perf_events to have a chance to run, it is necessary to do expensive time multiplexing of events. On the other hand, many monitoring tools count the common metrics (cycles, instructions). It is a waste to have multiple tools create multiple perf_events of "cycles" and occupy multiple PMCs. bperf tries to reduce such wastes by allowing multiple perf_events of "cycles" or "instructions" (at different scopes) to share PMUs. Instead of having each perf-stat session to read its own perf_events, bperf uses BPF programs to read the perf_events and aggregate readings to BPF maps. Then, the perf-stat session(s) reads the values from these BPF maps. Please refer to the comment before the definition of bperf_ops for the description of bperf architecture. bperf is off by default. To enable it, pass --bpf-counters option to perf-stat. bperf uses a BPF hashmap to share information about BPF programs and maps used by bperf. This map is pinned to bpffs. The default path is /sys/fs/bpf/perf_attr_map. The user could change the path with option --bpf-attr-map. Committer testing: # dmesg|grep "Performance Events" -A5 [ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver. [ 0.225280] ... version: 0 [ 0.225280] ... bit width: 48 [ 0.225281] ... generic registers: 6 [ 0.225281] ... value mask: 0000ffffffffffff [ 0.225281] ... max period: 00007fffffffffff # # for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done [1] 2436231 [2] 2436232 [3] 2436233 [4] 2436234 [5] 2436235 [6] 2436236 # perf stat -a -e cycles,instructions sleep 0.1 Performance counter stats for 'system wide': 310,326,987 cycles (41.87%) 236,143,290 instructions # 0.76 insn per cycle (41.87%) 0.100800885 seconds time elapsed # We can see that the counters were enabled for this workload 41.87% of the time. Now with --bpf-counters: # for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done [1] 2436514 [2] 2436515 [3] 2436516 [4] 2436517 [5] 2436518 [6] 2436519 [7] 2436520 [8] 2436521 [9] 2436522 [10] 2436523 [11] 2436524 [12] 2436525 [13] 2436526 [14] 2436527 [15] 2436528 [16] 2436529 [17] 2436530 [18] 2436531 [19] 2436532 [20] 2436533 [21] 2436534 [22] 2436535 [23] 2436536 [24] 2436537 [25] 2436538 [26] 2436539 [27] 2436540 [28] 2436541 [29] 2436542 [30] 2436543 [31] 2436544 [32] 2436545 # # ls -la /sys/fs/bpf/perf_attr_map -rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map # bpftool map | grep bperf | wc -l 64 # # bpftool map | tail 1265: percpu_array name accum_readings flags 0x0 key 4B value 24B max_entries 1 memlock 4096B 1266: hash name filter flags 0x0 key 4B value 4B max_entries 1 memlock 4096B 1267: array name bperf_fo.bss flags 0x400 key 4B value 8B max_entries 1 memlock 4096B btf_id 996 pids perf(2436545) 1268: percpu_array name accum_readings flags 0x0 key 4B value 24B max_entries 1 memlock 4096B 1269: hash name filter flags 0x0 key 4B value 4B max_entries 1 memlock 4096B 1270: array name bperf_fo.bss flags 0x400 key 4B value 8B max_entries 1 memlock 4096B btf_id 997 pids perf(2436541) 1285: array name pid_iter.rodata flags 0x480 key 4B value 4B max_entries 1 memlock 4096B btf_id 1017 frozen pids bpftool(2437504) 1286: array flags 0x0 key 4B value 32B max_entries 1 memlock 4096B # # bpftool map dump id 1268 | tail value (CPU 21): 8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00 80 fd 2a d1 4d 00 00 00 value (CPU 22): 7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00 a4 8a 2e ee 4d 00 00 00 value (CPU 23): a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00 b2 34 94 f6 4d 00 00 00 Found 1 element # bpftool map dump id 1268 | tail value (CPU 21): c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00 20 c6 fc 83 4e 00 00 00 value (CPU 22): 9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00 3e 0c df 89 4e 00 00 00 value (CPU 23): 18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00 5b 69 ed 83 4e 00 00 00 Found 1 element # bpftool map dump id 1268 | tail value (CPU 21): f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00 92 67 4c ba 4e 00 00 00 value (CPU 22): dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00 d9 32 7a c5 4e 00 00 00 value (CPU 23): bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00 7c 73 87 bf 4e 00 00 00 Found 1 element # # perf stat --bpf-counters -a -e cycles,instructions sleep 0.1 Performance counter stats for 'system wide': 119,410,122 cycles 152,105,479 instructions # 1.27 insn per cycle 0.101395093 seconds time elapsed # See? We had the counters enabled all the time. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: kernel-team@fb.com Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Conflicts: tools/perf/util/evsel.h Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/perf/Documentation/perf-stat.txt | 11 + tools/perf/Makefile.perf | 1 + tools/perf/builtin-stat.c | 10 + tools/perf/util/bpf_counter.c | 519 +++++++++++++++++- tools/perf/util/bpf_skel/bperf.h | 14 + tools/perf/util/bpf_skel/bperf_follower.bpf.c | 69 +++ tools/perf/util/bpf_skel/bperf_leader.bpf.c | 46 ++ tools/perf/util/bpf_skel/bperf_u.h | 14 + tools/perf/util/evsel.h | 20 +- tools/perf/util/target.h | 4 +- 10 files changed, 701 insertions(+), 7 deletions(-) create mode 100644 tools/perf/util/bpf_skel/bperf.h create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c create mode 100644 tools/perf/util/bpf_skel/bperf_u.h diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt index 5bfc0357694b..925d684d539c 100644 --- a/tools/perf/Documentation/perf-stat.txt +++ b/tools/perf/Documentation/perf-stat.txt @@ -93,6 +93,17 @@ report:: 1.102235068 seconds time elapsed +--bpf-counters:: + Use BPF programs to aggregate readings from perf_events. This + allows multiple perf-stat sessions that are counting the same metric (cycles, + instructions, etc.) to share hardware counters. + +--bpf-attr-map:: + With option "--bpf-counters", different perf-stat sessions share + information about shared BPF programs and maps via a pinned hashmap. + Use "--bpf-attr-map" to specify the path of this pinned hashmap. + The default path is /sys/fs/bpf/perf_attr_map. + ifdef::HAVE_LIBPFM[] --pfm-events events:: Select a PMU event using libpfm4 syntax (see http://perfmon2.sf.net) diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf index 0407e9aaaeef..7ddc4b65260a 100644 --- a/tools/perf/Makefile.perf +++ b/tools/perf/Makefile.perf @@ -1016,6 +1016,7 @@ python-clean: SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel) SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp) SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h +SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h ifdef BUILD_BPF_SKEL BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 8cf0490d01b6..6e2b28bb4dfc 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -779,6 +779,12 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx) } evlist__for_each_cpu (evsel_list, i, cpu) { + /* + * bperf calls evsel__open_per_cpu() in bperf__load(), so + * no need to call it again here. + */ + if (target.use_bpf) + break; affinity__set(&affinity, cpu); evlist__for_each_entry(evsel_list, counter) { @@ -1134,6 +1140,10 @@ static struct option stat_options[] = { #ifdef HAVE_BPF_SKEL OPT_STRING('b', "bpf-prog", &target.bpf_str, "bpf-prog-id", "stat events on existing bpf program id"), + OPT_BOOLEAN(0, "bpf-counters", &target.use_bpf, + "use bpf program to count events"), + OPT_STRING(0, "bpf-attr-map", &target.attr_map, "attr-map-path", + "path to perf_event_attr map"), #endif OPT_BOOLEAN('a', "all-cpus", &target.system_wide, "system-wide collection from all CPUs"), diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c index 04f89120b323..81d1df3c4ec0 100644 --- a/tools/perf/util/bpf_counter.c +++ b/tools/perf/util/bpf_counter.c @@ -5,6 +5,7 @@ #include <assert.h> #include <limits.h> #include <unistd.h> +#include <sys/file.h> #include <sys/time.h> #include <sys/resource.h> #include <linux/err.h> @@ -12,14 +13,45 @@ #include <bpf/bpf.h> #include <bpf/btf.h> #include <bpf/libbpf.h> +#include <api/fs/fs.h> #include "bpf_counter.h" #include "counts.h" #include "debug.h" #include "evsel.h" +#include "evlist.h" #include "target.h" +#include "cpumap.h" +#include "thread_map.h" #include "bpf_skel/bpf_prog_profiler.skel.h" +#include "bpf_skel/bperf_u.h" +#include "bpf_skel/bperf_leader.skel.h" +#include "bpf_skel/bperf_follower.skel.h" + +/* + * bperf uses a hashmap, the attr_map, to track all the leader programs. + * The hashmap is pinned in bpffs. flock() on this file is used to ensure + * no concurrent access to the attr_map. The key of attr_map is struct + * perf_event_attr, and the value is struct perf_event_attr_map_entry. + * + * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the + * leader prog, and the diff_map. Each perf-stat session holds a reference + * to the bpf_link to make sure the leader prog is attached to sched_switch + * tracepoint. + * + * Since the hashmap only contains IDs of the bpf_link and diff_map, it + * does not hold any references to the leader program. Once all perf-stat + * sessions of these events exit, the leader prog, its maps, and the + * perf_events will be freed. + */ +struct perf_event_attr_map_entry { + __u32 link_id; + __u32 diff_map_id; +}; + +#define DEFAULT_ATTR_MAP_PATH "fs/bpf/perf_attr_map" +#define ATTR_MAP_SIZE 16 static inline void *u64_to_ptr(__u64 ptr) { @@ -274,17 +306,494 @@ struct bpf_counter_ops bpf_program_profiler_ops = { .install_pe = bpf_program_profiler__install_pe, }; +static __u32 bpf_link_get_id(int fd) +{ + struct bpf_link_info link_info = {0}; + __u32 link_info_len = sizeof(link_info); + + bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len); + return link_info.id; +} + +static __u32 bpf_link_get_prog_id(int fd) +{ + struct bpf_link_info link_info = {0}; + __u32 link_info_len = sizeof(link_info); + + bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len); + return link_info.prog_id; +} + +static __u32 bpf_map_get_id(int fd) +{ + struct bpf_map_info map_info = {0}; + __u32 map_info_len = sizeof(map_info); + + bpf_obj_get_info_by_fd(fd, &map_info, &map_info_len); + return map_info.id; +} + +static int bperf_lock_attr_map(struct target *target) +{ + char path[PATH_MAX]; + int map_fd, err; + + if (target->attr_map) { + scnprintf(path, PATH_MAX, "%s", target->attr_map); + } else { + scnprintf(path, PATH_MAX, "%s/%s", sysfs__mountpoint(), + DEFAULT_ATTR_MAP_PATH); + } + + if (access(path, F_OK)) { + map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, + sizeof(struct perf_event_attr), + sizeof(struct perf_event_attr_map_entry), + ATTR_MAP_SIZE, 0); + if (map_fd < 0) + return -1; + + err = bpf_obj_pin(map_fd, path); + if (err) { + /* someone pinned the map in parallel? */ + close(map_fd); + map_fd = bpf_obj_get(path); + if (map_fd < 0) + return -1; + } + } else { + map_fd = bpf_obj_get(path); + if (map_fd < 0) + return -1; + } + + err = flock(map_fd, LOCK_EX); + if (err) { + close(map_fd); + return -1; + } + return map_fd; +} + +/* trigger the leader program on a cpu */ +static int bperf_trigger_reading(int prog_fd, int cpu) +{ + DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts, + .ctx_in = NULL, + .ctx_size_in = 0, + .flags = BPF_F_TEST_RUN_ON_CPU, + .cpu = cpu, + .retval = 0, + ); + + return bpf_prog_test_run_opts(prog_fd, &opts); +} + +static int bperf_check_target(struct evsel *evsel, + struct target *target, + enum bperf_filter_type *filter_type, + __u32 *filter_entry_cnt) +{ + if (evsel->leader->core.nr_members > 1) { + pr_err("bpf managed perf events do not yet support groups.\n"); + return -1; + } + + /* determine filter type based on target */ + if (target->system_wide) { + *filter_type = BPERF_FILTER_GLOBAL; + *filter_entry_cnt = 1; + } else if (target->cpu_list) { + *filter_type = BPERF_FILTER_CPU; + *filter_entry_cnt = perf_cpu_map__nr(evsel__cpus(evsel)); + } else if (target->tid) { + *filter_type = BPERF_FILTER_PID; + *filter_entry_cnt = perf_thread_map__nr(evsel->core.threads); + } else if (target->pid || evsel->evlist->workload.pid != -1) { + *filter_type = BPERF_FILTER_TGID; + *filter_entry_cnt = perf_thread_map__nr(evsel->core.threads); + } else { + pr_err("bpf managed perf events do not yet support these targets.\n"); + return -1; + } + + return 0; +} + +static struct perf_cpu_map *all_cpu_map; + +static int bperf_reload_leader_program(struct evsel *evsel, int attr_map_fd, + struct perf_event_attr_map_entry *entry) +{ + struct bperf_leader_bpf *skel = bperf_leader_bpf__open(); + int link_fd, diff_map_fd, err; + struct bpf_link *link = NULL; + + if (!skel) { + pr_err("Failed to open leader skeleton\n"); + return -1; + } + + bpf_map__resize(skel->maps.events, libbpf_num_possible_cpus()); + err = bperf_leader_bpf__load(skel); + if (err) { + pr_err("Failed to load leader skeleton\n"); + goto out; + } + + err = -1; + link = bpf_program__attach(skel->progs.on_switch); + if (!link) { + pr_err("Failed to attach leader program\n"); + goto out; + } + + link_fd = bpf_link__fd(link); + diff_map_fd = bpf_map__fd(skel->maps.diff_readings); + entry->link_id = bpf_link_get_id(link_fd); + entry->diff_map_id = bpf_map_get_id(diff_map_fd); + err = bpf_map_update_elem(attr_map_fd, &evsel->core.attr, entry, BPF_ANY); + assert(err == 0); + + evsel->bperf_leader_link_fd = bpf_link_get_fd_by_id(entry->link_id); + assert(evsel->bperf_leader_link_fd >= 0); + + /* + * save leader_skel for install_pe, which is called within + * following evsel__open_per_cpu call + */ + evsel->leader_skel = skel; + evsel__open_per_cpu(evsel, all_cpu_map, -1); + +out: + bperf_leader_bpf__destroy(skel); + bpf_link__destroy(link); + return err; +} + +static int bperf__load(struct evsel *evsel, struct target *target) +{ + struct perf_event_attr_map_entry entry = {0xffffffff, 0xffffffff}; + int attr_map_fd, diff_map_fd = -1, err; + enum bperf_filter_type filter_type; + __u32 filter_entry_cnt, i; + + if (bperf_check_target(evsel, target, &filter_type, &filter_entry_cnt)) + return -1; + + if (!all_cpu_map) { + all_cpu_map = perf_cpu_map__new(NULL); + if (!all_cpu_map) + return -1; + } + + evsel->bperf_leader_prog_fd = -1; + evsel->bperf_leader_link_fd = -1; + + /* + * Step 1: hold a fd on the leader program and the bpf_link, if + * the program is not already gone, reload the program. + * Use flock() to ensure exclusive access to the perf_event_attr + * map. + */ + attr_map_fd = bperf_lock_attr_map(target); + if (attr_map_fd < 0) { + pr_err("Failed to lock perf_event_attr map\n"); + return -1; + } + + err = bpf_map_lookup_elem(attr_map_fd, &evsel->core.attr, &entry); + if (err) { + err = bpf_map_update_elem(attr_map_fd, &evsel->core.attr, &entry, BPF_ANY); + if (err) + goto out; + } + + evsel->bperf_leader_link_fd = bpf_link_get_fd_by_id(entry.link_id); + if (evsel->bperf_leader_link_fd < 0 && + bperf_reload_leader_program(evsel, attr_map_fd, &entry)) + goto out; + + /* + * The bpf_link holds reference to the leader program, and the + * leader program holds reference to the maps. Therefore, if + * link_id is valid, diff_map_id should also be valid. + */ + evsel->bperf_leader_prog_fd = bpf_prog_get_fd_by_id( + bpf_link_get_prog_id(evsel->bperf_leader_link_fd)); + assert(evsel->bperf_leader_prog_fd >= 0); + + diff_map_fd = bpf_map_get_fd_by_id(entry.diff_map_id); + assert(diff_map_fd >= 0); + + /* + * bperf uses BPF_PROG_TEST_RUN to get accurate reading. Check + * whether the kernel support it + */ + err = bperf_trigger_reading(evsel->bperf_leader_prog_fd, 0); + if (err) { + pr_err("The kernel does not support test_run for raw_tp BPF programs.\n" + "Therefore, --use-bpf might show inaccurate readings\n"); + goto out; + } + + /* Step 2: load the follower skeleton */ + evsel->follower_skel = bperf_follower_bpf__open(); + if (!evsel->follower_skel) { + pr_err("Failed to open follower skeleton\n"); + goto out; + } + + /* attach fexit program to the leader program */ + bpf_program__set_attach_target(evsel->follower_skel->progs.fexit_XXX, + evsel->bperf_leader_prog_fd, "on_switch"); + + /* connect to leader diff_reading map */ + bpf_map__reuse_fd(evsel->follower_skel->maps.diff_readings, diff_map_fd); + + /* set up reading map */ + bpf_map__set_max_entries(evsel->follower_skel->maps.accum_readings, + filter_entry_cnt); + /* set up follower filter based on target */ + bpf_map__set_max_entries(evsel->follower_skel->maps.filter, + filter_entry_cnt); + err = bperf_follower_bpf__load(evsel->follower_skel); + if (err) { + pr_err("Failed to load follower skeleton\n"); + bperf_follower_bpf__destroy(evsel->follower_skel); + evsel->follower_skel = NULL; + goto out; + } + + for (i = 0; i < filter_entry_cnt; i++) { + int filter_map_fd; + __u32 key; + + if (filter_type == BPERF_FILTER_PID || + filter_type == BPERF_FILTER_TGID) + key = evsel->core.threads->map[i].pid; + else if (filter_type == BPERF_FILTER_CPU) + key = evsel->core.cpus->map[i]; + else + break; + + filter_map_fd = bpf_map__fd(evsel->follower_skel->maps.filter); + bpf_map_update_elem(filter_map_fd, &key, &i, BPF_ANY); + } + + evsel->follower_skel->bss->type = filter_type; + + err = bperf_follower_bpf__attach(evsel->follower_skel); + +out: + if (err && evsel->bperf_leader_link_fd >= 0) + close(evsel->bperf_leader_link_fd); + if (err && evsel->bperf_leader_prog_fd >= 0) + close(evsel->bperf_leader_prog_fd); + if (diff_map_fd >= 0) + close(diff_map_fd); + + flock(attr_map_fd, LOCK_UN); + close(attr_map_fd); + + return err; +} + +static int bperf__install_pe(struct evsel *evsel, int cpu, int fd) +{ + struct bperf_leader_bpf *skel = evsel->leader_skel; + + return bpf_map_update_elem(bpf_map__fd(skel->maps.events), + &cpu, &fd, BPF_ANY); +} + +/* + * trigger the leader prog on each cpu, so the accum_reading map could get + * the latest readings. + */ +static int bperf_sync_counters(struct evsel *evsel) +{ + int num_cpu, i, cpu; + + num_cpu = all_cpu_map->nr; + for (i = 0; i < num_cpu; i++) { + cpu = all_cpu_map->map[i]; + bperf_trigger_reading(evsel->bperf_leader_prog_fd, cpu); + } + return 0; +} + +static int bperf__enable(struct evsel *evsel) +{ + evsel->follower_skel->bss->enabled = 1; + return 0; +} + +static int bperf__read(struct evsel *evsel) +{ + struct bperf_follower_bpf *skel = evsel->follower_skel; + __u32 num_cpu_bpf = cpu__max_cpu(); + struct bpf_perf_event_value values[num_cpu_bpf]; + int reading_map_fd, err = 0; + __u32 i, j, num_cpu; + + bperf_sync_counters(evsel); + reading_map_fd = bpf_map__fd(skel->maps.accum_readings); + + for (i = 0; i < bpf_map__max_entries(skel->maps.accum_readings); i++) { + __u32 cpu; + + err = bpf_map_lookup_elem(reading_map_fd, &i, values); + if (err) + goto out; + switch (evsel->follower_skel->bss->type) { + case BPERF_FILTER_GLOBAL: + assert(i == 0); + + num_cpu = all_cpu_map->nr; + for (j = 0; j < num_cpu; j++) { + cpu = all_cpu_map->map[j]; + perf_counts(evsel->counts, cpu, 0)->val = values[cpu].counter; + perf_counts(evsel->counts, cpu, 0)->ena = values[cpu].enabled; + perf_counts(evsel->counts, cpu, 0)->run = values[cpu].running; + } + break; + case BPERF_FILTER_CPU: + cpu = evsel->core.cpus->map[i]; + perf_counts(evsel->counts, i, 0)->val = values[cpu].counter; + perf_counts(evsel->counts, i, 0)->ena = values[cpu].enabled; + perf_counts(evsel->counts, i, 0)->run = values[cpu].running; + break; + case BPERF_FILTER_PID: + case BPERF_FILTER_TGID: + perf_counts(evsel->counts, 0, i)->val = 0; + perf_counts(evsel->counts, 0, i)->ena = 0; + perf_counts(evsel->counts, 0, i)->run = 0; + + for (cpu = 0; cpu < num_cpu_bpf; cpu++) { + perf_counts(evsel->counts, 0, i)->val += values[cpu].counter; + perf_counts(evsel->counts, 0, i)->ena += values[cpu].enabled; + perf_counts(evsel->counts, 0, i)->run += values[cpu].running; + } + break; + default: + break; + } + } +out: + return err; +} + +static int bperf__destroy(struct evsel *evsel) +{ + bperf_follower_bpf__destroy(evsel->follower_skel); + close(evsel->bperf_leader_prog_fd); + close(evsel->bperf_leader_link_fd); + return 0; +} + +/* + * bperf: share hardware PMCs with BPF + * + * perf uses performance monitoring counters (PMC) to monitor system + * performance. The PMCs are limited hardware resources. For example, + * Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. + * + * Modern data center systems use these PMCs in many different ways: + * system level monitoring, (maybe nested) container level monitoring, per + * process monitoring, profiling (in sample mode), etc. In some cases, + * there are more active perf_events than available hardware PMCs. To allow + * all perf_events to have a chance to run, it is necessary to do expensive + * time multiplexing of events. + * + * On the other hand, many monitoring tools count the common metrics + * (cycles, instructions). It is a waste to have multiple tools create + * multiple perf_events of "cycles" and occupy multiple PMCs. + * + * bperf tries to reduce such wastes by allowing multiple perf_events of + * "cycles" or "instructions" (at different scopes) to share PMUs. Instead + * of having each perf-stat session to read its own perf_events, bperf uses + * BPF programs to read the perf_events and aggregate readings to BPF maps. + * Then, the perf-stat session(s) reads the values from these BPF maps. + * + * || + * shared progs and maps <- || -> per session progs and maps + * || + * --------------- || + * | perf_events | || + * --------------- fexit || ----------------- + * | --------||----> | follower prog | + * --------------- / || --- ----------------- + * cs -> | leader prog |/ ||/ | | + * --> --------------- /|| -------------- ------------------ + * / | | / || | filter map | | accum_readings | + * / ------------ ------------ || -------------- ------------------ + * | | prev map | | diff map | || | + * | ------------ ------------ || | + * \ || | + * = \ ==================================================== | ============ + * \ / user space + * \ / + * \ / + * BPF_PROG_TEST_RUN BPF_MAP_LOOKUP_ELEM + * \ / + * \ / + * \------ perf-stat ----------------------/ + * + * The figure above shows the architecture of bperf. Note that the figure + * is divided into 3 regions: shared progs and maps (top left), per session + * progs and maps (top right), and user space (bottom). + * + * The leader prog is triggered on each context switch (cs). The leader + * prog reads perf_events and stores the difference (current_reading - + * previous_reading) to the diff map. For the same metric, e.g. "cycles", + * multiple perf-stat sessions share the same leader prog. + * + * Each perf-stat session creates a follower prog as fexit program to the + * leader prog. It is possible to attach up to BPF_MAX_TRAMP_PROGS (38) + * follower progs to the same leader prog. The follower prog checks current + * task and processor ID to decide whether to add the value from the diff + * map to its accumulated reading map (accum_readings). + * + * Finally, perf-stat user space reads the value from accum_reading map. + * + * Besides context switch, it is also necessary to trigger the leader prog + * before perf-stat reads the value. Otherwise, the accum_reading map may + * not have the latest reading from the perf_events. This is achieved by + * triggering the event via sys_bpf(BPF_PROG_TEST_RUN) to each CPU. + * + * Comment before the definition of struct perf_event_attr_map_entry + * describes how different sessions of perf-stat share information about + * the leader prog. + */ + +struct bpf_counter_ops bperf_ops = { + .load = bperf__load, + .enable = bperf__enable, + .read = bperf__read, + .install_pe = bperf__install_pe, + .destroy = bperf__destroy, +}; + +static inline bool bpf_counter_skip(struct evsel *evsel) +{ + return list_empty(&evsel->bpf_counter_list) && + evsel->follower_skel == NULL; +} + int bpf_counter__install_pe(struct evsel *evsel, int cpu, int fd) { - if (list_empty(&evsel->bpf_counter_list)) + if (bpf_counter_skip(evsel)) return 0; return evsel->bpf_counter_ops->install_pe(evsel, cpu, fd); } int bpf_counter__load(struct evsel *evsel, struct target *target) { - if (target__has_bpf(target)) + if (target->bpf_str) evsel->bpf_counter_ops = &bpf_program_profiler_ops; + else if (target->use_bpf) + evsel->bpf_counter_ops = &bperf_ops; if (evsel->bpf_counter_ops) return evsel->bpf_counter_ops->load(evsel, target); @@ -293,21 +802,21 @@ int bpf_counter__load(struct evsel *evsel, struct target *target) int bpf_counter__enable(struct evsel *evsel) { - if (list_empty(&evsel->bpf_counter_list)) + if (bpf_counter_skip(evsel)) return 0; return evsel->bpf_counter_ops->enable(evsel); } int bpf_counter__read(struct evsel *evsel) { - if (list_empty(&evsel->bpf_counter_list)) + if (bpf_counter_skip(evsel)) return -EAGAIN; return evsel->bpf_counter_ops->read(evsel); } void bpf_counter__destroy(struct evsel *evsel) { - if (list_empty(&evsel->bpf_counter_list)) + if (bpf_counter_skip(evsel)) return; evsel->bpf_counter_ops->destroy(evsel); evsel->bpf_counter_ops = NULL; diff --git a/tools/perf/util/bpf_skel/bperf.h b/tools/perf/util/bpf_skel/bperf.h new file mode 100644 index 000000000000..186a5551ddb9 --- /dev/null +++ b/tools/perf/util/bpf_skel/bperf.h @@ -0,0 +1,14 @@ +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +// Copyright (c) 2021 Facebook + +#ifndef __BPERF_STAT_H +#define __BPERF_STAT_H + +typedef struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(key_size, sizeof(__u32)); + __uint(value_size, sizeof(struct bpf_perf_event_value)); + __uint(max_entries, 1); +} reading_map; + +#endif /* __BPERF_STAT_H */ diff --git a/tools/perf/util/bpf_skel/bperf_follower.bpf.c b/tools/perf/util/bpf_skel/bperf_follower.bpf.c new file mode 100644 index 000000000000..b8fa3cb2da23 --- /dev/null +++ b/tools/perf/util/bpf_skel/bperf_follower.bpf.c @@ -0,0 +1,69 @@ +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +// Copyright (c) 2021 Facebook +#include <linux/bpf.h> +#include <linux/perf_event.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include "bperf.h" +#include "bperf_u.h" + +reading_map diff_readings SEC(".maps"); +reading_map accum_readings SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(key_size, sizeof(__u32)); + __uint(value_size, sizeof(__u32)); +} filter SEC(".maps"); + +enum bperf_filter_type type = 0; +int enabled = 0; + +SEC("fexit/XXX") +int BPF_PROG(fexit_XXX) +{ + struct bpf_perf_event_value *diff_val, *accum_val; + __u32 filter_key, zero = 0; + __u32 *accum_key; + + if (!enabled) + return 0; + + switch (type) { + case BPERF_FILTER_GLOBAL: + accum_key = &zero; + goto do_add; + case BPERF_FILTER_CPU: + filter_key = bpf_get_smp_processor_id(); + break; + case BPERF_FILTER_PID: + filter_key = bpf_get_current_pid_tgid() & 0xffffffff; + break; + case BPERF_FILTER_TGID: + filter_key = bpf_get_current_pid_tgid() >> 32; + break; + default: + return 0; + } + + accum_key = bpf_map_lookup_elem(&filter, &filter_key); + if (!accum_key) + return 0; + +do_add: + diff_val = bpf_map_lookup_elem(&diff_readings, &zero); + if (!diff_val) + return 0; + + accum_val = bpf_map_lookup_elem(&accum_readings, accum_key); + if (!accum_val) + return 0; + + accum_val->counter += diff_val->counter; + accum_val->enabled += diff_val->enabled; + accum_val->running += diff_val->running; + + return 0; +} + +char LICENSE[] SEC("license") = "Dual BSD/GPL"; diff --git a/tools/perf/util/bpf_skel/bperf_leader.bpf.c b/tools/perf/util/bpf_skel/bperf_leader.bpf.c new file mode 100644 index 000000000000..4f70d1459e86 --- /dev/null +++ b/tools/perf/util/bpf_skel/bperf_leader.bpf.c @@ -0,0 +1,46 @@ +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +// Copyright (c) 2021 Facebook +#include <linux/bpf.h> +#include <linux/perf_event.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include "bperf.h" + +struct { + __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); + __uint(key_size, sizeof(__u32)); + __uint(value_size, sizeof(int)); + __uint(map_flags, BPF_F_PRESERVE_ELEMS); +} events SEC(".maps"); + +reading_map prev_readings SEC(".maps"); +reading_map diff_readings SEC(".maps"); + +SEC("raw_tp/sched_switch") +int BPF_PROG(on_switch) +{ + struct bpf_perf_event_value val, *prev_val, *diff_val; + __u32 key = bpf_get_smp_processor_id(); + __u32 zero = 0; + long err; + + prev_val = bpf_map_lookup_elem(&prev_readings, &zero); + if (!prev_val) + return 0; + + diff_val = bpf_map_lookup_elem(&diff_readings, &zero); + if (!diff_val) + return 0; + + err = bpf_perf_event_read_value(&events, key, &val, sizeof(val)); + if (err) + return 0; + + diff_val->counter = val.counter - prev_val->counter; + diff_val->enabled = val.enabled - prev_val->enabled; + diff_val->running = val.running - prev_val->running; + *prev_val = val; + return 0; +} + +char LICENSE[] SEC("license") = "Dual BSD/GPL"; diff --git a/tools/perf/util/bpf_skel/bperf_u.h b/tools/perf/util/bpf_skel/bperf_u.h new file mode 100644 index 000000000000..1ce0c2c905c1 --- /dev/null +++ b/tools/perf/util/bpf_skel/bperf_u.h @@ -0,0 +1,14 @@ +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +// Copyright (c) 2021 Facebook + +#ifndef __BPERF_STAT_U_H +#define __BPERF_STAT_U_H + +enum bperf_filter_type { + BPERF_FILTER_GLOBAL = 1, + BPERF_FILTER_CPU, + BPERF_FILTER_PID, + BPERF_FILTER_TGID, +}; + +#endif /* __BPERF_STAT_U_H */ diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h index e818e558a275..29199e57353b 100644 --- a/tools/perf/util/evsel.h +++ b/tools/perf/util/evsel.h @@ -19,6 +19,8 @@ struct perf_stat_evsel; union perf_event; struct bpf_counter_ops; struct target; +struct bperf_leader_bpf; +struct bperf_follower_bpf; typedef int (evsel__sb_cb_t)(union perf_event *event, void *data); @@ -129,8 +131,24 @@ struct evsel { * See also evsel__has_callchain(). */ __u64 synth_sample_type; - struct list_head bpf_counter_list; + + /* + * bpf_counter_ops serves two use cases: + * 1. perf-stat -b counting events used byBPF programs + * 2. perf-stat --use-bpf use BPF programs to aggregate counts + */ struct bpf_counter_ops *bpf_counter_ops; + + /* for perf-stat -b */ + struct list_head bpf_counter_list; + + /* for perf-stat --use-bpf */ + int bperf_leader_prog_fd; + int bperf_leader_link_fd; + union { + struct bperf_leader_bpf *leader_skel; + struct bperf_follower_bpf *follower_skel; + }; }; struct perf_missing_features { diff --git a/tools/perf/util/target.h b/tools/perf/util/target.h index f132c6c2eef8..1bce3eb28ef2 100644 --- a/tools/perf/util/target.h +++ b/tools/perf/util/target.h @@ -16,6 +16,8 @@ struct target { bool uses_mmap; bool default_per_cpu; bool per_thread; + bool use_bpf; + const char *attr_map; }; enum target_errno { @@ -66,7 +68,7 @@ static inline bool target__has_cpu(struct target *target) static inline bool target__has_bpf(struct target *target) { - return target->bpf_str; + return target->bpf_str || target->use_bpf; } static inline bool target__none(struct target *target) -- 2.34.1

From: Song Liu <songliubraving@fb.com> Take measurements of 't0' and 'ref_time' after enable_counters(), so that they only measure the time consumed when the counters are enabled. Signed-off-by: Song Liu <songliubraving@fb.com> Acked-by: Andi Kleen <andi@firstfloor.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: kernel-team@fb.com Link: http://lore.kernel.org/lkml/20210316211837.910506-3-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Conflicts: tools/perf/builtin-stat.c Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/perf/builtin-stat.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 6e2b28bb4dfc..0374fa35e2e6 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -918,15 +918,15 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx) /* * Enable counters and exec the command: */ - t0 = rdclock(); - clock_gettime(CLOCK_MONOTONIC, &ref_time); - if (forks) { perf_evlist__start_workload(evsel_list); err = enable_counters(); if (err) return -1; + t0 = rdclock(); + clock_gettime(CLOCK_MONOTONIC, &ref_time); + if (interval || timeout || evlist__ctlfd_initialized(evsel_list)) status = dispatch_events(forks, timeout, interval, ×); if (child_pid != -1) { @@ -947,6 +947,10 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx) err = enable_counters(); if (err) return -1; + + t0 = rdclock(); + clock_gettime(CLOCK_MONOTONIC, &ref_time); + status = dispatch_events(forks, timeout, interval, ×); } -- 2.34.1

From: Song Liu <song@kernel.org> By following the same protocol, other tools can share hardware PMCs with perf. Move perf_event_attr_map_entry and BPF_PERF_DEFAULT_ATTR_MAP_PATH to bpf_perf.h for other tools to use. Signed-off-by: Song Liu <song@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: kernel-team@fb.com Link: https://lore.kernel.org/r/20210425214333.1090950-2-song@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/lib/perf/include/perf/bpf_perf.h | 31 ++++++++++++++++++++++++++ tools/perf/util/bpf_counter.c | 27 +++------------------- 2 files changed, 34 insertions(+), 24 deletions(-) create mode 100644 tools/lib/perf/include/perf/bpf_perf.h diff --git a/tools/lib/perf/include/perf/bpf_perf.h b/tools/lib/perf/include/perf/bpf_perf.h new file mode 100644 index 000000000000..e7cf6ba7b674 --- /dev/null +++ b/tools/lib/perf/include/perf/bpf_perf.h @@ -0,0 +1,31 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ +#ifndef __LIBPERF_BPF_PERF_H +#define __LIBPERF_BPF_PERF_H + +#include <linux/types.h> /* for __u32 */ + +/* + * bpf_perf uses a hashmap, the attr_map, to track all the leader programs. + * The hashmap is pinned in bpffs. flock() on this file is used to ensure + * no concurrent access to the attr_map. The key of attr_map is struct + * perf_event_attr, and the value is struct perf_event_attr_map_entry. + * + * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the + * leader prog, and the diff_map. Each perf-stat session holds a reference + * to the bpf_link to make sure the leader prog is attached to sched_switch + * tracepoint. + * + * Since the hashmap only contains IDs of the bpf_link and diff_map, it + * does not hold any references to the leader program. Once all perf-stat + * sessions of these events exit, the leader prog, its maps, and the + * perf_events will be freed. + */ +struct perf_event_attr_map_entry { + __u32 link_id; + __u32 diff_map_id; +}; + +/* default attr_map name */ +#define BPF_PERF_DEFAULT_ATTR_MAP_PATH "perf_attr_map" + +#endif /* __LIBPERF_BPF_PERF_H */ diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c index 81d1df3c4ec0..be484ddbbd5b 100644 --- a/tools/perf/util/bpf_counter.c +++ b/tools/perf/util/bpf_counter.c @@ -14,6 +14,7 @@ #include <bpf/btf.h> #include <bpf/libbpf.h> #include <api/fs/fs.h> +#include <perf/bpf_perf.h> #include "bpf_counter.h" #include "counts.h" @@ -29,28 +30,6 @@ #include "bpf_skel/bperf_leader.skel.h" #include "bpf_skel/bperf_follower.skel.h" -/* - * bperf uses a hashmap, the attr_map, to track all the leader programs. - * The hashmap is pinned in bpffs. flock() on this file is used to ensure - * no concurrent access to the attr_map. The key of attr_map is struct - * perf_event_attr, and the value is struct perf_event_attr_map_entry. - * - * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the - * leader prog, and the diff_map. Each perf-stat session holds a reference - * to the bpf_link to make sure the leader prog is attached to sched_switch - * tracepoint. - * - * Since the hashmap only contains IDs of the bpf_link and diff_map, it - * does not hold any references to the leader program. Once all perf-stat - * sessions of these events exit, the leader prog, its maps, and the - * perf_events will be freed. - */ -struct perf_event_attr_map_entry { - __u32 link_id; - __u32 diff_map_id; -}; - -#define DEFAULT_ATTR_MAP_PATH "fs/bpf/perf_attr_map" #define ATTR_MAP_SIZE 16 static inline void *u64_to_ptr(__u64 ptr) @@ -341,8 +320,8 @@ static int bperf_lock_attr_map(struct target *target) if (target->attr_map) { scnprintf(path, PATH_MAX, "%s", target->attr_map); } else { - scnprintf(path, PATH_MAX, "%s/%s", sysfs__mountpoint(), - DEFAULT_ATTR_MAP_PATH); + scnprintf(path, PATH_MAX, "%s/fs/bpf/%s", sysfs__mountpoint(), + BPF_PERF_DEFAULT_ATTR_MAP_PATH); } if (access(path, F_OK)) { -- 2.34.1

From: Song Liu <song@kernel.org> perf_attr_map could be shared among different version of perf binary. Add bperf_attr_map_compatible() to check whether the existing attr_map is compatible with current perf binary. Signed-off-by: Song Liu <song@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: kernel-team@fb.com Link: https://lore.kernel.org/r/20210425214333.1090950-3-song@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/perf/util/bpf_counter.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c index be484ddbbd5b..5de991ab46af 100644 --- a/tools/perf/util/bpf_counter.c +++ b/tools/perf/util/bpf_counter.c @@ -312,6 +312,20 @@ static __u32 bpf_map_get_id(int fd) return map_info.id; } +static bool bperf_attr_map_compatible(int attr_map_fd) +{ + struct bpf_map_info map_info = {0}; + __u32 map_info_len = sizeof(map_info); + int err; + + err = bpf_obj_get_info_by_fd(attr_map_fd, &map_info, &map_info_len); + + if (err) + return false; + return (map_info.key_size == sizeof(struct perf_event_attr)) && + (map_info.value_size == sizeof(struct perf_event_attr_map_entry)); +} + static int bperf_lock_attr_map(struct target *target) { char path[PATH_MAX]; @@ -346,6 +360,11 @@ static int bperf_lock_attr_map(struct target *target) return -1; } + if (!bperf_attr_map_compatible(map_fd)) { + close(map_fd); + return -1; + + } err = flock(map_fd, LOCK_EX); if (err) { close(map_fd); -- 2.34.1

From: Song Liu <song@kernel.org> Currently, to use BPF to aggregate perf event counters, the user uses --bpf-counters option. Enable "use bpf by default" events with a config option, stat.bpf-counter-events. Events with name in the option will use BPF. This also enables mixed BPF event and regular event in the same sesssion. For example: perf config stat.bpf-counter-events=instructions perf stat -e instructions,cs The second command will use BPF for "instructions" but not "cs". Signed-off-by: Song Liu <song@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/r/20210425214333.1090950-4-song@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Conflicts: tools/perf/util/config.c [A config named stat.no-csv-summary didn't merge] Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/perf/Documentation/perf-stat.txt | 2 ++ tools/perf/builtin-stat.c | 42 +++++++++++++++----------- tools/perf/util/bpf_counter.c | 3 +- tools/perf/util/config.c | 4 +++ tools/perf/util/evsel.c | 22 ++++++++++++++ tools/perf/util/evsel.h | 8 +++++ tools/perf/util/target.h | 5 --- 7 files changed, 63 insertions(+), 23 deletions(-) diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt index 925d684d539c..c2ce57aab63e 100644 --- a/tools/perf/Documentation/perf-stat.txt +++ b/tools/perf/Documentation/perf-stat.txt @@ -97,6 +97,8 @@ report:: Use BPF programs to aggregate readings from perf_events. This allows multiple perf-stat sessions that are counting the same metric (cycles, instructions, etc.) to share hardware counters. + To use BPF programs on common events by default, use + "perf config stat.bpf-counter-events=<list_of_events>". --bpf-attr-map:: With option "--bpf-counters", different perf-stat sessions share diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 0374fa35e2e6..5ba56a1ae41f 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -147,6 +147,7 @@ static const char *smi_cost_attrs = { }; static struct evlist *evsel_list; +static bool all_counters_use_bpf = true; static struct target target = { .uid = UINT_MAX, @@ -386,6 +387,9 @@ static int read_affinity_counters(struct timespec *rs) struct affinity affinity; int i, ncpus, cpu; + if (all_counters_use_bpf) + return 0; + if (affinity__setup(&affinity) < 0) return -1; @@ -400,6 +404,8 @@ static int read_affinity_counters(struct timespec *rs) evlist__for_each_entry(evsel_list, counter) { if (evsel__cpu_iter_skip(counter, cpu)) continue; + if (evsel__is_bpf(counter)) + continue; if (!counter->err) { counter->err = read_counter_cpu(counter, rs, counter->cpu_iter - 1); @@ -416,6 +422,9 @@ static int read_bpf_map_counters(void) int err; evlist__for_each_entry(evsel_list, counter) { + if (!evsel__is_bpf(counter)) + continue; + err = bpf_counter__read(counter); if (err) return err; @@ -426,14 +435,10 @@ static int read_bpf_map_counters(void) static void read_counters(struct timespec *rs) { struct evsel *counter; - int err; if (!stat_config.stop_read_counter) { - if (target__has_bpf(&target)) - err = read_bpf_map_counters(); - else - err = read_affinity_counters(rs); - if (err < 0) + if (read_bpf_map_counters() || + read_affinity_counters(rs)) return; } @@ -522,12 +527,13 @@ static int enable_counters(void) struct evsel *evsel; int err; - if (target__has_bpf(&target)) { - evlist__for_each_entry(evsel_list, evsel) { - err = bpf_counter__enable(evsel); - if (err) - return err; - } + evlist__for_each_entry(evsel_list, evsel) { + if (!evsel__is_bpf(evsel)) + continue; + + err = bpf_counter__enable(evsel); + if (err) + return err; } if (stat_config.initial_delay < 0) { @@ -771,11 +777,11 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx) if (affinity__setup(&affinity) < 0) return -1; - if (target__has_bpf(&target)) { - evlist__for_each_entry(evsel_list, counter) { - if (bpf_counter__load(counter, &target)) - return -1; - } + evlist__for_each_entry(evsel_list, counter) { + if (bpf_counter__load(counter, &target)) + return -1; + if (!evsel__is_bpf(counter)) + all_counters_use_bpf = false; } evlist__for_each_cpu (evsel_list, i, cpu) { @@ -792,6 +798,8 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx) continue; if (counter->reset_group || counter->errored) continue; + if (evsel__is_bpf(counter)) + continue; try_again: if (create_perf_stat_counter(counter, &stat_config, &target, counter->cpu_iter - 1) < 0) { diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c index 5de991ab46af..33b1888103df 100644 --- a/tools/perf/util/bpf_counter.c +++ b/tools/perf/util/bpf_counter.c @@ -790,7 +790,8 @@ int bpf_counter__load(struct evsel *evsel, struct target *target) { if (target->bpf_str) evsel->bpf_counter_ops = &bpf_program_profiler_ops; - else if (target->use_bpf) + else if (target->use_bpf || + evsel__match_bpf_counter_events(evsel->name)) evsel->bpf_counter_ops = &bperf_ops; if (evsel->bpf_counter_ops) diff --git a/tools/perf/util/config.c b/tools/perf/util/config.c index 6969f82843ee..9d3f61579ed0 100644 --- a/tools/perf/util/config.c +++ b/tools/perf/util/config.c @@ -18,6 +18,7 @@ #include "util/hist.h" /* perf_hist_config */ #include "util/llvm-utils.h" /* perf_llvm_config */ #include "util/stat.h" /* perf_stat__set_big_num */ +#include "util/evsel.h" /* evsel__hw_names, evsel__use_bpf_counters */ #include "build-id.h" #include "debug.h" #include "config.h" @@ -457,6 +458,9 @@ static int perf_stat_config(const char *var, const char *value) if (!strcmp(var, "stat.big-num")) perf_stat__set_big_num(perf_config_bool(var, value)); + if (!strcmp(var, "stat.bpf-counter-events")) + evsel__bpf_counter_events = strdup(value); + /* Add other config variables here. */ return 0; } diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c index 8829fed10910..2a3e415c73b4 100644 --- a/tools/perf/util/evsel.c +++ b/tools/perf/util/evsel.c @@ -491,6 +491,28 @@ const char *evsel__hw_names[PERF_COUNT_HW_MAX] = { "ref-cycles", }; +char *evsel__bpf_counter_events; + +bool evsel__match_bpf_counter_events(const char *name) +{ + int name_len; + bool match; + char *ptr; + + if (!evsel__bpf_counter_events) + return false; + + ptr = strstr(evsel__bpf_counter_events, name); + name_len = strlen(name); + + /* check name matches a full token in evsel__bpf_counter_events */ + match = (ptr != NULL) && + ((ptr == evsel__bpf_counter_events) || (*(ptr - 1) == ',')) && + ((*(ptr + name_len) == ',') || (*(ptr + name_len) == '\0')); + + return match; +} + static const char *__evsel__hw_name(u64 config) { if (config < PERF_COUNT_HW_MAX && evsel__hw_names[config]) diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h index 29199e57353b..62da6db284f3 100644 --- a/tools/perf/util/evsel.h +++ b/tools/perf/util/evsel.h @@ -236,6 +236,11 @@ void evsel__calc_id_pos(struct evsel *evsel); bool evsel__is_cache_op_valid(u8 type, u8 op); +static inline bool evsel__is_bpf(struct evsel *evsel) +{ + return evsel->bpf_counter_ops != NULL; +} + #define EVSEL__MAX_ALIASES 8 extern const char *evsel__hw_cache[PERF_COUNT_HW_CACHE_MAX][EVSEL__MAX_ALIASES]; @@ -243,6 +248,9 @@ extern const char *evsel__hw_cache_op[PERF_COUNT_HW_CACHE_OP_MAX][EVSEL__MAX_ALI extern const char *evsel__hw_cache_result[PERF_COUNT_HW_CACHE_RESULT_MAX][EVSEL__MAX_ALIASES]; extern const char *evsel__hw_names[PERF_COUNT_HW_MAX]; extern const char *evsel__sw_names[PERF_COUNT_SW_MAX]; +extern char *evsel__bpf_counter_events; +bool evsel__match_bpf_counter_events(const char *name); + int __evsel__hw_cache_type_op_res_name(u8 type, u8 op, u8 result, char *bf, size_t size); const char *evsel__name(struct evsel *evsel); diff --git a/tools/perf/util/target.h b/tools/perf/util/target.h index 1bce3eb28ef2..4ff56217f2a6 100644 --- a/tools/perf/util/target.h +++ b/tools/perf/util/target.h @@ -66,11 +66,6 @@ static inline bool target__has_cpu(struct target *target) return target->system_wide || target->cpu_list; } -static inline bool target__has_bpf(struct target *target) -{ - return target->bpf_str || target->use_bpf; -} - static inline bool target__none(struct target *target) { return !target__has_task(target) && !target__has_cpu(target); -- 2.34.1

From: Song Liu <song@kernel.org> Introduce 'b' modifier to event parser, which means use BPF program to manage this event. This is the same as --bpf-counters option, but only applies to this event. For example, perf stat -e cycles:b,cs # use bpf for cycles, but not cs perf stat -e cycles,cs --bpf-counters # use bpf for both cycles and cs Suggested-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/r/20210425214333.1090950-5-song@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/perf/util/bpf_counter.c | 2 +- tools/perf/util/evsel.h | 1 + tools/perf/util/parse-events.c | 8 +++++++- tools/perf/util/parse-events.l | 2 +- 4 files changed, 10 insertions(+), 3 deletions(-) diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c index 33b1888103df..f179f5743025 100644 --- a/tools/perf/util/bpf_counter.c +++ b/tools/perf/util/bpf_counter.c @@ -790,7 +790,7 @@ int bpf_counter__load(struct evsel *evsel, struct target *target) { if (target->bpf_str) evsel->bpf_counter_ops = &bpf_program_profiler_ops; - else if (target->use_bpf || + else if (target->use_bpf || evsel->bpf_counter || evsel__match_bpf_counter_events(evsel->name)) evsel->bpf_counter_ops = &bperf_ops; diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h index 62da6db284f3..37c0e9d97c49 100644 --- a/tools/perf/util/evsel.h +++ b/tools/perf/util/evsel.h @@ -81,6 +81,7 @@ struct evsel { bool auto_merge_stats; bool collect_stat; bool weak_group; + bool bpf_counter; int bpf_fd; struct bpf_object *bpf_obj; }; diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c index 1f389d51bc73..71c566dc2d2d 100644 --- a/tools/perf/util/parse-events.c +++ b/tools/perf/util/parse-events.c @@ -1785,6 +1785,7 @@ struct event_modifier { int pinned; int weak; int exclusive; + int bpf_counter; }; static int get_event_modifier(struct event_modifier *mod, char *str, @@ -1805,6 +1806,7 @@ static int get_event_modifier(struct event_modifier *mod, char *str, int exclude = eu | ek | eh; int exclude_GH = evsel ? evsel->exclude_GH : 0; int weak = 0; + int bpf_counter = 0; memset(mod, 0, sizeof(*mod)); @@ -1848,6 +1850,8 @@ static int get_event_modifier(struct event_modifier *mod, char *str, exclusive = 1; } else if (*str == 'W') { weak = 1; + } else if (*str == 'b') { + bpf_counter = 1; } else break; @@ -1879,6 +1883,7 @@ static int get_event_modifier(struct event_modifier *mod, char *str, mod->sample_read = sample_read; mod->pinned = pinned; mod->weak = weak; + mod->bpf_counter = bpf_counter; mod->exclusive = exclusive; return 0; @@ -1893,7 +1898,7 @@ static int check_modifier(char *str) char *p = str; /* The sizeof includes 0 byte as well. */ - if (strlen(str) > (sizeof("ukhGHpppPSDIWe") - 1)) + if (strlen(str) > (sizeof("ukhGHpppPSDIWeb") - 1)) return -1; while (*p) { @@ -1934,6 +1939,7 @@ int parse_events__modifier_event(struct list_head *list, char *str, bool add) evsel->sample_read = mod.sample_read; evsel->precise_max = mod.precise_max; evsel->weak_group = mod.weak; + evsel->bpf_counter = mod.bpf_counter; if (evsel__is_group_leader(evsel)) { evsel->core.attr.pinned = mod.pinned; diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l index 88f203bb6fab..89fe19bc36d3 100644 --- a/tools/perf/util/parse-events.l +++ b/tools/perf/util/parse-events.l @@ -210,7 +210,7 @@ name_tag [\'][a-zA-Z_*?\[\]][a-zA-Z0-9_*?\-,\.\[\]:=]*[\'] name_minus [a-zA-Z_*?][a-zA-Z0-9\-_*?.:]* drv_cfg_term [a-zA-Z0-9_\.]+(=[a-zA-Z0-9_*?\.:]+)? /* If you add a modifier you need to update check_modifier() */ -modifier_event [ukhpPGHSDIWe]+ +modifier_event [ukhpPGHSDIWeb]+ modifier_bp [rwx]{1,3} %% -- 2.34.1

From: Song Liu <song@kernel.org> Introduce bpf_counter_ops->disable(), which is used stop counting the event. Committer notes: Added a dummy bpf_counter__disable() to the python binding to avoid having 'perf test python' failing. bpf_counter isn't supported in the python binding. Signed-off-by: Song Liu <song@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: kernel-team@fb.com Link: https://lore.kernel.org/r/20210425214333.1090950-6-song@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/perf/util/bpf_counter.c | 26 ++++++++++++++++++++++++++ tools/perf/util/bpf_counter.h | 7 +++++++ tools/perf/util/evlist.c | 4 ++++ tools/perf/util/python.c | 6 ++++++ 4 files changed, 43 insertions(+) diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c index f179f5743025..ddb52f748c8e 100644 --- a/tools/perf/util/bpf_counter.c +++ b/tools/perf/util/bpf_counter.c @@ -215,6 +215,17 @@ static int bpf_program_profiler__enable(struct evsel *evsel) return 0; } +static int bpf_program_profiler__disable(struct evsel *evsel) +{ + struct bpf_counter *counter; + + list_for_each_entry(counter, &evsel->bpf_counter_list, list) { + assert(counter->skel != NULL); + bpf_prog_profiler_bpf__detach(counter->skel); + } + return 0; +} + static int bpf_program_profiler__read(struct evsel *evsel) { // perf_cpu_map uses /sys/devices/system/cpu/online @@ -280,6 +291,7 @@ static int bpf_program_profiler__install_pe(struct evsel *evsel, int cpu, struct bpf_counter_ops bpf_program_profiler_ops = { .load = bpf_program_profiler__load, .enable = bpf_program_profiler__enable, + .disable = bpf_program_profiler__disable, .read = bpf_program_profiler__read, .destroy = bpf_program_profiler__destroy, .install_pe = bpf_program_profiler__install_pe, @@ -627,6 +639,12 @@ static int bperf__enable(struct evsel *evsel) return 0; } +static int bperf__disable(struct evsel *evsel) +{ + evsel->follower_skel->bss->enabled = 0; + return 0; +} + static int bperf__read(struct evsel *evsel) { struct bperf_follower_bpf *skel = evsel->follower_skel; @@ -768,6 +786,7 @@ static int bperf__destroy(struct evsel *evsel) struct bpf_counter_ops bperf_ops = { .load = bperf__load, .enable = bperf__enable, + .disable = bperf__disable, .read = bperf__read, .install_pe = bperf__install_pe, .destroy = bperf__destroy, @@ -806,6 +825,13 @@ int bpf_counter__enable(struct evsel *evsel) return evsel->bpf_counter_ops->enable(evsel); } +int bpf_counter__disable(struct evsel *evsel) +{ + if (bpf_counter_skip(evsel)) + return 0; + return evsel->bpf_counter_ops->disable(evsel); +} + int bpf_counter__read(struct evsel *evsel) { if (bpf_counter_skip(evsel)) diff --git a/tools/perf/util/bpf_counter.h b/tools/perf/util/bpf_counter.h index 2eca210e5dc1..dac052b16ecf 100644 --- a/tools/perf/util/bpf_counter.h +++ b/tools/perf/util/bpf_counter.h @@ -18,6 +18,7 @@ typedef int (*bpf_counter_evsel_install_pe_op)(struct evsel *evsel, struct bpf_counter_ops { bpf_counter_evsel_target_op load; bpf_counter_evsel_op enable; + bpf_counter_evsel_op disable; bpf_counter_evsel_op read; bpf_counter_evsel_op destroy; bpf_counter_evsel_install_pe_op install_pe; @@ -32,6 +33,7 @@ struct bpf_counter { int bpf_counter__load(struct evsel *evsel, struct target *target); int bpf_counter__enable(struct evsel *evsel); +int bpf_counter__disable(struct evsel *evsel); int bpf_counter__read(struct evsel *evsel); void bpf_counter__destroy(struct evsel *evsel); int bpf_counter__install_pe(struct evsel *evsel, int cpu, int fd); @@ -51,6 +53,11 @@ static inline int bpf_counter__enable(struct evsel *evsel __maybe_unused) return 0; } +static inline int bpf_counter__disable(struct evsel *evsel __maybe_unused) +{ + return 0; +} + static inline int bpf_counter__read(struct evsel *evsel __maybe_unused) { return -EAGAIN; diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index f62244c8238f..3f72eaaf4de7 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -17,6 +17,7 @@ #include "evsel.h" #include "debug.h" #include "units.h" +#include "bpf_counter.h" #include <internal/lib.h> // page_size #include "affinity.h" #include "../perf.h" @@ -397,6 +398,9 @@ void evlist__disable(struct evlist *evlist) if (affinity__setup(&affinity) < 0) return; + evlist__for_each_entry(evlist, pos) + bpf_counter__disable(pos); + /* Disable 'immediate' events last */ for (imm = 0; imm <= 1; imm++) { evlist__for_each_cpu(evlist, i, cpu) { diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c index 5dcc30ae6cab..1e0f5d1fe36a 100644 --- a/tools/perf/util/python.c +++ b/tools/perf/util/python.c @@ -90,6 +90,7 @@ int metricgroup__copy_metric_events(struct evlist *evlist, struct cgroup *cgrp, */ void bpf_counter__destroy(struct evsel *evsel); int bpf_counter__install_pe(struct evsel *evsel, int cpu, int fd); +int bpf_counter__disable(struct evsel *evsel); void bpf_counter__destroy(struct evsel *evsel __maybe_unused) { @@ -100,6 +101,11 @@ int bpf_counter__install_pe(struct evsel *evsel __maybe_unused, int cpu __maybe_ return 0; } +int bpf_counter__disable(struct evsel *evsel __maybe_unused) +{ + return 0; +} + /* * Support debug printing even though util/debug.c is not linked. That means * implementing 'verbose' and 'eprintf'. -- 2.34.1

From: Namhyung Kim <namhyung@kernel.org> The read_cgroup_id() is to read a cgroup id from a file handle using name_to_handle_at(2) for the given cgroup. It'll be used by bperf cgroup stat later. Committer notes: -int read_cgroup_id(struct cgroup *cgrp) +static inline int read_cgroup_id(struct cgroup *cgrp __maybe_unused) To fix the build when HAVE_FILE_HANDLE is not defined. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20210625071826.608504-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/perf/util/cgroup.c | 25 +++++++++++++++++++++++++ tools/perf/util/cgroup.h | 10 ++++++++++ 2 files changed, 35 insertions(+) diff --git a/tools/perf/util/cgroup.c b/tools/perf/util/cgroup.c index b81324a13a2b..677144aa2af2 100644 --- a/tools/perf/util/cgroup.c +++ b/tools/perf/util/cgroup.c @@ -35,6 +35,31 @@ static int open_cgroup(const char *name) return fd; } +#ifdef HAVE_FILE_HANDLE +int read_cgroup_id(struct cgroup *cgrp) +{ + char path[PATH_MAX + 1]; + char mnt[PATH_MAX + 1]; + struct { + struct file_handle fh; + uint64_t cgroup_id; + } handle; + int mount_id; + + if (cgroupfs_find_mountpoint(mnt, PATH_MAX + 1, "perf_event")) + return -1; + + scnprintf(path, PATH_MAX, "%s/%s", mnt, cgrp->name); + + handle.fh.handle_bytes = sizeof(handle.cgroup_id); + if (name_to_handle_at(AT_FDCWD, path, &handle.fh, &mount_id, 0) < 0) + return -1; + + cgrp->id = handle.cgroup_id; + return 0; +} +#endif /* HAVE_FILE_HANDLE */ + static struct cgroup *evlist__find_cgroup(struct evlist *evlist, const char *str) { struct evsel *counter; diff --git a/tools/perf/util/cgroup.h b/tools/perf/util/cgroup.h index 162906f3412a..f3c978746999 100644 --- a/tools/perf/util/cgroup.h +++ b/tools/perf/util/cgroup.h @@ -2,6 +2,7 @@ #ifndef __CGROUP_H__ #define __CGROUP_H__ +#include <linux/compiler.h> #include <linux/refcount.h> #include <linux/rbtree.h> #include "util/env.h" @@ -38,4 +39,13 @@ struct cgroup *cgroup__find(struct perf_env *env, uint64_t id); void perf_env__purge_cgroups(struct perf_env *env); +#ifdef HAVE_FILE_HANDLE +int read_cgroup_id(struct cgroup *cgrp); +#else +static inline int read_cgroup_id(struct cgroup *cgrp __maybe_unused) +{ + return -1; +} +#endif /* HAVE_FILE_HANDLE */ + #endif /* __CGROUP_H__ */ -- 2.34.1

From: Namhyung Kim <namhyung@kernel.org> The cgroup_is_v2() is to check if the given subsystem is mounted on cgroup v2 or not. It'll be used by BPF cgroup code later. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ian Rogers <irogers@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20210625071826.608504-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/perf/util/cgroup.c | 19 +++++++++++++++++++ tools/perf/util/cgroup.h | 2 ++ 2 files changed, 21 insertions(+) diff --git a/tools/perf/util/cgroup.c b/tools/perf/util/cgroup.c index 677144aa2af2..35148f8a7291 100644 --- a/tools/perf/util/cgroup.c +++ b/tools/perf/util/cgroup.c @@ -9,6 +9,7 @@ #include <linux/zalloc.h> #include <sys/types.h> #include <sys/stat.h> +#include <sys/statfs.h> #include <fcntl.h> #include <stdlib.h> #include <string.h> @@ -60,6 +61,24 @@ int read_cgroup_id(struct cgroup *cgrp) } #endif /* HAVE_FILE_HANDLE */ +#ifndef CGROUP2_SUPER_MAGIC +#define CGROUP2_SUPER_MAGIC 0x63677270 +#endif + +int cgroup_is_v2(const char *subsys) +{ + char mnt[PATH_MAX + 1]; + struct statfs stbuf; + + if (cgroupfs_find_mountpoint(mnt, PATH_MAX + 1, subsys)) + return -1; + + if (statfs(mnt, &stbuf) < 0) + return -1; + + return (stbuf.f_type == CGROUP2_SUPER_MAGIC); +} + static struct cgroup *evlist__find_cgroup(struct evlist *evlist, const char *str) { struct evsel *counter; diff --git a/tools/perf/util/cgroup.h b/tools/perf/util/cgroup.h index f3c978746999..de5b272560ab 100644 --- a/tools/perf/util/cgroup.h +++ b/tools/perf/util/cgroup.h @@ -48,4 +48,6 @@ static inline int read_cgroup_id(struct cgroup *cgrp __maybe_unused) } #endif /* HAVE_FILE_HANDLE */ +int cgroup_is_v2(const char *subsys); + #endif /* __CGROUP_H__ */ -- 2.34.1

From: Namhyung Kim <namhyung@kernel.org> Some helper functions will be used for cgroup counting too. Move them to a header file for sharing. Committer notes: Fix the build on older systems with: - struct bpf_map_info map_info = {0}; + struct bpf_map_info map_info = { .id = 0, }; This wasn't breaking the build in such systems as bpf_counter.c isn't built due to: tools/perf/util/Build: perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o The bpf_counter.h file on the other hand is included from places that are built everywhere. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20210625071826.608504-4-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/perf/util/bpf_counter.c | 52 ----------------------------------- tools/perf/util/bpf_counter.h | 52 +++++++++++++++++++++++++++++++++++ 2 files changed, 52 insertions(+), 52 deletions(-) diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c index ddb52f748c8e..da75793ed986 100644 --- a/tools/perf/util/bpf_counter.c +++ b/tools/perf/util/bpf_counter.c @@ -7,12 +7,8 @@ #include <unistd.h> #include <sys/file.h> #include <sys/time.h> -#include <sys/resource.h> #include <linux/err.h> #include <linux/zalloc.h> -#include <bpf/bpf.h> -#include <bpf/btf.h> -#include <bpf/libbpf.h> #include <api/fs/fs.h> #include <perf/bpf_perf.h> @@ -37,13 +33,6 @@ static inline void *u64_to_ptr(__u64 ptr) return (void *)(unsigned long)ptr; } -static void set_max_rlimit(void) -{ - struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; - - setrlimit(RLIMIT_MEMLOCK, &rinf); -} - static struct bpf_counter *bpf_counter_alloc(void) { struct bpf_counter *counter; @@ -297,33 +286,6 @@ struct bpf_counter_ops bpf_program_profiler_ops = { .install_pe = bpf_program_profiler__install_pe, }; -static __u32 bpf_link_get_id(int fd) -{ - struct bpf_link_info link_info = {0}; - __u32 link_info_len = sizeof(link_info); - - bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len); - return link_info.id; -} - -static __u32 bpf_link_get_prog_id(int fd) -{ - struct bpf_link_info link_info = {0}; - __u32 link_info_len = sizeof(link_info); - - bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len); - return link_info.prog_id; -} - -static __u32 bpf_map_get_id(int fd) -{ - struct bpf_map_info map_info = {0}; - __u32 map_info_len = sizeof(map_info); - - bpf_obj_get_info_by_fd(fd, &map_info, &map_info_len); - return map_info.id; -} - static bool bperf_attr_map_compatible(int attr_map_fd) { struct bpf_map_info map_info = {0}; @@ -385,20 +347,6 @@ static int bperf_lock_attr_map(struct target *target) return map_fd; } -/* trigger the leader program on a cpu */ -static int bperf_trigger_reading(int prog_fd, int cpu) -{ - DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts, - .ctx_in = NULL, - .ctx_size_in = 0, - .flags = BPF_F_TEST_RUN_ON_CPU, - .cpu = cpu, - .retval = 0, - ); - - return bpf_prog_test_run_opts(prog_fd, &opts); -} - static int bperf_check_target(struct evsel *evsel, struct target *target, enum bperf_filter_type *filter_type, diff --git a/tools/perf/util/bpf_counter.h b/tools/perf/util/bpf_counter.h index dac052b16ecf..434ded6749d6 100644 --- a/tools/perf/util/bpf_counter.h +++ b/tools/perf/util/bpf_counter.h @@ -3,6 +3,10 @@ #define __PERF_BPF_COUNTER_H 1 #include <linux/list.h> +#include <sys/resource.h> +#include <bpf/bpf.h> +#include <bpf/btf.h> +#include <bpf/libbpf.h> struct evsel; struct target; @@ -76,4 +80,52 @@ static inline int bpf_counter__install_pe(struct evsel *evsel __maybe_unused, #endif /* HAVE_BPF_SKEL */ +static inline void set_max_rlimit(void) +{ + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; + + setrlimit(RLIMIT_MEMLOCK, &rinf); +} + +static inline __u32 bpf_link_get_id(int fd) +{ + struct bpf_link_info link_info = { .id = 0, }; + __u32 link_info_len = sizeof(link_info); + + bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len); + return link_info.id; +} + +static inline __u32 bpf_link_get_prog_id(int fd) +{ + struct bpf_link_info link_info = { .id = 0, }; + __u32 link_info_len = sizeof(link_info); + + bpf_obj_get_info_by_fd(fd, &link_info, &link_info_len); + return link_info.prog_id; +} + +static inline __u32 bpf_map_get_id(int fd) +{ + struct bpf_map_info map_info = { .id = 0, }; + __u32 map_info_len = sizeof(map_info); + + bpf_obj_get_info_by_fd(fd, &map_info, &map_info_len); + return map_info.id; +} + +/* trigger the leader program on a cpu */ +static inline int bperf_trigger_reading(int prog_fd, int cpu) +{ + DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts, + .ctx_in = NULL, + .ctx_size_in = 0, + .flags = BPF_F_TEST_RUN_ON_CPU, + .cpu = cpu, + .retval = 0, + ); + + return bpf_prog_test_run_opts(prog_fd, &opts); +} + #endif /* __PERF_BPF_COUNTER_H */ -- 2.34.1

From: Namhyung Kim <namhyung@kernel.org> Recently bperf was added to use BPF to count perf events for various purposes. This is an extension for the approach and targetting to cgroup usages. Unlike the other bperf, it doesn't share the events with other processes but it'd reduce unnecessary events (and the overhead of multiplexing) for each monitored cgroup within the perf session. When --for-each-cgroup is used with --bpf-counters, it will open cgroup-switches event per cpu internally and attach the new BPF program to read given perf_events and to aggregate the results for cgroups. It's only called when task is switched to a task in a different cgroup. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20210701211227.1403788-1-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Conflicts: tools/perf/util/cgroup.c Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/perf/Makefile.perf | 17 +- tools/perf/util/Build | 1 + tools/perf/util/bpf_counter.c | 5 + tools/perf/util/bpf_counter_cgroup.c | 307 ++++++++++++++++++++ tools/perf/util/bpf_skel/bperf_cgroup.bpf.c | 191 ++++++++++++ tools/perf/util/cgroup.c | 3 + tools/perf/util/cgroup.h | 1 + 7 files changed, 524 insertions(+), 1 deletion(-) create mode 100644 tools/perf/util/bpf_counter_cgroup.c create mode 100644 tools/perf/util/bpf_skel/bperf_cgroup.bpf.c diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf index 7ddc4b65260a..8eacd8028b68 100644 --- a/tools/perf/Makefile.perf +++ b/tools/perf/Makefile.perf @@ -1017,6 +1017,7 @@ SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel) SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp) SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h +SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h ifdef BUILD_BPF_SKEL BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool @@ -1030,7 +1031,21 @@ $(BPFTOOL): | $(SKEL_TMP_OUT) CFLAGS= $(MAKE) -C ../bpf/bpftool \ OUTPUT=$(SKEL_TMP_OUT)/ bootstrap -$(SKEL_TMP_OUT)/%.bpf.o: util/bpf_skel/%.bpf.c $(LIBBPF) | $(SKEL_TMP_OUT) +VMLINUX_BTF_PATHS ?= $(if $(O),$(O)/vmlinux) \ + $(if $(KBUILD_OUTPUT),$(KBUILD_OUTPUT)/vmlinux) \ + ../../vmlinux \ + /sys/kernel/btf/vmlinux \ + /boot/vmlinux-$(shell uname -r) +VMLINUX_BTF ?= $(abspath $(firstword $(wildcard $(VMLINUX_BTF_PATHS)))) + +$(SKEL_OUT)/vmlinux.h: $(VMLINUX_BTF) $(BPFTOOL) +ifeq ($(VMLINUX_H),) + $(QUIET_GEN)$(BPFTOOL) btf dump file $< format c > $@ +else + $(Q)cp "$(VMLINUX_H)" $@ +endif + +$(SKEL_TMP_OUT)/%.bpf.o: util/bpf_skel/%.bpf.c $(LIBBPF) $(SKEL_OUT)/vmlinux.h | $(SKEL_TMP_OUT) $(QUIET_CLANG)$(CLANG) -g -O2 -target bpf $(BPF_INCLUDE) \ -c $(filter util/bpf_skel/%.bpf.c,$^) -o $@ && $(LLVM_STRIP) -g $@ diff --git a/tools/perf/util/Build b/tools/perf/util/Build index aa3f308cb325..f5211c87f7fa 100644 --- a/tools/perf/util/Build +++ b/tools/perf/util/Build @@ -138,6 +138,7 @@ perf-y += clockid.o perf-$(CONFIG_LIBBPF) += bpf-loader.o perf-$(CONFIG_LIBBPF) += bpf_map.o perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter.o +perf-$(CONFIG_PERF_BPF_SKEL) += bpf_counter_cgroup.o perf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o perf-$(CONFIG_LIBELF) += symbol-elf.o perf-$(CONFIG_LIBELF) += probe-file.o diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c index da75793ed986..ede77e92ab08 100644 --- a/tools/perf/util/bpf_counter.c +++ b/tools/perf/util/bpf_counter.c @@ -18,6 +18,7 @@ #include "evsel.h" #include "evlist.h" #include "target.h" +#include "cgroup.h" #include "cpumap.h" #include "thread_map.h" @@ -740,6 +741,8 @@ struct bpf_counter_ops bperf_ops = { .destroy = bperf__destroy, }; +extern struct bpf_counter_ops bperf_cgrp_ops; + static inline bool bpf_counter_skip(struct evsel *evsel) { return list_empty(&evsel->bpf_counter_list) && @@ -757,6 +760,8 @@ int bpf_counter__load(struct evsel *evsel, struct target *target) { if (target->bpf_str) evsel->bpf_counter_ops = &bpf_program_profiler_ops; + else if (cgrp_event_expanded && target->use_bpf) + evsel->bpf_counter_ops = &bperf_cgrp_ops; else if (target->use_bpf || evsel->bpf_counter || evsel__match_bpf_counter_events(evsel->name)) evsel->bpf_counter_ops = &bperf_ops; diff --git a/tools/perf/util/bpf_counter_cgroup.c b/tools/perf/util/bpf_counter_cgroup.c new file mode 100644 index 000000000000..21261f1a0501 --- /dev/null +++ b/tools/perf/util/bpf_counter_cgroup.c @@ -0,0 +1,307 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* Copyright (c) 2021 Facebook */ +/* Copyright (c) 2021 Google */ + +#include <assert.h> +#include <limits.h> +#include <unistd.h> +#include <sys/file.h> +#include <sys/time.h> +#include <sys/resource.h> +#include <linux/err.h> +#include <linux/zalloc.h> +#include <linux/perf_event.h> +#include <api/fs/fs.h> +#include <perf/bpf_perf.h> + +#include "affinity.h" +#include "bpf_counter.h" +#include "cgroup.h" +#include "counts.h" +#include "debug.h" +#include "evsel.h" +#include "evlist.h" +#include "target.h" +#include "cpumap.h" +#include "thread_map.h" + +#include "bpf_skel/bperf_cgroup.skel.h" + +static struct perf_event_attr cgrp_switch_attr = { + .type = PERF_TYPE_SOFTWARE, + .config = PERF_COUNT_SW_CGROUP_SWITCHES, + .size = sizeof(cgrp_switch_attr), + .sample_period = 1, + .disabled = 1, +}; + +static struct evsel *cgrp_switch; +static struct bperf_cgroup_bpf *skel; + +#define FD(evt, cpu) (*(int *)xyarray__entry(evt->core.fd, cpu, 0)) + +static int bperf_load_program(struct evlist *evlist) +{ + struct bpf_link *link; + struct evsel *evsel; + struct cgroup *cgrp, *leader_cgrp; + __u32 i, cpu; + __u32 nr_cpus = evlist->core.all_cpus->nr; + int total_cpus = cpu__max_cpu(); + int map_size, map_fd; + int prog_fd, err; + + skel = bperf_cgroup_bpf__open(); + if (!skel) { + pr_err("Failed to open cgroup skeleton\n"); + return -1; + } + + skel->rodata->num_cpus = total_cpus; + skel->rodata->num_events = evlist->core.nr_entries / nr_cgroups; + + BUG_ON(evlist->core.nr_entries % nr_cgroups != 0); + + /* we need one copy of events per cpu for reading */ + map_size = total_cpus * evlist->core.nr_entries / nr_cgroups; + bpf_map__resize(skel->maps.events, map_size); + bpf_map__resize(skel->maps.cgrp_idx, nr_cgroups); + /* previous result is saved in a per-cpu array */ + map_size = evlist->core.nr_entries / nr_cgroups; + bpf_map__resize(skel->maps.prev_readings, map_size); + /* cgroup result needs all events (per-cpu) */ + map_size = evlist->core.nr_entries; + bpf_map__resize(skel->maps.cgrp_readings, map_size); + + set_max_rlimit(); + + err = bperf_cgroup_bpf__load(skel); + if (err) { + pr_err("Failed to load cgroup skeleton\n"); + goto out; + } + + if (cgroup_is_v2("perf_event") > 0) + skel->bss->use_cgroup_v2 = 1; + + err = -1; + + cgrp_switch = evsel__new(&cgrp_switch_attr); + if (evsel__open_per_cpu(cgrp_switch, evlist->core.all_cpus, -1) < 0) { + pr_err("Failed to open cgroup switches event\n"); + goto out; + } + + for (i = 0; i < nr_cpus; i++) { + link = bpf_program__attach_perf_event(skel->progs.on_cgrp_switch, + FD(cgrp_switch, i)); + if (IS_ERR(link)) { + pr_err("Failed to attach cgroup program\n"); + err = PTR_ERR(link); + goto out; + } + } + + /* + * Update cgrp_idx map from cgroup-id to event index. + */ + cgrp = NULL; + i = 0; + + evlist__for_each_entry(evlist, evsel) { + if (cgrp == NULL || evsel->cgrp == leader_cgrp) { + leader_cgrp = evsel->cgrp; + evsel->cgrp = NULL; + + /* open single copy of the events w/o cgroup */ + err = evsel__open_per_cpu(evsel, evlist->core.all_cpus, -1); + if (err) { + pr_err("Failed to open first cgroup events\n"); + goto out; + } + + map_fd = bpf_map__fd(skel->maps.events); + for (cpu = 0; cpu < nr_cpus; cpu++) { + int fd = FD(evsel, cpu); + __u32 idx = evsel->idx * total_cpus + + evlist->core.all_cpus->map[cpu]; + + err = bpf_map_update_elem(map_fd, &idx, &fd, + BPF_ANY); + if (err < 0) { + pr_err("Failed to update perf_event fd\n"); + goto out; + } + } + + evsel->cgrp = leader_cgrp; + } + evsel->supported = true; + + if (evsel->cgrp == cgrp) + continue; + + cgrp = evsel->cgrp; + + if (read_cgroup_id(cgrp) < 0) { + pr_err("Failed to get cgroup id\n"); + err = -1; + goto out; + } + + map_fd = bpf_map__fd(skel->maps.cgrp_idx); + err = bpf_map_update_elem(map_fd, &cgrp->id, &i, BPF_ANY); + if (err < 0) { + pr_err("Failed to update cgroup index map\n"); + goto out; + } + + i++; + } + + /* + * bperf uses BPF_PROG_TEST_RUN to get accurate reading. Check + * whether the kernel support it + */ + prog_fd = bpf_program__fd(skel->progs.trigger_read); + err = bperf_trigger_reading(prog_fd, 0); + if (err) { + pr_warning("The kernel does not support test_run for raw_tp BPF programs.\n" + "Therefore, --for-each-cgroup might show inaccurate readings\n"); + err = 0; + } + +out: + return err; +} + +static int bperf_cgrp__load(struct evsel *evsel, + struct target *target __maybe_unused) +{ + static bool bperf_loaded = false; + + evsel->bperf_leader_prog_fd = -1; + evsel->bperf_leader_link_fd = -1; + + if (!bperf_loaded && bperf_load_program(evsel->evlist)) + return -1; + + bperf_loaded = true; + /* just to bypass bpf_counter_skip() */ + evsel->follower_skel = (struct bperf_follower_bpf *)skel; + + return 0; +} + +static int bperf_cgrp__install_pe(struct evsel *evsel __maybe_unused, + int cpu __maybe_unused, int fd __maybe_unused) +{ + /* nothing to do */ + return 0; +} + +/* + * trigger the leader prog on each cpu, so the cgrp_reading map could get + * the latest results. + */ +static int bperf_cgrp__sync_counters(struct evlist *evlist) +{ + int i, cpu; + int nr_cpus = evlist->core.all_cpus->nr; + int prog_fd = bpf_program__fd(skel->progs.trigger_read); + + for (i = 0; i < nr_cpus; i++) { + cpu = evlist->core.all_cpus->map[i]; + bperf_trigger_reading(prog_fd, cpu); + } + + return 0; +} + +static int bperf_cgrp__enable(struct evsel *evsel) +{ + if (evsel->idx) + return 0; + + bperf_cgrp__sync_counters(evsel->evlist); + + skel->bss->enabled = 1; + return 0; +} + +static int bperf_cgrp__disable(struct evsel *evsel) +{ + if (evsel->idx) + return 0; + + bperf_cgrp__sync_counters(evsel->evlist); + + skel->bss->enabled = 0; + return 0; +} + +static int bperf_cgrp__read(struct evsel *evsel) +{ + struct evlist *evlist = evsel->evlist; + int i, cpu, nr_cpus = evlist->core.all_cpus->nr; + int total_cpus = cpu__max_cpu(); + struct perf_counts_values *counts; + struct bpf_perf_event_value *values; + int reading_map_fd, err = 0; + __u32 idx; + + if (evsel->idx) + return 0; + + bperf_cgrp__sync_counters(evsel->evlist); + + values = calloc(total_cpus, sizeof(*values)); + if (values == NULL) + return -ENOMEM; + + reading_map_fd = bpf_map__fd(skel->maps.cgrp_readings); + + evlist__for_each_entry(evlist, evsel) { + idx = evsel->idx; + err = bpf_map_lookup_elem(reading_map_fd, &idx, values); + if (err) { + pr_err("bpf map lookup falied: idx=%u, event=%s, cgrp=%s\n", + idx, evsel__name(evsel), evsel->cgrp->name); + goto out; + } + + for (i = 0; i < nr_cpus; i++) { + cpu = evlist->core.all_cpus->map[i]; + + counts = perf_counts(evsel->counts, i, 0); + counts->val = values[cpu].counter; + counts->ena = values[cpu].enabled; + counts->run = values[cpu].running; + } + } + +out: + free(values); + return err; +} + +static int bperf_cgrp__destroy(struct evsel *evsel) +{ + if (evsel->idx) + return 0; + + bperf_cgroup_bpf__destroy(skel); + evsel__delete(cgrp_switch); // it'll destroy on_switch progs too + + return 0; +} + +struct bpf_counter_ops bperf_cgrp_ops = { + .load = bperf_cgrp__load, + .enable = bperf_cgrp__enable, + .disable = bperf_cgrp__disable, + .read = bperf_cgrp__read, + .install_pe = bperf_cgrp__install_pe, + .destroy = bperf_cgrp__destroy, +}; diff --git a/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c new file mode 100644 index 000000000000..292c430768b5 --- /dev/null +++ b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c @@ -0,0 +1,191 @@ +// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +// Copyright (c) 2021 Facebook +// Copyright (c) 2021 Google +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include <bpf/bpf_core_read.h> + +#define MAX_LEVELS 10 // max cgroup hierarchy level: arbitrary +#define MAX_EVENTS 32 // max events per cgroup: arbitrary + +// NOTE: many of map and global data will be modified before loading +// from the userspace (perf tool) using the skeleton helpers. + +// single set of global perf events to measure +struct { + __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); + __uint(key_size, sizeof(__u32)); + __uint(value_size, sizeof(int)); + __uint(max_entries, 1); +} events SEC(".maps"); + +// from cgroup id to event index +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(key_size, sizeof(__u64)); + __uint(value_size, sizeof(__u32)); + __uint(max_entries, 1); +} cgrp_idx SEC(".maps"); + +// per-cpu event snapshots to calculate delta +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(key_size, sizeof(__u32)); + __uint(value_size, sizeof(struct bpf_perf_event_value)); +} prev_readings SEC(".maps"); + +// aggregated event values for each cgroup (per-cpu) +// will be read from the user-space +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(key_size, sizeof(__u32)); + __uint(value_size, sizeof(struct bpf_perf_event_value)); +} cgrp_readings SEC(".maps"); + +const volatile __u32 num_events = 1; +const volatile __u32 num_cpus = 1; + +int enabled = 0; +int use_cgroup_v2 = 0; + +static inline int get_cgroup_v1_idx(__u32 *cgrps, int size) +{ + struct task_struct *p = (void *)bpf_get_current_task(); + struct cgroup *cgrp; + register int i = 0; + __u32 *elem; + int level; + int cnt; + + cgrp = BPF_CORE_READ(p, cgroups, subsys[perf_event_cgrp_id], cgroup); + level = BPF_CORE_READ(cgrp, level); + + for (cnt = 0; i < MAX_LEVELS; i++) { + __u64 cgrp_id; + + if (i > level) + break; + + // convert cgroup-id to a map index + cgrp_id = BPF_CORE_READ(cgrp, ancestor_ids[i]); + elem = bpf_map_lookup_elem(&cgrp_idx, &cgrp_id); + if (!elem) + continue; + + cgrps[cnt++] = *elem; + if (cnt == size) + break; + } + + return cnt; +} + +static inline int get_cgroup_v2_idx(__u32 *cgrps, int size) +{ + register int i = 0; + __u32 *elem; + int cnt; + + for (cnt = 0; i < MAX_LEVELS; i++) { + __u64 cgrp_id = bpf_get_current_ancestor_cgroup_id(i); + + if (cgrp_id == 0) + break; + + // convert cgroup-id to a map index + elem = bpf_map_lookup_elem(&cgrp_idx, &cgrp_id); + if (!elem) + continue; + + cgrps[cnt++] = *elem; + if (cnt == size) + break; + } + + return cnt; +} + +static int bperf_cgroup_count(void) +{ + register __u32 idx = 0; // to have it in a register to pass BPF verifier + register int c = 0; + struct bpf_perf_event_value val, delta, *prev_val, *cgrp_val; + __u32 cpu = bpf_get_smp_processor_id(); + __u32 cgrp_idx[MAX_LEVELS]; + int cgrp_cnt; + __u32 key, cgrp; + long err; + + if (use_cgroup_v2) + cgrp_cnt = get_cgroup_v2_idx(cgrp_idx, MAX_LEVELS); + else + cgrp_cnt = get_cgroup_v1_idx(cgrp_idx, MAX_LEVELS); + + for ( ; idx < MAX_EVENTS; idx++) { + if (idx == num_events) + break; + + // XXX: do not pass idx directly (for verifier) + key = idx; + // this is per-cpu array for diff + prev_val = bpf_map_lookup_elem(&prev_readings, &key); + if (!prev_val) { + val.counter = val.enabled = val.running = 0; + bpf_map_update_elem(&prev_readings, &key, &val, BPF_ANY); + + prev_val = bpf_map_lookup_elem(&prev_readings, &key); + if (!prev_val) + continue; + } + + // read from global perf_event array + key = idx * num_cpus + cpu; + err = bpf_perf_event_read_value(&events, key, &val, sizeof(val)); + if (err) + continue; + + if (enabled) { + delta.counter = val.counter - prev_val->counter; + delta.enabled = val.enabled - prev_val->enabled; + delta.running = val.running - prev_val->running; + + for (c = 0; c < MAX_LEVELS; c++) { + if (c == cgrp_cnt) + break; + + cgrp = cgrp_idx[c]; + + // aggregate the result by cgroup + key = cgrp * num_events + idx; + cgrp_val = bpf_map_lookup_elem(&cgrp_readings, &key); + if (cgrp_val) { + cgrp_val->counter += delta.counter; + cgrp_val->enabled += delta.enabled; + cgrp_val->running += delta.running; + } else { + bpf_map_update_elem(&cgrp_readings, &key, + &delta, BPF_ANY); + } + } + } + + *prev_val = val; + } + return 0; +} + +// This will be attached to cgroup-switches event for each cpu +SEC("perf_events") +int BPF_PROG(on_cgrp_switch) +{ + return bperf_cgroup_count(); +} + +SEC("raw_tp/sched_switch") +int BPF_PROG(trigger_read) +{ + return bperf_cgroup_count(); +} + +char LICENSE[] SEC("license") = "Dual BSD/GPL"; diff --git a/tools/perf/util/cgroup.c b/tools/perf/util/cgroup.c index 35148f8a7291..22bd8c97338a 100644 --- a/tools/perf/util/cgroup.c +++ b/tools/perf/util/cgroup.c @@ -16,6 +16,7 @@ #include <api/fs/fs.h> int nr_cgroups; +bool cgrp_event_expanded; static int open_cgroup(const char *name) { @@ -334,6 +335,8 @@ int evlist__expand_cgroup(struct evlist *evlist, const char *str, str = p+1; } + cgrp_event_expanded = true; + out_err: evlist__delete(orig_list); evlist__delete(tmp_list); diff --git a/tools/perf/util/cgroup.h b/tools/perf/util/cgroup.h index de5b272560ab..12256b78608c 100644 --- a/tools/perf/util/cgroup.h +++ b/tools/perf/util/cgroup.h @@ -18,6 +18,7 @@ struct cgroup { }; extern int nr_cgroups; /* number of explicit cgroups defined */ +extern bool cgrp_event_expanded; struct cgroup *cgroup__get(struct cgroup *cgroup); void cgroup__put(struct cgroup *cgroup); -- 2.34.1

From: Song Liu <songliubraving@fb.com> Arnaldo reported that building all his containers with BUILD_BPF_SKEL=1 to then make this the default he found problems in some distros where the system linux/bpf.h file was being used and lacked this: util/bpf_skel/bperf_leader.bpf.c:13:20: error: use of undeclared identifier 'BPF_F_PRESERVE_ELEMS' __uint(map_flags, BPF_F_PRESERVE_ELEMS); So use instead the vmlinux.h file generated by bpftool from BTF info. This fixed these as well, getting the build back working on debian:11, debian:experimental and ubuntu:21.10: In file included from In file included from util/bpf_skel/bperf_leader.bpf.cutil/bpf_skel/bpf_prog_profiler.bpf.c::33: : In file included from In file included from /usr/include/linux/bpf.h/usr/include/linux/bpf.h::1111: : /usr/include/linux/types.h/usr/include/linux/types.h::55::1010:: In file included from util/bpf_skel/bperf_follower.bpf.c:3fatal errorfatal error: : : In file included from /usr/include/linux/bpf.h:'asm/types.h' file not found11'asm/types.h' file not found: /usr/include/linux/types.h:5:10: fatal error: 'asm/types.h' file not found #include <asm/types.h>#include <asm/types.h> ^~~~~~~~~~~~~ ^~~~~~~~~~~~~ #include <asm/types.h> ^~~~~~~~~~~~~ 1 error generated. Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Tested-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lore.kernel.org/lkml/CF175681-8101-43D1-ABDB-449E644BE986@fb.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/perf/util/bpf_skel/bperf_follower.bpf.c | 3 +-- tools/perf/util/bpf_skel/bperf_leader.bpf.c | 3 +-- tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c | 2 +- 3 files changed, 3 insertions(+), 5 deletions(-) diff --git a/tools/perf/util/bpf_skel/bperf_follower.bpf.c b/tools/perf/util/bpf_skel/bperf_follower.bpf.c index b8fa3cb2da23..4a6acfde1493 100644 --- a/tools/perf/util/bpf_skel/bperf_follower.bpf.c +++ b/tools/perf/util/bpf_skel/bperf_follower.bpf.c @@ -1,7 +1,6 @@ // SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) // Copyright (c) 2021 Facebook -#include <linux/bpf.h> -#include <linux/perf_event.h> +#include "vmlinux.h" #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> #include "bperf.h" diff --git a/tools/perf/util/bpf_skel/bperf_leader.bpf.c b/tools/perf/util/bpf_skel/bperf_leader.bpf.c index 4f70d1459e86..40d962b05863 100644 --- a/tools/perf/util/bpf_skel/bperf_leader.bpf.c +++ b/tools/perf/util/bpf_skel/bperf_leader.bpf.c @@ -1,7 +1,6 @@ // SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) // Copyright (c) 2021 Facebook -#include <linux/bpf.h> -#include <linux/perf_event.h> +#include "vmlinux.h" #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> #include "bperf.h" diff --git a/tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c b/tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c index c7cec92d0236..72a5131e0eb9 100644 --- a/tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c +++ b/tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c @@ -1,6 +1,6 @@ // SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) // Copyright (c) 2020 Facebook -#include <linux/bpf.h> +#include "vmlinux.h" #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> -- 2.34.1

From: Song Liu <songliubraving@fb.com> When building bpf_skel with clang-10, typedef causes confusions like: libbpf: map 'prev_readings': unexpected def kind var. Fix this by removing the typedef. Fixes: 7fac83aaf2eecc9e ("perf stat: Introduce 'bperf' to share hardware PMCs with BPF") Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lore.kernel.org/lkml/BEF5C312-4331-4A60-AEC0-AD7617CB2BC4@fb.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/perf/util/bpf_skel/bperf.h | 14 -------------- tools/perf/util/bpf_skel/bperf_follower.bpf.c | 16 +++++++++++++--- tools/perf/util/bpf_skel/bperf_leader.bpf.c | 16 +++++++++++++--- 3 files changed, 26 insertions(+), 20 deletions(-) delete mode 100644 tools/perf/util/bpf_skel/bperf.h diff --git a/tools/perf/util/bpf_skel/bperf.h b/tools/perf/util/bpf_skel/bperf.h deleted file mode 100644 index 186a5551ddb9..000000000000 --- a/tools/perf/util/bpf_skel/bperf.h +++ /dev/null @@ -1,14 +0,0 @@ -// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) -// Copyright (c) 2021 Facebook - -#ifndef __BPERF_STAT_H -#define __BPERF_STAT_H - -typedef struct { - __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); - __uint(key_size, sizeof(__u32)); - __uint(value_size, sizeof(struct bpf_perf_event_value)); - __uint(max_entries, 1); -} reading_map; - -#endif /* __BPERF_STAT_H */ diff --git a/tools/perf/util/bpf_skel/bperf_follower.bpf.c b/tools/perf/util/bpf_skel/bperf_follower.bpf.c index 4a6acfde1493..f193998530d4 100644 --- a/tools/perf/util/bpf_skel/bperf_follower.bpf.c +++ b/tools/perf/util/bpf_skel/bperf_follower.bpf.c @@ -3,11 +3,21 @@ #include "vmlinux.h" #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> -#include "bperf.h" #include "bperf_u.h" -reading_map diff_readings SEC(".maps"); -reading_map accum_readings SEC(".maps"); +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(key_size, sizeof(__u32)); + __uint(value_size, sizeof(struct bpf_perf_event_value)); + __uint(max_entries, 1); +} diff_readings SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(key_size, sizeof(__u32)); + __uint(value_size, sizeof(struct bpf_perf_event_value)); + __uint(max_entries, 1); +} accum_readings SEC(".maps"); struct { __uint(type, BPF_MAP_TYPE_HASH); diff --git a/tools/perf/util/bpf_skel/bperf_leader.bpf.c b/tools/perf/util/bpf_skel/bperf_leader.bpf.c index 40d962b05863..e2a2d4cd7779 100644 --- a/tools/perf/util/bpf_skel/bperf_leader.bpf.c +++ b/tools/perf/util/bpf_skel/bperf_leader.bpf.c @@ -3,7 +3,6 @@ #include "vmlinux.h" #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> -#include "bperf.h" struct { __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); @@ -12,8 +11,19 @@ struct { __uint(map_flags, BPF_F_PRESERVE_ELEMS); } events SEC(".maps"); -reading_map prev_readings SEC(".maps"); -reading_map diff_readings SEC(".maps"); +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(key_size, sizeof(__u32)); + __uint(value_size, sizeof(struct bpf_perf_event_value)); + __uint(max_entries, 1); +} prev_readings SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(key_size, sizeof(__u32)); + __uint(value_size, sizeof(struct bpf_perf_event_value)); + __uint(max_entries, 1); +} diff_readings SEC(".maps"); SEC("raw_tp/sched_switch") int BPF_PROG(on_switch) -- 2.34.1

From: Namhyung Kim <namhyung@kernel.org> It seems the recent libbpf got more strict about the section name. I'm seeing a failure like this: $ sudo ./perf stat -a --bpf-counters --for-each-cgroup ^. sleep 1 libbpf: prog 'on_cgrp_switch': missing BPF prog type, check ELF section name 'perf_events' libbpf: prog 'on_cgrp_switch': failed to load: -22 libbpf: failed to load object 'bperf_cgroup_bpf' libbpf: failed to load BPF skeleton 'bperf_cgroup_bpf': -22 Failed to load cgroup skeleton The section name should be 'perf_event' (without the trailing 's'). Although it's related to the libbpf change, it'd be better fix the section name in the first place. Fixes: 944138f048f7d759 ("perf stat: Enable BPF counter with --for-each-cgroup") Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: bpf@vger.kernel.org Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/r/20220916184132.1161506-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/perf/util/bpf_skel/bperf_cgroup.bpf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c index 292c430768b5..c72f8ad96f75 100644 --- a/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c +++ b/tools/perf/util/bpf_skel/bperf_cgroup.bpf.c @@ -176,7 +176,7 @@ static int bperf_cgroup_count(void) } // This will be attached to cgroup-switches event for each cpu -SEC("perf_events") +SEC("perf_event") int BPF_PROG(on_cgrp_switch) { return bperf_cgroup_count(); -- 2.34.1

From: Tengda Wu <wutengda@huaweicloud.com> bperf has a nice ability to share PMUs, but it still does not support inherit events during fork(), resulting in some deviations in its stat results compared with perf. perf stat result: $ ./perf stat -e cycles,instructions -- ./perf test -w sqrtloop Performance counter stats for './perf test -w sqrtloop': 2,316,038,116 cycles 2,859,350,725 instructions 1.009603637 seconds time elapsed 1.004196000 seconds user 0.003950000 seconds sys bperf stat result: $ ./perf stat --bpf-counters -e cycles,instructions -- \ ./perf test -w sqrtloop Performance counter stats for './perf test -w sqrtloop': 18,762,093 cycles 23,487,766 instructions 1.008913769 seconds time elapsed 1.003248000 seconds user 0.004069000 seconds sys In order to support event inheritance, two new bpf programs are added to monitor the fork and exit of tasks respectively. When a task is created, add it to the filter map to enable counting, and reuse the `accum_key` of its parent task to count together with the parent task. When a task exits, remove it from the filter map to disable counting. After support: $ ./perf stat --bpf-counters -e cycles,instructions -- \ ./perf test -w sqrtloop Performance counter stats for './perf test -w sqrtloop': 2,316,252,189 cycles 2,859,946,547 instructions 1.009422314 seconds time elapsed 1.003597000 seconds user 0.004270000 seconds sys Signed-off-by: Tengda Wu <wutengda@huaweicloud.com> Cc: song@kernel.org Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20241021110201.325617-2-wutengda@huaweicloud.com Signed-off-by: Namhyung Kim <namhyung@kernel.org> Conflicts: tools/perf/util/bpf_counter.c Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/perf/builtin-stat.c | 1 + tools/perf/util/bpf_counter.c | 35 +++++-- tools/perf/util/bpf_skel/bperf_follower.bpf.c | 98 +++++++++++++++++-- tools/perf/util/bpf_skel/bperf_u.h | 5 + tools/perf/util/target.h | 1 + 5 files changed, 126 insertions(+), 14 deletions(-) diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 5ba56a1ae41f..59ca2cce3f8f 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -2244,6 +2244,7 @@ int cmd_stat(int argc, const char **argv) } else if (big_num_opt == 0) /* User passed --no-big-num */ stat_config.big_num = false; + target.inherit = !stat_config.no_inherit; err = target__validate(&target); if (err) { target__strerror(&target, err, errbuf, BUFSIZ); diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c index ede77e92ab08..b54a3ef85ba3 100644 --- a/tools/perf/util/bpf_counter.c +++ b/tools/perf/util/bpf_counter.c @@ -380,6 +380,7 @@ static int bperf_check_target(struct evsel *evsel, } static struct perf_cpu_map *all_cpu_map; +static __u32 filter_entry_cnt; static int bperf_reload_leader_program(struct evsel *evsel, int attr_map_fd, struct perf_event_attr_map_entry *entry) @@ -430,12 +431,32 @@ static int bperf_reload_leader_program(struct evsel *evsel, int attr_map_fd, return err; } +static int bperf_attach_follower_program(struct bperf_follower_bpf *skel, + enum bperf_filter_type filter_type, + bool inherit) +{ + struct bpf_link *link; + int err = 0; + + if ((filter_type == BPERF_FILTER_PID || + filter_type == BPERF_FILTER_TGID) && inherit) + /* attach all follower bpf progs to enable event inheritance */ + err = bperf_follower_bpf__attach(skel); + else { + link = bpf_program__attach(skel->progs.fexit_XXX); + if (IS_ERR(link)) + err = PTR_ERR(link); + } + + return err; +} + static int bperf__load(struct evsel *evsel, struct target *target) { struct perf_event_attr_map_entry entry = {0xffffffff, 0xffffffff}; int attr_map_fd, diff_map_fd = -1, err; enum bperf_filter_type filter_type; - __u32 filter_entry_cnt, i; + __u32 i; if (bperf_check_target(evsel, target, &filter_type, &filter_entry_cnt)) return -1; @@ -513,9 +534,6 @@ static int bperf__load(struct evsel *evsel, struct target *target) /* set up reading map */ bpf_map__set_max_entries(evsel->follower_skel->maps.accum_readings, filter_entry_cnt); - /* set up follower filter based on target */ - bpf_map__set_max_entries(evsel->follower_skel->maps.filter, - filter_entry_cnt); err = bperf_follower_bpf__load(evsel->follower_skel); if (err) { pr_err("Failed to load follower skeleton\n"); @@ -527,6 +545,7 @@ static int bperf__load(struct evsel *evsel, struct target *target) for (i = 0; i < filter_entry_cnt; i++) { int filter_map_fd; __u32 key; + struct bperf_filter_value fval = { i, 0 }; if (filter_type == BPERF_FILTER_PID || filter_type == BPERF_FILTER_TGID) @@ -537,12 +556,14 @@ static int bperf__load(struct evsel *evsel, struct target *target) break; filter_map_fd = bpf_map__fd(evsel->follower_skel->maps.filter); - bpf_map_update_elem(filter_map_fd, &key, &i, BPF_ANY); + bpf_map_update_elem(filter_map_fd, &key, &fval, BPF_ANY); } evsel->follower_skel->bss->type = filter_type; + evsel->follower_skel->bss->inherit = target->inherit; - err = bperf_follower_bpf__attach(evsel->follower_skel); + err = bperf_attach_follower_program(evsel->follower_skel, filter_type, + target->inherit); out: if (err && evsel->bperf_leader_link_fd >= 0) @@ -605,7 +626,7 @@ static int bperf__read(struct evsel *evsel) bperf_sync_counters(evsel); reading_map_fd = bpf_map__fd(skel->maps.accum_readings); - for (i = 0; i < bpf_map__max_entries(skel->maps.accum_readings); i++) { + for (i = 0; i < filter_entry_cnt; i++) { __u32 cpu; err = bpf_map_lookup_elem(reading_map_fd, &i, values); diff --git a/tools/perf/util/bpf_skel/bperf_follower.bpf.c b/tools/perf/util/bpf_skel/bperf_follower.bpf.c index f193998530d4..0595063139a3 100644 --- a/tools/perf/util/bpf_skel/bperf_follower.bpf.c +++ b/tools/perf/util/bpf_skel/bperf_follower.bpf.c @@ -5,6 +5,8 @@ #include <bpf/bpf_tracing.h> #include "bperf_u.h" +#define MAX_ENTRIES 102400 + struct { __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); __uint(key_size, sizeof(__u32)); @@ -22,25 +24,29 @@ struct { struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(key_size, sizeof(__u32)); - __uint(value_size, sizeof(__u32)); + __uint(value_size, sizeof(struct bperf_filter_value)); + __uint(max_entries, MAX_ENTRIES); + __uint(map_flags, BPF_F_NO_PREALLOC); } filter SEC(".maps"); enum bperf_filter_type type = 0; int enabled = 0; +int inherit; SEC("fexit/XXX") int BPF_PROG(fexit_XXX) { struct bpf_perf_event_value *diff_val, *accum_val; __u32 filter_key, zero = 0; - __u32 *accum_key; + __u32 accum_key; + struct bperf_filter_value *fval; if (!enabled) return 0; switch (type) { case BPERF_FILTER_GLOBAL: - accum_key = &zero; + accum_key = zero; goto do_add; case BPERF_FILTER_CPU: filter_key = bpf_get_smp_processor_id(); @@ -49,22 +55,34 @@ int BPF_PROG(fexit_XXX) filter_key = bpf_get_current_pid_tgid() & 0xffffffff; break; case BPERF_FILTER_TGID: - filter_key = bpf_get_current_pid_tgid() >> 32; + /* Use pid as the filter_key to exclude new task counts + * when inherit is disabled. Don't worry about the existing + * children in TGID losing their counts, bpf_counter has + * already added them to the filter map via perf_thread_map + * before this bpf prog runs. + */ + filter_key = inherit ? + bpf_get_current_pid_tgid() >> 32 : + bpf_get_current_pid_tgid() & 0xffffffff; break; default: return 0; } - accum_key = bpf_map_lookup_elem(&filter, &filter_key); - if (!accum_key) + fval = bpf_map_lookup_elem(&filter, &filter_key); + if (!fval) return 0; + accum_key = fval->accum_key; + if (fval->exited) + bpf_map_delete_elem(&filter, &filter_key); + do_add: diff_val = bpf_map_lookup_elem(&diff_readings, &zero); if (!diff_val) return 0; - accum_val = bpf_map_lookup_elem(&accum_readings, accum_key); + accum_val = bpf_map_lookup_elem(&accum_readings, &accum_key); if (!accum_val) return 0; @@ -75,4 +93,70 @@ int BPF_PROG(fexit_XXX) return 0; } +/* The program is only used for PID or TGID filter types. */ +SEC("tp_btf/task_newtask") +int BPF_PROG(on_newtask, struct task_struct *task, __u64 clone_flags) +{ + __u32 parent_key, child_key; + struct bperf_filter_value *parent_fval; + struct bperf_filter_value child_fval = { 0 }; + + if (!enabled) + return 0; + + switch (type) { + case BPERF_FILTER_PID: + parent_key = bpf_get_current_pid_tgid() & 0xffffffff; + child_key = task->pid; + break; + case BPERF_FILTER_TGID: + parent_key = bpf_get_current_pid_tgid() >> 32; + child_key = task->tgid; + if (child_key == parent_key) + return 0; + break; + default: + return 0; + } + + /* Check if the current task is one of the target tasks to be counted */ + parent_fval = bpf_map_lookup_elem(&filter, &parent_key); + if (!parent_fval) + return 0; + + /* Start counting for the new task by adding it into filter map, + * inherit the accum key of its parent task so that they can be + * counted together. + */ + child_fval.accum_key = parent_fval->accum_key; + child_fval.exited = 0; + bpf_map_update_elem(&filter, &child_key, &child_fval, BPF_NOEXIST); + + return 0; +} + +/* The program is only used for PID or TGID filter types. */ +SEC("tp_btf/sched_process_exit") +int BPF_PROG(on_exittask, struct task_struct *task) +{ + __u32 pid; + struct bperf_filter_value *fval; + + if (!enabled) + return 0; + + /* Stop counting for this task by removing it from filter map. + * For TGID type, if the pid can be found in the map, it means that + * this pid belongs to the leader task. After the task exits, the + * tgid of its child tasks (if any) will be 1, so the pid can be + * safely removed. + */ + pid = task->pid; + fval = bpf_map_lookup_elem(&filter, &pid); + if (fval) + fval->exited = 1; + + return 0; +} + char LICENSE[] SEC("license") = "Dual BSD/GPL"; diff --git a/tools/perf/util/bpf_skel/bperf_u.h b/tools/perf/util/bpf_skel/bperf_u.h index 1ce0c2c905c1..4a4a753980be 100644 --- a/tools/perf/util/bpf_skel/bperf_u.h +++ b/tools/perf/util/bpf_skel/bperf_u.h @@ -11,4 +11,9 @@ enum bperf_filter_type { BPERF_FILTER_TGID, }; +struct bperf_filter_value { + __u32 accum_key; + __u8 exited; +}; + #endif /* __BPERF_STAT_U_H */ diff --git a/tools/perf/util/target.h b/tools/perf/util/target.h index 4ff56217f2a6..4bb3afc5194c 100644 --- a/tools/perf/util/target.h +++ b/tools/perf/util/target.h @@ -17,6 +17,7 @@ struct target { bool default_per_cpu; bool per_thread; bool use_bpf; + bool inherit; const char *attr_map; }; -- 2.34.1

From: Xiaomeng Zhang <zhangxiaomeng13@huawei.com> bperf restricts the size of perf_attr_map's entries to 16, which cannot hold all events in many scenarios. A typical example is when the user specifies `-a -ddd` ([0]). And in other cases such as top-down analysis, which often requires a set of more than 16 PMUs to be collected simultaneously. Fix this by increase perf_attr_map entries to 100, and an event number check has been introduced when bperf__load() to ensure that users receive a more friendly prompt when the event limit is reached. [0] https://lore.kernel.org/all/20230104064402.1551516-3-namhyung@kernel.org/ Fixes: 7fac83aaf2ee ("perf stat: Introduce 'bperf' to share hardware PMCs with BPF") Signed-off-by: Tengda Wu <wutengda@huaweicloud.com> Signed-off-by: Xiaomeng Zhang <zhangxiaomeng13@huawei.com> Conflicts: tools/perf/util/bpf_counter.c Signed-off-by: Tengda Wu <wutengda2@huawei.com> --- tools/perf/util/bpf_counter.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c index b54a3ef85ba3..2394b0a5bcfe 100644 --- a/tools/perf/util/bpf_counter.c +++ b/tools/perf/util/bpf_counter.c @@ -27,7 +27,7 @@ #include "bpf_skel/bperf_leader.skel.h" #include "bpf_skel/bperf_follower.skel.h" -#define ATTR_MAP_SIZE 16 +#define ATTR_MAP_SIZE 100 static inline void *u64_to_ptr(__u64 ptr) { @@ -458,6 +458,12 @@ static int bperf__load(struct evsel *evsel, struct target *target) enum bperf_filter_type filter_type; __u32 i; + if (evsel->evlist->core.nr_entries > ATTR_MAP_SIZE) { + pr_err("Too many events, please limit to %d or less\n", + ATTR_MAP_SIZE); + return -1; + } + if (bperf_check_target(evsel, target, &filter_type, &filter_entry_cnt)) return -1; -- 2.34.1

反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/15341 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/7DC... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/15341 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/7DC...
participants (2)
-
patchwork bot
-
Tengda Wu