Hi,
In Arm processors, there is a hardware PMU (Performance Monitoring Unit) facility called Statistical Profiling Extension (SPE) that can gather memory access metrics.
In this patchset, SPE is exploited as an access information sampling approach to drive NUMA balancing. This sampling approach is introducedto replace the method based on address space scanning and hint faults with the access information provided by the hardware. With this, it is no longer necessary for NUMA balancing to scan over the address space periodically and rely on task-to-page association built by NUMA hint faults. Instead, the access samples obtained from hardware PMU are fed to NUMA balancing as equivalents to page fault. Except for the replaced sampling approach, the rest of the NUMA balancing policy is retained to do pages and tasks migrations according to the samples.
Profiling based on SPE is an valid alternative sampling approach in NUMA balancing for the optimal page and task placement. This can be also extended to other architectures as long as there is a hardware PMU that supports memory access profiling.
An abstract layer mem_sampling is introduced to reserve support for other kernel features and different types of hardware PMU.
To help evaluate performance of this approach in system, syctl interfaces are added to enable/disable hardware mem sampling. NUMA balancing sampling approach can be also switched back to hint-faults- based approach dynamically.
TODOs Currently, SPE for NUMA balance does not support PMD-level page migration, but it will be supported in later version.
Changes since v1: -- clean code, no functional change.
Ze Zuo (11): drivers/arm/spe: In-kernel SPE driver for page access profiling Add hardware PMU (mem_sampling) abstract layer mm/mem_sampling.c: Add controlling interface for mem_sampling Enable per-process mem_sampling from sched switch path Drive NUMA balancing via mem_sampling access data Add controlling interface for mem_sampling numa_balance tracing, numa balance: add trace events for numa data caused by mem_sampling driver/arm/spe: make mem_sampling and perf reuse arm spe driver tracing, mem-sampling-sample: add trace events for page access spe record ring buffer just for ctx origin flaw config: Enable MEM_SAMPLING for Numa balance by default
arch/arm64/configs/openeuler_defconfig | 3 + arch/x86/configs/openeuler_defconfig | 1 + drivers/Kconfig | 2 + drivers/Makefile | 1 + drivers/arm/Kconfig | 2 + drivers/arm/spe/Kconfig | 11 + drivers/arm/spe/Makefile | 2 + drivers/arm/spe/spe-decoder/Makefile | 2 + drivers/arm/spe/spe-decoder/arm-spe-decoder.c | 213 +++++ drivers/arm/spe/spe-decoder/arm-spe-decoder.h | 74 ++ .../arm/spe/spe-decoder/arm-spe-pkt-decoder.c | 227 +++++ .../arm/spe/spe-decoder/arm-spe-pkt-decoder.h | 153 ++++ drivers/arm/spe/spe.c | 864 ++++++++++++++++++ drivers/arm/spe/spe.h | 130 +++ drivers/perf/Kconfig | 2 +- drivers/perf/arm_pmu_acpi.c | 30 +- drivers/perf/arm_spe_pmu.c | 371 ++------ include/linux/mem_sampling.h | 121 +++ include/linux/perf/arm_pmu.h | 8 + include/trace/events/kmem.h | 80 ++ kernel/sched/core.c | 2 + kernel/sched/fair.c | 12 + kernel/sched/sched.h | 1 + mm/Kconfig | 23 + mm/Makefile | 1 + mm/mem_sampling.c | 534 +++++++++++ samples/bpf/spe/Makefile | 26 + samples/bpf/spe/Makefile.arch | 47 + samples/bpf/spe/README.md | 0 samples/bpf/spe/spe-record.bpf.c | 40 + samples/bpf/spe/spe-record.h | 47 + samples/bpf/spe/spe-record.user.c | 116 +++ 32 files changed, 2834 insertions(+), 312 deletions(-) create mode 100644 drivers/arm/Kconfig create mode 100644 drivers/arm/spe/Kconfig create mode 100644 drivers/arm/spe/Makefile create mode 100644 drivers/arm/spe/spe-decoder/Makefile create mode 100644 drivers/arm/spe/spe-decoder/arm-spe-decoder.c create mode 100644 drivers/arm/spe/spe-decoder/arm-spe-decoder.h create mode 100644 drivers/arm/spe/spe-decoder/arm-spe-pkt-decoder.c create mode 100644 drivers/arm/spe/spe-decoder/arm-spe-pkt-decoder.h create mode 100644 drivers/arm/spe/spe.c create mode 100644 drivers/arm/spe/spe.h create mode 100644 include/linux/mem_sampling.h create mode 100644 mm/mem_sampling.c create mode 100644 samples/bpf/spe/Makefile create mode 100644 samples/bpf/spe/Makefile.arch create mode 100644 samples/bpf/spe/README.md create mode 100644 samples/bpf/spe/spe-record.bpf.c create mode 100644 samples/bpf/spe/spe-record.h create mode 100644 samples/bpf/spe/spe-record.user.c