[PATCH v3 00/10] arm_mpam: Introduce the Narrow-PARTID feature for MPAM driver
The patchset is based on mpam_resctrl_glue_v5_debugfs branch of https://gitlab.arm.com/linux-arm/linux-bh.git repository. The narrow-partid feature in MPAM allows for a more efficient use of PARTIDs by enabling a many-to-one mapping of reqpartids (requested PARTIDs) to intpartids (internal PARTIDs). This mapping reduces the number of unique PARTIDs needed, thus allowing more tasks or processes to be monitored and managed with the available resources. For a mixture of MSCs system, for MSCs that do not support narrow-partid, we use the PARTIDs exceeding the number of closids as reqPARTIDs for expanding the monitoring groups. Therefore, we will expand the information contained in the RMID, so that it includes not only PMG, but also reqPARTIDs information. The new RMID would be like: RMID = reqPARTID * NUM_PMG + PMG Each control group has m (req)PARTIDs, which are used to expand the number of monitoring groups under one control group. Therefore, the number of monitoring groups is no longer limited by the range of MPAM's PMG, which enhances the extensibility of the system's monitoring capabilities. --- Compared with v2: - Add dynamic reqPARTID allocation implementation. - Patch 4 has been merged into the repository, dropped from this series. Compared with v1: - Redefine the RMID information. - Refactor the resctrl_arch_rmid_idx_decode() and resctrl_arch_rmid_idx_encode(). - Simplify closid_rmid2reqpartid() to rmid2reqpartid() and replace it accordingly. Compared with RFC-v4: - Rebase the patch set on the v6.14-rc1 branch. Compared with RFC-v3: - Add limitation of the Narrow-PARTID feature (See Patch 2). - Remove redundant reqpartid2closid() and reqpartid_pmg2rmid(). - Refactor closid_rmid2reqpartid() partially. - Merge the PARTID conversion-related patches into a single patch for bisectability. - Skip adaptation of resctrl_arch_set_rmid() which is going to be removed. Compared with RFC-v2: - Refactor closid/rmid pair translation. - Simplify the logic of synchronize configuration. - Remove reqPARTID source bitmap. Compared with RFC-v1: - Rebase this patch set on latest MPAM driver of the v6.12-rc1 branch. v2: https://lore.kernel.org/all/20250222112448.2438586-1-zengheng4@huawei.com/ v1: https://lore.kernel.org/all/20250217031852.2014939-1-zengheng4@huawei.com/ RFC-v4: https://lore.kernel.org/all/20250104101224.873926-1-zengheng4@huawei.com/ RFC-v3: https://lore.kernel.org/all/20241207092136.2488426-1-zengheng4@huawei.com/ RFC-v2: https://lore.kernel.org/all/20241119135104.595630-1-zengheng4@huawei.com/ RFC-v1: https://lore.kernel.org/all/20241114135037.918470-1-zengheng4@huawei.com/ --- Ben Horgan (1): arm64/sysreg: Add MPAMSM_EL1 register Zeng Heng (9): arm_mpam: Add intPARTID and reqPARTID support for narrow PARTID feature arm_mpam: Disable Narrow-PARTID when MBA lacks support arm_mpam: Refactor rmid to reqPARTID/PMG mapping arm_mpam: Propagate control group config to sub-monitoring groups fs/resctrl: Add rmid_entry state helpers arm_mpam: Implement dynamic reqPARTID allocation for monitoring groups fs/resctrl: Wire up rmid expansion and reclaim functions arm64/mpam: Add mpam_sync_config() for dynamic rmid expansion fs/resctrl: Prevent rmid parsing errors by flushing limbo on umount Documentation/arch/arm64/index.rst | 1 + Documentation/arch/arm64/mpam.rst | 94 + Documentation/arch/arm64/silicon-errata.rst | 9 + arch/arm64/Kconfig | 6 +- arch/arm64/include/asm/el2_setup.h | 3 +- arch/arm64/include/asm/mpam.h | 96 + arch/arm64/include/asm/resctrl.h | 2 + arch/arm64/include/asm/thread_info.h | 3 + arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/cpufeature.c | 21 +- arch/arm64/kernel/mpam.c | 62 + arch/arm64/kernel/process.c | 7 + arch/arm64/kvm/hyp/include/hyp/switch.h | 12 +- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 9 + arch/arm64/kvm/hyp/vhe/sysreg-sr.c | 16 + arch/arm64/kvm/sys_regs.c | 2 + arch/arm64/tools/sysreg | 8 + arch/x86/include/asm/resctrl.h | 7 + drivers/resctrl/Kconfig | 9 +- drivers/resctrl/Makefile | 1 + drivers/resctrl/mpam_devices.c | 340 ++- drivers/resctrl/mpam_internal.h | 112 +- drivers/resctrl/mpam_resctrl.c | 2094 +++++++++++++++++++ drivers/resctrl/test_mpam_resctrl.c | 364 ++++ fs/resctrl/monitor.c | 50 +- fs/resctrl/rdtgroup.c | 24 +- include/linux/arm_mpam.h | 49 + include/linux/resctrl.h | 21 + 28 files changed, 3368 insertions(+), 55 deletions(-) create mode 100644 Documentation/arch/arm64/mpam.rst create mode 100644 arch/arm64/include/asm/mpam.h create mode 100644 arch/arm64/include/asm/resctrl.h create mode 100644 arch/arm64/kernel/mpam.c create mode 100644 drivers/resctrl/mpam_resctrl.c create mode 100644 drivers/resctrl/test_mpam_resctrl.c -- 2.25.1
From: Ben Horgan <ben.horgan@arm.com> The MPAMSM_EL1 register determines the MPAM configuration for an SMCU. Add the register definition. Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> --- Documentation/arch/arm64/index.rst | 1 + Documentation/arch/arm64/mpam.rst | 94 + Documentation/arch/arm64/silicon-errata.rst | 9 + arch/arm64/Kconfig | 6 +- arch/arm64/include/asm/el2_setup.h | 3 +- arch/arm64/include/asm/mpam.h | 96 + arch/arm64/include/asm/resctrl.h | 2 + arch/arm64/include/asm/thread_info.h | 3 + arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/cpufeature.c | 21 +- arch/arm64/kernel/mpam.c | 62 + arch/arm64/kernel/process.c | 7 + arch/arm64/kvm/hyp/include/hyp/switch.h | 12 +- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 9 + arch/arm64/kvm/hyp/vhe/sysreg-sr.c | 16 + arch/arm64/kvm/sys_regs.c | 2 + arch/arm64/tools/sysreg | 8 + drivers/resctrl/Kconfig | 9 +- drivers/resctrl/Makefile | 1 + drivers/resctrl/mpam_devices.c | 257 ++- drivers/resctrl/mpam_internal.h | 105 +- drivers/resctrl/mpam_resctrl.c | 1877 +++++++++++++++++++ drivers/resctrl/test_mpam_resctrl.c | 364 ++++ include/linux/arm_mpam.h | 32 + 24 files changed, 2970 insertions(+), 27 deletions(-) create mode 100644 Documentation/arch/arm64/mpam.rst create mode 100644 arch/arm64/include/asm/mpam.h create mode 100644 arch/arm64/include/asm/resctrl.h create mode 100644 arch/arm64/kernel/mpam.c create mode 100644 drivers/resctrl/mpam_resctrl.c create mode 100644 drivers/resctrl/test_mpam_resctrl.c diff --git a/Documentation/arch/arm64/index.rst b/Documentation/arch/arm64/index.rst index af52edc8c0ac..98052b4ef4a1 100644 --- a/Documentation/arch/arm64/index.rst +++ b/Documentation/arch/arm64/index.rst @@ -23,6 +23,7 @@ ARM64 Architecture memory memory-tagging-extension mops + mpam perf pointer-authentication ptdump diff --git a/Documentation/arch/arm64/mpam.rst b/Documentation/arch/arm64/mpam.rst new file mode 100644 index 000000000000..6dc3de54ec9a --- /dev/null +++ b/Documentation/arch/arm64/mpam.rst @@ -0,0 +1,94 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==== +MPAM +==== + +What is MPAM +============ +MPAM (Memory Partitioning and Monitoring) is a feature in the CPUs and memory +system components such as the caches or memory controllers that allow memory +traffic to be labelled, partitioned and monitored. + +Traffic is labelled by the CPU, based on the control or monitor group the +current task is assigned to using resctrl. Partitioning policy can be set +using the schemata file in resctrl, and monitor values read via resctrl. +See Documentation/filesystems/resctrl.rst for more details. + +This allows tasks that share memory system resources, such as caches, to be +isolated from each other according to the partitioning policy (so called noisy +neighbours). + +Supported Platforms +=================== +Use of this feature requires CPU support, support in the memory system +components, and a description from firmware of where the MPAM device controls +are in the MMIO address space. (e.g. the 'MPAM' ACPI table). + +The MMIO device that provides MPAM controls/monitors for a memory system +component is called a memory system component. (MSC). + +Because the user interface to MPAM is via resctrl, only MPAM features that are +compatible with resctrl can be exposed to user-space. + +MSC are considered as a group based on the topology. MSC that correspond with +the L3 cache are considered together, it is not possible to mix MSC between L2 +and L3 to 'cover' a resctrl schema. + +The supported features are: + +* Cache portion bitmap controls (CPOR) on the L2 or L3 caches. To expose + CPOR at L2 or L3, every CPU must have a corresponding CPU cache at this + level that also supports the feature. Mismatched big/little platforms are + not supported as resctrl's controls would then also depend on task + placement. + +* Memory bandwidth maximum controls (MBW_MAX) on or after the L3 cache. + resctrl uses the L3 cache-id to identify where the memory bandwidth + control is applied. For this reason the platform must have an L3 cache + with cache-id's supplied by firmware. (It doesn't need to support MPAM.) + + To be exported as the 'MB' schema, the topology of the group of MSC chosen + must match the topology of the L3 cache so that the cache-id's can be + repainted. For example: Platforms with Memory bandwidth maximum controls + on CPU-less NUMA nodes cannot expose the 'MB' schema to resctrl as these + nodes do not have a corresponding L3 cache. If the memory bandwidth + control is on the memory rather than the L3 then there must be a single + global L3 as otherwise it is unknown which L3 the traffic came from. + + When the MPAM driver finds multiple groups of MSC it can use for the 'MB' + schema, it prefers the group closest to the L3 cache. + +* Cache Storage Usage (CSU) counters can expose the 'llc_occupancy' provided + there is at least one CSU monitor on each MSC that makes up the L3 group. + Exposing CSU counters from other caches or devices is not supported. + +* Memory Bandwidth Usage (MBWU) on or after the L3 cache. resctrl uses the + L3 cache-id to identify where the memory bandwidth is measured. For this + reason the platform must have an L3 cache with cache-id's supplied by + firmware. (It doesn't need to support MPAM.) + + Memory bandwidth monitoring makes use of MBWU monitors in each MSC that + makes up the L3 group. If there are more monitors than the maximum number + of control and monitor groups, these will be allocated and configured at + boot. Otherwise, the monitors will not be usable as they are required to + be free running. If the memory bandwidth monitoring is on the memory + rather than the L3 then there must be a single global L3 as otherwise it + is unknown which L3 the traffic came from. + + To expose 'mbm_total_bytes', the topology of the group of MSC chosen must + match the topology of the L3 cache so that the cache-id's can be + repainted. For example: Platforms with Memory bandwidth monitors on + CPU-less NUMA nodes cannot expose 'mbm_total_bytes' as these nodes do not + have a corresponding L3 cache. 'mbm_local_bytes' is not exposed as MPAM + cannot distinguish local traffic from global traffic. + +Feature emulation +================= +MPAM will emulate the Code Data Prioritisation (CDP) feature on all platforms. + +Reporting Bugs +============== +If you are not seeing the counters or controls you expect please share the +debug messages produced when enabling dynamic debug and booting with: +dyndbg="file mpam_resctrl.c +pl" diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst index 4c300caad901..65ed6ea33751 100644 --- a/Documentation/arch/arm64/silicon-errata.rst +++ b/Documentation/arch/arm64/silicon-errata.rst @@ -214,6 +214,9 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | ARM | SI L1 | #4311569 | ARM64_ERRATUM_4311569 | +----------------+-----------------+-----------------+-----------------------------+ +| ARM | CMN-650 | #3642720 | N/A | ++----------------+-----------------+-----------------+-----------------------------+ ++----------------+-----------------+-----------------+-----------------------------+ | Broadcom | Brahma-B53 | N/A | ARM64_ERRATUM_845719 | +----------------+-----------------+-----------------+-----------------------------+ | Broadcom | Brahma-B53 | N/A | ARM64_ERRATUM_843419 | @@ -247,6 +250,12 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | NVIDIA | T241 GICv3/4.x | T241-FABRIC-4 | N/A | +----------------+-----------------+-----------------+-----------------------------+ +| NVIDIA | T241 MPAM | T241-MPAM-1 | N/A | ++----------------+-----------------+-----------------+-----------------------------+ +| NVIDIA | T241 MPAM | T241-MPAM-4 | N/A | ++----------------+-----------------+-----------------+-----------------------------+ +| NVIDIA | T241 MPAM | T241-MPAM-6 | N/A | ++----------------+-----------------+-----------------+-----------------------------+ +----------------+-----------------+-----------------+-----------------------------+ | Freescale/NXP | LS2080A/LS1043A | A-008585 | FSL_ERRATUM_A008585 | +----------------+-----------------+-----------------+-----------------------------+ diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 38dba5f7e4d2..41a5b4ef86b4 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2016,8 +2016,8 @@ config ARM64_TLB_RANGE config ARM64_MPAM bool "Enable support for MPAM" - select ARM64_MPAM_DRIVER if EXPERT # does nothing yet - select ACPI_MPAM if ACPI + select ARM64_MPAM_DRIVER + select ARCH_HAS_CPU_RESCTRL help Memory System Resource Partitioning and Monitoring (MPAM) is an optional extension to the Arm architecture that allows each @@ -2039,6 +2039,8 @@ config ARM64_MPAM MPAM is exposed to user-space via the resctrl pseudo filesystem. + This option enables the extra context switch code. + endmenu # "ARMv8.4 architectural features" menu "ARMv8.5 architectural features" diff --git a/arch/arm64/include/asm/el2_setup.h b/arch/arm64/include/asm/el2_setup.h index 85f4c1615472..4d15071a4f3f 100644 --- a/arch/arm64/include/asm/el2_setup.h +++ b/arch/arm64/include/asm/el2_setup.h @@ -513,7 +513,8 @@ check_override id_aa64pfr0, ID_AA64PFR0_EL1_MPAM_SHIFT, .Linit_mpam_\@, .Lskip_mpam_\@, x1, x2 .Linit_mpam_\@: - msr_s SYS_MPAM2_EL2, xzr // use the default partition + mov x0, #MPAM2_EL2_EnMPAMSM_MASK + msr_s SYS_MPAM2_EL2, x0 // use the default partition, // and disable lower traps mrs_s x0, SYS_MPAMIDR_EL1 tbz x0, #MPAMIDR_EL1_HAS_HCR_SHIFT, .Lskip_mpam_\@ // skip if no MPAMHCR reg diff --git a/arch/arm64/include/asm/mpam.h b/arch/arm64/include/asm/mpam.h new file mode 100644 index 000000000000..70d396e7b6da --- /dev/null +++ b/arch/arm64/include/asm/mpam.h @@ -0,0 +1,96 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2025 Arm Ltd. */ + +#ifndef __ASM__MPAM_H +#define __ASM__MPAM_H + +#include <linux/arm_mpam.h> +#include <linux/bitfield.h> +#include <linux/jump_label.h> +#include <linux/percpu.h> +#include <linux/sched.h> + +#include <asm/sysreg.h> + +DECLARE_STATIC_KEY_FALSE(mpam_enabled); +DECLARE_PER_CPU(u64, arm64_mpam_default); +DECLARE_PER_CPU(u64, arm64_mpam_current); + +/* + * The value of the MPAM0_EL1 sysreg when a task is in resctrl's default group. + * This is used by the context switch code to use the resctrl CPU property + * instead. The value is modified when CDP is enabled/disabled by mounting + * the resctrl filesystem. + */ +extern u64 arm64_mpam_global_default; + +#ifdef CONFIG_ARM64_MPAM +static inline u64 __mpam_regval(u16 partid_d, u16 partid_i, u8 pmg_d, u8 pmg_i) +{ + return FIELD_PREP(MPAM0_EL1_PARTID_D, partid_d) | + FIELD_PREP(MPAM0_EL1_PARTID_I, partid_i) | + FIELD_PREP(MPAM0_EL1_PMG_D, pmg_d) | + FIELD_PREP(MPAM0_EL1_PMG_I, pmg_i); +} + +static inline void mpam_set_cpu_defaults(int cpu, u16 partid_d, u16 partid_i, + u8 pmg_d, u8 pmg_i) +{ + u64 default_val = __mpam_regval(partid_d, partid_i, pmg_d, pmg_i); + + WRITE_ONCE(per_cpu(arm64_mpam_default, cpu), default_val); +} + +/* + * The resctrl filesystem writes to the partid/pmg values for threads and CPUs, + * which may race with reads in mpam_thread_switch(). Ensure only one of the old + * or new values are used. Particular care should be taken with the pmg field as + * mpam_thread_switch() may read a partid and pmg that don't match, causing this + * value to be stored with cache allocations, despite being considered 'free' by + * resctrl. + */ +static inline u64 mpam_get_regval(struct task_struct *tsk) +{ + return READ_ONCE(task_thread_info(tsk)->mpam_partid_pmg); +} + +static inline void mpam_set_task_partid_pmg(struct task_struct *tsk, + u16 partid_d, u16 partid_i, + u8 pmg_d, u8 pmg_i) +{ + u64 regval = __mpam_regval(partid_d, partid_i, pmg_d, pmg_i); + + WRITE_ONCE(task_thread_info(tsk)->mpam_partid_pmg, regval); +} + +static inline void mpam_thread_switch(struct task_struct *tsk) +{ + u64 oldregval; + int cpu = smp_processor_id(); + u64 regval = mpam_get_regval(tsk); + + if (!static_branch_likely(&mpam_enabled)) + return; + + if (regval == READ_ONCE(arm64_mpam_global_default)) + regval = READ_ONCE(per_cpu(arm64_mpam_default, cpu)); + + oldregval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); + if (oldregval == regval) + return; + + write_sysreg_s(regval | MPAM1_EL1_MPAMEN, SYS_MPAM1_EL1); + if (system_supports_sme()) + write_sysreg_s(regval & (MPAMSM_EL1_PARTID_D | MPAMSM_EL1_PMG_D), SYS_MPAMSM_EL1); + isb(); + + /* Synchronising the EL0 write is left until the ERET to EL0 */ + write_sysreg_s(regval, SYS_MPAM0_EL1); + + WRITE_ONCE(per_cpu(arm64_mpam_current, cpu), regval); +} +#else +static inline void mpam_thread_switch(struct task_struct *tsk) {} +#endif /* CONFIG_ARM64_MPAM */ + +#endif /* __ASM__MPAM_H */ diff --git a/arch/arm64/include/asm/resctrl.h b/arch/arm64/include/asm/resctrl.h new file mode 100644 index 000000000000..b506e95cf6e3 --- /dev/null +++ b/arch/arm64/include/asm/resctrl.h @@ -0,0 +1,2 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#include <linux/arm_mpam.h> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 7942478e4065..5d7fe3e153c8 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -41,6 +41,9 @@ struct thread_info { #ifdef CONFIG_SHADOW_CALL_STACK void *scs_base; void *scs_sp; +#endif +#ifdef CONFIG_ARM64_MPAM + u64 mpam_partid_pmg; #endif u32 cpu; }; diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile index 76f32e424065..15979f366519 100644 --- a/arch/arm64/kernel/Makefile +++ b/arch/arm64/kernel/Makefile @@ -67,6 +67,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o obj-$(CONFIG_VMCORE_INFO) += vmcore_info.o obj-$(CONFIG_ARM_SDE_INTERFACE) += sdei.o obj-$(CONFIG_ARM64_PTR_AUTH) += pointer_auth.o +obj-$(CONFIG_ARM64_MPAM) += mpam.o obj-$(CONFIG_ARM64_MTE) += mte.o obj-y += vdso-wrap.o obj-$(CONFIG_COMPAT_VDSO) += vdso32-wrap.o diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index c31f8e17732a..4f34e7a76f64 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -86,6 +86,7 @@ #include <asm/kvm_host.h> #include <asm/mmu.h> #include <asm/mmu_context.h> +#include <asm/mpam.h> #include <asm/mte.h> #include <asm/hypervisor.h> #include <asm/processor.h> @@ -2492,13 +2493,19 @@ test_has_mpam(const struct arm64_cpu_capabilities *entry, int scope) static void cpu_enable_mpam(const struct arm64_cpu_capabilities *entry) { - /* - * Access by the kernel (at EL1) should use the reserved PARTID - * which is configured unrestricted. This avoids priority-inversion - * where latency sensitive tasks have to wait for a task that has - * been throttled to release the lock. - */ - write_sysreg_s(0, SYS_MPAM1_EL1); + int cpu = smp_processor_id(); + u64 regval = 0; + + if (IS_ENABLED(CONFIG_ARM64_MPAM) && static_branch_likely(&mpam_enabled)) + regval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); + + write_sysreg_s(regval | MPAM1_EL1_MPAMEN, SYS_MPAM1_EL1); + if (cpus_have_cap(ARM64_SME)) + write_sysreg_s(regval & (MPAMSM_EL1_PARTID_D | MPAMSM_EL1_PMG_D), SYS_MPAMSM_EL1); + isb(); + + /* Synchronising the EL0 write is left until the ERET to EL0 */ + write_sysreg_s(regval, SYS_MPAM0_EL1); } static bool diff --git a/arch/arm64/kernel/mpam.c b/arch/arm64/kernel/mpam.c new file mode 100644 index 000000000000..3a490de4fa12 --- /dev/null +++ b/arch/arm64/kernel/mpam.c @@ -0,0 +1,62 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2025 Arm Ltd. */ + +#include <asm/mpam.h> + +#include <linux/arm_mpam.h> +#include <linux/cpu_pm.h> +#include <linux/jump_label.h> +#include <linux/percpu.h> + +DEFINE_STATIC_KEY_FALSE(mpam_enabled); +DEFINE_PER_CPU(u64, arm64_mpam_default); +DEFINE_PER_CPU(u64, arm64_mpam_current); + +u64 arm64_mpam_global_default; + +static int mpam_pm_notifier(struct notifier_block *self, + unsigned long cmd, void *v) +{ + u64 regval; + int cpu = smp_processor_id(); + + switch (cmd) { + case CPU_PM_EXIT: + /* + * Don't use mpam_thread_switch() as the system register + * value has changed under our feet. + */ + regval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); + write_sysreg_s(regval | MPAM1_EL1_MPAMEN, SYS_MPAM1_EL1); + if (system_supports_sme()) { + write_sysreg_s(regval & (MPAMSM_EL1_PARTID_D | MPAMSM_EL1_PMG_D), + SYS_MPAMSM_EL1); + } + isb(); + + write_sysreg_s(regval, SYS_MPAM0_EL1); + + return NOTIFY_OK; + default: + return NOTIFY_DONE; + } +} + +static struct notifier_block mpam_pm_nb = { + .notifier_call = mpam_pm_notifier, +}; + +static int __init arm64_mpam_register_cpus(void) +{ + u64 mpamidr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); + u16 partid_max = FIELD_GET(MPAMIDR_EL1_PARTID_MAX, mpamidr); + u8 pmg_max = FIELD_GET(MPAMIDR_EL1_PMG_MAX, mpamidr); + + if (!system_supports_mpam()) + return 0; + + cpu_pm_register_notifier(&mpam_pm_nb); + return mpam_register_requestor(partid_max, pmg_max); +} +/* Must occur before mpam_msc_driver_init() from subsys_initcall() */ +arch_initcall(arm64_mpam_register_cpus) diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index 489554931231..47698955fa1e 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -51,6 +51,7 @@ #include <asm/fpsimd.h> #include <asm/gcs.h> #include <asm/mmu_context.h> +#include <asm/mpam.h> #include <asm/mte.h> #include <asm/processor.h> #include <asm/pointer_auth.h> @@ -738,6 +739,12 @@ struct task_struct *__switch_to(struct task_struct *prev, if (prev->thread.sctlr_user != next->thread.sctlr_user) update_sctlr_el1(next->thread.sctlr_user); + /* + * MPAM thread switch happens after the DSB to ensure prev's accesses + * use prev's MPAM settings. + */ + mpam_thread_switch(next); + /* the actual thread switch */ last = cpu_switch_to(prev, next); diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h index 2597e8bda867..0b50ddd530f3 100644 --- a/arch/arm64/kvm/hyp/include/hyp/switch.h +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h @@ -267,7 +267,8 @@ static inline void __deactivate_traps_hfgxtr(struct kvm_vcpu *vcpu) static inline void __activate_traps_mpam(struct kvm_vcpu *vcpu) { - u64 r = MPAM2_EL2_TRAPMPAM0EL1 | MPAM2_EL2_TRAPMPAM1EL1; + u64 clr = MPAM2_EL2_EnMPAMSM; + u64 set = MPAM2_EL2_TRAPMPAM0EL1 | MPAM2_EL2_TRAPMPAM1EL1; if (!system_supports_mpam()) return; @@ -277,18 +278,21 @@ static inline void __activate_traps_mpam(struct kvm_vcpu *vcpu) write_sysreg_s(MPAMHCR_EL2_TRAP_MPAMIDR_EL1, SYS_MPAMHCR_EL2); } else { /* From v1.1 TIDR can trap MPAMIDR, set it unconditionally */ - r |= MPAM2_EL2_TIDR; + set |= MPAM2_EL2_TIDR; } - write_sysreg_s(r, SYS_MPAM2_EL2); + sysreg_clear_set_s(SYS_MPAM2_EL2, clr, set); } static inline void __deactivate_traps_mpam(void) { + u64 clr = MPAM2_EL2_TRAPMPAM0EL1 | MPAM2_EL2_TRAPMPAM1EL1 | MPAM2_EL2_TIDR; + u64 set = MPAM2_EL2_EnMPAMSM; + if (!system_supports_mpam()) return; - write_sysreg_s(0, SYS_MPAM2_EL2); + sysreg_clear_set_s(SYS_MPAM2_EL2, clr, set); if (system_supports_mpam_hcr()) write_sysreg_s(MPAMHCR_HOST_FLAGS, SYS_MPAMHCR_EL2); diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index e7790097db93..80e71eeddc03 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -638,6 +638,15 @@ static void handle_host_hcall(struct kvm_cpu_context *host_ctxt) unsigned long hcall_min = 0; hcall_t hfn; + if (system_supports_mpam()) { + u64 mask = MPAM1_EL1_PARTID_D | MPAM1_EL1_PARTID_I | + MPAM1_EL1_PMG_D | MPAM1_EL1_PMG_I; + u64 val = MPAM2_EL2_MPAMEN | (read_sysreg_el1(SYS_MPAM1) & mask); + + write_sysreg_s(val, SYS_MPAM2_EL2); + isb(); + } + /* * If pKVM has been initialised then reject any calls to the * early "privileged" hypercalls. Note that we cannot reject diff --git a/arch/arm64/kvm/hyp/vhe/sysreg-sr.c b/arch/arm64/kvm/hyp/vhe/sysreg-sr.c index b254d442e54e..be685b63e8cf 100644 --- a/arch/arm64/kvm/hyp/vhe/sysreg-sr.c +++ b/arch/arm64/kvm/hyp/vhe/sysreg-sr.c @@ -183,6 +183,21 @@ void sysreg_restore_guest_state_vhe(struct kvm_cpu_context *ctxt) } NOKPROBE_SYMBOL(sysreg_restore_guest_state_vhe); +/* + * The _EL0 value was written by the host's context switch and belongs to the + * VMM. Copy this into the guest's _EL1 register. + */ +static inline void __mpam_guest_load(void) +{ + u64 mask = MPAM0_EL1_PARTID_D | MPAM0_EL1_PARTID_I | MPAM0_EL1_PMG_D | MPAM0_EL1_PMG_I; + + if (system_supports_mpam()) { + u64 val = (read_sysreg_s(SYS_MPAM0_EL1) & mask) | MPAM1_EL1_MPAMEN; + + write_sysreg_el1(val, SYS_MPAM1); + } +} + /** * __vcpu_load_switch_sysregs - Load guest system registers to the physical CPU * @@ -222,6 +237,7 @@ void __vcpu_load_switch_sysregs(struct kvm_vcpu *vcpu) */ __sysreg32_restore_state(vcpu); __sysreg_restore_user_state(guest_ctxt); + __mpam_guest_load(); if (unlikely(is_hyp_ctxt(vcpu))) { __sysreg_restore_vel2_state(vcpu); diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index a7cd0badc20c..2c9a52e66fe0 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -3373,6 +3373,8 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_MPAM1_EL1), undef_access }, { SYS_DESC(SYS_MPAM0_EL1), undef_access }, + { SYS_DESC(SYS_MPAMSM_EL1), undef_access }, + { SYS_DESC(SYS_VBAR_EL1), access_rw, reset_val, VBAR_EL1, 0 }, { SYS_DESC(SYS_DISR_EL1), NULL, reset_val, DISR_EL1, 0 }, diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg index 9d1c21108057..1287cb1de6f3 100644 --- a/arch/arm64/tools/sysreg +++ b/arch/arm64/tools/sysreg @@ -5172,6 +5172,14 @@ Field 31:16 PARTID_D Field 15:0 PARTID_I EndSysreg +Sysreg MPAMSM_EL1 3 0 10 5 3 +Res0 63:48 +Field 47:40 PMG_D +Res0 39:32 +Field 31:16 PARTID_D +Res0 15:0 +EndSysreg + Sysreg ISR_EL1 3 0 12 1 0 Res0 63:11 Field 10 IS diff --git a/drivers/resctrl/Kconfig b/drivers/resctrl/Kconfig index c808e0470394..672abea3b03c 100644 --- a/drivers/resctrl/Kconfig +++ b/drivers/resctrl/Kconfig @@ -1,6 +1,7 @@ menuconfig ARM64_MPAM_DRIVER bool "MPAM driver" - depends on ARM64 && ARM64_MPAM && EXPERT + depends on ARM64 && ARM64_MPAM + select ACPI_MPAM if ACPI help Memory System Resource Partitioning and Monitoring (MPAM) driver for System IP, e.g. caches and memory controllers. @@ -22,3 +23,9 @@ config MPAM_KUNIT_TEST If unsure, say N. endif + +config ARM64_MPAM_RESCTRL_FS + bool + default y if ARM64_MPAM_DRIVER && RESCTRL_FS + select RESCTRL_RMID_DEPENDS_ON_CLOSID + select RESCTRL_ASSIGN_FIXED diff --git a/drivers/resctrl/Makefile b/drivers/resctrl/Makefile index 898199dcf80d..4f6d0e81f9b8 100644 --- a/drivers/resctrl/Makefile +++ b/drivers/resctrl/Makefile @@ -1,4 +1,5 @@ obj-$(CONFIG_ARM64_MPAM_DRIVER) += mpam.o mpam-y += mpam_devices.o +mpam-$(CONFIG_ARM64_MPAM_RESCTRL_FS) += mpam_resctrl.o ccflags-$(CONFIG_ARM64_MPAM_DRIVER_DEBUG) += -DDEBUG diff --git a/drivers/resctrl/mpam_devices.c b/drivers/resctrl/mpam_devices.c index 1eebc2602187..9182c8fcf003 100644 --- a/drivers/resctrl/mpam_devices.c +++ b/drivers/resctrl/mpam_devices.c @@ -29,7 +29,15 @@ #include "mpam_internal.h" -DEFINE_STATIC_KEY_FALSE(mpam_enabled); /* This moves to arch code */ +/* Values for the T241 errata workaround */ +#define T241_CHIPS_MAX 4 +#define T241_CHIP_NSLICES 12 +#define T241_SPARE_REG0_OFF 0x1b0000 +#define T241_SPARE_REG1_OFF 0x1c0000 +#define T241_CHIP_ID(phys) FIELD_GET(GENMASK_ULL(44, 43), phys) +#define T241_SHADOW_REG_OFF(sidx, pid) (0x360048 + (sidx) * 0x10000 + (pid) * 8) +#define SMCCC_SOC_ID_T241 0x036b0241 +static void __iomem *t241_scratch_regs[T241_CHIPS_MAX]; /* * mpam_list_lock protects the SRCU lists when writing. Once the @@ -75,6 +83,14 @@ static DECLARE_WORK(mpam_broken_work, &mpam_disable); /* When mpam is disabled, the printed reason to aid debugging */ static char *mpam_disable_reason; +/* + * Whether resctrl has been setup. Used by cpuhp in preference to + * mpam_is_enabled(). The disable call after an error interrupt makes + * mpam_is_enabled() false before the cpuhp callbacks are made. + * Reads/writes should hold mpam_cpuhp_state_lock, (or be cpuhp callbacks). + */ +static bool mpam_resctrl_enabled; + /* * An MSC is a physical container for controls and monitors, each identified by * their RIS index. These share a base-address, interrupts and some MMIO @@ -624,6 +640,86 @@ static struct mpam_msc_ris *mpam_get_or_create_ris(struct mpam_msc *msc, return ERR_PTR(-ENOENT); } +static int mpam_enable_quirk_nvidia_t241_1(struct mpam_msc *msc, + const struct mpam_quirk *quirk) +{ + s32 soc_id = arm_smccc_get_soc_id_version(); + struct resource *r; + phys_addr_t phys; + + /* + * A mapping to a device other than the MSC is needed, check + * SOC_ID is NVIDIA T241 chip (036b:0241) + */ + if (soc_id < 0 || soc_id != SMCCC_SOC_ID_T241) + return -EINVAL; + + r = platform_get_resource(msc->pdev, IORESOURCE_MEM, 0); + if (!r) + return -EINVAL; + + /* Find the internal registers base addr from the CHIP ID */ + msc->t241_id = T241_CHIP_ID(r->start); + phys = FIELD_PREP(GENMASK_ULL(45, 44), msc->t241_id) | 0x19000000ULL; + + t241_scratch_regs[msc->t241_id] = ioremap(phys, SZ_8M); + if (WARN_ON_ONCE(!t241_scratch_regs[msc->t241_id])) + return -EINVAL; + + pr_info_once("Enabled workaround for NVIDIA T241 erratum T241-MPAM-1\n"); + + return 0; +} + +static const struct mpam_quirk mpam_quirks[] = { + { + /* NVIDIA t241 erratum T241-MPAM-1 */ + .init = mpam_enable_quirk_nvidia_t241_1, + .iidr = MPAM_IIDR_NVIDIA_T241, + .iidr_mask = MPAM_IIDR_MATCH_ONE, + .workaround = T241_SCRUB_SHADOW_REGS, + }, + { + /* NVIDIA t241 erratum T241-MPAM-4 */ + .iidr = MPAM_IIDR_NVIDIA_T241, + .iidr_mask = MPAM_IIDR_MATCH_ONE, + .workaround = T241_FORCE_MBW_MIN_TO_ONE, + }, + { + /* NVIDIA t241 erratum T241-MPAM-6 */ + .iidr = MPAM_IIDR_NVIDIA_T241, + .iidr_mask = MPAM_IIDR_MATCH_ONE, + .workaround = T241_MBW_COUNTER_SCALE_64, + }, + { + /* ARM CMN-650 CSU erratum 3642720 */ + .iidr = MPAM_IIDR_ARM_CMN_650, + .iidr_mask = MPAM_IIDR_MATCH_ONE, + .workaround = IGNORE_CSU_NRDY, + }, + { NULL } /* Sentinel */ +}; + +static void mpam_enable_quirks(struct mpam_msc *msc) +{ + const struct mpam_quirk *quirk; + + for (quirk = &mpam_quirks[0]; quirk->iidr_mask; quirk++) { + int err = 0; + + if (quirk->iidr != (msc->iidr & quirk->iidr_mask)) + continue; + + if (quirk->init) + err = quirk->init(msc, quirk); + + if (err) + continue; + + mpam_set_quirk(quirk->workaround, msc); + } +} + /* * IHI009A.a has this nugget: "If a monitor does not support automatic behaviour * of NRDY, software can use this bit for any purpose" - so hardware might not @@ -715,6 +811,13 @@ static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) mpam_set_feature(mpam_feat_mbw_part, props); props->bwa_wd = FIELD_GET(MPAMF_MBW_IDR_BWA_WD, mbw_features); + + /* + * The BWA_WD field can represent 0-63, but the control fields it + * describes have a maximum of 16 bits. + */ + props->bwa_wd = min(props->bwa_wd, 16); + if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_MAX, mbw_features)) mpam_set_feature(mpam_feat_mbw_max, props); @@ -851,8 +954,11 @@ static int mpam_msc_hw_probe(struct mpam_msc *msc) /* Grab an IDR value to find out how many RIS there are */ mutex_lock(&msc->part_sel_lock); idr = mpam_msc_read_idr(msc); + msc->iidr = mpam_read_partsel_reg(msc, IIDR); mutex_unlock(&msc->part_sel_lock); + mpam_enable_quirks(msc); + msc->ris_max = FIELD_GET(MPAMF_IDR_RIS_MAX, idr); /* Use these values so partid/pmg always starts with a valid value */ @@ -903,6 +1009,7 @@ struct mon_read { enum mpam_device_features type; u64 *val; int err; + bool waited_timeout; }; static bool mpam_ris_has_mbwu_long_counter(struct mpam_msc_ris *ris) @@ -1052,7 +1159,7 @@ static void write_msmon_ctl_flt_vals(struct mon_read *m, u32 ctl_val, } } -static u64 mpam_msmon_overflow_val(enum mpam_device_features type) +static u64 __mpam_msmon_overflow_val(enum mpam_device_features type) { /* TODO: implement scaling counters */ switch (type) { @@ -1067,6 +1174,18 @@ static u64 mpam_msmon_overflow_val(enum mpam_device_features type) } } +static u64 mpam_msmon_overflow_val(enum mpam_device_features type, + struct mpam_msc *msc) +{ + u64 overflow_val = __mpam_msmon_overflow_val(type); + + if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc) && + type != mpam_feat_msmon_mbwu_63counter) + overflow_val *= 64; + + return overflow_val; +} + static void __ris_msmon_read(void *arg) { u64 now; @@ -1137,6 +1256,10 @@ static void __ris_msmon_read(void *arg) if (mpam_has_feature(mpam_feat_msmon_csu_hw_nrdy, rprops)) nrdy = now & MSMON___NRDY; now = FIELD_GET(MSMON___VALUE, now); + + if (mpam_has_quirk(IGNORE_CSU_NRDY, msc) && m->waited_timeout) + nrdy = false; + break; case mpam_feat_msmon_mbwu_31counter: case mpam_feat_msmon_mbwu_44counter: @@ -1157,13 +1280,17 @@ static void __ris_msmon_read(void *arg) now = FIELD_GET(MSMON___VALUE, now); } + if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc) && + m->type != mpam_feat_msmon_mbwu_63counter) + now *= 64; + if (nrdy) break; mbwu_state = &ris->mbwu_state[ctx->mon]; if (overflow) - mbwu_state->correction += mpam_msmon_overflow_val(m->type); + mbwu_state->correction += mpam_msmon_overflow_val(m->type, msc); /* * Include bandwidth consumed before the last hardware reset and @@ -1270,6 +1397,7 @@ int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, .ctx = ctx, .type = type, .val = val, + .waited_timeout = true, }; *val = 0; @@ -1338,6 +1466,69 @@ static void mpam_reset_msc_bitmap(struct mpam_msc *msc, u16 reg, u16 wd) __mpam_write_reg(msc, reg, bm); } +static void mpam_apply_t241_erratum(struct mpam_msc_ris *ris, u16 partid) +{ + int sidx, i, lcount = 1000; + void __iomem *regs; + u64 val0, val; + + regs = t241_scratch_regs[ris->vmsc->msc->t241_id]; + + for (i = 0; i < lcount; i++) { + /* Read the shadow register at index 0 */ + val0 = readq_relaxed(regs + T241_SHADOW_REG_OFF(0, partid)); + + /* Check if all the shadow registers have the same value */ + for (sidx = 1; sidx < T241_CHIP_NSLICES; sidx++) { + val = readq_relaxed(regs + + T241_SHADOW_REG_OFF(sidx, partid)); + if (val != val0) + break; + } + if (sidx == T241_CHIP_NSLICES) + break; + } + + if (i == lcount) + pr_warn_once("t241: inconsistent values in shadow regs"); + + /* Write a value zero to spare registers to take effect of MBW conf */ + writeq_relaxed(0, regs + T241_SPARE_REG0_OFF); + writeq_relaxed(0, regs + T241_SPARE_REG1_OFF); +} + +static void mpam_quirk_post_config_change(struct mpam_msc_ris *ris, u16 partid, + struct mpam_config *cfg) +{ + if (mpam_has_quirk(T241_SCRUB_SHADOW_REGS, ris->vmsc->msc)) + mpam_apply_t241_erratum(ris, partid); +} + +static u16 mpam_wa_t241_force_mbw_min_to_one(struct mpam_props *props) +{ + u16 max_hw_value, min_hw_granule, res0_bits; + + res0_bits = 16 - props->bwa_wd; + max_hw_value = ((1 << props->bwa_wd) - 1) << res0_bits; + min_hw_granule = ~max_hw_value; + + return min_hw_granule + 1; +} + +static u16 mpam_wa_t241_calc_min_from_max(struct mpam_config *cfg) +{ + u16 val = 0; + + if (mpam_has_feature(mpam_feat_mbw_max, cfg)) { + u16 delta = ((5 * MPAMCFG_MBW_MAX_MAX) / 100) - 1; + + if (cfg->mbw_max > delta) + val = cfg->mbw_max - delta; + } + + return val; +} + /* Called via IPI. Call while holding an SRCU reference */ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, struct mpam_config *cfg) @@ -1380,9 +1571,19 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, mpam_write_partsel_reg(msc, MBW_PBM, cfg->mbw_pbm); } - if (mpam_has_feature(mpam_feat_mbw_min, rprops) && - mpam_has_feature(mpam_feat_mbw_min, cfg)) - mpam_write_partsel_reg(msc, MBW_MIN, 0); + if (mpam_has_feature(mpam_feat_mbw_min, rprops)) { + u16 val = 0; + + if (mpam_has_quirk(T241_FORCE_MBW_MIN_TO_ONE, msc)) { + u16 min = mpam_wa_t241_force_mbw_min_to_one(rprops); + + val = mpam_wa_t241_calc_min_from_max(cfg); + if (val < min) + val = min; + } + + mpam_write_partsel_reg(msc, MBW_MIN, val); + } if (mpam_has_feature(mpam_feat_mbw_max, rprops) && mpam_has_feature(mpam_feat_mbw_max, cfg)) { @@ -1421,6 +1622,8 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, mpam_write_partsel_reg(msc, PRI, pri_val); } + mpam_quirk_post_config_change(ris, partid, cfg); + mutex_unlock(&msc->part_sel_lock); } @@ -1630,6 +1833,9 @@ static int mpam_cpu_online(unsigned int cpu) mpam_reprogram_msc(msc); } + if (mpam_resctrl_enabled) + return mpam_resctrl_online_cpu(cpu); + return 0; } @@ -1673,6 +1879,9 @@ static int mpam_cpu_offline(unsigned int cpu) { struct mpam_msc *msc; + if (mpam_resctrl_enabled) + mpam_resctrl_offline_cpu(cpu); + guard(srcu)(&mpam_srcu); list_for_each_entry_srcu(msc, &mpam_all_msc, all_msc_list, srcu_read_lock_held(&mpam_srcu)) { @@ -1969,6 +2178,7 @@ static bool mpam_has_cmax_wd_feature(struct mpam_props *props) * resulting safe value must be compatible with both. When merging values in * the tree, all the aliasing resources must be handled first. * On mismatch, parent is modified. + * Quirks on an MSC will apply to all MSC in that class. */ static void __props_mismatch(struct mpam_props *parent, struct mpam_props *child, bool alias) @@ -2088,6 +2298,7 @@ static void __props_mismatch(struct mpam_props *parent, * nobble the class feature, as we can't configure all the resources. * e.g. The L3 cache is composed of two resources with 13 and 17 portion * bitmaps respectively. + * Quirks on an MSC will apply to all MSC in that class. */ static void __class_props_mismatch(struct mpam_class *class, struct mpam_vmsc *vmsc) @@ -2101,6 +2312,9 @@ __class_props_mismatch(struct mpam_class *class, struct mpam_vmsc *vmsc) dev_dbg(dev, "Merging features for class:0x%lx &= vmsc:0x%lx\n", (long)cprops->features, (long)vprops->features); + /* Merge quirks */ + class->quirks |= vmsc->msc->quirks; + /* Take the safe value for any common features */ __props_mismatch(cprops, vprops, false); } @@ -2165,6 +2379,9 @@ static void mpam_enable_merge_class_features(struct mpam_component *comp) list_for_each_entry(vmsc, &comp->vmsc, comp_list) __class_props_mismatch(class, vmsc); + + if (mpam_has_quirk(T241_FORCE_MBW_MIN_TO_ONE, class)) + mpam_clear_feature(mpam_feat_mbw_min, &class->props); } /* @@ -2518,6 +2735,12 @@ static void mpam_enable_once(void) mutex_unlock(&mpam_list_lock); cpus_read_unlock(); + if (!err) { + err = mpam_resctrl_setup(); + if (err) + pr_err("Failed to initialise resctrl: %d\n", err); + } + if (err) { mpam_disable_reason = "Failed to enable."; schedule_work(&mpam_broken_work); @@ -2525,6 +2748,7 @@ static void mpam_enable_once(void) } static_branch_enable(&mpam_enabled); + mpam_resctrl_enabled = true; mpam_register_cpuhp_callbacks(mpam_cpu_online, mpam_cpu_offline, "mpam:online"); @@ -2557,7 +2781,7 @@ static void mpam_reset_component_locked(struct mpam_component *comp) } } -static void mpam_reset_class_locked(struct mpam_class *class) +void mpam_reset_class_locked(struct mpam_class *class) { struct mpam_component *comp; @@ -2584,24 +2808,39 @@ static void mpam_reset_class(struct mpam_class *class) void mpam_disable(struct work_struct *ignored) { int idx; + bool do_resctrl_exit; struct mpam_class *class; struct mpam_msc *msc, *tmp; + if (mpam_is_enabled()) + static_branch_disable(&mpam_enabled); + mutex_lock(&mpam_cpuhp_state_lock); if (mpam_cpuhp_state) { cpuhp_remove_state(mpam_cpuhp_state); mpam_cpuhp_state = 0; } + + /* + * Removing the cpuhp state called mpam_cpu_offline() and told resctrl + * all the CPUs are offline. + */ + do_resctrl_exit = mpam_resctrl_enabled; + mpam_resctrl_enabled = false; mutex_unlock(&mpam_cpuhp_state_lock); - static_branch_disable(&mpam_enabled); + if (do_resctrl_exit) + mpam_resctrl_exit(); mpam_unregister_irqs(); idx = srcu_read_lock(&mpam_srcu); list_for_each_entry_srcu(class, &mpam_classes, classes_list, - srcu_read_lock_held(&mpam_srcu)) + srcu_read_lock_held(&mpam_srcu)) { mpam_reset_class(class); + if (do_resctrl_exit) + mpam_resctrl_teardown_class(class); + } srcu_read_unlock(&mpam_srcu, idx); mutex_lock(&mpam_list_lock); diff --git a/drivers/resctrl/mpam_internal.h b/drivers/resctrl/mpam_internal.h index e8971842b124..195ab821cc52 100644 --- a/drivers/resctrl/mpam_internal.h +++ b/drivers/resctrl/mpam_internal.h @@ -12,22 +12,31 @@ #include <linux/jump_label.h> #include <linux/llist.h> #include <linux/mutex.h> +#include <linux/resctrl.h> #include <linux/spinlock.h> #include <linux/srcu.h> #include <linux/types.h> +#include <asm/mpam.h> + #define MPAM_MSC_MAX_NUM_RIS 16 struct platform_device; -DECLARE_STATIC_KEY_FALSE(mpam_enabled); - #ifdef CONFIG_MPAM_KUNIT_TEST #define PACKED_FOR_KUNIT __packed #else #define PACKED_FOR_KUNIT #endif +/* + * This 'mon' values must not alias an actual monitor, so must be larger than + * U16_MAX, but not be confused with an errno value, so smaller than + * (u32)-SZ_4K. + * USE_PRE_ALLOCATED is used to avoid confusion with an actual monitor. + */ +#define USE_PRE_ALLOCATED (U16_MAX + 1) + static inline bool mpam_is_enabled(void) { return static_branch_likely(&mpam_enabled); @@ -76,6 +85,8 @@ struct mpam_msc { u8 pmg_max; unsigned long ris_idxs; u32 ris_max; + u32 iidr; + u16 quirks; /* * error_irq_lock is taken when registering/unregistering the error @@ -119,6 +130,9 @@ struct mpam_msc { void __iomem *mapped_hwpage; size_t mapped_hwpage_sz; + /* Values only used on some platforms for quirks */ + u32 t241_id; + struct mpam_garbage garbage; }; @@ -207,6 +221,42 @@ struct mpam_props { #define mpam_set_feature(_feat, x) __set_bit(_feat, (x)->features) #define mpam_clear_feature(_feat, x) __clear_bit(_feat, (x)->features) +/* Workaround bits for msc->quirks */ +enum mpam_device_quirks { + T241_SCRUB_SHADOW_REGS, + T241_FORCE_MBW_MIN_TO_ONE, + T241_MBW_COUNTER_SCALE_64, + IGNORE_CSU_NRDY, + MPAM_QUIRK_LAST +}; + +#define mpam_has_quirk(_quirk, x) ((1 << (_quirk) & (x)->quirks)) +#define mpam_set_quirk(_quirk, x) ((x)->quirks |= (1 << (_quirk))) + +struct mpam_quirk { + int (*init)(struct mpam_msc *msc, const struct mpam_quirk *quirk); + + u32 iidr; + u32 iidr_mask; + + enum mpam_device_quirks workaround; +}; + +#define MPAM_IIDR_MATCH_ONE (FIELD_PREP_CONST(MPAMF_IIDR_PRODUCTID, 0xfff) | \ + FIELD_PREP_CONST(MPAMF_IIDR_VARIANT, 0xf) | \ + FIELD_PREP_CONST(MPAMF_IIDR_REVISION, 0xf) | \ + FIELD_PREP_CONST(MPAMF_IIDR_IMPLEMENTER, 0xfff)) + +#define MPAM_IIDR_NVIDIA_T241 (FIELD_PREP_CONST(MPAMF_IIDR_PRODUCTID, 0x241) | \ + FIELD_PREP_CONST(MPAMF_IIDR_VARIANT, 0) | \ + FIELD_PREP_CONST(MPAMF_IIDR_REVISION, 0) | \ + FIELD_PREP_CONST(MPAMF_IIDR_IMPLEMENTER, 0x36b)) + +#define MPAM_IIDR_ARM_CMN_650 (FIELD_PREP_CONST(MPAMF_IIDR_PRODUCTID, 0) | \ + FIELD_PREP_CONST(MPAMF_IIDR_VARIANT, 0) | \ + FIELD_PREP_CONST(MPAMF_IIDR_REVISION, 0) | \ + FIELD_PREP_CONST(MPAMF_IIDR_IMPLEMENTER, 0x43b)) + /* The values for MSMON_CFG_MBWU_FLT.RWBW */ enum mon_filter_options { COUNT_BOTH = 0, @@ -215,7 +265,11 @@ enum mon_filter_options { }; struct mon_cfg { - u16 mon; + /* + * mon must be large enough to hold out of range values like + * USE_PRE_ALLOCATED + */ + u32 mon; u8 pmg; bool match_pmg; bool csu_exclude_clean; @@ -246,6 +300,7 @@ struct mpam_class { struct mpam_props props; u32 nrdy_usec; + u16 quirks; u8 level; enum mpam_class_types type; @@ -337,6 +392,33 @@ struct mpam_msc_ris { struct mpam_garbage garbage; }; +struct mpam_resctrl_dom { + struct mpam_component *ctrl_comp; + + /* + * There is no single mon_comp because different events may be backed + * by different class/components. mon_comp is indexed by the event + * number. + */ + struct mpam_component *mon_comp[QOS_NUM_EVENTS]; + + struct rdt_ctrl_domain resctrl_ctrl_dom; + struct rdt_l3_mon_domain resctrl_mon_dom; +}; + +struct mpam_resctrl_res { + struct mpam_class *class; + struct rdt_resource resctrl_res; + bool cdp_enabled; +}; + +struct mpam_resctrl_mon { + struct mpam_class *class; + + /* Array of allocated MBWU monitors, indexed by (closid, rmid). */ + int *mbwu_idx_to_mon; +}; + static inline int mpam_alloc_csu_mon(struct mpam_class *class) { struct mpam_props *cprops = &class->props; @@ -381,6 +463,9 @@ extern u8 mpam_pmg_max; void mpam_enable(struct work_struct *work); void mpam_disable(struct work_struct *work); +/* Reset all the RIS in a class under cpus_read_lock() */ +void mpam_reset_class_locked(struct mpam_class *class); + int mpam_apply_config(struct mpam_component *comp, u16 partid, struct mpam_config *cfg); @@ -391,6 +476,20 @@ void mpam_msmon_reset_mbwu(struct mpam_component *comp, struct mon_cfg *ctx); int mpam_get_cpumask_from_cache_id(unsigned long cache_id, u32 cache_level, cpumask_t *affinity); +#ifdef CONFIG_RESCTRL_FS +int mpam_resctrl_setup(void); +void mpam_resctrl_exit(void); +int mpam_resctrl_online_cpu(unsigned int cpu); +void mpam_resctrl_offline_cpu(unsigned int cpu); +void mpam_resctrl_teardown_class(struct mpam_class *class); +#else +static inline int mpam_resctrl_setup(void) { return 0; } +static inline void mpam_resctrl_exit(void) { } +static inline int mpam_resctrl_online_cpu(unsigned int cpu) { return 0; } +static inline void mpam_resctrl_offline_cpu(unsigned int cpu) { } +static inline void mpam_resctrl_teardown_class(struct mpam_class *class) { } +#endif /* CONFIG_RESCTRL_FS */ + /* * MPAM MSCs have the following register layout. See: * Arm Memory System Resource Partitioning and Monitoring (MPAM) System diff --git a/drivers/resctrl/mpam_resctrl.c b/drivers/resctrl/mpam_resctrl.c new file mode 100644 index 000000000000..19b306017845 --- /dev/null +++ b/drivers/resctrl/mpam_resctrl.c @@ -0,0 +1,1877 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (C) 2025 Arm Ltd. + +#define pr_fmt(fmt) "%s:%s: " fmt, KBUILD_MODNAME, __func__ + +#include <linux/arm_mpam.h> +#include <linux/cacheinfo.h> +#include <linux/cpu.h> +#include <linux/cpumask.h> +#include <linux/errno.h> +#include <linux/limits.h> +#include <linux/list.h> +#include <linux/math.h> +#include <linux/printk.h> +#include <linux/rculist.h> +#include <linux/resctrl.h> +#include <linux/slab.h> +#include <linux/types.h> +#include <linux/wait.h> + +#include <asm/mpam.h> + +#include "mpam_internal.h" + +DECLARE_WAIT_QUEUE_HEAD(resctrl_mon_ctx_waiters); + +/* + * The classes we've picked to map to resctrl resources, wrapped + * in with their resctrl structure. + * Class pointer may be NULL. + */ +static struct mpam_resctrl_res mpam_resctrl_controls[RDT_NUM_RESOURCES]; + +#define for_each_mpam_resctrl_control(res, rid) \ + for (rid = 0, res = &mpam_resctrl_controls[rid]; \ + rid < RDT_NUM_RESOURCES; \ + rid++, res = &mpam_resctrl_controls[rid]) + +/* + * The classes we've picked to map to resctrl events. + * Resctrl believes all the worlds a Xeon, and these are all on the L3. This + * array lets us find the actual class backing the event counters. e.g. + * the only memory bandwidth counters may be on the memory controller, but to + * make use of them, we pretend they are on L3. Restrict the events considered + * to those supported by MPAM. + * Class pointer may be NULL. + */ +#define MPAM_MAX_EVENT QOS_L3_MBM_TOTAL_EVENT_ID +static struct mpam_resctrl_mon mpam_resctrl_counters[MPAM_MAX_EVENT + 1]; + +#define for_each_mpam_resctrl_mon(mon, eventid) \ + for (eventid = QOS_FIRST_EVENT, mon = &mpam_resctrl_counters[eventid]; \ + eventid <= MPAM_MAX_EVENT; \ + eventid++, mon = &mpam_resctrl_counters[eventid]) + +/* The lock for modifying resctrl's domain lists from cpuhp callbacks. */ +static DEFINE_MUTEX(domain_list_lock); + +/* + * MPAM emulates CDP by setting different PARTID in the I/D fields of MPAM0_EL1. + * This applies globally to all traffic the CPU generates. + */ +static bool cdp_enabled; + +/* + * We use cacheinfo to discover the size of the caches and their id. cacheinfo + * populates this from a device_initcall(). mpam_resctrl_setup() must wait. + */ +static bool cacheinfo_ready; +static DECLARE_WAIT_QUEUE_HEAD(wait_cacheinfo_ready); + +/* + * If resctrl_init() succeeded, resctrl_exit() can be used to remove support + * for the filesystem in the event of an error. + */ +static bool resctrl_enabled; + +/* Whether this num_mbw_mon could result in a free_running system */ +static int __mpam_monitors_free_running(u16 num_mbwu_mon) +{ + if (num_mbwu_mon >= resctrl_arch_system_num_rmid_idx()) + return resctrl_arch_system_num_rmid_idx(); + return 0; +} + +bool resctrl_arch_alloc_capable(void) +{ + struct mpam_resctrl_res *res; + enum resctrl_res_level rid; + + for_each_mpam_resctrl_control(res, rid) { + if (res->resctrl_res.alloc_capable) + return true; + } + + return false; +} + +bool resctrl_arch_mon_capable(void) +{ + struct mpam_resctrl_res *res = &mpam_resctrl_controls[RDT_RESOURCE_L3]; + struct rdt_resource *l3 = &res->resctrl_res; + + /* All monitors are presented as being on the L3 cache */ + return l3->mon_capable; +} + +bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt) +{ + return false; +} + +void resctrl_arch_mon_event_config_read(void *info) +{ +} + +void resctrl_arch_mon_event_config_write(void *info) +{ +} + +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_l3_mon_domain *d) +{ +} + +void resctrl_arch_reset_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d, + u32 closid, u32 rmid, int cntr_id, + enum resctrl_event_id eventid) +{ +} + +void resctrl_arch_config_cntr(struct rdt_resource *r, struct rdt_l3_mon_domain *d, + enum resctrl_event_id evtid, u32 rmid, u32 closid, + u32 cntr_id, bool assign) +{ +} + +int resctrl_arch_cntr_read(struct rdt_resource *r, struct rdt_l3_mon_domain *d, + u32 unused, u32 rmid, int cntr_id, + enum resctrl_event_id eventid, u64 *val) +{ + return -EOPNOTSUPP; +} + +bool resctrl_arch_mbm_cntr_assign_enabled(struct rdt_resource *r) +{ + return false; +} + +int resctrl_arch_mbm_cntr_assign_set(struct rdt_resource *r, bool enable) +{ + return -EINVAL; +} + +int resctrl_arch_io_alloc_enable(struct rdt_resource *r, bool enable) +{ + return -EOPNOTSUPP; +} + +bool resctrl_arch_get_io_alloc_enabled(struct rdt_resource *r) +{ + return false; +} + +void resctrl_arch_pre_mount(void) +{ +} + +bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level rid) +{ + return mpam_resctrl_controls[rid].cdp_enabled; +} + +/** + * resctrl_reset_task_closids() - Reset the PARTID/PMG values for all tasks. + * + * At boot, all existing tasks use partid zero for D and I. + * To enable/disable CDP emulation, all these tasks need relabelling. + */ +static void resctrl_reset_task_closids(void) +{ + struct task_struct *p, *t; + + read_lock(&tasklist_lock); + for_each_process_thread(p, t) { + resctrl_arch_set_closid_rmid(t, RESCTRL_RESERVED_CLOSID, + RESCTRL_RESERVED_RMID); + } + read_unlock(&tasklist_lock); +} + +int resctrl_arch_set_cdp_enabled(enum resctrl_res_level rid, bool enable) +{ + u32 partid_i = RESCTRL_RESERVED_CLOSID, partid_d = RESCTRL_RESERVED_CLOSID; + int cpu; + + /* This resctrl hook is only called with enable set to false on error */ + cdp_enabled = enable; + mpam_resctrl_controls[rid].cdp_enabled = enable; + + /* The mbw_max feature can't hide cdp as it's a per-partid maximum. */ + if (cdp_enabled && !mpam_resctrl_controls[RDT_RESOURCE_MBA].cdp_enabled) + mpam_resctrl_controls[RDT_RESOURCE_MBA].resctrl_res.alloc_capable = false; + + if (mpam_resctrl_controls[RDT_RESOURCE_MBA].cdp_enabled && + mpam_resctrl_controls[RDT_RESOURCE_MBA].class) + mpam_resctrl_controls[RDT_RESOURCE_MBA].resctrl_res.alloc_capable = true; + + if (enable) { + if (mpam_partid_max < 1) + return -EINVAL; + + partid_d = resctrl_get_config_index(RESCTRL_RESERVED_CLOSID, CDP_DATA); + partid_i = resctrl_get_config_index(RESCTRL_RESERVED_CLOSID, CDP_CODE); + } + + mpam_set_task_partid_pmg(current, partid_d, partid_i, 0, 0); + WRITE_ONCE(arm64_mpam_global_default, mpam_get_regval(current)); + + resctrl_reset_task_closids(); + + for_each_possible_cpu(cpu) + mpam_set_cpu_defaults(cpu, partid_d, partid_i, 0, 0); + on_each_cpu(resctrl_arch_sync_cpu_closid_rmid, NULL, 1); + + return 0; +} + +static bool mpam_resctrl_hide_cdp(enum resctrl_res_level rid) +{ + return cdp_enabled && !resctrl_arch_get_cdp_enabled(rid); +} + +/* + * MSC may raise an error interrupt if it sees an out or range partid/pmg, + * and go on to truncate the value. Regardless of what the hardware supports, + * only the system wide safe value is safe to use. + */ +u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) +{ + return mpam_partid_max + 1; +} + +u32 resctrl_arch_system_num_rmid_idx(void) +{ + return (mpam_pmg_max + 1) * (mpam_partid_max + 1); +} + +u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid) +{ + return closid * (mpam_pmg_max + 1) + rmid; +} + +void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid) +{ + *closid = idx / (mpam_pmg_max + 1); + *rmid = idx % (mpam_pmg_max + 1); +} + +void resctrl_arch_sched_in(struct task_struct *tsk) +{ + lockdep_assert_preemption_disabled(); + + mpam_thread_switch(tsk); +} + +void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 rmid) +{ + WARN_ON_ONCE(closid > U16_MAX); + WARN_ON_ONCE(rmid > U8_MAX); + + if (!cdp_enabled) { + mpam_set_cpu_defaults(cpu, closid, closid, rmid, rmid); + } else { + /* + * When CDP is enabled, resctrl halves the closid range and we + * use odd/even partid for one closid. + */ + u32 partid_d = resctrl_get_config_index(closid, CDP_DATA); + u32 partid_i = resctrl_get_config_index(closid, CDP_CODE); + + mpam_set_cpu_defaults(cpu, partid_d, partid_i, rmid, rmid); + } +} + +void resctrl_arch_sync_cpu_closid_rmid(void *info) +{ + struct resctrl_cpu_defaults *r = info; + + lockdep_assert_preemption_disabled(); + + if (r) { + resctrl_arch_set_cpu_default_closid_rmid(smp_processor_id(), + r->closid, r->rmid); + } + + resctrl_arch_sched_in(current); +} + +void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid) +{ + WARN_ON_ONCE(closid > U16_MAX); + WARN_ON_ONCE(rmid > U8_MAX); + + if (!cdp_enabled) { + mpam_set_task_partid_pmg(tsk, closid, closid, rmid, rmid); + } else { + u32 partid_d = resctrl_get_config_index(closid, CDP_DATA); + u32 partid_i = resctrl_get_config_index(closid, CDP_CODE); + + mpam_set_task_partid_pmg(tsk, partid_d, partid_i, rmid, rmid); + } +} + +bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid) +{ + u64 regval = mpam_get_regval(tsk); + u32 tsk_closid = FIELD_GET(MPAM0_EL1_PARTID_D, regval); + + if (cdp_enabled) + tsk_closid >>= 1; + + return tsk_closid == closid; +} + +/* The task's pmg is not unique, the partid must be considered too */ +bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid) +{ + u64 regval = mpam_get_regval(tsk); + u32 tsk_closid = FIELD_GET(MPAM0_EL1_PARTID_D, regval); + u32 tsk_rmid = FIELD_GET(MPAM0_EL1_PMG_D, regval); + + if (cdp_enabled) + tsk_closid >>= 1; + + return (tsk_closid == closid) && (tsk_rmid == rmid); +} + +struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) +{ + if (l >= RDT_NUM_RESOURCES) + return NULL; + + return &mpam_resctrl_controls[l].resctrl_res; +} + +static int resctrl_arch_mon_ctx_alloc_no_wait(enum resctrl_event_id evtid) +{ + struct mpam_resctrl_mon *mon = &mpam_resctrl_counters[evtid]; + + if (!mpam_is_enabled()) + return -EINVAL; + + if (!mon->class) + return -EINVAL; + + switch (evtid) { + case QOS_L3_OCCUP_EVENT_ID: + /* With CDP, one monitor gets used for both code/data reads */ + return mpam_alloc_csu_mon(mon->class); + case QOS_L3_MBM_LOCAL_EVENT_ID: + case QOS_L3_MBM_TOTAL_EVENT_ID: + return USE_PRE_ALLOCATED; + default: + return -EOPNOTSUPP; + } +} + +void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, + enum resctrl_event_id evtid) +{ + DEFINE_WAIT(wait); + int *ret; + + ret = kmalloc(sizeof(*ret), GFP_KERNEL); + if (!ret) + return ERR_PTR(-ENOMEM); + + do { + prepare_to_wait(&resctrl_mon_ctx_waiters, &wait, + TASK_INTERRUPTIBLE); + *ret = resctrl_arch_mon_ctx_alloc_no_wait(evtid); + if (*ret == -ENOSPC) + schedule(); + } while (*ret == -ENOSPC && !signal_pending(current)); + finish_wait(&resctrl_mon_ctx_waiters, &wait); + + return ret; +} + +static void resctrl_arch_mon_ctx_free_no_wait(enum resctrl_event_id evtid, + u32 mon_idx) +{ + struct mpam_resctrl_mon *mon = &mpam_resctrl_counters[evtid]; + + if (!mpam_is_enabled()) + return; + + if (!mon->class) + return; + + if (evtid == QOS_L3_OCCUP_EVENT_ID) + mpam_free_csu_mon(mon->class, mon_idx); + + wake_up(&resctrl_mon_ctx_waiters); +} + +void resctrl_arch_mon_ctx_free(struct rdt_resource *r, + enum resctrl_event_id evtid, void *arch_mon_ctx) +{ + u32 mon_idx = *(u32 *)arch_mon_ctx; + + kfree(arch_mon_ctx); + + resctrl_arch_mon_ctx_free_no_wait(evtid, mon_idx); +} + +static int __read_mon(struct mpam_resctrl_mon *mon, struct mpam_component *mon_comp, + enum mpam_device_features mon_type, + int mon_idx, + enum resctrl_conf_type cdp_type, u32 closid, u32 rmid, u64 *val) +{ + struct mon_cfg cfg; + + if (!mpam_is_enabled()) + return -EINVAL; + + /* Shift closid to account for CDP */ + closid = resctrl_get_config_index(closid, cdp_type); + + if (mon_idx == USE_PRE_ALLOCATED) { + int mbwu_idx = resctrl_arch_rmid_idx_encode(closid, rmid); + + mon_idx = mon->mbwu_idx_to_mon[mbwu_idx]; + if (mon_idx == -1) + return -EINVAL; + } + + if (irqs_disabled()) { + /* Check if we can access this domain without an IPI */ + return -EIO; + } + + cfg = (struct mon_cfg) { + .mon = mon_idx, + .match_pmg = true, + .partid = closid, + .pmg = rmid, + }; + + return mpam_msmon_read(mon_comp, &cfg, mon_type, val); +} + +static int read_mon_cdp_safe(struct mpam_resctrl_mon *mon, struct mpam_component *mon_comp, + enum mpam_device_features mon_type, + int mon_idx, u32 closid, u32 rmid, u64 *val) +{ + if (cdp_enabled) { + u64 code_val = 0, data_val = 0; + int err; + + err = __read_mon(mon, mon_comp, mon_type, mon_idx, + CDP_CODE, closid, rmid, &code_val); + if (err) + return err; + + err = __read_mon(mon, mon_comp, mon_type, mon_idx, + CDP_DATA, closid, rmid, &data_val); + if (err) + return err; + + *val += code_val + data_val; + return 0; + } + + return __read_mon(mon, mon_comp, mon_type, mon_idx, + CDP_NONE, closid, rmid, val); +} + +/* MBWU when not in ABMC mode, and CSU counters. */ +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain_hdr *hdr, + u32 closid, u32 rmid, enum resctrl_event_id eventid, + void *arch_priv, u64 *val, void *arch_mon_ctx) +{ + struct mpam_resctrl_dom *l3_dom; + struct mpam_component *mon_comp; + u32 mon_idx = *(u32 *)arch_mon_ctx; + enum mpam_device_features mon_type; + struct mpam_resctrl_mon *mon = &mpam_resctrl_counters[eventid]; + + resctrl_arch_rmid_read_context_check(); + + if (!mpam_is_enabled()) + return -EINVAL; + + if (eventid >= QOS_NUM_EVENTS || !mon->class) + return -EINVAL; + + l3_dom = container_of(hdr, struct mpam_resctrl_dom, resctrl_mon_dom.hdr); + mon_comp = l3_dom->mon_comp[eventid]; + + switch (eventid) { + case QOS_L3_OCCUP_EVENT_ID: + mon_type = mpam_feat_msmon_csu; + break; + case QOS_L3_MBM_LOCAL_EVENT_ID: + case QOS_L3_MBM_TOTAL_EVENT_ID: + mon_type = mpam_feat_msmon_mbwu; + break; + default: + return -EINVAL; + } + + return read_mon_cdp_safe(mon, mon_comp, mon_type, mon_idx, + closid, rmid, val); +} + +static void __reset_mon(struct mpam_resctrl_mon *mon, struct mpam_component *mon_comp, + int mon_idx, + enum resctrl_conf_type cdp_type, u32 closid, u32 rmid) +{ + struct mon_cfg cfg = { }; + + if (!mpam_is_enabled()) + return; + + /* Shift closid to account for CDP */ + closid = resctrl_get_config_index(closid, cdp_type); + + if (mon_idx == USE_PRE_ALLOCATED) { + int mbwu_idx = resctrl_arch_rmid_idx_encode(closid, rmid); + + mon_idx = mon->mbwu_idx_to_mon[mbwu_idx]; + } + + if (mon_idx == -1) + return; + cfg.mon = mon_idx; + mpam_msmon_reset_mbwu(mon_comp, &cfg); +} + +static void reset_mon_cdp_safe(struct mpam_resctrl_mon *mon, struct mpam_component *mon_comp, + int mon_idx, u32 closid, u32 rmid) +{ + if (cdp_enabled) { + __reset_mon(mon, mon_comp, mon_idx, CDP_CODE, closid, rmid); + __reset_mon(mon, mon_comp, mon_idx, CDP_DATA, closid, rmid); + } else { + __reset_mon(mon, mon_comp, mon_idx, CDP_NONE, closid, rmid); + } +} + +/* Called via IPI. Call with read_cpus_lock() held. */ +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_l3_mon_domain *d, + u32 closid, u32 rmid, enum resctrl_event_id eventid) +{ + struct mpam_resctrl_dom *l3_dom; + struct mpam_component *mon_comp; + struct mpam_resctrl_mon *mon = &mpam_resctrl_counters[eventid]; + + if (!mpam_is_enabled()) + return; + + /* Only MBWU counters are relevant, and for supported event types. */ + if (eventid == QOS_L3_OCCUP_EVENT_ID || !mon->class) + return; + + l3_dom = container_of(d, struct mpam_resctrl_dom, resctrl_mon_dom); + mon_comp = l3_dom->mon_comp[eventid]; + + reset_mon_cdp_safe(mon, mon_comp, USE_PRE_ALLOCATED, closid, rmid); +} + +/* + * The rmid realloc threshold should be for the smallest cache exposed to + * resctrl. + */ +static int update_rmid_limits(struct mpam_class *class) +{ + u32 num_unique_pmg = resctrl_arch_system_num_rmid_idx(); + struct mpam_props *cprops = &class->props; + struct cacheinfo *ci; + + lockdep_assert_cpus_held(); + + if (!mpam_has_feature(mpam_feat_msmon_csu, cprops)) + return 0; + + /* + * Assume cache levels are the same size for all CPUs... + * The check just requires any online CPU and it can't go offline as we + * hold the cpu lock. + */ + ci = get_cpu_cacheinfo_level(raw_smp_processor_id(), class->level); + if (!ci || ci->size == 0) { + pr_debug("Could not read cache size for class %u\n", + class->level); + return -EINVAL; + } + + if (!resctrl_rmid_realloc_limit || + ci->size < resctrl_rmid_realloc_limit) { + resctrl_rmid_realloc_limit = ci->size; + resctrl_rmid_realloc_threshold = ci->size / num_unique_pmg; + } + + return 0; +} + +static bool cache_has_usable_cpor(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_cpor_part, cprops)) + return false; + + /* resctrl uses u32 for all bitmap configurations */ + return class->props.cpbm_wd <= 32; +} + +static bool mba_class_use_mbw_max(struct mpam_props *cprops) +{ + return (mpam_has_feature(mpam_feat_mbw_max, cprops) && + cprops->bwa_wd); +} + +static bool class_has_usable_mba(struct mpam_props *cprops) +{ + return mba_class_use_mbw_max(cprops); +} + +static bool cache_has_usable_csu(struct mpam_class *class) +{ + struct mpam_props *cprops; + + if (!class) + return false; + + cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_msmon_csu, cprops)) + return false; + + /* + * CSU counters settle on the value, so we can get away with + * having only one. + */ + if (!cprops->num_csu_mon) + return false; + + return true; +} + +static bool class_has_usable_mbwu(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_msmon_mbwu, cprops)) + return false; + + /* + * resctrl expects the bandwidth counters to be free running, + * which means we need as many monitors as resctrl has + * control/monitor groups. + */ + if (__mpam_monitors_free_running(cprops->num_mbwu_mon)) { + pr_debug("monitors usable in free-running mode\n"); + return true; + } + + pr_debug("Insufficient monitors for free-running mode\n"); + + return false; +} + +/* + * Calculate the worst-case percentage change from each implemented step + * in the control. + */ +static u32 get_mba_granularity(struct mpam_props *cprops) +{ + if (!mba_class_use_mbw_max(cprops)) + return 0; + + /* + * bwa_wd is the number of bits implemented in the 0.xxx + * fixed point fraction. 1 bit is 50%, 2 is 25% etc. + */ + return DIV_ROUND_UP(MAX_MBA_BW, 1 << cprops->bwa_wd); +} + +/* + * Each fixed-point hardware value architecturally represents a range + * of values: the full range 0% - 100% is split contiguously into + * (1 << cprops->bwa_wd) equal bands. + * + * Although the bwa_bwd fields have 6 bits the maximum valid value is 16 + * as it reports the width of fields that are at most 16 bits. When + * fewer than 16 bits are valid the least significant bits are + * ignored. The implied binary point is kept between bits 15 and 16 and + * so the valid bits are leftmost. + * + * See ARM IHI0099B.a "MPAM system component specification", Section 9.3, + * "The fixed-point fractional format" for more information. + * + * Find the nearest percentage value to the upper bound of the selected band: + */ +static u32 mbw_max_to_percent(u16 mbw_max, struct mpam_props *cprops) +{ + u32 val = mbw_max; + + val >>= 16 - cprops->bwa_wd; + val += 1; + val *= MAX_MBA_BW; + val = DIV_ROUND_CLOSEST(val, 1 << cprops->bwa_wd); + + return val; +} + +/* + * Find the band whose upper bound is closest to the specified percentage. + * + * A round-to-nearest policy is followed here as a balanced compromise + * between unexpected under-commit of the resource (where the total of + * a set of resource allocations after conversion is less than the + * expected total, due to rounding of the individual converted + * percentages) and over-commit (where the total of the converted + * allocations is greater than expected). + */ +static u16 percent_to_mbw_max(u8 pc, struct mpam_props *cprops) +{ + u32 val = pc; + + val <<= cprops->bwa_wd; + val = DIV_ROUND_CLOSEST(val, MAX_MBA_BW); + val = max(val, 1) - 1; + val <<= 16 - cprops->bwa_wd; + + return val; +} + +static u32 get_mba_min(struct mpam_props *cprops) +{ + if (!mba_class_use_mbw_max(cprops)) { + WARN_ON_ONCE(1); + return 0; + } + + return mbw_max_to_percent(0, cprops); +} + +/* Find the L3 cache that has affinity with this CPU */ +static int find_l3_equivalent_bitmask(int cpu, cpumask_var_t tmp_cpumask) +{ + u32 cache_id = get_cpu_cacheinfo_id(cpu, 3); + + lockdep_assert_cpus_held(); + + return mpam_get_cpumask_from_cache_id(cache_id, 3, tmp_cpumask); +} + +/* + * topology_matches_l3() - Is the provided class the same shape as L3 + * @victim: The class we'd like to pretend is L3. + * + * resctrl expects all the world's a Xeon, and all counters are on the + * L3. We allow some mapping counters on other classes. This requires + * that the CPU->domain mapping is the same kind of shape. + * + * Using cacheinfo directly would make this work even if resctrl can't + * use the L3 - but cacheinfo can't tell us anything about offline CPUs. + * Using the L3 resctrl domain list also depends on CPUs being online. + * Using the mpam_class we picked for L3 so we can use its domain list + * assumes that there are MPAM controls on the L3. + * Instead, this path eventually uses the mpam_get_cpumask_from_cache_id() + * helper which can tell us about offline CPUs ... but getting the cache_id + * to start with relies on at least one CPU per L3 cache being online at + * boot. + * + * Walk the victim component list and compare the affinity mask with the + * corresponding L3. The topology matches if each victim:component's affinity + * mask is the same as the CPU's corresponding L3's. These lists/masks are + * computed from firmware tables so don't change at runtime. + */ +static bool topology_matches_l3(struct mpam_class *victim) +{ + int cpu, err; + struct mpam_component *victim_iter; + + lockdep_assert_cpus_held(); + + cpumask_var_t __free(free_cpumask_var) tmp_cpumask = CPUMASK_VAR_NULL; + if (!alloc_cpumask_var(&tmp_cpumask, GFP_KERNEL)) + return false; + + guard(srcu)(&mpam_srcu); + list_for_each_entry_srcu(victim_iter, &victim->components, class_list, + srcu_read_lock_held(&mpam_srcu)) { + if (cpumask_empty(&victim_iter->affinity)) { + pr_debug("class %u has CPU-less component %u - can't match L3!\n", + victim->level, victim_iter->comp_id); + return false; + } + + cpu = cpumask_any_and(&victim_iter->affinity, cpu_online_mask); + if (WARN_ON_ONCE(cpu >= nr_cpu_ids)) + return false; + + cpumask_clear(tmp_cpumask); + err = find_l3_equivalent_bitmask(cpu, tmp_cpumask); + if (err) { + pr_debug("Failed to find L3's equivalent component to class %u component %u\n", + victim->level, victim_iter->comp_id); + return false; + } + + /* Any differing bits in the affinity mask? */ + if (!cpumask_equal(tmp_cpumask, &victim_iter->affinity)) { + pr_debug("class %u component %u has Mismatched CPU mask with L3 equivalent\n" + "L3:%*pbl != victim:%*pbl\n", + victim->level, victim_iter->comp_id, + cpumask_pr_args(tmp_cpumask), + cpumask_pr_args(&victim_iter->affinity)); + + return false; + } + } + + return true; +} + +/* + * Test if the traffic for a class matches that at egress from the L3. For + * MSC at memory controllers this is only possible if there is a single L3 + * as otherwise the counters at the memory can include bandwidth from the + * non-local L3. + */ +static bool traffic_matches_l3(struct mpam_class *class) +{ + int err, cpu; + + lockdep_assert_cpus_held(); + + if (class->type == MPAM_CLASS_CACHE && class->level == 3) + return true; + + if (class->type == MPAM_CLASS_CACHE && class->level != 3) { + pr_debug("class %u is a different cache from L3\n", class->level); + return false; + } + + if (class->type != MPAM_CLASS_MEMORY) { + pr_debug("class %u is neither of type cache or memory\n", class->level); + return false; + } + + cpumask_var_t __free(free_cpumask_var) tmp_cpumask = CPUMASK_VAR_NULL; + if (!alloc_cpumask_var(&tmp_cpumask, GFP_KERNEL)) { + pr_debug("cpumask allocation failed\n"); + return false; + } + + if (class->type != MPAM_CLASS_MEMORY) { + pr_debug("class %u is neither of type cache or memory\n", + class->level); + return false; + } + + cpu = cpumask_any_and(&class->affinity, cpu_online_mask); + err = find_l3_equivalent_bitmask(cpu, tmp_cpumask); + if (err) { + pr_debug("Failed to find L3 downstream to cpu %d\n", cpu); + return false; + } + + if (!cpumask_equal(tmp_cpumask, cpu_possible_mask)) { + pr_debug("There is more than one L3\n"); + return false; + } + + /* Be strict; the traffic might stop in the intermediate cache. */ + if (get_cpu_cacheinfo_id(cpu, 4) != -1) { + pr_debug("L3 isn't the last level of cache\n"); + return false; + } + + return true; +} + +/* Test whether we can export MPAM_CLASS_CACHE:{2,3}? */ +static void mpam_resctrl_pick_caches(void) +{ + struct mpam_class *class; + struct mpam_resctrl_res *res; + + lockdep_assert_cpus_held(); + + guard(srcu)(&mpam_srcu); + list_for_each_entry_srcu(class, &mpam_classes, classes_list, + srcu_read_lock_held(&mpam_srcu)) { + if (class->type != MPAM_CLASS_CACHE) { + pr_debug("class %u is not a cache\n", class->level); + continue; + } + + if (class->level != 2 && class->level != 3) { + pr_debug("class %u is not L2 or L3\n", class->level); + continue; + } + + if (!cache_has_usable_cpor(class)) { + pr_debug("class %u cache misses CPOR\n", class->level); + continue; + } + + if (!cpumask_equal(&class->affinity, cpu_possible_mask)) { + pr_debug("class %u has missing CPUs, mask %*pb != %*pb\n", class->level, + cpumask_pr_args(&class->affinity), + cpumask_pr_args(cpu_possible_mask)); + continue; + } + + if (class->level == 2) + res = &mpam_resctrl_controls[RDT_RESOURCE_L2]; + else + res = &mpam_resctrl_controls[RDT_RESOURCE_L3]; + res->class = class; + } +} + +static void mpam_resctrl_pick_mba(void) +{ + struct mpam_class *class, *candidate_class = NULL; + struct mpam_resctrl_res *res; + + lockdep_assert_cpus_held(); + + guard(srcu)(&mpam_srcu); + list_for_each_entry_srcu(class, &mpam_classes, classes_list, + srcu_read_lock_held(&mpam_srcu)) { + struct mpam_props *cprops = &class->props; + + if (class->level != 3 && class->type == MPAM_CLASS_CACHE) { + pr_debug("class %u is a cache but not the L3\n", class->level); + continue; + } + + if (!class_has_usable_mba(cprops)) { + pr_debug("class %u has no bandwidth control\n", + class->level); + continue; + } + + if (!cpumask_equal(&class->affinity, cpu_possible_mask)) { + pr_debug("class %u has missing CPUs\n", class->level); + continue; + } + + if (!topology_matches_l3(class)) { + pr_debug("class %u topology doesn't match L3\n", + class->level); + continue; + } + + if (!traffic_matches_l3(class)) { + pr_debug("class %u traffic doesn't match L3 egress\n", + class->level); + continue; + } + + /* + * Pick a resource to be MBA that as close as possible to + * the L3. mbm_total counts the bandwidth leaving the L3 + * cache and MBA should correspond as closely as possible + * for proper operation of mba_sc. + */ + if (!candidate_class || class->level < candidate_class->level) + candidate_class = class; + } + + if (candidate_class) { + pr_debug("selected class %u to back MBA\n", + candidate_class->level); + res = &mpam_resctrl_controls[RDT_RESOURCE_MBA]; + res->class = candidate_class; + } +} + +static void __free_mbwu_mon(struct mpam_class *class, int *array, + u16 num_mbwu_mon) +{ + for (int i = 0; i < num_mbwu_mon; i++) { + if (array[i] < 0) + continue; + + mpam_free_mbwu_mon(class, array[i]); + array[i] = ~0; + } +} + +static int __alloc_mbwu_mon(struct mpam_class *class, int *array, + u16 num_mbwu_mon) +{ + for (int i = 0; i < num_mbwu_mon; i++) { + int mbwu_mon = mpam_alloc_mbwu_mon(class); + + if (mbwu_mon < 0) { + __free_mbwu_mon(class, array, num_mbwu_mon); + return mbwu_mon; + } + array[i] = mbwu_mon; + } + + return 0; +} + +static int *__alloc_mbwu_array(struct mpam_class *class, u16 num_mbwu_mon) +{ + int err; + size_t array_size = num_mbwu_mon * sizeof(int); + int *array __free(kfree) = kmalloc(array_size, GFP_KERNEL); + + if (!array) + return ERR_PTR(-ENOMEM); + + memset(array, -1, array_size); + + err = __alloc_mbwu_mon(class, array, num_mbwu_mon); + if (err) + return ERR_PTR(err); + return_ptr(array); +} + +static void counter_update_class(enum resctrl_event_id evt_id, + struct mpam_class *class) +{ + struct mpam_resctrl_mon *mon = &mpam_resctrl_counters[evt_id]; + struct mpam_class *existing_class = mon->class; + u16 num_mbwu_mon = class->props.num_mbwu_mon; + int *new_array, *existing_array = mon->mbwu_idx_to_mon; + + if (existing_class) { + if (class->level == 3) { + pr_debug("Existing class is L3 - L3 wins\n"); + return; + } + + if (existing_class->level < class->level) { + pr_debug("Existing class is closer to L3, %u versus %u - closer is better\n", + existing_class->level, class->level); + return; + } + } + + pr_debug("Updating event %u to use class %u\n", evt_id, class->level); + + /* Might not need all the monitors */ + num_mbwu_mon = __mpam_monitors_free_running(num_mbwu_mon); + + if (evt_id != QOS_L3_OCCUP_EVENT_ID && num_mbwu_mon) { + /* + * This is the pre-allocated free-running monitors path. It always + * allocates one monitor per PARTID * PMG. + */ + WARN_ON_ONCE(num_mbwu_mon != resctrl_arch_system_num_rmid_idx()); + + new_array = __alloc_mbwu_array(class, num_mbwu_mon); + if (IS_ERR(new_array)) { + pr_debug("Failed to allocate MBWU array\n"); + return; + } + mon->mbwu_idx_to_mon = new_array; + + if (existing_array) { + pr_debug("Releasing previous class %u's monitors\n", + existing_class->level); + __free_mbwu_mon(existing_class, existing_array, num_mbwu_mon); + kfree(existing_array); + } + } else if (evt_id != QOS_L3_OCCUP_EVENT_ID) { + pr_debug("Not pre-allocating free-running counters\n"); + } + + mon->class = class; +} + +static void mpam_resctrl_pick_counters(void) +{ + struct mpam_class *class; + + lockdep_assert_cpus_held(); + + guard(srcu)(&mpam_srcu); + list_for_each_entry_srcu(class, &mpam_classes, classes_list, + srcu_read_lock_held(&mpam_srcu)) { + /* The name of the resource is L3... */ + if (class->type == MPAM_CLASS_CACHE && class->level != 3) { + pr_debug("class %u is a cache but not the L3", class->level); + continue; + } + + if (!cpumask_equal(&class->affinity, cpu_possible_mask)) { + pr_debug("class %u does not cover all CPUs", + class->level); + continue; + } + + if (cache_has_usable_csu(class)) { + pr_debug("class %u has usable CSU", + class->level); + + /* CSU counters only make sense on a cache. */ + switch (class->type) { + case MPAM_CLASS_CACHE: + if (update_rmid_limits(class)) + break; + + counter_update_class(QOS_L3_OCCUP_EVENT_ID, class); + break; + default: + break; + } + } + + if (class_has_usable_mbwu(class) && + topology_matches_l3(class) && + traffic_matches_l3(class)) { + pr_debug("class %u has usable MBWU, and matches L3 topology and traffic\n", + class->level); + + /* + * We can't distinguish traffic by destination so + * we don't know if it's staying on the same NUMA + * node. Hence, we can't calculate mbm_local except + * when we only have one L3 and it's equivalent to + * mbm_total and so always use mbm_total. + */ + counter_update_class(QOS_L3_MBM_TOTAL_EVENT_ID, class); + } + } +} + +static int mpam_resctrl_control_init(struct mpam_resctrl_res *res) +{ + struct mpam_class *class = res->class; + struct mpam_props *cprops = &class->props; + struct rdt_resource *r = &res->resctrl_res; + + switch (r->rid) { + case RDT_RESOURCE_L2: + case RDT_RESOURCE_L3: + r->schema_fmt = RESCTRL_SCHEMA_BITMAP; + r->cache.arch_has_sparse_bitmasks = true; + + r->cache.cbm_len = class->props.cpbm_wd; + /* mpam_devices will reject empty bitmaps */ + r->cache.min_cbm_bits = 1; + + if (r->rid == RDT_RESOURCE_L2) { + r->name = "L2"; + r->ctrl_scope = RESCTRL_L2_CACHE; + } else { + r->name = "L3"; + r->ctrl_scope = RESCTRL_L3_CACHE; + } + + /* + * Which bits are shared with other ...things... + * Unknown devices use partid-0 which uses all the bitmap + * fields. Until we configured the SMMU and GIC not to do this + * 'all the bits' is the correct answer here. + */ + r->cache.shareable_bits = resctrl_get_default_ctrl(r); + r->alloc_capable = true; + break; + case RDT_RESOURCE_MBA: + r->schema_fmt = RESCTRL_SCHEMA_RANGE; + r->ctrl_scope = RESCTRL_L3_CACHE; + + r->membw.delay_linear = true; + r->membw.throttle_mode = THREAD_THROTTLE_UNDEFINED; + r->membw.min_bw = get_mba_min(cprops); + r->membw.max_bw = MAX_MBA_BW; + r->membw.bw_gran = get_mba_granularity(cprops); + + r->name = "MB"; + r->alloc_capable = true; + break; + default: + return -EINVAL; + } + + return 0; +} + +static int mpam_resctrl_pick_domain_id(int cpu, struct mpam_component *comp) +{ + struct mpam_class *class = comp->class; + + if (class->type == MPAM_CLASS_CACHE) + return comp->comp_id; + + if (topology_matches_l3(class)) { + /* Use the corresponding L3 component ID as the domain ID */ + int id = get_cpu_cacheinfo_id(cpu, 3); + + /* Implies topology_matches_l3() made a mistake */ + if (WARN_ON_ONCE(id == -1)) + return comp->comp_id; + + return id; + } + + /* Otherwise, expose the ID used by the firmware table code. */ + return comp->comp_id; +} + +static int mpam_resctrl_monitor_init(struct mpam_resctrl_mon *mon, + enum resctrl_event_id type) +{ + struct mpam_resctrl_res *res = &mpam_resctrl_controls[RDT_RESOURCE_L3]; + struct rdt_resource *l3 = &res->resctrl_res; + + lockdep_assert_cpus_held(); + + /* + * There also needs to be an L3 cache present. + * The check just requires any online CPU and it can't go offline as we + * hold the cpu lock. + */ + if (get_cpu_cacheinfo_id(raw_smp_processor_id(), 3) == -1) + return 0; + + /* + * If there are no MPAM resources on L3, force it into existence. + * topology_matches_l3() already ensures this looks like the L3. + * The domain-ids will be fixed up by mpam_resctrl_domain_hdr_init(). + */ + if (!res->class) { + pr_warn_once("Faking L3 MSC to enable counters.\n"); + res->class = mpam_resctrl_counters[type].class; + } + + /* + * Called multiple times!, once per event type that has a + * monitoring class. + * Setting name is necessary on monitor only platforms. + */ + l3->name = "L3"; + l3->mon_scope = RESCTRL_L3_CACHE; + + /* + * num-rmid is the upper bound for the number of monitoring + * groups that can exist simultaneously, including the + * default monitoring group for each control group. Hence, + * advertise the whole rmid_idx space even though each + * control group has its own pmg/rmid space. Unfortunately, + * this does mean userspace needs to know the architecture + * to correctly interpret this value. + */ + l3->mon.num_rmid = resctrl_arch_system_num_rmid_idx(); + + if (resctrl_enable_mon_event(type, false, 0, NULL)) + l3->mon_capable = true; + + return 0; +} + +u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d, + u32 closid, enum resctrl_conf_type type) +{ + u32 partid; + struct mpam_config *cfg; + struct mpam_props *cprops; + struct mpam_resctrl_res *res; + struct mpam_resctrl_dom *dom; + enum mpam_device_features configured_by; + + lockdep_assert_cpus_held(); + + if (!mpam_is_enabled()) + return resctrl_get_default_ctrl(r); + + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + dom = container_of(d, struct mpam_resctrl_dom, resctrl_ctrl_dom); + cprops = &res->class->props; + + /* + * When CDP is enabled, but the resource doesn't support it, + * the control is cloned across both partids. + * Pick one at random to read: + */ + if (mpam_resctrl_hide_cdp(r->rid)) + type = CDP_DATA; + + partid = resctrl_get_config_index(closid, type); + cfg = &dom->ctrl_comp->cfg[partid]; + + switch (r->rid) { + case RDT_RESOURCE_L2: + case RDT_RESOURCE_L3: + configured_by = mpam_feat_cpor_part; + break; + case RDT_RESOURCE_MBA: + if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { + configured_by = mpam_feat_mbw_max; + break; + } + fallthrough; + default: + return resctrl_get_default_ctrl(r); + } + + if (!r->alloc_capable || partid >= resctrl_arch_get_num_closid(r) || + !mpam_has_feature(configured_by, cfg)) + return resctrl_get_default_ctrl(r); + + switch (configured_by) { + case mpam_feat_cpor_part: + return cfg->cpbm; + case mpam_feat_mbw_max: + return mbw_max_to_percent(cfg->mbw_max, cprops); + default: + return resctrl_get_default_ctrl(r); + } +} + +int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d, + u32 closid, enum resctrl_conf_type t, u32 cfg_val) +{ + int err; + u32 partid; + struct mpam_config cfg; + struct mpam_props *cprops; + struct mpam_resctrl_res *res; + struct mpam_resctrl_dom *dom; + + lockdep_assert_cpus_held(); + lockdep_assert_irqs_enabled(); + + if (!mpam_is_enabled()) + return -EINVAL; + + /* + * No need to check the CPU as mpam_apply_config() doesn't care, and + * resctrl_arch_update_domains() relies on this. + */ + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + dom = container_of(d, struct mpam_resctrl_dom, resctrl_ctrl_dom); + cprops = &res->class->props; + + if (mpam_resctrl_hide_cdp(r->rid)) + t = CDP_DATA; + + partid = resctrl_get_config_index(closid, t); + if (!r->alloc_capable || partid >= resctrl_arch_get_num_closid(r)) { + pr_debug("Not alloc capable or computed PARTID out of range\n"); + return -EINVAL; + } + + /* + * Copy the current config to avoid clearing other resources when the + * same component is exposed multiple times through resctrl. + */ + cfg = dom->ctrl_comp->cfg[partid]; + + switch (r->rid) { + case RDT_RESOURCE_L2: + case RDT_RESOURCE_L3: + cfg.cpbm = cfg_val; + mpam_set_feature(mpam_feat_cpor_part, &cfg); + break; + case RDT_RESOURCE_MBA: + if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { + cfg.mbw_max = percent_to_mbw_max(cfg_val, cprops); + mpam_set_feature(mpam_feat_mbw_max, &cfg); + break; + } + fallthrough; + default: + return -EINVAL; + } + + /* + * When CDP is enabled, but the resource doesn't support it, we need to + * apply the same configuration to the other partid. + */ + if (mpam_resctrl_hide_cdp(r->rid)) { + partid = resctrl_get_config_index(closid, CDP_CODE); + err = mpam_apply_config(dom->ctrl_comp, partid, &cfg); + if (err) + return err; + + partid = resctrl_get_config_index(closid, CDP_DATA); + return mpam_apply_config(dom->ctrl_comp, partid, &cfg); + } + + return mpam_apply_config(dom->ctrl_comp, partid, &cfg); +} + +int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) +{ + int err; + struct rdt_ctrl_domain *d; + + lockdep_assert_cpus_held(); + lockdep_assert_irqs_enabled(); + + if (!mpam_is_enabled()) + return -EINVAL; + + list_for_each_entry_rcu(d, &r->ctrl_domains, hdr.list) { + for (enum resctrl_conf_type t = 0; t < CDP_NUM_TYPES; t++) { + struct resctrl_staged_config *cfg = &d->staged_config[t]; + + if (!cfg->have_new_ctrl) + continue; + + err = resctrl_arch_update_one(r, d, closid, t, + cfg->new_ctrl); + if (err) + return err; + } + } + + return 0; +} + +void resctrl_arch_reset_all_ctrls(struct rdt_resource *r) +{ + struct mpam_resctrl_res *res; + + lockdep_assert_cpus_held(); + + if (!mpam_is_enabled()) + return; + + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + mpam_reset_class_locked(res->class); +} + +static void mpam_resctrl_domain_hdr_init(int cpu, struct mpam_component *comp, + enum resctrl_res_level rid, + struct rdt_domain_hdr *hdr) +{ + lockdep_assert_cpus_held(); + + INIT_LIST_HEAD(&hdr->list); + hdr->id = mpam_resctrl_pick_domain_id(cpu, comp); + hdr->rid = rid; + cpumask_set_cpu(cpu, &hdr->cpu_mask); +} + +static void mpam_resctrl_online_domain_hdr(unsigned int cpu, + struct rdt_domain_hdr *hdr) +{ + lockdep_assert_cpus_held(); + + cpumask_set_cpu(cpu, &hdr->cpu_mask); +} + +/** + * mpam_resctrl_offline_domain_hdr() - Update the domain header to remove a CPU. + * @cpu: The CPU to remove from the domain. + * @hdr: The domain's header. + * + * Removes @cpu from the header mask. If this was the last CPU in the domain, + * the domain header is removed from its parent list and true is returned, + * indicating the parent structure can be freed. + * If there are other CPUs in the domain, returns false. + */ +static bool mpam_resctrl_offline_domain_hdr(unsigned int cpu, + struct rdt_domain_hdr *hdr) +{ + lockdep_assert_held(&domain_list_lock); + + cpumask_clear_cpu(cpu, &hdr->cpu_mask); + if (cpumask_empty(&hdr->cpu_mask)) { + list_del_rcu(&hdr->list); + synchronize_rcu(); + return true; + } + + return false; +} + +static void mpam_resctrl_domain_insert(struct list_head *list, + struct rdt_domain_hdr *new) +{ + struct rdt_domain_hdr *err; + struct list_head *pos = NULL; + + lockdep_assert_held(&domain_list_lock); + + err = resctrl_find_domain(list, new->id, &pos); + if (WARN_ON_ONCE(err)) + return; + + list_add_tail_rcu(&new->list, pos); +} + +static struct mpam_component *find_component(struct mpam_class *class, int cpu) +{ + struct mpam_component *comp; + + guard(srcu)(&mpam_srcu); + list_for_each_entry_srcu(comp, &class->components, class_list, + srcu_read_lock_held(&mpam_srcu)) { + if (cpumask_test_cpu(cpu, &comp->affinity)) + return comp; + } + + return NULL; +} + +static struct mpam_resctrl_dom * +mpam_resctrl_alloc_domain(unsigned int cpu, struct mpam_resctrl_res *res) +{ + int err; + struct mpam_resctrl_dom *dom; + struct rdt_l3_mon_domain *mon_d; + struct rdt_ctrl_domain *ctrl_d; + struct mpam_class *class = res->class; + struct mpam_component *comp_iter, *ctrl_comp; + struct rdt_resource *r = &res->resctrl_res; + + lockdep_assert_held(&domain_list_lock); + + ctrl_comp = NULL; + guard(srcu)(&mpam_srcu); + list_for_each_entry_srcu(comp_iter, &class->components, class_list, + srcu_read_lock_held(&mpam_srcu)) { + if (cpumask_test_cpu(cpu, &comp_iter->affinity)) { + ctrl_comp = comp_iter; + break; + } + } + + /* class has no component for this CPU */ + if (WARN_ON_ONCE(!ctrl_comp)) + return ERR_PTR(-EINVAL); + + dom = kzalloc_node(sizeof(*dom), GFP_KERNEL, cpu_to_node(cpu)); + if (!dom) + return ERR_PTR(-ENOMEM); + + if (r->alloc_capable) { + dom->ctrl_comp = ctrl_comp; + + ctrl_d = &dom->resctrl_ctrl_dom; + mpam_resctrl_domain_hdr_init(cpu, ctrl_comp, r->rid, &ctrl_d->hdr); + ctrl_d->hdr.type = RESCTRL_CTRL_DOMAIN; + err = resctrl_online_ctrl_domain(r, ctrl_d); + if (err) + goto free_domain; + + mpam_resctrl_domain_insert(&r->ctrl_domains, &ctrl_d->hdr); + } else { + pr_debug("Skipped control domain online - no controls\n"); + } + + if (r->mon_capable) { + struct mpam_component *any_mon_comp; + struct mpam_resctrl_mon *mon; + enum resctrl_event_id eventid; + + /* + * Even if the monitor domain is backed by a different + * component, the L3 component IDs need to be used... only + * there may be no ctrl_comp for the L3. + * Search each event's class list for a component with + * overlapping CPUs and set up the dom->mon_comp array. + */ + + for_each_mpam_resctrl_mon(mon, eventid) { + struct mpam_component *mon_comp; + + if (!mon->class) + continue; // dummy resource + + mon_comp = find_component(mon->class, cpu); + dom->mon_comp[eventid] = mon_comp; + if (mon_comp) + any_mon_comp = mon_comp; + } + if (!any_mon_comp) { + WARN_ON_ONCE(0); + err = -EFAULT; + goto offline_ctrl_domain; + } + + mon_d = &dom->resctrl_mon_dom; + mpam_resctrl_domain_hdr_init(cpu, any_mon_comp, r->rid, &mon_d->hdr); + mon_d->hdr.type = RESCTRL_MON_DOMAIN; + err = resctrl_online_mon_domain(r, &mon_d->hdr); + if (err) + goto offline_ctrl_domain; + + mpam_resctrl_domain_insert(&r->mon_domains, &mon_d->hdr); + } else { + pr_debug("Skipped monitor domain online - no monitors\n"); + } + + return dom; + +offline_ctrl_domain: + if (r->alloc_capable) { + mpam_resctrl_offline_domain_hdr(cpu, &ctrl_d->hdr); + resctrl_offline_ctrl_domain(r, ctrl_d); + } +free_domain: + kfree(dom); + dom = ERR_PTR(err); + + return dom; +} + +/* + * We know all the monitors are associated with the L3, even if there are no + * controls and therefore no control component. Find the cache-id for the CPU + * and use that to search for existing resctrl domains. + * This relies on mpam_resctrl_pick_domain_id() using the L3 cache-id + * for anything that is not a cache. + */ +static struct mpam_resctrl_dom *mpam_resctrl_get_mon_domain_from_cpu(int cpu) +{ + int cache_id; + struct mpam_resctrl_dom *dom; + struct mpam_resctrl_res *l3 = &mpam_resctrl_controls[RDT_RESOURCE_L3]; + + lockdep_assert_cpus_held(); + + if (!l3->class) + return NULL; + cache_id = get_cpu_cacheinfo_id(cpu, 3); + if (cache_id < 0) + return NULL; + + list_for_each_entry_rcu(dom, &l3->resctrl_res.mon_domains, resctrl_mon_dom.hdr.list) { + if (dom->resctrl_mon_dom.hdr.id == cache_id) + return dom; + } + + return NULL; +} + +static struct mpam_resctrl_dom * +mpam_resctrl_get_domain_from_cpu(int cpu, struct mpam_resctrl_res *res) +{ + struct mpam_resctrl_dom *dom; + struct rdt_resource *r = &res->resctrl_res; + + lockdep_assert_cpus_held(); + + list_for_each_entry_rcu(dom, &r->ctrl_domains, resctrl_ctrl_dom.hdr.list) { + if (cpumask_test_cpu(cpu, &dom->ctrl_comp->affinity)) + return dom; + } + + if (r->rid != RDT_RESOURCE_L3) + return NULL; + + /* Search the mon domain list too - needed on monitor only platforms. */ + return mpam_resctrl_get_mon_domain_from_cpu(cpu); +} + +int mpam_resctrl_online_cpu(unsigned int cpu) +{ + struct mpam_resctrl_res *res; + enum resctrl_res_level rid; + + guard(mutex)(&domain_list_lock); + for_each_mpam_resctrl_control(res, rid) { + struct mpam_resctrl_dom *dom; + struct rdt_resource *r = &res->resctrl_res; + + if (!res->class) + continue; // dummy_resource; + + dom = mpam_resctrl_get_domain_from_cpu(cpu, res); + if (!dom) { + dom = mpam_resctrl_alloc_domain(cpu, res); + } else { + if (r->alloc_capable) { + struct rdt_ctrl_domain *ctrl_d = &dom->resctrl_ctrl_dom; + + mpam_resctrl_online_domain_hdr(cpu, &ctrl_d->hdr); + } + if (r->mon_capable) { + struct rdt_l3_mon_domain *mon_d = &dom->resctrl_mon_dom; + + mpam_resctrl_online_domain_hdr(cpu, &mon_d->hdr); + } + } + if (IS_ERR(dom)) + return PTR_ERR(dom); + } + + resctrl_online_cpu(cpu); + + return 0; +} + +void mpam_resctrl_offline_cpu(unsigned int cpu) +{ + struct mpam_resctrl_res *res; + enum resctrl_res_level rid; + + resctrl_offline_cpu(cpu); + + guard(mutex)(&domain_list_lock); + for_each_mpam_resctrl_control(res, rid) { + struct mpam_resctrl_dom *dom; + struct rdt_l3_mon_domain *mon_d; + struct rdt_ctrl_domain *ctrl_d; + bool ctrl_dom_empty, mon_dom_empty; + struct rdt_resource *r = &res->resctrl_res; + + if (!res->class) + continue; // dummy resource + + dom = mpam_resctrl_get_domain_from_cpu(cpu, res); + if (WARN_ON_ONCE(!dom)) + continue; + + if (r->alloc_capable) { + ctrl_d = &dom->resctrl_ctrl_dom; + ctrl_dom_empty = mpam_resctrl_offline_domain_hdr(cpu, &ctrl_d->hdr); + if (ctrl_dom_empty) + resctrl_offline_ctrl_domain(&res->resctrl_res, ctrl_d); + } else { + ctrl_dom_empty = true; + } + + if (r->mon_capable) { + mon_d = &dom->resctrl_mon_dom; + mon_dom_empty = mpam_resctrl_offline_domain_hdr(cpu, &mon_d->hdr); + if (mon_dom_empty) + resctrl_offline_mon_domain(&res->resctrl_res, &mon_d->hdr); + } else { + mon_dom_empty = true; + } + + if (ctrl_dom_empty && mon_dom_empty) + kfree(dom); + } +} + +int mpam_resctrl_setup(void) +{ + int err = 0; + struct mpam_resctrl_res *res; + enum resctrl_res_level rid; + struct mpam_resctrl_mon *mon; + enum resctrl_event_id eventid; + + wait_event(wait_cacheinfo_ready, cacheinfo_ready); + + cpus_read_lock(); + for_each_mpam_resctrl_control(res, rid) { + INIT_LIST_HEAD_RCU(&res->resctrl_res.ctrl_domains); + INIT_LIST_HEAD_RCU(&res->resctrl_res.mon_domains); + res->resctrl_res.rid = rid; + } + + /* Find some classes to use for controls */ + mpam_resctrl_pick_caches(); + mpam_resctrl_pick_mba(); + + /* Initialise the resctrl structures from the classes */ + for_each_mpam_resctrl_control(res, rid) { + if (!res->class) + continue; // dummy resource + + err = mpam_resctrl_control_init(res); + if (err) { + pr_debug("Failed to initialise rid %u\n", rid); + goto internal_error; + } + } + + /* Find some classes to use for monitors */ + mpam_resctrl_pick_counters(); + + for_each_mpam_resctrl_mon(mon, eventid) { + if (!mon->class) + continue; // dummy resource + + err = mpam_resctrl_monitor_init(mon, eventid); + if (err) { + pr_debug("Failed to initialise event %u\n", eventid); + goto internal_error; + } + } + + cpus_read_unlock(); + + if (!resctrl_arch_alloc_capable() && !resctrl_arch_mon_capable()) { + pr_debug("No alloc(%u) or monitor(%u) found - resctrl not supported\n", + resctrl_arch_alloc_capable(), resctrl_arch_mon_capable()); + return -EOPNOTSUPP; + } + + err = resctrl_init(); + if (err) + return err; + + WRITE_ONCE(resctrl_enabled, true); + + return 0; + +internal_error: + cpus_read_unlock(); + pr_debug("Internal error %d - resctrl not supported\n", err); + return err; +} + +void mpam_resctrl_exit(void) +{ + if (!READ_ONCE(resctrl_enabled)) + return; + + WRITE_ONCE(resctrl_enabled, false); + resctrl_exit(); +} + +static void mpam_resctrl_teardown_mon(struct mpam_resctrl_mon *mon, struct mpam_class *class) +{ + u32 num_mbwu_mon = resctrl_arch_system_num_rmid_idx(); + + if (!mon->mbwu_idx_to_mon) + return; + + __free_mbwu_mon(class, mon->mbwu_idx_to_mon, num_mbwu_mon); + mon->mbwu_idx_to_mon = NULL; +} + +/* + * The driver is detaching an MSC from this class, if resctrl was using it, + * pull on resctrl_exit(). + */ +void mpam_resctrl_teardown_class(struct mpam_class *class) +{ + struct mpam_resctrl_res *res; + enum resctrl_res_level rid; + struct mpam_resctrl_mon *mon; + enum resctrl_event_id eventid; + + might_sleep(); + + for_each_mpam_resctrl_control(res, rid) { + if (res->class == class) { + res->class = NULL; + break; + } + } + for_each_mpam_resctrl_mon(mon, eventid) { + if (mon->class == class) { + mon->class = NULL; + + mpam_resctrl_teardown_mon(mon, class); + break; + } + } +} + +static int __init __cacheinfo_ready(void) +{ + cacheinfo_ready = true; + wake_up(&wait_cacheinfo_ready); + + return 0; +} +device_initcall_sync(__cacheinfo_ready); + +#ifdef CONFIG_MPAM_KUNIT_TEST +#include "test_mpam_resctrl.c" +#endif diff --git a/drivers/resctrl/test_mpam_resctrl.c b/drivers/resctrl/test_mpam_resctrl.c new file mode 100644 index 000000000000..a20da161d965 --- /dev/null +++ b/drivers/resctrl/test_mpam_resctrl.c @@ -0,0 +1,364 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (C) 2025 Arm Ltd. +/* This file is intended to be included into mpam_resctrl.c */ + +#include <kunit/test.h> +#include <linux/array_size.h> +#include <linux/bits.h> +#include <linux/math.h> +#include <linux/sprintf.h> + +struct percent_value_case { + u8 pc; + u8 width; + u16 value; +}; + +/* + * Mysterious inscriptions taken from the union of ARM DDI 0598D.b, + * "Arm Architecture Reference Manual Supplement - Memory System + * Resource Partitioning and Monitoring (MPAM), for A-profile + * architecture", Section 9.8, "About the fixed-point fractional + * format" (exact percentage entries only) and ARM IHI0099B.a + * "MPAM system component specification", Section 9.3, + * "The fixed-point fractional format": + */ +static const struct percent_value_case percent_value_cases[] = { + /* Architectural cases: */ + { 1, 8, 1 }, { 1, 12, 0x27 }, { 1, 16, 0x28e }, + { 25, 8, 0x3f }, { 25, 12, 0x3ff }, { 25, 16, 0x3fff }, + { 33, 8, 0x53 }, { 33, 12, 0x546 }, { 33, 16, 0x5479 }, + { 35, 8, 0x58 }, { 35, 12, 0x598 }, { 35, 16, 0x5998 }, + { 45, 8, 0x72 }, { 45, 12, 0x732 }, { 45, 16, 0x7332 }, + { 50, 8, 0x7f }, { 50, 12, 0x7ff }, { 50, 16, 0x7fff }, + { 52, 8, 0x84 }, { 52, 12, 0x850 }, { 52, 16, 0x851d }, + { 55, 8, 0x8b }, { 55, 12, 0x8cb }, { 55, 16, 0x8ccb }, + { 58, 8, 0x93 }, { 58, 12, 0x946 }, { 58, 16, 0x9479 }, + { 75, 8, 0xbf }, { 75, 12, 0xbff }, { 75, 16, 0xbfff }, + { 80, 8, 0xcb }, { 80, 12, 0xccb }, { 80, 16, 0xcccb }, + { 88, 8, 0xe0 }, { 88, 12, 0xe13 }, { 88, 16, 0xe146 }, + { 95, 8, 0xf2 }, { 95, 12, 0xf32 }, { 95, 16, 0xf332 }, + { 100, 8, 0xff }, { 100, 12, 0xfff }, { 100, 16, 0xffff }, +}; + +static void test_percent_value_desc(const struct percent_value_case *param, + char *desc) +{ + snprintf(desc, KUNIT_PARAM_DESC_SIZE, + "pc=%d, width=%d, value=0x%.*x\n", + param->pc, param->width, + DIV_ROUND_UP(param->width, 4), param->value); +} + +KUNIT_ARRAY_PARAM(test_percent_value, percent_value_cases, + test_percent_value_desc); + +struct percent_value_test_info { + u32 pc; /* result of value-to-percent conversion */ + u32 value; /* result of percent-to-value conversion */ + u32 max_value; /* maximum raw value allowed by test params */ + unsigned int shift; /* promotes raw testcase value to 16 bits */ +}; + +/* + * Convert a reference percentage to a fixed-point MAX value and + * vice-versa, based on param (not test->param_value!) + */ +static void __prepare_percent_value_test(struct kunit *test, + struct percent_value_test_info *res, + const struct percent_value_case *param) +{ + struct mpam_props fake_props = { }; + + /* Reject bogus test parameters that would break the tests: */ + KUNIT_ASSERT_GE(test, param->width, 1); + KUNIT_ASSERT_LE(test, param->width, 16); + KUNIT_ASSERT_LT(test, param->value, 1 << param->width); + + mpam_set_feature(mpam_feat_mbw_max, &fake_props); + fake_props.bwa_wd = param->width; + + res->shift = 16 - param->width; + res->max_value = GENMASK_U32(param->width - 1, 0); + res->value = percent_to_mbw_max(param->pc, &fake_props); + res->pc = mbw_max_to_percent(param->value << res->shift, &fake_props); +} + +static void test_get_mba_granularity(struct kunit *test) +{ + int ret; + struct mpam_props fake_props = { }; + + /* Use MBW_MAX */ + mpam_set_feature(mpam_feat_mbw_max, &fake_props); + + fake_props.bwa_wd = 0; + KUNIT_EXPECT_FALSE(test, mba_class_use_mbw_max(&fake_props)); + + fake_props.bwa_wd = 1; + KUNIT_EXPECT_TRUE(test, mba_class_use_mbw_max(&fake_props)); + + /* Architectural maximum: */ + fake_props.bwa_wd = 16; + KUNIT_EXPECT_TRUE(test, mba_class_use_mbw_max(&fake_props)); + + /* No usable control... */ + fake_props.bwa_wd = 0; + ret = get_mba_granularity(&fake_props); + KUNIT_EXPECT_EQ(test, ret, 0); + + fake_props.bwa_wd = 1; + ret = get_mba_granularity(&fake_props); + KUNIT_EXPECT_EQ(test, ret, 50); /* DIV_ROUND_UP(100, 1 << 1)% = 50% */ + + fake_props.bwa_wd = 2; + ret = get_mba_granularity(&fake_props); + KUNIT_EXPECT_EQ(test, ret, 25); /* DIV_ROUND_UP(100, 1 << 2)% = 25% */ + + fake_props.bwa_wd = 3; + ret = get_mba_granularity(&fake_props); + KUNIT_EXPECT_EQ(test, ret, 13); /* DIV_ROUND_UP(100, 1 << 3)% = 13% */ + + fake_props.bwa_wd = 6; + ret = get_mba_granularity(&fake_props); + KUNIT_EXPECT_EQ(test, ret, 2); /* DIV_ROUND_UP(100, 1 << 6)% = 2% */ + + fake_props.bwa_wd = 7; + ret = get_mba_granularity(&fake_props); + KUNIT_EXPECT_EQ(test, ret, 1); /* DIV_ROUND_UP(100, 1 << 7)% = 1% */ + + /* Granularity saturates at 1% */ + fake_props.bwa_wd = 16; /* architectural maximum */ + ret = get_mba_granularity(&fake_props); + KUNIT_EXPECT_EQ(test, ret, 1); /* DIV_ROUND_UP(100, 1 << 16)% = 1% */ +} + +static void test_mbw_max_to_percent(struct kunit *test) +{ + const struct percent_value_case *param = test->param_value; + struct percent_value_test_info res; + + /* + * Since the reference values in percent_value_cases[] all + * correspond to exact percentages, round-to-nearest will + * always give the exact percentage back when the MPAM max + * value has precision of 0.5% or finer. (Always true for the + * reference data, since they all specify 8 bits or more of + * precision. + * + * So, keep it simple and demand an exact match: + */ + __prepare_percent_value_test(test, &res, param); + KUNIT_EXPECT_EQ(test, res.pc, param->pc); +} + +static void test_percent_to_mbw_max(struct kunit *test) +{ + const struct percent_value_case *param = test->param_value; + struct percent_value_test_info res; + + __prepare_percent_value_test(test, &res, param); + + KUNIT_EXPECT_GE(test, res.value, param->value << res.shift); + KUNIT_EXPECT_LE(test, res.value, (param->value + 1) << res.shift); + KUNIT_EXPECT_LE(test, res.value, res.max_value << res.shift); + + /* No flexibility allowed for 0% and 100%! */ + + if (param->pc == 0) + KUNIT_EXPECT_EQ(test, res.value, 0); + + if (param->pc == 100) + KUNIT_EXPECT_EQ(test, res.value, res.max_value << res.shift); +} + +static const void *test_all_bwa_wd_gen_params(struct kunit *test, const void *prev, + char *desc) +{ + uintptr_t param = (uintptr_t)prev; + + if (param > 15) + return NULL; + + param++; + + snprintf(desc, KUNIT_PARAM_DESC_SIZE, "wd=%u\n", (unsigned int)param); + + return (void *)param; +} + +static unsigned int test_get_bwa_wd(struct kunit *test) +{ + uintptr_t param = (uintptr_t)test->param_value; + + KUNIT_ASSERT_GE(test, param, 1); + KUNIT_ASSERT_LE(test, param, 16); + + return param; +} + +static void test_mbw_max_to_percent_limits(struct kunit *test) +{ + struct mpam_props fake_props = {0}; + u32 max_value; + + mpam_set_feature(mpam_feat_mbw_max, &fake_props); + fake_props.bwa_wd = test_get_bwa_wd(test); + max_value = GENMASK(15, 16 - fake_props.bwa_wd); + + KUNIT_EXPECT_EQ(test, mbw_max_to_percent(max_value, &fake_props), + MAX_MBA_BW); + KUNIT_EXPECT_EQ(test, mbw_max_to_percent(0, &fake_props), + get_mba_min(&fake_props)); + + /* + * Rounding policy dependent 0% sanity-check: + * With round-to-nearest, the minimum mbw_max value really + * should map to 0% if there are at least 200 steps. + * (100 steps may be enough for some other rounding policies.) + */ + if (fake_props.bwa_wd >= 8) + KUNIT_EXPECT_EQ(test, mbw_max_to_percent(0, &fake_props), 0); + + if (fake_props.bwa_wd < 8 && + mbw_max_to_percent(0, &fake_props) == 0) + kunit_warn(test, "wd=%d: Testsuite/driver Rounding policy mismatch?", + fake_props.bwa_wd); +} + +/* + * Check that converting a percentage to mbw_max and back again (or, as + * appropriate, vice-versa) always restores the original value: + */ +static void test_percent_max_roundtrip_stability(struct kunit *test) +{ + struct mpam_props fake_props = {0}; + unsigned int shift; + u32 pc, max, pc2, max2; + + mpam_set_feature(mpam_feat_mbw_max, &fake_props); + fake_props.bwa_wd = test_get_bwa_wd(test); + shift = 16 - fake_props.bwa_wd; + + /* + * Converting a valid value from the coarser scale to the finer + * scale and back again must yield the original value: + */ + if (fake_props.bwa_wd >= 7) { + /* More than 100 steps: only test exact pc values: */ + for (pc = get_mba_min(&fake_props); pc <= MAX_MBA_BW; pc++) { + max = percent_to_mbw_max(pc, &fake_props); + pc2 = mbw_max_to_percent(max, &fake_props); + KUNIT_EXPECT_EQ(test, pc2, pc); + } + } else { + /* Fewer than 100 steps: only test exact mbw_max values: */ + for (max = 0; max < 1 << 16; max += 1 << shift) { + pc = mbw_max_to_percent(max, &fake_props); + max2 = percent_to_mbw_max(pc, &fake_props); + KUNIT_EXPECT_EQ(test, max2, max); + } + } +} + +static void test_percent_to_max_rounding(struct kunit *test) +{ + const struct percent_value_case *param = test->param_value; + unsigned int num_rounded_up = 0, total = 0; + struct percent_value_test_info res; + + for (param = percent_value_cases, total = 0; + param < &percent_value_cases[ARRAY_SIZE(percent_value_cases)]; + param++, total++) { + __prepare_percent_value_test(test, &res, param); + if (res.value > param->value << res.shift) + num_rounded_up++; + } + + /* + * The MPAM driver applies a round-to-nearest policy, whereas a + * round-down policy seems to have been applied in the + * reference table from which the test vectors were selected. + * + * For a large and well-distributed suite of test vectors, + * about half should be rounded up and half down compared with + * the reference table. The actual test vectors are few in + * number and probably not very well distributed however, so + * tolerate a round-up rate of between 1/4 and 3/4 before + * crying foul: + */ + + kunit_info(test, "Round-up rate: %u%% (%u/%u)\n", + DIV_ROUND_CLOSEST(num_rounded_up * 100, total), + num_rounded_up, total); + + KUNIT_EXPECT_GE(test, 4 * num_rounded_up, 1 * total); + KUNIT_EXPECT_LE(test, 4 * num_rounded_up, 3 * total); +} + +struct rmid_idx_case { + u32 max_partid; + u32 max_pmg; +}; + +static const struct rmid_idx_case rmid_idx_cases[] = { + {0, 0}, {1, 4}, {3, 1}, {5, 9}, {4, 4}, {100, 11}, {0xFFFF, 0xFF}, +}; + +static void test_rmid_idx_desc(const struct rmid_idx_case *param, char *desc) +{ + snprintf(desc, KUNIT_PARAM_DESC_SIZE, "max_partid=%d, max_pmg=%d\n", + param->max_partid, param->max_pmg); +} + +KUNIT_ARRAY_PARAM(test_rmid_idx, rmid_idx_cases, test_rmid_idx_desc); + +static void test_rmid_idx_encoding(struct kunit *test) +{ + u32 orig_mpam_partid_max = mpam_partid_max; + u32 orig_mpam_pmg_max = mpam_pmg_max; + const struct rmid_idx_case *param = test->param_value; + u32 idx, num_idx, count = 0; + + mpam_partid_max = param->max_partid; + mpam_pmg_max = param->max_pmg; + + for (u32 partid = 0; partid <= mpam_partid_max; partid++) { + for (u32 pmg = 0; pmg <= mpam_pmg_max; pmg++) { + u32 partid_out, pmg_out; + + idx = resctrl_arch_rmid_idx_encode(partid, pmg); + /* Confirm there are no holes in the rmid idx range */ + KUNIT_EXPECT_EQ(test, count, idx); + count++; + resctrl_arch_rmid_idx_decode(idx, &partid_out, &pmg_out); + KUNIT_EXPECT_EQ(test, pmg, pmg_out); + KUNIT_EXPECT_EQ(test, partid, partid_out); + } + } + num_idx = resctrl_arch_system_num_rmid_idx(); + KUNIT_EXPECT_EQ(test, idx + 1, num_idx); + + /* Restore global variables that were messed with */ + mpam_partid_max = orig_mpam_partid_max; + mpam_pmg_max = orig_mpam_pmg_max; +} + +static struct kunit_case mpam_resctrl_test_cases[] = { + KUNIT_CASE(test_get_mba_granularity), + KUNIT_CASE_PARAM(test_mbw_max_to_percent, test_percent_value_gen_params), + KUNIT_CASE_PARAM(test_percent_to_mbw_max, test_percent_value_gen_params), + KUNIT_CASE_PARAM(test_mbw_max_to_percent_limits, test_all_bwa_wd_gen_params), + KUNIT_CASE(test_percent_to_max_rounding), + KUNIT_CASE_PARAM(test_percent_max_roundtrip_stability, + test_all_bwa_wd_gen_params), + KUNIT_CASE_PARAM(test_rmid_idx_encoding, test_rmid_idx_gen_params), + {} +}; + +static struct kunit_suite mpam_resctrl_test_suite = { + .name = "mpam_resctrl_test_suite", + .test_cases = mpam_resctrl_test_cases, +}; + +kunit_test_suites(&mpam_resctrl_test_suite); diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 7f00c5285a32..f92a36187a52 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -5,6 +5,7 @@ #define __LINUX_ARM_MPAM_H #include <linux/acpi.h> +#include <linux/resctrl_types.h> #include <linux/types.h> struct mpam_msc; @@ -49,6 +50,37 @@ static inline int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, } #endif +bool resctrl_arch_alloc_capable(void); +bool resctrl_arch_mon_capable(void); + +void resctrl_arch_set_cpu_default_closid(int cpu, u32 closid); +void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid); +void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 rmid); +void resctrl_arch_sched_in(struct task_struct *tsk); +bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid); +bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid); +u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid); +void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid); +u32 resctrl_arch_system_num_rmid_idx(void); + +struct rdt_resource; +void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, enum resctrl_event_id evtid); +void resctrl_arch_mon_ctx_free(struct rdt_resource *r, enum resctrl_event_id evtid, void *ctx); + +/* + * The CPU configuration for MPAM is cheap to write, and is only written if it + * has changed. No need for fine grained enables. + */ +static inline void resctrl_arch_enable_mon(void) { } +static inline void resctrl_arch_disable_mon(void) { } +static inline void resctrl_arch_enable_alloc(void) { } +static inline void resctrl_arch_disable_alloc(void) { } + +static inline unsigned int resctrl_arch_round_mon_val(unsigned int val) +{ + return val; +} + /** * mpam_register_requestor() - Register a requestor with the MPAM driver * @partid_max: The maximum PARTID value the requestor can generate. -- 2.25.1
Introduce narrow PARTID (partid_nrw) feature support, which enables many-to-one mapping of request PARTIDs (reqPARTID) to internal PARTIDs (intPARTID). This expands monitoring capability by allowing a single control group to track more task types through multiple reqPARTIDs per intPARTID, bypassing the PMG limit in some extent. intPARTID: Internal PARTID used for control group configuration. Configurations are synchronized to all reqPARTIDs mapped to the same intPARTID. Count is indicated by MPAMF_PARTID_NRW_IDR.INTPARTID_MAX, or defaults to PARTID count if narrow PARTID is unsupported. reqPARTID: Request PARTID used to expand monitoring groups. Enables a single control group to monitor more task types by multiple reqPARTIDs within one intPARTID, overcoming the PMG count limitation. Control group and monitoring limits are calculated as: n = min(intPARTID, PARTID) /* control groups (closids) */ l = min(reqPARTID, PARTID) /* total reqPARTIDs */ m = l // n /* reqPARTIDs per control group */ Where: intPARTID: intPARTIDs on narrow-PARTID-capable MSCs reqPARTID: reqPARTIDs on narrow-PARTID-capable MSCs PARTID: PARTIDs on non-narrow-PARTID-capable MSCs Example: L3 cache (256 PARTIDs, without narrow PARTID feature) + MATA (32 intPARTIDs, 256 reqPARTIDs): n = min( 32, 256) = 32 intPARTIDs l = min(256, 256) = 256 reqPARTIDs m = 256 / 32 = 8 reqPARTIDs per intPARTID Implementation notes: * Handle mixed MSC systems (some support narrow PARTID, some don't) by taking minimum values across all MSCs. * resctrl_arch_get_num_closid() now returns intPARTID count (was PARTID). Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/resctrl/mpam_devices.c | 30 ++++++++++++++++++++---------- drivers/resctrl/mpam_internal.h | 2 ++ drivers/resctrl/mpam_resctrl.c | 4 ++-- 3 files changed, 24 insertions(+), 12 deletions(-) diff --git a/drivers/resctrl/mpam_devices.c b/drivers/resctrl/mpam_devices.c index 9182c8fcf003..c5b3812e86ab 100644 --- a/drivers/resctrl/mpam_devices.c +++ b/drivers/resctrl/mpam_devices.c @@ -63,6 +63,7 @@ static DEFINE_MUTEX(mpam_cpuhp_state_lock); * Generating traffic outside this range will result in screaming interrupts. */ u16 mpam_partid_max; +u16 mpam_intpartid_max; u8 mpam_pmg_max; static bool partid_max_init, partid_max_published; static DEFINE_SPINLOCK(partid_max_lock); @@ -288,10 +289,12 @@ int mpam_register_requestor(u16 partid_max, u8 pmg_max) { guard(spinlock)(&partid_max_lock); if (!partid_max_init) { + mpam_intpartid_max = partid_max; mpam_partid_max = partid_max; mpam_pmg_max = pmg_max; partid_max_init = true; } else if (!partid_max_published) { + mpam_intpartid_max = min(mpam_intpartid_max, partid_max); mpam_partid_max = min(mpam_partid_max, partid_max); mpam_pmg_max = min(mpam_pmg_max, pmg_max); } else { @@ -931,7 +934,9 @@ static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) u16 partid_max = FIELD_GET(MPAMF_PARTID_NRW_IDR_INTPARTID_MAX, nrwidr); mpam_set_feature(mpam_feat_partid_nrw, props); - msc->partid_max = min(msc->partid_max, partid_max); + msc->intpartid_max = min(msc->partid_max, partid_max); + } else { + msc->intpartid_max = msc->partid_max; } } @@ -995,6 +1000,7 @@ static int mpam_msc_hw_probe(struct mpam_msc *msc) spin_lock(&partid_max_lock); mpam_partid_max = min(mpam_partid_max, msc->partid_max); + mpam_intpartid_max = min(mpam_intpartid_max, msc->intpartid_max); mpam_pmg_max = min(mpam_pmg_max, msc->pmg_max); spin_unlock(&partid_max_lock); @@ -1720,7 +1726,7 @@ static int mpam_reset_ris(void *arg) mpam_init_reset_cfg(&reset_cfg); spin_lock(&partid_max_lock); - partid_max = mpam_partid_max; + partid_max = mpam_intpartid_max; spin_unlock(&partid_max_lock); for (partid = 0; partid <= partid_max; partid++) mpam_reprogram_ris_partid(ris, partid, &reset_cfg); @@ -1776,7 +1782,7 @@ static void mpam_reprogram_msc(struct mpam_msc *msc) struct mpam_write_config_arg arg; /* - * No lock for mpam_partid_max as partid_max_published has been + * No lock for mpam_intpartid_max as partid_max_published has been * set by mpam_enabled(), so the values can no longer change. */ mpam_assert_partid_sizes_fixed(); @@ -1793,7 +1799,7 @@ static void mpam_reprogram_msc(struct mpam_msc *msc) arg.comp = ris->vmsc->comp; arg.ris = ris; reset = true; - for (partid = 0; partid <= mpam_partid_max; partid++) { + for (partid = 0; partid <= mpam_intpartid_max; partid++) { cfg = &ris->vmsc->comp->cfg[partid]; if (!bitmap_empty(cfg->features, MPAM_FEATURE_LAST)) reset = false; @@ -2616,7 +2622,7 @@ static void mpam_reset_component_cfg(struct mpam_component *comp) if (!comp->cfg) return; - for (i = 0; i <= mpam_partid_max; i++) { + for (i = 0; i <= mpam_intpartid_max; i++) { comp->cfg[i] = (struct mpam_config) {}; if (cprops->cpbm_wd) comp->cfg[i].cpbm = GENMASK(cprops->cpbm_wd - 1, 0); @@ -2636,7 +2642,7 @@ static int __allocate_component_cfg(struct mpam_component *comp) if (comp->cfg) return 0; - comp->cfg = kzalloc_objs(*comp->cfg, mpam_partid_max + 1); + comp->cfg = kzalloc_objs(*comp->cfg, mpam_intpartid_max + 1); if (!comp->cfg) return -ENOMEM; @@ -2703,7 +2709,7 @@ static void mpam_enable_once(void) int err; /* - * Once the cpuhp callbacks have been changed, mpam_partid_max can no + * Once the cpuhp callbacks have been changed, mpam_intpartid_max can no * longer change. */ spin_lock(&partid_max_lock); @@ -2752,9 +2758,13 @@ static void mpam_enable_once(void) mpam_register_cpuhp_callbacks(mpam_cpu_online, mpam_cpu_offline, "mpam:online"); - /* Use printk() to avoid the pr_fmt adding the function name. */ - printk(KERN_INFO "MPAM enabled with %u PARTIDs and %u PMGs\n", - mpam_partid_max + 1, mpam_pmg_max + 1); + if (mpam_partid_max == mpam_intpartid_max) + /* Use printk() to avoid the pr_fmt adding the function name. */ + printk(KERN_INFO "MPAM enabled with %u PARTIDs and %u PMGs\n", + mpam_partid_max + 1, mpam_pmg_max + 1); + else + printk(KERN_INFO "MPAM enabled with %u reqPARTIDs, %u intPARTIDs and %u PMGs\n", + mpam_partid_max + 1, mpam_intpartid_max + 1, mpam_pmg_max + 1); } static void mpam_reset_component_locked(struct mpam_component *comp) diff --git a/drivers/resctrl/mpam_internal.h b/drivers/resctrl/mpam_internal.h index 195ab821cc52..0b19742d8660 100644 --- a/drivers/resctrl/mpam_internal.h +++ b/drivers/resctrl/mpam_internal.h @@ -81,6 +81,7 @@ struct mpam_msc { */ struct mutex probe_lock; bool probed; + u16 intpartid_max; u16 partid_max; u8 pmg_max; unsigned long ris_idxs; @@ -457,6 +458,7 @@ extern struct list_head mpam_classes; /* System wide partid/pmg values */ extern u16 mpam_partid_max; +extern u16 mpam_intpartid_max; extern u8 mpam_pmg_max; /* Scheduled work callback to enable mpam once all MSC have been probed */ diff --git a/drivers/resctrl/mpam_resctrl.c b/drivers/resctrl/mpam_resctrl.c index 19b306017845..222ea1d199e1 100644 --- a/drivers/resctrl/mpam_resctrl.c +++ b/drivers/resctrl/mpam_resctrl.c @@ -206,7 +206,7 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level rid, bool enable) mpam_resctrl_controls[RDT_RESOURCE_MBA].resctrl_res.alloc_capable = true; if (enable) { - if (mpam_partid_max < 1) + if (mpam_intpartid_max < 1) return -EINVAL; partid_d = resctrl_get_config_index(RESCTRL_RESERVED_CLOSID, CDP_DATA); @@ -237,7 +237,7 @@ static bool mpam_resctrl_hide_cdp(enum resctrl_res_level rid) */ u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) { - return mpam_partid_max + 1; + return mpam_intpartid_max + 1; } u32 resctrl_arch_system_num_rmid_idx(void) -- 2.25.1
MPAM supports mixed systems with MSCs that may or may not implement Narrow-PARTID. However, when the MBA MSC uses percentage-based throttling (non-bitmap partition control) and lacks Narrow-PARTID support, resctrl cannot correctly apply control group configurations across multiple PARTIDs. Since there is no straightforward way to program compatible control values in this scenario, disable Narrow-PARTID system-wide when detected. The detection occurs at initialization time on the first call to get_num_reqpartid() from mpam_resctrl_pick_counters(), which is guaranteed to occur after mpam_resctrl_pick_mba() has set up the MBA resource class. When disabled, get_num_reqpartid() falls back to returning the number of internal PARTIDs (mpam_intpartid_max). Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/resctrl/mpam_resctrl.c | 38 +++++++++++++++++++++++++++++++++- 1 file changed, 37 insertions(+), 1 deletion(-) diff --git a/drivers/resctrl/mpam_resctrl.c b/drivers/resctrl/mpam_resctrl.c index 222ea1d199e1..0245b16c4531 100644 --- a/drivers/resctrl/mpam_resctrl.c +++ b/drivers/resctrl/mpam_resctrl.c @@ -240,9 +240,45 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) return mpam_intpartid_max + 1; } +/* + * Determine the effective number of PARTIDs available for resctrl. + * + * This function performs a one-time check to determine if Narrow-PARTID + * can be used. It must be called after mpam_resctrl_pick_mba() has + * initialized the MBA resource, as the MBA class properties are used + * to detect Narrow-PARTID support. + * + * The first call occurs in mpam_resctrl_pick_counters(), ensuring the + * prerequisite initialization is complete. + */ +static u32 get_num_reqpartid(void) +{ + struct mpam_resctrl_res *res; + struct rdt_resource *r_mba; + struct mpam_props *cprops; + static bool first = true; + + if (first) { + r_mba = resctrl_arch_get_resource(RDT_RESOURCE_MBA); + res = container_of(r_mba, struct mpam_resctrl_res, resctrl_res); + if (!res->class) + goto out; + + /* If MBA lacks Narrow-PARTID support, roll it back. */ + cprops = &res->class->props; + if (!mpam_has_feature(mpam_feat_partid_nrw, cprops)) + mpam_partid_max = mpam_intpartid_max; + + } + +out: + first = false; + return mpam_partid_max + 1; +} + u32 resctrl_arch_system_num_rmid_idx(void) { - return (mpam_pmg_max + 1) * (mpam_partid_max + 1); + return (mpam_pmg_max + 1) * get_num_reqpartid(); } u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid) -- 2.25.1
The Narrow PARTID feature allows the MPAM driver to statically or dynamically allocate request PARTIDs (reqPARTIDs) to internal PARTIDs (intPARTIDs). This enables expanding the number of monitoring groups beyond the hardware PMG limit. For systems with mixed MSCs (Memory System Components), MSCs that do not support narrow PARTID use PARTIDs exceeding the minimum number of intPARTIDs as reqPARTIDs to expand monitoring groups. Expand RMID to include reqPARTID information: rmid = reqPARTID * NUM_PMG + PMG To maintain compatibility with the existing resctrl layer, reqPARTIDs are allocated statically with a linear mapping to intPARTIDs via req2intpartid(). Mapping relationships (n = intPARTID count, m = reqPARTIDs per intPARTID): P - Partition group M - Monitoring group Group closid rmid.reqPARTID MSCs w/ narrow-PARTID MSCs w/o narrow-PARTID P1 0 intPARTID_1 PARTID_1 M1_1 0 0 ├── reqPARTID_1_1 ├── PARTID_1 M1_2 0 0+n ├── reqPARTID_1_2 ├── PARTID_1_2 M1_3 0 0+n*2 ├── reqPARTID_1_3 ├── PARTID_1_3 ... ├── ... ├── ... M1_m 0 0+n*(m-1) └── reqPARTID_1_m └── PARTID_1_m P2 1 intPARTID_2 PARTID_2 M2_1 1 1 ├── reqPARTID_2_1 ├── PARTID_2 M2_2 1 1+n ├── reqPARTID_2_2 ├── PARTID_2_2 M2_3 1 1+n*2 ├── reqPARTID_2_3 ├── PARTID_2_3 ... ├── ... ├── ... M2_m 1 1+n*(m-1) └── reqPARTID_2_m └── PARTID_2_m Pn n-1 intPARTID_n PARTID_n Mn_1 n-1 n-1 ├── reqPARTID_n_1 ├── PARTID_n Mn_2 n-1 n-1+n ├── reqPARTID_n_2 ├── PARTID_n_2 Mn_3 n-1 n-1+n*2 ├── reqPARTID_n_3 ├── PARTID_n_3 ... ├── ... ├── ... Mn_m n-1 n*m-1 └── reqPARTID_n_m └── PARTID_n_m Refactor the glue layer between resctrl abstractions (rmid) and MPAM hardware registers (reqPARTID/PMG) to support narrow PARTID. The resctrl layer uses rmid2reqpartid() and rmid2pmg() to extract components from rmid. The closid-to-intPARTID translation remains unchanged via resctrl_get_config_index(). Since narrow PARTID is a monitoring enhancement, reqPARTID is only used in monitoring paths while configuration paths maintain original semantics of closid. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/resctrl/mpam_resctrl.c | 101 +++++++++++++++++++++++---------- 1 file changed, 71 insertions(+), 30 deletions(-) diff --git a/drivers/resctrl/mpam_resctrl.c b/drivers/resctrl/mpam_resctrl.c index 0245b16c4531..6cbca21096f7 100644 --- a/drivers/resctrl/mpam_resctrl.c +++ b/drivers/resctrl/mpam_resctrl.c @@ -281,15 +281,60 @@ u32 resctrl_arch_system_num_rmid_idx(void) return (mpam_pmg_max + 1) * get_num_reqpartid(); } +static u16 rmid2reqpartid(u32 rmid) +{ + rmid /= (mpam_pmg_max + 1); + + if (cdp_enabled) + return resctrl_get_config_index(rmid, CDP_DATA); + + return resctrl_get_config_index(rmid, CDP_NONE); +} + +static u8 rmid2pmg(u32 rmid) +{ + return rmid % (mpam_pmg_max + 1); +} + +static u16 req2intpartid(u16 reqpartid) +{ + return reqpartid % (mpam_intpartid_max + 1); +} + +/* + * To avoid the reuse of rmid across multiple control groups, check + * the incoming closid to prevent rmid from being reallocated by + * resctrl_find_free_rmid(). + * + * If the closid and rmid do not match upon inspection, immediately + * returns an invalid rmid. A valid rmid must not exceed 24 bits. + */ u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid) { - return closid * (mpam_pmg_max + 1) + rmid; + u32 reqpartid = rmid2reqpartid(rmid); + u32 intpartid = req2intpartid(reqpartid); + + if (cdp_enabled) + intpartid >>= 1; + + if (closid != intpartid) + return U32_MAX; + + return rmid; } void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid) { - *closid = idx / (mpam_pmg_max + 1); - *rmid = idx % (mpam_pmg_max + 1); + u32 reqpartid = rmid2reqpartid(idx); + u32 intpartid = req2intpartid(reqpartid); + + if (rmid) + *rmid = idx; + if (closid) { + if (cdp_enabled) + intpartid >>= 1; + *closid = intpartid; + } } void resctrl_arch_sched_in(struct task_struct *tsk) @@ -301,21 +346,17 @@ void resctrl_arch_sched_in(struct task_struct *tsk) void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 rmid) { - WARN_ON_ONCE(closid > U16_MAX); - WARN_ON_ONCE(rmid > U8_MAX); + u32 reqpartid = rmid2reqpartid(rmid); + u8 pmg = rmid2pmg(rmid); - if (!cdp_enabled) { - mpam_set_cpu_defaults(cpu, closid, closid, rmid, rmid); - } else { + if (!cdp_enabled) + mpam_set_cpu_defaults(cpu, reqpartid, reqpartid, pmg, pmg); + else /* * When CDP is enabled, resctrl halves the closid range and we * use odd/even partid for one closid. */ - u32 partid_d = resctrl_get_config_index(closid, CDP_DATA); - u32 partid_i = resctrl_get_config_index(closid, CDP_CODE); - - mpam_set_cpu_defaults(cpu, partid_d, partid_i, rmid, rmid); - } + mpam_set_cpu_defaults(cpu, reqpartid, reqpartid + 1, pmg, pmg); } void resctrl_arch_sync_cpu_closid_rmid(void *info) @@ -334,17 +375,16 @@ void resctrl_arch_sync_cpu_closid_rmid(void *info) void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid) { - WARN_ON_ONCE(closid > U16_MAX); - WARN_ON_ONCE(rmid > U8_MAX); + u32 reqpartid = rmid2reqpartid(rmid); + u8 pmg = rmid2pmg(rmid); - if (!cdp_enabled) { - mpam_set_task_partid_pmg(tsk, closid, closid, rmid, rmid); - } else { - u32 partid_d = resctrl_get_config_index(closid, CDP_DATA); - u32 partid_i = resctrl_get_config_index(closid, CDP_CODE); + WARN_ON_ONCE(reqpartid > U16_MAX); + WARN_ON_ONCE(pmg > U8_MAX); - mpam_set_task_partid_pmg(tsk, partid_d, partid_i, rmid, rmid); - } + if (!cdp_enabled) + mpam_set_task_partid_pmg(tsk, reqpartid, reqpartid, pmg, pmg); + else + mpam_set_task_partid_pmg(tsk, reqpartid, reqpartid + 1, pmg, pmg); } bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid) @@ -352,6 +392,8 @@ bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid) u64 regval = mpam_get_regval(tsk); u32 tsk_closid = FIELD_GET(MPAM0_EL1_PARTID_D, regval); + tsk_closid = req2intpartid(tsk_closid); + if (cdp_enabled) tsk_closid >>= 1; @@ -362,13 +404,11 @@ bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid) bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid) { u64 regval = mpam_get_regval(tsk); - u32 tsk_closid = FIELD_GET(MPAM0_EL1_PARTID_D, regval); - u32 tsk_rmid = FIELD_GET(MPAM0_EL1_PMG_D, regval); - - if (cdp_enabled) - tsk_closid >>= 1; + u32 tsk_partid = FIELD_GET(MPAM0_EL1_PARTID_D, regval); + u32 tsk_pmg = FIELD_GET(MPAM0_EL1_PMG_D, regval); - return (tsk_closid == closid) && (tsk_rmid == rmid); + return (tsk_partid == rmid2reqpartid(rmid)) && + (tsk_pmg == rmid2pmg(rmid)); } struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) @@ -456,6 +496,7 @@ static int __read_mon(struct mpam_resctrl_mon *mon, struct mpam_component *mon_c enum resctrl_conf_type cdp_type, u32 closid, u32 rmid, u64 *val) { struct mon_cfg cfg; + u32 reqpartid = rmid2reqpartid(rmid); if (!mpam_is_enabled()) return -EINVAL; @@ -479,8 +520,8 @@ static int __read_mon(struct mpam_resctrl_mon *mon, struct mpam_component *mon_c cfg = (struct mon_cfg) { .mon = mon_idx, .match_pmg = true, - .partid = closid, - .pmg = rmid, + .partid = (cdp_type == CDP_CODE) ? reqpartid + 1 : reqpartid, + .pmg = rmid2pmg(rmid), }; return mpam_msmon_read(mon_comp, &cfg, mon_type, val); -- 2.25.1
With the narrow PARTID feature, each control group is assigned multiple (req)PARTIDs to expand monitoring capacity. When a control group's configuration is updated, all associated sub-monitoring groups (each identified by a unique reqPARTID) should be synchronized. In __write_config(), iterate over all reqPARTIDs belonging to the control group and propagate the configuration to each sub-monitoring group: 1. For MSCs supporting narrow PARTID, establish the reqPARTID to intPARTID mapping. 2. For MSCs without narrow PARTID support, synchronize the configuration to new PARTIDs directly. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/resctrl/mpam_devices.c | 33 +++++++++++++++++++++++++++------ drivers/resctrl/mpam_internal.h | 2 ++ drivers/resctrl/mpam_resctrl.c | 2 +- 3 files changed, 30 insertions(+), 7 deletions(-) diff --git a/drivers/resctrl/mpam_devices.c b/drivers/resctrl/mpam_devices.c index c5b3812e86ab..52533ee0e6ce 100644 --- a/drivers/resctrl/mpam_devices.c +++ b/drivers/resctrl/mpam_devices.c @@ -1541,6 +1541,7 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, { u32 pri_val = 0; u16 cmax = MPAMCFG_CMAX_CMAX; + u16 intpartid = req2intpartid(partid); struct mpam_msc *msc = ris->vmsc->msc; struct mpam_props *rprops = &ris->props; u16 dspri = GENMASK(rprops->dspri_wd, 0); @@ -1550,15 +1551,17 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, __mpam_part_sel(ris->ris_idx, partid, msc); if (mpam_has_feature(mpam_feat_partid_nrw, rprops)) { - /* Update the intpartid mapping */ mpam_write_partsel_reg(msc, INTPARTID, - MPAMCFG_INTPARTID_INTERNAL | partid); + MPAMCFG_INTPARTID_INTERNAL | intpartid); /* - * Then switch to the 'internal' partid to update the - * configuration. + * Mapping from reqpartid to intpartid already established. + * Sub-monitoring groups share the parent's configuration. */ - __mpam_intpart_sel(ris->ris_idx, partid, msc); + if (partid != intpartid) + goto out; + + __mpam_intpart_sel(ris->ris_idx, intpartid, msc); } if (mpam_has_feature(mpam_feat_cpor_part, rprops) && @@ -1630,6 +1633,7 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, mpam_quirk_post_config_change(ris, partid, cfg); +out: mutex_unlock(&msc->part_sel_lock); } @@ -1764,11 +1768,28 @@ struct mpam_write_config_arg { u16 partid; }; +static u32 get_num_reqpartid_per_intpartid(void) +{ + return (mpam_partid_max + 1) / (mpam_intpartid_max + 1); +} + static int __write_config(void *arg) { + int closid_num = resctrl_arch_get_num_closid(NULL); struct mpam_write_config_arg *c = arg; + u32 reqpartid, req_idx; - mpam_reprogram_ris_partid(c->ris, c->partid, &c->comp->cfg[c->partid]); + /* c->partid should be within the range of intPARTIDs */ + WARN_ON_ONCE(c->partid >= closid_num); + + /* Synchronize the configuration to each sub-monitoring group. */ + for (req_idx = 0; req_idx < get_num_reqpartid_per_intpartid(); + req_idx++) { + reqpartid = req_idx * closid_num + c->partid; + + mpam_reprogram_ris_partid(c->ris, reqpartid, + &c->comp->cfg[c->partid]); + } return 0; } diff --git a/drivers/resctrl/mpam_internal.h b/drivers/resctrl/mpam_internal.h index 0b19742d8660..013162faf37f 100644 --- a/drivers/resctrl/mpam_internal.h +++ b/drivers/resctrl/mpam_internal.h @@ -478,6 +478,8 @@ void mpam_msmon_reset_mbwu(struct mpam_component *comp, struct mon_cfg *ctx); int mpam_get_cpumask_from_cache_id(unsigned long cache_id, u32 cache_level, cpumask_t *affinity); +u16 req2intpartid(u16 reqpartid); + #ifdef CONFIG_RESCTRL_FS int mpam_resctrl_setup(void); void mpam_resctrl_exit(void); diff --git a/drivers/resctrl/mpam_resctrl.c b/drivers/resctrl/mpam_resctrl.c index 6cbca21096f7..bb2fdfc03b75 100644 --- a/drivers/resctrl/mpam_resctrl.c +++ b/drivers/resctrl/mpam_resctrl.c @@ -296,7 +296,7 @@ static u8 rmid2pmg(u32 rmid) return rmid % (mpam_pmg_max + 1); } -static u16 req2intpartid(u16 reqpartid) +u16 req2intpartid(u16 reqpartid) { return reqpartid % (mpam_intpartid_max + 1); } -- 2.25.1
Introduce helper functions for rmid_entry management, in preparation for upcoming patches supporting dynamic monitoring group allocation: - rmid_is_occupied(): Query whether a rmid_entry is currently allocated by checking if its list node has been removed from the free list. - rmid_entry_reassign_closid(): Update the closid associated with a rmid entry. Fix list node initialization in alloc_rmid() and dom_data_init() by using list_del_init() instead of list_del(). This ensures list_empty() checks in rmid_is_occupied() work correctly without encountering LIST_POISON values. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- fs/resctrl/monitor.c | 18 ++++++++++++++++-- include/linux/resctrl.h | 21 +++++++++++++++++++++ 2 files changed, 37 insertions(+), 2 deletions(-) diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index 49f3f6b846b2..4e78fb194c16 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -284,7 +284,7 @@ int alloc_rmid(u32 closid) if (IS_ERR(entry)) return PTR_ERR(entry); - list_del(&entry->list); + list_del_init(&entry->list); return entry->rmid; } @@ -344,6 +344,20 @@ void free_rmid(u32 closid, u32 rmid) list_add_tail(&entry->list, &rmid_free_lru); } +bool rmid_is_occupied(u32 closid, u32 rmid) +{ + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); + + return list_empty(&rmid_ptrs[idx].list); +} + +void rmid_entry_reassign_closid(u32 closid, u32 rmid) +{ + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); + + rmid_ptrs[idx].closid = closid; +} + static struct mbm_state *get_mbm_state(struct rdt_l3_mon_domain *d, u32 closid, u32 rmid, enum resctrl_event_id evtid) { @@ -943,7 +957,7 @@ int setup_rmid_lru_list(void) idx = resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, RESCTRL_RESERVED_RMID); entry = __rmid_entry(idx); - list_del(&entry->list); + list_del_init(&entry->list); return 0; } diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 006e57fd7ca5..b636e7250c20 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -702,6 +702,27 @@ bool resctrl_arch_get_io_alloc_enabled(struct rdt_resource *r); extern unsigned int resctrl_rmid_realloc_threshold; extern unsigned int resctrl_rmid_realloc_limit; +/** + * rmid_is_occupied() - Check whether the specified rmid has been + * allocated. + * @closid: Specify the closid that matches the rmid. + * @rmid: Specify the rmid entry to check status. + * + * This function checks if the rmid_entry is currently allocated by testing + * whether its list node is empty (removed from the free list). + * + * Return: + * True if the specified rmid is still in use. + */ +bool rmid_is_occupied(u32 closid, u32 rmid); + +/** + * rmid_entry_reassign_closid() - Update the closid field of a rmid_entry. + * @closid: Specify the reassigned closid. + * @rmid: Specify the rmid entry to update closid. + */ +void rmid_entry_reassign_closid(u32 closid, u32 rmid); + int resctrl_init(void); void resctrl_exit(void); -- 2.25.1
Replace static reqPARTID allocation with dynamic binding to maximize monitoring group utilization. Static allocation wastes resources when control groups create fewer sub-groups than the pre-allocated limit. Add a lookup table (reqpartid_map) to dynamically bind reqPARTIDs to control groups needing extended monitoring capacity: - resctrl_arch_rmid_expand(): Find and bind a free reqPARTID to the specified closid when creating monitoring groups. - resctrl_arch_rmid_reclaim(): Unbind reqPARTID when all monitoring groups associated with pmg are freed, making it available for reuse. Update conversion helpers for dynamic mapping: - req2intpartid(): Use lookup table for dynamic entries - Add partid2closid() and req_pmg2rmid() helpers Refactor __write_config() to iterate over all reqPARTIDs that match by intPARTID, removing fixed per-closid slot assumption. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/resctrl/mpam_devices.c | 21 ++--- drivers/resctrl/mpam_internal.h | 1 + drivers/resctrl/mpam_resctrl.c | 141 +++++++++++++++++++++++++++++--- include/linux/arm_mpam.h | 17 ++++ 4 files changed, 156 insertions(+), 24 deletions(-) diff --git a/drivers/resctrl/mpam_devices.c b/drivers/resctrl/mpam_devices.c index 52533ee0e6ce..b05900f15f4f 100644 --- a/drivers/resctrl/mpam_devices.c +++ b/drivers/resctrl/mpam_devices.c @@ -1768,27 +1768,24 @@ struct mpam_write_config_arg { u16 partid; }; -static u32 get_num_reqpartid_per_intpartid(void) -{ - return (mpam_partid_max + 1) / (mpam_intpartid_max + 1); -} - static int __write_config(void *arg) { int closid_num = resctrl_arch_get_num_closid(NULL); struct mpam_write_config_arg *c = arg; - u32 reqpartid, req_idx; + u32 reqpartid; /* c->partid should be within the range of intPARTIDs */ WARN_ON_ONCE(c->partid >= closid_num); - /* Synchronize the configuration to each sub-monitoring group. */ - for (req_idx = 0; req_idx < get_num_reqpartid_per_intpartid(); - req_idx++) { - reqpartid = req_idx * closid_num + c->partid; + mpam_reprogram_ris_partid(c->ris, c->partid, + &c->comp->cfg[c->partid]); - mpam_reprogram_ris_partid(c->ris, reqpartid, - &c->comp->cfg[c->partid]); + /* Synchronize the configuration to each sub-monitoring group. */ + for (reqpartid = closid_num; + reqpartid < get_num_reqpartid(); reqpartid++) { + if (req2intpartid(reqpartid) == c->partid) + mpam_reprogram_ris_partid(c->ris, reqpartid, + &c->comp->cfg[c->partid]); } return 0; diff --git a/drivers/resctrl/mpam_internal.h b/drivers/resctrl/mpam_internal.h index 013162faf37f..d03630e28be8 100644 --- a/drivers/resctrl/mpam_internal.h +++ b/drivers/resctrl/mpam_internal.h @@ -479,6 +479,7 @@ int mpam_get_cpumask_from_cache_id(unsigned long cache_id, u32 cache_level, cpumask_t *affinity); u16 req2intpartid(u16 reqpartid); +u32 get_num_reqpartid(void); #ifdef CONFIG_RESCTRL_FS int mpam_resctrl_setup(void); diff --git a/drivers/resctrl/mpam_resctrl.c b/drivers/resctrl/mpam_resctrl.c index bb2fdfc03b75..a3f027ad322c 100644 --- a/drivers/resctrl/mpam_resctrl.c +++ b/drivers/resctrl/mpam_resctrl.c @@ -251,7 +251,7 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) * The first call occurs in mpam_resctrl_pick_counters(), ensuring the * prerequisite initialization is complete. */ -static u32 get_num_reqpartid(void) +u32 get_num_reqpartid(void) { struct mpam_resctrl_res *res; struct rdt_resource *r_mba; @@ -296,9 +296,36 @@ static u8 rmid2pmg(u32 rmid) return rmid % (mpam_pmg_max + 1); } +static u32 req_pmg2rmid(u32 reqpartid, u8 pmg) +{ + if (cdp_enabled) + reqpartid >>= 1; + + return reqpartid * (mpam_pmg_max + 1) + pmg; +} + +static u32 *reqpartid_map; + u16 req2intpartid(u16 reqpartid) { - return reqpartid % (mpam_intpartid_max + 1); + WARN_ON_ONCE(reqpartid >= get_num_reqpartid()); + + /* + * Directly return intPartid in case that mpam_reset_ris() access + * NULL pointer. + */ + if (reqpartid < (mpam_intpartid_max + 1)) + return reqpartid; + + return reqpartid_map[reqpartid]; +} + +static u32 partid2closid(u32 partid) +{ + if (cdp_enabled) + partid >>= 1; + + return partid; } /* @@ -312,12 +339,12 @@ u16 req2intpartid(u16 reqpartid) u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid) { u32 reqpartid = rmid2reqpartid(rmid); - u32 intpartid = req2intpartid(reqpartid); - if (cdp_enabled) - intpartid >>= 1; + /* When enable CDP mode, needs to filter invalid rmid entry out */ + if (reqpartid >= get_num_reqpartid()) + return U32_MAX; - if (closid != intpartid) + if (closid != partid2closid(req2intpartid(reqpartid))) return U32_MAX; return rmid; @@ -330,11 +357,9 @@ void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid) if (rmid) *rmid = idx; - if (closid) { - if (cdp_enabled) - intpartid >>= 1; - *closid = intpartid; - } + + if (closid) + *closid = partid2closid(intpartid); } void resctrl_arch_sched_in(struct task_struct *tsk) @@ -1822,6 +1847,91 @@ void mpam_resctrl_offline_cpu(unsigned int cpu) } } +static int reqpartid_init(void) +{ + int req_num, idx; + + req_num = get_num_reqpartid(); + reqpartid_map = kcalloc(req_num, sizeof(u32), GFP_KERNEL); + if (!reqpartid_map) + return -ENOMEM; + + for (idx = 0; idx < req_num; idx++) + reqpartid_map[idx] = idx; + + return 0; +} + +static void reqpartid_exit(void) +{ + kfree(reqpartid_map); +} + +static void update_rmid_entries_for_reqpartid(u32 reqpartid) +{ + int pmg; + u32 intpartid = reqpartid_map[reqpartid]; + u32 closid = partid2closid(intpartid); + + for (pmg = 0; pmg <= mpam_pmg_max; pmg++) + rmid_entry_reassign_closid(closid, req_pmg2rmid(reqpartid, pmg)); +} + +int resctrl_arch_rmid_expand(u32 closid) +{ + int i; + + for (i = resctrl_arch_get_num_closid(NULL); + i < get_num_reqpartid(); i++) { + /* + * If reqpartid_map[i] > mpam_intpartid_max, + * it means the reqpartid 'i' is free. + */ + if (reqpartid_map[i] >= resctrl_arch_get_num_closid(NULL)) { + if (cdp_enabled) { + reqpartid_map[i] = resctrl_get_config_index(closid, CDP_DATA); + reqpartid_map[i + 1] = resctrl_get_config_index(closid, CDP_CODE); + } else { + reqpartid_map[i] = resctrl_get_config_index(closid, CDP_NONE); + } + update_rmid_entries_for_reqpartid(i); + return i; + } + } + + return -ENOSPC; +} + +void resctrl_arch_rmid_reclaim(u32 closid, u32 rmid) +{ + int pmg; + u32 intpartid; + int reqpartid = rmid2reqpartid(rmid); + + if (reqpartid < resctrl_arch_get_num_closid(NULL)) + return; + + if (cdp_enabled) + intpartid = resctrl_get_config_index(closid, CDP_DATA); + else + intpartid = resctrl_get_config_index(closid, CDP_NONE); + + WARN_ON_ONCE(intpartid != req2intpartid(reqpartid)); + + for (pmg = 0; pmg <= mpam_pmg_max; pmg++) { + if (rmid_is_occupied(closid, req_pmg2rmid(reqpartid, pmg))) + break; + } + + if (pmg > mpam_pmg_max) { + reqpartid_map[reqpartid] = reqpartid; + if (cdp_enabled) + reqpartid_map[reqpartid + 1] = reqpartid + 1; + + update_rmid_entries_for_reqpartid(reqpartid); + } +} + int mpam_resctrl_setup(void) { int err = 0; @@ -1877,10 +1987,16 @@ int mpam_resctrl_setup(void) return -EOPNOTSUPP; } - err = resctrl_init(); + err = reqpartid_init(); if (err) return err; + err = resctrl_init(); + if (err) { + reqpartid_exit(); + return err; + } + WRITE_ONCE(resctrl_enabled, true); return 0; @@ -1898,6 +2014,7 @@ void mpam_resctrl_exit(void) WRITE_ONCE(resctrl_enabled, false); resctrl_exit(); + reqpartid_exit(); } static void mpam_resctrl_teardown_mon(struct mpam_resctrl_mon *mon, struct mpam_class *class) diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index f92a36187a52..d45422965907 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -59,6 +59,23 @@ void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 rmid); void resctrl_arch_sched_in(struct task_struct *tsk); bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid); bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid); + +/** + * resctrl_arch_rmid_expand() - Expand the RMID resources for the specified closid. + * @closid: closid that matches the rmid. + * + * Return: + * 0 on success, or -ENOSPC etc on error. + */ +int resctrl_arch_rmid_expand(u32 closid); + +/** + * resctrl_arch_rmid_reclaim() - Reclaim the rmid resources for the specified closid. + * @closid: closid that matches the rmid. + * @rmid: Reclaim the rmid specified. + */ +void resctrl_arch_rmid_reclaim(u32 closid, u32 rmid); + u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid); void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid); u32 resctrl_arch_system_num_rmid_idx(void); -- 2.25.1
The previous patch implemented resctrl_arch_rmid_expand() and resctrl_arch_rmid_reclaim() for ARM MPAM. This patch integrates these architecture-specific functions into the generic resctrl layer. Refactor resctrl_find_free_rmid() to support dynamic rmid expansion. If no free rmid is available for the current closid, attempt to expand via resctrl_arch_expand_rmid(). On success, retry the rmid allocation. As this capability is architecture-specific, x86 maintains existing behavior by returning -ENOSPC when rmid resources are exhausted. Additionally, invoke resctrl_arch_rmid_reclaim() when rmids are released to enable architecture-specific resource cleanup. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/include/asm/resctrl.h | 7 +++++++ fs/resctrl/monitor.c | 32 ++++++++++++++++++++++++++++++-- 2 files changed, 37 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 575f8408a9e7..7f8c8d84f2a0 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -167,6 +167,13 @@ static inline void resctrl_arch_sched_in(struct task_struct *tsk) __resctrl_sched_in(tsk); } +static inline int resctrl_arch_rmid_expand(u32 closid) +{ + return -ENOSPC; +} + +static inline void resctrl_arch_rmid_reclaim(u32 closid, u32 rmid) {} + static inline void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid) { *rmid = idx; diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index 4e78fb194c16..c58440eb7cc9 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -122,6 +122,8 @@ static void limbo_release_entry(struct rmid_entry *entry) if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) closid_num_dirty_rmid[entry->closid]--; + + resctrl_arch_rmid_reclaim(entry->closid, entry->rmid); } /* @@ -197,7 +199,7 @@ bool has_busy_rmid(struct rdt_l3_mon_domain *d) return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit; } -static struct rmid_entry *resctrl_find_free_rmid(u32 closid) +static struct rmid_entry *__resctrl_find_free_rmid(u32 closid) { struct rmid_entry *itr; u32 itr_idx, cmp_idx; @@ -214,7 +216,12 @@ static struct rmid_entry *resctrl_find_free_rmid(u32 closid) * very first entry will be returned. */ itr_idx = resctrl_arch_rmid_idx_encode(itr->closid, itr->rmid); + if (itr_idx == U32_MAX) + continue; + cmp_idx = resctrl_arch_rmid_idx_encode(closid, itr->rmid); + if (cmp_idx == U32_MAX) + continue; if (itr_idx == cmp_idx) return itr; @@ -223,6 +230,25 @@ static struct rmid_entry *resctrl_find_free_rmid(u32 closid) return ERR_PTR(-ENOSPC); } +static struct rmid_entry *resctrl_find_free_rmid(u32 closid) +{ + struct rmid_entry *err; + int ret; + + err = __resctrl_find_free_rmid(closid); + if (err == ERR_PTR(-ENOSPC)) { + ret = resctrl_arch_rmid_expand(closid); + if (ret < 0) + /* Out of rmid */ + goto out; + + /* Try it again */ + return __resctrl_find_free_rmid(closid); + } +out: + return err; +} + /** * resctrl_find_cleanest_closid() - Find a CLOSID where all the associated * RMID are clean, or the CLOSID that has @@ -340,8 +366,10 @@ void free_rmid(u32 closid, u32 rmid) if (resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)) add_rmid_to_limbo(entry); - else + else { list_add_tail(&entry->list, &rmid_free_lru); + resctrl_arch_rmid_reclaim(closid, rmid); + } } bool rmid_is_occupied(u32 closid, u32 rmid) -- 2.25.1
Add mpam_sync_config() to synchronize configuration when dynamically expanding rmid resources. When binding a new reqpartid to a control group, the driver maps the reqpartid to the corresponding intpartid or applies the control group's existing configuration to new partid if without Narrow partid feature. Extend mpam_apply_config() with a sync mode: - Sync mode: mpam_sync_config() calls this to apply existing configuration without updating config. - Non-sync mode: resctrl_arch_update_one() calls this to compare, update, and apply configuration. This mode retains the original behavior. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/resctrl/mpam_devices.c | 23 ++++++++++++++++++----- drivers/resctrl/mpam_internal.h | 2 +- drivers/resctrl/mpam_resctrl.c | 29 ++++++++++++++++++++++++++--- 3 files changed, 45 insertions(+), 9 deletions(-) diff --git a/drivers/resctrl/mpam_devices.c b/drivers/resctrl/mpam_devices.c index b05900f15f4f..5a47131855af 100644 --- a/drivers/resctrl/mpam_devices.c +++ b/drivers/resctrl/mpam_devices.c @@ -1766,6 +1766,7 @@ struct mpam_write_config_arg { struct mpam_msc_ris *ris; struct mpam_component *comp; u16 partid; + bool sync; }; static int __write_config(void *arg) @@ -1774,6 +1775,15 @@ static int __write_config(void *arg) struct mpam_write_config_arg *c = arg; u32 reqpartid; + if (c->sync) { + /* c->partid should be within the range of reqPARTIDs */ + WARN_ON_ONCE(c->partid < closid_num); + + mpam_reprogram_ris_partid(c->ris, c->partid, + &c->comp->cfg[req2intpartid(c->partid)]); + return 0; + } + /* c->partid should be within the range of intPARTIDs */ WARN_ON_ONCE(c->partid >= closid_num); @@ -2931,7 +2941,7 @@ static bool mpam_update_config(struct mpam_config *cfg, } int mpam_apply_config(struct mpam_component *comp, u16 partid, - struct mpam_config *cfg) + struct mpam_config *cfg, bool sync) { struct mpam_write_config_arg arg; struct mpam_msc_ris *ris; @@ -2940,14 +2950,17 @@ int mpam_apply_config(struct mpam_component *comp, u16 partid, lockdep_assert_cpus_held(); - /* Don't pass in the current config! */ - WARN_ON_ONCE(&comp->cfg[partid] == cfg); + if (!sync) { + /* The partid is within the range of intPARTIDs */ + WARN_ON_ONCE(partid >= resctrl_arch_get_num_closid(NULL)); - if (!mpam_update_config(&comp->cfg[partid], cfg)) - return 0; + if (!mpam_update_config(&comp->cfg[partid], cfg)) + return 0; + } arg.comp = comp; arg.partid = partid; + arg.sync = sync; guard(srcu)(&mpam_srcu); list_for_each_entry_srcu(vmsc, &comp->vmsc, comp_list, diff --git a/drivers/resctrl/mpam_internal.h b/drivers/resctrl/mpam_internal.h index d03630e28be8..88ae438ca496 100644 --- a/drivers/resctrl/mpam_internal.h +++ b/drivers/resctrl/mpam_internal.h @@ -469,7 +469,7 @@ void mpam_disable(struct work_struct *work); void mpam_reset_class_locked(struct mpam_class *class); int mpam_apply_config(struct mpam_component *comp, u16 partid, - struct mpam_config *cfg); + struct mpam_config *cfg, bool sync); int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, enum mpam_device_features, u64 *val); diff --git a/drivers/resctrl/mpam_resctrl.c b/drivers/resctrl/mpam_resctrl.c index a3f027ad322c..cfb074d23965 100644 --- a/drivers/resctrl/mpam_resctrl.c +++ b/drivers/resctrl/mpam_resctrl.c @@ -1488,15 +1488,15 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d, */ if (mpam_resctrl_hide_cdp(r->rid)) { partid = resctrl_get_config_index(closid, CDP_CODE); - err = mpam_apply_config(dom->ctrl_comp, partid, &cfg); + err = mpam_apply_config(dom->ctrl_comp, partid, &cfg, false); if (err) return err; partid = resctrl_get_config_index(closid, CDP_DATA); - return mpam_apply_config(dom->ctrl_comp, partid, &cfg); + return mpam_apply_config(dom->ctrl_comp, partid, &cfg, false); } - return mpam_apply_config(dom->ctrl_comp, partid, &cfg); + return mpam_apply_config(dom->ctrl_comp, partid, &cfg, false); } int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) @@ -1877,6 +1877,23 @@ static void update_rmid_entries_for_reqpartid(u32 reqpartid) rmid_entry_reassign_closid(closid, req_pmg2rmid(reqpartid, pmg)); } +static int mpam_sync_config(u32 reqpartid) +{ + struct mpam_component *comp; + struct mpam_class *class; + int err; + + list_for_each_entry(class, &mpam_classes, classes_list) { + list_for_each_entry(comp, &class->components, class_list) { + err = mpam_apply_config(comp, reqpartid, NULL, true); + if (err) + return err; + } + } + + return 0; +} + int resctrl_arch_rmid_expand(u32 closid) { int i; @@ -1890,10 +1907,16 @@ int resctrl_arch_rmid_expand(u32 closid) if (reqpartid_map[i] >= resctrl_arch_get_num_closid(NULL)) { if (cdp_enabled) { reqpartid_map[i] = resctrl_get_config_index(closid, CDP_DATA); + mpam_sync_config(i); + reqpartid_map[i + 1] = resctrl_get_config_index(closid, CDP_CODE); + mpam_sync_config(i + 1); + } else { reqpartid_map[i] = resctrl_get_config_index(closid, CDP_NONE); + mpam_sync_config(i); } + update_rmid_entries_for_reqpartid(i); return i; } -- 2.25.1
Reorder rdt_kill_sb() to prevent rmid parsing errors caused by premature cdp_enabled reset. The rmid to reqpartid conversion depends on cdp_enabled state, but rdt_disable_ctx() clears this flag. Delay rdt_disable_ctx() until after rmdir_all_sub() completes, ensuring free_rmid() operates with correct CDP state. Additionally, flush all pending limbo work before umount completes. This prevents rmid release errors on subsequent mounts if CDP state changes between mount sessions. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- fs/resctrl/rdtgroup.c | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 5da305bd36c9..62e57a939cb7 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -3165,6 +3165,25 @@ static void resctrl_fs_teardown(void) rdtgroup_destroy_root(); } +static void rdt_destroy_limbo(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + struct rdt_l3_mon_domain *d; + + if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + return; + + if (!resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)) + return; + + list_for_each_entry(d, &r->mon_domains, hdr.list) { + if (has_busy_rmid(d)) { + __check_limbo(d, true); + cancel_delayed_work(&d->cqm_limbo); + } + } +} + static void rdt_kill_sb(struct super_block *sb) { struct rdt_resource *r; @@ -3172,13 +3191,14 @@ static void rdt_kill_sb(struct super_block *sb) cpus_read_lock(); mutex_lock(&rdtgroup_mutex); - rdt_disable_ctx(); - /* Put everything back to default values. */ for_each_alloc_capable_rdt_resource(r) resctrl_arch_reset_all_ctrls(r); resctrl_fs_teardown(); + rdt_destroy_limbo(); + rdt_disable_ctx(); + if (resctrl_arch_alloc_capable()) resctrl_arch_disable_alloc(); if (resctrl_arch_mon_capable()) -- 2.25.1
On 2026/3/13 10:34, Zeng Heng wrote:
Reorder rdt_kill_sb() to prevent rmid parsing errors caused by premature cdp_enabled reset. The rmid to reqpartid conversion depends on cdp_enabled state, but rdt_disable_ctx() clears this flag.
Delay rdt_disable_ctx() until after rmdir_all_sub() completes, ensuring free_rmid() operates with correct CDP state.
Additionally, flush all pending limbo work before umount completes. This prevents rmid release errors on subsequent mounts if CDP state changes between mount sessions.
Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- fs/resctrl/rdtgroup.c | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 5da305bd36c9..62e57a939cb7 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -3165,6 +3165,25 @@ static void resctrl_fs_teardown(void) rdtgroup_destroy_root(); }
+static void rdt_destroy_limbo(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + struct rdt_l3_mon_domain *d; + + if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + return; + + if (!resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)) + return; + + list_for_each_entry(d, &r->mon_domains, hdr.list) { + if (has_busy_rmid(d)) { + __check_limbo(d, true); + cancel_delayed_work(&d->cqm_limbo); + } + } +} + static void rdt_kill_sb(struct super_block *sb) { struct rdt_resource *r; @@ -3172,13 +3191,14 @@ static void rdt_kill_sb(struct super_block *sb) cpus_read_lock(); mutex_lock(&rdtgroup_mutex);
- rdt_disable_ctx(); - /* Put everything back to default values. */ for_each_alloc_capable_rdt_resource(r) resctrl_arch_reset_all_ctrls(r);
resctrl_fs_teardown(); + rdt_destroy_limbo(); + rdt_disable_ctx(); + if (resctrl_arch_alloc_capable()) resctrl_arch_disable_alloc(); if (resctrl_arch_mon_capable())
-------- Forwarded Message -------- Subject: Re: [PATCH v3 10/10] fs/resctrl: Prevent rmid parsing errors by flushing limbo on umount Date: Fri, 13 Mar 2026 12:39:24 +0800 From: Zeng Heng <zengheng4@huawei.com> To: kernel@openeuler.org CC: weiyongjun1@huawei.com, xiexiuqi@huawei.com, wangkefeng.wang@huawei.com On 2026/3/13 10:34, Zeng Heng wrote:
Reorder rdt_kill_sb() to prevent rmid parsing errors caused by premature cdp_enabled reset. The rmid to reqpartid conversion depends on cdp_enabled state, but rdt_disable_ctx() clears this flag.
Delay rdt_disable_ctx() until after rmdir_all_sub() completes, ensuring free_rmid() operates with correct CDP state.
Additionally, flush all pending limbo work before umount completes. This prevents rmid release errors on subsequent mounts if CDP state changes between mount sessions.
Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- fs/resctrl/rdtgroup.c | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-)
diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 5da305bd36c9..62e57a939cb7 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -3165,6 +3165,25 @@ static void resctrl_fs_teardown(void) rdtgroup_destroy_root(); }
+static void rdt_destroy_limbo(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + struct rdt_l3_mon_domain *d; + + if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + return; + + if (!resctrl_is_mon_event_enabled(QOS_L3_OCCUP_EVENT_ID)) + return; + + list_for_each_entry(d, &r->mon_domains, hdr.list) { + if (has_busy_rmid(d)) { + __check_limbo(d, true); + cancel_delayed_work(&d->cqm_limbo); + } + } +} + static void rdt_kill_sb(struct super_block *sb) { struct rdt_resource *r; @@ -3172,13 +3191,14 @@ static void rdt_kill_sb(struct super_block *sb) cpus_read_lock(); mutex_lock(&rdtgroup_mutex);
- rdt_disable_ctx(); - /* Put everything back to default values. */ for_each_alloc_capable_rdt_resource(r) resctrl_arch_reset_all_ctrls(r);
resctrl_fs_teardown(); + rdt_destroy_limbo(); + rdt_disable_ctx(); + if (resctrl_arch_alloc_capable()) resctrl_arch_disable_alloc(); if (resctrl_arch_mon_capable())
participants (1)
-
Zeng Heng