Amit Singh Tomar (1): fs/resctrl: Remove the limit on the number of CLOSID
Babu Moger (9): x86/resctrl: Add multiple tasks to the resctrl group at once x86/resctrl: Simplify rftype flag definitions x86/resctrl: Rename rftype flags for consistency x86/resctrl: Unwind properly from rdt_enable_ctx() x86/resctrl: Move default group file creation to mount x86/resctrl: Introduce "-o debug" mount option x86/resctrl: Display CLOSID for resource group x86/resctrl: Add support for the files of MON groups only x86/resctrl: Display RMID of resource group
Fenghua Yu (1): x86/resctrl: Add sparse_masks file in info
James Morse (104): tick/nohz: Move tick_nohz_full_mask declaration outside the #ifdef x86/resctrl: kfree() rmid_ptrs from resctrl_exit() x86/resctrl: Create helper for RMID allocation and mondata dir creation x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare() x86/resctrl: Track the closid with the rmid x86/resctrl: Access per-rmid structures by index x86/resctrl: Allow RMID allocation to be scoped by CLOSID x86/resctrl: Track the number of dirty RMID a CLOSID has x86/resctrl: Use __set_bit()/__clear_bit() instead of open coding x86/resctrl: Allocate the cleanest CLOSID by searching closid_num_dirty_rmid x86/resctrl: Move CLOSID/RMID matching and setting to use helpers x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow x86/resctrl: Queue mon_event_read() instead of sending an IPI x86/resctrl: Allow resctrl_arch_rmid_read() to sleep x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read() x86/resctrl: Make resctrl_mounted checks explicit x86/resctrl: Move alloc/mon static keys into helpers x86/resctrl: Make rdt_enable_key the arch's decision to switch x86/resctrl: Add helpers for system wide mon/alloc capable x86/resctrl: Add CPU online callback for resctrl work x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu x86/resctrl: Add CPU offline callback for resctrl work x86/resctrl: Move domain helper migration into resctrl_offline_cpu() x86/resctrl: Separate arch and fs resctrl locks x86/resctrl: Add a helper to avoid reaching into the arch code resource list x86/resctrl: Move ctrlval string parsing links away from the arch code x86/resctrl: Add helper for setting CPU default properties x86/resctrl: Remove rdtgroup from update_cpu_closid_rmid() x86/resctrl: Export resctrl fs's init function x86/resctrl: Wrap resctrl_arch_find_domain() around rdt_find_domain() x86/resctrl: Move resctrl types to a separate header x86/resctrl: Add a resctrl helper to reset all the resources x86/resctrl: Move monitor init work to a resctrl init call x86/resctrl: Move monitor exit work to a restrl exit call x86/resctrl: Move max_{name,data}_width into resctrl code x86/resctrl: Stop using the for_each_*_rdt_resource() walkers x86/resctrl: Export the is_mbm_*_enabled() helpers to asm/resctrl.h x86/resctrl: Add resctrl_arch_is_evt_configurable() to abstract BMEC x86/resctrl: Change mon_event_config_{read,write}() to be arch helpers x86/resctrl: Allow resctrl_arch_mon_event_config_write() to return an error x86/resctrl: Allow an architecture to disable pseudo lock x86/resctrl: Make prefetch_disable_bits belong to the arch code x86/resctrl: Make resctrl_arch_pseudo_lock_fn() take a plr x86/resctrl: Move thread_throttle_mode_init() to be managed by resctrl x86/resctrl: Move get_config_index() to a header x86/resctrl: Claim get_domain_from_cpu() for resctrl x86/resctrl: Describe resctrl's bitmap size assumptions x86/resctrl: Drop __init/__exit on assorted symbols fs/resctrl: Add boiler plate for external resctrl code x86/resctrl: Move the filesystem bits to headers visible to fs/resctrl x86/resctrl: Move the filesystem portions of resctrl to live in '/fs/' arm64: head.S: Initialise MPAM EL2 registers and disable traps arm64: cpufeature: discover CPU support for MPAM KVM: arm64: Fix missing traps of guest accesses to the MPAM registers KVM: arm64: Disable MPAM visibility by default, and handle traps arm64: mpam: Context switch the MPAM registers untested: KVM: arm64: Force guest EL1 to use user-space's partid configuration ACPI / PPTT: Provide a helper to walk processor containers ACPI / PPTT: Find PPTT cache level by ID ACPI / PPTT: Add a helper to fill a cpumask from a processor container ACPI / PPTT: Add a helper to fill a cpumask from a cache_id cacheinfo: Expose the code to generate a cache-id from a device_node drivers: base: cacheinfo: Add helper to find the cache size from cpu+level ACPI / MPAM: Parse the MPAM table arm_mpam: Add probe/remove for mpam msc driver and kbuild boiler plate arm_mpam: Add the class and component structures for ris firmware described arm_mpam: Add MPAM MSC register layout definitions arm_mpam: Add cpuhp callbacks to probe MSC hardware arm_mpam: Probe MSCs to find the supported partid/pmg values arm_mpam: Probe the hardware features resctrl supports arm_mpam: Merge supported features during mpam_enable() into mpam_class arm_mpam: Reset MSC controls from cpu hp callbacks arm_mpam: Add a helper to touch an MSC from any CPU arm_mpam: Extend reset logic to allow devices to be reset any time arm_mpam: Register and enable IRQs arm_mpam: Use the arch static key to indicate when mpam is enabled arm_mpam: Allow configuration to be applied and restored during cpu online arm_mpam: Probe and reset the rest of the features arm_mpam: Add helpers to allocate monitors arm_mpam: Add mpam_msmon_read() to read monitor value arm_mpam: Track bandwidth counter state for overflow and power management arm_mpam: Add helper to reset saved mbwu state arm_mpam: resctrl: Add boilerplate cpuhp and domain allocation arm_mpam: resctrl: Pick the caches we will use as resctrl resources arm_mpam: resctrl: Pick a value for num_rmid arm_mpam: resctrl: Implement resctrl_arch_reset_resources() arm_mpam: resctrl: Add resctrl_arch_get_config() arm_mpam: resctrl: Implement helpers to update configuration arm_mpam: resctrl: Add CDP emulation arm64: mpam: Add helpers to change a tasks and cpu mpam partid/pmg values arm_mpam: resctrl: Add rmid index helpers untested: arm_mpam: resctrl: Add support for MB resource untested: arm_mpam: resctrl: Add support for mbm counters arm_mpam: resctrl: Allow resctrl to allocate monitors arm_mpam: resctrl: Add resctrl_arch_rmid_read() and resctrl_arch_reset_rmid() untested: arm_mpam: resctrl: Allow monitors to be configured arm_mpam: resctrl: Add empty definitions for pseudo lock arm_mpam: resctrl: Add empty definitions for fine-grained enables arm_mpam: resctrl: Add dummy definition for free running counters arm64: mpam: Select ARCH_HAS_CPU_RESCTRL perf/arm-cmn: Stop claiming all the resources arm_mpam: resctrl: Tell resctrl about cpu/domain online/offline arm_mpam: resctrl: Call resctrl_exit() in the event of errors arm_mpam: resctrl: Update the rmid reallocation limit
Maciej Wieczor-Retman (3): x86/resctrl: Fix remaining kernel-doc warnings x86/resctrl: Rename arch_has_sparse_bitmaps x86/resctrl: Enable non-contiguous CBMs in Intel CAT
Rob Herring (3): cacheinfo: Allow for >32-bit cache 'id' cacheinfo: Set cache 'id' based on DT data dt-bindings: arm: Add MPAM MSC binding
Rohit Mathew (2): arm_mpam: Probe for long/lwd mbwu counters arm_mpam: Use long MBWU counters if supported
.../arch/arm64/cpu-feature-registers.rst | 2 + Documentation/arch/x86/resctrl.rst | 22 +- .../devicetree/bindings/arm/arm,mpam-msc.yaml | 227 + MAINTAINERS | 2 + arch/Kconfig | 8 + arch/arm64/Kconfig | 22 +- arch/arm64/include/asm/cpu.h | 1 + arch/arm64/include/asm/cpufeature.h | 13 + arch/arm64/include/asm/el2_setup.h | 16 + arch/arm64/include/asm/kvm_arm.h | 1 + arch/arm64/include/asm/mpam.h | 167 + arch/arm64/include/asm/resctrl.h | 1 + arch/arm64/include/asm/sysreg.h | 8 + arch/arm64/include/asm/thread_info.h | 3 + arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/cpufeature.c | 68 + arch/arm64/kernel/cpuinfo.c | 4 + arch/arm64/kernel/image-vars.h | 5 + arch/arm64/kernel/mpam.c | 23 + arch/arm64/kernel/process.c | 7 + arch/arm64/kvm/hyp/include/hyp/switch.h | 32 + arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 27 + arch/arm64/kvm/hyp/nvhe/switch.c | 11 + arch/arm64/kvm/hyp/vhe/sysreg-sr.c | 1 + arch/arm64/kvm/sys_regs.c | 70 +- arch/arm64/tools/cpucaps | 1 + arch/arm64/tools/sysreg | 33 + arch/x86/Kconfig | 5 +- arch/x86/include/asm/resctrl.h | 131 +- arch/x86/kernel/cpu/resctrl/core.c | 212 +- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 476 +- arch/x86/kernel/cpu/resctrl/internal.h | 394 +- arch/x86/kernel/cpu/resctrl/monitor.c | 594 +-- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 1123 +---- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 3916 +--------------- drivers/acpi/arm64/Kconfig | 3 + drivers/acpi/arm64/Makefile | 2 + drivers/acpi/arm64/mpam.c | 369 ++ drivers/acpi/pptt.c | 278 +- drivers/acpi/tables.c | 2 +- drivers/base/cacheinfo.c | 41 +- drivers/perf/arm-cmn.c | 12 +- drivers/platform/Kconfig | 2 + drivers/platform/Makefile | 1 + drivers/platform/mpam/Kconfig | 8 + drivers/platform/mpam/Makefile | 1 + drivers/platform/mpam/mpam_devices.c | 2422 ++++++++++ drivers/platform/mpam/mpam_internal.h | 576 +++ drivers/platform/mpam/mpam_resctrl.c | 1209 +++++ fs/Kconfig | 1 + fs/Makefile | 1 + fs/resctrl/Kconfig | 23 + fs/resctrl/Makefile | 1 + fs/resctrl/ctrlmondata.c | 528 +++ fs/resctrl/internal.h | 313 ++ fs/resctrl/monitor.c | 859 ++++ fs/resctrl/psuedo_lock.c | 1134 +++++ fs/resctrl/rdtgroup.c | 4015 +++++++++++++++++ include/linux/acpi.h | 17 + include/linux/arm_mpam.h | 105 + include/linux/cacheinfo.h | 30 +- include/linux/resctrl.h | 203 +- include/linux/resctrl_types.h | 98 + include/linux/tick.h | 9 +- 64 files changed, 13418 insertions(+), 6472 deletions(-) create mode 100644 Documentation/devicetree/bindings/arm/arm,mpam-msc.yaml create mode 100644 arch/arm64/include/asm/mpam.h create mode 100644 arch/arm64/include/asm/resctrl.h create mode 100644 arch/arm64/kernel/mpam.c create mode 100644 drivers/acpi/arm64/mpam.c create mode 100644 drivers/platform/mpam/Kconfig create mode 100644 drivers/platform/mpam/Makefile create mode 100644 drivers/platform/mpam/mpam_devices.c create mode 100644 drivers/platform/mpam/mpam_internal.h create mode 100644 drivers/platform/mpam/mpam_resctrl.c create mode 100644 fs/resctrl/Kconfig create mode 100644 fs/resctrl/Makefile create mode 100644 fs/resctrl/ctrlmondata.c create mode 100644 fs/resctrl/internal.h create mode 100644 fs/resctrl/monitor.c create mode 100644 fs/resctrl/psuedo_lock.c create mode 100644 fs/resctrl/rdtgroup.c create mode 100644 include/linux/arm_mpam.h create mode 100644 include/linux/resctrl_types.h
-- 2.25.1
From: Maciej Wieczor-Retman maciej.wieczor-retman@intel.com
The kernel test robot reported kernel-doc warnings here:
arch/x86/kernel/cpu/resctrl/rdtgroup.c:915: warning: Function parameter or member 'of' not described in 'rdt_bit_usage_show' arch/x86/kernel/cpu/resctrl/rdtgroup.c:915: warning: Function parameter or member 'seq' not described in 'rdt_bit_usage_show' arch/x86/kernel/cpu/resctrl/rdtgroup.c:915: warning: Function parameter or member 'v' not described in 'rdt_bit_usage_show' arch/x86/kernel/cpu/resctrl/rdtgroup.c:1144: warning: Function parameter or member 'type' not described in '__rdtgroup_cbm_overlaps' arch/x86/kernel/cpu/resctrl/rdtgroup.c:1224: warning: Function parameter or member 'rdtgrp' not described in 'rdtgroup_mode_test_exclusive' arch/x86/kernel/cpu/resctrl/rdtgroup.c:1261: warning: Function parameter or member 'of' not described in 'rdtgroup_mode_write' arch/x86/kernel/cpu/resctrl/rdtgroup.c:1261: warning: Function parameter or member 'buf' not described in 'rdtgroup_mode_write' arch/x86/kernel/cpu/resctrl/rdtgroup.c:1261: warning: Function parameter or member 'nbytes' not described in 'rdtgroup_mode_write' arch/x86/kernel/cpu/resctrl/rdtgroup.c:1261: warning: Function parameter or member 'off' not described in 'rdtgroup_mode_write' arch/x86/kernel/cpu/resctrl/rdtgroup.c:1370: warning: Function parameter or member 'of' not described in 'rdtgroup_size_show' arch/x86/kernel/cpu/resctrl/rdtgroup.c:1370: warning: Function parameter or member 's' not described in 'rdtgroup_size_show' arch/x86/kernel/cpu/resctrl/rdtgroup.c:1370: warning: Function parameter or member 'v' not described in 'rdtgroup_size_show'
The first two functions are missing an argument description while the other three are file callbacks and don't require a kernel-doc comment.
Closes: https://lore.kernel.org/oe-kbuild-all/202310070434.mD8eRNAz-lkp@intel.com/ Reported-by: kernel test robot lkp@intel.com Signed-off-by: Maciej Wieczor-Retman maciej.wieczor-retman@intel.com Signed-off-by: Ingo Molnar mingo@kernel.org Cc: Peter Newman peternewman@google.com Cc: Borislav Petkov bp@alien8.de Cc: Reinette Chatre reinette.chatre@intel.com Link: https://lore.kernel.org/r/20231011064843.246592-1-maciej.wieczor-retman@inte... --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 725344048f85..a52cdf97c21c 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -895,7 +895,7 @@ static int rdt_shareable_bits_show(struct kernfs_open_file *of, return 0; }
-/** +/* * rdt_bit_usage_show - Display current usage of resources * * A domain is a shared resource that can now be allocated differently. Here @@ -1123,6 +1123,7 @@ static enum resctrl_conf_type resctrl_peer_type(enum resctrl_conf_type my_type) * @d: The domain instance for which @closid is being tested. * @cbm: Capacity bitmask being tested. * @closid: Intended closid for @cbm. + * @type: CDP type of @r. * @exclusive: Only check if overlaps with exclusive resource groups * * Checks if provided @cbm intended to be used for @closid on domain @@ -1209,6 +1210,7 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d,
/** * rdtgroup_mode_test_exclusive - Test if this resource group can be exclusive + * @rdtgrp: Resource group identified through its closid. * * An exclusive resource group implies that there should be no sharing of * its allocated resources. At the time this group is considered to be @@ -1251,9 +1253,8 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp) return true; }
-/** +/* * rdtgroup_mode_write - Modify the resource group's mode - * */ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) @@ -1357,12 +1358,11 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, return size; }
-/** +/* * rdtgroup_size_show - Display size in bytes of allocated regions * * The "size" file mirrors the layout of the "schemata" file, printing the * size in bytes of each region instead of the capacity bitmask. - * */ static int rdtgroup_size_show(struct kernfs_open_file *of, struct seq_file *s, void *v)
From: Maciej Wieczor-Retman maciej.wieczor-retman@intel.com
Rename arch_has_sparse_bitmaps to arch_has_sparse_bitmasks to ensure consistent terminology throughout resctrl.
Suggested-by: Reinette Chatre reinette.chatre@intel.com Signed-off-by: Maciej Wieczor-Retman maciej.wieczor-retman@intel.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Reviewed-by: Peter Newman peternewman@google.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com Tested-by: Peter Newman peternewman@google.com Link: https://lore.kernel.org/r/e330fcdae873ef1a831e707025a4b70fa346666e.169693409... --- arch/x86/kernel/cpu/resctrl/core.c | 4 ++-- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 4 ++-- include/linux/resctrl.h | 4 ++-- 3 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 030d3b409768..c09e4fdded3c 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -872,7 +872,7 @@ static __init void rdt_init_res_defs_intel(void)
if (r->rid == RDT_RESOURCE_L3 || r->rid == RDT_RESOURCE_L2) { - r->cache.arch_has_sparse_bitmaps = false; + r->cache.arch_has_sparse_bitmasks = false; r->cache.arch_has_per_cpu_cfg = false; r->cache.min_cbm_bits = 1; } else if (r->rid == RDT_RESOURCE_MBA) { @@ -892,7 +892,7 @@ static __init void rdt_init_res_defs_amd(void)
if (r->rid == RDT_RESOURCE_L3 || r->rid == RDT_RESOURCE_L2) { - r->cache.arch_has_sparse_bitmaps = true; + r->cache.arch_has_sparse_bitmasks = true; r->cache.arch_has_per_cpu_cfg = true; r->cache.min_cbm_bits = 0; } else if (r->rid == RDT_RESOURCE_MBA) { diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index b44c487727d4..ab45012288bb 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -113,8 +113,8 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r) first_bit = find_first_bit(&val, cbm_len); zero_bit = find_next_zero_bit(&val, cbm_len, first_bit);
- /* Are non-contiguous bitmaps allowed? */ - if (!r->cache.arch_has_sparse_bitmaps && + /* Are non-contiguous bitmasks allowed? */ + if (!r->cache.arch_has_sparse_bitmasks && (find_next_bit(&val, cbm_len, zero_bit) < cbm_len)) { rdt_last_cmd_printf("The mask %lx has non-consecutive 1-bits\n", val); return false; diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 8334eeacfec5..66942d7fba7f 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -94,7 +94,7 @@ struct rdt_domain { * zero CBM. * @shareable_bits: Bitmask of shareable resource with other * executing entities - * @arch_has_sparse_bitmaps: True if a bitmap like f00f is valid. + * @arch_has_sparse_bitmasks: True if a bitmask like f00f is valid. * @arch_has_per_cpu_cfg: True if QOS_CFG register for this cache * level has CPU scope. */ @@ -102,7 +102,7 @@ struct resctrl_cache { unsigned int cbm_len; unsigned int min_cbm_bits; unsigned int shareable_bits; - bool arch_has_sparse_bitmaps; + bool arch_has_sparse_bitmasks; bool arch_has_per_cpu_cfg; };
From: Maciej Wieczor-Retman maciej.wieczor-retman@intel.com
The setting for non-contiguous 1s support in Intel CAT is hardcoded to false. On these systems, writing non-contiguous 1s into the schemata file will fail before resctrl passes the value to the hardware.
In Intel CAT CPUID.0x10.1:ECX[3] and CPUID.0x10.2:ECX[3] stopped being reserved and now carry information about non-contiguous 1s value support for L3 and L2 cache respectively. The CAT capacity bitmask (CBM) supports a non-contiguous 1s value if the bit is set.
The exception are Haswell systems where non-contiguous 1s value support needs to stay disabled since they can't make use of CPUID for Cache allocation.
Originally-by: Fenghua Yu fenghua.yu@intel.com Signed-off-by: Maciej Wieczor-Retman maciej.wieczor-retman@intel.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Reviewed-by: Peter Newman peternewman@google.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com Tested-by: Peter Newman peternewman@google.com Link: https://lore.kernel.org/r/1849b487256fe4de40b30f88450cba3d9abc9171.169693409... --- arch/x86/kernel/cpu/resctrl/core.c | 9 ++++++--- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 10 ++++++---- arch/x86/kernel/cpu/resctrl/internal.h | 9 +++++++++ 3 files changed, 21 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index c09e4fdded3c..19e0681f0435 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -152,6 +152,7 @@ static inline void cache_alloc_hsw_probe(void) r->cache.cbm_len = 20; r->cache.shareable_bits = 0xc0000; r->cache.min_cbm_bits = 2; + r->cache.arch_has_sparse_bitmasks = false; r->alloc_capable = true;
rdt_alloc_capable = true; @@ -267,15 +268,18 @@ static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r) { struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); union cpuid_0x10_1_eax eax; + union cpuid_0x10_x_ecx ecx; union cpuid_0x10_x_edx edx; - u32 ebx, ecx; + u32 ebx;
- cpuid_count(0x00000010, idx, &eax.full, &ebx, &ecx, &edx.full); + cpuid_count(0x00000010, idx, &eax.full, &ebx, &ecx.full, &edx.full); hw_res->num_closid = edx.split.cos_max + 1; r->cache.cbm_len = eax.split.cbm_len + 1; r->default_ctrl = BIT_MASK(eax.split.cbm_len + 1) - 1; r->cache.shareable_bits = ebx & r->default_ctrl; r->data_width = (r->cache.cbm_len + 3) / 4; + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) + r->cache.arch_has_sparse_bitmasks = ecx.split.noncont; r->alloc_capable = true; }
@@ -872,7 +876,6 @@ static __init void rdt_init_res_defs_intel(void)
if (r->rid == RDT_RESOURCE_L3 || r->rid == RDT_RESOURCE_L2) { - r->cache.arch_has_sparse_bitmasks = false; r->cache.arch_has_per_cpu_cfg = false; r->cache.min_cbm_bits = 1; } else if (r->rid == RDT_RESOURCE_MBA) { diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index ab45012288bb..beccb0e87ba7 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -87,10 +87,12 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s,
/* * Check whether a cache bit mask is valid. - * For Intel the SDM says: - * Please note that all (and only) contiguous '1' combinations - * are allowed (e.g. FFFFH, 0FF0H, 003CH, etc.). - * Additionally Haswell requires at least two bits set. + * On Intel CPUs, non-contiguous 1s value support is indicated by CPUID: + * - CPUID.0x10.1:ECX[3]: L3 non-contiguous 1s value supported if 1 + * - CPUID.0x10.2:ECX[3]: L2 non-contiguous 1s value supported if 1 + * + * Haswell does not support a non-contiguous 1s value and additionally + * requires at least two bits set. * AMD allows non-contiguous bitmasks. */ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r) diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 85ceaf9a31ac..c47ef2f13e8e 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -492,6 +492,15 @@ union cpuid_0x10_3_eax { unsigned int full; };
+/* CPUID.(EAX=10H, ECX=ResID).ECX */ +union cpuid_0x10_x_ecx { + struct { + unsigned int reserved:3; + unsigned int noncont:1; + } split; + unsigned int full; +}; + /* CPUID.(EAX=10H, ECX=ResID).EDX */ union cpuid_0x10_x_edx { struct {
From: Fenghua Yu fenghua.yu@intel.com
Add the interface in resctrl FS to show if sparse cache allocation bit masks are supported on the platform. Reading the file returns either a "1" if non-contiguous 1s are supported and "0" otherwise. The file path is /sys/fs/resctrl/info/{resource}/sparse_masks, where {resource} can be either "L2" or "L3".
Signed-off-by: Fenghua Yu fenghua.yu@intel.com Signed-off-by: Maciej Wieczor-Retman maciej.wieczor-retman@intel.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Reviewed-by: Peter Newman peternewman@google.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com Tested-by: Peter Newman peternewman@google.com Link: https://lore.kernel.org/r/7300535160beba41fd8aa073749ec1ee29b4621f.169693409... --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index a52cdf97c21c..09848ff11f7b 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -1117,6 +1117,17 @@ static enum resctrl_conf_type resctrl_peer_type(enum resctrl_conf_type my_type) } }
+static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->cache.arch_has_sparse_bitmasks); + + return 0; +} + /** * __rdtgroup_cbm_overlaps - Does CBM for intended closid overlap with other * @r: Resource to which domain instance @d belongs. @@ -1839,6 +1850,13 @@ static struct rftype res_common_files[] = { .seq_show = rdtgroup_size_show, .fflags = RF_CTRL_BASE, }, + { + .name = "sparse_masks", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_has_sparse_bitmasks_show, + .fflags = RF_CTRL_INFO | RFTYPE_RES_CACHE, + },
};
From: Babu Moger babu.moger@amd.com
The resctrl task assignment for monitor or control group needs to be done one at a time. For example:
$mount -t resctrl resctrl /sys/fs/resctrl/ $mkdir /sys/fs/resctrl/ctrl_grp1 $echo 123 > /sys/fs/resctrl/ctrl_grp1/tasks $echo 456 > /sys/fs/resctrl/ctrl_grp1/tasks $echo 789 > /sys/fs/resctrl/ctrl_grp1/tasks
This is not user-friendly when dealing with hundreds of tasks.
Support multiple task assignment in one command with tasks ids separated by commas. For example:
$echo 123,456,789 > /sys/fs/resctrl/ctrl_grp1/tasks
Signed-off-by: Babu Moger babu.moger@amd.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Peter Newman peternewman@google.com Reviewed-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Reviewed-by: Fenghua Yu fenghua.yu@intel.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Tested-by: Peter Newman peternewman@google.com Tested-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Link: https://lore.kernel.org/r/20231017002308.134480-2-babu.moger@amd.com --- Documentation/arch/x86/resctrl.rst | 9 ++++++++- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 25 ++++++++++++++++++++++--- 2 files changed, 30 insertions(+), 4 deletions(-)
diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst index cb05d90111b4..8154e9975d1e 100644 --- a/Documentation/arch/x86/resctrl.rst +++ b/Documentation/arch/x86/resctrl.rst @@ -299,7 +299,14 @@ All groups contain the following files: "tasks": Reading this file shows the list of all tasks that belong to this group. Writing a task id to the file will add a task to the - group. If the group is a CTRL_MON group the task is removed from + group. Multiple tasks can be added by separating the task ids + with commas. Tasks will be assigned sequentially. Multiple + failures are not supported. A single failure encountered while + attempting to assign a task will cause the operation to abort and + already added tasks before the failure will remain in the group. + Failures will be logged to /sys/fs/resctrl/info/last_cmd_status. + + If the group is a CTRL_MON group the task is removed from whichever previous CTRL_MON group owned the task and also from any MON group that owned the task. If the group is a MON group, then the task must already belong to the CTRL_MON parent of this diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 09848ff11f7b..fe239691628a 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -696,11 +696,10 @@ static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct rdtgroup *rdtgrp; + char *pid_str; int ret = 0; pid_t pid;
- if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0) - return -EINVAL; rdtgrp = rdtgroup_kn_lock_live(of->kn); if (!rdtgrp) { rdtgroup_kn_unlock(of->kn); @@ -715,7 +714,27 @@ static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of, goto unlock; }
- ret = rdtgroup_move_task(pid, rdtgrp, of); + while (buf && buf[0] != '\0' && buf[0] != '\n') { + pid_str = strim(strsep(&buf, ",")); + + if (kstrtoint(pid_str, 0, &pid)) { + rdt_last_cmd_printf("Task list parsing error pid %s\n", pid_str); + ret = -EINVAL; + break; + } + + if (pid < 0) { + rdt_last_cmd_printf("Invalid pid %d\n", pid); + ret = -EINVAL; + break; + } + + ret = rdtgroup_move_task(pid, rdtgrp, of); + if (ret) { + rdt_last_cmd_printf("Error while processing task %d\n", pid); + break; + } + }
unlock: rdtgroup_kn_unlock(of->kn);
From: Babu Moger babu.moger@amd.com
The rftype flags are bitmaps used for adding files under the resctrl filesystem. Some of these bitmap defines have one extra level of indirection which is not necessary.
Drop the RF_* defines and simplify the macros.
[ bp: Massage commit message. ]
Signed-off-by: Babu Moger babu.moger@amd.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Peter Newman peternewman@google.com Reviewed-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Reviewed-by: Fenghua Yu fenghua.yu@intel.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Tested-by: Peter Newman peternewman@google.com Tested-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Link: https://lore.kernel.org/r/20231017002308.134480-3-babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/internal.h | 9 +++------ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 +++++- 2 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index c47ef2f13e8e..0ad970c5c867 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -243,12 +243,9 @@ struct rdtgroup { */ #define RFTYPE_INFO BIT(0) #define RFTYPE_BASE BIT(1) -#define RF_CTRLSHIFT 4 -#define RF_MONSHIFT 5 -#define RF_TOPSHIFT 6 -#define RFTYPE_CTRL BIT(RF_CTRLSHIFT) -#define RFTYPE_MON BIT(RF_MONSHIFT) -#define RFTYPE_TOP BIT(RF_TOPSHIFT) +#define RFTYPE_CTRL BIT(4) +#define RFTYPE_MON BIT(5) +#define RFTYPE_TOP BIT(6) #define RFTYPE_RES_CACHE BIT(8) #define RFTYPE_RES_MB BIT(9) #define RF_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index fe239691628a..09141f1f0b96 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -3260,7 +3260,11 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, goto out_destroy; }
- files = RFTYPE_BASE | BIT(RF_CTRLSHIFT + rtype); + if (rtype == RDTCTRL_GROUP) + files = RFTYPE_BASE | RFTYPE_CTRL; + else + files = RFTYPE_BASE | RFTYPE_MON; + ret = rdtgroup_add_files(kn, files); if (ret) { rdt_last_cmd_puts("kernfs fill error\n");
From: Babu Moger babu.moger@amd.com
resctrl associates rftype flags with its files so that files can be chosen based on the resource, whether it is info or base, and if it is control or monitor type file. These flags use the RF_ as well as RFTYPE_ prefixes.
Change the prefix to RFTYPE_ for all these flags to be consistent.
Signed-off-by: Babu Moger babu.moger@amd.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Peter Newman peternewman@google.com Reviewed-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Reviewed-by: Fenghua Yu fenghua.yu@intel.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Tested-by: Peter Newman peternewman@google.com Tested-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Link: https://lore.kernel.org/r/20231017002308.134480-4-babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/internal.h | 10 +++--- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 44 +++++++++++++------------- 2 files changed, 27 insertions(+), 27 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 0ad970c5c867..ba4611111212 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -248,10 +248,10 @@ struct rdtgroup { #define RFTYPE_TOP BIT(6) #define RFTYPE_RES_CACHE BIT(8) #define RFTYPE_RES_MB BIT(9) -#define RF_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL) -#define RF_MON_INFO (RFTYPE_INFO | RFTYPE_MON) -#define RF_TOP_INFO (RFTYPE_INFO | RFTYPE_TOP) -#define RF_CTRL_BASE (RFTYPE_BASE | RFTYPE_CTRL) +#define RFTYPE_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL) +#define RFTYPE_MON_INFO (RFTYPE_INFO | RFTYPE_MON) +#define RFTYPE_TOP_INFO (RFTYPE_INFO | RFTYPE_TOP) +#define RFTYPE_CTRL_BASE (RFTYPE_BASE | RFTYPE_CTRL)
/* List of all resource groups */ extern struct list_head rdt_all_groups; @@ -267,7 +267,7 @@ void __exit rdtgroup_exit(void); * @mode: Access mode * @kf_ops: File operations * @flags: File specific RFTYPE_FLAGS_* flags - * @fflags: File specific RF_* or RFTYPE_* flags + * @fflags: File specific RFTYPE_* flags * @seq_show: Show content of the file * @write: Write to the file */ diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 09141f1f0b96..125d12d8f4b9 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -1716,77 +1716,77 @@ static struct rftype res_common_files[] = { .mode = 0444, .kf_ops = &rdtgroup_kf_single_ops, .seq_show = rdt_last_cmd_status_show, - .fflags = RF_TOP_INFO, + .fflags = RFTYPE_TOP_INFO, }, { .name = "num_closids", .mode = 0444, .kf_ops = &rdtgroup_kf_single_ops, .seq_show = rdt_num_closids_show, - .fflags = RF_CTRL_INFO, + .fflags = RFTYPE_CTRL_INFO, }, { .name = "mon_features", .mode = 0444, .kf_ops = &rdtgroup_kf_single_ops, .seq_show = rdt_mon_features_show, - .fflags = RF_MON_INFO, + .fflags = RFTYPE_MON_INFO, }, { .name = "num_rmids", .mode = 0444, .kf_ops = &rdtgroup_kf_single_ops, .seq_show = rdt_num_rmids_show, - .fflags = RF_MON_INFO, + .fflags = RFTYPE_MON_INFO, }, { .name = "cbm_mask", .mode = 0444, .kf_ops = &rdtgroup_kf_single_ops, .seq_show = rdt_default_ctrl_show, - .fflags = RF_CTRL_INFO | RFTYPE_RES_CACHE, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, }, { .name = "min_cbm_bits", .mode = 0444, .kf_ops = &rdtgroup_kf_single_ops, .seq_show = rdt_min_cbm_bits_show, - .fflags = RF_CTRL_INFO | RFTYPE_RES_CACHE, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, }, { .name = "shareable_bits", .mode = 0444, .kf_ops = &rdtgroup_kf_single_ops, .seq_show = rdt_shareable_bits_show, - .fflags = RF_CTRL_INFO | RFTYPE_RES_CACHE, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, }, { .name = "bit_usage", .mode = 0444, .kf_ops = &rdtgroup_kf_single_ops, .seq_show = rdt_bit_usage_show, - .fflags = RF_CTRL_INFO | RFTYPE_RES_CACHE, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, }, { .name = "min_bandwidth", .mode = 0444, .kf_ops = &rdtgroup_kf_single_ops, .seq_show = rdt_min_bw_show, - .fflags = RF_CTRL_INFO | RFTYPE_RES_MB, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, }, { .name = "bandwidth_gran", .mode = 0444, .kf_ops = &rdtgroup_kf_single_ops, .seq_show = rdt_bw_gran_show, - .fflags = RF_CTRL_INFO | RFTYPE_RES_MB, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, }, { .name = "delay_linear", .mode = 0444, .kf_ops = &rdtgroup_kf_single_ops, .seq_show = rdt_delay_linear_show, - .fflags = RF_CTRL_INFO | RFTYPE_RES_MB, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, }, /* * Platform specific which (if any) capabilities are provided by @@ -1805,7 +1805,7 @@ static struct rftype res_common_files[] = { .kf_ops = &rdtgroup_kf_single_ops, .write = max_threshold_occ_write, .seq_show = max_threshold_occ_show, - .fflags = RF_MON_INFO | RFTYPE_RES_CACHE, + .fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE, }, { .name = "mbm_total_bytes_config", @@ -1852,7 +1852,7 @@ static struct rftype res_common_files[] = { .kf_ops = &rdtgroup_kf_single_ops, .write = rdtgroup_schemata_write, .seq_show = rdtgroup_schemata_show, - .fflags = RF_CTRL_BASE, + .fflags = RFTYPE_CTRL_BASE, }, { .name = "mode", @@ -1860,21 +1860,21 @@ static struct rftype res_common_files[] = { .kf_ops = &rdtgroup_kf_single_ops, .write = rdtgroup_mode_write, .seq_show = rdtgroup_mode_show, - .fflags = RF_CTRL_BASE, + .fflags = RFTYPE_CTRL_BASE, }, { .name = "size", .mode = 0444, .kf_ops = &rdtgroup_kf_single_ops, .seq_show = rdtgroup_size_show, - .fflags = RF_CTRL_BASE, + .fflags = RFTYPE_CTRL_BASE, }, { .name = "sparse_masks", .mode = 0444, .kf_ops = &rdtgroup_kf_single_ops, .seq_show = rdt_has_sparse_bitmasks_show, - .fflags = RF_CTRL_INFO | RFTYPE_RES_CACHE, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, },
}; @@ -1931,7 +1931,7 @@ void __init thread_throttle_mode_init(void) if (!rft) return;
- rft->fflags = RF_CTRL_INFO | RFTYPE_RES_MB; + rft->fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB; }
void __init mbm_config_rftype_init(const char *config) @@ -1940,7 +1940,7 @@ void __init mbm_config_rftype_init(const char *config)
rft = rdtgroup_get_rftype_by_name(config); if (rft) - rft->fflags = RF_MON_INFO | RFTYPE_RES_CACHE; + rft->fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE; }
/** @@ -2075,21 +2075,21 @@ static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn) if (IS_ERR(kn_info)) return PTR_ERR(kn_info);
- ret = rdtgroup_add_files(kn_info, RF_TOP_INFO); + ret = rdtgroup_add_files(kn_info, RFTYPE_TOP_INFO); if (ret) goto out_destroy;
/* loop over enabled controls, these are all alloc_capable */ list_for_each_entry(s, &resctrl_schema_all, list) { r = s->res; - fflags = r->fflags | RF_CTRL_INFO; + fflags = r->fflags | RFTYPE_CTRL_INFO; ret = rdtgroup_mkdir_info_resdir(s, s->name, fflags); if (ret) goto out_destroy; }
for_each_mon_capable_rdt_resource(r) { - fflags = r->fflags | RF_MON_INFO; + fflags = r->fflags | RFTYPE_MON_INFO; sprintf(name, "%s_MON", r->name); ret = rdtgroup_mkdir_info_resdir(r, name, fflags); if (ret) @@ -3727,7 +3727,7 @@ static int __init rdtgroup_setup_root(void)
list_add(&rdtgroup_default.rdtgroup_list, &rdt_all_groups);
- ret = rdtgroup_add_files(kernfs_root_to_node(rdt_root), RF_CTRL_BASE); + ret = rdtgroup_add_files(kernfs_root_to_node(rdt_root), RFTYPE_CTRL_BASE); if (ret) { kernfs_destroy_root(rdt_root); goto out;
From: Babu Moger babu.moger@amd.com
rdt_enable_ctx() enables the features provided during resctrl mount.
Additions to rdt_enable_ctx() are required to also modify error paths of rdt_enable_ctx() callers to ensure correct unwinding if errors are encountered after calling rdt_enable_ctx(). This is error prone.
Introduce rdt_disable_ctx() to refactor the error unwinding of rdt_enable_ctx() to simplify future additions. This also simplifies cleanup in rdt_kill_sb().
Suggested-by: Reinette Chatre reinette.chatre@intel.com Signed-off-by: Babu Moger babu.moger@amd.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Peter Newman peternewman@google.com Reviewed-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Reviewed-by: Fenghua Yu fenghua.yu@intel.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Tested-by: Peter Newman peternewman@google.com Tested-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Link: https://lore.kernel.org/r/20231017002308.134480-5-babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 53 ++++++++++++++++---------- 1 file changed, 32 insertions(+), 21 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 125d12d8f4b9..d1b5fd70ccbb 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2308,14 +2308,6 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable) return 0; }
-static void cdp_disable_all(void) -{ - if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3)) - resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false); - if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2)) - resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false); -} - /* * We don't allow rdtgroup directories to be created anywhere * except the root directory. Thus when looking for the rdtgroup @@ -2395,19 +2387,42 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn, struct rdtgroup *prgrp, struct kernfs_node **mon_data_kn);
+static void rdt_disable_ctx(void) +{ + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false); + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false); + set_mba_sc(false); +} + static int rdt_enable_ctx(struct rdt_fs_context *ctx) { int ret = 0;
- if (ctx->enable_cdpl2) + if (ctx->enable_cdpl2) { ret = resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, true); + if (ret) + goto out_done; + }
- if (!ret && ctx->enable_cdpl3) + if (ctx->enable_cdpl3) { ret = resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, true); + if (ret) + goto out_cdpl2; + }
- if (!ret && ctx->enable_mba_mbps) + if (ctx->enable_mba_mbps) { ret = set_mba_sc(true); + if (ret) + goto out_cdpl3; + } + + return 0;
+out_cdpl3: + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false); +out_cdpl2: + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false); +out_done: return ret; }
@@ -2515,13 +2530,13 @@ static int rdt_get_tree(struct fs_context *fc) }
ret = rdt_enable_ctx(ctx); - if (ret < 0) - goto out_cdp; + if (ret) + goto out;
ret = schemata_list_create(); if (ret) { schemata_list_destroy(); - goto out_mba; + goto out_ctx; }
closid_init(); @@ -2580,11 +2595,8 @@ static int rdt_get_tree(struct fs_context *fc) kernfs_remove(kn_info); out_schemata_free: schemata_list_destroy(); -out_mba: - if (ctx->enable_mba_mbps) - set_mba_sc(false); -out_cdp: - cdp_disable_all(); +out_ctx: + rdt_disable_ctx(); out: rdt_last_cmd_clear(); mutex_unlock(&rdtgroup_mutex); @@ -2816,12 +2828,11 @@ static void rdt_kill_sb(struct super_block *sb) cpus_read_lock(); mutex_lock(&rdtgroup_mutex);
- set_mba_sc(false); + rdt_disable_ctx();
/*Put everything back to default values. */ for_each_alloc_capable_rdt_resource(r) reset_all_ctrls(r); - cdp_disable_all(); rmdir_all_sub(); rdt_pseudo_lock_release(); rdtgroup_default.mode = RDT_MODE_SHAREABLE;
From: Babu Moger babu.moger@amd.com
The default resource group and its files are created during kernel init time. Upcoming changes will make some resctrl files optional based on a mount parameter. If optional files are to be added to the default group based on the mount option, then each new file needs to be created separately and call kernfs_activate() again.
Create all files of the default resource group during resctrl mount, destroyed during unmount, to avoid scattering resctrl file addition across two separate code flows.
Signed-off-by: Babu Moger babu.moger@amd.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Peter Newman peternewman@google.com Reviewed-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Reviewed-by: Fenghua Yu fenghua.yu@intel.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Tested-by: Peter Newman peternewman@google.com Tested-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Link: https://lore.kernel.org/r/20231017002308.134480-6-babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 58 +++++++++++++++----------- 1 file changed, 34 insertions(+), 24 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index d1b5fd70ccbb..55da93b5ca53 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -54,6 +54,9 @@ static struct kernfs_node *kn_mondata; static struct seq_buf last_cmd_status; static char last_cmd_status_buf[512];
+static int rdtgroup_setup_root(struct rdt_fs_context *ctx); +static void rdtgroup_destroy_root(void); + struct dentry *debugfs_resctrl;
void rdt_last_cmd_clear(void) @@ -2529,10 +2532,14 @@ static int rdt_get_tree(struct fs_context *fc) goto out; }
- ret = rdt_enable_ctx(ctx); + ret = rdtgroup_setup_root(ctx); if (ret) goto out;
+ ret = rdt_enable_ctx(ctx); + if (ret) + goto out_root; + ret = schemata_list_create(); if (ret) { schemata_list_destroy(); @@ -2541,6 +2548,12 @@ static int rdt_get_tree(struct fs_context *fc)
closid_init();
+ ret = rdtgroup_add_files(rdtgroup_default.kn, RFTYPE_CTRL_BASE); + if (ret) + goto out_schemata_free; + + kernfs_activate(rdtgroup_default.kn); + ret = rdtgroup_create_info_dir(rdtgroup_default.kn); if (ret < 0) goto out_schemata_free; @@ -2597,6 +2610,8 @@ static int rdt_get_tree(struct fs_context *fc) schemata_list_destroy(); out_ctx: rdt_disable_ctx(); +out_root: + rdtgroup_destroy_root(); out: rdt_last_cmd_clear(); mutex_unlock(&rdtgroup_mutex); @@ -2667,7 +2682,6 @@ static int rdt_init_fs_context(struct fs_context *fc) if (!ctx) return -ENOMEM;
- ctx->kfc.root = rdt_root; ctx->kfc.magic = RDTGROUP_SUPER_MAGIC; fc->fs_private = &ctx->kfc; fc->ops = &rdt_fs_context_ops; @@ -2837,6 +2851,7 @@ static void rdt_kill_sb(struct super_block *sb) rdt_pseudo_lock_release(); rdtgroup_default.mode = RDT_MODE_SHAREABLE; schemata_list_destroy(); + rdtgroup_destroy_root(); static_branch_disable_cpuslocked(&rdt_alloc_enable_key); static_branch_disable_cpuslocked(&rdt_mon_enable_key); static_branch_disable_cpuslocked(&rdt_enable_key); @@ -3718,10 +3733,8 @@ static struct kernfs_syscall_ops rdtgroup_kf_syscall_ops = { .show_options = rdtgroup_show_options, };
-static int __init rdtgroup_setup_root(void) +static int rdtgroup_setup_root(struct rdt_fs_context *ctx) { - int ret; - rdt_root = kernfs_create_root(&rdtgroup_kf_syscall_ops, KERNFS_ROOT_CREATE_DEACTIVATED | KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK, @@ -3729,6 +3742,20 @@ static int __init rdtgroup_setup_root(void) if (IS_ERR(rdt_root)) return PTR_ERR(rdt_root);
+ ctx->kfc.root = rdt_root; + rdtgroup_default.kn = kernfs_root_to_node(rdt_root); + + return 0; +} + +static void rdtgroup_destroy_root(void) +{ + kernfs_destroy_root(rdt_root); + rdtgroup_default.kn = NULL; +} + +static void __init rdtgroup_setup_default(void) +{ mutex_lock(&rdtgroup_mutex);
rdtgroup_default.closid = 0; @@ -3738,19 +3765,7 @@ static int __init rdtgroup_setup_root(void)
list_add(&rdtgroup_default.rdtgroup_list, &rdt_all_groups);
- ret = rdtgroup_add_files(kernfs_root_to_node(rdt_root), RFTYPE_CTRL_BASE); - if (ret) { - kernfs_destroy_root(rdt_root); - goto out; - } - - rdtgroup_default.kn = kernfs_root_to_node(rdt_root); - kernfs_activate(rdtgroup_default.kn); - -out: mutex_unlock(&rdtgroup_mutex); - - return ret; }
static void domain_destroy_mon_state(struct rdt_domain *d) @@ -3872,13 +3887,11 @@ int __init rdtgroup_init(void) seq_buf_init(&last_cmd_status, last_cmd_status_buf, sizeof(last_cmd_status_buf));
- ret = rdtgroup_setup_root(); - if (ret) - return ret; + rdtgroup_setup_default();
ret = sysfs_create_mount_point(fs_kobj, "resctrl"); if (ret) - goto cleanup_root; + return ret;
ret = register_filesystem(&rdt_fs_type); if (ret) @@ -3911,8 +3924,6 @@ int __init rdtgroup_init(void)
cleanup_mountpoint: sysfs_remove_mount_point(fs_kobj, "resctrl"); -cleanup_root: - kernfs_destroy_root(rdt_root);
return ret; } @@ -3922,5 +3933,4 @@ void __exit rdtgroup_exit(void) debugfs_remove_recursive(debugfs_resctrl); unregister_filesystem(&rdt_fs_type); sysfs_remove_mount_point(fs_kobj, "resctrl"); - kernfs_destroy_root(rdt_root); }
From: Babu Moger babu.moger@amd.com
Add "-o debug" option to mount resctrl filesystem in debug mode. When in debug mode resctrl displays files that have the new RFTYPE_DEBUG flag to help resctrl debugging.
Signed-off-by: Babu Moger babu.moger@amd.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Peter Newman peternewman@google.com Reviewed-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Reviewed-by: Fenghua Yu fenghua.yu@intel.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Tested-by: Peter Newman peternewman@google.com Tested-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Link: https://lore.kernel.org/r/20231017002308.134480-7-babu.moger@amd.com --- Documentation/arch/x86/resctrl.rst | 5 ++++- arch/x86/kernel/cpu/resctrl/internal.h | 2 ++ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 18 ++++++++++++++++++ 3 files changed, 24 insertions(+), 1 deletion(-)
diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst index 8154e9975d1e..28d35aaa74b4 100644 --- a/Documentation/arch/x86/resctrl.rst +++ b/Documentation/arch/x86/resctrl.rst @@ -35,7 +35,7 @@ about the feature from resctrl's info directory.
To use the feature mount the file system::
- # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl + # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps][,debug]] /sys/fs/resctrl
mount options are:
@@ -46,6 +46,9 @@ mount options are: "mba_MBps": Enable the MBA Software Controller(mba_sc) to specify MBA bandwidth in MBps +"debug": + Make debug files accessible. Available debug files are annotated with + "Available only with debug option".
L2 and L3 CDP are controlled separately.
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index ba4611111212..b816b902b5c0 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -59,6 +59,7 @@ struct rdt_fs_context { bool enable_cdpl2; bool enable_cdpl3; bool enable_mba_mbps; + bool enable_debug; };
static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc) @@ -248,6 +249,7 @@ struct rdtgroup { #define RFTYPE_TOP BIT(6) #define RFTYPE_RES_CACHE BIT(8) #define RFTYPE_RES_MB BIT(9) +#define RFTYPE_DEBUG BIT(10) #define RFTYPE_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL) #define RFTYPE_MON_INFO (RFTYPE_INFO | RFTYPE_MON) #define RFTYPE_TOP_INFO (RFTYPE_INFO | RFTYPE_TOP) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 55da93b5ca53..84e0f45578dd 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -59,6 +59,8 @@ static void rdtgroup_destroy_root(void);
struct dentry *debugfs_resctrl;
+static bool resctrl_debug; + void rdt_last_cmd_clear(void) { lockdep_assert_held(&rdtgroup_mutex); @@ -1892,6 +1894,9 @@ static int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags)
lockdep_assert_held(&rdtgroup_mutex);
+ if (resctrl_debug) + fflags |= RFTYPE_DEBUG; + for (rft = rfts; rft < rfts + len; rft++) { if (rft->fflags && ((fflags & rft->fflags) == rft->fflags)) { ret = rdtgroup_add_file(kn, rft); @@ -2395,6 +2400,8 @@ static void rdt_disable_ctx(void) resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false); resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false); set_mba_sc(false); + + resctrl_debug = false; }
static int rdt_enable_ctx(struct rdt_fs_context *ctx) @@ -2419,6 +2426,9 @@ static int rdt_enable_ctx(struct rdt_fs_context *ctx) goto out_cdpl3; }
+ if (ctx->enable_debug) + resctrl_debug = true; + return 0;
out_cdpl3: @@ -2623,6 +2633,7 @@ enum rdt_param { Opt_cdp, Opt_cdpl2, Opt_mba_mbps, + Opt_debug, nr__rdt_params };
@@ -2630,6 +2641,7 @@ static const struct fs_parameter_spec rdt_fs_parameters[] = { fsparam_flag("cdp", Opt_cdp), fsparam_flag("cdpl2", Opt_cdpl2), fsparam_flag("mba_MBps", Opt_mba_mbps), + fsparam_flag("debug", Opt_debug), {} };
@@ -2655,6 +2667,9 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param) return -EINVAL; ctx->enable_mba_mbps = true; return 0; + case Opt_debug: + ctx->enable_debug = true; + return 0; }
return -EINVAL; @@ -3723,6 +3738,9 @@ static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf) if (is_mba_sc(&rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl)) seq_puts(seq, ",mba_MBps");
+ if (resctrl_debug) + seq_puts(seq, ",debug"); + return 0; }
From: Babu Moger babu.moger@amd.com
In x86, hardware uses CLOSID to identify a control group. When a user creates a control group this information is not visible to the user. It can help resctrl debugging.
Add CLOSID(ctrl_hw_id) to the control groups display in the resctrl interface. Users can see this detail when resctrl is mounted with the "-o debug" option.
Other architectures do not use "CLOSID". Use the names ctrl_hw_id to refer to "CLOSID" in an effort to keep the naming generic.
For example: $cat /sys/fs/resctrl/ctrl_grp1/ctrl_hw_id 1
Signed-off-by: Babu Moger babu.moger@amd.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Peter Newman peternewman@google.com Reviewed-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Reviewed-by: Fenghua Yu fenghua.yu@intel.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Tested-by: Peter Newman peternewman@google.com Tested-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Link: https://lore.kernel.org/r/20231017002308.134480-8-babu.moger@amd.com --- Documentation/arch/x86/resctrl.rst | 4 ++++ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 23 +++++++++++++++++++++++ 2 files changed, 27 insertions(+)
diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst index 28d35aaa74b4..54691c8b832d 100644 --- a/Documentation/arch/x86/resctrl.rst +++ b/Documentation/arch/x86/resctrl.rst @@ -352,6 +352,10 @@ When control is enabled all CTRL_MON groups will also contain: file. On successful pseudo-locked region creation the mode will automatically change to "pseudo-locked".
+"ctrl_hw_id": + Available only with debug option. The identifier used by hardware + for the control group. On x86 this is the CLOSID. + When monitoring is enabled all MON groups will also contain:
"mon_data": diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 84e0f45578dd..5814a0bf3cea 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -779,6 +779,22 @@ static int rdtgroup_tasks_show(struct kernfs_open_file *of, return ret; }
+static int rdtgroup_closid_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) + seq_printf(s, "%u\n", rdtgrp->closid); + else + ret = -ENOENT; + rdtgroup_kn_unlock(of->kn); + + return ret; +} + #ifdef CONFIG_PROC_CPU_RESCTRL
/* @@ -1881,6 +1897,13 @@ static struct rftype res_common_files[] = { .seq_show = rdt_has_sparse_bitmasks_show, .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, }, + { + .name = "ctrl_hw_id", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdtgroup_closid_show, + .fflags = RFTYPE_CTRL_BASE | RFTYPE_DEBUG, + },
};
From: Babu Moger babu.moger@amd.com
Files unique to monitoring groups have the RFTYPE_MON flag. When a new monitoring group is created the resctrl files with flags RFTYPE_BASE (files common to all resource groups) and RFTYPE_MON (files unique to monitoring groups) are created to support interacting with the new monitoring group.
A resource group can support both monitoring and control, also termed a CTRL_MON resource group. CTRL_MON groups should get both monitoring and control resctrl files but that is not the case. Only the RFTYPE_BASE and RFTYPE_CTRL files are created for CTRL_MON groups.
Ensure that files with the RFTYPE_MON flag are created for CTRL_MON groups.
Signed-off-by: Babu Moger babu.moger@amd.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Peter Newman peternewman@google.com Reviewed-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Reviewed-by: Fenghua Yu fenghua.yu@intel.com Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Tested-by: Peter Newman peternewman@google.com Tested-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Link: https://lore.kernel.org/r/20231017002308.134480-9-babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 5814a0bf3cea..5f6d6ba63a2e 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2551,6 +2551,7 @@ static void schemata_list_destroy(void) static int rdt_get_tree(struct fs_context *fc) { struct rdt_fs_context *ctx = rdt_fc2context(fc); + unsigned long flags = RFTYPE_CTRL_BASE; struct rdt_domain *dom; struct rdt_resource *r; int ret; @@ -2581,7 +2582,10 @@ static int rdt_get_tree(struct fs_context *fc)
closid_init();
- ret = rdtgroup_add_files(rdtgroup_default.kn, RFTYPE_CTRL_BASE); + if (rdt_mon_capable) + flags |= RFTYPE_MON; + + ret = rdtgroup_add_files(rdtgroup_default.kn, flags); if (ret) goto out_schemata_free;
@@ -3271,8 +3275,8 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, enum rdt_group_type rtype, struct rdtgroup **r) { struct rdtgroup *prdtgrp, *rdtgrp; + unsigned long files = 0; struct kernfs_node *kn; - uint files = 0; int ret;
prdtgrp = rdtgroup_kn_lock_live(parent_kn); @@ -3324,10 +3328,13 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, goto out_destroy; }
- if (rtype == RDTCTRL_GROUP) + if (rtype == RDTCTRL_GROUP) { files = RFTYPE_BASE | RFTYPE_CTRL; - else + if (rdt_mon_capable) + files |= RFTYPE_MON; + } else { files = RFTYPE_BASE | RFTYPE_MON; + }
ret = rdtgroup_add_files(kn, files); if (ret) {
From: Babu Moger babu.moger@amd.com
In x86, hardware uses RMID to identify a monitoring group. When a user creates a monitor group these details are not visible. These details can help resctrl debugging.
Add RMID(mon_hw_id) to the monitor groups display in the resctrl interface. Users can see these details when resctrl is mounted with "-o debug" option.
Add RFTYPE_MON_BASE that complements existing RFTYPE_CTRL_BASE and represents files belonging to monitoring groups.
Other architectures do not use "RMID". Use the name mon_hw_id to refer to "RMID" in an effort to keep the naming generic.
For example: $cat /sys/fs/resctrl/mon_groups/mon_grp1/mon_hw_id 3
Signed-off-by: Babu Moger babu.moger@amd.com Signed-off-by: Borislav Petkov (AMD) bp@alien8.de Reviewed-by: Peter Newman peternewman@google.com Reviewed-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Reviewed-by: Fenghua Yu fenghua.yu@intel.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Tested-by: Peter Newman peternewman@google.com Tested-by: Tan Shaopeng tan.shaopeng@jp.fujitsu.com Link: https://lore.kernel.org/r/20231017002308.134480-10-babu.moger@amd.com --- Documentation/arch/x86/resctrl.rst | 4 ++++ arch/x86/kernel/cpu/resctrl/internal.h | 1 + arch/x86/kernel/cpu/resctrl/rdtgroup.c | 23 +++++++++++++++++++++++ 3 files changed, 28 insertions(+)
diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst index 54691c8b832d..98b0eb509ed4 100644 --- a/Documentation/arch/x86/resctrl.rst +++ b/Documentation/arch/x86/resctrl.rst @@ -369,6 +369,10 @@ When monitoring is enabled all MON groups will also contain: the sum for all tasks in the CTRL_MON group and all tasks in MON groups. Please see example section for more details on usage.
+"mon_hw_id": + Available only with debug option. The identifier used by hardware + for the monitor group. On x86 this is the RMID. + Resource allocation rules -------------------------
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index b816b902b5c0..a4f1aa15f0a2 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -254,6 +254,7 @@ struct rdtgroup { #define RFTYPE_MON_INFO (RFTYPE_INFO | RFTYPE_MON) #define RFTYPE_TOP_INFO (RFTYPE_INFO | RFTYPE_TOP) #define RFTYPE_CTRL_BASE (RFTYPE_BASE | RFTYPE_CTRL) +#define RFTYPE_MON_BASE (RFTYPE_BASE | RFTYPE_MON)
/* List of all resource groups */ extern struct list_head rdt_all_groups; diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 5f6d6ba63a2e..69a1de92384a 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -795,6 +795,22 @@ static int rdtgroup_closid_show(struct kernfs_open_file *of, return ret; }
+static int rdtgroup_rmid_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) + seq_printf(s, "%u\n", rdtgrp->mon.rmid); + else + ret = -ENOENT; + rdtgroup_kn_unlock(of->kn); + + return ret; +} + #ifdef CONFIG_PROC_CPU_RESCTRL
/* @@ -1867,6 +1883,13 @@ static struct rftype res_common_files[] = { .seq_show = rdtgroup_tasks_show, .fflags = RFTYPE_BASE, }, + { + .name = "mon_hw_id", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdtgroup_rmid_show, + .fflags = RFTYPE_MON_BASE | RFTYPE_DEBUG, + }, { .name = "schemata", .mode = 0644,
From: James Morse james.morse@arm.com
tick_nohz_full_mask lists the CPUs that are nohz_full. This is only needed when CONFIG_NO_HZ_FULL is defined. tick_nohz_full_cpu() allows a specific CPU to be tested against the mask, and evaluates to false when CONFIG_NO_HZ_FULL is not defined.
The resctrl code needs to pick a CPU to run some work on, a new helper prefers housekeeping CPUs by examining the tick_nohz_full_mask. Hiding the declaration behind #ifdef CONFIG_NO_HZ_FULL forces all the users to be behind an ifdef too.
Move the tick_nohz_full_mask declaration, this lets callers drop the ifdef, and guard access to tick_nohz_full_mask with IS_ENABLED() or something like tick_nohz_full_cpu().
The definition does not need to be moved as any callers should be removed at compile time unless CONFIG_NO_HZ_FULL is defined.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Acked-by: Reinette Chatre reinette.chatre@intel.com # for resctrl dependency CC: Frederic Weisbecker frederic@kernel.org --- include/linux/tick.h | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/linux/tick.h b/include/linux/tick.h index 9459fef5b857..65af90ca409a 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -174,9 +174,16 @@ static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; } static inline void tick_nohz_idle_stop_tick_protected(void) { } #endif /* !CONFIG_NO_HZ_COMMON */
+/* + * Mask of CPUs that are nohz_full. + * + * Users should be guarded by CONFIG_NO_HZ_FULL or a tick_nohz_full_cpu() + * check. + */ +extern cpumask_var_t tick_nohz_full_mask; + #ifdef CONFIG_NO_HZ_FULL extern bool tick_nohz_full_running; -extern cpumask_var_t tick_nohz_full_mask;
static inline bool tick_nohz_full_enabled(void) {
From: James Morse james.morse@arm.com
rmid_ptrs[] is allocated from dom_data_init() but never free()d.
While the exit text ends up in the linker script's DISCARD section, the direction of travel is for resctrl to be/have loadable modules.
Add resctrl_put_mon_l3_config() to cleanup any memory allocated by rdt_get_mon_l3_config().
There is no reason to backport this to a stable kernel.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++ arch/x86/kernel/cpu/resctrl/internal.h | 1 + arch/x86/kernel/cpu/resctrl/monitor.c | 15 +++++++++++++++ 3 files changed, 22 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 19e0681f0435..b5d4b76f97c5 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -992,8 +992,14 @@ late_initcall(resctrl_late_init);
static void __exit resctrl_exit(void) { + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + cpuhp_remove_state(rdt_online); + rdtgroup_exit(); + + if (r->mon_capable) + rdt_put_mon_l3_config(r); }
__exitcall(resctrl_exit); diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index a4f1aa15f0a2..f68c6aecfa66 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -546,6 +546,7 @@ void closid_free(int closid); int alloc_rmid(void); void free_rmid(u32 rmid); int rdt_get_mon_l3_config(struct rdt_resource *r); +void __exit rdt_put_mon_l3_config(struct rdt_resource *r); bool __init rdt_cpu_has(int flag); void mon_event_count(void *info); int rdtgroup_mondata_show(struct seq_file *m, void *arg); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index f136ac046851..5d9864919f1c 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -741,6 +741,16 @@ static int dom_data_init(struct rdt_resource *r) return 0; }
+static void __exit dom_data_exit(struct rdt_resource *r) +{ + mutex_lock(&rdtgroup_mutex); + + kfree(rmid_ptrs); + rmid_ptrs = NULL; + + mutex_unlock(&rdtgroup_mutex); +} + static struct mon_evt llc_occupancy_event = { .name = "llc_occupancy", .evtid = QOS_L3_OCCUP_EVENT_ID, @@ -830,6 +840,11 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) return 0; }
+void __exit rdt_put_mon_l3_config(struct rdt_resource *r) +{ + dom_data_exit(r); +} + void __init intel_rdt_mbm_apply_quirk(void) { int cf_index;
From: James Morse james.morse@arm.com
When monitoring is supported, each monitor and control group is allocated an RMID. For control groups, rdtgroup_mkdir_ctrl_mon() later goes on to allocate the CLOSID.
MPAM's equivalent of RMID are not an independent number, so can't be allocated until the CLOSID is known. An RMID allocation for one CLOSID may fail, whereas another may succeed depending on how many monitor groups a control group has.
The RMID allocation needs to move to be after the CLOSID has been allocated.
Move the RMID allocation and mondata dir creation to a helper.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Peter Newman peternewman@google.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 42 +++++++++++++++++--------- 1 file changed, 27 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 69a1de92384a..1eb3a3075093 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -3293,6 +3293,30 @@ static int rdtgroup_init_alloc(struct rdtgroup *rdtgrp) return ret; }
+static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp) +{ + int ret; + + if (!rdt_mon_capable) + return 0; + + ret = alloc_rmid(); + if (ret < 0) { + rdt_last_cmd_puts("Out of RMIDs\n"); + return ret; + } + rdtgrp->mon.rmid = ret; + + ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn); + if (ret) { + rdt_last_cmd_puts("kernfs subdir error\n"); + free_rmid(rdtgrp->mon.rmid); + return ret; + } + + return 0; +} + static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, const char *name, umode_t mode, enum rdt_group_type rtype, struct rdtgroup **r) @@ -3365,20 +3389,10 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, goto out_destroy; }
- if (rdt_mon_capable) { - ret = alloc_rmid(); - if (ret < 0) { - rdt_last_cmd_puts("Out of RMIDs\n"); - goto out_destroy; - } - rdtgrp->mon.rmid = ret; + ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp); + if (ret) + goto out_destroy;
- ret = mkdir_mondata_all(kn, rdtgrp, &rdtgrp->mon.mon_data_kn); - if (ret) { - rdt_last_cmd_puts("kernfs subdir error\n"); - goto out_idfree; - } - } kernfs_activate(kn);
/* @@ -3386,8 +3400,6 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, */ return 0;
-out_idfree: - free_rmid(rdtgrp->mon.rmid); out_destroy: kernfs_put(rdtgrp->kn); kernfs_remove(rdtgrp->kn);
From: James Morse james.morse@arm.com
RMID are allocated for each monitor or control group directory, because each of these needs its own RMID. For control groups, rdtgroup_mkdir_ctrl_mon() later goes on to allocate the CLOSID.
MPAM's equivalent of RMID is not an independent number, so can't be allocated until the CLOSID is known. An RMID allocation for one CLOSID may fail, whereas another may succeed depending on how many monitor groups a control group has.
The RMID allocation needs to move to be after the CLOSID has been allocated.
Move the RMID allocation out of mkdir_rdt_prepare() to occur in its caller, after the mkdir_rdt_prepare() call. This allows the RMID allocator to know the CLOSID.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 35 +++++++++++++++++++------- 1 file changed, 26 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 1eb3a3075093..8fb0f56f64be 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -3317,6 +3317,12 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp) return 0; }
+static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp) +{ + if (rdt_mon_capable) + free_rmid(rgrp->mon.rmid); +} + static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, const char *name, umode_t mode, enum rdt_group_type rtype, struct rdtgroup **r) @@ -3389,12 +3395,6 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, goto out_destroy; }
- ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp); - if (ret) - goto out_destroy; - - kernfs_activate(kn); - /* * The caller unlocks the parent_kn upon success. */ @@ -3413,7 +3413,6 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, static void mkdir_rdt_prepare_clean(struct rdtgroup *rgrp) { kernfs_remove(rgrp->kn); - free_rmid(rgrp->mon.rmid); rdtgroup_remove(rgrp); }
@@ -3435,12 +3434,21 @@ static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn, prgrp = rdtgrp->mon.parent; rdtgrp->closid = prgrp->closid;
+ ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp); + if (ret) { + mkdir_rdt_prepare_clean(rdtgrp); + goto out_unlock; + } + + kernfs_activate(rdtgrp->kn); + /* * Add the rdtgrp to the list of rdtgrps the parent * ctrl_mon group has to track. */ list_add_tail(&rdtgrp->mon.crdtgrp_list, &prgrp->mon.crdtgrp_list);
+out_unlock: rdtgroup_kn_unlock(parent_kn); return ret; } @@ -3471,9 +3479,16 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn, ret = 0;
rdtgrp->closid = closid; + + ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp); + if (ret) + goto out_closid_free; + + kernfs_activate(rdtgrp->kn); + ret = rdtgroup_init_alloc(rdtgrp); if (ret < 0) - goto out_id_free; + goto out_rmid_free;
list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);
@@ -3493,7 +3508,9 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
out_del_list: list_del(&rdtgrp->rdtgroup_list); -out_id_free: +out_rmid_free: + mkdir_rdt_prepare_rmid_free(rdtgrp); +out_closid_free: closid_free(closid); out_common_fail: mkdir_rdt_prepare_clean(rdtgrp);
From: James Morse james.morse@arm.com
x86's RMID are independent of the CLOSID. An RMID can be allocated, used and freed without considering the CLOSID.
MPAM's equivalent feature is PMG, which is not an independent number, it extends the CLOSID/PARTID space. For MPAM, only PMG-bits worth of 'RMID' can be allocated for a single CLOSID. i.e. if there is 1 bit of PMG space, then each CLOSID can have two monitor groups.
To allow resctrl to disambiguate RMID values for different CLOSID, everything in resctrl that keeps an RMID value needs to know the CLOSID too. This will always be ignored on x86.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Xin Hao xhao@linux.alibaba.com --- arch/x86/include/asm/resctrl.h | 7 +++ arch/x86/kernel/cpu/resctrl/internal.h | 2 +- arch/x86/kernel/cpu/resctrl/monitor.c | 73 +++++++++++++++-------- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 ++-- include/linux/resctrl.h | 16 ++++- 6 files changed, 77 insertions(+), 37 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 255a78d9d906..cc6e1bce7b1a 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -7,6 +7,13 @@ #include <linux/sched.h> #include <linux/jump_label.h>
+/* + * This value can never be a valid CLOSID, and is used when mapping a + * (closid, rmid) pair to an index and back. On x86 only the RMID is + * needed. The index is a software defined value. + */ +#define X86_RESCTRL_EMPTY_CLOSID ((u32)~0) + /** * struct resctrl_pqr_state - State cache for the PQR MSR * @cur_rmid: The cached Resource Monitoring ID diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index f68c6aecfa66..c836e3294e12 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -544,7 +544,7 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r); int closids_supported(void); void closid_free(int closid); int alloc_rmid(void); -void free_rmid(u32 rmid); +void free_rmid(u32 closid, u32 rmid); int rdt_get_mon_l3_config(struct rdt_resource *r); void __exit rdt_put_mon_l3_config(struct rdt_resource *r); bool __init rdt_cpu_has(int flag); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 5d9864919f1c..f5d5570a5b5a 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -24,7 +24,20 @@
#include "internal.h"
+/** + * struct rmid_entry - dirty tracking for all RMID. + * @closid: The CLOSID for this entry. + * @rmid: The RMID for this entry. + * @busy: The number of domains with cached data using this RMID. + * @list: Member of the rmid_free_lru list when busy == 0. + * + * Depending on the architecture the correct monitor is accessed using + * both @closid and @rmid, or @rmid only. + * + * Take the rdtgroup_mutex when accessing. + */ struct rmid_entry { + u32 closid; u32 rmid; int busy; struct list_head list; @@ -136,7 +149,7 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val) return val; }
-static inline struct rmid_entry *__rmid_entry(u32 rmid) +static inline struct rmid_entry *__rmid_entry(u32 closid, u32 rmid) { struct rmid_entry *entry;
@@ -190,7 +203,8 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom, }
void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, - u32 rmid, enum resctrl_event_id eventid) + u32 unused, u32 rmid, + enum resctrl_event_id eventid) { struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); struct arch_mbm_state *am; @@ -230,7 +244,8 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width) }
int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, - u32 rmid, enum resctrl_event_id eventid, u64 *val) + u32 unused, u32 rmid, enum resctrl_event_id eventid, + u64 *val) { struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); @@ -285,9 +300,9 @@ void __check_limbo(struct rdt_domain *d, bool force_free) if (nrmid >= r->num_rmid) break;
- entry = __rmid_entry(nrmid); + entry = __rmid_entry(X86_RESCTRL_EMPTY_CLOSID, nrmid);// temporary
- if (resctrl_arch_rmid_read(r, d, entry->rmid, + if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid, QOS_L3_OCCUP_EVENT_ID, &val)) { rmid_dirty = true; } else { @@ -342,7 +357,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry) cpu = get_cpu(); list_for_each_entry(d, &r->domains, list) { if (cpumask_test_cpu(cpu, &d->cpu_mask)) { - err = resctrl_arch_rmid_read(r, d, entry->rmid, + err = resctrl_arch_rmid_read(r, d, entry->closid, + entry->rmid, QOS_L3_OCCUP_EVENT_ID, &val); if (err || val <= resctrl_rmid_realloc_threshold) @@ -366,7 +382,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry) list_add_tail(&entry->list, &rmid_free_lru); }
-void free_rmid(u32 rmid) +void free_rmid(u32 closid, u32 rmid) { struct rmid_entry *entry;
@@ -375,7 +391,7 @@ void free_rmid(u32 rmid)
lockdep_assert_held(&rdtgroup_mutex);
- entry = __rmid_entry(rmid); + entry = __rmid_entry(closid, rmid);
if (is_llc_occupancy_enabled()) add_rmid_to_limbo(entry); @@ -383,8 +399,8 @@ void free_rmid(u32 rmid) list_add_tail(&entry->list, &rmid_free_lru); }
-static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid, - enum resctrl_event_id evtid) +static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 closid, + u32 rmid, enum resctrl_event_id evtid) { switch (evtid) { case QOS_L3_MBM_TOTAL_EVENT_ID: @@ -396,20 +412,21 @@ static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid, } }
-static int __mon_event_count(u32 rmid, struct rmid_read *rr) +static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr) { struct mbm_state *m; u64 tval = 0;
if (rr->first) { - resctrl_arch_reset_rmid(rr->r, rr->d, rmid, rr->evtid); - m = get_mbm_state(rr->d, rmid, rr->evtid); + resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid); + m = get_mbm_state(rr->d, closid, rmid, rr->evtid); if (m) memset(m, 0, sizeof(struct mbm_state)); return 0; }
- rr->err = resctrl_arch_rmid_read(rr->r, rr->d, rmid, rr->evtid, &tval); + rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, rr->evtid, + &tval); if (rr->err) return rr->err;
@@ -421,6 +438,7 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr) /* * mbm_bw_count() - Update bw count from values previously read by * __mon_event_count(). + * @closid: The closid used to identify the cached mbm_state. * @rmid: The rmid used to identify the cached mbm_state. * @rr: The struct rmid_read populated by __mon_event_count(). * @@ -429,7 +447,7 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr) * __mon_event_count() is compared with the chunks value from the previous * invocation. This must be called once per second to maintain values in MBps. */ -static void mbm_bw_count(u32 rmid, struct rmid_read *rr) +static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr) { struct mbm_state *m = &rr->d->mbm_local[rmid]; u64 cur_bw, bytes, cur_bytes; @@ -459,7 +477,7 @@ void mon_event_count(void *info)
rdtgrp = rr->rgrp;
- ret = __mon_event_count(rdtgrp->mon.rmid, rr); + ret = __mon_event_count(rdtgrp->closid, rdtgrp->mon.rmid, rr);
/* * For Ctrl groups read data from child monitor groups and @@ -470,7 +488,8 @@ void mon_event_count(void *info)
if (rdtgrp->type == RDTCTRL_GROUP) { list_for_each_entry(entry, head, mon.crdtgrp_list) { - if (__mon_event_count(entry->mon.rmid, rr) == 0) + if (__mon_event_count(entry->closid, entry->mon.rmid, + rr) == 0) ret = 0; } } @@ -600,7 +619,8 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm) } }
-static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid) +static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, u32 rmid) { struct rmid_read rr;
@@ -615,12 +635,12 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid) if (is_mbm_total_enabled()) { rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID; rr.val = 0; - __mon_event_count(rmid, &rr); + __mon_event_count(closid, rmid, &rr); } if (is_mbm_local_enabled()) { rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID; rr.val = 0; - __mon_event_count(rmid, &rr); + __mon_event_count(closid, rmid, &rr);
/* * Call the MBA software controller only for the @@ -628,7 +648,7 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid) * the software controller explicitly. */ if (is_mba_sc(NULL)) - mbm_bw_count(rmid, &rr); + mbm_bw_count(closid, rmid, &rr); } }
@@ -685,11 +705,11 @@ void mbm_handle_overflow(struct work_struct *work) d = container_of(work, struct rdt_domain, mbm_over.work);
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { - mbm_update(r, d, prgrp->mon.rmid); + mbm_update(r, d, prgrp->closid, prgrp->mon.rmid);
head = &prgrp->mon.crdtgrp_list; list_for_each_entry(crgrp, head, mon.crdtgrp_list) - mbm_update(r, d, crgrp->mon.rmid); + mbm_update(r, d, crgrp->closid, crgrp->mon.rmid);
if (is_mba_sc(NULL)) update_mba_bw(prgrp, d); @@ -732,10 +752,11 @@ static int dom_data_init(struct rdt_resource *r) }
/* - * RMID 0 is special and is always allocated. It's used for all - * tasks that are not monitored. + * RESCTRL_RESERVED_CLOSID and RESCTRL_RESERVED_RMID are special and + * are always allocated. These are used for the rdtgroup_default + * control group, which will be setup later in rdtgroup_init(). */ - entry = __rmid_entry(0); + entry = __rmid_entry(RESCTRL_RESERVED_CLOSID, RESCTRL_RESERVED_RMID); list_del(&entry->list);
return 0; diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 8f559eeae08e..65bee6f11015 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -752,7 +752,7 @@ int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp) * anymore when this group would be used for pseudo-locking. This * is safe to call on platforms not capable of monitoring. */ - free_rmid(rdtgrp->mon.rmid); + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
ret = 0; goto out; @@ -787,7 +787,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp)
ret = rdtgroup_locksetup_user_restore(rdtgrp); if (ret) { - free_rmid(rdtgrp->mon.rmid); + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); return ret; }
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 8fb0f56f64be..f64c13fe22d7 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2842,7 +2842,7 @@ static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
head = &rdtgrp->mon.crdtgrp_list; list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) { - free_rmid(sentry->mon.rmid); + free_rmid(sentry->closid, sentry->mon.rmid); list_del(&sentry->mon.crdtgrp_list);
if (atomic_read(&sentry->waitcount) != 0) @@ -2882,7 +2882,7 @@ static void rmdir_all_sub(void) cpumask_or(&rdtgroup_default.cpu_mask, &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
- free_rmid(rdtgrp->mon.rmid); + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
kernfs_remove(rdtgrp->kn); list_del(&rdtgrp->rdtgroup_list); @@ -3310,7 +3310,7 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp) ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn); if (ret) { rdt_last_cmd_puts("kernfs subdir error\n"); - free_rmid(rdtgrp->mon.rmid); + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); return ret; }
@@ -3320,7 +3320,7 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp) static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp) { if (rdt_mon_capable) - free_rmid(rgrp->mon.rmid); + free_rmid(rgrp->closid, rgrp->mon.rmid); }
static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, @@ -3579,7 +3579,7 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) update_closid_rmid(tmpmask, NULL);
rdtgrp->flags = RDT_DELETED; - free_rmid(rdtgrp->mon.rmid); + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
/* * Remove the rdtgrp from the parent ctrl_mon group's list @@ -3625,8 +3625,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); update_closid_rmid(tmpmask, NULL);
+ free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); closid_free(rdtgrp->closid); - free_rmid(rdtgrp->mon.rmid);
rdtgroup_ctrl_remove(rdtgrp);
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 66942d7fba7f..bd4ec22b5a96 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -6,6 +6,10 @@ #include <linux/list.h> #include <linux/pid.h>
+/* CLOSID, RMID value used by the default control group */ +#define RESCTRL_RESERVED_CLOSID 0 +#define RESCTRL_RESERVED_RMID 0 + #ifdef CONFIG_PROC_CPU_RESCTRL
int proc_resctrl_show(struct seq_file *m, @@ -225,6 +229,9 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d); * for this resource and domain. * @r: resource that the counter should be read from. * @d: domain that the counter should be read from. + * @closid: closid that matches the rmid. Depending on the architecture, the + * counter may match traffic of both @closid and @rmid, or @rmid + * only. * @rmid: rmid of the counter to read. * @eventid: eventid to read, e.g. L3 occupancy. * @val: result of the counter read in bytes. @@ -235,20 +242,25 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d); * 0 on success, or -EIO, -EINVAL etc on error. */ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, - u32 rmid, enum resctrl_event_id eventid, u64 *val); + u32 closid, u32 rmid, enum resctrl_event_id eventid, + u64 *val); +
/** * resctrl_arch_reset_rmid() - Reset any private state associated with rmid * and eventid. * @r: The domain's resource. * @d: The rmid's domain. + * @closid: closid that matches the rmid. Depending on the architecture, the + * counter may match traffic of both @closid and @rmid, or @rmid only. * @rmid: The rmid whose counter values should be reset. * @eventid: The eventid whose counter values should be reset. * * This can be called from any CPU. */ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, - u32 rmid, enum resctrl_event_id eventid); + u32 closid, u32 rmid, + enum resctrl_event_id eventid);
/** * resctrl_arch_reset_rmid_all() - Reset all private state associated with
From: James Morse james.morse@arm.com
x86 systems identify traffic using the CLOSID and RMID. The CLOSID is used to lookup the control policy, the RMID is used for monitoring. For x86 these are independent numbers. Arm's MPAM has equivalent features PARTID and PMG, where the PARTID is used to lookup the control policy. The PMG in contrast is a small number of bits that are used to subdivide PARTID when monitoring. The cache-occupancy monitors require the PARTID to be specified when monitoring.
This means MPAM's PMG field is not unique. There are multiple PMG-0, one per allocated CLOSID/PARTID. If PMG is treated as equivalent to RMID, it cannot be allocated as an independent number. Bitmaps like rmid_busy_llc need to be sized by the number of unique entries for this resource.
Treat the combined CLOSID and RMID as an index, and provide architecture helpers to pack and unpack an index. This makes the MPAM values unique. The domain's rmid_busy_llc and rmid_ptrs[] are then sized by index, as are domain mbm_local[] and mbm_total[].
x86 can ignore the CLOSID field when packing and unpacking an index, and report as many indexes as RMID.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/include/asm/resctrl.h | 17 +++++ arch/x86/kernel/cpu/resctrl/core.c | 13 ++-- arch/x86/kernel/cpu/resctrl/internal.h | 4 +- arch/x86/kernel/cpu/resctrl/monitor.c | 98 +++++++++++++++++--------- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 13 ++-- 5 files changed, 100 insertions(+), 45 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index cc6e1bce7b1a..db4c84dde2d5 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -101,6 +101,23 @@ static inline void resctrl_sched_in(struct task_struct *tsk) __resctrl_sched_in(tsk); }
+static inline u32 resctrl_arch_system_num_rmid_idx(void) +{ + /* RMID are independent numbers for x86. num_rmid_idx == num_rmid */ + return boot_cpu_data.x86_cache_max_rmid + 1; +} + +static inline void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid) +{ + *rmid = idx; + *closid = X86_RESCTRL_EMPTY_CLOSID; +} + +static inline u32 resctrl_arch_rmid_idx_encode(u32 ignored, u32 rmid) +{ + return rmid; +} + void resctrl_cpu_detect(struct cpuinfo_x86 *c);
#else diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index b5d4b76f97c5..ba6daef6bd26 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -589,7 +589,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r) mbm_setup_overflow_handler(d, 0); } if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu && - has_busy_rmid(r, d)) { + has_busy_rmid(d)) { cancel_delayed_work(&d->cqm_limbo); cqm_setup_limbo_handler(d, 0); } @@ -600,11 +600,12 @@ static void clear_closid_rmid(int cpu) { struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
- state->default_closid = 0; - state->default_rmid = 0; - state->cur_closid = 0; - state->cur_rmid = 0; - wrmsr(MSR_IA32_PQR_ASSOC, 0, 0); + state->default_closid = RESCTRL_RESERVED_CLOSID; + state->default_rmid = RESCTRL_RESERVED_RMID; + state->cur_closid = RESCTRL_RESERVED_CLOSID; + state->cur_rmid = RESCTRL_RESERVED_RMID; + wrmsr(MSR_IA32_PQR_ASSOC, RESCTRL_RESERVED_RMID, + RESCTRL_RESERVED_CLOSID); }
static int resctrl_online_cpu(unsigned int cpu) diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index c836e3294e12..97ec24f91ac4 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -8,6 +8,8 @@ #include <linux/fs_context.h> #include <linux/jump_label.h>
+#include <asm/resctrl.h> + #define L3_QOS_CDP_ENABLE 0x01ULL
#define L2_QOS_CDP_ENABLE 0x01ULL @@ -560,7 +562,7 @@ void __init intel_rdt_mbm_apply_quirk(void); bool is_mba_sc(struct rdt_resource *r); void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms); void cqm_handle_limbo(struct work_struct *work); -bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d); +bool has_busy_rmid(struct rdt_domain *d); void __check_limbo(struct rdt_domain *d, bool force_free); void rdt_domain_reconfigure_cdp(struct rdt_resource *r); void __init thread_throttle_mode_init(void); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index f5d5570a5b5a..d623031def4c 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -149,12 +149,29 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val) return val; }
-static inline struct rmid_entry *__rmid_entry(u32 closid, u32 rmid) +/* + * x86 and arm64 differ in their handling of monitoring. + * x86's RMID are independent numbers, there is only one source of traffic + * with an RMID value of '1'. + * arm64's PMG extends the PARTID/CLOSID space, there are multiple sources of + * traffic with a PMG value of '1', one for each CLOSID, meaning the RMID + * value is no longer unique. + * To account for this, resctrl uses an index. On x86 this is just the RMID, + * on arm64 it encodes the CLOSID and RMID. This gives a unique number. + * + * The domain's rmid_busy_llc and rmid_ptrs[] are sized by index. The arch code + * must accept an attempt to read every index. + */ +static inline struct rmid_entry *__rmid_entry(u32 idx) { struct rmid_entry *entry; + u32 closid, rmid; + + entry = &rmid_ptrs[idx]; + resctrl_arch_rmid_idx_decode(idx, &closid, &rmid);
- entry = &rmid_ptrs[rmid]; - WARN_ON(entry->rmid != rmid); + WARN_ON_ONCE(entry->closid != closid); + WARN_ON_ONCE(entry->rmid != rmid);
return entry; } @@ -284,8 +301,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, void __check_limbo(struct rdt_domain *d, bool force_free) { struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); struct rmid_entry *entry; - u32 crmid = 1, nrmid; + u32 idx, cur_idx = 1; bool rmid_dirty; u64 val = 0;
@@ -296,12 +314,11 @@ void __check_limbo(struct rdt_domain *d, bool force_free) * RMID and move it to the free list when the counter reaches 0. */ for (;;) { - nrmid = find_next_bit(d->rmid_busy_llc, r->num_rmid, crmid); - if (nrmid >= r->num_rmid) + idx = find_next_bit(d->rmid_busy_llc, idx_limit, cur_idx); + if (idx >= idx_limit) break;
- entry = __rmid_entry(X86_RESCTRL_EMPTY_CLOSID, nrmid);// temporary - + entry = __rmid_entry(idx); if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid, QOS_L3_OCCUP_EVENT_ID, &val)) { rmid_dirty = true; @@ -310,19 +327,21 @@ void __check_limbo(struct rdt_domain *d, bool force_free) }
if (force_free || !rmid_dirty) { - clear_bit(entry->rmid, d->rmid_busy_llc); + clear_bit(idx, d->rmid_busy_llc); if (!--entry->busy) { rmid_limbo_count--; list_add_tail(&entry->list, &rmid_free_lru); } } - crmid = nrmid + 1; + cur_idx = idx + 1; } }
-bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d) +bool has_busy_rmid(struct rdt_domain *d) { - return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid; + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + + return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit; }
/* @@ -352,6 +371,9 @@ static void add_rmid_to_limbo(struct rmid_entry *entry) struct rdt_domain *d; int cpu, err; u64 val = 0; + u32 idx; + + idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
entry->busy = 0; cpu = get_cpu(); @@ -369,9 +391,9 @@ static void add_rmid_to_limbo(struct rmid_entry *entry) * For the first limbo RMID in the domain, * setup up the limbo worker. */ - if (!has_busy_rmid(r, d)) + if (!has_busy_rmid(d)) cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL); - set_bit(entry->rmid, d->rmid_busy_llc); + set_bit(idx, d->rmid_busy_llc); entry->busy++; } put_cpu(); @@ -384,14 +406,21 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
void free_rmid(u32 closid, u32 rmid) { + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); struct rmid_entry *entry;
- if (!rmid) - return; - lockdep_assert_held(&rdtgroup_mutex);
- entry = __rmid_entry(closid, rmid); + /* + * Do not allow the default rmid to be free'd. Comparing by index + * allows architectures that ignore the closid parameter to avoid an + * unnecessary check. + */ + if (idx == resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, + RESCTRL_RESERVED_RMID)) + return; + + entry = __rmid_entry(idx);
if (is_llc_occupancy_enabled()) add_rmid_to_limbo(entry); @@ -402,11 +431,13 @@ void free_rmid(u32 closid, u32 rmid) static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 closid, u32 rmid, enum resctrl_event_id evtid) { + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); + switch (evtid) { case QOS_L3_MBM_TOTAL_EVENT_ID: - return &d->mbm_total[rmid]; + return &d->mbm_total[idx]; case QOS_L3_MBM_LOCAL_EVENT_ID: - return &d->mbm_local[rmid]; + return &d->mbm_local[idx]; default: return NULL; } @@ -449,7 +480,8 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr) */ static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr) { - struct mbm_state *m = &rr->d->mbm_local[rmid]; + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); + struct mbm_state *m = &rr->d->mbm_local[idx]; u64 cur_bw, bytes, cur_bytes;
cur_bytes = rr->val; @@ -539,7 +571,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm) { u32 closid, rmid, cur_msr_val, new_msr_val; struct mbm_state *pmbm_data, *cmbm_data; - u32 cur_bw, delta_bw, user_bw; + u32 cur_bw, delta_bw, user_bw, idx; struct rdt_resource *r_mba; struct rdt_domain *dom_mba; struct list_head *head; @@ -552,7 +584,8 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
closid = rgrp->closid; rmid = rgrp->mon.rmid; - pmbm_data = &dom_mbm->mbm_local[rmid]; + idx = resctrl_arch_rmid_idx_encode(closid, rmid); + pmbm_data = &dom_mbm->mbm_local[idx];
dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba); if (!dom_mba) { @@ -660,17 +693,15 @@ void cqm_handle_limbo(struct work_struct *work) { unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL); int cpu = smp_processor_id(); - struct rdt_resource *r; struct rdt_domain *d;
mutex_lock(&rdtgroup_mutex);
- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; d = container_of(work, struct rdt_domain, cqm_limbo.work);
__check_limbo(d, false);
- if (has_busy_rmid(r, d)) + if (has_busy_rmid(d)) schedule_delayed_work_on(cpu, &d->cqm_limbo, delay);
mutex_unlock(&rdtgroup_mutex); @@ -735,19 +766,20 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
static int dom_data_init(struct rdt_resource *r) { + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); struct rmid_entry *entry = NULL; - int i, nr_rmids; + u32 idx; + int i;
- nr_rmids = r->num_rmid; - rmid_ptrs = kcalloc(nr_rmids, sizeof(struct rmid_entry), GFP_KERNEL); + rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL); if (!rmid_ptrs) return -ENOMEM;
- for (i = 0; i < nr_rmids; i++) { + for (i = 0; i < idx_limit; i++) { entry = &rmid_ptrs[i]; INIT_LIST_HEAD(&entry->list);
- entry->rmid = i; + resctrl_arch_rmid_idx_decode(i, &entry->closid, &entry->rmid); list_add_tail(&entry->list, &rmid_free_lru); }
@@ -756,7 +788,9 @@ static int dom_data_init(struct rdt_resource *r) * are always allocated. These are used for the rdtgroup_default * control group, which will be setup later in rdtgroup_init(). */ - entry = __rmid_entry(RESCTRL_RESERVED_CLOSID, RESCTRL_RESERVED_RMID); + idx = resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, + RESCTRL_RESERVED_RMID); + entry = __rmid_entry(idx); list_del(&entry->list);
return 0; diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index f64c13fe22d7..351b74670673 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -3858,8 +3858,8 @@ static void __init rdtgroup_setup_default(void) { mutex_lock(&rdtgroup_mutex);
- rdtgroup_default.closid = 0; - rdtgroup_default.mon.rmid = 0; + rdtgroup_default.closid = RESCTRL_RESERVED_CLOSID; + rdtgroup_default.mon.rmid = RESCTRL_RESERVED_RMID; rdtgroup_default.type = RDTCTRL_GROUP; INIT_LIST_HEAD(&rdtgroup_default.mon.crdtgrp_list);
@@ -3894,7 +3894,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
if (is_mbm_enabled()) cancel_delayed_work(&d->mbm_over); - if (is_llc_occupancy_enabled() && has_busy_rmid(r, d)) { + if (is_llc_occupancy_enabled() && has_busy_rmid(d)) { /* * When a package is going down, forcefully * decrement rmid->ebusy. There is no way to know @@ -3912,16 +3912,17 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d)
static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) { + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); size_t tsize;
if (is_llc_occupancy_enabled()) { - d->rmid_busy_llc = bitmap_zalloc(r->num_rmid, GFP_KERNEL); + d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL); if (!d->rmid_busy_llc) return -ENOMEM; } if (is_mbm_total_enabled()) { tsize = sizeof(*d->mbm_total); - d->mbm_total = kcalloc(r->num_rmid, tsize, GFP_KERNEL); + d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL); if (!d->mbm_total) { bitmap_free(d->rmid_busy_llc); return -ENOMEM; @@ -3929,7 +3930,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) } if (is_mbm_local_enabled()) { tsize = sizeof(*d->mbm_local); - d->mbm_local = kcalloc(r->num_rmid, tsize, GFP_KERNEL); + d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL); if (!d->mbm_local) { bitmap_free(d->rmid_busy_llc); kfree(d->mbm_total);
From: James Morse james.morse@arm.com
MPAMs RMID values are not unique unless the CLOSID is considered as well.
alloc_rmid() expects the RMID to be an independent number.
Pass the CLOSID in to alloc_rmid(). Use this to compare indexes when allocating. If the CLOSID is not relevant to the index, this ends up comparing the free RMID with itself, and the first free entry will be used. With MPAM the CLOSID is included in the index, so this becomes a walk of the free RMID entries, until one that matches the supplied CLOSID is found.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/internal.h | 2 +- arch/x86/kernel/cpu/resctrl/monitor.c | 43 ++++++++++++++++++----- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +- 4 files changed, 37 insertions(+), 12 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 97ec24f91ac4..2f1d4f141dab 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -545,7 +545,7 @@ void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp); struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r); int closids_supported(void); void closid_free(int closid); -int alloc_rmid(void); +int alloc_rmid(u32 closid); void free_rmid(u32 closid, u32 rmid); int rdt_get_mon_l3_config(struct rdt_resource *r); void __exit rdt_put_mon_l3_config(struct rdt_resource *r); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index d623031def4c..1f371b108a74 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -344,24 +344,49 @@ bool has_busy_rmid(struct rdt_domain *d) return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit; }
+static struct rmid_entry *resctrl_find_free_rmid(u32 closid) +{ + struct rmid_entry *itr; + u32 itr_idx, cmp_idx; + + if (list_empty(&rmid_free_lru)) + return rmid_limbo_count ? ERR_PTR(-EBUSY) : ERR_PTR(-ENOSPC); + + list_for_each_entry(itr, &rmid_free_lru, list) { + /* + * Get the index of this free RMID, and the index it would need + * to be if it were used with this CLOSID. + * If the CLOSID is irrelevant on this architecture, the two + * index values are always the same on every entry and thus the + * very first entry will be returned. + */ + itr_idx = resctrl_arch_rmid_idx_encode(itr->closid, itr->rmid); + cmp_idx = resctrl_arch_rmid_idx_encode(closid, itr->rmid); + + if (itr_idx == cmp_idx) + return itr; + } + + return ERR_PTR(-ENOSPC); +} + /* - * As of now the RMIDs allocation is global. - * However we keep track of which packages the RMIDs - * are used to optimize the limbo list management. + * For MPAM the RMID value is not unique, and has to be considered with + * the CLOSID. The (CLOSID, RMID) pair is allocated on all domains, which + * allows all domains to be managed by a single free list. + * Each domain also has a rmid_busy_llc to reduce the work of the limbo handler. */ -int alloc_rmid(void) +int alloc_rmid(u32 closid) { struct rmid_entry *entry;
lockdep_assert_held(&rdtgroup_mutex);
- if (list_empty(&rmid_free_lru)) - return rmid_limbo_count ? -EBUSY : -ENOSPC; + entry = resctrl_find_free_rmid(closid); + if (IS_ERR(entry)) + return PTR_ERR(entry);
- entry = list_first_entry(&rmid_free_lru, - struct rmid_entry, list); list_del(&entry->list); - return entry->rmid; }
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 65bee6f11015..d8f44113ed1f 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -777,7 +777,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp) int ret;
if (rdt_mon_capable) { - ret = alloc_rmid(); + ret = alloc_rmid(rdtgrp->closid); if (ret < 0) { rdt_last_cmd_puts("Out of RMIDs\n"); return ret; diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 351b74670673..12a557c96100 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -3300,7 +3300,7 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp) if (!rdt_mon_capable) return 0;
- ret = alloc_rmid(); + ret = alloc_rmid(rdtgrp->closid); if (ret < 0) { rdt_last_cmd_puts("Out of RMIDs\n"); return ret;
From: James Morse james.morse@arm.com
MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be used for different control groups.
This means once a CLOSID is allocated, all its monitoring ids may still be dirty, and held in limbo.
Keep track of the number of RMID held in limbo each CLOSID has. This will allow a future helper to find the 'cleanest' CLOSID when allocating.
The array is only needed when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is defined. This will never be the case on x86.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com --- arch/x86/kernel/cpu/resctrl/monitor.c | 69 +++++++++++++++++++++++---- 1 file changed, 59 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 1f371b108a74..6dfc68c800c8 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -50,6 +50,13 @@ struct rmid_entry { */ static LIST_HEAD(rmid_free_lru);
+/** + * @closid_num_dirty_rmid The number of dirty RMID each CLOSID has. + * Only allocated when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is defined. + * Indexed by CLOSID. Protected by rdtgroup_mutex. + */ +static u32 *closid_num_dirty_rmid; + /* * @rmid_limbo_count - count of currently unused but (potentially) * dirty RMIDs. @@ -292,6 +299,17 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, return 0; }
+static void limbo_release_entry(struct rmid_entry *entry) +{ + lockdep_assert_held(&rdtgroup_mutex); + + rmid_limbo_count--; + list_add_tail(&entry->list, &rmid_free_lru); + + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + closid_num_dirty_rmid[entry->closid]--; +} + /* * Check the RMIDs that are marked as busy for this domain. If the * reported LLC occupancy is below the threshold clear the busy bit and @@ -328,10 +346,8 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
if (force_free || !rmid_dirty) { clear_bit(idx, d->rmid_busy_llc); - if (!--entry->busy) { - rmid_limbo_count--; - list_add_tail(&entry->list, &rmid_free_lru); - } + if (!--entry->busy) + limbo_release_entry(entry); } cur_idx = idx + 1; } @@ -398,6 +414,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry) u64 val = 0; u32 idx;
+ lockdep_assert_held(&rdtgroup_mutex); + idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
entry->busy = 0; @@ -423,10 +441,13 @@ static void add_rmid_to_limbo(struct rmid_entry *entry) } put_cpu();
- if (entry->busy) + if (entry->busy) { rmid_limbo_count++; - else + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + closid_num_dirty_rmid[entry->closid]++; + } else { list_add_tail(&entry->list, &rmid_free_lru); + } }
void free_rmid(u32 closid, u32 rmid) @@ -792,13 +813,33 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms) static int dom_data_init(struct rdt_resource *r) { u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + u32 num_closid = resctrl_arch_get_num_closid(r); struct rmid_entry *entry = NULL; + int err = 0, i; u32 idx; - int i; + + mutex_lock(&rdtgroup_mutex); + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + u32 *tmp; + + tmp = kcalloc(num_closid, sizeof(*tmp), GFP_KERNEL); + if (!tmp) { + err = -ENOMEM; + goto out_unlock; + } + + closid_num_dirty_rmid = tmp; + }
rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL); - if (!rmid_ptrs) - return -ENOMEM; + if (!rmid_ptrs) { + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + kfree(closid_num_dirty_rmid); + closid_num_dirty_rmid = NULL; + } + err = -ENOMEM; + goto out_unlock; + }
for (i = 0; i < idx_limit; i++) { entry = &rmid_ptrs[i]; @@ -818,13 +859,21 @@ static int dom_data_init(struct rdt_resource *r) entry = __rmid_entry(idx); list_del(&entry->list);
- return 0; +out_unlock: + mutex_unlock(&rdtgroup_mutex); + + return err; }
static void __exit dom_data_exit(struct rdt_resource *r) { mutex_lock(&rdtgroup_mutex);
+ if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + kfree(closid_num_dirty_rmid); + closid_num_dirty_rmid = NULL; + } + kfree(rmid_ptrs); rmid_ptrs = NULL;
From: James Morse james.morse@arm.com
The resctrl CLOSID allocator uses a single 32bit word to track which CLOSID are free. The setting and clearing of bits is open coded.
Convert the existing open coded bit manipulations of closid_free_map to use __set_bit() and friends. These don't need to be atomic as this list is protected by the mutex.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 12a557c96100..f6b52415ca3d 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -111,7 +111,7 @@ void rdt_staged_configs_clear(void) * - Our choices on how to configure each resource become progressively more * limited as the number of resources grows. */ -static int closid_free_map; +static unsigned long closid_free_map; static int closid_free_map_len;
int closids_supported(void) @@ -130,8 +130,8 @@ static void closid_init(void)
closid_free_map = BIT_MASK(rdt_min_closid) - 1;
- /* CLOSID 0 is always reserved for the default group */ - closid_free_map &= ~1; + /* RESCTRL_RESERVED_CLOSID is always reserved for the default group */ + __clear_bit(RESCTRL_RESERVED_CLOSID, &closid_free_map); closid_free_map_len = rdt_min_closid; }
@@ -139,17 +139,21 @@ static int closid_alloc(void) { u32 closid = ffs(closid_free_map);
+ lockdep_assert_held(&rdtgroup_mutex); + if (closid == 0) return -ENOSPC; closid--; - closid_free_map &= ~(1 << closid); + __clear_bit(closid, &closid_free_map);
return closid; }
void closid_free(int closid) { - closid_free_map |= 1 << closid; + lockdep_assert_held(&rdtgroup_mutex); + + __set_bit(closid, &closid_free_map); }
/** @@ -161,7 +165,9 @@ void closid_free(int closid) */ static bool closid_allocated(unsigned int closid) { - return (closid_free_map & (1 << closid)) == 0; + lockdep_assert_held(&rdtgroup_mutex); + + return !test_bit(closid, &closid_free_map); }
/**
From: James Morse james.morse@arm.com
MPAM's PMG bits extend its PARTID space, meaning the same PMG value can be used for different control groups.
This means once a CLOSID is allocated, all its monitoring ids may still be dirty, and held in limbo.
Instead of allocating the first free CLOSID, on architectures where CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is enabled, search closid_num_dirty_rmid[] to find the cleanest CLOSID.
The CLOSID found is returned to closid_alloc() for the free list to be updated.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/internal.h | 2 ++ arch/x86/kernel/cpu/resctrl/monitor.c | 47 +++++++++++++++++++++++++- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 19 ++++++++--- 3 files changed, 62 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 2f1d4f141dab..521afa016b05 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -568,5 +568,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r); void __init thread_throttle_mode_init(void); void __init mbm_config_rftype_init(const char *config); void rdt_staged_configs_clear(void); +bool closid_allocated(unsigned int closid); +int resctrl_find_cleanest_closid(void);
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */ diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 6dfc68c800c8..fdbef88ff39b 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -50,7 +50,7 @@ struct rmid_entry { */ static LIST_HEAD(rmid_free_lru);
-/** +/* * @closid_num_dirty_rmid The number of dirty RMID each CLOSID has. * Only allocated when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is defined. * Indexed by CLOSID. Protected by rdtgroup_mutex. @@ -386,6 +386,51 @@ static struct rmid_entry *resctrl_find_free_rmid(u32 closid) return ERR_PTR(-ENOSPC); }
+/** + * resctrl_find_cleanest_closid() - Find a CLOSID where all the associated + * RMID are clean, or the CLOSID that has + * the most clean RMID. + * + * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocated CLOSID + * may not be able to allocate clean RMID. To avoid this the allocator will + * choose the CLOSID with the most clean RMID. + * + * When the CLOSID and RMID are independent numbers, the first free CLOSID will + * be returned. + */ +int resctrl_find_cleanest_closid(void) +{ + u32 cleanest_closid = ~0; + int i = 0; + + lockdep_assert_held(&rdtgroup_mutex); + + if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + return -EIO; + + for (i = 0; i < closids_supported(); i++) { + int num_dirty; + + if (closid_allocated(i)) + continue; + + num_dirty = closid_num_dirty_rmid[i]; + if (num_dirty == 0) + return i; + + if (cleanest_closid == ~0) + cleanest_closid = i; + + if (num_dirty < closid_num_dirty_rmid[cleanest_closid]) + cleanest_closid = i; + } + + if (cleanest_closid == ~0) + return -ENOSPC; + + return cleanest_closid; +} + /* * For MPAM the RMID value is not unique, and has to be considered with * the CLOSID. The (CLOSID, RMID) pair is allocated on all domains, which diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index f6b52415ca3d..00f830f729b3 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -137,13 +137,22 @@ static void closid_init(void)
static int closid_alloc(void) { - u32 closid = ffs(closid_free_map); + int cleanest_closid; + u32 closid;
lockdep_assert_held(&rdtgroup_mutex);
- if (closid == 0) - return -ENOSPC; - closid--; + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + cleanest_closid = resctrl_find_cleanest_closid(); + if (cleanest_closid < 0) + return cleanest_closid; + closid = cleanest_closid; + } else { + closid = ffs(closid_free_map); + if (closid == 0) + return -ENOSPC; + closid--; + } __clear_bit(closid, &closid_free_map);
return closid; @@ -163,7 +172,7 @@ void closid_free(int closid) * Return: true if @closid is currently associated with a resource group, * false if @closid is free */ -static bool closid_allocated(unsigned int closid) +bool closid_allocated(unsigned int closid) { lockdep_assert_held(&rdtgroup_mutex);
From: James Morse james.morse@arm.com
When switching tasks, the CLOSID and RMID that the new task should use are stored in struct task_struct. For x86 the CLOSID known by resctrl, the value in task_struct, and the value written to the CPU register are all the same thing.
MPAM's CPU interface has two different PARTID's one for data accesses the other for instruction fetch. Storing resctrl's CLOSID value in struct task_struct implies the arch code knows whether resctrl is using CDP.
Move the matching and setting of the struct task_struct properties to use helpers. This allows arm64 to store the hardware format of the register, instead of having to convert it each time.
__rdtgroup_move_task()s use of READ_ONCE()/WRITE_ONCE() ensures torn values aren't seen as another CPU may schedule the task being moved while the value is being changed. MPAM has an additional corner-case here as the PMG bits extend the PARTID space. If the scheduler sees a new-CLOSID but old-RMID, the task will dirty an RMID that the limbo code is not watching causing an inaccurate count. x86's RMID are independent values, so the limbo code will still be watching the old-RMID in this circumstance. To avoid this, arm64 needs both the CLOSID/RMID WRITE_ONCE()d together. Both values must be provided together.
Because MPAM's RMID values are not unique, the CLOSID must be provided when matching the RMID.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/include/asm/resctrl.h | 18 ++++++++ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 62 ++++++++++++++++---------- 2 files changed, 56 insertions(+), 24 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index db4c84dde2d5..1d274dbabc44 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -95,6 +95,24 @@ static inline unsigned int resctrl_arch_round_mon_val(unsigned int val) return val * scale; }
+static inline void resctrl_arch_set_closid_rmid(struct task_struct *tsk, + u32 closid, u32 rmid) +{ + WRITE_ONCE(tsk->closid, closid); + WRITE_ONCE(tsk->rmid, rmid); +} + +static inline bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid) +{ + return READ_ONCE(tsk->closid) == closid; +} + +static inline bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 ignored, + u32 rmid) +{ + return READ_ONCE(tsk->rmid) == rmid; +} + static inline void resctrl_sched_in(struct task_struct *tsk) { if (static_branch_likely(&rdt_enable_key)) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 00f830f729b3..460ac501f9e8 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -102,7 +102,7 @@ void rdt_staged_configs_clear(void) * * Using a global CLOSID across all resources has some advantages and * some drawbacks: - * + We can simply set "current->closid" to assign a task to a resource + * + We can simply set current's closid to assign a task to a resource * group. * + Context switch code can avoid extra memory references deciding which * CLOSID to load into the PQR_ASSOC MSR @@ -574,14 +574,26 @@ static void update_task_closid_rmid(struct task_struct *t) _update_task_closid_rmid(t); }
+static bool task_in_rdtgroup(struct task_struct *tsk, struct rdtgroup *rdtgrp) +{ + u32 closid, rmid = rdtgrp->mon.rmid; + + if (rdtgrp->type == RDTCTRL_GROUP) + closid = rdtgrp->closid; + else if (rdtgrp->type == RDTMON_GROUP) + closid = rdtgrp->mon.parent->closid; + else + return false; + + return resctrl_arch_match_closid(tsk, closid) && + resctrl_arch_match_rmid(tsk, closid, rmid); +} + static int __rdtgroup_move_task(struct task_struct *tsk, struct rdtgroup *rdtgrp) { /* If the task is already in rdtgrp, no need to move the task. */ - if ((rdtgrp->type == RDTCTRL_GROUP && tsk->closid == rdtgrp->closid && - tsk->rmid == rdtgrp->mon.rmid) || - (rdtgrp->type == RDTMON_GROUP && tsk->rmid == rdtgrp->mon.rmid && - tsk->closid == rdtgrp->mon.parent->closid)) + if (task_in_rdtgroup(tsk, rdtgrp)) return 0;
/* @@ -592,19 +604,19 @@ static int __rdtgroup_move_task(struct task_struct *tsk, * For monitor groups, can move the tasks only from * their parent CTRL group. */ - - if (rdtgrp->type == RDTCTRL_GROUP) { - WRITE_ONCE(tsk->closid, rdtgrp->closid); - WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid); - } else if (rdtgrp->type == RDTMON_GROUP) { - if (rdtgrp->mon.parent->closid == tsk->closid) { - WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid); - } else { - rdt_last_cmd_puts("Can't move task to different control group\n"); - return -EINVAL; - } + if (rdtgrp->type == RDTMON_GROUP && + !resctrl_arch_match_closid(tsk, rdtgrp->mon.parent->closid)) { + rdt_last_cmd_puts("Can't move task to different control group\n"); + return -EINVAL; }
+ if (rdtgrp->type == RDTMON_GROUP) + resctrl_arch_set_closid_rmid(tsk, rdtgrp->mon.parent->closid, + rdtgrp->mon.rmid); + else + resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid, + rdtgrp->mon.rmid); + /* * Ensure the task's closid and rmid are written before determining if * the task is current that will decide if it will be interrupted. @@ -626,14 +638,15 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
static bool is_closid_match(struct task_struct *t, struct rdtgroup *r) { - return (rdt_alloc_capable && - (r->type == RDTCTRL_GROUP) && (t->closid == r->closid)); + return (rdt_alloc_capable && (r->type == RDTCTRL_GROUP) && + resctrl_arch_match_closid(t, r->closid)); }
static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r) { - return (rdt_mon_capable && - (r->type == RDTMON_GROUP) && (t->rmid == r->mon.rmid)); + return (rdt_mon_capable && (r->type == RDTMON_GROUP) && + resctrl_arch_match_rmid(t, r->mon.parent->closid, + r->mon.rmid)); }
/** @@ -884,7 +897,7 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns, rdtg->mode != RDT_MODE_EXCLUSIVE) continue;
- if (rdtg->closid != tsk->closid) + if (!resctrl_arch_match_closid(tsk, rdtg->closid)) continue;
seq_printf(s, "res:%s%s\n", (rdtg == &rdtgroup_default) ? "/" : "", @@ -892,7 +905,8 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns, seq_puts(s, "mon:"); list_for_each_entry(crg, &rdtg->mon.crdtgrp_list, mon.crdtgrp_list) { - if (tsk->rmid != crg->mon.rmid) + if (!resctrl_arch_match_rmid(tsk, crg->mon.parent->closid, + crg->mon.rmid)) continue; seq_printf(s, "%s", crg->kn->name); break; @@ -2825,8 +2839,8 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to, for_each_process_thread(p, t) { if (!from || is_closid_match(t, from) || is_rmid_match(t, from)) { - WRITE_ONCE(t->closid, to->closid); - WRITE_ONCE(t->rmid, to->mon.rmid); + resctrl_arch_set_closid_rmid(t, to->closid, + to->mon.rmid);
/* * Order the closid/rmid stores above before the loads
From: James Morse james.morse@arm.com
The limbo and overflow code picks a CPU to use from the domain's list of online CPUs. Work is then scheduled on these CPUs to maintain the limbo list and any counters that may overflow.
cpumask_any() may pick a CPU that is marked nohz_full, which will either penalise the work that CPU was dedicated to, or delay the processing of limbo list or counters that may overflow. Perhaps indefinitely. Delaying the overflow handling will skew the bandwidth values calculated by mba_sc, which expects to be called once a second.
Add cpumask_any_housekeeping() as a replacement for cpumask_any() that prefers housekeeping CPUs. This helper will still return a nohz_full CPU if that is the only option. The CPU to use is re-evaluated each time the limbo/overflow work runs. This ensures the work will move off a nohz_full CPU once a housekeeping CPU is available.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/internal.h | 24 ++++++++++++++++++++++++ arch/x86/kernel/cpu/resctrl/monitor.c | 20 +++++++++++++------- 2 files changed, 37 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 521afa016b05..33e24fcc8dd0 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -7,6 +7,7 @@ #include <linux/kernfs.h> #include <linux/fs_context.h> #include <linux/jump_label.h> +#include <linux/tick.h>
#include <asm/resctrl.h>
@@ -56,6 +57,29 @@ /* Max event bits supported */ #define MAX_EVT_CONFIG_BITS GENMASK(6, 0)
+/** + * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that + * aren't marked nohz_full + * @mask: The mask to pick a CPU from. + * + * Returns a CPU in @mask. If there are housekeeping CPUs that don't use + * nohz_full, these are preferred. + */ +static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask) +{ + unsigned int cpu, hk_cpu; + + cpu = cpumask_any(mask); + if (!tick_nohz_full_cpu(cpu)) + return cpu; + + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask); + if (hk_cpu < nr_cpu_ids) + cpu = hk_cpu; + + return cpu; +} + struct rdt_fs_context { struct kernfs_fs_context kfc; bool enable_cdpl2; diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index fdbef88ff39b..8737a9d6caef 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -783,7 +783,6 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, void cqm_handle_limbo(struct work_struct *work) { unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL); - int cpu = smp_processor_id(); struct rdt_domain *d;
mutex_lock(&rdtgroup_mutex); @@ -792,8 +791,11 @@ void cqm_handle_limbo(struct work_struct *work)
__check_limbo(d, false);
- if (has_busy_rmid(d)) - schedule_delayed_work_on(cpu, &d->cqm_limbo, delay); + if (has_busy_rmid(d)) { + d->cqm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask); + schedule_delayed_work_on(d->cqm_work_cpu, &d->cqm_limbo, + delay); + }
mutex_unlock(&rdtgroup_mutex); } @@ -803,7 +805,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms) unsigned long delay = msecs_to_jiffies(delay_ms); int cpu;
- cpu = cpumask_any(&dom->cpu_mask); + cpu = cpumask_any_housekeeping(&dom->cpu_mask); dom->cqm_work_cpu = cpu;
schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay); @@ -813,7 +815,6 @@ void mbm_handle_overflow(struct work_struct *work) { unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL); struct rdtgroup *prgrp, *crgrp; - int cpu = smp_processor_id(); struct list_head *head; struct rdt_resource *r; struct rdt_domain *d; @@ -837,7 +838,12 @@ void mbm_handle_overflow(struct work_struct *work) update_mba_bw(prgrp, d); }
- schedule_delayed_work_on(cpu, &d->mbm_over, delay); + /* + * Re-check for housekeeping CPUs. This allows the overflow handler to + * move off a nohz_full CPU quickly. + */ + d->mbm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask); + schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);
out_unlock: mutex_unlock(&rdtgroup_mutex); @@ -850,7 +856,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
if (!static_branch_likely(&rdt_mon_enable_key)) return; - cpu = cpumask_any(&dom->cpu_mask); + cpu = cpumask_any_housekeeping(&dom->cpu_mask); dom->mbm_work_cpu = cpu; schedule_delayed_work_on(cpu, &dom->mbm_over, delay); }
From: James Morse james.morse@arm.com
Intel is blessed with an abundance of monitors, one per RMID, that can be read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC, the number implemented is up to the manufacturer. This means when there are fewer monitors than needed, they need to be allocated and freed.
MPAM's CSU monitors are used to back the 'llc_occupancy' monitor file. The CSU counter is allowed to return 'not ready' for a small number of micro-seconds after programming. To allow one CSU hardware monitor to be used for multiple control or monitor groups, the CPU accessing the monitor needs to be able to block when configuring and reading the counter.
Worse, the domain may be broken up into slices, and the MMIO accesses for each slice may need performing from different CPUs.
These two details mean MPAMs monitor code needs to be able to sleep, and IPI another CPU in the domain to read from a resource that has been sliced.
mon_event_read() already invokes mon_event_count() via IPI, which means this isn't possible. On systems using nohz-full, some CPUs need to be interrupted to run kernel work as they otherwise stay in user-space running realtime workloads. Interrupting these CPUs should be avoided, and scheduling work on them may never complete.
Change mon_event_read() to pick a housekeeping CPU, (one that is not using nohz_full) and schedule mon_event_count() and wait. If all the CPUs in a domain are using nohz-full, then an IPI is used as the fallback.
This function is only used in response to a user-space filesystem request (not the timing sensitive overflow code).
This allows MPAM to hide the slice behaviour from resctrl, and to keep the monitor-allocation in monitor.c. When the IPI fallback is used on machines where MPAM needs to make an access on multiple CPUs, the counter read will always fail.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Peter Newman peternewman@google.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com --- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 26 +++++++++++++++++++++-- arch/x86/kernel/cpu/resctrl/monitor.c | 2 +- 2 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index beccb0e87ba7..d07f99245851 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -19,6 +19,8 @@ #include <linux/kernfs.h> #include <linux/seq_file.h> #include <linux/slab.h> +#include <linux/tick.h> + #include "internal.h"
/* @@ -522,12 +524,21 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of, return ret; }
+static int smp_mon_event_count(void *arg) +{ + mon_event_count(arg); + + return 0; +} + void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, struct rdt_domain *d, struct rdtgroup *rdtgrp, int evtid, int first) { + int cpu; + /* - * setup the parameters to send to the IPI to read the data. + * Setup the parameters to pass to mon_event_count() to read the data. */ rr->rgrp = rdtgrp; rr->evtid = evtid; @@ -536,7 +547,18 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, rr->val = 0; rr->first = first;
- smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1); + cpu = cpumask_any_housekeeping(&d->cpu_mask); + + /* + * cpumask_any_housekeeping() prefers housekeeping CPUs, but + * are all the CPUs nohz_full? If yes, pick a CPU to IPI. + * MPAM's resctrl_arch_rmid_read() is unable to read the + * counters on some platforms if its called in irq context. + */ + if (tick_nohz_full_cpu(cpu)) + smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1); + else + smp_call_on_cpu(cpu, smp_mon_event_count, rr, false); }
int rdtgroup_mondata_show(struct seq_file *m, void *arg) diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 8737a9d6caef..7e81268137b0 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -588,7 +588,7 @@ static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr) }
/* - * This is called via IPI to read the CQM/MBM counters + * This is scheduled by mon_event_read() to read the CQM/MBM counters * on a domain. */ void mon_event_count(void *info)
From: James Morse james.morse@arm.com
MPAM's cache occupancy counters can take a little while to settle once the monitor has been configured. The maximum settling time is described to the driver via a firmware table. The value could be large enough that it makes sense to sleep. To avoid exposing this to resctrl, it should be hidden behind MPAM's resctrl_arch_rmid_read().
resctrl_arch_rmid_read() may be called via IPI meaning it is unable to sleep. In this case resctrl_arch_rmid_read() should return an error if it needs to sleep. This will only affect MPAM platforms where the cache occupancy counter isn't available immediately, nohz_full is in use, and there are no housekeeping CPUs in the necessary domain.
There are three callers of resctrl_arch_rmid_read(): __mon_event_count() and __check_limbo() are both called from a non-migrateable context. mon_event_read() invokes __mon_event_count() using smp_call_on_cpu(), which adds work to the target CPUs workqueue. rdtgroup_mutex() is held, meaning this cannot race with the resctrl cpuhp callback. __check_limbo() is invoked via schedule_delayed_work_on() also adds work to a per-cpu workqueue.
The remaining call is add_rmid_to_limbo() which is called in response to a user-space syscall that frees an RMID. This opportunistically reads the LLC occupancy counter on the current domain to see if the RMID is over the dirty threshold. This has to disable preemption to avoid reading the wrong domain's value. Disabling pre-emption here prevents resctrl_arch_rmid_read() from sleeping.
add_rmid_to_limbo() walks each domain, but only reads the counter on one domain. If the system has more than one domain, the RMID will always be added to the limbo list. If the RMIDs usage was not over the threshold, it will be removed from the list when __check_limbo() runs. Make this the default behaviour. Free RMIDs are always added to the limbo list for each domain.
The user visible effect of this is that a clean RMID is not available for re-allocation immediately after 'rmdir()' completes, this behaviour was never portable as it never happened on a machine with multiple domains.
Removing this path allows resctrl_arch_rmid_read() to sleep if its called with interrupts unmasked. Document this is the expected behaviour, and add a might_sleep() annotation to catch changes that won't work on arm64.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com --- arch/x86/kernel/cpu/resctrl/monitor.c | 25 +++++-------------------- include/linux/resctrl.h | 23 ++++++++++++++++++++++- 2 files changed, 27 insertions(+), 21 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 7e81268137b0..2785a2a4ea33 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -277,6 +277,8 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, u64 msr_val, chunks; int ret;
+ resctrl_arch_rmid_read_context_check(); + if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask)) return -EINVAL;
@@ -455,8 +457,6 @@ static void add_rmid_to_limbo(struct rmid_entry *entry) { struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; struct rdt_domain *d; - int cpu, err; - u64 val = 0; u32 idx;
lockdep_assert_held(&rdtgroup_mutex); @@ -464,17 +464,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry) idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
entry->busy = 0; - cpu = get_cpu(); list_for_each_entry(d, &r->domains, list) { - if (cpumask_test_cpu(cpu, &d->cpu_mask)) { - err = resctrl_arch_rmid_read(r, d, entry->closid, - entry->rmid, - QOS_L3_OCCUP_EVENT_ID, - &val); - if (err || val <= resctrl_rmid_realloc_threshold) - continue; - } - /* * For the first limbo RMID in the domain, * setup up the limbo worker. @@ -484,15 +474,10 @@ static void add_rmid_to_limbo(struct rmid_entry *entry) set_bit(idx, d->rmid_busy_llc); entry->busy++; } - put_cpu();
- if (entry->busy) { - rmid_limbo_count++; - if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) - closid_num_dirty_rmid[entry->closid]++; - } else { - list_add_tail(&entry->list, &rmid_free_lru); - } + rmid_limbo_count++; + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + closid_num_dirty_rmid[entry->closid]++; }
void free_rmid(u32 closid, u32 rmid) diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index bd4ec22b5a96..8649fc84aac2 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -236,7 +236,12 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d); * @eventid: eventid to read, e.g. L3 occupancy. * @val: result of the counter read in bytes. * - * Call from process context on a CPU that belongs to domain @d. + * Some architectures need to sleep when first programming some of the counters. + * (specifically: arm64's MPAM cache occupancy counters can return 'not ready' + * for a short period of time). Call from a non-migrateable process context on + * a CPU that belongs to domain @d. e.g. use smp_call_on_cpu() or + * schedule_work_on(). This function can be called with interrupts masked, + * e.g. using smp_call_function_any(), but may consistently return an error. * * Return: * 0 on success, or -EIO, -EINVAL etc on error. @@ -245,6 +250,22 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, u32 closid, u32 rmid, enum resctrl_event_id eventid, u64 *val);
+/** + * resctrl_arch_rmid_read_context_check() - warn about invalid contexts + * + * When built with CONFIG_DEBUG_ATOMIC_SLEEP generate a warning when + * resctrl_arch_rmid_read() is called with preemption disabled. + * + * The contract with resctrl_arch_rmid_read() is that if interrupts + * are unmasked, it can sleep. This allows NOHZ_FULL systems to use an + * IPI, (and fail if the call needed to sleep), while most of the time + * the work is scheduled, allowing the call to sleep. + */ +static inline void resctrl_arch_rmid_read_context_check(void) +{ + if (!irqs_disabled()) + might_sleep(); +}
/** * resctrl_arch_reset_rmid() - Reset any private state associated with rmid
From: James Morse james.morse@arm.com
Depending on the number of monitors available, Arm's MPAM may need to allocate a monitor prior to reading the counter value. Allocating a contended resource may involve sleeping.
__check_limbo() and mon_event_count() each make multiple calls to resctrl_arch_rmid_read(), to avoid extra work on contended systems, the allocation should be valid for multiple invocations of resctrl_arch_rmid_read().
The memory or hardware allocated is not specific to a domain.
Add arch hooks for this allocation, which need calling before resctrl_arch_rmid_read(). The allocated monitor is passed to resctrl_arch_rmid_read(), then freed again afterwards. The helper can be called on any CPU, and can sleep.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/include/asm/resctrl.h | 11 +++++++ arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 7 +++++ arch/x86/kernel/cpu/resctrl/internal.h | 1 + arch/x86/kernel/cpu/resctrl/monitor.c | 35 +++++++++++++++++++++-- include/linux/resctrl.h | 5 +++- 5 files changed, 55 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 1d274dbabc44..29c4cc343787 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -136,6 +136,17 @@ static inline u32 resctrl_arch_rmid_idx_encode(u32 ignored, u32 rmid) return rmid; }
+/* x86 can always read an rmid, nothing needs allocating */ +struct rdt_resource; +static inline void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid) +{ + might_sleep(); + return NULL; +}; + +static inline void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, + void *ctx) { }; + void resctrl_cpu_detect(struct cpuinfo_x86 *c);
#else diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index d07f99245851..a033e8e32108 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -546,6 +546,11 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, rr->d = d; rr->val = 0; rr->first = first; + rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evtid); + if (IS_ERR(rr->arch_mon_ctx)) { + rr->err = -EINVAL; + return; + }
cpu = cpumask_any_housekeeping(&d->cpu_mask);
@@ -559,6 +564,8 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1); else smp_call_on_cpu(cpu, smp_mon_event_count, rr, false); + + resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx); }
int rdtgroup_mondata_show(struct seq_file *m, void *arg) diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 33e24fcc8dd0..bb07b9b01e36 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -138,6 +138,7 @@ struct rmid_read { bool first; int err; u64 val; + void *arch_mon_ctx; };
extern bool rdt_alloc_capable; diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 2785a2a4ea33..20fe7ed494c2 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -269,7 +269,7 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width)
int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, u32 unused, u32 rmid, enum resctrl_event_id eventid, - u64 *val) + u64 *val, void *ignored) { struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); @@ -324,9 +324,17 @@ void __check_limbo(struct rdt_domain *d, bool force_free) u32 idx_limit = resctrl_arch_system_num_rmid_idx(); struct rmid_entry *entry; u32 idx, cur_idx = 1; + void *arch_mon_ctx; bool rmid_dirty; u64 val = 0;
+ arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID); + if (IS_ERR(arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(arch_mon_ctx)); + return; + } + /* * Skip RMID 0 and start from RMID 1 and check all the RMIDs that * are marked as busy for occupancy < threshold. If the occupancy @@ -340,7 +348,8 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
entry = __rmid_entry(idx); if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid, - QOS_L3_OCCUP_EVENT_ID, &val)) { + QOS_L3_OCCUP_EVENT_ID, &val, + arch_mon_ctx)) { rmid_dirty = true; } else { rmid_dirty = (val >= resctrl_rmid_realloc_threshold); @@ -353,6 +362,8 @@ void __check_limbo(struct rdt_domain *d, bool force_free) } cur_idx = idx + 1; } + + resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx); }
bool has_busy_rmid(struct rdt_domain *d) @@ -533,7 +544,7 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr) }
rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, rr->evtid, - &tval); + &tval, rr->arch_mon_ctx); if (rr->err) return rr->err;
@@ -744,11 +755,27 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, if (is_mbm_total_enabled()) { rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID; rr.val = 0; + rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); + if (IS_ERR(rr.arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(rr.arch_mon_ctx)); + return; + } + __mon_event_count(closid, rmid, &rr); + + resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); } if (is_mbm_local_enabled()) { rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID; rr.val = 0; + rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); + if (IS_ERR(rr.arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(rr.arch_mon_ctx)); + return; + } + __mon_event_count(closid, rmid, &rr);
/* @@ -758,6 +785,8 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, */ if (is_mba_sc(NULL)) mbm_bw_count(closid, rmid, &rr); + + resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); } }
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 8649fc84aac2..bf460c912bf5 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -235,6 +235,9 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d); * @rmid: rmid of the counter to read. * @eventid: eventid to read, e.g. L3 occupancy. * @val: result of the counter read in bytes. + * @arch_mon_ctx: An architecture specific value from + * resctrl_arch_mon_ctx_alloc(), for MPAM this identifies + * the hardware monitor allocated for this read request. * * Some architectures need to sleep when first programming some of the counters. * (specifically: arm64's MPAM cache occupancy counters can return 'not ready' @@ -248,7 +251,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d); */ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, u32 closid, u32 rmid, enum resctrl_event_id eventid, - u64 *val); + u64 *val, void *arch_mon_ctx);
/** * resctrl_arch_rmid_read_context_check() - warn about invalid contexts
From: James Morse james.morse@arm.com
The rdt_enable_key is switched when resctrl is mounted, and used to prevent a second mount of the filesystem. It also enables the architecture's context switch code.
This requires another architecture to have the same set of static-keys, as resctrl depends on them too. The existing users of these static-keys are implicitly also checking if the filesystem is mounted.
Make the resctrl_mounted checks explicit: resctrl can keep track of whether it has been mounted once. This doesn't need to be combined with whether the arch code is context switching the CLOSID.
rdt_mon_enable_key is never used just to test that resctrl is mounted, but does also have this implication. Add a resctrl_mounted to all uses of rdt_mon_enable_key.
This will allow the static-key changing to be moved behind resctrl_arch_ calls.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/internal.h | 1 + arch/x86/kernel/cpu/resctrl/monitor.c | 12 ++++++++++-- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 23 +++++++++++++++++------ 3 files changed, 28 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index bb07b9b01e36..5ce49503fcd8 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -145,6 +145,7 @@ extern bool rdt_alloc_capable; extern bool rdt_mon_capable; extern unsigned int rdt_mon_features; extern struct list_head resctrl_schema_all; +extern bool resctrl_mounted;
enum rdt_group_type { RDTCTRL_GROUP = 0, diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 20fe7ed494c2..e3e42736fb22 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -835,7 +835,11 @@ void mbm_handle_overflow(struct work_struct *work)
mutex_lock(&rdtgroup_mutex);
- if (!static_branch_likely(&rdt_mon_enable_key)) + /* + * If the filesystem has been unmounted this work no longer needs to + * run. + */ + if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key)) goto out_unlock;
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; @@ -868,7 +872,11 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms) unsigned long delay = msecs_to_jiffies(delay_ms); int cpu;
- if (!static_branch_likely(&rdt_mon_enable_key)) + /* + * When a domain comes online there is no guarantee the filesystem is + * mounted. If not, there is no need to catch counter overflow. + */ + if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key)) return; cpu = cpumask_any_housekeeping(&dom->cpu_mask); dom->mbm_work_cpu = cpu; diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 460ac501f9e8..646d224c3ac9 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -42,6 +42,9 @@ LIST_HEAD(rdt_all_groups); /* list of entries for the schemata file */ LIST_HEAD(resctrl_schema_all);
+/* The filesystem can only be mounted once. */ +bool resctrl_mounted; + /* Kernel fs node for "info" directory under root */ static struct kernfs_node *kn_info;
@@ -881,7 +884,7 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns, mutex_lock(&rdtgroup_mutex);
/* Return empty if resctrl has not been mounted. */ - if (!static_branch_unlikely(&rdt_enable_key)) { + if (!resctrl_mounted) { seq_puts(s, "res:\nmon:\n"); goto unlock; } @@ -2613,7 +2616,7 @@ static int rdt_get_tree(struct fs_context *fc) /* * resctrl file system can only be mounted once. */ - if (static_branch_unlikely(&rdt_enable_key)) { + if (resctrl_mounted) { ret = -EBUSY; goto out; } @@ -2674,8 +2677,10 @@ static int rdt_get_tree(struct fs_context *fc) if (rdt_mon_capable) static_branch_enable_cpuslocked(&rdt_mon_enable_key);
- if (rdt_alloc_capable || rdt_mon_capable) + if (rdt_alloc_capable || rdt_mon_capable) { static_branch_enable_cpuslocked(&rdt_enable_key); + resctrl_mounted = true; + }
if (is_mbm_enabled()) { r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; @@ -2949,6 +2954,7 @@ static void rdt_kill_sb(struct super_block *sb) static_branch_disable_cpuslocked(&rdt_alloc_enable_key); static_branch_disable_cpuslocked(&rdt_mon_enable_key); static_branch_disable_cpuslocked(&rdt_enable_key); + resctrl_mounted = false; kernfs_kill_sb(sb); mutex_unlock(&rdtgroup_mutex); cpus_read_unlock(); @@ -3918,7 +3924,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d) * If resctrl is mounted, remove all the * per domain monitor data directories. */ - if (static_branch_unlikely(&rdt_mon_enable_key)) + if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key)) rmdir_mondata_subdir_allrdtgrp(r, d->id);
if (is_mbm_enabled()) @@ -3995,8 +4001,13 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d) if (is_llc_occupancy_enabled()) INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
- /* If resctrl is mounted, add per domain monitor data directories. */ - if (static_branch_unlikely(&rdt_mon_enable_key)) + /* + * If the filesystem is not mounted then only the default resource group + * exists. Creation of its directories is deferred until mount time + * by rdt_get_tree() calling mkdir_mondata_all(). + * If resctrl is mounted, add per domain monitor data directories. + */ + if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key)) mkdir_mondata_subdir_allrdtgrp(r, d);
return 0;
From: James Morse james.morse@arm.com
resctrl enables three static keys depending on the features it has enabled. Another architecture's context switch code may look different, any static keys that control it should be buried behind helpers.
Move the alloc/mon logic into arch-specific helpers as a preparatory step for making the rdt_enable_key's status something the arch code decides.
This means other architectures don't have to mirror the static keys.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/include/asm/resctrl.h | 20 ++++++++++++++++++++ arch/x86/kernel/cpu/resctrl/internal.h | 5 ----- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 8 ++++---- 3 files changed, 24 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 29c4cc343787..3c9137b6ad4f 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -42,6 +42,26 @@ DECLARE_STATIC_KEY_FALSE(rdt_enable_key); DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key); DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
+static inline void resctrl_arch_enable_alloc(void) +{ + static_branch_enable_cpuslocked(&rdt_alloc_enable_key); +} + +static inline void resctrl_arch_disable_alloc(void) +{ + static_branch_disable_cpuslocked(&rdt_alloc_enable_key); +} + +static inline void resctrl_arch_enable_mon(void) +{ + static_branch_enable_cpuslocked(&rdt_mon_enable_key); +} + +static inline void resctrl_arch_disable_mon(void) +{ + static_branch_disable_cpuslocked(&rdt_mon_enable_key); +} + /* * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR * diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 5ce49503fcd8..68b9beed8e42 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -95,9 +95,6 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc) return container_of(kfc, struct rdt_fs_context, kfc); }
-DECLARE_STATIC_KEY_FALSE(rdt_enable_key); -DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key); - /** * struct mon_evt - Entry in the event list of a resource * @evtid: event id @@ -454,8 +451,6 @@ extern struct mutex rdtgroup_mutex;
extern struct rdt_hw_resource rdt_resources_all[]; extern struct rdtgroup rdtgroup_default; -DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key); - extern struct dentry *debugfs_resctrl;
enum resctrl_res_level { diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 646d224c3ac9..8cc5a0937fa8 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2673,9 +2673,9 @@ static int rdt_get_tree(struct fs_context *fc) goto out_psl;
if (rdt_alloc_capable) - static_branch_enable_cpuslocked(&rdt_alloc_enable_key); + resctrl_arch_enable_alloc(); if (rdt_mon_capable) - static_branch_enable_cpuslocked(&rdt_mon_enable_key); + resctrl_arch_enable_mon();
if (rdt_alloc_capable || rdt_mon_capable) { static_branch_enable_cpuslocked(&rdt_enable_key); @@ -2951,8 +2951,8 @@ static void rdt_kill_sb(struct super_block *sb) rdtgroup_default.mode = RDT_MODE_SHAREABLE; schemata_list_destroy(); rdtgroup_destroy_root(); - static_branch_disable_cpuslocked(&rdt_alloc_enable_key); - static_branch_disable_cpuslocked(&rdt_mon_enable_key); + resctrl_arch_disable_alloc(); + resctrl_arch_disable_mon(); static_branch_disable_cpuslocked(&rdt_enable_key); resctrl_mounted = false; kernfs_kill_sb(sb);
From: James Morse james.morse@arm.com
rdt_enable_key is switched when resctrl is mounted. It was also previously used to prevent a second mount of the filesystem.
Any other architecture that wants to support resctrl has to provide identical static keys.
Now that there are helpers for enabling and disabling the alloc/mon keys, resctrl doesn't need to switch this extra key, it can be done by the arch code. Use the static-key increment and decrement helpers, and change resctrl to ensure the calls are balanced.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/include/asm/resctrl.h | 4 ++++ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 11 +++++------ 2 files changed, 9 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 3c9137b6ad4f..b74aa34dc9e8 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -45,21 +45,25 @@ DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key); static inline void resctrl_arch_enable_alloc(void) { static_branch_enable_cpuslocked(&rdt_alloc_enable_key); + static_branch_inc_cpuslocked(&rdt_enable_key); }
static inline void resctrl_arch_disable_alloc(void) { static_branch_disable_cpuslocked(&rdt_alloc_enable_key); + static_branch_dec_cpuslocked(&rdt_enable_key); }
static inline void resctrl_arch_enable_mon(void) { static_branch_enable_cpuslocked(&rdt_mon_enable_key); + static_branch_inc_cpuslocked(&rdt_enable_key); }
static inline void resctrl_arch_disable_mon(void) { static_branch_disable_cpuslocked(&rdt_mon_enable_key); + static_branch_dec_cpuslocked(&rdt_enable_key); }
/* diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 8cc5a0937fa8..74a8494a0c91 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2677,10 +2677,8 @@ static int rdt_get_tree(struct fs_context *fc) if (rdt_mon_capable) resctrl_arch_enable_mon();
- if (rdt_alloc_capable || rdt_mon_capable) { - static_branch_enable_cpuslocked(&rdt_enable_key); + if (rdt_alloc_capable || rdt_mon_capable) resctrl_mounted = true; - }
if (is_mbm_enabled()) { r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; @@ -2951,9 +2949,10 @@ static void rdt_kill_sb(struct super_block *sb) rdtgroup_default.mode = RDT_MODE_SHAREABLE; schemata_list_destroy(); rdtgroup_destroy_root(); - resctrl_arch_disable_alloc(); - resctrl_arch_disable_mon(); - static_branch_disable_cpuslocked(&rdt_enable_key); + if (rdt_alloc_capable) + resctrl_arch_disable_alloc(); + if (rdt_mon_capable) + resctrl_arch_disable_mon(); resctrl_mounted = false; kernfs_kill_sb(sb); mutex_unlock(&rdtgroup_mutex);
From: James Morse james.morse@arm.com
resctrl reads rdt_alloc_capable or rdt_mon_capable to determine whether any of the resources support the corresponding features. resctrl also uses the static-keys that affect the architecture's context-switch code to determine the same thing.
This forces another architecture to have the same static-keys.
As the static-key is enabled based on the capable flag, and none of the filesystem uses of these are in the scheduler path, move the capable flags behind helpers, and use these in the filesystem code instead of the static-key.
After this change, only the architecture code manages and uses the static-keys to ensure __resctrl_sched_in() does not need runtime checks.
This avoids multiple architectures having to define the same static-keys.
Cases where the static-key implicitly tested if the resctrl filesystem was mounted all have an explicit check added by a previous patch.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com --- arch/x86/include/asm/resctrl.h | 13 ++++++++ arch/x86/kernel/cpu/resctrl/internal.h | 2 -- arch/x86/kernel/cpu/resctrl/monitor.c | 4 +-- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 ++-- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 38 +++++++++++------------ 5 files changed, 37 insertions(+), 26 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index b74aa34dc9e8..12dbd2588ca7 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -38,10 +38,18 @@ struct resctrl_pqr_state {
DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state);
+extern bool rdt_alloc_capable; +extern bool rdt_mon_capable; + DECLARE_STATIC_KEY_FALSE(rdt_enable_key); DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key); DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
+static inline bool resctrl_arch_alloc_capable(void) +{ + return rdt_alloc_capable; +} + static inline void resctrl_arch_enable_alloc(void) { static_branch_enable_cpuslocked(&rdt_alloc_enable_key); @@ -54,6 +62,11 @@ static inline void resctrl_arch_disable_alloc(void) static_branch_dec_cpuslocked(&rdt_enable_key); }
+static inline bool resctrl_arch_mon_capable(void) +{ + return rdt_mon_capable; +} + static inline void resctrl_arch_enable_mon(void) { static_branch_enable_cpuslocked(&rdt_mon_enable_key); diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 68b9beed8e42..c4c1e1909058 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -138,8 +138,6 @@ struct rmid_read { void *arch_mon_ctx; };
-extern bool rdt_alloc_capable; -extern bool rdt_mon_capable; extern unsigned int rdt_mon_features; extern struct list_head resctrl_schema_all; extern bool resctrl_mounted; diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index e3e42736fb22..95020b113ee5 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -839,7 +839,7 @@ void mbm_handle_overflow(struct work_struct *work) * If the filesystem has been unmounted this work no longer needs to * run. */ - if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key)) + if (!resctrl_mounted || !resctrl_arch_mon_capable()) goto out_unlock;
r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; @@ -876,7 +876,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms) * When a domain comes online there is no guarantee the filesystem is * mounted. If not, there is no need to catch counter overflow. */ - if (!resctrl_mounted || !static_branch_likely(&rdt_mon_enable_key)) + if (!resctrl_mounted || !resctrl_arch_mon_capable()) return; cpu = cpumask_any_housekeeping(&dom->cpu_mask); dom->mbm_work_cpu = cpu; diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index d8f44113ed1f..8056bed033cc 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -581,7 +581,7 @@ static int rdtgroup_locksetup_user_restrict(struct rdtgroup *rdtgrp) if (ret) goto err_cpus;
- if (rdt_mon_capable) { + if (resctrl_arch_mon_capable()) { ret = rdtgroup_kn_mode_restrict(rdtgrp, "mon_groups"); if (ret) goto err_cpus_list; @@ -628,7 +628,7 @@ static int rdtgroup_locksetup_user_restore(struct rdtgroup *rdtgrp) if (ret) goto err_cpus;
- if (rdt_mon_capable) { + if (resctrl_arch_mon_capable()) { ret = rdtgroup_kn_mode_restore(rdtgrp, "mon_groups", 0777); if (ret) goto err_cpus_list; @@ -776,7 +776,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp) { int ret;
- if (rdt_mon_capable) { + if (resctrl_arch_mon_capable()) { ret = alloc_rmid(rdtgrp->closid); if (ret < 0) { rdt_last_cmd_puts("Out of RMIDs\n"); diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 74a8494a0c91..89aef67588c0 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -641,13 +641,13 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
static bool is_closid_match(struct task_struct *t, struct rdtgroup *r) { - return (rdt_alloc_capable && (r->type == RDTCTRL_GROUP) && + return (resctrl_arch_alloc_capable() && (r->type == RDTCTRL_GROUP) && resctrl_arch_match_closid(t, r->closid)); }
static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r) { - return (rdt_mon_capable && (r->type == RDTMON_GROUP) && + return (resctrl_arch_mon_capable() && (r->type == RDTMON_GROUP) && resctrl_arch_match_rmid(t, r->mon.parent->closid, r->mon.rmid)); } @@ -2637,7 +2637,7 @@ static int rdt_get_tree(struct fs_context *fc)
closid_init();
- if (rdt_mon_capable) + if (resctrl_arch_mon_capable()) flags |= RFTYPE_MON;
ret = rdtgroup_add_files(rdtgroup_default.kn, flags); @@ -2650,7 +2650,7 @@ static int rdt_get_tree(struct fs_context *fc) if (ret < 0) goto out_schemata_free;
- if (rdt_mon_capable) { + if (resctrl_arch_mon_capable()) { ret = mongroup_create_dir(rdtgroup_default.kn, &rdtgroup_default, "mon_groups", &kn_mongrp); @@ -2672,12 +2672,12 @@ static int rdt_get_tree(struct fs_context *fc) if (ret < 0) goto out_psl;
- if (rdt_alloc_capable) + if (resctrl_arch_alloc_capable()) resctrl_arch_enable_alloc(); - if (rdt_mon_capable) + if (resctrl_arch_mon_capable()) resctrl_arch_enable_mon();
- if (rdt_alloc_capable || rdt_mon_capable) + if (resctrl_arch_alloc_capable() || resctrl_arch_mon_capable()) resctrl_mounted = true;
if (is_mbm_enabled()) { @@ -2691,10 +2691,10 @@ static int rdt_get_tree(struct fs_context *fc) out_psl: rdt_pseudo_lock_release(); out_mondata: - if (rdt_mon_capable) + if (resctrl_arch_mon_capable()) kernfs_remove(kn_mondata); out_mongrp: - if (rdt_mon_capable) + if (resctrl_arch_mon_capable()) kernfs_remove(kn_mongrp); out_info: kernfs_remove(kn_info); @@ -2949,9 +2949,9 @@ static void rdt_kill_sb(struct super_block *sb) rdtgroup_default.mode = RDT_MODE_SHAREABLE; schemata_list_destroy(); rdtgroup_destroy_root(); - if (rdt_alloc_capable) + if (resctrl_arch_alloc_capable()) resctrl_arch_disable_alloc(); - if (rdt_mon_capable) + if (resctrl_arch_mon_capable()) resctrl_arch_disable_mon(); resctrl_mounted = false; kernfs_kill_sb(sb); @@ -3331,7 +3331,7 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp) { int ret;
- if (!rdt_mon_capable) + if (!resctrl_arch_mon_capable()) return 0;
ret = alloc_rmid(rdtgrp->closid); @@ -3353,7 +3353,7 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp) { - if (rdt_mon_capable) + if (resctrl_arch_mon_capable()) free_rmid(rgrp->closid, rgrp->mon.rmid); }
@@ -3417,7 +3417,7 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
if (rtype == RDTCTRL_GROUP) { files = RFTYPE_BASE | RFTYPE_CTRL; - if (rdt_mon_capable) + if (resctrl_arch_mon_capable()) files |= RFTYPE_MON; } else { files = RFTYPE_BASE | RFTYPE_MON; @@ -3526,7 +3526,7 @@ static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn,
list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups);
- if (rdt_mon_capable) { + if (resctrl_arch_mon_capable()) { /* * Create an empty mon_groups directory to hold the subset * of tasks and cpus to monitor. @@ -3581,14 +3581,14 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name, * allocation is supported, add a control and monitoring * subdirectory */ - if (rdt_alloc_capable && parent_kn == rdtgroup_default.kn) + if (resctrl_arch_alloc_capable() && parent_kn == rdtgroup_default.kn) return rdtgroup_mkdir_ctrl_mon(parent_kn, name, mode);
/* * If RDT monitoring is supported and the parent directory is a valid * "mon_groups" directory, add a monitoring subdirectory. */ - if (rdt_mon_capable && is_mon_groups(parent_kn, name)) + if (resctrl_arch_mon_capable() && is_mon_groups(parent_kn, name)) return rdtgroup_mkdir_mon(parent_kn, name, mode);
return -EPERM; @@ -3923,7 +3923,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d) * If resctrl is mounted, remove all the * per domain monitor data directories. */ - if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key)) + if (resctrl_mounted && resctrl_arch_mon_capable()) rmdir_mondata_subdir_allrdtgrp(r, d->id);
if (is_mbm_enabled()) @@ -4006,7 +4006,7 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d) * by rdt_get_tree() calling mkdir_mondata_all(). * If resctrl is mounted, add per domain monitor data directories. */ - if (resctrl_mounted && static_branch_unlikely(&rdt_mon_enable_key)) + if (resctrl_mounted && resctrl_arch_mon_capable()) mkdir_mondata_subdir_allrdtgrp(r, d);
return 0;
From: James Morse james.morse@arm.com
The resctrl architecture specific code may need to create a domain when a CPU comes online, it also needs to reset the CPUs PQR_ASSOC register. The resctrl filesystem code needs to update the rdtgroup_default CPU mask when CPUs are brought online.
Currently this is all done in one function, resctrl_online_cpu(). This will need to be split into architecture and filesystem parts before resctrl can be moved to /fs/.
Pull the rdtgroup_default update work out as a filesystem specific cpu_online helper. resctrl_online_cpu() is the obvious name for this, which means the version in core.c needs renaming.
resctrl_online_cpu() is called by the arch code once it has done the work to add the new CPU to any domains.
In future patches, resctrl_online_cpu() will take the rdtgroup_mutex itself.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/core.c | 8 ++++---- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 8 ++++++++ include/linux/resctrl.h | 1 + 3 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index ba6daef6bd26..8ac89dbf4484 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -608,16 +608,16 @@ static void clear_closid_rmid(int cpu) RESCTRL_RESERVED_CLOSID); }
-static int resctrl_online_cpu(unsigned int cpu) +static int resctrl_arch_online_cpu(unsigned int cpu) { struct rdt_resource *r;
mutex_lock(&rdtgroup_mutex); for_each_capable_rdt_resource(r) domain_add_cpu(cpu, r); - /* The cpu is set in default rdtgroup after online. */ - cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); clear_closid_rmid(cpu); + + resctrl_online_cpu(cpu); mutex_unlock(&rdtgroup_mutex);
return 0; @@ -969,7 +969,7 @@ static int __init resctrl_late_init(void)
state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/resctrl/cat:online:", - resctrl_online_cpu, resctrl_offline_cpu); + resctrl_arch_online_cpu, resctrl_offline_cpu); if (state < 0) return state;
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 89aef67588c0..76aa73c49fad 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -4012,6 +4012,14 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d) return 0; }
+void resctrl_online_cpu(unsigned int cpu) +{ + lockdep_assert_held(&rdtgroup_mutex); + + /* The CPU is set in default rdtgroup after online. */ + cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); +} + /* * rdtgroup_init - rdtgroup initialization * diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index bf460c912bf5..4c4bad3c34e4 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -223,6 +223,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, u32 closid, enum resctrl_conf_type type); int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d); void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d); +void resctrl_online_cpu(unsigned int cpu);
/** * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
From: James Morse james.morse@arm.com
When a CPU is taken offline resctrl may need to move the overflow or limbo handlers to run on a different CPU.
Once the offline callbacks have been split, cqm_setup_limbo_handler() will be called while the CPU that is going offline is still present in the cpu_mask.
Pass the CPU to exclude to cqm_setup_limbo_handler() and mbm_setup_overflow_handler(). These functions can use a variant of cpumask_any_but() when selecting the CPU. -1 is used to indicate no CPUs need excluding.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/core.c | 8 +++-- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 2 +- arch/x86/kernel/cpu/resctrl/internal.h | 33 ++++++++++++++---- arch/x86/kernel/cpu/resctrl/monitor.c | 42 ++++++++++++++++++----- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 ++-- include/linux/resctrl.h | 2 ++ 6 files changed, 72 insertions(+), 21 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 8ac89dbf4484..118ac7c28106 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -586,12 +586,16 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r) if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) { if (is_mbm_enabled() && cpu == d->mbm_work_cpu) { cancel_delayed_work(&d->mbm_over); - mbm_setup_overflow_handler(d, 0); + /* + * temporary: exclude_cpu=-1 as this CPU has already + * been removed by cpumask_clear_cpu()d + */ + mbm_setup_overflow_handler(d, 0, RESCTRL_PICK_ANY_CPU); } if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu && has_busy_rmid(d)) { cancel_delayed_work(&d->cqm_limbo); - cqm_setup_limbo_handler(d, 0); + cqm_setup_limbo_handler(d, 0, RESCTRL_PICK_ANY_CPU); } } } diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index a033e8e32108..64db51455df3 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -552,7 +552,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, return; }
- cpu = cpumask_any_housekeeping(&d->cpu_mask); + cpu = cpumask_any_housekeeping(&d->cpu_mask, RESCTRL_PICK_ANY_CPU);
/* * cpumask_any_housekeeping() prefers housekeeping CPUs, but diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index c4c1e1909058..8091eb59b159 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -61,19 +61,36 @@ * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that * aren't marked nohz_full * @mask: The mask to pick a CPU from. + * @exclude_cpu:The CPU to avoid picking. * - * Returns a CPU in @mask. If there are housekeeping CPUs that don't use - * nohz_full, these are preferred. + * Returns a CPU from @mask, but not @exclude_cpu. If there are housekeeping + * CPUs that don't use nohz_full, these are preferred. Pass + * RESCTRL_PICK_ANY_CPU to avoid excluding any CPUs. + * + * When a CPU is excluded, returns >= nr_cpu_ids if no CPUs are available. */ -static inline unsigned int cpumask_any_housekeeping(const struct cpumask *mask) +static inline unsigned int +cpumask_any_housekeeping(const struct cpumask *mask, int exclude_cpu) { unsigned int cpu, hk_cpu;
- cpu = cpumask_any(mask); - if (!tick_nohz_full_cpu(cpu)) + if (exclude_cpu == RESCTRL_PICK_ANY_CPU) + cpu = cpumask_any(mask); + else + cpu = cpumask_any_but(mask, exclude_cpu); + + if (!IS_ENABLED(CONFIG_NO_HZ_FULL)) return cpu;
+ /* If the CPU picked isn't marked nohz_full nothing more needs doing. */ + if (cpu < nr_cpu_ids && !tick_nohz_full_cpu(cpu)) + return cpu; + + /* Try to find a CPU that isn't nohz_full to use in preference */ hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask); + if (hk_cpu == exclude_cpu) + hk_cpu = cpumask_nth_andnot(1, mask, tick_nohz_full_mask); + if (hk_cpu < nr_cpu_ids) cpu = hk_cpu;
@@ -575,11 +592,13 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, struct rdt_domain *d, struct rdtgroup *rdtgrp, int evtid, int first); void mbm_setup_overflow_handler(struct rdt_domain *dom, - unsigned long delay_ms); + unsigned long delay_ms, + int exclude_cpu); void mbm_handle_overflow(struct work_struct *work); void __init intel_rdt_mbm_apply_quirk(void); bool is_mba_sc(struct rdt_resource *r); -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms); +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms, + int exclude_cpu); void cqm_handle_limbo(struct work_struct *work); bool has_busy_rmid(struct rdt_domain *d); void __check_limbo(struct rdt_domain *d, bool force_free); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 95020b113ee5..c78b51eb1256 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -481,7 +481,8 @@ static void add_rmid_to_limbo(struct rmid_entry *entry) * setup up the limbo worker. */ if (!has_busy_rmid(d)) - cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL); + cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL, + RESCTRL_PICK_ANY_CPU); set_bit(idx, d->rmid_busy_llc); entry->busy++; } @@ -806,7 +807,8 @@ void cqm_handle_limbo(struct work_struct *work) __check_limbo(d, false);
if (has_busy_rmid(d)) { - d->cqm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask); + d->cqm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask, + RESCTRL_PICK_ANY_CPU); schedule_delayed_work_on(d->cqm_work_cpu, &d->cqm_limbo, delay); } @@ -814,15 +816,25 @@ void cqm_handle_limbo(struct work_struct *work) mutex_unlock(&rdtgroup_mutex); }
-void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms) +/** + * cqm_setup_limbo_handler() - Schedule the limbo handler to run for this + * domain. + * @dom: The domain the limbo handler should run for. + * @delay_ms: How far in the future the handler should run. + * @exclude_cpu: Which CPU the handler should not run on, + * RESCTRL_PICK_ANY_CPU to pick any CPU. + */ +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms, + int exclude_cpu) { unsigned long delay = msecs_to_jiffies(delay_ms); int cpu;
- cpu = cpumask_any_housekeeping(&dom->cpu_mask); + cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu); dom->cqm_work_cpu = cpu;
- schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay); + if (cpu < nr_cpu_ids) + schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay); }
void mbm_handle_overflow(struct work_struct *work) @@ -860,14 +872,24 @@ void mbm_handle_overflow(struct work_struct *work) * Re-check for housekeeping CPUs. This allows the overflow handler to * move off a nohz_full CPU quickly. */ - d->mbm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask); + d->mbm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask, + RESCTRL_PICK_ANY_CPU); schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay);
out_unlock: mutex_unlock(&rdtgroup_mutex); }
-void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms) +/** + * mbm_setup_overflow_handler() - Schedule the overflow handler to run for this + * domain. + * @dom: The domain the overflow handler should run for. + * @delay_ms: How far in the future the handler should run. + * @exclude_cpu: Which CPU the handler should not run on, + * RESCTRL_PICK_ANY_CPU to pick any CPU. + */ +void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms, + int exclude_cpu) { unsigned long delay = msecs_to_jiffies(delay_ms); int cpu; @@ -878,9 +900,11 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms) */ if (!resctrl_mounted || !resctrl_arch_mon_capable()) return; - cpu = cpumask_any_housekeeping(&dom->cpu_mask); + cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu); dom->mbm_work_cpu = cpu; - schedule_delayed_work_on(cpu, &dom->mbm_over, delay); + + if (cpu < nr_cpu_ids) + schedule_delayed_work_on(cpu, &dom->mbm_over, delay); }
static int dom_data_init(struct rdt_resource *r) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 76aa73c49fad..fc1b913e54bd 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2683,7 +2683,8 @@ static int rdt_get_tree(struct fs_context *fc) if (is_mbm_enabled()) { r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; list_for_each_entry(dom, &r->domains, list) - mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL); + mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, + RESCTRL_PICK_ANY_CPU); }
goto out; @@ -3994,7 +3995,8 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d)
if (is_mbm_enabled()) { INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow); - mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL); + mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, + RESCTRL_PICK_ANY_CPU); }
if (is_llc_occupancy_enabled()) diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 4c4bad3c34e4..ccbbbe5d18d3 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -10,6 +10,8 @@ #define RESCTRL_RESERVED_CLOSID 0 #define RESCTRL_RESERVED_RMID 0
+#define RESCTRL_PICK_ANY_CPU -1 + #ifdef CONFIG_PROC_CPU_RESCTRL
int proc_resctrl_show(struct seq_file *m,
From: James Morse james.morse@arm.com
The resctrl architecture specific code may need to free a domain when a CPU goes offline, it also needs to reset the CPUs PQR_ASSOC register. Amongst other things, the resctrl filesystem code needs to clear this CPU from the cpu_mask of any control and monitor groups.
Currently this is all done in core.c and called from resctrl_offline_cpu(), making the split between architecture and filesystem code unclear.
Move the filesystem work to remove the CPU from the control and monitor groups into a filesystem helper called resctrl_offline_cpu(), and rename the one in core.c resctrl_arch_offline_cpu().
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/core.c | 25 +++++-------------------- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 24 ++++++++++++++++++++++++ include/linux/resctrl.h | 1 + 3 files changed, 30 insertions(+), 20 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 118ac7c28106..65540acd59de 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -627,31 +627,15 @@ static int resctrl_arch_online_cpu(unsigned int cpu) return 0; }
-static void clear_childcpus(struct rdtgroup *r, unsigned int cpu) +static int resctrl_arch_offline_cpu(unsigned int cpu) { - struct rdtgroup *cr; - - list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) { - if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask)) { - break; - } - } -} - -static int resctrl_offline_cpu(unsigned int cpu) -{ - struct rdtgroup *rdtgrp; struct rdt_resource *r;
mutex_lock(&rdtgroup_mutex); + resctrl_offline_cpu(cpu); + for_each_capable_rdt_resource(r) domain_remove_cpu(cpu, r); - list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { - if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) { - clear_childcpus(rdtgrp, cpu); - break; - } - } clear_closid_rmid(cpu); mutex_unlock(&rdtgroup_mutex);
@@ -973,7 +957,8 @@ static int __init resctrl_late_init(void)
state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/resctrl/cat:online:", - resctrl_arch_online_cpu, resctrl_offline_cpu); + resctrl_arch_online_cpu, + resctrl_arch_offline_cpu); if (state < 0) return state;
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index fc1b913e54bd..958f73974860 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -4022,6 +4022,30 @@ void resctrl_online_cpu(unsigned int cpu) cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); }
+static void clear_childcpus(struct rdtgroup *r, unsigned int cpu) +{ + struct rdtgroup *cr; + + list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) { + if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask)) + break; + } +} + +void resctrl_offline_cpu(unsigned int cpu) +{ + struct rdtgroup *rdtgrp; + + lockdep_assert_held(&rdtgroup_mutex); + + list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { + if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) { + clear_childcpus(rdtgrp, cpu); + break; + } + } +} + /* * rdtgroup_init - rdtgroup initialization * diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index ccbbbe5d18d3..270ff1d5c051 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -226,6 +226,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d); void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d); void resctrl_online_cpu(unsigned int cpu); +void resctrl_offline_cpu(unsigned int cpu);
/** * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid
From: James Morse james.morse@arm.com
When a CPU is taken offline the resctrl filesystem code needs to check if it was the CPU nominated to perform the periodic overflow and limbo work. If so, another CPU needs to be chosen to do this work.
This is currently done in core.c, mixed in with the code that removes the CPU from the domain's mask, and potentially free()s the domain.
Move the migration of the overflow and limbo helpers into the filesystem code, into resctrl_offline_cpu(). As resctrl_offline_cpu() runs before the architecture code has removed the CPU from the domain mask, the callers need to be told which CPU is being removed, to avoid picking it as the new CPU. This uses the exclude_cpu feature previously added.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/core.c | 16 ---------------- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 18 ++++++++++++++++++ 2 files changed, 18 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 65540acd59de..c943cb78b2a1 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -582,22 +582,6 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
return; } - - if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) { - if (is_mbm_enabled() && cpu == d->mbm_work_cpu) { - cancel_delayed_work(&d->mbm_over); - /* - * temporary: exclude_cpu=-1 as this CPU has already - * been removed by cpumask_clear_cpu()d - */ - mbm_setup_overflow_handler(d, 0, RESCTRL_PICK_ANY_CPU); - } - if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu && - has_busy_rmid(d)) { - cancel_delayed_work(&d->cqm_limbo); - cqm_setup_limbo_handler(d, 0, RESCTRL_PICK_ANY_CPU); - } - } }
static void clear_closid_rmid(int cpu) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 958f73974860..f007c488fa31 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -4034,7 +4034,9 @@ static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
void resctrl_offline_cpu(unsigned int cpu) { + struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; struct rdtgroup *rdtgrp; + struct rdt_domain *d;
lockdep_assert_held(&rdtgroup_mutex);
@@ -4044,6 +4046,22 @@ void resctrl_offline_cpu(unsigned int cpu) break; } } + + if (!l3->mon_capable) + return; + + d = get_domain_from_cpu(cpu, l3); + if (d) { + if (is_mbm_enabled() && cpu == d->mbm_work_cpu) { + cancel_delayed_work(&d->mbm_over); + mbm_setup_overflow_handler(d, 0, cpu); + } + if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu && + has_busy_rmid(d)) { + cancel_delayed_work(&d->cqm_limbo); + cqm_setup_limbo_handler(d, 0, cpu); + } + } }
/*
From: James Morse james.morse@arm.com
resctrl has one mutex that is taken by the architecture specific code, and the filesystem parts. The two interact via cpuhp, where the architecture code updates the domain list. Filesystem handlers that walk the domains list should not run concurrently with the cpuhp callback modifying the list.
Exposing a lock from the filesystem code means the interface is not cleanly defined, and creates the possibility of cross-architecture lock ordering headaches. The interaction only exists so that certain filesystem paths are serialised against CPU hotplug. The CPU hotplug code already has a mechanism to do this using cpus_read_lock().
MPAM's monitors have an overflow interrupt, so it needs to be possible to walk the domains list in irq context. RCU is ideal for this, but some paths need to be able to sleep to allocate memory.
Because resctrl_{on,off}line_cpu() take the rdtgroup_mutex as part of a cpuhp callback, cpus_read_lock() must always be taken first. rdtgroup_schemata_write() already does this.
Most of the filesystem code's domain list walkers are currently protected by the rdtgroup_mutex taken in rdtgroup_kn_lock_live(). The exceptions are rdt_bit_usage_show() and the mon_config helpers which take the lock directly.
Make the domain list protected by RCU. An architecture-specific lock prevents concurrent writers. rdt_bit_usage_show() could walk the domain list using RCU, but to keep all the filesystem operations the same, this is changed to call cpus_read_lock(). The mon_config helpers send multiple IPIs, take the cpus_read_lock() in these cases.
The other filesystem list walkers need to be able to sleep. Add cpus_read_lock() to rdtgroup_kn_lock_live() so that the cpuhp callbacks can't be invoked when file system operations are occurring.
Add lockdep_assert_cpus_held() in the cases where the rdtgroup_kn_lock_live() call isn't obvious.
Resctrl's domain online/offline calls now need to take the rdtgroup_mutex themselves.
Signed-off-by: James Morse james.morse@arm.com Tested-by: Shaopeng Tan tan.shaopeng@fujitsu.com Tested-by: Peter Newman peternewman@google.com Tested-by: Babu Moger babu.moger@amd.com Reviewed-by: Shaopeng Tan tan.shaopeng@fujitsu.com Reviewed-by: Reinette Chatre reinette.chatre@intel.com Reviewed-by: Babu Moger babu.moger@amd.com --- arch/x86/kernel/cpu/resctrl/core.c | 44 +++++++++++---- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 15 ++++- arch/x86/kernel/cpu/resctrl/monitor.c | 8 +++ arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 3 + arch/x86/kernel/cpu/resctrl/rdtgroup.c | 68 ++++++++++++++++++----- include/linux/resctrl.h | 2 +- 6 files changed, 112 insertions(+), 28 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index c943cb78b2a1..4a35369e2d3c 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -16,6 +16,7 @@
#define pr_fmt(fmt) "resctrl: " fmt
+#include <linux/cpu.h> #include <linux/slab.h> #include <linux/err.h> #include <linux/cacheinfo.h> @@ -25,8 +26,15 @@ #include <asm/resctrl.h> #include "internal.h"
-/* Mutex to protect rdtgroup access. */ -DEFINE_MUTEX(rdtgroup_mutex); +/* + * rdt_domain structures are kfree()d when their last CPU goes offline, + * and allocated when the first CPU in a new domain comes online. + * The rdt_resource's domain list is updated when this happens. Readers of + * the domain list must either take cpus_read_lock(), or rely on an RCU + * read-side critical section, to avoid observing concurrent modification. + * All writers take this mutex: + */ +static DEFINE_MUTEX(domain_list_lock);
/* * The cached resctrl_pqr_state is strictly per CPU and can never be @@ -356,6 +364,15 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r) { struct rdt_domain *d;
+ /* + * Walking r->domains, ensure it can't race with cpuhp. + * Because this is called via IPI by rdt_ctrl_update(), assertions + * about locks this thread holds will lead to false positives. Check + * someone is holding the CPUs lock. + */ + if (IS_ENABLED(CONFIG_LOCKDEP)) + lockdep_is_cpus_held(); + list_for_each_entry(d, &r->domains, list) { /* Find the domain that contains this CPU */ if (cpumask_test_cpu(cpu, &d->cpu_mask)) @@ -512,6 +529,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r) struct rdt_domain *d; int err;
+ lockdep_assert_held(&domain_list_lock); + d = rdt_find_domain(r, id, &add_pos); if (IS_ERR(d)) { pr_warn("Couldn't find cache id for CPU %d\n", cpu); @@ -545,11 +564,12 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r) return; }
- list_add_tail(&d->list, add_pos); + list_add_tail_rcu(&d->list, add_pos);
err = resctrl_online_domain(r, d); if (err) { - list_del(&d->list); + list_del_rcu(&d->list); + synchronize_rcu(); domain_free(hw_dom); } } @@ -560,6 +580,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r) struct rdt_hw_domain *hw_dom; struct rdt_domain *d;
+ lockdep_assert_held(&domain_list_lock); + d = rdt_find_domain(r, id, NULL); if (IS_ERR_OR_NULL(d)) { pr_warn("Couldn't find cache id for CPU %d\n", cpu); @@ -570,7 +592,8 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r) cpumask_clear_cpu(cpu, &d->cpu_mask); if (cpumask_empty(&d->cpu_mask)) { resctrl_offline_domain(r, d); - list_del(&d->list); + list_del_rcu(&d->list); + synchronize_rcu();
/* * rdt_domain "d" is going to be freed below, so clear @@ -600,13 +623,13 @@ static int resctrl_arch_online_cpu(unsigned int cpu) { struct rdt_resource *r;
- mutex_lock(&rdtgroup_mutex); + mutex_lock(&domain_list_lock); for_each_capable_rdt_resource(r) domain_add_cpu(cpu, r); - clear_closid_rmid(cpu); + mutex_unlock(&domain_list_lock);
+ clear_closid_rmid(cpu); resctrl_online_cpu(cpu); - mutex_unlock(&rdtgroup_mutex);
return 0; } @@ -615,13 +638,14 @@ static int resctrl_arch_offline_cpu(unsigned int cpu) { struct rdt_resource *r;
- mutex_lock(&rdtgroup_mutex); resctrl_offline_cpu(cpu);
+ mutex_lock(&domain_list_lock); for_each_capable_rdt_resource(r) domain_remove_cpu(cpu, r); + mutex_unlock(&domain_list_lock); + clear_closid_rmid(cpu); - mutex_unlock(&rdtgroup_mutex);
return 0; } diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index 64db51455df3..dc59643498bf 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -212,6 +212,9 @@ static int parse_line(char *line, struct resctrl_schema *s, struct rdt_domain *d; unsigned long dom_id;
+ /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP && (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)) { rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n"); @@ -316,6 +319,9 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) struct rdt_domain *d; u32 idx;
+ /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) return -ENOMEM;
@@ -381,11 +387,9 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, return -EINVAL; buf[nbytes - 1] = '\0';
- cpus_read_lock(); rdtgrp = rdtgroup_kn_lock_live(of->kn); if (!rdtgrp) { rdtgroup_kn_unlock(of->kn); - cpus_read_unlock(); return -ENOENT; } rdt_last_cmd_clear(); @@ -447,7 +451,6 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, out: rdt_staged_configs_clear(); rdtgroup_kn_unlock(of->kn); - cpus_read_unlock(); return ret ?: nbytes; }
@@ -467,6 +470,9 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo bool sep = false; u32 ctrl_val;
+ /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + seq_printf(s, "%*s:", max_name_width, schema->name); list_for_each_entry(dom, &r->domains, list) { if (sep) @@ -537,6 +543,9 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, { int cpu;
+ /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + /* * Setup the parameters to pass to mon_event_count() to read the data. */ diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index c78b51eb1256..971ca45fc4e6 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -15,6 +15,7 @@ * Software Developer Manual June 2016, volume 3, section 17.17. */
+#include <linux/cpu.h> #include <linux/module.h> #include <linux/sizes.h> #include <linux/slab.h> @@ -472,6 +473,9 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
lockdep_assert_held(&rdtgroup_mutex);
+ /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
entry->busy = 0; @@ -800,6 +804,7 @@ void cqm_handle_limbo(struct work_struct *work) unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL); struct rdt_domain *d;
+ cpus_read_lock(); mutex_lock(&rdtgroup_mutex);
d = container_of(work, struct rdt_domain, cqm_limbo.work); @@ -814,6 +819,7 @@ void cqm_handle_limbo(struct work_struct *work) }
mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); }
/** @@ -845,6 +851,7 @@ void mbm_handle_overflow(struct work_struct *work) struct rdt_resource *r; struct rdt_domain *d;
+ cpus_read_lock(); mutex_lock(&rdtgroup_mutex);
/* @@ -878,6 +885,7 @@ void mbm_handle_overflow(struct work_struct *work)
out_unlock: mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); }
/** diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 8056bed033cc..884b88e25141 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -844,6 +844,9 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) struct rdt_domain *d_i; bool ret = false;
+ /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL)) return true;
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index f007c488fa31..39b807a28a55 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -35,6 +35,10 @@ DEFINE_STATIC_KEY_FALSE(rdt_enable_key); DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key); DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key); + +/* Mutex to protect rdtgroup access. */ +DEFINE_MUTEX(rdtgroup_mutex); + static struct kernfs_root *rdt_root; struct rdtgroup rdtgroup_default; LIST_HEAD(rdt_all_groups); @@ -1014,6 +1018,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of, bool sep = false; u32 ctrl_val;
+ cpus_read_lock(); mutex_lock(&rdtgroup_mutex); hw_shareable = r->cache.shareable_bits; list_for_each_entry(dom, &r->domains, list) { @@ -1074,6 +1079,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of, } seq_putc(seq, '\n'); mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); return 0; }
@@ -1329,6 +1335,9 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp) struct rdt_domain *d; u32 ctrl;
+ /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + list_for_each_entry(s, &resctrl_schema_all, list) { r = s->res; if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA) @@ -1593,6 +1602,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid struct rdt_domain *dom; bool sep = false;
+ cpus_read_lock(); mutex_lock(&rdtgroup_mutex);
list_for_each_entry(dom, &r->domains, list) { @@ -1609,6 +1619,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid seq_puts(s, "\n");
mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock();
return 0; } @@ -1700,6 +1711,9 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid) struct rdt_domain *d; int ret = 0;
+ /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + next: if (!tok || tok[0] == '\0') return 0; @@ -1741,6 +1755,7 @@ static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of, if (nbytes == 0 || buf[nbytes - 1] != '\n') return -EINVAL;
+ cpus_read_lock(); mutex_lock(&rdtgroup_mutex);
rdt_last_cmd_clear(); @@ -1750,6 +1765,7 @@ static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of, ret = mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID);
mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock();
return ret ?: nbytes; } @@ -1765,6 +1781,7 @@ static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of, if (nbytes == 0 || buf[nbytes - 1] != '\n') return -EINVAL;
+ cpus_read_lock(); mutex_lock(&rdtgroup_mutex);
rdt_last_cmd_clear(); @@ -1774,6 +1791,7 @@ static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of, ret = mon_config_write(r, buf, QOS_L3_MBM_LOCAL_EVENT_ID);
mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock();
return ret ?: nbytes; } @@ -2250,6 +2268,9 @@ static int set_cache_qos_cfg(int level, bool enable) struct rdt_domain *d; int cpu;
+ /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + if (level == RDT_RESOURCE_L3) update = l3_qos_cfg_update; else if (level == RDT_RESOURCE_L2) @@ -2449,6 +2470,7 @@ struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn)
rdtgroup_kn_get(rdtgrp, kn);
+ cpus_read_lock(); mutex_lock(&rdtgroup_mutex);
/* Was this group deleted while we waited? */ @@ -2466,6 +2488,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn) return;
mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + rdtgroup_kn_put(rdtgrp, kn); }
@@ -2798,6 +2822,9 @@ static int reset_all_ctrls(struct rdt_resource *r) struct rdt_domain *d; int i;
+ /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) return -ENOMEM;
@@ -3082,6 +3109,9 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn, struct rdt_domain *dom; int ret;
+ /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + list_for_each_entry(dom, &r->domains, list) { ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp); if (ret) @@ -3912,13 +3942,13 @@ static void domain_destroy_mon_state(struct rdt_domain *d)
void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d) { - lockdep_assert_held(&rdtgroup_mutex); + mutex_lock(&rdtgroup_mutex);
if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) mba_sc_domain_destroy(r, d);
if (!r->mon_capable) - return; + goto out_unlock;
/* * If resctrl is mounted, remove all the @@ -3943,6 +3973,9 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d) }
domain_destroy_mon_state(d); + +out_unlock: + mutex_unlock(&rdtgroup_mutex); }
static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) @@ -3978,20 +4011,22 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d) { - int err; + int err = 0;
- lockdep_assert_held(&rdtgroup_mutex); + mutex_lock(&rdtgroup_mutex);
- if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) + if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) { /* RDT_RESOURCE_MBA is never mon_capable */ - return mba_sc_domain_allocate(r, d); + err = mba_sc_domain_allocate(r, d); + goto out_unlock; + }
if (!r->mon_capable) - return 0; + goto out_unlock;
err = domain_setup_mon_state(r, d); if (err) - return err; + goto out_unlock;
if (is_mbm_enabled()) { INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow); @@ -4011,15 +4046,18 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d) if (resctrl_mounted && resctrl_arch_mon_capable()) mkdir_mondata_subdir_allrdtgrp(r, d);
- return 0; +out_unlock: + mutex_unlock(&rdtgroup_mutex); + + return err; }
void resctrl_online_cpu(unsigned int cpu) { - lockdep_assert_held(&rdtgroup_mutex); - + mutex_lock(&rdtgroup_mutex); /* The CPU is set in default rdtgroup after online. */ cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); + mutex_unlock(&rdtgroup_mutex); }
static void clear_childcpus(struct rdtgroup *r, unsigned int cpu) @@ -4038,8 +4076,7 @@ void resctrl_offline_cpu(unsigned int cpu) struct rdtgroup *rdtgrp; struct rdt_domain *d;
- lockdep_assert_held(&rdtgroup_mutex); - + mutex_lock(&rdtgroup_mutex); list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) { clear_childcpus(rdtgrp, cpu); @@ -4048,7 +4085,7 @@ void resctrl_offline_cpu(unsigned int cpu) }
if (!l3->mon_capable) - return; + goto out_unlock;
d = get_domain_from_cpu(cpu, l3); if (d) { @@ -4062,6 +4099,9 @@ void resctrl_offline_cpu(unsigned int cpu) cqm_setup_limbo_handler(d, 0, cpu); } } + +out_unlock: + mutex_unlock(&rdtgroup_mutex); }
/* diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 270ff1d5c051..a365f67131ec 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -159,7 +159,7 @@ struct resctrl_schema; * @cache_level: Which cache level defines scope of this resource * @cache: Cache allocation related data * @membw: If the component has bandwidth controls, their properties. - * @domains: All domains for this resource + * @domains: RCU list of all domains for this resource * @name: Name to use in "schemata" file. * @data_width: Character width of data when displaying * @default_ctrl: Specifies default cache cbm or memory B/W percent.
From: James Morse james.morse@arm.com
Resctrl occasionally wants to know something about a specific resource. It reaches into the arch code's rdt_resources_all.
Once the filesystem parts of resctrl are moved to /fs/, this means it will need visibility of the architecture specific structure.
Move the level enum to the resctrl header and add a helper to retrieve the resctrl struct by id. resctrl_arch_get_resource() should not return NULL for any value in the enum, it may instead return a dummy resource that is !alloc_enabled && !mon_enabled.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/core.c | 12 ++++++++++-- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 2 +- arch/x86/kernel/cpu/resctrl/internal.h | 10 ---------- arch/x86/kernel/cpu/resctrl/monitor.c | 8 ++++---- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 15 +++++++-------- include/linux/resctrl.h | 17 +++++++++++++++++ 6 files changed, 39 insertions(+), 25 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 4a35369e2d3c..2e84b6547214 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -122,6 +122,14 @@ struct rdt_hw_resource rdt_resources_all[] = { }, };
+struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) +{ + if (l >= RDT_NUM_RESOURCES) + return NULL; + + return &rdt_resources_all[l].r_resctrl; +} + /* * cache_alloc_hsw_probe() - Have to probe for Intel haswell server CPUs * as they do not have CPUID enumeration support for Cache allocation. @@ -169,7 +177,7 @@ static inline void cache_alloc_hsw_probe(void) bool is_mba_sc(struct rdt_resource *r) { if (!r) - return rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl.membw.mba_sc; + r = resctrl_arch_get_resource(RDT_RESOURCE_MBA);
/* * The software controller support is only applicable to MBA resource. @@ -990,7 +998,7 @@ late_initcall(resctrl_late_init);
static void __exit resctrl_exit(void) { - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
cpuhp_remove_state(rdt_online);
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index dc59643498bf..f12712766924 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -599,7 +599,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) domid = md.u.domid; evtid = md.u.evtid;
- r = &rdt_resources_all[resid].r_resctrl; + r = resctrl_arch_get_resource(resid); d = rdt_find_domain(r, domid, NULL); if (IS_ERR_OR_NULL(d)) { ret = -ENOENT; diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 8091eb59b159..3e0246238729 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -468,16 +468,6 @@ extern struct rdt_hw_resource rdt_resources_all[]; extern struct rdtgroup rdtgroup_default; extern struct dentry *debugfs_resctrl;
-enum resctrl_res_level { - RDT_RESOURCE_L3, - RDT_RESOURCE_L2, - RDT_RESOURCE_MBA, - RDT_RESOURCE_SMBA, - - /* Must be the last */ - RDT_NUM_RESOURCES, -}; - static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res) { struct rdt_hw_resource *hw_res = resctrl_to_arch_res(res); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 971ca45fc4e6..79a6d1b8367b 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -321,7 +321,7 @@ static void limbo_release_entry(struct rmid_entry *entry) */ void __check_limbo(struct rdt_domain *d, bool force_free) { - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); u32 idx_limit = resctrl_arch_system_num_rmid_idx(); struct rmid_entry *entry; u32 idx, cur_idx = 1; @@ -467,7 +467,7 @@ int alloc_rmid(u32 closid)
static void add_rmid_to_limbo(struct rmid_entry *entry) { - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); struct rdt_domain *d; u32 idx;
@@ -672,7 +672,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm) if (!is_mbm_local_enabled()) return;
- r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl; + r_mba = resctrl_arch_get_resource(RDT_RESOURCE_MBA);
closid = rgrp->closid; rmid = rgrp->mon.rmid; @@ -861,7 +861,7 @@ void mbm_handle_overflow(struct work_struct *work) if (!resctrl_mounted || !resctrl_arch_mon_capable()) goto out_unlock;
- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + r = resctrl_arch_get_resource(RDT_RESOURCE_L3); d = container_of(work, struct rdt_domain, mbm_over.work);
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 39b807a28a55..76e32529475b 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2257,7 +2257,7 @@ static void l2_qos_cfg_update(void *arg)
static inline bool is_mba_linear(void) { - return rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl.membw.delay_linear; + return resctrl_arch_get_resource(RDT_RESOURCE_MBA)->membw.delay_linear; }
static int set_cache_qos_cfg(int level, bool enable) @@ -2345,7 +2345,7 @@ static void mba_sc_domain_destroy(struct rdt_resource *r, */ static bool supports_mba_mbps(void) { - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl; + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA);
return (is_mbm_local_enabled() && r->alloc_capable && is_mba_linear()); @@ -2357,7 +2357,7 @@ static bool supports_mba_mbps(void) */ static int set_mba_sc(bool mba_sc) { - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl; + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); u32 num_closid = resctrl_arch_get_num_closid(r); struct rdt_domain *d; int i; @@ -2629,10 +2629,10 @@ static void schemata_list_destroy(void)
static int rdt_get_tree(struct fs_context *fc) { + struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3); struct rdt_fs_context *ctx = rdt_fc2context(fc); unsigned long flags = RFTYPE_CTRL_BASE; struct rdt_domain *dom; - struct rdt_resource *r; int ret;
cpus_read_lock(); @@ -2705,8 +2705,7 @@ static int rdt_get_tree(struct fs_context *fc) resctrl_mounted = true;
if (is_mbm_enabled()) { - r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; - list_for_each_entry(dom, &r->domains, list) + list_for_each_entry(dom, &l3->domains, list) mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU); } @@ -3882,7 +3881,7 @@ static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf) if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2)) seq_puts(seq, ",cdpl2");
- if (is_mba_sc(&rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl)) + if (is_mba_sc(resctrl_arch_get_resource(RDT_RESOURCE_MBA))) seq_puts(seq, ",mba_MBps");
if (resctrl_debug) @@ -4072,7 +4071,7 @@ static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
void resctrl_offline_cpu(unsigned int cpu) { - struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3); struct rdtgroup *rdtgrp; struct rdt_domain *d;
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index a365f67131ec..168cc9510069 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -36,6 +36,16 @@ enum resctrl_conf_type { CDP_DATA, };
+enum resctrl_res_level { + RDT_RESOURCE_L3, + RDT_RESOURCE_L2, + RDT_RESOURCE_MBA, + RDT_RESOURCE_SMBA, + + /* Must be the last */ + RDT_NUM_RESOURCES, +}; + #define CDP_NUM_TYPES (CDP_DATA + 1)
/* @@ -190,6 +200,13 @@ struct rdt_resource { bool cdp_capable; };
+/* + * Get the resource that exists at this level. If the level is not supported + * a dummy/not-capable resource can be returned. Levels >= RDT_NUM_RESOURCES + * will return NULL. + */ +struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l); + /** * struct resctrl_schema - configuration abilities of a resource presented to * user-space
From: James Morse james.morse@arm.com
How to parse the ctrlval strings from user-space is specified by a function pointer the arch code specifies.
The strings are part of resctrl's ABI, and the functions and their caller both live in the same file. Exporting the parsing functions and allowing the architecture to choose allows another architecture to get this wrong.
Keep this all in the flesystem parts of resctrl. This should prevent any architecture's string-parsing behaviour from varying without core code changes. Use the fflags to spot caches and bandwidth resources, and use the appropriate helper.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/core.c | 4 ---- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 28 +++++++++++++++++++---- arch/x86/kernel/cpu/resctrl/internal.h | 10 -------- include/linux/resctrl.h | 7 ------ 4 files changed, 23 insertions(+), 26 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 2e84b6547214..92d6e811fd3f 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -75,7 +75,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .name = "L3", .cache_level = 3, .domains = domain_init(RDT_RESOURCE_L3), - .parse_ctrlval = parse_cbm, .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, }, @@ -89,7 +88,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .name = "L2", .cache_level = 2, .domains = domain_init(RDT_RESOURCE_L2), - .parse_ctrlval = parse_cbm, .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, }, @@ -103,7 +101,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .name = "MB", .cache_level = 3, .domains = domain_init(RDT_RESOURCE_MBA), - .parse_ctrlval = parse_bw, .format_str = "%d=%*u", .fflags = RFTYPE_RES_MB, }, @@ -115,7 +112,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .name = "SMBA", .cache_level = 3, .domains = domain_init(RDT_RESOURCE_SMBA), - .parse_ctrlval = parse_bw, .format_str = "%d=%*u", .fflags = RFTYPE_RES_MB, }, diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index f12712766924..c7cc850311ef 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -23,6 +23,15 @@
#include "internal.h"
+struct rdt_parse_data { + struct rdtgroup *rdtgrp; + char *buf; +}; + +typedef int (ctrlval_parser_t)(struct rdt_parse_data *data, + struct resctrl_schema *s, + struct rdt_domain *d); + /* * Check whether MBA bandwidth percentage value is correct. The value is * checked against the minimum and max bandwidth values specified by the @@ -59,8 +68,8 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r) return true; }
-int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_domain *d) +static int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_domain *d) { struct resctrl_staged_config *cfg; u32 closid = data->rdtgrp->closid; @@ -138,8 +147,8 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r) * Read one cache bit mask (hex). Check that it is valid for the current * resource type. */ -int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_domain *d) +static int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_domain *d) { struct rdtgroup *rdtgrp = data->rdtgrp; struct resctrl_staged_config *cfg; @@ -195,6 +204,14 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, return 0; }
+static ctrlval_parser_t *get_parser(struct rdt_resource *res) +{ + if (res->fflags & RFTYPE_RES_CACHE) + return &parse_cbm; + else + return &parse_bw; +} + /* * For each domain in this resource we expect to find a series of: * id=mask @@ -204,6 +221,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, static int parse_line(char *line, struct resctrl_schema *s, struct rdtgroup *rdtgrp) { + ctrlval_parser_t *parse_ctrlval = get_parser(s->res); enum resctrl_conf_type t = s->conf_type; struct resctrl_staged_config *cfg; struct rdt_resource *r = s->res; @@ -235,7 +253,7 @@ static int parse_line(char *line, struct resctrl_schema *s, if (d->id == dom_id) { data.buf = dom; data.rdtgrp = rdtgrp; - if (r->parse_ctrlval(&data, s, d)) + if (parse_ctrlval(&data, s, d)) return -EINVAL; if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { cfg = &d->staged_config[t]; diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 3e0246238729..ec54aea622f1 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -418,11 +418,6 @@ static inline bool is_mbm_event(int e) e <= QOS_L3_MBM_LOCAL_EVENT_ID); }
-struct rdt_parse_data { - struct rdtgroup *rdtgrp; - char *buf; -}; - /** * struct rdt_hw_resource - arch private attributes of a resctrl resource * @r_resctrl: Attributes of the resource used directly by resctrl. @@ -457,11 +452,6 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r return container_of(r, struct rdt_hw_resource, r_resctrl); }
-int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_domain *d); -int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_domain *d); - extern struct mutex rdtgroup_mutex;
extern struct rdt_hw_resource rdt_resources_all[]; diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 168cc9510069..6e87bc95f5ea 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -157,9 +157,6 @@ struct resctrl_membw { u32 *mb_map; };
-struct rdt_parse_data; -struct resctrl_schema; - /** * struct rdt_resource - attributes of a resctrl resource * @rid: The index of the resource @@ -174,7 +171,6 @@ struct resctrl_schema; * @data_width: Character width of data when displaying * @default_ctrl: Specifies default cache cbm or memory B/W percent. * @format_str: Per resource format string to show domain value - * @parse_ctrlval: Per resource function pointer to parse control values * @evt_list: List of monitoring events * @fflags: flags to choose base and info files * @cdp_capable: Is the CDP feature available on this resource @@ -192,9 +188,6 @@ struct rdt_resource { int data_width; u32 default_ctrl; const char *format_str; - int (*parse_ctrlval)(struct rdt_parse_data *data, - struct resctrl_schema *s, - struct rdt_domain *d); struct list_head evt_list; unsigned long fflags; bool cdp_capable;
From: James Morse james.morse@arm.com
rdtgroup_rmdir_ctrl() and rdtgroup_rmdir_mon() set the per-cpu pqr_state for CPUs that were part of the rmdir'd group.
Another architecture might not have a 'pqr_state', its hardware may need the values in a different format. MPAM's equivalent of RMID values are not unique, and always need the CLOSID to be provided too.
As only rdtgroup_rmdir_mon() modifies just the RMID, provide a helper that sets both values. This allows another architecture to convert the values into the format the hardware expects. As these values are read by __resctrl_sched_in(), but may be written by a different CPU without any locking, add READ/WRTE_ONCE() to avoid torn values.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/include/asm/resctrl.h | 14 +++++++++++--- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 13 ++++++++----- 2 files changed, 19 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 12dbd2588ca7..f61382258743 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -4,8 +4,9 @@
#ifdef CONFIG_X86_CPU_RESCTRL
-#include <linux/sched.h> #include <linux/jump_label.h> +#include <linux/percpu.h> +#include <linux/sched.h>
/* * This value can never be a valid CLOSID, and is used when mapping a @@ -96,8 +97,8 @@ static inline void resctrl_arch_disable_mon(void) static inline void __resctrl_sched_in(struct task_struct *tsk) { struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state); - u32 closid = state->default_closid; - u32 rmid = state->default_rmid; + u32 closid = READ_ONCE(state->default_closid); + u32 rmid = READ_ONCE(state->default_rmid); u32 tmp;
/* @@ -132,6 +133,13 @@ static inline unsigned int resctrl_arch_round_mon_val(unsigned int val) return val * scale; }
+static inline void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, + u32 rmid) +{ + WRITE_ONCE(per_cpu(pqr_state.default_closid, cpu), closid); + WRITE_ONCE(per_cpu(pqr_state.default_rmid, cpu), rmid); +} + static inline void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid) { diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 76e32529475b..073f78b24cba 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -3634,7 +3634,9 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
/* Update per cpu rmid of the moved CPUs first */ for_each_cpu(cpu, &rdtgrp->cpu_mask) - per_cpu(pqr_state.default_rmid, cpu) = prdtgrp->mon.rmid; + resctrl_arch_set_cpu_default_closid_rmid(cpu, rdtgrp->closid, + prdtgrp->mon.rmid); + /* * Update the MSR on moved CPUs and CPUs which have moved * task running on them. @@ -3667,6 +3669,7 @@ static int rdtgroup_ctrl_remove(struct rdtgroup *rdtgrp)
static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) { + u32 closid, rmid; int cpu;
/* Give any tasks back to the default group */ @@ -3677,10 +3680,10 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
/* Update per cpu closid and rmid of the moved CPUs first */ - for_each_cpu(cpu, &rdtgrp->cpu_mask) { - per_cpu(pqr_state.default_closid, cpu) = rdtgroup_default.closid; - per_cpu(pqr_state.default_rmid, cpu) = rdtgroup_default.mon.rmid; - } + closid = rdtgroup_default.closid; + rmid = rdtgroup_default.mon.rmid; + for_each_cpu(cpu, &rdtgrp->cpu_mask) + resctrl_arch_set_cpu_default_closid_rmid(cpu, closid, rmid);
/* * Update the MSR on moved CPUs and CPUs which have moved
From: James Morse james.morse@arm.com
update_cpu_closid_rmid() takes a struct rdtgroup as an argument, which it uses to update the local CPUs default pqr values. This is a problem once the resctrl parts move out to /fs/, as the arch code cannot poke around inside struct rdtgroup.
Add a helper resctrl_arch_sync_cpus_defaults() to abstract the cross calling, and pass the effective CLOSID and RMID in a new struct.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 19 +++++++++++++++---- include/linux/resctrl.h | 11 +++++++++++ 2 files changed, 26 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 073f78b24cba..94988c55bcde 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -340,13 +340,13 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of, * from update_closid_rmid() is protected against __switch_to() because * preemption is disabled. */ -static void update_cpu_closid_rmid(void *info) +void resctrl_arch_sync_cpu_defaults(void *info) { - struct rdtgroup *r = info; + struct resctrl_cpu_sync *r = info;
if (r) { this_cpu_write(pqr_state.default_closid, r->closid); - this_cpu_write(pqr_state.default_rmid, r->mon.rmid); + this_cpu_write(pqr_state.default_rmid, r->rmid); }
/* @@ -361,11 +361,22 @@ static void update_cpu_closid_rmid(void *info) * Update the PGR_ASSOC MSR on all cpus in @cpu_mask, * * Per task closids/rmids must have been set up before calling this function. + * @r may be NULL. */ static void update_closid_rmid(const struct cpumask *cpu_mask, struct rdtgroup *r) { - on_each_cpu_mask(cpu_mask, update_cpu_closid_rmid, r, 1); + struct resctrl_cpu_sync defaults; + struct resctrl_cpu_sync *defaults_p = NULL; + + if (r) { + defaults.closid = r->closid; + defaults.rmid = r->mon.rmid; + defaults_p = &defaults; + } + + on_each_cpu_mask(cpu_mask, resctrl_arch_sync_cpu_defaults, defaults_p, + 1); }
static int cpus_mon_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask, diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 6e87bc95f5ea..2b79e4159507 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -220,6 +220,17 @@ struct resctrl_schema { u32 num_closid; };
+struct resctrl_cpu_sync { + u32 closid; + u32 rmid; +}; + +/* + * Update and re-load this CPUs defaults. Called via IPI, takes a pointer to + * struct resctrl_cpu_sync, or NULL. + */ +void resctrl_arch_sync_cpu_defaults(void *info); + /* The number of closid supported by this resource regardless of CDP */ u32 resctrl_arch_get_num_closid(struct rdt_resource *r); int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
From: James Morse james.morse@arm.com
rdtgroup_init() needs exporting so that arch code can call it once it lives in core code. As this is one of the few functions we export, rename it to have the filesystems name.
x86's arch code init functions for RDT need renaming (again) to avoid duplicate definitions.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/core.c | 10 +++++----- arch/x86/kernel/cpu/resctrl/internal.h | 3 --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 8 ++++---- include/linux/resctrl.h | 3 +++ 4 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 92d6e811fd3f..870208e9c9c3 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -949,7 +949,7 @@ void resctrl_cpu_detect(struct cpuinfo_x86 *c) } }
-static int __init resctrl_late_init(void) +static int __init resctrl_arch_late_init(void) { struct rdt_resource *r; int state, ret; @@ -974,7 +974,7 @@ static int __init resctrl_late_init(void) if (state < 0) return state;
- ret = rdtgroup_init(); + ret = resctrl_init(); if (ret) { cpuhp_remove_state(state); return ret; @@ -990,9 +990,9 @@ static int __init resctrl_late_init(void) return 0; }
-late_initcall(resctrl_late_init); +late_initcall(resctrl_arch_late_init);
-static void __exit resctrl_exit(void) +static void __exit resctrl_arch_exit(void) { struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
@@ -1004,4 +1004,4 @@ static void __exit resctrl_exit(void) rdt_put_mon_l3_config(r); }
-__exitcall(resctrl_exit); +__exitcall(resctrl_arch_exit); diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index ec54aea622f1..5ed209fb5d3c 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -301,9 +301,6 @@ extern struct list_head rdt_all_groups;
extern int max_name_width, max_data_width;
-int __init rdtgroup_init(void); -void __exit rdtgroup_exit(void); - /** * struct rftype - describe each file in the resctrl file system * @name: File name diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 94988c55bcde..71514da384c8 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -4118,14 +4118,14 @@ void resctrl_offline_cpu(unsigned int cpu) }
/* - * rdtgroup_init - rdtgroup initialization + * resctrl_init - resctrl filesystem initialization * * Setup resctrl file system including set up root, create mount point, - * register rdtgroup filesystem, and initialize files under root directory. + * register resctrl filesystem, and initialize files under root directory. * * Return: 0 on success or -errno */ -int __init rdtgroup_init(void) +int __init resctrl_init(void) { int ret = 0;
@@ -4173,7 +4173,7 @@ int __init rdtgroup_init(void) return ret; }
-void __exit rdtgroup_exit(void) +void __exit resctrl_exit(void) { debugfs_remove_recursive(debugfs_resctrl); unregister_filesystem(&rdt_fs_type); diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 2b79e4159507..f6a4b75f8122 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -325,4 +325,7 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d); extern unsigned int resctrl_rmid_realloc_threshold; extern unsigned int resctrl_rmid_realloc_limit;
+int __init resctrl_init(void); +void __exit resctrl_exit(void); + #endif /* _RESCTRL_H */
From: James Morse james.morse@arm.com
rdt_find_domain() finds a domain given a resource and a cache-id. Its not quite right for the resctrl arch API as it also returns the position to insert a new domain.
Wrap rdt_find_domain() in a another function resctrl_arch_find_domain() so we avoid the unnecessary argument outside the arch code.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/core.c | 9 +++++++-- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 2 +- arch/x86/kernel/cpu/resctrl/internal.h | 2 -- include/linux/resctrl.h | 2 ++ 4 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 870208e9c9c3..aebed4608b34 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -416,8 +416,8 @@ void rdt_ctrl_update(void *arg) * caller, return the first domain whose id is bigger than the input id. * The domain list is sorted by id in ascending order. */ -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, - struct list_head **pos) +static struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, + struct list_head **pos) { struct rdt_domain *d; struct list_head *l; @@ -441,6 +441,11 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, return NULL; }
+struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id) +{ + return rdt_find_domain(r, id, NULL); +} + static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc) { struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index c7cc850311ef..0a5a945ca368 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -618,7 +618,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) evtid = md.u.evtid;
r = resctrl_arch_get_resource(resid); - d = rdt_find_domain(r, domid, NULL); + d = resctrl_arch_find_domain(r, domid); if (IS_ERR_OR_NULL(d)) { ret = -ENOENT; goto out; diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 5ed209fb5d3c..f86b884ef43c 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -535,8 +535,6 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn); int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name); int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name, umode_t mask); -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, - struct list_head **pos); ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off); int rdtgroup_schemata_show(struct kernfs_open_file *of, diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index f6a4b75f8122..c5fcbb524136 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -233,6 +233,8 @@ void resctrl_arch_sync_cpu_defaults(void *info);
/* The number of closid supported by this resource regardless of CDP */ u32 resctrl_arch_get_num_closid(struct rdt_resource *r); + +struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id); int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
/*
From: James Morse james.morse@arm.com
To avoid sticky problems in the mpam glue code, move the resctrl enums into a separate header.
This lets the arch code declare prototypes that use these enums without creating a loop via asm<->linux resctrl.h
The same logic applies to the monitor-configuration defines, move these too.
Signed-off-by: James Morse james.morse@arm.com --- MAINTAINERS | 1 + arch/x86/kernel/cpu/resctrl/internal.h | 24 --------- include/linux/resctrl.h | 35 +------------ include/linux/resctrl_types.h | 68 ++++++++++++++++++++++++++ 4 files changed, 70 insertions(+), 58 deletions(-) create mode 100644 include/linux/resctrl_types.h
diff --git a/MAINTAINERS b/MAINTAINERS index dd5de540ec0b..a707271ab5f0 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -18062,6 +18062,7 @@ S: Supported F: Documentation/arch/x86/resctrl* F: arch/x86/include/asm/resctrl.h F: arch/x86/kernel/cpu/resctrl/ +F: include/linux/resctrl*.h F: tools/testing/selftests/resctrl/
READ-COPY UPDATE (RCU) diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index f86b884ef43c..fc47dcfc349b 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -33,30 +33,6 @@ */ #define MBM_CNTR_WIDTH_OFFSET_MAX (62 - MBM_CNTR_WIDTH_BASE)
-/* Reads to Local DRAM Memory */ -#define READS_TO_LOCAL_MEM BIT(0) - -/* Reads to Remote DRAM Memory */ -#define READS_TO_REMOTE_MEM BIT(1) - -/* Non-Temporal Writes to Local Memory */ -#define NON_TEMP_WRITE_TO_LOCAL_MEM BIT(2) - -/* Non-Temporal Writes to Remote Memory */ -#define NON_TEMP_WRITE_TO_REMOTE_MEM BIT(3) - -/* Reads to Local Memory the system identifies as "Slow Memory" */ -#define READS_TO_LOCAL_S_MEM BIT(4) - -/* Reads to Remote Memory the system identifies as "Slow Memory" */ -#define READS_TO_REMOTE_S_MEM BIT(5) - -/* Dirty Victims to All Types of Memory */ -#define DIRTY_VICTIMS_TO_ALL_MEM BIT(6) - -/* Max event bits supported */ -#define MAX_EVT_CONFIG_BITS GENMASK(6, 0) - /** * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that * aren't marked nohz_full diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index c5fcbb524136..b0ee7256e095 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -5,6 +5,7 @@ #include <linux/kernel.h> #include <linux/list.h> #include <linux/pid.h> +#include <linux/resctrl_types.h>
/* CLOSID, RMID value used by the default control group */ #define RESCTRL_RESERVED_CLOSID 0 @@ -24,40 +25,6 @@ int proc_resctrl_show(struct seq_file *m, /* max value for struct rdt_domain's mbps_val */ #define MBA_MAX_MBPS U32_MAX
-/** - * enum resctrl_conf_type - The type of configuration. - * @CDP_NONE: No prioritisation, both code and data are controlled or monitored. - * @CDP_CODE: Configuration applies to instruction fetches. - * @CDP_DATA: Configuration applies to reads and writes. - */ -enum resctrl_conf_type { - CDP_NONE, - CDP_CODE, - CDP_DATA, -}; - -enum resctrl_res_level { - RDT_RESOURCE_L3, - RDT_RESOURCE_L2, - RDT_RESOURCE_MBA, - RDT_RESOURCE_SMBA, - - /* Must be the last */ - RDT_NUM_RESOURCES, -}; - -#define CDP_NUM_TYPES (CDP_DATA + 1) - -/* - * Event IDs, the values match those used to program IA32_QM_EVTSEL before - * reading IA32_QM_CTR on RDT systems. - */ -enum resctrl_event_id { - QOS_L3_OCCUP_EVENT_ID = 0x01, - QOS_L3_MBM_TOTAL_EVENT_ID = 0x02, - QOS_L3_MBM_LOCAL_EVENT_ID = 0x03, -}; - /** * struct resctrl_staged_config - parsed configuration to be applied * @new_ctrl: new ctrl value to be loaded diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h new file mode 100644 index 000000000000..06e0f5567558 --- /dev/null +++ b/include/linux/resctrl_types.h @@ -0,0 +1,68 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2023 Arm Ltd. + * Based on arch/x86/kernel/cpu/resctrl/internal.h + */ + +#ifndef __LINUX_RESCTRL_TYPES_H +#define __LINUX_RESCTRL_TYPES_H + +/* Reads to Local DRAM Memory */ +#define READS_TO_LOCAL_MEM BIT(0) + +/* Reads to Remote DRAM Memory */ +#define READS_TO_REMOTE_MEM BIT(1) + +/* Non-Temporal Writes to Local Memory */ +#define NON_TEMP_WRITE_TO_LOCAL_MEM BIT(2) + +/* Non-Temporal Writes to Remote Memory */ +#define NON_TEMP_WRITE_TO_REMOTE_MEM BIT(3) + +/* Reads to Local Memory the system identifies as "Slow Memory" */ +#define READS_TO_LOCAL_S_MEM BIT(4) + +/* Reads to Remote Memory the system identifies as "Slow Memory" */ +#define READS_TO_REMOTE_S_MEM BIT(5) + +/* Dirty Victims to All Types of Memory */ +#define DIRTY_VICTIMS_TO_ALL_MEM BIT(6) + +/* Max event bits supported */ +#define MAX_EVT_CONFIG_BITS GENMASK(6, 0) + +/** + * enum resctrl_conf_type - The type of configuration. + * @CDP_NONE: No prioritisation, both code and data are controlled or monitored. + * @CDP_CODE: Configuration applies to instruction fetches. + * @CDP_DATA: Configuration applies to reads and writes. + */ +enum resctrl_conf_type { + CDP_NONE, + CDP_CODE, + CDP_DATA, +}; + +enum resctrl_res_level { + RDT_RESOURCE_L3, + RDT_RESOURCE_L2, + RDT_RESOURCE_MBA, + RDT_RESOURCE_SMBA, + + /* Must be the last */ + RDT_NUM_RESOURCES, +}; + +#define CDP_NUM_TYPES (CDP_DATA + 1) + +/* + * Event IDs, the values match those used to program IA32_QM_EVTSEL before + * reading IA32_QM_CTR on RDT systems. + */ +enum resctrl_event_id { + QOS_L3_OCCUP_EVENT_ID = 0x01, + QOS_L3_MBM_TOTAL_EVENT_ID = 0x02, + QOS_L3_MBM_LOCAL_EVENT_ID = 0x03, +}; + +#endif /* __LINUX_RESCTRL_TYPES_H */
From: James Morse james.morse@arm.com
On umount(), resctrl resets each resource back to its default configuration. It only ever does this for all resources, move the whole lot behind a single helper.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/include/asm/resctrl.h | 2 ++ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 16 +++++++++++----- 2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index f61382258743..5f6a5375bb4a 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -15,6 +15,8 @@ */ #define X86_RESCTRL_EMPTY_CLOSID ((u32)~0)
+void resctrl_arch_reset_resources(void); + /** * struct resctrl_pqr_state - State cache for the PQR MSR * @cur_rmid: The cached Resource Monitoring ID diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 71514da384c8..5e59687dc04b 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2863,6 +2863,14 @@ static int reset_all_ctrls(struct rdt_resource *r) return 0; }
+void resctrl_arch_reset_resources(void) +{ + struct rdt_resource *r; + + for_each_capable_rdt_resource(r) + reset_all_ctrls(r); +} + /* * Move tasks from one to the other group. If @from is NULL, then all tasks * in the systems are moved unconditionally (used for teardown). @@ -2972,16 +2980,14 @@ static void rmdir_all_sub(void)
static void rdt_kill_sb(struct super_block *sb) { - struct rdt_resource *r; - cpus_read_lock(); mutex_lock(&rdtgroup_mutex);
rdt_disable_ctx();
- /*Put everything back to default values. */ - for_each_alloc_capable_rdt_resource(r) - reset_all_ctrls(r); + /* Put everything back to default values. */ + resctrl_arch_reset_resources(); + rmdir_all_sub(); rdt_pseudo_lock_release(); rdtgroup_default.mode = RDT_MODE_SHAREABLE;
From: James Morse james.morse@arm.com
rdt_get_mon_l3_config() is called via the architecture's resctrl_late_init(), and initialises both architecture specific fields, such as hw_res->mon_scale and resctrl filesystem fields by calling dom_data_init().
To separate the filesystem and architecture parts of resctrl, this function needs splitting up.
Add resctrl_mon_resource_init() to do the filesystem specific work, and call it from resctrl_init().
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/internal.h | 1 + arch/x86/kernel/cpu/resctrl/monitor.c | 22 +++++++++++++++------- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 ++++ 3 files changed, 20 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index fc47dcfc349b..d28200dc6b16 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -542,6 +542,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg); void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, struct rdt_domain *d, struct rdtgroup *rdtgrp, int evtid, int first); +int resctrl_mon_resource_init(void); void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms, int exclude_cpu); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 79a6d1b8367b..35c9ee5ea51a 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -1019,12 +1019,26 @@ static void l3_mon_evt_init(struct rdt_resource *r) list_add_tail(&mbm_local_event.list, &r->evt_list); }
+int resctrl_mon_resource_init(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + int ret; + + ret = dom_data_init(r); + if (ret) + return ret; + + if (r->mon_capable) + l3_mon_evt_init(r); + + return 0; +} + int __init rdt_get_mon_l3_config(struct rdt_resource *r) { unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset; struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); unsigned int threshold; - int ret;
resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024; hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale; @@ -1052,10 +1066,6 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) */ resctrl_rmid_realloc_threshold = resctrl_arch_round_mon_val(threshold);
- ret = dom_data_init(r); - if (ret) - return ret; - if (rdt_cpu_has(X86_FEATURE_BMEC)) { if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) { mbm_total_event.configurable = true; @@ -1067,8 +1077,6 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) } }
- l3_mon_evt_init(r); - r->mon_capable = true;
return 0; diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 5e59687dc04b..637a804f5bcb 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -4140,6 +4140,10 @@ int __init resctrl_init(void)
rdtgroup_setup_default();
+ ret = resctrl_mon_resource_init(); + if (ret) + return ret; + ret = sysfs_create_mount_point(fs_kobj, "resctrl"); if (ret) return ret;
From: James Morse james.morse@arm.com
rdt_put_mon_l3_config() is called via the architecture's resctrl_arch_exit() call, and appears to free the rmid_ptrs[] and closid_num_dirty_rmid[] arrays. In reality this code is marked __exit, and is removed by the linker as resctl can't be built as a module.
MPAM can make use of this code from its error interrupt handler, a later patch drops all the __init/__exit annotations.
To separate the filesystem and architecture parts of resctrl, this free()ing work needs to be triggered by the filesystem.
Rename rdt_put_mon_l3_config() resctrl_mon_resource_exit() and call it from resctrl_exit().
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/core.c | 8 +------- arch/x86/kernel/cpu/resctrl/internal.h | 2 +- arch/x86/kernel/cpu/resctrl/monitor.c | 8 ++++++-- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 ++ 4 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index aebed4608b34..faa094e50c4a 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -999,14 +999,8 @@ late_initcall(resctrl_arch_late_init);
static void __exit resctrl_arch_exit(void) { - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); - cpuhp_remove_state(rdt_online); - - rdtgroup_exit(); - - if (r->mon_capable) - rdt_put_mon_l3_config(r); + resctrl_exit(); }
__exitcall(resctrl_arch_exit); diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index d28200dc6b16..10d568d6c32c 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -535,7 +535,7 @@ void closid_free(int closid); int alloc_rmid(u32 closid); void free_rmid(u32 closid, u32 rmid); int rdt_get_mon_l3_config(struct rdt_resource *r); -void __exit rdt_put_mon_l3_config(struct rdt_resource *r); +void __exit resctrl_mon_resource_exit(void); bool __init rdt_cpu_has(int flag); void mon_event_count(void *info); int rdtgroup_mondata_show(struct seq_file *m, void *arg); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 35c9ee5ea51a..31d216814794 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -972,8 +972,10 @@ static int dom_data_init(struct rdt_resource *r)
static void __exit dom_data_exit(struct rdt_resource *r) { - mutex_lock(&rdtgroup_mutex); + if (!r->mon_capable) + return;
+ mutex_lock(&rdtgroup_mutex); if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { kfree(closid_num_dirty_rmid); closid_num_dirty_rmid = NULL; @@ -1082,8 +1084,10 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) return 0; }
-void __exit rdt_put_mon_l3_config(struct rdt_resource *r) +void __exit resctrl_mon_resource_exit(void) { + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + dom_data_exit(r); }
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 637a804f5bcb..2ab35d7a658b 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -4188,4 +4188,6 @@ void __exit resctrl_exit(void) debugfs_remove_recursive(debugfs_resctrl); unregister_filesystem(&rdt_fs_type); sysfs_remove_mount_point(fs_kobj, "resctrl"); + + resctrl_mon_resource_exit(); }
From: James Morse james.morse@arm.com
max_name_width and max_data_width are used to pad the strings in the resctrl schemata file.
This should be part of the fs code as it influences the user-space interface, but currently max_data_width is generated by the arch init code. max_name_width is already managed by schemata_list_add().
Move the variables and max_data_width's initialisation code to rdtgroup.c.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/core.c | 22 ---------------------- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 ++++++++++++ 2 files changed, 12 insertions(+), 22 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index faa094e50c4a..fadcf0d3c0aa 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -44,12 +44,6 @@ static DEFINE_MUTEX(domain_list_lock); */ DEFINE_PER_CPU(struct resctrl_pqr_state, pqr_state);
-/* - * Used to store the max resource name width and max resource data width - * to display the schemata in a tabular format - */ -int max_name_width, max_data_width; - /* * Global boolean for rdt_alloc which is true if any * resource allocation is enabled. @@ -659,20 +653,6 @@ static int resctrl_arch_offline_cpu(unsigned int cpu) return 0; }
-/* - * Choose a width for the resource name and resource data based on the - * resource that has widest name and cbm. - */ -static __init void rdt_init_padding(void) -{ - struct rdt_resource *r; - - for_each_alloc_capable_rdt_resource(r) { - if (r->data_width > max_data_width) - max_data_width = r->data_width; - } -} - enum { RDT_FLAG_CMT, RDT_FLAG_MBM_TOTAL, @@ -970,8 +950,6 @@ static int __init resctrl_arch_late_init(void) if (!get_rdt_resources()) return -ENODEV;
- rdt_init_padding(); - state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/resctrl/cat:online:", resctrl_arch_online_cpu, diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 2ab35d7a658b..0a226ed2d1a8 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -58,6 +58,12 @@ static struct kernfs_node *kn_mongrp; /* Kernel fs node for "mon_data" directory under root */ static struct kernfs_node *kn_mondata;
+/* + * Used to store the max resource name width and max resource data width + * to display the schemata in a tabular format + */ +int max_name_width, max_data_width; + static struct seq_buf last_cmd_status; static char last_cmd_status_buf[512];
@@ -2599,6 +2605,12 @@ static int schemata_list_add(struct rdt_resource *r, enum resctrl_conf_type type if (cl > max_name_width) max_name_width = cl;
+ /* + * Choose a width for the resource data based on the resource that has + * widest name and cbm. + */ + max_data_width = max(max_data_width, r->data_width); + INIT_LIST_HEAD(&s->list); list_add(&s->list, &resctrl_schema_all);
From: James Morse james.morse@arm.com
The for_each_*_rdt_resource() helpers walk the architectures array of structures, using the resctrl visible part as an iterator.
To support resctrl, another architecture would have to provide equally complex macros.
Change the resctrl code that uses these to walk through the resource_level enum instead. Instances in core.c, and resctrl_arch_reset_resources() remain part of x86's architecture specific code.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 7 +++++- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 30 +++++++++++++++++++---- 2 files changed, 31 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 884b88e25141..f2315a50ea4f 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -840,6 +840,7 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) { cpumask_var_t cpu_with_psl; + enum resctrl_res_level i; struct rdt_resource *r; struct rdt_domain *d_i; bool ret = false; @@ -854,7 +855,11 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) * First determine which cpus have pseudo-locked regions * associated with them. */ - for_each_alloc_capable_rdt_resource(r) { + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->alloc_capable) + continue; + list_for_each_entry(d_i, &r->domains, list) { if (d_i->plr) cpumask_or(cpu_with_psl, cpu_with_psl, diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 0a226ed2d1a8..bcffa2fc324e 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -100,10 +100,15 @@ void rdt_staged_configs_clear(void) { struct rdt_resource *r; struct rdt_domain *dom; + int i;
lockdep_assert_held(&rdtgroup_mutex);
- for_each_alloc_capable_rdt_resource(r) { + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->alloc_capable) + continue; + list_for_each_entry(dom, &r->domains, list) memset(dom->staged_config, 0, sizeof(dom->staged_config)); } @@ -2185,6 +2190,7 @@ static int rdtgroup_mkdir_info_resdir(void *priv, char *name,
static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn) { + enum resctrl_res_level i; struct resctrl_schema *s; struct rdt_resource *r; unsigned long fflags; @@ -2209,8 +2215,12 @@ static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn) goto out_destroy; }
- for_each_mon_capable_rdt_resource(r) { - fflags = r->fflags | RFTYPE_MON_INFO; + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->mon_capable) + continue; + + fflags = r->fflags | RFTYPE_MON_INFO; sprintf(name, "%s_MON", r->name); ret = rdtgroup_mkdir_info_resdir(r, name, fflags); if (ret) @@ -2619,10 +2629,15 @@ static int schemata_list_add(struct rdt_resource *r, enum resctrl_conf_type type
static int schemata_list_create(void) { + enum resctrl_res_level i; struct rdt_resource *r; int ret = 0;
- for_each_alloc_capable_rdt_resource(r) { + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->alloc_capable) + continue; + if (resctrl_arch_get_cdp_enabled(r->rid)) { ret = schemata_list_add(r, CDP_CODE); if (ret) @@ -3170,6 +3185,7 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn, struct rdtgroup *prgrp, struct kernfs_node **dest_kn) { + enum resctrl_res_level i; struct rdt_resource *r; struct kernfs_node *kn; int ret; @@ -3188,7 +3204,11 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn, * Create the subdirectories for each domain. Note that all events * in a domain like L3 are grouped into a resource whose domain is L3 */ - for_each_mon_capable_rdt_resource(r) { + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->mon_capable) + continue; + ret = mkdir_mondata_subdir_alldom(kn, r, prgrp); if (ret) goto out_destroy;
From: James Morse james.morse@arm.com
The architecture specific parts of resctrl have helpers to hide accesses to the rdt_mon_features bitmap.
To ease moving the filesystem parts of the code out to /fs/, give these a resctrl_arch prefix. Move is_mbm_event() and is_mbm_enabled() into the only file that uses them.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/include/asm/resctrl.h | 17 ++++++++++++ arch/x86/kernel/cpu/resctrl/core.c | 4 +-- arch/x86/kernel/cpu/resctrl/internal.h | 16 ----------- arch/x86/kernel/cpu/resctrl/monitor.c | 18 ++++++------ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 38 +++++++++++++++++--------- 5 files changed, 53 insertions(+), 40 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 5f6a5375bb4a..50407e83d0ca 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -7,6 +7,7 @@ #include <linux/jump_label.h> #include <linux/percpu.h> #include <linux/sched.h> +#include <linux/resctrl_types.h>
/* * This value can never be a valid CLOSID, and is used when mapping a @@ -43,6 +44,7 @@ DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state);
extern bool rdt_alloc_capable; extern bool rdt_mon_capable; +extern unsigned int rdt_mon_features;
DECLARE_STATIC_KEY_FALSE(rdt_enable_key); DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key); @@ -82,6 +84,21 @@ static inline void resctrl_arch_disable_mon(void) static_branch_dec_cpuslocked(&rdt_enable_key); }
+static inline bool resctrl_arch_is_llc_occupancy_enabled(void) +{ + return (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID)); +} + +static inline bool resctrl_arch_is_mbm_total_enabled(void) +{ + return (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID)); +} + +static inline bool resctrl_arch_is_mbm_local_enabled(void) +{ + return (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID)); +} + /* * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR * diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index fadcf0d3c0aa..e62d89cd9252 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -492,13 +492,13 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom) { size_t tsize;
- if (is_mbm_total_enabled()) { + if (resctrl_arch_is_mbm_total_enabled()) { tsize = sizeof(*hw_dom->arch_mbm_total); hw_dom->arch_mbm_total = kcalloc(num_rmid, tsize, GFP_KERNEL); if (!hw_dom->arch_mbm_total) return -ENOMEM; } - if (is_mbm_local_enabled()) { + if (resctrl_arch_is_mbm_local_enabled()) { tsize = sizeof(*hw_dom->arch_mbm_local); hw_dom->arch_mbm_local = kcalloc(num_rmid, tsize, GFP_KERNEL); if (!hw_dom->arch_mbm_local) { diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 10d568d6c32c..9357ad1ff98b 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -131,7 +131,6 @@ struct rmid_read { void *arch_mon_ctx; };
-extern unsigned int rdt_mon_features; extern struct list_head resctrl_schema_all; extern bool resctrl_mounted;
@@ -370,21 +369,6 @@ static inline bool is_llc_occupancy_enabled(void) return (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID)); }
-static inline bool is_mbm_total_enabled(void) -{ - return (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID)); -} - -static inline bool is_mbm_local_enabled(void) -{ - return (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID)); -} - -static inline bool is_mbm_enabled(void) -{ - return (is_mbm_total_enabled() || is_mbm_local_enabled()); -} - static inline bool is_mbm_event(int e) { return (e >= QOS_L3_MBM_TOTAL_EVENT_ID && diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 31d216814794..d39c113a726a 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -251,11 +251,11 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d) { struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d);
- if (is_mbm_total_enabled()) + if (resctrl_arch_is_mbm_total_enabled()) memset(hw_dom->arch_mbm_total, 0, sizeof(*hw_dom->arch_mbm_total) * r->num_rmid);
- if (is_mbm_local_enabled()) + if (resctrl_arch_is_mbm_local_enabled()) memset(hw_dom->arch_mbm_local, 0, sizeof(*hw_dom->arch_mbm_local) * r->num_rmid); } @@ -514,7 +514,7 @@ void free_rmid(u32 closid, u32 rmid)
entry = __rmid_entry(idx);
- if (is_llc_occupancy_enabled()) + if (resctrl_arch_is_llc_occupancy_enabled()) add_rmid_to_limbo(entry); else list_add_tail(&entry->list, &rmid_free_lru); @@ -669,7 +669,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm) struct list_head *head; struct rdtgroup *entry;
- if (!is_mbm_local_enabled()) + if (!resctrl_arch_is_mbm_local_enabled()) return;
r_mba = resctrl_arch_get_resource(RDT_RESOURCE_MBA); @@ -757,7 +757,7 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, * This is protected from concurrent reads from user * as both the user and we hold the global mutex. */ - if (is_mbm_total_enabled()) { + if (resctrl_arch_is_mbm_total_enabled()) { rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID; rr.val = 0; rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); @@ -771,7 +771,7 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d,
resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); } - if (is_mbm_local_enabled()) { + if (resctrl_arch_is_mbm_local_enabled()) { rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID; rr.val = 0; rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); @@ -1013,11 +1013,11 @@ static void l3_mon_evt_init(struct rdt_resource *r) { INIT_LIST_HEAD(&r->evt_list);
- if (is_llc_occupancy_enabled()) + if (resctrl_arch_is_llc_occupancy_enabled()) list_add_tail(&llc_occupancy_event.list, &r->evt_list); - if (is_mbm_total_enabled()) + if (resctrl_arch_is_mbm_total_enabled()) list_add_tail(&mbm_total_event.list, &r->evt_list); - if (is_mbm_local_enabled()) + if (resctrl_arch_is_mbm_local_enabled()) list_add_tail(&mbm_local_event.list, &r->evt_list); }
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index bcffa2fc324e..d213c339c594 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -114,6 +114,18 @@ void rdt_staged_configs_clear(void) } }
+static bool resctrl_is_mbm_enabled(void) +{ + return (resctrl_arch_is_mbm_total_enabled() || + resctrl_arch_is_mbm_local_enabled()); +} + +static bool resctrl_is_mbm_event(int e) +{ + return (e >= QOS_L3_MBM_TOTAL_EVENT_ID && + e <= QOS_L3_MBM_LOCAL_EVENT_ID); +} + /* * Trivial allocator for CLOSIDs. Since h/w only supports a small number, * we can keep a bitmap of free CLOSIDs in a single integer. @@ -2374,7 +2386,7 @@ static bool supports_mba_mbps(void) { struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA);
- return (is_mbm_local_enabled() && + return (resctrl_arch_is_mbm_local_enabled() && r->alloc_capable && is_mba_linear()); }
@@ -2742,7 +2754,7 @@ static int rdt_get_tree(struct fs_context *fc) if (resctrl_arch_alloc_capable() || resctrl_arch_mon_capable()) resctrl_mounted = true;
- if (is_mbm_enabled()) { + if (resctrl_is_mbm_enabled()) { list_for_each_entry(dom, &l3->domains, list) mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU); @@ -3111,7 +3123,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, if (ret) goto out_destroy;
- if (is_mbm_event(mevt->evtid)) + if (resctrl_is_mbm_event(mevt->evtid)) mon_event_read(&rr, r, d, prgrp, mevt->evtid, true); } kernfs_activate(kn); @@ -4008,9 +4020,9 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d) if (resctrl_mounted && resctrl_arch_mon_capable()) rmdir_mondata_subdir_allrdtgrp(r, d->id);
- if (is_mbm_enabled()) + if (resctrl_is_mbm_enabled()) cancel_delayed_work(&d->mbm_over); - if (is_llc_occupancy_enabled() && has_busy_rmid(d)) { + if (resctrl_arch_is_llc_occupancy_enabled() && has_busy_rmid(d)) { /* * When a package is going down, forcefully * decrement rmid->ebusy. There is no way to know @@ -4034,12 +4046,12 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) u32 idx_limit = resctrl_arch_system_num_rmid_idx(); size_t tsize;
- if (is_llc_occupancy_enabled()) { + if (resctrl_arch_is_llc_occupancy_enabled()) { d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL); if (!d->rmid_busy_llc) return -ENOMEM; } - if (is_mbm_total_enabled()) { + if (resctrl_arch_is_mbm_total_enabled()) { tsize = sizeof(*d->mbm_total); d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL); if (!d->mbm_total) { @@ -4047,7 +4059,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) return -ENOMEM; } } - if (is_mbm_local_enabled()) { + if (resctrl_arch_is_mbm_local_enabled()) { tsize = sizeof(*d->mbm_local); d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL); if (!d->mbm_local) { @@ -4079,13 +4091,13 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d) if (err) goto out_unlock;
- if (is_mbm_enabled()) { + if (resctrl_is_mbm_enabled()) { INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow); mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU); }
- if (is_llc_occupancy_enabled()) + if (resctrl_arch_is_llc_occupancy_enabled()) INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
/* @@ -4140,12 +4152,12 @@ void resctrl_offline_cpu(unsigned int cpu)
d = get_domain_from_cpu(cpu, l3); if (d) { - if (is_mbm_enabled() && cpu == d->mbm_work_cpu) { + if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu) { cancel_delayed_work(&d->mbm_over); mbm_setup_overflow_handler(d, 0, cpu); } - if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu && - has_busy_rmid(d)) { + if (resctrl_arch_is_llc_occupancy_enabled() && + cpu == d->cqm_work_cpu && has_busy_rmid(d)) { cancel_delayed_work(&d->cqm_limbo); cqm_setup_limbo_handler(d, 0, cpu); }
From: James Morse james.morse@arm.com
When BMEC is supported the resctrl event can be configured in a number of ways. This depends on architecture support. rdt_get_mon_l3_config() modifies the struct mon_evt and calls mbm_config_rftype_init() to create the files that allow the configuration.
Splitting this into separate architecture and filesystem parts would require the struct mon_evt and mbm_config_rftype_init() to be exposed.
Instead, add resctrl_arch_is_evt_configurable(), and use this from resctrl_mon_resource_init() to initialise struct mon_evt and call mbm_config_rftype_init().
resctrl_arch_is_evt_configurable() uses rdt_cpu_has() so it can't easily be in a header. Putting it in core.c allows rdt_cpu_has() to become static.
resctrl_mon_resource_init() isn't marked __init, and MPAM will need to initialise resctrl late. Drop the __init on the relevant functions, and change rdt_options[] to be read-only after init.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/core.c | 19 +++++++++++++++++-- arch/x86/kernel/cpu/resctrl/internal.h | 3 +-- arch/x86/kernel/cpu/resctrl/monitor.c | 26 +++++++++++++------------- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +- include/linux/resctrl.h | 2 ++ 5 files changed, 34 insertions(+), 18 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index e62d89cd9252..b4d94ca04999 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -678,7 +678,7 @@ struct rdt_options { bool force_off, force_on; };
-static struct rdt_options rdt_options[] __initdata = { +static struct rdt_options rdt_options[] __ro_after_init = { RDT_OPT(RDT_FLAG_CMT, "cmt", X86_FEATURE_CQM_OCCUP_LLC), RDT_OPT(RDT_FLAG_MBM_TOTAL, "mbmtotal", X86_FEATURE_CQM_MBM_TOTAL), RDT_OPT(RDT_FLAG_MBM_LOCAL, "mbmlocal", X86_FEATURE_CQM_MBM_LOCAL), @@ -718,7 +718,7 @@ static int __init set_rdt_options(char *str) } __setup("rdt", set_rdt_options);
-bool __init rdt_cpu_has(int flag) +static bool rdt_cpu_has(int flag) { bool ret = boot_cpu_has(flag); struct rdt_options *o; @@ -738,6 +738,21 @@ bool __init rdt_cpu_has(int flag) return ret; }
+bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt) +{ + if (!rdt_cpu_has(X86_FEATURE_BMEC)) + return false; + + switch (evt) { + case QOS_L3_MBM_TOTAL_EVENT_ID: + return rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL); + case QOS_L3_MBM_LOCAL_EVENT_ID: + return rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL); + default: + return false; + } +} + static __init bool get_mem_config(void) { struct rdt_hw_resource *hw_res = &rdt_resources_all[RDT_RESOURCE_MBA]; diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 9357ad1ff98b..e9a8724faedb 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -520,7 +520,6 @@ int alloc_rmid(u32 closid); void free_rmid(u32 closid, u32 rmid); int rdt_get_mon_l3_config(struct rdt_resource *r); void __exit resctrl_mon_resource_exit(void); -bool __init rdt_cpu_has(int flag); void mon_event_count(void *info); int rdtgroup_mondata_show(struct seq_file *m, void *arg); void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, @@ -540,7 +539,7 @@ bool has_busy_rmid(struct rdt_domain *d); void __check_limbo(struct rdt_domain *d, bool force_free); void rdt_domain_reconfigure_cdp(struct rdt_resource *r); void __init thread_throttle_mode_init(void); -void __init mbm_config_rftype_init(const char *config); +void mbm_config_rftype_init(const char *config); void rdt_staged_configs_clear(void); bool closid_allocated(unsigned int closid); int resctrl_find_cleanest_closid(void); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index d39c113a726a..ec460296360e 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -1030,8 +1030,19 @@ int resctrl_mon_resource_init(void) if (ret) return ret;
- if (r->mon_capable) - l3_mon_evt_init(r); + if (!r->mon_capable) + return 0; + + l3_mon_evt_init(r); + + if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) { + mbm_total_event.configurable = true; + mbm_config_rftype_init("mbm_total_bytes_config"); + } + if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_LOCAL_EVENT_ID)) { + mbm_local_event.configurable = true; + mbm_config_rftype_init("mbm_local_bytes_config"); + }
return 0; } @@ -1068,17 +1079,6 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) */ resctrl_rmid_realloc_threshold = resctrl_arch_round_mon_val(threshold);
- if (rdt_cpu_has(X86_FEATURE_BMEC)) { - if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) { - mbm_total_event.configurable = true; - mbm_config_rftype_init("mbm_total_bytes_config"); - } - if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) { - mbm_local_event.configurable = true; - mbm_config_rftype_init("mbm_local_bytes_config"); - } - } - r->mon_capable = true;
return 0; diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index d213c339c594..4e2e731e7d89 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2072,7 +2072,7 @@ void __init thread_throttle_mode_init(void) rft->fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB; }
-void __init mbm_config_rftype_init(const char *config) +void mbm_config_rftype_init(const char *config) { struct rftype *rft;
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index b0ee7256e095..bfc63e8219e5 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -204,6 +204,8 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r); struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id); int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
+bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt); + /* * Update the ctrl_val and apply this config right now. * Must be called on one of the domain's CPUs.
From: James Morse james.morse@arm.com
mon_event_config_{read,write}() are called via IPI and access model specific registers to do their work.
To support another architecture, this needs abstracting.
Rename mon_event_config_{read,write}() to have a resctrl_arch_ prefix, and move their struct mon_config_info parameter into the restrl_types header. This allows another architecture to supply an implementation of these.
As struct mon_config_info is now exposed globally, give it a 'resctrl_' prefix. MPAM systems need access to the domain to do this work, add the resource and domain to struct resctrl_mon_config_info.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 34 +++++++++++++------------- include/linux/resctrl.h | 9 +++++++ 2 files changed, 26 insertions(+), 17 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 4e2e731e7d89..a7dbd66927ba 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -1579,11 +1579,6 @@ static int rdtgroup_size_show(struct kernfs_open_file *of, return ret; }
-struct mon_config_info { - u32 evtid; - u32 mon_config; -}; - #define INVALID_CONFIG_INDEX UINT_MAX
/** @@ -1608,9 +1603,9 @@ static inline unsigned int mon_event_config_index_get(u32 evtid) } }
-static void mon_event_config_read(void *info) +void resctrl_arch_mon_event_config_read(void *info) { - struct mon_config_info *mon_info = info; + struct resctrl_mon_config_info *mon_info = info; unsigned int index; u64 msrval;
@@ -1625,14 +1620,15 @@ static void mon_event_config_read(void *info) mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS; }
-static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info) +static void mondata_config_read(struct resctrl_mon_config_info *mon_info) { - smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1); + smp_call_function_any(&mon_info->d->cpu_mask, + resctrl_arch_mon_event_config_read, mon_info, 1); }
static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid) { - struct mon_config_info mon_info = {0}; + struct resctrl_mon_config_info mon_info = {0}; struct rdt_domain *dom; bool sep = false;
@@ -1643,9 +1639,11 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid if (sep) seq_puts(s, ";");
- memset(&mon_info, 0, sizeof(struct mon_config_info)); + memset(&mon_info, 0, sizeof(struct resctrl_mon_config_info)); + mon_info.r = r; + mon_info.d = dom; mon_info.evtid = evtid; - mondata_config_read(dom, &mon_info); + mondata_config_read(&mon_info);
seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config); sep = true; @@ -1678,9 +1676,9 @@ static int mbm_local_bytes_config_show(struct kernfs_open_file *of, return 0; }
-static void mon_event_config_write(void *info) +void resctrl_arch_mon_event_config_write(void *info) { - struct mon_config_info *mon_info = info; + struct resctrl_mon_config_info *mon_info = info; unsigned int index;
index = mon_event_config_index_get(mon_info->evtid); @@ -1694,7 +1692,7 @@ static void mon_event_config_write(void *info) static int mbm_config_write_domain(struct rdt_resource *r, struct rdt_domain *d, u32 evtid, u32 val) { - struct mon_config_info mon_info = {0}; + struct resctrl_mon_config_info mon_info = {0}; int ret = 0;
/* mon_config cannot be more than the supported set of events */ @@ -1707,8 +1705,10 @@ static int mbm_config_write_domain(struct rdt_resource *r, * Read the current config value first. If both are the same then * no need to write it again. */ + mon_info.r = r; + mon_info.d = d; mon_info.evtid = evtid; - mondata_config_read(d, &mon_info); + mondata_config_read(&mon_info); if (mon_info.mon_config == val) goto out;
@@ -1720,7 +1720,7 @@ static int mbm_config_write_domain(struct rdt_resource *r, * are scoped at the domain level. Writing any of these MSRs * on one CPU is observed by all the CPUs in the domain. */ - smp_call_function_any(&d->cpu_mask, mon_event_config_write, + smp_call_function_any(&d->cpu_mask, resctrl_arch_mon_event_config_write, &mon_info, 1);
/* diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index bfc63e8219e5..975b80102fbe 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -192,6 +192,13 @@ struct resctrl_cpu_sync { u32 rmid; };
+struct resctrl_mon_config_info { + struct rdt_resource *r; + struct rdt_domain *d; + u32 evtid; + u32 mon_config; +}; + /* * Update and re-load this CPUs defaults. Called via IPI, takes a pointer to * struct resctrl_cpu_sync, or NULL. @@ -205,6 +212,8 @@ struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id); int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid);
bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt); +void resctrl_arch_mon_event_config_write(void *info); +void resctrl_arch_mon_event_config_read(void *info);
/* * Update the ctrl_val and apply this config right now.
From: James Morse james.morse@arm.com
resctrl_arch_mon_event_config_write() writes a bitmap of events provided by user-space into the configuration register for the monitors.
This assumes that all architectures support all the features each bit corresponds to.
MPAM can filter monitors based on read, write, or both, but there are many more options in the existing bitmap. To allow this interface to work for machines with MPAM, allow the architecture helper to return an error if an incompatible bitmap is set.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 10 ++++++++-- include/linux/resctrl.h | 2 ++ 2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index a7dbd66927ba..47d0b9a73c5c 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -1684,16 +1684,18 @@ void resctrl_arch_mon_event_config_write(void *info) index = mon_event_config_index_get(mon_info->evtid); if (index == INVALID_CONFIG_INDEX) { pr_warn_once("Invalid event id %d\n", mon_info->evtid); + mon_info->err = -EINVAL; return; } wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0); + + mon_info->err = 0; }
static int mbm_config_write_domain(struct rdt_resource *r, struct rdt_domain *d, u32 evtid, u32 val) { struct resctrl_mon_config_info mon_info = {0}; - int ret = 0;
/* mon_config cannot be more than the supported set of events */ if (val > MAX_EVT_CONFIG_BITS) { @@ -1722,6 +1724,10 @@ static int mbm_config_write_domain(struct rdt_resource *r, */ smp_call_function_any(&d->cpu_mask, resctrl_arch_mon_event_config_write, &mon_info, 1); + if (mon_info.err) { + rdt_last_cmd_puts("Invalid event configuration\n"); + goto out; + }
/* * When an Event Configuration is changed, the bandwidth counters @@ -1735,7 +1741,7 @@ static int mbm_config_write_domain(struct rdt_resource *r, resctrl_arch_reset_rmid_all(r, d);
out: - return ret; + return mon_info.err; }
static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid) diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 975b80102fbe..1027421f233e 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -197,6 +197,8 @@ struct resctrl_mon_config_info { struct rdt_domain *d; u32 evtid; u32 mon_config; + + int err; };
/*
From: James Morse james.morse@arm.com
resctrl's pseudo lock has some copy-to-cache and measurement functions that are micro-architecture specific. pseudo_lock_fn() is not at all portable. Label these 'resctrl_arch_' so they stay under /arch/x86.
Pseudo-lock doesn't work without these, add a Kconfig symbol to disable it. This ensures that rdtgrp->mode can never be RDT_MODE_PSEUDO_LOCKED.
Arm's cache-lockdown works in a very different way. As it isn't possible to disable prefetch or efficiently restrict the CPUs ability to pull lines into the cache, this resctrl pseudo_lock stuff isn't going to be used on arm.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/Kconfig | 7 ++++ arch/x86/include/asm/resctrl.h | 5 +++ arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 3 +- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 48 +++++++++++++++-------- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 17 +++++--- 5 files changed, 56 insertions(+), 24 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 66bfabae8814..d9434d0d99b4 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -480,6 +480,7 @@ config X86_CPU_RESCTRL depends on X86 && (CPU_SUP_INTEL || CPU_SUP_AMD) select KERNFS select PROC_CPU_RESCTRL if PROC_FS + select RESCTRL_FS_PSEUDO_LOCK help Enable x86 CPU resource control support.
@@ -496,6 +497,12 @@ config X86_CPU_RESCTRL
Say N if unsure.
+config RESCTRL_FS_PSEUDO_LOCK + bool + help + Software mechanism to try and pin data in a cache portion using + micro-architecture tricks. + if X86_32 config X86_BIGSMP bool "Support for big SMP systems with more than 8 CPUs" diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 50407e83d0ca..a88af68f9fe2 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -211,6 +211,11 @@ static inline void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid static inline void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, void *ctx) { };
+u64 resctrl_arch_get_prefetch_disable_bits(void); +int resctrl_arch_pseudo_lock_fn(void *_rdtgrp); +int resctrl_arch_measure_cycles_lat_fn(void *_plr); +int resctrl_arch_measure_l2_residency(void *_plr); +int resctrl_arch_measure_l3_residency(void *_plr); void resctrl_cpu_detect(struct cpuinfo_x86 *c);
#else diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index 0a5a945ca368..215386d031ef 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -174,7 +174,8 @@ static int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, if (!cbm_validate(data->buf, &cbm_val, r)) return -EINVAL;
- if ((rdtgrp->mode == RDT_MODE_EXCLUSIVE || + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK) && + (rdtgrp->mode == RDT_MODE_EXCLUSIVE || rdtgrp->mode == RDT_MODE_SHAREABLE) && rdtgroup_cbm_overlaps_pseudo_locked(d, cbm_val)) { rdt_last_cmd_puts("CBM overlaps with pseudo-locked region\n"); diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index f2315a50ea4f..09adf0e3e753 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -62,7 +62,8 @@ static const struct class pseudo_lock_class = { };
/** - * get_prefetch_disable_bits - prefetch disable bits of supported platforms + * resctrl_arch_get_prefetch_disable_bits - prefetch disable bits of supported + * platforms * @void: It takes no parameters. * * Capture the list of platforms that have been validated to support @@ -76,13 +77,13 @@ static const struct class pseudo_lock_class = { * in the SDM. * * When adding a platform here also add support for its cache events to - * measure_cycles_perf_fn() + * resctrl_arch_measure_l*_residency() * * Return: * If platform is supported, the bits to disable hardware prefetchers, 0 * if platform is not supported. */ -static u64 get_prefetch_disable_bits(void) +u64 resctrl_arch_get_prefetch_disable_bits(void) { if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL || boot_cpu_data.x86 != 6) @@ -410,7 +411,7 @@ static void pseudo_lock_free(struct rdtgroup *rdtgrp) }
/** - * pseudo_lock_fn - Load kernel memory into cache + * resctrl_arch_pseudo_lock_fn - Load kernel memory into cache * @_rdtgrp: resource group to which pseudo-lock region belongs * * This is the core pseudo-locking flow. @@ -428,7 +429,7 @@ static void pseudo_lock_free(struct rdtgroup *rdtgrp) * * Return: 0. Waiter on waitqueue will be woken on completion. */ -static int pseudo_lock_fn(void *_rdtgrp) +int resctrl_arch_pseudo_lock_fn(void *_rdtgrp) { struct rdtgroup *rdtgrp = _rdtgrp; struct pseudo_lock_region *plr = rdtgrp->plr; @@ -714,7 +715,7 @@ int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp) * Not knowing the bits to disable prefetching implies that this * platform does not support Cache Pseudo-Locking. */ - prefetch_disable_bits = get_prefetch_disable_bits(); + prefetch_disable_bits = resctrl_arch_get_prefetch_disable_bits(); if (prefetch_disable_bits == 0) { rdt_last_cmd_puts("Pseudo-locking not supported\n"); return -EINVAL; @@ -776,6 +777,9 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp) { int ret;
+ if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return -EOPNOTSUPP; + if (resctrl_arch_mon_capable()) { ret = alloc_rmid(rdtgrp->closid); if (ret < 0) { @@ -848,6 +852,9 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) /* Walking r->domains, ensure it can't race with cpuhp */ lockdep_assert_cpus_held();
+ if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return -EOPNOTSUPP; + if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL)) return true;
@@ -879,7 +886,8 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) }
/** - * measure_cycles_lat_fn - Measure cycle latency to read pseudo-locked memory + * resctrl_arch_measure_cycles_lat_fn - Measure cycle latency to read + * pseudo-locked memory * @_plr: pseudo-lock region to measure * * There is no deterministic way to test if a memory region is cached. One @@ -892,7 +900,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) * * Return: 0. Waiter on waitqueue will be woken on completion. */ -static int measure_cycles_lat_fn(void *_plr) +int resctrl_arch_measure_cycles_lat_fn(void *_plr) { struct pseudo_lock_region *plr = _plr; u32 saved_low, saved_high; @@ -1076,7 +1084,7 @@ static int measure_residency_fn(struct perf_event_attr *miss_attr, return 0; }
-static int measure_l2_residency(void *_plr) +int resctrl_arch_measure_l2_residency(void *_plr) { struct pseudo_lock_region *plr = _plr; struct residency_counts counts = {0}; @@ -1114,7 +1122,7 @@ static int measure_l2_residency(void *_plr) return 0; }
-static int measure_l3_residency(void *_plr) +int resctrl_arch_measure_l3_residency(void *_plr) { struct pseudo_lock_region *plr = _plr; struct residency_counts counts = {0}; @@ -1212,18 +1220,18 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel) plr->cpu = cpu;
if (sel == 1) - thread = kthread_create_on_node(measure_cycles_lat_fn, plr, - cpu_to_node(cpu), + thread = kthread_create_on_node(resctrl_arch_measure_cycles_lat_fn, + plr, cpu_to_node(cpu), "pseudo_lock_measure/%u", cpu); else if (sel == 2) - thread = kthread_create_on_node(measure_l2_residency, plr, - cpu_to_node(cpu), + thread = kthread_create_on_node(resctrl_arch_measure_l2_residency, + plr, cpu_to_node(cpu), "pseudo_lock_measure/%u", cpu); else if (sel == 3) - thread = kthread_create_on_node(measure_l3_residency, plr, - cpu_to_node(cpu), + thread = kthread_create_on_node(resctrl_arch_measure_l3_residency, + plr, cpu_to_node(cpu), "pseudo_lock_measure/%u", cpu); else @@ -1310,6 +1318,9 @@ int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp) struct device *dev; int ret;
+ if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return -EOPNOTSUPP; + ret = pseudo_lock_region_alloc(plr); if (ret < 0) return ret; @@ -1322,7 +1333,7 @@ int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp)
plr->thread_done = 0;
- thread = kthread_create_on_node(pseudo_lock_fn, rdtgrp, + thread = kthread_create_on_node(resctrl_arch_pseudo_lock_fn, rdtgrp, cpu_to_node(plr->cpu), "pseudo_lock/%u", plr->cpu); if (IS_ERR(thread)) { @@ -1435,6 +1446,9 @@ void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp) { struct pseudo_lock_region *plr = rdtgrp->plr;
+ if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return; + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { /* * Default group cannot be a pseudo-locked region so we can diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 47d0b9a73c5c..4c1d3216b136 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -1451,7 +1451,8 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of, goto out; } rdtgrp->mode = RDT_MODE_EXCLUSIVE; - } else if (!strcmp(buf, "pseudo-locksetup")) { + } else if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK) && + !strcmp(buf, "pseudo-locksetup")) { ret = rdtgroup_locksetup_enter(rdtgrp); if (ret) goto out; @@ -2744,9 +2745,11 @@ static int rdt_get_tree(struct fs_context *fc) rdtgroup_default.mon.mon_data_kn = kn_mondata; }
- ret = rdt_pseudo_lock_init(); - if (ret) - goto out_mondata; + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) { + ret = rdt_pseudo_lock_init(); + if (ret) + goto out_mondata; + }
ret = kernfs_get_tree(fc); if (ret < 0) @@ -2769,7 +2772,8 @@ static int rdt_get_tree(struct fs_context *fc) goto out;
out_psl: - rdt_pseudo_lock_release(); + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + rdt_pseudo_lock_release(); out_mondata: if (resctrl_arch_mon_capable()) kernfs_remove(kn_mondata); @@ -3034,7 +3038,8 @@ static void rdt_kill_sb(struct super_block *sb) resctrl_arch_reset_resources();
rmdir_all_sub(); - rdt_pseudo_lock_release(); + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + rdt_pseudo_lock_release(); rdtgroup_default.mode = RDT_MODE_SHAREABLE; schemata_list_destroy(); rdtgroup_destroy_root();
From: James Morse james.morse@arm.com
prefetch_disable_bits is set by rdtgroup_locksetup_enter() from an arch code value, then read by the arch code.
Instead of exporting this value, make resctrl_arch_get_prefetch_disable_bits() set it so that the other arch-code helpers can use the cached-value.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 09adf0e3e753..20e6c99db25b 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -85,6 +85,8 @@ static const struct class pseudo_lock_class = { */ u64 resctrl_arch_get_prefetch_disable_bits(void) { + prefetch_disable_bits = 0; + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL || boot_cpu_data.x86 != 6) return 0; @@ -100,7 +102,8 @@ u64 resctrl_arch_get_prefetch_disable_bits(void) * 3 DCU IP Prefetcher Disable (R/W) * 63:4 Reserved */ - return 0xF; + prefetch_disable_bits = 0xF; + break; case INTEL_FAM6_ATOM_GOLDMONT: case INTEL_FAM6_ATOM_GOLDMONT_PLUS: /* @@ -111,10 +114,11 @@ u64 resctrl_arch_get_prefetch_disable_bits(void) * 2 DCU Hardware Prefetcher Disable (R/W) * 63:3 Reserved */ - return 0x5; + prefetch_disable_bits = 0x5; + break; }
- return 0; + return prefetch_disable_bits; }
/** @@ -715,8 +719,7 @@ int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp) * Not knowing the bits to disable prefetching implies that this * platform does not support Cache Pseudo-Locking. */ - prefetch_disable_bits = resctrl_arch_get_prefetch_disable_bits(); - if (prefetch_disable_bits == 0) { + if (resctrl_arch_get_prefetch_disable_bits() == 0) { rdt_last_cmd_puts("Pseudo-locking not supported\n"); return -EINVAL; }
From: James Morse james.morse@arm.com
resctrl_arch_pseudo_lock_fn() has some very arch specific behaviour, and it takes a struct rdtgroup. This means resctrl has to expose that structure its enums and substructures to the rest of the kernel.
The only reason resctrl_arch_pseudo_lock_fn() wants the rdtgroup is for the CLOSID. Embed that in the pseudo_lock_region as a hw_closid.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/include/asm/resctrl.h | 2 +- arch/x86/kernel/cpu/resctrl/internal.h | 37 --------------------- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 13 ++++---- include/linux/resctrl.h | 39 +++++++++++++++++++++++ 4 files changed, 47 insertions(+), 44 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index a88af68f9fe2..9940398e367e 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -212,7 +212,7 @@ static inline void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, void *ctx) { };
u64 resctrl_arch_get_prefetch_disable_bits(void); -int resctrl_arch_pseudo_lock_fn(void *_rdtgrp); +int resctrl_arch_pseudo_lock_fn(void *_plr); int resctrl_arch_measure_cycles_lat_fn(void *_plr); int resctrl_arch_measure_l2_residency(void *_plr); int resctrl_arch_measure_l3_residency(void *_plr); diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index e9a8724faedb..0e50cd10a443 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -183,43 +183,6 @@ struct mongroup { u32 rmid; };
-/** - * struct pseudo_lock_region - pseudo-lock region information - * @s: Resctrl schema for the resource to which this - * pseudo-locked region belongs - * @d: RDT domain to which this pseudo-locked region - * belongs - * @cbm: bitmask of the pseudo-locked region - * @lock_thread_wq: waitqueue used to wait on the pseudo-locking thread - * completion - * @thread_done: variable used by waitqueue to test if pseudo-locking - * thread completed - * @cpu: core associated with the cache on which the setup code - * will be run - * @line_size: size of the cache lines - * @size: size of pseudo-locked region in bytes - * @kmem: the kernel memory associated with pseudo-locked region - * @minor: minor number of character device associated with this - * region - * @debugfs_dir: pointer to this region's directory in the debugfs - * filesystem - * @pm_reqs: Power management QoS requests related to this region - */ -struct pseudo_lock_region { - struct resctrl_schema *s; - struct rdt_domain *d; - u32 cbm; - wait_queue_head_t lock_thread_wq; - int thread_done; - int cpu; - unsigned int line_size; - unsigned int size; - void *kmem; - unsigned int minor; - struct dentry *debugfs_dir; - struct list_head pm_reqs; -}; - /** * struct rdtgroup - store rdtgroup's data in resctrl file system. * @kn: kernfs node diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 20e6c99db25b..85536566d7f5 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -416,7 +416,7 @@ static void pseudo_lock_free(struct rdtgroup *rdtgrp)
/** * resctrl_arch_pseudo_lock_fn - Load kernel memory into cache - * @_rdtgrp: resource group to which pseudo-lock region belongs + * @_plr: the pseudo-lock region descriptor * * This is the core pseudo-locking flow. * @@ -433,10 +433,9 @@ static void pseudo_lock_free(struct rdtgroup *rdtgrp) * * Return: 0. Waiter on waitqueue will be woken on completion. */ -int resctrl_arch_pseudo_lock_fn(void *_rdtgrp) +int resctrl_arch_pseudo_lock_fn(void *_plr) { - struct rdtgroup *rdtgrp = _rdtgrp; - struct pseudo_lock_region *plr = rdtgrp->plr; + struct pseudo_lock_region *plr = _plr; u32 rmid_p, closid_p; unsigned long i; u64 saved_msr; @@ -496,7 +495,8 @@ int resctrl_arch_pseudo_lock_fn(void *_rdtgrp) * pseudo-locked followed by reading of kernel memory to load it * into the cache. */ - __wrmsr(MSR_IA32_PQR_ASSOC, rmid_p, rdtgrp->closid); + __wrmsr(MSR_IA32_PQR_ASSOC, rmid_p, plr->closid); + /* * Cache was flushed earlier. Now access kernel memory to read it * into cache region associated with just activated plr->closid. @@ -1336,7 +1336,8 @@ int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp)
plr->thread_done = 0;
- thread = kthread_create_on_node(resctrl_arch_pseudo_lock_fn, rdtgrp, + plr->closid = rdtgrp->closid; + thread = kthread_create_on_node(resctrl_arch_pseudo_lock_fn, plr, cpu_to_node(plr->cpu), "pseudo_lock/%u", plr->cpu); if (IS_ERR(thread)) { diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 1027421f233e..223937f89911 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -25,6 +25,45 @@ int proc_resctrl_show(struct seq_file *m, /* max value for struct rdt_domain's mbps_val */ #define MBA_MAX_MBPS U32_MAX
+/** + * struct pseudo_lock_region - pseudo-lock region information + * @s: Resctrl schema for the resource to which this + * pseudo-locked region belongs + * @closid: The closid that this pseudo-locked region uses + * @d: RDT domain to which this pseudo-locked region + * belongs + * @cbm: bitmask of the pseudo-locked region + * @lock_thread_wq: waitqueue used to wait on the pseudo-locking thread + * completion + * @thread_done: variable used by waitqueue to test if pseudo-locking + * thread completed + * @cpu: core associated with the cache on which the setup code + * will be run + * @line_size: size of the cache lines + * @size: size of pseudo-locked region in bytes + * @kmem: the kernel memory associated with pseudo-locked region + * @minor: minor number of character device associated with this + * region + * @debugfs_dir: pointer to this region's directory in the debugfs + * filesystem + * @pm_reqs: Power management QoS requests related to this region + */ +struct pseudo_lock_region { + struct resctrl_schema *s; + u32 closid; + struct rdt_domain *d; + u32 cbm; + wait_queue_head_t lock_thread_wq; + int thread_done; + int cpu; + unsigned int line_size; + unsigned int size; + void *kmem; + unsigned int minor; + struct dentry *debugfs_dir; + struct list_head pm_reqs; +}; + /** * struct resctrl_staged_config - parsed configuration to be applied * @new_ctrl: new ctrl value to be loaded
From: James Morse james.morse@arm.com
thread_throttle_mode_init() is called by the architecture specific code to make the 'thread_throttle_mode' file visible. The architecture specific code has already set the membw.throttle_mode in the rdt_resource.
The architecture specific call isn't necessary, and creates an extra interface between the architecture and resctrl.
Call thread_throttle_mode_init() from resctrl_setup(), check the membw.throttle_mode on the MBA resource. This avoids an extra API call.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/core.c | 1 - arch/x86/kernel/cpu/resctrl/internal.h | 1 - arch/x86/kernel/cpu/resctrl/rdtgroup.c | 9 ++++++++- 3 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index b4d94ca04999..b2260a4f2c09 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -227,7 +227,6 @@ static bool __get_mem_config_intel(struct rdt_resource *r) r->membw.throttle_mode = THREAD_THROTTLE_PER_THREAD; else r->membw.throttle_mode = THREAD_THROTTLE_MAX; - thread_throttle_mode_init();
r->alloc_capable = true;
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 0e50cd10a443..2844576bd10a 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -501,7 +501,6 @@ void cqm_handle_limbo(struct work_struct *work); bool has_busy_rmid(struct rdt_domain *d); void __check_limbo(struct rdt_domain *d, bool force_free); void rdt_domain_reconfigure_cdp(struct rdt_resource *r); -void __init thread_throttle_mode_init(void); void mbm_config_rftype_init(const char *config); void rdt_staged_configs_clear(void); bool closid_allocated(unsigned int closid); diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 4c1d3216b136..76c102245060 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2068,10 +2068,15 @@ static struct rftype *rdtgroup_get_rftype_by_name(const char *name) return NULL; }
-void __init thread_throttle_mode_init(void) +static void __init thread_throttle_mode_init(void) { + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); struct rftype *rft;
+ if (!r->alloc_capable || + r->membw.throttle_mode == THREAD_THROTTLE_UNDEFINED) + return; + rft = rdtgroup_get_rftype_by_name("thread_throttle_mode"); if (!rft) return; @@ -4195,6 +4200,8 @@ int __init resctrl_init(void)
rdtgroup_setup_default();
+ thread_throttle_mode_init(); + ret = resctrl_mon_resource_init(); if (ret) return ret;
From: James Morse james.morse@arm.com
get_config_index() is used by the architecture specific code to map a CLOSID+type pair to an index in the configuration arrays.
MPAM needs to do this too to preserve the ABI to user-space, there is no reason to do it differently.
Move the helper to a header file.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 19 +++---------------- include/linux/resctrl.h | 15 +++++++++++++++ 2 files changed, 18 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index 215386d031ef..6b31e6c97c20 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -278,19 +278,6 @@ static int parse_line(char *line, struct resctrl_schema *s, return -EINVAL; }
-static u32 get_config_index(u32 closid, enum resctrl_conf_type type) -{ - switch (type) { - default: - case CDP_NONE: - return closid; - case CDP_CODE: - return closid * 2 + 1; - case CDP_DATA: - return closid * 2; - } -} - static bool apply_config(struct rdt_hw_domain *hw_dom, struct resctrl_staged_config *cfg, u32 idx, cpumask_var_t cpu_mask) @@ -312,7 +299,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, { struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); - u32 idx = get_config_index(closid, t); + u32 idx = resctrl_get_config_index(closid, t); struct msr_param msr_param;
if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask)) @@ -352,7 +339,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) if (!cfg->have_new_ctrl) continue;
- idx = get_config_index(closid, t); + idx = resctrl_get_config_index(closid, t); if (!apply_config(hw_dom, cfg, idx, cpu_mask)) continue;
@@ -477,7 +464,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, u32 closid, enum resctrl_conf_type type) { struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); - u32 idx = get_config_index(closid, type); + u32 idx = resctrl_get_config_index(closid, type);
return hw_dom->ctrl_val[idx]; } diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 223937f89911..d7850cf6effe 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -256,6 +256,21 @@ bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt); void resctrl_arch_mon_event_config_write(void *info); void resctrl_arch_mon_event_config_read(void *info);
+/* For use by arch code that needs to remap resctrl's smaller CDP closid */ +static inline u32 resctrl_get_config_index(u32 closid, + enum resctrl_conf_type type) +{ + switch (type) { + default: + case CDP_NONE: + return closid; + case CDP_CODE: + return (closid * 2) + 1; + case CDP_DATA: + return (closid * 2); + } +} + /* * Update the ctrl_val and apply this config right now. * Must be called on one of the domain's CPUs.
From: James Morse james.morse@arm.com
get_domain_from_cpu() is a handy helper that both the arch code and resctrl need to use. Rename it resctrl_get_domain_from_cpu() so it gets moved out to /fs, and exported back to the arch code.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/core.c | 24 +--------------------- arch/x86/kernel/cpu/resctrl/internal.h | 1 - arch/x86/kernel/cpu/resctrl/monitor.c | 2 +- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +- include/linux/resctrl.h | 28 ++++++++++++++++++++++++++ 5 files changed, 31 insertions(+), 26 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index b2260a4f2c09..50809621f27a 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -357,28 +357,6 @@ cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r) wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]); }
-struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r) -{ - struct rdt_domain *d; - - /* - * Walking r->domains, ensure it can't race with cpuhp. - * Because this is called via IPI by rdt_ctrl_update(), assertions - * about locks this thread holds will lead to false positives. Check - * someone is holding the CPUs lock. - */ - if (IS_ENABLED(CONFIG_LOCKDEP)) - lockdep_is_cpus_held(); - - list_for_each_entry(d, &r->domains, list) { - /* Find the domain that contains this CPU */ - if (cpumask_test_cpu(cpu, &d->cpu_mask)) - return d; - } - - return NULL; -} - u32 resctrl_arch_get_num_closid(struct rdt_resource *r) { return resctrl_to_arch_res(r)->num_closid; @@ -392,7 +370,7 @@ void rdt_ctrl_update(void *arg) int cpu = smp_processor_id(); struct rdt_domain *d;
- d = get_domain_from_cpu(cpu, r); + d = resctrl_get_domain_from_cpu(cpu, r); if (d) { hw_res->msr_update(d, m, r); return; diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 2844576bd10a..939fa43af911 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -476,7 +476,6 @@ int rdt_pseudo_lock_init(void); void rdt_pseudo_lock_release(void); int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp); void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp); -struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r); int closids_supported(void); void closid_free(int closid); int alloc_rmid(u32 closid); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index ec460296360e..965a2339f8f5 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -679,7 +679,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm) idx = resctrl_arch_rmid_idx_encode(closid, rmid); pmbm_data = &dom_mbm->mbm_local[idx];
- dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba); + dom_mba = resctrl_get_domain_from_cpu(smp_processor_id(), r_mba); if (!dom_mba) { pr_warn_once("Failure to get domain for MBA update\n"); return; diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 76c102245060..d68a50dc9b05 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -4166,7 +4166,7 @@ void resctrl_offline_cpu(unsigned int cpu) if (!l3->mon_capable) goto out_unlock;
- d = get_domain_from_cpu(cpu, l3); + d = resctrl_get_domain_from_cpu(cpu, l3); if (d) { if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu) { cancel_delayed_work(&d->mbm_over); diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index d7850cf6effe..8970dcf0e242 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -2,6 +2,7 @@ #ifndef _RESCTRL_H #define _RESCTRL_H
+#include <linux/cpu.h> #include <linux/kernel.h> #include <linux/list.h> #include <linux/pid.h> @@ -271,6 +272,33 @@ static inline u32 resctrl_get_config_index(u32 closid, } }
+/* + * Caller must be in a RCU read-side critical section, or hold the + * cpuhp read lock to prevent the struct rdt_domain being freed. + */ +static inline struct rdt_domain * +resctrl_get_domain_from_cpu(int cpu, struct rdt_resource *r) +{ + struct rdt_domain *d; + + /* + * Walking r->domains, ensure it can't race with cpuhp. + * Because this is called via IPI by rdt_ctrl_update(), assertions + * about locks this thread holds will lead to false positives. Check + * someone is holding the CPUs lock. + */ + if (IS_ENABLED(CONFIG_LOCKDEP)) + lockdep_is_cpus_held(); + + list_for_each_entry_rcu(d, &r->domains, list) { + /* Find the domain that contains this CPU */ + if (cpumask_test_cpu(cpu, &d->cpu_mask)) + return d; + } + + return NULL; +} + /* * Update the ctrl_val and apply this config right now. * Must be called on one of the domain's CPUs.
From: James Morse james.morse@arm.com
resctrl stores the bitmaps and has a closid bitmap stored in u32.
MPAM supports bitmaps and partids bigger than this.
Add some preprocessor values that make it clear why MPAM clamps some of these values.
Signed-off-by: James Morse james.morse@arm.com --- include/linux/resctrl.h | 11 +++++++++++ 1 file changed, 11 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 8970dcf0e242..a88a43a0380d 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -26,6 +26,17 @@ int proc_resctrl_show(struct seq_file *m, /* max value for struct rdt_domain's mbps_val */ #define MBA_MAX_MBPS U32_MAX
+/* + * Resctrl uses a u32 as a closid bitmap. The maximum closid is 32. + */ +#define RESCTRL_MAX_CLOSID 32 + +/* + * Resctrl uses u32 to hold the user-space config. The maximum bitmap size is + * 32. + */ +#define RESCTRL_MAX_CBM 32 + /** * struct pseudo_lock_region - pseudo-lock region information * @s: Resctrl schema for the resource to which this
From: James Morse james.morse@arm.com
Because ARM's MPAM controls are probed using MMIO, resctrl can't be initialised until enough CPUs are online to have determined the system-wide supported num_closid.
To allow MPAM to initialise resctrl after __init text has been free'd, remove all the __init markings from resctrl.
The existing __exit markings cause these functions to be removed by the linker as it has never been possible to build resctrl as a module. MPAM has an error interrupt which causes the driver to reset and disable itself. Remove the __exit markings to allow the MPAM driver to tear down resctrl.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/internal.h | 2 +- arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++-- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 +++--- include/linux/resctrl.h | 4 ++-- 4 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 939fa43af911..95e597a445b3 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -481,7 +481,7 @@ void closid_free(int closid); int alloc_rmid(u32 closid); void free_rmid(u32 closid, u32 rmid); int rdt_get_mon_l3_config(struct rdt_resource *r); -void __exit resctrl_mon_resource_exit(void); +void resctrl_mon_resource_exit(void); void mon_event_count(void *info); int rdtgroup_mondata_show(struct seq_file *m, void *arg); void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 965a2339f8f5..507145fdd79a 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -970,7 +970,7 @@ static int dom_data_init(struct rdt_resource *r) return err; }
-static void __exit dom_data_exit(struct rdt_resource *r) +static void dom_data_exit(struct rdt_resource *r) { if (!r->mon_capable) return; @@ -1084,7 +1084,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) return 0; }
-void __exit resctrl_mon_resource_exit(void) +void resctrl_mon_resource_exit(void) { struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index d68a50dc9b05..253f5fbd8e1c 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2068,7 +2068,7 @@ static struct rftype *rdtgroup_get_rftype_by_name(const char *name) return NULL; }
-static void __init thread_throttle_mode_init(void) +static void thread_throttle_mode_init(void) { struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); struct rftype *rft; @@ -4191,7 +4191,7 @@ void resctrl_offline_cpu(unsigned int cpu) * * Return: 0 on success or -errno */ -int __init resctrl_init(void) +int resctrl_init(void) { int ret = 0;
@@ -4245,7 +4245,7 @@ int __init resctrl_init(void) return ret; }
-void __exit resctrl_exit(void) +void resctrl_exit(void) { debugfs_remove_recursive(debugfs_resctrl); unregister_filesystem(&rdt_fs_type); diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index a88a43a0380d..f9de90bc1f55 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -400,7 +400,7 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d); extern unsigned int resctrl_rmid_realloc_threshold; extern unsigned int resctrl_rmid_realloc_limit;
-int __init resctrl_init(void); -void __exit resctrl_exit(void); +int resctrl_init(void); +void resctrl_exit(void);
#endif /* _RESCTRL_H */
From: James Morse james.morse@arm.com
Add Makefile and Kconfig for fs/resctrl. Add ARCH_HAS_CPU_RESCTRL for the common parts of the resctrl interface and make X86_CPU_RESCTRL depend on this.
Signed-off-by: James Morse james.morse@arm.com --- MAINTAINERS | 1 + arch/Kconfig | 8 ++++++++ arch/x86/Kconfig | 10 +++------- fs/Kconfig | 1 + fs/Makefile | 1 + fs/resctrl/Kconfig | 23 +++++++++++++++++++++++ fs/resctrl/Makefile | 1 + fs/resctrl/ctrlmondata.c | 0 fs/resctrl/internal.h | 0 fs/resctrl/monitor.c | 0 fs/resctrl/psuedo_lock.c | 0 fs/resctrl/rdtgroup.c | 0 include/linux/resctrl.h | 4 ++++ 13 files changed, 42 insertions(+), 7 deletions(-) create mode 100644 fs/resctrl/Kconfig create mode 100644 fs/resctrl/Makefile create mode 100644 fs/resctrl/ctrlmondata.c create mode 100644 fs/resctrl/internal.h create mode 100644 fs/resctrl/monitor.c create mode 100644 fs/resctrl/psuedo_lock.c create mode 100644 fs/resctrl/rdtgroup.c
diff --git a/MAINTAINERS b/MAINTAINERS index a707271ab5f0..58207397888a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -18062,6 +18062,7 @@ S: Supported F: Documentation/arch/x86/resctrl* F: arch/x86/include/asm/resctrl.h F: arch/x86/kernel/cpu/resctrl/ +F: fs/resctrl/ F: include/linux/resctrl*.h F: tools/testing/selftests/resctrl/
diff --git a/arch/Kconfig b/arch/Kconfig index 12d51495caec..1ee89d0ada03 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1324,6 +1324,14 @@ config STRICT_MODULE_RWX config ARCH_HAS_PHYS_TO_DMA bool
+config ARCH_HAS_CPU_RESCTRL + bool + help + The 'resctrl' filesystem allows cpu controls of shared resources + such as caches and memory bandwidth to be configured. An architecture + selects this if it provides the arch-specific hooks for the filesystem + and needs the per-task closid/rmid properties. + config HAVE_ARCH_COMPILER_H bool help diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index d9434d0d99b4..d8322ee6e746 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -478,8 +478,10 @@ config GOLDFISH config X86_CPU_RESCTRL bool "x86 CPU resource control support" depends on X86 && (CPU_SUP_INTEL || CPU_SUP_AMD) + depends on MISC_FILESYSTEMS select KERNFS - select PROC_CPU_RESCTRL if PROC_FS + select ARCH_HAS_CPU_RESCTRL + select RESCTRL_FS select RESCTRL_FS_PSEUDO_LOCK help Enable x86 CPU resource control support. @@ -497,12 +499,6 @@ config X86_CPU_RESCTRL
Say N if unsure.
-config RESCTRL_FS_PSEUDO_LOCK - bool - help - Software mechanism to try and pin data in a cache portion using - micro-architecture tricks. - if X86_32 config X86_BIGSMP bool "Support for big SMP systems with more than 8 CPUs" diff --git a/fs/Kconfig b/fs/Kconfig index aa7e03cc1941..1e3ed753b9fe 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -325,6 +325,7 @@ source "fs/omfs/Kconfig" source "fs/hpfs/Kconfig" source "fs/qnx4/Kconfig" source "fs/qnx6/Kconfig" +source "fs/resctrl/Kconfig" source "fs/romfs/Kconfig" source "fs/pstore/Kconfig" source "fs/sysv/Kconfig" diff --git a/fs/Makefile b/fs/Makefile index bb6164200a39..df06712e1204 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -130,3 +130,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/ obj-$(CONFIG_EROFS_FS) += erofs/ obj-$(CONFIG_VBOXSF_FS) += vboxsf/ obj-$(CONFIG_ZONEFS_FS) += zonefs/ +obj-$(CONFIG_RESCTRL_FS) += resctrl/ diff --git a/fs/resctrl/Kconfig b/fs/resctrl/Kconfig new file mode 100644 index 000000000000..5bbf5a1798cd --- /dev/null +++ b/fs/resctrl/Kconfig @@ -0,0 +1,23 @@ +config RESCTRL_FS + bool "CPU Resource Control Filesystem (resctrl)" + depends on ARCH_HAS_CPU_RESCTRL + select KERNFS + select PROC_CPU_RESCTRL if PROC_FS + help + Resctrl is a filesystem interface + to control allocation and + monitoring of system resources + used by the CPUs. + +config RESCTRL_FS_PSEUDO_LOCK + bool + help + Software mechanism to try and pin data in a cache portion using + micro-architecture tricks. + +config RESCTRL_RMID_DEPENDS_ON_CLOSID + bool + help + Enable by the architecture when the RMID values depend on the CLOSID. + This causes the closid allocator to search for CLOSID with clean + RMID. diff --git a/fs/resctrl/Makefile b/fs/resctrl/Makefile new file mode 100644 index 000000000000..4c32e5e914b6 --- /dev/null +++ b/fs/resctrl/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_RESCTRL_FS) += rdtgroup.o ctrlmondata.o monitor.o psuedo_lock.o diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/fs/resctrl/psuedo_lock.c b/fs/resctrl/psuedo_lock.c new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index f9de90bc1f55..9066017d4f33 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -8,6 +8,10 @@ #include <linux/pid.h> #include <linux/resctrl_types.h>
+#ifdef CONFIG_ARCH_HAS_CPU_RESCTRL +#include <asm/resctrl.h> +#endif + /* CLOSID, RMID value used by the default control group */ #define RESCTRL_RESERVED_CLOSID 0 #define RESCTRL_RESERVED_RMID 0
From: James Morse james.morse@arm.com
Once the filsystem parts of resctrl move to fs/resctrl, it cannot rely on definitions in x86's internal.h.
Move definitions in internal.h that need to be shared between the filesystem and architecture code to header files that fs/resctrl can include.
Doing this separately means the filesystem code only moves between files of the same name, instead of having these changes mixed in too.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/include/asm/resctrl.h | 3 +++ arch/x86/kernel/cpu/resctrl/core.c | 5 ++++ arch/x86/kernel/cpu/resctrl/internal.h | 36 -------------------------- include/linux/resctrl.h | 3 +++ include/linux/resctrl_types.h | 30 +++++++++++++++++++++ 5 files changed, 41 insertions(+), 36 deletions(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 9940398e367e..37b8d7dfcfed 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -218,6 +218,9 @@ int resctrl_arch_measure_l2_residency(void *_plr); int resctrl_arch_measure_l3_residency(void *_plr); void resctrl_cpu_detect(struct cpuinfo_x86 *c);
+bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l); +int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); + #else
static inline void resctrl_sched_in(struct task_struct *tsk) {} diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 50809621f27a..9ca3a39de912 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -308,6 +308,11 @@ static void rdt_get_cdp_l2_config(void) rdt_get_cdp_config(RDT_RESOURCE_L2); }
+bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l) +{ + return rdt_resources_all[l].cdp_enabled; +} + static void mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r) { diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 95e597a445b3..8803640d017f 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -15,12 +15,6 @@
#define L2_QOS_CDP_ENABLE 0x01ULL
-#define CQM_LIMBOCHECK_INTERVAL 1000 - -#define MBM_CNTR_WIDTH_BASE 24 -#define MBM_OVERFLOW_INTERVAL 1000 -#define MAX_MBA_BW 100u -#define MBA_IS_LINEAR 0x4 #define MAX_MBA_BW_AMD 0x800 #define MBM_CNTR_WIDTH_OFFSET_AMD 20
@@ -211,29 +205,6 @@ struct rdtgroup { struct pseudo_lock_region *plr; };
-/* rdtgroup.flags */ -#define RDT_DELETED 1 - -/* rftype.flags */ -#define RFTYPE_FLAGS_CPUS_LIST 1 - -/* - * Define the file type flags for base and info directories. - */ -#define RFTYPE_INFO BIT(0) -#define RFTYPE_BASE BIT(1) -#define RFTYPE_CTRL BIT(4) -#define RFTYPE_MON BIT(5) -#define RFTYPE_TOP BIT(6) -#define RFTYPE_RES_CACHE BIT(8) -#define RFTYPE_RES_MB BIT(9) -#define RFTYPE_DEBUG BIT(10) -#define RFTYPE_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL) -#define RFTYPE_MON_INFO (RFTYPE_INFO | RFTYPE_MON) -#define RFTYPE_TOP_INFO (RFTYPE_INFO | RFTYPE_TOP) -#define RFTYPE_CTRL_BASE (RFTYPE_BASE | RFTYPE_CTRL) -#define RFTYPE_MON_BASE (RFTYPE_BASE | RFTYPE_MON) - /* List of all resource groups */ extern struct list_head rdt_all_groups;
@@ -386,13 +357,6 @@ static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res) return &hw_res->r_resctrl; }
-static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l) -{ - return rdt_resources_all[l].cdp_enabled; -} - -int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); - /* * To return the common struct rdt_resource, which is contained in struct * rdt_hw_resource, walk the resctrl member of struct rdt_hw_resource. diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 9066017d4f33..9b7dc3c8d3d2 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -41,6 +41,9 @@ int proc_resctrl_show(struct seq_file *m, */ #define RESCTRL_MAX_CBM 32
+extern unsigned int resctrl_rmid_realloc_limit; +extern unsigned int resctrl_rmid_realloc_threshold; + /** * struct pseudo_lock_region - pseudo-lock region information * @s: Resctrl schema for the resource to which this diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h index 06e0f5567558..44f7a47f986b 100644 --- a/include/linux/resctrl_types.h +++ b/include/linux/resctrl_types.h @@ -7,6 +7,36 @@ #ifndef __LINUX_RESCTRL_TYPES_H #define __LINUX_RESCTRL_TYPES_H
+#define CQM_LIMBOCHECK_INTERVAL 1000 + +#define MBM_CNTR_WIDTH_BASE 24 +#define MBM_OVERFLOW_INTERVAL 1000 +#define MAX_MBA_BW 100u +#define MBA_IS_LINEAR 0x4 + +/* rdtgroup.flags */ +#define RDT_DELETED 1 + +/* rftype.flags */ +#define RFTYPE_FLAGS_CPUS_LIST 1 + +/* + * Define the file type flags for base and info directories. + */ +#define RFTYPE_INFO BIT(0) +#define RFTYPE_BASE BIT(1) +#define RFTYPE_CTRL BIT(4) +#define RFTYPE_MON BIT(5) +#define RFTYPE_TOP BIT(6) +#define RFTYPE_RES_CACHE BIT(8) +#define RFTYPE_RES_MB BIT(9) +#define RFTYPE_DEBUG BIT(10) +#define RFTYPE_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL) +#define RFTYPE_MON_INFO (RFTYPE_INFO | RFTYPE_MON) +#define RFTYPE_TOP_INFO (RFTYPE_INFO | RFTYPE_TOP) +#define RFTYPE_CTRL_BASE (RFTYPE_BASE | RFTYPE_CTRL) +#define RFTYPE_MON_BASE (RFTYPE_BASE | RFTYPE_MON) + /* Reads to Local DRAM Memory */ #define READS_TO_LOCAL_MEM BIT(0)
From: James Morse james.morse@arm.com
Move the parts of resctrl that aren't archictecture specific to live in /fs/.
This lets other architectures implement resctrl too.
Signed-off-by: James Morse james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/core.c | 15 - arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 506 --- arch/x86/kernel/cpu/resctrl/internal.h | 282 -- arch/x86/kernel/cpu/resctrl/monitor.c | 837 ---- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 1105 ------ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4289 +-------------------- fs/resctrl/ctrlmondata.c | 528 +++ fs/resctrl/internal.h | 313 ++ fs/resctrl/monitor.c | 859 +++++ fs/resctrl/psuedo_lock.c | 1134 ++++++ fs/resctrl/rdtgroup.c | 4014 +++++++++++++++++++ 11 files changed, 6995 insertions(+), 6887 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 9ca3a39de912..46411c9afd5c 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -164,21 +164,6 @@ static inline void cache_alloc_hsw_probe(void) rdt_alloc_capable = true; }
-bool is_mba_sc(struct rdt_resource *r) -{ - if (!r) - r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); - - /* - * The software controller support is only applicable to MBA resource. - * Make sure to check for resource type. - */ - if (r->rid != RDT_RESOURCE_MBA) - return false; - - return r->membw.mba_sc; -} - /* * rdt_get_mb_table() - get a mapping of bandwidth(b/w) percentage values * exposed to user interface and the h/w understandable delay values. diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index 6b31e6c97c20..c5c3eaea27b6 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -23,261 +23,6 @@
#include "internal.h"
-struct rdt_parse_data { - struct rdtgroup *rdtgrp; - char *buf; -}; - -typedef int (ctrlval_parser_t)(struct rdt_parse_data *data, - struct resctrl_schema *s, - struct rdt_domain *d); - -/* - * Check whether MBA bandwidth percentage value is correct. The value is - * checked against the minimum and max bandwidth values specified by the - * hardware. The allocated bandwidth percentage is rounded to the next - * control step available on the hardware. - */ -static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r) -{ - unsigned long bw; - int ret; - - /* - * Only linear delay values is supported for current Intel SKUs. - */ - if (!r->membw.delay_linear && r->membw.arch_needs_linear) { - rdt_last_cmd_puts("No support for non-linear MB domains\n"); - return false; - } - - ret = kstrtoul(buf, 10, &bw); - if (ret) { - rdt_last_cmd_printf("Non-decimal digit in MB value %s\n", buf); - return false; - } - - if ((bw < r->membw.min_bw || bw > r->default_ctrl) && - !is_mba_sc(r)) { - rdt_last_cmd_printf("MB value %ld out of range [%d,%d]\n", bw, - r->membw.min_bw, r->default_ctrl); - return false; - } - - *data = roundup(bw, (unsigned long)r->membw.bw_gran); - return true; -} - -static int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_domain *d) -{ - struct resctrl_staged_config *cfg; - u32 closid = data->rdtgrp->closid; - struct rdt_resource *r = s->res; - unsigned long bw_val; - - cfg = &d->staged_config[s->conf_type]; - if (cfg->have_new_ctrl) { - rdt_last_cmd_printf("Duplicate domain %d\n", d->id); - return -EINVAL; - } - - if (!bw_validate(data->buf, &bw_val, r)) - return -EINVAL; - - if (is_mba_sc(r)) { - d->mbps_val[closid] = bw_val; - return 0; - } - - cfg->new_ctrl = bw_val; - cfg->have_new_ctrl = true; - - return 0; -} - -/* - * Check whether a cache bit mask is valid. - * On Intel CPUs, non-contiguous 1s value support is indicated by CPUID: - * - CPUID.0x10.1:ECX[3]: L3 non-contiguous 1s value supported if 1 - * - CPUID.0x10.2:ECX[3]: L2 non-contiguous 1s value supported if 1 - * - * Haswell does not support a non-contiguous 1s value and additionally - * requires at least two bits set. - * AMD allows non-contiguous bitmasks. - */ -static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r) -{ - unsigned long first_bit, zero_bit, val; - unsigned int cbm_len = r->cache.cbm_len; - int ret; - - ret = kstrtoul(buf, 16, &val); - if (ret) { - rdt_last_cmd_printf("Non-hex character in the mask %s\n", buf); - return false; - } - - if ((r->cache.min_cbm_bits > 0 && val == 0) || val > r->default_ctrl) { - rdt_last_cmd_puts("Mask out of range\n"); - return false; - } - - first_bit = find_first_bit(&val, cbm_len); - zero_bit = find_next_zero_bit(&val, cbm_len, first_bit); - - /* Are non-contiguous bitmasks allowed? */ - if (!r->cache.arch_has_sparse_bitmasks && - (find_next_bit(&val, cbm_len, zero_bit) < cbm_len)) { - rdt_last_cmd_printf("The mask %lx has non-consecutive 1-bits\n", val); - return false; - } - - if ((zero_bit - first_bit) < r->cache.min_cbm_bits) { - rdt_last_cmd_printf("Need at least %d bits in the mask\n", - r->cache.min_cbm_bits); - return false; - } - - *data = val; - return true; -} - -/* - * Read one cache bit mask (hex). Check that it is valid for the current - * resource type. - */ -static int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_domain *d) -{ - struct rdtgroup *rdtgrp = data->rdtgrp; - struct resctrl_staged_config *cfg; - struct rdt_resource *r = s->res; - u32 cbm_val; - - cfg = &d->staged_config[s->conf_type]; - if (cfg->have_new_ctrl) { - rdt_last_cmd_printf("Duplicate domain %d\n", d->id); - return -EINVAL; - } - - /* - * Cannot set up more than one pseudo-locked region in a cache - * hierarchy. - */ - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP && - rdtgroup_pseudo_locked_in_hierarchy(d)) { - rdt_last_cmd_puts("Pseudo-locked region in hierarchy\n"); - return -EINVAL; - } - - if (!cbm_validate(data->buf, &cbm_val, r)) - return -EINVAL; - - if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK) && - (rdtgrp->mode == RDT_MODE_EXCLUSIVE || - rdtgrp->mode == RDT_MODE_SHAREABLE) && - rdtgroup_cbm_overlaps_pseudo_locked(d, cbm_val)) { - rdt_last_cmd_puts("CBM overlaps with pseudo-locked region\n"); - return -EINVAL; - } - - /* - * The CBM may not overlap with the CBM of another closid if - * either is exclusive. - */ - if (rdtgroup_cbm_overlaps(s, d, cbm_val, rdtgrp->closid, true)) { - rdt_last_cmd_puts("Overlaps with exclusive group\n"); - return -EINVAL; - } - - if (rdtgroup_cbm_overlaps(s, d, cbm_val, rdtgrp->closid, false)) { - if (rdtgrp->mode == RDT_MODE_EXCLUSIVE || - rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - rdt_last_cmd_puts("Overlaps with other group\n"); - return -EINVAL; - } - } - - cfg->new_ctrl = cbm_val; - cfg->have_new_ctrl = true; - - return 0; -} - -static ctrlval_parser_t *get_parser(struct rdt_resource *res) -{ - if (res->fflags & RFTYPE_RES_CACHE) - return &parse_cbm; - else - return &parse_bw; -} - -/* - * For each domain in this resource we expect to find a series of: - * id=mask - * separated by ";". The "id" is in decimal, and must match one of - * the "id"s for this resource. - */ -static int parse_line(char *line, struct resctrl_schema *s, - struct rdtgroup *rdtgrp) -{ - ctrlval_parser_t *parse_ctrlval = get_parser(s->res); - enum resctrl_conf_type t = s->conf_type; - struct resctrl_staged_config *cfg; - struct rdt_resource *r = s->res; - struct rdt_parse_data data; - char *dom = NULL, *id; - struct rdt_domain *d; - unsigned long dom_id; - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP && - (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)) { - rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n"); - return -EINVAL; - } - -next: - if (!line || line[0] == '\0') - return 0; - dom = strsep(&line, ";"); - id = strsep(&dom, "="); - if (!dom || kstrtoul(id, 10, &dom_id)) { - rdt_last_cmd_puts("Missing '=' or non-numeric domain\n"); - return -EINVAL; - } - dom = strim(dom); - list_for_each_entry(d, &r->domains, list) { - if (d->id == dom_id) { - data.buf = dom; - data.rdtgrp = rdtgrp; - if (parse_ctrlval(&data, s, d)) - return -EINVAL; - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - cfg = &d->staged_config[t]; - /* - * In pseudo-locking setup mode and just - * parsed a valid CBM that should be - * pseudo-locked. Only one locked region per - * resource group and domain so just do - * the required initialization for single - * region and return. - */ - rdtgrp->plr->s = s; - rdtgrp->plr->d = d; - rdtgrp->plr->cbm = cfg->new_ctrl; - d->plr = rdtgrp->plr; - return 0; - } - goto next; - } - } - return -EINVAL; -} - static bool apply_config(struct rdt_hw_domain *hw_dom, struct resctrl_staged_config *cfg, u32 idx, cpumask_var_t cpu_mask) @@ -366,100 +111,6 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) return 0; }
-static int rdtgroup_parse_resource(char *resname, char *tok, - struct rdtgroup *rdtgrp) -{ - struct resctrl_schema *s; - - list_for_each_entry(s, &resctrl_schema_all, list) { - if (!strcmp(resname, s->name) && rdtgrp->closid < s->num_closid) - return parse_line(tok, s, rdtgrp); - } - rdt_last_cmd_printf("Unknown or unsupported resource name '%s'\n", resname); - return -EINVAL; -} - -ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off) -{ - struct resctrl_schema *s; - struct rdtgroup *rdtgrp; - struct rdt_resource *r; - char *tok, *resname; - int ret = 0; - - /* Valid input requires a trailing newline */ - if (nbytes == 0 || buf[nbytes - 1] != '\n') - return -EINVAL; - buf[nbytes - 1] = '\0'; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (!rdtgrp) { - rdtgroup_kn_unlock(of->kn); - return -ENOENT; - } - rdt_last_cmd_clear(); - - /* - * No changes to pseudo-locked region allowed. It has to be removed - * and re-created instead. - */ - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { - ret = -EINVAL; - rdt_last_cmd_puts("Resource group is pseudo-locked\n"); - goto out; - } - - rdt_staged_configs_clear(); - - while ((tok = strsep(&buf, "\n")) != NULL) { - resname = strim(strsep(&tok, ":")); - if (!tok) { - rdt_last_cmd_puts("Missing ':'\n"); - ret = -EINVAL; - goto out; - } - if (tok[0] == '\0') { - rdt_last_cmd_printf("Missing '%s' value\n", resname); - ret = -EINVAL; - goto out; - } - ret = rdtgroup_parse_resource(resname, tok, rdtgrp); - if (ret) - goto out; - } - - list_for_each_entry(s, &resctrl_schema_all, list) { - r = s->res; - - /* - * Writes to mba_sc resources update the software controller, - * not the control MSR. - */ - if (is_mba_sc(r)) - continue; - - ret = resctrl_arch_update_domains(r, rdtgrp->closid); - if (ret) - goto out; - } - - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - /* - * If pseudo-locking fails we keep the resource group in - * mode RDT_MODE_PSEUDO_LOCKSETUP with its class of service - * active and updated for just the domain the pseudo-locked - * region was requested for. - */ - ret = rdtgroup_pseudo_lock_create(rdtgrp); - } - -out: - rdt_staged_configs_clear(); - rdtgroup_kn_unlock(of->kn); - return ret ?: nbytes; -} - u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, u32 closid, enum resctrl_conf_type type) { @@ -468,160 +119,3 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d,
return hw_dom->ctrl_val[idx]; } - -static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid) -{ - struct rdt_resource *r = schema->res; - struct rdt_domain *dom; - bool sep = false; - u32 ctrl_val; - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - seq_printf(s, "%*s:", max_name_width, schema->name); - list_for_each_entry(dom, &r->domains, list) { - if (sep) - seq_puts(s, ";"); - - if (is_mba_sc(r)) - ctrl_val = dom->mbps_val[closid]; - else - ctrl_val = resctrl_arch_get_config(r, dom, closid, - schema->conf_type); - - seq_printf(s, r->format_str, dom->id, max_data_width, - ctrl_val); - sep = true; - } - seq_puts(s, "\n"); -} - -int rdtgroup_schemata_show(struct kernfs_open_file *of, - struct seq_file *s, void *v) -{ - struct resctrl_schema *schema; - struct rdtgroup *rdtgrp; - int ret = 0; - u32 closid; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (rdtgrp) { - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - list_for_each_entry(schema, &resctrl_schema_all, list) { - seq_printf(s, "%s:uninitialized\n", schema->name); - } - } else if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { - if (!rdtgrp->plr->d) { - rdt_last_cmd_clear(); - rdt_last_cmd_puts("Cache domain offline\n"); - ret = -ENODEV; - } else { - seq_printf(s, "%s:%d=%x\n", - rdtgrp->plr->s->res->name, - rdtgrp->plr->d->id, - rdtgrp->plr->cbm); - } - } else { - closid = rdtgrp->closid; - list_for_each_entry(schema, &resctrl_schema_all, list) { - if (closid < schema->num_closid) - show_doms(s, schema, closid); - } - } - } else { - ret = -ENOENT; - } - rdtgroup_kn_unlock(of->kn); - return ret; -} - -static int smp_mon_event_count(void *arg) -{ - mon_event_count(arg); - - return 0; -} - -void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, - struct rdt_domain *d, struct rdtgroup *rdtgrp, - int evtid, int first) -{ - int cpu; - - /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - /* - * Setup the parameters to pass to mon_event_count() to read the data. - */ - rr->rgrp = rdtgrp; - rr->evtid = evtid; - rr->r = r; - rr->d = d; - rr->val = 0; - rr->first = first; - rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evtid); - if (IS_ERR(rr->arch_mon_ctx)) { - rr->err = -EINVAL; - return; - } - - cpu = cpumask_any_housekeeping(&d->cpu_mask, RESCTRL_PICK_ANY_CPU); - - /* - * cpumask_any_housekeeping() prefers housekeeping CPUs, but - * are all the CPUs nohz_full? If yes, pick a CPU to IPI. - * MPAM's resctrl_arch_rmid_read() is unable to read the - * counters on some platforms if its called in irq context. - */ - if (tick_nohz_full_cpu(cpu)) - smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1); - else - smp_call_on_cpu(cpu, smp_mon_event_count, rr, false); - - resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx); -} - -int rdtgroup_mondata_show(struct seq_file *m, void *arg) -{ - struct kernfs_open_file *of = m->private; - u32 resid, evtid, domid; - struct rdtgroup *rdtgrp; - struct rdt_resource *r; - union mon_data_bits md; - struct rdt_domain *d; - struct rmid_read rr; - int ret = 0; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (!rdtgrp) { - ret = -ENOENT; - goto out; - } - - md.priv = of->kn->priv; - resid = md.u.rid; - domid = md.u.domid; - evtid = md.u.evtid; - - r = resctrl_arch_get_resource(resid); - d = resctrl_arch_find_domain(r, domid); - if (IS_ERR_OR_NULL(d)) { - ret = -ENOENT; - goto out; - } - - mon_event_read(&rr, r, d, rdtgrp, evtid, false); - - if (rr.err == -EIO) - seq_puts(m, "Error\n"); - else if (rr.err == -EINVAL) - seq_puts(m, "Unavailable\n"); - else - seq_printf(m, "%llu\n", rr.val); - -out: - rdtgroup_kn_unlock(of->kn); - return ret; -} diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 8803640d017f..d7b8762716f2 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -27,231 +27,6 @@ */ #define MBM_CNTR_WIDTH_OFFSET_MAX (62 - MBM_CNTR_WIDTH_BASE)
-/** - * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that - * aren't marked nohz_full - * @mask: The mask to pick a CPU from. - * @exclude_cpu:The CPU to avoid picking. - * - * Returns a CPU from @mask, but not @exclude_cpu. If there are housekeeping - * CPUs that don't use nohz_full, these are preferred. Pass - * RESCTRL_PICK_ANY_CPU to avoid excluding any CPUs. - * - * When a CPU is excluded, returns >= nr_cpu_ids if no CPUs are available. - */ -static inline unsigned int -cpumask_any_housekeeping(const struct cpumask *mask, int exclude_cpu) -{ - unsigned int cpu, hk_cpu; - - if (exclude_cpu == RESCTRL_PICK_ANY_CPU) - cpu = cpumask_any(mask); - else - cpu = cpumask_any_but(mask, exclude_cpu); - - if (!IS_ENABLED(CONFIG_NO_HZ_FULL)) - return cpu; - - /* If the CPU picked isn't marked nohz_full nothing more needs doing. */ - if (cpu < nr_cpu_ids && !tick_nohz_full_cpu(cpu)) - return cpu; - - /* Try to find a CPU that isn't nohz_full to use in preference */ - hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask); - if (hk_cpu == exclude_cpu) - hk_cpu = cpumask_nth_andnot(1, mask, tick_nohz_full_mask); - - if (hk_cpu < nr_cpu_ids) - cpu = hk_cpu; - - return cpu; -} - -struct rdt_fs_context { - struct kernfs_fs_context kfc; - bool enable_cdpl2; - bool enable_cdpl3; - bool enable_mba_mbps; - bool enable_debug; -}; - -static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc) -{ - struct kernfs_fs_context *kfc = fc->fs_private; - - return container_of(kfc, struct rdt_fs_context, kfc); -} - -/** - * struct mon_evt - Entry in the event list of a resource - * @evtid: event id - * @name: name of the event - * @configurable: true if the event is configurable - * @list: entry in &rdt_resource->evt_list - */ -struct mon_evt { - enum resctrl_event_id evtid; - char *name; - bool configurable; - struct list_head list; -}; - -/** - * union mon_data_bits - Monitoring details for each event file - * @priv: Used to store monitoring event data in @u - * as kernfs private data - * @rid: Resource id associated with the event file - * @evtid: Event id associated with the event file - * @domid: The domain to which the event file belongs - * @u: Name of the bit fields struct - */ -union mon_data_bits { - void *priv; - struct { - unsigned int rid : 10; - enum resctrl_event_id evtid : 8; - unsigned int domid : 14; - } u; -}; - -struct rmid_read { - struct rdtgroup *rgrp; - struct rdt_resource *r; - struct rdt_domain *d; - enum resctrl_event_id evtid; - bool first; - int err; - u64 val; - void *arch_mon_ctx; -}; - -extern struct list_head resctrl_schema_all; -extern bool resctrl_mounted; - -enum rdt_group_type { - RDTCTRL_GROUP = 0, - RDTMON_GROUP, - RDT_NUM_GROUP, -}; - -/** - * enum rdtgrp_mode - Mode of a RDT resource group - * @RDT_MODE_SHAREABLE: This resource group allows sharing of its allocations - * @RDT_MODE_EXCLUSIVE: No sharing of this resource group's allocations allowed - * @RDT_MODE_PSEUDO_LOCKSETUP: Resource group will be used for Pseudo-Locking - * @RDT_MODE_PSEUDO_LOCKED: No sharing of this resource group's allocations - * allowed AND the allocations are Cache Pseudo-Locked - * @RDT_NUM_MODES: Total number of modes - * - * The mode of a resource group enables control over the allowed overlap - * between allocations associated with different resource groups (classes - * of service). User is able to modify the mode of a resource group by - * writing to the "mode" resctrl file associated with the resource group. - * - * The "shareable", "exclusive", and "pseudo-locksetup" modes are set by - * writing the appropriate text to the "mode" file. A resource group enters - * "pseudo-locked" mode after the schemata is written while the resource - * group is in "pseudo-locksetup" mode. - */ -enum rdtgrp_mode { - RDT_MODE_SHAREABLE = 0, - RDT_MODE_EXCLUSIVE, - RDT_MODE_PSEUDO_LOCKSETUP, - RDT_MODE_PSEUDO_LOCKED, - - /* Must be last */ - RDT_NUM_MODES, -}; - -/** - * struct mongroup - store mon group's data in resctrl fs. - * @mon_data_kn: kernfs node for the mon_data directory - * @parent: parent rdtgrp - * @crdtgrp_list: child rdtgroup node list - * @rmid: rmid for this rdtgroup - */ -struct mongroup { - struct kernfs_node *mon_data_kn; - struct rdtgroup *parent; - struct list_head crdtgrp_list; - u32 rmid; -}; - -/** - * struct rdtgroup - store rdtgroup's data in resctrl file system. - * @kn: kernfs node - * @rdtgroup_list: linked list for all rdtgroups - * @closid: closid for this rdtgroup - * @cpu_mask: CPUs assigned to this rdtgroup - * @flags: status bits - * @waitcount: how many cpus expect to find this - * group when they acquire rdtgroup_mutex - * @type: indicates type of this rdtgroup - either - * monitor only or ctrl_mon group - * @mon: mongroup related data - * @mode: mode of resource group - * @plr: pseudo-locked region - */ -struct rdtgroup { - struct kernfs_node *kn; - struct list_head rdtgroup_list; - u32 closid; - struct cpumask cpu_mask; - int flags; - atomic_t waitcount; - enum rdt_group_type type; - struct mongroup mon; - enum rdtgrp_mode mode; - struct pseudo_lock_region *plr; -}; - -/* List of all resource groups */ -extern struct list_head rdt_all_groups; - -extern int max_name_width, max_data_width; - -/** - * struct rftype - describe each file in the resctrl file system - * @name: File name - * @mode: Access mode - * @kf_ops: File operations - * @flags: File specific RFTYPE_FLAGS_* flags - * @fflags: File specific RFTYPE_* flags - * @seq_show: Show content of the file - * @write: Write to the file - */ -struct rftype { - char *name; - umode_t mode; - const struct kernfs_ops *kf_ops; - unsigned long flags; - unsigned long fflags; - - int (*seq_show)(struct kernfs_open_file *of, - struct seq_file *sf, void *v); - /* - * write() is the generic write callback which maps directly to - * kernfs write operation and overrides all other operations. - * Maximum write size is determined by ->max_write_len. - */ - ssize_t (*write)(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off); -}; - -/** - * struct mbm_state - status for each MBM counter in each domain - * @prev_bw_bytes: Previous bytes value read for bandwidth calculation - * @prev_bw: The most recent bandwidth in MBps - * @delta_bw: Difference between the current and previous bandwidth - * @delta_comp: Indicates whether to compute the delta_bw - */ -struct mbm_state { - u64 prev_bw_bytes; - u32 prev_bw; - u32 delta_bw; - bool delta_comp; -}; - /** * struct arch_mbm_state - values used to compute resctrl_arch_rmid_read()s * return value. @@ -343,11 +118,7 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r return container_of(r, struct rdt_hw_resource, r_resctrl); }
-extern struct mutex rdtgroup_mutex; - extern struct rdt_hw_resource rdt_resources_all[]; -extern struct rdtgroup rdtgroup_default; -extern struct dentry *debugfs_resctrl;
static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res) { @@ -411,62 +182,9 @@ union cpuid_0x10_x_edx { unsigned int full; };
-void rdt_last_cmd_clear(void); -void rdt_last_cmd_puts(const char *s); -__printf(1, 2) -void rdt_last_cmd_printf(const char *fmt, ...); - void rdt_ctrl_update(void *arg); -struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn); -void rdtgroup_kn_unlock(struct kernfs_node *kn); -int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name); -int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name, - umode_t mask); -ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off); -int rdtgroup_schemata_show(struct kernfs_open_file *of, - struct seq_file *s, void *v); -bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d, - unsigned long cbm, int closid, bool exclusive); -unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d, - unsigned long cbm); -enum rdtgrp_mode rdtgroup_mode_by_closid(int closid); -int rdtgroup_tasks_assigned(struct rdtgroup *r); -int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp); -int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp); -bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm); -bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d); -int rdt_pseudo_lock_init(void); -void rdt_pseudo_lock_release(void); -int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp); -void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp); -int closids_supported(void); -void closid_free(int closid); -int alloc_rmid(u32 closid); -void free_rmid(u32 closid, u32 rmid); int rdt_get_mon_l3_config(struct rdt_resource *r); -void resctrl_mon_resource_exit(void); -void mon_event_count(void *info); -int rdtgroup_mondata_show(struct seq_file *m, void *arg); -void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, - struct rdt_domain *d, struct rdtgroup *rdtgrp, - int evtid, int first); -int resctrl_mon_resource_init(void); -void mbm_setup_overflow_handler(struct rdt_domain *dom, - unsigned long delay_ms, - int exclude_cpu); -void mbm_handle_overflow(struct work_struct *work); void __init intel_rdt_mbm_apply_quirk(void); -bool is_mba_sc(struct rdt_resource *r); -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms, - int exclude_cpu); -void cqm_handle_limbo(struct work_struct *work); -bool has_busy_rmid(struct rdt_domain *d); -void __check_limbo(struct rdt_domain *d, bool force_free); void rdt_domain_reconfigure_cdp(struct rdt_resource *r); -void mbm_config_rftype_init(const char *config); -void rdt_staged_configs_clear(void); -bool closid_allocated(unsigned int closid); -int resctrl_find_cleanest_closid(void);
#endif /* _ASM_X86_RESCTRL_INTERNAL_H */ diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 507145fdd79a..54cf311b6de4 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -25,53 +25,6 @@
#include "internal.h"
-/** - * struct rmid_entry - dirty tracking for all RMID. - * @closid: The CLOSID for this entry. - * @rmid: The RMID for this entry. - * @busy: The number of domains with cached data using this RMID. - * @list: Member of the rmid_free_lru list when busy == 0. - * - * Depending on the architecture the correct monitor is accessed using - * both @closid and @rmid, or @rmid only. - * - * Take the rdtgroup_mutex when accessing. - */ -struct rmid_entry { - u32 closid; - u32 rmid; - int busy; - struct list_head list; -}; - -/* - * @rmid_free_lru - A least recently used list of free RMIDs - * These RMIDs are guaranteed to have an occupancy less than the - * threshold occupancy - */ -static LIST_HEAD(rmid_free_lru); - -/* - * @closid_num_dirty_rmid The number of dirty RMID each CLOSID has. - * Only allocated when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is defined. - * Indexed by CLOSID. Protected by rdtgroup_mutex. - */ -static u32 *closid_num_dirty_rmid; - -/* - * @rmid_limbo_count - count of currently unused but (potentially) - * dirty RMIDs. - * This counts RMIDs that no one is currently using but that - * may have a occupancy value > resctrl_rmid_realloc_threshold. User can - * change the threshold occupancy value. - */ -static unsigned int rmid_limbo_count; - -/* - * @rmid_entry - The entry in the limbo and free lists. - */ -static struct rmid_entry *rmid_ptrs; - /* * Global boolean for rdt_monitor which is true if any * resource monitoring is enabled. @@ -83,17 +36,6 @@ bool rdt_mon_capable; */ unsigned int rdt_mon_features;
-/* - * This is the threshold cache occupancy in bytes at which we will consider an - * RMID available for re-allocation. - */ -unsigned int resctrl_rmid_realloc_threshold; - -/* - * This is the maximum value for the reallocation threshold, in bytes. - */ -unsigned int resctrl_rmid_realloc_limit; - #define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))
/* @@ -157,33 +99,6 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val) return val; }
-/* - * x86 and arm64 differ in their handling of monitoring. - * x86's RMID are independent numbers, there is only one source of traffic - * with an RMID value of '1'. - * arm64's PMG extends the PARTID/CLOSID space, there are multiple sources of - * traffic with a PMG value of '1', one for each CLOSID, meaning the RMID - * value is no longer unique. - * To account for this, resctrl uses an index. On x86 this is just the RMID, - * on arm64 it encodes the CLOSID and RMID. This gives a unique number. - * - * The domain's rmid_busy_llc and rmid_ptrs[] are sized by index. The arch code - * must accept an attempt to read every index. - */ -static inline struct rmid_entry *__rmid_entry(u32 idx) -{ - struct rmid_entry *entry; - u32 closid, rmid; - - entry = &rmid_ptrs[idx]; - resctrl_arch_rmid_idx_decode(idx, &closid, &rmid); - - WARN_ON_ONCE(entry->closid != closid); - WARN_ON_ONCE(entry->rmid != rmid); - - return entry; -} - static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) { u64 msr_val; @@ -302,751 +217,6 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, return 0; }
-static void limbo_release_entry(struct rmid_entry *entry) -{ - lockdep_assert_held(&rdtgroup_mutex); - - rmid_limbo_count--; - list_add_tail(&entry->list, &rmid_free_lru); - - if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) - closid_num_dirty_rmid[entry->closid]--; -} - -/* - * Check the RMIDs that are marked as busy for this domain. If the - * reported LLC occupancy is below the threshold clear the busy bit and - * decrement the count. If the busy count gets to zero on an RMID, we - * free the RMID - */ -void __check_limbo(struct rdt_domain *d, bool force_free) -{ - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); - u32 idx_limit = resctrl_arch_system_num_rmid_idx(); - struct rmid_entry *entry; - u32 idx, cur_idx = 1; - void *arch_mon_ctx; - bool rmid_dirty; - u64 val = 0; - - arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID); - if (IS_ERR(arch_mon_ctx)) { - pr_warn_ratelimited("Failed to allocate monitor context: %ld", - PTR_ERR(arch_mon_ctx)); - return; - } - - /* - * Skip RMID 0 and start from RMID 1 and check all the RMIDs that - * are marked as busy for occupancy < threshold. If the occupancy - * is less than the threshold decrement the busy counter of the - * RMID and move it to the free list when the counter reaches 0. - */ - for (;;) { - idx = find_next_bit(d->rmid_busy_llc, idx_limit, cur_idx); - if (idx >= idx_limit) - break; - - entry = __rmid_entry(idx); - if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid, - QOS_L3_OCCUP_EVENT_ID, &val, - arch_mon_ctx)) { - rmid_dirty = true; - } else { - rmid_dirty = (val >= resctrl_rmid_realloc_threshold); - } - - if (force_free || !rmid_dirty) { - clear_bit(idx, d->rmid_busy_llc); - if (!--entry->busy) - limbo_release_entry(entry); - } - cur_idx = idx + 1; - } - - resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx); -} - -bool has_busy_rmid(struct rdt_domain *d) -{ - u32 idx_limit = resctrl_arch_system_num_rmid_idx(); - - return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit; -} - -static struct rmid_entry *resctrl_find_free_rmid(u32 closid) -{ - struct rmid_entry *itr; - u32 itr_idx, cmp_idx; - - if (list_empty(&rmid_free_lru)) - return rmid_limbo_count ? ERR_PTR(-EBUSY) : ERR_PTR(-ENOSPC); - - list_for_each_entry(itr, &rmid_free_lru, list) { - /* - * Get the index of this free RMID, and the index it would need - * to be if it were used with this CLOSID. - * If the CLOSID is irrelevant on this architecture, the two - * index values are always the same on every entry and thus the - * very first entry will be returned. - */ - itr_idx = resctrl_arch_rmid_idx_encode(itr->closid, itr->rmid); - cmp_idx = resctrl_arch_rmid_idx_encode(closid, itr->rmid); - - if (itr_idx == cmp_idx) - return itr; - } - - return ERR_PTR(-ENOSPC); -} - -/** - * resctrl_find_cleanest_closid() - Find a CLOSID where all the associated - * RMID are clean, or the CLOSID that has - * the most clean RMID. - * - * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocated CLOSID - * may not be able to allocate clean RMID. To avoid this the allocator will - * choose the CLOSID with the most clean RMID. - * - * When the CLOSID and RMID are independent numbers, the first free CLOSID will - * be returned. - */ -int resctrl_find_cleanest_closid(void) -{ - u32 cleanest_closid = ~0; - int i = 0; - - lockdep_assert_held(&rdtgroup_mutex); - - if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) - return -EIO; - - for (i = 0; i < closids_supported(); i++) { - int num_dirty; - - if (closid_allocated(i)) - continue; - - num_dirty = closid_num_dirty_rmid[i]; - if (num_dirty == 0) - return i; - - if (cleanest_closid == ~0) - cleanest_closid = i; - - if (num_dirty < closid_num_dirty_rmid[cleanest_closid]) - cleanest_closid = i; - } - - if (cleanest_closid == ~0) - return -ENOSPC; - - return cleanest_closid; -} - -/* - * For MPAM the RMID value is not unique, and has to be considered with - * the CLOSID. The (CLOSID, RMID) pair is allocated on all domains, which - * allows all domains to be managed by a single free list. - * Each domain also has a rmid_busy_llc to reduce the work of the limbo handler. - */ -int alloc_rmid(u32 closid) -{ - struct rmid_entry *entry; - - lockdep_assert_held(&rdtgroup_mutex); - - entry = resctrl_find_free_rmid(closid); - if (IS_ERR(entry)) - return PTR_ERR(entry); - - list_del(&entry->list); - return entry->rmid; -} - -static void add_rmid_to_limbo(struct rmid_entry *entry) -{ - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); - struct rdt_domain *d; - u32 idx; - - lockdep_assert_held(&rdtgroup_mutex); - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid); - - entry->busy = 0; - list_for_each_entry(d, &r->domains, list) { - /* - * For the first limbo RMID in the domain, - * setup up the limbo worker. - */ - if (!has_busy_rmid(d)) - cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL, - RESCTRL_PICK_ANY_CPU); - set_bit(idx, d->rmid_busy_llc); - entry->busy++; - } - - rmid_limbo_count++; - if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) - closid_num_dirty_rmid[entry->closid]++; -} - -void free_rmid(u32 closid, u32 rmid) -{ - u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); - struct rmid_entry *entry; - - lockdep_assert_held(&rdtgroup_mutex); - - /* - * Do not allow the default rmid to be free'd. Comparing by index - * allows architectures that ignore the closid parameter to avoid an - * unnecessary check. - */ - if (idx == resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, - RESCTRL_RESERVED_RMID)) - return; - - entry = __rmid_entry(idx); - - if (resctrl_arch_is_llc_occupancy_enabled()) - add_rmid_to_limbo(entry); - else - list_add_tail(&entry->list, &rmid_free_lru); -} - -static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 closid, - u32 rmid, enum resctrl_event_id evtid) -{ - u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); - - switch (evtid) { - case QOS_L3_MBM_TOTAL_EVENT_ID: - return &d->mbm_total[idx]; - case QOS_L3_MBM_LOCAL_EVENT_ID: - return &d->mbm_local[idx]; - default: - return NULL; - } -} - -static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr) -{ - struct mbm_state *m; - u64 tval = 0; - - if (rr->first) { - resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid); - m = get_mbm_state(rr->d, closid, rmid, rr->evtid); - if (m) - memset(m, 0, sizeof(struct mbm_state)); - return 0; - } - - rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, rr->evtid, - &tval, rr->arch_mon_ctx); - if (rr->err) - return rr->err; - - rr->val += tval; - - return 0; -} - -/* - * mbm_bw_count() - Update bw count from values previously read by - * __mon_event_count(). - * @closid: The closid used to identify the cached mbm_state. - * @rmid: The rmid used to identify the cached mbm_state. - * @rr: The struct rmid_read populated by __mon_event_count(). - * - * Supporting function to calculate the memory bandwidth - * and delta bandwidth in MBps. The chunks value previously read by - * __mon_event_count() is compared with the chunks value from the previous - * invocation. This must be called once per second to maintain values in MBps. - */ -static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr) -{ - u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); - struct mbm_state *m = &rr->d->mbm_local[idx]; - u64 cur_bw, bytes, cur_bytes; - - cur_bytes = rr->val; - bytes = cur_bytes - m->prev_bw_bytes; - m->prev_bw_bytes = cur_bytes; - - cur_bw = bytes / SZ_1M; - - if (m->delta_comp) - m->delta_bw = abs(cur_bw - m->prev_bw); - m->delta_comp = false; - m->prev_bw = cur_bw; -} - -/* - * This is scheduled by mon_event_read() to read the CQM/MBM counters - * on a domain. - */ -void mon_event_count(void *info) -{ - struct rdtgroup *rdtgrp, *entry; - struct rmid_read *rr = info; - struct list_head *head; - int ret; - - rdtgrp = rr->rgrp; - - ret = __mon_event_count(rdtgrp->closid, rdtgrp->mon.rmid, rr); - - /* - * For Ctrl groups read data from child monitor groups and - * add them together. Count events which are read successfully. - * Discard the rmid_read's reporting errors. - */ - head = &rdtgrp->mon.crdtgrp_list; - - if (rdtgrp->type == RDTCTRL_GROUP) { - list_for_each_entry(entry, head, mon.crdtgrp_list) { - if (__mon_event_count(entry->closid, entry->mon.rmid, - rr) == 0) - ret = 0; - } - } - - /* - * __mon_event_count() calls for newly created monitor groups may - * report -EINVAL/Unavailable if the monitor hasn't seen any traffic. - * Discard error if any of the monitor event reads succeeded. - */ - if (ret == 0) - rr->err = 0; -} - -/* - * Feedback loop for MBA software controller (mba_sc) - * - * mba_sc is a feedback loop where we periodically read MBM counters and - * adjust the bandwidth percentage values via the IA32_MBA_THRTL_MSRs so - * that: - * - * current bandwidth(cur_bw) < user specified bandwidth(user_bw) - * - * This uses the MBM counters to measure the bandwidth and MBA throttle - * MSRs to control the bandwidth for a particular rdtgrp. It builds on the - * fact that resctrl rdtgroups have both monitoring and control. - * - * The frequency of the checks is 1s and we just tag along the MBM overflow - * timer. Having 1s interval makes the calculation of bandwidth simpler. - * - * Although MBA's goal is to restrict the bandwidth to a maximum, there may - * be a need to increase the bandwidth to avoid unnecessarily restricting - * the L2 <-> L3 traffic. - * - * Since MBA controls the L2 external bandwidth where as MBM measures the - * L3 external bandwidth the following sequence could lead to such a - * situation. - * - * Consider an rdtgroup which had high L3 <-> memory traffic in initial - * phases -> mba_sc kicks in and reduced bandwidth percentage values -> but - * after some time rdtgroup has mostly L2 <-> L3 traffic. - * - * In this case we may restrict the rdtgroup's L2 <-> L3 traffic as its - * throttle MSRs already have low percentage values. To avoid - * unnecessarily restricting such rdtgroups, we also increase the bandwidth. - */ -static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm) -{ - u32 closid, rmid, cur_msr_val, new_msr_val; - struct mbm_state *pmbm_data, *cmbm_data; - u32 cur_bw, delta_bw, user_bw, idx; - struct rdt_resource *r_mba; - struct rdt_domain *dom_mba; - struct list_head *head; - struct rdtgroup *entry; - - if (!resctrl_arch_is_mbm_local_enabled()) - return; - - r_mba = resctrl_arch_get_resource(RDT_RESOURCE_MBA); - - closid = rgrp->closid; - rmid = rgrp->mon.rmid; - idx = resctrl_arch_rmid_idx_encode(closid, rmid); - pmbm_data = &dom_mbm->mbm_local[idx]; - - dom_mba = resctrl_get_domain_from_cpu(smp_processor_id(), r_mba); - if (!dom_mba) { - pr_warn_once("Failure to get domain for MBA update\n"); - return; - } - - cur_bw = pmbm_data->prev_bw; - user_bw = dom_mba->mbps_val[closid]; - delta_bw = pmbm_data->delta_bw; - - /* MBA resource doesn't support CDP */ - cur_msr_val = resctrl_arch_get_config(r_mba, dom_mba, closid, CDP_NONE); - - /* - * For Ctrl groups read data from child monitor groups. - */ - head = &rgrp->mon.crdtgrp_list; - list_for_each_entry(entry, head, mon.crdtgrp_list) { - cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid]; - cur_bw += cmbm_data->prev_bw; - delta_bw += cmbm_data->delta_bw; - } - - /* - * Scale up/down the bandwidth linearly for the ctrl group. The - * bandwidth step is the bandwidth granularity specified by the - * hardware. - * - * The delta_bw is used when increasing the bandwidth so that we - * dont alternately increase and decrease the control values - * continuously. - * - * For ex: consider cur_bw = 90MBps, user_bw = 100MBps and if - * bandwidth step is 20MBps(> user_bw - cur_bw), we would keep - * switching between 90 and 110 continuously if we only check - * cur_bw < user_bw. - */ - if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) { - new_msr_val = cur_msr_val - r_mba->membw.bw_gran; - } else if (cur_msr_val < MAX_MBA_BW && - (user_bw > (cur_bw + delta_bw))) { - new_msr_val = cur_msr_val + r_mba->membw.bw_gran; - } else { - return; - } - - resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val); - - /* - * Delta values are updated dynamically package wise for each - * rdtgrp every time the throttle MSR changes value. - * - * This is because (1)the increase in bandwidth is not perfectly - * linear and only "approximately" linear even when the hardware - * says it is linear.(2)Also since MBA is a core specific - * mechanism, the delta values vary based on number of cores used - * by the rdtgrp. - */ - pmbm_data->delta_comp = true; - list_for_each_entry(entry, head, mon.crdtgrp_list) { - cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid]; - cmbm_data->delta_comp = true; - } -} - -static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, - u32 closid, u32 rmid) -{ - struct rmid_read rr; - - rr.first = false; - rr.r = r; - rr.d = d; - - /* - * This is protected from concurrent reads from user - * as both the user and we hold the global mutex. - */ - if (resctrl_arch_is_mbm_total_enabled()) { - rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID; - rr.val = 0; - rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); - if (IS_ERR(rr.arch_mon_ctx)) { - pr_warn_ratelimited("Failed to allocate monitor context: %ld", - PTR_ERR(rr.arch_mon_ctx)); - return; - } - - __mon_event_count(closid, rmid, &rr); - - resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); - } - if (resctrl_arch_is_mbm_local_enabled()) { - rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID; - rr.val = 0; - rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); - if (IS_ERR(rr.arch_mon_ctx)) { - pr_warn_ratelimited("Failed to allocate monitor context: %ld", - PTR_ERR(rr.arch_mon_ctx)); - return; - } - - __mon_event_count(closid, rmid, &rr); - - /* - * Call the MBA software controller only for the - * control groups and when user has enabled - * the software controller explicitly. - */ - if (is_mba_sc(NULL)) - mbm_bw_count(closid, rmid, &rr); - - resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); - } -} - -/* - * Handler to scan the limbo list and move the RMIDs - * to free list whose occupancy < threshold_occupancy. - */ -void cqm_handle_limbo(struct work_struct *work) -{ - unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL); - struct rdt_domain *d; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - d = container_of(work, struct rdt_domain, cqm_limbo.work); - - __check_limbo(d, false); - - if (has_busy_rmid(d)) { - d->cqm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask, - RESCTRL_PICK_ANY_CPU); - schedule_delayed_work_on(d->cqm_work_cpu, &d->cqm_limbo, - delay); - } - - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); -} - -/** - * cqm_setup_limbo_handler() - Schedule the limbo handler to run for this - * domain. - * @dom: The domain the limbo handler should run for. - * @delay_ms: How far in the future the handler should run. - * @exclude_cpu: Which CPU the handler should not run on, - * RESCTRL_PICK_ANY_CPU to pick any CPU. - */ -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms, - int exclude_cpu) -{ - unsigned long delay = msecs_to_jiffies(delay_ms); - int cpu; - - cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu); - dom->cqm_work_cpu = cpu; - - if (cpu < nr_cpu_ids) - schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay); -} - -void mbm_handle_overflow(struct work_struct *work) -{ - unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL); - struct rdtgroup *prgrp, *crgrp; - struct list_head *head; - struct rdt_resource *r; - struct rdt_domain *d; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - /* - * If the filesystem has been unmounted this work no longer needs to - * run. - */ - if (!resctrl_mounted || !resctrl_arch_mon_capable()) - goto out_unlock; - - r = resctrl_arch_get_resource(RDT_RESOURCE_L3); - d = container_of(work, struct rdt_domain, mbm_over.work); - - list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { - mbm_update(r, d, prgrp->closid, prgrp->mon.rmid); - - head = &prgrp->mon.crdtgrp_list; - list_for_each_entry(crgrp, head, mon.crdtgrp_list) - mbm_update(r, d, crgrp->closid, crgrp->mon.rmid); - - if (is_mba_sc(NULL)) - update_mba_bw(prgrp, d); - } - - /* - * Re-check for housekeeping CPUs. This allows the overflow handler to - * move off a nohz_full CPU quickly. - */ - d->mbm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask, - RESCTRL_PICK_ANY_CPU); - schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay); - -out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); -} - -/** - * mbm_setup_overflow_handler() - Schedule the overflow handler to run for this - * domain. - * @dom: The domain the overflow handler should run for. - * @delay_ms: How far in the future the handler should run. - * @exclude_cpu: Which CPU the handler should not run on, - * RESCTRL_PICK_ANY_CPU to pick any CPU. - */ -void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms, - int exclude_cpu) -{ - unsigned long delay = msecs_to_jiffies(delay_ms); - int cpu; - - /* - * When a domain comes online there is no guarantee the filesystem is - * mounted. If not, there is no need to catch counter overflow. - */ - if (!resctrl_mounted || !resctrl_arch_mon_capable()) - return; - cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu); - dom->mbm_work_cpu = cpu; - - if (cpu < nr_cpu_ids) - schedule_delayed_work_on(cpu, &dom->mbm_over, delay); -} - -static int dom_data_init(struct rdt_resource *r) -{ - u32 idx_limit = resctrl_arch_system_num_rmid_idx(); - u32 num_closid = resctrl_arch_get_num_closid(r); - struct rmid_entry *entry = NULL; - int err = 0, i; - u32 idx; - - mutex_lock(&rdtgroup_mutex); - if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { - u32 *tmp; - - tmp = kcalloc(num_closid, sizeof(*tmp), GFP_KERNEL); - if (!tmp) { - err = -ENOMEM; - goto out_unlock; - } - - closid_num_dirty_rmid = tmp; - } - - rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL); - if (!rmid_ptrs) { - if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { - kfree(closid_num_dirty_rmid); - closid_num_dirty_rmid = NULL; - } - err = -ENOMEM; - goto out_unlock; - } - - for (i = 0; i < idx_limit; i++) { - entry = &rmid_ptrs[i]; - INIT_LIST_HEAD(&entry->list); - - resctrl_arch_rmid_idx_decode(i, &entry->closid, &entry->rmid); - list_add_tail(&entry->list, &rmid_free_lru); - } - - /* - * RESCTRL_RESERVED_CLOSID and RESCTRL_RESERVED_RMID are special and - * are always allocated. These are used for the rdtgroup_default - * control group, which will be setup later in rdtgroup_init(). - */ - idx = resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, - RESCTRL_RESERVED_RMID); - entry = __rmid_entry(idx); - list_del(&entry->list); - -out_unlock: - mutex_unlock(&rdtgroup_mutex); - - return err; -} - -static void dom_data_exit(struct rdt_resource *r) -{ - if (!r->mon_capable) - return; - - mutex_lock(&rdtgroup_mutex); - if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { - kfree(closid_num_dirty_rmid); - closid_num_dirty_rmid = NULL; - } - - kfree(rmid_ptrs); - rmid_ptrs = NULL; - - mutex_unlock(&rdtgroup_mutex); -} - -static struct mon_evt llc_occupancy_event = { - .name = "llc_occupancy", - .evtid = QOS_L3_OCCUP_EVENT_ID, -}; - -static struct mon_evt mbm_total_event = { - .name = "mbm_total_bytes", - .evtid = QOS_L3_MBM_TOTAL_EVENT_ID, -}; - -static struct mon_evt mbm_local_event = { - .name = "mbm_local_bytes", - .evtid = QOS_L3_MBM_LOCAL_EVENT_ID, -}; - -/* - * Initialize the event list for the resource. - * - * Note that MBM events are also part of RDT_RESOURCE_L3 resource - * because as per the SDM the total and local memory bandwidth - * are enumerated as part of L3 monitoring. - */ -static void l3_mon_evt_init(struct rdt_resource *r) -{ - INIT_LIST_HEAD(&r->evt_list); - - if (resctrl_arch_is_llc_occupancy_enabled()) - list_add_tail(&llc_occupancy_event.list, &r->evt_list); - if (resctrl_arch_is_mbm_total_enabled()) - list_add_tail(&mbm_total_event.list, &r->evt_list); - if (resctrl_arch_is_mbm_local_enabled()) - list_add_tail(&mbm_local_event.list, &r->evt_list); -} - -int resctrl_mon_resource_init(void) -{ - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); - int ret; - - ret = dom_data_init(r); - if (ret) - return ret; - - if (!r->mon_capable) - return 0; - - l3_mon_evt_init(r); - - if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) { - mbm_total_event.configurable = true; - mbm_config_rftype_init("mbm_total_bytes_config"); - } - if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_LOCAL_EVENT_ID)) { - mbm_local_event.configurable = true; - mbm_config_rftype_init("mbm_local_bytes_config"); - } - - return 0; -} - int __init rdt_get_mon_l3_config(struct rdt_resource *r) { unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset; @@ -1084,13 +254,6 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) return 0; }
-void resctrl_mon_resource_exit(void) -{ - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); - - dom_data_exit(r); -} - void __init intel_rdt_mbm_apply_quirk(void) { int cf_index; diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 85536566d7f5..ba1596afee10 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -39,28 +39,6 @@ */ static u64 prefetch_disable_bits;
-/* - * Major number assigned to and shared by all devices exposing - * pseudo-locked regions. - */ -static unsigned int pseudo_lock_major; -static unsigned long pseudo_lock_minor_avail = GENMASK(MINORBITS, 0); - -static char *pseudo_lock_devnode(const struct device *dev, umode_t *mode) -{ - const struct rdtgroup *rdtgrp; - - rdtgrp = dev_get_drvdata(dev); - if (mode) - *mode = 0600; - return kasprintf(GFP_KERNEL, "pseudo_lock/%s", rdtgrp->kn->name); -} - -static const struct class pseudo_lock_class = { - .name = "pseudo_lock", - .devnode = pseudo_lock_devnode, -}; - /** * resctrl_arch_get_prefetch_disable_bits - prefetch disable bits of supported * platforms @@ -121,299 +99,6 @@ u64 resctrl_arch_get_prefetch_disable_bits(void) return prefetch_disable_bits; }
-/** - * pseudo_lock_minor_get - Obtain available minor number - * @minor: Pointer to where new minor number will be stored - * - * A bitmask is used to track available minor numbers. Here the next free - * minor number is marked as unavailable and returned. - * - * Return: 0 on success, <0 on failure. - */ -static int pseudo_lock_minor_get(unsigned int *minor) -{ - unsigned long first_bit; - - first_bit = find_first_bit(&pseudo_lock_minor_avail, MINORBITS); - - if (first_bit == MINORBITS) - return -ENOSPC; - - __clear_bit(first_bit, &pseudo_lock_minor_avail); - *minor = first_bit; - - return 0; -} - -/** - * pseudo_lock_minor_release - Return minor number to available - * @minor: The minor number made available - */ -static void pseudo_lock_minor_release(unsigned int minor) -{ - __set_bit(minor, &pseudo_lock_minor_avail); -} - -/** - * region_find_by_minor - Locate a pseudo-lock region by inode minor number - * @minor: The minor number of the device representing pseudo-locked region - * - * When the character device is accessed we need to determine which - * pseudo-locked region it belongs to. This is done by matching the minor - * number of the device to the pseudo-locked region it belongs. - * - * Minor numbers are assigned at the time a pseudo-locked region is associated - * with a cache instance. - * - * Return: On success return pointer to resource group owning the pseudo-locked - * region, NULL on failure. - */ -static struct rdtgroup *region_find_by_minor(unsigned int minor) -{ - struct rdtgroup *rdtgrp, *rdtgrp_match = NULL; - - list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { - if (rdtgrp->plr && rdtgrp->plr->minor == minor) { - rdtgrp_match = rdtgrp; - break; - } - } - return rdtgrp_match; -} - -/** - * struct pseudo_lock_pm_req - A power management QoS request list entry - * @list: Entry within the @pm_reqs list for a pseudo-locked region - * @req: PM QoS request - */ -struct pseudo_lock_pm_req { - struct list_head list; - struct dev_pm_qos_request req; -}; - -static void pseudo_lock_cstates_relax(struct pseudo_lock_region *plr) -{ - struct pseudo_lock_pm_req *pm_req, *next; - - list_for_each_entry_safe(pm_req, next, &plr->pm_reqs, list) { - dev_pm_qos_remove_request(&pm_req->req); - list_del(&pm_req->list); - kfree(pm_req); - } -} - -/** - * pseudo_lock_cstates_constrain - Restrict cores from entering C6 - * @plr: Pseudo-locked region - * - * To prevent the cache from being affected by power management entering - * C6 has to be avoided. This is accomplished by requesting a latency - * requirement lower than lowest C6 exit latency of all supported - * platforms as found in the cpuidle state tables in the intel_idle driver. - * At this time it is possible to do so with a single latency requirement - * for all supported platforms. - * - * Since Goldmont is supported, which is affected by X86_BUG_MONITOR, - * the ACPI latencies need to be considered while keeping in mind that C2 - * may be set to map to deeper sleep states. In this case the latency - * requirement needs to prevent entering C2 also. - * - * Return: 0 on success, <0 on failure - */ -static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr) -{ - struct pseudo_lock_pm_req *pm_req; - int cpu; - int ret; - - for_each_cpu(cpu, &plr->d->cpu_mask) { - pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL); - if (!pm_req) { - rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n"); - ret = -ENOMEM; - goto out_err; - } - ret = dev_pm_qos_add_request(get_cpu_device(cpu), - &pm_req->req, - DEV_PM_QOS_RESUME_LATENCY, - 30); - if (ret < 0) { - rdt_last_cmd_printf("Failed to add latency req CPU%d\n", - cpu); - kfree(pm_req); - ret = -1; - goto out_err; - } - list_add(&pm_req->list, &plr->pm_reqs); - } - - return 0; - -out_err: - pseudo_lock_cstates_relax(plr); - return ret; -} - -/** - * pseudo_lock_region_clear - Reset pseudo-lock region data - * @plr: pseudo-lock region - * - * All content of the pseudo-locked region is reset - any memory allocated - * freed. - * - * Return: void - */ -static void pseudo_lock_region_clear(struct pseudo_lock_region *plr) -{ - plr->size = 0; - plr->line_size = 0; - kfree(plr->kmem); - plr->kmem = NULL; - plr->s = NULL; - if (plr->d) - plr->d->plr = NULL; - plr->d = NULL; - plr->cbm = 0; - plr->debugfs_dir = NULL; -} - -/** - * pseudo_lock_region_init - Initialize pseudo-lock region information - * @plr: pseudo-lock region - * - * Called after user provided a schemata to be pseudo-locked. From the - * schemata the &struct pseudo_lock_region is on entry already initialized - * with the resource, domain, and capacity bitmask. Here the information - * required for pseudo-locking is deduced from this data and &struct - * pseudo_lock_region initialized further. This information includes: - * - size in bytes of the region to be pseudo-locked - * - cache line size to know the stride with which data needs to be accessed - * to be pseudo-locked - * - a cpu associated with the cache instance on which the pseudo-locking - * flow can be executed - * - * Return: 0 on success, <0 on failure. Descriptive error will be written - * to last_cmd_status buffer. - */ -static int pseudo_lock_region_init(struct pseudo_lock_region *plr) -{ - struct cpu_cacheinfo *ci; - int ret; - int i; - - /* Pick the first cpu we find that is associated with the cache. */ - plr->cpu = cpumask_first(&plr->d->cpu_mask); - - if (!cpu_online(plr->cpu)) { - rdt_last_cmd_printf("CPU %u associated with cache not online\n", - plr->cpu); - ret = -ENODEV; - goto out_region; - } - - ci = get_cpu_cacheinfo(plr->cpu); - - plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm); - - for (i = 0; i < ci->num_leaves; i++) { - if (ci->info_list[i].level == plr->s->res->cache_level) { - plr->line_size = ci->info_list[i].coherency_line_size; - return 0; - } - } - - ret = -1; - rdt_last_cmd_puts("Unable to determine cache line size\n"); -out_region: - pseudo_lock_region_clear(plr); - return ret; -} - -/** - * pseudo_lock_init - Initialize a pseudo-lock region - * @rdtgrp: resource group to which new pseudo-locked region will belong - * - * A pseudo-locked region is associated with a resource group. When this - * association is created the pseudo-locked region is initialized. The - * details of the pseudo-locked region are not known at this time so only - * allocation is done and association established. - * - * Return: 0 on success, <0 on failure - */ -static int pseudo_lock_init(struct rdtgroup *rdtgrp) -{ - struct pseudo_lock_region *plr; - - plr = kzalloc(sizeof(*plr), GFP_KERNEL); - if (!plr) - return -ENOMEM; - - init_waitqueue_head(&plr->lock_thread_wq); - INIT_LIST_HEAD(&plr->pm_reqs); - rdtgrp->plr = plr; - return 0; -} - -/** - * pseudo_lock_region_alloc - Allocate kernel memory that will be pseudo-locked - * @plr: pseudo-lock region - * - * Initialize the details required to set up the pseudo-locked region and - * allocate the contiguous memory that will be pseudo-locked to the cache. - * - * Return: 0 on success, <0 on failure. Descriptive error will be written - * to last_cmd_status buffer. - */ -static int pseudo_lock_region_alloc(struct pseudo_lock_region *plr) -{ - int ret; - - ret = pseudo_lock_region_init(plr); - if (ret < 0) - return ret; - - /* - * We do not yet support contiguous regions larger than - * KMALLOC_MAX_SIZE. - */ - if (plr->size > KMALLOC_MAX_SIZE) { - rdt_last_cmd_puts("Requested region exceeds maximum size\n"); - ret = -E2BIG; - goto out_region; - } - - plr->kmem = kzalloc(plr->size, GFP_KERNEL); - if (!plr->kmem) { - rdt_last_cmd_puts("Unable to allocate memory\n"); - ret = -ENOMEM; - goto out_region; - } - - ret = 0; - goto out; -out_region: - pseudo_lock_region_clear(plr); -out: - return ret; -} - -/** - * pseudo_lock_free - Free a pseudo-locked region - * @rdtgrp: resource group to which pseudo-locked region belonged - * - * The pseudo-locked region's resources have already been released, or not - * yet created at this point. Now it can be freed and disassociated from the - * resource group. - * - * Return: void - */ -static void pseudo_lock_free(struct rdtgroup *rdtgrp) -{ - pseudo_lock_region_clear(rdtgrp->plr); - kfree(rdtgrp->plr); - rdtgrp->plr = NULL; -} - /** * resctrl_arch_pseudo_lock_fn - Load kernel memory into cache * @_plr: the pseudo-lock region descriptor @@ -543,351 +228,6 @@ int resctrl_arch_pseudo_lock_fn(void *_plr) return 0; }
-/** - * rdtgroup_monitor_in_progress - Test if monitoring in progress - * @rdtgrp: resource group being queried - * - * Return: 1 if monitor groups have been created for this resource - * group, 0 otherwise. - */ -static int rdtgroup_monitor_in_progress(struct rdtgroup *rdtgrp) -{ - return !list_empty(&rdtgrp->mon.crdtgrp_list); -} - -/** - * rdtgroup_locksetup_user_restrict - Restrict user access to group - * @rdtgrp: resource group needing access restricted - * - * A resource group used for cache pseudo-locking cannot have cpus or tasks - * assigned to it. This is communicated to the user by restricting access - * to all the files that can be used to make such changes. - * - * Permissions restored with rdtgroup_locksetup_user_restore() - * - * Return: 0 on success, <0 on failure. If a failure occurs during the - * restriction of access an attempt will be made to restore permissions but - * the state of the mode of these files will be uncertain when a failure - * occurs. - */ -static int rdtgroup_locksetup_user_restrict(struct rdtgroup *rdtgrp) -{ - int ret; - - ret = rdtgroup_kn_mode_restrict(rdtgrp, "tasks"); - if (ret) - return ret; - - ret = rdtgroup_kn_mode_restrict(rdtgrp, "cpus"); - if (ret) - goto err_tasks; - - ret = rdtgroup_kn_mode_restrict(rdtgrp, "cpus_list"); - if (ret) - goto err_cpus; - - if (resctrl_arch_mon_capable()) { - ret = rdtgroup_kn_mode_restrict(rdtgrp, "mon_groups"); - if (ret) - goto err_cpus_list; - } - - ret = 0; - goto out; - -err_cpus_list: - rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0777); -err_cpus: - rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0777); -err_tasks: - rdtgroup_kn_mode_restore(rdtgrp, "tasks", 0777); -out: - return ret; -} - -/** - * rdtgroup_locksetup_user_restore - Restore user access to group - * @rdtgrp: resource group needing access restored - * - * Restore all file access previously removed using - * rdtgroup_locksetup_user_restrict() - * - * Return: 0 on success, <0 on failure. If a failure occurs during the - * restoration of access an attempt will be made to restrict permissions - * again but the state of the mode of these files will be uncertain when - * a failure occurs. - */ -static int rdtgroup_locksetup_user_restore(struct rdtgroup *rdtgrp) -{ - int ret; - - ret = rdtgroup_kn_mode_restore(rdtgrp, "tasks", 0777); - if (ret) - return ret; - - ret = rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0777); - if (ret) - goto err_tasks; - - ret = rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0777); - if (ret) - goto err_cpus; - - if (resctrl_arch_mon_capable()) { - ret = rdtgroup_kn_mode_restore(rdtgrp, "mon_groups", 0777); - if (ret) - goto err_cpus_list; - } - - ret = 0; - goto out; - -err_cpus_list: - rdtgroup_kn_mode_restrict(rdtgrp, "cpus_list"); -err_cpus: - rdtgroup_kn_mode_restrict(rdtgrp, "cpus"); -err_tasks: - rdtgroup_kn_mode_restrict(rdtgrp, "tasks"); -out: - return ret; -} - -/** - * rdtgroup_locksetup_enter - Resource group enters locksetup mode - * @rdtgrp: resource group requested to enter locksetup mode - * - * A resource group enters locksetup mode to reflect that it would be used - * to represent a pseudo-locked region and is in the process of being set - * up to do so. A resource group used for a pseudo-locked region would - * lose the closid associated with it so we cannot allow it to have any - * tasks or cpus assigned nor permit tasks or cpus to be assigned in the - * future. Monitoring of a pseudo-locked region is not allowed either. - * - * The above and more restrictions on a pseudo-locked region are checked - * for and enforced before the resource group enters the locksetup mode. - * - * Returns: 0 if the resource group successfully entered locksetup mode, <0 - * on failure. On failure the last_cmd_status buffer is updated with text to - * communicate details of failure to the user. - */ -int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp) -{ - int ret; - - /* - * The default resource group can neither be removed nor lose the - * default closid associated with it. - */ - if (rdtgrp == &rdtgroup_default) { - rdt_last_cmd_puts("Cannot pseudo-lock default group\n"); - return -EINVAL; - } - - /* - * Cache Pseudo-locking not supported when CDP is enabled. - * - * Some things to consider if you would like to enable this - * support (using L3 CDP as example): - * - When CDP is enabled two separate resources are exposed, - * L3DATA and L3CODE, but they are actually on the same cache. - * The implication for pseudo-locking is that if a - * pseudo-locked region is created on a domain of one - * resource (eg. L3CODE), then a pseudo-locked region cannot - * be created on that same domain of the other resource - * (eg. L3DATA). This is because the creation of a - * pseudo-locked region involves a call to wbinvd that will - * affect all cache allocations on particular domain. - * - Considering the previous, it may be possible to only - * expose one of the CDP resources to pseudo-locking and - * hide the other. For example, we could consider to only - * expose L3DATA and since the L3 cache is unified it is - * still possible to place instructions there are execute it. - * - If only one region is exposed to pseudo-locking we should - * still keep in mind that availability of a portion of cache - * for pseudo-locking should take into account both resources. - * Similarly, if a pseudo-locked region is created in one - * resource, the portion of cache used by it should be made - * unavailable to all future allocations from both resources. - */ - if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3) || - resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2)) { - rdt_last_cmd_puts("CDP enabled\n"); - return -EINVAL; - } - - /* - * Not knowing the bits to disable prefetching implies that this - * platform does not support Cache Pseudo-Locking. - */ - if (resctrl_arch_get_prefetch_disable_bits() == 0) { - rdt_last_cmd_puts("Pseudo-locking not supported\n"); - return -EINVAL; - } - - if (rdtgroup_monitor_in_progress(rdtgrp)) { - rdt_last_cmd_puts("Monitoring in progress\n"); - return -EINVAL; - } - - if (rdtgroup_tasks_assigned(rdtgrp)) { - rdt_last_cmd_puts("Tasks assigned to resource group\n"); - return -EINVAL; - } - - if (!cpumask_empty(&rdtgrp->cpu_mask)) { - rdt_last_cmd_puts("CPUs assigned to resource group\n"); - return -EINVAL; - } - - if (rdtgroup_locksetup_user_restrict(rdtgrp)) { - rdt_last_cmd_puts("Unable to modify resctrl permissions\n"); - return -EIO; - } - - ret = pseudo_lock_init(rdtgrp); - if (ret) { - rdt_last_cmd_puts("Unable to init pseudo-lock region\n"); - goto out_release; - } - - /* - * If this system is capable of monitoring a rmid would have been - * allocated when the control group was created. This is not needed - * anymore when this group would be used for pseudo-locking. This - * is safe to call on platforms not capable of monitoring. - */ - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); - - ret = 0; - goto out; - -out_release: - rdtgroup_locksetup_user_restore(rdtgrp); -out: - return ret; -} - -/** - * rdtgroup_locksetup_exit - resource group exist locksetup mode - * @rdtgrp: resource group - * - * When a resource group exits locksetup mode the earlier restrictions are - * lifted. - * - * Return: 0 on success, <0 on failure - */ -int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp) -{ - int ret; - - if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) - return -EOPNOTSUPP; - - if (resctrl_arch_mon_capable()) { - ret = alloc_rmid(rdtgrp->closid); - if (ret < 0) { - rdt_last_cmd_puts("Out of RMIDs\n"); - return ret; - } - rdtgrp->mon.rmid = ret; - } - - ret = rdtgroup_locksetup_user_restore(rdtgrp); - if (ret) { - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); - return ret; - } - - pseudo_lock_free(rdtgrp); - return 0; -} - -/** - * rdtgroup_cbm_overlaps_pseudo_locked - Test if CBM or portion is pseudo-locked - * @d: RDT domain - * @cbm: CBM to test - * - * @d represents a cache instance and @cbm a capacity bitmask that is - * considered for it. Determine if @cbm overlaps with any existing - * pseudo-locked region on @d. - * - * @cbm is unsigned long, even if only 32 bits are used, to make the - * bitmap functions work correctly. - * - * Return: true if @cbm overlaps with pseudo-locked region on @d, false - * otherwise. - */ -bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm) -{ - unsigned int cbm_len; - unsigned long cbm_b; - - if (d->plr) { - cbm_len = d->plr->s->res->cache.cbm_len; - cbm_b = d->plr->cbm; - if (bitmap_intersects(&cbm, &cbm_b, cbm_len)) - return true; - } - return false; -} - -/** - * rdtgroup_pseudo_locked_in_hierarchy - Pseudo-locked region in cache hierarchy - * @d: RDT domain under test - * - * The setup of a pseudo-locked region affects all cache instances within - * the hierarchy of the region. It is thus essential to know if any - * pseudo-locked regions exist within a cache hierarchy to prevent any - * attempts to create new pseudo-locked regions in the same hierarchy. - * - * Return: true if a pseudo-locked region exists in the hierarchy of @d or - * if it is not possible to test due to memory allocation issue, - * false otherwise. - */ -bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) -{ - cpumask_var_t cpu_with_psl; - enum resctrl_res_level i; - struct rdt_resource *r; - struct rdt_domain *d_i; - bool ret = false; - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) - return -EOPNOTSUPP; - - if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL)) - return true; - - /* - * First determine which cpus have pseudo-locked regions - * associated with them. - */ - for (i = 0; i < RDT_NUM_RESOURCES; i++) { - r = resctrl_arch_get_resource(i); - if (!r->alloc_capable) - continue; - - list_for_each_entry(d_i, &r->domains, list) { - if (d_i->plr) - cpumask_or(cpu_with_psl, cpu_with_psl, - &d_i->cpu_mask); - } - } - - /* - * Next test if new pseudo-locked region would intersect with - * existing region. - */ - if (cpumask_intersects(&d->cpu_mask, cpu_with_psl)) - ret = true; - - free_cpumask_var(cpu_with_psl); - return ret; -} - /** * resctrl_arch_measure_cycles_lat_fn - Measure cycle latency to read * pseudo-locked memory @@ -1180,448 +520,3 @@ int resctrl_arch_measure_l3_residency(void *_plr) wake_up_interruptible(&plr->lock_thread_wq); return 0; } - -/** - * pseudo_lock_measure_cycles - Trigger latency measure to pseudo-locked region - * @rdtgrp: Resource group to which the pseudo-locked region belongs. - * @sel: Selector of which measurement to perform on a pseudo-locked region. - * - * The measurement of latency to access a pseudo-locked region should be - * done from a cpu that is associated with that pseudo-locked region. - * Determine which cpu is associated with this region and start a thread on - * that cpu to perform the measurement, wait for that thread to complete. - * - * Return: 0 on success, <0 on failure - */ -static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel) -{ - struct pseudo_lock_region *plr = rdtgrp->plr; - struct task_struct *thread; - unsigned int cpu; - int ret = -1; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - if (rdtgrp->flags & RDT_DELETED) { - ret = -ENODEV; - goto out; - } - - if (!plr->d) { - ret = -ENODEV; - goto out; - } - - plr->thread_done = 0; - cpu = cpumask_first(&plr->d->cpu_mask); - if (!cpu_online(cpu)) { - ret = -ENODEV; - goto out; - } - - plr->cpu = cpu; - - if (sel == 1) - thread = kthread_create_on_node(resctrl_arch_measure_cycles_lat_fn, - plr, cpu_to_node(cpu), - "pseudo_lock_measure/%u", - cpu); - else if (sel == 2) - thread = kthread_create_on_node(resctrl_arch_measure_l2_residency, - plr, cpu_to_node(cpu), - "pseudo_lock_measure/%u", - cpu); - else if (sel == 3) - thread = kthread_create_on_node(resctrl_arch_measure_l3_residency, - plr, cpu_to_node(cpu), - "pseudo_lock_measure/%u", - cpu); - else - goto out; - - if (IS_ERR(thread)) { - ret = PTR_ERR(thread); - goto out; - } - kthread_bind(thread, cpu); - wake_up_process(thread); - - ret = wait_event_interruptible(plr->lock_thread_wq, - plr->thread_done == 1); - if (ret < 0) - goto out; - - ret = 0; - -out: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); - return ret; -} - -static ssize_t pseudo_lock_measure_trigger(struct file *file, - const char __user *user_buf, - size_t count, loff_t *ppos) -{ - struct rdtgroup *rdtgrp = file->private_data; - size_t buf_size; - char buf[32]; - int ret; - int sel; - - buf_size = min(count, (sizeof(buf) - 1)); - if (copy_from_user(buf, user_buf, buf_size)) - return -EFAULT; - - buf[buf_size] = '\0'; - ret = kstrtoint(buf, 10, &sel); - if (ret == 0) { - if (sel != 1 && sel != 2 && sel != 3) - return -EINVAL; - ret = debugfs_file_get(file->f_path.dentry); - if (ret) - return ret; - ret = pseudo_lock_measure_cycles(rdtgrp, sel); - if (ret == 0) - ret = count; - debugfs_file_put(file->f_path.dentry); - } - - return ret; -} - -static const struct file_operations pseudo_measure_fops = { - .write = pseudo_lock_measure_trigger, - .open = simple_open, - .llseek = default_llseek, -}; - -/** - * rdtgroup_pseudo_lock_create - Create a pseudo-locked region - * @rdtgrp: resource group to which pseudo-lock region belongs - * - * Called when a resource group in the pseudo-locksetup mode receives a - * valid schemata that should be pseudo-locked. Since the resource group is - * in pseudo-locksetup mode the &struct pseudo_lock_region has already been - * allocated and initialized with the essential information. If a failure - * occurs the resource group remains in the pseudo-locksetup mode with the - * &struct pseudo_lock_region associated with it, but cleared from all - * information and ready for the user to re-attempt pseudo-locking by - * writing the schemata again. - * - * Return: 0 if the pseudo-locked region was successfully pseudo-locked, <0 - * on failure. Descriptive error will be written to last_cmd_status buffer. - */ -int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp) -{ - struct pseudo_lock_region *plr = rdtgrp->plr; - struct task_struct *thread; - unsigned int new_minor; - struct device *dev; - int ret; - - if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) - return -EOPNOTSUPP; - - ret = pseudo_lock_region_alloc(plr); - if (ret < 0) - return ret; - - ret = pseudo_lock_cstates_constrain(plr); - if (ret < 0) { - ret = -EINVAL; - goto out_region; - } - - plr->thread_done = 0; - - plr->closid = rdtgrp->closid; - thread = kthread_create_on_node(resctrl_arch_pseudo_lock_fn, plr, - cpu_to_node(plr->cpu), - "pseudo_lock/%u", plr->cpu); - if (IS_ERR(thread)) { - ret = PTR_ERR(thread); - rdt_last_cmd_printf("Locking thread returned error %d\n", ret); - goto out_cstates; - } - - kthread_bind(thread, plr->cpu); - wake_up_process(thread); - - ret = wait_event_interruptible(plr->lock_thread_wq, - plr->thread_done == 1); - if (ret < 0) { - /* - * If the thread does not get on the CPU for whatever - * reason and the process which sets up the region is - * interrupted then this will leave the thread in runnable - * state and once it gets on the CPU it will dereference - * the cleared, but not freed, plr struct resulting in an - * empty pseudo-locking loop. - */ - rdt_last_cmd_puts("Locking thread interrupted\n"); - goto out_cstates; - } - - ret = pseudo_lock_minor_get(&new_minor); - if (ret < 0) { - rdt_last_cmd_puts("Unable to obtain a new minor number\n"); - goto out_cstates; - } - - /* - * Unlock access but do not release the reference. The - * pseudo-locked region will still be here on return. - * - * The mutex has to be released temporarily to avoid a potential - * deadlock with the mm->mmap_lock which is obtained in the - * device_create() and debugfs_create_dir() callpath below as well as - * before the mmap() callback is called. - */ - mutex_unlock(&rdtgroup_mutex); - - if (!IS_ERR_OR_NULL(debugfs_resctrl)) { - plr->debugfs_dir = debugfs_create_dir(rdtgrp->kn->name, - debugfs_resctrl); - if (!IS_ERR_OR_NULL(plr->debugfs_dir)) - debugfs_create_file("pseudo_lock_measure", 0200, - plr->debugfs_dir, rdtgrp, - &pseudo_measure_fops); - } - - dev = device_create(&pseudo_lock_class, NULL, - MKDEV(pseudo_lock_major, new_minor), - rdtgrp, "%s", rdtgrp->kn->name); - - mutex_lock(&rdtgroup_mutex); - - if (IS_ERR(dev)) { - ret = PTR_ERR(dev); - rdt_last_cmd_printf("Failed to create character device: %d\n", - ret); - goto out_debugfs; - } - - /* We released the mutex - check if group was removed while we did so */ - if (rdtgrp->flags & RDT_DELETED) { - ret = -ENODEV; - goto out_device; - } - - plr->minor = new_minor; - - rdtgrp->mode = RDT_MODE_PSEUDO_LOCKED; - closid_free(rdtgrp->closid); - rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0444); - rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0444); - - ret = 0; - goto out; - -out_device: - device_destroy(&pseudo_lock_class, MKDEV(pseudo_lock_major, new_minor)); -out_debugfs: - debugfs_remove_recursive(plr->debugfs_dir); - pseudo_lock_minor_release(new_minor); -out_cstates: - pseudo_lock_cstates_relax(plr); -out_region: - pseudo_lock_region_clear(plr); -out: - return ret; -} - -/** - * rdtgroup_pseudo_lock_remove - Remove a pseudo-locked region - * @rdtgrp: resource group to which the pseudo-locked region belongs - * - * The removal of a pseudo-locked region can be initiated when the resource - * group is removed from user space via a "rmdir" from userspace or the - * unmount of the resctrl filesystem. On removal the resource group does - * not go back to pseudo-locksetup mode before it is removed, instead it is - * removed directly. There is thus asymmetry with the creation where the - * &struct pseudo_lock_region is removed here while it was not created in - * rdtgroup_pseudo_lock_create(). - * - * Return: void - */ -void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp) -{ - struct pseudo_lock_region *plr = rdtgrp->plr; - - if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) - return; - - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - /* - * Default group cannot be a pseudo-locked region so we can - * free closid here. - */ - closid_free(rdtgrp->closid); - goto free; - } - - pseudo_lock_cstates_relax(plr); - debugfs_remove_recursive(rdtgrp->plr->debugfs_dir); - device_destroy(&pseudo_lock_class, MKDEV(pseudo_lock_major, plr->minor)); - pseudo_lock_minor_release(plr->minor); - -free: - pseudo_lock_free(rdtgrp); -} - -static int pseudo_lock_dev_open(struct inode *inode, struct file *filp) -{ - struct rdtgroup *rdtgrp; - - mutex_lock(&rdtgroup_mutex); - - rdtgrp = region_find_by_minor(iminor(inode)); - if (!rdtgrp) { - mutex_unlock(&rdtgroup_mutex); - return -ENODEV; - } - - filp->private_data = rdtgrp; - atomic_inc(&rdtgrp->waitcount); - /* Perform a non-seekable open - llseek is not supported */ - filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE); - - mutex_unlock(&rdtgroup_mutex); - - return 0; -} - -static int pseudo_lock_dev_release(struct inode *inode, struct file *filp) -{ - struct rdtgroup *rdtgrp; - - mutex_lock(&rdtgroup_mutex); - rdtgrp = filp->private_data; - WARN_ON(!rdtgrp); - if (!rdtgrp) { - mutex_unlock(&rdtgroup_mutex); - return -ENODEV; - } - filp->private_data = NULL; - atomic_dec(&rdtgrp->waitcount); - mutex_unlock(&rdtgroup_mutex); - return 0; -} - -static int pseudo_lock_dev_mremap(struct vm_area_struct *area) -{ - /* Not supported */ - return -EINVAL; -} - -static const struct vm_operations_struct pseudo_mmap_ops = { - .mremap = pseudo_lock_dev_mremap, -}; - -static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma) -{ - unsigned long vsize = vma->vm_end - vma->vm_start; - unsigned long off = vma->vm_pgoff << PAGE_SHIFT; - struct pseudo_lock_region *plr; - struct rdtgroup *rdtgrp; - unsigned long physical; - unsigned long psize; - - mutex_lock(&rdtgroup_mutex); - - rdtgrp = filp->private_data; - WARN_ON(!rdtgrp); - if (!rdtgrp) { - mutex_unlock(&rdtgroup_mutex); - return -ENODEV; - } - - plr = rdtgrp->plr; - - if (!plr->d) { - mutex_unlock(&rdtgroup_mutex); - return -ENODEV; - } - - /* - * Task is required to run with affinity to the cpus associated - * with the pseudo-locked region. If this is not the case the task - * may be scheduled elsewhere and invalidate entries in the - * pseudo-locked region. - */ - if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) { - mutex_unlock(&rdtgroup_mutex); - return -EINVAL; - } - - physical = __pa(plr->kmem) >> PAGE_SHIFT; - psize = plr->size - off; - - if (off > plr->size) { - mutex_unlock(&rdtgroup_mutex); - return -ENOSPC; - } - - /* - * Ensure changes are carried directly to the memory being mapped, - * do not allow copy-on-write mapping. - */ - if (!(vma->vm_flags & VM_SHARED)) { - mutex_unlock(&rdtgroup_mutex); - return -EINVAL; - } - - if (vsize > psize) { - mutex_unlock(&rdtgroup_mutex); - return -ENOSPC; - } - - memset(plr->kmem + off, 0, vsize); - - if (remap_pfn_range(vma, vma->vm_start, physical + vma->vm_pgoff, - vsize, vma->vm_page_prot)) { - mutex_unlock(&rdtgroup_mutex); - return -EAGAIN; - } - vma->vm_ops = &pseudo_mmap_ops; - mutex_unlock(&rdtgroup_mutex); - return 0; -} - -static const struct file_operations pseudo_lock_dev_fops = { - .owner = THIS_MODULE, - .llseek = no_llseek, - .read = NULL, - .write = NULL, - .open = pseudo_lock_dev_open, - .release = pseudo_lock_dev_release, - .mmap = pseudo_lock_dev_mmap, -}; - -int rdt_pseudo_lock_init(void) -{ - int ret; - - ret = register_chrdev(0, "pseudo_lock", &pseudo_lock_dev_fops); - if (ret < 0) - return ret; - - pseudo_lock_major = ret; - - ret = class_register(&pseudo_lock_class); - if (ret) { - unregister_chrdev(pseudo_lock_major, "pseudo_lock"); - return ret; - } - - return 0; -} - -void rdt_pseudo_lock_release(void) -{ - class_unregister(&pseudo_lock_class); - unregister_chrdev(pseudo_lock_major, "pseudo_lock"); - pseudo_lock_major = 0; -} diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 253f5fbd8e1c..10bd76e1d0d5 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -12,22 +12,8 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-#include <linux/cacheinfo.h> #include <linux/cpu.h> -#include <linux/debugfs.h> -#include <linux/fs.h> -#include <linux/fs_parser.h> -#include <linux/sysfs.h> -#include <linux/kernfs.h> -#include <linux/seq_buf.h> -#include <linux/seq_file.h> -#include <linux/sched/signal.h> -#include <linux/sched/task.h> #include <linux/slab.h> -#include <linux/task_work.h> -#include <linux/user_namespace.h> - -#include <uapi/linux/magic.h>
#include <asm/resctrl.h> #include "internal.h" @@ -36,4220 +22,239 @@ DEFINE_STATIC_KEY_FALSE(rdt_enable_key); DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key); DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
-/* Mutex to protect rdtgroup access. */ -DEFINE_MUTEX(rdtgroup_mutex); - -static struct kernfs_root *rdt_root; -struct rdtgroup rdtgroup_default; -LIST_HEAD(rdt_all_groups); - -/* list of entries for the schemata file */ -LIST_HEAD(resctrl_schema_all); - -/* The filesystem can only be mounted once. */ -bool resctrl_mounted; - -/* Kernel fs node for "info" directory under root */ -static struct kernfs_node *kn_info; - -/* Kernel fs node for "mon_groups" directory under root */ -static struct kernfs_node *kn_mongrp; - -/* Kernel fs node for "mon_data" directory under root */ -static struct kernfs_node *kn_mondata; - /* - * Used to store the max resource name width and max resource data width - * to display the schemata in a tabular format + * This is safe against resctrl_sched_in() called from __switch_to() + * because __switch_to() is executed with interrupts disabled. A local call + * from update_closid_rmid() is protected against __switch_to() because + * preemption is disabled. */ -int max_name_width, max_data_width; - -static struct seq_buf last_cmd_status; -static char last_cmd_status_buf[512]; - -static int rdtgroup_setup_root(struct rdt_fs_context *ctx); -static void rdtgroup_destroy_root(void); - -struct dentry *debugfs_resctrl; - -static bool resctrl_debug; - -void rdt_last_cmd_clear(void) -{ - lockdep_assert_held(&rdtgroup_mutex); - seq_buf_clear(&last_cmd_status); -} - -void rdt_last_cmd_puts(const char *s) -{ - lockdep_assert_held(&rdtgroup_mutex); - seq_buf_puts(&last_cmd_status, s); -} - -void rdt_last_cmd_printf(const char *fmt, ...) -{ - va_list ap; - - va_start(ap, fmt); - lockdep_assert_held(&rdtgroup_mutex); - seq_buf_vprintf(&last_cmd_status, fmt, ap); - va_end(ap); -} - -void rdt_staged_configs_clear(void) +void resctrl_arch_sync_cpu_defaults(void *info) { - struct rdt_resource *r; - struct rdt_domain *dom; - int i; - - lockdep_assert_held(&rdtgroup_mutex); - - for (i = 0; i < RDT_NUM_RESOURCES; i++) { - r = resctrl_arch_get_resource(i); - if (!r->alloc_capable) - continue; + struct resctrl_cpu_sync *r = info;
- list_for_each_entry(dom, &r->domains, list) - memset(dom->staged_config, 0, sizeof(dom->staged_config)); + if (r) { + this_cpu_write(pqr_state.default_closid, r->closid); + this_cpu_write(pqr_state.default_rmid, r->rmid); } -}
-static bool resctrl_is_mbm_enabled(void) -{ - return (resctrl_arch_is_mbm_total_enabled() || - resctrl_arch_is_mbm_local_enabled()); + /* + * We cannot unconditionally write the MSR because the current + * executing task might have its own closid selected. Just reuse + * the context switch code. + */ + resctrl_sched_in(current); }
-static bool resctrl_is_mbm_event(int e) -{ - return (e >= QOS_L3_MBM_TOTAL_EVENT_ID && - e <= QOS_L3_MBM_LOCAL_EVENT_ID); -} +#define INVALID_CONFIG_INDEX UINT_MAX
-/* - * Trivial allocator for CLOSIDs. Since h/w only supports a small number, - * we can keep a bitmap of free CLOSIDs in a single integer. +/** + * mon_event_config_index_get - get the hardware index for the + * configurable event + * @evtid: event id. * - * Using a global CLOSID across all resources has some advantages and - * some drawbacks: - * + We can simply set current's closid to assign a task to a resource - * group. - * + Context switch code can avoid extra memory references deciding which - * CLOSID to load into the PQR_ASSOC MSR - * - We give up some options in configuring resource groups across multi-socket - * systems. - * - Our choices on how to configure each resource become progressively more - * limited as the number of resources grows. + * Return: 0 for evtid == QOS_L3_MBM_TOTAL_EVENT_ID + * 1 for evtid == QOS_L3_MBM_LOCAL_EVENT_ID + * INVALID_CONFIG_INDEX for invalid evtid */ -static unsigned long closid_free_map; -static int closid_free_map_len; - -int closids_supported(void) -{ - return closid_free_map_len; -} - -static void closid_init(void) +static inline unsigned int mon_event_config_index_get(u32 evtid) { - struct resctrl_schema *s; - u32 rdt_min_closid = 32; - - /* Compute rdt_min_closid across all resources */ - list_for_each_entry(s, &resctrl_schema_all, list) - rdt_min_closid = min(rdt_min_closid, s->num_closid); - - closid_free_map = BIT_MASK(rdt_min_closid) - 1; - - /* RESCTRL_RESERVED_CLOSID is always reserved for the default group */ - __clear_bit(RESCTRL_RESERVED_CLOSID, &closid_free_map); - closid_free_map_len = rdt_min_closid; + switch (evtid) { + case QOS_L3_MBM_TOTAL_EVENT_ID: + return 0; + case QOS_L3_MBM_LOCAL_EVENT_ID: + return 1; + default: + /* Should never reach here */ + return INVALID_CONFIG_INDEX; + } }
-static int closid_alloc(void) +void resctrl_arch_mon_event_config_read(void *info) { - int cleanest_closid; - u32 closid; - - lockdep_assert_held(&rdtgroup_mutex); + struct resctrl_mon_config_info *mon_info = info; + unsigned int index; + u64 msrval;
- if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { - cleanest_closid = resctrl_find_cleanest_closid(); - if (cleanest_closid < 0) - return cleanest_closid; - closid = cleanest_closid; - } else { - closid = ffs(closid_free_map); - if (closid == 0) - return -ENOSPC; - closid--; + index = mon_event_config_index_get(mon_info->evtid); + if (index == INVALID_CONFIG_INDEX) { + pr_warn_once("Invalid event id %d\n", mon_info->evtid); + return; } - __clear_bit(closid, &closid_free_map); + rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval);
- return closid; + /* Report only the valid event configuration bits */ + mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS; }
-void closid_free(int closid) +void resctrl_arch_mon_event_config_write(void *info) { - lockdep_assert_held(&rdtgroup_mutex); - - __set_bit(closid, &closid_free_map); -} + struct resctrl_mon_config_info *mon_info = info; + unsigned int index;
-/** - * closid_allocated - test if provided closid is in use - * @closid: closid to be tested - * - * Return: true if @closid is currently associated with a resource group, - * false if @closid is free - */ -bool closid_allocated(unsigned int closid) -{ - lockdep_assert_held(&rdtgroup_mutex); + index = mon_event_config_index_get(mon_info->evtid); + if (index == INVALID_CONFIG_INDEX) { + pr_warn_once("Invalid event id %d\n", mon_info->evtid); + mon_info->err = -EINVAL; + return; + } + wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0);
- return !test_bit(closid, &closid_free_map); + mon_info->err = 0; }
-/** - * rdtgroup_mode_by_closid - Return mode of resource group with closid - * @closid: closid if the resource group - * - * Each resource group is associated with a @closid. Here the mode - * of a resource group can be queried by searching for it using its closid. - * - * Return: mode as &enum rdtgrp_mode of resource group with closid @closid - */ -enum rdtgrp_mode rdtgroup_mode_by_closid(int closid) +static void l3_qos_cfg_update(void *arg) { - struct rdtgroup *rdtgrp; - - list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { - if (rdtgrp->closid == closid) - return rdtgrp->mode; - } + bool *enable = arg;
- return RDT_NUM_MODES; + wrmsrl(MSR_IA32_L3_QOS_CFG, *enable ? L3_QOS_CDP_ENABLE : 0ULL); }
-static const char * const rdt_mode_str[] = { - [RDT_MODE_SHAREABLE] = "shareable", - [RDT_MODE_EXCLUSIVE] = "exclusive", - [RDT_MODE_PSEUDO_LOCKSETUP] = "pseudo-locksetup", - [RDT_MODE_PSEUDO_LOCKED] = "pseudo-locked", -}; - -/** - * rdtgroup_mode_str - Return the string representation of mode - * @mode: the resource group mode as &enum rdtgroup_mode - * - * Return: string representation of valid mode, "unknown" otherwise - */ -static const char *rdtgroup_mode_str(enum rdtgrp_mode mode) +static void l2_qos_cfg_update(void *arg) { - if (mode < RDT_MODE_SHAREABLE || mode >= RDT_NUM_MODES) - return "unknown"; + bool *enable = arg;
- return rdt_mode_str[mode]; + wrmsrl(MSR_IA32_L2_QOS_CFG, *enable ? L2_QOS_CDP_ENABLE : 0ULL); }
-/* set uid and gid of rdtgroup dirs and files to that of the creator */ -static int rdtgroup_kn_set_ugid(struct kernfs_node *kn) +static int set_cache_qos_cfg(int level, bool enable) { - struct iattr iattr = { .ia_valid = ATTR_UID | ATTR_GID, - .ia_uid = current_fsuid(), - .ia_gid = current_fsgid(), }; - - if (uid_eq(iattr.ia_uid, GLOBAL_ROOT_UID) && - gid_eq(iattr.ia_gid, GLOBAL_ROOT_GID)) - return 0; + void (*update)(void *arg); + struct rdt_resource *r_l; + cpumask_var_t cpu_mask; + struct rdt_domain *d; + int cpu;
- return kernfs_setattr(kn, &iattr); -} + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held();
-static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft) -{ - struct kernfs_node *kn; - int ret; + if (level == RDT_RESOURCE_L3) + update = l3_qos_cfg_update; + else if (level == RDT_RESOURCE_L2) + update = l2_qos_cfg_update; + else + return -EINVAL;
- kn = __kernfs_create_file(parent_kn, rft->name, rft->mode, - GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, - 0, rft->kf_ops, rft, NULL, NULL); - if (IS_ERR(kn)) - return PTR_ERR(kn); + if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) + return -ENOMEM;
- ret = rdtgroup_kn_set_ugid(kn); - if (ret) { - kernfs_remove(kn); - return ret; + r_l = &rdt_resources_all[level].r_resctrl; + list_for_each_entry(d, &r_l->domains, list) { + if (r_l->cache.arch_has_per_cpu_cfg) + /* Pick all the CPUs in the domain instance */ + for_each_cpu(cpu, &d->cpu_mask) + cpumask_set_cpu(cpu, cpu_mask); + else + /* Pick one CPU from each domain instance to update MSR */ + cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask); }
- return 0; -} + /* Update QOS_CFG MSR on all the CPUs in cpu_mask */ + on_each_cpu_mask(cpu_mask, update, &enable, 1);
-static int rdtgroup_seqfile_show(struct seq_file *m, void *arg) -{ - struct kernfs_open_file *of = m->private; - struct rftype *rft = of->kn->priv; + free_cpumask_var(cpu_mask);
- if (rft->seq_show) - return rft->seq_show(of, m, arg); return 0; }
-static ssize_t rdtgroup_file_write(struct kernfs_open_file *of, char *buf, - size_t nbytes, loff_t off) +/* Restore the qos cfg state when a domain comes online */ +void rdt_domain_reconfigure_cdp(struct rdt_resource *r) { - struct rftype *rft = of->kn->priv; - - if (rft->write) - return rft->write(of, buf, nbytes, off); - - return -EINVAL; -} - -static const struct kernfs_ops rdtgroup_kf_single_ops = { - .atomic_write_len = PAGE_SIZE, - .write = rdtgroup_file_write, - .seq_show = rdtgroup_seqfile_show, -}; + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
-static const struct kernfs_ops kf_mondata_ops = { - .atomic_write_len = PAGE_SIZE, - .seq_show = rdtgroup_mondata_show, -}; + if (!r->cdp_capable) + return;
-static bool is_cpu_list(struct kernfs_open_file *of) -{ - struct rftype *rft = of->kn->priv; + if (r->rid == RDT_RESOURCE_L2) + l2_qos_cfg_update(&hw_res->cdp_enabled);
- return rft->flags & RFTYPE_FLAGS_CPUS_LIST; + if (r->rid == RDT_RESOURCE_L3) + l3_qos_cfg_update(&hw_res->cdp_enabled); }
-static int rdtgroup_cpus_show(struct kernfs_open_file *of, - struct seq_file *s, void *v) +static int cdp_enable(int level) { - struct rdtgroup *rdtgrp; - struct cpumask *mask; - int ret = 0; + struct rdt_resource *r_l = &rdt_resources_all[level].r_resctrl; + int ret;
- rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!r_l->alloc_capable) + return -EINVAL;
- if (rdtgrp) { - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { - if (!rdtgrp->plr->d) { - rdt_last_cmd_clear(); - rdt_last_cmd_puts("Cache domain offline\n"); - ret = -ENODEV; - } else { - mask = &rdtgrp->plr->d->cpu_mask; - seq_printf(s, is_cpu_list(of) ? - "%*pbl\n" : "%*pb\n", - cpumask_pr_args(mask)); - } - } else { - seq_printf(s, is_cpu_list(of) ? "%*pbl\n" : "%*pb\n", - cpumask_pr_args(&rdtgrp->cpu_mask)); - } - } else { - ret = -ENOENT; - } - rdtgroup_kn_unlock(of->kn); + ret = set_cache_qos_cfg(level, true); + if (!ret) + rdt_resources_all[level].cdp_enabled = true;
return ret; }
-/* - * This is safe against resctrl_sched_in() called from __switch_to() - * because __switch_to() is executed with interrupts disabled. A local call - * from update_closid_rmid() is protected against __switch_to() because - * preemption is disabled. - */ -void resctrl_arch_sync_cpu_defaults(void *info) -{ - struct resctrl_cpu_sync *r = info; - - if (r) { - this_cpu_write(pqr_state.default_closid, r->closid); - this_cpu_write(pqr_state.default_rmid, r->rmid); - } - - /* - * We cannot unconditionally write the MSR because the current - * executing task might have its own closid selected. Just reuse - * the context switch code. - */ - resctrl_sched_in(current); -} - -/* - * Update the PGR_ASSOC MSR on all cpus in @cpu_mask, - * - * Per task closids/rmids must have been set up before calling this function. - * @r may be NULL. - */ -static void -update_closid_rmid(const struct cpumask *cpu_mask, struct rdtgroup *r) +static void cdp_disable(int level) { - struct resctrl_cpu_sync defaults; - struct resctrl_cpu_sync *defaults_p = NULL; + struct rdt_hw_resource *r_hw = &rdt_resources_all[level];
- if (r) { - defaults.closid = r->closid; - defaults.rmid = r->mon.rmid; - defaults_p = &defaults; + if (r_hw->cdp_enabled) { + set_cache_qos_cfg(level, false); + r_hw->cdp_enabled = false; } - - on_each_cpu_mask(cpu_mask, resctrl_arch_sync_cpu_defaults, defaults_p, - 1); }
-static int cpus_mon_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask, - cpumask_var_t tmpmask) +int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable) { - struct rdtgroup *prgrp = rdtgrp->mon.parent, *crgrp; - struct list_head *head; + struct rdt_hw_resource *hw_res = &rdt_resources_all[l];
- /* Check whether cpus belong to parent ctrl group */ - cpumask_andnot(tmpmask, newmask, &prgrp->cpu_mask); - if (!cpumask_empty(tmpmask)) { - rdt_last_cmd_puts("Can only add CPUs to mongroup that belong to parent\n"); + if (!hw_res->r_resctrl.cdp_capable) return -EINVAL; - } - - /* Check whether cpus are dropped from this group */ - cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask); - if (!cpumask_empty(tmpmask)) { - /* Give any dropped cpus to parent rdtgroup */ - cpumask_or(&prgrp->cpu_mask, &prgrp->cpu_mask, tmpmask); - update_closid_rmid(tmpmask, prgrp); - }
- /* - * If we added cpus, remove them from previous group that owned them - * and update per-cpu rmid - */ - cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask); - if (!cpumask_empty(tmpmask)) { - head = &prgrp->mon.crdtgrp_list; - list_for_each_entry(crgrp, head, mon.crdtgrp_list) { - if (crgrp == rdtgrp) - continue; - cpumask_andnot(&crgrp->cpu_mask, &crgrp->cpu_mask, - tmpmask); - } - update_closid_rmid(tmpmask, rdtgrp); - } + if (enable) + return cdp_enable(l);
- /* Done pushing/pulling - update this group with new mask */ - cpumask_copy(&rdtgrp->cpu_mask, newmask); + cdp_disable(l);
return 0; }
-static void cpumask_rdtgrp_clear(struct rdtgroup *r, struct cpumask *m) +static int reset_all_ctrls(struct rdt_resource *r) { - struct rdtgroup *crgrp; - - cpumask_andnot(&r->cpu_mask, &r->cpu_mask, m); - /* update the child mon group masks as well*/ - list_for_each_entry(crgrp, &r->mon.crdtgrp_list, mon.crdtgrp_list) - cpumask_and(&crgrp->cpu_mask, &r->cpu_mask, &crgrp->cpu_mask); -} + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); + struct rdt_hw_domain *hw_dom; + struct msr_param msr_param; + cpumask_var_t cpu_mask; + struct rdt_domain *d; + int i;
-static int cpus_ctrl_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask, - cpumask_var_t tmpmask, cpumask_var_t tmpmask1) -{ - struct rdtgroup *r, *crgrp; - struct list_head *head; + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held();
- /* Check whether cpus are dropped from this group */ - cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask); - if (!cpumask_empty(tmpmask)) { - /* Can't drop from default group */ - if (rdtgrp == &rdtgroup_default) { - rdt_last_cmd_puts("Can't drop CPUs from default group\n"); - return -EINVAL; - } + if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) + return -ENOMEM;
- /* Give any dropped cpus to rdtgroup_default */ - cpumask_or(&rdtgroup_default.cpu_mask, - &rdtgroup_default.cpu_mask, tmpmask); - update_closid_rmid(tmpmask, &rdtgroup_default); - } + msr_param.res = r; + msr_param.low = 0; + msr_param.high = hw_res->num_closid;
/* - * If we added cpus, remove them from previous group and - * the prev group's child groups that owned them - * and update per-cpu closid/rmid. + * Disable resource control for this resource by setting all + * CBMs in all domains to the maximum mask value. Pick one CPU + * from each domain to update the MSRs below. */ - cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask); - if (!cpumask_empty(tmpmask)) { - list_for_each_entry(r, &rdt_all_groups, rdtgroup_list) { - if (r == rdtgrp) - continue; - cpumask_and(tmpmask1, &r->cpu_mask, tmpmask); - if (!cpumask_empty(tmpmask1)) - cpumask_rdtgrp_clear(r, tmpmask1); - } - update_closid_rmid(tmpmask, rdtgrp); + list_for_each_entry(d, &r->domains, list) { + hw_dom = resctrl_to_arch_dom(d); + cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask); + + for (i = 0; i < hw_res->num_closid; i++) + hw_dom->ctrl_val[i] = r->default_ctrl; }
- /* Done pushing/pulling - update this group with new mask */ - cpumask_copy(&rdtgrp->cpu_mask, newmask); + /* Update CBM on all the CPUs in cpu_mask */ + on_each_cpu_mask(cpu_mask, rdt_ctrl_update, &msr_param, 1);
- /* - * Clear child mon group masks since there is a new parent mask - * now and update the rmid for the cpus the child lost. - */ - head = &rdtgrp->mon.crdtgrp_list; - list_for_each_entry(crgrp, head, mon.crdtgrp_list) { - cpumask_and(tmpmask, &rdtgrp->cpu_mask, &crgrp->cpu_mask); - update_closid_rmid(tmpmask, rdtgrp); - cpumask_clear(&crgrp->cpu_mask); - } + free_cpumask_var(cpu_mask);
return 0; }
-static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off) +void resctrl_arch_reset_resources(void) { - cpumask_var_t tmpmask, newmask, tmpmask1; - struct rdtgroup *rdtgrp; - int ret; - - if (!buf) - return -EINVAL; - - if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) - return -ENOMEM; - if (!zalloc_cpumask_var(&newmask, GFP_KERNEL)) { - free_cpumask_var(tmpmask); - return -ENOMEM; - } - if (!zalloc_cpumask_var(&tmpmask1, GFP_KERNEL)) { - free_cpumask_var(tmpmask); - free_cpumask_var(newmask); - return -ENOMEM; - } - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (!rdtgrp) { - ret = -ENOENT; - goto unlock; - } - - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED || - rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - ret = -EINVAL; - rdt_last_cmd_puts("Pseudo-locking in progress\n"); - goto unlock; - } - - if (is_cpu_list(of)) - ret = cpulist_parse(buf, newmask); - else - ret = cpumask_parse(buf, newmask); - - if (ret) { - rdt_last_cmd_puts("Bad CPU list/mask\n"); - goto unlock; - } - - /* check that user didn't specify any offline cpus */ - cpumask_andnot(tmpmask, newmask, cpu_online_mask); - if (!cpumask_empty(tmpmask)) { - ret = -EINVAL; - rdt_last_cmd_puts("Can only assign online CPUs\n"); - goto unlock; - } - - if (rdtgrp->type == RDTCTRL_GROUP) - ret = cpus_ctrl_write(rdtgrp, newmask, tmpmask, tmpmask1); - else if (rdtgrp->type == RDTMON_GROUP) - ret = cpus_mon_write(rdtgrp, newmask, tmpmask); - else - ret = -EINVAL; - -unlock: - rdtgroup_kn_unlock(of->kn); - free_cpumask_var(tmpmask); - free_cpumask_var(newmask); - free_cpumask_var(tmpmask1); - - return ret ?: nbytes; -} - -/** - * rdtgroup_remove - the helper to remove resource group safely - * @rdtgrp: resource group to remove - * - * On resource group creation via a mkdir, an extra kernfs_node reference is - * taken to ensure that the rdtgroup structure remains accessible for the - * rdtgroup_kn_unlock() calls where it is removed. - * - * Drop the extra reference here, then free the rdtgroup structure. - * - * Return: void - */ -static void rdtgroup_remove(struct rdtgroup *rdtgrp) -{ - kernfs_put(rdtgrp->kn); - kfree(rdtgrp); -} - -static void _update_task_closid_rmid(void *task) -{ - /* - * If the task is still current on this CPU, update PQR_ASSOC MSR. - * Otherwise, the MSR is updated when the task is scheduled in. - */ - if (task == current) - resctrl_sched_in(task); -} - -static void update_task_closid_rmid(struct task_struct *t) -{ - if (IS_ENABLED(CONFIG_SMP) && task_curr(t)) - smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1); - else - _update_task_closid_rmid(t); -} - -static bool task_in_rdtgroup(struct task_struct *tsk, struct rdtgroup *rdtgrp) -{ - u32 closid, rmid = rdtgrp->mon.rmid; - - if (rdtgrp->type == RDTCTRL_GROUP) - closid = rdtgrp->closid; - else if (rdtgrp->type == RDTMON_GROUP) - closid = rdtgrp->mon.parent->closid; - else - return false; - - return resctrl_arch_match_closid(tsk, closid) && - resctrl_arch_match_rmid(tsk, closid, rmid); -} - -static int __rdtgroup_move_task(struct task_struct *tsk, - struct rdtgroup *rdtgrp) -{ - /* If the task is already in rdtgrp, no need to move the task. */ - if (task_in_rdtgroup(tsk, rdtgrp)) - return 0; - - /* - * Set the task's closid/rmid before the PQR_ASSOC MSR can be - * updated by them. - * - * For ctrl_mon groups, move both closid and rmid. - * For monitor groups, can move the tasks only from - * their parent CTRL group. - */ - if (rdtgrp->type == RDTMON_GROUP && - !resctrl_arch_match_closid(tsk, rdtgrp->mon.parent->closid)) { - rdt_last_cmd_puts("Can't move task to different control group\n"); - return -EINVAL; - } - - if (rdtgrp->type == RDTMON_GROUP) - resctrl_arch_set_closid_rmid(tsk, rdtgrp->mon.parent->closid, - rdtgrp->mon.rmid); - else - resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid, - rdtgrp->mon.rmid); - - /* - * Ensure the task's closid and rmid are written before determining if - * the task is current that will decide if it will be interrupted. - * This pairs with the full barrier between the rq->curr update and - * resctrl_sched_in() during context switch. - */ - smp_mb(); - - /* - * By now, the task's closid and rmid are set. If the task is current - * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource - * group go into effect. If the task is not current, the MSR will be - * updated when the task is scheduled in. - */ - update_task_closid_rmid(tsk); - - return 0; -} - -static bool is_closid_match(struct task_struct *t, struct rdtgroup *r) -{ - return (resctrl_arch_alloc_capable() && (r->type == RDTCTRL_GROUP) && - resctrl_arch_match_closid(t, r->closid)); -} - -static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r) -{ - return (resctrl_arch_mon_capable() && (r->type == RDTMON_GROUP) && - resctrl_arch_match_rmid(t, r->mon.parent->closid, - r->mon.rmid)); -} - -/** - * rdtgroup_tasks_assigned - Test if tasks have been assigned to resource group - * @r: Resource group - * - * Return: 1 if tasks have been assigned to @r, 0 otherwise - */ -int rdtgroup_tasks_assigned(struct rdtgroup *r) -{ - struct task_struct *p, *t; - int ret = 0; - - lockdep_assert_held(&rdtgroup_mutex); - - rcu_read_lock(); - for_each_process_thread(p, t) { - if (is_closid_match(t, r) || is_rmid_match(t, r)) { - ret = 1; - break; - } - } - rcu_read_unlock(); - - return ret; -} - -static int rdtgroup_task_write_permission(struct task_struct *task, - struct kernfs_open_file *of) -{ - const struct cred *tcred = get_task_cred(task); - const struct cred *cred = current_cred(); - int ret = 0; - - /* - * Even if we're attaching all tasks in the thread group, we only - * need to check permissions on one of them. - */ - if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && - !uid_eq(cred->euid, tcred->uid) && - !uid_eq(cred->euid, tcred->suid)) { - rdt_last_cmd_printf("No permission to move task %d\n", task->pid); - ret = -EPERM; - } - - put_cred(tcred); - return ret; -} - -static int rdtgroup_move_task(pid_t pid, struct rdtgroup *rdtgrp, - struct kernfs_open_file *of) -{ - struct task_struct *tsk; - int ret; - - rcu_read_lock(); - if (pid) { - tsk = find_task_by_vpid(pid); - if (!tsk) { - rcu_read_unlock(); - rdt_last_cmd_printf("No task %d\n", pid); - return -ESRCH; - } - } else { - tsk = current; - } - - get_task_struct(tsk); - rcu_read_unlock(); - - ret = rdtgroup_task_write_permission(tsk, of); - if (!ret) - ret = __rdtgroup_move_task(tsk, rdtgrp); - - put_task_struct(tsk); - return ret; -} - -static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off) -{ - struct rdtgroup *rdtgrp; - char *pid_str; - int ret = 0; - pid_t pid; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (!rdtgrp) { - rdtgroup_kn_unlock(of->kn); - return -ENOENT; - } - rdt_last_cmd_clear(); - - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED || - rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - ret = -EINVAL; - rdt_last_cmd_puts("Pseudo-locking in progress\n"); - goto unlock; - } - - while (buf && buf[0] != '\0' && buf[0] != '\n') { - pid_str = strim(strsep(&buf, ",")); - - if (kstrtoint(pid_str, 0, &pid)) { - rdt_last_cmd_printf("Task list parsing error pid %s\n", pid_str); - ret = -EINVAL; - break; - } - - if (pid < 0) { - rdt_last_cmd_printf("Invalid pid %d\n", pid); - ret = -EINVAL; - break; - } - - ret = rdtgroup_move_task(pid, rdtgrp, of); - if (ret) { - rdt_last_cmd_printf("Error while processing task %d\n", pid); - break; - } - } - -unlock: - rdtgroup_kn_unlock(of->kn); - - return ret ?: nbytes; -} - -static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s) -{ - struct task_struct *p, *t; - pid_t pid; - - rcu_read_lock(); - for_each_process_thread(p, t) { - if (is_closid_match(t, r) || is_rmid_match(t, r)) { - pid = task_pid_vnr(t); - if (pid) - seq_printf(s, "%d\n", pid); - } - } - rcu_read_unlock(); -} - -static int rdtgroup_tasks_show(struct kernfs_open_file *of, - struct seq_file *s, void *v) -{ - struct rdtgroup *rdtgrp; - int ret = 0; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (rdtgrp) - show_rdt_tasks(rdtgrp, s); - else - ret = -ENOENT; - rdtgroup_kn_unlock(of->kn); - - return ret; -} - -static int rdtgroup_closid_show(struct kernfs_open_file *of, - struct seq_file *s, void *v) -{ - struct rdtgroup *rdtgrp; - int ret = 0; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (rdtgrp) - seq_printf(s, "%u\n", rdtgrp->closid); - else - ret = -ENOENT; - rdtgroup_kn_unlock(of->kn); - - return ret; -} - -static int rdtgroup_rmid_show(struct kernfs_open_file *of, - struct seq_file *s, void *v) -{ - struct rdtgroup *rdtgrp; - int ret = 0; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (rdtgrp) - seq_printf(s, "%u\n", rdtgrp->mon.rmid); - else - ret = -ENOENT; - rdtgroup_kn_unlock(of->kn); - - return ret; -} - -#ifdef CONFIG_PROC_CPU_RESCTRL - -/* - * A task can only be part of one resctrl control group and of one monitor - * group which is associated to that control group. - * - * 1) res: - * mon: - * - * resctrl is not available. - * - * 2) res:/ - * mon: - * - * Task is part of the root resctrl control group, and it is not associated - * to any monitor group. - * - * 3) res:/ - * mon:mon0 - * - * Task is part of the root resctrl control group and monitor group mon0. - * - * 4) res:group0 - * mon: - * - * Task is part of resctrl control group group0, and it is not associated - * to any monitor group. - * - * 5) res:group0 - * mon:mon1 - * - * Task is part of resctrl control group group0 and monitor group mon1. - */ -int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns, - struct pid *pid, struct task_struct *tsk) -{ - struct rdtgroup *rdtg; - int ret = 0; - - mutex_lock(&rdtgroup_mutex); - - /* Return empty if resctrl has not been mounted. */ - if (!resctrl_mounted) { - seq_puts(s, "res:\nmon:\n"); - goto unlock; - } - - list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) { - struct rdtgroup *crg; - - /* - * Task information is only relevant for shareable - * and exclusive groups. - */ - if (rdtg->mode != RDT_MODE_SHAREABLE && - rdtg->mode != RDT_MODE_EXCLUSIVE) - continue; - - if (!resctrl_arch_match_closid(tsk, rdtg->closid)) - continue; - - seq_printf(s, "res:%s%s\n", (rdtg == &rdtgroup_default) ? "/" : "", - rdtg->kn->name); - seq_puts(s, "mon:"); - list_for_each_entry(crg, &rdtg->mon.crdtgrp_list, - mon.crdtgrp_list) { - if (!resctrl_arch_match_rmid(tsk, crg->mon.parent->closid, - crg->mon.rmid)) - continue; - seq_printf(s, "%s", crg->kn->name); - break; - } - seq_putc(s, '\n'); - goto unlock; - } - /* - * The above search should succeed. Otherwise return - * with an error. - */ - ret = -ENOENT; -unlock: - mutex_unlock(&rdtgroup_mutex); - - return ret; -} -#endif - -static int rdt_last_cmd_status_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - int len; - - mutex_lock(&rdtgroup_mutex); - len = seq_buf_used(&last_cmd_status); - if (len) - seq_printf(seq, "%.*s", len, last_cmd_status_buf); - else - seq_puts(seq, "ok\n"); - mutex_unlock(&rdtgroup_mutex); - return 0; -} - -static int rdt_num_closids_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - - seq_printf(seq, "%u\n", s->num_closid); - return 0; -} - -static int rdt_default_ctrl_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - struct rdt_resource *r = s->res; - - seq_printf(seq, "%x\n", r->default_ctrl); - return 0; -} - -static int rdt_min_cbm_bits_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - struct rdt_resource *r = s->res; - - seq_printf(seq, "%u\n", r->cache.min_cbm_bits); - return 0; -} - -static int rdt_shareable_bits_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - struct rdt_resource *r = s->res; - - seq_printf(seq, "%x\n", r->cache.shareable_bits); - return 0; -} - -/* - * rdt_bit_usage_show - Display current usage of resources - * - * A domain is a shared resource that can now be allocated differently. Here - * we display the current regions of the domain as an annotated bitmask. - * For each domain of this resource its allocation bitmask - * is annotated as below to indicate the current usage of the corresponding bit: - * 0 - currently unused - * X - currently available for sharing and used by software and hardware - * H - currently used by hardware only but available for software use - * S - currently used and shareable by software only - * E - currently used exclusively by one resource group - * P - currently pseudo-locked by one resource group - */ -static int rdt_bit_usage_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - /* - * Use unsigned long even though only 32 bits are used to ensure - * test_bit() is used safely. - */ - unsigned long sw_shareable = 0, hw_shareable = 0; - unsigned long exclusive = 0, pseudo_locked = 0; - struct rdt_resource *r = s->res; - struct rdt_domain *dom; - int i, hwb, swb, excl, psl; - enum rdtgrp_mode mode; - bool sep = false; - u32 ctrl_val; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - hw_shareable = r->cache.shareable_bits; - list_for_each_entry(dom, &r->domains, list) { - if (sep) - seq_putc(seq, ';'); - sw_shareable = 0; - exclusive = 0; - seq_printf(seq, "%d=", dom->id); - for (i = 0; i < closids_supported(); i++) { - if (!closid_allocated(i)) - continue; - ctrl_val = resctrl_arch_get_config(r, dom, i, - s->conf_type); - mode = rdtgroup_mode_by_closid(i); - switch (mode) { - case RDT_MODE_SHAREABLE: - sw_shareable |= ctrl_val; - break; - case RDT_MODE_EXCLUSIVE: - exclusive |= ctrl_val; - break; - case RDT_MODE_PSEUDO_LOCKSETUP: - /* - * RDT_MODE_PSEUDO_LOCKSETUP is possible - * here but not included since the CBM - * associated with this CLOSID in this mode - * is not initialized and no task or cpu can be - * assigned this CLOSID. - */ - break; - case RDT_MODE_PSEUDO_LOCKED: - case RDT_NUM_MODES: - WARN(1, - "invalid mode for closid %d\n", i); - break; - } - } - for (i = r->cache.cbm_len - 1; i >= 0; i--) { - pseudo_locked = dom->plr ? dom->plr->cbm : 0; - hwb = test_bit(i, &hw_shareable); - swb = test_bit(i, &sw_shareable); - excl = test_bit(i, &exclusive); - psl = test_bit(i, &pseudo_locked); - if (hwb && swb) - seq_putc(seq, 'X'); - else if (hwb && !swb) - seq_putc(seq, 'H'); - else if (!hwb && swb) - seq_putc(seq, 'S'); - else if (excl) - seq_putc(seq, 'E'); - else if (psl) - seq_putc(seq, 'P'); - else /* Unused bits remain */ - seq_putc(seq, '0'); - } - sep = true; - } - seq_putc(seq, '\n'); - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); - return 0; -} - -static int rdt_min_bw_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - struct rdt_resource *r = s->res; - - seq_printf(seq, "%u\n", r->membw.min_bw); - return 0; -} - -static int rdt_num_rmids_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct rdt_resource *r = of->kn->parent->priv; - - seq_printf(seq, "%d\n", r->num_rmid); - - return 0; -} - -static int rdt_mon_features_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct rdt_resource *r = of->kn->parent->priv; - struct mon_evt *mevt; - - list_for_each_entry(mevt, &r->evt_list, list) { - seq_printf(seq, "%s\n", mevt->name); - if (mevt->configurable) - seq_printf(seq, "%s_config\n", mevt->name); - } - - return 0; -} - -static int rdt_bw_gran_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - struct rdt_resource *r = s->res; - - seq_printf(seq, "%u\n", r->membw.bw_gran); - return 0; -} - -static int rdt_delay_linear_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - struct rdt_resource *r = s->res; - - seq_printf(seq, "%u\n", r->membw.delay_linear); - return 0; -} - -static int max_threshold_occ_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - seq_printf(seq, "%u\n", resctrl_rmid_realloc_threshold); - - return 0; -} - -static int rdt_thread_throttle_mode_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - struct rdt_resource *r = s->res; - - if (r->membw.throttle_mode == THREAD_THROTTLE_PER_THREAD) - seq_puts(seq, "per-thread\n"); - else - seq_puts(seq, "max\n"); - - return 0; -} - -static ssize_t max_threshold_occ_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off) -{ - unsigned int bytes; - int ret; - - ret = kstrtouint(buf, 0, &bytes); - if (ret) - return ret; - - if (bytes > resctrl_rmid_realloc_limit) - return -EINVAL; - - resctrl_rmid_realloc_threshold = resctrl_arch_round_mon_val(bytes); - - return nbytes; -} - -/* - * rdtgroup_mode_show - Display mode of this resource group - */ -static int rdtgroup_mode_show(struct kernfs_open_file *of, - struct seq_file *s, void *v) -{ - struct rdtgroup *rdtgrp; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (!rdtgrp) { - rdtgroup_kn_unlock(of->kn); - return -ENOENT; - } - - seq_printf(s, "%s\n", rdtgroup_mode_str(rdtgrp->mode)); - - rdtgroup_kn_unlock(of->kn); - return 0; -} - -static enum resctrl_conf_type resctrl_peer_type(enum resctrl_conf_type my_type) -{ - switch (my_type) { - case CDP_CODE: - return CDP_DATA; - case CDP_DATA: - return CDP_CODE; - default: - case CDP_NONE: - return CDP_NONE; - } -} - -static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - struct rdt_resource *r = s->res; - - seq_printf(seq, "%u\n", r->cache.arch_has_sparse_bitmasks); - - return 0; -} - -/** - * __rdtgroup_cbm_overlaps - Does CBM for intended closid overlap with other - * @r: Resource to which domain instance @d belongs. - * @d: The domain instance for which @closid is being tested. - * @cbm: Capacity bitmask being tested. - * @closid: Intended closid for @cbm. - * @type: CDP type of @r. - * @exclusive: Only check if overlaps with exclusive resource groups - * - * Checks if provided @cbm intended to be used for @closid on domain - * @d overlaps with any other closids or other hardware usage associated - * with this domain. If @exclusive is true then only overlaps with - * resource groups in exclusive mode will be considered. If @exclusive - * is false then overlaps with any resource group or hardware entities - * will be considered. - * - * @cbm is unsigned long, even if only 32 bits are used, to make the - * bitmap functions work correctly. - * - * Return: false if CBM does not overlap, true if it does. - */ -static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d, - unsigned long cbm, int closid, - enum resctrl_conf_type type, bool exclusive) -{ - enum rdtgrp_mode mode; - unsigned long ctrl_b; - int i; - - /* Check for any overlap with regions used by hardware directly */ - if (!exclusive) { - ctrl_b = r->cache.shareable_bits; - if (bitmap_intersects(&cbm, &ctrl_b, r->cache.cbm_len)) - return true; - } - - /* Check for overlap with other resource groups */ - for (i = 0; i < closids_supported(); i++) { - ctrl_b = resctrl_arch_get_config(r, d, i, type); - mode = rdtgroup_mode_by_closid(i); - if (closid_allocated(i) && i != closid && - mode != RDT_MODE_PSEUDO_LOCKSETUP) { - if (bitmap_intersects(&cbm, &ctrl_b, r->cache.cbm_len)) { - if (exclusive) { - if (mode == RDT_MODE_EXCLUSIVE) - return true; - continue; - } - return true; - } - } - } - - return false; -} - -/** - * rdtgroup_cbm_overlaps - Does CBM overlap with other use of hardware - * @s: Schema for the resource to which domain instance @d belongs. - * @d: The domain instance for which @closid is being tested. - * @cbm: Capacity bitmask being tested. - * @closid: Intended closid for @cbm. - * @exclusive: Only check if overlaps with exclusive resource groups - * - * Resources that can be allocated using a CBM can use the CBM to control - * the overlap of these allocations. rdtgroup_cmb_overlaps() is the test - * for overlap. Overlap test is not limited to the specific resource for - * which the CBM is intended though - when dealing with CDP resources that - * share the underlying hardware the overlap check should be performed on - * the CDP resource sharing the hardware also. - * - * Refer to description of __rdtgroup_cbm_overlaps() for the details of the - * overlap test. - * - * Return: true if CBM overlap detected, false if there is no overlap - */ -bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d, - unsigned long cbm, int closid, bool exclusive) -{ - enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type); - struct rdt_resource *r = s->res; - - if (__rdtgroup_cbm_overlaps(r, d, cbm, closid, s->conf_type, - exclusive)) - return true; - - if (!resctrl_arch_get_cdp_enabled(r->rid)) - return false; - return __rdtgroup_cbm_overlaps(r, d, cbm, closid, peer_type, exclusive); -} - -/** - * rdtgroup_mode_test_exclusive - Test if this resource group can be exclusive - * @rdtgrp: Resource group identified through its closid. - * - * An exclusive resource group implies that there should be no sharing of - * its allocated resources. At the time this group is considered to be - * exclusive this test can determine if its current schemata supports this - * setting by testing for overlap with all other resource groups. - * - * Return: true if resource group can be exclusive, false if there is overlap - * with allocations of other resource groups and thus this resource group - * cannot be exclusive. - */ -static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp) -{ - int closid = rdtgrp->closid; - struct resctrl_schema *s; - struct rdt_resource *r; - bool has_cache = false; - struct rdt_domain *d; - u32 ctrl; - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - list_for_each_entry(s, &resctrl_schema_all, list) { - r = s->res; - if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA) - continue; - has_cache = true; - list_for_each_entry(d, &r->domains, list) { - ctrl = resctrl_arch_get_config(r, d, closid, - s->conf_type); - if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) { - rdt_last_cmd_puts("Schemata overlaps\n"); - return false; - } - } - } - - if (!has_cache) { - rdt_last_cmd_puts("Cannot be exclusive without CAT/CDP\n"); - return false; - } - - return true; -} - -/* - * rdtgroup_mode_write - Modify the resource group's mode - */ -static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off) -{ - struct rdtgroup *rdtgrp; - enum rdtgrp_mode mode; - int ret = 0; - - /* Valid input requires a trailing newline */ - if (nbytes == 0 || buf[nbytes - 1] != '\n') - return -EINVAL; - buf[nbytes - 1] = '\0'; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (!rdtgrp) { - rdtgroup_kn_unlock(of->kn); - return -ENOENT; - } - - rdt_last_cmd_clear(); - - mode = rdtgrp->mode; - - if ((!strcmp(buf, "shareable") && mode == RDT_MODE_SHAREABLE) || - (!strcmp(buf, "exclusive") && mode == RDT_MODE_EXCLUSIVE) || - (!strcmp(buf, "pseudo-locksetup") && - mode == RDT_MODE_PSEUDO_LOCKSETUP) || - (!strcmp(buf, "pseudo-locked") && mode == RDT_MODE_PSEUDO_LOCKED)) - goto out; - - if (mode == RDT_MODE_PSEUDO_LOCKED) { - rdt_last_cmd_puts("Cannot change pseudo-locked group\n"); - ret = -EINVAL; - goto out; - } - - if (!strcmp(buf, "shareable")) { - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - ret = rdtgroup_locksetup_exit(rdtgrp); - if (ret) - goto out; - } - rdtgrp->mode = RDT_MODE_SHAREABLE; - } else if (!strcmp(buf, "exclusive")) { - if (!rdtgroup_mode_test_exclusive(rdtgrp)) { - ret = -EINVAL; - goto out; - } - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - ret = rdtgroup_locksetup_exit(rdtgrp); - if (ret) - goto out; - } - rdtgrp->mode = RDT_MODE_EXCLUSIVE; - } else if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK) && - !strcmp(buf, "pseudo-locksetup")) { - ret = rdtgroup_locksetup_enter(rdtgrp); - if (ret) - goto out; - rdtgrp->mode = RDT_MODE_PSEUDO_LOCKSETUP; - } else { - rdt_last_cmd_puts("Unknown or unsupported mode\n"); - ret = -EINVAL; - } - -out: - rdtgroup_kn_unlock(of->kn); - return ret ?: nbytes; -} - -/** - * rdtgroup_cbm_to_size - Translate CBM to size in bytes - * @r: RDT resource to which @d belongs. - * @d: RDT domain instance. - * @cbm: bitmask for which the size should be computed. - * - * The bitmask provided associated with the RDT domain instance @d will be - * translated into how many bytes it represents. The size in bytes is - * computed by first dividing the total cache size by the CBM length to - * determine how many bytes each bit in the bitmask represents. The result - * is multiplied with the number of bits set in the bitmask. - * - * @cbm is unsigned long, even if only 32 bits are used to make the - * bitmap functions work correctly. - */ -unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, - struct rdt_domain *d, unsigned long cbm) -{ - struct cpu_cacheinfo *ci; - unsigned int size = 0; - int num_b, i; - - num_b = bitmap_weight(&cbm, r->cache.cbm_len); - ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask)); - for (i = 0; i < ci->num_leaves; i++) { - if (ci->info_list[i].level == r->cache_level) { - size = ci->info_list[i].size / r->cache.cbm_len * num_b; - break; - } - } - - return size; -} - -/* - * rdtgroup_size_show - Display size in bytes of allocated regions - * - * The "size" file mirrors the layout of the "schemata" file, printing the - * size in bytes of each region instead of the capacity bitmask. - */ -static int rdtgroup_size_show(struct kernfs_open_file *of, - struct seq_file *s, void *v) -{ - struct resctrl_schema *schema; - enum resctrl_conf_type type; - struct rdtgroup *rdtgrp; - struct rdt_resource *r; - struct rdt_domain *d; - unsigned int size; - int ret = 0; - u32 closid; - bool sep; - u32 ctrl; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (!rdtgrp) { - rdtgroup_kn_unlock(of->kn); - return -ENOENT; - } - - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { - if (!rdtgrp->plr->d) { - rdt_last_cmd_clear(); - rdt_last_cmd_puts("Cache domain offline\n"); - ret = -ENODEV; - } else { - seq_printf(s, "%*s:", max_name_width, - rdtgrp->plr->s->name); - size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res, - rdtgrp->plr->d, - rdtgrp->plr->cbm); - seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size); - } - goto out; - } - - closid = rdtgrp->closid; - - list_for_each_entry(schema, &resctrl_schema_all, list) { - r = schema->res; - type = schema->conf_type; - sep = false; - seq_printf(s, "%*s:", max_name_width, schema->name); - list_for_each_entry(d, &r->domains, list) { - if (sep) - seq_putc(s, ';'); - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - size = 0; - } else { - if (is_mba_sc(r)) - ctrl = d->mbps_val[closid]; - else - ctrl = resctrl_arch_get_config(r, d, - closid, - type); - if (r->rid == RDT_RESOURCE_MBA || - r->rid == RDT_RESOURCE_SMBA) - size = ctrl; - else - size = rdtgroup_cbm_to_size(r, d, ctrl); - } - seq_printf(s, "%d=%u", d->id, size); - sep = true; - } - seq_putc(s, '\n'); - } - -out: - rdtgroup_kn_unlock(of->kn); - - return ret; -} - -#define INVALID_CONFIG_INDEX UINT_MAX - -/** - * mon_event_config_index_get - get the hardware index for the - * configurable event - * @evtid: event id. - * - * Return: 0 for evtid == QOS_L3_MBM_TOTAL_EVENT_ID - * 1 for evtid == QOS_L3_MBM_LOCAL_EVENT_ID - * INVALID_CONFIG_INDEX for invalid evtid - */ -static inline unsigned int mon_event_config_index_get(u32 evtid) -{ - switch (evtid) { - case QOS_L3_MBM_TOTAL_EVENT_ID: - return 0; - case QOS_L3_MBM_LOCAL_EVENT_ID: - return 1; - default: - /* Should never reach here */ - return INVALID_CONFIG_INDEX; - } -} - -void resctrl_arch_mon_event_config_read(void *info) -{ - struct resctrl_mon_config_info *mon_info = info; - unsigned int index; - u64 msrval; - - index = mon_event_config_index_get(mon_info->evtid); - if (index == INVALID_CONFIG_INDEX) { - pr_warn_once("Invalid event id %d\n", mon_info->evtid); - return; - } - rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval); - - /* Report only the valid event configuration bits */ - mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS; -} - -static void mondata_config_read(struct resctrl_mon_config_info *mon_info) -{ - smp_call_function_any(&mon_info->d->cpu_mask, - resctrl_arch_mon_event_config_read, mon_info, 1); -} - -static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid) -{ - struct resctrl_mon_config_info mon_info = {0}; - struct rdt_domain *dom; - bool sep = false; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - list_for_each_entry(dom, &r->domains, list) { - if (sep) - seq_puts(s, ";"); - - memset(&mon_info, 0, sizeof(struct resctrl_mon_config_info)); - mon_info.r = r; - mon_info.d = dom; - mon_info.evtid = evtid; - mondata_config_read(&mon_info); - - seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config); - sep = true; - } - seq_puts(s, "\n"); - - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); - - return 0; -} - -static int mbm_total_bytes_config_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct rdt_resource *r = of->kn->parent->priv; - - mbm_config_show(seq, r, QOS_L3_MBM_TOTAL_EVENT_ID); - - return 0; -} - -static int mbm_local_bytes_config_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct rdt_resource *r = of->kn->parent->priv; - - mbm_config_show(seq, r, QOS_L3_MBM_LOCAL_EVENT_ID); - - return 0; -} - -void resctrl_arch_mon_event_config_write(void *info) -{ - struct resctrl_mon_config_info *mon_info = info; - unsigned int index; - - index = mon_event_config_index_get(mon_info->evtid); - if (index == INVALID_CONFIG_INDEX) { - pr_warn_once("Invalid event id %d\n", mon_info->evtid); - mon_info->err = -EINVAL; - return; - } - wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0); - - mon_info->err = 0; -} - -static int mbm_config_write_domain(struct rdt_resource *r, - struct rdt_domain *d, u32 evtid, u32 val) -{ - struct resctrl_mon_config_info mon_info = {0}; - - /* mon_config cannot be more than the supported set of events */ - if (val > MAX_EVT_CONFIG_BITS) { - rdt_last_cmd_puts("Invalid event configuration\n"); - return -EINVAL; - } - - /* - * Read the current config value first. If both are the same then - * no need to write it again. - */ - mon_info.r = r; - mon_info.d = d; - mon_info.evtid = evtid; - mondata_config_read(&mon_info); - if (mon_info.mon_config == val) - goto out; - - mon_info.mon_config = val; - - /* - * Update MSR_IA32_EVT_CFG_BASE MSR on one of the CPUs in the - * domain. The MSRs offset from MSR MSR_IA32_EVT_CFG_BASE - * are scoped at the domain level. Writing any of these MSRs - * on one CPU is observed by all the CPUs in the domain. - */ - smp_call_function_any(&d->cpu_mask, resctrl_arch_mon_event_config_write, - &mon_info, 1); - if (mon_info.err) { - rdt_last_cmd_puts("Invalid event configuration\n"); - goto out; - } - - /* - * When an Event Configuration is changed, the bandwidth counters - * for all RMIDs and Events will be cleared by the hardware. The - * hardware also sets MSR_IA32_QM_CTR.Unavailable (bit 62) for - * every RMID on the next read to any event for every RMID. - * Subsequent reads will have MSR_IA32_QM_CTR.Unavailable (bit 62) - * cleared while it is tracked by the hardware. Clear the - * mbm_local and mbm_total counts for all the RMIDs. - */ - resctrl_arch_reset_rmid_all(r, d); - -out: - return mon_info.err; -} - -static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid) -{ - char *dom_str = NULL, *id_str; - unsigned long dom_id, val; - struct rdt_domain *d; - int ret = 0; - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - -next: - if (!tok || tok[0] == '\0') - return 0; - - /* Start processing the strings for each domain */ - dom_str = strim(strsep(&tok, ";")); - id_str = strsep(&dom_str, "="); - - if (!id_str || kstrtoul(id_str, 10, &dom_id)) { - rdt_last_cmd_puts("Missing '=' or non-numeric domain id\n"); - return -EINVAL; - } - - if (!dom_str || kstrtoul(dom_str, 16, &val)) { - rdt_last_cmd_puts("Non-numeric event configuration value\n"); - return -EINVAL; - } - - list_for_each_entry(d, &r->domains, list) { - if (d->id == dom_id) { - ret = mbm_config_write_domain(r, d, evtid, val); - if (ret) - return -EINVAL; - goto next; - } - } - - return -EINVAL; -} - -static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, - loff_t off) -{ - struct rdt_resource *r = of->kn->parent->priv; - int ret; - - /* Valid input requires a trailing newline */ - if (nbytes == 0 || buf[nbytes - 1] != '\n') - return -EINVAL; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - rdt_last_cmd_clear(); - - buf[nbytes - 1] = '\0'; - - ret = mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID); - - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); - - return ret ?: nbytes; -} - -static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, - loff_t off) -{ - struct rdt_resource *r = of->kn->parent->priv; - int ret; - - /* Valid input requires a trailing newline */ - if (nbytes == 0 || buf[nbytes - 1] != '\n') - return -EINVAL; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - rdt_last_cmd_clear(); - - buf[nbytes - 1] = '\0'; - - ret = mon_config_write(r, buf, QOS_L3_MBM_LOCAL_EVENT_ID); - - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); - - return ret ?: nbytes; -} - -/* rdtgroup information files for one cache resource. */ -static struct rftype res_common_files[] = { - { - .name = "last_cmd_status", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_last_cmd_status_show, - .fflags = RFTYPE_TOP_INFO, - }, - { - .name = "num_closids", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_num_closids_show, - .fflags = RFTYPE_CTRL_INFO, - }, - { - .name = "mon_features", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_mon_features_show, - .fflags = RFTYPE_MON_INFO, - }, - { - .name = "num_rmids", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_num_rmids_show, - .fflags = RFTYPE_MON_INFO, - }, - { - .name = "cbm_mask", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_default_ctrl_show, - .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, - }, - { - .name = "min_cbm_bits", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_min_cbm_bits_show, - .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, - }, - { - .name = "shareable_bits", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_shareable_bits_show, - .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, - }, - { - .name = "bit_usage", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_bit_usage_show, - .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, - }, - { - .name = "min_bandwidth", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_min_bw_show, - .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, - }, - { - .name = "bandwidth_gran", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_bw_gran_show, - .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, - }, - { - .name = "delay_linear", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_delay_linear_show, - .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, - }, - /* - * Platform specific which (if any) capabilities are provided by - * thread_throttle_mode. Defer "fflags" initialization to platform - * discovery. - */ - { - .name = "thread_throttle_mode", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_thread_throttle_mode_show, - }, - { - .name = "max_threshold_occupancy", - .mode = 0644, - .kf_ops = &rdtgroup_kf_single_ops, - .write = max_threshold_occ_write, - .seq_show = max_threshold_occ_show, - .fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE, - }, - { - .name = "mbm_total_bytes_config", - .mode = 0644, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = mbm_total_bytes_config_show, - .write = mbm_total_bytes_config_write, - }, - { - .name = "mbm_local_bytes_config", - .mode = 0644, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = mbm_local_bytes_config_show, - .write = mbm_local_bytes_config_write, - }, - { - .name = "cpus", - .mode = 0644, - .kf_ops = &rdtgroup_kf_single_ops, - .write = rdtgroup_cpus_write, - .seq_show = rdtgroup_cpus_show, - .fflags = RFTYPE_BASE, - }, - { - .name = "cpus_list", - .mode = 0644, - .kf_ops = &rdtgroup_kf_single_ops, - .write = rdtgroup_cpus_write, - .seq_show = rdtgroup_cpus_show, - .flags = RFTYPE_FLAGS_CPUS_LIST, - .fflags = RFTYPE_BASE, - }, - { - .name = "tasks", - .mode = 0644, - .kf_ops = &rdtgroup_kf_single_ops, - .write = rdtgroup_tasks_write, - .seq_show = rdtgroup_tasks_show, - .fflags = RFTYPE_BASE, - }, - { - .name = "mon_hw_id", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdtgroup_rmid_show, - .fflags = RFTYPE_MON_BASE | RFTYPE_DEBUG, - }, - { - .name = "schemata", - .mode = 0644, - .kf_ops = &rdtgroup_kf_single_ops, - .write = rdtgroup_schemata_write, - .seq_show = rdtgroup_schemata_show, - .fflags = RFTYPE_CTRL_BASE, - }, - { - .name = "mode", - .mode = 0644, - .kf_ops = &rdtgroup_kf_single_ops, - .write = rdtgroup_mode_write, - .seq_show = rdtgroup_mode_show, - .fflags = RFTYPE_CTRL_BASE, - }, - { - .name = "size", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdtgroup_size_show, - .fflags = RFTYPE_CTRL_BASE, - }, - { - .name = "sparse_masks", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_has_sparse_bitmasks_show, - .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, - }, - { - .name = "ctrl_hw_id", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdtgroup_closid_show, - .fflags = RFTYPE_CTRL_BASE | RFTYPE_DEBUG, - }, - -}; - -static int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags) -{ - struct rftype *rfts, *rft; - int ret, len; - - rfts = res_common_files; - len = ARRAY_SIZE(res_common_files); - - lockdep_assert_held(&rdtgroup_mutex); - - if (resctrl_debug) - fflags |= RFTYPE_DEBUG; - - for (rft = rfts; rft < rfts + len; rft++) { - if (rft->fflags && ((fflags & rft->fflags) == rft->fflags)) { - ret = rdtgroup_add_file(kn, rft); - if (ret) - goto error; - } - } - - return 0; -error: - pr_warn("Failed to add %s, err=%d\n", rft->name, ret); - while (--rft >= rfts) { - if ((fflags & rft->fflags) == rft->fflags) - kernfs_remove_by_name(kn, rft->name); - } - return ret; -} - -static struct rftype *rdtgroup_get_rftype_by_name(const char *name) -{ - struct rftype *rfts, *rft; - int len; - - rfts = res_common_files; - len = ARRAY_SIZE(res_common_files); - - for (rft = rfts; rft < rfts + len; rft++) { - if (!strcmp(rft->name, name)) - return rft; - } - - return NULL; -} - -static void thread_throttle_mode_init(void) -{ - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); - struct rftype *rft; - - if (!r->alloc_capable || - r->membw.throttle_mode == THREAD_THROTTLE_UNDEFINED) - return; - - rft = rdtgroup_get_rftype_by_name("thread_throttle_mode"); - if (!rft) - return; - - rft->fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB; -} - -void mbm_config_rftype_init(const char *config) -{ - struct rftype *rft; - - rft = rdtgroup_get_rftype_by_name(config); - if (rft) - rft->fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE; -} - -/** - * rdtgroup_kn_mode_restrict - Restrict user access to named resctrl file - * @r: The resource group with which the file is associated. - * @name: Name of the file - * - * The permissions of named resctrl file, directory, or link are modified - * to not allow read, write, or execute by any user. - * - * WARNING: This function is intended to communicate to the user that the - * resctrl file has been locked down - that it is not relevant to the - * particular state the system finds itself in. It should not be relied - * on to protect from user access because after the file's permissions - * are restricted the user can still change the permissions using chmod - * from the command line. - * - * Return: 0 on success, <0 on failure. - */ -int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name) -{ - struct iattr iattr = {.ia_valid = ATTR_MODE,}; - struct kernfs_node *kn; - int ret = 0; - - kn = kernfs_find_and_get_ns(r->kn, name, NULL); - if (!kn) - return -ENOENT; - - switch (kernfs_type(kn)) { - case KERNFS_DIR: - iattr.ia_mode = S_IFDIR; - break; - case KERNFS_FILE: - iattr.ia_mode = S_IFREG; - break; - case KERNFS_LINK: - iattr.ia_mode = S_IFLNK; - break; - } - - ret = kernfs_setattr(kn, &iattr); - kernfs_put(kn); - return ret; -} - -/** - * rdtgroup_kn_mode_restore - Restore user access to named resctrl file - * @r: The resource group with which the file is associated. - * @name: Name of the file - * @mask: Mask of permissions that should be restored - * - * Restore the permissions of the named file. If @name is a directory the - * permissions of its parent will be used. - * - * Return: 0 on success, <0 on failure. - */ -int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name, - umode_t mask) -{ - struct iattr iattr = {.ia_valid = ATTR_MODE,}; - struct kernfs_node *kn, *parent; - struct rftype *rfts, *rft; - int ret, len; - - rfts = res_common_files; - len = ARRAY_SIZE(res_common_files); - - for (rft = rfts; rft < rfts + len; rft++) { - if (!strcmp(rft->name, name)) - iattr.ia_mode = rft->mode & mask; - } - - kn = kernfs_find_and_get_ns(r->kn, name, NULL); - if (!kn) - return -ENOENT; - - switch (kernfs_type(kn)) { - case KERNFS_DIR: - parent = kernfs_get_parent(kn); - if (parent) { - iattr.ia_mode |= parent->mode; - kernfs_put(parent); - } - iattr.ia_mode |= S_IFDIR; - break; - case KERNFS_FILE: - iattr.ia_mode |= S_IFREG; - break; - case KERNFS_LINK: - iattr.ia_mode |= S_IFLNK; - break; - } - - ret = kernfs_setattr(kn, &iattr); - kernfs_put(kn); - return ret; -} - -static int rdtgroup_mkdir_info_resdir(void *priv, char *name, - unsigned long fflags) -{ - struct kernfs_node *kn_subdir; - int ret; - - kn_subdir = kernfs_create_dir(kn_info, name, - kn_info->mode, priv); - if (IS_ERR(kn_subdir)) - return PTR_ERR(kn_subdir); - - ret = rdtgroup_kn_set_ugid(kn_subdir); - if (ret) - return ret; - - ret = rdtgroup_add_files(kn_subdir, fflags); - if (!ret) - kernfs_activate(kn_subdir); - - return ret; -} - -static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn) -{ - enum resctrl_res_level i; - struct resctrl_schema *s; - struct rdt_resource *r; - unsigned long fflags; - char name[32]; - int ret; - - /* create the directory */ - kn_info = kernfs_create_dir(parent_kn, "info", parent_kn->mode, NULL); - if (IS_ERR(kn_info)) - return PTR_ERR(kn_info); - - ret = rdtgroup_add_files(kn_info, RFTYPE_TOP_INFO); - if (ret) - goto out_destroy; - - /* loop over enabled controls, these are all alloc_capable */ - list_for_each_entry(s, &resctrl_schema_all, list) { - r = s->res; - fflags = r->fflags | RFTYPE_CTRL_INFO; - ret = rdtgroup_mkdir_info_resdir(s, s->name, fflags); - if (ret) - goto out_destroy; - } - - for (i = 0; i < RDT_NUM_RESOURCES; i++) { - r = resctrl_arch_get_resource(i); - if (!r->mon_capable) - continue; - - fflags = r->fflags | RFTYPE_MON_INFO; - sprintf(name, "%s_MON", r->name); - ret = rdtgroup_mkdir_info_resdir(r, name, fflags); - if (ret) - goto out_destroy; - } - - ret = rdtgroup_kn_set_ugid(kn_info); - if (ret) - goto out_destroy; - - kernfs_activate(kn_info); - - return 0; - -out_destroy: - kernfs_remove(kn_info); - return ret; -} - -static int -mongroup_create_dir(struct kernfs_node *parent_kn, struct rdtgroup *prgrp, - char *name, struct kernfs_node **dest_kn) -{ - struct kernfs_node *kn; - int ret; - - /* create the directory */ - kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); - if (IS_ERR(kn)) - return PTR_ERR(kn); - - if (dest_kn) - *dest_kn = kn; - - ret = rdtgroup_kn_set_ugid(kn); - if (ret) - goto out_destroy; - - kernfs_activate(kn); - - return 0; - -out_destroy: - kernfs_remove(kn); - return ret; -} - -static void l3_qos_cfg_update(void *arg) -{ - bool *enable = arg; - - wrmsrl(MSR_IA32_L3_QOS_CFG, *enable ? L3_QOS_CDP_ENABLE : 0ULL); -} - -static void l2_qos_cfg_update(void *arg) -{ - bool *enable = arg; - - wrmsrl(MSR_IA32_L2_QOS_CFG, *enable ? L2_QOS_CDP_ENABLE : 0ULL); -} - -static inline bool is_mba_linear(void) -{ - return resctrl_arch_get_resource(RDT_RESOURCE_MBA)->membw.delay_linear; -} - -static int set_cache_qos_cfg(int level, bool enable) -{ - void (*update)(void *arg); - struct rdt_resource *r_l; - cpumask_var_t cpu_mask; - struct rdt_domain *d; - int cpu; - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - if (level == RDT_RESOURCE_L3) - update = l3_qos_cfg_update; - else if (level == RDT_RESOURCE_L2) - update = l2_qos_cfg_update; - else - return -EINVAL; - - if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) - return -ENOMEM; - - r_l = &rdt_resources_all[level].r_resctrl; - list_for_each_entry(d, &r_l->domains, list) { - if (r_l->cache.arch_has_per_cpu_cfg) - /* Pick all the CPUs in the domain instance */ - for_each_cpu(cpu, &d->cpu_mask) - cpumask_set_cpu(cpu, cpu_mask); - else - /* Pick one CPU from each domain instance to update MSR */ - cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask); - } - - /* Update QOS_CFG MSR on all the CPUs in cpu_mask */ - on_each_cpu_mask(cpu_mask, update, &enable, 1); - - free_cpumask_var(cpu_mask); - - return 0; -} - -/* Restore the qos cfg state when a domain comes online */ -void rdt_domain_reconfigure_cdp(struct rdt_resource *r) -{ - struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); - - if (!r->cdp_capable) - return; - - if (r->rid == RDT_RESOURCE_L2) - l2_qos_cfg_update(&hw_res->cdp_enabled); - - if (r->rid == RDT_RESOURCE_L3) - l3_qos_cfg_update(&hw_res->cdp_enabled); -} - -static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d) -{ - u32 num_closid = resctrl_arch_get_num_closid(r); - int cpu = cpumask_any(&d->cpu_mask); - int i; - - d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val), - GFP_KERNEL, cpu_to_node(cpu)); - if (!d->mbps_val) - return -ENOMEM; - - for (i = 0; i < num_closid; i++) - d->mbps_val[i] = MBA_MAX_MBPS; - - return 0; -} - -static void mba_sc_domain_destroy(struct rdt_resource *r, - struct rdt_domain *d) -{ - kfree(d->mbps_val); - d->mbps_val = NULL; -} - -/* - * MBA software controller is supported only if - * MBM is supported and MBA is in linear scale. - */ -static bool supports_mba_mbps(void) -{ - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); - - return (resctrl_arch_is_mbm_local_enabled() && - r->alloc_capable && is_mba_linear()); -} - -/* - * Enable or disable the MBA software controller - * which helps user specify bandwidth in MBps. - */ -static int set_mba_sc(bool mba_sc) -{ - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); - u32 num_closid = resctrl_arch_get_num_closid(r); - struct rdt_domain *d; - int i; - - if (!supports_mba_mbps() || mba_sc == is_mba_sc(r)) - return -EINVAL; - - r->membw.mba_sc = mba_sc; - - list_for_each_entry(d, &r->domains, list) { - for (i = 0; i < num_closid; i++) - d->mbps_val[i] = MBA_MAX_MBPS; - } - - return 0; -} - -static int cdp_enable(int level) -{ - struct rdt_resource *r_l = &rdt_resources_all[level].r_resctrl; - int ret; - - if (!r_l->alloc_capable) - return -EINVAL; - - ret = set_cache_qos_cfg(level, true); - if (!ret) - rdt_resources_all[level].cdp_enabled = true; - - return ret; -} - -static void cdp_disable(int level) -{ - struct rdt_hw_resource *r_hw = &rdt_resources_all[level]; - - if (r_hw->cdp_enabled) { - set_cache_qos_cfg(level, false); - r_hw->cdp_enabled = false; - } -} - -int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable) -{ - struct rdt_hw_resource *hw_res = &rdt_resources_all[l]; - - if (!hw_res->r_resctrl.cdp_capable) - return -EINVAL; - - if (enable) - return cdp_enable(l); - - cdp_disable(l); - - return 0; -} - -/* - * We don't allow rdtgroup directories to be created anywhere - * except the root directory. Thus when looking for the rdtgroup - * structure for a kernfs node we are either looking at a directory, - * in which case the rdtgroup structure is pointed at by the "priv" - * field, otherwise we have a file, and need only look to the parent - * to find the rdtgroup. - */ -static struct rdtgroup *kernfs_to_rdtgroup(struct kernfs_node *kn) -{ - if (kernfs_type(kn) == KERNFS_DIR) { - /* - * All the resource directories use "kn->priv" - * to point to the "struct rdtgroup" for the - * resource. "info" and its subdirectories don't - * have rdtgroup structures, so return NULL here. - */ - if (kn == kn_info || kn->parent == kn_info) - return NULL; - else - return kn->priv; - } else { - return kn->parent->priv; - } -} - -static void rdtgroup_kn_get(struct rdtgroup *rdtgrp, struct kernfs_node *kn) -{ - atomic_inc(&rdtgrp->waitcount); - kernfs_break_active_protection(kn); -} - -static void rdtgroup_kn_put(struct rdtgroup *rdtgrp, struct kernfs_node *kn) -{ - if (atomic_dec_and_test(&rdtgrp->waitcount) && - (rdtgrp->flags & RDT_DELETED)) { - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || - rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) - rdtgroup_pseudo_lock_remove(rdtgrp); - kernfs_unbreak_active_protection(kn); - rdtgroup_remove(rdtgrp); - } else { - kernfs_unbreak_active_protection(kn); - } -} - -struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn) -{ - struct rdtgroup *rdtgrp = kernfs_to_rdtgroup(kn); - - if (!rdtgrp) - return NULL; - - rdtgroup_kn_get(rdtgrp, kn); - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - /* Was this group deleted while we waited? */ - if (rdtgrp->flags & RDT_DELETED) - return NULL; - - return rdtgrp; -} - -void rdtgroup_kn_unlock(struct kernfs_node *kn) -{ - struct rdtgroup *rdtgrp = kernfs_to_rdtgroup(kn); - - if (!rdtgrp) - return; - - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); - - rdtgroup_kn_put(rdtgrp, kn); -} - -static int mkdir_mondata_all(struct kernfs_node *parent_kn, - struct rdtgroup *prgrp, - struct kernfs_node **mon_data_kn); - -static void rdt_disable_ctx(void) -{ - resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false); - resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false); - set_mba_sc(false); - - resctrl_debug = false; -} - -static int rdt_enable_ctx(struct rdt_fs_context *ctx) -{ - int ret = 0; - - if (ctx->enable_cdpl2) { - ret = resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, true); - if (ret) - goto out_done; - } - - if (ctx->enable_cdpl3) { - ret = resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, true); - if (ret) - goto out_cdpl2; - } - - if (ctx->enable_mba_mbps) { - ret = set_mba_sc(true); - if (ret) - goto out_cdpl3; - } - - if (ctx->enable_debug) - resctrl_debug = true; - - return 0; - -out_cdpl3: - resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false); -out_cdpl2: - resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false); -out_done: - return ret; -} - -static int schemata_list_add(struct rdt_resource *r, enum resctrl_conf_type type) -{ - struct resctrl_schema *s; - const char *suffix = ""; - int ret, cl; - - s = kzalloc(sizeof(*s), GFP_KERNEL); - if (!s) - return -ENOMEM; - - s->res = r; - s->num_closid = resctrl_arch_get_num_closid(r); - if (resctrl_arch_get_cdp_enabled(r->rid)) - s->num_closid /= 2; - - s->conf_type = type; - switch (type) { - case CDP_CODE: - suffix = "CODE"; - break; - case CDP_DATA: - suffix = "DATA"; - break; - case CDP_NONE: - suffix = ""; - break; - } - - ret = snprintf(s->name, sizeof(s->name), "%s%s", r->name, suffix); - if (ret >= sizeof(s->name)) { - kfree(s); - return -EINVAL; - } - - cl = strlen(s->name); - - /* - * If CDP is supported by this resource, but not enabled, - * include the suffix. This ensures the tabular format of the - * schemata file does not change between mounts of the filesystem. - */ - if (r->cdp_capable && !resctrl_arch_get_cdp_enabled(r->rid)) - cl += 4; - - if (cl > max_name_width) - max_name_width = cl; - - /* - * Choose a width for the resource data based on the resource that has - * widest name and cbm. - */ - max_data_width = max(max_data_width, r->data_width); - - INIT_LIST_HEAD(&s->list); - list_add(&s->list, &resctrl_schema_all); - - return 0; -} - -static int schemata_list_create(void) -{ - enum resctrl_res_level i; - struct rdt_resource *r; - int ret = 0; - - for (i = 0; i < RDT_NUM_RESOURCES; i++) { - r = resctrl_arch_get_resource(i); - if (!r->alloc_capable) - continue; - - if (resctrl_arch_get_cdp_enabled(r->rid)) { - ret = schemata_list_add(r, CDP_CODE); - if (ret) - break; - - ret = schemata_list_add(r, CDP_DATA); - } else { - ret = schemata_list_add(r, CDP_NONE); - } - - if (ret) - break; - } - - return ret; -} - -static void schemata_list_destroy(void) -{ - struct resctrl_schema *s, *tmp; - - list_for_each_entry_safe(s, tmp, &resctrl_schema_all, list) { - list_del(&s->list); - kfree(s); - } -} - -static int rdt_get_tree(struct fs_context *fc) -{ - struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3); - struct rdt_fs_context *ctx = rdt_fc2context(fc); - unsigned long flags = RFTYPE_CTRL_BASE; - struct rdt_domain *dom; - int ret; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - /* - * resctrl file system can only be mounted once. - */ - if (resctrl_mounted) { - ret = -EBUSY; - goto out; - } - - ret = rdtgroup_setup_root(ctx); - if (ret) - goto out; - - ret = rdt_enable_ctx(ctx); - if (ret) - goto out_root; - - ret = schemata_list_create(); - if (ret) { - schemata_list_destroy(); - goto out_ctx; - } - - closid_init(); - - if (resctrl_arch_mon_capable()) - flags |= RFTYPE_MON; - - ret = rdtgroup_add_files(rdtgroup_default.kn, flags); - if (ret) - goto out_schemata_free; - - kernfs_activate(rdtgroup_default.kn); - - ret = rdtgroup_create_info_dir(rdtgroup_default.kn); - if (ret < 0) - goto out_schemata_free; - - if (resctrl_arch_mon_capable()) { - ret = mongroup_create_dir(rdtgroup_default.kn, - &rdtgroup_default, "mon_groups", - &kn_mongrp); - if (ret < 0) - goto out_info; - - ret = mkdir_mondata_all(rdtgroup_default.kn, - &rdtgroup_default, &kn_mondata); - if (ret < 0) - goto out_mongrp; - rdtgroup_default.mon.mon_data_kn = kn_mondata; - } - - if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) { - ret = rdt_pseudo_lock_init(); - if (ret) - goto out_mondata; - } - - ret = kernfs_get_tree(fc); - if (ret < 0) - goto out_psl; - - if (resctrl_arch_alloc_capable()) - resctrl_arch_enable_alloc(); - if (resctrl_arch_mon_capable()) - resctrl_arch_enable_mon(); - - if (resctrl_arch_alloc_capable() || resctrl_arch_mon_capable()) - resctrl_mounted = true; - - if (resctrl_is_mbm_enabled()) { - list_for_each_entry(dom, &l3->domains, list) - mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, - RESCTRL_PICK_ANY_CPU); - } - - goto out; - -out_psl: - if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) - rdt_pseudo_lock_release(); -out_mondata: - if (resctrl_arch_mon_capable()) - kernfs_remove(kn_mondata); -out_mongrp: - if (resctrl_arch_mon_capable()) - kernfs_remove(kn_mongrp); -out_info: - kernfs_remove(kn_info); -out_schemata_free: - schemata_list_destroy(); -out_ctx: - rdt_disable_ctx(); -out_root: - rdtgroup_destroy_root(); -out: - rdt_last_cmd_clear(); - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); - return ret; -} - -enum rdt_param { - Opt_cdp, - Opt_cdpl2, - Opt_mba_mbps, - Opt_debug, - nr__rdt_params -}; - -static const struct fs_parameter_spec rdt_fs_parameters[] = { - fsparam_flag("cdp", Opt_cdp), - fsparam_flag("cdpl2", Opt_cdpl2), - fsparam_flag("mba_MBps", Opt_mba_mbps), - fsparam_flag("debug", Opt_debug), - {} -}; - -static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param) -{ - struct rdt_fs_context *ctx = rdt_fc2context(fc); - struct fs_parse_result result; - int opt; - - opt = fs_parse(fc, rdt_fs_parameters, param, &result); - if (opt < 0) - return opt; - - switch (opt) { - case Opt_cdp: - ctx->enable_cdpl3 = true; - return 0; - case Opt_cdpl2: - ctx->enable_cdpl2 = true; - return 0; - case Opt_mba_mbps: - if (!supports_mba_mbps()) - return -EINVAL; - ctx->enable_mba_mbps = true; - return 0; - case Opt_debug: - ctx->enable_debug = true; - return 0; - } - - return -EINVAL; -} - -static void rdt_fs_context_free(struct fs_context *fc) -{ - struct rdt_fs_context *ctx = rdt_fc2context(fc); - - kernfs_free_fs_context(fc); - kfree(ctx); -} - -static const struct fs_context_operations rdt_fs_context_ops = { - .free = rdt_fs_context_free, - .parse_param = rdt_parse_param, - .get_tree = rdt_get_tree, -}; - -static int rdt_init_fs_context(struct fs_context *fc) -{ - struct rdt_fs_context *ctx; - - ctx = kzalloc(sizeof(struct rdt_fs_context), GFP_KERNEL); - if (!ctx) - return -ENOMEM; - - ctx->kfc.magic = RDTGROUP_SUPER_MAGIC; - fc->fs_private = &ctx->kfc; - fc->ops = &rdt_fs_context_ops; - put_user_ns(fc->user_ns); - fc->user_ns = get_user_ns(&init_user_ns); - fc->global = true; - return 0; -} - -static int reset_all_ctrls(struct rdt_resource *r) -{ - struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); - struct rdt_hw_domain *hw_dom; - struct msr_param msr_param; - cpumask_var_t cpu_mask; - struct rdt_domain *d; - int i; - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) - return -ENOMEM; - - msr_param.res = r; - msr_param.low = 0; - msr_param.high = hw_res->num_closid; - - /* - * Disable resource control for this resource by setting all - * CBMs in all domains to the maximum mask value. Pick one CPU - * from each domain to update the MSRs below. - */ - list_for_each_entry(d, &r->domains, list) { - hw_dom = resctrl_to_arch_dom(d); - cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask); - - for (i = 0; i < hw_res->num_closid; i++) - hw_dom->ctrl_val[i] = r->default_ctrl; - } - - /* Update CBM on all the CPUs in cpu_mask */ - on_each_cpu_mask(cpu_mask, rdt_ctrl_update, &msr_param, 1); - - free_cpumask_var(cpu_mask); - - return 0; -} - -void resctrl_arch_reset_resources(void) -{ - struct rdt_resource *r; - - for_each_capable_rdt_resource(r) - reset_all_ctrls(r); -} - -/* - * Move tasks from one to the other group. If @from is NULL, then all tasks - * in the systems are moved unconditionally (used for teardown). - * - * If @mask is not NULL the cpus on which moved tasks are running are set - * in that mask so the update smp function call is restricted to affected - * cpus. - */ -static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to, - struct cpumask *mask) -{ - struct task_struct *p, *t; - - read_lock(&tasklist_lock); - for_each_process_thread(p, t) { - if (!from || is_closid_match(t, from) || - is_rmid_match(t, from)) { - resctrl_arch_set_closid_rmid(t, to->closid, - to->mon.rmid); - - /* - * Order the closid/rmid stores above before the loads - * in task_curr(). This pairs with the full barrier - * between the rq->curr update and resctrl_sched_in() - * during context switch. - */ - smp_mb(); - - /* - * If the task is on a CPU, set the CPU in the mask. - * The detection is inaccurate as tasks might move or - * schedule before the smp function call takes place. - * In such a case the function call is pointless, but - * there is no other side effect. - */ - if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t)) - cpumask_set_cpu(task_cpu(t), mask); - } - } - read_unlock(&tasklist_lock); -} - -static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp) -{ - struct rdtgroup *sentry, *stmp; - struct list_head *head; - - head = &rdtgrp->mon.crdtgrp_list; - list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) { - free_rmid(sentry->closid, sentry->mon.rmid); - list_del(&sentry->mon.crdtgrp_list); - - if (atomic_read(&sentry->waitcount) != 0) - sentry->flags = RDT_DELETED; - else - rdtgroup_remove(sentry); - } -} - -/* - * Forcibly remove all of subdirectories under root. - */ -static void rmdir_all_sub(void) -{ - struct rdtgroup *rdtgrp, *tmp; - - /* Move all tasks to the default resource group */ - rdt_move_group_tasks(NULL, &rdtgroup_default, NULL); - - list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) { - /* Free any child rmids */ - free_all_child_rdtgrp(rdtgrp); - - /* Remove each rdtgroup other than root */ - if (rdtgrp == &rdtgroup_default) - continue; - - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || - rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) - rdtgroup_pseudo_lock_remove(rdtgrp); - - /* - * Give any CPUs back to the default group. We cannot copy - * cpu_online_mask because a CPU might have executed the - * offline callback already, but is still marked online. - */ - cpumask_or(&rdtgroup_default.cpu_mask, - &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); - - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); - - kernfs_remove(rdtgrp->kn); - list_del(&rdtgrp->rdtgroup_list); - - if (atomic_read(&rdtgrp->waitcount) != 0) - rdtgrp->flags = RDT_DELETED; - else - rdtgroup_remove(rdtgrp); - } - /* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */ - update_closid_rmid(cpu_online_mask, &rdtgroup_default); - - kernfs_remove(kn_info); - kernfs_remove(kn_mongrp); - kernfs_remove(kn_mondata); -} - -static void rdt_kill_sb(struct super_block *sb) -{ - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - rdt_disable_ctx(); - - /* Put everything back to default values. */ - resctrl_arch_reset_resources(); - - rmdir_all_sub(); - if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) - rdt_pseudo_lock_release(); - rdtgroup_default.mode = RDT_MODE_SHAREABLE; - schemata_list_destroy(); - rdtgroup_destroy_root(); - if (resctrl_arch_alloc_capable()) - resctrl_arch_disable_alloc(); - if (resctrl_arch_mon_capable()) - resctrl_arch_disable_mon(); - resctrl_mounted = false; - kernfs_kill_sb(sb); - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); -} - -static struct file_system_type rdt_fs_type = { - .name = "resctrl", - .init_fs_context = rdt_init_fs_context, - .parameters = rdt_fs_parameters, - .kill_sb = rdt_kill_sb, -}; - -static int mon_addfile(struct kernfs_node *parent_kn, const char *name, - void *priv) -{ - struct kernfs_node *kn; - int ret = 0; - - kn = __kernfs_create_file(parent_kn, name, 0444, - GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, - &kf_mondata_ops, priv, NULL, NULL); - if (IS_ERR(kn)) - return PTR_ERR(kn); - - ret = rdtgroup_kn_set_ugid(kn); - if (ret) { - kernfs_remove(kn); - return ret; - } - - return ret; -} - -/* - * Remove all subdirectories of mon_data of ctrl_mon groups - * and monitor groups with given domain id. - */ -static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, - unsigned int dom_id) -{ - struct rdtgroup *prgrp, *crgrp; - char name[32]; - - list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { - sprintf(name, "mon_%s_%02d", r->name, dom_id); - kernfs_remove_by_name(prgrp->mon.mon_data_kn, name); - - list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) - kernfs_remove_by_name(crgrp->mon.mon_data_kn, name); - } -} - -static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, - struct rdt_domain *d, - struct rdt_resource *r, struct rdtgroup *prgrp) -{ - union mon_data_bits priv; - struct kernfs_node *kn; - struct mon_evt *mevt; - struct rmid_read rr; - char name[32]; - int ret; - - sprintf(name, "mon_%s_%02d", r->name, d->id); - /* create the directory */ - kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); - if (IS_ERR(kn)) - return PTR_ERR(kn); - - ret = rdtgroup_kn_set_ugid(kn); - if (ret) - goto out_destroy; - - if (WARN_ON(list_empty(&r->evt_list))) { - ret = -EPERM; - goto out_destroy; - } - - priv.u.rid = r->rid; - priv.u.domid = d->id; - list_for_each_entry(mevt, &r->evt_list, list) { - priv.u.evtid = mevt->evtid; - ret = mon_addfile(kn, mevt->name, priv.priv); - if (ret) - goto out_destroy; - - if (resctrl_is_mbm_event(mevt->evtid)) - mon_event_read(&rr, r, d, prgrp, mevt->evtid, true); - } - kernfs_activate(kn); - return 0; - -out_destroy: - kernfs_remove(kn); - return ret; -} - -/* - * Add all subdirectories of mon_data for "ctrl_mon" groups - * and "monitor" groups with given domain id. - */ -static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, - struct rdt_domain *d) -{ - struct kernfs_node *parent_kn; - struct rdtgroup *prgrp, *crgrp; - struct list_head *head; - - list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { - parent_kn = prgrp->mon.mon_data_kn; - mkdir_mondata_subdir(parent_kn, d, r, prgrp); - - head = &prgrp->mon.crdtgrp_list; - list_for_each_entry(crgrp, head, mon.crdtgrp_list) { - parent_kn = crgrp->mon.mon_data_kn; - mkdir_mondata_subdir(parent_kn, d, r, crgrp); - } - } -} - -static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn, - struct rdt_resource *r, - struct rdtgroup *prgrp) -{ - struct rdt_domain *dom; - int ret; - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - list_for_each_entry(dom, &r->domains, list) { - ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp); - if (ret) - return ret; - } - - return 0; -} - -/* - * This creates a directory mon_data which contains the monitored data. - * - * mon_data has one directory for each domain which are named - * in the format mon_<domain_name>_<domain_id>. For ex: A mon_data - * with L3 domain looks as below: - * ./mon_data: - * mon_L3_00 - * mon_L3_01 - * mon_L3_02 - * ... - * - * Each domain directory has one file per event: - * ./mon_L3_00/: - * llc_occupancy - * - */ -static int mkdir_mondata_all(struct kernfs_node *parent_kn, - struct rdtgroup *prgrp, - struct kernfs_node **dest_kn) -{ - enum resctrl_res_level i; - struct rdt_resource *r; - struct kernfs_node *kn; - int ret; - - /* - * Create the mon_data directory first. - */ - ret = mongroup_create_dir(parent_kn, prgrp, "mon_data", &kn); - if (ret) - return ret; - - if (dest_kn) - *dest_kn = kn; - - /* - * Create the subdirectories for each domain. Note that all events - * in a domain like L3 are grouped into a resource whose domain is L3 - */ - for (i = 0; i < RDT_NUM_RESOURCES; i++) { - r = resctrl_arch_get_resource(i); - if (!r->mon_capable) - continue; - - ret = mkdir_mondata_subdir_alldom(kn, r, prgrp); - if (ret) - goto out_destroy; - } - - return 0; - -out_destroy: - kernfs_remove(kn); - return ret; -} - -/** - * cbm_ensure_valid - Enforce validity on provided CBM - * @_val: Candidate CBM - * @r: RDT resource to which the CBM belongs - * - * The provided CBM represents all cache portions available for use. This - * may be represented by a bitmap that does not consist of contiguous ones - * and thus be an invalid CBM. - * Here the provided CBM is forced to be a valid CBM by only considering - * the first set of contiguous bits as valid and clearing all bits. - * The intention here is to provide a valid default CBM with which a new - * resource group is initialized. The user can follow this with a - * modification to the CBM if the default does not satisfy the - * requirements. - */ -static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r) -{ - unsigned int cbm_len = r->cache.cbm_len; - unsigned long first_bit, zero_bit; - unsigned long val = _val; - - if (!val) - return 0; - - first_bit = find_first_bit(&val, cbm_len); - zero_bit = find_next_zero_bit(&val, cbm_len, first_bit); - - /* Clear any remaining bits to ensure contiguous region */ - bitmap_clear(&val, zero_bit, cbm_len - zero_bit); - return (u32)val; -} - -/* - * Initialize cache resources per RDT domain - * - * Set the RDT domain up to start off with all usable allocations. That is, - * all shareable and unused bits. All-zero CBM is invalid. - */ -static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s, - u32 closid) -{ - enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type); - enum resctrl_conf_type t = s->conf_type; - struct resctrl_staged_config *cfg; - struct rdt_resource *r = s->res; - u32 used_b = 0, unused_b = 0; - unsigned long tmp_cbm; - enum rdtgrp_mode mode; - u32 peer_ctl, ctrl_val; - int i; - - cfg = &d->staged_config[t]; - cfg->have_new_ctrl = false; - cfg->new_ctrl = r->cache.shareable_bits; - used_b = r->cache.shareable_bits; - for (i = 0; i < closids_supported(); i++) { - if (closid_allocated(i) && i != closid) { - mode = rdtgroup_mode_by_closid(i); - if (mode == RDT_MODE_PSEUDO_LOCKSETUP) - /* - * ctrl values for locksetup aren't relevant - * until the schemata is written, and the mode - * becomes RDT_MODE_PSEUDO_LOCKED. - */ - continue; - /* - * If CDP is active include peer domain's - * usage to ensure there is no overlap - * with an exclusive group. - */ - if (resctrl_arch_get_cdp_enabled(r->rid)) - peer_ctl = resctrl_arch_get_config(r, d, i, - peer_type); - else - peer_ctl = 0; - ctrl_val = resctrl_arch_get_config(r, d, i, - s->conf_type); - used_b |= ctrl_val | peer_ctl; - if (mode == RDT_MODE_SHAREABLE) - cfg->new_ctrl |= ctrl_val | peer_ctl; - } - } - if (d->plr && d->plr->cbm > 0) - used_b |= d->plr->cbm; - unused_b = used_b ^ (BIT_MASK(r->cache.cbm_len) - 1); - unused_b &= BIT_MASK(r->cache.cbm_len) - 1; - cfg->new_ctrl |= unused_b; - /* - * Force the initial CBM to be valid, user can - * modify the CBM based on system availability. - */ - cfg->new_ctrl = cbm_ensure_valid(cfg->new_ctrl, r); - /* - * Assign the u32 CBM to an unsigned long to ensure that - * bitmap_weight() does not access out-of-bound memory. - */ - tmp_cbm = cfg->new_ctrl; - if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) { - rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id); - return -ENOSPC; - } - cfg->have_new_ctrl = true; - - return 0; -} - -/* - * Initialize cache resources with default values. - * - * A new RDT group is being created on an allocation capable (CAT) - * supporting system. Set this group up to start off with all usable - * allocations. - * - * If there are no more shareable bits available on any domain then - * the entire allocation will fail. - */ -static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid) -{ - struct rdt_domain *d; - int ret; - - list_for_each_entry(d, &s->res->domains, list) { - ret = __init_one_rdt_domain(d, s, closid); - if (ret < 0) - return ret; - } - - return 0; -} - -/* Initialize MBA resource with default values. */ -static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid) -{ - struct resctrl_staged_config *cfg; - struct rdt_domain *d; - - list_for_each_entry(d, &r->domains, list) { - if (is_mba_sc(r)) { - d->mbps_val[closid] = MBA_MAX_MBPS; - continue; - } - - cfg = &d->staged_config[CDP_NONE]; - cfg->new_ctrl = r->default_ctrl; - cfg->have_new_ctrl = true; - } -} - -/* Initialize the RDT group's allocations. */ -static int rdtgroup_init_alloc(struct rdtgroup *rdtgrp) -{ - struct resctrl_schema *s; - struct rdt_resource *r; - int ret = 0; - - rdt_staged_configs_clear(); - - list_for_each_entry(s, &resctrl_schema_all, list) { - r = s->res; - if (r->rid == RDT_RESOURCE_MBA || - r->rid == RDT_RESOURCE_SMBA) { - rdtgroup_init_mba(r, rdtgrp->closid); - if (is_mba_sc(r)) - continue; - } else { - ret = rdtgroup_init_cat(s, rdtgrp->closid); - if (ret < 0) - goto out; - } - - ret = resctrl_arch_update_domains(r, rdtgrp->closid); - if (ret < 0) { - rdt_last_cmd_puts("Failed to initialize allocations\n"); - goto out; - } - - } - - rdtgrp->mode = RDT_MODE_SHAREABLE; - -out: - rdt_staged_configs_clear(); - return ret; -} - -static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp) -{ - int ret; - - if (!resctrl_arch_mon_capable()) - return 0; - - ret = alloc_rmid(rdtgrp->closid); - if (ret < 0) { - rdt_last_cmd_puts("Out of RMIDs\n"); - return ret; - } - rdtgrp->mon.rmid = ret; - - ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn); - if (ret) { - rdt_last_cmd_puts("kernfs subdir error\n"); - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); - return ret; - } - - return 0; -} - -static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp) -{ - if (resctrl_arch_mon_capable()) - free_rmid(rgrp->closid, rgrp->mon.rmid); -} - -static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, - const char *name, umode_t mode, - enum rdt_group_type rtype, struct rdtgroup **r) -{ - struct rdtgroup *prdtgrp, *rdtgrp; - unsigned long files = 0; - struct kernfs_node *kn; - int ret; - - prdtgrp = rdtgroup_kn_lock_live(parent_kn); - if (!prdtgrp) { - ret = -ENODEV; - goto out_unlock; - } - - if (rtype == RDTMON_GROUP && - (prdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || - prdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)) { - ret = -EINVAL; - rdt_last_cmd_puts("Pseudo-locking in progress\n"); - goto out_unlock; - } - - /* allocate the rdtgroup. */ - rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL); - if (!rdtgrp) { - ret = -ENOSPC; - rdt_last_cmd_puts("Kernel out of memory\n"); - goto out_unlock; - } - *r = rdtgrp; - rdtgrp->mon.parent = prdtgrp; - rdtgrp->type = rtype; - INIT_LIST_HEAD(&rdtgrp->mon.crdtgrp_list); - - /* kernfs creates the directory for rdtgrp */ - kn = kernfs_create_dir(parent_kn, name, mode, rdtgrp); - if (IS_ERR(kn)) { - ret = PTR_ERR(kn); - rdt_last_cmd_puts("kernfs create error\n"); - goto out_free_rgrp; - } - rdtgrp->kn = kn; - - /* - * kernfs_remove() will drop the reference count on "kn" which - * will free it. But we still need it to stick around for the - * rdtgroup_kn_unlock(kn) call. Take one extra reference here, - * which will be dropped by kernfs_put() in rdtgroup_remove(). - */ - kernfs_get(kn); - - ret = rdtgroup_kn_set_ugid(kn); - if (ret) { - rdt_last_cmd_puts("kernfs perm error\n"); - goto out_destroy; - } - - if (rtype == RDTCTRL_GROUP) { - files = RFTYPE_BASE | RFTYPE_CTRL; - if (resctrl_arch_mon_capable()) - files |= RFTYPE_MON; - } else { - files = RFTYPE_BASE | RFTYPE_MON; - } - - ret = rdtgroup_add_files(kn, files); - if (ret) { - rdt_last_cmd_puts("kernfs fill error\n"); - goto out_destroy; - } - - /* - * The caller unlocks the parent_kn upon success. - */ - return 0; - -out_destroy: - kernfs_put(rdtgrp->kn); - kernfs_remove(rdtgrp->kn); -out_free_rgrp: - kfree(rdtgrp); -out_unlock: - rdtgroup_kn_unlock(parent_kn); - return ret; -} - -static void mkdir_rdt_prepare_clean(struct rdtgroup *rgrp) -{ - kernfs_remove(rgrp->kn); - rdtgroup_remove(rgrp); -} - -/* - * Create a monitor group under "mon_groups" directory of a control - * and monitor group(ctrl_mon). This is a resource group - * to monitor a subset of tasks and cpus in its parent ctrl_mon group. - */ -static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn, - const char *name, umode_t mode) -{ - struct rdtgroup *rdtgrp, *prgrp; - int ret; - - ret = mkdir_rdt_prepare(parent_kn, name, mode, RDTMON_GROUP, &rdtgrp); - if (ret) - return ret; - - prgrp = rdtgrp->mon.parent; - rdtgrp->closid = prgrp->closid; - - ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp); - if (ret) { - mkdir_rdt_prepare_clean(rdtgrp); - goto out_unlock; - } - - kernfs_activate(rdtgrp->kn); - - /* - * Add the rdtgrp to the list of rdtgrps the parent - * ctrl_mon group has to track. - */ - list_add_tail(&rdtgrp->mon.crdtgrp_list, &prgrp->mon.crdtgrp_list); - -out_unlock: - rdtgroup_kn_unlock(parent_kn); - return ret; -} - -/* - * These are rdtgroups created under the root directory. Can be used - * to allocate and monitor resources. - */ -static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn, - const char *name, umode_t mode) -{ - struct rdtgroup *rdtgrp; - struct kernfs_node *kn; - u32 closid; - int ret; - - ret = mkdir_rdt_prepare(parent_kn, name, mode, RDTCTRL_GROUP, &rdtgrp); - if (ret) - return ret; - - kn = rdtgrp->kn; - ret = closid_alloc(); - if (ret < 0) { - rdt_last_cmd_puts("Out of CLOSIDs\n"); - goto out_common_fail; - } - closid = ret; - ret = 0; - - rdtgrp->closid = closid; - - ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp); - if (ret) - goto out_closid_free; - - kernfs_activate(rdtgrp->kn); - - ret = rdtgroup_init_alloc(rdtgrp); - if (ret < 0) - goto out_rmid_free; - - list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups); - - if (resctrl_arch_mon_capable()) { - /* - * Create an empty mon_groups directory to hold the subset - * of tasks and cpus to monitor. - */ - ret = mongroup_create_dir(kn, rdtgrp, "mon_groups", NULL); - if (ret) { - rdt_last_cmd_puts("kernfs subdir error\n"); - goto out_del_list; - } - } - - goto out_unlock; - -out_del_list: - list_del(&rdtgrp->rdtgroup_list); -out_rmid_free: - mkdir_rdt_prepare_rmid_free(rdtgrp); -out_closid_free: - closid_free(closid); -out_common_fail: - mkdir_rdt_prepare_clean(rdtgrp); -out_unlock: - rdtgroup_kn_unlock(parent_kn); - return ret; -} - -/* - * We allow creating mon groups only with in a directory called "mon_groups" - * which is present in every ctrl_mon group. Check if this is a valid - * "mon_groups" directory. - * - * 1. The directory should be named "mon_groups". - * 2. The mon group itself should "not" be named "mon_groups". - * This makes sure "mon_groups" directory always has a ctrl_mon group - * as parent. - */ -static bool is_mon_groups(struct kernfs_node *kn, const char *name) -{ - return (!strcmp(kn->name, "mon_groups") && - strcmp(name, "mon_groups")); -} - -static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name, - umode_t mode) -{ - /* Do not accept '\n' to avoid unparsable situation. */ - if (strchr(name, '\n')) - return -EINVAL; - - /* - * If the parent directory is the root directory and RDT - * allocation is supported, add a control and monitoring - * subdirectory - */ - if (resctrl_arch_alloc_capable() && parent_kn == rdtgroup_default.kn) - return rdtgroup_mkdir_ctrl_mon(parent_kn, name, mode); - - /* - * If RDT monitoring is supported and the parent directory is a valid - * "mon_groups" directory, add a monitoring subdirectory. - */ - if (resctrl_arch_mon_capable() && is_mon_groups(parent_kn, name)) - return rdtgroup_mkdir_mon(parent_kn, name, mode); - - return -EPERM; -} - -static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) -{ - struct rdtgroup *prdtgrp = rdtgrp->mon.parent; - int cpu; - - /* Give any tasks back to the parent group */ - rdt_move_group_tasks(rdtgrp, prdtgrp, tmpmask); - - /* Update per cpu rmid of the moved CPUs first */ - for_each_cpu(cpu, &rdtgrp->cpu_mask) - resctrl_arch_set_cpu_default_closid_rmid(cpu, rdtgrp->closid, - prdtgrp->mon.rmid); - - /* - * Update the MSR on moved CPUs and CPUs which have moved - * task running on them. - */ - cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); - update_closid_rmid(tmpmask, NULL); - - rdtgrp->flags = RDT_DELETED; - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); - - /* - * Remove the rdtgrp from the parent ctrl_mon group's list - */ - WARN_ON(list_empty(&prdtgrp->mon.crdtgrp_list)); - list_del(&rdtgrp->mon.crdtgrp_list); - - kernfs_remove(rdtgrp->kn); - - return 0; -} - -static int rdtgroup_ctrl_remove(struct rdtgroup *rdtgrp) -{ - rdtgrp->flags = RDT_DELETED; - list_del(&rdtgrp->rdtgroup_list); - - kernfs_remove(rdtgrp->kn); - return 0; -} - -static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) -{ - u32 closid, rmid; - int cpu; - - /* Give any tasks back to the default group */ - rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask); - - /* Give any CPUs back to the default group */ - cpumask_or(&rdtgroup_default.cpu_mask, - &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); - - /* Update per cpu closid and rmid of the moved CPUs first */ - closid = rdtgroup_default.closid; - rmid = rdtgroup_default.mon.rmid; - for_each_cpu(cpu, &rdtgrp->cpu_mask) - resctrl_arch_set_cpu_default_closid_rmid(cpu, closid, rmid); - - /* - * Update the MSR on moved CPUs and CPUs which have moved - * task running on them. - */ - cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); - update_closid_rmid(tmpmask, NULL); - - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); - closid_free(rdtgrp->closid); - - rdtgroup_ctrl_remove(rdtgrp); - - /* - * Free all the child monitor group rmids. - */ - free_all_child_rdtgrp(rdtgrp); - - return 0; -} - -static int rdtgroup_rmdir(struct kernfs_node *kn) -{ - struct kernfs_node *parent_kn = kn->parent; - struct rdtgroup *rdtgrp; - cpumask_var_t tmpmask; - int ret = 0; - - if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) - return -ENOMEM; - - rdtgrp = rdtgroup_kn_lock_live(kn); - if (!rdtgrp) { - ret = -EPERM; - goto out; - } - - /* - * If the rdtgroup is a ctrl_mon group and parent directory - * is the root directory, remove the ctrl_mon group. - * - * If the rdtgroup is a mon group and parent directory - * is a valid "mon_groups" directory, remove the mon group. - */ - if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == rdtgroup_default.kn && - rdtgrp != &rdtgroup_default) { - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || - rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { - ret = rdtgroup_ctrl_remove(rdtgrp); - } else { - ret = rdtgroup_rmdir_ctrl(rdtgrp, tmpmask); - } - } else if (rdtgrp->type == RDTMON_GROUP && - is_mon_groups(parent_kn, kn->name)) { - ret = rdtgroup_rmdir_mon(rdtgrp, tmpmask); - } else { - ret = -EPERM; - } - -out: - rdtgroup_kn_unlock(kn); - free_cpumask_var(tmpmask); - return ret; -} - -/** - * mongrp_reparent() - replace parent CTRL_MON group of a MON group - * @rdtgrp: the MON group whose parent should be replaced - * @new_prdtgrp: replacement parent CTRL_MON group for @rdtgrp - * @cpus: cpumask provided by the caller for use during this call - * - * Replaces the parent CTRL_MON group for a MON group, resulting in all member - * tasks' CLOSID immediately changing to that of the new parent group. - * Monitoring data for the group is unaffected by this operation. - */ -static void mongrp_reparent(struct rdtgroup *rdtgrp, - struct rdtgroup *new_prdtgrp, - cpumask_var_t cpus) -{ - struct rdtgroup *prdtgrp = rdtgrp->mon.parent; - - WARN_ON(rdtgrp->type != RDTMON_GROUP); - WARN_ON(new_prdtgrp->type != RDTCTRL_GROUP); - - /* Nothing to do when simply renaming a MON group. */ - if (prdtgrp == new_prdtgrp) - return; - - WARN_ON(list_empty(&prdtgrp->mon.crdtgrp_list)); - list_move_tail(&rdtgrp->mon.crdtgrp_list, - &new_prdtgrp->mon.crdtgrp_list); - - rdtgrp->mon.parent = new_prdtgrp; - rdtgrp->closid = new_prdtgrp->closid; - - /* Propagate updated closid to all tasks in this group. */ - rdt_move_group_tasks(rdtgrp, rdtgrp, cpus); - - update_closid_rmid(cpus, NULL); -} - -static int rdtgroup_rename(struct kernfs_node *kn, - struct kernfs_node *new_parent, const char *new_name) -{ - struct rdtgroup *new_prdtgrp; - struct rdtgroup *rdtgrp; - cpumask_var_t tmpmask; - int ret; - - rdtgrp = kernfs_to_rdtgroup(kn); - new_prdtgrp = kernfs_to_rdtgroup(new_parent); - if (!rdtgrp || !new_prdtgrp) - return -ENOENT; - - /* Release both kernfs active_refs before obtaining rdtgroup mutex. */ - rdtgroup_kn_get(rdtgrp, kn); - rdtgroup_kn_get(new_prdtgrp, new_parent); - - mutex_lock(&rdtgroup_mutex); - - rdt_last_cmd_clear(); - - /* - * Don't allow kernfs_to_rdtgroup() to return a parent rdtgroup if - * either kernfs_node is a file. - */ - if (kernfs_type(kn) != KERNFS_DIR || - kernfs_type(new_parent) != KERNFS_DIR) { - rdt_last_cmd_puts("Source and destination must be directories"); - ret = -EPERM; - goto out; - } - - if ((rdtgrp->flags & RDT_DELETED) || (new_prdtgrp->flags & RDT_DELETED)) { - ret = -ENOENT; - goto out; - } - - if (rdtgrp->type != RDTMON_GROUP || !kn->parent || - !is_mon_groups(kn->parent, kn->name)) { - rdt_last_cmd_puts("Source must be a MON group\n"); - ret = -EPERM; - goto out; - } - - if (!is_mon_groups(new_parent, new_name)) { - rdt_last_cmd_puts("Destination must be a mon_groups subdirectory\n"); - ret = -EPERM; - goto out; - } - - /* - * If the MON group is monitoring CPUs, the CPUs must be assigned to the - * current parent CTRL_MON group and therefore cannot be assigned to - * the new parent, making the move illegal. - */ - if (!cpumask_empty(&rdtgrp->cpu_mask) && - rdtgrp->mon.parent != new_prdtgrp) { - rdt_last_cmd_puts("Cannot move a MON group that monitors CPUs\n"); - ret = -EPERM; - goto out; - } - - /* - * Allocate the cpumask for use in mongrp_reparent() to avoid the - * possibility of failing to allocate it after kernfs_rename() has - * succeeded. - */ - if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) { - ret = -ENOMEM; - goto out; - } - - /* - * Perform all input validation and allocations needed to ensure - * mongrp_reparent() will succeed before calling kernfs_rename(), - * otherwise it would be necessary to revert this call if - * mongrp_reparent() failed. - */ - ret = kernfs_rename(kn, new_parent, new_name); - if (!ret) - mongrp_reparent(rdtgrp, new_prdtgrp, tmpmask); - - free_cpumask_var(tmpmask); - -out: - mutex_unlock(&rdtgroup_mutex); - rdtgroup_kn_put(rdtgrp, kn); - rdtgroup_kn_put(new_prdtgrp, new_parent); - return ret; -} - -static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf) -{ - if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3)) - seq_puts(seq, ",cdp"); - - if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2)) - seq_puts(seq, ",cdpl2"); - - if (is_mba_sc(resctrl_arch_get_resource(RDT_RESOURCE_MBA))) - seq_puts(seq, ",mba_MBps"); - - if (resctrl_debug) - seq_puts(seq, ",debug"); - - return 0; -} - -static struct kernfs_syscall_ops rdtgroup_kf_syscall_ops = { - .mkdir = rdtgroup_mkdir, - .rmdir = rdtgroup_rmdir, - .rename = rdtgroup_rename, - .show_options = rdtgroup_show_options, -}; - -static int rdtgroup_setup_root(struct rdt_fs_context *ctx) -{ - rdt_root = kernfs_create_root(&rdtgroup_kf_syscall_ops, - KERNFS_ROOT_CREATE_DEACTIVATED | - KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK, - &rdtgroup_default); - if (IS_ERR(rdt_root)) - return PTR_ERR(rdt_root); - - ctx->kfc.root = rdt_root; - rdtgroup_default.kn = kernfs_root_to_node(rdt_root); - - return 0; -} - -static void rdtgroup_destroy_root(void) -{ - kernfs_destroy_root(rdt_root); - rdtgroup_default.kn = NULL; -} - -static void __init rdtgroup_setup_default(void) -{ - mutex_lock(&rdtgroup_mutex); - - rdtgroup_default.closid = RESCTRL_RESERVED_CLOSID; - rdtgroup_default.mon.rmid = RESCTRL_RESERVED_RMID; - rdtgroup_default.type = RDTCTRL_GROUP; - INIT_LIST_HEAD(&rdtgroup_default.mon.crdtgrp_list); - - list_add(&rdtgroup_default.rdtgroup_list, &rdt_all_groups); - - mutex_unlock(&rdtgroup_mutex); -} - -static void domain_destroy_mon_state(struct rdt_domain *d) -{ - bitmap_free(d->rmid_busy_llc); - kfree(d->mbm_total); - kfree(d->mbm_local); -} - -void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d) -{ - mutex_lock(&rdtgroup_mutex); - - if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) - mba_sc_domain_destroy(r, d); - - if (!r->mon_capable) - goto out_unlock; - - /* - * If resctrl is mounted, remove all the - * per domain monitor data directories. - */ - if (resctrl_mounted && resctrl_arch_mon_capable()) - rmdir_mondata_subdir_allrdtgrp(r, d->id); - - if (resctrl_is_mbm_enabled()) - cancel_delayed_work(&d->mbm_over); - if (resctrl_arch_is_llc_occupancy_enabled() && has_busy_rmid(d)) { - /* - * When a package is going down, forcefully - * decrement rmid->ebusy. There is no way to know - * that the L3 was flushed and hence may lead to - * incorrect counts in rare scenarios, but leaving - * the RMID as busy creates RMID leaks if the - * package never comes back. - */ - __check_limbo(d, true); - cancel_delayed_work(&d->cqm_limbo); - } - - domain_destroy_mon_state(d); - -out_unlock: - mutex_unlock(&rdtgroup_mutex); -} - -static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) -{ - u32 idx_limit = resctrl_arch_system_num_rmid_idx(); - size_t tsize; - - if (resctrl_arch_is_llc_occupancy_enabled()) { - d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL); - if (!d->rmid_busy_llc) - return -ENOMEM; - } - if (resctrl_arch_is_mbm_total_enabled()) { - tsize = sizeof(*d->mbm_total); - d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL); - if (!d->mbm_total) { - bitmap_free(d->rmid_busy_llc); - return -ENOMEM; - } - } - if (resctrl_arch_is_mbm_local_enabled()) { - tsize = sizeof(*d->mbm_local); - d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL); - if (!d->mbm_local) { - bitmap_free(d->rmid_busy_llc); - kfree(d->mbm_total); - return -ENOMEM; - } - } - - return 0; -} - -int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d) -{ - int err = 0; - - mutex_lock(&rdtgroup_mutex); - - if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) { - /* RDT_RESOURCE_MBA is never mon_capable */ - err = mba_sc_domain_allocate(r, d); - goto out_unlock; - } - - if (!r->mon_capable) - goto out_unlock; - - err = domain_setup_mon_state(r, d); - if (err) - goto out_unlock; - - if (resctrl_is_mbm_enabled()) { - INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow); - mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, - RESCTRL_PICK_ANY_CPU); - } - - if (resctrl_arch_is_llc_occupancy_enabled()) - INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo); - - /* - * If the filesystem is not mounted then only the default resource group - * exists. Creation of its directories is deferred until mount time - * by rdt_get_tree() calling mkdir_mondata_all(). - * If resctrl is mounted, add per domain monitor data directories. - */ - if (resctrl_mounted && resctrl_arch_mon_capable()) - mkdir_mondata_subdir_allrdtgrp(r, d); - -out_unlock: - mutex_unlock(&rdtgroup_mutex); - - return err; -} - -void resctrl_online_cpu(unsigned int cpu) -{ - mutex_lock(&rdtgroup_mutex); - /* The CPU is set in default rdtgroup after online. */ - cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); - mutex_unlock(&rdtgroup_mutex); -} - -static void clear_childcpus(struct rdtgroup *r, unsigned int cpu) -{ - struct rdtgroup *cr; - - list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) { - if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask)) - break; - } -} - -void resctrl_offline_cpu(unsigned int cpu) -{ - struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3); - struct rdtgroup *rdtgrp; - struct rdt_domain *d; - - mutex_lock(&rdtgroup_mutex); - list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { - if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) { - clear_childcpus(rdtgrp, cpu); - break; - } - } - - if (!l3->mon_capable) - goto out_unlock; - - d = resctrl_get_domain_from_cpu(cpu, l3); - if (d) { - if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu) { - cancel_delayed_work(&d->mbm_over); - mbm_setup_overflow_handler(d, 0, cpu); - } - if (resctrl_arch_is_llc_occupancy_enabled() && - cpu == d->cqm_work_cpu && has_busy_rmid(d)) { - cancel_delayed_work(&d->cqm_limbo); - cqm_setup_limbo_handler(d, 0, cpu); - } - } - -out_unlock: - mutex_unlock(&rdtgroup_mutex); -} - -/* - * resctrl_init - resctrl filesystem initialization - * - * Setup resctrl file system including set up root, create mount point, - * register resctrl filesystem, and initialize files under root directory. - * - * Return: 0 on success or -errno - */ -int resctrl_init(void) -{ - int ret = 0; - - seq_buf_init(&last_cmd_status, last_cmd_status_buf, - sizeof(last_cmd_status_buf)); - - rdtgroup_setup_default(); - - thread_throttle_mode_init(); - - ret = resctrl_mon_resource_init(); - if (ret) - return ret; - - ret = sysfs_create_mount_point(fs_kobj, "resctrl"); - if (ret) - return ret; - - ret = register_filesystem(&rdt_fs_type); - if (ret) - goto cleanup_mountpoint; - - /* - * Adding the resctrl debugfs directory here may not be ideal since - * it would let the resctrl debugfs directory appear on the debugfs - * filesystem before the resctrl filesystem is mounted. - * It may also be ok since that would enable debugging of RDT before - * resctrl is mounted. - * The reason why the debugfs directory is created here and not in - * rdt_get_tree() is because rdt_get_tree() takes rdtgroup_mutex and - * during the debugfs directory creation also &sb->s_type->i_mutex_key - * (the lockdep class of inode->i_rwsem). Other filesystem - * interactions (eg. SyS_getdents) have the lock ordering: - * &sb->s_type->i_mutex_key --> &mm->mmap_lock - * During mmap(), called with &mm->mmap_lock, the rdtgroup_mutex - * is taken, thus creating dependency: - * &mm->mmap_lock --> rdtgroup_mutex for the latter that can cause - * issues considering the other two lock dependencies. - * By creating the debugfs directory here we avoid a dependency - * that may cause deadlock (even though file operations cannot - * occur until the filesystem is mounted, but I do not know how to - * tell lockdep that). - */ - debugfs_resctrl = debugfs_create_dir("resctrl", NULL); - - return 0; - -cleanup_mountpoint: - sysfs_remove_mount_point(fs_kobj, "resctrl"); - - return ret; -} - -void resctrl_exit(void) -{ - debugfs_remove_recursive(debugfs_resctrl); - unregister_filesystem(&rdt_fs_type); - sysfs_remove_mount_point(fs_kobj, "resctrl"); + struct rdt_resource *r;
- resctrl_mon_resource_exit(); + for_each_capable_rdt_resource(r) + reset_all_ctrls(r); } diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index e69de29bb2d1..62a6a67f1192 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -0,0 +1,528 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Resource Director Technology(RDT) + * - Cache Allocation code. + * + * Copyright (C) 2016 Intel Corporation + * + * Authors: + * Fenghua Yu fenghua.yu@intel.com + * Tony Luck tony.luck@intel.com + * + * More information about RDT be found in the Intel (R) x86 Architecture + * Software Developer Manual June 2016, volume 3, section 17.17. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/cpu.h> +#include <linux/kernfs.h> +#include <linux/seq_file.h> +#include <linux/slab.h> +#include "internal.h" + +struct rdt_parse_data { + struct rdtgroup *rdtgrp; + char *buf; +}; + +typedef int (ctrlval_parser_t)(struct rdt_parse_data *data, + struct resctrl_schema *s, + struct rdt_domain *d); + +/* + * Check whether MBA bandwidth percentage value is correct. The value is + * checked against the minimum and max bandwidth values specified by the + * hardware. The allocated bandwidth percentage is rounded to the next + * control step available on the hardware. + */ +static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r) +{ + unsigned long bw; + int ret; + + /* + * Only linear delay values is supported for current Intel SKUs. + */ + if (!r->membw.delay_linear && r->membw.arch_needs_linear) { + rdt_last_cmd_puts("No support for non-linear MB domains\n"); + return false; + } + + ret = kstrtoul(buf, 10, &bw); + if (ret) { + rdt_last_cmd_printf("Non-decimal digit in MB value %s\n", buf); + return false; + } + + if ((bw < r->membw.min_bw || bw > r->default_ctrl) && + !is_mba_sc(r)) { + rdt_last_cmd_printf("MB value %ld out of range [%d,%d]\n", bw, + r->membw.min_bw, r->default_ctrl); + return false; + } + + *data = roundup(bw, (unsigned long)r->membw.bw_gran); + return true; +} + +static int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_domain *d) +{ + struct resctrl_staged_config *cfg; + u32 closid = data->rdtgrp->closid; + struct rdt_resource *r = s->res; + unsigned long bw_val; + + cfg = &d->staged_config[s->conf_type]; + if (cfg->have_new_ctrl) { + rdt_last_cmd_printf("Duplicate domain %d\n", d->id); + return -EINVAL; + } + + if (!bw_validate(data->buf, &bw_val, r)) + return -EINVAL; + + if (is_mba_sc(r)) { + d->mbps_val[closid] = bw_val; + return 0; + } + + cfg->new_ctrl = bw_val; + cfg->have_new_ctrl = true; + + return 0; +} + +/* + * Check whether a cache bit mask is valid. + * On Intel CPUs, non-contiguous 1s value support is indicated by CPUID: + * - CPUID.0x10.1:ECX[3]: L3 non-contiguous 1s value supported if 1 + * - CPUID.0x10.2:ECX[3]: L2 non-contiguous 1s value supported if 1 + * + * Haswell does not support a non-contiguous 1s value and additionally + * requires at least two bits set. + * AMD allows non-contiguous bitmasks. + */ +static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r) +{ + unsigned long first_bit, zero_bit, val; + unsigned int cbm_len = r->cache.cbm_len; + int ret; + + ret = kstrtoul(buf, 16, &val); + if (ret) { + rdt_last_cmd_printf("Non-hex character in the mask %s\n", buf); + return false; + } + + if ((r->cache.min_cbm_bits > 0 && val == 0) || val > r->default_ctrl) { + rdt_last_cmd_puts("Mask out of range\n"); + return false; + } + + first_bit = find_first_bit(&val, cbm_len); + zero_bit = find_next_zero_bit(&val, cbm_len, first_bit); + + /* Are non-contiguous bitmasks allowed? */ + if (!r->cache.arch_has_sparse_bitmasks && + (find_next_bit(&val, cbm_len, zero_bit) < cbm_len)) { + rdt_last_cmd_printf("The mask %lx has non-consecutive 1-bits\n", val); + return false; + } + + if ((zero_bit - first_bit) < r->cache.min_cbm_bits) { + rdt_last_cmd_printf("Need at least %d bits in the mask\n", + r->cache.min_cbm_bits); + return false; + } + + *data = val; + return true; +} + +/* + * Read one cache bit mask (hex). Check that it is valid for the current + * resource type. + */ +static int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_domain *d) +{ + struct rdtgroup *rdtgrp = data->rdtgrp; + struct resctrl_staged_config *cfg; + struct rdt_resource *r = s->res; + u32 cbm_val; + + cfg = &d->staged_config[s->conf_type]; + if (cfg->have_new_ctrl) { + rdt_last_cmd_printf("Duplicate domain %d\n", d->id); + return -EINVAL; + } + + /* + * Cannot set up more than one pseudo-locked region in a cache + * hierarchy. + */ + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP && + rdtgroup_pseudo_locked_in_hierarchy(d)) { + rdt_last_cmd_puts("Pseudo-locked region in hierarchy\n"); + return -EINVAL; + } + + if (!cbm_validate(data->buf, &cbm_val, r)) + return -EINVAL; + + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK) && + (rdtgrp->mode == RDT_MODE_EXCLUSIVE || + rdtgrp->mode == RDT_MODE_SHAREABLE) && + rdtgroup_cbm_overlaps_pseudo_locked(d, cbm_val)) { + rdt_last_cmd_puts("CBM overlaps with pseudo-locked region\n"); + return -EINVAL; + } + + /* + * The CBM may not overlap with the CBM of another closid if + * either is exclusive. + */ + if (rdtgroup_cbm_overlaps(s, d, cbm_val, rdtgrp->closid, true)) { + rdt_last_cmd_puts("Overlaps with exclusive group\n"); + return -EINVAL; + } + + if (rdtgroup_cbm_overlaps(s, d, cbm_val, rdtgrp->closid, false)) { + if (rdtgrp->mode == RDT_MODE_EXCLUSIVE || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + rdt_last_cmd_puts("Overlaps with other group\n"); + return -EINVAL; + } + } + + cfg->new_ctrl = cbm_val; + cfg->have_new_ctrl = true; + + return 0; +} + +static ctrlval_parser_t *get_parser(struct rdt_resource *res) +{ + if (res->fflags & RFTYPE_RES_CACHE) + return &parse_cbm; + else + return &parse_bw; +} + +/* + * For each domain in this resource we expect to find a series of: + * id=mask + * separated by ";". The "id" is in decimal, and must match one of + * the "id"s for this resource. + */ +static int parse_line(char *line, struct resctrl_schema *s, + struct rdtgroup *rdtgrp) +{ + ctrlval_parser_t *parse_ctrlval = get_parser(s->res); + enum resctrl_conf_type t = s->conf_type; + struct resctrl_staged_config *cfg; + struct rdt_resource *r = s->res; + struct rdt_parse_data data; + char *dom = NULL, *id; + struct rdt_domain *d; + unsigned long dom_id; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP && + (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)) { + rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n"); + return -EINVAL; + } + +next: + if (!line || line[0] == '\0') + return 0; + dom = strsep(&line, ";"); + id = strsep(&dom, "="); + if (!dom || kstrtoul(id, 10, &dom_id)) { + rdt_last_cmd_puts("Missing '=' or non-numeric domain\n"); + return -EINVAL; + } + dom = strim(dom); + list_for_each_entry(d, &r->domains, list) { + if (d->id == dom_id) { + data.buf = dom; + data.rdtgrp = rdtgrp; + if (parse_ctrlval(&data, s, d)) + return -EINVAL; + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + cfg = &d->staged_config[t]; + /* + * In pseudo-locking setup mode and just + * parsed a valid CBM that should be + * pseudo-locked. Only one locked region per + * resource group and domain so just do + * the required initialization for single + * region and return. + */ + rdtgrp->plr->s = s; + rdtgrp->plr->d = d; + rdtgrp->plr->cbm = cfg->new_ctrl; + d->plr = rdtgrp->plr; + return 0; + } + goto next; + } + } + return -EINVAL; +} + +static int rdtgroup_parse_resource(char *resname, char *tok, + struct rdtgroup *rdtgrp) +{ + struct resctrl_schema *s; + + list_for_each_entry(s, &resctrl_schema_all, list) { + if (!strcmp(resname, s->name) && rdtgrp->closid < s->num_closid) + return parse_line(tok, s, rdtgrp); + } + rdt_last_cmd_printf("Unknown or unsupported resource name '%s'\n", resname); + return -EINVAL; +} + +ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct resctrl_schema *s; + struct rdtgroup *rdtgrp; + struct rdt_resource *r; + char *tok, *resname; + int ret = 0; + + /* Valid input requires a trailing newline */ + if (nbytes == 0 || buf[nbytes - 1] != '\n') + return -EINVAL; + buf[nbytes - 1] = '\0'; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + rdt_last_cmd_clear(); + + /* + * No changes to pseudo-locked region allowed. It has to be removed + * and re-created instead. + */ + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + ret = -EINVAL; + rdt_last_cmd_puts("Resource group is pseudo-locked\n"); + goto out; + } + + rdt_staged_configs_clear(); + + while ((tok = strsep(&buf, "\n")) != NULL) { + resname = strim(strsep(&tok, ":")); + if (!tok) { + rdt_last_cmd_puts("Missing ':'\n"); + ret = -EINVAL; + goto out; + } + if (tok[0] == '\0') { + rdt_last_cmd_printf("Missing '%s' value\n", resname); + ret = -EINVAL; + goto out; + } + ret = rdtgroup_parse_resource(resname, tok, rdtgrp); + if (ret) + goto out; + } + + list_for_each_entry(s, &resctrl_schema_all, list) { + r = s->res; + + /* + * Writes to mba_sc resources update the software controller, + * not the control MSR. + */ + if (is_mba_sc(r)) + continue; + + ret = resctrl_arch_update_domains(r, rdtgrp->closid); + if (ret) + goto out; + } + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + /* + * If pseudo-locking fails we keep the resource group in + * mode RDT_MODE_PSEUDO_LOCKSETUP with its class of service + * active and updated for just the domain the pseudo-locked + * region was requested for. + */ + ret = rdtgroup_pseudo_lock_create(rdtgrp); + } + +out: + rdt_staged_configs_clear(); + rdtgroup_kn_unlock(of->kn); + return ret ?: nbytes; +} + +static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid) +{ + struct rdt_resource *r = schema->res; + struct rdt_domain *dom; + bool sep = false; + u32 ctrl_val; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + seq_printf(s, "%*s:", max_name_width, schema->name); + list_for_each_entry(dom, &r->domains, list) { + if (sep) + seq_puts(s, ";"); + + if (is_mba_sc(r)) + ctrl_val = dom->mbps_val[closid]; + else + ctrl_val = resctrl_arch_get_config(r, dom, closid, + schema->conf_type); + + seq_printf(s, r->format_str, dom->id, max_data_width, + ctrl_val); + sep = true; + } + seq_puts(s, "\n"); +} + +int rdtgroup_schemata_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct resctrl_schema *schema; + struct rdtgroup *rdtgrp; + int ret = 0; + u32 closid; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + list_for_each_entry(schema, &resctrl_schema_all, list) { + seq_printf(s, "%s:uninitialized\n", schema->name); + } + } else if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + if (!rdtgrp->plr->d) { + rdt_last_cmd_clear(); + rdt_last_cmd_puts("Cache domain offline\n"); + ret = -ENODEV; + } else { + seq_printf(s, "%s:%d=%x\n", + rdtgrp->plr->s->res->name, + rdtgrp->plr->d->id, + rdtgrp->plr->cbm); + } + } else { + closid = rdtgrp->closid; + list_for_each_entry(schema, &resctrl_schema_all, list) { + if (closid < schema->num_closid) + show_doms(s, schema, closid); + } + } + } else { + ret = -ENOENT; + } + rdtgroup_kn_unlock(of->kn); + return ret; +} + +static int smp_mon_event_count(void *arg) +{ + mon_event_count(arg); + + return 0; +} + +void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, + struct rdt_domain *d, struct rdtgroup *rdtgrp, + int evtid, int first) +{ + int cpu; + + /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + /* + * Setup the parameters to pass to mon_event_count() to read the data. + */ + rr->rgrp = rdtgrp; + rr->evtid = evtid; + rr->r = r; + rr->d = d; + rr->val = 0; + rr->first = first; + rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evtid); + if (IS_ERR(rr->arch_mon_ctx)) { + rr->err = -EINVAL; + return; + } + + cpu = cpumask_any_housekeeping(&d->cpu_mask, RESCTRL_PICK_ANY_CPU); + + /* + * cpumask_any_housekeeping() prefers housekeeping CPUs, but + * are all the CPUs nohz_full? If yes, pick a CPU to IPI. + * MPAM's resctrl_arch_rmid_read() is unable to read the + * counters on some platforms if its called in irq context. + */ + if (tick_nohz_full_cpu(cpu)) + smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1); + else + smp_call_on_cpu(cpu, smp_mon_event_count, rr, false); + + resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx); +} + +int rdtgroup_mondata_show(struct seq_file *m, void *arg) +{ + struct kernfs_open_file *of = m->private; + u32 resid, evtid, domid; + struct rdtgroup *rdtgrp; + struct rdt_resource *r; + union mon_data_bits md; + struct rdt_domain *d; + struct rmid_read rr; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + ret = -ENOENT; + goto out; + } + + md.priv = of->kn->priv; + resid = md.u.rid; + domid = md.u.domid; + evtid = md.u.evtid; + + r = resctrl_arch_get_resource(resid); + d = resctrl_arch_find_domain(r, domid); + if (IS_ERR_OR_NULL(d)) { + ret = -ENOENT; + goto out; + } + + mon_event_read(&rr, r, d, rdtgrp, evtid, false); + + if (rr.err == -EIO) + seq_puts(m, "Error\n"); + else if (rr.err == -EINVAL) + seq_puts(m, "Unavailable\n"); + else + seq_printf(m, "%llu\n", rr.val); + +out: + rdtgroup_kn_unlock(of->kn); + return ret; +} diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h index e69de29bb2d1..7a6f46b4edd0 100644 --- a/fs/resctrl/internal.h +++ b/fs/resctrl/internal.h @@ -0,0 +1,313 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _FS_RESCTRL_INTERNAL_H +#define _FS_RESCTRL_INTERNAL_H + +#include <linux/resctrl.h> +#include <linux/sched.h> +#include <linux/kernfs.h> +#include <linux/fs_context.h> +#include <linux/jump_label.h> +#include <linux/tick.h> + +#include <asm/resctrl.h> + +/** + * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that + * aren't marked nohz_full + * @mask: The mask to pick a CPU from. + * @exclude_cpu:The CPU to avoid picking. + * + * Returns a CPU from @mask, but not @exclude_cpu. If there are housekeeping + * CPUs that don't use nohz_full, these are preferred. Pass + * RESCTRL_PICK_ANY_CPU to avoid excluding any CPUs. + * + * When a CPU is excluded, returns >= nr_cpu_ids if no CPUs are available. + */ +static inline unsigned int +cpumask_any_housekeeping(const struct cpumask *mask, int exclude_cpu) +{ + unsigned int cpu, hk_cpu; + + if (exclude_cpu == RESCTRL_PICK_ANY_CPU) + cpu = cpumask_any(mask); + else + cpu = cpumask_any_but(mask, exclude_cpu); + + if (!IS_ENABLED(CONFIG_NO_HZ_FULL)) + return cpu; + + /* If the CPU picked isn't marked nohz_full nothing more needs doing. */ + if (cpu < nr_cpu_ids && !tick_nohz_full_cpu(cpu)) + return cpu; + + /* Try to find a CPU that isn't nohz_full to use in preference */ + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask); + if (hk_cpu == exclude_cpu) + hk_cpu = cpumask_nth_andnot(1, mask, tick_nohz_full_mask); + + if (hk_cpu < nr_cpu_ids) + cpu = hk_cpu; + + return cpu; +} + +struct rdt_fs_context { + struct kernfs_fs_context kfc; + bool enable_cdpl2; + bool enable_cdpl3; + bool enable_mba_mbps; + bool enable_debug; +}; + +static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc) +{ + struct kernfs_fs_context *kfc = fc->fs_private; + + return container_of(kfc, struct rdt_fs_context, kfc); +} + +/** + * struct mon_evt - Entry in the event list of a resource + * @evtid: event id + * @name: name of the event + * @configurable: true if the event is configurable + * @list: entry in &rdt_resource->evt_list + */ +struct mon_evt { + enum resctrl_event_id evtid; + char *name; + bool configurable; + struct list_head list; +}; + +/** + * union mon_data_bits - Monitoring details for each event file + * @priv: Used to store monitoring event data in @u + * as kernfs private data + * @rid: Resource id associated with the event file + * @evtid: Event id associated with the event file + * @domid: The domain to which the event file belongs + * @u: Name of the bit fields struct + */ +union mon_data_bits { + void *priv; + struct { + unsigned int rid : 10; + enum resctrl_event_id evtid : 8; + unsigned int domid : 14; + } u; +}; + +struct rmid_read { + struct rdtgroup *rgrp; + struct rdt_resource *r; + struct rdt_domain *d; + enum resctrl_event_id evtid; + bool first; + int err; + u64 val; + void *arch_mon_ctx; +}; + +extern struct list_head resctrl_schema_all; +extern bool resctrl_mounted; + +enum rdt_group_type { + RDTCTRL_GROUP = 0, + RDTMON_GROUP, + RDT_NUM_GROUP, +}; + +/** + * enum rdtgrp_mode - Mode of a RDT resource group + * @RDT_MODE_SHAREABLE: This resource group allows sharing of its allocations + * @RDT_MODE_EXCLUSIVE: No sharing of this resource group's allocations allowed + * @RDT_MODE_PSEUDO_LOCKSETUP: Resource group will be used for Pseudo-Locking + * @RDT_MODE_PSEUDO_LOCKED: No sharing of this resource group's allocations + * allowed AND the allocations are Cache Pseudo-Locked + * @RDT_NUM_MODES: Total number of modes + * + * The mode of a resource group enables control over the allowed overlap + * between allocations associated with different resource groups (classes + * of service). User is able to modify the mode of a resource group by + * writing to the "mode" resctrl file associated with the resource group. + * + * The "shareable", "exclusive", and "pseudo-locksetup" modes are set by + * writing the appropriate text to the "mode" file. A resource group enters + * "pseudo-locked" mode after the schemata is written while the resource + * group is in "pseudo-locksetup" mode. + */ +enum rdtgrp_mode { + RDT_MODE_SHAREABLE = 0, + RDT_MODE_EXCLUSIVE, + RDT_MODE_PSEUDO_LOCKSETUP, + RDT_MODE_PSEUDO_LOCKED, + + /* Must be last */ + RDT_NUM_MODES, +}; + +/** + * struct mongroup - store mon group's data in resctrl fs. + * @mon_data_kn: kernfs node for the mon_data directory + * @parent: parent rdtgrp + * @crdtgrp_list: child rdtgroup node list + * @rmid: rmid for this rdtgroup + */ +struct mongroup { + struct kernfs_node *mon_data_kn; + struct rdtgroup *parent; + struct list_head crdtgrp_list; + u32 rmid; +}; + +/** + * struct rdtgroup - store rdtgroup's data in resctrl file system. + * @kn: kernfs node + * @rdtgroup_list: linked list for all rdtgroups + * @closid: closid for this rdtgroup + * @cpu_mask: CPUs assigned to this rdtgroup + * @flags: status bits + * @waitcount: how many cpus expect to find this + * group when they acquire rdtgroup_mutex + * @type: indicates type of this rdtgroup - either + * monitor only or ctrl_mon group + * @mon: mongroup related data + * @mode: mode of resource group + * @plr: pseudo-locked region + */ +struct rdtgroup { + struct kernfs_node *kn; + struct list_head rdtgroup_list; + u32 closid; + struct cpumask cpu_mask; + int flags; + atomic_t waitcount; + enum rdt_group_type type; + struct mongroup mon; + enum rdtgrp_mode mode; + struct pseudo_lock_region *plr; +}; + +/* List of all resource groups */ +extern struct list_head rdt_all_groups; + +extern int max_name_width, max_data_width; + +/** + * struct rftype - describe each file in the resctrl file system + * @name: File name + * @mode: Access mode + * @kf_ops: File operations + * @flags: File specific RFTYPE_FLAGS_* flags + * @fflags: File specific RFTYPE_* flags + * @seq_show: Show content of the file + * @write: Write to the file + */ +struct rftype { + char *name; + umode_t mode; + const struct kernfs_ops *kf_ops; + unsigned long flags; + unsigned long fflags; + + int (*seq_show)(struct kernfs_open_file *of, + struct seq_file *sf, void *v); + /* + * write() is the generic write callback which maps directly to + * kernfs write operation and overrides all other operations. + * Maximum write size is determined by ->max_write_len. + */ + ssize_t (*write)(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off); +}; + +/** + * struct mbm_state - status for each MBM counter in each domain + * @prev_bw_bytes: Previous bytes value read for bandwidth calculation + * @prev_bw: The most recent bandwidth in MBps + * @delta_bw: Difference between the current and previous bandwidth + * @delta_comp: Indicates whether to compute the delta_bw + */ +struct mbm_state { + u64 prev_bw_bytes; + u32 prev_bw; + u32 delta_bw; + bool delta_comp; +}; + +static inline bool is_mba_sc(struct rdt_resource *r) +{ + if (!r) + r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); + + /* + * The software controller support is only applicable to MBA resource. + * Make sure to check for resource type. + */ + if (r->rid != RDT_RESOURCE_MBA) + return false; + + return r->membw.mba_sc; +} + +extern struct mutex rdtgroup_mutex; +extern struct rdtgroup rdtgroup_default; +extern struct dentry *debugfs_resctrl; + +void rdt_last_cmd_clear(void); +void rdt_last_cmd_puts(const char *s); +__printf(1, 2) +void rdt_last_cmd_printf(const char *fmt, ...); + +struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn); +void rdtgroup_kn_unlock(struct kernfs_node *kn); +int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name); +int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name, + umode_t mask); +ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off); +int rdtgroup_schemata_show(struct kernfs_open_file *of, + struct seq_file *s, void *v); +bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d, + unsigned long cbm, int closid, bool exclusive); +unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d, + unsigned long cbm); +enum rdtgrp_mode rdtgroup_mode_by_closid(int closid); +int rdtgroup_tasks_assigned(struct rdtgroup *r); +int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp); +int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp); +bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm); +bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d); +int rdt_pseudo_lock_init(void); +void rdt_pseudo_lock_release(void); +int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp); +void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp); +int closids_supported(void); +bool closid_allocated(unsigned int closid); +bool resctrl_closid_is_dirty(u32 closid); +void closid_free(int closid); +int alloc_rmid(u32 closid); +void free_rmid(u32 closid, u32 rmid); +void resctrl_mon_resource_exit(void); +void mon_event_count(void *info); +int rdtgroup_mondata_show(struct seq_file *m, void *arg); +void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, + struct rdt_domain *d, struct rdtgroup *rdtgrp, + int evtid, int first); +int resctrl_mon_resource_init(void); +void mbm_setup_overflow_handler(struct rdt_domain *dom, + unsigned long delay_ms, + int exclude_cpu); +void mbm_handle_overflow(struct work_struct *work); +void setup_default_ctrlval(struct rdt_resource *r, u32 *dc); +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms, + int exclude_cpu); +void cqm_handle_limbo(struct work_struct *work); +bool has_busy_rmid(struct rdt_domain *d); +void __check_limbo(struct rdt_domain *d, bool force_free); +void mbm_config_rftype_init(const char *config); +void rdt_staged_configs_clear(void); +int resctrl_find_cleanest_closid(void); + +#endif /* _FS_RESCTRL_INTERNAL_H */ diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index e69de29bb2d1..fcf2ab18966f 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -0,0 +1,859 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Resource Director Technology(RDT) + * - Monitoring code + * + * Copyright (C) 2017 Intel Corporation + * + * Author: + * Vikas Shivappa vikas.shivappa@intel.com + * + * This replaces the cqm.c based on perf but we reuse a lot of + * code and datastructures originally from Peter Zijlstra and Matt Fleming. + * + * More information about RDT be found in the Intel (R) x86 Architecture + * Software Developer Manual June 2016, volume 3, section 17.17. + */ + +#include <linux/cpu.h> +#include <linux/module.h> +#include <linux/sizes.h> +#include <linux/slab.h> +#include "internal.h" + +/* + * struct rmid_entry - dirty tracking for all RMID. + * @closid: The CLOSID for this entry. + * @rmid: The RMID for this entry. + * @busy: The number of domains with cached data using this RMID. + * @list: Member of the rmid_free_lru list when busy == 0. + * + * Depending on the architecture the correct monitor is accessed using + * both @closid and @rmid, or @rmid only. + * + * Take the rdtgroup_mutex when accessing. + */ +struct rmid_entry { + u32 closid; + u32 rmid; + int busy; + struct list_head list; +}; + +/* + * @rmid_free_lru - A least recently used list of free RMIDs + * These RMIDs are guaranteed to have an occupancy less than the + * threshold occupancy + */ +static LIST_HEAD(rmid_free_lru); + +/** + * @closid_num_dirty_rmid The number of dirty RMID each CLOSID has. + * Only allocated when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is defined. + * Indexed by CLOSID. Protected by rdtgroup_mutex. + */ +static u32 *closid_num_dirty_rmid; + +/* + * @rmid_limbo_count - count of currently unused but (potentially) + * dirty RMIDs. + * This counts RMIDs that no one is currently using but that + * may have a occupancy value > resctrl_rmid_realloc_threshold. User can + * change the threshold occupancy value. + */ +static unsigned int rmid_limbo_count; + +/* + * @rmid_entry - The entry in the limbo and free lists. + */ +static struct rmid_entry *rmid_ptrs; + +/* + * This is the threshold cache occupancy in bytes at which we will consider an + * RMID available for re-allocation. + */ +unsigned int resctrl_rmid_realloc_threshold; + +/* + * This is the maximum value for the reallocation threshold, in bytes. + */ +unsigned int resctrl_rmid_realloc_limit; + +/* + * x86 and arm64 differ in their handling of monitoring. + * x86's RMID are independent numbers, there is only one source of traffic + * with an RMID value of '1'. + * arm64's PMG extends the PARTID/CLOSID space, there are multiple sources of + * traffic with a PMG value of '1', one for each CLOSID, meaning the RMID + * value is no longer unique. + * To account for this, resctrl uses an index. On x86 this is just the RMID, + * on arm64 it encodes the CLOSID and RMID. This gives a unique number. + * + * The domain's rmid_busy_llc and rmid_ptrs[] are sized by index. The arch code + * must accept an attempt to read every index. + */ +static inline struct rmid_entry *__rmid_entry(u32 idx) +{ + struct rmid_entry *entry; + u32 closid, rmid; + + entry = &rmid_ptrs[idx]; + resctrl_arch_rmid_idx_decode(idx, &closid, &rmid); + + WARN_ON_ONCE(entry->closid != closid); + WARN_ON_ONCE(entry->rmid != rmid); + + return entry; +} + +static void limbo_release_entry(struct rmid_entry *entry) +{ + lockdep_assert_held(&rdtgroup_mutex); + + rmid_limbo_count--; + list_add_tail(&entry->list, &rmid_free_lru); + + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + closid_num_dirty_rmid[entry->closid]--; +} + +/* + * Check the RMIDs that are marked as busy for this domain. If the + * reported LLC occupancy is below the threshold clear the busy bit and + * decrement the count. If the busy count gets to zero on an RMID, we + * free the RMID + */ +void __check_limbo(struct rdt_domain *d, bool force_free) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + struct rmid_entry *entry; + u32 idx, cur_idx = 1; + void *arch_mon_ctx; + bool rmid_dirty; + u64 val = 0; + + arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID); + if (IS_ERR(arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(arch_mon_ctx)); + return; + } + + /* + * Skip RMID 0 and start from RMID 1 and check all the RMIDs that + * are marked as busy for occupancy < threshold. If the occupancy + * is less than the threshold decrement the busy counter of the + * RMID and move it to the free list when the counter reaches 0. + */ + for (;;) { + idx = find_next_bit(d->rmid_busy_llc, idx_limit, cur_idx); + if (idx >= idx_limit) + break; + + entry = __rmid_entry(idx); + if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid, + QOS_L3_OCCUP_EVENT_ID, &val, + arch_mon_ctx)) { + rmid_dirty = true; + } else { + rmid_dirty = (val >= resctrl_rmid_realloc_threshold); + } + + if (force_free || !rmid_dirty) { + clear_bit(idx, d->rmid_busy_llc); + if (!--entry->busy) + limbo_release_entry(entry); + } + cur_idx = idx + 1; + } + + resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx); +} + +bool has_busy_rmid(struct rdt_domain *d) +{ + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + + return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit; +} + +static struct rmid_entry *resctrl_find_free_rmid(u32 closid) +{ + struct rmid_entry *itr; + u32 itr_idx, cmp_idx; + + if (list_empty(&rmid_free_lru)) + return rmid_limbo_count ? ERR_PTR(-EBUSY) : ERR_PTR(-ENOSPC); + + list_for_each_entry(itr, &rmid_free_lru, list) { + /* + * Get the index of this free RMID, and the index it would need + * to be if it were used with this CLOSID. + * If the CLOSID is irrelevant on this architecture, the two + * index values are always the same on every entry and thus the + * very first entry will be returned. + */ + itr_idx = resctrl_arch_rmid_idx_encode(itr->closid, itr->rmid); + cmp_idx = resctrl_arch_rmid_idx_encode(closid, itr->rmid); + + if (itr_idx == cmp_idx) + return itr; + } + + return ERR_PTR(-ENOSPC); +} + +/** + * resctrl_find_cleanest_closid() - Find a CLOSID where all the associated + * RMID are clean, or the CLOSID that has + * the most clean RMID. + * + * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocated CLOSID + * may not be able to allocate clean RMID. To avoid this the allocator will + * choose the CLOSID with the most clean RMID. + * + * When the CLOSID and RMID are independent numbers, the first free CLOSID will + * be returned. + */ +int resctrl_find_cleanest_closid(void) +{ + u32 cleanest_closid = ~0; + int i = 0; + + lockdep_assert_held(&rdtgroup_mutex); + + if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + return -EIO; + + for (i = 0; i < closids_supported(); i++) { + int num_dirty; + + if (closid_allocated(i)) + continue; + + num_dirty = closid_num_dirty_rmid[i]; + if (num_dirty == 0) + return i; + + if (cleanest_closid == ~0) + cleanest_closid = i; + + if (num_dirty < closid_num_dirty_rmid[cleanest_closid]) + cleanest_closid = i; + } + + if (cleanest_closid == ~0) + return -ENOSPC; + + return cleanest_closid; +} + +/* + * For MPAM the RMID value is not unique, and has to be considered with + * the CLOSID. The (CLOSID, RMID) pair is allocated on all domains, which + * allows all domains to be managed by a single free list. + * Each domain also has a rmid_busy_llc to reduce the work of the limbo handler. + */ +int alloc_rmid(u32 closid) +{ + struct rmid_entry *entry; + + lockdep_assert_held(&rdtgroup_mutex); + + entry = resctrl_find_free_rmid(closid); + if (IS_ERR(entry)) + return PTR_ERR(entry); + + list_del(&entry->list); + return entry->rmid; +} + +static void add_rmid_to_limbo(struct rmid_entry *entry) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + struct rdt_domain *d; + u32 idx; + + lockdep_assert_held(&rdtgroup_mutex); + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid); + + entry->busy = 0; + list_for_each_entry(d, &r->domains, list) { + /* + * For the first limbo RMID in the domain, + * setup up the limbo worker. + */ + if (!has_busy_rmid(d)) + cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL, + RESCTRL_PICK_ANY_CPU); + set_bit(idx, d->rmid_busy_llc); + entry->busy++; + } + + rmid_limbo_count++; + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + closid_num_dirty_rmid[entry->closid]++; +} + +void free_rmid(u32 closid, u32 rmid) +{ + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); + struct rmid_entry *entry; + + lockdep_assert_held(&rdtgroup_mutex); + + /* + * Do not allow the default rmid to be free'd. Comparing by index + * allows architectures that ignore the closid parameter to avoid an + * unnecessary check. + */ + if (idx == resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, + RESCTRL_RESERVED_RMID)) + return; + + entry = __rmid_entry(idx); + + if (resctrl_arch_is_llc_occupancy_enabled()) + add_rmid_to_limbo(entry); + else + list_add_tail(&entry->list, &rmid_free_lru); +} + +static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 closid, + u32 rmid, enum resctrl_event_id evtid) +{ + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); + + switch (evtid) { + case QOS_L3_MBM_TOTAL_EVENT_ID: + return &d->mbm_total[idx]; + case QOS_L3_MBM_LOCAL_EVENT_ID: + return &d->mbm_local[idx]; + default: + return NULL; + } +} + +static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr) +{ + struct mbm_state *m; + u64 tval = 0; + + if (rr->first) { + resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid); + m = get_mbm_state(rr->d, closid, rmid, rr->evtid); + if (m) + memset(m, 0, sizeof(struct mbm_state)); + return 0; + } + + rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, rr->evtid, + &tval, rr->arch_mon_ctx); + if (rr->err) + return rr->err; + + rr->val += tval; + + return 0; +} + +/* + * mbm_bw_count() - Update bw count from values previously read by + * __mon_event_count(). + * @closid: The closid used to identify the cached mbm_state. + * @rmid: The rmid used to identify the cached mbm_state. + * @rr: The struct rmid_read populated by __mon_event_count(). + * + * Supporting function to calculate the memory bandwidth + * and delta bandwidth in MBps. The chunks value previously read by + * __mon_event_count() is compared with the chunks value from the previous + * invocation. This must be called once per second to maintain values in MBps. + */ +static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr) +{ + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); + struct mbm_state *m = &rr->d->mbm_local[idx]; + u64 cur_bw, bytes, cur_bytes; + + cur_bytes = rr->val; + bytes = cur_bytes - m->prev_bw_bytes; + m->prev_bw_bytes = cur_bytes; + + cur_bw = bytes / SZ_1M; + + if (m->delta_comp) + m->delta_bw = abs(cur_bw - m->prev_bw); + m->delta_comp = false; + m->prev_bw = cur_bw; +} + +/* + * This is scheduled by mon_event_read() to read the CQM/MBM counters + * on a domain. + */ +void mon_event_count(void *info) +{ + struct rdtgroup *rdtgrp, *entry; + struct rmid_read *rr = info; + struct list_head *head; + int ret; + + rdtgrp = rr->rgrp; + + ret = __mon_event_count(rdtgrp->closid, rdtgrp->mon.rmid, rr); + + /* + * For Ctrl groups read data from child monitor groups and + * add them together. Count events which are read successfully. + * Discard the rmid_read's reporting errors. + */ + head = &rdtgrp->mon.crdtgrp_list; + + if (rdtgrp->type == RDTCTRL_GROUP) { + list_for_each_entry(entry, head, mon.crdtgrp_list) { + if (__mon_event_count(entry->closid, entry->mon.rmid, + rr) == 0) + ret = 0; + } + } + + /* + * __mon_event_count() calls for newly created monitor groups may + * report -EINVAL/Unavailable if the monitor hasn't seen any traffic. + * Discard error if any of the monitor event reads succeeded. + */ + if (ret == 0) + rr->err = 0; +} + +/* + * Feedback loop for MBA software controller (mba_sc) + * + * mba_sc is a feedback loop where we periodically read MBM counters and + * adjust the bandwidth percentage values via the IA32_MBA_THRTL_MSRs so + * that: + * + * current bandwidth(cur_bw) < user specified bandwidth(user_bw) + * + * This uses the MBM counters to measure the bandwidth and MBA throttle + * MSRs to control the bandwidth for a particular rdtgrp. It builds on the + * fact that resctrl rdtgroups have both monitoring and control. + * + * The frequency of the checks is 1s and we just tag along the MBM overflow + * timer. Having 1s interval makes the calculation of bandwidth simpler. + * + * Although MBA's goal is to restrict the bandwidth to a maximum, there may + * be a need to increase the bandwidth to avoid unnecessarily restricting + * the L2 <-> L3 traffic. + * + * Since MBA controls the L2 external bandwidth where as MBM measures the + * L3 external bandwidth the following sequence could lead to such a + * situation. + * + * Consider an rdtgroup which had high L3 <-> memory traffic in initial + * phases -> mba_sc kicks in and reduced bandwidth percentage values -> but + * after some time rdtgroup has mostly L2 <-> L3 traffic. + * + * In this case we may restrict the rdtgroup's L2 <-> L3 traffic as its + * throttle MSRs already have low percentage values. To avoid + * unnecessarily restricting such rdtgroups, we also increase the bandwidth. + */ +static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm) +{ + u32 closid, rmid, cur_msr_val, new_msr_val; + struct mbm_state *pmbm_data, *cmbm_data; + u32 cur_bw, delta_bw, user_bw, idx; + struct rdt_resource *r_mba; + struct rdt_domain *dom_mba; + struct list_head *head; + struct rdtgroup *entry; + + if (!resctrl_arch_is_mbm_local_enabled()) + return; + + r_mba = resctrl_arch_get_resource(RDT_RESOURCE_MBA); + + closid = rgrp->closid; + rmid = rgrp->mon.rmid; + idx = resctrl_arch_rmid_idx_encode(closid, rmid); + pmbm_data = &dom_mbm->mbm_local[idx]; + + dom_mba = resctrl_get_domain_from_cpu(smp_processor_id(), r_mba); + if (!dom_mba) { + pr_warn_once("Failure to get domain for MBA update\n"); + return; + } + + cur_bw = pmbm_data->prev_bw; + user_bw = dom_mba->mbps_val[closid]; + delta_bw = pmbm_data->delta_bw; + + /* MBA resource doesn't support CDP */ + cur_msr_val = resctrl_arch_get_config(r_mba, dom_mba, closid, CDP_NONE); + + /* + * For Ctrl groups read data from child monitor groups. + */ + head = &rgrp->mon.crdtgrp_list; + list_for_each_entry(entry, head, mon.crdtgrp_list) { + cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid]; + cur_bw += cmbm_data->prev_bw; + delta_bw += cmbm_data->delta_bw; + } + + /* + * Scale up/down the bandwidth linearly for the ctrl group. The + * bandwidth step is the bandwidth granularity specified by the + * hardware. + * + * The delta_bw is used when increasing the bandwidth so that we + * dont alternately increase and decrease the control values + * continuously. + * + * For ex: consider cur_bw = 90MBps, user_bw = 100MBps and if + * bandwidth step is 20MBps(> user_bw - cur_bw), we would keep + * switching between 90 and 110 continuously if we only check + * cur_bw < user_bw. + */ + if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) { + new_msr_val = cur_msr_val - r_mba->membw.bw_gran; + } else if (cur_msr_val < MAX_MBA_BW && + (user_bw > (cur_bw + delta_bw))) { + new_msr_val = cur_msr_val + r_mba->membw.bw_gran; + } else { + return; + } + + resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val); + + /* + * Delta values are updated dynamically package wise for each + * rdtgrp every time the throttle MSR changes value. + * + * This is because (1)the increase in bandwidth is not perfectly + * linear and only "approximately" linear even when the hardware + * says it is linear.(2)Also since MBA is a core specific + * mechanism, the delta values vary based on number of cores used + * by the rdtgrp. + */ + pmbm_data->delta_comp = true; + list_for_each_entry(entry, head, mon.crdtgrp_list) { + cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid]; + cmbm_data->delta_comp = true; + } +} + +static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, u32 rmid) +{ + struct rmid_read rr; + + rr.first = false; + rr.r = r; + rr.d = d; + + /* + * This is protected from concurrent reads from user + * as both the user and we hold the global mutex. + */ + if (resctrl_arch_is_mbm_total_enabled()) { + rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID; + rr.val = 0; + rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); + if (IS_ERR(rr.arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(rr.arch_mon_ctx)); + return; + } + + __mon_event_count(closid, rmid, &rr); + + resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); + } + if (resctrl_arch_is_mbm_local_enabled()) { + rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID; + rr.val = 0; + rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); + if (IS_ERR(rr.arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(rr.arch_mon_ctx)); + return; + } + + __mon_event_count(closid, rmid, &rr); + + /* + * Call the MBA software controller only for the + * control groups and when user has enabled + * the software controller explicitly. + */ + if (is_mba_sc(NULL)) + mbm_bw_count(closid, rmid, &rr); + + resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); + } +} + +/* + * Handler to scan the limbo list and move the RMIDs + * to free list whose occupancy < threshold_occupancy. + */ +void cqm_handle_limbo(struct work_struct *work) +{ + unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL); + struct rdt_domain *d; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + d = container_of(work, struct rdt_domain, cqm_limbo.work); + + __check_limbo(d, false); + + if (has_busy_rmid(d)) { + d->cqm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask, + RESCTRL_PICK_ANY_CPU); + schedule_delayed_work_on(d->cqm_work_cpu, &d->cqm_limbo, + delay); + } + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); +} + +/** + * cqm_setup_limbo_handler() - Schedule the limbo handler to run for this + * domain. + * @dom: The domain the limbo handler should run for. + * @delay_ms: How far in the future the handler should run. + * @exclude_cpu: Which CPU the handler should not run on, + * RESCTRL_PICK_ANY_CPU to pick any CPU. + */ +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms, + int exclude_cpu) +{ + unsigned long delay = msecs_to_jiffies(delay_ms); + int cpu; + + cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu); + dom->cqm_work_cpu = cpu; + + if (cpu < nr_cpu_ids) + schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay); +} + +void mbm_handle_overflow(struct work_struct *work) +{ + unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL); + struct rdtgroup *prgrp, *crgrp; + struct list_head *head; + struct rdt_resource *r; + struct rdt_domain *d; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + /* + * If the filesystem has been unmounted this work no longer needs to + * run. + */ + if (!resctrl_mounted || !resctrl_arch_mon_capable()) + goto out_unlock; + + r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + d = container_of(work, struct rdt_domain, mbm_over.work); + + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { + mbm_update(r, d, prgrp->closid, prgrp->mon.rmid); + + head = &prgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) + mbm_update(r, d, crgrp->closid, crgrp->mon.rmid); + + if (is_mba_sc(NULL)) + update_mba_bw(prgrp, d); + } + + /* + * Re-check for housekeeping CPUs. This allows the overflow handler to + * move off a nohz_full CPU quickly. + */ + d->mbm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask, + RESCTRL_PICK_ANY_CPU); + schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay); + +out_unlock: + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); +} + +/** + * mbm_setup_overflow_handler() - Schedule the overflow handler to run for this + * domain. + * @dom: The domain the overflow handler should run for. + * @delay_ms: How far in the future the handler should run. + * @exclude_cpu: Which CPU the handler should not run on, + * RESCTRL_PICK_ANY_CPU to pick any CPU. + */ +void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms, + int exclude_cpu) +{ + unsigned long delay = msecs_to_jiffies(delay_ms); + int cpu; + + /* + * When a domain comes online there is no guarantee the filesystem is + * mounted. If not, there is no need to catch counter overflow. + */ + if (!resctrl_mounted || !resctrl_arch_mon_capable()) + return; + cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu); + dom->mbm_work_cpu = cpu; + + if (cpu < nr_cpu_ids) + schedule_delayed_work_on(cpu, &dom->mbm_over, delay); +} + +static int dom_data_init(struct rdt_resource *r) +{ + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + u32 num_closid = resctrl_arch_get_num_closid(r); + struct rmid_entry *entry = NULL; + int err = 0, i; + u32 idx; + + mutex_lock(&rdtgroup_mutex); + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + u32 *tmp; + + tmp = kcalloc(num_closid, sizeof(*tmp), GFP_KERNEL); + if (!tmp) { + err = -ENOMEM; + goto out_unlock; + } + + closid_num_dirty_rmid = tmp; + } + + rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL); + if (!rmid_ptrs) { + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + kfree(closid_num_dirty_rmid); + closid_num_dirty_rmid = NULL; + } + err = -ENOMEM; + goto out_unlock; + } + + for (i = 0; i < idx_limit; i++) { + entry = &rmid_ptrs[i]; + INIT_LIST_HEAD(&entry->list); + + resctrl_arch_rmid_idx_decode(i, &entry->closid, &entry->rmid); + list_add_tail(&entry->list, &rmid_free_lru); + } + + /* + * RESCTRL_RESERVED_CLOSID and RESCTRL_RESERVED_RMID are special and + * are always allocated. These are used for the rdtgroup_default + * control group, which will be setup later in rdtgroup_init(). + */ + idx = resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, + RESCTRL_RESERVED_RMID); + entry = __rmid_entry(idx); + list_del(&entry->list); + +out_unlock: + mutex_unlock(&rdtgroup_mutex); + + return err; +} + +static void dom_data_exit(struct rdt_resource *r) +{ + if (!r->mon_capable) + return; + + mutex_lock(&rdtgroup_mutex); + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + kfree(closid_num_dirty_rmid); + closid_num_dirty_rmid = NULL; + } + + kfree(rmid_ptrs); + rmid_ptrs = NULL; + + mutex_unlock(&rdtgroup_mutex); +} + +static struct mon_evt llc_occupancy_event = { + .name = "llc_occupancy", + .evtid = QOS_L3_OCCUP_EVENT_ID, +}; + +static struct mon_evt mbm_total_event = { + .name = "mbm_total_bytes", + .evtid = QOS_L3_MBM_TOTAL_EVENT_ID, +}; + +static struct mon_evt mbm_local_event = { + .name = "mbm_local_bytes", + .evtid = QOS_L3_MBM_LOCAL_EVENT_ID, +}; + +/* + * Initialize the event list for the resource. + * + * Note that MBM events are also part of RDT_RESOURCE_L3 resource + * because as per the SDM the total and local memory bandwidth + * are enumerated as part of L3 monitoring. + */ +static void l3_mon_evt_init(struct rdt_resource *r) +{ + INIT_LIST_HEAD(&r->evt_list); + + if (resctrl_arch_is_llc_occupancy_enabled()) + list_add_tail(&llc_occupancy_event.list, &r->evt_list); + if (resctrl_arch_is_mbm_total_enabled()) + list_add_tail(&mbm_total_event.list, &r->evt_list); + if (resctrl_arch_is_mbm_local_enabled()) + list_add_tail(&mbm_local_event.list, &r->evt_list); +} + +int resctrl_mon_resource_init(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + int ret; + + ret = dom_data_init(r); + if (ret) + return ret; + + if (!r->mon_capable) + return 0; + + l3_mon_evt_init(r); + + if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) { + mbm_total_event.configurable = true; + mbm_config_rftype_init("mbm_total_bytes_config"); + } + if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_LOCAL_EVENT_ID)) { + mbm_local_event.configurable = true; + mbm_config_rftype_init("mbm_local_bytes_config"); + } + + return 0; +} + +void resctrl_mon_resource_exit(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + + dom_data_exit(r); +} diff --git a/fs/resctrl/psuedo_lock.c b/fs/resctrl/psuedo_lock.c index e69de29bb2d1..2fa42d0a33ea 100644 --- a/fs/resctrl/psuedo_lock.c +++ b/fs/resctrl/psuedo_lock.c @@ -0,0 +1,1134 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Resource Director Technology (RDT) + * + * Pseudo-locking support built on top of Cache Allocation Technology (CAT) + * + * Copyright (C) 2018 Intel Corporation + * + * Author: Reinette Chatre reinette.chatre@intel.com + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/cacheinfo.h> +#include <linux/cpu.h> +#include <linux/cpumask.h> +#include <linux/debugfs.h> +#include <linux/kthread.h> +#include <linux/mman.h> +#include <linux/perf_event.h> +#include <linux/pm_qos.h> +#include <linux/slab.h> +#include <linux/uaccess.h> + +#include <asm/cacheflush.h> +#include <asm/resctrl.h> +#include <asm/perf_event.h> + +#include "internal.h" + +/* + * Major number assigned to and shared by all devices exposing + * pseudo-locked regions. + */ +static unsigned int pseudo_lock_major; +static unsigned long pseudo_lock_minor_avail = GENMASK(MINORBITS, 0); + +static char *pseudo_lock_devnode(const struct device *dev, umode_t *mode) +{ + const struct rdtgroup *rdtgrp; + + rdtgrp = dev_get_drvdata(dev); + if (mode) + *mode = 0600; + return kasprintf(GFP_KERNEL, "pseudo_lock/%s", rdtgrp->kn->name); +} + +static const struct class pseudo_lock_class = { + .name = "pseudo_lock", + .devnode = pseudo_lock_devnode, +}; + +/** + * pseudo_lock_minor_get - Obtain available minor number + * @minor: Pointer to where new minor number will be stored + * + * A bitmask is used to track available minor numbers. Here the next free + * minor number is marked as unavailable and returned. + * + * Return: 0 on success, <0 on failure. + */ +static int pseudo_lock_minor_get(unsigned int *minor) +{ + unsigned long first_bit; + + first_bit = find_first_bit(&pseudo_lock_minor_avail, MINORBITS); + + if (first_bit == MINORBITS) + return -ENOSPC; + + __clear_bit(first_bit, &pseudo_lock_minor_avail); + *minor = first_bit; + + return 0; +} + +/** + * pseudo_lock_minor_release - Return minor number to available + * @minor: The minor number made available + */ +static void pseudo_lock_minor_release(unsigned int minor) +{ + __set_bit(minor, &pseudo_lock_minor_avail); +} + +/** + * region_find_by_minor - Locate a pseudo-lock region by inode minor number + * @minor: The minor number of the device representing pseudo-locked region + * + * When the character device is accessed we need to determine which + * pseudo-locked region it belongs to. This is done by matching the minor + * number of the device to the pseudo-locked region it belongs. + * + * Minor numbers are assigned at the time a pseudo-locked region is associated + * with a cache instance. + * + * Return: On success return pointer to resource group owning the pseudo-locked + * region, NULL on failure. + */ +static struct rdtgroup *region_find_by_minor(unsigned int minor) +{ + struct rdtgroup *rdtgrp, *rdtgrp_match = NULL; + + list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { + if (rdtgrp->plr && rdtgrp->plr->minor == minor) { + rdtgrp_match = rdtgrp; + break; + } + } + return rdtgrp_match; +} + +/** + * struct pseudo_lock_pm_req - A power management QoS request list entry + * @list: Entry within the @pm_reqs list for a pseudo-locked region + * @req: PM QoS request + */ +struct pseudo_lock_pm_req { + struct list_head list; + struct dev_pm_qos_request req; +}; + +static void pseudo_lock_cstates_relax(struct pseudo_lock_region *plr) +{ + struct pseudo_lock_pm_req *pm_req, *next; + + list_for_each_entry_safe(pm_req, next, &plr->pm_reqs, list) { + dev_pm_qos_remove_request(&pm_req->req); + list_del(&pm_req->list); + kfree(pm_req); + } +} + +/** + * pseudo_lock_cstates_constrain - Restrict cores from entering C6 + * @plr: Pseudo-locked region + * + * To prevent the cache from being affected by power management entering + * C6 has to be avoided. This is accomplished by requesting a latency + * requirement lower than lowest C6 exit latency of all supported + * platforms as found in the cpuidle state tables in the intel_idle driver. + * At this time it is possible to do so with a single latency requirement + * for all supported platforms. + * + * Since Goldmont is supported, which is affected by X86_BUG_MONITOR, + * the ACPI latencies need to be considered while keeping in mind that C2 + * may be set to map to deeper sleep states. In this case the latency + * requirement needs to prevent entering C2 also. + * + * Return: 0 on success, <0 on failure + */ +static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr) +{ + struct pseudo_lock_pm_req *pm_req; + int cpu; + int ret; + + for_each_cpu(cpu, &plr->d->cpu_mask) { + pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL); + if (!pm_req) { + rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n"); + ret = -ENOMEM; + goto out_err; + } + ret = dev_pm_qos_add_request(get_cpu_device(cpu), + &pm_req->req, + DEV_PM_QOS_RESUME_LATENCY, + 30); + if (ret < 0) { + rdt_last_cmd_printf("Failed to add latency req CPU%d\n", + cpu); + kfree(pm_req); + ret = -1; + goto out_err; + } + list_add(&pm_req->list, &plr->pm_reqs); + } + + return 0; + +out_err: + pseudo_lock_cstates_relax(plr); + return ret; +} + +/** + * pseudo_lock_region_clear - Reset pseudo-lock region data + * @plr: pseudo-lock region + * + * All content of the pseudo-locked region is reset - any memory allocated + * freed. + * + * Return: void + */ +static void pseudo_lock_region_clear(struct pseudo_lock_region *plr) +{ + plr->size = 0; + plr->line_size = 0; + kfree(plr->kmem); + plr->kmem = NULL; + plr->s = NULL; + if (plr->d) + plr->d->plr = NULL; + plr->d = NULL; + plr->cbm = 0; + plr->debugfs_dir = NULL; +} + +/** + * pseudo_lock_region_init - Initialize pseudo-lock region information + * @plr: pseudo-lock region + * + * Called after user provided a schemata to be pseudo-locked. From the + * schemata the &struct pseudo_lock_region is on entry already initialized + * with the resource, domain, and capacity bitmask. Here the information + * required for pseudo-locking is deduced from this data and &struct + * pseudo_lock_region initialized further. This information includes: + * - size in bytes of the region to be pseudo-locked + * - cache line size to know the stride with which data needs to be accessed + * to be pseudo-locked + * - a cpu associated with the cache instance on which the pseudo-locking + * flow can be executed + * + * Return: 0 on success, <0 on failure. Descriptive error will be written + * to last_cmd_status buffer. + */ +static int pseudo_lock_region_init(struct pseudo_lock_region *plr) +{ + struct cpu_cacheinfo *ci; + int ret; + int i; + + /* Pick the first cpu we find that is associated with the cache. */ + plr->cpu = cpumask_first(&plr->d->cpu_mask); + + if (!cpu_online(plr->cpu)) { + rdt_last_cmd_printf("CPU %u associated with cache not online\n", + plr->cpu); + ret = -ENODEV; + goto out_region; + } + + ci = get_cpu_cacheinfo(plr->cpu); + + plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm); + + for (i = 0; i < ci->num_leaves; i++) { + if (ci->info_list[i].level == plr->s->res->cache_level) { + plr->line_size = ci->info_list[i].coherency_line_size; + return 0; + } + } + + ret = -1; + rdt_last_cmd_puts("Unable to determine cache line size\n"); +out_region: + pseudo_lock_region_clear(plr); + return ret; +} + +/** + * pseudo_lock_init - Initialize a pseudo-lock region + * @rdtgrp: resource group to which new pseudo-locked region will belong + * + * A pseudo-locked region is associated with a resource group. When this + * association is created the pseudo-locked region is initialized. The + * details of the pseudo-locked region are not known at this time so only + * allocation is done and association established. + * + * Return: 0 on success, <0 on failure + */ +static int pseudo_lock_init(struct rdtgroup *rdtgrp) +{ + struct pseudo_lock_region *plr; + + plr = kzalloc(sizeof(*plr), GFP_KERNEL); + if (!plr) + return -ENOMEM; + + init_waitqueue_head(&plr->lock_thread_wq); + INIT_LIST_HEAD(&plr->pm_reqs); + rdtgrp->plr = plr; + return 0; +} + +/** + * pseudo_lock_region_alloc - Allocate kernel memory that will be pseudo-locked + * @plr: pseudo-lock region + * + * Initialize the details required to set up the pseudo-locked region and + * allocate the contiguous memory that will be pseudo-locked to the cache. + * + * Return: 0 on success, <0 on failure. Descriptive error will be written + * to last_cmd_status buffer. + */ +static int pseudo_lock_region_alloc(struct pseudo_lock_region *plr) +{ + int ret; + + ret = pseudo_lock_region_init(plr); + if (ret < 0) + return ret; + + /* + * We do not yet support contiguous regions larger than + * KMALLOC_MAX_SIZE. + */ + if (plr->size > KMALLOC_MAX_SIZE) { + rdt_last_cmd_puts("Requested region exceeds maximum size\n"); + ret = -E2BIG; + goto out_region; + } + + plr->kmem = kzalloc(plr->size, GFP_KERNEL); + if (!plr->kmem) { + rdt_last_cmd_puts("Unable to allocate memory\n"); + ret = -ENOMEM; + goto out_region; + } + + ret = 0; + goto out; +out_region: + pseudo_lock_region_clear(plr); +out: + return ret; +} + +/** + * pseudo_lock_free - Free a pseudo-locked region + * @rdtgrp: resource group to which pseudo-locked region belonged + * + * The pseudo-locked region's resources have already been released, or not + * yet created at this point. Now it can be freed and disassociated from the + * resource group. + * + * Return: void + */ +static void pseudo_lock_free(struct rdtgroup *rdtgrp) +{ + pseudo_lock_region_clear(rdtgrp->plr); + kfree(rdtgrp->plr); + rdtgrp->plr = NULL; +} + +/** + * rdtgroup_monitor_in_progress - Test if monitoring in progress + * @rdtgrp: resource group being queried + * + * Return: 1 if monitor groups have been created for this resource + * group, 0 otherwise. + */ +static int rdtgroup_monitor_in_progress(struct rdtgroup *rdtgrp) +{ + return !list_empty(&rdtgrp->mon.crdtgrp_list); +} + +/** + * rdtgroup_locksetup_user_restrict - Restrict user access to group + * @rdtgrp: resource group needing access restricted + * + * A resource group used for cache pseudo-locking cannot have cpus or tasks + * assigned to it. This is communicated to the user by restricting access + * to all the files that can be used to make such changes. + * + * Permissions restored with rdtgroup_locksetup_user_restore() + * + * Return: 0 on success, <0 on failure. If a failure occurs during the + * restriction of access an attempt will be made to restore permissions but + * the state of the mode of these files will be uncertain when a failure + * occurs. + */ +static int rdtgroup_locksetup_user_restrict(struct rdtgroup *rdtgrp) +{ + int ret; + + ret = rdtgroup_kn_mode_restrict(rdtgrp, "tasks"); + if (ret) + return ret; + + ret = rdtgroup_kn_mode_restrict(rdtgrp, "cpus"); + if (ret) + goto err_tasks; + + ret = rdtgroup_kn_mode_restrict(rdtgrp, "cpus_list"); + if (ret) + goto err_cpus; + + if (resctrl_arch_mon_capable()) { + ret = rdtgroup_kn_mode_restrict(rdtgrp, "mon_groups"); + if (ret) + goto err_cpus_list; + } + + ret = 0; + goto out; + +err_cpus_list: + rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0777); +err_cpus: + rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0777); +err_tasks: + rdtgroup_kn_mode_restore(rdtgrp, "tasks", 0777); +out: + return ret; +} + +/** + * rdtgroup_locksetup_user_restore - Restore user access to group + * @rdtgrp: resource group needing access restored + * + * Restore all file access previously removed using + * rdtgroup_locksetup_user_restrict() + * + * Return: 0 on success, <0 on failure. If a failure occurs during the + * restoration of access an attempt will be made to restrict permissions + * again but the state of the mode of these files will be uncertain when + * a failure occurs. + */ +static int rdtgroup_locksetup_user_restore(struct rdtgroup *rdtgrp) +{ + int ret; + + ret = rdtgroup_kn_mode_restore(rdtgrp, "tasks", 0777); + if (ret) + return ret; + + ret = rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0777); + if (ret) + goto err_tasks; + + ret = rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0777); + if (ret) + goto err_cpus; + + if (resctrl_arch_mon_capable()) { + ret = rdtgroup_kn_mode_restore(rdtgrp, "mon_groups", 0777); + if (ret) + goto err_cpus_list; + } + + ret = 0; + goto out; + +err_cpus_list: + rdtgroup_kn_mode_restrict(rdtgrp, "cpus_list"); +err_cpus: + rdtgroup_kn_mode_restrict(rdtgrp, "cpus"); +err_tasks: + rdtgroup_kn_mode_restrict(rdtgrp, "tasks"); +out: + return ret; +} + +/** + * rdtgroup_locksetup_enter - Resource group enters locksetup mode + * @rdtgrp: resource group requested to enter locksetup mode + * + * A resource group enters locksetup mode to reflect that it would be used + * to represent a pseudo-locked region and is in the process of being set + * up to do so. A resource group used for a pseudo-locked region would + * lose the closid associated with it so we cannot allow it to have any + * tasks or cpus assigned nor permit tasks or cpus to be assigned in the + * future. Monitoring of a pseudo-locked region is not allowed either. + * + * The above and more restrictions on a pseudo-locked region are checked + * for and enforced before the resource group enters the locksetup mode. + * + * Returns: 0 if the resource group successfully entered locksetup mode, <0 + * on failure. On failure the last_cmd_status buffer is updated with text to + * communicate details of failure to the user. + */ +int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp) +{ + int ret; + + /* + * The default resource group can neither be removed nor lose the + * default closid associated with it. + */ + if (rdtgrp == &rdtgroup_default) { + rdt_last_cmd_puts("Cannot pseudo-lock default group\n"); + return -EINVAL; + } + + /* + * Cache Pseudo-locking not supported when CDP is enabled. + * + * Some things to consider if you would like to enable this + * support (using L3 CDP as example): + * - When CDP is enabled two separate resources are exposed, + * L3DATA and L3CODE, but they are actually on the same cache. + * The implication for pseudo-locking is that if a + * pseudo-locked region is created on a domain of one + * resource (eg. L3CODE), then a pseudo-locked region cannot + * be created on that same domain of the other resource + * (eg. L3DATA). This is because the creation of a + * pseudo-locked region involves a call to wbinvd that will + * affect all cache allocations on particular domain. + * - Considering the previous, it may be possible to only + * expose one of the CDP resources to pseudo-locking and + * hide the other. For example, we could consider to only + * expose L3DATA and since the L3 cache is unified it is + * still possible to place instructions there are execute it. + * - If only one region is exposed to pseudo-locking we should + * still keep in mind that availability of a portion of cache + * for pseudo-locking should take into account both resources. + * Similarly, if a pseudo-locked region is created in one + * resource, the portion of cache used by it should be made + * unavailable to all future allocations from both resources. + */ + if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3) || + resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2)) { + rdt_last_cmd_puts("CDP enabled\n"); + return -EINVAL; + } + + /* + * Not knowing the bits to disable prefetching implies that this + * platform does not support Cache Pseudo-Locking. + */ + if (resctrl_arch_get_prefetch_disable_bits() == 0) { + rdt_last_cmd_puts("Pseudo-locking not supported\n"); + return -EINVAL; + } + + if (rdtgroup_monitor_in_progress(rdtgrp)) { + rdt_last_cmd_puts("Monitoring in progress\n"); + return -EINVAL; + } + + if (rdtgroup_tasks_assigned(rdtgrp)) { + rdt_last_cmd_puts("Tasks assigned to resource group\n"); + return -EINVAL; + } + + if (!cpumask_empty(&rdtgrp->cpu_mask)) { + rdt_last_cmd_puts("CPUs assigned to resource group\n"); + return -EINVAL; + } + + if (rdtgroup_locksetup_user_restrict(rdtgrp)) { + rdt_last_cmd_puts("Unable to modify resctrl permissions\n"); + return -EIO; + } + + ret = pseudo_lock_init(rdtgrp); + if (ret) { + rdt_last_cmd_puts("Unable to init pseudo-lock region\n"); + goto out_release; + } + + /* + * If this system is capable of monitoring a rmid would have been + * allocated when the control group was created. This is not needed + * anymore when this group would be used for pseudo-locking. This + * is safe to call on platforms not capable of monitoring. + */ + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + + ret = 0; + goto out; + +out_release: + rdtgroup_locksetup_user_restore(rdtgrp); +out: + return ret; +} + +/** + * rdtgroup_locksetup_exit - resource group exist locksetup mode + * @rdtgrp: resource group + * + * When a resource group exits locksetup mode the earlier restrictions are + * lifted. + * + * Return: 0 on success, <0 on failure + */ +int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp) +{ + int ret; + + if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return -EOPNOTSUPP; + + if (resctrl_arch_mon_capable()) { + ret = alloc_rmid(rdtgrp->closid); + if (ret < 0) { + rdt_last_cmd_puts("Out of RMIDs\n"); + return ret; + } + rdtgrp->mon.rmid = ret; + } + + ret = rdtgroup_locksetup_user_restore(rdtgrp); + if (ret) { + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + return ret; + } + + pseudo_lock_free(rdtgrp); + return 0; +} + +/** + * rdtgroup_cbm_overlaps_pseudo_locked - Test if CBM or portion is pseudo-locked + * @d: RDT domain + * @cbm: CBM to test + * + * @d represents a cache instance and @cbm a capacity bitmask that is + * considered for it. Determine if @cbm overlaps with any existing + * pseudo-locked region on @d. + * + * @cbm is unsigned long, even if only 32 bits are used, to make the + * bitmap functions work correctly. + * + * Return: true if @cbm overlaps with pseudo-locked region on @d, false + * otherwise. + */ +bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm) +{ + unsigned int cbm_len; + unsigned long cbm_b; + + if (d->plr) { + cbm_len = d->plr->s->res->cache.cbm_len; + cbm_b = d->plr->cbm; + if (bitmap_intersects(&cbm, &cbm_b, cbm_len)) + return true; + } + return false; +} + +/** + * rdtgroup_pseudo_locked_in_hierarchy - Pseudo-locked region in cache hierarchy + * @d: RDT domain under test + * + * The setup of a pseudo-locked region affects all cache instances within + * the hierarchy of the region. It is thus essential to know if any + * pseudo-locked regions exist within a cache hierarchy to prevent any + * attempts to create new pseudo-locked regions in the same hierarchy. + * + * Return: true if a pseudo-locked region exists in the hierarchy of @d or + * if it is not possible to test due to memory allocation issue, + * false otherwise. + */ +bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) +{ + cpumask_var_t cpu_with_psl; + enum resctrl_res_level i; + struct rdt_resource *r; + struct rdt_domain *d_i; + bool ret = false; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return -EOPNOTSUPP; + + if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL)) + return true; + + /* + * First determine which cpus have pseudo-locked regions + * associated with them. + */ + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->alloc_capable) + continue; + + list_for_each_entry(d_i, &r->domains, list) { + if (d_i->plr) + cpumask_or(cpu_with_psl, cpu_with_psl, + &d_i->cpu_mask); + } + } + + /* + * Next test if new pseudo-locked region would intersect with + * existing region. + */ + if (cpumask_intersects(&d->cpu_mask, cpu_with_psl)) + ret = true; + + free_cpumask_var(cpu_with_psl); + return ret; +} + +/** + * pseudo_lock_measure_cycles - Trigger latency measure to pseudo-locked region + * @rdtgrp: Resource group to which the pseudo-locked region belongs. + * @sel: Selector of which measurement to perform on a pseudo-locked region. + * + * The measurement of latency to access a pseudo-locked region should be + * done from a cpu that is associated with that pseudo-locked region. + * Determine which cpu is associated with this region and start a thread on + * that cpu to perform the measurement, wait for that thread to complete. + * + * Return: 0 on success, <0 on failure + */ +static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel) +{ + struct pseudo_lock_region *plr = rdtgrp->plr; + struct task_struct *thread; + unsigned int cpu; + int ret = -1; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + if (rdtgrp->flags & RDT_DELETED) { + ret = -ENODEV; + goto out; + } + + if (!plr->d) { + ret = -ENODEV; + goto out; + } + + plr->thread_done = 0; + cpu = cpumask_first(&plr->d->cpu_mask); + if (!cpu_online(cpu)) { + ret = -ENODEV; + goto out; + } + + plr->cpu = cpu; + + if (sel == 1) + thread = kthread_create_on_node(resctrl_arch_measure_cycles_lat_fn, + plr, cpu_to_node(cpu), + "pseudo_lock_measure/%u", + cpu); + else if (sel == 2) + thread = kthread_create_on_node(resctrl_arch_measure_l2_residency, + plr, cpu_to_node(cpu), + "pseudo_lock_measure/%u", + cpu); + else if (sel == 3) + thread = kthread_create_on_node(resctrl_arch_measure_l3_residency, + plr, cpu_to_node(cpu), + "pseudo_lock_measure/%u", + cpu); + else + goto out; + + if (IS_ERR(thread)) { + ret = PTR_ERR(thread); + goto out; + } + kthread_bind(thread, cpu); + wake_up_process(thread); + + ret = wait_event_interruptible(plr->lock_thread_wq, + plr->thread_done == 1); + if (ret < 0) + goto out; + + ret = 0; + +out: + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + return ret; +} + +static ssize_t pseudo_lock_measure_trigger(struct file *file, + const char __user *user_buf, + size_t count, loff_t *ppos) +{ + struct rdtgroup *rdtgrp = file->private_data; + size_t buf_size; + char buf[32]; + int ret; + int sel; + + buf_size = min(count, (sizeof(buf) - 1)); + if (copy_from_user(buf, user_buf, buf_size)) + return -EFAULT; + + buf[buf_size] = '\0'; + ret = kstrtoint(buf, 10, &sel); + if (ret == 0) { + if (sel != 1 && sel != 2 && sel != 3) + return -EINVAL; + ret = debugfs_file_get(file->f_path.dentry); + if (ret) + return ret; + ret = pseudo_lock_measure_cycles(rdtgrp, sel); + if (ret == 0) + ret = count; + debugfs_file_put(file->f_path.dentry); + } + + return ret; +} + +static const struct file_operations pseudo_measure_fops = { + .write = pseudo_lock_measure_trigger, + .open = simple_open, + .llseek = default_llseek, +}; + +/** + * rdtgroup_pseudo_lock_create - Create a pseudo-locked region + * @rdtgrp: resource group to which pseudo-lock region belongs + * + * Called when a resource group in the pseudo-locksetup mode receives a + * valid schemata that should be pseudo-locked. Since the resource group is + * in pseudo-locksetup mode the &struct pseudo_lock_region has already been + * allocated and initialized with the essential information. If a failure + * occurs the resource group remains in the pseudo-locksetup mode with the + * &struct pseudo_lock_region associated with it, but cleared from all + * information and ready for the user to re-attempt pseudo-locking by + * writing the schemata again. + * + * Return: 0 if the pseudo-locked region was successfully pseudo-locked, <0 + * on failure. Descriptive error will be written to last_cmd_status buffer. + */ +int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp) +{ + struct pseudo_lock_region *plr = rdtgrp->plr; + struct task_struct *thread; + unsigned int new_minor; + struct device *dev; + int ret; + + if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return -EOPNOTSUPP; + + ret = pseudo_lock_region_alloc(plr); + if (ret < 0) + return ret; + + ret = pseudo_lock_cstates_constrain(plr); + if (ret < 0) { + ret = -EINVAL; + goto out_region; + } + + plr->thread_done = 0; + + plr->closid = rdtgrp->closid; + thread = kthread_create_on_node(resctrl_arch_pseudo_lock_fn, plr, + cpu_to_node(plr->cpu), + "pseudo_lock/%u", plr->cpu); + if (IS_ERR(thread)) { + ret = PTR_ERR(thread); + rdt_last_cmd_printf("Locking thread returned error %d\n", ret); + goto out_cstates; + } + + kthread_bind(thread, plr->cpu); + wake_up_process(thread); + + ret = wait_event_interruptible(plr->lock_thread_wq, + plr->thread_done == 1); + if (ret < 0) { + /* + * If the thread does not get on the CPU for whatever + * reason and the process which sets up the region is + * interrupted then this will leave the thread in runnable + * state and once it gets on the CPU it will dereference + * the cleared, but not freed, plr struct resulting in an + * empty pseudo-locking loop. + */ + rdt_last_cmd_puts("Locking thread interrupted\n"); + goto out_cstates; + } + + ret = pseudo_lock_minor_get(&new_minor); + if (ret < 0) { + rdt_last_cmd_puts("Unable to obtain a new minor number\n"); + goto out_cstates; + } + + /* + * Unlock access but do not release the reference. The + * pseudo-locked region will still be here on return. + * + * The mutex has to be released temporarily to avoid a potential + * deadlock with the mm->mmap_lock which is obtained in the + * device_create() and debugfs_create_dir() callpath below as well as + * before the mmap() callback is called. + */ + mutex_unlock(&rdtgroup_mutex); + + if (!IS_ERR_OR_NULL(debugfs_resctrl)) { + plr->debugfs_dir = debugfs_create_dir(rdtgrp->kn->name, + debugfs_resctrl); + if (!IS_ERR_OR_NULL(plr->debugfs_dir)) + debugfs_create_file("pseudo_lock_measure", 0200, + plr->debugfs_dir, rdtgrp, + &pseudo_measure_fops); + } + + dev = device_create(&pseudo_lock_class, NULL, + MKDEV(pseudo_lock_major, new_minor), + rdtgrp, "%s", rdtgrp->kn->name); + + mutex_lock(&rdtgroup_mutex); + + if (IS_ERR(dev)) { + ret = PTR_ERR(dev); + rdt_last_cmd_printf("Failed to create character device: %d\n", + ret); + goto out_debugfs; + } + + /* We released the mutex - check if group was removed while we did so */ + if (rdtgrp->flags & RDT_DELETED) { + ret = -ENODEV; + goto out_device; + } + + plr->minor = new_minor; + + rdtgrp->mode = RDT_MODE_PSEUDO_LOCKED; + closid_free(rdtgrp->closid); + rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0444); + rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0444); + + ret = 0; + goto out; + +out_device: + device_destroy(&pseudo_lock_class, MKDEV(pseudo_lock_major, new_minor)); +out_debugfs: + debugfs_remove_recursive(plr->debugfs_dir); + pseudo_lock_minor_release(new_minor); +out_cstates: + pseudo_lock_cstates_relax(plr); +out_region: + pseudo_lock_region_clear(plr); +out: + return ret; +} + +/** + * rdtgroup_pseudo_lock_remove - Remove a pseudo-locked region + * @rdtgrp: resource group to which the pseudo-locked region belongs + * + * The removal of a pseudo-locked region can be initiated when the resource + * group is removed from user space via a "rmdir" from userspace or the + * unmount of the resctrl filesystem. On removal the resource group does + * not go back to pseudo-locksetup mode before it is removed, instead it is + * removed directly. There is thus asymmetry with the creation where the + * &struct pseudo_lock_region is removed here while it was not created in + * rdtgroup_pseudo_lock_create(). + * + * Return: void + */ +void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp) +{ + struct pseudo_lock_region *plr = rdtgrp->plr; + + if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return; + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + /* + * Default group cannot be a pseudo-locked region so we can + * free closid here. + */ + closid_free(rdtgrp->closid); + goto free; + } + + pseudo_lock_cstates_relax(plr); + debugfs_remove_recursive(rdtgrp->plr->debugfs_dir); + device_destroy(&pseudo_lock_class, MKDEV(pseudo_lock_major, plr->minor)); + pseudo_lock_minor_release(plr->minor); + +free: + pseudo_lock_free(rdtgrp); +} + +static int pseudo_lock_dev_open(struct inode *inode, struct file *filp) +{ + struct rdtgroup *rdtgrp; + + mutex_lock(&rdtgroup_mutex); + + rdtgrp = region_find_by_minor(iminor(inode)); + if (!rdtgrp) { + mutex_unlock(&rdtgroup_mutex); + return -ENODEV; + } + + filp->private_data = rdtgrp; + atomic_inc(&rdtgrp->waitcount); + /* Perform a non-seekable open - llseek is not supported */ + filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE); + + mutex_unlock(&rdtgroup_mutex); + + return 0; +} + +static int pseudo_lock_dev_release(struct inode *inode, struct file *filp) +{ + struct rdtgroup *rdtgrp; + + mutex_lock(&rdtgroup_mutex); + rdtgrp = filp->private_data; + WARN_ON(!rdtgrp); + if (!rdtgrp) { + mutex_unlock(&rdtgroup_mutex); + return -ENODEV; + } + filp->private_data = NULL; + atomic_dec(&rdtgrp->waitcount); + mutex_unlock(&rdtgroup_mutex); + return 0; +} + +static int pseudo_lock_dev_mremap(struct vm_area_struct *area) +{ + /* Not supported */ + return -EINVAL; +} + +static const struct vm_operations_struct pseudo_mmap_ops = { + .mremap = pseudo_lock_dev_mremap, +}; + +static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma) +{ + unsigned long vsize = vma->vm_end - vma->vm_start; + unsigned long off = vma->vm_pgoff << PAGE_SHIFT; + struct pseudo_lock_region *plr; + struct rdtgroup *rdtgrp; + unsigned long physical; + unsigned long psize; + + mutex_lock(&rdtgroup_mutex); + + rdtgrp = filp->private_data; + WARN_ON(!rdtgrp); + if (!rdtgrp) { + mutex_unlock(&rdtgroup_mutex); + return -ENODEV; + } + + plr = rdtgrp->plr; + + if (!plr->d) { + mutex_unlock(&rdtgroup_mutex); + return -ENODEV; + } + + /* + * Task is required to run with affinity to the cpus associated + * with the pseudo-locked region. If this is not the case the task + * may be scheduled elsewhere and invalidate entries in the + * pseudo-locked region. + */ + if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) { + mutex_unlock(&rdtgroup_mutex); + return -EINVAL; + } + + physical = __pa(plr->kmem) >> PAGE_SHIFT; + psize = plr->size - off; + + if (off > plr->size) { + mutex_unlock(&rdtgroup_mutex); + return -ENOSPC; + } + + /* + * Ensure changes are carried directly to the memory being mapped, + * do not allow copy-on-write mapping. + */ + if (!(vma->vm_flags & VM_SHARED)) { + mutex_unlock(&rdtgroup_mutex); + return -EINVAL; + } + + if (vsize > psize) { + mutex_unlock(&rdtgroup_mutex); + return -ENOSPC; + } + + memset(plr->kmem + off, 0, vsize); + + if (remap_pfn_range(vma, vma->vm_start, physical + vma->vm_pgoff, + vsize, vma->vm_page_prot)) { + mutex_unlock(&rdtgroup_mutex); + return -EAGAIN; + } + vma->vm_ops = &pseudo_mmap_ops; + mutex_unlock(&rdtgroup_mutex); + return 0; +} + +static const struct file_operations pseudo_lock_dev_fops = { + .owner = THIS_MODULE, + .llseek = no_llseek, + .read = NULL, + .write = NULL, + .open = pseudo_lock_dev_open, + .release = pseudo_lock_dev_release, + .mmap = pseudo_lock_dev_mmap, +}; + +int rdt_pseudo_lock_init(void) +{ + int ret; + + ret = register_chrdev(0, "pseudo_lock", &pseudo_lock_dev_fops); + if (ret < 0) + return ret; + + pseudo_lock_major = ret; + + ret = class_register(&pseudo_lock_class); + if (ret) { + unregister_chrdev(pseudo_lock_major, "pseudo_lock"); + return ret; + } + + return 0; +} + +void rdt_pseudo_lock_release(void) +{ + class_unregister(&pseudo_lock_class); + unregister_chrdev(pseudo_lock_major, "pseudo_lock"); + pseudo_lock_major = 0; +} diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index e69de29bb2d1..0a3b99ca3b48 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -0,0 +1,4014 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * User interface for Resource Allocation in Resource Director Technology(RDT) + * + * Copyright (C) 2016 Intel Corporation + * + * Author: Fenghua Yu fenghua.yu@intel.com + * + * More information about RDT be found in the Intel (R) x86 Architecture + * Software Developer Manual. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/cacheinfo.h> +#include <linux/cpu.h> +#include <linux/debugfs.h> +#include <linux/fs.h> +#include <linux/fs_parser.h> +#include <linux/sysfs.h> +#include <linux/kernfs.h> +#include <linux/seq_buf.h> +#include <linux/seq_file.h> +#include <linux/sched/signal.h> +#include <linux/sched/task.h> +#include <linux/slab.h> +#include <linux/task_work.h> +#include <linux/user_namespace.h> + +#include <uapi/linux/magic.h> + +#include <asm/resctrl.h> +#include "internal.h" + +/* Mutex to protect rdtgroup access. */ +DEFINE_MUTEX(rdtgroup_mutex); + +static struct kernfs_root *rdt_root; +struct rdtgroup rdtgroup_default; +LIST_HEAD(rdt_all_groups); + +/* list of entries for the schemata file */ +LIST_HEAD(resctrl_schema_all); + +/* The filesystem can only be mounted once. */ +bool resctrl_mounted; + +/* Kernel fs node for "info" directory under root */ +static struct kernfs_node *kn_info; + +/* Kernel fs node for "mon_groups" directory under root */ +static struct kernfs_node *kn_mongrp; + +/* Kernel fs node for "mon_data" directory under root */ +static struct kernfs_node *kn_mondata; + +/* + * Used to store the max resource name width and max resource data width + * to display the schemata in a tabular format + */ +int max_name_width, max_data_width; + +static struct seq_buf last_cmd_status; +static char last_cmd_status_buf[512]; + +static int rdtgroup_setup_root(struct rdt_fs_context *ctx); +static void rdtgroup_destroy_root(void); + +struct dentry *debugfs_resctrl; + +static bool resctrl_debug; + +void rdt_last_cmd_clear(void) +{ + lockdep_assert_held(&rdtgroup_mutex); + seq_buf_clear(&last_cmd_status); +} + +void rdt_last_cmd_puts(const char *s) +{ + lockdep_assert_held(&rdtgroup_mutex); + seq_buf_puts(&last_cmd_status, s); +} + +void rdt_last_cmd_printf(const char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + lockdep_assert_held(&rdtgroup_mutex); + seq_buf_vprintf(&last_cmd_status, fmt, ap); + va_end(ap); +} + +void rdt_staged_configs_clear(void) +{ + struct rdt_resource *r; + struct rdt_domain *dom; + int i; + + lockdep_assert_held(&rdtgroup_mutex); + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->alloc_capable) + continue; + + list_for_each_entry(dom, &r->domains, list) + memset(dom->staged_config, 0, sizeof(dom->staged_config)); + } +} + +static bool resctrl_is_mbm_enabled(void) +{ + return (resctrl_arch_is_mbm_total_enabled() || + resctrl_arch_is_mbm_local_enabled()); +} + +static bool resctrl_is_mbm_event(int e) +{ + return (e >= QOS_L3_MBM_TOTAL_EVENT_ID && + e <= QOS_L3_MBM_LOCAL_EVENT_ID); +} + +/* + * Trivial allocator for CLOSIDs. Since h/w only supports a small number, + * we can keep a bitmap of free CLOSIDs in a single integer. + * + * Using a global CLOSID across all resources has some advantages and + * some drawbacks: + * + We can simply set current's closid to assign a task to a resource + * group. + * + Context switch code can avoid extra memory references deciding which + * CLOSID to load into the PQR_ASSOC MSR + * - We give up some options in configuring resource groups across multi-socket + * systems. + * - Our choices on how to configure each resource become progressively more + * limited as the number of resources grows. + */ +static unsigned long closid_free_map; +static int closid_free_map_len; + +int closids_supported(void) +{ + return closid_free_map_len; +} + +static void closid_init(void) +{ + struct resctrl_schema *s; + u32 rdt_min_closid = 32; + + /* Compute rdt_min_closid across all resources */ + list_for_each_entry(s, &resctrl_schema_all, list) + rdt_min_closid = min(rdt_min_closid, s->num_closid); + + closid_free_map = BIT_MASK(rdt_min_closid) - 1; + + /* RESCTRL_RESERVED_CLOSID is always reserved for the default group */ + __clear_bit(RESCTRL_RESERVED_CLOSID, &closid_free_map); + closid_free_map_len = rdt_min_closid; +} + +static int closid_alloc(void) +{ + int cleanest_closid; + u32 closid; + + lockdep_assert_held(&rdtgroup_mutex); + + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + cleanest_closid = resctrl_find_cleanest_closid(); + if (cleanest_closid < 0) + return cleanest_closid; + closid = cleanest_closid; + } else { + closid = ffs(closid_free_map); + if (closid == 0) + return -ENOSPC; + closid--; + } + __clear_bit(closid, &closid_free_map); + + return closid; +} + +void closid_free(int closid) +{ + lockdep_assert_held(&rdtgroup_mutex); + + __set_bit(closid, &closid_free_map); +} + +/** + * closid_allocated - test if provided closid is in use + * @closid: closid to be tested + * + * Return: true if @closid is currently associated with a resource group, + * false if @closid is free + */ +bool closid_allocated(unsigned int closid) +{ + lockdep_assert_held(&rdtgroup_mutex); + + return !test_bit(closid, &closid_free_map); +} + +/** + * rdtgroup_mode_by_closid - Return mode of resource group with closid + * @closid: closid if the resource group + * + * Each resource group is associated with a @closid. Here the mode + * of a resource group can be queried by searching for it using its closid. + * + * Return: mode as &enum rdtgrp_mode of resource group with closid @closid + */ +enum rdtgrp_mode rdtgroup_mode_by_closid(int closid) +{ + struct rdtgroup *rdtgrp; + + list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { + if (rdtgrp->closid == closid) + return rdtgrp->mode; + } + + return RDT_NUM_MODES; +} + +static const char * const rdt_mode_str[] = { + [RDT_MODE_SHAREABLE] = "shareable", + [RDT_MODE_EXCLUSIVE] = "exclusive", + [RDT_MODE_PSEUDO_LOCKSETUP] = "pseudo-locksetup", + [RDT_MODE_PSEUDO_LOCKED] = "pseudo-locked", +}; + +/** + * rdtgroup_mode_str - Return the string representation of mode + * @mode: the resource group mode as &enum rdtgroup_mode + * + * Return: string representation of valid mode, "unknown" otherwise + */ +static const char *rdtgroup_mode_str(enum rdtgrp_mode mode) +{ + if (mode < RDT_MODE_SHAREABLE || mode >= RDT_NUM_MODES) + return "unknown"; + + return rdt_mode_str[mode]; +} + +/* set uid and gid of rdtgroup dirs and files to that of the creator */ +static int rdtgroup_kn_set_ugid(struct kernfs_node *kn) +{ + struct iattr iattr = { .ia_valid = ATTR_UID | ATTR_GID, + .ia_uid = current_fsuid(), + .ia_gid = current_fsgid(), }; + + if (uid_eq(iattr.ia_uid, GLOBAL_ROOT_UID) && + gid_eq(iattr.ia_gid, GLOBAL_ROOT_GID)) + return 0; + + return kernfs_setattr(kn, &iattr); +} + +static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft) +{ + struct kernfs_node *kn; + int ret; + + kn = __kernfs_create_file(parent_kn, rft->name, rft->mode, + GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, + 0, rft->kf_ops, rft, NULL, NULL); + if (IS_ERR(kn)) + return PTR_ERR(kn); + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) { + kernfs_remove(kn); + return ret; + } + + return 0; +} + +static int rdtgroup_seqfile_show(struct seq_file *m, void *arg) +{ + struct kernfs_open_file *of = m->private; + struct rftype *rft = of->kn->priv; + + if (rft->seq_show) + return rft->seq_show(of, m, arg); + return 0; +} + +static ssize_t rdtgroup_file_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct rftype *rft = of->kn->priv; + + if (rft->write) + return rft->write(of, buf, nbytes, off); + + return -EINVAL; +} + +static const struct kernfs_ops rdtgroup_kf_single_ops = { + .atomic_write_len = PAGE_SIZE, + .write = rdtgroup_file_write, + .seq_show = rdtgroup_seqfile_show, +}; + +static const struct kernfs_ops kf_mondata_ops = { + .atomic_write_len = PAGE_SIZE, + .seq_show = rdtgroup_mondata_show, +}; + +static bool is_cpu_list(struct kernfs_open_file *of) +{ + struct rftype *rft = of->kn->priv; + + return rft->flags & RFTYPE_FLAGS_CPUS_LIST; +} + +static int rdtgroup_cpus_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + struct cpumask *mask; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + + if (rdtgrp) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + if (!rdtgrp->plr->d) { + rdt_last_cmd_clear(); + rdt_last_cmd_puts("Cache domain offline\n"); + ret = -ENODEV; + } else { + mask = &rdtgrp->plr->d->cpu_mask; + seq_printf(s, is_cpu_list(of) ? + "%*pbl\n" : "%*pb\n", + cpumask_pr_args(mask)); + } + } else { + seq_printf(s, is_cpu_list(of) ? "%*pbl\n" : "%*pb\n", + cpumask_pr_args(&rdtgrp->cpu_mask)); + } + } else { + ret = -ENOENT; + } + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +/* + * Update the PGR_ASSOC MSR on all cpus in @cpu_mask, + * + * Per task closids/rmids must have been set up before calling this function. + * @r may be NULL. + */ +static void +update_closid_rmid(const struct cpumask *cpu_mask, struct rdtgroup *r) +{ + struct resctrl_cpu_sync defaults; + struct resctrl_cpu_sync *defaults_p = NULL; + + if (r) { + defaults.closid = r->closid; + defaults.rmid = r->mon.rmid; + defaults_p = &defaults; + } + + on_each_cpu_mask(cpu_mask, resctrl_arch_sync_cpu_defaults, defaults_p, + 1); +} + +static int cpus_mon_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask, + cpumask_var_t tmpmask) +{ + struct rdtgroup *prgrp = rdtgrp->mon.parent, *crgrp; + struct list_head *head; + + /* Check whether cpus belong to parent ctrl group */ + cpumask_andnot(tmpmask, newmask, &prgrp->cpu_mask); + if (!cpumask_empty(tmpmask)) { + rdt_last_cmd_puts("Can only add CPUs to mongroup that belong to parent\n"); + return -EINVAL; + } + + /* Check whether cpus are dropped from this group */ + cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask); + if (!cpumask_empty(tmpmask)) { + /* Give any dropped cpus to parent rdtgroup */ + cpumask_or(&prgrp->cpu_mask, &prgrp->cpu_mask, tmpmask); + update_closid_rmid(tmpmask, prgrp); + } + + /* + * If we added cpus, remove them from previous group that owned them + * and update per-cpu rmid + */ + cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask); + if (!cpumask_empty(tmpmask)) { + head = &prgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) { + if (crgrp == rdtgrp) + continue; + cpumask_andnot(&crgrp->cpu_mask, &crgrp->cpu_mask, + tmpmask); + } + update_closid_rmid(tmpmask, rdtgrp); + } + + /* Done pushing/pulling - update this group with new mask */ + cpumask_copy(&rdtgrp->cpu_mask, newmask); + + return 0; +} + +static void cpumask_rdtgrp_clear(struct rdtgroup *r, struct cpumask *m) +{ + struct rdtgroup *crgrp; + + cpumask_andnot(&r->cpu_mask, &r->cpu_mask, m); + /* update the child mon group masks as well*/ + list_for_each_entry(crgrp, &r->mon.crdtgrp_list, mon.crdtgrp_list) + cpumask_and(&crgrp->cpu_mask, &r->cpu_mask, &crgrp->cpu_mask); +} + +static int cpus_ctrl_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask, + cpumask_var_t tmpmask, cpumask_var_t tmpmask1) +{ + struct rdtgroup *r, *crgrp; + struct list_head *head; + + /* Check whether cpus are dropped from this group */ + cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask); + if (!cpumask_empty(tmpmask)) { + /* Can't drop from default group */ + if (rdtgrp == &rdtgroup_default) { + rdt_last_cmd_puts("Can't drop CPUs from default group\n"); + return -EINVAL; + } + + /* Give any dropped cpus to rdtgroup_default */ + cpumask_or(&rdtgroup_default.cpu_mask, + &rdtgroup_default.cpu_mask, tmpmask); + update_closid_rmid(tmpmask, &rdtgroup_default); + } + + /* + * If we added cpus, remove them from previous group and + * the prev group's child groups that owned them + * and update per-cpu closid/rmid. + */ + cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask); + if (!cpumask_empty(tmpmask)) { + list_for_each_entry(r, &rdt_all_groups, rdtgroup_list) { + if (r == rdtgrp) + continue; + cpumask_and(tmpmask1, &r->cpu_mask, tmpmask); + if (!cpumask_empty(tmpmask1)) + cpumask_rdtgrp_clear(r, tmpmask1); + } + update_closid_rmid(tmpmask, rdtgrp); + } + + /* Done pushing/pulling - update this group with new mask */ + cpumask_copy(&rdtgrp->cpu_mask, newmask); + + /* + * Clear child mon group masks since there is a new parent mask + * now and update the rmid for the cpus the child lost. + */ + head = &rdtgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) { + cpumask_and(tmpmask, &rdtgrp->cpu_mask, &crgrp->cpu_mask); + update_closid_rmid(tmpmask, rdtgrp); + cpumask_clear(&crgrp->cpu_mask); + } + + return 0; +} + +static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + cpumask_var_t tmpmask, newmask, tmpmask1; + struct rdtgroup *rdtgrp; + int ret; + + if (!buf) + return -EINVAL; + + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) + return -ENOMEM; + if (!zalloc_cpumask_var(&newmask, GFP_KERNEL)) { + free_cpumask_var(tmpmask); + return -ENOMEM; + } + if (!zalloc_cpumask_var(&tmpmask1, GFP_KERNEL)) { + free_cpumask_var(tmpmask); + free_cpumask_var(newmask); + return -ENOMEM; + } + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + ret = -ENOENT; + goto unlock; + } + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + ret = -EINVAL; + rdt_last_cmd_puts("Pseudo-locking in progress\n"); + goto unlock; + } + + if (is_cpu_list(of)) + ret = cpulist_parse(buf, newmask); + else + ret = cpumask_parse(buf, newmask); + + if (ret) { + rdt_last_cmd_puts("Bad CPU list/mask\n"); + goto unlock; + } + + /* check that user didn't specify any offline cpus */ + cpumask_andnot(tmpmask, newmask, cpu_online_mask); + if (!cpumask_empty(tmpmask)) { + ret = -EINVAL; + rdt_last_cmd_puts("Can only assign online CPUs\n"); + goto unlock; + } + + if (rdtgrp->type == RDTCTRL_GROUP) + ret = cpus_ctrl_write(rdtgrp, newmask, tmpmask, tmpmask1); + else if (rdtgrp->type == RDTMON_GROUP) + ret = cpus_mon_write(rdtgrp, newmask, tmpmask); + else + ret = -EINVAL; + +unlock: + rdtgroup_kn_unlock(of->kn); + free_cpumask_var(tmpmask); + free_cpumask_var(newmask); + free_cpumask_var(tmpmask1); + + return ret ?: nbytes; +} + +/** + * rdtgroup_remove - the helper to remove resource group safely + * @rdtgrp: resource group to remove + * + * On resource group creation via a mkdir, an extra kernfs_node reference is + * taken to ensure that the rdtgroup structure remains accessible for the + * rdtgroup_kn_unlock() calls where it is removed. + * + * Drop the extra reference here, then free the rdtgroup structure. + * + * Return: void + */ +static void rdtgroup_remove(struct rdtgroup *rdtgrp) +{ + kernfs_put(rdtgrp->kn); + kfree(rdtgrp); +} + +static void _update_task_closid_rmid(void *task) +{ + /* + * If the task is still current on this CPU, update PQR_ASSOC MSR. + * Otherwise, the MSR is updated when the task is scheduled in. + */ + if (task == current) + resctrl_sched_in(task); +} + +static void update_task_closid_rmid(struct task_struct *t) +{ + if (IS_ENABLED(CONFIG_SMP) && task_curr(t)) + smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1); + else + _update_task_closid_rmid(t); +} + +static bool task_in_rdtgroup(struct task_struct *tsk, struct rdtgroup *rdtgrp) +{ + u32 closid, rmid = rdtgrp->mon.rmid; + + if (rdtgrp->type == RDTCTRL_GROUP) + closid = rdtgrp->closid; + else if (rdtgrp->type == RDTMON_GROUP) + closid = rdtgrp->mon.parent->closid; + else + return false; + + return resctrl_arch_match_closid(tsk, closid) && + resctrl_arch_match_rmid(tsk, closid, rmid); +} + +static int __rdtgroup_move_task(struct task_struct *tsk, + struct rdtgroup *rdtgrp) +{ + /* If the task is already in rdtgrp, no need to move the task. */ + if (task_in_rdtgroup(tsk, rdtgrp)) + return 0; + + /* + * Set the task's closid/rmid before the PQR_ASSOC MSR can be + * updated by them. + * + * For ctrl_mon groups, move both closid and rmid. + * For monitor groups, can move the tasks only from + * their parent CTRL group. + */ + if (rdtgrp->type == RDTMON_GROUP && + !resctrl_arch_match_closid(tsk, rdtgrp->mon.parent->closid)) { + rdt_last_cmd_puts("Can't move task to different control group\n"); + return -EINVAL; + } + + if (rdtgrp->type == RDTMON_GROUP) + resctrl_arch_set_closid_rmid(tsk, rdtgrp->mon.parent->closid, + rdtgrp->mon.rmid); + else + resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid, + rdtgrp->mon.rmid); + + /* + * Ensure the task's closid and rmid are written before determining if + * the task is current that will decide if it will be interrupted. + * This pairs with the full barrier between the rq->curr update and + * resctrl_sched_in() during context switch. + */ + smp_mb(); + + /* + * By now, the task's closid and rmid are set. If the task is current + * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource + * group go into effect. If the task is not current, the MSR will be + * updated when the task is scheduled in. + */ + update_task_closid_rmid(tsk); + + return 0; +} + +static bool is_closid_match(struct task_struct *t, struct rdtgroup *r) +{ + return (resctrl_arch_alloc_capable() && (r->type == RDTCTRL_GROUP) && + resctrl_arch_match_closid(t, r->closid)); +} + +static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r) +{ + return (resctrl_arch_mon_capable() && (r->type == RDTMON_GROUP) && + resctrl_arch_match_rmid(t, r->mon.parent->closid, + r->mon.rmid)); +} + +/** + * rdtgroup_tasks_assigned - Test if tasks have been assigned to resource group + * @r: Resource group + * + * Return: 1 if tasks have been assigned to @r, 0 otherwise + */ +int rdtgroup_tasks_assigned(struct rdtgroup *r) +{ + struct task_struct *p, *t; + int ret = 0; + + lockdep_assert_held(&rdtgroup_mutex); + + rcu_read_lock(); + for_each_process_thread(p, t) { + if (is_closid_match(t, r) || is_rmid_match(t, r)) { + ret = 1; + break; + } + } + rcu_read_unlock(); + + return ret; +} + +static int rdtgroup_task_write_permission(struct task_struct *task, + struct kernfs_open_file *of) +{ + const struct cred *tcred = get_task_cred(task); + const struct cred *cred = current_cred(); + int ret = 0; + + /* + * Even if we're attaching all tasks in the thread group, we only + * need to check permissions on one of them. + */ + if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && + !uid_eq(cred->euid, tcred->uid) && + !uid_eq(cred->euid, tcred->suid)) { + rdt_last_cmd_printf("No permission to move task %d\n", task->pid); + ret = -EPERM; + } + + put_cred(tcred); + return ret; +} + +static int rdtgroup_move_task(pid_t pid, struct rdtgroup *rdtgrp, + struct kernfs_open_file *of) +{ + struct task_struct *tsk; + int ret; + + rcu_read_lock(); + if (pid) { + tsk = find_task_by_vpid(pid); + if (!tsk) { + rcu_read_unlock(); + rdt_last_cmd_printf("No task %d\n", pid); + return -ESRCH; + } + } else { + tsk = current; + } + + get_task_struct(tsk); + rcu_read_unlock(); + + ret = rdtgroup_task_write_permission(tsk, of); + if (!ret) + ret = __rdtgroup_move_task(tsk, rdtgrp); + + put_task_struct(tsk); + return ret; +} + +static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct rdtgroup *rdtgrp; + char *pid_str; + int ret = 0; + pid_t pid; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + rdt_last_cmd_clear(); + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + ret = -EINVAL; + rdt_last_cmd_puts("Pseudo-locking in progress\n"); + goto unlock; + } + + while (buf && buf[0] != '\0' && buf[0] != '\n') { + pid_str = strim(strsep(&buf, ",")); + + if (kstrtoint(pid_str, 0, &pid)) { + rdt_last_cmd_printf("Task list parsing error pid %s\n", pid_str); + ret = -EINVAL; + break; + } + + if (pid < 0) { + rdt_last_cmd_printf("Invalid pid %d\n", pid); + ret = -EINVAL; + break; + } + + ret = rdtgroup_move_task(pid, rdtgrp, of); + if (ret) { + rdt_last_cmd_printf("Error while processing task %d\n", pid); + break; + } + } + +unlock: + rdtgroup_kn_unlock(of->kn); + + return ret ?: nbytes; +} + +static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s) +{ + struct task_struct *p, *t; + pid_t pid; + + rcu_read_lock(); + for_each_process_thread(p, t) { + if (is_closid_match(t, r) || is_rmid_match(t, r)) { + pid = task_pid_vnr(t); + if (pid) + seq_printf(s, "%d\n", pid); + } + } + rcu_read_unlock(); +} + +static int rdtgroup_tasks_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) + show_rdt_tasks(rdtgrp, s); + else + ret = -ENOENT; + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +static int rdtgroup_closid_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) + seq_printf(s, "%u\n", rdtgrp->closid); + else + ret = -ENOENT; + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +static int rdtgroup_rmid_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) + seq_printf(s, "%u\n", rdtgrp->mon.rmid); + else + ret = -ENOENT; + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +#ifdef CONFIG_PROC_CPU_RESCTRL + +/* + * A task can only be part of one resctrl control group and of one monitor + * group which is associated to that control group. + * + * 1) res: + * mon: + * + * resctrl is not available. + * + * 2) res:/ + * mon: + * + * Task is part of the root resctrl control group, and it is not associated + * to any monitor group. + * + * 3) res:/ + * mon:mon0 + * + * Task is part of the root resctrl control group and monitor group mon0. + * + * 4) res:group0 + * mon: + * + * Task is part of resctrl control group group0, and it is not associated + * to any monitor group. + * + * 5) res:group0 + * mon:mon1 + * + * Task is part of resctrl control group group0 and monitor group mon1. + */ +int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns, + struct pid *pid, struct task_struct *tsk) +{ + struct rdtgroup *rdtg; + int ret = 0; + + mutex_lock(&rdtgroup_mutex); + + /* Return empty if resctrl has not been mounted. */ + if (!resctrl_mounted) { + seq_puts(s, "res:\nmon:\n"); + goto unlock; + } + + list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) { + struct rdtgroup *crg; + + /* + * Task information is only relevant for shareable + * and exclusive groups. + */ + if (rdtg->mode != RDT_MODE_SHAREABLE && + rdtg->mode != RDT_MODE_EXCLUSIVE) + continue; + + if (!resctrl_arch_match_closid(tsk, rdtg->closid)) + continue; + + seq_printf(s, "res:%s%s\n", (rdtg == &rdtgroup_default) ? "/" : "", + rdtg->kn->name); + seq_puts(s, "mon:"); + list_for_each_entry(crg, &rdtg->mon.crdtgrp_list, + mon.crdtgrp_list) { + if (!resctrl_arch_match_rmid(tsk, crg->mon.parent->closid, + crg->mon.rmid)) + continue; + seq_printf(s, "%s", crg->kn->name); + break; + } + seq_putc(s, '\n'); + goto unlock; + } + /* + * The above search should succeed. Otherwise return + * with an error. + */ + ret = -ENOENT; +unlock: + mutex_unlock(&rdtgroup_mutex); + + return ret; +} +#endif + +static int rdt_last_cmd_status_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + int len; + + mutex_lock(&rdtgroup_mutex); + len = seq_buf_used(&last_cmd_status); + if (len) + seq_printf(seq, "%.*s", len, last_cmd_status_buf); + else + seq_puts(seq, "ok\n"); + mutex_unlock(&rdtgroup_mutex); + return 0; +} + +static int rdt_num_closids_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + + seq_printf(seq, "%u\n", s->num_closid); + return 0; +} + +static int rdt_default_ctrl_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%x\n", r->default_ctrl); + return 0; +} + +static int rdt_min_cbm_bits_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->cache.min_cbm_bits); + return 0; +} + +static int rdt_shareable_bits_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%x\n", r->cache.shareable_bits); + return 0; +} + +/* + * rdt_bit_usage_show - Display current usage of resources + * + * A domain is a shared resource that can now be allocated differently. Here + * we display the current regions of the domain as an annotated bitmask. + * For each domain of this resource its allocation bitmask + * is annotated as below to indicate the current usage of the corresponding bit: + * 0 - currently unused + * X - currently available for sharing and used by software and hardware + * H - currently used by hardware only but available for software use + * S - currently used and shareable by software only + * E - currently used exclusively by one resource group + * P - currently pseudo-locked by one resource group + */ +static int rdt_bit_usage_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + /* + * Use unsigned long even though only 32 bits are used to ensure + * test_bit() is used safely. + */ + unsigned long sw_shareable = 0, hw_shareable = 0; + unsigned long exclusive = 0, pseudo_locked = 0; + struct rdt_resource *r = s->res; + struct rdt_domain *dom; + int i, hwb, swb, excl, psl; + enum rdtgrp_mode mode; + bool sep = false; + u32 ctrl_val; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + hw_shareable = r->cache.shareable_bits; + list_for_each_entry(dom, &r->domains, list) { + if (sep) + seq_putc(seq, ';'); + sw_shareable = 0; + exclusive = 0; + seq_printf(seq, "%d=", dom->id); + for (i = 0; i < closids_supported(); i++) { + if (!closid_allocated(i)) + continue; + ctrl_val = resctrl_arch_get_config(r, dom, i, + s->conf_type); + mode = rdtgroup_mode_by_closid(i); + switch (mode) { + case RDT_MODE_SHAREABLE: + sw_shareable |= ctrl_val; + break; + case RDT_MODE_EXCLUSIVE: + exclusive |= ctrl_val; + break; + case RDT_MODE_PSEUDO_LOCKSETUP: + /* + * RDT_MODE_PSEUDO_LOCKSETUP is possible + * here but not included since the CBM + * associated with this CLOSID in this mode + * is not initialized and no task or cpu can be + * assigned this CLOSID. + */ + break; + case RDT_MODE_PSEUDO_LOCKED: + case RDT_NUM_MODES: + WARN(1, + "invalid mode for closid %d\n", i); + break; + } + } + for (i = r->cache.cbm_len - 1; i >= 0; i--) { + pseudo_locked = dom->plr ? dom->plr->cbm : 0; + hwb = test_bit(i, &hw_shareable); + swb = test_bit(i, &sw_shareable); + excl = test_bit(i, &exclusive); + psl = test_bit(i, &pseudo_locked); + if (hwb && swb) + seq_putc(seq, 'X'); + else if (hwb && !swb) + seq_putc(seq, 'H'); + else if (!hwb && swb) + seq_putc(seq, 'S'); + else if (excl) + seq_putc(seq, 'E'); + else if (psl) + seq_putc(seq, 'P'); + else /* Unused bits remain */ + seq_putc(seq, '0'); + } + sep = true; + } + seq_putc(seq, '\n'); + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + return 0; +} + +static int rdt_min_bw_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->membw.min_bw); + return 0; +} + +static int rdt_num_rmids_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + + seq_printf(seq, "%d\n", r->num_rmid); + + return 0; +} + +static int rdt_mon_features_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + struct mon_evt *mevt; + + list_for_each_entry(mevt, &r->evt_list, list) { + seq_printf(seq, "%s\n", mevt->name); + if (mevt->configurable) + seq_printf(seq, "%s_config\n", mevt->name); + } + + return 0; +} + +static int rdt_bw_gran_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->membw.bw_gran); + return 0; +} + +static int rdt_delay_linear_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->membw.delay_linear); + return 0; +} + +static int max_threshold_occ_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + seq_printf(seq, "%u\n", resctrl_rmid_realloc_threshold); + + return 0; +} + +static int rdt_thread_throttle_mode_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + if (r->membw.throttle_mode == THREAD_THROTTLE_PER_THREAD) + seq_puts(seq, "per-thread\n"); + else + seq_puts(seq, "max\n"); + + return 0; +} + +static ssize_t max_threshold_occ_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + unsigned int bytes; + int ret; + + ret = kstrtouint(buf, 0, &bytes); + if (ret) + return ret; + + if (bytes > resctrl_rmid_realloc_limit) + return -EINVAL; + + resctrl_rmid_realloc_threshold = resctrl_arch_round_mon_val(bytes); + + return nbytes; +} + +/* + * rdtgroup_mode_show - Display mode of this resource group + */ +static int rdtgroup_mode_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + + seq_printf(s, "%s\n", rdtgroup_mode_str(rdtgrp->mode)); + + rdtgroup_kn_unlock(of->kn); + return 0; +} + +static enum resctrl_conf_type resctrl_peer_type(enum resctrl_conf_type my_type) +{ + switch (my_type) { + case CDP_CODE: + return CDP_DATA; + case CDP_DATA: + return CDP_CODE; + default: + case CDP_NONE: + return CDP_NONE; + } +} + +static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->cache.arch_has_sparse_bitmasks); + + return 0; +} + +/** + * __rdtgroup_cbm_overlaps - Does CBM for intended closid overlap with other + * @r: Resource to which domain instance @d belongs. + * @d: The domain instance for which @closid is being tested. + * @cbm: Capacity bitmask being tested. + * @closid: Intended closid for @cbm. + * @type: CDP type of @r. + * @exclusive: Only check if overlaps with exclusive resource groups + * + * Checks if provided @cbm intended to be used for @closid on domain + * @d overlaps with any other closids or other hardware usage associated + * with this domain. If @exclusive is true then only overlaps with + * resource groups in exclusive mode will be considered. If @exclusive + * is false then overlaps with any resource group or hardware entities + * will be considered. + * + * @cbm is unsigned long, even if only 32 bits are used, to make the + * bitmap functions work correctly. + * + * Return: false if CBM does not overlap, true if it does. + */ +static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d, + unsigned long cbm, int closid, + enum resctrl_conf_type type, bool exclusive) +{ + enum rdtgrp_mode mode; + unsigned long ctrl_b; + int i; + + /* Check for any overlap with regions used by hardware directly */ + if (!exclusive) { + ctrl_b = r->cache.shareable_bits; + if (bitmap_intersects(&cbm, &ctrl_b, r->cache.cbm_len)) + return true; + } + + /* Check for overlap with other resource groups */ + for (i = 0; i < closids_supported(); i++) { + ctrl_b = resctrl_arch_get_config(r, d, i, type); + mode = rdtgroup_mode_by_closid(i); + if (closid_allocated(i) && i != closid && + mode != RDT_MODE_PSEUDO_LOCKSETUP) { + if (bitmap_intersects(&cbm, &ctrl_b, r->cache.cbm_len)) { + if (exclusive) { + if (mode == RDT_MODE_EXCLUSIVE) + return true; + continue; + } + return true; + } + } + } + + return false; +} + +/** + * rdtgroup_cbm_overlaps - Does CBM overlap with other use of hardware + * @s: Schema for the resource to which domain instance @d belongs. + * @d: The domain instance for which @closid is being tested. + * @cbm: Capacity bitmask being tested. + * @closid: Intended closid for @cbm. + * @exclusive: Only check if overlaps with exclusive resource groups + * + * Resources that can be allocated using a CBM can use the CBM to control + * the overlap of these allocations. rdtgroup_cmb_overlaps() is the test + * for overlap. Overlap test is not limited to the specific resource for + * which the CBM is intended though - when dealing with CDP resources that + * share the underlying hardware the overlap check should be performed on + * the CDP resource sharing the hardware also. + * + * Refer to description of __rdtgroup_cbm_overlaps() for the details of the + * overlap test. + * + * Return: true if CBM overlap detected, false if there is no overlap + */ +bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d, + unsigned long cbm, int closid, bool exclusive) +{ + enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type); + struct rdt_resource *r = s->res; + + if (__rdtgroup_cbm_overlaps(r, d, cbm, closid, s->conf_type, + exclusive)) + return true; + + if (!resctrl_arch_get_cdp_enabled(r->rid)) + return false; + return __rdtgroup_cbm_overlaps(r, d, cbm, closid, peer_type, exclusive); +} + +/** + * rdtgroup_mode_test_exclusive - Test if this resource group can be exclusive + * @rdtgrp: Resource group identified through its closid. + * + * An exclusive resource group implies that there should be no sharing of + * its allocated resources. At the time this group is considered to be + * exclusive this test can determine if its current schemata supports this + * setting by testing for overlap with all other resource groups. + * + * Return: true if resource group can be exclusive, false if there is overlap + * with allocations of other resource groups and thus this resource group + * cannot be exclusive. + */ +static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp) +{ + int closid = rdtgrp->closid; + struct resctrl_schema *s; + struct rdt_resource *r; + bool has_cache = false; + struct rdt_domain *d; + u32 ctrl; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + list_for_each_entry(s, &resctrl_schema_all, list) { + r = s->res; + if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA) + continue; + has_cache = true; + list_for_each_entry(d, &r->domains, list) { + ctrl = resctrl_arch_get_config(r, d, closid, + s->conf_type); + if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) { + rdt_last_cmd_puts("Schemata overlaps\n"); + return false; + } + } + } + + if (!has_cache) { + rdt_last_cmd_puts("Cannot be exclusive without CAT/CDP\n"); + return false; + } + + return true; +} + +/* + * rdtgroup_mode_write - Modify the resource group's mode + */ +static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct rdtgroup *rdtgrp; + enum rdtgrp_mode mode; + int ret = 0; + + /* Valid input requires a trailing newline */ + if (nbytes == 0 || buf[nbytes - 1] != '\n') + return -EINVAL; + buf[nbytes - 1] = '\0'; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + + rdt_last_cmd_clear(); + + mode = rdtgrp->mode; + + if ((!strcmp(buf, "shareable") && mode == RDT_MODE_SHAREABLE) || + (!strcmp(buf, "exclusive") && mode == RDT_MODE_EXCLUSIVE) || + (!strcmp(buf, "pseudo-locksetup") && + mode == RDT_MODE_PSEUDO_LOCKSETUP) || + (!strcmp(buf, "pseudo-locked") && mode == RDT_MODE_PSEUDO_LOCKED)) + goto out; + + if (mode == RDT_MODE_PSEUDO_LOCKED) { + rdt_last_cmd_puts("Cannot change pseudo-locked group\n"); + ret = -EINVAL; + goto out; + } + + if (!strcmp(buf, "shareable")) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + ret = rdtgroup_locksetup_exit(rdtgrp); + if (ret) + goto out; + } + rdtgrp->mode = RDT_MODE_SHAREABLE; + } else if (!strcmp(buf, "exclusive")) { + if (!rdtgroup_mode_test_exclusive(rdtgrp)) { + ret = -EINVAL; + goto out; + } + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + ret = rdtgroup_locksetup_exit(rdtgrp); + if (ret) + goto out; + } + rdtgrp->mode = RDT_MODE_EXCLUSIVE; + } else if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK) && + !strcmp(buf, "pseudo-locksetup")) { + ret = rdtgroup_locksetup_enter(rdtgrp); + if (ret) + goto out; + rdtgrp->mode = RDT_MODE_PSEUDO_LOCKSETUP; + } else { + rdt_last_cmd_puts("Unknown or unsupported mode\n"); + ret = -EINVAL; + } + +out: + rdtgroup_kn_unlock(of->kn); + return ret ?: nbytes; +} + +/** + * rdtgroup_cbm_to_size - Translate CBM to size in bytes + * @r: RDT resource to which @d belongs. + * @d: RDT domain instance. + * @cbm: bitmask for which the size should be computed. + * + * The bitmask provided associated with the RDT domain instance @d will be + * translated into how many bytes it represents. The size in bytes is + * computed by first dividing the total cache size by the CBM length to + * determine how many bytes each bit in the bitmask represents. The result + * is multiplied with the number of bits set in the bitmask. + * + * @cbm is unsigned long, even if only 32 bits are used to make the + * bitmap functions work correctly. + */ +unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, + struct rdt_domain *d, unsigned long cbm) +{ + struct cpu_cacheinfo *ci; + unsigned int size = 0; + int num_b, i; + + num_b = bitmap_weight(&cbm, r->cache.cbm_len); + ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask)); + for (i = 0; i < ci->num_leaves; i++) { + if (ci->info_list[i].level == r->cache_level) { + size = ci->info_list[i].size / r->cache.cbm_len * num_b; + break; + } + } + + return size; +} + +/* + * rdtgroup_size_show - Display size in bytes of allocated regions + * + * The "size" file mirrors the layout of the "schemata" file, printing the + * size in bytes of each region instead of the capacity bitmask. + */ +static int rdtgroup_size_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct resctrl_schema *schema; + enum resctrl_conf_type type; + struct rdtgroup *rdtgrp; + struct rdt_resource *r; + struct rdt_domain *d; + unsigned int size; + int ret = 0; + u32 closid; + bool sep; + u32 ctrl; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + if (!rdtgrp->plr->d) { + rdt_last_cmd_clear(); + rdt_last_cmd_puts("Cache domain offline\n"); + ret = -ENODEV; + } else { + seq_printf(s, "%*s:", max_name_width, + rdtgrp->plr->s->name); + size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res, + rdtgrp->plr->d, + rdtgrp->plr->cbm); + seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size); + } + goto out; + } + + closid = rdtgrp->closid; + + list_for_each_entry(schema, &resctrl_schema_all, list) { + r = schema->res; + type = schema->conf_type; + sep = false; + seq_printf(s, "%*s:", max_name_width, schema->name); + list_for_each_entry(d, &r->domains, list) { + if (sep) + seq_putc(s, ';'); + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + size = 0; + } else { + if (is_mba_sc(r)) + ctrl = d->mbps_val[closid]; + else + ctrl = resctrl_arch_get_config(r, d, + closid, + type); + if (r->rid == RDT_RESOURCE_MBA || + r->rid == RDT_RESOURCE_SMBA) + size = ctrl; + else + size = rdtgroup_cbm_to_size(r, d, ctrl); + } + seq_printf(s, "%d=%u", d->id, size); + sep = true; + } + seq_putc(s, '\n'); + } + +out: + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +static void mondata_config_read(struct resctrl_mon_config_info *mon_info) +{ + smp_call_function_any(&mon_info->d->cpu_mask, + resctrl_arch_mon_event_config_read, mon_info, 1); +} + +static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid) +{ + struct resctrl_mon_config_info mon_info = {0}; + struct rdt_domain *dom; + bool sep = false; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + list_for_each_entry(dom, &r->domains, list) { + if (sep) + seq_puts(s, ";"); + + memset(&mon_info, 0, sizeof(struct resctrl_mon_config_info)); + mon_info.r = r; + mon_info.d = dom; + mon_info.evtid = evtid; + mondata_config_read(&mon_info); + + seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config); + sep = true; + } + seq_puts(s, "\n"); + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + return 0; +} + +static int mbm_total_bytes_config_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + + mbm_config_show(seq, r, QOS_L3_MBM_TOTAL_EVENT_ID); + + return 0; +} + +static int mbm_local_bytes_config_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + + mbm_config_show(seq, r, QOS_L3_MBM_LOCAL_EVENT_ID); + + return 0; +} + +static int mbm_config_write_domain(struct rdt_resource *r, + struct rdt_domain *d, u32 evtid, u32 val) +{ + struct resctrl_mon_config_info mon_info = {0}; + + /* mon_config cannot be more than the supported set of events */ + if (val > MAX_EVT_CONFIG_BITS) { + rdt_last_cmd_puts("Invalid event configuration\n"); + return -EINVAL; + } + + /* + * Read the current config value first. If both are the same then + * no need to write it again. + */ + mon_info.r = r; + mon_info.d = d; + mon_info.evtid = evtid; + mondata_config_read(&mon_info); + if (mon_info.mon_config == val) + goto out; + + mon_info.mon_config = val; + + /* + * Update MSR_IA32_EVT_CFG_BASE MSR on one of the CPUs in the + * domain. The MSRs offset from MSR MSR_IA32_EVT_CFG_BASE + * are scoped at the domain level. Writing any of these MSRs + * on one CPU is observed by all the CPUs in the domain. + */ + smp_call_function_any(&d->cpu_mask, resctrl_arch_mon_event_config_write, + &mon_info, 1); + if (mon_info.err) { + rdt_last_cmd_puts("Invalid event configuration\n"); + goto out; + } + + /* + * When an Event Configuration is changed, the bandwidth counters + * for all RMIDs and Events will be cleared by the hardware. The + * hardware also sets MSR_IA32_QM_CTR.Unavailable (bit 62) for + * every RMID on the next read to any event for every RMID. + * Subsequent reads will have MSR_IA32_QM_CTR.Unavailable (bit 62) + * cleared while it is tracked by the hardware. Clear the + * mbm_local and mbm_total counts for all the RMIDs. + */ + resctrl_arch_reset_rmid_all(r, d); + +out: + return mon_info.err; +} + +static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid) +{ + char *dom_str = NULL, *id_str; + unsigned long dom_id, val; + struct rdt_domain *d; + int ret = 0; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + +next: + if (!tok || tok[0] == '\0') + return 0; + + /* Start processing the strings for each domain */ + dom_str = strim(strsep(&tok, ";")); + id_str = strsep(&dom_str, "="); + + if (!id_str || kstrtoul(id_str, 10, &dom_id)) { + rdt_last_cmd_puts("Missing '=' or non-numeric domain id\n"); + return -EINVAL; + } + + if (!dom_str || kstrtoul(dom_str, 16, &val)) { + rdt_last_cmd_puts("Non-numeric event configuration value\n"); + return -EINVAL; + } + + list_for_each_entry(d, &r->domains, list) { + if (d->id == dom_id) { + ret = mbm_config_write_domain(r, d, evtid, val); + if (ret) + return -EINVAL; + goto next; + } + } + + return -EINVAL; +} + +static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct rdt_resource *r = of->kn->parent->priv; + int ret; + + /* Valid input requires a trailing newline */ + if (nbytes == 0 || buf[nbytes - 1] != '\n') + return -EINVAL; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + rdt_last_cmd_clear(); + + buf[nbytes - 1] = '\0'; + + ret = mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID); + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + return ret ?: nbytes; +} + +static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct rdt_resource *r = of->kn->parent->priv; + int ret; + + /* Valid input requires a trailing newline */ + if (nbytes == 0 || buf[nbytes - 1] != '\n') + return -EINVAL; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + rdt_last_cmd_clear(); + + buf[nbytes - 1] = '\0'; + + ret = mon_config_write(r, buf, QOS_L3_MBM_LOCAL_EVENT_ID); + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + return ret ?: nbytes; +} + +/* rdtgroup information files for one cache resource. */ +static struct rftype res_common_files[] = { + { + .name = "last_cmd_status", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_last_cmd_status_show, + .fflags = RFTYPE_TOP_INFO, + }, + { + .name = "num_closids", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_num_closids_show, + .fflags = RFTYPE_CTRL_INFO, + }, + { + .name = "mon_features", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_mon_features_show, + .fflags = RFTYPE_MON_INFO, + }, + { + .name = "num_rmids", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_num_rmids_show, + .fflags = RFTYPE_MON_INFO, + }, + { + .name = "cbm_mask", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_default_ctrl_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "min_cbm_bits", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_min_cbm_bits_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "shareable_bits", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_shareable_bits_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "bit_usage", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_bit_usage_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "min_bandwidth", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_min_bw_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, + }, + { + .name = "bandwidth_gran", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_bw_gran_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, + }, + { + .name = "delay_linear", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_delay_linear_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, + }, + /* + * Platform specific which (if any) capabilities are provided by + * thread_throttle_mode. Defer "fflags" initialization to platform + * discovery. + */ + { + .name = "thread_throttle_mode", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_thread_throttle_mode_show, + }, + { + .name = "max_threshold_occupancy", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = max_threshold_occ_write, + .seq_show = max_threshold_occ_show, + .fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "mbm_total_bytes_config", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = mbm_total_bytes_config_show, + .write = mbm_total_bytes_config_write, + }, + { + .name = "mbm_local_bytes_config", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = mbm_local_bytes_config_show, + .write = mbm_local_bytes_config_write, + }, + { + .name = "cpus", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_cpus_write, + .seq_show = rdtgroup_cpus_show, + .fflags = RFTYPE_BASE, + }, + { + .name = "cpus_list", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_cpus_write, + .seq_show = rdtgroup_cpus_show, + .flags = RFTYPE_FLAGS_CPUS_LIST, + .fflags = RFTYPE_BASE, + }, + { + .name = "tasks", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_tasks_write, + .seq_show = rdtgroup_tasks_show, + .fflags = RFTYPE_BASE, + }, + { + .name = "mon_hw_id", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdtgroup_rmid_show, + .fflags = RFTYPE_MON_BASE | RFTYPE_DEBUG, + }, + { + .name = "schemata", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_schemata_write, + .seq_show = rdtgroup_schemata_show, + .fflags = RFTYPE_CTRL_BASE, + }, + { + .name = "mode", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_mode_write, + .seq_show = rdtgroup_mode_show, + .fflags = RFTYPE_CTRL_BASE, + }, + { + .name = "size", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdtgroup_size_show, + .fflags = RFTYPE_CTRL_BASE, + }, + { + .name = "sparse_masks", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_has_sparse_bitmasks_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "ctrl_hw_id", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdtgroup_closid_show, + .fflags = RFTYPE_CTRL_BASE | RFTYPE_DEBUG, + }, + +}; + +static int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags) +{ + struct rftype *rfts, *rft; + int ret, len; + + rfts = res_common_files; + len = ARRAY_SIZE(res_common_files); + + lockdep_assert_held(&rdtgroup_mutex); + + if (resctrl_debug) + fflags |= RFTYPE_DEBUG; + + for (rft = rfts; rft < rfts + len; rft++) { + if (rft->fflags && ((fflags & rft->fflags) == rft->fflags)) { + ret = rdtgroup_add_file(kn, rft); + if (ret) + goto error; + } + } + + return 0; +error: + pr_warn("Failed to add %s, err=%d\n", rft->name, ret); + while (--rft >= rfts) { + if ((fflags & rft->fflags) == rft->fflags) + kernfs_remove_by_name(kn, rft->name); + } + return ret; +} + +static struct rftype *rdtgroup_get_rftype_by_name(const char *name) +{ + struct rftype *rfts, *rft; + int len; + + rfts = res_common_files; + len = ARRAY_SIZE(res_common_files); + + for (rft = rfts; rft < rfts + len; rft++) { + if (!strcmp(rft->name, name)) + return rft; + } + + return NULL; +} + +static void thread_throttle_mode_init(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); + struct rftype *rft; + + if (!r->alloc_capable || + r->membw.throttle_mode == THREAD_THROTTLE_UNDEFINED) + return; + + rft = rdtgroup_get_rftype_by_name("thread_throttle_mode"); + if (!rft) + return; + + rft->fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB; +} + +void mbm_config_rftype_init(const char *config) +{ + struct rftype *rft; + + rft = rdtgroup_get_rftype_by_name(config); + if (rft) + rft->fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE; +} + +/** + * rdtgroup_kn_mode_restrict - Restrict user access to named resctrl file + * @r: The resource group with which the file is associated. + * @name: Name of the file + * + * The permissions of named resctrl file, directory, or link are modified + * to not allow read, write, or execute by any user. + * + * WARNING: This function is intended to communicate to the user that the + * resctrl file has been locked down - that it is not relevant to the + * particular state the system finds itself in. It should not be relied + * on to protect from user access because after the file's permissions + * are restricted the user can still change the permissions using chmod + * from the command line. + * + * Return: 0 on success, <0 on failure. + */ +int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name) +{ + struct iattr iattr = {.ia_valid = ATTR_MODE,}; + struct kernfs_node *kn; + int ret = 0; + + kn = kernfs_find_and_get_ns(r->kn, name, NULL); + if (!kn) + return -ENOENT; + + switch (kernfs_type(kn)) { + case KERNFS_DIR: + iattr.ia_mode = S_IFDIR; + break; + case KERNFS_FILE: + iattr.ia_mode = S_IFREG; + break; + case KERNFS_LINK: + iattr.ia_mode = S_IFLNK; + break; + } + + ret = kernfs_setattr(kn, &iattr); + kernfs_put(kn); + return ret; +} + +/** + * rdtgroup_kn_mode_restore - Restore user access to named resctrl file + * @r: The resource group with which the file is associated. + * @name: Name of the file + * @mask: Mask of permissions that should be restored + * + * Restore the permissions of the named file. If @name is a directory the + * permissions of its parent will be used. + * + * Return: 0 on success, <0 on failure. + */ +int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name, + umode_t mask) +{ + struct iattr iattr = {.ia_valid = ATTR_MODE,}; + struct kernfs_node *kn, *parent; + struct rftype *rfts, *rft; + int ret, len; + + rfts = res_common_files; + len = ARRAY_SIZE(res_common_files); + + for (rft = rfts; rft < rfts + len; rft++) { + if (!strcmp(rft->name, name)) + iattr.ia_mode = rft->mode & mask; + } + + kn = kernfs_find_and_get_ns(r->kn, name, NULL); + if (!kn) + return -ENOENT; + + switch (kernfs_type(kn)) { + case KERNFS_DIR: + parent = kernfs_get_parent(kn); + if (parent) { + iattr.ia_mode |= parent->mode; + kernfs_put(parent); + } + iattr.ia_mode |= S_IFDIR; + break; + case KERNFS_FILE: + iattr.ia_mode |= S_IFREG; + break; + case KERNFS_LINK: + iattr.ia_mode |= S_IFLNK; + break; + } + + ret = kernfs_setattr(kn, &iattr); + kernfs_put(kn); + return ret; +} + +static int rdtgroup_mkdir_info_resdir(void *priv, char *name, + unsigned long fflags) +{ + struct kernfs_node *kn_subdir; + int ret; + + kn_subdir = kernfs_create_dir(kn_info, name, + kn_info->mode, priv); + if (IS_ERR(kn_subdir)) + return PTR_ERR(kn_subdir); + + ret = rdtgroup_kn_set_ugid(kn_subdir); + if (ret) + return ret; + + ret = rdtgroup_add_files(kn_subdir, fflags); + if (!ret) + kernfs_activate(kn_subdir); + + return ret; +} + +static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn) +{ + enum resctrl_res_level i; + struct resctrl_schema *s; + struct rdt_resource *r; + unsigned long fflags; + char name[32]; + int ret; + + /* create the directory */ + kn_info = kernfs_create_dir(parent_kn, "info", parent_kn->mode, NULL); + if (IS_ERR(kn_info)) + return PTR_ERR(kn_info); + + ret = rdtgroup_add_files(kn_info, RFTYPE_TOP_INFO); + if (ret) + goto out_destroy; + + /* loop over enabled controls, these are all alloc_capable */ + list_for_each_entry(s, &resctrl_schema_all, list) { + r = s->res; + fflags = r->fflags | RFTYPE_CTRL_INFO; + ret = rdtgroup_mkdir_info_resdir(s, s->name, fflags); + if (ret) + goto out_destroy; + } + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->mon_capable) + continue; + + fflags = r->fflags | RFTYPE_MON_INFO; + sprintf(name, "%s_MON", r->name); + ret = rdtgroup_mkdir_info_resdir(r, name, fflags); + if (ret) + goto out_destroy; + } + + ret = rdtgroup_kn_set_ugid(kn_info); + if (ret) + goto out_destroy; + + kernfs_activate(kn_info); + + return 0; + +out_destroy: + kernfs_remove(kn_info); + return ret; +} + +static int +mongroup_create_dir(struct kernfs_node *parent_kn, struct rdtgroup *prgrp, + char *name, struct kernfs_node **dest_kn) +{ + struct kernfs_node *kn; + int ret; + + /* create the directory */ + kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); + if (IS_ERR(kn)) + return PTR_ERR(kn); + + if (dest_kn) + *dest_kn = kn; + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) + goto out_destroy; + + kernfs_activate(kn); + + return 0; + +out_destroy: + kernfs_remove(kn); + return ret; +} + +static inline bool is_mba_linear(void) +{ + return resctrl_arch_get_resource(RDT_RESOURCE_MBA)->membw.delay_linear; +} + +static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d) +{ + u32 num_closid = resctrl_arch_get_num_closid(r); + int cpu = cpumask_any(&d->cpu_mask); + int i; + + d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val), + GFP_KERNEL, cpu_to_node(cpu)); + if (!d->mbps_val) + return -ENOMEM; + + for (i = 0; i < num_closid; i++) + d->mbps_val[i] = MBA_MAX_MBPS; + + return 0; +} + +static void mba_sc_domain_destroy(struct rdt_resource *r, + struct rdt_domain *d) +{ + kfree(d->mbps_val); + d->mbps_val = NULL; +} + +/* + * MBA software controller is supported only if + * MBM is supported and MBA is in linear scale. + */ +static bool supports_mba_mbps(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); + + return (resctrl_arch_is_mbm_local_enabled() && + r->alloc_capable && is_mba_linear()); +} + +/* + * Enable or disable the MBA software controller + * which helps user specify bandwidth in MBps. + */ +static int set_mba_sc(bool mba_sc) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); + u32 num_closid = resctrl_arch_get_num_closid(r); + struct rdt_domain *d; + int i; + + if (!supports_mba_mbps() || mba_sc == is_mba_sc(r)) + return -EINVAL; + + r->membw.mba_sc = mba_sc; + + list_for_each_entry(d, &r->domains, list) { + for (i = 0; i < num_closid; i++) + d->mbps_val[i] = MBA_MAX_MBPS; + } + + return 0; +} + +/* + * We don't allow rdtgroup directories to be created anywhere + * except the root directory. Thus when looking for the rdtgroup + * structure for a kernfs node we are either looking at a directory, + * in which case the rdtgroup structure is pointed at by the "priv" + * field, otherwise we have a file, and need only look to the parent + * to find the rdtgroup. + */ +static struct rdtgroup *kernfs_to_rdtgroup(struct kernfs_node *kn) +{ + if (kernfs_type(kn) == KERNFS_DIR) { + /* + * All the resource directories use "kn->priv" + * to point to the "struct rdtgroup" for the + * resource. "info" and its subdirectories don't + * have rdtgroup structures, so return NULL here. + */ + if (kn == kn_info || kn->parent == kn_info) + return NULL; + else + return kn->priv; + } else { + return kn->parent->priv; + } +} + +static void rdtgroup_kn_get(struct rdtgroup *rdtgrp, struct kernfs_node *kn) +{ + atomic_inc(&rdtgrp->waitcount); + kernfs_break_active_protection(kn); +} + +static void rdtgroup_kn_put(struct rdtgroup *rdtgrp, struct kernfs_node *kn) +{ + if (atomic_dec_and_test(&rdtgrp->waitcount) && + (rdtgrp->flags & RDT_DELETED)) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) + rdtgroup_pseudo_lock_remove(rdtgrp); + kernfs_unbreak_active_protection(kn); + rdtgroup_remove(rdtgrp); + } else { + kernfs_unbreak_active_protection(kn); + } +} + +struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn) +{ + struct rdtgroup *rdtgrp = kernfs_to_rdtgroup(kn); + + if (!rdtgrp) + return NULL; + + rdtgroup_kn_get(rdtgrp, kn); + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + /* Was this group deleted while we waited? */ + if (rdtgrp->flags & RDT_DELETED) + return NULL; + + return rdtgrp; +} + +void rdtgroup_kn_unlock(struct kernfs_node *kn) +{ + struct rdtgroup *rdtgrp = kernfs_to_rdtgroup(kn); + + if (!rdtgrp) + return; + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + rdtgroup_kn_put(rdtgrp, kn); +} + +static int mkdir_mondata_all(struct kernfs_node *parent_kn, + struct rdtgroup *prgrp, + struct kernfs_node **mon_data_kn); + +static void rdt_disable_ctx(void) +{ + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false); + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false); + set_mba_sc(false); + + resctrl_debug = false; +} + +static int rdt_enable_ctx(struct rdt_fs_context *ctx) +{ + int ret = 0; + + if (ctx->enable_cdpl2) { + ret = resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, true); + if (ret) + goto out_done; + } + + if (ctx->enable_cdpl3) { + ret = resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, true); + if (ret) + goto out_cdpl2; + } + + if (ctx->enable_mba_mbps) { + ret = set_mba_sc(true); + if (ret) + goto out_cdpl3; + } + + if (ctx->enable_debug) + resctrl_debug = true; + + return 0; + +out_cdpl3: + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false); +out_cdpl2: + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false); +out_done: + return ret; +} + +static int schemata_list_add(struct rdt_resource *r, enum resctrl_conf_type type) +{ + struct resctrl_schema *s; + const char *suffix = ""; + int ret, cl; + + s = kzalloc(sizeof(*s), GFP_KERNEL); + if (!s) + return -ENOMEM; + + s->res = r; + s->num_closid = resctrl_arch_get_num_closid(r); + if (resctrl_arch_get_cdp_enabled(r->rid)) + s->num_closid /= 2; + + s->conf_type = type; + switch (type) { + case CDP_CODE: + suffix = "CODE"; + break; + case CDP_DATA: + suffix = "DATA"; + break; + case CDP_NONE: + suffix = ""; + break; + } + + ret = snprintf(s->name, sizeof(s->name), "%s%s", r->name, suffix); + if (ret >= sizeof(s->name)) { + kfree(s); + return -EINVAL; + } + + cl = strlen(s->name); + + /* + * If CDP is supported by this resource, but not enabled, + * include the suffix. This ensures the tabular format of the + * schemata file does not change between mounts of the filesystem. + */ + if (r->cdp_capable && !resctrl_arch_get_cdp_enabled(r->rid)) + cl += 4; + + if (cl > max_name_width) + max_name_width = cl; + + /* + * Choose a width for the resource data based on the resource that has + * widest name and cbm. + */ + max_data_width = max(max_data_width, r->data_width); + + INIT_LIST_HEAD(&s->list); + list_add(&s->list, &resctrl_schema_all); + + return 0; +} + +static int schemata_list_create(void) +{ + enum resctrl_res_level i; + struct rdt_resource *r; + int ret = 0; + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->alloc_capable) + continue; + + if (resctrl_arch_get_cdp_enabled(r->rid)) { + ret = schemata_list_add(r, CDP_CODE); + if (ret) + break; + + ret = schemata_list_add(r, CDP_DATA); + } else { + ret = schemata_list_add(r, CDP_NONE); + } + + if (ret) + break; + } + + return ret; +} + +static void schemata_list_destroy(void) +{ + struct resctrl_schema *s, *tmp; + + list_for_each_entry_safe(s, tmp, &resctrl_schema_all, list) { + list_del(&s->list); + kfree(s); + } +} + +static int rdt_get_tree(struct fs_context *fc) +{ + struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3); + struct rdt_fs_context *ctx = rdt_fc2context(fc); + unsigned long flags = RFTYPE_CTRL_BASE; + struct rdt_domain *dom; + int ret; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + /* + * resctrl file system can only be mounted once. + */ + if (resctrl_mounted) { + ret = -EBUSY; + goto out; + } + + ret = rdtgroup_setup_root(ctx); + if (ret) + goto out; + + ret = rdt_enable_ctx(ctx); + if (ret) + goto out_root; + + ret = schemata_list_create(); + if (ret) { + schemata_list_destroy(); + goto out_ctx; + } + + closid_init(); + + if (resctrl_arch_mon_capable()) + flags |= RFTYPE_MON; + + ret = rdtgroup_add_files(rdtgroup_default.kn, flags); + if (ret) + goto out_schemata_free; + + kernfs_activate(rdtgroup_default.kn); + + ret = rdtgroup_create_info_dir(rdtgroup_default.kn); + if (ret < 0) + goto out_schemata_free; + + if (resctrl_arch_mon_capable()) { + ret = mongroup_create_dir(rdtgroup_default.kn, + &rdtgroup_default, "mon_groups", + &kn_mongrp); + if (ret < 0) + goto out_info; + + ret = mkdir_mondata_all(rdtgroup_default.kn, + &rdtgroup_default, &kn_mondata); + if (ret < 0) + goto out_mongrp; + rdtgroup_default.mon.mon_data_kn = kn_mondata; + } + + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) { + ret = rdt_pseudo_lock_init(); + if (ret) + goto out_mondata; + } + + ret = kernfs_get_tree(fc); + if (ret < 0) + goto out_psl; + + if (resctrl_arch_alloc_capable()) + resctrl_arch_enable_alloc(); + if (resctrl_arch_mon_capable()) + resctrl_arch_enable_mon(); + + if (resctrl_arch_alloc_capable() || resctrl_arch_mon_capable()) + resctrl_mounted = true; + + if (resctrl_is_mbm_enabled()) { + list_for_each_entry(dom, &l3->domains, list) + mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, + RESCTRL_PICK_ANY_CPU); + } + + goto out; + +out_psl: + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + rdt_pseudo_lock_release(); +out_mondata: + if (resctrl_arch_mon_capable()) + kernfs_remove(kn_mondata); +out_mongrp: + if (resctrl_arch_mon_capable()) + kernfs_remove(kn_mongrp); +out_info: + kernfs_remove(kn_info); +out_schemata_free: + schemata_list_destroy(); +out_ctx: + rdt_disable_ctx(); +out_root: + rdtgroup_destroy_root(); +out: + rdt_last_cmd_clear(); + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + return ret; +} + +enum rdt_param { + Opt_cdp, + Opt_cdpl2, + Opt_mba_mbps, + Opt_debug, + nr__rdt_params +}; + +static const struct fs_parameter_spec rdt_fs_parameters[] = { + fsparam_flag("cdp", Opt_cdp), + fsparam_flag("cdpl2", Opt_cdpl2), + fsparam_flag("mba_MBps", Opt_mba_mbps), + fsparam_flag("debug", Opt_debug), + {} +}; + +static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param) +{ + struct rdt_fs_context *ctx = rdt_fc2context(fc); + struct fs_parse_result result; + int opt; + + opt = fs_parse(fc, rdt_fs_parameters, param, &result); + if (opt < 0) + return opt; + + switch (opt) { + case Opt_cdp: + ctx->enable_cdpl3 = true; + return 0; + case Opt_cdpl2: + ctx->enable_cdpl2 = true; + return 0; + case Opt_mba_mbps: + if (!supports_mba_mbps()) + return -EINVAL; + ctx->enable_mba_mbps = true; + return 0; + case Opt_debug: + ctx->enable_debug = true; + return 0; + } + + return -EINVAL; +} + +static void rdt_fs_context_free(struct fs_context *fc) +{ + struct rdt_fs_context *ctx = rdt_fc2context(fc); + + kernfs_free_fs_context(fc); + kfree(ctx); +} + +static const struct fs_context_operations rdt_fs_context_ops = { + .free = rdt_fs_context_free, + .parse_param = rdt_parse_param, + .get_tree = rdt_get_tree, +}; + +static int rdt_init_fs_context(struct fs_context *fc) +{ + struct rdt_fs_context *ctx; + + ctx = kzalloc(sizeof(struct rdt_fs_context), GFP_KERNEL); + if (!ctx) + return -ENOMEM; + + ctx->kfc.magic = RDTGROUP_SUPER_MAGIC; + fc->fs_private = &ctx->kfc; + fc->ops = &rdt_fs_context_ops; + put_user_ns(fc->user_ns); + fc->user_ns = get_user_ns(&init_user_ns); + fc->global = true; + return 0; +} + +/* + * Move tasks from one to the other group. If @from is NULL, then all tasks + * in the systems are moved unconditionally (used for teardown). + * + * If @mask is not NULL the cpus on which moved tasks are running are set + * in that mask so the update smp function call is restricted to affected + * cpus. + */ +static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to, + struct cpumask *mask) +{ + struct task_struct *p, *t; + + read_lock(&tasklist_lock); + for_each_process_thread(p, t) { + if (!from || is_closid_match(t, from) || + is_rmid_match(t, from)) { + resctrl_arch_set_closid_rmid(t, to->closid, + to->mon.rmid); + + /* + * Order the closid/rmid stores above before the loads + * in task_curr(). This pairs with the full barrier + * between the rq->curr update and resctrl_sched_in() + * during context switch. + */ + smp_mb(); + + /* + * If the task is on a CPU, set the CPU in the mask. + * The detection is inaccurate as tasks might move or + * schedule before the smp function call takes place. + * In such a case the function call is pointless, but + * there is no other side effect. + */ + if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t)) + cpumask_set_cpu(task_cpu(t), mask); + } + } + read_unlock(&tasklist_lock); +} + +static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp) +{ + struct rdtgroup *sentry, *stmp; + struct list_head *head; + + head = &rdtgrp->mon.crdtgrp_list; + list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) { + free_rmid(sentry->closid, sentry->mon.rmid); + list_del(&sentry->mon.crdtgrp_list); + + if (atomic_read(&sentry->waitcount) != 0) + sentry->flags = RDT_DELETED; + else + rdtgroup_remove(sentry); + } +} + +/* + * Forcibly remove all of subdirectories under root. + */ +static void rmdir_all_sub(void) +{ + struct rdtgroup *rdtgrp, *tmp; + + /* Move all tasks to the default resource group */ + rdt_move_group_tasks(NULL, &rdtgroup_default, NULL); + + list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) { + /* Free any child rmids */ + free_all_child_rdtgrp(rdtgrp); + + /* Remove each rdtgroup other than root */ + if (rdtgrp == &rdtgroup_default) + continue; + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) + rdtgroup_pseudo_lock_remove(rdtgrp); + + /* + * Give any CPUs back to the default group. We cannot copy + * cpu_online_mask because a CPU might have executed the + * offline callback already, but is still marked online. + */ + cpumask_or(&rdtgroup_default.cpu_mask, + &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); + + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + + kernfs_remove(rdtgrp->kn); + list_del(&rdtgrp->rdtgroup_list); + + if (atomic_read(&rdtgrp->waitcount) != 0) + rdtgrp->flags = RDT_DELETED; + else + rdtgroup_remove(rdtgrp); + } + /* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */ + update_closid_rmid(cpu_online_mask, &rdtgroup_default); + + kernfs_remove(kn_info); + kernfs_remove(kn_mongrp); + kernfs_remove(kn_mondata); +} + +static void rdt_kill_sb(struct super_block *sb) +{ + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + rdt_disable_ctx(); + + /* Put everything back to default values. */ + resctrl_arch_reset_resources(); + + rmdir_all_sub(); + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + rdt_pseudo_lock_release(); + rdtgroup_default.mode = RDT_MODE_SHAREABLE; + schemata_list_destroy(); + rdtgroup_destroy_root(); + if (resctrl_arch_alloc_capable()) + resctrl_arch_disable_alloc(); + if (resctrl_arch_mon_capable()) + resctrl_arch_disable_mon(); + resctrl_mounted = false; + kernfs_kill_sb(sb); + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); +} + +static struct file_system_type rdt_fs_type = { + .name = "resctrl", + .init_fs_context = rdt_init_fs_context, + .parameters = rdt_fs_parameters, + .kill_sb = rdt_kill_sb, +}; + +static int mon_addfile(struct kernfs_node *parent_kn, const char *name, + void *priv) +{ + struct kernfs_node *kn; + int ret = 0; + + kn = __kernfs_create_file(parent_kn, name, 0444, + GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, + &kf_mondata_ops, priv, NULL, NULL); + if (IS_ERR(kn)) + return PTR_ERR(kn); + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) { + kernfs_remove(kn); + return ret; + } + + return ret; +} + +/* + * Remove all subdirectories of mon_data of ctrl_mon groups + * and monitor groups with given domain id. + */ +static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, + unsigned int dom_id) +{ + struct rdtgroup *prgrp, *crgrp; + char name[32]; + + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { + sprintf(name, "mon_%s_%02d", r->name, dom_id); + kernfs_remove_by_name(prgrp->mon.mon_data_kn, name); + + list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) + kernfs_remove_by_name(crgrp->mon.mon_data_kn, name); + } +} + +static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, + struct rdt_domain *d, + struct rdt_resource *r, struct rdtgroup *prgrp) +{ + union mon_data_bits priv; + struct kernfs_node *kn; + struct mon_evt *mevt; + struct rmid_read rr; + char name[32]; + int ret; + + sprintf(name, "mon_%s_%02d", r->name, d->id); + /* create the directory */ + kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); + if (IS_ERR(kn)) + return PTR_ERR(kn); + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) + goto out_destroy; + + if (WARN_ON(list_empty(&r->evt_list))) { + ret = -EPERM; + goto out_destroy; + } + + priv.u.rid = r->rid; + priv.u.domid = d->id; + list_for_each_entry(mevt, &r->evt_list, list) { + priv.u.evtid = mevt->evtid; + ret = mon_addfile(kn, mevt->name, priv.priv); + if (ret) + goto out_destroy; + + if (resctrl_is_mbm_event(mevt->evtid)) + mon_event_read(&rr, r, d, prgrp, mevt->evtid, true); + } + kernfs_activate(kn); + return 0; + +out_destroy: + kernfs_remove(kn); + return ret; +} + +/* + * Add all subdirectories of mon_data for "ctrl_mon" groups + * and "monitor" groups with given domain id. + */ +static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, + struct rdt_domain *d) +{ + struct kernfs_node *parent_kn; + struct rdtgroup *prgrp, *crgrp; + struct list_head *head; + + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { + parent_kn = prgrp->mon.mon_data_kn; + mkdir_mondata_subdir(parent_kn, d, r, prgrp); + + head = &prgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) { + parent_kn = crgrp->mon.mon_data_kn; + mkdir_mondata_subdir(parent_kn, d, r, crgrp); + } + } +} + +static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn, + struct rdt_resource *r, + struct rdtgroup *prgrp) +{ + struct rdt_domain *dom; + int ret; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + list_for_each_entry(dom, &r->domains, list) { + ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp); + if (ret) + return ret; + } + + return 0; +} + +/* + * This creates a directory mon_data which contains the monitored data. + * + * mon_data has one directory for each domain which are named + * in the format mon_<domain_name>_<domain_id>. For ex: A mon_data + * with L3 domain looks as below: + * ./mon_data: + * mon_L3_00 + * mon_L3_01 + * mon_L3_02 + * ... + * + * Each domain directory has one file per event: + * ./mon_L3_00/: + * llc_occupancy + * + */ +static int mkdir_mondata_all(struct kernfs_node *parent_kn, + struct rdtgroup *prgrp, + struct kernfs_node **dest_kn) +{ + enum resctrl_res_level i; + struct rdt_resource *r; + struct kernfs_node *kn; + int ret; + + /* + * Create the mon_data directory first. + */ + ret = mongroup_create_dir(parent_kn, prgrp, "mon_data", &kn); + if (ret) + return ret; + + if (dest_kn) + *dest_kn = kn; + + /* + * Create the subdirectories for each domain. Note that all events + * in a domain like L3 are grouped into a resource whose domain is L3 + */ + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->mon_capable) + continue; + + ret = mkdir_mondata_subdir_alldom(kn, r, prgrp); + if (ret) + goto out_destroy; + } + + return 0; + +out_destroy: + kernfs_remove(kn); + return ret; +} + +/** + * cbm_ensure_valid - Enforce validity on provided CBM + * @_val: Candidate CBM + * @r: RDT resource to which the CBM belongs + * + * The provided CBM represents all cache portions available for use. This + * may be represented by a bitmap that does not consist of contiguous ones + * and thus be an invalid CBM. + * Here the provided CBM is forced to be a valid CBM by only considering + * the first set of contiguous bits as valid and clearing all bits. + * The intention here is to provide a valid default CBM with which a new + * resource group is initialized. The user can follow this with a + * modification to the CBM if the default does not satisfy the + * requirements. + */ +static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r) +{ + unsigned int cbm_len = r->cache.cbm_len; + unsigned long first_bit, zero_bit; + unsigned long val = _val; + + if (!val) + return 0; + + first_bit = find_first_bit(&val, cbm_len); + zero_bit = find_next_zero_bit(&val, cbm_len, first_bit); + + /* Clear any remaining bits to ensure contiguous region */ + bitmap_clear(&val, zero_bit, cbm_len - zero_bit); + return (u32)val; +} + +/* + * Initialize cache resources per RDT domain + * + * Set the RDT domain up to start off with all usable allocations. That is, + * all shareable and unused bits. All-zero CBM is invalid. + */ +static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s, + u32 closid) +{ + enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type); + enum resctrl_conf_type t = s->conf_type; + struct resctrl_staged_config *cfg; + struct rdt_resource *r = s->res; + u32 used_b = 0, unused_b = 0; + unsigned long tmp_cbm; + enum rdtgrp_mode mode; + u32 peer_ctl, ctrl_val; + int i; + + cfg = &d->staged_config[t]; + cfg->have_new_ctrl = false; + cfg->new_ctrl = r->cache.shareable_bits; + used_b = r->cache.shareable_bits; + for (i = 0; i < closids_supported(); i++) { + if (closid_allocated(i) && i != closid) { + mode = rdtgroup_mode_by_closid(i); + if (mode == RDT_MODE_PSEUDO_LOCKSETUP) + /* + * ctrl values for locksetup aren't relevant + * until the schemata is written, and the mode + * becomes RDT_MODE_PSEUDO_LOCKED. + */ + continue; + /* + * If CDP is active include peer domain's + * usage to ensure there is no overlap + * with an exclusive group. + */ + if (resctrl_arch_get_cdp_enabled(r->rid)) + peer_ctl = resctrl_arch_get_config(r, d, i, + peer_type); + else + peer_ctl = 0; + ctrl_val = resctrl_arch_get_config(r, d, i, + s->conf_type); + used_b |= ctrl_val | peer_ctl; + if (mode == RDT_MODE_SHAREABLE) + cfg->new_ctrl |= ctrl_val | peer_ctl; + } + } + if (d->plr && d->plr->cbm > 0) + used_b |= d->plr->cbm; + unused_b = used_b ^ (BIT_MASK(r->cache.cbm_len) - 1); + unused_b &= BIT_MASK(r->cache.cbm_len) - 1; + cfg->new_ctrl |= unused_b; + /* + * Force the initial CBM to be valid, user can + * modify the CBM based on system availability. + */ + cfg->new_ctrl = cbm_ensure_valid(cfg->new_ctrl, r); + /* + * Assign the u32 CBM to an unsigned long to ensure that + * bitmap_weight() does not access out-of-bound memory. + */ + tmp_cbm = cfg->new_ctrl; + if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) { + rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id); + return -ENOSPC; + } + cfg->have_new_ctrl = true; + + return 0; +} + +/* + * Initialize cache resources with default values. + * + * A new RDT group is being created on an allocation capable (CAT) + * supporting system. Set this group up to start off with all usable + * allocations. + * + * If there are no more shareable bits available on any domain then + * the entire allocation will fail. + */ +static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid) +{ + struct rdt_domain *d; + int ret; + + list_for_each_entry(d, &s->res->domains, list) { + ret = __init_one_rdt_domain(d, s, closid); + if (ret < 0) + return ret; + } + + return 0; +} + +/* Initialize MBA resource with default values. */ +static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid) +{ + struct resctrl_staged_config *cfg; + struct rdt_domain *d; + + list_for_each_entry(d, &r->domains, list) { + if (is_mba_sc(r)) { + d->mbps_val[closid] = MBA_MAX_MBPS; + continue; + } + + cfg = &d->staged_config[CDP_NONE]; + cfg->new_ctrl = r->default_ctrl; + cfg->have_new_ctrl = true; + } +} + +/* Initialize the RDT group's allocations. */ +static int rdtgroup_init_alloc(struct rdtgroup *rdtgrp) +{ + struct resctrl_schema *s; + struct rdt_resource *r; + int ret = 0; + + rdt_staged_configs_clear(); + + list_for_each_entry(s, &resctrl_schema_all, list) { + r = s->res; + if (r->rid == RDT_RESOURCE_MBA || + r->rid == RDT_RESOURCE_SMBA) { + rdtgroup_init_mba(r, rdtgrp->closid); + if (is_mba_sc(r)) + continue; + } else { + ret = rdtgroup_init_cat(s, rdtgrp->closid); + if (ret < 0) + goto out; + } + + ret = resctrl_arch_update_domains(r, rdtgrp->closid); + if (ret < 0) { + rdt_last_cmd_puts("Failed to initialize allocations\n"); + goto out; + } + + } + + rdtgrp->mode = RDT_MODE_SHAREABLE; + +out: + rdt_staged_configs_clear(); + return ret; +} + +static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp) +{ + int ret; + + if (!resctrl_arch_mon_capable()) + return 0; + + ret = alloc_rmid(rdtgrp->closid); + if (ret < 0) { + rdt_last_cmd_puts("Out of RMIDs\n"); + return ret; + } + rdtgrp->mon.rmid = ret; + + ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn); + if (ret) { + rdt_last_cmd_puts("kernfs subdir error\n"); + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + return ret; + } + + return 0; +} + +static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp) +{ + if (resctrl_arch_mon_capable()) + free_rmid(rgrp->closid, rgrp->mon.rmid); +} + +static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, + const char *name, umode_t mode, + enum rdt_group_type rtype, struct rdtgroup **r) +{ + struct rdtgroup *prdtgrp, *rdtgrp; + unsigned long files = 0; + struct kernfs_node *kn; + int ret; + + prdtgrp = rdtgroup_kn_lock_live(parent_kn); + if (!prdtgrp) { + ret = -ENODEV; + goto out_unlock; + } + + if (rtype == RDTMON_GROUP && + (prdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + prdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)) { + ret = -EINVAL; + rdt_last_cmd_puts("Pseudo-locking in progress\n"); + goto out_unlock; + } + + /* allocate the rdtgroup. */ + rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL); + if (!rdtgrp) { + ret = -ENOSPC; + rdt_last_cmd_puts("Kernel out of memory\n"); + goto out_unlock; + } + *r = rdtgrp; + rdtgrp->mon.parent = prdtgrp; + rdtgrp->type = rtype; + INIT_LIST_HEAD(&rdtgrp->mon.crdtgrp_list); + + /* kernfs creates the directory for rdtgrp */ + kn = kernfs_create_dir(parent_kn, name, mode, rdtgrp); + if (IS_ERR(kn)) { + ret = PTR_ERR(kn); + rdt_last_cmd_puts("kernfs create error\n"); + goto out_free_rgrp; + } + rdtgrp->kn = kn; + + /* + * kernfs_remove() will drop the reference count on "kn" which + * will free it. But we still need it to stick around for the + * rdtgroup_kn_unlock(kn) call. Take one extra reference here, + * which will be dropped by kernfs_put() in rdtgroup_remove(). + */ + kernfs_get(kn); + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) { + rdt_last_cmd_puts("kernfs perm error\n"); + goto out_destroy; + } + + if (rtype == RDTCTRL_GROUP) { + files = RFTYPE_BASE | RFTYPE_CTRL; + if (resctrl_arch_mon_capable()) + files |= RFTYPE_MON; + } else { + files = RFTYPE_BASE | RFTYPE_MON; + } + + ret = rdtgroup_add_files(kn, files); + if (ret) { + rdt_last_cmd_puts("kernfs fill error\n"); + goto out_destroy; + } + + /* + * The caller unlocks the parent_kn upon success. + */ + return 0; + +out_destroy: + kernfs_put(rdtgrp->kn); + kernfs_remove(rdtgrp->kn); +out_free_rgrp: + kfree(rdtgrp); +out_unlock: + rdtgroup_kn_unlock(parent_kn); + return ret; +} + +static void mkdir_rdt_prepare_clean(struct rdtgroup *rgrp) +{ + kernfs_remove(rgrp->kn); + rdtgroup_remove(rgrp); +} + +/* + * Create a monitor group under "mon_groups" directory of a control + * and monitor group(ctrl_mon). This is a resource group + * to monitor a subset of tasks and cpus in its parent ctrl_mon group. + */ +static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn, + const char *name, umode_t mode) +{ + struct rdtgroup *rdtgrp, *prgrp; + int ret; + + ret = mkdir_rdt_prepare(parent_kn, name, mode, RDTMON_GROUP, &rdtgrp); + if (ret) + return ret; + + prgrp = rdtgrp->mon.parent; + rdtgrp->closid = prgrp->closid; + + ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp); + if (ret) { + mkdir_rdt_prepare_clean(rdtgrp); + goto out_unlock; + } + + kernfs_activate(rdtgrp->kn); + + /* + * Add the rdtgrp to the list of rdtgrps the parent + * ctrl_mon group has to track. + */ + list_add_tail(&rdtgrp->mon.crdtgrp_list, &prgrp->mon.crdtgrp_list); + +out_unlock: + rdtgroup_kn_unlock(parent_kn); + return ret; +} + +/* + * These are rdtgroups created under the root directory. Can be used + * to allocate and monitor resources. + */ +static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn, + const char *name, umode_t mode) +{ + struct rdtgroup *rdtgrp; + struct kernfs_node *kn; + u32 closid; + int ret; + + ret = mkdir_rdt_prepare(parent_kn, name, mode, RDTCTRL_GROUP, &rdtgrp); + if (ret) + return ret; + + kn = rdtgrp->kn; + ret = closid_alloc(); + if (ret < 0) { + rdt_last_cmd_puts("Out of CLOSIDs\n"); + goto out_common_fail; + } + closid = ret; + ret = 0; + + rdtgrp->closid = closid; + + ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp); + if (ret) + goto out_closid_free; + + kernfs_activate(rdtgrp->kn); + + ret = rdtgroup_init_alloc(rdtgrp); + if (ret < 0) + goto out_rmid_free; + + list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups); + + if (resctrl_arch_mon_capable()) { + /* + * Create an empty mon_groups directory to hold the subset + * of tasks and cpus to monitor. + */ + ret = mongroup_create_dir(kn, rdtgrp, "mon_groups", NULL); + if (ret) { + rdt_last_cmd_puts("kernfs subdir error\n"); + goto out_del_list; + } + } + + goto out_unlock; + +out_del_list: + list_del(&rdtgrp->rdtgroup_list); +out_rmid_free: + mkdir_rdt_prepare_rmid_free(rdtgrp); +out_closid_free: + closid_free(closid); +out_common_fail: + mkdir_rdt_prepare_clean(rdtgrp); +out_unlock: + rdtgroup_kn_unlock(parent_kn); + return ret; +} + +/* + * We allow creating mon groups only with in a directory called "mon_groups" + * which is present in every ctrl_mon group. Check if this is a valid + * "mon_groups" directory. + * + * 1. The directory should be named "mon_groups". + * 2. The mon group itself should "not" be named "mon_groups". + * This makes sure "mon_groups" directory always has a ctrl_mon group + * as parent. + */ +static bool is_mon_groups(struct kernfs_node *kn, const char *name) +{ + return (!strcmp(kn->name, "mon_groups") && + strcmp(name, "mon_groups")); +} + +static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name, + umode_t mode) +{ + /* Do not accept '\n' to avoid unparsable situation. */ + if (strchr(name, '\n')) + return -EINVAL; + + /* + * If the parent directory is the root directory and RDT + * allocation is supported, add a control and monitoring + * subdirectory + */ + if (resctrl_arch_alloc_capable() && parent_kn == rdtgroup_default.kn) + return rdtgroup_mkdir_ctrl_mon(parent_kn, name, mode); + + /* + * If RDT monitoring is supported and the parent directory is a valid + * "mon_groups" directory, add a monitoring subdirectory. + */ + if (resctrl_arch_mon_capable() && is_mon_groups(parent_kn, name)) + return rdtgroup_mkdir_mon(parent_kn, name, mode); + + return -EPERM; +} + +static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) +{ + struct rdtgroup *prdtgrp = rdtgrp->mon.parent; + int cpu; + + /* Give any tasks back to the parent group */ + rdt_move_group_tasks(rdtgrp, prdtgrp, tmpmask); + + /* Update per cpu rmid of the moved CPUs first */ + for_each_cpu(cpu, &rdtgrp->cpu_mask) + resctrl_arch_set_cpu_default_closid_rmid(cpu, rdtgrp->closid, + prdtgrp->mon.rmid); + + /* + * Update the MSR on moved CPUs and CPUs which have moved + * task running on them. + */ + cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); + update_closid_rmid(tmpmask, NULL); + + rdtgrp->flags = RDT_DELETED; + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + + /* + * Remove the rdtgrp from the parent ctrl_mon group's list + */ + WARN_ON(list_empty(&prdtgrp->mon.crdtgrp_list)); + list_del(&rdtgrp->mon.crdtgrp_list); + + kernfs_remove(rdtgrp->kn); + + return 0; +} + +static int rdtgroup_ctrl_remove(struct rdtgroup *rdtgrp) +{ + rdtgrp->flags = RDT_DELETED; + list_del(&rdtgrp->rdtgroup_list); + + kernfs_remove(rdtgrp->kn); + return 0; +} + +static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) +{ + u32 closid, rmid; + int cpu; + + /* Give any tasks back to the default group */ + rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask); + + /* Give any CPUs back to the default group */ + cpumask_or(&rdtgroup_default.cpu_mask, + &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); + + /* Update per cpu closid and rmid of the moved CPUs first */ + closid = rdtgroup_default.closid; + rmid = rdtgroup_default.mon.rmid; + for_each_cpu(cpu, &rdtgrp->cpu_mask) + resctrl_arch_set_cpu_default_closid_rmid(cpu, closid, rmid); + + /* + * Update the MSR on moved CPUs and CPUs which have moved + * task running on them. + */ + cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); + update_closid_rmid(tmpmask, NULL); + + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + closid_free(rdtgrp->closid); + + rdtgroup_ctrl_remove(rdtgrp); + + /* + * Free all the child monitor group rmids. + */ + free_all_child_rdtgrp(rdtgrp); + + return 0; +} + +static int rdtgroup_rmdir(struct kernfs_node *kn) +{ + struct kernfs_node *parent_kn = kn->parent; + struct rdtgroup *rdtgrp; + cpumask_var_t tmpmask; + int ret = 0; + + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) + return -ENOMEM; + + rdtgrp = rdtgroup_kn_lock_live(kn); + if (!rdtgrp) { + ret = -EPERM; + goto out; + } + + /* + * If the rdtgroup is a ctrl_mon group and parent directory + * is the root directory, remove the ctrl_mon group. + * + * If the rdtgroup is a mon group and parent directory + * is a valid "mon_groups" directory, remove the mon group. + */ + if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == rdtgroup_default.kn && + rdtgrp != &rdtgroup_default) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + ret = rdtgroup_ctrl_remove(rdtgrp); + } else { + ret = rdtgroup_rmdir_ctrl(rdtgrp, tmpmask); + } + } else if (rdtgrp->type == RDTMON_GROUP && + is_mon_groups(parent_kn, kn->name)) { + ret = rdtgroup_rmdir_mon(rdtgrp, tmpmask); + } else { + ret = -EPERM; + } + +out: + rdtgroup_kn_unlock(kn); + free_cpumask_var(tmpmask); + return ret; +} + +/** + * mongrp_reparent() - replace parent CTRL_MON group of a MON group + * @rdtgrp: the MON group whose parent should be replaced + * @new_prdtgrp: replacement parent CTRL_MON group for @rdtgrp + * @cpus: cpumask provided by the caller for use during this call + * + * Replaces the parent CTRL_MON group for a MON group, resulting in all member + * tasks' CLOSID immediately changing to that of the new parent group. + * Monitoring data for the group is unaffected by this operation. + */ +static void mongrp_reparent(struct rdtgroup *rdtgrp, + struct rdtgroup *new_prdtgrp, + cpumask_var_t cpus) +{ + struct rdtgroup *prdtgrp = rdtgrp->mon.parent; + + WARN_ON(rdtgrp->type != RDTMON_GROUP); + WARN_ON(new_prdtgrp->type != RDTCTRL_GROUP); + + /* Nothing to do when simply renaming a MON group. */ + if (prdtgrp == new_prdtgrp) + return; + + WARN_ON(list_empty(&prdtgrp->mon.crdtgrp_list)); + list_move_tail(&rdtgrp->mon.crdtgrp_list, + &new_prdtgrp->mon.crdtgrp_list); + + rdtgrp->mon.parent = new_prdtgrp; + rdtgrp->closid = new_prdtgrp->closid; + + /* Propagate updated closid to all tasks in this group. */ + rdt_move_group_tasks(rdtgrp, rdtgrp, cpus); + + update_closid_rmid(cpus, NULL); +} + +static int rdtgroup_rename(struct kernfs_node *kn, + struct kernfs_node *new_parent, const char *new_name) +{ + struct rdtgroup *new_prdtgrp; + struct rdtgroup *rdtgrp; + cpumask_var_t tmpmask; + int ret; + + rdtgrp = kernfs_to_rdtgroup(kn); + new_prdtgrp = kernfs_to_rdtgroup(new_parent); + if (!rdtgrp || !new_prdtgrp) + return -ENOENT; + + /* Release both kernfs active_refs before obtaining rdtgroup mutex. */ + rdtgroup_kn_get(rdtgrp, kn); + rdtgroup_kn_get(new_prdtgrp, new_parent); + + mutex_lock(&rdtgroup_mutex); + + rdt_last_cmd_clear(); + + /* + * Don't allow kernfs_to_rdtgroup() to return a parent rdtgroup if + * either kernfs_node is a file. + */ + if (kernfs_type(kn) != KERNFS_DIR || + kernfs_type(new_parent) != KERNFS_DIR) { + rdt_last_cmd_puts("Source and destination must be directories"); + ret = -EPERM; + goto out; + } + + if ((rdtgrp->flags & RDT_DELETED) || (new_prdtgrp->flags & RDT_DELETED)) { + ret = -ENOENT; + goto out; + } + + if (rdtgrp->type != RDTMON_GROUP || !kn->parent || + !is_mon_groups(kn->parent, kn->name)) { + rdt_last_cmd_puts("Source must be a MON group\n"); + ret = -EPERM; + goto out; + } + + if (!is_mon_groups(new_parent, new_name)) { + rdt_last_cmd_puts("Destination must be a mon_groups subdirectory\n"); + ret = -EPERM; + goto out; + } + + /* + * If the MON group is monitoring CPUs, the CPUs must be assigned to the + * current parent CTRL_MON group and therefore cannot be assigned to + * the new parent, making the move illegal. + */ + if (!cpumask_empty(&rdtgrp->cpu_mask) && + rdtgrp->mon.parent != new_prdtgrp) { + rdt_last_cmd_puts("Cannot move a MON group that monitors CPUs\n"); + ret = -EPERM; + goto out; + } + + /* + * Allocate the cpumask for use in mongrp_reparent() to avoid the + * possibility of failing to allocate it after kernfs_rename() has + * succeeded. + */ + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) { + ret = -ENOMEM; + goto out; + } + + /* + * Perform all input validation and allocations needed to ensure + * mongrp_reparent() will succeed before calling kernfs_rename(), + * otherwise it would be necessary to revert this call if + * mongrp_reparent() failed. + */ + ret = kernfs_rename(kn, new_parent, new_name); + if (!ret) + mongrp_reparent(rdtgrp, new_prdtgrp, tmpmask); + + free_cpumask_var(tmpmask); + +out: + mutex_unlock(&rdtgroup_mutex); + rdtgroup_kn_put(rdtgrp, kn); + rdtgroup_kn_put(new_prdtgrp, new_parent); + return ret; +} + +static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf) +{ + if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3)) + seq_puts(seq, ",cdp"); + + if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2)) + seq_puts(seq, ",cdpl2"); + + if (is_mba_sc(resctrl_arch_get_resource(RDT_RESOURCE_MBA))) + seq_puts(seq, ",mba_MBps"); + + if (resctrl_debug) + seq_puts(seq, ",debug"); + + return 0; +} + +static struct kernfs_syscall_ops rdtgroup_kf_syscall_ops = { + .mkdir = rdtgroup_mkdir, + .rmdir = rdtgroup_rmdir, + .rename = rdtgroup_rename, + .show_options = rdtgroup_show_options, +}; + +static int rdtgroup_setup_root(struct rdt_fs_context *ctx) +{ + rdt_root = kernfs_create_root(&rdtgroup_kf_syscall_ops, + KERNFS_ROOT_CREATE_DEACTIVATED | + KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK, + &rdtgroup_default); + if (IS_ERR(rdt_root)) + return PTR_ERR(rdt_root); + + ctx->kfc.root = rdt_root; + rdtgroup_default.kn = kernfs_root_to_node(rdt_root); + + return 0; +} + +static void rdtgroup_destroy_root(void) +{ + kernfs_destroy_root(rdt_root); + rdtgroup_default.kn = NULL; +} + +static void __init rdtgroup_setup_default(void) +{ + mutex_lock(&rdtgroup_mutex); + + rdtgroup_default.closid = RESCTRL_RESERVED_CLOSID; + rdtgroup_default.mon.rmid = RESCTRL_RESERVED_RMID; + rdtgroup_default.type = RDTCTRL_GROUP; + INIT_LIST_HEAD(&rdtgroup_default.mon.crdtgrp_list); + + list_add(&rdtgroup_default.rdtgroup_list, &rdt_all_groups); + + mutex_unlock(&rdtgroup_mutex); +} + +static void domain_destroy_mon_state(struct rdt_domain *d) +{ + bitmap_free(d->rmid_busy_llc); + kfree(d->mbm_total); + kfree(d->mbm_local); +} + +void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d) +{ + mutex_lock(&rdtgroup_mutex); + + if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) + mba_sc_domain_destroy(r, d); + + if (!r->mon_capable) + goto out_unlock; + + /* + * If resctrl is mounted, remove all the + * per domain monitor data directories. + */ + if (resctrl_mounted && resctrl_arch_mon_capable()) + rmdir_mondata_subdir_allrdtgrp(r, d->id); + + if (resctrl_is_mbm_enabled()) + cancel_delayed_work(&d->mbm_over); + if (resctrl_arch_is_llc_occupancy_enabled() && has_busy_rmid(d)) { + /* + * When a package is going down, forcefully + * decrement rmid->ebusy. There is no way to know + * that the L3 was flushed and hence may lead to + * incorrect counts in rare scenarios, but leaving + * the RMID as busy creates RMID leaks if the + * package never comes back. + */ + __check_limbo(d, true); + cancel_delayed_work(&d->cqm_limbo); + } + + domain_destroy_mon_state(d); + +out_unlock: + mutex_unlock(&rdtgroup_mutex); +} + +static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) +{ + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + size_t tsize; + + if (resctrl_arch_is_llc_occupancy_enabled()) { + d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL); + if (!d->rmid_busy_llc) + return -ENOMEM; + } + if (resctrl_arch_is_mbm_total_enabled()) { + tsize = sizeof(*d->mbm_total); + d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL); + if (!d->mbm_total) { + bitmap_free(d->rmid_busy_llc); + return -ENOMEM; + } + } + if (resctrl_arch_is_mbm_local_enabled()) { + tsize = sizeof(*d->mbm_local); + d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL); + if (!d->mbm_local) { + bitmap_free(d->rmid_busy_llc); + kfree(d->mbm_total); + return -ENOMEM; + } + } + + return 0; +} + +int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d) +{ + int err = 0; + + mutex_lock(&rdtgroup_mutex); + + if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) { + /* RDT_RESOURCE_MBA is never mon_capable */ + err = mba_sc_domain_allocate(r, d); + goto out_unlock; + } + + if (!r->mon_capable) + goto out_unlock; + + err = domain_setup_mon_state(r, d); + if (err) + goto out_unlock; + + if (resctrl_is_mbm_enabled()) { + INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow); + mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, + RESCTRL_PICK_ANY_CPU); + } + + if (resctrl_arch_is_llc_occupancy_enabled()) + INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo); + + /* + * If the filesystem is not mounted then only the default resource group + * exists. Creation of its directories is deferred until mount time + * by rdt_get_tree() calling mkdir_mondata_all(). + * If resctrl is mounted, add per domain monitor data directories. + */ + if (resctrl_mounted && resctrl_arch_mon_capable()) + mkdir_mondata_subdir_allrdtgrp(r, d); + +out_unlock: + mutex_unlock(&rdtgroup_mutex); + + return err; +} + +void resctrl_online_cpu(unsigned int cpu) +{ + mutex_lock(&rdtgroup_mutex); + /* The CPU is set in default rdtgroup after online. */ + cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); + mutex_unlock(&rdtgroup_mutex); +} + +static void clear_childcpus(struct rdtgroup *r, unsigned int cpu) +{ + struct rdtgroup *cr; + + list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) { + if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask)) + break; + } +} + +void resctrl_offline_cpu(unsigned int cpu) +{ + struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3); + struct rdtgroup *rdtgrp; + struct rdt_domain *d; + + mutex_lock(&rdtgroup_mutex); + list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { + if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) { + clear_childcpus(rdtgrp, cpu); + break; + } + } + + if (!l3->mon_capable) + goto out_unlock; + + d = resctrl_get_domain_from_cpu(cpu, l3); + if (d) { + if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu) { + cancel_delayed_work(&d->mbm_over); + mbm_setup_overflow_handler(d, 0, cpu); + } + if (resctrl_arch_is_llc_occupancy_enabled() && + cpu == d->cqm_work_cpu && has_busy_rmid(d)) { + cancel_delayed_work(&d->cqm_limbo); + cqm_setup_limbo_handler(d, 0, cpu); + } + } + +out_unlock: + mutex_unlock(&rdtgroup_mutex); +} + +/* + * resctrl_init - resctrl filesystem initialization + * + * Setup resctrl file system including set up root, create mount point, + * register resctrl filesystem, and initialize files under root directory. + * + * Return: 0 on success or -errno + */ +int resctrl_init(void) +{ + int ret = 0; + + seq_buf_init(&last_cmd_status, last_cmd_status_buf, + sizeof(last_cmd_status_buf)); + + rdtgroup_setup_default(); + + thread_throttle_mode_init(); + + ret = resctrl_mon_resource_init(); + if (ret) + return ret; + + ret = sysfs_create_mount_point(fs_kobj, "resctrl"); + if (ret) + return ret; + + ret = register_filesystem(&rdt_fs_type); + if (ret) + goto cleanup_mountpoint; + + /* + * Adding the resctrl debugfs directory here may not be ideal since + * it would let the resctrl debugfs directory appear on the debugfs + * filesystem before the resctrl filesystem is mounted. + * It may also be ok since that would enable debugging of RDT before + * resctrl is mounted. + * The reason why the debugfs directory is created here and not in + * rdt_get_tree() is because rdt_get_tree() takes rdtgroup_mutex and + * during the debugfs directory creation also &sb->s_type->i_mutex_key + * (the lockdep class of inode->i_rwsem). Other filesystem + * interactions (eg. SyS_getdents) have the lock ordering: + * &sb->s_type->i_mutex_key --> &mm->mmap_lock + * During mmap(), called with &mm->mmap_lock, the rdtgroup_mutex + * is taken, thus creating dependency: + * &mm->mmap_lock --> rdtgroup_mutex for the latter that can cause + * issues considering the other two lock dependencies. + * By creating the debugfs directory here we avoid a dependency + * that may cause deadlock (even though file operations cannot + * occur until the filesystem is mounted, but I do not know how to + * tell lockdep that). + */ + debugfs_resctrl = debugfs_create_dir("resctrl", NULL); + + return 0; + +cleanup_mountpoint: + sysfs_remove_mount_point(fs_kobj, "resctrl"); + + return ret; +} + +void resctrl_exit(void) +{ + debugfs_remove_recursive(debugfs_resctrl); + unregister_filesystem(&rdt_fs_type); + sysfs_remove_mount_point(fs_kobj, "resctrl"); + + resctrl_mon_resource_exit(); +}
From: James Morse james.morse@arm.com
Add code to head.S's el2_setup to detect MPAM and disable any EL2 traps. This register resets to an unknown value, setting it to the default parititons/pmg before we enable the MMU is the best thing to do.
Kexec/kdump will depend on this if the previous kernel left the CPU configured with a restrictive configuration.
If linux is booted at the highest implemented exception level el2_setup will clear the enable bit, disabling MPAM.
This code can't be enabled until a subsequent patch adds the Kconfig and cpufeature boiler plate.
Signed-off-by: James Morse james.morse@arm.com --- arch/arm64/include/asm/el2_setup.h | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)
diff --git a/arch/arm64/include/asm/el2_setup.h b/arch/arm64/include/asm/el2_setup.h index b7afaa026842..1e2181820a0a 100644 --- a/arch/arm64/include/asm/el2_setup.h +++ b/arch/arm64/include/asm/el2_setup.h @@ -208,6 +208,21 @@ msr spsr_el2, x0 .endm
+.macro __init_el2_mpam +#ifdef CONFIG_ARM64_MPAM + /* Memory Partioning And Monitoring: disable EL2 traps */ + mrs x1, id_aa64pfr0_el1 + ubfx x0, x1, #ID_AA64PFR0_EL1_MPAM_SHIFT, #4 + cbz x0, .Lskip_mpam_@ // skip if no MPAM + msr_s SYS_MPAM2_EL2, xzr // use the default partition + // and disable lower traps + mrs_s x0, SYS_MPAMIDR_EL1 + tbz x0, #17, .Lskip_mpam_@ // skip if no MPAMHCR reg + msr_s SYS_MPAMHCR_EL2, xzr // clear TRAP_MPAMIDR_EL1 -> EL2 +.Lskip_mpam_@: +#endif /* CONFIG_ARM64_MPAM */ +.endm + /** * Initialize EL2 registers to sane values. This should be called early on all * cores that were booted in EL2. Note that everything gets initialised as @@ -225,6 +240,7 @@ __init_el2_stage2 __init_el2_gicv3 __init_el2_hstr + __init_el2_mpam __init_el2_nvhe_idregs __init_el2_cptr __init_el2_fgt
From: James Morse james.morse@arm.com
ARMv8.4 adds support for 'Memory Partitioning And Monitoring' (MPAM) which describes an interface to cache and bandwidth controls wherever they appear in the system.
Add support to detect MPAM. Like SVE, MPAM has an extra id register that describes the virtualisation support, which is optional. Detect this separately so we can detect mismatched/insane systems, but still use MPAM on the host even if the virtualisation support is missing.
MPAM needs enabling at the highest implemented exception level, otherwise the register accesses trap. The 'enabled' flag is accessible to lower exception levels, but its in a register that traps when MPAM isn't enabled. The cpufeature 'matches' hook is extended to test this on one of the CPUs, so that firwmare can emulate MPAM as disabled if it is reserved for use by secure world.
(If you have a boot failure that bisects here its likely your CPUs advertise MPAM in the id registers, but firmware failed to either enable or MPAM, or emulate the trap as if it were disabled)
Signed-off-by: James Morse james.morse@arm.com --- .../arch/arm64/cpu-feature-registers.rst | 2 + arch/arm64/Kconfig | 19 ++++- arch/arm64/include/asm/cpu.h | 1 + arch/arm64/include/asm/cpufeature.h | 13 ++++ arch/arm64/include/asm/mpam.h | 76 +++++++++++++++++++ arch/arm64/include/asm/sysreg.h | 8 ++ arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/cpufeature.c | 68 +++++++++++++++++ arch/arm64/kernel/cpuinfo.c | 4 + arch/arm64/kernel/mpam.c | 8 ++ arch/arm64/tools/cpucaps | 1 + arch/arm64/tools/sysreg | 33 ++++++++ 12 files changed, 233 insertions(+), 1 deletion(-) create mode 100644 arch/arm64/include/asm/mpam.h create mode 100644 arch/arm64/kernel/mpam.c
diff --git a/Documentation/arch/arm64/cpu-feature-registers.rst b/Documentation/arch/arm64/cpu-feature-registers.rst index de6d8a4790e2..14ea68bcf196 100644 --- a/Documentation/arch/arm64/cpu-feature-registers.rst +++ b/Documentation/arch/arm64/cpu-feature-registers.rst @@ -152,6 +152,8 @@ infrastructure: +------------------------------+---------+---------+ | DIT | [51-48] | y | +------------------------------+---------+---------+ + | MPAM | [43-40] | n | + +------------------------------+---------+---------+ | SVE | [35-32] | y | +------------------------------+---------+---------+ | GIC | [27-24] | n | diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 8781381bb720..3d838b7a9cdd 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2053,7 +2053,24 @@ config ARM64_TLB_RANGE The feature introduces new assembly instructions, and they were support when binutils >= 2.30.
-endmenu # "ARMv8.4 architectural features" +config ARM64_MPAM + bool "Enable support for MPAM" + help + Memory Partitioning and Monitoring is an optional extension + that allows the CPUs to mark load and store transactions with + labels for partition-id and performance-monitoring-group. + System components, such as the caches, can use the partition-id + to apply a performance policy. MPAM monitors can use the + partition-id and performance-monitoring-group to measure the + cache occupancy or data throughput. + + Use of this extension requires CPU support, support in the + memory system components (MSC), and a description from firmware + of where the MSC are in the address space. + + MPAM is exposed to user-space via the resctrl pseudo filesystem. + +endmenu
menu "ARMv8.5 architectural features"
diff --git a/arch/arm64/include/asm/cpu.h b/arch/arm64/include/asm/cpu.h index e749838b9c5d..1cb5bafd9238 100644 --- a/arch/arm64/include/asm/cpu.h +++ b/arch/arm64/include/asm/cpu.h @@ -47,6 +47,7 @@ struct cpuinfo_arm64 { u64 reg_revidr; u64 reg_gmid; u64 reg_smidr; + u64 reg_mpamidr;
u64 reg_id_aa64dfr0; u64 reg_id_aa64dfr1; diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index 5bba39376055..e873848ad9d9 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -619,6 +619,13 @@ static inline bool id_aa64pfr1_sme(u64 pfr1) return val > 0; }
+static inline bool id_aa64pfr0_mpam(u64 pfr0) +{ + u32 val = cpuid_feature_extract_unsigned_field(pfr0, ID_AA64PFR0_EL1_MPAM_SHIFT); + + return val > 0; +} + static inline bool id_aa64pfr1_mte(u64 pfr1) { u32 val = cpuid_feature_extract_unsigned_field(pfr1, ID_AA64PFR1_EL1_MTE_SHIFT); @@ -831,6 +838,12 @@ static inline bool system_supports_tlb_range(void) cpus_have_const_cap(ARM64_HAS_TLB_RANGE); }
+static inline bool cpus_support_mpam(void) +{ + return IS_ENABLED(CONFIG_ARM64_MPAM) && + cpus_have_final_cap(ARM64_MPAM); +} + int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt); bool try_emulate_mrs(struct pt_regs *regs, u32 isn);
diff --git a/arch/arm64/include/asm/mpam.h b/arch/arm64/include/asm/mpam.h new file mode 100644 index 000000000000..a4a969be233a --- /dev/null +++ b/arch/arm64/include/asm/mpam.h @@ -0,0 +1,76 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2021 Arm Ltd. */ + +#ifndef __ASM__MPAM_H +#define __ASM__MPAM_H + +#include <linux/bitops.h> +#include <linux/init.h> +#include <linux/jump_label.h> + +#include <asm/cpucaps.h> +#include <asm/cpufeature.h> +#include <asm/sysreg.h> + +/* CPU Registers */ +#define MPAM_SYSREG_EN BIT_ULL(63) +#define MPAM_SYSREG_TRAP_IDR BIT_ULL(58) +#define MPAM_SYSREG_TRAP_MPAM0_EL1 BIT_ULL(49) +#define MPAM_SYSREG_TRAP_MPAM1_EL1 BIT_ULL(48) +#define MPAM_SYSREG_PMG_D GENMASK(47, 40) +#define MPAM_SYSREG_PMG_I GENMASK(39, 32) +#define MPAM_SYSREG_PARTID_D GENMASK(31, 16) +#define MPAM_SYSREG_PARTID_I GENMASK(15, 0) + +#define MPAMIDR_PMG_MAX GENMASK(40, 32) +#define MPAMIDR_PMG_MAX_SHIFT 32 +#define MPAMIDR_PMG_MAX_LEN 8 +#define MPAMIDR_VPMR_MAX GENMASK(20, 18) +#define MPAMIDR_VPMR_MAX_SHIFT 18 +#define MPAMIDR_VPMR_MAX_LEN 3 +#define MPAMIDR_HAS_HCR BIT(17) +#define MPAMIDR_HAS_HCR_SHIFT 17 +#define MPAMIDR_PARTID_MAX GENMASK(15, 0) +#define MPAMIDR_PARTID_MAX_SHIFT 0 +#define MPAMIDR_PARTID_MAX_LEN 15 + +#define MPAMHCR_EL0_VPMEN BIT_ULL(0) +#define MPAMHCR_EL1_VPMEN BIT_ULL(1) +#define MPAMHCR_GSTAPP_PLK BIT_ULL(8) +#define MPAMHCR_TRAP_MPAMIDR BIT_ULL(31) + +/* Properties of the VPM registers */ +#define MPAM_VPM_NUM_REGS 8 +#define MPAM_VPM_PARTID_LEN 16 +#define MPAM_VPM_PARTID_MASK 0xffff +#define MPAM_VPM_REG_LEN 64 +#define MPAM_VPM_PARTIDS_PER_REG (MPAM_VPM_REG_LEN / MPAM_VPM_PARTID_LEN) +#define MPAM_VPM_MAX_PARTID (MPAM_VPM_NUM_REGS * MPAM_VPM_PARTIDS_PER_REG) + + +DECLARE_STATIC_KEY_FALSE(arm64_mpam_has_hcr); + +/* check whether all CPUs have MPAM support */ +static inline bool mpam_cpus_have_feature(void) +{ + if (IS_ENABLED(CONFIG_ARM64_MPAM)) + return cpus_have_final_cap(ARM64_MPAM); + return false; +} + +/* check whether all CPUs have MPAM virtualisation support */ +static inline bool mpam_cpus_have_mpam_hcr(void) +{ + if (IS_ENABLED(CONFIG_ARM64_MPAM)) + return static_branch_unlikely(&arm64_mpam_has_hcr); + return false; +} + +/* enable MPAM virtualisation support */ +static inline void __init __enable_mpam_hcr(void) +{ + if (IS_ENABLED(CONFIG_ARM64_MPAM)) + static_branch_enable(&arm64_mpam_has_hcr); +} + +#endif /* __ASM__MPAM_H */ diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h index 38296579a4fd..94633246d311 100644 --- a/arch/arm64/include/asm/sysreg.h +++ b/arch/arm64/include/asm/sysreg.h @@ -515,6 +515,13 @@ #define SYS_MAIR_EL2 sys_reg(3, 4, 10, 2, 0) #define SYS_AMAIR_EL2 sys_reg(3, 4, 10, 3, 0)
+#define SYS_MPAMHCR_EL2 sys_reg(3, 4, 10, 4, 0) +#define SYS_MPAMVPMV_EL2 sys_reg(3, 4, 10, 4, 1) +#define SYS_MPAM2_EL2 sys_reg(3, 4, 10, 5, 0) + +#define __VPMn_op2(n) ((n) & 0x7) +#define SYS_MPAM_VPMn_EL2(n) sys_reg(3, 4, 10, 6, __VPMn_op2(n)) + #define SYS_VBAR_EL2 sys_reg(3, 4, 12, 0, 0) #define SYS_RVBAR_EL2 sys_reg(3, 4, 12, 0, 1) #define SYS_RMR_EL2 sys_reg(3, 4, 12, 0, 2) @@ -579,6 +586,7 @@ #define SYS_TFSR_EL12 sys_reg(3, 5, 5, 6, 0) #define SYS_MAIR_EL12 sys_reg(3, 5, 10, 2, 0) #define SYS_AMAIR_EL12 sys_reg(3, 5, 10, 3, 0) +#define SYS_MPAM1_EL12 sys_reg(3, 5, 10, 5, 0) #define SYS_VBAR_EL12 sys_reg(3, 5, 12, 0, 0) #define SYS_CNTKCTL_EL12 sys_reg(3, 5, 14, 1, 0) #define SYS_CNTP_TVAL_EL02 sys_reg(3, 5, 14, 2, 0) diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile index 187685f57980..b36e9ff00bf6 100644 --- a/arch/arm64/kernel/Makefile +++ b/arch/arm64/kernel/Makefile @@ -73,6 +73,7 @@ obj-$(CONFIG_CRASH_CORE) += crash_core.o obj-$(CONFIG_ARM_SDE_INTERFACE) += sdei.o obj-$(CONFIG_SDEI_WATCHDOG) += watchdog_sdei.o obj-$(CONFIG_ARM64_PTR_AUTH) += pointer_auth.o +obj-$(CONFIG_ARM64_MPAM) += mpam.o obj-$(CONFIG_ARM64_MTE) += mte.o obj-y += vdso-wrap.o obj-$(CONFIG_COMPAT_VDSO) += vdso32-wrap.o diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 455e72ce080a..9a2ff5eb1335 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -84,6 +84,7 @@ #include <asm/insn.h> #include <asm/kvm_host.h> #include <asm/mmu_context.h> +#include <asm/mpam.h> #include <asm/mte.h> #include <asm/processor.h> #include <asm/smp.h> @@ -623,6 +624,18 @@ static const struct arm64_ftr_bits ftr_smcr[] = { ARM64_FTR_END, };
+static const struct arm64_ftr_bits ftr_mpamidr[] = { + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, + MPAMIDR_PMG_MAX_SHIFT, MPAMIDR_PMG_MAX_LEN, 0), /* PMG_MAX */ + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, + MPAMIDR_VPMR_MAX_SHIFT, MPAMIDR_VPMR_MAX_LEN, 0), /* VPMR_MAX */ + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, + MPAMIDR_HAS_HCR_SHIFT, 1, 0), /* HAS_HCR */ + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, + MPAMIDR_PARTID_MAX_SHIFT, MPAMIDR_PARTID_MAX_LEN, 0), /* PARTID_MAX */ + ARM64_FTR_END, +}; + /* * Common ftr bits for a 32bit register with all hidden, strict * attributes, with 4bit feature fields and a default safe value of @@ -739,6 +752,9 @@ static const struct __ftr_reg_entry { ARM64_FTR_REG(SYS_ZCR_EL1, ftr_zcr), ARM64_FTR_REG(SYS_SMCR_EL1, ftr_smcr),
+ /* Op1 = 0, CRn = 10, CRm = 4 */ + ARM64_FTR_REG(SYS_MPAMIDR_EL1, ftr_mpamidr), + /* Op1 = 1, CRn = 0, CRm = 0 */ ARM64_FTR_REG(SYS_GMID_EL1, ftr_gmid),
@@ -1058,6 +1074,9 @@ void __init init_cpu_features(struct cpuinfo_arm64 *info) vec_init_vq_map(ARM64_VEC_SME); }
+ if (id_aa64pfr0_mpam(info->reg_id_aa64pfr0)) + init_cpu_ftr_reg(SYS_MPAMIDR_EL1, info->reg_mpamidr); + if (id_aa64pfr1_mte(info->reg_id_aa64pfr1)) init_cpu_ftr_reg(SYS_GMID_EL1, info->reg_gmid);
@@ -1317,6 +1336,11 @@ void update_cpu_features(int cpu, vec_update_vq_map(ARM64_VEC_SME); }
+ if (id_aa64pfr0_mpam(info->reg_id_aa64pfr0)) { + taint |= check_update_ftr_reg(SYS_MPAMIDR_EL1, cpu, + info->reg_mpamidr, boot->reg_mpamidr); + } + /* * The kernel uses the LDGM/STGM instructions and the number of tags * they read/write depends on the GMID_EL1.BS field. Check that the @@ -2242,6 +2266,39 @@ cpucap_panic_on_conflict(const struct arm64_cpu_capabilities *cap) return !!(cap->type & ARM64_CPUCAP_PANIC_ON_CONFLICT); }
+static bool __maybe_unused +test_has_mpam(const struct arm64_cpu_capabilities *entry, int scope) +{ + if (!has_cpuid_feature(entry, scope)) + return false; + + /* Check firmware actually enabled MPAM on this cpu. */ + return (read_sysreg_s(SYS_MPAM1_EL1) & MPAM_SYSREG_EN); +} + +static void __maybe_unused +cpu_enable_mpam(const struct arm64_cpu_capabilities *entry) +{ + /* + * Access by the kernel (at EL1) should use the reserved PARTID + * which is configured unrestricted. This avoids priority-inversion + * where latency sensitive tasks have to wait for a task that has + * been throttled to release the lock. + */ + write_sysreg_s(0, SYS_MPAM1_EL1); +} + +static void mpam_extra_caps(void) +{ + u64 idr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); + + if (!IS_ENABLED(CONFIG_ARM64_MPAM)) + return; + + if (idr & MPAMIDR_HAS_HCR) + __enable_mpam_hcr(); +} + static const struct arm64_cpu_capabilities arm64_features[] = { { .capability = ARM64_ALWAYS_BOOT, @@ -2722,6 +2779,16 @@ static const struct arm64_cpu_capabilities arm64_features[] = { .matches = has_cpuid_feature, ARM64_CPUID_FIELDS(ID_AA64MMFR2_EL1, EVT, IMP) }, +#ifdef CONFIG_ARM64_MPAM + { + .desc = "Memory Partitioning And Monitoring", + .type = ARM64_CPUCAP_SYSTEM_FEATURE, + .capability = ARM64_MPAM, + .matches = test_has_mpam, + .cpu_enable = cpu_enable_mpam, + ARM64_CPUID_FIELDS(ID_AA64PFR0_EL1, MPAM, 1) + }, +#endif {}, };
@@ -3361,6 +3428,7 @@ void __init setup_cpu_features(void) sve_setup(); sme_setup(); minsigstksz_setup(); + mpam_extra_caps();
/* * Check for sane CTR_EL0.CWG value. diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c index 49f2627e9d9a..eecedf5b67fa 100644 --- a/arch/arm64/kernel/cpuinfo.c +++ b/arch/arm64/kernel/cpuinfo.c @@ -460,6 +460,10 @@ static void __cpuinfo_store_cpu(struct cpuinfo_arm64 *info) if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0)) __cpuinfo_store_cpu_32bit(&info->aarch32);
+ if (IS_ENABLED(CONFIG_ARM64_MPAM) && + id_aa64pfr0_mpam(info->reg_id_aa64pfr0)) + info->reg_mpamidr = read_cpuid(MPAMIDR_EL1); + cpuinfo_detect_icache_policy(info); }
diff --git a/arch/arm64/kernel/mpam.c b/arch/arm64/kernel/mpam.c new file mode 100644 index 000000000000..ff29b666e025 --- /dev/null +++ b/arch/arm64/kernel/mpam.c @@ -0,0 +1,8 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2021 Arm Ltd. */ + +#include <asm/mpam.h> + +#include <linux/jump_label.h> + +DEFINE_STATIC_KEY_FALSE(arm64_mpam_has_hcr); diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps index 569ecec76c16..c60a603d2cfe 100644 --- a/arch/arm64/tools/cpucaps +++ b/arch/arm64/tools/cpucaps @@ -56,6 +56,7 @@ HW_DBM KVM_HVHE KVM_PROTECTED_MODE MISMATCHED_CACHE_TYPE +MPAM MTE MTE_ASYMM SME diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg index 76ce150e7347..0e7d7f327410 100644 --- a/arch/arm64/tools/sysreg +++ b/arch/arm64/tools/sysreg @@ -2530,6 +2530,22 @@ Res0 1 Field 0 EN EndSysreg
+Sysreg MPAMIDR_EL1 3 0 10 4 4 +Res0 63:62 +Field 61 HAS_SDEFLT +Field 60 HAS_FORCE_NS +Field 59 SP4 +Field 58 HAS_TIDR +Field 57 HAS_ALTSP +Res0 56:40 +Field 39:32 PMG_MAX +Res0 31:21 +Field 20:18 VPMR_MAX +Field 17 HAS_HCR +Res0 16 +Field 15:0 PARTID_MAX +EndSysreg + Sysreg LORID_EL1 3 0 10 4 7 Res0 63:24 Field 23:16 LD @@ -2537,6 +2553,22 @@ Res0 15:8 Field 7:0 LR EndSysreg
+Sysreg MPAM1_EL1 3 0 10 5 0 +Res0 63:48 +Field 47:40 PMG_D +Field 39:32 PMG_I +Field 31:16 PARTID_D +Field 15:0 PARTID_I +EndSysreg + +Sysreg MPAM0_EL1 3 0 10 5 1 +Res0 63:48 +Field 47:40 PMG_D +Field 39:32 PMG_I +Field 31:16 PARTID_D +Field 15:0 PARTID_I +EndSysreg + Sysreg ISR_EL1 3 0 12 1 0 Res0 63:11 Field 10 IS @@ -2550,6 +2582,7 @@ EndSysreg Sysreg ICC_NMIAR1_EL1 3 0 12 9 5 Res0 63:24 Field 23:0 INTID + EndSysreg
Sysreg TRBLIMITR_EL1 3 0 9 11 0
From: James Morse james.morse@arm.com
commit 011e5f5bf529f ("arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register") exposed the MPAM field of AA64PFR0_EL1 to guests, but didn't add trap handling.
If you are unlucky, this results in an MPAM aware guest being delivered an undef during boot. The host prints: | kvm [97]: Unsupported guest sys_reg access at: ffff800080024c64 [00000005] | { Op0( 3), Op1( 0), CRn(10), CRm( 5), Op2( 0), func_read },
Which results in: | Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT SMP | Modules linked in: | CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.6.0-rc7-00559-gd89c186d50b2 #14616 | Hardware name: linux,dummy-virt (DT) | pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : test_has_mpam+0x18/0x30 | lr : test_has_mpam+0x10/0x30 | sp : ffff80008000bd90 ... | Call trace: | test_has_mpam+0x18/0x30 | update_cpu_capabilities+0x7c/0x11c | setup_cpu_features+0x14/0xd8 | smp_cpus_done+0x24/0xb8 | smp_init+0x7c/0x8c | kernel_init_freeable+0xf8/0x280 | kernel_init+0x24/0x1e0 | ret_from_fork+0x10/0x20 | Code: 910003fd 97ffffde 72001c00 54000080 (d538a500) | ---[ end trace 0000000000000000 ]--- | Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b | ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
Add the support to enable the traps, and handle the three guest accessible registers as RAZ/WI. This allows guests to keep the invariant id-register value, while advertising that MPAM isn't really supported.
With MPAM v1.0 we can trap the MPAMIDR_EL1 register only if ARM64_HAS_MPAM_HCR, with v1.1 an additional MPAM2_EL2.TIDR bit traps MPAMIDR_EL1 on platforms that don't have MPAMHCR_EL2. Enable one of these if either is supported. If neither is supported, the guest can discover that the CPU has MPAM support, and how many PARTID etc the host has ... but it can't influence anything, so its harmless.
Full support for the feature would only expose MPAM to the guest if a psuedo-device has been created to describe the virt->phys partid mapping the VMM expects. This will depend on ARM64_HAS_MPAM_HCR.
Fixes: 011e5f5bf529f ("arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register") CC: Anshuman Khandual anshuman.khandual@arm.com Link: https://lore.kernel.org/linux-arm-kernel/20200925160102.118858-1-james.morse... Signed-off-by: James Morse james.morse@arm.com --- arch/arm64/include/asm/kvm_arm.h | 1 + arch/arm64/include/asm/mpam.h | 4 ++-- arch/arm64/kernel/image-vars.h | 5 ++++ arch/arm64/kvm/hyp/include/hyp/switch.h | 32 +++++++++++++++++++++++++ arch/arm64/kvm/sys_regs.c | 20 ++++++++++++++++ 5 files changed, 60 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h index 1095c6647e96..2d8b243a86cd 100644 --- a/arch/arm64/include/asm/kvm_arm.h +++ b/arch/arm64/include/asm/kvm_arm.h @@ -104,6 +104,7 @@
#define HCRX_GUEST_FLAGS (HCRX_EL2_SMPME | HCRX_EL2_TCR2En) #define HCRX_HOST_FLAGS (HCRX_EL2_MSCEn | HCRX_EL2_TCR2En) +#define MPAMHCR_HOST_FLAGS 0
/* TCR_EL2 Registers bits */ #define TCR_EL2_RES1 ((1U << 31) | (1 << 23)) diff --git a/arch/arm64/include/asm/mpam.h b/arch/arm64/include/asm/mpam.h index a4a969be233a..576102d510ad 100644 --- a/arch/arm64/include/asm/mpam.h +++ b/arch/arm64/include/asm/mpam.h @@ -51,7 +51,7 @@ DECLARE_STATIC_KEY_FALSE(arm64_mpam_has_hcr);
/* check whether all CPUs have MPAM support */ -static inline bool mpam_cpus_have_feature(void) +static __always_inline bool mpam_cpus_have_feature(void) { if (IS_ENABLED(CONFIG_ARM64_MPAM)) return cpus_have_final_cap(ARM64_MPAM); @@ -59,7 +59,7 @@ static inline bool mpam_cpus_have_feature(void) }
/* check whether all CPUs have MPAM virtualisation support */ -static inline bool mpam_cpus_have_mpam_hcr(void) +static __always_inline bool mpam_cpus_have_mpam_hcr(void) { if (IS_ENABLED(CONFIG_ARM64_MPAM)) return static_branch_unlikely(&arm64_mpam_has_hcr); diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h index 35f3c7959513..d10d3fed31d9 100644 --- a/arch/arm64/kernel/image-vars.h +++ b/arch/arm64/kernel/image-vars.h @@ -64,6 +64,11 @@ KVM_NVHE_ALIAS(nvhe_hyp_panic_handler); /* Vectors installed by hyp-init on reset HVC. */ KVM_NVHE_ALIAS(__hyp_stub_vectors);
+/* Additional static keys for cpufeatures */ +#ifdef CONFIG_ARM64_MPAM +KVM_NVHE_ALIAS(arm64_mpam_has_hcr); +#endif + /* Static keys which are set if a vGIC trap should be handled in hyp. */ KVM_NVHE_ALIAS(vgic_v2_cpuif_trap); KVM_NVHE_ALIAS(vgic_v3_cpuif_trap); diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h index 9cfe6bd1dbe4..657320f453e6 100644 --- a/arch/arm64/kvm/hyp/include/hyp/switch.h +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h @@ -27,6 +27,7 @@ #include <asm/kvm_hyp.h> #include <asm/kvm_mmu.h> #include <asm/kvm_nested.h> +#include <asm/mpam.h> #include <asm/fpsimd.h> #include <asm/debug-monitors.h> #include <asm/processor.h> @@ -172,6 +173,35 @@ static inline void __deactivate_traps_hfgxtr(struct kvm_vcpu *vcpu) write_sysreg_s(ctxt_sys_reg(hctxt, HDFGWTR_EL2), SYS_HDFGWTR_EL2); }
+static inline void __activate_traps_mpam(struct kvm_vcpu *vcpu) +{ + u64 r = MPAM_SYSREG_TRAP_MPAM0_EL1 | MPAM_SYSREG_TRAP_MPAM1_EL1; + + if (!mpam_cpus_have_feature()) + return; + + /* trap guest access to MPAMIDR_EL1 */ + if (mpam_cpus_have_mpam_hcr()) { + write_sysreg_s(MPAMHCR_TRAP_MPAMIDR, SYS_MPAMHCR_EL2); + } else { + /* From v1.1 TIDR can trap MPAMIDR, set it unconditionally */ + r |= MPAM_SYSREG_TRAP_IDR; + } + + write_sysreg_s(r, SYS_MPAM2_EL2); +} + +static inline void __deactivate_traps_mpam(void) +{ + if (!mpam_cpus_have_feature()) + return; + + write_sysreg_s(0, SYS_MPAM2_EL2); + + if (mpam_cpus_have_mpam_hcr()) + write_sysreg_s(MPAMHCR_HOST_FLAGS, SYS_MPAMHCR_EL2); +} + static inline void __activate_traps_common(struct kvm_vcpu *vcpu) { /* Trap on AArch32 cp15 c15 (impdef sysregs) accesses (EL1 or EL0) */ @@ -212,6 +242,7 @@ static inline void __activate_traps_common(struct kvm_vcpu *vcpu) }
__activate_traps_hfgxtr(vcpu); + __activate_traps_mpam(vcpu); }
static inline void __deactivate_traps_common(struct kvm_vcpu *vcpu) @@ -231,6 +262,7 @@ static inline void __deactivate_traps_common(struct kvm_vcpu *vcpu) write_sysreg_s(HCRX_HOST_FLAGS, SYS_HCRX_EL2);
__deactivate_traps_hfgxtr(vcpu); + __deactivate_traps_mpam(); }
static inline void ___activate_traps(struct kvm_vcpu *vcpu) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 0afd6136e275..0a6e5386107e 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -411,6 +411,23 @@ static bool trap_oslar_el1(struct kvm_vcpu *vcpu, return true; }
+static bool workaround_bad_mpam_abi(struct kvm_vcpu *vcpu, + struct sys_reg_params *p, + const struct sys_reg_desc *r) +{ + /* + * The ID register can't be removed without breaking migration, + * but MPAMIDR_EL1 can advertise all-zeroes, indicating there are zero + * PARTID/PMG supported by the CPU, allowing the other two trapped + * registers (MPAM1_EL1 and MPAM0_EL1) to be treated as RAZ/WI. + * Emulating MPAM1_EL1 as RAZ/WI means the guest sees the MPAMEN bit + * as clear, and realises MPAM isn't usable on this CPU. + */ + p->regval = 0; + + return true; +} + static bool trap_oslsr_el1(struct kvm_vcpu *vcpu, struct sys_reg_params *p, const struct sys_reg_desc *r) @@ -2130,8 +2147,11 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_LOREA_EL1), trap_loregion }, { SYS_DESC(SYS_LORN_EL1), trap_loregion }, { SYS_DESC(SYS_LORC_EL1), trap_loregion }, + { SYS_DESC(SYS_MPAMIDR_EL1), workaround_bad_mpam_abi }, { SYS_DESC(SYS_LORID_EL1), trap_loregion },
+ { SYS_DESC(SYS_MPAM1_EL1), workaround_bad_mpam_abi }, + { SYS_DESC(SYS_MPAM0_EL1), workaround_bad_mpam_abi }, { SYS_DESC(SYS_VBAR_EL1), access_rw, reset_val, VBAR_EL1, 0 }, { SYS_DESC(SYS_DISR_EL1), NULL, reset_val, DISR_EL1, 0 },
From: James Morse james.morse@arm.com
Currently KVM only allows writeable ID registers to be downgraded in the 'safe' direction, as determined by the cpufeature 'lower safe' flags.
commit 011e5f5bf529f ("arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register") exposed the MPAM field of AA64PFR0_EL1 to guests, but didn't add trap handling. A previous patch supplied the missing trap handling.
Existing VMs that have the MPAM field of AA64PFR0_EL1 need to be migratable, but there is little point enabling the MPAM CPU interface on new VMs until there is something a guest can do with it.
Clear the MPAM field from the guest's AA64PFR0_EL1 by default, but allow user-space to set it again if the host supports MPAM. Add a helper to return the maximum permitted value for an ID register. For most this is the reset value. To allow the MPAM field to be written as supported, check if the host sanitised value is '1' and upgrade the reset value.
Finally, change the trap handling to inject an undef if MPAM was not advertised to the guest.
Full support will depend on an psuedo-device being created that describes the virt->phys PARTID mapping the VMM expects. Migration would be expected to fail if this psuedo-device can't be created on the remote end. This ID bit isn't needed to block migration.
Signed-off-by: James Morse james.morse@arm.com --- arch/arm64/kvm/sys_regs.c | 74 +++++++++++++++++++++++++++++++-------- 1 file changed, 60 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 0a6e5386107e..9424fa7351bf 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -411,21 +411,29 @@ static bool trap_oslar_el1(struct kvm_vcpu *vcpu, return true; }
-static bool workaround_bad_mpam_abi(struct kvm_vcpu *vcpu, - struct sys_reg_params *p, - const struct sys_reg_desc *r) +static bool trap_mpam(struct kvm_vcpu *vcpu, + struct sys_reg_params *p, + const struct sys_reg_desc *r) { + u64 aa64pfr0_el1 = IDREG(vcpu->kvm, SYS_ID_AA64PFR0_EL1); + /* - * The ID register can't be removed without breaking migration, - * but MPAMIDR_EL1 can advertise all-zeroes, indicating there are zero - * PARTID/PMG supported by the CPU, allowing the other two trapped - * registers (MPAM1_EL1 and MPAM0_EL1) to be treated as RAZ/WI. + * What did we expose to the guest? + * Earlier guests may have seen the ID bits, which can't be removed + * without breaking migration, but MPAMIDR_EL1 can advertise all-zeroes, + * indicating there are zero PARTID/PMG supported by the CPU, allowing + * the other two trapped registers (MPAM1_EL1 and MPAM0_EL1) to be + * treated as RAZ/WI. * Emulating MPAM1_EL1 as RAZ/WI means the guest sees the MPAMEN bit * as clear, and realises MPAM isn't usable on this CPU. */ - p->regval = 0; + if (FIELD_GET(ID_AA64PFR0_EL1_MPAM_MASK, aa64pfr0_el1)) { + p->regval = 0; + return true; + }
- return true; + kvm_inject_undefined(vcpu); + return false; }
static bool trap_oslsr_el1(struct kvm_vcpu *vcpu, @@ -1245,6 +1253,36 @@ static s64 kvm_arm64_ftr_safe_value(u32 id, const struct arm64_ftr_bits *ftrp, return arm64_ftr_safe_value(&kvm_ftr, new, cur); }
+static u64 kvm_arm64_ftr_max(struct kvm_vcpu *vcpu, + const struct sys_reg_desc *rd) +{ + u64 pfr0, val = rd->reset(vcpu, rd); + u32 field, id = reg_to_encoding(rd); + + /* + * Some values may reset to a lower value than can be supported, + * get the maximum feature value. + */ + switch (id) { + case SYS_ID_AA64PFR0_EL1: + pfr0 = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1); + + /* + * MPAM resets to 0, but migration of MPAM=1 guests is needed. + * See trap_mpam() for more. + */ + field = cpuid_feature_extract_unsigned_field(pfr0, ID_AA64PFR0_EL1_MPAM_SHIFT); + if (field == ID_AA64PFR0_EL1_MPAM_1) { + val &= ~ID_AA64PFR0_EL1_MPAM_MASK; + val |= FIELD_PREP(ID_AA64PFR0_EL1_MPAM_MASK, ID_AA64PFR0_EL1_MPAM_1); + } + + break; + } + + return val; +} + /** * arm64_check_features() - Check if a feature register value constitutes * a subset of features indicated by the idreg's KVM sanitised limit. @@ -1265,8 +1303,7 @@ static int arm64_check_features(struct kvm_vcpu *vcpu, const struct arm64_ftr_bits *ftrp = NULL; u32 id = reg_to_encoding(rd); u64 writable_mask = rd->val; - u64 limit = rd->reset(vcpu, rd); - u64 mask = 0; + u64 limit, mask = 0;
/* * Hidden and unallocated ID registers may not have a corresponding @@ -1280,6 +1317,7 @@ static int arm64_check_features(struct kvm_vcpu *vcpu, if (!ftr_reg) return -EINVAL;
+ limit = kvm_arm64_ftr_max(vcpu, rd); ftrp = ftr_reg->ftr_bits;
for (; ftrp && ftrp->width; ftrp++) { @@ -1483,6 +1521,14 @@ static u64 read_sanitised_id_aa64pfr0_el1(struct kvm_vcpu *vcpu,
val &= ~ID_AA64PFR0_EL1_AMU_MASK;
+ /* + * MPAM is disabled by default as KVM also needs a set of PARTID to + * program the MPAMVPMx_EL2 PARTID remapping registers with. But some + * older kernels let the guest see the ID bit. Turning it on causes + * the registers to be emulated as RAZ/WI. See trap_mpam() for more. + */ + val &= ~ID_AA64PFR0_EL1_MPAM_MASK; + return val; }
@@ -2147,11 +2193,11 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_LOREA_EL1), trap_loregion }, { SYS_DESC(SYS_LORN_EL1), trap_loregion }, { SYS_DESC(SYS_LORC_EL1), trap_loregion }, - { SYS_DESC(SYS_MPAMIDR_EL1), workaround_bad_mpam_abi }, + { SYS_DESC(SYS_MPAMIDR_EL1), trap_mpam }, { SYS_DESC(SYS_LORID_EL1), trap_loregion },
- { SYS_DESC(SYS_MPAM1_EL1), workaround_bad_mpam_abi }, - { SYS_DESC(SYS_MPAM0_EL1), workaround_bad_mpam_abi }, + { SYS_DESC(SYS_MPAM1_EL1), trap_mpam }, + { SYS_DESC(SYS_MPAM0_EL1), trap_mpam }, { SYS_DESC(SYS_VBAR_EL1), access_rw, reset_val, VBAR_EL1, 0 }, { SYS_DESC(SYS_DISR_EL1), NULL, reset_val, DISR_EL1, 0 },
From: James Morse james.morse@arm.com
MPAM has a system register that is used to hold the partid and pmg values that traffic generated by EL0 will use. This can be set per-task by the resctrl file system.
Add a helper to switch this. resctrl expects a 'default' value to be used in preference if the default partid and pmg are selected.
struct task_struct's separate closid and rmid fields are insufficient to implement resctrl using MPAM, as resctrl can change the partid (closid) and pmg (sort of like the rmid) separately. On x86, the rmid is an independent number, so a race that writes a mismatched closid and rmid into hardware is benign. On arm64, the pmg bits extend the partid. (i.e. partid-5 has a pmg-0 that is not the same as partid-6's pmg-0). In this case, mismatching the values will 'dirty' a pmg value that resctrl believes is clean, and is not tracking with its 'limbo' code.
To avoid this, the partid and pmg are always read and written as a pair. Instead of making struct task_struct's closid and rmid fields an endian-unsafe union, add the value to struct thread_info and always use READ_ONCE()/WRITE_ONCE() when accessing this field.
CC: Amit Singh Tomar amitsinght@marvell.com Signed-off-by: James Morse james.morse@arm.com --- arch/arm64/include/asm/mpam.h | 46 ++++++++++++++++++++++++++++ arch/arm64/include/asm/thread_info.h | 3 ++ arch/arm64/kernel/mpam.c | 5 +++ arch/arm64/kernel/process.c | 7 +++++ 4 files changed, 61 insertions(+)
diff --git a/arch/arm64/include/asm/mpam.h b/arch/arm64/include/asm/mpam.h index 576102d510ad..1d81a6f26acd 100644 --- a/arch/arm64/include/asm/mpam.h +++ b/arch/arm64/include/asm/mpam.h @@ -7,6 +7,8 @@ #include <linux/bitops.h> #include <linux/init.h> #include <linux/jump_label.h> +#include <linux/percpu.h> +#include <linux/sched.h>
#include <asm/cpucaps.h> #include <asm/cpufeature.h> @@ -49,6 +51,9 @@
DECLARE_STATIC_KEY_FALSE(arm64_mpam_has_hcr); +DECLARE_STATIC_KEY_FALSE(mpam_enabled); +DECLARE_PER_CPU(u64, arm64_mpam_default); +DECLARE_PER_CPU(u64, arm64_mpam_current);
/* check whether all CPUs have MPAM support */ static __always_inline bool mpam_cpus_have_feature(void) @@ -73,4 +78,45 @@ static inline void __init __enable_mpam_hcr(void) static_branch_enable(&arm64_mpam_has_hcr); }
+/* + * The resctrl filesystem writes to the partid/pmg values for threads and CPUs, + * which may race with reads in __mpam_sched_in(). Ensure only one of the old + * or new values are used. Particular care should be taken with the pmg field + * as __mpam_sched_in() may read a partid and pmg that don't match, causing + * this value to be stored with cache allocations, despite being considered + * 'free' by resctrl. + * + * A value in struct thread_info is used instead of struct task_struct as the + * cpu's u64 register format is used, but struct task_struct has two u32'. + */ +static inline u64 mpam_get_regval(struct task_struct *tsk) +{ +#ifdef CONFIG_ARM64_MPAM + return READ_ONCE(task_thread_info(tsk)->mpam_partid_pmg); +#else + return 0; +#endif +} + +static inline void mpam_thread_switch(struct task_struct *tsk) +{ + u64 oldregval; + int cpu = smp_processor_id(); + u64 regval = mpam_get_regval(tsk); + + if (!IS_ENABLED(CONFIG_ARM64_MPAM) || + !static_branch_likely(&mpam_enabled)) + return; + + if (!regval) + regval = READ_ONCE(per_cpu(arm64_mpam_default, cpu)); + + oldregval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); + if (oldregval == regval) + return; + + /* Synchronising this write is left until the ERET to EL0 */ + write_sysreg_s(regval, SYS_MPAM0_EL1); + WRITE_ONCE(per_cpu(arm64_mpam_current, cpu), regval); +} #endif /* __ASM__MPAM_H */ diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 889a7a43792d..a2596f942500 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -41,6 +41,9 @@ struct thread_info { #ifdef CONFIG_SHADOW_CALL_STACK void *scs_base; void *scs_sp; +#endif +#ifdef CONFIG_ARM64_MPAM + u64 mpam_partid_pmg; #endif u32 cpu; }; diff --git a/arch/arm64/kernel/mpam.c b/arch/arm64/kernel/mpam.c index ff29b666e025..894ca52549b8 100644 --- a/arch/arm64/kernel/mpam.c +++ b/arch/arm64/kernel/mpam.c @@ -4,5 +4,10 @@ #include <asm/mpam.h>
#include <linux/jump_label.h> +#include <linux/percpu.h>
DEFINE_STATIC_KEY_FALSE(arm64_mpam_has_hcr); +DEFINE_STATIC_KEY_FALSE(mpam_enabled); +DEFINE_PER_CPU(u64, arm64_mpam_default); +DEFINE_PER_CPU(u64, arm64_mpam_current); + diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index 489810e3a7a1..068e5bb2661b 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -48,6 +48,7 @@ #include <asm/exec.h> #include <asm/fpsimd.h> #include <asm/mmu_context.h> +#include <asm/mpam.h> #include <asm/mte.h> #include <asm/processor.h> #include <asm/pointer_auth.h> @@ -551,6 +552,12 @@ struct task_struct *__switch_to(struct task_struct *prev, if (prev->thread.sctlr_user != next->thread.sctlr_user) update_sctlr_el1(next->thread.sctlr_user);
+ /* + * MPAM thread switch happens after the DSB to ensure prev's accesses + * use prev's MPAM settings. + */ + mpam_thread_switch(next); + /* the actual thread switch */ last = cpu_switch_to(prev, next);
From: James Morse james.morse@arm.com
While we trap the guest's attempts to read/write the MPAM control registers, these remain in effect. guest-EL0 uses KVM's user-space's configuration, as the value is left in the register, and guest-EL1 uses either the host kernel's configuration, or in the case of VHE, the unknown reset value of MPAM1_EL1.
On nVHE systems, EL2 continues to use partid-0 for world-switch, even when the host may have configured its kernel threads to use a different partid. 0 may have been assigned to another task.
We want to force the guest-EL1 to use KVM's user-space's MPAM configuration.
On a nVHE system, copy the EL1 MPAM register to EL2. This ensures world-switch uses the same partid as the kernel thread does on the host.
When loading the guests EL1 registers, copy the VMM's EL0 partid to the EL1 register. When restoring the hosts registers, the partid previously copied to EL2 can be used to restore EL1.
For VHE systems, we can skip restoring the EL1 register for the host, as it is out-of-context once HCR_EL2.TGE is set.
This is done outside the usual sysreg save/restore as the values can change behind KVMs back, so should not be stored in the guest context.
Signed-off-by: James Morse james.morse@arm.com --- arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 27 ++++++++++++++++++++++ arch/arm64/kvm/hyp/nvhe/switch.c | 11 +++++++++ arch/arm64/kvm/hyp/vhe/sysreg-sr.c | 1 + 3 files changed, 39 insertions(+)
diff --git a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h index bb6b571ec627..d6cfb3dc7f7c 100644 --- a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h +++ b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h @@ -15,6 +15,7 @@ #include <asm/kvm_emulate.h> #include <asm/kvm_hyp.h> #include <asm/kvm_mmu.h> +#include <asm/mpam.h>
static inline void __sysreg_save_common_state(struct kvm_cpu_context *ctxt) { @@ -243,4 +244,30 @@ static inline void __sysreg32_restore_state(struct kvm_vcpu *vcpu) write_sysreg(__vcpu_sys_reg(vcpu, DBGVCR32_EL2), dbgvcr32_el2); }
+/* + * The _EL0 value was written by the host's context switch, copy this into the + * guest's EL1. + */ +static inline void __mpam_guest_load(void) +{ + if (IS_ENABLED(CONFIG_ARM64_MPAM) && mpam_cpus_have_feature()) + write_sysreg_el1(read_sysreg_s(SYS_MPAM0_EL1), SYS_MPAM1); +} + +/* + * Copy the _EL2 register back to _EL1, clearing any trap bits EL2 may have set. + * nVHE world-switch copies the _EL1 register to _EL2. A VHE host writes to the + * _EL2 register as it is aliased by the hardware when TGE is set. + */ +static inline void __mpam_guest_put(void) +{ + u64 val, mask = MPAM_SYSREG_PMG_D | MPAM_SYSREG_PMG_I | + MPAM_SYSREG_PARTID_D | MPAM_SYSREG_PARTID_I; + + if (IS_ENABLED(CONFIG_ARM64_MPAM) && mpam_cpus_have_feature()) { + val = FIELD_GET(mask, read_sysreg_s(SYS_MPAM2_EL2)); + write_sysreg_el1(val, SYS_MPAM1); + } +} + #endif /* __ARM64_KVM_HYP_SYSREG_SR_H__ */ diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c index c353a06ee7e6..c2118f658e22 100644 --- a/arch/arm64/kvm/hyp/nvhe/switch.c +++ b/arch/arm64/kvm/hyp/nvhe/switch.c @@ -242,6 +242,13 @@ static void early_exit_filter(struct kvm_vcpu *vcpu, u64 *exit_code) } }
+/* Use the host thread's partid and pmg for world switch */ +static void __mpam_copy_el1_to_el2(void) +{ + if (IS_ENABLED(CONFIG_ARM64_MPAM) && mpam_cpus_have_feature()) + write_sysreg_s(read_sysreg_s(SYS_MPAM1_EL1), SYS_MPAM2_EL2); +} + /* Switch to the guest for legacy non-VHE systems */ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) { @@ -251,6 +258,8 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) bool pmu_switch_needed; u64 exit_code;
+ __mpam_copy_el1_to_el2(); + /* * Having IRQs masked via PMR when entering the guest means the GIC * will not signal the CPU of interrupts of lower priority, and the @@ -310,6 +319,7 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) __timer_enable_traps(vcpu);
__debug_switch_to_guest(vcpu); + __mpam_guest_load();
do { /* Jump in the fire! */ @@ -320,6 +330,7 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
__sysreg_save_state_nvhe(guest_ctxt); __sysreg32_save_state(vcpu); + __mpam_guest_put(); __timer_disable_traps(vcpu); __hyp_vgic_save_state(vcpu);
diff --git a/arch/arm64/kvm/hyp/vhe/sysreg-sr.c b/arch/arm64/kvm/hyp/vhe/sysreg-sr.c index b35a178e7e0d..6b407cd3230d 100644 --- a/arch/arm64/kvm/hyp/vhe/sysreg-sr.c +++ b/arch/arm64/kvm/hyp/vhe/sysreg-sr.c @@ -90,6 +90,7 @@ void kvm_vcpu_load_sysregs_vhe(struct kvm_vcpu *vcpu) __sysreg32_restore_state(vcpu); __sysreg_restore_user_state(guest_ctxt); __sysreg_restore_el1_state(guest_ctxt); + __mpam_guest_load();
vcpu_set_flag(vcpu, SYSREGS_ON_CPU);
From: James Morse james.morse@arm.com
The PPTT describes CPUs and caches, as well as processor containers.
To enable PPI partitions, the irqchip driver needs to know how many partitions the platform has, and which CPUs belong to which partition. When a percpu interrupt is registered, the partition is provided to allow a different driver to request the same percpu interrupt intid, one per partition.
The acpi_id of the Processor Container is the natural way to do this with ACPI, but the DSDT AML interpreter is not available early enough for the irqchipi driver. Fortunately, the same information can be described in the PPTT.
Add a helper to count the number or Processor Containers in the PPTT. This is structured as a walker/callback as the irqchip driver will also use this to configure each partition.
Only Processor entries in the PPTT that have a valid acpi id are considered as containers. To identify a particular Processor Container, it must have an id.
Signed-off-by: James Morse james.morse@arm.com --- drivers/acpi/pptt.c | 58 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 58 insertions(+)
diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c index a35dd0e41c27..f3279bcb78ff 100644 --- a/drivers/acpi/pptt.c +++ b/drivers/acpi/pptt.c @@ -21,6 +21,8 @@ #include <linux/cacheinfo.h> #include <acpi/processor.h>
+typedef int (*acpi_pptt_cpu_callback_t)(struct acpi_pptt_processor *, void *); + static struct acpi_subtable_header *fetch_pptt_subtable(struct acpi_table_header *table_hdr, u32 pptt_ref) { @@ -293,6 +295,62 @@ static struct acpi_pptt_processor *acpi_find_processor_node(struct acpi_table_he return NULL; }
+/** + * acpi_pptt_for_each_container() - Iterate over all processor containers + * + * Not all 'Processor' entries in the PPTT are either a CPU or a Processor + * Container, they may exist purely to describe a Private resource. CPUs + * have to be leaves, so a Processor Container is a non-leaf that has the + * 'ACPI Processor ID valid' flag set. + * + * Return: 0 for a complete walk, or the first non-zero value from the callback + * that stopped the walk. + */ +int acpi_pptt_for_each_container(acpi_pptt_cpu_callback_t callback, void *arg) +{ + struct acpi_pptt_processor *cpu_node; + struct acpi_table_header *table_hdr; + struct acpi_subtable_header *entry; + bool leaf_flag, has_leaf_flag = false; + unsigned long table_end; + acpi_status status; + u32 proc_sz; + int ret = 0; + + status = acpi_get_table(ACPI_SIG_PPTT, 0, &table_hdr); + if (ACPI_FAILURE(status)) + return 0; + + if (table_hdr->revision > 1) + has_leaf_flag = true; + + table_end = (unsigned long)table_hdr + table_hdr->length; + entry = ACPI_ADD_PTR(struct acpi_subtable_header, table_hdr, + sizeof(struct acpi_table_pptt)); + proc_sz = sizeof(struct acpi_pptt_processor); + while ((unsigned long)entry + proc_sz < table_end) { + cpu_node = (struct acpi_pptt_processor *)entry; + if (entry->type == ACPI_PPTT_TYPE_PROCESSOR && + cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_ID_VALID) + { + leaf_flag = cpu_node->flags & ACPI_PPTT_ACPI_LEAF_NODE; + if ((has_leaf_flag && !leaf_flag) || + (!has_leaf_flag && !acpi_pptt_leaf_node(table_hdr, cpu_node))) + { + ret = callback(cpu_node, arg); + if (ret) + break; + } + } + entry = ACPI_ADD_PTR(struct acpi_subtable_header, entry, + entry->length); + } + + acpi_put_table(table_hdr); + + return ret; +} + static u8 acpi_cache_type(enum cache_type type) { switch (type) {
From: James Morse james.morse@arm.com
The MPAM table identifies caches by id, but the driver also wants to know the cache level, without having to wait for whichever core has that cache to come online.
Add a helper that walks every possible cache, until it finds the one identified by id, then return the level.
Signed-off-by: James Morse james.morse@arm.com --- drivers/acpi/pptt.c | 74 ++++++++++++++++++++++++++++++++++++++++++++ include/linux/acpi.h | 5 +++ 2 files changed, 79 insertions(+)
diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c index f3279bcb78ff..6c1f668d63ba 100644 --- a/drivers/acpi/pptt.c +++ b/drivers/acpi/pptt.c @@ -870,3 +870,77 @@ int find_acpi_cpu_topology_hetero_id(unsigned int cpu) return find_acpi_cpu_topology_tag(cpu, PPTT_ABORT_PACKAGE, ACPI_PPTT_ACPI_IDENTICAL); } + + +/** + * find_acpi_cache_level_from_id() - Get the level of the specified cache + * @cache_id: The id field of the unified cache + * + * Determine the level relative to any CPU for the unified cache identified by + * cache_id. This allows the property to be found even if the CPUs are offline. + * + * The returned level can be used to group unified caches that are peers. + * + * The PPTT table must be rev 3 or later, + * + * If one CPUs L2 is shared with another as L3, this function will return + * and unpredictable value. + * + * Return: -ENOENT if the PPTT doesn't exist, or the cache cannot be found. + * Otherwise returns a value which represents the level of the specified cache. + */ +int find_acpi_cache_level_from_id(u32 cache_id) +{ + u32 acpi_cpu_id; + acpi_status status; + int level, cpu, num_levels; + struct acpi_pptt_cache *cache; + struct acpi_table_header *table; + struct acpi_pptt_cache_v1* cache_v1; + struct acpi_pptt_processor *cpu_node; + + status = acpi_get_table(ACPI_SIG_PPTT, 0, &table); + if (ACPI_FAILURE(status)) { + acpi_pptt_warn_missing(); + return -ENOENT; + } + + if (table->revision < 3) { + acpi_put_table(table); + return -ENOENT; + } + + /* + * If we found the cache first, we'd still need to walk from each CPU + * to find the level... + */ + for_each_possible_cpu(cpu) { + acpi_cpu_id = get_acpi_id_for_cpu(cpu); + cpu_node = acpi_find_processor_node(table, acpi_cpu_id); + if (!cpu_node) + break; + acpi_count_levels(table, cpu_node, &num_levels, NULL); + + /* Start at 1 for L1 */ + for (level = 1; level <= num_levels; level++) { + cache = acpi_find_cache_node(table, acpi_cpu_id, + ACPI_PPTT_CACHE_TYPE_UNIFIED, + level, &cpu_node); + if (!cache) + continue; + + cache_v1 = ACPI_ADD_PTR(struct acpi_pptt_cache_v1, + cache, + sizeof(struct acpi_pptt_cache)); + + if (cache->flags & ACPI_PPTT_CACHE_ID_VALID && + cache_v1->cache_id == cache_id) { + acpi_put_table(table); + return level; + } + } + } + + acpi_put_table(table); + return -ENOENT; +} diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 1c2bb14975af..12e40bc8592d 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -1495,6 +1495,7 @@ int find_acpi_cpu_topology(unsigned int cpu, int level); int find_acpi_cpu_topology_cluster(unsigned int cpu); int find_acpi_cpu_topology_package(unsigned int cpu); int find_acpi_cpu_topology_hetero_id(unsigned int cpu); +int find_acpi_cache_level_from_id(u32 cache_id); #else static inline int acpi_pptt_cpu_is_thread(unsigned int cpu) { @@ -1516,6 +1517,10 @@ static inline int find_acpi_cpu_topology_hetero_id(unsigned int cpu) { return -EINVAL; } +static inline int find_acpi_cache_level_from_id(u32 cache_id) +{ + return -EINVAL; +} #endif
#ifdef CONFIG_ARM64
From: James Morse james.morse@arm.com
The ACPI table for MPAM describes a set of CPUs with the UID of a processor container. These exist both in the namespace and the PPTT.
Using the existing for-each helpers, provide a helper to find the specified processor container in the PPTT, and fill a cpumask with the CPUs that belong to it.
Signed-off-by: James Morse james.morse@arm.com --- drivers/acpi/pptt.c | 67 ++++++++++++++++++++++++++++++++++++++++++++ include/linux/acpi.h | 6 ++++ 2 files changed, 73 insertions(+)
diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c index 6c1f668d63ba..9fbeed61cd87 100644 --- a/drivers/acpi/pptt.c +++ b/drivers/acpi/pptt.c @@ -295,6 +295,38 @@ static struct acpi_pptt_processor *acpi_find_processor_node(struct acpi_table_he return NULL; }
+/* parent_node points into the table, but the table isn't provided. */ +static void acpi_pptt_get_child_cpus(struct acpi_pptt_processor *parent_node, + cpumask_t *cpus) +{ + struct acpi_pptt_processor *cpu_node; + struct acpi_table_header *table_hdr; + acpi_status status; + u32 acpi_id; + int cpu; + + status = acpi_get_table(ACPI_SIG_PPTT, 0, &table_hdr); + if (ACPI_FAILURE(status)) + return; + + for_each_possible_cpu(cpu) { + acpi_id = get_acpi_id_for_cpu(cpu); + cpu_node = acpi_find_processor_node(table_hdr, acpi_id); + + while (cpu_node) { + if (cpu_node == parent_node) { + cpumask_set_cpu(cpu, cpus); + break; + } + cpu_node = fetch_pptt_node(table_hdr, cpu_node->parent); + } + } + + acpi_put_table(table_hdr); + + return; +} + /** * acpi_pptt_for_each_container() - Iterate over all processor containers * @@ -351,6 +383,41 @@ int acpi_pptt_for_each_container(acpi_pptt_cpu_callback_t callback, void *arg) return ret; }
+struct __cpus_from_container_arg { + u32 acpi_cpu_id; + cpumask_t *cpus; +}; + +static int __cpus_from_container(struct acpi_pptt_processor *container, void *arg) +{ + struct __cpus_from_container_arg *params = arg; + + if (container->acpi_processor_id == params->acpi_cpu_id) + acpi_pptt_get_child_cpus(container, params->cpus); + + return 0; +} + +/** + * acpi_pptt_get_cpus_from_container() - Populate a cpumask with all CPUs in a + * processor containers + * + * Find the specified Processor Container, and fill cpus with all the cpus + * below it. + * + * Return: 0 for a complete walk, or an error if the mask is incomplete. + */ +int acpi_pptt_get_cpus_from_container(u32 acpi_cpu_id, cpumask_t *cpus) +{ + struct __cpus_from_container_arg params; + + params.acpi_cpu_id = acpi_cpu_id; + params.cpus = cpus; + + cpumask_clear(cpus); + return acpi_pptt_for_each_container(&__cpus_from_container, ¶ms); +} + static u8 acpi_cache_type(enum cache_type type) { switch (type) { diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 12e40bc8592d..51c6551f9b63 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -1496,6 +1496,7 @@ int find_acpi_cpu_topology_cluster(unsigned int cpu); int find_acpi_cpu_topology_package(unsigned int cpu); int find_acpi_cpu_topology_hetero_id(unsigned int cpu); int find_acpi_cache_level_from_id(u32 cache_id); +int acpi_pptt_get_cpus_from_container(u32 acpi_cpu_id, cpumask_t *cpus); #else static inline int acpi_pptt_cpu_is_thread(unsigned int cpu) { @@ -1521,6 +1522,11 @@ static inline int find_acpi_cache_level_from_id(u32 cache_id) { return -EINVAL; } +static inline int acpi_pptt_get_cpus_from_container(u32 acpi_cpu_id, + cpumask_t *cpus) +{ + return -EINVAL; +} #endif
#ifdef CONFIG_ARM64
From: James Morse james.morse@arm.com
MPAM identifies CPUs by the cache_id in the PPTT cache structure.
The driver needs to know which CPUs are associated with the cache, the CPUs may not all be online, so cacheinfo does not have the information.
Add a helper to pull this information out of the PPTT.
CC: Rohit Mathew Rohit.Mathew@arm.com Signed-off-by: James Morse james.morse@arm.com --- drivers/acpi/pptt.c | 79 ++++++++++++++++++++++++++++++++++++++++++-- include/linux/acpi.h | 6 ++++ 2 files changed, 83 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c index 9fbeed61cd87..201df60e4a79 100644 --- a/drivers/acpi/pptt.c +++ b/drivers/acpi/pptt.c @@ -183,9 +183,10 @@ acpi_find_cache_level(struct acpi_table_header *table_hdr, * levels and split cache levels (data/instruction). * @table_hdr: Pointer to the head of the PPTT table * @cpu_node: processor node we wish to count caches for - * @levels: Number of levels if success. + * @levels: Number of levels if success. (*levels) should be initialized by + * the caller with the value to be used as the starting level. * @split_levels: Number of split cache levels (data/instruction) if - * success. Can by NULL. + * success. Can be NULL. * * Given a processor node containing a processing unit, walk into it and count * how many levels exist solely for it, and then walk up each level until we hit @@ -982,6 +983,8 @@ int find_acpi_cache_level_from_id(u32 cache_id) * to find the level... */ for_each_possible_cpu(cpu) { + + num_levels = 0; acpi_cpu_id = get_acpi_id_for_cpu(cpu); cpu_node = acpi_find_processor_node(table, acpi_cpu_id); if (!cpu_node) @@ -1011,3 +1014,75 @@ int find_acpi_cache_level_from_id(u32 cache_id) acpi_put_table(table); return -ENOENT; } + +/** + * acpi_pptt_get_cpumask_from_cache_id() - Get the cpus associated with the + * specified cache + * @cache_id: The id field of the unified cache + * @cpus: Where to buidl the cpumask + * + * Determine which CPUs are below this cache in the PPTT. This allows the property + * to be found even if the CPUs are offline. + * + * The PPTT table must be rev 3 or later, + * + * Return: -ENOENT if the PPTT doesn't exist, or the cache cannot be found. + * Otherwise returns 0 and sets the cpus in the provided cpumask. + */ +int acpi_pptt_get_cpumask_from_cache_id(u32 cache_id, cpumask_t *cpus) +{ + u32 acpi_cpu_id; + acpi_status status; + int level, cpu, num_levels; + struct acpi_pptt_cache *cache; + struct acpi_table_header *table; + struct acpi_pptt_cache_v1* cache_v1; + struct acpi_pptt_processor *cpu_node; + + cpumask_clear(cpus); + + status = acpi_get_table(ACPI_SIG_PPTT, 0, &table); + if (ACPI_FAILURE(status)) { + acpi_pptt_warn_missing(); + return -ENOENT; + } + + if (table->revision < 3) { + acpi_put_table(table); + return -ENOENT; + } + + /* + * If we found the cache first, we'd still need to walk from each cpu. + */ + for_each_possible_cpu(cpu) { + + num_levels = 0; + acpi_cpu_id = get_acpi_id_for_cpu(cpu); + cpu_node = acpi_find_processor_node(table, acpi_cpu_id); + if (!cpu_node) + break; + acpi_count_levels(table, cpu_node, &num_levels, NULL); + + /* Start at 1 for L1 */ + for (level = 1; level <= num_levels; level++) { + cache = acpi_find_cache_node(table, acpi_cpu_id, + ACPI_PPTT_CACHE_TYPE_UNIFIED, + level, &cpu_node); + if (!cache) + continue; + + cache_v1 = ACPI_ADD_PTR(struct acpi_pptt_cache_v1, + cache, + sizeof(struct acpi_pptt_cache)); + + if (cache->flags & ACPI_PPTT_CACHE_ID_VALID && + cache_v1->cache_id == cache_id) { + cpumask_set_cpu(cpu, cpus); + } + } + } + + acpi_put_table(table); + return 0; +} diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 51c6551f9b63..7749d71e49fe 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -1497,6 +1497,7 @@ int find_acpi_cpu_topology_package(unsigned int cpu); int find_acpi_cpu_topology_hetero_id(unsigned int cpu); int find_acpi_cache_level_from_id(u32 cache_id); int acpi_pptt_get_cpus_from_container(u32 acpi_cpu_id, cpumask_t *cpus); +int acpi_pptt_get_cpumask_from_cache_id(u32 cache_id, cpumask_t *cpus); #else static inline int acpi_pptt_cpu_is_thread(unsigned int cpu) { @@ -1527,6 +1528,11 @@ static inline int acpi_pptt_get_cpus_from_container(u32 acpi_cpu_id, { return -EINVAL; } +static inline int acpi_pptt_get_cpumask_from_cache_id(u32 cache_id, + cpumask_t *cpus) +{ + return -EINVAL; +} #endif
#ifdef CONFIG_ARM64
From: Rob Herring robh@kernel.org
In preparation to set the cache 'id' based on the CPU h/w ids, allow for 64-bit bit 'id' value. The only case that needs this is arm64, so unsigned long is sufficient.
Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "Rafael J. Wysocki" rafael@kernel.org Signed-off-by: Rob Herring robh@kernel.org [ Update get_cpu_cacheinfo_id() too. Use UL instead of ULL. ] Signed-off-by: James Morse james.morse@arm.com --- drivers/base/cacheinfo.c | 8 +++++++- include/linux/cacheinfo.h | 8 ++++---- 2 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c index f1e79263fe61..8fb251f84d66 100644 --- a/drivers/base/cacheinfo.c +++ b/drivers/base/cacheinfo.c @@ -620,13 +620,19 @@ static ssize_t file_name##_show(struct device *dev, \ return sysfs_emit(buf, "%u\n", this_leaf->object); \ }
-show_one(id, id); show_one(level, level); show_one(coherency_line_size, coherency_line_size); show_one(number_of_sets, number_of_sets); show_one(physical_line_partition, physical_line_partition); show_one(ways_of_associativity, ways_of_associativity);
+static ssize_t id_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + struct cacheinfo *this_leaf = dev_get_drvdata(dev); + + return sysfs_emit(buf, "%lu\n", this_leaf->id); +} + static ssize_t size_show(struct device *dev, struct device_attribute *attr, char *buf) { diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h index d504eb4b49ab..e0717a2d7ca0 100644 --- a/include/linux/cacheinfo.h +++ b/include/linux/cacheinfo.h @@ -47,7 +47,7 @@ extern unsigned int coherency_max_size; * keeping, the remaining members form the core properties of the cache */ struct cacheinfo { - unsigned int id; + unsigned long id; enum cache_type type; unsigned int level; unsigned int coherency_line_size; @@ -116,7 +116,7 @@ const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_leaf); * Get the id of the cache associated with @cpu at level @level. * cpuhp lock must be held. */ -static inline int get_cpu_cacheinfo_id(int cpu, int level) +static inline unsigned long get_cpu_cacheinfo_id(int cpu, int level) { struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu); int i; @@ -125,11 +125,11 @@ static inline int get_cpu_cacheinfo_id(int cpu, int level) if (ci->info_list[i].level == level) { if (ci->info_list[i].attributes & CACHE_ID) return ci->info_list[i].id; - return -1; + return ~0UL; } }
- return -1; + return ~0UL; }
#ifdef CONFIG_ARM64
From: Rob Herring robh@kernel.org
Use the minimum CPU h/w id of the CPUs associated with the cache for the cache 'id'. This will provide a stable id value for a given system. As we need to check all possible CPUs, we can't use the shared_cpu_map which is just online CPUs. As there's not a cache to CPUs mapping in DT, we have to walk all CPU nodes and then walk cache levels.
Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "Rafael J. Wysocki" rafael@kernel.org Signed-off-by: Rob Herring robh@kernel.org Signed-off-by: James Morse james.morse@arm.com --- drivers/base/cacheinfo.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+)
diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c index 8fb251f84d66..b12630805ba9 100644 --- a/drivers/base/cacheinfo.c +++ b/drivers/base/cacheinfo.c @@ -183,6 +183,31 @@ static bool cache_node_is_unified(struct cacheinfo *this_leaf, return of_property_read_bool(np, "cache-unified"); }
+static void cache_of_set_id(struct cacheinfo *this_leaf, struct device_node *np) +{ + struct device_node *cpu; + unsigned long min_id = ~0UL; + + for_each_of_cpu_node(cpu) { + struct device_node *cache_node = cpu; + u64 id = of_get_cpu_hwid(cache_node, 0); + + while ((cache_node = of_find_next_cache_node(cache_node))) { + if ((cache_node == np) && (id < min_id)) { + min_id = id; + of_node_put(cache_node); + break; + } + of_node_put(cache_node); + } + } + + if (min_id != ~0UL) { + this_leaf->id = min_id; + this_leaf->attributes |= CACHE_ID; + } +} + static void cache_of_set_props(struct cacheinfo *this_leaf, struct device_node *np) { @@ -198,6 +223,7 @@ static void cache_of_set_props(struct cacheinfo *this_leaf, cache_get_line_size(this_leaf, np); cache_nr_sets(this_leaf, np); cache_associativity(this_leaf); + cache_of_set_id(this_leaf, np); }
static int cache_setup_of_node(unsigned int cpu)
From: James Morse james.morse@arm.com
The MPAM driver identifies caches by id for use with resctrl. It needs to know the cache-id when probeing, but the value isn't set in cacheinfo until device_initcall().
Expose the code that generates the cache-id. The parts of the MPAM driver that run early can use this to set up the resctrl structures.
Signed-off-by: James Morse james.morse@arm.com --- drivers/base/cacheinfo.c | 13 ++++++++++--- include/linux/cacheinfo.h | 1 + 2 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c index b12630805ba9..0618827f73fe 100644 --- a/drivers/base/cacheinfo.c +++ b/drivers/base/cacheinfo.c @@ -183,7 +183,7 @@ static bool cache_node_is_unified(struct cacheinfo *this_leaf, return of_property_read_bool(np, "cache-unified"); }
-static void cache_of_set_id(struct cacheinfo *this_leaf, struct device_node *np) +unsigned long cache_of_get_id(struct device_node *np) { struct device_node *cpu; unsigned long min_id = ~0UL; @@ -202,8 +202,15 @@ static void cache_of_set_id(struct cacheinfo *this_leaf, struct device_node *np) } }
- if (min_id != ~0UL) { - this_leaf->id = min_id; + return min_id; +} + +static void cache_of_set_id(struct cacheinfo *this_leaf, struct device_node *np) +{ + unsigned long id = cache_of_get_id(np); + + if (id != ~0UL) { + this_leaf->id = id; this_leaf->attributes |= CACHE_ID; } } diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h index e0717a2d7ca0..2802406bb422 100644 --- a/include/linux/cacheinfo.h +++ b/include/linux/cacheinfo.h @@ -111,6 +111,7 @@ int acpi_get_cache_info(unsigned int cpu, #endif
const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_leaf); +unsigned long cache_of_get_id(struct device_node *np);
/* * Get the id of the cache associated with @cpu at level @level.
From: James Morse james.morse@arm.com
MPAM needs to know the size of a cache associated with a particular CPU. The DT/ACPI agnostic way of doing this is to ask cacheinfo.
Add a helper to do this.
Signed-off-by: James Morse james.morse@arm.com --- include/linux/cacheinfo.h | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+)
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h index 2802406bb422..251e8b393ec7 100644 --- a/include/linux/cacheinfo.h +++ b/include/linux/cacheinfo.h @@ -133,6 +133,27 @@ static inline unsigned long get_cpu_cacheinfo_id(int cpu, int level) return ~0UL; }
+/* + * Get the size of the cache associated with @cpu at level @level. + * cpuhp lock must be held. + */ +static inline unsigned int get_cpu_cacheinfo_size(int cpu, int level) +{ + struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu); + int i; + + if (!ci->info_list) + return 0; + + for (i = 0; i < ci->num_leaves; i++) { + if (ci->info_list[i].level == level) { + return ci->info_list[i].size; + } + } + + return 0; +} + #ifdef CONFIG_ARM64 #define use_arch_cache_info() (true) #else
From: James Morse james.morse@arm.com
Add code to parse the arm64 specific MPAM table, looking up the cache level from the PPTT and feeding the end result into the MPAM driver.
Signed-off-by: James Morse james.morse@arm.com --- arch/arm64/Kconfig | 1 + drivers/acpi/arm64/Kconfig | 3 + drivers/acpi/arm64/Makefile | 2 + drivers/acpi/arm64/mpam.c | 369 ++++++++++++++++++++++++++++++++++++ drivers/acpi/tables.c | 2 +- include/linux/arm_mpam.h | 43 +++++ 6 files changed, 419 insertions(+), 1 deletion(-) create mode 100644 drivers/acpi/arm64/mpam.c create mode 100644 include/linux/arm_mpam.h
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 3d838b7a9cdd..b947ae37b964 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2055,6 +2055,7 @@ config ARM64_TLB_RANGE
config ARM64_MPAM bool "Enable support for MPAM" + select ACPI_MPAM if ACPI help Memory Partitioning and Monitoring is an optional extension that allows the CPUs to mark load and store transactions with diff --git a/drivers/acpi/arm64/Kconfig b/drivers/acpi/arm64/Kconfig index b3ed6212244c..f2fd79f22e7d 100644 --- a/drivers/acpi/arm64/Kconfig +++ b/drivers/acpi/arm64/Kconfig @@ -21,3 +21,6 @@ config ACPI_AGDI
config ACPI_APMT bool + +config ACPI_MPAM + bool diff --git a/drivers/acpi/arm64/Makefile b/drivers/acpi/arm64/Makefile index 143debc1ba4a..9497a7777729 100644 --- a/drivers/acpi/arm64/Makefile +++ b/drivers/acpi/arm64/Makefile @@ -4,4 +4,6 @@ obj-$(CONFIG_ACPI_IORT) += iort.o obj-$(CONFIG_ACPI_GTDT) += gtdt.o obj-$(CONFIG_ACPI_APMT) += apmt.o obj-$(CONFIG_ARM_AMBA) += amba.o +obj-$(CONFIG_ACPI_MPAM) += mpam.o obj-y += dma.o init.o + diff --git a/drivers/acpi/arm64/mpam.c b/drivers/acpi/arm64/mpam.c new file mode 100644 index 000000000000..8a63449f27b5 --- /dev/null +++ b/drivers/acpi/arm64/mpam.c @@ -0,0 +1,369 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (C) 2022 Arm Ltd. + +/* Parse the MPAM ACPI table feeding the discovered nodes into the driver */ + +#define pr_fmt(fmt) "ACPI MPAM: " fmt + +#include <linux/acpi.h> +#include <linux/arm_mpam.h> +#include <linux/cpu.h> +#include <linux/cpumask.h> +#include <linux/platform_device.h> + +#include <acpi/processor.h> + +#include <asm/mpam.h> + +/* Flags for acpi_table_mpam_msc.*_interrupt_flags */ +#define ACPI_MPAM_MSC_IRQ_MODE_EDGE 1 +#define ACPI_MPAM_MSC_IRQ_TYPE_MASK (3<<1) +#define ACPI_MPAM_MSC_IRQ_TYPE_WIRED 0 +#define ACPI_MPAM_MSC_IRQ_AFFINITY_PROCESSOR_CONTAINER (1<<3) +#define ACPI_MPAM_MSC_IRQ_AFFINITY_VALID (1<<4) + +static bool frob_irq(struct platform_device *pdev, int intid, u32 flags, + int *irq, u32 processor_container_uid) +{ + int sense; + + if (!intid) + return false; + + /* 0 in this field indicates a wired interrupt */ + if (flags & ACPI_MPAM_MSC_IRQ_TYPE_MASK) + return false; + + if (flags & ACPI_MPAM_MSC_IRQ_MODE_EDGE) + sense = ACPI_EDGE_SENSITIVE; + else + sense = ACPI_LEVEL_SENSITIVE; + + /* + * If the GSI is in the GIC's PPI range, try and create a partitioned + * percpu interrupt. + */ + if (16 <= intid && intid < 32 && processor_container_uid != ~0) { + pr_err_once("Partitioned interrupts not supported\n"); + return false; + } else { + *irq = acpi_register_gsi(&pdev->dev, intid, sense, + ACPI_ACTIVE_HIGH); + } + if (*irq <= 0) { + pr_err_once("Failed to register interrupt 0x%x with ACPI\n", + intid); + return false; + } + + return true; +} + +static void acpi_mpam_parse_irqs(struct platform_device *pdev, + struct acpi_mpam_msc_node *tbl_msc, + struct resource *res, int *res_idx) +{ + u32 flags, aff = ~0; + int irq; + + flags = tbl_msc->overflow_interrupt_flags; + if (flags & ACPI_MPAM_MSC_IRQ_AFFINITY_VALID && + flags & ACPI_MPAM_MSC_IRQ_AFFINITY_PROCESSOR_CONTAINER) + aff = tbl_msc->overflow_interrupt_affinity; + if (frob_irq(pdev, tbl_msc->overflow_interrupt, flags, &irq, aff)) { + res[*res_idx].start = irq; + res[*res_idx].end = irq; + res[*res_idx].flags = IORESOURCE_IRQ; + res[*res_idx].name = "overflow"; + + (*res_idx)++; + } + + flags = tbl_msc->error_interrupt_flags; + if (flags & ACPI_MPAM_MSC_IRQ_AFFINITY_VALID && + flags & ACPI_MPAM_MSC_IRQ_AFFINITY_PROCESSOR_CONTAINER) + aff = tbl_msc->error_interrupt_affinity; + else + aff = ~0; + if (frob_irq(pdev, tbl_msc->error_interrupt, flags, &irq, aff)) { + res[*res_idx].start = irq; + res[*res_idx].end = irq; + res[*res_idx].flags = IORESOURCE_IRQ; + res[*res_idx].name = "error"; + + (*res_idx)++; + } +} + +static int acpi_mpam_parse_resource(struct mpam_msc *msc, + struct acpi_mpam_resource_node *res) +{ + u32 cache_id; + int level; + + switch (res->locator_type) { + case ACPI_MPAM_LOCATION_TYPE_PROCESSOR_CACHE: + cache_id = res->locator.cache_locator.cache_reference; + level = find_acpi_cache_level_from_id(cache_id); + if (level < 0) { + pr_err_once("Bad level for cache with id %u\n", cache_id); + return level; + } + return mpam_ris_create(msc, res->ris_index, MPAM_CLASS_CACHE, + level, cache_id); + case ACPI_MPAM_LOCATION_TYPE_MEMORY: + return mpam_ris_create(msc, res->ris_index, MPAM_CLASS_MEMORY, + 255, res->locator.memory_locator.proximity_domain); + default: + /* These get discovered later and treated as unknown */ + return 0; + } +} + +int acpi_mpam_parse_resources(struct mpam_msc *msc, + struct acpi_mpam_msc_node *tbl_msc) +{ + int i, err; + struct acpi_mpam_resource_node *resources; + + resources = (struct acpi_mpam_resource_node *)(tbl_msc + 1); + for (i = 0; i < tbl_msc->num_resouce_nodes; i++) { + err = acpi_mpam_parse_resource(msc, &resources[i]); + if (err) + return err; + } + + return 0; +} + +static bool __init parse_msc_pm_link(struct acpi_mpam_msc_node *tbl_msc, + struct platform_device *pdev, + u32 *acpi_id) +{ + bool acpi_id_valid = false; + struct acpi_device *buddy; + char hid[16], uid[16]; + int err; + + memset(&hid, 0, sizeof(hid)); + memcpy(hid, &tbl_msc->hardware_id_linked_device, + sizeof(tbl_msc->hardware_id_linked_device)); + + if (!strcmp(hid, ACPI_PROCESSOR_CONTAINER_HID)) { + *acpi_id = tbl_msc->instance_id_linked_device; + acpi_id_valid = true; + } + + err = snprintf(uid, sizeof(uid), "%u", + tbl_msc->instance_id_linked_device); + if (err < 0 || err >= sizeof(uid)) + return acpi_id_valid; + + buddy = acpi_dev_get_first_match_dev(hid, uid, -1); + if (buddy) { + device_link_add(&pdev->dev, &buddy->dev, DL_FLAG_STATELESS); + } + + return acpi_id_valid; +} + +static int decode_interface_type(struct acpi_mpam_msc_node *tbl_msc, + enum mpam_msc_iface *iface) +{ + switch (tbl_msc->interface_type){ + case 0: + *iface = MPAM_IFACE_MMIO; + return 0; + case 1: + *iface = MPAM_IFACE_PCC; + return 0; + default: + return -EINVAL; + } +} + +static int __init _parse_table(struct acpi_table_header *table) +{ + char *table_end, *table_offset = (char *)(table + 1); + struct property_entry props[4]; /* needs a sentinel */ + struct acpi_mpam_msc_node *tbl_msc; + int next_res, next_prop, err = 0; + struct acpi_device *companion; + struct platform_device *pdev; + enum mpam_msc_iface iface; + struct resource res[3]; + char uid[16]; + u32 acpi_id; + + table_end = (char *)table + table->length; + + while (table_offset < table_end) { + tbl_msc = (struct acpi_mpam_msc_node *)table_offset; + table_offset += tbl_msc->length; + + /* + * If any of the reserved fields are set, make no attempt to + * parse the msc structure. This will prevent the driver from + * probing all the MSC, meaning it can't discover the system + * wide supported partid and pmg ranges. This avoids whatever + * this MSC is truncating the partids and creating a screaming + * error interrupt. + */ + if (tbl_msc->reserved || tbl_msc->reserved1 || tbl_msc->reserved2) + continue; + + if (decode_interface_type(tbl_msc, &iface)) + continue; + + next_res = 0; + next_prop = 0; + memset(res, 0, sizeof(res)); + memset(props, 0, sizeof(props)); + + pdev = platform_device_alloc("mpam_msc", tbl_msc->identifier); + if (IS_ERR(pdev)) { + err = PTR_ERR(pdev); + break; + } + + if (tbl_msc->length < sizeof(*tbl_msc)) { + err = -EINVAL; + break; + } + + /* Some power management is described in the namespace: */ + err = snprintf(uid, sizeof(uid), "%u", tbl_msc->identifier); + if (err > 0 && err < sizeof(uid)) { + companion = acpi_dev_get_first_match_dev("ARMHAA5C", uid, -1); + if (companion) + ACPI_COMPANION_SET(&pdev->dev, companion); + } + + if (iface == MPAM_IFACE_MMIO) { + res[next_res].name = "MPAM:MSC"; + res[next_res].start = tbl_msc->base_address; + res[next_res].end = tbl_msc->base_address + tbl_msc->mmio_size - 1; + res[next_res].flags = IORESOURCE_MEM; + next_res++; + } else if (iface == MPAM_IFACE_PCC) { + props[next_prop++] = PROPERTY_ENTRY_U32("pcc-channel", + tbl_msc->base_address); + next_prop++; + } + + acpi_mpam_parse_irqs(pdev, tbl_msc, res, &next_res); + err = platform_device_add_resources(pdev, res, next_res); + if (err) + break; + + props[next_prop++] = PROPERTY_ENTRY_U32("arm,not-ready-us", + tbl_msc->max_nrdy_usec); + + /* + * The MSC's CPU affinity is described via its linked power + * management device, but only if it points at a Processor or + * Processor Container. + */ + if (parse_msc_pm_link(tbl_msc, pdev, &acpi_id)) { + props[next_prop++] = PROPERTY_ENTRY_U32("cpu_affinity", + acpi_id); + } + + err = device_create_managed_software_node(&pdev->dev, props, + NULL); + if (err) + break; + + /* Come back later if you want the RIS too */ + err = platform_device_add_data(pdev, tbl_msc, tbl_msc->length); + if (err) + break; + + platform_device_add(pdev); + } + + if (err) + platform_device_put(pdev); + + return err; +} + +static struct acpi_table_header *get_table(void) +{ + struct acpi_table_header *table; + acpi_status status; + + if (acpi_disabled || !mpam_cpus_have_feature()) + return NULL; + + status = acpi_get_table(ACPI_SIG_MPAM, 0, &table); + if (ACPI_FAILURE(status)) + return NULL; + + if (table->revision != 1) + return NULL; + + return table; +} + + + +static int __init acpi_mpam_parse(void) +{ + struct acpi_table_header *mpam; + int err; + + mpam = get_table(); + if (!mpam) + return 0; + + err = _parse_table(mpam); + acpi_put_table(mpam); + + return err; +} + +static int _count_msc(struct acpi_table_header *table) +{ + char *table_end, *table_offset = (char *)(table + 1); + struct acpi_mpam_msc_node *tbl_msc; + int ret = 0; + + tbl_msc = (struct acpi_mpam_msc_node *)table_offset; + table_end = (char *)table + table->length; + + while (table_offset < table_end) { + if (tbl_msc->length < sizeof(*tbl_msc)) + return -EINVAL; + + ret++; + + table_offset += tbl_msc->length; + tbl_msc = (struct acpi_mpam_msc_node *)table_offset; + } + + return ret; +} + + +int acpi_mpam_count_msc(void) +{ + struct acpi_table_header *mpam; + int ret; + + mpam = get_table(); + if (!mpam) + return 0; + + ret = _count_msc(mpam); + acpi_put_table(mpam); + + return ret; +} + +/* + * Call after ACPI devices have been created, which happens behind acpi_scan_init() + * called from subsys_initcall(). PCC requires the mailbox driver, which is + * initialised from postcore_initcall(). + */ +subsys_initcall_sync(acpi_mpam_parse); diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c index 8ab0a82b4da4..94cb47d740c9 100644 --- a/drivers/acpi/tables.c +++ b/drivers/acpi/tables.c @@ -566,7 +566,7 @@ static const char table_sigs[][ACPI_NAMESEG_SIZE] __initconst = { ACPI_SIG_PSDT, ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, ACPI_SIG_IORT, ACPI_SIG_NFIT, ACPI_SIG_HMAT, ACPI_SIG_PPTT, ACPI_SIG_NHLT, ACPI_SIG_AEST, ACPI_SIG_CEDT, ACPI_SIG_AGDI, - ACPI_SIG_NBFT }; + ACPI_SIG_NBFT, ACPI_SIG_MPAM };
#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)
diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h new file mode 100644 index 000000000000..0f1d3f07e789 --- /dev/null +++ b/include/linux/arm_mpam.h @@ -0,0 +1,43 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (C) 2021 Arm Ltd. + +#ifndef __LINUX_ARM_MPAM_H +#define __LINUX_ARM_MPAM_H + +#include <linux/acpi.h> +#include <linux/types.h> + +struct mpam_msc; + +enum mpam_msc_iface { + MPAM_IFACE_MMIO, /* a real MPAM MSC */ + MPAM_IFACE_PCC, /* a fake MPAM MSC */ +}; + +enum mpam_class_types { + MPAM_CLASS_CACHE, /* Well known caches, e.g. L2 */ + MPAM_CLASS_MEMORY, /* Main memory */ + MPAM_CLASS_UNKNOWN, /* Everything else, e.g. SMMU */ +}; + +#ifdef CONFIG_ACPI_MPAM +/* Parse the ACPI description of resources entries for this MSC. */ +int acpi_mpam_parse_resources(struct mpam_msc *msc, + struct acpi_mpam_msc_node *tbl_msc); +int acpi_mpam_count_msc(void); +#else +static inline int acpi_mpam_parse_resources(struct mpam_msc *msc, + struct acpi_mpam_msc_node *tbl_msc) +{ + return -EINVAL; +} +static inline int acpi_mpam_count_msc(void) { return -EINVAL; } +#endif + +static inline int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, + enum mpam_class_types type, u8 class_id, int component_id) +{ + return -EINVAL; +} + +#endif /* __LINUX_ARM_MPAM_H */
From: Rob Herring robh@kernel.org
The binding is designed around the assumption that an MSC will be a sub-block of something else such as a memory controller, cache controller, or IOMMU. However, it's certainly possible a design does not have that association or has a mixture of both, so the binding illustrates how we can support that with RIS child nodes.
A key part of MPAM is we need to know about all of the MSCs in the system before it can be enabled. This drives the need for the genericish 'arm,mpam-msc' compatible. Though we can't assume an MSC is accessible until a h/w specific driver potentially enables the h/w.
Cc: James Morse james.morse@arm.com Signed-off-by: Rob Herring robh@kernel.org Signed-off-by: James Morse james.morse@arm.com --- .../devicetree/bindings/arm/arm,mpam-msc.yaml | 227 ++++++++++++++++++ 1 file changed, 227 insertions(+) create mode 100644 Documentation/devicetree/bindings/arm/arm,mpam-msc.yaml
diff --git a/Documentation/devicetree/bindings/arm/arm,mpam-msc.yaml b/Documentation/devicetree/bindings/arm/arm,mpam-msc.yaml new file mode 100644 index 000000000000..9d542ecb1a7d --- /dev/null +++ b/Documentation/devicetree/bindings/arm/arm,mpam-msc.yaml @@ -0,0 +1,227 @@ +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/arm/arm,mpam-msc.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Arm Memory System Resource Partitioning and Monitoring (MPAM) + +description: | + The Arm MPAM specification can be found here: + + https://developer.arm.com/documentation/ddi0598/latest + +maintainers: + - Rob Herring robh@kernel.org + +properties: + compatible: + items: + - const: arm,mpam-msc # Further details are discoverable + - const: arm,mpam-memory-controller-msc + + reg: + maxItems: 1 + description: A memory region containing registers as defined in the MPAM + specification. + + interrupts: + minItems: 1 + items: + - description: error (optional) + - description: overflow (optional, only for monitoring) + + interrupt-names: + oneOf: + - items: + - enum: [ error, overflow ] + - items: + - const: error + - const: overflow + + arm,not-ready-us: + description: The maximum time in microseconds for monitoring data to be + accurate after a settings change. For more information, see the + Not-Ready (NRDY) bit description in the MPAM specification. + + numa-node-id: true # see NUMA binding + + '#address-cells': + const: 1 + + '#size-cells': + const: 0 + +patternProperties: + '^ris@[0-9a-f]$': + type: object + additionalProperties: false + description: | + RIS nodes for each RIS in an MSC. These nodes are required for each RIS + implementing known MPAM controls + + properties: + compatible: + enum: + # Bulk storage for cache + - arm,mpam-cache + # Memory bandwidth + - arm,mpam-memory + + reg: + minimum: 0 + maximum: 0xf + + cpus: + $ref: '/schemas/types.yaml#/definitions/phandle-array' + description: + Phandle(s) to the CPU node(s) this RIS belongs to. By default, the parent + device's affinity is used. + + arm,mpam-device: + $ref: '/schemas/types.yaml#/definitions/phandle' + description: + By default, the MPAM enabled device associated with a RIS is the MSC's + parent node. It is possible for each RIS to be associated with different + devices in which case 'arm,mpam-device' should be used. + + required: + - compatible + - reg + +required: + - compatible + - reg + +dependencies: + interrupts: [ interrupt-names ] + +additionalProperties: false + +examples: + - | + /* + cpus { + cpu@0 { + next-level-cache = <&L2_0>; + }; + cpu@100 { + next-level-cache = <&L2_1>; + }; + }; + */ + L2_0: cache-controller-0 { + compatible = "cache"; + cache-level = <2>; + cache-unified; + next-level-cache = <&L3>; + + }; + + L2_1: cache-controller-1 { + compatible = "cache"; + cache-level = <2>; + cache-unified; + next-level-cache = <&L3>; + + }; + + L3: cache-controller@30000000 { + compatible = "arm,dsu-l3-cache", "cache"; + cache-level = <3>; + cache-unified; + + ranges = <0x0 0x30000000 0x800000>; + #address-cells = <1>; + #size-cells = <1>; + + msc@10000 { + compatible = "arm,mpam-msc"; + + /* CPU affinity implied by parent cache node's */ + reg = <0x10000 0x2000>; + interrupts = <1>, <2>; + interrupt-names = "error", "overflow"; + arm,not-ready-us = <1>; + }; + }; + + mem: memory-controller@20000 { + compatible = "foo,a-memory-controller"; + reg = <0x20000 0x1000>; + + #address-cells = <1>; + #size-cells = <1>; + ranges; + + msc@21000 { + compatible = "arm,mpam-memory-controller-msc", "arm,mpam-msc"; + reg = <0x21000 0x1000>; + interrupts = <3>; + interrupt-names = "error"; + arm,not-ready-us = <1>; + numa-node-id = <1>; + }; + }; + + iommu@40000 { + reg = <0x40000 0x1000>; + + ranges; + #address-cells = <1>; + #size-cells = <1>; + + msc@41000 { + compatible = "arm,mpam-msc"; + reg = <0 0x1000>; + interrupts = <5>, <6>; + interrupt-names = "error", "overflow"; + arm,not-ready-us = <1>; + + #address-cells = <1>; + #size-cells = <0>; + + ris@2 { + compatible = "arm,mpam-cache"; + reg = <0>; + // TODO: How to map to device(s)? + }; + }; + }; + + msc@80000 { + compatible = "foo,a-standalone-msc"; + reg = <0x80000 0x1000>; + + clocks = <&clks 123>; + + ranges; + #address-cells = <1>; + #size-cells = <1>; + + msc@10000 { + compatible = "arm,mpam-msc"; + + reg = <0x10000 0x2000>; + interrupts = <7>; + interrupt-names = "overflow"; + arm,not-ready-us = <1>; + + #address-cells = <1>; + #size-cells = <0>; + + ris@0 { + compatible = "arm,mpam-cache"; + reg = <0>; + arm,mpam-device = <&L2_0>; + }; + + ris@1 { + compatible = "arm,mpam-memory"; + reg = <1>; + arm,mpam-device = <&mem>; + }; + }; + }; + +...
From: James Morse james.morse@arm.com
Probing MPAM is convoluted. MSCs that are integrated with a CPU may only be accessible from those CPUs, and they may not be online. Touching the hardware early is pointless as MPAM can't be used until the system-wide common values for num_partid and num_pmg have been discovered.
Start with driver probe/remove and mapping the MSC.
Signed-off-by: James Morse james.morse@arm.com --- arch/arm64/Kconfig | 1 + drivers/platform/Kconfig | 2 + drivers/platform/Makefile | 1 + drivers/platform/mpam/Kconfig | 6 + drivers/platform/mpam/Makefile | 1 + drivers/platform/mpam/mpam_devices.c | 355 ++++++++++++++++++++++++++ drivers/platform/mpam/mpam_internal.h | 48 ++++ 7 files changed, 414 insertions(+) create mode 100644 drivers/platform/mpam/Kconfig create mode 100644 drivers/platform/mpam/Makefile create mode 100644 drivers/platform/mpam/mpam_devices.c create mode 100644 drivers/platform/mpam/mpam_internal.h
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index b947ae37b964..7535cc94ce01 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2056,6 +2056,7 @@ config ARM64_TLB_RANGE config ARM64_MPAM bool "Enable support for MPAM" select ACPI_MPAM if ACPI + select ARM_CPU_RESCTRL help Memory Partitioning and Monitoring is an optional extension that allows the CPUs to mark load and store transactions with diff --git a/drivers/platform/Kconfig b/drivers/platform/Kconfig index 868b20361769..f26534a4a83b 100644 --- a/drivers/platform/Kconfig +++ b/drivers/platform/Kconfig @@ -9,6 +9,8 @@ source "drivers/platform/chrome/Kconfig"
source "drivers/platform/mellanox/Kconfig"
+source "drivers/platform/mpam/Kconfig" + source "drivers/platform/olpc/Kconfig"
source "drivers/platform/surface/Kconfig" diff --git a/drivers/platform/Makefile b/drivers/platform/Makefile index 41640172975a..db73de7cb1f4 100644 --- a/drivers/platform/Makefile +++ b/drivers/platform/Makefile @@ -11,3 +11,4 @@ obj-$(CONFIG_OLPC_EC) += olpc/ obj-$(CONFIG_GOLDFISH) += goldfish/ obj-$(CONFIG_CHROME_PLATFORMS) += chrome/ obj-$(CONFIG_SURFACE_PLATFORMS) += surface/ +obj-$(CONFIG_ARM_CPU_RESCTRL) += mpam/ diff --git a/drivers/platform/mpam/Kconfig b/drivers/platform/mpam/Kconfig new file mode 100644 index 000000000000..13bd86fc5e58 --- /dev/null +++ b/drivers/platform/mpam/Kconfig @@ -0,0 +1,6 @@ +# Confusingly, this is everything but the CPU bits of MPAM. CPU here means +# CPU resources, not containers or cgroups etc. +config ARM_CPU_RESCTRL + bool + depends on ARM64 + select RESCTRL_RMID_DEPENDS_ON_CLOSID diff --git a/drivers/platform/mpam/Makefile b/drivers/platform/mpam/Makefile new file mode 100644 index 000000000000..8ad69bfa2aa2 --- /dev/null +++ b/drivers/platform/mpam/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_ARM_CPU_RESCTRL) += mpam_devices.o diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c new file mode 100644 index 000000000000..885f8b61cb65 --- /dev/null +++ b/drivers/platform/mpam/mpam_devices.c @@ -0,0 +1,355 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (C) 2022 Arm Ltd. + +#define pr_fmt(fmt) "mpam: " fmt + +#include <linux/acpi.h> +#include <linux/arm_mpam.h> +#include <linux/cacheinfo.h> +#include <linux/cpu.h> +#include <linux/cpumask.h> +#include <linux/device.h> +#include <linux/errno.h> +#include <linux/gfp.h> +#include <linux/list.h> +#include <linux/lockdep.h> +#include <linux/mutex.h> +#include <linux/of.h> +#include <linux/of_platform.h> +#include <linux/platform_device.h> +#include <linux/printk.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/srcu.h> +#include <linux/types.h> + +#include <acpi/pcc.h> + +#include <asm/mpam.h> + +#include "mpam_internal.h" + +/* + * mpam_list_lock protects the SRCU lists when writing. Once the + * mpam_enabled key is enabled these lists are read-only, + * unless the error interrupt disables the driver. + */ +static DEFINE_MUTEX(mpam_list_lock); +static LIST_HEAD(mpam_all_msc); + +static struct srcu_struct mpam_srcu; + +/* MPAM isn't available until all the MSC have been probed. */ +static u32 mpam_num_msc; + +static void mpam_discovery_complete(void) +{ + pr_err("Discovered all MSC\n"); +} + +static int mpam_dt_count_msc(void) +{ + int count = 0; + struct device_node *np; + + for_each_compatible_node(np, NULL, "arm,mpam-msc") + count++; + + return count; +} + +static int mpam_dt_parse_resource(struct mpam_msc *msc, struct device_node *np, + u32 ris_idx) +{ + int err = 0; + u32 level = 0; + unsigned long cache_id; + struct device_node *cache; + + do { + if (of_device_is_compatible(np, "arm,mpam-cache")) { + cache = of_parse_phandle(np, "arm,mpam-device", 0); + if (!cache) { + pr_err("Failed to read phandle\n"); + break; + } + } else if (of_device_is_compatible(np->parent, "cache")) { + cache = np->parent; + } else { + /* For now, only caches are supported */ + cache = NULL; + break; + } + + err = of_property_read_u32(cache, "cache-level", &level); + if (err) { + pr_err("Failed to read cache-level\n"); + break; + } + + cache_id = cache_of_get_id(cache); + if (cache_id == ~0UL) { + err = -ENOENT; + break; + } + + err = mpam_ris_create(msc, ris_idx, MPAM_CLASS_CACHE, level, + cache_id); + } while (0); + of_node_put(cache); + + return err; +} + + +static int mpam_dt_parse_resources(struct mpam_msc *msc, void *ignored) +{ + int err, num_ris = 0; + const u32 *ris_idx_p; + struct device_node *iter, *np; + + np = msc->pdev->dev.of_node; + for_each_child_of_node(np, iter) { + ris_idx_p = of_get_property(iter, "reg", NULL); + if (ris_idx_p) { + num_ris++; + err = mpam_dt_parse_resource(msc, iter, *ris_idx_p); + if (err) { + of_node_put(iter); + return err; + } + } + } + + if (!num_ris) + mpam_dt_parse_resource(msc, np, 0); + + return err; +} + +static int get_msc_affinity(struct mpam_msc *msc) +{ + struct device_node *parent; + u32 affinity_id; + int err; + + if (!acpi_disabled) { + err = device_property_read_u32(&msc->pdev->dev, "cpu_affinity", + &affinity_id); + if (err) { + cpumask_copy(&msc->accessibility, cpu_possible_mask); + err = 0; + } else { + err = acpi_pptt_get_cpus_from_container(affinity_id, + &msc->accessibility); + } + + return err; + } + + /* This depends on the path to of_node */ + parent = of_get_parent(msc->pdev->dev.of_node); + if (parent == of_root) { + cpumask_copy(&msc->accessibility, cpu_possible_mask); + err = 0; + } else { + err = -EINVAL; + pr_err("Cannot determine CPU accessibility of MSC\n"); + } + of_node_put(parent); + + return err; +} + +static int fw_num_msc; + +static void mpam_pcc_rx_callback(struct mbox_client *cl, void *msg) +{ + /* TODO: wake up tasks blocked on this MSC's PCC channel */ +} + +static int mpam_msc_drv_probe(struct platform_device *pdev) +{ + int err; + pgprot_t prot; + void * __iomem io; + struct mpam_msc *msc; + struct resource *msc_res; + void *plat_data = pdev->dev.platform_data; + + mutex_lock(&mpam_list_lock); + do { + msc = devm_kzalloc(&pdev->dev, sizeof(*msc), GFP_KERNEL); + if (!msc) { + err = -ENOMEM; + break; + } + + INIT_LIST_HEAD_RCU(&msc->glbl_list); + msc->pdev = pdev; + + err = device_property_read_u32(&pdev->dev, "arm,not-ready-us", + &msc->nrdy_usec); + if (err) { + /* This will prevent CSU monitors being usable */ + msc->nrdy_usec = 0; + } + + err = get_msc_affinity(msc); + if (err) + break; + if (cpumask_empty(&msc->accessibility)) { + pr_err_once("msc:%u is not accessible from any CPU!", + msc->id); + err = -EINVAL; + break; + } + + mutex_init(&msc->lock); + msc->id = mpam_num_msc++; + INIT_LIST_HEAD_RCU(&msc->ris); + spin_lock_init(&msc->part_sel_lock); + spin_lock_init(&msc->mon_sel_lock); + + if (device_property_read_u32(&pdev->dev, "pcc-channel", + &msc->pcc_subspace_id)) + msc->iface = MPAM_IFACE_MMIO; + else + msc->iface = MPAM_IFACE_PCC; + + if (msc->iface == MPAM_IFACE_MMIO) { + io = devm_platform_get_and_ioremap_resource(pdev, 0, + &msc_res); + if (IS_ERR(io)) { + pr_err("Failed to map MSC base address\n"); + devm_kfree(&pdev->dev, msc); + err = PTR_ERR(io); + break; + } + msc->mapped_hwpage_sz = msc_res->end - msc_res->start; + msc->mapped_hwpage = io; + } else if (msc->iface == MPAM_IFACE_PCC) { + msc->pcc_cl.dev = &pdev->dev; + msc->pcc_cl.rx_callback = mpam_pcc_rx_callback; + msc->pcc_cl.tx_block = false; + msc->pcc_cl.tx_tout = 1000; /* 1s */ + msc->pcc_cl.knows_txdone = false; + + msc->pcc_chan = pcc_mbox_request_channel(&msc->pcc_cl, + msc->pcc_subspace_id); + if (IS_ERR(msc->pcc_chan)) { + pr_err("Failed to request MSC PCC channel\n"); + devm_kfree(&pdev->dev, msc); + err = PTR_ERR(msc->pcc_chan); + break; + } + + prot = __acpi_get_mem_attribute(msc->pcc_chan->shmem_base_addr); + io = ioremap_prot(msc->pcc_chan->shmem_base_addr, + msc->pcc_chan->shmem_size, pgprot_val(prot)); + if (IS_ERR(io)) { + pr_err("Failed to map MSC base address\n"); + pcc_mbox_free_channel(msc->pcc_chan); + devm_kfree(&pdev->dev, msc); + err = PTR_ERR(io); + break; + } + + /* TODO: issue a read to update the registers */ + + msc->mapped_hwpage_sz = msc->pcc_chan->shmem_size; + msc->mapped_hwpage = io + sizeof(struct acpi_pcct_shared_memory); + } + + list_add_rcu(&msc->glbl_list, &mpam_all_msc); + platform_set_drvdata(pdev, msc); + } while (0); + mutex_unlock(&mpam_list_lock); + + if (!err) { + /* Create RIS entries described by firmware */ + if (!acpi_disabled) + err = acpi_mpam_parse_resources(msc, plat_data); + else + err = mpam_dt_parse_resources(msc, plat_data); + } + + if (!err && fw_num_msc == mpam_num_msc) + mpam_discovery_complete(); + + return err; +} + +static int mpam_msc_drv_remove(struct platform_device *pdev) +{ + struct mpam_msc *msc = platform_get_drvdata(pdev); + + if (!msc) + return 0; + + mutex_lock(&mpam_list_lock); + mpam_num_msc--; + platform_set_drvdata(pdev, NULL); + list_del_rcu(&msc->glbl_list); + synchronize_srcu(&mpam_srcu); + mutex_unlock(&mpam_list_lock); + + return 0; +} + +static const struct of_device_id mpam_of_match[] = { + { .compatible = "arm,mpam-msc", }, + {}, +}; +MODULE_DEVICE_TABLE(of, mpam_of_match); + +static struct platform_driver mpam_msc_driver = { + .driver = { + .name = "mpam_msc", + .of_match_table = of_match_ptr(mpam_of_match), + }, + .probe = mpam_msc_drv_probe, + .remove = mpam_msc_drv_remove, +}; + +/* + * MSC that are hidden under caches are not created as platform devices + * as there is no cache driver. Caches are also special-cased in + * get_msc_affinity(). + */ +static void mpam_dt_create_foundling_msc(void) +{ + int err; + struct device_node *cache; + + for_each_compatible_node(cache, NULL, "cache") { + err = of_platform_populate(cache, mpam_of_match, NULL, NULL); + if (err) { + pr_err("Failed to create MSC devices under caches\n"); + } + } +} + +static int __init mpam_msc_driver_init(void) +{ + if (!mpam_cpus_have_feature()) + return -EOPNOTSUPP; + + init_srcu_struct(&mpam_srcu); + + if (!acpi_disabled) + fw_num_msc = acpi_mpam_count_msc(); + else + fw_num_msc = mpam_dt_count_msc(); + + if (fw_num_msc <= 0) { + pr_err("No MSC devices found in firmware\n"); + return -EINVAL; + } + + if (acpi_disabled) + mpam_dt_create_foundling_msc(); + + return platform_driver_register(&mpam_msc_driver); +} +subsys_initcall(mpam_msc_driver_init); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h new file mode 100644 index 000000000000..affd7999fcad --- /dev/null +++ b/drivers/platform/mpam/mpam_internal.h @@ -0,0 +1,48 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (C) 2021 Arm Ltd. + +#ifndef MPAM_INTERNAL_H +#define MPAM_INTERNAL_H + +#include <linux/arm_mpam.h> +#include <linux/cpumask.h> +#include <linux/io.h> +#include <linux/mailbox_client.h> +#include <linux/mutex.h> +#include <linux/resctrl.h> +#include <linux/sizes.h> + +struct mpam_msc +{ + /* member of mpam_all_msc */ + struct list_head glbl_list; + + int id; + struct platform_device *pdev; + + /* Not modified after mpam_is_enabled() becomes true */ + enum mpam_msc_iface iface; + u32 pcc_subspace_id; + struct mbox_client pcc_cl; + struct pcc_mbox_chan *pcc_chan; + u32 nrdy_usec; + cpumask_t accessibility; + + struct mutex lock; + unsigned long ris_idxs[128 / BITS_PER_LONG]; + u32 ris_max; + + /* mpam_msc_ris of this component */ + struct list_head ris; + + /* + * part_sel_lock protects access to the MSC hardware registers that are + * affected by MPAMCFG_PART_SEL. (including the ID registers) + * If needed, take msc->lock first. + */ + spinlock_t part_sel_lock; + void __iomem * mapped_hwpage; + size_t mapped_hwpage_sz; +}; + +#endif /* MPAM_INTERNAL_H */
From: James Morse james.morse@arm.com
An MSC is a container of resources, each identified by their RIS index. Some RIS are described by firmware to provide their position in the system. Others are discovered when the driver probes the hardware.
To configure a resource it needs to be found by its class, e.g. 'L2'. There are two kinds of grouping, a class is a set of components, which are visible as there are likely to be multiple instances of the L2 cache. struct mpam_components are a set of struct mpam_msc_ris, which are not visible as each L2 cache may be composed of individual slices which need to be configured the same as the hardware is not able to distribute the configuration.
Add support for creating and destroying these structures. A gfp is passed as the structure for 'unknown' may need creating if a new RIS entry is discovered when probing the MSC.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 362 +++++++++++++++++++++++++- drivers/platform/mpam/mpam_internal.h | 52 ++++ include/linux/arm_mpam.h | 7 +- 3 files changed, 412 insertions(+), 9 deletions(-)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 885f8b61cb65..599bd4ee5b00 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -20,7 +20,6 @@ #include <linux/printk.h> #include <linux/slab.h> #include <linux/spinlock.h> -#include <linux/srcu.h> #include <linux/types.h>
#include <acpi/pcc.h> @@ -37,11 +36,359 @@ static DEFINE_MUTEX(mpam_list_lock); static LIST_HEAD(mpam_all_msc);
-static struct srcu_struct mpam_srcu; +struct srcu_struct mpam_srcu;
/* MPAM isn't available until all the MSC have been probed. */ static u32 mpam_num_msc;
+/* + * An MSC is a container for resources, each identified by their RIS index. + * Components are a group of RIS that control the same thing. + * Classes are the set components of the same type. + * + * e.g. The set of RIS that make up the L2 are a component. These are sometimes + * termed slices. They should be configured as if they were one MSC. + * + * e.g. The SoC probably has more than one L2, each attached to a distinct set + * of CPUs. All the L2 components are grouped as a class. + * + * When creating an MSC, struct mpam_msc is added to the all mpam_all_msc list, + * then linked via struct mpam_ris to a component and a class. + * The same MSC may exist under different class->component paths, but the RIS + * index will be unique. + */ +LIST_HEAD(mpam_classes); + +static struct mpam_component * +mpam_component_alloc(struct mpam_class *class, int id, gfp_t gfp) +{ + struct mpam_component *comp; + + lockdep_assert_held(&mpam_list_lock); + + comp = kzalloc(sizeof(*comp), gfp); + if (!comp) + return ERR_PTR(-ENOMEM); + + comp->comp_id = id; + INIT_LIST_HEAD_RCU(&comp->ris); + /* affinity is updated when ris are added */ + INIT_LIST_HEAD_RCU(&comp->class_list); + comp->class = class; + + list_add_rcu(&comp->class_list, &class->components); + + return comp; +} + +static struct mpam_component * +mpam_component_get(struct mpam_class *class, int id, bool alloc, gfp_t gfp) +{ + struct mpam_component *comp; + + lockdep_assert_held(&mpam_list_lock); + + list_for_each_entry(comp, &class->components, class_list) { + if (comp->comp_id == id) + return comp; + } + + if (!alloc) + return ERR_PTR(-ENOENT); + + return mpam_component_alloc(class, id, gfp); +} + +static struct mpam_class * +mpam_class_alloc(u8 level_idx, enum mpam_class_types type, gfp_t gfp) +{ + struct mpam_class *class; + + lockdep_assert_held(&mpam_list_lock); + + class = kzalloc(sizeof(*class), gfp); + if (!class) + return ERR_PTR(-ENOMEM); + + INIT_LIST_HEAD_RCU(&class->components); + /* affinity is updated when ris are added */ + class->level = level_idx; + class->type = type; + INIT_LIST_HEAD_RCU(&class->classes_list); + + list_add_rcu(&class->classes_list, &mpam_classes); + + return class; +} + +static struct mpam_class * +mpam_class_get(u8 level_idx, enum mpam_class_types type, bool alloc, gfp_t gfp) +{ + bool found = false; + struct mpam_class *class; + + lockdep_assert_held(&mpam_list_lock); + + list_for_each_entry(class, &mpam_classes, classes_list) { + if (class->type == type && class->level == level_idx) { + found = true; + break; + } + } + + if (found) + return class; + + if (!alloc) + return ERR_PTR(-ENOENT); + + return mpam_class_alloc(level_idx, type, gfp); +} + +static void mpam_class_destroy(struct mpam_class *class) +{ + lockdep_assert_held(&mpam_list_lock); + + list_del_rcu(&class->classes_list); + synchronize_srcu(&mpam_srcu); + kfree(class); +} + +static void mpam_comp_destroy(struct mpam_component *comp) +{ + struct mpam_class *class = comp->class; + + lockdep_assert_held(&mpam_list_lock); + + list_del_rcu(&comp->class_list); + synchronize_srcu(&mpam_srcu); + kfree(comp); + + if (list_empty(&class->components)) + mpam_class_destroy(class); +} + +/* synchronise_srcu() before freeing ris */ +static void mpam_ris_destroy(struct mpam_msc_ris *ris) +{ + struct mpam_component *comp = ris->comp; + struct mpam_class *class = comp->class; + struct mpam_msc *msc = ris->msc; + + lockdep_assert_held(&mpam_list_lock); + lockdep_assert_preemption_enabled(); + + clear_bit(ris->ris_idx, msc->ris_idxs); + list_del_rcu(&ris->comp_list); + list_del_rcu(&ris->msc_list); + + cpumask_andnot(&comp->affinity, &comp->affinity, &ris->affinity); + cpumask_andnot(&class->affinity, &class->affinity, &ris->affinity); + + if (list_empty(&comp->ris)) + mpam_comp_destroy(comp); +} + +/* + * There are two ways of reaching a struct mpam_msc_ris. Via the + * class->component->ris, or via the msc. + * When destroying the msc, the other side needs unlinking and cleaning up too. + * synchronise_srcu() before freeing msc. + */ +static void mpam_msc_destroy(struct mpam_msc *msc) +{ + struct mpam_msc_ris *ris, *tmp; + + lockdep_assert_held(&mpam_list_lock); + lockdep_assert_preemption_enabled(); + + list_for_each_entry_safe(ris, tmp, &msc->ris, msc_list) + mpam_ris_destroy(ris); +} + +/* + * The cacheinfo structures are only populated when CPUs are online. + * This helper walks the device tree to include offline CPUs too. + */ +static int get_cpumask_from_cache_id(u32 cache_id, u32 cache_level, + cpumask_t *affinity) +{ + int cpu, err; + u32 iter_level; + int iter_cache_id; + struct device_node *iter; + + if (!acpi_disabled) + return acpi_pptt_get_cpumask_from_cache_id(cache_id, affinity); + + for_each_possible_cpu(cpu) { + iter = of_get_cpu_node(cpu, NULL); + if (!iter) { + pr_err("Failed to find cpu%d device node\n", cpu); + return -ENOENT; + } + + while ((iter = of_find_next_cache_node(iter))) { + err = of_property_read_u32(iter, "cache-level", + &iter_level); + if (err || (iter_level != cache_level)) { + of_node_put(iter); + continue; + } + + /* + * get_cpu_cacheinfo_id() isn't ready until sometime + * during device_initcall(). Use cache_of_get_id(). + */ + iter_cache_id = cache_of_get_id(iter); + if (cache_id == ~0UL) { + of_node_put(iter); + continue; + } + + if (iter_cache_id == cache_id) + cpumask_set_cpu(cpu, affinity); + + of_node_put(iter); + } + } + + return 0; +} + + +/* + * cpumask_of_node() only knows about online CPUs. This can't tell us whether + * a class is represented on all possible CPUs. + */ +static void get_cpumask_from_node_id(u32 node_id, cpumask_t *affinity) +{ + int cpu; + + for_each_possible_cpu(cpu) { + if (node_id == cpu_to_node(cpu)) + cpumask_set_cpu(cpu, affinity); + } +} + +static int get_cpumask_from_cache(struct device_node *cache, + cpumask_t *affinity) +{ + int err; + u32 cache_level; + int cache_id; + + err = of_property_read_u32(cache, "cache-level", &cache_level); + if (err) { + pr_err("Failed to read cache-level from cache node\n"); + return -ENOENT; + } + + cache_id = cache_of_get_id(cache); + if (cache_id == ~0UL) { + pr_err("Failed to calculate cache-id from cache node\n"); + return -ENOENT; + } + + return get_cpumask_from_cache_id(cache_id, cache_level, affinity); +} + +static int mpam_ris_get_affinity(struct mpam_msc *msc, cpumask_t *affinity, + enum mpam_class_types type, + struct mpam_class *class, + struct mpam_component *comp) +{ + int err; + + switch (type) { + case MPAM_CLASS_CACHE: + err = get_cpumask_from_cache_id(comp->comp_id, class->level, + affinity); + if (err) + return err; + + if (cpumask_empty(affinity)) + pr_warn_once("%s no CPUs associated with cache node", + dev_name(&msc->pdev->dev)); + + break; + case MPAM_CLASS_MEMORY: + get_cpumask_from_node_id(comp->comp_id, affinity); + if (cpumask_empty(affinity)) + pr_warn_once("%s no CPUs associated with memory node", + dev_name(&msc->pdev->dev)); + break; + case MPAM_CLASS_UNKNOWN: + return 0; + } + + cpumask_and(affinity, affinity, &msc->accessibility); + + return 0; +} + +static int mpam_ris_create_locked(struct mpam_msc *msc, u8 ris_idx, + enum mpam_class_types type, u8 class_id, + int component_id, gfp_t gfp) +{ + int err; + struct mpam_msc_ris *ris; + struct mpam_class *class; + struct mpam_component *comp; + + lockdep_assert_held(&mpam_list_lock); + + if (test_and_set_bit(ris_idx, msc->ris_idxs)) + return -EBUSY; + + ris = devm_kzalloc(&msc->pdev->dev, sizeof(*ris), gfp); + if (!ris) + return -ENOMEM; + + class = mpam_class_get(class_id, type, true, gfp); + if (IS_ERR(class)) + return PTR_ERR(class); + + comp = mpam_component_get(class, component_id, true, gfp); + if (IS_ERR(comp)) { + if (list_empty(&class->components)) + mpam_class_destroy(class); + return PTR_ERR(comp); + } + + err = mpam_ris_get_affinity(msc, &ris->affinity, type, class, comp); + if (err) { + if (list_empty(&class->components)) + mpam_class_destroy(class); + return err; + } + + ris->ris_idx = ris_idx; + INIT_LIST_HEAD_RCU(&ris->comp_list); + INIT_LIST_HEAD_RCU(&ris->msc_list); + ris->msc = msc; + ris->comp = comp; + + cpumask_or(&comp->affinity, &comp->affinity, &ris->affinity); + cpumask_or(&class->affinity, &class->affinity, &ris->affinity); + list_add_rcu(&ris->comp_list, &comp->ris); + + return 0; +} + +int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, + enum mpam_class_types type, u8 class_id, int component_id) +{ + int err; + + mutex_lock(&mpam_list_lock); + err = mpam_ris_create_locked(msc, ris_idx, type, class_id, + component_id, GFP_KERNEL); + mutex_unlock(&mpam_list_lock); + + return err; +} + static void mpam_discovery_complete(void) { pr_err("Discovered all MSC\n"); @@ -153,8 +500,14 @@ static int get_msc_affinity(struct mpam_msc *msc) cpumask_copy(&msc->accessibility, cpu_possible_mask); err = 0; } else { - err = -EINVAL; - pr_err("Cannot determine CPU accessibility of MSC\n"); + if (of_device_is_compatible(parent, "cache")) { + err = get_cpumask_from_cache(parent, + &msc->accessibility); + } else { + err = -EINVAL; + pr_err("Cannot determine accessibility of MSC: %s\n", + dev_name(&msc->pdev->dev)); + } } of_node_put(parent);
@@ -291,6 +644,7 @@ static int mpam_msc_drv_remove(struct platform_device *pdev) mpam_num_msc--; platform_set_drvdata(pdev, NULL); list_del_rcu(&msc->glbl_list); + mpam_msc_destroy(msc); synchronize_srcu(&mpam_srcu); mutex_unlock(&mpam_list_lock);
diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index affd7999fcad..07d9c70bf1e6 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -11,6 +11,7 @@ #include <linux/mutex.h> #include <linux/resctrl.h> #include <linux/sizes.h> +#include <linux/srcu.h>
struct mpam_msc { @@ -45,4 +46,55 @@ struct mpam_msc size_t mapped_hwpage_sz; };
+struct mpam_class +{ + /* mpam_components in this class */ + struct list_head components; + + cpumask_t affinity; + + u8 level; + enum mpam_class_types type; + + /* member of mpam_classes */ + struct list_head classes_list; +}; + +struct mpam_component +{ + u32 comp_id; + + /* mpam_msc_ris in this component */ + struct list_head ris; + + cpumask_t affinity; + + /* member of mpam_class:components */ + struct list_head class_list; + + /* parent: */ + struct mpam_class *class; +}; + +struct mpam_msc_ris +{ + u8 ris_idx; + + cpumask_t affinity; + + /* member of mpam_component:ris */ + struct list_head comp_list; + + /* member of mpam_msc:ris */ + struct list_head msc_list; + + /* parents: */ + struct mpam_msc *msc; + struct mpam_component *comp; +}; + +/* List of all classes */ +extern struct list_head mpam_classes; +extern struct srcu_struct mpam_srcu; + #endif /* MPAM_INTERNAL_H */ diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 0f1d3f07e789..950ea7049d53 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -34,10 +34,7 @@ static inline int acpi_mpam_parse_resources(struct mpam_msc *msc, static inline int acpi_mpam_count_msc(void) { return -EINVAL; } #endif
-static inline int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, - enum mpam_class_types type, u8 class_id, int component_id) -{ - return -EINVAL; -} +int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, + enum mpam_class_types type, u8 class_id, int component_id);
#endif /* __LINUX_ARM_MPAM_H */
From: James Morse james.morse@arm.com
Memory Partitioning and Monitoring (MPAM) has memory mapped devices (MSCs) with an identity/configuration page.
Add the definitions for these registers as offset within the page(s).
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_internal.h | 254 ++++++++++++++++++++++++++ 1 file changed, 254 insertions(+)
diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 07d9c70bf1e6..5d339e4375e2 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -97,4 +97,258 @@ struct mpam_msc_ris extern struct list_head mpam_classes; extern struct srcu_struct mpam_srcu;
+ +/* + * MPAM MSCs have the following register layout. See: + * Arm Architecture Reference Manual Supplement - Memory System Resource + * Partitioning and Monitoring (MPAM), for Armv8-A. DDI 0598A.a + */ +#define MPAM_ARCHITECTURE_V1 0x10 + +/* Memory mapped control pages: */ +/* ID Register offsets in the memory mapped page */ +#define MPAMF_IDR 0x0000 /* features id register */ +#define MPAMF_MSMON_IDR 0x0080 /* performance monitoring features */ +#define MPAMF_IMPL_IDR 0x0028 /* imp-def partitioning */ +#define MPAMF_CPOR_IDR 0x0030 /* cache-portion partitioning */ +#define MPAMF_CCAP_IDR 0x0038 /* cache-capacity partitioning */ +#define MPAMF_MBW_IDR 0x0040 /* mem-bw partitioning */ +#define MPAMF_PRI_IDR 0x0048 /* priority partitioning */ +#define MPAMF_CSUMON_IDR 0x0088 /* cache-usage monitor */ +#define MPAMF_MBWUMON_IDR 0x0090 /* mem-bw usage monitor */ +#define MPAMF_PARTID_NRW_IDR 0x0050 /* partid-narrowing */ +#define MPAMF_IIDR 0x0018 /* implementer id register */ +#define MPAMF_AIDR 0x0020 /* architectural id register */ + +/* Configuration and Status Register offsets in the memory mapped page */ +#define MPAMCFG_PART_SEL 0x0100 /* partid to configure: */ +#define MPAMCFG_CPBM 0x1000 /* cache-portion config */ +#define MPAMCFG_CMAX 0x0108 /* cache-capacity config */ +#define MPAMCFG_MBW_MIN 0x0200 /* min mem-bw config */ +#define MPAMCFG_MBW_MAX 0x0208 /* max mem-bw config */ +#define MPAMCFG_MBW_WINWD 0x0220 /* mem-bw accounting window config */ +#define MPAMCFG_MBW_PBM 0x2000 /* mem-bw portion bitmap config */ +#define MPAMCFG_PRI 0x0400 /* priority partitioning config */ +#define MPAMCFG_MBW_PROP 0x0500 /* mem-bw stride config */ +#define MPAMCFG_INTPARTID 0x0600 /* partid-narrowing config */ + +#define MSMON_CFG_MON_SEL 0x0800 /* monitor selector */ +#define MSMON_CFG_CSU_FLT 0x0810 /* cache-usage monitor filter */ +#define MSMON_CFG_CSU_CTL 0x0818 /* cache-usage monitor config */ +#define MSMON_CFG_MBWU_FLT 0x0820 /* mem-bw monitor filter */ +#define MSMON_CFG_MBWU_CTL 0x0828 /* mem-bw monitor config */ +#define MSMON_CSU 0x0840 /* current cache-usage */ +#define MSMON_CSU_CAPTURE 0x0848 /* last cache-usage value captured */ +#define MSMON_MBWU 0x0860 /* current mem-bw usage value */ +#define MSMON_MBWU_CAPTURE 0x0868 /* last mem-bw value captured */ +#define MSMON_CAPT_EVNT 0x0808 /* signal a capture event */ +#define MPAMF_ESR 0x00F8 /* error status register */ +#define MPAMF_ECR 0x00F0 /* error control register */ + +/* MPAMF_IDR - MPAM features ID register */ +#define MPAMF_IDR_PARTID_MAX GENMASK(15, 0) +#define MPAMF_IDR_PMG_MAX GENMASK(23, 16) +#define MPAMF_IDR_HAS_CCAP_PART BIT(24) +#define MPAMF_IDR_HAS_CPOR_PART BIT(25) +#define MPAMF_IDR_HAS_MBW_PART BIT(26) +#define MPAMF_IDR_HAS_PRI_PART BIT(27) +#define MPAMF_IDR_HAS_EXT BIT(28) +#define MPAMF_IDR_HAS_IMPL_IDR BIT(29) +#define MPAMF_IDR_HAS_MSMON BIT(30) +#define MPAMF_IDR_HAS_PARTID_NRW BIT(31) +#define MPAMF_IDR_HAS_RIS BIT(32) +#define MPAMF_IDR_HAS_EXT_ESR BIT(38) +#define MPAMF_IDR_HAS_ESR BIT(39) +#define MPAMF_IDR_RIS_MAX GENMASK(59, 56) + + +/* MPAMF_MSMON_IDR - MPAM performance monitoring ID register */ +#define MPAMF_MSMON_IDR_MSMON_CSU BIT(16) +#define MPAMF_MSMON_IDR_MSMON_MBWU BIT(17) +#define MPAMF_MSMON_IDR_HAS_LOCAL_CAPT_EVNT BIT(31) + +/* MPAMF_CPOR_IDR - MPAM features cache portion partitioning ID register */ +#define MPAMF_CPOR_IDR_CPBM_WD GENMASK(15, 0) + +/* MPAMF_CCAP_IDR - MPAM features cache capacity partitioning ID register */ +#define MPAMF_CCAP_IDR_CMAX_WD GENMASK(5, 0) + +/* MPAMF_MBW_IDR - MPAM features memory bandwidth partitioning ID register */ +#define MPAMF_MBW_IDR_BWA_WD GENMASK(5, 0) +#define MPAMF_MBW_IDR_HAS_MIN BIT(10) +#define MPAMF_MBW_IDR_HAS_MAX BIT(11) +#define MPAMF_MBW_IDR_HAS_PBM BIT(12) +#define MPAMF_MBW_IDR_HAS_PROP BIT(13) +#define MPAMF_MBW_IDR_WINDWR BIT(14) +#define MPAMF_MBW_IDR_BWPBM_WD GENMASK(28, 16) + +/* MPAMF_PRI_IDR - MPAM features priority partitioning ID register */ +#define MPAMF_PRI_IDR_HAS_INTPRI BIT(0) +#define MPAMF_PRI_IDR_INTPRI_0_IS_LOW BIT(1) +#define MPAMF_PRI_IDR_INTPRI_WD GENMASK(9, 4) +#define MPAMF_PRI_IDR_HAS_DSPRI BIT(16) +#define MPAMF_PRI_IDR_DSPRI_0_IS_LOW BIT(17) +#define MPAMF_PRI_IDR_DSPRI_WD GENMASK(25, 20) + +/* MPAMF_CSUMON_IDR - MPAM cache storage usage monitor ID register */ +#define MPAMF_CSUMON_IDR_NUM_MON GENMASK(15, 0) +#define MPAMF_CSUMON_IDR_HAS_CAPTURE BIT(31) + +/* MPAMF_MBWUMON_IDR - MPAM memory bandwidth usage monitor ID register */ +#define MPAMF_MBWUMON_IDR_NUM_MON GENMASK(15, 0) +#define MPAMF_MBWUMON_IDR_RWBW BIT(28) +#define MPAMF_MBWUMON_IDR_LWD BIT(29) +#define MPAMF_MBWUMON_IDR_HAS_LONG BIT(30) +#define MPAMF_MBWUMON_IDR_HAS_CAPTURE BIT(31) + +/* MPAMF_PARTID_NRW_IDR - MPAM PARTID narrowing ID register */ +#define MPAMF_PARTID_NRW_IDR_INTPARTID_MAX GENMASK(15, 0) + +/* MPAMF_IIDR - MPAM implementation ID register */ +#define MPAMF_IIDR_PRODUCTID GENMASK(31, 20) +#define MPAMF_IIDR_PRODUCTID_SHIFT 20 +#define MPAMF_IIDR_VARIANT GENMASK(19, 16) +#define MPAMF_IIDR_VARIANT_SHIFT 16 +#define MPAMF_IIDR_REVISON GENMASK(15, 12) +#define MPAMF_IIDR_REVISON_SHIFT 12 +#define MPAMF_IIDR_IMPLEMENTER GENMASK(11, 0) +#define MPAMF_IIDR_IMPLEMENTER_SHIFT 0 + +/* MPAMF_AIDR - MPAM architecture ID register */ +#define MPAMF_AIDR_ARCH_MAJOR_REV GENMASK(7, 4) +#define MPAMF_AIDR_ARCH_MINOR_REV GENMASK(3, 0) + +/* MPAMCFG_PART_SEL - MPAM partition configuration selection register */ +#define MPAMCFG_PART_SEL_PARTID_SEL GENMASK(15, 0) +#define MPAMCFG_PART_SEL_INTERNAL BIT(16) +#define MPAMCFG_PART_SEL_RIS GENMASK(27, 24) + +/* MPAMCFG_CMAX - MPAM cache portion bitmap partition configuration register */ +#define MPAMCFG_CMAX_CMAX GENMASK(15, 0) + +/* + * MPAMCFG_MBW_MIN - MPAM memory minimum bandwidth partitioning configuration + * register + */ +#define MPAMCFG_MBW_MIN_MIN GENMASK(15, 0) + +/* + * MPAMCFG_MBW_MAX - MPAM memory maximum bandwidth partitioning configuration + * register + */ +#define MPAMCFG_MBW_MAX_MAX GENMASK(15, 0) +#define MPAMCFG_MBW_MAX_HARDLIM BIT(31) + +/* + * MPAMCFG_MBW_WINWD - MPAM memory bandwidth partitioning window width + * register + */ +#define MPAMCFG_MBW_WINWD_US_FRAC GENMASK(7, 0) +#define MPAMCFG_MBW_WINWD_US_INT GENMASK(23, 8) + + +/* MPAMCFG_PRI - MPAM priority partitioning configuration register */ +#define MPAMCFG_PRI_INTPRI GENMASK(15, 0) +#define MPAMCFG_PRI_DSPRI GENMASK(31, 16) + +/* + * MPAMCFG_MBW_PROP - Memory bandwidth proportional stride partitioning + * configuration register + */ +#define MPAMCFG_MBW_PROP_STRIDEM1 GENMASK(15, 0) +#define MPAMCFG_MBW_PROP_EN BIT(31) + +/* + * MPAMCFG_INTPARTID - MPAM internal partition narrowing configuration register + */ +#define MPAMCFG_INTPARTID_INTPARTID GENMASK(15, 0) +#define MPAMCFG_INTPARTID_INTERNAL BIT(16) + +/* MSMON_CFG_MON_SEL - Memory system performance monitor selection register */ +#define MSMON_CFG_MON_SEL_MON_SEL GENMASK(7, 0) +#define MSMON_CFG_MON_SEL_RIS GENMASK(27, 24) + +/* MPAMF_ESR - MPAM Error Status Register */ +#define MPAMF_ESR_PARTID_OR_MON GENMASK(15, 0) +#define MPAMF_ESR_PMG GENMASK(23, 16) +#define MPAMF_ESR_ERRCODE GENMASK(27, 24) +#define MPAMF_ESR_OVRWR BIT(31) +#define MPAMF_ESR_RIS GENMASK(35, 32) + +/* MPAMF_ECR - MPAM Error Control Register */ +#define MPAMF_ECR_INTEN BIT(0) + +/* Error conditions in accessing memory mapped registers */ +#define MPAM_ERRCODE_NONE 0 +#define MPAM_ERRCODE_PARTID_SEL_RANGE 1 +#define MPAM_ERRCODE_REQ_PARTID_RANGE 2 +#define MPAM_ERRCODE_MSMONCFG_ID_RANGE 3 +#define MPAM_ERRCODE_REQ_PMG_RANGE 4 +#define MPAM_ERRCODE_MONITOR_RANGE 5 +#define MPAM_ERRCODE_INTPARTID_RANGE 6 +#define MPAM_ERRCODE_UNEXPECTED_INTERNAL 7 + +/* + * MSMON_CFG_CSU_FLT - Memory system performance monitor configure cache storage + * usage monitor filter register + */ +#define MSMON_CFG_CSU_FLT_PARTID GENMASK(15, 0) +#define MSMON_CFG_CSU_FLT_PMG GENMASK(23, 16) + +/* + * MSMON_CFG_CSU_CTL - Memory system performance monitor configure cache storage + * usage monitor control register + * MSMON_CFG_MBWU_CTL - Memory system performance monitor configure memory + * bandwidth usage monitor control register + */ +#define MSMON_CFG_x_CTL_TYPE GENMASK(7, 0) +#define MSMON_CFG_x_CTL_MATCH_PARTID BIT(16) +#define MSMON_CFG_x_CTL_MATCH_PMG BIT(17) +#define MSMON_CFG_x_CTL_SCLEN BIT(19) +#define MSMON_CFG_x_CTL_SUBTYPE GENMASK(23, 20) +#define MSMON_CFG_x_CTL_OFLOW_FRZ BIT(24) +#define MSMON_CFG_x_CTL_OFLOW_INTR BIT(25) +#define MSMON_CFG_x_CTL_OFLOW_STATUS BIT(26) +#define MSMON_CFG_x_CTL_CAPT_RESET BIT(27) +#define MSMON_CFG_x_CTL_CAPT_EVNT GENMASK(30, 28) +#define MSMON_CFG_x_CTL_EN BIT(31) + +#define MSMON_CFG_MBWU_CTL_TYPE_MBWU 0x42 +#define MSMON_CFG_MBWU_CTL_TYPE_CSU 0x43 + +#define MSMON_CFG_MBWU_CTL_SUBTYPE_NONE 0 +#define MSMON_CFG_MBWU_CTL_SUBTYPE_READ 1 +#define MSMON_CFG_MBWU_CTL_SUBTYPE_WRITE 2 +#define MSMON_CFG_MBWU_CTL_SUBTYPE_BOTH 3 + +#define MSMON_CFG_MBWU_CTL_SUBTYPE_MAX 3 +#define MSMON_CFG_MBWU_CTL_SUBTYPE_MASK 0x3 + +/* + * MSMON_CFG_MBWU_FLT - Memory system performance monitor configure memory + * bandwidth usage monitor filter register + */ +#define MSMON_CFG_MBWU_FLT_PARTID GENMASK(15, 0) +#define MSMON_CFG_MBWU_FLT_PMG GENMASK(23, 16) +#define MSMON_CFG_MBWU_FLT_RWBW GENMASK(31, 30) + +/* + * MSMON_CSU - Memory system performance monitor cache storage usage monitor + * register + * MSMON_CSU_CAPTURE - Memory system performance monitor cache storage usage + * capture register + * MSMON_MBWU - Memory system performance monitor memory bandwidth usage + * monitor register + * MSMON_MBWU_CAPTURE - Memory system performance monitor memory bandwidth usage + * capture register + */ +#define MSMON___VALUE GENMASK(30, 0) +#define MSMON___NRDY BIT(31) +#define MSMON_MBWU_L_VALUE GENMASK(62, 0) +/* + * MSMON_CAPT_EVNT - Memory system performance monitoring capture event + * generation register + */ +#define MSMON_CAPT_EVNT_NOW BIT(0) + #endif /* MPAM_INTERNAL_H */
From: James Morse james.morse@arm.com
Because an MSC can only by accessed from the CPUs in its cpu-affinity set we need to be running on one of those CPUs to probe the MSC hardware.
Do this work in the cpuhp callback. Probing the hardware will only happen before MPAM is enabled, walk all the MSCs and probe those we can reach that haven't already been probed.
Later, enabling MPAM will enable a static key which will allow mpam_discovery_cpu_online() and its mutex to be skipped.
Enabling a static key will also take the cpuhp lock, so can't be done from the cpuhp callback. Whenever a new MSC has been probed schedule work to test if all the MSCs have now been probed.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 148 +++++++++++++++++++++++++- drivers/platform/mpam/mpam_internal.h | 5 +- 2 files changed, 149 insertions(+), 4 deletions(-)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 599bd4ee5b00..48811c90b78a 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -4,6 +4,7 @@ #define pr_fmt(fmt) "mpam: " fmt
#include <linux/acpi.h> +#include <linux/atomic.h> #include <linux/arm_mpam.h> #include <linux/cacheinfo.h> #include <linux/cpu.h> @@ -21,6 +22,7 @@ #include <linux/slab.h> #include <linux/spinlock.h> #include <linux/types.h> +#include <linux/workqueue.h>
#include <acpi/pcc.h>
@@ -41,6 +43,16 @@ struct srcu_struct mpam_srcu; /* MPAM isn't available until all the MSC have been probed. */ static u32 mpam_num_msc;
+static int mpam_cpuhp_state; +static DEFINE_MUTEX(mpam_cpuhp_state_lock); + +/* + * mpam is enabled once all devices have been probed from CPU online callbacks, + * scheduled via this work_struct. If access to an MSC depends on a CPU that + * was not brought online at boot, this can happen surprisingly late. + */ +static DECLARE_WORK(mpam_enable_work, &mpam_enable); + /* * An MSC is a container for resources, each identified by their RIS index. * Components are a group of RIS that control the same thing. @@ -59,6 +71,24 @@ static u32 mpam_num_msc; */ LIST_HEAD(mpam_classes);
+static u32 __mpam_read_reg(struct mpam_msc *msc, u16 reg) +{ + WARN_ON_ONCE(reg > msc->mapped_hwpage_sz); + WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &msc->accessibility)); + + return readl_relaxed(msc->mapped_hwpage + reg); +} + +#define mpam_read_partsel_reg(msc, reg) \ +({ \ + u32 ____ret; \ + \ + lockdep_assert_held_once(&msc->part_sel_lock); \ + ____ret = __mpam_read_reg(msc, MPAMF_##reg); \ + \ + ____ret; \ +}) + static struct mpam_component * mpam_component_alloc(struct mpam_class *class, int id, gfp_t gfp) { @@ -389,9 +419,81 @@ int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, return err; }
-static void mpam_discovery_complete(void) +static int mpam_msc_hw_probe(struct mpam_msc *msc) +{ + u64 idr; + int err; + + lockdep_assert_held(&msc->lock); + + spin_lock(&msc->part_sel_lock); + idr = mpam_read_partsel_reg(msc, AIDR); + if ((idr & MPAMF_AIDR_ARCH_MAJOR_REV) != MPAM_ARCHITECTURE_V1) { + pr_err_once("%s does not match MPAM architecture v1.0\n", + dev_name(&msc->pdev->dev)); + err = -EIO; + } else { + msc->probed = true; + err = 0; + } + spin_unlock(&msc->part_sel_lock); + + return err; +} + +static int mpam_cpu_online(unsigned int cpu) { - pr_err("Discovered all MSC\n"); + return 0; +} + +/* Before mpam is enabled, try to probe new MSC */ +static int mpam_discovery_cpu_online(unsigned int cpu) +{ + int err = 0; + struct mpam_msc *msc; + bool new_device_probed = false; + + mutex_lock(&mpam_list_lock); + list_for_each_entry(msc, &mpam_all_msc, glbl_list) { + if (!cpumask_test_cpu(cpu, &msc->accessibility)) + continue; + + mutex_lock(&msc->lock); + if (!msc->probed) + err = mpam_msc_hw_probe(msc); + mutex_unlock(&msc->lock); + + if (!err) + new_device_probed = true; + else + break; // mpam_broken + } + mutex_unlock(&mpam_list_lock); + + if (new_device_probed && !err) + schedule_work(&mpam_enable_work); + + if (err < 0) + return err; + + return mpam_cpu_online(cpu); +} + +static int mpam_cpu_offline(unsigned int cpu) +{ + return 0; +} + +static void mpam_register_cpuhp_callbacks(int (*online)(unsigned int online)) +{ + mutex_lock(&mpam_cpuhp_state_lock); + mpam_cpuhp_state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mpam:online", + online, mpam_cpu_offline); + if (mpam_cpuhp_state <= 0) { + pr_err("Failed to register cpuhp callbacks"); + mpam_cpuhp_state = 0; + } + mutex_unlock(&mpam_cpuhp_state_lock); }
static int mpam_dt_count_msc(void) @@ -628,11 +730,51 @@ static int mpam_msc_drv_probe(struct platform_device *pdev) }
if (!err && fw_num_msc == mpam_num_msc) - mpam_discovery_complete(); + mpam_register_cpuhp_callbacks(&mpam_discovery_cpu_online);
return err; }
+static void mpam_enable_once(void) +{ + mutex_lock(&mpam_cpuhp_state_lock); + cpuhp_remove_state(mpam_cpuhp_state); + mpam_cpuhp_state = 0; + mutex_unlock(&mpam_cpuhp_state_lock); + + mpam_register_cpuhp_callbacks(mpam_cpu_online); + + pr_info("MPAM enabled\n"); +} + +/* + * Enable mpam once all devices have been probed. + * Scheduled by mpam_discovery_cpu_online() once all devices have been created. + * Also scheduled when new devices are probed when new CPUs come online. + */ +void mpam_enable(struct work_struct *work) +{ + static atomic_t once; + struct mpam_msc *msc; + bool all_devices_probed = true; + + /* Have we probed all the hw devices? */ + mutex_lock(&mpam_list_lock); + list_for_each_entry(msc, &mpam_all_msc, glbl_list) { + mutex_lock(&msc->lock); + if (!msc->probed) + all_devices_probed = false; + mutex_unlock(&msc->lock); + + if (!all_devices_probed) + break; + } + mutex_unlock(&mpam_list_lock); + + if (all_devices_probed && !atomic_fetch_inc(&once)) + mpam_enable_once(); +} + static int mpam_msc_drv_remove(struct platform_device *pdev) { struct mpam_msc *msc = platform_get_drvdata(pdev); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 5d339e4375e2..d5d567fe57ed 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -30,6 +30,7 @@ struct mpam_msc cpumask_t accessibility;
struct mutex lock; + bool probed; unsigned long ris_idxs[128 / BITS_PER_LONG]; u32 ris_max;
@@ -97,6 +98,8 @@ struct mpam_msc_ris extern struct list_head mpam_classes; extern struct srcu_struct mpam_srcu;
+/* Scheduled work callback to enable mpam once all MSC have been probed */ +void mpam_enable(struct work_struct *work);
/* * MPAM MSCs have the following register layout. See: @@ -196,7 +199,7 @@ extern struct srcu_struct mpam_srcu;
/* MPAMF_MBWUMON_IDR - MPAM memory bandwidth usage monitor ID register */ #define MPAMF_MBWUMON_IDR_NUM_MON GENMASK(15, 0) -#define MPAMF_MBWUMON_IDR_RWBW BIT(28) +#define MPAMF_MBWUMON_IDR_HAS_RWBW BIT(28) #define MPAMF_MBWUMON_IDR_LWD BIT(29) #define MPAMF_MBWUMON_IDR_HAS_LONG BIT(30) #define MPAMF_MBWUMON_IDR_HAS_CAPTURE BIT(31)
From: James Morse james.morse@arm.com
CPUs can generate traffic with a range of PARTID and PMG values, but each MSC may have its own maximum size for these fields. Before MPAM can be used, the driver needs to probe each RIS on each MSC, to find the system-wide smallest value that can be used.
While doing this, RIS entries that firmware didn't describe are create under MPAM_CLASS_UNKNOWN.
Signed-off-by: James Morse james.morse@arm.com --- arch/arm64/kernel/mpam.c | 10 ++ drivers/platform/mpam/mpam_devices.c | 166 ++++++++++++++++++++++++-- drivers/platform/mpam/mpam_internal.h | 6 + include/linux/arm_mpam.h | 2 + 4 files changed, 177 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/kernel/mpam.c b/arch/arm64/kernel/mpam.c index 894ca52549b8..02f43334f078 100644 --- a/arch/arm64/kernel/mpam.c +++ b/arch/arm64/kernel/mpam.c @@ -3,6 +3,7 @@
#include <asm/mpam.h>
+#include <linux/arm_mpam.h> #include <linux/jump_label.h> #include <linux/percpu.h>
@@ -11,3 +12,12 @@ DEFINE_STATIC_KEY_FALSE(mpam_enabled); DEFINE_PER_CPU(u64, arm64_mpam_default); DEFINE_PER_CPU(u64, arm64_mpam_current);
+static int __init arm64_mpam_register_cpus(void) +{ + u64 mpamidr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); + u16 partid_max = FIELD_GET(MPAMIDR_PARTID_MAX, mpamidr); + u8 pmg_max = FIELD_GET(MPAMIDR_PMG_MAX, mpamidr); + + return mpam_register_requestor(partid_max, pmg_max); +} +arch_initcall(arm64_mpam_register_cpus) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 48811c90b78a..ec22c279cfe9 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -6,6 +6,7 @@ #include <linux/acpi.h> #include <linux/atomic.h> #include <linux/arm_mpam.h> +#include <linux/bitfield.h> #include <linux/cacheinfo.h> #include <linux/cpu.h> #include <linux/cpumask.h> @@ -46,6 +47,15 @@ static u32 mpam_num_msc; static int mpam_cpuhp_state; static DEFINE_MUTEX(mpam_cpuhp_state_lock);
+/* + * The smallest common values for any CPU or MSC in the system. + * Generating traffic outside this range will result in screaming interrupts. + */ +u16 mpam_partid_max; +u8 mpam_pmg_max; +static bool partid_max_init, partid_max_published; +static DEFINE_SPINLOCK(partid_max_lock); + /* * mpam is enabled once all devices have been probed from CPU online callbacks, * scheduled via this work_struct. If access to an MSC depends on a CPU that @@ -79,6 +89,14 @@ static u32 __mpam_read_reg(struct mpam_msc *msc, u16 reg) return readl_relaxed(msc->mapped_hwpage + reg); }
+static void __mpam_write_reg(struct mpam_msc *msc, u16 reg, u32 val) +{ + WARN_ON_ONCE(reg > msc->mapped_hwpage_sz); + WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &msc->accessibility)); + + writel_relaxed(val, msc->mapped_hwpage + reg); +} + #define mpam_read_partsel_reg(msc, reg) \ ({ \ u32 ____ret; \ @@ -89,6 +107,59 @@ static u32 __mpam_read_reg(struct mpam_msc *msc, u16 reg) ____ret; \ })
+#define mpam_write_partsel_reg(msc, reg, val) \ +({ \ + lockdep_assert_held_once(&msc->part_sel_lock); \ + __mpam_write_reg(msc, MPAMCFG_##reg, val); \ +}) + +static u64 mpam_msc_read_idr(struct mpam_msc *msc) +{ + u64 idr_high = 0, idr_low; + + lockdep_assert_held(&msc->part_sel_lock); + + idr_low = mpam_read_partsel_reg(msc, IDR); + if (FIELD_GET(MPAMF_IDR_HAS_EXT, idr_low)) + idr_high = mpam_read_partsel_reg(msc, IDR + 4); + + return (idr_high << 32) | idr_low; +} + +static void __mpam_part_sel(u8 ris_idx, u16 partid, struct mpam_msc *msc) +{ + u32 partsel; + + lockdep_assert_held(&msc->part_sel_lock); + + partsel = FIELD_PREP(MPAMCFG_PART_SEL_RIS, ris_idx) | + FIELD_PREP(MPAMCFG_PART_SEL_PARTID_SEL, partid); + mpam_write_partsel_reg(msc, PART_SEL, partsel); +} + +int mpam_register_requestor(u16 partid_max, u8 pmg_max) +{ + int err = 0; + + spin_lock(&partid_max_lock); + if (!partid_max_init) { + mpam_partid_max = partid_max; + mpam_pmg_max = pmg_max; + partid_max_init = true; + } else if (!partid_max_published) { + mpam_partid_max = min(mpam_partid_max, partid_max); + mpam_pmg_max = min(mpam_pmg_max, pmg_max); + } else { + /* New requestors can't lower the values */ + if ((partid_max < mpam_partid_max) || (pmg_max < mpam_pmg_max)) + err = -EBUSY; + } + spin_unlock(&partid_max_lock); + + return err; +} +EXPORT_SYMBOL(mpam_register_requestor); + static struct mpam_component * mpam_component_alloc(struct mpam_class *class, int id, gfp_t gfp) { @@ -402,6 +473,7 @@ static int mpam_ris_create_locked(struct mpam_msc *msc, u8 ris_idx, cpumask_or(&comp->affinity, &comp->affinity, &ris->affinity); cpumask_or(&class->affinity, &class->affinity, &ris->affinity); list_add_rcu(&ris->comp_list, &comp->ris); + list_add_rcu(&ris->msc_list, &msc->ris);
return 0; } @@ -419,10 +491,37 @@ int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, return err; }
+static struct mpam_msc_ris *mpam_get_or_create_ris(struct mpam_msc *msc, + u8 ris_idx) +{ + int err; + struct mpam_msc_ris *ris, *found = ERR_PTR(-ENOENT); + + lockdep_assert_held(&mpam_list_lock); + + if (!test_bit(ris_idx, msc->ris_idxs)) { + err = mpam_ris_create_locked(msc, ris_idx, MPAM_CLASS_UNKNOWN, + 0, 0, GFP_ATOMIC); + if (err) + return ERR_PTR(err); + } + + list_for_each_entry(ris, &msc->ris, msc_list) { + if (ris->ris_idx == ris_idx) { + found = ris; + break; + } + } + + return found; +} + static int mpam_msc_hw_probe(struct mpam_msc *msc) { u64 idr; - int err; + u16 partid_max; + u8 ris_idx, pmg_max; + struct mpam_msc_ris *ris;
lockdep_assert_held(&msc->lock);
@@ -431,14 +530,43 @@ static int mpam_msc_hw_probe(struct mpam_msc *msc) if ((idr & MPAMF_AIDR_ARCH_MAJOR_REV) != MPAM_ARCHITECTURE_V1) { pr_err_once("%s does not match MPAM architecture v1.0\n", dev_name(&msc->pdev->dev)); - err = -EIO; - } else { - msc->probed = true; - err = 0; + spin_unlock(&msc->part_sel_lock); + return -EIO; } + + idr = mpam_msc_read_idr(msc); spin_unlock(&msc->part_sel_lock); + msc->ris_max = FIELD_GET(MPAMF_IDR_RIS_MAX, idr); + + /* Use these values so partid/pmg always starts with a valid value */ + msc->partid_max = FIELD_GET(MPAMF_IDR_PARTID_MAX, idr); + msc->pmg_max = FIELD_GET(MPAMF_IDR_PMG_MAX, idr); + + for (ris_idx = 0; ris_idx <= msc->ris_max; ris_idx++) { + spin_lock(&msc->part_sel_lock); + __mpam_part_sel(ris_idx, 0, msc); + idr = mpam_msc_read_idr(msc); + spin_unlock(&msc->part_sel_lock); + + partid_max = FIELD_GET(MPAMF_IDR_PARTID_MAX, idr); + pmg_max = FIELD_GET(MPAMF_IDR_PMG_MAX, idr); + msc->partid_max = min(msc->partid_max, partid_max); + msc->pmg_max = min(msc->pmg_max, pmg_max); + + ris = mpam_get_or_create_ris(msc, ris_idx); + if (IS_ERR(ris)) { + return PTR_ERR(ris); + } + }
- return err; + spin_lock(&partid_max_lock); + mpam_partid_max = min(mpam_partid_max, msc->partid_max); + mpam_pmg_max = min(mpam_pmg_max, msc->pmg_max); + spin_unlock(&partid_max_lock); + + msc->probed = true; + + return 0; }
static int mpam_cpu_online(unsigned int cpu) @@ -742,9 +870,18 @@ static void mpam_enable_once(void) mpam_cpuhp_state = 0; mutex_unlock(&mpam_cpuhp_state_lock);
+ /* + * Once the cpuhp callbacks have been changed, mpam_partid_max can no + * longer change. + */ + spin_lock(&partid_max_lock); + partid_max_published = true; + spin_unlock(&partid_max_lock); + mpam_register_cpuhp_callbacks(mpam_cpu_online);
- pr_info("MPAM enabled\n"); + pr_info("MPAM enabled with %u partid and %u pmg\n", + mpam_partid_max + 1, mpam_pmg_max + 1); }
/* @@ -828,11 +965,25 @@ static void mpam_dt_create_foundling_msc(void)
static int __init mpam_msc_driver_init(void) { + bool mpam_not_available = false; + if (!mpam_cpus_have_feature()) return -EOPNOTSUPP;
init_srcu_struct(&mpam_srcu);
+ /* + * If the MPAM CPU interface is not implemented, or reserved by + * firmware, there is no point touching the rest of the hardware. + */ + spin_lock(&partid_max_lock); + if (!partid_max_init || (!mpam_partid_max && !mpam_pmg_max)) + mpam_not_available = true; + spin_unlock(&partid_max_lock); + + if (mpam_not_available) + return 0; + if (!acpi_disabled) fw_num_msc = acpi_mpam_count_msc(); else @@ -848,4 +999,5 @@ static int __init mpam_msc_driver_init(void)
return platform_driver_register(&mpam_msc_driver); } +/* Must occur after arm64_mpam_register_cpus() from arch_initcall() */ subsys_initcall(mpam_msc_driver_init); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index d5d567fe57ed..a7de4a69b9f8 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -31,6 +31,8 @@ struct mpam_msc
struct mutex lock; bool probed; + u16 partid_max; + u8 pmg_max; unsigned long ris_idxs[128 / BITS_PER_LONG]; u32 ris_max;
@@ -98,6 +100,10 @@ struct mpam_msc_ris extern struct list_head mpam_classes; extern struct srcu_struct mpam_srcu;
+/* System wide partid/pmg values */ +extern u16 mpam_partid_max; +extern u8 mpam_pmg_max; + /* Scheduled work callback to enable mpam once all MSC have been probed */ void mpam_enable(struct work_struct *work);
diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 950ea7049d53..40e09b4d236b 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -34,6 +34,8 @@ static inline int acpi_mpam_parse_resources(struct mpam_msc *msc, static inline int acpi_mpam_count_msc(void) { return -EINVAL; } #endif
+int mpam_register_requestor(u16 partid_max, u8 pmg_max); + int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, enum mpam_class_types type, u8 class_id, int component_id);
From: James Morse james.morse@arm.com
Expand the probing support with the control and monitor types we can use with resctrl.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 79 ++++++++++++++++++++++++++- drivers/platform/mpam/mpam_internal.h | 50 +++++++++++++++++ 2 files changed, 127 insertions(+), 2 deletions(-)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index ec22c279cfe9..7eeeb44506b1 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -83,7 +83,7 @@ LIST_HEAD(mpam_classes);
static u32 __mpam_read_reg(struct mpam_msc *msc, u16 reg) { - WARN_ON_ONCE(reg > msc->mapped_hwpage_sz); + WARN_ON_ONCE(reg + sizeof(u32) > msc->mapped_hwpage_sz); WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &msc->accessibility));
return readl_relaxed(msc->mapped_hwpage + reg); @@ -91,7 +91,7 @@ static u32 __mpam_read_reg(struct mpam_msc *msc, u16 reg)
static void __mpam_write_reg(struct mpam_msc *msc, u16 reg, u32 val) { - WARN_ON_ONCE(reg > msc->mapped_hwpage_sz); + WARN_ON_ONCE(reg + sizeof(u32) > msc->mapped_hwpage_sz); WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &msc->accessibility));
writel_relaxed(val, msc->mapped_hwpage + reg); @@ -516,6 +516,74 @@ static struct mpam_msc_ris *mpam_get_or_create_ris(struct mpam_msc *msc, return found; }
+static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) +{ + int err; + struct mpam_msc *msc = ris->msc; + struct mpam_props *props = &ris->props; + + lockdep_assert_held(&msc->lock); + lockdep_assert_held(&msc->part_sel_lock); + + /* Cache Portion partitioning */ + if (FIELD_GET(MPAMF_IDR_HAS_CPOR_PART, ris->idr)) { + u32 cpor_features = mpam_read_partsel_reg(msc, CPOR_IDR); + + props->cpbm_wd = FIELD_GET(MPAMF_CPOR_IDR_CPBM_WD, cpor_features); + if (props->cpbm_wd) + mpam_set_feature(mpam_feat_cpor_part, props); + } + + /* Memory bandwidth partitioning */ + if (FIELD_GET(MPAMF_IDR_HAS_MBW_PART, ris->idr)) { + u32 mbw_features = mpam_read_partsel_reg(msc, MBW_IDR); + + /* portion bitmap resolution */ + props->mbw_pbm_bits = FIELD_GET(MPAMF_MBW_IDR_BWPBM_WD, mbw_features); + if (props->mbw_pbm_bits && + FIELD_GET(MPAMF_MBW_IDR_HAS_PBM, mbw_features)) + mpam_set_feature(mpam_feat_mbw_part, props); + + props->bwa_wd = FIELD_GET(MPAMF_MBW_IDR_BWA_WD, mbw_features); + if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_MAX, mbw_features)) + mpam_set_feature(mpam_feat_mbw_max, props); + } + + /* Performance Monitoring */ + if (FIELD_GET(MPAMF_IDR_HAS_MSMON, ris->idr)) { + u32 msmon_features = mpam_read_partsel_reg(msc, MSMON_IDR); + + if (FIELD_GET(MPAMF_MSMON_IDR_MSMON_CSU, msmon_features)) { + u32 csumonidr, discard; + + /* + * If the firmware max-nrdy-us property is missing, the + * CSU counters can't be used. Should we wait forever? + */ + err = device_property_read_u32(&msc->pdev->dev, + "arm,not-ready-us", + &discard); + + csumonidr = mpam_read_partsel_reg(msc, CSUMON_IDR); + props->num_csu_mon = FIELD_GET(MPAMF_CSUMON_IDR_NUM_MON, csumonidr); + if (props->num_csu_mon && !err) + mpam_set_feature(mpam_feat_msmon_csu, props); + else if (props->num_csu_mon) + pr_err_once("Counters are not usable because not-ready timeout was not provided by firmware."); + } + if (FIELD_GET(MPAMF_MSMON_IDR_MSMON_MBWU, msmon_features)) { + u32 mbwumonidr = mpam_read_partsel_reg(msc, MBWUMON_IDR); + + props->num_mbwu_mon = FIELD_GET(MPAMF_MBWUMON_IDR_NUM_MON, mbwumonidr); + if (props->num_mbwu_mon) + mpam_set_feature(mpam_feat_msmon_mbwu, props); + + if (FIELD_GET(MPAMF_MBWUMON_IDR_HAS_RWBW, mbwumonidr)) + mpam_set_feature(mpam_feat_msmon_mbwu_rwbw, props); + } + } +} + static int mpam_msc_hw_probe(struct mpam_msc *msc) { u64 idr; @@ -536,6 +604,7 @@ static int mpam_msc_hw_probe(struct mpam_msc *msc)
idr = mpam_msc_read_idr(msc); spin_unlock(&msc->part_sel_lock); + msc->ris_max = FIELD_GET(MPAMF_IDR_RIS_MAX, idr);
/* Use these values so partid/pmg always starts with a valid value */ @@ -557,6 +626,12 @@ static int mpam_msc_hw_probe(struct mpam_msc *msc) if (IS_ERR(ris)) { return PTR_ERR(ris); } + ris->idr = idr; + + spin_lock(&msc->part_sel_lock); + __mpam_part_sel(ris_idx, 0, msc); + mpam_ris_hw_probe(ris); + spin_unlock(&msc->part_sel_lock); }
spin_lock(&partid_max_lock); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index a7de4a69b9f8..71e62594876d 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -49,6 +49,54 @@ struct mpam_msc size_t mapped_hwpage_sz; };
+/* + * When we compact the supported features, we don't care what they are. + * Storing them as a bitmap makes life easy. + */ +typedef u16 mpam_features_t; + +/* Bits for mpam_features_t */ +enum mpam_device_features { + mpam_feat_ccap_part = 0, + mpam_feat_cpor_part, + mpam_feat_mbw_part, + mpam_feat_mbw_min, + mpam_feat_mbw_max, + mpam_feat_mbw_prop, + mpam_feat_msmon, + mpam_feat_msmon_csu, + mpam_feat_msmon_csu_capture, + mpam_feat_msmon_mbwu, + mpam_feat_msmon_mbwu_capture, + mpam_feat_msmon_mbwu_rwbw, + mpam_feat_msmon_capt, + MPAM_FEATURE_LAST, +}; +#define MPAM_ALL_FEATURES ((1<<MPAM_FEATURE_LAST) - 1) + +struct mpam_props +{ + mpam_features_t features; + + u16 cpbm_wd; + u16 mbw_pbm_bits; + u8 bwa_wd; + u16 num_csu_mon; + u16 num_mbwu_mon; +}; + +static inline bool mpam_has_feature(enum mpam_device_features feat, + struct mpam_props *props) +{ + return (1<<feat) & props->features; +} + +static inline void mpam_set_feature(enum mpam_device_features feat, + struct mpam_props *props) +{ + props->features |= (1<<feat); +} + struct mpam_class { /* mpam_components in this class */ @@ -82,6 +130,8 @@ struct mpam_component struct mpam_msc_ris { u8 ris_idx; + u64 idr; + struct mpam_props props;
cpumask_t affinity;
From: James Morse james.morse@arm.com
To make a decision about whether to expose an mpam class as a resctrl resource we need to know its overall supported features and properties.
Once we've probed all the resources, we can walk the tree and produced overall values by merging the bitmaps. This eliminates features that are only supported by some MSC that make up a component or class.
If bitmap properties are mismatched within a component we cannot support the mismatched feature.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 87 +++++++++++++++++++++++++++ drivers/platform/mpam/mpam_internal.h | 8 +++ 2 files changed, 95 insertions(+)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 7eeeb44506b1..f1574f393e1d 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -938,8 +938,95 @@ static int mpam_msc_drv_probe(struct platform_device *pdev) return err; }
+/* + * If a resource doesn't match class feature/configuration, do the right thing. + * For 'num' properties we can just take the minimum. + * For properties where the mismatched unused bits would make a difference, we + * nobble the class feature, as we can't configure all the resources. + * e.g. The L3 cache is composed of two resources with 13 and 17 portion + * bitmaps respectively. + */ +static void +__resource_props_mismatch(struct mpam_msc_ris *ris, struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + struct mpam_props *rprops = &ris->props; + + lockdep_assert_held(&mpam_list_lock); /* we modify class */ + + /* Clear missing features */ + cprops->features &= rprops->features; + + /* Clear incompatible features */ + if (cprops->cpbm_wd != rprops->cpbm_wd) + mpam_clear_feature(mpam_feat_cpor_part, &cprops->features); + if (cprops->mbw_pbm_bits != rprops->mbw_pbm_bits) + mpam_clear_feature(mpam_feat_mbw_part, &cprops->features); + + /* bwa_wd is a count of bits, fewer bits means less precision */ + if (cprops->bwa_wd != rprops->bwa_wd) + cprops->bwa_wd = min(cprops->bwa_wd, rprops->bwa_wd); + + /* For num properties, take the minimum */ + if (cprops->num_csu_mon != rprops->num_csu_mon) + cprops->num_csu_mon = min(cprops->num_csu_mon, rprops->num_csu_mon); + if (cprops->num_mbwu_mon != rprops->num_mbwu_mon) + cprops->num_mbwu_mon = min(cprops->num_mbwu_mon, rprops->num_mbwu_mon); +} + +/* + * Copy the first component's first resources's properties and features to the + * class. __resource_props_mismatch() will remove conflicts. + * It is not possible to have a class with no components, or a component with + * no resources. + */ +static void mpam_enable_init_class_features(struct mpam_class *class) +{ + struct mpam_msc_ris *ris; + struct mpam_component *comp; + + comp = list_first_entry_or_null(&class->components, + struct mpam_component, class_list); + if (WARN_ON(!comp)) + return; + + ris = list_first_entry_or_null(&comp->ris, + struct mpam_msc_ris, comp_list); + if (WARN_ON(!ris)) + return; + + class->props = ris->props; +} + +/* Merge all the common resource features into class. */ +static void mpam_enable_merge_features(void) +{ + struct mpam_msc_ris *ris; + struct mpam_class *class; + struct mpam_component *comp; + + lockdep_assert_held(&mpam_list_lock); + + list_for_each_entry(class, &mpam_classes, classes_list) { + mpam_enable_init_class_features(class); + + list_for_each_entry(comp, &class->components, class_list) { + list_for_each_entry(ris, &comp->ris, comp_list) { + __resource_props_mismatch(ris, class); + + class->nrdy_usec = max(class->nrdy_usec, + ris->msc->nrdy_usec); + } + } + } +} + static void mpam_enable_once(void) { + mutex_lock(&mpam_list_lock); + mpam_enable_merge_features(); + mutex_unlock(&mpam_list_lock); + mutex_lock(&mpam_cpuhp_state_lock); cpuhp_remove_state(mpam_cpuhp_state); mpam_cpuhp_state = 0; diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 71e62594876d..db50f5e40d98 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -97,6 +97,12 @@ static inline void mpam_set_feature(enum mpam_device_features feat, props->features |= (1<<feat); }
+static inline void mpam_clear_feature(enum mpam_device_features feat, + mpam_features_t *supported) +{ + *supported &= ~(1<<feat); +} + struct mpam_class { /* mpam_components in this class */ @@ -104,6 +110,8 @@ struct mpam_class
cpumask_t affinity;
+ struct mpam_props props; + u32 nrdy_usec; u8 level; enum mpam_class_types type;
From: James Morse james.morse@arm.com
When a CPU comes online, it may bring a newly accessible MSC with it. Only the default partid has its value reset by hardware, and even then the MSC might not have been reset since its config was previously dirtyied. e.g. Kexec.
Any in-use partid must have its configuration restored, or reset. In-use partids may be held in caches and evicted later.
MSC are also reset when CPUs are taken offline to cover cases where firmware doesn't reset the MSC over reboot using UEFI, or kexec where there is no firmware involvement.
If the configuration for a RIS has not been touched since it was brought online, it does not need resetting again.
To reset, write the maximum values for all discovered controls.
CC: Rohit Mathew Rohit.Mathew@arm.com Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 126 +++++++++++++++++++++++++- drivers/platform/mpam/mpam_internal.h | 5 +- 2 files changed, 129 insertions(+), 2 deletions(-)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index f1574f393e1d..e281e6e910e0 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -7,6 +7,7 @@ #include <linux/atomic.h> #include <linux/arm_mpam.h> #include <linux/bitfield.h> +#include <linux/bitmap.h> #include <linux/cacheinfo.h> #include <linux/cpu.h> #include <linux/cpumask.h> @@ -644,8 +645,115 @@ static int mpam_msc_hw_probe(struct mpam_msc *msc) return 0; }
+static void mpam_reset_msc_bitmap(struct mpam_msc *msc, u16 reg, u16 wd) +{ + u32 num_words, msb; + u32 bm = ~0; + int i; + + lockdep_assert_held(&msc->part_sel_lock); + + /* + * Write all ~0 to all but the last 32bit-word, which may + * have fewer bits... + */ + num_words = DIV_ROUND_UP(wd, 32); + for (i = 0; i < num_words - 1; i++, reg += sizeof(bm)) + __mpam_write_reg(msc, reg, bm); + + /* + * ....and then the last (maybe) partial 32bit word. When wd is a + * multiple of 32, msb should be 31 to write a full 32bit word. + */ + msb = (wd - 1) % 32; + bm = GENMASK(msb , 0); + if (bm) + __mpam_write_reg(msc, reg, bm); +} + +static void mpam_reset_ris_partid(struct mpam_msc_ris *ris, u16 partid) +{ + struct mpam_msc *msc = ris->msc; + u16 bwa_fract = MPAMCFG_MBW_MAX_MAX; + struct mpam_props *rprops = &ris->props; + + lockdep_assert_held(&msc->lock); + + spin_lock(&msc->part_sel_lock); + __mpam_part_sel(ris->ris_idx, partid, msc); + + if (mpam_has_feature(mpam_feat_cpor_part, rprops)) + mpam_reset_msc_bitmap(msc, MPAMCFG_CPBM, rprops->cpbm_wd); + + if (mpam_has_feature(mpam_feat_mbw_part, rprops)) + mpam_reset_msc_bitmap(msc, MPAMCFG_MBW_PBM, rprops->mbw_pbm_bits); + + if (mpam_has_feature(mpam_feat_mbw_min, rprops)) + mpam_write_partsel_reg(msc, MBW_MIN, 0); + + if (mpam_has_feature(mpam_feat_mbw_max, rprops)) + mpam_write_partsel_reg(msc, MBW_MAX, bwa_fract); + + if (mpam_has_feature(mpam_feat_mbw_prop, rprops)) + mpam_write_partsel_reg(msc, MBW_PROP, bwa_fract); + spin_unlock(&msc->part_sel_lock); +} + +static void mpam_reset_ris(struct mpam_msc_ris *ris) +{ + u16 partid, partid_max; + struct mpam_msc *msc = ris->msc; + + lockdep_assert_held(&msc->lock); + + if (ris->in_reset_state) + return; + + spin_lock(&partid_max_lock); + partid_max = mpam_partid_max; + spin_unlock(&partid_max_lock); + for (partid = 0; partid < partid_max; partid++) + mpam_reset_ris_partid(ris, partid); +} + +static void mpam_reset_msc(struct mpam_msc *msc, bool online) +{ + int idx; + struct mpam_msc_ris *ris; + + lockdep_assert_held(&msc->lock); + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &msc->ris, msc_list) { + mpam_reset_ris(ris); + + /* + * Set in_reset_state when coming online. The reset state + * for non-zero partid may be lost while the CPUs are offline. + */ + ris->in_reset_state = online; + } + srcu_read_unlock(&mpam_srcu, idx); +} + static int mpam_cpu_online(unsigned int cpu) { + int idx; + struct mpam_msc *msc; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(msc, &mpam_all_msc, glbl_list) { + if (!cpumask_test_cpu(cpu, &msc->accessibility)) + continue; + + if (atomic_fetch_inc(&msc->online_refs) == 0) { + mutex_lock(&msc->lock); + mpam_reset_msc(msc, true); + mutex_unlock(&msc->lock); + } + } + srcu_read_unlock(&mpam_srcu, idx); + return 0; }
@@ -684,6 +792,22 @@ static int mpam_discovery_cpu_online(unsigned int cpu)
static int mpam_cpu_offline(unsigned int cpu) { + int idx; + struct mpam_msc *msc; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(msc, &mpam_all_msc, glbl_list) { + if (!cpumask_test_cpu(cpu, &msc->accessibility)) + continue; + + if (atomic_dec_and_test(&msc->online_refs)) { + mutex_lock(&msc->lock); + mpam_reset_msc(msc, false); + mutex_unlock(&msc->lock); + } + } + srcu_read_unlock(&mpam_srcu, idx); + return 0; }
@@ -1043,7 +1167,7 @@ static void mpam_enable_once(void) mpam_register_cpuhp_callbacks(mpam_cpu_online);
pr_info("MPAM enabled with %u partid and %u pmg\n", - mpam_partid_max + 1, mpam_pmg_max + 1); + READ_ONCE(mpam_partid_max) + 1, mpam_pmg_max + 1); }
/* diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index db50f5e40d98..dd1c017c6e08 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -5,6 +5,7 @@ #define MPAM_INTERNAL_H
#include <linux/arm_mpam.h> +#include <linux/atomic.h> #include <linux/cpumask.h> #include <linux/io.h> #include <linux/mailbox_client.h> @@ -28,7 +29,8 @@ struct mpam_msc struct pcc_mbox_chan *pcc_chan; u32 nrdy_usec; cpumask_t accessibility; - + atomic_t online_refs; + struct mutex lock; bool probed; u16 partid_max; @@ -140,6 +142,7 @@ struct mpam_msc_ris u8 ris_idx; u64 idr; struct mpam_props props; + bool in_reset_state;
cpumask_t affinity;
From: James Morse james.morse@arm.com
Resetting RIS entries from the cpuhp callback is easy as the callback occurs on the correct CPU. This won't be true for any other caller that wants to reset or configure an MSC.
Add a helper that schedules the provided function if necessary. Prevent the cpuhp callbacks from changing the MSC state by taking the cpuhp lock.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 40 +++++++++++++++++++++++----- 1 file changed, 34 insertions(+), 6 deletions(-)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index e281e6e910e0..9156591fc4b9 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -699,21 +699,49 @@ static void mpam_reset_ris_partid(struct mpam_msc_ris *ris, u16 partid) spin_unlock(&msc->part_sel_lock); }
-static void mpam_reset_ris(struct mpam_msc_ris *ris) +/* + * Called via smp_call_on_cpu() to prevent migration, while still being + * pre-emptible. + */ +static int mpam_reset_ris(void *arg) { u16 partid, partid_max; - struct mpam_msc *msc = ris->msc; - - lockdep_assert_held(&msc->lock); + struct mpam_msc_ris *ris = arg;
if (ris->in_reset_state) - return; + return 0;
spin_lock(&partid_max_lock); partid_max = mpam_partid_max; spin_unlock(&partid_max_lock); for (partid = 0; partid < partid_max; partid++) mpam_reset_ris_partid(ris, partid); + + return 0; +} + +/* + * Get the preferred CPU for this MSC. If it is accessible from this CPU, + * this CPU is preferred. This can be preempted/migrated, it will only result + * in more work. + */ +static int mpam_get_msc_preferred_cpu(struct mpam_msc *msc) +{ + int cpu = raw_smp_processor_id(); + + if (cpumask_test_cpu(cpu, &msc->accessibility)) + return cpu; + + return cpumask_first_and(&msc->accessibility, cpu_online_mask); +} + +static int mpam_touch_msc(struct mpam_msc *msc, int (*fn)(void *a), void *arg) +{ + lockdep_assert_irqs_enabled(); + lockdep_assert_cpus_held(); + lockdep_assert_held(&msc->lock); + + return smp_call_on_cpu(mpam_get_msc_preferred_cpu(msc), fn, arg, true); }
static void mpam_reset_msc(struct mpam_msc *msc, bool online) @@ -725,7 +753,7 @@ static void mpam_reset_msc(struct mpam_msc *msc, bool online)
idx = srcu_read_lock(&mpam_srcu); list_for_each_entry_rcu(ris, &msc->ris, msc_list) { - mpam_reset_ris(ris); + mpam_touch_msc(msc, &mpam_reset_ris, ris);
/* * Set in_reset_state when coming online. The reset state
From: James Morse james.morse@arm.com
cpuhp callbacks aren't the only time the MSC configuration may need to be reset. Resctrl has an API call to reset a class. If an MPAM error interrupt arrives it indicates the driver has misprogrammed an MSC. The safest thing to do is reset all the MSCs and disable MPAM.
Add a helper to reset RIS via their class. Call this from mpam_disable(), which can be scheduled from the error interrupt handler.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 34 ++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 9156591fc4b9..6091e87308ba 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -1198,6 +1198,40 @@ static void mpam_enable_once(void) READ_ONCE(mpam_partid_max) + 1, mpam_pmg_max + 1); }
+static void mpam_reset_class(struct mpam_class *class) +{ + int idx; + struct mpam_msc_ris *ris; + struct mpam_component *comp; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(comp, &class->components, class_list) { + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { + mutex_lock(&ris->msc->lock); + mpam_touch_msc(ris->msc, mpam_reset_ris, ris); + mutex_unlock(&ris->msc->lock); + ris->in_reset_state = true; + } + } + srcu_read_unlock(&mpam_srcu, idx); +} + +/* + * Called in response to an error IRQ. + * All of MPAMs errors indicate a software bug, restore any modified + * controls to their reset values. + */ +void mpam_disable(void) +{ + int idx; + struct mpam_class *class; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(class, &mpam_classes, classes_list) + mpam_reset_class(class); + srcu_read_unlock(&mpam_srcu, idx); +} + /* * Enable mpam once all devices have been probed. * Scheduled by mpam_discovery_cpu_online() once all devices have been created.
From: James Morse james.morse@arm.com
Register and enable error IRQs. All the MPAM error interrupts indicate a software bug, e.g. out of range partid. If the error interrupt is ever signalled, attempt to disable MPAM.
Only the irq handler accesses the ESR register, so no locking is needed. The work to disable MPAM after an error needs to happen at process context, use a threaded interrupt.
There is no support for percpu threaded interrupts, for now schedule the work to be done from the irq handler.
Enabling the IRQs in the MSC may involve cross calling to a CPU that can access the MSC.
CC: Rohit Mathew rohit.mathew@arm.com Tested-by: Rohit Mathew rohit.mathew@arm.com Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 311 +++++++++++++++++++++++++- drivers/platform/mpam/mpam_internal.h | 8 + 2 files changed, 309 insertions(+), 10 deletions(-)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 6091e87308ba..ac82c041999d 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -14,6 +14,9 @@ #include <linux/device.h> #include <linux/errno.h> #include <linux/gfp.h> +#include <linux/interrupt.h> +#include <linux/irq.h> +#include <linux/irqdesc.h> #include <linux/list.h> #include <linux/lockdep.h> #include <linux/mutex.h> @@ -64,6 +67,12 @@ static DEFINE_SPINLOCK(partid_max_lock); */ static DECLARE_WORK(mpam_enable_work, &mpam_enable);
+/* + * All mpam error interrupts indicate a software bug. On receipt, disable the + * driver. + */ +static DECLARE_WORK(mpam_broken_work, &mpam_disable); + /* * An MSC is a container for resources, each identified by their RIS index. * Components are a group of RIS that control the same thing. @@ -127,6 +136,24 @@ static u64 mpam_msc_read_idr(struct mpam_msc *msc) return (idr_high << 32) | idr_low; }
+static void mpam_msc_zero_esr(struct mpam_msc *msc) +{ + writel_relaxed(0, msc->mapped_hwpage + MPAMF_ESR); + if (msc->has_extd_esr) + writel_relaxed(0, msc->mapped_hwpage + MPAMF_ESR + 4); +} + +static u64 mpam_msc_read_esr(struct mpam_msc *msc) +{ + u64 esr_high = 0, esr_low; + + esr_low = readl_relaxed(msc->mapped_hwpage + MPAMF_ESR); + if (msc->has_extd_esr) + esr_high = readl_relaxed(msc->mapped_hwpage + MPAMF_ESR + 4); + + return (esr_high << 32) | esr_low; +} + static void __mpam_part_sel(u8 ris_idx, u16 partid, struct mpam_msc *msc) { u32 partsel; @@ -622,6 +649,7 @@ static int mpam_msc_hw_probe(struct mpam_msc *msc) pmg_max = FIELD_GET(MPAMF_IDR_PMG_MAX, idr); msc->partid_max = min(msc->partid_max, partid_max); msc->pmg_max = min(msc->pmg_max, pmg_max); + msc->has_extd_esr = FIELD_GET(MPAMF_IDR_HAS_EXT_ESR, idr);
ris = mpam_get_or_create_ris(msc, ris_idx); if (IS_ERR(ris)) { @@ -764,6 +792,12 @@ static void mpam_reset_msc(struct mpam_msc *msc, bool online) srcu_read_unlock(&mpam_srcu, idx); }
+static void _enable_percpu_irq(void *_irq) +{ + int *irq = _irq; + enable_percpu_irq(*irq, IRQ_TYPE_NONE); +} + static int mpam_cpu_online(unsigned int cpu) { int idx; @@ -774,11 +808,13 @@ static int mpam_cpu_online(unsigned int cpu) if (!cpumask_test_cpu(cpu, &msc->accessibility)) continue;
- if (atomic_fetch_inc(&msc->online_refs) == 0) { - mutex_lock(&msc->lock); + mutex_lock(&msc->lock); + if (msc->reenable_error_ppi) + _enable_percpu_irq(&msc->reenable_error_ppi); + + if (atomic_fetch_inc(&msc->online_refs) == 0) mpam_reset_msc(msc, true); - mutex_unlock(&msc->lock); - } + mutex_unlock(&msc->lock); } srcu_read_unlock(&mpam_srcu, idx);
@@ -828,11 +864,13 @@ static int mpam_cpu_offline(unsigned int cpu) if (!cpumask_test_cpu(cpu, &msc->accessibility)) continue;
- if (atomic_dec_and_test(&msc->online_refs)) { - mutex_lock(&msc->lock); + mutex_lock(&msc->lock); + if (msc->reenable_error_ppi) + disable_percpu_irq(msc->reenable_error_ppi); + + if (atomic_dec_and_test(&msc->online_refs)) mpam_reset_msc(msc, false); - mutex_unlock(&msc->lock); - } + mutex_unlock(&msc->lock); } srcu_read_unlock(&mpam_srcu, idx);
@@ -851,6 +889,50 @@ static void mpam_register_cpuhp_callbacks(int (*online)(unsigned int online)) mutex_unlock(&mpam_cpuhp_state_lock); }
+static int __setup_ppi(struct mpam_msc *msc) +{ + int cpu; + + msc->error_dev_id = alloc_percpu_gfp(struct mpam_msc *, GFP_KERNEL); + if (!msc->error_dev_id) + return -ENOMEM; + + for_each_cpu(cpu, &msc->accessibility) { + struct mpam_msc *empty = *per_cpu_ptr(msc->error_dev_id, cpu); + if (empty != NULL) { + pr_err_once("%s shares PPI with %s!\n", dev_name(&msc->pdev->dev), + dev_name(&empty->pdev->dev)); + return -EBUSY; + } + *per_cpu_ptr(msc->error_dev_id, cpu) = msc; + } + + return 0; +} + +static int mpam_msc_setup_error_irq(struct mpam_msc *msc) +{ + int irq; + + irq = platform_get_irq_byname_optional(msc->pdev, "error"); + if (irq <= 0) + return 0; + + /* Allocate and initialise the percpu device pointer for PPI */ + if (irq_is_percpu(irq)) + + return __setup_ppi(msc); + + /* sanity check: shared interrupts can be routed anywhere? */ + if (!cpumask_equal(&msc->accessibility, cpu_possible_mask)) { + pr_err_once("msc:%u is a private resource with a shared error interrupt", + msc->id); + return -EINVAL; + } + + return 0; +} + static int mpam_dt_count_msc(void) { int count = 0; @@ -1021,6 +1103,13 @@ static int mpam_msc_drv_probe(struct platform_device *pdev) spin_lock_init(&msc->part_sel_lock); spin_lock_init(&msc->mon_sel_lock);
+ err = mpam_msc_setup_error_irq(msc); + if (err) { + devm_kfree(&pdev->dev, msc); + msc = ERR_PTR(err); + break; + } + if (device_property_read_u32(&pdev->dev, "pcc-channel", &msc->pcc_subspace_id)) msc->iface = MPAM_IFACE_MMIO; @@ -1173,11 +1262,198 @@ static void mpam_enable_merge_features(void) } }
+static char *mpam_errcode_names[16] = { + [0] = "No error", + [1] = "PARTID_SEL_Range", + [2] = "Req_PARTID_Range", + [3] = "MSMONCFG_ID_RANGE", + [4] = "Req_PMG_Range", + [5] = "Monitor_Range", + [6] = "intPARTID_Range", + [7] = "Unexpected_INTERNAL", + [8] = "Undefined_RIS_PART_SEL", + [9] = "RIS_No_Control", + [10] = "Undefined_RIS_MON_SEL", + [11] = "RIS_No_Monitor", + [12 ... 15] = "Reserved" +}; + +static int mpam_enable_msc_ecr(void *_msc) +{ + struct mpam_msc *msc = _msc; + + writel_relaxed(1, msc->mapped_hwpage + MPAMF_ECR); + + return 0; +} + +static int mpam_disable_msc_ecr(void *_msc) +{ + struct mpam_msc *msc = _msc; + + writel_relaxed(0, msc->mapped_hwpage + MPAMF_ECR); + + return 0; +} + +static irqreturn_t __mpam_irq_handler(int irq, struct mpam_msc *msc) +{ + u64 reg; + u16 partid; + u8 errcode, pmg, ris; + + if (WARN_ON_ONCE(!msc) || + WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), + &msc->accessibility))) + return IRQ_NONE; + + reg = mpam_msc_read_esr(msc); + + errcode = FIELD_GET(MPAMF_ESR_ERRCODE, reg); + if (!errcode) + return IRQ_NONE; + + /* Clear level triggered irq */ + mpam_msc_zero_esr(msc); + + partid = FIELD_GET(MPAMF_ESR_PARTID_OR_MON, reg); + pmg = FIELD_GET(MPAMF_ESR_PMG, reg); + ris = FIELD_GET(MPAMF_ESR_PMG, reg); + + pr_err("error irq from msc:%u '%s', partid:%u, pmg: %u, ris: %u\n", + msc->id, mpam_errcode_names[errcode], partid, pmg, ris); + + if (irq_is_percpu(irq)) { + mpam_disable_msc_ecr(msc); + schedule_work(&mpam_broken_work); + return IRQ_HANDLED; + } + + return IRQ_WAKE_THREAD; +} + +static irqreturn_t mpam_ppi_handler(int irq, void *dev_id) +{ + struct mpam_msc *msc = *(struct mpam_msc **)dev_id; + + return __mpam_irq_handler(irq, msc); +} + +static irqreturn_t mpam_spi_handler(int irq, void *dev_id) +{ + struct mpam_msc *msc = dev_id; + + return __mpam_irq_handler(irq, msc); +} + +static irqreturn_t mpam_disable_thread(int irq, void *dev_id); + +static int mpam_register_irqs(void) +{ + int err, irq; + struct mpam_msc *msc; + + lockdep_assert_cpus_held(); + lockdep_assert_held(&mpam_list_lock); + + list_for_each_entry(msc, &mpam_all_msc, glbl_list) { + irq = platform_get_irq_byname_optional(msc->pdev, "error"); + if (irq <= 0) + continue; + + /* The MPAM spec says the interrupt can be SPI, PPI or LPI */ + /* We anticipate sharing the interrupt with other MSCs */ + if (irq_is_percpu(irq)) { + err = request_percpu_irq(irq, &mpam_ppi_handler, + "mpam:msc:error", + msc->error_dev_id); + if (err) + return err; + + mutex_lock(&msc->lock); + msc->reenable_error_ppi = irq; + smp_call_function_many(&msc->accessibility, + &_enable_percpu_irq, &irq, + true); + mutex_unlock(&msc->lock); + } else { + err = devm_request_threaded_irq(&msc->pdev->dev, irq, + &mpam_spi_handler, + &mpam_disable_thread, + IRQF_SHARED, + "mpam:msc:error", msc); + if (err) + return err; + } + + mutex_lock(&msc->lock); + msc->error_irq_requested = true; + mpam_touch_msc(msc, mpam_enable_msc_ecr, msc); + msc->error_irq_hw_enabled = true; + mutex_unlock(&msc->lock); + } + + return 0; +} + +static void mpam_unregister_irqs(void) +{ + int irq; + struct mpam_msc *msc; + + cpus_read_lock(); + /* take the lock as free_irq() can sleep */ + mutex_lock(&mpam_list_lock); + list_for_each_entry(msc, &mpam_all_msc, glbl_list) { + irq = platform_get_irq_byname_optional(msc->pdev, "error"); + if (irq <= 0) + continue; + + mutex_lock(&msc->lock); + if (msc->error_irq_hw_enabled) { + mpam_touch_msc(msc, mpam_disable_msc_ecr, msc); + msc->error_irq_hw_enabled = false; + } + + if (msc->error_irq_requested) { + if (irq_is_percpu(irq)) { + msc->reenable_error_ppi = 0; + free_percpu_irq(irq, msc->error_dev_id); + } else { + devm_free_irq(&msc->pdev->dev, irq, msc); + } + msc->error_irq_requested = false; + } + mutex_unlock(&msc->lock); + } + mutex_unlock(&mpam_list_lock); + cpus_read_unlock(); +} + static void mpam_enable_once(void) { + int err; + + /* + * If all the MSC have been probed, enabling the IRQs happens next. + * That involves cross-calling to a CPU that can reach the MSC, and + * the locks must be taken in this order: + */ + cpus_read_lock(); mutex_lock(&mpam_list_lock); mpam_enable_merge_features(); + + err = mpam_register_irqs(); + if (err) + pr_warn("Failed to register irqs: %d\n", err); + mutex_unlock(&mpam_list_lock); + cpus_read_unlock(); + + if (err) { + schedule_work(&mpam_broken_work); + return; + }
mutex_lock(&mpam_cpuhp_state_lock); cpuhp_remove_state(mpam_cpuhp_state); @@ -1221,15 +1497,31 @@ static void mpam_reset_class(struct mpam_class *class) * All of MPAMs errors indicate a software bug, restore any modified * controls to their reset values. */ -void mpam_disable(void) +static irqreturn_t mpam_disable_thread(int irq, void *dev_id) { int idx; struct mpam_class *class;
+ mutex_lock(&mpam_cpuhp_state_lock); + if (mpam_cpuhp_state) { + cpuhp_remove_state(mpam_cpuhp_state); + mpam_cpuhp_state = 0; + } + mutex_unlock(&mpam_cpuhp_state_lock); + + mpam_unregister_irqs(); + idx = srcu_read_lock(&mpam_srcu); list_for_each_entry_rcu(class, &mpam_classes, classes_list) mpam_reset_class(class); srcu_read_unlock(&mpam_srcu, idx); + + return IRQ_HANDLED; +} + +void mpam_disable(struct work_struct *ignored) +{ + mpam_disable_thread(0, NULL); }
/* @@ -1243,7 +1535,6 @@ void mpam_enable(struct work_struct *work) struct mpam_msc *msc; bool all_devices_probed = true;
- /* Have we probed all the hw devices? */ mutex_lock(&mpam_list_lock); list_for_each_entry(msc, &mpam_all_msc, glbl_list) { mutex_lock(&msc->lock); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index dd1c017c6e08..e81aed7e6be6 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -29,10 +29,17 @@ struct mpam_msc struct pcc_mbox_chan *pcc_chan; u32 nrdy_usec; cpumask_t accessibility; + bool has_extd_esr; + + int reenable_error_ppi; + struct mpam_msc * __percpu *error_dev_id; + atomic_t online_refs; struct mutex lock; bool probed; + bool error_irq_requested; + bool error_irq_hw_enabled; u16 partid_max; u8 pmg_max; unsigned long ris_idxs[128 / BITS_PER_LONG]; @@ -167,6 +174,7 @@ extern u8 mpam_pmg_max;
/* Scheduled work callback to enable mpam once all MSC have been probed */ void mpam_enable(struct work_struct *work); +void mpam_disable(struct work_struct *work);
/* * MPAM MSCs have the following register layout. See:
From: James Morse james.morse@arm.com
Once all the MSC have been probed, the system wide usable number of PARTID is known and the configuration arrays can be allocated.
After this point, checking all the MSC have been probed is pointless, and the cpuhp callbacks should restore the configuration, instead of just resetting the MSC.
Enable the architecture's static key that indicates whether mpam is enabled and use this to skip the discovery work on cpu hotplug.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 6 ++++++ drivers/platform/mpam/mpam_internal.h | 8 ++++++++ 2 files changed, 14 insertions(+)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index ac82c041999d..04e1dfd1c5e3 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -828,6 +828,9 @@ static int mpam_discovery_cpu_online(unsigned int cpu) struct mpam_msc *msc; bool new_device_probed = false;
+ if (mpam_is_enabled()) + return 0; + mutex_lock(&mpam_list_lock); list_for_each_entry(msc, &mpam_all_msc, glbl_list) { if (!cpumask_test_cpu(cpu, &msc->accessibility)) @@ -1468,6 +1471,7 @@ static void mpam_enable_once(void) partid_max_published = true; spin_unlock(&partid_max_lock);
+ static_branch_enable(&mpam_enabled); mpam_register_cpuhp_callbacks(mpam_cpu_online);
pr_info("MPAM enabled with %u partid and %u pmg\n", @@ -1509,6 +1513,8 @@ static irqreturn_t mpam_disable_thread(int irq, void *dev_id) } mutex_unlock(&mpam_cpuhp_state_lock);
+ static_branch_disable(&mpam_enabled); + mpam_unregister_irqs();
idx = srcu_read_lock(&mpam_srcu); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index e81aed7e6be6..9f9e7a9899a6 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -8,12 +8,20 @@ #include <linux/atomic.h> #include <linux/cpumask.h> #include <linux/io.h> +#include <linux/jump_label.h> #include <linux/mailbox_client.h> #include <linux/mutex.h> #include <linux/resctrl.h> #include <linux/sizes.h> #include <linux/srcu.h>
+DECLARE_STATIC_KEY_FALSE(mpam_enabled); + +static inline bool mpam_is_enabled(void) +{ + return static_branch_likely(&mpam_enabled); +} + struct mpam_msc { /* member of mpam_all_msc */
From: James Morse james.morse@arm.com
When CPUs come online the original configuration should be restored. Once the maximum partid is known, allocate an configuration array for each component, and reprogram each RIS configuration from this.
The MPAM spec describes how multiple controls can interact. To prevent this happening by accident, always reset controls that don't have a valid configuration. This allows the same helper to be used for configuration and reset.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 193 +++++++++++++++++++++++--- drivers/platform/mpam/mpam_internal.h | 24 +++- 2 files changed, 192 insertions(+), 25 deletions(-)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 04e1dfd1c5e3..7da57310d52c 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -699,51 +699,89 @@ static void mpam_reset_msc_bitmap(struct mpam_msc *msc, u16 reg, u16 wd) __mpam_write_reg(msc, reg, bm); }
-static void mpam_reset_ris_partid(struct mpam_msc_ris *ris, u16 partid) +static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, + struct mpam_config *cfg) { struct mpam_msc *msc = ris->msc; u16 bwa_fract = MPAMCFG_MBW_MAX_MAX; struct mpam_props *rprops = &ris->props;
- lockdep_assert_held(&msc->lock); - spin_lock(&msc->part_sel_lock); __mpam_part_sel(ris->ris_idx, partid, msc);
- if (mpam_has_feature(mpam_feat_cpor_part, rprops)) - mpam_reset_msc_bitmap(msc, MPAMCFG_CPBM, rprops->cpbm_wd); + if (mpam_has_feature(mpam_feat_cpor_part, rprops)) { + if (mpam_has_feature(mpam_feat_cpor_part, cfg)) + mpam_write_partsel_reg(msc, CPBM, cfg->cpbm); + else + mpam_reset_msc_bitmap(msc, MPAMCFG_CPBM, + rprops->cpbm_wd); + }
- if (mpam_has_feature(mpam_feat_mbw_part, rprops)) - mpam_reset_msc_bitmap(msc, MPAMCFG_MBW_PBM, rprops->mbw_pbm_bits); + if (mpam_has_feature(mpam_feat_mbw_part, rprops)) { + if (mpam_has_feature(mpam_feat_mbw_part, cfg)) + mpam_write_partsel_reg(msc, MBW_PBM, cfg->mbw_pbm); + else + mpam_reset_msc_bitmap(msc, MPAMCFG_MBW_PBM, + rprops->mbw_pbm_bits); + }
if (mpam_has_feature(mpam_feat_mbw_min, rprops)) mpam_write_partsel_reg(msc, MBW_MIN, 0);
- if (mpam_has_feature(mpam_feat_mbw_max, rprops)) - mpam_write_partsel_reg(msc, MBW_MAX, bwa_fract); + if (mpam_has_feature(mpam_feat_mbw_max, rprops)) { + if (mpam_has_feature(mpam_feat_mbw_max, cfg)) + mpam_write_partsel_reg(msc, MBW_MAX, cfg->mbw_max); + else + mpam_write_partsel_reg(msc, MBW_MAX, bwa_fract); + }
if (mpam_has_feature(mpam_feat_mbw_prop, rprops)) mpam_write_partsel_reg(msc, MBW_PROP, bwa_fract); spin_unlock(&msc->part_sel_lock); }
+struct reprogram_ris { + struct mpam_msc_ris *ris; + struct mpam_config *cfg; +}; + +/* Call with MSC lock held */ +static int mpam_reprogram_ris(void *_arg) +{ + u16 partid, partid_max; + struct reprogram_ris *arg = _arg; + struct mpam_msc_ris *ris = arg->ris; + struct mpam_config *cfg = arg->cfg; + + if (ris->in_reset_state) + return 0; + + spin_lock(&partid_max_lock); + partid_max = mpam_partid_max; + spin_unlock(&partid_max_lock); + for (partid = 0; partid < partid_max; partid++) + mpam_reprogram_ris_partid(ris, partid, cfg); + + return 0; +} + /* * Called via smp_call_on_cpu() to prevent migration, while still being * pre-emptible. */ static int mpam_reset_ris(void *arg) { - u16 partid, partid_max; struct mpam_msc_ris *ris = arg; + struct reprogram_ris reprogram_arg; + struct mpam_config empty_cfg = { 0 };
if (ris->in_reset_state) return 0;
- spin_lock(&partid_max_lock); - partid_max = mpam_partid_max; - spin_unlock(&partid_max_lock); - for (partid = 0; partid < partid_max; partid++) - mpam_reset_ris_partid(ris, partid); + reprogram_arg.ris = ris; + reprogram_arg.cfg = &empty_cfg; + + mpam_reprogram_ris(&reprogram_arg);
return 0; } @@ -792,6 +830,37 @@ static void mpam_reset_msc(struct mpam_msc *msc, bool online) srcu_read_unlock(&mpam_srcu, idx); }
+static void mpam_reprogram_msc(struct mpam_msc *msc) +{ + int idx; + u16 partid; + bool reset; + struct mpam_config *cfg; + struct mpam_msc_ris *ris; + + lockdep_assert_held(&msc->lock); + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &msc->ris, msc_list) { + if (!mpam_is_enabled() && !ris->in_reset_state) { + mpam_touch_msc(msc, &mpam_reset_ris, ris); + ris->in_reset_state = true; + continue; + } + + reset = true; + for (partid = 0; partid < mpam_partid_max; partid++) { + cfg = &ris->comp->cfg[partid]; + if (cfg->features) + reset = false; + + mpam_reprogram_ris_partid(ris, partid, cfg); + } + ris->in_reset_state = reset; + } + srcu_read_unlock(&mpam_srcu, idx); +} + static void _enable_percpu_irq(void *_irq) { int *irq = _irq; @@ -813,7 +882,7 @@ static int mpam_cpu_online(unsigned int cpu) _enable_percpu_irq(&msc->reenable_error_ppi);
if (atomic_fetch_inc(&msc->online_refs) == 0) - mpam_reset_msc(msc, true); + mpam_reprogram_msc(msc); mutex_unlock(&msc->lock); } srcu_read_unlock(&mpam_srcu, idx); @@ -1433,6 +1502,37 @@ static void mpam_unregister_irqs(void) cpus_read_unlock(); }
+static int __allocate_component_cfg(struct mpam_component *comp) +{ + if (comp->cfg) + return 0; + + comp->cfg = kcalloc(mpam_partid_max, sizeof(*comp->cfg), GFP_KERNEL); + if (!comp->cfg) + return -ENOMEM; + + return 0; +} + +static int mpam_allocate_config(void) +{ + int err = 0; + struct mpam_class *class; + struct mpam_component *comp; + + lockdep_assert_held(&mpam_list_lock); + + list_for_each_entry(class, &mpam_classes, classes_list) { + list_for_each_entry(comp, &class->components, class_list) { + err = __allocate_component_cfg(comp); + if (err) + return err; + } + } + + return 0; +} + static void mpam_enable_once(void) { int err; @@ -1444,12 +1544,21 @@ static void mpam_enable_once(void) */ cpus_read_lock(); mutex_lock(&mpam_list_lock); - mpam_enable_merge_features(); + do { + mpam_enable_merge_features();
- err = mpam_register_irqs(); - if (err) - pr_warn("Failed to register irqs: %d\n", err); + err = mpam_allocate_config(); + if (err) { + pr_err("Failed to allocate configuration arrays.\n"); + break; + }
+ err = mpam_register_irqs(); + if (err) { + pr_warn("Failed to register irqs: %d\n", err); + break; + } + } while (0); mutex_unlock(&mpam_list_lock); cpus_read_unlock();
@@ -1486,6 +1595,8 @@ static void mpam_reset_class(struct mpam_class *class)
idx = srcu_read_lock(&mpam_srcu); list_for_each_entry_rcu(comp, &class->components, class_list) { + memset(comp->cfg, 0, (mpam_partid_max * sizeof(*comp->cfg))); + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { mutex_lock(&ris->msc->lock); mpam_touch_msc(ris->msc, mpam_reset_ris, ris); @@ -1575,6 +1686,48 @@ static int mpam_msc_drv_remove(struct platform_device *pdev) return 0; }
+struct mpam_write_config_arg { + struct mpam_msc_ris *ris; + struct mpam_component *comp; + u16 partid; +}; + +static int __write_config(void *arg) +{ + struct mpam_write_config_arg *c = arg; + + mpam_reprogram_ris_partid(c->ris, c->partid, &c->comp->cfg[c->partid]); + + return 0; +} + +/* TODO: split into write_config/sync_config */ +/* TODO: add config_dirty bitmap to drive sync_config */ +int mpam_apply_config(struct mpam_component *comp, u16 partid, + struct mpam_config *cfg) +{ + struct mpam_write_config_arg arg; + struct mpam_msc_ris *ris; + int idx; + + lockdep_assert_cpus_held(); + + comp->cfg[partid] = *cfg; + arg.comp = comp; + arg.partid = partid; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { + arg.ris = ris; + mutex_lock(&ris->msc->lock); + mpam_touch_msc(ris->msc, __write_config, &arg); + mutex_unlock(&ris->msc->lock); + } + srcu_read_unlock(&mpam_srcu, idx); + + return 0; +} + static const struct of_device_id mpam_of_match[] = { { .compatible = "arm,mpam-msc", }, {}, diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 9f9e7a9899a6..6a6d67a473dc 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -102,11 +102,7 @@ struct mpam_props u16 num_mbwu_mon; };
-static inline bool mpam_has_feature(enum mpam_device_features feat, - struct mpam_props *props) -{ - return (1<<feat) & props->features; -} +#define mpam_has_feature(_feat, x) ((1<<_feat) & (x)->features)
static inline void mpam_set_feature(enum mpam_device_features feat, struct mpam_props *props) @@ -136,6 +132,15 @@ struct mpam_class struct list_head classes_list; };
+struct mpam_config { + /* Which configuration values are valid. 0 is used for reset */ + mpam_features_t features; + + u32 cpbm; + u32 mbw_pbm; + u16 mbw_max; +}; + struct mpam_component { u32 comp_id; @@ -145,6 +150,12 @@ struct mpam_component
cpumask_t affinity;
+ /* + * Array of configuration values, indexed by partid. + * Read from cpuhp callbacks, hold the cpuhp lock when writing. + */ + struct mpam_config *cfg; + /* member of mpam_class:components */ struct list_head class_list;
@@ -184,6 +195,9 @@ extern u8 mpam_pmg_max; void mpam_enable(struct work_struct *work); void mpam_disable(struct work_struct *work);
+int mpam_apply_config(struct mpam_component *comp, u16 partid, + struct mpam_config *cfg); + /* * MPAM MSCs have the following register layout. See: * Arm Architecture Reference Manual Supplement - Memory System Resource
From: James Morse james.morse@arm.com
MPAM supports more features than are going to be exposed to resctrl. For partid other than 0, the reset values of these controls isn't known.
Discover the rest of the features so they can be reset to avoid any side effects when resctrl is in use.
PARTID narrowing allows MSC/RIS to support less configuration space than is usable. If this feature is found on a class of device we are likely to use, then reduce the partid_max to make it usable. This allows us to map a PARTID to itself.
CC: Rohit Mathew Rohit.Mathew@arm.com Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 91 +++++++++++++++++++++++++++ drivers/platform/mpam/mpam_internal.h | 10 ++- 2 files changed, 100 insertions(+), 1 deletion(-)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 7da57310d52c..612c4b97d373 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -549,10 +549,20 @@ static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) int err; struct mpam_msc *msc = ris->msc; struct mpam_props *props = &ris->props; + struct mpam_class *class = ris->comp->class;
lockdep_assert_held(&msc->lock); lockdep_assert_held(&msc->part_sel_lock);
+ /* Cache Capacity Partitioning */ + if (FIELD_GET(MPAMF_IDR_HAS_CCAP_PART, ris->idr)) { + u32 ccap_features = mpam_read_partsel_reg(msc, CCAP_IDR); + + props->cmax_wd = FIELD_GET(MPAMF_CCAP_IDR_CMAX_WD, ccap_features); + if (props->cmax_wd) + mpam_set_feature(mpam_feat_ccap_part, props); + } + /* Cache Portion partitioning */ if (FIELD_GET(MPAMF_IDR_HAS_CPOR_PART, ris->idr)) { u32 cpor_features = mpam_read_partsel_reg(msc, CPOR_IDR); @@ -575,6 +585,31 @@ static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) props->bwa_wd = FIELD_GET(MPAMF_MBW_IDR_BWA_WD, mbw_features); if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_MAX, mbw_features)) mpam_set_feature(mpam_feat_mbw_max, props); + + if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_MIN, mbw_features)) + mpam_set_feature(mpam_feat_mbw_min, props); + + if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_PROP, mbw_features)) + mpam_set_feature(mpam_feat_mbw_prop, props); + } + + /* Priority partitioning */ + if (FIELD_GET(MPAMF_IDR_HAS_PRI_PART, ris->idr)) { + u32 pri_features = mpam_read_partsel_reg(msc, PRI_IDR); + + props->intpri_wd = FIELD_GET(MPAMF_PRI_IDR_INTPRI_WD, pri_features); + if (props->intpri_wd && FIELD_GET(MPAMF_PRI_IDR_HAS_INTPRI, pri_features)) { + mpam_set_feature(mpam_feat_intpri_part, props); + if (FIELD_GET(MPAMF_PRI_IDR_INTPRI_0_IS_LOW, pri_features)) + mpam_set_feature(mpam_feat_intpri_part_0_low, props); + } + + props->dspri_wd = FIELD_GET(MPAMF_PRI_IDR_DSPRI_WD, pri_features); + if (props->dspri_wd && FIELD_GET(MPAMF_PRI_IDR_HAS_DSPRI, pri_features)) { + mpam_set_feature(mpam_feat_dspri_part, props); + if (FIELD_GET(MPAMF_PRI_IDR_DSPRI_0_IS_LOW, pri_features)) + mpam_set_feature(mpam_feat_dspri_part_0_low, props); + } }
/* Performance Monitoring */ @@ -610,6 +645,21 @@ static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) mpam_set_feature(mpam_feat_msmon_mbwu_rwbw, props); } } + + /* + * RIS with PARTID narrowing don't have enough storage for one + * configuration per PARTID. If these are in a class we could use, + * reduce the supported partid_max to match the numer of intpartid. + * If the class is unknown, just ignore it. + */ + if (FIELD_GET(MPAMF_IDR_HAS_PARTID_NRW, ris->idr) && + class->type != MPAM_CLASS_UNKNOWN) { + u32 nrwidr = mpam_read_partsel_reg(msc, PARTID_NRW_IDR); + u16 partid_max = FIELD_GET(MPAMF_PARTID_NRW_IDR_INTPARTID_MAX, nrwidr); + + mpam_set_feature(mpam_feat_partid_nrw, props); + msc->partid_max = min(msc->partid_max, partid_max); + } }
static int mpam_msc_hw_probe(struct mpam_msc *msc) @@ -702,13 +752,21 @@ static void mpam_reset_msc_bitmap(struct mpam_msc *msc, u16 reg, u16 wd) static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, struct mpam_config *cfg) { + u32 pri_val = 0; + u16 cmax = MPAMCFG_CMAX_CMAX; struct mpam_msc *msc = ris->msc; u16 bwa_fract = MPAMCFG_MBW_MAX_MAX; struct mpam_props *rprops = &ris->props; + u16 dspri = GENMASK(rprops->dspri_wd, 0); + u16 intpri = GENMASK(rprops->intpri_wd, 0);
spin_lock(&msc->part_sel_lock); __mpam_part_sel(ris->ris_idx, partid, msc);
+ if(mpam_has_feature(mpam_feat_partid_nrw, rprops)) + mpam_write_partsel_reg(msc, INTPARTID, + (MPAMCFG_PART_SEL_INTERNAL | partid)); + if (mpam_has_feature(mpam_feat_cpor_part, rprops)) { if (mpam_has_feature(mpam_feat_cpor_part, cfg)) mpam_write_partsel_reg(msc, CPBM, cfg->cpbm); @@ -737,6 +795,26 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid,
if (mpam_has_feature(mpam_feat_mbw_prop, rprops)) mpam_write_partsel_reg(msc, MBW_PROP, bwa_fract); + + if (mpam_has_feature(mpam_feat_ccap_part, rprops)) + mpam_write_partsel_reg(msc, CMAX, cmax); + + if (mpam_has_feature(mpam_feat_intpri_part, rprops) || + mpam_has_feature(mpam_feat_dspri_part, rprops)) { + /* aces high? */ + if (!mpam_has_feature(mpam_feat_intpri_part_0_low, rprops)) + intpri = 0; + if (!mpam_has_feature(mpam_feat_dspri_part_0_low, rprops)) + dspri = 0; + + if (mpam_has_feature(mpam_feat_intpri_part, rprops)) + pri_val |= FIELD_PREP(MPAMCFG_PRI_INTPRI, intpri); + if (mpam_has_feature(mpam_feat_dspri_part, rprops)) + pri_val |= FIELD_PREP(MPAMCFG_PRI_DSPRI, dspri); + + mpam_write_partsel_reg(msc, PRI, pri_val); + } + spin_unlock(&msc->part_sel_lock); }
@@ -1285,6 +1363,19 @@ __resource_props_mismatch(struct mpam_msc_ris *ris, struct mpam_class *class) cprops->num_csu_mon = min(cprops->num_csu_mon, rprops->num_csu_mon); if (cprops->num_mbwu_mon != rprops->num_mbwu_mon) cprops->num_mbwu_mon = min(cprops->num_mbwu_mon, rprops->num_mbwu_mon); + + if (cprops->intpri_wd != rprops->intpri_wd) + cprops->intpri_wd = min(cprops->intpri_wd, rprops->intpri_wd); + if (cprops->dspri_wd != rprops->dspri_wd) + cprops->dspri_wd = min(cprops->dspri_wd, rprops->dspri_wd); + + /* {int,ds}pri may not have differing 0-low behaviour */ + if (mpam_has_feature(mpam_feat_intpri_part_0_low, cprops) != + mpam_has_feature(mpam_feat_intpri_part_0_low, rprops)) + mpam_clear_feature(mpam_feat_intpri_part, &cprops->features); + if (mpam_has_feature(mpam_feat_dspri_part_0_low, cprops) != + mpam_has_feature(mpam_feat_dspri_part_0_low, rprops)) + mpam_clear_feature(mpam_feat_dspri_part, &cprops->features); }
/* diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 6a6d67a473dc..9b90f10b0436 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -70,7 +70,7 @@ struct mpam_msc * When we compact the supported features, we don't care what they are. * Storing them as a bitmap makes life easy. */ -typedef u16 mpam_features_t; +typedef u32 mpam_features_t;
/* Bits for mpam_features_t */ enum mpam_device_features { @@ -80,6 +80,10 @@ enum mpam_device_features { mpam_feat_mbw_min, mpam_feat_mbw_max, mpam_feat_mbw_prop, + mpam_feat_intpri_part, + mpam_feat_intpri_part_0_low, + mpam_feat_dspri_part, + mpam_feat_dspri_part_0_low, mpam_feat_msmon, mpam_feat_msmon_csu, mpam_feat_msmon_csu_capture, @@ -87,6 +91,7 @@ enum mpam_device_features { mpam_feat_msmon_mbwu_capture, mpam_feat_msmon_mbwu_rwbw, mpam_feat_msmon_capt, + mpam_feat_partid_nrw, MPAM_FEATURE_LAST, }; #define MPAM_ALL_FEATURES ((1<<MPAM_FEATURE_LAST) - 1) @@ -98,6 +103,9 @@ struct mpam_props u16 cpbm_wd; u16 mbw_pbm_bits; u8 bwa_wd; + u16 cmax_wd; + u16 intpri_wd; + u16 dspri_wd; u16 num_csu_mon; u16 num_mbwu_mon; };
From: James Morse james.morse@arm.com
MPAM's MSC support a number of monitors, each of which supports bandwidth counters, or cache-storage-utilisation counters. To use a counter, a monitor needs to be configured. Add helpers to allocate and free CSU or MBWU monitors.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 2 ++ drivers/platform/mpam/mpam_internal.h | 35 +++++++++++++++++++++++++++ 2 files changed, 37 insertions(+)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 612c4b97d373..82dffe8c03ef 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -244,6 +244,8 @@ mpam_class_alloc(u8 level_idx, enum mpam_class_types type, gfp_t gfp) class->level = level_idx; class->type = type; INIT_LIST_HEAD_RCU(&class->classes_list); + ida_init(&class->ida_csu_mon); + ida_init(&class->ida_mbwu_mon);
list_add_rcu(&class->classes_list, &mpam_classes);
diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 9b90f10b0436..19acb460aeba 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -138,6 +138,9 @@ struct mpam_class
/* member of mpam_classes */ struct list_head classes_list; + + struct ida ida_csu_mon; + struct ida ida_mbwu_mon; };
struct mpam_config { @@ -191,6 +194,38 @@ struct mpam_msc_ris struct mpam_component *comp; };
+static inline int mpam_alloc_csu_mon(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_msmon_csu, cprops)) + return -EOPNOTSUPP; + + return ida_alloc_range(&class->ida_csu_mon, 0, cprops->num_csu_mon - 1, + GFP_KERNEL); +} + +static inline void mpam_free_csu_mon(struct mpam_class *class, int csu_mon) +{ + ida_free(&class->ida_csu_mon, csu_mon); +} + +static inline int mpam_alloc_mbwu_mon(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_msmon_mbwu, cprops)) + return -EOPNOTSUPP; + + return ida_alloc_range(&class->ida_mbwu_mon, 0, + cprops->num_mbwu_mon - 1, GFP_KERNEL); +} + +static inline void mpam_free_mbwu_mon(struct mpam_class *class, int mbwu_mon) +{ + ida_free(&class->ida_mbwu_mon, mbwu_mon); +} + /* List of all classes */ extern struct list_head mpam_classes; extern struct srcu_struct mpam_srcu;
From: James Morse james.morse@arm.com
Reaing a monitor involves configuring what you want to monitor, and reading the value. Components made up of multiple MSC may need values from each MSC. MSCs may take time to configure, returning 'not ready'. The maximum 'not ready' time should have been provided by firmware.
Add mpam_msmon_read() to hide all this. If (one of) the MSC returns not ready, then wait the full timeout value before trying again.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 219 ++++++++++++++++++++++++++ drivers/platform/mpam/mpam_internal.h | 19 +++ 2 files changed, 238 insertions(+)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 82dffe8c03ef..3516528f2e14 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -123,6 +123,22 @@ static void __mpam_write_reg(struct mpam_msc *msc, u16 reg, u32 val) __mpam_write_reg(msc, MPAMCFG_##reg, val); \ })
+#define mpam_read_monsel_reg(msc, reg) \ +({ \ + u32 ____ret; \ + \ + lockdep_assert_held_once(&msc->mon_sel_lock); \ + ____ret = __mpam_read_reg(msc, MSMON_##reg); \ + \ + ____ret; \ +}) + +#define mpam_write_monsel_reg(msc, reg, val) \ +({ \ + lockdep_assert_held_once(&msc->mon_sel_lock); \ + __mpam_write_reg(msc, MSMON_##reg, val); \ +}) + static u64 mpam_msc_read_idr(struct mpam_msc *msc) { u64 idr_high = 0, idr_low; @@ -725,6 +741,209 @@ static int mpam_msc_hw_probe(struct mpam_msc *msc) return 0; }
+struct mon_read +{ + struct mpam_msc_ris *ris; + struct mon_cfg *ctx; + enum mpam_device_features type; + u64 *val; + int err; +}; + +static void gen_msmon_ctl_flt_vals(struct mon_read *m, u32 *ctl_val, + u32 *flt_val) +{ + struct mon_cfg *ctx = m->ctx; + + switch (m->type) { + case mpam_feat_msmon_csu: + *ctl_val = MSMON_CFG_MBWU_CTL_TYPE_CSU; + break; + case mpam_feat_msmon_mbwu: + *ctl_val = MSMON_CFG_MBWU_CTL_TYPE_MBWU; + break; + default: + return; + } + + /* + * For CSU counters its implementation-defined what happens when not + * filtering by partid. + */ + *ctl_val |= MSMON_CFG_x_CTL_MATCH_PARTID; + + *flt_val = FIELD_PREP(MSMON_CFG_MBWU_FLT_PARTID, ctx->partid); + if (m->ctx->match_pmg) { + *ctl_val |= MSMON_CFG_x_CTL_MATCH_PMG; + *flt_val |= FIELD_PREP(MSMON_CFG_MBWU_FLT_PMG, ctx->pmg); + } + + if (mpam_has_feature(mpam_feat_msmon_mbwu_rwbw, &m->ris->props)) + *flt_val |= FIELD_PREP(MSMON_CFG_MBWU_FLT_RWBW, ctx->opts); +} + +static void read_msmon_ctl_flt_vals(struct mon_read *m, u32 *ctl_val, + u32 *flt_val) +{ + struct mpam_msc *msc = m->ris->msc; + + switch (m->type) { + case mpam_feat_msmon_csu: + *ctl_val = mpam_read_monsel_reg(msc, CFG_CSU_CTL); + *flt_val = mpam_read_monsel_reg(msc, CFG_CSU_FLT); + break; + case mpam_feat_msmon_mbwu: + *ctl_val = mpam_read_monsel_reg(msc, CFG_MBWU_CTL); + *flt_val = mpam_read_monsel_reg(msc, CFG_MBWU_FLT); + break; + default: + return; + } +} + +static void write_msmon_ctl_flt_vals(struct mon_read *m, u32 ctl_val, + u32 flt_val) +{ + struct mpam_msc *msc = m->ris->msc; + + /* + * Write the ctl_val with the enable bit cleared, reset the counter, + * then enable counter. + */ + switch (m->type) { + case mpam_feat_msmon_csu: + mpam_write_monsel_reg(msc, CFG_CSU_FLT, flt_val); + mpam_write_monsel_reg(msc, CFG_CSU_CTL, ctl_val); + mpam_write_monsel_reg(msc, CSU, 0); + mpam_write_monsel_reg(msc, CFG_CSU_CTL, ctl_val|MSMON_CFG_x_CTL_EN); + break; + case mpam_feat_msmon_mbwu: + mpam_write_monsel_reg(msc, CFG_MBWU_FLT, flt_val); + mpam_write_monsel_reg(msc, CFG_MBWU_CTL, ctl_val); + mpam_write_monsel_reg(msc, MBWU, 0); + mpam_write_monsel_reg(msc, CFG_MBWU_CTL, ctl_val|MSMON_CFG_x_CTL_EN); + break; + default: + return; + } +} + +static void __ris_msmon_read(void *arg) +{ + u64 now; + bool nrdy = false; + unsigned long flags; + struct mon_read *m = arg; + struct mon_cfg *ctx = m->ctx; + struct mpam_msc_ris *ris = m->ris; + struct mpam_msc *msc = m->ris->msc; + u32 mon_sel, ctl_val, flt_val, cur_ctl, cur_flt; + + lockdep_assert_held(&msc->lock); + + spin_lock_irqsave(&msc->mon_sel_lock, flags); + mon_sel = FIELD_PREP(MSMON_CFG_MON_SEL_MON_SEL, ctx->mon) | + FIELD_PREP(MSMON_CFG_MON_SEL_RIS, ris->ris_idx); + mpam_write_monsel_reg(msc, CFG_MON_SEL, mon_sel); + + /* + * Read the existing configuration to avoid re-writing the same values. + * This saves waiting for 'nrdy' on subsequent reads. + */ + read_msmon_ctl_flt_vals(m, &cur_ctl, &cur_flt); + gen_msmon_ctl_flt_vals(m, &ctl_val, &flt_val); + if (cur_flt != flt_val || cur_ctl != (ctl_val | MSMON_CFG_x_CTL_EN)) + write_msmon_ctl_flt_vals(m, ctl_val, flt_val); + + switch (m->type) { + case mpam_feat_msmon_csu: + now = mpam_read_monsel_reg(msc, CSU); + break; + case mpam_feat_msmon_mbwu: + now = mpam_read_monsel_reg(msc, MBWU); + break; + default: + return; + } + spin_unlock_irqrestore(&msc->mon_sel_lock, flags); + + nrdy = now & MSMON___NRDY; + if (nrdy) { + m->err = -EBUSY; + return; + } + + now = FIELD_GET(MSMON___VALUE, now); + *(m->val) += now; +} + +static int _msmon_read(struct mpam_component *comp, struct mon_read *arg) +{ + int err, idx; + struct mpam_msc *msc; + struct mpam_msc_ris *ris; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { + arg->ris = ris; + + msc = ris->msc; + mutex_lock(&msc->lock); + err = smp_call_function_any(&msc->accessibility, + __ris_msmon_read, arg, true); + mutex_unlock(&msc->lock); + if (!err && arg->err) + err = arg->err; + if (err) + break; + } + srcu_read_unlock(&mpam_srcu, idx); + + return err; +} + +int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, + enum mpam_device_features type, u64 *val) +{ + int err; + struct mon_read arg; + u64 wait_jiffies = 0; + struct mpam_props *cprops = &comp->class->props; + + might_sleep(); + + if (!mpam_is_enabled()) + return -EIO; + + if (!mpam_has_feature(type, cprops)) + return -EOPNOTSUPP; + + memset(&arg, 0, sizeof(arg)); + arg.ctx = ctx; + arg.type = type; + arg.val = val; + *val = 0; + + err = _msmon_read(comp, &arg); + if (err == -EBUSY) + wait_jiffies = usecs_to_jiffies(comp->class->nrdy_usec); + + while (wait_jiffies) + wait_jiffies = schedule_timeout_uninterruptible(wait_jiffies); + + if (err == -EBUSY) { + memset(&arg, 0, sizeof(arg)); + arg.ctx = ctx; + arg.type = type; + arg.val = val; + *val = 0; + + err = _msmon_read(comp, &arg); + } + + return err; +} + static void mpam_reset_msc_bitmap(struct mpam_msc *msc, u16 reg, u16 wd) { u32 num_words, msb; diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 19acb460aeba..fb3a4c00e5d3 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -62,6 +62,7 @@ struct mpam_msc * If needed, take msc->lock first. */ spinlock_t part_sel_lock; + spinlock_t mon_sel_lock; void __iomem * mapped_hwpage; size_t mapped_hwpage_sz; }; @@ -194,6 +195,21 @@ struct mpam_msc_ris struct mpam_component *comp; };
+/* The values for MSMON_CFG_MBWU_FLT.RWBW */ +enum mon_filter_options { + COUNT_BOTH = 0, + COUNT_WRITE = 1, + COUNT_READ = 2, +}; + +struct mon_cfg { + u16 mon; + u8 pmg; + bool match_pmg; + u32 partid; + enum mon_filter_options opts; +}; + static inline int mpam_alloc_csu_mon(struct mpam_class *class) { struct mpam_props *cprops = &class->props; @@ -241,6 +257,9 @@ void mpam_disable(struct work_struct *work); int mpam_apply_config(struct mpam_component *comp, u16 partid, struct mpam_config *cfg);
+int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, + enum mpam_device_features, u64 *val); + /* * MPAM MSCs have the following register layout. See: * Arm Architecture Reference Manual Supplement - Memory System Resource
From: James Morse james.morse@arm.com
Bandwidth counters need to run continuously to correctly reflect the bandwidth.
The value read may be lower than the previous value read in the case of overflow and when the hardware is reset due to CPU hotplug.
Add struct mbwu_state to track the bandwidth counter to allow overflow and power management to be handled.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 148 +++++++++++++++++++++++++- drivers/platform/mpam/mpam_internal.h | 52 ++++++--- 2 files changed, 181 insertions(+), 19 deletions(-)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 3516528f2e14..930e14471378 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -773,6 +773,7 @@ static void gen_msmon_ctl_flt_vals(struct mon_read *m, u32 *ctl_val, *ctl_val |= MSMON_CFG_x_CTL_MATCH_PARTID;
*flt_val = FIELD_PREP(MSMON_CFG_MBWU_FLT_PARTID, ctx->partid); + *flt_val |= FIELD_PREP(MSMON_CFG_MBWU_FLT_RWBW, ctx->opts); if (m->ctx->match_pmg) { *ctl_val |= MSMON_CFG_x_CTL_MATCH_PMG; *flt_val |= FIELD_PREP(MSMON_CFG_MBWU_FLT_PMG, ctx->pmg); @@ -805,6 +806,7 @@ static void write_msmon_ctl_flt_vals(struct mon_read *m, u32 ctl_val, u32 flt_val) { struct mpam_msc *msc = m->ris->msc; + struct msmon_mbwu_state *mbwu_state;
/* * Write the ctl_val with the enable bit cleared, reset the counter, @@ -822,21 +824,33 @@ static void write_msmon_ctl_flt_vals(struct mon_read *m, u32 ctl_val, mpam_write_monsel_reg(msc, CFG_MBWU_CTL, ctl_val); mpam_write_monsel_reg(msc, MBWU, 0); mpam_write_monsel_reg(msc, CFG_MBWU_CTL, ctl_val|MSMON_CFG_x_CTL_EN); + + mbwu_state = &m->ris->mbwu_state[m->ctx->mon]; + if (mbwu_state) + mbwu_state->prev_val = 0; + break; default: return; } }
+static u64 mpam_msmon_overflow_val(struct mpam_msc_ris *ris) +{ + /* TODO: scaling, and long counters */ + return GENMASK_ULL(30, 0); +} + static void __ris_msmon_read(void *arg) { - u64 now; bool nrdy = false; unsigned long flags; struct mon_read *m = arg; + u64 now, overflow_val = 0; struct mon_cfg *ctx = m->ctx; struct mpam_msc_ris *ris = m->ris; struct mpam_msc *msc = m->ris->msc; + struct msmon_mbwu_state *mbwu_state; u32 mon_sel, ctl_val, flt_val, cur_ctl, cur_flt;
lockdep_assert_held(&msc->lock); @@ -858,22 +872,41 @@ static void __ris_msmon_read(void *arg) switch (m->type) { case mpam_feat_msmon_csu: now = mpam_read_monsel_reg(msc, CSU); + nrdy = now & MSMON___NRDY; + now = FIELD_GET(MSMON___VALUE, now); break; case mpam_feat_msmon_mbwu: now = mpam_read_monsel_reg(msc, MBWU); + nrdy = now & MSMON___NRDY; + now = FIELD_GET(MSMON___VALUE, now); + + if (nrdy) + break; + + mbwu_state = &ris->mbwu_state[ctx->mon]; + if (!mbwu_state) + break; + + /* Add any pre-overflow value to the mbwu_state->val */ + if (mbwu_state->prev_val > now) + overflow_val = mpam_msmon_overflow_val(ris) - mbwu_state->prev_val; + + mbwu_state->prev_val = now; + mbwu_state->correction += overflow_val; + + /* Include bandwidth consumed before the last hardware reset */ + now += mbwu_state->correction; break; default: return; } spin_unlock_irqrestore(&msc->mon_sel_lock, flags);
- nrdy = now & MSMON___NRDY; if (nrdy) { m->err = -EBUSY; return; }
- now = FIELD_GET(MSMON___VALUE, now); *(m->val) += now; }
@@ -1064,6 +1097,68 @@ static int mpam_reprogram_ris(void *_arg) return 0; }
+static int mpam_restore_mbwu_state(void *_ris) +{ + int i; + struct mon_read mwbu_arg; + struct mpam_msc_ris *ris = _ris; + + lockdep_assert_held(&ris->msc->lock); + + for (i = 0; i < ris->props.num_mbwu_mon; i++) { + if (ris->mbwu_state[i].enabled) { + mwbu_arg.ris = ris; + mwbu_arg.ctx = &ris->mbwu_state[i].cfg; + mwbu_arg.type = mpam_feat_msmon_mbwu; + + __ris_msmon_read(&mwbu_arg); + } + } + + return 0; +} + +static int mpam_save_mbwu_state(void *arg) +{ + int i; + u64 val; + struct mon_cfg *cfg; + unsigned long flags; + u32 cur_flt, cur_ctl, mon_sel; + struct mpam_msc_ris *ris = arg; + struct mpam_msc *msc = ris->msc; + struct msmon_mbwu_state *mbwu_state; + + lockdep_assert_held(&msc->lock); + + for (i = 0; i < ris->props.num_mbwu_mon; i++) { + mbwu_state = &ris->mbwu_state[i]; + cfg = &mbwu_state->cfg; + + spin_lock_irqsave(&msc->mon_sel_lock, flags); + mon_sel = FIELD_PREP(MSMON_CFG_MON_SEL_MON_SEL, i) | + FIELD_PREP(MSMON_CFG_MON_SEL_RIS, ris->ris_idx); + mpam_write_monsel_reg(msc, CFG_MON_SEL, mon_sel); + + cur_flt = mpam_read_monsel_reg(msc, CFG_MBWU_FLT); + cur_ctl = mpam_read_monsel_reg(msc, CFG_MBWU_CTL); + mpam_write_monsel_reg(msc, CFG_MBWU_CTL, 0); + + val = mpam_read_monsel_reg(msc, MBWU); + mpam_write_monsel_reg(msc, MBWU, 0); + + cfg->mon = i; + cfg->pmg = FIELD_GET(MSMON_CFG_MBWU_FLT_PMG, cur_flt); + cfg->match_pmg = FIELD_GET(MSMON_CFG_x_CTL_MATCH_PMG, cur_ctl); + cfg->partid = FIELD_GET(MSMON_CFG_MBWU_FLT_PARTID, cur_flt); + mbwu_state->correction += val; + mbwu_state->enabled = FIELD_GET(MSMON_CFG_x_CTL_EN, cur_ctl); + spin_unlock_irqrestore(&msc->mon_sel_lock, flags); + } + + return 0; +} + /* * Called via smp_call_on_cpu() to prevent migration, while still being * pre-emptible. @@ -1125,6 +1220,9 @@ static void mpam_reset_msc(struct mpam_msc *msc, bool online) * for non-zero partid may be lost while the CPUs are offline. */ ris->in_reset_state = online; + + if (mpam_is_enabled() && !online) + mpam_touch_msc(msc, &mpam_save_mbwu_state, ris); } srcu_read_unlock(&mpam_srcu, idx); } @@ -1156,6 +1254,9 @@ static void mpam_reprogram_msc(struct mpam_msc *msc) mpam_reprogram_ris_partid(ris, partid, cfg); } ris->in_reset_state = reset; + + if (mpam_has_feature(mpam_feat_msmon_mbwu, &ris->props)) + mpam_touch_msc(msc, &mpam_restore_mbwu_state, ris); } srcu_read_unlock(&mpam_srcu, idx); } @@ -1814,8 +1915,31 @@ static void mpam_unregister_irqs(void) cpus_read_unlock(); }
+static void __destroy_component_cfg(struct mpam_component *comp) +{ + unsigned long flags; + struct mpam_msc_ris *ris; + struct msmon_mbwu_state *mbwu_state; + + kfree(comp->cfg); + list_for_each_entry(ris, &comp->ris, comp_list) { + mutex_lock(&ris->msc->lock); + spin_lock_irqsave(&ris->msc->mon_sel_lock, flags); + mbwu_state = ris->mbwu_state; + ris->mbwu_state = NULL; + spin_unlock_irqrestore(&ris->msc->mon_sel_lock, flags); + mutex_unlock(&ris->msc->lock); + + kfree(mbwu_state); + } +} + static int __allocate_component_cfg(struct mpam_component *comp) { + unsigned long flags; + struct mpam_msc_ris *ris; + struct msmon_mbwu_state *mbwu_state; + if (comp->cfg) return 0;
@@ -1823,6 +1947,24 @@ static int __allocate_component_cfg(struct mpam_component *comp) if (!comp->cfg) return -ENOMEM;
+ list_for_each_entry(ris, &comp->ris, comp_list) { + if (!ris->props.num_mbwu_mon) + continue; + + mbwu_state = kcalloc(ris->props.num_mbwu_mon, + sizeof(*ris->mbwu_state), GFP_KERNEL); + if (!mbwu_state) { + __destroy_component_cfg(comp); + return -ENOMEM; + } + + mutex_lock(&ris->msc->lock); + spin_lock_irqsave(&ris->msc->mon_sel_lock, flags); + ris->mbwu_state = mbwu_state; + spin_unlock_irqrestore(&ris->msc->mon_sel_lock, flags); + mutex_unlock(&ris->msc->lock); + } + return 0; }
diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index fb3a4c00e5d3..320036851777 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -175,8 +175,40 @@ struct mpam_component struct mpam_class *class; };
-struct mpam_msc_ris -{ +/* The values for MSMON_CFG_MBWU_FLT.RWBW */ +enum mon_filter_options { + COUNT_BOTH = 0, + COUNT_WRITE = 1, + COUNT_READ = 2, +}; + +struct mon_cfg { + u16 mon; + u8 pmg; + bool match_pmg; + u32 partid; + enum mon_filter_options opts; +}; + +/* + * Changes to enabled and cfg are protected by the msc->lock. + * Changes to prev_val and correction are protected by the msc's mon_sel_lock. + */ +struct msmon_mbwu_state { + bool enabled; + struct mon_cfg cfg; + + /* The value last read from the hardware. Used to detect overflow. */ + u64 prev_val; + + /* + * The value to add to the new reading to account for power management, + * and shifts to trigger the overflow interrupt. + */ + u64 correction; +}; + +struct mpam_msc_ris { u8 ris_idx; u64 idr; struct mpam_props props; @@ -193,21 +225,9 @@ struct mpam_msc_ris /* parents: */ struct mpam_msc *msc; struct mpam_component *comp; -};
-/* The values for MSMON_CFG_MBWU_FLT.RWBW */ -enum mon_filter_options { - COUNT_BOTH = 0, - COUNT_WRITE = 1, - COUNT_READ = 2, -}; - -struct mon_cfg { - u16 mon; - u8 pmg; - bool match_pmg; - u32 partid; - enum mon_filter_options opts; + /* msmon mbwu configuration is preserved over reset */ + struct msmon_mbwu_state *mbwu_state; };
static inline int mpam_alloc_csu_mon(struct mpam_class *class)
From: Rohit Mathew rohit.mathew@arm.com
mpam v0.1 and versions above v1.0 support optional long counter for memory bandwidth monitoring. The MPAMF_MBWUMON_IDR register have fields indicating support for long counters. As of now, a 44 bit counter represented by HAS_LONG field (bit 30) and a 63 bit counter represented by LWD (bit 29) can be optionally integrated. Probe for these counters and set corresponding feature bits if any of these counters are present.
Signed-off-by: Rohit Mathew rohit.mathew@arm.com Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 22 ++++++++++++++++++++++ drivers/platform/mpam/mpam_internal.h | 7 +++++++ 2 files changed, 29 insertions(+)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 930e14471378..c2c5dfa77083 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -653,6 +653,7 @@ static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) pr_err_once("Counters are not usable because not-ready timeout was not provided by firmware."); } if (FIELD_GET(MPAMF_MSMON_IDR_MSMON_MBWU, msmon_features)) { + bool has_long; u32 mbwumonidr = mpam_read_partsel_reg(msc, MBWUMON_IDR);
props->num_mbwu_mon = FIELD_GET(MPAMF_MBWUMON_IDR_NUM_MON, mbwumonidr); @@ -661,6 +662,27 @@ static void mpam_ris_hw_probe(struct mpam_msc_ris *ris)
if (FIELD_GET(MPAMF_MBWUMON_IDR_HAS_RWBW, mbwumonidr)) mpam_set_feature(mpam_feat_msmon_mbwu_rwbw, props); + + /* + * Treat long counter and its extension, lwd as mutually + * exclusive feature bits. Though these are dependent + * fields at the implementation level, there would never + * be a need for mpam_feat_msmon_mbwu_44counter (long + * counter) and mpam_feat_msmon_mbwu_63counter (lwd) + * bits to be set together. + * + * mpam_feat_msmon_mbwu isn't treated as an exclusive + * bit as this feature bit would be used as the "front + * facing feature bit" for any checks related to mbwu + * monitors. + */ + has_long = FIELD_GET(MPAMF_MBWUMON_IDR_HAS_LONG, mbwumonidr); + if (props->num_mbwu_mon && has_long) { + if (FIELD_GET(MPAMF_MBWUMON_IDR_LWD, mbwumonidr)) + mpam_set_feature(mpam_feat_msmon_mbwu_63counter, props); + else + mpam_set_feature(mpam_feat_msmon_mbwu_44counter, props); + } } }
diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 320036851777..7ce7587202bf 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -88,7 +88,14 @@ enum mpam_device_features { mpam_feat_msmon, mpam_feat_msmon_csu, mpam_feat_msmon_csu_capture, + /* + * Having mpam_feat_msmon_mbwu set doesn't mean the regular 31 bit MBWU + * counter would be used. The exact counter used is decided based on the + * status of mpam_feat_msmon_mbwu_l/mpam_feat_msmon_mbwu_lwd as well. + */ mpam_feat_msmon_mbwu, + mpam_feat_msmon_mbwu_44counter, + mpam_feat_msmon_mbwu_63counter, mpam_feat_msmon_mbwu_capture, mpam_feat_msmon_mbwu_rwbw, mpam_feat_msmon_capt,
From: Rohit Mathew rohit.mathew@arm.com
If the 44 bit (long) or 63 bit (LWD) counters are detected on probing the RIS, use long/LWD counter instead of the regular 31 bit mbwu counter.
Only 32bit accesses to the MSC are required to be supported by the spec, but these registers are 64bits. The lower half may overflow into the higher half between two 32bit reads. To avoid this, use a helper that reads the top half twice to check for overflow.
Signed-off-by: Rohit Mathew rohit.mathew@arm.com [morse: merged multiple patches from Rohit] Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 86 ++++++++++++++++++++++++--- drivers/platform/mpam/mpam_internal.h | 7 ++- 2 files changed, 84 insertions(+), 9 deletions(-)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index c2c5dfa77083..7f1f848281be 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -772,6 +772,48 @@ struct mon_read int err; };
+static bool mpam_ris_has_mbwu_long_counter(struct mpam_msc_ris *ris) +{ + return (mpam_has_feature(mpam_feat_msmon_mbwu_63counter, &ris->props) || + mpam_has_feature(mpam_feat_msmon_mbwu_44counter, &ris->props)); +} + +static u64 mpam_msc_read_mbwu_l(struct mpam_msc *msc) +{ + int retry = 3; + u32 mbwu_l_low; + u64 mbwu_l_high1, mbwu_l_high2; + + lockdep_assert_held_once(&msc->mon_sel_lock); + + WARN_ON_ONCE((MSMON_MBWU_L + sizeof(u64)) > msc->mapped_hwpage_sz); + WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &msc->accessibility)); + + mbwu_l_high2 = readl_relaxed(msc->mapped_hwpage + MSMON_MBWU_L + 4); + do { + mbwu_l_high1 = mbwu_l_high2; + mbwu_l_low = readl_relaxed(msc->mapped_hwpage + MSMON_MBWU_L); + mbwu_l_high2 = readl_relaxed(msc->mapped_hwpage + MSMON_MBWU_L + 4); + + retry--; + } while (mbwu_l_high1 != mbwu_l_high2 && retry > 0); + + if (mbwu_l_high2 == mbwu_l_high1) + return (mbwu_l_high1 << 32) | mbwu_l_low; + return MSMON___NRDY_L; +} + +static void mpam_msc_zero_mbwu_l(struct mpam_msc *msc) +{ + lockdep_assert_held_once(&msc->mon_sel_lock); + + WARN_ON_ONCE((MSMON_MBWU_L + sizeof(u64)) > msc->mapped_hwpage_sz); + WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &msc->accessibility)); + + writel_relaxed(0, msc->mapped_hwpage + MSMON_MBWU_L); + writel_relaxed(0, msc->mapped_hwpage + MSMON_MBWU_L + 4); +} + static void gen_msmon_ctl_flt_vals(struct mon_read *m, u32 *ctl_val, u32 *flt_val) { @@ -844,7 +886,12 @@ static void write_msmon_ctl_flt_vals(struct mon_read *m, u32 ctl_val, case mpam_feat_msmon_mbwu: mpam_write_monsel_reg(msc, CFG_MBWU_FLT, flt_val); mpam_write_monsel_reg(msc, CFG_MBWU_CTL, ctl_val); - mpam_write_monsel_reg(msc, MBWU, 0); + + if (mpam_ris_has_mbwu_long_counter(m->ris)) + mpam_msc_zero_mbwu_l(m->ris->msc); + else + mpam_write_monsel_reg(msc, MBWU, 0); + mpam_write_monsel_reg(msc, CFG_MBWU_CTL, ctl_val|MSMON_CFG_x_CTL_EN);
mbwu_state = &m->ris->mbwu_state[m->ctx->mon]; @@ -859,8 +906,13 @@ static void write_msmon_ctl_flt_vals(struct mon_read *m, u32 ctl_val,
static u64 mpam_msmon_overflow_val(struct mpam_msc_ris *ris) { - /* TODO: scaling, and long counters */ - return GENMASK_ULL(30, 0); + /* TODO: implement scaling counters */ + if (mpam_has_feature(mpam_feat_msmon_mbwu_63counter, &ris->props)) + return GENMASK_ULL(62, 0); + else if (mpam_has_feature(mpam_feat_msmon_mbwu_44counter, &ris->props)) + return GENMASK_ULL(43, 0); + else + return GENMASK_ULL(30, 0); }
static void __ris_msmon_read(void *arg) @@ -898,9 +950,22 @@ static void __ris_msmon_read(void *arg) now = FIELD_GET(MSMON___VALUE, now); break; case mpam_feat_msmon_mbwu: - now = mpam_read_monsel_reg(msc, MBWU); - nrdy = now & MSMON___NRDY; - now = FIELD_GET(MSMON___VALUE, now); + /* + * If long or lwd counters are supported, use them, else revert + * to the 32 bit counter. + */ + if (mpam_ris_has_mbwu_long_counter(ris)) { + now = mpam_msc_read_mbwu_l(msc); + nrdy = now & MSMON___NRDY_L; + if (mpam_has_feature(mpam_feat_msmon_mbwu_63counter, &ris->props)) + now = FIELD_GET(MSMON___LWD_VALUE, now); + else + now = FIELD_GET(MSMON___L_VALUE, now); + } else { + now = mpam_read_monsel_reg(msc, MBWU); + nrdy = now & MSMON___NRDY; + now = FIELD_GET(MSMON___VALUE, now); + }
if (nrdy) break; @@ -1166,8 +1231,13 @@ static int mpam_save_mbwu_state(void *arg) cur_ctl = mpam_read_monsel_reg(msc, CFG_MBWU_CTL); mpam_write_monsel_reg(msc, CFG_MBWU_CTL, 0);
- val = mpam_read_monsel_reg(msc, MBWU); - mpam_write_monsel_reg(msc, MBWU, 0); + if (mpam_ris_has_mbwu_long_counter(ris)) { + val = mpam_msc_read_mbwu_l(msc); + mpam_msc_zero_mbwu_l(msc); + } else { + val = mpam_read_monsel_reg(msc, MBWU); + mpam_write_monsel_reg(msc, MBWU, 0); + }
cfg->mon = i; cfg->pmg = FIELD_GET(MSMON_CFG_MBWU_FLT_PMG, cur_flt); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 7ce7587202bf..0f49932463dd 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -330,6 +330,8 @@ int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, #define MSMON_CSU_CAPTURE 0x0848 /* last cache-usage value captured */ #define MSMON_MBWU 0x0860 /* current mem-bw usage value */ #define MSMON_MBWU_CAPTURE 0x0868 /* last mem-bw value captured */ +#define MSMON_MBWU_L 0x0880 /* current long mem-bw usage value */ +#define MSMON_MBWU_CAPTURE_L 0x0890 /* last long mem-bw value captured */ #define MSMON_CAPT_EVNT 0x0808 /* signal a capture event */ #define MPAMF_ESR 0x00F8 /* error status register */ #define MPAMF_ECR 0x00F0 /* error control register */ @@ -533,7 +535,10 @@ int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, */ #define MSMON___VALUE GENMASK(30, 0) #define MSMON___NRDY BIT(31) -#define MSMON_MBWU_L_VALUE GENMASK(62, 0) +#define MSMON___NRDY_L BIT(63) +#define MSMON___L_VALUE GENMASK(43, 0) +#define MSMON___LWD_VALUE GENMASK(62, 0) + /* * MSMON_CAPT_EVNT - Memory system performance monitoring capture event * generation register
From: James Morse james.morse@arm.com
resctrl expects to reset the bandwidth counters when the filesystem is mounted.
To allow this, add a helper that clears the saved mbwu state. Instead of cross calling to each CPU that can access the component MSC to write to the counter, set a flag that causes it to be zero'd on the the next read. This is easily done by forcing a configuration update.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 44 +++++++++++++++++++++++---- drivers/platform/mpam/mpam_internal.h | 5 ++- 2 files changed, 42 insertions(+), 7 deletions(-)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 7f1f848281be..f9f22d1d698c 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -919,9 +919,11 @@ static void __ris_msmon_read(void *arg) { bool nrdy = false; unsigned long flags; + bool config_mismatch; struct mon_read *m = arg; u64 now, overflow_val = 0; struct mon_cfg *ctx = m->ctx; + bool reset_on_next_read = false; struct mpam_msc_ris *ris = m->ris; struct mpam_msc *msc = m->ris->msc; struct msmon_mbwu_state *mbwu_state; @@ -934,13 +936,24 @@ static void __ris_msmon_read(void *arg) FIELD_PREP(MSMON_CFG_MON_SEL_RIS, ris->ris_idx); mpam_write_monsel_reg(msc, CFG_MON_SEL, mon_sel);
+ if (m->type == mpam_feat_msmon_mbwu) { + mbwu_state = &ris->mbwu_state[ctx->mon]; + if (mbwu_state) { + reset_on_next_read = mbwu_state->reset_on_next_read; + mbwu_state->reset_on_next_read = false; + } + } + /* * Read the existing configuration to avoid re-writing the same values. * This saves waiting for 'nrdy' on subsequent reads. */ read_msmon_ctl_flt_vals(m, &cur_ctl, &cur_flt); gen_msmon_ctl_flt_vals(m, &ctl_val, &flt_val); - if (cur_flt != flt_val || cur_ctl != (ctl_val | MSMON_CFG_x_CTL_EN)) + config_mismatch = cur_flt != flt_val || + cur_ctl != (ctl_val | MSMON_CFG_x_CTL_EN); + + if (config_mismatch || reset_on_next_read) write_msmon_ctl_flt_vals(m, ctl_val, flt_val);
switch (m->type) { @@ -970,7 +983,6 @@ static void __ris_msmon_read(void *arg) if (nrdy) break;
- mbwu_state = &ris->mbwu_state[ctx->mon]; if (!mbwu_state) break;
@@ -1064,6 +1076,30 @@ int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, return err; }
+void mpam_msmon_reset_mbwu(struct mpam_component *comp, struct mon_cfg *ctx) +{ + int idx; + unsigned long flags; + struct mpam_msc *msc; + struct mpam_msc_ris *ris; + + if (!mpam_is_enabled()) + return; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { + if (!mpam_has_feature(mpam_feat_msmon_mbwu, &ris->props)) + continue; + + msc = ris->msc; + spin_lock_irqsave(&msc->mon_sel_lock, flags); + ris->mbwu_state[ctx->mon].correction = 0; + ris->mbwu_state[ctx->mon].reset_on_next_read = true; + spin_unlock_irqrestore(&msc->mon_sel_lock, flags); + } + srcu_read_unlock(&mpam_srcu, idx); +} + static void mpam_reset_msc_bitmap(struct mpam_msc *msc, u16 reg, u16 wd) { u32 num_words, msb; @@ -1190,8 +1226,6 @@ static int mpam_restore_mbwu_state(void *_ris) struct mon_read mwbu_arg; struct mpam_msc_ris *ris = _ris;
- lockdep_assert_held(&ris->msc->lock); - for (i = 0; i < ris->props.num_mbwu_mon; i++) { if (ris->mbwu_state[i].enabled) { mwbu_arg.ris = ris; @@ -1216,8 +1250,6 @@ static int mpam_save_mbwu_state(void *arg) struct mpam_msc *msc = ris->msc; struct msmon_mbwu_state *mbwu_state;
- lockdep_assert_held(&msc->lock); - for (i = 0; i < ris->props.num_mbwu_mon; i++) { mbwu_state = &ris->mbwu_state[i]; cfg = &mbwu_state->cfg; diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 0f49932463dd..80f4b434b3ad 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -199,10 +199,12 @@ struct mon_cfg {
/* * Changes to enabled and cfg are protected by the msc->lock. - * Changes to prev_val and correction are protected by the msc's mon_sel_lock. + * Changes to reset_on_next_read, prev_val and correction are protected by the + * msc's mon_sel_lock. */ struct msmon_mbwu_state { bool enabled; + bool reset_on_next_read; struct mon_cfg cfg;
/* The value last read from the hardware. Used to detect overflow. */ @@ -286,6 +288,7 @@ int mpam_apply_config(struct mpam_component *comp, u16 partid,
int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, enum mpam_device_features, u64 *val); +void mpam_msmon_reset_mbwu(struct mpam_component *comp, struct mon_cfg *ctx);
/* * MPAM MSCs have the following register layout. See:
From: James Morse james.morse@arm.com
resctrl has its own data structures to describe its resources. We can't use these directly as we play tricks with the 'MBA' resource, picking the MPAM controls or monitors that best apply. We may export the same component as both L3 and MBA.
Add mpam_resctrl_exports[] as the array of class->resctrl mappings we are exporting, and add the cpuhp hooks that allocated and free the resctrl domain structures.
While we're here, plumb in a few other obvious things.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/Makefile | 2 +- drivers/platform/mpam/mpam_devices.c | 12 ++ drivers/platform/mpam/mpam_internal.h | 15 ++ drivers/platform/mpam/mpam_resctrl.c | 237 ++++++++++++++++++++++++++ include/linux/arm_mpam.h | 4 + 5 files changed, 269 insertions(+), 1 deletion(-) create mode 100644 drivers/platform/mpam/mpam_resctrl.c
diff --git a/drivers/platform/mpam/Makefile b/drivers/platform/mpam/Makefile index 8ad69bfa2aa2..37693be531c3 100644 --- a/drivers/platform/mpam/Makefile +++ b/drivers/platform/mpam/Makefile @@ -1 +1 @@ -obj-$(CONFIG_ARM_CPU_RESCTRL) += mpam_devices.o +obj-$(CONFIG_ARM_CPU_RESCTRL) += mpam_devices.o mpam_resctrl.o diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index f9f22d1d698c..28f6010df6dc 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -1411,6 +1411,9 @@ static int mpam_cpu_online(unsigned int cpu) } srcu_read_unlock(&mpam_srcu, idx);
+ if (mpam_is_enabled()) + mpam_resctrl_online_cpu(cpu); + return 0; }
@@ -1470,6 +1473,9 @@ static int mpam_cpu_offline(unsigned int cpu) } srcu_read_unlock(&mpam_srcu, idx);
+ if (mpam_is_enabled()) + mpam_resctrl_offline_cpu(cpu); + return 0; }
@@ -2140,6 +2146,12 @@ static void mpam_enable_once(void) mutex_unlock(&mpam_list_lock); cpus_read_unlock();
+ if (!err) { + err = mpam_resctrl_setup(); + if (err) + pr_err("Failed to initialise resctrl: %d\n", err); + } + if (err) { schedule_work(&mpam_broken_work); return; diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 80f4b434b3ad..579a8485d320 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -239,6 +239,16 @@ struct mpam_msc_ris { struct msmon_mbwu_state *mbwu_state; };
+struct mpam_resctrl_dom { + struct mpam_component *comp; + struct rdt_domain resctrl_dom; +}; + +struct mpam_resctrl_res { + struct mpam_class *class; + struct rdt_resource resctrl_res; +}; + static inline int mpam_alloc_csu_mon(struct mpam_class *class) { struct mpam_props *cprops = &class->props; @@ -290,6 +300,11 @@ int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, enum mpam_device_features, u64 *val); void mpam_msmon_reset_mbwu(struct mpam_component *comp, struct mon_cfg *ctx);
+int mpam_resctrl_online_cpu(unsigned int cpu); +int mpam_resctrl_offline_cpu(unsigned int cpu); + +int mpam_resctrl_setup(void); + /* * MPAM MSCs have the following register layout. See: * Arm Architecture Reference Manual Supplement - Memory System Resource diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c new file mode 100644 index 000000000000..b9c1292ff630 --- /dev/null +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -0,0 +1,237 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (C) 2021 Arm Ltd. + +#define pr_fmt(fmt) "mpam: resctrl: " fmt + +#include <linux/arm_mpam.h> +#include <linux/cacheinfo.h> +#include <linux/cpu.h> +#include <linux/cpumask.h> +#include <linux/errno.h> +#include <linux/list.h> +#include <linux/printk.h> +#include <linux/rculist.h> +#include <linux/resctrl.h> +#include <linux/slab.h> +#include <linux/types.h> + +#include <asm/mpam.h> + +#include "mpam_internal.h" + +/* + * The classes we've picked to map to resctrl resources. + * Class pointer may be NULL. + */ +static struct mpam_resctrl_res mpam_resctrl_exports[RDT_NUM_RESOURCES]; + +static bool exposed_alloc_capable; +static bool exposed_mon_capable; + +bool resctrl_arch_alloc_capable(void) +{ + return exposed_alloc_capable; +} + +bool resctrl_arch_mon_capable(void) +{ + return exposed_mon_capable; +} + +/* + * MSC may raise an error interrupt if it sees an out or range partid/pmg, + * and go on to truncate the value. Regardless of what the hardware supports, + * only the system wide safe value is safe to use. + */ +u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) +{ + return min((u32)mpam_partid_max + 1, (u32)RESCTRL_MAX_CLOSID); +} + +struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) +{ + if (l >= RDT_NUM_RESOURCES) + return NULL; + + return &mpam_resctrl_exports[l].resctrl_res; +} + +static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) +{ + /* TODO: initialise the resctrl resources */ + + return 0; +} + +/* Called with the mpam classes lock held */ +int mpam_resctrl_setup(void) +{ + int err = 0; + struct mpam_resctrl_res *res; + enum resctrl_res_level i; + + cpus_read_lock(); + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + INIT_LIST_HEAD(&res->resctrl_res.domains); + INIT_LIST_HEAD(&res->resctrl_res.evt_list); + res->resctrl_res.rid = i; + } + + /* TODO: pick MPAM classes to map to resctrl resources */ + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + if (!res->class) + continue; // dummy resource + + err = mpam_resctrl_resource_init(res); + if (err) + break; + } + cpus_read_unlock(); + + if (!err && !exposed_alloc_capable && !exposed_mon_capable) + err = -EOPNOTSUPP; + + if (!err) { + if (!is_power_of_2(mpam_pmg_max + 1)) { + /* + * If not all the partid*pmg values are valid indexes, + * resctrl may allocate pmg that don't exist. This + * should cause an error interrupt. + */ + pr_warn("Number of PMG is not a power of 2! resctrl may misbehave"); + } + err = resctrl_init(); + } + + return err; +} + +static struct mpam_resctrl_dom * +mpam_resctrl_alloc_domain(unsigned int cpu, struct mpam_resctrl_res *res) +{ + struct mpam_resctrl_dom *dom; + struct mpam_class *class = res->class; + struct mpam_component *comp_iter, *comp; + + comp = NULL; + list_for_each_entry(comp_iter, &class->components, class_list) { + if (cpumask_test_cpu(cpu, &comp_iter->affinity)) { + comp = comp_iter; + break; + } + } + + /* cpu with unknown exported component? */ + if (WARN_ON_ONCE(!comp)) + return ERR_PTR(-EINVAL); + + dom = kzalloc_node(sizeof(*dom), GFP_KERNEL, cpu_to_node(cpu)); + if (!dom) + return ERR_PTR(-ENOMEM); + + dom->comp = comp; + INIT_LIST_HEAD(&dom->resctrl_dom.list); + dom->resctrl_dom.id = comp->comp_id; + cpumask_set_cpu(cpu, &dom->resctrl_dom.cpu_mask); + + /* TODO: this list should be sorted */ + list_add_tail(&dom->resctrl_dom.list, &res->resctrl_res.domains); + + return dom; +} + +/* Like resctrl_get_domain_from_cpu(), but for offline CPUs */ +static struct mpam_resctrl_dom * +mpam_get_domain_from_cpu(int cpu, struct mpam_resctrl_res *res) +{ + struct rdt_domain *d; + struct mpam_resctrl_dom *dom; + + lockdep_assert_cpus_held(); + + list_for_each_entry(d, &res->resctrl_res.domains, list) { + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + + if (cpumask_test_cpu(cpu, &dom->comp->affinity)) + return dom; + } + + return NULL; +} + +struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id) +{ + struct rdt_domain *d; + struct mpam_resctrl_dom *dom; + + lockdep_assert_cpus_held(); + + list_for_each_entry(d, &r->domains, list) { + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + if (dom->comp->comp_id == id) + return &dom->resctrl_dom; + } + + return NULL; +} + +int mpam_resctrl_online_cpu(unsigned int cpu) +{ + int i; + struct mpam_resctrl_dom *dom; + struct mpam_resctrl_res *res; + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + + if (!res->class) + continue; // dummy_resource; + + dom = mpam_get_domain_from_cpu(cpu, res); + if (dom) { + cpumask_set_cpu(cpu, &dom->resctrl_dom.cpu_mask); + continue; + } + + dom = mpam_resctrl_alloc_domain(cpu, res); + if (IS_ERR(dom)) + return PTR_ERR(dom); + } + + return 0; +} + +int mpam_resctrl_offline_cpu(unsigned int cpu) +{ + int i; + struct rdt_domain *d; + struct mpam_resctrl_res *res; + struct mpam_resctrl_dom *dom; + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + + if (!res->class) + continue; // dummy resource + + d = resctrl_get_domain_from_cpu(cpu, &res->resctrl_res); + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + + /* The last one standing was ahead of us... */ + if (WARN_ON_ONCE(!d)) + continue; + + cpumask_clear_cpu(cpu, &d->cpu_mask); + + if (!cpumask_empty(&d->cpu_mask)) + continue; + + list_del(&d->list); + kfree(dom); + } + + return 0; +} diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 40e09b4d236b..27c3ad9912ef 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -39,4 +39,8 @@ int mpam_register_requestor(u16 partid_max, u8 pmg_max); int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, enum mpam_class_types type, u8 class_id, int component_id);
+ +bool resctrl_arch_alloc_capable(void); +bool resctrl_arch_mon_capable(void); + #endif /* __LINUX_ARM_MPAM_H */
From: James Morse james.morse@arm.com
Sytems with MPAM support may have a variety of control types at any point of their system layout. We can only expose certain types of control, and only if they exist at particular locations.
Start with the well-know caches. These have to be depth 2 or 3 and support MPAM's cache portion bitmap controls, with a number of portions fewer that resctrl's limit.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_resctrl.c | 135 ++++++++++++++++++++++++++- include/linux/arm_mpam.h | 7 ++ 2 files changed, 138 insertions(+), 4 deletions(-)
diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index b9c1292ff630..3e72cf514c4a 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -27,6 +27,7 @@ static struct mpam_resctrl_res mpam_resctrl_exports[RDT_NUM_RESOURCES];
static bool exposed_alloc_capable; static bool exposed_mon_capable; +static struct mpam_class *mbm_local_class;
bool resctrl_arch_alloc_capable(void) { @@ -38,6 +39,11 @@ bool resctrl_arch_mon_capable(void) return exposed_mon_capable; }
+bool resctrl_arch_is_mbm_local_enabled(void) +{ + return mbm_local_class; +} + /* * MSC may raise an error interrupt if it sees an out or range partid/pmg, * and go on to truncate the value. Regardless of what the hardware supports, @@ -56,14 +62,134 @@ struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) return &mpam_resctrl_exports[l].resctrl_res; }
+static bool cache_has_usable_cpor(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_cpor_part, cprops)) + return false; + + /* TODO: Scaling is not yet supported */ + return (class->props.cpbm_wd <= RESCTRL_MAX_CBM); +} + +static bool cache_has_usable_csu(struct mpam_class *class) +{ + struct mpam_props *cprops; + + if (!class) + return false; + + cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_msmon_csu, cprops)) + return false; + + /* + * CSU counters settle on the value, so we can get away with + * having only one. + */ + if (!cprops->num_csu_mon) + return false; + + return (mpam_partid_max > 1) || (mpam_pmg_max != 0); +} + +bool resctrl_arch_is_llc_occupancy_enabled(void) +{ + return cache_has_usable_csu(mpam_resctrl_exports[RDT_RESOURCE_L3].class); +} + +/* Test whether we can export MPAM_CLASS_CACHE:{2,3}? */ +static void mpam_resctrl_pick_caches(void) +{ + int idx; + struct mpam_class *class; + struct mpam_resctrl_res *res; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(class, &mpam_classes, classes_list) { + bool has_cpor = cache_has_usable_cpor(class); + + if (class->type != MPAM_CLASS_CACHE) { + pr_debug("pick_caches: Class is not a cache\n"); + continue; + } + + if (class->level != 2 && class->level != 3) { + pr_debug("pick_caches: not L2 or L3\n"); + continue; + } + + if (class->level == 2 && !has_cpor) { + pr_debug("pick_caches: L2 missing CPOR\n"); + continue; + } + else if (!has_cpor && !cache_has_usable_csu(class)) { + pr_debug("pick_caches: Cache misses CPOR and CSU\n"); + continue; + } + + if (!cpumask_equal(&class->affinity, cpu_possible_mask)) { + pr_debug("pick_caches: Class has missing CPUs\n"); + continue; + } + + if (class->level == 2) { + res = &mpam_resctrl_exports[RDT_RESOURCE_L2]; + res->resctrl_res.name = "L2"; + } else { + res = &mpam_resctrl_exports[RDT_RESOURCE_L3]; + res->resctrl_res.name = "L3"; + } + res->class = class; + } + srcu_read_unlock(&mpam_srcu, idx); +} + static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) { - /* TODO: initialise the resctrl resources */ + struct mpam_class *class = res->class; + struct rdt_resource *r = &res->resctrl_res; + + /* Is this one of the two well-known caches? */ + if (res->resctrl_res.rid == RDT_RESOURCE_L2 || + res->resctrl_res.rid == RDT_RESOURCE_L3) { + /* TODO: Scaling is not yet supported */ + r->cache.cbm_len = class->props.cpbm_wd; + r->cache.arch_has_sparse_bitmasks = true; + + /* mpam_devices will reject empty bitmaps */ + r->cache.min_cbm_bits = 1; + + /* TODO: kill these properties off as they are derivatives */ + r->format_str = "%d=%0*x"; + r->fflags = RFTYPE_RES_CACHE; + r->default_ctrl = BIT_MASK(class->props.cpbm_wd) - 1; + r->data_width = (class->props.cpbm_wd + 3) / 4; + + /* + * Which bits are shared with other ...things... + * Unknown devices use partid-0 which uses all the bitmap + * fields. Until we configured the SMMU and GIC not to do this + * 'all the bits' is the correct answer here. + */ + r->cache.shareable_bits = r->default_ctrl; + + if (mpam_has_feature(mpam_feat_cpor_part, &class->props)) { + r->alloc_capable = true; + exposed_alloc_capable = true; + } + + if (class->level == 3 && cache_has_usable_csu(class)) { + r->mon_capable = true; + exposed_mon_capable = true; + } + }
return 0; }
-/* Called with the mpam classes lock held */ int mpam_resctrl_setup(void) { int err = 0; @@ -78,7 +204,7 @@ int mpam_resctrl_setup(void) res->resctrl_res.rid = i; }
- /* TODO: pick MPAM classes to map to resctrl resources */ + mpam_resctrl_pick_caches();
for (i = 0; i < RDT_NUM_RESOURCES; i++) { res = &mpam_resctrl_exports[i]; @@ -103,7 +229,8 @@ int mpam_resctrl_setup(void) */ pr_warn("Number of PMG is not a power of 2! resctrl may misbehave"); } - err = resctrl_init(); + + /* TODO: call resctrl_init() */ }
return err; diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 27c3ad9912ef..576bb97fa552 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -42,5 +42,12 @@ int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx,
bool resctrl_arch_alloc_capable(void); bool resctrl_arch_mon_capable(void); +bool resctrl_arch_is_llc_occupancy_enabled(void); +bool resctrl_arch_is_mbm_local_enabled(void); + +static inline bool resctrl_arch_is_mbm_total_enabled(void) +{ + return false; +}
#endif /* __LINUX_ARM_MPAM_H */
From: James Morse james.morse@arm.com
After the changes to resctrl to support MPAM, num_rmid is only used as a value that is unfortunately exposed to user-space. For MPAM, this value doesn't mean anything, and whatever value we do expose will be wrong for some use cases.
User-space may expect it can use this value to know how many 'extra' monitor groups it can create. e.g. on x86 if num_closid=4, num_rmid=8, then a total of 4 monitor groups can be created. If num_rmid were 2, then only 2 control groups could be created. For MPAM the number of pmg is very likely to be smaller than the number of partid, but this doesn't restrict the creation of control groups, as each control group has its own pmg space.
Pick 1 if monitoring is supported.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_resctrl.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 3e72cf514c4a..c2cf9432e50d 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -181,10 +181,22 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) exposed_alloc_capable = true; }
- if (class->level == 3 && cache_has_usable_csu(class)) { + if (class->level == 3 && cache_has_usable_csu(class)) r->mon_capable = true; - exposed_mon_capable = true; - } + } + + if (r->mon_capable) { + exposed_mon_capable = true; + + /* + * Unfortunately, num_rmid doesn't mean anything for + * mpam, and its exposed to user-space! + * num-rmid is supposed to mean the number of groups + * that can be created, both control or monitor groups. + * For mpam, each control group has its own pmg/rmid + * space. + */ + r->num_rmid = 1; }
return 0;
From: James Morse james.morse@arm.com
We already have a helper for reseting an mpam class. Hook it up to resctrl_arch_reset_resources().
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 2 +- drivers/platform/mpam/mpam_internal.h | 2 ++ drivers/platform/mpam/mpam_resctrl.c | 27 +++++++++++++++++++++++++++ include/linux/arm_mpam.h | 3 +++ 4 files changed, 33 insertions(+), 1 deletion(-)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 28f6010df6dc..13f48878c3d5 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -2177,7 +2177,7 @@ static void mpam_enable_once(void) READ_ONCE(mpam_partid_max) + 1, mpam_pmg_max + 1); }
-static void mpam_reset_class(struct mpam_class *class) +void mpam_reset_class(struct mpam_class *class) { int idx; struct mpam_msc_ris *ris; diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 579a8485d320..9c47d5a35d99 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -293,6 +293,8 @@ extern u8 mpam_pmg_max; void mpam_enable(struct work_struct *work); void mpam_disable(struct work_struct *work);
+void mpam_reset_class(struct mpam_class *class); + int mpam_apply_config(struct mpam_component *comp, u16 partid, struct mpam_config *cfg);
diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index c2cf9432e50d..7a54dc7c8056 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -248,6 +248,33 @@ int mpam_resctrl_setup(void) return err; }
+void resctrl_arch_reset_resources(void) +{ + int i, idx; + struct mpam_class *class; + struct mpam_resctrl_res *res; + + lockdep_assert_cpus_held(); + + if (!mpam_is_enabled()) + return; + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + + if (!res->class) + continue; // dummy resource + + if (!res->resctrl_res.alloc_capable) + continue; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(class, &mpam_classes, classes_list) + mpam_reset_class(class); + srcu_read_unlock(&mpam_srcu, idx); + } +} + static struct mpam_resctrl_dom * mpam_resctrl_alloc_domain(unsigned int cpu, struct mpam_resctrl_res *res) { diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 576bb97fa552..97d4c8f076e4 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -50,4 +50,7 @@ static inline bool resctrl_arch_is_mbm_total_enabled(void) return false; }
+/* reset cached configurations, then all devices */ +void resctrl_arch_reset_resources(void); + #endif /* __LINUX_ARM_MPAM_H */
From: James Morse james.morse@arm.com
Implement resctrl_arch_get_config() by testing the configuration for a CPOR bitmap. For any other configuration type return the default.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 3 ++ drivers/platform/mpam/mpam_resctrl.c | 44 ++++++++++++++++++++++++++++ 2 files changed, 47 insertions(+)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 13f48878c3d5..5ebb944ad9fc 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -2302,6 +2302,9 @@ int mpam_apply_config(struct mpam_component *comp, u16 partid,
lockdep_assert_cpus_held();
+ if (!memcmp(&comp->cfg[partid], cfg, sizeof(*cfg))) + return 0; + comp->cfg[partid] = *cfg; arg.comp = comp; arg.partid = partid; diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 7a54dc7c8056..940d2bf7ccd4 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -248,6 +248,50 @@ int mpam_resctrl_setup(void) return err; }
+u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, enum resctrl_conf_type type) +{ + u32 partid; + struct mpam_config *cfg; + struct mpam_props *cprops; + struct mpam_resctrl_res *res; + struct mpam_resctrl_dom *dom; + enum mpam_device_features configured_by; + + lockdep_assert_cpus_held(); + + if (!mpam_is_enabled()) + return r->default_ctrl; + + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + cprops = &res->class->props; + + partid = resctrl_get_config_index(closid, type); + cfg = &dom->comp->cfg[partid]; + + switch (r->rid) { + case RDT_RESOURCE_L2: + case RDT_RESOURCE_L3: + configured_by = mpam_feat_cpor_part; + break; + default: + return -EINVAL; + } + + if (!r->alloc_capable || partid >= resctrl_arch_get_num_closid(r) || + !mpam_has_feature(configured_by, cfg)) + return r->default_ctrl; + + switch (configured_by) { + case mpam_feat_cpor_part: + /* TODO: Scaling is not yet supported */ + return cfg->cpbm; + default: + return -EINVAL; + } +} + void resctrl_arch_reset_resources(void) { int i, idx;
From: James Morse james.morse@arm.com
resctrl has two helpers for updating the configuration. resctrl_arch_update_one() updates a single value, and is used by the software-controller to apply feedback to the bandwidth controls, it has to be called on one of the CPUs in the resctrl:domain.
resctrl_arch_update_domains() copies multiple staged configurations, it can be called from anywhere.
Both helpers should update any changes to the underlying hardware.
Imlpement resctrl_arch_update_domains() to use resctrl_arch_update_one(), which doesn't depend on being called on the right CPU.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_internal.h | 7 +-- drivers/platform/mpam/mpam_resctrl.c | 63 +++++++++++++++++++++++++++ 2 files changed, 64 insertions(+), 6 deletions(-)
diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 9c47d5a35d99..49fc264d32f7 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -119,12 +119,7 @@ struct mpam_props };
#define mpam_has_feature(_feat, x) ((1<<_feat) & (x)->features) - -static inline void mpam_set_feature(enum mpam_device_features feat, - struct mpam_props *props) -{ - props->features |= (1<<feat); -} +#define mpam_set_feature(_feat, x) ((x)->features |= (1<<_feat))
static inline void mpam_clear_feature(enum mpam_device_features feat, mpam_features_t *supported) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 940d2bf7ccd4..9d808f68a122 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -292,6 +292,69 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, } }
+int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, enum resctrl_conf_type t, u32 cfg_val) +{ + u32 partid; + struct mpam_config cfg; + struct mpam_props *cprops; + struct mpam_resctrl_res *res; + struct mpam_resctrl_dom *dom; + + lockdep_assert_cpus_held(); + lockdep_assert_irqs_enabled(); + + /* NOTE: don't check the CPU as mpam_apply_config() doesn't care, + * and resctrl_arch_update_domains() depends on this. */ + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + cprops = &res->class->props; + + partid = resctrl_get_config_index(closid, t); + if (!r->alloc_capable || partid >= resctrl_arch_get_num_closid(r)) + return -EINVAL; + + switch (r->rid) { + case RDT_RESOURCE_L2: + case RDT_RESOURCE_L3: + /* TODO: Scaling is not yet supported */ + cfg.cpbm = cfg_val; + mpam_set_feature(mpam_feat_cpor_part, &cfg); + break; + default: + return -EINVAL; + } + + return mpam_apply_config(dom->comp, partid, &cfg); +} + +/* TODO: this is IPI heavy */ +int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) +{ + int err = 0; + struct rdt_domain *d; + enum resctrl_conf_type t; + struct resctrl_staged_config *cfg; + + lockdep_assert_cpus_held(); + lockdep_assert_irqs_enabled(); + + list_for_each_entry(d, &r->domains, list) { + for (t = 0; t < CDP_NUM_TYPES; t++) { + cfg = &d->staged_config[t]; + if (!cfg->have_new_ctrl) + continue; + + err = resctrl_arch_update_one(r, d, closid, t, + cfg->new_ctrl); + if (err) + return err; + } + } + + return err; +} + void resctrl_arch_reset_resources(void) { int i, idx;
From: James Morse james.morse@arm.com
Intel RDT's CDP feature allows the cache to use a different control value depending on whether the accesses was for instruction fetch or a data access. MPAM's equivalent feature is the other way up: the CPU assigns a different partid label to traffic depending on whether it was instruction fetch or a data access, which causes the cache to use a different control value based solely on the partid.
MPAM can emulate CDP, with the side effect that the alternative partid is seen by all MSC, it can't be enabled per-MSC.
Add the resctrl hooks to turn this on or off. Add the helpers that match a closid against a task, which need to be aware that the value written to hardware is not the same as the one resctrl is using.
The context switch code needs to match the default resctrl group's value against a variable, as this value changes depending on whether CDP is in use.
Signed-off-by: James Morse james.morse@arm.com --- arch/arm64/include/asm/mpam.h | 3 +- drivers/platform/mpam/mpam_resctrl.c | 62 ++++++++++++++++++++++++++++ include/linux/arm_mpam.h | 13 ++++++ 3 files changed, 77 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/mpam.h b/arch/arm64/include/asm/mpam.h index 1d81a6f26acd..edae8c98fa28 100644 --- a/arch/arm64/include/asm/mpam.h +++ b/arch/arm64/include/asm/mpam.h @@ -4,6 +4,7 @@ #ifndef __ASM__MPAM_H #define __ASM__MPAM_H
+#include <linux/arm_mpam.h> #include <linux/bitops.h> #include <linux/init.h> #include <linux/jump_label.h> @@ -108,7 +109,7 @@ static inline void mpam_thread_switch(struct task_struct *tsk) !static_branch_likely(&mpam_enabled)) return;
- if (!regval) + if (regval == READ_ONCE(mpam_resctrl_default_group)) regval = READ_ONCE(per_cpu(arm64_mpam_default, cpu));
oldregval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 9d808f68a122..eead81775f94 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -19,6 +19,8 @@
#include "mpam_internal.h"
+u64 mpam_resctrl_default_group; + /* * The classes we've picked to map to resctrl resources. * Class pointer may be NULL. @@ -29,6 +31,12 @@ static bool exposed_alloc_capable; static bool exposed_mon_capable; static struct mpam_class *mbm_local_class;
+/* + * MPAM emulates CDP by setting different PARTID in the I/D fields of MPAM1_EL1. + * This applies globally to all traffic the CPU generates. + */ +static bool cdp_enabled; + bool resctrl_arch_alloc_capable(void) { return exposed_alloc_capable; @@ -44,6 +52,36 @@ bool resctrl_arch_is_mbm_local_enabled(void) return mbm_local_class; }
+bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level ignored) +{ + return cdp_enabled; +} + +int resctrl_arch_set_cdp_enabled(enum resctrl_res_level ignored, bool enable) +{ + u64 regval; + u32 partid, partid_i, partid_d; + + cdp_enabled = enable; + + partid = RESCTRL_RESERVED_CLOSID; + + if (enable) { + partid_d = resctrl_get_config_index(partid, CDP_CODE); + partid_i = resctrl_get_config_index(partid, CDP_DATA); + regval = FIELD_PREP(MPAM_SYSREG_PARTID_D, partid_d) | + FIELD_PREP(MPAM_SYSREG_PARTID_I, partid_i); + + } else { + regval = FIELD_PREP(MPAM_SYSREG_PARTID_D, partid) | + FIELD_PREP(MPAM_SYSREG_PARTID_I, partid); + } + + WRITE_ONCE(mpam_resctrl_default_group, regval); + + return 0; +} + /* * MSC may raise an error interrupt if it sees an out or range partid/pmg, * and go on to truncate the value. Regardless of what the hardware supports, @@ -54,6 +92,30 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) return min((u32)mpam_partid_max + 1, (u32)RESCTRL_MAX_CLOSID); }
+bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid) +{ + u64 regval = mpam_get_regval(tsk); + u32 tsk_closid = FIELD_GET(MPAM_SYSREG_PARTID_D, regval); + + if (cdp_enabled) + tsk_closid >>= 1; + + return tsk_closid == closid; +} + +/* The task's pmg is not unique, the partid must be considered too */ +bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid) +{ + u64 regval = mpam_get_regval(tsk); + u32 tsk_closid = FIELD_GET(MPAM_SYSREG_PARTID_D, regval); + u32 tsk_rmid = FIELD_GET(MPAM_SYSREG_PMG_D, regval); + + if (cdp_enabled) + tsk_closid >>= 1; + + return (tsk_closid == closid) && (tsk_rmid == rmid); +} + struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) { if (l >= RDT_NUM_RESOURCES) diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 97d4c8f076e4..e3921b0ab836 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -5,8 +5,17 @@ #define __LINUX_ARM_MPAM_H
#include <linux/acpi.h> +#include <linux/resctrl_types.h> #include <linux/types.h>
+/* + * The value of the MPAM1_EL1 sysreg when a task is in the default group. + * This is used by the context switch code to use the resctrl CPU property + * instead. The value is modified when CDP is enabled/disabled by mounting + * the resctrl filesystem. + */ +extern u64 mpam_resctrl_default_group; + struct mpam_msc;
enum mpam_msc_iface { @@ -53,4 +62,8 @@ static inline bool resctrl_arch_is_mbm_total_enabled(void) /* reset cached configurations, then all devices */ void resctrl_arch_reset_resources(void);
+bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level ignored); +int resctrl_arch_set_cdp_enabled(enum resctrl_res_level ignored, bool enable); +bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid); +bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid); #endif /* __LINUX_ARM_MPAM_H */
From: James Morse james.morse@arm.com
Care must be taken when modifying the partid and pmg of a task, as writing these values may race with the task being scheduled in, and reading the modified values.
Add helpers to set the task properties, and the cpu default value, and add the plumbing to the mpam driver that lets resctrl use them.
Signed-off-by: James Morse james.morse@arm.com --- arch/arm64/include/asm/mpam.h | 44 +++++++++++++++++++++ drivers/platform/mpam/mpam_resctrl.c | 58 ++++++++++++++++++++++++++++ include/linux/arm_mpam.h | 7 ++++ 3 files changed, 109 insertions(+)
diff --git a/arch/arm64/include/asm/mpam.h b/arch/arm64/include/asm/mpam.h index edae8c98fa28..9abe1fe58c34 100644 --- a/arch/arm64/include/asm/mpam.h +++ b/arch/arm64/include/asm/mpam.h @@ -6,6 +6,7 @@
#include <linux/arm_mpam.h> #include <linux/bitops.h> +#include <linux/bitfield.h> #include <linux/init.h> #include <linux/jump_label.h> #include <linux/percpu.h> @@ -90,6 +91,35 @@ static inline void __init __enable_mpam_hcr(void) * A value in struct thread_info is used instead of struct task_struct as the * cpu's u64 register format is used, but struct task_struct has two u32'. */ + static inline void mpam_set_cpu_defaults(int cpu, u16 partid_d, u16 partid_i, + u8 pmg_d, u8 pmg_i) +{ + u64 default_val; + + default_val = FIELD_PREP(MPAM_SYSREG_PARTID_D, partid_d); + default_val |= FIELD_PREP(MPAM_SYSREG_PARTID_I, partid_i); + default_val |= FIELD_PREP(MPAM_SYSREG_PMG_D, pmg_d); + default_val |= FIELD_PREP(MPAM_SYSREG_PMG_I, pmg_i); + + WRITE_ONCE(per_cpu(arm64_mpam_default, cpu), default_val); +} + +static inline void mpam_set_task_partid_pmg(struct task_struct *tsk, + u16 partid_d, u16 partid_i, + u8 pmg_d, u8 pmg_i) +{ +#ifdef CONFIG_ARM64_MPAM + u64 regval; + + regval = FIELD_PREP(MPAM_SYSREG_PARTID_D, partid_d); + regval |= FIELD_PREP(MPAM_SYSREG_PARTID_I, partid_i); + regval |= FIELD_PREP(MPAM_SYSREG_PMG_D, pmg_d); + regval |= FIELD_PREP(MPAM_SYSREG_PMG_I, pmg_i); + + WRITE_ONCE(task_thread_info(tsk)->mpam_partid_pmg, regval); +#endif +} + static inline u64 mpam_get_regval(struct task_struct *tsk) { #ifdef CONFIG_ARM64_MPAM @@ -99,6 +129,20 @@ static inline u64 mpam_get_regval(struct task_struct *tsk) #endif }
+static inline void resctrl_arch_set_rmid(struct task_struct *tsk, u32 rmid) +{ +#ifdef CONFIG_ARM64_MPAM + u64 regval = mpam_get_regval(tsk); + + regval &= ~MPAM_SYSREG_PMG_D; + regval &= ~MPAM_SYSREG_PMG_I; + regval |= FIELD_PREP(MPAM_SYSREG_PMG_D, rmid); + regval |= FIELD_PREP(MPAM_SYSREG_PMG_I, rmid); + + WRITE_ONCE(task_thread_info(tsk)->mpam_partid_pmg, regval); +#endif +} + static inline void mpam_thread_switch(struct task_struct *tsk) { u64 oldregval; diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index eead81775f94..19a7807f8013 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -8,6 +8,7 @@ #include <linux/cpu.h> #include <linux/cpumask.h> #include <linux/errno.h> +#include <linux/limits.h> #include <linux/list.h> #include <linux/printk.h> #include <linux/rculist.h> @@ -92,6 +93,63 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) return min((u32)mpam_partid_max + 1, (u32)RESCTRL_MAX_CLOSID); }
+void resctrl_sched_in(struct task_struct *tsk) +{ + lockdep_assert_preemption_disabled(); + + mpam_thread_switch(tsk); +} + +void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 pmg) +{ + BUG_ON(closid > U16_MAX); + BUG_ON(pmg > U8_MAX); + + if (!cdp_enabled) { + mpam_set_cpu_defaults(cpu, closid, closid, pmg, pmg); + } else { + /* + * When CDP is enabled, resctrl halves the closid range and we + * use odd/even partid for one closid. + */ + u32 partid_d = resctrl_get_config_index(closid, CDP_DATA); + u32 partid_i = resctrl_get_config_index(closid, CDP_CODE); + + mpam_set_cpu_defaults(cpu, partid_d, partid_i, pmg, pmg); + } +} + +void resctrl_arch_sync_cpu_defaults(void *info) +{ + struct resctrl_cpu_sync *r = info; + + lockdep_assert_preemption_disabled(); + + if (r) { + resctrl_arch_set_cpu_default_closid_rmid(smp_processor_id(), + r->closid, r->rmid); + } + + resctrl_sched_in(current); +} + +void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid) +{ + + + BUG_ON(closid > U16_MAX); + BUG_ON(rmid > U8_MAX); + + if (!cdp_enabled) { + mpam_set_task_partid_pmg(tsk, closid, closid, rmid, rmid); + } else { + u32 partid_d = resctrl_get_config_index(closid, CDP_DATA); + u32 partid_i = resctrl_get_config_index(closid, CDP_CODE); + + mpam_set_task_partid_pmg(tsk, partid_d, partid_i, rmid, rmid); + } +} + bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid) { u64 regval = mpam_get_regval(tsk); diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index e3921b0ab836..e2428a0dbe2f 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -16,6 +16,8 @@ */ extern u64 mpam_resctrl_default_group;
+#include <asm/mpam.h> + struct mpam_msc;
enum mpam_msc_iface { @@ -66,4 +68,9 @@ bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level ignored); int resctrl_arch_set_cdp_enabled(enum resctrl_res_level ignored, bool enable); bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid); bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid); +void resctrl_arch_set_cpu_default_closid(int cpu, u32 closid); +void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid); +void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 pmg); +void resctrl_sched_in(struct task_struct *tsk); + #endif /* __LINUX_ARM_MPAM_H */
From: James Morse james.morse@arm.com
Because MPAM's pmg aren't identical to RDT's rmid, resctrl handles some datastructrues by index. This allows x86 to map indexes to RMID, and MPAM to map them to partid-and-pmg.
Add the helpers to do this.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_resctrl.c | 28 ++++++++++++++++++++++++++++ include/linux/arm_mpam.h | 3 +++ 2 files changed, 31 insertions(+)
diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 19a7807f8013..01d43ce9911a 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -93,6 +93,34 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) return min((u32)mpam_partid_max + 1, (u32)RESCTRL_MAX_CLOSID); }
+u32 resctrl_arch_system_num_rmid_idx(void) +{ + u8 closid_shift = fls(mpam_pmg_max); + u32 num_partid = resctrl_arch_get_num_closid(NULL); + + return num_partid << closid_shift; +} + +u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid) +{ + u8 closid_shift = fls(mpam_pmg_max); + + BUG_ON(closid_shift > 8); + + return (closid << closid_shift) | rmid; +} + +void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid) +{ + u8 closid_shift = fls(mpam_pmg_max); + u32 pmg_mask = ~(~0 << closid_shift); + + BUG_ON(closid_shift > 8); + + *closid = idx >> closid_shift; + *rmid = idx & pmg_mask; +} + void resctrl_sched_in(struct task_struct *tsk) { lockdep_assert_preemption_disabled(); diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index e2428a0dbe2f..3ba466659058 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -72,5 +72,8 @@ void resctrl_arch_set_cpu_default_closid(int cpu, u32 closid); void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid); void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 pmg); void resctrl_sched_in(struct task_struct *tsk); +u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid); +void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid); +u32 resctrl_arch_system_num_rmid_idx(void);
#endif /* __LINUX_ARM_MPAM_H */
From: James Morse james.morse@arm.com
resctrl supports 'MB', as a percentage throttling of traffic somewhere after the L3. This is the control that mba_sc uses, so ideally the class chosen should be as close as possible to the counters used for mba_local.
MB's percentage control can be backed either with the fixed point fraction MBW_MAX or the bandwidth portion bitmap. Add helper to convert to/from percentages. One problem here is the value written is not the same as the value read back. This is deliberatly made visible to user-space. Another is the MBW_MAX fixed point fraction can't represent 100%. This is also exposed to user-space, as otherwise the values for a single-bit system is 100%, 0%, instead of 50%, 0%.
The way CDP is emulated means MB controls need programming twice by the resctrl glue, as the bandwidth controls can be applied independently for data or instruction-fetch. This isn't how x86 behaves, and neither user-space nor resctrl support it.
CC: Amit Singh Tomar amitsinght@marvell.com Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_resctrl.c | 219 ++++++++++++++++++++++++++- 1 file changed, 216 insertions(+), 3 deletions(-)
diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 01d43ce9911a..ce7216ae3817 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -53,9 +53,20 @@ bool resctrl_arch_is_mbm_local_enabled(void) return mbm_local_class; }
-bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level ignored) +bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level rid) { - return cdp_enabled; + switch (rid) { + case RDT_RESOURCE_L2: + case RDT_RESOURCE_L3: + return cdp_enabled; + case RDT_RESOURCE_MBA: + default: + /* + * x86's MBA control doesn't support CDP, so user-space doesn't + * expect it. + */ + return false; + } }
int resctrl_arch_set_cdp_enabled(enum resctrl_res_level ignored, bool enable) @@ -83,6 +94,11 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level ignored, bool enable) return 0; }
+static bool mpam_resctrl_hide_cdp(enum resctrl_res_level rid) +{ + return cdp_enabled && !resctrl_arch_get_cdp_enabled(rid); +} + /* * MSC may raise an error interrupt if it sees an out or range partid/pmg, * and go on to truncate the value. Regardless of what the hardware supports, @@ -248,6 +264,102 @@ bool resctrl_arch_is_llc_occupancy_enabled(void) return cache_has_usable_csu(mpam_resctrl_exports[RDT_RESOURCE_L3].class); }
+static bool mba_class_use_mbw_part(struct mpam_props *cprops) +{ + /* TODO: Scaling is not yet supported */ + return (mpam_has_feature(mpam_feat_mbw_part, cprops) && + cprops->mbw_pbm_bits < MAX_MBA_BW); +} + +static bool class_has_usable_mba(struct mpam_props *cprops) +{ + if (mba_class_use_mbw_part(cprops) || + mpam_has_feature(mpam_feat_mbw_max, cprops)) + return true; + + return false; +} + +/* + * Calculate the percentage change from each implemented bit in the control + * This can return 0 when BWA_WD is greater than 6. (100 / (1<<7) == 0) + */ +static u32 get_mba_granularity(struct mpam_props *cprops) +{ + if (mba_class_use_mbw_part(cprops)) { + return MAX_MBA_BW / cprops->mbw_pbm_bits; + } else if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { + /* + * bwa_wd is the number of bits implemented in the 0.xxx + * fixed point fraction. 1 bit is 50%, 2 is 25% etc. + */ + return MAX_MBA_BW / (cprops->bwa_wd + 1); + } + + return 0; +} + +static u32 mbw_pbm_to_percent(unsigned long mbw_pbm, struct mpam_props *cprops) +{ + u32 bit, result = 0, granularity = get_mba_granularity(cprops); + + for_each_set_bit(bit, &mbw_pbm, cprops->mbw_pbm_bits % 32) { + result += granularity; + } + + return result; +} + +static u32 mbw_max_to_percent(u16 mbw_max, struct mpam_props *cprops) +{ + u8 bit; + u32 divisor = 2, value = 0; + + for (bit = 15; bit; bit--) { + if (mbw_max & BIT(bit)) + value += MAX_MBA_BW / divisor; + divisor <<= 1; + } + + return value; +} + +static u32 percent_to_mbw_pbm(u8 pc, struct mpam_props *cprops) +{ + u32 granularity = get_mba_granularity(cprops); + u8 num_bits = pc / granularity; + + if (!num_bits) + return 0; + + /* TODO: pick bits at random to avoid contention */ + return (1 << num_bits) - 1; +} + +static u16 percent_to_mbw_max(u8 pc, struct mpam_props *cprops) +{ + u8 bit; + u32 divisor = 2, value = 0; + + if (WARN_ON_ONCE(cprops->bwa_wd > 15)) + return MAX_MBA_BW; + + for (bit = 15; bit; bit--) { + if (pc >= MAX_MBA_BW / divisor) { + pc -= MAX_MBA_BW / divisor; + value |= BIT(bit); + } + divisor <<= 1; + + if (!pc || !(MAX_MBA_BW / divisor)) + break; + } + + value &= GENMASK(15, 15 - cprops->bwa_wd); + + return value; +} + /* Test whether we can export MPAM_CLASS_CACHE:{2,3}? */ static void mpam_resctrl_pick_caches(void) { @@ -295,6 +407,44 @@ static void mpam_resctrl_pick_caches(void) srcu_read_unlock(&mpam_srcu, idx); }
+static void mpam_resctrl_pick_mba(void) +{ + struct mpam_class *class, *candidate_class = NULL; + struct mpam_resctrl_res *res; + int idx; + + lockdep_assert_cpus_held(); + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(class, &mpam_classes, classes_list) { + struct mpam_props *cprops = &class->props; + + if (class->level < 3) + continue; + + if (!class_has_usable_mba(cprops)) + continue; + + if (!cpumask_equal(&class->affinity, cpu_possible_mask)) + continue; + + /* + * mba_sc reads the mbm_local counter, and waggles the MBA controls. + * mbm_local is implicitly part of the L3, pick a resouce to be MBA + * that as close as possible to the L3. + */ + if (!candidate_class || class->level < candidate_class->level) + candidate_class = class; + } + srcu_read_unlock(&mpam_srcu, idx); + + if (candidate_class) { + res = &mpam_resctrl_exports[RDT_RESOURCE_MBA]; + res->class = candidate_class; + res->resctrl_res.name = "MB"; + } +} + static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) { struct mpam_class *class = res->class; @@ -331,6 +481,27 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res)
if (class->level == 3 && cache_has_usable_csu(class)) r->mon_capable = true; + } else if (res->resctrl_res.rid == RDT_RESOURCE_MBA) { + struct mpam_props *cprops = &class->props; + + /* TODO: kill these properties off as they are derivatives */ + r->format_str = "%d=%0*u"; + r->fflags = RFTYPE_RES_MB; + r->default_ctrl = MAX_MBA_BW; + r->data_width = 3; + + r->membw.delay_linear = true; + r->membw.throttle_mode = THREAD_THROTTLE_UNDEFINED; + r->membw.bw_gran = get_mba_granularity(cprops); + + /* Round up to at least 1% */ + if (!r->membw.bw_gran) + r->membw.bw_gran = 1; + + if (class_has_usable_mba(cprops)) { + r->alloc_capable = true; + exposed_alloc_capable = true; + } }
if (r->mon_capable) { @@ -365,6 +536,7 @@ int mpam_resctrl_setup(void) }
mpam_resctrl_pick_caches(); + mpam_resctrl_pick_mba();
for (i = 0; i < RDT_NUM_RESOURCES; i++) { res = &mpam_resctrl_exports[i]; @@ -423,6 +595,15 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, case RDT_RESOURCE_L3: configured_by = mpam_feat_cpor_part; break; + case RDT_RESOURCE_MBA: + if (mba_class_use_mbw_part(cprops)) { + configured_by = mpam_feat_mbw_part; + break; + } else if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { + configured_by = mpam_feat_mbw_max; + break; + } + fallthrough; default: return -EINVAL; } @@ -435,6 +616,11 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, case mpam_feat_cpor_part: /* TODO: Scaling is not yet supported */ return cfg->cpbm; + case mpam_feat_mbw_part: + /* TODO: Scaling is not yet supported */ + return mbw_pbm_to_percent(cfg->mbw_pbm, cprops); + case mpam_feat_mbw_max: + return mbw_max_to_percent(cfg->mbw_max, cprops); default: return -EINVAL; } @@ -443,6 +629,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, u32 closid, enum resctrl_conf_type t, u32 cfg_val) { + int err; u32 partid; struct mpam_config cfg; struct mpam_props *cprops; @@ -469,11 +656,37 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, cfg.cpbm = cfg_val; mpam_set_feature(mpam_feat_cpor_part, &cfg); break; + case RDT_RESOURCE_MBA: + if (mba_class_use_mbw_part(cprops)) { + cfg.mbw_pbm = percent_to_mbw_pbm(cfg_val, cprops); + mpam_set_feature(mpam_feat_mbw_part, &cfg); + break; + } else if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { + cfg.mbw_max = percent_to_mbw_max(cfg_val, cprops); + mpam_set_feature(mpam_feat_mbw_max, &cfg); + break; + } + fallthrough; default: return -EINVAL; }
- return mpam_apply_config(dom->comp, partid, &cfg); + /* + * When CDP is enabled, but the resource doesn't support it, we need to + * apply the same configuration to the other partid. + */ + if (mpam_resctrl_hide_cdp(r->rid)) { + partid = resctrl_get_config_index(closid, CDP_CODE); + err = mpam_apply_config(dom->comp, partid, &cfg); + if (err) + return err; + + partid = resctrl_get_config_index(closid, CDP_DATA); + return mpam_apply_config(dom->comp, partid, &cfg); + + } else { + return mpam_apply_config(dom->comp, partid, &cfg); + } }
/* TODO: this is IPI heavy */
From: James Morse james.morse@arm.com
resctrl has two types of counters, NUMA-local and global. MPAM has only bandwidth counters, but the position of the MSC may mean it counts NUMA-local, or global traffic.
But the topology information is not available.
Apply a hueristic: the L2 or L3 supports bandwidth monitors, these are probably NUMA-local. If the memory controller supports bandwidth monitors, they are probably global.
This selection is made from mpam_resctrl_resource_init(), which implies resources that can be used for resctrl controls exist and also have counters. This would be a problem on a platform that only supports monitoring.
TODO: Add an extra pass of all classes to find the classes to use as bandwidth counters.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_resctrl.c | 52 +++++++++++++++++++++++++++- include/linux/arm_mpam.h | 6 +--- 2 files changed, 52 insertions(+), 6 deletions(-)
diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index ce7216ae3817..19916d9149c5 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -31,6 +31,7 @@ static struct mpam_resctrl_res mpam_resctrl_exports[RDT_NUM_RESOURCES]; static bool exposed_alloc_capable; static bool exposed_mon_capable; static struct mpam_class *mbm_local_class; +static struct mpam_class *mbm_total_class;
/* * MPAM emulates CDP by setting different PARTID in the I/D fields of MPAM1_EL1. @@ -53,6 +54,11 @@ bool resctrl_arch_is_mbm_local_enabled(void) return mbm_local_class; }
+bool resctrl_arch_is_mbm_total_enabled(void) +{ + return mbm_total_class; +} + bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level rid) { switch (rid) { @@ -264,6 +270,24 @@ bool resctrl_arch_is_llc_occupancy_enabled(void) return cache_has_usable_csu(mpam_resctrl_exports[RDT_RESOURCE_L3].class); }
+static bool class_has_usable_mbwu(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_msmon_mbwu, cprops)) + return false; + + /* + * resctrl expects the bandwidth counters to be free running, + * which means we need as many monitors as resctrl has + * control/monitor groups. + */ + if (cprops->num_mbwu_mon < resctrl_arch_system_num_rmid_idx()) + return false; + + return (mpam_partid_max > 1) || (mpam_pmg_max != 0); +} + static bool mba_class_use_mbw_part(struct mpam_props *cprops) { /* TODO: Scaling is not yet supported */ @@ -449,10 +473,13 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) { struct mpam_class *class = res->class; struct rdt_resource *r = &res->resctrl_res; + bool has_mbwu = class_has_usable_mbwu(class);
/* Is this one of the two well-known caches? */ if (res->resctrl_res.rid == RDT_RESOURCE_L2 || res->resctrl_res.rid == RDT_RESOURCE_L3) { + bool has_csu = cache_has_usable_csu(class); + /* TODO: Scaling is not yet supported */ r->cache.cbm_len = class->props.cpbm_wd; r->cache.arch_has_sparse_bitmasks = true; @@ -479,8 +506,25 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) exposed_alloc_capable = true; }
- if (class->level == 3 && cache_has_usable_csu(class)) + /* + * MBWU counters may be 'local' or 'total' depending on where + * they are in the topology. Counters on caches are assumed to + * be local. If it's on the memory controller, its assumed to + * be global. + */ + if (has_mbwu && class->level >= 3) { + mbm_local_class = class; r->mon_capable = true; + } + + /* + * CSU counters only make sense on a cache. The file is called + * llc_occupancy, but its expected to the on the L3. + */ + if (has_csu && class->type == MPAM_CLASS_CACHE && + class->level == 3) { + r->mon_capable = true; + } } else if (res->resctrl_res.rid == RDT_RESOURCE_MBA) { struct mpam_props *cprops = &class->props;
@@ -502,6 +546,11 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) r->alloc_capable = true; exposed_alloc_capable = true; } + + if (has_mbwu && class->type == MPAM_CLASS_MEMORY) { + mbm_total_class = class; + r->mon_capable = true; + } }
if (r->mon_capable) { @@ -537,6 +586,7 @@ int mpam_resctrl_setup(void)
mpam_resctrl_pick_caches(); mpam_resctrl_pick_mba(); + /* TODO: mpam_resctrl_pick_counters(); */
for (i = 0; i < RDT_NUM_RESOURCES; i++) { res = &mpam_resctrl_exports[i]; diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 3ba466659058..94179b86a9e6 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -55,11 +55,7 @@ bool resctrl_arch_alloc_capable(void); bool resctrl_arch_mon_capable(void); bool resctrl_arch_is_llc_occupancy_enabled(void); bool resctrl_arch_is_mbm_local_enabled(void); - -static inline bool resctrl_arch_is_mbm_total_enabled(void) -{ - return false; -} +bool resctrl_arch_is_mbm_total_enabled(void);
/* reset cached configurations, then all devices */ void resctrl_arch_reset_resources(void);
From: James Morse james.morse@arm.com
When resctrl wants to read a domain's 'QOS_L3_OCCUP', it needs to allocate a monitor on the corresponding resource. Monitors are allocated by class instead of component.
Add helpers to do this.
The MBM events depend on having their monitors allocated at init time so that they can be left running. The value USE_RMID_IDX is out of range for a monitor, and is used to indicate this behaviour.
resctrl_arch_mon_ctx_alloc() is implemented to have a no_wait version and a waitqueue for callers that sleep. The no_wait version will later become an interface for the resctrl_pmu to use.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_internal.h | 3 ++ drivers/platform/mpam/mpam_resctrl.c | 72 +++++++++++++++++++++++++++ include/linux/arm_mpam.h | 4 ++ 3 files changed, 79 insertions(+)
diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 49fc264d32f7..4d982469b9f5 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -17,6 +17,9 @@
DECLARE_STATIC_KEY_FALSE(mpam_enabled);
+/* Value to indicate the allocated monitor is derived from the RMID index. */ +#define USE_RMID_IDX (U16_MAX + 1) + static inline bool mpam_is_enabled(void) { return static_branch_likely(&mpam_enabled); diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 19916d9149c5..70b1879de10a 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -22,6 +22,8 @@
u64 mpam_resctrl_default_group;
+DECLARE_WAIT_QUEUE_HEAD(resctrl_mon_ctx_waiters); + /* * The classes we've picked to map to resctrl resources. * Class pointer may be NULL. @@ -39,6 +41,10 @@ static struct mpam_class *mbm_total_class; */ static bool cdp_enabled;
+/* A dummy mon context to use when the monitors were allocated up front */ +u32 __mon_is_rmid_idx = USE_RMID_IDX; +void *mon_is_rmid_idx = &__mon_is_rmid_idx; + bool resctrl_arch_alloc_capable(void) { return exposed_alloc_capable; @@ -232,6 +238,72 @@ struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) return &mpam_resctrl_exports[l].resctrl_res; }
+static void *resctrl_arch_mon_ctx_alloc_no_wait(struct rdt_resource *r, + int evtid) +{ + struct mpam_resctrl_res *res; + u32 *ret = kmalloc(sizeof(*ret), GFP_KERNEL); + + if (!ret) + return ERR_PTR(-ENOMEM); + + switch (evtid) { + case QOS_L3_OCCUP_EVENT_ID: + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + + *ret = mpam_alloc_csu_mon(res->class); + return ret; + case QOS_L3_MBM_LOCAL_EVENT_ID: + case QOS_L3_MBM_TOTAL_EVENT_ID: + return mon_is_rmid_idx; + } + + return ERR_PTR(-EOPNOTSUPP); +} + +void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid) +{ + DEFINE_WAIT(wait); + void *ret; + + might_sleep(); + + do { + prepare_to_wait(&resctrl_mon_ctx_waiters, &wait, + TASK_INTERRUPTIBLE); + ret = resctrl_arch_mon_ctx_alloc_no_wait(r, evtid); + if (PTR_ERR(ret) == -ENOSPC) + schedule(); + } while (PTR_ERR(ret) == -ENOSPC && !signal_pending(current)); + finish_wait(&resctrl_mon_ctx_waiters, &wait); + + return ret; +} + +void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, + void *arch_mon_ctx) +{ + struct mpam_resctrl_res *res; + u32 mon = *(u32 *)arch_mon_ctx; + + if (mon == USE_RMID_IDX) + return; + kfree(arch_mon_ctx); + arch_mon_ctx = NULL; + + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + + switch (evtid) { + case QOS_L3_OCCUP_EVENT_ID: + mpam_free_csu_mon(res->class, mon); + wake_up(&resctrl_mon_ctx_waiters); + return; + case QOS_L3_MBM_TOTAL_EVENT_ID: + case QOS_L3_MBM_LOCAL_EVENT_ID: + return; + } +} + static bool cache_has_usable_cpor(struct mpam_class *class) { struct mpam_props *cprops = &class->props; diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 94179b86a9e6..7206a35bb205 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -72,4 +72,8 @@ u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid); void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid); u32 resctrl_arch_system_num_rmid_idx(void);
+struct rdt_resource; +void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid); +void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, void *ctx); + #endif /* __LINUX_ARM_MPAM_H */
From: James Morse james.morse@arm.com
resctrl uses resctrl_arch_rmid_read() to read counters. CDP emulation means the counter may need reading twice to get both the I and D side allocations. The same goes for reset.
Add the rounding helper for checking monitor values while we're here.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_resctrl.c | 79 ++++++++++++++++++++++++++++ include/linux/arm_mpam.h | 4 ++ 2 files changed, 83 insertions(+)
diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 70b1879de10a..d0140943bd9c 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -304,6 +304,85 @@ void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, } }
+int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, u32 rmid, enum resctrl_event_id eventid, + u64 *val, void *arch_mon_ctx) +{ + int err; + u64 cdp_val; + struct mon_cfg cfg; + struct mpam_resctrl_dom *dom; + u32 mon = *(u32 *)arch_mon_ctx; + enum mpam_device_features type; + + resctrl_arch_rmid_read_context_check(); + + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + + switch (eventid) { + case QOS_L3_OCCUP_EVENT_ID: + type = mpam_feat_msmon_csu; + break; + case QOS_L3_MBM_LOCAL_EVENT_ID: + case QOS_L3_MBM_TOTAL_EVENT_ID: + type = mpam_feat_msmon_mbwu; + break; + default: + return -EINVAL; + } + + cfg.mon = mon; + if (cfg.mon == USE_RMID_IDX) + cfg.mon = resctrl_arch_rmid_idx_encode(closid, rmid); + + cfg.match_pmg = true; + cfg.pmg = rmid; + + if (cdp_enabled) { + cfg.partid = closid << 1; + err = mpam_msmon_read(dom->comp, &cfg, type, val); + if (err) + return err; + + cfg.partid += 1; + err = mpam_msmon_read(dom->comp, &cfg, type, &cdp_val); + if (!err) + *val += cdp_val; + } else { + cfg.partid = closid; + err = mpam_msmon_read(dom->comp, &cfg, type, val); + } + + return err; +} + +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, u32 rmid, enum resctrl_event_id eventid) +{ + struct mon_cfg cfg; + struct mpam_resctrl_dom *dom; + + if (eventid != QOS_L3_MBM_LOCAL_EVENT_ID) + return; + + cfg.mon = resctrl_arch_rmid_idx_encode(closid, rmid); + cfg.match_pmg = true; + cfg.pmg = rmid; + + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + + if (cdp_enabled) { + cfg.partid = closid << 1; + mpam_msmon_reset_mbwu(dom->comp, &cfg); + + cfg.partid += 1; + mpam_msmon_reset_mbwu(dom->comp, &cfg); + } else { + cfg.partid = closid; + mpam_msmon_reset_mbwu(dom->comp, &cfg); + } +} + static bool cache_has_usable_cpor(struct mpam_class *class) { struct mpam_props *cprops = &class->props; diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 7206a35bb205..0f2522c2b11c 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -50,6 +50,10 @@ int mpam_register_requestor(u16 partid_max, u8 pmg_max); int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, enum mpam_class_types type, u8 class_id, int component_id);
+static inline unsigned int resctrl_arch_round_mon_val(unsigned int val) +{ + return val; +}
bool resctrl_arch_alloc_capable(void); bool resctrl_arch_mon_capable(void);
From: James Morse james.morse@arm.com
MPAM MSCs may have support for filtering reads or writes when monitoring traffic. Resctrl has a configuration bitmap for which kind of accesses should be monitored.
Bridge the gap where possible. MPAM only has a read/write bit, so not all the combinations can be supported.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 26 +++++++++++ drivers/platform/mpam/mpam_internal.h | 9 ++++ drivers/platform/mpam/mpam_resctrl.c | 62 +++++++++++++++++++++++++++ 3 files changed, 97 insertions(+)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 5ebb944ad9fc..c922f2b264cc 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -1076,6 +1076,32 @@ int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, return err; }
+void mpam_msmon_reset_all_mbwu(struct mpam_component *comp) +{ + int idx, i; + unsigned long flags; + struct mpam_msc *msc; + struct mpam_msc_ris *ris; + + if (!mpam_is_enabled()) + return; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { + if (!mpam_has_feature(mpam_feat_msmon_mbwu, &ris->props)) + continue; + + msc = ris->msc; + spin_lock_irqsave(&msc->mon_sel_lock, flags); + for(i = 0; i < ris->props.num_mbwu_mon; i++) { + ris->mbwu_state[i].correction = 0; + ris->mbwu_state[i].reset_on_next_read = true; + } + spin_unlock_irqrestore(&msc->mon_sel_lock, flags); + } + srcu_read_unlock(&mpam_srcu, idx); +} + void mpam_msmon_reset_mbwu(struct mpam_component *comp, struct mon_cfg *ctx) { int idx; diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 4d982469b9f5..1f7c0ddcf8b1 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -20,6 +20,12 @@ DECLARE_STATIC_KEY_FALSE(mpam_enabled); /* Value to indicate the allocated monitor is derived from the RMID index. */ #define USE_RMID_IDX (U16_MAX + 1)
+/* + * Only these event configuration bits are supported. MPAM can't know if + * data is being written back, these will show up as a write. + */ +#define MPAM_RESTRL_EVT_CONFIG_VALID (READS_TO_LOCAL_MEM | NON_TEMP_WRITE_TO_LOCAL_MEM) + static inline bool mpam_is_enabled(void) { return static_branch_likely(&mpam_enabled); @@ -240,6 +246,8 @@ struct mpam_msc_ris { struct mpam_resctrl_dom { struct mpam_component *comp; struct rdt_domain resctrl_dom; + + u32 mbm_local_evt_cfg; };
struct mpam_resctrl_res { @@ -299,6 +307,7 @@ int mpam_apply_config(struct mpam_component *comp, u16 partid, int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, enum mpam_device_features, u64 *val); void mpam_msmon_reset_mbwu(struct mpam_component *comp, struct mon_cfg *ctx); +void mpam_msmon_reset_all_mbwu(struct mpam_component *comp);
int mpam_resctrl_online_cpu(unsigned int cpu); int mpam_resctrl_offline_cpu(unsigned int cpu); diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index d0140943bd9c..d0becc296108 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -304,6 +304,18 @@ void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, } }
+static enum mon_filter_options resctrl_evt_config_to_mpam(u32 local_evt_cfg) +{ + switch (local_evt_cfg) { + case READS_TO_LOCAL_MEM: + return COUNT_READ; + case NON_TEMP_WRITE_TO_LOCAL_MEM: + return COUNT_WRITE; + default: + return COUNT_BOTH; + } +} + int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, u32 closid, u32 rmid, enum resctrl_event_id eventid, u64 *val, void *arch_mon_ctx) @@ -337,6 +349,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
cfg.match_pmg = true; cfg.pmg = rmid; + cfg.opts = resctrl_evt_config_to_mpam(dom->mbm_local_evt_cfg);
if (cdp_enabled) { cfg.partid = closid << 1; @@ -620,6 +633,54 @@ static void mpam_resctrl_pick_mba(void) } }
+bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt) +{ + struct mpam_props *cprops; + + switch (evt) { + case QOS_L3_MBM_LOCAL_EVENT_ID: + if (!mbm_local_class) + return false; + cprops = &mbm_local_class->props; + + return mpam_has_feature(mpam_feat_msmon_mbwu_rwbw, cprops); + default: + return false; + } +} + +void resctrl_arch_mon_event_config_read(void *info) +{ + struct mpam_resctrl_dom *dom; + struct resctrl_mon_config_info *mon_info = info; + + dom = container_of(mon_info->d, struct mpam_resctrl_dom, resctrl_dom); + mon_info->mon_config = dom->mbm_local_evt_cfg & MAX_EVT_CONFIG_BITS; +} + +void resctrl_arch_mon_event_config_write(void *info) +{ + struct mpam_resctrl_dom *dom; + struct resctrl_mon_config_info *mon_info = info; + + if (mon_info->mon_config & ~MPAM_RESTRL_EVT_CONFIG_VALID) { + mon_info->err = -EOPNOTSUPP; + return; + } + + dom = container_of(mon_info->d, struct mpam_resctrl_dom, resctrl_dom); + dom->mbm_local_evt_cfg = mon_info->mon_config & MPAM_RESTRL_EVT_CONFIG_VALID; +} + +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d) +{ + struct mpam_resctrl_dom *dom; + + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + dom->mbm_local_evt_cfg = MPAM_RESTRL_EVT_CONFIG_VALID; + mpam_msmon_reset_all_mbwu(dom->comp); +} + static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) { struct mpam_class *class = res->class; @@ -970,6 +1031,7 @@ mpam_resctrl_alloc_domain(unsigned int cpu, struct mpam_resctrl_res *res) dom->comp = comp; INIT_LIST_HEAD(&dom->resctrl_dom.list); dom->resctrl_dom.id = comp->comp_id; + dom->mbm_local_evt_cfg = MPAM_RESTRL_EVT_CONFIG_VALID; cpumask_set_cpu(cpu, &dom->resctrl_dom.cpu_mask);
/* TODO: this list should be sorted */
From: James Morse james.morse@arm.com
Pseudo lock isn't supported on arm64. Add empty definitions of the functions arm64 doesn't implement. Because the Kconfig option is not selected, none of these will be called.
Signed-off-by: James Morse james.morse@arm.com --- include/linux/arm_mpam.h | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 0f2522c2b11c..796ce59cc540 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -80,4 +80,11 @@ struct rdt_resource; void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid); void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, void *ctx);
+/* Pseudo lock is not supported by MPAM */ +static inline int resctrl_arch_pseudo_lock_fn(void *_plr) { return 0; } +static inline int resctrl_arch_measure_l2_residency(void *_plr) { return 0; } +static inline int resctrl_arch_measure_l3_residency(void *_plr) { return 0; } +static inline int resctrl_arch_measure_cycles_lat_fn(void *_plr) { return 0; } +static inline u64 resctrl_arch_get_prefetch_disable_bits(void) { return 0; } + #endif /* __LINUX_ARM_MPAM_H */
From: James Morse james.morse@arm.com
resctrl has individual hooks to separately enable and disable the closid/partid and rmid/pmg context switching code.
For MPAM this is all the same thing, as the value in struct task_struct is used to cache the value that should be written to hardware. arm64's context switching code is enabled once MPAM is usable, but doesn't touch the hardware unless the value has changed.
Resctrl doesn't need to ask. Add empty definitions for these hoooks.
Signed-off-by: James Morse james.morse@arm.com --- include/linux/arm_mpam.h | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 796ce59cc540..5fd62e0d131d 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -87,4 +87,13 @@ static inline int resctrl_arch_measure_l3_residency(void *_plr) { return 0; } static inline int resctrl_arch_measure_cycles_lat_fn(void *_plr) { return 0; } static inline u64 resctrl_arch_get_prefetch_disable_bits(void) { return 0; }
+/* + * The CPU configuration for MPAM is cheap to write, and is only written if it + * has changed. No need for fine grained enables. + */ +static inline void resctrl_arch_enable_mon(void) { } +static inline void resctrl_arch_disable_mon(void) { } +static inline void resctrl_arch_enable_alloc(void) { } +static inline void resctrl_arch_disable_alloc(void) { } + #endif /* __LINUX_ARM_MPAM_H */
From: James Morse james.morse@arm.com
resctrl expects RDT like counters that are free running. MPAM's counters don't behave like this as they need a monitor to be allocated first.
Provide the helper that says whether free running counters are supported.
Subsequent patches will make this more intelligent.
Signed-off-by: James Morse james.morse@arm.com --- include/linux/arm_mpam.h | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 5fd62e0d131d..d70e4e726fe6 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -55,6 +55,12 @@ static inline unsigned int resctrl_arch_round_mon_val(unsigned int val) return val; }
+/* MPAM counters requires a monitor to be allocated */ +static inline bool resctrl_arch_event_is_free_running(enum resctrl_event_id evt) +{ + return false; +} + bool resctrl_arch_alloc_capable(void); bool resctrl_arch_mon_capable(void); bool resctrl_arch_is_llc_occupancy_enabled(void);
From: James Morse james.morse@arm.com
Enough MPAM support is present to enable ARCH_HAS_CPU_RESCTRL. Let it rip^Wlink!
Signed-off-by: James Morse james.morse@arm.com --- arch/arm64/Kconfig | 3 ++- arch/arm64/include/asm/resctrl.h | 1 + drivers/platform/mpam/Kconfig | 4 +++- 3 files changed, 6 insertions(+), 2 deletions(-) create mode 100644 arch/arm64/include/asm/resctrl.h
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 7535cc94ce01..dffd8c8ede80 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2056,7 +2056,8 @@ config ARM64_TLB_RANGE config ARM64_MPAM bool "Enable support for MPAM" select ACPI_MPAM if ACPI - select ARM_CPU_RESCTRL + select ARCH_HAS_CPU_RESCTRL + select RESCTRL_FS help Memory Partitioning and Monitoring is an optional extension that allows the CPUs to mark load and store transactions with diff --git a/arch/arm64/include/asm/resctrl.h b/arch/arm64/include/asm/resctrl.h new file mode 100644 index 000000000000..bf33f82f04e2 --- /dev/null +++ b/arch/arm64/include/asm/resctrl.h @@ -0,0 +1 @@ +#include <linux/arm_mpam.h> diff --git a/drivers/platform/mpam/Kconfig b/drivers/platform/mpam/Kconfig index 13bd86fc5e58..75f5b2454fbe 100644 --- a/drivers/platform/mpam/Kconfig +++ b/drivers/platform/mpam/Kconfig @@ -2,5 +2,7 @@ # CPU resources, not containers or cgroups etc. config ARM_CPU_RESCTRL bool - depends on ARM64 + default y + depends on ARM64 && ARCH_HAS_CPU_RESCTRL + depends on MISC_FILESYSTEMS select RESCTRL_RMID_DEPENDS_ON_CLOSID
From: James Morse james.morse@arm.com
Carl reports that when both the MPAM driver and CMN driver are built into the kernel, they fight over who can claim the resources associated with their registers. This prevents the second of these two drivers from probing.
Currently the CMN PMU driver claims all the CMN registers.
The MPAM registers are grouped together in a small number of pages, whereas the PMU registers that the CMN PMU driver uses appear throughout the CMN register space.
Having the CMN driver claim all the resources is the wrong thing to do, and claiming individual registers here and there is not worthwhile. Instead, stop the CMN driver from claiming any resources as its registers are not grouped together.
Reported-by: Carl Worth carl@os.amperecomputing.com Tested-by: Carl Worth carl@os.amperecomputing.com Signed-off-by: James Morse james.morse@arm.com CC: Ilkka Koskinen ilkka@os.amperecomputing.com --- drivers/perf/arm-cmn.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/drivers/perf/arm-cmn.c b/drivers/perf/arm-cmn.c index caae2d3e9d3e..10598a4f932a 100644 --- a/drivers/perf/arm-cmn.c +++ b/drivers/perf/arm-cmn.c @@ -2428,6 +2428,7 @@ static int arm_cmn_probe(struct platform_device *pdev) struct arm_cmn *cmn; const char *name; static atomic_t id; + struct resource *cfg; int err, rootnode, this_id;
cmn = devm_kzalloc(&pdev->dev, sizeof(*cmn), GFP_KERNEL); @@ -2442,7 +2443,16 @@ static int arm_cmn_probe(struct platform_device *pdev) rootnode = arm_cmn600_acpi_probe(pdev, cmn); } else { rootnode = 0; - cmn->base = devm_platform_ioremap_resource(pdev, 0); + + /* + * Avoid registering resources as the PMUs registers are + * scattered through CMN, and may appear either side of + * registers for other 'devices'. (e.g. the MPAM MSC controls). + */ + cfg = platform_get_resource(pdev, IORESOURCE_MEM, 0); + if (!cfg) + return PTR_ERR(-EINVAL); + cmn->base = devm_ioremap(&pdev->dev, cfg->start, resource_size(cfg)); if (IS_ERR(cmn->base)) return PTR_ERR(cmn->base); if (cmn->part == PART_CMN600)
From: James Morse james.morse@arm.com
Now that mpam links against resctrl, call the cpu and domain online/offline calls at the appropriate point.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_resctrl.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index d0becc296108..c862f9015c57 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -824,7 +824,7 @@ int mpam_resctrl_setup(void) pr_warn("Number of PMG is not a power of 2! resctrl may misbehave"); }
- /* TODO: call resctrl_init() */ + err = resctrl_init(); }
return err; @@ -1077,7 +1077,7 @@ struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id)
int mpam_resctrl_online_cpu(unsigned int cpu) { - int i; + int i, err; struct mpam_resctrl_dom *dom; struct mpam_resctrl_res *res;
@@ -1096,8 +1096,12 @@ int mpam_resctrl_online_cpu(unsigned int cpu) dom = mpam_resctrl_alloc_domain(cpu, res); if (IS_ERR(dom)) return PTR_ERR(dom); + err = resctrl_online_domain(&res->resctrl_res, &dom->resctrl_dom); + if (err) + return err; }
+ resctrl_online_cpu(cpu); return 0; }
@@ -1108,6 +1112,8 @@ int mpam_resctrl_offline_cpu(unsigned int cpu) struct mpam_resctrl_res *res; struct mpam_resctrl_dom *dom;
+ resctrl_offline_cpu(cpu); + for (i = 0; i < RDT_NUM_RESOURCES; i++) { res = &mpam_resctrl_exports[i];
@@ -1126,6 +1132,7 @@ int mpam_resctrl_offline_cpu(unsigned int cpu) if (!cpumask_empty(&d->cpu_mask)) continue;
+ resctrl_offline_domain(&res->resctrl_res, &dom->resctrl_dom); list_del(&d->list); kfree(dom); }
From: James Morse james.morse@arm.com
All of MPAMs errors indicate a software bug, e.g. an out-of-bounds partid has been generated. When this happens, the mpam driver is disabled.
If resctrl_init() succeeded, also call resctrl_exit() to remove resctrl.
If the filesystem was mounted in its traditional place, it is no longer possible for processes to find it as the mount point has been removed. If the filesystem was mounted elsewhere, it will appear that all CPU and domains are offline. User-space will not be able to update the hardware.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_devices.c | 2 ++ drivers/platform/mpam/mpam_internal.h | 1 + drivers/platform/mpam/mpam_resctrl.c | 17 +++++++++++++++++ 3 files changed, 20 insertions(+)
diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index c922f2b264cc..c2457c36ab12 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -2240,6 +2240,8 @@ static irqreturn_t mpam_disable_thread(int irq, void *dev_id) } mutex_unlock(&mpam_cpuhp_state_lock);
+ mpam_resctrl_exit(); + static_branch_disable(&mpam_enabled);
mpam_unregister_irqs(); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 1f7c0ddcf8b1..119660a53e16 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -313,6 +313,7 @@ int mpam_resctrl_online_cpu(unsigned int cpu); int mpam_resctrl_offline_cpu(unsigned int cpu);
int mpam_resctrl_setup(void); +void mpam_resctrl_exit(void);
/* * MPAM MSCs have the following register layout. See: diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index c862f9015c57..074d8ba32115 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -41,6 +41,12 @@ static struct mpam_class *mbm_total_class; */ static bool cdp_enabled;
+/* + * If resctrl_init() succeeded, resctrl_exit() can be used to remove support + * for the filesystem in the event of an error. + */ +static bool resctrl_enabled; + /* A dummy mon context to use when the monitors were allocated up front */ u32 __mon_is_rmid_idx = USE_RMID_IDX; void *mon_is_rmid_idx = &__mon_is_rmid_idx; @@ -825,11 +831,22 @@ int mpam_resctrl_setup(void) }
err = resctrl_init(); + if (!err) + WRITE_ONCE(resctrl_enabled, true); }
return err; }
+void mpam_resctrl_exit(void) +{ + if (!READ_ONCE(resctrl_enabled)) + return; + + WRITE_ONCE(resctrl_enabled, false); + resctrl_exit(); +} + u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, u32 closid, enum resctrl_conf_type type) {
From: James Morse james.morse@arm.com
resctrl's limbo code needs to be told when the data left in a cache is small enough for the partid+pmg value to be re-allocated.
x86 uses the cache size divded by the number of rmid users the cache may have. Do the same, but for the smallest cache, and with the number of partid-and-pmg users.
Querying the cache size can't happen until after cacheinfo_sysfs_init() has run, so mpam_resctrl_setup() must wait until then.
Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_resctrl.c | 51 ++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+)
diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 074d8ba32115..9e69e0e1bb07 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -15,6 +15,7 @@ #include <linux/resctrl.h> #include <linux/slab.h> #include <linux/types.h> +#include <linux/wait.h>
#include <asm/mpam.h>
@@ -47,6 +48,13 @@ static bool cdp_enabled; */ static bool resctrl_enabled;
+/* + * mpam_resctrl_pick_caches() needs to know the size of the caches. cacheinfo + * populates this from a device_initcall(). mpam_resctrl_setup() must wait. + */ +static bool cacheinfo_ready; +static DECLARE_WAIT_QUEUE_HEAD(wait_cacheinfo_ready); + /* A dummy mon context to use when the monitors were allocated up front */ u32 __mon_is_rmid_idx = USE_RMID_IDX; void *mon_is_rmid_idx = &__mon_is_rmid_idx; @@ -402,6 +410,24 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, } }
+/* + * The rmid realloc threshold should be for the smallest cache exposed to + * resctrl. + */ +static void update_rmid_limits(unsigned int size) +{ + u32 num_unique_pmg = resctrl_arch_system_num_rmid_idx(); + + if (WARN_ON_ONCE(!size)) + return; + + if (resctrl_rmid_realloc_limit && size > resctrl_rmid_realloc_limit) + return; + + resctrl_rmid_realloc_limit = size; + resctrl_rmid_realloc_threshold = size / num_unique_pmg; +} + static bool cache_has_usable_cpor(struct mpam_class *class) { struct mpam_props *cprops = &class->props; @@ -558,11 +584,15 @@ static u16 percent_to_mbw_max(u8 pc, struct mpam_props *cprops) static void mpam_resctrl_pick_caches(void) { int idx; + unsigned int cache_size; struct mpam_class *class; struct mpam_resctrl_res *res;
+ lockdep_assert_cpus_held(); + idx = srcu_read_lock(&mpam_srcu); list_for_each_entry_rcu(class, &mpam_classes, classes_list) { + struct mpam_props *cprops = &class->props; bool has_cpor = cache_has_usable_cpor(class);
if (class->type != MPAM_CLASS_CACHE) { @@ -589,6 +619,16 @@ static void mpam_resctrl_pick_caches(void) continue; }
+ /* Assume cache levels are the same size for all CPUs... */ + cache_size = get_cpu_cacheinfo_size(smp_processor_id(), class->level); + if (!cache_size) { + pr_debug("pick_caches: Could not read cache size\n"); + continue; + } + + if (mpam_has_feature(mpam_feat_msmon_csu, cprops)) + update_rmid_limits(cache_size); + if (class->level == 2) { res = &mpam_resctrl_exports[RDT_RESOURCE_L2]; res->resctrl_res.name = "L2"; @@ -794,6 +834,8 @@ int mpam_resctrl_setup(void) struct mpam_resctrl_res *res; enum resctrl_res_level i;
+ wait_event(wait_cacheinfo_ready, cacheinfo_ready); + cpus_read_lock(); for (i = 0; i < RDT_NUM_RESOURCES; i++) { res = &mpam_resctrl_exports[i]; @@ -1156,3 +1198,12 @@ int mpam_resctrl_offline_cpu(unsigned int cpu)
return 0; } + +static int __init __cacheinfo_ready(void) +{ + cacheinfo_ready = true; + wake_up(&wait_cacheinfo_ready); + + return 0; +} +device_initcall_sync(__cacheinfo_ready);
From: Amit Singh Tomar amitsinght@marvell.com
At the moment, number of resource control group (user can create) is limited to 32. Remove the limit.
ffs() returns '1' for bit 0, hence the existing code subtracts 1 from the index to get the CLOSID value. find_first_bit() returns the bit number which does not need adjusting.
Signed-off-by: Amit Singh Tomar amitsinght@marvell.com [ morse: fixed the off-by-one in the allocator and the wrong not-found value. Removed the limit. ] Signed-off-by: James Morse james.morse@arm.com --- drivers/platform/mpam/mpam_resctrl.c | 2 +- fs/resctrl/rdtgroup.c | 25 +++++++++++++------------ include/linux/resctrl.h | 5 ----- 3 files changed, 14 insertions(+), 18 deletions(-)
diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 9e69e0e1bb07..38e53c46a9ec 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -132,7 +132,7 @@ static bool mpam_resctrl_hide_cdp(enum resctrl_res_level rid) */ u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) { - return min((u32)mpam_partid_max + 1, (u32)RESCTRL_MAX_CLOSID); + return mpam_partid_max + 1; }
u32 resctrl_arch_system_num_rmid_idx(void) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 0a3b99ca3b48..20a4962d3e81 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -123,8 +123,8 @@ static bool resctrl_is_mbm_event(int e) }
/* - * Trivial allocator for CLOSIDs. Since h/w only supports a small number, - * we can keep a bitmap of free CLOSIDs in a single integer. + * Trivial allocator for CLOSIDs. Use BITMAP APIs to manipulate a bitmap + * of free CLOSIDs. * * Using a global CLOSID across all resources has some advantages and * some drawbacks: @@ -137,7 +137,7 @@ static bool resctrl_is_mbm_event(int e) * - Our choices on how to configure each resource become progressively more * limited as the number of resources grows. */ -static unsigned long closid_free_map; +static unsigned long *closid_free_map; static int closid_free_map_len;
int closids_supported(void) @@ -148,16 +148,17 @@ int closids_supported(void) static void closid_init(void) { struct resctrl_schema *s; - u32 rdt_min_closid = 32; + u32 rdt_min_closid = ~0;
/* Compute rdt_min_closid across all resources */ list_for_each_entry(s, &resctrl_schema_all, list) rdt_min_closid = min(rdt_min_closid, s->num_closid);
- closid_free_map = BIT_MASK(rdt_min_closid) - 1; + closid_free_map = bitmap_alloc(rdt_min_closid, GFP_KERNEL); + bitmap_fill(closid_free_map, rdt_min_closid);
/* RESCTRL_RESERVED_CLOSID is always reserved for the default group */ - __clear_bit(RESCTRL_RESERVED_CLOSID, &closid_free_map); + __clear_bit(RESCTRL_RESERVED_CLOSID, closid_free_map); closid_free_map_len = rdt_min_closid; }
@@ -174,12 +175,12 @@ static int closid_alloc(void) return cleanest_closid; closid = cleanest_closid; } else { - closid = ffs(closid_free_map); - if (closid == 0) + closid = find_first_bit(closid_free_map, closid_free_map_len); + if (closid == closid_free_map_len) return -ENOSPC; - closid--; } - __clear_bit(closid, &closid_free_map); + + __clear_bit(closid, closid_free_map);
return closid; } @@ -188,7 +189,7 @@ void closid_free(int closid) { lockdep_assert_held(&rdtgroup_mutex);
- __set_bit(closid, &closid_free_map); + __set_bit(closid, closid_free_map); }
/** @@ -202,7 +203,7 @@ bool closid_allocated(unsigned int closid) { lockdep_assert_held(&rdtgroup_mutex);
- return !test_bit(closid, &closid_free_map); + return !test_bit(closid, closid_free_map); }
/** diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 9b7dc3c8d3d2..dd34523469a5 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -30,11 +30,6 @@ int proc_resctrl_show(struct seq_file *m, /* max value for struct rdt_domain's mbps_val */ #define MBA_MAX_MBPS U32_MAX
-/* - * Resctrl uses a u32 as a closid bitmap. The maximum closid is 32. - */ -#define RESCTRL_MAX_CLOSID 32 - /* * Resctrl uses u32 to hold the user-space config. The maximum bitmap size is * 32.
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/3728 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/G...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/3728 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/G...