[PATCH OLK-6.6 000/168] arm64/mpam: Introduce MPAM feature

Amit Singh Tomar (1): fs/resctrl: Remove the limit on the number of CLOSID James Morse (84): x86/resctrl: Add a helper to avoid reaching into the arch code resource list x86/resctrl: Move ctrlval string parsing links away from the arch code x86/resctrl: Add helper for setting CPU default properties x86/resctrl: Remove rdtgroup from update_cpu_closid_rmid() x86/resctrl: Export resctrl fs's init function x86/resctrl: Wrap resctrl_arch_find_domain() around rdt_find_domain() x86/resctrl: Move resctrl types to a separate header x86/resctrl: Add a resctrl helper to reset all the resources x86/resctrl: Move monitor init work to a resctrl init call x86/resctrl: Move monitor exit work to a restrl exit call x86/resctrl: Move max_{name,data}_width into resctrl code x86/resctrl: Stop using the for_each_*_rdt_resource() walkers x86/resctrl: Export the is_mbm_*_enabled() helpers to asm/resctrl.h x86/resctrl: Add resctrl_arch_is_evt_configurable() to abstract BMEC x86/resctrl: Change mon_event_config_{read,write}() to be arch helpers x86/resctrl: Allow resctrl_arch_mon_event_config_write() to return an error x86/resctrl: Allow an architecture to disable pseudo lock x86/resctrl: Make prefetch_disable_bits belong to the arch code x86/resctrl: Make resctrl_arch_pseudo_lock_fn() take a plr x86/resctrl: Move thread_throttle_mode_init() to be managed by resctrl x86/resctrl: Move get_config_index() to a header x86/resctrl: Claim get_domain_from_cpu() for resctrl x86/resctrl: Describe resctrl's bitmap size assumptions x86/resctrl: Drop __init/__exit on assorted symbols fs/resctrl: Add boiler plate for external resctrl code x86/resctrl: Move mbm_cfg_mask to struct rdt_resource x86/resctrl: Move the filesystem bits to headers visible to fs/resctrl x86/resctrl: Move the filesystem portions of resctrl to live in '/fs/' arm64: head.S: Initialise MPAM EL2 registers and disable traps arm64: cpufeature: discover CPU support for MPAM KVM: arm64: Fix missing traps of guest accesses to the MPAM registers KVM: arm64: Disable MPAM visibility by default, and handle traps arm64: mpam: Context switch the MPAM registers untested: KVM: arm64: Force guest EL1 to use user-space's partid configuration ACPI / PPTT: Provide a helper to walk processor containers ACPI / PPTT: Find PPTT cache level by ID ACPI / PPTT: Add a helper to fill a cpumask from a processor container ACPI / PPTT: Add a helper to fill a cpumask from a cache_id cacheinfo: Expose the code to generate a cache-id from a device_node drivers: base: cacheinfo: Add helper to find the cache size from cpu+level ACPI / MPAM: Parse the MPAM table arm_mpam: Add probe/remove for mpam msc driver and kbuild boiler plate arm_mpam: Add the class and component structures for ris firmware described arm_mpam: Add MPAM MSC register layout definitions arm_mpam: Add cpuhp callbacks to probe MSC hardware arm_mpam: Probe MSCs to find the supported partid/pmg values arm_mpam: Probe the hardware features resctrl supports arm_mpam: Merge supported features during mpam_enable() into mpam_class arm_mpam: Reset MSC controls from cpu hp callbacks arm_mpam: Add a helper to touch an MSC from any CPU arm_mpam: Extend reset logic to allow devices to be reset any time arm_mpam: Register and enable IRQs arm_mpam: Use the arch static key to indicate when mpam is enabled arm_mpam: Allow configuration to be applied and restored during cpu online arm_mpam: Probe and reset the rest of the features arm_mpam: Add helpers to allocate monitors arm_mpam: Add mpam_msmon_read() to read monitor value arm_mpam: Track bandwidth counter state for overflow and power management arm_mpam: Add helper to reset saved mbwu state arm_mpam: resctrl: Add boilerplate cpuhp and domain allocation arm_mpam: resctrl: Pick the caches we will use as resctrl resources arm_mpam: resctrl: Pick a value for num_rmid arm_mpam: resctrl: Implement resctrl_arch_reset_resources() arm_mpam: resctrl: Add resctrl_arch_get_config() arm_mpam: resctrl: Implement helpers to update configuration arm_mpam: resctrl: Add CDP emulation arm64: mpam: Add helpers to change a tasks and cpu mpam partid/pmg values arm_mpam: resctrl: Add rmid index helpers untested: arm_mpam: resctrl: Add support for MB resource untested: arm_mpam: resctrl: Add support for mbm counters arm_mpam: resctrl: Allow resctrl to allocate monitors arm_mpam: resctrl: Add resctrl_arch_rmid_read() and resctrl_arch_reset_rmid() untested: arm_mpam: resctrl: Allow monitors to be configured arm_mpam: resctrl: Add empty definitions for pseudo lock arm_mpam: resctrl: Add empty definitions for fine-grained enables arm_mpam: resctrl: Add dummy definition for free running counters arm64: mpam: Select ARCH_HAS_CPU_RESCTRL perf/arm-cmn: Stop claiming all the resources arm_mpam: resctrl: Tell resctrl about cpu/domain online/offline arm_mpam: resctrl: Call resctrl_exit() in the event of errors arm_mpam: resctrl: Update the rmid reallocation limit arm64: mpam: Add cpu_pm notifier to restore MPAM sysregs arm64: mpam: Restore the expected MPAM sysregs on cpuhp x86/resctrl: Add max_bw to struct resctrl_membw Rob Herring (3): cacheinfo: Allow for >32-bit cache 'id' cacheinfo: Set cache 'id' based on DT data dt-bindings: arm: Add MPAM MSC binding Rohit Mathew (2): arm_mpam: Probe for long/lwd mbwu counters arm_mpam: Use long MBWU counters if supported Zeng Heng (78): Revert "x86/resctrl: Fix arch_mbm_* array overrun on SNC" Revert "x86/resctrl: Detect Sub-NUMA Cluster (SNC) mode" Revert "x86/resctrl: Enable shared RMID mode on Sub-NUMA Cluster (SNC) systems" Revert "x86/resctrl: Make __mon_event_count() handle sum domains" Revert "x86/resctrl: Fill out rmid_read structure for smp_call*() to read a counter" Revert "x86/resctrl: Handle removing directories in Sub-NUMA Cluster (SNC) mode" Revert "x86/resctrl: Create Sub-NUMA Cluster (SNC) monitor files" Revert "x86/resctrl: Allocate a new field in union mon_data_bits" Revert "x86/resctrl: Refactor mkdir_mondata_subdir() with a helper function" Revert "x86/resctrl: Initialize on-stack struct rmid_read instances" Revert "x86/resctrl: Add a new field to struct rmid_read for summation of domains" Revert "x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files" Revert "x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) systems" Revert "x86/resctrl: Introduce snc_nodes_per_l3_cache" Revert "x86/resctrl: Add node-scope to the options for feature scope" Revert "x86/resctrl: Split the rdt_domain and rdt_hw_domain structures" Revert "x86/resctrl: Prepare for different scope for control/monitor operations" Revert "x86/resctrl: Prepare to split rdt_domain structure" Revert "x86/resctrl: Prepare for new domain scope" Revert "x86/resctrl: Replace open coded cacheinfo searches" Revert "x86/resctrl: Add tracepoint for llc_occupancy tracking" Revert "x86/resctrl: Rename pseudo_lock_event.h to trace.h" Revert "x86/resctrl: Simplify call convention for MSR update functions" Revert "x86/resctrl: Pass domain to target CPU" arm_mpam: control memory bandwidth with hard limit flag resctrl: fix undefined reference to lockdep_is_cpus_held() fs/resctrl: Move rdtgroup_setup_default() out of init.text section Revert "KVM: arm64: Disable MPAM visibility by default, and handle traps" arm64: cpufeature: Both the major and the minor version numbers need to be checked arm64/mpam: skip mpam initialize under kdump kernel x86: resctrl: Fix illegal access by the chips not having RDT Revert "arm64: head.S: Initialise MPAM EL2 registers and disable traps" arm64/mpam: Check mpam_detect_is_enabled() before accessing MPAM registers arm64/mpam: Fix redefined reference of 'mpam_detect_is_enabled' fs/resctrl: Adapt to the hardware topology structures of RDT and MPAM arm64/mpam: Support MATA monitor feature for MPAM arm64/mpam: Add judgment to distinguish MSMON_MBWU_CAPTURE definition arm64/mpam: Fix out-of-bound access of mbwu_state array arm64/mpam: fix MBA granularity conversion formula arm64/mpam: fix bug in percent_to_mbw_max() arm64/mpam: Fix out-of-bound access of cfg array arm64/mpam: Improve conversion accuracy between percent and fixed-point fraction arm64/mpam: Add write memory barrier to guarantee monitor results arm64/mpam: Try reading again if monitor instance returns not ready arm64/mpam: fix memleak in resctrl_arch_mon_ctx_alloc_no_wait() arm64/mpam: fix impossible condition in get_cpumask_from_cache_id() arm64/mpam: fix impossible condition in resctrl_arch_rmid_read() arm64/mpam: Optimize CSU/MBWU monitor multiplexing fs/resctrl: Determine whether the MBM monitors require overflow checking arm64/mpam: Set the cpbm width of msc class with the minimum arm64/mpam: Correct the judgment condition of the CMAX feature fs/resctrl: Fix kmemleak caused by closid_init() arm64/mpam: Fix allocated cache size information arm64/mpam: Add debugging information about CDP monitor value fs/resctrl: Fix configuration to wrong control group when CDP is enabled fs/resctrl: As a pre-patch for expanding MPAM's QoS capability arm64/mpam: Add CMAX feature arm64/mpam: Add mbw_min and cmin features arm64/mpam: Add PRIO feature arm64/mpam: Add limit feature x86/resctrl: Add a handling path of default label in get_arch_mbm_state() fs/resctrl: Create l2 cache monitors fs/resctrl: Add l2 mount option to enable L2 msc arm64/mpam: Refuse cpu offline when L2 msc is enabled arm64/mpam: Refuse to enter powerdown state after L2 msc updated fs/resctrl: Fix the crash caused by mounting resctrl but not support RDT arm64/mpam: Fix num_rmids information arm64/mpam: Set 1 as the minimum setting value for MBA arm64/mpam: Update QoS partition default value fs/resctrl: Free mbm_total and mbm_local when fails fs/resctrl: Restore default settings for all resctrl_res_level fs/resctrl: Add missing rdt_last_cmd_clear() after rdtgroup_kn_lock_live() arm64/mpam: Add MPAM manual fs/resctrl: Prevent idle RMIDs from not being released in time from limbo fs/resctrl: L2_MON does not support the limbo mechanism fs/resctrl: Fix return value in rdtgroup_pseudo_locked_in_hierarchy() x86/resctrl: Revert x86 RDT code back x86/resctrl: Decouple ARM MPAM from x86 RDT .../arch/arm64/cpu-feature-registers.rst | 2 + Documentation/arch/arm64/mpam.md | 587 +++ .../devicetree/bindings/arm/arm,mpam-msc.yaml | 227 + MAINTAINERS | 2 + arch/Kconfig | 8 + arch/arm64/Kconfig | 22 +- arch/arm64/include/asm/cpu.h | 1 + arch/arm64/include/asm/cpufeature.h | 22 + arch/arm64/include/asm/kvm_arm.h | 1 + arch/arm64/include/asm/mpam.h | 167 + arch/arm64/include/asm/resctrl.h | 1 + arch/arm64/include/asm/sysreg.h | 8 + arch/arm64/include/asm/thread_info.h | 3 + arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/cpufeature.c | 110 + arch/arm64/kernel/cpuinfo.c | 6 + arch/arm64/kernel/image-vars.h | 5 + arch/arm64/kernel/mpam.c | 75 + arch/arm64/kernel/process.c | 7 + arch/arm64/kvm/hyp/include/hyp/switch.h | 32 + arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 27 + arch/arm64/kvm/hyp/nvhe/switch.c | 11 + arch/arm64/kvm/hyp/vhe/sysreg-sr.c | 1 + arch/arm64/kvm/sys_regs.c | 20 + arch/arm64/tools/cpucaps | 1 + arch/arm64/tools/sysreg | 33 + arch/x86/Kconfig | 3 +- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +- drivers/acpi/arm64/Kconfig | 3 + drivers/acpi/arm64/Makefile | 2 + drivers/acpi/arm64/mpam.c | 369 ++ drivers/acpi/pptt.c | 278 +- drivers/acpi/tables.c | 2 +- drivers/base/cacheinfo.c | 41 +- drivers/perf/arm-cmn.c | 12 +- drivers/platform/Kconfig | 2 + drivers/platform/Makefile | 1 + drivers/platform/mpam/Kconfig | 8 + drivers/platform/mpam/Makefile | 1 + drivers/platform/mpam/mpam_devices.c | 2491 ++++++++++ drivers/platform/mpam/mpam_internal.h | 595 +++ drivers/platform/mpam/mpam_resctrl.c | 1599 +++++++ fs/Kconfig | 1 + fs/Makefile | 1 + fs/resctrl/Kconfig | 24 + fs/resctrl/Makefile | 1 + fs/resctrl/ctrlmondata.c | 563 +++ fs/resctrl/internal.h | 295 ++ fs/resctrl/monitor.c | 814 ++++ fs/resctrl/psuedo_lock.c | 1134 +++++ fs/resctrl/rdtgroup.c | 4066 +++++++++++++++++ include/linux/acpi.h | 17 + include/linux/arm_mpam.h | 109 + include/linux/cacheinfo.h | 28 +- include/linux/resctrl.h | 7 + include/linux/resctrl_mpam.h | 452 ++ include/linux/resctrl_types.h | 119 + 57 files changed, 14410 insertions(+), 12 deletions(-) create mode 100644 Documentation/arch/arm64/mpam.md create mode 100644 Documentation/devicetree/bindings/arm/arm,mpam-msc.yaml create mode 100644 arch/arm64/include/asm/mpam.h create mode 100644 arch/arm64/include/asm/resctrl.h create mode 100644 arch/arm64/kernel/mpam.c create mode 100644 drivers/acpi/arm64/mpam.c create mode 100644 drivers/platform/mpam/Kconfig create mode 100644 drivers/platform/mpam/Makefile create mode 100644 drivers/platform/mpam/mpam_devices.c create mode 100644 drivers/platform/mpam/mpam_internal.h create mode 100644 drivers/platform/mpam/mpam_resctrl.c create mode 100644 fs/resctrl/Kconfig create mode 100644 fs/resctrl/Makefile create mode 100644 fs/resctrl/ctrlmondata.c create mode 100644 fs/resctrl/internal.h create mode 100644 fs/resctrl/monitor.c create mode 100644 fs/resctrl/psuedo_lock.c create mode 100644 fs/resctrl/rdtgroup.c create mode 100644 include/linux/arm_mpam.h create mode 100644 include/linux/resctrl_mpam.h create mode 100644 include/linux/resctrl_types.h -- 2.25.1

This reverts commit 5cf3389f93d220994730c51f5edad35cf69029ee. --- arch/x86/include/asm/resctrl.h | 6 ++++++ arch/x86/kernel/cpu/resctrl/core.c | 8 -------- include/linux/resctrl.h | 1 - 3 files changed, 6 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 8b1b6ce1e51b..12dbd2588ca7 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -156,6 +156,12 @@ static inline void resctrl_sched_in(struct task_struct *tsk) __resctrl_sched_in(tsk); } +static inline u32 resctrl_arch_system_num_rmid_idx(void) +{ + /* RMID are independent numbers for x86. num_rmid_idx == num_rmid */ + return boot_cpu_data.x86_cache_max_rmid + 1; +} + static inline void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid) { *rmid = idx; diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 89a3b145dff8..d3a87070bda4 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -119,14 +119,6 @@ struct rdt_hw_resource rdt_resources_all[] = { }, }; -u32 resctrl_arch_system_num_rmid_idx(void) -{ - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; - - /* RMID are independent numbers for x86. num_rmid_idx == num_rmid */ - return r->num_rmid; -} - /* * cache_alloc_hsw_probe() - Have to probe for Intel haswell server CPUs * as they do not have CPUID enumeration support for Cache allocation. diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index d94abba1c716..b0875b99e811 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -248,7 +248,6 @@ struct resctrl_schema { /* The number of closid supported by this resource regardless of CDP */ u32 resctrl_arch_get_num_closid(struct rdt_resource *r); -u32 resctrl_arch_system_num_rmid_idx(void); int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid); /* -- 2.25.1

This reverts commit f18d4d6bb1d7e3c03a695720748944b2448855e3. --- arch/x86/kernel/cpu/resctrl/monitor.c | 66 --------------------------- 1 file changed, 66 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index cc0f1c48d7b2..a62c4dc91161 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -15,8 +15,6 @@ * Software Developer Manual June 2016, volume 3, section 17.17. */ -#define pr_fmt(fmt) "resctrl: " fmt - #include <linux/cpu.h> #include <linux/module.h> #include <linux/sizes.h> @@ -1113,68 +1111,6 @@ void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d) msr_clear_bit(MSR_RMID_SNC_CONFIG, 0); } -/* CPU models that support MSR_RMID_SNC_CONFIG */ -static const struct x86_cpu_id snc_cpu_ids[] __initconst = { - X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0), - X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0), - X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0), - X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0), - X86_MATCH_INTEL_FAM6_MODEL(ATOM_CRESTMONT_X, 0), - {} -}; - -/* - * There isn't a simple hardware bit that indicates whether a CPU is running - * in Sub-NUMA Cluster (SNC) mode. Infer the state by comparing the - * number of CPUs sharing the L3 cache with CPU0 to the number of CPUs in - * the same NUMA node as CPU0. - * It is not possible to accurately determine SNC state if the system is - * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes - * to L3 caches. It will be OK if system is booted with hyperthreading - * disabled (since this doesn't affect the ratio). - */ -static __init int snc_get_config(void) -{ - struct cacheinfo *ci = get_cpu_cacheinfo_level(0, RESCTRL_L3_CACHE); - const cpumask_t *node0_cpumask; - int cpus_per_node, cpus_per_l3; - int ret; - - if (!x86_match_cpu(snc_cpu_ids) || !ci) - return 1; - - cpus_read_lock(); - if (num_online_cpus() != num_present_cpus()) - pr_warn("Some CPUs offline, SNC detection may be incorrect\n"); - cpus_read_unlock(); - - node0_cpumask = cpumask_of_node(cpu_to_node(0)); - - cpus_per_node = cpumask_weight(node0_cpumask); - cpus_per_l3 = cpumask_weight(&ci->shared_cpu_map); - - if (!cpus_per_node || !cpus_per_l3) - return 1; - - ret = cpus_per_l3 / cpus_per_node; - - /* sanity check: Only valid results are 1, 2, 3, 4 */ - switch (ret) { - case 1: - break; - case 2 ... 4: - pr_info("Sub-NUMA Cluster mode detected with %d nodes per L3 cache\n", ret); - rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_L3_NODE; - break; - default: - pr_warn("Ignore improbable SNC node count %d\n", ret); - ret = 1; - break; - } - - return ret; -} - int __init rdt_get_mon_l3_config(struct rdt_resource *r) { unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset; @@ -1182,8 +1118,6 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) unsigned int threshold; int ret; - snc_nodes_per_l3_cache = snc_get_config(); - resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024; hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache; r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache; -- 2.25.1

This reverts commit d6293e7f01d0d21122484d0644aa3dccb6e50ba5. --- arch/x86/include/asm/msr-index.h | 1 - arch/x86/kernel/cpu/resctrl/core.c | 2 -- arch/x86/kernel/cpu/resctrl/internal.h | 2 -- arch/x86/kernel/cpu/resctrl/monitor.c | 20 -------------------- 4 files changed, 25 deletions(-) diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 0aae302e3daa..182fa1daccb6 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -1156,7 +1156,6 @@ #define MSR_IA32_QM_CTR 0xc8e #define MSR_IA32_PQR_ASSOC 0xc8f #define MSR_IA32_L3_CBM_BASE 0xc90 -#define MSR_RMID_SNC_CONFIG 0xca0 #define MSR_IA32_L2_CBM_BASE 0xd10 #define MSR_IA32_MBA_THRTL_BASE 0xd50 diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index d3a87070bda4..a1b67620ac04 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -615,8 +615,6 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r) } cpumask_set_cpu(cpu, &d->hdr.cpu_mask); - arch_mon_domain_online(r, d); - if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) { mon_domain_free(hw_dom); return; diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 955999aecfca..16982d1baf99 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -534,8 +534,6 @@ static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l) int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); -void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d); - /* * To return the common struct rdt_resource, which is contained in struct * rdt_hw_resource, walk the resctrl member of struct rdt_hw_resource. diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index a62c4dc91161..7ee4e0c90159 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -1091,26 +1091,6 @@ static void l3_mon_evt_init(struct rdt_resource *r) list_add_tail(&mbm_local_event.list, &r->evt_list); } -/* - * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1 - * which indicates that RMIDs are configured in legacy mode. - * This mode is incompatible with Linux resctrl semantics - * as RMIDs are partitioned between SNC nodes, which requires - * a user to know which RMID is allocated to a task. - * Clearing bit 0 reconfigures the RMID counters for use - * in RMID sharing mode. This mode is better for Linux. - * The RMID space is divided between all SNC nodes with the - * RMIDs renumbered to start from zero in each node when - * counting operations from tasks. Code to read the counters - * must adjust RMID counter numbers based on SNC node. See - * logical_rmid_to_physical_rmid() for code that does this. - */ -void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d) -{ - if (snc_nodes_per_l3_cache > 1) - msr_clear_bit(MSR_RMID_SNC_CONFIG, 0); -} - int __init rdt_get_mon_l3_config(struct rdt_resource *r) { unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset; -- 2.25.1

This reverts commit fa882cd398cdd16545d36189ffa7eeede4cf6e9f. --- arch/x86/kernel/cpu/resctrl/monitor.c | 51 +++++---------------------- 1 file changed, 9 insertions(+), 42 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 7ee4e0c90159..9bba9d59d5f6 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -324,6 +324,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d, resctrl_arch_rmid_read_context_check(); + if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask)) + return -EINVAL; + prmid = logical_rmid_to_physical_rmid(cpu, rmid); ret = __rmid_read_phys(prmid, eventid, &msr_val); if (ret) @@ -590,10 +593,7 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid, static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr) { - int cpu = smp_processor_id(); - struct rdt_mon_domain *d; struct mbm_state *m; - int err, ret; u64 tval = 0; if (rr->first) { @@ -604,47 +604,14 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr) return 0; } - if (rr->d) { - /* Reading a single domain, must be on a CPU in that domain. */ - if (!cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask)) - return -EINVAL; - rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, - rr->evtid, &tval, rr->arch_mon_ctx); - if (rr->err) - return rr->err; + rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, rr->evtid, + &tval, rr->arch_mon_ctx); + if (rr->err) + return rr->err; - rr->val += tval; + rr->val += tval; - return 0; - } - - /* Summing domains that share a cache, must be on a CPU for that cache. */ - if (!cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map)) - return -EINVAL; - - /* - * Legacy files must report the sum of an event across all - * domains that share the same L3 cache instance. - * Report success if a read from any domain succeeds, -EINVAL - * (translated to "Unavailable" for user space) if reading from - * all domains fail for any reason. - */ - ret = -EINVAL; - list_for_each_entry(d, &rr->r->mon_domains, hdr.list) { - if (d->ci->id != rr->ci->id) - continue; - err = resctrl_arch_rmid_read(rr->r, d, closid, rmid, - rr->evtid, &tval, rr->arch_mon_ctx); - if (!err) { - rr->val += tval; - ret = 0; - } - } - - if (ret) - rr->err = ret; - - return ret; + return 0; } /* -- 2.25.1

This reverts commit f3c8c014eb255ff53a1ecf073a8c507bb20ee45f. --- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 40 +++++------------------ arch/x86/kernel/cpu/resctrl/internal.h | 2 +- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +- 3 files changed, 10 insertions(+), 34 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index 200d89a64027..7b670ee6fd9d 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -520,7 +520,7 @@ static int smp_mon_event_count(void *arg) void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, struct rdt_mon_domain *d, struct rdtgroup *rdtgrp, - cpumask_t *cpumask, int evtid, int first) + int evtid, int first) { int cpu; @@ -541,7 +541,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, return; } - cpu = cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU); + cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask, RESCTRL_PICK_ANY_CPU); /* * cpumask_any_housekeeping() prefers housekeeping CPUs, but @@ -550,7 +550,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, * counters on some platforms if its called in IRQ context. */ if (tick_nohz_full_cpu(cpu)) - smp_call_function_any(cpumask, mon_event_count, rr, 1); + smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1); else smp_call_on_cpu(cpu, smp_mon_event_count, rr, false); @@ -579,40 +579,16 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) resid = md.u.rid; domid = md.u.domid; evtid = md.u.evtid; - r = &rdt_resources_all[resid].r_resctrl; - if (md.u.sum) { - /* - * This file requires summing across all domains that share - * the L3 cache id that was provided in the "domid" field of the - * mon_data_bits union. Search all domains in the resource for - * one that matches this cache id. - */ - list_for_each_entry(d, &r->mon_domains, hdr.list) { - if (d->ci->id == domid) { - rr.ci = d->ci; - mon_event_read(&rr, r, NULL, rdtgrp, - &d->ci->shared_cpu_map, evtid, false); - goto checkresult; - } - } + r = &rdt_resources_all[resid].r_resctrl; + hdr = rdt_find_domain(&r->mon_domains, domid, NULL); + if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) { ret = -ENOENT; goto out; - } else { - /* - * This file provides data from a single domain. Search - * the resource to find the domain with "domid". - */ - hdr = rdt_find_domain(&r->mon_domains, domid, NULL); - if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) { - ret = -ENOENT; - goto out; - } - d = container_of(hdr, struct rdt_mon_domain, hdr); - mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false); } + d = container_of(hdr, struct rdt_mon_domain, hdr); -checkresult: + mon_event_read(&rr, r, d, rdtgrp, evtid, false); if (rr.err == -EIO) seq_puts(m, "Error\n"); diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 16982d1baf99..13d862221f9c 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -632,7 +632,7 @@ void mon_event_count(void *info); int rdtgroup_mondata_show(struct seq_file *m, void *arg); void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, struct rdt_mon_domain *d, struct rdtgroup *rdtgrp, - cpumask_t *cpumask, int evtid, int first); + int evtid, int first); void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms, int exclude_cpu); diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index d7163b764c62..58e53f1f52a0 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -3070,7 +3070,7 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d, return ret; if (!do_sum && is_mbm_event(mevt->evtid)) - mon_event_read(&rr, r, d, prgrp, &d->hdr.cpu_mask, mevt->evtid, true); + mon_event_read(&rr, r, d, prgrp, mevt->evtid, true); } return 0; -- 2.25.1

This reverts commit 199eea7677198c357f41b0a38cebaa2701072094. --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 35 +++++--------------------- 1 file changed, 6 insertions(+), 29 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 58e53f1f52a0..8502385e389f 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -3006,45 +3006,22 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name, return ret; } -static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subname) -{ - struct kernfs_node *kn; - - kn = kernfs_find_and_get(pkn, name); - if (!kn) - return; - kernfs_put(kn); - - if (kn->dir.subdirs <= 1) - kernfs_remove(kn); - else - kernfs_remove_by_name(kn, subname); -} - /* * Remove all subdirectories of mon_data of ctrl_mon groups - * and monitor groups for the given domain. - * Remove files and directories containing "sum" of domain data - * when last domain being summed is removed. + * and monitor groups with given domain id. */ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, - struct rdt_mon_domain *d) + unsigned int dom_id) { struct rdtgroup *prgrp, *crgrp; - char subname[32]; - bool snc_mode; char name[32]; - snc_mode = r->mon_scope == RESCTRL_L3_NODE; - sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id); - if (snc_mode) - sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id); - list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { - mon_rmdir_one_subdir(prgrp->mon.mon_data_kn, name, subname); + sprintf(name, "mon_%s_%02d", r->name, dom_id); + kernfs_remove_by_name(prgrp->mon.mon_data_kn, name); list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) - mon_rmdir_one_subdir(crgrp->mon.mon_data_kn, name, subname); + kernfs_remove_by_name(crgrp->mon.mon_data_kn, name); } } @@ -4014,7 +3991,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d * per domain monitor data directories. */ if (resctrl_mounted && resctrl_arch_mon_capable()) - rmdir_mondata_subdir_allrdtgrp(r, d); + rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id); if (is_mbm_enabled()) cancel_delayed_work(&d->mbm_over); -- 2.25.1

This reverts commit adb183fcbea03bdbb9cdbf67ea8bb9902daf53f9. --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 62 +++++++------------------- 1 file changed, 16 insertions(+), 46 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 8502385e389f..9c38ddcfe150 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -3026,8 +3026,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, } static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d, - struct rdt_resource *r, struct rdtgroup *prgrp, - bool do_sum) + struct rdt_resource *r, struct rdtgroup *prgrp) { struct rmid_read rr = {0}; union mon_data_bits priv; @@ -3038,15 +3037,14 @@ static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d, return -EPERM; priv.u.rid = r->rid; - priv.u.domid = do_sum ? d->ci->id : d->hdr.id; - priv.u.sum = do_sum; + priv.u.domid = d->hdr.id; list_for_each_entry(mevt, &r->evt_list, list) { priv.u.evtid = mevt->evtid; ret = mon_addfile(kn, mevt->name, priv.priv); if (ret) return ret; - if (!do_sum && is_mbm_event(mevt->evtid)) + if (is_mbm_event(mevt->evtid)) mon_event_read(&rr, r, d, prgrp, mevt->evtid, true); } @@ -3057,51 +3055,23 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, struct rdt_mon_domain *d, struct rdt_resource *r, struct rdtgroup *prgrp) { - struct kernfs_node *kn, *ckn; + struct kernfs_node *kn; char name[32]; - bool snc_mode; - int ret = 0; - - lockdep_assert_held(&rdtgroup_mutex); - - snc_mode = r->mon_scope == RESCTRL_L3_NODE; - sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id); - kn = kernfs_find_and_get(parent_kn, name); - if (kn) { - /* - * rdtgroup_mutex will prevent this directory from being - * removed. No need to keep this hold. - */ - kernfs_put(kn); - } else { - kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); - if (IS_ERR(kn)) - return PTR_ERR(kn); - - ret = rdtgroup_kn_set_ugid(kn); - if (ret) - goto out_destroy; - ret = mon_add_all_files(kn, d, r, prgrp, snc_mode); - if (ret) - goto out_destroy; - } + int ret; - if (snc_mode) { - sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id); - ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp); - if (IS_ERR(ckn)) { - ret = -EINVAL; - goto out_destroy; - } + sprintf(name, "mon_%s_%02d", r->name, d->hdr.id); + /* create the directory */ + kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); + if (IS_ERR(kn)) + return PTR_ERR(kn); - ret = rdtgroup_kn_set_ugid(ckn); - if (ret) - goto out_destroy; + ret = rdtgroup_kn_set_ugid(kn); + if (ret) + goto out_destroy; - ret = mon_add_all_files(ckn, d, r, prgrp, false); - if (ret) - goto out_destroy; - } + ret = mon_add_all_files(kn, d, r, prgrp); + if (ret) + goto out_destroy; kernfs_activate(kn); return 0; -- 2.25.1

This reverts commit e0f8aeaef19814e05b226e7248e7c4efacb1a31c. --- arch/x86/kernel/cpu/resctrl/internal.h | 20 +++++++------------- 1 file changed, 7 insertions(+), 13 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 13d862221f9c..681b5bdcd2f9 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -127,25 +127,19 @@ struct mon_evt { }; /** - * union mon_data_bits - Monitoring details for each event file. + * union mon_data_bits - Monitoring details for each event file * @priv: Used to store monitoring event data in @u - * as kernfs private data. - * @u.rid: Resource id associated with the event file. - * @u.evtid: Event id associated with the event file. - * @u.sum: Set when event must be summed across multiple - * domains. - * @u.domid: When @u.sum is zero this is the domain to which - * the event file belongs. When @sum is one this - * is the id of the L3 cache that all domains to be - * summed share. - * @u: Name of the bit fields struct. + * as kernfs private data + * @rid: Resource id associated with the event file + * @evtid: Event id associated with the event file + * @domid: The domain to which the event file belongs + * @u: Name of the bit fields struct */ union mon_data_bits { void *priv; struct { unsigned int rid : 10; - enum resctrl_event_id evtid : 7; - unsigned int sum : 1; + enum resctrl_event_id evtid : 8; unsigned int domid : 14; } u; }; -- 2.25.1

This reverts commit d41b6015c8277fccc515cd7a3718392687ffa7a2. --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 45 ++++++++++---------------- 1 file changed, 17 insertions(+), 28 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 9c38ddcfe150..d0443589cd86 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -3025,37 +3025,14 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, } } -static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d, - struct rdt_resource *r, struct rdtgroup *prgrp) -{ - struct rmid_read rr = {0}; - union mon_data_bits priv; - struct mon_evt *mevt; - int ret; - - if (WARN_ON(list_empty(&r->evt_list))) - return -EPERM; - - priv.u.rid = r->rid; - priv.u.domid = d->hdr.id; - list_for_each_entry(mevt, &r->evt_list, list) { - priv.u.evtid = mevt->evtid; - ret = mon_addfile(kn, mevt->name, priv.priv); - if (ret) - return ret; - - if (is_mbm_event(mevt->evtid)) - mon_event_read(&rr, r, d, prgrp, mevt->evtid, true); - } - - return 0; -} - static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, struct rdt_mon_domain *d, struct rdt_resource *r, struct rdtgroup *prgrp) { + struct rmid_read rr = {0}; + union mon_data_bits priv; struct kernfs_node *kn; + struct mon_evt *mevt; char name[32]; int ret; @@ -3069,10 +3046,22 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, if (ret) goto out_destroy; - ret = mon_add_all_files(kn, d, r, prgrp); - if (ret) + if (WARN_ON(list_empty(&r->evt_list))) { + ret = -EPERM; goto out_destroy; + } + priv.u.rid = r->rid; + priv.u.domid = d->hdr.id; + list_for_each_entry(mevt, &r->evt_list, list) { + priv.u.evtid = mevt->evtid; + ret = mon_addfile(kn, mevt->name, priv.priv); + if (ret) + goto out_destroy; + + if (is_mbm_event(mevt->evtid)) + mon_event_read(&rr, r, d, prgrp, mevt->evtid, true); + } kernfs_activate(kn); return 0; -- 2.25.1

This reverts commit 6d9c4ce1f252565c7a6b00dd384259ba04cea1cb. --- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 3 ++- arch/x86/kernel/cpu/resctrl/monitor.c | 3 ++- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +- 3 files changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index 7b670ee6fd9d..97e554e36f07 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -534,6 +534,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, rr->evtid = evtid; rr->r = r; rr->d = d; + rr->val = 0; rr->first = first; rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evtid); if (IS_ERR(rr->arch_mon_ctx)) { @@ -561,12 +562,12 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) { struct kernfs_open_file *of = m->private; struct rdt_domain_hdr *hdr; - struct rmid_read rr = {0}; struct rdt_mon_domain *d; u32 resid, evtid, domid; struct rdtgroup *rdtgrp; struct rdt_resource *r; union mon_data_bits md; + struct rmid_read rr; int ret = 0; rdtgrp = rdtgroup_kn_lock_live(of->kn); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 9bba9d59d5f6..b6220089c68c 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -781,8 +781,9 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm) static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, u32 closid, u32 rmid) { - struct rmid_read rr = {0}; + struct rmid_read rr; + rr.first = false; rr.r = r; rr.d = d; diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index d0443589cd86..70d41a8fd788 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -3029,10 +3029,10 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, struct rdt_mon_domain *d, struct rdt_resource *r, struct rdtgroup *prgrp) { - struct rmid_read rr = {0}; union mon_data_bits priv; struct kernfs_node *kn; struct mon_evt *mevt; + struct rmid_read rr; char name[32]; int ret; -- 2.25.1

This reverts commit 51e37b1547765f08b8ad98b57209c918fcc6812c. --- arch/x86/kernel/cpu/resctrl/internal.h | 19 ------------------- 1 file changed, 19 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 681b5bdcd2f9..135190e0711c 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -144,31 +144,12 @@ union mon_data_bits { } u; }; -/** - * struct rmid_read - Data passed across smp_call*() to read event count. - * @rgrp: Resource group for which the counter is being read. If it is a parent - * resource group then its event count is summed with the count from all - * its child resource groups. - * @r: Resource describing the properties of the event being read. - * @d: Domain that the counter should be read from. If NULL then sum all - * domains in @r sharing L3 @ci.id - * @evtid: Which monitor event to read. - * @first: Initialize MBM counter when true. - * @ci: Cacheinfo for L3. Only set when @d is NULL. Used when summing domains. - * @err: Error encountered when reading counter. - * @val: Returned value of event counter. If @rgrp is a parent resource group, - * @val includes the sum of event counts from its child resource groups. - * If @d is NULL, @val includes the sum of all domains in @r sharing @ci.id, - * (summed across child resource groups if @rgrp is a parent resource group). - * @arch_mon_ctx: Hardware monitor allocated for this read request (MPAM only). - */ struct rmid_read { struct rdtgroup *rgrp; struct rdt_resource *r; struct rdt_mon_domain *d; enum resctrl_event_id evtid; bool first; - struct cacheinfo *ci; int err; u64 val; void *arch_mon_ctx; -- 2.25.1

This reverts commit 3368249fabf3c810cd202e22737e6771cad81eec. --- arch/x86/kernel/cpu/resctrl/core.c | 7 +------ arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 1 + arch/x86/kernel/cpu/resctrl/rdtgroup.c | 1 + include/linux/resctrl.h | 3 --- 4 files changed, 3 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index a1b67620ac04..a48d3d4327a2 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -19,6 +19,7 @@ #include <linux/cpu.h> #include <linux/slab.h> #include <linux/err.h> +#include <linux/cacheinfo.h> #include <linux/cpuhotplug.h> #include <asm/intel-family.h> @@ -607,12 +608,6 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r) d = &hw_dom->d_resctrl; d->hdr.id = id; d->hdr.type = RESCTRL_MON_DOMAIN; - d->ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE); - if (!d->ci) { - pr_warn_once("Can't find L3 cache for CPU:%d resource %s\n", cpu, r->name); - mon_domain_free(hw_dom); - return; - } cpumask_set_cpu(cpu, &d->hdr.cpu_mask); if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) { diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 180bcacddf75..3e7e405ea715 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -11,6 +11,7 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include <linux/cacheinfo.h> #include <linux/cpu.h> #include <linux/cpumask.h> #include <linux/debugfs.h> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 70d41a8fd788..d3b0fa958266 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -12,6 +12,7 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include <linux/cacheinfo.h> #include <linux/cpu.h> #include <linux/debugfs.h> #include <linux/fs.h> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index b0875b99e811..64b6ad1b22a1 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -2,7 +2,6 @@ #ifndef _RESCTRL_H #define _RESCTRL_H -#include <linux/cacheinfo.h> #include <linux/kernel.h> #include <linux/list.h> #include <linux/pid.h> @@ -97,7 +96,6 @@ struct rdt_ctrl_domain { /** * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource * @hdr: common header for different domain types - * @ci: cache info for this domain * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold * @mbm_total: saved state for MBM total bandwidth * @mbm_local: saved state for MBM local bandwidth @@ -108,7 +106,6 @@ struct rdt_ctrl_domain { */ struct rdt_mon_domain { struct rdt_domain_hdr hdr; - struct cacheinfo *ci; unsigned long *rmid_busy_llc; struct mbm_state *mbm_total; struct mbm_state *mbm_local; -- 2.25.1

This reverts commit 22e534936a1ecbfdcf213f1ef1fd6b2a3da4c6ce. --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index d3b0fa958266..eb3bbfa96d5a 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2335,18 +2335,14 @@ static void mba_sc_domain_destroy(struct rdt_resource *r, /* * MBA software controller is supported only if - * MBM is supported and MBA is in linear scale, - * and the MBM monitor scope is the same as MBA - * control scope. + * MBM is supported and MBA is in linear scale. */ static bool supports_mba_mbps(void) { - struct rdt_resource *rmbm = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl; return (is_mbm_local_enabled() && - r->alloc_capable && is_mba_linear() && - r->ctrl_scope == rmbm->mon_scope); + r->alloc_capable && is_mba_linear()); } /* @@ -2754,7 +2750,6 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param) { struct rdt_fs_context *ctx = rdt_fc2context(fc); struct fs_parse_result result; - const char *msg; int opt; opt = fs_parse(fc, rdt_fs_parameters, param, &result); @@ -2769,9 +2764,8 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param) ctx->enable_cdpl2 = true; return 0; case Opt_mba_mbps: - msg = "mba_MBps requires local MBM and linear scale MBA at L3 scope"; if (!supports_mba_mbps()) - return invalfc(fc, msg); + return -EINVAL; ctx->enable_mba_mbps = true; return 0; case Opt_debug: -- 2.25.1

This reverts commit 9c7bf4bf48443593d2855ddfaad923f87544b63c. --- arch/x86/kernel/cpu/resctrl/monitor.c | 56 +++------------------------ 1 file changed, 6 insertions(+), 50 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index b6220089c68c..2f5d35f0716f 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -97,8 +97,6 @@ unsigned int resctrl_rmid_realloc_limit; #define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5)) -static int snc_nodes_per_l3_cache = 1; - /* * The correction factor table is documented in Documentation/arch/x86/resctrl.rst. * If rmid > rmid threshold, MBM total and local values should be multiplied @@ -187,43 +185,7 @@ static inline struct rmid_entry *__rmid_entry(u32 idx) return entry; } -/* - * When Sub-NUMA Cluster (SNC) mode is not enabled (as indicated by - * "snc_nodes_per_l3_cache == 1") no translation of the RMID value is - * needed. The physical RMID is the same as the logical RMID. - * - * On a platform with SNC mode enabled, Linux enables RMID sharing mode - * via MSR 0xCA0 (see the "RMID Sharing Mode" section in the "Intel - * Resource Director Technology Architecture Specification" for a full - * description of RMID sharing mode). - * - * In RMID sharing mode there are fewer "logical RMID" values available - * to accumulate data ("physical RMIDs" are divided evenly between SNC - * nodes that share an L3 cache). Linux creates an rdt_mon_domain for - * each SNC node. - * - * The value loaded into IA32_PQR_ASSOC is the "logical RMID". - * - * Data is collected independently on each SNC node and can be retrieved - * using the "physical RMID" value computed by this function and loaded - * into IA32_QM_EVTSEL. @cpu can be any CPU in the SNC node. - * - * The scope of the IA32_QM_EVTSEL and IA32_QM_CTR MSRs is at the L3 - * cache. So a "physical RMID" may be read from any CPU that shares - * the L3 cache with the desired SNC node, not just from a CPU in - * the specific SNC node. - */ -static int logical_rmid_to_physical_rmid(int cpu, int lrmid) -{ - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; - - if (snc_nodes_per_l3_cache == 1) - return lrmid; - - return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid; -} - -static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid, u64 *val) +static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) { u64 msr_val; @@ -235,7 +197,7 @@ static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid, u64 *val) * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62) * are error bits. */ - wrmsr(MSR_IA32_QM_EVTSEL, eventid, prmid); + wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid); rdmsrl(MSR_IA32_QM_CTR, msr_val); if (msr_val & RMID_VAL_ERROR) @@ -271,17 +233,14 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d, enum resctrl_event_id eventid) { struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d); - int cpu = cpumask_any(&d->hdr.cpu_mask); struct arch_mbm_state *am; - u32 prmid; am = get_arch_mbm_state(hw_dom, rmid, eventid); if (am) { memset(am, 0, sizeof(*am)); - prmid = logical_rmid_to_physical_rmid(cpu, rmid); /* Record any initial, non-zero count value. */ - __rmid_read_phys(prmid, eventid, &am->prev_msr); + __rmid_read(rmid, eventid, &am->prev_msr); } } @@ -316,10 +275,8 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d, { struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d); struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); - int cpu = cpumask_any(&d->hdr.cpu_mask); struct arch_mbm_state *am; u64 msr_val, chunks; - u32 prmid; int ret; resctrl_arch_rmid_read_context_check(); @@ -327,8 +284,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d, if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask)) return -EINVAL; - prmid = logical_rmid_to_physical_rmid(cpu, rmid); - ret = __rmid_read_phys(prmid, eventid, &msr_val); + ret = __rmid_read(rmid, eventid, &msr_val); if (ret) return ret; @@ -1067,8 +1023,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) int ret; resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024; - hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache; - r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache; + hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale; + r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1; hw_res->mbm_width = MBM_CNTR_WIDTH_BASE; if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX) -- 2.25.1

This reverts commit 89fa7a721d083206a7aa8fcb0d8417c0182c34f1. --- arch/x86/kernel/cpu/resctrl/core.c | 2 -- include/linux/resctrl.h | 1 - 2 files changed, 3 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index a48d3d4327a2..642fd1ec8611 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -510,8 +510,6 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope) case RESCTRL_L2_CACHE: case RESCTRL_L3_CACHE: return get_cpu_cacheinfo_id(cpu, scope); - case RESCTRL_L3_NODE: - return cpu_to_node(cpu); default: break; } diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 64b6ad1b22a1..aa2c22a8e37b 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -176,7 +176,6 @@ struct resctrl_schema; enum resctrl_scope { RESCTRL_L2_CACHE = 2, RESCTRL_L3_CACHE = 3, - RESCTRL_L3_NODE, }; /** -- 2.25.1

This reverts commit 0b82242ad24516ae7457f2473a6333b3df02b906. --- arch/x86/kernel/cpu/resctrl/core.c | 71 +++++++++++------------ arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 28 ++++----- arch/x86/kernel/cpu/resctrl/internal.h | 62 ++++++++------------ arch/x86/kernel/cpu/resctrl/monitor.c | 40 ++++++------- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 +- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 64 ++++++++++---------- include/linux/resctrl.h | 48 +++++++-------- 7 files changed, 145 insertions(+), 174 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 642fd1ec8611..e59c3fd8a1a1 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -309,8 +309,8 @@ static void rdt_get_cdp_l2_config(void) static void mba_wrmsr_amd(struct msr_param *m) { - struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom); struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res); + struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom); unsigned int i; for (i = m->low; i < m->high; i++) @@ -333,8 +333,8 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r) static void mba_wrmsr_intel(struct msr_param *m) { - struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom); struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res); + struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom); unsigned int i; /* Write the delay values for mba. */ @@ -344,17 +344,17 @@ static void mba_wrmsr_intel(struct msr_param *m) static void cat_wrmsr(struct msr_param *m) { - struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom); struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res); + struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom); unsigned int i; for (i = m->low; i < m->high; i++) wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]); } -struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r) +struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r) { - struct rdt_ctrl_domain *d; + struct rdt_domain *d; lockdep_assert_cpus_held(); @@ -367,9 +367,9 @@ struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r return NULL; } -struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r) +struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r) { - struct rdt_mon_domain *d; + struct rdt_domain *d; lockdep_assert_cpus_held(); @@ -440,23 +440,18 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc) *dc = r->default_ctrl; } -static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom) -{ - kfree(hw_dom->ctrl_val); - kfree(hw_dom); -} - -static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom) +static void domain_free(struct rdt_hw_domain *hw_dom) { kfree(hw_dom->arch_mbm_total); kfree(hw_dom->arch_mbm_local); + kfree(hw_dom->ctrl_val); kfree(hw_dom); } -static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d) +static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d) { - struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d); struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); + struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); struct msr_param m; u32 *dc; @@ -481,7 +476,7 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain * * @num_rmid: The size of the MBM counter array * @hw_dom: The domain that owns the allocated arrays */ -static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom) +static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom) { size_t tsize; @@ -520,10 +515,10 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope) static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r) { int id = get_domain_id_from_scope(cpu, r->ctrl_scope); - struct rdt_hw_ctrl_domain *hw_dom; struct list_head *add_pos = NULL; + struct rdt_hw_domain *hw_dom; struct rdt_domain_hdr *hdr; - struct rdt_ctrl_domain *d; + struct rdt_domain *d; int err; lockdep_assert_held(&domain_list_lock); @@ -538,7 +533,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r) if (hdr) { if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN)) return; - d = container_of(hdr, struct rdt_ctrl_domain, hdr); + d = container_of(hdr, struct rdt_domain, hdr); cpumask_set_cpu(cpu, &d->hdr.cpu_mask); if (r->cache.arch_has_per_cpu_cfg) @@ -558,7 +553,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r) rdt_domain_reconfigure_cdp(r); if (domain_setup_ctrlval(r, d)) { - ctrl_domain_free(hw_dom); + domain_free(hw_dom); return; } @@ -568,7 +563,7 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r) if (err) { list_del_rcu(&d->hdr.list); synchronize_rcu(); - ctrl_domain_free(hw_dom); + domain_free(hw_dom); } } @@ -576,9 +571,9 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r) { int id = get_domain_id_from_scope(cpu, r->mon_scope); struct list_head *add_pos = NULL; - struct rdt_hw_mon_domain *hw_dom; + struct rdt_hw_domain *hw_dom; struct rdt_domain_hdr *hdr; - struct rdt_mon_domain *d; + struct rdt_domain *d; int err; lockdep_assert_held(&domain_list_lock); @@ -593,7 +588,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r) if (hdr) { if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) return; - d = container_of(hdr, struct rdt_mon_domain, hdr); + d = container_of(hdr, struct rdt_domain, hdr); cpumask_set_cpu(cpu, &d->hdr.cpu_mask); return; @@ -609,7 +604,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r) cpumask_set_cpu(cpu, &d->hdr.cpu_mask); if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) { - mon_domain_free(hw_dom); + domain_free(hw_dom); return; } @@ -619,7 +614,7 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r) if (err) { list_del_rcu(&d->hdr.list); synchronize_rcu(); - mon_domain_free(hw_dom); + domain_free(hw_dom); } } @@ -634,9 +629,9 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r) static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r) { int id = get_domain_id_from_scope(cpu, r->ctrl_scope); - struct rdt_hw_ctrl_domain *hw_dom; + struct rdt_hw_domain *hw_dom; struct rdt_domain_hdr *hdr; - struct rdt_ctrl_domain *d; + struct rdt_domain *d; lockdep_assert_held(&domain_list_lock); @@ -656,8 +651,8 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r) if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN)) return; - d = container_of(hdr, struct rdt_ctrl_domain, hdr); - hw_dom = resctrl_to_arch_ctrl_dom(d); + d = container_of(hdr, struct rdt_domain, hdr); + hw_dom = resctrl_to_arch_dom(d); cpumask_clear_cpu(cpu, &d->hdr.cpu_mask); if (cpumask_empty(&d->hdr.cpu_mask)) { @@ -666,12 +661,12 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r) synchronize_rcu(); /* - * rdt_ctrl_domain "d" is going to be freed below, so clear + * rdt_domain "d" is going to be freed below, so clear * its pointer from pseudo_lock_region struct. */ if (d->plr) d->plr->d = NULL; - ctrl_domain_free(hw_dom); + domain_free(hw_dom); return; } @@ -680,9 +675,9 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r) static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r) { int id = get_domain_id_from_scope(cpu, r->mon_scope); - struct rdt_hw_mon_domain *hw_dom; + struct rdt_hw_domain *hw_dom; struct rdt_domain_hdr *hdr; - struct rdt_mon_domain *d; + struct rdt_domain *d; lockdep_assert_held(&domain_list_lock); @@ -702,15 +697,15 @@ static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r) if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) return; - d = container_of(hdr, struct rdt_mon_domain, hdr); - hw_dom = resctrl_to_arch_mon_dom(d); + d = container_of(hdr, struct rdt_domain, hdr); + hw_dom = resctrl_to_arch_dom(d); cpumask_clear_cpu(cpu, &d->hdr.cpu_mask); if (cpumask_empty(&d->hdr.cpu_mask)) { resctrl_offline_mon_domain(r, d); list_del_rcu(&d->hdr.list); synchronize_rcu(); - mon_domain_free(hw_dom); + domain_free(hw_dom); return; } diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index 97e554e36f07..9aceed827ff0 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -65,7 +65,7 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) } int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_ctrl_domain *d) + struct rdt_domain *d) { struct resctrl_staged_config *cfg; u32 closid = data->rdtgrp->closid; @@ -144,7 +144,7 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r) * resource type. */ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_ctrl_domain *d) + struct rdt_domain *d) { struct rdtgroup *rdtgrp = data->rdtgrp; struct resctrl_staged_config *cfg; @@ -213,8 +213,8 @@ static int parse_line(char *line, struct resctrl_schema *s, struct resctrl_staged_config *cfg; struct rdt_resource *r = s->res; struct rdt_parse_data data; - struct rdt_ctrl_domain *d; char *dom = NULL, *id; + struct rdt_domain *d; unsigned long dom_id; /* Walking r->domains, ensure it can't race with cpuhp */ @@ -277,11 +277,11 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type) } } -int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d, +int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, u32 closid, enum resctrl_conf_type t, u32 cfg_val) { - struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d); struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); + struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); u32 idx = get_config_index(closid, t); struct msr_param msr_param; @@ -302,17 +302,17 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d, int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) { struct resctrl_staged_config *cfg; - struct rdt_hw_ctrl_domain *hw_dom; + struct rdt_hw_domain *hw_dom; struct msr_param msr_param; - struct rdt_ctrl_domain *d; enum resctrl_conf_type t; + struct rdt_domain *d; u32 idx; /* Walking r->domains, ensure it can't race with cpuhp */ lockdep_assert_cpus_held(); list_for_each_entry(d, &r->ctrl_domains, hdr.list) { - hw_dom = resctrl_to_arch_ctrl_dom(d); + hw_dom = resctrl_to_arch_dom(d); msr_param.res = NULL; for (t = 0; t < CDP_NUM_TYPES; t++) { cfg = &hw_dom->d_resctrl.staged_config[t]; @@ -435,10 +435,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, return ret ?: nbytes; } -u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d, +u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, u32 closid, enum resctrl_conf_type type) { - struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d); + struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); u32 idx = get_config_index(closid, type); return hw_dom->ctrl_val[idx]; @@ -447,7 +447,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d, static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid) { struct rdt_resource *r = schema->res; - struct rdt_ctrl_domain *dom; + struct rdt_domain *dom; bool sep = false; u32 ctrl_val; @@ -519,7 +519,7 @@ static int smp_mon_event_count(void *arg) } void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, - struct rdt_mon_domain *d, struct rdtgroup *rdtgrp, + struct rdt_domain *d, struct rdtgroup *rdtgrp, int evtid, int first) { int cpu; @@ -562,11 +562,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) { struct kernfs_open_file *of = m->private; struct rdt_domain_hdr *hdr; - struct rdt_mon_domain *d; u32 resid, evtid, domid; struct rdtgroup *rdtgrp; struct rdt_resource *r; union mon_data_bits md; + struct rdt_domain *d; struct rmid_read rr; int ret = 0; @@ -587,7 +587,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) ret = -ENOENT; goto out; } - d = container_of(hdr, struct rdt_mon_domain, hdr); + d = container_of(hdr, struct rdt_domain, hdr); mon_event_read(&rr, r, d, rdtgrp, evtid, false); diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 135190e0711c..377679b79919 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -147,7 +147,7 @@ union mon_data_bits { struct rmid_read { struct rdtgroup *rgrp; struct rdt_resource *r; - struct rdt_mon_domain *d; + struct rdt_domain *d; enum resctrl_event_id evtid; bool first; int err; @@ -232,7 +232,7 @@ struct mongroup { */ struct pseudo_lock_region { struct resctrl_schema *s; - struct rdt_ctrl_domain *d; + struct rdt_domain *d; u32 cbm; wait_queue_head_t lock_thread_wq; int thread_done; @@ -355,41 +355,25 @@ struct arch_mbm_state { }; /** - * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share - * a resource for a control function + * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share + * a resource * @d_resctrl: Properties exposed to the resctrl file system * @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID) - * - * Members of this structure are accessed via helpers that provide abstraction. - */ -struct rdt_hw_ctrl_domain { - struct rdt_ctrl_domain d_resctrl; - u32 *ctrl_val; -}; - -/** - * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share - * a resource for a monitor function - * @d_resctrl: Properties exposed to the resctrl file system * @arch_mbm_total: arch private state for MBM total bandwidth * @arch_mbm_local: arch private state for MBM local bandwidth * * Members of this structure are accessed via helpers that provide abstraction. */ -struct rdt_hw_mon_domain { - struct rdt_mon_domain d_resctrl; +struct rdt_hw_domain { + struct rdt_domain d_resctrl; + u32 *ctrl_val; struct arch_mbm_state *arch_mbm_total; struct arch_mbm_state *arch_mbm_local; }; -static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r) -{ - return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl); -} - -static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r) +static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r) { - return container_of(r, struct rdt_hw_mon_domain, d_resctrl); + return container_of(r, struct rdt_hw_domain, d_resctrl); } /** @@ -401,7 +385,7 @@ static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_d */ struct msr_param { struct rdt_resource *res; - struct rdt_ctrl_domain *dom; + struct rdt_domain *dom; u32 low; u32 high; }; @@ -474,9 +458,9 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r } int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_ctrl_domain *d); + struct rdt_domain *d); int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_ctrl_domain *d); + struct rdt_domain *d); extern struct mutex rdtgroup_mutex; @@ -580,22 +564,22 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off); int rdtgroup_schemata_show(struct kernfs_open_file *of, struct seq_file *s, void *v); -bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d, +bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d, unsigned long cbm, int closid, bool exclusive); -unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d, +unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d, unsigned long cbm); enum rdtgrp_mode rdtgroup_mode_by_closid(int closid); int rdtgroup_tasks_assigned(struct rdtgroup *r); int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp); int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp); -bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm); -bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d); +bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm); +bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d); int rdt_pseudo_lock_init(void); void rdt_pseudo_lock_release(void); int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp); void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp); -struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r); -struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r); +struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r); +struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r); int closids_supported(void); void closid_free(int closid); int alloc_rmid(u32 closid); @@ -606,19 +590,19 @@ bool __init rdt_cpu_has(int flag); void mon_event_count(void *info); int rdtgroup_mondata_show(struct seq_file *m, void *arg); void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, - struct rdt_mon_domain *d, struct rdtgroup *rdtgrp, + struct rdt_domain *d, struct rdtgroup *rdtgrp, int evtid, int first); -void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, +void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms, int exclude_cpu); void mbm_handle_overflow(struct work_struct *work); void __init intel_rdt_mbm_apply_quirk(void); bool is_mba_sc(struct rdt_resource *r); -void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms, +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms, int exclude_cpu); void cqm_handle_limbo(struct work_struct *work); -bool has_busy_rmid(struct rdt_mon_domain *d); -void __check_limbo(struct rdt_mon_domain *d, bool force_free); +bool has_busy_rmid(struct rdt_domain *d); +void __check_limbo(struct rdt_domain *d, bool force_free); void rdt_domain_reconfigure_cdp(struct rdt_resource *r); void __init thread_throttle_mode_init(void); void __init mbm_config_rftype_init(const char *config); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 2f5d35f0716f..18ed3a6b0818 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -209,7 +209,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) return 0; } -static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom, +static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom, u32 rmid, enum resctrl_event_id eventid) { @@ -228,11 +228,11 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_do return NULL; } -void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d, +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, u32 unused, u32 rmid, enum resctrl_event_id eventid) { - struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d); + struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); struct arch_mbm_state *am; am = get_arch_mbm_state(hw_dom, rmid, eventid); @@ -248,9 +248,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d, * Assumes that hardware counters are also reset and thus that there is * no need to record initial non-zero counts. */ -void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d) +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d) { - struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d); + struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); if (is_mbm_total_enabled()) memset(hw_dom->arch_mbm_total, 0, @@ -269,12 +269,12 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width) return chunks >> shift; } -int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d, +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, u32 unused, u32 rmid, enum resctrl_event_id eventid, u64 *val, void *ignored) { - struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d); struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); + struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); struct arch_mbm_state *am; u64 msr_val, chunks; int ret; @@ -320,7 +320,7 @@ static void limbo_release_entry(struct rmid_entry *entry) * decrement the count. If the busy count gets to zero on an RMID, we * free the RMID */ -void __check_limbo(struct rdt_mon_domain *d, bool force_free) +void __check_limbo(struct rdt_domain *d, bool force_free) { struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; u32 idx_limit = resctrl_arch_system_num_rmid_idx(); @@ -378,7 +378,7 @@ void __check_limbo(struct rdt_mon_domain *d, bool force_free) resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx); } -bool has_busy_rmid(struct rdt_mon_domain *d) +bool has_busy_rmid(struct rdt_domain *d) { u32 idx_limit = resctrl_arch_system_num_rmid_idx(); @@ -479,7 +479,7 @@ int alloc_rmid(u32 closid) static void add_rmid_to_limbo(struct rmid_entry *entry) { struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; - struct rdt_mon_domain *d; + struct rdt_domain *d; u32 idx; lockdep_assert_held(&rdtgroup_mutex); @@ -532,7 +532,7 @@ void free_rmid(u32 closid, u32 rmid) list_add_tail(&entry->list, &rmid_free_lru); } -static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid, +static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 closid, u32 rmid, enum resctrl_event_id evtid) { u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); @@ -668,12 +668,12 @@ void mon_event_count(void *info) * throttle MSRs already have low percentage values. To avoid * unnecessarily restricting such rdtgroups, we also increase the bandwidth. */ -static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm) +static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm) { u32 closid, rmid, cur_msr_val, new_msr_val; struct mbm_state *pmbm_data, *cmbm_data; - struct rdt_ctrl_domain *dom_mba; struct rdt_resource *r_mba; + struct rdt_domain *dom_mba; u32 cur_bw, user_bw, idx; struct list_head *head; struct rdtgroup *entry; @@ -734,7 +734,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm) resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val); } -static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, +static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, u32 closid, u32 rmid) { struct rmid_read rr; @@ -792,12 +792,12 @@ static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, void cqm_handle_limbo(struct work_struct *work) { unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL); - struct rdt_mon_domain *d; + struct rdt_domain *d; cpus_read_lock(); mutex_lock(&rdtgroup_mutex); - d = container_of(work, struct rdt_mon_domain, cqm_limbo.work); + d = container_of(work, struct rdt_domain, cqm_limbo.work); __check_limbo(d, false); @@ -820,7 +820,7 @@ void cqm_handle_limbo(struct work_struct *work) * @exclude_cpu: Which CPU the handler should not run on, * RESCTRL_PICK_ANY_CPU to pick any CPU. */ -void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms, +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms, int exclude_cpu) { unsigned long delay = msecs_to_jiffies(delay_ms); @@ -837,9 +837,9 @@ void mbm_handle_overflow(struct work_struct *work) { unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL); struct rdtgroup *prgrp, *crgrp; - struct rdt_mon_domain *d; struct list_head *head; struct rdt_resource *r; + struct rdt_domain *d; cpus_read_lock(); mutex_lock(&rdtgroup_mutex); @@ -852,7 +852,7 @@ void mbm_handle_overflow(struct work_struct *work) goto out_unlock; r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; - d = container_of(work, struct rdt_mon_domain, mbm_over.work); + d = container_of(work, struct rdt_domain, mbm_over.work); list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { mbm_update(r, d, prgrp->closid, prgrp->mon.rmid); @@ -886,7 +886,7 @@ void mbm_handle_overflow(struct work_struct *work) * @exclude_cpu: Which CPU the handler should not run on, * RESCTRL_PICK_ANY_CPU to pick any CPU. */ -void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms, +void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms, int exclude_cpu) { unsigned long delay = msecs_to_jiffies(delay_ms); diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 3e7e405ea715..63941dab11b2 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -809,7 +809,7 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp) * Return: true if @cbm overlaps with pseudo-locked region on @d, false * otherwise. */ -bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm) +bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm) { unsigned int cbm_len; unsigned long cbm_b; @@ -836,11 +836,11 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned lon * if it is not possible to test due to memory allocation issue, * false otherwise. */ -bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d) +bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) { - struct rdt_ctrl_domain *d_i; cpumask_var_t cpu_with_psl; struct rdt_resource *r; + struct rdt_domain *d_i; bool ret = false; /* Walking r->domains, ensure it can't race with cpuhp */ diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index eb3bbfa96d5a..17d4610eecf5 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -92,8 +92,8 @@ void rdt_last_cmd_printf(const char *fmt, ...) void rdt_staged_configs_clear(void) { - struct rdt_ctrl_domain *dom; struct rdt_resource *r; + struct rdt_domain *dom; lockdep_assert_held(&rdtgroup_mutex); @@ -1012,7 +1012,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of, unsigned long sw_shareable = 0, hw_shareable = 0; unsigned long exclusive = 0, pseudo_locked = 0; struct rdt_resource *r = s->res; - struct rdt_ctrl_domain *dom; + struct rdt_domain *dom; int i, hwb, swb, excl, psl; enum rdtgrp_mode mode; bool sep = false; @@ -1243,7 +1243,7 @@ static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of, * * Return: false if CBM does not overlap, true if it does. */ -static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d, +static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d, unsigned long cbm, int closid, enum resctrl_conf_type type, bool exclusive) { @@ -1298,7 +1298,7 @@ static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_doma * * Return: true if CBM overlap detected, false if there is no overlap */ -bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d, +bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d, unsigned long cbm, int closid, bool exclusive) { enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type); @@ -1329,10 +1329,10 @@ bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d, static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp) { int closid = rdtgrp->closid; - struct rdt_ctrl_domain *d; struct resctrl_schema *s; struct rdt_resource *r; bool has_cache = false; + struct rdt_domain *d; u32 ctrl; /* Walking r->domains, ensure it can't race with cpuhp */ @@ -1448,7 +1448,7 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of, * bitmap functions work correctly. */ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, - struct rdt_ctrl_domain *d, unsigned long cbm) + struct rdt_domain *d, unsigned long cbm) { unsigned int size = 0; struct cacheinfo *ci; @@ -1476,9 +1476,9 @@ static int rdtgroup_size_show(struct kernfs_open_file *of, { struct resctrl_schema *schema; enum resctrl_conf_type type; - struct rdt_ctrl_domain *d; struct rdtgroup *rdtgrp; struct rdt_resource *r; + struct rdt_domain *d; unsigned int size; int ret = 0; u32 closid; @@ -1590,7 +1590,7 @@ static void mon_event_config_read(void *info) mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS; } -static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info) +static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info) { smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1); } @@ -1598,7 +1598,7 @@ static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid) { struct mon_config_info mon_info = {0}; - struct rdt_mon_domain *dom; + struct rdt_domain *dom; bool sep = false; cpus_read_lock(); @@ -1657,7 +1657,7 @@ static void mon_event_config_write(void *info) } static void mbm_config_write_domain(struct rdt_resource *r, - struct rdt_mon_domain *d, u32 evtid, u32 val) + struct rdt_domain *d, u32 evtid, u32 val) { struct mon_config_info mon_info = {0}; @@ -1698,7 +1698,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid) struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); char *dom_str = NULL, *id_str; unsigned long dom_id, val; - struct rdt_mon_domain *d; + struct rdt_domain *d; /* Walking r->domains, ensure it can't race with cpuhp */ lockdep_assert_cpus_held(); @@ -2257,9 +2257,9 @@ static inline bool is_mba_linear(void) static int set_cache_qos_cfg(int level, bool enable) { void (*update)(void *arg); - struct rdt_ctrl_domain *d; struct rdt_resource *r_l; cpumask_var_t cpu_mask; + struct rdt_domain *d; int cpu; /* Walking r->domains, ensure it can't race with cpuhp */ @@ -2309,7 +2309,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r) l3_qos_cfg_update(&hw_res->cdp_enabled); } -static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d) +static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d) { u32 num_closid = resctrl_arch_get_num_closid(r); int cpu = cpumask_any(&d->hdr.cpu_mask); @@ -2327,7 +2327,7 @@ static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain } static void mba_sc_domain_destroy(struct rdt_resource *r, - struct rdt_ctrl_domain *d) + struct rdt_domain *d) { kfree(d->mbps_val); d->mbps_val = NULL; @@ -2353,7 +2353,7 @@ static int set_mba_sc(bool mba_sc) { struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl; u32 num_closid = resctrl_arch_get_num_closid(r); - struct rdt_ctrl_domain *d; + struct rdt_domain *d; int i; if (!supports_mba_mbps() || mba_sc == is_mba_sc(r)) @@ -2625,7 +2625,7 @@ static int rdt_get_tree(struct fs_context *fc) { struct rdt_fs_context *ctx = rdt_fc2context(fc); unsigned long flags = RFTYPE_CTRL_BASE; - struct rdt_mon_domain *dom; + struct rdt_domain *dom; struct rdt_resource *r; int ret; @@ -2810,9 +2810,9 @@ static int rdt_init_fs_context(struct fs_context *fc) static int reset_all_ctrls(struct rdt_resource *r) { struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); - struct rdt_hw_ctrl_domain *hw_dom; + struct rdt_hw_domain *hw_dom; struct msr_param msr_param; - struct rdt_ctrl_domain *d; + struct rdt_domain *d; int i; /* Walking r->domains, ensure it can't race with cpuhp */ @@ -2828,7 +2828,7 @@ static int reset_all_ctrls(struct rdt_resource *r) * from each domain to update the MSRs below. */ list_for_each_entry(d, &r->ctrl_domains, hdr.list) { - hw_dom = resctrl_to_arch_ctrl_dom(d); + hw_dom = resctrl_to_arch_dom(d); for (i = 0; i < hw_res->num_closid; i++) hw_dom->ctrl_val[i] = r->default_ctrl; @@ -3021,7 +3021,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, } static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, - struct rdt_mon_domain *d, + struct rdt_domain *d, struct rdt_resource *r, struct rdtgroup *prgrp) { union mon_data_bits priv; @@ -3070,7 +3070,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, * and "monitor" groups with given domain id. */ static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, - struct rdt_mon_domain *d) + struct rdt_domain *d) { struct kernfs_node *parent_kn; struct rdtgroup *prgrp, *crgrp; @@ -3092,7 +3092,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn, struct rdt_resource *r, struct rdtgroup *prgrp) { - struct rdt_mon_domain *dom; + struct rdt_domain *dom; int ret; /* Walking r->domains, ensure it can't race with cpuhp */ @@ -3197,7 +3197,7 @@ static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r) * Set the RDT domain up to start off with all usable allocations. That is, * all shareable and unused bits. All-zero CBM is invalid. */ -static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s, +static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s, u32 closid) { enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type); @@ -3277,7 +3277,7 @@ static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schem */ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid) { - struct rdt_ctrl_domain *d; + struct rdt_domain *d; int ret; list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) { @@ -3293,7 +3293,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid) static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid) { struct resctrl_staged_config *cfg; - struct rdt_ctrl_domain *d; + struct rdt_domain *d; list_for_each_entry(d, &r->ctrl_domains, hdr.list) { if (is_mba_sc(r)) { @@ -3919,14 +3919,14 @@ static void __init rdtgroup_setup_default(void) mutex_unlock(&rdtgroup_mutex); } -static void domain_destroy_mon_state(struct rdt_mon_domain *d) +static void domain_destroy_mon_state(struct rdt_domain *d) { bitmap_free(d->rmid_busy_llc); kfree(d->mbm_total); kfree(d->mbm_local); } -void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d) +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d) { mutex_lock(&rdtgroup_mutex); @@ -3936,7 +3936,7 @@ void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain mutex_unlock(&rdtgroup_mutex); } -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d) +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d) { mutex_lock(&rdtgroup_mutex); @@ -3967,7 +3967,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d mutex_unlock(&rdtgroup_mutex); } -static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d) +static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) { u32 idx_limit = resctrl_arch_system_num_rmid_idx(); size_t tsize; @@ -3998,7 +3998,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain return 0; } -int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d) +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d) { int err = 0; @@ -4014,7 +4014,7 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d return err; } -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d) +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d) { int err; @@ -4069,8 +4069,8 @@ static void clear_childcpus(struct rdtgroup *r, unsigned int cpu) void resctrl_offline_cpu(unsigned int cpu) { struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; - struct rdt_mon_domain *d; struct rdtgroup *rdtgrp; + struct rdt_domain *d; mutex_lock(&rdtgroup_mutex); list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index aa2c22a8e37b..96ddf9ff3183 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -78,23 +78,7 @@ struct rdt_domain_hdr { }; /** - * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource - * @hdr: common header for different domain types - * @plr: pseudo-locked region (if any) associated with domain - * @staged_config: parsed configuration to be applied - * @mbps_val: When mba_sc is enabled, this holds the array of user - * specified control values for mba_sc in MBps, indexed - * by closid - */ -struct rdt_ctrl_domain { - struct rdt_domain_hdr hdr; - struct pseudo_lock_region *plr; - struct resctrl_staged_config staged_config[CDP_NUM_TYPES]; - u32 *mbps_val; -}; - -/** - * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource + * struct rdt_domain - group of CPUs sharing a resctrl resource * @hdr: common header for different domain types * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold * @mbm_total: saved state for MBM total bandwidth @@ -103,8 +87,13 @@ struct rdt_ctrl_domain { * @cqm_limbo: worker to periodically read CQM h/w counters * @mbm_work_cpu: worker CPU for MBM h/w counters * @cqm_work_cpu: worker CPU for CQM h/w counters + * @plr: pseudo-locked region (if any) associated with domain + * @staged_config: parsed configuration to be applied + * @mbps_val: When mba_sc is enabled, this holds the array of user + * specified control values for mba_sc in MBps, indexed + * by closid */ -struct rdt_mon_domain { +struct rdt_domain { struct rdt_domain_hdr hdr; unsigned long *rmid_busy_llc; struct mbm_state *mbm_total; @@ -113,6 +102,9 @@ struct rdt_mon_domain { struct delayed_work cqm_limbo; int mbm_work_cpu; int cqm_work_cpu; + struct pseudo_lock_region *plr; + struct resctrl_staged_config staged_config[CDP_NUM_TYPES]; + u32 *mbps_val; }; /** @@ -216,7 +208,7 @@ struct rdt_resource { const char *format_str; int (*parse_ctrlval)(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_ctrl_domain *d); + struct rdt_domain *d); struct list_head evt_list; unsigned long fflags; bool cdp_capable; @@ -250,15 +242,15 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid); * Update the ctrl_val and apply this config right now. * Must be called on one of the domain's CPUs. */ -int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d, +int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, u32 closid, enum resctrl_conf_type t, u32 cfg_val); -u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d, +u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, u32 closid, enum resctrl_conf_type type); -int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d); -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d); -void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d); -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d); +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d); +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d); +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d); +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d); void resctrl_online_cpu(unsigned int cpu); void resctrl_offline_cpu(unsigned int cpu); @@ -287,7 +279,7 @@ void resctrl_offline_cpu(unsigned int cpu); * Return: * 0 on success, or -EIO, -EINVAL etc on error. */ -int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d, +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, u32 closid, u32 rmid, enum resctrl_event_id eventid, u64 *val, void *arch_mon_ctx); @@ -320,7 +312,7 @@ static inline void resctrl_arch_rmid_read_context_check(void) * * This can be called from any CPU. */ -void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d, +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, u32 closid, u32 rmid, enum resctrl_event_id eventid); @@ -333,7 +325,7 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d, * * This can be called from any CPU. */ -void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d); +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d); extern unsigned int resctrl_rmid_realloc_threshold; extern unsigned int resctrl_rmid_realloc_limit; -- 2.25.1

This reverts commit 960daf33b7aca2f1f9533af1dfe91dc593833d52. --- arch/x86/kernel/cpu/resctrl/core.c | 224 +++++----------------- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 12 +- arch/x86/kernel/cpu/resctrl/internal.h | 7 +- arch/x86/kernel/cpu/resctrl/monitor.c | 4 +- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 60 +++--- include/linux/resctrl.h | 25 +-- 7 files changed, 96 insertions(+), 240 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index e59c3fd8a1a1..2ff05ffda328 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -60,8 +60,7 @@ static void mba_wrmsr_intel(struct msr_param *m); static void cat_wrmsr(struct msr_param *m); static void mba_wrmsr_amd(struct msr_param *m); -#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains) -#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains) +#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains) struct rdt_hw_resource rdt_resources_all[] = { [RDT_RESOURCE_L3] = @@ -69,10 +68,8 @@ struct rdt_hw_resource rdt_resources_all[] = { .r_resctrl = { .rid = RDT_RESOURCE_L3, .name = "L3", - .ctrl_scope = RESCTRL_L3_CACHE, - .mon_scope = RESCTRL_L3_CACHE, - .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L3), - .mon_domains = mon_domain_init(RDT_RESOURCE_L3), + .scope = RESCTRL_L3_CACHE, + .domains = domain_init(RDT_RESOURCE_L3), .parse_ctrlval = parse_cbm, .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, @@ -85,8 +82,8 @@ struct rdt_hw_resource rdt_resources_all[] = { .r_resctrl = { .rid = RDT_RESOURCE_L2, .name = "L2", - .ctrl_scope = RESCTRL_L2_CACHE, - .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L2), + .scope = RESCTRL_L2_CACHE, + .domains = domain_init(RDT_RESOURCE_L2), .parse_ctrlval = parse_cbm, .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, @@ -99,8 +96,8 @@ struct rdt_hw_resource rdt_resources_all[] = { .r_resctrl = { .rid = RDT_RESOURCE_MBA, .name = "MB", - .ctrl_scope = RESCTRL_L3_CACHE, - .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_MBA), + .scope = RESCTRL_L3_CACHE, + .domains = domain_init(RDT_RESOURCE_MBA), .parse_ctrlval = parse_bw, .format_str = "%d=%*u", .fflags = RFTYPE_RES_MB, @@ -111,8 +108,8 @@ struct rdt_hw_resource rdt_resources_all[] = { .r_resctrl = { .rid = RDT_RESOURCE_SMBA, .name = "SMBA", - .ctrl_scope = RESCTRL_L3_CACHE, - .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_SMBA), + .scope = RESCTRL_L3_CACHE, + .domains = domain_init(RDT_RESOURCE_SMBA), .parse_ctrlval = parse_bw, .format_str = "%d=%*u", .fflags = RFTYPE_RES_MB, @@ -352,28 +349,13 @@ static void cat_wrmsr(struct msr_param *m) wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]); } -struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r) +struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r) { struct rdt_domain *d; lockdep_assert_cpus_held(); - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { - /* Find the domain that contains this CPU */ - if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) - return d; - } - - return NULL; -} - -struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r) -{ - struct rdt_domain *d; - - lockdep_assert_cpus_held(); - - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry(d, &r->domains, hdr.list) { /* Find the domain that contains this CPU */ if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) return d; @@ -397,26 +379,26 @@ void rdt_ctrl_update(void *arg) } /* - * rdt_find_domain - Search for a domain id in a resource domain list. + * rdt_find_domain - Find a domain in a resource that matches input resource id * - * Search the domain list to find the domain id. If the domain id is - * found, return the domain. NULL otherwise. If the domain id is not - * found (and NULL returned) then the first domain with id bigger than - * the input id can be returned to the caller via @pos. + * Search resource r's domain list to find the resource id. If the resource + * id is found in a domain, return the domain. Otherwise, if requested by + * caller, return the first domain whose id is bigger than the input id. + * The domain list is sorted by id in ascending order. */ -struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id, - struct list_head **pos) +struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, + struct list_head **pos) { - struct rdt_domain_hdr *d; + struct rdt_domain *d; struct list_head *l; - list_for_each(l, h) { - d = list_entry(l, struct rdt_domain_hdr, list); + list_for_each(l, &r->domains) { + d = list_entry(l, struct rdt_domain, hdr.list); /* When id is found, return its domain. */ - if (id == d->id) + if (id == d->hdr.id) return d; /* Stop searching when finding id's position in sorted list. */ - if (id < d->id) + if (id < d->hdr.id) break; } @@ -512,29 +494,38 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope) return -EINVAL; } -static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r) +/* + * domain_add_cpu - Add a cpu to a resource's domain list. + * + * If an existing domain in the resource r's domain list matches the cpu's + * resource id, add the cpu in the domain. + * + * Otherwise, a new domain is allocated and inserted into the right position + * in the domain list sorted by id in ascending order. + * + * The order in the domain list is visible to users when we print entries + * in the schemata file and schemata input is validated to have the same order + * as this list. + */ +static void domain_add_cpu(int cpu, struct rdt_resource *r) { - int id = get_domain_id_from_scope(cpu, r->ctrl_scope); + int id = get_domain_id_from_scope(cpu, r->scope); struct list_head *add_pos = NULL; struct rdt_hw_domain *hw_dom; - struct rdt_domain_hdr *hdr; struct rdt_domain *d; int err; lockdep_assert_held(&domain_list_lock); if (id < 0) { - pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n", - cpu, r->ctrl_scope, r->name); + pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n", + cpu, r->scope, r->name); return; } - hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos); - if (hdr) { - if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN)) - return; - d = container_of(hdr, struct rdt_domain, hdr); + d = rdt_find_domain(r, id, &add_pos); + if (d) { cpumask_set_cpu(cpu, &d->hdr.cpu_mask); if (r->cache.arch_has_per_cpu_cfg) rdt_domain_reconfigure_cdp(r); @@ -547,70 +538,23 @@ static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r) d = &hw_dom->d_resctrl; d->hdr.id = id; - d->hdr.type = RESCTRL_CTRL_DOMAIN; cpumask_set_cpu(cpu, &d->hdr.cpu_mask); rdt_domain_reconfigure_cdp(r); - if (domain_setup_ctrlval(r, d)) { + if (r->alloc_capable && domain_setup_ctrlval(r, d)) { domain_free(hw_dom); return; } - list_add_tail_rcu(&d->hdr.list, add_pos); - - err = resctrl_online_ctrl_domain(r, d); - if (err) { - list_del_rcu(&d->hdr.list); - synchronize_rcu(); - domain_free(hw_dom); - } -} - -static void domain_add_cpu_mon(int cpu, struct rdt_resource *r) -{ - int id = get_domain_id_from_scope(cpu, r->mon_scope); - struct list_head *add_pos = NULL; - struct rdt_hw_domain *hw_dom; - struct rdt_domain_hdr *hdr; - struct rdt_domain *d; - int err; - - lockdep_assert_held(&domain_list_lock); - - if (id < 0) { - pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n", - cpu, r->mon_scope, r->name); - return; - } - - hdr = rdt_find_domain(&r->mon_domains, id, &add_pos); - if (hdr) { - if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) - return; - d = container_of(hdr, struct rdt_domain, hdr); - - cpumask_set_cpu(cpu, &d->hdr.cpu_mask); - return; - } - - hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu)); - if (!hw_dom) - return; - - d = &hw_dom->d_resctrl; - d->hdr.id = id; - d->hdr.type = RESCTRL_MON_DOMAIN; - cpumask_set_cpu(cpu, &d->hdr.cpu_mask); - - if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) { + if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) { domain_free(hw_dom); return; } list_add_tail_rcu(&d->hdr.list, add_pos); - err = resctrl_online_mon_domain(r, d); + err = resctrl_online_domain(r, d); if (err) { list_del_rcu(&d->hdr.list); synchronize_rcu(); @@ -618,45 +562,30 @@ static void domain_add_cpu_mon(int cpu, struct rdt_resource *r) } } -static void domain_add_cpu(int cpu, struct rdt_resource *r) -{ - if (r->alloc_capable) - domain_add_cpu_ctrl(cpu, r); - if (r->mon_capable) - domain_add_cpu_mon(cpu, r); -} - -static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r) +static void domain_remove_cpu(int cpu, struct rdt_resource *r) { - int id = get_domain_id_from_scope(cpu, r->ctrl_scope); + int id = get_domain_id_from_scope(cpu, r->scope); struct rdt_hw_domain *hw_dom; - struct rdt_domain_hdr *hdr; struct rdt_domain *d; lockdep_assert_held(&domain_list_lock); if (id < 0) { - pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n", - cpu, r->ctrl_scope, r->name); + pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n", + cpu, r->scope, r->name); return; } - hdr = rdt_find_domain(&r->ctrl_domains, id, NULL); - if (!hdr) { - pr_warn("Can't find control domain for id=%d for CPU %d for resource %s\n", - id, cpu, r->name); + d = rdt_find_domain(r, id, NULL); + if (!d) { + pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu); return; } - - if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN)) - return; - - d = container_of(hdr, struct rdt_domain, hdr); hw_dom = resctrl_to_arch_dom(d); cpumask_clear_cpu(cpu, &d->hdr.cpu_mask); if (cpumask_empty(&d->hdr.cpu_mask)) { - resctrl_offline_ctrl_domain(r, d); + resctrl_offline_domain(r, d); list_del_rcu(&d->hdr.list); synchronize_rcu(); @@ -672,53 +601,6 @@ static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r) } } -static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r) -{ - int id = get_domain_id_from_scope(cpu, r->mon_scope); - struct rdt_hw_domain *hw_dom; - struct rdt_domain_hdr *hdr; - struct rdt_domain *d; - - lockdep_assert_held(&domain_list_lock); - - if (id < 0) { - pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n", - cpu, r->mon_scope, r->name); - return; - } - - hdr = rdt_find_domain(&r->mon_domains, id, NULL); - if (!hdr) { - pr_warn("Can't find monitor domain for id=%d for CPU %d for resource %s\n", - id, cpu, r->name); - return; - } - - if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) - return; - - d = container_of(hdr, struct rdt_domain, hdr); - hw_dom = resctrl_to_arch_dom(d); - - cpumask_clear_cpu(cpu, &d->hdr.cpu_mask); - if (cpumask_empty(&d->hdr.cpu_mask)) { - resctrl_offline_mon_domain(r, d); - list_del_rcu(&d->hdr.list); - synchronize_rcu(); - domain_free(hw_dom); - - return; - } -} - -static void domain_remove_cpu(int cpu, struct rdt_resource *r) -{ - if (r->alloc_capable) - domain_remove_cpu_ctrl(cpu, r); - if (r->mon_capable) - domain_remove_cpu_mon(cpu, r); -} - static void clear_closid_rmid(int cpu) { struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state); diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index 9aceed827ff0..1d8261e84362 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -236,7 +236,7 @@ static int parse_line(char *line, struct resctrl_schema *s, return -EINVAL; } dom = strim(dom); - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry(d, &r->domains, hdr.list) { if (d->hdr.id == dom_id) { data.buf = dom; data.rdtgrp = rdtgrp; @@ -311,7 +311,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) /* Walking r->domains, ensure it can't race with cpuhp */ lockdep_assert_cpus_held(); - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry(d, &r->domains, hdr.list) { hw_dom = resctrl_to_arch_dom(d); msr_param.res = NULL; for (t = 0; t < CDP_NUM_TYPES; t++) { @@ -455,7 +455,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo lockdep_assert_cpus_held(); seq_printf(s, "%*s:", max_name_width, schema->name); - list_for_each_entry(dom, &r->ctrl_domains, hdr.list) { + list_for_each_entry(dom, &r->domains, hdr.list) { if (sep) seq_puts(s, ";"); @@ -561,7 +561,6 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, int rdtgroup_mondata_show(struct seq_file *m, void *arg) { struct kernfs_open_file *of = m->private; - struct rdt_domain_hdr *hdr; u32 resid, evtid, domid; struct rdtgroup *rdtgrp; struct rdt_resource *r; @@ -582,12 +581,11 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) evtid = md.u.evtid; r = &rdt_resources_all[resid].r_resctrl; - hdr = rdt_find_domain(&r->mon_domains, domid, NULL); - if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) { + d = rdt_find_domain(r, domid, NULL); + if (!d) { ret = -ENOENT; goto out; } - d = container_of(hdr, struct rdt_domain, hdr); mon_event_read(&rr, r, d, rdtgrp, evtid, false); diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 377679b79919..f1d926832ec8 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -558,8 +558,8 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn); int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name); int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name, umode_t mask); -struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id, - struct list_head **pos); +struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, + struct list_head **pos); ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off); int rdtgroup_schemata_show(struct kernfs_open_file *of, @@ -578,8 +578,7 @@ int rdt_pseudo_lock_init(void); void rdt_pseudo_lock_release(void); int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp); void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp); -struct rdt_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r); -struct rdt_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r); +struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r); int closids_supported(void); void closid_free(int closid); int alloc_rmid(u32 closid); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 18ed3a6b0818..4d7c596f9276 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -490,7 +490,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry) idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid); entry->busy = 0; - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry(d, &r->domains, hdr.list) { /* * For the first limbo RMID in the domain, * setup up the limbo worker. @@ -688,7 +688,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm) idx = resctrl_arch_rmid_idx_encode(closid, rmid); pmbm_data = &dom_mbm->mbm_local[idx]; - dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba); + dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba); if (!dom_mba) { pr_warn_once("Failure to get domain for MBA update\n"); return; diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 63941dab11b2..86436464959c 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -292,7 +292,7 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr) */ static int pseudo_lock_region_init(struct pseudo_lock_region *plr) { - enum resctrl_scope scope = plr->s->res->ctrl_scope; + enum resctrl_scope scope = plr->s->res->scope; struct cacheinfo *ci; int ret; @@ -854,7 +854,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) * associated with them. */ for_each_alloc_capable_rdt_resource(r) { - list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) { + list_for_each_entry(d_i, &r->domains, hdr.list) { if (d_i->plr) cpumask_or(cpu_with_psl, cpu_with_psl, &d_i->hdr.cpu_mask); diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 17d4610eecf5..b6ba77cdf0e8 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -98,7 +98,7 @@ void rdt_staged_configs_clear(void) lockdep_assert_held(&rdtgroup_mutex); for_each_alloc_capable_rdt_resource(r) { - list_for_each_entry(dom, &r->ctrl_domains, hdr.list) + list_for_each_entry(dom, &r->domains, hdr.list) memset(dom->staged_config, 0, sizeof(dom->staged_config)); } } @@ -1021,7 +1021,7 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of, cpus_read_lock(); mutex_lock(&rdtgroup_mutex); hw_shareable = r->cache.shareable_bits; - list_for_each_entry(dom, &r->ctrl_domains, hdr.list) { + list_for_each_entry(dom, &r->domains, hdr.list) { if (sep) seq_putc(seq, ';'); sw_shareable = 0; @@ -1343,7 +1343,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp) if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA) continue; has_cache = true; - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry(d, &r->domains, hdr.list) { ctrl = resctrl_arch_get_config(r, d, closid, s->conf_type); if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) { @@ -1454,11 +1454,11 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct cacheinfo *ci; int num_b; - if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE)) + if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE)) return size; num_b = bitmap_weight(&cbm, r->cache.cbm_len); - ci = get_cpu_cacheinfo_level(cpumask_any(&d->hdr.cpu_mask), r->ctrl_scope); + ci = get_cpu_cacheinfo_level(cpumask_any(&d->hdr.cpu_mask), r->scope); if (ci) size = ci->size / r->cache.cbm_len * num_b; @@ -1514,7 +1514,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of, type = schema->conf_type; sep = false; seq_printf(s, "%*s:", max_name_width, schema->name); - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry(d, &r->domains, hdr.list) { if (sep) seq_putc(s, ';'); if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { @@ -1604,7 +1604,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid cpus_read_lock(); mutex_lock(&rdtgroup_mutex); - list_for_each_entry(dom, &r->mon_domains, hdr.list) { + list_for_each_entry(dom, &r->domains, hdr.list) { if (sep) seq_puts(s, ";"); @@ -1728,7 +1728,7 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid) return -EINVAL; } - list_for_each_entry(d, &r->mon_domains, hdr.list) { + list_for_each_entry(d, &r->domains, hdr.list) { if (d->hdr.id == dom_id) { mbm_config_write_domain(r, d, evtid, val); goto next; @@ -2276,7 +2276,7 @@ static int set_cache_qos_cfg(int level, bool enable) return -ENOMEM; r_l = &rdt_resources_all[level].r_resctrl; - list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) { + list_for_each_entry(d, &r_l->domains, hdr.list) { if (r_l->cache.arch_has_per_cpu_cfg) /* Pick all the CPUs in the domain instance */ for_each_cpu(cpu, &d->hdr.cpu_mask) @@ -2361,7 +2361,7 @@ static int set_mba_sc(bool mba_sc) r->membw.mba_sc = mba_sc; - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry(d, &r->domains, hdr.list) { for (i = 0; i < num_closid; i++) d->mbps_val[i] = MBA_MAX_MBPS; } @@ -2700,7 +2700,7 @@ static int rdt_get_tree(struct fs_context *fc) if (is_mbm_enabled()) { r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; - list_for_each_entry(dom, &r->mon_domains, hdr.list) + list_for_each_entry(dom, &r->domains, hdr.list) mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU); } @@ -2824,10 +2824,10 @@ static int reset_all_ctrls(struct rdt_resource *r) /* * Disable resource control for this resource by setting all - * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU + * CBMs in all domains to the maximum mask value. Pick one CPU * from each domain to update the MSRs below. */ - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry(d, &r->domains, hdr.list) { hw_dom = resctrl_to_arch_dom(d); for (i = 0; i < hw_res->num_closid; i++) @@ -3098,7 +3098,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn, /* Walking r->domains, ensure it can't race with cpuhp */ lockdep_assert_cpus_held(); - list_for_each_entry(dom, &r->mon_domains, hdr.list) { + list_for_each_entry(dom, &r->domains, hdr.list) { ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp); if (ret) return ret; @@ -3280,7 +3280,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid) struct rdt_domain *d; int ret; - list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) { + list_for_each_entry(d, &s->res->domains, hdr.list) { ret = __init_one_rdt_domain(d, s, closid); if (ret < 0) return ret; @@ -3295,7 +3295,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid) struct resctrl_staged_config *cfg; struct rdt_domain *d; - list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + list_for_each_entry(d, &r->domains, hdr.list) { if (is_mba_sc(r)) { d->mbps_val[closid] = MBA_MAX_MBPS; continue; @@ -3926,19 +3926,15 @@ static void domain_destroy_mon_state(struct rdt_domain *d) kfree(d->mbm_local); } -void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d) +void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d) { mutex_lock(&rdtgroup_mutex); if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) mba_sc_domain_destroy(r, d); - mutex_unlock(&rdtgroup_mutex); -} - -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d) -{ - mutex_lock(&rdtgroup_mutex); + if (!r->mon_capable) + goto out_unlock; /* * If resctrl is mounted, remove all the @@ -3964,6 +3960,7 @@ void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d) domain_destroy_mon_state(d); +out_unlock: mutex_unlock(&rdtgroup_mutex); } @@ -3998,7 +3995,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) return 0; } -int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d) +int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d) { int err = 0; @@ -4007,18 +4004,11 @@ int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d) if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) { /* RDT_RESOURCE_MBA is never mon_capable */ err = mba_sc_domain_allocate(r, d); + goto out_unlock; } - mutex_unlock(&rdtgroup_mutex); - - return err; -} - -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d) -{ - int err; - - mutex_lock(&rdtgroup_mutex); + if (!r->mon_capable) + goto out_unlock; err = domain_setup_mon_state(r, d); if (err) @@ -4083,7 +4073,7 @@ void resctrl_offline_cpu(unsigned int cpu) if (!l3->mon_capable) goto out_unlock; - d = get_mon_domain_from_cpu(cpu, l3); + d = get_domain_from_cpu(cpu, l3); if (d) { if (is_mbm_enabled() && cpu == d->mbm_work_cpu) { cancel_delayed_work(&d->mbm_over); diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 96ddf9ff3183..f63fcf17a3bc 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -58,22 +58,15 @@ struct resctrl_staged_config { bool have_new_ctrl; }; -enum resctrl_domain_type { - RESCTRL_CTRL_DOMAIN, - RESCTRL_MON_DOMAIN, -}; - /** * struct rdt_domain_hdr - common header for different domain types * @list: all instances of this resource * @id: unique id for this instance - * @type: type of this instance * @cpu_mask: which CPUs share this resource */ struct rdt_domain_hdr { struct list_head list; int id; - enum resctrl_domain_type type; struct cpumask cpu_mask; }; @@ -176,12 +169,10 @@ enum resctrl_scope { * @alloc_capable: Is allocation available on this machine * @mon_capable: Is monitor feature available on this machine * @num_rmid: Number of RMIDs available - * @ctrl_scope: Scope of this resource for control functions - * @mon_scope: Scope of this resource for monitor functions + * @scope: Scope of this resource * @cache: Cache allocation related data * @membw: If the component has bandwidth controls, their properties. - * @ctrl_domains: RCU list of all control domains for this resource - * @mon_domains: RCU list of all monitor domains for this resource + * @domains: RCU list of all domains for this resource * @name: Name to use in "schemata" file. * @data_width: Character width of data when displaying * @default_ctrl: Specifies default cache cbm or memory B/W percent. @@ -196,12 +187,10 @@ struct rdt_resource { bool alloc_capable; bool mon_capable; int num_rmid; - enum resctrl_scope ctrl_scope; - enum resctrl_scope mon_scope; + enum resctrl_scope scope; struct resctrl_cache cache; struct resctrl_membw membw; - struct list_head ctrl_domains; - struct list_head mon_domains; + struct list_head domains; char *name; int data_width; u32 default_ctrl; @@ -247,10 +236,8 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, u32 closid, enum resctrl_conf_type type); -int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d); -int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_domain *d); -void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_domain *d); -void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_domain *d); +int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d); +void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d); void resctrl_online_cpu(unsigned int cpu); void resctrl_offline_cpu(unsigned int cpu); -- 2.25.1

This reverts commit 7fadaff757b2fb2a7704375c3930bd1138a36187. --- arch/x86/kernel/cpu/resctrl/core.c | 26 +++++----- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 24 ++++----- arch/x86/kernel/cpu/resctrl/monitor.c | 14 +++--- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 14 +++--- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 60 +++++++++++------------ include/linux/resctrl.h | 16 ++---- 6 files changed, 73 insertions(+), 81 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 2ff05ffda328..d3cbe51e6b45 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -355,9 +355,9 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r) lockdep_assert_cpus_held(); - list_for_each_entry(d, &r->domains, hdr.list) { + list_for_each_entry(d, &r->domains, list) { /* Find the domain that contains this CPU */ - if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) + if (cpumask_test_cpu(cpu, &d->cpu_mask)) return d; } @@ -393,12 +393,12 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, struct list_head *l; list_for_each(l, &r->domains) { - d = list_entry(l, struct rdt_domain, hdr.list); + d = list_entry(l, struct rdt_domain, list); /* When id is found, return its domain. */ - if (id == d->hdr.id) + if (id == d->id) return d; /* Stop searching when finding id's position in sorted list. */ - if (id < d->hdr.id) + if (id < d->id) break; } @@ -526,7 +526,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r) d = rdt_find_domain(r, id, &add_pos); if (d) { - cpumask_set_cpu(cpu, &d->hdr.cpu_mask); + cpumask_set_cpu(cpu, &d->cpu_mask); if (r->cache.arch_has_per_cpu_cfg) rdt_domain_reconfigure_cdp(r); return; @@ -537,8 +537,8 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r) return; d = &hw_dom->d_resctrl; - d->hdr.id = id; - cpumask_set_cpu(cpu, &d->hdr.cpu_mask); + d->id = id; + cpumask_set_cpu(cpu, &d->cpu_mask); rdt_domain_reconfigure_cdp(r); @@ -552,11 +552,11 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r) return; } - list_add_tail_rcu(&d->hdr.list, add_pos); + list_add_tail_rcu(&d->list, add_pos); err = resctrl_online_domain(r, d); if (err) { - list_del_rcu(&d->hdr.list); + list_del_rcu(&d->list); synchronize_rcu(); domain_free(hw_dom); } @@ -583,10 +583,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r) } hw_dom = resctrl_to_arch_dom(d); - cpumask_clear_cpu(cpu, &d->hdr.cpu_mask); - if (cpumask_empty(&d->hdr.cpu_mask)) { + cpumask_clear_cpu(cpu, &d->cpu_mask); + if (cpumask_empty(&d->cpu_mask)) { resctrl_offline_domain(r, d); - list_del_rcu(&d->hdr.list); + list_del_rcu(&d->list); synchronize_rcu(); /* diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index 1d8261e84362..b5b8e9c5be11 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -74,7 +74,7 @@ int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, cfg = &d->staged_config[s->conf_type]; if (cfg->have_new_ctrl) { - rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id); + rdt_last_cmd_printf("Duplicate domain %d\n", d->id); return -EINVAL; } @@ -153,7 +153,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, cfg = &d->staged_config[s->conf_type]; if (cfg->have_new_ctrl) { - rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id); + rdt_last_cmd_printf("Duplicate domain %d\n", d->id); return -EINVAL; } @@ -236,8 +236,8 @@ static int parse_line(char *line, struct resctrl_schema *s, return -EINVAL; } dom = strim(dom); - list_for_each_entry(d, &r->domains, hdr.list) { - if (d->hdr.id == dom_id) { + list_for_each_entry(d, &r->domains, list) { + if (d->id == dom_id) { data.buf = dom; data.rdtgrp = rdtgrp; if (r->parse_ctrlval(&data, s, d)) @@ -285,7 +285,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, u32 idx = get_config_index(closid, t); struct msr_param msr_param; - if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask)) + if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask)) return -EINVAL; hw_dom->ctrl_val[idx] = cfg_val; @@ -311,7 +311,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) /* Walking r->domains, ensure it can't race with cpuhp */ lockdep_assert_cpus_held(); - list_for_each_entry(d, &r->domains, hdr.list) { + list_for_each_entry(d, &r->domains, list) { hw_dom = resctrl_to_arch_dom(d); msr_param.res = NULL; for (t = 0; t < CDP_NUM_TYPES; t++) { @@ -335,7 +335,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) } } if (msr_param.res) - smp_call_function_any(&d->hdr.cpu_mask, rdt_ctrl_update, &msr_param, 1); + smp_call_function_any(&d->cpu_mask, rdt_ctrl_update, &msr_param, 1); } return 0; @@ -455,7 +455,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo lockdep_assert_cpus_held(); seq_printf(s, "%*s:", max_name_width, schema->name); - list_for_each_entry(dom, &r->domains, hdr.list) { + list_for_each_entry(dom, &r->domains, list) { if (sep) seq_puts(s, ";"); @@ -465,7 +465,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo ctrl_val = resctrl_arch_get_config(r, dom, closid, schema->conf_type); - seq_printf(s, r->format_str, dom->hdr.id, max_data_width, + seq_printf(s, r->format_str, dom->id, max_data_width, ctrl_val); sep = true; } @@ -494,7 +494,7 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of, } else { seq_printf(s, "%s:%d=%x\n", rdtgrp->plr->s->res->name, - rdtgrp->plr->d->hdr.id, + rdtgrp->plr->d->id, rdtgrp->plr->cbm); } } else { @@ -542,7 +542,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, return; } - cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask, RESCTRL_PICK_ANY_CPU); + cpu = cpumask_any_housekeeping(&d->cpu_mask, RESCTRL_PICK_ANY_CPU); /* * cpumask_any_housekeeping() prefers housekeeping CPUs, but @@ -551,7 +551,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, * counters on some platforms if its called in IRQ context. */ if (tick_nohz_full_cpu(cpu)) - smp_call_function_any(&d->hdr.cpu_mask, mon_event_count, rr, 1); + smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1); else smp_call_on_cpu(cpu, smp_mon_event_count, rr, false); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 4d7c596f9276..366f496ca3ce 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -281,7 +281,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, resctrl_arch_rmid_read_context_check(); - if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask)) + if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask)) return -EINVAL; ret = __rmid_read(rmid, eventid, &msr_val); @@ -364,7 +364,7 @@ void __check_limbo(struct rdt_domain *d, bool force_free) * CLOSID and RMID because there may be dependencies between them * on some architectures. */ - trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->hdr.id, val); + trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->id, val); } if (force_free || !rmid_dirty) { @@ -490,7 +490,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry) idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid); entry->busy = 0; - list_for_each_entry(d, &r->domains, hdr.list) { + list_for_each_entry(d, &r->domains, list) { /* * For the first limbo RMID in the domain, * setup up the limbo worker. @@ -802,7 +802,7 @@ void cqm_handle_limbo(struct work_struct *work) __check_limbo(d, false); if (has_busy_rmid(d)) { - d->cqm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask, + d->cqm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask, RESCTRL_PICK_ANY_CPU); schedule_delayed_work_on(d->cqm_work_cpu, &d->cqm_limbo, delay); @@ -826,7 +826,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms, unsigned long delay = msecs_to_jiffies(delay_ms); int cpu; - cpu = cpumask_any_housekeeping(&dom->hdr.cpu_mask, exclude_cpu); + cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu); dom->cqm_work_cpu = cpu; if (cpu < nr_cpu_ids) @@ -869,7 +869,7 @@ void mbm_handle_overflow(struct work_struct *work) * Re-check for housekeeping CPUs. This allows the overflow handler to * move off a nohz_full CPU quickly. */ - d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask, + d->mbm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask, RESCTRL_PICK_ANY_CPU); schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay); @@ -898,7 +898,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms, */ if (!resctrl_mounted || !resctrl_arch_mon_capable()) return; - cpu = cpumask_any_housekeeping(&dom->hdr.cpu_mask, exclude_cpu); + cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu); dom->mbm_work_cpu = cpu; if (cpu < nr_cpu_ids) diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 86436464959c..1041c6401e9e 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -221,7 +221,7 @@ static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr) int cpu; int ret; - for_each_cpu(cpu, &plr->d->hdr.cpu_mask) { + for_each_cpu(cpu, &plr->d->cpu_mask) { pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL); if (!pm_req) { rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n"); @@ -300,7 +300,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr) return -ENODEV; /* Pick the first cpu we find that is associated with the cache. */ - plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask); + plr->cpu = cpumask_first(&plr->d->cpu_mask); if (!cpu_online(plr->cpu)) { rdt_last_cmd_printf("CPU %u associated with cache not online\n", @@ -854,10 +854,10 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) * associated with them. */ for_each_alloc_capable_rdt_resource(r) { - list_for_each_entry(d_i, &r->domains, hdr.list) { + list_for_each_entry(d_i, &r->domains, list) { if (d_i->plr) cpumask_or(cpu_with_psl, cpu_with_psl, - &d_i->hdr.cpu_mask); + &d_i->cpu_mask); } } @@ -865,7 +865,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) * Next test if new pseudo-locked region would intersect with * existing region. */ - if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl)) + if (cpumask_intersects(&d->cpu_mask, cpu_with_psl)) ret = true; free_cpumask_var(cpu_with_psl); @@ -1197,7 +1197,7 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel) } plr->thread_done = 0; - cpu = cpumask_first(&plr->d->hdr.cpu_mask); + cpu = cpumask_first(&plr->d->cpu_mask); if (!cpu_online(cpu)) { ret = -ENODEV; goto out; @@ -1527,7 +1527,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma) * may be scheduled elsewhere and invalidate entries in the * pseudo-locked region. */ - if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) { + if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) { mutex_unlock(&rdtgroup_mutex); return -EINVAL; } diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index b6ba77cdf0e8..50f5876a3020 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -98,7 +98,7 @@ void rdt_staged_configs_clear(void) lockdep_assert_held(&rdtgroup_mutex); for_each_alloc_capable_rdt_resource(r) { - list_for_each_entry(dom, &r->domains, hdr.list) + list_for_each_entry(dom, &r->domains, list) memset(dom->staged_config, 0, sizeof(dom->staged_config)); } } @@ -317,7 +317,7 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of, rdt_last_cmd_puts("Cache domain offline\n"); ret = -ENODEV; } else { - mask = &rdtgrp->plr->d->hdr.cpu_mask; + mask = &rdtgrp->plr->d->cpu_mask; seq_printf(s, is_cpu_list(of) ? "%*pbl\n" : "%*pb\n", cpumask_pr_args(mask)); @@ -1021,12 +1021,12 @@ static int rdt_bit_usage_show(struct kernfs_open_file *of, cpus_read_lock(); mutex_lock(&rdtgroup_mutex); hw_shareable = r->cache.shareable_bits; - list_for_each_entry(dom, &r->domains, hdr.list) { + list_for_each_entry(dom, &r->domains, list) { if (sep) seq_putc(seq, ';'); sw_shareable = 0; exclusive = 0; - seq_printf(seq, "%d=", dom->hdr.id); + seq_printf(seq, "%d=", dom->id); for (i = 0; i < closids_supported(); i++) { if (!closid_allocated(i)) continue; @@ -1343,7 +1343,7 @@ static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp) if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA) continue; has_cache = true; - list_for_each_entry(d, &r->domains, hdr.list) { + list_for_each_entry(d, &r->domains, list) { ctrl = resctrl_arch_get_config(r, d, closid, s->conf_type); if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) { @@ -1458,7 +1458,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, return size; num_b = bitmap_weight(&cbm, r->cache.cbm_len); - ci = get_cpu_cacheinfo_level(cpumask_any(&d->hdr.cpu_mask), r->scope); + ci = get_cpu_cacheinfo_level(cpumask_any(&d->cpu_mask), r->scope); if (ci) size = ci->size / r->cache.cbm_len * num_b; @@ -1502,7 +1502,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of, size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res, rdtgrp->plr->d, rdtgrp->plr->cbm); - seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size); + seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size); } goto out; } @@ -1514,7 +1514,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of, type = schema->conf_type; sep = false; seq_printf(s, "%*s:", max_name_width, schema->name); - list_for_each_entry(d, &r->domains, hdr.list) { + list_for_each_entry(d, &r->domains, list) { if (sep) seq_putc(s, ';'); if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { @@ -1532,7 +1532,7 @@ static int rdtgroup_size_show(struct kernfs_open_file *of, else size = rdtgroup_cbm_to_size(r, d, ctrl); } - seq_printf(s, "%d=%u", d->hdr.id, size); + seq_printf(s, "%d=%u", d->id, size); sep = true; } seq_putc(s, '\n'); @@ -1592,7 +1592,7 @@ static void mon_event_config_read(void *info) static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info) { - smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1); + smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1); } static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid) @@ -1604,7 +1604,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid cpus_read_lock(); mutex_lock(&rdtgroup_mutex); - list_for_each_entry(dom, &r->domains, hdr.list) { + list_for_each_entry(dom, &r->domains, list) { if (sep) seq_puts(s, ";"); @@ -1612,7 +1612,7 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid mon_info.evtid = evtid; mondata_config_read(dom, &mon_info); - seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config); + seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config); sep = true; } seq_puts(s, "\n"); @@ -1678,7 +1678,7 @@ static void mbm_config_write_domain(struct rdt_resource *r, * are scoped at the domain level. Writing any of these MSRs * on one CPU is observed by all the CPUs in the domain. */ - smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write, + smp_call_function_any(&d->cpu_mask, mon_event_config_write, &mon_info, 1); /* @@ -1728,8 +1728,8 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid) return -EINVAL; } - list_for_each_entry(d, &r->domains, hdr.list) { - if (d->hdr.id == dom_id) { + list_for_each_entry(d, &r->domains, list) { + if (d->id == dom_id) { mbm_config_write_domain(r, d, evtid, val); goto next; } @@ -2276,14 +2276,14 @@ static int set_cache_qos_cfg(int level, bool enable) return -ENOMEM; r_l = &rdt_resources_all[level].r_resctrl; - list_for_each_entry(d, &r_l->domains, hdr.list) { + list_for_each_entry(d, &r_l->domains, list) { if (r_l->cache.arch_has_per_cpu_cfg) /* Pick all the CPUs in the domain instance */ - for_each_cpu(cpu, &d->hdr.cpu_mask) + for_each_cpu(cpu, &d->cpu_mask) cpumask_set_cpu(cpu, cpu_mask); else /* Pick one CPU from each domain instance to update MSR */ - cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask); + cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask); } /* Update QOS_CFG MSR on all the CPUs in cpu_mask */ @@ -2312,7 +2312,7 @@ void rdt_domain_reconfigure_cdp(struct rdt_resource *r) static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d) { u32 num_closid = resctrl_arch_get_num_closid(r); - int cpu = cpumask_any(&d->hdr.cpu_mask); + int cpu = cpumask_any(&d->cpu_mask); int i; d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val), @@ -2361,7 +2361,7 @@ static int set_mba_sc(bool mba_sc) r->membw.mba_sc = mba_sc; - list_for_each_entry(d, &r->domains, hdr.list) { + list_for_each_entry(d, &r->domains, list) { for (i = 0; i < num_closid; i++) d->mbps_val[i] = MBA_MAX_MBPS; } @@ -2700,7 +2700,7 @@ static int rdt_get_tree(struct fs_context *fc) if (is_mbm_enabled()) { r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; - list_for_each_entry(dom, &r->domains, hdr.list) + list_for_each_entry(dom, &r->domains, list) mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU); } @@ -2827,13 +2827,13 @@ static int reset_all_ctrls(struct rdt_resource *r) * CBMs in all domains to the maximum mask value. Pick one CPU * from each domain to update the MSRs below. */ - list_for_each_entry(d, &r->domains, hdr.list) { + list_for_each_entry(d, &r->domains, list) { hw_dom = resctrl_to_arch_dom(d); for (i = 0; i < hw_res->num_closid; i++) hw_dom->ctrl_val[i] = r->default_ctrl; msr_param.dom = d; - smp_call_function_any(&d->hdr.cpu_mask, rdt_ctrl_update, &msr_param, 1); + smp_call_function_any(&d->cpu_mask, rdt_ctrl_update, &msr_param, 1); } return 0; @@ -3031,7 +3031,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, char name[32]; int ret; - sprintf(name, "mon_%s_%02d", r->name, d->hdr.id); + sprintf(name, "mon_%s_%02d", r->name, d->id); /* create the directory */ kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); if (IS_ERR(kn)) @@ -3047,7 +3047,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, } priv.u.rid = r->rid; - priv.u.domid = d->hdr.id; + priv.u.domid = d->id; list_for_each_entry(mevt, &r->evt_list, list) { priv.u.evtid = mevt->evtid; ret = mon_addfile(kn, mevt->name, priv.priv); @@ -3098,7 +3098,7 @@ static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn, /* Walking r->domains, ensure it can't race with cpuhp */ lockdep_assert_cpus_held(); - list_for_each_entry(dom, &r->domains, hdr.list) { + list_for_each_entry(dom, &r->domains, list) { ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp); if (ret) return ret; @@ -3257,7 +3257,7 @@ static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s, */ tmp_cbm = cfg->new_ctrl; if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) { - rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id); + rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id); return -ENOSPC; } cfg->have_new_ctrl = true; @@ -3280,7 +3280,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid) struct rdt_domain *d; int ret; - list_for_each_entry(d, &s->res->domains, hdr.list) { + list_for_each_entry(d, &s->res->domains, list) { ret = __init_one_rdt_domain(d, s, closid); if (ret < 0) return ret; @@ -3295,7 +3295,7 @@ static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid) struct resctrl_staged_config *cfg; struct rdt_domain *d; - list_for_each_entry(d, &r->domains, hdr.list) { + list_for_each_entry(d, &r->domains, list) { if (is_mba_sc(r)) { d->mbps_val[closid] = MBA_MAX_MBPS; continue; @@ -3941,7 +3941,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d) * per domain monitor data directories. */ if (resctrl_mounted && resctrl_arch_mon_capable()) - rmdir_mondata_subdir_allrdtgrp(r, d->hdr.id); + rmdir_mondata_subdir_allrdtgrp(r, d->id); if (is_mbm_enabled()) cancel_delayed_work(&d->mbm_over); diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index f63fcf17a3bc..ed693bfe474d 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -59,20 +59,10 @@ struct resctrl_staged_config { }; /** - * struct rdt_domain_hdr - common header for different domain types + * struct rdt_domain - group of CPUs sharing a resctrl resource * @list: all instances of this resource * @id: unique id for this instance * @cpu_mask: which CPUs share this resource - */ -struct rdt_domain_hdr { - struct list_head list; - int id; - struct cpumask cpu_mask; -}; - -/** - * struct rdt_domain - group of CPUs sharing a resctrl resource - * @hdr: common header for different domain types * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold * @mbm_total: saved state for MBM total bandwidth * @mbm_local: saved state for MBM local bandwidth @@ -87,7 +77,9 @@ struct rdt_domain_hdr { * by closid */ struct rdt_domain { - struct rdt_domain_hdr hdr; + struct list_head list; + int id; + struct cpumask cpu_mask; unsigned long *rmid_busy_llc; struct mbm_state *mbm_total; struct mbm_state *mbm_local; -- 2.25.1

This reverts commit 32fd77acd090e14cdcdf80f00bfb6ebc78d5c8bd. --- arch/x86/kernel/cpu/resctrl/core.c | 46 +++++++---------------- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 2 +- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 +-- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 +-- include/linux/resctrl.h | 9 +---- 5 files changed, 19 insertions(+), 49 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index d3cbe51e6b45..de2e9a91d6d1 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -68,7 +68,7 @@ struct rdt_hw_resource rdt_resources_all[] = { .r_resctrl = { .rid = RDT_RESOURCE_L3, .name = "L3", - .scope = RESCTRL_L3_CACHE, + .cache_level = 3, .domains = domain_init(RDT_RESOURCE_L3), .parse_ctrlval = parse_cbm, .format_str = "%d=%0*x", @@ -82,7 +82,7 @@ struct rdt_hw_resource rdt_resources_all[] = { .r_resctrl = { .rid = RDT_RESOURCE_L2, .name = "L2", - .scope = RESCTRL_L2_CACHE, + .cache_level = 2, .domains = domain_init(RDT_RESOURCE_L2), .parse_ctrlval = parse_cbm, .format_str = "%d=%0*x", @@ -96,7 +96,7 @@ struct rdt_hw_resource rdt_resources_all[] = { .r_resctrl = { .rid = RDT_RESOURCE_MBA, .name = "MB", - .scope = RESCTRL_L3_CACHE, + .cache_level = 3, .domains = domain_init(RDT_RESOURCE_MBA), .parse_ctrlval = parse_bw, .format_str = "%d=%*u", @@ -108,7 +108,7 @@ struct rdt_hw_resource rdt_resources_all[] = { .r_resctrl = { .rid = RDT_RESOURCE_SMBA, .name = "SMBA", - .scope = RESCTRL_L3_CACHE, + .cache_level = 3, .domains = domain_init(RDT_RESOURCE_SMBA), .parse_ctrlval = parse_bw, .format_str = "%d=%*u", @@ -392,6 +392,9 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, struct rdt_domain *d; struct list_head *l; + if (id < 0) + return ERR_PTR(-ENODEV); + list_for_each(l, &r->domains) { d = list_entry(l, struct rdt_domain, list); /* When id is found, return its domain. */ @@ -481,19 +484,6 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom) return 0; } -static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope) -{ - switch (scope) { - case RESCTRL_L2_CACHE: - case RESCTRL_L3_CACHE: - return get_cpu_cacheinfo_id(cpu, scope); - default: - break; - } - - return -EINVAL; -} - /* * domain_add_cpu - Add a cpu to a resource's domain list. * @@ -509,7 +499,7 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope) */ static void domain_add_cpu(int cpu, struct rdt_resource *r) { - int id = get_domain_id_from_scope(cpu, r->scope); + int id = get_cpu_cacheinfo_id(cpu, r->cache_level); struct list_head *add_pos = NULL; struct rdt_hw_domain *hw_dom; struct rdt_domain *d; @@ -517,14 +507,12 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r) lockdep_assert_held(&domain_list_lock); - if (id < 0) { - pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n", - cpu, r->scope, r->name); + d = rdt_find_domain(r, id, &add_pos); + if (IS_ERR(d)) { + pr_warn("Couldn't find cache id for CPU %d\n", cpu); return; } - d = rdt_find_domain(r, id, &add_pos); - if (d) { cpumask_set_cpu(cpu, &d->cpu_mask); if (r->cache.arch_has_per_cpu_cfg) @@ -564,21 +552,15 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r) static void domain_remove_cpu(int cpu, struct rdt_resource *r) { - int id = get_domain_id_from_scope(cpu, r->scope); + int id = get_cpu_cacheinfo_id(cpu, r->cache_level); struct rdt_hw_domain *hw_dom; struct rdt_domain *d; lockdep_assert_held(&domain_list_lock); - if (id < 0) { - pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n", - cpu, r->scope, r->name); - return; - } - d = rdt_find_domain(r, id, NULL); - if (!d) { - pr_warn("Couldn't find domain with id=%d for CPU %d\n", id, cpu); + if (IS_ERR_OR_NULL(d)) { + pr_warn("Couldn't find cache id for CPU %d\n", cpu); return; } hw_dom = resctrl_to_arch_dom(d); diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index b5b8e9c5be11..5891fc950f74 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -582,7 +582,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) r = &rdt_resources_all[resid].r_resctrl; d = rdt_find_domain(r, domid, NULL); - if (!d) { + if (IS_ERR_OR_NULL(d)) { ret = -ENOENT; goto out; } diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 1041c6401e9e..1e3f9d28a4b5 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -292,13 +292,9 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr) */ static int pseudo_lock_region_init(struct pseudo_lock_region *plr) { - enum resctrl_scope scope = plr->s->res->scope; struct cacheinfo *ci; int ret; - if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE)) - return -ENODEV; - /* Pick the first cpu we find that is associated with the cache. */ plr->cpu = cpumask_first(&plr->d->cpu_mask); @@ -309,7 +305,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr) goto out_region; } - ci = get_cpu_cacheinfo_level(plr->cpu, scope); + ci = get_cpu_cacheinfo_level(plr->cpu, plr->s->res->cache_level); if (ci) { plr->line_size = ci->coherency_line_size; plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm); diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 50f5876a3020..cb68a121dabb 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -1454,11 +1454,8 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct cacheinfo *ci; int num_b; - if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE)) - return size; - num_b = bitmap_weight(&cbm, r->cache.cbm_len); - ci = get_cpu_cacheinfo_level(cpumask_any(&d->cpu_mask), r->scope); + ci = get_cpu_cacheinfo_level(cpumask_any(&d->cpu_mask), r->cache_level); if (ci) size = ci->size / r->cache.cbm_len * num_b; diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index ed693bfe474d..a365f67131ec 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -150,18 +150,13 @@ struct resctrl_membw { struct rdt_parse_data; struct resctrl_schema; -enum resctrl_scope { - RESCTRL_L2_CACHE = 2, - RESCTRL_L3_CACHE = 3, -}; - /** * struct rdt_resource - attributes of a resctrl resource * @rid: The index of the resource * @alloc_capable: Is allocation available on this machine * @mon_capable: Is monitor feature available on this machine * @num_rmid: Number of RMIDs available - * @scope: Scope of this resource + * @cache_level: Which cache level defines scope of this resource * @cache: Cache allocation related data * @membw: If the component has bandwidth controls, their properties. * @domains: RCU list of all domains for this resource @@ -179,7 +174,7 @@ struct rdt_resource { bool alloc_capable; bool mon_capable; int num_rmid; - enum resctrl_scope scope; + int cache_level; struct resctrl_cache cache; struct resctrl_membw membw; struct list_head domains; -- 2.25.1

This reverts commit 52210851652322dd063b6cfe8ab909848f4daebe. --- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 17 +++++++++++------ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 14 +++++++++----- 2 files changed, 20 insertions(+), 11 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 1e3f9d28a4b5..492c8e28c4ce 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -292,8 +292,9 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr) */ static int pseudo_lock_region_init(struct pseudo_lock_region *plr) { - struct cacheinfo *ci; + struct cpu_cacheinfo *ci; int ret; + int i; /* Pick the first cpu we find that is associated with the cache. */ plr->cpu = cpumask_first(&plr->d->cpu_mask); @@ -305,11 +306,15 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr) goto out_region; } - ci = get_cpu_cacheinfo_level(plr->cpu, plr->s->res->cache_level); - if (ci) { - plr->line_size = ci->coherency_line_size; - plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm); - return 0; + ci = get_cpu_cacheinfo(plr->cpu); + + plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm); + + for (i = 0; i < ci->num_leaves; i++) { + if (ci->info_list[i].level == plr->s->res->cache_level) { + plr->line_size = ci->info_list[i].coherency_line_size; + return 0; + } } ret = -1; diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index cb68a121dabb..02f213f1c51c 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -1450,14 +1450,18 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of, unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d, unsigned long cbm) { + struct cpu_cacheinfo *ci; unsigned int size = 0; - struct cacheinfo *ci; - int num_b; + int num_b, i; num_b = bitmap_weight(&cbm, r->cache.cbm_len); - ci = get_cpu_cacheinfo_level(cpumask_any(&d->cpu_mask), r->cache_level); - if (ci) - size = ci->size / r->cache.cbm_len * num_b; + ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask)); + for (i = 0; i < ci->num_leaves; i++) { + if (ci->info_list[i].level == r->cache_level) { + size = ci->info_list[i].size / r->cache.cbm_len * num_b; + break; + } + } return size; } -- 2.25.1

This reverts commit 408f174b747ee73e2b1adb3bc57bfd2ca901924c. --- Documentation/arch/x86/resctrl.rst | 6 ------ arch/x86/kernel/cpu/resctrl/monitor.c | 11 ----------- arch/x86/kernel/cpu/resctrl/trace.h | 16 ---------------- 3 files changed, 33 deletions(-) diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst index 6f07af4a885e..908861167821 100644 --- a/Documentation/arch/x86/resctrl.rst +++ b/Documentation/arch/x86/resctrl.rst @@ -450,12 +450,6 @@ during mkdir. max_threshold_occupancy is a user configurable value to determine the occupancy at which an RMID can be freed. -The mon_llc_occupancy_limbo tracepoint gives the precise occupancy in bytes -for a subset of RMID that are not immediately available for allocation. -This can't be relied on to produce output every second, it may be necessary -to attempt to create an empty monitor group to force an update. Output may -only be produced if creation of a control or monitor group fails. - Schemata files - general concepts --------------------------------- Each line in the file describes one resource. The line starts with diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 366f496ca3ce..2ce5f4913c82 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -24,7 +24,6 @@ #include <asm/resctrl.h> #include "internal.h" -#include "trace.h" /** * struct rmid_entry - dirty tracking for all RMID. @@ -355,16 +354,6 @@ void __check_limbo(struct rdt_domain *d, bool force_free) rmid_dirty = true; } else { rmid_dirty = (val >= resctrl_rmid_realloc_threshold); - - /* - * x86's CLOSID and RMID are independent numbers, so the entry's - * CLOSID is an empty CLOSID (X86_RESCTRL_EMPTY_CLOSID). On Arm the - * RMID (PMG) extends the CLOSID (PARTID) space with bits that aren't - * used to select the configuration. It is thus necessary to track both - * CLOSID and RMID because there may be dependencies between them - * on some architectures. - */ - trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->id, val); } if (force_free || !rmid_dirty) { diff --git a/arch/x86/kernel/cpu/resctrl/trace.h b/arch/x86/kernel/cpu/resctrl/trace.h index 2a506316b303..495fb90c8572 100644 --- a/arch/x86/kernel/cpu/resctrl/trace.h +++ b/arch/x86/kernel/cpu/resctrl/trace.h @@ -35,22 +35,6 @@ TRACE_EVENT(pseudo_lock_l3, TP_printk("hits=%llu miss=%llu", __entry->l3_hits, __entry->l3_miss)); -TRACE_EVENT(mon_llc_occupancy_limbo, - TP_PROTO(u32 ctrl_hw_id, u32 mon_hw_id, int domain_id, u64 llc_occupancy_bytes), - TP_ARGS(ctrl_hw_id, mon_hw_id, domain_id, llc_occupancy_bytes), - TP_STRUCT__entry(__field(u32, ctrl_hw_id) - __field(u32, mon_hw_id) - __field(int, domain_id) - __field(u64, llc_occupancy_bytes)), - TP_fast_assign(__entry->ctrl_hw_id = ctrl_hw_id; - __entry->mon_hw_id = mon_hw_id; - __entry->domain_id = domain_id; - __entry->llc_occupancy_bytes = llc_occupancy_bytes;), - TP_printk("ctrl_hw_id=%u mon_hw_id=%u domain_id=%d llc_occupancy_bytes=%llu", - __entry->ctrl_hw_id, __entry->mon_hw_id, __entry->domain_id, - __entry->llc_occupancy_bytes) - ); - #endif /* _TRACE_RESCTRL_H */ #undef TRACE_INCLUDE_PATH -- 2.25.1

This reverts commit 4cd7986a343c53002448012f77b76c1b9faa1b69. --- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +- .../kernel/cpu/resctrl/{trace.h => pseudo_lock_event.h} | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) rename arch/x86/kernel/cpu/resctrl/{trace.h => pseudo_lock_event.h} (86%) diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 492c8e28c4ce..884b88e25141 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -31,7 +31,7 @@ #include "internal.h" #define CREATE_TRACE_POINTS -#include "trace.h" +#include "pseudo_lock_event.h" /* * The bits needed to disable hardware prefetching varies based on the diff --git a/arch/x86/kernel/cpu/resctrl/trace.h b/arch/x86/kernel/cpu/resctrl/pseudo_lock_event.h similarity index 86% rename from arch/x86/kernel/cpu/resctrl/trace.h rename to arch/x86/kernel/cpu/resctrl/pseudo_lock_event.h index 495fb90c8572..428ebbd4270b 100644 --- a/arch/x86/kernel/cpu/resctrl/trace.h +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock_event.h @@ -2,8 +2,8 @@ #undef TRACE_SYSTEM #define TRACE_SYSTEM resctrl -#if !defined(_TRACE_RESCTRL_H) || defined(TRACE_HEADER_MULTI_READ) -#define _TRACE_RESCTRL_H +#if !defined(_TRACE_PSEUDO_LOCK_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_PSEUDO_LOCK_H #include <linux/tracepoint.h> @@ -35,9 +35,9 @@ TRACE_EVENT(pseudo_lock_l3, TP_printk("hits=%llu miss=%llu", __entry->l3_hits, __entry->l3_miss)); -#endif /* _TRACE_RESCTRL_H */ +#endif /* _TRACE_PSEUDO_LOCK_H */ #undef TRACE_INCLUDE_PATH #define TRACE_INCLUDE_PATH . -#define TRACE_INCLUDE_FILE trace +#define TRACE_INCLUDE_FILE pseudo_lock_event #include <trace/define_trace.h> -- 2.25.1

This reverts commit 64c3d003e756088e1bf8351fc96a294930865c31. --- arch/x86/kernel/cpu/resctrl/core.c | 40 ++++++++++++++--------- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 2 +- arch/x86/kernel/cpu/resctrl/internal.h | 3 +- 3 files changed, 27 insertions(+), 18 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index de2e9a91d6d1..670519f48a1c 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -56,9 +56,14 @@ int max_name_width, max_data_width; */ bool rdt_alloc_capable; -static void mba_wrmsr_intel(struct msr_param *m); -static void cat_wrmsr(struct msr_param *m); -static void mba_wrmsr_amd(struct msr_param *m); +static void +mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m, + struct rdt_resource *r); +static void +cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r); +static void +mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, + struct rdt_resource *r); #define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains) @@ -304,11 +309,12 @@ static void rdt_get_cdp_l2_config(void) rdt_get_cdp_config(RDT_RESOURCE_L2); } -static void mba_wrmsr_amd(struct msr_param *m) +static void +mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r) { - struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res); - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom); unsigned int i; + struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); for (i = m->low; i < m->high; i++) wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]); @@ -328,22 +334,25 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r) return r->default_ctrl; } -static void mba_wrmsr_intel(struct msr_param *m) +static void +mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m, + struct rdt_resource *r) { - struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res); - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom); unsigned int i; + struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); /* Write the delay values for mba. */ for (i = m->low; i < m->high; i++) - wrmsrl(hw_res->msr_base + i, delay_bw_map(hw_dom->ctrl_val[i], m->res)); + wrmsrl(hw_res->msr_base + i, delay_bw_map(hw_dom->ctrl_val[i], r)); } -static void cat_wrmsr(struct msr_param *m) +static void +cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r) { - struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res); - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(m->dom); unsigned int i; + struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); for (i = m->low; i < m->high; i++) wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]); @@ -375,7 +384,7 @@ void rdt_ctrl_update(void *arg) struct msr_param *m = arg; hw_res = resctrl_to_arch_res(m->res); - hw_res->msr_update(m); + hw_res->msr_update(m->dom, m, m->res); } /* @@ -448,11 +457,10 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d) hw_dom->ctrl_val = dc; setup_default_ctrlval(r, dc); - m.res = r; m.dom = d; m.low = 0; m.high = hw_res->num_closid; - hw_res->msr_update(&m); + hw_res->msr_update(d, &m, r); return 0; } diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index 5891fc950f74..15d916449e47 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -294,7 +294,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, msr_param.dom = d; msr_param.low = idx; msr_param.high = idx + 1; - hw_res->msr_update(&msr_param); + hw_res->msr_update(d, &msr_param, r); return 0; } diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index f1d926832ec8..ab2d315f7a2e 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -445,7 +445,8 @@ struct rdt_hw_resource { struct rdt_resource r_resctrl; u32 num_closid; unsigned int msr_base; - void (*msr_update)(struct msr_param *m); + void (*msr_update) (struct rdt_domain *d, struct msr_param *m, + struct rdt_resource *r); unsigned int mon_scale; unsigned int mbm_width; unsigned int mbm_cfg_mask; -- 2.25.1

This reverts commit 8bc23b6f7bfa279ec458d914788bec2eba15e327. --- arch/x86/kernel/cpu/resctrl/core.c | 17 ++++++---- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 38 ++++++++++++++++++----- arch/x86/kernel/cpu/resctrl/internal.h | 2 -- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 +++++-- 4 files changed, 52 insertions(+), 17 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 670519f48a1c..4edc2b9d53c4 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -362,8 +362,6 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r) { struct rdt_domain *d; - lockdep_assert_cpus_held(); - list_for_each_entry(d, &r->domains, list) { /* Find the domain that contains this CPU */ if (cpumask_test_cpu(cpu, &d->cpu_mask)) @@ -380,11 +378,19 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r) void rdt_ctrl_update(void *arg) { - struct rdt_hw_resource *hw_res; struct msr_param *m = arg; + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res); + struct rdt_resource *r = m->res; + int cpu = smp_processor_id(); + struct rdt_domain *d; - hw_res = resctrl_to_arch_res(m->res); - hw_res->msr_update(m->dom, m, m->res); + d = get_domain_from_cpu(cpu, r); + if (d) { + hw_res->msr_update(d, m, r); + return; + } + pr_warn_once("cpu %d not found in any domain for resource %s\n", + cpu, r->name); } /* @@ -457,7 +463,6 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d) hw_dom->ctrl_val = dc; setup_default_ctrlval(r, dc); - m.dom = d; m.low = 0; m.high = hw_res->num_closid; hw_res->msr_update(d, &m, r); diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index 15d916449e47..8bd717222f0c 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -277,6 +277,22 @@ static u32 get_config_index(u32 closid, enum resctrl_conf_type type) } } +static bool apply_config(struct rdt_hw_domain *hw_dom, + struct resctrl_staged_config *cfg, u32 idx, + cpumask_var_t cpu_mask) +{ + struct rdt_domain *dom = &hw_dom->d_resctrl; + + if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) { + cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask); + hw_dom->ctrl_val[idx] = cfg->new_ctrl; + + return true; + } + + return false; +} + int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, u32 closid, enum resctrl_conf_type t, u32 cfg_val) { @@ -291,7 +307,6 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, hw_dom->ctrl_val[idx] = cfg_val; msr_param.res = r; - msr_param.dom = d; msr_param.low = idx; msr_param.high = idx + 1; hw_res->msr_update(d, &msr_param, r); @@ -305,39 +320,48 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) struct rdt_hw_domain *hw_dom; struct msr_param msr_param; enum resctrl_conf_type t; + cpumask_var_t cpu_mask; struct rdt_domain *d; u32 idx; /* Walking r->domains, ensure it can't race with cpuhp */ lockdep_assert_cpus_held(); + if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) + return -ENOMEM; + + msr_param.res = NULL; list_for_each_entry(d, &r->domains, list) { hw_dom = resctrl_to_arch_dom(d); - msr_param.res = NULL; for (t = 0; t < CDP_NUM_TYPES; t++) { cfg = &hw_dom->d_resctrl.staged_config[t]; if (!cfg->have_new_ctrl) continue; idx = get_config_index(closid, t); - if (cfg->new_ctrl == hw_dom->ctrl_val[idx]) + if (!apply_config(hw_dom, cfg, idx, cpu_mask)) continue; - hw_dom->ctrl_val[idx] = cfg->new_ctrl; if (!msr_param.res) { msr_param.low = idx; msr_param.high = msr_param.low + 1; msr_param.res = r; - msr_param.dom = d; } else { msr_param.low = min(msr_param.low, idx); msr_param.high = max(msr_param.high, idx + 1); } } - if (msr_param.res) - smp_call_function_any(&d->cpu_mask, rdt_ctrl_update, &msr_param, 1); } + if (cpumask_empty(cpu_mask)) + goto done; + + /* Update resource control msr on all the CPUs. */ + on_each_cpu_mask(cpu_mask, rdt_ctrl_update, &msr_param, 1); + +done: + free_cpumask_var(cpu_mask); + return 0; } diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index ab2d315f7a2e..1a8687f8073a 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -379,13 +379,11 @@ static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r) /** * struct msr_param - set a range of MSRs from a domain * @res: The resource to use - * @dom: The domain to update * @low: Beginning index from base MSR * @high: End index */ struct msr_param { struct rdt_resource *res; - struct rdt_domain *dom; u32 low; u32 high; }; diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 02f213f1c51c..011e17efb1a6 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2813,12 +2813,16 @@ static int reset_all_ctrls(struct rdt_resource *r) struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); struct rdt_hw_domain *hw_dom; struct msr_param msr_param; + cpumask_var_t cpu_mask; struct rdt_domain *d; int i; /* Walking r->domains, ensure it can't race with cpuhp */ lockdep_assert_cpus_held(); + if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) + return -ENOMEM; + msr_param.res = r; msr_param.low = 0; msr_param.high = hw_res->num_closid; @@ -2830,13 +2834,17 @@ static int reset_all_ctrls(struct rdt_resource *r) */ list_for_each_entry(d, &r->domains, list) { hw_dom = resctrl_to_arch_dom(d); + cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask); for (i = 0; i < hw_res->num_closid; i++) hw_dom->ctrl_val[i] = r->default_ctrl; - msr_param.dom = d; - smp_call_function_any(&d->cpu_mask, rdt_ctrl_update, &msr_param, 1); } + /* Update CBM on all the CPUs in cpu_mask */ + on_each_cpu_mask(cpu_mask, rdt_ctrl_update, &msr_param, 1); + + free_cpumask_var(cpu_mask); + return 0; } -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Resctrl occasionally wants to know something about a specific resource. It reaches into the arch code's rdt_resources_all. Once the filesystem parts of resctrl are moved to /fs/, this means it will need visibility of the architecture specific structure. Move the level enum to the resctrl header and add a helper to retrieve the resctrl struct by id. resctrl_arch_get_resource() should not return NULL for any value in the enum, it may instead return a dummy resource that is !alloc_enabled && !mon_enabled. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/core.c | 12 ++++++++++-- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 2 +- arch/x86/kernel/cpu/resctrl/internal.h | 10 ---------- arch/x86/kernel/cpu/resctrl/monitor.c | 8 ++++---- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 15 +++++++-------- include/linux/resctrl.h | 17 +++++++++++++++++ 6 files changed, 39 insertions(+), 25 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 4edc2b9d53c4..afefbeda4f72 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -122,6 +122,14 @@ struct rdt_hw_resource rdt_resources_all[] = { }, }; +struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) +{ + if (l >= RDT_NUM_RESOURCES) + return NULL; + + return &rdt_resources_all[l].r_resctrl; +} + /* * cache_alloc_hsw_probe() - Have to probe for Intel haswell server CPUs * as they do not have CPUID enumeration support for Cache allocation. @@ -169,7 +177,7 @@ static inline void cache_alloc_hsw_probe(void) bool is_mba_sc(struct rdt_resource *r) { if (!r) - return rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl.membw.mba_sc; + r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); /* * The software controller support is only applicable to MBA resource. @@ -979,7 +987,7 @@ late_initcall(resctrl_late_init); static void __exit resctrl_exit(void) { - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); cpuhp_remove_state(rdt_online); diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index 8bd717222f0c..e2634a3263a6 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -604,7 +604,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) domid = md.u.domid; evtid = md.u.evtid; - r = &rdt_resources_all[resid].r_resctrl; + r = resctrl_arch_get_resource(resid); d = rdt_find_domain(r, domid, NULL); if (IS_ERR_OR_NULL(d)) { ret = -ENOENT; diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 1a8687f8073a..64a86b593fb1 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -467,16 +467,6 @@ extern struct rdt_hw_resource rdt_resources_all[]; extern struct rdtgroup rdtgroup_default; extern struct dentry *debugfs_resctrl; -enum resctrl_res_level { - RDT_RESOURCE_L3, - RDT_RESOURCE_L2, - RDT_RESOURCE_MBA, - RDT_RESOURCE_SMBA, - - /* Must be the last */ - RDT_NUM_RESOURCES, -}; - static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res) { struct rdt_hw_resource *hw_res = resctrl_to_arch_res(res); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 2ce5f4913c82..484ba10928f5 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -321,7 +321,7 @@ static void limbo_release_entry(struct rmid_entry *entry) */ void __check_limbo(struct rdt_domain *d, bool force_free) { - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); u32 idx_limit = resctrl_arch_system_num_rmid_idx(); struct rmid_entry *entry; u32 idx, cur_idx = 1; @@ -467,7 +467,7 @@ int alloc_rmid(u32 closid) static void add_rmid_to_limbo(struct rmid_entry *entry) { - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); struct rdt_domain *d; u32 idx; @@ -670,7 +670,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm) if (!is_mbm_local_enabled()) return; - r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl; + r_mba = resctrl_arch_get_resource(RDT_RESOURCE_MBA); closid = rgrp->closid; rmid = rgrp->mon.rmid; @@ -840,7 +840,7 @@ void mbm_handle_overflow(struct work_struct *work) if (!resctrl_mounted || !resctrl_arch_mon_capable()) goto out_unlock; - r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + r = resctrl_arch_get_resource(RDT_RESOURCE_L3); d = container_of(work, struct rdt_domain, mbm_over.work); list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 011e17efb1a6..4bb4a9e41534 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2252,7 +2252,7 @@ static void l2_qos_cfg_update(void *arg) static inline bool is_mba_linear(void) { - return rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl.membw.delay_linear; + return resctrl_arch_get_resource(RDT_RESOURCE_MBA)->membw.delay_linear; } static int set_cache_qos_cfg(int level, bool enable) @@ -2340,7 +2340,7 @@ static void mba_sc_domain_destroy(struct rdt_resource *r, */ static bool supports_mba_mbps(void) { - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl; + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); return (is_mbm_local_enabled() && r->alloc_capable && is_mba_linear()); @@ -2352,7 +2352,7 @@ static bool supports_mba_mbps(void) */ static int set_mba_sc(bool mba_sc) { - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl; + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); u32 num_closid = resctrl_arch_get_num_closid(r); struct rdt_domain *d; int i; @@ -2624,10 +2624,10 @@ static void schemata_list_destroy(void) static int rdt_get_tree(struct fs_context *fc) { + struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3); struct rdt_fs_context *ctx = rdt_fc2context(fc); unsigned long flags = RFTYPE_CTRL_BASE; struct rdt_domain *dom; - struct rdt_resource *r; int ret; cpus_read_lock(); @@ -2700,8 +2700,7 @@ static int rdt_get_tree(struct fs_context *fc) resctrl_mounted = true; if (is_mbm_enabled()) { - r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; - list_for_each_entry(dom, &r->domains, list) + list_for_each_entry(dom, &l3->domains, list) mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU); } @@ -3877,7 +3876,7 @@ static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf) if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2)) seq_puts(seq, ",cdpl2"); - if (is_mba_sc(&rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl)) + if (is_mba_sc(resctrl_arch_get_resource(RDT_RESOURCE_MBA))) seq_puts(seq, ",mba_MBps"); if (resctrl_debug) @@ -4067,7 +4066,7 @@ static void clear_childcpus(struct rdtgroup *r, unsigned int cpu) void resctrl_offline_cpu(unsigned int cpu) { - struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3); struct rdtgroup *rdtgrp; struct rdt_domain *d; diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index a365f67131ec..168cc9510069 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -36,6 +36,16 @@ enum resctrl_conf_type { CDP_DATA, }; +enum resctrl_res_level { + RDT_RESOURCE_L3, + RDT_RESOURCE_L2, + RDT_RESOURCE_MBA, + RDT_RESOURCE_SMBA, + + /* Must be the last */ + RDT_NUM_RESOURCES, +}; + #define CDP_NUM_TYPES (CDP_DATA + 1) /* @@ -190,6 +200,13 @@ struct rdt_resource { bool cdp_capable; }; +/* + * Get the resource that exists at this level. If the level is not supported + * a dummy/not-capable resource can be returned. Levels >= RDT_NUM_RESOURCES + * will return NULL. + */ +struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l); + /** * struct resctrl_schema - configuration abilities of a resource presented to * user-space -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- How to parse the ctrlval strings from user-space is specified by a function pointer the arch code specifies. The strings are part of resctrl's ABI, and the functions and their caller both live in the same file. Exporting the parsing functions and allowing the architecture to choose allows another architecture to get this wrong. Keep this all in the flesystem parts of resctrl. This should prevent any architecture's string-parsing behaviour from varying without core code changes. Use the fflags to spot caches and bandwidth resources, and use the appropriate helper. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/core.c | 4 ---- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 28 +++++++++++++++++++---- arch/x86/kernel/cpu/resctrl/internal.h | 10 -------- include/linux/resctrl.h | 7 ------ 4 files changed, 23 insertions(+), 26 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index afefbeda4f72..5d3efd5bc50d 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -75,7 +75,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .name = "L3", .cache_level = 3, .domains = domain_init(RDT_RESOURCE_L3), - .parse_ctrlval = parse_cbm, .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, }, @@ -89,7 +88,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .name = "L2", .cache_level = 2, .domains = domain_init(RDT_RESOURCE_L2), - .parse_ctrlval = parse_cbm, .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, }, @@ -103,7 +101,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .name = "MB", .cache_level = 3, .domains = domain_init(RDT_RESOURCE_MBA), - .parse_ctrlval = parse_bw, .format_str = "%d=%*u", .fflags = RFTYPE_RES_MB, }, @@ -115,7 +112,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .name = "SMBA", .cache_level = 3, .domains = domain_init(RDT_RESOURCE_SMBA), - .parse_ctrlval = parse_bw, .format_str = "%d=%*u", .fflags = RFTYPE_RES_MB, }, diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index e2634a3263a6..077b4cbaa074 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -23,6 +23,15 @@ #include "internal.h" +struct rdt_parse_data { + struct rdtgroup *rdtgrp; + char *buf; +}; + +typedef int (ctrlval_parser_t)(struct rdt_parse_data *data, + struct resctrl_schema *s, + struct rdt_domain *d); + /* * Check whether MBA bandwidth percentage value is correct. The value is * checked against the minimum and max bandwidth values specified by the @@ -64,8 +73,8 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) return true; } -int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_domain *d) +static int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_domain *d) { struct resctrl_staged_config *cfg; u32 closid = data->rdtgrp->closid; @@ -143,8 +152,8 @@ static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r) * Read one cache bit mask (hex). Check that it is valid for the current * resource type. */ -int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_domain *d) +static int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_domain *d) { struct rdtgroup *rdtgrp = data->rdtgrp; struct resctrl_staged_config *cfg; @@ -200,6 +209,14 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, return 0; } +static ctrlval_parser_t *get_parser(struct rdt_resource *res) +{ + if (res->fflags & RFTYPE_RES_CACHE) + return &parse_cbm; + else + return &parse_bw; +} + /* * For each domain in this resource we expect to find a series of: * id=mask @@ -209,6 +226,7 @@ int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, static int parse_line(char *line, struct resctrl_schema *s, struct rdtgroup *rdtgrp) { + ctrlval_parser_t *parse_ctrlval = get_parser(s->res); enum resctrl_conf_type t = s->conf_type; struct resctrl_staged_config *cfg; struct rdt_resource *r = s->res; @@ -240,7 +258,7 @@ static int parse_line(char *line, struct resctrl_schema *s, if (d->id == dom_id) { data.buf = dom; data.rdtgrp = rdtgrp; - if (r->parse_ctrlval(&data, s, d)) + if (parse_ctrlval(&data, s, d)) return -EINVAL; if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { cfg = &d->staged_config[t]; diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 64a86b593fb1..3449d7db653d 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -414,11 +414,6 @@ static inline bool is_mbm_event(int e) e <= QOS_L3_MBM_LOCAL_EVENT_ID); } -struct rdt_parse_data { - struct rdtgroup *rdtgrp; - char *buf; -}; - /** * struct rdt_hw_resource - arch private attributes of a resctrl resource * @r_resctrl: Attributes of the resource used directly by resctrl. @@ -456,11 +451,6 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r return container_of(r, struct rdt_hw_resource, r_resctrl); } -int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_domain *d); -int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_domain *d); - extern struct mutex rdtgroup_mutex; extern struct rdt_hw_resource rdt_resources_all[]; diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 168cc9510069..6e87bc95f5ea 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -157,9 +157,6 @@ struct resctrl_membw { u32 *mb_map; }; -struct rdt_parse_data; -struct resctrl_schema; - /** * struct rdt_resource - attributes of a resctrl resource * @rid: The index of the resource @@ -174,7 +171,6 @@ struct resctrl_schema; * @data_width: Character width of data when displaying * @default_ctrl: Specifies default cache cbm or memory B/W percent. * @format_str: Per resource format string to show domain value - * @parse_ctrlval: Per resource function pointer to parse control values * @evt_list: List of monitoring events * @fflags: flags to choose base and info files * @cdp_capable: Is the CDP feature available on this resource @@ -192,9 +188,6 @@ struct rdt_resource { int data_width; u32 default_ctrl; const char *format_str; - int (*parse_ctrlval)(struct rdt_parse_data *data, - struct resctrl_schema *s, - struct rdt_domain *d); struct list_head evt_list; unsigned long fflags; bool cdp_capable; -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- rdtgroup_rmdir_ctrl() and rdtgroup_rmdir_mon() set the per-cpu pqr_state for CPUs that were part of the rmdir'd group. Another architecture might not have a 'pqr_state', its hardware may need the values in a different format. MPAM's equivalent of RMID values are not unique, and always need the CLOSID to be provided too. As only rdtgroup_rmdir_mon() modifies just the RMID, provide a helper that sets both values. This allows another architecture to convert the values into the format the hardware expects. As these values are read by __resctrl_sched_in(), but may be written by a different CPU without any locking, add READ/WRTE_ONCE() to avoid torn values. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/include/asm/resctrl.h | 14 +++++++++++--- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 13 ++++++++----- 2 files changed, 19 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 12dbd2588ca7..f61382258743 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -4,8 +4,9 @@ #ifdef CONFIG_X86_CPU_RESCTRL -#include <linux/sched.h> #include <linux/jump_label.h> +#include <linux/percpu.h> +#include <linux/sched.h> /* * This value can never be a valid CLOSID, and is used when mapping a @@ -96,8 +97,8 @@ static inline void resctrl_arch_disable_mon(void) static inline void __resctrl_sched_in(struct task_struct *tsk) { struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state); - u32 closid = state->default_closid; - u32 rmid = state->default_rmid; + u32 closid = READ_ONCE(state->default_closid); + u32 rmid = READ_ONCE(state->default_rmid); u32 tmp; /* @@ -132,6 +133,13 @@ static inline unsigned int resctrl_arch_round_mon_val(unsigned int val) return val * scale; } +static inline void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, + u32 rmid) +{ + WRITE_ONCE(per_cpu(pqr_state.default_closid, cpu), closid); + WRITE_ONCE(per_cpu(pqr_state.default_rmid, cpu), rmid); +} + static inline void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid) { diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 4bb4a9e41534..8a2ea90e8cdf 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -3629,7 +3629,9 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) /* Update per cpu rmid of the moved CPUs first */ for_each_cpu(cpu, &rdtgrp->cpu_mask) - per_cpu(pqr_state.default_rmid, cpu) = prdtgrp->mon.rmid; + resctrl_arch_set_cpu_default_closid_rmid(cpu, rdtgrp->closid, + prdtgrp->mon.rmid); + /* * Update the MSR on moved CPUs and CPUs which have moved * task running on them. @@ -3662,6 +3664,7 @@ static int rdtgroup_ctrl_remove(struct rdtgroup *rdtgrp) static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) { + u32 closid, rmid; int cpu; /* Give any tasks back to the default group */ @@ -3672,10 +3675,10 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); /* Update per cpu closid and rmid of the moved CPUs first */ - for_each_cpu(cpu, &rdtgrp->cpu_mask) { - per_cpu(pqr_state.default_closid, cpu) = rdtgroup_default.closid; - per_cpu(pqr_state.default_rmid, cpu) = rdtgroup_default.mon.rmid; - } + closid = rdtgroup_default.closid; + rmid = rdtgroup_default.mon.rmid; + for_each_cpu(cpu, &rdtgrp->cpu_mask) + resctrl_arch_set_cpu_default_closid_rmid(cpu, closid, rmid); /* * Update the MSR on moved CPUs and CPUs which have moved -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- update_cpu_closid_rmid() takes a struct rdtgroup as an argument, which it uses to update the local CPUs default pqr values. This is a problem once the resctrl parts move out to /fs/, as the arch code cannot poke around inside struct rdtgroup. Add a helper resctrl_arch_sync_cpus_defaults() to abstract the cross calling, and pass the effective CLOSID and RMID in a new struct. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 19 +++++++++++++++---- include/linux/resctrl.h | 11 +++++++++++ 2 files changed, 26 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 8a2ea90e8cdf..6526e8e15969 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -340,13 +340,13 @@ static int rdtgroup_cpus_show(struct kernfs_open_file *of, * from update_closid_rmid() is protected against __switch_to() because * preemption is disabled. */ -static void update_cpu_closid_rmid(void *info) +void resctrl_arch_sync_cpu_defaults(void *info) { - struct rdtgroup *r = info; + struct resctrl_cpu_sync *r = info; if (r) { this_cpu_write(pqr_state.default_closid, r->closid); - this_cpu_write(pqr_state.default_rmid, r->mon.rmid); + this_cpu_write(pqr_state.default_rmid, r->rmid); } /* @@ -361,11 +361,22 @@ static void update_cpu_closid_rmid(void *info) * Update the PGR_ASSOC MSR on all cpus in @cpu_mask, * * Per task closids/rmids must have been set up before calling this function. + * @r may be NULL. */ static void update_closid_rmid(const struct cpumask *cpu_mask, struct rdtgroup *r) { - on_each_cpu_mask(cpu_mask, update_cpu_closid_rmid, r, 1); + struct resctrl_cpu_sync defaults; + struct resctrl_cpu_sync *defaults_p = NULL; + + if (r) { + defaults.closid = r->closid; + defaults.rmid = r->mon.rmid; + defaults_p = &defaults; + } + + on_each_cpu_mask(cpu_mask, resctrl_arch_sync_cpu_defaults, defaults_p, + 1); } static int cpus_mon_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask, diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 6e87bc95f5ea..2b79e4159507 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -220,6 +220,17 @@ struct resctrl_schema { u32 num_closid; }; +struct resctrl_cpu_sync { + u32 closid; + u32 rmid; +}; + +/* + * Update and re-load this CPUs defaults. Called via IPI, takes a pointer to + * struct resctrl_cpu_sync, or NULL. + */ +void resctrl_arch_sync_cpu_defaults(void *info); + /* The number of closid supported by this resource regardless of CDP */ u32 resctrl_arch_get_num_closid(struct rdt_resource *r); int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid); -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- rdtgroup_init() needs exporting so that arch code can call it once it lives in core code. As this is one of the few functions we export, rename it to have the filesystems name. x86's arch code init functions for RDT need renaming (again) to avoid duplicate definitions. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/core.c | 10 +++++----- arch/x86/kernel/cpu/resctrl/internal.h | 3 --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 8 ++++---- include/linux/resctrl.h | 3 +++ 4 files changed, 12 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 5d3efd5bc50d..a913fdbdd962 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -938,7 +938,7 @@ void resctrl_cpu_detect(struct cpuinfo_x86 *c) } } -static int __init resctrl_late_init(void) +static int __init resctrl_arch_late_init(void) { struct rdt_resource *r; int state, ret; @@ -963,7 +963,7 @@ static int __init resctrl_late_init(void) if (state < 0) return state; - ret = rdtgroup_init(); + ret = resctrl_init(); if (ret) { cpuhp_remove_state(state); return ret; @@ -979,9 +979,9 @@ static int __init resctrl_late_init(void) return 0; } -late_initcall(resctrl_late_init); +late_initcall(resctrl_arch_late_init); -static void __exit resctrl_exit(void) +static void __exit resctrl_arch_exit(void) { struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); @@ -993,4 +993,4 @@ static void __exit resctrl_exit(void) rdt_put_mon_l3_config(); } -__exitcall(resctrl_exit); +__exitcall(resctrl_arch_exit); diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 3449d7db653d..05845ae6739e 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -301,9 +301,6 @@ extern struct list_head rdt_all_groups; extern int max_name_width, max_data_width; -int __init rdtgroup_init(void); -void __exit rdtgroup_exit(void); - /** * struct rftype - describe each file in the resctrl file system * @name: File name diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 6526e8e15969..9b25c633e247 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -4113,14 +4113,14 @@ void resctrl_offline_cpu(unsigned int cpu) } /* - * rdtgroup_init - rdtgroup initialization + * resctrl_init - resctrl filesystem initialization * * Setup resctrl file system including set up root, create mount point, - * register rdtgroup filesystem, and initialize files under root directory. + * register resctrl filesystem, and initialize files under root directory. * * Return: 0 on success or -errno */ -int __init rdtgroup_init(void) +int __init resctrl_init(void) { int ret = 0; @@ -4168,7 +4168,7 @@ int __init rdtgroup_init(void) return ret; } -void __exit rdtgroup_exit(void) +void __exit resctrl_exit(void) { debugfs_remove_recursive(debugfs_resctrl); unregister_filesystem(&rdt_fs_type); diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 2b79e4159507..f6a4b75f8122 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -325,4 +325,7 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d); extern unsigned int resctrl_rmid_realloc_threshold; extern unsigned int resctrl_rmid_realloc_limit; +int __init resctrl_init(void); +void __exit resctrl_exit(void); + #endif /* _RESCTRL_H */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- rdt_find_domain() finds a domain given a resource and a cache-id. Its not quite right for the resctrl arch API as it also returns the position to insert a new domain. Wrap rdt_find_domain() in a another function resctrl_arch_find_domain() so we avoid the unnecessary argument outside the arch code. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/core.c | 9 +++++++-- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 2 +- arch/x86/kernel/cpu/resctrl/internal.h | 2 -- include/linux/resctrl.h | 2 ++ 4 files changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index a913fdbdd962..0a5f538495cb 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -405,8 +405,8 @@ void rdt_ctrl_update(void *arg) * caller, return the first domain whose id is bigger than the input id. * The domain list is sorted by id in ascending order. */ -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, - struct list_head **pos) +static struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, + struct list_head **pos) { struct rdt_domain *d; struct list_head *l; @@ -430,6 +430,11 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, return NULL; } +struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id) +{ + return rdt_find_domain(r, id, NULL); +} + static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc) { struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index 077b4cbaa074..ad9e5516e607 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -623,7 +623,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg) evtid = md.u.evtid; r = resctrl_arch_get_resource(resid); - d = rdt_find_domain(r, domid, NULL); + d = resctrl_arch_find_domain(r, domid); if (IS_ERR_OR_NULL(d)) { ret = -ENOENT; goto out; diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 05845ae6739e..74a8b0a86206 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -534,8 +534,6 @@ void rdtgroup_kn_unlock(struct kernfs_node *kn); int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name); int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name, umode_t mask); -struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, - struct list_head **pos); ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off); int rdtgroup_schemata_show(struct kernfs_open_file *of, diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index f6a4b75f8122..c5fcbb524136 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -233,6 +233,8 @@ void resctrl_arch_sync_cpu_defaults(void *info); /* The number of closid supported by this resource regardless of CDP */ u32 resctrl_arch_get_num_closid(struct rdt_resource *r); + +struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id); int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid); /* -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- To avoid sticky problems in the mpam glue code, move the resctrl enums into a separate header. This lets the arch code declare prototypes that use these enums without creating a loop via asm<->linux resctrl.h The same logic applies to the monitor-configuration defines, move these too. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- MAINTAINERS | 1 + arch/x86/kernel/cpu/resctrl/internal.h | 24 --------- include/linux/resctrl.h | 35 +------------ include/linux/resctrl_types.h | 68 ++++++++++++++++++++++++++ 4 files changed, 70 insertions(+), 58 deletions(-) create mode 100644 include/linux/resctrl_types.h diff --git a/MAINTAINERS b/MAINTAINERS index ae7060868aa8..c47e200a439a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -18077,6 +18077,7 @@ S: Supported F: Documentation/arch/x86/resctrl* F: arch/x86/include/asm/resctrl.h F: arch/x86/kernel/cpu/resctrl/ +F: include/linux/resctrl*.h F: tools/testing/selftests/resctrl/ READ-COPY UPDATE (RCU) diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 74a8b0a86206..d79491769bcf 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -32,30 +32,6 @@ */ #define MBM_CNTR_WIDTH_OFFSET_MAX (62 - MBM_CNTR_WIDTH_BASE) -/* Reads to Local DRAM Memory */ -#define READS_TO_LOCAL_MEM BIT(0) - -/* Reads to Remote DRAM Memory */ -#define READS_TO_REMOTE_MEM BIT(1) - -/* Non-Temporal Writes to Local Memory */ -#define NON_TEMP_WRITE_TO_LOCAL_MEM BIT(2) - -/* Non-Temporal Writes to Remote Memory */ -#define NON_TEMP_WRITE_TO_REMOTE_MEM BIT(3) - -/* Reads to Local Memory the system identifies as "Slow Memory" */ -#define READS_TO_LOCAL_S_MEM BIT(4) - -/* Reads to Remote Memory the system identifies as "Slow Memory" */ -#define READS_TO_REMOTE_S_MEM BIT(5) - -/* Dirty Victims to All Types of Memory */ -#define DIRTY_VICTIMS_TO_ALL_MEM BIT(6) - -/* Max event bits supported */ -#define MAX_EVT_CONFIG_BITS GENMASK(6, 0) - /** * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that * aren't marked nohz_full diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index c5fcbb524136..b0ee7256e095 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -5,6 +5,7 @@ #include <linux/kernel.h> #include <linux/list.h> #include <linux/pid.h> +#include <linux/resctrl_types.h> /* CLOSID, RMID value used by the default control group */ #define RESCTRL_RESERVED_CLOSID 0 @@ -24,40 +25,6 @@ int proc_resctrl_show(struct seq_file *m, /* max value for struct rdt_domain's mbps_val */ #define MBA_MAX_MBPS U32_MAX -/** - * enum resctrl_conf_type - The type of configuration. - * @CDP_NONE: No prioritisation, both code and data are controlled or monitored. - * @CDP_CODE: Configuration applies to instruction fetches. - * @CDP_DATA: Configuration applies to reads and writes. - */ -enum resctrl_conf_type { - CDP_NONE, - CDP_CODE, - CDP_DATA, -}; - -enum resctrl_res_level { - RDT_RESOURCE_L3, - RDT_RESOURCE_L2, - RDT_RESOURCE_MBA, - RDT_RESOURCE_SMBA, - - /* Must be the last */ - RDT_NUM_RESOURCES, -}; - -#define CDP_NUM_TYPES (CDP_DATA + 1) - -/* - * Event IDs, the values match those used to program IA32_QM_EVTSEL before - * reading IA32_QM_CTR on RDT systems. - */ -enum resctrl_event_id { - QOS_L3_OCCUP_EVENT_ID = 0x01, - QOS_L3_MBM_TOTAL_EVENT_ID = 0x02, - QOS_L3_MBM_LOCAL_EVENT_ID = 0x03, -}; - /** * struct resctrl_staged_config - parsed configuration to be applied * @new_ctrl: new ctrl value to be loaded diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h new file mode 100644 index 000000000000..06e0f5567558 --- /dev/null +++ b/include/linux/resctrl_types.h @@ -0,0 +1,68 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2023 Arm Ltd. + * Based on arch/x86/kernel/cpu/resctrl/internal.h + */ + +#ifndef __LINUX_RESCTRL_TYPES_H +#define __LINUX_RESCTRL_TYPES_H + +/* Reads to Local DRAM Memory */ +#define READS_TO_LOCAL_MEM BIT(0) + +/* Reads to Remote DRAM Memory */ +#define READS_TO_REMOTE_MEM BIT(1) + +/* Non-Temporal Writes to Local Memory */ +#define NON_TEMP_WRITE_TO_LOCAL_MEM BIT(2) + +/* Non-Temporal Writes to Remote Memory */ +#define NON_TEMP_WRITE_TO_REMOTE_MEM BIT(3) + +/* Reads to Local Memory the system identifies as "Slow Memory" */ +#define READS_TO_LOCAL_S_MEM BIT(4) + +/* Reads to Remote Memory the system identifies as "Slow Memory" */ +#define READS_TO_REMOTE_S_MEM BIT(5) + +/* Dirty Victims to All Types of Memory */ +#define DIRTY_VICTIMS_TO_ALL_MEM BIT(6) + +/* Max event bits supported */ +#define MAX_EVT_CONFIG_BITS GENMASK(6, 0) + +/** + * enum resctrl_conf_type - The type of configuration. + * @CDP_NONE: No prioritisation, both code and data are controlled or monitored. + * @CDP_CODE: Configuration applies to instruction fetches. + * @CDP_DATA: Configuration applies to reads and writes. + */ +enum resctrl_conf_type { + CDP_NONE, + CDP_CODE, + CDP_DATA, +}; + +enum resctrl_res_level { + RDT_RESOURCE_L3, + RDT_RESOURCE_L2, + RDT_RESOURCE_MBA, + RDT_RESOURCE_SMBA, + + /* Must be the last */ + RDT_NUM_RESOURCES, +}; + +#define CDP_NUM_TYPES (CDP_DATA + 1) + +/* + * Event IDs, the values match those used to program IA32_QM_EVTSEL before + * reading IA32_QM_CTR on RDT systems. + */ +enum resctrl_event_id { + QOS_L3_OCCUP_EVENT_ID = 0x01, + QOS_L3_MBM_TOTAL_EVENT_ID = 0x02, + QOS_L3_MBM_LOCAL_EVENT_ID = 0x03, +}; + +#endif /* __LINUX_RESCTRL_TYPES_H */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- On umount(), resctrl resets each resource back to its default configuration. It only ever does this for all resources, move the whole lot behind a single helper. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/include/asm/resctrl.h | 2 ++ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 16 +++++++++++----- 2 files changed, 13 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index f61382258743..5f6a5375bb4a 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -15,6 +15,8 @@ */ #define X86_RESCTRL_EMPTY_CLOSID ((u32)~0) +void resctrl_arch_reset_resources(void); + /** * struct resctrl_pqr_state - State cache for the PQR MSR * @cur_rmid: The cached Resource Monitoring ID diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 9b25c633e247..93f3126318e7 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2858,6 +2858,14 @@ static int reset_all_ctrls(struct rdt_resource *r) return 0; } +void resctrl_arch_reset_resources(void) +{ + struct rdt_resource *r; + + for_each_capable_rdt_resource(r) + reset_all_ctrls(r); +} + /* * Move tasks from one to the other group. If @from is NULL, then all tasks * in the systems are moved unconditionally (used for teardown). @@ -2967,16 +2975,14 @@ static void rmdir_all_sub(void) static void rdt_kill_sb(struct super_block *sb) { - struct rdt_resource *r; - cpus_read_lock(); mutex_lock(&rdtgroup_mutex); rdt_disable_ctx(); - /*Put everything back to default values. */ - for_each_alloc_capable_rdt_resource(r) - reset_all_ctrls(r); + /* Put everything back to default values. */ + resctrl_arch_reset_resources(); + rmdir_all_sub(); rdt_pseudo_lock_release(); rdtgroup_default.mode = RDT_MODE_SHAREABLE; -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- rdt_get_mon_l3_config() is called via the architecture's resctrl_late_init(), and initialises both architecture specific fields, such as hw_res->mon_scale and resctrl filesystem fields by calling dom_data_init(). To separate the filesystem and architecture parts of resctrl, this function needs splitting up. Add resctrl_mon_resource_init() to do the filesystem specific work, and call it from resctrl_init(). Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/internal.h | 1 + arch/x86/kernel/cpu/resctrl/monitor.c | 22 +++++++++++++++------- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 ++++ 3 files changed, 20 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index d79491769bcf..d2da875de4f5 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -541,6 +541,7 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg); void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, struct rdt_domain *d, struct rdtgroup *rdtgrp, int evtid, int first); +int resctrl_mon_resource_init(void); void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms, int exclude_cpu); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 484ba10928f5..2f38a488689b 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -1004,12 +1004,26 @@ static void l3_mon_evt_init(struct rdt_resource *r) list_add_tail(&mbm_local_event.list, &r->evt_list); } +int resctrl_mon_resource_init(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + int ret; + + ret = dom_data_init(r); + if (ret) + return ret; + + if (r->mon_capable) + l3_mon_evt_init(r); + + return 0; +} + int __init rdt_get_mon_l3_config(struct rdt_resource *r) { unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset; struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); unsigned int threshold; - int ret; resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024; hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale; @@ -1037,10 +1051,6 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) */ resctrl_rmid_realloc_threshold = resctrl_arch_round_mon_val(threshold); - ret = dom_data_init(r); - if (ret) - return ret; - if (rdt_cpu_has(X86_FEATURE_BMEC)) { u32 eax, ebx, ecx, edx; @@ -1058,8 +1068,6 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) } } - l3_mon_evt_init(r); - r->mon_capable = true; return 0; diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 93f3126318e7..18ea66eefd36 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -4135,6 +4135,10 @@ int __init resctrl_init(void) rdtgroup_setup_default(); + ret = resctrl_mon_resource_init(); + if (ret) + return ret; + ret = sysfs_create_mount_point(fs_kobj, "resctrl"); if (ret) return ret; -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- rdt_put_mon_l3_config() is called via the architecture's resctrl_arch_exit() call, and appears to free the rmid_ptrs[] and closid_num_dirty_rmid[] arrays. In reality this code is marked __exit, and is removed by the linker as resctl can't be built as a module. MPAM can make use of this code from its error interrupt handler, a later patch drops all the __init/__exit annotations. To separate the filesystem and architecture parts of resctrl, this free()ing work needs to be triggered by the filesystem. Rename rdt_put_mon_l3_config() resctrl_mon_resource_exit() and call it from resctrl_exit(). Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/core.c | 7 +------ arch/x86/kernel/cpu/resctrl/internal.h | 2 +- arch/x86/kernel/cpu/resctrl/monitor.c | 7 ++++++- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 ++ 4 files changed, 10 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 0a5f538495cb..2d1af875e1c4 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -988,14 +988,9 @@ late_initcall(resctrl_arch_late_init); static void __exit resctrl_arch_exit(void) { - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); - cpuhp_remove_state(rdt_online); - rdtgroup_exit(); - - if (r->mon_capable) - rdt_put_mon_l3_config(); + resctrl_exit(); } __exitcall(resctrl_arch_exit); diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index d2da875de4f5..e5ef68402516 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -534,7 +534,7 @@ void closid_free(int closid); int alloc_rmid(u32 closid); void free_rmid(u32 closid, u32 rmid); int rdt_get_mon_l3_config(struct rdt_resource *r); -void __exit rdt_put_mon_l3_config(void); +void __exit resctrl_mon_resource_exit(void); bool __init rdt_cpu_has(int flag); void mon_event_count(void *info); int rdtgroup_mondata_show(struct seq_file *m, void *arg); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 2f38a488689b..47aeca2769aa 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -1073,8 +1073,13 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) return 0; } -void __exit rdt_put_mon_l3_config(void) +void __exit resctrl_mon_resource_exit(void) { + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + + if (!r->mon_capable) + return; + dom_data_exit(); } diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 18ea66eefd36..ec8ef2ecf80d 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -4183,4 +4183,6 @@ void __exit resctrl_exit(void) debugfs_remove_recursive(debugfs_resctrl); unregister_filesystem(&rdt_fs_type); sysfs_remove_mount_point(fs_kobj, "resctrl"); + + resctrl_mon_resource_exit(); } -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- max_name_width and max_data_width are used to pad the strings in the resctrl schemata file. This should be part of the fs code as it influences the user-space interface, but currently max_data_width is generated by the arch init code. max_name_width is already managed by schemata_list_add(). Move the variables and max_data_width's initialisation code to rdtgroup.c. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/core.c | 22 ---------------------- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 ++++++++++++ 2 files changed, 12 insertions(+), 22 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 2d1af875e1c4..689649ad48ff 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -44,12 +44,6 @@ static DEFINE_MUTEX(domain_list_lock); */ DEFINE_PER_CPU(struct resctrl_pqr_state, pqr_state); -/* - * Used to store the max resource name width and max resource data width - * to display the schemata in a tabular format - */ -int max_name_width, max_data_width; - /* * Global boolean for rdt_alloc which is true if any * resource allocation is enabled. @@ -648,20 +642,6 @@ static int resctrl_arch_offline_cpu(unsigned int cpu) return 0; } -/* - * Choose a width for the resource name and resource data based on the - * resource that has widest name and cbm. - */ -static __init void rdt_init_padding(void) -{ - struct rdt_resource *r; - - for_each_alloc_capable_rdt_resource(r) { - if (r->data_width > max_data_width) - max_data_width = r->data_width; - } -} - enum { RDT_FLAG_CMT, RDT_FLAG_MBM_TOTAL, @@ -959,8 +939,6 @@ static int __init resctrl_arch_late_init(void) if (!get_rdt_resources()) return -ENODEV; - rdt_init_padding(); - state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/resctrl/cat:online:", resctrl_arch_online_cpu, diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index ec8ef2ecf80d..9113c0a8b2cf 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -58,6 +58,12 @@ static struct kernfs_node *kn_mongrp; /* Kernel fs node for "mon_data" directory under root */ static struct kernfs_node *kn_mondata; +/* + * Used to store the max resource name width and max resource data width + * to display the schemata in a tabular format + */ +int max_name_width, max_data_width; + static struct seq_buf last_cmd_status; static char last_cmd_status_buf[512]; @@ -2594,6 +2600,12 @@ static int schemata_list_add(struct rdt_resource *r, enum resctrl_conf_type type if (cl > max_name_width) max_name_width = cl; + /* + * Choose a width for the resource data based on the resource that has + * widest name and cbm. + */ + max_data_width = max(max_data_width, r->data_width); + INIT_LIST_HEAD(&s->list); list_add(&s->list, &resctrl_schema_all); -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- The for_each_*_rdt_resource() helpers walk the architectures array of structures, using the resctrl visible part as an iterator. To support resctrl, another architecture would have to provide equally complex macros. Change the resctrl code that uses these to walk through the resource_level enum instead. Instances in core.c, and resctrl_arch_reset_resources() remain part of x86's architecture specific code. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 7 +++++- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 30 +++++++++++++++++++---- 2 files changed, 31 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 884b88e25141..f2315a50ea4f 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -840,6 +840,7 @@ bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) { cpumask_var_t cpu_with_psl; + enum resctrl_res_level i; struct rdt_resource *r; struct rdt_domain *d_i; bool ret = false; @@ -854,7 +855,11 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) * First determine which cpus have pseudo-locked regions * associated with them. */ - for_each_alloc_capable_rdt_resource(r) { + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->alloc_capable) + continue; + list_for_each_entry(d_i, &r->domains, list) { if (d_i->plr) cpumask_or(cpu_with_psl, cpu_with_psl, diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 9113c0a8b2cf..0454d34cd066 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -100,10 +100,15 @@ void rdt_staged_configs_clear(void) { struct rdt_resource *r; struct rdt_domain *dom; + int i; lockdep_assert_held(&rdtgroup_mutex); - for_each_alloc_capable_rdt_resource(r) { + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->alloc_capable) + continue; + list_for_each_entry(dom, &r->domains, list) memset(dom->staged_config, 0, sizeof(dom->staged_config)); } @@ -2180,6 +2185,7 @@ static int rdtgroup_mkdir_info_resdir(void *priv, char *name, static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn) { + enum resctrl_res_level i; struct resctrl_schema *s; struct rdt_resource *r; unsigned long fflags; @@ -2204,8 +2210,12 @@ static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn) goto out_destroy; } - for_each_mon_capable_rdt_resource(r) { - fflags = r->fflags | RFTYPE_MON_INFO; + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->mon_capable) + continue; + + fflags = r->fflags | RFTYPE_MON_INFO; sprintf(name, "%s_MON", r->name); ret = rdtgroup_mkdir_info_resdir(r, name, fflags); if (ret) @@ -2614,10 +2624,15 @@ static int schemata_list_add(struct rdt_resource *r, enum resctrl_conf_type type static int schemata_list_create(void) { + enum resctrl_res_level i; struct rdt_resource *r; int ret = 0; - for_each_alloc_capable_rdt_resource(r) { + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->alloc_capable) + continue; + if (resctrl_arch_get_cdp_enabled(r->rid)) { ret = schemata_list_add(r, CDP_CODE); if (ret) @@ -3165,6 +3180,7 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn, struct rdtgroup *prgrp, struct kernfs_node **dest_kn) { + enum resctrl_res_level i; struct rdt_resource *r; struct kernfs_node *kn; int ret; @@ -3183,7 +3199,11 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn, * Create the subdirectories for each domain. Note that all events * in a domain like L3 are grouped into a resource whose domain is L3 */ - for_each_mon_capable_rdt_resource(r) { + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->mon_capable) + continue; + ret = mkdir_mondata_subdir_alldom(kn, r, prgrp); if (ret) goto out_destroy; -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- The architecture specific parts of resctrl have helpers to hide accesses to the rdt_mon_features bitmap. To ease moving the filesystem parts of the code out to /fs/, give these a resctrl_arch prefix. Move is_mbm_event() and is_mbm_enabled() into the only file that uses them. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/include/asm/resctrl.h | 17 ++++++++++++ arch/x86/kernel/cpu/resctrl/core.c | 4 +-- arch/x86/kernel/cpu/resctrl/internal.h | 16 ----------- arch/x86/kernel/cpu/resctrl/monitor.c | 18 ++++++------ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 38 +++++++++++++++++--------- 5 files changed, 53 insertions(+), 40 deletions(-) diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 5f6a5375bb4a..50407e83d0ca 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -7,6 +7,7 @@ #include <linux/jump_label.h> #include <linux/percpu.h> #include <linux/sched.h> +#include <linux/resctrl_types.h> /* * This value can never be a valid CLOSID, and is used when mapping a @@ -43,6 +44,7 @@ DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state); extern bool rdt_alloc_capable; extern bool rdt_mon_capable; +extern unsigned int rdt_mon_features; DECLARE_STATIC_KEY_FALSE(rdt_enable_key); DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key); @@ -82,6 +84,21 @@ static inline void resctrl_arch_disable_mon(void) static_branch_dec_cpuslocked(&rdt_enable_key); } +static inline bool resctrl_arch_is_llc_occupancy_enabled(void) +{ + return (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID)); +} + +static inline bool resctrl_arch_is_mbm_total_enabled(void) +{ + return (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID)); +} + +static inline bool resctrl_arch_is_mbm_local_enabled(void) +{ + return (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID)); +} + /* * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR * diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 689649ad48ff..96aa4663f1ea 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -481,13 +481,13 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom) { size_t tsize; - if (is_mbm_total_enabled()) { + if (resctrl_arch_is_mbm_total_enabled()) { tsize = sizeof(*hw_dom->arch_mbm_total); hw_dom->arch_mbm_total = kcalloc(num_rmid, tsize, GFP_KERNEL); if (!hw_dom->arch_mbm_total) return -ENOMEM; } - if (is_mbm_local_enabled()) { + if (resctrl_arch_is_mbm_local_enabled()) { tsize = sizeof(*hw_dom->arch_mbm_local); hw_dom->arch_mbm_local = kcalloc(num_rmid, tsize, GFP_KERNEL); if (!hw_dom->arch_mbm_local) { diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index e5ef68402516..45840e197d12 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -131,7 +131,6 @@ struct rmid_read { void *arch_mon_ctx; }; -extern unsigned int rdt_mon_features; extern struct list_head resctrl_schema_all; extern bool resctrl_mounted; @@ -366,21 +365,6 @@ static inline bool is_llc_occupancy_enabled(void) return (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID)); } -static inline bool is_mbm_total_enabled(void) -{ - return (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID)); -} - -static inline bool is_mbm_local_enabled(void) -{ - return (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID)); -} - -static inline bool is_mbm_enabled(void) -{ - return (is_mbm_total_enabled() || is_mbm_local_enabled()); -} - static inline bool is_mbm_event(int e) { return (e >= QOS_L3_MBM_TOTAL_EVENT_ID && diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 47aeca2769aa..5f7a47514a0f 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -251,11 +251,11 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d) { struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); - if (is_mbm_total_enabled()) + if (resctrl_arch_is_mbm_total_enabled()) memset(hw_dom->arch_mbm_total, 0, sizeof(*hw_dom->arch_mbm_total) * r->num_rmid); - if (is_mbm_local_enabled()) + if (resctrl_arch_is_mbm_local_enabled()) memset(hw_dom->arch_mbm_local, 0, sizeof(*hw_dom->arch_mbm_local) * r->num_rmid); } @@ -515,7 +515,7 @@ void free_rmid(u32 closid, u32 rmid) entry = __rmid_entry(idx); - if (is_llc_occupancy_enabled()) + if (resctrl_arch_is_llc_occupancy_enabled()) add_rmid_to_limbo(entry); else list_add_tail(&entry->list, &rmid_free_lru); @@ -667,7 +667,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm) struct list_head *head; struct rdtgroup *entry; - if (!is_mbm_local_enabled()) + if (!resctrl_arch_is_mbm_local_enabled()) return; r_mba = resctrl_arch_get_resource(RDT_RESOURCE_MBA); @@ -736,7 +736,7 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, * This is protected from concurrent reads from user * as both the user and we hold the global mutex. */ - if (is_mbm_total_enabled()) { + if (resctrl_arch_is_mbm_total_enabled()) { rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID; rr.val = 0; rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); @@ -750,7 +750,7 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); } - if (is_mbm_local_enabled()) { + if (resctrl_arch_is_mbm_local_enabled()) { rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID; rr.val = 0; rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); @@ -996,11 +996,11 @@ static void l3_mon_evt_init(struct rdt_resource *r) { INIT_LIST_HEAD(&r->evt_list); - if (is_llc_occupancy_enabled()) + if (resctrl_arch_is_llc_occupancy_enabled()) list_add_tail(&llc_occupancy_event.list, &r->evt_list); - if (is_mbm_total_enabled()) + if (resctrl_arch_is_mbm_total_enabled()) list_add_tail(&mbm_total_event.list, &r->evt_list); - if (is_mbm_local_enabled()) + if (resctrl_arch_is_mbm_local_enabled()) list_add_tail(&mbm_local_event.list, &r->evt_list); } diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 0454d34cd066..307854c3a349 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -114,6 +114,18 @@ void rdt_staged_configs_clear(void) } } +static bool resctrl_is_mbm_enabled(void) +{ + return (resctrl_arch_is_mbm_total_enabled() || + resctrl_arch_is_mbm_local_enabled()); +} + +static bool resctrl_is_mbm_event(int e) +{ + return (e >= QOS_L3_MBM_TOTAL_EVENT_ID && + e <= QOS_L3_MBM_LOCAL_EVENT_ID); +} + /* * Trivial allocator for CLOSIDs. Since h/w only supports a small number, * we can keep a bitmap of free CLOSIDs in a single integer. @@ -2369,7 +2381,7 @@ static bool supports_mba_mbps(void) { struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); - return (is_mbm_local_enabled() && + return (resctrl_arch_is_mbm_local_enabled() && r->alloc_capable && is_mba_linear()); } @@ -2737,7 +2749,7 @@ static int rdt_get_tree(struct fs_context *fc) if (resctrl_arch_alloc_capable() || resctrl_arch_mon_capable()) resctrl_mounted = true; - if (is_mbm_enabled()) { + if (resctrl_is_mbm_enabled()) { list_for_each_entry(dom, &l3->domains, list) mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU); @@ -3106,7 +3118,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, if (ret) goto out_destroy; - if (is_mbm_event(mevt->evtid)) + if (resctrl_is_mbm_event(mevt->evtid)) mon_event_read(&rr, r, d, prgrp, mevt->evtid, true); } kernfs_activate(kn); @@ -4003,9 +4015,9 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d) if (resctrl_mounted && resctrl_arch_mon_capable()) rmdir_mondata_subdir_allrdtgrp(r, d->id); - if (is_mbm_enabled()) + if (resctrl_is_mbm_enabled()) cancel_delayed_work(&d->mbm_over); - if (is_llc_occupancy_enabled() && has_busy_rmid(d)) { + if (resctrl_arch_is_llc_occupancy_enabled() && has_busy_rmid(d)) { /* * When a package is going down, forcefully * decrement rmid->ebusy. There is no way to know @@ -4029,12 +4041,12 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) u32 idx_limit = resctrl_arch_system_num_rmid_idx(); size_t tsize; - if (is_llc_occupancy_enabled()) { + if (resctrl_arch_is_llc_occupancy_enabled()) { d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL); if (!d->rmid_busy_llc) return -ENOMEM; } - if (is_mbm_total_enabled()) { + if (resctrl_arch_is_mbm_total_enabled()) { tsize = sizeof(*d->mbm_total); d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL); if (!d->mbm_total) { @@ -4042,7 +4054,7 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) return -ENOMEM; } } - if (is_mbm_local_enabled()) { + if (resctrl_arch_is_mbm_local_enabled()) { tsize = sizeof(*d->mbm_local); d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL); if (!d->mbm_local) { @@ -4074,13 +4086,13 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d) if (err) goto out_unlock; - if (is_mbm_enabled()) { + if (resctrl_is_mbm_enabled()) { INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow); mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU); } - if (is_llc_occupancy_enabled()) + if (resctrl_arch_is_llc_occupancy_enabled()) INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo); /* @@ -4135,12 +4147,12 @@ void resctrl_offline_cpu(unsigned int cpu) d = get_domain_from_cpu(cpu, l3); if (d) { - if (is_mbm_enabled() && cpu == d->mbm_work_cpu) { + if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu) { cancel_delayed_work(&d->mbm_over); mbm_setup_overflow_handler(d, 0, cpu); } - if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu && - has_busy_rmid(d)) { + if (resctrl_arch_is_llc_occupancy_enabled() && + cpu == d->cqm_work_cpu && has_busy_rmid(d)) { cancel_delayed_work(&d->cqm_limbo); cqm_setup_limbo_handler(d, 0, cpu); } -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- When BMEC is supported the resctrl event can be configured in a number of ways. This depends on architecture support. rdt_get_mon_l3_config() modifies the struct mon_evt and calls mbm_config_rftype_init() to create the files that allow the configuration. Splitting this into separate architecture and filesystem parts would require the struct mon_evt and mbm_config_rftype_init() to be exposed. Instead, add resctrl_arch_is_evt_configurable(), and use this from resctrl_mon_resource_init() to initialise struct mon_evt and call mbm_config_rftype_init(). resctrl_arch_is_evt_configurable() uses rdt_cpu_has() so it can't easily be in a header. Putting it in core.c allows rdt_cpu_has() to become static. resctrl_mon_resource_init() isn't marked __init, and MPAM will need to initialise resctrl late. Drop the __init on the relevant functions, and change rdt_options[] to be read-only after init. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/core.c | 19 +++++++++++++++++-- arch/x86/kernel/cpu/resctrl/internal.h | 4 ++-- arch/x86/kernel/cpu/resctrl/monitor.c | 24 +++++++++++++----------- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +- include/linux/resctrl.h | 2 ++ 5 files changed, 35 insertions(+), 16 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 96aa4663f1ea..6aa179b5c183 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -667,7 +667,7 @@ struct rdt_options { bool force_off, force_on; }; -static struct rdt_options rdt_options[] __initdata = { +static struct rdt_options rdt_options[] __ro_after_init = { RDT_OPT(RDT_FLAG_CMT, "cmt", X86_FEATURE_CQM_OCCUP_LLC), RDT_OPT(RDT_FLAG_MBM_TOTAL, "mbmtotal", X86_FEATURE_CQM_MBM_TOTAL), RDT_OPT(RDT_FLAG_MBM_LOCAL, "mbmlocal", X86_FEATURE_CQM_MBM_LOCAL), @@ -707,7 +707,7 @@ static int __init set_rdt_options(char *str) } __setup("rdt", set_rdt_options); -bool __init rdt_cpu_has(int flag) +bool rdt_cpu_has(int flag) { bool ret = boot_cpu_has(flag); struct rdt_options *o; @@ -727,6 +727,21 @@ bool __init rdt_cpu_has(int flag) return ret; } +bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt) +{ + if (!rdt_cpu_has(X86_FEATURE_BMEC)) + return false; + + switch (evt) { + case QOS_L3_MBM_TOTAL_EVENT_ID: + return rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL); + case QOS_L3_MBM_LOCAL_EVENT_ID: + return rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL); + default: + return false; + } +} + static __init bool get_mem_config(void) { struct rdt_hw_resource *hw_res = &rdt_resources_all[RDT_RESOURCE_MBA]; diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 45840e197d12..7cab19a6a714 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -519,7 +519,7 @@ int alloc_rmid(u32 closid); void free_rmid(u32 closid, u32 rmid); int rdt_get_mon_l3_config(struct rdt_resource *r); void __exit resctrl_mon_resource_exit(void); -bool __init rdt_cpu_has(int flag); +bool rdt_cpu_has(int flag); void mon_event_count(void *info); int rdtgroup_mondata_show(struct seq_file *m, void *arg); void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, @@ -539,7 +539,7 @@ bool has_busy_rmid(struct rdt_domain *d); void __check_limbo(struct rdt_domain *d, bool force_free); void rdt_domain_reconfigure_cdp(struct rdt_resource *r); void __init thread_throttle_mode_init(void); -void __init mbm_config_rftype_init(const char *config); +void mbm_config_rftype_init(const char *config); void rdt_staged_configs_clear(void); bool closid_allocated(unsigned int closid); int resctrl_find_cleanest_closid(void); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 5f7a47514a0f..fbb2877e317c 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -1013,8 +1013,19 @@ int resctrl_mon_resource_init(void) if (ret) return ret; - if (r->mon_capable) - l3_mon_evt_init(r); + if (!r->mon_capable) + return 0; + + l3_mon_evt_init(r); + + if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) { + mbm_total_event.configurable = true; + mbm_config_rftype_init("mbm_total_bytes_config"); + } + if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_LOCAL_EVENT_ID)) { + mbm_local_event.configurable = true; + mbm_config_rftype_init("mbm_local_bytes_config"); + } return 0; } @@ -1057,15 +1068,6 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) /* Detect list of bandwidth sources that can be tracked */ cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx); hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS; - - if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) { - mbm_total_event.configurable = true; - mbm_config_rftype_init("mbm_total_bytes_config"); - } - if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) { - mbm_local_event.configurable = true; - mbm_config_rftype_init("mbm_local_bytes_config"); - } } r->mon_capable = true; diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 307854c3a349..b0d8b2709df5 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2067,7 +2067,7 @@ void __init thread_throttle_mode_init(void) rft->fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB; } -void __init mbm_config_rftype_init(const char *config) +void mbm_config_rftype_init(const char *config) { struct rftype *rft; diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index b0ee7256e095..bfc63e8219e5 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -204,6 +204,8 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r); struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id); int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid); +bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt); + /* * Update the ctrl_val and apply this config right now. * Must be called on one of the domain's CPUs. -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- mon_event_config_{read,write}() are called via IPI and access model specific registers to do their work. To support another architecture, this needs abstracting. Rename mon_event_config_{read,write}() to have a resctrl_arch_ prefix, and move their struct mon_config_info parameter into the restrl_types header. This allows another architecture to supply an implementation of these. As struct mon_config_info is now exposed globally, give it a 'resctrl_' prefix. MPAM systems need access to the domain to do this work, add the resource and domain to struct resctrl_mon_config_info. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 34 +++++++++++++------------- include/linux/resctrl.h | 9 +++++++ 2 files changed, 26 insertions(+), 17 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index b0d8b2709df5..b22b2960542c 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -1579,11 +1579,6 @@ static int rdtgroup_size_show(struct kernfs_open_file *of, return ret; } -struct mon_config_info { - u32 evtid; - u32 mon_config; -}; - #define INVALID_CONFIG_INDEX UINT_MAX /** @@ -1608,9 +1603,9 @@ static inline unsigned int mon_event_config_index_get(u32 evtid) } } -static void mon_event_config_read(void *info) +void resctrl_arch_mon_event_config_read(void *info) { - struct mon_config_info *mon_info = info; + struct resctrl_mon_config_info *mon_info = info; unsigned int index; u64 msrval; @@ -1625,14 +1620,15 @@ static void mon_event_config_read(void *info) mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS; } -static void mondata_config_read(struct rdt_domain *d, struct mon_config_info *mon_info) +static void mondata_config_read(struct resctrl_mon_config_info *mon_info) { - smp_call_function_any(&d->cpu_mask, mon_event_config_read, mon_info, 1); + smp_call_function_any(&mon_info->d->cpu_mask, + resctrl_arch_mon_event_config_read, mon_info, 1); } static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid) { - struct mon_config_info mon_info = {0}; + struct resctrl_mon_config_info mon_info = {0}; struct rdt_domain *dom; bool sep = false; @@ -1643,9 +1639,11 @@ static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid if (sep) seq_puts(s, ";"); - memset(&mon_info, 0, sizeof(struct mon_config_info)); + memset(&mon_info, 0, sizeof(struct resctrl_mon_config_info)); + mon_info.r = r; + mon_info.d = dom; mon_info.evtid = evtid; - mondata_config_read(dom, &mon_info); + mondata_config_read(&mon_info); seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config); sep = true; @@ -1678,9 +1676,9 @@ static int mbm_local_bytes_config_show(struct kernfs_open_file *of, return 0; } -static void mon_event_config_write(void *info) +void resctrl_arch_mon_event_config_write(void *info) { - struct mon_config_info *mon_info = info; + struct resctrl_mon_config_info *mon_info = info; unsigned int index; index = mon_event_config_index_get(mon_info->evtid); @@ -1694,14 +1692,16 @@ static void mon_event_config_write(void *info) static void mbm_config_write_domain(struct rdt_resource *r, struct rdt_domain *d, u32 evtid, u32 val) { - struct mon_config_info mon_info = {0}; + struct resctrl_mon_config_info mon_info = {0}; /* * Read the current config value first. If both are the same then * no need to write it again. */ + mon_info.r = r; + mon_info.d = d; mon_info.evtid = evtid; - mondata_config_read(d, &mon_info); + mondata_config_read(&mon_info); if (mon_info.mon_config == val) return; @@ -1713,7 +1713,7 @@ static void mbm_config_write_domain(struct rdt_resource *r, * are scoped at the domain level. Writing any of these MSRs * on one CPU is observed by all the CPUs in the domain. */ - smp_call_function_any(&d->cpu_mask, mon_event_config_write, + smp_call_function_any(&d->cpu_mask, resctrl_arch_mon_event_config_write, &mon_info, 1); /* diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index bfc63e8219e5..975b80102fbe 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -192,6 +192,13 @@ struct resctrl_cpu_sync { u32 rmid; }; +struct resctrl_mon_config_info { + struct rdt_resource *r; + struct rdt_domain *d; + u32 evtid; + u32 mon_config; +}; + /* * Update and re-load this CPUs defaults. Called via IPI, takes a pointer to * struct resctrl_cpu_sync, or NULL. @@ -205,6 +212,8 @@ struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id); int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid); bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt); +void resctrl_arch_mon_event_config_write(void *info); +void resctrl_arch_mon_event_config_read(void *info); /* * Update the ctrl_val and apply this config right now. -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- resctrl_arch_mon_event_config_write() writes a bitmap of events provided by user-space into the configuration register for the monitors. This assumes that all architectures support all the features each bit corresponds to. MPAM can filter monitors based on read, write, or both, but there are many more options in the existing bitmap. To allow this interface to work for machines with MPAM, allow the architecture helper to return an error if an incompatible bitmap is set. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 7 +++++++ include/linux/resctrl.h | 2 ++ 2 files changed, 9 insertions(+) diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index b22b2960542c..b464f840cfe3 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -1684,9 +1684,12 @@ void resctrl_arch_mon_event_config_write(void *info) index = mon_event_config_index_get(mon_info->evtid); if (index == INVALID_CONFIG_INDEX) { pr_warn_once("Invalid event id %d\n", mon_info->evtid); + mon_info->err = -EINVAL; return; } wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0); + + mon_info->err = 0; } static void mbm_config_write_domain(struct rdt_resource *r, @@ -1715,6 +1718,10 @@ static void mbm_config_write_domain(struct rdt_resource *r, */ smp_call_function_any(&d->cpu_mask, resctrl_arch_mon_event_config_write, &mon_info, 1); + if (mon_info.err) { + rdt_last_cmd_puts("Invalid event configuration\n"); + return; + } /* * When an Event Configuration is changed, the bandwidth counters diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 975b80102fbe..1027421f233e 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -197,6 +197,8 @@ struct resctrl_mon_config_info { struct rdt_domain *d; u32 evtid; u32 mon_config; + + int err; }; /* -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- resctrl's pseudo lock has some copy-to-cache and measurement functions that are micro-architecture specific. pseudo_lock_fn() is not at all portable. Label these 'resctrl_arch_' so they stay under /arch/x86. Pseudo-lock doesn't work without these, add a Kconfig symbol to disable it. This ensures that rdtgrp->mode can never be RDT_MODE_PSEUDO_LOCKED. Arm's cache-lockdown works in a very different way. As it isn't possible to disable prefetch or efficiently restrict the CPUs ability to pull lines into the cache, this resctrl pseudo_lock stuff isn't going to be used on arm. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/Kconfig | 7 ++++ arch/x86/include/asm/resctrl.h | 5 +++ arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 3 +- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 48 +++++++++++++++-------- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 17 +++++--- 5 files changed, 56 insertions(+), 24 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 52f55990e375..8bfa4a7b2737 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -481,6 +481,7 @@ config X86_CPU_RESCTRL depends on X86 && (CPU_SUP_INTEL || CPU_SUP_AMD) select KERNFS select PROC_CPU_RESCTRL if PROC_FS + select RESCTRL_FS_PSEUDO_LOCK help Enable x86 CPU resource control support. @@ -497,6 +498,12 @@ config X86_CPU_RESCTRL Say N if unsure. +config RESCTRL_FS_PSEUDO_LOCK + bool + help + Software mechanism to try and pin data in a cache portion using + micro-architecture tricks. + if X86_32 config X86_BIGSMP bool "Support for big SMP systems with more than 8 CPUs" diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 50407e83d0ca..a88af68f9fe2 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -211,6 +211,11 @@ static inline void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid static inline void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, void *ctx) { }; +u64 resctrl_arch_get_prefetch_disable_bits(void); +int resctrl_arch_pseudo_lock_fn(void *_rdtgrp); +int resctrl_arch_measure_cycles_lat_fn(void *_plr); +int resctrl_arch_measure_l2_residency(void *_plr); +int resctrl_arch_measure_l3_residency(void *_plr); void resctrl_cpu_detect(struct cpuinfo_x86 *c); #else diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index ad9e5516e607..96ae487f3120 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -179,7 +179,8 @@ static int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, if (!cbm_validate(data->buf, &cbm_val, r)) return -EINVAL; - if ((rdtgrp->mode == RDT_MODE_EXCLUSIVE || + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK) && + (rdtgrp->mode == RDT_MODE_EXCLUSIVE || rdtgrp->mode == RDT_MODE_SHAREABLE) && rdtgroup_cbm_overlaps_pseudo_locked(d, cbm_val)) { rdt_last_cmd_puts("CBM overlaps with pseudo-locked region\n"); diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index f2315a50ea4f..09adf0e3e753 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -62,7 +62,8 @@ static const struct class pseudo_lock_class = { }; /** - * get_prefetch_disable_bits - prefetch disable bits of supported platforms + * resctrl_arch_get_prefetch_disable_bits - prefetch disable bits of supported + * platforms * @void: It takes no parameters. * * Capture the list of platforms that have been validated to support @@ -76,13 +77,13 @@ static const struct class pseudo_lock_class = { * in the SDM. * * When adding a platform here also add support for its cache events to - * measure_cycles_perf_fn() + * resctrl_arch_measure_l*_residency() * * Return: * If platform is supported, the bits to disable hardware prefetchers, 0 * if platform is not supported. */ -static u64 get_prefetch_disable_bits(void) +u64 resctrl_arch_get_prefetch_disable_bits(void) { if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL || boot_cpu_data.x86 != 6) @@ -410,7 +411,7 @@ static void pseudo_lock_free(struct rdtgroup *rdtgrp) } /** - * pseudo_lock_fn - Load kernel memory into cache + * resctrl_arch_pseudo_lock_fn - Load kernel memory into cache * @_rdtgrp: resource group to which pseudo-lock region belongs * * This is the core pseudo-locking flow. @@ -428,7 +429,7 @@ static void pseudo_lock_free(struct rdtgroup *rdtgrp) * * Return: 0. Waiter on waitqueue will be woken on completion. */ -static int pseudo_lock_fn(void *_rdtgrp) +int resctrl_arch_pseudo_lock_fn(void *_rdtgrp) { struct rdtgroup *rdtgrp = _rdtgrp; struct pseudo_lock_region *plr = rdtgrp->plr; @@ -714,7 +715,7 @@ int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp) * Not knowing the bits to disable prefetching implies that this * platform does not support Cache Pseudo-Locking. */ - prefetch_disable_bits = get_prefetch_disable_bits(); + prefetch_disable_bits = resctrl_arch_get_prefetch_disable_bits(); if (prefetch_disable_bits == 0) { rdt_last_cmd_puts("Pseudo-locking not supported\n"); return -EINVAL; @@ -776,6 +777,9 @@ int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp) { int ret; + if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return -EOPNOTSUPP; + if (resctrl_arch_mon_capable()) { ret = alloc_rmid(rdtgrp->closid); if (ret < 0) { @@ -848,6 +852,9 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) /* Walking r->domains, ensure it can't race with cpuhp */ lockdep_assert_cpus_held(); + if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return -EOPNOTSUPP; + if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL)) return true; @@ -879,7 +886,8 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) } /** - * measure_cycles_lat_fn - Measure cycle latency to read pseudo-locked memory + * resctrl_arch_measure_cycles_lat_fn - Measure cycle latency to read + * pseudo-locked memory * @_plr: pseudo-lock region to measure * * There is no deterministic way to test if a memory region is cached. One @@ -892,7 +900,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) * * Return: 0. Waiter on waitqueue will be woken on completion. */ -static int measure_cycles_lat_fn(void *_plr) +int resctrl_arch_measure_cycles_lat_fn(void *_plr) { struct pseudo_lock_region *plr = _plr; u32 saved_low, saved_high; @@ -1076,7 +1084,7 @@ static int measure_residency_fn(struct perf_event_attr *miss_attr, return 0; } -static int measure_l2_residency(void *_plr) +int resctrl_arch_measure_l2_residency(void *_plr) { struct pseudo_lock_region *plr = _plr; struct residency_counts counts = {0}; @@ -1114,7 +1122,7 @@ static int measure_l2_residency(void *_plr) return 0; } -static int measure_l3_residency(void *_plr) +int resctrl_arch_measure_l3_residency(void *_plr) { struct pseudo_lock_region *plr = _plr; struct residency_counts counts = {0}; @@ -1212,18 +1220,18 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel) plr->cpu = cpu; if (sel == 1) - thread = kthread_create_on_node(measure_cycles_lat_fn, plr, - cpu_to_node(cpu), + thread = kthread_create_on_node(resctrl_arch_measure_cycles_lat_fn, + plr, cpu_to_node(cpu), "pseudo_lock_measure/%u", cpu); else if (sel == 2) - thread = kthread_create_on_node(measure_l2_residency, plr, - cpu_to_node(cpu), + thread = kthread_create_on_node(resctrl_arch_measure_l2_residency, + plr, cpu_to_node(cpu), "pseudo_lock_measure/%u", cpu); else if (sel == 3) - thread = kthread_create_on_node(measure_l3_residency, plr, - cpu_to_node(cpu), + thread = kthread_create_on_node(resctrl_arch_measure_l3_residency, + plr, cpu_to_node(cpu), "pseudo_lock_measure/%u", cpu); else @@ -1310,6 +1318,9 @@ int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp) struct device *dev; int ret; + if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return -EOPNOTSUPP; + ret = pseudo_lock_region_alloc(plr); if (ret < 0) return ret; @@ -1322,7 +1333,7 @@ int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp) plr->thread_done = 0; - thread = kthread_create_on_node(pseudo_lock_fn, rdtgrp, + thread = kthread_create_on_node(resctrl_arch_pseudo_lock_fn, rdtgrp, cpu_to_node(plr->cpu), "pseudo_lock/%u", plr->cpu); if (IS_ERR(thread)) { @@ -1435,6 +1446,9 @@ void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp) { struct pseudo_lock_region *plr = rdtgrp->plr; + if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return; + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { /* * Default group cannot be a pseudo-locked region so we can diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index b464f840cfe3..ef84d8a855b6 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -1451,7 +1451,8 @@ static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of, goto out; } rdtgrp->mode = RDT_MODE_EXCLUSIVE; - } else if (!strcmp(buf, "pseudo-locksetup")) { + } else if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK) && + !strcmp(buf, "pseudo-locksetup")) { ret = rdtgroup_locksetup_enter(rdtgrp); if (ret) goto out; @@ -2740,9 +2741,11 @@ static int rdt_get_tree(struct fs_context *fc) rdtgroup_default.mon.mon_data_kn = kn_mondata; } - ret = rdt_pseudo_lock_init(); - if (ret) - goto out_mondata; + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) { + ret = rdt_pseudo_lock_init(); + if (ret) + goto out_mondata; + } ret = kernfs_get_tree(fc); if (ret < 0) @@ -2765,7 +2768,8 @@ static int rdt_get_tree(struct fs_context *fc) goto out; out_psl: - rdt_pseudo_lock_release(); + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + rdt_pseudo_lock_release(); out_mondata: if (resctrl_arch_mon_capable()) kernfs_remove(kn_mondata); @@ -3030,7 +3034,8 @@ static void rdt_kill_sb(struct super_block *sb) resctrl_arch_reset_resources(); rmdir_all_sub(); - rdt_pseudo_lock_release(); + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + rdt_pseudo_lock_release(); rdtgroup_default.mode = RDT_MODE_SHAREABLE; schemata_list_destroy(); rdtgroup_destroy_root(); -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- prefetch_disable_bits is set by rdtgroup_locksetup_enter() from an arch code value, then read by the arch code. Instead of exporting this value, make resctrl_arch_get_prefetch_disable_bits() set it so that the other arch-code helpers can use the cached-value. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 09adf0e3e753..20e6c99db25b 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -85,6 +85,8 @@ static const struct class pseudo_lock_class = { */ u64 resctrl_arch_get_prefetch_disable_bits(void) { + prefetch_disable_bits = 0; + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL || boot_cpu_data.x86 != 6) return 0; @@ -100,7 +102,8 @@ u64 resctrl_arch_get_prefetch_disable_bits(void) * 3 DCU IP Prefetcher Disable (R/W) * 63:4 Reserved */ - return 0xF; + prefetch_disable_bits = 0xF; + break; case INTEL_FAM6_ATOM_GOLDMONT: case INTEL_FAM6_ATOM_GOLDMONT_PLUS: /* @@ -111,10 +114,11 @@ u64 resctrl_arch_get_prefetch_disable_bits(void) * 2 DCU Hardware Prefetcher Disable (R/W) * 63:3 Reserved */ - return 0x5; + prefetch_disable_bits = 0x5; + break; } - return 0; + return prefetch_disable_bits; } /** @@ -715,8 +719,7 @@ int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp) * Not knowing the bits to disable prefetching implies that this * platform does not support Cache Pseudo-Locking. */ - prefetch_disable_bits = resctrl_arch_get_prefetch_disable_bits(); - if (prefetch_disable_bits == 0) { + if (resctrl_arch_get_prefetch_disable_bits() == 0) { rdt_last_cmd_puts("Pseudo-locking not supported\n"); return -EINVAL; } -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- resctrl_arch_pseudo_lock_fn() has some very arch specific behaviour, and it takes a struct rdtgroup. This means resctrl has to expose that structure its enums and substructures to the rest of the kernel. The only reason resctrl_arch_pseudo_lock_fn() wants the rdtgroup is for the CLOSID. Embed that in the pseudo_lock_region as a hw_closid. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/include/asm/resctrl.h | 2 +- arch/x86/kernel/cpu/resctrl/internal.h | 37 --------------------- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 13 ++++---- include/linux/resctrl.h | 39 +++++++++++++++++++++++ 4 files changed, 47 insertions(+), 44 deletions(-) diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index a88af68f9fe2..9940398e367e 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -212,7 +212,7 @@ static inline void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, void *ctx) { }; u64 resctrl_arch_get_prefetch_disable_bits(void); -int resctrl_arch_pseudo_lock_fn(void *_rdtgrp); +int resctrl_arch_pseudo_lock_fn(void *_plr); int resctrl_arch_measure_cycles_lat_fn(void *_plr); int resctrl_arch_measure_l2_residency(void *_plr); int resctrl_arch_measure_l3_residency(void *_plr); diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 7cab19a6a714..9198a5bf59d4 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -183,43 +183,6 @@ struct mongroup { u32 rmid; }; -/** - * struct pseudo_lock_region - pseudo-lock region information - * @s: Resctrl schema for the resource to which this - * pseudo-locked region belongs - * @d: RDT domain to which this pseudo-locked region - * belongs - * @cbm: bitmask of the pseudo-locked region - * @lock_thread_wq: waitqueue used to wait on the pseudo-locking thread - * completion - * @thread_done: variable used by waitqueue to test if pseudo-locking - * thread completed - * @cpu: core associated with the cache on which the setup code - * will be run - * @line_size: size of the cache lines - * @size: size of pseudo-locked region in bytes - * @kmem: the kernel memory associated with pseudo-locked region - * @minor: minor number of character device associated with this - * region - * @debugfs_dir: pointer to this region's directory in the debugfs - * filesystem - * @pm_reqs: Power management QoS requests related to this region - */ -struct pseudo_lock_region { - struct resctrl_schema *s; - struct rdt_domain *d; - u32 cbm; - wait_queue_head_t lock_thread_wq; - int thread_done; - int cpu; - unsigned int line_size; - unsigned int size; - void *kmem; - unsigned int minor; - struct dentry *debugfs_dir; - struct list_head pm_reqs; -}; - /** * struct rdtgroup - store rdtgroup's data in resctrl file system. * @kn: kernfs node diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 20e6c99db25b..85536566d7f5 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -416,7 +416,7 @@ static void pseudo_lock_free(struct rdtgroup *rdtgrp) /** * resctrl_arch_pseudo_lock_fn - Load kernel memory into cache - * @_rdtgrp: resource group to which pseudo-lock region belongs + * @_plr: the pseudo-lock region descriptor * * This is the core pseudo-locking flow. * @@ -433,10 +433,9 @@ static void pseudo_lock_free(struct rdtgroup *rdtgrp) * * Return: 0. Waiter on waitqueue will be woken on completion. */ -int resctrl_arch_pseudo_lock_fn(void *_rdtgrp) +int resctrl_arch_pseudo_lock_fn(void *_plr) { - struct rdtgroup *rdtgrp = _rdtgrp; - struct pseudo_lock_region *plr = rdtgrp->plr; + struct pseudo_lock_region *plr = _plr; u32 rmid_p, closid_p; unsigned long i; u64 saved_msr; @@ -496,7 +495,8 @@ int resctrl_arch_pseudo_lock_fn(void *_rdtgrp) * pseudo-locked followed by reading of kernel memory to load it * into the cache. */ - __wrmsr(MSR_IA32_PQR_ASSOC, rmid_p, rdtgrp->closid); + __wrmsr(MSR_IA32_PQR_ASSOC, rmid_p, plr->closid); + /* * Cache was flushed earlier. Now access kernel memory to read it * into cache region associated with just activated plr->closid. @@ -1336,7 +1336,8 @@ int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp) plr->thread_done = 0; - thread = kthread_create_on_node(resctrl_arch_pseudo_lock_fn, rdtgrp, + plr->closid = rdtgrp->closid; + thread = kthread_create_on_node(resctrl_arch_pseudo_lock_fn, plr, cpu_to_node(plr->cpu), "pseudo_lock/%u", plr->cpu); if (IS_ERR(thread)) { diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 1027421f233e..223937f89911 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -25,6 +25,45 @@ int proc_resctrl_show(struct seq_file *m, /* max value for struct rdt_domain's mbps_val */ #define MBA_MAX_MBPS U32_MAX +/** + * struct pseudo_lock_region - pseudo-lock region information + * @s: Resctrl schema for the resource to which this + * pseudo-locked region belongs + * @closid: The closid that this pseudo-locked region uses + * @d: RDT domain to which this pseudo-locked region + * belongs + * @cbm: bitmask of the pseudo-locked region + * @lock_thread_wq: waitqueue used to wait on the pseudo-locking thread + * completion + * @thread_done: variable used by waitqueue to test if pseudo-locking + * thread completed + * @cpu: core associated with the cache on which the setup code + * will be run + * @line_size: size of the cache lines + * @size: size of pseudo-locked region in bytes + * @kmem: the kernel memory associated with pseudo-locked region + * @minor: minor number of character device associated with this + * region + * @debugfs_dir: pointer to this region's directory in the debugfs + * filesystem + * @pm_reqs: Power management QoS requests related to this region + */ +struct pseudo_lock_region { + struct resctrl_schema *s; + u32 closid; + struct rdt_domain *d; + u32 cbm; + wait_queue_head_t lock_thread_wq; + int thread_done; + int cpu; + unsigned int line_size; + unsigned int size; + void *kmem; + unsigned int minor; + struct dentry *debugfs_dir; + struct list_head pm_reqs; +}; + /** * struct resctrl_staged_config - parsed configuration to be applied * @new_ctrl: new ctrl value to be loaded -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- thread_throttle_mode_init() is called by the architecture specific code to make the 'thread_throttle_mode' file visible. The architecture specific code has already set the membw.throttle_mode in the rdt_resource. The architecture specific call isn't necessary, and creates an extra interface between the architecture and resctrl. Call thread_throttle_mode_init() from resctrl_setup(), check the membw.throttle_mode on the MBA resource. This avoids an extra API call. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/core.c | 1 - arch/x86/kernel/cpu/resctrl/internal.h | 1 - arch/x86/kernel/cpu/resctrl/rdtgroup.c | 9 ++++++++- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 6aa179b5c183..786f3c9a7267 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -227,7 +227,6 @@ static __init bool __get_mem_config_intel(struct rdt_resource *r) r->membw.throttle_mode = THREAD_THROTTLE_PER_THREAD; else r->membw.throttle_mode = THREAD_THROTTLE_MAX; - thread_throttle_mode_init(); r->alloc_capable = true; diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 9198a5bf59d4..1d1412d19737 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -501,7 +501,6 @@ void cqm_handle_limbo(struct work_struct *work); bool has_busy_rmid(struct rdt_domain *d); void __check_limbo(struct rdt_domain *d, bool force_free); void rdt_domain_reconfigure_cdp(struct rdt_resource *r); -void __init thread_throttle_mode_init(void); void mbm_config_rftype_init(const char *config); void rdt_staged_configs_clear(void); bool closid_allocated(unsigned int closid); diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index ef84d8a855b6..f5ed5089084d 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2064,10 +2064,15 @@ static struct rftype *rdtgroup_get_rftype_by_name(const char *name) return NULL; } -void __init thread_throttle_mode_init(void) +static void __init thread_throttle_mode_init(void) { + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); struct rftype *rft; + if (!r->alloc_capable || + r->membw.throttle_mode == THREAD_THROTTLE_UNDEFINED) + return; + rft = rdtgroup_get_rftype_by_name("thread_throttle_mode"); if (!rft) return; @@ -4191,6 +4196,8 @@ int __init resctrl_init(void) rdtgroup_setup_default(); + thread_throttle_mode_init(); + ret = resctrl_mon_resource_init(); if (ret) return ret; -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- get_config_index() is used by the architecture specific code to map a CLOSID+type pair to an index in the configuration arrays. MPAM needs to do this too to preserve the ABI to user-space, there is no reason to do it differently. Move the helper to a header file. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 19 +++---------------- include/linux/resctrl.h | 15 +++++++++++++++ 2 files changed, 18 insertions(+), 16 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index 96ae487f3120..5953b997c682 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -283,19 +283,6 @@ static int parse_line(char *line, struct resctrl_schema *s, return -EINVAL; } -static u32 get_config_index(u32 closid, enum resctrl_conf_type type) -{ - switch (type) { - default: - case CDP_NONE: - return closid; - case CDP_CODE: - return closid * 2 + 1; - case CDP_DATA: - return closid * 2; - } -} - static bool apply_config(struct rdt_hw_domain *hw_dom, struct resctrl_staged_config *cfg, u32 idx, cpumask_var_t cpu_mask) @@ -317,7 +304,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, { struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); - u32 idx = get_config_index(closid, t); + u32 idx = resctrl_get_config_index(closid, t); struct msr_param msr_param; if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask)) @@ -357,7 +344,7 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) if (!cfg->have_new_ctrl) continue; - idx = get_config_index(closid, t); + idx = resctrl_get_config_index(closid, t); if (!apply_config(hw_dom, cfg, idx, cpu_mask)) continue; @@ -482,7 +469,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, u32 closid, enum resctrl_conf_type type) { struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); - u32 idx = get_config_index(closid, type); + u32 idx = resctrl_get_config_index(closid, type); return hw_dom->ctrl_val[idx]; } diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 223937f89911..d7850cf6effe 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -256,6 +256,21 @@ bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt); void resctrl_arch_mon_event_config_write(void *info); void resctrl_arch_mon_event_config_read(void *info); +/* For use by arch code that needs to remap resctrl's smaller CDP closid */ +static inline u32 resctrl_get_config_index(u32 closid, + enum resctrl_conf_type type) +{ + switch (type) { + default: + case CDP_NONE: + return closid; + case CDP_CODE: + return (closid * 2) + 1; + case CDP_DATA: + return (closid * 2); + } +} + /* * Update the ctrl_val and apply this config right now. * Must be called on one of the domain's CPUs. -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- get_domain_from_cpu() is a handy helper that both the arch code and resctrl need to use. Rename it resctrl_get_domain_from_cpu() so it gets moved out to /fs, and exported back to the arch code. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/core.c | 15 +------------- arch/x86/kernel/cpu/resctrl/internal.h | 1 - arch/x86/kernel/cpu/resctrl/monitor.c | 2 +- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +- include/linux/resctrl.h | 28 ++++++++++++++++++++++++++ 5 files changed, 31 insertions(+), 17 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 786f3c9a7267..6176c163af54 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -355,19 +355,6 @@ cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r) wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]); } -struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r) -{ - struct rdt_domain *d; - - list_for_each_entry(d, &r->domains, list) { - /* Find the domain that contains this CPU */ - if (cpumask_test_cpu(cpu, &d->cpu_mask)) - return d; - } - - return NULL; -} - u32 resctrl_arch_get_num_closid(struct rdt_resource *r) { return resctrl_to_arch_res(r)->num_closid; @@ -381,7 +368,7 @@ void rdt_ctrl_update(void *arg) int cpu = smp_processor_id(); struct rdt_domain *d; - d = get_domain_from_cpu(cpu, r); + d = resctrl_get_domain_from_cpu(cpu, r); if (d) { hw_res->msr_update(d, m, r); return; diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 1d1412d19737..3d40e215b96b 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -475,7 +475,6 @@ int rdt_pseudo_lock_init(void); void rdt_pseudo_lock_release(void); int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp); void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp); -struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r); int closids_supported(void); void closid_free(int closid); int alloc_rmid(u32 closid); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index fbb2877e317c..91861a54c4d0 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -677,7 +677,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm) idx = resctrl_arch_rmid_idx_encode(closid, rmid); pmbm_data = &dom_mbm->mbm_local[idx]; - dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba); + dom_mba = resctrl_get_domain_from_cpu(smp_processor_id(), r_mba); if (!dom_mba) { pr_warn_once("Failure to get domain for MBA update\n"); return; diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index f5ed5089084d..c3336a320375 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -4162,7 +4162,7 @@ void resctrl_offline_cpu(unsigned int cpu) if (!l3->mon_capable) goto out_unlock; - d = get_domain_from_cpu(cpu, l3); + d = resctrl_get_domain_from_cpu(cpu, l3); if (d) { if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu) { cancel_delayed_work(&d->mbm_over); diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index d7850cf6effe..8970dcf0e242 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -2,6 +2,7 @@ #ifndef _RESCTRL_H #define _RESCTRL_H +#include <linux/cpu.h> #include <linux/kernel.h> #include <linux/list.h> #include <linux/pid.h> @@ -271,6 +272,33 @@ static inline u32 resctrl_get_config_index(u32 closid, } } +/* + * Caller must be in a RCU read-side critical section, or hold the + * cpuhp read lock to prevent the struct rdt_domain being freed. + */ +static inline struct rdt_domain * +resctrl_get_domain_from_cpu(int cpu, struct rdt_resource *r) +{ + struct rdt_domain *d; + + /* + * Walking r->domains, ensure it can't race with cpuhp. + * Because this is called via IPI by rdt_ctrl_update(), assertions + * about locks this thread holds will lead to false positives. Check + * someone is holding the CPUs lock. + */ + if (IS_ENABLED(CONFIG_LOCKDEP)) + lockdep_is_cpus_held(); + + list_for_each_entry_rcu(d, &r->domains, list) { + /* Find the domain that contains this CPU */ + if (cpumask_test_cpu(cpu, &d->cpu_mask)) + return d; + } + + return NULL; +} + /* * Update the ctrl_val and apply this config right now. * Must be called on one of the domain's CPUs. -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- resctrl stores the bitmaps and has a closid bitmap stored in u32. MPAM supports bitmaps and partids bigger than this. Add some preprocessor values that make it clear why MPAM clamps some of these values. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- include/linux/resctrl.h | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 8970dcf0e242..a88a43a0380d 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -26,6 +26,17 @@ int proc_resctrl_show(struct seq_file *m, /* max value for struct rdt_domain's mbps_val */ #define MBA_MAX_MBPS U32_MAX +/* + * Resctrl uses a u32 as a closid bitmap. The maximum closid is 32. + */ +#define RESCTRL_MAX_CLOSID 32 + +/* + * Resctrl uses u32 to hold the user-space config. The maximum bitmap size is + * 32. + */ +#define RESCTRL_MAX_CBM 32 + /** * struct pseudo_lock_region - pseudo-lock region information * @s: Resctrl schema for the resource to which this -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Because ARM's MPAM controls are probed using MMIO, resctrl can't be initialised until enough CPUs are online to have determined the system-wide supported num_closid. To allow MPAM to initialise resctrl after __init text has been free'd, remove all the __init markings from resctrl. The existing __exit markings cause these functions to be removed by the linker as it has never been possible to build resctrl as a module. MPAM has an error interrupt which causes the driver to reset and disable itself. Remove the __exit markings to allow the MPAM driver to tear down resctrl. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/internal.h | 2 +- arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++-- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 6 +++--- include/linux/resctrl.h | 4 ++-- 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 3d40e215b96b..18b79d19cfd8 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -480,7 +480,7 @@ void closid_free(int closid); int alloc_rmid(u32 closid); void free_rmid(u32 closid, u32 rmid); int rdt_get_mon_l3_config(struct rdt_resource *r); -void __exit resctrl_mon_resource_exit(void); +void resctrl_mon_resource_exit(void); bool rdt_cpu_has(int flag); void mon_event_count(void *info); int rdtgroup_mondata_show(struct seq_file *m, void *arg); diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 91861a54c4d0..324311bfbdd1 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -955,7 +955,7 @@ static int dom_data_init(struct rdt_resource *r) return err; } -static void __exit dom_data_exit(void) +static void dom_data_exit(void) { mutex_lock(&rdtgroup_mutex); @@ -1075,7 +1075,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) return 0; } -void __exit resctrl_mon_resource_exit(void) +void resctrl_mon_resource_exit(void) { struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index c3336a320375..4c62eb3554dd 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -2064,7 +2064,7 @@ static struct rftype *rdtgroup_get_rftype_by_name(const char *name) return NULL; } -static void __init thread_throttle_mode_init(void) +static void thread_throttle_mode_init(void) { struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); struct rftype *rft; @@ -4187,7 +4187,7 @@ void resctrl_offline_cpu(unsigned int cpu) * * Return: 0 on success or -errno */ -int __init resctrl_init(void) +int resctrl_init(void) { int ret = 0; @@ -4241,7 +4241,7 @@ int __init resctrl_init(void) return ret; } -void __exit resctrl_exit(void) +void resctrl_exit(void) { debugfs_remove_recursive(debugfs_resctrl); unregister_filesystem(&rdt_fs_type); diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index a88a43a0380d..f9de90bc1f55 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -400,7 +400,7 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d); extern unsigned int resctrl_rmid_realloc_threshold; extern unsigned int resctrl_rmid_realloc_limit; -int __init resctrl_init(void); -void __exit resctrl_exit(void); +int resctrl_init(void); +void resctrl_exit(void); #endif /* _RESCTRL_H */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Add Makefile and Kconfig for fs/resctrl. Add ARCH_HAS_CPU_RESCTRL for the common parts of the resctrl interface and make X86_CPU_RESCTRL depend on this. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- MAINTAINERS | 1 + arch/Kconfig | 8 ++++++++ arch/x86/Kconfig | 10 +++------- fs/Kconfig | 1 + fs/Makefile | 1 + fs/resctrl/Kconfig | 23 +++++++++++++++++++++++ fs/resctrl/Makefile | 1 + fs/resctrl/ctrlmondata.c | 0 fs/resctrl/internal.h | 0 fs/resctrl/monitor.c | 0 fs/resctrl/psuedo_lock.c | 0 fs/resctrl/rdtgroup.c | 0 include/linux/resctrl.h | 4 ++++ 13 files changed, 42 insertions(+), 7 deletions(-) create mode 100644 fs/resctrl/Kconfig create mode 100644 fs/resctrl/Makefile create mode 100644 fs/resctrl/ctrlmondata.c create mode 100644 fs/resctrl/internal.h create mode 100644 fs/resctrl/monitor.c create mode 100644 fs/resctrl/psuedo_lock.c create mode 100644 fs/resctrl/rdtgroup.c diff --git a/MAINTAINERS b/MAINTAINERS index c47e200a439a..84df3741c961 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -18077,6 +18077,7 @@ S: Supported F: Documentation/arch/x86/resctrl* F: arch/x86/include/asm/resctrl.h F: arch/x86/kernel/cpu/resctrl/ +F: fs/resctrl/ F: include/linux/resctrl*.h F: tools/testing/selftests/resctrl/ diff --git a/arch/Kconfig b/arch/Kconfig index 09603e0bc2cc..ba24f2f5eff3 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1333,6 +1333,14 @@ config STRICT_MODULE_RWX config ARCH_HAS_PHYS_TO_DMA bool +config ARCH_HAS_CPU_RESCTRL + bool + help + The 'resctrl' filesystem allows cpu controls of shared resources + such as caches and memory bandwidth to be configured. An architecture + selects this if it provides the arch-specific hooks for the filesystem + and needs the per-task closid/rmid properties. + config HAVE_ARCH_COMPILER_H bool help diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 8bfa4a7b2737..4ba29ccfaa9d 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -479,8 +479,10 @@ config GOLDFISH config X86_CPU_RESCTRL bool "x86 CPU resource control support" depends on X86 && (CPU_SUP_INTEL || CPU_SUP_AMD) + depends on MISC_FILESYSTEMS select KERNFS - select PROC_CPU_RESCTRL if PROC_FS + select ARCH_HAS_CPU_RESCTRL + select RESCTRL_FS select RESCTRL_FS_PSEUDO_LOCK help Enable x86 CPU resource control support. @@ -498,12 +500,6 @@ config X86_CPU_RESCTRL Say N if unsure. -config RESCTRL_FS_PSEUDO_LOCK - bool - help - Software mechanism to try and pin data in a cache portion using - micro-architecture tricks. - if X86_32 config X86_BIGSMP bool "Support for big SMP systems with more than 8 CPUs" diff --git a/fs/Kconfig b/fs/Kconfig index aa7e03cc1941..1e3ed753b9fe 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -325,6 +325,7 @@ source "fs/omfs/Kconfig" source "fs/hpfs/Kconfig" source "fs/qnx4/Kconfig" source "fs/qnx6/Kconfig" +source "fs/resctrl/Kconfig" source "fs/romfs/Kconfig" source "fs/pstore/Kconfig" source "fs/sysv/Kconfig" diff --git a/fs/Makefile b/fs/Makefile index f9541f40be4e..b62375770dee 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -129,3 +129,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/ obj-$(CONFIG_EROFS_FS) += erofs/ obj-$(CONFIG_VBOXSF_FS) += vboxsf/ obj-$(CONFIG_ZONEFS_FS) += zonefs/ +obj-$(CONFIG_RESCTRL_FS) += resctrl/ diff --git a/fs/resctrl/Kconfig b/fs/resctrl/Kconfig new file mode 100644 index 000000000000..5bbf5a1798cd --- /dev/null +++ b/fs/resctrl/Kconfig @@ -0,0 +1,23 @@ +config RESCTRL_FS + bool "CPU Resource Control Filesystem (resctrl)" + depends on ARCH_HAS_CPU_RESCTRL + select KERNFS + select PROC_CPU_RESCTRL if PROC_FS + help + Resctrl is a filesystem interface + to control allocation and + monitoring of system resources + used by the CPUs. + +config RESCTRL_FS_PSEUDO_LOCK + bool + help + Software mechanism to try and pin data in a cache portion using + micro-architecture tricks. + +config RESCTRL_RMID_DEPENDS_ON_CLOSID + bool + help + Enable by the architecture when the RMID values depend on the CLOSID. + This causes the closid allocator to search for CLOSID with clean + RMID. diff --git a/fs/resctrl/Makefile b/fs/resctrl/Makefile new file mode 100644 index 000000000000..4c32e5e914b6 --- /dev/null +++ b/fs/resctrl/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_RESCTRL_FS) += rdtgroup.o ctrlmondata.o monitor.o psuedo_lock.o diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/fs/resctrl/psuedo_lock.c b/fs/resctrl/psuedo_lock.c new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index f9de90bc1f55..9066017d4f33 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -8,6 +8,10 @@ #include <linux/pid.h> #include <linux/resctrl_types.h> +#ifdef CONFIG_ARCH_HAS_CPU_RESCTRL +#include <asm/resctrl.h> +#endif + /* CLOSID, RMID value used by the default control group */ #define RESCTRL_RESERVED_CLOSID 0 #define RESCTRL_RESERVED_RMID 0 -- 2.25.1

From: James Morse <james.morse@arm.com> The mbm_cfg_mask field lists the bits that user-space can set when configuring an event. This value is output via the last_cmd_status file. Once the filesystem parts of resctrl are moved to live in /fs/, the struct rdt_hw_resource is inaccessible to the filesystem code. Because this value is output to user-space, it has to be accessible to the filesystem code. Move it to struct rdt_resource. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-23-james.morse@arm.com --- arch/x86/kernel/cpu/resctrl/internal.h | 3 --- arch/x86/kernel/cpu/resctrl/monitor.c | 2 +- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 ++--- include/linux/resctrl.h | 3 +++ 4 files changed, 6 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 18b79d19cfd8..a0cfe860558d 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -346,8 +346,6 @@ static inline bool is_mbm_event(int e) * @msr_update: Function pointer to update QOS MSRs * @mon_scale: cqm counter * mon_scale = occupancy in bytes * @mbm_width: Monitor width, to detect and correct for overflow. - * @mbm_cfg_mask: Bandwidth sources that can be tracked when Bandwidth - * Monitoring Event Configuration (BMEC) is supported. * @cdp_enabled: CDP state of this resource * * Members of this structure are either private to the architecture @@ -362,7 +360,6 @@ struct rdt_hw_resource { struct rdt_resource *r); unsigned int mon_scale; unsigned int mbm_width; - unsigned int mbm_cfg_mask; bool cdp_enabled; }; diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 324311bfbdd1..356796ea784c 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -1067,7 +1067,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) /* Detect list of bandwidth sources that can be tracked */ cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx); - hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS; + r->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS; } r->mon_capable = true; diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 4c62eb3554dd..f69715d5126e 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -1738,7 +1738,6 @@ static void mbm_config_write_domain(struct rdt_resource *r, static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid) { - struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); char *dom_str = NULL, *id_str; unsigned long dom_id, val; struct rdt_domain *d; @@ -1765,9 +1764,9 @@ static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid) } /* Value from user cannot be more than the supported set of events */ - if ((val & hw_res->mbm_cfg_mask) != val) { + if ((val & r->mbm_cfg_mask) != val) { rdt_last_cmd_printf("Invalid event configuration: max valid mask is 0x%02x\n", - hw_res->mbm_cfg_mask); + r->mbm_cfg_mask); return -EINVAL; } diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 9066017d4f33..940f136a30b0 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -194,6 +194,8 @@ struct resctrl_membw { * @default_ctrl: Specifies default cache cbm or memory B/W percent. * @format_str: Per resource format string to show domain value * @evt_list: List of monitoring events + * @mbm_cfg_mask: Bandwidth sources that can be tracked when bandwidth + * monitoring events can be configured. * @fflags: flags to choose base and info files * @cdp_capable: Is the CDP feature available on this resource */ @@ -211,6 +213,7 @@ struct rdt_resource { u32 default_ctrl; const char *format_str; struct list_head evt_list; + unsigned int mbm_cfg_mask; unsigned long fflags; bool cdp_capable; }; -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Once the filsystem parts of resctrl move to fs/resctrl, it cannot rely on definitions in x86's internal.h. Move definitions in internal.h that need to be shared between the filesystem and architecture code to header files that fs/resctrl can include. Doing this separately means the filesystem code only moves between files of the same name, instead of having these changes mixed in too. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/include/asm/resctrl.h | 3 +++ arch/x86/kernel/cpu/resctrl/core.c | 5 ++++ arch/x86/kernel/cpu/resctrl/internal.h | 36 -------------------------- include/linux/resctrl.h | 3 +++ include/linux/resctrl_types.h | 30 +++++++++++++++++++++ 5 files changed, 41 insertions(+), 36 deletions(-) diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 9940398e367e..37b8d7dfcfed 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -218,6 +218,9 @@ int resctrl_arch_measure_l2_residency(void *_plr); int resctrl_arch_measure_l3_residency(void *_plr); void resctrl_cpu_detect(struct cpuinfo_x86 *c); +bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l); +int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); + #else static inline void resctrl_sched_in(struct task_struct *tsk) {} diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 6176c163af54..607527943199 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -306,6 +306,11 @@ static void rdt_get_cdp_l2_config(void) rdt_get_cdp_config(RDT_RESOURCE_L2); } +bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l) +{ + return rdt_resources_all[l].cdp_enabled; +} + static void mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r) { diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index a0cfe860558d..3a364ecbeaff 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -15,12 +15,6 @@ #define L2_QOS_CDP_ENABLE 0x01ULL -#define CQM_LIMBOCHECK_INTERVAL 1000 - -#define MBM_CNTR_WIDTH_BASE 24 -#define MBM_OVERFLOW_INTERVAL 1000 -#define MAX_MBA_BW 100u -#define MBA_IS_LINEAR 0x4 #define MBM_CNTR_WIDTH_OFFSET_AMD 20 #define RMID_VAL_ERROR BIT_ULL(63) @@ -211,29 +205,6 @@ struct rdtgroup { struct pseudo_lock_region *plr; }; -/* rdtgroup.flags */ -#define RDT_DELETED 1 - -/* rftype.flags */ -#define RFTYPE_FLAGS_CPUS_LIST 1 - -/* - * Define the file type flags for base and info directories. - */ -#define RFTYPE_INFO BIT(0) -#define RFTYPE_BASE BIT(1) -#define RFTYPE_CTRL BIT(4) -#define RFTYPE_MON BIT(5) -#define RFTYPE_TOP BIT(6) -#define RFTYPE_RES_CACHE BIT(8) -#define RFTYPE_RES_MB BIT(9) -#define RFTYPE_DEBUG BIT(10) -#define RFTYPE_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL) -#define RFTYPE_MON_INFO (RFTYPE_INFO | RFTYPE_MON) -#define RFTYPE_TOP_INFO (RFTYPE_INFO | RFTYPE_TOP) -#define RFTYPE_CTRL_BASE (RFTYPE_BASE | RFTYPE_CTRL) -#define RFTYPE_MON_BASE (RFTYPE_BASE | RFTYPE_MON) - /* List of all resource groups */ extern struct list_head rdt_all_groups; @@ -382,13 +353,6 @@ static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res) return &hw_res->r_resctrl; } -static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l) -{ - return rdt_resources_all[l].cdp_enabled; -} - -int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); - /* * To return the common struct rdt_resource, which is contained in struct * rdt_hw_resource, walk the resctrl member of struct rdt_hw_resource. diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 940f136a30b0..35c3bf2ec599 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -41,6 +41,9 @@ int proc_resctrl_show(struct seq_file *m, */ #define RESCTRL_MAX_CBM 32 +extern unsigned int resctrl_rmid_realloc_limit; +extern unsigned int resctrl_rmid_realloc_threshold; + /** * struct pseudo_lock_region - pseudo-lock region information * @s: Resctrl schema for the resource to which this diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h index 06e0f5567558..44f7a47f986b 100644 --- a/include/linux/resctrl_types.h +++ b/include/linux/resctrl_types.h @@ -7,6 +7,36 @@ #ifndef __LINUX_RESCTRL_TYPES_H #define __LINUX_RESCTRL_TYPES_H +#define CQM_LIMBOCHECK_INTERVAL 1000 + +#define MBM_CNTR_WIDTH_BASE 24 +#define MBM_OVERFLOW_INTERVAL 1000 +#define MAX_MBA_BW 100u +#define MBA_IS_LINEAR 0x4 + +/* rdtgroup.flags */ +#define RDT_DELETED 1 + +/* rftype.flags */ +#define RFTYPE_FLAGS_CPUS_LIST 1 + +/* + * Define the file type flags for base and info directories. + */ +#define RFTYPE_INFO BIT(0) +#define RFTYPE_BASE BIT(1) +#define RFTYPE_CTRL BIT(4) +#define RFTYPE_MON BIT(5) +#define RFTYPE_TOP BIT(6) +#define RFTYPE_RES_CACHE BIT(8) +#define RFTYPE_RES_MB BIT(9) +#define RFTYPE_DEBUG BIT(10) +#define RFTYPE_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL) +#define RFTYPE_MON_INFO (RFTYPE_INFO | RFTYPE_MON) +#define RFTYPE_TOP_INFO (RFTYPE_INFO | RFTYPE_TOP) +#define RFTYPE_CTRL_BASE (RFTYPE_BASE | RFTYPE_CTRL) +#define RFTYPE_MON_BASE (RFTYPE_BASE | RFTYPE_MON) + /* Reads to Local DRAM Memory */ #define READS_TO_LOCAL_MEM BIT(0) -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Move the parts of resctrl that aren't archictecture specific to live in /fs/. This lets other architectures implement resctrl too. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/core.c | 15 - arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 511 --- arch/x86/kernel/cpu/resctrl/internal.h | 279 -- arch/x86/kernel/cpu/resctrl/monitor.c | 823 ---- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 1105 ------ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4284 +-------------------- fs/resctrl/ctrlmondata.c | 533 +++ fs/resctrl/internal.h | 310 ++ fs/resctrl/monitor.c | 845 ++++ fs/resctrl/psuedo_lock.c | 1134 ++++++ fs/resctrl/rdtgroup.c | 4002 +++++++++++++++++++ 11 files changed, 6971 insertions(+), 6870 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 607527943199..2613d617e8bd 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -164,21 +164,6 @@ static inline void cache_alloc_hsw_probe(void) rdt_alloc_capable = true; } -bool is_mba_sc(struct rdt_resource *r) -{ - if (!r) - r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); - - /* - * The software controller support is only applicable to MBA resource. - * Make sure to check for resource type. - */ - if (r->rid != RDT_RESOURCE_MBA) - return false; - - return r->membw.mba_sc; -} - /* * rdt_get_mb_table() - get a mapping of bandwidth(b/w) percentage values * exposed to user interface and the h/w understandable delay values. diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index 5953b997c682..c5c3eaea27b6 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -23,266 +23,6 @@ #include "internal.h" -struct rdt_parse_data { - struct rdtgroup *rdtgrp; - char *buf; -}; - -typedef int (ctrlval_parser_t)(struct rdt_parse_data *data, - struct resctrl_schema *s, - struct rdt_domain *d); - -/* - * Check whether MBA bandwidth percentage value is correct. The value is - * checked against the minimum and max bandwidth values specified by the - * hardware. The allocated bandwidth percentage is rounded to the next - * control step available on the hardware. - */ -static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) -{ - int ret; - u32 bw; - - /* - * Only linear delay values is supported for current Intel SKUs. - */ - if (!r->membw.delay_linear && r->membw.arch_needs_linear) { - rdt_last_cmd_puts("No support for non-linear MB domains\n"); - return false; - } - - ret = kstrtou32(buf, 10, &bw); - if (ret) { - rdt_last_cmd_printf("Invalid MB value %s\n", buf); - return false; - } - - /* Nothing else to do if software controller is enabled. */ - if (is_mba_sc(r)) { - *data = bw; - return true; - } - - if (bw < r->membw.min_bw || bw > r->default_ctrl) { - rdt_last_cmd_printf("MB value %u out of range [%d,%d]\n", - bw, r->membw.min_bw, r->default_ctrl); - return false; - } - - *data = roundup(bw, (unsigned long)r->membw.bw_gran); - return true; -} - -static int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_domain *d) -{ - struct resctrl_staged_config *cfg; - u32 closid = data->rdtgrp->closid; - struct rdt_resource *r = s->res; - u32 bw_val; - - cfg = &d->staged_config[s->conf_type]; - if (cfg->have_new_ctrl) { - rdt_last_cmd_printf("Duplicate domain %d\n", d->id); - return -EINVAL; - } - - if (!bw_validate(data->buf, &bw_val, r)) - return -EINVAL; - - if (is_mba_sc(r)) { - d->mbps_val[closid] = bw_val; - return 0; - } - - cfg->new_ctrl = bw_val; - cfg->have_new_ctrl = true; - - return 0; -} - -/* - * Check whether a cache bit mask is valid. - * On Intel CPUs, non-contiguous 1s value support is indicated by CPUID: - * - CPUID.0x10.1:ECX[3]: L3 non-contiguous 1s value supported if 1 - * - CPUID.0x10.2:ECX[3]: L2 non-contiguous 1s value supported if 1 - * - * Haswell does not support a non-contiguous 1s value and additionally - * requires at least two bits set. - * AMD allows non-contiguous bitmasks. - */ -static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r) -{ - unsigned long first_bit, zero_bit, val; - unsigned int cbm_len = r->cache.cbm_len; - int ret; - - ret = kstrtoul(buf, 16, &val); - if (ret) { - rdt_last_cmd_printf("Non-hex character in the mask %s\n", buf); - return false; - } - - if ((r->cache.min_cbm_bits > 0 && val == 0) || val > r->default_ctrl) { - rdt_last_cmd_puts("Mask out of range\n"); - return false; - } - - first_bit = find_first_bit(&val, cbm_len); - zero_bit = find_next_zero_bit(&val, cbm_len, first_bit); - - /* Are non-contiguous bitmasks allowed? */ - if (!r->cache.arch_has_sparse_bitmasks && - (find_next_bit(&val, cbm_len, zero_bit) < cbm_len)) { - rdt_last_cmd_printf("The mask %lx has non-consecutive 1-bits\n", val); - return false; - } - - if ((zero_bit - first_bit) < r->cache.min_cbm_bits) { - rdt_last_cmd_printf("Need at least %d bits in the mask\n", - r->cache.min_cbm_bits); - return false; - } - - *data = val; - return true; -} - -/* - * Read one cache bit mask (hex). Check that it is valid for the current - * resource type. - */ -static int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_domain *d) -{ - struct rdtgroup *rdtgrp = data->rdtgrp; - struct resctrl_staged_config *cfg; - struct rdt_resource *r = s->res; - u32 cbm_val; - - cfg = &d->staged_config[s->conf_type]; - if (cfg->have_new_ctrl) { - rdt_last_cmd_printf("Duplicate domain %d\n", d->id); - return -EINVAL; - } - - /* - * Cannot set up more than one pseudo-locked region in a cache - * hierarchy. - */ - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP && - rdtgroup_pseudo_locked_in_hierarchy(d)) { - rdt_last_cmd_puts("Pseudo-locked region in hierarchy\n"); - return -EINVAL; - } - - if (!cbm_validate(data->buf, &cbm_val, r)) - return -EINVAL; - - if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK) && - (rdtgrp->mode == RDT_MODE_EXCLUSIVE || - rdtgrp->mode == RDT_MODE_SHAREABLE) && - rdtgroup_cbm_overlaps_pseudo_locked(d, cbm_val)) { - rdt_last_cmd_puts("CBM overlaps with pseudo-locked region\n"); - return -EINVAL; - } - - /* - * The CBM may not overlap with the CBM of another closid if - * either is exclusive. - */ - if (rdtgroup_cbm_overlaps(s, d, cbm_val, rdtgrp->closid, true)) { - rdt_last_cmd_puts("Overlaps with exclusive group\n"); - return -EINVAL; - } - - if (rdtgroup_cbm_overlaps(s, d, cbm_val, rdtgrp->closid, false)) { - if (rdtgrp->mode == RDT_MODE_EXCLUSIVE || - rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - rdt_last_cmd_puts("Overlaps with other group\n"); - return -EINVAL; - } - } - - cfg->new_ctrl = cbm_val; - cfg->have_new_ctrl = true; - - return 0; -} - -static ctrlval_parser_t *get_parser(struct rdt_resource *res) -{ - if (res->fflags & RFTYPE_RES_CACHE) - return &parse_cbm; - else - return &parse_bw; -} - -/* - * For each domain in this resource we expect to find a series of: - * id=mask - * separated by ";". The "id" is in decimal, and must match one of - * the "id"s for this resource. - */ -static int parse_line(char *line, struct resctrl_schema *s, - struct rdtgroup *rdtgrp) -{ - ctrlval_parser_t *parse_ctrlval = get_parser(s->res); - enum resctrl_conf_type t = s->conf_type; - struct resctrl_staged_config *cfg; - struct rdt_resource *r = s->res; - struct rdt_parse_data data; - char *dom = NULL, *id; - struct rdt_domain *d; - unsigned long dom_id; - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP && - (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)) { - rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n"); - return -EINVAL; - } - -next: - if (!line || line[0] == '\0') - return 0; - dom = strsep(&line, ";"); - id = strsep(&dom, "="); - if (!dom || kstrtoul(id, 10, &dom_id)) { - rdt_last_cmd_puts("Missing '=' or non-numeric domain\n"); - return -EINVAL; - } - dom = strim(dom); - list_for_each_entry(d, &r->domains, list) { - if (d->id == dom_id) { - data.buf = dom; - data.rdtgrp = rdtgrp; - if (parse_ctrlval(&data, s, d)) - return -EINVAL; - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - cfg = &d->staged_config[t]; - /* - * In pseudo-locking setup mode and just - * parsed a valid CBM that should be - * pseudo-locked. Only one locked region per - * resource group and domain so just do - * the required initialization for single - * region and return. - */ - rdtgrp->plr->s = s; - rdtgrp->plr->d = d; - rdtgrp->plr->cbm = cfg->new_ctrl; - d->plr = rdtgrp->plr; - return 0; - } - goto next; - } - } - return -EINVAL; -} - static bool apply_config(struct rdt_hw_domain *hw_dom, struct resctrl_staged_config *cfg, u32 idx, cpumask_var_t cpu_mask) @@ -371,100 +111,6 @@ int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) return 0; } -static int rdtgroup_parse_resource(char *resname, char *tok, - struct rdtgroup *rdtgrp) -{ - struct resctrl_schema *s; - - list_for_each_entry(s, &resctrl_schema_all, list) { - if (!strcmp(resname, s->name) && rdtgrp->closid < s->num_closid) - return parse_line(tok, s, rdtgrp); - } - rdt_last_cmd_printf("Unknown or unsupported resource name '%s'\n", resname); - return -EINVAL; -} - -ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off) -{ - struct resctrl_schema *s; - struct rdtgroup *rdtgrp; - struct rdt_resource *r; - char *tok, *resname; - int ret = 0; - - /* Valid input requires a trailing newline */ - if (nbytes == 0 || buf[nbytes - 1] != '\n') - return -EINVAL; - buf[nbytes - 1] = '\0'; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (!rdtgrp) { - rdtgroup_kn_unlock(of->kn); - return -ENOENT; - } - rdt_last_cmd_clear(); - - /* - * No changes to pseudo-locked region allowed. It has to be removed - * and re-created instead. - */ - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { - ret = -EINVAL; - rdt_last_cmd_puts("Resource group is pseudo-locked\n"); - goto out; - } - - rdt_staged_configs_clear(); - - while ((tok = strsep(&buf, "\n")) != NULL) { - resname = strim(strsep(&tok, ":")); - if (!tok) { - rdt_last_cmd_puts("Missing ':'\n"); - ret = -EINVAL; - goto out; - } - if (tok[0] == '\0') { - rdt_last_cmd_printf("Missing '%s' value\n", resname); - ret = -EINVAL; - goto out; - } - ret = rdtgroup_parse_resource(resname, tok, rdtgrp); - if (ret) - goto out; - } - - list_for_each_entry(s, &resctrl_schema_all, list) { - r = s->res; - - /* - * Writes to mba_sc resources update the software controller, - * not the control MSR. - */ - if (is_mba_sc(r)) - continue; - - ret = resctrl_arch_update_domains(r, rdtgrp->closid); - if (ret) - goto out; - } - - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - /* - * If pseudo-locking fails we keep the resource group in - * mode RDT_MODE_PSEUDO_LOCKSETUP with its class of service - * active and updated for just the domain the pseudo-locked - * region was requested for. - */ - ret = rdtgroup_pseudo_lock_create(rdtgrp); - } - -out: - rdt_staged_configs_clear(); - rdtgroup_kn_unlock(of->kn); - return ret ?: nbytes; -} - u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, u32 closid, enum resctrl_conf_type type) { @@ -473,160 +119,3 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, return hw_dom->ctrl_val[idx]; } - -static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid) -{ - struct rdt_resource *r = schema->res; - struct rdt_domain *dom; - bool sep = false; - u32 ctrl_val; - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - seq_printf(s, "%*s:", max_name_width, schema->name); - list_for_each_entry(dom, &r->domains, list) { - if (sep) - seq_puts(s, ";"); - - if (is_mba_sc(r)) - ctrl_val = dom->mbps_val[closid]; - else - ctrl_val = resctrl_arch_get_config(r, dom, closid, - schema->conf_type); - - seq_printf(s, r->format_str, dom->id, max_data_width, - ctrl_val); - sep = true; - } - seq_puts(s, "\n"); -} - -int rdtgroup_schemata_show(struct kernfs_open_file *of, - struct seq_file *s, void *v) -{ - struct resctrl_schema *schema; - struct rdtgroup *rdtgrp; - int ret = 0; - u32 closid; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (rdtgrp) { - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - list_for_each_entry(schema, &resctrl_schema_all, list) { - seq_printf(s, "%s:uninitialized\n", schema->name); - } - } else if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { - if (!rdtgrp->plr->d) { - rdt_last_cmd_clear(); - rdt_last_cmd_puts("Cache domain offline\n"); - ret = -ENODEV; - } else { - seq_printf(s, "%s:%d=%x\n", - rdtgrp->plr->s->res->name, - rdtgrp->plr->d->id, - rdtgrp->plr->cbm); - } - } else { - closid = rdtgrp->closid; - list_for_each_entry(schema, &resctrl_schema_all, list) { - if (closid < schema->num_closid) - show_doms(s, schema, closid); - } - } - } else { - ret = -ENOENT; - } - rdtgroup_kn_unlock(of->kn); - return ret; -} - -static int smp_mon_event_count(void *arg) -{ - mon_event_count(arg); - - return 0; -} - -void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, - struct rdt_domain *d, struct rdtgroup *rdtgrp, - int evtid, int first) -{ - int cpu; - - /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - /* - * Setup the parameters to pass to mon_event_count() to read the data. - */ - rr->rgrp = rdtgrp; - rr->evtid = evtid; - rr->r = r; - rr->d = d; - rr->val = 0; - rr->first = first; - rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evtid); - if (IS_ERR(rr->arch_mon_ctx)) { - rr->err = -EINVAL; - return; - } - - cpu = cpumask_any_housekeeping(&d->cpu_mask, RESCTRL_PICK_ANY_CPU); - - /* - * cpumask_any_housekeeping() prefers housekeeping CPUs, but - * are all the CPUs nohz_full? If yes, pick a CPU to IPI. - * MPAM's resctrl_arch_rmid_read() is unable to read the - * counters on some platforms if its called in IRQ context. - */ - if (tick_nohz_full_cpu(cpu)) - smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1); - else - smp_call_on_cpu(cpu, smp_mon_event_count, rr, false); - - resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx); -} - -int rdtgroup_mondata_show(struct seq_file *m, void *arg) -{ - struct kernfs_open_file *of = m->private; - u32 resid, evtid, domid; - struct rdtgroup *rdtgrp; - struct rdt_resource *r; - union mon_data_bits md; - struct rdt_domain *d; - struct rmid_read rr; - int ret = 0; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (!rdtgrp) { - ret = -ENOENT; - goto out; - } - - md.priv = of->kn->priv; - resid = md.u.rid; - domid = md.u.domid; - evtid = md.u.evtid; - - r = resctrl_arch_get_resource(resid); - d = resctrl_arch_find_domain(r, domid); - if (IS_ERR_OR_NULL(d)) { - ret = -ENOENT; - goto out; - } - - mon_event_read(&rr, r, d, rdtgrp, evtid, false); - - if (rr.err == -EIO) - seq_puts(m, "Error\n"); - else if (rr.err == -EINVAL) - seq_puts(m, "Unavailable\n"); - else - seq_printf(m, "%llu\n", rr.val); - -out: - rdtgroup_kn_unlock(of->kn); - return ret; -} diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 3a364ecbeaff..23d574ab56fb 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -26,228 +26,6 @@ */ #define MBM_CNTR_WIDTH_OFFSET_MAX (62 - MBM_CNTR_WIDTH_BASE) -/** - * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that - * aren't marked nohz_full - * @mask: The mask to pick a CPU from. - * @exclude_cpu:The CPU to avoid picking. - * - * Returns a CPU from @mask, but not @exclude_cpu. If there are housekeeping - * CPUs that don't use nohz_full, these are preferred. Pass - * RESCTRL_PICK_ANY_CPU to avoid excluding any CPUs. - * - * When a CPU is excluded, returns >= nr_cpu_ids if no CPUs are available. - */ -static inline unsigned int -cpumask_any_housekeeping(const struct cpumask *mask, int exclude_cpu) -{ - unsigned int cpu, hk_cpu; - - if (exclude_cpu == RESCTRL_PICK_ANY_CPU) - cpu = cpumask_any(mask); - else - cpu = cpumask_any_but(mask, exclude_cpu); - - /* Only continue if tick_nohz_full_mask has been initialized. */ - if (!tick_nohz_full_enabled()) - return cpu; - - /* If the CPU picked isn't marked nohz_full nothing more needs doing. */ - if (cpu < nr_cpu_ids && !tick_nohz_full_cpu(cpu)) - return cpu; - - /* Try to find a CPU that isn't nohz_full to use in preference */ - hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask); - if (hk_cpu == exclude_cpu) - hk_cpu = cpumask_nth_andnot(1, mask, tick_nohz_full_mask); - - if (hk_cpu < nr_cpu_ids) - cpu = hk_cpu; - - return cpu; -} - -struct rdt_fs_context { - struct kernfs_fs_context kfc; - bool enable_cdpl2; - bool enable_cdpl3; - bool enable_mba_mbps; - bool enable_debug; -}; - -static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc) -{ - struct kernfs_fs_context *kfc = fc->fs_private; - - return container_of(kfc, struct rdt_fs_context, kfc); -} - -/** - * struct mon_evt - Entry in the event list of a resource - * @evtid: event id - * @name: name of the event - * @configurable: true if the event is configurable - * @list: entry in &rdt_resource->evt_list - */ -struct mon_evt { - enum resctrl_event_id evtid; - char *name; - bool configurable; - struct list_head list; -}; - -/** - * union mon_data_bits - Monitoring details for each event file - * @priv: Used to store monitoring event data in @u - * as kernfs private data - * @rid: Resource id associated with the event file - * @evtid: Event id associated with the event file - * @domid: The domain to which the event file belongs - * @u: Name of the bit fields struct - */ -union mon_data_bits { - void *priv; - struct { - unsigned int rid : 10; - enum resctrl_event_id evtid : 8; - unsigned int domid : 14; - } u; -}; - -struct rmid_read { - struct rdtgroup *rgrp; - struct rdt_resource *r; - struct rdt_domain *d; - enum resctrl_event_id evtid; - bool first; - int err; - u64 val; - void *arch_mon_ctx; -}; - -extern struct list_head resctrl_schema_all; -extern bool resctrl_mounted; - -enum rdt_group_type { - RDTCTRL_GROUP = 0, - RDTMON_GROUP, - RDT_NUM_GROUP, -}; - -/** - * enum rdtgrp_mode - Mode of a RDT resource group - * @RDT_MODE_SHAREABLE: This resource group allows sharing of its allocations - * @RDT_MODE_EXCLUSIVE: No sharing of this resource group's allocations allowed - * @RDT_MODE_PSEUDO_LOCKSETUP: Resource group will be used for Pseudo-Locking - * @RDT_MODE_PSEUDO_LOCKED: No sharing of this resource group's allocations - * allowed AND the allocations are Cache Pseudo-Locked - * @RDT_NUM_MODES: Total number of modes - * - * The mode of a resource group enables control over the allowed overlap - * between allocations associated with different resource groups (classes - * of service). User is able to modify the mode of a resource group by - * writing to the "mode" resctrl file associated with the resource group. - * - * The "shareable", "exclusive", and "pseudo-locksetup" modes are set by - * writing the appropriate text to the "mode" file. A resource group enters - * "pseudo-locked" mode after the schemata is written while the resource - * group is in "pseudo-locksetup" mode. - */ -enum rdtgrp_mode { - RDT_MODE_SHAREABLE = 0, - RDT_MODE_EXCLUSIVE, - RDT_MODE_PSEUDO_LOCKSETUP, - RDT_MODE_PSEUDO_LOCKED, - - /* Must be last */ - RDT_NUM_MODES, -}; - -/** - * struct mongroup - store mon group's data in resctrl fs. - * @mon_data_kn: kernfs node for the mon_data directory - * @parent: parent rdtgrp - * @crdtgrp_list: child rdtgroup node list - * @rmid: rmid for this rdtgroup - */ -struct mongroup { - struct kernfs_node *mon_data_kn; - struct rdtgroup *parent; - struct list_head crdtgrp_list; - u32 rmid; -}; - -/** - * struct rdtgroup - store rdtgroup's data in resctrl file system. - * @kn: kernfs node - * @rdtgroup_list: linked list for all rdtgroups - * @closid: closid for this rdtgroup - * @cpu_mask: CPUs assigned to this rdtgroup - * @flags: status bits - * @waitcount: how many cpus expect to find this - * group when they acquire rdtgroup_mutex - * @type: indicates type of this rdtgroup - either - * monitor only or ctrl_mon group - * @mon: mongroup related data - * @mode: mode of resource group - * @plr: pseudo-locked region - */ -struct rdtgroup { - struct kernfs_node *kn; - struct list_head rdtgroup_list; - u32 closid; - struct cpumask cpu_mask; - int flags; - atomic_t waitcount; - enum rdt_group_type type; - struct mongroup mon; - enum rdtgrp_mode mode; - struct pseudo_lock_region *plr; -}; - -/* List of all resource groups */ -extern struct list_head rdt_all_groups; - -extern int max_name_width, max_data_width; - -/** - * struct rftype - describe each file in the resctrl file system - * @name: File name - * @mode: Access mode - * @kf_ops: File operations - * @flags: File specific RFTYPE_FLAGS_* flags - * @fflags: File specific RFTYPE_* flags - * @seq_show: Show content of the file - * @write: Write to the file - */ -struct rftype { - char *name; - umode_t mode; - const struct kernfs_ops *kf_ops; - unsigned long flags; - unsigned long fflags; - - int (*seq_show)(struct kernfs_open_file *of, - struct seq_file *sf, void *v); - /* - * write() is the generic write callback which maps directly to - * kernfs write operation and overrides all other operations. - * Maximum write size is determined by ->max_write_len. - */ - ssize_t (*write)(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off); -}; - -/** - * struct mbm_state - status for each MBM counter in each domain - * @prev_bw_bytes: Previous bytes value read for bandwidth calculation - * @prev_bw: The most recent bandwidth in MBps - */ -struct mbm_state { - u64 prev_bw_bytes; - u32 prev_bw; -}; - /** * struct arch_mbm_state - values used to compute resctrl_arch_rmid_read()s * return value. @@ -339,11 +117,7 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r return container_of(r, struct rdt_hw_resource, r_resctrl); } -extern struct mutex rdtgroup_mutex; - extern struct rdt_hw_resource rdt_resources_all[]; -extern struct rdtgroup rdtgroup_default; -extern struct dentry *debugfs_resctrl; static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res) { @@ -407,63 +181,10 @@ union cpuid_0x10_x_edx { unsigned int full; }; -void rdt_last_cmd_clear(void); -void rdt_last_cmd_puts(const char *s); -__printf(1, 2) -void rdt_last_cmd_printf(const char *fmt, ...); - void rdt_ctrl_update(void *arg); -struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn); -void rdtgroup_kn_unlock(struct kernfs_node *kn); -int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name); -int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name, - umode_t mask); -ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off); -int rdtgroup_schemata_show(struct kernfs_open_file *of, - struct seq_file *s, void *v); -bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d, - unsigned long cbm, int closid, bool exclusive); -unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d, - unsigned long cbm); -enum rdtgrp_mode rdtgroup_mode_by_closid(int closid); -int rdtgroup_tasks_assigned(struct rdtgroup *r); -int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp); -int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp); -bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm); -bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d); -int rdt_pseudo_lock_init(void); -void rdt_pseudo_lock_release(void); -int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp); -void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp); -int closids_supported(void); -void closid_free(int closid); -int alloc_rmid(u32 closid); -void free_rmid(u32 closid, u32 rmid); int rdt_get_mon_l3_config(struct rdt_resource *r); -void resctrl_mon_resource_exit(void); bool rdt_cpu_has(int flag); -void mon_event_count(void *info); -int rdtgroup_mondata_show(struct seq_file *m, void *arg); -void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, - struct rdt_domain *d, struct rdtgroup *rdtgrp, - int evtid, int first); -int resctrl_mon_resource_init(void); -void mbm_setup_overflow_handler(struct rdt_domain *dom, - unsigned long delay_ms, - int exclude_cpu); -void mbm_handle_overflow(struct work_struct *work); void __init intel_rdt_mbm_apply_quirk(void); -bool is_mba_sc(struct rdt_resource *r); -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms, - int exclude_cpu); -void cqm_handle_limbo(struct work_struct *work); -bool has_busy_rmid(struct rdt_domain *d); -void __check_limbo(struct rdt_domain *d, bool force_free); void rdt_domain_reconfigure_cdp(struct rdt_resource *r); -void mbm_config_rftype_init(const char *config); -void rdt_staged_configs_clear(void); -bool closid_allocated(unsigned int closid); -int resctrl_find_cleanest_closid(void); #endif /* _ASM_X86_RESCTRL_INTERNAL_H */ diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 356796ea784c..02fb9d87479a 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -25,53 +25,6 @@ #include "internal.h" -/** - * struct rmid_entry - dirty tracking for all RMID. - * @closid: The CLOSID for this entry. - * @rmid: The RMID for this entry. - * @busy: The number of domains with cached data using this RMID. - * @list: Member of the rmid_free_lru list when busy == 0. - * - * Depending on the architecture the correct monitor is accessed using - * both @closid and @rmid, or @rmid only. - * - * Take the rdtgroup_mutex when accessing. - */ -struct rmid_entry { - u32 closid; - u32 rmid; - int busy; - struct list_head list; -}; - -/* - * @rmid_free_lru - A least recently used list of free RMIDs - * These RMIDs are guaranteed to have an occupancy less than the - * threshold occupancy - */ -static LIST_HEAD(rmid_free_lru); - -/* - * @closid_num_dirty_rmid The number of dirty RMID each CLOSID has. - * Only allocated when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is defined. - * Indexed by CLOSID. Protected by rdtgroup_mutex. - */ -static u32 *closid_num_dirty_rmid; - -/* - * @rmid_limbo_count - count of currently unused but (potentially) - * dirty RMIDs. - * This counts RMIDs that no one is currently using but that - * may have a occupancy value > resctrl_rmid_realloc_threshold. User can - * change the threshold occupancy value. - */ -static unsigned int rmid_limbo_count; - -/* - * @rmid_entry - The entry in the limbo and free lists. - */ -static struct rmid_entry *rmid_ptrs; - /* * Global boolean for rdt_monitor which is true if any * resource monitoring is enabled. @@ -83,17 +36,6 @@ bool rdt_mon_capable; */ unsigned int rdt_mon_features; -/* - * This is the threshold cache occupancy in bytes at which we will consider an - * RMID available for re-allocation. - */ -unsigned int resctrl_rmid_realloc_threshold; - -/* - * This is the maximum value for the reallocation threshold, in bytes. - */ -unsigned int resctrl_rmid_realloc_limit; - #define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5)) /* @@ -157,33 +99,6 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val) return val; } -/* - * x86 and arm64 differ in their handling of monitoring. - * x86's RMID are independent numbers, there is only one source of traffic - * with an RMID value of '1'. - * arm64's PMG extends the PARTID/CLOSID space, there are multiple sources of - * traffic with a PMG value of '1', one for each CLOSID, meaning the RMID - * value is no longer unique. - * To account for this, resctrl uses an index. On x86 this is just the RMID, - * on arm64 it encodes the CLOSID and RMID. This gives a unique number. - * - * The domain's rmid_busy_llc and rmid_ptrs[] are sized by index. The arch code - * must accept an attempt to read every index. - */ -static inline struct rmid_entry *__rmid_entry(u32 idx) -{ - struct rmid_entry *entry; - u32 closid, rmid; - - entry = &rmid_ptrs[idx]; - resctrl_arch_rmid_idx_decode(idx, &closid, &rmid); - - WARN_ON_ONCE(entry->closid != closid); - WARN_ON_ONCE(entry->rmid != rmid); - - return entry; -} - static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) { u64 msr_val; @@ -302,734 +217,6 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, return 0; } -static void limbo_release_entry(struct rmid_entry *entry) -{ - lockdep_assert_held(&rdtgroup_mutex); - - rmid_limbo_count--; - list_add_tail(&entry->list, &rmid_free_lru); - - if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) - closid_num_dirty_rmid[entry->closid]--; -} - -/* - * Check the RMIDs that are marked as busy for this domain. If the - * reported LLC occupancy is below the threshold clear the busy bit and - * decrement the count. If the busy count gets to zero on an RMID, we - * free the RMID - */ -void __check_limbo(struct rdt_domain *d, bool force_free) -{ - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); - u32 idx_limit = resctrl_arch_system_num_rmid_idx(); - struct rmid_entry *entry; - u32 idx, cur_idx = 1; - void *arch_mon_ctx; - bool rmid_dirty; - u64 val = 0; - - arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID); - if (IS_ERR(arch_mon_ctx)) { - pr_warn_ratelimited("Failed to allocate monitor context: %ld", - PTR_ERR(arch_mon_ctx)); - return; - } - - /* - * Skip RMID 0 and start from RMID 1 and check all the RMIDs that - * are marked as busy for occupancy < threshold. If the occupancy - * is less than the threshold decrement the busy counter of the - * RMID and move it to the free list when the counter reaches 0. - */ - for (;;) { - idx = find_next_bit(d->rmid_busy_llc, idx_limit, cur_idx); - if (idx >= idx_limit) - break; - - entry = __rmid_entry(idx); - if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid, - QOS_L3_OCCUP_EVENT_ID, &val, - arch_mon_ctx)) { - rmid_dirty = true; - } else { - rmid_dirty = (val >= resctrl_rmid_realloc_threshold); - } - - if (force_free || !rmid_dirty) { - clear_bit(idx, d->rmid_busy_llc); - if (!--entry->busy) - limbo_release_entry(entry); - } - cur_idx = idx + 1; - } - - resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx); -} - -bool has_busy_rmid(struct rdt_domain *d) -{ - u32 idx_limit = resctrl_arch_system_num_rmid_idx(); - - return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit; -} - -static struct rmid_entry *resctrl_find_free_rmid(u32 closid) -{ - struct rmid_entry *itr; - u32 itr_idx, cmp_idx; - - if (list_empty(&rmid_free_lru)) - return rmid_limbo_count ? ERR_PTR(-EBUSY) : ERR_PTR(-ENOSPC); - - list_for_each_entry(itr, &rmid_free_lru, list) { - /* - * Get the index of this free RMID, and the index it would need - * to be if it were used with this CLOSID. - * If the CLOSID is irrelevant on this architecture, the two - * index values are always the same on every entry and thus the - * very first entry will be returned. - */ - itr_idx = resctrl_arch_rmid_idx_encode(itr->closid, itr->rmid); - cmp_idx = resctrl_arch_rmid_idx_encode(closid, itr->rmid); - - if (itr_idx == cmp_idx) - return itr; - } - - return ERR_PTR(-ENOSPC); -} - -/** - * resctrl_find_cleanest_closid() - Find a CLOSID where all the associated - * RMID are clean, or the CLOSID that has - * the most clean RMID. - * - * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocated CLOSID - * may not be able to allocate clean RMID. To avoid this the allocator will - * choose the CLOSID with the most clean RMID. - * - * When the CLOSID and RMID are independent numbers, the first free CLOSID will - * be returned. - */ -int resctrl_find_cleanest_closid(void) -{ - u32 cleanest_closid = ~0; - int i = 0; - - lockdep_assert_held(&rdtgroup_mutex); - - if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) - return -EIO; - - for (i = 0; i < closids_supported(); i++) { - int num_dirty; - - if (closid_allocated(i)) - continue; - - num_dirty = closid_num_dirty_rmid[i]; - if (num_dirty == 0) - return i; - - if (cleanest_closid == ~0) - cleanest_closid = i; - - if (num_dirty < closid_num_dirty_rmid[cleanest_closid]) - cleanest_closid = i; - } - - if (cleanest_closid == ~0) - return -ENOSPC; - - return cleanest_closid; -} - -/* - * For MPAM the RMID value is not unique, and has to be considered with - * the CLOSID. The (CLOSID, RMID) pair is allocated on all domains, which - * allows all domains to be managed by a single free list. - * Each domain also has a rmid_busy_llc to reduce the work of the limbo handler. - */ -int alloc_rmid(u32 closid) -{ - struct rmid_entry *entry; - - lockdep_assert_held(&rdtgroup_mutex); - - entry = resctrl_find_free_rmid(closid); - if (IS_ERR(entry)) - return PTR_ERR(entry); - - list_del(&entry->list); - return entry->rmid; -} - -static void add_rmid_to_limbo(struct rmid_entry *entry) -{ - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); - struct rdt_domain *d; - u32 idx; - - lockdep_assert_held(&rdtgroup_mutex); - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid); - - entry->busy = 0; - list_for_each_entry(d, &r->domains, list) { - /* - * For the first limbo RMID in the domain, - * setup up the limbo worker. - */ - if (!has_busy_rmid(d)) - cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL, - RESCTRL_PICK_ANY_CPU); - set_bit(idx, d->rmid_busy_llc); - entry->busy++; - } - - rmid_limbo_count++; - if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) - closid_num_dirty_rmid[entry->closid]++; -} - -void free_rmid(u32 closid, u32 rmid) -{ - u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); - struct rmid_entry *entry; - - lockdep_assert_held(&rdtgroup_mutex); - - /* - * Do not allow the default rmid to be free'd. Comparing by index - * allows architectures that ignore the closid parameter to avoid an - * unnecessary check. - */ - if (!resctrl_arch_mon_capable() || - idx == resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, - RESCTRL_RESERVED_RMID)) - return; - - entry = __rmid_entry(idx); - - if (resctrl_arch_is_llc_occupancy_enabled()) - add_rmid_to_limbo(entry); - else - list_add_tail(&entry->list, &rmid_free_lru); -} - -static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 closid, - u32 rmid, enum resctrl_event_id evtid) -{ - u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); - - switch (evtid) { - case QOS_L3_MBM_TOTAL_EVENT_ID: - return &d->mbm_total[idx]; - case QOS_L3_MBM_LOCAL_EVENT_ID: - return &d->mbm_local[idx]; - default: - return NULL; - } -} - -static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr) -{ - struct mbm_state *m; - u64 tval = 0; - - if (rr->first) { - resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid); - m = get_mbm_state(rr->d, closid, rmid, rr->evtid); - if (m) - memset(m, 0, sizeof(struct mbm_state)); - return 0; - } - - rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, rr->evtid, - &tval, rr->arch_mon_ctx); - if (rr->err) - return rr->err; - - rr->val += tval; - - return 0; -} - -/* - * mbm_bw_count() - Update bw count from values previously read by - * __mon_event_count(). - * @closid: The closid used to identify the cached mbm_state. - * @rmid: The rmid used to identify the cached mbm_state. - * @rr: The struct rmid_read populated by __mon_event_count(). - * - * Supporting function to calculate the memory bandwidth - * and delta bandwidth in MBps. The chunks value previously read by - * __mon_event_count() is compared with the chunks value from the previous - * invocation. This must be called once per second to maintain values in MBps. - */ -static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr) -{ - u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); - struct mbm_state *m = &rr->d->mbm_local[idx]; - u64 cur_bw, bytes, cur_bytes; - - cur_bytes = rr->val; - bytes = cur_bytes - m->prev_bw_bytes; - m->prev_bw_bytes = cur_bytes; - - cur_bw = bytes / SZ_1M; - - m->prev_bw = cur_bw; -} - -/* - * This is scheduled by mon_event_read() to read the CQM/MBM counters - * on a domain. - */ -void mon_event_count(void *info) -{ - struct rdtgroup *rdtgrp, *entry; - struct rmid_read *rr = info; - struct list_head *head; - int ret; - - rdtgrp = rr->rgrp; - - ret = __mon_event_count(rdtgrp->closid, rdtgrp->mon.rmid, rr); - - /* - * For Ctrl groups read data from child monitor groups and - * add them together. Count events which are read successfully. - * Discard the rmid_read's reporting errors. - */ - head = &rdtgrp->mon.crdtgrp_list; - - if (rdtgrp->type == RDTCTRL_GROUP) { - list_for_each_entry(entry, head, mon.crdtgrp_list) { - if (__mon_event_count(entry->closid, entry->mon.rmid, - rr) == 0) - ret = 0; - } - } - - /* - * __mon_event_count() calls for newly created monitor groups may - * report -EINVAL/Unavailable if the monitor hasn't seen any traffic. - * Discard error if any of the monitor event reads succeeded. - */ - if (ret == 0) - rr->err = 0; -} - -/* - * Feedback loop for MBA software controller (mba_sc) - * - * mba_sc is a feedback loop where we periodically read MBM counters and - * adjust the bandwidth percentage values via the IA32_MBA_THRTL_MSRs so - * that: - * - * current bandwidth(cur_bw) < user specified bandwidth(user_bw) - * - * This uses the MBM counters to measure the bandwidth and MBA throttle - * MSRs to control the bandwidth for a particular rdtgrp. It builds on the - * fact that resctrl rdtgroups have both monitoring and control. - * - * The frequency of the checks is 1s and we just tag along the MBM overflow - * timer. Having 1s interval makes the calculation of bandwidth simpler. - * - * Although MBA's goal is to restrict the bandwidth to a maximum, there may - * be a need to increase the bandwidth to avoid unnecessarily restricting - * the L2 <-> L3 traffic. - * - * Since MBA controls the L2 external bandwidth where as MBM measures the - * L3 external bandwidth the following sequence could lead to such a - * situation. - * - * Consider an rdtgroup which had high L3 <-> memory traffic in initial - * phases -> mba_sc kicks in and reduced bandwidth percentage values -> but - * after some time rdtgroup has mostly L2 <-> L3 traffic. - * - * In this case we may restrict the rdtgroup's L2 <-> L3 traffic as its - * throttle MSRs already have low percentage values. To avoid - * unnecessarily restricting such rdtgroups, we also increase the bandwidth. - */ -static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm) -{ - u32 closid, rmid, cur_msr_val, new_msr_val; - struct mbm_state *pmbm_data, *cmbm_data; - struct rdt_resource *r_mba; - struct rdt_domain *dom_mba; - u32 cur_bw, user_bw, idx; - struct list_head *head; - struct rdtgroup *entry; - - if (!resctrl_arch_is_mbm_local_enabled()) - return; - - r_mba = resctrl_arch_get_resource(RDT_RESOURCE_MBA); - - closid = rgrp->closid; - rmid = rgrp->mon.rmid; - idx = resctrl_arch_rmid_idx_encode(closid, rmid); - pmbm_data = &dom_mbm->mbm_local[idx]; - - dom_mba = resctrl_get_domain_from_cpu(smp_processor_id(), r_mba); - if (!dom_mba) { - pr_warn_once("Failure to get domain for MBA update\n"); - return; - } - - cur_bw = pmbm_data->prev_bw; - user_bw = dom_mba->mbps_val[closid]; - - /* MBA resource doesn't support CDP */ - cur_msr_val = resctrl_arch_get_config(r_mba, dom_mba, closid, CDP_NONE); - - /* - * For Ctrl groups read data from child monitor groups. - */ - head = &rgrp->mon.crdtgrp_list; - list_for_each_entry(entry, head, mon.crdtgrp_list) { - cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid]; - cur_bw += cmbm_data->prev_bw; - } - - /* - * Scale up/down the bandwidth linearly for the ctrl group. The - * bandwidth step is the bandwidth granularity specified by the - * hardware. - * Always increase throttling if current bandwidth is above the - * target set by user. - * But avoid thrashing up and down on every poll by checking - * whether a decrease in throttling is likely to push the group - * back over target. E.g. if currently throttling to 30% of bandwidth - * on a system with 10% granularity steps, check whether moving to - * 40% would go past the limit by multiplying current bandwidth by - * "(30 + 10) / 30". - */ - if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) { - new_msr_val = cur_msr_val - r_mba->membw.bw_gran; - } else if (cur_msr_val < MAX_MBA_BW && - (user_bw > (cur_bw * (cur_msr_val + r_mba->membw.min_bw) / cur_msr_val))) { - new_msr_val = cur_msr_val + r_mba->membw.bw_gran; - } else { - return; - } - - resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val); -} - -static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, - u32 closid, u32 rmid) -{ - struct rmid_read rr; - - rr.first = false; - rr.r = r; - rr.d = d; - - /* - * This is protected from concurrent reads from user - * as both the user and we hold the global mutex. - */ - if (resctrl_arch_is_mbm_total_enabled()) { - rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID; - rr.val = 0; - rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); - if (IS_ERR(rr.arch_mon_ctx)) { - pr_warn_ratelimited("Failed to allocate monitor context: %ld", - PTR_ERR(rr.arch_mon_ctx)); - return; - } - - __mon_event_count(closid, rmid, &rr); - - resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); - } - if (resctrl_arch_is_mbm_local_enabled()) { - rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID; - rr.val = 0; - rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); - if (IS_ERR(rr.arch_mon_ctx)) { - pr_warn_ratelimited("Failed to allocate monitor context: %ld", - PTR_ERR(rr.arch_mon_ctx)); - return; - } - - __mon_event_count(closid, rmid, &rr); - - /* - * Call the MBA software controller only for the - * control groups and when user has enabled - * the software controller explicitly. - */ - if (is_mba_sc(NULL)) - mbm_bw_count(closid, rmid, &rr); - - resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); - } -} - -/* - * Handler to scan the limbo list and move the RMIDs - * to free list whose occupancy < threshold_occupancy. - */ -void cqm_handle_limbo(struct work_struct *work) -{ - unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL); - struct rdt_domain *d; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - d = container_of(work, struct rdt_domain, cqm_limbo.work); - - __check_limbo(d, false); - - if (has_busy_rmid(d)) { - d->cqm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask, - RESCTRL_PICK_ANY_CPU); - schedule_delayed_work_on(d->cqm_work_cpu, &d->cqm_limbo, - delay); - } - - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); -} - -/** - * cqm_setup_limbo_handler() - Schedule the limbo handler to run for this - * domain. - * @dom: The domain the limbo handler should run for. - * @delay_ms: How far in the future the handler should run. - * @exclude_cpu: Which CPU the handler should not run on, - * RESCTRL_PICK_ANY_CPU to pick any CPU. - */ -void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms, - int exclude_cpu) -{ - unsigned long delay = msecs_to_jiffies(delay_ms); - int cpu; - - cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu); - dom->cqm_work_cpu = cpu; - - if (cpu < nr_cpu_ids) - schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay); -} - -void mbm_handle_overflow(struct work_struct *work) -{ - unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL); - struct rdtgroup *prgrp, *crgrp; - struct list_head *head; - struct rdt_resource *r; - struct rdt_domain *d; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - /* - * If the filesystem has been unmounted this work no longer needs to - * run. - */ - if (!resctrl_mounted || !resctrl_arch_mon_capable()) - goto out_unlock; - - r = resctrl_arch_get_resource(RDT_RESOURCE_L3); - d = container_of(work, struct rdt_domain, mbm_over.work); - - list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { - mbm_update(r, d, prgrp->closid, prgrp->mon.rmid); - - head = &prgrp->mon.crdtgrp_list; - list_for_each_entry(crgrp, head, mon.crdtgrp_list) - mbm_update(r, d, crgrp->closid, crgrp->mon.rmid); - - if (is_mba_sc(NULL)) - update_mba_bw(prgrp, d); - } - - /* - * Re-check for housekeeping CPUs. This allows the overflow handler to - * move off a nohz_full CPU quickly. - */ - d->mbm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask, - RESCTRL_PICK_ANY_CPU); - schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay); - -out_unlock: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); -} - -/** - * mbm_setup_overflow_handler() - Schedule the overflow handler to run for this - * domain. - * @dom: The domain the overflow handler should run for. - * @delay_ms: How far in the future the handler should run. - * @exclude_cpu: Which CPU the handler should not run on, - * RESCTRL_PICK_ANY_CPU to pick any CPU. - */ -void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms, - int exclude_cpu) -{ - unsigned long delay = msecs_to_jiffies(delay_ms); - int cpu; - - /* - * When a domain comes online there is no guarantee the filesystem is - * mounted. If not, there is no need to catch counter overflow. - */ - if (!resctrl_mounted || !resctrl_arch_mon_capable()) - return; - cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu); - dom->mbm_work_cpu = cpu; - - if (cpu < nr_cpu_ids) - schedule_delayed_work_on(cpu, &dom->mbm_over, delay); -} - -static int dom_data_init(struct rdt_resource *r) -{ - u32 idx_limit = resctrl_arch_system_num_rmid_idx(); - u32 num_closid = resctrl_arch_get_num_closid(r); - struct rmid_entry *entry = NULL; - int err = 0, i; - u32 idx; - - mutex_lock(&rdtgroup_mutex); - if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { - u32 *tmp; - - /* - * If the architecture hasn't provided a sanitised value here, - * this may result in larger arrays than necessary. Resctrl will - * use a smaller system wide value based on the resources in - * use. - */ - tmp = kcalloc(num_closid, sizeof(*tmp), GFP_KERNEL); - if (!tmp) { - err = -ENOMEM; - goto out_unlock; - } - - closid_num_dirty_rmid = tmp; - } - - rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL); - if (!rmid_ptrs) { - if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { - kfree(closid_num_dirty_rmid); - closid_num_dirty_rmid = NULL; - } - err = -ENOMEM; - goto out_unlock; - } - - for (i = 0; i < idx_limit; i++) { - entry = &rmid_ptrs[i]; - INIT_LIST_HEAD(&entry->list); - - resctrl_arch_rmid_idx_decode(i, &entry->closid, &entry->rmid); - list_add_tail(&entry->list, &rmid_free_lru); - } - - /* - * RESCTRL_RESERVED_CLOSID and RESCTRL_RESERVED_RMID are special and - * are always allocated. These are used for the rdtgroup_default - * control group, which will be setup later in rdtgroup_init(). - */ - idx = resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, - RESCTRL_RESERVED_RMID); - entry = __rmid_entry(idx); - list_del(&entry->list); - -out_unlock: - mutex_unlock(&rdtgroup_mutex); - - return err; -} - -static void dom_data_exit(void) -{ - mutex_lock(&rdtgroup_mutex); - - if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { - kfree(closid_num_dirty_rmid); - closid_num_dirty_rmid = NULL; - } - - kfree(rmid_ptrs); - rmid_ptrs = NULL; - - mutex_unlock(&rdtgroup_mutex); -} - -static struct mon_evt llc_occupancy_event = { - .name = "llc_occupancy", - .evtid = QOS_L3_OCCUP_EVENT_ID, -}; - -static struct mon_evt mbm_total_event = { - .name = "mbm_total_bytes", - .evtid = QOS_L3_MBM_TOTAL_EVENT_ID, -}; - -static struct mon_evt mbm_local_event = { - .name = "mbm_local_bytes", - .evtid = QOS_L3_MBM_LOCAL_EVENT_ID, -}; - -/* - * Initialize the event list for the resource. - * - * Note that MBM events are also part of RDT_RESOURCE_L3 resource - * because as per the SDM the total and local memory bandwidth - * are enumerated as part of L3 monitoring. - */ -static void l3_mon_evt_init(struct rdt_resource *r) -{ - INIT_LIST_HEAD(&r->evt_list); - - if (resctrl_arch_is_llc_occupancy_enabled()) - list_add_tail(&llc_occupancy_event.list, &r->evt_list); - if (resctrl_arch_is_mbm_total_enabled()) - list_add_tail(&mbm_total_event.list, &r->evt_list); - if (resctrl_arch_is_mbm_local_enabled()) - list_add_tail(&mbm_local_event.list, &r->evt_list); -} - -int resctrl_mon_resource_init(void) -{ - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); - int ret; - - ret = dom_data_init(r); - if (ret) - return ret; - - if (!r->mon_capable) - return 0; - - l3_mon_evt_init(r); - - if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) { - mbm_total_event.configurable = true; - mbm_config_rftype_init("mbm_total_bytes_config"); - } - if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_LOCAL_EVENT_ID)) { - mbm_local_event.configurable = true; - mbm_config_rftype_init("mbm_local_bytes_config"); - } - - return 0; -} - int __init rdt_get_mon_l3_config(struct rdt_resource *r) { unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset; @@ -1075,16 +262,6 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r) return 0; } -void resctrl_mon_resource_exit(void) -{ - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); - - if (!r->mon_capable) - return; - - dom_data_exit(); -} - void __init intel_rdt_mbm_apply_quirk(void) { int cf_index; diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index 85536566d7f5..ba1596afee10 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -39,28 +39,6 @@ */ static u64 prefetch_disable_bits; -/* - * Major number assigned to and shared by all devices exposing - * pseudo-locked regions. - */ -static unsigned int pseudo_lock_major; -static unsigned long pseudo_lock_minor_avail = GENMASK(MINORBITS, 0); - -static char *pseudo_lock_devnode(const struct device *dev, umode_t *mode) -{ - const struct rdtgroup *rdtgrp; - - rdtgrp = dev_get_drvdata(dev); - if (mode) - *mode = 0600; - return kasprintf(GFP_KERNEL, "pseudo_lock/%s", rdtgrp->kn->name); -} - -static const struct class pseudo_lock_class = { - .name = "pseudo_lock", - .devnode = pseudo_lock_devnode, -}; - /** * resctrl_arch_get_prefetch_disable_bits - prefetch disable bits of supported * platforms @@ -121,299 +99,6 @@ u64 resctrl_arch_get_prefetch_disable_bits(void) return prefetch_disable_bits; } -/** - * pseudo_lock_minor_get - Obtain available minor number - * @minor: Pointer to where new minor number will be stored - * - * A bitmask is used to track available minor numbers. Here the next free - * minor number is marked as unavailable and returned. - * - * Return: 0 on success, <0 on failure. - */ -static int pseudo_lock_minor_get(unsigned int *minor) -{ - unsigned long first_bit; - - first_bit = find_first_bit(&pseudo_lock_minor_avail, MINORBITS); - - if (first_bit == MINORBITS) - return -ENOSPC; - - __clear_bit(first_bit, &pseudo_lock_minor_avail); - *minor = first_bit; - - return 0; -} - -/** - * pseudo_lock_minor_release - Return minor number to available - * @minor: The minor number made available - */ -static void pseudo_lock_minor_release(unsigned int minor) -{ - __set_bit(minor, &pseudo_lock_minor_avail); -} - -/** - * region_find_by_minor - Locate a pseudo-lock region by inode minor number - * @minor: The minor number of the device representing pseudo-locked region - * - * When the character device is accessed we need to determine which - * pseudo-locked region it belongs to. This is done by matching the minor - * number of the device to the pseudo-locked region it belongs. - * - * Minor numbers are assigned at the time a pseudo-locked region is associated - * with a cache instance. - * - * Return: On success return pointer to resource group owning the pseudo-locked - * region, NULL on failure. - */ -static struct rdtgroup *region_find_by_minor(unsigned int minor) -{ - struct rdtgroup *rdtgrp, *rdtgrp_match = NULL; - - list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { - if (rdtgrp->plr && rdtgrp->plr->minor == minor) { - rdtgrp_match = rdtgrp; - break; - } - } - return rdtgrp_match; -} - -/** - * struct pseudo_lock_pm_req - A power management QoS request list entry - * @list: Entry within the @pm_reqs list for a pseudo-locked region - * @req: PM QoS request - */ -struct pseudo_lock_pm_req { - struct list_head list; - struct dev_pm_qos_request req; -}; - -static void pseudo_lock_cstates_relax(struct pseudo_lock_region *plr) -{ - struct pseudo_lock_pm_req *pm_req, *next; - - list_for_each_entry_safe(pm_req, next, &plr->pm_reqs, list) { - dev_pm_qos_remove_request(&pm_req->req); - list_del(&pm_req->list); - kfree(pm_req); - } -} - -/** - * pseudo_lock_cstates_constrain - Restrict cores from entering C6 - * @plr: Pseudo-locked region - * - * To prevent the cache from being affected by power management entering - * C6 has to be avoided. This is accomplished by requesting a latency - * requirement lower than lowest C6 exit latency of all supported - * platforms as found in the cpuidle state tables in the intel_idle driver. - * At this time it is possible to do so with a single latency requirement - * for all supported platforms. - * - * Since Goldmont is supported, which is affected by X86_BUG_MONITOR, - * the ACPI latencies need to be considered while keeping in mind that C2 - * may be set to map to deeper sleep states. In this case the latency - * requirement needs to prevent entering C2 also. - * - * Return: 0 on success, <0 on failure - */ -static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr) -{ - struct pseudo_lock_pm_req *pm_req; - int cpu; - int ret; - - for_each_cpu(cpu, &plr->d->cpu_mask) { - pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL); - if (!pm_req) { - rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n"); - ret = -ENOMEM; - goto out_err; - } - ret = dev_pm_qos_add_request(get_cpu_device(cpu), - &pm_req->req, - DEV_PM_QOS_RESUME_LATENCY, - 30); - if (ret < 0) { - rdt_last_cmd_printf("Failed to add latency req CPU%d\n", - cpu); - kfree(pm_req); - ret = -1; - goto out_err; - } - list_add(&pm_req->list, &plr->pm_reqs); - } - - return 0; - -out_err: - pseudo_lock_cstates_relax(plr); - return ret; -} - -/** - * pseudo_lock_region_clear - Reset pseudo-lock region data - * @plr: pseudo-lock region - * - * All content of the pseudo-locked region is reset - any memory allocated - * freed. - * - * Return: void - */ -static void pseudo_lock_region_clear(struct pseudo_lock_region *plr) -{ - plr->size = 0; - plr->line_size = 0; - kfree(plr->kmem); - plr->kmem = NULL; - plr->s = NULL; - if (plr->d) - plr->d->plr = NULL; - plr->d = NULL; - plr->cbm = 0; - plr->debugfs_dir = NULL; -} - -/** - * pseudo_lock_region_init - Initialize pseudo-lock region information - * @plr: pseudo-lock region - * - * Called after user provided a schemata to be pseudo-locked. From the - * schemata the &struct pseudo_lock_region is on entry already initialized - * with the resource, domain, and capacity bitmask. Here the information - * required for pseudo-locking is deduced from this data and &struct - * pseudo_lock_region initialized further. This information includes: - * - size in bytes of the region to be pseudo-locked - * - cache line size to know the stride with which data needs to be accessed - * to be pseudo-locked - * - a cpu associated with the cache instance on which the pseudo-locking - * flow can be executed - * - * Return: 0 on success, <0 on failure. Descriptive error will be written - * to last_cmd_status buffer. - */ -static int pseudo_lock_region_init(struct pseudo_lock_region *plr) -{ - struct cpu_cacheinfo *ci; - int ret; - int i; - - /* Pick the first cpu we find that is associated with the cache. */ - plr->cpu = cpumask_first(&plr->d->cpu_mask); - - if (!cpu_online(plr->cpu)) { - rdt_last_cmd_printf("CPU %u associated with cache not online\n", - plr->cpu); - ret = -ENODEV; - goto out_region; - } - - ci = get_cpu_cacheinfo(plr->cpu); - - plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm); - - for (i = 0; i < ci->num_leaves; i++) { - if (ci->info_list[i].level == plr->s->res->cache_level) { - plr->line_size = ci->info_list[i].coherency_line_size; - return 0; - } - } - - ret = -1; - rdt_last_cmd_puts("Unable to determine cache line size\n"); -out_region: - pseudo_lock_region_clear(plr); - return ret; -} - -/** - * pseudo_lock_init - Initialize a pseudo-lock region - * @rdtgrp: resource group to which new pseudo-locked region will belong - * - * A pseudo-locked region is associated with a resource group. When this - * association is created the pseudo-locked region is initialized. The - * details of the pseudo-locked region are not known at this time so only - * allocation is done and association established. - * - * Return: 0 on success, <0 on failure - */ -static int pseudo_lock_init(struct rdtgroup *rdtgrp) -{ - struct pseudo_lock_region *plr; - - plr = kzalloc(sizeof(*plr), GFP_KERNEL); - if (!plr) - return -ENOMEM; - - init_waitqueue_head(&plr->lock_thread_wq); - INIT_LIST_HEAD(&plr->pm_reqs); - rdtgrp->plr = plr; - return 0; -} - -/** - * pseudo_lock_region_alloc - Allocate kernel memory that will be pseudo-locked - * @plr: pseudo-lock region - * - * Initialize the details required to set up the pseudo-locked region and - * allocate the contiguous memory that will be pseudo-locked to the cache. - * - * Return: 0 on success, <0 on failure. Descriptive error will be written - * to last_cmd_status buffer. - */ -static int pseudo_lock_region_alloc(struct pseudo_lock_region *plr) -{ - int ret; - - ret = pseudo_lock_region_init(plr); - if (ret < 0) - return ret; - - /* - * We do not yet support contiguous regions larger than - * KMALLOC_MAX_SIZE. - */ - if (plr->size > KMALLOC_MAX_SIZE) { - rdt_last_cmd_puts("Requested region exceeds maximum size\n"); - ret = -E2BIG; - goto out_region; - } - - plr->kmem = kzalloc(plr->size, GFP_KERNEL); - if (!plr->kmem) { - rdt_last_cmd_puts("Unable to allocate memory\n"); - ret = -ENOMEM; - goto out_region; - } - - ret = 0; - goto out; -out_region: - pseudo_lock_region_clear(plr); -out: - return ret; -} - -/** - * pseudo_lock_free - Free a pseudo-locked region - * @rdtgrp: resource group to which pseudo-locked region belonged - * - * The pseudo-locked region's resources have already been released, or not - * yet created at this point. Now it can be freed and disassociated from the - * resource group. - * - * Return: void - */ -static void pseudo_lock_free(struct rdtgroup *rdtgrp) -{ - pseudo_lock_region_clear(rdtgrp->plr); - kfree(rdtgrp->plr); - rdtgrp->plr = NULL; -} - /** * resctrl_arch_pseudo_lock_fn - Load kernel memory into cache * @_plr: the pseudo-lock region descriptor @@ -543,351 +228,6 @@ int resctrl_arch_pseudo_lock_fn(void *_plr) return 0; } -/** - * rdtgroup_monitor_in_progress - Test if monitoring in progress - * @rdtgrp: resource group being queried - * - * Return: 1 if monitor groups have been created for this resource - * group, 0 otherwise. - */ -static int rdtgroup_monitor_in_progress(struct rdtgroup *rdtgrp) -{ - return !list_empty(&rdtgrp->mon.crdtgrp_list); -} - -/** - * rdtgroup_locksetup_user_restrict - Restrict user access to group - * @rdtgrp: resource group needing access restricted - * - * A resource group used for cache pseudo-locking cannot have cpus or tasks - * assigned to it. This is communicated to the user by restricting access - * to all the files that can be used to make such changes. - * - * Permissions restored with rdtgroup_locksetup_user_restore() - * - * Return: 0 on success, <0 on failure. If a failure occurs during the - * restriction of access an attempt will be made to restore permissions but - * the state of the mode of these files will be uncertain when a failure - * occurs. - */ -static int rdtgroup_locksetup_user_restrict(struct rdtgroup *rdtgrp) -{ - int ret; - - ret = rdtgroup_kn_mode_restrict(rdtgrp, "tasks"); - if (ret) - return ret; - - ret = rdtgroup_kn_mode_restrict(rdtgrp, "cpus"); - if (ret) - goto err_tasks; - - ret = rdtgroup_kn_mode_restrict(rdtgrp, "cpus_list"); - if (ret) - goto err_cpus; - - if (resctrl_arch_mon_capable()) { - ret = rdtgroup_kn_mode_restrict(rdtgrp, "mon_groups"); - if (ret) - goto err_cpus_list; - } - - ret = 0; - goto out; - -err_cpus_list: - rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0777); -err_cpus: - rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0777); -err_tasks: - rdtgroup_kn_mode_restore(rdtgrp, "tasks", 0777); -out: - return ret; -} - -/** - * rdtgroup_locksetup_user_restore - Restore user access to group - * @rdtgrp: resource group needing access restored - * - * Restore all file access previously removed using - * rdtgroup_locksetup_user_restrict() - * - * Return: 0 on success, <0 on failure. If a failure occurs during the - * restoration of access an attempt will be made to restrict permissions - * again but the state of the mode of these files will be uncertain when - * a failure occurs. - */ -static int rdtgroup_locksetup_user_restore(struct rdtgroup *rdtgrp) -{ - int ret; - - ret = rdtgroup_kn_mode_restore(rdtgrp, "tasks", 0777); - if (ret) - return ret; - - ret = rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0777); - if (ret) - goto err_tasks; - - ret = rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0777); - if (ret) - goto err_cpus; - - if (resctrl_arch_mon_capable()) { - ret = rdtgroup_kn_mode_restore(rdtgrp, "mon_groups", 0777); - if (ret) - goto err_cpus_list; - } - - ret = 0; - goto out; - -err_cpus_list: - rdtgroup_kn_mode_restrict(rdtgrp, "cpus_list"); -err_cpus: - rdtgroup_kn_mode_restrict(rdtgrp, "cpus"); -err_tasks: - rdtgroup_kn_mode_restrict(rdtgrp, "tasks"); -out: - return ret; -} - -/** - * rdtgroup_locksetup_enter - Resource group enters locksetup mode - * @rdtgrp: resource group requested to enter locksetup mode - * - * A resource group enters locksetup mode to reflect that it would be used - * to represent a pseudo-locked region and is in the process of being set - * up to do so. A resource group used for a pseudo-locked region would - * lose the closid associated with it so we cannot allow it to have any - * tasks or cpus assigned nor permit tasks or cpus to be assigned in the - * future. Monitoring of a pseudo-locked region is not allowed either. - * - * The above and more restrictions on a pseudo-locked region are checked - * for and enforced before the resource group enters the locksetup mode. - * - * Returns: 0 if the resource group successfully entered locksetup mode, <0 - * on failure. On failure the last_cmd_status buffer is updated with text to - * communicate details of failure to the user. - */ -int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp) -{ - int ret; - - /* - * The default resource group can neither be removed nor lose the - * default closid associated with it. - */ - if (rdtgrp == &rdtgroup_default) { - rdt_last_cmd_puts("Cannot pseudo-lock default group\n"); - return -EINVAL; - } - - /* - * Cache Pseudo-locking not supported when CDP is enabled. - * - * Some things to consider if you would like to enable this - * support (using L3 CDP as example): - * - When CDP is enabled two separate resources are exposed, - * L3DATA and L3CODE, but they are actually on the same cache. - * The implication for pseudo-locking is that if a - * pseudo-locked region is created on a domain of one - * resource (eg. L3CODE), then a pseudo-locked region cannot - * be created on that same domain of the other resource - * (eg. L3DATA). This is because the creation of a - * pseudo-locked region involves a call to wbinvd that will - * affect all cache allocations on particular domain. - * - Considering the previous, it may be possible to only - * expose one of the CDP resources to pseudo-locking and - * hide the other. For example, we could consider to only - * expose L3DATA and since the L3 cache is unified it is - * still possible to place instructions there are execute it. - * - If only one region is exposed to pseudo-locking we should - * still keep in mind that availability of a portion of cache - * for pseudo-locking should take into account both resources. - * Similarly, if a pseudo-locked region is created in one - * resource, the portion of cache used by it should be made - * unavailable to all future allocations from both resources. - */ - if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3) || - resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2)) { - rdt_last_cmd_puts("CDP enabled\n"); - return -EINVAL; - } - - /* - * Not knowing the bits to disable prefetching implies that this - * platform does not support Cache Pseudo-Locking. - */ - if (resctrl_arch_get_prefetch_disable_bits() == 0) { - rdt_last_cmd_puts("Pseudo-locking not supported\n"); - return -EINVAL; - } - - if (rdtgroup_monitor_in_progress(rdtgrp)) { - rdt_last_cmd_puts("Monitoring in progress\n"); - return -EINVAL; - } - - if (rdtgroup_tasks_assigned(rdtgrp)) { - rdt_last_cmd_puts("Tasks assigned to resource group\n"); - return -EINVAL; - } - - if (!cpumask_empty(&rdtgrp->cpu_mask)) { - rdt_last_cmd_puts("CPUs assigned to resource group\n"); - return -EINVAL; - } - - if (rdtgroup_locksetup_user_restrict(rdtgrp)) { - rdt_last_cmd_puts("Unable to modify resctrl permissions\n"); - return -EIO; - } - - ret = pseudo_lock_init(rdtgrp); - if (ret) { - rdt_last_cmd_puts("Unable to init pseudo-lock region\n"); - goto out_release; - } - - /* - * If this system is capable of monitoring a rmid would have been - * allocated when the control group was created. This is not needed - * anymore when this group would be used for pseudo-locking. This - * is safe to call on platforms not capable of monitoring. - */ - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); - - ret = 0; - goto out; - -out_release: - rdtgroup_locksetup_user_restore(rdtgrp); -out: - return ret; -} - -/** - * rdtgroup_locksetup_exit - resource group exist locksetup mode - * @rdtgrp: resource group - * - * When a resource group exits locksetup mode the earlier restrictions are - * lifted. - * - * Return: 0 on success, <0 on failure - */ -int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp) -{ - int ret; - - if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) - return -EOPNOTSUPP; - - if (resctrl_arch_mon_capable()) { - ret = alloc_rmid(rdtgrp->closid); - if (ret < 0) { - rdt_last_cmd_puts("Out of RMIDs\n"); - return ret; - } - rdtgrp->mon.rmid = ret; - } - - ret = rdtgroup_locksetup_user_restore(rdtgrp); - if (ret) { - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); - return ret; - } - - pseudo_lock_free(rdtgrp); - return 0; -} - -/** - * rdtgroup_cbm_overlaps_pseudo_locked - Test if CBM or portion is pseudo-locked - * @d: RDT domain - * @cbm: CBM to test - * - * @d represents a cache instance and @cbm a capacity bitmask that is - * considered for it. Determine if @cbm overlaps with any existing - * pseudo-locked region on @d. - * - * @cbm is unsigned long, even if only 32 bits are used, to make the - * bitmap functions work correctly. - * - * Return: true if @cbm overlaps with pseudo-locked region on @d, false - * otherwise. - */ -bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm) -{ - unsigned int cbm_len; - unsigned long cbm_b; - - if (d->plr) { - cbm_len = d->plr->s->res->cache.cbm_len; - cbm_b = d->plr->cbm; - if (bitmap_intersects(&cbm, &cbm_b, cbm_len)) - return true; - } - return false; -} - -/** - * rdtgroup_pseudo_locked_in_hierarchy - Pseudo-locked region in cache hierarchy - * @d: RDT domain under test - * - * The setup of a pseudo-locked region affects all cache instances within - * the hierarchy of the region. It is thus essential to know if any - * pseudo-locked regions exist within a cache hierarchy to prevent any - * attempts to create new pseudo-locked regions in the same hierarchy. - * - * Return: true if a pseudo-locked region exists in the hierarchy of @d or - * if it is not possible to test due to memory allocation issue, - * false otherwise. - */ -bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) -{ - cpumask_var_t cpu_with_psl; - enum resctrl_res_level i; - struct rdt_resource *r; - struct rdt_domain *d_i; - bool ret = false; - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) - return -EOPNOTSUPP; - - if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL)) - return true; - - /* - * First determine which cpus have pseudo-locked regions - * associated with them. - */ - for (i = 0; i < RDT_NUM_RESOURCES; i++) { - r = resctrl_arch_get_resource(i); - if (!r->alloc_capable) - continue; - - list_for_each_entry(d_i, &r->domains, list) { - if (d_i->plr) - cpumask_or(cpu_with_psl, cpu_with_psl, - &d_i->cpu_mask); - } - } - - /* - * Next test if new pseudo-locked region would intersect with - * existing region. - */ - if (cpumask_intersects(&d->cpu_mask, cpu_with_psl)) - ret = true; - - free_cpumask_var(cpu_with_psl); - return ret; -} - /** * resctrl_arch_measure_cycles_lat_fn - Measure cycle latency to read * pseudo-locked memory @@ -1180,448 +520,3 @@ int resctrl_arch_measure_l3_residency(void *_plr) wake_up_interruptible(&plr->lock_thread_wq); return 0; } - -/** - * pseudo_lock_measure_cycles - Trigger latency measure to pseudo-locked region - * @rdtgrp: Resource group to which the pseudo-locked region belongs. - * @sel: Selector of which measurement to perform on a pseudo-locked region. - * - * The measurement of latency to access a pseudo-locked region should be - * done from a cpu that is associated with that pseudo-locked region. - * Determine which cpu is associated with this region and start a thread on - * that cpu to perform the measurement, wait for that thread to complete. - * - * Return: 0 on success, <0 on failure - */ -static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel) -{ - struct pseudo_lock_region *plr = rdtgrp->plr; - struct task_struct *thread; - unsigned int cpu; - int ret = -1; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - if (rdtgrp->flags & RDT_DELETED) { - ret = -ENODEV; - goto out; - } - - if (!plr->d) { - ret = -ENODEV; - goto out; - } - - plr->thread_done = 0; - cpu = cpumask_first(&plr->d->cpu_mask); - if (!cpu_online(cpu)) { - ret = -ENODEV; - goto out; - } - - plr->cpu = cpu; - - if (sel == 1) - thread = kthread_create_on_node(resctrl_arch_measure_cycles_lat_fn, - plr, cpu_to_node(cpu), - "pseudo_lock_measure/%u", - cpu); - else if (sel == 2) - thread = kthread_create_on_node(resctrl_arch_measure_l2_residency, - plr, cpu_to_node(cpu), - "pseudo_lock_measure/%u", - cpu); - else if (sel == 3) - thread = kthread_create_on_node(resctrl_arch_measure_l3_residency, - plr, cpu_to_node(cpu), - "pseudo_lock_measure/%u", - cpu); - else - goto out; - - if (IS_ERR(thread)) { - ret = PTR_ERR(thread); - goto out; - } - kthread_bind(thread, cpu); - wake_up_process(thread); - - ret = wait_event_interruptible(plr->lock_thread_wq, - plr->thread_done == 1); - if (ret < 0) - goto out; - - ret = 0; - -out: - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); - return ret; -} - -static ssize_t pseudo_lock_measure_trigger(struct file *file, - const char __user *user_buf, - size_t count, loff_t *ppos) -{ - struct rdtgroup *rdtgrp = file->private_data; - size_t buf_size; - char buf[32]; - int ret; - int sel; - - buf_size = min(count, (sizeof(buf) - 1)); - if (copy_from_user(buf, user_buf, buf_size)) - return -EFAULT; - - buf[buf_size] = '\0'; - ret = kstrtoint(buf, 10, &sel); - if (ret == 0) { - if (sel != 1 && sel != 2 && sel != 3) - return -EINVAL; - ret = debugfs_file_get(file->f_path.dentry); - if (ret) - return ret; - ret = pseudo_lock_measure_cycles(rdtgrp, sel); - if (ret == 0) - ret = count; - debugfs_file_put(file->f_path.dentry); - } - - return ret; -} - -static const struct file_operations pseudo_measure_fops = { - .write = pseudo_lock_measure_trigger, - .open = simple_open, - .llseek = default_llseek, -}; - -/** - * rdtgroup_pseudo_lock_create - Create a pseudo-locked region - * @rdtgrp: resource group to which pseudo-lock region belongs - * - * Called when a resource group in the pseudo-locksetup mode receives a - * valid schemata that should be pseudo-locked. Since the resource group is - * in pseudo-locksetup mode the &struct pseudo_lock_region has already been - * allocated and initialized with the essential information. If a failure - * occurs the resource group remains in the pseudo-locksetup mode with the - * &struct pseudo_lock_region associated with it, but cleared from all - * information and ready for the user to re-attempt pseudo-locking by - * writing the schemata again. - * - * Return: 0 if the pseudo-locked region was successfully pseudo-locked, <0 - * on failure. Descriptive error will be written to last_cmd_status buffer. - */ -int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp) -{ - struct pseudo_lock_region *plr = rdtgrp->plr; - struct task_struct *thread; - unsigned int new_minor; - struct device *dev; - int ret; - - if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) - return -EOPNOTSUPP; - - ret = pseudo_lock_region_alloc(plr); - if (ret < 0) - return ret; - - ret = pseudo_lock_cstates_constrain(plr); - if (ret < 0) { - ret = -EINVAL; - goto out_region; - } - - plr->thread_done = 0; - - plr->closid = rdtgrp->closid; - thread = kthread_create_on_node(resctrl_arch_pseudo_lock_fn, plr, - cpu_to_node(plr->cpu), - "pseudo_lock/%u", plr->cpu); - if (IS_ERR(thread)) { - ret = PTR_ERR(thread); - rdt_last_cmd_printf("Locking thread returned error %d\n", ret); - goto out_cstates; - } - - kthread_bind(thread, plr->cpu); - wake_up_process(thread); - - ret = wait_event_interruptible(plr->lock_thread_wq, - plr->thread_done == 1); - if (ret < 0) { - /* - * If the thread does not get on the CPU for whatever - * reason and the process which sets up the region is - * interrupted then this will leave the thread in runnable - * state and once it gets on the CPU it will dereference - * the cleared, but not freed, plr struct resulting in an - * empty pseudo-locking loop. - */ - rdt_last_cmd_puts("Locking thread interrupted\n"); - goto out_cstates; - } - - ret = pseudo_lock_minor_get(&new_minor); - if (ret < 0) { - rdt_last_cmd_puts("Unable to obtain a new minor number\n"); - goto out_cstates; - } - - /* - * Unlock access but do not release the reference. The - * pseudo-locked region will still be here on return. - * - * The mutex has to be released temporarily to avoid a potential - * deadlock with the mm->mmap_lock which is obtained in the - * device_create() and debugfs_create_dir() callpath below as well as - * before the mmap() callback is called. - */ - mutex_unlock(&rdtgroup_mutex); - - if (!IS_ERR_OR_NULL(debugfs_resctrl)) { - plr->debugfs_dir = debugfs_create_dir(rdtgrp->kn->name, - debugfs_resctrl); - if (!IS_ERR_OR_NULL(plr->debugfs_dir)) - debugfs_create_file("pseudo_lock_measure", 0200, - plr->debugfs_dir, rdtgrp, - &pseudo_measure_fops); - } - - dev = device_create(&pseudo_lock_class, NULL, - MKDEV(pseudo_lock_major, new_minor), - rdtgrp, "%s", rdtgrp->kn->name); - - mutex_lock(&rdtgroup_mutex); - - if (IS_ERR(dev)) { - ret = PTR_ERR(dev); - rdt_last_cmd_printf("Failed to create character device: %d\n", - ret); - goto out_debugfs; - } - - /* We released the mutex - check if group was removed while we did so */ - if (rdtgrp->flags & RDT_DELETED) { - ret = -ENODEV; - goto out_device; - } - - plr->minor = new_minor; - - rdtgrp->mode = RDT_MODE_PSEUDO_LOCKED; - closid_free(rdtgrp->closid); - rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0444); - rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0444); - - ret = 0; - goto out; - -out_device: - device_destroy(&pseudo_lock_class, MKDEV(pseudo_lock_major, new_minor)); -out_debugfs: - debugfs_remove_recursive(plr->debugfs_dir); - pseudo_lock_minor_release(new_minor); -out_cstates: - pseudo_lock_cstates_relax(plr); -out_region: - pseudo_lock_region_clear(plr); -out: - return ret; -} - -/** - * rdtgroup_pseudo_lock_remove - Remove a pseudo-locked region - * @rdtgrp: resource group to which the pseudo-locked region belongs - * - * The removal of a pseudo-locked region can be initiated when the resource - * group is removed from user space via a "rmdir" from userspace or the - * unmount of the resctrl filesystem. On removal the resource group does - * not go back to pseudo-locksetup mode before it is removed, instead it is - * removed directly. There is thus asymmetry with the creation where the - * &struct pseudo_lock_region is removed here while it was not created in - * rdtgroup_pseudo_lock_create(). - * - * Return: void - */ -void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp) -{ - struct pseudo_lock_region *plr = rdtgrp->plr; - - if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) - return; - - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - /* - * Default group cannot be a pseudo-locked region so we can - * free closid here. - */ - closid_free(rdtgrp->closid); - goto free; - } - - pseudo_lock_cstates_relax(plr); - debugfs_remove_recursive(rdtgrp->plr->debugfs_dir); - device_destroy(&pseudo_lock_class, MKDEV(pseudo_lock_major, plr->minor)); - pseudo_lock_minor_release(plr->minor); - -free: - pseudo_lock_free(rdtgrp); -} - -static int pseudo_lock_dev_open(struct inode *inode, struct file *filp) -{ - struct rdtgroup *rdtgrp; - - mutex_lock(&rdtgroup_mutex); - - rdtgrp = region_find_by_minor(iminor(inode)); - if (!rdtgrp) { - mutex_unlock(&rdtgroup_mutex); - return -ENODEV; - } - - filp->private_data = rdtgrp; - atomic_inc(&rdtgrp->waitcount); - /* Perform a non-seekable open - llseek is not supported */ - filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE); - - mutex_unlock(&rdtgroup_mutex); - - return 0; -} - -static int pseudo_lock_dev_release(struct inode *inode, struct file *filp) -{ - struct rdtgroup *rdtgrp; - - mutex_lock(&rdtgroup_mutex); - rdtgrp = filp->private_data; - WARN_ON(!rdtgrp); - if (!rdtgrp) { - mutex_unlock(&rdtgroup_mutex); - return -ENODEV; - } - filp->private_data = NULL; - atomic_dec(&rdtgrp->waitcount); - mutex_unlock(&rdtgroup_mutex); - return 0; -} - -static int pseudo_lock_dev_mremap(struct vm_area_struct *area) -{ - /* Not supported */ - return -EINVAL; -} - -static const struct vm_operations_struct pseudo_mmap_ops = { - .mremap = pseudo_lock_dev_mremap, -}; - -static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma) -{ - unsigned long vsize = vma->vm_end - vma->vm_start; - unsigned long off = vma->vm_pgoff << PAGE_SHIFT; - struct pseudo_lock_region *plr; - struct rdtgroup *rdtgrp; - unsigned long physical; - unsigned long psize; - - mutex_lock(&rdtgroup_mutex); - - rdtgrp = filp->private_data; - WARN_ON(!rdtgrp); - if (!rdtgrp) { - mutex_unlock(&rdtgroup_mutex); - return -ENODEV; - } - - plr = rdtgrp->plr; - - if (!plr->d) { - mutex_unlock(&rdtgroup_mutex); - return -ENODEV; - } - - /* - * Task is required to run with affinity to the cpus associated - * with the pseudo-locked region. If this is not the case the task - * may be scheduled elsewhere and invalidate entries in the - * pseudo-locked region. - */ - if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) { - mutex_unlock(&rdtgroup_mutex); - return -EINVAL; - } - - physical = __pa(plr->kmem) >> PAGE_SHIFT; - psize = plr->size - off; - - if (off > plr->size) { - mutex_unlock(&rdtgroup_mutex); - return -ENOSPC; - } - - /* - * Ensure changes are carried directly to the memory being mapped, - * do not allow copy-on-write mapping. - */ - if (!(vma->vm_flags & VM_SHARED)) { - mutex_unlock(&rdtgroup_mutex); - return -EINVAL; - } - - if (vsize > psize) { - mutex_unlock(&rdtgroup_mutex); - return -ENOSPC; - } - - memset(plr->kmem + off, 0, vsize); - - if (remap_pfn_range(vma, vma->vm_start, physical + vma->vm_pgoff, - vsize, vma->vm_page_prot)) { - mutex_unlock(&rdtgroup_mutex); - return -EAGAIN; - } - vma->vm_ops = &pseudo_mmap_ops; - mutex_unlock(&rdtgroup_mutex); - return 0; -} - -static const struct file_operations pseudo_lock_dev_fops = { - .owner = THIS_MODULE, - .llseek = no_llseek, - .read = NULL, - .write = NULL, - .open = pseudo_lock_dev_open, - .release = pseudo_lock_dev_release, - .mmap = pseudo_lock_dev_mmap, -}; - -int rdt_pseudo_lock_init(void) -{ - int ret; - - ret = register_chrdev(0, "pseudo_lock", &pseudo_lock_dev_fops); - if (ret < 0) - return ret; - - pseudo_lock_major = ret; - - ret = class_register(&pseudo_lock_class); - if (ret) { - unregister_chrdev(pseudo_lock_major, "pseudo_lock"); - return ret; - } - - return 0; -} - -void rdt_pseudo_lock_release(void) -{ - class_unregister(&pseudo_lock_class); - unregister_chrdev(pseudo_lock_major, "pseudo_lock"); - pseudo_lock_major = 0; -} diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index f69715d5126e..10bd76e1d0d5 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -12,22 +12,8 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt -#include <linux/cacheinfo.h> #include <linux/cpu.h> -#include <linux/debugfs.h> -#include <linux/fs.h> -#include <linux/fs_parser.h> -#include <linux/sysfs.h> -#include <linux/kernfs.h> -#include <linux/seq_buf.h> -#include <linux/seq_file.h> -#include <linux/sched/signal.h> -#include <linux/sched/task.h> #include <linux/slab.h> -#include <linux/task_work.h> -#include <linux/user_namespace.h> - -#include <uapi/linux/magic.h> #include <asm/resctrl.h> #include "internal.h" @@ -36,4215 +22,239 @@ DEFINE_STATIC_KEY_FALSE(rdt_enable_key); DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key); DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key); -/* Mutex to protect rdtgroup access. */ -DEFINE_MUTEX(rdtgroup_mutex); - -static struct kernfs_root *rdt_root; -struct rdtgroup rdtgroup_default; -LIST_HEAD(rdt_all_groups); - -/* list of entries for the schemata file */ -LIST_HEAD(resctrl_schema_all); - -/* The filesystem can only be mounted once. */ -bool resctrl_mounted; - -/* Kernel fs node for "info" directory under root */ -static struct kernfs_node *kn_info; - -/* Kernel fs node for "mon_groups" directory under root */ -static struct kernfs_node *kn_mongrp; - -/* Kernel fs node for "mon_data" directory under root */ -static struct kernfs_node *kn_mondata; - /* - * Used to store the max resource name width and max resource data width - * to display the schemata in a tabular format + * This is safe against resctrl_sched_in() called from __switch_to() + * because __switch_to() is executed with interrupts disabled. A local call + * from update_closid_rmid() is protected against __switch_to() because + * preemption is disabled. */ -int max_name_width, max_data_width; - -static struct seq_buf last_cmd_status; -static char last_cmd_status_buf[512]; - -static int rdtgroup_setup_root(struct rdt_fs_context *ctx); -static void rdtgroup_destroy_root(void); - -struct dentry *debugfs_resctrl; - -static bool resctrl_debug; - -void rdt_last_cmd_clear(void) -{ - lockdep_assert_held(&rdtgroup_mutex); - seq_buf_clear(&last_cmd_status); -} - -void rdt_last_cmd_puts(const char *s) -{ - lockdep_assert_held(&rdtgroup_mutex); - seq_buf_puts(&last_cmd_status, s); -} - -void rdt_last_cmd_printf(const char *fmt, ...) -{ - va_list ap; - - va_start(ap, fmt); - lockdep_assert_held(&rdtgroup_mutex); - seq_buf_vprintf(&last_cmd_status, fmt, ap); - va_end(ap); -} - -void rdt_staged_configs_clear(void) +void resctrl_arch_sync_cpu_defaults(void *info) { - struct rdt_resource *r; - struct rdt_domain *dom; - int i; - - lockdep_assert_held(&rdtgroup_mutex); - - for (i = 0; i < RDT_NUM_RESOURCES; i++) { - r = resctrl_arch_get_resource(i); - if (!r->alloc_capable) - continue; + struct resctrl_cpu_sync *r = info; - list_for_each_entry(dom, &r->domains, list) - memset(dom->staged_config, 0, sizeof(dom->staged_config)); + if (r) { + this_cpu_write(pqr_state.default_closid, r->closid); + this_cpu_write(pqr_state.default_rmid, r->rmid); } -} -static bool resctrl_is_mbm_enabled(void) -{ - return (resctrl_arch_is_mbm_total_enabled() || - resctrl_arch_is_mbm_local_enabled()); + /* + * We cannot unconditionally write the MSR because the current + * executing task might have its own closid selected. Just reuse + * the context switch code. + */ + resctrl_sched_in(current); } -static bool resctrl_is_mbm_event(int e) -{ - return (e >= QOS_L3_MBM_TOTAL_EVENT_ID && - e <= QOS_L3_MBM_LOCAL_EVENT_ID); -} +#define INVALID_CONFIG_INDEX UINT_MAX -/* - * Trivial allocator for CLOSIDs. Since h/w only supports a small number, - * we can keep a bitmap of free CLOSIDs in a single integer. +/** + * mon_event_config_index_get - get the hardware index for the + * configurable event + * @evtid: event id. * - * Using a global CLOSID across all resources has some advantages and - * some drawbacks: - * + We can simply set current's closid to assign a task to a resource - * group. - * + Context switch code can avoid extra memory references deciding which - * CLOSID to load into the PQR_ASSOC MSR - * - We give up some options in configuring resource groups across multi-socket - * systems. - * - Our choices on how to configure each resource become progressively more - * limited as the number of resources grows. + * Return: 0 for evtid == QOS_L3_MBM_TOTAL_EVENT_ID + * 1 for evtid == QOS_L3_MBM_LOCAL_EVENT_ID + * INVALID_CONFIG_INDEX for invalid evtid */ -static unsigned long closid_free_map; -static int closid_free_map_len; - -int closids_supported(void) -{ - return closid_free_map_len; -} - -static void closid_init(void) +static inline unsigned int mon_event_config_index_get(u32 evtid) { - struct resctrl_schema *s; - u32 rdt_min_closid = 32; - - /* Compute rdt_min_closid across all resources */ - list_for_each_entry(s, &resctrl_schema_all, list) - rdt_min_closid = min(rdt_min_closid, s->num_closid); - - closid_free_map = BIT_MASK(rdt_min_closid) - 1; - - /* RESCTRL_RESERVED_CLOSID is always reserved for the default group */ - __clear_bit(RESCTRL_RESERVED_CLOSID, &closid_free_map); - closid_free_map_len = rdt_min_closid; + switch (evtid) { + case QOS_L3_MBM_TOTAL_EVENT_ID: + return 0; + case QOS_L3_MBM_LOCAL_EVENT_ID: + return 1; + default: + /* Should never reach here */ + return INVALID_CONFIG_INDEX; + } } -static int closid_alloc(void) +void resctrl_arch_mon_event_config_read(void *info) { - int cleanest_closid; - u32 closid; - - lockdep_assert_held(&rdtgroup_mutex); + struct resctrl_mon_config_info *mon_info = info; + unsigned int index; + u64 msrval; - if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { - cleanest_closid = resctrl_find_cleanest_closid(); - if (cleanest_closid < 0) - return cleanest_closid; - closid = cleanest_closid; - } else { - closid = ffs(closid_free_map); - if (closid == 0) - return -ENOSPC; - closid--; + index = mon_event_config_index_get(mon_info->evtid); + if (index == INVALID_CONFIG_INDEX) { + pr_warn_once("Invalid event id %d\n", mon_info->evtid); + return; } - __clear_bit(closid, &closid_free_map); + rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval); - return closid; + /* Report only the valid event configuration bits */ + mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS; } -void closid_free(int closid) +void resctrl_arch_mon_event_config_write(void *info) { - lockdep_assert_held(&rdtgroup_mutex); - - __set_bit(closid, &closid_free_map); -} + struct resctrl_mon_config_info *mon_info = info; + unsigned int index; -/** - * closid_allocated - test if provided closid is in use - * @closid: closid to be tested - * - * Return: true if @closid is currently associated with a resource group, - * false if @closid is free - */ -bool closid_allocated(unsigned int closid) -{ - lockdep_assert_held(&rdtgroup_mutex); + index = mon_event_config_index_get(mon_info->evtid); + if (index == INVALID_CONFIG_INDEX) { + pr_warn_once("Invalid event id %d\n", mon_info->evtid); + mon_info->err = -EINVAL; + return; + } + wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0); - return !test_bit(closid, &closid_free_map); + mon_info->err = 0; } -/** - * rdtgroup_mode_by_closid - Return mode of resource group with closid - * @closid: closid if the resource group - * - * Each resource group is associated with a @closid. Here the mode - * of a resource group can be queried by searching for it using its closid. - * - * Return: mode as &enum rdtgrp_mode of resource group with closid @closid - */ -enum rdtgrp_mode rdtgroup_mode_by_closid(int closid) +static void l3_qos_cfg_update(void *arg) { - struct rdtgroup *rdtgrp; - - list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { - if (rdtgrp->closid == closid) - return rdtgrp->mode; - } + bool *enable = arg; - return RDT_NUM_MODES; + wrmsrl(MSR_IA32_L3_QOS_CFG, *enable ? L3_QOS_CDP_ENABLE : 0ULL); } -static const char * const rdt_mode_str[] = { - [RDT_MODE_SHAREABLE] = "shareable", - [RDT_MODE_EXCLUSIVE] = "exclusive", - [RDT_MODE_PSEUDO_LOCKSETUP] = "pseudo-locksetup", - [RDT_MODE_PSEUDO_LOCKED] = "pseudo-locked", -}; - -/** - * rdtgroup_mode_str - Return the string representation of mode - * @mode: the resource group mode as &enum rdtgroup_mode - * - * Return: string representation of valid mode, "unknown" otherwise - */ -static const char *rdtgroup_mode_str(enum rdtgrp_mode mode) +static void l2_qos_cfg_update(void *arg) { - if (mode < RDT_MODE_SHAREABLE || mode >= RDT_NUM_MODES) - return "unknown"; + bool *enable = arg; - return rdt_mode_str[mode]; + wrmsrl(MSR_IA32_L2_QOS_CFG, *enable ? L2_QOS_CDP_ENABLE : 0ULL); } -/* set uid and gid of rdtgroup dirs and files to that of the creator */ -static int rdtgroup_kn_set_ugid(struct kernfs_node *kn) +static int set_cache_qos_cfg(int level, bool enable) { - struct iattr iattr = { .ia_valid = ATTR_UID | ATTR_GID, - .ia_uid = current_fsuid(), - .ia_gid = current_fsgid(), }; - - if (uid_eq(iattr.ia_uid, GLOBAL_ROOT_UID) && - gid_eq(iattr.ia_gid, GLOBAL_ROOT_GID)) - return 0; + void (*update)(void *arg); + struct rdt_resource *r_l; + cpumask_var_t cpu_mask; + struct rdt_domain *d; + int cpu; - return kernfs_setattr(kn, &iattr); -} + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); -static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft) -{ - struct kernfs_node *kn; - int ret; + if (level == RDT_RESOURCE_L3) + update = l3_qos_cfg_update; + else if (level == RDT_RESOURCE_L2) + update = l2_qos_cfg_update; + else + return -EINVAL; - kn = __kernfs_create_file(parent_kn, rft->name, rft->mode, - GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, - 0, rft->kf_ops, rft, NULL, NULL); - if (IS_ERR(kn)) - return PTR_ERR(kn); + if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) + return -ENOMEM; - ret = rdtgroup_kn_set_ugid(kn); - if (ret) { - kernfs_remove(kn); - return ret; + r_l = &rdt_resources_all[level].r_resctrl; + list_for_each_entry(d, &r_l->domains, list) { + if (r_l->cache.arch_has_per_cpu_cfg) + /* Pick all the CPUs in the domain instance */ + for_each_cpu(cpu, &d->cpu_mask) + cpumask_set_cpu(cpu, cpu_mask); + else + /* Pick one CPU from each domain instance to update MSR */ + cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask); } - return 0; -} + /* Update QOS_CFG MSR on all the CPUs in cpu_mask */ + on_each_cpu_mask(cpu_mask, update, &enable, 1); -static int rdtgroup_seqfile_show(struct seq_file *m, void *arg) -{ - struct kernfs_open_file *of = m->private; - struct rftype *rft = of->kn->priv; + free_cpumask_var(cpu_mask); - if (rft->seq_show) - return rft->seq_show(of, m, arg); return 0; } -static ssize_t rdtgroup_file_write(struct kernfs_open_file *of, char *buf, - size_t nbytes, loff_t off) +/* Restore the qos cfg state when a domain comes online */ +void rdt_domain_reconfigure_cdp(struct rdt_resource *r) { - struct rftype *rft = of->kn->priv; - - if (rft->write) - return rft->write(of, buf, nbytes, off); - - return -EINVAL; -} - -static const struct kernfs_ops rdtgroup_kf_single_ops = { - .atomic_write_len = PAGE_SIZE, - .write = rdtgroup_file_write, - .seq_show = rdtgroup_seqfile_show, -}; + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); -static const struct kernfs_ops kf_mondata_ops = { - .atomic_write_len = PAGE_SIZE, - .seq_show = rdtgroup_mondata_show, -}; + if (!r->cdp_capable) + return; -static bool is_cpu_list(struct kernfs_open_file *of) -{ - struct rftype *rft = of->kn->priv; + if (r->rid == RDT_RESOURCE_L2) + l2_qos_cfg_update(&hw_res->cdp_enabled); - return rft->flags & RFTYPE_FLAGS_CPUS_LIST; + if (r->rid == RDT_RESOURCE_L3) + l3_qos_cfg_update(&hw_res->cdp_enabled); } -static int rdtgroup_cpus_show(struct kernfs_open_file *of, - struct seq_file *s, void *v) +static int cdp_enable(int level) { - struct rdtgroup *rdtgrp; - struct cpumask *mask; - int ret = 0; + struct rdt_resource *r_l = &rdt_resources_all[level].r_resctrl; + int ret; - rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!r_l->alloc_capable) + return -EINVAL; - if (rdtgrp) { - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { - if (!rdtgrp->plr->d) { - rdt_last_cmd_clear(); - rdt_last_cmd_puts("Cache domain offline\n"); - ret = -ENODEV; - } else { - mask = &rdtgrp->plr->d->cpu_mask; - seq_printf(s, is_cpu_list(of) ? - "%*pbl\n" : "%*pb\n", - cpumask_pr_args(mask)); - } - } else { - seq_printf(s, is_cpu_list(of) ? "%*pbl\n" : "%*pb\n", - cpumask_pr_args(&rdtgrp->cpu_mask)); - } - } else { - ret = -ENOENT; - } - rdtgroup_kn_unlock(of->kn); + ret = set_cache_qos_cfg(level, true); + if (!ret) + rdt_resources_all[level].cdp_enabled = true; return ret; } -/* - * This is safe against resctrl_sched_in() called from __switch_to() - * because __switch_to() is executed with interrupts disabled. A local call - * from update_closid_rmid() is protected against __switch_to() because - * preemption is disabled. - */ -void resctrl_arch_sync_cpu_defaults(void *info) -{ - struct resctrl_cpu_sync *r = info; - - if (r) { - this_cpu_write(pqr_state.default_closid, r->closid); - this_cpu_write(pqr_state.default_rmid, r->rmid); - } - - /* - * We cannot unconditionally write the MSR because the current - * executing task might have its own closid selected. Just reuse - * the context switch code. - */ - resctrl_sched_in(current); -} - -/* - * Update the PGR_ASSOC MSR on all cpus in @cpu_mask, - * - * Per task closids/rmids must have been set up before calling this function. - * @r may be NULL. - */ -static void -update_closid_rmid(const struct cpumask *cpu_mask, struct rdtgroup *r) +static void cdp_disable(int level) { - struct resctrl_cpu_sync defaults; - struct resctrl_cpu_sync *defaults_p = NULL; + struct rdt_hw_resource *r_hw = &rdt_resources_all[level]; - if (r) { - defaults.closid = r->closid; - defaults.rmid = r->mon.rmid; - defaults_p = &defaults; + if (r_hw->cdp_enabled) { + set_cache_qos_cfg(level, false); + r_hw->cdp_enabled = false; } - - on_each_cpu_mask(cpu_mask, resctrl_arch_sync_cpu_defaults, defaults_p, - 1); } -static int cpus_mon_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask, - cpumask_var_t tmpmask) +int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable) { - struct rdtgroup *prgrp = rdtgrp->mon.parent, *crgrp; - struct list_head *head; + struct rdt_hw_resource *hw_res = &rdt_resources_all[l]; - /* Check whether cpus belong to parent ctrl group */ - cpumask_andnot(tmpmask, newmask, &prgrp->cpu_mask); - if (!cpumask_empty(tmpmask)) { - rdt_last_cmd_puts("Can only add CPUs to mongroup that belong to parent\n"); + if (!hw_res->r_resctrl.cdp_capable) return -EINVAL; - } - - /* Check whether cpus are dropped from this group */ - cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask); - if (!cpumask_empty(tmpmask)) { - /* Give any dropped cpus to parent rdtgroup */ - cpumask_or(&prgrp->cpu_mask, &prgrp->cpu_mask, tmpmask); - update_closid_rmid(tmpmask, prgrp); - } - /* - * If we added cpus, remove them from previous group that owned them - * and update per-cpu rmid - */ - cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask); - if (!cpumask_empty(tmpmask)) { - head = &prgrp->mon.crdtgrp_list; - list_for_each_entry(crgrp, head, mon.crdtgrp_list) { - if (crgrp == rdtgrp) - continue; - cpumask_andnot(&crgrp->cpu_mask, &crgrp->cpu_mask, - tmpmask); - } - update_closid_rmid(tmpmask, rdtgrp); - } + if (enable) + return cdp_enable(l); - /* Done pushing/pulling - update this group with new mask */ - cpumask_copy(&rdtgrp->cpu_mask, newmask); + cdp_disable(l); return 0; } -static void cpumask_rdtgrp_clear(struct rdtgroup *r, struct cpumask *m) +static int reset_all_ctrls(struct rdt_resource *r) { - struct rdtgroup *crgrp; - - cpumask_andnot(&r->cpu_mask, &r->cpu_mask, m); - /* update the child mon group masks as well*/ - list_for_each_entry(crgrp, &r->mon.crdtgrp_list, mon.crdtgrp_list) - cpumask_and(&crgrp->cpu_mask, &r->cpu_mask, &crgrp->cpu_mask); -} + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); + struct rdt_hw_domain *hw_dom; + struct msr_param msr_param; + cpumask_var_t cpu_mask; + struct rdt_domain *d; + int i; -static int cpus_ctrl_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask, - cpumask_var_t tmpmask, cpumask_var_t tmpmask1) -{ - struct rdtgroup *r, *crgrp; - struct list_head *head; + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); - /* Check whether cpus are dropped from this group */ - cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask); - if (!cpumask_empty(tmpmask)) { - /* Can't drop from default group */ - if (rdtgrp == &rdtgroup_default) { - rdt_last_cmd_puts("Can't drop CPUs from default group\n"); - return -EINVAL; - } + if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) + return -ENOMEM; - /* Give any dropped cpus to rdtgroup_default */ - cpumask_or(&rdtgroup_default.cpu_mask, - &rdtgroup_default.cpu_mask, tmpmask); - update_closid_rmid(tmpmask, &rdtgroup_default); - } + msr_param.res = r; + msr_param.low = 0; + msr_param.high = hw_res->num_closid; /* - * If we added cpus, remove them from previous group and - * the prev group's child groups that owned them - * and update per-cpu closid/rmid. + * Disable resource control for this resource by setting all + * CBMs in all domains to the maximum mask value. Pick one CPU + * from each domain to update the MSRs below. */ - cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask); - if (!cpumask_empty(tmpmask)) { - list_for_each_entry(r, &rdt_all_groups, rdtgroup_list) { - if (r == rdtgrp) - continue; - cpumask_and(tmpmask1, &r->cpu_mask, tmpmask); - if (!cpumask_empty(tmpmask1)) - cpumask_rdtgrp_clear(r, tmpmask1); - } - update_closid_rmid(tmpmask, rdtgrp); + list_for_each_entry(d, &r->domains, list) { + hw_dom = resctrl_to_arch_dom(d); + cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask); + + for (i = 0; i < hw_res->num_closid; i++) + hw_dom->ctrl_val[i] = r->default_ctrl; } - /* Done pushing/pulling - update this group with new mask */ - cpumask_copy(&rdtgrp->cpu_mask, newmask); + /* Update CBM on all the CPUs in cpu_mask */ + on_each_cpu_mask(cpu_mask, rdt_ctrl_update, &msr_param, 1); - /* - * Clear child mon group masks since there is a new parent mask - * now and update the rmid for the cpus the child lost. - */ - head = &rdtgrp->mon.crdtgrp_list; - list_for_each_entry(crgrp, head, mon.crdtgrp_list) { - cpumask_and(tmpmask, &rdtgrp->cpu_mask, &crgrp->cpu_mask); - update_closid_rmid(tmpmask, rdtgrp); - cpumask_clear(&crgrp->cpu_mask); - } + free_cpumask_var(cpu_mask); return 0; } -static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off) +void resctrl_arch_reset_resources(void) { - cpumask_var_t tmpmask, newmask, tmpmask1; - struct rdtgroup *rdtgrp; - int ret; - - if (!buf) - return -EINVAL; - - if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) - return -ENOMEM; - if (!zalloc_cpumask_var(&newmask, GFP_KERNEL)) { - free_cpumask_var(tmpmask); - return -ENOMEM; - } - if (!zalloc_cpumask_var(&tmpmask1, GFP_KERNEL)) { - free_cpumask_var(tmpmask); - free_cpumask_var(newmask); - return -ENOMEM; - } - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (!rdtgrp) { - ret = -ENOENT; - goto unlock; - } - - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED || - rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - ret = -EINVAL; - rdt_last_cmd_puts("Pseudo-locking in progress\n"); - goto unlock; - } - - if (is_cpu_list(of)) - ret = cpulist_parse(buf, newmask); - else - ret = cpumask_parse(buf, newmask); - - if (ret) { - rdt_last_cmd_puts("Bad CPU list/mask\n"); - goto unlock; - } - - /* check that user didn't specify any offline cpus */ - cpumask_andnot(tmpmask, newmask, cpu_online_mask); - if (!cpumask_empty(tmpmask)) { - ret = -EINVAL; - rdt_last_cmd_puts("Can only assign online CPUs\n"); - goto unlock; - } - - if (rdtgrp->type == RDTCTRL_GROUP) - ret = cpus_ctrl_write(rdtgrp, newmask, tmpmask, tmpmask1); - else if (rdtgrp->type == RDTMON_GROUP) - ret = cpus_mon_write(rdtgrp, newmask, tmpmask); - else - ret = -EINVAL; - -unlock: - rdtgroup_kn_unlock(of->kn); - free_cpumask_var(tmpmask); - free_cpumask_var(newmask); - free_cpumask_var(tmpmask1); - - return ret ?: nbytes; -} - -/** - * rdtgroup_remove - the helper to remove resource group safely - * @rdtgrp: resource group to remove - * - * On resource group creation via a mkdir, an extra kernfs_node reference is - * taken to ensure that the rdtgroup structure remains accessible for the - * rdtgroup_kn_unlock() calls where it is removed. - * - * Drop the extra reference here, then free the rdtgroup structure. - * - * Return: void - */ -static void rdtgroup_remove(struct rdtgroup *rdtgrp) -{ - kernfs_put(rdtgrp->kn); - kfree(rdtgrp); -} - -static void _update_task_closid_rmid(void *task) -{ - /* - * If the task is still current on this CPU, update PQR_ASSOC MSR. - * Otherwise, the MSR is updated when the task is scheduled in. - */ - if (task == current) - resctrl_sched_in(task); -} - -static void update_task_closid_rmid(struct task_struct *t) -{ - if (IS_ENABLED(CONFIG_SMP) && task_curr(t)) - smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1); - else - _update_task_closid_rmid(t); -} - -static bool task_in_rdtgroup(struct task_struct *tsk, struct rdtgroup *rdtgrp) -{ - u32 closid, rmid = rdtgrp->mon.rmid; - - if (rdtgrp->type == RDTCTRL_GROUP) - closid = rdtgrp->closid; - else if (rdtgrp->type == RDTMON_GROUP) - closid = rdtgrp->mon.parent->closid; - else - return false; - - return resctrl_arch_match_closid(tsk, closid) && - resctrl_arch_match_rmid(tsk, closid, rmid); -} - -static int __rdtgroup_move_task(struct task_struct *tsk, - struct rdtgroup *rdtgrp) -{ - /* If the task is already in rdtgrp, no need to move the task. */ - if (task_in_rdtgroup(tsk, rdtgrp)) - return 0; - - /* - * Set the task's closid/rmid before the PQR_ASSOC MSR can be - * updated by them. - * - * For ctrl_mon groups, move both closid and rmid. - * For monitor groups, can move the tasks only from - * their parent CTRL group. - */ - if (rdtgrp->type == RDTMON_GROUP && - !resctrl_arch_match_closid(tsk, rdtgrp->mon.parent->closid)) { - rdt_last_cmd_puts("Can't move task to different control group\n"); - return -EINVAL; - } - - if (rdtgrp->type == RDTMON_GROUP) - resctrl_arch_set_closid_rmid(tsk, rdtgrp->mon.parent->closid, - rdtgrp->mon.rmid); - else - resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid, - rdtgrp->mon.rmid); - - /* - * Ensure the task's closid and rmid are written before determining if - * the task is current that will decide if it will be interrupted. - * This pairs with the full barrier between the rq->curr update and - * resctrl_sched_in() during context switch. - */ - smp_mb(); - - /* - * By now, the task's closid and rmid are set. If the task is current - * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource - * group go into effect. If the task is not current, the MSR will be - * updated when the task is scheduled in. - */ - update_task_closid_rmid(tsk); - - return 0; -} - -static bool is_closid_match(struct task_struct *t, struct rdtgroup *r) -{ - return (resctrl_arch_alloc_capable() && (r->type == RDTCTRL_GROUP) && - resctrl_arch_match_closid(t, r->closid)); -} - -static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r) -{ - return (resctrl_arch_mon_capable() && (r->type == RDTMON_GROUP) && - resctrl_arch_match_rmid(t, r->mon.parent->closid, - r->mon.rmid)); -} - -/** - * rdtgroup_tasks_assigned - Test if tasks have been assigned to resource group - * @r: Resource group - * - * Return: 1 if tasks have been assigned to @r, 0 otherwise - */ -int rdtgroup_tasks_assigned(struct rdtgroup *r) -{ - struct task_struct *p, *t; - int ret = 0; - - lockdep_assert_held(&rdtgroup_mutex); - - rcu_read_lock(); - for_each_process_thread(p, t) { - if (is_closid_match(t, r) || is_rmid_match(t, r)) { - ret = 1; - break; - } - } - rcu_read_unlock(); - - return ret; -} - -static int rdtgroup_task_write_permission(struct task_struct *task, - struct kernfs_open_file *of) -{ - const struct cred *tcred = get_task_cred(task); - const struct cred *cred = current_cred(); - int ret = 0; - - /* - * Even if we're attaching all tasks in the thread group, we only - * need to check permissions on one of them. - */ - if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && - !uid_eq(cred->euid, tcred->uid) && - !uid_eq(cred->euid, tcred->suid)) { - rdt_last_cmd_printf("No permission to move task %d\n", task->pid); - ret = -EPERM; - } - - put_cred(tcred); - return ret; -} - -static int rdtgroup_move_task(pid_t pid, struct rdtgroup *rdtgrp, - struct kernfs_open_file *of) -{ - struct task_struct *tsk; - int ret; - - rcu_read_lock(); - if (pid) { - tsk = find_task_by_vpid(pid); - if (!tsk) { - rcu_read_unlock(); - rdt_last_cmd_printf("No task %d\n", pid); - return -ESRCH; - } - } else { - tsk = current; - } - - get_task_struct(tsk); - rcu_read_unlock(); - - ret = rdtgroup_task_write_permission(tsk, of); - if (!ret) - ret = __rdtgroup_move_task(tsk, rdtgrp); - - put_task_struct(tsk); - return ret; -} - -static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off) -{ - struct rdtgroup *rdtgrp; - char *pid_str; - int ret = 0; - pid_t pid; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (!rdtgrp) { - rdtgroup_kn_unlock(of->kn); - return -ENOENT; - } - rdt_last_cmd_clear(); - - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED || - rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - ret = -EINVAL; - rdt_last_cmd_puts("Pseudo-locking in progress\n"); - goto unlock; - } - - while (buf && buf[0] != '\0' && buf[0] != '\n') { - pid_str = strim(strsep(&buf, ",")); - - if (kstrtoint(pid_str, 0, &pid)) { - rdt_last_cmd_printf("Task list parsing error pid %s\n", pid_str); - ret = -EINVAL; - break; - } - - if (pid < 0) { - rdt_last_cmd_printf("Invalid pid %d\n", pid); - ret = -EINVAL; - break; - } - - ret = rdtgroup_move_task(pid, rdtgrp, of); - if (ret) { - rdt_last_cmd_printf("Error while processing task %d\n", pid); - break; - } - } - -unlock: - rdtgroup_kn_unlock(of->kn); - - return ret ?: nbytes; -} - -static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s) -{ - struct task_struct *p, *t; - pid_t pid; - - rcu_read_lock(); - for_each_process_thread(p, t) { - if (is_closid_match(t, r) || is_rmid_match(t, r)) { - pid = task_pid_vnr(t); - if (pid) - seq_printf(s, "%d\n", pid); - } - } - rcu_read_unlock(); -} - -static int rdtgroup_tasks_show(struct kernfs_open_file *of, - struct seq_file *s, void *v) -{ - struct rdtgroup *rdtgrp; - int ret = 0; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (rdtgrp) - show_rdt_tasks(rdtgrp, s); - else - ret = -ENOENT; - rdtgroup_kn_unlock(of->kn); - - return ret; -} - -static int rdtgroup_closid_show(struct kernfs_open_file *of, - struct seq_file *s, void *v) -{ - struct rdtgroup *rdtgrp; - int ret = 0; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (rdtgrp) - seq_printf(s, "%u\n", rdtgrp->closid); - else - ret = -ENOENT; - rdtgroup_kn_unlock(of->kn); - - return ret; -} - -static int rdtgroup_rmid_show(struct kernfs_open_file *of, - struct seq_file *s, void *v) -{ - struct rdtgroup *rdtgrp; - int ret = 0; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (rdtgrp) - seq_printf(s, "%u\n", rdtgrp->mon.rmid); - else - ret = -ENOENT; - rdtgroup_kn_unlock(of->kn); - - return ret; -} - -#ifdef CONFIG_PROC_CPU_RESCTRL - -/* - * A task can only be part of one resctrl control group and of one monitor - * group which is associated to that control group. - * - * 1) res: - * mon: - * - * resctrl is not available. - * - * 2) res:/ - * mon: - * - * Task is part of the root resctrl control group, and it is not associated - * to any monitor group. - * - * 3) res:/ - * mon:mon0 - * - * Task is part of the root resctrl control group and monitor group mon0. - * - * 4) res:group0 - * mon: - * - * Task is part of resctrl control group group0, and it is not associated - * to any monitor group. - * - * 5) res:group0 - * mon:mon1 - * - * Task is part of resctrl control group group0 and monitor group mon1. - */ -int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns, - struct pid *pid, struct task_struct *tsk) -{ - struct rdtgroup *rdtg; - int ret = 0; - - mutex_lock(&rdtgroup_mutex); - - /* Return empty if resctrl has not been mounted. */ - if (!resctrl_mounted) { - seq_puts(s, "res:\nmon:\n"); - goto unlock; - } - - list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) { - struct rdtgroup *crg; - - /* - * Task information is only relevant for shareable - * and exclusive groups. - */ - if (rdtg->mode != RDT_MODE_SHAREABLE && - rdtg->mode != RDT_MODE_EXCLUSIVE) - continue; - - if (!resctrl_arch_match_closid(tsk, rdtg->closid)) - continue; - - seq_printf(s, "res:%s%s\n", (rdtg == &rdtgroup_default) ? "/" : "", - rdtg->kn->name); - seq_puts(s, "mon:"); - list_for_each_entry(crg, &rdtg->mon.crdtgrp_list, - mon.crdtgrp_list) { - if (!resctrl_arch_match_rmid(tsk, crg->mon.parent->closid, - crg->mon.rmid)) - continue; - seq_printf(s, "%s", crg->kn->name); - break; - } - seq_putc(s, '\n'); - goto unlock; - } - /* - * The above search should succeed. Otherwise return - * with an error. - */ - ret = -ENOENT; -unlock: - mutex_unlock(&rdtgroup_mutex); - - return ret; -} -#endif - -static int rdt_last_cmd_status_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - int len; - - mutex_lock(&rdtgroup_mutex); - len = seq_buf_used(&last_cmd_status); - if (len) - seq_printf(seq, "%.*s", len, last_cmd_status_buf); - else - seq_puts(seq, "ok\n"); - mutex_unlock(&rdtgroup_mutex); - return 0; -} - -static int rdt_num_closids_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - - seq_printf(seq, "%u\n", s->num_closid); - return 0; -} - -static int rdt_default_ctrl_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - struct rdt_resource *r = s->res; - - seq_printf(seq, "%x\n", r->default_ctrl); - return 0; -} - -static int rdt_min_cbm_bits_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - struct rdt_resource *r = s->res; - - seq_printf(seq, "%u\n", r->cache.min_cbm_bits); - return 0; -} - -static int rdt_shareable_bits_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - struct rdt_resource *r = s->res; - - seq_printf(seq, "%x\n", r->cache.shareable_bits); - return 0; -} - -/* - * rdt_bit_usage_show - Display current usage of resources - * - * A domain is a shared resource that can now be allocated differently. Here - * we display the current regions of the domain as an annotated bitmask. - * For each domain of this resource its allocation bitmask - * is annotated as below to indicate the current usage of the corresponding bit: - * 0 - currently unused - * X - currently available for sharing and used by software and hardware - * H - currently used by hardware only but available for software use - * S - currently used and shareable by software only - * E - currently used exclusively by one resource group - * P - currently pseudo-locked by one resource group - */ -static int rdt_bit_usage_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - /* - * Use unsigned long even though only 32 bits are used to ensure - * test_bit() is used safely. - */ - unsigned long sw_shareable = 0, hw_shareable = 0; - unsigned long exclusive = 0, pseudo_locked = 0; - struct rdt_resource *r = s->res; - struct rdt_domain *dom; - int i, hwb, swb, excl, psl; - enum rdtgrp_mode mode; - bool sep = false; - u32 ctrl_val; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - hw_shareable = r->cache.shareable_bits; - list_for_each_entry(dom, &r->domains, list) { - if (sep) - seq_putc(seq, ';'); - sw_shareable = 0; - exclusive = 0; - seq_printf(seq, "%d=", dom->id); - for (i = 0; i < closids_supported(); i++) { - if (!closid_allocated(i)) - continue; - ctrl_val = resctrl_arch_get_config(r, dom, i, - s->conf_type); - mode = rdtgroup_mode_by_closid(i); - switch (mode) { - case RDT_MODE_SHAREABLE: - sw_shareable |= ctrl_val; - break; - case RDT_MODE_EXCLUSIVE: - exclusive |= ctrl_val; - break; - case RDT_MODE_PSEUDO_LOCKSETUP: - /* - * RDT_MODE_PSEUDO_LOCKSETUP is possible - * here but not included since the CBM - * associated with this CLOSID in this mode - * is not initialized and no task or cpu can be - * assigned this CLOSID. - */ - break; - case RDT_MODE_PSEUDO_LOCKED: - case RDT_NUM_MODES: - WARN(1, - "invalid mode for closid %d\n", i); - break; - } - } - for (i = r->cache.cbm_len - 1; i >= 0; i--) { - pseudo_locked = dom->plr ? dom->plr->cbm : 0; - hwb = test_bit(i, &hw_shareable); - swb = test_bit(i, &sw_shareable); - excl = test_bit(i, &exclusive); - psl = test_bit(i, &pseudo_locked); - if (hwb && swb) - seq_putc(seq, 'X'); - else if (hwb && !swb) - seq_putc(seq, 'H'); - else if (!hwb && swb) - seq_putc(seq, 'S'); - else if (excl) - seq_putc(seq, 'E'); - else if (psl) - seq_putc(seq, 'P'); - else /* Unused bits remain */ - seq_putc(seq, '0'); - } - sep = true; - } - seq_putc(seq, '\n'); - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); - return 0; -} - -static int rdt_min_bw_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - struct rdt_resource *r = s->res; - - seq_printf(seq, "%u\n", r->membw.min_bw); - return 0; -} - -static int rdt_num_rmids_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct rdt_resource *r = of->kn->parent->priv; - - seq_printf(seq, "%d\n", r->num_rmid); - - return 0; -} - -static int rdt_mon_features_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct rdt_resource *r = of->kn->parent->priv; - struct mon_evt *mevt; - - list_for_each_entry(mevt, &r->evt_list, list) { - seq_printf(seq, "%s\n", mevt->name); - if (mevt->configurable) - seq_printf(seq, "%s_config\n", mevt->name); - } - - return 0; -} - -static int rdt_bw_gran_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - struct rdt_resource *r = s->res; - - seq_printf(seq, "%u\n", r->membw.bw_gran); - return 0; -} - -static int rdt_delay_linear_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - struct rdt_resource *r = s->res; - - seq_printf(seq, "%u\n", r->membw.delay_linear); - return 0; -} - -static int max_threshold_occ_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - seq_printf(seq, "%u\n", resctrl_rmid_realloc_threshold); - - return 0; -} - -static int rdt_thread_throttle_mode_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - struct rdt_resource *r = s->res; - - if (r->membw.throttle_mode == THREAD_THROTTLE_PER_THREAD) - seq_puts(seq, "per-thread\n"); - else - seq_puts(seq, "max\n"); - - return 0; -} - -static ssize_t max_threshold_occ_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off) -{ - unsigned int bytes; - int ret; - - ret = kstrtouint(buf, 0, &bytes); - if (ret) - return ret; - - if (bytes > resctrl_rmid_realloc_limit) - return -EINVAL; - - resctrl_rmid_realloc_threshold = resctrl_arch_round_mon_val(bytes); - - return nbytes; -} - -/* - * rdtgroup_mode_show - Display mode of this resource group - */ -static int rdtgroup_mode_show(struct kernfs_open_file *of, - struct seq_file *s, void *v) -{ - struct rdtgroup *rdtgrp; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (!rdtgrp) { - rdtgroup_kn_unlock(of->kn); - return -ENOENT; - } - - seq_printf(s, "%s\n", rdtgroup_mode_str(rdtgrp->mode)); - - rdtgroup_kn_unlock(of->kn); - return 0; -} - -static enum resctrl_conf_type resctrl_peer_type(enum resctrl_conf_type my_type) -{ - switch (my_type) { - case CDP_CODE: - return CDP_DATA; - case CDP_DATA: - return CDP_CODE; - default: - case CDP_NONE: - return CDP_NONE; - } -} - -static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct resctrl_schema *s = of->kn->parent->priv; - struct rdt_resource *r = s->res; - - seq_printf(seq, "%u\n", r->cache.arch_has_sparse_bitmasks); - - return 0; -} - -/** - * __rdtgroup_cbm_overlaps - Does CBM for intended closid overlap with other - * @r: Resource to which domain instance @d belongs. - * @d: The domain instance for which @closid is being tested. - * @cbm: Capacity bitmask being tested. - * @closid: Intended closid for @cbm. - * @type: CDP type of @r. - * @exclusive: Only check if overlaps with exclusive resource groups - * - * Checks if provided @cbm intended to be used for @closid on domain - * @d overlaps with any other closids or other hardware usage associated - * with this domain. If @exclusive is true then only overlaps with - * resource groups in exclusive mode will be considered. If @exclusive - * is false then overlaps with any resource group or hardware entities - * will be considered. - * - * @cbm is unsigned long, even if only 32 bits are used, to make the - * bitmap functions work correctly. - * - * Return: false if CBM does not overlap, true if it does. - */ -static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d, - unsigned long cbm, int closid, - enum resctrl_conf_type type, bool exclusive) -{ - enum rdtgrp_mode mode; - unsigned long ctrl_b; - int i; - - /* Check for any overlap with regions used by hardware directly */ - if (!exclusive) { - ctrl_b = r->cache.shareable_bits; - if (bitmap_intersects(&cbm, &ctrl_b, r->cache.cbm_len)) - return true; - } - - /* Check for overlap with other resource groups */ - for (i = 0; i < closids_supported(); i++) { - ctrl_b = resctrl_arch_get_config(r, d, i, type); - mode = rdtgroup_mode_by_closid(i); - if (closid_allocated(i) && i != closid && - mode != RDT_MODE_PSEUDO_LOCKSETUP) { - if (bitmap_intersects(&cbm, &ctrl_b, r->cache.cbm_len)) { - if (exclusive) { - if (mode == RDT_MODE_EXCLUSIVE) - return true; - continue; - } - return true; - } - } - } - - return false; -} - -/** - * rdtgroup_cbm_overlaps - Does CBM overlap with other use of hardware - * @s: Schema for the resource to which domain instance @d belongs. - * @d: The domain instance for which @closid is being tested. - * @cbm: Capacity bitmask being tested. - * @closid: Intended closid for @cbm. - * @exclusive: Only check if overlaps with exclusive resource groups - * - * Resources that can be allocated using a CBM can use the CBM to control - * the overlap of these allocations. rdtgroup_cmb_overlaps() is the test - * for overlap. Overlap test is not limited to the specific resource for - * which the CBM is intended though - when dealing with CDP resources that - * share the underlying hardware the overlap check should be performed on - * the CDP resource sharing the hardware also. - * - * Refer to description of __rdtgroup_cbm_overlaps() for the details of the - * overlap test. - * - * Return: true if CBM overlap detected, false if there is no overlap - */ -bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d, - unsigned long cbm, int closid, bool exclusive) -{ - enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type); - struct rdt_resource *r = s->res; - - if (__rdtgroup_cbm_overlaps(r, d, cbm, closid, s->conf_type, - exclusive)) - return true; - - if (!resctrl_arch_get_cdp_enabled(r->rid)) - return false; - return __rdtgroup_cbm_overlaps(r, d, cbm, closid, peer_type, exclusive); -} - -/** - * rdtgroup_mode_test_exclusive - Test if this resource group can be exclusive - * @rdtgrp: Resource group identified through its closid. - * - * An exclusive resource group implies that there should be no sharing of - * its allocated resources. At the time this group is considered to be - * exclusive this test can determine if its current schemata supports this - * setting by testing for overlap with all other resource groups. - * - * Return: true if resource group can be exclusive, false if there is overlap - * with allocations of other resource groups and thus this resource group - * cannot be exclusive. - */ -static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp) -{ - int closid = rdtgrp->closid; - struct resctrl_schema *s; - struct rdt_resource *r; - bool has_cache = false; - struct rdt_domain *d; - u32 ctrl; - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - list_for_each_entry(s, &resctrl_schema_all, list) { - r = s->res; - if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA) - continue; - has_cache = true; - list_for_each_entry(d, &r->domains, list) { - ctrl = resctrl_arch_get_config(r, d, closid, - s->conf_type); - if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) { - rdt_last_cmd_puts("Schemata overlaps\n"); - return false; - } - } - } - - if (!has_cache) { - rdt_last_cmd_puts("Cannot be exclusive without CAT/CDP\n"); - return false; - } - - return true; -} - -/* - * rdtgroup_mode_write - Modify the resource group's mode - */ -static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, loff_t off) -{ - struct rdtgroup *rdtgrp; - enum rdtgrp_mode mode; - int ret = 0; - - /* Valid input requires a trailing newline */ - if (nbytes == 0 || buf[nbytes - 1] != '\n') - return -EINVAL; - buf[nbytes - 1] = '\0'; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (!rdtgrp) { - rdtgroup_kn_unlock(of->kn); - return -ENOENT; - } - - rdt_last_cmd_clear(); - - mode = rdtgrp->mode; - - if ((!strcmp(buf, "shareable") && mode == RDT_MODE_SHAREABLE) || - (!strcmp(buf, "exclusive") && mode == RDT_MODE_EXCLUSIVE) || - (!strcmp(buf, "pseudo-locksetup") && - mode == RDT_MODE_PSEUDO_LOCKSETUP) || - (!strcmp(buf, "pseudo-locked") && mode == RDT_MODE_PSEUDO_LOCKED)) - goto out; - - if (mode == RDT_MODE_PSEUDO_LOCKED) { - rdt_last_cmd_puts("Cannot change pseudo-locked group\n"); - ret = -EINVAL; - goto out; - } - - if (!strcmp(buf, "shareable")) { - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - ret = rdtgroup_locksetup_exit(rdtgrp); - if (ret) - goto out; - } - rdtgrp->mode = RDT_MODE_SHAREABLE; - } else if (!strcmp(buf, "exclusive")) { - if (!rdtgroup_mode_test_exclusive(rdtgrp)) { - ret = -EINVAL; - goto out; - } - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - ret = rdtgroup_locksetup_exit(rdtgrp); - if (ret) - goto out; - } - rdtgrp->mode = RDT_MODE_EXCLUSIVE; - } else if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK) && - !strcmp(buf, "pseudo-locksetup")) { - ret = rdtgroup_locksetup_enter(rdtgrp); - if (ret) - goto out; - rdtgrp->mode = RDT_MODE_PSEUDO_LOCKSETUP; - } else { - rdt_last_cmd_puts("Unknown or unsupported mode\n"); - ret = -EINVAL; - } - -out: - rdtgroup_kn_unlock(of->kn); - return ret ?: nbytes; -} - -/** - * rdtgroup_cbm_to_size - Translate CBM to size in bytes - * @r: RDT resource to which @d belongs. - * @d: RDT domain instance. - * @cbm: bitmask for which the size should be computed. - * - * The bitmask provided associated with the RDT domain instance @d will be - * translated into how many bytes it represents. The size in bytes is - * computed by first dividing the total cache size by the CBM length to - * determine how many bytes each bit in the bitmask represents. The result - * is multiplied with the number of bits set in the bitmask. - * - * @cbm is unsigned long, even if only 32 bits are used to make the - * bitmap functions work correctly. - */ -unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, - struct rdt_domain *d, unsigned long cbm) -{ - struct cpu_cacheinfo *ci; - unsigned int size = 0; - int num_b, i; - - num_b = bitmap_weight(&cbm, r->cache.cbm_len); - ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask)); - for (i = 0; i < ci->num_leaves; i++) { - if (ci->info_list[i].level == r->cache_level) { - size = ci->info_list[i].size / r->cache.cbm_len * num_b; - break; - } - } - - return size; -} - -/* - * rdtgroup_size_show - Display size in bytes of allocated regions - * - * The "size" file mirrors the layout of the "schemata" file, printing the - * size in bytes of each region instead of the capacity bitmask. - */ -static int rdtgroup_size_show(struct kernfs_open_file *of, - struct seq_file *s, void *v) -{ - struct resctrl_schema *schema; - enum resctrl_conf_type type; - struct rdtgroup *rdtgrp; - struct rdt_resource *r; - struct rdt_domain *d; - unsigned int size; - int ret = 0; - u32 closid; - bool sep; - u32 ctrl; - - rdtgrp = rdtgroup_kn_lock_live(of->kn); - if (!rdtgrp) { - rdtgroup_kn_unlock(of->kn); - return -ENOENT; - } - - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { - if (!rdtgrp->plr->d) { - rdt_last_cmd_clear(); - rdt_last_cmd_puts("Cache domain offline\n"); - ret = -ENODEV; - } else { - seq_printf(s, "%*s:", max_name_width, - rdtgrp->plr->s->name); - size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res, - rdtgrp->plr->d, - rdtgrp->plr->cbm); - seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size); - } - goto out; - } - - closid = rdtgrp->closid; - - list_for_each_entry(schema, &resctrl_schema_all, list) { - r = schema->res; - type = schema->conf_type; - sep = false; - seq_printf(s, "%*s:", max_name_width, schema->name); - list_for_each_entry(d, &r->domains, list) { - if (sep) - seq_putc(s, ';'); - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { - size = 0; - } else { - if (is_mba_sc(r)) - ctrl = d->mbps_val[closid]; - else - ctrl = resctrl_arch_get_config(r, d, - closid, - type); - if (r->rid == RDT_RESOURCE_MBA || - r->rid == RDT_RESOURCE_SMBA) - size = ctrl; - else - size = rdtgroup_cbm_to_size(r, d, ctrl); - } - seq_printf(s, "%d=%u", d->id, size); - sep = true; - } - seq_putc(s, '\n'); - } - -out: - rdtgroup_kn_unlock(of->kn); - - return ret; -} - -#define INVALID_CONFIG_INDEX UINT_MAX - -/** - * mon_event_config_index_get - get the hardware index for the - * configurable event - * @evtid: event id. - * - * Return: 0 for evtid == QOS_L3_MBM_TOTAL_EVENT_ID - * 1 for evtid == QOS_L3_MBM_LOCAL_EVENT_ID - * INVALID_CONFIG_INDEX for invalid evtid - */ -static inline unsigned int mon_event_config_index_get(u32 evtid) -{ - switch (evtid) { - case QOS_L3_MBM_TOTAL_EVENT_ID: - return 0; - case QOS_L3_MBM_LOCAL_EVENT_ID: - return 1; - default: - /* Should never reach here */ - return INVALID_CONFIG_INDEX; - } -} - -void resctrl_arch_mon_event_config_read(void *info) -{ - struct resctrl_mon_config_info *mon_info = info; - unsigned int index; - u64 msrval; - - index = mon_event_config_index_get(mon_info->evtid); - if (index == INVALID_CONFIG_INDEX) { - pr_warn_once("Invalid event id %d\n", mon_info->evtid); - return; - } - rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval); - - /* Report only the valid event configuration bits */ - mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS; -} - -static void mondata_config_read(struct resctrl_mon_config_info *mon_info) -{ - smp_call_function_any(&mon_info->d->cpu_mask, - resctrl_arch_mon_event_config_read, mon_info, 1); -} - -static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid) -{ - struct resctrl_mon_config_info mon_info = {0}; - struct rdt_domain *dom; - bool sep = false; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - list_for_each_entry(dom, &r->domains, list) { - if (sep) - seq_puts(s, ";"); - - memset(&mon_info, 0, sizeof(struct resctrl_mon_config_info)); - mon_info.r = r; - mon_info.d = dom; - mon_info.evtid = evtid; - mondata_config_read(&mon_info); - - seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config); - sep = true; - } - seq_puts(s, "\n"); - - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); - - return 0; -} - -static int mbm_total_bytes_config_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct rdt_resource *r = of->kn->parent->priv; - - mbm_config_show(seq, r, QOS_L3_MBM_TOTAL_EVENT_ID); - - return 0; -} - -static int mbm_local_bytes_config_show(struct kernfs_open_file *of, - struct seq_file *seq, void *v) -{ - struct rdt_resource *r = of->kn->parent->priv; - - mbm_config_show(seq, r, QOS_L3_MBM_LOCAL_EVENT_ID); - - return 0; -} - -void resctrl_arch_mon_event_config_write(void *info) -{ - struct resctrl_mon_config_info *mon_info = info; - unsigned int index; - - index = mon_event_config_index_get(mon_info->evtid); - if (index == INVALID_CONFIG_INDEX) { - pr_warn_once("Invalid event id %d\n", mon_info->evtid); - mon_info->err = -EINVAL; - return; - } - wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0); - - mon_info->err = 0; -} - -static void mbm_config_write_domain(struct rdt_resource *r, - struct rdt_domain *d, u32 evtid, u32 val) -{ - struct resctrl_mon_config_info mon_info = {0}; - - /* - * Read the current config value first. If both are the same then - * no need to write it again. - */ - mon_info.r = r; - mon_info.d = d; - mon_info.evtid = evtid; - mondata_config_read(&mon_info); - if (mon_info.mon_config == val) - return; - - mon_info.mon_config = val; - - /* - * Update MSR_IA32_EVT_CFG_BASE MSR on one of the CPUs in the - * domain. The MSRs offset from MSR MSR_IA32_EVT_CFG_BASE - * are scoped at the domain level. Writing any of these MSRs - * on one CPU is observed by all the CPUs in the domain. - */ - smp_call_function_any(&d->cpu_mask, resctrl_arch_mon_event_config_write, - &mon_info, 1); - if (mon_info.err) { - rdt_last_cmd_puts("Invalid event configuration\n"); - return; - } - - /* - * When an Event Configuration is changed, the bandwidth counters - * for all RMIDs and Events will be cleared by the hardware. The - * hardware also sets MSR_IA32_QM_CTR.Unavailable (bit 62) for - * every RMID on the next read to any event for every RMID. - * Subsequent reads will have MSR_IA32_QM_CTR.Unavailable (bit 62) - * cleared while it is tracked by the hardware. Clear the - * mbm_local and mbm_total counts for all the RMIDs. - */ - resctrl_arch_reset_rmid_all(r, d); -} - -static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid) -{ - char *dom_str = NULL, *id_str; - unsigned long dom_id, val; - struct rdt_domain *d; - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - -next: - if (!tok || tok[0] == '\0') - return 0; - - /* Start processing the strings for each domain */ - dom_str = strim(strsep(&tok, ";")); - id_str = strsep(&dom_str, "="); - - if (!id_str || kstrtoul(id_str, 10, &dom_id)) { - rdt_last_cmd_puts("Missing '=' or non-numeric domain id\n"); - return -EINVAL; - } - - if (!dom_str || kstrtoul(dom_str, 16, &val)) { - rdt_last_cmd_puts("Non-numeric event configuration value\n"); - return -EINVAL; - } - - /* Value from user cannot be more than the supported set of events */ - if ((val & r->mbm_cfg_mask) != val) { - rdt_last_cmd_printf("Invalid event configuration: max valid mask is 0x%02x\n", - r->mbm_cfg_mask); - return -EINVAL; - } - - list_for_each_entry(d, &r->domains, list) { - if (d->id == dom_id) { - mbm_config_write_domain(r, d, evtid, val); - goto next; - } - } - - return -EINVAL; -} - -static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, - loff_t off) -{ - struct rdt_resource *r = of->kn->parent->priv; - int ret; - - /* Valid input requires a trailing newline */ - if (nbytes == 0 || buf[nbytes - 1] != '\n') - return -EINVAL; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - rdt_last_cmd_clear(); - - buf[nbytes - 1] = '\0'; - - ret = mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID); - - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); - - return ret ?: nbytes; -} - -static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of, - char *buf, size_t nbytes, - loff_t off) -{ - struct rdt_resource *r = of->kn->parent->priv; - int ret; - - /* Valid input requires a trailing newline */ - if (nbytes == 0 || buf[nbytes - 1] != '\n') - return -EINVAL; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - rdt_last_cmd_clear(); - - buf[nbytes - 1] = '\0'; - - ret = mon_config_write(r, buf, QOS_L3_MBM_LOCAL_EVENT_ID); - - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); - - return ret ?: nbytes; -} - -/* rdtgroup information files for one cache resource. */ -static struct rftype res_common_files[] = { - { - .name = "last_cmd_status", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_last_cmd_status_show, - .fflags = RFTYPE_TOP_INFO, - }, - { - .name = "num_closids", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_num_closids_show, - .fflags = RFTYPE_CTRL_INFO, - }, - { - .name = "mon_features", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_mon_features_show, - .fflags = RFTYPE_MON_INFO, - }, - { - .name = "num_rmids", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_num_rmids_show, - .fflags = RFTYPE_MON_INFO, - }, - { - .name = "cbm_mask", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_default_ctrl_show, - .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, - }, - { - .name = "min_cbm_bits", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_min_cbm_bits_show, - .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, - }, - { - .name = "shareable_bits", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_shareable_bits_show, - .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, - }, - { - .name = "bit_usage", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_bit_usage_show, - .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, - }, - { - .name = "min_bandwidth", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_min_bw_show, - .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, - }, - { - .name = "bandwidth_gran", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_bw_gran_show, - .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, - }, - { - .name = "delay_linear", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_delay_linear_show, - .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, - }, - /* - * Platform specific which (if any) capabilities are provided by - * thread_throttle_mode. Defer "fflags" initialization to platform - * discovery. - */ - { - .name = "thread_throttle_mode", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_thread_throttle_mode_show, - }, - { - .name = "max_threshold_occupancy", - .mode = 0644, - .kf_ops = &rdtgroup_kf_single_ops, - .write = max_threshold_occ_write, - .seq_show = max_threshold_occ_show, - .fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE, - }, - { - .name = "mbm_total_bytes_config", - .mode = 0644, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = mbm_total_bytes_config_show, - .write = mbm_total_bytes_config_write, - }, - { - .name = "mbm_local_bytes_config", - .mode = 0644, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = mbm_local_bytes_config_show, - .write = mbm_local_bytes_config_write, - }, - { - .name = "cpus", - .mode = 0644, - .kf_ops = &rdtgroup_kf_single_ops, - .write = rdtgroup_cpus_write, - .seq_show = rdtgroup_cpus_show, - .fflags = RFTYPE_BASE, - }, - { - .name = "cpus_list", - .mode = 0644, - .kf_ops = &rdtgroup_kf_single_ops, - .write = rdtgroup_cpus_write, - .seq_show = rdtgroup_cpus_show, - .flags = RFTYPE_FLAGS_CPUS_LIST, - .fflags = RFTYPE_BASE, - }, - { - .name = "tasks", - .mode = 0644, - .kf_ops = &rdtgroup_kf_single_ops, - .write = rdtgroup_tasks_write, - .seq_show = rdtgroup_tasks_show, - .fflags = RFTYPE_BASE, - }, - { - .name = "mon_hw_id", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdtgroup_rmid_show, - .fflags = RFTYPE_MON_BASE | RFTYPE_DEBUG, - }, - { - .name = "schemata", - .mode = 0644, - .kf_ops = &rdtgroup_kf_single_ops, - .write = rdtgroup_schemata_write, - .seq_show = rdtgroup_schemata_show, - .fflags = RFTYPE_CTRL_BASE, - }, - { - .name = "mode", - .mode = 0644, - .kf_ops = &rdtgroup_kf_single_ops, - .write = rdtgroup_mode_write, - .seq_show = rdtgroup_mode_show, - .fflags = RFTYPE_CTRL_BASE, - }, - { - .name = "size", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdtgroup_size_show, - .fflags = RFTYPE_CTRL_BASE, - }, - { - .name = "sparse_masks", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdt_has_sparse_bitmasks_show, - .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, - }, - { - .name = "ctrl_hw_id", - .mode = 0444, - .kf_ops = &rdtgroup_kf_single_ops, - .seq_show = rdtgroup_closid_show, - .fflags = RFTYPE_CTRL_BASE | RFTYPE_DEBUG, - }, - -}; - -static int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags) -{ - struct rftype *rfts, *rft; - int ret, len; - - rfts = res_common_files; - len = ARRAY_SIZE(res_common_files); - - lockdep_assert_held(&rdtgroup_mutex); - - if (resctrl_debug) - fflags |= RFTYPE_DEBUG; - - for (rft = rfts; rft < rfts + len; rft++) { - if (rft->fflags && ((fflags & rft->fflags) == rft->fflags)) { - ret = rdtgroup_add_file(kn, rft); - if (ret) - goto error; - } - } - - return 0; -error: - pr_warn("Failed to add %s, err=%d\n", rft->name, ret); - while (--rft >= rfts) { - if ((fflags & rft->fflags) == rft->fflags) - kernfs_remove_by_name(kn, rft->name); - } - return ret; -} - -static struct rftype *rdtgroup_get_rftype_by_name(const char *name) -{ - struct rftype *rfts, *rft; - int len; - - rfts = res_common_files; - len = ARRAY_SIZE(res_common_files); - - for (rft = rfts; rft < rfts + len; rft++) { - if (!strcmp(rft->name, name)) - return rft; - } - - return NULL; -} - -static void thread_throttle_mode_init(void) -{ - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); - struct rftype *rft; - - if (!r->alloc_capable || - r->membw.throttle_mode == THREAD_THROTTLE_UNDEFINED) - return; - - rft = rdtgroup_get_rftype_by_name("thread_throttle_mode"); - if (!rft) - return; - - rft->fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB; -} - -void mbm_config_rftype_init(const char *config) -{ - struct rftype *rft; - - rft = rdtgroup_get_rftype_by_name(config); - if (rft) - rft->fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE; -} - -/** - * rdtgroup_kn_mode_restrict - Restrict user access to named resctrl file - * @r: The resource group with which the file is associated. - * @name: Name of the file - * - * The permissions of named resctrl file, directory, or link are modified - * to not allow read, write, or execute by any user. - * - * WARNING: This function is intended to communicate to the user that the - * resctrl file has been locked down - that it is not relevant to the - * particular state the system finds itself in. It should not be relied - * on to protect from user access because after the file's permissions - * are restricted the user can still change the permissions using chmod - * from the command line. - * - * Return: 0 on success, <0 on failure. - */ -int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name) -{ - struct iattr iattr = {.ia_valid = ATTR_MODE,}; - struct kernfs_node *kn; - int ret = 0; - - kn = kernfs_find_and_get_ns(r->kn, name, NULL); - if (!kn) - return -ENOENT; - - switch (kernfs_type(kn)) { - case KERNFS_DIR: - iattr.ia_mode = S_IFDIR; - break; - case KERNFS_FILE: - iattr.ia_mode = S_IFREG; - break; - case KERNFS_LINK: - iattr.ia_mode = S_IFLNK; - break; - } - - ret = kernfs_setattr(kn, &iattr); - kernfs_put(kn); - return ret; -} - -/** - * rdtgroup_kn_mode_restore - Restore user access to named resctrl file - * @r: The resource group with which the file is associated. - * @name: Name of the file - * @mask: Mask of permissions that should be restored - * - * Restore the permissions of the named file. If @name is a directory the - * permissions of its parent will be used. - * - * Return: 0 on success, <0 on failure. - */ -int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name, - umode_t mask) -{ - struct iattr iattr = {.ia_valid = ATTR_MODE,}; - struct kernfs_node *kn, *parent; - struct rftype *rfts, *rft; - int ret, len; - - rfts = res_common_files; - len = ARRAY_SIZE(res_common_files); - - for (rft = rfts; rft < rfts + len; rft++) { - if (!strcmp(rft->name, name)) - iattr.ia_mode = rft->mode & mask; - } - - kn = kernfs_find_and_get_ns(r->kn, name, NULL); - if (!kn) - return -ENOENT; - - switch (kernfs_type(kn)) { - case KERNFS_DIR: - parent = kernfs_get_parent(kn); - if (parent) { - iattr.ia_mode |= parent->mode; - kernfs_put(parent); - } - iattr.ia_mode |= S_IFDIR; - break; - case KERNFS_FILE: - iattr.ia_mode |= S_IFREG; - break; - case KERNFS_LINK: - iattr.ia_mode |= S_IFLNK; - break; - } - - ret = kernfs_setattr(kn, &iattr); - kernfs_put(kn); - return ret; -} - -static int rdtgroup_mkdir_info_resdir(void *priv, char *name, - unsigned long fflags) -{ - struct kernfs_node *kn_subdir; - int ret; - - kn_subdir = kernfs_create_dir(kn_info, name, - kn_info->mode, priv); - if (IS_ERR(kn_subdir)) - return PTR_ERR(kn_subdir); - - ret = rdtgroup_kn_set_ugid(kn_subdir); - if (ret) - return ret; - - ret = rdtgroup_add_files(kn_subdir, fflags); - if (!ret) - kernfs_activate(kn_subdir); - - return ret; -} - -static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn) -{ - enum resctrl_res_level i; - struct resctrl_schema *s; - struct rdt_resource *r; - unsigned long fflags; - char name[32]; - int ret; - - /* create the directory */ - kn_info = kernfs_create_dir(parent_kn, "info", parent_kn->mode, NULL); - if (IS_ERR(kn_info)) - return PTR_ERR(kn_info); - - ret = rdtgroup_add_files(kn_info, RFTYPE_TOP_INFO); - if (ret) - goto out_destroy; - - /* loop over enabled controls, these are all alloc_capable */ - list_for_each_entry(s, &resctrl_schema_all, list) { - r = s->res; - fflags = r->fflags | RFTYPE_CTRL_INFO; - ret = rdtgroup_mkdir_info_resdir(s, s->name, fflags); - if (ret) - goto out_destroy; - } - - for (i = 0; i < RDT_NUM_RESOURCES; i++) { - r = resctrl_arch_get_resource(i); - if (!r->mon_capable) - continue; - - fflags = r->fflags | RFTYPE_MON_INFO; - sprintf(name, "%s_MON", r->name); - ret = rdtgroup_mkdir_info_resdir(r, name, fflags); - if (ret) - goto out_destroy; - } - - ret = rdtgroup_kn_set_ugid(kn_info); - if (ret) - goto out_destroy; - - kernfs_activate(kn_info); - - return 0; - -out_destroy: - kernfs_remove(kn_info); - return ret; -} - -static int -mongroup_create_dir(struct kernfs_node *parent_kn, struct rdtgroup *prgrp, - char *name, struct kernfs_node **dest_kn) -{ - struct kernfs_node *kn; - int ret; - - /* create the directory */ - kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); - if (IS_ERR(kn)) - return PTR_ERR(kn); - - if (dest_kn) - *dest_kn = kn; - - ret = rdtgroup_kn_set_ugid(kn); - if (ret) - goto out_destroy; - - kernfs_activate(kn); - - return 0; - -out_destroy: - kernfs_remove(kn); - return ret; -} - -static void l3_qos_cfg_update(void *arg) -{ - bool *enable = arg; - - wrmsrl(MSR_IA32_L3_QOS_CFG, *enable ? L3_QOS_CDP_ENABLE : 0ULL); -} - -static void l2_qos_cfg_update(void *arg) -{ - bool *enable = arg; - - wrmsrl(MSR_IA32_L2_QOS_CFG, *enable ? L2_QOS_CDP_ENABLE : 0ULL); -} - -static inline bool is_mba_linear(void) -{ - return resctrl_arch_get_resource(RDT_RESOURCE_MBA)->membw.delay_linear; -} - -static int set_cache_qos_cfg(int level, bool enable) -{ - void (*update)(void *arg); - struct rdt_resource *r_l; - cpumask_var_t cpu_mask; - struct rdt_domain *d; - int cpu; - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - if (level == RDT_RESOURCE_L3) - update = l3_qos_cfg_update; - else if (level == RDT_RESOURCE_L2) - update = l2_qos_cfg_update; - else - return -EINVAL; - - if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) - return -ENOMEM; - - r_l = &rdt_resources_all[level].r_resctrl; - list_for_each_entry(d, &r_l->domains, list) { - if (r_l->cache.arch_has_per_cpu_cfg) - /* Pick all the CPUs in the domain instance */ - for_each_cpu(cpu, &d->cpu_mask) - cpumask_set_cpu(cpu, cpu_mask); - else - /* Pick one CPU from each domain instance to update MSR */ - cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask); - } - - /* Update QOS_CFG MSR on all the CPUs in cpu_mask */ - on_each_cpu_mask(cpu_mask, update, &enable, 1); - - free_cpumask_var(cpu_mask); - - return 0; -} - -/* Restore the qos cfg state when a domain comes online */ -void rdt_domain_reconfigure_cdp(struct rdt_resource *r) -{ - struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); - - if (!r->cdp_capable) - return; - - if (r->rid == RDT_RESOURCE_L2) - l2_qos_cfg_update(&hw_res->cdp_enabled); - - if (r->rid == RDT_RESOURCE_L3) - l3_qos_cfg_update(&hw_res->cdp_enabled); -} - -static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d) -{ - u32 num_closid = resctrl_arch_get_num_closid(r); - int cpu = cpumask_any(&d->cpu_mask); - int i; - - d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val), - GFP_KERNEL, cpu_to_node(cpu)); - if (!d->mbps_val) - return -ENOMEM; - - for (i = 0; i < num_closid; i++) - d->mbps_val[i] = MBA_MAX_MBPS; - - return 0; -} - -static void mba_sc_domain_destroy(struct rdt_resource *r, - struct rdt_domain *d) -{ - kfree(d->mbps_val); - d->mbps_val = NULL; -} - -/* - * MBA software controller is supported only if - * MBM is supported and MBA is in linear scale. - */ -static bool supports_mba_mbps(void) -{ - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); - - return (resctrl_arch_is_mbm_local_enabled() && - r->alloc_capable && is_mba_linear()); -} - -/* - * Enable or disable the MBA software controller - * which helps user specify bandwidth in MBps. - */ -static int set_mba_sc(bool mba_sc) -{ - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); - u32 num_closid = resctrl_arch_get_num_closid(r); - struct rdt_domain *d; - int i; - - if (!supports_mba_mbps() || mba_sc == is_mba_sc(r)) - return -EINVAL; - - r->membw.mba_sc = mba_sc; - - list_for_each_entry(d, &r->domains, list) { - for (i = 0; i < num_closid; i++) - d->mbps_val[i] = MBA_MAX_MBPS; - } - - return 0; -} - -static int cdp_enable(int level) -{ - struct rdt_resource *r_l = &rdt_resources_all[level].r_resctrl; - int ret; - - if (!r_l->alloc_capable) - return -EINVAL; - - ret = set_cache_qos_cfg(level, true); - if (!ret) - rdt_resources_all[level].cdp_enabled = true; - - return ret; -} - -static void cdp_disable(int level) -{ - struct rdt_hw_resource *r_hw = &rdt_resources_all[level]; - - if (r_hw->cdp_enabled) { - set_cache_qos_cfg(level, false); - r_hw->cdp_enabled = false; - } -} - -int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable) -{ - struct rdt_hw_resource *hw_res = &rdt_resources_all[l]; - - if (!hw_res->r_resctrl.cdp_capable) - return -EINVAL; - - if (enable) - return cdp_enable(l); - - cdp_disable(l); - - return 0; -} - -/* - * We don't allow rdtgroup directories to be created anywhere - * except the root directory. Thus when looking for the rdtgroup - * structure for a kernfs node we are either looking at a directory, - * in which case the rdtgroup structure is pointed at by the "priv" - * field, otherwise we have a file, and need only look to the parent - * to find the rdtgroup. - */ -static struct rdtgroup *kernfs_to_rdtgroup(struct kernfs_node *kn) -{ - if (kernfs_type(kn) == KERNFS_DIR) { - /* - * All the resource directories use "kn->priv" - * to point to the "struct rdtgroup" for the - * resource. "info" and its subdirectories don't - * have rdtgroup structures, so return NULL here. - */ - if (kn == kn_info || kn->parent == kn_info) - return NULL; - else - return kn->priv; - } else { - return kn->parent->priv; - } -} - -static void rdtgroup_kn_get(struct rdtgroup *rdtgrp, struct kernfs_node *kn) -{ - atomic_inc(&rdtgrp->waitcount); - kernfs_break_active_protection(kn); -} - -static void rdtgroup_kn_put(struct rdtgroup *rdtgrp, struct kernfs_node *kn) -{ - if (atomic_dec_and_test(&rdtgrp->waitcount) && - (rdtgrp->flags & RDT_DELETED)) { - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || - rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) - rdtgroup_pseudo_lock_remove(rdtgrp); - kernfs_unbreak_active_protection(kn); - rdtgroup_remove(rdtgrp); - } else { - kernfs_unbreak_active_protection(kn); - } -} - -struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn) -{ - struct rdtgroup *rdtgrp = kernfs_to_rdtgroup(kn); - - if (!rdtgrp) - return NULL; - - rdtgroup_kn_get(rdtgrp, kn); - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - /* Was this group deleted while we waited? */ - if (rdtgrp->flags & RDT_DELETED) - return NULL; - - return rdtgrp; -} - -void rdtgroup_kn_unlock(struct kernfs_node *kn) -{ - struct rdtgroup *rdtgrp = kernfs_to_rdtgroup(kn); - - if (!rdtgrp) - return; - - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); - - rdtgroup_kn_put(rdtgrp, kn); -} - -static int mkdir_mondata_all(struct kernfs_node *parent_kn, - struct rdtgroup *prgrp, - struct kernfs_node **mon_data_kn); - -static void rdt_disable_ctx(void) -{ - resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false); - resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false); - set_mba_sc(false); - - resctrl_debug = false; -} - -static int rdt_enable_ctx(struct rdt_fs_context *ctx) -{ - int ret = 0; - - if (ctx->enable_cdpl2) { - ret = resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, true); - if (ret) - goto out_done; - } - - if (ctx->enable_cdpl3) { - ret = resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, true); - if (ret) - goto out_cdpl2; - } - - if (ctx->enable_mba_mbps) { - ret = set_mba_sc(true); - if (ret) - goto out_cdpl3; - } - - if (ctx->enable_debug) - resctrl_debug = true; - - return 0; - -out_cdpl3: - resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false); -out_cdpl2: - resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false); -out_done: - return ret; -} - -static int schemata_list_add(struct rdt_resource *r, enum resctrl_conf_type type) -{ - struct resctrl_schema *s; - const char *suffix = ""; - int ret, cl; - - s = kzalloc(sizeof(*s), GFP_KERNEL); - if (!s) - return -ENOMEM; - - s->res = r; - s->num_closid = resctrl_arch_get_num_closid(r); - if (resctrl_arch_get_cdp_enabled(r->rid)) - s->num_closid /= 2; - - s->conf_type = type; - switch (type) { - case CDP_CODE: - suffix = "CODE"; - break; - case CDP_DATA: - suffix = "DATA"; - break; - case CDP_NONE: - suffix = ""; - break; - } - - ret = snprintf(s->name, sizeof(s->name), "%s%s", r->name, suffix); - if (ret >= sizeof(s->name)) { - kfree(s); - return -EINVAL; - } - - cl = strlen(s->name); - - /* - * If CDP is supported by this resource, but not enabled, - * include the suffix. This ensures the tabular format of the - * schemata file does not change between mounts of the filesystem. - */ - if (r->cdp_capable && !resctrl_arch_get_cdp_enabled(r->rid)) - cl += 4; - - if (cl > max_name_width) - max_name_width = cl; - - /* - * Choose a width for the resource data based on the resource that has - * widest name and cbm. - */ - max_data_width = max(max_data_width, r->data_width); - - INIT_LIST_HEAD(&s->list); - list_add(&s->list, &resctrl_schema_all); - - return 0; -} - -static int schemata_list_create(void) -{ - enum resctrl_res_level i; - struct rdt_resource *r; - int ret = 0; - - for (i = 0; i < RDT_NUM_RESOURCES; i++) { - r = resctrl_arch_get_resource(i); - if (!r->alloc_capable) - continue; - - if (resctrl_arch_get_cdp_enabled(r->rid)) { - ret = schemata_list_add(r, CDP_CODE); - if (ret) - break; - - ret = schemata_list_add(r, CDP_DATA); - } else { - ret = schemata_list_add(r, CDP_NONE); - } - - if (ret) - break; - } - - return ret; -} - -static void schemata_list_destroy(void) -{ - struct resctrl_schema *s, *tmp; - - list_for_each_entry_safe(s, tmp, &resctrl_schema_all, list) { - list_del(&s->list); - kfree(s); - } -} - -static int rdt_get_tree(struct fs_context *fc) -{ - struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3); - struct rdt_fs_context *ctx = rdt_fc2context(fc); - unsigned long flags = RFTYPE_CTRL_BASE; - struct rdt_domain *dom; - int ret; - - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - /* - * resctrl file system can only be mounted once. - */ - if (resctrl_mounted) { - ret = -EBUSY; - goto out; - } - - ret = rdtgroup_setup_root(ctx); - if (ret) - goto out; - - ret = rdt_enable_ctx(ctx); - if (ret) - goto out_root; - - ret = schemata_list_create(); - if (ret) { - schemata_list_destroy(); - goto out_ctx; - } - - closid_init(); - - if (resctrl_arch_mon_capable()) - flags |= RFTYPE_MON; - - ret = rdtgroup_add_files(rdtgroup_default.kn, flags); - if (ret) - goto out_schemata_free; - - kernfs_activate(rdtgroup_default.kn); - - ret = rdtgroup_create_info_dir(rdtgroup_default.kn); - if (ret < 0) - goto out_schemata_free; - - if (resctrl_arch_mon_capable()) { - ret = mongroup_create_dir(rdtgroup_default.kn, - &rdtgroup_default, "mon_groups", - &kn_mongrp); - if (ret < 0) - goto out_info; - - ret = mkdir_mondata_all(rdtgroup_default.kn, - &rdtgroup_default, &kn_mondata); - if (ret < 0) - goto out_mongrp; - rdtgroup_default.mon.mon_data_kn = kn_mondata; - } - - if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) { - ret = rdt_pseudo_lock_init(); - if (ret) - goto out_mondata; - } - - ret = kernfs_get_tree(fc); - if (ret < 0) - goto out_psl; - - if (resctrl_arch_alloc_capable()) - resctrl_arch_enable_alloc(); - if (resctrl_arch_mon_capable()) - resctrl_arch_enable_mon(); - - if (resctrl_arch_alloc_capable() || resctrl_arch_mon_capable()) - resctrl_mounted = true; - - if (resctrl_is_mbm_enabled()) { - list_for_each_entry(dom, &l3->domains, list) - mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, - RESCTRL_PICK_ANY_CPU); - } - - goto out; - -out_psl: - if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) - rdt_pseudo_lock_release(); -out_mondata: - if (resctrl_arch_mon_capable()) - kernfs_remove(kn_mondata); -out_mongrp: - if (resctrl_arch_mon_capable()) - kernfs_remove(kn_mongrp); -out_info: - kernfs_remove(kn_info); -out_schemata_free: - schemata_list_destroy(); -out_ctx: - rdt_disable_ctx(); -out_root: - rdtgroup_destroy_root(); -out: - rdt_last_cmd_clear(); - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); - return ret; -} - -enum rdt_param { - Opt_cdp, - Opt_cdpl2, - Opt_mba_mbps, - Opt_debug, - nr__rdt_params -}; - -static const struct fs_parameter_spec rdt_fs_parameters[] = { - fsparam_flag("cdp", Opt_cdp), - fsparam_flag("cdpl2", Opt_cdpl2), - fsparam_flag("mba_MBps", Opt_mba_mbps), - fsparam_flag("debug", Opt_debug), - {} -}; - -static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param) -{ - struct rdt_fs_context *ctx = rdt_fc2context(fc); - struct fs_parse_result result; - int opt; - - opt = fs_parse(fc, rdt_fs_parameters, param, &result); - if (opt < 0) - return opt; - - switch (opt) { - case Opt_cdp: - ctx->enable_cdpl3 = true; - return 0; - case Opt_cdpl2: - ctx->enable_cdpl2 = true; - return 0; - case Opt_mba_mbps: - if (!supports_mba_mbps()) - return -EINVAL; - ctx->enable_mba_mbps = true; - return 0; - case Opt_debug: - ctx->enable_debug = true; - return 0; - } - - return -EINVAL; -} - -static void rdt_fs_context_free(struct fs_context *fc) -{ - struct rdt_fs_context *ctx = rdt_fc2context(fc); - - kernfs_free_fs_context(fc); - kfree(ctx); -} - -static const struct fs_context_operations rdt_fs_context_ops = { - .free = rdt_fs_context_free, - .parse_param = rdt_parse_param, - .get_tree = rdt_get_tree, -}; - -static int rdt_init_fs_context(struct fs_context *fc) -{ - struct rdt_fs_context *ctx; - - ctx = kzalloc(sizeof(struct rdt_fs_context), GFP_KERNEL); - if (!ctx) - return -ENOMEM; - - ctx->kfc.magic = RDTGROUP_SUPER_MAGIC; - fc->fs_private = &ctx->kfc; - fc->ops = &rdt_fs_context_ops; - put_user_ns(fc->user_ns); - fc->user_ns = get_user_ns(&init_user_ns); - fc->global = true; - return 0; -} - -static int reset_all_ctrls(struct rdt_resource *r) -{ - struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); - struct rdt_hw_domain *hw_dom; - struct msr_param msr_param; - cpumask_var_t cpu_mask; - struct rdt_domain *d; - int i; - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) - return -ENOMEM; - - msr_param.res = r; - msr_param.low = 0; - msr_param.high = hw_res->num_closid; - - /* - * Disable resource control for this resource by setting all - * CBMs in all domains to the maximum mask value. Pick one CPU - * from each domain to update the MSRs below. - */ - list_for_each_entry(d, &r->domains, list) { - hw_dom = resctrl_to_arch_dom(d); - cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask); - - for (i = 0; i < hw_res->num_closid; i++) - hw_dom->ctrl_val[i] = r->default_ctrl; - } - - /* Update CBM on all the CPUs in cpu_mask */ - on_each_cpu_mask(cpu_mask, rdt_ctrl_update, &msr_param, 1); - - free_cpumask_var(cpu_mask); - - return 0; -} - -void resctrl_arch_reset_resources(void) -{ - struct rdt_resource *r; - - for_each_capable_rdt_resource(r) - reset_all_ctrls(r); -} - -/* - * Move tasks from one to the other group. If @from is NULL, then all tasks - * in the systems are moved unconditionally (used for teardown). - * - * If @mask is not NULL the cpus on which moved tasks are running are set - * in that mask so the update smp function call is restricted to affected - * cpus. - */ -static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to, - struct cpumask *mask) -{ - struct task_struct *p, *t; - - read_lock(&tasklist_lock); - for_each_process_thread(p, t) { - if (!from || is_closid_match(t, from) || - is_rmid_match(t, from)) { - resctrl_arch_set_closid_rmid(t, to->closid, - to->mon.rmid); - - /* - * Order the closid/rmid stores above before the loads - * in task_curr(). This pairs with the full barrier - * between the rq->curr update and resctrl_sched_in() - * during context switch. - */ - smp_mb(); - - /* - * If the task is on a CPU, set the CPU in the mask. - * The detection is inaccurate as tasks might move or - * schedule before the smp function call takes place. - * In such a case the function call is pointless, but - * there is no other side effect. - */ - if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t)) - cpumask_set_cpu(task_cpu(t), mask); - } - } - read_unlock(&tasklist_lock); -} - -static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp) -{ - struct rdtgroup *sentry, *stmp; - struct list_head *head; - - head = &rdtgrp->mon.crdtgrp_list; - list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) { - free_rmid(sentry->closid, sentry->mon.rmid); - list_del(&sentry->mon.crdtgrp_list); - - if (atomic_read(&sentry->waitcount) != 0) - sentry->flags = RDT_DELETED; - else - rdtgroup_remove(sentry); - } -} - -/* - * Forcibly remove all of subdirectories under root. - */ -static void rmdir_all_sub(void) -{ - struct rdtgroup *rdtgrp, *tmp; - - /* Move all tasks to the default resource group */ - rdt_move_group_tasks(NULL, &rdtgroup_default, NULL); - - list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) { - /* Free any child rmids */ - free_all_child_rdtgrp(rdtgrp); - - /* Remove each rdtgroup other than root */ - if (rdtgrp == &rdtgroup_default) - continue; - - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || - rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) - rdtgroup_pseudo_lock_remove(rdtgrp); - - /* - * Give any CPUs back to the default group. We cannot copy - * cpu_online_mask because a CPU might have executed the - * offline callback already, but is still marked online. - */ - cpumask_or(&rdtgroup_default.cpu_mask, - &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); - - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); - - kernfs_remove(rdtgrp->kn); - list_del(&rdtgrp->rdtgroup_list); - - if (atomic_read(&rdtgrp->waitcount) != 0) - rdtgrp->flags = RDT_DELETED; - else - rdtgroup_remove(rdtgrp); - } - /* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */ - update_closid_rmid(cpu_online_mask, &rdtgroup_default); - - kernfs_remove(kn_info); - kernfs_remove(kn_mongrp); - kernfs_remove(kn_mondata); -} - -static void rdt_kill_sb(struct super_block *sb) -{ - cpus_read_lock(); - mutex_lock(&rdtgroup_mutex); - - rdt_disable_ctx(); - - /* Put everything back to default values. */ - resctrl_arch_reset_resources(); - - rmdir_all_sub(); - if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) - rdt_pseudo_lock_release(); - rdtgroup_default.mode = RDT_MODE_SHAREABLE; - schemata_list_destroy(); - rdtgroup_destroy_root(); - if (resctrl_arch_alloc_capable()) - resctrl_arch_disable_alloc(); - if (resctrl_arch_mon_capable()) - resctrl_arch_disable_mon(); - resctrl_mounted = false; - kernfs_kill_sb(sb); - mutex_unlock(&rdtgroup_mutex); - cpus_read_unlock(); -} - -static struct file_system_type rdt_fs_type = { - .name = "resctrl", - .init_fs_context = rdt_init_fs_context, - .parameters = rdt_fs_parameters, - .kill_sb = rdt_kill_sb, -}; - -static int mon_addfile(struct kernfs_node *parent_kn, const char *name, - void *priv) -{ - struct kernfs_node *kn; - int ret = 0; - - kn = __kernfs_create_file(parent_kn, name, 0444, - GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, - &kf_mondata_ops, priv, NULL, NULL); - if (IS_ERR(kn)) - return PTR_ERR(kn); - - ret = rdtgroup_kn_set_ugid(kn); - if (ret) { - kernfs_remove(kn); - return ret; - } - - return ret; -} - -/* - * Remove all subdirectories of mon_data of ctrl_mon groups - * and monitor groups with given domain id. - */ -static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, - unsigned int dom_id) -{ - struct rdtgroup *prgrp, *crgrp; - char name[32]; - - list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { - sprintf(name, "mon_%s_%02d", r->name, dom_id); - kernfs_remove_by_name(prgrp->mon.mon_data_kn, name); - - list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) - kernfs_remove_by_name(crgrp->mon.mon_data_kn, name); - } -} - -static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, - struct rdt_domain *d, - struct rdt_resource *r, struct rdtgroup *prgrp) -{ - union mon_data_bits priv; - struct kernfs_node *kn; - struct mon_evt *mevt; - struct rmid_read rr; - char name[32]; - int ret; - - sprintf(name, "mon_%s_%02d", r->name, d->id); - /* create the directory */ - kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); - if (IS_ERR(kn)) - return PTR_ERR(kn); - - ret = rdtgroup_kn_set_ugid(kn); - if (ret) - goto out_destroy; - - if (WARN_ON(list_empty(&r->evt_list))) { - ret = -EPERM; - goto out_destroy; - } - - priv.u.rid = r->rid; - priv.u.domid = d->id; - list_for_each_entry(mevt, &r->evt_list, list) { - priv.u.evtid = mevt->evtid; - ret = mon_addfile(kn, mevt->name, priv.priv); - if (ret) - goto out_destroy; - - if (resctrl_is_mbm_event(mevt->evtid)) - mon_event_read(&rr, r, d, prgrp, mevt->evtid, true); - } - kernfs_activate(kn); - return 0; - -out_destroy: - kernfs_remove(kn); - return ret; -} - -/* - * Add all subdirectories of mon_data for "ctrl_mon" groups - * and "monitor" groups with given domain id. - */ -static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, - struct rdt_domain *d) -{ - struct kernfs_node *parent_kn; - struct rdtgroup *prgrp, *crgrp; - struct list_head *head; - - list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { - parent_kn = prgrp->mon.mon_data_kn; - mkdir_mondata_subdir(parent_kn, d, r, prgrp); - - head = &prgrp->mon.crdtgrp_list; - list_for_each_entry(crgrp, head, mon.crdtgrp_list) { - parent_kn = crgrp->mon.mon_data_kn; - mkdir_mondata_subdir(parent_kn, d, r, crgrp); - } - } -} - -static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn, - struct rdt_resource *r, - struct rdtgroup *prgrp) -{ - struct rdt_domain *dom; - int ret; - - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); - - list_for_each_entry(dom, &r->domains, list) { - ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp); - if (ret) - return ret; - } - - return 0; -} - -/* - * This creates a directory mon_data which contains the monitored data. - * - * mon_data has one directory for each domain which are named - * in the format mon_<domain_name>_<domain_id>. For ex: A mon_data - * with L3 domain looks as below: - * ./mon_data: - * mon_L3_00 - * mon_L3_01 - * mon_L3_02 - * ... - * - * Each domain directory has one file per event: - * ./mon_L3_00/: - * llc_occupancy - * - */ -static int mkdir_mondata_all(struct kernfs_node *parent_kn, - struct rdtgroup *prgrp, - struct kernfs_node **dest_kn) -{ - enum resctrl_res_level i; - struct rdt_resource *r; - struct kernfs_node *kn; - int ret; - - /* - * Create the mon_data directory first. - */ - ret = mongroup_create_dir(parent_kn, prgrp, "mon_data", &kn); - if (ret) - return ret; - - if (dest_kn) - *dest_kn = kn; - - /* - * Create the subdirectories for each domain. Note that all events - * in a domain like L3 are grouped into a resource whose domain is L3 - */ - for (i = 0; i < RDT_NUM_RESOURCES; i++) { - r = resctrl_arch_get_resource(i); - if (!r->mon_capable) - continue; - - ret = mkdir_mondata_subdir_alldom(kn, r, prgrp); - if (ret) - goto out_destroy; - } - - return 0; - -out_destroy: - kernfs_remove(kn); - return ret; -} - -/** - * cbm_ensure_valid - Enforce validity on provided CBM - * @_val: Candidate CBM - * @r: RDT resource to which the CBM belongs - * - * The provided CBM represents all cache portions available for use. This - * may be represented by a bitmap that does not consist of contiguous ones - * and thus be an invalid CBM. - * Here the provided CBM is forced to be a valid CBM by only considering - * the first set of contiguous bits as valid and clearing all bits. - * The intention here is to provide a valid default CBM with which a new - * resource group is initialized. The user can follow this with a - * modification to the CBM if the default does not satisfy the - * requirements. - */ -static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r) -{ - unsigned int cbm_len = r->cache.cbm_len; - unsigned long first_bit, zero_bit; - unsigned long val = _val; - - if (!val) - return 0; - - first_bit = find_first_bit(&val, cbm_len); - zero_bit = find_next_zero_bit(&val, cbm_len, first_bit); - - /* Clear any remaining bits to ensure contiguous region */ - bitmap_clear(&val, zero_bit, cbm_len - zero_bit); - return (u32)val; -} - -/* - * Initialize cache resources per RDT domain - * - * Set the RDT domain up to start off with all usable allocations. That is, - * all shareable and unused bits. All-zero CBM is invalid. - */ -static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s, - u32 closid) -{ - enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type); - enum resctrl_conf_type t = s->conf_type; - struct resctrl_staged_config *cfg; - struct rdt_resource *r = s->res; - u32 used_b = 0, unused_b = 0; - unsigned long tmp_cbm; - enum rdtgrp_mode mode; - u32 peer_ctl, ctrl_val; - int i; - - cfg = &d->staged_config[t]; - cfg->have_new_ctrl = false; - cfg->new_ctrl = r->cache.shareable_bits; - used_b = r->cache.shareable_bits; - for (i = 0; i < closids_supported(); i++) { - if (closid_allocated(i) && i != closid) { - mode = rdtgroup_mode_by_closid(i); - if (mode == RDT_MODE_PSEUDO_LOCKSETUP) - /* - * ctrl values for locksetup aren't relevant - * until the schemata is written, and the mode - * becomes RDT_MODE_PSEUDO_LOCKED. - */ - continue; - /* - * If CDP is active include peer domain's - * usage to ensure there is no overlap - * with an exclusive group. - */ - if (resctrl_arch_get_cdp_enabled(r->rid)) - peer_ctl = resctrl_arch_get_config(r, d, i, - peer_type); - else - peer_ctl = 0; - ctrl_val = resctrl_arch_get_config(r, d, i, - s->conf_type); - used_b |= ctrl_val | peer_ctl; - if (mode == RDT_MODE_SHAREABLE) - cfg->new_ctrl |= ctrl_val | peer_ctl; - } - } - if (d->plr && d->plr->cbm > 0) - used_b |= d->plr->cbm; - unused_b = used_b ^ (BIT_MASK(r->cache.cbm_len) - 1); - unused_b &= BIT_MASK(r->cache.cbm_len) - 1; - cfg->new_ctrl |= unused_b; - /* - * Force the initial CBM to be valid, user can - * modify the CBM based on system availability. - */ - cfg->new_ctrl = cbm_ensure_valid(cfg->new_ctrl, r); - /* - * Assign the u32 CBM to an unsigned long to ensure that - * bitmap_weight() does not access out-of-bound memory. - */ - tmp_cbm = cfg->new_ctrl; - if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) { - rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id); - return -ENOSPC; - } - cfg->have_new_ctrl = true; - - return 0; -} - -/* - * Initialize cache resources with default values. - * - * A new RDT group is being created on an allocation capable (CAT) - * supporting system. Set this group up to start off with all usable - * allocations. - * - * If there are no more shareable bits available on any domain then - * the entire allocation will fail. - */ -static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid) -{ - struct rdt_domain *d; - int ret; - - list_for_each_entry(d, &s->res->domains, list) { - ret = __init_one_rdt_domain(d, s, closid); - if (ret < 0) - return ret; - } - - return 0; -} - -/* Initialize MBA resource with default values. */ -static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid) -{ - struct resctrl_staged_config *cfg; - struct rdt_domain *d; - - list_for_each_entry(d, &r->domains, list) { - if (is_mba_sc(r)) { - d->mbps_val[closid] = MBA_MAX_MBPS; - continue; - } - - cfg = &d->staged_config[CDP_NONE]; - cfg->new_ctrl = r->default_ctrl; - cfg->have_new_ctrl = true; - } -} - -/* Initialize the RDT group's allocations. */ -static int rdtgroup_init_alloc(struct rdtgroup *rdtgrp) -{ - struct resctrl_schema *s; - struct rdt_resource *r; - int ret = 0; - - rdt_staged_configs_clear(); - - list_for_each_entry(s, &resctrl_schema_all, list) { - r = s->res; - if (r->rid == RDT_RESOURCE_MBA || - r->rid == RDT_RESOURCE_SMBA) { - rdtgroup_init_mba(r, rdtgrp->closid); - if (is_mba_sc(r)) - continue; - } else { - ret = rdtgroup_init_cat(s, rdtgrp->closid); - if (ret < 0) - goto out; - } - - ret = resctrl_arch_update_domains(r, rdtgrp->closid); - if (ret < 0) { - rdt_last_cmd_puts("Failed to initialize allocations\n"); - goto out; - } - - } - - rdtgrp->mode = RDT_MODE_SHAREABLE; - -out: - rdt_staged_configs_clear(); - return ret; -} - -static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp) -{ - int ret; - - if (!resctrl_arch_mon_capable()) - return 0; - - ret = alloc_rmid(rdtgrp->closid); - if (ret < 0) { - rdt_last_cmd_puts("Out of RMIDs\n"); - return ret; - } - rdtgrp->mon.rmid = ret; - - ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn); - if (ret) { - rdt_last_cmd_puts("kernfs subdir error\n"); - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); - return ret; - } - - return 0; -} - -static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp) -{ - if (resctrl_arch_mon_capable()) - free_rmid(rgrp->closid, rgrp->mon.rmid); -} - -static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, - const char *name, umode_t mode, - enum rdt_group_type rtype, struct rdtgroup **r) -{ - struct rdtgroup *prdtgrp, *rdtgrp; - unsigned long files = 0; - struct kernfs_node *kn; - int ret; - - prdtgrp = rdtgroup_kn_lock_live(parent_kn); - if (!prdtgrp) { - ret = -ENODEV; - goto out_unlock; - } - - if (rtype == RDTMON_GROUP && - (prdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || - prdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)) { - ret = -EINVAL; - rdt_last_cmd_puts("Pseudo-locking in progress\n"); - goto out_unlock; - } - - /* allocate the rdtgroup. */ - rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL); - if (!rdtgrp) { - ret = -ENOSPC; - rdt_last_cmd_puts("Kernel out of memory\n"); - goto out_unlock; - } - *r = rdtgrp; - rdtgrp->mon.parent = prdtgrp; - rdtgrp->type = rtype; - INIT_LIST_HEAD(&rdtgrp->mon.crdtgrp_list); - - /* kernfs creates the directory for rdtgrp */ - kn = kernfs_create_dir(parent_kn, name, mode, rdtgrp); - if (IS_ERR(kn)) { - ret = PTR_ERR(kn); - rdt_last_cmd_puts("kernfs create error\n"); - goto out_free_rgrp; - } - rdtgrp->kn = kn; - - /* - * kernfs_remove() will drop the reference count on "kn" which - * will free it. But we still need it to stick around for the - * rdtgroup_kn_unlock(kn) call. Take one extra reference here, - * which will be dropped by kernfs_put() in rdtgroup_remove(). - */ - kernfs_get(kn); - - ret = rdtgroup_kn_set_ugid(kn); - if (ret) { - rdt_last_cmd_puts("kernfs perm error\n"); - goto out_destroy; - } - - if (rtype == RDTCTRL_GROUP) { - files = RFTYPE_BASE | RFTYPE_CTRL; - if (resctrl_arch_mon_capable()) - files |= RFTYPE_MON; - } else { - files = RFTYPE_BASE | RFTYPE_MON; - } - - ret = rdtgroup_add_files(kn, files); - if (ret) { - rdt_last_cmd_puts("kernfs fill error\n"); - goto out_destroy; - } - - /* - * The caller unlocks the parent_kn upon success. - */ - return 0; - -out_destroy: - kernfs_put(rdtgrp->kn); - kernfs_remove(rdtgrp->kn); -out_free_rgrp: - kfree(rdtgrp); -out_unlock: - rdtgroup_kn_unlock(parent_kn); - return ret; -} - -static void mkdir_rdt_prepare_clean(struct rdtgroup *rgrp) -{ - kernfs_remove(rgrp->kn); - rdtgroup_remove(rgrp); -} - -/* - * Create a monitor group under "mon_groups" directory of a control - * and monitor group(ctrl_mon). This is a resource group - * to monitor a subset of tasks and cpus in its parent ctrl_mon group. - */ -static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn, - const char *name, umode_t mode) -{ - struct rdtgroup *rdtgrp, *prgrp; - int ret; - - ret = mkdir_rdt_prepare(parent_kn, name, mode, RDTMON_GROUP, &rdtgrp); - if (ret) - return ret; - - prgrp = rdtgrp->mon.parent; - rdtgrp->closid = prgrp->closid; - - ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp); - if (ret) { - mkdir_rdt_prepare_clean(rdtgrp); - goto out_unlock; - } - - kernfs_activate(rdtgrp->kn); - - /* - * Add the rdtgrp to the list of rdtgrps the parent - * ctrl_mon group has to track. - */ - list_add_tail(&rdtgrp->mon.crdtgrp_list, &prgrp->mon.crdtgrp_list); - -out_unlock: - rdtgroup_kn_unlock(parent_kn); - return ret; -} - -/* - * These are rdtgroups created under the root directory. Can be used - * to allocate and monitor resources. - */ -static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn, - const char *name, umode_t mode) -{ - struct rdtgroup *rdtgrp; - struct kernfs_node *kn; - u32 closid; - int ret; - - ret = mkdir_rdt_prepare(parent_kn, name, mode, RDTCTRL_GROUP, &rdtgrp); - if (ret) - return ret; - - kn = rdtgrp->kn; - ret = closid_alloc(); - if (ret < 0) { - rdt_last_cmd_puts("Out of CLOSIDs\n"); - goto out_common_fail; - } - closid = ret; - ret = 0; - - rdtgrp->closid = closid; - - ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp); - if (ret) - goto out_closid_free; - - kernfs_activate(rdtgrp->kn); - - ret = rdtgroup_init_alloc(rdtgrp); - if (ret < 0) - goto out_rmid_free; - - list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups); - - if (resctrl_arch_mon_capable()) { - /* - * Create an empty mon_groups directory to hold the subset - * of tasks and cpus to monitor. - */ - ret = mongroup_create_dir(kn, rdtgrp, "mon_groups", NULL); - if (ret) { - rdt_last_cmd_puts("kernfs subdir error\n"); - goto out_del_list; - } - } - - goto out_unlock; - -out_del_list: - list_del(&rdtgrp->rdtgroup_list); -out_rmid_free: - mkdir_rdt_prepare_rmid_free(rdtgrp); -out_closid_free: - closid_free(closid); -out_common_fail: - mkdir_rdt_prepare_clean(rdtgrp); -out_unlock: - rdtgroup_kn_unlock(parent_kn); - return ret; -} - -/* - * We allow creating mon groups only with in a directory called "mon_groups" - * which is present in every ctrl_mon group. Check if this is a valid - * "mon_groups" directory. - * - * 1. The directory should be named "mon_groups". - * 2. The mon group itself should "not" be named "mon_groups". - * This makes sure "mon_groups" directory always has a ctrl_mon group - * as parent. - */ -static bool is_mon_groups(struct kernfs_node *kn, const char *name) -{ - return (!strcmp(kn->name, "mon_groups") && - strcmp(name, "mon_groups")); -} - -static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name, - umode_t mode) -{ - /* Do not accept '\n' to avoid unparsable situation. */ - if (strchr(name, '\n')) - return -EINVAL; - - /* - * If the parent directory is the root directory and RDT - * allocation is supported, add a control and monitoring - * subdirectory - */ - if (resctrl_arch_alloc_capable() && parent_kn == rdtgroup_default.kn) - return rdtgroup_mkdir_ctrl_mon(parent_kn, name, mode); - - /* - * If RDT monitoring is supported and the parent directory is a valid - * "mon_groups" directory, add a monitoring subdirectory. - */ - if (resctrl_arch_mon_capable() && is_mon_groups(parent_kn, name)) - return rdtgroup_mkdir_mon(parent_kn, name, mode); - - return -EPERM; -} - -static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) -{ - struct rdtgroup *prdtgrp = rdtgrp->mon.parent; - int cpu; - - /* Give any tasks back to the parent group */ - rdt_move_group_tasks(rdtgrp, prdtgrp, tmpmask); - - /* Update per cpu rmid of the moved CPUs first */ - for_each_cpu(cpu, &rdtgrp->cpu_mask) - resctrl_arch_set_cpu_default_closid_rmid(cpu, rdtgrp->closid, - prdtgrp->mon.rmid); - - /* - * Update the MSR on moved CPUs and CPUs which have moved - * task running on them. - */ - cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); - update_closid_rmid(tmpmask, NULL); - - rdtgrp->flags = RDT_DELETED; - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); - - /* - * Remove the rdtgrp from the parent ctrl_mon group's list - */ - WARN_ON(list_empty(&prdtgrp->mon.crdtgrp_list)); - list_del(&rdtgrp->mon.crdtgrp_list); - - kernfs_remove(rdtgrp->kn); - - return 0; -} - -static int rdtgroup_ctrl_remove(struct rdtgroup *rdtgrp) -{ - rdtgrp->flags = RDT_DELETED; - list_del(&rdtgrp->rdtgroup_list); - - kernfs_remove(rdtgrp->kn); - return 0; -} - -static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) -{ - u32 closid, rmid; - int cpu; - - /* Give any tasks back to the default group */ - rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask); - - /* Give any CPUs back to the default group */ - cpumask_or(&rdtgroup_default.cpu_mask, - &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); - - /* Update per cpu closid and rmid of the moved CPUs first */ - closid = rdtgroup_default.closid; - rmid = rdtgroup_default.mon.rmid; - for_each_cpu(cpu, &rdtgrp->cpu_mask) - resctrl_arch_set_cpu_default_closid_rmid(cpu, closid, rmid); - - /* - * Update the MSR on moved CPUs and CPUs which have moved - * task running on them. - */ - cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); - update_closid_rmid(tmpmask, NULL); - - free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); - closid_free(rdtgrp->closid); - - rdtgroup_ctrl_remove(rdtgrp); - - /* - * Free all the child monitor group rmids. - */ - free_all_child_rdtgrp(rdtgrp); - - return 0; -} - -static int rdtgroup_rmdir(struct kernfs_node *kn) -{ - struct kernfs_node *parent_kn = kn->parent; - struct rdtgroup *rdtgrp; - cpumask_var_t tmpmask; - int ret = 0; - - if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) - return -ENOMEM; - - rdtgrp = rdtgroup_kn_lock_live(kn); - if (!rdtgrp) { - ret = -EPERM; - goto out; - } - - /* - * If the rdtgroup is a ctrl_mon group and parent directory - * is the root directory, remove the ctrl_mon group. - * - * If the rdtgroup is a mon group and parent directory - * is a valid "mon_groups" directory, remove the mon group. - */ - if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == rdtgroup_default.kn && - rdtgrp != &rdtgroup_default) { - if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || - rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { - ret = rdtgroup_ctrl_remove(rdtgrp); - } else { - ret = rdtgroup_rmdir_ctrl(rdtgrp, tmpmask); - } - } else if (rdtgrp->type == RDTMON_GROUP && - is_mon_groups(parent_kn, kn->name)) { - ret = rdtgroup_rmdir_mon(rdtgrp, tmpmask); - } else { - ret = -EPERM; - } - -out: - rdtgroup_kn_unlock(kn); - free_cpumask_var(tmpmask); - return ret; -} - -/** - * mongrp_reparent() - replace parent CTRL_MON group of a MON group - * @rdtgrp: the MON group whose parent should be replaced - * @new_prdtgrp: replacement parent CTRL_MON group for @rdtgrp - * @cpus: cpumask provided by the caller for use during this call - * - * Replaces the parent CTRL_MON group for a MON group, resulting in all member - * tasks' CLOSID immediately changing to that of the new parent group. - * Monitoring data for the group is unaffected by this operation. - */ -static void mongrp_reparent(struct rdtgroup *rdtgrp, - struct rdtgroup *new_prdtgrp, - cpumask_var_t cpus) -{ - struct rdtgroup *prdtgrp = rdtgrp->mon.parent; - - WARN_ON(rdtgrp->type != RDTMON_GROUP); - WARN_ON(new_prdtgrp->type != RDTCTRL_GROUP); - - /* Nothing to do when simply renaming a MON group. */ - if (prdtgrp == new_prdtgrp) - return; - - WARN_ON(list_empty(&prdtgrp->mon.crdtgrp_list)); - list_move_tail(&rdtgrp->mon.crdtgrp_list, - &new_prdtgrp->mon.crdtgrp_list); - - rdtgrp->mon.parent = new_prdtgrp; - rdtgrp->closid = new_prdtgrp->closid; - - /* Propagate updated closid to all tasks in this group. */ - rdt_move_group_tasks(rdtgrp, rdtgrp, cpus); - - update_closid_rmid(cpus, NULL); -} - -static int rdtgroup_rename(struct kernfs_node *kn, - struct kernfs_node *new_parent, const char *new_name) -{ - struct rdtgroup *new_prdtgrp; - struct rdtgroup *rdtgrp; - cpumask_var_t tmpmask; - int ret; - - rdtgrp = kernfs_to_rdtgroup(kn); - new_prdtgrp = kernfs_to_rdtgroup(new_parent); - if (!rdtgrp || !new_prdtgrp) - return -ENOENT; - - /* Release both kernfs active_refs before obtaining rdtgroup mutex. */ - rdtgroup_kn_get(rdtgrp, kn); - rdtgroup_kn_get(new_prdtgrp, new_parent); - - mutex_lock(&rdtgroup_mutex); - - rdt_last_cmd_clear(); - - /* - * Don't allow kernfs_to_rdtgroup() to return a parent rdtgroup if - * either kernfs_node is a file. - */ - if (kernfs_type(kn) != KERNFS_DIR || - kernfs_type(new_parent) != KERNFS_DIR) { - rdt_last_cmd_puts("Source and destination must be directories"); - ret = -EPERM; - goto out; - } - - if ((rdtgrp->flags & RDT_DELETED) || (new_prdtgrp->flags & RDT_DELETED)) { - ret = -ENOENT; - goto out; - } - - if (rdtgrp->type != RDTMON_GROUP || !kn->parent || - !is_mon_groups(kn->parent, kn->name)) { - rdt_last_cmd_puts("Source must be a MON group\n"); - ret = -EPERM; - goto out; - } - - if (!is_mon_groups(new_parent, new_name)) { - rdt_last_cmd_puts("Destination must be a mon_groups subdirectory\n"); - ret = -EPERM; - goto out; - } - - /* - * If the MON group is monitoring CPUs, the CPUs must be assigned to the - * current parent CTRL_MON group and therefore cannot be assigned to - * the new parent, making the move illegal. - */ - if (!cpumask_empty(&rdtgrp->cpu_mask) && - rdtgrp->mon.parent != new_prdtgrp) { - rdt_last_cmd_puts("Cannot move a MON group that monitors CPUs\n"); - ret = -EPERM; - goto out; - } - - /* - * Allocate the cpumask for use in mongrp_reparent() to avoid the - * possibility of failing to allocate it after kernfs_rename() has - * succeeded. - */ - if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) { - ret = -ENOMEM; - goto out; - } - - /* - * Perform all input validation and allocations needed to ensure - * mongrp_reparent() will succeed before calling kernfs_rename(), - * otherwise it would be necessary to revert this call if - * mongrp_reparent() failed. - */ - ret = kernfs_rename(kn, new_parent, new_name); - if (!ret) - mongrp_reparent(rdtgrp, new_prdtgrp, tmpmask); - - free_cpumask_var(tmpmask); - -out: - mutex_unlock(&rdtgroup_mutex); - rdtgroup_kn_put(rdtgrp, kn); - rdtgroup_kn_put(new_prdtgrp, new_parent); - return ret; -} - -static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf) -{ - if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3)) - seq_puts(seq, ",cdp"); - - if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2)) - seq_puts(seq, ",cdpl2"); - - if (is_mba_sc(resctrl_arch_get_resource(RDT_RESOURCE_MBA))) - seq_puts(seq, ",mba_MBps"); - - if (resctrl_debug) - seq_puts(seq, ",debug"); - - return 0; -} - -static struct kernfs_syscall_ops rdtgroup_kf_syscall_ops = { - .mkdir = rdtgroup_mkdir, - .rmdir = rdtgroup_rmdir, - .rename = rdtgroup_rename, - .show_options = rdtgroup_show_options, -}; - -static int rdtgroup_setup_root(struct rdt_fs_context *ctx) -{ - rdt_root = kernfs_create_root(&rdtgroup_kf_syscall_ops, - KERNFS_ROOT_CREATE_DEACTIVATED | - KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK, - &rdtgroup_default); - if (IS_ERR(rdt_root)) - return PTR_ERR(rdt_root); - - ctx->kfc.root = rdt_root; - rdtgroup_default.kn = kernfs_root_to_node(rdt_root); - - return 0; -} - -static void rdtgroup_destroy_root(void) -{ - kernfs_destroy_root(rdt_root); - rdtgroup_default.kn = NULL; -} - -static void __init rdtgroup_setup_default(void) -{ - mutex_lock(&rdtgroup_mutex); - - rdtgroup_default.closid = RESCTRL_RESERVED_CLOSID; - rdtgroup_default.mon.rmid = RESCTRL_RESERVED_RMID; - rdtgroup_default.type = RDTCTRL_GROUP; - INIT_LIST_HEAD(&rdtgroup_default.mon.crdtgrp_list); - - list_add(&rdtgroup_default.rdtgroup_list, &rdt_all_groups); - - mutex_unlock(&rdtgroup_mutex); -} - -static void domain_destroy_mon_state(struct rdt_domain *d) -{ - bitmap_free(d->rmid_busy_llc); - kfree(d->mbm_total); - kfree(d->mbm_local); -} - -void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d) -{ - mutex_lock(&rdtgroup_mutex); - - if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) - mba_sc_domain_destroy(r, d); - - if (!r->mon_capable) - goto out_unlock; - - /* - * If resctrl is mounted, remove all the - * per domain monitor data directories. - */ - if (resctrl_mounted && resctrl_arch_mon_capable()) - rmdir_mondata_subdir_allrdtgrp(r, d->id); - - if (resctrl_is_mbm_enabled()) - cancel_delayed_work(&d->mbm_over); - if (resctrl_arch_is_llc_occupancy_enabled() && has_busy_rmid(d)) { - /* - * When a package is going down, forcefully - * decrement rmid->ebusy. There is no way to know - * that the L3 was flushed and hence may lead to - * incorrect counts in rare scenarios, but leaving - * the RMID as busy creates RMID leaks if the - * package never comes back. - */ - __check_limbo(d, true); - cancel_delayed_work(&d->cqm_limbo); - } - - domain_destroy_mon_state(d); - -out_unlock: - mutex_unlock(&rdtgroup_mutex); -} - -static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) -{ - u32 idx_limit = resctrl_arch_system_num_rmid_idx(); - size_t tsize; - - if (resctrl_arch_is_llc_occupancy_enabled()) { - d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL); - if (!d->rmid_busy_llc) - return -ENOMEM; - } - if (resctrl_arch_is_mbm_total_enabled()) { - tsize = sizeof(*d->mbm_total); - d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL); - if (!d->mbm_total) { - bitmap_free(d->rmid_busy_llc); - return -ENOMEM; - } - } - if (resctrl_arch_is_mbm_local_enabled()) { - tsize = sizeof(*d->mbm_local); - d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL); - if (!d->mbm_local) { - bitmap_free(d->rmid_busy_llc); - kfree(d->mbm_total); - return -ENOMEM; - } - } - - return 0; -} - -int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d) -{ - int err = 0; - - mutex_lock(&rdtgroup_mutex); - - if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) { - /* RDT_RESOURCE_MBA is never mon_capable */ - err = mba_sc_domain_allocate(r, d); - goto out_unlock; - } - - if (!r->mon_capable) - goto out_unlock; - - err = domain_setup_mon_state(r, d); - if (err) - goto out_unlock; - - if (resctrl_is_mbm_enabled()) { - INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow); - mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, - RESCTRL_PICK_ANY_CPU); - } - - if (resctrl_arch_is_llc_occupancy_enabled()) - INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo); - - /* - * If the filesystem is not mounted then only the default resource group - * exists. Creation of its directories is deferred until mount time - * by rdt_get_tree() calling mkdir_mondata_all(). - * If resctrl is mounted, add per domain monitor data directories. - */ - if (resctrl_mounted && resctrl_arch_mon_capable()) - mkdir_mondata_subdir_allrdtgrp(r, d); - -out_unlock: - mutex_unlock(&rdtgroup_mutex); - - return err; -} - -void resctrl_online_cpu(unsigned int cpu) -{ - mutex_lock(&rdtgroup_mutex); - /* The CPU is set in default rdtgroup after online. */ - cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); - mutex_unlock(&rdtgroup_mutex); -} - -static void clear_childcpus(struct rdtgroup *r, unsigned int cpu) -{ - struct rdtgroup *cr; - - list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) { - if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask)) - break; - } -} - -void resctrl_offline_cpu(unsigned int cpu) -{ - struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3); - struct rdtgroup *rdtgrp; - struct rdt_domain *d; - - mutex_lock(&rdtgroup_mutex); - list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { - if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) { - clear_childcpus(rdtgrp, cpu); - break; - } - } - - if (!l3->mon_capable) - goto out_unlock; - - d = resctrl_get_domain_from_cpu(cpu, l3); - if (d) { - if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu) { - cancel_delayed_work(&d->mbm_over); - mbm_setup_overflow_handler(d, 0, cpu); - } - if (resctrl_arch_is_llc_occupancy_enabled() && - cpu == d->cqm_work_cpu && has_busy_rmid(d)) { - cancel_delayed_work(&d->cqm_limbo); - cqm_setup_limbo_handler(d, 0, cpu); - } - } - -out_unlock: - mutex_unlock(&rdtgroup_mutex); -} - -/* - * resctrl_init - resctrl filesystem initialization - * - * Setup resctrl file system including set up root, create mount point, - * register resctrl filesystem, and initialize files under root directory. - * - * Return: 0 on success or -errno - */ -int resctrl_init(void) -{ - int ret = 0; - - seq_buf_init(&last_cmd_status, last_cmd_status_buf, - sizeof(last_cmd_status_buf)); - - rdtgroup_setup_default(); - - thread_throttle_mode_init(); - - ret = resctrl_mon_resource_init(); - if (ret) - return ret; - - ret = sysfs_create_mount_point(fs_kobj, "resctrl"); - if (ret) - return ret; - - ret = register_filesystem(&rdt_fs_type); - if (ret) - goto cleanup_mountpoint; - - /* - * Adding the resctrl debugfs directory here may not be ideal since - * it would let the resctrl debugfs directory appear on the debugfs - * filesystem before the resctrl filesystem is mounted. - * It may also be ok since that would enable debugging of RDT before - * resctrl is mounted. - * The reason why the debugfs directory is created here and not in - * rdt_get_tree() is because rdt_get_tree() takes rdtgroup_mutex and - * during the debugfs directory creation also &sb->s_type->i_mutex_key - * (the lockdep class of inode->i_rwsem). Other filesystem - * interactions (eg. SyS_getdents) have the lock ordering: - * &sb->s_type->i_mutex_key --> &mm->mmap_lock - * During mmap(), called with &mm->mmap_lock, the rdtgroup_mutex - * is taken, thus creating dependency: - * &mm->mmap_lock --> rdtgroup_mutex for the latter that can cause - * issues considering the other two lock dependencies. - * By creating the debugfs directory here we avoid a dependency - * that may cause deadlock (even though file operations cannot - * occur until the filesystem is mounted, but I do not know how to - * tell lockdep that). - */ - debugfs_resctrl = debugfs_create_dir("resctrl", NULL); - - return 0; - -cleanup_mountpoint: - sysfs_remove_mount_point(fs_kobj, "resctrl"); - - return ret; -} - -void resctrl_exit(void) -{ - debugfs_remove_recursive(debugfs_resctrl); - unregister_filesystem(&rdt_fs_type); - sysfs_remove_mount_point(fs_kobj, "resctrl"); + struct rdt_resource *r; - resctrl_mon_resource_exit(); + for_each_capable_rdt_resource(r) + reset_all_ctrls(r); } diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index e69de29bb2d1..ae9584be4348 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -0,0 +1,533 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Resource Director Technology(RDT) + * - Cache Allocation code. + * + * Copyright (C) 2016 Intel Corporation + * + * Authors: + * Fenghua Yu <fenghua.yu@intel.com> + * Tony Luck <tony.luck@intel.com> + * + * More information about RDT be found in the Intel (R) x86 Architecture + * Software Developer Manual June 2016, volume 3, section 17.17. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/cpu.h> +#include <linux/kernfs.h> +#include <linux/seq_file.h> +#include <linux/slab.h> +#include "internal.h" + +struct rdt_parse_data { + struct rdtgroup *rdtgrp; + char *buf; +}; + +typedef int (ctrlval_parser_t)(struct rdt_parse_data *data, + struct resctrl_schema *s, + struct rdt_domain *d); + +/* + * Check whether MBA bandwidth percentage value is correct. The value is + * checked against the minimum and max bandwidth values specified by the + * hardware. The allocated bandwidth percentage is rounded to the next + * control step available on the hardware. + */ +static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) +{ + int ret; + u32 bw; + + /* + * Only linear delay values is supported for current Intel SKUs. + */ + if (!r->membw.delay_linear && r->membw.arch_needs_linear) { + rdt_last_cmd_puts("No support for non-linear MB domains\n"); + return false; + } + + ret = kstrtou32(buf, 10, &bw); + if (ret) { + rdt_last_cmd_printf("Invalid MB value %s\n", buf); + return false; + } + + /* Nothing else to do if software controller is enabled. */ + if (is_mba_sc(r)) { + *data = bw; + return true; + } + + if (bw < r->membw.min_bw || bw > r->default_ctrl) { + rdt_last_cmd_printf("MB value %u out of range [%d,%d]\n", + bw, r->membw.min_bw, r->default_ctrl); + return false; + } + + *data = roundup(bw, (unsigned long)r->membw.bw_gran); + return true; +} + +static int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_domain *d) +{ + struct resctrl_staged_config *cfg; + u32 closid = data->rdtgrp->closid; + struct rdt_resource *r = s->res; + u32 bw_val; + + cfg = &d->staged_config[s->conf_type]; + if (cfg->have_new_ctrl) { + rdt_last_cmd_printf("Duplicate domain %d\n", d->id); + return -EINVAL; + } + + if (!bw_validate(data->buf, &bw_val, r)) + return -EINVAL; + + if (is_mba_sc(r)) { + d->mbps_val[closid] = bw_val; + return 0; + } + + cfg->new_ctrl = bw_val; + cfg->have_new_ctrl = true; + + return 0; +} + +/* + * Check whether a cache bit mask is valid. + * On Intel CPUs, non-contiguous 1s value support is indicated by CPUID: + * - CPUID.0x10.1:ECX[3]: L3 non-contiguous 1s value supported if 1 + * - CPUID.0x10.2:ECX[3]: L2 non-contiguous 1s value supported if 1 + * + * Haswell does not support a non-contiguous 1s value and additionally + * requires at least two bits set. + * AMD allows non-contiguous bitmasks. + */ +static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r) +{ + unsigned long first_bit, zero_bit, val; + unsigned int cbm_len = r->cache.cbm_len; + int ret; + + ret = kstrtoul(buf, 16, &val); + if (ret) { + rdt_last_cmd_printf("Non-hex character in the mask %s\n", buf); + return false; + } + + if ((r->cache.min_cbm_bits > 0 && val == 0) || val > r->default_ctrl) { + rdt_last_cmd_puts("Mask out of range\n"); + return false; + } + + first_bit = find_first_bit(&val, cbm_len); + zero_bit = find_next_zero_bit(&val, cbm_len, first_bit); + + /* Are non-contiguous bitmasks allowed? */ + if (!r->cache.arch_has_sparse_bitmasks && + (find_next_bit(&val, cbm_len, zero_bit) < cbm_len)) { + rdt_last_cmd_printf("The mask %lx has non-consecutive 1-bits\n", val); + return false; + } + + if ((zero_bit - first_bit) < r->cache.min_cbm_bits) { + rdt_last_cmd_printf("Need at least %d bits in the mask\n", + r->cache.min_cbm_bits); + return false; + } + + *data = val; + return true; +} + +/* + * Read one cache bit mask (hex). Check that it is valid for the current + * resource type. + */ +static int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_domain *d) +{ + struct rdtgroup *rdtgrp = data->rdtgrp; + struct resctrl_staged_config *cfg; + struct rdt_resource *r = s->res; + u32 cbm_val; + + cfg = &d->staged_config[s->conf_type]; + if (cfg->have_new_ctrl) { + rdt_last_cmd_printf("Duplicate domain %d\n", d->id); + return -EINVAL; + } + + /* + * Cannot set up more than one pseudo-locked region in a cache + * hierarchy. + */ + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP && + rdtgroup_pseudo_locked_in_hierarchy(d)) { + rdt_last_cmd_puts("Pseudo-locked region in hierarchy\n"); + return -EINVAL; + } + + if (!cbm_validate(data->buf, &cbm_val, r)) + return -EINVAL; + + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK) && + (rdtgrp->mode == RDT_MODE_EXCLUSIVE || + rdtgrp->mode == RDT_MODE_SHAREABLE) && + rdtgroup_cbm_overlaps_pseudo_locked(d, cbm_val)) { + rdt_last_cmd_puts("CBM overlaps with pseudo-locked region\n"); + return -EINVAL; + } + + /* + * The CBM may not overlap with the CBM of another closid if + * either is exclusive. + */ + if (rdtgroup_cbm_overlaps(s, d, cbm_val, rdtgrp->closid, true)) { + rdt_last_cmd_puts("Overlaps with exclusive group\n"); + return -EINVAL; + } + + if (rdtgroup_cbm_overlaps(s, d, cbm_val, rdtgrp->closid, false)) { + if (rdtgrp->mode == RDT_MODE_EXCLUSIVE || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + rdt_last_cmd_puts("Overlaps with other group\n"); + return -EINVAL; + } + } + + cfg->new_ctrl = cbm_val; + cfg->have_new_ctrl = true; + + return 0; +} + +static ctrlval_parser_t *get_parser(struct rdt_resource *res) +{ + if (res->fflags & RFTYPE_RES_CACHE) + return &parse_cbm; + else + return &parse_bw; +} + +/* + * For each domain in this resource we expect to find a series of: + * id=mask + * separated by ";". The "id" is in decimal, and must match one of + * the "id"s for this resource. + */ +static int parse_line(char *line, struct resctrl_schema *s, + struct rdtgroup *rdtgrp) +{ + ctrlval_parser_t *parse_ctrlval = get_parser(s->res); + enum resctrl_conf_type t = s->conf_type; + struct resctrl_staged_config *cfg; + struct rdt_resource *r = s->res; + struct rdt_parse_data data; + char *dom = NULL, *id; + struct rdt_domain *d; + unsigned long dom_id; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP && + (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)) { + rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n"); + return -EINVAL; + } + +next: + if (!line || line[0] == '\0') + return 0; + dom = strsep(&line, ";"); + id = strsep(&dom, "="); + if (!dom || kstrtoul(id, 10, &dom_id)) { + rdt_last_cmd_puts("Missing '=' or non-numeric domain\n"); + return -EINVAL; + } + dom = strim(dom); + list_for_each_entry(d, &r->domains, list) { + if (d->id == dom_id) { + data.buf = dom; + data.rdtgrp = rdtgrp; + if (parse_ctrlval(&data, s, d)) + return -EINVAL; + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + cfg = &d->staged_config[t]; + /* + * In pseudo-locking setup mode and just + * parsed a valid CBM that should be + * pseudo-locked. Only one locked region per + * resource group and domain so just do + * the required initialization for single + * region and return. + */ + rdtgrp->plr->s = s; + rdtgrp->plr->d = d; + rdtgrp->plr->cbm = cfg->new_ctrl; + d->plr = rdtgrp->plr; + return 0; + } + goto next; + } + } + return -EINVAL; +} + +static int rdtgroup_parse_resource(char *resname, char *tok, + struct rdtgroup *rdtgrp) +{ + struct resctrl_schema *s; + + list_for_each_entry(s, &resctrl_schema_all, list) { + if (!strcmp(resname, s->name) && rdtgrp->closid < s->num_closid) + return parse_line(tok, s, rdtgrp); + } + rdt_last_cmd_printf("Unknown or unsupported resource name '%s'\n", resname); + return -EINVAL; +} + +ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct resctrl_schema *s; + struct rdtgroup *rdtgrp; + struct rdt_resource *r; + char *tok, *resname; + int ret = 0; + + /* Valid input requires a trailing newline */ + if (nbytes == 0 || buf[nbytes - 1] != '\n') + return -EINVAL; + buf[nbytes - 1] = '\0'; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + rdt_last_cmd_clear(); + + /* + * No changes to pseudo-locked region allowed. It has to be removed + * and re-created instead. + */ + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + ret = -EINVAL; + rdt_last_cmd_puts("Resource group is pseudo-locked\n"); + goto out; + } + + rdt_staged_configs_clear(); + + while ((tok = strsep(&buf, "\n")) != NULL) { + resname = strim(strsep(&tok, ":")); + if (!tok) { + rdt_last_cmd_puts("Missing ':'\n"); + ret = -EINVAL; + goto out; + } + if (tok[0] == '\0') { + rdt_last_cmd_printf("Missing '%s' value\n", resname); + ret = -EINVAL; + goto out; + } + ret = rdtgroup_parse_resource(resname, tok, rdtgrp); + if (ret) + goto out; + } + + list_for_each_entry(s, &resctrl_schema_all, list) { + r = s->res; + + /* + * Writes to mba_sc resources update the software controller, + * not the control MSR. + */ + if (is_mba_sc(r)) + continue; + + ret = resctrl_arch_update_domains(r, rdtgrp->closid); + if (ret) + goto out; + } + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + /* + * If pseudo-locking fails we keep the resource group in + * mode RDT_MODE_PSEUDO_LOCKSETUP with its class of service + * active and updated for just the domain the pseudo-locked + * region was requested for. + */ + ret = rdtgroup_pseudo_lock_create(rdtgrp); + } + +out: + rdt_staged_configs_clear(); + rdtgroup_kn_unlock(of->kn); + return ret ?: nbytes; +} + +static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid) +{ + struct rdt_resource *r = schema->res; + struct rdt_domain *dom; + bool sep = false; + u32 ctrl_val; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + seq_printf(s, "%*s:", max_name_width, schema->name); + list_for_each_entry(dom, &r->domains, list) { + if (sep) + seq_puts(s, ";"); + + if (is_mba_sc(r)) + ctrl_val = dom->mbps_val[closid]; + else + ctrl_val = resctrl_arch_get_config(r, dom, closid, + schema->conf_type); + + seq_printf(s, r->format_str, dom->id, max_data_width, + ctrl_val); + sep = true; + } + seq_puts(s, "\n"); +} + +int rdtgroup_schemata_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct resctrl_schema *schema; + struct rdtgroup *rdtgrp; + int ret = 0; + u32 closid; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + list_for_each_entry(schema, &resctrl_schema_all, list) { + seq_printf(s, "%s:uninitialized\n", schema->name); + } + } else if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + if (!rdtgrp->plr->d) { + rdt_last_cmd_clear(); + rdt_last_cmd_puts("Cache domain offline\n"); + ret = -ENODEV; + } else { + seq_printf(s, "%s:%d=%x\n", + rdtgrp->plr->s->res->name, + rdtgrp->plr->d->id, + rdtgrp->plr->cbm); + } + } else { + closid = rdtgrp->closid; + list_for_each_entry(schema, &resctrl_schema_all, list) { + if (closid < schema->num_closid) + show_doms(s, schema, closid); + } + } + } else { + ret = -ENOENT; + } + rdtgroup_kn_unlock(of->kn); + return ret; +} + +static int smp_mon_event_count(void *arg) +{ + mon_event_count(arg); + + return 0; +} + +void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, + struct rdt_domain *d, struct rdtgroup *rdtgrp, + int evtid, int first) +{ + int cpu; + + /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + /* + * Setup the parameters to pass to mon_event_count() to read the data. + */ + rr->rgrp = rdtgrp; + rr->evtid = evtid; + rr->r = r; + rr->d = d; + rr->val = 0; + rr->first = first; + rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evtid); + if (IS_ERR(rr->arch_mon_ctx)) { + rr->err = -EINVAL; + return; + } + + cpu = cpumask_any_housekeeping(&d->cpu_mask, RESCTRL_PICK_ANY_CPU); + + /* + * cpumask_any_housekeeping() prefers housekeeping CPUs, but + * are all the CPUs nohz_full? If yes, pick a CPU to IPI. + * MPAM's resctrl_arch_rmid_read() is unable to read the + * counters on some platforms if its called in IRQ context. + */ + if (tick_nohz_full_cpu(cpu)) + smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1); + else + smp_call_on_cpu(cpu, smp_mon_event_count, rr, false); + + resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx); +} + +int rdtgroup_mondata_show(struct seq_file *m, void *arg) +{ + struct kernfs_open_file *of = m->private; + u32 resid, evtid, domid; + struct rdtgroup *rdtgrp; + struct rdt_resource *r; + union mon_data_bits md; + struct rdt_domain *d; + struct rmid_read rr; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + ret = -ENOENT; + goto out; + } + + md.priv = of->kn->priv; + resid = md.u.rid; + domid = md.u.domid; + evtid = md.u.evtid; + + r = resctrl_arch_get_resource(resid); + d = resctrl_arch_find_domain(r, domid); + if (IS_ERR_OR_NULL(d)) { + ret = -ENOENT; + goto out; + } + + mon_event_read(&rr, r, d, rdtgrp, evtid, false); + + if (rr.err == -EIO) + seq_puts(m, "Error\n"); + else if (rr.err == -EINVAL) + seq_puts(m, "Unavailable\n"); + else + seq_printf(m, "%llu\n", rr.val); + +out: + rdtgroup_kn_unlock(of->kn); + return ret; +} diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h index e69de29bb2d1..0cdabf19ff79 100644 --- a/fs/resctrl/internal.h +++ b/fs/resctrl/internal.h @@ -0,0 +1,310 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _FS_RESCTRL_INTERNAL_H +#define _FS_RESCTRL_INTERNAL_H + +#include <linux/resctrl.h> +#include <linux/sched.h> +#include <linux/kernfs.h> +#include <linux/fs_context.h> +#include <linux/jump_label.h> +#include <linux/tick.h> + +#include <asm/resctrl.h> + +/** + * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that + * aren't marked nohz_full + * @mask: The mask to pick a CPU from. + * @exclude_cpu:The CPU to avoid picking. + * + * Returns a CPU from @mask, but not @exclude_cpu. If there are housekeeping + * CPUs that don't use nohz_full, these are preferred. Pass + * RESCTRL_PICK_ANY_CPU to avoid excluding any CPUs. + * + * When a CPU is excluded, returns >= nr_cpu_ids if no CPUs are available. + */ +static inline unsigned int +cpumask_any_housekeeping(const struct cpumask *mask, int exclude_cpu) +{ + unsigned int cpu, hk_cpu; + + if (exclude_cpu == RESCTRL_PICK_ANY_CPU) + cpu = cpumask_any(mask); + else + cpu = cpumask_any_but(mask, exclude_cpu); + + /* Only continue if tick_nohz_full_mask has been initialized. */ + if (!tick_nohz_full_enabled()) + return cpu; + + /* If the CPU picked isn't marked nohz_full nothing more needs doing. */ + if (cpu < nr_cpu_ids && !tick_nohz_full_cpu(cpu)) + return cpu; + + /* Try to find a CPU that isn't nohz_full to use in preference */ + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask); + if (hk_cpu == exclude_cpu) + hk_cpu = cpumask_nth_andnot(1, mask, tick_nohz_full_mask); + + if (hk_cpu < nr_cpu_ids) + cpu = hk_cpu; + + return cpu; +} + +struct rdt_fs_context { + struct kernfs_fs_context kfc; + bool enable_cdpl2; + bool enable_cdpl3; + bool enable_mba_mbps; + bool enable_debug; +}; + +static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc) +{ + struct kernfs_fs_context *kfc = fc->fs_private; + + return container_of(kfc, struct rdt_fs_context, kfc); +} + +/** + * struct mon_evt - Entry in the event list of a resource + * @evtid: event id + * @name: name of the event + * @configurable: true if the event is configurable + * @list: entry in &rdt_resource->evt_list + */ +struct mon_evt { + enum resctrl_event_id evtid; + char *name; + bool configurable; + struct list_head list; +}; + +/** + * union mon_data_bits - Monitoring details for each event file + * @priv: Used to store monitoring event data in @u + * as kernfs private data + * @rid: Resource id associated with the event file + * @evtid: Event id associated with the event file + * @domid: The domain to which the event file belongs + * @u: Name of the bit fields struct + */ +union mon_data_bits { + void *priv; + struct { + unsigned int rid : 10; + enum resctrl_event_id evtid : 8; + unsigned int domid : 14; + } u; +}; + +struct rmid_read { + struct rdtgroup *rgrp; + struct rdt_resource *r; + struct rdt_domain *d; + enum resctrl_event_id evtid; + bool first; + int err; + u64 val; + void *arch_mon_ctx; +}; + +extern struct list_head resctrl_schema_all; +extern bool resctrl_mounted; + +enum rdt_group_type { + RDTCTRL_GROUP = 0, + RDTMON_GROUP, + RDT_NUM_GROUP, +}; + +/** + * enum rdtgrp_mode - Mode of a RDT resource group + * @RDT_MODE_SHAREABLE: This resource group allows sharing of its allocations + * @RDT_MODE_EXCLUSIVE: No sharing of this resource group's allocations allowed + * @RDT_MODE_PSEUDO_LOCKSETUP: Resource group will be used for Pseudo-Locking + * @RDT_MODE_PSEUDO_LOCKED: No sharing of this resource group's allocations + * allowed AND the allocations are Cache Pseudo-Locked + * @RDT_NUM_MODES: Total number of modes + * + * The mode of a resource group enables control over the allowed overlap + * between allocations associated with different resource groups (classes + * of service). User is able to modify the mode of a resource group by + * writing to the "mode" resctrl file associated with the resource group. + * + * The "shareable", "exclusive", and "pseudo-locksetup" modes are set by + * writing the appropriate text to the "mode" file. A resource group enters + * "pseudo-locked" mode after the schemata is written while the resource + * group is in "pseudo-locksetup" mode. + */ +enum rdtgrp_mode { + RDT_MODE_SHAREABLE = 0, + RDT_MODE_EXCLUSIVE, + RDT_MODE_PSEUDO_LOCKSETUP, + RDT_MODE_PSEUDO_LOCKED, + + /* Must be last */ + RDT_NUM_MODES, +}; + +/** + * struct mongroup - store mon group's data in resctrl fs. + * @mon_data_kn: kernfs node for the mon_data directory + * @parent: parent rdtgrp + * @crdtgrp_list: child rdtgroup node list + * @rmid: rmid for this rdtgroup + */ +struct mongroup { + struct kernfs_node *mon_data_kn; + struct rdtgroup *parent; + struct list_head crdtgrp_list; + u32 rmid; +}; + +/** + * struct rdtgroup - store rdtgroup's data in resctrl file system. + * @kn: kernfs node + * @rdtgroup_list: linked list for all rdtgroups + * @closid: closid for this rdtgroup + * @cpu_mask: CPUs assigned to this rdtgroup + * @flags: status bits + * @waitcount: how many cpus expect to find this + * group when they acquire rdtgroup_mutex + * @type: indicates type of this rdtgroup - either + * monitor only or ctrl_mon group + * @mon: mongroup related data + * @mode: mode of resource group + * @plr: pseudo-locked region + */ +struct rdtgroup { + struct kernfs_node *kn; + struct list_head rdtgroup_list; + u32 closid; + struct cpumask cpu_mask; + int flags; + atomic_t waitcount; + enum rdt_group_type type; + struct mongroup mon; + enum rdtgrp_mode mode; + struct pseudo_lock_region *plr; +}; + +/* List of all resource groups */ +extern struct list_head rdt_all_groups; + +extern int max_name_width, max_data_width; + +/** + * struct rftype - describe each file in the resctrl file system + * @name: File name + * @mode: Access mode + * @kf_ops: File operations + * @flags: File specific RFTYPE_FLAGS_* flags + * @fflags: File specific RFTYPE_* flags + * @seq_show: Show content of the file + * @write: Write to the file + */ +struct rftype { + char *name; + umode_t mode; + const struct kernfs_ops *kf_ops; + unsigned long flags; + unsigned long fflags; + + int (*seq_show)(struct kernfs_open_file *of, + struct seq_file *sf, void *v); + /* + * write() is the generic write callback which maps directly to + * kernfs write operation and overrides all other operations. + * Maximum write size is determined by ->max_write_len. + */ + ssize_t (*write)(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off); +}; + +/** + * struct mbm_state - status for each MBM counter in each domain + * @prev_bw_bytes: Previous bytes value read for bandwidth calculation + * @prev_bw: The most recent bandwidth in MBps + */ +struct mbm_state { + u64 prev_bw_bytes; + u32 prev_bw; +}; + +static inline bool is_mba_sc(struct rdt_resource *r) +{ + if (!r) + r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); + + /* + * The software controller support is only applicable to MBA resource. + * Make sure to check for resource type. + */ + if (r->rid != RDT_RESOURCE_MBA) + return false; + + return r->membw.mba_sc; +} + +extern struct mutex rdtgroup_mutex; +extern struct rdtgroup rdtgroup_default; +extern struct dentry *debugfs_resctrl; + +void rdt_last_cmd_clear(void); +void rdt_last_cmd_puts(const char *s); +__printf(1, 2) +void rdt_last_cmd_printf(const char *fmt, ...); + +struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn); +void rdtgroup_kn_unlock(struct kernfs_node *kn); +int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name); +int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name, + umode_t mask); +ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off); +int rdtgroup_schemata_show(struct kernfs_open_file *of, + struct seq_file *s, void *v); +bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d, + unsigned long cbm, int closid, bool exclusive); +unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_domain *d, + unsigned long cbm); +enum rdtgrp_mode rdtgroup_mode_by_closid(int closid); +int rdtgroup_tasks_assigned(struct rdtgroup *r); +int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp); +int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp); +bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm); +bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d); +int rdt_pseudo_lock_init(void); +void rdt_pseudo_lock_release(void); +int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp); +void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp); +int closids_supported(void); +bool closid_allocated(unsigned int closid); +bool resctrl_closid_is_dirty(u32 closid); +void closid_free(int closid); +int alloc_rmid(u32 closid); +void free_rmid(u32 closid, u32 rmid); +void resctrl_mon_resource_exit(void); +void mon_event_count(void *info); +int rdtgroup_mondata_show(struct seq_file *m, void *arg); +void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, + struct rdt_domain *d, struct rdtgroup *rdtgrp, + int evtid, int first); +int resctrl_mon_resource_init(void); +void mbm_setup_overflow_handler(struct rdt_domain *dom, + unsigned long delay_ms, + int exclude_cpu); +void mbm_handle_overflow(struct work_struct *work); +void setup_default_ctrlval(struct rdt_resource *r, u32 *dc); +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms, + int exclude_cpu); +void cqm_handle_limbo(struct work_struct *work); +bool has_busy_rmid(struct rdt_domain *d); +void __check_limbo(struct rdt_domain *d, bool force_free); +void mbm_config_rftype_init(const char *config); +void rdt_staged_configs_clear(void); +int resctrl_find_cleanest_closid(void); + +#endif /* _FS_RESCTRL_INTERNAL_H */ diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index e69de29bb2d1..710ba5bf9962 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -0,0 +1,845 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Resource Director Technology(RDT) + * - Monitoring code + * + * Copyright (C) 2017 Intel Corporation + * + * Author: + * Vikas Shivappa <vikas.shivappa@intel.com> + * + * This replaces the cqm.c based on perf but we reuse a lot of + * code and datastructures originally from Peter Zijlstra and Matt Fleming. + * + * More information about RDT be found in the Intel (R) x86 Architecture + * Software Developer Manual June 2016, volume 3, section 17.17. + */ + +#include <linux/cpu.h> +#include <linux/module.h> +#include <linux/sizes.h> +#include <linux/slab.h> +#include "internal.h" + +/* + * struct rmid_entry - dirty tracking for all RMID. + * @closid: The CLOSID for this entry. + * @rmid: The RMID for this entry. + * @busy: The number of domains with cached data using this RMID. + * @list: Member of the rmid_free_lru list when busy == 0. + * + * Depending on the architecture the correct monitor is accessed using + * both @closid and @rmid, or @rmid only. + * + * Take the rdtgroup_mutex when accessing. + */ +struct rmid_entry { + u32 closid; + u32 rmid; + int busy; + struct list_head list; +}; + +/* + * @rmid_free_lru - A least recently used list of free RMIDs + * These RMIDs are guaranteed to have an occupancy less than the + * threshold occupancy + */ +static LIST_HEAD(rmid_free_lru); + +/** + * @closid_num_dirty_rmid The number of dirty RMID each CLOSID has. + * Only allocated when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is defined. + * Indexed by CLOSID. Protected by rdtgroup_mutex. + */ +static u32 *closid_num_dirty_rmid; + +/* + * @rmid_limbo_count - count of currently unused but (potentially) + * dirty RMIDs. + * This counts RMIDs that no one is currently using but that + * may have a occupancy value > resctrl_rmid_realloc_threshold. User can + * change the threshold occupancy value. + */ +static unsigned int rmid_limbo_count; + +/* + * @rmid_entry - The entry in the limbo and free lists. + */ +static struct rmid_entry *rmid_ptrs; + +/* + * This is the threshold cache occupancy in bytes at which we will consider an + * RMID available for re-allocation. + */ +unsigned int resctrl_rmid_realloc_threshold; + +/* + * This is the maximum value for the reallocation threshold, in bytes. + */ +unsigned int resctrl_rmid_realloc_limit; + +/* + * x86 and arm64 differ in their handling of monitoring. + * x86's RMID are independent numbers, there is only one source of traffic + * with an RMID value of '1'. + * arm64's PMG extends the PARTID/CLOSID space, there are multiple sources of + * traffic with a PMG value of '1', one for each CLOSID, meaning the RMID + * value is no longer unique. + * To account for this, resctrl uses an index. On x86 this is just the RMID, + * on arm64 it encodes the CLOSID and RMID. This gives a unique number. + * + * The domain's rmid_busy_llc and rmid_ptrs[] are sized by index. The arch code + * must accept an attempt to read every index. + */ +static inline struct rmid_entry *__rmid_entry(u32 idx) +{ + struct rmid_entry *entry; + u32 closid, rmid; + + entry = &rmid_ptrs[idx]; + resctrl_arch_rmid_idx_decode(idx, &closid, &rmid); + + WARN_ON_ONCE(entry->closid != closid); + WARN_ON_ONCE(entry->rmid != rmid); + + return entry; +} + +static void limbo_release_entry(struct rmid_entry *entry) +{ + lockdep_assert_held(&rdtgroup_mutex); + + rmid_limbo_count--; + list_add_tail(&entry->list, &rmid_free_lru); + + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + closid_num_dirty_rmid[entry->closid]--; +} + +/* + * Check the RMIDs that are marked as busy for this domain. If the + * reported LLC occupancy is below the threshold clear the busy bit and + * decrement the count. If the busy count gets to zero on an RMID, we + * free the RMID + */ +void __check_limbo(struct rdt_domain *d, bool force_free) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + struct rmid_entry *entry; + u32 idx, cur_idx = 1; + void *arch_mon_ctx; + bool rmid_dirty; + u64 val = 0; + + arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID); + if (IS_ERR(arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(arch_mon_ctx)); + return; + } + + /* + * Skip RMID 0 and start from RMID 1 and check all the RMIDs that + * are marked as busy for occupancy < threshold. If the occupancy + * is less than the threshold decrement the busy counter of the + * RMID and move it to the free list when the counter reaches 0. + */ + for (;;) { + idx = find_next_bit(d->rmid_busy_llc, idx_limit, cur_idx); + if (idx >= idx_limit) + break; + + entry = __rmid_entry(idx); + if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid, + QOS_L3_OCCUP_EVENT_ID, &val, + arch_mon_ctx)) { + rmid_dirty = true; + } else { + rmid_dirty = (val >= resctrl_rmid_realloc_threshold); + } + + if (force_free || !rmid_dirty) { + clear_bit(idx, d->rmid_busy_llc); + if (!--entry->busy) + limbo_release_entry(entry); + } + cur_idx = idx + 1; + } + + resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx); +} + +bool has_busy_rmid(struct rdt_domain *d) +{ + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + + return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit; +} + +static struct rmid_entry *resctrl_find_free_rmid(u32 closid) +{ + struct rmid_entry *itr; + u32 itr_idx, cmp_idx; + + if (list_empty(&rmid_free_lru)) + return rmid_limbo_count ? ERR_PTR(-EBUSY) : ERR_PTR(-ENOSPC); + + list_for_each_entry(itr, &rmid_free_lru, list) { + /* + * Get the index of this free RMID, and the index it would need + * to be if it were used with this CLOSID. + * If the CLOSID is irrelevant on this architecture, the two + * index values are always the same on every entry and thus the + * very first entry will be returned. + */ + itr_idx = resctrl_arch_rmid_idx_encode(itr->closid, itr->rmid); + cmp_idx = resctrl_arch_rmid_idx_encode(closid, itr->rmid); + + if (itr_idx == cmp_idx) + return itr; + } + + return ERR_PTR(-ENOSPC); +} + +/** + * resctrl_find_cleanest_closid() - Find a CLOSID where all the associated + * RMID are clean, or the CLOSID that has + * the most clean RMID. + * + * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocated CLOSID + * may not be able to allocate clean RMID. To avoid this the allocator will + * choose the CLOSID with the most clean RMID. + * + * When the CLOSID and RMID are independent numbers, the first free CLOSID will + * be returned. + */ +int resctrl_find_cleanest_closid(void) +{ + u32 cleanest_closid = ~0; + int i = 0; + + lockdep_assert_held(&rdtgroup_mutex); + + if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + return -EIO; + + for (i = 0; i < closids_supported(); i++) { + int num_dirty; + + if (closid_allocated(i)) + continue; + + num_dirty = closid_num_dirty_rmid[i]; + if (num_dirty == 0) + return i; + + if (cleanest_closid == ~0) + cleanest_closid = i; + + if (num_dirty < closid_num_dirty_rmid[cleanest_closid]) + cleanest_closid = i; + } + + if (cleanest_closid == ~0) + return -ENOSPC; + + return cleanest_closid; +} + +/* + * For MPAM the RMID value is not unique, and has to be considered with + * the CLOSID. The (CLOSID, RMID) pair is allocated on all domains, which + * allows all domains to be managed by a single free list. + * Each domain also has a rmid_busy_llc to reduce the work of the limbo handler. + */ +int alloc_rmid(u32 closid) +{ + struct rmid_entry *entry; + + lockdep_assert_held(&rdtgroup_mutex); + + entry = resctrl_find_free_rmid(closid); + if (IS_ERR(entry)) + return PTR_ERR(entry); + + list_del(&entry->list); + return entry->rmid; +} + +static void add_rmid_to_limbo(struct rmid_entry *entry) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + struct rdt_domain *d; + u32 idx; + + lockdep_assert_held(&rdtgroup_mutex); + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid); + + entry->busy = 0; + list_for_each_entry(d, &r->domains, list) { + /* + * For the first limbo RMID in the domain, + * setup up the limbo worker. + */ + if (!has_busy_rmid(d)) + cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL, + RESCTRL_PICK_ANY_CPU); + set_bit(idx, d->rmid_busy_llc); + entry->busy++; + } + + rmid_limbo_count++; + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + closid_num_dirty_rmid[entry->closid]++; +} + +void free_rmid(u32 closid, u32 rmid) +{ + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); + struct rmid_entry *entry; + + lockdep_assert_held(&rdtgroup_mutex); + + /* + * Do not allow the default rmid to be free'd. Comparing by index + * allows architectures that ignore the closid parameter to avoid an + * unnecessary check. + */ + if (!resctrl_arch_mon_capable() || + idx == resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, + RESCTRL_RESERVED_RMID)) + return; + + entry = __rmid_entry(idx); + + if (resctrl_arch_is_llc_occupancy_enabled()) + add_rmid_to_limbo(entry); + else + list_add_tail(&entry->list, &rmid_free_lru); +} + +static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 closid, + u32 rmid, enum resctrl_event_id evtid) +{ + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); + + switch (evtid) { + case QOS_L3_MBM_TOTAL_EVENT_ID: + return &d->mbm_total[idx]; + case QOS_L3_MBM_LOCAL_EVENT_ID: + return &d->mbm_local[idx]; + default: + return NULL; + } +} + +static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr) +{ + struct mbm_state *m; + u64 tval = 0; + + if (rr->first) { + resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid); + m = get_mbm_state(rr->d, closid, rmid, rr->evtid); + if (m) + memset(m, 0, sizeof(struct mbm_state)); + return 0; + } + + rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, rr->evtid, + &tval, rr->arch_mon_ctx); + if (rr->err) + return rr->err; + + rr->val += tval; + + return 0; +} + +/* + * mbm_bw_count() - Update bw count from values previously read by + * __mon_event_count(). + * @closid: The closid used to identify the cached mbm_state. + * @rmid: The rmid used to identify the cached mbm_state. + * @rr: The struct rmid_read populated by __mon_event_count(). + * + * Supporting function to calculate the memory bandwidth + * and delta bandwidth in MBps. The chunks value previously read by + * __mon_event_count() is compared with the chunks value from the previous + * invocation. This must be called once per second to maintain values in MBps. + */ +static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr) +{ + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); + struct mbm_state *m = &rr->d->mbm_local[idx]; + u64 cur_bw, bytes, cur_bytes; + + cur_bytes = rr->val; + bytes = cur_bytes - m->prev_bw_bytes; + m->prev_bw_bytes = cur_bytes; + + cur_bw = bytes / SZ_1M; + + m->prev_bw = cur_bw; +} + +/* + * This is scheduled by mon_event_read() to read the CQM/MBM counters + * on a domain. + */ +void mon_event_count(void *info) +{ + struct rdtgroup *rdtgrp, *entry; + struct rmid_read *rr = info; + struct list_head *head; + int ret; + + rdtgrp = rr->rgrp; + + ret = __mon_event_count(rdtgrp->closid, rdtgrp->mon.rmid, rr); + + /* + * For Ctrl groups read data from child monitor groups and + * add them together. Count events which are read successfully. + * Discard the rmid_read's reporting errors. + */ + head = &rdtgrp->mon.crdtgrp_list; + + if (rdtgrp->type == RDTCTRL_GROUP) { + list_for_each_entry(entry, head, mon.crdtgrp_list) { + if (__mon_event_count(entry->closid, entry->mon.rmid, + rr) == 0) + ret = 0; + } + } + + /* + * __mon_event_count() calls for newly created monitor groups may + * report -EINVAL/Unavailable if the monitor hasn't seen any traffic. + * Discard error if any of the monitor event reads succeeded. + */ + if (ret == 0) + rr->err = 0; +} + +/* + * Feedback loop for MBA software controller (mba_sc) + * + * mba_sc is a feedback loop where we periodically read MBM counters and + * adjust the bandwidth percentage values via the IA32_MBA_THRTL_MSRs so + * that: + * + * current bandwidth(cur_bw) < user specified bandwidth(user_bw) + * + * This uses the MBM counters to measure the bandwidth and MBA throttle + * MSRs to control the bandwidth for a particular rdtgrp. It builds on the + * fact that resctrl rdtgroups have both monitoring and control. + * + * The frequency of the checks is 1s and we just tag along the MBM overflow + * timer. Having 1s interval makes the calculation of bandwidth simpler. + * + * Although MBA's goal is to restrict the bandwidth to a maximum, there may + * be a need to increase the bandwidth to avoid unnecessarily restricting + * the L2 <-> L3 traffic. + * + * Since MBA controls the L2 external bandwidth where as MBM measures the + * L3 external bandwidth the following sequence could lead to such a + * situation. + * + * Consider an rdtgroup which had high L3 <-> memory traffic in initial + * phases -> mba_sc kicks in and reduced bandwidth percentage values -> but + * after some time rdtgroup has mostly L2 <-> L3 traffic. + * + * In this case we may restrict the rdtgroup's L2 <-> L3 traffic as its + * throttle MSRs already have low percentage values. To avoid + * unnecessarily restricting such rdtgroups, we also increase the bandwidth. + */ +static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm) +{ + u32 closid, rmid, cur_msr_val, new_msr_val; + struct mbm_state *pmbm_data, *cmbm_data; + struct rdt_resource *r_mba; + struct rdt_domain *dom_mba; + u32 cur_bw, user_bw, idx; + struct list_head *head; + struct rdtgroup *entry; + + if (!resctrl_arch_is_mbm_local_enabled()) + return; + + r_mba = resctrl_arch_get_resource(RDT_RESOURCE_MBA); + + closid = rgrp->closid; + rmid = rgrp->mon.rmid; + idx = resctrl_arch_rmid_idx_encode(closid, rmid); + pmbm_data = &dom_mbm->mbm_local[idx]; + + dom_mba = resctrl_get_domain_from_cpu(smp_processor_id(), r_mba); + if (!dom_mba) { + pr_warn_once("Failure to get domain for MBA update\n"); + return; + } + + cur_bw = pmbm_data->prev_bw; + user_bw = dom_mba->mbps_val[closid]; + + /* MBA resource doesn't support CDP */ + cur_msr_val = resctrl_arch_get_config(r_mba, dom_mba, closid, CDP_NONE); + + /* + * For Ctrl groups read data from child monitor groups. + */ + head = &rgrp->mon.crdtgrp_list; + list_for_each_entry(entry, head, mon.crdtgrp_list) { + cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid]; + cur_bw += cmbm_data->prev_bw; + } + + /* + * Scale up/down the bandwidth linearly for the ctrl group. The + * bandwidth step is the bandwidth granularity specified by the + * hardware. + * Always increase throttling if current bandwidth is above the + * target set by user. + * But avoid thrashing up and down on every poll by checking + * whether a decrease in throttling is likely to push the group + * back over target. E.g. if currently throttling to 30% of bandwidth + * on a system with 10% granularity steps, check whether moving to + * 40% would go past the limit by multiplying current bandwidth by + * "(30 + 10) / 30". + */ + if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) { + new_msr_val = cur_msr_val - r_mba->membw.bw_gran; + } else if (cur_msr_val < MAX_MBA_BW && + (user_bw > (cur_bw * (cur_msr_val + r_mba->membw.min_bw) / cur_msr_val))) { + new_msr_val = cur_msr_val + r_mba->membw.bw_gran; + } else { + return; + } + + resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val); +} + +static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, u32 rmid) +{ + struct rmid_read rr; + + rr.first = false; + rr.r = r; + rr.d = d; + + /* + * This is protected from concurrent reads from user + * as both the user and we hold the global mutex. + */ + if (resctrl_arch_is_mbm_total_enabled()) { + rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID; + rr.val = 0; + rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); + if (IS_ERR(rr.arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(rr.arch_mon_ctx)); + return; + } + + __mon_event_count(closid, rmid, &rr); + + resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); + } + if (resctrl_arch_is_mbm_local_enabled()) { + rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID; + rr.val = 0; + rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); + if (IS_ERR(rr.arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(rr.arch_mon_ctx)); + return; + } + + __mon_event_count(closid, rmid, &rr); + + /* + * Call the MBA software controller only for the + * control groups and when user has enabled + * the software controller explicitly. + */ + if (is_mba_sc(NULL)) + mbm_bw_count(closid, rmid, &rr); + + resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); + } +} + +/* + * Handler to scan the limbo list and move the RMIDs + * to free list whose occupancy < threshold_occupancy. + */ +void cqm_handle_limbo(struct work_struct *work) +{ + unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL); + struct rdt_domain *d; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + d = container_of(work, struct rdt_domain, cqm_limbo.work); + + __check_limbo(d, false); + + if (has_busy_rmid(d)) { + d->cqm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask, + RESCTRL_PICK_ANY_CPU); + schedule_delayed_work_on(d->cqm_work_cpu, &d->cqm_limbo, + delay); + } + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); +} + +/** + * cqm_setup_limbo_handler() - Schedule the limbo handler to run for this + * domain. + * @dom: The domain the limbo handler should run for. + * @delay_ms: How far in the future the handler should run. + * @exclude_cpu: Which CPU the handler should not run on, + * RESCTRL_PICK_ANY_CPU to pick any CPU. + */ +void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms, + int exclude_cpu) +{ + unsigned long delay = msecs_to_jiffies(delay_ms); + int cpu; + + cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu); + dom->cqm_work_cpu = cpu; + + if (cpu < nr_cpu_ids) + schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay); +} + +void mbm_handle_overflow(struct work_struct *work) +{ + unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL); + struct rdtgroup *prgrp, *crgrp; + struct list_head *head; + struct rdt_resource *r; + struct rdt_domain *d; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + /* + * If the filesystem has been unmounted this work no longer needs to + * run. + */ + if (!resctrl_mounted || !resctrl_arch_mon_capable()) + goto out_unlock; + + r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + d = container_of(work, struct rdt_domain, mbm_over.work); + + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { + mbm_update(r, d, prgrp->closid, prgrp->mon.rmid); + + head = &prgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) + mbm_update(r, d, crgrp->closid, crgrp->mon.rmid); + + if (is_mba_sc(NULL)) + update_mba_bw(prgrp, d); + } + + /* + * Re-check for housekeeping CPUs. This allows the overflow handler to + * move off a nohz_full CPU quickly. + */ + d->mbm_work_cpu = cpumask_any_housekeeping(&d->cpu_mask, + RESCTRL_PICK_ANY_CPU); + schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay); + +out_unlock: + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); +} + +/** + * mbm_setup_overflow_handler() - Schedule the overflow handler to run for this + * domain. + * @dom: The domain the overflow handler should run for. + * @delay_ms: How far in the future the handler should run. + * @exclude_cpu: Which CPU the handler should not run on, + * RESCTRL_PICK_ANY_CPU to pick any CPU. + */ +void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms, + int exclude_cpu) +{ + unsigned long delay = msecs_to_jiffies(delay_ms); + int cpu; + + /* + * When a domain comes online there is no guarantee the filesystem is + * mounted. If not, there is no need to catch counter overflow. + */ + if (!resctrl_mounted || !resctrl_arch_mon_capable()) + return; + cpu = cpumask_any_housekeeping(&dom->cpu_mask, exclude_cpu); + dom->mbm_work_cpu = cpu; + + if (cpu < nr_cpu_ids) + schedule_delayed_work_on(cpu, &dom->mbm_over, delay); +} + +static int dom_data_init(struct rdt_resource *r) +{ + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + u32 num_closid = resctrl_arch_get_num_closid(r); + struct rmid_entry *entry = NULL; + int err = 0, i; + u32 idx; + + mutex_lock(&rdtgroup_mutex); + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + u32 *tmp; + + /* + * If the architecture hasn't provided a sanitised value here, + * this may result in larger arrays than necessary. Resctrl will + * use a smaller system wide value based on the resources in + * use. + */ + tmp = kcalloc(num_closid, sizeof(*tmp), GFP_KERNEL); + if (!tmp) { + err = -ENOMEM; + goto out_unlock; + } + + closid_num_dirty_rmid = tmp; + } + + rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL); + if (!rmid_ptrs) { + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + kfree(closid_num_dirty_rmid); + closid_num_dirty_rmid = NULL; + } + err = -ENOMEM; + goto out_unlock; + } + + for (i = 0; i < idx_limit; i++) { + entry = &rmid_ptrs[i]; + INIT_LIST_HEAD(&entry->list); + + resctrl_arch_rmid_idx_decode(i, &entry->closid, &entry->rmid); + list_add_tail(&entry->list, &rmid_free_lru); + } + + /* + * RESCTRL_RESERVED_CLOSID and RESCTRL_RESERVED_RMID are special and + * are always allocated. These are used for the rdtgroup_default + * control group, which will be setup later in rdtgroup_init(). + */ + idx = resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, + RESCTRL_RESERVED_RMID); + entry = __rmid_entry(idx); + list_del(&entry->list); + +out_unlock: + mutex_unlock(&rdtgroup_mutex); + + return err; +} + +static void dom_data_exit(void) +{ + mutex_lock(&rdtgroup_mutex); + + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + kfree(closid_num_dirty_rmid); + closid_num_dirty_rmid = NULL; + } + + kfree(rmid_ptrs); + rmid_ptrs = NULL; + + mutex_unlock(&rdtgroup_mutex); +} + +static struct mon_evt llc_occupancy_event = { + .name = "llc_occupancy", + .evtid = QOS_L3_OCCUP_EVENT_ID, +}; + +static struct mon_evt mbm_total_event = { + .name = "mbm_total_bytes", + .evtid = QOS_L3_MBM_TOTAL_EVENT_ID, +}; + +static struct mon_evt mbm_local_event = { + .name = "mbm_local_bytes", + .evtid = QOS_L3_MBM_LOCAL_EVENT_ID, +}; + +/* + * Initialize the event list for the resource. + * + * Note that MBM events are also part of RDT_RESOURCE_L3 resource + * because as per the SDM the total and local memory bandwidth + * are enumerated as part of L3 monitoring. + */ +static void l3_mon_evt_init(struct rdt_resource *r) +{ + INIT_LIST_HEAD(&r->evt_list); + + if (resctrl_arch_is_llc_occupancy_enabled()) + list_add_tail(&llc_occupancy_event.list, &r->evt_list); + if (resctrl_arch_is_mbm_total_enabled()) + list_add_tail(&mbm_total_event.list, &r->evt_list); + if (resctrl_arch_is_mbm_local_enabled()) + list_add_tail(&mbm_local_event.list, &r->evt_list); +} + +int resctrl_mon_resource_init(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + int ret; + + ret = dom_data_init(r); + if (ret) + return ret; + + if (!r->mon_capable) + return 0; + + l3_mon_evt_init(r); + + if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) { + mbm_total_event.configurable = true; + mbm_config_rftype_init("mbm_total_bytes_config"); + } + if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_LOCAL_EVENT_ID)) { + mbm_local_event.configurable = true; + mbm_config_rftype_init("mbm_local_bytes_config"); + } + + return 0; +} + +void resctrl_mon_resource_exit(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + + if (!r->mon_capable) + return; + + dom_data_exit(); +} diff --git a/fs/resctrl/psuedo_lock.c b/fs/resctrl/psuedo_lock.c index e69de29bb2d1..2fa42d0a33ea 100644 --- a/fs/resctrl/psuedo_lock.c +++ b/fs/resctrl/psuedo_lock.c @@ -0,0 +1,1134 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Resource Director Technology (RDT) + * + * Pseudo-locking support built on top of Cache Allocation Technology (CAT) + * + * Copyright (C) 2018 Intel Corporation + * + * Author: Reinette Chatre <reinette.chatre@intel.com> + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/cacheinfo.h> +#include <linux/cpu.h> +#include <linux/cpumask.h> +#include <linux/debugfs.h> +#include <linux/kthread.h> +#include <linux/mman.h> +#include <linux/perf_event.h> +#include <linux/pm_qos.h> +#include <linux/slab.h> +#include <linux/uaccess.h> + +#include <asm/cacheflush.h> +#include <asm/resctrl.h> +#include <asm/perf_event.h> + +#include "internal.h" + +/* + * Major number assigned to and shared by all devices exposing + * pseudo-locked regions. + */ +static unsigned int pseudo_lock_major; +static unsigned long pseudo_lock_minor_avail = GENMASK(MINORBITS, 0); + +static char *pseudo_lock_devnode(const struct device *dev, umode_t *mode) +{ + const struct rdtgroup *rdtgrp; + + rdtgrp = dev_get_drvdata(dev); + if (mode) + *mode = 0600; + return kasprintf(GFP_KERNEL, "pseudo_lock/%s", rdtgrp->kn->name); +} + +static const struct class pseudo_lock_class = { + .name = "pseudo_lock", + .devnode = pseudo_lock_devnode, +}; + +/** + * pseudo_lock_minor_get - Obtain available minor number + * @minor: Pointer to where new minor number will be stored + * + * A bitmask is used to track available minor numbers. Here the next free + * minor number is marked as unavailable and returned. + * + * Return: 0 on success, <0 on failure. + */ +static int pseudo_lock_minor_get(unsigned int *minor) +{ + unsigned long first_bit; + + first_bit = find_first_bit(&pseudo_lock_minor_avail, MINORBITS); + + if (first_bit == MINORBITS) + return -ENOSPC; + + __clear_bit(first_bit, &pseudo_lock_minor_avail); + *minor = first_bit; + + return 0; +} + +/** + * pseudo_lock_minor_release - Return minor number to available + * @minor: The minor number made available + */ +static void pseudo_lock_minor_release(unsigned int minor) +{ + __set_bit(minor, &pseudo_lock_minor_avail); +} + +/** + * region_find_by_minor - Locate a pseudo-lock region by inode minor number + * @minor: The minor number of the device representing pseudo-locked region + * + * When the character device is accessed we need to determine which + * pseudo-locked region it belongs to. This is done by matching the minor + * number of the device to the pseudo-locked region it belongs. + * + * Minor numbers are assigned at the time a pseudo-locked region is associated + * with a cache instance. + * + * Return: On success return pointer to resource group owning the pseudo-locked + * region, NULL on failure. + */ +static struct rdtgroup *region_find_by_minor(unsigned int minor) +{ + struct rdtgroup *rdtgrp, *rdtgrp_match = NULL; + + list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { + if (rdtgrp->plr && rdtgrp->plr->minor == minor) { + rdtgrp_match = rdtgrp; + break; + } + } + return rdtgrp_match; +} + +/** + * struct pseudo_lock_pm_req - A power management QoS request list entry + * @list: Entry within the @pm_reqs list for a pseudo-locked region + * @req: PM QoS request + */ +struct pseudo_lock_pm_req { + struct list_head list; + struct dev_pm_qos_request req; +}; + +static void pseudo_lock_cstates_relax(struct pseudo_lock_region *plr) +{ + struct pseudo_lock_pm_req *pm_req, *next; + + list_for_each_entry_safe(pm_req, next, &plr->pm_reqs, list) { + dev_pm_qos_remove_request(&pm_req->req); + list_del(&pm_req->list); + kfree(pm_req); + } +} + +/** + * pseudo_lock_cstates_constrain - Restrict cores from entering C6 + * @plr: Pseudo-locked region + * + * To prevent the cache from being affected by power management entering + * C6 has to be avoided. This is accomplished by requesting a latency + * requirement lower than lowest C6 exit latency of all supported + * platforms as found in the cpuidle state tables in the intel_idle driver. + * At this time it is possible to do so with a single latency requirement + * for all supported platforms. + * + * Since Goldmont is supported, which is affected by X86_BUG_MONITOR, + * the ACPI latencies need to be considered while keeping in mind that C2 + * may be set to map to deeper sleep states. In this case the latency + * requirement needs to prevent entering C2 also. + * + * Return: 0 on success, <0 on failure + */ +static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr) +{ + struct pseudo_lock_pm_req *pm_req; + int cpu; + int ret; + + for_each_cpu(cpu, &plr->d->cpu_mask) { + pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL); + if (!pm_req) { + rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n"); + ret = -ENOMEM; + goto out_err; + } + ret = dev_pm_qos_add_request(get_cpu_device(cpu), + &pm_req->req, + DEV_PM_QOS_RESUME_LATENCY, + 30); + if (ret < 0) { + rdt_last_cmd_printf("Failed to add latency req CPU%d\n", + cpu); + kfree(pm_req); + ret = -1; + goto out_err; + } + list_add(&pm_req->list, &plr->pm_reqs); + } + + return 0; + +out_err: + pseudo_lock_cstates_relax(plr); + return ret; +} + +/** + * pseudo_lock_region_clear - Reset pseudo-lock region data + * @plr: pseudo-lock region + * + * All content of the pseudo-locked region is reset - any memory allocated + * freed. + * + * Return: void + */ +static void pseudo_lock_region_clear(struct pseudo_lock_region *plr) +{ + plr->size = 0; + plr->line_size = 0; + kfree(plr->kmem); + plr->kmem = NULL; + plr->s = NULL; + if (plr->d) + plr->d->plr = NULL; + plr->d = NULL; + plr->cbm = 0; + plr->debugfs_dir = NULL; +} + +/** + * pseudo_lock_region_init - Initialize pseudo-lock region information + * @plr: pseudo-lock region + * + * Called after user provided a schemata to be pseudo-locked. From the + * schemata the &struct pseudo_lock_region is on entry already initialized + * with the resource, domain, and capacity bitmask. Here the information + * required for pseudo-locking is deduced from this data and &struct + * pseudo_lock_region initialized further. This information includes: + * - size in bytes of the region to be pseudo-locked + * - cache line size to know the stride with which data needs to be accessed + * to be pseudo-locked + * - a cpu associated with the cache instance on which the pseudo-locking + * flow can be executed + * + * Return: 0 on success, <0 on failure. Descriptive error will be written + * to last_cmd_status buffer. + */ +static int pseudo_lock_region_init(struct pseudo_lock_region *plr) +{ + struct cpu_cacheinfo *ci; + int ret; + int i; + + /* Pick the first cpu we find that is associated with the cache. */ + plr->cpu = cpumask_first(&plr->d->cpu_mask); + + if (!cpu_online(plr->cpu)) { + rdt_last_cmd_printf("CPU %u associated with cache not online\n", + plr->cpu); + ret = -ENODEV; + goto out_region; + } + + ci = get_cpu_cacheinfo(plr->cpu); + + plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm); + + for (i = 0; i < ci->num_leaves; i++) { + if (ci->info_list[i].level == plr->s->res->cache_level) { + plr->line_size = ci->info_list[i].coherency_line_size; + return 0; + } + } + + ret = -1; + rdt_last_cmd_puts("Unable to determine cache line size\n"); +out_region: + pseudo_lock_region_clear(plr); + return ret; +} + +/** + * pseudo_lock_init - Initialize a pseudo-lock region + * @rdtgrp: resource group to which new pseudo-locked region will belong + * + * A pseudo-locked region is associated with a resource group. When this + * association is created the pseudo-locked region is initialized. The + * details of the pseudo-locked region are not known at this time so only + * allocation is done and association established. + * + * Return: 0 on success, <0 on failure + */ +static int pseudo_lock_init(struct rdtgroup *rdtgrp) +{ + struct pseudo_lock_region *plr; + + plr = kzalloc(sizeof(*plr), GFP_KERNEL); + if (!plr) + return -ENOMEM; + + init_waitqueue_head(&plr->lock_thread_wq); + INIT_LIST_HEAD(&plr->pm_reqs); + rdtgrp->plr = plr; + return 0; +} + +/** + * pseudo_lock_region_alloc - Allocate kernel memory that will be pseudo-locked + * @plr: pseudo-lock region + * + * Initialize the details required to set up the pseudo-locked region and + * allocate the contiguous memory that will be pseudo-locked to the cache. + * + * Return: 0 on success, <0 on failure. Descriptive error will be written + * to last_cmd_status buffer. + */ +static int pseudo_lock_region_alloc(struct pseudo_lock_region *plr) +{ + int ret; + + ret = pseudo_lock_region_init(plr); + if (ret < 0) + return ret; + + /* + * We do not yet support contiguous regions larger than + * KMALLOC_MAX_SIZE. + */ + if (plr->size > KMALLOC_MAX_SIZE) { + rdt_last_cmd_puts("Requested region exceeds maximum size\n"); + ret = -E2BIG; + goto out_region; + } + + plr->kmem = kzalloc(plr->size, GFP_KERNEL); + if (!plr->kmem) { + rdt_last_cmd_puts("Unable to allocate memory\n"); + ret = -ENOMEM; + goto out_region; + } + + ret = 0; + goto out; +out_region: + pseudo_lock_region_clear(plr); +out: + return ret; +} + +/** + * pseudo_lock_free - Free a pseudo-locked region + * @rdtgrp: resource group to which pseudo-locked region belonged + * + * The pseudo-locked region's resources have already been released, or not + * yet created at this point. Now it can be freed and disassociated from the + * resource group. + * + * Return: void + */ +static void pseudo_lock_free(struct rdtgroup *rdtgrp) +{ + pseudo_lock_region_clear(rdtgrp->plr); + kfree(rdtgrp->plr); + rdtgrp->plr = NULL; +} + +/** + * rdtgroup_monitor_in_progress - Test if monitoring in progress + * @rdtgrp: resource group being queried + * + * Return: 1 if monitor groups have been created for this resource + * group, 0 otherwise. + */ +static int rdtgroup_monitor_in_progress(struct rdtgroup *rdtgrp) +{ + return !list_empty(&rdtgrp->mon.crdtgrp_list); +} + +/** + * rdtgroup_locksetup_user_restrict - Restrict user access to group + * @rdtgrp: resource group needing access restricted + * + * A resource group used for cache pseudo-locking cannot have cpus or tasks + * assigned to it. This is communicated to the user by restricting access + * to all the files that can be used to make such changes. + * + * Permissions restored with rdtgroup_locksetup_user_restore() + * + * Return: 0 on success, <0 on failure. If a failure occurs during the + * restriction of access an attempt will be made to restore permissions but + * the state of the mode of these files will be uncertain when a failure + * occurs. + */ +static int rdtgroup_locksetup_user_restrict(struct rdtgroup *rdtgrp) +{ + int ret; + + ret = rdtgroup_kn_mode_restrict(rdtgrp, "tasks"); + if (ret) + return ret; + + ret = rdtgroup_kn_mode_restrict(rdtgrp, "cpus"); + if (ret) + goto err_tasks; + + ret = rdtgroup_kn_mode_restrict(rdtgrp, "cpus_list"); + if (ret) + goto err_cpus; + + if (resctrl_arch_mon_capable()) { + ret = rdtgroup_kn_mode_restrict(rdtgrp, "mon_groups"); + if (ret) + goto err_cpus_list; + } + + ret = 0; + goto out; + +err_cpus_list: + rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0777); +err_cpus: + rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0777); +err_tasks: + rdtgroup_kn_mode_restore(rdtgrp, "tasks", 0777); +out: + return ret; +} + +/** + * rdtgroup_locksetup_user_restore - Restore user access to group + * @rdtgrp: resource group needing access restored + * + * Restore all file access previously removed using + * rdtgroup_locksetup_user_restrict() + * + * Return: 0 on success, <0 on failure. If a failure occurs during the + * restoration of access an attempt will be made to restrict permissions + * again but the state of the mode of these files will be uncertain when + * a failure occurs. + */ +static int rdtgroup_locksetup_user_restore(struct rdtgroup *rdtgrp) +{ + int ret; + + ret = rdtgroup_kn_mode_restore(rdtgrp, "tasks", 0777); + if (ret) + return ret; + + ret = rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0777); + if (ret) + goto err_tasks; + + ret = rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0777); + if (ret) + goto err_cpus; + + if (resctrl_arch_mon_capable()) { + ret = rdtgroup_kn_mode_restore(rdtgrp, "mon_groups", 0777); + if (ret) + goto err_cpus_list; + } + + ret = 0; + goto out; + +err_cpus_list: + rdtgroup_kn_mode_restrict(rdtgrp, "cpus_list"); +err_cpus: + rdtgroup_kn_mode_restrict(rdtgrp, "cpus"); +err_tasks: + rdtgroup_kn_mode_restrict(rdtgrp, "tasks"); +out: + return ret; +} + +/** + * rdtgroup_locksetup_enter - Resource group enters locksetup mode + * @rdtgrp: resource group requested to enter locksetup mode + * + * A resource group enters locksetup mode to reflect that it would be used + * to represent a pseudo-locked region and is in the process of being set + * up to do so. A resource group used for a pseudo-locked region would + * lose the closid associated with it so we cannot allow it to have any + * tasks or cpus assigned nor permit tasks or cpus to be assigned in the + * future. Monitoring of a pseudo-locked region is not allowed either. + * + * The above and more restrictions on a pseudo-locked region are checked + * for and enforced before the resource group enters the locksetup mode. + * + * Returns: 0 if the resource group successfully entered locksetup mode, <0 + * on failure. On failure the last_cmd_status buffer is updated with text to + * communicate details of failure to the user. + */ +int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp) +{ + int ret; + + /* + * The default resource group can neither be removed nor lose the + * default closid associated with it. + */ + if (rdtgrp == &rdtgroup_default) { + rdt_last_cmd_puts("Cannot pseudo-lock default group\n"); + return -EINVAL; + } + + /* + * Cache Pseudo-locking not supported when CDP is enabled. + * + * Some things to consider if you would like to enable this + * support (using L3 CDP as example): + * - When CDP is enabled two separate resources are exposed, + * L3DATA and L3CODE, but they are actually on the same cache. + * The implication for pseudo-locking is that if a + * pseudo-locked region is created on a domain of one + * resource (eg. L3CODE), then a pseudo-locked region cannot + * be created on that same domain of the other resource + * (eg. L3DATA). This is because the creation of a + * pseudo-locked region involves a call to wbinvd that will + * affect all cache allocations on particular domain. + * - Considering the previous, it may be possible to only + * expose one of the CDP resources to pseudo-locking and + * hide the other. For example, we could consider to only + * expose L3DATA and since the L3 cache is unified it is + * still possible to place instructions there are execute it. + * - If only one region is exposed to pseudo-locking we should + * still keep in mind that availability of a portion of cache + * for pseudo-locking should take into account both resources. + * Similarly, if a pseudo-locked region is created in one + * resource, the portion of cache used by it should be made + * unavailable to all future allocations from both resources. + */ + if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3) || + resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2)) { + rdt_last_cmd_puts("CDP enabled\n"); + return -EINVAL; + } + + /* + * Not knowing the bits to disable prefetching implies that this + * platform does not support Cache Pseudo-Locking. + */ + if (resctrl_arch_get_prefetch_disable_bits() == 0) { + rdt_last_cmd_puts("Pseudo-locking not supported\n"); + return -EINVAL; + } + + if (rdtgroup_monitor_in_progress(rdtgrp)) { + rdt_last_cmd_puts("Monitoring in progress\n"); + return -EINVAL; + } + + if (rdtgroup_tasks_assigned(rdtgrp)) { + rdt_last_cmd_puts("Tasks assigned to resource group\n"); + return -EINVAL; + } + + if (!cpumask_empty(&rdtgrp->cpu_mask)) { + rdt_last_cmd_puts("CPUs assigned to resource group\n"); + return -EINVAL; + } + + if (rdtgroup_locksetup_user_restrict(rdtgrp)) { + rdt_last_cmd_puts("Unable to modify resctrl permissions\n"); + return -EIO; + } + + ret = pseudo_lock_init(rdtgrp); + if (ret) { + rdt_last_cmd_puts("Unable to init pseudo-lock region\n"); + goto out_release; + } + + /* + * If this system is capable of monitoring a rmid would have been + * allocated when the control group was created. This is not needed + * anymore when this group would be used for pseudo-locking. This + * is safe to call on platforms not capable of monitoring. + */ + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + + ret = 0; + goto out; + +out_release: + rdtgroup_locksetup_user_restore(rdtgrp); +out: + return ret; +} + +/** + * rdtgroup_locksetup_exit - resource group exist locksetup mode + * @rdtgrp: resource group + * + * When a resource group exits locksetup mode the earlier restrictions are + * lifted. + * + * Return: 0 on success, <0 on failure + */ +int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp) +{ + int ret; + + if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return -EOPNOTSUPP; + + if (resctrl_arch_mon_capable()) { + ret = alloc_rmid(rdtgrp->closid); + if (ret < 0) { + rdt_last_cmd_puts("Out of RMIDs\n"); + return ret; + } + rdtgrp->mon.rmid = ret; + } + + ret = rdtgroup_locksetup_user_restore(rdtgrp); + if (ret) { + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + return ret; + } + + pseudo_lock_free(rdtgrp); + return 0; +} + +/** + * rdtgroup_cbm_overlaps_pseudo_locked - Test if CBM or portion is pseudo-locked + * @d: RDT domain + * @cbm: CBM to test + * + * @d represents a cache instance and @cbm a capacity bitmask that is + * considered for it. Determine if @cbm overlaps with any existing + * pseudo-locked region on @d. + * + * @cbm is unsigned long, even if only 32 bits are used, to make the + * bitmap functions work correctly. + * + * Return: true if @cbm overlaps with pseudo-locked region on @d, false + * otherwise. + */ +bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_domain *d, unsigned long cbm) +{ + unsigned int cbm_len; + unsigned long cbm_b; + + if (d->plr) { + cbm_len = d->plr->s->res->cache.cbm_len; + cbm_b = d->plr->cbm; + if (bitmap_intersects(&cbm, &cbm_b, cbm_len)) + return true; + } + return false; +} + +/** + * rdtgroup_pseudo_locked_in_hierarchy - Pseudo-locked region in cache hierarchy + * @d: RDT domain under test + * + * The setup of a pseudo-locked region affects all cache instances within + * the hierarchy of the region. It is thus essential to know if any + * pseudo-locked regions exist within a cache hierarchy to prevent any + * attempts to create new pseudo-locked regions in the same hierarchy. + * + * Return: true if a pseudo-locked region exists in the hierarchy of @d or + * if it is not possible to test due to memory allocation issue, + * false otherwise. + */ +bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) +{ + cpumask_var_t cpu_with_psl; + enum resctrl_res_level i; + struct rdt_resource *r; + struct rdt_domain *d_i; + bool ret = false; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return -EOPNOTSUPP; + + if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL)) + return true; + + /* + * First determine which cpus have pseudo-locked regions + * associated with them. + */ + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->alloc_capable) + continue; + + list_for_each_entry(d_i, &r->domains, list) { + if (d_i->plr) + cpumask_or(cpu_with_psl, cpu_with_psl, + &d_i->cpu_mask); + } + } + + /* + * Next test if new pseudo-locked region would intersect with + * existing region. + */ + if (cpumask_intersects(&d->cpu_mask, cpu_with_psl)) + ret = true; + + free_cpumask_var(cpu_with_psl); + return ret; +} + +/** + * pseudo_lock_measure_cycles - Trigger latency measure to pseudo-locked region + * @rdtgrp: Resource group to which the pseudo-locked region belongs. + * @sel: Selector of which measurement to perform on a pseudo-locked region. + * + * The measurement of latency to access a pseudo-locked region should be + * done from a cpu that is associated with that pseudo-locked region. + * Determine which cpu is associated with this region and start a thread on + * that cpu to perform the measurement, wait for that thread to complete. + * + * Return: 0 on success, <0 on failure + */ +static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel) +{ + struct pseudo_lock_region *plr = rdtgrp->plr; + struct task_struct *thread; + unsigned int cpu; + int ret = -1; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + if (rdtgrp->flags & RDT_DELETED) { + ret = -ENODEV; + goto out; + } + + if (!plr->d) { + ret = -ENODEV; + goto out; + } + + plr->thread_done = 0; + cpu = cpumask_first(&plr->d->cpu_mask); + if (!cpu_online(cpu)) { + ret = -ENODEV; + goto out; + } + + plr->cpu = cpu; + + if (sel == 1) + thread = kthread_create_on_node(resctrl_arch_measure_cycles_lat_fn, + plr, cpu_to_node(cpu), + "pseudo_lock_measure/%u", + cpu); + else if (sel == 2) + thread = kthread_create_on_node(resctrl_arch_measure_l2_residency, + plr, cpu_to_node(cpu), + "pseudo_lock_measure/%u", + cpu); + else if (sel == 3) + thread = kthread_create_on_node(resctrl_arch_measure_l3_residency, + plr, cpu_to_node(cpu), + "pseudo_lock_measure/%u", + cpu); + else + goto out; + + if (IS_ERR(thread)) { + ret = PTR_ERR(thread); + goto out; + } + kthread_bind(thread, cpu); + wake_up_process(thread); + + ret = wait_event_interruptible(plr->lock_thread_wq, + plr->thread_done == 1); + if (ret < 0) + goto out; + + ret = 0; + +out: + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + return ret; +} + +static ssize_t pseudo_lock_measure_trigger(struct file *file, + const char __user *user_buf, + size_t count, loff_t *ppos) +{ + struct rdtgroup *rdtgrp = file->private_data; + size_t buf_size; + char buf[32]; + int ret; + int sel; + + buf_size = min(count, (sizeof(buf) - 1)); + if (copy_from_user(buf, user_buf, buf_size)) + return -EFAULT; + + buf[buf_size] = '\0'; + ret = kstrtoint(buf, 10, &sel); + if (ret == 0) { + if (sel != 1 && sel != 2 && sel != 3) + return -EINVAL; + ret = debugfs_file_get(file->f_path.dentry); + if (ret) + return ret; + ret = pseudo_lock_measure_cycles(rdtgrp, sel); + if (ret == 0) + ret = count; + debugfs_file_put(file->f_path.dentry); + } + + return ret; +} + +static const struct file_operations pseudo_measure_fops = { + .write = pseudo_lock_measure_trigger, + .open = simple_open, + .llseek = default_llseek, +}; + +/** + * rdtgroup_pseudo_lock_create - Create a pseudo-locked region + * @rdtgrp: resource group to which pseudo-lock region belongs + * + * Called when a resource group in the pseudo-locksetup mode receives a + * valid schemata that should be pseudo-locked. Since the resource group is + * in pseudo-locksetup mode the &struct pseudo_lock_region has already been + * allocated and initialized with the essential information. If a failure + * occurs the resource group remains in the pseudo-locksetup mode with the + * &struct pseudo_lock_region associated with it, but cleared from all + * information and ready for the user to re-attempt pseudo-locking by + * writing the schemata again. + * + * Return: 0 if the pseudo-locked region was successfully pseudo-locked, <0 + * on failure. Descriptive error will be written to last_cmd_status buffer. + */ +int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp) +{ + struct pseudo_lock_region *plr = rdtgrp->plr; + struct task_struct *thread; + unsigned int new_minor; + struct device *dev; + int ret; + + if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return -EOPNOTSUPP; + + ret = pseudo_lock_region_alloc(plr); + if (ret < 0) + return ret; + + ret = pseudo_lock_cstates_constrain(plr); + if (ret < 0) { + ret = -EINVAL; + goto out_region; + } + + plr->thread_done = 0; + + plr->closid = rdtgrp->closid; + thread = kthread_create_on_node(resctrl_arch_pseudo_lock_fn, plr, + cpu_to_node(plr->cpu), + "pseudo_lock/%u", plr->cpu); + if (IS_ERR(thread)) { + ret = PTR_ERR(thread); + rdt_last_cmd_printf("Locking thread returned error %d\n", ret); + goto out_cstates; + } + + kthread_bind(thread, plr->cpu); + wake_up_process(thread); + + ret = wait_event_interruptible(plr->lock_thread_wq, + plr->thread_done == 1); + if (ret < 0) { + /* + * If the thread does not get on the CPU for whatever + * reason and the process which sets up the region is + * interrupted then this will leave the thread in runnable + * state and once it gets on the CPU it will dereference + * the cleared, but not freed, plr struct resulting in an + * empty pseudo-locking loop. + */ + rdt_last_cmd_puts("Locking thread interrupted\n"); + goto out_cstates; + } + + ret = pseudo_lock_minor_get(&new_minor); + if (ret < 0) { + rdt_last_cmd_puts("Unable to obtain a new minor number\n"); + goto out_cstates; + } + + /* + * Unlock access but do not release the reference. The + * pseudo-locked region will still be here on return. + * + * The mutex has to be released temporarily to avoid a potential + * deadlock with the mm->mmap_lock which is obtained in the + * device_create() and debugfs_create_dir() callpath below as well as + * before the mmap() callback is called. + */ + mutex_unlock(&rdtgroup_mutex); + + if (!IS_ERR_OR_NULL(debugfs_resctrl)) { + plr->debugfs_dir = debugfs_create_dir(rdtgrp->kn->name, + debugfs_resctrl); + if (!IS_ERR_OR_NULL(plr->debugfs_dir)) + debugfs_create_file("pseudo_lock_measure", 0200, + plr->debugfs_dir, rdtgrp, + &pseudo_measure_fops); + } + + dev = device_create(&pseudo_lock_class, NULL, + MKDEV(pseudo_lock_major, new_minor), + rdtgrp, "%s", rdtgrp->kn->name); + + mutex_lock(&rdtgroup_mutex); + + if (IS_ERR(dev)) { + ret = PTR_ERR(dev); + rdt_last_cmd_printf("Failed to create character device: %d\n", + ret); + goto out_debugfs; + } + + /* We released the mutex - check if group was removed while we did so */ + if (rdtgrp->flags & RDT_DELETED) { + ret = -ENODEV; + goto out_device; + } + + plr->minor = new_minor; + + rdtgrp->mode = RDT_MODE_PSEUDO_LOCKED; + closid_free(rdtgrp->closid); + rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0444); + rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0444); + + ret = 0; + goto out; + +out_device: + device_destroy(&pseudo_lock_class, MKDEV(pseudo_lock_major, new_minor)); +out_debugfs: + debugfs_remove_recursive(plr->debugfs_dir); + pseudo_lock_minor_release(new_minor); +out_cstates: + pseudo_lock_cstates_relax(plr); +out_region: + pseudo_lock_region_clear(plr); +out: + return ret; +} + +/** + * rdtgroup_pseudo_lock_remove - Remove a pseudo-locked region + * @rdtgrp: resource group to which the pseudo-locked region belongs + * + * The removal of a pseudo-locked region can be initiated when the resource + * group is removed from user space via a "rmdir" from userspace or the + * unmount of the resctrl filesystem. On removal the resource group does + * not go back to pseudo-locksetup mode before it is removed, instead it is + * removed directly. There is thus asymmetry with the creation where the + * &struct pseudo_lock_region is removed here while it was not created in + * rdtgroup_pseudo_lock_create(). + * + * Return: void + */ +void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp) +{ + struct pseudo_lock_region *plr = rdtgrp->plr; + + if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + return; + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + /* + * Default group cannot be a pseudo-locked region so we can + * free closid here. + */ + closid_free(rdtgrp->closid); + goto free; + } + + pseudo_lock_cstates_relax(plr); + debugfs_remove_recursive(rdtgrp->plr->debugfs_dir); + device_destroy(&pseudo_lock_class, MKDEV(pseudo_lock_major, plr->minor)); + pseudo_lock_minor_release(plr->minor); + +free: + pseudo_lock_free(rdtgrp); +} + +static int pseudo_lock_dev_open(struct inode *inode, struct file *filp) +{ + struct rdtgroup *rdtgrp; + + mutex_lock(&rdtgroup_mutex); + + rdtgrp = region_find_by_minor(iminor(inode)); + if (!rdtgrp) { + mutex_unlock(&rdtgroup_mutex); + return -ENODEV; + } + + filp->private_data = rdtgrp; + atomic_inc(&rdtgrp->waitcount); + /* Perform a non-seekable open - llseek is not supported */ + filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE); + + mutex_unlock(&rdtgroup_mutex); + + return 0; +} + +static int pseudo_lock_dev_release(struct inode *inode, struct file *filp) +{ + struct rdtgroup *rdtgrp; + + mutex_lock(&rdtgroup_mutex); + rdtgrp = filp->private_data; + WARN_ON(!rdtgrp); + if (!rdtgrp) { + mutex_unlock(&rdtgroup_mutex); + return -ENODEV; + } + filp->private_data = NULL; + atomic_dec(&rdtgrp->waitcount); + mutex_unlock(&rdtgroup_mutex); + return 0; +} + +static int pseudo_lock_dev_mremap(struct vm_area_struct *area) +{ + /* Not supported */ + return -EINVAL; +} + +static const struct vm_operations_struct pseudo_mmap_ops = { + .mremap = pseudo_lock_dev_mremap, +}; + +static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma) +{ + unsigned long vsize = vma->vm_end - vma->vm_start; + unsigned long off = vma->vm_pgoff << PAGE_SHIFT; + struct pseudo_lock_region *plr; + struct rdtgroup *rdtgrp; + unsigned long physical; + unsigned long psize; + + mutex_lock(&rdtgroup_mutex); + + rdtgrp = filp->private_data; + WARN_ON(!rdtgrp); + if (!rdtgrp) { + mutex_unlock(&rdtgroup_mutex); + return -ENODEV; + } + + plr = rdtgrp->plr; + + if (!plr->d) { + mutex_unlock(&rdtgroup_mutex); + return -ENODEV; + } + + /* + * Task is required to run with affinity to the cpus associated + * with the pseudo-locked region. If this is not the case the task + * may be scheduled elsewhere and invalidate entries in the + * pseudo-locked region. + */ + if (!cpumask_subset(current->cpus_ptr, &plr->d->cpu_mask)) { + mutex_unlock(&rdtgroup_mutex); + return -EINVAL; + } + + physical = __pa(plr->kmem) >> PAGE_SHIFT; + psize = plr->size - off; + + if (off > plr->size) { + mutex_unlock(&rdtgroup_mutex); + return -ENOSPC; + } + + /* + * Ensure changes are carried directly to the memory being mapped, + * do not allow copy-on-write mapping. + */ + if (!(vma->vm_flags & VM_SHARED)) { + mutex_unlock(&rdtgroup_mutex); + return -EINVAL; + } + + if (vsize > psize) { + mutex_unlock(&rdtgroup_mutex); + return -ENOSPC; + } + + memset(plr->kmem + off, 0, vsize); + + if (remap_pfn_range(vma, vma->vm_start, physical + vma->vm_pgoff, + vsize, vma->vm_page_prot)) { + mutex_unlock(&rdtgroup_mutex); + return -EAGAIN; + } + vma->vm_ops = &pseudo_mmap_ops; + mutex_unlock(&rdtgroup_mutex); + return 0; +} + +static const struct file_operations pseudo_lock_dev_fops = { + .owner = THIS_MODULE, + .llseek = no_llseek, + .read = NULL, + .write = NULL, + .open = pseudo_lock_dev_open, + .release = pseudo_lock_dev_release, + .mmap = pseudo_lock_dev_mmap, +}; + +int rdt_pseudo_lock_init(void) +{ + int ret; + + ret = register_chrdev(0, "pseudo_lock", &pseudo_lock_dev_fops); + if (ret < 0) + return ret; + + pseudo_lock_major = ret; + + ret = class_register(&pseudo_lock_class); + if (ret) { + unregister_chrdev(pseudo_lock_major, "pseudo_lock"); + return ret; + } + + return 0; +} + +void rdt_pseudo_lock_release(void) +{ + class_unregister(&pseudo_lock_class); + unregister_chrdev(pseudo_lock_major, "pseudo_lock"); + pseudo_lock_major = 0; +} diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index e69de29bb2d1..427aa26dc006 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -0,0 +1,4002 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * User interface for Resource Allocation in Resource Director Technology(RDT) + * + * Copyright (C) 2016 Intel Corporation + * + * Author: Fenghua Yu <fenghua.yu@intel.com> + * + * More information about RDT be found in the Intel (R) x86 Architecture + * Software Developer Manual. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/cacheinfo.h> +#include <linux/cpu.h> +#include <linux/debugfs.h> +#include <linux/fs.h> +#include <linux/fs_parser.h> +#include <linux/sysfs.h> +#include <linux/kernfs.h> +#include <linux/seq_buf.h> +#include <linux/seq_file.h> +#include <linux/sched/signal.h> +#include <linux/sched/task.h> +#include <linux/slab.h> +#include <linux/task_work.h> +#include <linux/user_namespace.h> + +#include <uapi/linux/magic.h> + +#include <asm/resctrl.h> +#include "internal.h" + +/* Mutex to protect rdtgroup access. */ +DEFINE_MUTEX(rdtgroup_mutex); + +static struct kernfs_root *rdt_root; +struct rdtgroup rdtgroup_default; +LIST_HEAD(rdt_all_groups); + +/* list of entries for the schemata file */ +LIST_HEAD(resctrl_schema_all); + +/* The filesystem can only be mounted once. */ +bool resctrl_mounted; + +/* Kernel fs node for "info" directory under root */ +static struct kernfs_node *kn_info; + +/* Kernel fs node for "mon_groups" directory under root */ +static struct kernfs_node *kn_mongrp; + +/* Kernel fs node for "mon_data" directory under root */ +static struct kernfs_node *kn_mondata; + +/* + * Used to store the max resource name width and max resource data width + * to display the schemata in a tabular format + */ +int max_name_width, max_data_width; + +static struct seq_buf last_cmd_status; +static char last_cmd_status_buf[512]; + +static int rdtgroup_setup_root(struct rdt_fs_context *ctx); +static void rdtgroup_destroy_root(void); + +struct dentry *debugfs_resctrl; + +static bool resctrl_debug; + +void rdt_last_cmd_clear(void) +{ + lockdep_assert_held(&rdtgroup_mutex); + seq_buf_clear(&last_cmd_status); +} + +void rdt_last_cmd_puts(const char *s) +{ + lockdep_assert_held(&rdtgroup_mutex); + seq_buf_puts(&last_cmd_status, s); +} + +void rdt_last_cmd_printf(const char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + lockdep_assert_held(&rdtgroup_mutex); + seq_buf_vprintf(&last_cmd_status, fmt, ap); + va_end(ap); +} + +void rdt_staged_configs_clear(void) +{ + struct rdt_resource *r; + struct rdt_domain *dom; + int i; + + lockdep_assert_held(&rdtgroup_mutex); + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->alloc_capable) + continue; + + list_for_each_entry(dom, &r->domains, list) + memset(dom->staged_config, 0, sizeof(dom->staged_config)); + } +} + +static bool resctrl_is_mbm_enabled(void) +{ + return (resctrl_arch_is_mbm_total_enabled() || + resctrl_arch_is_mbm_local_enabled()); +} + +static bool resctrl_is_mbm_event(int e) +{ + return (e >= QOS_L3_MBM_TOTAL_EVENT_ID && + e <= QOS_L3_MBM_LOCAL_EVENT_ID); +} + +/* + * Trivial allocator for CLOSIDs. Since h/w only supports a small number, + * we can keep a bitmap of free CLOSIDs in a single integer. + * + * Using a global CLOSID across all resources has some advantages and + * some drawbacks: + * + We can simply set current's closid to assign a task to a resource + * group. + * + Context switch code can avoid extra memory references deciding which + * CLOSID to load into the PQR_ASSOC MSR + * - We give up some options in configuring resource groups across multi-socket + * systems. + * - Our choices on how to configure each resource become progressively more + * limited as the number of resources grows. + */ +static unsigned long closid_free_map; +static int closid_free_map_len; + +int closids_supported(void) +{ + return closid_free_map_len; +} + +static void closid_init(void) +{ + struct resctrl_schema *s; + u32 rdt_min_closid = 32; + + /* Compute rdt_min_closid across all resources */ + list_for_each_entry(s, &resctrl_schema_all, list) + rdt_min_closid = min(rdt_min_closid, s->num_closid); + + closid_free_map = BIT_MASK(rdt_min_closid) - 1; + + /* RESCTRL_RESERVED_CLOSID is always reserved for the default group */ + __clear_bit(RESCTRL_RESERVED_CLOSID, &closid_free_map); + closid_free_map_len = rdt_min_closid; +} + +static int closid_alloc(void) +{ + int cleanest_closid; + u32 closid; + + lockdep_assert_held(&rdtgroup_mutex); + + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + cleanest_closid = resctrl_find_cleanest_closid(); + if (cleanest_closid < 0) + return cleanest_closid; + closid = cleanest_closid; + } else { + closid = ffs(closid_free_map); + if (closid == 0) + return -ENOSPC; + closid--; + } + __clear_bit(closid, &closid_free_map); + + return closid; +} + +void closid_free(int closid) +{ + lockdep_assert_held(&rdtgroup_mutex); + + __set_bit(closid, &closid_free_map); +} + +/** + * closid_allocated - test if provided closid is in use + * @closid: closid to be tested + * + * Return: true if @closid is currently associated with a resource group, + * false if @closid is free + */ +bool closid_allocated(unsigned int closid) +{ + lockdep_assert_held(&rdtgroup_mutex); + + return !test_bit(closid, &closid_free_map); +} + +/** + * rdtgroup_mode_by_closid - Return mode of resource group with closid + * @closid: closid if the resource group + * + * Each resource group is associated with a @closid. Here the mode + * of a resource group can be queried by searching for it using its closid. + * + * Return: mode as &enum rdtgrp_mode of resource group with closid @closid + */ +enum rdtgrp_mode rdtgroup_mode_by_closid(int closid) +{ + struct rdtgroup *rdtgrp; + + list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { + if (rdtgrp->closid == closid) + return rdtgrp->mode; + } + + return RDT_NUM_MODES; +} + +static const char * const rdt_mode_str[] = { + [RDT_MODE_SHAREABLE] = "shareable", + [RDT_MODE_EXCLUSIVE] = "exclusive", + [RDT_MODE_PSEUDO_LOCKSETUP] = "pseudo-locksetup", + [RDT_MODE_PSEUDO_LOCKED] = "pseudo-locked", +}; + +/** + * rdtgroup_mode_str - Return the string representation of mode + * @mode: the resource group mode as &enum rdtgroup_mode + * + * Return: string representation of valid mode, "unknown" otherwise + */ +static const char *rdtgroup_mode_str(enum rdtgrp_mode mode) +{ + if (mode < RDT_MODE_SHAREABLE || mode >= RDT_NUM_MODES) + return "unknown"; + + return rdt_mode_str[mode]; +} + +/* set uid and gid of rdtgroup dirs and files to that of the creator */ +static int rdtgroup_kn_set_ugid(struct kernfs_node *kn) +{ + struct iattr iattr = { .ia_valid = ATTR_UID | ATTR_GID, + .ia_uid = current_fsuid(), + .ia_gid = current_fsgid(), }; + + if (uid_eq(iattr.ia_uid, GLOBAL_ROOT_UID) && + gid_eq(iattr.ia_gid, GLOBAL_ROOT_GID)) + return 0; + + return kernfs_setattr(kn, &iattr); +} + +static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft) +{ + struct kernfs_node *kn; + int ret; + + kn = __kernfs_create_file(parent_kn, rft->name, rft->mode, + GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, + 0, rft->kf_ops, rft, NULL, NULL); + if (IS_ERR(kn)) + return PTR_ERR(kn); + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) { + kernfs_remove(kn); + return ret; + } + + return 0; +} + +static int rdtgroup_seqfile_show(struct seq_file *m, void *arg) +{ + struct kernfs_open_file *of = m->private; + struct rftype *rft = of->kn->priv; + + if (rft->seq_show) + return rft->seq_show(of, m, arg); + return 0; +} + +static ssize_t rdtgroup_file_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct rftype *rft = of->kn->priv; + + if (rft->write) + return rft->write(of, buf, nbytes, off); + + return -EINVAL; +} + +static const struct kernfs_ops rdtgroup_kf_single_ops = { + .atomic_write_len = PAGE_SIZE, + .write = rdtgroup_file_write, + .seq_show = rdtgroup_seqfile_show, +}; + +static const struct kernfs_ops kf_mondata_ops = { + .atomic_write_len = PAGE_SIZE, + .seq_show = rdtgroup_mondata_show, +}; + +static bool is_cpu_list(struct kernfs_open_file *of) +{ + struct rftype *rft = of->kn->priv; + + return rft->flags & RFTYPE_FLAGS_CPUS_LIST; +} + +static int rdtgroup_cpus_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + struct cpumask *mask; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + + if (rdtgrp) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + if (!rdtgrp->plr->d) { + rdt_last_cmd_clear(); + rdt_last_cmd_puts("Cache domain offline\n"); + ret = -ENODEV; + } else { + mask = &rdtgrp->plr->d->cpu_mask; + seq_printf(s, is_cpu_list(of) ? + "%*pbl\n" : "%*pb\n", + cpumask_pr_args(mask)); + } + } else { + seq_printf(s, is_cpu_list(of) ? "%*pbl\n" : "%*pb\n", + cpumask_pr_args(&rdtgrp->cpu_mask)); + } + } else { + ret = -ENOENT; + } + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +/* + * Update the PGR_ASSOC MSR on all cpus in @cpu_mask, + * + * Per task closids/rmids must have been set up before calling this function. + * @r may be NULL. + */ +static void +update_closid_rmid(const struct cpumask *cpu_mask, struct rdtgroup *r) +{ + struct resctrl_cpu_sync defaults; + struct resctrl_cpu_sync *defaults_p = NULL; + + if (r) { + defaults.closid = r->closid; + defaults.rmid = r->mon.rmid; + defaults_p = &defaults; + } + + on_each_cpu_mask(cpu_mask, resctrl_arch_sync_cpu_defaults, defaults_p, + 1); +} + +static int cpus_mon_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask, + cpumask_var_t tmpmask) +{ + struct rdtgroup *prgrp = rdtgrp->mon.parent, *crgrp; + struct list_head *head; + + /* Check whether cpus belong to parent ctrl group */ + cpumask_andnot(tmpmask, newmask, &prgrp->cpu_mask); + if (!cpumask_empty(tmpmask)) { + rdt_last_cmd_puts("Can only add CPUs to mongroup that belong to parent\n"); + return -EINVAL; + } + + /* Check whether cpus are dropped from this group */ + cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask); + if (!cpumask_empty(tmpmask)) { + /* Give any dropped cpus to parent rdtgroup */ + cpumask_or(&prgrp->cpu_mask, &prgrp->cpu_mask, tmpmask); + update_closid_rmid(tmpmask, prgrp); + } + + /* + * If we added cpus, remove them from previous group that owned them + * and update per-cpu rmid + */ + cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask); + if (!cpumask_empty(tmpmask)) { + head = &prgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) { + if (crgrp == rdtgrp) + continue; + cpumask_andnot(&crgrp->cpu_mask, &crgrp->cpu_mask, + tmpmask); + } + update_closid_rmid(tmpmask, rdtgrp); + } + + /* Done pushing/pulling - update this group with new mask */ + cpumask_copy(&rdtgrp->cpu_mask, newmask); + + return 0; +} + +static void cpumask_rdtgrp_clear(struct rdtgroup *r, struct cpumask *m) +{ + struct rdtgroup *crgrp; + + cpumask_andnot(&r->cpu_mask, &r->cpu_mask, m); + /* update the child mon group masks as well*/ + list_for_each_entry(crgrp, &r->mon.crdtgrp_list, mon.crdtgrp_list) + cpumask_and(&crgrp->cpu_mask, &r->cpu_mask, &crgrp->cpu_mask); +} + +static int cpus_ctrl_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask, + cpumask_var_t tmpmask, cpumask_var_t tmpmask1) +{ + struct rdtgroup *r, *crgrp; + struct list_head *head; + + /* Check whether cpus are dropped from this group */ + cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask); + if (!cpumask_empty(tmpmask)) { + /* Can't drop from default group */ + if (rdtgrp == &rdtgroup_default) { + rdt_last_cmd_puts("Can't drop CPUs from default group\n"); + return -EINVAL; + } + + /* Give any dropped cpus to rdtgroup_default */ + cpumask_or(&rdtgroup_default.cpu_mask, + &rdtgroup_default.cpu_mask, tmpmask); + update_closid_rmid(tmpmask, &rdtgroup_default); + } + + /* + * If we added cpus, remove them from previous group and + * the prev group's child groups that owned them + * and update per-cpu closid/rmid. + */ + cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask); + if (!cpumask_empty(tmpmask)) { + list_for_each_entry(r, &rdt_all_groups, rdtgroup_list) { + if (r == rdtgrp) + continue; + cpumask_and(tmpmask1, &r->cpu_mask, tmpmask); + if (!cpumask_empty(tmpmask1)) + cpumask_rdtgrp_clear(r, tmpmask1); + } + update_closid_rmid(tmpmask, rdtgrp); + } + + /* Done pushing/pulling - update this group with new mask */ + cpumask_copy(&rdtgrp->cpu_mask, newmask); + + /* + * Clear child mon group masks since there is a new parent mask + * now and update the rmid for the cpus the child lost. + */ + head = &rdtgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) { + cpumask_and(tmpmask, &rdtgrp->cpu_mask, &crgrp->cpu_mask); + update_closid_rmid(tmpmask, rdtgrp); + cpumask_clear(&crgrp->cpu_mask); + } + + return 0; +} + +static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + cpumask_var_t tmpmask, newmask, tmpmask1; + struct rdtgroup *rdtgrp; + int ret; + + if (!buf) + return -EINVAL; + + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) + return -ENOMEM; + if (!zalloc_cpumask_var(&newmask, GFP_KERNEL)) { + free_cpumask_var(tmpmask); + return -ENOMEM; + } + if (!zalloc_cpumask_var(&tmpmask1, GFP_KERNEL)) { + free_cpumask_var(tmpmask); + free_cpumask_var(newmask); + return -ENOMEM; + } + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + ret = -ENOENT; + goto unlock; + } + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + ret = -EINVAL; + rdt_last_cmd_puts("Pseudo-locking in progress\n"); + goto unlock; + } + + if (is_cpu_list(of)) + ret = cpulist_parse(buf, newmask); + else + ret = cpumask_parse(buf, newmask); + + if (ret) { + rdt_last_cmd_puts("Bad CPU list/mask\n"); + goto unlock; + } + + /* check that user didn't specify any offline cpus */ + cpumask_andnot(tmpmask, newmask, cpu_online_mask); + if (!cpumask_empty(tmpmask)) { + ret = -EINVAL; + rdt_last_cmd_puts("Can only assign online CPUs\n"); + goto unlock; + } + + if (rdtgrp->type == RDTCTRL_GROUP) + ret = cpus_ctrl_write(rdtgrp, newmask, tmpmask, tmpmask1); + else if (rdtgrp->type == RDTMON_GROUP) + ret = cpus_mon_write(rdtgrp, newmask, tmpmask); + else + ret = -EINVAL; + +unlock: + rdtgroup_kn_unlock(of->kn); + free_cpumask_var(tmpmask); + free_cpumask_var(newmask); + free_cpumask_var(tmpmask1); + + return ret ?: nbytes; +} + +/** + * rdtgroup_remove - the helper to remove resource group safely + * @rdtgrp: resource group to remove + * + * On resource group creation via a mkdir, an extra kernfs_node reference is + * taken to ensure that the rdtgroup structure remains accessible for the + * rdtgroup_kn_unlock() calls where it is removed. + * + * Drop the extra reference here, then free the rdtgroup structure. + * + * Return: void + */ +static void rdtgroup_remove(struct rdtgroup *rdtgrp) +{ + kernfs_put(rdtgrp->kn); + kfree(rdtgrp); +} + +static void _update_task_closid_rmid(void *task) +{ + /* + * If the task is still current on this CPU, update PQR_ASSOC MSR. + * Otherwise, the MSR is updated when the task is scheduled in. + */ + if (task == current) + resctrl_sched_in(task); +} + +static void update_task_closid_rmid(struct task_struct *t) +{ + if (IS_ENABLED(CONFIG_SMP) && task_curr(t)) + smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1); + else + _update_task_closid_rmid(t); +} + +static bool task_in_rdtgroup(struct task_struct *tsk, struct rdtgroup *rdtgrp) +{ + u32 closid, rmid = rdtgrp->mon.rmid; + + if (rdtgrp->type == RDTCTRL_GROUP) + closid = rdtgrp->closid; + else if (rdtgrp->type == RDTMON_GROUP) + closid = rdtgrp->mon.parent->closid; + else + return false; + + return resctrl_arch_match_closid(tsk, closid) && + resctrl_arch_match_rmid(tsk, closid, rmid); +} + +static int __rdtgroup_move_task(struct task_struct *tsk, + struct rdtgroup *rdtgrp) +{ + /* If the task is already in rdtgrp, no need to move the task. */ + if (task_in_rdtgroup(tsk, rdtgrp)) + return 0; + + /* + * Set the task's closid/rmid before the PQR_ASSOC MSR can be + * updated by them. + * + * For ctrl_mon groups, move both closid and rmid. + * For monitor groups, can move the tasks only from + * their parent CTRL group. + */ + if (rdtgrp->type == RDTMON_GROUP && + !resctrl_arch_match_closid(tsk, rdtgrp->mon.parent->closid)) { + rdt_last_cmd_puts("Can't move task to different control group\n"); + return -EINVAL; + } + + if (rdtgrp->type == RDTMON_GROUP) + resctrl_arch_set_closid_rmid(tsk, rdtgrp->mon.parent->closid, + rdtgrp->mon.rmid); + else + resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid, + rdtgrp->mon.rmid); + + /* + * Ensure the task's closid and rmid are written before determining if + * the task is current that will decide if it will be interrupted. + * This pairs with the full barrier between the rq->curr update and + * resctrl_sched_in() during context switch. + */ + smp_mb(); + + /* + * By now, the task's closid and rmid are set. If the task is current + * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource + * group go into effect. If the task is not current, the MSR will be + * updated when the task is scheduled in. + */ + update_task_closid_rmid(tsk); + + return 0; +} + +static bool is_closid_match(struct task_struct *t, struct rdtgroup *r) +{ + return (resctrl_arch_alloc_capable() && (r->type == RDTCTRL_GROUP) && + resctrl_arch_match_closid(t, r->closid)); +} + +static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r) +{ + return (resctrl_arch_mon_capable() && (r->type == RDTMON_GROUP) && + resctrl_arch_match_rmid(t, r->mon.parent->closid, + r->mon.rmid)); +} + +/** + * rdtgroup_tasks_assigned - Test if tasks have been assigned to resource group + * @r: Resource group + * + * Return: 1 if tasks have been assigned to @r, 0 otherwise + */ +int rdtgroup_tasks_assigned(struct rdtgroup *r) +{ + struct task_struct *p, *t; + int ret = 0; + + lockdep_assert_held(&rdtgroup_mutex); + + rcu_read_lock(); + for_each_process_thread(p, t) { + if (is_closid_match(t, r) || is_rmid_match(t, r)) { + ret = 1; + break; + } + } + rcu_read_unlock(); + + return ret; +} + +static int rdtgroup_task_write_permission(struct task_struct *task, + struct kernfs_open_file *of) +{ + const struct cred *tcred = get_task_cred(task); + const struct cred *cred = current_cred(); + int ret = 0; + + /* + * Even if we're attaching all tasks in the thread group, we only + * need to check permissions on one of them. + */ + if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && + !uid_eq(cred->euid, tcred->uid) && + !uid_eq(cred->euid, tcred->suid)) { + rdt_last_cmd_printf("No permission to move task %d\n", task->pid); + ret = -EPERM; + } + + put_cred(tcred); + return ret; +} + +static int rdtgroup_move_task(pid_t pid, struct rdtgroup *rdtgrp, + struct kernfs_open_file *of) +{ + struct task_struct *tsk; + int ret; + + rcu_read_lock(); + if (pid) { + tsk = find_task_by_vpid(pid); + if (!tsk) { + rcu_read_unlock(); + rdt_last_cmd_printf("No task %d\n", pid); + return -ESRCH; + } + } else { + tsk = current; + } + + get_task_struct(tsk); + rcu_read_unlock(); + + ret = rdtgroup_task_write_permission(tsk, of); + if (!ret) + ret = __rdtgroup_move_task(tsk, rdtgrp); + + put_task_struct(tsk); + return ret; +} + +static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct rdtgroup *rdtgrp; + char *pid_str; + int ret = 0; + pid_t pid; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + rdt_last_cmd_clear(); + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + ret = -EINVAL; + rdt_last_cmd_puts("Pseudo-locking in progress\n"); + goto unlock; + } + + while (buf && buf[0] != '\0' && buf[0] != '\n') { + pid_str = strim(strsep(&buf, ",")); + + if (kstrtoint(pid_str, 0, &pid)) { + rdt_last_cmd_printf("Task list parsing error pid %s\n", pid_str); + ret = -EINVAL; + break; + } + + if (pid < 0) { + rdt_last_cmd_printf("Invalid pid %d\n", pid); + ret = -EINVAL; + break; + } + + ret = rdtgroup_move_task(pid, rdtgrp, of); + if (ret) { + rdt_last_cmd_printf("Error while processing task %d\n", pid); + break; + } + } + +unlock: + rdtgroup_kn_unlock(of->kn); + + return ret ?: nbytes; +} + +static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s) +{ + struct task_struct *p, *t; + pid_t pid; + + rcu_read_lock(); + for_each_process_thread(p, t) { + if (is_closid_match(t, r) || is_rmid_match(t, r)) { + pid = task_pid_vnr(t); + if (pid) + seq_printf(s, "%d\n", pid); + } + } + rcu_read_unlock(); +} + +static int rdtgroup_tasks_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) + show_rdt_tasks(rdtgrp, s); + else + ret = -ENOENT; + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +static int rdtgroup_closid_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) + seq_printf(s, "%u\n", rdtgrp->closid); + else + ret = -ENOENT; + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +static int rdtgroup_rmid_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) + seq_printf(s, "%u\n", rdtgrp->mon.rmid); + else + ret = -ENOENT; + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +#ifdef CONFIG_PROC_CPU_RESCTRL + +/* + * A task can only be part of one resctrl control group and of one monitor + * group which is associated to that control group. + * + * 1) res: + * mon: + * + * resctrl is not available. + * + * 2) res:/ + * mon: + * + * Task is part of the root resctrl control group, and it is not associated + * to any monitor group. + * + * 3) res:/ + * mon:mon0 + * + * Task is part of the root resctrl control group and monitor group mon0. + * + * 4) res:group0 + * mon: + * + * Task is part of resctrl control group group0, and it is not associated + * to any monitor group. + * + * 5) res:group0 + * mon:mon1 + * + * Task is part of resctrl control group group0 and monitor group mon1. + */ +int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns, + struct pid *pid, struct task_struct *tsk) +{ + struct rdtgroup *rdtg; + int ret = 0; + + mutex_lock(&rdtgroup_mutex); + + /* Return empty if resctrl has not been mounted. */ + if (!resctrl_mounted) { + seq_puts(s, "res:\nmon:\n"); + goto unlock; + } + + list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) { + struct rdtgroup *crg; + + /* + * Task information is only relevant for shareable + * and exclusive groups. + */ + if (rdtg->mode != RDT_MODE_SHAREABLE && + rdtg->mode != RDT_MODE_EXCLUSIVE) + continue; + + if (!resctrl_arch_match_closid(tsk, rdtg->closid)) + continue; + + seq_printf(s, "res:%s%s\n", (rdtg == &rdtgroup_default) ? "/" : "", + rdtg->kn->name); + seq_puts(s, "mon:"); + list_for_each_entry(crg, &rdtg->mon.crdtgrp_list, + mon.crdtgrp_list) { + if (!resctrl_arch_match_rmid(tsk, crg->mon.parent->closid, + crg->mon.rmid)) + continue; + seq_printf(s, "%s", crg->kn->name); + break; + } + seq_putc(s, '\n'); + goto unlock; + } + /* + * The above search should succeed. Otherwise return + * with an error. + */ + ret = -ENOENT; +unlock: + mutex_unlock(&rdtgroup_mutex); + + return ret; +} +#endif + +static int rdt_last_cmd_status_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + int len; + + mutex_lock(&rdtgroup_mutex); + len = seq_buf_used(&last_cmd_status); + if (len) + seq_printf(seq, "%.*s", len, last_cmd_status_buf); + else + seq_puts(seq, "ok\n"); + mutex_unlock(&rdtgroup_mutex); + return 0; +} + +static int rdt_num_closids_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + + seq_printf(seq, "%u\n", s->num_closid); + return 0; +} + +static int rdt_default_ctrl_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%x\n", r->default_ctrl); + return 0; +} + +static int rdt_min_cbm_bits_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->cache.min_cbm_bits); + return 0; +} + +static int rdt_shareable_bits_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%x\n", r->cache.shareable_bits); + return 0; +} + +/* + * rdt_bit_usage_show - Display current usage of resources + * + * A domain is a shared resource that can now be allocated differently. Here + * we display the current regions of the domain as an annotated bitmask. + * For each domain of this resource its allocation bitmask + * is annotated as below to indicate the current usage of the corresponding bit: + * 0 - currently unused + * X - currently available for sharing and used by software and hardware + * H - currently used by hardware only but available for software use + * S - currently used and shareable by software only + * E - currently used exclusively by one resource group + * P - currently pseudo-locked by one resource group + */ +static int rdt_bit_usage_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + /* + * Use unsigned long even though only 32 bits are used to ensure + * test_bit() is used safely. + */ + unsigned long sw_shareable = 0, hw_shareable = 0; + unsigned long exclusive = 0, pseudo_locked = 0; + struct rdt_resource *r = s->res; + struct rdt_domain *dom; + int i, hwb, swb, excl, psl; + enum rdtgrp_mode mode; + bool sep = false; + u32 ctrl_val; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + hw_shareable = r->cache.shareable_bits; + list_for_each_entry(dom, &r->domains, list) { + if (sep) + seq_putc(seq, ';'); + sw_shareable = 0; + exclusive = 0; + seq_printf(seq, "%d=", dom->id); + for (i = 0; i < closids_supported(); i++) { + if (!closid_allocated(i)) + continue; + ctrl_val = resctrl_arch_get_config(r, dom, i, + s->conf_type); + mode = rdtgroup_mode_by_closid(i); + switch (mode) { + case RDT_MODE_SHAREABLE: + sw_shareable |= ctrl_val; + break; + case RDT_MODE_EXCLUSIVE: + exclusive |= ctrl_val; + break; + case RDT_MODE_PSEUDO_LOCKSETUP: + /* + * RDT_MODE_PSEUDO_LOCKSETUP is possible + * here but not included since the CBM + * associated with this CLOSID in this mode + * is not initialized and no task or cpu can be + * assigned this CLOSID. + */ + break; + case RDT_MODE_PSEUDO_LOCKED: + case RDT_NUM_MODES: + WARN(1, + "invalid mode for closid %d\n", i); + break; + } + } + for (i = r->cache.cbm_len - 1; i >= 0; i--) { + pseudo_locked = dom->plr ? dom->plr->cbm : 0; + hwb = test_bit(i, &hw_shareable); + swb = test_bit(i, &sw_shareable); + excl = test_bit(i, &exclusive); + psl = test_bit(i, &pseudo_locked); + if (hwb && swb) + seq_putc(seq, 'X'); + else if (hwb && !swb) + seq_putc(seq, 'H'); + else if (!hwb && swb) + seq_putc(seq, 'S'); + else if (excl) + seq_putc(seq, 'E'); + else if (psl) + seq_putc(seq, 'P'); + else /* Unused bits remain */ + seq_putc(seq, '0'); + } + sep = true; + } + seq_putc(seq, '\n'); + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + return 0; +} + +static int rdt_min_bw_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->membw.min_bw); + return 0; +} + +static int rdt_num_rmids_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + + seq_printf(seq, "%d\n", r->num_rmid); + + return 0; +} + +static int rdt_mon_features_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + struct mon_evt *mevt; + + list_for_each_entry(mevt, &r->evt_list, list) { + seq_printf(seq, "%s\n", mevt->name); + if (mevt->configurable) + seq_printf(seq, "%s_config\n", mevt->name); + } + + return 0; +} + +static int rdt_bw_gran_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->membw.bw_gran); + return 0; +} + +static int rdt_delay_linear_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->membw.delay_linear); + return 0; +} + +static int max_threshold_occ_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + seq_printf(seq, "%u\n", resctrl_rmid_realloc_threshold); + + return 0; +} + +static int rdt_thread_throttle_mode_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + if (r->membw.throttle_mode == THREAD_THROTTLE_PER_THREAD) + seq_puts(seq, "per-thread\n"); + else + seq_puts(seq, "max\n"); + + return 0; +} + +static ssize_t max_threshold_occ_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + unsigned int bytes; + int ret; + + ret = kstrtouint(buf, 0, &bytes); + if (ret) + return ret; + + if (bytes > resctrl_rmid_realloc_limit) + return -EINVAL; + + resctrl_rmid_realloc_threshold = resctrl_arch_round_mon_val(bytes); + + return nbytes; +} + +/* + * rdtgroup_mode_show - Display mode of this resource group + */ +static int rdtgroup_mode_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + + seq_printf(s, "%s\n", rdtgroup_mode_str(rdtgrp->mode)); + + rdtgroup_kn_unlock(of->kn); + return 0; +} + +static enum resctrl_conf_type resctrl_peer_type(enum resctrl_conf_type my_type) +{ + switch (my_type) { + case CDP_CODE: + return CDP_DATA; + case CDP_DATA: + return CDP_CODE; + default: + case CDP_NONE: + return CDP_NONE; + } +} + +static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->cache.arch_has_sparse_bitmasks); + + return 0; +} + +/** + * __rdtgroup_cbm_overlaps - Does CBM for intended closid overlap with other + * @r: Resource to which domain instance @d belongs. + * @d: The domain instance for which @closid is being tested. + * @cbm: Capacity bitmask being tested. + * @closid: Intended closid for @cbm. + * @type: CDP type of @r. + * @exclusive: Only check if overlaps with exclusive resource groups + * + * Checks if provided @cbm intended to be used for @closid on domain + * @d overlaps with any other closids or other hardware usage associated + * with this domain. If @exclusive is true then only overlaps with + * resource groups in exclusive mode will be considered. If @exclusive + * is false then overlaps with any resource group or hardware entities + * will be considered. + * + * @cbm is unsigned long, even if only 32 bits are used, to make the + * bitmap functions work correctly. + * + * Return: false if CBM does not overlap, true if it does. + */ +static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_domain *d, + unsigned long cbm, int closid, + enum resctrl_conf_type type, bool exclusive) +{ + enum rdtgrp_mode mode; + unsigned long ctrl_b; + int i; + + /* Check for any overlap with regions used by hardware directly */ + if (!exclusive) { + ctrl_b = r->cache.shareable_bits; + if (bitmap_intersects(&cbm, &ctrl_b, r->cache.cbm_len)) + return true; + } + + /* Check for overlap with other resource groups */ + for (i = 0; i < closids_supported(); i++) { + ctrl_b = resctrl_arch_get_config(r, d, i, type); + mode = rdtgroup_mode_by_closid(i); + if (closid_allocated(i) && i != closid && + mode != RDT_MODE_PSEUDO_LOCKSETUP) { + if (bitmap_intersects(&cbm, &ctrl_b, r->cache.cbm_len)) { + if (exclusive) { + if (mode == RDT_MODE_EXCLUSIVE) + return true; + continue; + } + return true; + } + } + } + + return false; +} + +/** + * rdtgroup_cbm_overlaps - Does CBM overlap with other use of hardware + * @s: Schema for the resource to which domain instance @d belongs. + * @d: The domain instance for which @closid is being tested. + * @cbm: Capacity bitmask being tested. + * @closid: Intended closid for @cbm. + * @exclusive: Only check if overlaps with exclusive resource groups + * + * Resources that can be allocated using a CBM can use the CBM to control + * the overlap of these allocations. rdtgroup_cmb_overlaps() is the test + * for overlap. Overlap test is not limited to the specific resource for + * which the CBM is intended though - when dealing with CDP resources that + * share the underlying hardware the overlap check should be performed on + * the CDP resource sharing the hardware also. + * + * Refer to description of __rdtgroup_cbm_overlaps() for the details of the + * overlap test. + * + * Return: true if CBM overlap detected, false if there is no overlap + */ +bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_domain *d, + unsigned long cbm, int closid, bool exclusive) +{ + enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type); + struct rdt_resource *r = s->res; + + if (__rdtgroup_cbm_overlaps(r, d, cbm, closid, s->conf_type, + exclusive)) + return true; + + if (!resctrl_arch_get_cdp_enabled(r->rid)) + return false; + return __rdtgroup_cbm_overlaps(r, d, cbm, closid, peer_type, exclusive); +} + +/** + * rdtgroup_mode_test_exclusive - Test if this resource group can be exclusive + * @rdtgrp: Resource group identified through its closid. + * + * An exclusive resource group implies that there should be no sharing of + * its allocated resources. At the time this group is considered to be + * exclusive this test can determine if its current schemata supports this + * setting by testing for overlap with all other resource groups. + * + * Return: true if resource group can be exclusive, false if there is overlap + * with allocations of other resource groups and thus this resource group + * cannot be exclusive. + */ +static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp) +{ + int closid = rdtgrp->closid; + struct resctrl_schema *s; + struct rdt_resource *r; + bool has_cache = false; + struct rdt_domain *d; + u32 ctrl; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + list_for_each_entry(s, &resctrl_schema_all, list) { + r = s->res; + if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA) + continue; + has_cache = true; + list_for_each_entry(d, &r->domains, list) { + ctrl = resctrl_arch_get_config(r, d, closid, + s->conf_type); + if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) { + rdt_last_cmd_puts("Schemata overlaps\n"); + return false; + } + } + } + + if (!has_cache) { + rdt_last_cmd_puts("Cannot be exclusive without CAT/CDP\n"); + return false; + } + + return true; +} + +/* + * rdtgroup_mode_write - Modify the resource group's mode + */ +static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct rdtgroup *rdtgrp; + enum rdtgrp_mode mode; + int ret = 0; + + /* Valid input requires a trailing newline */ + if (nbytes == 0 || buf[nbytes - 1] != '\n') + return -EINVAL; + buf[nbytes - 1] = '\0'; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + + rdt_last_cmd_clear(); + + mode = rdtgrp->mode; + + if ((!strcmp(buf, "shareable") && mode == RDT_MODE_SHAREABLE) || + (!strcmp(buf, "exclusive") && mode == RDT_MODE_EXCLUSIVE) || + (!strcmp(buf, "pseudo-locksetup") && + mode == RDT_MODE_PSEUDO_LOCKSETUP) || + (!strcmp(buf, "pseudo-locked") && mode == RDT_MODE_PSEUDO_LOCKED)) + goto out; + + if (mode == RDT_MODE_PSEUDO_LOCKED) { + rdt_last_cmd_puts("Cannot change pseudo-locked group\n"); + ret = -EINVAL; + goto out; + } + + if (!strcmp(buf, "shareable")) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + ret = rdtgroup_locksetup_exit(rdtgrp); + if (ret) + goto out; + } + rdtgrp->mode = RDT_MODE_SHAREABLE; + } else if (!strcmp(buf, "exclusive")) { + if (!rdtgroup_mode_test_exclusive(rdtgrp)) { + ret = -EINVAL; + goto out; + } + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + ret = rdtgroup_locksetup_exit(rdtgrp); + if (ret) + goto out; + } + rdtgrp->mode = RDT_MODE_EXCLUSIVE; + } else if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK) && + !strcmp(buf, "pseudo-locksetup")) { + ret = rdtgroup_locksetup_enter(rdtgrp); + if (ret) + goto out; + rdtgrp->mode = RDT_MODE_PSEUDO_LOCKSETUP; + } else { + rdt_last_cmd_puts("Unknown or unsupported mode\n"); + ret = -EINVAL; + } + +out: + rdtgroup_kn_unlock(of->kn); + return ret ?: nbytes; +} + +/** + * rdtgroup_cbm_to_size - Translate CBM to size in bytes + * @r: RDT resource to which @d belongs. + * @d: RDT domain instance. + * @cbm: bitmask for which the size should be computed. + * + * The bitmask provided associated with the RDT domain instance @d will be + * translated into how many bytes it represents. The size in bytes is + * computed by first dividing the total cache size by the CBM length to + * determine how many bytes each bit in the bitmask represents. The result + * is multiplied with the number of bits set in the bitmask. + * + * @cbm is unsigned long, even if only 32 bits are used to make the + * bitmap functions work correctly. + */ +unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, + struct rdt_domain *d, unsigned long cbm) +{ + struct cpu_cacheinfo *ci; + unsigned int size = 0; + int num_b, i; + + num_b = bitmap_weight(&cbm, r->cache.cbm_len); + ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask)); + for (i = 0; i < ci->num_leaves; i++) { + if (ci->info_list[i].level == r->cache_level) { + size = ci->info_list[i].size / r->cache.cbm_len * num_b; + break; + } + } + + return size; +} + +/* + * rdtgroup_size_show - Display size in bytes of allocated regions + * + * The "size" file mirrors the layout of the "schemata" file, printing the + * size in bytes of each region instead of the capacity bitmask. + */ +static int rdtgroup_size_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct resctrl_schema *schema; + enum resctrl_conf_type type; + struct rdtgroup *rdtgrp; + struct rdt_resource *r; + struct rdt_domain *d; + unsigned int size; + int ret = 0; + u32 closid; + bool sep; + u32 ctrl; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + if (!rdtgrp->plr->d) { + rdt_last_cmd_clear(); + rdt_last_cmd_puts("Cache domain offline\n"); + ret = -ENODEV; + } else { + seq_printf(s, "%*s:", max_name_width, + rdtgrp->plr->s->name); + size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res, + rdtgrp->plr->d, + rdtgrp->plr->cbm); + seq_printf(s, "%d=%u\n", rdtgrp->plr->d->id, size); + } + goto out; + } + + closid = rdtgrp->closid; + + list_for_each_entry(schema, &resctrl_schema_all, list) { + r = schema->res; + type = schema->conf_type; + sep = false; + seq_printf(s, "%*s:", max_name_width, schema->name); + list_for_each_entry(d, &r->domains, list) { + if (sep) + seq_putc(s, ';'); + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + size = 0; + } else { + if (is_mba_sc(r)) + ctrl = d->mbps_val[closid]; + else + ctrl = resctrl_arch_get_config(r, d, + closid, + type); + if (r->rid == RDT_RESOURCE_MBA || + r->rid == RDT_RESOURCE_SMBA) + size = ctrl; + else + size = rdtgroup_cbm_to_size(r, d, ctrl); + } + seq_printf(s, "%d=%u", d->id, size); + sep = true; + } + seq_putc(s, '\n'); + } + +out: + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +static void mondata_config_read(struct resctrl_mon_config_info *mon_info) +{ + smp_call_function_any(&mon_info->d->cpu_mask, + resctrl_arch_mon_event_config_read, mon_info, 1); +} + +static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid) +{ + struct resctrl_mon_config_info mon_info = {0}; + struct rdt_domain *dom; + bool sep = false; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + list_for_each_entry(dom, &r->domains, list) { + if (sep) + seq_puts(s, ";"); + + memset(&mon_info, 0, sizeof(struct resctrl_mon_config_info)); + mon_info.r = r; + mon_info.d = dom; + mon_info.evtid = evtid; + mondata_config_read(&mon_info); + + seq_printf(s, "%d=0x%02x", dom->id, mon_info.mon_config); + sep = true; + } + seq_puts(s, "\n"); + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + return 0; +} + +static int mbm_total_bytes_config_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + + mbm_config_show(seq, r, QOS_L3_MBM_TOTAL_EVENT_ID); + + return 0; +} + +static int mbm_local_bytes_config_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + + mbm_config_show(seq, r, QOS_L3_MBM_LOCAL_EVENT_ID); + + return 0; +} + +static void mbm_config_write_domain(struct rdt_resource *r, + struct rdt_domain *d, u32 evtid, u32 val) +{ + struct resctrl_mon_config_info mon_info = {0}; + + /* + * Read the current config value first. If both are the same then + * no need to write it again. + */ + mon_info.r = r; + mon_info.d = d; + mon_info.evtid = evtid; + mondata_config_read(&mon_info); + if (mon_info.mon_config == val) + return; + + mon_info.mon_config = val; + + /* + * Update MSR_IA32_EVT_CFG_BASE MSR on one of the CPUs in the + * domain. The MSRs offset from MSR MSR_IA32_EVT_CFG_BASE + * are scoped at the domain level. Writing any of these MSRs + * on one CPU is observed by all the CPUs in the domain. + */ + smp_call_function_any(&d->cpu_mask, resctrl_arch_mon_event_config_write, + &mon_info, 1); + if (mon_info.err) { + rdt_last_cmd_puts("Invalid event configuration\n"); + return; + } + + /* + * When an Event Configuration is changed, the bandwidth counters + * for all RMIDs and Events will be cleared by the hardware. The + * hardware also sets MSR_IA32_QM_CTR.Unavailable (bit 62) for + * every RMID on the next read to any event for every RMID. + * Subsequent reads will have MSR_IA32_QM_CTR.Unavailable (bit 62) + * cleared while it is tracked by the hardware. Clear the + * mbm_local and mbm_total counts for all the RMIDs. + */ + resctrl_arch_reset_rmid_all(r, d); +} + +static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid) +{ + char *dom_str = NULL, *id_str; + unsigned long dom_id, val; + struct rdt_domain *d; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + +next: + if (!tok || tok[0] == '\0') + return 0; + + /* Start processing the strings for each domain */ + dom_str = strim(strsep(&tok, ";")); + id_str = strsep(&dom_str, "="); + + if (!id_str || kstrtoul(id_str, 10, &dom_id)) { + rdt_last_cmd_puts("Missing '=' or non-numeric domain id\n"); + return -EINVAL; + } + + if (!dom_str || kstrtoul(dom_str, 16, &val)) { + rdt_last_cmd_puts("Non-numeric event configuration value\n"); + return -EINVAL; + } + + list_for_each_entry(d, &r->domains, list) { + if (d->id == dom_id) { + mbm_config_write_domain(r, d, evtid, val); + goto next; + } + } + + return -EINVAL; +} + +static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct rdt_resource *r = of->kn->parent->priv; + int ret; + + /* Valid input requires a trailing newline */ + if (nbytes == 0 || buf[nbytes - 1] != '\n') + return -EINVAL; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + rdt_last_cmd_clear(); + + buf[nbytes - 1] = '\0'; + + ret = mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID); + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + return ret ?: nbytes; +} + +static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct rdt_resource *r = of->kn->parent->priv; + int ret; + + /* Valid input requires a trailing newline */ + if (nbytes == 0 || buf[nbytes - 1] != '\n') + return -EINVAL; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + rdt_last_cmd_clear(); + + buf[nbytes - 1] = '\0'; + + ret = mon_config_write(r, buf, QOS_L3_MBM_LOCAL_EVENT_ID); + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + return ret ?: nbytes; +} + +/* rdtgroup information files for one cache resource. */ +static struct rftype res_common_files[] = { + { + .name = "last_cmd_status", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_last_cmd_status_show, + .fflags = RFTYPE_TOP_INFO, + }, + { + .name = "num_closids", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_num_closids_show, + .fflags = RFTYPE_CTRL_INFO, + }, + { + .name = "mon_features", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_mon_features_show, + .fflags = RFTYPE_MON_INFO, + }, + { + .name = "num_rmids", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_num_rmids_show, + .fflags = RFTYPE_MON_INFO, + }, + { + .name = "cbm_mask", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_default_ctrl_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "min_cbm_bits", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_min_cbm_bits_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "shareable_bits", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_shareable_bits_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "bit_usage", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_bit_usage_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "min_bandwidth", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_min_bw_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, + }, + { + .name = "bandwidth_gran", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_bw_gran_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, + }, + { + .name = "delay_linear", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_delay_linear_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, + }, + /* + * Platform specific which (if any) capabilities are provided by + * thread_throttle_mode. Defer "fflags" initialization to platform + * discovery. + */ + { + .name = "thread_throttle_mode", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_thread_throttle_mode_show, + }, + { + .name = "max_threshold_occupancy", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = max_threshold_occ_write, + .seq_show = max_threshold_occ_show, + .fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "mbm_total_bytes_config", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = mbm_total_bytes_config_show, + .write = mbm_total_bytes_config_write, + }, + { + .name = "mbm_local_bytes_config", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = mbm_local_bytes_config_show, + .write = mbm_local_bytes_config_write, + }, + { + .name = "cpus", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_cpus_write, + .seq_show = rdtgroup_cpus_show, + .fflags = RFTYPE_BASE, + }, + { + .name = "cpus_list", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_cpus_write, + .seq_show = rdtgroup_cpus_show, + .flags = RFTYPE_FLAGS_CPUS_LIST, + .fflags = RFTYPE_BASE, + }, + { + .name = "tasks", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_tasks_write, + .seq_show = rdtgroup_tasks_show, + .fflags = RFTYPE_BASE, + }, + { + .name = "mon_hw_id", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdtgroup_rmid_show, + .fflags = RFTYPE_MON_BASE | RFTYPE_DEBUG, + }, + { + .name = "schemata", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_schemata_write, + .seq_show = rdtgroup_schemata_show, + .fflags = RFTYPE_CTRL_BASE, + }, + { + .name = "mode", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_mode_write, + .seq_show = rdtgroup_mode_show, + .fflags = RFTYPE_CTRL_BASE, + }, + { + .name = "size", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdtgroup_size_show, + .fflags = RFTYPE_CTRL_BASE, + }, + { + .name = "sparse_masks", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_has_sparse_bitmasks_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "ctrl_hw_id", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdtgroup_closid_show, + .fflags = RFTYPE_CTRL_BASE | RFTYPE_DEBUG, + }, + +}; + +static int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags) +{ + struct rftype *rfts, *rft; + int ret, len; + + rfts = res_common_files; + len = ARRAY_SIZE(res_common_files); + + lockdep_assert_held(&rdtgroup_mutex); + + if (resctrl_debug) + fflags |= RFTYPE_DEBUG; + + for (rft = rfts; rft < rfts + len; rft++) { + if (rft->fflags && ((fflags & rft->fflags) == rft->fflags)) { + ret = rdtgroup_add_file(kn, rft); + if (ret) + goto error; + } + } + + return 0; +error: + pr_warn("Failed to add %s, err=%d\n", rft->name, ret); + while (--rft >= rfts) { + if ((fflags & rft->fflags) == rft->fflags) + kernfs_remove_by_name(kn, rft->name); + } + return ret; +} + +static struct rftype *rdtgroup_get_rftype_by_name(const char *name) +{ + struct rftype *rfts, *rft; + int len; + + rfts = res_common_files; + len = ARRAY_SIZE(res_common_files); + + for (rft = rfts; rft < rfts + len; rft++) { + if (!strcmp(rft->name, name)) + return rft; + } + + return NULL; +} + +static void thread_throttle_mode_init(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); + struct rftype *rft; + + if (!r->alloc_capable || + r->membw.throttle_mode == THREAD_THROTTLE_UNDEFINED) + return; + + rft = rdtgroup_get_rftype_by_name("thread_throttle_mode"); + if (!rft) + return; + + rft->fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB; +} + +void mbm_config_rftype_init(const char *config) +{ + struct rftype *rft; + + rft = rdtgroup_get_rftype_by_name(config); + if (rft) + rft->fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE; +} + +/** + * rdtgroup_kn_mode_restrict - Restrict user access to named resctrl file + * @r: The resource group with which the file is associated. + * @name: Name of the file + * + * The permissions of named resctrl file, directory, or link are modified + * to not allow read, write, or execute by any user. + * + * WARNING: This function is intended to communicate to the user that the + * resctrl file has been locked down - that it is not relevant to the + * particular state the system finds itself in. It should not be relied + * on to protect from user access because after the file's permissions + * are restricted the user can still change the permissions using chmod + * from the command line. + * + * Return: 0 on success, <0 on failure. + */ +int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name) +{ + struct iattr iattr = {.ia_valid = ATTR_MODE,}; + struct kernfs_node *kn; + int ret = 0; + + kn = kernfs_find_and_get_ns(r->kn, name, NULL); + if (!kn) + return -ENOENT; + + switch (kernfs_type(kn)) { + case KERNFS_DIR: + iattr.ia_mode = S_IFDIR; + break; + case KERNFS_FILE: + iattr.ia_mode = S_IFREG; + break; + case KERNFS_LINK: + iattr.ia_mode = S_IFLNK; + break; + } + + ret = kernfs_setattr(kn, &iattr); + kernfs_put(kn); + return ret; +} + +/** + * rdtgroup_kn_mode_restore - Restore user access to named resctrl file + * @r: The resource group with which the file is associated. + * @name: Name of the file + * @mask: Mask of permissions that should be restored + * + * Restore the permissions of the named file. If @name is a directory the + * permissions of its parent will be used. + * + * Return: 0 on success, <0 on failure. + */ +int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name, + umode_t mask) +{ + struct iattr iattr = {.ia_valid = ATTR_MODE,}; + struct kernfs_node *kn, *parent; + struct rftype *rfts, *rft; + int ret, len; + + rfts = res_common_files; + len = ARRAY_SIZE(res_common_files); + + for (rft = rfts; rft < rfts + len; rft++) { + if (!strcmp(rft->name, name)) + iattr.ia_mode = rft->mode & mask; + } + + kn = kernfs_find_and_get_ns(r->kn, name, NULL); + if (!kn) + return -ENOENT; + + switch (kernfs_type(kn)) { + case KERNFS_DIR: + parent = kernfs_get_parent(kn); + if (parent) { + iattr.ia_mode |= parent->mode; + kernfs_put(parent); + } + iattr.ia_mode |= S_IFDIR; + break; + case KERNFS_FILE: + iattr.ia_mode |= S_IFREG; + break; + case KERNFS_LINK: + iattr.ia_mode |= S_IFLNK; + break; + } + + ret = kernfs_setattr(kn, &iattr); + kernfs_put(kn); + return ret; +} + +static int rdtgroup_mkdir_info_resdir(void *priv, char *name, + unsigned long fflags) +{ + struct kernfs_node *kn_subdir; + int ret; + + kn_subdir = kernfs_create_dir(kn_info, name, + kn_info->mode, priv); + if (IS_ERR(kn_subdir)) + return PTR_ERR(kn_subdir); + + ret = rdtgroup_kn_set_ugid(kn_subdir); + if (ret) + return ret; + + ret = rdtgroup_add_files(kn_subdir, fflags); + if (!ret) + kernfs_activate(kn_subdir); + + return ret; +} + +static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn) +{ + enum resctrl_res_level i; + struct resctrl_schema *s; + struct rdt_resource *r; + unsigned long fflags; + char name[32]; + int ret; + + /* create the directory */ + kn_info = kernfs_create_dir(parent_kn, "info", parent_kn->mode, NULL); + if (IS_ERR(kn_info)) + return PTR_ERR(kn_info); + + ret = rdtgroup_add_files(kn_info, RFTYPE_TOP_INFO); + if (ret) + goto out_destroy; + + /* loop over enabled controls, these are all alloc_capable */ + list_for_each_entry(s, &resctrl_schema_all, list) { + r = s->res; + fflags = r->fflags | RFTYPE_CTRL_INFO; + ret = rdtgroup_mkdir_info_resdir(s, s->name, fflags); + if (ret) + goto out_destroy; + } + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->mon_capable) + continue; + + fflags = r->fflags | RFTYPE_MON_INFO; + sprintf(name, "%s_MON", r->name); + ret = rdtgroup_mkdir_info_resdir(r, name, fflags); + if (ret) + goto out_destroy; + } + + ret = rdtgroup_kn_set_ugid(kn_info); + if (ret) + goto out_destroy; + + kernfs_activate(kn_info); + + return 0; + +out_destroy: + kernfs_remove(kn_info); + return ret; +} + +static int +mongroup_create_dir(struct kernfs_node *parent_kn, struct rdtgroup *prgrp, + char *name, struct kernfs_node **dest_kn) +{ + struct kernfs_node *kn; + int ret; + + /* create the directory */ + kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); + if (IS_ERR(kn)) + return PTR_ERR(kn); + + if (dest_kn) + *dest_kn = kn; + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) + goto out_destroy; + + kernfs_activate(kn); + + return 0; + +out_destroy: + kernfs_remove(kn); + return ret; +} + +static inline bool is_mba_linear(void) +{ + return resctrl_arch_get_resource(RDT_RESOURCE_MBA)->membw.delay_linear; +} + +static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_domain *d) +{ + u32 num_closid = resctrl_arch_get_num_closid(r); + int cpu = cpumask_any(&d->cpu_mask); + int i; + + d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val), + GFP_KERNEL, cpu_to_node(cpu)); + if (!d->mbps_val) + return -ENOMEM; + + for (i = 0; i < num_closid; i++) + d->mbps_val[i] = MBA_MAX_MBPS; + + return 0; +} + +static void mba_sc_domain_destroy(struct rdt_resource *r, + struct rdt_domain *d) +{ + kfree(d->mbps_val); + d->mbps_val = NULL; +} + +/* + * MBA software controller is supported only if + * MBM is supported and MBA is in linear scale. + */ +static bool supports_mba_mbps(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); + + return (resctrl_arch_is_mbm_local_enabled() && + r->alloc_capable && is_mba_linear()); +} + +/* + * Enable or disable the MBA software controller + * which helps user specify bandwidth in MBps. + */ +static int set_mba_sc(bool mba_sc) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_MBA); + u32 num_closid = resctrl_arch_get_num_closid(r); + struct rdt_domain *d; + int i; + + if (!supports_mba_mbps() || mba_sc == is_mba_sc(r)) + return -EINVAL; + + r->membw.mba_sc = mba_sc; + + list_for_each_entry(d, &r->domains, list) { + for (i = 0; i < num_closid; i++) + d->mbps_val[i] = MBA_MAX_MBPS; + } + + return 0; +} + +/* + * We don't allow rdtgroup directories to be created anywhere + * except the root directory. Thus when looking for the rdtgroup + * structure for a kernfs node we are either looking at a directory, + * in which case the rdtgroup structure is pointed at by the "priv" + * field, otherwise we have a file, and need only look to the parent + * to find the rdtgroup. + */ +static struct rdtgroup *kernfs_to_rdtgroup(struct kernfs_node *kn) +{ + if (kernfs_type(kn) == KERNFS_DIR) { + /* + * All the resource directories use "kn->priv" + * to point to the "struct rdtgroup" for the + * resource. "info" and its subdirectories don't + * have rdtgroup structures, so return NULL here. + */ + if (kn == kn_info || kn->parent == kn_info) + return NULL; + else + return kn->priv; + } else { + return kn->parent->priv; + } +} + +static void rdtgroup_kn_get(struct rdtgroup *rdtgrp, struct kernfs_node *kn) +{ + atomic_inc(&rdtgrp->waitcount); + kernfs_break_active_protection(kn); +} + +static void rdtgroup_kn_put(struct rdtgroup *rdtgrp, struct kernfs_node *kn) +{ + if (atomic_dec_and_test(&rdtgrp->waitcount) && + (rdtgrp->flags & RDT_DELETED)) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) + rdtgroup_pseudo_lock_remove(rdtgrp); + kernfs_unbreak_active_protection(kn); + rdtgroup_remove(rdtgrp); + } else { + kernfs_unbreak_active_protection(kn); + } +} + +struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn) +{ + struct rdtgroup *rdtgrp = kernfs_to_rdtgroup(kn); + + if (!rdtgrp) + return NULL; + + rdtgroup_kn_get(rdtgrp, kn); + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + /* Was this group deleted while we waited? */ + if (rdtgrp->flags & RDT_DELETED) + return NULL; + + return rdtgrp; +} + +void rdtgroup_kn_unlock(struct kernfs_node *kn) +{ + struct rdtgroup *rdtgrp = kernfs_to_rdtgroup(kn); + + if (!rdtgrp) + return; + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + rdtgroup_kn_put(rdtgrp, kn); +} + +static int mkdir_mondata_all(struct kernfs_node *parent_kn, + struct rdtgroup *prgrp, + struct kernfs_node **mon_data_kn); + +static void rdt_disable_ctx(void) +{ + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false); + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false); + set_mba_sc(false); + + resctrl_debug = false; +} + +static int rdt_enable_ctx(struct rdt_fs_context *ctx) +{ + int ret = 0; + + if (ctx->enable_cdpl2) { + ret = resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, true); + if (ret) + goto out_done; + } + + if (ctx->enable_cdpl3) { + ret = resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, true); + if (ret) + goto out_cdpl2; + } + + if (ctx->enable_mba_mbps) { + ret = set_mba_sc(true); + if (ret) + goto out_cdpl3; + } + + if (ctx->enable_debug) + resctrl_debug = true; + + return 0; + +out_cdpl3: + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false); +out_cdpl2: + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false); +out_done: + return ret; +} + +static int schemata_list_add(struct rdt_resource *r, enum resctrl_conf_type type) +{ + struct resctrl_schema *s; + const char *suffix = ""; + int ret, cl; + + s = kzalloc(sizeof(*s), GFP_KERNEL); + if (!s) + return -ENOMEM; + + s->res = r; + s->num_closid = resctrl_arch_get_num_closid(r); + if (resctrl_arch_get_cdp_enabled(r->rid)) + s->num_closid /= 2; + + s->conf_type = type; + switch (type) { + case CDP_CODE: + suffix = "CODE"; + break; + case CDP_DATA: + suffix = "DATA"; + break; + case CDP_NONE: + suffix = ""; + break; + } + + ret = snprintf(s->name, sizeof(s->name), "%s%s", r->name, suffix); + if (ret >= sizeof(s->name)) { + kfree(s); + return -EINVAL; + } + + cl = strlen(s->name); + + /* + * If CDP is supported by this resource, but not enabled, + * include the suffix. This ensures the tabular format of the + * schemata file does not change between mounts of the filesystem. + */ + if (r->cdp_capable && !resctrl_arch_get_cdp_enabled(r->rid)) + cl += 4; + + if (cl > max_name_width) + max_name_width = cl; + + /* + * Choose a width for the resource data based on the resource that has + * widest name and cbm. + */ + max_data_width = max(max_data_width, r->data_width); + + INIT_LIST_HEAD(&s->list); + list_add(&s->list, &resctrl_schema_all); + + return 0; +} + +static int schemata_list_create(void) +{ + enum resctrl_res_level i; + struct rdt_resource *r; + int ret = 0; + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->alloc_capable) + continue; + + if (resctrl_arch_get_cdp_enabled(r->rid)) { + ret = schemata_list_add(r, CDP_CODE); + if (ret) + break; + + ret = schemata_list_add(r, CDP_DATA); + } else { + ret = schemata_list_add(r, CDP_NONE); + } + + if (ret) + break; + } + + return ret; +} + +static void schemata_list_destroy(void) +{ + struct resctrl_schema *s, *tmp; + + list_for_each_entry_safe(s, tmp, &resctrl_schema_all, list) { + list_del(&s->list); + kfree(s); + } +} + +static int rdt_get_tree(struct fs_context *fc) +{ + struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3); + struct rdt_fs_context *ctx = rdt_fc2context(fc); + unsigned long flags = RFTYPE_CTRL_BASE; + struct rdt_domain *dom; + int ret; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + /* + * resctrl file system can only be mounted once. + */ + if (resctrl_mounted) { + ret = -EBUSY; + goto out; + } + + ret = rdtgroup_setup_root(ctx); + if (ret) + goto out; + + ret = rdt_enable_ctx(ctx); + if (ret) + goto out_root; + + ret = schemata_list_create(); + if (ret) { + schemata_list_destroy(); + goto out_ctx; + } + + closid_init(); + + if (resctrl_arch_mon_capable()) + flags |= RFTYPE_MON; + + ret = rdtgroup_add_files(rdtgroup_default.kn, flags); + if (ret) + goto out_schemata_free; + + kernfs_activate(rdtgroup_default.kn); + + ret = rdtgroup_create_info_dir(rdtgroup_default.kn); + if (ret < 0) + goto out_schemata_free; + + if (resctrl_arch_mon_capable()) { + ret = mongroup_create_dir(rdtgroup_default.kn, + &rdtgroup_default, "mon_groups", + &kn_mongrp); + if (ret < 0) + goto out_info; + + ret = mkdir_mondata_all(rdtgroup_default.kn, + &rdtgroup_default, &kn_mondata); + if (ret < 0) + goto out_mongrp; + rdtgroup_default.mon.mon_data_kn = kn_mondata; + } + + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) { + ret = rdt_pseudo_lock_init(); + if (ret) + goto out_mondata; + } + + ret = kernfs_get_tree(fc); + if (ret < 0) + goto out_psl; + + if (resctrl_arch_alloc_capable()) + resctrl_arch_enable_alloc(); + if (resctrl_arch_mon_capable()) + resctrl_arch_enable_mon(); + + if (resctrl_arch_alloc_capable() || resctrl_arch_mon_capable()) + resctrl_mounted = true; + + if (resctrl_is_mbm_enabled()) { + list_for_each_entry(dom, &l3->domains, list) + mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, + RESCTRL_PICK_ANY_CPU); + } + + goto out; + +out_psl: + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + rdt_pseudo_lock_release(); +out_mondata: + if (resctrl_arch_mon_capable()) + kernfs_remove(kn_mondata); +out_mongrp: + if (resctrl_arch_mon_capable()) + kernfs_remove(kn_mongrp); +out_info: + kernfs_remove(kn_info); +out_schemata_free: + schemata_list_destroy(); +out_ctx: + rdt_disable_ctx(); +out_root: + rdtgroup_destroy_root(); +out: + rdt_last_cmd_clear(); + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + return ret; +} + +enum rdt_param { + Opt_cdp, + Opt_cdpl2, + Opt_mba_mbps, + Opt_debug, + nr__rdt_params +}; + +static const struct fs_parameter_spec rdt_fs_parameters[] = { + fsparam_flag("cdp", Opt_cdp), + fsparam_flag("cdpl2", Opt_cdpl2), + fsparam_flag("mba_MBps", Opt_mba_mbps), + fsparam_flag("debug", Opt_debug), + {} +}; + +static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param) +{ + struct rdt_fs_context *ctx = rdt_fc2context(fc); + struct fs_parse_result result; + int opt; + + opt = fs_parse(fc, rdt_fs_parameters, param, &result); + if (opt < 0) + return opt; + + switch (opt) { + case Opt_cdp: + ctx->enable_cdpl3 = true; + return 0; + case Opt_cdpl2: + ctx->enable_cdpl2 = true; + return 0; + case Opt_mba_mbps: + if (!supports_mba_mbps()) + return -EINVAL; + ctx->enable_mba_mbps = true; + return 0; + case Opt_debug: + ctx->enable_debug = true; + return 0; + } + + return -EINVAL; +} + +static void rdt_fs_context_free(struct fs_context *fc) +{ + struct rdt_fs_context *ctx = rdt_fc2context(fc); + + kernfs_free_fs_context(fc); + kfree(ctx); +} + +static const struct fs_context_operations rdt_fs_context_ops = { + .free = rdt_fs_context_free, + .parse_param = rdt_parse_param, + .get_tree = rdt_get_tree, +}; + +static int rdt_init_fs_context(struct fs_context *fc) +{ + struct rdt_fs_context *ctx; + + ctx = kzalloc(sizeof(struct rdt_fs_context), GFP_KERNEL); + if (!ctx) + return -ENOMEM; + + ctx->kfc.magic = RDTGROUP_SUPER_MAGIC; + fc->fs_private = &ctx->kfc; + fc->ops = &rdt_fs_context_ops; + put_user_ns(fc->user_ns); + fc->user_ns = get_user_ns(&init_user_ns); + fc->global = true; + return 0; +} + +/* + * Move tasks from one to the other group. If @from is NULL, then all tasks + * in the systems are moved unconditionally (used for teardown). + * + * If @mask is not NULL the cpus on which moved tasks are running are set + * in that mask so the update smp function call is restricted to affected + * cpus. + */ +static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to, + struct cpumask *mask) +{ + struct task_struct *p, *t; + + read_lock(&tasklist_lock); + for_each_process_thread(p, t) { + if (!from || is_closid_match(t, from) || + is_rmid_match(t, from)) { + resctrl_arch_set_closid_rmid(t, to->closid, + to->mon.rmid); + + /* + * Order the closid/rmid stores above before the loads + * in task_curr(). This pairs with the full barrier + * between the rq->curr update and resctrl_sched_in() + * during context switch. + */ + smp_mb(); + + /* + * If the task is on a CPU, set the CPU in the mask. + * The detection is inaccurate as tasks might move or + * schedule before the smp function call takes place. + * In such a case the function call is pointless, but + * there is no other side effect. + */ + if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t)) + cpumask_set_cpu(task_cpu(t), mask); + } + } + read_unlock(&tasklist_lock); +} + +static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp) +{ + struct rdtgroup *sentry, *stmp; + struct list_head *head; + + head = &rdtgrp->mon.crdtgrp_list; + list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) { + free_rmid(sentry->closid, sentry->mon.rmid); + list_del(&sentry->mon.crdtgrp_list); + + if (atomic_read(&sentry->waitcount) != 0) + sentry->flags = RDT_DELETED; + else + rdtgroup_remove(sentry); + } +} + +/* + * Forcibly remove all of subdirectories under root. + */ +static void rmdir_all_sub(void) +{ + struct rdtgroup *rdtgrp, *tmp; + + /* Move all tasks to the default resource group */ + rdt_move_group_tasks(NULL, &rdtgroup_default, NULL); + + list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) { + /* Free any child rmids */ + free_all_child_rdtgrp(rdtgrp); + + /* Remove each rdtgroup other than root */ + if (rdtgrp == &rdtgroup_default) + continue; + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) + rdtgroup_pseudo_lock_remove(rdtgrp); + + /* + * Give any CPUs back to the default group. We cannot copy + * cpu_online_mask because a CPU might have executed the + * offline callback already, but is still marked online. + */ + cpumask_or(&rdtgroup_default.cpu_mask, + &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); + + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + + kernfs_remove(rdtgrp->kn); + list_del(&rdtgrp->rdtgroup_list); + + if (atomic_read(&rdtgrp->waitcount) != 0) + rdtgrp->flags = RDT_DELETED; + else + rdtgroup_remove(rdtgrp); + } + /* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */ + update_closid_rmid(cpu_online_mask, &rdtgroup_default); + + kernfs_remove(kn_info); + kernfs_remove(kn_mongrp); + kernfs_remove(kn_mondata); +} + +static void rdt_kill_sb(struct super_block *sb) +{ + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + rdt_disable_ctx(); + + /* Put everything back to default values. */ + resctrl_arch_reset_resources(); + + rmdir_all_sub(); + if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) + rdt_pseudo_lock_release(); + rdtgroup_default.mode = RDT_MODE_SHAREABLE; + schemata_list_destroy(); + rdtgroup_destroy_root(); + if (resctrl_arch_alloc_capable()) + resctrl_arch_disable_alloc(); + if (resctrl_arch_mon_capable()) + resctrl_arch_disable_mon(); + resctrl_mounted = false; + kernfs_kill_sb(sb); + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); +} + +static struct file_system_type rdt_fs_type = { + .name = "resctrl", + .init_fs_context = rdt_init_fs_context, + .parameters = rdt_fs_parameters, + .kill_sb = rdt_kill_sb, +}; + +static int mon_addfile(struct kernfs_node *parent_kn, const char *name, + void *priv) +{ + struct kernfs_node *kn; + int ret = 0; + + kn = __kernfs_create_file(parent_kn, name, 0444, + GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, + &kf_mondata_ops, priv, NULL, NULL); + if (IS_ERR(kn)) + return PTR_ERR(kn); + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) { + kernfs_remove(kn); + return ret; + } + + return ret; +} + +/* + * Remove all subdirectories of mon_data of ctrl_mon groups + * and monitor groups with given domain id. + */ +static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, + unsigned int dom_id) +{ + struct rdtgroup *prgrp, *crgrp; + char name[32]; + + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { + sprintf(name, "mon_%s_%02d", r->name, dom_id); + kernfs_remove_by_name(prgrp->mon.mon_data_kn, name); + + list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) + kernfs_remove_by_name(crgrp->mon.mon_data_kn, name); + } +} + +static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, + struct rdt_domain *d, + struct rdt_resource *r, struct rdtgroup *prgrp) +{ + union mon_data_bits priv; + struct kernfs_node *kn; + struct mon_evt *mevt; + struct rmid_read rr; + char name[32]; + int ret; + + sprintf(name, "mon_%s_%02d", r->name, d->id); + /* create the directory */ + kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); + if (IS_ERR(kn)) + return PTR_ERR(kn); + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) + goto out_destroy; + + if (WARN_ON(list_empty(&r->evt_list))) { + ret = -EPERM; + goto out_destroy; + } + + priv.u.rid = r->rid; + priv.u.domid = d->id; + list_for_each_entry(mevt, &r->evt_list, list) { + priv.u.evtid = mevt->evtid; + ret = mon_addfile(kn, mevt->name, priv.priv); + if (ret) + goto out_destroy; + + if (resctrl_is_mbm_event(mevt->evtid)) + mon_event_read(&rr, r, d, prgrp, mevt->evtid, true); + } + kernfs_activate(kn); + return 0; + +out_destroy: + kernfs_remove(kn); + return ret; +} + +/* + * Add all subdirectories of mon_data for "ctrl_mon" groups + * and "monitor" groups with given domain id. + */ +static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, + struct rdt_domain *d) +{ + struct kernfs_node *parent_kn; + struct rdtgroup *prgrp, *crgrp; + struct list_head *head; + + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { + parent_kn = prgrp->mon.mon_data_kn; + mkdir_mondata_subdir(parent_kn, d, r, prgrp); + + head = &prgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) { + parent_kn = crgrp->mon.mon_data_kn; + mkdir_mondata_subdir(parent_kn, d, r, crgrp); + } + } +} + +static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn, + struct rdt_resource *r, + struct rdtgroup *prgrp) +{ + struct rdt_domain *dom; + int ret; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + list_for_each_entry(dom, &r->domains, list) { + ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp); + if (ret) + return ret; + } + + return 0; +} + +/* + * This creates a directory mon_data which contains the monitored data. + * + * mon_data has one directory for each domain which are named + * in the format mon_<domain_name>_<domain_id>. For ex: A mon_data + * with L3 domain looks as below: + * ./mon_data: + * mon_L3_00 + * mon_L3_01 + * mon_L3_02 + * ... + * + * Each domain directory has one file per event: + * ./mon_L3_00/: + * llc_occupancy + * + */ +static int mkdir_mondata_all(struct kernfs_node *parent_kn, + struct rdtgroup *prgrp, + struct kernfs_node **dest_kn) +{ + enum resctrl_res_level i; + struct rdt_resource *r; + struct kernfs_node *kn; + int ret; + + /* + * Create the mon_data directory first. + */ + ret = mongroup_create_dir(parent_kn, prgrp, "mon_data", &kn); + if (ret) + return ret; + + if (dest_kn) + *dest_kn = kn; + + /* + * Create the subdirectories for each domain. Note that all events + * in a domain like L3 are grouped into a resource whose domain is L3 + */ + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->mon_capable) + continue; + + ret = mkdir_mondata_subdir_alldom(kn, r, prgrp); + if (ret) + goto out_destroy; + } + + return 0; + +out_destroy: + kernfs_remove(kn); + return ret; +} + +/** + * cbm_ensure_valid - Enforce validity on provided CBM + * @_val: Candidate CBM + * @r: RDT resource to which the CBM belongs + * + * The provided CBM represents all cache portions available for use. This + * may be represented by a bitmap that does not consist of contiguous ones + * and thus be an invalid CBM. + * Here the provided CBM is forced to be a valid CBM by only considering + * the first set of contiguous bits as valid and clearing all bits. + * The intention here is to provide a valid default CBM with which a new + * resource group is initialized. The user can follow this with a + * modification to the CBM if the default does not satisfy the + * requirements. + */ +static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r) +{ + unsigned int cbm_len = r->cache.cbm_len; + unsigned long first_bit, zero_bit; + unsigned long val = _val; + + if (!val) + return 0; + + first_bit = find_first_bit(&val, cbm_len); + zero_bit = find_next_zero_bit(&val, cbm_len, first_bit); + + /* Clear any remaining bits to ensure contiguous region */ + bitmap_clear(&val, zero_bit, cbm_len - zero_bit); + return (u32)val; +} + +/* + * Initialize cache resources per RDT domain + * + * Set the RDT domain up to start off with all usable allocations. That is, + * all shareable and unused bits. All-zero CBM is invalid. + */ +static int __init_one_rdt_domain(struct rdt_domain *d, struct resctrl_schema *s, + u32 closid) +{ + enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type); + enum resctrl_conf_type t = s->conf_type; + struct resctrl_staged_config *cfg; + struct rdt_resource *r = s->res; + u32 used_b = 0, unused_b = 0; + unsigned long tmp_cbm; + enum rdtgrp_mode mode; + u32 peer_ctl, ctrl_val; + int i; + + cfg = &d->staged_config[t]; + cfg->have_new_ctrl = false; + cfg->new_ctrl = r->cache.shareable_bits; + used_b = r->cache.shareable_bits; + for (i = 0; i < closids_supported(); i++) { + if (closid_allocated(i) && i != closid) { + mode = rdtgroup_mode_by_closid(i); + if (mode == RDT_MODE_PSEUDO_LOCKSETUP) + /* + * ctrl values for locksetup aren't relevant + * until the schemata is written, and the mode + * becomes RDT_MODE_PSEUDO_LOCKED. + */ + continue; + /* + * If CDP is active include peer domain's + * usage to ensure there is no overlap + * with an exclusive group. + */ + if (resctrl_arch_get_cdp_enabled(r->rid)) + peer_ctl = resctrl_arch_get_config(r, d, i, + peer_type); + else + peer_ctl = 0; + ctrl_val = resctrl_arch_get_config(r, d, i, + s->conf_type); + used_b |= ctrl_val | peer_ctl; + if (mode == RDT_MODE_SHAREABLE) + cfg->new_ctrl |= ctrl_val | peer_ctl; + } + } + if (d->plr && d->plr->cbm > 0) + used_b |= d->plr->cbm; + unused_b = used_b ^ (BIT_MASK(r->cache.cbm_len) - 1); + unused_b &= BIT_MASK(r->cache.cbm_len) - 1; + cfg->new_ctrl |= unused_b; + /* + * Force the initial CBM to be valid, user can + * modify the CBM based on system availability. + */ + cfg->new_ctrl = cbm_ensure_valid(cfg->new_ctrl, r); + /* + * Assign the u32 CBM to an unsigned long to ensure that + * bitmap_weight() does not access out-of-bound memory. + */ + tmp_cbm = cfg->new_ctrl; + if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) { + rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->id); + return -ENOSPC; + } + cfg->have_new_ctrl = true; + + return 0; +} + +/* + * Initialize cache resources with default values. + * + * A new RDT group is being created on an allocation capable (CAT) + * supporting system. Set this group up to start off with all usable + * allocations. + * + * If there are no more shareable bits available on any domain then + * the entire allocation will fail. + */ +static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid) +{ + struct rdt_domain *d; + int ret; + + list_for_each_entry(d, &s->res->domains, list) { + ret = __init_one_rdt_domain(d, s, closid); + if (ret < 0) + return ret; + } + + return 0; +} + +/* Initialize MBA resource with default values. */ +static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid) +{ + struct resctrl_staged_config *cfg; + struct rdt_domain *d; + + list_for_each_entry(d, &r->domains, list) { + if (is_mba_sc(r)) { + d->mbps_val[closid] = MBA_MAX_MBPS; + continue; + } + + cfg = &d->staged_config[CDP_NONE]; + cfg->new_ctrl = r->default_ctrl; + cfg->have_new_ctrl = true; + } +} + +/* Initialize the RDT group's allocations. */ +static int rdtgroup_init_alloc(struct rdtgroup *rdtgrp) +{ + struct resctrl_schema *s; + struct rdt_resource *r; + int ret = 0; + + rdt_staged_configs_clear(); + + list_for_each_entry(s, &resctrl_schema_all, list) { + r = s->res; + if (r->rid == RDT_RESOURCE_MBA || + r->rid == RDT_RESOURCE_SMBA) { + rdtgroup_init_mba(r, rdtgrp->closid); + if (is_mba_sc(r)) + continue; + } else { + ret = rdtgroup_init_cat(s, rdtgrp->closid); + if (ret < 0) + goto out; + } + + ret = resctrl_arch_update_domains(r, rdtgrp->closid); + if (ret < 0) { + rdt_last_cmd_puts("Failed to initialize allocations\n"); + goto out; + } + + } + + rdtgrp->mode = RDT_MODE_SHAREABLE; + +out: + rdt_staged_configs_clear(); + return ret; +} + +static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp) +{ + int ret; + + if (!resctrl_arch_mon_capable()) + return 0; + + ret = alloc_rmid(rdtgrp->closid); + if (ret < 0) { + rdt_last_cmd_puts("Out of RMIDs\n"); + return ret; + } + rdtgrp->mon.rmid = ret; + + ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn); + if (ret) { + rdt_last_cmd_puts("kernfs subdir error\n"); + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + return ret; + } + + return 0; +} + +static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp) +{ + if (resctrl_arch_mon_capable()) + free_rmid(rgrp->closid, rgrp->mon.rmid); +} + +static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, + const char *name, umode_t mode, + enum rdt_group_type rtype, struct rdtgroup **r) +{ + struct rdtgroup *prdtgrp, *rdtgrp; + unsigned long files = 0; + struct kernfs_node *kn; + int ret; + + prdtgrp = rdtgroup_kn_lock_live(parent_kn); + if (!prdtgrp) { + ret = -ENODEV; + goto out_unlock; + } + + if (rtype == RDTMON_GROUP && + (prdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + prdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)) { + ret = -EINVAL; + rdt_last_cmd_puts("Pseudo-locking in progress\n"); + goto out_unlock; + } + + /* allocate the rdtgroup. */ + rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL); + if (!rdtgrp) { + ret = -ENOSPC; + rdt_last_cmd_puts("Kernel out of memory\n"); + goto out_unlock; + } + *r = rdtgrp; + rdtgrp->mon.parent = prdtgrp; + rdtgrp->type = rtype; + INIT_LIST_HEAD(&rdtgrp->mon.crdtgrp_list); + + /* kernfs creates the directory for rdtgrp */ + kn = kernfs_create_dir(parent_kn, name, mode, rdtgrp); + if (IS_ERR(kn)) { + ret = PTR_ERR(kn); + rdt_last_cmd_puts("kernfs create error\n"); + goto out_free_rgrp; + } + rdtgrp->kn = kn; + + /* + * kernfs_remove() will drop the reference count on "kn" which + * will free it. But we still need it to stick around for the + * rdtgroup_kn_unlock(kn) call. Take one extra reference here, + * which will be dropped by kernfs_put() in rdtgroup_remove(). + */ + kernfs_get(kn); + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) { + rdt_last_cmd_puts("kernfs perm error\n"); + goto out_destroy; + } + + if (rtype == RDTCTRL_GROUP) { + files = RFTYPE_BASE | RFTYPE_CTRL; + if (resctrl_arch_mon_capable()) + files |= RFTYPE_MON; + } else { + files = RFTYPE_BASE | RFTYPE_MON; + } + + ret = rdtgroup_add_files(kn, files); + if (ret) { + rdt_last_cmd_puts("kernfs fill error\n"); + goto out_destroy; + } + + /* + * The caller unlocks the parent_kn upon success. + */ + return 0; + +out_destroy: + kernfs_put(rdtgrp->kn); + kernfs_remove(rdtgrp->kn); +out_free_rgrp: + kfree(rdtgrp); +out_unlock: + rdtgroup_kn_unlock(parent_kn); + return ret; +} + +static void mkdir_rdt_prepare_clean(struct rdtgroup *rgrp) +{ + kernfs_remove(rgrp->kn); + rdtgroup_remove(rgrp); +} + +/* + * Create a monitor group under "mon_groups" directory of a control + * and monitor group(ctrl_mon). This is a resource group + * to monitor a subset of tasks and cpus in its parent ctrl_mon group. + */ +static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn, + const char *name, umode_t mode) +{ + struct rdtgroup *rdtgrp, *prgrp; + int ret; + + ret = mkdir_rdt_prepare(parent_kn, name, mode, RDTMON_GROUP, &rdtgrp); + if (ret) + return ret; + + prgrp = rdtgrp->mon.parent; + rdtgrp->closid = prgrp->closid; + + ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp); + if (ret) { + mkdir_rdt_prepare_clean(rdtgrp); + goto out_unlock; + } + + kernfs_activate(rdtgrp->kn); + + /* + * Add the rdtgrp to the list of rdtgrps the parent + * ctrl_mon group has to track. + */ + list_add_tail(&rdtgrp->mon.crdtgrp_list, &prgrp->mon.crdtgrp_list); + +out_unlock: + rdtgroup_kn_unlock(parent_kn); + return ret; +} + +/* + * These are rdtgroups created under the root directory. Can be used + * to allocate and monitor resources. + */ +static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn, + const char *name, umode_t mode) +{ + struct rdtgroup *rdtgrp; + struct kernfs_node *kn; + u32 closid; + int ret; + + ret = mkdir_rdt_prepare(parent_kn, name, mode, RDTCTRL_GROUP, &rdtgrp); + if (ret) + return ret; + + kn = rdtgrp->kn; + ret = closid_alloc(); + if (ret < 0) { + rdt_last_cmd_puts("Out of CLOSIDs\n"); + goto out_common_fail; + } + closid = ret; + ret = 0; + + rdtgrp->closid = closid; + + ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp); + if (ret) + goto out_closid_free; + + kernfs_activate(rdtgrp->kn); + + ret = rdtgroup_init_alloc(rdtgrp); + if (ret < 0) + goto out_rmid_free; + + list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups); + + if (resctrl_arch_mon_capable()) { + /* + * Create an empty mon_groups directory to hold the subset + * of tasks and cpus to monitor. + */ + ret = mongroup_create_dir(kn, rdtgrp, "mon_groups", NULL); + if (ret) { + rdt_last_cmd_puts("kernfs subdir error\n"); + goto out_del_list; + } + } + + goto out_unlock; + +out_del_list: + list_del(&rdtgrp->rdtgroup_list); +out_rmid_free: + mkdir_rdt_prepare_rmid_free(rdtgrp); +out_closid_free: + closid_free(closid); +out_common_fail: + mkdir_rdt_prepare_clean(rdtgrp); +out_unlock: + rdtgroup_kn_unlock(parent_kn); + return ret; +} + +/* + * We allow creating mon groups only with in a directory called "mon_groups" + * which is present in every ctrl_mon group. Check if this is a valid + * "mon_groups" directory. + * + * 1. The directory should be named "mon_groups". + * 2. The mon group itself should "not" be named "mon_groups". + * This makes sure "mon_groups" directory always has a ctrl_mon group + * as parent. + */ +static bool is_mon_groups(struct kernfs_node *kn, const char *name) +{ + return (!strcmp(kn->name, "mon_groups") && + strcmp(name, "mon_groups")); +} + +static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name, + umode_t mode) +{ + /* Do not accept '\n' to avoid unparsable situation. */ + if (strchr(name, '\n')) + return -EINVAL; + + /* + * If the parent directory is the root directory and RDT + * allocation is supported, add a control and monitoring + * subdirectory + */ + if (resctrl_arch_alloc_capable() && parent_kn == rdtgroup_default.kn) + return rdtgroup_mkdir_ctrl_mon(parent_kn, name, mode); + + /* + * If RDT monitoring is supported and the parent directory is a valid + * "mon_groups" directory, add a monitoring subdirectory. + */ + if (resctrl_arch_mon_capable() && is_mon_groups(parent_kn, name)) + return rdtgroup_mkdir_mon(parent_kn, name, mode); + + return -EPERM; +} + +static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) +{ + struct rdtgroup *prdtgrp = rdtgrp->mon.parent; + int cpu; + + /* Give any tasks back to the parent group */ + rdt_move_group_tasks(rdtgrp, prdtgrp, tmpmask); + + /* Update per cpu rmid of the moved CPUs first */ + for_each_cpu(cpu, &rdtgrp->cpu_mask) + resctrl_arch_set_cpu_default_closid_rmid(cpu, rdtgrp->closid, + prdtgrp->mon.rmid); + + /* + * Update the MSR on moved CPUs and CPUs which have moved + * task running on them. + */ + cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); + update_closid_rmid(tmpmask, NULL); + + rdtgrp->flags = RDT_DELETED; + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + + /* + * Remove the rdtgrp from the parent ctrl_mon group's list + */ + WARN_ON(list_empty(&prdtgrp->mon.crdtgrp_list)); + list_del(&rdtgrp->mon.crdtgrp_list); + + kernfs_remove(rdtgrp->kn); + + return 0; +} + +static int rdtgroup_ctrl_remove(struct rdtgroup *rdtgrp) +{ + rdtgrp->flags = RDT_DELETED; + list_del(&rdtgrp->rdtgroup_list); + + kernfs_remove(rdtgrp->kn); + return 0; +} + +static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) +{ + u32 closid, rmid; + int cpu; + + /* Give any tasks back to the default group */ + rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask); + + /* Give any CPUs back to the default group */ + cpumask_or(&rdtgroup_default.cpu_mask, + &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); + + /* Update per cpu closid and rmid of the moved CPUs first */ + closid = rdtgroup_default.closid; + rmid = rdtgroup_default.mon.rmid; + for_each_cpu(cpu, &rdtgrp->cpu_mask) + resctrl_arch_set_cpu_default_closid_rmid(cpu, closid, rmid); + + /* + * Update the MSR on moved CPUs and CPUs which have moved + * task running on them. + */ + cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); + update_closid_rmid(tmpmask, NULL); + + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + closid_free(rdtgrp->closid); + + rdtgroup_ctrl_remove(rdtgrp); + + /* + * Free all the child monitor group rmids. + */ + free_all_child_rdtgrp(rdtgrp); + + return 0; +} + +static int rdtgroup_rmdir(struct kernfs_node *kn) +{ + struct kernfs_node *parent_kn = kn->parent; + struct rdtgroup *rdtgrp; + cpumask_var_t tmpmask; + int ret = 0; + + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) + return -ENOMEM; + + rdtgrp = rdtgroup_kn_lock_live(kn); + if (!rdtgrp) { + ret = -EPERM; + goto out; + } + + /* + * If the rdtgroup is a ctrl_mon group and parent directory + * is the root directory, remove the ctrl_mon group. + * + * If the rdtgroup is a mon group and parent directory + * is a valid "mon_groups" directory, remove the mon group. + */ + if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == rdtgroup_default.kn && + rdtgrp != &rdtgroup_default) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + ret = rdtgroup_ctrl_remove(rdtgrp); + } else { + ret = rdtgroup_rmdir_ctrl(rdtgrp, tmpmask); + } + } else if (rdtgrp->type == RDTMON_GROUP && + is_mon_groups(parent_kn, kn->name)) { + ret = rdtgroup_rmdir_mon(rdtgrp, tmpmask); + } else { + ret = -EPERM; + } + +out: + rdtgroup_kn_unlock(kn); + free_cpumask_var(tmpmask); + return ret; +} + +/** + * mongrp_reparent() - replace parent CTRL_MON group of a MON group + * @rdtgrp: the MON group whose parent should be replaced + * @new_prdtgrp: replacement parent CTRL_MON group for @rdtgrp + * @cpus: cpumask provided by the caller for use during this call + * + * Replaces the parent CTRL_MON group for a MON group, resulting in all member + * tasks' CLOSID immediately changing to that of the new parent group. + * Monitoring data for the group is unaffected by this operation. + */ +static void mongrp_reparent(struct rdtgroup *rdtgrp, + struct rdtgroup *new_prdtgrp, + cpumask_var_t cpus) +{ + struct rdtgroup *prdtgrp = rdtgrp->mon.parent; + + WARN_ON(rdtgrp->type != RDTMON_GROUP); + WARN_ON(new_prdtgrp->type != RDTCTRL_GROUP); + + /* Nothing to do when simply renaming a MON group. */ + if (prdtgrp == new_prdtgrp) + return; + + WARN_ON(list_empty(&prdtgrp->mon.crdtgrp_list)); + list_move_tail(&rdtgrp->mon.crdtgrp_list, + &new_prdtgrp->mon.crdtgrp_list); + + rdtgrp->mon.parent = new_prdtgrp; + rdtgrp->closid = new_prdtgrp->closid; + + /* Propagate updated closid to all tasks in this group. */ + rdt_move_group_tasks(rdtgrp, rdtgrp, cpus); + + update_closid_rmid(cpus, NULL); +} + +static int rdtgroup_rename(struct kernfs_node *kn, + struct kernfs_node *new_parent, const char *new_name) +{ + struct rdtgroup *new_prdtgrp; + struct rdtgroup *rdtgrp; + cpumask_var_t tmpmask; + int ret; + + rdtgrp = kernfs_to_rdtgroup(kn); + new_prdtgrp = kernfs_to_rdtgroup(new_parent); + if (!rdtgrp || !new_prdtgrp) + return -ENOENT; + + /* Release both kernfs active_refs before obtaining rdtgroup mutex. */ + rdtgroup_kn_get(rdtgrp, kn); + rdtgroup_kn_get(new_prdtgrp, new_parent); + + mutex_lock(&rdtgroup_mutex); + + rdt_last_cmd_clear(); + + /* + * Don't allow kernfs_to_rdtgroup() to return a parent rdtgroup if + * either kernfs_node is a file. + */ + if (kernfs_type(kn) != KERNFS_DIR || + kernfs_type(new_parent) != KERNFS_DIR) { + rdt_last_cmd_puts("Source and destination must be directories"); + ret = -EPERM; + goto out; + } + + if ((rdtgrp->flags & RDT_DELETED) || (new_prdtgrp->flags & RDT_DELETED)) { + ret = -ENOENT; + goto out; + } + + if (rdtgrp->type != RDTMON_GROUP || !kn->parent || + !is_mon_groups(kn->parent, kn->name)) { + rdt_last_cmd_puts("Source must be a MON group\n"); + ret = -EPERM; + goto out; + } + + if (!is_mon_groups(new_parent, new_name)) { + rdt_last_cmd_puts("Destination must be a mon_groups subdirectory\n"); + ret = -EPERM; + goto out; + } + + /* + * If the MON group is monitoring CPUs, the CPUs must be assigned to the + * current parent CTRL_MON group and therefore cannot be assigned to + * the new parent, making the move illegal. + */ + if (!cpumask_empty(&rdtgrp->cpu_mask) && + rdtgrp->mon.parent != new_prdtgrp) { + rdt_last_cmd_puts("Cannot move a MON group that monitors CPUs\n"); + ret = -EPERM; + goto out; + } + + /* + * Allocate the cpumask for use in mongrp_reparent() to avoid the + * possibility of failing to allocate it after kernfs_rename() has + * succeeded. + */ + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) { + ret = -ENOMEM; + goto out; + } + + /* + * Perform all input validation and allocations needed to ensure + * mongrp_reparent() will succeed before calling kernfs_rename(), + * otherwise it would be necessary to revert this call if + * mongrp_reparent() failed. + */ + ret = kernfs_rename(kn, new_parent, new_name); + if (!ret) + mongrp_reparent(rdtgrp, new_prdtgrp, tmpmask); + + free_cpumask_var(tmpmask); + +out: + mutex_unlock(&rdtgroup_mutex); + rdtgroup_kn_put(rdtgrp, kn); + rdtgroup_kn_put(new_prdtgrp, new_parent); + return ret; +} + +static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf) +{ + if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3)) + seq_puts(seq, ",cdp"); + + if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2)) + seq_puts(seq, ",cdpl2"); + + if (is_mba_sc(resctrl_arch_get_resource(RDT_RESOURCE_MBA))) + seq_puts(seq, ",mba_MBps"); + + if (resctrl_debug) + seq_puts(seq, ",debug"); + + return 0; +} + +static struct kernfs_syscall_ops rdtgroup_kf_syscall_ops = { + .mkdir = rdtgroup_mkdir, + .rmdir = rdtgroup_rmdir, + .rename = rdtgroup_rename, + .show_options = rdtgroup_show_options, +}; + +static int rdtgroup_setup_root(struct rdt_fs_context *ctx) +{ + rdt_root = kernfs_create_root(&rdtgroup_kf_syscall_ops, + KERNFS_ROOT_CREATE_DEACTIVATED | + KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK, + &rdtgroup_default); + if (IS_ERR(rdt_root)) + return PTR_ERR(rdt_root); + + ctx->kfc.root = rdt_root; + rdtgroup_default.kn = kernfs_root_to_node(rdt_root); + + return 0; +} + +static void rdtgroup_destroy_root(void) +{ + kernfs_destroy_root(rdt_root); + rdtgroup_default.kn = NULL; +} + +static void __init rdtgroup_setup_default(void) +{ + mutex_lock(&rdtgroup_mutex); + + rdtgroup_default.closid = RESCTRL_RESERVED_CLOSID; + rdtgroup_default.mon.rmid = RESCTRL_RESERVED_RMID; + rdtgroup_default.type = RDTCTRL_GROUP; + INIT_LIST_HEAD(&rdtgroup_default.mon.crdtgrp_list); + + list_add(&rdtgroup_default.rdtgroup_list, &rdt_all_groups); + + mutex_unlock(&rdtgroup_mutex); +} + +static void domain_destroy_mon_state(struct rdt_domain *d) +{ + bitmap_free(d->rmid_busy_llc); + kfree(d->mbm_total); + kfree(d->mbm_local); +} + +void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d) +{ + mutex_lock(&rdtgroup_mutex); + + if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) + mba_sc_domain_destroy(r, d); + + if (!r->mon_capable) + goto out_unlock; + + /* + * If resctrl is mounted, remove all the + * per domain monitor data directories. + */ + if (resctrl_mounted && resctrl_arch_mon_capable()) + rmdir_mondata_subdir_allrdtgrp(r, d->id); + + if (resctrl_is_mbm_enabled()) + cancel_delayed_work(&d->mbm_over); + if (resctrl_arch_is_llc_occupancy_enabled() && has_busy_rmid(d)) { + /* + * When a package is going down, forcefully + * decrement rmid->ebusy. There is no way to know + * that the L3 was flushed and hence may lead to + * incorrect counts in rare scenarios, but leaving + * the RMID as busy creates RMID leaks if the + * package never comes back. + */ + __check_limbo(d, true); + cancel_delayed_work(&d->cqm_limbo); + } + + domain_destroy_mon_state(d); + +out_unlock: + mutex_unlock(&rdtgroup_mutex); +} + +static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) +{ + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + size_t tsize; + + if (resctrl_arch_is_llc_occupancy_enabled()) { + d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL); + if (!d->rmid_busy_llc) + return -ENOMEM; + } + if (resctrl_arch_is_mbm_total_enabled()) { + tsize = sizeof(*d->mbm_total); + d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL); + if (!d->mbm_total) { + bitmap_free(d->rmid_busy_llc); + return -ENOMEM; + } + } + if (resctrl_arch_is_mbm_local_enabled()) { + tsize = sizeof(*d->mbm_local); + d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL); + if (!d->mbm_local) { + bitmap_free(d->rmid_busy_llc); + kfree(d->mbm_total); + return -ENOMEM; + } + } + + return 0; +} + +int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d) +{ + int err = 0; + + mutex_lock(&rdtgroup_mutex); + + if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) { + /* RDT_RESOURCE_MBA is never mon_capable */ + err = mba_sc_domain_allocate(r, d); + goto out_unlock; + } + + if (!r->mon_capable) + goto out_unlock; + + err = domain_setup_mon_state(r, d); + if (err) + goto out_unlock; + + if (resctrl_is_mbm_enabled()) { + INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow); + mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, + RESCTRL_PICK_ANY_CPU); + } + + if (resctrl_arch_is_llc_occupancy_enabled()) + INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo); + + /* + * If the filesystem is not mounted then only the default resource group + * exists. Creation of its directories is deferred until mount time + * by rdt_get_tree() calling mkdir_mondata_all(). + * If resctrl is mounted, add per domain monitor data directories. + */ + if (resctrl_mounted && resctrl_arch_mon_capable()) + mkdir_mondata_subdir_allrdtgrp(r, d); + +out_unlock: + mutex_unlock(&rdtgroup_mutex); + + return err; +} + +void resctrl_online_cpu(unsigned int cpu) +{ + mutex_lock(&rdtgroup_mutex); + /* The CPU is set in default rdtgroup after online. */ + cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); + mutex_unlock(&rdtgroup_mutex); +} + +static void clear_childcpus(struct rdtgroup *r, unsigned int cpu) +{ + struct rdtgroup *cr; + + list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) { + if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask)) + break; + } +} + +void resctrl_offline_cpu(unsigned int cpu) +{ + struct rdt_resource *l3 = resctrl_arch_get_resource(RDT_RESOURCE_L3); + struct rdtgroup *rdtgrp; + struct rdt_domain *d; + + mutex_lock(&rdtgroup_mutex); + list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { + if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) { + clear_childcpus(rdtgrp, cpu); + break; + } + } + + if (!l3->mon_capable) + goto out_unlock; + + d = resctrl_get_domain_from_cpu(cpu, l3); + if (d) { + if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu) { + cancel_delayed_work(&d->mbm_over); + mbm_setup_overflow_handler(d, 0, cpu); + } + if (resctrl_arch_is_llc_occupancy_enabled() && + cpu == d->cqm_work_cpu && has_busy_rmid(d)) { + cancel_delayed_work(&d->cqm_limbo); + cqm_setup_limbo_handler(d, 0, cpu); + } + } + +out_unlock: + mutex_unlock(&rdtgroup_mutex); +} + +/* + * resctrl_init - resctrl filesystem initialization + * + * Setup resctrl file system including set up root, create mount point, + * register resctrl filesystem, and initialize files under root directory. + * + * Return: 0 on success or -errno + */ +int resctrl_init(void) +{ + int ret = 0; + + seq_buf_init(&last_cmd_status, last_cmd_status_buf, + sizeof(last_cmd_status_buf)); + + rdtgroup_setup_default(); + + thread_throttle_mode_init(); + + ret = resctrl_mon_resource_init(); + if (ret) + return ret; + + ret = sysfs_create_mount_point(fs_kobj, "resctrl"); + if (ret) + return ret; + + ret = register_filesystem(&rdt_fs_type); + if (ret) + goto cleanup_mountpoint; + + /* + * Adding the resctrl debugfs directory here may not be ideal since + * it would let the resctrl debugfs directory appear on the debugfs + * filesystem before the resctrl filesystem is mounted. + * It may also be ok since that would enable debugging of RDT before + * resctrl is mounted. + * The reason why the debugfs directory is created here and not in + * rdt_get_tree() is because rdt_get_tree() takes rdtgroup_mutex and + * during the debugfs directory creation also &sb->s_type->i_mutex_key + * (the lockdep class of inode->i_rwsem). Other filesystem + * interactions (eg. SyS_getdents) have the lock ordering: + * &sb->s_type->i_mutex_key --> &mm->mmap_lock + * During mmap(), called with &mm->mmap_lock, the rdtgroup_mutex + * is taken, thus creating dependency: + * &mm->mmap_lock --> rdtgroup_mutex for the latter that can cause + * issues considering the other two lock dependencies. + * By creating the debugfs directory here we avoid a dependency + * that may cause deadlock (even though file operations cannot + * occur until the filesystem is mounted, but I do not know how to + * tell lockdep that). + */ + debugfs_resctrl = debugfs_create_dir("resctrl", NULL); + + return 0; + +cleanup_mountpoint: + sysfs_remove_mount_point(fs_kobj, "resctrl"); + + return ret; +} + +void resctrl_exit(void) +{ + debugfs_remove_recursive(debugfs_resctrl); + unregister_filesystem(&rdt_fs_type); + sysfs_remove_mount_point(fs_kobj, "resctrl"); + + resctrl_mon_resource_exit(); +} -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Add code to head.S's el2_setup to detect MPAM and disable any EL2 traps. This register resets to an unknown value, setting it to the default parititons/pmg before we enable the MMU is the best thing to do. Kexec/kdump will depend on this if the previous kernel left the CPU configured with a restrictive configuration. If linux is booted at the highest implemented exception level el2_setup will clear the enable bit, disabling MPAM. This code can't be enabled until a subsequent patch adds the Kconfig and cpufeature boiler plate. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/include/asm/el2_setup.h | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/arch/arm64/include/asm/el2_setup.h b/arch/arm64/include/asm/el2_setup.h index b7afaa026842..1e2181820a0a 100644 --- a/arch/arm64/include/asm/el2_setup.h +++ b/arch/arm64/include/asm/el2_setup.h @@ -208,6 +208,21 @@ msr spsr_el2, x0 .endm +.macro __init_el2_mpam +#ifdef CONFIG_ARM64_MPAM + /* Memory Partioning And Monitoring: disable EL2 traps */ + mrs x1, id_aa64pfr0_el1 + ubfx x0, x1, #ID_AA64PFR0_EL1_MPAM_SHIFT, #4 + cbz x0, .Lskip_mpam_\@ // skip if no MPAM + msr_s SYS_MPAM2_EL2, xzr // use the default partition + // and disable lower traps + mrs_s x0, SYS_MPAMIDR_EL1 + tbz x0, #17, .Lskip_mpam_\@ // skip if no MPAMHCR reg + msr_s SYS_MPAMHCR_EL2, xzr // clear TRAP_MPAMIDR_EL1 -> EL2 +.Lskip_mpam_\@: +#endif /* CONFIG_ARM64_MPAM */ +.endm + /** * Initialize EL2 registers to sane values. This should be called early on all * cores that were booted in EL2. Note that everything gets initialised as @@ -225,6 +240,7 @@ __init_el2_stage2 __init_el2_gicv3 __init_el2_hstr + __init_el2_mpam __init_el2_nvhe_idregs __init_el2_cptr __init_el2_fgt -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- ARMv8.4 adds support for 'Memory Partitioning And Monitoring' (MPAM) which describes an interface to cache and bandwidth controls wherever they appear in the system. Add support to detect MPAM. Like SVE, MPAM has an extra id register that describes the virtualisation support, which is optional. Detect this separately so we can detect mismatched/insane systems, but still use MPAM on the host even if the virtualisation support is missing. MPAM needs enabling at the highest implemented exception level, otherwise the register accesses trap. The 'enabled' flag is accessible to lower exception levels, but its in a register that traps when MPAM isn't enabled. The cpufeature 'matches' hook is extended to test this on one of the CPUs, so that firwmare can emulate MPAM as disabled if it is reserved for use by secure world. (If you have a boot failure that bisects here its likely your CPUs advertise MPAM in the id registers, but firmware failed to either enable or MPAM, or emulate the trap as if it were disabled) Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- .../arch/arm64/cpu-feature-registers.rst | 2 + arch/arm64/Kconfig | 19 ++++- arch/arm64/include/asm/cpu.h | 1 + arch/arm64/include/asm/cpufeature.h | 13 ++++ arch/arm64/include/asm/mpam.h | 76 +++++++++++++++++++ arch/arm64/include/asm/sysreg.h | 8 ++ arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/cpufeature.c | 68 +++++++++++++++++ arch/arm64/kernel/cpuinfo.c | 4 + arch/arm64/kernel/mpam.c | 8 ++ arch/arm64/tools/cpucaps | 1 + arch/arm64/tools/sysreg | 33 ++++++++ 12 files changed, 233 insertions(+), 1 deletion(-) create mode 100644 arch/arm64/include/asm/mpam.h create mode 100644 arch/arm64/kernel/mpam.c diff --git a/Documentation/arch/arm64/cpu-feature-registers.rst b/Documentation/arch/arm64/cpu-feature-registers.rst index de6d8a4790e2..14ea68bcf196 100644 --- a/Documentation/arch/arm64/cpu-feature-registers.rst +++ b/Documentation/arch/arm64/cpu-feature-registers.rst @@ -152,6 +152,8 @@ infrastructure: +------------------------------+---------+---------+ | DIT | [51-48] | y | +------------------------------+---------+---------+ + | MPAM | [43-40] | n | + +------------------------------+---------+---------+ | SVE | [35-32] | y | +------------------------------+---------+---------+ | GIC | [27-24] | n | diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 658c6a61ab6f..471e8129d0fb 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2013,7 +2013,24 @@ config ARM64_TLB_RANGE The feature introduces new assembly instructions, and they were support when binutils >= 2.30. -endmenu # "ARMv8.4 architectural features" +config ARM64_MPAM + bool "Enable support for MPAM" + help + Memory Partitioning and Monitoring is an optional extension + that allows the CPUs to mark load and store transactions with + labels for partition-id and performance-monitoring-group. + System components, such as the caches, can use the partition-id + to apply a performance policy. MPAM monitors can use the + partition-id and performance-monitoring-group to measure the + cache occupancy or data throughput. + + Use of this extension requires CPU support, support in the + memory system components (MSC), and a description from firmware + of where the MSC are in the address space. + + MPAM is exposed to user-space via the resctrl pseudo filesystem. + +endmenu menu "ARMv8.5 architectural features" diff --git a/arch/arm64/include/asm/cpu.h b/arch/arm64/include/asm/cpu.h index e749838b9c5d..1cb5bafd9238 100644 --- a/arch/arm64/include/asm/cpu.h +++ b/arch/arm64/include/asm/cpu.h @@ -47,6 +47,7 @@ struct cpuinfo_arm64 { u64 reg_revidr; u64 reg_gmid; u64 reg_smidr; + u64 reg_mpamidr; u64 reg_id_aa64dfr0; u64 reg_id_aa64dfr1; diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index 86a475eb720e..876e0ae300fd 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -608,6 +608,13 @@ static inline bool id_aa64pfr1_sme(u64 pfr1) return val > 0; } +static inline bool id_aa64pfr0_mpam(u64 pfr0) +{ + u32 val = cpuid_feature_extract_unsigned_field(pfr0, ID_AA64PFR0_EL1_MPAM_SHIFT); + + return val > 0; +} + static inline bool id_aa64pfr1_mte(u64 pfr1) { u32 val = cpuid_feature_extract_unsigned_field(pfr1, ID_AA64PFR1_EL1_MTE_SHIFT); @@ -817,6 +824,12 @@ static inline bool system_supports_tlb_range(void) return alternative_has_cap_unlikely(ARM64_HAS_TLB_RANGE); } +static inline bool cpus_support_mpam(void) +{ + return IS_ENABLED(CONFIG_ARM64_MPAM) && + cpus_have_final_cap(ARM64_MPAM); +} + static inline bool system_supports_lpa2(void) { return cpus_have_final_cap(ARM64_HAS_LPA2); diff --git a/arch/arm64/include/asm/mpam.h b/arch/arm64/include/asm/mpam.h new file mode 100644 index 000000000000..a4a969be233a --- /dev/null +++ b/arch/arm64/include/asm/mpam.h @@ -0,0 +1,76 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2021 Arm Ltd. */ + +#ifndef __ASM__MPAM_H +#define __ASM__MPAM_H + +#include <linux/bitops.h> +#include <linux/init.h> +#include <linux/jump_label.h> + +#include <asm/cpucaps.h> +#include <asm/cpufeature.h> +#include <asm/sysreg.h> + +/* CPU Registers */ +#define MPAM_SYSREG_EN BIT_ULL(63) +#define MPAM_SYSREG_TRAP_IDR BIT_ULL(58) +#define MPAM_SYSREG_TRAP_MPAM0_EL1 BIT_ULL(49) +#define MPAM_SYSREG_TRAP_MPAM1_EL1 BIT_ULL(48) +#define MPAM_SYSREG_PMG_D GENMASK(47, 40) +#define MPAM_SYSREG_PMG_I GENMASK(39, 32) +#define MPAM_SYSREG_PARTID_D GENMASK(31, 16) +#define MPAM_SYSREG_PARTID_I GENMASK(15, 0) + +#define MPAMIDR_PMG_MAX GENMASK(40, 32) +#define MPAMIDR_PMG_MAX_SHIFT 32 +#define MPAMIDR_PMG_MAX_LEN 8 +#define MPAMIDR_VPMR_MAX GENMASK(20, 18) +#define MPAMIDR_VPMR_MAX_SHIFT 18 +#define MPAMIDR_VPMR_MAX_LEN 3 +#define MPAMIDR_HAS_HCR BIT(17) +#define MPAMIDR_HAS_HCR_SHIFT 17 +#define MPAMIDR_PARTID_MAX GENMASK(15, 0) +#define MPAMIDR_PARTID_MAX_SHIFT 0 +#define MPAMIDR_PARTID_MAX_LEN 15 + +#define MPAMHCR_EL0_VPMEN BIT_ULL(0) +#define MPAMHCR_EL1_VPMEN BIT_ULL(1) +#define MPAMHCR_GSTAPP_PLK BIT_ULL(8) +#define MPAMHCR_TRAP_MPAMIDR BIT_ULL(31) + +/* Properties of the VPM registers */ +#define MPAM_VPM_NUM_REGS 8 +#define MPAM_VPM_PARTID_LEN 16 +#define MPAM_VPM_PARTID_MASK 0xffff +#define MPAM_VPM_REG_LEN 64 +#define MPAM_VPM_PARTIDS_PER_REG (MPAM_VPM_REG_LEN / MPAM_VPM_PARTID_LEN) +#define MPAM_VPM_MAX_PARTID (MPAM_VPM_NUM_REGS * MPAM_VPM_PARTIDS_PER_REG) + + +DECLARE_STATIC_KEY_FALSE(arm64_mpam_has_hcr); + +/* check whether all CPUs have MPAM support */ +static inline bool mpam_cpus_have_feature(void) +{ + if (IS_ENABLED(CONFIG_ARM64_MPAM)) + return cpus_have_final_cap(ARM64_MPAM); + return false; +} + +/* check whether all CPUs have MPAM virtualisation support */ +static inline bool mpam_cpus_have_mpam_hcr(void) +{ + if (IS_ENABLED(CONFIG_ARM64_MPAM)) + return static_branch_unlikely(&arm64_mpam_has_hcr); + return false; +} + +/* enable MPAM virtualisation support */ +static inline void __init __enable_mpam_hcr(void) +{ + if (IS_ENABLED(CONFIG_ARM64_MPAM)) + static_branch_enable(&arm64_mpam_has_hcr); +} + +#endif /* __ASM__MPAM_H */ diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h index bc782116315a..7e80440a6414 100644 --- a/arch/arm64/include/asm/sysreg.h +++ b/arch/arm64/include/asm/sysreg.h @@ -515,6 +515,13 @@ #define SYS_MAIR_EL2 sys_reg(3, 4, 10, 2, 0) #define SYS_AMAIR_EL2 sys_reg(3, 4, 10, 3, 0) +#define SYS_MPAMHCR_EL2 sys_reg(3, 4, 10, 4, 0) +#define SYS_MPAMVPMV_EL2 sys_reg(3, 4, 10, 4, 1) +#define SYS_MPAM2_EL2 sys_reg(3, 4, 10, 5, 0) + +#define __VPMn_op2(n) ((n) & 0x7) +#define SYS_MPAM_VPMn_EL2(n) sys_reg(3, 4, 10, 6, __VPMn_op2(n)) + #define SYS_VBAR_EL2 sys_reg(3, 4, 12, 0, 0) #define SYS_RVBAR_EL2 sys_reg(3, 4, 12, 0, 1) #define SYS_RMR_EL2 sys_reg(3, 4, 12, 0, 2) @@ -579,6 +586,7 @@ #define SYS_TFSR_EL12 sys_reg(3, 5, 5, 6, 0) #define SYS_MAIR_EL12 sys_reg(3, 5, 10, 2, 0) #define SYS_AMAIR_EL12 sys_reg(3, 5, 10, 3, 0) +#define SYS_MPAM1_EL12 sys_reg(3, 5, 10, 5, 0) #define SYS_VBAR_EL12 sys_reg(3, 5, 12, 0, 0) #define SYS_CNTKCTL_EL12 sys_reg(3, 5, 14, 1, 0) #define SYS_CNTP_TVAL_EL02 sys_reg(3, 5, 14, 2, 0) diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile index d95b3d6b471a..685e0a58a4c6 100644 --- a/arch/arm64/kernel/Makefile +++ b/arch/arm64/kernel/Makefile @@ -69,6 +69,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o obj-$(CONFIG_CRASH_CORE) += crash_core.o obj-$(CONFIG_ARM_SDE_INTERFACE) += sdei.o obj-$(CONFIG_ARM64_PTR_AUTH) += pointer_auth.o +obj-$(CONFIG_ARM64_MPAM) += mpam.o obj-$(CONFIG_ARM64_MTE) += mte.o obj-y += vdso-wrap.o obj-$(CONFIG_COMPAT_VDSO) += vdso32-wrap.o diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 58156edbcf01..d76225b4a00f 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -84,6 +84,7 @@ #include <asm/insn.h> #include <asm/kvm_host.h> #include <asm/mmu_context.h> +#include <asm/mpam.h> #include <asm/mte.h> #include <asm/processor.h> #include <asm/smp.h> @@ -623,6 +624,18 @@ static const struct arm64_ftr_bits ftr_smcr[] = { ARM64_FTR_END, }; +static const struct arm64_ftr_bits ftr_mpamidr[] = { + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, + MPAMIDR_PMG_MAX_SHIFT, MPAMIDR_PMG_MAX_LEN, 0), /* PMG_MAX */ + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, + MPAMIDR_VPMR_MAX_SHIFT, MPAMIDR_VPMR_MAX_LEN, 0), /* VPMR_MAX */ + ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, + MPAMIDR_HAS_HCR_SHIFT, 1, 0), /* HAS_HCR */ + ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, + MPAMIDR_PARTID_MAX_SHIFT, MPAMIDR_PARTID_MAX_LEN, 0), /* PARTID_MAX */ + ARM64_FTR_END, +}; + /* * Common ftr bits for a 32bit register with all hidden, strict * attributes, with 4bit feature fields and a default safe value of @@ -739,6 +752,9 @@ static const struct __ftr_reg_entry { ARM64_FTR_REG(SYS_ZCR_EL1, ftr_zcr), ARM64_FTR_REG(SYS_SMCR_EL1, ftr_smcr), + /* Op1 = 0, CRn = 10, CRm = 4 */ + ARM64_FTR_REG(SYS_MPAMIDR_EL1, ftr_mpamidr), + /* Op1 = 1, CRn = 0, CRm = 0 */ ARM64_FTR_REG(SYS_GMID_EL1, ftr_gmid), @@ -1066,6 +1082,9 @@ void __init init_cpu_features(struct cpuinfo_arm64 *info) cpacr_restore(cpacr); } + if (id_aa64pfr0_mpam(info->reg_id_aa64pfr0)) + init_cpu_ftr_reg(SYS_MPAMIDR_EL1, info->reg_mpamidr); + if (id_aa64pfr1_mte(info->reg_id_aa64pfr1)) init_cpu_ftr_reg(SYS_GMID_EL1, info->reg_gmid); @@ -1333,6 +1352,11 @@ void update_cpu_features(int cpu, cpacr_restore(cpacr); } + if (id_aa64pfr0_mpam(info->reg_id_aa64pfr0)) { + taint |= check_update_ftr_reg(SYS_MPAMIDR_EL1, cpu, + info->reg_mpamidr, boot->reg_mpamidr); + } + /* * The kernel uses the LDGM/STGM instructions and the number of tags * they read/write depends on the GMID_EL1.BS field. Check that the @@ -2313,6 +2337,39 @@ cpucap_panic_on_conflict(const struct arm64_cpu_capabilities *cap) return !!(cap->type & ARM64_CPUCAP_PANIC_ON_CONFLICT); } +static bool __maybe_unused +test_has_mpam(const struct arm64_cpu_capabilities *entry, int scope) +{ + if (!has_cpuid_feature(entry, scope)) + return false; + + /* Check firmware actually enabled MPAM on this cpu. */ + return (read_sysreg_s(SYS_MPAM1_EL1) & MPAM_SYSREG_EN); +} + +static void __maybe_unused +cpu_enable_mpam(const struct arm64_cpu_capabilities *entry) +{ + /* + * Access by the kernel (at EL1) should use the reserved PARTID + * which is configured unrestricted. This avoids priority-inversion + * where latency sensitive tasks have to wait for a task that has + * been throttled to release the lock. + */ + write_sysreg_s(0, SYS_MPAM1_EL1); +} + +static void mpam_extra_caps(void) +{ + u64 idr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); + + if (!IS_ENABLED(CONFIG_ARM64_MPAM)) + return; + + if (idr & MPAMIDR_HAS_HCR) + __enable_mpam_hcr(); +} + static const struct arm64_cpu_capabilities arm64_features[] = { { .capability = ARM64_ALWAYS_BOOT, @@ -2799,6 +2856,16 @@ static const struct arm64_cpu_capabilities arm64_features[] = { .type = ARM64_CPUCAP_SYSTEM_FEATURE, .matches = has_lpa2, }, +#ifdef CONFIG_ARM64_MPAM + { + .desc = "Memory Partitioning And Monitoring", + .type = ARM64_CPUCAP_SYSTEM_FEATURE, + .capability = ARM64_MPAM, + .matches = test_has_mpam, + .cpu_enable = cpu_enable_mpam, + ARM64_CPUID_FIELDS(ID_AA64PFR0_EL1, MPAM, 1) + }, +#endif {}, }; @@ -3432,6 +3499,7 @@ void __init setup_system_features(void) sve_setup(); sme_setup(); + mpam_extra_caps(); /* * Check for sane CTR_EL0.CWG value. diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c index 98fda8500535..1b1fe0f58a86 100644 --- a/arch/arm64/kernel/cpuinfo.c +++ b/arch/arm64/kernel/cpuinfo.c @@ -460,6 +460,10 @@ static void __cpuinfo_store_cpu(struct cpuinfo_arm64 *info) if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0)) __cpuinfo_store_cpu_32bit(&info->aarch32); + if (IS_ENABLED(CONFIG_ARM64_MPAM) && + id_aa64pfr0_mpam(info->reg_id_aa64pfr0)) + info->reg_mpamidr = read_cpuid(MPAMIDR_EL1); + cpuinfo_detect_icache_policy(info); } diff --git a/arch/arm64/kernel/mpam.c b/arch/arm64/kernel/mpam.c new file mode 100644 index 000000000000..ff29b666e025 --- /dev/null +++ b/arch/arm64/kernel/mpam.c @@ -0,0 +1,8 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (C) 2021 Arm Ltd. */ + +#include <asm/mpam.h> + +#include <linux/jump_label.h> + +DEFINE_STATIC_KEY_FALSE(arm64_mpam_has_hcr); diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps index 7369d2440af8..296eeb91bfa4 100644 --- a/arch/arm64/tools/cpucaps +++ b/arch/arm64/tools/cpucaps @@ -57,6 +57,7 @@ HW_DBM KVM_HVHE KVM_PROTECTED_MODE MISMATCHED_CACHE_TYPE +MPAM MTE MTE_ASYMM SME diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg index 76ce150e7347..0e7d7f327410 100644 --- a/arch/arm64/tools/sysreg +++ b/arch/arm64/tools/sysreg @@ -2530,6 +2530,22 @@ Res0 1 Field 0 EN EndSysreg +Sysreg MPAMIDR_EL1 3 0 10 4 4 +Res0 63:62 +Field 61 HAS_SDEFLT +Field 60 HAS_FORCE_NS +Field 59 SP4 +Field 58 HAS_TIDR +Field 57 HAS_ALTSP +Res0 56:40 +Field 39:32 PMG_MAX +Res0 31:21 +Field 20:18 VPMR_MAX +Field 17 HAS_HCR +Res0 16 +Field 15:0 PARTID_MAX +EndSysreg + Sysreg LORID_EL1 3 0 10 4 7 Res0 63:24 Field 23:16 LD @@ -2537,6 +2553,22 @@ Res0 15:8 Field 7:0 LR EndSysreg +Sysreg MPAM1_EL1 3 0 10 5 0 +Res0 63:48 +Field 47:40 PMG_D +Field 39:32 PMG_I +Field 31:16 PARTID_D +Field 15:0 PARTID_I +EndSysreg + +Sysreg MPAM0_EL1 3 0 10 5 1 +Res0 63:48 +Field 47:40 PMG_D +Field 39:32 PMG_I +Field 31:16 PARTID_D +Field 15:0 PARTID_I +EndSysreg + Sysreg ISR_EL1 3 0 12 1 0 Res0 63:11 Field 10 IS @@ -2550,6 +2582,7 @@ EndSysreg Sysreg ICC_NMIAR1_EL1 3 0 12 9 5 Res0 63:24 Field 23:0 INTID + EndSysreg Sysreg TRBLIMITR_EL1 3 0 9 11 0 -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- commit 011e5f5bf529f ("arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register") exposed the MPAM field of AA64PFR0_EL1 to guests, but didn't add trap handling. If you are unlucky, this results in an MPAM aware guest being delivered an undef during boot. The host prints: | kvm [97]: Unsupported guest sys_reg access at: ffff800080024c64 [00000005] | { Op0( 3), Op1( 0), CRn(10), CRm( 5), Op2( 0), func_read }, Which results in: | Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT SMP | Modules linked in: | CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.6.0-rc7-00559-gd89c186d50b2 #14616 | Hardware name: linux,dummy-virt (DT) | pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : test_has_mpam+0x18/0x30 | lr : test_has_mpam+0x10/0x30 | sp : ffff80008000bd90 ... | Call trace: | test_has_mpam+0x18/0x30 | update_cpu_capabilities+0x7c/0x11c | setup_cpu_features+0x14/0xd8 | smp_cpus_done+0x24/0xb8 | smp_init+0x7c/0x8c | kernel_init_freeable+0xf8/0x280 | kernel_init+0x24/0x1e0 | ret_from_fork+0x10/0x20 | Code: 910003fd 97ffffde 72001c00 54000080 (d538a500) | ---[ end trace 0000000000000000 ]--- | Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b | ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--- Add the support to enable the traps, and handle the three guest accessible registers as RAZ/WI. This allows guests to keep the invariant id-register value, while advertising that MPAM isn't really supported. With MPAM v1.0 we can trap the MPAMIDR_EL1 register only if ARM64_HAS_MPAM_HCR, with v1.1 an additional MPAM2_EL2.TIDR bit traps MPAMIDR_EL1 on platforms that don't have MPAMHCR_EL2. Enable one of these if either is supported. If neither is supported, the guest can discover that the CPU has MPAM support, and how many PARTID etc the host has ... but it can't influence anything, so its harmless. Full support for the feature would only expose MPAM to the guest if a psuedo-device has been created to describe the virt->phys partid mapping the VMM expects. This will depend on ARM64_HAS_MPAM_HCR. Fixes: 011e5f5bf529f ("arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register") CC: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/linux-arm-kernel/20200925160102.118858-1-james.morse... Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/include/asm/kvm_arm.h | 1 + arch/arm64/include/asm/mpam.h | 4 ++-- arch/arm64/kernel/image-vars.h | 5 ++++ arch/arm64/kvm/hyp/include/hyp/switch.h | 32 +++++++++++++++++++++++++ arch/arm64/kvm/sys_regs.c | 20 ++++++++++++++++ 5 files changed, 60 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h index 5e8a0835470d..aeaabcc3eafb 100644 --- a/arch/arm64/include/asm/kvm_arm.h +++ b/arch/arm64/include/asm/kvm_arm.h @@ -104,6 +104,7 @@ #define HCRX_GUEST_FLAGS (HCRX_EL2_SMPME | HCRX_EL2_TCR2En) #define HCRX_HOST_FLAGS (HCRX_EL2_MSCEn | HCRX_EL2_TCR2En) +#define MPAMHCR_HOST_FLAGS 0 /* TCR_EL2 Registers bits */ #define TCR_EL2_DS (1UL << 32) diff --git a/arch/arm64/include/asm/mpam.h b/arch/arm64/include/asm/mpam.h index a4a969be233a..576102d510ad 100644 --- a/arch/arm64/include/asm/mpam.h +++ b/arch/arm64/include/asm/mpam.h @@ -51,7 +51,7 @@ DECLARE_STATIC_KEY_FALSE(arm64_mpam_has_hcr); /* check whether all CPUs have MPAM support */ -static inline bool mpam_cpus_have_feature(void) +static __always_inline bool mpam_cpus_have_feature(void) { if (IS_ENABLED(CONFIG_ARM64_MPAM)) return cpus_have_final_cap(ARM64_MPAM); @@ -59,7 +59,7 @@ static inline bool mpam_cpus_have_feature(void) } /* check whether all CPUs have MPAM virtualisation support */ -static inline bool mpam_cpus_have_mpam_hcr(void) +static __always_inline bool mpam_cpus_have_mpam_hcr(void) { if (IS_ENABLED(CONFIG_ARM64_MPAM)) return static_branch_unlikely(&arm64_mpam_has_hcr); diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h index 35f3c7959513..d10d3fed31d9 100644 --- a/arch/arm64/kernel/image-vars.h +++ b/arch/arm64/kernel/image-vars.h @@ -64,6 +64,11 @@ KVM_NVHE_ALIAS(nvhe_hyp_panic_handler); /* Vectors installed by hyp-init on reset HVC. */ KVM_NVHE_ALIAS(__hyp_stub_vectors); +/* Additional static keys for cpufeatures */ +#ifdef CONFIG_ARM64_MPAM +KVM_NVHE_ALIAS(arm64_mpam_has_hcr); +#endif + /* Static keys which are set if a vGIC trap should be handled in hyp. */ KVM_NVHE_ALIAS(vgic_v2_cpuif_trap); KVM_NVHE_ALIAS(vgic_v3_cpuif_trap); diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h index 9cfe6bd1dbe4..657320f453e6 100644 --- a/arch/arm64/kvm/hyp/include/hyp/switch.h +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h @@ -27,6 +27,7 @@ #include <asm/kvm_hyp.h> #include <asm/kvm_mmu.h> #include <asm/kvm_nested.h> +#include <asm/mpam.h> #include <asm/fpsimd.h> #include <asm/debug-monitors.h> #include <asm/processor.h> @@ -172,6 +173,35 @@ static inline void __deactivate_traps_hfgxtr(struct kvm_vcpu *vcpu) write_sysreg_s(ctxt_sys_reg(hctxt, HDFGWTR_EL2), SYS_HDFGWTR_EL2); } +static inline void __activate_traps_mpam(struct kvm_vcpu *vcpu) +{ + u64 r = MPAM_SYSREG_TRAP_MPAM0_EL1 | MPAM_SYSREG_TRAP_MPAM1_EL1; + + if (!mpam_cpus_have_feature()) + return; + + /* trap guest access to MPAMIDR_EL1 */ + if (mpam_cpus_have_mpam_hcr()) { + write_sysreg_s(MPAMHCR_TRAP_MPAMIDR, SYS_MPAMHCR_EL2); + } else { + /* From v1.1 TIDR can trap MPAMIDR, set it unconditionally */ + r |= MPAM_SYSREG_TRAP_IDR; + } + + write_sysreg_s(r, SYS_MPAM2_EL2); +} + +static inline void __deactivate_traps_mpam(void) +{ + if (!mpam_cpus_have_feature()) + return; + + write_sysreg_s(0, SYS_MPAM2_EL2); + + if (mpam_cpus_have_mpam_hcr()) + write_sysreg_s(MPAMHCR_HOST_FLAGS, SYS_MPAMHCR_EL2); +} + static inline void __activate_traps_common(struct kvm_vcpu *vcpu) { /* Trap on AArch32 cp15 c15 (impdef sysregs) accesses (EL1 or EL0) */ @@ -212,6 +242,7 @@ static inline void __activate_traps_common(struct kvm_vcpu *vcpu) } __activate_traps_hfgxtr(vcpu); + __activate_traps_mpam(vcpu); } static inline void __deactivate_traps_common(struct kvm_vcpu *vcpu) @@ -231,6 +262,7 @@ static inline void __deactivate_traps_common(struct kvm_vcpu *vcpu) write_sysreg_s(HCRX_HOST_FLAGS, SYS_HCRX_EL2); __deactivate_traps_hfgxtr(vcpu); + __deactivate_traps_mpam(); } static inline void ___activate_traps(struct kvm_vcpu *vcpu) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index d0cd1295d248..c471af52dbeb 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -417,6 +417,23 @@ static bool trap_oslar_el1(struct kvm_vcpu *vcpu, return true; } +static bool workaround_bad_mpam_abi(struct kvm_vcpu *vcpu, + struct sys_reg_params *p, + const struct sys_reg_desc *r) +{ + /* + * The ID register can't be removed without breaking migration, + * but MPAMIDR_EL1 can advertise all-zeroes, indicating there are zero + * PARTID/PMG supported by the CPU, allowing the other two trapped + * registers (MPAM1_EL1 and MPAM0_EL1) to be treated as RAZ/WI. + * Emulating MPAM1_EL1 as RAZ/WI means the guest sees the MPAMEN bit + * as clear, and realises MPAM isn't usable on this CPU. + */ + p->regval = 0; + + return true; +} + static bool trap_oslsr_el1(struct kvm_vcpu *vcpu, struct sys_reg_params *p, const struct sys_reg_desc *r) @@ -2136,8 +2153,11 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_LOREA_EL1), trap_loregion }, { SYS_DESC(SYS_LORN_EL1), trap_loregion }, { SYS_DESC(SYS_LORC_EL1), trap_loregion }, + { SYS_DESC(SYS_MPAMIDR_EL1), workaround_bad_mpam_abi }, { SYS_DESC(SYS_LORID_EL1), trap_loregion }, + { SYS_DESC(SYS_MPAM1_EL1), workaround_bad_mpam_abi }, + { SYS_DESC(SYS_MPAM0_EL1), workaround_bad_mpam_abi }, { SYS_DESC(SYS_VBAR_EL1), access_rw, reset_val, VBAR_EL1, 0 }, { SYS_DESC(SYS_DISR_EL1), NULL, reset_val, DISR_EL1, 0 }, -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Currently KVM only allows writeable ID registers to be downgraded in the 'safe' direction, as determined by the cpufeature 'lower safe' flags. commit 011e5f5bf529f ("arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register") exposed the MPAM field of AA64PFR0_EL1 to guests, but didn't add trap handling. A previous patch supplied the missing trap handling. Existing VMs that have the MPAM field of AA64PFR0_EL1 need to be migratable, but there is little point enabling the MPAM CPU interface on new VMs until there is something a guest can do with it. Clear the MPAM field from the guest's AA64PFR0_EL1 by default, but allow user-space to set it again if the host supports MPAM. Add a helper to return the maximum permitted value for an ID register. For most this is the reset value. To allow the MPAM field to be written as supported, check if the host sanitised value is '1' and upgrade the reset value. Finally, change the trap handling to inject an undef if MPAM was not advertised to the guest. Full support will depend on an psuedo-device being created that describes the virt->phys PARTID mapping the VMM expects. Migration would be expected to fail if this psuedo-device can't be created on the remote end. This ID bit isn't needed to block migration. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/kvm/sys_regs.c | 74 +++++++++++++++++++++++++++++++-------- 1 file changed, 60 insertions(+), 14 deletions(-) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index c471af52dbeb..e368571a71b6 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -417,21 +417,29 @@ static bool trap_oslar_el1(struct kvm_vcpu *vcpu, return true; } -static bool workaround_bad_mpam_abi(struct kvm_vcpu *vcpu, - struct sys_reg_params *p, - const struct sys_reg_desc *r) +static bool trap_mpam(struct kvm_vcpu *vcpu, + struct sys_reg_params *p, + const struct sys_reg_desc *r) { + u64 aa64pfr0_el1 = IDREG(vcpu->kvm, SYS_ID_AA64PFR0_EL1); + /* - * The ID register can't be removed without breaking migration, - * but MPAMIDR_EL1 can advertise all-zeroes, indicating there are zero - * PARTID/PMG supported by the CPU, allowing the other two trapped - * registers (MPAM1_EL1 and MPAM0_EL1) to be treated as RAZ/WI. + * What did we expose to the guest? + * Earlier guests may have seen the ID bits, which can't be removed + * without breaking migration, but MPAMIDR_EL1 can advertise all-zeroes, + * indicating there are zero PARTID/PMG supported by the CPU, allowing + * the other two trapped registers (MPAM1_EL1 and MPAM0_EL1) to be + * treated as RAZ/WI. * Emulating MPAM1_EL1 as RAZ/WI means the guest sees the MPAMEN bit * as clear, and realises MPAM isn't usable on this CPU. */ - p->regval = 0; + if (FIELD_GET(ID_AA64PFR0_EL1_MPAM_MASK, aa64pfr0_el1)) { + p->regval = 0; + return true; + } - return true; + kvm_inject_undefined(vcpu); + return false; } static bool trap_oslsr_el1(struct kvm_vcpu *vcpu, @@ -1251,6 +1259,36 @@ static s64 kvm_arm64_ftr_safe_value(u32 id, const struct arm64_ftr_bits *ftrp, return arm64_ftr_safe_value(&kvm_ftr, new, cur); } +static u64 kvm_arm64_ftr_max(struct kvm_vcpu *vcpu, + const struct sys_reg_desc *rd) +{ + u64 pfr0, val = rd->reset(vcpu, rd); + u32 field, id = reg_to_encoding(rd); + + /* + * Some values may reset to a lower value than can be supported, + * get the maximum feature value. + */ + switch (id) { + case SYS_ID_AA64PFR0_EL1: + pfr0 = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1); + + /* + * MPAM resets to 0, but migration of MPAM=1 guests is needed. + * See trap_mpam() for more. + */ + field = cpuid_feature_extract_unsigned_field(pfr0, ID_AA64PFR0_EL1_MPAM_SHIFT); + if (field == ID_AA64PFR0_EL1_MPAM_1) { + val &= ~ID_AA64PFR0_EL1_MPAM_MASK; + val |= FIELD_PREP(ID_AA64PFR0_EL1_MPAM_MASK, ID_AA64PFR0_EL1_MPAM_1); + } + + break; + } + + return val; +} + /** * arm64_check_features() - Check if a feature register value constitutes * a subset of features indicated by the idreg's KVM sanitised limit. @@ -1271,8 +1309,7 @@ static int arm64_check_features(struct kvm_vcpu *vcpu, const struct arm64_ftr_bits *ftrp = NULL; u32 id = reg_to_encoding(rd); u64 writable_mask = rd->val; - u64 limit = rd->reset(vcpu, rd); - u64 mask = 0; + u64 limit, mask = 0; /* * Hidden and unallocated ID registers may not have a corresponding @@ -1286,6 +1323,7 @@ static int arm64_check_features(struct kvm_vcpu *vcpu, if (!ftr_reg) return -EINVAL; + limit = kvm_arm64_ftr_max(vcpu, rd); ftrp = ftr_reg->ftr_bits; for (; ftrp && ftrp->width; ftrp++) { @@ -1489,6 +1527,14 @@ static u64 read_sanitised_id_aa64pfr0_el1(struct kvm_vcpu *vcpu, val &= ~ID_AA64PFR0_EL1_AMU_MASK; + /* + * MPAM is disabled by default as KVM also needs a set of PARTID to + * program the MPAMVPMx_EL2 PARTID remapping registers with. But some + * older kernels let the guest see the ID bit. Turning it on causes + * the registers to be emulated as RAZ/WI. See trap_mpam() for more. + */ + val &= ~ID_AA64PFR0_EL1_MPAM_MASK; + return val; } @@ -2153,11 +2199,11 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_LOREA_EL1), trap_loregion }, { SYS_DESC(SYS_LORN_EL1), trap_loregion }, { SYS_DESC(SYS_LORC_EL1), trap_loregion }, - { SYS_DESC(SYS_MPAMIDR_EL1), workaround_bad_mpam_abi }, + { SYS_DESC(SYS_MPAMIDR_EL1), trap_mpam }, { SYS_DESC(SYS_LORID_EL1), trap_loregion }, - { SYS_DESC(SYS_MPAM1_EL1), workaround_bad_mpam_abi }, - { SYS_DESC(SYS_MPAM0_EL1), workaround_bad_mpam_abi }, + { SYS_DESC(SYS_MPAM1_EL1), trap_mpam }, + { SYS_DESC(SYS_MPAM0_EL1), trap_mpam }, { SYS_DESC(SYS_VBAR_EL1), access_rw, reset_val, VBAR_EL1, 0 }, { SYS_DESC(SYS_DISR_EL1), NULL, reset_val, DISR_EL1, 0 }, -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- MPAM has a system register that is used to hold the partid and pmg values that traffic generated by EL0 will use. This can be set per-task by the resctrl file system. Add a helper to switch this. resctrl expects a 'default' value to be used in preference if the default partid and pmg are selected. struct task_struct's separate closid and rmid fields are insufficient to implement resctrl using MPAM, as resctrl can change the partid (closid) and pmg (sort of like the rmid) separately. On x86, the rmid is an independent number, so a race that writes a mismatched closid and rmid into hardware is benign. On arm64, the pmg bits extend the partid. (i.e. partid-5 has a pmg-0 that is not the same as partid-6's pmg-0). In this case, mismatching the values will 'dirty' a pmg value that resctrl believes is clean, and is not tracking with its 'limbo' code. To avoid this, the partid and pmg are always read and written as a pair. Instead of making struct task_struct's closid and rmid fields an endian-unsafe union, add the value to struct thread_info and always use READ_ONCE()/WRITE_ONCE() when accessing this field. CC: Amit Singh Tomar <amitsinght@marvell.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/include/asm/mpam.h | 46 ++++++++++++++++++++++++++++ arch/arm64/include/asm/thread_info.h | 3 ++ arch/arm64/kernel/mpam.c | 5 +++ arch/arm64/kernel/process.c | 7 +++++ 4 files changed, 61 insertions(+) diff --git a/arch/arm64/include/asm/mpam.h b/arch/arm64/include/asm/mpam.h index 576102d510ad..1d81a6f26acd 100644 --- a/arch/arm64/include/asm/mpam.h +++ b/arch/arm64/include/asm/mpam.h @@ -7,6 +7,8 @@ #include <linux/bitops.h> #include <linux/init.h> #include <linux/jump_label.h> +#include <linux/percpu.h> +#include <linux/sched.h> #include <asm/cpucaps.h> #include <asm/cpufeature.h> @@ -49,6 +51,9 @@ DECLARE_STATIC_KEY_FALSE(arm64_mpam_has_hcr); +DECLARE_STATIC_KEY_FALSE(mpam_enabled); +DECLARE_PER_CPU(u64, arm64_mpam_default); +DECLARE_PER_CPU(u64, arm64_mpam_current); /* check whether all CPUs have MPAM support */ static __always_inline bool mpam_cpus_have_feature(void) @@ -73,4 +78,45 @@ static inline void __init __enable_mpam_hcr(void) static_branch_enable(&arm64_mpam_has_hcr); } +/* + * The resctrl filesystem writes to the partid/pmg values for threads and CPUs, + * which may race with reads in __mpam_sched_in(). Ensure only one of the old + * or new values are used. Particular care should be taken with the pmg field + * as __mpam_sched_in() may read a partid and pmg that don't match, causing + * this value to be stored with cache allocations, despite being considered + * 'free' by resctrl. + * + * A value in struct thread_info is used instead of struct task_struct as the + * cpu's u64 register format is used, but struct task_struct has two u32'. + */ +static inline u64 mpam_get_regval(struct task_struct *tsk) +{ +#ifdef CONFIG_ARM64_MPAM + return READ_ONCE(task_thread_info(tsk)->mpam_partid_pmg); +#else + return 0; +#endif +} + +static inline void mpam_thread_switch(struct task_struct *tsk) +{ + u64 oldregval; + int cpu = smp_processor_id(); + u64 regval = mpam_get_regval(tsk); + + if (!IS_ENABLED(CONFIG_ARM64_MPAM) || + !static_branch_likely(&mpam_enabled)) + return; + + if (!regval) + regval = READ_ONCE(per_cpu(arm64_mpam_default, cpu)); + + oldregval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); + if (oldregval == regval) + return; + + /* Synchronising this write is left until the ERET to EL0 */ + write_sysreg_s(regval, SYS_MPAM0_EL1); + WRITE_ONCE(per_cpu(arm64_mpam_current, cpu), regval); +} #endif /* __ASM__MPAM_H */ diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 553d1bc559c6..c57b33de0ed1 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -41,6 +41,9 @@ struct thread_info { #ifdef CONFIG_SHADOW_CALL_STACK void *scs_base; void *scs_sp; +#endif +#ifdef CONFIG_ARM64_MPAM + u64 mpam_partid_pmg; #endif u32 cpu; }; diff --git a/arch/arm64/kernel/mpam.c b/arch/arm64/kernel/mpam.c index ff29b666e025..894ca52549b8 100644 --- a/arch/arm64/kernel/mpam.c +++ b/arch/arm64/kernel/mpam.c @@ -4,5 +4,10 @@ #include <asm/mpam.h> #include <linux/jump_label.h> +#include <linux/percpu.h> DEFINE_STATIC_KEY_FALSE(arm64_mpam_has_hcr); +DEFINE_STATIC_KEY_FALSE(mpam_enabled); +DEFINE_PER_CPU(u64, arm64_mpam_default); +DEFINE_PER_CPU(u64, arm64_mpam_current); + diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index 657ea273c0f9..2c2cff1927e7 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -49,6 +49,7 @@ #include <asm/exec.h> #include <asm/fpsimd.h> #include <asm/mmu_context.h> +#include <asm/mpam.h> #include <asm/mte.h> #include <asm/processor.h> #include <asm/pointer_auth.h> @@ -552,6 +553,12 @@ struct task_struct *__switch_to(struct task_struct *prev, if (prev->thread.sctlr_user != next->thread.sctlr_user) update_sctlr_el1(next->thread.sctlr_user); + /* + * MPAM thread switch happens after the DSB to ensure prev's accesses + * use prev's MPAM settings. + */ + mpam_thread_switch(next); + /* the actual thread switch */ last = cpu_switch_to(prev, next); -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- While we trap the guest's attempts to read/write the MPAM control registers, these remain in effect. guest-EL0 uses KVM's user-space's configuration, as the value is left in the register, and guest-EL1 uses either the host kernel's configuration, or in the case of VHE, the unknown reset value of MPAM1_EL1. On nVHE systems, EL2 continues to use partid-0 for world-switch, even when the host may have configured its kernel threads to use a different partid. 0 may have been assigned to another task. We want to force the guest-EL1 to use KVM's user-space's MPAM configuration. On a nVHE system, copy the EL1 MPAM register to EL2. This ensures world-switch uses the same partid as the kernel thread does on the host. When loading the guests EL1 registers, copy the VMM's EL0 partid to the EL1 register. When restoring the hosts registers, the partid previously copied to EL2 can be used to restore EL1. For VHE systems, we can skip restoring the EL1 register for the host, as it is out-of-context once HCR_EL2.TGE is set. This is done outside the usual sysreg save/restore as the values can change behind KVMs back, so should not be stored in the guest context. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 27 ++++++++++++++++++++++ arch/arm64/kvm/hyp/nvhe/switch.c | 11 +++++++++ arch/arm64/kvm/hyp/vhe/sysreg-sr.c | 1 + 3 files changed, 39 insertions(+) diff --git a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h index bb6b571ec627..d6cfb3dc7f7c 100644 --- a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h +++ b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h @@ -15,6 +15,7 @@ #include <asm/kvm_emulate.h> #include <asm/kvm_hyp.h> #include <asm/kvm_mmu.h> +#include <asm/mpam.h> static inline void __sysreg_save_common_state(struct kvm_cpu_context *ctxt) { @@ -243,4 +244,30 @@ static inline void __sysreg32_restore_state(struct kvm_vcpu *vcpu) write_sysreg(__vcpu_sys_reg(vcpu, DBGVCR32_EL2), dbgvcr32_el2); } +/* + * The _EL0 value was written by the host's context switch, copy this into the + * guest's EL1. + */ +static inline void __mpam_guest_load(void) +{ + if (IS_ENABLED(CONFIG_ARM64_MPAM) && mpam_cpus_have_feature()) + write_sysreg_el1(read_sysreg_s(SYS_MPAM0_EL1), SYS_MPAM1); +} + +/* + * Copy the _EL2 register back to _EL1, clearing any trap bits EL2 may have set. + * nVHE world-switch copies the _EL1 register to _EL2. A VHE host writes to the + * _EL2 register as it is aliased by the hardware when TGE is set. + */ +static inline void __mpam_guest_put(void) +{ + u64 val, mask = MPAM_SYSREG_PMG_D | MPAM_SYSREG_PMG_I | + MPAM_SYSREG_PARTID_D | MPAM_SYSREG_PARTID_I; + + if (IS_ENABLED(CONFIG_ARM64_MPAM) && mpam_cpus_have_feature()) { + val = FIELD_GET(mask, read_sysreg_s(SYS_MPAM2_EL2)); + write_sysreg_el1(val, SYS_MPAM1); + } +} + #endif /* __ARM64_KVM_HYP_SYSREG_SR_H__ */ diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c index c353a06ee7e6..c2118f658e22 100644 --- a/arch/arm64/kvm/hyp/nvhe/switch.c +++ b/arch/arm64/kvm/hyp/nvhe/switch.c @@ -242,6 +242,13 @@ static void early_exit_filter(struct kvm_vcpu *vcpu, u64 *exit_code) } } +/* Use the host thread's partid and pmg for world switch */ +static void __mpam_copy_el1_to_el2(void) +{ + if (IS_ENABLED(CONFIG_ARM64_MPAM) && mpam_cpus_have_feature()) + write_sysreg_s(read_sysreg_s(SYS_MPAM1_EL1), SYS_MPAM2_EL2); +} + /* Switch to the guest for legacy non-VHE systems */ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) { @@ -251,6 +258,8 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) bool pmu_switch_needed; u64 exit_code; + __mpam_copy_el1_to_el2(); + /* * Having IRQs masked via PMR when entering the guest means the GIC * will not signal the CPU of interrupts of lower priority, and the @@ -310,6 +319,7 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) __timer_enable_traps(vcpu); __debug_switch_to_guest(vcpu); + __mpam_guest_load(); do { /* Jump in the fire! */ @@ -320,6 +330,7 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) __sysreg_save_state_nvhe(guest_ctxt); __sysreg32_save_state(vcpu); + __mpam_guest_put(); __timer_disable_traps(vcpu); __hyp_vgic_save_state(vcpu); diff --git a/arch/arm64/kvm/hyp/vhe/sysreg-sr.c b/arch/arm64/kvm/hyp/vhe/sysreg-sr.c index b35a178e7e0d..6b407cd3230d 100644 --- a/arch/arm64/kvm/hyp/vhe/sysreg-sr.c +++ b/arch/arm64/kvm/hyp/vhe/sysreg-sr.c @@ -90,6 +90,7 @@ void kvm_vcpu_load_sysregs_vhe(struct kvm_vcpu *vcpu) __sysreg32_restore_state(vcpu); __sysreg_restore_user_state(guest_ctxt); __sysreg_restore_el1_state(guest_ctxt); + __mpam_guest_load(); vcpu_set_flag(vcpu, SYSREGS_ON_CPU); -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- The PPTT describes CPUs and caches, as well as processor containers. To enable PPI partitions, the irqchip driver needs to know how many partitions the platform has, and which CPUs belong to which partition. When a percpu interrupt is registered, the partition is provided to allow a different driver to request the same percpu interrupt intid, one per partition. The acpi_id of the Processor Container is the natural way to do this with ACPI, but the DSDT AML interpreter is not available early enough for the irqchipi driver. Fortunately, the same information can be described in the PPTT. Add a helper to count the number or Processor Containers in the PPTT. This is structured as a walker/callback as the irqchip driver will also use this to configure each partition. Only Processor entries in the PPTT that have a valid acpi id are considered as containers. To identify a particular Processor Container, it must have an id. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/acpi/pptt.c | 58 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 58 insertions(+) diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c index a35dd0e41c27..f3279bcb78ff 100644 --- a/drivers/acpi/pptt.c +++ b/drivers/acpi/pptt.c @@ -21,6 +21,8 @@ #include <linux/cacheinfo.h> #include <acpi/processor.h> +typedef int (*acpi_pptt_cpu_callback_t)(struct acpi_pptt_processor *, void *); + static struct acpi_subtable_header *fetch_pptt_subtable(struct acpi_table_header *table_hdr, u32 pptt_ref) { @@ -293,6 +295,62 @@ static struct acpi_pptt_processor *acpi_find_processor_node(struct acpi_table_he return NULL; } +/** + * acpi_pptt_for_each_container() - Iterate over all processor containers + * + * Not all 'Processor' entries in the PPTT are either a CPU or a Processor + * Container, they may exist purely to describe a Private resource. CPUs + * have to be leaves, so a Processor Container is a non-leaf that has the + * 'ACPI Processor ID valid' flag set. + * + * Return: 0 for a complete walk, or the first non-zero value from the callback + * that stopped the walk. + */ +int acpi_pptt_for_each_container(acpi_pptt_cpu_callback_t callback, void *arg) +{ + struct acpi_pptt_processor *cpu_node; + struct acpi_table_header *table_hdr; + struct acpi_subtable_header *entry; + bool leaf_flag, has_leaf_flag = false; + unsigned long table_end; + acpi_status status; + u32 proc_sz; + int ret = 0; + + status = acpi_get_table(ACPI_SIG_PPTT, 0, &table_hdr); + if (ACPI_FAILURE(status)) + return 0; + + if (table_hdr->revision > 1) + has_leaf_flag = true; + + table_end = (unsigned long)table_hdr + table_hdr->length; + entry = ACPI_ADD_PTR(struct acpi_subtable_header, table_hdr, + sizeof(struct acpi_table_pptt)); + proc_sz = sizeof(struct acpi_pptt_processor); + while ((unsigned long)entry + proc_sz < table_end) { + cpu_node = (struct acpi_pptt_processor *)entry; + if (entry->type == ACPI_PPTT_TYPE_PROCESSOR && + cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_ID_VALID) + { + leaf_flag = cpu_node->flags & ACPI_PPTT_ACPI_LEAF_NODE; + if ((has_leaf_flag && !leaf_flag) || + (!has_leaf_flag && !acpi_pptt_leaf_node(table_hdr, cpu_node))) + { + ret = callback(cpu_node, arg); + if (ret) + break; + } + } + entry = ACPI_ADD_PTR(struct acpi_subtable_header, entry, + entry->length); + } + + acpi_put_table(table_hdr); + + return ret; +} + static u8 acpi_cache_type(enum cache_type type) { switch (type) { -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- The MPAM table identifies caches by id, but the driver also wants to know the cache level, without having to wait for whichever core has that cache to come online. Add a helper that walks every possible cache, until it finds the one identified by id, then return the level. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/acpi/pptt.c | 74 ++++++++++++++++++++++++++++++++++++++++++++ include/linux/acpi.h | 5 +++ 2 files changed, 79 insertions(+) diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c index f3279bcb78ff..6c1f668d63ba 100644 --- a/drivers/acpi/pptt.c +++ b/drivers/acpi/pptt.c @@ -870,3 +870,77 @@ int find_acpi_cpu_topology_hetero_id(unsigned int cpu) return find_acpi_cpu_topology_tag(cpu, PPTT_ABORT_PACKAGE, ACPI_PPTT_ACPI_IDENTICAL); } + + +/** + * find_acpi_cache_level_from_id() - Get the level of the specified cache + * @cache_id: The id field of the unified cache + * + * Determine the level relative to any CPU for the unified cache identified by + * cache_id. This allows the property to be found even if the CPUs are offline. + * + * The returned level can be used to group unified caches that are peers. + * + * The PPTT table must be rev 3 or later, + * + * If one CPUs L2 is shared with another as L3, this function will return + * and unpredictable value. + * + * Return: -ENOENT if the PPTT doesn't exist, or the cache cannot be found. + * Otherwise returns a value which represents the level of the specified cache. + */ +int find_acpi_cache_level_from_id(u32 cache_id) +{ + u32 acpi_cpu_id; + acpi_status status; + int level, cpu, num_levels; + struct acpi_pptt_cache *cache; + struct acpi_table_header *table; + struct acpi_pptt_cache_v1* cache_v1; + struct acpi_pptt_processor *cpu_node; + + status = acpi_get_table(ACPI_SIG_PPTT, 0, &table); + if (ACPI_FAILURE(status)) { + acpi_pptt_warn_missing(); + return -ENOENT; + } + + if (table->revision < 3) { + acpi_put_table(table); + return -ENOENT; + } + + /* + * If we found the cache first, we'd still need to walk from each CPU + * to find the level... + */ + for_each_possible_cpu(cpu) { + acpi_cpu_id = get_acpi_id_for_cpu(cpu); + cpu_node = acpi_find_processor_node(table, acpi_cpu_id); + if (!cpu_node) + break; + acpi_count_levels(table, cpu_node, &num_levels, NULL); + + /* Start at 1 for L1 */ + for (level = 1; level <= num_levels; level++) { + cache = acpi_find_cache_node(table, acpi_cpu_id, + ACPI_PPTT_CACHE_TYPE_UNIFIED, + level, &cpu_node); + if (!cache) + continue; + + cache_v1 = ACPI_ADD_PTR(struct acpi_pptt_cache_v1, + cache, + sizeof(struct acpi_pptt_cache)); + + if (cache->flags & ACPI_PPTT_CACHE_ID_VALID && + cache_v1->cache_id == cache_id) { + acpi_put_table(table); + return level; + } + } + } + + acpi_put_table(table); + return -ENOENT; +} diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 1b76d2f83eac..7e4fc2f6883c 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -1495,6 +1495,7 @@ int find_acpi_cpu_topology(unsigned int cpu, int level); int find_acpi_cpu_topology_cluster(unsigned int cpu); int find_acpi_cpu_topology_package(unsigned int cpu); int find_acpi_cpu_topology_hetero_id(unsigned int cpu); +int find_acpi_cache_level_from_id(u32 cache_id); #else static inline int acpi_pptt_cpu_is_thread(unsigned int cpu) { @@ -1516,6 +1517,10 @@ static inline int find_acpi_cpu_topology_hetero_id(unsigned int cpu) { return -EINVAL; } +static inline int find_acpi_cache_level_from_id(u32 cache_id) +{ + return -EINVAL; +} #endif #ifdef CONFIG_ARM64 -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- The ACPI table for MPAM describes a set of CPUs with the UID of a processor container. These exist both in the namespace and the PPTT. Using the existing for-each helpers, provide a helper to find the specified processor container in the PPTT, and fill a cpumask with the CPUs that belong to it. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/acpi/pptt.c | 67 ++++++++++++++++++++++++++++++++++++++++++++ include/linux/acpi.h | 6 ++++ 2 files changed, 73 insertions(+) diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c index 6c1f668d63ba..9fbeed61cd87 100644 --- a/drivers/acpi/pptt.c +++ b/drivers/acpi/pptt.c @@ -295,6 +295,38 @@ static struct acpi_pptt_processor *acpi_find_processor_node(struct acpi_table_he return NULL; } +/* parent_node points into the table, but the table isn't provided. */ +static void acpi_pptt_get_child_cpus(struct acpi_pptt_processor *parent_node, + cpumask_t *cpus) +{ + struct acpi_pptt_processor *cpu_node; + struct acpi_table_header *table_hdr; + acpi_status status; + u32 acpi_id; + int cpu; + + status = acpi_get_table(ACPI_SIG_PPTT, 0, &table_hdr); + if (ACPI_FAILURE(status)) + return; + + for_each_possible_cpu(cpu) { + acpi_id = get_acpi_id_for_cpu(cpu); + cpu_node = acpi_find_processor_node(table_hdr, acpi_id); + + while (cpu_node) { + if (cpu_node == parent_node) { + cpumask_set_cpu(cpu, cpus); + break; + } + cpu_node = fetch_pptt_node(table_hdr, cpu_node->parent); + } + } + + acpi_put_table(table_hdr); + + return; +} + /** * acpi_pptt_for_each_container() - Iterate over all processor containers * @@ -351,6 +383,41 @@ int acpi_pptt_for_each_container(acpi_pptt_cpu_callback_t callback, void *arg) return ret; } +struct __cpus_from_container_arg { + u32 acpi_cpu_id; + cpumask_t *cpus; +}; + +static int __cpus_from_container(struct acpi_pptt_processor *container, void *arg) +{ + struct __cpus_from_container_arg *params = arg; + + if (container->acpi_processor_id == params->acpi_cpu_id) + acpi_pptt_get_child_cpus(container, params->cpus); + + return 0; +} + +/** + * acpi_pptt_get_cpus_from_container() - Populate a cpumask with all CPUs in a + * processor containers + * + * Find the specified Processor Container, and fill cpus with all the cpus + * below it. + * + * Return: 0 for a complete walk, or an error if the mask is incomplete. + */ +int acpi_pptt_get_cpus_from_container(u32 acpi_cpu_id, cpumask_t *cpus) +{ + struct __cpus_from_container_arg params; + + params.acpi_cpu_id = acpi_cpu_id; + params.cpus = cpus; + + cpumask_clear(cpus); + return acpi_pptt_for_each_container(&__cpus_from_container, ¶ms); +} + static u8 acpi_cache_type(enum cache_type type) { switch (type) { diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 7e4fc2f6883c..93b8155625bf 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -1496,6 +1496,7 @@ int find_acpi_cpu_topology_cluster(unsigned int cpu); int find_acpi_cpu_topology_package(unsigned int cpu); int find_acpi_cpu_topology_hetero_id(unsigned int cpu); int find_acpi_cache_level_from_id(u32 cache_id); +int acpi_pptt_get_cpus_from_container(u32 acpi_cpu_id, cpumask_t *cpus); #else static inline int acpi_pptt_cpu_is_thread(unsigned int cpu) { @@ -1521,6 +1522,11 @@ static inline int find_acpi_cache_level_from_id(u32 cache_id) { return -EINVAL; } +static inline int acpi_pptt_get_cpus_from_container(u32 acpi_cpu_id, + cpumask_t *cpus) +{ + return -EINVAL; +} #endif #ifdef CONFIG_ARM64 -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- MPAM identifies CPUs by the cache_id in the PPTT cache structure. The driver needs to know which CPUs are associated with the cache, the CPUs may not all be online, so cacheinfo does not have the information. Add a helper to pull this information out of the PPTT. CC: Rohit Mathew <Rohit.Mathew@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/acpi/pptt.c | 79 ++++++++++++++++++++++++++++++++++++++++++-- include/linux/acpi.h | 6 ++++ 2 files changed, 83 insertions(+), 2 deletions(-) diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c index 9fbeed61cd87..201df60e4a79 100644 --- a/drivers/acpi/pptt.c +++ b/drivers/acpi/pptt.c @@ -183,9 +183,10 @@ acpi_find_cache_level(struct acpi_table_header *table_hdr, * levels and split cache levels (data/instruction). * @table_hdr: Pointer to the head of the PPTT table * @cpu_node: processor node we wish to count caches for - * @levels: Number of levels if success. + * @levels: Number of levels if success. (*levels) should be initialized by + * the caller with the value to be used as the starting level. * @split_levels: Number of split cache levels (data/instruction) if - * success. Can by NULL. + * success. Can be NULL. * * Given a processor node containing a processing unit, walk into it and count * how many levels exist solely for it, and then walk up each level until we hit @@ -982,6 +983,8 @@ int find_acpi_cache_level_from_id(u32 cache_id) * to find the level... */ for_each_possible_cpu(cpu) { + + num_levels = 0; acpi_cpu_id = get_acpi_id_for_cpu(cpu); cpu_node = acpi_find_processor_node(table, acpi_cpu_id); if (!cpu_node) @@ -1011,3 +1014,75 @@ int find_acpi_cache_level_from_id(u32 cache_id) acpi_put_table(table); return -ENOENT; } + +/** + * acpi_pptt_get_cpumask_from_cache_id() - Get the cpus associated with the + * specified cache + * @cache_id: The id field of the unified cache + * @cpus: Where to buidl the cpumask + * + * Determine which CPUs are below this cache in the PPTT. This allows the property + * to be found even if the CPUs are offline. + * + * The PPTT table must be rev 3 or later, + * + * Return: -ENOENT if the PPTT doesn't exist, or the cache cannot be found. + * Otherwise returns 0 and sets the cpus in the provided cpumask. + */ +int acpi_pptt_get_cpumask_from_cache_id(u32 cache_id, cpumask_t *cpus) +{ + u32 acpi_cpu_id; + acpi_status status; + int level, cpu, num_levels; + struct acpi_pptt_cache *cache; + struct acpi_table_header *table; + struct acpi_pptt_cache_v1* cache_v1; + struct acpi_pptt_processor *cpu_node; + + cpumask_clear(cpus); + + status = acpi_get_table(ACPI_SIG_PPTT, 0, &table); + if (ACPI_FAILURE(status)) { + acpi_pptt_warn_missing(); + return -ENOENT; + } + + if (table->revision < 3) { + acpi_put_table(table); + return -ENOENT; + } + + /* + * If we found the cache first, we'd still need to walk from each cpu. + */ + for_each_possible_cpu(cpu) { + + num_levels = 0; + acpi_cpu_id = get_acpi_id_for_cpu(cpu); + cpu_node = acpi_find_processor_node(table, acpi_cpu_id); + if (!cpu_node) + break; + acpi_count_levels(table, cpu_node, &num_levels, NULL); + + /* Start at 1 for L1 */ + for (level = 1; level <= num_levels; level++) { + cache = acpi_find_cache_node(table, acpi_cpu_id, + ACPI_PPTT_CACHE_TYPE_UNIFIED, + level, &cpu_node); + if (!cache) + continue; + + cache_v1 = ACPI_ADD_PTR(struct acpi_pptt_cache_v1, + cache, + sizeof(struct acpi_pptt_cache)); + + if (cache->flags & ACPI_PPTT_CACHE_ID_VALID && + cache_v1->cache_id == cache_id) { + cpumask_set_cpu(cpu, cpus); + } + } + } + + acpi_put_table(table); + return 0; +} diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 93b8155625bf..3c4c5ad9b136 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -1497,6 +1497,7 @@ int find_acpi_cpu_topology_package(unsigned int cpu); int find_acpi_cpu_topology_hetero_id(unsigned int cpu); int find_acpi_cache_level_from_id(u32 cache_id); int acpi_pptt_get_cpus_from_container(u32 acpi_cpu_id, cpumask_t *cpus); +int acpi_pptt_get_cpumask_from_cache_id(u32 cache_id, cpumask_t *cpus); #else static inline int acpi_pptt_cpu_is_thread(unsigned int cpu) { @@ -1527,6 +1528,11 @@ static inline int acpi_pptt_get_cpus_from_container(u32 acpi_cpu_id, { return -EINVAL; } +static inline int acpi_pptt_get_cpumask_from_cache_id(u32 cache_id, + cpumask_t *cpus) +{ + return -EINVAL; +} #endif #ifdef CONFIG_ARM64 -- 2.25.1

From: Rob Herring <robh@kernel.org> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- In preparation to set the cache 'id' based on the CPU h/w ids, allow for 64-bit bit 'id' value. The only case that needs this is arm64, so unsigned long is sufficient. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Rob Herring <robh@kernel.org> [ Update get_cpu_cacheinfo_id() too. Use UL instead of ULL. ] Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/base/cacheinfo.c | 8 +++++++- include/linux/cacheinfo.h | 6 +++--- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c index 23b8cba4a2a3..bb6949abf25b 100644 --- a/drivers/base/cacheinfo.c +++ b/drivers/base/cacheinfo.c @@ -620,13 +620,19 @@ static ssize_t file_name##_show(struct device *dev, \ return sysfs_emit(buf, "%u\n", this_leaf->object); \ } -show_one(id, id); show_one(level, level); show_one(coherency_line_size, coherency_line_size); show_one(number_of_sets, number_of_sets); show_one(physical_line_partition, physical_line_partition); show_one(ways_of_associativity, ways_of_associativity); +static ssize_t id_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + struct cacheinfo *this_leaf = dev_get_drvdata(dev); + + return sysfs_emit(buf, "%lu\n", this_leaf->id); +} + static ssize_t size_show(struct device *dev, struct device_attribute *attr, char *buf) { diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h index ef701ba043e9..73b5bdf363a0 100644 --- a/include/linux/cacheinfo.h +++ b/include/linux/cacheinfo.h @@ -48,7 +48,7 @@ extern unsigned int coherency_max_size; * keeping, the remaining members form the core properties of the cache */ struct cacheinfo { - unsigned int id; + unsigned long id; enum cache_type type; unsigned int level; unsigned int coherency_line_size; @@ -140,11 +140,11 @@ static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level) * Get the id of the cache associated with @cpu at level @level. * cpuhp lock must be held. */ -static inline int get_cpu_cacheinfo_id(int cpu, int level) +static inline unsigned long get_cpu_cacheinfo_id(int cpu, int level) { struct cacheinfo *ci = get_cpu_cacheinfo_level(cpu, level); - return ci ? ci->id : -1; + return ci ? ci->id : ~0UL; } #ifdef CONFIG_ARM64 -- 2.25.1

From: Rob Herring <robh@kernel.org> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Use the minimum CPU h/w id of the CPUs associated with the cache for the cache 'id'. This will provide a stable id value for a given system. As we need to check all possible CPUs, we can't use the shared_cpu_map which is just online CPUs. As there's not a cache to CPUs mapping in DT, we have to walk all CPU nodes and then walk cache levels. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/base/cacheinfo.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c index bb6949abf25b..a153acfb2c48 100644 --- a/drivers/base/cacheinfo.c +++ b/drivers/base/cacheinfo.c @@ -183,6 +183,31 @@ static bool cache_node_is_unified(struct cacheinfo *this_leaf, return of_property_read_bool(np, "cache-unified"); } +static void cache_of_set_id(struct cacheinfo *this_leaf, struct device_node *np) +{ + struct device_node *cpu; + unsigned long min_id = ~0UL; + + for_each_of_cpu_node(cpu) { + struct device_node *cache_node = cpu; + u64 id = of_get_cpu_hwid(cache_node, 0); + + while ((cache_node = of_find_next_cache_node(cache_node))) { + if ((cache_node == np) && (id < min_id)) { + min_id = id; + of_node_put(cache_node); + break; + } + of_node_put(cache_node); + } + } + + if (min_id != ~0UL) { + this_leaf->id = min_id; + this_leaf->attributes |= CACHE_ID; + } +} + static void cache_of_set_props(struct cacheinfo *this_leaf, struct device_node *np) { @@ -198,6 +223,7 @@ static void cache_of_set_props(struct cacheinfo *this_leaf, cache_get_line_size(this_leaf, np); cache_nr_sets(this_leaf, np); cache_associativity(this_leaf); + cache_of_set_id(this_leaf, np); } static int cache_setup_of_node(unsigned int cpu) -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- The MPAM driver identifies caches by id for use with resctrl. It needs to know the cache-id when probeing, but the value isn't set in cacheinfo until device_initcall(). Expose the code that generates the cache-id. The parts of the MPAM driver that run early can use this to set up the resctrl structures. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/base/cacheinfo.c | 13 ++++++++++--- include/linux/cacheinfo.h | 1 + 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c index a153acfb2c48..fc443c6a49e1 100644 --- a/drivers/base/cacheinfo.c +++ b/drivers/base/cacheinfo.c @@ -183,7 +183,7 @@ static bool cache_node_is_unified(struct cacheinfo *this_leaf, return of_property_read_bool(np, "cache-unified"); } -static void cache_of_set_id(struct cacheinfo *this_leaf, struct device_node *np) +unsigned long cache_of_get_id(struct device_node *np) { struct device_node *cpu; unsigned long min_id = ~0UL; @@ -202,8 +202,15 @@ static void cache_of_set_id(struct cacheinfo *this_leaf, struct device_node *np) } } - if (min_id != ~0UL) { - this_leaf->id = min_id; + return min_id; +} + +static void cache_of_set_id(struct cacheinfo *this_leaf, struct device_node *np) +{ + unsigned long id = cache_of_get_id(np); + + if (id != ~0UL) { + this_leaf->id = id; this_leaf->attributes |= CACHE_ID; } } diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h index 73b5bdf363a0..d5d27842ea0c 100644 --- a/include/linux/cacheinfo.h +++ b/include/linux/cacheinfo.h @@ -112,6 +112,7 @@ int acpi_get_cache_info(unsigned int cpu, #endif const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_leaf); +unsigned long cache_of_get_id(struct device_node *np); /* * Get the cacheinfo structure for the cache associated with @cpu at -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- MPAM needs to know the size of a cache associated with a particular CPU. The DT/ACPI agnostic way of doing this is to ask cacheinfo. Add a helper to do this. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- include/linux/cacheinfo.h | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h index d5d27842ea0c..779be650ac83 100644 --- a/include/linux/cacheinfo.h +++ b/include/linux/cacheinfo.h @@ -148,6 +148,27 @@ static inline unsigned long get_cpu_cacheinfo_id(int cpu, int level) return ci ? ci->id : ~0UL; } +/* + * Get the size of the cache associated with @cpu at level @level. + * cpuhp lock must be held. + */ +static inline unsigned int get_cpu_cacheinfo_size(int cpu, int level) +{ + struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu); + int i; + + if (!ci->info_list) + return 0; + + for (i = 0; i < ci->num_leaves; i++) { + if (ci->info_list[i].level == level) { + return ci->info_list[i].size; + } + } + + return 0; +} + #ifdef CONFIG_ARM64 #define use_arch_cache_info() (true) #else -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Add code to parse the arm64 specific MPAM table, looking up the cache level from the PPTT and feeding the end result into the MPAM driver. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/Kconfig | 1 + drivers/acpi/arm64/Kconfig | 3 + drivers/acpi/arm64/Makefile | 2 + drivers/acpi/arm64/mpam.c | 369 ++++++++++++++++++++++++++++++++++++ drivers/acpi/tables.c | 2 +- include/linux/arm_mpam.h | 43 +++++ 6 files changed, 419 insertions(+), 1 deletion(-) create mode 100644 drivers/acpi/arm64/mpam.c create mode 100644 include/linux/arm_mpam.h diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 471e8129d0fb..145995b14daa 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2015,6 +2015,7 @@ config ARM64_TLB_RANGE config ARM64_MPAM bool "Enable support for MPAM" + select ACPI_MPAM if ACPI help Memory Partitioning and Monitoring is an optional extension that allows the CPUs to mark load and store transactions with diff --git a/drivers/acpi/arm64/Kconfig b/drivers/acpi/arm64/Kconfig index b3ed6212244c..f2fd79f22e7d 100644 --- a/drivers/acpi/arm64/Kconfig +++ b/drivers/acpi/arm64/Kconfig @@ -21,3 +21,6 @@ config ACPI_AGDI config ACPI_APMT bool + +config ACPI_MPAM + bool diff --git a/drivers/acpi/arm64/Makefile b/drivers/acpi/arm64/Makefile index 143debc1ba4a..9497a7777729 100644 --- a/drivers/acpi/arm64/Makefile +++ b/drivers/acpi/arm64/Makefile @@ -4,4 +4,6 @@ obj-$(CONFIG_ACPI_IORT) += iort.o obj-$(CONFIG_ACPI_GTDT) += gtdt.o obj-$(CONFIG_ACPI_APMT) += apmt.o obj-$(CONFIG_ARM_AMBA) += amba.o +obj-$(CONFIG_ACPI_MPAM) += mpam.o obj-y += dma.o init.o + diff --git a/drivers/acpi/arm64/mpam.c b/drivers/acpi/arm64/mpam.c new file mode 100644 index 000000000000..8a63449f27b5 --- /dev/null +++ b/drivers/acpi/arm64/mpam.c @@ -0,0 +1,369 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (C) 2022 Arm Ltd. + +/* Parse the MPAM ACPI table feeding the discovered nodes into the driver */ + +#define pr_fmt(fmt) "ACPI MPAM: " fmt + +#include <linux/acpi.h> +#include <linux/arm_mpam.h> +#include <linux/cpu.h> +#include <linux/cpumask.h> +#include <linux/platform_device.h> + +#include <acpi/processor.h> + +#include <asm/mpam.h> + +/* Flags for acpi_table_mpam_msc.*_interrupt_flags */ +#define ACPI_MPAM_MSC_IRQ_MODE_EDGE 1 +#define ACPI_MPAM_MSC_IRQ_TYPE_MASK (3<<1) +#define ACPI_MPAM_MSC_IRQ_TYPE_WIRED 0 +#define ACPI_MPAM_MSC_IRQ_AFFINITY_PROCESSOR_CONTAINER (1<<3) +#define ACPI_MPAM_MSC_IRQ_AFFINITY_VALID (1<<4) + +static bool frob_irq(struct platform_device *pdev, int intid, u32 flags, + int *irq, u32 processor_container_uid) +{ + int sense; + + if (!intid) + return false; + + /* 0 in this field indicates a wired interrupt */ + if (flags & ACPI_MPAM_MSC_IRQ_TYPE_MASK) + return false; + + if (flags & ACPI_MPAM_MSC_IRQ_MODE_EDGE) + sense = ACPI_EDGE_SENSITIVE; + else + sense = ACPI_LEVEL_SENSITIVE; + + /* + * If the GSI is in the GIC's PPI range, try and create a partitioned + * percpu interrupt. + */ + if (16 <= intid && intid < 32 && processor_container_uid != ~0) { + pr_err_once("Partitioned interrupts not supported\n"); + return false; + } else { + *irq = acpi_register_gsi(&pdev->dev, intid, sense, + ACPI_ACTIVE_HIGH); + } + if (*irq <= 0) { + pr_err_once("Failed to register interrupt 0x%x with ACPI\n", + intid); + return false; + } + + return true; +} + +static void acpi_mpam_parse_irqs(struct platform_device *pdev, + struct acpi_mpam_msc_node *tbl_msc, + struct resource *res, int *res_idx) +{ + u32 flags, aff = ~0; + int irq; + + flags = tbl_msc->overflow_interrupt_flags; + if (flags & ACPI_MPAM_MSC_IRQ_AFFINITY_VALID && + flags & ACPI_MPAM_MSC_IRQ_AFFINITY_PROCESSOR_CONTAINER) + aff = tbl_msc->overflow_interrupt_affinity; + if (frob_irq(pdev, tbl_msc->overflow_interrupt, flags, &irq, aff)) { + res[*res_idx].start = irq; + res[*res_idx].end = irq; + res[*res_idx].flags = IORESOURCE_IRQ; + res[*res_idx].name = "overflow"; + + (*res_idx)++; + } + + flags = tbl_msc->error_interrupt_flags; + if (flags & ACPI_MPAM_MSC_IRQ_AFFINITY_VALID && + flags & ACPI_MPAM_MSC_IRQ_AFFINITY_PROCESSOR_CONTAINER) + aff = tbl_msc->error_interrupt_affinity; + else + aff = ~0; + if (frob_irq(pdev, tbl_msc->error_interrupt, flags, &irq, aff)) { + res[*res_idx].start = irq; + res[*res_idx].end = irq; + res[*res_idx].flags = IORESOURCE_IRQ; + res[*res_idx].name = "error"; + + (*res_idx)++; + } +} + +static int acpi_mpam_parse_resource(struct mpam_msc *msc, + struct acpi_mpam_resource_node *res) +{ + u32 cache_id; + int level; + + switch (res->locator_type) { + case ACPI_MPAM_LOCATION_TYPE_PROCESSOR_CACHE: + cache_id = res->locator.cache_locator.cache_reference; + level = find_acpi_cache_level_from_id(cache_id); + if (level < 0) { + pr_err_once("Bad level for cache with id %u\n", cache_id); + return level; + } + return mpam_ris_create(msc, res->ris_index, MPAM_CLASS_CACHE, + level, cache_id); + case ACPI_MPAM_LOCATION_TYPE_MEMORY: + return mpam_ris_create(msc, res->ris_index, MPAM_CLASS_MEMORY, + 255, res->locator.memory_locator.proximity_domain); + default: + /* These get discovered later and treated as unknown */ + return 0; + } +} + +int acpi_mpam_parse_resources(struct mpam_msc *msc, + struct acpi_mpam_msc_node *tbl_msc) +{ + int i, err; + struct acpi_mpam_resource_node *resources; + + resources = (struct acpi_mpam_resource_node *)(tbl_msc + 1); + for (i = 0; i < tbl_msc->num_resouce_nodes; i++) { + err = acpi_mpam_parse_resource(msc, &resources[i]); + if (err) + return err; + } + + return 0; +} + +static bool __init parse_msc_pm_link(struct acpi_mpam_msc_node *tbl_msc, + struct platform_device *pdev, + u32 *acpi_id) +{ + bool acpi_id_valid = false; + struct acpi_device *buddy; + char hid[16], uid[16]; + int err; + + memset(&hid, 0, sizeof(hid)); + memcpy(hid, &tbl_msc->hardware_id_linked_device, + sizeof(tbl_msc->hardware_id_linked_device)); + + if (!strcmp(hid, ACPI_PROCESSOR_CONTAINER_HID)) { + *acpi_id = tbl_msc->instance_id_linked_device; + acpi_id_valid = true; + } + + err = snprintf(uid, sizeof(uid), "%u", + tbl_msc->instance_id_linked_device); + if (err < 0 || err >= sizeof(uid)) + return acpi_id_valid; + + buddy = acpi_dev_get_first_match_dev(hid, uid, -1); + if (buddy) { + device_link_add(&pdev->dev, &buddy->dev, DL_FLAG_STATELESS); + } + + return acpi_id_valid; +} + +static int decode_interface_type(struct acpi_mpam_msc_node *tbl_msc, + enum mpam_msc_iface *iface) +{ + switch (tbl_msc->interface_type){ + case 0: + *iface = MPAM_IFACE_MMIO; + return 0; + case 1: + *iface = MPAM_IFACE_PCC; + return 0; + default: + return -EINVAL; + } +} + +static int __init _parse_table(struct acpi_table_header *table) +{ + char *table_end, *table_offset = (char *)(table + 1); + struct property_entry props[4]; /* needs a sentinel */ + struct acpi_mpam_msc_node *tbl_msc; + int next_res, next_prop, err = 0; + struct acpi_device *companion; + struct platform_device *pdev; + enum mpam_msc_iface iface; + struct resource res[3]; + char uid[16]; + u32 acpi_id; + + table_end = (char *)table + table->length; + + while (table_offset < table_end) { + tbl_msc = (struct acpi_mpam_msc_node *)table_offset; + table_offset += tbl_msc->length; + + /* + * If any of the reserved fields are set, make no attempt to + * parse the msc structure. This will prevent the driver from + * probing all the MSC, meaning it can't discover the system + * wide supported partid and pmg ranges. This avoids whatever + * this MSC is truncating the partids and creating a screaming + * error interrupt. + */ + if (tbl_msc->reserved || tbl_msc->reserved1 || tbl_msc->reserved2) + continue; + + if (decode_interface_type(tbl_msc, &iface)) + continue; + + next_res = 0; + next_prop = 0; + memset(res, 0, sizeof(res)); + memset(props, 0, sizeof(props)); + + pdev = platform_device_alloc("mpam_msc", tbl_msc->identifier); + if (IS_ERR(pdev)) { + err = PTR_ERR(pdev); + break; + } + + if (tbl_msc->length < sizeof(*tbl_msc)) { + err = -EINVAL; + break; + } + + /* Some power management is described in the namespace: */ + err = snprintf(uid, sizeof(uid), "%u", tbl_msc->identifier); + if (err > 0 && err < sizeof(uid)) { + companion = acpi_dev_get_first_match_dev("ARMHAA5C", uid, -1); + if (companion) + ACPI_COMPANION_SET(&pdev->dev, companion); + } + + if (iface == MPAM_IFACE_MMIO) { + res[next_res].name = "MPAM:MSC"; + res[next_res].start = tbl_msc->base_address; + res[next_res].end = tbl_msc->base_address + tbl_msc->mmio_size - 1; + res[next_res].flags = IORESOURCE_MEM; + next_res++; + } else if (iface == MPAM_IFACE_PCC) { + props[next_prop++] = PROPERTY_ENTRY_U32("pcc-channel", + tbl_msc->base_address); + next_prop++; + } + + acpi_mpam_parse_irqs(pdev, tbl_msc, res, &next_res); + err = platform_device_add_resources(pdev, res, next_res); + if (err) + break; + + props[next_prop++] = PROPERTY_ENTRY_U32("arm,not-ready-us", + tbl_msc->max_nrdy_usec); + + /* + * The MSC's CPU affinity is described via its linked power + * management device, but only if it points at a Processor or + * Processor Container. + */ + if (parse_msc_pm_link(tbl_msc, pdev, &acpi_id)) { + props[next_prop++] = PROPERTY_ENTRY_U32("cpu_affinity", + acpi_id); + } + + err = device_create_managed_software_node(&pdev->dev, props, + NULL); + if (err) + break; + + /* Come back later if you want the RIS too */ + err = platform_device_add_data(pdev, tbl_msc, tbl_msc->length); + if (err) + break; + + platform_device_add(pdev); + } + + if (err) + platform_device_put(pdev); + + return err; +} + +static struct acpi_table_header *get_table(void) +{ + struct acpi_table_header *table; + acpi_status status; + + if (acpi_disabled || !mpam_cpus_have_feature()) + return NULL; + + status = acpi_get_table(ACPI_SIG_MPAM, 0, &table); + if (ACPI_FAILURE(status)) + return NULL; + + if (table->revision != 1) + return NULL; + + return table; +} + + + +static int __init acpi_mpam_parse(void) +{ + struct acpi_table_header *mpam; + int err; + + mpam = get_table(); + if (!mpam) + return 0; + + err = _parse_table(mpam); + acpi_put_table(mpam); + + return err; +} + +static int _count_msc(struct acpi_table_header *table) +{ + char *table_end, *table_offset = (char *)(table + 1); + struct acpi_mpam_msc_node *tbl_msc; + int ret = 0; + + tbl_msc = (struct acpi_mpam_msc_node *)table_offset; + table_end = (char *)table + table->length; + + while (table_offset < table_end) { + if (tbl_msc->length < sizeof(*tbl_msc)) + return -EINVAL; + + ret++; + + table_offset += tbl_msc->length; + tbl_msc = (struct acpi_mpam_msc_node *)table_offset; + } + + return ret; +} + + +int acpi_mpam_count_msc(void) +{ + struct acpi_table_header *mpam; + int ret; + + mpam = get_table(); + if (!mpam) + return 0; + + ret = _count_msc(mpam); + acpi_put_table(mpam); + + return ret; +} + +/* + * Call after ACPI devices have been created, which happens behind acpi_scan_init() + * called from subsys_initcall(). PCC requires the mailbox driver, which is + * initialised from postcore_initcall(). + */ +subsys_initcall_sync(acpi_mpam_parse); diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c index ffa1888844fe..d7b74e909623 100644 --- a/drivers/acpi/tables.c +++ b/drivers/acpi/tables.c @@ -580,7 +580,7 @@ static const char table_sigs[][ACPI_NAMESEG_SIZE] __initconst = { ACPI_SIG_PSDT, ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, ACPI_SIG_IORT, ACPI_SIG_NFIT, ACPI_SIG_HMAT, ACPI_SIG_PPTT, ACPI_SIG_NHLT, ACPI_SIG_AEST, ACPI_SIG_CEDT, ACPI_SIG_AGDI, - ACPI_SIG_NBFT }; + ACPI_SIG_NBFT, ACPI_SIG_MPAM }; #define ACPI_HEADER_SIZE sizeof(struct acpi_table_header) diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h new file mode 100644 index 000000000000..0f1d3f07e789 --- /dev/null +++ b/include/linux/arm_mpam.h @@ -0,0 +1,43 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (C) 2021 Arm Ltd. + +#ifndef __LINUX_ARM_MPAM_H +#define __LINUX_ARM_MPAM_H + +#include <linux/acpi.h> +#include <linux/types.h> + +struct mpam_msc; + +enum mpam_msc_iface { + MPAM_IFACE_MMIO, /* a real MPAM MSC */ + MPAM_IFACE_PCC, /* a fake MPAM MSC */ +}; + +enum mpam_class_types { + MPAM_CLASS_CACHE, /* Well known caches, e.g. L2 */ + MPAM_CLASS_MEMORY, /* Main memory */ + MPAM_CLASS_UNKNOWN, /* Everything else, e.g. SMMU */ +}; + +#ifdef CONFIG_ACPI_MPAM +/* Parse the ACPI description of resources entries for this MSC. */ +int acpi_mpam_parse_resources(struct mpam_msc *msc, + struct acpi_mpam_msc_node *tbl_msc); +int acpi_mpam_count_msc(void); +#else +static inline int acpi_mpam_parse_resources(struct mpam_msc *msc, + struct acpi_mpam_msc_node *tbl_msc) +{ + return -EINVAL; +} +static inline int acpi_mpam_count_msc(void) { return -EINVAL; } +#endif + +static inline int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, + enum mpam_class_types type, u8 class_id, int component_id) +{ + return -EINVAL; +} + +#endif /* __LINUX_ARM_MPAM_H */ -- 2.25.1

From: Rob Herring <robh@kernel.org> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- The binding is designed around the assumption that an MSC will be a sub-block of something else such as a memory controller, cache controller, or IOMMU. However, it's certainly possible a design does not have that association or has a mixture of both, so the binding illustrates how we can support that with RIS child nodes. A key part of MPAM is we need to know about all of the MSCs in the system before it can be enabled. This drives the need for the genericish 'arm,mpam-msc' compatible. Though we can't assume an MSC is accessible until a h/w specific driver potentially enables the h/w. Cc: James Morse <james.morse@arm.com> Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- .../devicetree/bindings/arm/arm,mpam-msc.yaml | 227 ++++++++++++++++++ 1 file changed, 227 insertions(+) create mode 100644 Documentation/devicetree/bindings/arm/arm,mpam-msc.yaml diff --git a/Documentation/devicetree/bindings/arm/arm,mpam-msc.yaml b/Documentation/devicetree/bindings/arm/arm,mpam-msc.yaml new file mode 100644 index 000000000000..9d542ecb1a7d --- /dev/null +++ b/Documentation/devicetree/bindings/arm/arm,mpam-msc.yaml @@ -0,0 +1,227 @@ +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/arm/arm,mpam-msc.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Arm Memory System Resource Partitioning and Monitoring (MPAM) + +description: | + The Arm MPAM specification can be found here: + + https://developer.arm.com/documentation/ddi0598/latest + +maintainers: + - Rob Herring <robh@kernel.org> + +properties: + compatible: + items: + - const: arm,mpam-msc # Further details are discoverable + - const: arm,mpam-memory-controller-msc + + reg: + maxItems: 1 + description: A memory region containing registers as defined in the MPAM + specification. + + interrupts: + minItems: 1 + items: + - description: error (optional) + - description: overflow (optional, only for monitoring) + + interrupt-names: + oneOf: + - items: + - enum: [ error, overflow ] + - items: + - const: error + - const: overflow + + arm,not-ready-us: + description: The maximum time in microseconds for monitoring data to be + accurate after a settings change. For more information, see the + Not-Ready (NRDY) bit description in the MPAM specification. + + numa-node-id: true # see NUMA binding + + '#address-cells': + const: 1 + + '#size-cells': + const: 0 + +patternProperties: + '^ris@[0-9a-f]$': + type: object + additionalProperties: false + description: | + RIS nodes for each RIS in an MSC. These nodes are required for each RIS + implementing known MPAM controls + + properties: + compatible: + enum: + # Bulk storage for cache + - arm,mpam-cache + # Memory bandwidth + - arm,mpam-memory + + reg: + minimum: 0 + maximum: 0xf + + cpus: + $ref: '/schemas/types.yaml#/definitions/phandle-array' + description: + Phandle(s) to the CPU node(s) this RIS belongs to. By default, the parent + device's affinity is used. + + arm,mpam-device: + $ref: '/schemas/types.yaml#/definitions/phandle' + description: + By default, the MPAM enabled device associated with a RIS is the MSC's + parent node. It is possible for each RIS to be associated with different + devices in which case 'arm,mpam-device' should be used. + + required: + - compatible + - reg + +required: + - compatible + - reg + +dependencies: + interrupts: [ interrupt-names ] + +additionalProperties: false + +examples: + - | + /* + cpus { + cpu@0 { + next-level-cache = <&L2_0>; + }; + cpu@100 { + next-level-cache = <&L2_1>; + }; + }; + */ + L2_0: cache-controller-0 { + compatible = "cache"; + cache-level = <2>; + cache-unified; + next-level-cache = <&L3>; + + }; + + L2_1: cache-controller-1 { + compatible = "cache"; + cache-level = <2>; + cache-unified; + next-level-cache = <&L3>; + + }; + + L3: cache-controller@30000000 { + compatible = "arm,dsu-l3-cache", "cache"; + cache-level = <3>; + cache-unified; + + ranges = <0x0 0x30000000 0x800000>; + #address-cells = <1>; + #size-cells = <1>; + + msc@10000 { + compatible = "arm,mpam-msc"; + + /* CPU affinity implied by parent cache node's */ + reg = <0x10000 0x2000>; + interrupts = <1>, <2>; + interrupt-names = "error", "overflow"; + arm,not-ready-us = <1>; + }; + }; + + mem: memory-controller@20000 { + compatible = "foo,a-memory-controller"; + reg = <0x20000 0x1000>; + + #address-cells = <1>; + #size-cells = <1>; + ranges; + + msc@21000 { + compatible = "arm,mpam-memory-controller-msc", "arm,mpam-msc"; + reg = <0x21000 0x1000>; + interrupts = <3>; + interrupt-names = "error"; + arm,not-ready-us = <1>; + numa-node-id = <1>; + }; + }; + + iommu@40000 { + reg = <0x40000 0x1000>; + + ranges; + #address-cells = <1>; + #size-cells = <1>; + + msc@41000 { + compatible = "arm,mpam-msc"; + reg = <0 0x1000>; + interrupts = <5>, <6>; + interrupt-names = "error", "overflow"; + arm,not-ready-us = <1>; + + #address-cells = <1>; + #size-cells = <0>; + + ris@2 { + compatible = "arm,mpam-cache"; + reg = <0>; + // TODO: How to map to device(s)? + }; + }; + }; + + msc@80000 { + compatible = "foo,a-standalone-msc"; + reg = <0x80000 0x1000>; + + clocks = <&clks 123>; + + ranges; + #address-cells = <1>; + #size-cells = <1>; + + msc@10000 { + compatible = "arm,mpam-msc"; + + reg = <0x10000 0x2000>; + interrupts = <7>; + interrupt-names = "overflow"; + arm,not-ready-us = <1>; + + #address-cells = <1>; + #size-cells = <0>; + + ris@0 { + compatible = "arm,mpam-cache"; + reg = <0>; + arm,mpam-device = <&L2_0>; + }; + + ris@1 { + compatible = "arm,mpam-memory"; + reg = <1>; + arm,mpam-device = <&mem>; + }; + }; + }; + +... -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Probing MPAM is convoluted. MSCs that are integrated with a CPU may only be accessible from those CPUs, and they may not be online. Touching the hardware early is pointless as MPAM can't be used until the system-wide common values for num_partid and num_pmg have been discovered. Start with driver probe/remove and mapping the MSC. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/Kconfig | 1 + drivers/platform/Kconfig | 2 + drivers/platform/Makefile | 1 + drivers/platform/mpam/Kconfig | 6 + drivers/platform/mpam/Makefile | 1 + drivers/platform/mpam/mpam_devices.c | 355 ++++++++++++++++++++++++++ drivers/platform/mpam/mpam_internal.h | 48 ++++ 7 files changed, 414 insertions(+) create mode 100644 drivers/platform/mpam/Kconfig create mode 100644 drivers/platform/mpam/Makefile create mode 100644 drivers/platform/mpam/mpam_devices.c create mode 100644 drivers/platform/mpam/mpam_internal.h diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 145995b14daa..95c50f3ac290 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2016,6 +2016,7 @@ config ARM64_TLB_RANGE config ARM64_MPAM bool "Enable support for MPAM" select ACPI_MPAM if ACPI + select ARM_CPU_RESCTRL help Memory Partitioning and Monitoring is an optional extension that allows the CPUs to mark load and store transactions with diff --git a/drivers/platform/Kconfig b/drivers/platform/Kconfig index 868b20361769..f26534a4a83b 100644 --- a/drivers/platform/Kconfig +++ b/drivers/platform/Kconfig @@ -9,6 +9,8 @@ source "drivers/platform/chrome/Kconfig" source "drivers/platform/mellanox/Kconfig" +source "drivers/platform/mpam/Kconfig" + source "drivers/platform/olpc/Kconfig" source "drivers/platform/surface/Kconfig" diff --git a/drivers/platform/Makefile b/drivers/platform/Makefile index 41640172975a..db73de7cb1f4 100644 --- a/drivers/platform/Makefile +++ b/drivers/platform/Makefile @@ -11,3 +11,4 @@ obj-$(CONFIG_OLPC_EC) += olpc/ obj-$(CONFIG_GOLDFISH) += goldfish/ obj-$(CONFIG_CHROME_PLATFORMS) += chrome/ obj-$(CONFIG_SURFACE_PLATFORMS) += surface/ +obj-$(CONFIG_ARM_CPU_RESCTRL) += mpam/ diff --git a/drivers/platform/mpam/Kconfig b/drivers/platform/mpam/Kconfig new file mode 100644 index 000000000000..13bd86fc5e58 --- /dev/null +++ b/drivers/platform/mpam/Kconfig @@ -0,0 +1,6 @@ +# Confusingly, this is everything but the CPU bits of MPAM. CPU here means +# CPU resources, not containers or cgroups etc. +config ARM_CPU_RESCTRL + bool + depends on ARM64 + select RESCTRL_RMID_DEPENDS_ON_CLOSID diff --git a/drivers/platform/mpam/Makefile b/drivers/platform/mpam/Makefile new file mode 100644 index 000000000000..8ad69bfa2aa2 --- /dev/null +++ b/drivers/platform/mpam/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_ARM_CPU_RESCTRL) += mpam_devices.o diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c new file mode 100644 index 000000000000..885f8b61cb65 --- /dev/null +++ b/drivers/platform/mpam/mpam_devices.c @@ -0,0 +1,355 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (C) 2022 Arm Ltd. + +#define pr_fmt(fmt) "mpam: " fmt + +#include <linux/acpi.h> +#include <linux/arm_mpam.h> +#include <linux/cacheinfo.h> +#include <linux/cpu.h> +#include <linux/cpumask.h> +#include <linux/device.h> +#include <linux/errno.h> +#include <linux/gfp.h> +#include <linux/list.h> +#include <linux/lockdep.h> +#include <linux/mutex.h> +#include <linux/of.h> +#include <linux/of_platform.h> +#include <linux/platform_device.h> +#include <linux/printk.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/srcu.h> +#include <linux/types.h> + +#include <acpi/pcc.h> + +#include <asm/mpam.h> + +#include "mpam_internal.h" + +/* + * mpam_list_lock protects the SRCU lists when writing. Once the + * mpam_enabled key is enabled these lists are read-only, + * unless the error interrupt disables the driver. + */ +static DEFINE_MUTEX(mpam_list_lock); +static LIST_HEAD(mpam_all_msc); + +static struct srcu_struct mpam_srcu; + +/* MPAM isn't available until all the MSC have been probed. */ +static u32 mpam_num_msc; + +static void mpam_discovery_complete(void) +{ + pr_err("Discovered all MSC\n"); +} + +static int mpam_dt_count_msc(void) +{ + int count = 0; + struct device_node *np; + + for_each_compatible_node(np, NULL, "arm,mpam-msc") + count++; + + return count; +} + +static int mpam_dt_parse_resource(struct mpam_msc *msc, struct device_node *np, + u32 ris_idx) +{ + int err = 0; + u32 level = 0; + unsigned long cache_id; + struct device_node *cache; + + do { + if (of_device_is_compatible(np, "arm,mpam-cache")) { + cache = of_parse_phandle(np, "arm,mpam-device", 0); + if (!cache) { + pr_err("Failed to read phandle\n"); + break; + } + } else if (of_device_is_compatible(np->parent, "cache")) { + cache = np->parent; + } else { + /* For now, only caches are supported */ + cache = NULL; + break; + } + + err = of_property_read_u32(cache, "cache-level", &level); + if (err) { + pr_err("Failed to read cache-level\n"); + break; + } + + cache_id = cache_of_get_id(cache); + if (cache_id == ~0UL) { + err = -ENOENT; + break; + } + + err = mpam_ris_create(msc, ris_idx, MPAM_CLASS_CACHE, level, + cache_id); + } while (0); + of_node_put(cache); + + return err; +} + + +static int mpam_dt_parse_resources(struct mpam_msc *msc, void *ignored) +{ + int err, num_ris = 0; + const u32 *ris_idx_p; + struct device_node *iter, *np; + + np = msc->pdev->dev.of_node; + for_each_child_of_node(np, iter) { + ris_idx_p = of_get_property(iter, "reg", NULL); + if (ris_idx_p) { + num_ris++; + err = mpam_dt_parse_resource(msc, iter, *ris_idx_p); + if (err) { + of_node_put(iter); + return err; + } + } + } + + if (!num_ris) + mpam_dt_parse_resource(msc, np, 0); + + return err; +} + +static int get_msc_affinity(struct mpam_msc *msc) +{ + struct device_node *parent; + u32 affinity_id; + int err; + + if (!acpi_disabled) { + err = device_property_read_u32(&msc->pdev->dev, "cpu_affinity", + &affinity_id); + if (err) { + cpumask_copy(&msc->accessibility, cpu_possible_mask); + err = 0; + } else { + err = acpi_pptt_get_cpus_from_container(affinity_id, + &msc->accessibility); + } + + return err; + } + + /* This depends on the path to of_node */ + parent = of_get_parent(msc->pdev->dev.of_node); + if (parent == of_root) { + cpumask_copy(&msc->accessibility, cpu_possible_mask); + err = 0; + } else { + err = -EINVAL; + pr_err("Cannot determine CPU accessibility of MSC\n"); + } + of_node_put(parent); + + return err; +} + +static int fw_num_msc; + +static void mpam_pcc_rx_callback(struct mbox_client *cl, void *msg) +{ + /* TODO: wake up tasks blocked on this MSC's PCC channel */ +} + +static int mpam_msc_drv_probe(struct platform_device *pdev) +{ + int err; + pgprot_t prot; + void * __iomem io; + struct mpam_msc *msc; + struct resource *msc_res; + void *plat_data = pdev->dev.platform_data; + + mutex_lock(&mpam_list_lock); + do { + msc = devm_kzalloc(&pdev->dev, sizeof(*msc), GFP_KERNEL); + if (!msc) { + err = -ENOMEM; + break; + } + + INIT_LIST_HEAD_RCU(&msc->glbl_list); + msc->pdev = pdev; + + err = device_property_read_u32(&pdev->dev, "arm,not-ready-us", + &msc->nrdy_usec); + if (err) { + /* This will prevent CSU monitors being usable */ + msc->nrdy_usec = 0; + } + + err = get_msc_affinity(msc); + if (err) + break; + if (cpumask_empty(&msc->accessibility)) { + pr_err_once("msc:%u is not accessible from any CPU!", + msc->id); + err = -EINVAL; + break; + } + + mutex_init(&msc->lock); + msc->id = mpam_num_msc++; + INIT_LIST_HEAD_RCU(&msc->ris); + spin_lock_init(&msc->part_sel_lock); + spin_lock_init(&msc->mon_sel_lock); + + if (device_property_read_u32(&pdev->dev, "pcc-channel", + &msc->pcc_subspace_id)) + msc->iface = MPAM_IFACE_MMIO; + else + msc->iface = MPAM_IFACE_PCC; + + if (msc->iface == MPAM_IFACE_MMIO) { + io = devm_platform_get_and_ioremap_resource(pdev, 0, + &msc_res); + if (IS_ERR(io)) { + pr_err("Failed to map MSC base address\n"); + devm_kfree(&pdev->dev, msc); + err = PTR_ERR(io); + break; + } + msc->mapped_hwpage_sz = msc_res->end - msc_res->start; + msc->mapped_hwpage = io; + } else if (msc->iface == MPAM_IFACE_PCC) { + msc->pcc_cl.dev = &pdev->dev; + msc->pcc_cl.rx_callback = mpam_pcc_rx_callback; + msc->pcc_cl.tx_block = false; + msc->pcc_cl.tx_tout = 1000; /* 1s */ + msc->pcc_cl.knows_txdone = false; + + msc->pcc_chan = pcc_mbox_request_channel(&msc->pcc_cl, + msc->pcc_subspace_id); + if (IS_ERR(msc->pcc_chan)) { + pr_err("Failed to request MSC PCC channel\n"); + devm_kfree(&pdev->dev, msc); + err = PTR_ERR(msc->pcc_chan); + break; + } + + prot = __acpi_get_mem_attribute(msc->pcc_chan->shmem_base_addr); + io = ioremap_prot(msc->pcc_chan->shmem_base_addr, + msc->pcc_chan->shmem_size, pgprot_val(prot)); + if (IS_ERR(io)) { + pr_err("Failed to map MSC base address\n"); + pcc_mbox_free_channel(msc->pcc_chan); + devm_kfree(&pdev->dev, msc); + err = PTR_ERR(io); + break; + } + + /* TODO: issue a read to update the registers */ + + msc->mapped_hwpage_sz = msc->pcc_chan->shmem_size; + msc->mapped_hwpage = io + sizeof(struct acpi_pcct_shared_memory); + } + + list_add_rcu(&msc->glbl_list, &mpam_all_msc); + platform_set_drvdata(pdev, msc); + } while (0); + mutex_unlock(&mpam_list_lock); + + if (!err) { + /* Create RIS entries described by firmware */ + if (!acpi_disabled) + err = acpi_mpam_parse_resources(msc, plat_data); + else + err = mpam_dt_parse_resources(msc, plat_data); + } + + if (!err && fw_num_msc == mpam_num_msc) + mpam_discovery_complete(); + + return err; +} + +static int mpam_msc_drv_remove(struct platform_device *pdev) +{ + struct mpam_msc *msc = platform_get_drvdata(pdev); + + if (!msc) + return 0; + + mutex_lock(&mpam_list_lock); + mpam_num_msc--; + platform_set_drvdata(pdev, NULL); + list_del_rcu(&msc->glbl_list); + synchronize_srcu(&mpam_srcu); + mutex_unlock(&mpam_list_lock); + + return 0; +} + +static const struct of_device_id mpam_of_match[] = { + { .compatible = "arm,mpam-msc", }, + {}, +}; +MODULE_DEVICE_TABLE(of, mpam_of_match); + +static struct platform_driver mpam_msc_driver = { + .driver = { + .name = "mpam_msc", + .of_match_table = of_match_ptr(mpam_of_match), + }, + .probe = mpam_msc_drv_probe, + .remove = mpam_msc_drv_remove, +}; + +/* + * MSC that are hidden under caches are not created as platform devices + * as there is no cache driver. Caches are also special-cased in + * get_msc_affinity(). + */ +static void mpam_dt_create_foundling_msc(void) +{ + int err; + struct device_node *cache; + + for_each_compatible_node(cache, NULL, "cache") { + err = of_platform_populate(cache, mpam_of_match, NULL, NULL); + if (err) { + pr_err("Failed to create MSC devices under caches\n"); + } + } +} + +static int __init mpam_msc_driver_init(void) +{ + if (!mpam_cpus_have_feature()) + return -EOPNOTSUPP; + + init_srcu_struct(&mpam_srcu); + + if (!acpi_disabled) + fw_num_msc = acpi_mpam_count_msc(); + else + fw_num_msc = mpam_dt_count_msc(); + + if (fw_num_msc <= 0) { + pr_err("No MSC devices found in firmware\n"); + return -EINVAL; + } + + if (acpi_disabled) + mpam_dt_create_foundling_msc(); + + return platform_driver_register(&mpam_msc_driver); +} +subsys_initcall(mpam_msc_driver_init); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h new file mode 100644 index 000000000000..affd7999fcad --- /dev/null +++ b/drivers/platform/mpam/mpam_internal.h @@ -0,0 +1,48 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (C) 2021 Arm Ltd. + +#ifndef MPAM_INTERNAL_H +#define MPAM_INTERNAL_H + +#include <linux/arm_mpam.h> +#include <linux/cpumask.h> +#include <linux/io.h> +#include <linux/mailbox_client.h> +#include <linux/mutex.h> +#include <linux/resctrl.h> +#include <linux/sizes.h> + +struct mpam_msc +{ + /* member of mpam_all_msc */ + struct list_head glbl_list; + + int id; + struct platform_device *pdev; + + /* Not modified after mpam_is_enabled() becomes true */ + enum mpam_msc_iface iface; + u32 pcc_subspace_id; + struct mbox_client pcc_cl; + struct pcc_mbox_chan *pcc_chan; + u32 nrdy_usec; + cpumask_t accessibility; + + struct mutex lock; + unsigned long ris_idxs[128 / BITS_PER_LONG]; + u32 ris_max; + + /* mpam_msc_ris of this component */ + struct list_head ris; + + /* + * part_sel_lock protects access to the MSC hardware registers that are + * affected by MPAMCFG_PART_SEL. (including the ID registers) + * If needed, take msc->lock first. + */ + spinlock_t part_sel_lock; + void __iomem * mapped_hwpage; + size_t mapped_hwpage_sz; +}; + +#endif /* MPAM_INTERNAL_H */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- An MSC is a container of resources, each identified by their RIS index. Some RIS are described by firmware to provide their position in the system. Others are discovered when the driver probes the hardware. To configure a resource it needs to be found by its class, e.g. 'L2'. There are two kinds of grouping, a class is a set of components, which are visible as there are likely to be multiple instances of the L2 cache. struct mpam_components are a set of struct mpam_msc_ris, which are not visible as each L2 cache may be composed of individual slices which need to be configured the same as the hardware is not able to distribute the configuration. Add support for creating and destroying these structures. A gfp is passed as the structure for 'unknown' may need creating if a new RIS entry is discovered when probing the MSC. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 362 +++++++++++++++++++++++++- drivers/platform/mpam/mpam_internal.h | 52 ++++ include/linux/arm_mpam.h | 7 +- 3 files changed, 412 insertions(+), 9 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 885f8b61cb65..599bd4ee5b00 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -20,7 +20,6 @@ #include <linux/printk.h> #include <linux/slab.h> #include <linux/spinlock.h> -#include <linux/srcu.h> #include <linux/types.h> #include <acpi/pcc.h> @@ -37,11 +36,359 @@ static DEFINE_MUTEX(mpam_list_lock); static LIST_HEAD(mpam_all_msc); -static struct srcu_struct mpam_srcu; +struct srcu_struct mpam_srcu; /* MPAM isn't available until all the MSC have been probed. */ static u32 mpam_num_msc; +/* + * An MSC is a container for resources, each identified by their RIS index. + * Components are a group of RIS that control the same thing. + * Classes are the set components of the same type. + * + * e.g. The set of RIS that make up the L2 are a component. These are sometimes + * termed slices. They should be configured as if they were one MSC. + * + * e.g. The SoC probably has more than one L2, each attached to a distinct set + * of CPUs. All the L2 components are grouped as a class. + * + * When creating an MSC, struct mpam_msc is added to the all mpam_all_msc list, + * then linked via struct mpam_ris to a component and a class. + * The same MSC may exist under different class->component paths, but the RIS + * index will be unique. + */ +LIST_HEAD(mpam_classes); + +static struct mpam_component * +mpam_component_alloc(struct mpam_class *class, int id, gfp_t gfp) +{ + struct mpam_component *comp; + + lockdep_assert_held(&mpam_list_lock); + + comp = kzalloc(sizeof(*comp), gfp); + if (!comp) + return ERR_PTR(-ENOMEM); + + comp->comp_id = id; + INIT_LIST_HEAD_RCU(&comp->ris); + /* affinity is updated when ris are added */ + INIT_LIST_HEAD_RCU(&comp->class_list); + comp->class = class; + + list_add_rcu(&comp->class_list, &class->components); + + return comp; +} + +static struct mpam_component * +mpam_component_get(struct mpam_class *class, int id, bool alloc, gfp_t gfp) +{ + struct mpam_component *comp; + + lockdep_assert_held(&mpam_list_lock); + + list_for_each_entry(comp, &class->components, class_list) { + if (comp->comp_id == id) + return comp; + } + + if (!alloc) + return ERR_PTR(-ENOENT); + + return mpam_component_alloc(class, id, gfp); +} + +static struct mpam_class * +mpam_class_alloc(u8 level_idx, enum mpam_class_types type, gfp_t gfp) +{ + struct mpam_class *class; + + lockdep_assert_held(&mpam_list_lock); + + class = kzalloc(sizeof(*class), gfp); + if (!class) + return ERR_PTR(-ENOMEM); + + INIT_LIST_HEAD_RCU(&class->components); + /* affinity is updated when ris are added */ + class->level = level_idx; + class->type = type; + INIT_LIST_HEAD_RCU(&class->classes_list); + + list_add_rcu(&class->classes_list, &mpam_classes); + + return class; +} + +static struct mpam_class * +mpam_class_get(u8 level_idx, enum mpam_class_types type, bool alloc, gfp_t gfp) +{ + bool found = false; + struct mpam_class *class; + + lockdep_assert_held(&mpam_list_lock); + + list_for_each_entry(class, &mpam_classes, classes_list) { + if (class->type == type && class->level == level_idx) { + found = true; + break; + } + } + + if (found) + return class; + + if (!alloc) + return ERR_PTR(-ENOENT); + + return mpam_class_alloc(level_idx, type, gfp); +} + +static void mpam_class_destroy(struct mpam_class *class) +{ + lockdep_assert_held(&mpam_list_lock); + + list_del_rcu(&class->classes_list); + synchronize_srcu(&mpam_srcu); + kfree(class); +} + +static void mpam_comp_destroy(struct mpam_component *comp) +{ + struct mpam_class *class = comp->class; + + lockdep_assert_held(&mpam_list_lock); + + list_del_rcu(&comp->class_list); + synchronize_srcu(&mpam_srcu); + kfree(comp); + + if (list_empty(&class->components)) + mpam_class_destroy(class); +} + +/* synchronise_srcu() before freeing ris */ +static void mpam_ris_destroy(struct mpam_msc_ris *ris) +{ + struct mpam_component *comp = ris->comp; + struct mpam_class *class = comp->class; + struct mpam_msc *msc = ris->msc; + + lockdep_assert_held(&mpam_list_lock); + lockdep_assert_preemption_enabled(); + + clear_bit(ris->ris_idx, msc->ris_idxs); + list_del_rcu(&ris->comp_list); + list_del_rcu(&ris->msc_list); + + cpumask_andnot(&comp->affinity, &comp->affinity, &ris->affinity); + cpumask_andnot(&class->affinity, &class->affinity, &ris->affinity); + + if (list_empty(&comp->ris)) + mpam_comp_destroy(comp); +} + +/* + * There are two ways of reaching a struct mpam_msc_ris. Via the + * class->component->ris, or via the msc. + * When destroying the msc, the other side needs unlinking and cleaning up too. + * synchronise_srcu() before freeing msc. + */ +static void mpam_msc_destroy(struct mpam_msc *msc) +{ + struct mpam_msc_ris *ris, *tmp; + + lockdep_assert_held(&mpam_list_lock); + lockdep_assert_preemption_enabled(); + + list_for_each_entry_safe(ris, tmp, &msc->ris, msc_list) + mpam_ris_destroy(ris); +} + +/* + * The cacheinfo structures are only populated when CPUs are online. + * This helper walks the device tree to include offline CPUs too. + */ +static int get_cpumask_from_cache_id(u32 cache_id, u32 cache_level, + cpumask_t *affinity) +{ + int cpu, err; + u32 iter_level; + int iter_cache_id; + struct device_node *iter; + + if (!acpi_disabled) + return acpi_pptt_get_cpumask_from_cache_id(cache_id, affinity); + + for_each_possible_cpu(cpu) { + iter = of_get_cpu_node(cpu, NULL); + if (!iter) { + pr_err("Failed to find cpu%d device node\n", cpu); + return -ENOENT; + } + + while ((iter = of_find_next_cache_node(iter))) { + err = of_property_read_u32(iter, "cache-level", + &iter_level); + if (err || (iter_level != cache_level)) { + of_node_put(iter); + continue; + } + + /* + * get_cpu_cacheinfo_id() isn't ready until sometime + * during device_initcall(). Use cache_of_get_id(). + */ + iter_cache_id = cache_of_get_id(iter); + if (cache_id == ~0UL) { + of_node_put(iter); + continue; + } + + if (iter_cache_id == cache_id) + cpumask_set_cpu(cpu, affinity); + + of_node_put(iter); + } + } + + return 0; +} + + +/* + * cpumask_of_node() only knows about online CPUs. This can't tell us whether + * a class is represented on all possible CPUs. + */ +static void get_cpumask_from_node_id(u32 node_id, cpumask_t *affinity) +{ + int cpu; + + for_each_possible_cpu(cpu) { + if (node_id == cpu_to_node(cpu)) + cpumask_set_cpu(cpu, affinity); + } +} + +static int get_cpumask_from_cache(struct device_node *cache, + cpumask_t *affinity) +{ + int err; + u32 cache_level; + int cache_id; + + err = of_property_read_u32(cache, "cache-level", &cache_level); + if (err) { + pr_err("Failed to read cache-level from cache node\n"); + return -ENOENT; + } + + cache_id = cache_of_get_id(cache); + if (cache_id == ~0UL) { + pr_err("Failed to calculate cache-id from cache node\n"); + return -ENOENT; + } + + return get_cpumask_from_cache_id(cache_id, cache_level, affinity); +} + +static int mpam_ris_get_affinity(struct mpam_msc *msc, cpumask_t *affinity, + enum mpam_class_types type, + struct mpam_class *class, + struct mpam_component *comp) +{ + int err; + + switch (type) { + case MPAM_CLASS_CACHE: + err = get_cpumask_from_cache_id(comp->comp_id, class->level, + affinity); + if (err) + return err; + + if (cpumask_empty(affinity)) + pr_warn_once("%s no CPUs associated with cache node", + dev_name(&msc->pdev->dev)); + + break; + case MPAM_CLASS_MEMORY: + get_cpumask_from_node_id(comp->comp_id, affinity); + if (cpumask_empty(affinity)) + pr_warn_once("%s no CPUs associated with memory node", + dev_name(&msc->pdev->dev)); + break; + case MPAM_CLASS_UNKNOWN: + return 0; + } + + cpumask_and(affinity, affinity, &msc->accessibility); + + return 0; +} + +static int mpam_ris_create_locked(struct mpam_msc *msc, u8 ris_idx, + enum mpam_class_types type, u8 class_id, + int component_id, gfp_t gfp) +{ + int err; + struct mpam_msc_ris *ris; + struct mpam_class *class; + struct mpam_component *comp; + + lockdep_assert_held(&mpam_list_lock); + + if (test_and_set_bit(ris_idx, msc->ris_idxs)) + return -EBUSY; + + ris = devm_kzalloc(&msc->pdev->dev, sizeof(*ris), gfp); + if (!ris) + return -ENOMEM; + + class = mpam_class_get(class_id, type, true, gfp); + if (IS_ERR(class)) + return PTR_ERR(class); + + comp = mpam_component_get(class, component_id, true, gfp); + if (IS_ERR(comp)) { + if (list_empty(&class->components)) + mpam_class_destroy(class); + return PTR_ERR(comp); + } + + err = mpam_ris_get_affinity(msc, &ris->affinity, type, class, comp); + if (err) { + if (list_empty(&class->components)) + mpam_class_destroy(class); + return err; + } + + ris->ris_idx = ris_idx; + INIT_LIST_HEAD_RCU(&ris->comp_list); + INIT_LIST_HEAD_RCU(&ris->msc_list); + ris->msc = msc; + ris->comp = comp; + + cpumask_or(&comp->affinity, &comp->affinity, &ris->affinity); + cpumask_or(&class->affinity, &class->affinity, &ris->affinity); + list_add_rcu(&ris->comp_list, &comp->ris); + + return 0; +} + +int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, + enum mpam_class_types type, u8 class_id, int component_id) +{ + int err; + + mutex_lock(&mpam_list_lock); + err = mpam_ris_create_locked(msc, ris_idx, type, class_id, + component_id, GFP_KERNEL); + mutex_unlock(&mpam_list_lock); + + return err; +} + static void mpam_discovery_complete(void) { pr_err("Discovered all MSC\n"); @@ -153,8 +500,14 @@ static int get_msc_affinity(struct mpam_msc *msc) cpumask_copy(&msc->accessibility, cpu_possible_mask); err = 0; } else { - err = -EINVAL; - pr_err("Cannot determine CPU accessibility of MSC\n"); + if (of_device_is_compatible(parent, "cache")) { + err = get_cpumask_from_cache(parent, + &msc->accessibility); + } else { + err = -EINVAL; + pr_err("Cannot determine accessibility of MSC: %s\n", + dev_name(&msc->pdev->dev)); + } } of_node_put(parent); @@ -291,6 +644,7 @@ static int mpam_msc_drv_remove(struct platform_device *pdev) mpam_num_msc--; platform_set_drvdata(pdev, NULL); list_del_rcu(&msc->glbl_list); + mpam_msc_destroy(msc); synchronize_srcu(&mpam_srcu); mutex_unlock(&mpam_list_lock); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index affd7999fcad..07d9c70bf1e6 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -11,6 +11,7 @@ #include <linux/mutex.h> #include <linux/resctrl.h> #include <linux/sizes.h> +#include <linux/srcu.h> struct mpam_msc { @@ -45,4 +46,55 @@ struct mpam_msc size_t mapped_hwpage_sz; }; +struct mpam_class +{ + /* mpam_components in this class */ + struct list_head components; + + cpumask_t affinity; + + u8 level; + enum mpam_class_types type; + + /* member of mpam_classes */ + struct list_head classes_list; +}; + +struct mpam_component +{ + u32 comp_id; + + /* mpam_msc_ris in this component */ + struct list_head ris; + + cpumask_t affinity; + + /* member of mpam_class:components */ + struct list_head class_list; + + /* parent: */ + struct mpam_class *class; +}; + +struct mpam_msc_ris +{ + u8 ris_idx; + + cpumask_t affinity; + + /* member of mpam_component:ris */ + struct list_head comp_list; + + /* member of mpam_msc:ris */ + struct list_head msc_list; + + /* parents: */ + struct mpam_msc *msc; + struct mpam_component *comp; +}; + +/* List of all classes */ +extern struct list_head mpam_classes; +extern struct srcu_struct mpam_srcu; + #endif /* MPAM_INTERNAL_H */ diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 0f1d3f07e789..950ea7049d53 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -34,10 +34,7 @@ static inline int acpi_mpam_parse_resources(struct mpam_msc *msc, static inline int acpi_mpam_count_msc(void) { return -EINVAL; } #endif -static inline int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, - enum mpam_class_types type, u8 class_id, int component_id) -{ - return -EINVAL; -} +int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, + enum mpam_class_types type, u8 class_id, int component_id); #endif /* __LINUX_ARM_MPAM_H */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Memory Partitioning and Monitoring (MPAM) has memory mapped devices (MSCs) with an identity/configuration page. Add the definitions for these registers as offset within the page(s). Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_internal.h | 254 ++++++++++++++++++++++++++ 1 file changed, 254 insertions(+) diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 07d9c70bf1e6..5d339e4375e2 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -97,4 +97,258 @@ struct mpam_msc_ris extern struct list_head mpam_classes; extern struct srcu_struct mpam_srcu; + +/* + * MPAM MSCs have the following register layout. See: + * Arm Architecture Reference Manual Supplement - Memory System Resource + * Partitioning and Monitoring (MPAM), for Armv8-A. DDI 0598A.a + */ +#define MPAM_ARCHITECTURE_V1 0x10 + +/* Memory mapped control pages: */ +/* ID Register offsets in the memory mapped page */ +#define MPAMF_IDR 0x0000 /* features id register */ +#define MPAMF_MSMON_IDR 0x0080 /* performance monitoring features */ +#define MPAMF_IMPL_IDR 0x0028 /* imp-def partitioning */ +#define MPAMF_CPOR_IDR 0x0030 /* cache-portion partitioning */ +#define MPAMF_CCAP_IDR 0x0038 /* cache-capacity partitioning */ +#define MPAMF_MBW_IDR 0x0040 /* mem-bw partitioning */ +#define MPAMF_PRI_IDR 0x0048 /* priority partitioning */ +#define MPAMF_CSUMON_IDR 0x0088 /* cache-usage monitor */ +#define MPAMF_MBWUMON_IDR 0x0090 /* mem-bw usage monitor */ +#define MPAMF_PARTID_NRW_IDR 0x0050 /* partid-narrowing */ +#define MPAMF_IIDR 0x0018 /* implementer id register */ +#define MPAMF_AIDR 0x0020 /* architectural id register */ + +/* Configuration and Status Register offsets in the memory mapped page */ +#define MPAMCFG_PART_SEL 0x0100 /* partid to configure: */ +#define MPAMCFG_CPBM 0x1000 /* cache-portion config */ +#define MPAMCFG_CMAX 0x0108 /* cache-capacity config */ +#define MPAMCFG_MBW_MIN 0x0200 /* min mem-bw config */ +#define MPAMCFG_MBW_MAX 0x0208 /* max mem-bw config */ +#define MPAMCFG_MBW_WINWD 0x0220 /* mem-bw accounting window config */ +#define MPAMCFG_MBW_PBM 0x2000 /* mem-bw portion bitmap config */ +#define MPAMCFG_PRI 0x0400 /* priority partitioning config */ +#define MPAMCFG_MBW_PROP 0x0500 /* mem-bw stride config */ +#define MPAMCFG_INTPARTID 0x0600 /* partid-narrowing config */ + +#define MSMON_CFG_MON_SEL 0x0800 /* monitor selector */ +#define MSMON_CFG_CSU_FLT 0x0810 /* cache-usage monitor filter */ +#define MSMON_CFG_CSU_CTL 0x0818 /* cache-usage monitor config */ +#define MSMON_CFG_MBWU_FLT 0x0820 /* mem-bw monitor filter */ +#define MSMON_CFG_MBWU_CTL 0x0828 /* mem-bw monitor config */ +#define MSMON_CSU 0x0840 /* current cache-usage */ +#define MSMON_CSU_CAPTURE 0x0848 /* last cache-usage value captured */ +#define MSMON_MBWU 0x0860 /* current mem-bw usage value */ +#define MSMON_MBWU_CAPTURE 0x0868 /* last mem-bw value captured */ +#define MSMON_CAPT_EVNT 0x0808 /* signal a capture event */ +#define MPAMF_ESR 0x00F8 /* error status register */ +#define MPAMF_ECR 0x00F0 /* error control register */ + +/* MPAMF_IDR - MPAM features ID register */ +#define MPAMF_IDR_PARTID_MAX GENMASK(15, 0) +#define MPAMF_IDR_PMG_MAX GENMASK(23, 16) +#define MPAMF_IDR_HAS_CCAP_PART BIT(24) +#define MPAMF_IDR_HAS_CPOR_PART BIT(25) +#define MPAMF_IDR_HAS_MBW_PART BIT(26) +#define MPAMF_IDR_HAS_PRI_PART BIT(27) +#define MPAMF_IDR_HAS_EXT BIT(28) +#define MPAMF_IDR_HAS_IMPL_IDR BIT(29) +#define MPAMF_IDR_HAS_MSMON BIT(30) +#define MPAMF_IDR_HAS_PARTID_NRW BIT(31) +#define MPAMF_IDR_HAS_RIS BIT(32) +#define MPAMF_IDR_HAS_EXT_ESR BIT(38) +#define MPAMF_IDR_HAS_ESR BIT(39) +#define MPAMF_IDR_RIS_MAX GENMASK(59, 56) + + +/* MPAMF_MSMON_IDR - MPAM performance monitoring ID register */ +#define MPAMF_MSMON_IDR_MSMON_CSU BIT(16) +#define MPAMF_MSMON_IDR_MSMON_MBWU BIT(17) +#define MPAMF_MSMON_IDR_HAS_LOCAL_CAPT_EVNT BIT(31) + +/* MPAMF_CPOR_IDR - MPAM features cache portion partitioning ID register */ +#define MPAMF_CPOR_IDR_CPBM_WD GENMASK(15, 0) + +/* MPAMF_CCAP_IDR - MPAM features cache capacity partitioning ID register */ +#define MPAMF_CCAP_IDR_CMAX_WD GENMASK(5, 0) + +/* MPAMF_MBW_IDR - MPAM features memory bandwidth partitioning ID register */ +#define MPAMF_MBW_IDR_BWA_WD GENMASK(5, 0) +#define MPAMF_MBW_IDR_HAS_MIN BIT(10) +#define MPAMF_MBW_IDR_HAS_MAX BIT(11) +#define MPAMF_MBW_IDR_HAS_PBM BIT(12) +#define MPAMF_MBW_IDR_HAS_PROP BIT(13) +#define MPAMF_MBW_IDR_WINDWR BIT(14) +#define MPAMF_MBW_IDR_BWPBM_WD GENMASK(28, 16) + +/* MPAMF_PRI_IDR - MPAM features priority partitioning ID register */ +#define MPAMF_PRI_IDR_HAS_INTPRI BIT(0) +#define MPAMF_PRI_IDR_INTPRI_0_IS_LOW BIT(1) +#define MPAMF_PRI_IDR_INTPRI_WD GENMASK(9, 4) +#define MPAMF_PRI_IDR_HAS_DSPRI BIT(16) +#define MPAMF_PRI_IDR_DSPRI_0_IS_LOW BIT(17) +#define MPAMF_PRI_IDR_DSPRI_WD GENMASK(25, 20) + +/* MPAMF_CSUMON_IDR - MPAM cache storage usage monitor ID register */ +#define MPAMF_CSUMON_IDR_NUM_MON GENMASK(15, 0) +#define MPAMF_CSUMON_IDR_HAS_CAPTURE BIT(31) + +/* MPAMF_MBWUMON_IDR - MPAM memory bandwidth usage monitor ID register */ +#define MPAMF_MBWUMON_IDR_NUM_MON GENMASK(15, 0) +#define MPAMF_MBWUMON_IDR_RWBW BIT(28) +#define MPAMF_MBWUMON_IDR_LWD BIT(29) +#define MPAMF_MBWUMON_IDR_HAS_LONG BIT(30) +#define MPAMF_MBWUMON_IDR_HAS_CAPTURE BIT(31) + +/* MPAMF_PARTID_NRW_IDR - MPAM PARTID narrowing ID register */ +#define MPAMF_PARTID_NRW_IDR_INTPARTID_MAX GENMASK(15, 0) + +/* MPAMF_IIDR - MPAM implementation ID register */ +#define MPAMF_IIDR_PRODUCTID GENMASK(31, 20) +#define MPAMF_IIDR_PRODUCTID_SHIFT 20 +#define MPAMF_IIDR_VARIANT GENMASK(19, 16) +#define MPAMF_IIDR_VARIANT_SHIFT 16 +#define MPAMF_IIDR_REVISON GENMASK(15, 12) +#define MPAMF_IIDR_REVISON_SHIFT 12 +#define MPAMF_IIDR_IMPLEMENTER GENMASK(11, 0) +#define MPAMF_IIDR_IMPLEMENTER_SHIFT 0 + +/* MPAMF_AIDR - MPAM architecture ID register */ +#define MPAMF_AIDR_ARCH_MAJOR_REV GENMASK(7, 4) +#define MPAMF_AIDR_ARCH_MINOR_REV GENMASK(3, 0) + +/* MPAMCFG_PART_SEL - MPAM partition configuration selection register */ +#define MPAMCFG_PART_SEL_PARTID_SEL GENMASK(15, 0) +#define MPAMCFG_PART_SEL_INTERNAL BIT(16) +#define MPAMCFG_PART_SEL_RIS GENMASK(27, 24) + +/* MPAMCFG_CMAX - MPAM cache portion bitmap partition configuration register */ +#define MPAMCFG_CMAX_CMAX GENMASK(15, 0) + +/* + * MPAMCFG_MBW_MIN - MPAM memory minimum bandwidth partitioning configuration + * register + */ +#define MPAMCFG_MBW_MIN_MIN GENMASK(15, 0) + +/* + * MPAMCFG_MBW_MAX - MPAM memory maximum bandwidth partitioning configuration + * register + */ +#define MPAMCFG_MBW_MAX_MAX GENMASK(15, 0) +#define MPAMCFG_MBW_MAX_HARDLIM BIT(31) + +/* + * MPAMCFG_MBW_WINWD - MPAM memory bandwidth partitioning window width + * register + */ +#define MPAMCFG_MBW_WINWD_US_FRAC GENMASK(7, 0) +#define MPAMCFG_MBW_WINWD_US_INT GENMASK(23, 8) + + +/* MPAMCFG_PRI - MPAM priority partitioning configuration register */ +#define MPAMCFG_PRI_INTPRI GENMASK(15, 0) +#define MPAMCFG_PRI_DSPRI GENMASK(31, 16) + +/* + * MPAMCFG_MBW_PROP - Memory bandwidth proportional stride partitioning + * configuration register + */ +#define MPAMCFG_MBW_PROP_STRIDEM1 GENMASK(15, 0) +#define MPAMCFG_MBW_PROP_EN BIT(31) + +/* + * MPAMCFG_INTPARTID - MPAM internal partition narrowing configuration register + */ +#define MPAMCFG_INTPARTID_INTPARTID GENMASK(15, 0) +#define MPAMCFG_INTPARTID_INTERNAL BIT(16) + +/* MSMON_CFG_MON_SEL - Memory system performance monitor selection register */ +#define MSMON_CFG_MON_SEL_MON_SEL GENMASK(7, 0) +#define MSMON_CFG_MON_SEL_RIS GENMASK(27, 24) + +/* MPAMF_ESR - MPAM Error Status Register */ +#define MPAMF_ESR_PARTID_OR_MON GENMASK(15, 0) +#define MPAMF_ESR_PMG GENMASK(23, 16) +#define MPAMF_ESR_ERRCODE GENMASK(27, 24) +#define MPAMF_ESR_OVRWR BIT(31) +#define MPAMF_ESR_RIS GENMASK(35, 32) + +/* MPAMF_ECR - MPAM Error Control Register */ +#define MPAMF_ECR_INTEN BIT(0) + +/* Error conditions in accessing memory mapped registers */ +#define MPAM_ERRCODE_NONE 0 +#define MPAM_ERRCODE_PARTID_SEL_RANGE 1 +#define MPAM_ERRCODE_REQ_PARTID_RANGE 2 +#define MPAM_ERRCODE_MSMONCFG_ID_RANGE 3 +#define MPAM_ERRCODE_REQ_PMG_RANGE 4 +#define MPAM_ERRCODE_MONITOR_RANGE 5 +#define MPAM_ERRCODE_INTPARTID_RANGE 6 +#define MPAM_ERRCODE_UNEXPECTED_INTERNAL 7 + +/* + * MSMON_CFG_CSU_FLT - Memory system performance monitor configure cache storage + * usage monitor filter register + */ +#define MSMON_CFG_CSU_FLT_PARTID GENMASK(15, 0) +#define MSMON_CFG_CSU_FLT_PMG GENMASK(23, 16) + +/* + * MSMON_CFG_CSU_CTL - Memory system performance monitor configure cache storage + * usage monitor control register + * MSMON_CFG_MBWU_CTL - Memory system performance monitor configure memory + * bandwidth usage monitor control register + */ +#define MSMON_CFG_x_CTL_TYPE GENMASK(7, 0) +#define MSMON_CFG_x_CTL_MATCH_PARTID BIT(16) +#define MSMON_CFG_x_CTL_MATCH_PMG BIT(17) +#define MSMON_CFG_x_CTL_SCLEN BIT(19) +#define MSMON_CFG_x_CTL_SUBTYPE GENMASK(23, 20) +#define MSMON_CFG_x_CTL_OFLOW_FRZ BIT(24) +#define MSMON_CFG_x_CTL_OFLOW_INTR BIT(25) +#define MSMON_CFG_x_CTL_OFLOW_STATUS BIT(26) +#define MSMON_CFG_x_CTL_CAPT_RESET BIT(27) +#define MSMON_CFG_x_CTL_CAPT_EVNT GENMASK(30, 28) +#define MSMON_CFG_x_CTL_EN BIT(31) + +#define MSMON_CFG_MBWU_CTL_TYPE_MBWU 0x42 +#define MSMON_CFG_MBWU_CTL_TYPE_CSU 0x43 + +#define MSMON_CFG_MBWU_CTL_SUBTYPE_NONE 0 +#define MSMON_CFG_MBWU_CTL_SUBTYPE_READ 1 +#define MSMON_CFG_MBWU_CTL_SUBTYPE_WRITE 2 +#define MSMON_CFG_MBWU_CTL_SUBTYPE_BOTH 3 + +#define MSMON_CFG_MBWU_CTL_SUBTYPE_MAX 3 +#define MSMON_CFG_MBWU_CTL_SUBTYPE_MASK 0x3 + +/* + * MSMON_CFG_MBWU_FLT - Memory system performance monitor configure memory + * bandwidth usage monitor filter register + */ +#define MSMON_CFG_MBWU_FLT_PARTID GENMASK(15, 0) +#define MSMON_CFG_MBWU_FLT_PMG GENMASK(23, 16) +#define MSMON_CFG_MBWU_FLT_RWBW GENMASK(31, 30) + +/* + * MSMON_CSU - Memory system performance monitor cache storage usage monitor + * register + * MSMON_CSU_CAPTURE - Memory system performance monitor cache storage usage + * capture register + * MSMON_MBWU - Memory system performance monitor memory bandwidth usage + * monitor register + * MSMON_MBWU_CAPTURE - Memory system performance monitor memory bandwidth usage + * capture register + */ +#define MSMON___VALUE GENMASK(30, 0) +#define MSMON___NRDY BIT(31) +#define MSMON_MBWU_L_VALUE GENMASK(62, 0) +/* + * MSMON_CAPT_EVNT - Memory system performance monitoring capture event + * generation register + */ +#define MSMON_CAPT_EVNT_NOW BIT(0) + #endif /* MPAM_INTERNAL_H */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Because an MSC can only by accessed from the CPUs in its cpu-affinity set we need to be running on one of those CPUs to probe the MSC hardware. Do this work in the cpuhp callback. Probing the hardware will only happen before MPAM is enabled, walk all the MSCs and probe those we can reach that haven't already been probed. Later, enabling MPAM will enable a static key which will allow mpam_discovery_cpu_online() and its mutex to be skipped. Enabling a static key will also take the cpuhp lock, so can't be done from the cpuhp callback. Whenever a new MSC has been probed schedule work to test if all the MSCs have now been probed. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 148 +++++++++++++++++++++++++- drivers/platform/mpam/mpam_internal.h | 5 +- 2 files changed, 149 insertions(+), 4 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 599bd4ee5b00..48811c90b78a 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -4,6 +4,7 @@ #define pr_fmt(fmt) "mpam: " fmt #include <linux/acpi.h> +#include <linux/atomic.h> #include <linux/arm_mpam.h> #include <linux/cacheinfo.h> #include <linux/cpu.h> @@ -21,6 +22,7 @@ #include <linux/slab.h> #include <linux/spinlock.h> #include <linux/types.h> +#include <linux/workqueue.h> #include <acpi/pcc.h> @@ -41,6 +43,16 @@ struct srcu_struct mpam_srcu; /* MPAM isn't available until all the MSC have been probed. */ static u32 mpam_num_msc; +static int mpam_cpuhp_state; +static DEFINE_MUTEX(mpam_cpuhp_state_lock); + +/* + * mpam is enabled once all devices have been probed from CPU online callbacks, + * scheduled via this work_struct. If access to an MSC depends on a CPU that + * was not brought online at boot, this can happen surprisingly late. + */ +static DECLARE_WORK(mpam_enable_work, &mpam_enable); + /* * An MSC is a container for resources, each identified by their RIS index. * Components are a group of RIS that control the same thing. @@ -59,6 +71,24 @@ static u32 mpam_num_msc; */ LIST_HEAD(mpam_classes); +static u32 __mpam_read_reg(struct mpam_msc *msc, u16 reg) +{ + WARN_ON_ONCE(reg > msc->mapped_hwpage_sz); + WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &msc->accessibility)); + + return readl_relaxed(msc->mapped_hwpage + reg); +} + +#define mpam_read_partsel_reg(msc, reg) \ +({ \ + u32 ____ret; \ + \ + lockdep_assert_held_once(&msc->part_sel_lock); \ + ____ret = __mpam_read_reg(msc, MPAMF_##reg); \ + \ + ____ret; \ +}) + static struct mpam_component * mpam_component_alloc(struct mpam_class *class, int id, gfp_t gfp) { @@ -389,9 +419,81 @@ int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, return err; } -static void mpam_discovery_complete(void) +static int mpam_msc_hw_probe(struct mpam_msc *msc) +{ + u64 idr; + int err; + + lockdep_assert_held(&msc->lock); + + spin_lock(&msc->part_sel_lock); + idr = mpam_read_partsel_reg(msc, AIDR); + if ((idr & MPAMF_AIDR_ARCH_MAJOR_REV) != MPAM_ARCHITECTURE_V1) { + pr_err_once("%s does not match MPAM architecture v1.0\n", + dev_name(&msc->pdev->dev)); + err = -EIO; + } else { + msc->probed = true; + err = 0; + } + spin_unlock(&msc->part_sel_lock); + + return err; +} + +static int mpam_cpu_online(unsigned int cpu) { - pr_err("Discovered all MSC\n"); + return 0; +} + +/* Before mpam is enabled, try to probe new MSC */ +static int mpam_discovery_cpu_online(unsigned int cpu) +{ + int err = 0; + struct mpam_msc *msc; + bool new_device_probed = false; + + mutex_lock(&mpam_list_lock); + list_for_each_entry(msc, &mpam_all_msc, glbl_list) { + if (!cpumask_test_cpu(cpu, &msc->accessibility)) + continue; + + mutex_lock(&msc->lock); + if (!msc->probed) + err = mpam_msc_hw_probe(msc); + mutex_unlock(&msc->lock); + + if (!err) + new_device_probed = true; + else + break; // mpam_broken + } + mutex_unlock(&mpam_list_lock); + + if (new_device_probed && !err) + schedule_work(&mpam_enable_work); + + if (err < 0) + return err; + + return mpam_cpu_online(cpu); +} + +static int mpam_cpu_offline(unsigned int cpu) +{ + return 0; +} + +static void mpam_register_cpuhp_callbacks(int (*online)(unsigned int online)) +{ + mutex_lock(&mpam_cpuhp_state_lock); + mpam_cpuhp_state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mpam:online", + online, mpam_cpu_offline); + if (mpam_cpuhp_state <= 0) { + pr_err("Failed to register cpuhp callbacks"); + mpam_cpuhp_state = 0; + } + mutex_unlock(&mpam_cpuhp_state_lock); } static int mpam_dt_count_msc(void) @@ -628,11 +730,51 @@ static int mpam_msc_drv_probe(struct platform_device *pdev) } if (!err && fw_num_msc == mpam_num_msc) - mpam_discovery_complete(); + mpam_register_cpuhp_callbacks(&mpam_discovery_cpu_online); return err; } +static void mpam_enable_once(void) +{ + mutex_lock(&mpam_cpuhp_state_lock); + cpuhp_remove_state(mpam_cpuhp_state); + mpam_cpuhp_state = 0; + mutex_unlock(&mpam_cpuhp_state_lock); + + mpam_register_cpuhp_callbacks(mpam_cpu_online); + + pr_info("MPAM enabled\n"); +} + +/* + * Enable mpam once all devices have been probed. + * Scheduled by mpam_discovery_cpu_online() once all devices have been created. + * Also scheduled when new devices are probed when new CPUs come online. + */ +void mpam_enable(struct work_struct *work) +{ + static atomic_t once; + struct mpam_msc *msc; + bool all_devices_probed = true; + + /* Have we probed all the hw devices? */ + mutex_lock(&mpam_list_lock); + list_for_each_entry(msc, &mpam_all_msc, glbl_list) { + mutex_lock(&msc->lock); + if (!msc->probed) + all_devices_probed = false; + mutex_unlock(&msc->lock); + + if (!all_devices_probed) + break; + } + mutex_unlock(&mpam_list_lock); + + if (all_devices_probed && !atomic_fetch_inc(&once)) + mpam_enable_once(); +} + static int mpam_msc_drv_remove(struct platform_device *pdev) { struct mpam_msc *msc = platform_get_drvdata(pdev); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 5d339e4375e2..d5d567fe57ed 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -30,6 +30,7 @@ struct mpam_msc cpumask_t accessibility; struct mutex lock; + bool probed; unsigned long ris_idxs[128 / BITS_PER_LONG]; u32 ris_max; @@ -97,6 +98,8 @@ struct mpam_msc_ris extern struct list_head mpam_classes; extern struct srcu_struct mpam_srcu; +/* Scheduled work callback to enable mpam once all MSC have been probed */ +void mpam_enable(struct work_struct *work); /* * MPAM MSCs have the following register layout. See: @@ -196,7 +199,7 @@ extern struct srcu_struct mpam_srcu; /* MPAMF_MBWUMON_IDR - MPAM memory bandwidth usage monitor ID register */ #define MPAMF_MBWUMON_IDR_NUM_MON GENMASK(15, 0) -#define MPAMF_MBWUMON_IDR_RWBW BIT(28) +#define MPAMF_MBWUMON_IDR_HAS_RWBW BIT(28) #define MPAMF_MBWUMON_IDR_LWD BIT(29) #define MPAMF_MBWUMON_IDR_HAS_LONG BIT(30) #define MPAMF_MBWUMON_IDR_HAS_CAPTURE BIT(31) -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- CPUs can generate traffic with a range of PARTID and PMG values, but each MSC may have its own maximum size for these fields. Before MPAM can be used, the driver needs to probe each RIS on each MSC, to find the system-wide smallest value that can be used. While doing this, RIS entries that firmware didn't describe are create under MPAM_CLASS_UNKNOWN. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/kernel/mpam.c | 10 ++ drivers/platform/mpam/mpam_devices.c | 166 ++++++++++++++++++++++++-- drivers/platform/mpam/mpam_internal.h | 6 + include/linux/arm_mpam.h | 2 + 4 files changed, 177 insertions(+), 7 deletions(-) diff --git a/arch/arm64/kernel/mpam.c b/arch/arm64/kernel/mpam.c index 894ca52549b8..02f43334f078 100644 --- a/arch/arm64/kernel/mpam.c +++ b/arch/arm64/kernel/mpam.c @@ -3,6 +3,7 @@ #include <asm/mpam.h> +#include <linux/arm_mpam.h> #include <linux/jump_label.h> #include <linux/percpu.h> @@ -11,3 +12,12 @@ DEFINE_STATIC_KEY_FALSE(mpam_enabled); DEFINE_PER_CPU(u64, arm64_mpam_default); DEFINE_PER_CPU(u64, arm64_mpam_current); +static int __init arm64_mpam_register_cpus(void) +{ + u64 mpamidr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); + u16 partid_max = FIELD_GET(MPAMIDR_PARTID_MAX, mpamidr); + u8 pmg_max = FIELD_GET(MPAMIDR_PMG_MAX, mpamidr); + + return mpam_register_requestor(partid_max, pmg_max); +} +arch_initcall(arm64_mpam_register_cpus) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 48811c90b78a..ec22c279cfe9 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -6,6 +6,7 @@ #include <linux/acpi.h> #include <linux/atomic.h> #include <linux/arm_mpam.h> +#include <linux/bitfield.h> #include <linux/cacheinfo.h> #include <linux/cpu.h> #include <linux/cpumask.h> @@ -46,6 +47,15 @@ static u32 mpam_num_msc; static int mpam_cpuhp_state; static DEFINE_MUTEX(mpam_cpuhp_state_lock); +/* + * The smallest common values for any CPU or MSC in the system. + * Generating traffic outside this range will result in screaming interrupts. + */ +u16 mpam_partid_max; +u8 mpam_pmg_max; +static bool partid_max_init, partid_max_published; +static DEFINE_SPINLOCK(partid_max_lock); + /* * mpam is enabled once all devices have been probed from CPU online callbacks, * scheduled via this work_struct. If access to an MSC depends on a CPU that @@ -79,6 +89,14 @@ static u32 __mpam_read_reg(struct mpam_msc *msc, u16 reg) return readl_relaxed(msc->mapped_hwpage + reg); } +static void __mpam_write_reg(struct mpam_msc *msc, u16 reg, u32 val) +{ + WARN_ON_ONCE(reg > msc->mapped_hwpage_sz); + WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &msc->accessibility)); + + writel_relaxed(val, msc->mapped_hwpage + reg); +} + #define mpam_read_partsel_reg(msc, reg) \ ({ \ u32 ____ret; \ @@ -89,6 +107,59 @@ static u32 __mpam_read_reg(struct mpam_msc *msc, u16 reg) ____ret; \ }) +#define mpam_write_partsel_reg(msc, reg, val) \ +({ \ + lockdep_assert_held_once(&msc->part_sel_lock); \ + __mpam_write_reg(msc, MPAMCFG_##reg, val); \ +}) + +static u64 mpam_msc_read_idr(struct mpam_msc *msc) +{ + u64 idr_high = 0, idr_low; + + lockdep_assert_held(&msc->part_sel_lock); + + idr_low = mpam_read_partsel_reg(msc, IDR); + if (FIELD_GET(MPAMF_IDR_HAS_EXT, idr_low)) + idr_high = mpam_read_partsel_reg(msc, IDR + 4); + + return (idr_high << 32) | idr_low; +} + +static void __mpam_part_sel(u8 ris_idx, u16 partid, struct mpam_msc *msc) +{ + u32 partsel; + + lockdep_assert_held(&msc->part_sel_lock); + + partsel = FIELD_PREP(MPAMCFG_PART_SEL_RIS, ris_idx) | + FIELD_PREP(MPAMCFG_PART_SEL_PARTID_SEL, partid); + mpam_write_partsel_reg(msc, PART_SEL, partsel); +} + +int mpam_register_requestor(u16 partid_max, u8 pmg_max) +{ + int err = 0; + + spin_lock(&partid_max_lock); + if (!partid_max_init) { + mpam_partid_max = partid_max; + mpam_pmg_max = pmg_max; + partid_max_init = true; + } else if (!partid_max_published) { + mpam_partid_max = min(mpam_partid_max, partid_max); + mpam_pmg_max = min(mpam_pmg_max, pmg_max); + } else { + /* New requestors can't lower the values */ + if ((partid_max < mpam_partid_max) || (pmg_max < mpam_pmg_max)) + err = -EBUSY; + } + spin_unlock(&partid_max_lock); + + return err; +} +EXPORT_SYMBOL(mpam_register_requestor); + static struct mpam_component * mpam_component_alloc(struct mpam_class *class, int id, gfp_t gfp) { @@ -402,6 +473,7 @@ static int mpam_ris_create_locked(struct mpam_msc *msc, u8 ris_idx, cpumask_or(&comp->affinity, &comp->affinity, &ris->affinity); cpumask_or(&class->affinity, &class->affinity, &ris->affinity); list_add_rcu(&ris->comp_list, &comp->ris); + list_add_rcu(&ris->msc_list, &msc->ris); return 0; } @@ -419,10 +491,37 @@ int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, return err; } +static struct mpam_msc_ris *mpam_get_or_create_ris(struct mpam_msc *msc, + u8 ris_idx) +{ + int err; + struct mpam_msc_ris *ris, *found = ERR_PTR(-ENOENT); + + lockdep_assert_held(&mpam_list_lock); + + if (!test_bit(ris_idx, msc->ris_idxs)) { + err = mpam_ris_create_locked(msc, ris_idx, MPAM_CLASS_UNKNOWN, + 0, 0, GFP_ATOMIC); + if (err) + return ERR_PTR(err); + } + + list_for_each_entry(ris, &msc->ris, msc_list) { + if (ris->ris_idx == ris_idx) { + found = ris; + break; + } + } + + return found; +} + static int mpam_msc_hw_probe(struct mpam_msc *msc) { u64 idr; - int err; + u16 partid_max; + u8 ris_idx, pmg_max; + struct mpam_msc_ris *ris; lockdep_assert_held(&msc->lock); @@ -431,14 +530,43 @@ static int mpam_msc_hw_probe(struct mpam_msc *msc) if ((idr & MPAMF_AIDR_ARCH_MAJOR_REV) != MPAM_ARCHITECTURE_V1) { pr_err_once("%s does not match MPAM architecture v1.0\n", dev_name(&msc->pdev->dev)); - err = -EIO; - } else { - msc->probed = true; - err = 0; + spin_unlock(&msc->part_sel_lock); + return -EIO; } + + idr = mpam_msc_read_idr(msc); spin_unlock(&msc->part_sel_lock); + msc->ris_max = FIELD_GET(MPAMF_IDR_RIS_MAX, idr); + + /* Use these values so partid/pmg always starts with a valid value */ + msc->partid_max = FIELD_GET(MPAMF_IDR_PARTID_MAX, idr); + msc->pmg_max = FIELD_GET(MPAMF_IDR_PMG_MAX, idr); + + for (ris_idx = 0; ris_idx <= msc->ris_max; ris_idx++) { + spin_lock(&msc->part_sel_lock); + __mpam_part_sel(ris_idx, 0, msc); + idr = mpam_msc_read_idr(msc); + spin_unlock(&msc->part_sel_lock); + + partid_max = FIELD_GET(MPAMF_IDR_PARTID_MAX, idr); + pmg_max = FIELD_GET(MPAMF_IDR_PMG_MAX, idr); + msc->partid_max = min(msc->partid_max, partid_max); + msc->pmg_max = min(msc->pmg_max, pmg_max); + + ris = mpam_get_or_create_ris(msc, ris_idx); + if (IS_ERR(ris)) { + return PTR_ERR(ris); + } + } - return err; + spin_lock(&partid_max_lock); + mpam_partid_max = min(mpam_partid_max, msc->partid_max); + mpam_pmg_max = min(mpam_pmg_max, msc->pmg_max); + spin_unlock(&partid_max_lock); + + msc->probed = true; + + return 0; } static int mpam_cpu_online(unsigned int cpu) @@ -742,9 +870,18 @@ static void mpam_enable_once(void) mpam_cpuhp_state = 0; mutex_unlock(&mpam_cpuhp_state_lock); + /* + * Once the cpuhp callbacks have been changed, mpam_partid_max can no + * longer change. + */ + spin_lock(&partid_max_lock); + partid_max_published = true; + spin_unlock(&partid_max_lock); + mpam_register_cpuhp_callbacks(mpam_cpu_online); - pr_info("MPAM enabled\n"); + pr_info("MPAM enabled with %u partid and %u pmg\n", + mpam_partid_max + 1, mpam_pmg_max + 1); } /* @@ -828,11 +965,25 @@ static void mpam_dt_create_foundling_msc(void) static int __init mpam_msc_driver_init(void) { + bool mpam_not_available = false; + if (!mpam_cpus_have_feature()) return -EOPNOTSUPP; init_srcu_struct(&mpam_srcu); + /* + * If the MPAM CPU interface is not implemented, or reserved by + * firmware, there is no point touching the rest of the hardware. + */ + spin_lock(&partid_max_lock); + if (!partid_max_init || (!mpam_partid_max && !mpam_pmg_max)) + mpam_not_available = true; + spin_unlock(&partid_max_lock); + + if (mpam_not_available) + return 0; + if (!acpi_disabled) fw_num_msc = acpi_mpam_count_msc(); else @@ -848,4 +999,5 @@ static int __init mpam_msc_driver_init(void) return platform_driver_register(&mpam_msc_driver); } +/* Must occur after arm64_mpam_register_cpus() from arch_initcall() */ subsys_initcall(mpam_msc_driver_init); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index d5d567fe57ed..a7de4a69b9f8 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -31,6 +31,8 @@ struct mpam_msc struct mutex lock; bool probed; + u16 partid_max; + u8 pmg_max; unsigned long ris_idxs[128 / BITS_PER_LONG]; u32 ris_max; @@ -98,6 +100,10 @@ struct mpam_msc_ris extern struct list_head mpam_classes; extern struct srcu_struct mpam_srcu; +/* System wide partid/pmg values */ +extern u16 mpam_partid_max; +extern u8 mpam_pmg_max; + /* Scheduled work callback to enable mpam once all MSC have been probed */ void mpam_enable(struct work_struct *work); diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 950ea7049d53..40e09b4d236b 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -34,6 +34,8 @@ static inline int acpi_mpam_parse_resources(struct mpam_msc *msc, static inline int acpi_mpam_count_msc(void) { return -EINVAL; } #endif +int mpam_register_requestor(u16 partid_max, u8 pmg_max); + int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, enum mpam_class_types type, u8 class_id, int component_id); -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Expand the probing support with the control and monitor types we can use with resctrl. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 79 ++++++++++++++++++++++++++- drivers/platform/mpam/mpam_internal.h | 50 +++++++++++++++++ 2 files changed, 127 insertions(+), 2 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index ec22c279cfe9..7eeeb44506b1 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -83,7 +83,7 @@ LIST_HEAD(mpam_classes); static u32 __mpam_read_reg(struct mpam_msc *msc, u16 reg) { - WARN_ON_ONCE(reg > msc->mapped_hwpage_sz); + WARN_ON_ONCE(reg + sizeof(u32) > msc->mapped_hwpage_sz); WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &msc->accessibility)); return readl_relaxed(msc->mapped_hwpage + reg); @@ -91,7 +91,7 @@ static u32 __mpam_read_reg(struct mpam_msc *msc, u16 reg) static void __mpam_write_reg(struct mpam_msc *msc, u16 reg, u32 val) { - WARN_ON_ONCE(reg > msc->mapped_hwpage_sz); + WARN_ON_ONCE(reg + sizeof(u32) > msc->mapped_hwpage_sz); WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &msc->accessibility)); writel_relaxed(val, msc->mapped_hwpage + reg); @@ -516,6 +516,74 @@ static struct mpam_msc_ris *mpam_get_or_create_ris(struct mpam_msc *msc, return found; } +static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) +{ + int err; + struct mpam_msc *msc = ris->msc; + struct mpam_props *props = &ris->props; + + lockdep_assert_held(&msc->lock); + lockdep_assert_held(&msc->part_sel_lock); + + /* Cache Portion partitioning */ + if (FIELD_GET(MPAMF_IDR_HAS_CPOR_PART, ris->idr)) { + u32 cpor_features = mpam_read_partsel_reg(msc, CPOR_IDR); + + props->cpbm_wd = FIELD_GET(MPAMF_CPOR_IDR_CPBM_WD, cpor_features); + if (props->cpbm_wd) + mpam_set_feature(mpam_feat_cpor_part, props); + } + + /* Memory bandwidth partitioning */ + if (FIELD_GET(MPAMF_IDR_HAS_MBW_PART, ris->idr)) { + u32 mbw_features = mpam_read_partsel_reg(msc, MBW_IDR); + + /* portion bitmap resolution */ + props->mbw_pbm_bits = FIELD_GET(MPAMF_MBW_IDR_BWPBM_WD, mbw_features); + if (props->mbw_pbm_bits && + FIELD_GET(MPAMF_MBW_IDR_HAS_PBM, mbw_features)) + mpam_set_feature(mpam_feat_mbw_part, props); + + props->bwa_wd = FIELD_GET(MPAMF_MBW_IDR_BWA_WD, mbw_features); + if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_MAX, mbw_features)) + mpam_set_feature(mpam_feat_mbw_max, props); + } + + /* Performance Monitoring */ + if (FIELD_GET(MPAMF_IDR_HAS_MSMON, ris->idr)) { + u32 msmon_features = mpam_read_partsel_reg(msc, MSMON_IDR); + + if (FIELD_GET(MPAMF_MSMON_IDR_MSMON_CSU, msmon_features)) { + u32 csumonidr, discard; + + /* + * If the firmware max-nrdy-us property is missing, the + * CSU counters can't be used. Should we wait forever? + */ + err = device_property_read_u32(&msc->pdev->dev, + "arm,not-ready-us", + &discard); + + csumonidr = mpam_read_partsel_reg(msc, CSUMON_IDR); + props->num_csu_mon = FIELD_GET(MPAMF_CSUMON_IDR_NUM_MON, csumonidr); + if (props->num_csu_mon && !err) + mpam_set_feature(mpam_feat_msmon_csu, props); + else if (props->num_csu_mon) + pr_err_once("Counters are not usable because not-ready timeout was not provided by firmware."); + } + if (FIELD_GET(MPAMF_MSMON_IDR_MSMON_MBWU, msmon_features)) { + u32 mbwumonidr = mpam_read_partsel_reg(msc, MBWUMON_IDR); + + props->num_mbwu_mon = FIELD_GET(MPAMF_MBWUMON_IDR_NUM_MON, mbwumonidr); + if (props->num_mbwu_mon) + mpam_set_feature(mpam_feat_msmon_mbwu, props); + + if (FIELD_GET(MPAMF_MBWUMON_IDR_HAS_RWBW, mbwumonidr)) + mpam_set_feature(mpam_feat_msmon_mbwu_rwbw, props); + } + } +} + static int mpam_msc_hw_probe(struct mpam_msc *msc) { u64 idr; @@ -536,6 +604,7 @@ static int mpam_msc_hw_probe(struct mpam_msc *msc) idr = mpam_msc_read_idr(msc); spin_unlock(&msc->part_sel_lock); + msc->ris_max = FIELD_GET(MPAMF_IDR_RIS_MAX, idr); /* Use these values so partid/pmg always starts with a valid value */ @@ -557,6 +626,12 @@ static int mpam_msc_hw_probe(struct mpam_msc *msc) if (IS_ERR(ris)) { return PTR_ERR(ris); } + ris->idr = idr; + + spin_lock(&msc->part_sel_lock); + __mpam_part_sel(ris_idx, 0, msc); + mpam_ris_hw_probe(ris); + spin_unlock(&msc->part_sel_lock); } spin_lock(&partid_max_lock); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index a7de4a69b9f8..71e62594876d 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -49,6 +49,54 @@ struct mpam_msc size_t mapped_hwpage_sz; }; +/* + * When we compact the supported features, we don't care what they are. + * Storing them as a bitmap makes life easy. + */ +typedef u16 mpam_features_t; + +/* Bits for mpam_features_t */ +enum mpam_device_features { + mpam_feat_ccap_part = 0, + mpam_feat_cpor_part, + mpam_feat_mbw_part, + mpam_feat_mbw_min, + mpam_feat_mbw_max, + mpam_feat_mbw_prop, + mpam_feat_msmon, + mpam_feat_msmon_csu, + mpam_feat_msmon_csu_capture, + mpam_feat_msmon_mbwu, + mpam_feat_msmon_mbwu_capture, + mpam_feat_msmon_mbwu_rwbw, + mpam_feat_msmon_capt, + MPAM_FEATURE_LAST, +}; +#define MPAM_ALL_FEATURES ((1<<MPAM_FEATURE_LAST) - 1) + +struct mpam_props +{ + mpam_features_t features; + + u16 cpbm_wd; + u16 mbw_pbm_bits; + u8 bwa_wd; + u16 num_csu_mon; + u16 num_mbwu_mon; +}; + +static inline bool mpam_has_feature(enum mpam_device_features feat, + struct mpam_props *props) +{ + return (1<<feat) & props->features; +} + +static inline void mpam_set_feature(enum mpam_device_features feat, + struct mpam_props *props) +{ + props->features |= (1<<feat); +} + struct mpam_class { /* mpam_components in this class */ @@ -82,6 +130,8 @@ struct mpam_component struct mpam_msc_ris { u8 ris_idx; + u64 idr; + struct mpam_props props; cpumask_t affinity; -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- To make a decision about whether to expose an mpam class as a resctrl resource we need to know its overall supported features and properties. Once we've probed all the resources, we can walk the tree and produced overall values by merging the bitmaps. This eliminates features that are only supported by some MSC that make up a component or class. If bitmap properties are mismatched within a component we cannot support the mismatched feature. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 87 +++++++++++++++++++++++++++ drivers/platform/mpam/mpam_internal.h | 8 +++ 2 files changed, 95 insertions(+) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 7eeeb44506b1..f1574f393e1d 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -938,8 +938,95 @@ static int mpam_msc_drv_probe(struct platform_device *pdev) return err; } +/* + * If a resource doesn't match class feature/configuration, do the right thing. + * For 'num' properties we can just take the minimum. + * For properties where the mismatched unused bits would make a difference, we + * nobble the class feature, as we can't configure all the resources. + * e.g. The L3 cache is composed of two resources with 13 and 17 portion + * bitmaps respectively. + */ +static void +__resource_props_mismatch(struct mpam_msc_ris *ris, struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + struct mpam_props *rprops = &ris->props; + + lockdep_assert_held(&mpam_list_lock); /* we modify class */ + + /* Clear missing features */ + cprops->features &= rprops->features; + + /* Clear incompatible features */ + if (cprops->cpbm_wd != rprops->cpbm_wd) + mpam_clear_feature(mpam_feat_cpor_part, &cprops->features); + if (cprops->mbw_pbm_bits != rprops->mbw_pbm_bits) + mpam_clear_feature(mpam_feat_mbw_part, &cprops->features); + + /* bwa_wd is a count of bits, fewer bits means less precision */ + if (cprops->bwa_wd != rprops->bwa_wd) + cprops->bwa_wd = min(cprops->bwa_wd, rprops->bwa_wd); + + /* For num properties, take the minimum */ + if (cprops->num_csu_mon != rprops->num_csu_mon) + cprops->num_csu_mon = min(cprops->num_csu_mon, rprops->num_csu_mon); + if (cprops->num_mbwu_mon != rprops->num_mbwu_mon) + cprops->num_mbwu_mon = min(cprops->num_mbwu_mon, rprops->num_mbwu_mon); +} + +/* + * Copy the first component's first resources's properties and features to the + * class. __resource_props_mismatch() will remove conflicts. + * It is not possible to have a class with no components, or a component with + * no resources. + */ +static void mpam_enable_init_class_features(struct mpam_class *class) +{ + struct mpam_msc_ris *ris; + struct mpam_component *comp; + + comp = list_first_entry_or_null(&class->components, + struct mpam_component, class_list); + if (WARN_ON(!comp)) + return; + + ris = list_first_entry_or_null(&comp->ris, + struct mpam_msc_ris, comp_list); + if (WARN_ON(!ris)) + return; + + class->props = ris->props; +} + +/* Merge all the common resource features into class. */ +static void mpam_enable_merge_features(void) +{ + struct mpam_msc_ris *ris; + struct mpam_class *class; + struct mpam_component *comp; + + lockdep_assert_held(&mpam_list_lock); + + list_for_each_entry(class, &mpam_classes, classes_list) { + mpam_enable_init_class_features(class); + + list_for_each_entry(comp, &class->components, class_list) { + list_for_each_entry(ris, &comp->ris, comp_list) { + __resource_props_mismatch(ris, class); + + class->nrdy_usec = max(class->nrdy_usec, + ris->msc->nrdy_usec); + } + } + } +} + static void mpam_enable_once(void) { + mutex_lock(&mpam_list_lock); + mpam_enable_merge_features(); + mutex_unlock(&mpam_list_lock); + mutex_lock(&mpam_cpuhp_state_lock); cpuhp_remove_state(mpam_cpuhp_state); mpam_cpuhp_state = 0; diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 71e62594876d..db50f5e40d98 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -97,6 +97,12 @@ static inline void mpam_set_feature(enum mpam_device_features feat, props->features |= (1<<feat); } +static inline void mpam_clear_feature(enum mpam_device_features feat, + mpam_features_t *supported) +{ + *supported &= ~(1<<feat); +} + struct mpam_class { /* mpam_components in this class */ @@ -104,6 +110,8 @@ struct mpam_class cpumask_t affinity; + struct mpam_props props; + u32 nrdy_usec; u8 level; enum mpam_class_types type; -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- When a CPU comes online, it may bring a newly accessible MSC with it. Only the default partid has its value reset by hardware, and even then the MSC might not have been reset since its config was previously dirtyied. e.g. Kexec. Any in-use partid must have its configuration restored, or reset. In-use partids may be held in caches and evicted later. MSC are also reset when CPUs are taken offline to cover cases where firmware doesn't reset the MSC over reboot using UEFI, or kexec where there is no firmware involvement. If the configuration for a RIS has not been touched since it was brought online, it does not need resetting again. To reset, write the maximum values for all discovered controls. CC: Rohit Mathew <Rohit.Mathew@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 126 +++++++++++++++++++++++++- drivers/platform/mpam/mpam_internal.h | 5 +- 2 files changed, 129 insertions(+), 2 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index f1574f393e1d..e281e6e910e0 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -7,6 +7,7 @@ #include <linux/atomic.h> #include <linux/arm_mpam.h> #include <linux/bitfield.h> +#include <linux/bitmap.h> #include <linux/cacheinfo.h> #include <linux/cpu.h> #include <linux/cpumask.h> @@ -644,8 +645,115 @@ static int mpam_msc_hw_probe(struct mpam_msc *msc) return 0; } +static void mpam_reset_msc_bitmap(struct mpam_msc *msc, u16 reg, u16 wd) +{ + u32 num_words, msb; + u32 bm = ~0; + int i; + + lockdep_assert_held(&msc->part_sel_lock); + + /* + * Write all ~0 to all but the last 32bit-word, which may + * have fewer bits... + */ + num_words = DIV_ROUND_UP(wd, 32); + for (i = 0; i < num_words - 1; i++, reg += sizeof(bm)) + __mpam_write_reg(msc, reg, bm); + + /* + * ....and then the last (maybe) partial 32bit word. When wd is a + * multiple of 32, msb should be 31 to write a full 32bit word. + */ + msb = (wd - 1) % 32; + bm = GENMASK(msb , 0); + if (bm) + __mpam_write_reg(msc, reg, bm); +} + +static void mpam_reset_ris_partid(struct mpam_msc_ris *ris, u16 partid) +{ + struct mpam_msc *msc = ris->msc; + u16 bwa_fract = MPAMCFG_MBW_MAX_MAX; + struct mpam_props *rprops = &ris->props; + + lockdep_assert_held(&msc->lock); + + spin_lock(&msc->part_sel_lock); + __mpam_part_sel(ris->ris_idx, partid, msc); + + if (mpam_has_feature(mpam_feat_cpor_part, rprops)) + mpam_reset_msc_bitmap(msc, MPAMCFG_CPBM, rprops->cpbm_wd); + + if (mpam_has_feature(mpam_feat_mbw_part, rprops)) + mpam_reset_msc_bitmap(msc, MPAMCFG_MBW_PBM, rprops->mbw_pbm_bits); + + if (mpam_has_feature(mpam_feat_mbw_min, rprops)) + mpam_write_partsel_reg(msc, MBW_MIN, 0); + + if (mpam_has_feature(mpam_feat_mbw_max, rprops)) + mpam_write_partsel_reg(msc, MBW_MAX, bwa_fract); + + if (mpam_has_feature(mpam_feat_mbw_prop, rprops)) + mpam_write_partsel_reg(msc, MBW_PROP, bwa_fract); + spin_unlock(&msc->part_sel_lock); +} + +static void mpam_reset_ris(struct mpam_msc_ris *ris) +{ + u16 partid, partid_max; + struct mpam_msc *msc = ris->msc; + + lockdep_assert_held(&msc->lock); + + if (ris->in_reset_state) + return; + + spin_lock(&partid_max_lock); + partid_max = mpam_partid_max; + spin_unlock(&partid_max_lock); + for (partid = 0; partid < partid_max; partid++) + mpam_reset_ris_partid(ris, partid); +} + +static void mpam_reset_msc(struct mpam_msc *msc, bool online) +{ + int idx; + struct mpam_msc_ris *ris; + + lockdep_assert_held(&msc->lock); + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &msc->ris, msc_list) { + mpam_reset_ris(ris); + + /* + * Set in_reset_state when coming online. The reset state + * for non-zero partid may be lost while the CPUs are offline. + */ + ris->in_reset_state = online; + } + srcu_read_unlock(&mpam_srcu, idx); +} + static int mpam_cpu_online(unsigned int cpu) { + int idx; + struct mpam_msc *msc; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(msc, &mpam_all_msc, glbl_list) { + if (!cpumask_test_cpu(cpu, &msc->accessibility)) + continue; + + if (atomic_fetch_inc(&msc->online_refs) == 0) { + mutex_lock(&msc->lock); + mpam_reset_msc(msc, true); + mutex_unlock(&msc->lock); + } + } + srcu_read_unlock(&mpam_srcu, idx); + return 0; } @@ -684,6 +792,22 @@ static int mpam_discovery_cpu_online(unsigned int cpu) static int mpam_cpu_offline(unsigned int cpu) { + int idx; + struct mpam_msc *msc; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(msc, &mpam_all_msc, glbl_list) { + if (!cpumask_test_cpu(cpu, &msc->accessibility)) + continue; + + if (atomic_dec_and_test(&msc->online_refs)) { + mutex_lock(&msc->lock); + mpam_reset_msc(msc, false); + mutex_unlock(&msc->lock); + } + } + srcu_read_unlock(&mpam_srcu, idx); + return 0; } @@ -1043,7 +1167,7 @@ static void mpam_enable_once(void) mpam_register_cpuhp_callbacks(mpam_cpu_online); pr_info("MPAM enabled with %u partid and %u pmg\n", - mpam_partid_max + 1, mpam_pmg_max + 1); + READ_ONCE(mpam_partid_max) + 1, mpam_pmg_max + 1); } /* diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index db50f5e40d98..dd1c017c6e08 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -5,6 +5,7 @@ #define MPAM_INTERNAL_H #include <linux/arm_mpam.h> +#include <linux/atomic.h> #include <linux/cpumask.h> #include <linux/io.h> #include <linux/mailbox_client.h> @@ -28,7 +29,8 @@ struct mpam_msc struct pcc_mbox_chan *pcc_chan; u32 nrdy_usec; cpumask_t accessibility; - + atomic_t online_refs; + struct mutex lock; bool probed; u16 partid_max; @@ -140,6 +142,7 @@ struct mpam_msc_ris u8 ris_idx; u64 idr; struct mpam_props props; + bool in_reset_state; cpumask_t affinity; -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Resetting RIS entries from the cpuhp callback is easy as the callback occurs on the correct CPU. This won't be true for any other caller that wants to reset or configure an MSC. Add a helper that schedules the provided function if necessary. Prevent the cpuhp callbacks from changing the MSC state by taking the cpuhp lock. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 40 +++++++++++++++++++++++----- 1 file changed, 34 insertions(+), 6 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index e281e6e910e0..9156591fc4b9 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -699,21 +699,49 @@ static void mpam_reset_ris_partid(struct mpam_msc_ris *ris, u16 partid) spin_unlock(&msc->part_sel_lock); } -static void mpam_reset_ris(struct mpam_msc_ris *ris) +/* + * Called via smp_call_on_cpu() to prevent migration, while still being + * pre-emptible. + */ +static int mpam_reset_ris(void *arg) { u16 partid, partid_max; - struct mpam_msc *msc = ris->msc; - - lockdep_assert_held(&msc->lock); + struct mpam_msc_ris *ris = arg; if (ris->in_reset_state) - return; + return 0; spin_lock(&partid_max_lock); partid_max = mpam_partid_max; spin_unlock(&partid_max_lock); for (partid = 0; partid < partid_max; partid++) mpam_reset_ris_partid(ris, partid); + + return 0; +} + +/* + * Get the preferred CPU for this MSC. If it is accessible from this CPU, + * this CPU is preferred. This can be preempted/migrated, it will only result + * in more work. + */ +static int mpam_get_msc_preferred_cpu(struct mpam_msc *msc) +{ + int cpu = raw_smp_processor_id(); + + if (cpumask_test_cpu(cpu, &msc->accessibility)) + return cpu; + + return cpumask_first_and(&msc->accessibility, cpu_online_mask); +} + +static int mpam_touch_msc(struct mpam_msc *msc, int (*fn)(void *a), void *arg) +{ + lockdep_assert_irqs_enabled(); + lockdep_assert_cpus_held(); + lockdep_assert_held(&msc->lock); + + return smp_call_on_cpu(mpam_get_msc_preferred_cpu(msc), fn, arg, true); } static void mpam_reset_msc(struct mpam_msc *msc, bool online) @@ -725,7 +753,7 @@ static void mpam_reset_msc(struct mpam_msc *msc, bool online) idx = srcu_read_lock(&mpam_srcu); list_for_each_entry_rcu(ris, &msc->ris, msc_list) { - mpam_reset_ris(ris); + mpam_touch_msc(msc, &mpam_reset_ris, ris); /* * Set in_reset_state when coming online. The reset state -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- cpuhp callbacks aren't the only time the MSC configuration may need to be reset. Resctrl has an API call to reset a class. If an MPAM error interrupt arrives it indicates the driver has misprogrammed an MSC. The safest thing to do is reset all the MSCs and disable MPAM. Add a helper to reset RIS via their class. Call this from mpam_disable(), which can be scheduled from the error interrupt handler. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 34 ++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 9156591fc4b9..6091e87308ba 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -1198,6 +1198,40 @@ static void mpam_enable_once(void) READ_ONCE(mpam_partid_max) + 1, mpam_pmg_max + 1); } +static void mpam_reset_class(struct mpam_class *class) +{ + int idx; + struct mpam_msc_ris *ris; + struct mpam_component *comp; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(comp, &class->components, class_list) { + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { + mutex_lock(&ris->msc->lock); + mpam_touch_msc(ris->msc, mpam_reset_ris, ris); + mutex_unlock(&ris->msc->lock); + ris->in_reset_state = true; + } + } + srcu_read_unlock(&mpam_srcu, idx); +} + +/* + * Called in response to an error IRQ. + * All of MPAMs errors indicate a software bug, restore any modified + * controls to their reset values. + */ +void mpam_disable(void) +{ + int idx; + struct mpam_class *class; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(class, &mpam_classes, classes_list) + mpam_reset_class(class); + srcu_read_unlock(&mpam_srcu, idx); +} + /* * Enable mpam once all devices have been probed. * Scheduled by mpam_discovery_cpu_online() once all devices have been created. -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Register and enable error IRQs. All the MPAM error interrupts indicate a software bug, e.g. out of range partid. If the error interrupt is ever signalled, attempt to disable MPAM. Only the irq handler accesses the ESR register, so no locking is needed. The work to disable MPAM after an error needs to happen at process context, use a threaded interrupt. There is no support for percpu threaded interrupts, for now schedule the work to be done from the irq handler. Enabling the IRQs in the MSC may involve cross calling to a CPU that can access the MSC. CC: Rohit Mathew <rohit.mathew@arm.com> Tested-by: Rohit Mathew <rohit.mathew@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 311 +++++++++++++++++++++++++- drivers/platform/mpam/mpam_internal.h | 8 + 2 files changed, 309 insertions(+), 10 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 6091e87308ba..ac82c041999d 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -14,6 +14,9 @@ #include <linux/device.h> #include <linux/errno.h> #include <linux/gfp.h> +#include <linux/interrupt.h> +#include <linux/irq.h> +#include <linux/irqdesc.h> #include <linux/list.h> #include <linux/lockdep.h> #include <linux/mutex.h> @@ -64,6 +67,12 @@ static DEFINE_SPINLOCK(partid_max_lock); */ static DECLARE_WORK(mpam_enable_work, &mpam_enable); +/* + * All mpam error interrupts indicate a software bug. On receipt, disable the + * driver. + */ +static DECLARE_WORK(mpam_broken_work, &mpam_disable); + /* * An MSC is a container for resources, each identified by their RIS index. * Components are a group of RIS that control the same thing. @@ -127,6 +136,24 @@ static u64 mpam_msc_read_idr(struct mpam_msc *msc) return (idr_high << 32) | idr_low; } +static void mpam_msc_zero_esr(struct mpam_msc *msc) +{ + writel_relaxed(0, msc->mapped_hwpage + MPAMF_ESR); + if (msc->has_extd_esr) + writel_relaxed(0, msc->mapped_hwpage + MPAMF_ESR + 4); +} + +static u64 mpam_msc_read_esr(struct mpam_msc *msc) +{ + u64 esr_high = 0, esr_low; + + esr_low = readl_relaxed(msc->mapped_hwpage + MPAMF_ESR); + if (msc->has_extd_esr) + esr_high = readl_relaxed(msc->mapped_hwpage + MPAMF_ESR + 4); + + return (esr_high << 32) | esr_low; +} + static void __mpam_part_sel(u8 ris_idx, u16 partid, struct mpam_msc *msc) { u32 partsel; @@ -622,6 +649,7 @@ static int mpam_msc_hw_probe(struct mpam_msc *msc) pmg_max = FIELD_GET(MPAMF_IDR_PMG_MAX, idr); msc->partid_max = min(msc->partid_max, partid_max); msc->pmg_max = min(msc->pmg_max, pmg_max); + msc->has_extd_esr = FIELD_GET(MPAMF_IDR_HAS_EXT_ESR, idr); ris = mpam_get_or_create_ris(msc, ris_idx); if (IS_ERR(ris)) { @@ -764,6 +792,12 @@ static void mpam_reset_msc(struct mpam_msc *msc, bool online) srcu_read_unlock(&mpam_srcu, idx); } +static void _enable_percpu_irq(void *_irq) +{ + int *irq = _irq; + enable_percpu_irq(*irq, IRQ_TYPE_NONE); +} + static int mpam_cpu_online(unsigned int cpu) { int idx; @@ -774,11 +808,13 @@ static int mpam_cpu_online(unsigned int cpu) if (!cpumask_test_cpu(cpu, &msc->accessibility)) continue; - if (atomic_fetch_inc(&msc->online_refs) == 0) { - mutex_lock(&msc->lock); + mutex_lock(&msc->lock); + if (msc->reenable_error_ppi) + _enable_percpu_irq(&msc->reenable_error_ppi); + + if (atomic_fetch_inc(&msc->online_refs) == 0) mpam_reset_msc(msc, true); - mutex_unlock(&msc->lock); - } + mutex_unlock(&msc->lock); } srcu_read_unlock(&mpam_srcu, idx); @@ -828,11 +864,13 @@ static int mpam_cpu_offline(unsigned int cpu) if (!cpumask_test_cpu(cpu, &msc->accessibility)) continue; - if (atomic_dec_and_test(&msc->online_refs)) { - mutex_lock(&msc->lock); + mutex_lock(&msc->lock); + if (msc->reenable_error_ppi) + disable_percpu_irq(msc->reenable_error_ppi); + + if (atomic_dec_and_test(&msc->online_refs)) mpam_reset_msc(msc, false); - mutex_unlock(&msc->lock); - } + mutex_unlock(&msc->lock); } srcu_read_unlock(&mpam_srcu, idx); @@ -851,6 +889,50 @@ static void mpam_register_cpuhp_callbacks(int (*online)(unsigned int online)) mutex_unlock(&mpam_cpuhp_state_lock); } +static int __setup_ppi(struct mpam_msc *msc) +{ + int cpu; + + msc->error_dev_id = alloc_percpu_gfp(struct mpam_msc *, GFP_KERNEL); + if (!msc->error_dev_id) + return -ENOMEM; + + for_each_cpu(cpu, &msc->accessibility) { + struct mpam_msc *empty = *per_cpu_ptr(msc->error_dev_id, cpu); + if (empty != NULL) { + pr_err_once("%s shares PPI with %s!\n", dev_name(&msc->pdev->dev), + dev_name(&empty->pdev->dev)); + return -EBUSY; + } + *per_cpu_ptr(msc->error_dev_id, cpu) = msc; + } + + return 0; +} + +static int mpam_msc_setup_error_irq(struct mpam_msc *msc) +{ + int irq; + + irq = platform_get_irq_byname_optional(msc->pdev, "error"); + if (irq <= 0) + return 0; + + /* Allocate and initialise the percpu device pointer for PPI */ + if (irq_is_percpu(irq)) + + return __setup_ppi(msc); + + /* sanity check: shared interrupts can be routed anywhere? */ + if (!cpumask_equal(&msc->accessibility, cpu_possible_mask)) { + pr_err_once("msc:%u is a private resource with a shared error interrupt", + msc->id); + return -EINVAL; + } + + return 0; +} + static int mpam_dt_count_msc(void) { int count = 0; @@ -1021,6 +1103,13 @@ static int mpam_msc_drv_probe(struct platform_device *pdev) spin_lock_init(&msc->part_sel_lock); spin_lock_init(&msc->mon_sel_lock); + err = mpam_msc_setup_error_irq(msc); + if (err) { + devm_kfree(&pdev->dev, msc); + msc = ERR_PTR(err); + break; + } + if (device_property_read_u32(&pdev->dev, "pcc-channel", &msc->pcc_subspace_id)) msc->iface = MPAM_IFACE_MMIO; @@ -1173,11 +1262,198 @@ static void mpam_enable_merge_features(void) } } +static char *mpam_errcode_names[16] = { + [0] = "No error", + [1] = "PARTID_SEL_Range", + [2] = "Req_PARTID_Range", + [3] = "MSMONCFG_ID_RANGE", + [4] = "Req_PMG_Range", + [5] = "Monitor_Range", + [6] = "intPARTID_Range", + [7] = "Unexpected_INTERNAL", + [8] = "Undefined_RIS_PART_SEL", + [9] = "RIS_No_Control", + [10] = "Undefined_RIS_MON_SEL", + [11] = "RIS_No_Monitor", + [12 ... 15] = "Reserved" +}; + +static int mpam_enable_msc_ecr(void *_msc) +{ + struct mpam_msc *msc = _msc; + + writel_relaxed(1, msc->mapped_hwpage + MPAMF_ECR); + + return 0; +} + +static int mpam_disable_msc_ecr(void *_msc) +{ + struct mpam_msc *msc = _msc; + + writel_relaxed(0, msc->mapped_hwpage + MPAMF_ECR); + + return 0; +} + +static irqreturn_t __mpam_irq_handler(int irq, struct mpam_msc *msc) +{ + u64 reg; + u16 partid; + u8 errcode, pmg, ris; + + if (WARN_ON_ONCE(!msc) || + WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), + &msc->accessibility))) + return IRQ_NONE; + + reg = mpam_msc_read_esr(msc); + + errcode = FIELD_GET(MPAMF_ESR_ERRCODE, reg); + if (!errcode) + return IRQ_NONE; + + /* Clear level triggered irq */ + mpam_msc_zero_esr(msc); + + partid = FIELD_GET(MPAMF_ESR_PARTID_OR_MON, reg); + pmg = FIELD_GET(MPAMF_ESR_PMG, reg); + ris = FIELD_GET(MPAMF_ESR_PMG, reg); + + pr_err("error irq from msc:%u '%s', partid:%u, pmg: %u, ris: %u\n", + msc->id, mpam_errcode_names[errcode], partid, pmg, ris); + + if (irq_is_percpu(irq)) { + mpam_disable_msc_ecr(msc); + schedule_work(&mpam_broken_work); + return IRQ_HANDLED; + } + + return IRQ_WAKE_THREAD; +} + +static irqreturn_t mpam_ppi_handler(int irq, void *dev_id) +{ + struct mpam_msc *msc = *(struct mpam_msc **)dev_id; + + return __mpam_irq_handler(irq, msc); +} + +static irqreturn_t mpam_spi_handler(int irq, void *dev_id) +{ + struct mpam_msc *msc = dev_id; + + return __mpam_irq_handler(irq, msc); +} + +static irqreturn_t mpam_disable_thread(int irq, void *dev_id); + +static int mpam_register_irqs(void) +{ + int err, irq; + struct mpam_msc *msc; + + lockdep_assert_cpus_held(); + lockdep_assert_held(&mpam_list_lock); + + list_for_each_entry(msc, &mpam_all_msc, glbl_list) { + irq = platform_get_irq_byname_optional(msc->pdev, "error"); + if (irq <= 0) + continue; + + /* The MPAM spec says the interrupt can be SPI, PPI or LPI */ + /* We anticipate sharing the interrupt with other MSCs */ + if (irq_is_percpu(irq)) { + err = request_percpu_irq(irq, &mpam_ppi_handler, + "mpam:msc:error", + msc->error_dev_id); + if (err) + return err; + + mutex_lock(&msc->lock); + msc->reenable_error_ppi = irq; + smp_call_function_many(&msc->accessibility, + &_enable_percpu_irq, &irq, + true); + mutex_unlock(&msc->lock); + } else { + err = devm_request_threaded_irq(&msc->pdev->dev, irq, + &mpam_spi_handler, + &mpam_disable_thread, + IRQF_SHARED, + "mpam:msc:error", msc); + if (err) + return err; + } + + mutex_lock(&msc->lock); + msc->error_irq_requested = true; + mpam_touch_msc(msc, mpam_enable_msc_ecr, msc); + msc->error_irq_hw_enabled = true; + mutex_unlock(&msc->lock); + } + + return 0; +} + +static void mpam_unregister_irqs(void) +{ + int irq; + struct mpam_msc *msc; + + cpus_read_lock(); + /* take the lock as free_irq() can sleep */ + mutex_lock(&mpam_list_lock); + list_for_each_entry(msc, &mpam_all_msc, glbl_list) { + irq = platform_get_irq_byname_optional(msc->pdev, "error"); + if (irq <= 0) + continue; + + mutex_lock(&msc->lock); + if (msc->error_irq_hw_enabled) { + mpam_touch_msc(msc, mpam_disable_msc_ecr, msc); + msc->error_irq_hw_enabled = false; + } + + if (msc->error_irq_requested) { + if (irq_is_percpu(irq)) { + msc->reenable_error_ppi = 0; + free_percpu_irq(irq, msc->error_dev_id); + } else { + devm_free_irq(&msc->pdev->dev, irq, msc); + } + msc->error_irq_requested = false; + } + mutex_unlock(&msc->lock); + } + mutex_unlock(&mpam_list_lock); + cpus_read_unlock(); +} + static void mpam_enable_once(void) { + int err; + + /* + * If all the MSC have been probed, enabling the IRQs happens next. + * That involves cross-calling to a CPU that can reach the MSC, and + * the locks must be taken in this order: + */ + cpus_read_lock(); mutex_lock(&mpam_list_lock); mpam_enable_merge_features(); + + err = mpam_register_irqs(); + if (err) + pr_warn("Failed to register irqs: %d\n", err); + mutex_unlock(&mpam_list_lock); + cpus_read_unlock(); + + if (err) { + schedule_work(&mpam_broken_work); + return; + } mutex_lock(&mpam_cpuhp_state_lock); cpuhp_remove_state(mpam_cpuhp_state); @@ -1221,15 +1497,31 @@ static void mpam_reset_class(struct mpam_class *class) * All of MPAMs errors indicate a software bug, restore any modified * controls to their reset values. */ -void mpam_disable(void) +static irqreturn_t mpam_disable_thread(int irq, void *dev_id) { int idx; struct mpam_class *class; + mutex_lock(&mpam_cpuhp_state_lock); + if (mpam_cpuhp_state) { + cpuhp_remove_state(mpam_cpuhp_state); + mpam_cpuhp_state = 0; + } + mutex_unlock(&mpam_cpuhp_state_lock); + + mpam_unregister_irqs(); + idx = srcu_read_lock(&mpam_srcu); list_for_each_entry_rcu(class, &mpam_classes, classes_list) mpam_reset_class(class); srcu_read_unlock(&mpam_srcu, idx); + + return IRQ_HANDLED; +} + +void mpam_disable(struct work_struct *ignored) +{ + mpam_disable_thread(0, NULL); } /* @@ -1243,7 +1535,6 @@ void mpam_enable(struct work_struct *work) struct mpam_msc *msc; bool all_devices_probed = true; - /* Have we probed all the hw devices? */ mutex_lock(&mpam_list_lock); list_for_each_entry(msc, &mpam_all_msc, glbl_list) { mutex_lock(&msc->lock); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index dd1c017c6e08..e81aed7e6be6 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -29,10 +29,17 @@ struct mpam_msc struct pcc_mbox_chan *pcc_chan; u32 nrdy_usec; cpumask_t accessibility; + bool has_extd_esr; + + int reenable_error_ppi; + struct mpam_msc * __percpu *error_dev_id; + atomic_t online_refs; struct mutex lock; bool probed; + bool error_irq_requested; + bool error_irq_hw_enabled; u16 partid_max; u8 pmg_max; unsigned long ris_idxs[128 / BITS_PER_LONG]; @@ -167,6 +174,7 @@ extern u8 mpam_pmg_max; /* Scheduled work callback to enable mpam once all MSC have been probed */ void mpam_enable(struct work_struct *work); +void mpam_disable(struct work_struct *work); /* * MPAM MSCs have the following register layout. See: -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Once all the MSC have been probed, the system wide usable number of PARTID is known and the configuration arrays can be allocated. After this point, checking all the MSC have been probed is pointless, and the cpuhp callbacks should restore the configuration, instead of just resetting the MSC. Enable the architecture's static key that indicates whether mpam is enabled and use this to skip the discovery work on cpu hotplug. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 6 ++++++ drivers/platform/mpam/mpam_internal.h | 8 ++++++++ 2 files changed, 14 insertions(+) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index ac82c041999d..04e1dfd1c5e3 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -828,6 +828,9 @@ static int mpam_discovery_cpu_online(unsigned int cpu) struct mpam_msc *msc; bool new_device_probed = false; + if (mpam_is_enabled()) + return 0; + mutex_lock(&mpam_list_lock); list_for_each_entry(msc, &mpam_all_msc, glbl_list) { if (!cpumask_test_cpu(cpu, &msc->accessibility)) @@ -1468,6 +1471,7 @@ static void mpam_enable_once(void) partid_max_published = true; spin_unlock(&partid_max_lock); + static_branch_enable(&mpam_enabled); mpam_register_cpuhp_callbacks(mpam_cpu_online); pr_info("MPAM enabled with %u partid and %u pmg\n", @@ -1509,6 +1513,8 @@ static irqreturn_t mpam_disable_thread(int irq, void *dev_id) } mutex_unlock(&mpam_cpuhp_state_lock); + static_branch_disable(&mpam_enabled); + mpam_unregister_irqs(); idx = srcu_read_lock(&mpam_srcu); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index e81aed7e6be6..9f9e7a9899a6 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -8,12 +8,20 @@ #include <linux/atomic.h> #include <linux/cpumask.h> #include <linux/io.h> +#include <linux/jump_label.h> #include <linux/mailbox_client.h> #include <linux/mutex.h> #include <linux/resctrl.h> #include <linux/sizes.h> #include <linux/srcu.h> +DECLARE_STATIC_KEY_FALSE(mpam_enabled); + +static inline bool mpam_is_enabled(void) +{ + return static_branch_likely(&mpam_enabled); +} + struct mpam_msc { /* member of mpam_all_msc */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- When CPUs come online the original configuration should be restored. Once the maximum partid is known, allocate an configuration array for each component, and reprogram each RIS configuration from this. The MPAM spec describes how multiple controls can interact. To prevent this happening by accident, always reset controls that don't have a valid configuration. This allows the same helper to be used for configuration and reset. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 193 +++++++++++++++++++++++--- drivers/platform/mpam/mpam_internal.h | 24 +++- 2 files changed, 192 insertions(+), 25 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 04e1dfd1c5e3..7da57310d52c 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -699,51 +699,89 @@ static void mpam_reset_msc_bitmap(struct mpam_msc *msc, u16 reg, u16 wd) __mpam_write_reg(msc, reg, bm); } -static void mpam_reset_ris_partid(struct mpam_msc_ris *ris, u16 partid) +static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, + struct mpam_config *cfg) { struct mpam_msc *msc = ris->msc; u16 bwa_fract = MPAMCFG_MBW_MAX_MAX; struct mpam_props *rprops = &ris->props; - lockdep_assert_held(&msc->lock); - spin_lock(&msc->part_sel_lock); __mpam_part_sel(ris->ris_idx, partid, msc); - if (mpam_has_feature(mpam_feat_cpor_part, rprops)) - mpam_reset_msc_bitmap(msc, MPAMCFG_CPBM, rprops->cpbm_wd); + if (mpam_has_feature(mpam_feat_cpor_part, rprops)) { + if (mpam_has_feature(mpam_feat_cpor_part, cfg)) + mpam_write_partsel_reg(msc, CPBM, cfg->cpbm); + else + mpam_reset_msc_bitmap(msc, MPAMCFG_CPBM, + rprops->cpbm_wd); + } - if (mpam_has_feature(mpam_feat_mbw_part, rprops)) - mpam_reset_msc_bitmap(msc, MPAMCFG_MBW_PBM, rprops->mbw_pbm_bits); + if (mpam_has_feature(mpam_feat_mbw_part, rprops)) { + if (mpam_has_feature(mpam_feat_mbw_part, cfg)) + mpam_write_partsel_reg(msc, MBW_PBM, cfg->mbw_pbm); + else + mpam_reset_msc_bitmap(msc, MPAMCFG_MBW_PBM, + rprops->mbw_pbm_bits); + } if (mpam_has_feature(mpam_feat_mbw_min, rprops)) mpam_write_partsel_reg(msc, MBW_MIN, 0); - if (mpam_has_feature(mpam_feat_mbw_max, rprops)) - mpam_write_partsel_reg(msc, MBW_MAX, bwa_fract); + if (mpam_has_feature(mpam_feat_mbw_max, rprops)) { + if (mpam_has_feature(mpam_feat_mbw_max, cfg)) + mpam_write_partsel_reg(msc, MBW_MAX, cfg->mbw_max); + else + mpam_write_partsel_reg(msc, MBW_MAX, bwa_fract); + } if (mpam_has_feature(mpam_feat_mbw_prop, rprops)) mpam_write_partsel_reg(msc, MBW_PROP, bwa_fract); spin_unlock(&msc->part_sel_lock); } +struct reprogram_ris { + struct mpam_msc_ris *ris; + struct mpam_config *cfg; +}; + +/* Call with MSC lock held */ +static int mpam_reprogram_ris(void *_arg) +{ + u16 partid, partid_max; + struct reprogram_ris *arg = _arg; + struct mpam_msc_ris *ris = arg->ris; + struct mpam_config *cfg = arg->cfg; + + if (ris->in_reset_state) + return 0; + + spin_lock(&partid_max_lock); + partid_max = mpam_partid_max; + spin_unlock(&partid_max_lock); + for (partid = 0; partid < partid_max; partid++) + mpam_reprogram_ris_partid(ris, partid, cfg); + + return 0; +} + /* * Called via smp_call_on_cpu() to prevent migration, while still being * pre-emptible. */ static int mpam_reset_ris(void *arg) { - u16 partid, partid_max; struct mpam_msc_ris *ris = arg; + struct reprogram_ris reprogram_arg; + struct mpam_config empty_cfg = { 0 }; if (ris->in_reset_state) return 0; - spin_lock(&partid_max_lock); - partid_max = mpam_partid_max; - spin_unlock(&partid_max_lock); - for (partid = 0; partid < partid_max; partid++) - mpam_reset_ris_partid(ris, partid); + reprogram_arg.ris = ris; + reprogram_arg.cfg = &empty_cfg; + + mpam_reprogram_ris(&reprogram_arg); return 0; } @@ -792,6 +830,37 @@ static void mpam_reset_msc(struct mpam_msc *msc, bool online) srcu_read_unlock(&mpam_srcu, idx); } +static void mpam_reprogram_msc(struct mpam_msc *msc) +{ + int idx; + u16 partid; + bool reset; + struct mpam_config *cfg; + struct mpam_msc_ris *ris; + + lockdep_assert_held(&msc->lock); + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &msc->ris, msc_list) { + if (!mpam_is_enabled() && !ris->in_reset_state) { + mpam_touch_msc(msc, &mpam_reset_ris, ris); + ris->in_reset_state = true; + continue; + } + + reset = true; + for (partid = 0; partid < mpam_partid_max; partid++) { + cfg = &ris->comp->cfg[partid]; + if (cfg->features) + reset = false; + + mpam_reprogram_ris_partid(ris, partid, cfg); + } + ris->in_reset_state = reset; + } + srcu_read_unlock(&mpam_srcu, idx); +} + static void _enable_percpu_irq(void *_irq) { int *irq = _irq; @@ -813,7 +882,7 @@ static int mpam_cpu_online(unsigned int cpu) _enable_percpu_irq(&msc->reenable_error_ppi); if (atomic_fetch_inc(&msc->online_refs) == 0) - mpam_reset_msc(msc, true); + mpam_reprogram_msc(msc); mutex_unlock(&msc->lock); } srcu_read_unlock(&mpam_srcu, idx); @@ -1433,6 +1502,37 @@ static void mpam_unregister_irqs(void) cpus_read_unlock(); } +static int __allocate_component_cfg(struct mpam_component *comp) +{ + if (comp->cfg) + return 0; + + comp->cfg = kcalloc(mpam_partid_max, sizeof(*comp->cfg), GFP_KERNEL); + if (!comp->cfg) + return -ENOMEM; + + return 0; +} + +static int mpam_allocate_config(void) +{ + int err = 0; + struct mpam_class *class; + struct mpam_component *comp; + + lockdep_assert_held(&mpam_list_lock); + + list_for_each_entry(class, &mpam_classes, classes_list) { + list_for_each_entry(comp, &class->components, class_list) { + err = __allocate_component_cfg(comp); + if (err) + return err; + } + } + + return 0; +} + static void mpam_enable_once(void) { int err; @@ -1444,12 +1544,21 @@ static void mpam_enable_once(void) */ cpus_read_lock(); mutex_lock(&mpam_list_lock); - mpam_enable_merge_features(); + do { + mpam_enable_merge_features(); - err = mpam_register_irqs(); - if (err) - pr_warn("Failed to register irqs: %d\n", err); + err = mpam_allocate_config(); + if (err) { + pr_err("Failed to allocate configuration arrays.\n"); + break; + } + err = mpam_register_irqs(); + if (err) { + pr_warn("Failed to register irqs: %d\n", err); + break; + } + } while (0); mutex_unlock(&mpam_list_lock); cpus_read_unlock(); @@ -1486,6 +1595,8 @@ static void mpam_reset_class(struct mpam_class *class) idx = srcu_read_lock(&mpam_srcu); list_for_each_entry_rcu(comp, &class->components, class_list) { + memset(comp->cfg, 0, (mpam_partid_max * sizeof(*comp->cfg))); + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { mutex_lock(&ris->msc->lock); mpam_touch_msc(ris->msc, mpam_reset_ris, ris); @@ -1575,6 +1686,48 @@ static int mpam_msc_drv_remove(struct platform_device *pdev) return 0; } +struct mpam_write_config_arg { + struct mpam_msc_ris *ris; + struct mpam_component *comp; + u16 partid; +}; + +static int __write_config(void *arg) +{ + struct mpam_write_config_arg *c = arg; + + mpam_reprogram_ris_partid(c->ris, c->partid, &c->comp->cfg[c->partid]); + + return 0; +} + +/* TODO: split into write_config/sync_config */ +/* TODO: add config_dirty bitmap to drive sync_config */ +int mpam_apply_config(struct mpam_component *comp, u16 partid, + struct mpam_config *cfg) +{ + struct mpam_write_config_arg arg; + struct mpam_msc_ris *ris; + int idx; + + lockdep_assert_cpus_held(); + + comp->cfg[partid] = *cfg; + arg.comp = comp; + arg.partid = partid; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { + arg.ris = ris; + mutex_lock(&ris->msc->lock); + mpam_touch_msc(ris->msc, __write_config, &arg); + mutex_unlock(&ris->msc->lock); + } + srcu_read_unlock(&mpam_srcu, idx); + + return 0; +} + static const struct of_device_id mpam_of_match[] = { { .compatible = "arm,mpam-msc", }, {}, diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 9f9e7a9899a6..6a6d67a473dc 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -102,11 +102,7 @@ struct mpam_props u16 num_mbwu_mon; }; -static inline bool mpam_has_feature(enum mpam_device_features feat, - struct mpam_props *props) -{ - return (1<<feat) & props->features; -} +#define mpam_has_feature(_feat, x) ((1<<_feat) & (x)->features) static inline void mpam_set_feature(enum mpam_device_features feat, struct mpam_props *props) @@ -136,6 +132,15 @@ struct mpam_class struct list_head classes_list; }; +struct mpam_config { + /* Which configuration values are valid. 0 is used for reset */ + mpam_features_t features; + + u32 cpbm; + u32 mbw_pbm; + u16 mbw_max; +}; + struct mpam_component { u32 comp_id; @@ -145,6 +150,12 @@ struct mpam_component cpumask_t affinity; + /* + * Array of configuration values, indexed by partid. + * Read from cpuhp callbacks, hold the cpuhp lock when writing. + */ + struct mpam_config *cfg; + /* member of mpam_class:components */ struct list_head class_list; @@ -184,6 +195,9 @@ extern u8 mpam_pmg_max; void mpam_enable(struct work_struct *work); void mpam_disable(struct work_struct *work); +int mpam_apply_config(struct mpam_component *comp, u16 partid, + struct mpam_config *cfg); + /* * MPAM MSCs have the following register layout. See: * Arm Architecture Reference Manual Supplement - Memory System Resource -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- MPAM supports more features than are going to be exposed to resctrl. For partid other than 0, the reset values of these controls isn't known. Discover the rest of the features so they can be reset to avoid any side effects when resctrl is in use. PARTID narrowing allows MSC/RIS to support less configuration space than is usable. If this feature is found on a class of device we are likely to use, then reduce the partid_max to make it usable. This allows us to map a PARTID to itself. CC: Rohit Mathew <Rohit.Mathew@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 91 +++++++++++++++++++++++++++ drivers/platform/mpam/mpam_internal.h | 10 ++- 2 files changed, 100 insertions(+), 1 deletion(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 7da57310d52c..612c4b97d373 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -549,10 +549,20 @@ static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) int err; struct mpam_msc *msc = ris->msc; struct mpam_props *props = &ris->props; + struct mpam_class *class = ris->comp->class; lockdep_assert_held(&msc->lock); lockdep_assert_held(&msc->part_sel_lock); + /* Cache Capacity Partitioning */ + if (FIELD_GET(MPAMF_IDR_HAS_CCAP_PART, ris->idr)) { + u32 ccap_features = mpam_read_partsel_reg(msc, CCAP_IDR); + + props->cmax_wd = FIELD_GET(MPAMF_CCAP_IDR_CMAX_WD, ccap_features); + if (props->cmax_wd) + mpam_set_feature(mpam_feat_ccap_part, props); + } + /* Cache Portion partitioning */ if (FIELD_GET(MPAMF_IDR_HAS_CPOR_PART, ris->idr)) { u32 cpor_features = mpam_read_partsel_reg(msc, CPOR_IDR); @@ -575,6 +585,31 @@ static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) props->bwa_wd = FIELD_GET(MPAMF_MBW_IDR_BWA_WD, mbw_features); if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_MAX, mbw_features)) mpam_set_feature(mpam_feat_mbw_max, props); + + if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_MIN, mbw_features)) + mpam_set_feature(mpam_feat_mbw_min, props); + + if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_PROP, mbw_features)) + mpam_set_feature(mpam_feat_mbw_prop, props); + } + + /* Priority partitioning */ + if (FIELD_GET(MPAMF_IDR_HAS_PRI_PART, ris->idr)) { + u32 pri_features = mpam_read_partsel_reg(msc, PRI_IDR); + + props->intpri_wd = FIELD_GET(MPAMF_PRI_IDR_INTPRI_WD, pri_features); + if (props->intpri_wd && FIELD_GET(MPAMF_PRI_IDR_HAS_INTPRI, pri_features)) { + mpam_set_feature(mpam_feat_intpri_part, props); + if (FIELD_GET(MPAMF_PRI_IDR_INTPRI_0_IS_LOW, pri_features)) + mpam_set_feature(mpam_feat_intpri_part_0_low, props); + } + + props->dspri_wd = FIELD_GET(MPAMF_PRI_IDR_DSPRI_WD, pri_features); + if (props->dspri_wd && FIELD_GET(MPAMF_PRI_IDR_HAS_DSPRI, pri_features)) { + mpam_set_feature(mpam_feat_dspri_part, props); + if (FIELD_GET(MPAMF_PRI_IDR_DSPRI_0_IS_LOW, pri_features)) + mpam_set_feature(mpam_feat_dspri_part_0_low, props); + } } /* Performance Monitoring */ @@ -610,6 +645,21 @@ static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) mpam_set_feature(mpam_feat_msmon_mbwu_rwbw, props); } } + + /* + * RIS with PARTID narrowing don't have enough storage for one + * configuration per PARTID. If these are in a class we could use, + * reduce the supported partid_max to match the numer of intpartid. + * If the class is unknown, just ignore it. + */ + if (FIELD_GET(MPAMF_IDR_HAS_PARTID_NRW, ris->idr) && + class->type != MPAM_CLASS_UNKNOWN) { + u32 nrwidr = mpam_read_partsel_reg(msc, PARTID_NRW_IDR); + u16 partid_max = FIELD_GET(MPAMF_PARTID_NRW_IDR_INTPARTID_MAX, nrwidr); + + mpam_set_feature(mpam_feat_partid_nrw, props); + msc->partid_max = min(msc->partid_max, partid_max); + } } static int mpam_msc_hw_probe(struct mpam_msc *msc) @@ -702,13 +752,21 @@ static void mpam_reset_msc_bitmap(struct mpam_msc *msc, u16 reg, u16 wd) static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, struct mpam_config *cfg) { + u32 pri_val = 0; + u16 cmax = MPAMCFG_CMAX_CMAX; struct mpam_msc *msc = ris->msc; u16 bwa_fract = MPAMCFG_MBW_MAX_MAX; struct mpam_props *rprops = &ris->props; + u16 dspri = GENMASK(rprops->dspri_wd, 0); + u16 intpri = GENMASK(rprops->intpri_wd, 0); spin_lock(&msc->part_sel_lock); __mpam_part_sel(ris->ris_idx, partid, msc); + if(mpam_has_feature(mpam_feat_partid_nrw, rprops)) + mpam_write_partsel_reg(msc, INTPARTID, + (MPAMCFG_PART_SEL_INTERNAL | partid)); + if (mpam_has_feature(mpam_feat_cpor_part, rprops)) { if (mpam_has_feature(mpam_feat_cpor_part, cfg)) mpam_write_partsel_reg(msc, CPBM, cfg->cpbm); @@ -737,6 +795,26 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, if (mpam_has_feature(mpam_feat_mbw_prop, rprops)) mpam_write_partsel_reg(msc, MBW_PROP, bwa_fract); + + if (mpam_has_feature(mpam_feat_ccap_part, rprops)) + mpam_write_partsel_reg(msc, CMAX, cmax); + + if (mpam_has_feature(mpam_feat_intpri_part, rprops) || + mpam_has_feature(mpam_feat_dspri_part, rprops)) { + /* aces high? */ + if (!mpam_has_feature(mpam_feat_intpri_part_0_low, rprops)) + intpri = 0; + if (!mpam_has_feature(mpam_feat_dspri_part_0_low, rprops)) + dspri = 0; + + if (mpam_has_feature(mpam_feat_intpri_part, rprops)) + pri_val |= FIELD_PREP(MPAMCFG_PRI_INTPRI, intpri); + if (mpam_has_feature(mpam_feat_dspri_part, rprops)) + pri_val |= FIELD_PREP(MPAMCFG_PRI_DSPRI, dspri); + + mpam_write_partsel_reg(msc, PRI, pri_val); + } + spin_unlock(&msc->part_sel_lock); } @@ -1285,6 +1363,19 @@ __resource_props_mismatch(struct mpam_msc_ris *ris, struct mpam_class *class) cprops->num_csu_mon = min(cprops->num_csu_mon, rprops->num_csu_mon); if (cprops->num_mbwu_mon != rprops->num_mbwu_mon) cprops->num_mbwu_mon = min(cprops->num_mbwu_mon, rprops->num_mbwu_mon); + + if (cprops->intpri_wd != rprops->intpri_wd) + cprops->intpri_wd = min(cprops->intpri_wd, rprops->intpri_wd); + if (cprops->dspri_wd != rprops->dspri_wd) + cprops->dspri_wd = min(cprops->dspri_wd, rprops->dspri_wd); + + /* {int,ds}pri may not have differing 0-low behaviour */ + if (mpam_has_feature(mpam_feat_intpri_part_0_low, cprops) != + mpam_has_feature(mpam_feat_intpri_part_0_low, rprops)) + mpam_clear_feature(mpam_feat_intpri_part, &cprops->features); + if (mpam_has_feature(mpam_feat_dspri_part_0_low, cprops) != + mpam_has_feature(mpam_feat_dspri_part_0_low, rprops)) + mpam_clear_feature(mpam_feat_dspri_part, &cprops->features); } /* diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 6a6d67a473dc..9b90f10b0436 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -70,7 +70,7 @@ struct mpam_msc * When we compact the supported features, we don't care what they are. * Storing them as a bitmap makes life easy. */ -typedef u16 mpam_features_t; +typedef u32 mpam_features_t; /* Bits for mpam_features_t */ enum mpam_device_features { @@ -80,6 +80,10 @@ enum mpam_device_features { mpam_feat_mbw_min, mpam_feat_mbw_max, mpam_feat_mbw_prop, + mpam_feat_intpri_part, + mpam_feat_intpri_part_0_low, + mpam_feat_dspri_part, + mpam_feat_dspri_part_0_low, mpam_feat_msmon, mpam_feat_msmon_csu, mpam_feat_msmon_csu_capture, @@ -87,6 +91,7 @@ enum mpam_device_features { mpam_feat_msmon_mbwu_capture, mpam_feat_msmon_mbwu_rwbw, mpam_feat_msmon_capt, + mpam_feat_partid_nrw, MPAM_FEATURE_LAST, }; #define MPAM_ALL_FEATURES ((1<<MPAM_FEATURE_LAST) - 1) @@ -98,6 +103,9 @@ struct mpam_props u16 cpbm_wd; u16 mbw_pbm_bits; u8 bwa_wd; + u16 cmax_wd; + u16 intpri_wd; + u16 dspri_wd; u16 num_csu_mon; u16 num_mbwu_mon; }; -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- MPAM's MSC support a number of monitors, each of which supports bandwidth counters, or cache-storage-utilisation counters. To use a counter, a monitor needs to be configured. Add helpers to allocate and free CSU or MBWU monitors. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 2 ++ drivers/platform/mpam/mpam_internal.h | 35 +++++++++++++++++++++++++++ 2 files changed, 37 insertions(+) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 612c4b97d373..82dffe8c03ef 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -244,6 +244,8 @@ mpam_class_alloc(u8 level_idx, enum mpam_class_types type, gfp_t gfp) class->level = level_idx; class->type = type; INIT_LIST_HEAD_RCU(&class->classes_list); + ida_init(&class->ida_csu_mon); + ida_init(&class->ida_mbwu_mon); list_add_rcu(&class->classes_list, &mpam_classes); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 9b90f10b0436..19acb460aeba 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -138,6 +138,9 @@ struct mpam_class /* member of mpam_classes */ struct list_head classes_list; + + struct ida ida_csu_mon; + struct ida ida_mbwu_mon; }; struct mpam_config { @@ -191,6 +194,38 @@ struct mpam_msc_ris struct mpam_component *comp; }; +static inline int mpam_alloc_csu_mon(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_msmon_csu, cprops)) + return -EOPNOTSUPP; + + return ida_alloc_range(&class->ida_csu_mon, 0, cprops->num_csu_mon - 1, + GFP_KERNEL); +} + +static inline void mpam_free_csu_mon(struct mpam_class *class, int csu_mon) +{ + ida_free(&class->ida_csu_mon, csu_mon); +} + +static inline int mpam_alloc_mbwu_mon(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_msmon_mbwu, cprops)) + return -EOPNOTSUPP; + + return ida_alloc_range(&class->ida_mbwu_mon, 0, + cprops->num_mbwu_mon - 1, GFP_KERNEL); +} + +static inline void mpam_free_mbwu_mon(struct mpam_class *class, int mbwu_mon) +{ + ida_free(&class->ida_mbwu_mon, mbwu_mon); +} + /* List of all classes */ extern struct list_head mpam_classes; extern struct srcu_struct mpam_srcu; -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Reaing a monitor involves configuring what you want to monitor, and reading the value. Components made up of multiple MSC may need values from each MSC. MSCs may take time to configure, returning 'not ready'. The maximum 'not ready' time should have been provided by firmware. Add mpam_msmon_read() to hide all this. If (one of) the MSC returns not ready, then wait the full timeout value before trying again. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 219 ++++++++++++++++++++++++++ drivers/platform/mpam/mpam_internal.h | 19 +++ 2 files changed, 238 insertions(+) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 82dffe8c03ef..3516528f2e14 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -123,6 +123,22 @@ static void __mpam_write_reg(struct mpam_msc *msc, u16 reg, u32 val) __mpam_write_reg(msc, MPAMCFG_##reg, val); \ }) +#define mpam_read_monsel_reg(msc, reg) \ +({ \ + u32 ____ret; \ + \ + lockdep_assert_held_once(&msc->mon_sel_lock); \ + ____ret = __mpam_read_reg(msc, MSMON_##reg); \ + \ + ____ret; \ +}) + +#define mpam_write_monsel_reg(msc, reg, val) \ +({ \ + lockdep_assert_held_once(&msc->mon_sel_lock); \ + __mpam_write_reg(msc, MSMON_##reg, val); \ +}) + static u64 mpam_msc_read_idr(struct mpam_msc *msc) { u64 idr_high = 0, idr_low; @@ -725,6 +741,209 @@ static int mpam_msc_hw_probe(struct mpam_msc *msc) return 0; } +struct mon_read +{ + struct mpam_msc_ris *ris; + struct mon_cfg *ctx; + enum mpam_device_features type; + u64 *val; + int err; +}; + +static void gen_msmon_ctl_flt_vals(struct mon_read *m, u32 *ctl_val, + u32 *flt_val) +{ + struct mon_cfg *ctx = m->ctx; + + switch (m->type) { + case mpam_feat_msmon_csu: + *ctl_val = MSMON_CFG_MBWU_CTL_TYPE_CSU; + break; + case mpam_feat_msmon_mbwu: + *ctl_val = MSMON_CFG_MBWU_CTL_TYPE_MBWU; + break; + default: + return; + } + + /* + * For CSU counters its implementation-defined what happens when not + * filtering by partid. + */ + *ctl_val |= MSMON_CFG_x_CTL_MATCH_PARTID; + + *flt_val = FIELD_PREP(MSMON_CFG_MBWU_FLT_PARTID, ctx->partid); + if (m->ctx->match_pmg) { + *ctl_val |= MSMON_CFG_x_CTL_MATCH_PMG; + *flt_val |= FIELD_PREP(MSMON_CFG_MBWU_FLT_PMG, ctx->pmg); + } + + if (mpam_has_feature(mpam_feat_msmon_mbwu_rwbw, &m->ris->props)) + *flt_val |= FIELD_PREP(MSMON_CFG_MBWU_FLT_RWBW, ctx->opts); +} + +static void read_msmon_ctl_flt_vals(struct mon_read *m, u32 *ctl_val, + u32 *flt_val) +{ + struct mpam_msc *msc = m->ris->msc; + + switch (m->type) { + case mpam_feat_msmon_csu: + *ctl_val = mpam_read_monsel_reg(msc, CFG_CSU_CTL); + *flt_val = mpam_read_monsel_reg(msc, CFG_CSU_FLT); + break; + case mpam_feat_msmon_mbwu: + *ctl_val = mpam_read_monsel_reg(msc, CFG_MBWU_CTL); + *flt_val = mpam_read_monsel_reg(msc, CFG_MBWU_FLT); + break; + default: + return; + } +} + +static void write_msmon_ctl_flt_vals(struct mon_read *m, u32 ctl_val, + u32 flt_val) +{ + struct mpam_msc *msc = m->ris->msc; + + /* + * Write the ctl_val with the enable bit cleared, reset the counter, + * then enable counter. + */ + switch (m->type) { + case mpam_feat_msmon_csu: + mpam_write_monsel_reg(msc, CFG_CSU_FLT, flt_val); + mpam_write_monsel_reg(msc, CFG_CSU_CTL, ctl_val); + mpam_write_monsel_reg(msc, CSU, 0); + mpam_write_monsel_reg(msc, CFG_CSU_CTL, ctl_val|MSMON_CFG_x_CTL_EN); + break; + case mpam_feat_msmon_mbwu: + mpam_write_monsel_reg(msc, CFG_MBWU_FLT, flt_val); + mpam_write_monsel_reg(msc, CFG_MBWU_CTL, ctl_val); + mpam_write_monsel_reg(msc, MBWU, 0); + mpam_write_monsel_reg(msc, CFG_MBWU_CTL, ctl_val|MSMON_CFG_x_CTL_EN); + break; + default: + return; + } +} + +static void __ris_msmon_read(void *arg) +{ + u64 now; + bool nrdy = false; + unsigned long flags; + struct mon_read *m = arg; + struct mon_cfg *ctx = m->ctx; + struct mpam_msc_ris *ris = m->ris; + struct mpam_msc *msc = m->ris->msc; + u32 mon_sel, ctl_val, flt_val, cur_ctl, cur_flt; + + lockdep_assert_held(&msc->lock); + + spin_lock_irqsave(&msc->mon_sel_lock, flags); + mon_sel = FIELD_PREP(MSMON_CFG_MON_SEL_MON_SEL, ctx->mon) | + FIELD_PREP(MSMON_CFG_MON_SEL_RIS, ris->ris_idx); + mpam_write_monsel_reg(msc, CFG_MON_SEL, mon_sel); + + /* + * Read the existing configuration to avoid re-writing the same values. + * This saves waiting for 'nrdy' on subsequent reads. + */ + read_msmon_ctl_flt_vals(m, &cur_ctl, &cur_flt); + gen_msmon_ctl_flt_vals(m, &ctl_val, &flt_val); + if (cur_flt != flt_val || cur_ctl != (ctl_val | MSMON_CFG_x_CTL_EN)) + write_msmon_ctl_flt_vals(m, ctl_val, flt_val); + + switch (m->type) { + case mpam_feat_msmon_csu: + now = mpam_read_monsel_reg(msc, CSU); + break; + case mpam_feat_msmon_mbwu: + now = mpam_read_monsel_reg(msc, MBWU); + break; + default: + return; + } + spin_unlock_irqrestore(&msc->mon_sel_lock, flags); + + nrdy = now & MSMON___NRDY; + if (nrdy) { + m->err = -EBUSY; + return; + } + + now = FIELD_GET(MSMON___VALUE, now); + *(m->val) += now; +} + +static int _msmon_read(struct mpam_component *comp, struct mon_read *arg) +{ + int err, idx; + struct mpam_msc *msc; + struct mpam_msc_ris *ris; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { + arg->ris = ris; + + msc = ris->msc; + mutex_lock(&msc->lock); + err = smp_call_function_any(&msc->accessibility, + __ris_msmon_read, arg, true); + mutex_unlock(&msc->lock); + if (!err && arg->err) + err = arg->err; + if (err) + break; + } + srcu_read_unlock(&mpam_srcu, idx); + + return err; +} + +int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, + enum mpam_device_features type, u64 *val) +{ + int err; + struct mon_read arg; + u64 wait_jiffies = 0; + struct mpam_props *cprops = &comp->class->props; + + might_sleep(); + + if (!mpam_is_enabled()) + return -EIO; + + if (!mpam_has_feature(type, cprops)) + return -EOPNOTSUPP; + + memset(&arg, 0, sizeof(arg)); + arg.ctx = ctx; + arg.type = type; + arg.val = val; + *val = 0; + + err = _msmon_read(comp, &arg); + if (err == -EBUSY) + wait_jiffies = usecs_to_jiffies(comp->class->nrdy_usec); + + while (wait_jiffies) + wait_jiffies = schedule_timeout_uninterruptible(wait_jiffies); + + if (err == -EBUSY) { + memset(&arg, 0, sizeof(arg)); + arg.ctx = ctx; + arg.type = type; + arg.val = val; + *val = 0; + + err = _msmon_read(comp, &arg); + } + + return err; +} + static void mpam_reset_msc_bitmap(struct mpam_msc *msc, u16 reg, u16 wd) { u32 num_words, msb; diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 19acb460aeba..fb3a4c00e5d3 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -62,6 +62,7 @@ struct mpam_msc * If needed, take msc->lock first. */ spinlock_t part_sel_lock; + spinlock_t mon_sel_lock; void __iomem * mapped_hwpage; size_t mapped_hwpage_sz; }; @@ -194,6 +195,21 @@ struct mpam_msc_ris struct mpam_component *comp; }; +/* The values for MSMON_CFG_MBWU_FLT.RWBW */ +enum mon_filter_options { + COUNT_BOTH = 0, + COUNT_WRITE = 1, + COUNT_READ = 2, +}; + +struct mon_cfg { + u16 mon; + u8 pmg; + bool match_pmg; + u32 partid; + enum mon_filter_options opts; +}; + static inline int mpam_alloc_csu_mon(struct mpam_class *class) { struct mpam_props *cprops = &class->props; @@ -241,6 +257,9 @@ void mpam_disable(struct work_struct *work); int mpam_apply_config(struct mpam_component *comp, u16 partid, struct mpam_config *cfg); +int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, + enum mpam_device_features, u64 *val); + /* * MPAM MSCs have the following register layout. See: * Arm Architecture Reference Manual Supplement - Memory System Resource -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Bandwidth counters need to run continuously to correctly reflect the bandwidth. The value read may be lower than the previous value read in the case of overflow and when the hardware is reset due to CPU hotplug. Add struct mbwu_state to track the bandwidth counter to allow overflow and power management to be handled. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 148 +++++++++++++++++++++++++- drivers/platform/mpam/mpam_internal.h | 52 ++++++--- 2 files changed, 181 insertions(+), 19 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 3516528f2e14..930e14471378 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -773,6 +773,7 @@ static void gen_msmon_ctl_flt_vals(struct mon_read *m, u32 *ctl_val, *ctl_val |= MSMON_CFG_x_CTL_MATCH_PARTID; *flt_val = FIELD_PREP(MSMON_CFG_MBWU_FLT_PARTID, ctx->partid); + *flt_val |= FIELD_PREP(MSMON_CFG_MBWU_FLT_RWBW, ctx->opts); if (m->ctx->match_pmg) { *ctl_val |= MSMON_CFG_x_CTL_MATCH_PMG; *flt_val |= FIELD_PREP(MSMON_CFG_MBWU_FLT_PMG, ctx->pmg); @@ -805,6 +806,7 @@ static void write_msmon_ctl_flt_vals(struct mon_read *m, u32 ctl_val, u32 flt_val) { struct mpam_msc *msc = m->ris->msc; + struct msmon_mbwu_state *mbwu_state; /* * Write the ctl_val with the enable bit cleared, reset the counter, @@ -822,21 +824,33 @@ static void write_msmon_ctl_flt_vals(struct mon_read *m, u32 ctl_val, mpam_write_monsel_reg(msc, CFG_MBWU_CTL, ctl_val); mpam_write_monsel_reg(msc, MBWU, 0); mpam_write_monsel_reg(msc, CFG_MBWU_CTL, ctl_val|MSMON_CFG_x_CTL_EN); + + mbwu_state = &m->ris->mbwu_state[m->ctx->mon]; + if (mbwu_state) + mbwu_state->prev_val = 0; + break; default: return; } } +static u64 mpam_msmon_overflow_val(struct mpam_msc_ris *ris) +{ + /* TODO: scaling, and long counters */ + return GENMASK_ULL(30, 0); +} + static void __ris_msmon_read(void *arg) { - u64 now; bool nrdy = false; unsigned long flags; struct mon_read *m = arg; + u64 now, overflow_val = 0; struct mon_cfg *ctx = m->ctx; struct mpam_msc_ris *ris = m->ris; struct mpam_msc *msc = m->ris->msc; + struct msmon_mbwu_state *mbwu_state; u32 mon_sel, ctl_val, flt_val, cur_ctl, cur_flt; lockdep_assert_held(&msc->lock); @@ -858,22 +872,41 @@ static void __ris_msmon_read(void *arg) switch (m->type) { case mpam_feat_msmon_csu: now = mpam_read_monsel_reg(msc, CSU); + nrdy = now & MSMON___NRDY; + now = FIELD_GET(MSMON___VALUE, now); break; case mpam_feat_msmon_mbwu: now = mpam_read_monsel_reg(msc, MBWU); + nrdy = now & MSMON___NRDY; + now = FIELD_GET(MSMON___VALUE, now); + + if (nrdy) + break; + + mbwu_state = &ris->mbwu_state[ctx->mon]; + if (!mbwu_state) + break; + + /* Add any pre-overflow value to the mbwu_state->val */ + if (mbwu_state->prev_val > now) + overflow_val = mpam_msmon_overflow_val(ris) - mbwu_state->prev_val; + + mbwu_state->prev_val = now; + mbwu_state->correction += overflow_val; + + /* Include bandwidth consumed before the last hardware reset */ + now += mbwu_state->correction; break; default: return; } spin_unlock_irqrestore(&msc->mon_sel_lock, flags); - nrdy = now & MSMON___NRDY; if (nrdy) { m->err = -EBUSY; return; } - now = FIELD_GET(MSMON___VALUE, now); *(m->val) += now; } @@ -1064,6 +1097,68 @@ static int mpam_reprogram_ris(void *_arg) return 0; } +static int mpam_restore_mbwu_state(void *_ris) +{ + int i; + struct mon_read mwbu_arg; + struct mpam_msc_ris *ris = _ris; + + lockdep_assert_held(&ris->msc->lock); + + for (i = 0; i < ris->props.num_mbwu_mon; i++) { + if (ris->mbwu_state[i].enabled) { + mwbu_arg.ris = ris; + mwbu_arg.ctx = &ris->mbwu_state[i].cfg; + mwbu_arg.type = mpam_feat_msmon_mbwu; + + __ris_msmon_read(&mwbu_arg); + } + } + + return 0; +} + +static int mpam_save_mbwu_state(void *arg) +{ + int i; + u64 val; + struct mon_cfg *cfg; + unsigned long flags; + u32 cur_flt, cur_ctl, mon_sel; + struct mpam_msc_ris *ris = arg; + struct mpam_msc *msc = ris->msc; + struct msmon_mbwu_state *mbwu_state; + + lockdep_assert_held(&msc->lock); + + for (i = 0; i < ris->props.num_mbwu_mon; i++) { + mbwu_state = &ris->mbwu_state[i]; + cfg = &mbwu_state->cfg; + + spin_lock_irqsave(&msc->mon_sel_lock, flags); + mon_sel = FIELD_PREP(MSMON_CFG_MON_SEL_MON_SEL, i) | + FIELD_PREP(MSMON_CFG_MON_SEL_RIS, ris->ris_idx); + mpam_write_monsel_reg(msc, CFG_MON_SEL, mon_sel); + + cur_flt = mpam_read_monsel_reg(msc, CFG_MBWU_FLT); + cur_ctl = mpam_read_monsel_reg(msc, CFG_MBWU_CTL); + mpam_write_monsel_reg(msc, CFG_MBWU_CTL, 0); + + val = mpam_read_monsel_reg(msc, MBWU); + mpam_write_monsel_reg(msc, MBWU, 0); + + cfg->mon = i; + cfg->pmg = FIELD_GET(MSMON_CFG_MBWU_FLT_PMG, cur_flt); + cfg->match_pmg = FIELD_GET(MSMON_CFG_x_CTL_MATCH_PMG, cur_ctl); + cfg->partid = FIELD_GET(MSMON_CFG_MBWU_FLT_PARTID, cur_flt); + mbwu_state->correction += val; + mbwu_state->enabled = FIELD_GET(MSMON_CFG_x_CTL_EN, cur_ctl); + spin_unlock_irqrestore(&msc->mon_sel_lock, flags); + } + + return 0; +} + /* * Called via smp_call_on_cpu() to prevent migration, while still being * pre-emptible. @@ -1125,6 +1220,9 @@ static void mpam_reset_msc(struct mpam_msc *msc, bool online) * for non-zero partid may be lost while the CPUs are offline. */ ris->in_reset_state = online; + + if (mpam_is_enabled() && !online) + mpam_touch_msc(msc, &mpam_save_mbwu_state, ris); } srcu_read_unlock(&mpam_srcu, idx); } @@ -1156,6 +1254,9 @@ static void mpam_reprogram_msc(struct mpam_msc *msc) mpam_reprogram_ris_partid(ris, partid, cfg); } ris->in_reset_state = reset; + + if (mpam_has_feature(mpam_feat_msmon_mbwu, &ris->props)) + mpam_touch_msc(msc, &mpam_restore_mbwu_state, ris); } srcu_read_unlock(&mpam_srcu, idx); } @@ -1814,8 +1915,31 @@ static void mpam_unregister_irqs(void) cpus_read_unlock(); } +static void __destroy_component_cfg(struct mpam_component *comp) +{ + unsigned long flags; + struct mpam_msc_ris *ris; + struct msmon_mbwu_state *mbwu_state; + + kfree(comp->cfg); + list_for_each_entry(ris, &comp->ris, comp_list) { + mutex_lock(&ris->msc->lock); + spin_lock_irqsave(&ris->msc->mon_sel_lock, flags); + mbwu_state = ris->mbwu_state; + ris->mbwu_state = NULL; + spin_unlock_irqrestore(&ris->msc->mon_sel_lock, flags); + mutex_unlock(&ris->msc->lock); + + kfree(mbwu_state); + } +} + static int __allocate_component_cfg(struct mpam_component *comp) { + unsigned long flags; + struct mpam_msc_ris *ris; + struct msmon_mbwu_state *mbwu_state; + if (comp->cfg) return 0; @@ -1823,6 +1947,24 @@ static int __allocate_component_cfg(struct mpam_component *comp) if (!comp->cfg) return -ENOMEM; + list_for_each_entry(ris, &comp->ris, comp_list) { + if (!ris->props.num_mbwu_mon) + continue; + + mbwu_state = kcalloc(ris->props.num_mbwu_mon, + sizeof(*ris->mbwu_state), GFP_KERNEL); + if (!mbwu_state) { + __destroy_component_cfg(comp); + return -ENOMEM; + } + + mutex_lock(&ris->msc->lock); + spin_lock_irqsave(&ris->msc->mon_sel_lock, flags); + ris->mbwu_state = mbwu_state; + spin_unlock_irqrestore(&ris->msc->mon_sel_lock, flags); + mutex_unlock(&ris->msc->lock); + } + return 0; } diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index fb3a4c00e5d3..320036851777 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -175,8 +175,40 @@ struct mpam_component struct mpam_class *class; }; -struct mpam_msc_ris -{ +/* The values for MSMON_CFG_MBWU_FLT.RWBW */ +enum mon_filter_options { + COUNT_BOTH = 0, + COUNT_WRITE = 1, + COUNT_READ = 2, +}; + +struct mon_cfg { + u16 mon; + u8 pmg; + bool match_pmg; + u32 partid; + enum mon_filter_options opts; +}; + +/* + * Changes to enabled and cfg are protected by the msc->lock. + * Changes to prev_val and correction are protected by the msc's mon_sel_lock. + */ +struct msmon_mbwu_state { + bool enabled; + struct mon_cfg cfg; + + /* The value last read from the hardware. Used to detect overflow. */ + u64 prev_val; + + /* + * The value to add to the new reading to account for power management, + * and shifts to trigger the overflow interrupt. + */ + u64 correction; +}; + +struct mpam_msc_ris { u8 ris_idx; u64 idr; struct mpam_props props; @@ -193,21 +225,9 @@ struct mpam_msc_ris /* parents: */ struct mpam_msc *msc; struct mpam_component *comp; -}; -/* The values for MSMON_CFG_MBWU_FLT.RWBW */ -enum mon_filter_options { - COUNT_BOTH = 0, - COUNT_WRITE = 1, - COUNT_READ = 2, -}; - -struct mon_cfg { - u16 mon; - u8 pmg; - bool match_pmg; - u32 partid; - enum mon_filter_options opts; + /* msmon mbwu configuration is preserved over reset */ + struct msmon_mbwu_state *mbwu_state; }; static inline int mpam_alloc_csu_mon(struct mpam_class *class) -- 2.25.1

From: Rohit Mathew <rohit.mathew@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- mpam v0.1 and versions above v1.0 support optional long counter for memory bandwidth monitoring. The MPAMF_MBWUMON_IDR register have fields indicating support for long counters. As of now, a 44 bit counter represented by HAS_LONG field (bit 30) and a 63 bit counter represented by LWD (bit 29) can be optionally integrated. Probe for these counters and set corresponding feature bits if any of these counters are present. Signed-off-by: Rohit Mathew <rohit.mathew@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 22 ++++++++++++++++++++++ drivers/platform/mpam/mpam_internal.h | 7 +++++++ 2 files changed, 29 insertions(+) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 930e14471378..c2c5dfa77083 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -653,6 +653,7 @@ static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) pr_err_once("Counters are not usable because not-ready timeout was not provided by firmware."); } if (FIELD_GET(MPAMF_MSMON_IDR_MSMON_MBWU, msmon_features)) { + bool has_long; u32 mbwumonidr = mpam_read_partsel_reg(msc, MBWUMON_IDR); props->num_mbwu_mon = FIELD_GET(MPAMF_MBWUMON_IDR_NUM_MON, mbwumonidr); @@ -661,6 +662,27 @@ static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) if (FIELD_GET(MPAMF_MBWUMON_IDR_HAS_RWBW, mbwumonidr)) mpam_set_feature(mpam_feat_msmon_mbwu_rwbw, props); + + /* + * Treat long counter and its extension, lwd as mutually + * exclusive feature bits. Though these are dependent + * fields at the implementation level, there would never + * be a need for mpam_feat_msmon_mbwu_44counter (long + * counter) and mpam_feat_msmon_mbwu_63counter (lwd) + * bits to be set together. + * + * mpam_feat_msmon_mbwu isn't treated as an exclusive + * bit as this feature bit would be used as the "front + * facing feature bit" for any checks related to mbwu + * monitors. + */ + has_long = FIELD_GET(MPAMF_MBWUMON_IDR_HAS_LONG, mbwumonidr); + if (props->num_mbwu_mon && has_long) { + if (FIELD_GET(MPAMF_MBWUMON_IDR_LWD, mbwumonidr)) + mpam_set_feature(mpam_feat_msmon_mbwu_63counter, props); + else + mpam_set_feature(mpam_feat_msmon_mbwu_44counter, props); + } } } diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 320036851777..7ce7587202bf 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -88,7 +88,14 @@ enum mpam_device_features { mpam_feat_msmon, mpam_feat_msmon_csu, mpam_feat_msmon_csu_capture, + /* + * Having mpam_feat_msmon_mbwu set doesn't mean the regular 31 bit MBWU + * counter would be used. The exact counter used is decided based on the + * status of mpam_feat_msmon_mbwu_l/mpam_feat_msmon_mbwu_lwd as well. + */ mpam_feat_msmon_mbwu, + mpam_feat_msmon_mbwu_44counter, + mpam_feat_msmon_mbwu_63counter, mpam_feat_msmon_mbwu_capture, mpam_feat_msmon_mbwu_rwbw, mpam_feat_msmon_capt, -- 2.25.1

From: Rohit Mathew <rohit.mathew@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- If the 44 bit (long) or 63 bit (LWD) counters are detected on probing the RIS, use long/LWD counter instead of the regular 31 bit mbwu counter. Only 32bit accesses to the MSC are required to be supported by the spec, but these registers are 64bits. The lower half may overflow into the higher half between two 32bit reads. To avoid this, use a helper that reads the top half twice to check for overflow. Signed-off-by: Rohit Mathew <rohit.mathew@arm.com> [morse: merged multiple patches from Rohit] Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 86 ++++++++++++++++++++++++--- drivers/platform/mpam/mpam_internal.h | 7 ++- 2 files changed, 84 insertions(+), 9 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index c2c5dfa77083..7f1f848281be 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -772,6 +772,48 @@ struct mon_read int err; }; +static bool mpam_ris_has_mbwu_long_counter(struct mpam_msc_ris *ris) +{ + return (mpam_has_feature(mpam_feat_msmon_mbwu_63counter, &ris->props) || + mpam_has_feature(mpam_feat_msmon_mbwu_44counter, &ris->props)); +} + +static u64 mpam_msc_read_mbwu_l(struct mpam_msc *msc) +{ + int retry = 3; + u32 mbwu_l_low; + u64 mbwu_l_high1, mbwu_l_high2; + + lockdep_assert_held_once(&msc->mon_sel_lock); + + WARN_ON_ONCE((MSMON_MBWU_L + sizeof(u64)) > msc->mapped_hwpage_sz); + WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &msc->accessibility)); + + mbwu_l_high2 = readl_relaxed(msc->mapped_hwpage + MSMON_MBWU_L + 4); + do { + mbwu_l_high1 = mbwu_l_high2; + mbwu_l_low = readl_relaxed(msc->mapped_hwpage + MSMON_MBWU_L); + mbwu_l_high2 = readl_relaxed(msc->mapped_hwpage + MSMON_MBWU_L + 4); + + retry--; + } while (mbwu_l_high1 != mbwu_l_high2 && retry > 0); + + if (mbwu_l_high2 == mbwu_l_high1) + return (mbwu_l_high1 << 32) | mbwu_l_low; + return MSMON___NRDY_L; +} + +static void mpam_msc_zero_mbwu_l(struct mpam_msc *msc) +{ + lockdep_assert_held_once(&msc->mon_sel_lock); + + WARN_ON_ONCE((MSMON_MBWU_L + sizeof(u64)) > msc->mapped_hwpage_sz); + WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(), &msc->accessibility)); + + writel_relaxed(0, msc->mapped_hwpage + MSMON_MBWU_L); + writel_relaxed(0, msc->mapped_hwpage + MSMON_MBWU_L + 4); +} + static void gen_msmon_ctl_flt_vals(struct mon_read *m, u32 *ctl_val, u32 *flt_val) { @@ -844,7 +886,12 @@ static void write_msmon_ctl_flt_vals(struct mon_read *m, u32 ctl_val, case mpam_feat_msmon_mbwu: mpam_write_monsel_reg(msc, CFG_MBWU_FLT, flt_val); mpam_write_monsel_reg(msc, CFG_MBWU_CTL, ctl_val); - mpam_write_monsel_reg(msc, MBWU, 0); + + if (mpam_ris_has_mbwu_long_counter(m->ris)) + mpam_msc_zero_mbwu_l(m->ris->msc); + else + mpam_write_monsel_reg(msc, MBWU, 0); + mpam_write_monsel_reg(msc, CFG_MBWU_CTL, ctl_val|MSMON_CFG_x_CTL_EN); mbwu_state = &m->ris->mbwu_state[m->ctx->mon]; @@ -859,8 +906,13 @@ static void write_msmon_ctl_flt_vals(struct mon_read *m, u32 ctl_val, static u64 mpam_msmon_overflow_val(struct mpam_msc_ris *ris) { - /* TODO: scaling, and long counters */ - return GENMASK_ULL(30, 0); + /* TODO: implement scaling counters */ + if (mpam_has_feature(mpam_feat_msmon_mbwu_63counter, &ris->props)) + return GENMASK_ULL(62, 0); + else if (mpam_has_feature(mpam_feat_msmon_mbwu_44counter, &ris->props)) + return GENMASK_ULL(43, 0); + else + return GENMASK_ULL(30, 0); } static void __ris_msmon_read(void *arg) @@ -898,9 +950,22 @@ static void __ris_msmon_read(void *arg) now = FIELD_GET(MSMON___VALUE, now); break; case mpam_feat_msmon_mbwu: - now = mpam_read_monsel_reg(msc, MBWU); - nrdy = now & MSMON___NRDY; - now = FIELD_GET(MSMON___VALUE, now); + /* + * If long or lwd counters are supported, use them, else revert + * to the 32 bit counter. + */ + if (mpam_ris_has_mbwu_long_counter(ris)) { + now = mpam_msc_read_mbwu_l(msc); + nrdy = now & MSMON___NRDY_L; + if (mpam_has_feature(mpam_feat_msmon_mbwu_63counter, &ris->props)) + now = FIELD_GET(MSMON___LWD_VALUE, now); + else + now = FIELD_GET(MSMON___L_VALUE, now); + } else { + now = mpam_read_monsel_reg(msc, MBWU); + nrdy = now & MSMON___NRDY; + now = FIELD_GET(MSMON___VALUE, now); + } if (nrdy) break; @@ -1166,8 +1231,13 @@ static int mpam_save_mbwu_state(void *arg) cur_ctl = mpam_read_monsel_reg(msc, CFG_MBWU_CTL); mpam_write_monsel_reg(msc, CFG_MBWU_CTL, 0); - val = mpam_read_monsel_reg(msc, MBWU); - mpam_write_monsel_reg(msc, MBWU, 0); + if (mpam_ris_has_mbwu_long_counter(ris)) { + val = mpam_msc_read_mbwu_l(msc); + mpam_msc_zero_mbwu_l(msc); + } else { + val = mpam_read_monsel_reg(msc, MBWU); + mpam_write_monsel_reg(msc, MBWU, 0); + } cfg->mon = i; cfg->pmg = FIELD_GET(MSMON_CFG_MBWU_FLT_PMG, cur_flt); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 7ce7587202bf..0f49932463dd 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -330,6 +330,8 @@ int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, #define MSMON_CSU_CAPTURE 0x0848 /* last cache-usage value captured */ #define MSMON_MBWU 0x0860 /* current mem-bw usage value */ #define MSMON_MBWU_CAPTURE 0x0868 /* last mem-bw value captured */ +#define MSMON_MBWU_L 0x0880 /* current long mem-bw usage value */ +#define MSMON_MBWU_CAPTURE_L 0x0890 /* last long mem-bw value captured */ #define MSMON_CAPT_EVNT 0x0808 /* signal a capture event */ #define MPAMF_ESR 0x00F8 /* error status register */ #define MPAMF_ECR 0x00F0 /* error control register */ @@ -533,7 +535,10 @@ int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, */ #define MSMON___VALUE GENMASK(30, 0) #define MSMON___NRDY BIT(31) -#define MSMON_MBWU_L_VALUE GENMASK(62, 0) +#define MSMON___NRDY_L BIT(63) +#define MSMON___L_VALUE GENMASK(43, 0) +#define MSMON___LWD_VALUE GENMASK(62, 0) + /* * MSMON_CAPT_EVNT - Memory system performance monitoring capture event * generation register -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- resctrl expects to reset the bandwidth counters when the filesystem is mounted. To allow this, add a helper that clears the saved mbwu state. Instead of cross calling to each CPU that can access the component MSC to write to the counter, set a flag that causes it to be zero'd on the the next read. This is easily done by forcing a configuration update. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 44 +++++++++++++++++++++++---- drivers/platform/mpam/mpam_internal.h | 5 ++- 2 files changed, 42 insertions(+), 7 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 7f1f848281be..f9f22d1d698c 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -919,9 +919,11 @@ static void __ris_msmon_read(void *arg) { bool nrdy = false; unsigned long flags; + bool config_mismatch; struct mon_read *m = arg; u64 now, overflow_val = 0; struct mon_cfg *ctx = m->ctx; + bool reset_on_next_read = false; struct mpam_msc_ris *ris = m->ris; struct mpam_msc *msc = m->ris->msc; struct msmon_mbwu_state *mbwu_state; @@ -934,13 +936,24 @@ static void __ris_msmon_read(void *arg) FIELD_PREP(MSMON_CFG_MON_SEL_RIS, ris->ris_idx); mpam_write_monsel_reg(msc, CFG_MON_SEL, mon_sel); + if (m->type == mpam_feat_msmon_mbwu) { + mbwu_state = &ris->mbwu_state[ctx->mon]; + if (mbwu_state) { + reset_on_next_read = mbwu_state->reset_on_next_read; + mbwu_state->reset_on_next_read = false; + } + } + /* * Read the existing configuration to avoid re-writing the same values. * This saves waiting for 'nrdy' on subsequent reads. */ read_msmon_ctl_flt_vals(m, &cur_ctl, &cur_flt); gen_msmon_ctl_flt_vals(m, &ctl_val, &flt_val); - if (cur_flt != flt_val || cur_ctl != (ctl_val | MSMON_CFG_x_CTL_EN)) + config_mismatch = cur_flt != flt_val || + cur_ctl != (ctl_val | MSMON_CFG_x_CTL_EN); + + if (config_mismatch || reset_on_next_read) write_msmon_ctl_flt_vals(m, ctl_val, flt_val); switch (m->type) { @@ -970,7 +983,6 @@ static void __ris_msmon_read(void *arg) if (nrdy) break; - mbwu_state = &ris->mbwu_state[ctx->mon]; if (!mbwu_state) break; @@ -1064,6 +1076,30 @@ int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, return err; } +void mpam_msmon_reset_mbwu(struct mpam_component *comp, struct mon_cfg *ctx) +{ + int idx; + unsigned long flags; + struct mpam_msc *msc; + struct mpam_msc_ris *ris; + + if (!mpam_is_enabled()) + return; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { + if (!mpam_has_feature(mpam_feat_msmon_mbwu, &ris->props)) + continue; + + msc = ris->msc; + spin_lock_irqsave(&msc->mon_sel_lock, flags); + ris->mbwu_state[ctx->mon].correction = 0; + ris->mbwu_state[ctx->mon].reset_on_next_read = true; + spin_unlock_irqrestore(&msc->mon_sel_lock, flags); + } + srcu_read_unlock(&mpam_srcu, idx); +} + static void mpam_reset_msc_bitmap(struct mpam_msc *msc, u16 reg, u16 wd) { u32 num_words, msb; @@ -1190,8 +1226,6 @@ static int mpam_restore_mbwu_state(void *_ris) struct mon_read mwbu_arg; struct mpam_msc_ris *ris = _ris; - lockdep_assert_held(&ris->msc->lock); - for (i = 0; i < ris->props.num_mbwu_mon; i++) { if (ris->mbwu_state[i].enabled) { mwbu_arg.ris = ris; @@ -1216,8 +1250,6 @@ static int mpam_save_mbwu_state(void *arg) struct mpam_msc *msc = ris->msc; struct msmon_mbwu_state *mbwu_state; - lockdep_assert_held(&msc->lock); - for (i = 0; i < ris->props.num_mbwu_mon; i++) { mbwu_state = &ris->mbwu_state[i]; cfg = &mbwu_state->cfg; diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 0f49932463dd..80f4b434b3ad 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -199,10 +199,12 @@ struct mon_cfg { /* * Changes to enabled and cfg are protected by the msc->lock. - * Changes to prev_val and correction are protected by the msc's mon_sel_lock. + * Changes to reset_on_next_read, prev_val and correction are protected by the + * msc's mon_sel_lock. */ struct msmon_mbwu_state { bool enabled; + bool reset_on_next_read; struct mon_cfg cfg; /* The value last read from the hardware. Used to detect overflow. */ @@ -286,6 +288,7 @@ int mpam_apply_config(struct mpam_component *comp, u16 partid, int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, enum mpam_device_features, u64 *val); +void mpam_msmon_reset_mbwu(struct mpam_component *comp, struct mon_cfg *ctx); /* * MPAM MSCs have the following register layout. See: -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- resctrl has its own data structures to describe its resources. We can't use these directly as we play tricks with the 'MBA' resource, picking the MPAM controls or monitors that best apply. We may export the same component as both L3 and MBA. Add mpam_resctrl_exports[] as the array of class->resctrl mappings we are exporting, and add the cpuhp hooks that allocated and free the resctrl domain structures. While we're here, plumb in a few other obvious things. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/Makefile | 2 +- drivers/platform/mpam/mpam_devices.c | 12 ++ drivers/platform/mpam/mpam_internal.h | 15 ++ drivers/platform/mpam/mpam_resctrl.c | 237 ++++++++++++++++++++++++++ include/linux/arm_mpam.h | 4 + 5 files changed, 269 insertions(+), 1 deletion(-) create mode 100644 drivers/platform/mpam/mpam_resctrl.c diff --git a/drivers/platform/mpam/Makefile b/drivers/platform/mpam/Makefile index 8ad69bfa2aa2..37693be531c3 100644 --- a/drivers/platform/mpam/Makefile +++ b/drivers/platform/mpam/Makefile @@ -1 +1 @@ -obj-$(CONFIG_ARM_CPU_RESCTRL) += mpam_devices.o +obj-$(CONFIG_ARM_CPU_RESCTRL) += mpam_devices.o mpam_resctrl.o diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index f9f22d1d698c..28f6010df6dc 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -1411,6 +1411,9 @@ static int mpam_cpu_online(unsigned int cpu) } srcu_read_unlock(&mpam_srcu, idx); + if (mpam_is_enabled()) + mpam_resctrl_online_cpu(cpu); + return 0; } @@ -1470,6 +1473,9 @@ static int mpam_cpu_offline(unsigned int cpu) } srcu_read_unlock(&mpam_srcu, idx); + if (mpam_is_enabled()) + mpam_resctrl_offline_cpu(cpu); + return 0; } @@ -2140,6 +2146,12 @@ static void mpam_enable_once(void) mutex_unlock(&mpam_list_lock); cpus_read_unlock(); + if (!err) { + err = mpam_resctrl_setup(); + if (err) + pr_err("Failed to initialise resctrl: %d\n", err); + } + if (err) { schedule_work(&mpam_broken_work); return; diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 80f4b434b3ad..579a8485d320 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -239,6 +239,16 @@ struct mpam_msc_ris { struct msmon_mbwu_state *mbwu_state; }; +struct mpam_resctrl_dom { + struct mpam_component *comp; + struct rdt_domain resctrl_dom; +}; + +struct mpam_resctrl_res { + struct mpam_class *class; + struct rdt_resource resctrl_res; +}; + static inline int mpam_alloc_csu_mon(struct mpam_class *class) { struct mpam_props *cprops = &class->props; @@ -290,6 +300,11 @@ int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, enum mpam_device_features, u64 *val); void mpam_msmon_reset_mbwu(struct mpam_component *comp, struct mon_cfg *ctx); +int mpam_resctrl_online_cpu(unsigned int cpu); +int mpam_resctrl_offline_cpu(unsigned int cpu); + +int mpam_resctrl_setup(void); + /* * MPAM MSCs have the following register layout. See: * Arm Architecture Reference Manual Supplement - Memory System Resource diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c new file mode 100644 index 000000000000..b9c1292ff630 --- /dev/null +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -0,0 +1,237 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (C) 2021 Arm Ltd. + +#define pr_fmt(fmt) "mpam: resctrl: " fmt + +#include <linux/arm_mpam.h> +#include <linux/cacheinfo.h> +#include <linux/cpu.h> +#include <linux/cpumask.h> +#include <linux/errno.h> +#include <linux/list.h> +#include <linux/printk.h> +#include <linux/rculist.h> +#include <linux/resctrl.h> +#include <linux/slab.h> +#include <linux/types.h> + +#include <asm/mpam.h> + +#include "mpam_internal.h" + +/* + * The classes we've picked to map to resctrl resources. + * Class pointer may be NULL. + */ +static struct mpam_resctrl_res mpam_resctrl_exports[RDT_NUM_RESOURCES]; + +static bool exposed_alloc_capable; +static bool exposed_mon_capable; + +bool resctrl_arch_alloc_capable(void) +{ + return exposed_alloc_capable; +} + +bool resctrl_arch_mon_capable(void) +{ + return exposed_mon_capable; +} + +/* + * MSC may raise an error interrupt if it sees an out or range partid/pmg, + * and go on to truncate the value. Regardless of what the hardware supports, + * only the system wide safe value is safe to use. + */ +u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) +{ + return min((u32)mpam_partid_max + 1, (u32)RESCTRL_MAX_CLOSID); +} + +struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) +{ + if (l >= RDT_NUM_RESOURCES) + return NULL; + + return &mpam_resctrl_exports[l].resctrl_res; +} + +static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) +{ + /* TODO: initialise the resctrl resources */ + + return 0; +} + +/* Called with the mpam classes lock held */ +int mpam_resctrl_setup(void) +{ + int err = 0; + struct mpam_resctrl_res *res; + enum resctrl_res_level i; + + cpus_read_lock(); + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + INIT_LIST_HEAD(&res->resctrl_res.domains); + INIT_LIST_HEAD(&res->resctrl_res.evt_list); + res->resctrl_res.rid = i; + } + + /* TODO: pick MPAM classes to map to resctrl resources */ + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + if (!res->class) + continue; // dummy resource + + err = mpam_resctrl_resource_init(res); + if (err) + break; + } + cpus_read_unlock(); + + if (!err && !exposed_alloc_capable && !exposed_mon_capable) + err = -EOPNOTSUPP; + + if (!err) { + if (!is_power_of_2(mpam_pmg_max + 1)) { + /* + * If not all the partid*pmg values are valid indexes, + * resctrl may allocate pmg that don't exist. This + * should cause an error interrupt. + */ + pr_warn("Number of PMG is not a power of 2! resctrl may misbehave"); + } + err = resctrl_init(); + } + + return err; +} + +static struct mpam_resctrl_dom * +mpam_resctrl_alloc_domain(unsigned int cpu, struct mpam_resctrl_res *res) +{ + struct mpam_resctrl_dom *dom; + struct mpam_class *class = res->class; + struct mpam_component *comp_iter, *comp; + + comp = NULL; + list_for_each_entry(comp_iter, &class->components, class_list) { + if (cpumask_test_cpu(cpu, &comp_iter->affinity)) { + comp = comp_iter; + break; + } + } + + /* cpu with unknown exported component? */ + if (WARN_ON_ONCE(!comp)) + return ERR_PTR(-EINVAL); + + dom = kzalloc_node(sizeof(*dom), GFP_KERNEL, cpu_to_node(cpu)); + if (!dom) + return ERR_PTR(-ENOMEM); + + dom->comp = comp; + INIT_LIST_HEAD(&dom->resctrl_dom.list); + dom->resctrl_dom.id = comp->comp_id; + cpumask_set_cpu(cpu, &dom->resctrl_dom.cpu_mask); + + /* TODO: this list should be sorted */ + list_add_tail(&dom->resctrl_dom.list, &res->resctrl_res.domains); + + return dom; +} + +/* Like resctrl_get_domain_from_cpu(), but for offline CPUs */ +static struct mpam_resctrl_dom * +mpam_get_domain_from_cpu(int cpu, struct mpam_resctrl_res *res) +{ + struct rdt_domain *d; + struct mpam_resctrl_dom *dom; + + lockdep_assert_cpus_held(); + + list_for_each_entry(d, &res->resctrl_res.domains, list) { + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + + if (cpumask_test_cpu(cpu, &dom->comp->affinity)) + return dom; + } + + return NULL; +} + +struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id) +{ + struct rdt_domain *d; + struct mpam_resctrl_dom *dom; + + lockdep_assert_cpus_held(); + + list_for_each_entry(d, &r->domains, list) { + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + if (dom->comp->comp_id == id) + return &dom->resctrl_dom; + } + + return NULL; +} + +int mpam_resctrl_online_cpu(unsigned int cpu) +{ + int i; + struct mpam_resctrl_dom *dom; + struct mpam_resctrl_res *res; + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + + if (!res->class) + continue; // dummy_resource; + + dom = mpam_get_domain_from_cpu(cpu, res); + if (dom) { + cpumask_set_cpu(cpu, &dom->resctrl_dom.cpu_mask); + continue; + } + + dom = mpam_resctrl_alloc_domain(cpu, res); + if (IS_ERR(dom)) + return PTR_ERR(dom); + } + + return 0; +} + +int mpam_resctrl_offline_cpu(unsigned int cpu) +{ + int i; + struct rdt_domain *d; + struct mpam_resctrl_res *res; + struct mpam_resctrl_dom *dom; + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + + if (!res->class) + continue; // dummy resource + + d = resctrl_get_domain_from_cpu(cpu, &res->resctrl_res); + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + + /* The last one standing was ahead of us... */ + if (WARN_ON_ONCE(!d)) + continue; + + cpumask_clear_cpu(cpu, &d->cpu_mask); + + if (!cpumask_empty(&d->cpu_mask)) + continue; + + list_del(&d->list); + kfree(dom); + } + + return 0; +} diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 40e09b4d236b..27c3ad9912ef 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -39,4 +39,8 @@ int mpam_register_requestor(u16 partid_max, u8 pmg_max); int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, enum mpam_class_types type, u8 class_id, int component_id); + +bool resctrl_arch_alloc_capable(void); +bool resctrl_arch_mon_capable(void); + #endif /* __LINUX_ARM_MPAM_H */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Sytems with MPAM support may have a variety of control types at any point of their system layout. We can only expose certain types of control, and only if they exist at particular locations. Start with the well-know caches. These have to be depth 2 or 3 and support MPAM's cache portion bitmap controls, with a number of portions fewer that resctrl's limit. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 135 ++++++++++++++++++++++++++- include/linux/arm_mpam.h | 7 ++ 2 files changed, 138 insertions(+), 4 deletions(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index b9c1292ff630..3e72cf514c4a 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -27,6 +27,7 @@ static struct mpam_resctrl_res mpam_resctrl_exports[RDT_NUM_RESOURCES]; static bool exposed_alloc_capable; static bool exposed_mon_capable; +static struct mpam_class *mbm_local_class; bool resctrl_arch_alloc_capable(void) { @@ -38,6 +39,11 @@ bool resctrl_arch_mon_capable(void) return exposed_mon_capable; } +bool resctrl_arch_is_mbm_local_enabled(void) +{ + return mbm_local_class; +} + /* * MSC may raise an error interrupt if it sees an out or range partid/pmg, * and go on to truncate the value. Regardless of what the hardware supports, @@ -56,14 +62,134 @@ struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) return &mpam_resctrl_exports[l].resctrl_res; } +static bool cache_has_usable_cpor(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_cpor_part, cprops)) + return false; + + /* TODO: Scaling is not yet supported */ + return (class->props.cpbm_wd <= RESCTRL_MAX_CBM); +} + +static bool cache_has_usable_csu(struct mpam_class *class) +{ + struct mpam_props *cprops; + + if (!class) + return false; + + cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_msmon_csu, cprops)) + return false; + + /* + * CSU counters settle on the value, so we can get away with + * having only one. + */ + if (!cprops->num_csu_mon) + return false; + + return (mpam_partid_max > 1) || (mpam_pmg_max != 0); +} + +bool resctrl_arch_is_llc_occupancy_enabled(void) +{ + return cache_has_usable_csu(mpam_resctrl_exports[RDT_RESOURCE_L3].class); +} + +/* Test whether we can export MPAM_CLASS_CACHE:{2,3}? */ +static void mpam_resctrl_pick_caches(void) +{ + int idx; + struct mpam_class *class; + struct mpam_resctrl_res *res; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(class, &mpam_classes, classes_list) { + bool has_cpor = cache_has_usable_cpor(class); + + if (class->type != MPAM_CLASS_CACHE) { + pr_debug("pick_caches: Class is not a cache\n"); + continue; + } + + if (class->level != 2 && class->level != 3) { + pr_debug("pick_caches: not L2 or L3\n"); + continue; + } + + if (class->level == 2 && !has_cpor) { + pr_debug("pick_caches: L2 missing CPOR\n"); + continue; + } + else if (!has_cpor && !cache_has_usable_csu(class)) { + pr_debug("pick_caches: Cache misses CPOR and CSU\n"); + continue; + } + + if (!cpumask_equal(&class->affinity, cpu_possible_mask)) { + pr_debug("pick_caches: Class has missing CPUs\n"); + continue; + } + + if (class->level == 2) { + res = &mpam_resctrl_exports[RDT_RESOURCE_L2]; + res->resctrl_res.name = "L2"; + } else { + res = &mpam_resctrl_exports[RDT_RESOURCE_L3]; + res->resctrl_res.name = "L3"; + } + res->class = class; + } + srcu_read_unlock(&mpam_srcu, idx); +} + static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) { - /* TODO: initialise the resctrl resources */ + struct mpam_class *class = res->class; + struct rdt_resource *r = &res->resctrl_res; + + /* Is this one of the two well-known caches? */ + if (res->resctrl_res.rid == RDT_RESOURCE_L2 || + res->resctrl_res.rid == RDT_RESOURCE_L3) { + /* TODO: Scaling is not yet supported */ + r->cache.cbm_len = class->props.cpbm_wd; + r->cache.arch_has_sparse_bitmasks = true; + + /* mpam_devices will reject empty bitmaps */ + r->cache.min_cbm_bits = 1; + + /* TODO: kill these properties off as they are derivatives */ + r->format_str = "%d=%0*x"; + r->fflags = RFTYPE_RES_CACHE; + r->default_ctrl = BIT_MASK(class->props.cpbm_wd) - 1; + r->data_width = (class->props.cpbm_wd + 3) / 4; + + /* + * Which bits are shared with other ...things... + * Unknown devices use partid-0 which uses all the bitmap + * fields. Until we configured the SMMU and GIC not to do this + * 'all the bits' is the correct answer here. + */ + r->cache.shareable_bits = r->default_ctrl; + + if (mpam_has_feature(mpam_feat_cpor_part, &class->props)) { + r->alloc_capable = true; + exposed_alloc_capable = true; + } + + if (class->level == 3 && cache_has_usable_csu(class)) { + r->mon_capable = true; + exposed_mon_capable = true; + } + } return 0; } -/* Called with the mpam classes lock held */ int mpam_resctrl_setup(void) { int err = 0; @@ -78,7 +204,7 @@ int mpam_resctrl_setup(void) res->resctrl_res.rid = i; } - /* TODO: pick MPAM classes to map to resctrl resources */ + mpam_resctrl_pick_caches(); for (i = 0; i < RDT_NUM_RESOURCES; i++) { res = &mpam_resctrl_exports[i]; @@ -103,7 +229,8 @@ int mpam_resctrl_setup(void) */ pr_warn("Number of PMG is not a power of 2! resctrl may misbehave"); } - err = resctrl_init(); + + /* TODO: call resctrl_init() */ } return err; diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 27c3ad9912ef..576bb97fa552 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -42,5 +42,12 @@ int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, bool resctrl_arch_alloc_capable(void); bool resctrl_arch_mon_capable(void); +bool resctrl_arch_is_llc_occupancy_enabled(void); +bool resctrl_arch_is_mbm_local_enabled(void); + +static inline bool resctrl_arch_is_mbm_total_enabled(void) +{ + return false; +} #endif /* __LINUX_ARM_MPAM_H */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- After the changes to resctrl to support MPAM, num_rmid is only used as a value that is unfortunately exposed to user-space. For MPAM, this value doesn't mean anything, and whatever value we do expose will be wrong for some use cases. User-space may expect it can use this value to know how many 'extra' monitor groups it can create. e.g. on x86 if num_closid=4, num_rmid=8, then a total of 4 monitor groups can be created. If num_rmid were 2, then only 2 control groups could be created. For MPAM the number of pmg is very likely to be smaller than the number of partid, but this doesn't restrict the creation of control groups, as each control group has its own pmg space. Pick 1 if monitoring is supported. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 3e72cf514c4a..c2cf9432e50d 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -181,10 +181,22 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) exposed_alloc_capable = true; } - if (class->level == 3 && cache_has_usable_csu(class)) { + if (class->level == 3 && cache_has_usable_csu(class)) r->mon_capable = true; - exposed_mon_capable = true; - } + } + + if (r->mon_capable) { + exposed_mon_capable = true; + + /* + * Unfortunately, num_rmid doesn't mean anything for + * mpam, and its exposed to user-space! + * num-rmid is supposed to mean the number of groups + * that can be created, both control or monitor groups. + * For mpam, each control group has its own pmg/rmid + * space. + */ + r->num_rmid = 1; } return 0; -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- We already have a helper for reseting an mpam class. Hook it up to resctrl_arch_reset_resources(). Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 2 +- drivers/platform/mpam/mpam_internal.h | 2 ++ drivers/platform/mpam/mpam_resctrl.c | 27 +++++++++++++++++++++++++++ include/linux/arm_mpam.h | 3 +++ 4 files changed, 33 insertions(+), 1 deletion(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 28f6010df6dc..13f48878c3d5 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -2177,7 +2177,7 @@ static void mpam_enable_once(void) READ_ONCE(mpam_partid_max) + 1, mpam_pmg_max + 1); } -static void mpam_reset_class(struct mpam_class *class) +void mpam_reset_class(struct mpam_class *class) { int idx; struct mpam_msc_ris *ris; diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 579a8485d320..9c47d5a35d99 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -293,6 +293,8 @@ extern u8 mpam_pmg_max; void mpam_enable(struct work_struct *work); void mpam_disable(struct work_struct *work); +void mpam_reset_class(struct mpam_class *class); + int mpam_apply_config(struct mpam_component *comp, u16 partid, struct mpam_config *cfg); diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index c2cf9432e50d..7a54dc7c8056 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -248,6 +248,33 @@ int mpam_resctrl_setup(void) return err; } +void resctrl_arch_reset_resources(void) +{ + int i, idx; + struct mpam_class *class; + struct mpam_resctrl_res *res; + + lockdep_assert_cpus_held(); + + if (!mpam_is_enabled()) + return; + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + + if (!res->class) + continue; // dummy resource + + if (!res->resctrl_res.alloc_capable) + continue; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(class, &mpam_classes, classes_list) + mpam_reset_class(class); + srcu_read_unlock(&mpam_srcu, idx); + } +} + static struct mpam_resctrl_dom * mpam_resctrl_alloc_domain(unsigned int cpu, struct mpam_resctrl_res *res) { diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 576bb97fa552..97d4c8f076e4 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -50,4 +50,7 @@ static inline bool resctrl_arch_is_mbm_total_enabled(void) return false; } +/* reset cached configurations, then all devices */ +void resctrl_arch_reset_resources(void); + #endif /* __LINUX_ARM_MPAM_H */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Implement resctrl_arch_get_config() by testing the configuration for a CPOR bitmap. For any other configuration type return the default. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 3 ++ drivers/platform/mpam/mpam_resctrl.c | 44 ++++++++++++++++++++++++++++ 2 files changed, 47 insertions(+) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 13f48878c3d5..5ebb944ad9fc 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -2302,6 +2302,9 @@ int mpam_apply_config(struct mpam_component *comp, u16 partid, lockdep_assert_cpus_held(); + if (!memcmp(&comp->cfg[partid], cfg, sizeof(*cfg))) + return 0; + comp->cfg[partid] = *cfg; arg.comp = comp; arg.partid = partid; diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 7a54dc7c8056..940d2bf7ccd4 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -248,6 +248,50 @@ int mpam_resctrl_setup(void) return err; } +u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, enum resctrl_conf_type type) +{ + u32 partid; + struct mpam_config *cfg; + struct mpam_props *cprops; + struct mpam_resctrl_res *res; + struct mpam_resctrl_dom *dom; + enum mpam_device_features configured_by; + + lockdep_assert_cpus_held(); + + if (!mpam_is_enabled()) + return r->default_ctrl; + + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + cprops = &res->class->props; + + partid = resctrl_get_config_index(closid, type); + cfg = &dom->comp->cfg[partid]; + + switch (r->rid) { + case RDT_RESOURCE_L2: + case RDT_RESOURCE_L3: + configured_by = mpam_feat_cpor_part; + break; + default: + return -EINVAL; + } + + if (!r->alloc_capable || partid >= resctrl_arch_get_num_closid(r) || + !mpam_has_feature(configured_by, cfg)) + return r->default_ctrl; + + switch (configured_by) { + case mpam_feat_cpor_part: + /* TODO: Scaling is not yet supported */ + return cfg->cpbm; + default: + return -EINVAL; + } +} + void resctrl_arch_reset_resources(void) { int i, idx; -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- resctrl has two helpers for updating the configuration. resctrl_arch_update_one() updates a single value, and is used by the software-controller to apply feedback to the bandwidth controls, it has to be called on one of the CPUs in the resctrl:domain. resctrl_arch_update_domains() copies multiple staged configurations, it can be called from anywhere. Both helpers should update any changes to the underlying hardware. Imlpement resctrl_arch_update_domains() to use resctrl_arch_update_one(), which doesn't depend on being called on the right CPU. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_internal.h | 7 +-- drivers/platform/mpam/mpam_resctrl.c | 63 +++++++++++++++++++++++++++ 2 files changed, 64 insertions(+), 6 deletions(-) diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 9c47d5a35d99..49fc264d32f7 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -119,12 +119,7 @@ struct mpam_props }; #define mpam_has_feature(_feat, x) ((1<<_feat) & (x)->features) - -static inline void mpam_set_feature(enum mpam_device_features feat, - struct mpam_props *props) -{ - props->features |= (1<<feat); -} +#define mpam_set_feature(_feat, x) ((x)->features |= (1<<_feat)) static inline void mpam_clear_feature(enum mpam_device_features feat, mpam_features_t *supported) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 940d2bf7ccd4..9d808f68a122 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -292,6 +292,69 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, } } +int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, enum resctrl_conf_type t, u32 cfg_val) +{ + u32 partid; + struct mpam_config cfg; + struct mpam_props *cprops; + struct mpam_resctrl_res *res; + struct mpam_resctrl_dom *dom; + + lockdep_assert_cpus_held(); + lockdep_assert_irqs_enabled(); + + /* NOTE: don't check the CPU as mpam_apply_config() doesn't care, + * and resctrl_arch_update_domains() depends on this. */ + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + cprops = &res->class->props; + + partid = resctrl_get_config_index(closid, t); + if (!r->alloc_capable || partid >= resctrl_arch_get_num_closid(r)) + return -EINVAL; + + switch (r->rid) { + case RDT_RESOURCE_L2: + case RDT_RESOURCE_L3: + /* TODO: Scaling is not yet supported */ + cfg.cpbm = cfg_val; + mpam_set_feature(mpam_feat_cpor_part, &cfg); + break; + default: + return -EINVAL; + } + + return mpam_apply_config(dom->comp, partid, &cfg); +} + +/* TODO: this is IPI heavy */ +int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) +{ + int err = 0; + struct rdt_domain *d; + enum resctrl_conf_type t; + struct resctrl_staged_config *cfg; + + lockdep_assert_cpus_held(); + lockdep_assert_irqs_enabled(); + + list_for_each_entry(d, &r->domains, list) { + for (t = 0; t < CDP_NUM_TYPES; t++) { + cfg = &d->staged_config[t]; + if (!cfg->have_new_ctrl) + continue; + + err = resctrl_arch_update_one(r, d, closid, t, + cfg->new_ctrl); + if (err) + return err; + } + } + + return err; +} + void resctrl_arch_reset_resources(void) { int i, idx; -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Intel RDT's CDP feature allows the cache to use a different control value depending on whether the accesses was for instruction fetch or a data access. MPAM's equivalent feature is the other way up: the CPU assigns a different partid label to traffic depending on whether it was instruction fetch or a data access, which causes the cache to use a different control value based solely on the partid. MPAM can emulate CDP, with the side effect that the alternative partid is seen by all MSC, it can't be enabled per-MSC. Add the resctrl hooks to turn this on or off. Add the helpers that match a closid against a task, which need to be aware that the value written to hardware is not the same as the one resctrl is using. The context switch code needs to match the default resctrl group's value against a variable, as this value changes depending on whether CDP is in use. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/include/asm/mpam.h | 3 +- drivers/platform/mpam/mpam_resctrl.c | 62 ++++++++++++++++++++++++++++ include/linux/arm_mpam.h | 13 ++++++ 3 files changed, 77 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/mpam.h b/arch/arm64/include/asm/mpam.h index 1d81a6f26acd..edae8c98fa28 100644 --- a/arch/arm64/include/asm/mpam.h +++ b/arch/arm64/include/asm/mpam.h @@ -4,6 +4,7 @@ #ifndef __ASM__MPAM_H #define __ASM__MPAM_H +#include <linux/arm_mpam.h> #include <linux/bitops.h> #include <linux/init.h> #include <linux/jump_label.h> @@ -108,7 +109,7 @@ static inline void mpam_thread_switch(struct task_struct *tsk) !static_branch_likely(&mpam_enabled)) return; - if (!regval) + if (regval == READ_ONCE(mpam_resctrl_default_group)) regval = READ_ONCE(per_cpu(arm64_mpam_default, cpu)); oldregval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 9d808f68a122..eead81775f94 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -19,6 +19,8 @@ #include "mpam_internal.h" +u64 mpam_resctrl_default_group; + /* * The classes we've picked to map to resctrl resources. * Class pointer may be NULL. @@ -29,6 +31,12 @@ static bool exposed_alloc_capable; static bool exposed_mon_capable; static struct mpam_class *mbm_local_class; +/* + * MPAM emulates CDP by setting different PARTID in the I/D fields of MPAM1_EL1. + * This applies globally to all traffic the CPU generates. + */ +static bool cdp_enabled; + bool resctrl_arch_alloc_capable(void) { return exposed_alloc_capable; @@ -44,6 +52,36 @@ bool resctrl_arch_is_mbm_local_enabled(void) return mbm_local_class; } +bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level ignored) +{ + return cdp_enabled; +} + +int resctrl_arch_set_cdp_enabled(enum resctrl_res_level ignored, bool enable) +{ + u64 regval; + u32 partid, partid_i, partid_d; + + cdp_enabled = enable; + + partid = RESCTRL_RESERVED_CLOSID; + + if (enable) { + partid_d = resctrl_get_config_index(partid, CDP_CODE); + partid_i = resctrl_get_config_index(partid, CDP_DATA); + regval = FIELD_PREP(MPAM_SYSREG_PARTID_D, partid_d) | + FIELD_PREP(MPAM_SYSREG_PARTID_I, partid_i); + + } else { + regval = FIELD_PREP(MPAM_SYSREG_PARTID_D, partid) | + FIELD_PREP(MPAM_SYSREG_PARTID_I, partid); + } + + WRITE_ONCE(mpam_resctrl_default_group, regval); + + return 0; +} + /* * MSC may raise an error interrupt if it sees an out or range partid/pmg, * and go on to truncate the value. Regardless of what the hardware supports, @@ -54,6 +92,30 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) return min((u32)mpam_partid_max + 1, (u32)RESCTRL_MAX_CLOSID); } +bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid) +{ + u64 regval = mpam_get_regval(tsk); + u32 tsk_closid = FIELD_GET(MPAM_SYSREG_PARTID_D, regval); + + if (cdp_enabled) + tsk_closid >>= 1; + + return tsk_closid == closid; +} + +/* The task's pmg is not unique, the partid must be considered too */ +bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid) +{ + u64 regval = mpam_get_regval(tsk); + u32 tsk_closid = FIELD_GET(MPAM_SYSREG_PARTID_D, regval); + u32 tsk_rmid = FIELD_GET(MPAM_SYSREG_PMG_D, regval); + + if (cdp_enabled) + tsk_closid >>= 1; + + return (tsk_closid == closid) && (tsk_rmid == rmid); +} + struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) { if (l >= RDT_NUM_RESOURCES) diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 97d4c8f076e4..e3921b0ab836 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -5,8 +5,17 @@ #define __LINUX_ARM_MPAM_H #include <linux/acpi.h> +#include <linux/resctrl_types.h> #include <linux/types.h> +/* + * The value of the MPAM1_EL1 sysreg when a task is in the default group. + * This is used by the context switch code to use the resctrl CPU property + * instead. The value is modified when CDP is enabled/disabled by mounting + * the resctrl filesystem. + */ +extern u64 mpam_resctrl_default_group; + struct mpam_msc; enum mpam_msc_iface { @@ -53,4 +62,8 @@ static inline bool resctrl_arch_is_mbm_total_enabled(void) /* reset cached configurations, then all devices */ void resctrl_arch_reset_resources(void); +bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level ignored); +int resctrl_arch_set_cdp_enabled(enum resctrl_res_level ignored, bool enable); +bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid); +bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid); #endif /* __LINUX_ARM_MPAM_H */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Care must be taken when modifying the partid and pmg of a task, as writing these values may race with the task being scheduled in, and reading the modified values. Add helpers to set the task properties, and the cpu default value, and add the plumbing to the mpam driver that lets resctrl use them. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/include/asm/mpam.h | 44 +++++++++++++++++++++ drivers/platform/mpam/mpam_resctrl.c | 58 ++++++++++++++++++++++++++++ include/linux/arm_mpam.h | 7 ++++ 3 files changed, 109 insertions(+) diff --git a/arch/arm64/include/asm/mpam.h b/arch/arm64/include/asm/mpam.h index edae8c98fa28..9abe1fe58c34 100644 --- a/arch/arm64/include/asm/mpam.h +++ b/arch/arm64/include/asm/mpam.h @@ -6,6 +6,7 @@ #include <linux/arm_mpam.h> #include <linux/bitops.h> +#include <linux/bitfield.h> #include <linux/init.h> #include <linux/jump_label.h> #include <linux/percpu.h> @@ -90,6 +91,35 @@ static inline void __init __enable_mpam_hcr(void) * A value in struct thread_info is used instead of struct task_struct as the * cpu's u64 register format is used, but struct task_struct has two u32'. */ + static inline void mpam_set_cpu_defaults(int cpu, u16 partid_d, u16 partid_i, + u8 pmg_d, u8 pmg_i) +{ + u64 default_val; + + default_val = FIELD_PREP(MPAM_SYSREG_PARTID_D, partid_d); + default_val |= FIELD_PREP(MPAM_SYSREG_PARTID_I, partid_i); + default_val |= FIELD_PREP(MPAM_SYSREG_PMG_D, pmg_d); + default_val |= FIELD_PREP(MPAM_SYSREG_PMG_I, pmg_i); + + WRITE_ONCE(per_cpu(arm64_mpam_default, cpu), default_val); +} + +static inline void mpam_set_task_partid_pmg(struct task_struct *tsk, + u16 partid_d, u16 partid_i, + u8 pmg_d, u8 pmg_i) +{ +#ifdef CONFIG_ARM64_MPAM + u64 regval; + + regval = FIELD_PREP(MPAM_SYSREG_PARTID_D, partid_d); + regval |= FIELD_PREP(MPAM_SYSREG_PARTID_I, partid_i); + regval |= FIELD_PREP(MPAM_SYSREG_PMG_D, pmg_d); + regval |= FIELD_PREP(MPAM_SYSREG_PMG_I, pmg_i); + + WRITE_ONCE(task_thread_info(tsk)->mpam_partid_pmg, regval); +#endif +} + static inline u64 mpam_get_regval(struct task_struct *tsk) { #ifdef CONFIG_ARM64_MPAM @@ -99,6 +129,20 @@ static inline u64 mpam_get_regval(struct task_struct *tsk) #endif } +static inline void resctrl_arch_set_rmid(struct task_struct *tsk, u32 rmid) +{ +#ifdef CONFIG_ARM64_MPAM + u64 regval = mpam_get_regval(tsk); + + regval &= ~MPAM_SYSREG_PMG_D; + regval &= ~MPAM_SYSREG_PMG_I; + regval |= FIELD_PREP(MPAM_SYSREG_PMG_D, rmid); + regval |= FIELD_PREP(MPAM_SYSREG_PMG_I, rmid); + + WRITE_ONCE(task_thread_info(tsk)->mpam_partid_pmg, regval); +#endif +} + static inline void mpam_thread_switch(struct task_struct *tsk) { u64 oldregval; diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index eead81775f94..19a7807f8013 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -8,6 +8,7 @@ #include <linux/cpu.h> #include <linux/cpumask.h> #include <linux/errno.h> +#include <linux/limits.h> #include <linux/list.h> #include <linux/printk.h> #include <linux/rculist.h> @@ -92,6 +93,63 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) return min((u32)mpam_partid_max + 1, (u32)RESCTRL_MAX_CLOSID); } +void resctrl_sched_in(struct task_struct *tsk) +{ + lockdep_assert_preemption_disabled(); + + mpam_thread_switch(tsk); +} + +void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 pmg) +{ + BUG_ON(closid > U16_MAX); + BUG_ON(pmg > U8_MAX); + + if (!cdp_enabled) { + mpam_set_cpu_defaults(cpu, closid, closid, pmg, pmg); + } else { + /* + * When CDP is enabled, resctrl halves the closid range and we + * use odd/even partid for one closid. + */ + u32 partid_d = resctrl_get_config_index(closid, CDP_DATA); + u32 partid_i = resctrl_get_config_index(closid, CDP_CODE); + + mpam_set_cpu_defaults(cpu, partid_d, partid_i, pmg, pmg); + } +} + +void resctrl_arch_sync_cpu_defaults(void *info) +{ + struct resctrl_cpu_sync *r = info; + + lockdep_assert_preemption_disabled(); + + if (r) { + resctrl_arch_set_cpu_default_closid_rmid(smp_processor_id(), + r->closid, r->rmid); + } + + resctrl_sched_in(current); +} + +void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid) +{ + + + BUG_ON(closid > U16_MAX); + BUG_ON(rmid > U8_MAX); + + if (!cdp_enabled) { + mpam_set_task_partid_pmg(tsk, closid, closid, rmid, rmid); + } else { + u32 partid_d = resctrl_get_config_index(closid, CDP_DATA); + u32 partid_i = resctrl_get_config_index(closid, CDP_CODE); + + mpam_set_task_partid_pmg(tsk, partid_d, partid_i, rmid, rmid); + } +} + bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid) { u64 regval = mpam_get_regval(tsk); diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index e3921b0ab836..e2428a0dbe2f 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -16,6 +16,8 @@ */ extern u64 mpam_resctrl_default_group; +#include <asm/mpam.h> + struct mpam_msc; enum mpam_msc_iface { @@ -66,4 +68,9 @@ bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level ignored); int resctrl_arch_set_cdp_enabled(enum resctrl_res_level ignored, bool enable); bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid); bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid); +void resctrl_arch_set_cpu_default_closid(int cpu, u32 closid); +void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid); +void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 pmg); +void resctrl_sched_in(struct task_struct *tsk); + #endif /* __LINUX_ARM_MPAM_H */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Because MPAM's pmg aren't identical to RDT's rmid, resctrl handles some datastructrues by index. This allows x86 to map indexes to RMID, and MPAM to map them to partid-and-pmg. Add the helpers to do this. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 28 ++++++++++++++++++++++++++++ include/linux/arm_mpam.h | 3 +++ 2 files changed, 31 insertions(+) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 19a7807f8013..01d43ce9911a 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -93,6 +93,34 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) return min((u32)mpam_partid_max + 1, (u32)RESCTRL_MAX_CLOSID); } +u32 resctrl_arch_system_num_rmid_idx(void) +{ + u8 closid_shift = fls(mpam_pmg_max); + u32 num_partid = resctrl_arch_get_num_closid(NULL); + + return num_partid << closid_shift; +} + +u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid) +{ + u8 closid_shift = fls(mpam_pmg_max); + + BUG_ON(closid_shift > 8); + + return (closid << closid_shift) | rmid; +} + +void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid) +{ + u8 closid_shift = fls(mpam_pmg_max); + u32 pmg_mask = ~(~0 << closid_shift); + + BUG_ON(closid_shift > 8); + + *closid = idx >> closid_shift; + *rmid = idx & pmg_mask; +} + void resctrl_sched_in(struct task_struct *tsk) { lockdep_assert_preemption_disabled(); diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index e2428a0dbe2f..3ba466659058 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -72,5 +72,8 @@ void resctrl_arch_set_cpu_default_closid(int cpu, u32 closid); void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid); void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 pmg); void resctrl_sched_in(struct task_struct *tsk); +u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid); +void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid); +u32 resctrl_arch_system_num_rmid_idx(void); #endif /* __LINUX_ARM_MPAM_H */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- resctrl supports 'MB', as a percentage throttling of traffic somewhere after the L3. This is the control that mba_sc uses, so ideally the class chosen should be as close as possible to the counters used for mba_local. MB's percentage control can be backed either with the fixed point fraction MBW_MAX or the bandwidth portion bitmap. Add helper to convert to/from percentages. One problem here is the value written is not the same as the value read back. This is deliberatly made visible to user-space. Another is the MBW_MAX fixed point fraction can't represent 100%. This is also exposed to user-space, as otherwise the values for a single-bit system is 100%, 0%, instead of 50%, 0%. The way CDP is emulated means MB controls need programming twice by the resctrl glue, as the bandwidth controls can be applied independently for data or instruction-fetch. This isn't how x86 behaves, and neither user-space nor resctrl support it. CC: Amit Singh Tomar <amitsinght@marvell.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 219 ++++++++++++++++++++++++++- 1 file changed, 216 insertions(+), 3 deletions(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 01d43ce9911a..ce7216ae3817 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -53,9 +53,20 @@ bool resctrl_arch_is_mbm_local_enabled(void) return mbm_local_class; } -bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level ignored) +bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level rid) { - return cdp_enabled; + switch (rid) { + case RDT_RESOURCE_L2: + case RDT_RESOURCE_L3: + return cdp_enabled; + case RDT_RESOURCE_MBA: + default: + /* + * x86's MBA control doesn't support CDP, so user-space doesn't + * expect it. + */ + return false; + } } int resctrl_arch_set_cdp_enabled(enum resctrl_res_level ignored, bool enable) @@ -83,6 +94,11 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level ignored, bool enable) return 0; } +static bool mpam_resctrl_hide_cdp(enum resctrl_res_level rid) +{ + return cdp_enabled && !resctrl_arch_get_cdp_enabled(rid); +} + /* * MSC may raise an error interrupt if it sees an out or range partid/pmg, * and go on to truncate the value. Regardless of what the hardware supports, @@ -248,6 +264,102 @@ bool resctrl_arch_is_llc_occupancy_enabled(void) return cache_has_usable_csu(mpam_resctrl_exports[RDT_RESOURCE_L3].class); } +static bool mba_class_use_mbw_part(struct mpam_props *cprops) +{ + /* TODO: Scaling is not yet supported */ + return (mpam_has_feature(mpam_feat_mbw_part, cprops) && + cprops->mbw_pbm_bits < MAX_MBA_BW); +} + +static bool class_has_usable_mba(struct mpam_props *cprops) +{ + if (mba_class_use_mbw_part(cprops) || + mpam_has_feature(mpam_feat_mbw_max, cprops)) + return true; + + return false; +} + +/* + * Calculate the percentage change from each implemented bit in the control + * This can return 0 when BWA_WD is greater than 6. (100 / (1<<7) == 0) + */ +static u32 get_mba_granularity(struct mpam_props *cprops) +{ + if (mba_class_use_mbw_part(cprops)) { + return MAX_MBA_BW / cprops->mbw_pbm_bits; + } else if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { + /* + * bwa_wd is the number of bits implemented in the 0.xxx + * fixed point fraction. 1 bit is 50%, 2 is 25% etc. + */ + return MAX_MBA_BW / (cprops->bwa_wd + 1); + } + + return 0; +} + +static u32 mbw_pbm_to_percent(unsigned long mbw_pbm, struct mpam_props *cprops) +{ + u32 bit, result = 0, granularity = get_mba_granularity(cprops); + + for_each_set_bit(bit, &mbw_pbm, cprops->mbw_pbm_bits % 32) { + result += granularity; + } + + return result; +} + +static u32 mbw_max_to_percent(u16 mbw_max, struct mpam_props *cprops) +{ + u8 bit; + u32 divisor = 2, value = 0; + + for (bit = 15; bit; bit--) { + if (mbw_max & BIT(bit)) + value += MAX_MBA_BW / divisor; + divisor <<= 1; + } + + return value; +} + +static u32 percent_to_mbw_pbm(u8 pc, struct mpam_props *cprops) +{ + u32 granularity = get_mba_granularity(cprops); + u8 num_bits = pc / granularity; + + if (!num_bits) + return 0; + + /* TODO: pick bits at random to avoid contention */ + return (1 << num_bits) - 1; +} + +static u16 percent_to_mbw_max(u8 pc, struct mpam_props *cprops) +{ + u8 bit; + u32 divisor = 2, value = 0; + + if (WARN_ON_ONCE(cprops->bwa_wd > 15)) + return MAX_MBA_BW; + + for (bit = 15; bit; bit--) { + if (pc >= MAX_MBA_BW / divisor) { + pc -= MAX_MBA_BW / divisor; + value |= BIT(bit); + } + divisor <<= 1; + + if (!pc || !(MAX_MBA_BW / divisor)) + break; + } + + value &= GENMASK(15, 15 - cprops->bwa_wd); + + return value; +} + /* Test whether we can export MPAM_CLASS_CACHE:{2,3}? */ static void mpam_resctrl_pick_caches(void) { @@ -295,6 +407,44 @@ static void mpam_resctrl_pick_caches(void) srcu_read_unlock(&mpam_srcu, idx); } +static void mpam_resctrl_pick_mba(void) +{ + struct mpam_class *class, *candidate_class = NULL; + struct mpam_resctrl_res *res; + int idx; + + lockdep_assert_cpus_held(); + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(class, &mpam_classes, classes_list) { + struct mpam_props *cprops = &class->props; + + if (class->level < 3) + continue; + + if (!class_has_usable_mba(cprops)) + continue; + + if (!cpumask_equal(&class->affinity, cpu_possible_mask)) + continue; + + /* + * mba_sc reads the mbm_local counter, and waggles the MBA controls. + * mbm_local is implicitly part of the L3, pick a resouce to be MBA + * that as close as possible to the L3. + */ + if (!candidate_class || class->level < candidate_class->level) + candidate_class = class; + } + srcu_read_unlock(&mpam_srcu, idx); + + if (candidate_class) { + res = &mpam_resctrl_exports[RDT_RESOURCE_MBA]; + res->class = candidate_class; + res->resctrl_res.name = "MB"; + } +} + static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) { struct mpam_class *class = res->class; @@ -331,6 +481,27 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) if (class->level == 3 && cache_has_usable_csu(class)) r->mon_capable = true; + } else if (res->resctrl_res.rid == RDT_RESOURCE_MBA) { + struct mpam_props *cprops = &class->props; + + /* TODO: kill these properties off as they are derivatives */ + r->format_str = "%d=%0*u"; + r->fflags = RFTYPE_RES_MB; + r->default_ctrl = MAX_MBA_BW; + r->data_width = 3; + + r->membw.delay_linear = true; + r->membw.throttle_mode = THREAD_THROTTLE_UNDEFINED; + r->membw.bw_gran = get_mba_granularity(cprops); + + /* Round up to at least 1% */ + if (!r->membw.bw_gran) + r->membw.bw_gran = 1; + + if (class_has_usable_mba(cprops)) { + r->alloc_capable = true; + exposed_alloc_capable = true; + } } if (r->mon_capable) { @@ -365,6 +536,7 @@ int mpam_resctrl_setup(void) } mpam_resctrl_pick_caches(); + mpam_resctrl_pick_mba(); for (i = 0; i < RDT_NUM_RESOURCES; i++) { res = &mpam_resctrl_exports[i]; @@ -423,6 +595,15 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, case RDT_RESOURCE_L3: configured_by = mpam_feat_cpor_part; break; + case RDT_RESOURCE_MBA: + if (mba_class_use_mbw_part(cprops)) { + configured_by = mpam_feat_mbw_part; + break; + } else if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { + configured_by = mpam_feat_mbw_max; + break; + } + fallthrough; default: return -EINVAL; } @@ -435,6 +616,11 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, case mpam_feat_cpor_part: /* TODO: Scaling is not yet supported */ return cfg->cpbm; + case mpam_feat_mbw_part: + /* TODO: Scaling is not yet supported */ + return mbw_pbm_to_percent(cfg->mbw_pbm, cprops); + case mpam_feat_mbw_max: + return mbw_max_to_percent(cfg->mbw_max, cprops); default: return -EINVAL; } @@ -443,6 +629,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, u32 closid, enum resctrl_conf_type t, u32 cfg_val) { + int err; u32 partid; struct mpam_config cfg; struct mpam_props *cprops; @@ -469,11 +656,37 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, cfg.cpbm = cfg_val; mpam_set_feature(mpam_feat_cpor_part, &cfg); break; + case RDT_RESOURCE_MBA: + if (mba_class_use_mbw_part(cprops)) { + cfg.mbw_pbm = percent_to_mbw_pbm(cfg_val, cprops); + mpam_set_feature(mpam_feat_mbw_part, &cfg); + break; + } else if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { + cfg.mbw_max = percent_to_mbw_max(cfg_val, cprops); + mpam_set_feature(mpam_feat_mbw_max, &cfg); + break; + } + fallthrough; default: return -EINVAL; } - return mpam_apply_config(dom->comp, partid, &cfg); + /* + * When CDP is enabled, but the resource doesn't support it, we need to + * apply the same configuration to the other partid. + */ + if (mpam_resctrl_hide_cdp(r->rid)) { + partid = resctrl_get_config_index(closid, CDP_CODE); + err = mpam_apply_config(dom->comp, partid, &cfg); + if (err) + return err; + + partid = resctrl_get_config_index(closid, CDP_DATA); + return mpam_apply_config(dom->comp, partid, &cfg); + + } else { + return mpam_apply_config(dom->comp, partid, &cfg); + } } /* TODO: this is IPI heavy */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- resctrl has two types of counters, NUMA-local and global. MPAM has only bandwidth counters, but the position of the MSC may mean it counts NUMA-local, or global traffic. But the topology information is not available. Apply a hueristic: the L2 or L3 supports bandwidth monitors, these are probably NUMA-local. If the memory controller supports bandwidth monitors, they are probably global. This selection is made from mpam_resctrl_resource_init(), which implies resources that can be used for resctrl controls exist and also have counters. This would be a problem on a platform that only supports monitoring. TODO: Add an extra pass of all classes to find the classes to use as bandwidth counters. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 52 +++++++++++++++++++++++++++- include/linux/arm_mpam.h | 6 +--- 2 files changed, 52 insertions(+), 6 deletions(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index ce7216ae3817..19916d9149c5 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -31,6 +31,7 @@ static struct mpam_resctrl_res mpam_resctrl_exports[RDT_NUM_RESOURCES]; static bool exposed_alloc_capable; static bool exposed_mon_capable; static struct mpam_class *mbm_local_class; +static struct mpam_class *mbm_total_class; /* * MPAM emulates CDP by setting different PARTID in the I/D fields of MPAM1_EL1. @@ -53,6 +54,11 @@ bool resctrl_arch_is_mbm_local_enabled(void) return mbm_local_class; } +bool resctrl_arch_is_mbm_total_enabled(void) +{ + return mbm_total_class; +} + bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level rid) { switch (rid) { @@ -264,6 +270,24 @@ bool resctrl_arch_is_llc_occupancy_enabled(void) return cache_has_usable_csu(mpam_resctrl_exports[RDT_RESOURCE_L3].class); } +static bool class_has_usable_mbwu(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + if (!mpam_has_feature(mpam_feat_msmon_mbwu, cprops)) + return false; + + /* + * resctrl expects the bandwidth counters to be free running, + * which means we need as many monitors as resctrl has + * control/monitor groups. + */ + if (cprops->num_mbwu_mon < resctrl_arch_system_num_rmid_idx()) + return false; + + return (mpam_partid_max > 1) || (mpam_pmg_max != 0); +} + static bool mba_class_use_mbw_part(struct mpam_props *cprops) { /* TODO: Scaling is not yet supported */ @@ -449,10 +473,13 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) { struct mpam_class *class = res->class; struct rdt_resource *r = &res->resctrl_res; + bool has_mbwu = class_has_usable_mbwu(class); /* Is this one of the two well-known caches? */ if (res->resctrl_res.rid == RDT_RESOURCE_L2 || res->resctrl_res.rid == RDT_RESOURCE_L3) { + bool has_csu = cache_has_usable_csu(class); + /* TODO: Scaling is not yet supported */ r->cache.cbm_len = class->props.cpbm_wd; r->cache.arch_has_sparse_bitmasks = true; @@ -479,8 +506,25 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) exposed_alloc_capable = true; } - if (class->level == 3 && cache_has_usable_csu(class)) + /* + * MBWU counters may be 'local' or 'total' depending on where + * they are in the topology. Counters on caches are assumed to + * be local. If it's on the memory controller, its assumed to + * be global. + */ + if (has_mbwu && class->level >= 3) { + mbm_local_class = class; r->mon_capable = true; + } + + /* + * CSU counters only make sense on a cache. The file is called + * llc_occupancy, but its expected to the on the L3. + */ + if (has_csu && class->type == MPAM_CLASS_CACHE && + class->level == 3) { + r->mon_capable = true; + } } else if (res->resctrl_res.rid == RDT_RESOURCE_MBA) { struct mpam_props *cprops = &class->props; @@ -502,6 +546,11 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) r->alloc_capable = true; exposed_alloc_capable = true; } + + if (has_mbwu && class->type == MPAM_CLASS_MEMORY) { + mbm_total_class = class; + r->mon_capable = true; + } } if (r->mon_capable) { @@ -537,6 +586,7 @@ int mpam_resctrl_setup(void) mpam_resctrl_pick_caches(); mpam_resctrl_pick_mba(); + /* TODO: mpam_resctrl_pick_counters(); */ for (i = 0; i < RDT_NUM_RESOURCES; i++) { res = &mpam_resctrl_exports[i]; diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 3ba466659058..94179b86a9e6 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -55,11 +55,7 @@ bool resctrl_arch_alloc_capable(void); bool resctrl_arch_mon_capable(void); bool resctrl_arch_is_llc_occupancy_enabled(void); bool resctrl_arch_is_mbm_local_enabled(void); - -static inline bool resctrl_arch_is_mbm_total_enabled(void) -{ - return false; -} +bool resctrl_arch_is_mbm_total_enabled(void); /* reset cached configurations, then all devices */ void resctrl_arch_reset_resources(void); -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- When resctrl wants to read a domain's 'QOS_L3_OCCUP', it needs to allocate a monitor on the corresponding resource. Monitors are allocated by class instead of component. Add helpers to do this. The MBM events depend on having their monitors allocated at init time so that they can be left running. The value USE_RMID_IDX is out of range for a monitor, and is used to indicate this behaviour. resctrl_arch_mon_ctx_alloc() is implemented to have a no_wait version and a waitqueue for callers that sleep. The no_wait version will later become an interface for the resctrl_pmu to use. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_internal.h | 3 ++ drivers/platform/mpam/mpam_resctrl.c | 72 +++++++++++++++++++++++++++ include/linux/arm_mpam.h | 4 ++ 3 files changed, 79 insertions(+) diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 49fc264d32f7..4d982469b9f5 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -17,6 +17,9 @@ DECLARE_STATIC_KEY_FALSE(mpam_enabled); +/* Value to indicate the allocated monitor is derived from the RMID index. */ +#define USE_RMID_IDX (U16_MAX + 1) + static inline bool mpam_is_enabled(void) { return static_branch_likely(&mpam_enabled); diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 19916d9149c5..70b1879de10a 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -22,6 +22,8 @@ u64 mpam_resctrl_default_group; +DECLARE_WAIT_QUEUE_HEAD(resctrl_mon_ctx_waiters); + /* * The classes we've picked to map to resctrl resources. * Class pointer may be NULL. @@ -39,6 +41,10 @@ static struct mpam_class *mbm_total_class; */ static bool cdp_enabled; +/* A dummy mon context to use when the monitors were allocated up front */ +u32 __mon_is_rmid_idx = USE_RMID_IDX; +void *mon_is_rmid_idx = &__mon_is_rmid_idx; + bool resctrl_arch_alloc_capable(void) { return exposed_alloc_capable; @@ -232,6 +238,72 @@ struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) return &mpam_resctrl_exports[l].resctrl_res; } +static void *resctrl_arch_mon_ctx_alloc_no_wait(struct rdt_resource *r, + int evtid) +{ + struct mpam_resctrl_res *res; + u32 *ret = kmalloc(sizeof(*ret), GFP_KERNEL); + + if (!ret) + return ERR_PTR(-ENOMEM); + + switch (evtid) { + case QOS_L3_OCCUP_EVENT_ID: + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + + *ret = mpam_alloc_csu_mon(res->class); + return ret; + case QOS_L3_MBM_LOCAL_EVENT_ID: + case QOS_L3_MBM_TOTAL_EVENT_ID: + return mon_is_rmid_idx; + } + + return ERR_PTR(-EOPNOTSUPP); +} + +void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid) +{ + DEFINE_WAIT(wait); + void *ret; + + might_sleep(); + + do { + prepare_to_wait(&resctrl_mon_ctx_waiters, &wait, + TASK_INTERRUPTIBLE); + ret = resctrl_arch_mon_ctx_alloc_no_wait(r, evtid); + if (PTR_ERR(ret) == -ENOSPC) + schedule(); + } while (PTR_ERR(ret) == -ENOSPC && !signal_pending(current)); + finish_wait(&resctrl_mon_ctx_waiters, &wait); + + return ret; +} + +void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, + void *arch_mon_ctx) +{ + struct mpam_resctrl_res *res; + u32 mon = *(u32 *)arch_mon_ctx; + + if (mon == USE_RMID_IDX) + return; + kfree(arch_mon_ctx); + arch_mon_ctx = NULL; + + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + + switch (evtid) { + case QOS_L3_OCCUP_EVENT_ID: + mpam_free_csu_mon(res->class, mon); + wake_up(&resctrl_mon_ctx_waiters); + return; + case QOS_L3_MBM_TOTAL_EVENT_ID: + case QOS_L3_MBM_LOCAL_EVENT_ID: + return; + } +} + static bool cache_has_usable_cpor(struct mpam_class *class) { struct mpam_props *cprops = &class->props; diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 94179b86a9e6..7206a35bb205 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -72,4 +72,8 @@ u32 resctrl_arch_rmid_idx_encode(u32 closid, u32 rmid); void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid); u32 resctrl_arch_system_num_rmid_idx(void); +struct rdt_resource; +void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid); +void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, void *ctx); + #endif /* __LINUX_ARM_MPAM_H */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- resctrl uses resctrl_arch_rmid_read() to read counters. CDP emulation means the counter may need reading twice to get both the I and D side allocations. The same goes for reset. Add the rounding helper for checking monitor values while we're here. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 79 ++++++++++++++++++++++++++++ include/linux/arm_mpam.h | 4 ++ 2 files changed, 83 insertions(+) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 70b1879de10a..d0140943bd9c 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -304,6 +304,85 @@ void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, } } +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, u32 rmid, enum resctrl_event_id eventid, + u64 *val, void *arch_mon_ctx) +{ + int err; + u64 cdp_val; + struct mon_cfg cfg; + struct mpam_resctrl_dom *dom; + u32 mon = *(u32 *)arch_mon_ctx; + enum mpam_device_features type; + + resctrl_arch_rmid_read_context_check(); + + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + + switch (eventid) { + case QOS_L3_OCCUP_EVENT_ID: + type = mpam_feat_msmon_csu; + break; + case QOS_L3_MBM_LOCAL_EVENT_ID: + case QOS_L3_MBM_TOTAL_EVENT_ID: + type = mpam_feat_msmon_mbwu; + break; + default: + return -EINVAL; + } + + cfg.mon = mon; + if (cfg.mon == USE_RMID_IDX) + cfg.mon = resctrl_arch_rmid_idx_encode(closid, rmid); + + cfg.match_pmg = true; + cfg.pmg = rmid; + + if (cdp_enabled) { + cfg.partid = closid << 1; + err = mpam_msmon_read(dom->comp, &cfg, type, val); + if (err) + return err; + + cfg.partid += 1; + err = mpam_msmon_read(dom->comp, &cfg, type, &cdp_val); + if (!err) + *val += cdp_val; + } else { + cfg.partid = closid; + err = mpam_msmon_read(dom->comp, &cfg, type, val); + } + + return err; +} + +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, u32 rmid, enum resctrl_event_id eventid) +{ + struct mon_cfg cfg; + struct mpam_resctrl_dom *dom; + + if (eventid != QOS_L3_MBM_LOCAL_EVENT_ID) + return; + + cfg.mon = resctrl_arch_rmid_idx_encode(closid, rmid); + cfg.match_pmg = true; + cfg.pmg = rmid; + + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + + if (cdp_enabled) { + cfg.partid = closid << 1; + mpam_msmon_reset_mbwu(dom->comp, &cfg); + + cfg.partid += 1; + mpam_msmon_reset_mbwu(dom->comp, &cfg); + } else { + cfg.partid = closid; + mpam_msmon_reset_mbwu(dom->comp, &cfg); + } +} + static bool cache_has_usable_cpor(struct mpam_class *class) { struct mpam_props *cprops = &class->props; diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 7206a35bb205..0f2522c2b11c 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -50,6 +50,10 @@ int mpam_register_requestor(u16 partid_max, u8 pmg_max); int mpam_ris_create(struct mpam_msc *msc, u8 ris_idx, enum mpam_class_types type, u8 class_id, int component_id); +static inline unsigned int resctrl_arch_round_mon_val(unsigned int val) +{ + return val; +} bool resctrl_arch_alloc_capable(void); bool resctrl_arch_mon_capable(void); -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- MPAM MSCs may have support for filtering reads or writes when monitoring traffic. Resctrl has a configuration bitmap for which kind of accesses should be monitored. Bridge the gap where possible. MPAM only has a read/write bit, so not all the combinations can be supported. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 26 +++++++++++ drivers/platform/mpam/mpam_internal.h | 9 ++++ drivers/platform/mpam/mpam_resctrl.c | 62 +++++++++++++++++++++++++++ 3 files changed, 97 insertions(+) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 5ebb944ad9fc..c922f2b264cc 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -1076,6 +1076,32 @@ int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, return err; } +void mpam_msmon_reset_all_mbwu(struct mpam_component *comp) +{ + int idx, i; + unsigned long flags; + struct mpam_msc *msc; + struct mpam_msc_ris *ris; + + if (!mpam_is_enabled()) + return; + + idx = srcu_read_lock(&mpam_srcu); + list_for_each_entry_rcu(ris, &comp->ris, comp_list) { + if (!mpam_has_feature(mpam_feat_msmon_mbwu, &ris->props)) + continue; + + msc = ris->msc; + spin_lock_irqsave(&msc->mon_sel_lock, flags); + for(i = 0; i < ris->props.num_mbwu_mon; i++) { + ris->mbwu_state[i].correction = 0; + ris->mbwu_state[i].reset_on_next_read = true; + } + spin_unlock_irqrestore(&msc->mon_sel_lock, flags); + } + srcu_read_unlock(&mpam_srcu, idx); +} + void mpam_msmon_reset_mbwu(struct mpam_component *comp, struct mon_cfg *ctx) { int idx; diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 4d982469b9f5..1f7c0ddcf8b1 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -20,6 +20,12 @@ DECLARE_STATIC_KEY_FALSE(mpam_enabled); /* Value to indicate the allocated monitor is derived from the RMID index. */ #define USE_RMID_IDX (U16_MAX + 1) +/* + * Only these event configuration bits are supported. MPAM can't know if + * data is being written back, these will show up as a write. + */ +#define MPAM_RESTRL_EVT_CONFIG_VALID (READS_TO_LOCAL_MEM | NON_TEMP_WRITE_TO_LOCAL_MEM) + static inline bool mpam_is_enabled(void) { return static_branch_likely(&mpam_enabled); @@ -240,6 +246,8 @@ struct mpam_msc_ris { struct mpam_resctrl_dom { struct mpam_component *comp; struct rdt_domain resctrl_dom; + + u32 mbm_local_evt_cfg; }; struct mpam_resctrl_res { @@ -299,6 +307,7 @@ int mpam_apply_config(struct mpam_component *comp, u16 partid, int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, enum mpam_device_features, u64 *val); void mpam_msmon_reset_mbwu(struct mpam_component *comp, struct mon_cfg *ctx); +void mpam_msmon_reset_all_mbwu(struct mpam_component *comp); int mpam_resctrl_online_cpu(unsigned int cpu); int mpam_resctrl_offline_cpu(unsigned int cpu); diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index d0140943bd9c..d0becc296108 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -304,6 +304,18 @@ void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, } } +static enum mon_filter_options resctrl_evt_config_to_mpam(u32 local_evt_cfg) +{ + switch (local_evt_cfg) { + case READS_TO_LOCAL_MEM: + return COUNT_READ; + case NON_TEMP_WRITE_TO_LOCAL_MEM: + return COUNT_WRITE; + default: + return COUNT_BOTH; + } +} + int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, u32 closid, u32 rmid, enum resctrl_event_id eventid, u64 *val, void *arch_mon_ctx) @@ -337,6 +349,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, cfg.match_pmg = true; cfg.pmg = rmid; + cfg.opts = resctrl_evt_config_to_mpam(dom->mbm_local_evt_cfg); if (cdp_enabled) { cfg.partid = closid << 1; @@ -620,6 +633,54 @@ static void mpam_resctrl_pick_mba(void) } } +bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt) +{ + struct mpam_props *cprops; + + switch (evt) { + case QOS_L3_MBM_LOCAL_EVENT_ID: + if (!mbm_local_class) + return false; + cprops = &mbm_local_class->props; + + return mpam_has_feature(mpam_feat_msmon_mbwu_rwbw, cprops); + default: + return false; + } +} + +void resctrl_arch_mon_event_config_read(void *info) +{ + struct mpam_resctrl_dom *dom; + struct resctrl_mon_config_info *mon_info = info; + + dom = container_of(mon_info->d, struct mpam_resctrl_dom, resctrl_dom); + mon_info->mon_config = dom->mbm_local_evt_cfg & MAX_EVT_CONFIG_BITS; +} + +void resctrl_arch_mon_event_config_write(void *info) +{ + struct mpam_resctrl_dom *dom; + struct resctrl_mon_config_info *mon_info = info; + + if (mon_info->mon_config & ~MPAM_RESTRL_EVT_CONFIG_VALID) { + mon_info->err = -EOPNOTSUPP; + return; + } + + dom = container_of(mon_info->d, struct mpam_resctrl_dom, resctrl_dom); + dom->mbm_local_evt_cfg = mon_info->mon_config & MPAM_RESTRL_EVT_CONFIG_VALID; +} + +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d) +{ + struct mpam_resctrl_dom *dom; + + dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); + dom->mbm_local_evt_cfg = MPAM_RESTRL_EVT_CONFIG_VALID; + mpam_msmon_reset_all_mbwu(dom->comp); +} + static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) { struct mpam_class *class = res->class; @@ -970,6 +1031,7 @@ mpam_resctrl_alloc_domain(unsigned int cpu, struct mpam_resctrl_res *res) dom->comp = comp; INIT_LIST_HEAD(&dom->resctrl_dom.list); dom->resctrl_dom.id = comp->comp_id; + dom->mbm_local_evt_cfg = MPAM_RESTRL_EVT_CONFIG_VALID; cpumask_set_cpu(cpu, &dom->resctrl_dom.cpu_mask); /* TODO: this list should be sorted */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Pseudo lock isn't supported on arm64. Add empty definitions of the functions arm64 doesn't implement. Because the Kconfig option is not selected, none of these will be called. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- include/linux/arm_mpam.h | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 0f2522c2b11c..796ce59cc540 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -80,4 +80,11 @@ struct rdt_resource; void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid); void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, void *ctx); +/* Pseudo lock is not supported by MPAM */ +static inline int resctrl_arch_pseudo_lock_fn(void *_plr) { return 0; } +static inline int resctrl_arch_measure_l2_residency(void *_plr) { return 0; } +static inline int resctrl_arch_measure_l3_residency(void *_plr) { return 0; } +static inline int resctrl_arch_measure_cycles_lat_fn(void *_plr) { return 0; } +static inline u64 resctrl_arch_get_prefetch_disable_bits(void) { return 0; } + #endif /* __LINUX_ARM_MPAM_H */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- resctrl has individual hooks to separately enable and disable the closid/partid and rmid/pmg context switching code. For MPAM this is all the same thing, as the value in struct task_struct is used to cache the value that should be written to hardware. arm64's context switching code is enabled once MPAM is usable, but doesn't touch the hardware unless the value has changed. Resctrl doesn't need to ask. Add empty definitions for these hoooks. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- include/linux/arm_mpam.h | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 796ce59cc540..5fd62e0d131d 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -87,4 +87,13 @@ static inline int resctrl_arch_measure_l3_residency(void *_plr) { return 0; } static inline int resctrl_arch_measure_cycles_lat_fn(void *_plr) { return 0; } static inline u64 resctrl_arch_get_prefetch_disable_bits(void) { return 0; } +/* + * The CPU configuration for MPAM is cheap to write, and is only written if it + * has changed. No need for fine grained enables. + */ +static inline void resctrl_arch_enable_mon(void) { } +static inline void resctrl_arch_disable_mon(void) { } +static inline void resctrl_arch_enable_alloc(void) { } +static inline void resctrl_arch_disable_alloc(void) { } + #endif /* __LINUX_ARM_MPAM_H */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- resctrl expects RDT like counters that are free running. MPAM's counters don't behave like this as they need a monitor to be allocated first. Provide the helper that says whether free running counters are supported. Subsequent patches will make this more intelligent. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- include/linux/arm_mpam.h | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index 5fd62e0d131d..d70e4e726fe6 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -55,6 +55,12 @@ static inline unsigned int resctrl_arch_round_mon_val(unsigned int val) return val; } +/* MPAM counters requires a monitor to be allocated */ +static inline bool resctrl_arch_event_is_free_running(enum resctrl_event_id evt) +{ + return false; +} + bool resctrl_arch_alloc_capable(void); bool resctrl_arch_mon_capable(void); bool resctrl_arch_is_llc_occupancy_enabled(void); -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Enough MPAM support is present to enable ARCH_HAS_CPU_RESCTRL. Let it rip^Wlink! Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/Kconfig | 3 ++- arch/arm64/include/asm/resctrl.h | 1 + drivers/platform/mpam/Kconfig | 4 +++- 3 files changed, 6 insertions(+), 2 deletions(-) create mode 100644 arch/arm64/include/asm/resctrl.h diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 95c50f3ac290..7ff5e6becc9b 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2016,7 +2016,8 @@ config ARM64_TLB_RANGE config ARM64_MPAM bool "Enable support for MPAM" select ACPI_MPAM if ACPI - select ARM_CPU_RESCTRL + select ARCH_HAS_CPU_RESCTRL + select RESCTRL_FS help Memory Partitioning and Monitoring is an optional extension that allows the CPUs to mark load and store transactions with diff --git a/arch/arm64/include/asm/resctrl.h b/arch/arm64/include/asm/resctrl.h new file mode 100644 index 000000000000..bf33f82f04e2 --- /dev/null +++ b/arch/arm64/include/asm/resctrl.h @@ -0,0 +1 @@ +#include <linux/arm_mpam.h> diff --git a/drivers/platform/mpam/Kconfig b/drivers/platform/mpam/Kconfig index 13bd86fc5e58..75f5b2454fbe 100644 --- a/drivers/platform/mpam/Kconfig +++ b/drivers/platform/mpam/Kconfig @@ -2,5 +2,7 @@ # CPU resources, not containers or cgroups etc. config ARM_CPU_RESCTRL bool - depends on ARM64 + default y + depends on ARM64 && ARCH_HAS_CPU_RESCTRL + depends on MISC_FILESYSTEMS select RESCTRL_RMID_DEPENDS_ON_CLOSID -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Carl reports that when both the MPAM driver and CMN driver are built into the kernel, they fight over who can claim the resources associated with their registers. This prevents the second of these two drivers from probing. Currently the CMN PMU driver claims all the CMN registers. The MPAM registers are grouped together in a small number of pages, whereas the PMU registers that the CMN PMU driver uses appear throughout the CMN register space. Having the CMN driver claim all the resources is the wrong thing to do, and claiming individual registers here and there is not worthwhile. Instead, stop the CMN driver from claiming any resources as its registers are not grouped together. Reported-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Signed-off-by: James Morse <james.morse@arm.com> CC: Ilkka Koskinen <ilkka@os.amperecomputing.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/perf/arm-cmn.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/perf/arm-cmn.c b/drivers/perf/arm-cmn.c index 0b3ce7713645..b5507068ba67 100644 --- a/drivers/perf/arm-cmn.c +++ b/drivers/perf/arm-cmn.c @@ -2435,6 +2435,7 @@ static int arm_cmn_probe(struct platform_device *pdev) struct arm_cmn *cmn; const char *name; static atomic_t id; + struct resource *cfg; int err, rootnode, this_id; cmn = devm_kzalloc(&pdev->dev, sizeof(*cmn), GFP_KERNEL); @@ -2449,7 +2450,16 @@ static int arm_cmn_probe(struct platform_device *pdev) rootnode = arm_cmn600_acpi_probe(pdev, cmn); } else { rootnode = 0; - cmn->base = devm_platform_ioremap_resource(pdev, 0); + + /* + * Avoid registering resources as the PMUs registers are + * scattered through CMN, and may appear either side of + * registers for other 'devices'. (e.g. the MPAM MSC controls). + */ + cfg = platform_get_resource(pdev, IORESOURCE_MEM, 0); + if (!cfg) + return -EINVAL; + cmn->base = devm_ioremap(&pdev->dev, cfg->start, resource_size(cfg)); if (IS_ERR(cmn->base)) return PTR_ERR(cmn->base); if (cmn->part == PART_CMN600) -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- Now that mpam links against resctrl, call the cpu and domain online/offline calls at the appropriate point. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index d0becc296108..c862f9015c57 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -824,7 +824,7 @@ int mpam_resctrl_setup(void) pr_warn("Number of PMG is not a power of 2! resctrl may misbehave"); } - /* TODO: call resctrl_init() */ + err = resctrl_init(); } return err; @@ -1077,7 +1077,7 @@ struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id) int mpam_resctrl_online_cpu(unsigned int cpu) { - int i; + int i, err; struct mpam_resctrl_dom *dom; struct mpam_resctrl_res *res; @@ -1096,8 +1096,12 @@ int mpam_resctrl_online_cpu(unsigned int cpu) dom = mpam_resctrl_alloc_domain(cpu, res); if (IS_ERR(dom)) return PTR_ERR(dom); + err = resctrl_online_domain(&res->resctrl_res, &dom->resctrl_dom); + if (err) + return err; } + resctrl_online_cpu(cpu); return 0; } @@ -1108,6 +1112,8 @@ int mpam_resctrl_offline_cpu(unsigned int cpu) struct mpam_resctrl_res *res; struct mpam_resctrl_dom *dom; + resctrl_offline_cpu(cpu); + for (i = 0; i < RDT_NUM_RESOURCES; i++) { res = &mpam_resctrl_exports[i]; @@ -1126,6 +1132,7 @@ int mpam_resctrl_offline_cpu(unsigned int cpu) if (!cpumask_empty(&d->cpu_mask)) continue; + resctrl_offline_domain(&res->resctrl_res, &dom->resctrl_dom); list_del(&d->list); kfree(dom); } -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- All of MPAMs errors indicate a software bug, e.g. an out-of-bounds partid has been generated. When this happens, the mpam driver is disabled. If resctrl_init() succeeded, also call resctrl_exit() to remove resctrl. If the filesystem was mounted in its traditional place, it is no longer possible for processes to find it as the mount point has been removed. If the filesystem was mounted elsewhere, it will appear that all CPU and domains are offline. User-space will not be able to update the hardware. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 2 ++ drivers/platform/mpam/mpam_internal.h | 1 + drivers/platform/mpam/mpam_resctrl.c | 17 +++++++++++++++++ 3 files changed, 20 insertions(+) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index c922f2b264cc..c2457c36ab12 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -2240,6 +2240,8 @@ static irqreturn_t mpam_disable_thread(int irq, void *dev_id) } mutex_unlock(&mpam_cpuhp_state_lock); + mpam_resctrl_exit(); + static_branch_disable(&mpam_enabled); mpam_unregister_irqs(); diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 1f7c0ddcf8b1..119660a53e16 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -313,6 +313,7 @@ int mpam_resctrl_online_cpu(unsigned int cpu); int mpam_resctrl_offline_cpu(unsigned int cpu); int mpam_resctrl_setup(void); +void mpam_resctrl_exit(void); /* * MPAM MSCs have the following register layout. See: diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index c862f9015c57..074d8ba32115 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -41,6 +41,12 @@ static struct mpam_class *mbm_total_class; */ static bool cdp_enabled; +/* + * If resctrl_init() succeeded, resctrl_exit() can be used to remove support + * for the filesystem in the event of an error. + */ +static bool resctrl_enabled; + /* A dummy mon context to use when the monitors were allocated up front */ u32 __mon_is_rmid_idx = USE_RMID_IDX; void *mon_is_rmid_idx = &__mon_is_rmid_idx; @@ -825,11 +831,22 @@ int mpam_resctrl_setup(void) } err = resctrl_init(); + if (!err) + WRITE_ONCE(resctrl_enabled, true); } return err; } +void mpam_resctrl_exit(void) +{ + if (!READ_ONCE(resctrl_enabled)) + return; + + WRITE_ONCE(resctrl_enabled, false); + resctrl_exit(); +} + u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, u32 closid, enum resctrl_conf_type type) { -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- resctrl's limbo code needs to be told when the data left in a cache is small enough for the partid+pmg value to be re-allocated. x86 uses the cache size divded by the number of rmid users the cache may have. Do the same, but for the smallest cache, and with the number of partid-and-pmg users. Querying the cache size can't happen until after cacheinfo_sysfs_init() has run, so mpam_resctrl_setup() must wait until then. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 51 ++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 074d8ba32115..9e69e0e1bb07 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -15,6 +15,7 @@ #include <linux/resctrl.h> #include <linux/slab.h> #include <linux/types.h> +#include <linux/wait.h> #include <asm/mpam.h> @@ -47,6 +48,13 @@ static bool cdp_enabled; */ static bool resctrl_enabled; +/* + * mpam_resctrl_pick_caches() needs to know the size of the caches. cacheinfo + * populates this from a device_initcall(). mpam_resctrl_setup() must wait. + */ +static bool cacheinfo_ready; +static DECLARE_WAIT_QUEUE_HEAD(wait_cacheinfo_ready); + /* A dummy mon context to use when the monitors were allocated up front */ u32 __mon_is_rmid_idx = USE_RMID_IDX; void *mon_is_rmid_idx = &__mon_is_rmid_idx; @@ -402,6 +410,24 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, } } +/* + * The rmid realloc threshold should be for the smallest cache exposed to + * resctrl. + */ +static void update_rmid_limits(unsigned int size) +{ + u32 num_unique_pmg = resctrl_arch_system_num_rmid_idx(); + + if (WARN_ON_ONCE(!size)) + return; + + if (resctrl_rmid_realloc_limit && size > resctrl_rmid_realloc_limit) + return; + + resctrl_rmid_realloc_limit = size; + resctrl_rmid_realloc_threshold = size / num_unique_pmg; +} + static bool cache_has_usable_cpor(struct mpam_class *class) { struct mpam_props *cprops = &class->props; @@ -558,11 +584,15 @@ static u16 percent_to_mbw_max(u8 pc, struct mpam_props *cprops) static void mpam_resctrl_pick_caches(void) { int idx; + unsigned int cache_size; struct mpam_class *class; struct mpam_resctrl_res *res; + lockdep_assert_cpus_held(); + idx = srcu_read_lock(&mpam_srcu); list_for_each_entry_rcu(class, &mpam_classes, classes_list) { + struct mpam_props *cprops = &class->props; bool has_cpor = cache_has_usable_cpor(class); if (class->type != MPAM_CLASS_CACHE) { @@ -589,6 +619,16 @@ static void mpam_resctrl_pick_caches(void) continue; } + /* Assume cache levels are the same size for all CPUs... */ + cache_size = get_cpu_cacheinfo_size(smp_processor_id(), class->level); + if (!cache_size) { + pr_debug("pick_caches: Could not read cache size\n"); + continue; + } + + if (mpam_has_feature(mpam_feat_msmon_csu, cprops)) + update_rmid_limits(cache_size); + if (class->level == 2) { res = &mpam_resctrl_exports[RDT_RESOURCE_L2]; res->resctrl_res.name = "L2"; @@ -794,6 +834,8 @@ int mpam_resctrl_setup(void) struct mpam_resctrl_res *res; enum resctrl_res_level i; + wait_event(wait_cacheinfo_ready, cacheinfo_ready); + cpus_read_lock(); for (i = 0; i < RDT_NUM_RESOURCES; i++) { res = &mpam_resctrl_exports[i]; @@ -1156,3 +1198,12 @@ int mpam_resctrl_offline_cpu(unsigned int cpu) return 0; } + +static int __init __cacheinfo_ready(void) +{ + cacheinfo_ready = true; + wake_up(&wait_cacheinfo_ready); + + return 0; +} +device_initcall_sync(__cacheinfo_ready); -- 2.25.1

From: Amit Singh Tomar <amitsinght@marvell.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/... --------------------------- At the moment, number of resource control group (user can create) is limited to 32. Remove the limit. ffs() returns '1' for bit 0, hence the existing code subtracts 1 from the index to get the CLOSID value. find_first_bit() returns the bit number which does not need adjusting. Signed-off-by: Amit Singh Tomar <amitsinght@marvell.com> [ morse: fixed the off-by-one in the allocator and the wrong not-found value. Removed the limit. ] Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 2 +- fs/resctrl/rdtgroup.c | 25 +++++++++++++------------ include/linux/resctrl.h | 5 ----- 3 files changed, 14 insertions(+), 18 deletions(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 9e69e0e1bb07..38e53c46a9ec 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -132,7 +132,7 @@ static bool mpam_resctrl_hide_cdp(enum resctrl_res_level rid) */ u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) { - return min((u32)mpam_partid_max + 1, (u32)RESCTRL_MAX_CLOSID); + return mpam_partid_max + 1; } u32 resctrl_arch_system_num_rmid_idx(void) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 427aa26dc006..79d994fa813f 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -123,8 +123,8 @@ static bool resctrl_is_mbm_event(int e) } /* - * Trivial allocator for CLOSIDs. Since h/w only supports a small number, - * we can keep a bitmap of free CLOSIDs in a single integer. + * Trivial allocator for CLOSIDs. Use BITMAP APIs to manipulate a bitmap + * of free CLOSIDs. * * Using a global CLOSID across all resources has some advantages and * some drawbacks: @@ -137,7 +137,7 @@ static bool resctrl_is_mbm_event(int e) * - Our choices on how to configure each resource become progressively more * limited as the number of resources grows. */ -static unsigned long closid_free_map; +static unsigned long *closid_free_map; static int closid_free_map_len; int closids_supported(void) @@ -148,16 +148,17 @@ int closids_supported(void) static void closid_init(void) { struct resctrl_schema *s; - u32 rdt_min_closid = 32; + u32 rdt_min_closid = ~0; /* Compute rdt_min_closid across all resources */ list_for_each_entry(s, &resctrl_schema_all, list) rdt_min_closid = min(rdt_min_closid, s->num_closid); - closid_free_map = BIT_MASK(rdt_min_closid) - 1; + closid_free_map = bitmap_alloc(rdt_min_closid, GFP_KERNEL); + bitmap_fill(closid_free_map, rdt_min_closid); /* RESCTRL_RESERVED_CLOSID is always reserved for the default group */ - __clear_bit(RESCTRL_RESERVED_CLOSID, &closid_free_map); + __clear_bit(RESCTRL_RESERVED_CLOSID, closid_free_map); closid_free_map_len = rdt_min_closid; } @@ -174,12 +175,12 @@ static int closid_alloc(void) return cleanest_closid; closid = cleanest_closid; } else { - closid = ffs(closid_free_map); - if (closid == 0) + closid = find_first_bit(closid_free_map, closid_free_map_len); + if (closid == closid_free_map_len) return -ENOSPC; - closid--; } - __clear_bit(closid, &closid_free_map); + + __clear_bit(closid, closid_free_map); return closid; } @@ -188,7 +189,7 @@ void closid_free(int closid) { lockdep_assert_held(&rdtgroup_mutex); - __set_bit(closid, &closid_free_map); + __set_bit(closid, closid_free_map); } /** @@ -202,7 +203,7 @@ bool closid_allocated(unsigned int closid) { lockdep_assert_held(&rdtgroup_mutex); - return !test_bit(closid, &closid_free_map); + return !test_bit(closid, closid_free_map); } /** diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 35c3bf2ec599..bd419324e822 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -30,11 +30,6 @@ int proc_resctrl_show(struct seq_file *m, /* max value for struct rdt_domain's mbps_val */ #define MBA_MAX_MBPS U32_MAX -/* - * Resctrl uses a u32 as a closid bitmap. The maximum closid is 32. - */ -#define RESCTRL_MAX_CLOSID 32 - /* * Resctrl uses u32 to hold the user-space config. The maximum bitmap size is * 32. -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT --------------------------- When the max memory bandwidth is exceeded, the partition does not use any more bandwidth until the memory bandwidth measurement for the partition falls below max value setting. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index c2457c36ab12..19bd40814685 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -1191,7 +1191,7 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, if (mpam_has_feature(mpam_feat_mbw_max, rprops)) { if (mpam_has_feature(mpam_feat_mbw_max, cfg)) - mpam_write_partsel_reg(msc, MBW_MAX, cfg->mbw_max); + mpam_write_partsel_reg(msc, MBW_MAX, cfg->mbw_max | MPAMCFG_MBW_MAX_HARDLIM); else mpam_write_partsel_reg(msc, MBW_MAX, bwa_fract); } -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT -------------------------------- lockdep_is_cpus_held() will only be compiled by the kernel when CONFIG_HOTPLUG_CPU is defined. Otherwise, kernel test robot would report the following compile error: ld: vmlinux.o: in function `resctrl_get_domain_from_cpu': include/linux/resctrl.h:291: undefined reference to `lockdep_is_cpus_held' Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202401252038.VOiTn1TA-lkp@intel.com/ Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- include/linux/resctrl.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index bd419324e822..f945a6672bce 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -303,7 +303,7 @@ resctrl_get_domain_from_cpu(int cpu, struct rdt_resource *r) * about locks this thread holds will lead to false positives. Check * someone is holding the CPUs lock. */ - if (IS_ENABLED(CONFIG_LOCKDEP)) + if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && IS_ENABLED(CONFIG_LOCKDEP)) lockdep_is_cpus_held(); list_for_each_entry_rcu(d, &r->domains, list) { -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT -------------------------------- rdtgroup_setup_default() is declared at .init.text section, but will be called by resctrl_init() located in the .text section, which will lead to incorrect reference. Here is the compilation error message reported by linker: WARNING: modpost: vmlinux: section mismatch in reference: resctrl_init+0x6c (section: .text) -> rdtgroup_setup_default (section: .init.text) Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202401261430.Xq824gQ5-lkp@intel.com/ Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- fs/resctrl/rdtgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 79d994fa813f..526a494080b6 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -3746,7 +3746,7 @@ static void rdtgroup_destroy_root(void) rdtgroup_default.kn = NULL; } -static void __init rdtgroup_setup_default(void) +static void rdtgroup_setup_default(void) { mutex_lock(&rdtgroup_mutex); -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT -------------------------------- This reverts commit 30fae95d10af693623d759c7a9cd56ade352a42a. The reverted commit would cause KVM to return an E2BIG value after performing the arm64_check_features() check. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/kvm/sys_regs.c | 74 ++++++++------------------------------- 1 file changed, 14 insertions(+), 60 deletions(-) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index e368571a71b6..c471af52dbeb 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -417,29 +417,21 @@ static bool trap_oslar_el1(struct kvm_vcpu *vcpu, return true; } -static bool trap_mpam(struct kvm_vcpu *vcpu, - struct sys_reg_params *p, - const struct sys_reg_desc *r) +static bool workaround_bad_mpam_abi(struct kvm_vcpu *vcpu, + struct sys_reg_params *p, + const struct sys_reg_desc *r) { - u64 aa64pfr0_el1 = IDREG(vcpu->kvm, SYS_ID_AA64PFR0_EL1); - /* - * What did we expose to the guest? - * Earlier guests may have seen the ID bits, which can't be removed - * without breaking migration, but MPAMIDR_EL1 can advertise all-zeroes, - * indicating there are zero PARTID/PMG supported by the CPU, allowing - * the other two trapped registers (MPAM1_EL1 and MPAM0_EL1) to be - * treated as RAZ/WI. + * The ID register can't be removed without breaking migration, + * but MPAMIDR_EL1 can advertise all-zeroes, indicating there are zero + * PARTID/PMG supported by the CPU, allowing the other two trapped + * registers (MPAM1_EL1 and MPAM0_EL1) to be treated as RAZ/WI. * Emulating MPAM1_EL1 as RAZ/WI means the guest sees the MPAMEN bit * as clear, and realises MPAM isn't usable on this CPU. */ - if (FIELD_GET(ID_AA64PFR0_EL1_MPAM_MASK, aa64pfr0_el1)) { - p->regval = 0; - return true; - } + p->regval = 0; - kvm_inject_undefined(vcpu); - return false; + return true; } static bool trap_oslsr_el1(struct kvm_vcpu *vcpu, @@ -1259,36 +1251,6 @@ static s64 kvm_arm64_ftr_safe_value(u32 id, const struct arm64_ftr_bits *ftrp, return arm64_ftr_safe_value(&kvm_ftr, new, cur); } -static u64 kvm_arm64_ftr_max(struct kvm_vcpu *vcpu, - const struct sys_reg_desc *rd) -{ - u64 pfr0, val = rd->reset(vcpu, rd); - u32 field, id = reg_to_encoding(rd); - - /* - * Some values may reset to a lower value than can be supported, - * get the maximum feature value. - */ - switch (id) { - case SYS_ID_AA64PFR0_EL1: - pfr0 = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1); - - /* - * MPAM resets to 0, but migration of MPAM=1 guests is needed. - * See trap_mpam() for more. - */ - field = cpuid_feature_extract_unsigned_field(pfr0, ID_AA64PFR0_EL1_MPAM_SHIFT); - if (field == ID_AA64PFR0_EL1_MPAM_1) { - val &= ~ID_AA64PFR0_EL1_MPAM_MASK; - val |= FIELD_PREP(ID_AA64PFR0_EL1_MPAM_MASK, ID_AA64PFR0_EL1_MPAM_1); - } - - break; - } - - return val; -} - /** * arm64_check_features() - Check if a feature register value constitutes * a subset of features indicated by the idreg's KVM sanitised limit. @@ -1309,7 +1271,8 @@ static int arm64_check_features(struct kvm_vcpu *vcpu, const struct arm64_ftr_bits *ftrp = NULL; u32 id = reg_to_encoding(rd); u64 writable_mask = rd->val; - u64 limit, mask = 0; + u64 limit = rd->reset(vcpu, rd); + u64 mask = 0; /* * Hidden and unallocated ID registers may not have a corresponding @@ -1323,7 +1286,6 @@ static int arm64_check_features(struct kvm_vcpu *vcpu, if (!ftr_reg) return -EINVAL; - limit = kvm_arm64_ftr_max(vcpu, rd); ftrp = ftr_reg->ftr_bits; for (; ftrp && ftrp->width; ftrp++) { @@ -1527,14 +1489,6 @@ static u64 read_sanitised_id_aa64pfr0_el1(struct kvm_vcpu *vcpu, val &= ~ID_AA64PFR0_EL1_AMU_MASK; - /* - * MPAM is disabled by default as KVM also needs a set of PARTID to - * program the MPAMVPMx_EL2 PARTID remapping registers with. But some - * older kernels let the guest see the ID bit. Turning it on causes - * the registers to be emulated as RAZ/WI. See trap_mpam() for more. - */ - val &= ~ID_AA64PFR0_EL1_MPAM_MASK; - return val; } @@ -2199,11 +2153,11 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_LOREA_EL1), trap_loregion }, { SYS_DESC(SYS_LORN_EL1), trap_loregion }, { SYS_DESC(SYS_LORC_EL1), trap_loregion }, - { SYS_DESC(SYS_MPAMIDR_EL1), trap_mpam }, + { SYS_DESC(SYS_MPAMIDR_EL1), workaround_bad_mpam_abi }, { SYS_DESC(SYS_LORID_EL1), trap_loregion }, - { SYS_DESC(SYS_MPAM1_EL1), trap_mpam }, - { SYS_DESC(SYS_MPAM0_EL1), trap_mpam }, + { SYS_DESC(SYS_MPAM1_EL1), workaround_bad_mpam_abi }, + { SYS_DESC(SYS_MPAM0_EL1), workaround_bad_mpam_abi }, { SYS_DESC(SYS_VBAR_EL1), access_rw, reset_val, VBAR_EL1, 0 }, { SYS_DESC(SYS_DISR_EL1), NULL, reset_val, DISR_EL1, 0 }, -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I92AK4 ----------------------------- Both the major version number and the minor version numbers need to be checked to confirm whether the MPAM Extension is supported on the platform. This supports MPAM v0.1 version for cpufeature detection, check [1] for details. [1] https://developer.arm.com/documentation/ddi0595/2021-12/AArch64- Registers/ID-AA64PFR1-EL1--AArch64-Processor-Feature-Register-1?lang=en Fixes: 21771eaaf93a ("arm64: cpufeature: discover CPU support for MPAM") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/include/asm/cpufeature.h | 7 +++++++ arch/arm64/kernel/cpufeature.c | 13 +++++++++---- arch/arm64/kernel/cpuinfo.c | 3 ++- 3 files changed, 18 insertions(+), 5 deletions(-) diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index 876e0ae300fd..4050293403c9 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -615,6 +615,13 @@ static inline bool id_aa64pfr0_mpam(u64 pfr0) return val > 0; } +static inline bool id_aa64pfr1_mpamfrac(u64 pfr1) +{ + u32 val = cpuid_feature_extract_unsigned_field(pfr1, ID_AA64PFR1_EL1_MPAM_frac_SHIFT); + + return val > 0; +} + static inline bool id_aa64pfr1_mte(u64 pfr1) { u32 val = cpuid_feature_extract_unsigned_field(pfr1, ID_AA64PFR1_EL1_MTE_SHIFT); diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index d76225b4a00f..811f32cf42d6 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -1082,7 +1082,8 @@ void __init init_cpu_features(struct cpuinfo_arm64 *info) cpacr_restore(cpacr); } - if (id_aa64pfr0_mpam(info->reg_id_aa64pfr0)) + if (id_aa64pfr0_mpam(info->reg_id_aa64pfr0) || + id_aa64pfr1_mpamfrac(info->reg_id_aa64pfr1)) init_cpu_ftr_reg(SYS_MPAMIDR_EL1, info->reg_mpamidr); if (id_aa64pfr1_mte(info->reg_id_aa64pfr1)) @@ -1352,7 +1353,8 @@ void update_cpu_features(int cpu, cpacr_restore(cpacr); } - if (id_aa64pfr0_mpam(info->reg_id_aa64pfr0)) { + if (id_aa64pfr0_mpam(info->reg_id_aa64pfr0) || + id_aa64pfr1_mpamfrac(info->reg_id_aa64pfr1)) { taint |= check_update_ftr_reg(SYS_MPAMIDR_EL1, cpu, info->reg_mpamidr, boot->reg_mpamidr); } @@ -2340,7 +2342,11 @@ cpucap_panic_on_conflict(const struct arm64_cpu_capabilities *cap) static bool __maybe_unused test_has_mpam(const struct arm64_cpu_capabilities *entry, int scope) { - if (!has_cpuid_feature(entry, scope)) + u64 pfr0 = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1); + u64 pfr1 = read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1); + + if (!id_aa64pfr0_mpam(pfr0) && + !id_aa64pfr1_mpamfrac(pfr1)) return false; /* Check firmware actually enabled MPAM on this cpu. */ @@ -2863,7 +2869,6 @@ static const struct arm64_cpu_capabilities arm64_features[] = { .capability = ARM64_MPAM, .matches = test_has_mpam, .cpu_enable = cpu_enable_mpam, - ARM64_CPUID_FIELDS(ID_AA64PFR0_EL1, MPAM, 1) }, #endif {}, diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c index 1b1fe0f58a86..59dcbcfe1ba0 100644 --- a/arch/arm64/kernel/cpuinfo.c +++ b/arch/arm64/kernel/cpuinfo.c @@ -461,7 +461,8 @@ static void __cpuinfo_store_cpu(struct cpuinfo_arm64 *info) __cpuinfo_store_cpu_32bit(&info->aarch32); if (IS_ENABLED(CONFIG_ARM64_MPAM) && - id_aa64pfr0_mpam(info->reg_id_aa64pfr0)) + (id_aa64pfr0_mpam(info->reg_id_aa64pfr0) || + id_aa64pfr1_mpamfrac(info->reg_id_aa64pfr1))) info->reg_mpamidr = read_cpuid(MPAMIDR_EL1); cpuinfo_detect_icache_policy(info); -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I92AK4 ----------------------------- Skip MPAM initialize when running under kdump kernel. We put the judgement at the front of MPAM initialization to avoid meaningless initialization as much as possible. Fixes: 9cb26074545b ("arm_mpam: Probe MSCs to find the supported partid/pmg values") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/kernel/mpam.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/arch/arm64/kernel/mpam.c b/arch/arm64/kernel/mpam.c index 02f43334f078..8fd9dbdc2b01 100644 --- a/arch/arm64/kernel/mpam.c +++ b/arch/arm64/kernel/mpam.c @@ -6,6 +6,7 @@ #include <linux/arm_mpam.h> #include <linux/jump_label.h> #include <linux/percpu.h> +#include <linux/crash_dump.h> DEFINE_STATIC_KEY_FALSE(arm64_mpam_has_hcr); DEFINE_STATIC_KEY_FALSE(mpam_enabled); @@ -14,6 +15,9 @@ DEFINE_PER_CPU(u64, arm64_mpam_current); static int __init arm64_mpam_register_cpus(void) { + if (is_kdump_kernel()) + return 0; + u64 mpamidr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); u16 partid_max = FIELD_GET(MPAMIDR_PARTID_MAX, mpamidr); u8 pmg_max = FIELD_GET(MPAMIDR_PMG_MAX, mpamidr); -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IA7L2U --------------------------- Before dom_data_init() initializing the monitor resource, we should check if the msc has the ability of monitor. If the chip not having RDT monitor function, variable idx_limit in dom_data_init() would be set 0. rmid_ptrs points to an illegal buffer after kcalloc() allocation, which would cause illegal access fault when accessing. Fixes: 13e249bf4944 ("x86/resctrl: Move the filesystem portions of resctrl to live in '/fs/'") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- fs/resctrl/monitor.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index 710ba5bf9962..97170a4d3edf 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -813,13 +813,13 @@ int resctrl_mon_resource_init(void) struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); int ret; + if (!r->mon_capable) + return 0; + ret = dom_data_init(r); if (ret) return ret; - if (!r->mon_capable) - return 0; - l3_mon_evt_init(r); if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) { -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9OXPO -------------------------------- This reverts commit a01fda352718f178db2effe5e9d2c1b6a5602356. If BIOS firmware doesn't enable MPAM function under EL3 environment, while the hardware has the MPAM ability, that would cause illegal instruction fault when access MPAM registers. It should be noted that this will bring potential risks to Kdump. If the previous kernel has an unhandled MPAM exception interrupt, it would cause Kdump crash after re-enable interrupt capability. Fixes: a01fda352718 ("arm64: head.S: Initialise MPAM EL2 registers and disable traps") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/include/asm/el2_setup.h | 16 ---------------- 1 file changed, 16 deletions(-) diff --git a/arch/arm64/include/asm/el2_setup.h b/arch/arm64/include/asm/el2_setup.h index 1e2181820a0a..b7afaa026842 100644 --- a/arch/arm64/include/asm/el2_setup.h +++ b/arch/arm64/include/asm/el2_setup.h @@ -208,21 +208,6 @@ msr spsr_el2, x0 .endm -.macro __init_el2_mpam -#ifdef CONFIG_ARM64_MPAM - /* Memory Partioning And Monitoring: disable EL2 traps */ - mrs x1, id_aa64pfr0_el1 - ubfx x0, x1, #ID_AA64PFR0_EL1_MPAM_SHIFT, #4 - cbz x0, .Lskip_mpam_\@ // skip if no MPAM - msr_s SYS_MPAM2_EL2, xzr // use the default partition - // and disable lower traps - mrs_s x0, SYS_MPAMIDR_EL1 - tbz x0, #17, .Lskip_mpam_\@ // skip if no MPAMHCR reg - msr_s SYS_MPAMHCR_EL2, xzr // clear TRAP_MPAMIDR_EL1 -> EL2 -.Lskip_mpam_\@: -#endif /* CONFIG_ARM64_MPAM */ -.endm - /** * Initialize EL2 registers to sane values. This should be called early on all * cores that were booted in EL2. Note that everything gets initialised as @@ -240,7 +225,6 @@ __init_el2_stage2 __init_el2_gicv3 __init_el2_hstr - __init_el2_mpam __init_el2_nvhe_idregs __init_el2_cptr __init_el2_fgt -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9OXPO -------------------------------- If BIOS firmware doesn't enable MPAM function under EL3 environment, while the hardware has the MPAM ability, that would cause illegal instruction fault when access MPAM registers. In order to be compatible with this scenario and avoid crashes during startup, only by adding the 'arm64.mpam' boot parameter in kernel command line make the MPAM feature be enabled. Fixes: 21771eaaf93a ("arm64: cpufeature: discover CPU support for MPAM") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/include/asm/cpufeature.h | 9 +++++++ arch/arm64/kernel/cpufeature.c | 39 ++++++++++++++++++++++++++--- arch/arm64/kernel/cpuinfo.c | 1 + arch/arm64/kernel/mpam.c | 15 ++++++++--- 4 files changed, 56 insertions(+), 8 deletions(-) diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index 4050293403c9..e8db3f242f93 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -837,6 +837,15 @@ static inline bool cpus_support_mpam(void) cpus_have_final_cap(ARM64_MPAM); } +#ifdef CONFIG_ARM64_MPAM +bool mpam_detect_is_enabled(void); +#else +static inline bool mpam_detect_is_enabled(void) +{ + return false; +} +#endif + static inline bool system_supports_lpa2(void) { return cpus_have_final_cap(ARM64_HAS_LPA2); diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 811f32cf42d6..3958738da32f 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -1082,8 +1082,9 @@ void __init init_cpu_features(struct cpuinfo_arm64 *info) cpacr_restore(cpacr); } - if (id_aa64pfr0_mpam(info->reg_id_aa64pfr0) || - id_aa64pfr1_mpamfrac(info->reg_id_aa64pfr1)) + if (mpam_detect_is_enabled() && + (id_aa64pfr0_mpam(info->reg_id_aa64pfr0) || + id_aa64pfr1_mpamfrac(info->reg_id_aa64pfr1))) init_cpu_ftr_reg(SYS_MPAMIDR_EL1, info->reg_mpamidr); if (id_aa64pfr1_mte(info->reg_id_aa64pfr1)) @@ -1353,8 +1354,9 @@ void update_cpu_features(int cpu, cpacr_restore(cpacr); } - if (id_aa64pfr0_mpam(info->reg_id_aa64pfr0) || - id_aa64pfr1_mpamfrac(info->reg_id_aa64pfr1)) { + if (mpam_detect_is_enabled() && + (id_aa64pfr0_mpam(info->reg_id_aa64pfr0) || + id_aa64pfr1_mpamfrac(info->reg_id_aa64pfr1))) { taint |= check_update_ftr_reg(SYS_MPAMIDR_EL1, cpu, info->reg_mpamidr, boot->reg_mpamidr); } @@ -2339,6 +2341,19 @@ cpucap_panic_on_conflict(const struct arm64_cpu_capabilities *cap) return !!(cap->type & ARM64_CPUCAP_PANIC_ON_CONFLICT); } +static bool __read_mostly mpam_force_enabled; +bool mpam_detect_is_enabled(void) +{ + return mpam_force_enabled; +} + +static int __init mpam_setup(char *str) +{ + mpam_force_enabled = true; + return 0; +} +early_param("arm64.mpam", mpam_setup); + static bool __maybe_unused test_has_mpam(const struct arm64_cpu_capabilities *entry, int scope) { @@ -2349,6 +2364,12 @@ test_has_mpam(const struct arm64_cpu_capabilities *entry, int scope) !id_aa64pfr1_mpamfrac(pfr1)) return false; + if (is_kdump_kernel()) + return false; + + if (!mpam_detect_is_enabled()) + return false; + /* Check firmware actually enabled MPAM on this cpu. */ return (read_sysreg_s(SYS_MPAM1_EL1) & MPAM_SYSREG_EN); } @@ -2356,6 +2377,16 @@ test_has_mpam(const struct arm64_cpu_capabilities *entry, int scope) static void __maybe_unused cpu_enable_mpam(const struct arm64_cpu_capabilities *entry) { + u64 idr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); + + /* + * Initialise MPAM EL2 registers and disable EL2 traps. + */ + write_sysreg_s(0, SYS_MPAM2_EL2); + + if (idr & MPAMIDR_HAS_HCR) + write_sysreg_s(0, SYS_MPAMHCR_EL2); + /* * Access by the kernel (at EL1) should use the reserved PARTID * which is configured unrestricted. This avoids priority-inversion diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c index 59dcbcfe1ba0..f568fbefa51d 100644 --- a/arch/arm64/kernel/cpuinfo.c +++ b/arch/arm64/kernel/cpuinfo.c @@ -461,6 +461,7 @@ static void __cpuinfo_store_cpu(struct cpuinfo_arm64 *info) __cpuinfo_store_cpu_32bit(&info->aarch32); if (IS_ENABLED(CONFIG_ARM64_MPAM) && + mpam_detect_is_enabled() && (id_aa64pfr0_mpam(info->reg_id_aa64pfr0) || id_aa64pfr1_mpamfrac(info->reg_id_aa64pfr1))) info->reg_mpamidr = read_cpuid(MPAMIDR_EL1); diff --git a/arch/arm64/kernel/mpam.c b/arch/arm64/kernel/mpam.c index 8fd9dbdc2b01..d8891d5c59c0 100644 --- a/arch/arm64/kernel/mpam.c +++ b/arch/arm64/kernel/mpam.c @@ -15,13 +15,20 @@ DEFINE_PER_CPU(u64, arm64_mpam_current); static int __init arm64_mpam_register_cpus(void) { + u16 partid_max; + u64 mpamidr; + u8 pmg_max; + if (is_kdump_kernel()) return 0; - u64 mpamidr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); - u16 partid_max = FIELD_GET(MPAMIDR_PARTID_MAX, mpamidr); - u8 pmg_max = FIELD_GET(MPAMIDR_PMG_MAX, mpamidr); + if (!mpam_cpus_have_feature()) + return 0; + + mpamidr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); + partid_max = FIELD_GET(MPAMIDR_PARTID_MAX, mpamidr); + pmg_max = FIELD_GET(MPAMIDR_PMG_MAX, mpamidr); return mpam_register_requestor(partid_max, pmg_max); } -arch_initcall(arm64_mpam_register_cpus) +arch_initcall(arm64_mpam_register_cpus); -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IATSCU -------------------------------- Fix re-defined reference of 'mpam_detect_is_enabled' when CONFIG_ARM64_MPAM is enabled. Fixes: 797c68d3b52d ("arm64/mpam: Check mpam_detect_is_enabled() before accessing MPAM registers") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/include/asm/cpufeature.h | 7 ------- 1 file changed, 7 deletions(-) diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index e8db3f242f93..98d1dee3a5d6 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -837,14 +837,7 @@ static inline bool cpus_support_mpam(void) cpus_have_final_cap(ARM64_MPAM); } -#ifdef CONFIG_ARM64_MPAM bool mpam_detect_is_enabled(void); -#else -static inline bool mpam_detect_is_enabled(void) -{ - return false; -} -#endif static inline bool system_supports_lpa2(void) { -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT -------------------------------- When attempting to enable the MPAM memory bandwidth monitoring feature, we found that MPAM differs from x86 RDT in hardware topology implementation. To maintain compatibility with the interface of version, a small amount of public code needs to be split. According to the x86 SDM, both total memory bandwidth and local memory bandwidth are classified as part of L3 cache monitoring: ~~~ mon_data ├── mon_L3_00 │ ├── llc_occupancy │ ├── mbm_local_bytes │ └── mbm_total_bytes └── mon_L3_01 ├── llc_occupancy ├── mbm_local_bytes └── mbm_total_bytes ~~~ In contrast, for ARM64 MPAM, L3/memory monitoring is implemented using separate L3/MATA resources: ~~~ mon_data ├── mon_L3_01 │ ├── llc_occupancy │ └── mbm_local_bytes ├── mon_L3_195 │ ├── llc_occupancy │ └── mbm_local_bytes ├── mon_MB_00 │ └── mbm_total_bytes └── mon_MB_01 └── mbm_total_bytes ~~~ The public code file fs/resctrl/monitor.c has added a reference to resctrl_arch_mon_resource_init() and removed l3_mon_evt_init(). Both x86 RDT and ARM64 MPAM implement their own versions of resctrl_arch_mon_resource_init() and l3_mon_evt_init(), with the x86 part retaining its original logic unchanged. Fixes: 13e249bf4944 ("x86/resctrl: Move the filesystem portions of resctrl to live in '/fs/'") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/monitor.c | 52 +++++++++++++++++++++++ drivers/platform/mpam/mpam_resctrl.c | 59 +++++++++++++++++++++++++++ fs/resctrl/internal.h | 15 ------- fs/resctrl/monitor.c | 47 +-------------------- include/linux/resctrl.h | 17 ++++++++ 5 files changed, 129 insertions(+), 61 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 02fb9d87479a..8fcd25b1aa86 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -275,3 +275,55 @@ void __init intel_rdt_mbm_apply_quirk(void) mbm_cf_rmidthreshold = mbm_cf_table[cf_index].rmidthreshold; mbm_cf = mbm_cf_table[cf_index].cf; } + +static struct mon_evt llc_occupancy_event = { + .name = "llc_occupancy", + .evtid = QOS_L3_OCCUP_EVENT_ID, +}; + +static struct mon_evt mbm_total_event = { + .name = "mbm_total_bytes", + .evtid = QOS_L3_MBM_TOTAL_EVENT_ID, +}; + +static struct mon_evt mbm_local_event = { + .name = "mbm_local_bytes", + .evtid = QOS_L3_MBM_LOCAL_EVENT_ID, +}; + +/* + * Initialize the event list for the resource. + * + * Note that MBM events are also part of RDT_RESOURCE_L3 resource + * because as per the SDM the total and local memory bandwidth + * are enumerated as part of L3 monitoring. + */ +static void l3_mon_evt_init(struct rdt_resource *r) +{ + INIT_LIST_HEAD(&r->evt_list); + + if (resctrl_arch_is_llc_occupancy_enabled()) + list_add_tail(&llc_occupancy_event.list, &r->evt_list); + if (resctrl_arch_is_mbm_total_enabled()) + list_add_tail(&mbm_total_event.list, &r->evt_list); + if (resctrl_arch_is_mbm_local_enabled()) + list_add_tail(&mbm_local_event.list, &r->evt_list); +} + +int resctrl_arch_mon_resource_init(void) +{ + struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + + l3_mon_evt_init(r); + + if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) { + mbm_total_event.configurable = true; + mbm_config_rftype_init("mbm_total_bytes_config"); + } + if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_LOCAL_EVENT_ID)) { + mbm_local_event.configurable = true; + mbm_config_rftype_init("mbm_local_bytes_config"); + } + + return 0; +} diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 38e53c46a9ec..b42cabffb277 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -1199,6 +1199,65 @@ int mpam_resctrl_offline_cpu(unsigned int cpu) return 0; } +static struct mon_evt llc_occupancy_event = { + .name = "llc_occupancy", + .evtid = QOS_L3_OCCUP_EVENT_ID, +}; + +static struct mon_evt mbm_total_event = { + .name = "mbm_total_bytes", + .evtid = QOS_L3_MBM_TOTAL_EVENT_ID, +}; + +static struct mon_evt mbm_local_event = { + .name = "mbm_local_bytes", + .evtid = QOS_L3_MBM_LOCAL_EVENT_ID, +}; + +/* + * Initialize the event list for the resource. + * + * Note that MBM events are also part of RDT_RESOURCE_L3 resource + * because as per the SDM the total and local memory bandwidth + * are enumerated as part of L3 monitoring. + */ +static void l3_mon_evt_init(struct rdt_resource *r) +{ + INIT_LIST_HEAD(&r->evt_list); + + if (!r->mon_capable) + return; + + if (r->rid == RDT_RESOURCE_L3) { + if (resctrl_arch_is_llc_occupancy_enabled()) + list_add_tail(&llc_occupancy_event.list, &r->evt_list); + + if (resctrl_arch_is_mbm_local_enabled()) + list_add_tail(&mbm_local_event.list, &r->evt_list); + } + + if ((r->rid == RDT_RESOURCE_MBA) && + resctrl_arch_is_mbm_total_enabled()) + list_add_tail(&mbm_total_event.list, &r->evt_list); +} + +int resctrl_arch_mon_resource_init(void) +{ + l3_mon_evt_init(resctrl_arch_get_resource(RDT_RESOURCE_L3)); + l3_mon_evt_init(resctrl_arch_get_resource(RDT_RESOURCE_MBA)); + + if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) { + mbm_total_event.configurable = true; + mbm_config_rftype_init("mbm_total_bytes_config"); + } + if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_LOCAL_EVENT_ID)) { + mbm_local_event.configurable = true; + mbm_config_rftype_init("mbm_local_bytes_config"); + } + + return 0; +} + static int __init __cacheinfo_ready(void) { cacheinfo_ready = true; diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h index 0cdabf19ff79..9e7ff00b8514 100644 --- a/fs/resctrl/internal.h +++ b/fs/resctrl/internal.h @@ -67,20 +67,6 @@ static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc) return container_of(kfc, struct rdt_fs_context, kfc); } -/** - * struct mon_evt - Entry in the event list of a resource - * @evtid: event id - * @name: name of the event - * @configurable: true if the event is configurable - * @list: entry in &rdt_resource->evt_list - */ -struct mon_evt { - enum resctrl_event_id evtid; - char *name; - bool configurable; - struct list_head list; -}; - /** * union mon_data_bits - Monitoring details for each event file * @priv: Used to store monitoring event data in @u @@ -303,7 +289,6 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms, void cqm_handle_limbo(struct work_struct *work); bool has_busy_rmid(struct rdt_domain *d); void __check_limbo(struct rdt_domain *d, bool force_free); -void mbm_config_rftype_init(const char *config); void rdt_staged_configs_clear(void); int resctrl_find_cleanest_closid(void); diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index 97170a4d3edf..402985f25ac6 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -774,40 +774,6 @@ static void dom_data_exit(void) mutex_unlock(&rdtgroup_mutex); } -static struct mon_evt llc_occupancy_event = { - .name = "llc_occupancy", - .evtid = QOS_L3_OCCUP_EVENT_ID, -}; - -static struct mon_evt mbm_total_event = { - .name = "mbm_total_bytes", - .evtid = QOS_L3_MBM_TOTAL_EVENT_ID, -}; - -static struct mon_evt mbm_local_event = { - .name = "mbm_local_bytes", - .evtid = QOS_L3_MBM_LOCAL_EVENT_ID, -}; - -/* - * Initialize the event list for the resource. - * - * Note that MBM events are also part of RDT_RESOURCE_L3 resource - * because as per the SDM the total and local memory bandwidth - * are enumerated as part of L3 monitoring. - */ -static void l3_mon_evt_init(struct rdt_resource *r) -{ - INIT_LIST_HEAD(&r->evt_list); - - if (resctrl_arch_is_llc_occupancy_enabled()) - list_add_tail(&llc_occupancy_event.list, &r->evt_list); - if (resctrl_arch_is_mbm_total_enabled()) - list_add_tail(&mbm_total_event.list, &r->evt_list); - if (resctrl_arch_is_mbm_local_enabled()) - list_add_tail(&mbm_local_event.list, &r->evt_list); -} - int resctrl_mon_resource_init(void) { struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); @@ -820,18 +786,7 @@ int resctrl_mon_resource_init(void) if (ret) return ret; - l3_mon_evt_init(r); - - if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) { - mbm_total_event.configurable = true; - mbm_config_rftype_init("mbm_total_bytes_config"); - } - if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_LOCAL_EVENT_ID)) { - mbm_local_event.configurable = true; - mbm_config_rftype_init("mbm_local_bytes_config"); - } - - return 0; + return resctrl_arch_mon_resource_init(); } void resctrl_mon_resource_exit(void) diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index f945a6672bce..bee15ba1099a 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -257,6 +257,20 @@ struct resctrl_mon_config_info { int err; }; +/** + * struct mon_evt - Entry in the event list of a resource + * @evtid: event id + * @name: name of the event + * @configurable: true if the event is configurable + * @list: entry in &rdt_resource->evt_list + */ +struct mon_evt { + enum resctrl_event_id evtid; + char *name; + bool configurable; + struct list_head list; +}; + /* * Update and re-load this CPUs defaults. Called via IPI, takes a pointer to * struct resctrl_cpu_sync, or NULL. @@ -408,4 +422,7 @@ extern unsigned int resctrl_rmid_realloc_limit; int resctrl_init(void); void resctrl_exit(void); +int resctrl_arch_mon_resource_init(void); +void mbm_config_rftype_init(const char *config); + #endif /* _RESCTRL_H */ -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT -------------------------------- To support the MB monitoring feature, provide mbm_total_bytes interface under the resctrl fs to monitor the MATA memory bandwidth. To accommodate platforms where the number of MATA monitor resources is less than the number of closids, which prevents support for MB(Memory Bandwidth) monitoring features, we have removed this constraint. This allows bandwidth monitoring to operate in non-free running mode. Fixes: dc2005d467b3 ("untested: arm_mpam: resctrl: Add support for mbm counters") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 8 -------- 1 file changed, 8 deletions(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index b42cabffb277..ad36036c0765 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -473,14 +473,6 @@ static bool class_has_usable_mbwu(struct mpam_class *class) if (!mpam_has_feature(mpam_feat_msmon_mbwu, cprops)) return false; - /* - * resctrl expects the bandwidth counters to be free running, - * which means we need as many monitors as resctrl has - * control/monitor groups. - */ - if (cprops->num_mbwu_mon < resctrl_arch_system_num_rmid_idx()) - return false; - return (mpam_partid_max > 1) || (mpam_pmg_max != 0); } -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT -------------------------------- Because MSMON_MBWU_CAPTURE register is implemented in different definition according to the corresponding versions of the protocol. This would affect mbm_total_bytes interface returns the various result. So far, all HISI chipsets follow the definition of the DDI0598 version, the value field of MPAM Memory Bandwidth Usage Monitor Register indicates the memory bandwidth usage in bytes per second. Therefore, as an instantaneous value for memory bandwidth statistics, there is no possibility of causing the register to overflow. Fixes: 21ac9fc8f275 ("arm_mpam: Track bandwidth counter state for overflow and power management") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 19bd40814685..169e9d43c70c 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -915,6 +915,11 @@ static u64 mpam_msmon_overflow_val(struct mpam_msc_ris *ris) return GENMASK_ULL(30, 0); } +static bool resctrl_arch_would_mbm_overflow(void) +{ + return read_cpuid_implementor() != ARM_CPU_IMP_HISI; +} + static void __ris_msmon_read(void *arg) { bool nrdy = false; @@ -986,6 +991,18 @@ static void __ris_msmon_read(void *arg) if (!mbwu_state) break; + /* + * Following the definition of the DDI0598 version, + * the value field of MPAM Memory Bandwidth Usage Monitor Register + * indicates the memory bandwidth usage in bytes per second, + * instead the scaled count of bytes transferred since the monitor + * was last reset in the latest version (DDI0598D_b). + */ + if (ris->comp->class->type == MPAM_CLASS_MEMORY) { + if (!resctrl_arch_would_mbm_overflow()) + break; + } + /* Add any pre-overflow value to the mbwu_state->val */ if (mbwu_state->prev_val > now) overflow_val = mpam_msmon_overflow_val(ris) - mbwu_state->prev_val; -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAFGJ6 --------------------------------------------------- The mbwu_state array is allocated according to the monitor number. However when select the monitor the index is encoded by closid and rmid, which may exceed the monitor number. If the KASAN config is enabled, we would find the following error when create new monitors: BUG: KASAN: slab-use-after-free in mpam_msmon_reset_mbwu+0x3b8/0x450 Write of size 1 at addr ffff0800c2350fa1 by task kworker/0:0/8 In addition, the statistics of mbwu monitors are cumulative values, which are different type with csu monitors, so need to make sure use the same monitor between twice readings. Here we adapt the remainder of rmid to the num_mbwu_mon as a compromise, if the mbwu monitor number of system can't support free run mode. Fixes: dc2005d467b3 ("untested: arm_mpam: resctrl: Add support for mbm counters") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index ad36036c0765..b60d5e3ab9fe 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -336,8 +336,10 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, { int err; u64 cdp_val; + u16 num_mbwu_mon; struct mon_cfg cfg; struct mpam_resctrl_dom *dom; + struct mpam_resctrl_res *res; u32 mon = *(u32 *)arch_mon_ctx; enum mpam_device_features type; @@ -358,8 +360,16 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, } cfg.mon = mon; - if (cfg.mon == USE_RMID_IDX) - cfg.mon = resctrl_arch_rmid_idx_encode(closid, rmid); + if (cfg.mon == USE_RMID_IDX) { + /* + * The number of mbwu monitors can't support free run mode, + * adapt the remainder of rmid to the num_mbwu_mon as a + * compromise. + */ + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + num_mbwu_mon = res->class->props.num_mbwu_mon; + cfg.mon = resctrl_arch_rmid_idx_encode(closid, rmid) % num_mbwu_mon; + } cfg.match_pmg = true; cfg.pmg = rmid; @@ -386,13 +396,17 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, u32 closid, u32 rmid, enum resctrl_event_id eventid) { + u16 num_mbwu_mon; struct mon_cfg cfg; struct mpam_resctrl_dom *dom; + struct mpam_resctrl_res *res; if (eventid != QOS_L3_MBM_LOCAL_EVENT_ID) return; - cfg.mon = resctrl_arch_rmid_idx_encode(closid, rmid); + res = container_of(r, struct mpam_resctrl_res, resctrl_res); + num_mbwu_mon = res->class->props.num_mbwu_mon; + cfg.mon = resctrl_arch_rmid_idx_encode(closid, rmid) % num_mbwu_mon; cfg.match_pmg = true; cfg.pmg = rmid; -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT -------------------------------- Correct MBA granularity conversion formula. Fixes: 58db5c68e84a ("untested: arm_mpam: resctrl: Add support for MB resource") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index b60d5e3ab9fe..9076274d5325 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -519,7 +519,7 @@ static u32 get_mba_granularity(struct mpam_props *cprops) * bwa_wd is the number of bits implemented in the 0.xxx * fixed point fraction. 1 bit is 50%, 2 is 25% etc. */ - return MAX_MBA_BW / (cprops->bwa_wd + 1); + return MAX_MBA_BW / (1 << cprops->bwa_wd); } return 0; -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT -------------------------------- Correct percentage to fixed floating point conversion formula. Fixes: 58db5c68e84a ("untested: arm_mpam: resctrl: Add support for MB resource") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 9076274d5325..b5a4be7c4fea 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -581,7 +581,7 @@ static u16 percent_to_mbw_max(u8 pc, struct mpam_props *cprops) break; } - value &= GENMASK(15, 15 - cprops->bwa_wd); + value &= GENMASK(15, 15 - cprops->bwa_wd + 1); return value; } -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAFGJ6 -------------------------------- The cfg array of per msc component is allocated according to the partid number. The length of array should be (mpam_partid_max + 1) instead of mpam_partid_max. Otherwise, when resctrl_arch_get_config() accesses the array would raise slab-out-of-bounds fault like below: BUG: KASAN: slab-out-of-bounds in resctrl_arch_get_config+0x404/0x7c8 Read of size 4 at addr ffff08280da29b64 by task mkdir/4156 Fixes: be74872ad2e3 ("arm_mpam: Allow configuration to be applied and restored during cpu online") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 169e9d43c70c..a227b0716195 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -1246,7 +1246,7 @@ struct reprogram_ris { /* Call with MSC lock held */ static int mpam_reprogram_ris(void *_arg) { - u16 partid, partid_max; + u16 partid, num_partid; struct reprogram_ris *arg = _arg; struct mpam_msc_ris *ris = arg->ris; struct mpam_config *cfg = arg->cfg; @@ -1255,9 +1255,9 @@ static int mpam_reprogram_ris(void *_arg) return 0; spin_lock(&partid_max_lock); - partid_max = mpam_partid_max; + num_partid = resctrl_arch_get_num_closid(NULL); spin_unlock(&partid_max_lock); - for (partid = 0; partid < partid_max; partid++) + for (partid = 0; partid < num_partid; partid++) mpam_reprogram_ris_partid(ris, partid, cfg); return 0; @@ -1413,7 +1413,7 @@ static void mpam_reprogram_msc(struct mpam_msc *msc) } reset = true; - for (partid = 0; partid < mpam_partid_max; partid++) { + for (partid = 0; partid < resctrl_arch_get_num_closid(NULL); partid++) { cfg = &ris->comp->cfg[partid]; if (cfg->features) reset = false; @@ -2116,7 +2116,8 @@ static int __allocate_component_cfg(struct mpam_component *comp) if (comp->cfg) return 0; - comp->cfg = kcalloc(mpam_partid_max, sizeof(*comp->cfg), GFP_KERNEL); + comp->cfg = kcalloc(resctrl_arch_get_num_closid(NULL), + sizeof(*comp->cfg), GFP_KERNEL); if (!comp->cfg) return -ENOMEM; @@ -2228,7 +2229,7 @@ void mpam_reset_class(struct mpam_class *class) idx = srcu_read_lock(&mpam_srcu); list_for_each_entry_rcu(comp, &class->components, class_list) { - memset(comp->cfg, 0, (mpam_partid_max * sizeof(*comp->cfg))); + memset(comp->cfg, 0, resctrl_arch_get_num_closid(NULL) * sizeof(*comp->cfg)); list_for_each_entry_rcu(ris, &comp->ris, comp_list) { mutex_lock(&ris->msc->lock); -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAKAKN -------------------------------- According to the bit width of the fixed-point fraction implemented by hardware, the maximum supported precision is calculated to reduce the conversion granularity as much as possible, so that improving conversion accuracy between percent and fixed-point fraction. Fixes: 58db5c68e84a ("untested: arm_mpam: resctrl: Add support for MB resource") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 40 ++++++++++++++++++---------- 1 file changed, 26 insertions(+), 14 deletions(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index b5a4be7c4fea..dd1ccca8edcb 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -536,21 +536,31 @@ static u32 mbw_pbm_to_percent(unsigned long mbw_pbm, struct mpam_props *cprops) return result; } -static u32 mbw_max_to_percent(u16 mbw_max, struct mpam_props *cprops) +static int get_wd_precision(u8 wd) +{ + int ret = (1 << wd) / MAX_MBA_BW; + + if (!ret) + return 1; + + return ret; +} + +static u32 mbw_max_to_percent(u16 mbw_max, u8 wd) { u8 bit; - u32 divisor = 2, value = 0; + u32 divisor = 2, value = 0, precision = get_wd_precision(wd); for (bit = 15; bit; bit--) { if (mbw_max & BIT(bit)) - value += MAX_MBA_BW / divisor; + value += MAX_MBA_BW * precision / divisor; divisor <<= 1; } - return value; + return DIV_ROUND_UP(value, precision); } -static u32 percent_to_mbw_pbm(u8 pc, struct mpam_props *cprops) +static u32 percent_to_mbw_pbm(u32 pc, struct mpam_props *cprops) { u32 granularity = get_mba_granularity(cprops); u8 num_bits = pc / granularity; @@ -562,26 +572,28 @@ static u32 percent_to_mbw_pbm(u8 pc, struct mpam_props *cprops) return (1 << num_bits) - 1; } -static u16 percent_to_mbw_max(u8 pc, struct mpam_props *cprops) +static u16 percent_to_mbw_max(u32 pc, u8 wd) { u8 bit; - u32 divisor = 2, value = 0; + u32 divisor = 2, value = 0, precision = get_wd_precision(wd); - if (WARN_ON_ONCE(cprops->bwa_wd > 15)) + if (WARN_ON_ONCE(wd > 15)) return MAX_MBA_BW; + pc *= precision; + for (bit = 15; bit; bit--) { - if (pc >= MAX_MBA_BW / divisor) { - pc -= MAX_MBA_BW / divisor; + if (pc >= MAX_MBA_BW * precision / divisor) { + pc -= MAX_MBA_BW * precision / divisor; value |= BIT(bit); } divisor <<= 1; - if (!pc || !(MAX_MBA_BW / divisor)) + if (!pc || !(MAX_MBA_BW * precision / divisor)) break; } - value &= GENMASK(15, 15 - cprops->bwa_wd + 1); + value &= GENMASK(15, 15 - wd + 1); return value; } @@ -947,7 +959,7 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, /* TODO: Scaling is not yet supported */ return mbw_pbm_to_percent(cfg->mbw_pbm, cprops); case mpam_feat_mbw_max: - return mbw_max_to_percent(cfg->mbw_max, cprops); + return mbw_max_to_percent(cfg->mbw_max, cprops->bwa_wd); default: return -EINVAL; } @@ -989,7 +1001,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, mpam_set_feature(mpam_feat_mbw_part, &cfg); break; } else if (mpam_has_feature(mpam_feat_mbw_max, cprops)) { - cfg.mbw_max = percent_to_mbw_max(cfg_val, cprops); + cfg.mbw_max = percent_to_mbw_max(cfg_val, cprops->bwa_wd); mpam_set_feature(mpam_feat_mbw_max, &cfg); break; } -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT -------------------------------- Before configure the PARTID of monitor instances, make sure CFG_MON_SEL register has already selected the target instance self. By the same reason, ensure the PARTID has been configured properly before reading counter result of the monitor instance. Fixes: bb66b4d115e5 ("arm_mpam: Add mpam_msmon_read() to read monitor value") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index a227b0716195..ef5ce9316e27 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -941,6 +941,9 @@ static void __ris_msmon_read(void *arg) FIELD_PREP(MSMON_CFG_MON_SEL_RIS, ris->ris_idx); mpam_write_monsel_reg(msc, CFG_MON_SEL, mon_sel); + /* Selects a monitor instance to configure PARTID. */ + wmb(); + if (m->type == mpam_feat_msmon_mbwu) { mbwu_state = &ris->mbwu_state[ctx->mon]; if (mbwu_state) { @@ -961,6 +964,12 @@ static void __ris_msmon_read(void *arg) if (config_mismatch || reset_on_next_read) write_msmon_ctl_flt_vals(m, ctl_val, flt_val); + /* + * Selects the monitor instance associated to the specified PARTID + * to read counter value. + */ + wmb(); + switch (m->type) { case mpam_feat_msmon_csu: now = mpam_read_monsel_reg(msc, CSU); -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAG93D -------------------------------- If the MSC monitor instance is the first time to configure, this single monitor need to wait 'not ready' time before getting the valid value. Components made up of multiple MSC may need values from each MSC, so when getting the whole component monitor values, need to wait every single monitor instance 'not ready' time instead of once per component. Fixes: bb66b4d115e5 ("arm_mpam: Add mpam_msmon_read() to read monitor value") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 37 +++++++++++++--------------- 1 file changed, 17 insertions(+), 20 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index ef5ce9316e27..52976e80351a 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -1033,11 +1033,14 @@ static void __ris_msmon_read(void *arg) } *(m->val) += now; + m->err = 0; } static int _msmon_read(struct mpam_component *comp, struct mon_read *arg) { int err, idx; + bool read_again; + u64 wait_jiffies; struct mpam_msc *msc; struct mpam_msc_ris *ris; @@ -1046,10 +1049,23 @@ static int _msmon_read(struct mpam_component *comp, struct mon_read *arg) arg->ris = ris; msc = ris->msc; + read_again = false; +again: mutex_lock(&msc->lock); err = smp_call_function_any(&msc->accessibility, __ris_msmon_read, arg, true); mutex_unlock(&msc->lock); + + if (arg->err == -EBUSY && !read_again) { + read_again = true; + + wait_jiffies = usecs_to_jiffies(comp->class->nrdy_usec); + while (wait_jiffies) + wait_jiffies = schedule_timeout_uninterruptible(wait_jiffies); + + goto again; + } + if (!err && arg->err) err = arg->err; if (err) @@ -1063,9 +1079,7 @@ static int _msmon_read(struct mpam_component *comp, struct mon_read *arg) int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, enum mpam_device_features type, u64 *val) { - int err; struct mon_read arg; - u64 wait_jiffies = 0; struct mpam_props *cprops = &comp->class->props; might_sleep(); @@ -1082,24 +1096,7 @@ int mpam_msmon_read(struct mpam_component *comp, struct mon_cfg *ctx, arg.val = val; *val = 0; - err = _msmon_read(comp, &arg); - if (err == -EBUSY) - wait_jiffies = usecs_to_jiffies(comp->class->nrdy_usec); - - while (wait_jiffies) - wait_jiffies = schedule_timeout_uninterruptible(wait_jiffies); - - if (err == -EBUSY) { - memset(&arg, 0, sizeof(arg)); - arg.ctx = ctx; - arg.type = type; - arg.val = val; - *val = 0; - - err = _msmon_read(comp, &arg); - } - - return err; + return _msmon_read(comp, &arg); } void mpam_msmon_reset_all_mbwu(struct mpam_component *comp) -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAJUN4 -------------------------------- The smatch tool complains about possible memory leak warning shown below: drivers/platform/mpam/mpam_resctrl.c:323 resctrl_arch_mon_ctx_alloc_no_wait() warn: possible memory leak of 'ret' drivers/platform/mpam/mpam_resctrl.c:326 resctrl_arch_mon_ctx_alloc_no_wait() warn: possible memory leak of 'ret' Only the QOS_L3_OCCUP_EVENT_ID path will return the allocated object, and other case paths will cause memory leaks. In the QOS_L3_MBM_LOCAL_EVENT_ID and QOS_L3_MBM_TOTAL_EVENT_ID cases, resctrl_arch_mon_ctx_alloc_no_wait() should return a pointer to the allocated object, not an invalid pointer. Fixes: 9119da143939 ("arm_mpam: resctrl: Allow resctrl to allocate monitors") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index dd1ccca8edcb..9aadf009b6a8 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -57,7 +57,6 @@ static DECLARE_WAIT_QUEUE_HEAD(wait_cacheinfo_ready); /* A dummy mon context to use when the monitors were allocated up front */ u32 __mon_is_rmid_idx = USE_RMID_IDX; -void *mon_is_rmid_idx = &__mon_is_rmid_idx; bool resctrl_arch_alloc_capable(void) { @@ -267,12 +266,16 @@ static void *resctrl_arch_mon_ctx_alloc_no_wait(struct rdt_resource *r, *ret = mpam_alloc_csu_mon(res->class); return ret; + case QOS_L3_MBM_LOCAL_EVENT_ID: case QOS_L3_MBM_TOTAL_EVENT_ID: - return mon_is_rmid_idx; - } + *ret = __mon_is_rmid_idx; + return ret; - return ERR_PTR(-EOPNOTSUPP); + default: + kfree(ret); + return ERR_PTR(-EOPNOTSUPP); + } } void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid) @@ -300,15 +303,11 @@ void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, struct mpam_resctrl_res *res; u32 mon = *(u32 *)arch_mon_ctx; - if (mon == USE_RMID_IDX) - return; kfree(arch_mon_ctx); - arch_mon_ctx = NULL; - - res = container_of(r, struct mpam_resctrl_res, resctrl_res); switch (evtid) { case QOS_L3_OCCUP_EVENT_ID: + res = container_of(r, struct mpam_resctrl_res, resctrl_res); mpam_free_csu_mon(res->class, mon); wake_up(&resctrl_mon_ctx_waiters); return; -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAJUN4 -------------------------------- The smatch tool complains about impossible condition warning shown below: drivers/platform/mpam/mpam_devices.c:391 get_cpumask_from_cache_id() warn: impossible condition '(cache_id == ~0) => (0-u32max == u64max)' cache_id is an unsigned 32-bit object, so when comparing conditions, keep the data types consistent. Fixes: 848cefee5a21 ("arm_mpam: Add the class and component structures for ris firmware described") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 52976e80351a..335ee203ebfa 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -388,7 +388,7 @@ static int get_cpumask_from_cache_id(u32 cache_id, u32 cache_level, * during device_initcall(). Use cache_of_get_id(). */ iter_cache_id = cache_of_get_id(iter); - if (cache_id == ~0UL) { + if (cache_id == (~0)) { of_node_put(iter); continue; } -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAJUN4 -------------------------------- The smatch tool complains about impossible condition warning shown below: drivers/platform/mpam/mpam_resctrl.c:414 resctrl_arch_rmid_read() warn: impossible condition '(cfg.mon == ((~0) + 1)) => (0-u16max == 65536)' The USE_RMID_IDX macro is a 32-bit data type, as the assignment object, cfg.mon should also be a 32-bit unsigned variable. Fixes: 21ac9fc8f275 ("arm_mpam: Track bandwidth counter state for overflow and power management") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 119660a53e16..a613e5a77a85 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -194,7 +194,7 @@ enum mon_filter_options { }; struct mon_cfg { - u16 mon; + u32 mon; u8 pmg; bool match_pmg; u32 partid; -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC35SV -------------------------------- When MBM is implemented according to the *DDI0598* protocol, that is, mbm_total_bytes interface returns the statistical results of instantaneous bandwidth traffic, there is no need to create a delay_work to handle MBM statistical overflow. No matter it is MBWU or CSU monitors, using per-CLOSID monitoring instances avoids inaccurate statistical results caused by monitor multiplexing, which can occur when multiple control groups are monitored simultaneously. Fixes: 9119da143939 ("arm_mpam: resctrl: Allow resctrl to allocate monitors") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 33 +++++++--------------------- 1 file changed, 8 insertions(+), 25 deletions(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 9aadf009b6a8..728f03a2a3f6 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -254,7 +254,6 @@ struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) static void *resctrl_arch_mon_ctx_alloc_no_wait(struct rdt_resource *r, int evtid) { - struct mpam_resctrl_res *res; u32 *ret = kmalloc(sizeof(*ret), GFP_KERNEL); if (!ret) @@ -262,11 +261,6 @@ static void *resctrl_arch_mon_ctx_alloc_no_wait(struct rdt_resource *r, switch (evtid) { case QOS_L3_OCCUP_EVENT_ID: - res = container_of(r, struct mpam_resctrl_res, resctrl_res); - - *ret = mpam_alloc_csu_mon(res->class); - return ret; - case QOS_L3_MBM_LOCAL_EVENT_ID: case QOS_L3_MBM_TOTAL_EVENT_ID: *ret = __mon_is_rmid_idx; @@ -300,21 +294,7 @@ void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid) void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, void *arch_mon_ctx) { - struct mpam_resctrl_res *res; - u32 mon = *(u32 *)arch_mon_ctx; - kfree(arch_mon_ctx); - - switch (evtid) { - case QOS_L3_OCCUP_EVENT_ID: - res = container_of(r, struct mpam_resctrl_res, resctrl_res); - mpam_free_csu_mon(res->class, mon); - wake_up(&resctrl_mon_ctx_waiters); - return; - case QOS_L3_MBM_TOTAL_EVENT_ID: - case QOS_L3_MBM_LOCAL_EVENT_ID: - return; - } } static enum mon_filter_options resctrl_evt_config_to_mpam(u32 local_evt_cfg) @@ -335,7 +315,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, { int err; u64 cdp_val; - u16 num_mbwu_mon; + u16 num_mon; struct mon_cfg cfg; struct mpam_resctrl_dom *dom; struct mpam_resctrl_res *res; @@ -362,12 +342,15 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, if (cfg.mon == USE_RMID_IDX) { /* * The number of mbwu monitors can't support free run mode, - * adapt the remainder of rmid to the num_mbwu_mon as a - * compromise. + * adapt the remainder of rmid to the num_mon as compromise. */ res = container_of(r, struct mpam_resctrl_res, resctrl_res); - num_mbwu_mon = res->class->props.num_mbwu_mon; - cfg.mon = resctrl_arch_rmid_idx_encode(closid, rmid) % num_mbwu_mon; + if (type == mpam_feat_msmon_mbwu) + num_mon = res->class->props.num_mbwu_mon; + else + num_mon = res->class->props.num_csu_mon; + + cfg.mon = closid % num_mon; } cfg.match_pmg = true; -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC35SV -------------------------------- When MBM is implemented according to the *DDI0598* protocol, that is, mbm_total_bytes interface returns the statistical results of instantaneous bandwidth traffic, there is no need to create a delay_work to handle MBM statistical overflow. Fixes: 9119da143939 ("arm_mpam: resctrl: Allow resctrl to allocate monitors") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/include/asm/resctrl.h | 2 ++ drivers/platform/mpam/mpam_devices.c | 8 ++++++++ fs/resctrl/rdtgroup.c | 9 +++++---- include/linux/arm_mpam.h | 1 + 4 files changed, 16 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 37b8d7dfcfed..c5f90edc3e20 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -94,6 +94,8 @@ static inline bool resctrl_arch_is_mbm_total_enabled(void) return (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID)); } +static inline bool resctrl_arch_would_mbm_overflow(void) { return true; } + static inline bool resctrl_arch_is_mbm_local_enabled(void) { return (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID)); diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 335ee203ebfa..c1bf5d81f180 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -920,6 +920,14 @@ static bool resctrl_arch_would_mbm_overflow(void) return read_cpuid_implementor() != ARM_CPU_IMP_HISI; } +bool resctrl_arch_would_mbm_overflow(void) +{ + if (is_midr_in_range_list(read_cpuid_id(), mbwu_flowrate_list)) + return false; + + return true; +} + static void __ris_msmon_read(void *arg) { bool nrdy = false; diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 526a494080b6..e1697e6129d0 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -2564,7 +2564,7 @@ static int rdt_get_tree(struct fs_context *fc) if (resctrl_arch_alloc_capable() || resctrl_arch_mon_capable()) resctrl_mounted = true; - if (resctrl_is_mbm_enabled()) { + if (resctrl_is_mbm_enabled() && resctrl_arch_would_mbm_overflow()) { list_for_each_entry(dom, &l3->domains, list) mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU); @@ -3784,7 +3784,7 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d) if (resctrl_mounted && resctrl_arch_mon_capable()) rmdir_mondata_subdir_allrdtgrp(r, d->id); - if (resctrl_is_mbm_enabled()) + if (resctrl_is_mbm_enabled() && resctrl_arch_would_mbm_overflow()) cancel_delayed_work(&d->mbm_over); if (resctrl_arch_is_llc_occupancy_enabled() && has_busy_rmid(d)) { /* @@ -3855,7 +3855,7 @@ int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d) if (err) goto out_unlock; - if (resctrl_is_mbm_enabled()) { + if (resctrl_is_mbm_enabled() && resctrl_arch_would_mbm_overflow()) { INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow); mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, RESCTRL_PICK_ANY_CPU); @@ -3916,7 +3916,8 @@ void resctrl_offline_cpu(unsigned int cpu) d = resctrl_get_domain_from_cpu(cpu, l3); if (d) { - if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu) { + if (resctrl_is_mbm_enabled() && cpu == d->mbm_work_cpu && + resctrl_arch_would_mbm_overflow()) { cancel_delayed_work(&d->mbm_over); mbm_setup_overflow_handler(d, 0, cpu); } diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index d70e4e726fe6..c80f220d98d6 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -66,6 +66,7 @@ bool resctrl_arch_mon_capable(void); bool resctrl_arch_is_llc_occupancy_enabled(void); bool resctrl_arch_is_mbm_local_enabled(void); bool resctrl_arch_is_mbm_total_enabled(void); +bool resctrl_arch_would_mbm_overflow(void); /* reset cached configurations, then all devices */ void resctrl_arch_reset_resources(void); -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC35SV --------------------------------------------------- Be compatible with the hardware platform which has different CPBM bandwidth of cache MSCs within the same class, and prevent the CPBM feature from being cleared. Fixes: 931c09722546 ("arm_mpam: Merge supported features during mpam_enable() into mpam_class") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index c1bf5d81f180..157179e864a7 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -1857,9 +1857,10 @@ __resource_props_mismatch(struct mpam_msc_ris *ris, struct mpam_class *class) /* Clear missing features */ cprops->features &= rprops->features; - /* Clear incompatible features */ + /* Set cpbm_wd with the min cpbm_wd among all cache msc */ if (cprops->cpbm_wd != rprops->cpbm_wd) - mpam_clear_feature(mpam_feat_cpor_part, &cprops->features); + cprops->cpbm_wd = min(cprops->cpbm_wd, rprops->cpbm_wd); + if (cprops->mbw_pbm_bits != rprops->mbw_pbm_bits) mpam_clear_feature(mpam_feat_mbw_part, &cprops->features); -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC35SV --------------------------------------------------- To check whether hardware supports the CMAX feature or not, not only needs to check the CMAX_WD field of the MPAMF_CCAP_IDR register, but also needs to check the NO_CMAX field. Fixes: b1576ca4ed95 ("arm_mpam: Probe and reset the rest of the features") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 3 ++- drivers/platform/mpam/mpam_internal.h | 3 +++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 157179e864a7..968d51e22675 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -577,7 +577,8 @@ static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) u32 ccap_features = mpam_read_partsel_reg(msc, CCAP_IDR); props->cmax_wd = FIELD_GET(MPAMF_CCAP_IDR_CMAX_WD, ccap_features); - if (props->cmax_wd) + if (props->cmax_wd && + !FIELD_GET(MPAMF_CCAP_IDR_NO_CMAX, ccap_features)) mpam_set_feature(mpam_feat_ccap_part, props); } diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index a613e5a77a85..d8854cc6f78e 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -391,6 +391,9 @@ void mpam_resctrl_exit(void); /* MPAMF_CCAP_IDR - MPAM features cache capacity partitioning ID register */ #define MPAMF_CCAP_IDR_CMAX_WD GENMASK(5, 0) +#define MPAMF_CCAP_IDR_HAS_CMIN BIT(29) +#define MPAMF_CCAP_IDR_NO_CMAX BIT(30) +#define MPAMF_CCAP_IDR_HAS_CMAX_SOFTLIM BIT(31) /* MPAMF_MBW_IDR - MPAM features memory bandwidth partitioning ID register */ #define MPAMF_MBW_IDR_BWA_WD GENMASK(5, 0) -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC35SV -------------------------------- When the resctrl fs is mounted, closid_init() will allocate memory of closid_free_map, but when the fs is unmounted, it is never released, which will cause a memory leak. Add release of closid_free_map in rdt_kill_sb() to ensure no leaks when the fs is unmounted. Fixes: 1e7284b7981c ("fs/resctrl: Remove the limit on the number of CLOSID") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- fs/resctrl/rdtgroup.c | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index e1697e6129d0..2903c505ef17 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -162,6 +162,11 @@ static void closid_init(void) closid_free_map_len = rdt_min_closid; } +static void closid_exit(void) +{ + bitmap_free(closid_free_map); +} + static int closid_alloc(void) { int cleanest_closid; @@ -2512,10 +2517,8 @@ static int rdt_get_tree(struct fs_context *fc) goto out_root; ret = schemata_list_create(); - if (ret) { - schemata_list_destroy(); - goto out_ctx; - } + if (ret) + goto out_schemata_free; closid_init(); @@ -2524,13 +2527,13 @@ static int rdt_get_tree(struct fs_context *fc) ret = rdtgroup_add_files(rdtgroup_default.kn, flags); if (ret) - goto out_schemata_free; + goto out_closid; kernfs_activate(rdtgroup_default.kn); ret = rdtgroup_create_info_dir(rdtgroup_default.kn); if (ret < 0) - goto out_schemata_free; + goto out_closid; if (resctrl_arch_mon_capable()) { ret = mongroup_create_dir(rdtgroup_default.kn, @@ -2583,9 +2586,10 @@ static int rdt_get_tree(struct fs_context *fc) kernfs_remove(kn_mongrp); out_info: kernfs_remove(kn_info); +out_closid: + closid_exit(); out_schemata_free: schemata_list_destroy(); -out_ctx: rdt_disable_ctx(); out_root: rdtgroup_destroy_root(); @@ -2794,6 +2798,7 @@ static void rdt_kill_sb(struct super_block *sb) if (IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) rdt_pseudo_lock_release(); rdtgroup_default.mode = RDT_MODE_SHAREABLE; + closid_exit(); schemata_list_destroy(); rdtgroup_destroy_root(); if (resctrl_arch_alloc_capable()) -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAFXTZ -------------------------------- Fix struct rdt_resource initialization about cache level in mpam_resctrl_resource_init(), otherwise when obtaining allocated cache size by size file under control group, it would be found that the cache level is uninitialized. Fixes: 1ca491c7bbcd ("arm_mpam: resctrl: Pick the caches we will use as resctrl resources") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 728f03a2a3f6..6040b0d53600 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -750,6 +750,7 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) r->fflags = RFTYPE_RES_CACHE; r->default_ctrl = BIT_MASK(class->props.cpbm_wd) - 1; r->data_width = (class->props.cpbm_wd + 3) / 4; + r->cache_level = class->level; /* * Which bits are shared with other ...things... -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC35SV -------------------------------- If the CDP mount option is enabled, allow users to configure separately code and data portion, but still provide one monitor for each MSC class instead of two. Based on the above reason, add debugging information about monitor value for code and data streams separately. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 6040b0d53600..660a1d13fdff 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -358,15 +358,19 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, cfg.opts = resctrl_evt_config_to_mpam(dom->mbm_local_evt_cfg); if (cdp_enabled) { - cfg.partid = closid << 1; + cfg.partid = resctrl_get_config_index(closid, CDP_DATA); err = mpam_msmon_read(dom->comp, &cfg, type, val); if (err) return err; - cfg.partid += 1; + cfg.partid = resctrl_get_config_index(closid, CDP_CODE); err = mpam_msmon_read(dom->comp, &cfg, type, &cdp_val); - if (!err) + if (!err) { + pr_debug("read monitor rmid %u %s:%u CODE/DATA: %lld/%lld\n", + resctrl_arch_rmid_idx_encode(closid, rmid), + r->name, dom->comp->comp_id, cdp_val, *val); *val += cdp_val; + } } else { cfg.partid = closid; err = mpam_msmon_read(dom->comp, &cfg, type, val); -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAK8CS -------------------------------- When CDP is enabled, but the resource like MATA doesn't support it, we need to apply the same configuration to both of the CDP_CODE and CDP_DATA partids. In mpam_apply_config(), it will synchronize the configuration to the wrong comp->cfg[partid]. We will first update the staged_config with correct conf_type and then update the configuration to the comp->cfg corresponding to the partid by resctrl_arch_update_one(). In the same way, when get configuration from schemata interface by resctrl_arch_get_config(), we need to get the configuration of the CDP_CODE partid. Fixes: 58db5c68e84a ("untested: arm_mpam: resctrl: Add support for MB resource") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/include/asm/resctrl.h | 4 ++++ drivers/platform/mpam/mpam_resctrl.c | 31 +++++++++++----------------- fs/resctrl/ctrlmondata.c | 28 ++++++++++++++++++++++--- include/linux/arm_mpam.h | 1 + 4 files changed, 42 insertions(+), 22 deletions(-) diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index c5f90edc3e20..b4a0a99bde6a 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -222,6 +222,10 @@ void resctrl_cpu_detect(struct cpuinfo_x86 *c); bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l); int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); +static inline bool resctrl_arch_hide_cdp(enum resctrl_res_level rid) +{ + return false; +}; #else diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 660a1d13fdff..60cc53530cbb 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -119,7 +119,7 @@ int resctrl_arch_set_cdp_enabled(enum resctrl_res_level ignored, bool enable) return 0; } -static bool mpam_resctrl_hide_cdp(enum resctrl_res_level rid) +bool resctrl_arch_hide_cdp(enum resctrl_res_level rid) { return cdp_enabled && !resctrl_arch_get_cdp_enabled(rid); } @@ -913,7 +913,16 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, dom = container_of(d, struct mpam_resctrl_dom, resctrl_dom); cprops = &res->class->props; - partid = resctrl_get_config_index(closid, type); + /* + * When CDP is enabled, but the resource doesn't support it, we + * need to get the configuration from the CDP_CODE resctrl_conf_type + * which is same as the CDP_DATA one. + */ + if (resctrl_arch_hide_cdp(r->rid)) + partid = resctrl_get_config_index(closid, CDP_CODE); + else + partid = resctrl_get_config_index(closid, type); + cfg = &dom->comp->cfg[partid]; switch (r->rid) { @@ -955,7 +964,6 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, u32 closid, enum resctrl_conf_type t, u32 cfg_val) { - int err; u32 partid; struct mpam_config cfg; struct mpam_props *cprops; @@ -997,22 +1005,7 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, return -EINVAL; } - /* - * When CDP is enabled, but the resource doesn't support it, we need to - * apply the same configuration to the other partid. - */ - if (mpam_resctrl_hide_cdp(r->rid)) { - partid = resctrl_get_config_index(closid, CDP_CODE); - err = mpam_apply_config(dom->comp, partid, &cfg); - if (err) - return err; - - partid = resctrl_get_config_index(closid, CDP_DATA); - return mpam_apply_config(dom->comp, partid, &cfg); - - } else { - return mpam_apply_config(dom->comp, partid, &cfg); - } + return mpam_apply_config(dom->comp, partid, &cfg); } /* TODO: this is IPI heavy */ diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index ae9584be4348..e57faaf11189 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -71,15 +71,15 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) return true; } -static int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, - struct rdt_domain *d) +static int parse_bw_conf_type(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_domain *d, enum resctrl_conf_type conf_type) { struct resctrl_staged_config *cfg; u32 closid = data->rdtgrp->closid; struct rdt_resource *r = s->res; u32 bw_val; - cfg = &d->staged_config[s->conf_type]; + cfg = &d->staged_config[conf_type]; if (cfg->have_new_ctrl) { rdt_last_cmd_printf("Duplicate domain %d\n", d->id); return -EINVAL; @@ -99,6 +99,28 @@ static int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, return 0; } +static int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_domain *d) +{ + struct rdt_resource *r = s->res; + int err; + + /* + * When CDP is enabled, but the resource doesn't support it, we + * need to apply the same configuration to both of the CDP_CODE + * and CDP_DATA resctrl_conf_type. + */ + if (resctrl_arch_hide_cdp(r->rid)) { + err = parse_bw_conf_type(data, s, d, CDP_CODE); + if (err) + return err; + + return parse_bw_conf_type(data, s, d, CDP_DATA); + } + + return parse_bw_conf_type(data, s, d, s->conf_type); +} + /* * Check whether a cache bit mask is valid. * On Intel CPUs, non-contiguous 1s value support is indicated by CPUID: diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index c80f220d98d6..aef951595f65 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -73,6 +73,7 @@ void resctrl_arch_reset_resources(void); bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level ignored); int resctrl_arch_set_cdp_enabled(enum resctrl_res_level ignored, bool enable); +bool resctrl_arch_hide_cdp(enum resctrl_res_level rid); bool resctrl_arch_match_closid(struct task_struct *tsk, u32 closid); bool resctrl_arch_match_rmid(struct task_struct *tsk, u32 closid, u32 rmid); void resctrl_arch_set_cpu_default_closid(int cpu, u32 closid); -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC35SV -------------------------------- In the following patches, we would introduce many QoS configuration in order to enhance MPAM's capability for isolating interference and schedule shared resources. As a prerequisite patch, it introduces the resctrl_schema_fmt, an enumeration type that indicates the format in which resource configurations are issued by users. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/core.c | 4 ++ drivers/platform/mpam/mpam_resctrl.c | 70 ++++++++++++++++------------ fs/resctrl/ctrlmondata.c | 14 +++--- fs/resctrl/rdtgroup.c | 13 ++++-- include/linux/resctrl.h | 13 +++++- 5 files changed, 74 insertions(+), 40 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 2613d617e8bd..1454aa1a1f84 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -71,6 +71,7 @@ struct rdt_hw_resource rdt_resources_all[] = { .domains = domain_init(RDT_RESOURCE_L3), .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, + .schema_fmt = RESCTRL_SCHEMA_BITMAP, }, .msr_base = MSR_IA32_L3_CBM_BASE, .msr_update = cat_wrmsr, @@ -84,6 +85,7 @@ struct rdt_hw_resource rdt_resources_all[] = { .domains = domain_init(RDT_RESOURCE_L2), .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, + .schema_fmt = RESCTRL_SCHEMA_BITMAP, }, .msr_base = MSR_IA32_L2_CBM_BASE, .msr_update = cat_wrmsr, @@ -97,6 +99,7 @@ struct rdt_hw_resource rdt_resources_all[] = { .domains = domain_init(RDT_RESOURCE_MBA), .format_str = "%d=%*u", .fflags = RFTYPE_RES_MB, + .schema_fmt = RESCTRL_SCHEMA_RANGE, }, }, [RDT_RESOURCE_SMBA] = @@ -108,6 +111,7 @@ struct rdt_hw_resource rdt_resources_all[] = { .domains = domain_init(RDT_RESOURCE_SMBA), .format_str = "%d=%*u", .fflags = RFTYPE_RES_MB, + .schema_fmt = RESCTRL_SCHEMA_RANGE, }, }, }; diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 60cc53530cbb..8a4c482adc82 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -537,6 +537,9 @@ static u32 mbw_max_to_percent(u16 mbw_max, u8 wd) u8 bit; u32 divisor = 2, value = 0, precision = get_wd_precision(wd); + if (mbw_max == GENMASK(15, 15 - wd + 1)) + return MAX_MBA_BW; + for (bit = 15; bit; bit--) { if (mbw_max & BIT(bit)) value += MAX_MBA_BW * precision / divisor; @@ -566,6 +569,9 @@ static u16 percent_to_mbw_max(u32 pc, u8 wd) if (WARN_ON_ONCE(wd > 15)) return MAX_MBA_BW; + if (pc == MAX_MBA_BW) + return GENMASK(15, 15 - wd + 1); + pc *= precision; for (bit = 15; bit; bit--) { @@ -633,22 +639,25 @@ static void mpam_resctrl_pick_caches(void) if (mpam_has_feature(mpam_feat_msmon_csu, cprops)) update_rmid_limits(cache_size); - if (class->level == 2) { - res = &mpam_resctrl_exports[RDT_RESOURCE_L2]; - res->resctrl_res.name = "L2"; - } else { - res = &mpam_resctrl_exports[RDT_RESOURCE_L3]; - res->resctrl_res.name = "L3"; + if (has_cpor) { + if (class->level == 2) { + res = &mpam_resctrl_exports[RDT_RESOURCE_L2]; + res->resctrl_res.name = "L2"; + } else { + res = &mpam_resctrl_exports[RDT_RESOURCE_L3]; + res->resctrl_res.name = "L3"; + } + res->class = class; } - res->class = class; } srcu_read_unlock(&mpam_srcu, idx); } static void mpam_resctrl_pick_mba(void) { - struct mpam_class *class, *candidate_class = NULL; struct mpam_resctrl_res *res; + struct mpam_class *class; + bool has_mba; int idx; lockdep_assert_cpus_held(); @@ -657,6 +666,8 @@ static void mpam_resctrl_pick_mba(void) list_for_each_entry_rcu(class, &mpam_classes, classes_list) { struct mpam_props *cprops = &class->props; + has_mba = class_has_usable_mba(cprops); + if (class->level < 3) continue; @@ -666,21 +677,13 @@ static void mpam_resctrl_pick_mba(void) if (!cpumask_equal(&class->affinity, cpu_possible_mask)) continue; - /* - * mba_sc reads the mbm_local counter, and waggles the MBA controls. - * mbm_local is implicitly part of the L3, pick a resouce to be MBA - * that as close as possible to the L3. - */ - if (!candidate_class || class->level < candidate_class->level) - candidate_class = class; + if (has_mba) { + res = &mpam_resctrl_exports[RDT_RESOURCE_MBA]; + res->class = class; + res->resctrl_res.name = "MB"; + } } srcu_read_unlock(&mpam_srcu, idx); - - if (candidate_class) { - res = &mpam_resctrl_exports[RDT_RESOURCE_MBA]; - res->class = candidate_class; - res->resctrl_res.name = "MB"; - } } bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt) @@ -734,14 +737,15 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d) static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) { struct mpam_class *class = res->class; + struct mpam_props *cprops = &class->props; struct rdt_resource *r = &res->resctrl_res; bool has_mbwu = class_has_usable_mbwu(class); + bool has_csu = cache_has_usable_csu(class); /* Is this one of the two well-known caches? */ - if (res->resctrl_res.rid == RDT_RESOURCE_L2 || - res->resctrl_res.rid == RDT_RESOURCE_L3) { - bool has_csu = cache_has_usable_csu(class); - + switch (res->resctrl_res.rid) { + case RDT_RESOURCE_L2: + case RDT_RESOURCE_L3: /* TODO: Scaling is not yet supported */ r->cache.cbm_len = class->props.cpbm_wd; r->cache.arch_has_sparse_bitmasks = true; @@ -751,6 +755,7 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) /* TODO: kill these properties off as they are derivatives */ r->format_str = "%d=%0*x"; + r->schema_fmt = RESCTRL_SCHEMA_BITMAP; r->fflags = RFTYPE_RES_CACHE; r->default_ctrl = BIT_MASK(class->props.cpbm_wd) - 1; r->data_width = (class->props.cpbm_wd + 3) / 4; @@ -764,7 +769,7 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) */ r->cache.shareable_bits = r->default_ctrl; - if (mpam_has_feature(mpam_feat_cpor_part, &class->props)) { + if (mpam_has_feature(mpam_feat_cpor_part, cprops)) { r->alloc_capable = true; exposed_alloc_capable = true; } @@ -788,11 +793,12 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) class->level == 3) { r->mon_capable = true; } - } else if (res->resctrl_res.rid == RDT_RESOURCE_MBA) { - struct mpam_props *cprops = &class->props; + break; + case RDT_RESOURCE_MBA: /* TODO: kill these properties off as they are derivatives */ r->format_str = "%d=%0*u"; + r->schema_fmt = RESCTRL_SCHEMA_RANGE; r->fflags = RFTYPE_RES_MB; r->default_ctrl = MAX_MBA_BW; r->data_width = 3; @@ -814,6 +820,10 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) mbm_total_class = class; r->mon_capable = true; } + break; + + default: + break; } if (r->mon_capable) { @@ -827,7 +837,7 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) * For mpam, each control group has its own pmg/rmid * space. */ - r->num_rmid = 1; + r->num_rmid = mpam_partid_max * mpam_pmg_max; } return 0; @@ -983,6 +993,8 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, if (!r->alloc_capable || partid >= resctrl_arch_get_num_closid(r)) return -EINVAL; + cfg = dom->comp->cfg[partid]; + switch (r->rid) { case RDT_RESOURCE_L2: case RDT_RESOURCE_L3: diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index e57faaf11189..d2793e6407e8 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -45,14 +45,15 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) * Only linear delay values is supported for current Intel SKUs. */ if (!r->membw.delay_linear && r->membw.arch_needs_linear) { - rdt_last_cmd_puts("No support for non-linear MB domains\n"); + rdt_last_cmd_printf("No support for non-linear %s domains\n", + r->name); return false; } ret = kstrtou32(buf, 10, &bw); if (ret) { - rdt_last_cmd_printf("Invalid MB value %s\n", buf); - return false; + rdt_last_cmd_printf("Non-decimal digit in %s value %s\n", + r->name, buf); } /* Nothing else to do if software controller is enabled. */ @@ -62,8 +63,9 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) } if (bw < r->membw.min_bw || bw > r->default_ctrl) { - rdt_last_cmd_printf("MB value %u out of range [%d,%d]\n", - bw, r->membw.min_bw, r->default_ctrl); + rdt_last_cmd_printf("%s value %d out of range [%d,%d]\n", + r->name, bw, r->membw.min_bw, + r->default_ctrl); return false; } @@ -232,7 +234,7 @@ static int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, static ctrlval_parser_t *get_parser(struct rdt_resource *res) { - if (res->fflags & RFTYPE_RES_CACHE) + if (res->schema_fmt == RESCTRL_SCHEMA_BITMAP) return &parse_cbm; else return &parse_bw; diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 2903c505ef17..cc40b178ddee 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -1541,11 +1541,11 @@ static int rdtgroup_size_show(struct kernfs_open_file *of, ctrl = resctrl_arch_get_config(r, d, closid, type); - if (r->rid == RDT_RESOURCE_MBA || - r->rid == RDT_RESOURCE_SMBA) - size = ctrl; - else + if (r->rid == RDT_RESOURCE_L3 || + r->rid == RDT_RESOURCE_L2) size = rdtgroup_cbm_to_size(r, d, ctrl); + else + size = ctrl; } seq_printf(s, "%d=%u", d->id, size); sep = true; @@ -2143,6 +2143,11 @@ static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn) /* loop over enabled controls, these are all alloc_capable */ list_for_each_entry(s, &resctrl_schema_all, list) { r = s->res; + + /* Not supported yet */ + if (r->rid > RDT_RESOURCE_SMBA) + continue; + fflags = r->fflags | RFTYPE_CTRL_INFO; ret = rdtgroup_mkdir_info_resdir(s, s->name, fflags); if (ret) diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index bee15ba1099a..ad58fb43957e 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -177,6 +177,16 @@ struct resctrl_membw { u32 *mb_map; }; +/** + * enum resctrl_schema_fmt - The format user-space provides for a schema. + * @RESCTRL_SCHEMA_BITMAP: The schema is a bitmap in hex. + * @RESCTRL_SCHEMA_RANGE: The schema is a decimal number. + */ +enum resctrl_schema_fmt { + RESCTRL_SCHEMA_BITMAP, + RESCTRL_SCHEMA_RANGE, +}; + /** * struct rdt_resource - attributes of a resctrl resource * @rid: The index of the resource @@ -214,6 +224,7 @@ struct rdt_resource { unsigned int mbm_cfg_mask; unsigned long fflags; bool cdp_capable; + enum resctrl_schema_fmt schema_fmt; }; /* @@ -237,7 +248,7 @@ struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l); */ struct resctrl_schema { struct list_head list; - char name[8]; + char name[16]; enum resctrl_conf_type conf_type; struct rdt_resource *res; u32 num_closid; -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC35SV -------------------------------- Implement the cmax interface and allow users can configure the cache maximum capacity by the schemata interface under the resctrl fs. User can update CMAX settings by the interface {L2,L3}MAX:${domain_id}=${value} in the schemata file. When the cache occupancy of the target PARTID does not reach the cmax configuration, allows to use the shared resources with higher priority. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 10 ++++-- drivers/platform/mpam/mpam_internal.h | 3 +- drivers/platform/mpam/mpam_resctrl.c | 51 ++++++++++++++++++++++++++- include/linux/resctrl_types.h | 4 +++ 4 files changed, 63 insertions(+), 5 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 968d51e22675..e7b5af32aa2d 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -1210,6 +1210,13 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, rprops->cpbm_wd); } + if (mpam_has_feature(mpam_feat_ccap_part, rprops)) { + if (mpam_has_feature(mpam_feat_ccap_part, cfg)) + mpam_write_partsel_reg(msc, CMAX, cfg->cmax); + else + mpam_write_partsel_reg(msc, CMAX, cmax); + } + if (mpam_has_feature(mpam_feat_mbw_part, rprops)) { if (mpam_has_feature(mpam_feat_mbw_part, cfg)) mpam_write_partsel_reg(msc, MBW_PBM, cfg->mbw_pbm); @@ -1231,9 +1238,6 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, if (mpam_has_feature(mpam_feat_mbw_prop, rprops)) mpam_write_partsel_reg(msc, MBW_PROP, bwa_fract); - if (mpam_has_feature(mpam_feat_ccap_part, rprops)) - mpam_write_partsel_reg(msc, CMAX, cmax); - if (mpam_has_feature(mpam_feat_intpri_part, rprops) || mpam_has_feature(mpam_feat_dspri_part, rprops)) { /* aces high? */ diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index d8854cc6f78e..c650eb878cd2 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -52,7 +52,7 @@ struct mpam_msc struct mpam_msc * __percpu *error_dev_id; atomic_t online_refs; - + struct mutex lock; bool probed; bool error_irq_requested; @@ -162,6 +162,7 @@ struct mpam_config { u32 cpbm; u32 mbw_pbm; u16 mbw_max; + u16 cmax; }; struct mpam_component diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 8a4c482adc82..91f7d0810c3d 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -439,6 +439,13 @@ static bool cache_has_usable_cpor(struct mpam_class *class) return (class->props.cpbm_wd <= RESCTRL_MAX_CBM); } +static bool cache_has_usable_cmax(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + return mpam_has_feature(mpam_feat_ccap_part, cprops); +} + static bool cache_has_usable_csu(struct mpam_class *class) { struct mpam_props *cprops; @@ -595,6 +602,7 @@ static void mpam_resctrl_pick_caches(void) { int idx; unsigned int cache_size; + bool has_cpor, has_cmax; struct mpam_class *class; struct mpam_resctrl_res *res; @@ -603,7 +611,9 @@ static void mpam_resctrl_pick_caches(void) idx = srcu_read_lock(&mpam_srcu); list_for_each_entry_rcu(class, &mpam_classes, classes_list) { struct mpam_props *cprops = &class->props; - bool has_cpor = cache_has_usable_cpor(class); + + has_cpor = cache_has_usable_cpor(class); + has_cmax = cache_has_usable_cmax(class); if (class->type != MPAM_CLASS_CACHE) { pr_debug("pick_caches: Class is not a cache\n"); @@ -649,6 +659,17 @@ static void mpam_resctrl_pick_caches(void) } res->class = class; } + + if (has_cmax) { + if (class->level == 2) { + res = &mpam_resctrl_exports[RDT_RESOURCE_L2_MAX]; + res->resctrl_res.name = "L2MAX"; + } else { + res = &mpam_resctrl_exports[RDT_RESOURCE_L3_MAX]; + res->resctrl_res.name = "L3MAX"; + } + res->class = class; + } } srcu_read_unlock(&mpam_srcu, idx); } @@ -822,6 +843,22 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) } break; + case RDT_RESOURCE_L3_MAX: + case RDT_RESOURCE_L2_MAX: + r->format_str = "%d=%0*u"; + r->schema_fmt = RESCTRL_SCHEMA_RANGE; + r->fflags = RFTYPE_RES_CACHE; + r->default_ctrl = MAX_MBA_BW; + r->data_width = 3; + r->cache_level = class->level; + + if (cache_has_usable_cmax(class)) + r->alloc_capable = true; + + r->membw.min_bw = 0; + r->membw.bw_gran = max(100 / (1 << cprops->cmax_wd), 1); + break; + default: break; } @@ -940,6 +977,11 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, case RDT_RESOURCE_L3: configured_by = mpam_feat_cpor_part; break; + case RDT_RESOURCE_L2_MAX: + case RDT_RESOURCE_L3_MAX: + configured_by = mpam_feat_ccap_part; + break; + case RDT_RESOURCE_MBA: if (mba_class_use_mbw_part(cprops)) { configured_by = mpam_feat_mbw_part; @@ -961,6 +1003,8 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, case mpam_feat_cpor_part: /* TODO: Scaling is not yet supported */ return cfg->cpbm; + case mpam_feat_ccap_part: + return mbw_max_to_percent(cfg->cmax, cprops->cmax_wd); case mpam_feat_mbw_part: /* TODO: Scaling is not yet supported */ return mbw_pbm_to_percent(cfg->mbw_pbm, cprops); @@ -1002,6 +1046,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, cfg.cpbm = cfg_val; mpam_set_feature(mpam_feat_cpor_part, &cfg); break; + case RDT_RESOURCE_L2_MAX: + case RDT_RESOURCE_L3_MAX: + cfg.cmax = percent_to_mbw_max(cfg_val, cprops->cmax_wd); + mpam_set_feature(mpam_feat_ccap_part, &cfg); + break; case RDT_RESOURCE_MBA: if (mba_class_use_mbw_part(cprops)) { cfg.mbw_pbm = percent_to_mbw_pbm(cfg_val, cprops); diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h index 44f7a47f986b..b155d536ed3d 100644 --- a/include/linux/resctrl_types.h +++ b/include/linux/resctrl_types.h @@ -78,6 +78,10 @@ enum resctrl_res_level { RDT_RESOURCE_L2, RDT_RESOURCE_MBA, RDT_RESOURCE_SMBA, +#ifdef CONFIG_ARM64_MPAM + RDT_RESOURCE_L3_MAX, + RDT_RESOURCE_L2_MAX, +#endif /* Must be the last */ RDT_NUM_RESOURCES, -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC35SV -------------------------------- Implement mbw_min/cmin feature for mata/cache msc, and provide MBMIN/LxMIN interface in schemata file for user. On some specific platforms, L3 implements the cmin feature, but L2 may not provide it, or vice versa. Therefore, it is necessary to separately determine whether the cache msc has implemented the corresponding feature. Before transmitting the cache MIN portion value to msc, need to check whether the CMIN setting from user space is validate or not. The same goes for MBMIN. It should be noted that the new configuration value is in decimal form instead of hexadecimal, so here reuse the parse_bw() to analysis input value for cache msc. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 26 +++++-- drivers/platform/mpam/mpam_internal.h | 4 + drivers/platform/mpam/mpam_resctrl.c | 102 ++++++++++++++++++++++++-- include/linux/resctrl_types.h | 3 + 4 files changed, 123 insertions(+), 12 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index e7b5af32aa2d..396389c12769 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -577,9 +577,14 @@ static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) u32 ccap_features = mpam_read_partsel_reg(msc, CCAP_IDR); props->cmax_wd = FIELD_GET(MPAMF_CCAP_IDR_CMAX_WD, ccap_features); - if (props->cmax_wd && - !FIELD_GET(MPAMF_CCAP_IDR_NO_CMAX, ccap_features)) - mpam_set_feature(mpam_feat_ccap_part, props); + + if (props->cmax_wd) { + if (!FIELD_GET(MPAMF_CCAP_IDR_NO_CMAX, ccap_features)) + mpam_set_feature(mpam_feat_ccap_part, props); + + if (FIELD_GET(MPAMF_CCAP_IDR_HAS_CMIN, ccap_features)) + mpam_set_feature(mpam_feat_cmin, props); + } } /* Cache Portion partitioning */ @@ -1217,6 +1222,13 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, mpam_write_partsel_reg(msc, CMAX, cmax); } + if (mpam_has_feature(mpam_feat_cmin, rprops)) { + if (mpam_has_feature(mpam_feat_cmin, cfg)) + mpam_write_partsel_reg(msc, CMIN, cfg->cmin); + else + mpam_write_partsel_reg(msc, CMIN, 0); + } + if (mpam_has_feature(mpam_feat_mbw_part, rprops)) { if (mpam_has_feature(mpam_feat_mbw_part, cfg)) mpam_write_partsel_reg(msc, MBW_PBM, cfg->mbw_pbm); @@ -1225,8 +1237,12 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, rprops->mbw_pbm_bits); } - if (mpam_has_feature(mpam_feat_mbw_min, rprops)) - mpam_write_partsel_reg(msc, MBW_MIN, 0); + if (mpam_has_feature(mpam_feat_mbw_min, rprops)) { + if (mpam_has_feature(mpam_feat_mbw_min, cfg)) + mpam_write_partsel_reg(msc, MBW_MIN, cfg->mbw_min); + else + mpam_write_partsel_reg(msc, MBW_MIN, 0); + } if (mpam_has_feature(mpam_feat_mbw_max, rprops)) { if (mpam_has_feature(mpam_feat_mbw_max, cfg)) diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index c650eb878cd2..6555c2def51a 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -86,6 +86,7 @@ typedef u32 mpam_features_t; enum mpam_device_features { mpam_feat_ccap_part = 0, mpam_feat_cpor_part, + mpam_feat_cmin, mpam_feat_mbw_part, mpam_feat_mbw_min, mpam_feat_mbw_max, @@ -162,7 +163,9 @@ struct mpam_config { u32 cpbm; u32 mbw_pbm; u16 mbw_max; + u16 mbw_min; u16 cmax; + u16 cmin; }; struct mpam_component @@ -342,6 +345,7 @@ void mpam_resctrl_exit(void); #define MPAMCFG_PART_SEL 0x0100 /* partid to configure: */ #define MPAMCFG_CPBM 0x1000 /* cache-portion config */ #define MPAMCFG_CMAX 0x0108 /* cache-capacity config */ +#define MPAMCFG_CMIN 0x0110 /* cache-capacity min config */ #define MPAMCFG_MBW_MIN 0x0200 /* min mem-bw config */ #define MPAMCFG_MBW_MAX 0x0208 /* max mem-bw config */ #define MPAMCFG_MBW_WINWD 0x0220 /* mem-bw accounting window config */ diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 91f7d0810c3d..4fdebc3b7947 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -446,6 +446,13 @@ static bool cache_has_usable_cmax(struct mpam_class *class) return mpam_has_feature(mpam_feat_ccap_part, cprops); } +static bool cache_has_usable_cmin(struct mpam_class *class) +{ + struct mpam_props *cprops = &class->props; + + return mpam_has_feature(mpam_feat_cmin, cprops); +} + static bool cache_has_usable_csu(struct mpam_class *class) { struct mpam_props *cprops; @@ -499,6 +506,14 @@ static bool class_has_usable_mba(struct mpam_props *cprops) return false; } +static bool class_has_usable_mbw_min(struct mpam_props *cprops) +{ + if (mpam_has_feature(mpam_feat_mbw_min, cprops)) + return true; + + return false; +} + /* * Calculate the percentage change from each implemented bit in the control * This can return 0 when BWA_WD is greater than 6. (100 / (1<<7) == 0) @@ -602,9 +617,9 @@ static void mpam_resctrl_pick_caches(void) { int idx; unsigned int cache_size; - bool has_cpor, has_cmax; struct mpam_class *class; struct mpam_resctrl_res *res; + bool has_cpor, has_cmax, has_cmin; lockdep_assert_cpus_held(); @@ -614,6 +629,7 @@ static void mpam_resctrl_pick_caches(void) has_cpor = cache_has_usable_cpor(class); has_cmax = cache_has_usable_cmax(class); + has_cmin = cache_has_usable_cmin(class); if (class->type != MPAM_CLASS_CACHE) { pr_debug("pick_caches: Class is not a cache\n"); @@ -670,6 +686,17 @@ static void mpam_resctrl_pick_caches(void) } res->class = class; } + + if (has_cmin) { + if (class->level == 2) { + res = &mpam_resctrl_exports[RDT_RESOURCE_L2_MIN]; + res->resctrl_res.name = "L2MIN"; + } else { + res = &mpam_resctrl_exports[RDT_RESOURCE_L3_MIN]; + res->resctrl_res.name = "L3MIN"; + } + res->class = class; + } } srcu_read_unlock(&mpam_srcu, idx); } @@ -677,8 +704,8 @@ static void mpam_resctrl_pick_caches(void) static void mpam_resctrl_pick_mba(void) { struct mpam_resctrl_res *res; + bool has_mba, has_mbw_min; struct mpam_class *class; - bool has_mba; int idx; lockdep_assert_cpus_held(); @@ -688,13 +715,11 @@ static void mpam_resctrl_pick_mba(void) struct mpam_props *cprops = &class->props; has_mba = class_has_usable_mba(cprops); + has_mbw_min = class_has_usable_mbw_min(cprops); if (class->level < 3) continue; - if (!class_has_usable_mba(cprops)) - continue; - if (!cpumask_equal(&class->affinity, cpu_possible_mask)) continue; @@ -703,6 +728,12 @@ static void mpam_resctrl_pick_mba(void) res->class = class; res->resctrl_res.name = "MB"; } + + if (has_mbw_min) { + res = &mpam_resctrl_exports[RDT_RESOURCE_MB_MIN]; + res->class = class; + res->resctrl_res.name = "MBMIN"; + } } srcu_read_unlock(&mpam_srcu, idx); } @@ -859,6 +890,41 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) r->membw.bw_gran = max(100 / (1 << cprops->cmax_wd), 1); break; + case RDT_RESOURCE_L3_MIN: + case RDT_RESOURCE_L2_MIN: + r->format_str = "%d=%0*u"; + r->schema_fmt = RESCTRL_SCHEMA_RANGE; + r->fflags = RFTYPE_RES_CACHE; + r->default_ctrl = MAX_MBA_BW; + r->data_width = 3; + r->cache_level = class->level; + + if (cache_has_usable_cmin(class)) + r->alloc_capable = true; + + r->membw.min_bw = 0; + r->membw.bw_gran = max(100 / (1 << cprops->cmax_wd), 1); + break; + + case RDT_RESOURCE_MB_MIN: + r->format_str = "%d=%0*u"; + r->schema_fmt = RESCTRL_SCHEMA_RANGE; + r->fflags = RFTYPE_RES_MB; + r->default_ctrl = MAX_MBA_BW; + r->data_width = 3; + + r->membw.delay_linear = true; + r->membw.throttle_mode = THREAD_THROTTLE_UNDEFINED; + r->membw.bw_gran = get_mba_granularity(cprops); + + /* Round up to at least 1% */ + if (!r->membw.bw_gran) + r->membw.bw_gran = 1; + + if (class_has_usable_mbw_min(cprops)) + r->alloc_capable = true; + break; + default: break; } @@ -981,6 +1047,10 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, case RDT_RESOURCE_L3_MAX: configured_by = mpam_feat_ccap_part; break; + case RDT_RESOURCE_L2_MIN: + case RDT_RESOURCE_L3_MIN: + configured_by = mpam_feat_cmin; + break; case RDT_RESOURCE_MBA: if (mba_class_use_mbw_part(cprops)) { @@ -990,7 +1060,12 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, configured_by = mpam_feat_mbw_max; break; } - fallthrough; + return -EINVAL; + + case RDT_RESOURCE_MB_MIN: + configured_by = mpam_feat_mbw_min; + break; + default: return -EINVAL; } @@ -1005,11 +1080,15 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, return cfg->cpbm; case mpam_feat_ccap_part: return mbw_max_to_percent(cfg->cmax, cprops->cmax_wd); + case mpam_feat_cmin: + return mbw_max_to_percent(cfg->cmin, cprops->cmax_wd); case mpam_feat_mbw_part: /* TODO: Scaling is not yet supported */ return mbw_pbm_to_percent(cfg->mbw_pbm, cprops); case mpam_feat_mbw_max: return mbw_max_to_percent(cfg->mbw_max, cprops->bwa_wd); + case mpam_feat_mbw_min: + return mbw_max_to_percent(cfg->mbw_min, cprops->bwa_wd); default: return -EINVAL; } @@ -1051,6 +1130,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, cfg.cmax = percent_to_mbw_max(cfg_val, cprops->cmax_wd); mpam_set_feature(mpam_feat_ccap_part, &cfg); break; + case RDT_RESOURCE_L2_MIN: + case RDT_RESOURCE_L3_MIN: + cfg.cmin = percent_to_mbw_max(cfg_val, cprops->cmax_wd); + mpam_set_feature(mpam_feat_cmin, &cfg); + break; case RDT_RESOURCE_MBA: if (mba_class_use_mbw_part(cprops)) { cfg.mbw_pbm = percent_to_mbw_pbm(cfg_val, cprops); @@ -1061,7 +1145,11 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, mpam_set_feature(mpam_feat_mbw_max, &cfg); break; } - fallthrough; + return -EINVAL; + case RDT_RESOURCE_MB_MIN: + cfg.mbw_min = percent_to_mbw_max(cfg_val, cprops->bwa_wd); + mpam_set_feature(mpam_feat_mbw_min, &cfg); + break; default: return -EINVAL; } diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h index b155d536ed3d..ceaf8d31333e 100644 --- a/include/linux/resctrl_types.h +++ b/include/linux/resctrl_types.h @@ -81,6 +81,9 @@ enum resctrl_res_level { #ifdef CONFIG_ARM64_MPAM RDT_RESOURCE_L3_MAX, RDT_RESOURCE_L2_MAX, + RDT_RESOURCE_L3_MIN, + RDT_RESOURCE_L2_MIN, + RDT_RESOURCE_MB_MIN, #endif /* Must be the last */ -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC35SV -------------------------------- The Priority Partition feature controls the priority of requests attributed to the target PARTID. In addition to the MATA msc implementing priority features, the L3 controller can also have the same features. Here the *MBPRI* and *L3PRI* interface are provided in the schemata file to allow users to configure and query QoS partition settings. Before transmitting the priority value to msc, it needs to check whether the setting from user is validate or not. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 35 ++++++++----- drivers/platform/mpam/mpam_internal.h | 7 +++ drivers/platform/mpam/mpam_resctrl.c | 73 ++++++++++++++++++++++++++- include/linux/resctrl.h | 6 +++ include/linux/resctrl_types.h | 3 ++ 5 files changed, 108 insertions(+), 16 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 396389c12769..80ef4eb908e8 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -1193,12 +1193,11 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, struct mpam_config *cfg) { u32 pri_val = 0; + u16 intpri, dspri; u16 cmax = MPAMCFG_CMAX_CMAX; struct mpam_msc *msc = ris->msc; u16 bwa_fract = MPAMCFG_MBW_MAX_MAX; struct mpam_props *rprops = &ris->props; - u16 dspri = GENMASK(rprops->dspri_wd, 0); - u16 intpri = GENMASK(rprops->intpri_wd, 0); spin_lock(&msc->part_sel_lock); __mpam_part_sel(ris->ris_idx, partid, msc); @@ -1254,21 +1253,29 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, if (mpam_has_feature(mpam_feat_mbw_prop, rprops)) mpam_write_partsel_reg(msc, MBW_PROP, bwa_fract); - if (mpam_has_feature(mpam_feat_intpri_part, rprops) || - mpam_has_feature(mpam_feat_dspri_part, rprops)) { - /* aces high? */ - if (!mpam_has_feature(mpam_feat_intpri_part_0_low, rprops)) - intpri = 0; - if (!mpam_has_feature(mpam_feat_dspri_part_0_low, rprops)) - dspri = 0; - - if (mpam_has_feature(mpam_feat_intpri_part, rprops)) + if (mpam_has_feature(mpam_feat_intpri_part_0_low, rprops)) + intpri = GENMASK(rprops->intpri_wd - 1, 0); + else + intpri = 0; + + if (mpam_has_feature(mpam_feat_intpri_part, rprops)) { + if (mpam_has_feature(mpam_feat_intpri_part, cfg)) + pri_val |= FIELD_PREP(MPAMCFG_PRI_INTPRI, cfg->intpri); + else pri_val |= FIELD_PREP(MPAMCFG_PRI_INTPRI, intpri); - if (mpam_has_feature(mpam_feat_dspri_part, rprops)) - pri_val |= FIELD_PREP(MPAMCFG_PRI_DSPRI, dspri); + } + if (mpam_has_feature(mpam_feat_dspri_part_0_low, rprops)) + dspri = GENMASK(rprops->dspri_wd - 1, 0); + else + dspri = 0; + + if (mpam_has_feature(mpam_feat_dspri_part, rprops)) + pri_val |= FIELD_PREP(MPAMCFG_PRI_DSPRI, dspri); + + if (mpam_has_feature(mpam_feat_intpri_part, rprops) || + mpam_has_feature(mpam_feat_dspri_part, rprops)) mpam_write_partsel_reg(msc, PRI, pri_val); - } spin_unlock(&msc->part_sel_lock); } diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 6555c2def51a..47582ddc554f 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -166,6 +166,13 @@ struct mpam_config { u16 mbw_min; u16 cmax; u16 cmin; + + /* + * dspri is downstream priority + * intpri is internal priority + */ + u16 dspri; + u16 intpri; }; struct mpam_component diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 4fdebc3b7947..854aa414a03c 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -514,6 +514,14 @@ static bool class_has_usable_mbw_min(struct mpam_props *cprops) return false; } +static bool class_has_usable_intpri(struct mpam_props *cprops) +{ + if (mpam_has_feature(mpam_feat_intpri_part, cprops)) + return true; + + return false; +} + /* * Calculate the percentage change from each implemented bit in the control * This can return 0 when BWA_WD is greater than 6. (100 / (1<<7) == 0) @@ -619,7 +627,7 @@ static void mpam_resctrl_pick_caches(void) unsigned int cache_size; struct mpam_class *class; struct mpam_resctrl_res *res; - bool has_cpor, has_cmax, has_cmin; + bool has_cpor, has_cmax, has_cmin, has_intpri; lockdep_assert_cpus_held(); @@ -630,6 +638,7 @@ static void mpam_resctrl_pick_caches(void) has_cpor = cache_has_usable_cpor(class); has_cmax = cache_has_usable_cmax(class); has_cmin = cache_has_usable_cmin(class); + has_intpri = class_has_usable_intpri(cprops); if (class->type != MPAM_CLASS_CACHE) { pr_debug("pick_caches: Class is not a cache\n"); @@ -697,6 +706,17 @@ static void mpam_resctrl_pick_caches(void) } res->class = class; } + + if (has_intpri) { + if (class->level == 2) { + res = &mpam_resctrl_exports[RDT_RESOURCE_L2_PRI]; + res->resctrl_res.name = "L2PRI"; + } else { + res = &mpam_resctrl_exports[RDT_RESOURCE_L3_PRI]; + res->resctrl_res.name = "L3PRI"; + } + res->class = class; + } } srcu_read_unlock(&mpam_srcu, idx); } @@ -704,7 +724,7 @@ static void mpam_resctrl_pick_caches(void) static void mpam_resctrl_pick_mba(void) { struct mpam_resctrl_res *res; - bool has_mba, has_mbw_min; + bool has_mba, has_mbw_min, has_intpri; struct mpam_class *class; int idx; @@ -716,6 +736,7 @@ static void mpam_resctrl_pick_mba(void) has_mba = class_has_usable_mba(cprops); has_mbw_min = class_has_usable_mbw_min(cprops); + has_intpri = class_has_usable_intpri(cprops); if (class->level < 3) continue; @@ -734,6 +755,12 @@ static void mpam_resctrl_pick_mba(void) res->class = class; res->resctrl_res.name = "MBMIN"; } + + if (has_intpri) { + res = &mpam_resctrl_exports[RDT_RESOURCE_MB_PRI]; + res->class = class; + res->resctrl_res.name = "MBPRI"; + } } srcu_read_unlock(&mpam_srcu, idx); } @@ -925,6 +952,35 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) r->alloc_capable = true; break; + case RDT_RESOURCE_L3_PRI: + case RDT_RESOURCE_L2_PRI: + r->format_str = "%d=%0*u"; + r->schema_fmt = RESCTRL_SCHEMA_RANGE; + r->fflags = RFTYPE_RES_CACHE; + r->default_ctrl = GENMASK(cprops->intpri_wd - 1, 0); + r->data_width = 3; + r->cache_level = class->level; + + if (class_has_usable_intpri(cprops)) + r->alloc_capable = true; + + r->membw.min_bw = 0; + r->membw.bw_gran = 1; + break; + + case RDT_RESOURCE_MB_PRI: + r->format_str = "%d=%0*u"; + r->schema_fmt = RESCTRL_SCHEMA_RANGE; + r->fflags = RFTYPE_RES_MB; + r->default_ctrl = GENMASK(cprops->intpri_wd - 1, 0); + r->data_width = 3; + + r->membw.bw_gran = 1; + + if (class_has_usable_intpri(cprops)) + r->alloc_capable = true; + break; + default: break; } @@ -1051,6 +1107,11 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, case RDT_RESOURCE_L3_MIN: configured_by = mpam_feat_cmin; break; + case RDT_RESOURCE_L2_PRI: + case RDT_RESOURCE_L3_PRI: + case RDT_RESOURCE_MB_PRI: + configured_by = mpam_feat_intpri_part; + break; case RDT_RESOURCE_MBA: if (mba_class_use_mbw_part(cprops)) { @@ -1082,6 +1143,8 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, return mbw_max_to_percent(cfg->cmax, cprops->cmax_wd); case mpam_feat_cmin: return mbw_max_to_percent(cfg->cmin, cprops->cmax_wd); + case mpam_feat_intpri_part: + return cfg->intpri; case mpam_feat_mbw_part: /* TODO: Scaling is not yet supported */ return mbw_pbm_to_percent(cfg->mbw_pbm, cprops); @@ -1135,6 +1198,12 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, cfg.cmin = percent_to_mbw_max(cfg_val, cprops->cmax_wd); mpam_set_feature(mpam_feat_cmin, &cfg); break; + case RDT_RESOURCE_L2_PRI: + case RDT_RESOURCE_L3_PRI: + case RDT_RESOURCE_MB_PRI: + cfg.intpri = cfg_val; + mpam_set_feature(mpam_feat_intpri_part, &cfg); + break; case RDT_RESOURCE_MBA: if (mba_class_use_mbw_part(cprops)) { cfg.mbw_pbm = percent_to_mbw_pbm(cfg_val, cprops); diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index ad58fb43957e..c064c01596e0 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -133,6 +133,8 @@ struct rdt_domain { * @arch_has_sparse_bitmasks: True if a bitmask like f00f is valid. * @arch_has_per_cpu_cfg: True if QOS_CFG register for this cache * level has CPU scope. + * @intpri_wd: Number of implemented bits in the priority + * partition. */ struct resctrl_cache { unsigned int cbm_len; @@ -140,6 +142,7 @@ struct resctrl_cache { unsigned int shareable_bits; bool arch_has_sparse_bitmasks; bool arch_has_per_cpu_cfg; + unsigned int intpri_wd; }; /** @@ -166,6 +169,8 @@ enum membw_throttle_mode { * different memory bandwidths * @mba_sc: True if MBA software controller(mba_sc) is enabled * @mb_map: Mapping of memory B/W percentage to memory B/W delay + * @intpri_wd: Number of implemented bits in the priority + * partition. */ struct resctrl_membw { u32 min_bw; @@ -175,6 +180,7 @@ struct resctrl_membw { enum membw_throttle_mode throttle_mode; bool mba_sc; u32 *mb_map; + u32 intpri_wd; }; /** diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h index ceaf8d31333e..ad8d1ac7ca29 100644 --- a/include/linux/resctrl_types.h +++ b/include/linux/resctrl_types.h @@ -84,6 +84,9 @@ enum resctrl_res_level { RDT_RESOURCE_L3_MIN, RDT_RESOURCE_L2_MIN, RDT_RESOURCE_MB_MIN, + RDT_RESOURCE_L3_PRI, + RDT_RESOURCE_L2_PRI, + RDT_RESOURCE_MB_PRI, #endif /* Must be the last */ -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC35SV -------------------------------- Allow users can configure the mata hard-limit feature and cache soft-limit feature by the schemata MBHDL:${domain_id}=${value} interface under the resctrl fs. When MAX bandwidth is exceeded, the hard-limit feature not allows the partition use any more bandwidth until the memory bandwidth measurement for the partition falls below MAX. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 16 +++++++++-- drivers/platform/mpam/mpam_internal.h | 2 ++ drivers/platform/mpam/mpam_resctrl.c | 40 ++++++++++++++++++++++++++- include/linux/resctrl_types.h | 1 + 4 files changed, 56 insertions(+), 3 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 80ef4eb908e8..8190272e5a1b 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -607,8 +607,10 @@ static void mpam_ris_hw_probe(struct mpam_msc_ris *ris) mpam_set_feature(mpam_feat_mbw_part, props); props->bwa_wd = FIELD_GET(MPAMF_MBW_IDR_BWA_WD, mbw_features); - if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_MAX, mbw_features)) + if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_MAX, mbw_features)) { mpam_set_feature(mpam_feat_mbw_max, props); + mpam_set_feature(mpam_feat_max_limit, props); + } if (props->bwa_wd && FIELD_GET(MPAMF_MBW_IDR_HAS_MIN, mbw_features)) mpam_set_feature(mpam_feat_mbw_min, props); @@ -1192,6 +1194,7 @@ static void mpam_reset_msc_bitmap(struct mpam_msc *msc, u16 reg, u16 wd) static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, struct mpam_config *cfg) { + bool limit; u32 pri_val = 0; u16 intpri, dspri; u16 cmax = MPAMCFG_CMAX_CMAX; @@ -1245,7 +1248,16 @@ static void mpam_reprogram_ris_partid(struct mpam_msc_ris *ris, u16 partid, if (mpam_has_feature(mpam_feat_mbw_max, rprops)) { if (mpam_has_feature(mpam_feat_mbw_max, cfg)) - mpam_write_partsel_reg(msc, MBW_MAX, cfg->mbw_max | MPAMCFG_MBW_MAX_HARDLIM); + bwa_fract = cfg->mbw_max; + + if (mpam_has_feature(mpam_feat_max_limit, cfg)) + limit = cfg->max_limit; + else + limit = false; + + if (limit) + mpam_write_partsel_reg(msc, MBW_MAX, bwa_fract | + MPAMCFG_MBW_MAX_HARDLIM); else mpam_write_partsel_reg(msc, MBW_MAX, bwa_fract); } diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index 47582ddc554f..e9c52078edea 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -90,6 +90,7 @@ enum mpam_device_features { mpam_feat_mbw_part, mpam_feat_mbw_min, mpam_feat_mbw_max, + mpam_feat_max_limit, mpam_feat_mbw_prop, mpam_feat_intpri_part, mpam_feat_intpri_part_0_low, @@ -164,6 +165,7 @@ struct mpam_config { u32 mbw_pbm; u16 mbw_max; u16 mbw_min; + bool max_limit; u16 cmax; u16 cmin; diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 854aa414a03c..53bca8ba056a 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -522,6 +522,14 @@ static bool class_has_usable_intpri(struct mpam_props *cprops) return false; } +static bool class_has_usable_max_limit(struct mpam_props *cprops) +{ + if (mpam_has_feature(mpam_feat_max_limit, cprops)) + return true; + + return false; +} + /* * Calculate the percentage change from each implemented bit in the control * This can return 0 when BWA_WD is greater than 6. (100 / (1<<7) == 0) @@ -723,8 +731,8 @@ static void mpam_resctrl_pick_caches(void) static void mpam_resctrl_pick_mba(void) { + bool has_mba, has_mbw_min, has_intpri, has_limit; struct mpam_resctrl_res *res; - bool has_mba, has_mbw_min, has_intpri; struct mpam_class *class; int idx; @@ -737,6 +745,7 @@ static void mpam_resctrl_pick_mba(void) has_mba = class_has_usable_mba(cprops); has_mbw_min = class_has_usable_mbw_min(cprops); has_intpri = class_has_usable_intpri(cprops); + has_limit = class_has_usable_max_limit(cprops); if (class->level < 3) continue; @@ -761,6 +770,12 @@ static void mpam_resctrl_pick_mba(void) res->class = class; res->resctrl_res.name = "MBPRI"; } + + if (has_limit) { + res = &mpam_resctrl_exports[RDT_RESOURCE_MB_HDL]; + res->class = class; + res->resctrl_res.name = "MBHDL"; + } } srcu_read_unlock(&mpam_srcu, idx); } @@ -981,6 +996,19 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) r->alloc_capable = true; break; + case RDT_RESOURCE_MB_HDL: + r->format_str = "%d=%0*u"; + r->schema_fmt = RESCTRL_SCHEMA_RANGE; + r->fflags = RFTYPE_RES_MB; + r->default_ctrl = 1; + r->data_width = 1; + + r->membw.bw_gran = 1; + + if (class_has_usable_max_limit(cprops)) + r->alloc_capable = true; + break; + default: break; } @@ -1127,6 +1155,10 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, configured_by = mpam_feat_mbw_min; break; + case RDT_RESOURCE_MB_HDL: + configured_by = mpam_feat_max_limit; + break; + default: return -EINVAL; } @@ -1152,6 +1184,8 @@ u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, return mbw_max_to_percent(cfg->mbw_max, cprops->bwa_wd); case mpam_feat_mbw_min: return mbw_max_to_percent(cfg->mbw_min, cprops->bwa_wd); + case mpam_feat_max_limit: + return cfg->max_limit; default: return -EINVAL; } @@ -1219,6 +1253,10 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, cfg.mbw_min = percent_to_mbw_max(cfg_val, cprops->bwa_wd); mpam_set_feature(mpam_feat_mbw_min, &cfg); break; + case RDT_RESOURCE_MB_HDL: + cfg.max_limit = cfg_val; + mpam_set_feature(mpam_feat_max_limit, &cfg); + break; default: return -EINVAL; } diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h index ad8d1ac7ca29..fd1704766c29 100644 --- a/include/linux/resctrl_types.h +++ b/include/linux/resctrl_types.h @@ -87,6 +87,7 @@ enum resctrl_res_level { RDT_RESOURCE_L3_PRI, RDT_RESOURCE_L2_PRI, RDT_RESOURCE_MB_PRI, + RDT_RESOURCE_MB_HDL, #endif /* Must be the last */ -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC30P1 -------------------------------- Add a handling path of default label to the get_arch_mbm_state() to avoid compilation warnings. This patch will serve as a prerequisite patch for extending the enum resctrl_event_id. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/monitor.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index 8fcd25b1aa86..ddc12606c602 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -134,6 +134,8 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom, return &hw_dom->arch_mbm_total[rmid]; case QOS_L3_MBM_LOCAL_EVENT_ID: return &hw_dom->arch_mbm_local[rmid]; + default: + break; } /* Never expect to get here */ -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC30P1 -------------------------------- Support MPAM to create *mon_L2_${cache_id}/l2c_occupancy* cache occupancy monitors under mon_data directory. In the other hand, RDT doesn't support L2 monitor feature. In addition, let MPAM L2 cache msc support memory bandwidth monitor feature. L2 cache msc not only has the MSMON_CSU monitor, but also has the MSMON_MBWU monitor. Now provide *mon_L2_${cache_id}/mbm_core_bytes* interface to display the value of l2c MSMON_MBWU counter. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/include/asm/resctrl.h | 2 ++ drivers/platform/mpam/mpam_devices.c | 10 +----- drivers/platform/mpam/mpam_resctrl.c | 50 ++++++++++++++++++++++++---- fs/resctrl/monitor.c | 14 ++++++++ fs/resctrl/rdtgroup.c | 17 ++++++++-- include/linux/arm_mpam.h | 2 ++ include/linux/resctrl.h | 1 + include/linux/resctrl_types.h | 10 ++++++ 8 files changed, 88 insertions(+), 18 deletions(-) diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index b4a0a99bde6a..406b4cf301bb 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -101,6 +101,8 @@ static inline bool resctrl_arch_is_mbm_local_enabled(void) return (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID)); } +static inline bool resctrl_arch_is_mbm_core_enabled(void) { return false; } + /* * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR * diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 8190272e5a1b..237a7cd54099 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -923,17 +923,9 @@ static u64 mpam_msmon_overflow_val(struct mpam_msc_ris *ris) return GENMASK_ULL(30, 0); } -static bool resctrl_arch_would_mbm_overflow(void) -{ - return read_cpuid_implementor() != ARM_CPU_IMP_HISI; -} - bool resctrl_arch_would_mbm_overflow(void) { - if (is_midr_in_range_list(read_cpuid_id(), mbwu_flowrate_list)) - return false; - - return true; + return read_cpuid_implementor() != ARM_CPU_IMP_HISI; } static void __ris_msmon_read(void *arg) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 53bca8ba056a..0e96b64872aa 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -35,6 +35,7 @@ static bool exposed_alloc_capable; static bool exposed_mon_capable; static struct mpam_class *mbm_local_class; static struct mpam_class *mbm_total_class; +static struct mpam_class *mbm_core_class; /* * MPAM emulates CDP by setting different PARTID in the I/D fields of MPAM1_EL1. @@ -78,6 +79,11 @@ bool resctrl_arch_is_mbm_total_enabled(void) return mbm_total_class; } +bool resctrl_arch_is_mbm_core_enabled(void) +{ + return mbm_core_class; +} + bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level rid) { switch (rid) { @@ -263,6 +269,8 @@ static void *resctrl_arch_mon_ctx_alloc_no_wait(struct rdt_resource *r, case QOS_L3_OCCUP_EVENT_ID: case QOS_L3_MBM_LOCAL_EVENT_ID: case QOS_L3_MBM_TOTAL_EVENT_ID: + case QOS_L2_OCCUP_EVENT_ID: + case QOS_L2_MBM_CORE_EVENT_ID: *ret = __mon_is_rmid_idx; return ret; @@ -328,10 +336,12 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, switch (eventid) { case QOS_L3_OCCUP_EVENT_ID: + case QOS_L2_OCCUP_EVENT_ID: type = mpam_feat_msmon_csu; break; case QOS_L3_MBM_LOCAL_EVENT_ID: case QOS_L3_MBM_TOTAL_EVENT_ID: + case QOS_L2_MBM_CORE_EVENT_ID: type = mpam_feat_msmon_mbwu; break; default: @@ -480,6 +490,11 @@ bool resctrl_arch_is_llc_occupancy_enabled(void) return cache_has_usable_csu(mpam_resctrl_exports[RDT_RESOURCE_L3].class); } +bool resctrl_arch_is_l2c_occupancy_enabled(void) +{ + return cache_has_usable_csu(mpam_resctrl_exports[RDT_RESOURCE_L2].class); +} + static bool class_has_usable_mbwu(struct mpam_class *class) { struct mpam_props *cprops = &class->props; @@ -874,19 +889,23 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) * be local. If it's on the memory controller, its assumed to * be global. */ - if (has_mbwu && class->level >= 3) { - mbm_local_class = class; - r->mon_capable = true; + if (has_mbwu) { + if (class->level == 3) { + mbm_local_class = class; + r->mon_capable = true; + + } else if (class->level == 2) { + mbm_core_class = class; + r->mon_capable = true; + } } /* * CSU counters only make sense on a cache. The file is called * llc_occupancy, but its expected to the on the L3. */ - if (has_csu && class->type == MPAM_CLASS_CACHE && - class->level == 3) { + if (has_csu && class->type == MPAM_CLASS_CACHE) r->mon_capable = true; - } break; case RDT_RESOURCE_MBA: @@ -1458,6 +1477,11 @@ static struct mon_evt llc_occupancy_event = { .evtid = QOS_L3_OCCUP_EVENT_ID, }; +static struct mon_evt l2c_occupancy_event = { + .name = "l2c_occupancy", + .evtid = QOS_L2_OCCUP_EVENT_ID, +}; + static struct mon_evt mbm_total_event = { .name = "mbm_total_bytes", .evtid = QOS_L3_MBM_TOTAL_EVENT_ID, @@ -1468,6 +1492,11 @@ static struct mon_evt mbm_local_event = { .evtid = QOS_L3_MBM_LOCAL_EVENT_ID, }; +static struct mon_evt mbm_core_event = { + .name = "mbm_core_bytes", + .evtid = QOS_L2_MBM_CORE_EVENT_ID, +}; + /* * Initialize the event list for the resource. * @@ -1490,6 +1519,14 @@ static void l3_mon_evt_init(struct rdt_resource *r) list_add_tail(&mbm_local_event.list, &r->evt_list); } + if (r->rid == RDT_RESOURCE_L2) { + if (resctrl_arch_is_l2c_occupancy_enabled()) + list_add_tail(&l2c_occupancy_event.list, &r->evt_list); + + if (resctrl_arch_is_mbm_core_enabled()) + list_add_tail(&mbm_core_event.list, &r->evt_list); + } + if ((r->rid == RDT_RESOURCE_MBA) && resctrl_arch_is_mbm_total_enabled()) list_add_tail(&mbm_total_event.list, &r->evt_list); @@ -1498,6 +1535,7 @@ static void l3_mon_evt_init(struct rdt_resource *r) int resctrl_arch_mon_resource_init(void) { l3_mon_evt_init(resctrl_arch_get_resource(RDT_RESOURCE_L3)); + l3_mon_evt_init(resctrl_arch_get_resource(RDT_RESOURCE_L2)); l3_mon_evt_init(resctrl_arch_get_resource(RDT_RESOURCE_MBA)); if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) { diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index 402985f25ac6..cbb8fc878cde 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -574,6 +574,20 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, if (is_mba_sc(NULL)) mbm_bw_count(closid, rmid, &rr); + resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); + } + if (resctrl_arch_is_mbm_core_enabled()) { + rr.evtid = QOS_L2_MBM_CORE_EVENT_ID; + rr.val = 0; + rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); + if (IS_ERR(rr.arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(rr.arch_mon_ctx)); + return; + } + + __mon_event_count(closid, rmid, &rr); + resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); } } diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index cc40b178ddee..b4fb125a7509 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -113,13 +113,15 @@ void rdt_staged_configs_clear(void) static bool resctrl_is_mbm_enabled(void) { return (resctrl_arch_is_mbm_total_enabled() || - resctrl_arch_is_mbm_local_enabled()); + resctrl_arch_is_mbm_local_enabled() || + resctrl_arch_is_mbm_core_enabled()); } static bool resctrl_is_mbm_event(int e) { - return (e >= QOS_L3_MBM_TOTAL_EVENT_ID && - e <= QOS_L3_MBM_LOCAL_EVENT_ID); + return (e == QOS_L3_MBM_TOTAL_EVENT_ID || + e == QOS_L3_MBM_LOCAL_EVENT_ID || + e == QOS_L2_MBM_CORE_EVENT_ID); } /* @@ -3842,6 +3844,15 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) return -ENOMEM; } } + if (resctrl_arch_is_mbm_core_enabled()) { + tsize = sizeof(*d->mbm_core); + d->mbm_core = kcalloc(idx_limit, tsize, GFP_KERNEL); + if (!d->mbm_core) { + bitmap_free(d->rmid_busy_llc); + kfree(d->mbm_core); + return -ENOMEM; + } + } return 0; } diff --git a/include/linux/arm_mpam.h b/include/linux/arm_mpam.h index aef951595f65..3a06a8afad7c 100644 --- a/include/linux/arm_mpam.h +++ b/include/linux/arm_mpam.h @@ -64,6 +64,8 @@ static inline bool resctrl_arch_event_is_free_running(enum resctrl_event_id evt) bool resctrl_arch_alloc_capable(void); bool resctrl_arch_mon_capable(void); bool resctrl_arch_is_llc_occupancy_enabled(void); +bool resctrl_arch_is_l2c_occupancy_enabled(void); +bool resctrl_arch_is_mbm_core_enabled(void); bool resctrl_arch_is_mbm_local_enabled(void); bool resctrl_arch_is_mbm_total_enabled(void); bool resctrl_arch_would_mbm_overflow(void); diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index c064c01596e0..4388cab9e380 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -113,6 +113,7 @@ struct rdt_domain { unsigned long *rmid_busy_llc; struct mbm_state *mbm_total; struct mbm_state *mbm_local; + struct mbm_state *mbm_core; struct delayed_work mbm_over; struct delayed_work cqm_limbo; int mbm_work_cpu; diff --git a/include/linux/resctrl_types.h b/include/linux/resctrl_types.h index fd1704766c29..3d56c35877ea 100644 --- a/include/linux/resctrl_types.h +++ b/include/linux/resctrl_types.h @@ -99,11 +99,21 @@ enum resctrl_res_level { /* * Event IDs, the values match those used to program IA32_QM_EVTSEL before * reading IA32_QM_CTR on RDT systems. + * + * Monitor Event IDs, representative a variety of monitoring events: + * QOS_L3_OCCUP_EVENT_ID: L3 Cache Occupancy statistics event + * QOS_L3_MBM_TOTAL_EVENT_ID: Global Memory Bandwidth statistics event + * QOS_L3_MBM_LOCAL_EVENT_ID: L3 Cache Bandwidth statistics event + * QOS_L2_OCCUP_EVENT_ID: L2 Cache Occupancy statistics event + * QOS_L2_MBM_CORE_EVENT_ID: L2 Cache Bandwidth statistics event */ enum resctrl_event_id { QOS_L3_OCCUP_EVENT_ID = 0x01, QOS_L3_MBM_TOTAL_EVENT_ID = 0x02, QOS_L3_MBM_LOCAL_EVENT_ID = 0x03, + + QOS_L2_OCCUP_EVENT_ID, + QOS_L2_MBM_CORE_EVENT_ID, }; #endif /* __LINUX_RESCTRL_TYPES_H */ -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC30P1 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/commit/?h=mp... -------------------------------- The MPAM system registers will be lost if the CPU is reset during PSCI's CPU_SUSPEND. Restore them. mpam_thread_switch(current) can't be used as this won't make any changes if the in-memory copy says the register already has the correct value. In reality the system register is UNKNOWN out of reset. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/kernel/mpam.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/arch/arm64/kernel/mpam.c b/arch/arm64/kernel/mpam.c index d8891d5c59c0..d3ac3bfb5564 100644 --- a/arch/arm64/kernel/mpam.c +++ b/arch/arm64/kernel/mpam.c @@ -4,6 +4,7 @@ #include <asm/mpam.h> #include <linux/arm_mpam.h> +#include <linux/cpu_pm.h> #include <linux/jump_label.h> #include <linux/percpu.h> #include <linux/crash_dump.h> @@ -13,6 +14,32 @@ DEFINE_STATIC_KEY_FALSE(mpam_enabled); DEFINE_PER_CPU(u64, arm64_mpam_default); DEFINE_PER_CPU(u64, arm64_mpam_current); +static int mpam_pm_notifier(struct notifier_block *self, + unsigned long cmd, void *v) +{ + u64 regval; + int cpu = smp_processor_id(); + + switch (cmd) { + case CPU_PM_EXIT: + /* + * Don't use mpam_thread_switch() as the system register + * value has changed under our feet. + */ + regval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); + write_sysreg_s(0, SYS_MPAM1_EL1); + write_sysreg_s(regval, SYS_MPAM0_EL1); + + return NOTIFY_OK; + default: + return NOTIFY_DONE; + } +} + +static struct notifier_block mpam_pm_nb = { + .notifier_call = mpam_pm_notifier, +}; + static int __init arm64_mpam_register_cpus(void) { u16 partid_max; @@ -29,6 +56,7 @@ static int __init arm64_mpam_register_cpus(void) partid_max = FIELD_GET(MPAMIDR_PARTID_MAX, mpamidr); pmg_max = FIELD_GET(MPAMIDR_PMG_MAX, mpamidr); + cpu_pm_register_notifier(&mpam_pm_nb); return mpam_register_requestor(partid_max, pmg_max); } arch_initcall(arm64_mpam_register_cpus); -- 2.25.1

From: James Morse <james.morse@arm.com> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC30P1 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/commit/?h=mp... -------------------------------- When a CPU is brought online, only the MPAM1_EL1 sysreg is set. For completeness restore the MPAM0_EL1 register too. This isn't expected to cause a problem as cpuhp happens from the idle thread, which can't return to user-space. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/kernel/cpufeature.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 3958738da32f..995c96ed4b94 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -2378,6 +2378,11 @@ static void __maybe_unused cpu_enable_mpam(const struct arm64_cpu_capabilities *entry) { u64 idr = read_sanitised_ftr_reg(SYS_MPAMIDR_EL1); + int cpu = smp_processor_id(); + u64 regval = 0; + + if (IS_ENABLED(CONFIG_ARM64_MPAM)) + regval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); /* * Initialise MPAM EL2 registers and disable EL2 traps. @@ -2394,6 +2399,7 @@ cpu_enable_mpam(const struct arm64_cpu_capabilities *entry) * been throttled to release the lock. */ write_sysreg_s(0, SYS_MPAM1_EL1); + write_sysreg_s(regval, SYS_MPAM0_EL1); } static void mpam_extra_caps(void) -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC30P1 -------------------------------- The L2 msc function interface is hidden by default, and allow to enable after adding mounting option of "-o l2" by user specified. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- fs/resctrl/ctrlmondata.c | 6 ++++++ fs/resctrl/internal.h | 1 + fs/resctrl/rdtgroup.c | 16 ++++++++++++++++ include/linux/resctrl.h | 1 + 4 files changed, 24 insertions(+) diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index d2793e6407e8..025fbf91dfc6 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -311,6 +311,9 @@ static int rdtgroup_parse_resource(char *resname, char *tok, struct resctrl_schema *s; list_for_each_entry(s, &resctrl_schema_all, list) { + if (s->res->invisible) + continue; + if (!strcmp(resname, s->name) && rdtgrp->closid < s->num_closid) return parse_line(tok, s, rdtgrp); } @@ -455,6 +458,9 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of, } else { closid = rdtgrp->closid; list_for_each_entry(schema, &resctrl_schema_all, list) { + if (schema->res->invisible) + continue; + if (closid < schema->num_closid) show_doms(s, schema, closid); } diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h index 9e7ff00b8514..e1574ca71317 100644 --- a/fs/resctrl/internal.h +++ b/fs/resctrl/internal.h @@ -58,6 +58,7 @@ struct rdt_fs_context { bool enable_cdpl3; bool enable_mba_mbps; bool enable_debug; + bool enable_l2; }; static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index b4fb125a7509..b2ed307beb58 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -2367,6 +2367,7 @@ static void rdt_disable_ctx(void) static int rdt_enable_ctx(struct rdt_fs_context *ctx) { + struct rdt_resource *r; int ret = 0; if (ctx->enable_cdpl2) { @@ -2390,6 +2391,13 @@ static int rdt_enable_ctx(struct rdt_fs_context *ctx) if (ctx->enable_debug) resctrl_debug = true; + r = resctrl_arch_get_resource(RDT_RESOURCE_L2); + /* Only arm64 arch hides L2 resource by default */ + if (IS_ENABLED(CONFIG_ARM64_MPAM) && !ctx->enable_l2) + r->invisible = true; + else + r->invisible = false; + return 0; out_cdpl3: @@ -2612,6 +2620,7 @@ enum rdt_param { Opt_cdpl2, Opt_mba_mbps, Opt_debug, + Opt_l2, nr__rdt_params }; @@ -2620,6 +2629,7 @@ static const struct fs_parameter_spec rdt_fs_parameters[] = { fsparam_flag("cdpl2", Opt_cdpl2), fsparam_flag("mba_MBps", Opt_mba_mbps), fsparam_flag("debug", Opt_debug), + fsparam_flag("l2", Opt_l2), {} }; @@ -2648,6 +2658,9 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param) case Opt_debug: ctx->enable_debug = true; return 0; + case Opt_l2: + ctx->enable_l2 = true; + return 0; } return -EINVAL; @@ -2997,6 +3010,9 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn, if (!r->mon_capable) continue; + if (r->invisible) + continue; + ret = mkdir_mondata_subdir_alldom(kn, r, prgrp); if (ret) goto out_destroy; diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 4388cab9e380..0b04898c90ba 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -218,6 +218,7 @@ struct rdt_resource { int rid; bool alloc_capable; bool mon_capable; + bool invisible; int num_rmid; int cache_level; struct resctrl_cache cache; -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC30P1 -------------------------------- When a CPU is brought offline, all registers of volatile msc components would loss configuration context. Because that MPAM driver refuses to set CPU offline when L2 msc is enabled. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_devices.c | 6 +++++- drivers/platform/mpam/mpam_internal.h | 2 ++ drivers/platform/mpam/mpam_resctrl.c | 29 +++++++++++++++++++++++++++ fs/resctrl/internal.h | 1 - include/linux/resctrl.h | 3 +++ 5 files changed, 39 insertions(+), 2 deletions(-) diff --git a/drivers/platform/mpam/mpam_devices.c b/drivers/platform/mpam/mpam_devices.c index 237a7cd54099..5f05771e3e4d 100644 --- a/drivers/platform/mpam/mpam_devices.c +++ b/drivers/platform/mpam/mpam_devices.c @@ -1544,9 +1544,13 @@ static int mpam_discovery_cpu_online(unsigned int cpu) static int mpam_cpu_offline(unsigned int cpu) { - int idx; + int ret, idx; struct mpam_msc *msc; + ret = mpam_resctrl_prepare_offline(); + if (ret) + return ret; + idx = srcu_read_lock(&mpam_srcu); list_for_each_entry_rcu(msc, &mpam_all_msc, glbl_list) { if (!cpumask_test_cpu(cpu, &msc->accessibility)) diff --git a/drivers/platform/mpam/mpam_internal.h b/drivers/platform/mpam/mpam_internal.h index e9c52078edea..ff1890b3a78e 100644 --- a/drivers/platform/mpam/mpam_internal.h +++ b/drivers/platform/mpam/mpam_internal.h @@ -590,4 +590,6 @@ void mpam_resctrl_exit(void); */ #define MSMON_CAPT_EVNT_NOW BIT(0) +int mpam_resctrl_prepare_offline(void); + #endif /* MPAM_INTERNAL_H */ diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 0e96b64872aa..bfbd244bd4e0 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -906,6 +906,14 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) */ if (has_csu && class->type == MPAM_CLASS_CACHE) r->mon_capable = true; + + /* + * The power domain of L2 cache msc is shared with the + * core's, which will cause information of the L2 msc to + * be lost when the core enter power down state. + */ + if (class->level <= 2) + r->is_volatile = true; break; case RDT_RESOURCE_MBA: @@ -1437,6 +1445,27 @@ int mpam_resctrl_online_cpu(unsigned int cpu) return 0; } +int mpam_resctrl_prepare_offline(void) +{ + struct mpam_resctrl_res *res; + int i; + + if (resctrl_mounted) { + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + res = &mpam_resctrl_exports[i]; + + if (res->resctrl_res.is_volatile && + !res->resctrl_res.invisible) { + pr_info("%s is working, umount /sys/fs/resctrl first.\n", + res->resctrl_res.name); + return -EBUSY; + } + } + } + + return 0; +} + int mpam_resctrl_offline_cpu(unsigned int cpu) { int i; diff --git a/fs/resctrl/internal.h b/fs/resctrl/internal.h index e1574ca71317..698af63a2803 100644 --- a/fs/resctrl/internal.h +++ b/fs/resctrl/internal.h @@ -98,7 +98,6 @@ struct rmid_read { }; extern struct list_head resctrl_schema_all; -extern bool resctrl_mounted; enum rdt_group_type { RDTCTRL_GROUP = 0, diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 0b04898c90ba..3f73064769fa 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -219,6 +219,7 @@ struct rdt_resource { bool alloc_capable; bool mon_capable; bool invisible; + bool is_volatile; int num_rmid; int cache_level; struct resctrl_cache cache; @@ -438,6 +439,8 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d); extern unsigned int resctrl_rmid_realloc_threshold; extern unsigned int resctrl_rmid_realloc_limit; +extern bool resctrl_mounted; + int resctrl_init(void); void resctrl_exit(void); -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC30P1 -------------------------------- In case that L2 msc components lose configure context after entering cpuidle powerdown state, refuse to enter powerdown state since L2 msc specified by user. When resctrl fs is released from kernel, MPAM driver would restore all MSC partition with default values and enable cpuidle powerdown state again. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/arm64/kernel/mpam.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kernel/mpam.c b/arch/arm64/kernel/mpam.c index d3ac3bfb5564..3f070cbab420 100644 --- a/arch/arm64/kernel/mpam.c +++ b/arch/arm64/kernel/mpam.c @@ -8,6 +8,7 @@ #include <linux/jump_label.h> #include <linux/percpu.h> #include <linux/crash_dump.h> +#include <linux/resctrl.h> DEFINE_STATIC_KEY_FALSE(arm64_mpam_has_hcr); DEFINE_STATIC_KEY_FALSE(mpam_enabled); @@ -18,9 +19,21 @@ static int mpam_pm_notifier(struct notifier_block *self, unsigned long cmd, void *v) { u64 regval; - int cpu = smp_processor_id(); + struct rdt_resource *r; + int i, cpu = smp_processor_id(); switch (cmd) { + case CPU_PM_ENTER: + if (!resctrl_mounted) + return NOTIFY_OK; + + for (i = 0; i < RDT_NUM_RESOURCES; i++) { + r = resctrl_arch_get_resource(i); + if (!r->invisible && r->is_volatile) + return NOTIFY_BAD; + } + + return NOTIFY_OK; case CPU_PM_EXIT: /* * Don't use mpam_thread_switch() as the system register -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT -------------------------------- Fix the crash caused by mounting the resctrl filesystem on a platform that does not support any RDT resources. WARNING: CPU: 0 PID: 13291 at mm/page_alloc.c:4780 __alloc_pages Call Trace: __kmalloc_large_node+0x8f/0x190 __kmalloc+0xb8/0x120 closid_init+0x4a/0x90 rdt_get_tree+0x109/0x400 vfs_get_tree+0x25/0xe0 do_new_mount+0x164/0x1c0 __se_sys_mount+0x165/0x1d0 do_syscall_64+0x56/0x100 Fixes: 1e7284b7981c ("fs/resctrl: Remove the limit on the number of CLOSID") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- fs/resctrl/rdtgroup.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index b2ed307beb58..df7736e9a8af 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -147,7 +147,7 @@ int closids_supported(void) return closid_free_map_len; } -static void closid_init(void) +static int closid_init(void) { struct resctrl_schema *s; u32 rdt_min_closid = ~0; @@ -156,12 +156,20 @@ static void closid_init(void) list_for_each_entry(s, &resctrl_schema_all, list) rdt_min_closid = min(rdt_min_closid, s->num_closid); + if (rdt_min_closid == ~0) + return -EOPNOTSUPP; + closid_free_map = bitmap_alloc(rdt_min_closid, GFP_KERNEL); + if (!closid_free_map) + return -ENOMEM; + bitmap_fill(closid_free_map, rdt_min_closid); /* RESCTRL_RESERVED_CLOSID is always reserved for the default group */ __clear_bit(RESCTRL_RESERVED_CLOSID, closid_free_map); closid_free_map_len = rdt_min_closid; + + return 0; } static void closid_exit(void) @@ -2535,7 +2543,9 @@ static int rdt_get_tree(struct fs_context *fc) if (ret) goto out_schemata_free; - closid_init(); + ret = closid_init(); + if (ret) + goto out_schemata_free; if (resctrl_arch_mon_capable()) flags |= RFTYPE_MON; -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT -------------------------------- Select the maximum number of RMIDs supported by the system as the RMIDs number. Fixes: cc622cf0fb21 ("arm_mpam: resctrl: Pick a value for num_rmid") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index bfbd244bd4e0..08aef61ce4cf 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -1051,7 +1051,7 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) * For mpam, each control group has its own pmg/rmid * space. */ - r->num_rmid = mpam_partid_max * mpam_pmg_max; + r->num_rmid = resctrl_arch_system_num_rmid_idx(); } return 0; -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT -------------------------------- The MPAM hardware does not allow setting 0 as the bandwidth limit value. Otherwise, it would exhibit undefined behavior. Fixes: 58db5c68e84a ("untested: arm_mpam: resctrl: Add support for MB resource") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 08aef61ce4cf..a709ada864e0 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -926,6 +926,7 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) r->membw.delay_linear = true; r->membw.throttle_mode = THREAD_THROTTLE_UNDEFINED; + r->membw.min_bw = 1; r->membw.bw_gran = get_mba_granularity(cprops); /* Round up to at least 1% */ -- 2.25.1

From: James Morse <james.morse@arm.com> mainline inclusion from mainline-v6.15-rc1 commit 634ebb98b929443ea1a5502c93f26d01aedbd5c8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC35SV -------------------------------- __rdt_get_mem_config_amd() and __get_mem_config_intel() both use the default_ctrl property as a maximum value. This is because the MBA schema works differently between these platforms. Doing this complicates determining whether the default_ctrl property belongs to the arch code, or can be derived from the schema format. Deriving the maximum or default value from the schema format would avoid the architecture code having to tell resctrl such obvious things as the maximum percentage is 100, and the maximum bitmap is all ones. Maximum bandwidth is always going to vary per platform. Add max_bw as a special case. This is currently used for the maximum MBA percentage on Intel platforms, but can be removed from the architecture code if 'percentage' becomes a schema format resctrl supports directly. This value isn't needed for other schema formats. This will allow the default_ctrl to be generated from the schema properties when it is needed. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-8-james.morse@arm.com Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/kernel/cpu/resctrl/core.c | 2 ++ fs/resctrl/ctrlmondata.c | 4 ++-- include/linux/resctrl.h | 2 ++ 3 files changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 1454aa1a1f84..44379d4568b4 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -200,6 +200,7 @@ static __init bool __get_mem_config_intel(struct rdt_resource *r) hw_res->num_closid = edx.split.cos_max + 1; max_delay = eax.split.max_delay + 1; r->default_ctrl = MAX_MBA_BW; + r->membw.max_bw = MAX_MBA_BW; r->membw.arch_needs_linear = true; if (ecx & MBA_IS_LINEAR) { r->membw.delay_linear = true; @@ -236,6 +237,7 @@ static __init bool __rdt_get_mem_config_amd(struct rdt_resource *r) cpuid_count(0x80000020, subleaf, &eax, &ebx, &ecx, &edx); hw_res->num_closid = edx + 1; r->default_ctrl = 1 << eax; + r->membw.max_bw = 1 << eax; /* AMD does not use delay */ r->membw.delay_linear = false; diff --git a/fs/resctrl/ctrlmondata.c b/fs/resctrl/ctrlmondata.c index 025fbf91dfc6..0c7a6f37405d 100644 --- a/fs/resctrl/ctrlmondata.c +++ b/fs/resctrl/ctrlmondata.c @@ -62,10 +62,10 @@ static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) return true; } - if (bw < r->membw.min_bw || bw > r->default_ctrl) { + if (bw < r->membw.min_bw || bw > r->membw.max_bw) { rdt_last_cmd_printf("%s value %d out of range [%d,%d]\n", r->name, bw, r->membw.min_bw, - r->default_ctrl); + r->membw.max_bw); return false; } diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 3f73064769fa..7700f7019d77 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -163,6 +163,7 @@ enum membw_throttle_mode { /** * struct resctrl_membw - Memory bandwidth allocation related data * @min_bw: Minimum memory bandwidth percentage user can request + * @max_bw: Maximum memory bandwidth value, used as the reset value * @bw_gran: Granularity at which the memory bandwidth is allocated * @delay_linear: True if memory B/W delay is in linear scale * @arch_needs_linear: True if we can't configure non-linear resources @@ -175,6 +176,7 @@ enum membw_throttle_mode { */ struct resctrl_membw { u32 min_bw; + u32 max_bw; u32 bw_gran; u32 delay_linear; bool arch_needs_linear; -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC35SV -------------------------------- No longer use the configurable maximum value as the default for QoS configuration. Instead, we have updated the recommended value to be the default. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index a709ada864e0..4df5aff518da 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -922,6 +922,7 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) r->schema_fmt = RESCTRL_SCHEMA_RANGE; r->fflags = RFTYPE_RES_MB; r->default_ctrl = MAX_MBA_BW; + r->membw.max_bw = MAX_MBA_BW; r->data_width = 3; r->membw.delay_linear = true; @@ -950,6 +951,7 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) r->schema_fmt = RESCTRL_SCHEMA_RANGE; r->fflags = RFTYPE_RES_CACHE; r->default_ctrl = MAX_MBA_BW; + r->membw.max_bw = MAX_MBA_BW; r->data_width = 3; r->cache_level = class->level; @@ -965,7 +967,8 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) r->format_str = "%d=%0*u"; r->schema_fmt = RESCTRL_SCHEMA_RANGE; r->fflags = RFTYPE_RES_CACHE; - r->default_ctrl = MAX_MBA_BW; + r->default_ctrl = 0; + r->membw.max_bw = MAX_MBA_BW; r->data_width = 3; r->cache_level = class->level; @@ -980,7 +983,8 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) r->format_str = "%d=%0*u"; r->schema_fmt = RESCTRL_SCHEMA_RANGE; r->fflags = RFTYPE_RES_MB; - r->default_ctrl = MAX_MBA_BW; + r->default_ctrl = 0; + r->membw.max_bw = MAX_MBA_BW; r->data_width = 3; r->membw.delay_linear = true; @@ -1000,7 +1004,8 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) r->format_str = "%d=%0*u"; r->schema_fmt = RESCTRL_SCHEMA_RANGE; r->fflags = RFTYPE_RES_CACHE; - r->default_ctrl = GENMASK(cprops->intpri_wd - 1, 0); + r->default_ctrl = 0; + r->membw.max_bw = GENMASK(cprops->intpri_wd - 1, 0); r->data_width = 3; r->cache_level = class->level; @@ -1015,7 +1020,8 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) r->format_str = "%d=%0*u"; r->schema_fmt = RESCTRL_SCHEMA_RANGE; r->fflags = RFTYPE_RES_MB; - r->default_ctrl = GENMASK(cprops->intpri_wd - 1, 0); + r->default_ctrl = 3; + r->membw.max_bw = GENMASK(cprops->intpri_wd - 1, 0); r->data_width = 3; r->membw.bw_gran = 1; @@ -1029,6 +1035,7 @@ static int mpam_resctrl_resource_init(struct mpam_resctrl_res *res) r->schema_fmt = RESCTRL_SCHEMA_RANGE; r->fflags = RFTYPE_RES_MB; r->default_ctrl = 1; + r->membw.max_bw = 1; r->data_width = 1; r->membw.bw_gran = 1; -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC30P1 -------------------------------- When the allocation of mbm_core fails due to insufficient system memory resources, it is necessary to also release the previously allocated mbm_total and mbm_local to prevent memory leakage. Fixes: 556688623b2b ("fs/resctrl: Create l2 cache monitors") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- fs/resctrl/rdtgroup.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index df7736e9a8af..9252781360c4 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -3875,7 +3875,8 @@ static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) d->mbm_core = kcalloc(idx_limit, tsize, GFP_KERNEL); if (!d->mbm_core) { bitmap_free(d->rmid_busy_llc); - kfree(d->mbm_core); + kfree(d->mbm_total); + kfree(d->mbm_local); return -ENOMEM; } } -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC35SV -------------------------------- Each time a new control group is created, all the resctrl_res_level items need to be initialized with default values to prevent the partid from retaining old configurations due to previous usage. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- fs/resctrl/rdtgroup.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index 9252781360c4..a42da66ae88e 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -3166,7 +3166,7 @@ static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid) } /* Initialize MBA resource with default values. */ -static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid) +static void rdtgroup_init_res(struct rdt_resource *r, u32 closid) { struct resctrl_staged_config *cfg; struct rdt_domain *d; @@ -3194,15 +3194,16 @@ static int rdtgroup_init_alloc(struct rdtgroup *rdtgrp) list_for_each_entry(s, &resctrl_schema_all, list) { r = s->res; - if (r->rid == RDT_RESOURCE_MBA || - r->rid == RDT_RESOURCE_SMBA) { - rdtgroup_init_mba(r, rdtgrp->closid); - if (is_mba_sc(r)) - continue; - } else { + if (r->rid == RDT_RESOURCE_L2 || + r->rid == RDT_RESOURCE_L3) { ret = rdtgroup_init_cat(s, rdtgrp->closid); if (ret < 0) goto out; + + } else { + rdtgroup_init_res(r, rdtgrp->closid); + if (is_mba_sc(r)) + continue; } ret = resctrl_arch_update_domains(r, rdtgrp->closid); -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT -------------------------------- The fixes tag patch resolves the lockdep warning. However, directly removing rdt_last_cmd_clear() would leave the last_cmd_status interface with stale logs, which does not conform to the functional definition before the fix. Therefore, the rdt_last_cmd_clear() operation is performed after successfully acquiring the rdtgroup_mutex. Fixes: c8eafe149530 ("x86/resctrl: Fix potential lockdep warning") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- fs/resctrl/rdtgroup.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index a42da66ae88e..b97d49f01b1c 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -527,6 +527,8 @@ static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of, goto unlock; } + rdt_last_cmd_clear(); + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED || rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { ret = -EINVAL; @@ -3266,6 +3268,8 @@ static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, goto out_unlock; } + rdt_last_cmd_clear(); + if (rtype == RDTMON_GROUP && (prdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || prdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)) { -- 2.25.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC88OM -------------------------------- Add user manual for the MPAM features. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- Documentation/arch/arm64/mpam.md | 587 +++++++++++++++++++++++++++++++ 1 file changed, 587 insertions(+) create mode 100644 Documentation/arch/arm64/mpam.md diff --git a/Documentation/arch/arm64/mpam.md b/Documentation/arch/arm64/mpam.md new file mode 100644 index 000000000000..60f62cd7f781 --- /dev/null +++ b/Documentation/arch/arm64/mpam.md @@ -0,0 +1,587 @@ +MPAM 用户手册 +========= + +# 1 MPAM 简介 + +MPAM(Memory system component Partioning and Monitoring)是Arm Architecture v8.4的拓展特性。用于解决服务器系统中,混合部署不同类型业务时,由于共享资源的竞争(L3/L2 Cache,MATA),而带来的关键应用性能下降或者系统整体性能下降问题。 + +MPAM的应用可针对不同业务,将同时作用于硬件访存路径上产生的竞争和冲突进行隔离控制,从而帮助提升服务器利用率,降低服务部署成本。 + +本手册只适用于OLK-6.6软件版本。 + +# 2 内核编译选项 + +配置 CONFIG_ARM64_MPAM=y 后,即使能MPAM完整功能。 + +# 3 内核启动参数 + +启动流程默认关闭MPAM,启动MPAM初始化需要**cmdline**添加**arm64.mpam**参数配置后重启机器。 + +# 3 接口总览 + +MPAM功能通过resctrl文件系统呈现,挂载点位于 */sys/fs/resctrl* 。系统启动后,需要手动挂载resctrl文件系统。 + +## 3.1 系统挂载参数 + +resctrl可以通过添加挂载参数支持多种挂载方式,具体指令如下: + +~~~ +# mount -t resctrl resctrl [-o cdp[,cdpl2][,debug][,l2]] /sys/fs/resctrl +~~~ + +挂载参数包括: + +* cdp: 针对L3缓存,根据访问指令和访问数据分别配置。 +* cdpl2: 针对L2缓存,根据访问指令和访问数据分别配置。 +* debug: 使能调试接口访问。 +* l2: 使能L2缓存配置和监控功能,默认关闭MPAM L2功能。 + +## 3.2 resctrl 系统目录介绍 + +### 3.2.1 Info 目录 + +Info 目录包含有关已启用资源的信息,每个资源都有其自己的子目录,子目录的名称反映了资源的名称。 + +每个子目录包含以下与分配相关的文件: + +缓存资源(L3/L2)子目录包含以下与分配相关的文件: + +**num_closids**: 适用于该资源的有效CLOSID(Class of Service ID)数量。内核会以所有已启用资源中最小的CLOSID数量作为限制。 + +**cbm_mask**: 适用于该资源的有效位掩码(bitmask)。 + +**min_cbm_bits**: 写入掩码时必须设置的连续位的最小数量。 + +**shareable_bits**: 与其他执行实体共享资源的位掩码。用户在设置独占缓存分区时可以使用此字段。 + +**bit_usage**: 标注了资源所有实例的使用情况的容量位掩码。说明如下: + + * *0*: 对应区域未使用。当系统的资源已被分配,且在“bit_usage”中发现“0”时,这表明资源被浪费了。 + + * *H*: 对应区域仅由硬件使用,但可供软件使用。如果资源的“shareable_bits”中有位被设置,但这些位并未全部出现在资源组的分配方案中,则“shareable_bits”中出现但未分配给资源组的位将被标记为“H”。 + + * *X*: 对应区域可供共享,并被硬件和软件使用。这些位同时出现在“shareable_bits”和资源组的分配中。 + + * *S*: 对应区域由软件使用,并可供共享。 + + * *E*: 对应区域被一个资源组独占使用,不允许共享。 + +**sparse_masks**: 指示是否支持CBM(Capacity Bit Mask)中的非连续1值。 + + * *0*: 仅支持CBM中的连续1值。 + + * *1*: 支持CBM中的非连续1值。 + +MB(Memory bandwidth,内存带宽)子目录包含以下与分配相关的文件: + +**min_bandwidth**: 用户可以请求的最小内存带宽百分比。 + +**bandwidth_gran**: 内存带宽百分比分配的粒度。分配的带宽百分比会四舍五入到硬件上可用的下一个控制步长。可用的带宽控制步长为: +~~~ +min_bandwidth + N * bandwidth_gran +~~~ + +**delay_linear**: 指示延迟刻度是线性还是非线性的。该字段仅用于信息参考。 + +如果支持监控功能,则会存在一个名为 L3_MON 和 MB_MON 的目录,其中包含以下文件: + +**num_rmids**: 可用的RMID(Resource Monitoring ID)数量。这是可以创建的“CTRL_MON”+“MON”组的最大数量。 + +**mon_features**: 如果为该资源启用了监控功能,则列出监控事件。例如: +~~~ +# grep . /sys/fs/resctrl/info/*_MON/mon_features +/sys/fs/resctrl/info/L3_MON/mon_features:llc_occupancy +/sys/fs/resctrl/info/MB_MON/mon_features:mbm_total_bytes +~~~ + +**max_threshold_occupancy**: 读/写文件,提供一个最大值(以字节为单位),低于此设定值下,之前使用过的LLC_occupancy计数器可以被考虑重新分配使用。 + +> 请注意,一旦释放了RMID(资源监控ID),它可能不会立即可用,因为RMID仍然与之前使用该RMID的缓存行相关联。因此,这些RMID会被放入一个“待定”列表中,并在缓存占用量降低后再次检查。如果系统中存在大量处于“待定”状态的RMID,但它们尚未准备好被使用,用户在执行mkdir操作时可能会看到-EBUSY错误。 + +> max_threshold_occupancy是一个用户可配置的值,用于确定在什么占用量下可以释放一个RMID。 + +最后,在 Info 目录的顶层有一个名为 **last_cmd_status** 的文件。每次通过文件系统发出“命令”(例如创建新目录或写入任何控制文件)时,该文件都会被重置。如果命令成功,文件内容将显示为“ok”。如果命令失败,它将提供比文件操作错误返回更详细的信息。例如: +~~~ +# echo "MB:1=110" > schemata +-bash: echo: write error: Invalid argument +# cat /sys/fs/resctrl/info/last_cmd_status +MB value 110 out of range [0,100] +~~~ + +### 3.2.2 资源分配和监控 + +资源组在resctrl文件系统中以目录的形式表示。默认组是根目录,刚挂载后,它拥有系统中的所有任务和CPU,并且可以充分利用所有资源。 + +在支持RDT(资源分配技术)控制功能的系统中,可以在根目录下创建额外的目录,这些目录指定了每种资源的不同数量(参见下面的“schemata”)。根目录和这些额外的顶级目录在下文中被称为“CTRL_MON”组。 + +在支持RDT监控的系统中,根目录和其他顶级目录中包含一个名为“mon_groups”的目录,在其中可以创建额外的目录来监控其父“CTRL_MON”组中任务的子集。这些在本文档的其余部分被称为“MON”组。 + +删除一个目录会将其所代表的组拥有的所有任务和CPU移动到其父目录。删除一个创建的“CTRL_MON”组将自动删除其下所有的“MON”组。 + +支持将“MON”组目录移动到一个新的父“CTRL_MON”组,以便在不影响其监控数据或分配的任务的情况下更改“MON”组的资源分配。此操作不适用于监控CPU的“MON”组。目前,除了简单地重命名“CTRL_MON”或“MON”组之外,不支持其他任何移动操作。 + +所有组包含以下文件: + +**tasks**: 读取此文件将显示属于该组的所有任务列表。将任务ID写入该文件会将任务添加到该组中。可以通过用逗号分隔任务ID来添加多个任务。任务将按顺序分配。在尝试分配任务时遇到的任何单个失败都会导致操作中止,而在失败之前已添加到组中的任务将保留在组中。失败信息将记录到/sys/fs/resctrl/info/last_cmd_status。 + +如果该组是一个“CTRL_MON”组,则任务将从之前拥有该任务的“CTRL_MON”组中移除,同时也会从任何拥有该任务的“MON”组中移除。如果该组是一个“MON”组,则任务必须已经属于该组的“CTRL_MON”父组。任务将从任何之前的“MON”组中移除。 + +**cpus**: 读取此文件将显示该组拥有的逻辑CPU的位掩码。将掩码写入该文件将向该组添加或移除CPU。与“tasks”文件类似,维护了一个层级结构,其中“MON”组只能包含其父“CTRL_MON”组拥有的CPU。 + +**cpus_list**: 与“cpus”类似,但使用CPU范围而不是位掩码。 + +当启用控制功能时,所有“CTRL_MON”组还将包含以下文件: + +**schemata**: 列出该组可用的所有资源。每种资源都有自己的行和格式——详细信息请参见下文。 + +**size**: 类似于“schemata”文件的显示,但显示的是每种资源分配的字节大小,而不是表示分配的位。 + +**mode**: 资源组的“mode”决定了其分配的共享方式。“shareable”资源组允许共享其分配,而“exclusive”资源组则不允许。 + +**ctrl_hw_id**: 仅在启用调试选项时可用。硬件用于控制组的标识符。在arm64架构上,即是 PARTID。 + +当启用监控功能时,所有“MON”组还将包含以下文件: + +**mon_data**: 包含一组按L3域和MB域事件组织的文件。例如,在具有两个L3域的系统中,将存在子目录“mon_L3_00”和“mon_L3_01”。 + +每个子目录中都有一个文件对应每个事件(例如“llc_occupancy”、“mbm_total_bytes”)。在“MON”组中,这些文件提供了组中所有任务的当前事件值。在“CTRL_MON”组中,这些文件提供了“CTRL_MON”组中所有任务以及所有“MON”组中任务的总和。有关使用方法的更多详细信息,请参见示例部分。 + +**mon_hw_id**: 仅在启用调试选项时可用。硬件用于监控组的标识符。 + +以下为resctrl文件系统目录树: + +~~~ +/sys/fs/resctrl(根分组) + ├── cpus # bitmask方式显示根分组关联的vcpu + ├── cpus_list # cpu list方式显示根分组关联的vcpu + ├── ctrl_hw_id # 硬件用于控制组的标识符 + ├── info # 用于显示属性信息及错误提示信息 + │ ├── L3 + │ │ ├── bit_usage # 标注了资源所有实例的使用情况的容量位掩码 + │ │ ├── cbm_mask # 系统所支持的最大cache way bitmask,一个bit代表一个cache way + │ │ ├── min_cbm_bits # 使用schemata所能配置的最小cache way bitmask + │ │ ├── num_closids # L3能够提供创建控制组的最大数量 + │ │ ├── shareable_bits # 当前所有cbm_mask全部shareable,支持后续扩展 + │ │ └── sparse_masks # 指示是否支持CBM(Capacity Bit Mask)中的非连续1值 + │ ├── L3_MON + │ │ ├── max_threshold_occupancy # 低于此设定值下,之前使用过的LLC_occupancy计数器可以被考虑重新分配使用 + │ │ ├── mon_features # 列出监控事件 + │ │ └── num_rmids # 可创建控制组和监控组的总数 + │ ├── last_cmd_status # 操作错误提示 + │ ├── MB + │ │ ├── bandwidth_gran # 带宽百分比配置粒度 + │ │ ├── delay_linear # 指示延迟刻度是线性还是非线性的 + │ │ ├── min_bandwidth # 最小带宽配置百分比 + │ │ └── num_closids # 同L3 num_closid + │ └── MB_MON + │ ├── mon_features # 同L3 mon_features + │ └── num_rmids # 同L3 num_rmids + ├── mode # 资源组的 mode 决定了其分配的共享方式 + ├── mon_data + │ ├── mon_L3_01 # 标号代表L3 cache id + │ │ └── llc_occupancy # 表示当前分组所关联的pid/vcpu在该区域上实际占用L3 Cache大小,下同 + │ ├── mon_L3_122 + │ │ └── llc_occupancy + │ ├── mon_MB_00 # 标号代表numa id + │ │ └── mbm_total_bytes # 表示当前分组所关联的pid/vcpu在该区域上内存带宽流量大小,下同 + │ └── mon_MB_01 + │ └── mbm_total_bytes + ├── mon_groups # 创建监控组目录 + ├── mon_hw_id # 硬件用于监控组的标识符 + ├── schemata # 资源使用配置接口 + ├── size # 显示的是每种资源分配的字节大小 + └── tasks # 显示与根组关联的pid +~~~ + +### 3.2.3 控制组配置接口 Schemata 文件 + +**schemata**文件中的每一行描述一种资源。每行以资源的名称开头,后面跟着该资源在系统中每个实例上要应用的具体值。 + +#### Cache IDs + +在当前一代的系统中,每个插槽(socket)有一个L3缓存,而L2缓存通常仅由一个核心上的超线程共享,但这并不是架构上的强制要求。我们可能会在一个插槽上有多个独立的L3缓存,或者多个核心共享一个L2缓存。因此,我们不使用“插槽”或“核心”来定义共享资源的逻辑CPU集合,而是使用“Cache ID”(缓存ID)。 + +在给定的缓存级别上,这将在整个系统中是一个唯一的数字(但不能保证是一个连续的序列,可能会有间隔)。要查找每个逻辑CPU的ID,请查看 /sys/devices/system/cpu/cpu*/cache/index*/id。 + +#### Cache Bit Masks(CBM,缓存位掩码) + +对于缓存资源,我们使用位掩码来描述可用于分配的缓存部分。掩码的最大值由每种CPU型号定义(并且可能因不同的缓存级别而异)。该值在resctrl文件系统的“info”目录中提供,位于info/{resource}/cbm_mask。 + +#### L3 缓存配置 + +当未启用CDP(代码/数据缓存划分)时,L3 schemata的格式为: + +~~~ +L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... +~~~ + +当启用CDP时,L3控制被拆分为两个独立的资源,因此您可以分别为代码和数据指定独立的掩码,如下所示: + +~~~ +L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... +L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... +~~~ + +读取schemata文件将显示所有域上所有资源的状态。在写入时,您只需要指定您希望更改的值。 + +例如使用默认方式挂载,设置L3缓存位掩码只有4位: + +~~~ +# mount -t resctrl resctrl /sys/fs/resctrl/ +# cat /sys/fs/resctrl/schemata +L3:1=fffffff;122=fffffff + +# echo "L3:122=3c0;" > /sys/fs/resctrl/schemata +# cat /sys/fs/resctrl/schemata +L3:1=fffffff;122=00003c0 +~~~ + +使用开启CDP方式挂载,设置L3 data缓存位掩码只有4位: + +~~~ +# mount -t resctrl resctrl /sys/fs/resctrl/ -o cdp +# cat /sys/fs/resctrl/schemata +L3DATA:1=fffffff;122=fffffff +L3CODE:1=fffffff;122=fffffff + +# echo "L3DATA:122=3c0;" > schemata +# cat /sys/fs/resctrl/schemata +L3DATA:1=fffffff;122=00003c0 +L3CODE:1=fffffff;122=fffffff +~~~ + +#### L2 缓存配置 + +L2 缓存配置功能默认关闭,需要通过显式添加 l2 挂载参数,才会使能L2缓存配置功能。使能L2功能以后,系统关闭 cpuidle powerdown 功能和cpu下线功能。 + +L2 schemata的格式是: + +~~~ +L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... +~~~ + +使用“cdpl2”挂载选项可以在L2上支持CDP: + +~~~ +L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... +L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... +~~~ + +L2 缓存配置示例,设置L2缓存位掩码只有4位: + +~~~ +# mount -t resctrl resctrl /sys/fs/resctrl/ -o l2 +# cat schemata +L2:4=000ff;8=000ff + +# echo "L2:4=f;" > schemata +# cat schemata +L2:4=0000f;8=000ff +~~~ + +使用“cdpl2”挂载选项: + +~~~ +# mount -t resctrl resctrl /sys/fs/resctrl/ -o l2cdp,l2 +# cat schemata +L2DATA:4=000ff;8=000ff +L2CODE:4=000ff;8=000ff + +# echo "L2DATA:4=f;" > schemata +# cat schemata +L2DATA:4=0000f;8=000ff # 控制组 L2 DATA 只能使用L2 cache4的4个cache way +L2CODE:4=000ff;8=000ff +~~~ + +#### MB 内存带宽分配 + +对于内存带宽资源,默认情况下,用户通过指定总内存带宽的百分比来控制该资源。 + +每种CPU型号的最小带宽百分比值是预定义的,可以通过info/MB/min_bandwidth查询。分配的带宽粒度也取决于CPU型号,可以在info/MB/bandwidth_gran中查询。可用的带宽控制步长为:min_bw + N * bw_gran。中间值会被四舍五入到硬件上可用的下一个控制步长。 + +MB schemata的格式是: + +~~~ +MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... +~~~ + +MB 带宽配置示例: + +~~~ +# cat /sys/fs/resctrl/schemata +MB:0=0000100;1=0000100 + +# echo "MB:0=50" > /sys/fs/resctrl/schemata +# cat /sys/fs/resctrl/schemata +MB:0=0000050;1=0000100 # 降低控制组 MB 内存带宽使用上限为50% +~~~ + +### 3.2.4 资源分配规则 + +当一个任务正在运行时,以下规则定义了它可用的资源: +1. 如果任务属于一个非默认组,则使用该组的分配方案(schemata)。 +2. 否则,如果任务属于默认组,但运行在一个被分配给某个特定组的CPU上,则使用该CPU所属组的分配方案。 +3. 否则,使用默认组的分配方案。 + +### 3.2.5 监控组配置方法 + +读取控制组和监控组的监控数据,可通过 mon_data 目录接口读取监控数据: + +~~~ +# grep . mon_data/*/* +mon_data/mon_L3_01/llc_occupancy:73276416 +mon_data/mon_L3_122/llc_occupancy:11875328 +mon_data/mon_MB_00/mbm_total_bytes:32806 +mon_data/mon_MB_01/mbm_total_bytes:31700 +~~~ + +其中,mon_data读取监控数据文件分别: + * llc_occupancy 代表 L3缓存当前占用量,单位 Byte + * mbm_total_bytes 代表内存带宽瞬时流量,单位 MB/s + +支持在控制组下创建子监控组,监控父控制组监控对象的子集: + +~~~ +# cd /sys/fs/resctrl/p1 +# cd mon_groups/ && mkdir m1 # 监控组只能监控,m1分组资源配置跟随p1分组 +# echo '0-1' > cpus_list +# grep . mon_data/mon_*/* +mon_data/mon_L3_01/llc_occupancy:18432 +mon_data/mon_L3_122/llc_occupancy:1024 +mon_data/mon_MB_00/mbm_total_bytes:0 +mon_data/mon_MB_01/mbm_total_bytes:0 +~~~ + +控制组监控的是控制组本身及所有子监控组的监控值之和。 + +## 3.3 QoS 增强特性 + +### 3.3.1 PRI 优先级设置 + +对共享资源优先级进行配置,包括 L3PRI 和 MBPRI: + +~~~ +# cat schemata + MBPRI:0=0000007;1=0000007 + L3PRI:1=0000003;122=0000003 + +# echo "MBPRI:0=0000003" > schemata +# cat schemata + MBPRI:0=0000003;1=0000007 # 降低控制组 MB numa0的优先级 + L3PRI:1=0000003;122=0000003 + +# echo "MBPRI:0=0000003" > schemata + +~~~ + +优先级设置数字越大,即优先级越高,反之,数字越小,优先级越低。 + +> MBPRI 默认值为 3,MBPRI合法值范围 [0,7]。L3PRI 默认值为 0,L3PRI合法值范围 [0,3]。 + +### 3.3.2 MIN 限低值设置 + +共享资源实际使用占比低于设置值,会自动提高对该资源使用优先级,包括 L3MIN 和 MBMIN: + +~~~ +# cat schemata + MBMIN:0=00100 + L3MIN:1=00100;5=00100 + +# echo "MBMIN:0=00050" > schemata +# cat schemata + MBMIN:0=00050 + L3MIN:1=00100;5=00100 + +# echo "L3MIN:1=00050" > schemata +# cat schemata + MBMIN:0=00050 + L3MIN:1=00050;5=00100 +~~~ + +L3MIN 和 MBMIN 接口接受的输入参数为百分比,即设置值与总资源(内存带宽/缓存占用量)的占比。 + +> MBMIN 和 L3MIN 默认值为 0,合法值范围都为 [0,100]。 + +### 3.3.3 HDL 强制隔离设置 + +当MBHDL=1,限制MB共享资源使用量不能超出MB设置值,若MBHDL=0,则允许空闲情况下,MB共享资源使用量超过MB设置值: + +~~~ +# cat schemata + MBHDL:0=0000001;1=0000001 + +# echo "MBHDL:0=0000000" > schemata +# cat schemata + MBHDL:0=0000000;1=0000001 # 关闭MB在numa0上的强制限制功能 +~~~ + +> MBHDL 默认值为 1,合法值范围 [0,1]。 + +### 3.3.4 MAX 资源上限设置 + +设置允许分配的缓存容量的最大百分比,包括 L3MAX 接口: + +~~~ +# cat schemata + L3MAX:1=00100;5=00100 + +# echo "L3MAX:1=00050" > schemata +# cat schemata + L3MAX:1=00050;5=00100 # 降低L3分配的缓存最大容量百分比 +~~~ + +> MB 和 L3MAX 默认值为 100,MB合法值范围 [1,100],L3MAX合法值范围 [0,100]。 + +## 3.4 外设 IO 流量管控 + +### 3.4.1 控制组绑定外设 + +MPAM 提供通过绑定 iommu_group ID,对设备IO流量进行带宽限制和监控。 + +譬如,控制网卡设备 eno2 ,首先查找该设备 PCI_SLOT 信息: + +~~~ +# cat /sys/class/net/eno2/device/uevent | grep PCI_SLOT +PCI_SLOT_NAME=0000:35:00.1 +~~~ + +或者通过 ethtool 工具查看 bus-info 信息: + +~~~ +# ethtool -i eno2 | grep bus-info +bus-info: 0000:35:00.1 +~~~ + +按照设备总线信息,查找到该设备所属的 iommu_group : + +~~~ +# find /sys/kernel/iommu_groups/ -name "0000:35:00.1" +/sys/kernel/iommu_groups/17/devices/0000:35:00.1 +~~~ + +或者通过 lspci 工具查看 iommu_group 信息: + +~~~ +# lspci -vvv -s 0000:35:00.1 | grep "IOMMU group" + IOMMU group: 17 +~~~ + +将查询到的group id 17,通过 tasks 接口绑定到指定的控制组下: + +~~~ +# cd /sys/fs/resctrl/p1/ +# echo "iommu_group:17" > tasks +# cat tasks +iommu_group:17 # 此时iommu_group 17 已被绑定到控制组p1 +~~~ + +### 3.4.2 查看外设带宽流量 + +控制组绑定外设所属的iommu组后,可以查看设备流量带宽: + +~~~ +# grep . mon_data/mon_MB_0*/* +mon_data/mon_MB_00/mbm_total_bytes:0 +mon_data/mon_MB_01/mbm_total_bytes:4230 +~~~ + +### 3.4.3 配置外设带宽流量 + +通过MB配置接口,可以实现限制设备的流量带宽: + +~~~ +# echo "MB:1=0000001" > schemata +# cat schemata + MB:0=0000100;1=0000000 + +# grep . mon_data/*/* +mon_data/mon_MB_00/mbm_total_bytes:0 +mon_data/mon_MB_01/mbm_total_bytes:1208 +~~~ + +# 4 控制组和监控组配置使用示例 + +/sys/fs/resctrl 默认为根分组,根分组可以创建若干个控制组,一个控制组既可以关联一组pid/tid,也可以关联一组cpu集合。 + +创建一个新的控制组,关联 pid/tid: + +~~~ +# cd /sys/fs/resctrl/ && mkdir p1 +# cd p1 && echo $$ > tasks # 关联当前shell进程pid到p1组 +# cat tasks # 可查看成功关联的pid +29190 +29607 +~~~ + +也可以选择关联 cpu: + +~~~ +# cd p1 && echo '0-1' > cpus_list +# cat cpus_list # 可查看关联的cpu +0-1 +~~~ + +查看可创建的控制组和监控组数量: + +~~~ +# cat info/L3/num_closids # 可在info对应的资源的目录下查看closid数量,即可以创建控制组的数量 +32 +# cat info/MB/num_closids +32 +# cat info/L3_MON/num_rmids # 可以创建控制组和监控组的总数量 +128 +# cat info/MB_MON/num_rmids +128 +~~~ + +通过配置控制组可达到隔离 L3 Cache/Memory Bandwidth 效果,通过读取对应分组mon_data接口可以获取该组资源占用情况,比如对一个控制组限制L3 Cache使用: + +~~~ +# cat info/L3/cbm_mask # 查看info目录下对应资源的属性 +fffffff +# cat info/L3/min_cbm_bits +1 + +# cd /sys/fs/resctrl/p1 +# cat schemata + MB:0=0000100;1=0000100 + L3:1=fffffff;122=fffffff + +# echo 'L3:0=1' > schemata # 配置1条cache way给p1分组 +# cat schemata + MB:0=0000100;1=0000100 # 若此时该组关联pid/cpu,那么该pid/cpu产生的访存请求只会分配到这条cache + L3:1=0000001;122=fffffff +~~~ + +对控制组限制MB使用: + +~~~ +# cat info/MB/min_bandwidth # 和配置L3类似,也可以查看MB的相关信息 +1 +# cat info/MB/bandwidth_gran +1 # 可知,配置带宽最小百分比是1%,颗粒度是1% +# cat schemata + MB:0=0000100;1=0000100 + L3:1=0000001;122=fffffff +# echo 'MB:0=1' > schemata +# cat schemata + MB:0=0000001;1=0000100 + L3:1=0000001;122=fffffff +~~~ + +支持在控制组下创建子监控组: + +~~~ +# cd /sys/fs/resctrl/p1 +# cd mon_groups/ && mkdir m1 # 监控组只能监控,m1分组资源配置跟随p1分组 +# ls m1/ +cpus cpus_list mon_data tasks +# echo '0-1' > cpus_list +# cat cpus_list +0-1 +# grep . mon_data/mon_*/* +mon_data/mon_L3_01/llc_occupancy:18432 +mon_data/mon_L3_122/llc_occupancy:1024 +mon_data/mon_MB_00/mbm_total_bytes:0 +mon_data/mon_MB_01/mbm_total_bytes:0 +~~~ -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAG93D -------------------------------- The L3 cache occupancy limbo handler monitors idle RMIDs. If it detects that the cache occupancy of the monitored RMID is below the max_threshold_occupancy setting value, it will release the RMID from the limbo back to the rmid_free list. However, in scenarios where the user sets a high max_threshold_occupancy value and frequently creates or deletes monitoring groups, or simply frequently mounts and unmounts the resctrl filesystem, because of the limbo handler delayed scheduling, it can cause idle RMIDs to not be released back to the rmid_free list in time, which leads to the creating monitor group operations failing with ENOSPC. Fixes: 13e249bf4944 ("x86/resctrl: Move the filesystem portions of resctrl to live in '/fs/'") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- fs/resctrl/monitor.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/resctrl/monitor.c b/fs/resctrl/monitor.c index cbb8fc878cde..a8f8b228b8c2 100644 --- a/fs/resctrl/monitor.c +++ b/fs/resctrl/monitor.c @@ -289,7 +289,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry) * setup up the limbo worker. */ if (!has_busy_rmid(d)) - cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL, + cqm_setup_limbo_handler(d, 0, RESCTRL_PICK_ANY_CPU); set_bit(idx, d->rmid_busy_llc); entry->busy++; -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC30P1 -------------------------------- Currently, only L3_MON supports the limbo mechanism, while L2_MON does not. Therefore, during initialization, avoid incorrect initialization of the resctrl_rmid_realloc_limit and resctrl_rmid_realloc_threshold variables by L2_MON. Similarly, since L2_MON does not support read/write operations on the max_threshold_occupancy interface, and directly returns. Fixes: 556688623b2b ("fs/resctrl: Create l2 cache monitors") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- drivers/platform/mpam/mpam_resctrl.c | 6 ++++-- fs/resctrl/rdtgroup.c | 9 +++++++++ 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/drivers/platform/mpam/mpam_resctrl.c b/drivers/platform/mpam/mpam_resctrl.c index 4df5aff518da..b3bc9ee5625b 100644 --- a/drivers/platform/mpam/mpam_resctrl.c +++ b/drivers/platform/mpam/mpam_resctrl.c @@ -694,8 +694,10 @@ static void mpam_resctrl_pick_caches(void) continue; } - if (mpam_has_feature(mpam_feat_msmon_csu, cprops)) - update_rmid_limits(cache_size); + if (mpam_has_feature(mpam_feat_msmon_csu, cprops)) { + if (class->level == 3) + update_rmid_limits(cache_size); + } if (has_cpor) { if (class->level == 2) { diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c index b97d49f01b1c..da5c44983183 100644 --- a/fs/resctrl/rdtgroup.c +++ b/fs/resctrl/rdtgroup.c @@ -1166,6 +1166,11 @@ static int rdt_delay_linear_show(struct kernfs_open_file *of, static int max_threshold_occ_show(struct kernfs_open_file *of, struct seq_file *seq, void *v) { + struct rdt_resource *r = of->kn->parent->priv; + + if (r->cache_level != 3) + return 0; + seq_printf(seq, "%u\n", resctrl_rmid_realloc_threshold); return 0; @@ -1188,9 +1193,13 @@ static int rdt_thread_throttle_mode_show(struct kernfs_open_file *of, static ssize_t max_threshold_occ_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { + struct rdt_resource *r = of->kn->parent->priv; unsigned int bytes; int ret; + if (r->cache_level != 3) + return 0; + ret = kstrtouint(buf, 0, &bytes); if (ret) return ret; -- 2.25.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8T2RT -------------------------------- The implementation of the rdtgroup_pseudo_locked_in_hierarchy() relies on the CONFIG_RESCTRL_FS_PSEUDO_LOCK. When this configuration is disabled, the function should return false according to its return type, instead of EOPNOTSUPP. Fixes: dd5cb90e6b16 ("x86/resctrl: Allow an architecture to disable pseudo lock") Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- fs/resctrl/psuedo_lock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/resctrl/psuedo_lock.c b/fs/resctrl/psuedo_lock.c index 2fa42d0a33ea..f2737fd562cf 100644 --- a/fs/resctrl/psuedo_lock.c +++ b/fs/resctrl/psuedo_lock.c @@ -656,7 +656,7 @@ bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_domain *d) lockdep_assert_cpus_held(); if (!IS_ENABLED(CONFIG_RESCTRL_FS_PSEUDO_LOCK)) - return -EOPNOTSUPP; + return false; if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL)) return true; -- 2.25.1

This preparatory patch restores the RDT implementation to its state at commit 543fe1576126 ("bytedance: config: enable pvpanic"), laying the groundwork for the subsequent separation of the common resctrl layer between ARM MPAM and x86 RDT. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- Documentation/arch/x86/resctrl.rst | 6 + arch/x86/include/asm/msr-index.h | 1 + arch/x86/include/asm/resctrl.h | 55 +- arch/x86/kernel/cpu/resctrl/core.c | 472 +- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 574 ++- arch/x86/kernel/cpu/resctrl/internal.h | 486 +- arch/x86/kernel/cpu/resctrl/monitor.c | 1059 +++- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 1128 ++++- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4273 ++++++++++++++++- .../resctrl/{pseudo_lock_event.h => trace.h} | 24 +- include/linux/resctrl.h | 280 +- 11 files changed, 7672 insertions(+), 686 deletions(-) rename arch/x86/kernel/cpu/resctrl/{pseudo_lock_event.h => trace.h} (56%) diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst index 908861167821..6f07af4a885e 100644 --- a/Documentation/arch/x86/resctrl.rst +++ b/Documentation/arch/x86/resctrl.rst @@ -450,6 +450,12 @@ during mkdir. max_threshold_occupancy is a user configurable value to determine the occupancy at which an RMID can be freed. +The mon_llc_occupancy_limbo tracepoint gives the precise occupancy in bytes +for a subset of RMID that are not immediately available for allocation. +This can't be relied on to produce output every second, it may be necessary +to attempt to create an empty monitor group to force an update. Output may +only be produced if creation of a control or monitor group fails. + Schemata files - general concepts --------------------------------- Each line in the file describes one resource. The line starts with diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 182fa1daccb6..0aae302e3daa 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -1156,6 +1156,7 @@ #define MSR_IA32_QM_CTR 0xc8e #define MSR_IA32_PQR_ASSOC 0xc8f #define MSR_IA32_L3_CBM_BASE 0xc90 +#define MSR_RMID_SNC_CONFIG 0xca0 #define MSR_IA32_L2_CBM_BASE 0xd10 #define MSR_IA32_MBA_THRTL_BASE 0xd50 diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h index 406b4cf301bb..8b1b6ce1e51b 100644 --- a/arch/x86/include/asm/resctrl.h +++ b/arch/x86/include/asm/resctrl.h @@ -4,10 +4,8 @@ #ifdef CONFIG_X86_CPU_RESCTRL -#include <linux/jump_label.h> -#include <linux/percpu.h> #include <linux/sched.h> -#include <linux/resctrl_types.h> +#include <linux/jump_label.h> /* * This value can never be a valid CLOSID, and is used when mapping a @@ -16,8 +14,6 @@ */ #define X86_RESCTRL_EMPTY_CLOSID ((u32)~0) -void resctrl_arch_reset_resources(void); - /** * struct resctrl_pqr_state - State cache for the PQR MSR * @cur_rmid: The cached Resource Monitoring ID @@ -44,7 +40,6 @@ DECLARE_PER_CPU(struct resctrl_pqr_state, pqr_state); extern bool rdt_alloc_capable; extern bool rdt_mon_capable; -extern unsigned int rdt_mon_features; DECLARE_STATIC_KEY_FALSE(rdt_enable_key); DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key); @@ -84,25 +79,6 @@ static inline void resctrl_arch_disable_mon(void) static_branch_dec_cpuslocked(&rdt_enable_key); } -static inline bool resctrl_arch_is_llc_occupancy_enabled(void) -{ - return (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID)); -} - -static inline bool resctrl_arch_is_mbm_total_enabled(void) -{ - return (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID)); -} - -static inline bool resctrl_arch_would_mbm_overflow(void) { return true; } - -static inline bool resctrl_arch_is_mbm_local_enabled(void) -{ - return (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID)); -} - -static inline bool resctrl_arch_is_mbm_core_enabled(void) { return false; } - /* * __resctrl_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR * @@ -120,8 +96,8 @@ static inline bool resctrl_arch_is_mbm_core_enabled(void) { return false; } static inline void __resctrl_sched_in(struct task_struct *tsk) { struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state); - u32 closid = READ_ONCE(state->default_closid); - u32 rmid = READ_ONCE(state->default_rmid); + u32 closid = state->default_closid; + u32 rmid = state->default_rmid; u32 tmp; /* @@ -156,13 +132,6 @@ static inline unsigned int resctrl_arch_round_mon_val(unsigned int val) return val * scale; } -static inline void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, - u32 rmid) -{ - WRITE_ONCE(per_cpu(pqr_state.default_closid, cpu), closid); - WRITE_ONCE(per_cpu(pqr_state.default_rmid, cpu), rmid); -} - static inline void resctrl_arch_set_closid_rmid(struct task_struct *tsk, u32 closid, u32 rmid) { @@ -187,12 +156,6 @@ static inline void resctrl_sched_in(struct task_struct *tsk) __resctrl_sched_in(tsk); } -static inline u32 resctrl_arch_system_num_rmid_idx(void) -{ - /* RMID are independent numbers for x86. num_rmid_idx == num_rmid */ - return boot_cpu_data.x86_cache_max_rmid + 1; -} - static inline void resctrl_arch_rmid_idx_decode(u32 idx, u32 *closid, u32 *rmid) { *rmid = idx; @@ -215,20 +178,8 @@ static inline void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid static inline void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid, void *ctx) { }; -u64 resctrl_arch_get_prefetch_disable_bits(void); -int resctrl_arch_pseudo_lock_fn(void *_plr); -int resctrl_arch_measure_cycles_lat_fn(void *_plr); -int resctrl_arch_measure_l2_residency(void *_plr); -int resctrl_arch_measure_l3_residency(void *_plr); void resctrl_cpu_detect(struct cpuinfo_x86 *c); -bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l); -int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); -static inline bool resctrl_arch_hide_cdp(enum resctrl_res_level rid) -{ - return false; -}; - #else static inline void resctrl_sched_in(struct task_struct *tsk) {} diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c index 44379d4568b4..89a3b145dff8 100644 --- a/arch/x86/kernel/cpu/resctrl/core.c +++ b/arch/x86/kernel/cpu/resctrl/core.c @@ -19,7 +19,6 @@ #include <linux/cpu.h> #include <linux/slab.h> #include <linux/err.h> -#include <linux/cacheinfo.h> #include <linux/cpuhotplug.h> #include <asm/intel-family.h> @@ -44,22 +43,24 @@ static DEFINE_MUTEX(domain_list_lock); */ DEFINE_PER_CPU(struct resctrl_pqr_state, pqr_state); +/* + * Used to store the max resource name width and max resource data width + * to display the schemata in a tabular format + */ +int max_name_width, max_data_width; + /* * Global boolean for rdt_alloc which is true if any * resource allocation is enabled. */ bool rdt_alloc_capable; -static void -mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m, - struct rdt_resource *r); -static void -cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r); -static void -mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, - struct rdt_resource *r); +static void mba_wrmsr_intel(struct msr_param *m); +static void cat_wrmsr(struct msr_param *m); +static void mba_wrmsr_amd(struct msr_param *m); -#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains) +#define ctrl_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.ctrl_domains) +#define mon_domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.mon_domains) struct rdt_hw_resource rdt_resources_all[] = { [RDT_RESOURCE_L3] = @@ -67,11 +68,13 @@ struct rdt_hw_resource rdt_resources_all[] = { .r_resctrl = { .rid = RDT_RESOURCE_L3, .name = "L3", - .cache_level = 3, - .domains = domain_init(RDT_RESOURCE_L3), + .ctrl_scope = RESCTRL_L3_CACHE, + .mon_scope = RESCTRL_L3_CACHE, + .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L3), + .mon_domains = mon_domain_init(RDT_RESOURCE_L3), + .parse_ctrlval = parse_cbm, .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, - .schema_fmt = RESCTRL_SCHEMA_BITMAP, }, .msr_base = MSR_IA32_L3_CBM_BASE, .msr_update = cat_wrmsr, @@ -81,11 +84,11 @@ struct rdt_hw_resource rdt_resources_all[] = { .r_resctrl = { .rid = RDT_RESOURCE_L2, .name = "L2", - .cache_level = 2, - .domains = domain_init(RDT_RESOURCE_L2), + .ctrl_scope = RESCTRL_L2_CACHE, + .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_L2), + .parse_ctrlval = parse_cbm, .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, - .schema_fmt = RESCTRL_SCHEMA_BITMAP, }, .msr_base = MSR_IA32_L2_CBM_BASE, .msr_update = cat_wrmsr, @@ -95,11 +98,11 @@ struct rdt_hw_resource rdt_resources_all[] = { .r_resctrl = { .rid = RDT_RESOURCE_MBA, .name = "MB", - .cache_level = 3, - .domains = domain_init(RDT_RESOURCE_MBA), + .ctrl_scope = RESCTRL_L3_CACHE, + .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_MBA), + .parse_ctrlval = parse_bw, .format_str = "%d=%*u", .fflags = RFTYPE_RES_MB, - .schema_fmt = RESCTRL_SCHEMA_RANGE, }, }, [RDT_RESOURCE_SMBA] = @@ -107,21 +110,21 @@ struct rdt_hw_resource rdt_resources_all[] = { .r_resctrl = { .rid = RDT_RESOURCE_SMBA, .name = "SMBA", - .cache_level = 3, - .domains = domain_init(RDT_RESOURCE_SMBA), + .ctrl_scope = RESCTRL_L3_CACHE, + .ctrl_domains = ctrl_domain_init(RDT_RESOURCE_SMBA), + .parse_ctrlval = parse_bw, .format_str = "%d=%*u", .fflags = RFTYPE_RES_MB, - .schema_fmt = RESCTRL_SCHEMA_RANGE, }, }, }; -struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l) +u32 resctrl_arch_system_num_rmid_idx(void) { - if (l >= RDT_NUM_RESOURCES) - return NULL; + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; - return &rdt_resources_all[l].r_resctrl; + /* RMID are independent numbers for x86. num_rmid_idx == num_rmid */ + return r->num_rmid; } /* @@ -168,6 +171,21 @@ static inline void cache_alloc_hsw_probe(void) rdt_alloc_capable = true; } +bool is_mba_sc(struct rdt_resource *r) +{ + if (!r) + return rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl.membw.mba_sc; + + /* + * The software controller support is only applicable to MBA resource. + * Make sure to check for resource type. + */ + if (r->rid != RDT_RESOURCE_MBA) + return false; + + return r->membw.mba_sc; +} + /* * rdt_get_mb_table() - get a mapping of bandwidth(b/w) percentage values * exposed to user interface and the h/w understandable delay values. @@ -200,7 +218,6 @@ static __init bool __get_mem_config_intel(struct rdt_resource *r) hw_res->num_closid = edx.split.cos_max + 1; max_delay = eax.split.max_delay + 1; r->default_ctrl = MAX_MBA_BW; - r->membw.max_bw = MAX_MBA_BW; r->membw.arch_needs_linear = true; if (ecx & MBA_IS_LINEAR) { r->membw.delay_linear = true; @@ -217,6 +234,7 @@ static __init bool __get_mem_config_intel(struct rdt_resource *r) r->membw.throttle_mode = THREAD_THROTTLE_PER_THREAD; else r->membw.throttle_mode = THREAD_THROTTLE_MAX; + thread_throttle_mode_init(); r->alloc_capable = true; @@ -237,7 +255,6 @@ static __init bool __rdt_get_mem_config_amd(struct rdt_resource *r) cpuid_count(0x80000020, subleaf, &eax, &ebx, &ecx, &edx); hw_res->num_closid = edx + 1; r->default_ctrl = 1 << eax; - r->membw.max_bw = 1 << eax; /* AMD does not use delay */ r->membw.delay_linear = false; @@ -297,17 +314,11 @@ static void rdt_get_cdp_l2_config(void) rdt_get_cdp_config(RDT_RESOURCE_L2); } -bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l) -{ - return rdt_resources_all[l].cdp_enabled; -} - -static void -mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r) +static void mba_wrmsr_amd(struct msr_param *m) { + struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom); + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res); unsigned int i; - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); - struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); for (i = m->low; i < m->high; i++) wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]); @@ -327,30 +338,57 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r) return r->default_ctrl; } -static void -mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m, - struct rdt_resource *r) +static void mba_wrmsr_intel(struct msr_param *m) { + struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom); + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res); unsigned int i; - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); - struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); /* Write the delay values for mba. */ for (i = m->low; i < m->high; i++) - wrmsrl(hw_res->msr_base + i, delay_bw_map(hw_dom->ctrl_val[i], r)); + wrmsrl(hw_res->msr_base + i, delay_bw_map(hw_dom->ctrl_val[i], m->res)); } -static void -cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r) +static void cat_wrmsr(struct msr_param *m) { + struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(m->dom); + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res); unsigned int i; - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); - struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); for (i = m->low; i < m->high; i++) wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]); } +struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r) +{ + struct rdt_ctrl_domain *d; + + lockdep_assert_cpus_held(); + + list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + /* Find the domain that contains this CPU */ + if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) + return d; + } + + return NULL; +} + +struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r) +{ + struct rdt_mon_domain *d; + + lockdep_assert_cpus_held(); + + list_for_each_entry(d, &r->mon_domains, hdr.list) { + /* Find the domain that contains this CPU */ + if (cpumask_test_cpu(cpu, &d->hdr.cpu_mask)) + return d; + } + + return NULL; +} + u32 resctrl_arch_get_num_closid(struct rdt_resource *r) { return resctrl_to_arch_res(r)->num_closid; @@ -358,40 +396,29 @@ u32 resctrl_arch_get_num_closid(struct rdt_resource *r) void rdt_ctrl_update(void *arg) { + struct rdt_hw_resource *hw_res; struct msr_param *m = arg; - struct rdt_hw_resource *hw_res = resctrl_to_arch_res(m->res); - struct rdt_resource *r = m->res; - int cpu = smp_processor_id(); - struct rdt_domain *d; - d = resctrl_get_domain_from_cpu(cpu, r); - if (d) { - hw_res->msr_update(d, m, r); - return; - } - pr_warn_once("cpu %d not found in any domain for resource %s\n", - cpu, r->name); + hw_res = resctrl_to_arch_res(m->res); + hw_res->msr_update(m); } /* - * rdt_find_domain - Find a domain in a resource that matches input resource id + * rdt_find_domain - Search for a domain id in a resource domain list. * - * Search resource r's domain list to find the resource id. If the resource - * id is found in a domain, return the domain. Otherwise, if requested by - * caller, return the first domain whose id is bigger than the input id. - * The domain list is sorted by id in ascending order. + * Search the domain list to find the domain id. If the domain id is + * found, return the domain. NULL otherwise. If the domain id is not + * found (and NULL returned) then the first domain with id bigger than + * the input id can be returned to the caller via @pos. */ -static struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, - struct list_head **pos) +struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id, + struct list_head **pos) { - struct rdt_domain *d; + struct rdt_domain_hdr *d; struct list_head *l; - if (id < 0) - return ERR_PTR(-ENODEV); - - list_for_each(l, &r->domains) { - d = list_entry(l, struct rdt_domain, list); + list_for_each(l, h) { + d = list_entry(l, struct rdt_domain_hdr, list); /* When id is found, return its domain. */ if (id == d->id) return d; @@ -406,11 +433,6 @@ static struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, return NULL; } -struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id) -{ - return rdt_find_domain(r, id, NULL); -} - static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc) { struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); @@ -425,18 +447,23 @@ static void setup_default_ctrlval(struct rdt_resource *r, u32 *dc) *dc = r->default_ctrl; } -static void domain_free(struct rdt_hw_domain *hw_dom) +static void ctrl_domain_free(struct rdt_hw_ctrl_domain *hw_dom) +{ + kfree(hw_dom->ctrl_val); + kfree(hw_dom); +} + +static void mon_domain_free(struct rdt_hw_mon_domain *hw_dom) { kfree(hw_dom->arch_mbm_total); kfree(hw_dom->arch_mbm_local); - kfree(hw_dom->ctrl_val); kfree(hw_dom); } -static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d) +static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_ctrl_domain *d) { + struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d); struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); struct msr_param m; u32 *dc; @@ -448,9 +475,11 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d) hw_dom->ctrl_val = dc; setup_default_ctrlval(r, dc); + m.res = r; + m.dom = d; m.low = 0; m.high = hw_res->num_closid; - hw_res->msr_update(d, &m, r); + hw_res->msr_update(&m); return 0; } @@ -459,17 +488,17 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d) * @num_rmid: The size of the MBM counter array * @hw_dom: The domain that owns the allocated arrays */ -static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom) +static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_mon_domain *hw_dom) { size_t tsize; - if (resctrl_arch_is_mbm_total_enabled()) { + if (is_mbm_total_enabled()) { tsize = sizeof(*hw_dom->arch_mbm_total); hw_dom->arch_mbm_total = kcalloc(num_rmid, tsize, GFP_KERNEL); if (!hw_dom->arch_mbm_total) return -ENOMEM; } - if (resctrl_arch_is_mbm_local_enabled()) { + if (is_mbm_local_enabled()) { tsize = sizeof(*hw_dom->arch_mbm_local); hw_dom->arch_mbm_local = kcalloc(num_rmid, tsize, GFP_KERNEL); if (!hw_dom->arch_mbm_local) { @@ -482,37 +511,45 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom) return 0; } -/* - * domain_add_cpu - Add a cpu to a resource's domain list. - * - * If an existing domain in the resource r's domain list matches the cpu's - * resource id, add the cpu in the domain. - * - * Otherwise, a new domain is allocated and inserted into the right position - * in the domain list sorted by id in ascending order. - * - * The order in the domain list is visible to users when we print entries - * in the schemata file and schemata input is validated to have the same order - * as this list. - */ -static void domain_add_cpu(int cpu, struct rdt_resource *r) +static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope) +{ + switch (scope) { + case RESCTRL_L2_CACHE: + case RESCTRL_L3_CACHE: + return get_cpu_cacheinfo_id(cpu, scope); + case RESCTRL_L3_NODE: + return cpu_to_node(cpu); + default: + break; + } + + return -EINVAL; +} + +static void domain_add_cpu_ctrl(int cpu, struct rdt_resource *r) { - int id = get_cpu_cacheinfo_id(cpu, r->cache_level); + int id = get_domain_id_from_scope(cpu, r->ctrl_scope); + struct rdt_hw_ctrl_domain *hw_dom; struct list_head *add_pos = NULL; - struct rdt_hw_domain *hw_dom; - struct rdt_domain *d; + struct rdt_domain_hdr *hdr; + struct rdt_ctrl_domain *d; int err; lockdep_assert_held(&domain_list_lock); - d = rdt_find_domain(r, id, &add_pos); - if (IS_ERR(d)) { - pr_warn("Couldn't find cache id for CPU %d\n", cpu); + if (id < 0) { + pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n", + cpu, r->ctrl_scope, r->name); return; } - if (d) { - cpumask_set_cpu(cpu, &d->cpu_mask); + hdr = rdt_find_domain(&r->ctrl_domains, id, &add_pos); + if (hdr) { + if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN)) + return; + d = container_of(hdr, struct rdt_ctrl_domain, hdr); + + cpumask_set_cpu(cpu, &d->hdr.cpu_mask); if (r->cache.arch_has_per_cpu_cfg) rdt_domain_reconfigure_cdp(r); return; @@ -523,64 +560,187 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r) return; d = &hw_dom->d_resctrl; - d->id = id; - cpumask_set_cpu(cpu, &d->cpu_mask); + d->hdr.id = id; + d->hdr.type = RESCTRL_CTRL_DOMAIN; + cpumask_set_cpu(cpu, &d->hdr.cpu_mask); rdt_domain_reconfigure_cdp(r); - if (r->alloc_capable && domain_setup_ctrlval(r, d)) { - domain_free(hw_dom); + if (domain_setup_ctrlval(r, d)) { + ctrl_domain_free(hw_dom); return; } - if (r->mon_capable && arch_domain_mbm_alloc(r->num_rmid, hw_dom)) { - domain_free(hw_dom); + list_add_tail_rcu(&d->hdr.list, add_pos); + + err = resctrl_online_ctrl_domain(r, d); + if (err) { + list_del_rcu(&d->hdr.list); + synchronize_rcu(); + ctrl_domain_free(hw_dom); + } +} + +static void domain_add_cpu_mon(int cpu, struct rdt_resource *r) +{ + int id = get_domain_id_from_scope(cpu, r->mon_scope); + struct list_head *add_pos = NULL; + struct rdt_hw_mon_domain *hw_dom; + struct rdt_domain_hdr *hdr; + struct rdt_mon_domain *d; + int err; + + lockdep_assert_held(&domain_list_lock); + + if (id < 0) { + pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n", + cpu, r->mon_scope, r->name); + return; + } + + hdr = rdt_find_domain(&r->mon_domains, id, &add_pos); + if (hdr) { + if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) + return; + d = container_of(hdr, struct rdt_mon_domain, hdr); + + cpumask_set_cpu(cpu, &d->hdr.cpu_mask); return; } - list_add_tail_rcu(&d->list, add_pos); + hw_dom = kzalloc_node(sizeof(*hw_dom), GFP_KERNEL, cpu_to_node(cpu)); + if (!hw_dom) + return; + + d = &hw_dom->d_resctrl; + d->hdr.id = id; + d->hdr.type = RESCTRL_MON_DOMAIN; + d->ci = get_cpu_cacheinfo_level(cpu, RESCTRL_L3_CACHE); + if (!d->ci) { + pr_warn_once("Can't find L3 cache for CPU:%d resource %s\n", cpu, r->name); + mon_domain_free(hw_dom); + return; + } + cpumask_set_cpu(cpu, &d->hdr.cpu_mask); + + arch_mon_domain_online(r, d); + + if (arch_domain_mbm_alloc(r->num_rmid, hw_dom)) { + mon_domain_free(hw_dom); + return; + } - err = resctrl_online_domain(r, d); + list_add_tail_rcu(&d->hdr.list, add_pos); + + err = resctrl_online_mon_domain(r, d); if (err) { - list_del_rcu(&d->list); + list_del_rcu(&d->hdr.list); synchronize_rcu(); - domain_free(hw_dom); + mon_domain_free(hw_dom); } } -static void domain_remove_cpu(int cpu, struct rdt_resource *r) +static void domain_add_cpu(int cpu, struct rdt_resource *r) { - int id = get_cpu_cacheinfo_id(cpu, r->cache_level); - struct rdt_hw_domain *hw_dom; - struct rdt_domain *d; + if (r->alloc_capable) + domain_add_cpu_ctrl(cpu, r); + if (r->mon_capable) + domain_add_cpu_mon(cpu, r); +} + +static void domain_remove_cpu_ctrl(int cpu, struct rdt_resource *r) +{ + int id = get_domain_id_from_scope(cpu, r->ctrl_scope); + struct rdt_hw_ctrl_domain *hw_dom; + struct rdt_domain_hdr *hdr; + struct rdt_ctrl_domain *d; lockdep_assert_held(&domain_list_lock); - d = rdt_find_domain(r, id, NULL); - if (IS_ERR_OR_NULL(d)) { - pr_warn("Couldn't find cache id for CPU %d\n", cpu); + if (id < 0) { + pr_warn_once("Can't find control domain id for CPU:%d scope:%d for resource %s\n", + cpu, r->ctrl_scope, r->name); + return; + } + + hdr = rdt_find_domain(&r->ctrl_domains, id, NULL); + if (!hdr) { + pr_warn("Can't find control domain for id=%d for CPU %d for resource %s\n", + id, cpu, r->name); return; } - hw_dom = resctrl_to_arch_dom(d); - cpumask_clear_cpu(cpu, &d->cpu_mask); - if (cpumask_empty(&d->cpu_mask)) { - resctrl_offline_domain(r, d); - list_del_rcu(&d->list); + if (WARN_ON_ONCE(hdr->type != RESCTRL_CTRL_DOMAIN)) + return; + + d = container_of(hdr, struct rdt_ctrl_domain, hdr); + hw_dom = resctrl_to_arch_ctrl_dom(d); + + cpumask_clear_cpu(cpu, &d->hdr.cpu_mask); + if (cpumask_empty(&d->hdr.cpu_mask)) { + resctrl_offline_ctrl_domain(r, d); + list_del_rcu(&d->hdr.list); synchronize_rcu(); /* - * rdt_domain "d" is going to be freed below, so clear + * rdt_ctrl_domain "d" is going to be freed below, so clear * its pointer from pseudo_lock_region struct. */ if (d->plr) d->plr->d = NULL; - domain_free(hw_dom); + ctrl_domain_free(hw_dom); + + return; + } +} + +static void domain_remove_cpu_mon(int cpu, struct rdt_resource *r) +{ + int id = get_domain_id_from_scope(cpu, r->mon_scope); + struct rdt_hw_mon_domain *hw_dom; + struct rdt_domain_hdr *hdr; + struct rdt_mon_domain *d; + + lockdep_assert_held(&domain_list_lock); + + if (id < 0) { + pr_warn_once("Can't find monitor domain id for CPU:%d scope:%d for resource %s\n", + cpu, r->mon_scope, r->name); + return; + } + + hdr = rdt_find_domain(&r->mon_domains, id, NULL); + if (!hdr) { + pr_warn("Can't find monitor domain for id=%d for CPU %d for resource %s\n", + id, cpu, r->name); + return; + } + + if (WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) + return; + + d = container_of(hdr, struct rdt_mon_domain, hdr); + hw_dom = resctrl_to_arch_mon_dom(d); + + cpumask_clear_cpu(cpu, &d->hdr.cpu_mask); + if (cpumask_empty(&d->hdr.cpu_mask)) { + resctrl_offline_mon_domain(r, d); + list_del_rcu(&d->hdr.list); + synchronize_rcu(); + mon_domain_free(hw_dom); return; } } +static void domain_remove_cpu(int cpu, struct rdt_resource *r) +{ + if (r->alloc_capable) + domain_remove_cpu_ctrl(cpu, r); + if (r->mon_capable) + domain_remove_cpu_mon(cpu, r); +} + static void clear_closid_rmid(int cpu) { struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state); @@ -624,6 +784,20 @@ static int resctrl_arch_offline_cpu(unsigned int cpu) return 0; } +/* + * Choose a width for the resource name and resource data based on the + * resource that has widest name and cbm. + */ +static __init void rdt_init_padding(void) +{ + struct rdt_resource *r; + + for_each_alloc_capable_rdt_resource(r) { + if (r->data_width > max_data_width) + max_data_width = r->data_width; + } +} + enum { RDT_FLAG_CMT, RDT_FLAG_MBM_TOTAL, @@ -649,7 +823,7 @@ struct rdt_options { bool force_off, force_on; }; -static struct rdt_options rdt_options[] __ro_after_init = { +static struct rdt_options rdt_options[] __initdata = { RDT_OPT(RDT_FLAG_CMT, "cmt", X86_FEATURE_CQM_OCCUP_LLC), RDT_OPT(RDT_FLAG_MBM_TOTAL, "mbmtotal", X86_FEATURE_CQM_MBM_TOTAL), RDT_OPT(RDT_FLAG_MBM_LOCAL, "mbmlocal", X86_FEATURE_CQM_MBM_LOCAL), @@ -689,7 +863,7 @@ static int __init set_rdt_options(char *str) } __setup("rdt", set_rdt_options); -bool rdt_cpu_has(int flag) +bool __init rdt_cpu_has(int flag) { bool ret = boot_cpu_has(flag); struct rdt_options *o; @@ -709,21 +883,6 @@ bool rdt_cpu_has(int flag) return ret; } -bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt) -{ - if (!rdt_cpu_has(X86_FEATURE_BMEC)) - return false; - - switch (evt) { - case QOS_L3_MBM_TOTAL_EVENT_ID: - return rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL); - case QOS_L3_MBM_LOCAL_EVENT_ID: - return rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL); - default: - return false; - } -} - static __init bool get_mem_config(void) { struct rdt_hw_resource *hw_res = &rdt_resources_all[RDT_RESOURCE_MBA]; @@ -920,7 +1079,7 @@ void resctrl_cpu_detect(struct cpuinfo_x86 *c) } } -static int __init resctrl_arch_late_init(void) +static int __init resctrl_late_init(void) { struct rdt_resource *r; int state, ret; @@ -936,6 +1095,8 @@ static int __init resctrl_arch_late_init(void) if (!get_rdt_resources()) return -ENODEV; + rdt_init_padding(); + state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/resctrl/cat:online:", resctrl_arch_online_cpu, @@ -943,7 +1104,7 @@ static int __init resctrl_arch_late_init(void) if (state < 0) return state; - ret = resctrl_init(); + ret = rdtgroup_init(); if (ret) { cpuhp_remove_state(state); return ret; @@ -959,13 +1120,18 @@ static int __init resctrl_arch_late_init(void) return 0; } -late_initcall(resctrl_arch_late_init); +late_initcall(resctrl_late_init); -static void __exit resctrl_arch_exit(void) +static void __exit resctrl_exit(void) { + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + cpuhp_remove_state(rdt_online); - resctrl_exit(); + rdtgroup_exit(); + + if (r->mon_capable) + rdt_put_mon_l3_config(); } -__exitcall(resctrl_arch_exit); +__exitcall(resctrl_exit); diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c index c5c3eaea27b6..200d89a64027 100644 --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c @@ -23,39 +23,278 @@ #include "internal.h" -static bool apply_config(struct rdt_hw_domain *hw_dom, - struct resctrl_staged_config *cfg, u32 idx, - cpumask_var_t cpu_mask) +/* + * Check whether MBA bandwidth percentage value is correct. The value is + * checked against the minimum and max bandwidth values specified by the + * hardware. The allocated bandwidth percentage is rounded to the next + * control step available on the hardware. + */ +static bool bw_validate(char *buf, u32 *data, struct rdt_resource *r) { - struct rdt_domain *dom = &hw_dom->d_resctrl; + int ret; + u32 bw; - if (cfg->new_ctrl != hw_dom->ctrl_val[idx]) { - cpumask_set_cpu(cpumask_any(&dom->cpu_mask), cpu_mask); - hw_dom->ctrl_val[idx] = cfg->new_ctrl; + /* + * Only linear delay values is supported for current Intel SKUs. + */ + if (!r->membw.delay_linear && r->membw.arch_needs_linear) { + rdt_last_cmd_puts("No support for non-linear MB domains\n"); + return false; + } + + ret = kstrtou32(buf, 10, &bw); + if (ret) { + rdt_last_cmd_printf("Invalid MB value %s\n", buf); + return false; + } + /* Nothing else to do if software controller is enabled. */ + if (is_mba_sc(r)) { + *data = bw; return true; } - return false; + if (bw < r->membw.min_bw || bw > r->default_ctrl) { + rdt_last_cmd_printf("MB value %u out of range [%d,%d]\n", + bw, r->membw.min_bw, r->default_ctrl); + return false; + } + + *data = roundup(bw, (unsigned long)r->membw.bw_gran); + return true; +} + +int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_ctrl_domain *d) +{ + struct resctrl_staged_config *cfg; + u32 closid = data->rdtgrp->closid; + struct rdt_resource *r = s->res; + u32 bw_val; + + cfg = &d->staged_config[s->conf_type]; + if (cfg->have_new_ctrl) { + rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id); + return -EINVAL; + } + + if (!bw_validate(data->buf, &bw_val, r)) + return -EINVAL; + + if (is_mba_sc(r)) { + d->mbps_val[closid] = bw_val; + return 0; + } + + cfg->new_ctrl = bw_val; + cfg->have_new_ctrl = true; + + return 0; +} + +/* + * Check whether a cache bit mask is valid. + * On Intel CPUs, non-contiguous 1s value support is indicated by CPUID: + * - CPUID.0x10.1:ECX[3]: L3 non-contiguous 1s value supported if 1 + * - CPUID.0x10.2:ECX[3]: L2 non-contiguous 1s value supported if 1 + * + * Haswell does not support a non-contiguous 1s value and additionally + * requires at least two bits set. + * AMD allows non-contiguous bitmasks. + */ +static bool cbm_validate(char *buf, u32 *data, struct rdt_resource *r) +{ + unsigned long first_bit, zero_bit, val; + unsigned int cbm_len = r->cache.cbm_len; + int ret; + + ret = kstrtoul(buf, 16, &val); + if (ret) { + rdt_last_cmd_printf("Non-hex character in the mask %s\n", buf); + return false; + } + + if ((r->cache.min_cbm_bits > 0 && val == 0) || val > r->default_ctrl) { + rdt_last_cmd_puts("Mask out of range\n"); + return false; + } + + first_bit = find_first_bit(&val, cbm_len); + zero_bit = find_next_zero_bit(&val, cbm_len, first_bit); + + /* Are non-contiguous bitmasks allowed? */ + if (!r->cache.arch_has_sparse_bitmasks && + (find_next_bit(&val, cbm_len, zero_bit) < cbm_len)) { + rdt_last_cmd_printf("The mask %lx has non-consecutive 1-bits\n", val); + return false; + } + + if ((zero_bit - first_bit) < r->cache.min_cbm_bits) { + rdt_last_cmd_printf("Need at least %d bits in the mask\n", + r->cache.min_cbm_bits); + return false; + } + + *data = val; + return true; +} + +/* + * Read one cache bit mask (hex). Check that it is valid for the current + * resource type. + */ +int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_ctrl_domain *d) +{ + struct rdtgroup *rdtgrp = data->rdtgrp; + struct resctrl_staged_config *cfg; + struct rdt_resource *r = s->res; + u32 cbm_val; + + cfg = &d->staged_config[s->conf_type]; + if (cfg->have_new_ctrl) { + rdt_last_cmd_printf("Duplicate domain %d\n", d->hdr.id); + return -EINVAL; + } + + /* + * Cannot set up more than one pseudo-locked region in a cache + * hierarchy. + */ + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP && + rdtgroup_pseudo_locked_in_hierarchy(d)) { + rdt_last_cmd_puts("Pseudo-locked region in hierarchy\n"); + return -EINVAL; + } + + if (!cbm_validate(data->buf, &cbm_val, r)) + return -EINVAL; + + if ((rdtgrp->mode == RDT_MODE_EXCLUSIVE || + rdtgrp->mode == RDT_MODE_SHAREABLE) && + rdtgroup_cbm_overlaps_pseudo_locked(d, cbm_val)) { + rdt_last_cmd_puts("CBM overlaps with pseudo-locked region\n"); + return -EINVAL; + } + + /* + * The CBM may not overlap with the CBM of another closid if + * either is exclusive. + */ + if (rdtgroup_cbm_overlaps(s, d, cbm_val, rdtgrp->closid, true)) { + rdt_last_cmd_puts("Overlaps with exclusive group\n"); + return -EINVAL; + } + + if (rdtgroup_cbm_overlaps(s, d, cbm_val, rdtgrp->closid, false)) { + if (rdtgrp->mode == RDT_MODE_EXCLUSIVE || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + rdt_last_cmd_puts("Overlaps with other group\n"); + return -EINVAL; + } + } + + cfg->new_ctrl = cbm_val; + cfg->have_new_ctrl = true; + + return 0; +} + +/* + * For each domain in this resource we expect to find a series of: + * id=mask + * separated by ";". The "id" is in decimal, and must match one of + * the "id"s for this resource. + */ +static int parse_line(char *line, struct resctrl_schema *s, + struct rdtgroup *rdtgrp) +{ + enum resctrl_conf_type t = s->conf_type; + struct resctrl_staged_config *cfg; + struct rdt_resource *r = s->res; + struct rdt_parse_data data; + struct rdt_ctrl_domain *d; + char *dom = NULL, *id; + unsigned long dom_id; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP && + (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA)) { + rdt_last_cmd_puts("Cannot pseudo-lock MBA resource\n"); + return -EINVAL; + } + +next: + if (!line || line[0] == '\0') + return 0; + dom = strsep(&line, ";"); + id = strsep(&dom, "="); + if (!dom || kstrtoul(id, 10, &dom_id)) { + rdt_last_cmd_puts("Missing '=' or non-numeric domain\n"); + return -EINVAL; + } + dom = strim(dom); + list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + if (d->hdr.id == dom_id) { + data.buf = dom; + data.rdtgrp = rdtgrp; + if (r->parse_ctrlval(&data, s, d)) + return -EINVAL; + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + cfg = &d->staged_config[t]; + /* + * In pseudo-locking setup mode and just + * parsed a valid CBM that should be + * pseudo-locked. Only one locked region per + * resource group and domain so just do + * the required initialization for single + * region and return. + */ + rdtgrp->plr->s = s; + rdtgrp->plr->d = d; + rdtgrp->plr->cbm = cfg->new_ctrl; + d->plr = rdtgrp->plr; + return 0; + } + goto next; + } + } + return -EINVAL; +} + +static u32 get_config_index(u32 closid, enum resctrl_conf_type type) +{ + switch (type) { + default: + case CDP_NONE: + return closid; + case CDP_CODE: + return closid * 2 + 1; + case CDP_DATA: + return closid * 2; + } } -int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, +int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d, u32 closid, enum resctrl_conf_type t, u32 cfg_val) { + struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d); struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); - u32 idx = resctrl_get_config_index(closid, t); + u32 idx = get_config_index(closid, t); struct msr_param msr_param; - if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask)) + if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask)) return -EINVAL; hw_dom->ctrl_val[idx] = cfg_val; msr_param.res = r; + msr_param.dom = d; msr_param.low = idx; msr_param.high = idx + 1; - hw_res->msr_update(d, &msr_param, r); + hw_res->msr_update(&msr_param); return 0; } @@ -63,59 +302,326 @@ int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid) { struct resctrl_staged_config *cfg; - struct rdt_hw_domain *hw_dom; + struct rdt_hw_ctrl_domain *hw_dom; struct msr_param msr_param; + struct rdt_ctrl_domain *d; enum resctrl_conf_type t; - cpumask_var_t cpu_mask; - struct rdt_domain *d; u32 idx; /* Walking r->domains, ensure it can't race with cpuhp */ lockdep_assert_cpus_held(); - if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) - return -ENOMEM; - - msr_param.res = NULL; - list_for_each_entry(d, &r->domains, list) { - hw_dom = resctrl_to_arch_dom(d); + list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + hw_dom = resctrl_to_arch_ctrl_dom(d); + msr_param.res = NULL; for (t = 0; t < CDP_NUM_TYPES; t++) { cfg = &hw_dom->d_resctrl.staged_config[t]; if (!cfg->have_new_ctrl) continue; - idx = resctrl_get_config_index(closid, t); - if (!apply_config(hw_dom, cfg, idx, cpu_mask)) + idx = get_config_index(closid, t); + if (cfg->new_ctrl == hw_dom->ctrl_val[idx]) continue; + hw_dom->ctrl_val[idx] = cfg->new_ctrl; if (!msr_param.res) { msr_param.low = idx; msr_param.high = msr_param.low + 1; msr_param.res = r; + msr_param.dom = d; } else { msr_param.low = min(msr_param.low, idx); msr_param.high = max(msr_param.high, idx + 1); } } + if (msr_param.res) + smp_call_function_any(&d->hdr.cpu_mask, rdt_ctrl_update, &msr_param, 1); + } + + return 0; +} + +static int rdtgroup_parse_resource(char *resname, char *tok, + struct rdtgroup *rdtgrp) +{ + struct resctrl_schema *s; + + list_for_each_entry(s, &resctrl_schema_all, list) { + if (!strcmp(resname, s->name) && rdtgrp->closid < s->num_closid) + return parse_line(tok, s, rdtgrp); } + rdt_last_cmd_printf("Unknown or unsupported resource name '%s'\n", resname); + return -EINVAL; +} + +ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct resctrl_schema *s; + struct rdtgroup *rdtgrp; + struct rdt_resource *r; + char *tok, *resname; + int ret = 0; - if (cpumask_empty(cpu_mask)) - goto done; + /* Valid input requires a trailing newline */ + if (nbytes == 0 || buf[nbytes - 1] != '\n') + return -EINVAL; + buf[nbytes - 1] = '\0'; - /* Update resource control msr on all the CPUs. */ - on_each_cpu_mask(cpu_mask, rdt_ctrl_update, &msr_param, 1); + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + rdt_last_cmd_clear(); -done: - free_cpumask_var(cpu_mask); + /* + * No changes to pseudo-locked region allowed. It has to be removed + * and re-created instead. + */ + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + ret = -EINVAL; + rdt_last_cmd_puts("Resource group is pseudo-locked\n"); + goto out; + } - return 0; + rdt_staged_configs_clear(); + + while ((tok = strsep(&buf, "\n")) != NULL) { + resname = strim(strsep(&tok, ":")); + if (!tok) { + rdt_last_cmd_puts("Missing ':'\n"); + ret = -EINVAL; + goto out; + } + if (tok[0] == '\0') { + rdt_last_cmd_printf("Missing '%s' value\n", resname); + ret = -EINVAL; + goto out; + } + ret = rdtgroup_parse_resource(resname, tok, rdtgrp); + if (ret) + goto out; + } + + list_for_each_entry(s, &resctrl_schema_all, list) { + r = s->res; + + /* + * Writes to mba_sc resources update the software controller, + * not the control MSR. + */ + if (is_mba_sc(r)) + continue; + + ret = resctrl_arch_update_domains(r, rdtgrp->closid); + if (ret) + goto out; + } + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + /* + * If pseudo-locking fails we keep the resource group in + * mode RDT_MODE_PSEUDO_LOCKSETUP with its class of service + * active and updated for just the domain the pseudo-locked + * region was requested for. + */ + ret = rdtgroup_pseudo_lock_create(rdtgrp); + } + +out: + rdt_staged_configs_clear(); + rdtgroup_kn_unlock(of->kn); + return ret ?: nbytes; } -u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, +u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d, u32 closid, enum resctrl_conf_type type) { - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); - u32 idx = resctrl_get_config_index(closid, type); + struct rdt_hw_ctrl_domain *hw_dom = resctrl_to_arch_ctrl_dom(d); + u32 idx = get_config_index(closid, type); return hw_dom->ctrl_val[idx]; } + +static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid) +{ + struct rdt_resource *r = schema->res; + struct rdt_ctrl_domain *dom; + bool sep = false; + u32 ctrl_val; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + seq_printf(s, "%*s:", max_name_width, schema->name); + list_for_each_entry(dom, &r->ctrl_domains, hdr.list) { + if (sep) + seq_puts(s, ";"); + + if (is_mba_sc(r)) + ctrl_val = dom->mbps_val[closid]; + else + ctrl_val = resctrl_arch_get_config(r, dom, closid, + schema->conf_type); + + seq_printf(s, r->format_str, dom->hdr.id, max_data_width, + ctrl_val); + sep = true; + } + seq_puts(s, "\n"); +} + +int rdtgroup_schemata_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct resctrl_schema *schema; + struct rdtgroup *rdtgrp; + int ret = 0; + u32 closid; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + list_for_each_entry(schema, &resctrl_schema_all, list) { + seq_printf(s, "%s:uninitialized\n", schema->name); + } + } else if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + if (!rdtgrp->plr->d) { + rdt_last_cmd_clear(); + rdt_last_cmd_puts("Cache domain offline\n"); + ret = -ENODEV; + } else { + seq_printf(s, "%s:%d=%x\n", + rdtgrp->plr->s->res->name, + rdtgrp->plr->d->hdr.id, + rdtgrp->plr->cbm); + } + } else { + closid = rdtgrp->closid; + list_for_each_entry(schema, &resctrl_schema_all, list) { + if (closid < schema->num_closid) + show_doms(s, schema, closid); + } + } + } else { + ret = -ENOENT; + } + rdtgroup_kn_unlock(of->kn); + return ret; +} + +static int smp_mon_event_count(void *arg) +{ + mon_event_count(arg); + + return 0; +} + +void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, + struct rdt_mon_domain *d, struct rdtgroup *rdtgrp, + cpumask_t *cpumask, int evtid, int first) +{ + int cpu; + + /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + /* + * Setup the parameters to pass to mon_event_count() to read the data. + */ + rr->rgrp = rdtgrp; + rr->evtid = evtid; + rr->r = r; + rr->d = d; + rr->first = first; + rr->arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, evtid); + if (IS_ERR(rr->arch_mon_ctx)) { + rr->err = -EINVAL; + return; + } + + cpu = cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU); + + /* + * cpumask_any_housekeeping() prefers housekeeping CPUs, but + * are all the CPUs nohz_full? If yes, pick a CPU to IPI. + * MPAM's resctrl_arch_rmid_read() is unable to read the + * counters on some platforms if its called in IRQ context. + */ + if (tick_nohz_full_cpu(cpu)) + smp_call_function_any(cpumask, mon_event_count, rr, 1); + else + smp_call_on_cpu(cpu, smp_mon_event_count, rr, false); + + resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx); +} + +int rdtgroup_mondata_show(struct seq_file *m, void *arg) +{ + struct kernfs_open_file *of = m->private; + struct rdt_domain_hdr *hdr; + struct rmid_read rr = {0}; + struct rdt_mon_domain *d; + u32 resid, evtid, domid; + struct rdtgroup *rdtgrp; + struct rdt_resource *r; + union mon_data_bits md; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + ret = -ENOENT; + goto out; + } + + md.priv = of->kn->priv; + resid = md.u.rid; + domid = md.u.domid; + evtid = md.u.evtid; + r = &rdt_resources_all[resid].r_resctrl; + + if (md.u.sum) { + /* + * This file requires summing across all domains that share + * the L3 cache id that was provided in the "domid" field of the + * mon_data_bits union. Search all domains in the resource for + * one that matches this cache id. + */ + list_for_each_entry(d, &r->mon_domains, hdr.list) { + if (d->ci->id == domid) { + rr.ci = d->ci; + mon_event_read(&rr, r, NULL, rdtgrp, + &d->ci->shared_cpu_map, evtid, false); + goto checkresult; + } + } + ret = -ENOENT; + goto out; + } else { + /* + * This file provides data from a single domain. Search + * the resource to find the domain with "domid". + */ + hdr = rdt_find_domain(&r->mon_domains, domid, NULL); + if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) { + ret = -ENOENT; + goto out; + } + d = container_of(hdr, struct rdt_mon_domain, hdr); + mon_event_read(&rr, r, d, rdtgrp, &d->hdr.cpu_mask, evtid, false); + } + +checkresult: + + if (rr.err == -EIO) + seq_puts(m, "Error\n"); + else if (rr.err == -EINVAL) + seq_puts(m, "Unavailable\n"); + else + seq_printf(m, "%llu\n", rr.val); + +out: + rdtgroup_kn_unlock(of->kn); + return ret; +} diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h index 23d574ab56fb..955999aecfca 100644 --- a/arch/x86/kernel/cpu/resctrl/internal.h +++ b/arch/x86/kernel/cpu/resctrl/internal.h @@ -15,6 +15,12 @@ #define L2_QOS_CDP_ENABLE 0x01ULL +#define CQM_LIMBOCHECK_INTERVAL 1000 + +#define MBM_CNTR_WIDTH_BASE 24 +#define MBM_OVERFLOW_INTERVAL 1000 +#define MAX_MBA_BW 100u +#define MBA_IS_LINEAR 0x4 #define MBM_CNTR_WIDTH_OFFSET_AMD 20 #define RMID_VAL_ERROR BIT_ULL(63) @@ -26,6 +32,341 @@ */ #define MBM_CNTR_WIDTH_OFFSET_MAX (62 - MBM_CNTR_WIDTH_BASE) +/* Reads to Local DRAM Memory */ +#define READS_TO_LOCAL_MEM BIT(0) + +/* Reads to Remote DRAM Memory */ +#define READS_TO_REMOTE_MEM BIT(1) + +/* Non-Temporal Writes to Local Memory */ +#define NON_TEMP_WRITE_TO_LOCAL_MEM BIT(2) + +/* Non-Temporal Writes to Remote Memory */ +#define NON_TEMP_WRITE_TO_REMOTE_MEM BIT(3) + +/* Reads to Local Memory the system identifies as "Slow Memory" */ +#define READS_TO_LOCAL_S_MEM BIT(4) + +/* Reads to Remote Memory the system identifies as "Slow Memory" */ +#define READS_TO_REMOTE_S_MEM BIT(5) + +/* Dirty Victims to All Types of Memory */ +#define DIRTY_VICTIMS_TO_ALL_MEM BIT(6) + +/* Max event bits supported */ +#define MAX_EVT_CONFIG_BITS GENMASK(6, 0) + +/** + * cpumask_any_housekeeping() - Choose any CPU in @mask, preferring those that + * aren't marked nohz_full + * @mask: The mask to pick a CPU from. + * @exclude_cpu:The CPU to avoid picking. + * + * Returns a CPU from @mask, but not @exclude_cpu. If there are housekeeping + * CPUs that don't use nohz_full, these are preferred. Pass + * RESCTRL_PICK_ANY_CPU to avoid excluding any CPUs. + * + * When a CPU is excluded, returns >= nr_cpu_ids if no CPUs are available. + */ +static inline unsigned int +cpumask_any_housekeeping(const struct cpumask *mask, int exclude_cpu) +{ + unsigned int cpu, hk_cpu; + + if (exclude_cpu == RESCTRL_PICK_ANY_CPU) + cpu = cpumask_any(mask); + else + cpu = cpumask_any_but(mask, exclude_cpu); + + /* Only continue if tick_nohz_full_mask has been initialized. */ + if (!tick_nohz_full_enabled()) + return cpu; + + /* If the CPU picked isn't marked nohz_full nothing more needs doing. */ + if (cpu < nr_cpu_ids && !tick_nohz_full_cpu(cpu)) + return cpu; + + /* Try to find a CPU that isn't nohz_full to use in preference */ + hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask); + if (hk_cpu == exclude_cpu) + hk_cpu = cpumask_nth_andnot(1, mask, tick_nohz_full_mask); + + if (hk_cpu < nr_cpu_ids) + cpu = hk_cpu; + + return cpu; +} + +struct rdt_fs_context { + struct kernfs_fs_context kfc; + bool enable_cdpl2; + bool enable_cdpl3; + bool enable_mba_mbps; + bool enable_debug; +}; + +static inline struct rdt_fs_context *rdt_fc2context(struct fs_context *fc) +{ + struct kernfs_fs_context *kfc = fc->fs_private; + + return container_of(kfc, struct rdt_fs_context, kfc); +} + +/** + * struct mon_evt - Entry in the event list of a resource + * @evtid: event id + * @name: name of the event + * @configurable: true if the event is configurable + * @list: entry in &rdt_resource->evt_list + */ +struct mon_evt { + enum resctrl_event_id evtid; + char *name; + bool configurable; + struct list_head list; +}; + +/** + * union mon_data_bits - Monitoring details for each event file. + * @priv: Used to store monitoring event data in @u + * as kernfs private data. + * @u.rid: Resource id associated with the event file. + * @u.evtid: Event id associated with the event file. + * @u.sum: Set when event must be summed across multiple + * domains. + * @u.domid: When @u.sum is zero this is the domain to which + * the event file belongs. When @sum is one this + * is the id of the L3 cache that all domains to be + * summed share. + * @u: Name of the bit fields struct. + */ +union mon_data_bits { + void *priv; + struct { + unsigned int rid : 10; + enum resctrl_event_id evtid : 7; + unsigned int sum : 1; + unsigned int domid : 14; + } u; +}; + +/** + * struct rmid_read - Data passed across smp_call*() to read event count. + * @rgrp: Resource group for which the counter is being read. If it is a parent + * resource group then its event count is summed with the count from all + * its child resource groups. + * @r: Resource describing the properties of the event being read. + * @d: Domain that the counter should be read from. If NULL then sum all + * domains in @r sharing L3 @ci.id + * @evtid: Which monitor event to read. + * @first: Initialize MBM counter when true. + * @ci: Cacheinfo for L3. Only set when @d is NULL. Used when summing domains. + * @err: Error encountered when reading counter. + * @val: Returned value of event counter. If @rgrp is a parent resource group, + * @val includes the sum of event counts from its child resource groups. + * If @d is NULL, @val includes the sum of all domains in @r sharing @ci.id, + * (summed across child resource groups if @rgrp is a parent resource group). + * @arch_mon_ctx: Hardware monitor allocated for this read request (MPAM only). + */ +struct rmid_read { + struct rdtgroup *rgrp; + struct rdt_resource *r; + struct rdt_mon_domain *d; + enum resctrl_event_id evtid; + bool first; + struct cacheinfo *ci; + int err; + u64 val; + void *arch_mon_ctx; +}; + +extern unsigned int rdt_mon_features; +extern struct list_head resctrl_schema_all; +extern bool resctrl_mounted; + +enum rdt_group_type { + RDTCTRL_GROUP = 0, + RDTMON_GROUP, + RDT_NUM_GROUP, +}; + +/** + * enum rdtgrp_mode - Mode of a RDT resource group + * @RDT_MODE_SHAREABLE: This resource group allows sharing of its allocations + * @RDT_MODE_EXCLUSIVE: No sharing of this resource group's allocations allowed + * @RDT_MODE_PSEUDO_LOCKSETUP: Resource group will be used for Pseudo-Locking + * @RDT_MODE_PSEUDO_LOCKED: No sharing of this resource group's allocations + * allowed AND the allocations are Cache Pseudo-Locked + * @RDT_NUM_MODES: Total number of modes + * + * The mode of a resource group enables control over the allowed overlap + * between allocations associated with different resource groups (classes + * of service). User is able to modify the mode of a resource group by + * writing to the "mode" resctrl file associated with the resource group. + * + * The "shareable", "exclusive", and "pseudo-locksetup" modes are set by + * writing the appropriate text to the "mode" file. A resource group enters + * "pseudo-locked" mode after the schemata is written while the resource + * group is in "pseudo-locksetup" mode. + */ +enum rdtgrp_mode { + RDT_MODE_SHAREABLE = 0, + RDT_MODE_EXCLUSIVE, + RDT_MODE_PSEUDO_LOCKSETUP, + RDT_MODE_PSEUDO_LOCKED, + + /* Must be last */ + RDT_NUM_MODES, +}; + +/** + * struct mongroup - store mon group's data in resctrl fs. + * @mon_data_kn: kernfs node for the mon_data directory + * @parent: parent rdtgrp + * @crdtgrp_list: child rdtgroup node list + * @rmid: rmid for this rdtgroup + */ +struct mongroup { + struct kernfs_node *mon_data_kn; + struct rdtgroup *parent; + struct list_head crdtgrp_list; + u32 rmid; +}; + +/** + * struct pseudo_lock_region - pseudo-lock region information + * @s: Resctrl schema for the resource to which this + * pseudo-locked region belongs + * @d: RDT domain to which this pseudo-locked region + * belongs + * @cbm: bitmask of the pseudo-locked region + * @lock_thread_wq: waitqueue used to wait on the pseudo-locking thread + * completion + * @thread_done: variable used by waitqueue to test if pseudo-locking + * thread completed + * @cpu: core associated with the cache on which the setup code + * will be run + * @line_size: size of the cache lines + * @size: size of pseudo-locked region in bytes + * @kmem: the kernel memory associated with pseudo-locked region + * @minor: minor number of character device associated with this + * region + * @debugfs_dir: pointer to this region's directory in the debugfs + * filesystem + * @pm_reqs: Power management QoS requests related to this region + */ +struct pseudo_lock_region { + struct resctrl_schema *s; + struct rdt_ctrl_domain *d; + u32 cbm; + wait_queue_head_t lock_thread_wq; + int thread_done; + int cpu; + unsigned int line_size; + unsigned int size; + void *kmem; + unsigned int minor; + struct dentry *debugfs_dir; + struct list_head pm_reqs; +}; + +/** + * struct rdtgroup - store rdtgroup's data in resctrl file system. + * @kn: kernfs node + * @rdtgroup_list: linked list for all rdtgroups + * @closid: closid for this rdtgroup + * @cpu_mask: CPUs assigned to this rdtgroup + * @flags: status bits + * @waitcount: how many cpus expect to find this + * group when they acquire rdtgroup_mutex + * @type: indicates type of this rdtgroup - either + * monitor only or ctrl_mon group + * @mon: mongroup related data + * @mode: mode of resource group + * @plr: pseudo-locked region + */ +struct rdtgroup { + struct kernfs_node *kn; + struct list_head rdtgroup_list; + u32 closid; + struct cpumask cpu_mask; + int flags; + atomic_t waitcount; + enum rdt_group_type type; + struct mongroup mon; + enum rdtgrp_mode mode; + struct pseudo_lock_region *plr; +}; + +/* rdtgroup.flags */ +#define RDT_DELETED 1 + +/* rftype.flags */ +#define RFTYPE_FLAGS_CPUS_LIST 1 + +/* + * Define the file type flags for base and info directories. + */ +#define RFTYPE_INFO BIT(0) +#define RFTYPE_BASE BIT(1) +#define RFTYPE_CTRL BIT(4) +#define RFTYPE_MON BIT(5) +#define RFTYPE_TOP BIT(6) +#define RFTYPE_RES_CACHE BIT(8) +#define RFTYPE_RES_MB BIT(9) +#define RFTYPE_DEBUG BIT(10) +#define RFTYPE_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL) +#define RFTYPE_MON_INFO (RFTYPE_INFO | RFTYPE_MON) +#define RFTYPE_TOP_INFO (RFTYPE_INFO | RFTYPE_TOP) +#define RFTYPE_CTRL_BASE (RFTYPE_BASE | RFTYPE_CTRL) +#define RFTYPE_MON_BASE (RFTYPE_BASE | RFTYPE_MON) + +/* List of all resource groups */ +extern struct list_head rdt_all_groups; + +extern int max_name_width, max_data_width; + +int __init rdtgroup_init(void); +void __exit rdtgroup_exit(void); + +/** + * struct rftype - describe each file in the resctrl file system + * @name: File name + * @mode: Access mode + * @kf_ops: File operations + * @flags: File specific RFTYPE_FLAGS_* flags + * @fflags: File specific RFTYPE_* flags + * @seq_show: Show content of the file + * @write: Write to the file + */ +struct rftype { + char *name; + umode_t mode; + const struct kernfs_ops *kf_ops; + unsigned long flags; + unsigned long fflags; + + int (*seq_show)(struct kernfs_open_file *of, + struct seq_file *sf, void *v); + /* + * write() is the generic write callback which maps directly to + * kernfs write operation and overrides all other operations. + * Maximum write size is determined by ->max_write_len. + */ + ssize_t (*write)(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off); +}; + +/** + * struct mbm_state - status for each MBM counter in each domain + * @prev_bw_bytes: Previous bytes value read for bandwidth calculation + * @prev_bw: The most recent bandwidth in MBps + */ +struct mbm_state { + u64 prev_bw_bytes; + u32 prev_bw; +}; + /** * struct arch_mbm_state - values used to compute resctrl_arch_rmid_read()s * return value. @@ -39,35 +380,53 @@ struct arch_mbm_state { }; /** - * struct rdt_hw_domain - Arch private attributes of a set of CPUs that share - * a resource + * struct rdt_hw_ctrl_domain - Arch private attributes of a set of CPUs that share + * a resource for a control function * @d_resctrl: Properties exposed to the resctrl file system * @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID) + * + * Members of this structure are accessed via helpers that provide abstraction. + */ +struct rdt_hw_ctrl_domain { + struct rdt_ctrl_domain d_resctrl; + u32 *ctrl_val; +}; + +/** + * struct rdt_hw_mon_domain - Arch private attributes of a set of CPUs that share + * a resource for a monitor function + * @d_resctrl: Properties exposed to the resctrl file system * @arch_mbm_total: arch private state for MBM total bandwidth * @arch_mbm_local: arch private state for MBM local bandwidth * * Members of this structure are accessed via helpers that provide abstraction. */ -struct rdt_hw_domain { - struct rdt_domain d_resctrl; - u32 *ctrl_val; +struct rdt_hw_mon_domain { + struct rdt_mon_domain d_resctrl; struct arch_mbm_state *arch_mbm_total; struct arch_mbm_state *arch_mbm_local; }; -static inline struct rdt_hw_domain *resctrl_to_arch_dom(struct rdt_domain *r) +static inline struct rdt_hw_ctrl_domain *resctrl_to_arch_ctrl_dom(struct rdt_ctrl_domain *r) { - return container_of(r, struct rdt_hw_domain, d_resctrl); + return container_of(r, struct rdt_hw_ctrl_domain, d_resctrl); +} + +static inline struct rdt_hw_mon_domain *resctrl_to_arch_mon_dom(struct rdt_mon_domain *r) +{ + return container_of(r, struct rdt_hw_mon_domain, d_resctrl); } /** * struct msr_param - set a range of MSRs from a domain * @res: The resource to use + * @dom: The domain to update * @low: Beginning index from base MSR * @high: End index */ struct msr_param { struct rdt_resource *res; + struct rdt_ctrl_domain *dom; u32 low; u32 high; }; @@ -77,12 +436,32 @@ static inline bool is_llc_occupancy_enabled(void) return (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID)); } +static inline bool is_mbm_total_enabled(void) +{ + return (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID)); +} + +static inline bool is_mbm_local_enabled(void) +{ + return (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID)); +} + +static inline bool is_mbm_enabled(void) +{ + return (is_mbm_total_enabled() || is_mbm_local_enabled()); +} + static inline bool is_mbm_event(int e) { return (e >= QOS_L3_MBM_TOTAL_EVENT_ID && e <= QOS_L3_MBM_LOCAL_EVENT_ID); } +struct rdt_parse_data { + struct rdtgroup *rdtgrp; + char *buf; +}; + /** * struct rdt_hw_resource - arch private attributes of a resctrl resource * @r_resctrl: Attributes of the resource used directly by resctrl. @@ -95,6 +474,8 @@ static inline bool is_mbm_event(int e) * @msr_update: Function pointer to update QOS MSRs * @mon_scale: cqm counter * mon_scale = occupancy in bytes * @mbm_width: Monitor width, to detect and correct for overflow. + * @mbm_cfg_mask: Bandwidth sources that can be tracked when Bandwidth + * Monitoring Event Configuration (BMEC) is supported. * @cdp_enabled: CDP state of this resource * * Members of this structure are either private to the architecture @@ -105,10 +486,10 @@ struct rdt_hw_resource { struct rdt_resource r_resctrl; u32 num_closid; unsigned int msr_base; - void (*msr_update) (struct rdt_domain *d, struct msr_param *m, - struct rdt_resource *r); + void (*msr_update)(struct msr_param *m); unsigned int mon_scale; unsigned int mbm_width; + unsigned int mbm_cfg_mask; bool cdp_enabled; }; @@ -117,7 +498,26 @@ static inline struct rdt_hw_resource *resctrl_to_arch_res(struct rdt_resource *r return container_of(r, struct rdt_hw_resource, r_resctrl); } +int parse_cbm(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_ctrl_domain *d); +int parse_bw(struct rdt_parse_data *data, struct resctrl_schema *s, + struct rdt_ctrl_domain *d); + +extern struct mutex rdtgroup_mutex; + extern struct rdt_hw_resource rdt_resources_all[]; +extern struct rdtgroup rdtgroup_default; +extern struct dentry *debugfs_resctrl; + +enum resctrl_res_level { + RDT_RESOURCE_L3, + RDT_RESOURCE_L2, + RDT_RESOURCE_MBA, + RDT_RESOURCE_SMBA, + + /* Must be the last */ + RDT_NUM_RESOURCES, +}; static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res) { @@ -127,6 +527,15 @@ static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res) return &hw_res->r_resctrl; } +static inline bool resctrl_arch_get_cdp_enabled(enum resctrl_res_level l) +{ + return rdt_resources_all[l].cdp_enabled; +} + +int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable); + +void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d); + /* * To return the common struct rdt_resource, which is contained in struct * rdt_hw_resource, walk the resctrl member of struct rdt_hw_resource. @@ -181,10 +590,67 @@ union cpuid_0x10_x_edx { unsigned int full; }; +void rdt_last_cmd_clear(void); +void rdt_last_cmd_puts(const char *s); +__printf(1, 2) +void rdt_last_cmd_printf(const char *fmt, ...); + void rdt_ctrl_update(void *arg); +struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn); +void rdtgroup_kn_unlock(struct kernfs_node *kn); +int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name); +int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name, + umode_t mask); +struct rdt_domain_hdr *rdt_find_domain(struct list_head *h, int id, + struct list_head **pos); +ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off); +int rdtgroup_schemata_show(struct kernfs_open_file *of, + struct seq_file *s, void *v); +bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d, + unsigned long cbm, int closid, bool exclusive); +unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, struct rdt_ctrl_domain *d, + unsigned long cbm); +enum rdtgrp_mode rdtgroup_mode_by_closid(int closid); +int rdtgroup_tasks_assigned(struct rdtgroup *r); +int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp); +int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp); +bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm); +bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d); +int rdt_pseudo_lock_init(void); +void rdt_pseudo_lock_release(void); +int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp); +void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp); +struct rdt_ctrl_domain *get_ctrl_domain_from_cpu(int cpu, struct rdt_resource *r); +struct rdt_mon_domain *get_mon_domain_from_cpu(int cpu, struct rdt_resource *r); +int closids_supported(void); +void closid_free(int closid); +int alloc_rmid(u32 closid); +void free_rmid(u32 closid, u32 rmid); int rdt_get_mon_l3_config(struct rdt_resource *r); -bool rdt_cpu_has(int flag); +void __exit rdt_put_mon_l3_config(void); +bool __init rdt_cpu_has(int flag); +void mon_event_count(void *info); +int rdtgroup_mondata_show(struct seq_file *m, void *arg); +void mon_event_read(struct rmid_read *rr, struct rdt_resource *r, + struct rdt_mon_domain *d, struct rdtgroup *rdtgrp, + cpumask_t *cpumask, int evtid, int first); +void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, + unsigned long delay_ms, + int exclude_cpu); +void mbm_handle_overflow(struct work_struct *work); void __init intel_rdt_mbm_apply_quirk(void); +bool is_mba_sc(struct rdt_resource *r); +void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms, + int exclude_cpu); +void cqm_handle_limbo(struct work_struct *work); +bool has_busy_rmid(struct rdt_mon_domain *d); +void __check_limbo(struct rdt_mon_domain *d, bool force_free); void rdt_domain_reconfigure_cdp(struct rdt_resource *r); +void __init thread_throttle_mode_init(void); +void __init mbm_config_rftype_init(const char *config); +void rdt_staged_configs_clear(void); +bool closid_allocated(unsigned int closid); +int resctrl_find_cleanest_closid(void); #endif /* _ASM_X86_RESCTRL_INTERNAL_H */ diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c index ddc12606c602..cc0f1c48d7b2 100644 --- a/arch/x86/kernel/cpu/resctrl/monitor.c +++ b/arch/x86/kernel/cpu/resctrl/monitor.c @@ -15,6 +15,8 @@ * Software Developer Manual June 2016, volume 3, section 17.17. */ +#define pr_fmt(fmt) "resctrl: " fmt + #include <linux/cpu.h> #include <linux/module.h> #include <linux/sizes.h> @@ -24,6 +26,54 @@ #include <asm/resctrl.h> #include "internal.h" +#include "trace.h" + +/** + * struct rmid_entry - dirty tracking for all RMID. + * @closid: The CLOSID for this entry. + * @rmid: The RMID for this entry. + * @busy: The number of domains with cached data using this RMID. + * @list: Member of the rmid_free_lru list when busy == 0. + * + * Depending on the architecture the correct monitor is accessed using + * both @closid and @rmid, or @rmid only. + * + * Take the rdtgroup_mutex when accessing. + */ +struct rmid_entry { + u32 closid; + u32 rmid; + int busy; + struct list_head list; +}; + +/* + * @rmid_free_lru - A least recently used list of free RMIDs + * These RMIDs are guaranteed to have an occupancy less than the + * threshold occupancy + */ +static LIST_HEAD(rmid_free_lru); + +/* + * @closid_num_dirty_rmid The number of dirty RMID each CLOSID has. + * Only allocated when CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID is defined. + * Indexed by CLOSID. Protected by rdtgroup_mutex. + */ +static u32 *closid_num_dirty_rmid; + +/* + * @rmid_limbo_count - count of currently unused but (potentially) + * dirty RMIDs. + * This counts RMIDs that no one is currently using but that + * may have a occupancy value > resctrl_rmid_realloc_threshold. User can + * change the threshold occupancy value. + */ +static unsigned int rmid_limbo_count; + +/* + * @rmid_entry - The entry in the limbo and free lists. + */ +static struct rmid_entry *rmid_ptrs; /* * Global boolean for rdt_monitor which is true if any @@ -36,8 +86,21 @@ bool rdt_mon_capable; */ unsigned int rdt_mon_features; +/* + * This is the threshold cache occupancy in bytes at which we will consider an + * RMID available for re-allocation. + */ +unsigned int resctrl_rmid_realloc_threshold; + +/* + * This is the maximum value for the reallocation threshold, in bytes. + */ +unsigned int resctrl_rmid_realloc_limit; + #define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5)) +static int snc_nodes_per_l3_cache = 1; + /* * The correction factor table is documented in Documentation/arch/x86/resctrl.rst. * If rmid > rmid threshold, MBM total and local values should be multiplied @@ -99,7 +162,70 @@ static inline u64 get_corrected_mbm_count(u32 rmid, unsigned long val) return val; } -static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) +/* + * x86 and arm64 differ in their handling of monitoring. + * x86's RMID are independent numbers, there is only one source of traffic + * with an RMID value of '1'. + * arm64's PMG extends the PARTID/CLOSID space, there are multiple sources of + * traffic with a PMG value of '1', one for each CLOSID, meaning the RMID + * value is no longer unique. + * To account for this, resctrl uses an index. On x86 this is just the RMID, + * on arm64 it encodes the CLOSID and RMID. This gives a unique number. + * + * The domain's rmid_busy_llc and rmid_ptrs[] are sized by index. The arch code + * must accept an attempt to read every index. + */ +static inline struct rmid_entry *__rmid_entry(u32 idx) +{ + struct rmid_entry *entry; + u32 closid, rmid; + + entry = &rmid_ptrs[idx]; + resctrl_arch_rmid_idx_decode(idx, &closid, &rmid); + + WARN_ON_ONCE(entry->closid != closid); + WARN_ON_ONCE(entry->rmid != rmid); + + return entry; +} + +/* + * When Sub-NUMA Cluster (SNC) mode is not enabled (as indicated by + * "snc_nodes_per_l3_cache == 1") no translation of the RMID value is + * needed. The physical RMID is the same as the logical RMID. + * + * On a platform with SNC mode enabled, Linux enables RMID sharing mode + * via MSR 0xCA0 (see the "RMID Sharing Mode" section in the "Intel + * Resource Director Technology Architecture Specification" for a full + * description of RMID sharing mode). + * + * In RMID sharing mode there are fewer "logical RMID" values available + * to accumulate data ("physical RMIDs" are divided evenly between SNC + * nodes that share an L3 cache). Linux creates an rdt_mon_domain for + * each SNC node. + * + * The value loaded into IA32_PQR_ASSOC is the "logical RMID". + * + * Data is collected independently on each SNC node and can be retrieved + * using the "physical RMID" value computed by this function and loaded + * into IA32_QM_EVTSEL. @cpu can be any CPU in the SNC node. + * + * The scope of the IA32_QM_EVTSEL and IA32_QM_CTR MSRs is at the L3 + * cache. So a "physical RMID" may be read from any CPU that shares + * the L3 cache with the desired SNC node, not just from a CPU in + * the specific SNC node. + */ +static int logical_rmid_to_physical_rmid(int cpu, int lrmid) +{ + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + + if (snc_nodes_per_l3_cache == 1) + return lrmid; + + return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid; +} + +static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid, u64 *val) { u64 msr_val; @@ -111,7 +237,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62) * are error bits. */ - wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid); + wrmsr(MSR_IA32_QM_EVTSEL, eventid, prmid); rdmsrl(MSR_IA32_QM_CTR, msr_val); if (msr_val & RMID_VAL_ERROR) @@ -123,7 +249,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val) return 0; } -static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom, +static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_mon_domain *hw_dom, u32 rmid, enum resctrl_event_id eventid) { @@ -134,8 +260,6 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom, return &hw_dom->arch_mbm_total[rmid]; case QOS_L3_MBM_LOCAL_EVENT_ID: return &hw_dom->arch_mbm_local[rmid]; - default: - break; } /* Never expect to get here */ @@ -144,19 +268,22 @@ static struct arch_mbm_state *get_arch_mbm_state(struct rdt_hw_domain *hw_dom, return NULL; } -void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d, u32 unused, u32 rmid, enum resctrl_event_id eventid) { - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); + struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d); + int cpu = cpumask_any(&d->hdr.cpu_mask); struct arch_mbm_state *am; + u32 prmid; am = get_arch_mbm_state(hw_dom, rmid, eventid); if (am) { memset(am, 0, sizeof(*am)); + prmid = logical_rmid_to_physical_rmid(cpu, rmid); /* Record any initial, non-zero count value. */ - __rmid_read(rmid, eventid, &am->prev_msr); + __rmid_read_phys(prmid, eventid, &am->prev_msr); } } @@ -164,15 +291,15 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, * Assumes that hardware counters are also reset and thus that there is * no need to record initial non-zero counts. */ -void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d) +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d) { - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); + struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d); - if (resctrl_arch_is_mbm_total_enabled()) + if (is_mbm_total_enabled()) memset(hw_dom->arch_mbm_total, 0, sizeof(*hw_dom->arch_mbm_total) * r->num_rmid); - if (resctrl_arch_is_mbm_local_enabled()) + if (is_mbm_local_enabled()) memset(hw_dom->arch_mbm_local, 0, sizeof(*hw_dom->arch_mbm_local) * r->num_rmid); } @@ -185,22 +312,22 @@ static u64 mbm_overflow_count(u64 prev_msr, u64 cur_msr, unsigned int width) return chunks >> shift; } -int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d, u32 unused, u32 rmid, enum resctrl_event_id eventid, u64 *val, void *ignored) { + struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d); struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); - struct rdt_hw_domain *hw_dom = resctrl_to_arch_dom(d); + int cpu = cpumask_any(&d->hdr.cpu_mask); struct arch_mbm_state *am; u64 msr_val, chunks; + u32 prmid; int ret; resctrl_arch_rmid_read_context_check(); - if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask)) - return -EINVAL; - - ret = __rmid_read(rmid, eventid, &msr_val); + prmid = logical_rmid_to_physical_rmid(cpu, rmid); + ret = __rmid_read_phys(prmid, eventid, &msr_val); if (ret) return ret; @@ -219,63 +346,717 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, return 0; } -int __init rdt_get_mon_l3_config(struct rdt_resource *r) +static void limbo_release_entry(struct rmid_entry *entry) { - unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset; - struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); - unsigned int threshold; + lockdep_assert_held(&rdtgroup_mutex); - resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024; - hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale; - r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1; - hw_res->mbm_width = MBM_CNTR_WIDTH_BASE; + rmid_limbo_count--; + list_add_tail(&entry->list, &rmid_free_lru); - if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX) - hw_res->mbm_width += mbm_offset; - else if (mbm_offset > MBM_CNTR_WIDTH_OFFSET_MAX) - pr_warn("Ignoring impossible MBM counter offset\n"); + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + closid_num_dirty_rmid[entry->closid]--; +} + +/* + * Check the RMIDs that are marked as busy for this domain. If the + * reported LLC occupancy is below the threshold clear the busy bit and + * decrement the count. If the busy count gets to zero on an RMID, we + * free the RMID + */ +void __check_limbo(struct rdt_mon_domain *d, bool force_free) +{ + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + struct rmid_entry *entry; + u32 idx, cur_idx = 1; + void *arch_mon_ctx; + bool rmid_dirty; + u64 val = 0; + + arch_mon_ctx = resctrl_arch_mon_ctx_alloc(r, QOS_L3_OCCUP_EVENT_ID); + if (IS_ERR(arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(arch_mon_ctx)); + return; + } /* - * A reasonable upper limit on the max threshold is the number - * of lines tagged per RMID if all RMIDs have the same number of - * lines tagged in the LLC. - * - * For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC. + * Skip RMID 0 and start from RMID 1 and check all the RMIDs that + * are marked as busy for occupancy < threshold. If the occupancy + * is less than the threshold decrement the busy counter of the + * RMID and move it to the free list when the counter reaches 0. */ - threshold = resctrl_rmid_realloc_limit / r->num_rmid; + for (;;) { + idx = find_next_bit(d->rmid_busy_llc, idx_limit, cur_idx); + if (idx >= idx_limit) + break; + + entry = __rmid_entry(idx); + if (resctrl_arch_rmid_read(r, d, entry->closid, entry->rmid, + QOS_L3_OCCUP_EVENT_ID, &val, + arch_mon_ctx)) { + rmid_dirty = true; + } else { + rmid_dirty = (val >= resctrl_rmid_realloc_threshold); + + /* + * x86's CLOSID and RMID are independent numbers, so the entry's + * CLOSID is an empty CLOSID (X86_RESCTRL_EMPTY_CLOSID). On Arm the + * RMID (PMG) extends the CLOSID (PARTID) space with bits that aren't + * used to select the configuration. It is thus necessary to track both + * CLOSID and RMID because there may be dependencies between them + * on some architectures. + */ + trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->hdr.id, val); + } + + if (force_free || !rmid_dirty) { + clear_bit(idx, d->rmid_busy_llc); + if (!--entry->busy) + limbo_release_entry(entry); + } + cur_idx = idx + 1; + } + + resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx); +} + +bool has_busy_rmid(struct rdt_mon_domain *d) +{ + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + + return find_first_bit(d->rmid_busy_llc, idx_limit) != idx_limit; +} + +static struct rmid_entry *resctrl_find_free_rmid(u32 closid) +{ + struct rmid_entry *itr; + u32 itr_idx, cmp_idx; + + if (list_empty(&rmid_free_lru)) + return rmid_limbo_count ? ERR_PTR(-EBUSY) : ERR_PTR(-ENOSPC); + + list_for_each_entry(itr, &rmid_free_lru, list) { + /* + * Get the index of this free RMID, and the index it would need + * to be if it were used with this CLOSID. + * If the CLOSID is irrelevant on this architecture, the two + * index values are always the same on every entry and thus the + * very first entry will be returned. + */ + itr_idx = resctrl_arch_rmid_idx_encode(itr->closid, itr->rmid); + cmp_idx = resctrl_arch_rmid_idx_encode(closid, itr->rmid); + + if (itr_idx == cmp_idx) + return itr; + } + + return ERR_PTR(-ENOSPC); +} + +/** + * resctrl_find_cleanest_closid() - Find a CLOSID where all the associated + * RMID are clean, or the CLOSID that has + * the most clean RMID. + * + * MPAM's equivalent of RMID are per-CLOSID, meaning a freshly allocated CLOSID + * may not be able to allocate clean RMID. To avoid this the allocator will + * choose the CLOSID with the most clean RMID. + * + * When the CLOSID and RMID are independent numbers, the first free CLOSID will + * be returned. + */ +int resctrl_find_cleanest_closid(void) +{ + u32 cleanest_closid = ~0; + int i = 0; + + lockdep_assert_held(&rdtgroup_mutex); + + if (!IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + return -EIO; + + for (i = 0; i < closids_supported(); i++) { + int num_dirty; + + if (closid_allocated(i)) + continue; + + num_dirty = closid_num_dirty_rmid[i]; + if (num_dirty == 0) + return i; + + if (cleanest_closid == ~0) + cleanest_closid = i; + + if (num_dirty < closid_num_dirty_rmid[cleanest_closid]) + cleanest_closid = i; + } + + if (cleanest_closid == ~0) + return -ENOSPC; + + return cleanest_closid; +} + +/* + * For MPAM the RMID value is not unique, and has to be considered with + * the CLOSID. The (CLOSID, RMID) pair is allocated on all domains, which + * allows all domains to be managed by a single free list. + * Each domain also has a rmid_busy_llc to reduce the work of the limbo handler. + */ +int alloc_rmid(u32 closid) +{ + struct rmid_entry *entry; + + lockdep_assert_held(&rdtgroup_mutex); + + entry = resctrl_find_free_rmid(closid); + if (IS_ERR(entry)) + return PTR_ERR(entry); + + list_del(&entry->list); + return entry->rmid; +} + +static void add_rmid_to_limbo(struct rmid_entry *entry) +{ + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + struct rdt_mon_domain *d; + u32 idx; + + lockdep_assert_held(&rdtgroup_mutex); + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid); + + entry->busy = 0; + list_for_each_entry(d, &r->mon_domains, hdr.list) { + /* + * For the first limbo RMID in the domain, + * setup up the limbo worker. + */ + if (!has_busy_rmid(d)) + cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL, + RESCTRL_PICK_ANY_CPU); + set_bit(idx, d->rmid_busy_llc); + entry->busy++; + } + + rmid_limbo_count++; + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) + closid_num_dirty_rmid[entry->closid]++; +} + +void free_rmid(u32 closid, u32 rmid) +{ + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); + struct rmid_entry *entry; + + lockdep_assert_held(&rdtgroup_mutex); /* - * Because num_rmid may not be a power of two, round the value - * to the nearest multiple of hw_res->mon_scale so it matches a - * value the hardware will measure. mon_scale may not be a power of 2. + * Do not allow the default rmid to be free'd. Comparing by index + * allows architectures that ignore the closid parameter to avoid an + * unnecessary check. */ - resctrl_rmid_realloc_threshold = resctrl_arch_round_mon_val(threshold); + if (!resctrl_arch_mon_capable() || + idx == resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, + RESCTRL_RESERVED_RMID)) + return; - if (rdt_cpu_has(X86_FEATURE_BMEC)) { - u32 eax, ebx, ecx, edx; + entry = __rmid_entry(idx); - /* Detect list of bandwidth sources that can be tracked */ - cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx); - r->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS; + if (is_llc_occupancy_enabled()) + add_rmid_to_limbo(entry); + else + list_add_tail(&entry->list, &rmid_free_lru); +} + +static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid, + u32 rmid, enum resctrl_event_id evtid) +{ + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); + + switch (evtid) { + case QOS_L3_MBM_TOTAL_EVENT_ID: + return &d->mbm_total[idx]; + case QOS_L3_MBM_LOCAL_EVENT_ID: + return &d->mbm_local[idx]; + default: + return NULL; } +} - r->mon_capable = true; +static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr) +{ + int cpu = smp_processor_id(); + struct rdt_mon_domain *d; + struct mbm_state *m; + int err, ret; + u64 tval = 0; - return 0; + if (rr->first) { + resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid); + m = get_mbm_state(rr->d, closid, rmid, rr->evtid); + if (m) + memset(m, 0, sizeof(struct mbm_state)); + return 0; + } + + if (rr->d) { + /* Reading a single domain, must be on a CPU in that domain. */ + if (!cpumask_test_cpu(cpu, &rr->d->hdr.cpu_mask)) + return -EINVAL; + rr->err = resctrl_arch_rmid_read(rr->r, rr->d, closid, rmid, + rr->evtid, &tval, rr->arch_mon_ctx); + if (rr->err) + return rr->err; + + rr->val += tval; + + return 0; + } + + /* Summing domains that share a cache, must be on a CPU for that cache. */ + if (!cpumask_test_cpu(cpu, &rr->ci->shared_cpu_map)) + return -EINVAL; + + /* + * Legacy files must report the sum of an event across all + * domains that share the same L3 cache instance. + * Report success if a read from any domain succeeds, -EINVAL + * (translated to "Unavailable" for user space) if reading from + * all domains fail for any reason. + */ + ret = -EINVAL; + list_for_each_entry(d, &rr->r->mon_domains, hdr.list) { + if (d->ci->id != rr->ci->id) + continue; + err = resctrl_arch_rmid_read(rr->r, d, closid, rmid, + rr->evtid, &tval, rr->arch_mon_ctx); + if (!err) { + rr->val += tval; + ret = 0; + } + } + + if (ret) + rr->err = ret; + + return ret; } -void __init intel_rdt_mbm_apply_quirk(void) +/* + * mbm_bw_count() - Update bw count from values previously read by + * __mon_event_count(). + * @closid: The closid used to identify the cached mbm_state. + * @rmid: The rmid used to identify the cached mbm_state. + * @rr: The struct rmid_read populated by __mon_event_count(). + * + * Supporting function to calculate the memory bandwidth + * and delta bandwidth in MBps. The chunks value previously read by + * __mon_event_count() is compared with the chunks value from the previous + * invocation. This must be called once per second to maintain values in MBps. + */ +static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr) { - int cf_index; + u32 idx = resctrl_arch_rmid_idx_encode(closid, rmid); + struct mbm_state *m = &rr->d->mbm_local[idx]; + u64 cur_bw, bytes, cur_bytes; - cf_index = (boot_cpu_data.x86_cache_max_rmid + 1) / 8 - 1; - if (cf_index >= ARRAY_SIZE(mbm_cf_table)) { - pr_info("No MBM correction factor available\n"); + cur_bytes = rr->val; + bytes = cur_bytes - m->prev_bw_bytes; + m->prev_bw_bytes = cur_bytes; + + cur_bw = bytes / SZ_1M; + + m->prev_bw = cur_bw; +} + +/* + * This is scheduled by mon_event_read() to read the CQM/MBM counters + * on a domain. + */ +void mon_event_count(void *info) +{ + struct rdtgroup *rdtgrp, *entry; + struct rmid_read *rr = info; + struct list_head *head; + int ret; + + rdtgrp = rr->rgrp; + + ret = __mon_event_count(rdtgrp->closid, rdtgrp->mon.rmid, rr); + + /* + * For Ctrl groups read data from child monitor groups and + * add them together. Count events which are read successfully. + * Discard the rmid_read's reporting errors. + */ + head = &rdtgrp->mon.crdtgrp_list; + + if (rdtgrp->type == RDTCTRL_GROUP) { + list_for_each_entry(entry, head, mon.crdtgrp_list) { + if (__mon_event_count(entry->closid, entry->mon.rmid, + rr) == 0) + ret = 0; + } + } + + /* + * __mon_event_count() calls for newly created monitor groups may + * report -EINVAL/Unavailable if the monitor hasn't seen any traffic. + * Discard error if any of the monitor event reads succeeded. + */ + if (ret == 0) + rr->err = 0; +} + +/* + * Feedback loop for MBA software controller (mba_sc) + * + * mba_sc is a feedback loop where we periodically read MBM counters and + * adjust the bandwidth percentage values via the IA32_MBA_THRTL_MSRs so + * that: + * + * current bandwidth(cur_bw) < user specified bandwidth(user_bw) + * + * This uses the MBM counters to measure the bandwidth and MBA throttle + * MSRs to control the bandwidth for a particular rdtgrp. It builds on the + * fact that resctrl rdtgroups have both monitoring and control. + * + * The frequency of the checks is 1s and we just tag along the MBM overflow + * timer. Having 1s interval makes the calculation of bandwidth simpler. + * + * Although MBA's goal is to restrict the bandwidth to a maximum, there may + * be a need to increase the bandwidth to avoid unnecessarily restricting + * the L2 <-> L3 traffic. + * + * Since MBA controls the L2 external bandwidth where as MBM measures the + * L3 external bandwidth the following sequence could lead to such a + * situation. + * + * Consider an rdtgroup which had high L3 <-> memory traffic in initial + * phases -> mba_sc kicks in and reduced bandwidth percentage values -> but + * after some time rdtgroup has mostly L2 <-> L3 traffic. + * + * In this case we may restrict the rdtgroup's L2 <-> L3 traffic as its + * throttle MSRs already have low percentage values. To avoid + * unnecessarily restricting such rdtgroups, we also increase the bandwidth. + */ +static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_mon_domain *dom_mbm) +{ + u32 closid, rmid, cur_msr_val, new_msr_val; + struct mbm_state *pmbm_data, *cmbm_data; + struct rdt_ctrl_domain *dom_mba; + struct rdt_resource *r_mba; + u32 cur_bw, user_bw, idx; + struct list_head *head; + struct rdtgroup *entry; + + if (!is_mbm_local_enabled()) + return; + + r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl; + + closid = rgrp->closid; + rmid = rgrp->mon.rmid; + idx = resctrl_arch_rmid_idx_encode(closid, rmid); + pmbm_data = &dom_mbm->mbm_local[idx]; + + dom_mba = get_ctrl_domain_from_cpu(smp_processor_id(), r_mba); + if (!dom_mba) { + pr_warn_once("Failure to get domain for MBA update\n"); return; } - mbm_cf_rmidthreshold = mbm_cf_table[cf_index].rmidthreshold; - mbm_cf = mbm_cf_table[cf_index].cf; + cur_bw = pmbm_data->prev_bw; + user_bw = dom_mba->mbps_val[closid]; + + /* MBA resource doesn't support CDP */ + cur_msr_val = resctrl_arch_get_config(r_mba, dom_mba, closid, CDP_NONE); + + /* + * For Ctrl groups read data from child monitor groups. + */ + head = &rgrp->mon.crdtgrp_list; + list_for_each_entry(entry, head, mon.crdtgrp_list) { + cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid]; + cur_bw += cmbm_data->prev_bw; + } + + /* + * Scale up/down the bandwidth linearly for the ctrl group. The + * bandwidth step is the bandwidth granularity specified by the + * hardware. + * Always increase throttling if current bandwidth is above the + * target set by user. + * But avoid thrashing up and down on every poll by checking + * whether a decrease in throttling is likely to push the group + * back over target. E.g. if currently throttling to 30% of bandwidth + * on a system with 10% granularity steps, check whether moving to + * 40% would go past the limit by multiplying current bandwidth by + * "(30 + 10) / 30". + */ + if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) { + new_msr_val = cur_msr_val - r_mba->membw.bw_gran; + } else if (cur_msr_val < MAX_MBA_BW && + (user_bw > (cur_bw * (cur_msr_val + r_mba->membw.min_bw) / cur_msr_val))) { + new_msr_val = cur_msr_val + r_mba->membw.bw_gran; + } else { + return; + } + + resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val); +} + +static void mbm_update(struct rdt_resource *r, struct rdt_mon_domain *d, + u32 closid, u32 rmid) +{ + struct rmid_read rr = {0}; + + rr.r = r; + rr.d = d; + + /* + * This is protected from concurrent reads from user + * as both the user and we hold the global mutex. + */ + if (is_mbm_total_enabled()) { + rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID; + rr.val = 0; + rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); + if (IS_ERR(rr.arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(rr.arch_mon_ctx)); + return; + } + + __mon_event_count(closid, rmid, &rr); + + resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); + } + if (is_mbm_local_enabled()) { + rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID; + rr.val = 0; + rr.arch_mon_ctx = resctrl_arch_mon_ctx_alloc(rr.r, rr.evtid); + if (IS_ERR(rr.arch_mon_ctx)) { + pr_warn_ratelimited("Failed to allocate monitor context: %ld", + PTR_ERR(rr.arch_mon_ctx)); + return; + } + + __mon_event_count(closid, rmid, &rr); + + /* + * Call the MBA software controller only for the + * control groups and when user has enabled + * the software controller explicitly. + */ + if (is_mba_sc(NULL)) + mbm_bw_count(closid, rmid, &rr); + + resctrl_arch_mon_ctx_free(rr.r, rr.evtid, rr.arch_mon_ctx); + } +} + +/* + * Handler to scan the limbo list and move the RMIDs + * to free list whose occupancy < threshold_occupancy. + */ +void cqm_handle_limbo(struct work_struct *work) +{ + unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL); + struct rdt_mon_domain *d; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + d = container_of(work, struct rdt_mon_domain, cqm_limbo.work); + + __check_limbo(d, false); + + if (has_busy_rmid(d)) { + d->cqm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask, + RESCTRL_PICK_ANY_CPU); + schedule_delayed_work_on(d->cqm_work_cpu, &d->cqm_limbo, + delay); + } + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); +} + +/** + * cqm_setup_limbo_handler() - Schedule the limbo handler to run for this + * domain. + * @dom: The domain the limbo handler should run for. + * @delay_ms: How far in the future the handler should run. + * @exclude_cpu: Which CPU the handler should not run on, + * RESCTRL_PICK_ANY_CPU to pick any CPU. + */ +void cqm_setup_limbo_handler(struct rdt_mon_domain *dom, unsigned long delay_ms, + int exclude_cpu) +{ + unsigned long delay = msecs_to_jiffies(delay_ms); + int cpu; + + cpu = cpumask_any_housekeeping(&dom->hdr.cpu_mask, exclude_cpu); + dom->cqm_work_cpu = cpu; + + if (cpu < nr_cpu_ids) + schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay); +} + +void mbm_handle_overflow(struct work_struct *work) +{ + unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL); + struct rdtgroup *prgrp, *crgrp; + struct rdt_mon_domain *d; + struct list_head *head; + struct rdt_resource *r; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + /* + * If the filesystem has been unmounted this work no longer needs to + * run. + */ + if (!resctrl_mounted || !resctrl_arch_mon_capable()) + goto out_unlock; + + r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + d = container_of(work, struct rdt_mon_domain, mbm_over.work); + + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { + mbm_update(r, d, prgrp->closid, prgrp->mon.rmid); + + head = &prgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) + mbm_update(r, d, crgrp->closid, crgrp->mon.rmid); + + if (is_mba_sc(NULL)) + update_mba_bw(prgrp, d); + } + + /* + * Re-check for housekeeping CPUs. This allows the overflow handler to + * move off a nohz_full CPU quickly. + */ + d->mbm_work_cpu = cpumask_any_housekeeping(&d->hdr.cpu_mask, + RESCTRL_PICK_ANY_CPU); + schedule_delayed_work_on(d->mbm_work_cpu, &d->mbm_over, delay); + +out_unlock: + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); +} + +/** + * mbm_setup_overflow_handler() - Schedule the overflow handler to run for this + * domain. + * @dom: The domain the overflow handler should run for. + * @delay_ms: How far in the future the handler should run. + * @exclude_cpu: Which CPU the handler should not run on, + * RESCTRL_PICK_ANY_CPU to pick any CPU. + */ +void mbm_setup_overflow_handler(struct rdt_mon_domain *dom, unsigned long delay_ms, + int exclude_cpu) +{ + unsigned long delay = msecs_to_jiffies(delay_ms); + int cpu; + + /* + * When a domain comes online there is no guarantee the filesystem is + * mounted. If not, there is no need to catch counter overflow. + */ + if (!resctrl_mounted || !resctrl_arch_mon_capable()) + return; + cpu = cpumask_any_housekeeping(&dom->hdr.cpu_mask, exclude_cpu); + dom->mbm_work_cpu = cpu; + + if (cpu < nr_cpu_ids) + schedule_delayed_work_on(cpu, &dom->mbm_over, delay); +} + +static int dom_data_init(struct rdt_resource *r) +{ + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + u32 num_closid = resctrl_arch_get_num_closid(r); + struct rmid_entry *entry = NULL; + int err = 0, i; + u32 idx; + + mutex_lock(&rdtgroup_mutex); + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + u32 *tmp; + + /* + * If the architecture hasn't provided a sanitised value here, + * this may result in larger arrays than necessary. Resctrl will + * use a smaller system wide value based on the resources in + * use. + */ + tmp = kcalloc(num_closid, sizeof(*tmp), GFP_KERNEL); + if (!tmp) { + err = -ENOMEM; + goto out_unlock; + } + + closid_num_dirty_rmid = tmp; + } + + rmid_ptrs = kcalloc(idx_limit, sizeof(struct rmid_entry), GFP_KERNEL); + if (!rmid_ptrs) { + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + kfree(closid_num_dirty_rmid); + closid_num_dirty_rmid = NULL; + } + err = -ENOMEM; + goto out_unlock; + } + + for (i = 0; i < idx_limit; i++) { + entry = &rmid_ptrs[i]; + INIT_LIST_HEAD(&entry->list); + + resctrl_arch_rmid_idx_decode(i, &entry->closid, &entry->rmid); + list_add_tail(&entry->list, &rmid_free_lru); + } + + /* + * RESCTRL_RESERVED_CLOSID and RESCTRL_RESERVED_RMID are special and + * are always allocated. These are used for the rdtgroup_default + * control group, which will be setup later in rdtgroup_init(). + */ + idx = resctrl_arch_rmid_idx_encode(RESCTRL_RESERVED_CLOSID, + RESCTRL_RESERVED_RMID); + entry = __rmid_entry(idx); + list_del(&entry->list); + +out_unlock: + mutex_unlock(&rdtgroup_mutex); + + return err; +} + +static void __exit dom_data_exit(void) +{ + mutex_lock(&rdtgroup_mutex); + + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + kfree(closid_num_dirty_rmid); + closid_num_dirty_rmid = NULL; + } + + kfree(rmid_ptrs); + rmid_ptrs = NULL; + + mutex_unlock(&rdtgroup_mutex); } static struct mon_evt llc_occupancy_event = { @@ -304,28 +1085,174 @@ static void l3_mon_evt_init(struct rdt_resource *r) { INIT_LIST_HEAD(&r->evt_list); - if (resctrl_arch_is_llc_occupancy_enabled()) + if (is_llc_occupancy_enabled()) list_add_tail(&llc_occupancy_event.list, &r->evt_list); - if (resctrl_arch_is_mbm_total_enabled()) + if (is_mbm_total_enabled()) list_add_tail(&mbm_total_event.list, &r->evt_list); - if (resctrl_arch_is_mbm_local_enabled()) + if (is_mbm_local_enabled()) list_add_tail(&mbm_local_event.list, &r->evt_list); } -int resctrl_arch_mon_resource_init(void) +/* + * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1 + * which indicates that RMIDs are configured in legacy mode. + * This mode is incompatible with Linux resctrl semantics + * as RMIDs are partitioned between SNC nodes, which requires + * a user to know which RMID is allocated to a task. + * Clearing bit 0 reconfigures the RMID counters for use + * in RMID sharing mode. This mode is better for Linux. + * The RMID space is divided between all SNC nodes with the + * RMIDs renumbered to start from zero in each node when + * counting operations from tasks. Code to read the counters + * must adjust RMID counter numbers based on SNC node. See + * logical_rmid_to_physical_rmid() for code that does this. + */ +void arch_mon_domain_online(struct rdt_resource *r, struct rdt_mon_domain *d) +{ + if (snc_nodes_per_l3_cache > 1) + msr_clear_bit(MSR_RMID_SNC_CONFIG, 0); +} + +/* CPU models that support MSR_RMID_SNC_CONFIG */ +static const struct x86_cpu_id snc_cpu_ids[] __initconst = { + X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0), + X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0), + X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0), + X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0), + X86_MATCH_INTEL_FAM6_MODEL(ATOM_CRESTMONT_X, 0), + {} +}; + +/* + * There isn't a simple hardware bit that indicates whether a CPU is running + * in Sub-NUMA Cluster (SNC) mode. Infer the state by comparing the + * number of CPUs sharing the L3 cache with CPU0 to the number of CPUs in + * the same NUMA node as CPU0. + * It is not possible to accurately determine SNC state if the system is + * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes + * to L3 caches. It will be OK if system is booted with hyperthreading + * disabled (since this doesn't affect the ratio). + */ +static __init int snc_get_config(void) { - struct rdt_resource *r = resctrl_arch_get_resource(RDT_RESOURCE_L3); + struct cacheinfo *ci = get_cpu_cacheinfo_level(0, RESCTRL_L3_CACHE); + const cpumask_t *node0_cpumask; + int cpus_per_node, cpus_per_l3; + int ret; - l3_mon_evt_init(r); + if (!x86_match_cpu(snc_cpu_ids) || !ci) + return 1; + + cpus_read_lock(); + if (num_online_cpus() != num_present_cpus()) + pr_warn("Some CPUs offline, SNC detection may be incorrect\n"); + cpus_read_unlock(); + + node0_cpumask = cpumask_of_node(cpu_to_node(0)); - if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_TOTAL_EVENT_ID)) { - mbm_total_event.configurable = true; - mbm_config_rftype_init("mbm_total_bytes_config"); + cpus_per_node = cpumask_weight(node0_cpumask); + cpus_per_l3 = cpumask_weight(&ci->shared_cpu_map); + + if (!cpus_per_node || !cpus_per_l3) + return 1; + + ret = cpus_per_l3 / cpus_per_node; + + /* sanity check: Only valid results are 1, 2, 3, 4 */ + switch (ret) { + case 1: + break; + case 2 ... 4: + pr_info("Sub-NUMA Cluster mode detected with %d nodes per L3 cache\n", ret); + rdt_resources_all[RDT_RESOURCE_L3].r_resctrl.mon_scope = RESCTRL_L3_NODE; + break; + default: + pr_warn("Ignore improbable SNC node count %d\n", ret); + ret = 1; + break; } - if (resctrl_arch_is_evt_configurable(QOS_L3_MBM_LOCAL_EVENT_ID)) { - mbm_local_event.configurable = true; - mbm_config_rftype_init("mbm_local_bytes_config"); + + return ret; +} + +int __init rdt_get_mon_l3_config(struct rdt_resource *r) +{ + unsigned int mbm_offset = boot_cpu_data.x86_cache_mbm_width_offset; + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); + unsigned int threshold; + int ret; + + snc_nodes_per_l3_cache = snc_get_config(); + + resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024; + hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache; + r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache; + hw_res->mbm_width = MBM_CNTR_WIDTH_BASE; + + if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX) + hw_res->mbm_width += mbm_offset; + else if (mbm_offset > MBM_CNTR_WIDTH_OFFSET_MAX) + pr_warn("Ignoring impossible MBM counter offset\n"); + + /* + * A reasonable upper limit on the max threshold is the number + * of lines tagged per RMID if all RMIDs have the same number of + * lines tagged in the LLC. + * + * For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC. + */ + threshold = resctrl_rmid_realloc_limit / r->num_rmid; + + /* + * Because num_rmid may not be a power of two, round the value + * to the nearest multiple of hw_res->mon_scale so it matches a + * value the hardware will measure. mon_scale may not be a power of 2. + */ + resctrl_rmid_realloc_threshold = resctrl_arch_round_mon_val(threshold); + + ret = dom_data_init(r); + if (ret) + return ret; + + if (rdt_cpu_has(X86_FEATURE_BMEC)) { + u32 eax, ebx, ecx, edx; + + /* Detect list of bandwidth sources that can be tracked */ + cpuid_count(0x80000020, 3, &eax, &ebx, &ecx, &edx); + hw_res->mbm_cfg_mask = ecx & MAX_EVT_CONFIG_BITS; + + if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL)) { + mbm_total_event.configurable = true; + mbm_config_rftype_init("mbm_total_bytes_config"); + } + if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL)) { + mbm_local_event.configurable = true; + mbm_config_rftype_init("mbm_local_bytes_config"); + } } + l3_mon_evt_init(r); + + r->mon_capable = true; + return 0; } + +void __exit rdt_put_mon_l3_config(void) +{ + dom_data_exit(); +} + +void __init intel_rdt_mbm_apply_quirk(void) +{ + int cf_index; + + cf_index = (boot_cpu_data.x86_cache_max_rmid + 1) / 8 - 1; + if (cf_index >= ARRAY_SIZE(mbm_cf_table)) { + pr_info("No MBM correction factor available\n"); + return; + } + + mbm_cf_rmidthreshold = mbm_cf_table[cf_index].rmidthreshold; + mbm_cf = mbm_cf_table[cf_index].cf; +} diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index ba1596afee10..180bcacddf75 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -11,7 +11,6 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt -#include <linux/cacheinfo.h> #include <linux/cpu.h> #include <linux/cpumask.h> #include <linux/debugfs.h> @@ -31,7 +30,7 @@ #include "internal.h" #define CREATE_TRACE_POINTS -#include "pseudo_lock_event.h" +#include "trace.h" /* * The bits needed to disable hardware prefetching varies based on the @@ -39,9 +38,30 @@ */ static u64 prefetch_disable_bits; +/* + * Major number assigned to and shared by all devices exposing + * pseudo-locked regions. + */ +static unsigned int pseudo_lock_major; +static unsigned long pseudo_lock_minor_avail = GENMASK(MINORBITS, 0); + +static char *pseudo_lock_devnode(const struct device *dev, umode_t *mode) +{ + const struct rdtgroup *rdtgrp; + + rdtgrp = dev_get_drvdata(dev); + if (mode) + *mode = 0600; + return kasprintf(GFP_KERNEL, "pseudo_lock/%s", rdtgrp->kn->name); +} + +static const struct class pseudo_lock_class = { + .name = "pseudo_lock", + .devnode = pseudo_lock_devnode, +}; + /** - * resctrl_arch_get_prefetch_disable_bits - prefetch disable bits of supported - * platforms + * get_prefetch_disable_bits - prefetch disable bits of supported platforms * @void: It takes no parameters. * * Capture the list of platforms that have been validated to support @@ -55,16 +75,14 @@ static u64 prefetch_disable_bits; * in the SDM. * * When adding a platform here also add support for its cache events to - * resctrl_arch_measure_l*_residency() + * measure_cycles_perf_fn() * * Return: * If platform is supported, the bits to disable hardware prefetchers, 0 * if platform is not supported. */ -u64 resctrl_arch_get_prefetch_disable_bits(void) +static u64 get_prefetch_disable_bits(void) { - prefetch_disable_bits = 0; - if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL || boot_cpu_data.x86 != 6) return 0; @@ -80,8 +98,7 @@ u64 resctrl_arch_get_prefetch_disable_bits(void) * 3 DCU IP Prefetcher Disable (R/W) * 63:4 Reserved */ - prefetch_disable_bits = 0xF; - break; + return 0xF; case INTEL_FAM6_ATOM_GOLDMONT: case INTEL_FAM6_ATOM_GOLDMONT_PLUS: /* @@ -92,16 +109,307 @@ u64 resctrl_arch_get_prefetch_disable_bits(void) * 2 DCU Hardware Prefetcher Disable (R/W) * 63:3 Reserved */ - prefetch_disable_bits = 0x5; - break; + return 0x5; + } + + return 0; +} + +/** + * pseudo_lock_minor_get - Obtain available minor number + * @minor: Pointer to where new minor number will be stored + * + * A bitmask is used to track available minor numbers. Here the next free + * minor number is marked as unavailable and returned. + * + * Return: 0 on success, <0 on failure. + */ +static int pseudo_lock_minor_get(unsigned int *minor) +{ + unsigned long first_bit; + + first_bit = find_first_bit(&pseudo_lock_minor_avail, MINORBITS); + + if (first_bit == MINORBITS) + return -ENOSPC; + + __clear_bit(first_bit, &pseudo_lock_minor_avail); + *minor = first_bit; + + return 0; +} + +/** + * pseudo_lock_minor_release - Return minor number to available + * @minor: The minor number made available + */ +static void pseudo_lock_minor_release(unsigned int minor) +{ + __set_bit(minor, &pseudo_lock_minor_avail); +} + +/** + * region_find_by_minor - Locate a pseudo-lock region by inode minor number + * @minor: The minor number of the device representing pseudo-locked region + * + * When the character device is accessed we need to determine which + * pseudo-locked region it belongs to. This is done by matching the minor + * number of the device to the pseudo-locked region it belongs. + * + * Minor numbers are assigned at the time a pseudo-locked region is associated + * with a cache instance. + * + * Return: On success return pointer to resource group owning the pseudo-locked + * region, NULL on failure. + */ +static struct rdtgroup *region_find_by_minor(unsigned int minor) +{ + struct rdtgroup *rdtgrp, *rdtgrp_match = NULL; + + list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { + if (rdtgrp->plr && rdtgrp->plr->minor == minor) { + rdtgrp_match = rdtgrp; + break; + } + } + return rdtgrp_match; +} + +/** + * struct pseudo_lock_pm_req - A power management QoS request list entry + * @list: Entry within the @pm_reqs list for a pseudo-locked region + * @req: PM QoS request + */ +struct pseudo_lock_pm_req { + struct list_head list; + struct dev_pm_qos_request req; +}; + +static void pseudo_lock_cstates_relax(struct pseudo_lock_region *plr) +{ + struct pseudo_lock_pm_req *pm_req, *next; + + list_for_each_entry_safe(pm_req, next, &plr->pm_reqs, list) { + dev_pm_qos_remove_request(&pm_req->req); + list_del(&pm_req->list); + kfree(pm_req); + } +} + +/** + * pseudo_lock_cstates_constrain - Restrict cores from entering C6 + * @plr: Pseudo-locked region + * + * To prevent the cache from being affected by power management entering + * C6 has to be avoided. This is accomplished by requesting a latency + * requirement lower than lowest C6 exit latency of all supported + * platforms as found in the cpuidle state tables in the intel_idle driver. + * At this time it is possible to do so with a single latency requirement + * for all supported platforms. + * + * Since Goldmont is supported, which is affected by X86_BUG_MONITOR, + * the ACPI latencies need to be considered while keeping in mind that C2 + * may be set to map to deeper sleep states. In this case the latency + * requirement needs to prevent entering C2 also. + * + * Return: 0 on success, <0 on failure + */ +static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr) +{ + struct pseudo_lock_pm_req *pm_req; + int cpu; + int ret; + + for_each_cpu(cpu, &plr->d->hdr.cpu_mask) { + pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL); + if (!pm_req) { + rdt_last_cmd_puts("Failure to allocate memory for PM QoS\n"); + ret = -ENOMEM; + goto out_err; + } + ret = dev_pm_qos_add_request(get_cpu_device(cpu), + &pm_req->req, + DEV_PM_QOS_RESUME_LATENCY, + 30); + if (ret < 0) { + rdt_last_cmd_printf("Failed to add latency req CPU%d\n", + cpu); + kfree(pm_req); + ret = -1; + goto out_err; + } + list_add(&pm_req->list, &plr->pm_reqs); + } + + return 0; + +out_err: + pseudo_lock_cstates_relax(plr); + return ret; +} + +/** + * pseudo_lock_region_clear - Reset pseudo-lock region data + * @plr: pseudo-lock region + * + * All content of the pseudo-locked region is reset - any memory allocated + * freed. + * + * Return: void + */ +static void pseudo_lock_region_clear(struct pseudo_lock_region *plr) +{ + plr->size = 0; + plr->line_size = 0; + kfree(plr->kmem); + plr->kmem = NULL; + plr->s = NULL; + if (plr->d) + plr->d->plr = NULL; + plr->d = NULL; + plr->cbm = 0; + plr->debugfs_dir = NULL; +} + +/** + * pseudo_lock_region_init - Initialize pseudo-lock region information + * @plr: pseudo-lock region + * + * Called after user provided a schemata to be pseudo-locked. From the + * schemata the &struct pseudo_lock_region is on entry already initialized + * with the resource, domain, and capacity bitmask. Here the information + * required for pseudo-locking is deduced from this data and &struct + * pseudo_lock_region initialized further. This information includes: + * - size in bytes of the region to be pseudo-locked + * - cache line size to know the stride with which data needs to be accessed + * to be pseudo-locked + * - a cpu associated with the cache instance on which the pseudo-locking + * flow can be executed + * + * Return: 0 on success, <0 on failure. Descriptive error will be written + * to last_cmd_status buffer. + */ +static int pseudo_lock_region_init(struct pseudo_lock_region *plr) +{ + enum resctrl_scope scope = plr->s->res->ctrl_scope; + struct cacheinfo *ci; + int ret; + + if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE)) + return -ENODEV; + + /* Pick the first cpu we find that is associated with the cache. */ + plr->cpu = cpumask_first(&plr->d->hdr.cpu_mask); + + if (!cpu_online(plr->cpu)) { + rdt_last_cmd_printf("CPU %u associated with cache not online\n", + plr->cpu); + ret = -ENODEV; + goto out_region; + } + + ci = get_cpu_cacheinfo_level(plr->cpu, scope); + if (ci) { + plr->line_size = ci->coherency_line_size; + plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm); + return 0; } - return prefetch_disable_bits; + ret = -1; + rdt_last_cmd_puts("Unable to determine cache line size\n"); +out_region: + pseudo_lock_region_clear(plr); + return ret; } /** - * resctrl_arch_pseudo_lock_fn - Load kernel memory into cache - * @_plr: the pseudo-lock region descriptor + * pseudo_lock_init - Initialize a pseudo-lock region + * @rdtgrp: resource group to which new pseudo-locked region will belong + * + * A pseudo-locked region is associated with a resource group. When this + * association is created the pseudo-locked region is initialized. The + * details of the pseudo-locked region are not known at this time so only + * allocation is done and association established. + * + * Return: 0 on success, <0 on failure + */ +static int pseudo_lock_init(struct rdtgroup *rdtgrp) +{ + struct pseudo_lock_region *plr; + + plr = kzalloc(sizeof(*plr), GFP_KERNEL); + if (!plr) + return -ENOMEM; + + init_waitqueue_head(&plr->lock_thread_wq); + INIT_LIST_HEAD(&plr->pm_reqs); + rdtgrp->plr = plr; + return 0; +} + +/** + * pseudo_lock_region_alloc - Allocate kernel memory that will be pseudo-locked + * @plr: pseudo-lock region + * + * Initialize the details required to set up the pseudo-locked region and + * allocate the contiguous memory that will be pseudo-locked to the cache. + * + * Return: 0 on success, <0 on failure. Descriptive error will be written + * to last_cmd_status buffer. + */ +static int pseudo_lock_region_alloc(struct pseudo_lock_region *plr) +{ + int ret; + + ret = pseudo_lock_region_init(plr); + if (ret < 0) + return ret; + + /* + * We do not yet support contiguous regions larger than + * KMALLOC_MAX_SIZE. + */ + if (plr->size > KMALLOC_MAX_SIZE) { + rdt_last_cmd_puts("Requested region exceeds maximum size\n"); + ret = -E2BIG; + goto out_region; + } + + plr->kmem = kzalloc(plr->size, GFP_KERNEL); + if (!plr->kmem) { + rdt_last_cmd_puts("Unable to allocate memory\n"); + ret = -ENOMEM; + goto out_region; + } + + ret = 0; + goto out; +out_region: + pseudo_lock_region_clear(plr); +out: + return ret; +} + +/** + * pseudo_lock_free - Free a pseudo-locked region + * @rdtgrp: resource group to which pseudo-locked region belonged + * + * The pseudo-locked region's resources have already been released, or not + * yet created at this point. Now it can be freed and disassociated from the + * resource group. + * + * Return: void + */ +static void pseudo_lock_free(struct rdtgroup *rdtgrp) +{ + pseudo_lock_region_clear(rdtgrp->plr); + kfree(rdtgrp->plr); + rdtgrp->plr = NULL; +} + +/** + * pseudo_lock_fn - Load kernel memory into cache + * @_rdtgrp: resource group to which pseudo-lock region belongs * * This is the core pseudo-locking flow. * @@ -118,9 +426,10 @@ u64 resctrl_arch_get_prefetch_disable_bits(void) * * Return: 0. Waiter on waitqueue will be woken on completion. */ -int resctrl_arch_pseudo_lock_fn(void *_plr) +static int pseudo_lock_fn(void *_rdtgrp) { - struct pseudo_lock_region *plr = _plr; + struct rdtgroup *rdtgrp = _rdtgrp; + struct pseudo_lock_region *plr = rdtgrp->plr; u32 rmid_p, closid_p; unsigned long i; u64 saved_msr; @@ -180,8 +489,7 @@ int resctrl_arch_pseudo_lock_fn(void *_plr) * pseudo-locked followed by reading of kernel memory to load it * into the cache. */ - __wrmsr(MSR_IA32_PQR_ASSOC, rmid_p, plr->closid); - + __wrmsr(MSR_IA32_PQR_ASSOC, rmid_p, rdtgrp->closid); /* * Cache was flushed earlier. Now access kernel memory to read it * into cache region associated with just activated plr->closid. @@ -229,8 +537,342 @@ int resctrl_arch_pseudo_lock_fn(void *_plr) } /** - * resctrl_arch_measure_cycles_lat_fn - Measure cycle latency to read - * pseudo-locked memory + * rdtgroup_monitor_in_progress - Test if monitoring in progress + * @rdtgrp: resource group being queried + * + * Return: 1 if monitor groups have been created for this resource + * group, 0 otherwise. + */ +static int rdtgroup_monitor_in_progress(struct rdtgroup *rdtgrp) +{ + return !list_empty(&rdtgrp->mon.crdtgrp_list); +} + +/** + * rdtgroup_locksetup_user_restrict - Restrict user access to group + * @rdtgrp: resource group needing access restricted + * + * A resource group used for cache pseudo-locking cannot have cpus or tasks + * assigned to it. This is communicated to the user by restricting access + * to all the files that can be used to make such changes. + * + * Permissions restored with rdtgroup_locksetup_user_restore() + * + * Return: 0 on success, <0 on failure. If a failure occurs during the + * restriction of access an attempt will be made to restore permissions but + * the state of the mode of these files will be uncertain when a failure + * occurs. + */ +static int rdtgroup_locksetup_user_restrict(struct rdtgroup *rdtgrp) +{ + int ret; + + ret = rdtgroup_kn_mode_restrict(rdtgrp, "tasks"); + if (ret) + return ret; + + ret = rdtgroup_kn_mode_restrict(rdtgrp, "cpus"); + if (ret) + goto err_tasks; + + ret = rdtgroup_kn_mode_restrict(rdtgrp, "cpus_list"); + if (ret) + goto err_cpus; + + if (resctrl_arch_mon_capable()) { + ret = rdtgroup_kn_mode_restrict(rdtgrp, "mon_groups"); + if (ret) + goto err_cpus_list; + } + + ret = 0; + goto out; + +err_cpus_list: + rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0777); +err_cpus: + rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0777); +err_tasks: + rdtgroup_kn_mode_restore(rdtgrp, "tasks", 0777); +out: + return ret; +} + +/** + * rdtgroup_locksetup_user_restore - Restore user access to group + * @rdtgrp: resource group needing access restored + * + * Restore all file access previously removed using + * rdtgroup_locksetup_user_restrict() + * + * Return: 0 on success, <0 on failure. If a failure occurs during the + * restoration of access an attempt will be made to restrict permissions + * again but the state of the mode of these files will be uncertain when + * a failure occurs. + */ +static int rdtgroup_locksetup_user_restore(struct rdtgroup *rdtgrp) +{ + int ret; + + ret = rdtgroup_kn_mode_restore(rdtgrp, "tasks", 0777); + if (ret) + return ret; + + ret = rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0777); + if (ret) + goto err_tasks; + + ret = rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0777); + if (ret) + goto err_cpus; + + if (resctrl_arch_mon_capable()) { + ret = rdtgroup_kn_mode_restore(rdtgrp, "mon_groups", 0777); + if (ret) + goto err_cpus_list; + } + + ret = 0; + goto out; + +err_cpus_list: + rdtgroup_kn_mode_restrict(rdtgrp, "cpus_list"); +err_cpus: + rdtgroup_kn_mode_restrict(rdtgrp, "cpus"); +err_tasks: + rdtgroup_kn_mode_restrict(rdtgrp, "tasks"); +out: + return ret; +} + +/** + * rdtgroup_locksetup_enter - Resource group enters locksetup mode + * @rdtgrp: resource group requested to enter locksetup mode + * + * A resource group enters locksetup mode to reflect that it would be used + * to represent a pseudo-locked region and is in the process of being set + * up to do so. A resource group used for a pseudo-locked region would + * lose the closid associated with it so we cannot allow it to have any + * tasks or cpus assigned nor permit tasks or cpus to be assigned in the + * future. Monitoring of a pseudo-locked region is not allowed either. + * + * The above and more restrictions on a pseudo-locked region are checked + * for and enforced before the resource group enters the locksetup mode. + * + * Returns: 0 if the resource group successfully entered locksetup mode, <0 + * on failure. On failure the last_cmd_status buffer is updated with text to + * communicate details of failure to the user. + */ +int rdtgroup_locksetup_enter(struct rdtgroup *rdtgrp) +{ + int ret; + + /* + * The default resource group can neither be removed nor lose the + * default closid associated with it. + */ + if (rdtgrp == &rdtgroup_default) { + rdt_last_cmd_puts("Cannot pseudo-lock default group\n"); + return -EINVAL; + } + + /* + * Cache Pseudo-locking not supported when CDP is enabled. + * + * Some things to consider if you would like to enable this + * support (using L3 CDP as example): + * - When CDP is enabled two separate resources are exposed, + * L3DATA and L3CODE, but they are actually on the same cache. + * The implication for pseudo-locking is that if a + * pseudo-locked region is created on a domain of one + * resource (eg. L3CODE), then a pseudo-locked region cannot + * be created on that same domain of the other resource + * (eg. L3DATA). This is because the creation of a + * pseudo-locked region involves a call to wbinvd that will + * affect all cache allocations on particular domain. + * - Considering the previous, it may be possible to only + * expose one of the CDP resources to pseudo-locking and + * hide the other. For example, we could consider to only + * expose L3DATA and since the L3 cache is unified it is + * still possible to place instructions there are execute it. + * - If only one region is exposed to pseudo-locking we should + * still keep in mind that availability of a portion of cache + * for pseudo-locking should take into account both resources. + * Similarly, if a pseudo-locked region is created in one + * resource, the portion of cache used by it should be made + * unavailable to all future allocations from both resources. + */ + if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3) || + resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2)) { + rdt_last_cmd_puts("CDP enabled\n"); + return -EINVAL; + } + + /* + * Not knowing the bits to disable prefetching implies that this + * platform does not support Cache Pseudo-Locking. + */ + prefetch_disable_bits = get_prefetch_disable_bits(); + if (prefetch_disable_bits == 0) { + rdt_last_cmd_puts("Pseudo-locking not supported\n"); + return -EINVAL; + } + + if (rdtgroup_monitor_in_progress(rdtgrp)) { + rdt_last_cmd_puts("Monitoring in progress\n"); + return -EINVAL; + } + + if (rdtgroup_tasks_assigned(rdtgrp)) { + rdt_last_cmd_puts("Tasks assigned to resource group\n"); + return -EINVAL; + } + + if (!cpumask_empty(&rdtgrp->cpu_mask)) { + rdt_last_cmd_puts("CPUs assigned to resource group\n"); + return -EINVAL; + } + + if (rdtgroup_locksetup_user_restrict(rdtgrp)) { + rdt_last_cmd_puts("Unable to modify resctrl permissions\n"); + return -EIO; + } + + ret = pseudo_lock_init(rdtgrp); + if (ret) { + rdt_last_cmd_puts("Unable to init pseudo-lock region\n"); + goto out_release; + } + + /* + * If this system is capable of monitoring a rmid would have been + * allocated when the control group was created. This is not needed + * anymore when this group would be used for pseudo-locking. This + * is safe to call on platforms not capable of monitoring. + */ + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + + ret = 0; + goto out; + +out_release: + rdtgroup_locksetup_user_restore(rdtgrp); +out: + return ret; +} + +/** + * rdtgroup_locksetup_exit - resource group exist locksetup mode + * @rdtgrp: resource group + * + * When a resource group exits locksetup mode the earlier restrictions are + * lifted. + * + * Return: 0 on success, <0 on failure + */ +int rdtgroup_locksetup_exit(struct rdtgroup *rdtgrp) +{ + int ret; + + if (resctrl_arch_mon_capable()) { + ret = alloc_rmid(rdtgrp->closid); + if (ret < 0) { + rdt_last_cmd_puts("Out of RMIDs\n"); + return ret; + } + rdtgrp->mon.rmid = ret; + } + + ret = rdtgroup_locksetup_user_restore(rdtgrp); + if (ret) { + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + return ret; + } + + pseudo_lock_free(rdtgrp); + return 0; +} + +/** + * rdtgroup_cbm_overlaps_pseudo_locked - Test if CBM or portion is pseudo-locked + * @d: RDT domain + * @cbm: CBM to test + * + * @d represents a cache instance and @cbm a capacity bitmask that is + * considered for it. Determine if @cbm overlaps with any existing + * pseudo-locked region on @d. + * + * @cbm is unsigned long, even if only 32 bits are used, to make the + * bitmap functions work correctly. + * + * Return: true if @cbm overlaps with pseudo-locked region on @d, false + * otherwise. + */ +bool rdtgroup_cbm_overlaps_pseudo_locked(struct rdt_ctrl_domain *d, unsigned long cbm) +{ + unsigned int cbm_len; + unsigned long cbm_b; + + if (d->plr) { + cbm_len = d->plr->s->res->cache.cbm_len; + cbm_b = d->plr->cbm; + if (bitmap_intersects(&cbm, &cbm_b, cbm_len)) + return true; + } + return false; +} + +/** + * rdtgroup_pseudo_locked_in_hierarchy - Pseudo-locked region in cache hierarchy + * @d: RDT domain under test + * + * The setup of a pseudo-locked region affects all cache instances within + * the hierarchy of the region. It is thus essential to know if any + * pseudo-locked regions exist within a cache hierarchy to prevent any + * attempts to create new pseudo-locked regions in the same hierarchy. + * + * Return: true if a pseudo-locked region exists in the hierarchy of @d or + * if it is not possible to test due to memory allocation issue, + * false otherwise. + */ +bool rdtgroup_pseudo_locked_in_hierarchy(struct rdt_ctrl_domain *d) +{ + struct rdt_ctrl_domain *d_i; + cpumask_var_t cpu_with_psl; + struct rdt_resource *r; + bool ret = false; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + if (!zalloc_cpumask_var(&cpu_with_psl, GFP_KERNEL)) + return true; + + /* + * First determine which cpus have pseudo-locked regions + * associated with them. + */ + for_each_alloc_capable_rdt_resource(r) { + list_for_each_entry(d_i, &r->ctrl_domains, hdr.list) { + if (d_i->plr) + cpumask_or(cpu_with_psl, cpu_with_psl, + &d_i->hdr.cpu_mask); + } + } + + /* + * Next test if new pseudo-locked region would intersect with + * existing region. + */ + if (cpumask_intersects(&d->hdr.cpu_mask, cpu_with_psl)) + ret = true; + + free_cpumask_var(cpu_with_psl); + return ret; +} + +/** + * measure_cycles_lat_fn - Measure cycle latency to read pseudo-locked memory * @_plr: pseudo-lock region to measure * * There is no deterministic way to test if a memory region is cached. One @@ -243,7 +885,7 @@ int resctrl_arch_pseudo_lock_fn(void *_plr) * * Return: 0. Waiter on waitqueue will be woken on completion. */ -int resctrl_arch_measure_cycles_lat_fn(void *_plr) +static int measure_cycles_lat_fn(void *_plr) { struct pseudo_lock_region *plr = _plr; u32 saved_low, saved_high; @@ -427,7 +1069,7 @@ static int measure_residency_fn(struct perf_event_attr *miss_attr, return 0; } -int resctrl_arch_measure_l2_residency(void *_plr) +static int measure_l2_residency(void *_plr) { struct pseudo_lock_region *plr = _plr; struct residency_counts counts = {0}; @@ -465,7 +1107,7 @@ int resctrl_arch_measure_l2_residency(void *_plr) return 0; } -int resctrl_arch_measure_l3_residency(void *_plr) +static int measure_l3_residency(void *_plr) { struct pseudo_lock_region *plr = _plr; struct residency_counts counts = {0}; @@ -520,3 +1162,441 @@ int resctrl_arch_measure_l3_residency(void *_plr) wake_up_interruptible(&plr->lock_thread_wq); return 0; } + +/** + * pseudo_lock_measure_cycles - Trigger latency measure to pseudo-locked region + * @rdtgrp: Resource group to which the pseudo-locked region belongs. + * @sel: Selector of which measurement to perform on a pseudo-locked region. + * + * The measurement of latency to access a pseudo-locked region should be + * done from a cpu that is associated with that pseudo-locked region. + * Determine which cpu is associated with this region and start a thread on + * that cpu to perform the measurement, wait for that thread to complete. + * + * Return: 0 on success, <0 on failure + */ +static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel) +{ + struct pseudo_lock_region *plr = rdtgrp->plr; + struct task_struct *thread; + unsigned int cpu; + int ret = -1; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + if (rdtgrp->flags & RDT_DELETED) { + ret = -ENODEV; + goto out; + } + + if (!plr->d) { + ret = -ENODEV; + goto out; + } + + plr->thread_done = 0; + cpu = cpumask_first(&plr->d->hdr.cpu_mask); + if (!cpu_online(cpu)) { + ret = -ENODEV; + goto out; + } + + plr->cpu = cpu; + + if (sel == 1) + thread = kthread_create_on_node(measure_cycles_lat_fn, plr, + cpu_to_node(cpu), + "pseudo_lock_measure/%u", + cpu); + else if (sel == 2) + thread = kthread_create_on_node(measure_l2_residency, plr, + cpu_to_node(cpu), + "pseudo_lock_measure/%u", + cpu); + else if (sel == 3) + thread = kthread_create_on_node(measure_l3_residency, plr, + cpu_to_node(cpu), + "pseudo_lock_measure/%u", + cpu); + else + goto out; + + if (IS_ERR(thread)) { + ret = PTR_ERR(thread); + goto out; + } + kthread_bind(thread, cpu); + wake_up_process(thread); + + ret = wait_event_interruptible(plr->lock_thread_wq, + plr->thread_done == 1); + if (ret < 0) + goto out; + + ret = 0; + +out: + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + return ret; +} + +static ssize_t pseudo_lock_measure_trigger(struct file *file, + const char __user *user_buf, + size_t count, loff_t *ppos) +{ + struct rdtgroup *rdtgrp = file->private_data; + size_t buf_size; + char buf[32]; + int ret; + int sel; + + buf_size = min(count, (sizeof(buf) - 1)); + if (copy_from_user(buf, user_buf, buf_size)) + return -EFAULT; + + buf[buf_size] = '\0'; + ret = kstrtoint(buf, 10, &sel); + if (ret == 0) { + if (sel != 1 && sel != 2 && sel != 3) + return -EINVAL; + ret = debugfs_file_get(file->f_path.dentry); + if (ret) + return ret; + ret = pseudo_lock_measure_cycles(rdtgrp, sel); + if (ret == 0) + ret = count; + debugfs_file_put(file->f_path.dentry); + } + + return ret; +} + +static const struct file_operations pseudo_measure_fops = { + .write = pseudo_lock_measure_trigger, + .open = simple_open, + .llseek = default_llseek, +}; + +/** + * rdtgroup_pseudo_lock_create - Create a pseudo-locked region + * @rdtgrp: resource group to which pseudo-lock region belongs + * + * Called when a resource group in the pseudo-locksetup mode receives a + * valid schemata that should be pseudo-locked. Since the resource group is + * in pseudo-locksetup mode the &struct pseudo_lock_region has already been + * allocated and initialized with the essential information. If a failure + * occurs the resource group remains in the pseudo-locksetup mode with the + * &struct pseudo_lock_region associated with it, but cleared from all + * information and ready for the user to re-attempt pseudo-locking by + * writing the schemata again. + * + * Return: 0 if the pseudo-locked region was successfully pseudo-locked, <0 + * on failure. Descriptive error will be written to last_cmd_status buffer. + */ +int rdtgroup_pseudo_lock_create(struct rdtgroup *rdtgrp) +{ + struct pseudo_lock_region *plr = rdtgrp->plr; + struct task_struct *thread; + unsigned int new_minor; + struct device *dev; + int ret; + + ret = pseudo_lock_region_alloc(plr); + if (ret < 0) + return ret; + + ret = pseudo_lock_cstates_constrain(plr); + if (ret < 0) { + ret = -EINVAL; + goto out_region; + } + + plr->thread_done = 0; + + thread = kthread_create_on_node(pseudo_lock_fn, rdtgrp, + cpu_to_node(plr->cpu), + "pseudo_lock/%u", plr->cpu); + if (IS_ERR(thread)) { + ret = PTR_ERR(thread); + rdt_last_cmd_printf("Locking thread returned error %d\n", ret); + goto out_cstates; + } + + kthread_bind(thread, plr->cpu); + wake_up_process(thread); + + ret = wait_event_interruptible(plr->lock_thread_wq, + plr->thread_done == 1); + if (ret < 0) { + /* + * If the thread does not get on the CPU for whatever + * reason and the process which sets up the region is + * interrupted then this will leave the thread in runnable + * state and once it gets on the CPU it will dereference + * the cleared, but not freed, plr struct resulting in an + * empty pseudo-locking loop. + */ + rdt_last_cmd_puts("Locking thread interrupted\n"); + goto out_cstates; + } + + ret = pseudo_lock_minor_get(&new_minor); + if (ret < 0) { + rdt_last_cmd_puts("Unable to obtain a new minor number\n"); + goto out_cstates; + } + + /* + * Unlock access but do not release the reference. The + * pseudo-locked region will still be here on return. + * + * The mutex has to be released temporarily to avoid a potential + * deadlock with the mm->mmap_lock which is obtained in the + * device_create() and debugfs_create_dir() callpath below as well as + * before the mmap() callback is called. + */ + mutex_unlock(&rdtgroup_mutex); + + if (!IS_ERR_OR_NULL(debugfs_resctrl)) { + plr->debugfs_dir = debugfs_create_dir(rdtgrp->kn->name, + debugfs_resctrl); + if (!IS_ERR_OR_NULL(plr->debugfs_dir)) + debugfs_create_file("pseudo_lock_measure", 0200, + plr->debugfs_dir, rdtgrp, + &pseudo_measure_fops); + } + + dev = device_create(&pseudo_lock_class, NULL, + MKDEV(pseudo_lock_major, new_minor), + rdtgrp, "%s", rdtgrp->kn->name); + + mutex_lock(&rdtgroup_mutex); + + if (IS_ERR(dev)) { + ret = PTR_ERR(dev); + rdt_last_cmd_printf("Failed to create character device: %d\n", + ret); + goto out_debugfs; + } + + /* We released the mutex - check if group was removed while we did so */ + if (rdtgrp->flags & RDT_DELETED) { + ret = -ENODEV; + goto out_device; + } + + plr->minor = new_minor; + + rdtgrp->mode = RDT_MODE_PSEUDO_LOCKED; + closid_free(rdtgrp->closid); + rdtgroup_kn_mode_restore(rdtgrp, "cpus", 0444); + rdtgroup_kn_mode_restore(rdtgrp, "cpus_list", 0444); + + ret = 0; + goto out; + +out_device: + device_destroy(&pseudo_lock_class, MKDEV(pseudo_lock_major, new_minor)); +out_debugfs: + debugfs_remove_recursive(plr->debugfs_dir); + pseudo_lock_minor_release(new_minor); +out_cstates: + pseudo_lock_cstates_relax(plr); +out_region: + pseudo_lock_region_clear(plr); +out: + return ret; +} + +/** + * rdtgroup_pseudo_lock_remove - Remove a pseudo-locked region + * @rdtgrp: resource group to which the pseudo-locked region belongs + * + * The removal of a pseudo-locked region can be initiated when the resource + * group is removed from user space via a "rmdir" from userspace or the + * unmount of the resctrl filesystem. On removal the resource group does + * not go back to pseudo-locksetup mode before it is removed, instead it is + * removed directly. There is thus asymmetry with the creation where the + * &struct pseudo_lock_region is removed here while it was not created in + * rdtgroup_pseudo_lock_create(). + * + * Return: void + */ +void rdtgroup_pseudo_lock_remove(struct rdtgroup *rdtgrp) +{ + struct pseudo_lock_region *plr = rdtgrp->plr; + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + /* + * Default group cannot be a pseudo-locked region so we can + * free closid here. + */ + closid_free(rdtgrp->closid); + goto free; + } + + pseudo_lock_cstates_relax(plr); + debugfs_remove_recursive(rdtgrp->plr->debugfs_dir); + device_destroy(&pseudo_lock_class, MKDEV(pseudo_lock_major, plr->minor)); + pseudo_lock_minor_release(plr->minor); + +free: + pseudo_lock_free(rdtgrp); +} + +static int pseudo_lock_dev_open(struct inode *inode, struct file *filp) +{ + struct rdtgroup *rdtgrp; + + mutex_lock(&rdtgroup_mutex); + + rdtgrp = region_find_by_minor(iminor(inode)); + if (!rdtgrp) { + mutex_unlock(&rdtgroup_mutex); + return -ENODEV; + } + + filp->private_data = rdtgrp; + atomic_inc(&rdtgrp->waitcount); + /* Perform a non-seekable open - llseek is not supported */ + filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE); + + mutex_unlock(&rdtgroup_mutex); + + return 0; +} + +static int pseudo_lock_dev_release(struct inode *inode, struct file *filp) +{ + struct rdtgroup *rdtgrp; + + mutex_lock(&rdtgroup_mutex); + rdtgrp = filp->private_data; + WARN_ON(!rdtgrp); + if (!rdtgrp) { + mutex_unlock(&rdtgroup_mutex); + return -ENODEV; + } + filp->private_data = NULL; + atomic_dec(&rdtgrp->waitcount); + mutex_unlock(&rdtgroup_mutex); + return 0; +} + +static int pseudo_lock_dev_mremap(struct vm_area_struct *area) +{ + /* Not supported */ + return -EINVAL; +} + +static const struct vm_operations_struct pseudo_mmap_ops = { + .mremap = pseudo_lock_dev_mremap, +}; + +static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma) +{ + unsigned long vsize = vma->vm_end - vma->vm_start; + unsigned long off = vma->vm_pgoff << PAGE_SHIFT; + struct pseudo_lock_region *plr; + struct rdtgroup *rdtgrp; + unsigned long physical; + unsigned long psize; + + mutex_lock(&rdtgroup_mutex); + + rdtgrp = filp->private_data; + WARN_ON(!rdtgrp); + if (!rdtgrp) { + mutex_unlock(&rdtgroup_mutex); + return -ENODEV; + } + + plr = rdtgrp->plr; + + if (!plr->d) { + mutex_unlock(&rdtgroup_mutex); + return -ENODEV; + } + + /* + * Task is required to run with affinity to the cpus associated + * with the pseudo-locked region. If this is not the case the task + * may be scheduled elsewhere and invalidate entries in the + * pseudo-locked region. + */ + if (!cpumask_subset(current->cpus_ptr, &plr->d->hdr.cpu_mask)) { + mutex_unlock(&rdtgroup_mutex); + return -EINVAL; + } + + physical = __pa(plr->kmem) >> PAGE_SHIFT; + psize = plr->size - off; + + if (off > plr->size) { + mutex_unlock(&rdtgroup_mutex); + return -ENOSPC; + } + + /* + * Ensure changes are carried directly to the memory being mapped, + * do not allow copy-on-write mapping. + */ + if (!(vma->vm_flags & VM_SHARED)) { + mutex_unlock(&rdtgroup_mutex); + return -EINVAL; + } + + if (vsize > psize) { + mutex_unlock(&rdtgroup_mutex); + return -ENOSPC; + } + + memset(plr->kmem + off, 0, vsize); + + if (remap_pfn_range(vma, vma->vm_start, physical + vma->vm_pgoff, + vsize, vma->vm_page_prot)) { + mutex_unlock(&rdtgroup_mutex); + return -EAGAIN; + } + vma->vm_ops = &pseudo_mmap_ops; + mutex_unlock(&rdtgroup_mutex); + return 0; +} + +static const struct file_operations pseudo_lock_dev_fops = { + .owner = THIS_MODULE, + .llseek = no_llseek, + .read = NULL, + .write = NULL, + .open = pseudo_lock_dev_open, + .release = pseudo_lock_dev_release, + .mmap = pseudo_lock_dev_mmap, +}; + +int rdt_pseudo_lock_init(void) +{ + int ret; + + ret = register_chrdev(0, "pseudo_lock", &pseudo_lock_dev_fops); + if (ret < 0) + return ret; + + pseudo_lock_major = ret; + + ret = class_register(&pseudo_lock_class); + if (ret) { + unregister_chrdev(pseudo_lock_major, "pseudo_lock"); + return ret; + } + + return 0; +} + +void rdt_pseudo_lock_release(void) +{ + class_unregister(&pseudo_lock_class); + unregister_chrdev(pseudo_lock_major, "pseudo_lock"); + pseudo_lock_major = 0; +} diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index 10bd76e1d0d5..d7163b764c62 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -13,7 +13,20 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/cpu.h> +#include <linux/debugfs.h> +#include <linux/fs.h> +#include <linux/fs_parser.h> +#include <linux/sysfs.h> +#include <linux/kernfs.h> +#include <linux/seq_buf.h> +#include <linux/seq_file.h> +#include <linux/sched/signal.h> +#include <linux/sched/task.h> #include <linux/slab.h> +#include <linux/task_work.h> +#include <linux/user_namespace.h> + +#include <uapi/linux/magic.h> #include <asm/resctrl.h> #include "internal.h" @@ -22,239 +35,4199 @@ DEFINE_STATIC_KEY_FALSE(rdt_enable_key); DEFINE_STATIC_KEY_FALSE(rdt_mon_enable_key); DEFINE_STATIC_KEY_FALSE(rdt_alloc_enable_key); -/* - * This is safe against resctrl_sched_in() called from __switch_to() - * because __switch_to() is executed with interrupts disabled. A local call - * from update_closid_rmid() is protected against __switch_to() because - * preemption is disabled. - */ -void resctrl_arch_sync_cpu_defaults(void *info) +/* Mutex to protect rdtgroup access. */ +DEFINE_MUTEX(rdtgroup_mutex); + +static struct kernfs_root *rdt_root; +struct rdtgroup rdtgroup_default; +LIST_HEAD(rdt_all_groups); + +/* list of entries for the schemata file */ +LIST_HEAD(resctrl_schema_all); + +/* The filesystem can only be mounted once. */ +bool resctrl_mounted; + +/* Kernel fs node for "info" directory under root */ +static struct kernfs_node *kn_info; + +/* Kernel fs node for "mon_groups" directory under root */ +static struct kernfs_node *kn_mongrp; + +/* Kernel fs node for "mon_data" directory under root */ +static struct kernfs_node *kn_mondata; + +static struct seq_buf last_cmd_status; +static char last_cmd_status_buf[512]; + +static int rdtgroup_setup_root(struct rdt_fs_context *ctx); +static void rdtgroup_destroy_root(void); + +struct dentry *debugfs_resctrl; + +static bool resctrl_debug; + +void rdt_last_cmd_clear(void) { - struct resctrl_cpu_sync *r = info; + lockdep_assert_held(&rdtgroup_mutex); + seq_buf_clear(&last_cmd_status); +} - if (r) { - this_cpu_write(pqr_state.default_closid, r->closid); - this_cpu_write(pqr_state.default_rmid, r->rmid); - } +void rdt_last_cmd_puts(const char *s) +{ + lockdep_assert_held(&rdtgroup_mutex); + seq_buf_puts(&last_cmd_status, s); +} - /* - * We cannot unconditionally write the MSR because the current - * executing task might have its own closid selected. Just reuse - * the context switch code. - */ - resctrl_sched_in(current); +void rdt_last_cmd_printf(const char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + lockdep_assert_held(&rdtgroup_mutex); + seq_buf_vprintf(&last_cmd_status, fmt, ap); + va_end(ap); } -#define INVALID_CONFIG_INDEX UINT_MAX +void rdt_staged_configs_clear(void) +{ + struct rdt_ctrl_domain *dom; + struct rdt_resource *r; -/** - * mon_event_config_index_get - get the hardware index for the - * configurable event - * @evtid: event id. + lockdep_assert_held(&rdtgroup_mutex); + + for_each_alloc_capable_rdt_resource(r) { + list_for_each_entry(dom, &r->ctrl_domains, hdr.list) + memset(dom->staged_config, 0, sizeof(dom->staged_config)); + } +} + +/* + * Trivial allocator for CLOSIDs. Since h/w only supports a small number, + * we can keep a bitmap of free CLOSIDs in a single integer. * - * Return: 0 for evtid == QOS_L3_MBM_TOTAL_EVENT_ID - * 1 for evtid == QOS_L3_MBM_LOCAL_EVENT_ID - * INVALID_CONFIG_INDEX for invalid evtid + * Using a global CLOSID across all resources has some advantages and + * some drawbacks: + * + We can simply set current's closid to assign a task to a resource + * group. + * + Context switch code can avoid extra memory references deciding which + * CLOSID to load into the PQR_ASSOC MSR + * - We give up some options in configuring resource groups across multi-socket + * systems. + * - Our choices on how to configure each resource become progressively more + * limited as the number of resources grows. */ -static inline unsigned int mon_event_config_index_get(u32 evtid) +static unsigned long closid_free_map; +static int closid_free_map_len; + +int closids_supported(void) { - switch (evtid) { - case QOS_L3_MBM_TOTAL_EVENT_ID: - return 0; - case QOS_L3_MBM_LOCAL_EVENT_ID: - return 1; - default: - /* Should never reach here */ - return INVALID_CONFIG_INDEX; - } + return closid_free_map_len; } -void resctrl_arch_mon_event_config_read(void *info) +static void closid_init(void) { - struct resctrl_mon_config_info *mon_info = info; - unsigned int index; - u64 msrval; + struct resctrl_schema *s; + u32 rdt_min_closid = 32; - index = mon_event_config_index_get(mon_info->evtid); - if (index == INVALID_CONFIG_INDEX) { - pr_warn_once("Invalid event id %d\n", mon_info->evtid); - return; - } - rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval); + /* Compute rdt_min_closid across all resources */ + list_for_each_entry(s, &resctrl_schema_all, list) + rdt_min_closid = min(rdt_min_closid, s->num_closid); - /* Report only the valid event configuration bits */ - mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS; + closid_free_map = BIT_MASK(rdt_min_closid) - 1; + + /* RESCTRL_RESERVED_CLOSID is always reserved for the default group */ + __clear_bit(RESCTRL_RESERVED_CLOSID, &closid_free_map); + closid_free_map_len = rdt_min_closid; } -void resctrl_arch_mon_event_config_write(void *info) +static int closid_alloc(void) { - struct resctrl_mon_config_info *mon_info = info; - unsigned int index; + int cleanest_closid; + u32 closid; - index = mon_event_config_index_get(mon_info->evtid); - if (index == INVALID_CONFIG_INDEX) { - pr_warn_once("Invalid event id %d\n", mon_info->evtid); - mon_info->err = -EINVAL; - return; + lockdep_assert_held(&rdtgroup_mutex); + + if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID)) { + cleanest_closid = resctrl_find_cleanest_closid(); + if (cleanest_closid < 0) + return cleanest_closid; + closid = cleanest_closid; + } else { + closid = ffs(closid_free_map); + if (closid == 0) + return -ENOSPC; + closid--; } - wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0); + __clear_bit(closid, &closid_free_map); - mon_info->err = 0; + return closid; } -static void l3_qos_cfg_update(void *arg) +void closid_free(int closid) { - bool *enable = arg; + lockdep_assert_held(&rdtgroup_mutex); - wrmsrl(MSR_IA32_L3_QOS_CFG, *enable ? L3_QOS_CDP_ENABLE : 0ULL); + __set_bit(closid, &closid_free_map); } -static void l2_qos_cfg_update(void *arg) +/** + * closid_allocated - test if provided closid is in use + * @closid: closid to be tested + * + * Return: true if @closid is currently associated with a resource group, + * false if @closid is free + */ +bool closid_allocated(unsigned int closid) { - bool *enable = arg; + lockdep_assert_held(&rdtgroup_mutex); - wrmsrl(MSR_IA32_L2_QOS_CFG, *enable ? L2_QOS_CDP_ENABLE : 0ULL); + return !test_bit(closid, &closid_free_map); } -static int set_cache_qos_cfg(int level, bool enable) +/** + * rdtgroup_mode_by_closid - Return mode of resource group with closid + * @closid: closid if the resource group + * + * Each resource group is associated with a @closid. Here the mode + * of a resource group can be queried by searching for it using its closid. + * + * Return: mode as &enum rdtgrp_mode of resource group with closid @closid + */ +enum rdtgrp_mode rdtgroup_mode_by_closid(int closid) { - void (*update)(void *arg); - struct rdt_resource *r_l; - cpumask_var_t cpu_mask; - struct rdt_domain *d; - int cpu; + struct rdtgroup *rdtgrp; - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); + list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { + if (rdtgrp->closid == closid) + return rdtgrp->mode; + } - if (level == RDT_RESOURCE_L3) - update = l3_qos_cfg_update; - else if (level == RDT_RESOURCE_L2) - update = l2_qos_cfg_update; - else - return -EINVAL; + return RDT_NUM_MODES; +} - if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) - return -ENOMEM; +static const char * const rdt_mode_str[] = { + [RDT_MODE_SHAREABLE] = "shareable", + [RDT_MODE_EXCLUSIVE] = "exclusive", + [RDT_MODE_PSEUDO_LOCKSETUP] = "pseudo-locksetup", + [RDT_MODE_PSEUDO_LOCKED] = "pseudo-locked", +}; - r_l = &rdt_resources_all[level].r_resctrl; - list_for_each_entry(d, &r_l->domains, list) { - if (r_l->cache.arch_has_per_cpu_cfg) - /* Pick all the CPUs in the domain instance */ - for_each_cpu(cpu, &d->cpu_mask) - cpumask_set_cpu(cpu, cpu_mask); - else - /* Pick one CPU from each domain instance to update MSR */ - cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask); +/** + * rdtgroup_mode_str - Return the string representation of mode + * @mode: the resource group mode as &enum rdtgroup_mode + * + * Return: string representation of valid mode, "unknown" otherwise + */ +static const char *rdtgroup_mode_str(enum rdtgrp_mode mode) +{ + if (mode < RDT_MODE_SHAREABLE || mode >= RDT_NUM_MODES) + return "unknown"; + + return rdt_mode_str[mode]; +} + +/* set uid and gid of rdtgroup dirs and files to that of the creator */ +static int rdtgroup_kn_set_ugid(struct kernfs_node *kn) +{ + struct iattr iattr = { .ia_valid = ATTR_UID | ATTR_GID, + .ia_uid = current_fsuid(), + .ia_gid = current_fsgid(), }; + + if (uid_eq(iattr.ia_uid, GLOBAL_ROOT_UID) && + gid_eq(iattr.ia_gid, GLOBAL_ROOT_GID)) + return 0; + + return kernfs_setattr(kn, &iattr); +} + +static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft) +{ + struct kernfs_node *kn; + int ret; + + kn = __kernfs_create_file(parent_kn, rft->name, rft->mode, + GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, + 0, rft->kf_ops, rft, NULL, NULL); + if (IS_ERR(kn)) + return PTR_ERR(kn); + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) { + kernfs_remove(kn); + return ret; } - /* Update QOS_CFG MSR on all the CPUs in cpu_mask */ - on_each_cpu_mask(cpu_mask, update, &enable, 1); + return 0; +} - free_cpumask_var(cpu_mask); +static int rdtgroup_seqfile_show(struct seq_file *m, void *arg) +{ + struct kernfs_open_file *of = m->private; + struct rftype *rft = of->kn->priv; + if (rft->seq_show) + return rft->seq_show(of, m, arg); return 0; } -/* Restore the qos cfg state when a domain comes online */ -void rdt_domain_reconfigure_cdp(struct rdt_resource *r) +static ssize_t rdtgroup_file_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) { - struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); + struct rftype *rft = of->kn->priv; - if (!r->cdp_capable) - return; + if (rft->write) + return rft->write(of, buf, nbytes, off); - if (r->rid == RDT_RESOURCE_L2) - l2_qos_cfg_update(&hw_res->cdp_enabled); + return -EINVAL; +} - if (r->rid == RDT_RESOURCE_L3) - l3_qos_cfg_update(&hw_res->cdp_enabled); +static const struct kernfs_ops rdtgroup_kf_single_ops = { + .atomic_write_len = PAGE_SIZE, + .write = rdtgroup_file_write, + .seq_show = rdtgroup_seqfile_show, +}; + +static const struct kernfs_ops kf_mondata_ops = { + .atomic_write_len = PAGE_SIZE, + .seq_show = rdtgroup_mondata_show, +}; + +static bool is_cpu_list(struct kernfs_open_file *of) +{ + struct rftype *rft = of->kn->priv; + + return rft->flags & RFTYPE_FLAGS_CPUS_LIST; } -static int cdp_enable(int level) +static int rdtgroup_cpus_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) { - struct rdt_resource *r_l = &rdt_resources_all[level].r_resctrl; - int ret; + struct rdtgroup *rdtgrp; + struct cpumask *mask; + int ret = 0; - if (!r_l->alloc_capable) - return -EINVAL; + rdtgrp = rdtgroup_kn_lock_live(of->kn); - ret = set_cache_qos_cfg(level, true); - if (!ret) - rdt_resources_all[level].cdp_enabled = true; + if (rdtgrp) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + if (!rdtgrp->plr->d) { + rdt_last_cmd_clear(); + rdt_last_cmd_puts("Cache domain offline\n"); + ret = -ENODEV; + } else { + mask = &rdtgrp->plr->d->hdr.cpu_mask; + seq_printf(s, is_cpu_list(of) ? + "%*pbl\n" : "%*pb\n", + cpumask_pr_args(mask)); + } + } else { + seq_printf(s, is_cpu_list(of) ? "%*pbl\n" : "%*pb\n", + cpumask_pr_args(&rdtgrp->cpu_mask)); + } + } else { + ret = -ENOENT; + } + rdtgroup_kn_unlock(of->kn); return ret; } -static void cdp_disable(int level) +/* + * This is safe against resctrl_sched_in() called from __switch_to() + * because __switch_to() is executed with interrupts disabled. A local call + * from update_closid_rmid() is protected against __switch_to() because + * preemption is disabled. + */ +static void update_cpu_closid_rmid(void *info) { - struct rdt_hw_resource *r_hw = &rdt_resources_all[level]; + struct rdtgroup *r = info; - if (r_hw->cdp_enabled) { - set_cache_qos_cfg(level, false); - r_hw->cdp_enabled = false; + if (r) { + this_cpu_write(pqr_state.default_closid, r->closid); + this_cpu_write(pqr_state.default_rmid, r->mon.rmid); } + + /* + * We cannot unconditionally write the MSR because the current + * executing task might have its own closid selected. Just reuse + * the context switch code. + */ + resctrl_sched_in(current); } -int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable) +/* + * Update the PGR_ASSOC MSR on all cpus in @cpu_mask, + * + * Per task closids/rmids must have been set up before calling this function. + */ +static void +update_closid_rmid(const struct cpumask *cpu_mask, struct rdtgroup *r) { - struct rdt_hw_resource *hw_res = &rdt_resources_all[l]; + on_each_cpu_mask(cpu_mask, update_cpu_closid_rmid, r, 1); +} - if (!hw_res->r_resctrl.cdp_capable) +static int cpus_mon_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask, + cpumask_var_t tmpmask) +{ + struct rdtgroup *prgrp = rdtgrp->mon.parent, *crgrp; + struct list_head *head; + + /* Check whether cpus belong to parent ctrl group */ + cpumask_andnot(tmpmask, newmask, &prgrp->cpu_mask); + if (!cpumask_empty(tmpmask)) { + rdt_last_cmd_puts("Can only add CPUs to mongroup that belong to parent\n"); return -EINVAL; + } - if (enable) - return cdp_enable(l); + /* Check whether cpus are dropped from this group */ + cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask); + if (!cpumask_empty(tmpmask)) { + /* Give any dropped cpus to parent rdtgroup */ + cpumask_or(&prgrp->cpu_mask, &prgrp->cpu_mask, tmpmask); + update_closid_rmid(tmpmask, prgrp); + } - cdp_disable(l); + /* + * If we added cpus, remove them from previous group that owned them + * and update per-cpu rmid + */ + cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask); + if (!cpumask_empty(tmpmask)) { + head = &prgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) { + if (crgrp == rdtgrp) + continue; + cpumask_andnot(&crgrp->cpu_mask, &crgrp->cpu_mask, + tmpmask); + } + update_closid_rmid(tmpmask, rdtgrp); + } + + /* Done pushing/pulling - update this group with new mask */ + cpumask_copy(&rdtgrp->cpu_mask, newmask); return 0; } -static int reset_all_ctrls(struct rdt_resource *r) +static void cpumask_rdtgrp_clear(struct rdtgroup *r, struct cpumask *m) { - struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); - struct rdt_hw_domain *hw_dom; - struct msr_param msr_param; - cpumask_var_t cpu_mask; - struct rdt_domain *d; - int i; + struct rdtgroup *crgrp; - /* Walking r->domains, ensure it can't race with cpuhp */ - lockdep_assert_cpus_held(); + cpumask_andnot(&r->cpu_mask, &r->cpu_mask, m); + /* update the child mon group masks as well*/ + list_for_each_entry(crgrp, &r->mon.crdtgrp_list, mon.crdtgrp_list) + cpumask_and(&crgrp->cpu_mask, &r->cpu_mask, &crgrp->cpu_mask); +} - if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) - return -ENOMEM; +static int cpus_ctrl_write(struct rdtgroup *rdtgrp, cpumask_var_t newmask, + cpumask_var_t tmpmask, cpumask_var_t tmpmask1) +{ + struct rdtgroup *r, *crgrp; + struct list_head *head; - msr_param.res = r; - msr_param.low = 0; - msr_param.high = hw_res->num_closid; + /* Check whether cpus are dropped from this group */ + cpumask_andnot(tmpmask, &rdtgrp->cpu_mask, newmask); + if (!cpumask_empty(tmpmask)) { + /* Can't drop from default group */ + if (rdtgrp == &rdtgroup_default) { + rdt_last_cmd_puts("Can't drop CPUs from default group\n"); + return -EINVAL; + } + + /* Give any dropped cpus to rdtgroup_default */ + cpumask_or(&rdtgroup_default.cpu_mask, + &rdtgroup_default.cpu_mask, tmpmask); + update_closid_rmid(tmpmask, &rdtgroup_default); + } /* - * Disable resource control for this resource by setting all - * CBMs in all domains to the maximum mask value. Pick one CPU - * from each domain to update the MSRs below. + * If we added cpus, remove them from previous group and + * the prev group's child groups that owned them + * and update per-cpu closid/rmid. */ - list_for_each_entry(d, &r->domains, list) { - hw_dom = resctrl_to_arch_dom(d); - cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask); - - for (i = 0; i < hw_res->num_closid; i++) - hw_dom->ctrl_val[i] = r->default_ctrl; + cpumask_andnot(tmpmask, newmask, &rdtgrp->cpu_mask); + if (!cpumask_empty(tmpmask)) { + list_for_each_entry(r, &rdt_all_groups, rdtgroup_list) { + if (r == rdtgrp) + continue; + cpumask_and(tmpmask1, &r->cpu_mask, tmpmask); + if (!cpumask_empty(tmpmask1)) + cpumask_rdtgrp_clear(r, tmpmask1); + } + update_closid_rmid(tmpmask, rdtgrp); } - /* Update CBM on all the CPUs in cpu_mask */ - on_each_cpu_mask(cpu_mask, rdt_ctrl_update, &msr_param, 1); + /* Done pushing/pulling - update this group with new mask */ + cpumask_copy(&rdtgrp->cpu_mask, newmask); - free_cpumask_var(cpu_mask); + /* + * Clear child mon group masks since there is a new parent mask + * now and update the rmid for the cpus the child lost. + */ + head = &rdtgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) { + cpumask_and(tmpmask, &rdtgrp->cpu_mask, &crgrp->cpu_mask); + update_closid_rmid(tmpmask, rdtgrp); + cpumask_clear(&crgrp->cpu_mask); + } return 0; } -void resctrl_arch_reset_resources(void) +static ssize_t rdtgroup_cpus_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) { - struct rdt_resource *r; + cpumask_var_t tmpmask, newmask, tmpmask1; + struct rdtgroup *rdtgrp; + int ret; - for_each_capable_rdt_resource(r) - reset_all_ctrls(r); + if (!buf) + return -EINVAL; + + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) + return -ENOMEM; + if (!zalloc_cpumask_var(&newmask, GFP_KERNEL)) { + free_cpumask_var(tmpmask); + return -ENOMEM; + } + if (!zalloc_cpumask_var(&tmpmask1, GFP_KERNEL)) { + free_cpumask_var(tmpmask); + free_cpumask_var(newmask); + return -ENOMEM; + } + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + ret = -ENOENT; + goto unlock; + } + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + ret = -EINVAL; + rdt_last_cmd_puts("Pseudo-locking in progress\n"); + goto unlock; + } + + if (is_cpu_list(of)) + ret = cpulist_parse(buf, newmask); + else + ret = cpumask_parse(buf, newmask); + + if (ret) { + rdt_last_cmd_puts("Bad CPU list/mask\n"); + goto unlock; + } + + /* check that user didn't specify any offline cpus */ + cpumask_andnot(tmpmask, newmask, cpu_online_mask); + if (!cpumask_empty(tmpmask)) { + ret = -EINVAL; + rdt_last_cmd_puts("Can only assign online CPUs\n"); + goto unlock; + } + + if (rdtgrp->type == RDTCTRL_GROUP) + ret = cpus_ctrl_write(rdtgrp, newmask, tmpmask, tmpmask1); + else if (rdtgrp->type == RDTMON_GROUP) + ret = cpus_mon_write(rdtgrp, newmask, tmpmask); + else + ret = -EINVAL; + +unlock: + rdtgroup_kn_unlock(of->kn); + free_cpumask_var(tmpmask); + free_cpumask_var(newmask); + free_cpumask_var(tmpmask1); + + return ret ?: nbytes; +} + +/** + * rdtgroup_remove - the helper to remove resource group safely + * @rdtgrp: resource group to remove + * + * On resource group creation via a mkdir, an extra kernfs_node reference is + * taken to ensure that the rdtgroup structure remains accessible for the + * rdtgroup_kn_unlock() calls where it is removed. + * + * Drop the extra reference here, then free the rdtgroup structure. + * + * Return: void + */ +static void rdtgroup_remove(struct rdtgroup *rdtgrp) +{ + kernfs_put(rdtgrp->kn); + kfree(rdtgrp); +} + +static void _update_task_closid_rmid(void *task) +{ + /* + * If the task is still current on this CPU, update PQR_ASSOC MSR. + * Otherwise, the MSR is updated when the task is scheduled in. + */ + if (task == current) + resctrl_sched_in(task); +} + +static void update_task_closid_rmid(struct task_struct *t) +{ + if (IS_ENABLED(CONFIG_SMP) && task_curr(t)) + smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1); + else + _update_task_closid_rmid(t); +} + +static bool task_in_rdtgroup(struct task_struct *tsk, struct rdtgroup *rdtgrp) +{ + u32 closid, rmid = rdtgrp->mon.rmid; + + if (rdtgrp->type == RDTCTRL_GROUP) + closid = rdtgrp->closid; + else if (rdtgrp->type == RDTMON_GROUP) + closid = rdtgrp->mon.parent->closid; + else + return false; + + return resctrl_arch_match_closid(tsk, closid) && + resctrl_arch_match_rmid(tsk, closid, rmid); +} + +static int __rdtgroup_move_task(struct task_struct *tsk, + struct rdtgroup *rdtgrp) +{ + /* If the task is already in rdtgrp, no need to move the task. */ + if (task_in_rdtgroup(tsk, rdtgrp)) + return 0; + + /* + * Set the task's closid/rmid before the PQR_ASSOC MSR can be + * updated by them. + * + * For ctrl_mon groups, move both closid and rmid. + * For monitor groups, can move the tasks only from + * their parent CTRL group. + */ + if (rdtgrp->type == RDTMON_GROUP && + !resctrl_arch_match_closid(tsk, rdtgrp->mon.parent->closid)) { + rdt_last_cmd_puts("Can't move task to different control group\n"); + return -EINVAL; + } + + if (rdtgrp->type == RDTMON_GROUP) + resctrl_arch_set_closid_rmid(tsk, rdtgrp->mon.parent->closid, + rdtgrp->mon.rmid); + else + resctrl_arch_set_closid_rmid(tsk, rdtgrp->closid, + rdtgrp->mon.rmid); + + /* + * Ensure the task's closid and rmid are written before determining if + * the task is current that will decide if it will be interrupted. + * This pairs with the full barrier between the rq->curr update and + * resctrl_sched_in() during context switch. + */ + smp_mb(); + + /* + * By now, the task's closid and rmid are set. If the task is current + * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource + * group go into effect. If the task is not current, the MSR will be + * updated when the task is scheduled in. + */ + update_task_closid_rmid(tsk); + + return 0; +} + +static bool is_closid_match(struct task_struct *t, struct rdtgroup *r) +{ + return (resctrl_arch_alloc_capable() && (r->type == RDTCTRL_GROUP) && + resctrl_arch_match_closid(t, r->closid)); +} + +static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r) +{ + return (resctrl_arch_mon_capable() && (r->type == RDTMON_GROUP) && + resctrl_arch_match_rmid(t, r->mon.parent->closid, + r->mon.rmid)); +} + +/** + * rdtgroup_tasks_assigned - Test if tasks have been assigned to resource group + * @r: Resource group + * + * Return: 1 if tasks have been assigned to @r, 0 otherwise + */ +int rdtgroup_tasks_assigned(struct rdtgroup *r) +{ + struct task_struct *p, *t; + int ret = 0; + + lockdep_assert_held(&rdtgroup_mutex); + + rcu_read_lock(); + for_each_process_thread(p, t) { + if (is_closid_match(t, r) || is_rmid_match(t, r)) { + ret = 1; + break; + } + } + rcu_read_unlock(); + + return ret; +} + +static int rdtgroup_task_write_permission(struct task_struct *task, + struct kernfs_open_file *of) +{ + const struct cred *tcred = get_task_cred(task); + const struct cred *cred = current_cred(); + int ret = 0; + + /* + * Even if we're attaching all tasks in the thread group, we only + * need to check permissions on one of them. + */ + if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && + !uid_eq(cred->euid, tcred->uid) && + !uid_eq(cred->euid, tcred->suid)) { + rdt_last_cmd_printf("No permission to move task %d\n", task->pid); + ret = -EPERM; + } + + put_cred(tcred); + return ret; +} + +static int rdtgroup_move_task(pid_t pid, struct rdtgroup *rdtgrp, + struct kernfs_open_file *of) +{ + struct task_struct *tsk; + int ret; + + rcu_read_lock(); + if (pid) { + tsk = find_task_by_vpid(pid); + if (!tsk) { + rcu_read_unlock(); + rdt_last_cmd_printf("No task %d\n", pid); + return -ESRCH; + } + } else { + tsk = current; + } + + get_task_struct(tsk); + rcu_read_unlock(); + + ret = rdtgroup_task_write_permission(tsk, of); + if (!ret) + ret = __rdtgroup_move_task(tsk, rdtgrp); + + put_task_struct(tsk); + return ret; +} + +static ssize_t rdtgroup_tasks_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct rdtgroup *rdtgrp; + char *pid_str; + int ret = 0; + pid_t pid; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + rdt_last_cmd_clear(); + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + ret = -EINVAL; + rdt_last_cmd_puts("Pseudo-locking in progress\n"); + goto unlock; + } + + while (buf && buf[0] != '\0' && buf[0] != '\n') { + pid_str = strim(strsep(&buf, ",")); + + if (kstrtoint(pid_str, 0, &pid)) { + rdt_last_cmd_printf("Task list parsing error pid %s\n", pid_str); + ret = -EINVAL; + break; + } + + if (pid < 0) { + rdt_last_cmd_printf("Invalid pid %d\n", pid); + ret = -EINVAL; + break; + } + + ret = rdtgroup_move_task(pid, rdtgrp, of); + if (ret) { + rdt_last_cmd_printf("Error while processing task %d\n", pid); + break; + } + } + +unlock: + rdtgroup_kn_unlock(of->kn); + + return ret ?: nbytes; +} + +static void show_rdt_tasks(struct rdtgroup *r, struct seq_file *s) +{ + struct task_struct *p, *t; + pid_t pid; + + rcu_read_lock(); + for_each_process_thread(p, t) { + if (is_closid_match(t, r) || is_rmid_match(t, r)) { + pid = task_pid_vnr(t); + if (pid) + seq_printf(s, "%d\n", pid); + } + } + rcu_read_unlock(); +} + +static int rdtgroup_tasks_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) + show_rdt_tasks(rdtgrp, s); + else + ret = -ENOENT; + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +static int rdtgroup_closid_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) + seq_printf(s, "%u\n", rdtgrp->closid); + else + ret = -ENOENT; + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +static int rdtgroup_rmid_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + int ret = 0; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (rdtgrp) + seq_printf(s, "%u\n", rdtgrp->mon.rmid); + else + ret = -ENOENT; + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +#ifdef CONFIG_PROC_CPU_RESCTRL + +/* + * A task can only be part of one resctrl control group and of one monitor + * group which is associated to that control group. + * + * 1) res: + * mon: + * + * resctrl is not available. + * + * 2) res:/ + * mon: + * + * Task is part of the root resctrl control group, and it is not associated + * to any monitor group. + * + * 3) res:/ + * mon:mon0 + * + * Task is part of the root resctrl control group and monitor group mon0. + * + * 4) res:group0 + * mon: + * + * Task is part of resctrl control group group0, and it is not associated + * to any monitor group. + * + * 5) res:group0 + * mon:mon1 + * + * Task is part of resctrl control group group0 and monitor group mon1. + */ +int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns, + struct pid *pid, struct task_struct *tsk) +{ + struct rdtgroup *rdtg; + int ret = 0; + + mutex_lock(&rdtgroup_mutex); + + /* Return empty if resctrl has not been mounted. */ + if (!resctrl_mounted) { + seq_puts(s, "res:\nmon:\n"); + goto unlock; + } + + list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) { + struct rdtgroup *crg; + + /* + * Task information is only relevant for shareable + * and exclusive groups. + */ + if (rdtg->mode != RDT_MODE_SHAREABLE && + rdtg->mode != RDT_MODE_EXCLUSIVE) + continue; + + if (!resctrl_arch_match_closid(tsk, rdtg->closid)) + continue; + + seq_printf(s, "res:%s%s\n", (rdtg == &rdtgroup_default) ? "/" : "", + rdtg->kn->name); + seq_puts(s, "mon:"); + list_for_each_entry(crg, &rdtg->mon.crdtgrp_list, + mon.crdtgrp_list) { + if (!resctrl_arch_match_rmid(tsk, crg->mon.parent->closid, + crg->mon.rmid)) + continue; + seq_printf(s, "%s", crg->kn->name); + break; + } + seq_putc(s, '\n'); + goto unlock; + } + /* + * The above search should succeed. Otherwise return + * with an error. + */ + ret = -ENOENT; +unlock: + mutex_unlock(&rdtgroup_mutex); + + return ret; +} +#endif + +static int rdt_last_cmd_status_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + int len; + + mutex_lock(&rdtgroup_mutex); + len = seq_buf_used(&last_cmd_status); + if (len) + seq_printf(seq, "%.*s", len, last_cmd_status_buf); + else + seq_puts(seq, "ok\n"); + mutex_unlock(&rdtgroup_mutex); + return 0; +} + +static int rdt_num_closids_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + + seq_printf(seq, "%u\n", s->num_closid); + return 0; +} + +static int rdt_default_ctrl_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%x\n", r->default_ctrl); + return 0; +} + +static int rdt_min_cbm_bits_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->cache.min_cbm_bits); + return 0; +} + +static int rdt_shareable_bits_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%x\n", r->cache.shareable_bits); + return 0; +} + +/* + * rdt_bit_usage_show - Display current usage of resources + * + * A domain is a shared resource that can now be allocated differently. Here + * we display the current regions of the domain as an annotated bitmask. + * For each domain of this resource its allocation bitmask + * is annotated as below to indicate the current usage of the corresponding bit: + * 0 - currently unused + * X - currently available for sharing and used by software and hardware + * H - currently used by hardware only but available for software use + * S - currently used and shareable by software only + * E - currently used exclusively by one resource group + * P - currently pseudo-locked by one resource group + */ +static int rdt_bit_usage_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + /* + * Use unsigned long even though only 32 bits are used to ensure + * test_bit() is used safely. + */ + unsigned long sw_shareable = 0, hw_shareable = 0; + unsigned long exclusive = 0, pseudo_locked = 0; + struct rdt_resource *r = s->res; + struct rdt_ctrl_domain *dom; + int i, hwb, swb, excl, psl; + enum rdtgrp_mode mode; + bool sep = false; + u32 ctrl_val; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + hw_shareable = r->cache.shareable_bits; + list_for_each_entry(dom, &r->ctrl_domains, hdr.list) { + if (sep) + seq_putc(seq, ';'); + sw_shareable = 0; + exclusive = 0; + seq_printf(seq, "%d=", dom->hdr.id); + for (i = 0; i < closids_supported(); i++) { + if (!closid_allocated(i)) + continue; + ctrl_val = resctrl_arch_get_config(r, dom, i, + s->conf_type); + mode = rdtgroup_mode_by_closid(i); + switch (mode) { + case RDT_MODE_SHAREABLE: + sw_shareable |= ctrl_val; + break; + case RDT_MODE_EXCLUSIVE: + exclusive |= ctrl_val; + break; + case RDT_MODE_PSEUDO_LOCKSETUP: + /* + * RDT_MODE_PSEUDO_LOCKSETUP is possible + * here but not included since the CBM + * associated with this CLOSID in this mode + * is not initialized and no task or cpu can be + * assigned this CLOSID. + */ + break; + case RDT_MODE_PSEUDO_LOCKED: + case RDT_NUM_MODES: + WARN(1, + "invalid mode for closid %d\n", i); + break; + } + } + for (i = r->cache.cbm_len - 1; i >= 0; i--) { + pseudo_locked = dom->plr ? dom->plr->cbm : 0; + hwb = test_bit(i, &hw_shareable); + swb = test_bit(i, &sw_shareable); + excl = test_bit(i, &exclusive); + psl = test_bit(i, &pseudo_locked); + if (hwb && swb) + seq_putc(seq, 'X'); + else if (hwb && !swb) + seq_putc(seq, 'H'); + else if (!hwb && swb) + seq_putc(seq, 'S'); + else if (excl) + seq_putc(seq, 'E'); + else if (psl) + seq_putc(seq, 'P'); + else /* Unused bits remain */ + seq_putc(seq, '0'); + } + sep = true; + } + seq_putc(seq, '\n'); + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + return 0; +} + +static int rdt_min_bw_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->membw.min_bw); + return 0; +} + +static int rdt_num_rmids_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + + seq_printf(seq, "%d\n", r->num_rmid); + + return 0; +} + +static int rdt_mon_features_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + struct mon_evt *mevt; + + list_for_each_entry(mevt, &r->evt_list, list) { + seq_printf(seq, "%s\n", mevt->name); + if (mevt->configurable) + seq_printf(seq, "%s_config\n", mevt->name); + } + + return 0; +} + +static int rdt_bw_gran_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->membw.bw_gran); + return 0; +} + +static int rdt_delay_linear_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->membw.delay_linear); + return 0; +} + +static int max_threshold_occ_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + seq_printf(seq, "%u\n", resctrl_rmid_realloc_threshold); + + return 0; +} + +static int rdt_thread_throttle_mode_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + if (r->membw.throttle_mode == THREAD_THROTTLE_PER_THREAD) + seq_puts(seq, "per-thread\n"); + else + seq_puts(seq, "max\n"); + + return 0; +} + +static ssize_t max_threshold_occ_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + unsigned int bytes; + int ret; + + ret = kstrtouint(buf, 0, &bytes); + if (ret) + return ret; + + if (bytes > resctrl_rmid_realloc_limit) + return -EINVAL; + + resctrl_rmid_realloc_threshold = resctrl_arch_round_mon_val(bytes); + + return nbytes; +} + +/* + * rdtgroup_mode_show - Display mode of this resource group + */ +static int rdtgroup_mode_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct rdtgroup *rdtgrp; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + + seq_printf(s, "%s\n", rdtgroup_mode_str(rdtgrp->mode)); + + rdtgroup_kn_unlock(of->kn); + return 0; +} + +static enum resctrl_conf_type resctrl_peer_type(enum resctrl_conf_type my_type) +{ + switch (my_type) { + case CDP_CODE: + return CDP_DATA; + case CDP_DATA: + return CDP_CODE; + default: + case CDP_NONE: + return CDP_NONE; + } +} + +static int rdt_has_sparse_bitmasks_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + + seq_printf(seq, "%u\n", r->cache.arch_has_sparse_bitmasks); + + return 0; +} + +/** + * __rdtgroup_cbm_overlaps - Does CBM for intended closid overlap with other + * @r: Resource to which domain instance @d belongs. + * @d: The domain instance for which @closid is being tested. + * @cbm: Capacity bitmask being tested. + * @closid: Intended closid for @cbm. + * @type: CDP type of @r. + * @exclusive: Only check if overlaps with exclusive resource groups + * + * Checks if provided @cbm intended to be used for @closid on domain + * @d overlaps with any other closids or other hardware usage associated + * with this domain. If @exclusive is true then only overlaps with + * resource groups in exclusive mode will be considered. If @exclusive + * is false then overlaps with any resource group or hardware entities + * will be considered. + * + * @cbm is unsigned long, even if only 32 bits are used, to make the + * bitmap functions work correctly. + * + * Return: false if CBM does not overlap, true if it does. + */ +static bool __rdtgroup_cbm_overlaps(struct rdt_resource *r, struct rdt_ctrl_domain *d, + unsigned long cbm, int closid, + enum resctrl_conf_type type, bool exclusive) +{ + enum rdtgrp_mode mode; + unsigned long ctrl_b; + int i; + + /* Check for any overlap with regions used by hardware directly */ + if (!exclusive) { + ctrl_b = r->cache.shareable_bits; + if (bitmap_intersects(&cbm, &ctrl_b, r->cache.cbm_len)) + return true; + } + + /* Check for overlap with other resource groups */ + for (i = 0; i < closids_supported(); i++) { + ctrl_b = resctrl_arch_get_config(r, d, i, type); + mode = rdtgroup_mode_by_closid(i); + if (closid_allocated(i) && i != closid && + mode != RDT_MODE_PSEUDO_LOCKSETUP) { + if (bitmap_intersects(&cbm, &ctrl_b, r->cache.cbm_len)) { + if (exclusive) { + if (mode == RDT_MODE_EXCLUSIVE) + return true; + continue; + } + return true; + } + } + } + + return false; +} + +/** + * rdtgroup_cbm_overlaps - Does CBM overlap with other use of hardware + * @s: Schema for the resource to which domain instance @d belongs. + * @d: The domain instance for which @closid is being tested. + * @cbm: Capacity bitmask being tested. + * @closid: Intended closid for @cbm. + * @exclusive: Only check if overlaps with exclusive resource groups + * + * Resources that can be allocated using a CBM can use the CBM to control + * the overlap of these allocations. rdtgroup_cmb_overlaps() is the test + * for overlap. Overlap test is not limited to the specific resource for + * which the CBM is intended though - when dealing with CDP resources that + * share the underlying hardware the overlap check should be performed on + * the CDP resource sharing the hardware also. + * + * Refer to description of __rdtgroup_cbm_overlaps() for the details of the + * overlap test. + * + * Return: true if CBM overlap detected, false if there is no overlap + */ +bool rdtgroup_cbm_overlaps(struct resctrl_schema *s, struct rdt_ctrl_domain *d, + unsigned long cbm, int closid, bool exclusive) +{ + enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type); + struct rdt_resource *r = s->res; + + if (__rdtgroup_cbm_overlaps(r, d, cbm, closid, s->conf_type, + exclusive)) + return true; + + if (!resctrl_arch_get_cdp_enabled(r->rid)) + return false; + return __rdtgroup_cbm_overlaps(r, d, cbm, closid, peer_type, exclusive); +} + +/** + * rdtgroup_mode_test_exclusive - Test if this resource group can be exclusive + * @rdtgrp: Resource group identified through its closid. + * + * An exclusive resource group implies that there should be no sharing of + * its allocated resources. At the time this group is considered to be + * exclusive this test can determine if its current schemata supports this + * setting by testing for overlap with all other resource groups. + * + * Return: true if resource group can be exclusive, false if there is overlap + * with allocations of other resource groups and thus this resource group + * cannot be exclusive. + */ +static bool rdtgroup_mode_test_exclusive(struct rdtgroup *rdtgrp) +{ + int closid = rdtgrp->closid; + struct rdt_ctrl_domain *d; + struct resctrl_schema *s; + struct rdt_resource *r; + bool has_cache = false; + u32 ctrl; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + list_for_each_entry(s, &resctrl_schema_all, list) { + r = s->res; + if (r->rid == RDT_RESOURCE_MBA || r->rid == RDT_RESOURCE_SMBA) + continue; + has_cache = true; + list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + ctrl = resctrl_arch_get_config(r, d, closid, + s->conf_type); + if (rdtgroup_cbm_overlaps(s, d, ctrl, closid, false)) { + rdt_last_cmd_puts("Schemata overlaps\n"); + return false; + } + } + } + + if (!has_cache) { + rdt_last_cmd_puts("Cannot be exclusive without CAT/CDP\n"); + return false; + } + + return true; +} + +/* + * rdtgroup_mode_write - Modify the resource group's mode + */ +static ssize_t rdtgroup_mode_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct rdtgroup *rdtgrp; + enum rdtgrp_mode mode; + int ret = 0; + + /* Valid input requires a trailing newline */ + if (nbytes == 0 || buf[nbytes - 1] != '\n') + return -EINVAL; + buf[nbytes - 1] = '\0'; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + + rdt_last_cmd_clear(); + + mode = rdtgrp->mode; + + if ((!strcmp(buf, "shareable") && mode == RDT_MODE_SHAREABLE) || + (!strcmp(buf, "exclusive") && mode == RDT_MODE_EXCLUSIVE) || + (!strcmp(buf, "pseudo-locksetup") && + mode == RDT_MODE_PSEUDO_LOCKSETUP) || + (!strcmp(buf, "pseudo-locked") && mode == RDT_MODE_PSEUDO_LOCKED)) + goto out; + + if (mode == RDT_MODE_PSEUDO_LOCKED) { + rdt_last_cmd_puts("Cannot change pseudo-locked group\n"); + ret = -EINVAL; + goto out; + } + + if (!strcmp(buf, "shareable")) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + ret = rdtgroup_locksetup_exit(rdtgrp); + if (ret) + goto out; + } + rdtgrp->mode = RDT_MODE_SHAREABLE; + } else if (!strcmp(buf, "exclusive")) { + if (!rdtgroup_mode_test_exclusive(rdtgrp)) { + ret = -EINVAL; + goto out; + } + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + ret = rdtgroup_locksetup_exit(rdtgrp); + if (ret) + goto out; + } + rdtgrp->mode = RDT_MODE_EXCLUSIVE; + } else if (!strcmp(buf, "pseudo-locksetup")) { + ret = rdtgroup_locksetup_enter(rdtgrp); + if (ret) + goto out; + rdtgrp->mode = RDT_MODE_PSEUDO_LOCKSETUP; + } else { + rdt_last_cmd_puts("Unknown or unsupported mode\n"); + ret = -EINVAL; + } + +out: + rdtgroup_kn_unlock(of->kn); + return ret ?: nbytes; +} + +/** + * rdtgroup_cbm_to_size - Translate CBM to size in bytes + * @r: RDT resource to which @d belongs. + * @d: RDT domain instance. + * @cbm: bitmask for which the size should be computed. + * + * The bitmask provided associated with the RDT domain instance @d will be + * translated into how many bytes it represents. The size in bytes is + * computed by first dividing the total cache size by the CBM length to + * determine how many bytes each bit in the bitmask represents. The result + * is multiplied with the number of bits set in the bitmask. + * + * @cbm is unsigned long, even if only 32 bits are used to make the + * bitmap functions work correctly. + */ +unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r, + struct rdt_ctrl_domain *d, unsigned long cbm) +{ + unsigned int size = 0; + struct cacheinfo *ci; + int num_b; + + if (WARN_ON_ONCE(r->ctrl_scope != RESCTRL_L2_CACHE && r->ctrl_scope != RESCTRL_L3_CACHE)) + return size; + + num_b = bitmap_weight(&cbm, r->cache.cbm_len); + ci = get_cpu_cacheinfo_level(cpumask_any(&d->hdr.cpu_mask), r->ctrl_scope); + if (ci) + size = ci->size / r->cache.cbm_len * num_b; + + return size; +} + +/* + * rdtgroup_size_show - Display size in bytes of allocated regions + * + * The "size" file mirrors the layout of the "schemata" file, printing the + * size in bytes of each region instead of the capacity bitmask. + */ +static int rdtgroup_size_show(struct kernfs_open_file *of, + struct seq_file *s, void *v) +{ + struct resctrl_schema *schema; + enum resctrl_conf_type type; + struct rdt_ctrl_domain *d; + struct rdtgroup *rdtgrp; + struct rdt_resource *r; + unsigned int size; + int ret = 0; + u32 closid; + bool sep; + u32 ctrl; + + rdtgrp = rdtgroup_kn_lock_live(of->kn); + if (!rdtgrp) { + rdtgroup_kn_unlock(of->kn); + return -ENOENT; + } + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + if (!rdtgrp->plr->d) { + rdt_last_cmd_clear(); + rdt_last_cmd_puts("Cache domain offline\n"); + ret = -ENODEV; + } else { + seq_printf(s, "%*s:", max_name_width, + rdtgrp->plr->s->name); + size = rdtgroup_cbm_to_size(rdtgrp->plr->s->res, + rdtgrp->plr->d, + rdtgrp->plr->cbm); + seq_printf(s, "%d=%u\n", rdtgrp->plr->d->hdr.id, size); + } + goto out; + } + + closid = rdtgrp->closid; + + list_for_each_entry(schema, &resctrl_schema_all, list) { + r = schema->res; + type = schema->conf_type; + sep = false; + seq_printf(s, "%*s:", max_name_width, schema->name); + list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + if (sep) + seq_putc(s, ';'); + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP) { + size = 0; + } else { + if (is_mba_sc(r)) + ctrl = d->mbps_val[closid]; + else + ctrl = resctrl_arch_get_config(r, d, + closid, + type); + if (r->rid == RDT_RESOURCE_MBA || + r->rid == RDT_RESOURCE_SMBA) + size = ctrl; + else + size = rdtgroup_cbm_to_size(r, d, ctrl); + } + seq_printf(s, "%d=%u", d->hdr.id, size); + sep = true; + } + seq_putc(s, '\n'); + } + +out: + rdtgroup_kn_unlock(of->kn); + + return ret; +} + +struct mon_config_info { + u32 evtid; + u32 mon_config; +}; + +#define INVALID_CONFIG_INDEX UINT_MAX + +/** + * mon_event_config_index_get - get the hardware index for the + * configurable event + * @evtid: event id. + * + * Return: 0 for evtid == QOS_L3_MBM_TOTAL_EVENT_ID + * 1 for evtid == QOS_L3_MBM_LOCAL_EVENT_ID + * INVALID_CONFIG_INDEX for invalid evtid + */ +static inline unsigned int mon_event_config_index_get(u32 evtid) +{ + switch (evtid) { + case QOS_L3_MBM_TOTAL_EVENT_ID: + return 0; + case QOS_L3_MBM_LOCAL_EVENT_ID: + return 1; + default: + /* Should never reach here */ + return INVALID_CONFIG_INDEX; + } +} + +static void mon_event_config_read(void *info) +{ + struct mon_config_info *mon_info = info; + unsigned int index; + u64 msrval; + + index = mon_event_config_index_get(mon_info->evtid); + if (index == INVALID_CONFIG_INDEX) { + pr_warn_once("Invalid event id %d\n", mon_info->evtid); + return; + } + rdmsrl(MSR_IA32_EVT_CFG_BASE + index, msrval); + + /* Report only the valid event configuration bits */ + mon_info->mon_config = msrval & MAX_EVT_CONFIG_BITS; +} + +static void mondata_config_read(struct rdt_mon_domain *d, struct mon_config_info *mon_info) +{ + smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_read, mon_info, 1); +} + +static int mbm_config_show(struct seq_file *s, struct rdt_resource *r, u32 evtid) +{ + struct mon_config_info mon_info = {0}; + struct rdt_mon_domain *dom; + bool sep = false; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + list_for_each_entry(dom, &r->mon_domains, hdr.list) { + if (sep) + seq_puts(s, ";"); + + memset(&mon_info, 0, sizeof(struct mon_config_info)); + mon_info.evtid = evtid; + mondata_config_read(dom, &mon_info); + + seq_printf(s, "%d=0x%02x", dom->hdr.id, mon_info.mon_config); + sep = true; + } + seq_puts(s, "\n"); + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + return 0; +} + +static int mbm_total_bytes_config_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + + mbm_config_show(seq, r, QOS_L3_MBM_TOTAL_EVENT_ID); + + return 0; +} + +static int mbm_local_bytes_config_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + + mbm_config_show(seq, r, QOS_L3_MBM_LOCAL_EVENT_ID); + + return 0; +} + +static void mon_event_config_write(void *info) +{ + struct mon_config_info *mon_info = info; + unsigned int index; + + index = mon_event_config_index_get(mon_info->evtid); + if (index == INVALID_CONFIG_INDEX) { + pr_warn_once("Invalid event id %d\n", mon_info->evtid); + return; + } + wrmsr(MSR_IA32_EVT_CFG_BASE + index, mon_info->mon_config, 0); +} + +static void mbm_config_write_domain(struct rdt_resource *r, + struct rdt_mon_domain *d, u32 evtid, u32 val) +{ + struct mon_config_info mon_info = {0}; + + /* + * Read the current config value first. If both are the same then + * no need to write it again. + */ + mon_info.evtid = evtid; + mondata_config_read(d, &mon_info); + if (mon_info.mon_config == val) + return; + + mon_info.mon_config = val; + + /* + * Update MSR_IA32_EVT_CFG_BASE MSR on one of the CPUs in the + * domain. The MSRs offset from MSR MSR_IA32_EVT_CFG_BASE + * are scoped at the domain level. Writing any of these MSRs + * on one CPU is observed by all the CPUs in the domain. + */ + smp_call_function_any(&d->hdr.cpu_mask, mon_event_config_write, + &mon_info, 1); + + /* + * When an Event Configuration is changed, the bandwidth counters + * for all RMIDs and Events will be cleared by the hardware. The + * hardware also sets MSR_IA32_QM_CTR.Unavailable (bit 62) for + * every RMID on the next read to any event for every RMID. + * Subsequent reads will have MSR_IA32_QM_CTR.Unavailable (bit 62) + * cleared while it is tracked by the hardware. Clear the + * mbm_local and mbm_total counts for all the RMIDs. + */ + resctrl_arch_reset_rmid_all(r, d); +} + +static int mon_config_write(struct rdt_resource *r, char *tok, u32 evtid) +{ + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); + char *dom_str = NULL, *id_str; + unsigned long dom_id, val; + struct rdt_mon_domain *d; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + +next: + if (!tok || tok[0] == '\0') + return 0; + + /* Start processing the strings for each domain */ + dom_str = strim(strsep(&tok, ";")); + id_str = strsep(&dom_str, "="); + + if (!id_str || kstrtoul(id_str, 10, &dom_id)) { + rdt_last_cmd_puts("Missing '=' or non-numeric domain id\n"); + return -EINVAL; + } + + if (!dom_str || kstrtoul(dom_str, 16, &val)) { + rdt_last_cmd_puts("Non-numeric event configuration value\n"); + return -EINVAL; + } + + /* Value from user cannot be more than the supported set of events */ + if ((val & hw_res->mbm_cfg_mask) != val) { + rdt_last_cmd_printf("Invalid event configuration: max valid mask is 0x%02x\n", + hw_res->mbm_cfg_mask); + return -EINVAL; + } + + list_for_each_entry(d, &r->mon_domains, hdr.list) { + if (d->hdr.id == dom_id) { + mbm_config_write_domain(r, d, evtid, val); + goto next; + } + } + + return -EINVAL; +} + +static ssize_t mbm_total_bytes_config_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct rdt_resource *r = of->kn->parent->priv; + int ret; + + /* Valid input requires a trailing newline */ + if (nbytes == 0 || buf[nbytes - 1] != '\n') + return -EINVAL; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + rdt_last_cmd_clear(); + + buf[nbytes - 1] = '\0'; + + ret = mon_config_write(r, buf, QOS_L3_MBM_TOTAL_EVENT_ID); + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + return ret ?: nbytes; +} + +static ssize_t mbm_local_bytes_config_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + struct rdt_resource *r = of->kn->parent->priv; + int ret; + + /* Valid input requires a trailing newline */ + if (nbytes == 0 || buf[nbytes - 1] != '\n') + return -EINVAL; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + rdt_last_cmd_clear(); + + buf[nbytes - 1] = '\0'; + + ret = mon_config_write(r, buf, QOS_L3_MBM_LOCAL_EVENT_ID); + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + return ret ?: nbytes; +} + +/* rdtgroup information files for one cache resource. */ +static struct rftype res_common_files[] = { + { + .name = "last_cmd_status", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_last_cmd_status_show, + .fflags = RFTYPE_TOP_INFO, + }, + { + .name = "num_closids", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_num_closids_show, + .fflags = RFTYPE_CTRL_INFO, + }, + { + .name = "mon_features", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_mon_features_show, + .fflags = RFTYPE_MON_INFO, + }, + { + .name = "num_rmids", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_num_rmids_show, + .fflags = RFTYPE_MON_INFO, + }, + { + .name = "cbm_mask", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_default_ctrl_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "min_cbm_bits", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_min_cbm_bits_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "shareable_bits", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_shareable_bits_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "bit_usage", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_bit_usage_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "min_bandwidth", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_min_bw_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, + }, + { + .name = "bandwidth_gran", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_bw_gran_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, + }, + { + .name = "delay_linear", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_delay_linear_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB, + }, + /* + * Platform specific which (if any) capabilities are provided by + * thread_throttle_mode. Defer "fflags" initialization to platform + * discovery. + */ + { + .name = "thread_throttle_mode", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_thread_throttle_mode_show, + }, + { + .name = "max_threshold_occupancy", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = max_threshold_occ_write, + .seq_show = max_threshold_occ_show, + .fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "mbm_total_bytes_config", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = mbm_total_bytes_config_show, + .write = mbm_total_bytes_config_write, + }, + { + .name = "mbm_local_bytes_config", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = mbm_local_bytes_config_show, + .write = mbm_local_bytes_config_write, + }, + { + .name = "cpus", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_cpus_write, + .seq_show = rdtgroup_cpus_show, + .fflags = RFTYPE_BASE, + }, + { + .name = "cpus_list", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_cpus_write, + .seq_show = rdtgroup_cpus_show, + .flags = RFTYPE_FLAGS_CPUS_LIST, + .fflags = RFTYPE_BASE, + }, + { + .name = "tasks", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_tasks_write, + .seq_show = rdtgroup_tasks_show, + .fflags = RFTYPE_BASE, + }, + { + .name = "mon_hw_id", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdtgroup_rmid_show, + .fflags = RFTYPE_MON_BASE | RFTYPE_DEBUG, + }, + { + .name = "schemata", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_schemata_write, + .seq_show = rdtgroup_schemata_show, + .fflags = RFTYPE_CTRL_BASE, + }, + { + .name = "mode", + .mode = 0644, + .kf_ops = &rdtgroup_kf_single_ops, + .write = rdtgroup_mode_write, + .seq_show = rdtgroup_mode_show, + .fflags = RFTYPE_CTRL_BASE, + }, + { + .name = "size", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdtgroup_size_show, + .fflags = RFTYPE_CTRL_BASE, + }, + { + .name = "sparse_masks", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_has_sparse_bitmasks_show, + .fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_CACHE, + }, + { + .name = "ctrl_hw_id", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdtgroup_closid_show, + .fflags = RFTYPE_CTRL_BASE | RFTYPE_DEBUG, + }, + +}; + +static int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags) +{ + struct rftype *rfts, *rft; + int ret, len; + + rfts = res_common_files; + len = ARRAY_SIZE(res_common_files); + + lockdep_assert_held(&rdtgroup_mutex); + + if (resctrl_debug) + fflags |= RFTYPE_DEBUG; + + for (rft = rfts; rft < rfts + len; rft++) { + if (rft->fflags && ((fflags & rft->fflags) == rft->fflags)) { + ret = rdtgroup_add_file(kn, rft); + if (ret) + goto error; + } + } + + return 0; +error: + pr_warn("Failed to add %s, err=%d\n", rft->name, ret); + while (--rft >= rfts) { + if ((fflags & rft->fflags) == rft->fflags) + kernfs_remove_by_name(kn, rft->name); + } + return ret; +} + +static struct rftype *rdtgroup_get_rftype_by_name(const char *name) +{ + struct rftype *rfts, *rft; + int len; + + rfts = res_common_files; + len = ARRAY_SIZE(res_common_files); + + for (rft = rfts; rft < rfts + len; rft++) { + if (!strcmp(rft->name, name)) + return rft; + } + + return NULL; +} + +void __init thread_throttle_mode_init(void) +{ + struct rftype *rft; + + rft = rdtgroup_get_rftype_by_name("thread_throttle_mode"); + if (!rft) + return; + + rft->fflags = RFTYPE_CTRL_INFO | RFTYPE_RES_MB; +} + +void __init mbm_config_rftype_init(const char *config) +{ + struct rftype *rft; + + rft = rdtgroup_get_rftype_by_name(config); + if (rft) + rft->fflags = RFTYPE_MON_INFO | RFTYPE_RES_CACHE; +} + +/** + * rdtgroup_kn_mode_restrict - Restrict user access to named resctrl file + * @r: The resource group with which the file is associated. + * @name: Name of the file + * + * The permissions of named resctrl file, directory, or link are modified + * to not allow read, write, or execute by any user. + * + * WARNING: This function is intended to communicate to the user that the + * resctrl file has been locked down - that it is not relevant to the + * particular state the system finds itself in. It should not be relied + * on to protect from user access because after the file's permissions + * are restricted the user can still change the permissions using chmod + * from the command line. + * + * Return: 0 on success, <0 on failure. + */ +int rdtgroup_kn_mode_restrict(struct rdtgroup *r, const char *name) +{ + struct iattr iattr = {.ia_valid = ATTR_MODE,}; + struct kernfs_node *kn; + int ret = 0; + + kn = kernfs_find_and_get_ns(r->kn, name, NULL); + if (!kn) + return -ENOENT; + + switch (kernfs_type(kn)) { + case KERNFS_DIR: + iattr.ia_mode = S_IFDIR; + break; + case KERNFS_FILE: + iattr.ia_mode = S_IFREG; + break; + case KERNFS_LINK: + iattr.ia_mode = S_IFLNK; + break; + } + + ret = kernfs_setattr(kn, &iattr); + kernfs_put(kn); + return ret; +} + +/** + * rdtgroup_kn_mode_restore - Restore user access to named resctrl file + * @r: The resource group with which the file is associated. + * @name: Name of the file + * @mask: Mask of permissions that should be restored + * + * Restore the permissions of the named file. If @name is a directory the + * permissions of its parent will be used. + * + * Return: 0 on success, <0 on failure. + */ +int rdtgroup_kn_mode_restore(struct rdtgroup *r, const char *name, + umode_t mask) +{ + struct iattr iattr = {.ia_valid = ATTR_MODE,}; + struct kernfs_node *kn, *parent; + struct rftype *rfts, *rft; + int ret, len; + + rfts = res_common_files; + len = ARRAY_SIZE(res_common_files); + + for (rft = rfts; rft < rfts + len; rft++) { + if (!strcmp(rft->name, name)) + iattr.ia_mode = rft->mode & mask; + } + + kn = kernfs_find_and_get_ns(r->kn, name, NULL); + if (!kn) + return -ENOENT; + + switch (kernfs_type(kn)) { + case KERNFS_DIR: + parent = kernfs_get_parent(kn); + if (parent) { + iattr.ia_mode |= parent->mode; + kernfs_put(parent); + } + iattr.ia_mode |= S_IFDIR; + break; + case KERNFS_FILE: + iattr.ia_mode |= S_IFREG; + break; + case KERNFS_LINK: + iattr.ia_mode |= S_IFLNK; + break; + } + + ret = kernfs_setattr(kn, &iattr); + kernfs_put(kn); + return ret; +} + +static int rdtgroup_mkdir_info_resdir(void *priv, char *name, + unsigned long fflags) +{ + struct kernfs_node *kn_subdir; + int ret; + + kn_subdir = kernfs_create_dir(kn_info, name, + kn_info->mode, priv); + if (IS_ERR(kn_subdir)) + return PTR_ERR(kn_subdir); + + ret = rdtgroup_kn_set_ugid(kn_subdir); + if (ret) + return ret; + + ret = rdtgroup_add_files(kn_subdir, fflags); + if (!ret) + kernfs_activate(kn_subdir); + + return ret; +} + +static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn) +{ + struct resctrl_schema *s; + struct rdt_resource *r; + unsigned long fflags; + char name[32]; + int ret; + + /* create the directory */ + kn_info = kernfs_create_dir(parent_kn, "info", parent_kn->mode, NULL); + if (IS_ERR(kn_info)) + return PTR_ERR(kn_info); + + ret = rdtgroup_add_files(kn_info, RFTYPE_TOP_INFO); + if (ret) + goto out_destroy; + + /* loop over enabled controls, these are all alloc_capable */ + list_for_each_entry(s, &resctrl_schema_all, list) { + r = s->res; + fflags = r->fflags | RFTYPE_CTRL_INFO; + ret = rdtgroup_mkdir_info_resdir(s, s->name, fflags); + if (ret) + goto out_destroy; + } + + for_each_mon_capable_rdt_resource(r) { + fflags = r->fflags | RFTYPE_MON_INFO; + sprintf(name, "%s_MON", r->name); + ret = rdtgroup_mkdir_info_resdir(r, name, fflags); + if (ret) + goto out_destroy; + } + + ret = rdtgroup_kn_set_ugid(kn_info); + if (ret) + goto out_destroy; + + kernfs_activate(kn_info); + + return 0; + +out_destroy: + kernfs_remove(kn_info); + return ret; +} + +static int +mongroup_create_dir(struct kernfs_node *parent_kn, struct rdtgroup *prgrp, + char *name, struct kernfs_node **dest_kn) +{ + struct kernfs_node *kn; + int ret; + + /* create the directory */ + kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); + if (IS_ERR(kn)) + return PTR_ERR(kn); + + if (dest_kn) + *dest_kn = kn; + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) + goto out_destroy; + + kernfs_activate(kn); + + return 0; + +out_destroy: + kernfs_remove(kn); + return ret; +} + +static void l3_qos_cfg_update(void *arg) +{ + bool *enable = arg; + + wrmsrl(MSR_IA32_L3_QOS_CFG, *enable ? L3_QOS_CDP_ENABLE : 0ULL); +} + +static void l2_qos_cfg_update(void *arg) +{ + bool *enable = arg; + + wrmsrl(MSR_IA32_L2_QOS_CFG, *enable ? L2_QOS_CDP_ENABLE : 0ULL); +} + +static inline bool is_mba_linear(void) +{ + return rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl.membw.delay_linear; +} + +static int set_cache_qos_cfg(int level, bool enable) +{ + void (*update)(void *arg); + struct rdt_ctrl_domain *d; + struct rdt_resource *r_l; + cpumask_var_t cpu_mask; + int cpu; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + if (level == RDT_RESOURCE_L3) + update = l3_qos_cfg_update; + else if (level == RDT_RESOURCE_L2) + update = l2_qos_cfg_update; + else + return -EINVAL; + + if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) + return -ENOMEM; + + r_l = &rdt_resources_all[level].r_resctrl; + list_for_each_entry(d, &r_l->ctrl_domains, hdr.list) { + if (r_l->cache.arch_has_per_cpu_cfg) + /* Pick all the CPUs in the domain instance */ + for_each_cpu(cpu, &d->hdr.cpu_mask) + cpumask_set_cpu(cpu, cpu_mask); + else + /* Pick one CPU from each domain instance to update MSR */ + cpumask_set_cpu(cpumask_any(&d->hdr.cpu_mask), cpu_mask); + } + + /* Update QOS_CFG MSR on all the CPUs in cpu_mask */ + on_each_cpu_mask(cpu_mask, update, &enable, 1); + + free_cpumask_var(cpu_mask); + + return 0; +} + +/* Restore the qos cfg state when a domain comes online */ +void rdt_domain_reconfigure_cdp(struct rdt_resource *r) +{ + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); + + if (!r->cdp_capable) + return; + + if (r->rid == RDT_RESOURCE_L2) + l2_qos_cfg_update(&hw_res->cdp_enabled); + + if (r->rid == RDT_RESOURCE_L3) + l3_qos_cfg_update(&hw_res->cdp_enabled); +} + +static int mba_sc_domain_allocate(struct rdt_resource *r, struct rdt_ctrl_domain *d) +{ + u32 num_closid = resctrl_arch_get_num_closid(r); + int cpu = cpumask_any(&d->hdr.cpu_mask); + int i; + + d->mbps_val = kcalloc_node(num_closid, sizeof(*d->mbps_val), + GFP_KERNEL, cpu_to_node(cpu)); + if (!d->mbps_val) + return -ENOMEM; + + for (i = 0; i < num_closid; i++) + d->mbps_val[i] = MBA_MAX_MBPS; + + return 0; +} + +static void mba_sc_domain_destroy(struct rdt_resource *r, + struct rdt_ctrl_domain *d) +{ + kfree(d->mbps_val); + d->mbps_val = NULL; +} + +/* + * MBA software controller is supported only if + * MBM is supported and MBA is in linear scale, + * and the MBM monitor scope is the same as MBA + * control scope. + */ +static bool supports_mba_mbps(void) +{ + struct rdt_resource *rmbm = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl; + + return (is_mbm_local_enabled() && + r->alloc_capable && is_mba_linear() && + r->ctrl_scope == rmbm->mon_scope); +} + +/* + * Enable or disable the MBA software controller + * which helps user specify bandwidth in MBps. + */ +static int set_mba_sc(bool mba_sc) +{ + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl; + u32 num_closid = resctrl_arch_get_num_closid(r); + struct rdt_ctrl_domain *d; + int i; + + if (!supports_mba_mbps() || mba_sc == is_mba_sc(r)) + return -EINVAL; + + r->membw.mba_sc = mba_sc; + + list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + for (i = 0; i < num_closid; i++) + d->mbps_val[i] = MBA_MAX_MBPS; + } + + return 0; +} + +static int cdp_enable(int level) +{ + struct rdt_resource *r_l = &rdt_resources_all[level].r_resctrl; + int ret; + + if (!r_l->alloc_capable) + return -EINVAL; + + ret = set_cache_qos_cfg(level, true); + if (!ret) + rdt_resources_all[level].cdp_enabled = true; + + return ret; +} + +static void cdp_disable(int level) +{ + struct rdt_hw_resource *r_hw = &rdt_resources_all[level]; + + if (r_hw->cdp_enabled) { + set_cache_qos_cfg(level, false); + r_hw->cdp_enabled = false; + } +} + +int resctrl_arch_set_cdp_enabled(enum resctrl_res_level l, bool enable) +{ + struct rdt_hw_resource *hw_res = &rdt_resources_all[l]; + + if (!hw_res->r_resctrl.cdp_capable) + return -EINVAL; + + if (enable) + return cdp_enable(l); + + cdp_disable(l); + + return 0; +} + +/* + * We don't allow rdtgroup directories to be created anywhere + * except the root directory. Thus when looking for the rdtgroup + * structure for a kernfs node we are either looking at a directory, + * in which case the rdtgroup structure is pointed at by the "priv" + * field, otherwise we have a file, and need only look to the parent + * to find the rdtgroup. + */ +static struct rdtgroup *kernfs_to_rdtgroup(struct kernfs_node *kn) +{ + if (kernfs_type(kn) == KERNFS_DIR) { + /* + * All the resource directories use "kn->priv" + * to point to the "struct rdtgroup" for the + * resource. "info" and its subdirectories don't + * have rdtgroup structures, so return NULL here. + */ + if (kn == kn_info || kn->parent == kn_info) + return NULL; + else + return kn->priv; + } else { + return kn->parent->priv; + } +} + +static void rdtgroup_kn_get(struct rdtgroup *rdtgrp, struct kernfs_node *kn) +{ + atomic_inc(&rdtgrp->waitcount); + kernfs_break_active_protection(kn); +} + +static void rdtgroup_kn_put(struct rdtgroup *rdtgrp, struct kernfs_node *kn) +{ + if (atomic_dec_and_test(&rdtgrp->waitcount) && + (rdtgrp->flags & RDT_DELETED)) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) + rdtgroup_pseudo_lock_remove(rdtgrp); + kernfs_unbreak_active_protection(kn); + rdtgroup_remove(rdtgrp); + } else { + kernfs_unbreak_active_protection(kn); + } +} + +struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn) +{ + struct rdtgroup *rdtgrp = kernfs_to_rdtgroup(kn); + + if (!rdtgrp) + return NULL; + + rdtgroup_kn_get(rdtgrp, kn); + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + /* Was this group deleted while we waited? */ + if (rdtgrp->flags & RDT_DELETED) + return NULL; + + return rdtgrp; +} + +void rdtgroup_kn_unlock(struct kernfs_node *kn) +{ + struct rdtgroup *rdtgrp = kernfs_to_rdtgroup(kn); + + if (!rdtgrp) + return; + + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + + rdtgroup_kn_put(rdtgrp, kn); +} + +static int mkdir_mondata_all(struct kernfs_node *parent_kn, + struct rdtgroup *prgrp, + struct kernfs_node **mon_data_kn); + +static void rdt_disable_ctx(void) +{ + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false); + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false); + set_mba_sc(false); + + resctrl_debug = false; +} + +static int rdt_enable_ctx(struct rdt_fs_context *ctx) +{ + int ret = 0; + + if (ctx->enable_cdpl2) { + ret = resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, true); + if (ret) + goto out_done; + } + + if (ctx->enable_cdpl3) { + ret = resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, true); + if (ret) + goto out_cdpl2; + } + + if (ctx->enable_mba_mbps) { + ret = set_mba_sc(true); + if (ret) + goto out_cdpl3; + } + + if (ctx->enable_debug) + resctrl_debug = true; + + return 0; + +out_cdpl3: + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false); +out_cdpl2: + resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false); +out_done: + return ret; +} + +static int schemata_list_add(struct rdt_resource *r, enum resctrl_conf_type type) +{ + struct resctrl_schema *s; + const char *suffix = ""; + int ret, cl; + + s = kzalloc(sizeof(*s), GFP_KERNEL); + if (!s) + return -ENOMEM; + + s->res = r; + s->num_closid = resctrl_arch_get_num_closid(r); + if (resctrl_arch_get_cdp_enabled(r->rid)) + s->num_closid /= 2; + + s->conf_type = type; + switch (type) { + case CDP_CODE: + suffix = "CODE"; + break; + case CDP_DATA: + suffix = "DATA"; + break; + case CDP_NONE: + suffix = ""; + break; + } + + ret = snprintf(s->name, sizeof(s->name), "%s%s", r->name, suffix); + if (ret >= sizeof(s->name)) { + kfree(s); + return -EINVAL; + } + + cl = strlen(s->name); + + /* + * If CDP is supported by this resource, but not enabled, + * include the suffix. This ensures the tabular format of the + * schemata file does not change between mounts of the filesystem. + */ + if (r->cdp_capable && !resctrl_arch_get_cdp_enabled(r->rid)) + cl += 4; + + if (cl > max_name_width) + max_name_width = cl; + + INIT_LIST_HEAD(&s->list); + list_add(&s->list, &resctrl_schema_all); + + return 0; +} + +static int schemata_list_create(void) +{ + struct rdt_resource *r; + int ret = 0; + + for_each_alloc_capable_rdt_resource(r) { + if (resctrl_arch_get_cdp_enabled(r->rid)) { + ret = schemata_list_add(r, CDP_CODE); + if (ret) + break; + + ret = schemata_list_add(r, CDP_DATA); + } else { + ret = schemata_list_add(r, CDP_NONE); + } + + if (ret) + break; + } + + return ret; +} + +static void schemata_list_destroy(void) +{ + struct resctrl_schema *s, *tmp; + + list_for_each_entry_safe(s, tmp, &resctrl_schema_all, list) { + list_del(&s->list); + kfree(s); + } +} + +static int rdt_get_tree(struct fs_context *fc) +{ + struct rdt_fs_context *ctx = rdt_fc2context(fc); + unsigned long flags = RFTYPE_CTRL_BASE; + struct rdt_mon_domain *dom; + struct rdt_resource *r; + int ret; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + /* + * resctrl file system can only be mounted once. + */ + if (resctrl_mounted) { + ret = -EBUSY; + goto out; + } + + ret = rdtgroup_setup_root(ctx); + if (ret) + goto out; + + ret = rdt_enable_ctx(ctx); + if (ret) + goto out_root; + + ret = schemata_list_create(); + if (ret) { + schemata_list_destroy(); + goto out_ctx; + } + + closid_init(); + + if (resctrl_arch_mon_capable()) + flags |= RFTYPE_MON; + + ret = rdtgroup_add_files(rdtgroup_default.kn, flags); + if (ret) + goto out_schemata_free; + + kernfs_activate(rdtgroup_default.kn); + + ret = rdtgroup_create_info_dir(rdtgroup_default.kn); + if (ret < 0) + goto out_schemata_free; + + if (resctrl_arch_mon_capable()) { + ret = mongroup_create_dir(rdtgroup_default.kn, + &rdtgroup_default, "mon_groups", + &kn_mongrp); + if (ret < 0) + goto out_info; + + ret = mkdir_mondata_all(rdtgroup_default.kn, + &rdtgroup_default, &kn_mondata); + if (ret < 0) + goto out_mongrp; + rdtgroup_default.mon.mon_data_kn = kn_mondata; + } + + ret = rdt_pseudo_lock_init(); + if (ret) + goto out_mondata; + + ret = kernfs_get_tree(fc); + if (ret < 0) + goto out_psl; + + if (resctrl_arch_alloc_capable()) + resctrl_arch_enable_alloc(); + if (resctrl_arch_mon_capable()) + resctrl_arch_enable_mon(); + + if (resctrl_arch_alloc_capable() || resctrl_arch_mon_capable()) + resctrl_mounted = true; + + if (is_mbm_enabled()) { + r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + list_for_each_entry(dom, &r->mon_domains, hdr.list) + mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL, + RESCTRL_PICK_ANY_CPU); + } + + goto out; + +out_psl: + rdt_pseudo_lock_release(); +out_mondata: + if (resctrl_arch_mon_capable()) + kernfs_remove(kn_mondata); +out_mongrp: + if (resctrl_arch_mon_capable()) + kernfs_remove(kn_mongrp); +out_info: + kernfs_remove(kn_info); +out_schemata_free: + schemata_list_destroy(); +out_ctx: + rdt_disable_ctx(); +out_root: + rdtgroup_destroy_root(); +out: + rdt_last_cmd_clear(); + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); + return ret; +} + +enum rdt_param { + Opt_cdp, + Opt_cdpl2, + Opt_mba_mbps, + Opt_debug, + nr__rdt_params +}; + +static const struct fs_parameter_spec rdt_fs_parameters[] = { + fsparam_flag("cdp", Opt_cdp), + fsparam_flag("cdpl2", Opt_cdpl2), + fsparam_flag("mba_MBps", Opt_mba_mbps), + fsparam_flag("debug", Opt_debug), + {} +}; + +static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param) +{ + struct rdt_fs_context *ctx = rdt_fc2context(fc); + struct fs_parse_result result; + const char *msg; + int opt; + + opt = fs_parse(fc, rdt_fs_parameters, param, &result); + if (opt < 0) + return opt; + + switch (opt) { + case Opt_cdp: + ctx->enable_cdpl3 = true; + return 0; + case Opt_cdpl2: + ctx->enable_cdpl2 = true; + return 0; + case Opt_mba_mbps: + msg = "mba_MBps requires local MBM and linear scale MBA at L3 scope"; + if (!supports_mba_mbps()) + return invalfc(fc, msg); + ctx->enable_mba_mbps = true; + return 0; + case Opt_debug: + ctx->enable_debug = true; + return 0; + } + + return -EINVAL; +} + +static void rdt_fs_context_free(struct fs_context *fc) +{ + struct rdt_fs_context *ctx = rdt_fc2context(fc); + + kernfs_free_fs_context(fc); + kfree(ctx); +} + +static const struct fs_context_operations rdt_fs_context_ops = { + .free = rdt_fs_context_free, + .parse_param = rdt_parse_param, + .get_tree = rdt_get_tree, +}; + +static int rdt_init_fs_context(struct fs_context *fc) +{ + struct rdt_fs_context *ctx; + + ctx = kzalloc(sizeof(struct rdt_fs_context), GFP_KERNEL); + if (!ctx) + return -ENOMEM; + + ctx->kfc.magic = RDTGROUP_SUPER_MAGIC; + fc->fs_private = &ctx->kfc; + fc->ops = &rdt_fs_context_ops; + put_user_ns(fc->user_ns); + fc->user_ns = get_user_ns(&init_user_ns); + fc->global = true; + return 0; +} + +static int reset_all_ctrls(struct rdt_resource *r) +{ + struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r); + struct rdt_hw_ctrl_domain *hw_dom; + struct msr_param msr_param; + struct rdt_ctrl_domain *d; + int i; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + msr_param.res = r; + msr_param.low = 0; + msr_param.high = hw_res->num_closid; + + /* + * Disable resource control for this resource by setting all + * CBMs in all ctrl_domains to the maximum mask value. Pick one CPU + * from each domain to update the MSRs below. + */ + list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + hw_dom = resctrl_to_arch_ctrl_dom(d); + + for (i = 0; i < hw_res->num_closid; i++) + hw_dom->ctrl_val[i] = r->default_ctrl; + msr_param.dom = d; + smp_call_function_any(&d->hdr.cpu_mask, rdt_ctrl_update, &msr_param, 1); + } + + return 0; +} + +/* + * Move tasks from one to the other group. If @from is NULL, then all tasks + * in the systems are moved unconditionally (used for teardown). + * + * If @mask is not NULL the cpus on which moved tasks are running are set + * in that mask so the update smp function call is restricted to affected + * cpus. + */ +static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to, + struct cpumask *mask) +{ + struct task_struct *p, *t; + + read_lock(&tasklist_lock); + for_each_process_thread(p, t) { + if (!from || is_closid_match(t, from) || + is_rmid_match(t, from)) { + resctrl_arch_set_closid_rmid(t, to->closid, + to->mon.rmid); + + /* + * Order the closid/rmid stores above before the loads + * in task_curr(). This pairs with the full barrier + * between the rq->curr update and resctrl_sched_in() + * during context switch. + */ + smp_mb(); + + /* + * If the task is on a CPU, set the CPU in the mask. + * The detection is inaccurate as tasks might move or + * schedule before the smp function call takes place. + * In such a case the function call is pointless, but + * there is no other side effect. + */ + if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t)) + cpumask_set_cpu(task_cpu(t), mask); + } + } + read_unlock(&tasklist_lock); +} + +static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp) +{ + struct rdtgroup *sentry, *stmp; + struct list_head *head; + + head = &rdtgrp->mon.crdtgrp_list; + list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) { + free_rmid(sentry->closid, sentry->mon.rmid); + list_del(&sentry->mon.crdtgrp_list); + + if (atomic_read(&sentry->waitcount) != 0) + sentry->flags = RDT_DELETED; + else + rdtgroup_remove(sentry); + } +} + +/* + * Forcibly remove all of subdirectories under root. + */ +static void rmdir_all_sub(void) +{ + struct rdtgroup *rdtgrp, *tmp; + + /* Move all tasks to the default resource group */ + rdt_move_group_tasks(NULL, &rdtgroup_default, NULL); + + list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) { + /* Free any child rmids */ + free_all_child_rdtgrp(rdtgrp); + + /* Remove each rdtgroup other than root */ + if (rdtgrp == &rdtgroup_default) + continue; + + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) + rdtgroup_pseudo_lock_remove(rdtgrp); + + /* + * Give any CPUs back to the default group. We cannot copy + * cpu_online_mask because a CPU might have executed the + * offline callback already, but is still marked online. + */ + cpumask_or(&rdtgroup_default.cpu_mask, + &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); + + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + + kernfs_remove(rdtgrp->kn); + list_del(&rdtgrp->rdtgroup_list); + + if (atomic_read(&rdtgrp->waitcount) != 0) + rdtgrp->flags = RDT_DELETED; + else + rdtgroup_remove(rdtgrp); + } + /* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */ + update_closid_rmid(cpu_online_mask, &rdtgroup_default); + + kernfs_remove(kn_info); + kernfs_remove(kn_mongrp); + kernfs_remove(kn_mondata); +} + +static void rdt_kill_sb(struct super_block *sb) +{ + struct rdt_resource *r; + + cpus_read_lock(); + mutex_lock(&rdtgroup_mutex); + + rdt_disable_ctx(); + + /*Put everything back to default values. */ + for_each_alloc_capable_rdt_resource(r) + reset_all_ctrls(r); + rmdir_all_sub(); + rdt_pseudo_lock_release(); + rdtgroup_default.mode = RDT_MODE_SHAREABLE; + schemata_list_destroy(); + rdtgroup_destroy_root(); + if (resctrl_arch_alloc_capable()) + resctrl_arch_disable_alloc(); + if (resctrl_arch_mon_capable()) + resctrl_arch_disable_mon(); + resctrl_mounted = false; + kernfs_kill_sb(sb); + mutex_unlock(&rdtgroup_mutex); + cpus_read_unlock(); +} + +static struct file_system_type rdt_fs_type = { + .name = "resctrl", + .init_fs_context = rdt_init_fs_context, + .parameters = rdt_fs_parameters, + .kill_sb = rdt_kill_sb, +}; + +static int mon_addfile(struct kernfs_node *parent_kn, const char *name, + void *priv) +{ + struct kernfs_node *kn; + int ret = 0; + + kn = __kernfs_create_file(parent_kn, name, 0444, + GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, + &kf_mondata_ops, priv, NULL, NULL); + if (IS_ERR(kn)) + return PTR_ERR(kn); + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) { + kernfs_remove(kn); + return ret; + } + + return ret; +} + +static void mon_rmdir_one_subdir(struct kernfs_node *pkn, char *name, char *subname) +{ + struct kernfs_node *kn; + + kn = kernfs_find_and_get(pkn, name); + if (!kn) + return; + kernfs_put(kn); + + if (kn->dir.subdirs <= 1) + kernfs_remove(kn); + else + kernfs_remove_by_name(kn, subname); +} + +/* + * Remove all subdirectories of mon_data of ctrl_mon groups + * and monitor groups for the given domain. + * Remove files and directories containing "sum" of domain data + * when last domain being summed is removed. + */ +static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, + struct rdt_mon_domain *d) +{ + struct rdtgroup *prgrp, *crgrp; + char subname[32]; + bool snc_mode; + char name[32]; + + snc_mode = r->mon_scope == RESCTRL_L3_NODE; + sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id); + if (snc_mode) + sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id); + + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { + mon_rmdir_one_subdir(prgrp->mon.mon_data_kn, name, subname); + + list_for_each_entry(crgrp, &prgrp->mon.crdtgrp_list, mon.crdtgrp_list) + mon_rmdir_one_subdir(crgrp->mon.mon_data_kn, name, subname); + } +} + +static int mon_add_all_files(struct kernfs_node *kn, struct rdt_mon_domain *d, + struct rdt_resource *r, struct rdtgroup *prgrp, + bool do_sum) +{ + struct rmid_read rr = {0}; + union mon_data_bits priv; + struct mon_evt *mevt; + int ret; + + if (WARN_ON(list_empty(&r->evt_list))) + return -EPERM; + + priv.u.rid = r->rid; + priv.u.domid = do_sum ? d->ci->id : d->hdr.id; + priv.u.sum = do_sum; + list_for_each_entry(mevt, &r->evt_list, list) { + priv.u.evtid = mevt->evtid; + ret = mon_addfile(kn, mevt->name, priv.priv); + if (ret) + return ret; + + if (!do_sum && is_mbm_event(mevt->evtid)) + mon_event_read(&rr, r, d, prgrp, &d->hdr.cpu_mask, mevt->evtid, true); + } + + return 0; +} + +static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, + struct rdt_mon_domain *d, + struct rdt_resource *r, struct rdtgroup *prgrp) +{ + struct kernfs_node *kn, *ckn; + char name[32]; + bool snc_mode; + int ret = 0; + + lockdep_assert_held(&rdtgroup_mutex); + + snc_mode = r->mon_scope == RESCTRL_L3_NODE; + sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id); + kn = kernfs_find_and_get(parent_kn, name); + if (kn) { + /* + * rdtgroup_mutex will prevent this directory from being + * removed. No need to keep this hold. + */ + kernfs_put(kn); + } else { + kn = kernfs_create_dir(parent_kn, name, parent_kn->mode, prgrp); + if (IS_ERR(kn)) + return PTR_ERR(kn); + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) + goto out_destroy; + ret = mon_add_all_files(kn, d, r, prgrp, snc_mode); + if (ret) + goto out_destroy; + } + + if (snc_mode) { + sprintf(name, "mon_sub_%s_%02d", r->name, d->hdr.id); + ckn = kernfs_create_dir(kn, name, parent_kn->mode, prgrp); + if (IS_ERR(ckn)) { + ret = -EINVAL; + goto out_destroy; + } + + ret = rdtgroup_kn_set_ugid(ckn); + if (ret) + goto out_destroy; + + ret = mon_add_all_files(ckn, d, r, prgrp, false); + if (ret) + goto out_destroy; + } + + kernfs_activate(kn); + return 0; + +out_destroy: + kernfs_remove(kn); + return ret; +} + +/* + * Add all subdirectories of mon_data for "ctrl_mon" groups + * and "monitor" groups with given domain id. + */ +static void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, + struct rdt_mon_domain *d) +{ + struct kernfs_node *parent_kn; + struct rdtgroup *prgrp, *crgrp; + struct list_head *head; + + list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) { + parent_kn = prgrp->mon.mon_data_kn; + mkdir_mondata_subdir(parent_kn, d, r, prgrp); + + head = &prgrp->mon.crdtgrp_list; + list_for_each_entry(crgrp, head, mon.crdtgrp_list) { + parent_kn = crgrp->mon.mon_data_kn; + mkdir_mondata_subdir(parent_kn, d, r, crgrp); + } + } +} + +static int mkdir_mondata_subdir_alldom(struct kernfs_node *parent_kn, + struct rdt_resource *r, + struct rdtgroup *prgrp) +{ + struct rdt_mon_domain *dom; + int ret; + + /* Walking r->domains, ensure it can't race with cpuhp */ + lockdep_assert_cpus_held(); + + list_for_each_entry(dom, &r->mon_domains, hdr.list) { + ret = mkdir_mondata_subdir(parent_kn, dom, r, prgrp); + if (ret) + return ret; + } + + return 0; +} + +/* + * This creates a directory mon_data which contains the monitored data. + * + * mon_data has one directory for each domain which are named + * in the format mon_<domain_name>_<domain_id>. For ex: A mon_data + * with L3 domain looks as below: + * ./mon_data: + * mon_L3_00 + * mon_L3_01 + * mon_L3_02 + * ... + * + * Each domain directory has one file per event: + * ./mon_L3_00/: + * llc_occupancy + * + */ +static int mkdir_mondata_all(struct kernfs_node *parent_kn, + struct rdtgroup *prgrp, + struct kernfs_node **dest_kn) +{ + struct rdt_resource *r; + struct kernfs_node *kn; + int ret; + + /* + * Create the mon_data directory first. + */ + ret = mongroup_create_dir(parent_kn, prgrp, "mon_data", &kn); + if (ret) + return ret; + + if (dest_kn) + *dest_kn = kn; + + /* + * Create the subdirectories for each domain. Note that all events + * in a domain like L3 are grouped into a resource whose domain is L3 + */ + for_each_mon_capable_rdt_resource(r) { + ret = mkdir_mondata_subdir_alldom(kn, r, prgrp); + if (ret) + goto out_destroy; + } + + return 0; + +out_destroy: + kernfs_remove(kn); + return ret; +} + +/** + * cbm_ensure_valid - Enforce validity on provided CBM + * @_val: Candidate CBM + * @r: RDT resource to which the CBM belongs + * + * The provided CBM represents all cache portions available for use. This + * may be represented by a bitmap that does not consist of contiguous ones + * and thus be an invalid CBM. + * Here the provided CBM is forced to be a valid CBM by only considering + * the first set of contiguous bits as valid and clearing all bits. + * The intention here is to provide a valid default CBM with which a new + * resource group is initialized. The user can follow this with a + * modification to the CBM if the default does not satisfy the + * requirements. + */ +static u32 cbm_ensure_valid(u32 _val, struct rdt_resource *r) +{ + unsigned int cbm_len = r->cache.cbm_len; + unsigned long first_bit, zero_bit; + unsigned long val = _val; + + if (!val) + return 0; + + first_bit = find_first_bit(&val, cbm_len); + zero_bit = find_next_zero_bit(&val, cbm_len, first_bit); + + /* Clear any remaining bits to ensure contiguous region */ + bitmap_clear(&val, zero_bit, cbm_len - zero_bit); + return (u32)val; +} + +/* + * Initialize cache resources per RDT domain + * + * Set the RDT domain up to start off with all usable allocations. That is, + * all shareable and unused bits. All-zero CBM is invalid. + */ +static int __init_one_rdt_domain(struct rdt_ctrl_domain *d, struct resctrl_schema *s, + u32 closid) +{ + enum resctrl_conf_type peer_type = resctrl_peer_type(s->conf_type); + enum resctrl_conf_type t = s->conf_type; + struct resctrl_staged_config *cfg; + struct rdt_resource *r = s->res; + u32 used_b = 0, unused_b = 0; + unsigned long tmp_cbm; + enum rdtgrp_mode mode; + u32 peer_ctl, ctrl_val; + int i; + + cfg = &d->staged_config[t]; + cfg->have_new_ctrl = false; + cfg->new_ctrl = r->cache.shareable_bits; + used_b = r->cache.shareable_bits; + for (i = 0; i < closids_supported(); i++) { + if (closid_allocated(i) && i != closid) { + mode = rdtgroup_mode_by_closid(i); + if (mode == RDT_MODE_PSEUDO_LOCKSETUP) + /* + * ctrl values for locksetup aren't relevant + * until the schemata is written, and the mode + * becomes RDT_MODE_PSEUDO_LOCKED. + */ + continue; + /* + * If CDP is active include peer domain's + * usage to ensure there is no overlap + * with an exclusive group. + */ + if (resctrl_arch_get_cdp_enabled(r->rid)) + peer_ctl = resctrl_arch_get_config(r, d, i, + peer_type); + else + peer_ctl = 0; + ctrl_val = resctrl_arch_get_config(r, d, i, + s->conf_type); + used_b |= ctrl_val | peer_ctl; + if (mode == RDT_MODE_SHAREABLE) + cfg->new_ctrl |= ctrl_val | peer_ctl; + } + } + if (d->plr && d->plr->cbm > 0) + used_b |= d->plr->cbm; + unused_b = used_b ^ (BIT_MASK(r->cache.cbm_len) - 1); + unused_b &= BIT_MASK(r->cache.cbm_len) - 1; + cfg->new_ctrl |= unused_b; + /* + * Force the initial CBM to be valid, user can + * modify the CBM based on system availability. + */ + cfg->new_ctrl = cbm_ensure_valid(cfg->new_ctrl, r); + /* + * Assign the u32 CBM to an unsigned long to ensure that + * bitmap_weight() does not access out-of-bound memory. + */ + tmp_cbm = cfg->new_ctrl; + if (bitmap_weight(&tmp_cbm, r->cache.cbm_len) < r->cache.min_cbm_bits) { + rdt_last_cmd_printf("No space on %s:%d\n", s->name, d->hdr.id); + return -ENOSPC; + } + cfg->have_new_ctrl = true; + + return 0; +} + +/* + * Initialize cache resources with default values. + * + * A new RDT group is being created on an allocation capable (CAT) + * supporting system. Set this group up to start off with all usable + * allocations. + * + * If there are no more shareable bits available on any domain then + * the entire allocation will fail. + */ +static int rdtgroup_init_cat(struct resctrl_schema *s, u32 closid) +{ + struct rdt_ctrl_domain *d; + int ret; + + list_for_each_entry(d, &s->res->ctrl_domains, hdr.list) { + ret = __init_one_rdt_domain(d, s, closid); + if (ret < 0) + return ret; + } + + return 0; +} + +/* Initialize MBA resource with default values. */ +static void rdtgroup_init_mba(struct rdt_resource *r, u32 closid) +{ + struct resctrl_staged_config *cfg; + struct rdt_ctrl_domain *d; + + list_for_each_entry(d, &r->ctrl_domains, hdr.list) { + if (is_mba_sc(r)) { + d->mbps_val[closid] = MBA_MAX_MBPS; + continue; + } + + cfg = &d->staged_config[CDP_NONE]; + cfg->new_ctrl = r->default_ctrl; + cfg->have_new_ctrl = true; + } +} + +/* Initialize the RDT group's allocations. */ +static int rdtgroup_init_alloc(struct rdtgroup *rdtgrp) +{ + struct resctrl_schema *s; + struct rdt_resource *r; + int ret = 0; + + rdt_staged_configs_clear(); + + list_for_each_entry(s, &resctrl_schema_all, list) { + r = s->res; + if (r->rid == RDT_RESOURCE_MBA || + r->rid == RDT_RESOURCE_SMBA) { + rdtgroup_init_mba(r, rdtgrp->closid); + if (is_mba_sc(r)) + continue; + } else { + ret = rdtgroup_init_cat(s, rdtgrp->closid); + if (ret < 0) + goto out; + } + + ret = resctrl_arch_update_domains(r, rdtgrp->closid); + if (ret < 0) { + rdt_last_cmd_puts("Failed to initialize allocations\n"); + goto out; + } + + } + + rdtgrp->mode = RDT_MODE_SHAREABLE; + +out: + rdt_staged_configs_clear(); + return ret; +} + +static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp) +{ + int ret; + + if (!resctrl_arch_mon_capable()) + return 0; + + ret = alloc_rmid(rdtgrp->closid); + if (ret < 0) { + rdt_last_cmd_puts("Out of RMIDs\n"); + return ret; + } + rdtgrp->mon.rmid = ret; + + ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn); + if (ret) { + rdt_last_cmd_puts("kernfs subdir error\n"); + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + return ret; + } + + return 0; +} + +static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp) +{ + if (resctrl_arch_mon_capable()) + free_rmid(rgrp->closid, rgrp->mon.rmid); +} + +static int mkdir_rdt_prepare(struct kernfs_node *parent_kn, + const char *name, umode_t mode, + enum rdt_group_type rtype, struct rdtgroup **r) +{ + struct rdtgroup *prdtgrp, *rdtgrp; + unsigned long files = 0; + struct kernfs_node *kn; + int ret; + + prdtgrp = rdtgroup_kn_lock_live(parent_kn); + if (!prdtgrp) { + ret = -ENODEV; + goto out_unlock; + } + + if (rtype == RDTMON_GROUP && + (prdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + prdtgrp->mode == RDT_MODE_PSEUDO_LOCKED)) { + ret = -EINVAL; + rdt_last_cmd_puts("Pseudo-locking in progress\n"); + goto out_unlock; + } + + /* allocate the rdtgroup. */ + rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL); + if (!rdtgrp) { + ret = -ENOSPC; + rdt_last_cmd_puts("Kernel out of memory\n"); + goto out_unlock; + } + *r = rdtgrp; + rdtgrp->mon.parent = prdtgrp; + rdtgrp->type = rtype; + INIT_LIST_HEAD(&rdtgrp->mon.crdtgrp_list); + + /* kernfs creates the directory for rdtgrp */ + kn = kernfs_create_dir(parent_kn, name, mode, rdtgrp); + if (IS_ERR(kn)) { + ret = PTR_ERR(kn); + rdt_last_cmd_puts("kernfs create error\n"); + goto out_free_rgrp; + } + rdtgrp->kn = kn; + + /* + * kernfs_remove() will drop the reference count on "kn" which + * will free it. But we still need it to stick around for the + * rdtgroup_kn_unlock(kn) call. Take one extra reference here, + * which will be dropped by kernfs_put() in rdtgroup_remove(). + */ + kernfs_get(kn); + + ret = rdtgroup_kn_set_ugid(kn); + if (ret) { + rdt_last_cmd_puts("kernfs perm error\n"); + goto out_destroy; + } + + if (rtype == RDTCTRL_GROUP) { + files = RFTYPE_BASE | RFTYPE_CTRL; + if (resctrl_arch_mon_capable()) + files |= RFTYPE_MON; + } else { + files = RFTYPE_BASE | RFTYPE_MON; + } + + ret = rdtgroup_add_files(kn, files); + if (ret) { + rdt_last_cmd_puts("kernfs fill error\n"); + goto out_destroy; + } + + /* + * The caller unlocks the parent_kn upon success. + */ + return 0; + +out_destroy: + kernfs_put(rdtgrp->kn); + kernfs_remove(rdtgrp->kn); +out_free_rgrp: + kfree(rdtgrp); +out_unlock: + rdtgroup_kn_unlock(parent_kn); + return ret; +} + +static void mkdir_rdt_prepare_clean(struct rdtgroup *rgrp) +{ + kernfs_remove(rgrp->kn); + rdtgroup_remove(rgrp); +} + +/* + * Create a monitor group under "mon_groups" directory of a control + * and monitor group(ctrl_mon). This is a resource group + * to monitor a subset of tasks and cpus in its parent ctrl_mon group. + */ +static int rdtgroup_mkdir_mon(struct kernfs_node *parent_kn, + const char *name, umode_t mode) +{ + struct rdtgroup *rdtgrp, *prgrp; + int ret; + + ret = mkdir_rdt_prepare(parent_kn, name, mode, RDTMON_GROUP, &rdtgrp); + if (ret) + return ret; + + prgrp = rdtgrp->mon.parent; + rdtgrp->closid = prgrp->closid; + + ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp); + if (ret) { + mkdir_rdt_prepare_clean(rdtgrp); + goto out_unlock; + } + + kernfs_activate(rdtgrp->kn); + + /* + * Add the rdtgrp to the list of rdtgrps the parent + * ctrl_mon group has to track. + */ + list_add_tail(&rdtgrp->mon.crdtgrp_list, &prgrp->mon.crdtgrp_list); + +out_unlock: + rdtgroup_kn_unlock(parent_kn); + return ret; +} + +/* + * These are rdtgroups created under the root directory. Can be used + * to allocate and monitor resources. + */ +static int rdtgroup_mkdir_ctrl_mon(struct kernfs_node *parent_kn, + const char *name, umode_t mode) +{ + struct rdtgroup *rdtgrp; + struct kernfs_node *kn; + u32 closid; + int ret; + + ret = mkdir_rdt_prepare(parent_kn, name, mode, RDTCTRL_GROUP, &rdtgrp); + if (ret) + return ret; + + kn = rdtgrp->kn; + ret = closid_alloc(); + if (ret < 0) { + rdt_last_cmd_puts("Out of CLOSIDs\n"); + goto out_common_fail; + } + closid = ret; + ret = 0; + + rdtgrp->closid = closid; + + ret = mkdir_rdt_prepare_rmid_alloc(rdtgrp); + if (ret) + goto out_closid_free; + + kernfs_activate(rdtgrp->kn); + + ret = rdtgroup_init_alloc(rdtgrp); + if (ret < 0) + goto out_rmid_free; + + list_add(&rdtgrp->rdtgroup_list, &rdt_all_groups); + + if (resctrl_arch_mon_capable()) { + /* + * Create an empty mon_groups directory to hold the subset + * of tasks and cpus to monitor. + */ + ret = mongroup_create_dir(kn, rdtgrp, "mon_groups", NULL); + if (ret) { + rdt_last_cmd_puts("kernfs subdir error\n"); + goto out_del_list; + } + } + + goto out_unlock; + +out_del_list: + list_del(&rdtgrp->rdtgroup_list); +out_rmid_free: + mkdir_rdt_prepare_rmid_free(rdtgrp); +out_closid_free: + closid_free(closid); +out_common_fail: + mkdir_rdt_prepare_clean(rdtgrp); +out_unlock: + rdtgroup_kn_unlock(parent_kn); + return ret; +} + +/* + * We allow creating mon groups only with in a directory called "mon_groups" + * which is present in every ctrl_mon group. Check if this is a valid + * "mon_groups" directory. + * + * 1. The directory should be named "mon_groups". + * 2. The mon group itself should "not" be named "mon_groups". + * This makes sure "mon_groups" directory always has a ctrl_mon group + * as parent. + */ +static bool is_mon_groups(struct kernfs_node *kn, const char *name) +{ + return (!strcmp(kn->name, "mon_groups") && + strcmp(name, "mon_groups")); +} + +static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name, + umode_t mode) +{ + /* Do not accept '\n' to avoid unparsable situation. */ + if (strchr(name, '\n')) + return -EINVAL; + + /* + * If the parent directory is the root directory and RDT + * allocation is supported, add a control and monitoring + * subdirectory + */ + if (resctrl_arch_alloc_capable() && parent_kn == rdtgroup_default.kn) + return rdtgroup_mkdir_ctrl_mon(parent_kn, name, mode); + + /* + * If RDT monitoring is supported and the parent directory is a valid + * "mon_groups" directory, add a monitoring subdirectory. + */ + if (resctrl_arch_mon_capable() && is_mon_groups(parent_kn, name)) + return rdtgroup_mkdir_mon(parent_kn, name, mode); + + return -EPERM; +} + +static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) +{ + struct rdtgroup *prdtgrp = rdtgrp->mon.parent; + int cpu; + + /* Give any tasks back to the parent group */ + rdt_move_group_tasks(rdtgrp, prdtgrp, tmpmask); + + /* Update per cpu rmid of the moved CPUs first */ + for_each_cpu(cpu, &rdtgrp->cpu_mask) + per_cpu(pqr_state.default_rmid, cpu) = prdtgrp->mon.rmid; + /* + * Update the MSR on moved CPUs and CPUs which have moved + * task running on them. + */ + cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); + update_closid_rmid(tmpmask, NULL); + + rdtgrp->flags = RDT_DELETED; + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + + /* + * Remove the rdtgrp from the parent ctrl_mon group's list + */ + WARN_ON(list_empty(&prdtgrp->mon.crdtgrp_list)); + list_del(&rdtgrp->mon.crdtgrp_list); + + kernfs_remove(rdtgrp->kn); + + return 0; +} + +static int rdtgroup_ctrl_remove(struct rdtgroup *rdtgrp) +{ + rdtgrp->flags = RDT_DELETED; + list_del(&rdtgrp->rdtgroup_list); + + kernfs_remove(rdtgrp->kn); + return 0; +} + +static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask) +{ + int cpu; + + /* Give any tasks back to the default group */ + rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask); + + /* Give any CPUs back to the default group */ + cpumask_or(&rdtgroup_default.cpu_mask, + &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask); + + /* Update per cpu closid and rmid of the moved CPUs first */ + for_each_cpu(cpu, &rdtgrp->cpu_mask) { + per_cpu(pqr_state.default_closid, cpu) = rdtgroup_default.closid; + per_cpu(pqr_state.default_rmid, cpu) = rdtgroup_default.mon.rmid; + } + + /* + * Update the MSR on moved CPUs and CPUs which have moved + * task running on them. + */ + cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask); + update_closid_rmid(tmpmask, NULL); + + free_rmid(rdtgrp->closid, rdtgrp->mon.rmid); + closid_free(rdtgrp->closid); + + rdtgroup_ctrl_remove(rdtgrp); + + /* + * Free all the child monitor group rmids. + */ + free_all_child_rdtgrp(rdtgrp); + + return 0; +} + +static int rdtgroup_rmdir(struct kernfs_node *kn) +{ + struct kernfs_node *parent_kn = kn->parent; + struct rdtgroup *rdtgrp; + cpumask_var_t tmpmask; + int ret = 0; + + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) + return -ENOMEM; + + rdtgrp = rdtgroup_kn_lock_live(kn); + if (!rdtgrp) { + ret = -EPERM; + goto out; + } + + /* + * If the rdtgroup is a ctrl_mon group and parent directory + * is the root directory, remove the ctrl_mon group. + * + * If the rdtgroup is a mon group and parent directory + * is a valid "mon_groups" directory, remove the mon group. + */ + if (rdtgrp->type == RDTCTRL_GROUP && parent_kn == rdtgroup_default.kn && + rdtgrp != &rdtgroup_default) { + if (rdtgrp->mode == RDT_MODE_PSEUDO_LOCKSETUP || + rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) { + ret = rdtgroup_ctrl_remove(rdtgrp); + } else { + ret = rdtgroup_rmdir_ctrl(rdtgrp, tmpmask); + } + } else if (rdtgrp->type == RDTMON_GROUP && + is_mon_groups(parent_kn, kn->name)) { + ret = rdtgroup_rmdir_mon(rdtgrp, tmpmask); + } else { + ret = -EPERM; + } + +out: + rdtgroup_kn_unlock(kn); + free_cpumask_var(tmpmask); + return ret; +} + +/** + * mongrp_reparent() - replace parent CTRL_MON group of a MON group + * @rdtgrp: the MON group whose parent should be replaced + * @new_prdtgrp: replacement parent CTRL_MON group for @rdtgrp + * @cpus: cpumask provided by the caller for use during this call + * + * Replaces the parent CTRL_MON group for a MON group, resulting in all member + * tasks' CLOSID immediately changing to that of the new parent group. + * Monitoring data for the group is unaffected by this operation. + */ +static void mongrp_reparent(struct rdtgroup *rdtgrp, + struct rdtgroup *new_prdtgrp, + cpumask_var_t cpus) +{ + struct rdtgroup *prdtgrp = rdtgrp->mon.parent; + + WARN_ON(rdtgrp->type != RDTMON_GROUP); + WARN_ON(new_prdtgrp->type != RDTCTRL_GROUP); + + /* Nothing to do when simply renaming a MON group. */ + if (prdtgrp == new_prdtgrp) + return; + + WARN_ON(list_empty(&prdtgrp->mon.crdtgrp_list)); + list_move_tail(&rdtgrp->mon.crdtgrp_list, + &new_prdtgrp->mon.crdtgrp_list); + + rdtgrp->mon.parent = new_prdtgrp; + rdtgrp->closid = new_prdtgrp->closid; + + /* Propagate updated closid to all tasks in this group. */ + rdt_move_group_tasks(rdtgrp, rdtgrp, cpus); + + update_closid_rmid(cpus, NULL); +} + +static int rdtgroup_rename(struct kernfs_node *kn, + struct kernfs_node *new_parent, const char *new_name) +{ + struct rdtgroup *new_prdtgrp; + struct rdtgroup *rdtgrp; + cpumask_var_t tmpmask; + int ret; + + rdtgrp = kernfs_to_rdtgroup(kn); + new_prdtgrp = kernfs_to_rdtgroup(new_parent); + if (!rdtgrp || !new_prdtgrp) + return -ENOENT; + + /* Release both kernfs active_refs before obtaining rdtgroup mutex. */ + rdtgroup_kn_get(rdtgrp, kn); + rdtgroup_kn_get(new_prdtgrp, new_parent); + + mutex_lock(&rdtgroup_mutex); + + rdt_last_cmd_clear(); + + /* + * Don't allow kernfs_to_rdtgroup() to return a parent rdtgroup if + * either kernfs_node is a file. + */ + if (kernfs_type(kn) != KERNFS_DIR || + kernfs_type(new_parent) != KERNFS_DIR) { + rdt_last_cmd_puts("Source and destination must be directories"); + ret = -EPERM; + goto out; + } + + if ((rdtgrp->flags & RDT_DELETED) || (new_prdtgrp->flags & RDT_DELETED)) { + ret = -ENOENT; + goto out; + } + + if (rdtgrp->type != RDTMON_GROUP || !kn->parent || + !is_mon_groups(kn->parent, kn->name)) { + rdt_last_cmd_puts("Source must be a MON group\n"); + ret = -EPERM; + goto out; + } + + if (!is_mon_groups(new_parent, new_name)) { + rdt_last_cmd_puts("Destination must be a mon_groups subdirectory\n"); + ret = -EPERM; + goto out; + } + + /* + * If the MON group is monitoring CPUs, the CPUs must be assigned to the + * current parent CTRL_MON group and therefore cannot be assigned to + * the new parent, making the move illegal. + */ + if (!cpumask_empty(&rdtgrp->cpu_mask) && + rdtgrp->mon.parent != new_prdtgrp) { + rdt_last_cmd_puts("Cannot move a MON group that monitors CPUs\n"); + ret = -EPERM; + goto out; + } + + /* + * Allocate the cpumask for use in mongrp_reparent() to avoid the + * possibility of failing to allocate it after kernfs_rename() has + * succeeded. + */ + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) { + ret = -ENOMEM; + goto out; + } + + /* + * Perform all input validation and allocations needed to ensure + * mongrp_reparent() will succeed before calling kernfs_rename(), + * otherwise it would be necessary to revert this call if + * mongrp_reparent() failed. + */ + ret = kernfs_rename(kn, new_parent, new_name); + if (!ret) + mongrp_reparent(rdtgrp, new_prdtgrp, tmpmask); + + free_cpumask_var(tmpmask); + +out: + mutex_unlock(&rdtgroup_mutex); + rdtgroup_kn_put(rdtgrp, kn); + rdtgroup_kn_put(new_prdtgrp, new_parent); + return ret; +} + +static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf) +{ + if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3)) + seq_puts(seq, ",cdp"); + + if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2)) + seq_puts(seq, ",cdpl2"); + + if (is_mba_sc(&rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl)) + seq_puts(seq, ",mba_MBps"); + + if (resctrl_debug) + seq_puts(seq, ",debug"); + + return 0; +} + +static struct kernfs_syscall_ops rdtgroup_kf_syscall_ops = { + .mkdir = rdtgroup_mkdir, + .rmdir = rdtgroup_rmdir, + .rename = rdtgroup_rename, + .show_options = rdtgroup_show_options, +}; + +static int rdtgroup_setup_root(struct rdt_fs_context *ctx) +{ + rdt_root = kernfs_create_root(&rdtgroup_kf_syscall_ops, + KERNFS_ROOT_CREATE_DEACTIVATED | + KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK, + &rdtgroup_default); + if (IS_ERR(rdt_root)) + return PTR_ERR(rdt_root); + + ctx->kfc.root = rdt_root; + rdtgroup_default.kn = kernfs_root_to_node(rdt_root); + + return 0; +} + +static void rdtgroup_destroy_root(void) +{ + kernfs_destroy_root(rdt_root); + rdtgroup_default.kn = NULL; +} + +static void __init rdtgroup_setup_default(void) +{ + mutex_lock(&rdtgroup_mutex); + + rdtgroup_default.closid = RESCTRL_RESERVED_CLOSID; + rdtgroup_default.mon.rmid = RESCTRL_RESERVED_RMID; + rdtgroup_default.type = RDTCTRL_GROUP; + INIT_LIST_HEAD(&rdtgroup_default.mon.crdtgrp_list); + + list_add(&rdtgroup_default.rdtgroup_list, &rdt_all_groups); + + mutex_unlock(&rdtgroup_mutex); +} + +static void domain_destroy_mon_state(struct rdt_mon_domain *d) +{ + bitmap_free(d->rmid_busy_llc); + kfree(d->mbm_total); + kfree(d->mbm_local); +} + +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d) +{ + mutex_lock(&rdtgroup_mutex); + + if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) + mba_sc_domain_destroy(r, d); + + mutex_unlock(&rdtgroup_mutex); +} + +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d) +{ + mutex_lock(&rdtgroup_mutex); + + /* + * If resctrl is mounted, remove all the + * per domain monitor data directories. + */ + if (resctrl_mounted && resctrl_arch_mon_capable()) + rmdir_mondata_subdir_allrdtgrp(r, d); + + if (is_mbm_enabled()) + cancel_delayed_work(&d->mbm_over); + if (is_llc_occupancy_enabled() && has_busy_rmid(d)) { + /* + * When a package is going down, forcefully + * decrement rmid->ebusy. There is no way to know + * that the L3 was flushed and hence may lead to + * incorrect counts in rare scenarios, but leaving + * the RMID as busy creates RMID leaks if the + * package never comes back. + */ + __check_limbo(d, true); + cancel_delayed_work(&d->cqm_limbo); + } + + domain_destroy_mon_state(d); + + mutex_unlock(&rdtgroup_mutex); +} + +static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_mon_domain *d) +{ + u32 idx_limit = resctrl_arch_system_num_rmid_idx(); + size_t tsize; + + if (is_llc_occupancy_enabled()) { + d->rmid_busy_llc = bitmap_zalloc(idx_limit, GFP_KERNEL); + if (!d->rmid_busy_llc) + return -ENOMEM; + } + if (is_mbm_total_enabled()) { + tsize = sizeof(*d->mbm_total); + d->mbm_total = kcalloc(idx_limit, tsize, GFP_KERNEL); + if (!d->mbm_total) { + bitmap_free(d->rmid_busy_llc); + return -ENOMEM; + } + } + if (is_mbm_local_enabled()) { + tsize = sizeof(*d->mbm_local); + d->mbm_local = kcalloc(idx_limit, tsize, GFP_KERNEL); + if (!d->mbm_local) { + bitmap_free(d->rmid_busy_llc); + kfree(d->mbm_total); + return -ENOMEM; + } + } + + return 0; +} + +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d) +{ + int err = 0; + + mutex_lock(&rdtgroup_mutex); + + if (supports_mba_mbps() && r->rid == RDT_RESOURCE_MBA) { + /* RDT_RESOURCE_MBA is never mon_capable */ + err = mba_sc_domain_allocate(r, d); + } + + mutex_unlock(&rdtgroup_mutex); + + return err; +} + +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d) +{ + int err; + + mutex_lock(&rdtgroup_mutex); + + err = domain_setup_mon_state(r, d); + if (err) + goto out_unlock; + + if (is_mbm_enabled()) { + INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow); + mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL, + RESCTRL_PICK_ANY_CPU); + } + + if (is_llc_occupancy_enabled()) + INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo); + + /* + * If the filesystem is not mounted then only the default resource group + * exists. Creation of its directories is deferred until mount time + * by rdt_get_tree() calling mkdir_mondata_all(). + * If resctrl is mounted, add per domain monitor data directories. + */ + if (resctrl_mounted && resctrl_arch_mon_capable()) + mkdir_mondata_subdir_allrdtgrp(r, d); + +out_unlock: + mutex_unlock(&rdtgroup_mutex); + + return err; +} + +void resctrl_online_cpu(unsigned int cpu) +{ + mutex_lock(&rdtgroup_mutex); + /* The CPU is set in default rdtgroup after online. */ + cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask); + mutex_unlock(&rdtgroup_mutex); +} + +static void clear_childcpus(struct rdtgroup *r, unsigned int cpu) +{ + struct rdtgroup *cr; + + list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) { + if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask)) + break; + } +} + +void resctrl_offline_cpu(unsigned int cpu) +{ + struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl; + struct rdt_mon_domain *d; + struct rdtgroup *rdtgrp; + + mutex_lock(&rdtgroup_mutex); + list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) { + if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) { + clear_childcpus(rdtgrp, cpu); + break; + } + } + + if (!l3->mon_capable) + goto out_unlock; + + d = get_mon_domain_from_cpu(cpu, l3); + if (d) { + if (is_mbm_enabled() && cpu == d->mbm_work_cpu) { + cancel_delayed_work(&d->mbm_over); + mbm_setup_overflow_handler(d, 0, cpu); + } + if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu && + has_busy_rmid(d)) { + cancel_delayed_work(&d->cqm_limbo); + cqm_setup_limbo_handler(d, 0, cpu); + } + } + +out_unlock: + mutex_unlock(&rdtgroup_mutex); +} + +/* + * rdtgroup_init - rdtgroup initialization + * + * Setup resctrl file system including set up root, create mount point, + * register rdtgroup filesystem, and initialize files under root directory. + * + * Return: 0 on success or -errno + */ +int __init rdtgroup_init(void) +{ + int ret = 0; + + seq_buf_init(&last_cmd_status, last_cmd_status_buf, + sizeof(last_cmd_status_buf)); + + rdtgroup_setup_default(); + + ret = sysfs_create_mount_point(fs_kobj, "resctrl"); + if (ret) + return ret; + + ret = register_filesystem(&rdt_fs_type); + if (ret) + goto cleanup_mountpoint; + + /* + * Adding the resctrl debugfs directory here may not be ideal since + * it would let the resctrl debugfs directory appear on the debugfs + * filesystem before the resctrl filesystem is mounted. + * It may also be ok since that would enable debugging of RDT before + * resctrl is mounted. + * The reason why the debugfs directory is created here and not in + * rdt_get_tree() is because rdt_get_tree() takes rdtgroup_mutex and + * during the debugfs directory creation also &sb->s_type->i_mutex_key + * (the lockdep class of inode->i_rwsem). Other filesystem + * interactions (eg. SyS_getdents) have the lock ordering: + * &sb->s_type->i_mutex_key --> &mm->mmap_lock + * During mmap(), called with &mm->mmap_lock, the rdtgroup_mutex + * is taken, thus creating dependency: + * &mm->mmap_lock --> rdtgroup_mutex for the latter that can cause + * issues considering the other two lock dependencies. + * By creating the debugfs directory here we avoid a dependency + * that may cause deadlock (even though file operations cannot + * occur until the filesystem is mounted, but I do not know how to + * tell lockdep that). + */ + debugfs_resctrl = debugfs_create_dir("resctrl", NULL); + + return 0; + +cleanup_mountpoint: + sysfs_remove_mount_point(fs_kobj, "resctrl"); + + return ret; +} + +void __exit rdtgroup_exit(void) +{ + debugfs_remove_recursive(debugfs_resctrl); + unregister_filesystem(&rdt_fs_type); + sysfs_remove_mount_point(fs_kobj, "resctrl"); } diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock_event.h b/arch/x86/kernel/cpu/resctrl/trace.h similarity index 56% rename from arch/x86/kernel/cpu/resctrl/pseudo_lock_event.h rename to arch/x86/kernel/cpu/resctrl/trace.h index 428ebbd4270b..2a506316b303 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock_event.h +++ b/arch/x86/kernel/cpu/resctrl/trace.h @@ -2,8 +2,8 @@ #undef TRACE_SYSTEM #define TRACE_SYSTEM resctrl -#if !defined(_TRACE_PSEUDO_LOCK_H) || defined(TRACE_HEADER_MULTI_READ) -#define _TRACE_PSEUDO_LOCK_H +#if !defined(_TRACE_RESCTRL_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_RESCTRL_H #include <linux/tracepoint.h> @@ -35,9 +35,25 @@ TRACE_EVENT(pseudo_lock_l3, TP_printk("hits=%llu miss=%llu", __entry->l3_hits, __entry->l3_miss)); -#endif /* _TRACE_PSEUDO_LOCK_H */ +TRACE_EVENT(mon_llc_occupancy_limbo, + TP_PROTO(u32 ctrl_hw_id, u32 mon_hw_id, int domain_id, u64 llc_occupancy_bytes), + TP_ARGS(ctrl_hw_id, mon_hw_id, domain_id, llc_occupancy_bytes), + TP_STRUCT__entry(__field(u32, ctrl_hw_id) + __field(u32, mon_hw_id) + __field(int, domain_id) + __field(u64, llc_occupancy_bytes)), + TP_fast_assign(__entry->ctrl_hw_id = ctrl_hw_id; + __entry->mon_hw_id = mon_hw_id; + __entry->domain_id = domain_id; + __entry->llc_occupancy_bytes = llc_occupancy_bytes;), + TP_printk("ctrl_hw_id=%u mon_hw_id=%u domain_id=%d llc_occupancy_bytes=%llu", + __entry->ctrl_hw_id, __entry->mon_hw_id, __entry->domain_id, + __entry->llc_occupancy_bytes) + ); + +#endif /* _TRACE_RESCTRL_H */ #undef TRACE_INCLUDE_PATH #define TRACE_INCLUDE_PATH . -#define TRACE_INCLUDE_FILE pseudo_lock_event +#define TRACE_INCLUDE_FILE trace #include <trace/define_trace.h> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 7700f7019d77..d94abba1c716 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -2,15 +2,10 @@ #ifndef _RESCTRL_H #define _RESCTRL_H -#include <linux/cpu.h> +#include <linux/cacheinfo.h> #include <linux/kernel.h> #include <linux/list.h> #include <linux/pid.h> -#include <linux/resctrl_types.h> - -#ifdef CONFIG_ARCH_HAS_CPU_RESCTRL -#include <asm/resctrl.h> -#endif /* CLOSID, RMID value used by the default control group */ #define RESCTRL_RESERVED_CLOSID 0 @@ -30,52 +25,28 @@ int proc_resctrl_show(struct seq_file *m, /* max value for struct rdt_domain's mbps_val */ #define MBA_MAX_MBPS U32_MAX -/* - * Resctrl uses u32 to hold the user-space config. The maximum bitmap size is - * 32. +/** + * enum resctrl_conf_type - The type of configuration. + * @CDP_NONE: No prioritisation, both code and data are controlled or monitored. + * @CDP_CODE: Configuration applies to instruction fetches. + * @CDP_DATA: Configuration applies to reads and writes. */ -#define RESCTRL_MAX_CBM 32 +enum resctrl_conf_type { + CDP_NONE, + CDP_CODE, + CDP_DATA, +}; -extern unsigned int resctrl_rmid_realloc_limit; -extern unsigned int resctrl_rmid_realloc_threshold; +#define CDP_NUM_TYPES (CDP_DATA + 1) -/** - * struct pseudo_lock_region - pseudo-lock region information - * @s: Resctrl schema for the resource to which this - * pseudo-locked region belongs - * @closid: The closid that this pseudo-locked region uses - * @d: RDT domain to which this pseudo-locked region - * belongs - * @cbm: bitmask of the pseudo-locked region - * @lock_thread_wq: waitqueue used to wait on the pseudo-locking thread - * completion - * @thread_done: variable used by waitqueue to test if pseudo-locking - * thread completed - * @cpu: core associated with the cache on which the setup code - * will be run - * @line_size: size of the cache lines - * @size: size of pseudo-locked region in bytes - * @kmem: the kernel memory associated with pseudo-locked region - * @minor: minor number of character device associated with this - * region - * @debugfs_dir: pointer to this region's directory in the debugfs - * filesystem - * @pm_reqs: Power management QoS requests related to this region +/* + * Event IDs, the values match those used to program IA32_QM_EVTSEL before + * reading IA32_QM_CTR on RDT systems. */ -struct pseudo_lock_region { - struct resctrl_schema *s; - u32 closid; - struct rdt_domain *d; - u32 cbm; - wait_queue_head_t lock_thread_wq; - int thread_done; - int cpu; - unsigned int line_size; - unsigned int size; - void *kmem; - unsigned int minor; - struct dentry *debugfs_dir; - struct list_head pm_reqs; +enum resctrl_event_id { + QOS_L3_OCCUP_EVENT_ID = 0x01, + QOS_L3_MBM_TOTAL_EVENT_ID = 0x02, + QOS_L3_MBM_LOCAL_EVENT_ID = 0x03, }; /** @@ -88,11 +59,45 @@ struct resctrl_staged_config { bool have_new_ctrl; }; +enum resctrl_domain_type { + RESCTRL_CTRL_DOMAIN, + RESCTRL_MON_DOMAIN, +}; + /** - * struct rdt_domain - group of CPUs sharing a resctrl resource + * struct rdt_domain_hdr - common header for different domain types * @list: all instances of this resource * @id: unique id for this instance + * @type: type of this instance * @cpu_mask: which CPUs share this resource + */ +struct rdt_domain_hdr { + struct list_head list; + int id; + enum resctrl_domain_type type; + struct cpumask cpu_mask; +}; + +/** + * struct rdt_ctrl_domain - group of CPUs sharing a resctrl control resource + * @hdr: common header for different domain types + * @plr: pseudo-locked region (if any) associated with domain + * @staged_config: parsed configuration to be applied + * @mbps_val: When mba_sc is enabled, this holds the array of user + * specified control values for mba_sc in MBps, indexed + * by closid + */ +struct rdt_ctrl_domain { + struct rdt_domain_hdr hdr; + struct pseudo_lock_region *plr; + struct resctrl_staged_config staged_config[CDP_NUM_TYPES]; + u32 *mbps_val; +}; + +/** + * struct rdt_mon_domain - group of CPUs sharing a resctrl monitor resource + * @hdr: common header for different domain types + * @ci: cache info for this domain * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold * @mbm_total: saved state for MBM total bandwidth * @mbm_local: saved state for MBM local bandwidth @@ -100,27 +105,17 @@ struct resctrl_staged_config { * @cqm_limbo: worker to periodically read CQM h/w counters * @mbm_work_cpu: worker CPU for MBM h/w counters * @cqm_work_cpu: worker CPU for CQM h/w counters - * @plr: pseudo-locked region (if any) associated with domain - * @staged_config: parsed configuration to be applied - * @mbps_val: When mba_sc is enabled, this holds the array of user - * specified control values for mba_sc in MBps, indexed - * by closid */ -struct rdt_domain { - struct list_head list; - int id; - struct cpumask cpu_mask; +struct rdt_mon_domain { + struct rdt_domain_hdr hdr; + struct cacheinfo *ci; unsigned long *rmid_busy_llc; struct mbm_state *mbm_total; struct mbm_state *mbm_local; - struct mbm_state *mbm_core; struct delayed_work mbm_over; struct delayed_work cqm_limbo; int mbm_work_cpu; int cqm_work_cpu; - struct pseudo_lock_region *plr; - struct resctrl_staged_config staged_config[CDP_NUM_TYPES]; - u32 *mbps_val; }; /** @@ -134,8 +129,6 @@ struct rdt_domain { * @arch_has_sparse_bitmasks: True if a bitmask like f00f is valid. * @arch_has_per_cpu_cfg: True if QOS_CFG register for this cache * level has CPU scope. - * @intpri_wd: Number of implemented bits in the priority - * partition. */ struct resctrl_cache { unsigned int cbm_len; @@ -143,7 +136,6 @@ struct resctrl_cache { unsigned int shareable_bits; bool arch_has_sparse_bitmasks; bool arch_has_per_cpu_cfg; - unsigned int intpri_wd; }; /** @@ -163,7 +155,6 @@ enum membw_throttle_mode { /** * struct resctrl_membw - Memory bandwidth allocation related data * @min_bw: Minimum memory bandwidth percentage user can request - * @max_bw: Maximum memory bandwidth value, used as the reset value * @bw_gran: Granularity at which the memory bandwidth is allocated * @delay_linear: True if memory B/W delay is in linear scale * @arch_needs_linear: True if we can't configure non-linear resources @@ -171,29 +162,24 @@ enum membw_throttle_mode { * different memory bandwidths * @mba_sc: True if MBA software controller(mba_sc) is enabled * @mb_map: Mapping of memory B/W percentage to memory B/W delay - * @intpri_wd: Number of implemented bits in the priority - * partition. */ struct resctrl_membw { u32 min_bw; - u32 max_bw; u32 bw_gran; u32 delay_linear; bool arch_needs_linear; enum membw_throttle_mode throttle_mode; bool mba_sc; u32 *mb_map; - u32 intpri_wd; }; -/** - * enum resctrl_schema_fmt - The format user-space provides for a schema. - * @RESCTRL_SCHEMA_BITMAP: The schema is a bitmap in hex. - * @RESCTRL_SCHEMA_RANGE: The schema is a decimal number. - */ -enum resctrl_schema_fmt { - RESCTRL_SCHEMA_BITMAP, - RESCTRL_SCHEMA_RANGE, +struct rdt_parse_data; +struct resctrl_schema; + +enum resctrl_scope { + RESCTRL_L2_CACHE = 2, + RESCTRL_L3_CACHE = 3, + RESCTRL_L3_NODE, }; /** @@ -202,17 +188,18 @@ enum resctrl_schema_fmt { * @alloc_capable: Is allocation available on this machine * @mon_capable: Is monitor feature available on this machine * @num_rmid: Number of RMIDs available - * @cache_level: Which cache level defines scope of this resource + * @ctrl_scope: Scope of this resource for control functions + * @mon_scope: Scope of this resource for monitor functions * @cache: Cache allocation related data * @membw: If the component has bandwidth controls, their properties. - * @domains: RCU list of all domains for this resource + * @ctrl_domains: RCU list of all control domains for this resource + * @mon_domains: RCU list of all monitor domains for this resource * @name: Name to use in "schemata" file. * @data_width: Character width of data when displaying * @default_ctrl: Specifies default cache cbm or memory B/W percent. * @format_str: Per resource format string to show domain value + * @parse_ctrlval: Per resource function pointer to parse control values * @evt_list: List of monitoring events - * @mbm_cfg_mask: Bandwidth sources that can be tracked when bandwidth - * monitoring events can be configured. * @fflags: flags to choose base and info files * @cdp_capable: Is the CDP feature available on this resource */ @@ -220,31 +207,25 @@ struct rdt_resource { int rid; bool alloc_capable; bool mon_capable; - bool invisible; - bool is_volatile; int num_rmid; - int cache_level; + enum resctrl_scope ctrl_scope; + enum resctrl_scope mon_scope; struct resctrl_cache cache; struct resctrl_membw membw; - struct list_head domains; + struct list_head ctrl_domains; + struct list_head mon_domains; char *name; int data_width; u32 default_ctrl; const char *format_str; + int (*parse_ctrlval)(struct rdt_parse_data *data, + struct resctrl_schema *s, + struct rdt_ctrl_domain *d); struct list_head evt_list; - unsigned int mbm_cfg_mask; unsigned long fflags; bool cdp_capable; - enum resctrl_schema_fmt schema_fmt; }; -/* - * Get the resource that exists at this level. If the level is not supported - * a dummy/not-capable resource can be returned. Levels >= RDT_NUM_RESOURCES - * will return NULL. - */ -struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l); - /** * struct resctrl_schema - configuration abilities of a resource presented to * user-space @@ -259,109 +240,30 @@ struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l); */ struct resctrl_schema { struct list_head list; - char name[16]; + char name[8]; enum resctrl_conf_type conf_type; struct rdt_resource *res; u32 num_closid; }; -struct resctrl_cpu_sync { - u32 closid; - u32 rmid; -}; - -struct resctrl_mon_config_info { - struct rdt_resource *r; - struct rdt_domain *d; - u32 evtid; - u32 mon_config; - - int err; -}; - -/** - * struct mon_evt - Entry in the event list of a resource - * @evtid: event id - * @name: name of the event - * @configurable: true if the event is configurable - * @list: entry in &rdt_resource->evt_list - */ -struct mon_evt { - enum resctrl_event_id evtid; - char *name; - bool configurable; - struct list_head list; -}; - -/* - * Update and re-load this CPUs defaults. Called via IPI, takes a pointer to - * struct resctrl_cpu_sync, or NULL. - */ -void resctrl_arch_sync_cpu_defaults(void *info); - /* The number of closid supported by this resource regardless of CDP */ u32 resctrl_arch_get_num_closid(struct rdt_resource *r); - -struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id); +u32 resctrl_arch_system_num_rmid_idx(void); int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid); -bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt); -void resctrl_arch_mon_event_config_write(void *info); -void resctrl_arch_mon_event_config_read(void *info); - -/* For use by arch code that needs to remap resctrl's smaller CDP closid */ -static inline u32 resctrl_get_config_index(u32 closid, - enum resctrl_conf_type type) -{ - switch (type) { - default: - case CDP_NONE: - return closid; - case CDP_CODE: - return (closid * 2) + 1; - case CDP_DATA: - return (closid * 2); - } -} - -/* - * Caller must be in a RCU read-side critical section, or hold the - * cpuhp read lock to prevent the struct rdt_domain being freed. - */ -static inline struct rdt_domain * -resctrl_get_domain_from_cpu(int cpu, struct rdt_resource *r) -{ - struct rdt_domain *d; - - /* - * Walking r->domains, ensure it can't race with cpuhp. - * Because this is called via IPI by rdt_ctrl_update(), assertions - * about locks this thread holds will lead to false positives. Check - * someone is holding the CPUs lock. - */ - if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && IS_ENABLED(CONFIG_LOCKDEP)) - lockdep_is_cpus_held(); - - list_for_each_entry_rcu(d, &r->domains, list) { - /* Find the domain that contains this CPU */ - if (cpumask_test_cpu(cpu, &d->cpu_mask)) - return d; - } - - return NULL; -} - /* * Update the ctrl_val and apply this config right now. * Must be called on one of the domain's CPUs. */ -int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, +int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_ctrl_domain *d, u32 closid, enum resctrl_conf_type t, u32 cfg_val); -u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, +u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_ctrl_domain *d, u32 closid, enum resctrl_conf_type type); -int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d); -void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d); +int resctrl_online_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d); +int resctrl_online_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d); +void resctrl_offline_ctrl_domain(struct rdt_resource *r, struct rdt_ctrl_domain *d); +void resctrl_offline_mon_domain(struct rdt_resource *r, struct rdt_mon_domain *d); void resctrl_online_cpu(unsigned int cpu); void resctrl_offline_cpu(unsigned int cpu); @@ -390,7 +292,7 @@ void resctrl_offline_cpu(unsigned int cpu); * Return: * 0 on success, or -EIO, -EINVAL etc on error. */ -int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d, u32 closid, u32 rmid, enum resctrl_event_id eventid, u64 *val, void *arch_mon_ctx); @@ -423,7 +325,7 @@ static inline void resctrl_arch_rmid_read_context_check(void) * * This can be called from any CPU. */ -void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d, u32 closid, u32 rmid, enum resctrl_event_id eventid); @@ -436,17 +338,9 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, * * This can be called from any CPU. */ -void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d); +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain *d); extern unsigned int resctrl_rmid_realloc_threshold; extern unsigned int resctrl_rmid_realloc_limit; -extern bool resctrl_mounted; - -int resctrl_init(void); -void resctrl_exit(void); - -int resctrl_arch_mon_resource_init(void); -void mbm_config_rftype_init(const char *config); - #endif /* _RESCTRL_H */ -- 2.25.1

After merging SNC support, we found that ARM MPAM and x86 RDT concepts diverge significantly. Attempts to maintain a shared resctrl layer resulted in excessive compatibility shims and fragile code. Therefore we now split compilation dependencies between ARM MPAM and x86 RDT, and maintain separate code paths for each architecture. The preparatory patch has restores the x86 RDT implementation to its original state (commit 543fe1576126), while allowing ARM MPAM to evolve independently. Signed-off-by: Zeng Heng <zengheng4@huawei.com> --- arch/x86/Kconfig | 2 - arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +- fs/resctrl/Kconfig | 1 + include/linux/resctrl.h | 7 + include/linux/resctrl_mpam.h | 452 +++++++++++++++++++++++++ 5 files changed, 462 insertions(+), 4 deletions(-) create mode 100644 include/linux/resctrl_mpam.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 4ba29ccfaa9d..c2bdb7b95176 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -482,8 +482,6 @@ config X86_CPU_RESCTRL depends on MISC_FILESYSTEMS select KERNFS select ARCH_HAS_CPU_RESCTRL - select RESCTRL_FS - select RESCTRL_FS_PSEUDO_LOCK help Enable x86 CPU resource control support. diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index d7163b764c62..be67cb47a888 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -3036,7 +3036,7 @@ static void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r, char name[32]; snc_mode = r->mon_scope == RESCTRL_L3_NODE; - sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id); + sprintf(name, "mon_%s_%02ld", r->name, snc_mode ? d->ci->id : d->hdr.id); if (snc_mode) sprintf(subname, "mon_sub_%s_%02d", r->name, d->hdr.id); @@ -3088,7 +3088,7 @@ static int mkdir_mondata_subdir(struct kernfs_node *parent_kn, lockdep_assert_held(&rdtgroup_mutex); snc_mode = r->mon_scope == RESCTRL_L3_NODE; - sprintf(name, "mon_%s_%02d", r->name, snc_mode ? d->ci->id : d->hdr.id); + sprintf(name, "mon_%s_%02ld", r->name, snc_mode ? d->ci->id : d->hdr.id); kn = kernfs_find_and_get(parent_kn, name); if (kn) { /* diff --git a/fs/resctrl/Kconfig b/fs/resctrl/Kconfig index 5bbf5a1798cd..540da76e1189 100644 --- a/fs/resctrl/Kconfig +++ b/fs/resctrl/Kconfig @@ -1,6 +1,7 @@ config RESCTRL_FS bool "CPU Resource Control Filesystem (resctrl)" depends on ARCH_HAS_CPU_RESCTRL + depends on ARM_CPU_RESCTRL select KERNFS select PROC_CPU_RESCTRL if PROC_FS help diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index d94abba1c716..d86f0b1d3f60 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -2,6 +2,12 @@ #ifndef _RESCTRL_H #define _RESCTRL_H +#ifdef CONFIG_ARM_CPU_RESCTRL + +#include <linux/resctrl_mpam.h> + +#else + #include <linux/cacheinfo.h> #include <linux/kernel.h> #include <linux/list.h> @@ -343,4 +349,5 @@ void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_mon_domain * extern unsigned int resctrl_rmid_realloc_threshold; extern unsigned int resctrl_rmid_realloc_limit; +#endif /* ARM_CPU_RESCTRL */ #endif /* _RESCTRL_H */ diff --git a/include/linux/resctrl_mpam.h b/include/linux/resctrl_mpam.h new file mode 100644 index 000000000000..449620cf04b3 --- /dev/null +++ b/include/linux/resctrl_mpam.h @@ -0,0 +1,452 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _RESCTRL_MPAM_H +#define _RESCTRL_MPAM_H + +#include <linux/cpu.h> +#include <linux/kernel.h> +#include <linux/list.h> +#include <linux/pid.h> +#include <linux/resctrl_types.h> + +#ifdef CONFIG_ARCH_HAS_CPU_RESCTRL +#include <asm/resctrl.h> +#endif + +/* CLOSID, RMID value used by the default control group */ +#define RESCTRL_RESERVED_CLOSID 0 +#define RESCTRL_RESERVED_RMID 0 + +#define RESCTRL_PICK_ANY_CPU -1 + +#ifdef CONFIG_PROC_CPU_RESCTRL + +int proc_resctrl_show(struct seq_file *m, + struct pid_namespace *ns, + struct pid *pid, + struct task_struct *tsk); + +#endif + +/* max value for struct rdt_domain's mbps_val */ +#define MBA_MAX_MBPS U32_MAX + +/* + * Resctrl uses u32 to hold the user-space config. The maximum bitmap size is + * 32. + */ +#define RESCTRL_MAX_CBM 32 + +extern unsigned int resctrl_rmid_realloc_limit; +extern unsigned int resctrl_rmid_realloc_threshold; + +/** + * struct pseudo_lock_region - pseudo-lock region information + * @s: Resctrl schema for the resource to which this + * pseudo-locked region belongs + * @closid: The closid that this pseudo-locked region uses + * @d: RDT domain to which this pseudo-locked region + * belongs + * @cbm: bitmask of the pseudo-locked region + * @lock_thread_wq: waitqueue used to wait on the pseudo-locking thread + * completion + * @thread_done: variable used by waitqueue to test if pseudo-locking + * thread completed + * @cpu: core associated with the cache on which the setup code + * will be run + * @line_size: size of the cache lines + * @size: size of pseudo-locked region in bytes + * @kmem: the kernel memory associated with pseudo-locked region + * @minor: minor number of character device associated with this + * region + * @debugfs_dir: pointer to this region's directory in the debugfs + * filesystem + * @pm_reqs: Power management QoS requests related to this region + */ +struct pseudo_lock_region { + struct resctrl_schema *s; + u32 closid; + struct rdt_domain *d; + u32 cbm; + wait_queue_head_t lock_thread_wq; + int thread_done; + int cpu; + unsigned int line_size; + unsigned int size; + void *kmem; + unsigned int minor; + struct dentry *debugfs_dir; + struct list_head pm_reqs; +}; + +/** + * struct resctrl_staged_config - parsed configuration to be applied + * @new_ctrl: new ctrl value to be loaded + * @have_new_ctrl: whether the user provided new_ctrl is valid + */ +struct resctrl_staged_config { + u32 new_ctrl; + bool have_new_ctrl; +}; + +/** + * struct rdt_domain - group of CPUs sharing a resctrl resource + * @list: all instances of this resource + * @id: unique id for this instance + * @cpu_mask: which CPUs share this resource + * @rmid_busy_llc: bitmap of which limbo RMIDs are above threshold + * @mbm_total: saved state for MBM total bandwidth + * @mbm_local: saved state for MBM local bandwidth + * @mbm_over: worker to periodically read MBM h/w counters + * @cqm_limbo: worker to periodically read CQM h/w counters + * @mbm_work_cpu: worker CPU for MBM h/w counters + * @cqm_work_cpu: worker CPU for CQM h/w counters + * @plr: pseudo-locked region (if any) associated with domain + * @staged_config: parsed configuration to be applied + * @mbps_val: When mba_sc is enabled, this holds the array of user + * specified control values for mba_sc in MBps, indexed + * by closid + */ +struct rdt_domain { + struct list_head list; + int id; + struct cpumask cpu_mask; + unsigned long *rmid_busy_llc; + struct mbm_state *mbm_total; + struct mbm_state *mbm_local; + struct mbm_state *mbm_core; + struct delayed_work mbm_over; + struct delayed_work cqm_limbo; + int mbm_work_cpu; + int cqm_work_cpu; + struct pseudo_lock_region *plr; + struct resctrl_staged_config staged_config[CDP_NUM_TYPES]; + u32 *mbps_val; +}; + +/** + * struct resctrl_cache - Cache allocation related data + * @cbm_len: Length of the cache bit mask + * @min_cbm_bits: Minimum number of consecutive bits to be set. + * The value 0 means the architecture can support + * zero CBM. + * @shareable_bits: Bitmask of shareable resource with other + * executing entities + * @arch_has_sparse_bitmasks: True if a bitmask like f00f is valid. + * @arch_has_per_cpu_cfg: True if QOS_CFG register for this cache + * level has CPU scope. + * @intpri_wd: Number of implemented bits in the priority + * partition. + */ +struct resctrl_cache { + unsigned int cbm_len; + unsigned int min_cbm_bits; + unsigned int shareable_bits; + bool arch_has_sparse_bitmasks; + bool arch_has_per_cpu_cfg; + unsigned int intpri_wd; +}; + +/** + * enum membw_throttle_mode - System's memory bandwidth throttling mode + * @THREAD_THROTTLE_UNDEFINED: Not relevant to the system + * @THREAD_THROTTLE_MAX: Memory bandwidth is throttled at the core + * always using smallest bandwidth percentage + * assigned to threads, aka "max throttling" + * @THREAD_THROTTLE_PER_THREAD: Memory bandwidth is throttled at the thread + */ +enum membw_throttle_mode { + THREAD_THROTTLE_UNDEFINED = 0, + THREAD_THROTTLE_MAX, + THREAD_THROTTLE_PER_THREAD, +}; + +/** + * struct resctrl_membw - Memory bandwidth allocation related data + * @min_bw: Minimum memory bandwidth percentage user can request + * @max_bw: Maximum memory bandwidth value, used as the reset value + * @bw_gran: Granularity at which the memory bandwidth is allocated + * @delay_linear: True if memory B/W delay is in linear scale + * @arch_needs_linear: True if we can't configure non-linear resources + * @throttle_mode: Bandwidth throttling mode when threads request + * different memory bandwidths + * @mba_sc: True if MBA software controller(mba_sc) is enabled + * @mb_map: Mapping of memory B/W percentage to memory B/W delay + * @intpri_wd: Number of implemented bits in the priority + * partition. + */ +struct resctrl_membw { + u32 min_bw; + u32 max_bw; + u32 bw_gran; + u32 delay_linear; + bool arch_needs_linear; + enum membw_throttle_mode throttle_mode; + bool mba_sc; + u32 *mb_map; + u32 intpri_wd; +}; + +/** + * enum resctrl_schema_fmt - The format user-space provides for a schema. + * @RESCTRL_SCHEMA_BITMAP: The schema is a bitmap in hex. + * @RESCTRL_SCHEMA_RANGE: The schema is a decimal number. + */ +enum resctrl_schema_fmt { + RESCTRL_SCHEMA_BITMAP, + RESCTRL_SCHEMA_RANGE, +}; + +/** + * struct rdt_resource - attributes of a resctrl resource + * @rid: The index of the resource + * @alloc_capable: Is allocation available on this machine + * @mon_capable: Is monitor feature available on this machine + * @num_rmid: Number of RMIDs available + * @cache_level: Which cache level defines scope of this resource + * @cache: Cache allocation related data + * @membw: If the component has bandwidth controls, their properties. + * @domains: RCU list of all domains for this resource + * @name: Name to use in "schemata" file. + * @data_width: Character width of data when displaying + * @default_ctrl: Specifies default cache cbm or memory B/W percent. + * @format_str: Per resource format string to show domain value + * @evt_list: List of monitoring events + * @mbm_cfg_mask: Bandwidth sources that can be tracked when bandwidth + * monitoring events can be configured. + * @fflags: flags to choose base and info files + * @cdp_capable: Is the CDP feature available on this resource + */ +struct rdt_resource { + int rid; + bool alloc_capable; + bool mon_capable; + bool invisible; + bool is_volatile; + int num_rmid; + int cache_level; + struct resctrl_cache cache; + struct resctrl_membw membw; + struct list_head domains; + char *name; + int data_width; + u32 default_ctrl; + const char *format_str; + struct list_head evt_list; + unsigned int mbm_cfg_mask; + unsigned long fflags; + bool cdp_capable; + enum resctrl_schema_fmt schema_fmt; +}; + +/* + * Get the resource that exists at this level. If the level is not supported + * a dummy/not-capable resource can be returned. Levels >= RDT_NUM_RESOURCES + * will return NULL. + */ +struct rdt_resource *resctrl_arch_get_resource(enum resctrl_res_level l); + +/** + * struct resctrl_schema - configuration abilities of a resource presented to + * user-space + * @list: Member of resctrl_schema_all. + * @name: The name to use in the "schemata" file. + * @conf_type: Whether this schema is specific to code/data. + * @res: The resource structure exported by the architecture to describe + * the hardware that is configured by this schema. + * @num_closid: The number of closid that can be used with this schema. When + * features like CDP are enabled, this will be lower than the + * hardware supports for the resource. + */ +struct resctrl_schema { + struct list_head list; + char name[16]; + enum resctrl_conf_type conf_type; + struct rdt_resource *res; + u32 num_closid; +}; + +struct resctrl_cpu_sync { + u32 closid; + u32 rmid; +}; + +struct resctrl_mon_config_info { + struct rdt_resource *r; + struct rdt_domain *d; + u32 evtid; + u32 mon_config; + + int err; +}; + +/** + * struct mon_evt - Entry in the event list of a resource + * @evtid: event id + * @name: name of the event + * @configurable: true if the event is configurable + * @list: entry in &rdt_resource->evt_list + */ +struct mon_evt { + enum resctrl_event_id evtid; + char *name; + bool configurable; + struct list_head list; +}; + +/* + * Update and re-load this CPUs defaults. Called via IPI, takes a pointer to + * struct resctrl_cpu_sync, or NULL. + */ +void resctrl_arch_sync_cpu_defaults(void *info); + +/* The number of closid supported by this resource regardless of CDP */ +u32 resctrl_arch_get_num_closid(struct rdt_resource *r); + +struct rdt_domain *resctrl_arch_find_domain(struct rdt_resource *r, int id); +int resctrl_arch_update_domains(struct rdt_resource *r, u32 closid); + +bool resctrl_arch_is_evt_configurable(enum resctrl_event_id evt); +void resctrl_arch_mon_event_config_write(void *info); +void resctrl_arch_mon_event_config_read(void *info); + +/* For use by arch code that needs to remap resctrl's smaller CDP closid */ +static inline u32 resctrl_get_config_index(u32 closid, + enum resctrl_conf_type type) +{ + switch (type) { + default: + case CDP_NONE: + return closid; + case CDP_CODE: + return (closid * 2) + 1; + case CDP_DATA: + return (closid * 2); + } +} + +/* + * Caller must be in a RCU read-side critical section, or hold the + * cpuhp read lock to prevent the struct rdt_domain being freed. + */ +static inline struct rdt_domain * +resctrl_get_domain_from_cpu(int cpu, struct rdt_resource *r) +{ + struct rdt_domain *d; + + /* + * Walking r->domains, ensure it can't race with cpuhp. + * Because this is called via IPI by rdt_ctrl_update(), assertions + * about locks this thread holds will lead to false positives. Check + * someone is holding the CPUs lock. + */ + if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && IS_ENABLED(CONFIG_LOCKDEP)) + lockdep_is_cpus_held(); + + list_for_each_entry_rcu(d, &r->domains, list) { + /* Find the domain that contains this CPU */ + if (cpumask_test_cpu(cpu, &d->cpu_mask)) + return d; + } + + return NULL; +} + +/* + * Update the ctrl_val and apply this config right now. + * Must be called on one of the domain's CPUs. + */ +int resctrl_arch_update_one(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, enum resctrl_conf_type t, u32 cfg_val); + +u32 resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, enum resctrl_conf_type type); +int resctrl_online_domain(struct rdt_resource *r, struct rdt_domain *d); +void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d); +void resctrl_online_cpu(unsigned int cpu); +void resctrl_offline_cpu(unsigned int cpu); + +/** + * resctrl_arch_rmid_read() - Read the eventid counter corresponding to rmid + * for this resource and domain. + * @r: resource that the counter should be read from. + * @d: domain that the counter should be read from. + * @closid: closid that matches the rmid. Depending on the architecture, the + * counter may match traffic of both @closid and @rmid, or @rmid + * only. + * @rmid: rmid of the counter to read. + * @eventid: eventid to read, e.g. L3 occupancy. + * @val: result of the counter read in bytes. + * @arch_mon_ctx: An architecture specific value from + * resctrl_arch_mon_ctx_alloc(), for MPAM this identifies + * the hardware monitor allocated for this read request. + * + * Some architectures need to sleep when first programming some of the counters. + * (specifically: arm64's MPAM cache occupancy counters can return 'not ready' + * for a short period of time). Call from a non-migrateable process context on + * a CPU that belongs to domain @d. e.g. use smp_call_on_cpu() or + * schedule_work_on(). This function can be called with interrupts masked, + * e.g. using smp_call_function_any(), but may consistently return an error. + * + * Return: + * 0 on success, or -EIO, -EINVAL etc on error. + */ +int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, u32 rmid, enum resctrl_event_id eventid, + u64 *val, void *arch_mon_ctx); + +/** + * resctrl_arch_rmid_read_context_check() - warn about invalid contexts + * + * When built with CONFIG_DEBUG_ATOMIC_SLEEP generate a warning when + * resctrl_arch_rmid_read() is called with preemption disabled. + * + * The contract with resctrl_arch_rmid_read() is that if interrupts + * are unmasked, it can sleep. This allows NOHZ_FULL systems to use an + * IPI, (and fail if the call needed to sleep), while most of the time + * the work is scheduled, allowing the call to sleep. + */ +static inline void resctrl_arch_rmid_read_context_check(void) +{ + if (!irqs_disabled()) + might_sleep(); +} + +/** + * resctrl_arch_reset_rmid() - Reset any private state associated with rmid + * and eventid. + * @r: The domain's resource. + * @d: The rmid's domain. + * @closid: closid that matches the rmid. Depending on the architecture, the + * counter may match traffic of both @closid and @rmid, or @rmid only. + * @rmid: The rmid whose counter values should be reset. + * @eventid: The eventid whose counter values should be reset. + * + * This can be called from any CPU. + */ +void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, u32 rmid, + enum resctrl_event_id eventid); + +/** + * resctrl_arch_reset_rmid_all() - Reset all private state associated with + * all rmids and eventids. + * @r: The resctrl resource. + * @d: The domain for which all architectural counter state will + * be cleared. + * + * This can be called from any CPU. + */ +void resctrl_arch_reset_rmid_all(struct rdt_resource *r, struct rdt_domain *d); + +extern unsigned int resctrl_rmid_realloc_threshold; +extern unsigned int resctrl_rmid_realloc_limit; + +extern bool resctrl_mounted; + +int resctrl_init(void); +void resctrl_exit(void); + +int resctrl_arch_mon_resource_init(void); +void mbm_config_rftype_init(const char *config); + +#endif /* _RESCTRL_MPAM_H */ -- 2.25.1

反馈: 您发送到kernel@openeuler.org的补丁/补丁集,转换为PR失败! 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/VWO... 失败原因:应用补丁/补丁集失败,Patch failed at 0001 Revert "x86/resctrl: Fix arch_mbm_* array overrun on SNC" 建议解决方法:请查看失败原因, 确认补丁是否可以应用在当前期望分支的最新代码上 FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/VWO... Failed Reason: apply patch(es) failed, Patch failed at 0001 Revert "x86/resctrl: Fix arch_mbm_* array overrun on SNC" Suggest Solution: please checkout if the failed patch(es) can work on the newest codes in expected branch
participants (2)
-
patchwork bot
-
Zeng Heng